Seminar Period 3 - Part 1
- Due 10 Feb 2023 by 17:00
- Points 0
- Submitting a file upload
- File types pdf and py
Overview of assignment
The assignment this period is designed to partially mimic the experiences you may have if you get a job as a data scientist or machine learning expert in the real world. This is especially the case if it is in an industry that analyzes customers' behaviours and backgrounds and then makes decisions based on the analysis. The assignment has two parts. Part 2 of the assignment will be released after Part 1 has been completed. You must complete both parts before your seminar. You can complete the assignment either alone or with a partner. The seminars will take place in the last week of the period 27 Feb - March 6 (inclusive).
Note: If you complete the project with a partner, please make individual submissions to canvas. But you can upload the same report for both partners. Just make a note in the document who the two students are in the group.
Part 1 is described on this Canvas page and you should submit the requested material by 17:00, Feb 10
Part 2 will be released on Feb 13 and you should submit the requested assignment by 17:00, Feb 24.
Part 1
The scenario:
You have been recently hired by an insurance firm and are working in their data analytics and actuarial team. The team has collected a large database of the number of car insurance claims by their customers over a period time plus a set of 9 potentially useful attributes associated with each customer. Previously another employee had done some investigations of the usefulness of training regression models to predict the frequency of car insurance claims using the data in the database. Note that this employee had a real interest in neural networks. Your boss knows that you are a ML whizz kid and has set you the following task:
- Scan the python code previously written and decide which model you should use and see if you can do some engineering to improve results
- The boss says you are also free to train your own regression model
- The boss also says you should not spend more than approx half a day (<= 5 hours) on this task as there is lots of other stuff on your work To Do list.
Your boss gives you the guidelines that you should choose the model that "is going to save us money and make our customers happy" and after your research you should submit a very short document, 2 pages max (including figures), 1 page would be more appreciated with the following information:
- a short paragraph describing your trained model and any major feature wrangling you did,
- one short paragraph why you chose this model, and
- summarise the performance of your model on the test set.
Code & data shared with you
The code created by the previous employee based much of the code from the scikit-learn tutorial "Poisson regression and non-normal loss" Links to an external site.. You should quickly read this page to get an overview of the problem and standard solutions. Note you are very welcome to base you solution solely on the code from the tutorial. But please use the train/val/test splits specified below. The data used and code written by the employee:
- The French Motor Third-Party Liability Claims dataset slightly cleaned up and partitioned into a training, validation and test split
- Python code to read in the the above csv files, display it and train simple regressors (this is mainly based on the scikit paper below) and a neural network.
- The code requires the python packages:
- scikit-learn,
- pytorch (installed without gpu support),
- matlibplot, pandas,
- tabulate and
- numpy
- The code was run from the command line as follows and fits :
- python3.9 seminar3_main.py -d True -n False -dir name_of_data_dir
This displays plots showing the data + results of the regressors trained. The neural network regressors are not trained. The last entry is the directory name of where the datafiles are. - python3.9 seminar3_main.py -d True -n True -dir name_of_data_dir
This displays plots showing the data + results of the regressors trained. The neural network regressors are trained
Note the default values for the neural network regressors (not balanced and balanced) have large batch sizes and quite a large number of hidden nodes + 20 epochs of training. Thus training will be a little slow to run. You can change these settings in the file seminar3_main.py plus other settings for the neural network training.
- python3.9 seminar3_main.py -d True -n False -dir name_of_data_dir
What you should upload to canvas:
- The pdf short document you have written to summarize your work and model choice
- A python file containing the code to train your chosen model.
Some background material:
For full disclosure I have written (copied) the code linked to above! The dataset is released from this publication
- Case Study: French Motor Third-Party Liability Claims
Links to an external site.
Alexander Noll, Robert Salzmann, Mario V. Wuthrich
and much of the code is adapted/copied from the scikit-learn tutorial :
The first linked article will give you some extra background information and some ideas for data wrangling etc.