Seminar Period 3 - Part 1

Due 10 Feb 2023 by 17:00
Points 0
Submitting a file upload
File types pdf and py

Overview of assignment

The assignment this period is designed to partially mimic the experiences you may have if you get a job as a data scientist or machine learning expert in the real world. This is especially the case if it is in an industry that analyzes customers' behaviours and backgrounds and then makes decisions based on the analysis. The assignment has two parts. Part 2 of the assignment will be released after Part 1 has been completed. You must complete both parts before your seminar. You can complete the assignment either alone or with a partner. The seminars will take place in the last week of the period 27 Feb - March 6 (inclusive).

Note: If you complete the project with a partner, please make individual submissions to canvas. But you can upload the same report for both partners. Just make a note in the document who the two students are in the group.

Part 1 is described on this Canvas page and you should submit the requested material by 17:00, Feb 10
Part 2 will be released on Feb 13 and you should submit the requested assignment by 17:00, Feb 24.

Part 1

The scenario:

You have been recently hired by an insurance firm and are working in their data analytics and actuarial team. The team has collected a large database of the number of car insurance claims by their customers over a period time plus a set of 9 potentially useful attributes associated with each customer. Previously another employee had done some investigations of the usefulness of training regression models to predict the frequency of car insurance claims using the data in the database. Note that this employee had a real interest in neural networks. Your boss knows that you are a ML whizz kid and has set you the following task:

Scan the python code previously written and decide which model you should use and see if you can do some engineering to improve results
The boss says you are also free to train your own regression model
The boss also says you should not spend more than approx half a day (<= 5 hours) on this task as there is lots of other stuff on your work To Do list.

Your boss gives you the guidelines that you should choose the model that "is going to save us money and make our customers happy" and after your research you should submit a very short document, 2 pages max (including figures), 1 page would be more appreciated with the following information:

a short paragraph describing your trained model and any major feature wrangling you did,
one short paragraph why you chose this model, and
summarise the performance of your model on the test set.

Code & data shared with you

The code created by the previous employee based much of the code from the scikit-learn tutorial "Poisson regression and non-normal loss" Links to an external site.. You should quickly read this page to get an overview of the problem and standard solutions. Note you are very welcome to base you solution solely on the code from the tutorial. But please use the train/val/test splits specified below. The data used and code written by the employee:

The French Motor Third-Party Liability Claims dataset slightly cleaned up and partitioned into a training, validation and test split
Python code to read in the the above csv files, display it and train simple regressors (this is mainly based on the scikit paper below) and a neural network.
- seminar3_main.py Download seminar3_main.py
- neural_network_regressor.py Download neural_network_regressor.py
The code requires the python packages:
- scikit-learn,
- pytorch (installed without gpu support),
- matlibplot, pandas,
- tabulate and
- numpy
The code was run from the command line as follows and fits :
- python3.9 seminar3_main.py -d True -n False -dir name_of_data_dir
  This displays plots showing the data + results of the regressors trained. The neural network regressors are not trained. The last entry is the directory name of where the datafiles are.
- python3.9 seminar3_main.py -d True -n True -dir name_of_data_dir
  This displays plots showing the data + results of the regressors trained. The neural network regressors are trained
  Note the default values for the neural network regressors (not balanced and balanced) have large batch sizes and quite a large number of hidden nodes + 20 epochs of training. Thus training will be a little slow to run. You can change these settings in the file seminar3_main.py plus other settings for the neural network training.

What you should upload to canvas:

The pdf short document you have written to summarize your work and model choice
A python file containing the code to train your chosen model.

Some background material:

For full disclosure I have written (copied) the code linked to above! The dataset is released from this publication

Case Study: French Motor Third-Party Liability Claims Links to an external site.
Alexander Noll, Robert Salzmann, Mario V. Wuthrich

and much of the code is adapted/copied from the scikit-learn tutorial :

Poisson regression and non-normal loss Links to an external site.

The first linked article will give you some extra background information and some ideas for data wrangling etc.

Rubric

Title:

Find a rubric

Title

Title
Criteria	Ratings	Pts
Description of criterion threshold: 5 pts Edit criterion description Delete criterion row	5 to >0 Pts Full marks blank 0 to >0 Pts No marks blank_2 This area will be used by the assessor to leave comments related to this criterion.	pts / 5 pts --
Description of criterion threshold: 5 pts Edit criterion description Delete criterion row	5 to >0 Pts Full marks blank 0 to >0 Pts No marks blank_2 This area will be used by the assessor to leave comments related to this criterion.	pts / 5 pts --