7.CodeReview. Tips for code submission

What is the game?

Scikit-learn is a popular open-source project and a very useful machine learning toolkit. Open-source projects get the contributions of many people. Scikit-learn has a committee, which approves regularly new contributions (but the need to be novel, well written, well tested, and adhere to the scikit-learn interface). Play the game! Make new (potential) contributions to scikit-learn.

To that end, the binomes will split the work in the following way:

- Binome "modelisation": you are responsible for creating a supervised-learning class called model with at least two methods: fit and predict. Make it inherit of BaseEstimator. Heri provided you with a basic outline.

- Binome "preprocessing": you are responsible for creating a preprocessing class. In this example the class Preprocessor just calls PCA, but you need to do something more original than that. To adhere to the scikit-learn interface, make it inherit of BaseEstimator AND provide 3 methods: fit, predict, and fit-transform (see this post to understand the difference between fit and fit-tranform).

- Binome visualisation: We give you more freedom, just generate interesting graphs for the input data and the results.

One of the benefits of sticking to the interface of scikit-learn is that you will be able to use your new classes like any other scikit-learn models and combine them with other classes like grid search and cross-validation, voting classifiers, and pipelines. Surprise us! Try things we do not even know of yet in scikit-learn and get more originality points!

This is an example of pipeline (from this sample code):

The object fancy_classifier is a model like any other. You can call on it the methods fit and predict. It will chain a preprocessing (from the class Preprocessor) and a classifier (from the class BaggingClassifier). You can chain in this way several preprocessings and terminate with a classifier or a regressor.

This is an example of voting classifiers (from this sample code):

The object clf is a model like any other. You can call on it the methods fit and predict. It votes among a number of classifiers, including fancy_classifier. Scikit-learn is like a lego game: you can create pipelines of voting classifiers or voting classifiers of pipelines, etc.

Instructions

- Submit your homework on Chagrade under 7.CodeReview (NOT 10.CodeReview); only ONE team member should submit the homework. Deadline March 21, 2020.

- For projects GAIASAVERS and MEDICHAL: use PREPROCESSED data competitions.

- Make sure to submit your CODE (not results).

- Give meaningful names to your Codalab submission zip files to identify them more easily.

- Test your code first on your local computer to make sure it works. Use Anaconda with Python 3.7. Avoid installing extra libraries with pip install (they will not be on Codalab). You should have a main in your model.py that allows to do a test then also try it into the Jupyter notebook.

- Common errors include indentation errors. Copy the indentation from previous lines (it may include a combination of spaces and tabs that look alike to the eye but not to the computer).

- Remove the pickle, submit ONLY metadata and model.py in a zip file NOT containing the top directory.

- Look at the log files if there is an error (click on the [+] sign next to your submission.

- If your submit results (only for debug purposes): use the public data to generate the results (NOT the sample data).

FEEDBACK AND HELP TO GET A BETTER GRADE:

Your grade is based on the code submitted on Codalab. It must be a SELF-CONTAINED zip archive and include:

- the main class model, which MUST be in model.py

- the preprocessing

- the tests

You do not need to put everything in a single model.py file. For example, the preprocessing class Preprocessor can be in a file preprocess.py. Then at the top of model.py you can import the preprocessing as:

from preprocess import Preprocessor

- Clarity (/3):

+ Make the code "YOURS":

1) Add at the top a header with:

Author: your team name

Last revision Date: Date of last modification

Description: Brief description of what this is useful for.

Revision History: Dates and brief description of changes.

2) For small changes/bug fixes, put the initials of the person who made the change, e.g.:

# IG: Bug fixed here, replaced = by ==.

3) Use meaningful variable names of your own choice.

4) Make short and clear programming statements (avoid long lines).

5) When you add a new method or function, indicate with comments between triple quotes ''' what the function does '''.

+ Organize well your GitHub repository.

+ Add many comments and prints in your code (clarity is also clarity of execution!)

+ Make your code modular.

- Originality (/3):

+ Include some personal code

+ CODE WITHOUT PREPROCESSING WILL NOT GET 3/3 ON ORIGINALITY

+ HIGHLIGHT big changes you made e.g. ### NEW CONTRIBUTION OF GROUP XXX ### (for small changes see "clarity section")

+ Make changes that make the code easier to use.

+ Make use of pipelines to combine the preprocessing and the model (see scikit-learn documentation and examples from Isabelle in zClassifer.py and zPreprocessor.py)

However, if you feel uncomfortable with pipelines (Heri's example combines the preprocessing without pipelines).

+ Make use of VotingClassifier to combine several classifiers (see examples from Isabelle in zClassifer.py)

+ Search the doc of scikit-learn to find interesting models to try.

+ Make comparisons of various combinations of preprocessing, models, model hyper-parameters. Use grid-search and cross-validation to select the best combination.

+ Implement a new algorithm not already in scikit-learn. For example this version of Naive Bayes, or this version of kernel ridge regression.

+ Go back to raw data and implement a new feature construction algorithm or apply some deep learning methods (more advanced).

- Tests (/3):

+ Add a main at the bottom of model.py showing how you tested your code (Heri created a simple example showing you how to do that); or

+ Create a class to test your code; or

+ Create a Jupyter-notebook to test your code.

In the 2 last cases, do not forget to include the test code with your Codalab submission.

- Score on Codalab (/1):

+ Outperform the baseline (sample code submission).

+ To get good performance without preprocessing: show in your code that you compared some model with and without preprocessing but submit the version without preprocessing.

- Preprocessing:

+ Make a separate class for the preprocessor. Keep preprocessing and model in 2 separate files (easier for development).

+ Using pipelines is a good idea to combine preprocessing and model.