Data Science

What is a sample, a feature, a target?

Here is, as an example, the Iris dataset (the one you studied during TP0).

* Each line in the data matrix is a sample: it represents one particular flower (iris).

* Each column is a feature (an attribute) of the flowers. For example, fleur 1 has a petal length of 1.3 and a petal width of 0.2

* The target (or label) is the variable that we aims to predict at the end: the type of iris. The first one is a setosa and the second a virginica. We usually map classes to integers to manipulate them easily (this procedure is called label encoding).


Why should I care about learning Python?

Python is a very versatile interpreted language, which allows you both to write simple scripts and elaborate object-oriented programs. Many modern machine learning and AI toolkits such as scikit-learn, Tensorflow, and Pytorch are based on Python.

Where do I learn more about Python?

Come to the extra Python class we give: Friday 15h45 - 17h45. Salle 313, with Adrien. All class material can be DOWNLOADED.

Jupyter noteboooks

Why is my notebook formatting messed up on GitHub or Nbviewer when it is good on my desktop?

Not all viewers are alike, some are more lenient to mistakes like that than others to minor mistakes in the markup cells. Verify that all your HTML tags are closed, e.g. if you start by <div> close the paragraph with </div>.

Git and Github

Why should I care about Git and Github?

This is going to be one of the most important collaborative tools you will be using in your project. It will allow you to work separately on your own computers and yet have a common place where the latest and greatest version is constantly updated (and will be graded by your teachers).

What is the difference between Git and Github?

GitHub are web-based Git repository holding your code in a shared place. The aim of Git is to manage software development projects (with revision control of files) as they are changing over time, ON YOUR COMPUTER. In principle you can use Git without GitHub, just for yourself, to keep track of older versions of your code. The GitHub repositories allow you to easily share your code with others and work on the same software without having diverging versions.

Where can I learn about Git and GitHub?

There are many tutorials, like THIS ONE.

How do I create a new repo?

To create a new repo on your computer, you can either fetch the contents of one on GitHub or create a directory and put some files in it.

Case 1: Fetch the contents of a GitHub repo.

- Create a GitHub account if you do not have one.

- Find a repo you like (e.g. a homework)

- Click "Fork" in the upper right corner to create your own personal copy

- Click "clone or download" to get the address of the repo (e.g. https://github.com/YOUR-USERNAME/info232.git)

- At the prompt on your computer execute

git clone https://github.com/YOUR-USERNAME/info232.git

in a directory of your choice (where the files will be copied).

Case 2: Create a repo on your computer from scratch.

- At the prompt on your computer, go to a directory that has some files in it.

- Then execute

git init

What are the most useful git commands?

git pull: ALWAYS do that first at the prompt of your computer, when you resume working, if you share a GitHub repo with others (not for TP1 and 2), because others may have made a change.

git status: Lets you know what's what in your repo. Some files may be "tracked" and other not. A tracked file is a file for which you follow changes over time. For simplicity you may want to track all files. But some files (e.g. data) may be big so you may want to "ignore" them.

git add: Allows you to add a file to be tracked to your next commit.

git add .

If you use a dot, you just add everything, this is simple!

git commit: Allows you to declare that changes you last added (with git add) should now be made ready to push, and you can attach a comment (message) to track changes.

git commit -m "some useful comment"

The message is really important to keep track of your changes.

git push: Allows you to push your changes to the web repo on GitHub.

git push

For more, see this CHEAT SHEET.

How to use Github

  • Create an account on https://github.com/

  • Create a new repository

  • Follow the quick setup instructions, which typically are to go to open a command shell and go to a directory where your code is, then type:

git init

git add .

git commit -m "first commit"

git branch -M main

git remote add origin https://github.com/your-github-id/your-repo.git

git push -u origin main


Why should I care about dockers?

Codalab runs your code and the scoring program in docker containers. This allows us to isolate them from the rest of the world and run them in a well defined environment (with correct Python version and libraries). Your challenge may be using a version of Python with different libraries than those you have already on your computer and other specialized code. If you run you code on your computer using a docker image identical to the one we use on Codalab, you ensure a better reproducibility of the results.

What is the difference between a "docker container" and a "docker image"?

A "docker container" is an instantiation of a "docker image". The difference between "docker image" and "docker container" is the same as the difference between a class and a runtime instance of a class (in object oriented programming).

Where can I learn about dockers?

In the docker documentation. Use this documentation to install the docker software on your personal computer.

What are the most useful docker commands?

If you are an L2 student, you will use already created dockers. All you should need is:

1) Retrieve the "docker image" (if you don't do it, docker will pull the image automatically the first time you do "docker run"):

docker pull [docker-image-name]

2) Make a copy of your starting kit and your public data to an auxiliary directory on your server, e.g.:

mkdir /home/aux

cp starting_kit ~/aux

cp public_data ~/aux

This will allow you to see them from within the docker.

3) Create and start a container with the run command:

docker run --name [docker-container-name] -it -v ~/aux:/home/aux [docker-image-name]

4) If all goes well, you will get a prompt inside your docker container. At the docker prompt:

# cd /home/aux

You now see your stating kit and public data. You can run the ingestion and scoring programs:

# cd starting_kit

# python3 ingestion_program/ingestion.py sample_data sample_result_submission ingestion_program sample_code_submission

# python3 scoring_program/score.py sample_data sample_result_submission scoring_output

Or you can replace sample_data by public_data

# python3 ingestion_program/ingestion.py ../public_data sample_result_submission ingestion_program sample_code_submission

# exit

5) Back to your regular prompt:

cd ~/aux/starting_kit

Locate your sample_result_submission, zip it (without directory) and submit it.

If you are an M2 student and need to create of modify a new docker for used with Codalab, check these instructions. The docker image name is a competition property, which can be set in the YAML configuration file of a competition or changed with the editor.

For more useful commands, see this cheat sheet.

How to run jupyter-notebook from inside the docker?

2) Go to your starting kit directory and run from the command line:

docker run --name [docker-container-name] -it -p 5000:8888 [docker-image-name] jupyter-notebook --ip --allow-root

[docker-container-name] : a name YOU give to the container that will run on your machine.

[docker-container-name] : the correct docker image for your project (see below).

This should allow you to create and start a docker container.

Once started, copy and paste the Jupyter Notebook link you are provided with, after replacing the lost name and using port 5000 instead of 8888, e.g.:


3) When you finish running the notebook, extract submission zip file from the container:

docker cp [docker-container-name]:/starting_kit/example_submission.zip [DEST PATH]

Here, replace example_submission.zip by the filename e.g. : sample_code_submission_18-12-10-02-00.zip

What is the "docker image name" for my project?

If your project uses the default docker "codalab/codalab-legacy:py3", you may want to just use Anaconda for Python 3.6 installed directly on your computer rather then running into a docker. Some projects have two versions, one "simple one" uses preprocessed data and can be solved with the default docker or Anaconda for Python 3.6, and one more complex one from raw image data, requiring a custom docker (with Tensorflow or Pytorch installed).

I am getting the message "docker container name already exists", what's wrong?

You may have already run "docker run", which creates and starts a container.

There are several solutions, see answer.

The simplest is:

docker stop [docker-container-name]

docker rm [docker-container-name]

then you can run your "docker run" command again. But this way you destroy your container and all its previous contents.

How do I find our which containers are on my system?

docker ps --all

Note that if you did not give a name to your docker with the --name option, a cool name is given automatically.