Challenge Design and Participation

The goal of the Project A and B classes is to learn data science while having fun. Project A students create mini-challenges and Project B students solve them. Prior to 2020-2021, M2 students took Project A and L2 students took Project B. We now offer both classes to M1 students.

Mini Challenges 2019-2020

Challenges

(M2 students)

Abstract

Task

L2 groups

(who solved it)

Plankton recognition

A lot of species are endangered, if not simply disappearing.Help protect what can still be protected, and save what can still be saved.

Our planet is covered at 70% by water, which makes it obvious that a great part of the diversity that can be found on Earth will be found here, under the water. Help monitoring the health of the oceans by studying planktons (microscopic animals) and determining their diversity in a certain location, as a measure of biodiversity.

The problem here is a multiclass classification problem. Each element is a plankton photography. Originally, each image contained 300x300 pixel, but we preproced and reduced them to 100x100 pixel for memory reason. We then used the vertical and horizontal histograms as well as the mean and the variance as features.

Each image is then characterized by these 202 features.

Traffic prediction

Lemonade sales are dependent on car traffic near your place of business. Your mission, should you decide to accept it, is to predict the number of cars that will pass by at a given date, hour, and additional meteorological informations, near the lemonade stand.

This is a regression problem. You have to predict highway traffic volume as a function of 58 features (including specification about time and different weather descriptions).

Malaria parasite identification

Malaria is a disease that is spread by a bite of an infected female mosquito, caused by the Plasmodium, genus of parasites, transmitted by mosquito bites. Early diagnosis could help treat and control the disease. In this challenge you will have access to pre-processed images of segmented cells from the thin blood smear slide. Your goal is to detect parasitized cell images from uninfected ones in order to diagnose malaria.

The problem is a binary classification problem. Each sample is an an image of a cell which can be infected or not. You are given for training a set of images which is reduced to 6 numerical features, so you have a matrix containing those 6 features for 60% of the total number of images.

Acknowledgements: These challenges are hosted by CodaLab. We received a grant of the FCS Paris-Saclay and sponsorship of Microsoft Azure for Research and Google Research.

Mini Challenges 2018-2019

Challenges

(M2 students)

Abstract

Task

L2 groups

(who solved it)

Recognize landscapes from satellite images

This challenge views things from high! From areal images you'll figure out whether it is a beach, chaparral, cloud, desert, forest, island, lake, meadow, mountain, river, sea_ice, snowberg, or wetland. Classifying terrains is important to control urban development, favor economic growth, and protect environment. [DIAPOS]

The problem is a multi-class classification problem. You must predict the categories of 13 classes.

There are two possible challenges: one from raw data and one from preprocessed data. In its raw data version, the challenge is to classify images characterized by 128*128 pixel maps. In its preprocessed data version, the challenge is to classify vectors of 4096 high-level abstract features extracted with a pre-trained CNN.

Health data challenge

The dataset is a set of patients which have been diagnosed at different stages of cancer. Your task is to improve the classification results regarding the stages of those patients. [DIAPOS]

This is a multi-class classification problem. You must classify cancer stages among a specific population in one of 10 categories. The data is a matrix of (number of patients) lines * (number of features per patient) columns. Features correspond to methylation information related to the medical condition of each patient.

Learning to run a power network

The goal of this challenge is to control electricity transportation in power grids, while keeping people and equipment safe. This is the "gamification" of a serious problem: operating the grid is becoming increasingly complex because of the advent of less predictable renewable energies, the globalization of energy markets, growth in consumption and concurrent limitations on new line construction. [DIAPOS]

This is a reinforcement learning (RL) problem. You will have access to a simulator of a small scale grid. The designed RL agents should learn a policy keeping the power grid in security. The possible actions include switching a line status (in service or out-of-service) or changing the line interconnections.

Detect Fake paintings

The goal of the challenge is to detect the fake paintings. We present you with real paintings and paintings generated by a computer program. Can you tell them appart? [DIAPOS]

The problem is a binary classification problem. Each sample (image) is characterized by 200 features. You must predict whether the images are fake or real.


Influence of nutrition on live expectancy

Evaluate how nutrition affects longevity using data from NHANES (US National Health and Nutrition Examination Survey). [DIAPOS]

The problem is a regression problem (prediction of time of death) with censored data (some people leave the study or are still alive and the end of the study). The metric of evaluation is the concordance index.


Info 232:

Legacy class material from 2018-2019.

Legacy homework from 2018-2019.

Acknowledgements: These challenges are hosted by CodaLab. We received a grant of the FCS Paris-Saclay and EIT Health and sponsorship of Microsoft Azure for Research and Google Research.

Mini Challenges 2017-2018

Challenges

(M2 students)

Abstract

Task

L2 groups

(who solved it)

Computer vision

Autonomous vehicles will become a common means of transportation very soon. However, obstacles remain to be overcome, in particular obstacle avoidance. This requires powerful computer vision algorithms. In this challenge you will contribute to solve the problem of recognizing animals and vehicles.

To illustrate this problematic, we propose to study the image source CIFAR-10 which groups entities that can interact with the vehicle environment like animals(cat, horse, dog, ...) and vehicles (bike, car, truck, ...). We preprocessed the images to you get to solve a multi-class classification problem from pre-computed features. Your score is the balanced accuracy or BAC. It is the average of the error rates for the various classes. Make predictions the are vectors [0 0 ... 1 ... 0 0] with a 1 at the ith position if you want to predict you sample belongs to class i.

Over-prescription of opioids

Over-prescription of opioid medicines presents a new public health problem because many people have become addicted. This challenge asks you to help predicting which doctors tend to over-prescribe such medicines.

The data set contains a binary classification task. The target represents, for each medical prescription whether an opioid has been prescribed or not. The features represent, amongst others, the specialty of the doctor who made the prescription and the name of the non-opioid drugs present in this prescription.

Your score is the Gini or "normalized AUC": 2 AUC - 1. AUC stands for Area under ROC curve. Make numerical predictions for test samples that are larger for the positive class and smaller for the negative class (discriminant values). Random guesses give a score close to 0 while perfect predictions give a score of 1.

House princing

Predicting at which price a house will sell helps people selling their property at a fair price. This dataset contains house sale prices for King County, which includes Seattle. It includes homes sold between May 2014 and May 2015.

This is a regression problem. The dataset contains 19 house features plus the price and the id columns, along with 21613 observations.

Your score is the R-square = 1 - MSE / var(Y).It si 0 for the baseline method that predicts the average target value. It is 1 for perfect guesses. It ca be negative if your predictions are worse than the average target value!

Air quality challenge

Pollution, or the introduction of different forms of waste materials in our environment, has negative effects to the ecosystem we rely on. With modernization and development in our lives, pollution has reached its peak, giving rise to global warming and human illness.

This is a regression problem. The goal of this challenge is to predict the NOx levels in the air in Northern Taiwan, which is an indicator of pollution. The dataset is was initially provided by the Environmental Protection Administration, Executive Yuan, R.O.C.

Your score is the R-square = 1 - MSE / var(Y).It si 0 for the baseline method that predicts the average target value. It is 1 for perfect guesses. It ca be negative if your predictions are worse than the average target value!

Give me some credit

This challenge deals with a fundamental task in the financial industry: credit scoring. In simple English, it means deciding whether to grant a credit to someone or not, depending on her/his historical financial record.

This is a binary classification problem. The data set contains 150000 instances separated on 2 classes, where each class refers to the seriousness of a client in two years.

Your score is the Gini or "normalized AUC": 2 AUC - 1. AUC stands for Area under ROC curve. Make numerical predictions for test samples that are larger for the positive class and smaller for the negative class (discriminant values). Random guesses give a score close to 0 while perfect predictions give a score of 1.

Acknowledgements: These challenges were generated with ChaLab and are hosted by CodaLab. We received a grant of the FCS Paris-Saclay and sponsorship of Microsoft Azure for Research.

Auto-sklearn performances

Mini Challenges 2016-2017

Challenges

(M2 students)

Abstract

Task

L2 groups

(who solved it)

Activity of molecules against HIV

The problem is to relate molecular structure to activity to screen new compounds before actually testing them with High Throughput Screening (HTS) in vitro experiments. HTS is a method for massive scientific experimentation used in drug discovery, linking the fields of biology and chemistry. This method remains very costly process despite many recent technological advances in the field of biotechnology. This is why applying machine learning methods would be of great benefit for the pharmaceutical industry to reduce the number of compounds that need to be tested.

The Objective of is to predict which compounds are active against the AIDS HIV infection. The dataset has two classes : active or inactive (Binary Classification). The variables represent properties of the molecule inferred from its structure.

Note: this project is running on the LRI server. In case of problem, a previous version on the main Codalab instance is available.

Lothlorien

This challenge aims at addressing the issue of resources access (website, drug purchase, violent movie, etc.) based on the age of a person. Indeed a lot of violent content is accessible on the internet and 45 % of children under 12 are not monitored by parental control. For this sake, we rely on the person's real-time image to estimate his age category. Facial aging effects are mainly correlated to bone movement and growth, skin wrinkles and reduction of muscle strength. Human observation lacking of accuracy, we want to find an automatic algorithm to make this distinction.

A computer vision challenge is proposed for undergraduate students in which the challenger must predict the class of a person (major or minor) based on a picture of his/her face.

Note: the main Codalab instance of this challenge has been tested.

Note: this project is running on the LRI server. In case of problem, a previous version on the main Codalab instance is available.

Ecocity

Help SimCity's mayor fight pollution and traffic jams by optimizing the city's bike rental system!

SimCity mayor has invested a lot of money to fight against pollution and reduce traffic jams. Her first action was the purchase of a bike rental system. To improve the system, she wishes to predict the number of bikes rented at each station at any moment of the day using weather data.

The challenge that is to use weather data (temperature, humidity, cloud cover) to predict the number of bikes rented at given station for a given day. To make the challenge more interesting, predictions are asked either in the morning or in the afternoon.

Note: this project is running on the LRI server. In case of problem, a previous version on the main Codalab instance is available.

Movie recommendation

Currently, there are more and more music to listen, movies to watch and things to buy on the Internet. Therefore, developing systems that help users find items they may like is crucial. Recommending items is different from "classical" machine learning, where you only have to predict a class given several features. Recommendation implies using predictions to recommend suitable items (in this case movies) to the adequate people. In addition to that, this preferences can be sometimes evolve in time.

In this challenge, you will work on the famous Movielens dataset. The goal of this challenge is to predict for a user and a given film the score that is the most likely to be awarded by the user.

Note: There is also a LRI version. Warning: both versions were using different score. They should now both use a_metric = 1 - MAE/MAD.

Pick The Sneak Peek

In 2000, 60,234 titles between movies and TV shows were released, according to the IMDB source. In 2010, 165,830 titles and in 2016, 190,275 titles were filmed. We can only notice that the movie release industry is in perpetual increase and the databases aggregating the data are in need of more information to expand.

This is a text processing challenge.

The idea is to facilitate the genre labeling of movies from their summaries and thus to help with categorization of the movies database.

Note: this project is running on the LRI server. In case of problem, a previous version on the main Codalab instance is available.

The Godfather returns!

After last year’s purge accomplished by Batman the Godfather has return and he's looking for new skills, the best criminals in SF, for crime organizations to prosper again and go back to gold age. To make sure about the recruits' abilities, records of their previous crimes in the San Francisco Bay Area are being investigated, background checks are being conducted on the candidates curriculum and a software is being developed to highlight criminals' potential.

The goal is to design software to predict, for each criminal record, the category of crime. If the candidate's crime falls into the category that the Godfather needs, he will be recruited!

Note: No LRI implementation so far.

Acknowledgements: These challenges were generated with ChaLab and are hosted by CodaLab. We received a grant of the FCS Paris-Saclay and sponsorship of Microsoft Azure for Research.

Auto-sklearn performances

Mini Challenges 2015-2016

Challenges

(M2 students)

Abstract

Task

L2 groups

(who solved it)

Diabetes diagnosis

Diabetes will be the seventh most common cause of death in 2030 according to the World Health Organization. In 2014, global prevalence of diabetes was estimated to be more than 9% among adults aged 18+ years. If most hospitals have the necessary medical equipment to treat this disease, some do not have these means.

The task is a binary classification problem. Using the train set, it consists in predicting the length of stay for a patient given its diagnosis and its medications. This label consists in two categories : a stay inferior to 7 days or a stay greater or equal to 7 days.

Restaurant recommendation

We propose a challenge in restaurant recommendation to predict the rating for a particular user of any restaurant. We have very detailed information of the restaurants like geographical information, number of stars, reviews, etc and for each person a list of some restaurants he visited and his personal rate.

The participants will work in two principal tasks:

Task 1: Select the most prevalent features in the three datasets:

Task 2: improving the prediction results using others methods and improving the training dataset with the data of Yelp.

Computer vision

Robots take more place in society everyday and soon they may be walking in the streets among us. There are a lot of problems that need to be solved before that and one of them is adaptation. An AI needs to adapt its vision of the world: when it sees an entity for the first time it should be able to tell if it is a domestic animal, a predator, a vehicle or maybe another robot? That is where transfer learning shows up: extracting general features from specific examples of a group allows to efficiently classify unknown entities.

The idea of the challenge is to learn how to separate distinct classes of images. Precisely, we consider different superclasses, like "aquatic animals", each containing several classes, like "dolphin", and the goal is to tell this superclasses apart.

Crimes in Gotham city

Batman fighting in the forefront to deliver the Gotham City from the evil crimes. And now he and his team want to create a system in order to increase their working efficiency. They have recent years’ crime data of Gotham City which is collected from GCPD and Batman’s database. The data including the location, the time and some other information of each crime. Some crimes have been solved, the others not.

The main goal of this project is to help Batman develop this system. In other words, do the classification of crimes. You can treat it as a binary classification problem, to predict whether a crime can be solved or not. You can also first do the logistic regression to compute how likely a crime will be solved. Then Batman can define the priority for the crimes with this system.

Opinion mining from text

In this project you will tackle the problem of Opinion Mining in movie reviews with a basic set of techniques used in text classification. Many sentiment-analysis methods for the classification of reviews use training and test-data based on star ratings provided by reviewers. However, when reading reviews it appears that the reviewer's ratings do not always give an accurate measure of the sentiment of the review.

The objective of the challenge is to determine the polarity of an opinion from raw text. Since it's a challenge for starter you will only focus on classifying opinion to positive or negative. You can go further in detailing sentiments like happiness, sadness, satisfaction but this will not be our goal in this contest.