Pump It Up

Machine learning classification competition

Posted by Kurt Eulau on October 03, 2020 · 3 mins read

Problem Description

Predict the operating condition of 14,850 unlabelled waterpoints given a labelled dataset of 54,900 waterpoints. Features in the labelled dataset include characteristics such as the location, installer, construction year, and type of well. This provided dataset has missing and incorrect values. This project is an entry in the Pump It Up competition held by Driven Data.

A full description from Driven Data can be found here.

The link to my Github repo for this particular project can be found here.

Datasets

The rules of the competition prohibit the publication of the data by third parties, though you can find the problem description at Driven Data using this link. Once you log in or create an account you can then gain access to the datasets.

Workflow of notebooks as seen in Github

  1. The EDA_clean.ipynb notebook contains plots and steps taken to clean the data. Some values are also imputed. Please note that several functions used in this notebook have been stored in imputing_functions.py in order to declutter the notebook. The data is then exported as a .csv file.

  2. train_catboost.ipynb and train_rf.ipynb import the .csv file from EDA_clean.ipynb and train a Catboost and Random Forest model, respectively.

  3. eval_catboost.ipynb and eval_rf.ipynb export a .csv file in the format required by the competition. They both require that the evaluation dataset provided by Driven Data be cleaned with EDA_clean.ipynb. The evaluation dataset does not include labels, as these are stored internally at Driven Data and are used to calculate the accuarcy submitted models.

Notes

Currently, the Random Forest model performs better on the training dataset but significantly worse during evaluation compared to the CatBoost model.

I have used GridSearchCV on both Random Forest and CatBoost models to tune hyperparameters but did not achieve an increase in accuracy.

Next steps (when I have the time) could include:

  • better imputation of missing data, especially construction year, population, and other numerical features
  • using regular expressions to match funders and installers rather than just forcing all to lower case and removing double speaces
  • assessing the reasons for the Random Forest model to do well in training and then losing around 18% accuracy during evaluation, despite the use of cross-validation and stratifying duing train_test_split
  • incorporate ensemble method to increase accuracy

Photo credit: title image provided under CC0 Public Domain license. No attribution required.