Predict the operating condition of 14,850 unlabelled waterpoints given a labelled dataset of 54,900 waterpoints. Features in the labelled dataset include characteristics such as the location, installer, construction year, and type of well. This provided dataset has missing and incorrect values. This project is an entry in the Pump It Up competition held by Driven Data.
A full description from Driven Data can be found here.
The link to my Github repo for this particular project can be found here.
The rules of the competition prohibit the publication of the data by third parties, though you can find the problem description at Driven Data using this link. Once you log in or create an account you can then gain access to the datasets.
The EDA_clean.ipynb
notebook contains plots and steps taken to clean the data. Some values are also imputed. Please note that several functions used in this notebook have been stored in imputing_functions.py
in order to declutter the notebook. The data is then exported as a .csv file.
train_catboost.ipynb
and train_rf.ipynb
import the .csv file from EDA_clean.ipynb
and train a Catboost and Random Forest model, respectively.
eval_catboost.ipynb
and eval_rf.ipynb
export a .csv file in the format required by the competition. They both require that the evaluation dataset provided by Driven Data be cleaned with EDA_clean.ipynb
. The evaluation dataset does not include labels, as these are stored internally at Driven Data and are used to calculate the accuarcy submitted models.
Currently, the Random Forest model performs better on the training dataset but significantly worse during evaluation compared to the CatBoost model.
I have used GridSearchCV on both Random Forest and CatBoost models to tune hyperparameters but did not achieve an increase in accuracy.
Next steps (when I have the time) could include:
Photo credit: title image provided under CC0 Public Domain license. No attribution required.