Map Covid-19 vulnerabilities in south africa using machine learning on old data.
This hackaton was about predicting the vulnerabities of south africa population during these times of disease using old data. The target variable to predict is therefore the percentage of large households who have to leave their premises for water.
The data contains information about differents wards, areas in south africa, as for the number of households in a ward or the number of individuals, the percentage of dwellings type, the percentage listing present school attendance, data about the wealthness of the wards, and so on.
I tried different models such as :
- Linear Regression;
- Elastic Net;
- LightGBM;
- eXtreme Gradient Boosting (XGBoost);
- CatBoost
and others. But the later CatBoost was my best model.
The submitted solution has really small feature engineering, only the outliers in the target variable were removed and the features which needs transformation were scaled, and all the features were used except the ones with only zeros or with appromaximately 50% of zeros.
I worked on Google colab, so you can find the notebook in the notebooks folder or directly here : vulnerability_covid_map_sa
I tried quiet a few feature selection techniques, such as the XGBoost feature selection function and also backward elimination. Both of them gave no meaninful performance. I've also applied a neural net but the resuls were very to my submitted ensemble model and were no better.
- [] Improve feature selection and engineering
- [] Reorder feature selection and neural net code
- [] Upload the notebooks and python files if better results are obtained.
For any question or contribution, you can reach me on Twitter : @balde_ahmed. Thank you :)