This is a practical introduction to two Python data analysis and machine learning libraries, Pandas and Scikit-learn, through a Kaggle competition problem. The tutorial is an adpatation of the Pycon UK Introductory Tutorial given by Ezzeri Esa. The original version can be found here: https://github.com/savarin/pyconuk-introtutorial
Compared to the orginal Pycon introductory tutorial, more sophiticated analyses have been added in this tutorial for:
- data exploration and visualisation
- data preprocessing including and feature selection
- cross-validation and hyper-parameters tuning for various model types through the use of pipelines
- model comparison with statistical significance tests: on accuracy and area under the ROC curves estimated from cross-validation
This tutorial requires pandas, scikit-learn, and best run with the IPython Notebook. If you're not sure how to install these packages, we recommend the free Anaconda distribution.
The materials will be best reviewed with the IPython Notebook. You should be able to type
ipython notebook
in your terminal window and see the notebook panel load in your web browser.
You can clone the material in this tutorial using git as follows:
git clone git://github.com/pipalu/TutorialMLPython.git
Alternatively, there is a link above to download the contents of this repository as a zip file.
The notebooks can be viewed in a static fashion using the nbviewer site, as per the links in the section below. However, we recommend reviewing them interactively with the IPython Notebook.
The tutorial will start with data manipulation using pandas - loading data, and cleaning data. We then explore the data with some visualisation. We'll then use scikit-learn to make predictions. By the end of the tutorial, we would have worked on the Kaggle Titanic competition from start to finish, through a number of iterations in an increasing order of sophistication.
- Section 1 - Data cleaning and exploration.ipynb
- Section 2 - Parameter Tuning for Random Forest.ipynb
- Section 3 - Cross-validation with ROC analysis.ipynb
- Section 4 - SVM with Feature Selection and Parameter Tuning.ipynb
- Section 5 - Model Comparison using Pipelines.ipynb
A Kaggle account would be required for the purposes of making submissions and reviewing our performance on the leaderboard.
Most credits go to the original instroctor of the Pycon UK Introductory Tutorial, Ezzeri Esa [savarin] (https://github.com/savarin) for providing the excellent tutorial materials through github.