Note: sometimes your answer doesn't match one of the options exactly. That's fine. Select the option that's closest to your solution.
In this homework, we will use the California Housing Prices from Kaggle.
Here's a wget-able link:
wget https://raw.githubusercontent.com/alexeygrigorev/datasets/master/housing.csv
The goal of this homework is to create a regression model for predicting housing prices (column 'median_house_value'
).
For this homework, we only want to use a subset of data. This is the same subset we used in homework #2. But in contrast to homework #2, we are going to use all columns of the dataset.
First, keep only the records where ocean_proximity
is either '<1H OCEAN'
or 'INLAND'
Preparation:
- Fill missing values with zeros.
- Apply the log transform to
median_house_value
. - Do train/validation/test split with 60%/20%/20% distribution.
- Use the
train_test_split
function and set therandom_state
parameter to 1. - Use
DictVectorizer(sparse=True)
to turn the dataframes into matrices.
Let's train a decision tree regressor to predict the median_house_value
variable.
- Train a model with
max_depth=1
.
Which feature is used for splitting the data?
ocean_proximity
total_rooms
latitude
population
Train a random forest model with these parameters:
n_estimators=10
random_state=1
n_jobs=-1
(optional - to make training faster)
What's the RMSE of this model on validation?
- 0.045
- 0.245
- 0.545
- 0.845
Now let's experiment with the n_estimators
parameter
- Try different values of this parameter from 10 to 200 with step 10.
- Set
random_state
to1
. - Evaluate the model on the validation dataset.
After which value of n_estimators
does RMSE stop improving?
Consider 3 decimal places for retrieving the answer.
- 10
- 25
- 50
- 160
Let's select the best max_depth
:
- Try different values of
max_depth
:[10, 15, 20, 25]
- For each of these values,
- try different values of
n_estimators
from 10 till 200 (with step 10) - calculate the mean RMSE
- try different values of
- Fix the random seed:
random_state=1
What's the best max_depth
, using the mean RMSE?
- 10
- 15
- 20
- 25
We can extract feature importance information from tree-based models.
At each step of the decision tree learning algorithm, it finds the best split. When doing it, we can calculate "gain" - the reduction in impurity before and after the split. This gain is quite useful in understanding what are the important features for tree-based models.
In Scikit-Learn, tree-based models contain this information in the
feature_importances_
field.
For this homework question, we'll find the most important feature:
- Train the model with these parameters:
n_estimators=10
,max_depth=20
,random_state=1
,n_jobs=-1
(optional)
- Get the feature importance information from this model
What's the most important feature (among these 4)?
total_rooms
median_income
total_bedrooms
longitude
Now let's train an XGBoost model! For this question, we'll tune the eta
parameter:
- Install XGBoost
- Create DMatrix for train and validation
- Create a watchlist
- Train a model with these parameters for 100 rounds:
xgb_params = {
'eta': 0.3,
'max_depth': 6,
'min_child_weight': 1,
'objective': 'reg:squarederror',
'nthread': 8,
'seed': 1,
'verbosity': 1,
}
Now change eta
from 0.3
to 0.1
.
Which eta leads to the best RMSE score on the validation dataset?
- 0.3
- 0.1
- Both give equal value
- Submit your results here: https://forms.gle/Qa2SuzG7QGZNCaoV9
- If your answer doesn't match options exactly, select the closest one.
- You can submit your solution multiple times. In this case, only the last submission will be used
The deadline for submitting is October 23 (Monday), 23:00 CET. After that the form will be closed.