-
Notifications
You must be signed in to change notification settings - Fork 32
/
06-statistical_models.Rmd
1255 lines (903 loc) · 46.3 KB
/
06-statistical_models.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
# Statistical models
In this chapter, we will not learn about all the models out there that you may or may not need.
Instead, I will show you how can use what you have learned until now and how you can apply these
concepts to modeling. Also, as you read in the beginning of the book, R has many many packages. So
the model you need is most probably already implemented in some package and you will very likely
not need to write your own from scratch.
In the first section, I will discuss the terminology used in this book. Then I will discuss
linear regression; showing how linear regression works illsutrates very well how other models
work too, without loss of generality. Then I will introduce the concepte of hyper-parameters
with ridge regression. This chapter will then finish with an introduction to cross-validation as
a way to tune the hyper-parameters of models that features them.
## Terminology
Before continuing discussing about statistical models and model fitting it is worthwhile to discuss
terminology a little bit. Depending on your background, you might call an explanatory variable a
feature or the dependent variable the target. These are the same objects. The matrix of features
is usually called a design matrix, and what statisticians call the intercept is what
machine learning engineers call the bias. Referring to the intercept by bias is unfortunate, as bias
also has a very different meaning; bias is also what we call the error in a model that may cause
*biased* estimates. To finish up, the estimated parameters of the model may be called coefficients
or weights. Here again, I don't like the using *weight* as weight as a very different meaning in
statistics.
So, in the remainder of this chapter, and book, I will use the terminology from the statistical
litterature, using dependent and explanatory variables (`y` and `x`), and calling the
estimated parameters coefficients and the intercept... well the intercept (the $\beta$s of the model).
However, I will talk of *training* a model, instead of *estimating* a model.
## Fitting a model to data
Suppose you have a variable `y` that you wish to explain using a set of other variables `x1`, `x2`,
`x3`, etc. Let's take a look at the `Housing` dataset from the `Ecdat` package:
```{r, include=FALSE}
library(Ecdat)
data(Housing)
```
```{r, eval=FALSE}
library(Ecdat)
data(Housing)
```
You can read a description of the dataset by running:
```{r, eval=FALSE}
?Housing
```
```
Housing package:Ecdat R Documentation
Sales Prices of Houses in the City of Windsor
Description:
a cross-section from 1987
_number of observations_ : 546
_observation_ : goods
_country_ : Canada
Usage:
data(Housing)
Format:
A dataframe containing :
price: sale price of a house
lotsize: the lot size of a property in square feet
bedrooms: number of bedrooms
bathrms: number of full bathrooms
stories: number of stories excluding basement
driveway: does the house has a driveway ?
recroom: does the house has a recreational room ?
fullbase: does the house has a full finished basement ?
gashw: does the house uses gas for hot water heating ?
airco: does the house has central air conditioning ?
garagepl: number of garage places
prefarea: is the house located in the preferred neighbourhood of the city ?
Source:
Anglin, P.M. and R. Gencay (1996) “Semiparametric estimation of
a hedonic price function”, _Journal of Applied Econometrics_,
*11(6)*, 633-648.
References:
Verbeek, Marno (2004) _A Guide to Modern Econometrics_, John Wiley
and Sons, chapter 3.
Journal of Applied Econometrics data archive : <URL:
http://qed.econ.queensu.ca/jae/>.
See Also:
‘Index.Source’, ‘Index.Economics’, ‘Index.Econometrics’,
‘Index.Observations’
```
or by looking for `Housing` in the help pane of RStudio. Usually, you would take a look a the data
before doing any modeling:
```{r}
glimpse(Housing)
```
Housing prices depend on a set of variables such as the number of bedrooms, the area it is located
and so on. If you believe that housing prices depend linearly on a set of explanatory variables,
you will want to estimate a linear model. To estimate a *linear model*, you will need to use the
built-in `lm()` function:
```{r}
model1 <- lm(price ~ lotsize + bedrooms, data = Housing)
```
`lm()` takes a formula as an argument, which defines the model you want to estimate. In this case,
I ran the following regression:
\[
\text{price} = \beta_0 + \beta_1 * \text{lotsize} + \beta_2 * \text{bedrooms} + \varepsilon
\]
where \(\beta_0, \beta_1\) and \(\beta_2\) are three parameters to estimate. To take a look at the
results, you can use the `summary()` method (not to be confused with `dplyr::summarise()`):
```{r}
summary(model1)
```
if you wish to remove the intercept (\(\beta_0\) in the above equation) from your model, you can
do so with `-1`:
```{r}
model2 <- lm(price ~ -1 + lotsize + bedrooms, data = Housing)
summary(model2)
```
or if you want to use all the columns inside `Housing`, replacing the column names by `.`:
```{r}
model3 <- lm(price ~ ., data = Housing)
summary(model3)
```
You can access different elements of `model3` with `$`, because the result of `lm()` is a list
(you can check this claim with `typeof(model3)`:
```{r}
print(model3$coefficients)
```
but I prefer to use the `{broom}` package, and more specifically the `tidy()` function, which
converts `model3` into a neat `data.frame`:
```{r}
results3 <- broom::tidy(model3)
glimpse(results3)
```
I explicitely write `broom::tidy()` because `tidy()` is a popular function name. For instance,
it is also a function from the `{yardstick}` package, which does not do the same thing at all. Since
I will also be using `{yardstick}` I prefer to explicitely write `broom::tidy()` to avoid conflicts.
Using `broom::tidy()` is useful, because you can then work on the results easily, for example if
you wish to only keep results that are significant at the 5\% level:
```{r}
results3 %>%
filter(p.value < 0.05)
```
You can even add new columns, such as the confidence intervals:
```{r}
results3 <- broom::tidy(model3, conf.int = TRUE, conf.level = 0.95)
print(results3)
```
Going back to model estimation, you can of course use `lm()` in a pipe workflow:
```{r}
Housing %>%
select(-driveway, -stories) %>%
lm(price ~ ., data = .) %>%
broom::tidy()
```
The first `.` in the `lm()` function is used to indicate that we wish to use all the data from `Housing`
(minus `driveway` and `stories` which I removed using `select()` and the `-` sign), and the second `.` is
used to *place* the result from the two `dplyr` instructions that preceded is to be placed there.
The picture below should help you understand:
```{r, echo = FALSE}
knitr::include_graphics("pics/pipe_to_second_position.png")
```
You have to specify this, because by default, when using `%>%` the left hand side argument gets
passed as the first argument of the function on the right hand side.
Since version 4.2, R now also natively includes a placeholder, `_`:
```{r}
Housing |>
select(-driveway, -stories) |>
lm(price ~ ., data = _) |>
broom::tidy()
```
For the example above, I've also switched from `%>%` to `|>`, or else I can't use the `_` placeholder.
The advantage of the `_` placeholder is that it disambiguates `.`. So here, the `.` is a placeholder for
all the variables in the dataset, and `_` is a placeholder for the dataset.
## Diagnostics
Diagnostics are useful metrics to assess model fit. You can read some of these diagnostics, such as
the \(R^2\) at the bottom of the summary (when running `summary(my_model)`), but if you want to do
more than simply reading these diagnostics from RStudio, you can put those in a `data.frame` too,
using `broom::glance()`:
```{r}
glance(model3)
```
You can also plot the usual diagnostics plots using `ggfortify::autoplot()` which uses the
`{ggplot2}` package under the hood:
```{r}
library(ggfortify)
autoplot(model3, which = 1:6) + theme_minimal()
```
`which=1:6` is an additional option that shows you all the diagnostics plot. If you omit this
option, you will only get 4 of them.
You can also get the residuals of the regression in two ways; either you grab them directly from
the model fit:
```{r}
resi3 <- residuals(model3)
```
or you can augment the original data with a residuals column, using `broom::augment()`:
```{r, include=FALSE}
housing_aug <- augment(model3)
```
```{r, eval=FALSE}
housing_aug <- augment(model3)
```
Let's take a look at `housing_aug`:
```{r}
glimpse(housing_aug)
```
A few columns have been added to the original data, among them `.resid` which contains the
residuals. Let's plot them:
```{r}
ggplot(housing_aug) +
geom_density(aes(.resid))
```
Fitted values are also added to the original data, under the variable `.fitted`. It would also have
been possible to get the fitted values with:
```{r}
fit3 <- fitted(model3)
```
but I prefer using `augment()`, because the columns get merged to the original data, which then
makes it easier to find specific individuals, for example, you might want to know for which housing
units the model underestimates the price:
```{r}
total_pos <- housing_aug %>%
filter(.resid > 0) %>%
summarise(total = n()) %>%
pull(total)
```
we find `r total_pos` individuals where the residuals are positive. It is also easier to
extract outliers:
```{r}
housing_aug %>%
mutate(prank = cume_dist(.cooksd)) %>%
filter(prank > 0.99) %>%
glimpse()
```
`prank` is a variable I created with `cume_dist()` which is a `dplyr` function that returns the
proportion of all values less than or equal to the current rank. For example:
```{r}
example <- c(5, 4.6, 2, 1, 0.8, 0, -1)
cume_dist(example)
```
by filtering `prank > 0.99` we get the top 1% of outliers according to Cook's distance.
## Interpreting models
Model interpretation is essential in the social sciences, but it is also getting very important
in machine learning. As usual, the terminology is different; in machine learning, we speak about
explainability. There is a very important aspect that one has to understand when it comes to
interpretability/explainability: *classical, parametric* models, and *black-box* models. This
is very well explained in @breiman2001, an absolute must read (link to paper, in PDF format:
[click here](https://projecteuclid.org/download/pdf_1/euclid.ss/1009213726)). The gist of the paper
is that there are two cultures of statistical modeling; one culture relies on modeling the data
generating process, for instance, by considering that a variable y (independent variable, or target)
is a linear combination of input variables x (dependent variables, or features) plus some noise. The
other culture uses complex algorithms (random forests, neural networks)
to model the relationship between y and x. The author argues that most statisticians have relied
for too long on modeling data generating processes and do not use all the potential offered by
these complex algorithms. I think that a lot of things have changed since then, and that nowadays
any practitioner that uses data is open to use any type of model or algorithm, as long as it does
the job. However, the paper is very interesting, and the discussion on trade-off between
simplicity of the model and interpretability/explainability is still relevant today.
In this section, I will explain how one can go about interpreting or explaining models from these
two cultures.
Also, it is important to note here that the discussion that will follow will be heavily influenced
by my econometrics background. I will focus on marginal effects as way to interpret parametric
models (models from the first culture described above), but depending on the field, practitioners
might use something else (for instance by computing odds ratios in a logistic regression).
I will start by interpretability of *classical* statistical models.
### Marginal effects
If one wants to know the effect of variable `x` on the dependent variable `y`,
so-called marginal effects have to be computed. This is easily done in R with the `{marginaleffects}` package.
Formally, marginal effects are the partial derivative of the regression equation with respect to the variable
we want to look at.
```{r}
library(marginaleffects)
effects_model3 <- marginaleffects(model3)
summary(effects_model3)
```
Let's go through this: `summary(effects_model3)` shows the average marginal effects for each of the dependent
variables that were used in `model3`. The way to interpret them is as follows:
*everything else held constant (often you'll read the Latin ceteris paribus for this), a unit increase in
`lotize` increases the `price` by 3.546 units, on average.*
The *everything held constant* part is crucial; with marginal effects, you're looking at just the effect of
one variable at a time. For discrete variables, like `driveway`, this is simpler: imagine you have two
equal houses, exactly the same house, one has a driveway and the other doesn't. The one with the driveway
is 6687 units more expensive, *on average*.
Now it turns out that in the case of a linear model, the average marginal effects are exactly equal to the
coefficients. Just compare `summary(model3)` to `effects_model3` to see
(and remember, I told you that marginal effects were the partial derivative of the regression equation with
respect to the variable of interest. So the derivative of $\alpha*X_1 + ....$ with respect to $X_1$ will
be $\alpha$). But in the case of a more complex, non-linear model, this is not so obvious. This is
where `{marginaleffects}` will make your life much easier.
It is also possible to plot the results:
```{r}
plot(effects_model3)
```
`effects_model3` is a data frame containing the effects for each house in the data set. For example,
let's take a look at the first house:
```{r}
effects_model3 %>%
filter(rowid == 1)
```
`rowid` is column that identifies the houses in the original data set, so `rowid == 1` filters out
the first house. This shows you the marginal effects (column `dydx` computed for this house; but
remember, since we're dealing with a linear model, the values of the marginal effects are constant.
If you don't see the point of this discussion, don't fret, the next example should make things
clearer.
Let's estimate a logit model and compute the marginal effects. You might know logit models as
*logistic regression*. Logit models can be estimated using the `glm()` function, which stands for
generalized linear models.
As an example, we are going to use the `Participation` data, also from the `{Ecdat}` package:
```{r}
data(Participation)
```
```{r, eval=FALSE}
?Particpation
```
```
Participation package:Ecdat R Documentation
Labor Force Participation
Description:
a cross-section
_number of observations_ : 872
_observation_ : individuals
_country_ : Switzerland
Usage:
data(Participation)
Format:
A dataframe containing :
lfp labour force participation ?
lnnlinc the log of nonlabour income
age age in years divided by 10
educ years of formal education
nyc the number of young children (younger than 7)
noc number of older children
foreign foreigner ?
Source:
Gerfin, Michael (1996) “Parametric and semiparametric estimation
of the binary response”, _Journal of Applied Econometrics_,
*11(3)*, 321-340.
References:
Davidson, R. and James G. MacKinnon (2004) _Econometric Theory
and Methods_, New York, Oxford University Press, <URL:
http://www.econ.queensu.ca/ETM/>, chapter 11.
Journal of Applied Econometrics data archive : <URL:
http://qed.econ.queensu.ca/jae/>.
See Also:
‘Index.Source’, ‘Index.Economics’, ‘Index.Econometrics’,
‘Index.Observations’
```
The variable of interest is `lfp`: whether the individual participates in the labour force or not.
To know which variables are relevant in the decision to participate in the labour force, one could
train a logit model, using `glm()`:
```{r}
logit_participation <- glm(lfp ~ ., data = Participation, family = "binomial")
broom::tidy(logit_participation)
```
From the results above, one can only interpret the sign of the coefficients. To know how much a
variable influences the labour force participation, one has to use `marginaleffects()`:
```{r}
effects_logit_participation <- marginaleffects(logit_participation)
summary(effects_logit_participation)
```
As you can see, the average marginal effects here are not equal to the estimated coefficients of the
model. Let's take a look at the first row of the data:
```{r}
Participation[1, ]
```
and let's now look at `rowid == 1` in the marginal effects data frame:
```{r}
effects_logit_participation %>%
filter(rowid == 1)
```
Let's focus on the first row, where `term` is `lnnlinc`. What we see here is the effect of an infinitesimal
increase in the variable `lnnlinc` on the participation, for an individual who has the following
characteristics: `lnnlinc = 10.7875`, `age = 3`, `educ = 8`, `nyc = 1`, `noc = 1` and `foreign = no`, which
are the characteristics of this first individual in our data.
So let's look at the value of `dydx` for every individual:
```{r}
dydx_lnnlinc <- effects_logit_participation %>%
filter(term == "lnnlinc")
head(dydx_lnnlinc)
```
`dydx_lnnlinc` is a data frame with all individual marginal effect for the variable `lnnlinc`.
What if we compute the mean of this column?
```{r}
dydx_lnnlinc %>%
summarise(mean(dydx))
```
Let's compare this to the average marginal effects:
```{r}
summary(effects_logit_participation)
```
Yep, it's the same! This is why we speak of *average marginal effects*. Now that we know why
these are called average marginal effects, let's go back to interpreting them. This time,
let's plot them, because why not:
```{r}
plot(effects_logit_participation)
```
So an infinitesimal increase, in say, non-labour income (`lnnlinc`) of 0.001 is associated with a
decrease of the probability of labour force participation by 0.001*17 percentage points.
This is just scratching the surface of interpreting these kinds of models. There are many more
types of effects that you can compute and look at. I highly recommend you read the documentation
of `{marginaleffects}` which you can find
[here](https://vincentarelbundock.github.io/marginaleffects/index.html). The author
of the package, Vincent Arel-Bundock writes a lot of very helpful documentation for his packages,
so if model interpretation is important for your job, definitely take a look.
### Explainability of *black-box* models
Just read Christoph Molnar's
[Interpretable Machine Learning](https://christophm.github.io/interpretable-ml-book/).
Seriously, I cannot add anything meaningful to it. His book is brilliant.
## Comparing models
Consider this section more as an illustration of what is possible with the knowledge you acquired
at this point. Imagine that the task at hand is to compare two models. We would like to select
the one which has the best fit to the data.
Let's first estimate another model on the same data; prices are only positive, so a linear regression
might not be the best model, because the model could predict negative prices. Let's look at the
distribution of prices:
```{r}
ggplot(Housing) +
geom_density(aes(price))
```
it looks like modeling the log of `price` might provide a better fit:
```{r}
model_log <- lm(log(price) ~ ., data = Housing)
result_log <- broom::tidy(model_log)
print(result_log)
```
Let's take a look at the diagnostics:
```{r}
glance(model_log)
```
Let's compare these to the ones from the previous model:
```{r}
diag_lm <- glance(model3)
diag_lm <- diag_lm %>%
mutate(model = "lin-lin model")
diag_log <- glance(model_log)
diag_log <- diag_log %>%
mutate(model = "log-lin model")
diagnostics_models <- full_join(diag_lm, diag_log) %>%
select(model, everything()) # put the `model` column first
print(diagnostics_models)
```
I saved the diagnostics in two different `data.frame` objects using the `glance()` function and added a
`model` column to indicate which model the diagnostics come from. Then I merged both datasets using
`full_join()`, a `{dplyr}` function. Using this approach, we can easily build a data frame with the
diagnostics of several models and compare them. The model using the logarithm of prices has lower
AIC and BIC (and this higher likelihood), so if you're worried about selecting the model with the better
fit to the data, you'd go for this model.
## Using a model for prediction
Once you estimated a model, you might want to use it for prediction. This is easily done using the
`predict()` function that works with most models. Prediction is also useful as a way to test the
accuracy of your model: split your data into a training set (used for training) and a testing
set (used for the pseudo-prediction) and see if your model overfits the data. We are going to see
how to do that in a later section; for now, let's just get acquainted with `predict()` and other
functions. I insist, keep in mind that this section is only to get acquainted with these functions.
We are going to explore prediction, overfitting and tuning of models in a later section.
Let's go back to the models we trained in the previous section, `model3` and `model_log`. Let's also
take a subsample of data, which we will be using for prediction:
```{r}
set.seed(1234)
pred_set <- Housing %>%
sample_n(20)
```
In order to always get the same `pred_set`, I set the random seed first. Let's take a look at the
data:
```{r}
print(pred_set)
```
If we wish to use it for prediction, this is easily done with `predict()`:
```{r}
predict(model3, pred_set)
```
This returns a vector of predicted prices. This can then be used to compute the Root Mean Squared Error
for instance. Let's do it within a `tidyverse` pipeline:
```{r}
rmse <- pred_set %>%
mutate(predictions = predict(model3, .)) %>%
summarise(sqrt(sum(predictions - price)**2/n()))
```
The root mean square error of `model3` is `r rmse`.
I also used the `n()` function which returns the number of observations in a group (or all the
observations, if the data is not grouped). Let's compare `model3` 's RMSE with the one from
`model_log`:
```{r}
rmse2 <- pred_set %>%
mutate(predictions = exp(predict(model_log, .))) %>%
summarise(sqrt(sum(predictions - price)**2/n()))
```
Don't forget to exponentiate the predictions, remember you're dealing with a log-linear model! `model_log`'s
RMSE is `r rmse2` which is lower than `model3`'s. However, keep in mind that the model was trained
on the whole data, and then the prediction quality was assessed using a subsample of the data the
model was trained on... so actually we can't really say if `model_log`'s predictions are very useful.
Of course, this is the same for `model3`.
In a later section we are going to learn how to do cross validation to avoid this issue.
Just as a side note, notice that I had to copy and paste basically the same lines twice to compute
the predictions for both models. That's not much, but if I wanted to compare 10 models, copy and
paste mistakes could have sneaked in. Instead, it would have been nice to have a function that
computes the RMSE and then use it on my models. We are going to learn how to write our own functions
and use it just like if it was another built-in R function.
## Beyond linear regression
R has a lot of other built-in functions for regression, such as `glm()` (for Generalized Linear
Models) and `nls()` for (for Nonlinear Least Squares). There are also functions and additional
packages for time series, panel data, machine learning, bayesian and nonparametric methods.
Presenting everything here would take too much space, and would be pretty useless as you can find
whatever you need using an internet search engine. What you have learned until now is quite general
and should work on many type of models. To help you out, here is a list of methods and the
recommended packages that you can use:
Model Package Quick example
----- ------- -------
Robust Linear Regression `MASS` `rlm(y ~ x, data = mydata)`
Nonlinear Least Squares `stats`^[This package gets installed with R, no need to add it] `nls(y ~ x1 / (1 + x2), data = mydata)`^[The formula in the example is shown for illustration purposes.]
Logit `stats` `glm(y ~ x, data = mydata, family = "binomial")`
Probit `stats` `glm(y ~ x, data = mydata, family = binomial(link = "probit"))`
K-Means `stats` `kmeans(data, n)`^[`data` must only contain numeric values, and `n` is the number of clusters.]
PCA `stats` `prcomp(data, scale = TRUE, center = TRUE)`^[`data` must only contain numeric values, or a formula can be provided.]
Multinomial Logit `mlogit` Requires several steps of data pre-processing and formula definition, refer to the [Vignette](https://cran.r-project.org/web/packages/mlogit/vignettes/mlogit.pdf) for more details.
Cox PH `survival` `coxph(Surv(y_time, y_status) ~ x, data = mydata)`^[`Surv(y_time, y_status)` creates a *survival* object, where `y_time` is the time to event `y_status`. It is possible to create more complex survival objects depending on exactly which data you are dealing with.]
Time series Several, depending on your needs. Time series in R is a vast subject that would require a very thick book to cover. You can get started with the following series of blog articles, [Tidy time-series, part 1](http://www.business-science.io/timeseries-analysis/2017/07/02/tidy-timeseries-analysis.html), [Tidy time-series, part 2](http://www.business-science.io/timeseries-analysis/2017/07/23/tidy-timeseries-analysis-pt-2.html), [Tidy time-series, part 3](http://www.business-science.io/timeseries-analysis/2017/07/30/tidy-timeseries-analysis-pt-3.html) and [Tidy time-series, part 4](http://www.business-science.io/timeseries-analysis/2017/08/30/tidy-timeseries-analysis-pt-4.html)
Panel data `plm` `plm(y ~ x, data = mydata, model = "within|random")`
Machine learning Several, depending on your needs. R is a very popular programming language for machine learning. [This book](https://www.tmwr.org/) is a must read if you need to do machine learning with R.
Nonparametric regression `np` Several functions and options available, refer to the [Vignette](https://cran.r-project.org/web/packages/np/vignettes/np.pdf) for more details.
This table is far from being complete. Should you be a Bayesian, you'd want to look at packages
such as `{rstan}`, which uses `STAN`, an external piece of software that must be installed on your
system. It is also possible to train models using Bayesian inference without the need of external
tools, with the `{bayesm}` package which estimates the usual micro-econometric models. There really
are a lot of packages available for Bayesian inference, and you can find them all in the [related
CRAN Task View](https://cran.r-project.org/web/views/Bayesian.html).
## Hyper-parameters
Hyper-parameters are parameters of the model that cannot be directly learned from the data.
A linear regression does not have any hyper-parameters, but a random forest for instance has several.
You might have heard of ridge regression, lasso and elasticnet. These are
extensions of linear models that avoid over-fitting by penalizing *large* models. These
extensions of the linear regression have hyper-parameters that the practitioner has to tune. There
are several ways one can tune these parameters, for example, by doing a grid-search, or a random
search over the grid or using more elaborate methods. To introduce hyper-parameters, let's get
to know ridge regression, also called Tikhonov regularization.
### Ridge regression
Ridge regression is used when the data you are working with has a lot of explanatory variables,
or when there is a risk that a simple linear regression might overfit to the training data, because,
for example, your explanatory variables are collinear.
If you are training a linear model and then you notice that it generalizes very badly to new,
unseen data, it is very likely that the linear model you trained overfit the data.
In this case, ridge regression might prove useful. The way ridge regression works might seem
counter-intuititive; it boils down to fitting a *worse* model to the training data, but in return,
this worse model will generalize better to new data.
The closed form solution of the ordinary least squares estimator is defined as:
\[
\widehat{\beta} = (X'X)^{-1}X'Y
\]
where $X$ is the design matrix (the matrix made up of the explanatory variables) and $Y$ is the
dependent variable. For ridge regression, this closed form solution changes a little bit:
\[
\widehat{\beta} = (X'X + \lambda I_p)^{-1}X'Y
\]
where $\lambda \in \mathbb{R}$ is an hyper-parameter and $I_p$ is the identity matrix of dimension $p$
($p$ is the number of explanatory variables).
This formula above is the closed form solution to the following optimisation program:
\[
\sum_{i=1}^n \left(y_i - \sum_{j=1}^px_{ij}\beta_j\right)^2
\]
such that:
\[
\sum_{j=1}^p(\beta_j)^2 < c
\]
for any strictly positive $c$.
The `glmnet()` function from the `{glmnet}` package can be used for ridge regression, by setting
the `alpha` argument to 0 (setting it to 1 would do LASSO, and setting it to a number between
0 and 1 would do elasticnet). But in order to compare linear regression and ridge regression,
let me first divide the data into a training set and a testing set:
```{r}
index <- 1:nrow(Housing)
set.seed(12345)
train_index <- sample(index, round(0.90*nrow(Housing)), replace = FALSE)
test_index <- setdiff(index, train_index)
train_x <- Housing[train_index, ] %>%
select(-price)
train_y <- Housing[train_index, ] %>%
pull(price)
test_x <- Housing[test_index, ] %>%
select(-price)
test_y <- Housing[test_index, ] %>%
pull(price)
```
I do the train/test split this way, because `glmnet()` requires a design matrix as input, and not
a formula. Design matrices can be created using the `model.matrix()` function:
```{r}
library("glmnet")
train_matrix <- model.matrix(train_y ~ ., data = train_x)
test_matrix <- model.matrix(test_y ~ ., data = test_x)
```
Let's now run a linear regression, by setting the penalty to 0:
```{r}
model_lm_ridge <- glmnet(y = train_y, x = train_matrix, alpha = 0, lambda = 0)
```
The model above provides the same result as a linear regression, because I set `lambda` to 0. Let's
compare the coefficients between the two:
```{r}
coef(model_lm_ridge)
```
and now the coefficients of the linear regression (because I provide a design matrix, I have to use
`lm.fit()` instead of `lm()` which requires a formula, not a matrix.)
```{r}
coef(lm.fit(x = train_matrix, y = train_y))
```
as you can see, the coefficients are the same. Let's compute the RMSE for the unpenalized linear
regression:
```{r}
preds_lm <- predict(model_lm_ridge, test_matrix)
rmse_lm <- sqrt(mean(preds_lm - test_y)^2)
```
The RMSE for the linear unpenalized regression is equal to `r rmse_lm`.
Let's now run a ridge regression, with `lambda` equal to 100, and see if the RMSE is smaller:
```{r}
model_ridge <- glmnet(y = train_y, x = train_matrix, alpha = 0, lambda = 100)
```
and let's compute the RMSE again:
```{r}
preds <- predict(model_ridge, test_matrix)
rmse <- sqrt(mean(preds - test_y)^2)
```
The RMSE for the linear penalized regression is equal to `r rmse`, which is smaller than before.
But which value of `lambda` gives smallest RMSE? To find out, one must run model over a grid of
`lambda` values and pick the model with lowest RMSE. This procedure is available in the `cv.glmnet()`
function, which picks the best value for `lambda`:
```{r}
best_model <- cv.glmnet(train_matrix, train_y)
# lambda that minimises the MSE
best_model$lambda.min
```
According to `cv.glmnet()` the best value for `lambda` is `r best_model$lambda.min`. In the
next section, we will implement cross validation ourselves, in order to find the hyper-parameters
of a random forest.
## Training, validating, and testing models
Cross-validation is an important procedure which is used to compare models but also to tune the
hyper-parameters of a model. In this section, we are going to use several packages from the
[`{tidymodels}`](https://github.com/tidymodels) collection of packages, namely
[`{recipes}`](https://tidymodels.github.io/recipes/),
[`{rsample}`](https://tidymodels.github.io/rsample/) and
[`{parsnip}`](https://tidymodels.github.io/parsnip/) to train a random forest the tidy way. I will
also use [`{mlrMBO}`](http://mlrmbo.mlr-org.com/) to tune the hyper-parameters of the random forest.
### Set up
Let's load the needed packages:
```{r, include=FALSE}
library("tidyverse")
library("recipes")
library("rsample")
library("parsnip")
library("yardstick")
library("brotools")
library("mlbench")
```
```{r, eval=FALSE}
library("tidyverse")
library("recipes")
library("rsample")
library("parsnip")
library("yardstick")
library("brotools")
library("mlbench")
```
Load the data which is included in the `{mlrbench}` package:
```{r}
data("BostonHousing2")
```
I will train a random forest to predict the housing prices, which is the `cmedv` column:
```{r}
head(BostonHousing2)
```
Only keep relevant columns:
```{r}
boston <- BostonHousing2 %>%
select(-medv, -tract, -lon, -lat) %>%
rename(price = cmedv)
```
I remove `tract`, `lat` and `lon` because the information contained in the column `town` is enough.
To train and evaluate the model's performance, I split the data in two.
One data set, called the training set, will be further split into two down below. I won't
touch the second data set, the test set, until the very end, to finally assess the model's
performance.
```{r}
train_test_split <- initial_split(boston, prop = 0.9)
housing_train <- training(train_test_split)
housing_test <- testing(train_test_split)
```
`initial_split()`, `training()` and `testing()` are functions from the `{rsample}` package.
I will train a random forest on the training data, but the question, is *which* random forest?
Because random forests have several hyper-parameters, and as explained in the intro these
hyper-parameters cannot be directly learned from the data, which one should we choose? We could
train 6 random forests for instance and compare their performance, but why only 6? Why not 16?
In order to find the right hyper-parameters, the practitioner can
use values from the literature that seemed to have worked well (like is done in Macro-econometrics)
or you can further split the train set into two, create a grid of hyperparameter, train the model
on one part of the data for all values of the grid, and compare the predictions of the models on the
second part of the data. You then stick with the model that performed the best, for example, the
model with lowest RMSE. The thing is, you can't estimate the true value of the RMSE with only
one value. It's like if you wanted to estimate the height of the population by drawing one single
observation from the population. You need a bit more observations. To approach the true value of the
RMSE for a give set of hyperparameters, instead of doing one split, let's do 30. Then we
compute the average RMSE, which implies training 30 models for each combination of the values of the
hyperparameters.
First, let's split the training data again, using the `mc_cv()` function from `{rsample}` package.
This function implements Monte Carlo cross-validation:
```{r}
validation_data <- mc_cv(housing_train, prop = 0.9, times = 30)
```
What does `validation_data` look like?
```{r}
validation_data
```
Let's look further down:
```{r}
validation_data$splits[[1]]
```
The first value is the number of rows of the first set, the second value of the second, and the third
was the original amount of values in the training data, before splitting again.
How should we call these two new data sets? The author of `{rsample}`, Max Kuhn, talks about
the *analysis* and the *assessment* sets, and I'm going to use this terminology as well.
```{r, echo=FALSE, include = FALSE}
blogdown::shortcode("tweet", "1066131042615140353")
```
Now, in order to continue I need to pre-process the data. I will do this in three steps.
The first and the second steps are used to center and scale the numeric variables and the third step
converts character and factor variables to dummy variables. This is needed because I will train a
random forest, which cannot handle factor variables directly. Let's define a recipe to do that,
and start by pre-processing the testing set. I write a wrapper function around the recipe,
because I will need to apply this recipe to various data sets:
```{r}