-
Notifications
You must be signed in to change notification settings - Fork 3
/
Class7_notes.Rmd
289 lines (194 loc) · 12.7 KB
/
Class7_notes.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
---
title: "Class7_notes"
author: "Anita"
date: "11/3/2019"
output: html_document
---
## Welcome to Class 7!
Today we will learn how to build **linear regression models** in R! The new commands lm() and summary() are from base R, however we still want the tidyverse package to be loaded.
In this class we will work with the child aggression data.
### Importing data
It comes in .dat format, which is different from .csv we normally work with. Therefore, instead of using the specific read_csv() function that is looking for commas between values, we we will use the more general command read_delim() to read it in.
read_delim() comes from tidyverse and needs us to specify the 'delim' argument, i.e. what character separates values. Normally it's either white space, tab, comma or semicolon. When you try to read files, you can try experimenting with it. In tis case it's the space character:
```{r}
library(tidyverse)
#load data
child <- read_delim("ChildAggression.dat", delim = ' ')
```
Once you read in your data, you can use head(your_data) to quickly inspect data by seeing the first several values from it. Make a code chunk below and try this function out.
Here is interpretation for variables and their values:
Aggression (high score = more aggressive) <- this will be our dependent variable (**outcome variable**)
Television (high score = more time spent watching television)
Computer_ Games (high score = more time spent playing computer games)
Sibling_Aggression (high score = more aggression seen in their older sibling)
Diet (high score = the child has a good diet low in additives)
Parenting_Style (high score = bad parenting practices)
Already now you should try to answer the following questions:
What is the type of variables you can see in this data?
What kind of statistical tests can you use to explore this kind of data? Could you run a t-test?
### Part 1: Simple linear regression
#### lm(formula, data)
takes in the following arguments:
formula in the format: outcome ~ predictor
name of the dataset, where the outcome and predictor come from
Fits the model, i.e. finds the estimates that produce the line that minimizes the sum of squares of the residuals
...and returns the output that contains results of fitting a linear regression
The output contains coefficients of the fitted model, its residuals (distance between actual and predicted values), fitted values, degrees of freedom, and all other kinds of stuff relevant to the model.
Use summary() function on the fitted model to see this output.
Example:
```{r}
#make a linear model
m <- lm(Aggression ~ Television, child)
#see results of model fit to data
summary(m)
```
#### Reporting simple regression
By looking at the output above, we could report the following result:
*"A linear regression analysis was used to test if the amount of television time significantly predicted children's ratings of aggression. Adjusting for the number of predictors, the results of the regression indicated that the predictor explained 2.36% of the variance in aggression metric (Adjusted R2 =.0236, F(1,664)= 17.11, p<.001). It was found that bigger amount of television significantly predicted aggressive tendencies (β = 0.1634, SE =0.0395, t = 4.137, p<.001)."*
Report regression in APA style:
https://web2.uconn.edu/writingcenter/pdf/Reporting_Statistics.pdf
When reporting model performance:
mention either of R squared and corresponding explained variance
if you have just one predictor - R squared is fine
if you have more predictors - adjusted R squared is better
F(number of predictors, remaining degrees of freedom) = F statistic
p-value associated with F statistic
When reporting results for research question (do so for every predictor):
beta value
standard error
t value
associated p-value
#### Visualizing regression
From the class on correlation, we know that to visualize how one continuous variable changes in relation to another continuous variable, we need to draw a scatter plot (geom_point in ggplot2):
```{r}
#plot the data
ggplot(child, aes(Television, Aggression)) + #base layer
geom_point() + #scatter plot
ggtitle('Scatter plot')
```
The distinct feature of regression plots is the fitted line that produces the least sum of squares, let's draw it using the summary of the model we defined in the previous part. In that summary, we can access coefficients. In coefficients the first element is intercept and the second one is slope for our regression line. We write them down so we can use them in ggplot's geom_abline().
```{r}
#writing down the model summary
sum <- summary(m)
#extracting intercept and slope from the model summary using index
int <- sum$coefficients[1] #the first element is intercept
slope <- sum$coefficients[2] #the second element is slope
#plot the line
ggplot(child, aes(Television, Aggression)) + #base layer
geom_point() + #scatter plot
geom_abline(intercept = int, slope = slope) + #drawing a line with intercept and slope we wrote down from model summary
ggtitle('Scatter plot with the regression line: manually')
```
Alternatively, we can draw a regression line automatically using geom_smooth(method = lm). Use this page: http://www.sthda.com/english/wiki/print.php?id=188 to look up more information about scatter plots and regression lines in R.
```{r}
#... or just use geom_smooth(method = lm) to plot that line automatically (does not show extrapolation)
ggplot(child, aes(Television, Aggression)) +
geom_point() +
geom_smooth(method = lm) + #fit the best straight line to the data
ggtitle('Scatter plot with the regression line: quicker using geom_smooth (method lm)')
#same but without the confidence intervals around the line, so it looks more like the abline geom
ggplot(child, aes(Television, Aggression)) +
geom_point() +
geom_smooth(method = lm, se = FALSE) + #fit the best straight line to the data, don't show confidence intervals
ggtitle('Scatter plot with the regression line: geom_smooth without confidence intervals ')
```
#### Part 1 Exercise:
Now, it's your turn to build linear models.
Make 5 regressions: one for each predictor
Visualize the results using scatter plots with regression lines
Which one of predictors was the worst?
### Part 2: Multiple regression
In general, you have to be very picky when you choose multiple predictors for your regression model. However, here we just learn, so feel welcome to experiment and try different combinations of predictors. You can include more than one predictor by simply adding it using '+' sign in the formula (see below):
```{r}
mm <- lm(Aggression ~ Computer_Games + Television, child)
summary(mm)
```
#### Multicollinearity
However, you should still keep in mind that if you add two predictors that are highly correlated, the estimates of your regression will become hard to interpret and unstable.
Therefore, make sure you always check correlation between your predictors!
You can check how all of variables in your data are correlated with each other by running cor() function on the entire dataset. High correlation coefficients (e.g. over 0.7) are worrying and indicate that it might be better not to include such correlated predictors in the same model together. To make it easier to see, round up the results to 2 decimal points using round() function:
```{r}
#check cor()
cor(child)
#same, but rounded to 2 decimal points
round(cor(child),2) #much better! but still hard to see
```
You can make it even easier by displaying the correlation matrix of all predictors as **a heatmap**. For this you will need to install and load reshape2 library and run the following code (source and explanation: https://onunicornsandgenes.blog/2019/07/22/using-r-correlation-heatmap-with-ggplot2/)
From the author of this code:
"*What is going on?*
- cor makes a correlation matrix with all the pairwise correlations between variables (twice; plus a diagonal of ones)
- melt takes the matrix and creates a data frame in long form, each row consisting of id variables Var1 and Var2 and a single value
- We then plot with the tile geometry, mapping the indicator variables to rows and columns, and value (i.e. correlations) to the fill colour.
- In ggplot2, a scale that has a midpoint and a different colour in each direction is called scale_colour_gradient2, and we just need to add it. I also set the limits to -1 and 1, which doesn’t change the colour but fills out the legend for completeness. Done!"
Run the code below and look at the output - are there predictors that are highly correlated?
```{r}
pacman::p_load(ggplot2, reshape2)
#plot as heatmap
qplot(x=Var1, y=Var2,
data=melt(cor(child, use="p")),
fill=value,
geom="tile") +
scale_fill_gradient2(limits=c(-1, 1))
##### Highlight only the 'worrying' correlation values
#plot only values above .7
plot_df <- melt(cor(child))
plot_df$value[plot_df$value < 0.7 & plot_df$value > - 0.7] = 0
qplot(x=Var1, y=Var2, data=plot_df, fill=value, geom="tile") +
scale_fill_gradient2(limits=c(-1, 1))
```
After you've examined this, you just need to make sure that you don't put highly correlated predictors in the same model at the same time. In this case, it seems that Television and Parenting style variables are correlated more than others. Even though their correlation coefficient is below 0.7 (it's 0.53), you still might want to avoid putting those two together as predictors in a model.
#### Part 2 Exercise
Pick a predictor that you think is the most important for predicting child aggression. Then create 3 regression models:
1) where aggression is predicted by just the predictor you picked
2) where aggression is predicted by your predictor and some another additional predictor from the dataset
3) where aggression is predicted by your predictor and two additional predictors from the dataset
Look through summaries of these models and answer questions:
How did the beta estimate for your most important predictor change?
How did model performance change? Which metric do you use to answer this question?
Write up an APA style report for your best model.
#### The top model competition
*In groups*
Who can make the model which explains the data the best? E.g. have the highest adjusted R2
If you believe you have the best model send me an email reporting the results of the model in APA style
Either mail, pdf or html output of the markdown is fine
my email: 201608652@post.au.dk
We will choose the winner next week
### Optional stuff
#### 1) Playing with ggplot
1a) Play with geom_smooth, try making a regression line without the method argument
1b) All of our predictor variables are standardized (their mean values were assigned to be 0, and all of observations were centered around it correspondingly to the original relations in data). Try to make a scatter plot that has one x axis for all predictor variables (they have roughly the same standradized scale, so it's possible) and the y axis for the outcome variable. Plot all 5 regression lines on it.
#### 2) Making a loop
If you’re bored: make a loop that runs through the 5 predictors, fits a model to each and saves the R^2 to a dataframe
#### 3) Look through code for stuff from slides
##### Here's how extrapolation example about meatless day was made (last year by Kenneth)
```{r}
#data - source: https://vegetarisk.dk/statistik-om-danmark/
tible <- data_frame(year = c(2010, 2017, 2018),
perc_one_meatless_day_pr_week = c(17, 28, 32)
)
#linear model
m1 <- lm(perc_one_meatless_day_pr_week ~ year, tible)
sum_m1 <- summary(m1)
#visaulisation
ggplot(tible, aes(year, perc_one_meatless_day_pr_week)) +
geom_point() +
geom_abline(intercept = sum_m1$coefficients[1, 1], slope = sum_m1$coefficients[2, 1]) +
labs(title = paste("adjusted R^2: ", round(sum_m1$adj.r.squared, 2), sep = ""))
#extrapolation
predict(m1, data.frame(year = c(2018-100, 2018+100)))
ggplot(tible, aes(year, perc_one_meatless_day_pr_week)) +
geom_point() +
geom_abline(intercept = sum_m1$coefficients[1,1], slope = sum_m1$coefficients[2,1]) +
xlim(2010, 2118)+ylim(0,200) +
geom_abline(intercept = 100, slope = 0) +
labs(title = paste("adjusted R^2: ", round(sum_m1$adj.r.squared, 2), sep= ""))
```
##### Here's how the 'top model' image was made (last year by Kenneth)
```{r}
pacman::p_load(tidyverse, png)
mypng <- readPNG('model_image.png')
plot_dat <- data_frame(x = c(0, 0.5, 1, 1.5, 1.7, 2, 2.5, 3, 3.5, 4, 4.5, 5),
y = c(3, 3.3, 3.3, 2.9, 2.5,2.7, 3.1, 1.6, 1.6, 2, 1.5, 1.2))
p <- ggplot(plot_dat, aes(x,y)) + theme_bw()
p + annotation_raster(mypng, ymin = 0,ymax= 4,xmin = 0,xmax = 5) + xlim(0,4.5) + ylim(0,4) + stat_smooth(se = F, method = "lm", formula = y ~ poly(x, 8))
```