-
Notifications
You must be signed in to change notification settings - Fork 0
/
DescribeData.Rmd
364 lines (305 loc) · 14.3 KB
/
DescribeData.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
---
title: "Describing Data"
date: "1/27/2022"
output: html_document
---
<script type="text/javascript">
function showhide(id) {
var e = document.getElementById(id);
e.style.display = (e.style.display == 'block') ? 'none' : 'block';
}
</script>
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
library(mosaic)
library(pander)
library(tidyverse)
```
## Numerical Summaries
### Treatment Means
In the following explanations
* `Y` must be a “numeric” vector of the quantitative response variable.
* `X` is a qualitative variable. It would represent a treatment factor.
* `YourDataSet` is the name of your data set.
You can take a tidyverse approach or a mosaic package approach to calculating numerical summaries for each treatment.
#### mosaic package:
Calculating treatment means for **one factor**:
```{r eval = FALSE, echo = TRUE}
library(mosaic)
library(pander)
favstats(Y~X, data=YourDataSet)
```
Example code:
<a href="javascript:showhide('mosaic_mean')">
<div class="hoverchunk">
<span class="tooltipr">
library(mosaic)
<span class="tooltiprtext">mosaic is an R Package that is useful in the teaching of statistics to beginning programmers.</span>
</span><br><span class="tooltipr">
library(pander)
<span class="tooltiprtext">pander is an R Package that makes R output look pretty</span>
</span><br><span class="tooltipr">
favstats(
<span class="tooltipRtext">a function from the mosaic package that returns a set of favorite summary statistics</span>
</span><span class="tooltipr">
Temp
<span class="tooltipRtext">This is our response variable. From `?airquality` you can see in the help file that Temp is the maximum daily temp in degrees F at La Gaurdia Aiport during 1973</span>
</span><span class="tooltipr">
~
<span class="tooltipRtext">"~" is the tilde symbol. It can be interpreted as "y broken down by x"; "y modeled by x"; "y explained by x", etc. Where y is on the left of the tilde and x is on the right. </span>
</span><span class="tooltipr">
Month,
<span class="tooltiprtext">"Month" is a column from the airquality dataset that can be treated as qualitative.</span>
</span><span class="tooltipr">
data = airquality
<span class="tooltiprtext">You have to tell R what dataset the variables Temp and Month come from. 'airquality' is a preloaded dataset in R. </span>
</span><span class="tooltipr">
)
<span class="tooltiprtext">Functions must always end with a closing parenthesis.</span>
</span><span class="tooltipr" style="float:right;">
Click to view output
<span class="tooltiprtext">Click to View Output.</span>
</span>
</div>
</a>
<div id="mosaic_mean" style="display:none;">
```{r, echo=FALSE}
favstats(Temp~Month, data=airquality)
```
</div>
<br>
When calculating treatment **means for combinations of 2 or more factors** you can use `+` or `|` to separate the factors. `|` (read as 'vertical bar' or 'pipe') has the advantage that in addition to calculating means for every factor level combination, favstats will also output the marginal means for each level of the last factor listed.
*NOTE:* unlike it's use in the `aov()` command, using the `*` within favstats does not yield expected results and should NOT be used.
```{r eval = FALSE, echo = TRUE}
library(mosaic)
favstats(Y ~ X + Z, data = YourDataSet)
#OR
favstats(Y ~ X | Z, data = YourDataSet)
```
Example code:
<a href="javascript:showhide('favstats_plus')">
<div class="hoverchunk">
<span class="tooltipr">
library(mosaic)
<span class="tooltiprtext">mosaic is an R Package that is useful in the teaching of statistics to beginning programmers.</span>
</span><br><span class="tooltipr">
favstats(
<span class="tooltipRtext">a function from the mosaic package that returns a set of favorite summary statistics</span>
</span><span class="tooltipr">
mpg
<span class="tooltipRtext">This is a quantitative variable (numerical vector) from the mtcars dataset</span>
</span><span class="tooltipr">
~
<span class="tooltipRtext">"~" is the tilde symbol. It can be interpreted as "y broken down by x"; "y modeled by x"; "y explained by x", etc. Where y is on the left of the tilde and x variables are on the right. </span>
</span><span class="tooltipr">
am
<span class="tooltiprtext">A qualitative variable from the mtcars dataset. It is coded as 0 and 1 and so therefore is treated as numeric. That is a key distinction when creating the model, but it does not matter when calling favstats().</span>
</span><span class="tooltipr">
+
<span class="tooltiprtext">This allows us to create additional subgroups within 'am' for each level of 'cyl'.</span>
</span><span class="tooltipr">
cyl,
<span class="tooltiprtext">A variable from the mtcars dataset with 3 distinct values: 4, 6, and 8. Though it is a numeric column we want to treat it as a factor. This is a key distinction when creating the model, but it does not matter when calling favstats().</span>
</span><span class="tooltipr">
data = mtcars
<span class="tooltiprtext">You have to tell R what dataset the variables 'mpg', 'am', and 'cyl' come from. 'mtcars' is a preloaded dataset in R. </span>
</span><span class="tooltipr">
)
<span class="tooltiprtext">Functions must always end with a closing parenthesis.</span>
</span><span class="tooltipr" style="float:right;">
Click to view output
<span class="tooltiprtext">Click to View Output.</span>
</span>
</div>
</a>
<div id="favstats_plus" style="display:none;">
```{r, echo=FALSE}
favstats(mpg ~ am + cyl, data=mtcars)
```
</div>
<br>
Notice that the first column in the output contains the factor level combinations of `am` and `cyl`. So, '0.4' is interpreted as level 0 for `am` and level 4 for `cyl`. Or in other words, the summary statistics on that row are for automatic tranmission, 4 cylinder engine vehicles. The column label of 'am.cyl' indicates which factor is represented on which side of the period. The next example uses the same way of labeling the factor level combinations, but the column label is not as intuitive or helpful.
<br>
<a href="javascript:showhide('favstats_bar')">
<div class="hoverchunk">
<span class="tooltipr">
library(mosaic)
<span class="tooltiprtext">mosaic is an R Package that is useful in the teaching of statistics to beginning programmers.</span>
</span><br><span class="tooltipr">
favstats(
<span class="tooltipRtext">a function from the mosaic package that returns a set of favorite summary statistics</span>
</span><span class="tooltipr">
mpg
<span class="tooltipRtext">This is a quantitative variable (numerical vector) from the mtcars dataset</span>
</span><span class="tooltipr">
~
<span class="tooltipRtext">"~" is the tilde symbol. It can be interpreted as "y broken down by x"; "y modeled by x"; "y explained by x", etc. Where y is on the left of the tilde and x variables are on the right. </span>
</span><span class="tooltipr">
am
<span class="tooltiprtext">A qualitative variable from the mtcars dataset. It is coded as 0 and 1 and so therefore is treated as numeric. That is a key distinction when creating the model, but it does not matter when calling favstats().</span>
</span><span class="tooltipr">
|
<span class="tooltiprtext">Referred to as a vertical bar or pipe, this symbol further defines subgroups of the variable on its left, using the values of the variable on its right side </span>
</span><span class="tooltipr">
cyl,
<span class="tooltiprtext">A variable from the mtcars dataset with 3 distinct values: 4, 6, and 8. Though it is a numeric column we want to treat it as a factor. This is a key distinction when creating the model, but it does not matter when calling favstats().</span>
</span><span class="tooltipr">
data = mtcars
<span class="tooltiprtext">You have to tell R what dataset the variables 'mpg', 'am', and 'cyl' come from. 'mtcars' is a preloaded dataset in R. </span>
</span><span class="tooltipr">
)
<span class="tooltiprtext">Functions must always end with a closing parenthesis.</span>
</span><span class="tooltipr" style="float:right;">
Click to View Output
<span class="tooltiprtext">Click to View Output.</span>
</span>
</div>
</a>
<div id="favstats_bar" style="display:none;">
```{r, echo=FALSE}
favstats(mpg ~ am | cyl, data=mtcars)
```
</div>
<br>
#### tidyverse approach
Calculating treatment means for **one factor**:
```{r eval = FALSE, echo = TRUE}
library(tidyverse)
YourDataSet %>%
Group_by(X) %>%
Summarise(MeanY = mean(Y), sdY = sd(Y), sampleSize = n())
```
<a href="javascript:showhide('mean3')">
<div class="hoverchunk">
<span class="tooltipr">
library(tidyverse)
<span class="tooltiprtext">tidyverse is an R Package that is very useful for working with data.</span>
</span><br><span class="tooltipr">
airquality
<span class="tooltipRtext">`airquality` is a dataset in R.</span>
</span><span class="tooltipr">
%>%
<span class="tooltipRtext">The pipe operator that will send the `airquality` dataset down inside of the code on the following line.</span>
</span><br/><span class="tooltipr">
group_by(
<span class="tooltipRtext">"group_by" is a function from library(tidyverse) that allows us to split the airquality dataset into "little" datasets, one dataset for each value in the "Month" column.</span>
</span><span class="tooltipr">
Month
<span class="tooltiprtext">"Month" is a column from the airquality dataset that can be treated as qualitative.</span>
</span><span class="tooltipr">
)
<span class="tooltiprtext">Functions must always end with a closing parenthesis.</span>
</span><span class="tooltipr">
%>%
<span class="tooltipRtext">The pipe operator that will send the grouped version of the `airquality` dataset down inside of the code on the following line.</span>
</span><br/><span class="tooltipr">
summarise(
<span class="tooltipRtext">"summarise" is a function from library(tidyverse) that allows us to compute numerical summaries on data.</span>
</span><span class="tooltipr">
aveTemp =
<span class="tooltiprtext">"AveTemp" is just a name we made up. It will contain the results of the mean(...) function.</span>
</span><span class="tooltipr">
mean(
<span class="tooltiprtext">"mean" is an R function used to calculate the mean.</span>
</span><span class="tooltipr">
Temp
<span class="tooltiprtext">Temp is a quantitative variable (numeric vector) from the airquality dataset.</span>
</span><span class="tooltipr">
)
<span class="tooltiprtext">Functions must always end with a closing parenthesis.</span>
</span><span class="tooltipr">
)
<span class="tooltiprtext">Functions must always end with a closing parenthesis.</span>
</span><span class="tooltipr">
<span class="tooltiprtext">Press Enter to run the code.</span>
</span><span class="tooltipr" style="float:right;">
Click to View Output
<span class="tooltiprtext">Click to View Output.</span>
</span>
</div>
</a>
<div id="mean3" style="display:none;">
```{r, echo=FALSE}
airquality %>%
group_by(Month) %>%
summarise(aveTemp = mean(Temp)) %>%
pander()
```
Note that R calculated the mean `Temp` for each month in `Month` from the `airquality` dataset.
May (5), June (6), July (7), August (8), and September (9), respectively.
Further, note that to get the "nicely formatted" table, you would have to use
```{}
library(pander)
airquality %>%
group_by(Month) %>%
summarise(aveTemp = mean(Temp)) %>%
pander()
```
</div>
<br>
To calculate treatment means for each combination of factor levels of **2 or more factors**, simply add the additional variable to the `group_by()` statement.
Example code:
<a href="javascript:showhide('mean_tidy2')">
<div class="hoverchunk">
<span class="tooltipr">
library(tidyverse)
<span class="tooltiprtext">tidyverse is an R Package that is very useful for working with data.</span>
</span><br><span class="tooltipr">
mpg
<span class="tooltipRtext">`mtcars` is a dataset in preloaded in R.</span>
</span><span class="tooltipr">
%>%
<span class="tooltipRtext">The pipe operator that will send the `mtcars` dataset down inside of the code on the following line.</span>
</span><br/><span class="tooltipr">
group_by(
<span class="tooltipRtext">"group_by" is a function from library(tidyverse) that allows us to split the mtcars dataset into "little" datasets, one dataset for each combination of values in the 'am' and 'cyl' variables</span>
</span><span class="tooltipr">
am, cyl
<span class="tooltiprtext">'am' and 'cyl' are both columns in mtcars. By listing them both here we are going to get output for each combination of 'am' and 'cyl' that exists in the dataset</span>
</span><span class="tooltipr">
)
<span class="tooltiprtext">Functions must always end with a closing parenthesis.</span>
</span><span class="tooltipr">
%>%
<span class="tooltipRtext">The pipe operator that will send the grouped version of the `airquality` dataset down inside of the code on the following line.</span>
</span><br/><span class="tooltipr">
summarise(
<span class="tooltipRtext">"summarise" is a function from library(tidyverse) that allows us to compute numerical summaries on data.</span>
</span><span class="tooltipr">
mean_mpg =
<span class="tooltiprtext">"mean_mpg" is just a name we made up. It will contain the results of the mean(...) function.</span>
</span><span class="tooltipr">
mean(
<span class="tooltiprtext">"mean" is an R function used to calculate the mean.</span>
</span><span class="tooltipr">
mpg
<span class="tooltiprtext">mpg is a quantitative variable (numeric vector) from the mtcars dataset.</span>
</span><span class="tooltipr">
)
<span class="tooltiprtext">Functions must always end with a closing parenthesis.</span>
</span><span class="tooltipr">
)
<span class="tooltiprtext">Functions must always end with a closing parenthesis.</span>
</span><span class="tooltipr">
<span class="tooltiprtext">Press Enter to run the code.</span>
</span><span class="tooltipr" style="float:right;">
Click to View Output
<span class="tooltiprtext">Click to View Output.</span>
</span>
</div>
</a>
<div id="mean_tidy2" style="display:none;">
```{r, echo=FALSE}
mtcars %>%
group_by(am, cyl) %>%
summarise(mean_mpg = mean(mpg)) %>%
pander()
```
</div>
-----
### Treatment Effects
---
## Graphical Summaries
I'm thinking of boxplot, a scatter/jitter plot with means connected by a line, and an interaction plot. Try to decide the degree to teach base R vs. ggplot.