-
Notifications
You must be signed in to change notification settings - Fork 7
/
18-chi_square_test_for_independence-web.Rmd
552 lines (320 loc) · 20.3 KB
/
18-chi_square_test_for_independence-web.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
# Chi-square test for independence
<!-- Please don't mess with the next few lines! -->
<style>h5{font-size:2em;color:#0000FF}h6{font-size:1.5em;color:#0000FF}div.answer{margin-left:5%;border:1px solid #0000FF;border-left-width:10px;padding:25px} div.summary{background-color:rgba(30,144,255,0.1);border:3px double #0000FF;padding:25px}</style>`r options(scipen=999)`<p style="color:#ffffff">`r intToUtf8(c(50,46,48))`</p>
<!-- Please don't mess with the previous few lines! -->
::: {.summary}
### Functions introduced in this chapter: {-}
No new R functions are introduced here.
:::
## Introduction
In this chapter we will learn how to run the chi-square test for independence.
A chi-square test for independence tests the relationship between two categorical variables. This is an extension of the test for two proportions, except now applied in situations where either the predictor or response variables (or both) have three or more categories.
### Install new packages
There are no new packages used in this chapter.
### Download the R notebook file
Check the upper-right corner in RStudio to make sure you're in your `intro_stats` project. Then click on the following link to download this chapter as an R notebook file (`.Rmd`).
<a href = "https://vectorposse.github.io/intro_stats/chapter_downloads/18-chi_square_test_for_independence.Rmd" download>https://vectorposse.github.io/intro_stats/chapter_downloads/18-chi_square_test_for_independence.Rmd</a>
Once the file is downloaded, move it to your project folder in RStudio and open it there.
### Restart R and run all chunks
In RStudio, select "Restart R and Run All Chunks" from the "Run" menu.
## Load packages
We load the standard `tideverse`, `janitor`, and `infer` packages. We also use the `MASS` package for the `birthwt` data, and the `openintro` package for the `smoking` data.
```{r}
library(tidyverse)
library(janitor)
library(infer)
library(MASS)
library(openintro)
```
## Research question
Are mothers from certain racial groups more or less likely to have low birth weight babies? In other words, are low birth weight and race associated?
Let's look at the data. The `birthwt` data was collected at Baystate Medical Center, Springfield, Mass during 1986. In terms of addressing the research question, we are, of course, limited to conclusions about women in that area of the country in the mid-1980s.
```{r}
birthwt
```
```{r}
glimpse(birthwt)
```
The `low` variable is an indicator of birth weight less than 2.5 kg. So even though birth weight is numerical, we have a convenient categorical variable that serves as a marker of low birth weight, gathering all low birth weight babies into a single group. The `race` variable is categorical, coded as 1 = white, 2 = black, 3 = other.
Neither variable appears in the data frame as a factor variable, so we will need to change that. The new tibble will be called `birthwt2`.
```{r}
birthwt2 <- birthwt %>%
mutate(low_fct = factor(low, levels = c(0, 1),
labels = c("no", "yes")),
race_fct = factor(race, levels = c(1, 2, 3),
labels = c("white", "black", "other")))
birthwt2
```
```{r}
glimpse(birthwt2)
```
## Chi-square test for independence
In a previous chapter, we learned about the chi-square goodness-of-fit test. With a single categorical variable, we summarized data in a frequency table. Each cell of the table had an observed count from the data that we compared to an expected count from the assumption of a null hypothesis. The chi-square statistic measured the discrepancy between observed and expected.
With two categorical variables, we use a contingency table instead of a frequency table. But the principle of the chi-square statistic is the same: each cell in the contingency table has an observed count and an expected count. This forms the basis of a chi-square test for independence.
Below is the contingency table for these two variables. Normally, we only care about column totals because we care how the response variable (here, `low_fct`) is distributed in each group of the predictor variable (i.e., each racial group). But for the calculation of chi-squared, we will need both row and column totals.
```{r}
tabyl(birthwt2, low_fct, race_fct) %>%
adorn_totals(where = c("row", "col"))
```
A test for independence has a simple null hypothesis: the two variables are independent. This gives us a way to compute expected counts. To see how, look at the sum of all the normal weight babies ($73 + 15 + 42 = 130$) and all the low birth weight babies ($23 + 11 + 25 = 59$). In other words, if race is ignored, there were 130 normal weight babies and 59 low birth weight babies out of 189 total babies. 59 of 189 is 0.31217 or 31.217%, and 130 of 189 is 0.68783 or 68.783%.
Now, if low birth weight and race are truly independent, it shouldn't matter if the mothers were white, black, or some other race. In other words, of 96 white mothers, we should still expect 68.783% of them to have normal weight babies and 31.217% of them to have low birth weight babies. 68.783% of 96 is 66.032. **This is the expected cell count for normal birth weight babies of white women.** 31.217% of 96 is 29.968. **This is the expected cell count for low birth weight babies of white women.** The same analysis can be done for the next two columns as well.
##### Exercise 1 {-}
Complete the list of expected cell counts in the table above. In other words, apply the percentages 68.783% and 31.217% to the totals of the "black" and "other" columns. Put them in the table below:
::: {.answer}
| | white | black | other |
|-----|--------|--------|--------|
| no | 66.032 | ? | ? |
| yes | 29.968 | ? | ? |
:::
*****
Unlike the goodness-of-fit test that requires one to specify expected counts for each cell, the test for independence uses only the data to determine the expected counts. For any given cell, if $R$ is the row total, $C$ is the column total, and $n$ is the grand total (the sample size), the expected count in any cell is simply
$$
E = \frac{R C}{n}.
$$
This is equivalent to the explanation in the previous paragraph. Using low birth weight babies among white mothers as an example, $R/n$ is $59/189$ which is 0.31217. Then we multiply this by the column total $C = 96$ to get
$$
\left(\frac{R}{n}\right) C = \frac{R C}{n} = \frac{59 \times 96}{189} = 29.96825.
$$
Everything else works almost the same as it did for a chi-square goodness-of-fit test. We still compute $\chi^{2}$ by adding up deviations across all cells:
$$
\chi^{2} = \sum \frac{(O - E)^{2}}{E}.
$$
Even under the assumption of the null, there will still be some sampling variability. Like any hypothesis test, our job is to determine whether the deviations we see are possible due to pure chance alone. The random values of $\chi^{2}$ that result from sampling variability will follow a chi-square model. But how many degrees of freedom are there? This is a little different from the goodness-of-fit test. Instead of the number of cells minus one, we use the following formula:
$$
df = (\#rows - 1)(\#columns - 1).
$$
In our example we have 2 rows ("yes", "no") and 3 columns ("white", "black", "other"); therefore,
$$
df = (2 - 1)(3 - 1) = 1 \times 2 = 2
$$
and we have 2 degrees of freedom (even though there are 6 cells).
Let's run through the rubric in its entirety.
## Exploratory data analysis
### Use data documentation (help files, code books, Google, etc.) to determine as much as possible about the data provenance and structure.
You should type `?birthwt` at the Console to read the help file. We don't have any information about how these mothers were selected. The "Source" at the end of the help file is a statistics textbook, so we'd have to track down that book to see where they got the data and if traced back to a primary source.
```{r}
birthwt
```
```{r}
glimpse(birthwt)
```
### Prepare the data for analysis.
```{r}
# Although we've already done this above,
# we include it here again for completeness.
birthwt2 <- birthwt %>%
mutate(low_fct = factor(low, levels = c(0, 1),
labels = c("no", "yes")),
race_fct = factor(race, levels = c(1, 2, 3),
labels = c("white", "black", "other")))
birthwt2
```
### Make tables or plots to explore the data visually.
```{r}
tabyl(birthwt2, low_fct, race_fct) %>%
adorn_totals()
```
```{r}
tabyl(birthwt2, low_fct, race_fct) %>%
adorn_totals() %>%
adorn_percentages("col") %>%
adorn_pct_formatting()
```
Commentary: Earlier we used row and column total to explain how expected cell counts arise. Here, however, we will revert back to our previous standard practice of generating one contingency table with counts and another with column percentages.
## Hypotheses
### Identify the sample (or samples) and a reasonable population (or populations) of interest.
The sample consists of 189 mothers who gave birth at the Baystate Medical Center in Springfield, Massachusetts in 1986. The population is presumably all mothers, although it's safest to conclude only about mothers who gave birth at this hospital.
### Express the null and alternative hypotheses as contextually meaningful full sentences.
$H_{0}:$ Low birth weight and race are independent.
$H_{A}:$ Low birth weight and race are associated.
### Express the null and alternative hypotheses in symbols (when possible).
For a chi-square test for independence, this section is not applicable. With multiple categories in the response and predictor variables, there are no specific parameters of interest to express symbolically.
## Model
### Identify the sampling distribution model.
We will use a chi-square model with 2 degrees of freedom.
### Check the relevant conditions to ensure that model assumptions are met.
* Random
- We hope that these 189 women are representative of all women who gave birth in this hospital (or, at best, in that region) around that time.
* 10%
- We don't know how many women gave birth at this hospital, but perhaps over many years we might have more than 1890 women.
* Expected cell counts
- You checked the cell counts as a part of Exercise 1. Note that all expected cell counts are larger than 5, so the condition is met.
## Mechanics
### Compute the test statistic.
```{r}
obs_chisq <- birthwt2 %>%
specify(response = low_fct, explanatory = race_fct) %>%
hypothesize(null = "independence") %>%
calculate(stat = "chisq")
obs_chisq
```
### Report the test statistic in context (when possible).
The value of $\chi^{2}$ is `r obs_chisq %>% pull(1)`.
Commentary: As in the last chapter, there's not much context to report with a value of $\chi^{2}$, so the most we can do here is just report it in a full sentence.
### Plot the null distribution.
```{r}
low_race_test <- birthwt2 %>%
specify(response = low_fct, explanatory = race_fct) %>%
assume(distribution = "chisq")
low_race_test
```
```{r}
low_race_test %>%
visualize() +
shade_p_value(obs_chisq, direction = "greater")
```
### Calculate the P-value.
```{r}
low_race_test_p <- low_race_test %>%
get_p_value(obs_chisq, direction = "greater")
low_race_test_p
```
### Interpret the P-value as a probability given the null.
The P-value is `r low_race_test_p %>% pull(1)`. If low birth weight and race were independent, there would be a `r 100 * low_race_test_p %>% pull(1)`% chance of seeing results at least as extreme as we saw in the data.
## Conclusion
### State the statistical conclusion.
We fail to reject the null hypothesis.
### State (but do not overstate) a contextually meaningful conclusion.
There is insufficient evidence that low birth weight and race are associated.
### Express reservations or uncertainty about the generalizability of the conclusion.
Given our uncertainly about how the data was collected, it's not clear what our conclusion means. Also, failing to reject the null is really a "non-conclusion" in that it leaves us basically knowing nothing. We don't have evidence of such an association (and there are good reasons to believe there may not be one), but failing to reject the null does not prove anything.
### Identify the possibility of either a Type I or Type II error and state what making such an error means in the context of the hypotheses.
It's possible that we have made a Type II error. It may be that low birth weight and race are associated, but our sample has not given enough evidence of such an association.
## Confidence interval
There are no parameters of interest in a chi-square test, so there is no confidence interval to report.
## Your turn
Use the `smoking` data set from the `openintro` package. Run a chi-square test for independence to determine if smoking status is associated with marital status.
The rubric outline is reproduced below. You may refer to the worked example above and modify it accordingly. Remember to strip out all the commentary. That is just exposition for your benefit in understanding the steps, but is not meant to form part of the formal inference process.
Another word of warning: the copy/paste process is not a substitute for your brain. You will often need to modify more than just the names of the data frames and variables to adapt the worked examples to your own work. Do not blindly copy and paste code without understanding what it does. And you should **never** copy and paste text. All the sentences and paragraphs you write are expressions of your own analysis. They must reflect your own understanding of the inferential process.
**Also, so that your answers here don't mess up the code chunks above, use new variable names everywhere.**
##### Exploratory data analysis {-}
###### Use data documentation (help files, code books, Google, etc.) to determine as much as possible about the data provenance and structure. {-}
::: {.answer}
Please write up your answer here
```{r}
# Add code here to print the data
```
```{r}
# Add code here to glimpse the variables
```
:::
###### Prepare the data for analysis. [Not always necessary.] {-}
::: {.answer}
```{r}
# Add code here to prepare the data for analysis.
```
:::
###### Make tables or plots to explore the data visually. {-}
::: {.answer}
```{r}
# Add code here to make tables or plots.
```
:::
##### Hypotheses {-}
###### Identify the sample (or samples) and a reasonable population (or populations) of interest. {-}
::: {.answer}
Please write up your answer here.
:::
###### Express the null and alternative hypotheses as contextually meaningful full sentences. {-}
::: {.answer}
$H_{0}:$ Null hypothesis goes here.
$H_{A}:$ Alternative hypothesis goes here.
:::
###### Express the null and alternative hypotheses in symbols (when possible). {-}
::: {.answer}
$H_{0}: math$
$H_{A}: math$
:::
##### Model {-}
###### Identify the sampling distribution model. {-}
::: {.answer}
Please write up your answer here.
:::
###### Check the relevant conditions to ensure that model assumptions are met. {-}
::: {.answer}
Please write up your answer here. (Some conditions may require R code as well.)
:::
##### Mechanics {-}
###### Compute the test statistic. {-}
::: {.answer}
```{r}
# Add code here to compute the test statistic.
```
:::
###### Report the test statistic in context (when possible). {-}
::: {.answer}
Please write up your answer here.
:::
###### Plot the null distribution. {-}
::: {.answer}
```{r}
# Add code here to plot the null distribution.
```
:::
###### Calculate the P-value. {-}
::: {.answer}
```{r}
# Add code here to calculate the P-value.
```
:::
###### Interpret the P-value as a probability given the null. {-}
::: {.answer}
Please write up your answer here.
:::
##### Conclusion {-}
###### State the statistical conclusion. {-}
::: {.answer}
Please write up your answer here.
:::
###### State (but do not overstate) a contextually meaningful conclusion. {-}
::: {.answer}
Please write up your answer here.
:::
###### Express reservations or uncertainty about the generalizability of the conclusion. {-}
::: {.answer}
Please write up your answer here.
:::
###### Identify the possibility of either a Type I or Type II error and state what making such an error means in the context of the hypotheses. {-}
::: {.answer}
Please write up your answer here.
:::
## Bonus section: Residuals
Just like with the chi-square test for goodness of fit, rejecting the null hypothesis using the chi-square test for independence informs us that two variables are associated, but it doesn't tell us the useful information about which combinations of variables have higher and lower counts than expected. And just like the chi-square test for goodness of fit, we can examine the *residuals table* to find that information.
**A word of caution**: You should only examine the residuals if your test was statistically significant! The residuals table for tests in which we fail to reject the null hypothesis can be misleading.
Because we failed to reject the null hypothesis in the `low_race_test`, it would be unwise for us to examine the residuals table in that test. Instead, we'll use a different example.
The `diabetes2` dataset in the `openintro` package contains information about an experiment evaluating three treatments for Type 2 diabetes in patients aged 10-17 who were being treated with metformin. The three treatments summarized in the `treatment` variable were: continued treatment with metformin (`met`), treatment with metformin combined with rosiglitazone (`rosi`), or a lifestyle intervention program (`lifestyle`). Each patient had a primary `outcome`, which was either "lacked glycemic control" (`failure`) or did not lack that control (`success`). Here is the summary of the results of the experiment:
```{r}
tabyl(diabetes2, treatment, outcome)
```
For the sake of a streamlined presentation, we'll omit the usual details of condition-checking, hypothesis-writing, etc., and skip right to the conclusion.
```{r}
tabyl(diabetes2, treatment, outcome) %>%
chisq.test() -> outcome_treatment_chisq.test
outcome_treatment_chisq.test
```
Notice that the p-value obtained from the test is below our usual significance level \(\alpha = 0.05\), so it makes sense for us to examine the residuals.
```{r}
outcome_treatment_chisq.test$residuals
```
Again, these values don't mean much in the real world; our job is to look at the most positive and most negative values.
- Since the `rosi` and `failure` cell has the most negative value, the count of people who failed to achieve glycemic control with rosiglitazone is the most *below* expected. (That's a good result!)
- Since the `rosi` and `success` cell has the most positive value, the count of people who succeeded in achieving glycemic control with rosiglitazone is the most *above* expected. (That's also a good result!)
Overall, we can conclude that the rosiglitazone treatment was quite successful in helping people achieve their glycemic control goals.
### Your turn
Examine the residuals table to determine which marital statuses are most associated with smoking or not smoking.
::: {.answer}
```{r}
# Add code here to produce the chisq.test result.
# Add code here to examine the residuals table.
```
Please write your answer here.
:::
## Conclusion
With two categorical variables, we can run a chi-square test for independence to test the null hypothesis that the two variables are independent. While technically we can run this test for any two categorical variables, if both variables have only two levels, we would usually choose to run a test for two proportions. The chi-square test for independence is useful when one or both of the response and predictor variables have three or more levels. The expected cell counts are derived from the data and then the chi-squared statistic is computed as usual. Using the correct degrees of freedom, we can test how much the observed cell counts deviate from the expected cell counts and derive a P-value.
### Preparing and submitting your assignment
1. From the "Run" menu, select "Restart R and Run All Chunks".
2. Deal with any code errors that crop up. Repeat steps 1–-2 until there are no more code errors.
3. Spell check your document by clicking the icon with "ABC" and a check mark.
4. Hit the "Preview" button one last time to generate the final draft of the `.nb.html` file.
5. Proofread the HTML file carefully. If there are errors, go back and fix them, then repeat steps 1--5 again.
If you have completed this chapter as part of a statistics course, follow the directions you receive from your professor to submit your assignment.