21-inference_for_two_independent_means-web.Rmd

# Inference for two independent means

<!-- Please don't mess with the next few lines! -->
<style>h5{font-size:2em;color:#0000FF}h6{font-size:1.5em;color:#0000FF}div.answer{margin-left:5%;border:1px solid #0000FF;border-left-width:10px;padding:25px} div.summary{background-color:rgba(30,144,255,0.1);border:3px double #0000FF;padding:25px}</style>`r options(scipen=999)`<p style="color:#ffffff">`r intToUtf8(c(50,46,48))`</p>
<!-- Please don't mess with the previous few lines! -->


::: {.summary}

### Functions introduced in this chapter: {-}

No new R functions are introduced here.

:::


## Introduction

If we have a numerical variable and a categorical variable with two categories, we can think of the numerical variable as response and the categorical variable as predictor. The idea is that the two categories sort your numerical data into two groups which can be compared. Assuming the two groups are independent of each other, we can use them as samples of two larger populations. This leads to inference to decide if the difference between the means of the two groups is statistically significant and then estimate the difference between the means of the two populations represented. The relevant hypothesis test is called a two-sample t test (or Welch's t test, to be specific).


### Install new packages

There are no new packages used in this chapter.

### Download the R notebook file

Check the upper-right corner in RStudio to make sure you're in your `intro_stats` project. Then click on the following link to download this chapter as an R notebook file (`.Rmd`).

<a href = "https://vectorposse.github.io/intro_stats/chapter_downloads/21-inference_for_two_independent_means.Rmd" download>https://vectorposse.github.io/intro_stats/chapter_downloads/21-inference_for_two_independent_means.Rmd</a>

Once the file is downloaded, move it to your project folder in RStudio and open it there.

### Restart R and run all chunks

In RStudio, select "Restart R and Run All Chunks" from the "Run" menu.


## Load packages

We load the standard `tidyverse`, `janitor`, and `infer` packages. We also use the `MASS` package for the `birthwt` data.

```{r}
library(tidyverse)
library(janitor)
library(infer)
library(MASS)
```


## Research question

Recall the `birthwt` data that was collected at Baystate Medical Center, Springfield, Mass during 1986. In a previous chapter, we measured low birth weight babies using a categorical variable that served as an indicator for low birth weight.

##### Exercise 1 {-}

How was it determined if a baby was considered "low birth weight" for purposes of constructing the variable `low`? Use the help file to find out.

::: {.answer}

Please write up your answer here.

:::

*****


We have the actual birth weight of the babies in this data. So, rather than using a coarse classification into a binary "yes or no" variable, why not use the full precision of the birth weight measured in grams? This is a very precisely measured numerical variable.

We'd like to compare mean birth weights among two groups: women who smoked during pregnancy, and women who didn't.


## Data preparation

The actual mean weights in each sample (the smoking women and the nonsmoking women) can be found using a `group_by` and `summarise` pipeline:

```{r}
birthwt %>%
  group_by(smoke) %>%
  summarise(mean(bwt))
```

Note that 0 means "nonsmoker" and 1 means "smoker". Looks like We need to address the fact the `smoke` variable is recorded as a numerical variable instead of a categorical variable. Here is `birthwt2` that we will use from here on out:

```{r}
birthwt2 <- birthwt %>%
    mutate(smoke_fct = factor(smoke, levels = c(0, 1), labels = c("Nonsmoker", "Smoker")))
birthwt2
```

```{r}
glimpse(birthwt2)
```

The difference between the means is now calculated using `infer` tools. We will store the result as `obs_diff` for "observed difference".

```{r}
obs_diff <- birthwt2 %>%
  specify(response = bwt, explanatory = smoke_fct) %>% 
  calculate(stat = "diff in means", order = c("Nonsmoker", "Smoker"))
obs_diff
```


##### Exercise 2 {-}

What would happen if we used `order = c("Smoker", "Nonsmoker")` instead? Why might we have a slight preference for `order = c("Nonsmoker", "Smoker")`?

::: {.answer}

Please write up your answer here.

:::

*****


Note that it will not actually make a difference to the inferential process in which order we subtract. However, we do have to be consistent to use the same order throughout. When interpreting the test statistic, effect size, and confidence interval, we will need to pay attention to the order of subtraction to make sure we are interpreting our results correctly.


## Every day I'm shuffling

Whenever there are two groups, the obvious null hypothesis is that there is no difference between them.

Consider the `smoke` variable. If there were truly no difference in mean birth weights between women who smoked and women who didn't, then it shouldn't matter if we know the smoking status or not. It becomes irrelevant under the assumption of the null.

We can simulate this assumption by shuffling the list of smoking status. More concretely, we can randomly assign a smoking status label to each mother and then calculate the average birth weight in each group. Since the smoking labels are random, there's no reason to expect a difference between the two average weights other than random fluctuations due to sampling variability.

For example, here is the actual smoking status of the women:

```{r}
birthwt2$smoke_fct
```

But we're going to use values that have been randomly shuffled, like this one, for example:

```{r}
set.seed(1729)
sample(birthwt2$smoke_fct)
```

The `infer` package will perform this random shuffling over and over again. Given the now arbitrary labels of "Nonsmoker" and "Smoker" (which are meaningless because each women was assigned to one of these labels randomly with no regard to her actual smoking status), `infer` will calculate the mean birth weights among the first group of women (labeled "Nonsmokers" but not really consisting of all nonsmokers) and the second group of women (labeled "Smokers" but not really consisting of all smokers). Finally `infer` will compute the difference between those two means. And it will do this process 1000 times.

```{r}
set.seed(1729)
bwt_smoke_test <- birthwt2 %>%
  specify(response = bwt, explanatory = smoke_fct) %>%
  hypothesize(null = "independence") %>%
  generate(reps = 1000, type = "permute") %>%
  calculate(stat = "diff in means", order = c("Nonsmoker", "Smoker"))
bwt_smoke_test
```

##### Exercise 3 {-}

Before we graph these simulated values, what do you guess will be the mean value? Keep in mind that we have computed differences in the mean birth weights between two groups of women. But because we have shuffled the smoking labels randomly, we aren't really calculating the difference in mean birth weights of nonsmokers vs smokers. We're just computing the difference in mean birth weights of randomly assigned groups of women.

::: {.answer}

Please write up your answer here.

:::

*****


Here's the visualization:

```{r}
bwt_smoke_test %>%
    visualize()
```

No surprise that this histogram looks nearly normal, centered at zero: the simulation is working under the assumption of the null hypothesis of no difference between the groups.

Here is the same plot but including our sample difference:

```{r}
bwt_smoke_test %>%
    visualize() +
    shade_p_value(obs_stat = obs_diff, direction = "two_sided")
```

Our observed difference (from the sampled data) is quite far out into the tail of this simulated sampling distribution, so it appears that our actual data would be somewhat unlikely due to pure chance alone if the null hypothesis were true.

We can even find a P-value by calculating how many of our sampled values are as extreme or more extreme than the observed data difference.

```{r}
bwt_smoke_test %>%
    get_p_value(obs_stat = obs_diff, direction = "two-sided")
```

Indeed, this is a small P-value.


## The sampling distribution model

In the previous section, we simulated the sampling distribution under the assumption of a null hypothesis of no difference between the groups. It certainly looked like a normal model, but which normal model? The center is obviously zero, but what about the standard deviation?

Let's assume that both groups come from populations that are normally distributed with normal models $N(\mu_{1}, \sigma_{1})$ and $N(\mu_{2}, \sigma_{2})$. If we take samples of size $n_{1}$ from group 1 and $n_{2}$ from group 2, some fancy math shows that the distribution of the differences between sample means is

$$
N\left(\mu_{1} - \mu_{2}, \sqrt{\frac{\sigma_{1}^{2}}{n_{1}} + \frac{\sigma_{2}^{2}}{n_{2}}}\right).
$$

Under the assumption of the null, the difference of the means is zero ($\mu_{1} - \mu_{2} =  0$). Unfortunately, though, we make no assumption on the standard deviations. It should be clear that the only solution is to substitute the sample standard deviations $s_{1}$ and $s_{2}$ for the population standard deviations $\sigma_{1}$ and $\sigma_{2}$.^[When we were testing two proportions with categorical data, one option (described in an optional appendix in that chapter) was to pool the data. With numerical data, we can calculate a pooled mean, but that doesn't help with the unknown standard deviations. Nothing in the null hypothesis suggests that the standard deviations of the two groups should be the same. In the extremely rare situation in which one can assume equal standard deviations in the two groups, then there is a way to run a pooled t test. But this "extra" assumption of equal standard deviations is typically questionable at best.]

$$
SE = \sqrt{\frac{s_{1}^{2}}{n_{1}} + \frac{s_{2}^{2}}{n_{2}}}.
$$

However, $s_{1}$ and $s_{2}$ are not perfect estimates of $\sigma_{1}$ and $\sigma_{2}$; they are subject to sampling variability too. This extra variability means that a normal model is no longer appropriate as the sampling distribution model.

In the one-sample case, a Student t model with $df = n - 1$ was the right choice. In the two-sample case, we don't know the right answer. And I don't mean that we haven't learned it yet in our stats class. I mean, statisticians have not found a formula for the correct sampling distribution. It is a famous unsolved problem, called the Behrens-Fisher problem.

Several researchers have proposed solutions that are "close" though. One compelling one is called "Welch's t test". Welch showed that even though it's not quite right, a Student t model is very close as long as you pick the degrees of freedom carefully. Unfortunately, the way to compute the right degrees of freedom is crazy complicated. Fortunately, R is good at crazy complicated computations.

Let's go through the full rubric.


## Exploratory data analysis

### Use data documentation (help files, code books, Google, etc.) to determine as much as possible about the data provenance and structure.

Type `birthwt` at the Console to read the help file. We have the same concerns about the lack of details as we did in Chapter 16.

```{r}
birthwt
```

```{r}
glimpse(birthwt)
```

### Prepare the data for analysis.

We need to be sure `smoke` is a factor variable, so we create the new tibble `birthwt2` with the mutated variable `smoke_fct`.

```{r}
birthwt2 <- birthwt %>%
    mutate(smoke_fct = factor(smoke, levels = c(0, 1), labels = c("Nonsmoker", "Smoker")))
birthwt2
```

```{r}
glimpse(birthwt2)
```

### Make tables or plots to explore the data visually.

How many women are in each group?

```{r}
tabyl(birthwt2, smoke_fct) %>%
  adorn_totals()
```

With a numerical response variable and a categorical predictor variable, there are two useful plots: a side-by-side boxplot and a stacked histogram.

```{r}
ggplot(birthwt2, aes(y = bwt, x = smoke_fct)) +
    geom_boxplot()
```

```{r}
ggplot(birthwt2, aes(x = bwt)) +
    geom_histogram(binwidth = 250, boundary = 0) +
    facet_grid(smoke_fct ~ .)
```

The histograms for both groups look sort of normal, but the nonsmoker group may be a little left skewed and the smoker group may have some low outliers. Here are the QQ plots to give us another way to ascertain normality of the data.

```{r}
ggplot(birthwt2, aes(sample = bwt)) +
    geom_qq() +
    geom_qq_line() +
    facet_grid(smoke_fct ~ .)
```

There's a little deviation from normality, but nothing too crazy.

Commentary: The boxplots and histograms show why statistical inference is so important. It's clear that there is some difference between the two groups, but it's not obvious if that difference will turn out to be statistically significant. There appears to be a lot of variability in both groups, and both groups have a fair number of lighter and heavier babies.


## Hypotheses

### Identify the sample (or samples) and a reasonable population (or populations) of interest.

The samples consist of 115 nonsmoking mothers and 74 smoking mothers. The populations are those women who do not smoke during pregnancy and those women who do smoke during pregnancy.

### Express the null and alternative hypotheses as contextually meaningful full sentences.

$H_{0}:$ There is no difference in the birth weight of babies born to mothers who do not smoke versus mothers who do smoke.

$H_{A}:$ There is a difference in the birth weight of babies born to mothers who do not smoke versus mothers who do smoke.

### Express the null and alternative hypotheses in symbols (when possible).

$H_{0}: \mu_{Nonsmoker} - \mu_{Smoker} = 0$

$H_{A}: \mu_{Nonsmoker} - \mu_{Smoker} \neq 0$

Commentary: As mentioned before, the order in which you subtract will not change the inference, but it will affect your interpretation of the results. Also, once you've chosen a direction to subtract, be consistent about that choice throughout the rubric.


## Model

### Identify the sampling distribution model.

We use a t model with the number of degrees of freedom to be determined.

Commentary: For Welch's t test, the degrees of freedom won't usually be a whole number. Be sure you understand that the formula is no longer $df = n - 1$. That doesn't even make any sense as there isn't a single $n$ in a *two*-sample test. The `infer` package will tell us how many degrees of freedom to use later in the Mechanics section.

### Check the relevant conditions to ensure that model assumptions are met.

* Random (for both groups)
    - We have very little information about these women. We hope that the 115 nonsmoking mothers at this hospital are representative of other nonsmoking mothers, at least in that region at that time. And same for the 74 smoking mothers.

* 10% (for both groups)
    - 115 is less than 10% of all nonsmoking mothers and 74 is less than 10% of all smoking mothers.
    
* Nearly normal (for both groups)
    - Since the sample sizes are more than 30 in each group, we meet the condition.


## Mechanics

### Compute the test statistic.

```{r}
obs_diff <- birthwt2 %>%
  specify(response = bwt, explanatory = smoke_fct) %>%
  calculate(stat = "diff in means", order = c("Nonsmoker", "Smoker"))
obs_diff
```

```{r}
obs_diff_t <- birthwt2 %>%
  specify(response = bwt, explanatory = smoke_fct) %>%
  calculate(stat = "t", order = c("Nonsmoker", "Smoker"))
obs_diff_t
```

### Report the test statistic in context (when possible).

The difference in the mean birth weight of babies born to nonsmoking mothers and smoking mothers is `r obs_diff %>% pull(1)` grams. This was obtained by subtracting nonsmoking mothers minus smoking mothers. In other words, the fact that this is positive indicates that nonsmoking mothers had heavier babies, on average, than smoking mothers.

The t score is `r obs_diff_t %>% pull(1)`. The sample difference in birth weights is about 2.7 standard errors higher than the null value of zero.

Commentary: Remember that whenever you are computing the difference between two quantities, you must indicate the direction of that difference you so your reader knows how to interpret the value, whether it is positive or negative.

### Plot the null distribution.

```{r}
bwt_smoke_test_t <- birthwt2 %>%
  specify(response = bwt, explanatory = smoke_fct) %>%
  hypothesise(null = "independence") %>%
  assume("t")
bwt_smoke_test_t
```

```{r}
bwt_smoke_test_t %>%
  visualize() +
  shade_p_value(obs_stat = obs_diff_t, direction = "two-sided")
```


Commentary: We use the name `bwt_smoke_test_t` (using the assumption of a Student t model) as a new variable name so that it doesn't overwrite the variable `bwt_smoke_test` we performed earlier as a permutation test (the one with the shuffling). This results of using `bwt_smoke_test` versus `bwt_smoke_test_t` will be very similar.

Note that the `infer` output tells us there are 170 degrees of freedom. (It turns out to be 170.1.) Note that this number is the result of a complicated formula, and it's not just a simple function of the sample sizes 115 and 74.

Finally, note that the alternative hypothesis indicated a two-sided test, so we need to specify a "two-sided" P-value in the `shade_p_value` command.

### Calculate the P-value.

```{r}
bwt_smoke_p <- bwt_smoke_test_t %>%
  get_p_value(obs_stat = obs_diff_t, direction = "two-sided")
bwt_smoke_p
```

### Interpret the P-value as a probability given the null.

The P-value is `r bwt_smoke_p %>% pull(1)`. If there were no difference in the mean birth weights between nonsmoking and smoking women, there would be a `r 100 * bwt_smoke_p %>% pull(1)`% chance of seeing data at least as extreme as what we saw.


## Conclusion

### State the statistical conclusion.

We reject the null hypothesis.

### State (but do not overstate) a contextually meaningful conclusion.

We have sufficient evidence that there is a difference in the mean birth weight of babies born to mothers who do not smoke versus mothers who do smoke.

### Express reservations or uncertainty about the generalizability of the conclusion.

As when we looked at this data before, our uncertainly about the data provenance means that we don't know if the difference observed in these samples at this one hospital at this one time are generalizable to larger populations. Also keep in mind that this data is observational, so we cannot draw any causal conclusion about the "effect" of smoking on birth weight.

### Identify the possibility of either a Type I or Type II error and state what making such an error means in the context of the hypotheses.

If we've made a Type I error, then that means that there might be no difference in the birth weights of babies from nonsmoking versus smoking mothers, but we got some unusual samples that showed a difference.


## Confidence interval

### Check the relevant conditions to ensure that model assumptions are met.

There are no additional conditions to check.

### Calculate the confidence interval.

```{r}
bwt_smoke_ci <- bwt_smoke_test_t %>%
  get_confidence_interval(point_estimate = obs_diff, level = 0.95)
bwt_smoke_ci
```

Commentary: Pay close attention to when we use `obs_diff` and `obs_diff_t`. In the hypothesis test, we assumed a t distribution for the null and so we have to use the t score `obs_diff_t` to shade the P-value. However, for a confidence interval, we are building the interval centered on our sample difference `obs_diff`.

### State (but do not overstate) a contextually meaningful interpretation.

We are 95% confident that the true difference in birth weight between nonsmoking and smoking mothers is captured in the interval (`r bwt_smoke_ci$lower_ci` g, `r bwt_smoke_ci$upper_ci` g). We obtained this by subtracting nonsmokers minus smokers.

Commentary: Again, remember to indicate the direction of the difference by indicating the order of subtraction.

### If running a two-sided test, explain how the confidence interval reinforces the conclusion of the hypothesis test.

Since zero is not contained in the confidence interval, zero is not a plausible value for the true difference in birth weights between the two groups of mothers.

### When comparing two groups, comment on the effect size and the practical significance of the result.

In order to know if smoking is a risk factor for low birth weight, we would need to know what a difference of 80 g or 490 grams means for babies. Although most of us presumably don't have any special training in obstetrics, we could do a quick internet search to see that even half a kilogram is not a large amount of weight difference between two babies. Having said that, though, any difference in birth weight that might be attributable to smoking could be a concern to doctors. In any event, our data is observational, so we cannot make causal claims here.


## Your turn

Continue to use the `birthwt` data set. This time, see if a history of hypertension is associated with a difference in the mean birth weight of babies. In the "Prepare the data for analysis" section, you will need to create a new tibble---call it `birthwt3`---in which you convert the `ht` variable to a factor variable.

The rubric outline is reproduced below. You may refer to the worked example above and modify it accordingly. Remember to strip out all the commentary. That is just exposition for your benefit in understanding the steps, but is not meant to form part of the formal inference process.

Another word of warning: the copy/paste process is not a substitute for your brain. You will often need to modify more than just the names of the data frames and variables to adapt the worked examples to your own work. Do not blindly copy and paste code without understanding what it does. And you should **never** copy and paste text. All the sentences and paragraphs you write are expressions of your own analysis. They must reflect your own understanding of the inferential process.

**Also, so that your answers here don't mess up the code chunks above, use new variable names everywhere.**


##### Exploratory data analysis {-}

###### Use data documentation (help files, code books, Google, etc.) to determine as much as possible about the data provenance and structure. {-}

::: {.answer}

Please write up your answer here

```{r}
# Add code here to print the data
```

```{r}
# Add code here to glimpse the variables
```

:::

###### Prepare the data for analysis. [Not always necessary.] {-}

::: {.answer}

```{r}
# Add code here to prepare the data for analysis.
```

:::

###### Make tables or plots to explore the data visually. {-}

::: {.answer}

```{r}
# Add code here to make tables or plots.
```

:::


##### Hypotheses {-}

###### Identify the sample (or samples) and a reasonable population (or populations) of interest. {-}

::: {.answer}

Please write up your answer here.

:::

###### Express the null and alternative hypotheses as contextually meaningful full sentences. {-}

::: {.answer}

$H_{0}:$ Null hypothesis goes here.

$H_{A}:$ Alternative hypothesis goes here.

:::

###### Express the null and alternative hypotheses in symbols (when possible). {-}

::: {.answer}

$H_{0}: math$

$H_{A}: math$

:::


##### Model {-}

###### Identify the sampling distribution model. {-}

::: {.answer}

Please write up your answer here.

:::

###### Check the relevant conditions to ensure that model assumptions are met. {-}

::: {.answer}

Please write up your answer here. (Some conditions may require R code as well.)

:::


##### Mechanics {-}

###### Compute the test statistic. {-}

::: {.answer}

```{r}
# Add code here to compute the test statistic.
```

:::

###### Report the test statistic in context (when possible). {-}

::: {.answer}

Please write up your answer here.

:::

###### Plot the null distribution. {-}

::: {.answer}

```{r}
# IF CONDUCTING A SIMULATION...
set.seed(1)
# Add code here to simulate the null distribution.
```

```{r}
# Add code here to plot the null distribution.
```

:::

###### Calculate the P-value. {-}

::: {.answer}

```{r}
# Add code here to calculate the P-value.
```

:::

###### Interpret the P-value as a probability given the null. {-}

::: {.answer}

Please write up your answer here.

:::


##### Conclusion {-}

###### State the statistical conclusion. {-}

::: {.answer}

Please write up your answer here.

:::

###### State (but do not overstate) a contextually meaningful conclusion. {-}

::: {.answer}

Please write up your answer here.

:::

###### Express reservations or uncertainty about the generalizability of the conclusion. {-}

::: {.answer}

Please write up your answer here.

:::

###### Identify the possibility of either a Type I or Type II error and state what making such an error means in the context of the hypotheses. {-}

::: {.answer}

Please write up your answer here.

:::

##### Confidence interval {-}

###### Check the relevant conditions to ensure that model assumptions are met. {-}

::: {.answer}

Please write up your answer here. (Some conditions may require R code as well.)

:::

###### Calculate and graph the confidence interval. {-}

::: {.answer}

```{r}
# Add code here to calculate the confidence interval.
```

```{r}
# Add code here to graph the confidence interval.
```

:::

###### State (but do not overstate) a contextually meaningful interpretation. {-}

::: {.answer}

Please write up your answer here.

:::

###### If running a two-sided test, explain how the confidence interval reinforces the conclusion of the hypothesis test. [Not always applicable.] {-}

::: {.answer}

Please write up your answer here.

:::

###### When comparing two groups, comment on the effect size and the practical significance of the result. [Not always applicable.] {-}

::: {.answer}

Please write up your answer here.

:::


## Conclusion

A numerical variable can be split into two groups using a categorical variable. As long as the groups are independent of each other, we can use inference to determine if there is a statistically significant difference between the mean values of the response variable for each group. Such a test can be run by simulation (using a permutation test) or by meeting the conditions for and assuming a t distribution (with a complicated formula for the degrees of freedom).

### Preparing and submitting your assignment

1. From the "Run" menu, select "Restart R and Run All Chunks".
2. Deal with any code errors that crop up. Repeat steps 1–-2 until there are no more code errors.
3. Spell check your document by clicking the icon with "ABC" and a check mark.
4. Hit the "Preview" button one last time to generate the final draft of the `.nb.html` file.
5. Proofread the HTML file carefully. If there are errors, go back and fix them, then repeat steps 1--5 again.

If you have completed this chapter as part of a statistics course, follow the directions you receive from your professor to submit your assignment.