Coffee | Died | Did not die |
---|---|---|
Non-drinker | 1039 | 5438 |
Occasional drinker | 4440 | 29712 |
Regular drinker | 3601 | 24934 |
We have more than two samples! Non-coffee drinkers, occasional drinkers, and regular drinkers.
Is there an association between coffee drinking status and whether somebody died? Are the two independent?
Coffee | Died | Did not die |
---|---|---|
Non-drinker | 1039 | 5438 |
Occasional drinker | 4440 | 29712 |
Regular drinker | 3601 | 24934 |
Coffee | Died | Did not die |
---|---|---|
Non-drinker | 1039 | 5438 |
Occasional drinker | 4440 | 29712 |
Regular drinker | 3601 | 24934 |
If H0 were true, then we would expect:
Coffee | Died | Did not die |
---|---|---|
Non-drinker | 1039 | 5438 |
Occasional drinker | 4440 | 29712 |
Regular drinker | 3601 | 24934 |
If H0 were true, then we would expect:
Coffee | Died | Did not die |
---|---|---|
Non-drinker | 1039 | 5438 |
Occasional drinker | 4440 | 29712 |
Regular drinker | 3601 | 24934 |
If H0 were true, then we would expect:
Coffee | Died | Did not die |
---|---|---|
Non-drinker | 1039 | 5438 |
Occasional drinker | 4440 | 29712 |
Regular drinker | 3601 | 24934 |
If H0 were true, then we would expect:
Coffee | Died | Did not die |
---|---|---|
Non-drinker | 1039 | 5438 |
Occasional drinker | 4440 | 29712 |
Regular drinker | 3601 | 24934 |
If H0 were true, then we would expect:
Coffee | Died | Did not die |
---|---|---|
Non-drinker | 1039 | 5438 |
Occasional drinker | 4440 | 29712 |
Regular drinker | 3601 | 24934 |
If H0 were true, then we would expect:
Coffee | Died | Did not die |
---|---|---|
Non-drinker | 1039 | 5438 |
Occasional drinker | 4440 | 29712 |
Regular drinker | 3601 | 24934 |
Let's investigate non-coffee drinking and dying:
If these were independent, we would expect P(Non-Drinker AND Died) to be 6477/69164 × 9080/69164 ≈ 0.012. So, we expect approximately 850 study participants to be non-drinkers who died.
Coffee | Died | Did not die |
---|---|---|
Non-drinker | 1039 | 5438 |
Occasional drinker | 4440 | 29712 |
Regular drinker | 3601 | 24934 |
The observed number is 1039, for a difference of 189 participants between the observed and expected counts.
Coffee | Died | Did not die |
---|---|---|
Non-drinker | 1039 | 5438 |
Occasional drinker | 4440 | 29712 |
Regular drinker | 3601 | 24934 |
The observed number is 1039, for a difference of 189 participants between the observed and expected counts.
Is this strong evidence against the claim of independence?
Well, that was just one cell! There are five more cells in which there may be differences between observed and expected counts.
Well, that was just one cell! There are five more cells in which there may be differences between observed and expected counts.
How can we sum up these differences in a principled way, and use it to conduct statistical inference?
The chi-squared test has a very nice motivation in terms of comparing observed vs. the expected counts that we would expect if H0 were true.
If these total differences are "large enough," then we reject the null hypothesis.
The chi-squared test has a very nice motivation in terms of comparing observed vs. the expected counts that we would expect if H0 were true.
If these total differences are "large enough," then we reject the null hypothesis.
The chi-squared test has a very nice motivation in terms of comparing observed vs. the expected counts that we would expect if H0 were true.
If these total differences are "large enough," then we reject the null hypothesis.
To combine differences across table cells, we need to square them before adding them up (so that negative differences aren't canceled out by positive differences)
We will also scale these differences by the expected count (a difference of 189 participants isn't large when thinking about 100,000 total observations, but is huge when thinking about 300 total observations!)
The chi-square χ2 test statistic is
r×c∑i∈cells(Oi−Ei)2Ei,
where r×c is the number of cells in the table (rows times columns), i indexes across all cells, Oi is the expected count in cell i, and Ei is the expected count in cell i.
The chi-square χ2 test statistic is
r×c∑i∈cells(Oi−Ei)2Ei,
where r×c is the number of cells in the table (rows times columns), i indexes across all cells, Oi is the expected count in cell i, and Ei is the expected count in cell i.
This statistic is the total squared difference between the observed and expected cell counts, scaling by the expected cell count for each cell.
The chi-square χ2 test statistic is
r×c∑i∈cells(Oi−Ei)2Ei,
where r×c is the number of cells in the table (rows times columns), i indexes across all cells, Oi is the expected count in cell i, and Ei is the expected count in cell i.
This statistic is the total squared difference between the observed and expected cell counts, scaling by the expected cell count for each cell.
Under H0, the distribution of this sum is approximated by a χ2 distribution with (r−1)×(c−1) degrees of freedom.
Remember, we only reject if the difference is "large enough." So, we only examine the right-tail. That is, the probability of seeing our χ2 statistic or larger when calculating p-values.
Luckily, you don't have to calculate all the expected counts by hand, create the test statistic, and manually compare to a chi-square distribution.
coffee_data %>% slice(1:10)
## # A tibble: 10 x 2## coffee health_outcome## <chr> <chr> ## 1 Does not drink coffee Died ## 2 Does not drink coffee Died ## 3 Does not drink coffee Died ## 4 Does not drink coffee Died ## 5 Does not drink coffee Died ## 6 Does not drink coffee Died ## 7 Does not drink coffee Died ## 8 Does not drink coffee Died ## 9 Does not drink coffee Died ## 10 Does not drink coffee Died
coffee_data %>% chisq_test(formula = health_outcome ~ coffee)
## # A tibble: 1 x 3## statistic chisq_df p_value## <dbl> <int> <dbl>## 1 55.2 2 1.05e-12
coffee_data %>% chisq_test(formula = health_outcome ~ coffee)
## # A tibble: 1 x 3## statistic chisq_df p_value## <dbl> <int> <dbl>## 1 55.2 2 1.05e-12
Formally assess the hypothesis that coffee drinking and health outcome are independent.
What might we conclude given these data?
Keyboard shortcuts
↑, ←, Pg Up, k | Go to previous slide |
↓, →, Pg Dn, Space, j | Go to next slide |
Home | Go to first slide |
End | Go to last slide |
Number + Return | Go to specific slide |
b / m / f | Toggle blackout / mirrored / fullscreen mode |
c | Clone slideshow |
p | Toggle presenter mode |
t | Restart the presentation timer |
?, h | Toggle this help |
Esc | Back to slideshow |