+ - 0:00:00
Notes for current slide
Notes for next slide

Comparing three or more groups

Prof. Maria Tackett

1

An old example...

Coffee Died Did not die
Non-drinker 1039 5438
Occasional drinker 4440 29712
Regular drinker 3601 24934

We have more than two samples! Non-coffee drinkers, occasional drinkers, and regular drinkers.

Is there an association between coffee drinking status and whether somebody died? Are the two independent?

3

A new hypothesis test...

Coffee Died Did not die
Non-drinker 1039 5438
Occasional drinker 4440 29712
Regular drinker 3601 24934
  • H0: Coffee-drinking category and health outcome are independent; there is no association between the two variables
  • Ha: Coffee-drinking category and health outcome are NOT independent; there is an association between the two variables
4

Review

Coffee Died Did not die
Non-drinker 1039 5438
Occasional drinker 4440 29712
Regular drinker 3601 24934

If H0 were true, then we would expect:

  • P(Non-Drinker) x P(Died) = P(Non-drinker AND Died)
5

Review

Coffee Died Did not die
Non-drinker 1039 5438
Occasional drinker 4440 29712
Regular drinker 3601 24934

If H0 were true, then we would expect:

  • P(Non-Drinker) x P(Died) = P(Non-drinker AND Died)
  • P(Occasional Drinker) x P(Died) = P(Occasional drinker AND Died)
5

Review

Coffee Died Did not die
Non-drinker 1039 5438
Occasional drinker 4440 29712
Regular drinker 3601 24934

If H0 were true, then we would expect:

  • P(Non-Drinker) x P(Died) = P(Non-drinker AND Died)
  • P(Occasional Drinker) x P(Died) = P(Occasional drinker AND Died)
  • P(Regular Drinker) x P(Died) = P(Regular drinker AND Died)
5

Review

Coffee Died Did not die
Non-drinker 1039 5438
Occasional drinker 4440 29712
Regular drinker 3601 24934

If H0 were true, then we would expect:

  • P(Non-Drinker) x P(Died) = P(Non-drinker AND Died)
  • P(Occasional Drinker) x P(Died) = P(Occasional drinker AND Died)
  • P(Regular Drinker) x P(Died) = P(Regular drinker AND Died)
  • P(Non-Drinker) x P(Lived) = P(Non-drinker AND Lived)
5

Review

Coffee Died Did not die
Non-drinker 1039 5438
Occasional drinker 4440 29712
Regular drinker 3601 24934

If H0 were true, then we would expect:

  • P(Non-Drinker) x P(Died) = P(Non-drinker AND Died)
  • P(Occasional Drinker) x P(Died) = P(Occasional drinker AND Died)
  • P(Regular Drinker) x P(Died) = P(Regular drinker AND Died)
  • P(Non-Drinker) x P(Lived) = P(Non-drinker AND Lived)
  • P(Occasional Drinker) x P(Lived) = P(Occasional drinker AND Lived)
5

Review

Coffee Died Did not die
Non-drinker 1039 5438
Occasional drinker 4440 29712
Regular drinker 3601 24934

If H0 were true, then we would expect:

  • P(Non-Drinker) x P(Died) = P(Non-drinker AND Died)
  • P(Occasional Drinker) x P(Died) = P(Occasional drinker AND Died)
  • P(Regular Drinker) x P(Died) = P(Regular drinker AND Died)
  • P(Non-Drinker) x P(Lived) = P(Non-drinker AND Lived)
  • P(Occasional Drinker) x P(Lived) = P(Occasional drinker AND Lived)
  • P(Regular Drinker) x P(Lived) = P(Regular drinker AND Lived)
5

Observed vs. expected counts

Coffee Died Did not die
Non-drinker 1039 5438
Occasional drinker 4440 29712
Regular drinker 3601 24934

Let's investigate non-coffee drinking and dying:

  • P(Non-Drinker) = 6477/69164 0.09365
  • P(Died) = 9080/69164 0.131

If these were independent, we would expect P(Non-Drinker AND Died) to be 6477/69164 × 9080/69164 0.012. So, we expect approximately 850 study participants to be non-drinkers who died.

6

Observed vs. expected counts

Coffee Died Did not die
Non-drinker 1039 5438
Occasional drinker 4440 29712
Regular drinker 3601 24934

The observed number is 1039, for a difference of 189 participants between the observed and expected counts.

7

Observed vs. expected counts

Coffee Died Did not die
Non-drinker 1039 5438
Occasional drinker 4440 29712
Regular drinker 3601 24934

The observed number is 1039, for a difference of 189 participants between the observed and expected counts.

Is this strong evidence against the claim of independence?

7

Observed vs. expected counts

Well, that was just one cell! There are five more cells in which there may be differences between observed and expected counts.

8

Observed vs. expected counts

Well, that was just one cell! There are five more cells in which there may be differences between observed and expected counts.

How can we sum up these differences in a principled way, and use it to conduct statistical inference?

8

The chi-square test

The chi-squared test has a very nice motivation in terms of comparing observed vs. the expected counts that we would expect if H0 were true.

If these total differences are "large enough," then we reject the null hypothesis.

9

The chi-square test

The chi-squared test has a very nice motivation in terms of comparing observed vs. the expected counts that we would expect if H0 were true.

If these total differences are "large enough," then we reject the null hypothesis.

  • To combine differences across table cells, we need to square them before adding them up (so that negative differences aren't canceled out by positive differences)
9

The chi-square test

The chi-squared test has a very nice motivation in terms of comparing observed vs. the expected counts that we would expect if H0 were true.

If these total differences are "large enough," then we reject the null hypothesis.

  • To combine differences across table cells, we need to square them before adding them up (so that negative differences aren't canceled out by positive differences)

  • We will also scale these differences by the expected count (a difference of 189 participants isn't large when thinking about 100,000 total observations, but is huge when thinking about 300 total observations!)

9

The chi-square test statistic

The chi-square χ2 test statistic is

r×cicells(OiEi)2Ei,

where r×c is the number of cells in the table (rows times columns), i indexes across all cells, Oi is the expected count in cell i, and Ei is the expected count in cell i.

10

The chi-square test statistic

The chi-square χ2 test statistic is

r×cicells(OiEi)2Ei,

where r×c is the number of cells in the table (rows times columns), i indexes across all cells, Oi is the expected count in cell i, and Ei is the expected count in cell i.

This statistic is the total squared difference between the observed and expected cell counts, scaling by the expected cell count for each cell.

10

The chi-square test statistic

The chi-square χ2 test statistic is

r×cicells(OiEi)2Ei,

where r×c is the number of cells in the table (rows times columns), i indexes across all cells, Oi is the expected count in cell i, and Ei is the expected count in cell i.

This statistic is the total squared difference between the observed and expected cell counts, scaling by the expected cell count for each cell.

Under H0, the distribution of this sum is approximated by a χ2 distribution with (r1)×(c1) degrees of freedom.

10

Chi-squared distributions

11

Chi-squared distributions

Remember, we only reject if the difference is "large enough." So, we only examine the right-tail. That is, the probability of seeing our χ2 statistic or larger when calculating p-values.

11

Implementation in R

Luckily, you don't have to calculate all the expected counts by hand, create the test statistic, and manually compare to a chi-square distribution.

coffee_data %>%
slice(1:10)
## # A tibble: 10 x 2
## coffee health_outcome
## <chr> <chr>
## 1 Does not drink coffee Died
## 2 Does not drink coffee Died
## 3 Does not drink coffee Died
## 4 Does not drink coffee Died
## 5 Does not drink coffee Died
## 6 Does not drink coffee Died
## 7 Does not drink coffee Died
## 8 Does not drink coffee Died
## 9 Does not drink coffee Died
## 10 Does not drink coffee Died
12

Chi-square test using infer

coffee_data %>%
chisq_test(formula = health_outcome ~ coffee)
## # A tibble: 1 x 3
## statistic chisq_df p_value
## <dbl> <int> <dbl>
## 1 55.2 2 1.05e-12
13

Chi-square test using infer

coffee_data %>%
chisq_test(formula = health_outcome ~ coffee)
## # A tibble: 1 x 3
## statistic chisq_df p_value
## <dbl> <int> <dbl>
## 1 55.2 2 1.05e-12

Formally assess the hypothesis that coffee drinking and health outcome are independent.

What might we conclude given these data?

13
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow