+ - 0:00:00
Notes for current slide
Notes for next slide

Two-sample inference

Prof. Maria Tackett

1

Recap

So far, we've talked about performing interval estimation and hypothesis testing for means using

  • simulation-based methods, such as bootstrap or direct simulation, and
  • the Central Limit Theorem

In all cases so far, we've only compared one sample against a hypothesized value.

But what if we wanted to compare two samples against each other?

3

Two-sample inference for means

Suppose we have two (representative) samples, and wanted to either

  • estimate the difference in means in the two populations

    • confidence interval for μ1μ2
  • Test the hypotheses

H0:μ1=μ2Ha:μ1μ2,

where μ1 and μ2 are the population means in groups 1 and 2.

4

How might you calculate a confidence interval and address the above hypothesis test using simulation-based methods? How about the CLT?

5

Today's data

Adapted from Erdogdu Sakar, B., et al. Collection and Analysis of a Parkinson Speech Dataset with Multiple Types of Sound Recordings, IEEE Journal of Biomedical and Health Informatics, vol. 17(4), pp. 828-834, 2013 (image from Wikipedia)

6

Some voice analysis terminology

  • Jitter: frequency variation from cycle to cycle
  • Shimmer: amplitude variation of the sound wave

Jitter and shimmer are affected by lack of control of vocal cord vibration, and pathological differences from average values may be indicative of Parkinson's Disease (PD).

(from Teixeira, Oliveira, and Lopes, 2013)

7

Question of interest

Is there a difference in average voice jitter between patients with Parkinson's disease (PD) and those who don't have Parkinson's disease (control group)?

parkinsons.csv contains repeated voice recordings from a number of patients, some with PD and some serving as non-PD controls (Erdogdu B et al.). For now, assume that all samples were taken independently from each other (this is not actually the case, but we'll make this assumption).

Jitter is given in milliseconds (ms), and shimmer is given in decibels (dB).

8

Bootstrap estimation

Let's construct the bootstrap distribution for the difference in means.

set.seed(2020)
parkinsons <- read_csv("data/parkinsons.csv")
library(infer)
boot_diffs <- parkinsons %>%
specify(jitter ~ status) %>%
generate(reps = 1000, type = "bootstrap") %>%
calculate(stat = "diff in means",
order = c("Healthy", "PD"))
9

Bootstrap estimation

Let's construct the bootstrap distribution for the difference in means.

10

CI for difference in means

Let's construct the bootstrap distribution for the difference in means.

boot_diffs %>%
summarize(lower = quantile(stat, 0.025),
upper = quantile(stat, 0.975))
## # A tibble: 1 x 2
## lower upper
## <dbl> <dbl>
## 1 -0.00413 -0.00220
11

CI for difference in means

## # A tibble: 1 x 2
## lower upper
## <dbl> <dbl>
## 1 -0.00413 -0.00220

Interpretation: We are 95% confident that the mean voice jitter for people without Parkinson's disease is about 0.002 to 0.004 ms less than the mean voice jitter for those with Parkinson's disease.

12

CI for difference in means

## # A tibble: 1 x 2
## lower upper
## <dbl> <dbl>
## 1 -0.00413 -0.00220

Interpretation: We are 95% confident that the mean voice jitter for people without Parkinson's disease is about 0.002 to 0.004 ms less than the mean voice jitter for those with Parkinson's disease.

Is there evidence that there is a difference in mean voice jitter between PD patients and healthy patients?

12

Hypothesis testing

Let μP be the mean voice jitter among PD patients, and μH be the mean voice jitter among healthy patients. Let's test

H0:μP=μHHa:μPμH

If the two means are truly equal (i.e., if H0 is true), then the difference, μHμP, should be zero.

13

Hypothesis testing

Let's construct the simulated null distribution for the difference in means, μHμP. If the two means are truly equal (i.e., if H0 is true), then this difference should be zero.

null_dist <- parkinsons %>%
specify(jitter ~ status) %>%
hypothesize(null = "independence") %>%
generate(reps = 1000, type = "permute") %>%
calculate(stat = "diff in means",
order = c("Healthy", "PD"))
14

Hypothesis testing

15

Hypothesis testing

obs_diff <- parkinsons %>%
specify(jitter ~ status) %>%
calculate(stat = "diff in means", order = c("Healthy", "PD")) %>%
pull()
obs_diff
## [1] -0.00312321
16

Hypothesis testing

obs_diff <- parkinsons %>%
specify(jitter ~ status) %>%
calculate(stat = "diff in means", order = c("Healthy", "PD")) %>%
pull()
obs_diff
## [1] -0.00312321
null_dist %>%
filter(abs(stat) >= abs(obs_diff)) %>%
summarise(p_val = n() / nrow(null_dist))
## # A tibble: 1 x 1
## p_val
## <dbl>
## 1 0
16

Conclusion

The p-value is very small, so we reject H0. The data provide sufficient evidence that there is a difference in the mean voice jitter between patients who have Parkinson's disease and those who don't have the disease.

17

Difference in means using CLT

18

Difference in means using CLT

CLT-based inference for a difference in means relies on the two-sample t-test for independent samples. Like the t-test we've seen before, the test statistic takes on the following form:

18

Difference in means using CLT

CLT-based inference for a difference in means relies on the two-sample t-test for independent samples. Like the t-test we've seen before, the test statistic takes on the following form:

T=(ˉX1ˉX2)(μ1μ2)^SEdiff

18

Difference in means using CLT

CLT-based inference for a difference in means relies on the two-sample t-test for independent samples. Like the t-test we've seen before, the test statistic takes on the following form:

T=(ˉX1ˉX2)(μ1μ2)^SEdiff

The test statistic depends on whether we can assume that the two groups have the same underlying variability in their observations.

18

Difference in means using CLT

CLT-based inference for a difference in means relies on the two-sample t-test for independent samples. Like the t-test we've seen before, the test statistic takes on the following form:

T=(ˉX1ˉX2)(μ1μ2)^SEdiff

The test statistic depends on whether we can assume that the two groups have the same underlying variability in their observations.

The exact form of the test statistic under the null hypothesis, including the degrees of freedom, are a complicated fraction that no one calculates by hand. Let's let R handle this!

18

CLT: Difference in means

parkinsons %>%
t_test(jitter ~ status,
mu = 0,
order = c("Healthy", "PD"),
alternative = "two-sided",
conf_int = TRUE, conf_level = 0.95)
## # A tibble: 1 x 6
## statistic t_df p_value alternative lower_ci upper_ci
## <dbl> <dbl> <dbl> <chr> <dbl> <dbl>
## 1 -5.96 187. 0.0000000124 two.sided -0.00416 -0.00209
19

CLT: Difference in means

## # A tibble: 1 x 6
## statistic t_df p_value alternative lower_ci upper_ci
## <dbl> <dbl> <dbl> <chr> <dbl> <dbl>
## 1 -5.96 187. 0.0000000124 two.sided -0.00416 -0.00209

Comprehensively evaluate the research question by specifying the hypotheses, the test statistic and its the distribution under the null, the p-value, and decision at the α=0.05 significance level. Interpret the conclusions from your hypothesis test in context of the original research question.

20
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow