class: center, middle, inverse, title-slide # Inference with the CLT ### Prof. Maria Tackett --- layout: true <div class="my-footer"> <span> <a href="http://datasciencebox.org" target="_blank">datasciencebox.org</a> </span> </div> --- class: middle center ## [Click for PDF of slides](14-clt-inference.pdf) --- ## The Central Limit Theorem For a population with a well-defined mean `\(\mu\)` and standard deviation `\(\sigma\)`, these three properties hold for the distribution of sample average `\(\bar{X}\)`, assuming certain conditions hold: ✅ The distribution of the sample statistic is nearly normal ✅ The distribution is centered at the (often unknown) population parameter ✅ The variability of the distribution is inversely proportional to the square root of the sample size --- ## Why do we care? Knowing the distribution of the sample statistic `\(\bar{X}\)` can help us -- - estimate a population parameter as point **estimate** `\(\boldsymbol{\pm}\)` **margin of error** - the .vocab[margin of error] is comprised of a measure of how confident we want to be and how variable the sample statistic is -- <br> - test for a population parameter by evaluating how likely it is to obtain to observed sample statistic when assuming that the null hypothesis is true - this probability will depend on how variable the sampling distribution is --- class: center, middle ## Inference based on the CLT --- ## Inference based on the CLT If necessary conditions are met, we can also use inference methods based on the CLT. Suppose we know the true population standard deviation. -- Then the CLT tells us that `\(\bar{X}\)` approximately has the distribution `\(N\left(\mu, \sigma/\sqrt{n}\right)\)`. That is, `$$Z = \frac{\bar{X} - \mu}{\sigma/\sqrt{n}} \sim N(0, 1)$$` --- ## Probabilities under N(0,1) curve ```r # P(Z < -1.5) pnorm(-1.5) ``` ``` ## [1] 0.0668072 ``` <img src="14-clt-inference_files/figure-html/unnamed-chunk-2-1.png" style="display: block; margin: auto;" /> --- ## Probability between two values .question[ If `\(Z \sim N(0, 1)\)`, what is `\(P(-1 < Z < 2)\)`? ] -- <img src="14-clt-inference_files/figure-html/unnamed-chunk-3-1.png" style="display: block; margin: auto;" /> --- ## Probability between two values .question[ If `\(Z \sim N(0, 1)\)`, what is `\(P(-1 < Z < 2)\)`? ] <img src="14-clt-inference_files/figure-html/unnamed-chunk-4-1.png" style="display: block; margin: auto;" /> --- ## Probability between two values .question[ If `\(Z \sim N(0, 1)\)`, what is `\(P(-1 < Z < 2)\)`? ] <img src="14-clt-inference_files/figure-html/unnamed-chunk-5-1.png" style="display: block; margin: auto;" /> --- ## Probability between two values .question[ If `\(Z \sim N(0, 1)\)`, what is `\(P(-1 < Z < 2)\)`? ] <img src="14-clt-inference_files/figure-html/unnamed-chunk-6-1.png" style="display: block; margin: auto;" /> ```r pnorm(2) - pnorm(-1) ``` ``` ## [1] 0.8185946 ``` --- ## Probability between two values .question[ If `\(Z \sim N(0, 1)\)`, what is `\(P(-1 < Z < 2)\)`? ] ```r pnorm(2) - pnorm(-1) ``` ``` ## [1] 0.8185946 ``` --- ## Finding cutoff values under N(0,1) curve ```r # find the median, Q2 qnorm(0.5) ``` ``` ## [1] 0 ``` <img src="14-clt-inference_files/figure-html/unnamed-chunk-10-1.png" style="display: block; margin: auto;" /> --- ## What if `\(\sigma\)` isn't known? <img src="img/14/guinness.jpg" width="80%" style="display: block; margin: auto;" /> --- ## T distribution - In practice, we never know the true value of `\(\sigma\)`, and so we estimate it from our data with `\(s\)`. We can make the following test statistic for testing a single sample's population mean, which has a .vocab[t-distribution with n-1 degrees of freedom]: .question[ $$ T = \frac{\bar{X} - \mu}{s/\sqrt{n}} \sim t_{n-1}$$ ] --- ## T distribution The t-distribution is also unimodal and symmetric, and is centered at 0 -- Thicker tails than the normal distribution - This is to make up for additional variability introduced by using `\(s\)` instead of `\(\sigma\)` in calculation of the SE --- ## T vs Z distributions <img src="14-clt-inference_files/figure-html/unnamed-chunk-12-1.png" style="display: block; margin: auto;" /> --- ## T distribution .pull-left[ .vocab[Finding probabilities under the t curve:] ```r #P(t < -1.96) pt(-1.96, df = 9) ``` ``` ## [1] 0.0408222 ``` ```r #P(t > -1.96) pt(-1.96, df = 9, lower.tail = FALSE) ``` ``` ## [1] 0.9591778 ``` ] -- .pull-right[ .vocab[Finding cutoff values under the t curve:] ```r # Find Q1 qt(0.25, df = 9) ``` ``` ## [1] -0.7027221 ``` ```r # Q3 qt(0.75, df = 9) ``` ``` ## [1] 0.7027221 ``` ] --- ## Resident satisfaction in Durham `durham_survey` contains resident responses to a survey given by the City of Durham in 2018. These are a randomly selected, representative sample of Durham residents. Questions were rated 1 - 5, with 1 being "highly dissatisfied" and 5 being "highly satisfied." -- .question[ Is there evidence that, on average, Durham residents are generally satisfied (score greater than 3) with the quality of the public library system? ] --- ## Exploratory Data Analysis ```r durham <- read_csv("data/durham_survey.csv") %>% filter(quality_library != 9) ``` ```r durham %>% summarise(x_bar = mean(quality_library), med = median(quality_library), sd = sd(quality_library), n = n()) ``` ``` ## # A tibble: 1 x 4 ## x_bar med sd n ## <dbl> <dbl> <dbl> <int> ## 1 3.97 4 0.900 521 ``` --- ## Exploratory Data Analysis <img src="14-clt-inference_files/figure-html/unnamed-chunk-19-1.png" style="display: block; margin: auto;" /> --- ## Hypotheses .question[ What are the hypotheses for evaluating if Durham residents, on average, are generally satisfied with the public library system? ] -- `$$H_0: \mu = 3$$` `$$H_a: \mu > 3$$` --- ## Conditions .question[ What conditions must be satisfied to conduct this hypothesis test using methods based on the CLT? Are these conditions satisfied? ] -- **Independence?** -- ✅ The residents were randomly selected for the survey, and 521 is less than 10% of the Durham population (~ 270,000). -- **Sample size / distribution?** -- ✅ 521 > 30, so the sample is large enough to apply the Central Limit Theorem. --- ## Calculating the test statistic Summary statistics from the sample: ``` ## # A tibble: 1 x 3 ## xbar s n ## <dbl> <dbl> <int> ## 1 3.97 0.900 521 ``` -- And the CLT says: `$$\bar{x} \sim N\left(mean = \mu, SE = \frac{\sigma}{\sqrt{n}}\right)$$` -- .question[ How many standard errors away from the population mean is the observed sample mean? ] .question[ How likely are we to observe a sample mean that is at least as extreme as the observed sample mean, if in fact the null hypothesis is true? ] --- ## Calculations -- ```r (se <- durham_summary$s / sqrt(durham_summary$n)) # SE ``` ``` ## [1] 0.03944416 ``` -- ```r (t <- (durham_summary$xbar - 3) / se) # Test statistic ``` ``` ## [1] 24.57372 ``` -- ```r (df <- durham_summary$n - 1) # Degrees of freedom ``` ``` ## [1] 520 ``` -- ```r pt(t, df, lower.tail = FALSE) # P-value, P(T > t |H_0 true) ``` ``` ## [1] 2.247911e-89 ``` --- ## Conclusion The p-value is very small, so we reject `\(H_0\)`. -- The data provide sufficient evidence at the `\(\alpha = 0.05\)` level that Durham residents, on average, are satisfied with the quality of the public library system. -- .question[ Would you expect a 95% confidence interval to include 3? ] --- ## Confidence interval for a mean .alert[ **General form of the confidence interval** `$$point~estimate \pm critical~value \times SE$$` ] -- .alert[ **Confidence interval for the mean** `$$\bar{x} \pm t^*_{n-1} \times \frac{s}{\sqrt{n}}$$` ] --- ## Calculate 95% confidence interval .alert[ `$$\bar{x} \pm t^*_{n-1} \times \frac{s}{\sqrt{n}}$$` ] -- ```r # Critical value t_star <- qt(0.975, df) ``` -- ```r # Point estimate point_est <- durham_summary$xbar ``` -- ```r # Confidence interval round(point_est + c(-1,1) * t_star * se, 2) ``` ``` ## [1] 3.89 4.05 ``` --- ## Interpret 95% confidence interval ```r round(point_est + c(-1,1) * t_star * se, 2) ``` ``` ## [1] 3.89 4.05 ``` .question[ Interpret this interval in context of the data. ] -- **We are 95% confident that the true mean rating for Durham residents' satisfaction with the library system is between 3.89 and 4.05.** --- class: middle, center ## Inference with the CLT using `infer` --- ## CLT-based hypothesis testing in `infer` `$$H_0: \mu = 3 \text{ vs }H_a: \mu > 3$$` -- ```r durham %>% t_test(response = quality_library, mu = 3, alternative = "greater", conf_int = FALSE) ``` ``` ## # A tibble: 1 x 4 ## statistic t_df p_value alternative ## <dbl> <dbl> <dbl> <chr> ## 1 24.6 520 2.25e-89 greater ``` --- ## CLT-based confidence intervals in `infer` Calculate a 95% confidence interval for the mean satisfaction rating. -- ```r durham %>% t_test(response = quality_library, alternative = "two-sided", conf_int = TRUE, conf_level = 0.95) ``` ``` ## # A tibble: 1 x 6 ## statistic t_df p_value alternative lower_ci upper_ci ## <dbl> <dbl> <dbl> <chr> <dbl> <dbl> ## 1 101. 520 0 two.sided 3.89 4.05 ``` --- ## Other built-in functionality in R - There are more built in functions for doing some of these tests in R. - However a learning goal is this course is not to go through an exhaustive list of all CLT based tests and how to implement them - Instead the goal is to understand how these methods are / are not like the simulation based methods we learned about earlier -- .question[ What is similar, and what is different, between CLT based test of means vs. simulation based test? ]