class: center, middle, inverse, title-slide # The Central Limit Theorem ## (CLT) ### Prof. Maria Tackett --- layout: true <div class="my-footer"> <span> <a href="http://datasciencebox.org" target="_blank">datasciencebox.org</a> </span> </div> --- class: middle center ## [Click for PDF of slides](13-clt.pdf) --- class: center, middle ## Sample Statistics and Sampling Distributions --- ## Variability of sample statistics - We've seen that each sample from the population yields a slightly different sample statistic (sample mean, sample proportion, etc.) - Previously we've quantified this value via simulation - Today we talk about some of the theory underlying .vocab[sampling distributions], particularly as they relate to sample means. --- ## Statistical inference - Statistical inference is the act of generalizing from a sample in order to make conclusions regarding a population. - We are interested in population parameters, which we do not observe. Instead, we must calculate statistics from our sample in order to learn about them. - As part of this process, we must quantify the degree of uncertainty in our sample statistic. --- ## Sampling distribution of the mean Suppose we’re interested in the mean resting heart rate of students at Duke, and are able to do the following: -- 1. Take a random sample of size `\(n\)` from this population, and calculate the mean resting heart rate in this sample, `\(\bar{X}_1\)` -- 2. Put the sample back, take a second random sample of size `\(n\)`, and calculate the mean resting heart rate from this new sample, `\(\bar{X}_2\)` -- 3. Put the sample back, take a third random sample of size `\(n\)`, and calculate the mean resting heart rate from this sample, too... -- ...and so on. --- ## Sampling distribution of the mean After repeating this many times, we have a dataset that has the sample averages from the population: `\(\bar{X}_1\)`, `\(\bar{X}_2\)`, `\(\cdots\)`, `\(\bar{X}_K\)` (assuming we took `\(K\)` total samples). -- .question[ Can we say anything about the distribution of these sample means (that is, the .vocab[sampling distribution] of the mean?) ] *(Keep in mind, we don't know what the underlying distribution of mean resting heart rate looks like in Duke students!)* --- class: center, middle ## The Central Limit Theorem --- class: middle A quick caveat... For now, let's assume we know the underlying standard deviation, `\(\sigma\)`, from our distribution --- ## The Central Limit Theorem For a population with a well-defined mean `\(\mu\)` and standard deviation `\(\sigma\)`, these three properties hold for the distribution of sample average `\(\bar{X}\)`, assuming certain conditions hold: -- 1. The mean of the sampling distribution of the mean is identical to the population mean `\(\mu\)`. -- 2. The standard deviation of the distribution of the sample averages is `\(\sigma/\sqrt{n}\)`. - This is called the .vocab[standard error] (SE) of the mean. -- 3. For `\(n\)` large enough, the shape of the sampling distribution of means is approximately .vocab[normally distributed]. --- ## The normal (Gaussian) distribution The normal distribution is unimodal and symmetric and is described by its .vocab[density function]: If a random variable `\(X\)` follows the normal distribution, then `$$f(x) = \frac{1}{\sqrt{2\pi\sigma^2}}\exp\left\{ -\frac{1}{2}\frac{(x - \mu)^2}{\sigma^2} \right\}$$` where `\(\mu\)` is the mean and `\(\sigma^2\)` is the variance `\((\sigma \text{ is the standard deviation})\)` .alert[ We often write `\(N(\mu, \sigma)\)` to describe this distribution. ] --- ## The normal distribution (graphically) <img src="13-clt_files/figure-html/unnamed-chunk-1-1.png" style="display: block; margin: auto;" /> --- ## Wait, *any* distribution? The central limit theorem tells us that *<b>sample averages</b>* are normally distributed, if we have enough data and certain assumptions hold. This is true *even if our original variables are not normally distributed*. Click [here](http://onlinestatbook.com/stat_sim/sampling_dist/index.html) to see an interactive demonstration of this idea. --- ## Conditions for CLT We need to check two conditions for CLT to hold: independence, sample size/distribution. -- ✅ .vocab[Independence:] The sampled observations must be independent. This is difficult to check, but the following are useful guidelines: - the sample must be randomly taken - if sampling without replacement, sample size must be less than 10% of the population size -- If samples are independent, then by definition one sample's value does not "influence" another sample's value. --- ## Conditions for CLT ✅ .vocab[Sample size / distribution:] - if data are numerical, usually n > 30 is considered a large enough sample for the CLT to kick in - if we know for sure that the underlying data are normally distributed, then the distribution of sample averages will also be exactly normal, regardless of the sample size - if data are categorical, at least 10 successes and 10 failures. --- class: middle, center ## Let's run our own simulation --- ### Underlying population (not observed in real life!) .small[ ```r rs_pop <- tibble(x = rbeta(100000, 1, 5) * 100) ``` ] <img src="13-clt_files/figure-html/unnamed-chunk-3-1.png" style="display: block; margin: auto;" /> **The true population parameters** .small[ ``` ## # A tibble: 1 x 2 ## mu sigma ## <dbl> <dbl> ## 1 16.6 14.1 ``` ] --- ## Sampling from the population - 1 ```r set.seed(1) samp_1 <- rs_pop %>% sample_n(size = 50) %>% summarise(x_bar = mean(x)) ``` ```r samp_1 ``` ``` ## # A tibble: 1 x 1 ## x_bar ## <dbl> ## 1 15.9 ``` --- ## Sampling from the population - 2 ```r set.seed(2) samp_2 <- rs_pop %>% sample_n(size = 50) %>% summarise(x_bar = mean(x)) ``` ```r samp_2 ``` ``` ## # A tibble: 1 x 1 ## x_bar ## <dbl> ## 1 17.1 ``` --- ## Sampling from the population - 3 ```r set.seed(3) samp_3 <- rs_pop %>% sample_n(size = 50) %>% summarise(x_bar = mean(x)) ``` ```r samp_3 ``` ``` ## # A tibble: 1 x 1 ## x_bar ## <dbl> ## 1 19.2 ``` -- keep repeating... --- ## Sampling distribution .small[ ```r set.seed(092620) sampling <- rs_pop %>% rep_sample_n(size = 50, replace = TRUE, reps = 5000) %>% group_by(replicate) %>% summarise(xbar = mean(x)) ``` ] <img src="13-clt_files/figure-html/unnamed-chunk-12-1.png" style="display: block; margin: auto;" /> ```r sampling %>% summarise(mean = mean(xbar), se = sd(xbar)) ``` ``` ## # A tibble: 1 x 2 ## mean se ## <dbl> <dbl> ## 1 16.6 1.99 ``` --- .question[ How do the shapes, centers, and spreads of these distributions compare? ] <img src="13-clt_files/figure-html/unnamed-chunk-14-1.png" style="display: block; margin: auto;" /> --- ## Recap - If certain assumptions are satisfied, regardless of the shape of the population distribution, the sampling distribution of the mean follows an approximately normal distribution. -- - The center of the sampling distribution is at the center of the population distribution. -- - The sampling distribution is less variable than the population distribution (and we can quantify by how much). -- .question[ What is the standard error, and how are the standard error and sample size related? What does that say about how the spread of the sampling distribution changes as `\(n\)` increases? ] --- class: center, middle ## Finding probabilities in R --- ## Probabilities under N(0,1) curve ```r # P(Z < -1.5) pnorm(-1.5) ``` ``` ## [1] 0.0668072 ``` <img src="13-clt_files/figure-html/unnamed-chunk-16-1.png" style="display: block; margin: auto;" /> --- ## Probability between two values .question[ If `\(Z \sim N(0, 1)\)`, what is `\(P(-1 < Z < 2)\)`? ] -- <img src="13-clt_files/figure-html/unnamed-chunk-17-1.png" style="display: block; margin: auto;" /> --- ## Probability between two values .question[ If `\(Z \sim N(0, 1)\)`, what is `\(P(-1 < Z < 2)\)`? ] <img src="13-clt_files/figure-html/unnamed-chunk-18-1.png" style="display: block; margin: auto;" /> --- ## Probability between two values .question[ If `\(Z \sim N(0, 1)\)`, what is `\(P(-1 < Z < 2)\)`? ] <img src="13-clt_files/figure-html/unnamed-chunk-19-1.png" style="display: block; margin: auto;" /> --- ## Probability between two values .question[ If `\(Z \sim N(0, 1)\)`, what is `\(P(-1 < Z < 2)\)`? ] <img src="13-clt_files/figure-html/unnamed-chunk-20-1.png" style="display: block; margin: auto;" /> ```r pnorm(2) - pnorm(-1) ``` ``` ## [1] 0.8185946 ``` --- ## Probability between two values .question[ If `\(Z \sim N(0, 1)\)`, what is `\(P(-1 < Z < 2)\)`? ] ```r pnorm(2) - pnorm(-1) ``` ``` ## [1] 0.8185946 ``` --- ## Finding cutoff values under N(0,1) curve ```r # find Q1 qnorm(0.25) ``` ``` ## [1] -0.6744898 ``` <img src="13-clt_files/figure-html/unnamed-chunk-24-1.png" style="display: block; margin: auto;" /> --- ## Looking ahead... We will use the Central Limit Theorem and the normal distribution to conduct statistical inference.