The Central Limit Theorem

# The Central Limit Theorem
## (CLT)
### Prof. Maria Tackett

---

<div class="my-footer">
<span>
<a href="http://datasciencebox.org" target="_blank">datasciencebox.org</a>
</span>
</div>

---

## [Click for PDF of slides](13-clt.pdf)

---

## Sample Statistics and Sampling Distributions

---

## Variability of sample statistics

- We've seen that each sample from the population yields a slightly different 
sample statistic (sample mean, sample proportion, etc.)

- Previously we've quantified this value via simulation

- Today we talk about some of the theory underlying .vocab[sampling distributions],
particularly as they relate to sample means.

---

## Statistical inference

- Statistical inference is the act of generalizing from a sample in order to 
make conclusions regarding a population.

- We are interested in population parameters, which we do not observe. Instead, 
we must calculate statistics from our sample in order to learn about them.

- As part of this process, we must quantify the degree of uncertainty in our 
sample statistic.

---

## Sampling distribution of the mean

Suppose we’re interested in the mean resting heart rate of students at Duke, and 
are able to do the following:

1. Take a random sample of size `$n$` from this population, and calculate the 
mean resting heart rate in this sample, `$\bar{X}_1$`

2. Put the sample back, take a second random sample of size `$n$`, and calculate 
the mean resting heart rate from this new sample, `$\bar{X}_2$`

3. Put the sample back, take a third random sample of size `$n$`, and calculate
the mean resting heart rate from this sample, too...

...and so on.

---

## Sampling distribution of the mean

After repeating this many times, we have a dataset that has the
sample averages from the population: `$\bar{X}_1$`, `$\bar{X}_2$`, `$\cdots$`,
`$\bar{X}_K$` (assuming we took `$K$` total samples).

.question[
Can we say anything about the distribution of these sample means (that is, the
.vocab[sampling distribution] of the mean?) 
]

*(Keep in mind, we don't know what the underlying distribution of mean resting 
heart rate looks like in Duke students!)*

---

## The Central Limit Theorem

---

A quick caveat...

For now, let's assume we know the underlying standard deviation, `$\sigma$`, from our distribution

---

## The Central Limit Theorem

For a population with a well-defined mean `$\mu$` and standard deviation `$\sigma$`,
these three properties hold for the distribution of sample average `$\bar{X}$`,
assuming certain conditions hold:

1. The mean of the sampling distribution of the mean is identical to the 
population mean `$\mu$`.

2. The standard deviation of the distribution of the sample averages is
`$\sigma/\sqrt{n}$`.
  - This is called the .vocab[standard error] (SE) of the mean. 
 
--

3. For `$n$` large enough, the shape of the
sampling distribution of means is approximately .vocab[normally distributed].

---

## The normal (Gaussian) distribution

The normal distribution is unimodal and symmetric and is described by its
.vocab[density function]:

If a random variable `$X$` follows the normal distribution, then
`$$f(x) = \frac{1}{\sqrt{2\pi\sigma^2}}\exp\left\{ -\frac{1}{2}\frac{(x - \mu)^2}{\sigma^2} \right\}$$`
where `$\mu$` is the mean and `$\sigma^2$` is the variance `$(\sigma \text{ is the standard deviation})$`

## The normal distribution (graphically)

---

## Wait, *any* distribution?

The central limit theorem tells us that *<b>sample averages</b>* are normally distributed, if we have enough data and certain assumptions hold.

This is true *even if our original variables are not normally distributed*.

Click [here](http://onlinestatbook.com/stat_sim/sampling_dist/index.html) to see an interactive demonstration of this idea.

---

## Conditions for CLT

We need to check two conditions for CLT to hold: independence, sample size/distribution.

✅ .vocab[Independence:] The sampled observations must be independent. This is 
difficult to check, but the following are useful guidelines:

- the sample must be randomly taken
- if sampling without replacement, sample size must be less than 10% of the 
    population size

If samples are independent, then by definition one sample's value does not "influence" another sample's value.

---

## Conditions for CLT

✅ .vocab[Sample size / distribution:]

- if data are numerical, usually n > 30 is considered a large enough sample for the CLT to kick in
- if we know for sure that the underlying data are normally distributed, then the distribution of sample averages will also be exactly normal, regardless of the sample size
- if data are categorical, at least 10 successes and 10 failures.

---

## Let's run our own simulation

---

### Underlying population (not observed in real life!)

```r
rs_pop <- tibble(x = rbeta(100000, 1, 5) * 100)
```
]

**The true population parameters**
.small[

```
## # A tibble: 1 x 2
##      mu sigma
##   <dbl> <dbl>
## 1  16.6  14.1
```
]

---

## Sampling from the population - 1

```r
set.seed(1)
samp_1 <- rs_pop %>%
  sample_n(size = 50) %>%
  summarise(x_bar = mean(x))
```

```r
samp_1
```

```
## # A tibble: 1 x 1
##   x_bar
##   <dbl>
## 1  15.9
```

---

## Sampling from the population - 2

```r
set.seed(2)
samp_2 <- rs_pop %>%
  sample_n(size = 50) %>% 
  summarise(x_bar = mean(x))
```

```r
samp_2
```

```
## # A tibble: 1 x 1
##   x_bar
##   <dbl>
## 1  17.1
```
---

## Sampling from the population - 3

```r
set.seed(3)
samp_3 <- rs_pop %>%
  sample_n(size = 50) %>% 
  summarise(x_bar = mean(x))
```

```r
samp_3
```

```
## # A tibble: 1 x 1
##   x_bar
##   <dbl>
## 1  19.2
```

keep repeating...

---

## Sampling distribution

```r
set.seed(092620)
sampling <- rs_pop %>%
  rep_sample_n(size = 50, replace = TRUE, reps = 5000) %>%
  group_by(replicate) %>%
  summarise(xbar = mean(x))
```
]

```r
sampling %>%
  summarise(mean = mean(xbar), se = sd(xbar))
```

```
## # A tibble: 1 x 2
##    mean    se
##   <dbl> <dbl>
## 1  16.6  1.99
```

---

---

## Recap

- If certain assumptions are satisfied, regardless of the shape of the 
population distribution, the sampling distribution of the mean follows an 
approximately normal distribution.

- The center of the sampling distribution is at the center of the population 
distribution.

- The sampling distribution is less variable than the population distribution 
(and we can quantify by how much).

.question[
What is the standard error, and how are the standard error and sample size 
related? What does that say about how the spread of the sampling distribution
changes as `$n$` increases?
]

---

## Finding probabilities in R

---

## Probabilities under N(0,1) curve

```r
# P(Z < -1.5)
pnorm(-1.5)
```

```
## [1] 0.0668072
```

---

## Probability between two values

---

##  Probability between two values

---

## Probability between two values

---

## Probability between two values

```r
pnorm(2) - pnorm(-1)
```

```
## [1] 0.8185946
```

---

##  Probability between two values

```r
pnorm(2) - pnorm(-1)
```

```
## [1] 0.8185946
```

---

## Finding cutoff values under N(0,1) curve

```r
# find Q1
qnorm(0.25)
```

```
## [1] -0.6744898
```

---

## Looking ahead...

We will use the Central Limit Theorem and the normal distribution to conduct statistical inference.