Multiple regression allows us to relate a numerical response variable to one or more numerical or categorical predictors.
We can use multiple regression models to understand relationships, assess differences, and make predictions.
But what about a situation where the response of interest is categorical and binary?
Multiple regression allows us to relate a numerical response variable to one or more numerical or categorical predictors.
We can use multiple regression models to understand relationships, assess differences, and make predictions.
But what about a situation where the response of interest is categorical and binary?
On April 15, 1912 the famous ocean liner Titanic sank in the North Atlantic
after striking an iceberg on its maiden voyage. The dataset titanic.csv
contains the survival status and other attributes of individuals on the titanic.
survived: survival status (1 = survived, 0 = died)pclass: passenger class (1 = 1st, 2 = 2nd, 3 = 3rd)name: name of individualsex: sex (male or female)age: age in yearsfare: passenger fare in British poundsWe are interested in investigating the variables that contribute to passenger survival. Do women and children really come first?
library(tidyverse)library(broom)
glimpse(titanic)
## Rows: 887## Columns: 7## $ pclass <dbl> 3, 1, 3, 1, 3, 3, 1, 3, 3, 2, 3, 1, 3, 3, 3, 2, 3, 2, 3, 3, …## $ name <chr> "Mr. Owen Harris Braund", "Mrs. John Bradley (Florence Brigg…## $ sex <chr> "male", "female", "female", "female", "male", "male", "male"…## $ age <dbl> 22, 38, 26, 35, 35, 27, 54, 2, 27, 14, 4, 58, 20, 39, 14, 55…## $ fare <dbl> 7.2500, 71.2833, 7.9250, 53.1000, 8.0500, 8.4583, 51.8625, 2…## $ died <dbl> 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0, 1, 0, …## $ survived <dbl> 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1, …
y=β0+β1 x1+β2 x2+⋯+βk xk+ϵ
y=β0+β1 x1+β2 x2+⋯+βk xk+ϵ
ˆy=b0+b1 x1+b2 x2+⋯+bk xk
y=β0+β1 x1+β2 x2+⋯+βk xk+ϵ
ˆy=b0+b1 x1+b2 x2+⋯+bk xk
Denote by p the probability of survival and consider the model below.
p=β0+β1 x1+β2 x2+⋯+βk xk+ϵ
y=β0+β1 x1+β2 x2+⋯+βk xk+ϵ
ˆy=b0+b1 x1+b2 x2+⋯+bk xk
Denote by p the probability of survival and consider the model below.
p=β0+β1 x1+β2 x2+⋯+βk xk+ϵ
Can you see any problems with this approach?
lm_survival <- lm(survived ~ age + sex, data = titanic)tidy(lm_survival)
## # A tibble: 3 x 5## term estimate std.error statistic p.value## <chr> <dbl> <dbl> <dbl> <dbl>## 1 (Intercept) 0.752 0.0356 21.1 2.88e-80## 2 age -0.000343 0.000979 -0.350 7.26e- 1## 3 sexmale -0.551 0.0289 -19.1 3.50e-68


This isn't helpful! We need to develop a new tool.
Odds are sometimes expressed as X : Y and read X to Y.
It is the ratio of successes to failures, where values larger than 1 favor a success and values smaller than 1 favor a failure.
Odds are sometimes expressed as X : Y and read X to Y.
It is the ratio of successes to failures, where values larger than 1 favor a success and values smaller than 1 favor a failure.
If P(A)=1/2, the odds of A are 1/21/2=1
Odds are sometimes expressed as X : Y and read X to Y.
It is the ratio of successes to failures, where values larger than 1 favor a success and values smaller than 1 favor a failure.
If P(A)=1/2, the odds of A are 1/21/2=1
If P(B)=1/3, the odds of B are 1/32/3=0.5
An odds ratio is a ratio of odds.
logit(p)=log(p1−p)
logit(p)=log(p1−p)
The logit takes a value of p between 0 and 1 and outputs a value between −∞ and ∞.
logit(p)=log(p1−p)
The logit takes a value of p between 0 and 1 and outputs a value between −∞ and ∞.
The inverse logit (logistic) takes a value between −∞ and ∞ and outputs a value between 0 and 1.
inverse logit(x)=ex1+ex
log(p1−p)=β0+β1x1+β2x2+…+βkxk
log(p1−p)=β0+β1x1+β2x2+…+βkxk
Use the inverse logit to find the expression for p.
p=eβ0+β1x1+β2x2+…+βkxk1+eβ0+β1x1+β2x2+…+βkxk
We can use the logistic regression model to obtain predicted probabilities of success for a binary response variable.
We handle fitting the model via computer using the glm function.
logit_mod <- glm(survived ~ sex + age, data = titanic, family = "binomial")tidy(logit_mod)
## # A tibble: 3 x 5## term estimate std.error statistic p.value## <chr> <dbl> <dbl> <dbl> <dbl>## 1 (Intercept) 1.11 0.208 5.34 9.05e- 8## 2 sexmale -2.50 0.168 -14.9 3.24e-50## 3 age -0.00206 0.00586 -0.351 7.25e- 1And use augment to find predicted log-odds.
pred_log_odds <- augment(logit_mod)tidy(logit_mod)
## # A tibble: 3 x 5## term estimate std.error statistic p.value## <chr> <dbl> <dbl> <dbl> <dbl>## 1 (Intercept) 1.11 0.208 5.34 9.05e- 8## 2 sexmale -2.50 0.168 -14.9 3.24e-50## 3 age -0.00206 0.00586 -0.351 7.25e- 1log(ˆp1−ˆp)=1.11−2.50 sex−0.00206 age
log(ˆp1−ˆp)=1.11−2.50 sex−0.00206 age
log(ˆp1−ˆp)=1.11−2.50 sex−0.00206 age
Holding sex constant, for every additional year of age, we expect the log-odds of survival to decrease by approximately 0.002.
log(ˆp1−ˆp)=1.11−2.50 sex−0.00206 age
Holding sex constant, for every additional year of age, we expect the log-odds of survival to decrease by approximately 0.002.
Holding age constant, we expect males to have a log-odds of survival that is 2.50 less than females.
ˆp1−ˆp=e1.11−2.50 sex−0.00206 age
ˆp1−ˆp=e1.11−2.50 sex−0.00206 age
Holding sex constant, for every one year increase in age, the odds of survival are expected to multiply by a factor of e−0.00206=0.998.
ˆp1−ˆp=e1.11−2.50 sex−0.00206 age
Holding sex constant, for every one year increase in age, the odds of survival are expected to multiply by a factor of e−0.00206=0.998.
Holding age constant, the odds of survival for males are e−2.50=0.082 times the odds of survival for females.
## # A tibble: 2 x 3## survived Died Survived## <dbl> <int> <int>## 1 0 464 81## 2 1 109 233Weaknesses
Weaknesses
Strengths
Keyboard shortcuts
| ↑, ←, Pg Up, k | Go to previous slide |
| ↓, →, Pg Dn, Space, j | Go to next slide |
| Home | Go to first slide |
| End | Go to last slide |
| Number + Return | Go to specific slide |
| b / m / f | Toggle blackout / mirrored / fullscreen mode |
| c | Clone slideshow |
| p | Toggle presenter mode |
| t | Restart the presentation timer |
| ?, h | Toggle this help |
| Esc | Back to slideshow |