+ - 0:00:00
Notes for current slide
Notes for next slide

Logistic Regression

Prof. Maria Tackett

1

class: middle center

Click for PDF of slides

2

Introduction

Multiple regression allows us to relate a numerical response variable to one or more numerical or categorical predictors.

We can use multiple regression models to understand relationships, assess differences, and make predictions.

But what about a situation where the response of interest is categorical and binary?

3

Introduction

Multiple regression allows us to relate a numerical response variable to one or more numerical or categorical predictors.

We can use multiple regression models to understand relationships, assess differences, and make predictions.

But what about a situation where the response of interest is categorical and binary?

  • spam or not spam
  • malignant or benign tumor
  • survived or died
  • admitted or or not admitted
3

Titanic

On April 15, 1912 the famous ocean liner Titanic sank in the North Atlantic after striking an iceberg on its maiden voyage. The dataset titanic.csv contains the survival status and other attributes of individuals on the titanic.

  • survived: survival status (1 = survived, 0 = died)
  • pclass: passenger class (1 = 1st, 2 = 2nd, 3 = 3rd)
  • name: name of individual
  • sex: sex (male or female)
  • age: age in years
  • fare: passenger fare in British pounds

We are interested in investigating the variables that contribute to passenger survival. Do women and children really come first?

4

Data and Packages

library(tidyverse)
library(broom)
glimpse(titanic)
## Rows: 887
## Columns: 7
## $ pclass <dbl> 3, 1, 3, 1, 3, 3, 1, 3, 3, 2, 3, 1, 3, 3, 3, 2, 3, 2, 3, 3, …
## $ name <chr> "Mr. Owen Harris Braund", "Mrs. John Bradley (Florence Brigg…
## $ sex <chr> "male", "female", "female", "female", "male", "male", "male"…
## $ age <dbl> 22, 38, 26, 35, 35, 27, 54, 2, 27, 14, 4, 58, 20, 39, 14, 55…
## $ fare <dbl> 7.2500, 71.2833, 7.9250, 53.1000, 8.0500, 8.4583, 51.8625, 2…
## $ died <dbl> 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0, 1, 0, …
## $ survived <dbl> 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1, …
5

Exploratory Data Analysis

6

The linear model with multiple predictors

  • Population model:

y=β0+β1 x1+β2 x2++βk xk+ϵ

7

The linear model with multiple predictors

  • Population model:

y=β0+β1 x1+β2 x2++βk xk+ϵ

  • Sample model that we use to estimate the population model:

ˆy=b0+b1 x1+b2 x2++bk xk

7

The linear model with multiple predictors

  • Population model:

y=β0+β1 x1+β2 x2++βk xk+ϵ

  • Sample model that we use to estimate the population model:

ˆy=b0+b1 x1+b2 x2++bk xk

Denote by p the probability of survival and consider the model below.

p=β0+β1 x1+β2 x2++βk xk+ϵ

7

The linear model with multiple predictors

  • Population model:

y=β0+β1 x1+β2 x2++βk xk+ϵ

  • Sample model that we use to estimate the population model:

ˆy=b0+b1 x1+b2 x2++bk xk

Denote by p the probability of survival and consider the model below.

p=β0+β1 x1+β2 x2++βk xk+ϵ

Can you see any problems with this approach?

7

Linear Regression?

lm_survival <- lm(survived ~ age + sex, data = titanic)
tidy(lm_survival)
## # A tibble: 3 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 0.752 0.0356 21.1 2.88e-80
## 2 age -0.000343 0.000979 -0.350 7.26e- 1
## 3 sexmale -0.551 0.0289 -19.1 3.50e-68
8

Visualizing the Model

9

Diagnostics

10

Diagnostics

This isn't helpful! We need to develop a new tool.

10

Preliminaries

  • Denote by p the probability of some event
  • The odds the event occurs is p1p
11

Preliminaries

  • Denote by p the probability of some event
  • The odds the event occurs is p1p

Odds are sometimes expressed as X : Y and read X to Y.

It is the ratio of successes to failures, where values larger than 1 favor a success and values smaller than 1 favor a failure.

11

Preliminaries

  • Denote by p the probability of some event
  • The odds the event occurs is p1p

Odds are sometimes expressed as X : Y and read X to Y.

It is the ratio of successes to failures, where values larger than 1 favor a success and values smaller than 1 favor a failure.

If P(A)=1/2, the odds of A are 1/21/2=1

11

Preliminaries

  • Denote by p the probability of some event
  • The odds the event occurs is p1p

Odds are sometimes expressed as X : Y and read X to Y.

It is the ratio of successes to failures, where values larger than 1 favor a success and values smaller than 1 favor a failure.

If P(A)=1/2, the odds of A are 1/21/2=1

If P(B)=1/3, the odds of B are 1/32/3=0.5

An odds ratio is a ratio of odds.

11

Preliminaries

  • Taking the natural log of the odds yields the logit of p

logit(p)=log(p1p)

12

Preliminaries

  • Taking the natural log of the odds yields the logit of p

logit(p)=log(p1p)

The logit takes a value of p between 0 and 1 and outputs a value between and .

12

Preliminaries

  • Taking the natural log of the odds yields the logit of p

logit(p)=log(p1p)

The logit takes a value of p between 0 and 1 and outputs a value between and .

The inverse logit (logistic) takes a value between and and outputs a value between 0 and 1.

inverse logit(x)=ex1+ex

12

Logistic Regression Model

log(p1p)=β0+β1x1+β2x2++βkxk

13

Logistic Regression Model

log(p1p)=β0+β1x1+β2x2++βkxk

Use the inverse logit to find the expression for p.

p=eβ0+β1x1+β2x2++βkxk1+eβ0+β1x1+β2x2++βkxk

We can use the logistic regression model to obtain predicted probabilities of success for a binary response variable.

13

Logistic Regression Model

We handle fitting the model via computer using the glm function.

logit_mod <- glm(survived ~ sex + age, data = titanic,
family = "binomial")
tidy(logit_mod)
## # A tibble: 3 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 1.11 0.208 5.34 9.05e- 8
## 2 sexmale -2.50 0.168 -14.9 3.24e-50
## 3 age -0.00206 0.00586 -0.351 7.25e- 1
14

Logistic Regression Model

And use augment to find predicted log-odds.

pred_log_odds <- augment(logit_mod)
15

The Estimated Logistic Regression Model

tidy(logit_mod)
## # A tibble: 3 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 1.11 0.208 5.34 9.05e- 8
## 2 sexmale -2.50 0.168 -14.9 3.24e-50
## 3 age -0.00206 0.00586 -0.351 7.25e- 1

log(ˆp1ˆp)=1.112.50 sex0.00206 age

ˆp=e1.112.50 sex0.00206 age1+e1.112.50 sex0.00206 age

16

Interpreting coefficients

log(ˆp1ˆp)=1.112.50 sex0.00206 age


17

Interpreting coefficients

log(ˆp1ˆp)=1.112.50 sex0.00206 age


Holding sex constant, for every additional year of age, we expect the log-odds of survival to decrease by approximately 0.002.


17

Interpreting coefficients

log(ˆp1ˆp)=1.112.50 sex0.00206 age


Holding sex constant, for every additional year of age, we expect the log-odds of survival to decrease by approximately 0.002.


Holding age constant, we expect males to have a log-odds of survival that is 2.50 less than females.

17

Interpreting coefficients

ˆp1ˆp=e1.112.50 sex0.00206 age


18

Interpreting coefficients

ˆp1ˆp=e1.112.50 sex0.00206 age


Holding sex constant, for every one year increase in age, the odds of survival are expected to multiply by a factor of e0.00206=0.998.


18

Interpreting coefficients

ˆp1ˆp=e1.112.50 sex0.00206 age


Holding sex constant, for every one year increase in age, the odds of survival are expected to multiply by a factor of e0.00206=0.998.


Holding age constant, the odds of survival for males are e2.50=0.082 times the odds of survival for females.

18

Classification

  • Logistic regression allows us to obtain predicted probabilities of success for a binary variable.
  • By imposing a threshold (for example if the probability is greater than 0.50) we can create a classifier.
19

Classification

  • Logistic regression allows us to obtain predicted probabilities of success for a binary variable.
  • By imposing a threshold (for example if the probability is greater than 0.50) we can create a classifier.
## # A tibble: 2 x 3
## survived Died Survived
## <dbl> <int> <int>
## 1 0 464 81
## 2 1 109 233
19

Strengths and Weaknesses

Weaknesses

  • Logistic regression has assumptions: independence and linearity in the log-odds (some other methods require fewer assumptions)
  • If the predictors are correlated, coefficient estimates may be unreliable
20

Strengths and Weaknesses

Weaknesses

  • Logistic regression has assumptions: independence and linearity in the log-odds (some other methods require fewer assumptions)
  • If the predictors are correlated, coefficient estimates may be unreliable

Strengths

  • Straightforward interpretation of coefficients
  • Handles numerical and categorical predictors
  • Can quantify uncertainty around a prediction
  • Can extend to more than 2 categories (multinomial regression)
20

class: middle center

Click for PDF of slides

2
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow