Hang out with the TAs from STA 199! This is a casual conversation and a fun opportunity to meet the members of the STA 199 teaching team. The only rule is these can’t turn into office hours!
Tea with a TA counts as a statistics experience.
Caroline Levenson, Mon, Nov 2, 1p - 2p
What questions do you have about statistics and the US election? Click here to submit your questions by Friday, Oct 30. We will discuss some of the questions during the Nov 2 live lecture.
And…if you’re eligible, VOTE!
Write the draft in the written-report.Rmd
file in the project repo.
Draft should include
Let’s filter spam emails! We’ll examine a data set of 3921 emails and use logistic regression to determine which ones are potentially spam.
We’ll use the following variables in the analysis:
spam
- 1: email is spam, 0: email is not spamwinner
- yes: “winner” appeared in the email, no: “winner” did not appear in emailnum_char
- Number of characters in the email (in thousands)email <- read_csv("data/email.csv") %>%
mutate(spam = factor(spam))
Visualize the relationship between (1) spam
and winner
, and (2) spam
and num_char
. Use these plots to describe the relationship between the response variable and each of the explanatory variables.
Fit a logistic regression model with spam
as the response and winner
and num_char
as explanatory variables. Use the tidy function to display the model output. Hint: You need to use the glm function and family = "binomial"
for the model.
Write the model equation. You can use “log-odds-spam” as the response variable.
Interpret \(\hat{\beta}_{num\_char}\) in terms of the log odds an email is spam.
Interpret \(\hat{\beta}_{winner}\) in terms of the log odds an email is spam.
Calculate the predicted log odds that an email that has 400 words and contains the word “winner” is spam.
Calculate the predicted probability that an email that has 400 words and contains the word “winner” is spam.
Suppose your model is used to identify which emails should be classified as spam and moved to the junk folder. Should the email from the previous question be classified as spam? Briefly explain your decision.