Class activity solutions, September 16

Data import and cleaning

library(tidyverse)
titanic <- read.csv("https://sta214-f22.github.io/labs/Titanic.csv")
titanic <- titanic %>%
  drop_na()

Question 1

logodds_plot(titanic, 30, "equal_size", "Age", "Survived",
             reg_formula = y ~ x)

There may be a negative relationship between Age and the probability of survival. The logistic regression assumption that the log odds are linear in Age seems reasonable here.

Question 2

logodds_plot(titanic, 30, "equal_size", "Age", "Survived", 
             grouping = "Sex",
             reg_formula = y ~ x)

The assumption that the log odds are linear in Age seems reasonable for both the male and female passengers. We can see that female passengers generally have a higher chance of survival than male passengers, and that the relationship between age and survival appears positive for female passengers but negative for male passengers.

Question 3

logodds_plot(titanic, 30, "equal_size", "Fare", "Survived",
             reg_formula = y ~ x)

The shape assumption does not seem reasonable for Fare; the log odds appear to be a non-linear function of Fare. We should try some transformations.

Question 4

logodds_plot(titanic, 30, "equal_size", "Fare", "Survived",
             reg_formula = y ~ log(x))

A log transformation appears to be a good way of capturing the relationship between Fare and survival.

Question 6

We will fit the model

\[Y_i \sim Bernoulli(p_i)\]

\[\log \left( \dfrac{p_i}{1 - p_i} \right) = \beta_0 + \beta_1 Sex_i + \beta_2 \log(Fare_i + 1) + \beta_3 Age_i + \beta_4 Age_i \cdot Sex_i\]

(We add 1 to Fare before taking the log to deal with 0s)

m1 <- glm(Survived ~ log(Fare + 1) + Sex*Age, 
          data = titanic, family = binomial)
summary(m1)

## 
## Call:
## glm(formula = Survived ~ log(Fare + 1) + Sex * Age, family = binomial, 
##     data = titanic)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.2142  -0.6479  -0.5059   0.7538   2.5885  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   -1.40695    0.44682  -3.149  0.00164 ** 
## log(Fare + 1)  0.69449    0.11065   6.276 3.47e-10 ***
## Sexmale       -1.27467    0.41654  -3.060  0.00221 ** 
## Age            0.01107    0.01107   1.000  0.31730    
## Sexmale:Age   -0.03638    0.01378  -2.639  0.00831 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 964.52  on 713  degrees of freedom
## Residual deviance: 697.21  on 709  degrees of freedom
## AIC: 707.21
## 
## Number of Fisher Scoring iterations: 4

Question 7

Now let’s check for any influential points using Cook’s distance:

hist(cooks.distance(m1))

max(cooks.distance(m1))

## [1] 0.03384292

The largest observed Cook’s distance is around 0.035, which is well below both of the common thresholds for influential points (0.5 and 1). So, we are not concerned about influential points here.

Question 8

We also want to assess the shape assumption for our fitting model, using quantile residual plots (this is a bit different than the empirical logit plots from the earlier questions, because we can use predictions from the full model when calculating our quantile residuals). We plot the quantile residuals against our quantitative predictors, Age and \(\log(Fare + 1)\):

titanic %>%
  mutate(resids = statmod::qresid(m1)) %>%
  ggplot(aes(x = Age, y = resids)) +
  geom_point() +
  geom_smooth() +
  theme_bw()

titanic %>%
  mutate(resids = statmod::qresid(m1)) %>%
  ggplot(aes(x = log(Fare + 1), y = resids)) +
  geom_point() +
  geom_smooth() +
  theme_bw()

Both plots look good (there is a random scatter of residuals around 0, with no clear pattern), so the shape assumption seems reasonable.

Question 9

Our null hypothesis is \(\beta_3 = \beta_4 = 0\). Letting \(\beta_{(2)} = (\beta_3, \beta_4)^T\), then our hypotheses are

\(H_0: \beta_{(2)} = \beta_{(2)}^0 = (0, 0)^T\)

\(H_A: \beta_{(2)} \neq \beta_{(2)}^0\)

Calculating the test statistic and p-value:

I22 <- vcov(m1)[4:5, 4:5]
W <- t(coefficients(m1)[4:5]) %*% solve(I22) %*% coefficients(m1)[4:5]
pchisq(W, df = 2, lower.tail = F) #p-value

##             [,1]
## [1,] 0.005869939