Class Activity, September 16
Data
The RMS Titanic was a huge, luxury passenger liner designed and built in the early 20th century. Despite the fact that the ship was believed to be unsinkable, during her maiden voyage on April 15, 1912, the Titanic collided with an iceberg and sank. Of all the passengers and crew, less than half survived. Part of the reason why so few people survived has been attributed to the fact that the Titanic did not carry enough lifeboats for its passengers and crew. This meant that there was competition for space in the boats, and not everyone was able to make it aboard. Communication errors, stress and shock…there were a great many factors that contributed to this tragedy.
The loss of life during the Titanic tragedy was enormous, but there were survivors. Was it random chance that these particular people survived? Or were there some specific characteristics of these people that led to their positions in the life boats? Let’s investigate.
We have observations on 12 different variables, some categorical and some numeric:
Passenger
: A unique ID number for each passenger.Survived
: An indicator for whether the passenger survived (1) or perished (0) during the disaster.Pclass
: Indicator for the class of the ticket held by this passengers; 1 = 1st class, 2 = 2nd class, 3 = 3rd class.Name
: The name of the passenger.Sex
: Binary Indicator for the biological sex of the passenger.Age
: Age of the passenger in years; Age is fractional if the passenger was less than 1 year old.SibSp
: number of siblings/spouses the passenger had aboard the Titanic. Here, siblings are defined as brother, sister, stepbrother, and stepsister. Spouses are defined as husband and wife.Parch
: number of parents/children the passenger had aboard the Titanic. Here, parent is defined as mother/father and child is defined as daughter,son, stepdaughter or stepson. NOTE: Some children traveled only with a nanny, therefore parch=0 for them. There were no parents aboard for these children.Ticket
: The unique ticket number for each passenger.Fare
: How much the ticket cost in US dollars.Cabin
: The cabin number assigned to each passenger. Some cabins hold more than one passenger.Embarked
: Port where the passenger boarded the ship; C = Cherbourg, Q = Queenstown, S = Southampton
Recall that our goal is to build a model to help explore what characteristics were related to whether or not a passenger survived the disaster.
Loading the data
The titanic
data can be loaded into R with the following
command:
<- read.csv("https://sta214-f22.github.io/labs/Titanic.csv") titanic
Research question
We want to investigate the following question:
Is there a relationship between passenger age and their probability of survival, after accounting for sex, passenger class, and the cost of their ticket?
Part I: Exploratory data analysis (EDA)
To begin, we will explore relationships between survival and the different explanatory variables, using empirical logit plots. Code and examples for creating empirical logit plots can be found at
https://sta712-f22.github.io/homework/empirical_logits.html
Our research question is focused on age as an explanatory variable, so let us begin by exploring age and survival.
Question 1
Use the following code to create an empirical logit plot to examine the relationship between age and survival:
library(tidyverse)
logodds_plot(titanic, 30, "equal_size", "Age", "Survived",
reg_formula = y ~ x)
The research question asks us to account for other variables, like sex, in the model. So, let us see how the relationship between age and survival changes when we add sex. To investigate this question, we can fit separate lines on the empirical logit plot for male and female passengers.
Question 2
Use the following code to create an empirical logit plot for the relationship between age and survival, broken down by sex. Does the assumption that the log odds are a linear function of Age seem appropriate?
logodds_plot(titanic, 30, "equal_size", "Age", "Survived",
grouping = "Sex",
reg_formula = y ~ x)
We also care about including Fare in the model, so let’s explore the relationship between Fare and survival.
Question 3
Use the following code to create an empirical logit plot for the relationship between fare and survival. Does the shape/linearity assumption seem reasonable here?
logodds_plot(titanic, 30, "equal_size", "Fare", "Survived",
reg_formula = y ~ x)
When the shape assumption does not seem reasonable, we can try
different transformations. This is done in the logodds_plot
function by changing the regression formula (reg_formula
).
For example, to try a log transformation on Fare:
logodds_plot(titanic, 30, "equal_size", "Fare", "Survived",
reg_formula = y ~ log(x))
Question 4
Experiment with different transformations for Fare. Which transformation seems most appropriate?
Question 5
Explore any other variables you need in the model to address the research question. Based on your exploratory data analysis, propose a model that allows you to address the research question.
Part II: Diagnostics
Now let’s fit the model, and assess whether the model assumptions are met.
Question 6
Fit the model you proposed in Question 5. Note: you may need to
remove missing values from the titanic
data first!
Question 7
Use the cooks.distance(...)
function to check for any
influential points. Recall that we typically use a threshold of 0.5 or 1
to identify influential points.
Question 8
Use the qresid(...)
function from the
statmod
package to calculate quantile residuals for your
fitted model. Make two quantile residual plots: one for Fare, and one
for Age. Does the shape assumption seem reasonable?
Part III: Hypothesis testing
Question 9
Use a Wald test to address the original research question. Give your null and alternative hypotheses, test statistic, and p-value.