Empirical logit plots
Function
The following R function can be used to create empirical logit plots:
<- function(data, num_bins, bin_method,
logodds_plot grouping = NULL,
x_name, y_name, reg_formula = y ~ x){
if(is.null(grouping)){
<- data.frame(x = data %>% pull(x_name),
dat y = data %>% pull(y_name),
group = 1)
else {
} <- data.frame(x = data %>% pull(x_name),
dat y = data %>% pull(y_name),
group = as.factor(data %>% pull(grouping)))
}
if(bin_method == "equal_size"){
<- dat %>%
logodds_table drop_na() %>%
arrange(group, x) %>%
group_by(group) %>%
mutate(obs = y,
bin = rep(1:num_bins,
each=ceiling(n()/num_bins))[1:n()]) %>%
group_by(bin, group) %>%
summarize(mean_x = mean(x),
prop = mean(c(obs, 0.5)),
num_obs = n()) %>%
ungroup() %>%
mutate(logodds = log(prop/(1 - prop)))
else {
} <- dat %>%
logodds_table drop_na() %>%
group_by(group) %>%
mutate(obs = y,
bin = cut(x,
breaks = num_bins,
labels = F)) %>%
group_by(bin, group) %>%
summarize(mean_x = mean(x),
prop = mean(c(obs, 0.5)),
num_obs = n()) %>%
ungroup() %>%
mutate(logodds = log(prop/(1 - prop)))
}
if(is.null(grouping)){
%>%
logodds_table ggplot(aes(x = mean_x,
y = logodds)) +
geom_point(size=2) +
geom_smooth(se=F, method="lm", formula = reg_formula) +
theme_bw() +
labs(x = x_name,
y = "Empirical log odds") +
theme(text = element_text(size=15))
else {
} %>%
logodds_table ggplot(aes(x = mean_x,
y = logodds,
color = group,
shape = group)) +
geom_point(size=2) +
geom_smooth(se=F, method="lm", formula = reg_formula) +
theme_bw() +
labs(x = x_name,
y = "Empirical log odds",
color = grouping,
shape = grouping) +
theme(text = element_text(size=15))
}
}
Arguments
data
: the dataset of interest (e.g.,titanic
)num_bins
: the number of bins to use- The number of bins should be chosen based on the size of the data.
For example, with
bin_method = "equal_size"
, we probably want at least 15 observations per bin, sonum_bins
< (number of observations)/15
- The number of bins should be chosen based on the size of the data.
For example, with
bin_method
: whether to choose bins with the same number of observations ("equal_size"
) or the same width ("equal_width"
)x_name
: the name of the column containing the explanatory variable (e.g.,"Fare"
). The quotation marks are neededy_name
: the name of the column containing the response (e.g.,"Survived"
). The quotation marks are neededgrouping
: the name of a categorical variable in the data; fit a separate line for each level of the categorical variablereg_formula
: This is the shape of the relationship you want to plot. If you want a line, this is y ~ x (the default). Some other examples:y ~ log(x)
: a log transformation of xy ~ sqrt(x)
: a square root transformation of xy ~ poly(x, 2)
: a second-order polynomialy ~ poly(x, 3)
: a third-order polynomial
Examples
The titanic
dataset contains a number of variables
recorded for passengers on the Titanic when it sunk on its
first voyage. The titanic
dataset can be loaded into R
with
<- read.csv("https://sta214-f22.github.io/labs/Titanic.csv") titanic
Now let’s see how to make empirical logit plots.
Looking at Fare
The response variable we care about with the titanic
data is Survived
: whether a passenger survived the disaster
or not. Let’s first look at Fare
as an explanatory
variable. We will make an empirical logit plot with 30 equally sized
bins:
library(tidyverse)
logodds_plot(titanic, 30, "equal_size", "Fare", "Survived",
reg_formula = y ~ x)
Trying a different shape
That linear fit doesn’t look very good! Maybe we should try a different shape. Let’s try a quadratic fit instead:
logodds_plot(titanic, 30, "equal_size", "Fare", "Survived",
reg_formula = y ~ poly(x, 2))
And maybe a log transformation too:
logodds_plot(titanic, 30, "equal_size", "Fare", "Survived",
reg_formula = y ~ log(x))
Potential interactions
Does the relationship between Fare and Survival depend on passenger
class (Pclass
)? We can add Pclass
as a
grouping variable to the empirical logit plot, and fit a different line
for each passenger class.
logodds_plot(titanic, 30, "equal_size", "Fare", "Survived",
grouping = "Pclass",
reg_formula = y ~ log(x))
Now, this doesn’t let us see the relationship very well. Let’s transform Fare first, then make the empirical logit plot with the transformed variable:
%>%
titanic mutate(logFare = log(Fare + 1)) %>%
logodds_plot(30, "equal_size", "logFare", "Survived",
grouping = "Pclass",
reg_formula = y ~ x)
If we think these lines have different slopes, we might consider adding an interaction between Fare and passenger class.