Class Activity, August 31
Part I
In the first part of this class activity, we will explore quantile
residuals as a diagnostic tool for logistic regression. Quantile
residuals can be calculated for a fitted GLM in R using the
qresid()
function in the statmod
package (you
may need to install this package).
In class, we used the following code to simulate data for which the logistic regression model assumptions hold:
# simulate a single explanatory variable from a Normal distribution
<- rnorm(1000)
x
# create P(Y = 1 | X) for each entry in x
# Here log odds = -1 + 2x
<- exp(-1 + 2*x)/(1 + exp(-1 + 2*x))
p
# Finally, simulate a binary response at each x
<- rbinom(1000, 1, p)
y
# fit the model and plot the quantile residuals against x
# add a smooth fit to see if there is a relationship
<- glm(y ~ x, family = binomial)
m1 data.frame(x = x, residuals = qresid(m1)) %>%
ggplot(aes(x = x, y = residuals)) +
geom_point() +
geom_smooth() +
theme_bw()
Now, let’s see what happens when we break a logistic regression assumption!
Questions
- Modify the simulation code above so that the log odds are
not a linear function of \(X\). Fit the same logistic regression model
as above (
y ~ x
, which is now wrong) and make a quantile residual plot. Does the plot show that the assumptions are violated?
Part II
In the second part of this class activity, we will use diagnostic tools to help fit a logistic regression model on real data. Here we work with data from a website called ScienceForums.Net (SFN), which has been open since 2002 and hosts conversations on a range of topics from biological and physical science to religion and philosophy. Each row in the data represents one ‘thread’, which is comprised of a series of posts stemming from an initial post. For each thread, we have some information that SFN collects such as the number of views and the number of authors. The threads present in the data are a random sample of threads from 2002-2014, with the data collected in 2014. SFN moderators are interested in using this data to determine which threads warrant the most attention.
You can load the SFN data into R by
<- read.csv("https://sta712-f22.github.io/class_activities/sfn.csv") sfn
The sfn
dataset contains the following columns:
Age
: the age of the thread (in days) when the data was collected in 2014, measured from the first post in the threadState
: sometimes moderators close threads if they are inappropriate. closed indicates the thread has been closed, otherwise State is openPosts
: the number of posts in the threadViews
: the total number of views of the threadDuration
: the number of days between the first and last posts in the threadAuthors
: the number of distinct authors posting in the threadAuthorExperience
: the number of days the author of the first post in the thread had been registered on SFN when the thread began (0 indicates they registered that day)DeletedPosts
: the number of posts in the thread that have been deleted by a moderatorForum
: the forum in which the thread was posted (e.g., Science)AuthorBanned
: whether the original author of the thread is currently banned from posting on SFN (at the time of data collection, not when the thread was first posted)
Questions
Moderators are interested in predicting whether a thread will have any posts which need to be deleted.
Create a new variable,
HasDeleted
, which is = 1 when the number of deleted posts is > 0, and 0 otherwise.Fit a logistic regression model to predict whether any posts have been deleted, using explanatory variables like the number of views, the number of posts, and the number of authors.
Create empirical logit and quantile residual plots for your fitted model. Does it look like any transformations are needed on your explanatory variables?
The below function, logodds_plot
, can be used to create
an empirical logit plot. To use the function, you will need to specify
the following arguments
data
: the dataset of interest (e.g.,sfn
)num_bins
: the number of bins to use- The number of bins should be chosen based on the size of the data.
For example, with
bin_method = "equal_size"
, we probably want at least 15 observations per bin, sonum_bins
< (number of observations)/15
- The number of bins should be chosen based on the size of the data.
For example, with
bin_method
: whether to choose bins with the same number of observations ("equal_size"
) or the same width ("equal_width"
)x
: the name of the column containing the explanatory variable (e.g.,"Views"
). The quotation marks are neededy
: the name of the column containing the response (e.g.,"HasDeleted"
). The quotation marks are neededgrouping
: Fit a separate line for different levels of the grouping variablereg_formula
: This is the shape of the relationship you want to plot. If you want a line, this is y ~ x (the default). Some other examples:y ~ log(x)
: a log transformation of xy ~ sqrt(x)
: a square root transformation of xy ~ poly(x, 2)
: a second-order polynomialy ~ poly(x, 3)
: a third-order polynomial
<- function(data, num_bins, bin_method,
logodds_plot grouping = NULL, reg_formula = y ~ x){
x_name, y_name,
if(is.null(grouping)){
<- data.frame(x = data %>% pull(x_name),
dat y = data %>% pull(y_name),
group = 1)
else {
} <- data.frame(x = data %>% pull(x_name),
dat y = data %>% pull(y_name),
group = data %>% pull(grouping))
}
if(bin_method == "equal_size"){
<- dat %>%
logodds_table drop_na() %>%
arrange(group, x) %>%
group_by(group) %>%
mutate(obs = y,
bin = rep(1:num_bins,
each=ceiling(n()/num_bins))[1:n()]) %>%
group_by(bin, group) %>%
summarize(mean_x = mean(x),
prop = mean(c(obs, 0.5)),
num_obs = n()) %>%
ungroup() %>%
mutate(logodds = log(prop/(1 - prop)))
else {
} <- dat %>%
logodds_table drop_na() %>%
group_by(group) %>%
mutate(obs = y,
bin = cut(x,
breaks = num_bins,
labels = F)) %>%
group_by(bin, group) %>%
summarize(mean_x = mean(x),
prop = mean(c(obs, 0.5)),
num_obs = n()) %>%
ungroup() %>%
mutate(logodds = log(prop/(1 - prop)))
}
if(is.null(grouping)){
%>%
logodds_table ggplot(aes(x = mean_x,
y = logodds)) +
geom_point(size=2) +
geom_smooth(se=F, method="lm", formula = reg_formula) +
theme_bw() +
labs(x = x_name,
y = "Empirical log odds") +
theme(text = element_text(size=25))
else {
} %>%
logodds_table ggplot(aes(x = mean_x,
y = logodds,
color = group,
shape = group)) +
geom_point(size=2) +
geom_smooth(se=F, method="lm", formula = reg_formula) +
theme_bw() +
labs(x = x_name,
y = "Empirical log odds",
color = grouping,
shape = grouping) +
theme(text = element_text(size=25))
}
}