No offset is needed in this model. For all students, we are interested in the number of articles published in the last three years of their program. If we compared the number of articles published by students at different points in the PhD, then we would want to include the length of time they had been a PhD student as an offset term.
Based on the empirical log mean plot, it seems reasonable to model the log mean number of articles as a linear function of PhD prestige. I don’t think a transformation is needed for Prestige.
logmean_plot(articles, 35, "equal_size", "phd", "art")
p1 <- logmean_plot(articles, 25, "equal_size", "ment", "art")
p2 <- logmean_plot(articles, 25, "equal_size", "ment", "art",
reg_formula = y ~ log(x+1))
grid.arrange(p1, p2, ncol=2)
p1 <- logmean_plot(articles, 30, "equal_size", "phd", "art",
grouping = "mar")
p2 <- logmean_plot(articles, 30, "equal_size", "phd", "art",
grouping = "fem")
p3 <- articles %>%
mutate(log_ment = log(ment + 1)) %>%
logmean_plot(30, "equal_size", "log_ment", "art",
grouping = "mar")
p4 <- articles %>%
mutate(log_ment = log(ment + 1)) %>%
logmean_plot(30, "equal_size", "log_ment", "art",
grouping = "fem")
grid.arrange(p1, p2, p3, p4, ncol=2)
\[Articles_i \sim Poisson(\lambda_i)\]
\[\log(\lambda_i) = \beta_0 + \beta_1 Female_i + \beta_2 Married_i + \beta_3 Kids_i + \\ \beta_4 Prestige_i + \beta_5 \log(Mentor_i + 1)\]
m1 <- glm(art ~ fem + mar + kid5 + phd + log(ment + 1),
data = articles, family = poisson)
summary(m1)
##
## Call:
## glm(formula = art ~ fem + mar + kid5 + phd + log(ment + 1), family = poisson,
## data = articles)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.6018 -1.3180 -0.3248 0.5678 5.2470
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.09434 0.11098 -0.850 0.39529
## femWomen -0.23390 0.05447 -4.294 1.75e-05 ***
## marMarried 0.16767 0.06125 2.737 0.00619 **
## kid5 -0.17224 0.03992 -4.315 1.60e-05 ***
## phd -0.02302 0.02717 -0.847 0.39680
## log(ment + 1) 0.37546 0.02957 12.698 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 1817.4 on 914 degrees of freedom
## Residual deviance: 1598.8 on 909 degrees of freedom
## AIC: 3278.5
##
## Number of Fisher Scoring iterations: 5
A one-unit increase in prestige is associated with a change in the average number of articles published by a factor of \(e^{-0.023} = 0.977\), holding the other variables fixed.
The estimated average number of articles published for a student of this description would be 2.47.
a <- c(1, 0, 1, 0, 3, log(10 + 1))
exp(t(coef(m1)) %*% a)
## [,1]
## [1,] 2.470879
library(statmod)
p1 <- articles %>%
mutate(qresids = qresid(m1)) %>%
ggplot(aes(x = phd, y = qresids)) +
geom_point() +
geom_smooth() +
theme_bw() +
labs(x = "PhD prestige", y = "Quantile residuals")
p2 <- articles %>%
mutate(qresids = qresid(m1)) %>%
ggplot(aes(x = log(ment + 1), y = qresids)) +
geom_point() +
geom_smooth() +
theme_bw() +
labs(x = "log(Number of mentor articles + 1)", y = "Quantile residuals")
grid.arrange(p1, p2, ncol=2)
The shape assumption seems reasonable for both PhD prestige and the transformed number of mentor articles.
plot(log(articles$ment + 1), cooks.distance(m1))
The largest Cook’s distance is about 0.25, which is below the usual thresholds of 0.5 or 1 at which we are concerned about influential points.
library(car)
vif(m1)
## fem mar kid5 phd log(ment + 1)
## 1.102697 1.259549 1.279447 1.108363 1.105538
There does not appear to be any multicollinearity issues in our data.
There are no apparent assumption violations.
The residual deviance is 1598.8, on 909 degrees of freedom. This gives a goodness of fit p-value of essentially 0, so our model may not be a good fit to the data.
pchisq(1598.8, df=909, lower.tail=F)
## [1] 1.142904e-40
Pearson estimate: \(\widehat{\phi} = 1.81\)
pearson_resids <- residuals(m1, type="pearson")
sum(pearson_resids^2)/df.residual(m1)
## [1] 1.814574
We are 95% confident that the average number of articles published by married students is between \(e^{0.0071} = 1.007\) and \(e^{0.329} = 1.39\) times higher than the average number of articles published by single students, holding all other variables constant.