1. No offset is needed in this model. For all students, we are interested in the number of articles published in the last three years of their program. If we compared the number of articles published by students at different points in the PhD, then we would want to include the length of time they had been a PhD student as an offset term.

  2. Based on the empirical log mean plot, it seems reasonable to model the log mean number of articles as a linear function of PhD prestige. I don’t think a transformation is needed for Prestige.

logmean_plot(articles, 35, "equal_size", "phd", "art")

  1. There appears to be a nonlinear relationship between the log means and the number of articles published by the mentor. A log transformation seems like it might be appropriate.
p1 <- logmean_plot(articles, 25, "equal_size", "ment", "art")
p2 <- logmean_plot(articles, 25, "equal_size", "ment", "art", 
                   reg_formula = y ~ log(x+1))

grid.arrange(p1, p2, ncol=2)

  1. Based on the plots below, we may want to include an interaction between PhD program prestige and marital status, though honestly it is hard to tell. I will err on the side of simplicity here, and leave out the interaction term.
p1 <- logmean_plot(articles, 30, "equal_size", "phd", "art",
                   grouping = "mar")

p2 <- logmean_plot(articles, 30, "equal_size", "phd", "art",
                   grouping = "fem")

p3 <- articles %>% 
  mutate(log_ment = log(ment + 1)) %>%
  logmean_plot(30, "equal_size", "log_ment", "art",
                   grouping = "mar")

p4 <- articles %>% 
  mutate(log_ment = log(ment + 1)) %>%
  logmean_plot(30, "equal_size", "log_ment", "art",
                   grouping = "fem")

grid.arrange(p1, p2, p3, p4, ncol=2)

  1. For simplicity I’ll leave out any interaction terms for now, since it isn’t clear from EDA whether the interactions are needed. The model is therefore

\[Articles_i \sim Poisson(\lambda_i)\]

\[\log(\lambda_i) = \beta_0 + \beta_1 Female_i + \beta_2 Married_i + \beta_3 Kids_i + \\ \beta_4 Prestige_i + \beta_5 \log(Mentor_i + 1)\]

  1. Fitted model:
m1 <- glm(art ~ fem + mar + kid5 + phd + log(ment + 1),
          data = articles, family = poisson)
summary(m1)
## 
## Call:
## glm(formula = art ~ fem + mar + kid5 + phd + log(ment + 1), family = poisson, 
##     data = articles)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.6018  -1.3180  -0.3248   0.5678   5.2470  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   -0.09434    0.11098  -0.850  0.39529    
## femWomen      -0.23390    0.05447  -4.294 1.75e-05 ***
## marMarried     0.16767    0.06125   2.737  0.00619 ** 
## kid5          -0.17224    0.03992  -4.315 1.60e-05 ***
## phd           -0.02302    0.02717  -0.847  0.39680    
## log(ment + 1)  0.37546    0.02957  12.698  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 1817.4  on 914  degrees of freedom
## Residual deviance: 1598.8  on 909  degrees of freedom
## AIC: 3278.5
## 
## Number of Fisher Scoring iterations: 5
  1. A one-unit increase in prestige is associated with a change in the average number of articles published by a factor of \(e^{-0.023} = 0.977\), holding the other variables fixed.

  2. The estimated average number of articles published for a student of this description would be 2.47.

a <- c(1, 0, 1, 0, 3, log(10 + 1))
exp(t(coef(m1)) %*% a)
##          [,1]
## [1,] 2.470879
  1. Quantile residual plots:
library(statmod)

p1 <- articles %>%
  mutate(qresids = qresid(m1)) %>%
  ggplot(aes(x = phd, y = qresids)) +
  geom_point() +
  geom_smooth() +
  theme_bw() +
  labs(x = "PhD prestige", y = "Quantile residuals")

p2 <- articles %>%
  mutate(qresids = qresid(m1)) %>%
  ggplot(aes(x = log(ment + 1), y = qresids)) +
  geom_point() +
  geom_smooth() +
  theme_bw() +
  labs(x = "log(Number of mentor articles + 1)", y = "Quantile residuals")

grid.arrange(p1, p2, ncol=2)

The shape assumption seems reasonable for both PhD prestige and the transformed number of mentor articles.

  1. Cook’s distance:
plot(log(articles$ment + 1), cooks.distance(m1))

The largest Cook’s distance is about 0.25, which is below the usual thresholds of 0.5 or 1 at which we are concerned about influential points.

  1. Variance inflation factors:
library(car)

vif(m1)
##           fem           mar          kid5           phd log(ment + 1) 
##      1.102697      1.259549      1.279447      1.108363      1.105538

There does not appear to be any multicollinearity issues in our data.

  1. There are no apparent assumption violations.

  2. The residual deviance is 1598.8, on 909 degrees of freedom. This gives a goodness of fit p-value of essentially 0, so our model may not be a good fit to the data.

pchisq(1598.8, df=909, lower.tail=F)
## [1] 1.142904e-40
  1. Mean deviance estimate: \(\widehat{\phi} = 1598.8/909 = 1.76\)

Pearson estimate: \(\widehat{\phi} = 1.81\)

pearson_resids <- residuals(m1, type="pearson")
sum(pearson_resids^2)/df.residual(m1)
## [1] 1.814574
  1. 95% confidence interval for \(\beta_2\): \(0.168 \pm 1.96 \cdot \sqrt{1.81} \cdot (0.061) = (0.0071, 0.329)\)

We are 95% confident that the average number of articles published by married students is between \(e^{0.0071} = 1.007\) and \(e^{0.329} = 1.39\) times higher than the average number of articles published by single students, holding all other variables constant.