Class Activity, September 5
In this class activity, we will explore leverage and Cook’s distance for identifying influential points (i.e., observations which can substantially change our fitted regression model).
The following code generates data with a potential outlier at \(x = -2\):
# simulate a single explanatory variable from a Normal distribution
<- rnorm(100)
x
# create P(Y = 1 | X) for each entry in x
# Here log odds = -1 + 2x
<- exp(-1 + 2*x)/(1 + exp(-1 + 2*x))
p
# Finally, simulate a binary response at each x
<- rbinom(100, 1, p)
y
<- c(x, -2)
x1 <- c(y, 1) y1
Questions
Run the code above, then answer the following questions:
The leverage values for a fitted model can be computed in R with the
hatvalues(...)
function. Plot leverage against the predictorx
; does leverage always increase as we move away from the center of \(X\)?Cook’s distance can be computed in R with the
cooks.distance(...)
function. Plot Cook’s distance againstx
. Is the potential outlier identified as an influential point?Try changing the location of the outlier from \(x = -2\) to \(x =\) something else. How does Cook’s distance change as we move the potential outlier?
Now increase the sample size of the simulated data. How does sample size impact whether an outlier is influential?