# load necessary packages
library(dplyr)
library(plotly)
library(stargazer)
library(knitr)
library(MASS)
library(arm)
library(AER)
library(margins)
library(stats)
set.seed(123)
Gelman and Hill (2007) collected New York City (NYC) “stop and frisk” data for 175,000 stops over a 15-month period in 1998-19991 and it is this data that we will use in our count model analysis. We use count models, such as the Poisson, when each data point \(y_{i}\) has no natural limit or number of successes (i.e., a binomial).
precincts
- NYC police precinct, numbered 1-75
ethnicity
- 1 = black, 2 = hispanic, 3 = white
crime type
- 1 = violent, 2 = weapons, 3 = property, 4 = drug
past.arrests
- the number of arrests within New York City in 1997 as recorded by the Division of Criminal Justice Services (DCJS) of New York State. This is a proxy for number of crimes committed by each ethnic group.
pop
- population
df <- read.table("https://raw.githubusercontent.com/afogarty85/replications/master/Stop%20and%20Frisk/stop_and_frisk.dat",
skip = 6, header = TRUE)
df <- na.omit(df)
df <- aggregate(cbind(stops, past.arrests) ~ precinct + eth, data = df, sum)
df$eth <- dplyr::recode(df$eth, `1` = 'black', `2` = 'hispanic', `3` = 'white')
To begin our analysis, we first specify a hypothesis:
\(H_{1}\): In a comparison of stop and frisk events, blacks are more likely to be searched than hispanics or whites.
Since our dependent variable are the number of stop and frisk events (stops
), let’s first view our data’s distribution by ethnicity (eth
). The barplot below shows quite clearly that blacks are more likely to be stopped than hispanics or whites.
dv_count <- df %>% group_by(eth) %>% tally(stops)
dv_count$eth <- car::recode(dv_count$eth, "1='black'; 2='hispanic'; 3='white'")
plot_ly(data = dv_count, type = 'bar') %>%
add_trace(x = ~factor(eth), y =~n, name = '') %>%
layout(showlegend = FALSE,
xaxis = list(
title = 'Ethnicity'),
yaxis = list(
title = 'Stop and Frisk Events'))
Because we have a continuous dependent variable, we can view its relationship with our primary independent variable of interest quite easily through a scatterplot. We can see a clear pattern across most NYC precincts where blacks are stopped and frisked more than hispanics or whites.
We begin by fitting our model using a poisson. The poisson distribution is used to model variation in count data, meaning data that can equal 0, 1, 2, etc.
fit <- glm(stops ~ factor(eth) + factor(precinct),
family = poisson(link = 'log'), data = df, offset=log(past.arrests))
We begin by checking whether or not our data is overdispersed. Overdispersion occurs when our dependent variable’s variance exceeds the mean. The poisson does not give us an independent variance parameter \(\sigma\) which means that our result can be overdispersed.
## [1] 584.0889
## [1] 376742.9
We can also test for this formally with a Z-score test. Z-tests the hypothesis that the Poisson model is overdispersed. In practice, it tests whether the data should be modeled as Poisson or negative binomial. We test for overdispersion by computing the following:
# check for overdispersion
mu <- predict(fit, type = 'response')
z <- ((df$stops - mu)^2 - df$stops) / (mu * sqrt(2))
zscore <- lm(z ~ 1)
summary(zscore)
##
## Call:
## lm(formula = z ~ 1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.206 -9.262 -5.180 4.209 75.677
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.4973 0.9085 10.45 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.63 on 224 degrees of freedom
An alternative method is available here, which is only applicable to the relationship of poisson and negative binomial models, as well as zero inflated models. What we are testing here is whether or not the dispersion parameter is significantly greater than 0, corresponding to a Poisson dispersion equal to 1.0; meaning we should use a negative binomial.
# fit a negative binomial for this test
nbreg <- glm.nb(stops ~ factor(eth) + factor(precinct) + offset(log(past.arrests)), data = df)
# boundary likelihood ratio test; test whether dispersion parameter greater than 0
l_poisson <- logLik(fit)[1] # get log likelihood
l_nb <- logLik(nbreg)[1] # get log likelihood
LRtest <- -2 * (l_poisson - (l_nb)) # calculate likelihood ratio
pchisq(LRtest, 1, lower.tail = FALSE) / 2
## [1] 0
Given the results above, we can reject the null of no overdispersion which means that real overdisperson exists in the data and that a negative binomial is preferred. To account for overdispersion, we can use a number of different models to better fit our count data which we will use later.
Tabling model fit for the moment, we can interpret our poisson model’s results directly which states: Compared to the baseline category (blacks), hispanics have approximately the same rate of being stopped while whites have a 1 - exp(-0.41900122)
or 34% lower chance of being stopped.
stargazer(fit$coefficients[1:3],
type = "html",
title = "Stop and Frisk Results",
dep.var.labels = "Stop and Frisks",
covariate.labels = c("Intercept", "Hispanics", "Whites"))
Intercept | Hispanics | Whites |
-1.379 | 0.010 | -0.419 |
We can also use the predict
command to generate predicted counts of stop and frisks for each row in our dataset.
We can also use the prediction
command to generate average predicted counts for our primary independent variable, eth
. We can see that blacks and hispanics are, on average, exceedingly more likely to be stopped than whites.
## Data frame with 675 predictions from
## glm(formula = stops ~ factor(eth) + factor(precinct), family = poisson(link = "log"),
## data = df, offset = log(past.arrests))
## with average predictions:
## eth x
## black 621.3
## hispanic 627.7
## white 408.7
We use quasipossion models to account for overdispersion in count data. To fit a quasipossion, we make a simple change to the model specifications. We can see from its results that it no longer assumes an overdispersed parameter of 1. A careful inspection will also show much higher standard errors which is how quasipossion handles the overdispersion.
fit2 <- glm(stops ~ factor(eth) + factor(precinct),
family = quasipoisson(link = 'log'), data = df, offset = log(past.arrests))
stargazer(fit2$coefficients[1:3],
type = "html",
title = "Stop and Frisk Results",
dep.var.labels = "Police Stops",
covariate.labels = c("Intercept", "Hispanics", "Whites"))
Intercept | Hispanics | Whites |
-1.379 | 0.010 | -0.419 |
Given that we know the poisson is a poor fit for our data, we will compare the poisson to the negative binomial.
#Negative Binomial
nbreg <- glm.nb(stops ~ factor(eth) + factor(precinct) + offset(log(past.arrests)), data = df)
We can see that the model’s coefficients are not all that different when we compare the poisson to the negative binomial.
stargazer(nbreg$coefficients[1:3],
type = "html",
title = "Stop and Frisk Results",
dep.var.labels = "Police Stops",
covariate.labels = c("Intercept", "Hispanics", "Whites"))
Intercept | Hispanics | Whites |
-1.227 | 0.009 | -0.486 |
However, by comparing model fit, we can see that the negative binomial fits our data much better than the poisson.
## df AIC
## fit 77 5287.752
## nbreg 78 2759.344
We can also use the prediction
command to generate average predicted counts for our primary independent variable, eth
. We can see that blacks and hispanics are, on average, exceedingly more likely to be stopped than whites.
nbreg.count <- predict(nbreg, type="response")
prediction(nbreg, at=list(eth = c('black','hispanic','white')))
## Data frame with 675 predictions from
## glm.nb(formula = stops ~ factor(eth) + factor(precinct) + offset(log(past.arrests)),
## data = df, init.theta = 19.55174066, link = log)
## with average predictions:
## eth x
## black 606.1
## hispanic 611.3
## white 372.8
We can also visualize the difference in predicted counts between models at each NYC precinct by doing the following:
count.compare <- data.frame(poisson = poisson.count,
nbreg = nbreg.count,
eth = df$eth)
plot_ly(type = 'scatter', mode = 'markers') %>%
add_trace(y =~count.compare$poisson, name = 'Poisson') %>%
add_trace(y =~count.compare$nbreg, name = 'Negative Binomial') %>%
layout(xaxis = list(
title = 'NYC Police Precinct'),
yaxis = list(
title = 'Expected Count'))
While we have lightly examined the poisson and negative binomial models for count data, there exists a wider spectrum of count models which excel at modeling count data with excessive amounts of zeros in the dependent variable. These models are: zero inflated negative binomial and zero inflated negative poisson. In this section, we will analyze count data with many zeros and use these models to better fit our data.
## Classes and Methods for R developed in the
## Political Science Computational Laboratory
## Department of Political Science
## Stanford University
## Simon Jackman
## hurdle and zeroinfl functions by Achim Zeileis
# load data
df <- read.csv('https://raw.githubusercontent.com/afogarty85/replications/master/German%20Socio-Economic%20Panel/mdvis.csv')
The data we will use is from a subset of the German Socio-Economic Panel (SOEP). The subset was created by Rabe-Hesketh and Skrondal (2005). Only working women are included in these data. Beginning in 1997, German health reform in part entailed a 200 co-payment as well as limits in provider reimbursement. Patients were surveyed for the one year panel (1996) prior to and the one year panel (1998) after reform to assess whether the number of physician visits by patients declined - which was the goal of reform legislation. The response, or variable to be explained by the model, is numvisit, which indicates the number of patient visits to a physician’s office during a three month period. The data is derived from the COUNT
package.
Some of the data is described as follows:
numvisit
- the number of patient visits to a physician’s office during a three month period
reform
- health reform in part entailed a 200 co-payment as well as limits in provider reimbursement; 1 = post-reform, 0 = pre-reform
badh
- 1 = bad health, 0 = not bad health
age
- patient age
educ3
- patient has a 12th grade education
We can see that the data appears to be clean and has no missing data from the summary table below.
## numvisit reform badh age
## Min. : 0.000 Min. :0.0000 Min. :0.0000 Min. :20.00
## 1st Qu.: 0.000 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:28.00
## Median : 2.000 Median :1.0000 Median :0.0000 Median :34.00
## Mean : 2.589 Mean :0.5061 Mean :0.1136 Mean :36.78
## 3rd Qu.: 3.000 3rd Qu.:1.0000 3rd Qu.:0.0000 3rd Qu.:46.00
## Max. :60.000 Max. :1.0000 Max. :1.0000 Max. :60.00
## educ educ1 educ2 educ3
## Min. :1.000 Min. :0.0000 Min. :0.0000 Min. :0.0000
## 1st Qu.:2.000 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000
## Median :2.000 Median :0.0000 Median :0.0000 Median :0.0000
## Mean :2.091 Mean :0.2465 Mean :0.4158 Mean :0.3377
## 3rd Qu.:3.000 3rd Qu.:0.0000 3rd Qu.:1.0000 3rd Qu.:1.0000
## Max. :3.000 Max. :1.0000 Max. :1.0000 Max. :1.0000
## agegrp age1 age2 age3
## Min. :1.000 Min. :0.0000 Min. :0.0000 Min. :0.0000
## 1st Qu.:1.000 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000
## Median :1.000 Median :1.0000 Median :0.0000 Median :0.0000
## Mean :1.555 Mean :0.6071 Mean :0.2313 Mean :0.1617
## 3rd Qu.:2.000 3rd Qu.:1.0000 3rd Qu.:0.0000 3rd Qu.:0.0000
## Max. :3.000 Max. :1.0000 Max. :1.0000 Max. :1.0000
## loginc
## Min. :5.832
## 1st Qu.:7.484
## Median :7.725
## Mean :7.713
## 3rd Qu.:7.949
## Max. :9.333
Similar to the poisson, we begin by checking the extent to which the variance exceeds the mean. We can see from the results below that we probably have some overdispersion.
## [1] 2.589133
## [1] 16.12985
Unlike the precinct data, this dataset contains a large number of zeros; roughly 30% of the dataset, which leads us to suspect we might need a zero inflated model.
##
## FALSE TRUE
## 1562 665
Prior to fitting other count models, we should start by fitting a poisson and then determine whether or not alternatives should be sought. According to the tests below, we can see that our data is indeed overdispersed.
poisson <- glm(numvisit ~ factor(reform) + factor(badh) + educ3 + age, data = df, family = "poisson")
summary(poisson)
##
## Call:
## glm(formula = numvisit ~ factor(reform) + factor(badh) + educ3 +
## age, family = "poisson", data = df)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.9745 -1.9500 -0.6999 0.5392 12.4506
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.642769 0.050350 12.766 < 2e-16 ***
## factor(reform)1 -0.135512 0.026544 -5.105 3.30e-07 ***
## factor(badh)1 1.122070 0.030199 37.156 < 2e-16 ***
## educ3 -0.119454 0.029045 -4.113 3.91e-05 ***
## age 0.005204 0.001235 4.212 2.53e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 8848.8 on 2226 degrees of freedom
## Residual deviance: 7423.7 on 2222 degrees of freedom
## AIC: 11899
##
## Number of Fisher Scoring iterations: 5
# check for overdispersion
mu <- predict(poisson, type = 'response')
z <- ((df$numvisit - mu)^2 - df$numvisit) / (mu * sqrt(2))
zscore <- lm(z ~ 1)
summary(zscore)
##
## Call:
## lm(formula = z ~ 1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.144 -2.900 -2.131 -0.831 286.158
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.3383 0.2729 8.568 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12.88 on 2226 degrees of freedom
Adding onto the fact that 30% of the observations for our dependent variable are zeros, we should move to try a zero inflated model. We begin by fitting two different zero inflated models, the negative binomial and the poisson. We then compare their model fit through AIC and then through the vuong test.
zinb <- zeroinfl(numvisit ~ factor(reform) + factor(badh) + educ3 + age, data = df, dist = "negbin")
zip <- zeroinfl(numvisit ~ factor(reform) + factor(badh) + educ3 + age, data = df, dist = "poisson")
AIC(zinb, zip, poisson) # ZINB fits better
## df AIC
## zinb 11 9145.569
## zip 10 10800.434
## poisson 5 11899.206
The standard fit test for ZINB models is the Vuong test which compares the predicted fit values of the ZINB and ZIP; assessing whether there is a significant difference between the two.
## Vuong Non-Nested Hypothesis Test-Statistic:
## (test-statistic is asymptotically distributed N(0,1) under the
## null that the models are indistinguishible)
## -------------------------------------------------------------
## Vuong z-statistic H_A p-value
## Raw 7.121645 model1 > model2 5.3324e-13
## AIC-corrected 7.121645 model1 > model2 5.3324e-13
## BIC-corrected 7.121645 model1 > model2 5.3324e-13
# positive values: model 1 preferred over model 2
# negative values: model 2 preferred over model 1,
A border likelihood ratio test can also be run to compare ZINB with ZIP. The method is identical to the one used previously when comparing poisson to negative binomial.
# boundary likelihood ratio test; test whether dispersion parameter greater than 0
l_poisson <- logLik(zip)[1] # get log likelihood
l_nb <- logLik(zinb)[1] # get log likelihood
LRtest <- -2 * (l_poisson - (l_nb)) # calculate likelihood ratio
pchisq(LRtest, 1, lower.tail = FALSE) / 2
## [1] 0
Given the results above, we can reject the null of no overdispersion which means that real overdisperson exists in the data and that a negative binomial is preferred.
##
## Call:
## zeroinfl(formula = numvisit ~ factor(reform) + factor(badh) + educ3 +
## age, data = df, dist = "negbin")
##
## Pearson residuals:
## Min 1Q Median 3Q Max
## -0.9662 -0.7968 -0.3713 0.3241 10.0931
##
## Count model coefficients (negbin with log link):
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.605392 0.104439 5.797 6.77e-09 ***
## factor(reform)1 -0.107721 0.055568 -1.939 0.05256 .
## factor(badh)1 1.082017 0.076344 14.173 < 2e-16 ***
## educ3 -0.121444 0.062047 -1.957 0.05031 .
## age 0.006818 0.002471 2.760 0.00579 **
## Log(theta) 0.069733 0.074237 0.939 0.34756
##
## Zero-inflation model coefficients (binomial with logit link):
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -4.92785 2.65363 -1.857 0.0633 .
## factor(reform)1 1.01365 1.47608 0.687 0.4923
## factor(badh)1 -2.09529 3.91362 -0.535 0.5924
## educ3 -7.97084 35.73915 -0.223 0.8235
## age 0.03514 0.03415 1.029 0.3035
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Theta = 1.0722
## Number of iterations in BFGS optimization: 51
## Log-likelihood: -4562 on 11 Df
We interpret the coefficients in the model in the following way:
## [1] 0.8978781
## [1] 2.755641
Next, if we want to generate predicted counts with standard errors, we do so by running simulations.
n.draws <- 1000 # number of simulated sets of coefficients
sim <- rmvnorm(n.draws, coef(zinb), vcov(zinb)) # sim
# sequence independent variable
sequence <- seq(min(df$age), max(df$age), by = 1)
# Empty vectors to store the point estimates and confidence intervals
pe.ov <- as.matrix(NA, nrow=length(sequence))
lo.ov <- as.matrix(NA, nrow=length(sequence))
hi.ov <- as.matrix(NA, nrow=length(sequence))
for(j in 1:length(sequence)){ # Loop goes across distance values
# Set the other independent variables to all of their observed values
# factor(reform) + factor(badh) + educ3 + age
x.sequence <- data.frame(intercept = 1,
reform = df$reform ,
badh = df$badh,
educ3 = df$educ3,
age = sequence[j])
x.sequence <- data.matrix(x.sequence)
# Save average of linear predictor across all the observations
pp <- matrix(NA, nrow = n.draws)
for(i in 1:n.draws){# Loop over coefficients
# For each observation in the dataset
xb <- exp(x.sequence %*% sim[i, 1:5]) # log link function
pp[i] <- mean(xb)
}
# Compute point estimate and CI for each value of distance
pe.ov[j] <- quantile(pp, 0.5)
lo.ov[j] <- quantile(pp, 0.025)
hi.ov[j] <- quantile(pp, 0.975)
}
# put results in df
result_df = data.frame(age = sequence,
predicted_y = pe.ov,
ci_low = lo.ov,
ci_high = hi.ov)
x <- list(title = "Patient Age", dtick = 1, zeroline = FALSE)
y <- list(title = "Expected Count")
plot_ly(data = result_df, type = 'scatter', mode = 'lines') %>%
add_trace(x =~age, y =~ci_high, line = list(color = 'transparent'),
name = 'Upper CI', showlegend = FALSE) %>%
add_trace(x =~age, y =~ci_low, fill = 'tonexty', fillcolor='rgba(0,100,80,0.2)',
line = list(color = 'transparent'), name = 'Lower CI', showlegend = FALSE) %>%
add_trace(x =~age, y =~predicted_y, line = list(color='rgb(0,100,80)'),
name = 'Expected Count') %>%
layout(xaxis = x,
yaxis = y)
We interpret the results as follows:
##
## Attaching package: 'faraway'
## The following objects are masked from 'package:survival':
##
## rats, solder
## The following objects are masked from 'package:car':
##
## logit, vif
## The following objects are masked from 'package:arm':
##
## fround, logit, pfround
n.draws <- 1000 # number of simulated sets of coefficients
sim <- rmvnorm(n.draws, coef(zinb), vcov(zinb)) # sim
# sequence independent variable
sequence <- seq(min(df$age), max(df$age), by = 1)
# Empty vectors to store the point estimates and confidence intervals
pe.ov <- as.matrix(NA, nrow=length(sequence))
lo.ov <- as.matrix(NA, nrow=length(sequence))
hi.ov <- as.matrix(NA, nrow=length(sequence))
for(j in 1:length(sequence)){ # Loop goes across distance values
# Set the other independent variables to all of their observed values
x.sequence <- data.frame(intercept = 1,
reform = df$reform ,
badh = df$badh,
educ3 = df$educ3,
age = sequence[j])
x.sequence <- data.matrix(x.sequence)
# Save average of linear predictor across all the observations
pp <- matrix(NA, nrow = n.draws)
for(i in 1:n.draws){# Loop over coefficients
# For each observation in the dataset
xb <- ilogit(x.sequence %*% sim[i, 6:10]) # logit link function
pp[i] <- mean(xb)
}
# Compute point estimate and CI for each value of distance
pe.ov[j] <- quantile(pp, 0.5)
lo.ov[j] <- quantile(pp, 0.025)
hi.ov[j] <- quantile(pp, 0.975)
}
result_df = data.frame(age = sequence,
predicted_y = pe.ov,
ci_low = lo.ov,
ci_high = hi.ov)
x <- list(title = "Age", dtick = 1, zeroline = FALSE)
y <- list(title = "Probability of Zero Doctor Visits")
plot_ly(data = result_df, type = 'scatter', mode = 'lines') %>%
add_trace(x =~age, y =~ci_high, line = list(color = 'transparent'),
name = 'Upper CI', showlegend = FALSE) %>%
add_trace(x =~age, y =~ci_low, fill = 'tonexty', fillcolor='rgba(0,100,80,0.2)',
line = list(color = 'transparent'), name = 'Lower CI', showlegend = FALSE) %>%
add_trace(x =~age, y =~predicted_y, line = list(color='rgb(0,100,80)'),
name = 'Predicted Probability') %>%
layout(xaxis = x,
yaxis = y)
We interpret the results as follows:
The graph here is not all that exciting because the variable is not statistically significant. The point here is just to demonstrate how we would produce predicted probabilities for the logit given other data.
Kropko, Jonathan. PLAD 8320: Generalized Linear Models. University of Virginia.
Gelman, Andrew, and Jennifer Hill. Data analysis using regression and multilevel/hierarchical models. Cambridge university press, 2006.
Hilbe, Joseph M. Negative binomial regression. Cambridge University Press, 2011.
https://www.rdocumentation.org/packages/COUNT/versions/1.3.4/topics/mdvis