Binary Regression Analysis

Andrew Fogarty

8/14/2019

# load necessary packages
library(carData)
library(stargazer)
library(dplyr)
library(knitr)
library(MASS)
library(faraway)
library(plotly)
library(arm)
library(margins)
set.seed(123)

1 Introduction

In October 1988, a plebiscite vote was held in Chile to determine whether or not Augusto Pinochet should extend his rule for another eight years. The package carData contains Chilean national survey data collected in April and May 1988. In this report, we are going to remain agnostic over the primary independent variable and we are going to explore the effects of several of our independent variables on the probability of voting for Pinochet’s extension.

2 Data

The data frame contains the following variables:

data(Chile) # load data
summary(Chile) # summarize data
##  region     population     sex           age        education  
##  C :600   Min.   :  3750   F:1379   Min.   :18.00   P   :1107  
##  M :100   1st Qu.: 25000   M:1321   1st Qu.:26.00   PS  : 462  
##  N :322   Median :175000            Median :36.00   S   :1120  
##  S :718   Mean   :152222            Mean   :38.55   NA's:  11  
##  SA:960   3rd Qu.:250000            3rd Qu.:49.00              
##           Max.   :250000            Max.   :70.00              
##                                     NA's   :1                  
##      income         statusquo          vote    
##  Min.   :  2500   Min.   :-1.80301   A   :187  
##  1st Qu.:  7500   1st Qu.:-1.00223   N   :889  
##  Median : 15000   Median :-0.04558   U   :588  
##  Mean   : 33876   Mean   : 0.00000   Y   :868  
##  3rd Qu.: 35000   3rd Qu.: 0.96857   NA's:168  
##  Max.   :200000   Max.   : 2.04859             
##  NA's   :98       NA's   :17

To begin, we first want to recode our dependent variable vote, such that it is binary. From our description above, we can see that vote currently contains 4 categories. A reasonable transformation is to recode undecided and abstain as NA. Next, we drop the missing values so we can proceed with the analysis.

Chile$vote <- with(Chile, ifelse(vote == "Y", 1, ifelse(vote=="N", 0, NA))) 
table(Chile$vote)
## 
##   0   1 
## 889 868
Chile <- na.omit(Chile)

3 Analysis

Now, we can fit our first model. With binary dependent variables, we have the option of specifying three different link functions, logit, probit, and cloglog. In this analysis, we will only look at logit and probit for the sake of time and utility. cloglog models are rarely used, at least within the realm of political science. Our Logit model can be specified formally as:

\[ Pr(Y) = \frac{\exp(\alpha + \beta_{1}statusquo_{i} + \beta_{2}age_{i} + \beta_{3}income_{i} + \beta_{4}sex_{i} + \beta_{5}education_{i})}{1 + \exp(\alpha + \beta_{1}statusquo_{i} + \beta_{2}age_{i} + \beta_{3}income_{i} + \beta_{4}sex_{i} + \beta_{5}education_{i})} \]

And our probit model can be specified formally as:

\[ Pr(Y) = \phi(\alpha + \beta_{1}statusquo_{i} + \beta_{2}age_{i} + \beta_{3}income_{i} + \beta_{4}sex_{i} + \beta_{5}education_{i} ) \]

Translating this into R, our models are instantiated like so:

logit1 <- glm(vote ~ statusquo + age + income + sex + education, 
              data = Chile, family=binomial(link="logit")) 

probit1 <- glm(vote ~ statusquo + age + income + sex + education, 
               data = Chile, family=binomial(link="probit")) 

We can then compare the estimates of our two models via stargazer. Drawing interpretations from the sign and significance of the coefficients, we see that both models return the same results although the magnitude of the coefficients varies substantively across statusquo, sex, and education.

stargazer(logit1, probit1, type="text", 
          title="Who voted for Pinochet?", align = TRUE,
          dep.var.labels = "Vote Yes",
          column.labels = c("Logit", "Probit"),
          model.names = FALSE)
## 
## Who voted for Pinochet?
## ==============================================
##                       Dependent variable:     
##                   ----------------------------
##                             Vote Yes          
##                       Logit         Probit    
##                        (1)            (2)     
## ----------------------------------------------
## statusquo            3.183***      1.733***   
##                      (0.147)        (0.068)   
##                                               
## age                   0.001         0.0004    
##                      (0.007)        (0.004)   
##                                               
## income               -0.00000      -0.00000   
##                     (0.00000)      (0.00000)  
##                                               
## sexM                -0.572***      -0.319***  
##                      (0.203)        (0.105)   
##                                               
## educationPS         -0.960***      -0.505***  
##                      (0.342)        (0.175)   
##                                               
## educationS          -0.628***      -0.300**   
##                      (0.242)        (0.125)   
##                                               
## Constant             1.007***      0.509***   
##                      (0.376)        (0.194)   
##                                               
## ----------------------------------------------
## Observations          1,703          1,703    
## Log Likelihood       -353.758      -354.173   
## Akaike Inf. Crit.    721.517        722.346   
## ==============================================
## Note:              *p<0.1; **p<0.05; ***p<0.01

3.1 Analysis: Error Rate

We can also see how well our model predicts our data. First, we want to pull the predicted values out of our model. Then we want to recode our predictions such that we recode any prediction less than 0.5 as a 0, and any prediction greater than 0.5 as a 1 to create 2 simple classes. Then we can create a table, or confusion matrix, to review the results.

Chile.pred <- logit1$fitted.values
Chile.pred[Chile.pred < 0.5] <- 0
Chile.pred[Chile.pred > 0.5] <- 1

The rows represent the observed data while the columns represent the predictions. We can see that our logit correctly classifies, or predicts, most cases. We can see that out of 867 No voters, our model’s covariates predicts 808 voters correctly along with 59 false positives. We can also see that out of 836 Yes voters, our model’s covariates correctly predicts 772 voters while incorrectly generating 64 false negatives.

table(Chile$vote, Chile.pred)
##    Chile.pred
##       0   1
##   0 808  59
##   1  64 772

We can also create a more general error rate which tells us the proportion of observations being predicted incorrectly by the logit model. This is calculated by taking 1-accuracy. Accuracy is derived by summing the true positive and true negative diagonals and dividing by the total number of observations. Our results tell us that our model is does relatively well because it incorrectly classifies roughly 7% of our sample.

error.rate <- mean((logit1$fitted.values > 0.5 & Chile$vote == 0) | 
                     (logit1$fitted.values < 0.5 & Chile$vote == 1))
error.rate # 0.07222548
## [1] 0.07222548

3.2 Analysis: Predictions

Predicted probability point estimates are the bread and butter of our binary models. While it is interesting to know that women were more likely to vote in favor of Pinochet than men, what we really want to know is by how much. What is the average probability that a woman votes for Pinochet? The probability that a female will vote in favor of keeping the military regime in power is, on average, 0.58, after controlling for their preference for the status quo, age, income, and education level.

gender_levels <- levels(Chile$sex)

logit.prob <- sapply(gender_levels, FUN=function(x){
      mean(predict(logit1, type = "response", 
                   newdata = data.frame(statusquo = mean(Chile$statusquo), 
                     age = mean(Chile$age), 
                     income = mean(Chile$income), 
                     sex = x, 
                     education = "S")))
      })

probit.prob <- sapply(gender_levels, FUN=function(x){
      mean(predict(probit1, type = "response", 
                   newdata = data.frame(statusquo = mean(Chile$statusquo), 
                     age = mean(Chile$age), 
                     income = mean(Chile$income), 
                     sex = x, 
                     education = "S")))
      })

QI <- data.frame(gender = gender_levels,
                 logit.prob = logit.prob,
                 probit.prob = probit.prob)
row.names(QI) <- NULL

kable(QI)
gender logit.prob probit.prob
F 0.5805265 0.5699259
M 0.4384230 0.4432041

We can see from the results above that there is little difference in the predicted probabilities across our logit and probit models. The problem with this slightly more automated way of generated predicted probabilities is that we cannot yield confidence intervals around our estimates. For individual point predictions and confidence intervals, we can, however use the code below. This process can be fairly tedious depending on the number of predictions we want to make, however. To get around this problem, we must run simulations on our data, similar to the multinomial logit analysis.

# hold all other variables at mean / mode; specify female gender
df_gender <- data.frame(1, statusquo = mean(Chile$statusquo), 
                     age = mean(Chile$age), 
                     income = mean(Chile$income), 
                     sex = "F", 
                     education = "S")

# generate predictions
gender_predict <- predict(logit1, 
                       newdata = df_gender, 
                       type="response", 
                       se.fit=TRUE)

# generate confidence intervals
gender_ci <- c(gender_predict$fit -1.96 * gender_predict$se.fit, # lower CI
  gender_predict$fit + 1.96 * gender_predict$se.fit) # upper CI

# prepare data
QI2 <- data.frame(gender = 'female',
                 logit.prob = gender_predict$fit,
                 logit.ci.lower = gender_ci[1],
                 logit.ci.upper = gender_ci[2])
kable(QI2)
gender logit.prob logit.ci.lower logit.ci.upper
female 0.5805265 0.4879098 0.6731432

3.3 Analysis: Simulations

To generate confidence intervals for predicted probabilities, we need to use other packages which involve simulations. We will go through the process in steps.

3.3.1 Simulations: Step 1

Our first step involves taking 10000 draws from a normal distribution which is feasible given that asymptotic normality is a maximum likelihood property. Simulation is what allows us to account for uncertainty when making predictions.

logit1 <- glm(vote ~ statusquo + age + income + sex + education, 
              data = Chile, family=binomial(link="logit")) 

sqsex <- c('F','M') # vary the primary independent variable
n.draws <- 10000 # specify the number of draws

coefs <- coef(logit1) # pull the model coefficients
vcov <- vcov(logit1) # pull the model variance-covariance matrix

# simulate coefficients for each category
sim.coefs <- mvrnorm(n.draws, mu = coefs, Sigma = vcov)

If we run the summary command on our simulation results, we can see that the mean for each coefficient is very close to the estimate found in our stargazer model results.

summary(sim.coefs)
##   (Intercept)        statusquo          age           
##  Min.   :-0.3717   Min.   :2.671   Min.   :-0.026413  
##  1st Qu.: 0.7476   1st Qu.:3.083   1st Qu.:-0.003781  
##  Median : 1.0064   Median :3.179   Median : 0.001175  
##  Mean   : 1.0052   Mean   :3.181   Mean   : 0.001210  
##  3rd Qu.: 1.2615   3rd Qu.:3.279   3rd Qu.: 0.006262  
##  Max.   : 2.5463   Max.   :3.698   Max.   : 0.033262  
##      income                sexM           educationPS     
##  Min.   :-1.284e-05   Min.   :-1.35478   Min.   :-2.2594  
##  1st Qu.:-4.200e-06   1st Qu.:-0.70703   1st Qu.:-1.1930  
##  Median :-2.342e-06   Median :-0.57004   Median :-0.9633  
##  Mean   :-2.323e-06   Mean   :-0.56993   Mean   :-0.9615  
##  3rd Qu.:-4.148e-07   3rd Qu.:-0.43405   3rd Qu.:-0.7315  
##  Max.   : 8.605e-06   Max.   : 0.08794   Max.   : 0.2216  
##    educationS     
##  Min.   :-1.5917  
##  1st Qu.:-0.7893  
##  Median :-0.6297  
##  Mean   :-0.6280  
##  3rd Qu.:-0.4692  
##  Max.   : 0.3979

3.3.2 Simulations: Step 2

Then, we want to instantiate one data frame that specifies our model. The model should hold all values at their mean other than the primary independent variable sex. We then transform the data frames into data matricies such that we can use matrix multiplication.

# Set Data to Average Values
df_female <- cbind(1, # intercept
                   mean(Chile$statusquo), #status quo
                     mean(Chile$age), #age
                     mean(Chile$income), #income
                     0, # female
                     0, # education is set by mode, so PS = 0
                     1) # secondary education = 1

df_female <- data.matrix(df_female)

#
df_male <- cbind(1, # intercept
                   mean(Chile$statusquo), #status quo
                     mean(Chile$age), #age
                     mean(Chile$income), #income
                     1, # male
                     0, # education is set by mode, so PS = 0
                     1) # secondary education = 1

df_male <- data.matrix(df_male)

3.3.3 Simulations: Step 3

Third, we want to initiate a loop and matrix multiply our sample’s average values instantiated in df_female. We then wrap this matrix multiplication in our \(\exp\) link function via the ilogit wrapper from the faraway package. Before running the calculations however, we need to build some containers to store our results.

pp.female <- matrix(NA, nrow=n.draws)
pp.male <- matrix(NA, nrow=n.draws)
for(k in 1:n.draws){# Loop over coefficients
  print(paste(c("Now working on iteration", k), collapse=" "))
# Female Voters
  female.chile <- ilogit(df_female %*% sim.coefs[k,]) # Matrix multiply + link function
  pp.female[k] <- mean(female.chile)
  
# Male Voters
  male.chile <- ilogit(df_male %*% sim.coefs[k,]) # Matrix multiply + link function
  pp.male[k] <- mean(male.chile)
}

3.3.4 Simulations: Step 4

Lastly, to derive our predicted probabilities and credible intervals, we can calculate the 2.5th, 50th, and 97.5th percentile of each column.

female.ci.low <- quantile(pp.female, 0.025)
female.pp.mean <- quantile(pp.female, 0.5)
female.ci.high <- quantile(pp.female, 0.975)

male.ci.low <- quantile(pp.male, 0.025)
male.pp.mean <- quantile(pp.male, 0.5)
male.ci.high <- quantile(pp.male, 0.975)

We can see that our simulation draws very similar estimates and confidence (credible) intervals for sex as we found above using a much simpler point prediction function.

# prepare data
QI3 <- data.frame(gender = 'female',
                 logit.prob = female.pp.mean,
                 logit.ci.lower = female.ci.low,
                 logit.ci.upper = female.ci.high)
row.names(QI3) <- NULL
kable(QI3)
gender logit.prob logit.ci.lower logit.ci.upper
female 0.5799348 0.4878316 0.6683374

It is also instructive to look graph our findings.

x_data = c(female.pp.mean, male.pp.mean)

y_data = c("P(Vote in Favor of Pinochet|Female)", 
           "P(Vote in Favor of Pinochet|Male)")

ci_data = c(female.ci.high - female.ci.low, 
            male.ci.high - male.ci.low)

x <- list(title = "Probability", dtick = 0.10, range = c(0, 1))
y <- list(title = "")

plot_ly(type = 'scatter', mode = 'markers') %>%
  add_trace(x = ~x_data, y = ~y_data, name = '',
            error_x = list(
              type = 'data',
              array = ci_data)
            ) %>% 
  layout(title = "Simulated Predicted Probabilities",
                                                showlegend = FALSE,
                                                 xaxis = x,
                                                 yaxis = y,
                                                 margin = list(l = 150))

While the above calculations may seem tedious for dummy variables, this process is worth its weight when we look to derive predicted probabilities and confidence (credible) intervals from continuous variables.

3.4 Analysis: Simulations - Continued

How does a voter’s preference for the status quo affect whether or not they will choose to vote for Pinochet to remain in power? We can find out by performing similar calculations as above. First, we sequence our primary independent variable, statusquo, across 1000 different possible values between its min and max. Then, we create a matrix which allows statusquo to vary, but holds all other values at its mean or mode. Finally, we matrix multiply our average value matrix with our simulated coefficients and wrap this calculation in the ilogit link function which automates this calculation for us: \[\begin{equation} p = \frac{exp(\pi)}{1+exp(\pi)} \end{equation}\]

statusquo_sequence <- seq(from=min(na.omit(Chile$statusquo)),
                          to=max(na.omit(Chile$statusquo)),
                          length.out=1000)

# Set Data to Average Values
df_sq <- cbind(1, # intercept
                   statusquo_sequence, #status quo
                     mean(Chile$age), #age
                     mean(Chile$income), #income
                     1, # male = mode
                     0, # education is set by mode, so PS = 0
                     1) # secondary education = 1; mode

pred.statusquo <- ilogit(df_sq %*% t(sim.coefs))

We can then loop over our data to prepare it for graphing. To generate our average predicted probabilites, we will take our results above and apply the 50th percentile quantile function over each row. To generate our credible (confidence) intervals, we will apply the quantile function again at the 2.5th and 97.5th percentiles.

y_data = apply(pred.statusquo, 1, quantile, .5)
y_lower_ci = apply(pred.statusquo, 1, quantile, .025)
y_upper_ci = apply(pred.statusquo, 1, quantile, .975)

We can then plug this data into plotly to generate a very aesthetic and interactive graph. We interpet the graph as follows: The probability that a male Chilean voter who is indifferent to the status quo will vote in favor of keeping Pinochet in power is 0.44, on average, after controlling for age, income, and education.

x <- list(title = "Support for the Status Quo", dtick = 0.5, zeroline = FALSE)
y <- list(title = "P(Vote in Favor of Pinochet)")

plot_ly(type = 'scatter', mode = 'lines') %>%
  add_trace(x =~statusquo_sequence, y =~y_upper_ci, line = list(color = 'transparent'), 
            name = 'Upper CI', showlegend = FALSE) %>%
  add_trace(x =~statusquo_sequence, y =~y_lower_ci, fill = 'tonexty', fillcolor='rgba(0,100,80,0.2)', 
            line = list(color = 'transparent'), name = 'Lower CI', showlegend = FALSE) %>%
  add_trace(x =~statusquo_sequence, y =~y_data, line = list(color='rgb(0,100,80)'),
            name = 'Average Probability') %>%
  layout(xaxis = x,
         yaxis = y)

With the hard work done, we can quickly alter our gender variable and juxtapose the probability of voting for Pinochet, given voter gender, while varying their preferences over the status quo. We can see from the graph below that the mean estimates for our predicted probabilities overlap, suggesting there is no difference between genders, only at extreme values of support for the status quo.

# Set Data to Average Values
df_sq2 <- cbind(1, # intercept
                   statusquo_sequence, #status quo
                     mean(Chile$age), #age
                     mean(Chile$income), #income
                     0, # female 
                     0, # education is set by mode, so PS = 0
                     1) # secondary education = 1; mode

pred.statusquo2 <- ilogit(df_sq2 %*% t(sim.coefs))
y_data2 = apply(pred.statusquo2, 1, quantile, .5)
y_lower_ci2 = apply(pred.statusquo2, 1, quantile, .025)
y_upper_ci2 = apply(pred.statusquo2, 1, quantile, .975)
x <- list(title = "Support for the Status Quo", dtick = 0.5, zeroline = FALSE)
y <- list(title = "P(Vote in Favor of Pinochet)")

plot_ly(type = 'scatter', mode = 'lines') %>%
  add_trace(x =~statusquo_sequence, y =~y_upper_ci2, line = list(color = 'transparent'), 
            name = 'Upper CI', showlegend = FALSE) %>%
  add_trace(x =~statusquo_sequence, y =~y_lower_ci2, fill = 'tonexty', fillcolor='rgba(3,29,68,0.2)', 
            line = list(color = 'transparent'), name = 'Lower CI', showlegend = FALSE) %>%
  add_trace(x =~statusquo_sequence, y =~y_data2, line = list(color='rgb(4,57,94)'),
            name = 'Avg. Female Prob.') %>%
  add_trace(x =~statusquo_sequence, y =~y_upper_ci, line = list(color = 'transparent'), 
            name = 'Upper CI', showlegend = FALSE) %>%
  add_trace(x =~statusquo_sequence, y =~y_lower_ci, fill = 'tonexty', fillcolor='rgba(0,100,80,0.2)', 
            line = list(color = 'transparent'), name = 'Lower CI', showlegend = FALSE) %>%
  add_trace(x =~statusquo_sequence, y =~y_data, line = list(color='rgb(0,100,80)'),
            name = 'Avg. Male Prob.') %>%
  layout(xaxis = x,
         yaxis = y)

4 Analysis - Marginal Change

For continuous independent variables, we can take the first derivative of a predicted probability to generate its marginal effect. We will continue our examination of statusquo to generate some marginal effects. Instead of sequencing statusquo across its min and max over 1000 values, we are going to use its actual values so we can generate and predict accurate marginal changes across real values.

statusquo_sequence <- sort(Chile$statusquo) # sequence real values

# Set Data to Average Values
df_sq <- cbind(1, # intercept
                   statusquo_sequence, #status quo
                     mean(Chile$age), #age
                     mean(Chile$income), #income
                     1, # male = mode
                     0, # education is set by mode, so PS = 0
                     1) # secondary education = 1; mode

sim.coefs <- mvrnorm(n.draws, mu = coefs, Sigma = vcov) # sim again

beta.sq <- sim.coefs[,2] # Pull Beta for status quo

# Calculate Average Marginal Effect
marginal.change <- beta.sq * dlogis(df_sq %*% t(sim.coefs))
mean(marginal.change)
## [1] 0.1901343
# dlogis link function conveniently does the following:
# qi1 <- df_sq %*% t(sim.coefs)
# qi1.marginal.change <- beta.sq * (exp(-qi1) / (1 + exp(-qi1))^2)

To check our marginal effects calculations, we can do so by using the margins package.

library(margins)
margins1 <- margins(logit1, type = 'response')
summary(margins1)
##       factor     AME     SE       z      p   lower   upper
##          age  0.0001 0.0004  0.1683 0.8663 -0.0008  0.0009
##  educationPS -0.0580 0.0209 -2.7712 0.0056 -0.0990 -0.0170
##   educationS -0.0385 0.0151 -2.5520 0.0107 -0.0680 -0.0089
##       income -0.0000 0.0000 -0.8444 0.3985 -0.0000  0.0000
##         sexM -0.0342 0.0122 -2.8003 0.0051 -0.0581 -0.0103
##    statusquo  0.1863 0.0043 43.6438 0.0000  0.1779  0.1946

We interpet the average marginal change as follows: As the voter identifies one point more strongly with the status quo, the voter’s probability for voting in favor of Pinochet increases by 0.19, on average, after controlling for age, income, gender, and education.

4.1 Analysis - Marginal Change Plot

Lastly, we can take these calculations and plot them in plotly to view the complete marginal change picture across all possible statusquo values found in our survey sample.

# loop over data applying quantile function
y_data = apply(marginal.change, 1, quantile, .5)
y_lower_ci = apply(marginal.change, 1, quantile, .025)
y_upper_ci = apply(marginal.change, 1, quantile, .975)
x <- list(title = "Support for the Status Quo", dtick = 0.5, zeroline = FALSE)
y <- list(title = "Marginal Change in P(Vote in Favor of Pinochet)")

plot_ly(type = 'scatter', mode = 'lines') %>%
  add_trace(x =~statusquo_sequence, y =~y_upper_ci, line = list(color = 'transparent'), 
            name = 'Upper CI', showlegend = FALSE) %>%
  add_trace(x =~statusquo_sequence, y =~y_lower_ci, fill = 'tonexty', fillcolor='rgba(0,100,80,0.2)', 
            line = list(color = 'transparent'), name = 'Lower CI', showlegend = FALSE) %>%
  add_trace(x =~statusquo_sequence, y =~y_data, line = list(color='rgb(0,100,80)'),
            name = 'Average Marginal Effect') %>%
  layout(xaxis = x,
         yaxis = y)

5 Sources