2.4 Primary Independent Variable
The primary independent variable is tyr
, or the total years of schooling at time \(t\). Following a similar pattern, the average adult’s level of education is also increasing over time.
Why are some countries more democratic than others? Existing accounts assume that improvements in democracy is driven mostly by economic development (GDP), coalitions, inequality, or resource abundance which seemingly eliminates the need to study this question. I challenge this view by arguing that institutions, particularly education, plays a more meaningful role in promoting democracy.
# load libraries
library(googleVis)
library(dplyr)
library(tidyr)
library(ggplot2)
library(plotly)
library(knitr)
library(plm)
library(MASS)
library(lmtest)
df <- read.csv("https://raw.githubusercontent.com/afogarty85/replications/master/Education%20and%20Democracy/education_democracy.csv")
## country year tyr pri
## Afghanistan: 39 Min. :1820 Min. : 0.0049 Min. : 0.000
## Albania : 39 1st Qu.:1865 1st Qu.: 0.3850 1st Qu.: 2.746
## Algeria : 39 Median :1915 Median : 2.1626 Median : 32.279
## Argentina : 39 Mean :1915 Mean : 3.2381 Mean : 42.654
## Australia : 39 3rd Qu.:1965 3rd Qu.: 5.2344 3rd Qu.: 85.000
## Austria : 39 Max. :2010 Max. :13.2446 Max. :100.000
## (Other) :3861 NA's :1050
## sec ter democ autoc
## Min. : 0.0000 Min. : 0.00000 Min. : 0.000 Min. : 0.000
## 1st Qu.: 0.0000 1st Qu.: 0.00000 1st Qu.: 0.000 1st Qu.: 0.000
## Median : 0.7481 Median : 0.09498 Median : 2.000 Median : 3.000
## Mean : 14.3694 Mean : 5.06813 Mean : 3.953 Mean : 3.641
## 3rd Qu.: 14.5832 3rd Qu.: 2.00000 3rd Qu.: 8.000 3rd Qu.: 7.000
## Max. :100.0000 Max. :100.00000 Max. :10.000 Max. :10.000
## NA's :1720 NA's :1720
## gdp region population area
## Min. : 231.9 Min. : 1.000 Min. :1.520e+05 Min. : 2040
## 1st Qu.: 1643.2 1st Qu.: 2.000 1st Qu.:3.077e+06 1st Qu.: 92120
## Median : 3251.2 Median : 4.000 Median :7.769e+06 Median : 301340
## Mean : 5604.3 Mean : 3.968 Mean :3.226e+07 Mean : 1275891
## 3rd Qu.: 6497.8 3rd Qu.: 5.000 3rd Qu.:2.397e+07 3rd Qu.: 799380
## Max. :80413.1 Max. :10.000 Max. :1.310e+09 Max. :22100000
## NA's :2341 NA's :1720 NA's :2025 NA's :3038
## democavg autocavg
## Min. : 0.000 Min. :0.000
## 1st Qu.: 1.667 1st Qu.:2.381
## Median : 2.882 Median :3.875
## Mean : 3.746 Mean :3.739
## 3rd Qu.: 5.545 3rd Qu.:5.118
## Max. :10.000 Max. :8.778
##
In this analysis, we are going to examine the effect of education on levels of democracy across 105 countries from 1820 to 2010. The dataset used in this analysis stitches together three different datasets, of which the reader can learn more about in the sources section below. A brief description of the variables is as follows:
The plot below shows the geographic extent of our dataset. Specifically, it shows the average number of years of education that citizens (age 15-64) acquired from school in 2010.
Following a similar pattern, the plot below shows each country’s polity democracy score in 2010. What these two graphs tell us is that the intensity of the color green tends to covary across both variables: as citizens acquire more education, the level of a country’s democracy increases.
Since we are evaluating time series cross-sectional data, we might also want to better understand the temporal coverage of our data.
kable(df %>%
summarise(distinct_countries = n_distinct(country),
distinct_years = n_distinct(year),
min_year = min(year),
max_year = max(year)))
distinct_countries | distinct_years | min_year | max_year |
---|---|---|---|
105 | 39 | 1820 | 2010 |
The dependent variable is democ
, or the level of democracy within a country at time \(t\). A brief look at our dependent variable shows that within our sample, the level of democracy has been increasing over time.
The primary independent variable is tyr
, or the total years of schooling at time \(t\). Following a similar pattern, the average adult’s level of education is also increasing over time.
tyr_mean <- df %>% group_by(country) %>%
summarize_at(vars(tyr), list(mean), na.rm = TRUE) # compute average education levels
democ_mean <- df %>% group_by(country) %>%
summarize_at(vars(democ), list(mean), na.rm = TRUE) # compute average democracy levels
regLine <- lm(democ_mean$democ ~ tyr_mean$tyr) # generate slope
plot_ly(type = 'scatter', mode = 'markers') %>%
add_trace(x = ~tyr_mean$tyr, y = ~democ_mean$democ, color = ~democ_mean$country,
showlegend = FALSE) %>%
add_trace(x = ~tyr_mean$tyr, y = fitted(regLine), mode = "lines", name = 'Slope',
showlegend = FALSE) %>%
layout(xaxis = list(
title = 'Avg. Years of Adult Education'),
yaxis = list(
title = 'Avg. Country Democracy Levels'))
Now that we have a better understanding of our data and we can see a clear relationship between education and democracy, let’s specify a hypothesis and test it with an appropriate model.
\(H_{1}\): In a comparison of countries, as the years of the schooling among adults increases, measures of democracy increases.
Notice that the hypothesis implicitly refers to the over time effects of education on the levels of democracy. Thus, we need to specify case fixed effects in order to realize the over time variation. Case fixed effects are apparent in our equation given the absence of \(_{t}\) from country
.
\[\begin{eqnarray} \small \text{Democracy}_{it} = \alpha + \beta_{1} \text{Education}_{it} + \beta_{2} \text{GDP}_{it} + \beta_{3} \text{Population}_{it} + \beta_{4} \text{Democracy}_{it-1} + \text{Country}_{i} + \epsilon_{it} \end{eqnarray}\]
The dependent variable is included as a lagged independent variable because despite applying fixed effects to the regression, I have not accounted for any other issue associated with time series data such as auto-correlation. Lagging the dependent variable is one method to account for auto-correlation which means that I am modeling this time series as an Auto Regressive (AR) 1 process. While this makes a large assumption that all 105 countries are identical in their auto-correlation and rate of decay, there are few tools for panel models to improve our estimation currently. Additionally, the lagged dependent variable is also incorporated for an AR1 impulse response function analysis that we will incorporate later.
Before running our first model, we need to transform several our variables with the natural log to improve our model’s fit.
We can estimate the over time effects because the fixed effects procedure is as if we created a subset of our dataset by country, performed a regression for that subset, then subtracted the cross-sectional variation leaving only the time effects, and then repeated that for every country. After running 105 regressions, we would average the slopes together and weight it by its sample size (the number of countries) and the variance of our independent variables. The effect we derive is a within case over time analysis of itself. A templated interpretation is as follows: Within an average case
, one unit increase in \(X\), is associated with a \(\hat{\beta}\) change in \(Y\), on average, after controlling for all other variables.
As we can see, the primary independent variable is statistically significant. The result tells us that: within an average country, a ten year increase in the average years of schooling for citizens aged 15-64 is associated with a 1.7 increase in the polity democracy score, on average, after controlling for GDP and population. This effect is quite notable as the polity democracy score is on an 11 point scale. However, a ten year increase in the average years of schooling is also quite challenging as many developing countries are averaging below 5 years (as of 2010) while all Western nations exceed 10. It is also worth noting that GDP is not statistically significant after controlling for education which challenges the findings of previous research which purports to demonstrate GDP’s very close and significant link with democracy.1
# Case Fixed Effects - Over Time Variation
pdata1 <- pdata.frame(df, index=c("country", "year"), row.names=FALSE) # case fixed effects
caseFE <- plm(democ ~ tyr + gdp + population + dplyr::lag(democ, k=1), model="within", data=pdata1)
summary(caseFE)
## Oneway (individual) effect Within Model
##
## Call:
## plm(formula = democ ~ tyr + gdp + population + dplyr::lag(democ,
## k = 1), data = pdata1, model = "within")
##
## Unbalanced Panel: n = 102, T = 2-28, N = 1408
##
## Residuals:
## Min. 1st Qu. Median 3rd Qu. Max.
## -8.894454 -0.686317 -0.084028 0.536275 8.396564
##
## Coefficients:
## Estimate Std. Error t-value Pr(>|t|)
## tyr 0.168844 0.048388 3.4894 0.0005002 ***
## gdp 0.109639 0.149019 0.7357 0.4620235
## population 0.109853 0.172494 0.6369 0.5243321
## dplyr::lag(democ, k = 1) 0.701194 0.020695 33.8822 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Total Sum of Squares: 10152
## Residual Sum of Squares: 4046.2
## R-Squared: 0.60144
## Adj. R-Squared: 0.56929
## F-statistic: 491.181 on 4 and 1302 DF, p-value: < 2.22e-16
pbgtest(caseFE) # null = no serial correlation; cannot reject; model has no serial correlation of errors
##
## Breusch-Godfrey/Wooldridge test for serial correlation in panel models
##
## data: democ ~ tyr + gdp + population + dplyr::lag(democ, k = 1)
## chisq = 1.3149, df = 2, p-value = 0.5182
## alternative hypothesis: serial correlation in idiosyncratic errors
bptest(caseFE, studentize = F) # null = there is no heteroskedasticity; can reject; model has heteroskedasticity
##
## Breusch-Pagan test
##
## data: caseFE
## BP = 112.01, df = 4, p-value < 2.2e-16
The impulse response function below is based on a simplified model above which depicts the marginal effect of average years of adult education over time. Substantively, this result tells us: A one year increase in average years of adult schooling for citizens aged 15-64 today (\(t_{=0}\)), is associated with an approximately 0.7 increase in the Polity democracy index one year from now (\(t_{=1}\)). The difference between the IRF and the regression results in the table above is due to the following: a regression represents average treatment effects across time while the IRF depicts average treatment effects at specified time intervals.
## Create model
reg <- plm(democ ~ tyr + dplyr::lag(democ, k=1), model="within", data=pdata1)
summary(reg)
## Oneway (individual) effect Within Model
##
## Call:
## plm(formula = democ ~ tyr + dplyr::lag(democ, k = 1), data = pdata1,
## model = "within")
##
## Unbalanced Panel: n = 105, T = 3-29, N = 1878
##
## Residuals:
## Min. 1st Qu. Median 3rd Qu. Max.
## -9.00561 -0.59027 -0.09114 0.46019 8.54134
##
## Coefficients:
## Estimate Std. Error t-value Pr(>|t|)
## tyr 0.168426 0.016428 10.252 < 2.2e-16 ***
## dplyr::lag(democ, k = 1) 0.726113 0.016760 43.325 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Total Sum of Squares: 13149
## Residual Sum of Squares: 4803.3
## R-Squared: 0.63469
## Adj. R-Squared: 0.61283
## F-statistic: 1538.48 on 2 and 1771 DF, p-value: < 2.22e-16
## Create loop
iterations <- 10000
results <- data.frame()
coefs <- coef(reg) #store coefs
vcov <- vcov(reg) #store vcov
sim.irf <- mvrnorm(iterations, coefs, vcov) # simulate new coefs 1000x
head(sim.irf)
## tyr dplyr::lag(democ, k = 1)
## [1,] 0.1768250 0.7120246
## [2,] 0.1532820 0.7595639
## [3,] 0.1439394 0.7312199
## [4,] 0.2060065 0.7043871
## [5,] 0.1855121 0.7045173
## [6,] 0.1744217 0.7171867
data.sim <- data.frame(sim.irf)
for(i in 1:iterations){
#print(paste(c("Now working on iteration", i), collapse=" "))
irf.funx <- c(data.sim[i,2],
data.sim[i,2]*data.sim[i,1],
data.sim[i,2]*data.sim[i,1]^2,
data.sim[i,2]*data.sim[i,1]^3,
data.sim[i,2]*data.sim[i,1]^4,
data.sim[i,2]*data.sim[i,1]^5,
data.sim[i,2]*data.sim[i,1]^6,
data.sim[i,2]*data.sim[i,1]^7,
data.sim[i,2]*data.sim[i,1]^8,
data.sim[i,2]*data.sim[i,1]^9) # IRF function
# for every iteration, pull the IRF values for each time point
# store 1000 IRF results
results <- rbind(results, irf.funx)
}
# Mean IRFs
mean.irfs <- apply(results, 2, mean)
low.ci <- apply(results, 2, FUN=function(x){
quantile(x, .025)
})
low.ci
## X0.712024624636414 X0.125903781819358 X0.0222629410949251
## 6.938696e-01 1.007601e-01 1.375452e-02
## X0.00393664542108215 X0.000696097479001088 X0.000123087463675728
## 1.873446e-03 2.560293e-04 3.493386e-05
## X2.1764945530136e.05 X3.84858733605703e.06 X6.80526604707105e.07
## 4.747940e-06 6.483768e-07 8.834213e-08
## X1.20334143225824e.07
## 1.202984e-08
## X0.712024624636414 X0.125903781819358 X0.0222629410949251
## 7.598629e-01 1.431292e-01 2.860344e-02
## X0.00393664542108215 X0.000696097479001088 X0.000123087463675728
## 5.714401e-03 1.143147e-03 2.290501e-04
## X2.1764945530136e.05 X3.84858733605703e.06 X6.80526604707105e.07
## 4.583837e-05 9.171275e-06 1.835012e-06
## X1.20334143225824e.07
## 3.671941e-07
irf.data <- data.frame(time = 0:9, irf = mean.irfs, irf.lb = low.ci, irf.ub = hi.ci)
g <- ggplot(data = irf.data,
aes(x = time, y = irf,
ymin = irf.lb, ymax = irf.ub)) +
geom_path() +
geom_ribbon(alpha=0.2) +
theme_bw() +
labs(x = "Year", y = "Marginal Effects: Education") +
theme(legend.position="bottom") +
guides(color = guide_legend(ncol=1),fill = guide_legend(nrow=1)) +
geom_hline(aes(yintercept=0))
g
Over time analysis is not the only option available to us in time series cross-sectional analysis. We can also estimate the cross-sectional effect by choosing where we place our fixed effects. To derive the cross-sectional effect, we use time fixed effects by placing our fixed effects on year
.
We can estimate the cross-sectional effects because the fixed effects procedure is as if we created a subset of our dataset by year, performed a regression for that subset, then subtracted the time variation leaving only the cross-sectional effects, and then repeated that for every year. After running 39 regressions, we would average the slopes together and weight it by its sample size (the number of years) and the variance of our independent variables. The effect we derive is a across case analysis – comparing one unit to the next. A templated interpretation is as follows: When comparing one case to another, if one case has \(X\) worth, we expect on average, its \(Y\) will be \(\hat{\beta}\).
The model’s results tells us: When comparing one country to another, if one country’s adult population has 10 years of education, we expect on average, that its polity democracy score will be 6.3 out of 11.
pdata2 <- pdata.frame(df, index=c("year", "country"), row.names=FALSE) # year fixed effects
timeFE <- plm(democ ~ tyr + gdp + population + dplyr::lag(democ, k=1), model="within", data=pdata2)
summary(timeFE)
## Oneway (individual) effect Within Model
##
## Call:
## plm(formula = democ ~ tyr + gdp + population + dplyr::lag(democ,
## k = 1), data = pdata2, model = "within")
##
## Unbalanced Panel: n = 28, T = 7-97, N = 1150
##
## Residuals:
## Min. 1st Qu. Median 3rd Qu. Max.
## -7.64459 -2.25500 0.09491 2.34190 8.73758
##
## Coefficients:
## Estimate Std. Error t-value Pr(>|t|)
## tyr 0.6398062 0.0572493 11.1758 < 2.2e-16 ***
## gdp 1.0069573 0.1577995 6.3812 2.568e-10 ***
## population 0.0014959 0.0669478 0.0223 0.9822
## dplyr::lag(democ, k = 1) 0.0143816 0.0236460 0.6082 0.5432
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Total Sum of Squares: 19123
## Residual Sum of Squares: 11489
## R-Squared: 0.3992
## Adj. R-Squared: 0.38254
## F-statistic: 185.715 on 4 and 1118 DF, p-value: < 2.22e-16
Kropko, Jonathan. PLAD 7500: Time Series. University of Virginia.
Haber, Stephen, and Victor Menaldo. “Do natural resources fuel authoritarianism? A reappraisal of the resource curse.” American political science Review 105, no. 1 (2011): 1-26. Dataset: Haber, Stephen, and Victor Menaldo
Barro, Robert and Jong-Wha Lee, 2013, “A New Data Set of Educational Attainment in the World, 1950-2010.” Journal of Development Economics, vol 104, pp.184-198. Dataset: Barro, Robert and Jong-Wha Lee
Marshall, Monty G., Keith Jaggers, and Ted Robert Gurr. Polity IV project. Center for International Development and Conflict Management at the University of Maryland College Park, 2002. Dataset: Polity
Boix, Carles. “Democracy, development, and the international system.” American Political Science Review 105, no. 4 (2011): 809-828.↩︎