Linear Models

Andrew Fogarty

8/19/2019

# load necessary packages
library(dplyr)
library(googleVis)
library(plotly)
library(stargazer)
library(knitr)
library(MASS)

1 Introduction

The Association of Religion Data Archives (ARDA) has assembled a dataset that stitches together economic, social and demographic variables from the following three locations across 252 countries:

Since we have 244 potential variables to work with here, we are going to first trim down our dataset to a handful of potentially interesting variables to test a hypothesis we are interested in. This dataset is particularly interesting because it has measures for life satisfaction. A suitable research question might look like: What causes differences between the life satisfaction between countries?

2 Hypothesis

\(H_{1}\): In a comparison of countries, those with greater percentages of literate adults are more likely to have greater life satisfaction than those with fewer percentages of literate adults.

3 Data

# load data
df <- read.csv('https://raw.githubusercontent.com/afogarty85/replications/master/ARDA/socio_economic_religion_world.csv')

# trim variables
df <- dplyr::select(df, UNSATI, UN_GINI, UNPOP10, UNMAGE10, 
                    UNMPI, UNGDPCAP, I_RELIGION, CIAOIL_P, 
                    UNLIT, CIALAREA, UNPOLFRE, UNADJNS,
                    UNNAME, COWCODE, CIANAME)
df$UNPOLFRE <- as.factor(df$UNPOLFRE)

3.1 Data: Variable Descriptions

We can quickly look at the geospatial distribution of our dependent variable UNSATI with the help of gvisGeoChart. The graph shows a wide variance in overall life satisfaction distributed across the world, with the most happy people living in Western countries and in some oil producing states like Saudi Arabia. This matches an obvious and strong correlation between poor countries and poor life satisfaction which is evident from most of the African continent.

df_worldplot <- dplyr::select(df, CIANAME, UNSATI) %>% na.omit(df_worldplot) # prepare new df for graph

worldplot <- gvisGeoChart(df_worldplot, locationvar = "CIANAME", # gvis follows certain naming conventions
                          colorvar = "UNSATI", # our dependent variable
                          options=list(width=750, 
                                       height=600))

cat(worldplot$html$chart, file = "worldplot.html") # work around to get our graph to run inside our RMD
htmltools::includeHTML("worldplot.html") # load the graph here

4 Data: Dependent Variable

We can also plot the distribution of our dependent variable. We can see that we have a continuous dependent variable that has a shape very similar to the normal distribution, which in part makes OLS a reasonable choice to model this data.

plot_ly(data = df, type = 'histogram') %>%
  add_trace(x =~UNSATI, name = 'Life Satisfaction', showlegend = F) %>%
  layout(xaxis = list(
    title = 'Life Satisfaction'),
    yaxis = list(
      title = 'Count of Countries'))

5 Data: Independent Variable

Following our hypothesis, we can also inspect our primary independent variable. We can see that the mode of our dataset is 100% adult literacy, but that our data has a long left-tail meaning that large portions of adults are not literate across the world.

plot_ly(data = df, type = 'histogram') %>%
  add_trace(x =~UNLIT, name = 'Adult Literacy Rate', nbinsx = 50, 
            marker = list(color = 'rgb(49,130,189)'), showlegend = F) %>%
  layout(xaxis = list(
    title = 'Adult Literacy Rate'),
    yaxis = list(
      title = 'Count of Countries'))

6 Data: Dependent + Independent Variable

We can look at our data a little differently to better understand the relative differences between countries in their adult literacy and in their overall life satisfaction. We can see that a large number of Latin American countries, with Costa Rica in particular, have the highest values in terms of life satisfaction and adult literacy. On the opposite end, we can see many African countries encompassing the lowest values. The scatter plot is particularly helpful because it helps us look at, and interact with, the relative distances between countries. The slope of the best-fit line is positive, which indicates that adults with higher literacy levels have higher life satisfaction.

scatter_df = dplyr::select(df, UNLIT, UNSATI, CIANAME)
scatter_df <- na.omit(scatter_df)
fit <- lm(UNSATI ~ UNLIT, data = scatter_df)

plot_ly(data = scatter_df, type = 'scatter', mode = 'markers') %>%
  add_trace(x =~UNLIT, y =~UNSATI, 
            color =~CIANAME, showlegend = F) %>%
  add_trace(x =~UNLIT, y = fitted(fit), mode = 'lines', 
            line = list(color = 'rgb(49,130,189)'), name = 'Slope') %>%
  layout(xaxis = list(
    title = 'Adult Literacy Rate'),
    yaxis = list(
      title = 'Life Satisfaction'))

7 Analysis:

Before we run some models, we can first calculate them by hand so we have a better understanding of what the lm() function is doing for us. We will first begin by calculating a single linear regression by-hand and then move on to a multiple linear regression.

7.1 Analysis: Calculate Slope of Best-Fit Line

We can find the slope apparent in the 2-way scatter above through the following calculation,

\[ \beta = \frac{\sum_{i=1}^N (y_i - \bar{y})(x_i - \bar{x})}{\sum_{i=1}^N (x_i - \bar{x})^2} \]

Whereby:

Translating this into R, we do the following:

df2 = dplyr::select(df, UNLIT, UNSATI) # select X and Y variables
df2 <- na.omit(df2) # remove NA
df2 <- mutate(df2, mean_literacy = mean(UNLIT),
               mean_satisfaction = mean(UNSATI),
               numerator = (UNLIT - mean_literacy) * (UNSATI - mean_satisfaction),
               denominator = (UNLIT - mean_literacy)^2)
coef <- sum(df2$numerator) / sum(df2$denominator)

Now to check our work, we use the lm function.

# check the calculation above
fit1 <- lm(UNSATI ~ UNLIT, data = df2) # fit a model

And then we compare our by-hand coefficient to that of the lm function.

c(coef, fit1$coefficients[2]) # match
##                 UNLIT 
## 0.03661856 0.03661856

While single linear regression is relatively straight forward, multiple linear regression requires a few more steps and the use of matrix multiplication.

7.2 Analysis: Multivariate Regression by Hand

df_hand <- dplyr::select(df, UNSATI, UNLIT, 
                         UNGDPCAP, UNPOP10, UN_GINI) # pull variables we will use in the model
df_hand <- na.omit(df_hand) # remove NA
Y <- as.matrix(df_hand$UNSATI) # place Y in a vector

X <- data.frame(intercept = 1, 
                       literacy = df_hand$UNLIT, 
                       gdp = df_hand$UNGDPCAP, 
                       pop = df_hand$UNPOP10, 
                       gini = df_hand$UN_GINI) # place X variables in a data frame

X <- data.matrix(X) # convert to matrix

Through linear algebra, the formula for Ordinary Least Squares is parsimoniously listed below,

\[ \hat{\beta} = (X'X)^{-1}X'Y \] Whereby the formula is read as: X prime X inverse \(\times\) X prime Y

We can calculate multiple linear regression by-hand if we follow the five steps below:

Step Function
1 Transpose X to create X’
2 Matrix Multiply X’ by Y
3 Matrix Multiply X’ by X
4 Inverse Step 3
5 Matrix Multiply Step 4 by Step 2

In R, each step looks like the following:

# Step 1 - Transpose X
transpose_X <- t(X)
# Step 2 - Matrix Multiply X' by Y
x_prime_y <- transpose_X %*% Y
# Step 3 - Matrix Multiply X' by X
x_prime_x <- transpose_X %*% X
# Step 4 - X Prime X - Inverse
x_prime_x_inv <- solve(x_prime_x)
# Step 5 - multiply X Prime X Inverse by X Prime Y
betas <- x_prime_x_inv %*% x_prime_y

However, we can use a simplified approach to calculate \(\hat{\beta}\) in one line as shown below:

betas <- solve(t(X) %*% X) %*% t(X) %*% Y
stargazer(betas, type = 'text')
## 
## =================
## intercept  1.517 
## literacy   0.029 
## gdp       0.00005
## pop        0.001 
## gini       0.032 
## -----------------

To see that we did calculation correctly, compare the results to the lm function.

beta_check <- lm(UNSATI ~ UNLIT + UNGDPCAP + UNPOP10 + UN_GINI, data = df)
c(betas, beta_check$coefficients)
##                                                                   (Intercept) 
## 1.516640e+00 2.867659e-02 4.719285e-05 8.055081e-04 3.188312e-02 1.516640e+00 
##        UNLIT     UNGDPCAP      UNPOP10      UN_GINI 
## 2.867659e-02 4.719285e-05 8.055081e-04 3.188312e-02

7.3 Analysis: Interpreting Results

Now that we have a good handle on how we calculate our slope, the next task is its interpretation. Here, we fit a model which we think parsimoniously explains life satisfaction driven by our primary independent variable: literacy. As we know, literacy is a proxy for all sorts of wonderful qualities in adults. Given an ability to read, adults are able to attain better jobs, obtain an education, and importantly, earn more money. While literacy is important, we are also controlling for other factors, such as the natural log of a country’s GDP, the natural log of a country’s population, and the relative distribution of income across households.

fit1 <- lm(UNSATI ~ log(UNLIT) + log(UNGDPCAP) + 
             log(UNPOP10) + UN_GINI, data = df)
summary(fit1)
## 
## Call:
## lm(formula = UNSATI ~ log(UNLIT) + log(UNGDPCAP) + log(UNPOP10) + 
##     UN_GINI, data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.41541 -0.53448 -0.00523  0.58033  2.35745 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -4.41417    1.51531  -2.913 0.004442 ** 
## log(UNLIT)     1.32005    0.37985   3.475 0.000765 ***
## log(UNGDPCAP)  0.36334    0.07944   4.573 1.42e-05 ***
## log(UNPOP10)   0.12375    0.07060   1.753 0.082766 .  
## UN_GINI        0.02354    0.01065   2.210 0.029477 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9658 on 97 degrees of freedom
##   (151 observations deleted due to missingness)
## Multiple R-squared:  0.4495, Adjusted R-squared:  0.4268 
## F-statistic:  19.8 on 4 and 97 DF,  p-value: 6.073e-12
stargazer(fit1,
          type = "html",
          title = "Literacy's Relationship with Life Satisfaction",
          dep.var.labels = "Life Satisfaction",
          covariate.labels = c("Literacy", "GDP", "Population", 
                               "Inequality"),
          ci = TRUE)
Literacy’s Relationship with Life Satisfaction
Dependent variable:
Life Satisfaction
Literacy 1.320***
(0.576, 2.065)
GDP 0.363***
(0.208, 0.519)
Population 0.124*
(-0.015, 0.262)
Inequality 0.024**
(0.003, 0.044)
Constant -4.414***
(-7.384, -1.444)
Observations 102
R2 0.450
Adjusted R2 0.427
Residual Std. Error 0.966 (df = 97)
F Statistic 19.804*** (df = 4; 97)
Note: p<0.1; p<0.05; p<0.01
100*(exp(0.006057) - 1) # convert to percent change in Y
## [1] 0.6075381

We interpret the results, its confidence interval, and the p-value in the following way:

7.3.1 Analysis: Interpreting Results

Here is how we interpret our results when X and Y are transformed by the natural logarithm.

fit1 <- lm(log(UNSATI) ~ UNLIT + log(UNGDPCAP) + 
             log(UNPOP10) + UN_GINI, data = df)
summary(fit1)
## 
## Call:
## lm(formula = log(UNSATI) ~ UNLIT + log(UNGDPCAP) + log(UNPOP10) + 
##     UN_GINI, data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.66833 -0.09020  0.00771  0.10900  0.43553 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   0.534958   0.150942   3.544 0.000608 ***
## UNLIT         0.004710   0.001229   3.832 0.000225 ***
## log(UNGDPCAP) 0.063552   0.016369   3.882 0.000189 ***
## log(UNPOP10)  0.027509   0.013904   1.979 0.050701 .  
## UN_GINI       0.004448   0.002099   2.120 0.036595 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1898 on 97 degrees of freedom
##   (151 observations deleted due to missingness)
## Multiple R-squared:  0.4551, Adjusted R-squared:  0.4327 
## F-statistic: 20.26 on 4 and 97 DF,  p-value: 3.747e-12
100*(exp(0.006057) - 1) # convert to percent change in Y
## [1] 0.6075381
  • Effect: a 1% increase in a country’s GDP is associated with a 0.6% increase in life satisfaction, on average, after controlling for literacy, population, and inequality.

7.4 Analysis: Understanding Control Variables

What do “control” variables actually do in our regressions? The coefficient for our primary independent variable is 0.02308. How did we get this exact number?

First, we will prepare some fresh data for this experiment:

df_controls <- dplyr::select(df, UNSATI, UNLIT, UNGDPCAP, UNPOP10, UN_GINI)
df_controls <- na.omit(df_controls)

Next, we will omit our primary independent variable from the regression.

fit_control1 <- lm(UNSATI ~ log(UNGDPCAP, base = 1.1) + 
             log(UNPOP10) + UN_GINI, data = df_controls)

Then we will save this model’s residuals by:

control_errors1 <- residuals(fit_control1)

Next, we exchange the dependent variable, UNSATI, with our primary IV, UNLIT.

fit_control2 <- lm(UNLIT ~ log(UNGDPCAP, base = 1.1) + 
             log(UNPOP10) + UN_GINI, data = df_controls)

And we save its residuals like so:

control_errors2 <- residuals(fit_control2)

Next, we regress the errors from our first regression on the second. Remember, it is spoken as if “y” is regressed on “x” and not the other way around.

fit_control3 <- lm(control_errors1 ~ control_errors2)

Notice that we now have our beta.

fit_control3$coefficients[2]
## control_errors2 
##      0.02307966

8 Analysis: Interactions and Marginal Effects

Although interactions are uncommon in regression, how do we formulate and interpret them correctly? Normally, we have substantive reasons and theory for creating an interaction. An interaction term states that one variable depends on the level of another. Here, we have created one for show, whereby adult literacy is dependent on the level of inequality in a country.

We aim for marginal effects with continuous interactions which means we take the first partial derivative of the expected value of \(Y\) (life satisfaction) with respect to adult literacy.

\[ \frac{\partial E(\text{life satisfaction}_i)}{\partial \text{adult literacy}_i} = \beta_1 + \beta_5\text{inequality}_i.\] First we run the model:

fit1 <- lm(UNSATI ~ UNLIT + UN_GINI + log(UNGDPCAP, base = 1.1) + 
             log(UNPOP10) + UNLIT*UN_GINI,  data = df)
summary(fit1)
## 
## Call:
## lm(formula = UNSATI ~ UNLIT + UN_GINI + log(UNGDPCAP, base = 1.1) + 
##     log(UNPOP10) + UNLIT * UN_GINI, data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.50437 -0.45556 -0.00166  0.46892  2.07966 
## 
## Coefficients:
##                             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                6.6847467  2.6881924   2.487 0.014620 *  
## UNLIT                     -0.0567818  0.0297263  -1.910 0.059098 .  
## UN_GINI                   -0.1461452  0.0635925  -2.298 0.023723 *  
## log(UNGDPCAP, base = 1.1)  0.0303350  0.0076349   3.973 0.000137 ***
## log(UNPOP10)               0.0861714  0.0701704   1.228 0.222438    
## UNLIT:UN_GINI              0.0020153  0.0007347   2.743 0.007262 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9274 on 96 degrees of freedom
##   (151 observations deleted due to missingness)
## Multiple R-squared:  0.4977, Adjusted R-squared:  0.4715 
## F-statistic: 19.02 on 5 and 96 DF,  p-value: 4.187e-13

Then we pull the beta coefficients.

beta_lit <- fit1$coefficients[2]
beta_interaction <- fit1$coefficients[6]

Then sequence adult literacy.

sq_lit <- seq(from = min(df_controls$UNLIT), 
              to = max(df_controls$UNLIT), length.out = 10)

And calculate the marginal effect:

marg.effect <- beta_lit + beta_interaction*sq_lit
ME <- data.frame(sq_lit, marg.effect)
kable(ME)
sq_lit marg.effect
26.16667 -0.0040489
34.34444 0.0124315
42.52222 0.0289120
50.70000 0.0453924
58.87778 0.0618728
67.05556 0.0783532
75.23333 0.0948336
83.41111 0.1113141
91.58889 0.1277945
99.76667 0.1442749

9 Analysis: Marginal Effect Standard Errors

To derive standard errors for our marginal effects, we must run simulations in a manner similar to the logit and multinomial analyses.

coef <- coef(fit1) # pull coefs
sigma <- vcov(fit1) # pull variance-covariance matrix
sim.betas <- data.frame(mvrnorm(n = 1000, mu = coef, Sigma=sigma)) # simulate
sim.betas <- dplyr::select(sim.betas, UNLIT, UNLIT.UN_GINI) # pull beta and interaction term
head(sim.betas)
##         UNLIT UNLIT.UN_GINI
## 1 -0.14633756  0.0041983082
## 2 -0.06978633  0.0022529103
## 3 -0.00745908  0.0008181124
## 4 -0.05231454  0.0017624026
## 5 -0.02849129  0.0014858894
## 6 -0.02989192  0.0012247231
sq_lit <- as.data.frame(sq_lit) # transform sequence to a dataframe
sq_lit <- mutate(sq_lit, one = 1) # add a variable for easy joining
sim.betas <- mutate(sim.betas, one = 1) # do the same
ME <- full_join(sim.betas, sq_lit) # join the data
## Joining, by = "one"
ME <- mutate(ME, marg.effect = UNLIT # calcualte marginal effect
             + UNLIT.UN_GINI*sq_lit) # beta + interaction term * sequenced x-variable
ME <- ME %>%
      group_by(sq_lit) %>% 
  summarize(me = mean(marg.effect),
            LB = quantile(marg.effect, .025),
            UB = quantile(marg.effect, .975))
ME
## # A tibble: 10 x 4
##    sq_lit       me       LB     UB
##     <dbl>    <dbl>    <dbl>  <dbl>
##  1   26.2 -0.00386 -0.0283  0.0189
##  2   34.3  0.0125  -0.00226 0.0261
##  3   42.5  0.0289   0.0171  0.0421
##  4   50.7  0.0453   0.0258  0.0654
##  5   58.9  0.0617   0.0322  0.0916
##  6   67.1  0.0780   0.0371  0.119 
##  7   75.2  0.0944   0.0424  0.149 
##  8   83.4  0.111    0.0464  0.178 
##  9   91.6  0.127    0.0498  0.204 
## 10   99.8  0.144    0.0550  0.231

9.1 Analysis: Interactions and Marginal Effects Graphs

plot_ly(data = ME, type = 'scatter', mode = 'lines') %>%
  add_trace(x=~sq_lit, y=~me, name = 'Marginal Effect', line = list(color='rgb(0,100,80)')) %>%
  add_trace(x =~sq_lit, y =~UB, line = list(color = 'transparent'), 
            name = 'Upper CI', showlegend = FALSE) %>%
  add_trace(x =~sq_lit, y =~LB, fill = 'tonexty', fillcolor='rgba(0,100,80,0.2)', 
            line = list(color = 'transparent'), name = 'Lower CI', showlegend = FALSE) %>%
  layout(xaxis = list(dtick = 10))

Interpretation:

10 OLS Assumptions Check

10.1 Zero Conditional Mean

A violation of the zero conditional mean assumption risks biased coefficients. We check for violations of the zero conditional mean in the plot below, which depicts our residuals by a linear combination of our independent variables. In the plot, we look for a roughly flat red line along zero. We can see that the red line is indeed mostly flat with some slight curvatures towards the ends, likely owing to the fact that we have limited observations. We therefore assess that our primary model does not violate this assumption.

plot(fit1, which = 1)

10.2 Homoskedasticity

The assumption of homoskedastic error consists of the following two components:

  1. \(\epsilon_{i}\) is independent of \(\epsilon_{j}\) conditional on X for all \(i\) \(\neq\) \(j\).

  2. The variance of the error does not depend on the predictor: Var(\(\epsilon_{i}\)|X) = Var(\(\epsilon_{i}\))

The assumption that all the errors have the same variance is called homoskedasticity.

The null for the Breusch-Pagan test is homoskedasticity. Given the results of our test; we reject the null and assume we have hetereoskedasticity. However, we need to be cautious about this test as it is sensitive to the number of observations.

library(lmtest)
## Loading required package: zoo
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
library(sandwich)
bptest(fit1)
## 
##  studentized Breusch-Pagan test
## 
## data:  fit1
## BP = 17.795, df = 5, p-value = 0.003214

We might also suspect evidence of heteroskedasticity in our plot below which shows some unevenness in the scatter plot particularly around the center values as they cluster further toward the right. However, it does not look too bad.

plot(fit1, which = 1)

We can also judge heteroskedasticity from the next plot which if it showed a non-flat line. If we had homoskedastic data, the red line would be roughly flat, which is what we see.

plot(fit1, which = 3)

Taken collectively, we judge that we have homoskedasticity and this means that practically, we do not have to incorporate robust standard errors, although we still might want to in order to be conservative.

10.3 Normality of Errors

The normality of errors assumes that the errors must have this specific distribution:

\[\begin{eqnarray} \epsilon \sim N(0, \sigma^2), &\\ Corr(\epsilon_{j}, \epsilon_{k}) = 0, \forall j \neq k \end{eqnarray}\]

Which means:

  1. all errors are normally distributed with a mean of 0,

  2. all errors are independent from one another,

  3. and all errors have the same variance.

If the errors are not normally distributed, or do not have a mean of 0, then we say we have a model misspecification problem. If the errors are not independent from one another, we have an autocorrelation problem.

We conduct several tests below which taken collectively, suggest that our errors are normal, as evidenced by: (1) the histogram of residuals which show a near normal distribution, (2) the shapiro test which fails to reject the null hypothesis that our residuals are drawn from a normal distribution, and (3) a normal Q-Q plot which shows only a few observations that deviate from the diagonal line at each end.

hist(fit1$residuals, breaks = 50, main = "Model 2 Residuals", xlab="Residuals")

shapiro.test(fit1$residuals)
## 
##  Shapiro-Wilk normality test
## 
## data:  fit1$residuals
## W = 0.98825, p-value = 0.5124
# qq plot
plot(fit1, which = 2)

11 Sources