Missingness refers to observations or measurement that, in principle, could have been carried out but for whatever reason, failed to occur (Ward and Ahlquist 2018). Data that we collect, whether observational or experimental data, will with near certainty, have some values that are missing. Missing data comes in all sorts of ways from: (1) individuals not choosing to answer survey questions, (2) human error, (3) losing computer files, and (4) data is reported asynchronously. Missing data is easy to ignore since everyone deals with missing data. However, ignoring missing data can result in biased estimates and standard errors that are wildly deflated or inflated (King et al., 2001).
There are three standard ways to think about missing data.
Missing Completely at Random (MCAR): The probability that any particular value is missing is independent of the values of any of the data, whether missing or observed.
Missing at Random (MAR): The probability that an observed value is is unrelated to unobserved values, conditional on observed data.
Not Missing at Random (NMAR): Missingness that depends on the value of the missing data, even after conditioning on the observed values. A mechanism separate from that which generates the observed data is causing the missingness.
There is no statistical test that we can use to determine what type of missing data we have. MCAR, MAR, and NMAR are assumptions about the data we have. If we have MCAR or MAR, we can use data imputation techniques described here.
The general idea behind missing data imputation algorithms is that they all fill in the missing data with estimates of what real data would look like if it were available. Because the estimated data is by nature uncertain, we replicate the missing data many times to incorporate the uncertainty into the analysis. We use the following process when using multiple imputation which can be time consuming and computationally expensive:
Create multiple new data sets
Estimate model parameters on each data set
Combine and summarize the results from each estimation
Amelia
includes a nice method, missmap
, to help us visualize our missing data. To demonstrate its ease of use and utility, consider the following data set:
## simulate data
set.seed(1)
n <- 100
x1 <- rnorm(n) ; x2 <- rnorm(n) ; x3 <- rnorm(n)
y <- 1 + 2*x1 -1*x2 + 1*x3 + rnorm(n)
## organize into matrix
raw <- cbind(y, x1, x2, x3)
## simulate missingness
naMat <- matrix(rbinom(n*4, 1, .7),
nrow=nrow(raw), ncol=ncol(raw))
naMat[naMat==0] <- NA
## remove observations
data <- raw * naMat
## plot missing data
data <- data.frame(data)
missmap(data)
While there are a number of different data augmentation methods ranging from Multiple Imputation via Chained Equations (MICE) to Bayesian Multiple Imputation, this applied demonstration will focus on a new cutting-edge model that uses Gaussian copulas. Copulas are a way to decompose a multivariate distribution such that the dependence across variables can be considered separately from the marginal distribution of each variable. Once the new distribution is created, we sample from it when new values are needed (Ward and Ahlquist 2018).
Imputation is about predicting the missing data with minimum variance. Consequently, we want to take advantage of all the information we have available to build an imputation model. This means that we include all variables that are feasible to manage; ignoring which side of the equation the variable will be on. The more information we use, the more credible the MAR assumption. Sometimes, standardizing our variables helps improve imputation speed and convergence.
Using the previously generated data above, the goal is to show how to conduct inference on the effect of \(x_{1}\), \(x_{2}\), and \(x_{3}\) on \(y\) after imputing the missing values. The main function that we will use is sbgcop.mcmc
which requires us to set the following four arguments:
1.\(Y\) - A matrix with missing values to be imputed. The object passed to the argument must be in matrix format. Users should only include variables that can provide information to the imputation algorithm. For example, this can include lags and leads of a variable in the case of time series cross-sectional data. Identification variables, such as actor names, abbreviations, or years, should not be included in the matrix.
nsamp
- Number of Markov chain iterations. Since this is a Bayesian estimation scheme, we must pass the number of iterations for which we want the Markov chain to run. If nsamp
is set to 100, then the Markov chain will run for 100 iterations and 100 imputed data sets will be produced.
odens
- Number of iterations between saved samples. The odens
argument specifies how often an iteration from the Markov chain should be saved. Thus, if nsamp
is set to 100 and odens
is set to 4, 25 imputed data sets will be returned by sbgcop.mcmc.
seed
- Random seed integer. Since this is a Bayesian model and we will be sampling from distributions to arrive at parameter values, one should always pass an integer to the seed argument.
We run the Markov chain for 2000 iterations and save every 10th iteration. We store the output from sbgcop.mcmc
to sbgcopOutput
.
## 5 percent done Wed Apr 22 16:26:25 2020
## 10 percent done Wed Apr 22 16:26:26 2020
## 15 percent done Wed Apr 22 16:26:27 2020
## 20 percent done Wed Apr 22 16:26:28 2020
## 25 percent done Wed Apr 22 16:26:29 2020
## 30 percent done Wed Apr 22 16:26:30 2020
## 35 percent done Wed Apr 22 16:26:31 2020
## 40 percent done Wed Apr 22 16:26:32 2020
## 45 percent done Wed Apr 22 16:26:33 2020
## 50 percent done Wed Apr 22 16:26:34 2020
## 55 percent done Wed Apr 22 16:26:35 2020
## 60 percent done Wed Apr 22 16:26:36 2020
## 65 percent done Wed Apr 22 16:26:38 2020
## 70 percent done Wed Apr 22 16:26:39 2020
## 75 percent done Wed Apr 22 16:26:40 2020
## 80 percent done Wed Apr 22 16:26:41 2020
## 85 percent done Wed Apr 22 16:26:42 2020
## 90 percent done Wed Apr 22 16:26:43 2020
## 95 percent done Wed Apr 22 16:26:44 2020
## 100 percent done Wed Apr 22 16:26:45 2020
The output sbgcopOutput
contains the following objects:
C.psamp
- Contains posterior samples of the correlation matrix structured as an array of size p x p x nsamp
/odens
where p indicates the number of variables included in the imputation process. In our case, the data
object includes 4 variables and we ran the Markov chain for 2000 iterations saving every tenth. Thus giving us dimensions of: 4 x 4 x 200. Each value in this array is providing us with the estimated association between a pair of parameters at every saved iteration of the Markov chain.We show an example below using the 1st and 5th saved iterations.
## , , 1
##
## y x1 x2 x3
## y 1.0000000 0.7559554 -0.4941585 0.4979331
## x1 0.7559554 1.0000000 -0.2372739 0.2285985
## x2 -0.4941585 -0.2372739 1.0000000 -0.2072172
## x3 0.4979331 0.2285985 -0.2072172 1.0000000
##
## , , 5
##
## y x1 x2 x3
## y 1.0000000 0.45869390 -0.40908045 0.38735260
## x1 0.4586939 1.00000000 0.02845669 0.08295806
## x2 -0.4090805 0.02845669 1.00000000 -0.04835457
## x3 0.3873526 0.08295806 -0.04835457 1.00000000
To generate a trace plot of this data we need to restructure our data from wide to long.
sbgcopCorr = reshape2::melt(sbgcopOutput$C.psamp)
# remove cases where variable is the same in both columns
sbgcopCorr = sbgcopCorr[sbgcopCorr$Var1 != sbgcopCorr$Var2,]
# construct an indicator for pairs of variables
sbgcopCorr$v12 = paste(sbgcopCorr$Var1, sbgcopCorr$Var2, sep='-')
print(head(sbgcopCorr))
## Var1 Var2 Var3 value v12
## 2 x1 y 1 0.7559554 x1-y
## 3 x2 y 1 -0.4941585 x2-y
## 4 x3 y 1 0.4979331 x3-y
## 5 y x1 1 0.7559554 y-x1
## 7 x2 x1 1 -0.2372739 x2-x1
## 8 x3 x1 1 0.2285985 x3-x1
Using the reshape2
package we have reformatted the array into a data frame, in which the first two columns designate the variables for which a correlation is being estimated, the third an indicator of the saved iteration, the fourth the correlation, and the fifth an indicator designating the variables being compared.
list1 <- unique(sbgcopCorr$v12)
lapply(list1, function(j) {
current_j = j # pull current var
temp_data <- sbgcopCorr %>% filter(v12 == current_j) # subset df
plot_ly( # plot
data = temp_data,
name = ~v12,
x = ~ Var3,
y = ~ value,
type = "scatter",
mode='lines'
)
}) %>% subplot(
nrows = 4,
shareX = TRUE,
shareY = TRUE,
margin = 0.015,
which_layout =1)
The trace plots show that the Markov chain tends to converge rather quickly in this example. The coda
package provides an excellent set of diagnostics to test convergence in more depth.
After conducting the imputation and evaluating convergence, we now use the imputed data sets to conduct inferential analysis. For the purpose of this example, we estimate the effect of \(x_{1}\), \(x_{2}\), and \(x_{3}\) on \(y\). sbgcop
, generated 200 copies of our original data set in which posterior samples of the original missing values have been included. Each of these copies are saved in the output from sbgcop.mcmc
, which has dimensions of 100 x 4 x 200. The first two dimensions of this object correspond to the original dimensions of our data object, and the third corresponds to the number of saved iterations from the Markov chain.
Having generated a set of imputed data sets, our next step is to use a regression model to estimate the effect of our independent variables on \(y\). We cannot just use one of the imputed data sets, as this would not take into account the uncertainty in our imputations. Instead we run several regression on as many of the imputed data sets generated by sbgcop.mcmc
that we think are appropriate. For the sake of this example, we utilize all 200 imputed data sets, but typically randomly sampling around 20 imputed data sets should be be sufficient
Each time we run the regression model, we will save the coefficient and standard errors for the independent variables and organize the results into a matrix as shown below.
coefEstimates <- NULL
serrorEstimates <- NULL
for( copy in 1:dim(sbgcopOutput$'Y.impute')[3]){
# extract copy from sbgcopOutput
copyDf <- data.frame(sbgcopOutput$'Y.impute'[,,copy])
names(copyDf) <- colnames(sbgcopOutput$Y.pmean)
# run model
model <- lm(y~x1+x2+x3,data=copyDf)
# extract coefficients
beta <- coef(model)
coefEstimates <- rbind(coefEstimates, beta)
# extract standard errors
serror <- sqrt(diag(vcov(model)))
serrorEstimates <- rbind(serrorEstimates, serror)
}
print(head(coefEstimates))
## (Intercept) x1 x2 x3
## beta 1.335304 1.772706 -0.8913761 0.8487361
## beta 1.432138 1.956273 -0.8839710 0.8503200
## beta 1.130142 1.712177 -0.7695652 0.7345462
## beta 1.217125 1.777386 -0.9003787 0.5431423
## beta 1.270255 1.469008 -0.9496130 0.7110963
## beta 1.344896 1.904784 -0.7715629 0.9956991
The last step is to combine each of the estimates using using Rubin’s rule. Many existing packages have implemented functions to aid in this last step, one could use the pool function from mice
or the mi.meld
function from Amelia II
as below.
## $q.mi
## (Intercept) x1 x2 x3
## [1,] 1.266575 1.834104 -0.8757189 0.783168
##
## $se.mi
## (Intercept) x1 x2 x3
## [1,] 0.1769035 0.211011 0.2071008 0.1943801
The resulting parameter estimates take into account the uncertainty introduced through the imputation process, and we can interpret them just as we would interpret the results from a typical regression.
Honaker, James, Anne Joseph, Gary King, Kenneth Scheve, and Naunihal Singh. “Amelia: A Program for Missing Data (Windows version) Cambridge, MA: Harvard University.” (2001).
Ward, Michael D., and John S. Ahlquist. Maximum Likelihood for Social Science: Strategies for Analysis. Cambridge University Press, 2018.