Sitemap

A list of all the posts and pages found on the site. For you robots out there is an XML version available for digesting as well.

Pages

Posts

BERT-Vision

Published:

What compression methods can extract regularity from BERT during fine-tuning? Drawing on research that demonstrates the utility of information found across all of BERT’s layers, we propose a compression method, BERT-Vision, that captures the regularities produced by BERT during fine-tuning. BERT-Vision’s contribution is two-fold: First, we show that compression during fine-tuning can yield comparative and sometimes better performance than BERT, and second, we show that this performance is realized with a model that is 209x smaller than BERT in terms of its parameters. To view this project, please click here.

Typos: A Survey Experiment

Published:

Command of language is one of the most significant cognitive abilities we possess and is often the most pervasive signal we encounter in a social media setting. When we notice overt and unintentional grammatical errors in social media posts, do we make unconscious assumptions about the authors’ general intelligence? Do we attribute difficulty with written language with other indicators such as lower-performing verbal acuity or overall intelligence? Further, are some categories of grammatical errors more injurious than others – or do we take in stride all these trespasses? To view this project, please click here.

Latent Control: Hidden Markov Models

Published:

Who controls territory in civil war? This is a central variable in the research and analysis of civil wars – yet it is incredibly difficult to measure. In this post, I model territorial control as a latent variable – an unobserved variable that presumes it is the cause of its indicators. This project models the latent variable across the entire country of Afghanistan using sub-national event data, a Hidden Markov Model, Uber’s hexagonal spatial index, and logistic spatial and temporal decay functions to treat serially correlated data in time and space. To view this project, please click here.

PetaFlights

Published:

What accounts for flight delays in the U.S.? This project portrays the machine learning end of a large data engineering project that merged 630 million rows of weather data against 31 million rows of flight data. I use the state-of-the-art in distributed deep learning by leveraging Petastorm, Horovod, and PyTorch to produce a multilayer perceptron model that is distributed across 8 workers in DataBricks. Importantly, I use novel approaches to transform categorical data into continuous features through an embedding table. To view this project, please click here.

NLP: Natural Language Propaganda

Published:

Who are the targets of insurgent propaganda? I investigate the ability to classify the targets (e.g, the U.S. or Kabul) of insurgent propaganda messages using a novel corpus containing over 11,000 Taliban statements from 2014 to 2020. In experiments with Convolutional Neural Network (CNN) and transformer architectures, I demonstrate that the audiences of insurgent messages are best captured by transformers, likely owing to its encoder-decoder architecture. This paper’s contribution is twofold: First, it offers a new and novel data set with utility in classification and summarization tasks for machine learning. Second, it suggests that since the audience of messaging can be reliably identified, new opportunities are afforded to analysts to look closer at the contrasts in language to better understand the targets of information.

Nutritionalcart

Published:

How healthy is the average Instacart user? Are certain types (i.e., vegetarians, carnivores) of food buyers healthier than others? I bring new data to bear on these questions to better understand how healthy the average Instacart user is and to better understand the health benefits afforded to Instacart users who choose some types (i.e., plant-based, meat-based) of foods over others. To determine the relative health of Instacart users, I matched the top 10 most ordered products by aisle with USDA nutrient data by using USDA-provided API access to their database through JavaScript Object Notation (JSON). To view this project, please click here. An upgraded algorithm that better searches the USDA database can be found here.

Map Off

Published:

Map Off is a game designed to test your geography skills in the United States or around the World. The inspiration for this game comes from my wife, Hannah, because we often test our spatial skills against one another in the presence of a map. In turn, we now have access to maps and competition anytime we want.

Reexamining Civilian Preferences in Civil War: A Survey in Afghanistan

Published:

How do civilians react to changing authority in civil war? We investigate this question in Afghanistan using survey data from The Asia Foundation following the end of U.S.-led combat operations in 2014. I demonstrate that there is clear evidence that civilian attitudes are indeed conditional on the following three-way interaction: territorial control, ethnicity, and survival. For instance, there is a notable and statistically significant distinction between Pashtuns and non-Pashtuns under Taliban control in their approval of the Afghan Government. I bring largely unused country-wide individual-level data to bear on analyzing civilian wartime beliefs. To view this research project, please click here.

Azure

Azure Synapse Analytics

Published:

Azure Synapse Analytics is a limitless analytics service that brings together data integration, enterprise data warehousing, and big data analytics. It gives you the freedom to query data on your terms, using either serverless or dedicated resources—at scale. Azure Synapse brings these worlds together with a unified experience to ingest, explore, prepare, manage, and serve data for immediate BI and machine learning needs. In this demo, we will walk through the necessary procedures to: 1. Create an Azure Synapse environment, 2. Write SQL tables to Parquet and dump them into our Data Lake, 3. Run automated pipeline computations on our Data Lake files with AzureML, 4. Connect our Data Lake with Azure Data Bricks for Big Data analytics, 5. Setup Azure DevOps for Synapse version control.

Azure Static Web Apps

Published:

Azure Static Web Apps is a serverless hosting service that offers streamlined full-stack development from source code to global high availability. In this guide, we will build a static web app, secure it to a limited audience, and setup GitHub actions that build our posts automatically for us.

Azure Machine Learning

Published:

Azure Machine Learning is an enterprise-grade machine learning service that builds and deploy models. In this guide, we will go through an end-to-end process that: (1) instantiates a work and compute space, (2) loads a tabular data set for prediction, (3) runs a single experiment, (4) scales a hyperparameter search using multiple VMs with HyperDrive, (5) deploys a model for inference to the web, and (6) show how to send new input data and retrieve predictions.

Azure Logic Apps

Published:

Azure Logic Apps is a cloud-based platform for creating and running automated workflows that integrate your apps, data, services, and systems. With this platform, you can quickly develop highly scalable integration solutions for your enterprise and business-to-business (B2B) scenarios.

Bayes

Bayesian Modeling

Published:

Bayesian models begin with one set of plausibilities assigned to each parameter; called prior plausibilities. These priors are then updated, as the model learns from the data, to produce posterior plausibilities. At each step as the model learns, the updated set of plausibilities becomes the new initial plausibilities for the next observation. In other words, the Bayesian model updates the prior distributions with their logical consequences: the posterior distribution. In this post, I use a linear model in pymc3, using all of its latest capabilities, so as to provide a complete and state-of-the-art example of generalized linear modeling.

DL

Multilayer Perceptrons

Published:

Artificial Neural Networks (ANN) are powerful classification and regression algorithms that can solve simple and complex linear and non-linear modeling problems. In this post, we demonstrate the functionality of a basic deep learning multi-layer perceptron model on PyTorch using the famous MNIST data set. In this post, we walk through the application of multilayer perceptrons.

Feature Normalization and Initialization

Published:

In this guide, we will walk through feature normalization and weight initialization schemes in PyTorch. In short, we normalize our inputs for gradient descent because large weights will dominate our updates in our attempt to find global or local minima. Separately, we use custom weight initialization schemes to improve our ability to converge during optimization or to improve our ability to use certain activation functions. In this post, we walk through the application of batch normalization and Xaviar initialization in PyTorch.

Regularization

Published:

We can think of regularization as a general means to reduce overfitting. In practice, this means that regularizing effects aim to reduce the model capacity and/or the variance of the predictions. In this post, we walk through the application of L2 regularization (manually and automatically) and dropout in PyTorch.

Optimization

Published:

Numerical optimizers provide the means to estimate our parameters by finding the values that maximize the likelihood of generating our data. This guide helps us understand how optimization algorithms find the best estimates for our coefficients in a step-by-step process. This guide uses Gradient Descent and regularization techniques in a completely manual approach to finding parameters that most likely generated our data. Another way of thinking about gradient descent is that we are inevitably asking the algorithm the following: What parameter values will push our error to zero?

Distributed Training

Published:

Spark is on open source cluster computing framework that automates the distribution of data and computations on a cluster of computers. DataBricks handles much of the architecture and cluster management for you, leveraging Jupyter style notebooks. This guide shows how to perform distributed deep learning using PyTorch on DataBricks.

ML

K Nearest Neighbors

Published:

The K Nearest Neighbors (KNN) algorithm is part of a family of classifier algorithms that aim to predict the class or category of an observation. KNN works by calculating the distance, often the Euclidean (i.e., straight line) distance, between observations. In this post, we walk through the application of the KNN algorithm and demonstrate the conditions under which the algorithm excels, does poorly, and is improved through feature engineering.

Decision Trees

Published:

The Decision Tree algorithm is part of a family of classifier and regression algorithms that aim to predict the class or value of an observation. Decision trees classify data by splitting features at specified thresholds such that, ideally, we can perfectly predict the observation’s label. At its core, features are split by using two relatively simplistic algorithms: entropy and information gain. When deciding how to split a feature, a threshold is selected such that the informational gain is the highest, meaning more information is revealed and thereby our predictions for our dependent variable’s label is improved (or perfect).

Naive Bayes

Published:

The Naive Bayes algorithm is part of a family of classifier algorithms that aim to predict the category of an observation. It is a Maximum Likelihood (MLE) generative model that suggests each class is generated by its features. At its core, the algorithm uses Bayes theorem. In this post, we walk through the application of the Naive Bayes algorithm and demonstrate the conditions under which the algorithm excels, does poorly, and is improved through feature engineering.

Support Vector Machines

Published:

The Support Vector Machine (SVM) algorithm is part of a family of classifier and regression algorithms that aim to predict the class or value of an observation. The SVM algorithm identifies data points, called support vectors, that generate the widest possible margin between two classes in order to yield the best classification generalization. The SVM is made powerful by the use of kernels, a function that computes the dot product of two vectors, thereby allowing us to effectively skip feature transformations and consequently improve computation performance. In this post, we walk through the application of the SVM algorithm through linear and nonlinear modeling.

Dimensionality Reduction

Published:

The idea behind dimensionality reduction is simple: take high dimensional feature spaces ($k$) and project them onto lower dimensional subspaces ($m$) (where $m$ < $k$). Dimensionality reduction has several kind of appealing properties like solving the curse of dimensionality and overfitting, but it also allows us to visualize high dimensional data and to compress it. Collapsing high dimensional data that would otherwise be too difficult for us to understand or interpret suddenly becomes much more salient when we collapse it down into two or three dimensions. In this post, we walk through the application of the principal component analysis, a central dimensionality reduction algorithm.

Cluster Analysis

Published:

Cluster analysis is a form of unsupervised learning which aims to discover and explore the underlying structure in the data. The crux of a cluster analysis algorithm is distance metrics: the way you measure similarity or distance between observations. Unsupervised learning is often used in situations where you do not have labelled data (perhaps it is expensive) or when you might not know the correct values for some of your data and therefore, you might want to evaluate its underlying structure.

NLP

Classification: CNN

Published:

Text or sequence classification aims to label a sentence or document based on its content. In this post, we use Convolutional Neural Networks and pre-trained embeddings to classify a novel corpus. This post provides a full treatment of the steps required to prepare data for NLP analysis and to analyze it with PyTorch. It also includes sample code to optimize the hyperparamters through state-of-the-art pruning and search algorithms via Optuna.

Classification: DistilBERT

Published:

Text or sequence classification aims to label a sentence or document based on its content. In this post, we use Transformers to classify a novel data set that I created based on insurgent propaganda messages. This post provides a full treatment of the steps required to prepare data for NLP analysis and to analyze it with PyTorch. It also includes sample code to optimize the hyperparamters through state-of-the-art pruning and search algorithms via Optuna.

Generation: DistilGPT-2

Published:

Language models are trained to predict the probability the next token considering the preceding tokens that came before it. A token can be a word, a letter, or a subcomponent of a word. In this guide, we use the a decoder-only language model transformer to predict text from our novel insurgent propaganda corpus.

Classification: BERT-CNN

Published:

In this guide, we prepare a BERT-CNN ensemble which takes the embeddings generated by the BERT base model and feeds them into a CNN. The general logic from this guide can be used to replace the CNN with any other NN of your choice. While it is a fun task to explore, adding on what is technically an inferior model on-top of a Transformer is not really necessary. Like other guides, this walk through provides a complete treatment of the data preparation and training of the BERT-CNN ensemble in PyTorch.

Summarization: T5

Published:

There are two types of summaries: (1) abstractive, or explaining in your own words, and (2) extractive, or building a summary from existing text. Humans are mostly abstractive while NLP systems are mostly extractive. In this guide we use T5, a pre-trained and very large (e.g., roughly twice the size of BERT-base) encoder-decoder Transformer model. T5, a model devised by Google, is an important advancement in the field of Transformers because it achieves near human-level performance on a variety of benchmarks like GLUE and SQuAD.

Classification: Hierarchical Attention Networks

Published:

Hierarchical Attention Networks (HAN), as its name suggests, have a hierarchical structure that reflects the hierarchical nature of documents. It has two levels of attention mechanisms that are applied at the word and sentence level which afford it the differential ability to capture more and less important content which evaluating documents. In this guide, we walk through how to create a Hieararchical Attention Network in PyTorch and as well as how to create and structure our data appropriately.

Classification: T5

Published:

In this guide we use T5, a pre-trained and very large (e.g., roughly twice the size of BERT-base) encoder-decoder Transformer model for a classification task. T5, a model devised by Google, is an important advancement in the field of Transformers because it achieves near human-level performance on a variety of benchmarks like GLUE and SQuAD. Another important advancement is that it treats NLP as a text-to-text problem, whereby our inputs are text and our outputs are also text. In this universal framework, T5 can therefore handle any NLP task (in English). T5 was pre-trained on the C4 (Colossal Clean Crawled Corpus) corpus which amounts to roughly 750GB of clean English text. For comparative purpsoes, BERT was trained on roughly 13GB of text and XLNet was trained on roughly 126GB of text. For these reasons, T5 is the state of the art and its encoder-decoder architecture is likely the future of NLP models.

Classification: Character CNN

Published:

In this post, we use character-level Convolutional Neural Networks (CNN) to classify a novel data set in PyTorch. CNNs are useful in extracting information from raw signals, ranging from computer vision, speech, and text. Character-level CNNs treat text characters as a kind of raw signal, thereby allowing CNNs to eschew the requirement to develop an understanding of words. While character-level CNNs work in many different situations, they work well to handle Twitter or multilingual text as advanced embeddings are usually unavailable to researchers. In this guide, we replicate the model as specified in the paper Character-level Convolutional Networks for Text Classification

Classification: Capsule Routing

Published:

While capsule networks have been used in the field of computer vision and CNNs, recent work shows that they work well in Natural Language Processing (NLP) as well. “A capsule is a group of neurons whose outputs represent different properties of the same entity in different contexts. Routing by agreement is an iterative form of clustering in which a capsule detects an entity by looking for agreement among votes from input capsules that have already detected parts of the entity in a previous layer” (Heinsen, 2019). Capsule networks are a means for aggregating the importance of embeddings akin to attention mechanisms.

Classification: Entity Embeddings

Published:

In this guide, we will implement entity embeddings in two ways via PyTorch: (1) via nn.Embedding(), and (2) via transformers. We will also show how to load data in a more efficient manner through a custom PyTorch data set class. This style of data management is slightly more complicated to initialize, but is the precise way we want to load our data when dealing: (1) big data, or (2) a memory-conservative environment. Entity embeddings refers to the idea of transforming categorical variables into continuous embeddings to avoid one-hot encoding and sparse matrices.

Semantic Search

Published:

Natural language processing and computer vision methods generate high-dimensional vectors that represent text and images, yet traditional databases that can be queried like SQL are not adapted to these new representations. Given enough text and media, this information can quickly encompass billions of vectors. To find similar entries means that we must find similar high-dimensional vectors which is inefficient and likely impossible with standard query languages. Similarity search fills this void by searching for similar vectors; those nearby in Euclidean space. We can leverage similarity search algorithms once our vectors are generated by deep learning algorithms. In this post, we will use Faiss – Facebook AI Similarity Search.

applied

Survival Analysis

Published:

This post walks through survival analysis on a panel dataset. It replicates a model that estimates the duration of democratic regime survival on a panel dataset incorporating 105 countries from 1950 to 2004.

Unordered Categorical Models

Published:

This post replicates and extends a flagship American Political Science Review article that uses a multinomial logit to predict the likelihood of an insurgent group’s mode of warfare given the presence or absence of the Cold War.

Binary Regression

Published:

In October 1988, a plebiscite vote was held in Chile to determine whether or not Augusto Pinochet should extend his rule for another eight years. The package carData contains Chilean national survey data collected in April and May 1988. In this analysis, we evaluate the effect that several variables have on a voter’s likelihood to keep Pinochet in power using binary regression models.

Time Series Analysis

Published:

This post applies time series analysis to data provided by Kwon (2015) which seeks to empirically understand the causal relationship between political polarization and income inequality in the U.S.

Linear Models

Published:

What causes differences in people’s life satisfaction across countries? The Association of Religion Data Archives (ARDA) has assembled a dataset that stitches together economic, social and demographic variables across 252 countries. In this analysis, we inspect factors associated with life satisfaction with linear models.

Ordered Categorical Models

Published:

In 2009, a scandal broke out across England. Many British Members of Parliament (MPs) were exposed as misusing their allowances and expenses permitted to them as an elected official. In 2010, British voters were surveyed as to whether they think the MPs implicated in the scandal should resign or not prior to parliamentary elections. At the time of the scandal, the Labour party led by PM Gordon Brown was in power. The conservative party, led by David Cameron, won the largest number of votes and seats in the 2010 general election on the heels of the scandal. In this analysis, we use generalized linear models for ordered categorical data to further explore the survey data.

Count Model Analysis

Published:

Gelman and Hill (2007) collected New York City (NYC) “stop and frisk” data for 175,000 stops over a 15-month period in 1998-1999. In this analysis, count models are used to model the data.

Multiple Imputation

Published:

Missingness refers to observations or measurement that, in principle, could have been carried out but for whatever reason, failed to occur (Ward and Ahlquist 2018). Data that we collect, whether observational or experimental data, will with near certainty, have some values that are missing. The general idea behind missing data imputation algorithms is that they all fill in the missing data with estimates of what real data would look like if it were available. Because the estimated data is by nature uncertain, we replicate the missing data many times to incorporate the uncertainty into the analysis. This post walks through the latest cutting edge multiple imputation technique developed by Hollenbach et. al (2018).

Numerical Optimization

Published:

Maximum likelihood fixes our observed data and asks: What parameter values most likely generated the data we have? In a likelihood framework, our data is viewed as a joint probability as a function of parameter values for a specified mass or density function. In this case, the joint probability is being maximized with respect to the parameters. A maximum likelihood estimate is one that provides the density or mass function with the highest likelihood of generating the observed data. Numerical optimizers provide the means to estimate our parameters by finding the values that maximize the likelihood of generating our data. This guide helps us understand how optimization algorithms find the best estimates for our coefficients in a step-by-step process.

portfolio

publications

talks

teaching

Essential Empirical Methods

Undergraduate-level course, 2019

Essential Empirical Methods is a three part course that introduces new analysts to ideas of: (1) concept measurement and construction, (2) describing and summarizing data, and (3) formulating hypotheses and making comparisons.