Posts by Collection

DL

Multilayer Perceptrons

Published:

Artificial Neural Networks (ANN) are powerful classification and regression algorithms that can solve simple and complex linear and non-linear modeling problems. In this post, we demonstrate the functionality of a basic deep learning multi-layer perceptron model on PyTorch using the famous MNIST data set. In this post, we walk through the application of multilayer perceptrons.

Feature Normalization and Initialization

Published:

In this guide, we will walk through feature normalization and weight initialization schemes in PyTorch. In short, we normalize our inputs for gradient descent because large weights will dominate our updates in our attempt to find global or local minima. Separately, we use custom weight initialization schemes to improve our ability to converge during optimization or to improve our ability to use certain activation functions. In this post, we walk through the application of batch normalization and Xaviar initialization in PyTorch.

Regularization

Published:

We can think of regularization as a general means to reduce overfitting. In practice, this means that regularizing effects aim to reduce the model capacity and/or the variance of the predictions. In this post, we walk through the application of L2 regularization (manually and automatically) and dropout in PyTorch.

Optimization

Published:

Numerical optimizers provide the means to estimate our parameters by finding the values that maximize the likelihood of generating our data. This guide helps us understand how optimization algorithms find the best estimates for our coefficients in a step-by-step process. This guide uses Gradient Descent and regularization techniques in a completely manual approach to finding parameters that most likely generated our data. Another way of thinking about gradient descent is that we are inevitably asking the algorithm the following: What parameter values will push our error to zero?

Distributed Training

Published:

Spark is on open source cluster computing framework that automates the distribution of data and computations on a cluster of computers. DataBricks handles much of the architecture and cluster management for you, leveraging Jupyter style notebooks. This guide shows how to perform distributed deep learning using PyTorch on DataBricks.

ML

K Nearest Neighbors

Published:

The K Nearest Neighbors (KNN) algorithm is part of a family of classifier algorithms that aim to predict the class or category of an observation. KNN works by calculating the distance, often the Euclidean (i.e., straight line) distance, between observations. In this post, we walk through the application of the KNN algorithm and demonstrate the conditions under which the algorithm excels, does poorly, and is improved through feature engineering.

Decision Trees

Published:

The Decision Tree algorithm is part of a family of classifier and regression algorithms that aim to predict the class or value of an observation. Decision trees classify data by splitting features at specified thresholds such that, ideally, we can perfectly predict the observation’s label. At its core, features are split by using two relatively simplistic algorithms: entropy and information gain. When deciding how to split a feature, a threshold is selected such that the informational gain is the highest, meaning more information is revealed and thereby our predictions for our dependent variable’s label is improved (or perfect).

Naive Bayes

Published:

The Naive Bayes algorithm is part of a family of classifier algorithms that aim to predict the category of an observation. It is a Maximum Likelihood (MLE) generative model that suggests each class is generated by its features. At its core, the algorithm uses Bayes theorem. In this post, we walk through the application of the Naive Bayes algorithm and demonstrate the conditions under which the algorithm excels, does poorly, and is improved through feature engineering.

Support Vector Machines

Published:

The Support Vector Machine (SVM) algorithm is part of a family of classifier and regression algorithms that aim to predict the class or value of an observation. The SVM algorithm identifies data points, called support vectors, that generate the widest possible margin between two classes in order to yield the best classification generalization. The SVM is made powerful by the use of kernels, a function that computes the dot product of two vectors, thereby allowing us to effectively skip feature transformations and consequently improve computation performance. In this post, we walk through the application of the SVM algorithm through linear and nonlinear modeling.

Dimensionality Reduction

Published:

The idea behind dimensionality reduction is simple: take high dimensional feature spaces ($k$) and project them onto lower dimensional subspaces ($m$) (where $m$ < $k$). Dimensionality reduction has several kind of appealing properties like solving the curse of dimensionality and overfitting, but it also allows us to visualize high dimensional data and to compress it. Collapsing high dimensional data that would otherwise be too difficult for us to understand or interpret suddenly becomes much more salient when we collapse it down into two or three dimensions. In this post, we walk through the application of the principal component analysis, a central dimensionality reduction algorithm.

Cluster Analysis

Published:

Cluster analysis is a form of unsupervised learning which aims to discover and explore the underlying structure in the data. The crux of a cluster analysis algorithm is distance metrics: the way you measure similarity or distance between observations. Unsupervised learning is often used in situations where you do not have labelled data (perhaps it is expensive) or when you might not know the correct values for some of your data and therefore, you might want to evaluate its underlying structure.

NLP

Classification: CNN

Published:

Text or sequence classification aims to label a sentence or document based on its content. In this post, we use Convolutional Neural Networks and pre-trained embeddings to classify a novel corpus. This post provides a full treatment of the steps required to prepare data for NLP analysis and to analyze it with PyTorch. It also includes sample code to optimize the hyperparamters through state-of-the-art pruning and search algorithms via Optuna.

Classification: DistilBERT

Published:

Text or sequence classification aims to label a sentence or document based on its content. In this post, we use Transformers to classify a novel data set that I created based on insurgent propaganda messages. This post provides a full treatment of the steps required to prepare data for NLP analysis and to analyze it with PyTorch. It also includes sample code to optimize the hyperparamters through state-of-the-art pruning and search algorithms via Optuna.

Generation: DistilGPT-2

Published:

Language models are trained to predict the probability the next token considering the preceding tokens that came before it. A token can be a word, a letter, or a subcomponent of a word. In this guide, we use the a decoder-only language model transformer to predict text from our novel insurgent propaganda corpus.

Classification: BERT-CNN

Published:

In this guide, we prepare a BERT-CNN ensemble which takes the embeddings generated by the BERT base model and feeds them into a CNN. The general logic from this guide can be used to replace the CNN with any other NN of your choice. While it is a fun task to explore, adding on what is technically an inferior model on-top of a Transformer is not really necessary. Like other guides, this walk through provides a complete treatment of the data preparation and training of the BERT-CNN ensemble in PyTorch.

Summarization: T5

Published:

There are two types of summaries: (1) abstractive, or explaining in your own words, and (2) extractive, or building a summary from existing text. Humans are mostly abstractive while NLP systems are mostly extractive. In this guide we use T5, a pre-trained and very large (e.g., roughly twice the size of BERT-base) encoder-decoder Transformer model. T5, a model devised by Google, is an important advancement in the field of Transformers because it achieves near human-level performance on a variety of benchmarks like GLUE and SQuAD.

Classification: Hierarchical Attention Networks

Published:

Hierarchical Attention Networks (HAN), as its name suggests, have a hierarchical structure that reflects the hierarchical nature of documents. It has two levels of attention mechanisms that are applied at the word and sentence level which afford it the differential ability to capture more and less important content which evaluating documents. In this guide, we walk through how to create a Hieararchical Attention Network in PyTorch and as well as how to create and structure our data appropriately.

Classification: T5

Published:

In this guide we use T5, a pre-trained and very large (e.g., roughly twice the size of BERT-base) encoder-decoder Transformer model for a classification task. T5, a model devised by Google, is an important advancement in the field of Transformers because it achieves near human-level performance on a variety of benchmarks like GLUE and SQuAD. Another important advancement is that it treats NLP as a text-to-text problem, whereby our inputs are text and our outputs are also text. In this universal framework, T5 can therefore handle any NLP task (in English). T5 was pre-trained on the C4 (Colossal Clean Crawled Corpus) corpus which amounts to roughly 750GB of clean English text. For comparative purpsoes, BERT was trained on roughly 13GB of text and XLNet was trained on roughly 126GB of text. For these reasons, T5 is the state of the art and its encoder-decoder architecture is likely the future of NLP models.

Classification: Character CNN

Published:

In this post, we use character-level Convolutional Neural Networks (CNN) to classify a novel data set in PyTorch. CNNs are useful in extracting information from raw signals, ranging from computer vision, speech, and text. Character-level CNNs treat text characters as a kind of raw signal, thereby allowing CNNs to eschew the requirement to develop an understanding of words. While character-level CNNs work in many different situations, they work well to handle Twitter or multilingual text as advanced embeddings are usually unavailable to researchers. In this guide, we replicate the model as specified in the paper Character-level Convolutional Networks for Text Classification

Classification: Capsule Routing

Published:

While capsule networks have been used in the field of computer vision and CNNs, recent work shows that they work well in Natural Language Processing (NLP) as well. “A capsule is a group of neurons whose outputs represent different properties of the same entity in different contexts. Routing by agreement is an iterative form of clustering in which a capsule detects an entity by looking for agreement among votes from input capsules that have already detected parts of the entity in a previous layer” (Heinsen, 2019). Capsule networks are a means for aggregating the importance of embeddings akin to attention mechanisms.

Classification: Entity Embeddings

Published:

In this guide, we will implement entity embeddings in two ways via PyTorch: (1) via nn.Embedding(), and (2) via transformers. We will also show how to load data in a more efficient manner through a custom PyTorch data set class. This style of data management is slightly more complicated to initialize, but is the precise way we want to load our data when dealing: (1) big data, or (2) a memory-conservative environment. Entity embeddings refers to the idea of transforming categorical variables into continuous embeddings to avoid one-hot encoding and sparse matrices.

Semantic Search

Published:

Natural language processing and computer vision methods generate high-dimensional vectors that represent text and images, yet traditional databases that can be queried like SQL are not adapted to these new representations. Given enough text and media, this information can quickly encompass billions of vectors. To find similar entries means that we must find similar high-dimensional vectors which is inefficient and likely impossible with standard query languages. Similarity search fills this void by searching for similar vectors; those nearby in Euclidean space. We can leverage similarity search algorithms once our vectors are generated by deep learning algorithms. In this post, we will use Faiss – Facebook AI Similarity Search.

applied

Survival Analysis

Published:

This post walks through survival analysis on a panel dataset. It replicates a model that estimates the duration of democratic regime survival on a panel dataset incorporating 105 countries from 1950 to 2004.

Unordered Categorical Models

Published:

This post replicates and extends a flagship American Political Science Review article that uses a multinomial logit to predict the likelihood of an insurgent group’s mode of warfare given the presence or absence of the Cold War.

Binary Regression

Published:

In October 1988, a plebiscite vote was held in Chile to determine whether or not Augusto Pinochet should extend his rule for another eight years. The package carData contains Chilean national survey data collected in April and May 1988. In this analysis, we evaluate the effect that several variables have on a voter’s likelihood to keep Pinochet in power using binary regression models.

Time Series Analysis

Published:

This post applies time series analysis to data provided by Kwon (2015) which seeks to empirically understand the causal relationship between political polarization and income inequality in the U.S.

Linear Models

Published:

What causes differences in people’s life satisfaction across countries? The Association of Religion Data Archives (ARDA) has assembled a dataset that stitches together economic, social and demographic variables across 252 countries. In this analysis, we inspect factors associated with life satisfaction with linear models.

Ordered Categorical Models

Published:

In 2009, a scandal broke out across England. Many British Members of Parliament (MPs) were exposed as misusing their allowances and expenses permitted to them as an elected official. In 2010, British voters were surveyed as to whether they think the MPs implicated in the scandal should resign or not prior to parliamentary elections. At the time of the scandal, the Labour party led by PM Gordon Brown was in power. The conservative party, led by David Cameron, won the largest number of votes and seats in the 2010 general election on the heels of the scandal. In this analysis, we use generalized linear models for ordered categorical data to further explore the survey data.

Count Model Analysis

Published:

Gelman and Hill (2007) collected New York City (NYC) “stop and frisk” data for 175,000 stops over a 15-month period in 1998-1999. In this analysis, count models are used to model the data.

Multiple Imputation

Published:

Missingness refers to observations or measurement that, in principle, could have been carried out but for whatever reason, failed to occur (Ward and Ahlquist 2018). Data that we collect, whether observational or experimental data, will with near certainty, have some values that are missing. The general idea behind missing data imputation algorithms is that they all fill in the missing data with estimates of what real data would look like if it were available. Because the estimated data is by nature uncertain, we replicate the missing data many times to incorporate the uncertainty into the analysis. This post walks through the latest cutting edge multiple imputation technique developed by Hollenbach et. al (2018).

Numerical Optimization

Published:

Maximum likelihood fixes our observed data and asks: What parameter values most likely generated the data we have? In a likelihood framework, our data is viewed as a joint probability as a function of parameter values for a specified mass or density function. In this case, the joint probability is being maximized with respect to the parameters. A maximum likelihood estimate is one that provides the density or mass function with the highest likelihood of generating the observed data. Numerical optimizers provide the means to estimate our parameters by finding the values that maximize the likelihood of generating our data. This guide helps us understand how optimization algorithms find the best estimates for our coefficients in a step-by-step process.

portfolio

publications

talks

teaching

Essential Empirical Methods

Undergraduate-level course, 2019

Essential Empirical Methods is a three part course that introduces new analysts to ideas of: (1) concept measurement and construction, (2) describing and summarizing data, and (3) formulating hypotheses and making comparisons.