# load packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn.naive_bayes import BernoulliNB, MultinomialNB, GaussianNB
from sklearn.metrics import accuracy_score
from sklearn.datasets import fetch_openml
from sklearn.model_selection import GridSearchCV
np.random.seed(0) # set seed
The Naive Bayes algorithm is part of a family of classifier algorithms that aim to predict the class or category of an observation. It is a Maximum Likelihood (MLE) generative model that suggests \(Y\) is generated by its features. At its core, the algorithm uses Bayes theorem which looks says: \(P(Y | X) = \frac{P(X | Y) \times P(Y)}{P(X)}\). This equation has four quantities of interest and each has its own name:
Prior: \(P(Y)\)
Posterior: \(P(Y|X)\)
Likelihood: \(P(X|Y)\)
Marginal Probability: \(P(X)\)
While Naive Bayes is useful for classifying large data sets, owing to its calculative speed, it naïvely assumes that every feature is independent of the other, given \(Y\) (i.e., conditional independence). The consequences for violating this assumption are overconfident predictions; owing to correlation among our \(X\) variables.
As described above, Naive Bayes is often described as being useful for its speed and simplicity, leading some people to suggest that we should use Naive Bayes as a baseline to compare non-nested models. I argue that Naive Bayes should be used purposefully, taking into account its assumptions, and used only then under those conditions. It makes little sense to use Naive Bayes as a baseline given most of our predictive tasks will involve correlated features and thus yield overly confident predictions. After all, correlated \(X\) variables are exactly why we use methods like regression in the first place – we use it to defeat omitted variable bias.
Naive Bayes requires that we calculate three quantities of interest: (1) \(P(Y)\), (2) \(P(X | Y)\), and (3) \(P(X)\). To better understand the algorithm and the computation required to produce these quantities of interest, we will use the iris data set for our initial analysis. We load our data below:
# load the data
iris = load_iris()
X, Y = iris.data, iris.target
# shuffle the data
shuffle = np.random.permutation(np.arange(X.shape[0]))
X, Y = X[shuffle], Y[shuffle]
# split into train and test
train_data, train_labels = X[:100], Y[:100]
test_data, test_labels = X[100:], Y[100:]
Next, to explore the Bayesian procedure, we transform our continuous features into binary measurements for simplification purposes. This threshold splits our data (i.e., petal length, petal width) into one of two types of features: short or long.
# for simplicity, we will binarize features
def binarize_iris(data, thresholds=[5.7, 3.3, 2.1, 0.6]):
# thresholds derived from information gain; see decision tree analysis
# initialize a new feature array with the same shape as the original data.
binarized_data = np.zeros(data.shape)
# apply a threshold to each feature.
for feature in range(data.shape[1]):
binarized_data[:, feature] = data[:, feature] > thresholds[feature]
return binarized_data
# create new binarized training and test data
binarized_train_data = binarize_iris(train_data)
binarized_test_data = binarize_iris(test_data)
We begin by finding our first quantity of interest, the prior or \(P(Y)\), by: (1) calculating the count for each \(Y_{i}\), and (2) calculating the the probability distribution of \(Y_{i}\).
# step 1
# initialize counter storage
label_counts = np.zeros((1, 3))
# generate label counts
for label in train_labels:
label_counts[0, label] += 1
# step 2
# generate probability distribution
label_probs = np.zeros((0, 3))
for count in label_counts:
label_probs = count / np.sum(label_counts)
# print probability distribution
for prob, name in zip(label_probs, iris.target_names):
print('For label', name, 'its probability distribution is:', prob)
## For label setosa its probability distribution is: 0.31
## For label versicolor its probability distribution is: 0.33
## For label virginica its probability distribution is: 0.36
Next, we calculate the likelihood or \(P(X | Y)\) which we can derive as the joint probability of \(X\) and \(Y\) normalized by the probability of \(Y\), or \(\frac{P(X,Y)}{P(Y)}\).
# compute joint probability: P(X, Y)
# create containers for counts of x = 0 and y
feature0_label_counts = np.zeros([len(iris.feature_names), len(iris.target_names)]) # length of x and y
# create containers for counts of x = 1 and y
feature1_label_counts = np.zeros([len(iris.feature_names), len(iris.target_names)]) # length of x and y
# for each training sample
for i in range(binarized_train_data.shape[0]):
label = train_labels[i] # y_{i}
features = binarized_train_data[i] # x_{i}
# loop over training samples
for index, value in enumerate(features):
# count instances of each type
feature0_label_counts[index][label] += (value == 0)
feature1_label_counts[index][label] += (value == 1)
# normalize probabilities; divide by y
feature0_label_counts = feature0_label_counts / label_counts # (x|y) / y
feature1_label_counts = feature1_label_counts / label_counts # (x|y) / y
# print count results by binary feature
print(' Feature Counts: Short\n', np.round(feature0_label_counts, 2))
## Feature Counts: Short
## [[1. 0.42 0.08]
## [0.32 0.97 0.94]
## [1. 0. 0. ]
## [1. 0. 0. ]]
print('\n Feature Counts: Long\n', np.round(feature1_label_counts, 2))
# check probabilities sum to 1
##
## Feature Counts: Long
## [[0. 0.58 0.92]
## [0.68 0.03 0.06]
## [0. 1. 1. ]
## [0. 1. 1. ]]
print ('\nEnsure P(feature = 0 | label) + P(feature = 1 | label) = 1\n'
, feature0_label_counts + feature1_label_counts)
##
## Ensure P(feature = 0 | label) + P(feature = 1 | label) = 1
## [[1. 1. 1.]
## [1. 1. 1.]
## [1. 1. 1.]
## [1. 1. 1.]]
Lastly, we need to calculate the marginal probability or the denominator \(P(X)\) of Bayes Rule. In our algorithm, we actually do not directly calculate this quantity of interest in part because we only need to derive which class has the highest posterior value rather than deriving the true posterior value. Instead, what we do is we marginalize over the joint distribution to compute: \(\sum_{y}P(Y)\times P(X|Y)\). Practically, this means to derive \(P(X)\) we just sum over all possible values of \(Y_{i}\). Additionally, this means that since the \(P(X)\) will always sum to 1, we can effectively ignore the denominator for our calculations.
In this section, we show how the algorithm derives posterior predictions by gradually updating our belief \(P(Y|X)\) given the introduction of new features \(P(X|Y)\). We begin by displaying reference notation for our features and labels.
## Features: ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
## Labels : ['setosa' 'versicolor' 'virginica']
We begin by pulling our first quantity of interest: \(P(Y)\).
## [0.31 0.33 0.36]
Next, we pull our second quantity of interest: \(P(X_{short}|Y)\)
## [[1. 0.42424242 0.08333333]
## [0.32258065 0.96969697 0.94444444]
## [1. 0. 0. ]
## [1. 0. 0. ]]
Then, we prepare our posterior prediction which is simply our prior \(P(Y)\), absent the introduction of evidence.
## the posterior is [0.31 0.33 0.36]
Now we can generate our first posterior prediction by multiplying our prior \(P(Y)\) by our joint probability for sepal length \(P(\text{Sepal Length}_{short}|Y)\).
posterior *= p_x0_y[0] # sepal length is the index value
print('P(sepal length = 0 | Y)', p_x0_y[0]) # knowing we had a short sepal length, its nearly certain its a setosa
## P(sepal length = 0 | Y) [1. 0.42424242 0.08333333]
## The posterior sum is 0.48
Our posterior prediction tells us that: by taking into account the joint probability of sepal length among short length iris’, the predicted probability that the iris is a setosa is: 0.65
.
## [0.65 0.29 0.06]
We continue this process by including a new feature one at a time to make this process clear.
## P(sepal width = 0 | Y) [0.32258065 0.96969697 0.94444444]
## the posterior sum is 0.550189393939394
## Given sepal length and width, our posterior prediction is [0.38 0.51 0.11]
## P(petal length = 0 | Y) [1. 0. 0.]
## the posterior sum is 0.378657487091222
## Given sepal length and width and petal length, our posterior prediction is [1. 0. 0.]
## P(petal width = 0 | Y) [1. 0. 0.]
## the posterior sum is 1.0
## Given sepal length and width and petal length and width, our posterior prediction is [1. 0. 0.]
Notice that our posterior.sum()
has eventually summed to 1 which is why the denominator is effectively ignored in our Bayes application. Also notice that our algorithm came up with the right answer, but it is highly overconfident as evidenced by the 1
or 0
results. We derived the overconfidence from our \(P(X | Y)\) calculations which, because we binarized our data, found zero short iris observations in terms of petal length
and petal width
for versicolors
and virginicas
. As our calculations involve multiplication, our estimates will inevitably include zero. To overcome this problem, we can implement Laplace smoothing which replaces a count of 0 in our joint probability calculations \(P(X | Y)\) with a count of \(\alpha\). By default sklearn
sets \(\alpha\) to 1.
Next, we implement a basic Bernoulli Naive Bayes model to demonstrate its ability to generate posterior predictions and to classify data given binary features.
## BernoulliNB(alpha=1.0, binarize=0.0, class_prior=None, fit_prior=True)
## Accuracy: 0.80
posterior = np.exp(model.feature_log_prob_).T
print('Posterior Predicted Probabilities for Features = Long:\n', np.round(posterior, 2))
## Posterior Predicted Probabilities for Features = Long:
## [[0.03 0.57 0.89]
## [0.67 0.06 0.08]
## [0.03 0.97 0.97]
## [0.03 0.97 0.97]]
Although our posterior predictions look robust in that there is a clear separation between our iris classes, we need to remember that the perceived robustness is influenced by correlation in our \(X\), which is demonstrated below. It is for these reasons that we need to remain cautious about using Naive Bayes as a baseline model to derive predictions and accuracy for our classification problems.
## 0 1 2 3
## 0 1.000000 -0.444297 0.697650 0.697650
## 1 -0.444297 1.000000 -0.686502 -0.686502
## 2 0.697650 -0.686502 1.000000 1.000000
## 3 0.697650 -0.686502 1.000000 1.000000
In this section, we demonstrate some modeling choices that we can make when using different Naive Bayes models from sklearn
on the MNIST data set. In the chunk below, we begin by loading our data using sklearn
’s handy package to access MNIST.
# load MNSIT from https://www.openml.org/d/554 via fetch_openml
X, Y = fetch_openml(name='mnist_784', return_X_y=True, cache=False)
# rescale grayscale values to [0,1].
X = X / 255.0
# shuffle the data
shuffle = np.random.permutation(np.arange(X.shape[0])) # generate unique random indices
X, Y = X[shuffle], Y[shuffle] # apply shuffle
# prepare some datasets
test_data, test_labels = X[61000:], Y[61000:]
dev_data, dev_labels = X[60000:61000], Y[60000:61000]
train_data, train_labels = X[:60000], Y[:60000]
mini_train_data, mini_train_labels = X[:1000], Y[:1000]
Since MNIST is a data set of handwritten digits, we can visualize the data thanks to imshow
from matplotlib
. In the chunk below, we collect 10 examples of all 10 digits and display them in a 2x2 matrix.
def visualize_mnist(label_array):
fig, ax = plt.subplots(10, 10, figsize = (10,10)) # create 10x10 subplots
for i in range(0, 10): # initiate loop
find_digit = np.where(label_array == str(i)) # find numbers 0-9
find_digit = mini_train_data[find_digit][0:10] # truncate data
for j in range(0, 10): # initiate loop
working_item = find_digit[j] # pull individual arrays
working_item.shape = (28, 28) # reshape for viewing
working_item = 255 - working_item # subtract itself for white background
ax[i, j].imshow(working_item, cmap = 'gray') # plot the image
ax[i, j].set_xticks([]) # remove axis ticks for clarity
ax[i, j].set_yticks([]) # remove axis ticks for clarity
return plt.show()
visualize_mnist(mini_train_labels)
Prior to fitting our model, we create a function to binarize our \(X\) data, resulting in 664k 0
s and 120k 1
s.
# binarize X for BernoulliNB
def binarize(data):
binarized_data = np.zeros(data.shape) # create array of zeros
binarized_data[(data < 0.3)] = 0 # thresholds chosen arbitrarily
binarized_data[(data >= 0.3)] = 1 # thresholds chosen arbitrarily
return binarized_data
binarized_train = binarize(mini_train_data) # binarize x train
binarized_dev = binarize(dev_data) # binarize x test
print(np.bincount(binarized_train.astype(int).flat)) # review binarized results
## [664735 119265]
Next, we apply the simplistic Bernoulli Naive Bayes model that functions exactly like the model we explicated at the start of this guide with the iris data set. To tune the central hyperparameter for Naive Bayes, we use grid search to loop over different \(\alpha\) parameters to find the best performing model as determined by accuracy
.
We can see that our model does relatively well given only binary features. We can also see that low \(\alpha\) parameters produce better accuracy than high \(\alpha\) parameters. This is because \(\alpha\) replaces what otherwise would be zeros with 0.0001
allowing features some slight chance of occurring and thereby improving our generalizability.
alphas = {'alpha': [1.0e-10, 0.0001, 0.001, 0.01, 0.1, 0.5, 1.0, 2.0, 10.0]}
# setup grid search
grid_nb = GridSearchCV(estimator = BernoulliNB(binarize = None), param_grid = alphas, cv = 5, scoring = 'accuracy')
# fit nb
grid_nb.fit(binarized_train, mini_train_labels) # use our binarized thresholds; ignoring sklearns
# check best estimator result
## GridSearchCV(cv=5, error_score=nan,
## estimator=BernoulliNB(alpha=1.0, binarize=None, class_prior=None,
## fit_prior=True),
## iid='deprecated', n_jobs=None,
## param_grid={'alpha': [1e-10, 0.0001, 0.001, 0.01, 0.1, 0.5, 1.0,
## 2.0, 10.0]},
## pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
## scoring='accuracy', verbose=0)
## BernoulliNB(alpha=0.001, binarize=None, class_prior=None, fit_prior=True)
## 0.84
scores = grid_nb.cv_results_['mean_test_score'] # capture scores
# print scores
print("Best alpha = ", grid_nb.best_params_)
## Best alpha = {'alpha': 0.001}
for i in range(len(scores)):
print('For alpha', alphas['alpha'][i], 'the accuracy is:', round(scores[i], 3))
## For alpha 1e-10 the accuracy is: 0.8
## For alpha 0.0001 the accuracy is: 0.819
## For alpha 0.001 the accuracy is: 0.823
## For alpha 0.01 the accuracy is: 0.823
## For alpha 0.1 the accuracy is: 0.815
## For alpha 0.5 the accuracy is: 0.813
## For alpha 1.0 the accuracy is: 0.811
## For alpha 2.0 the accuracy is: 0.801
## For alpha 10.0 the accuracy is: 0.768
Since Naive Bayes is a generative model, we can demonstrate its ability below by generating digit images.
# create a NB model
nb = BernoulliNB(alpha = 0.0001) # alpha set to grid search's best result
# fit the model; lets use my binarized data
nb.fit(binarized_train, mini_train_labels)
# generate probabilities
probs = np.exp(nb.feature_log_prob_)
# produce function to generate images from predicted probabilities
def gen_images(feature_log_prob):
shape_n = feature_log_prob.shape[0]
random_nums = np.round(np.random.rand(784), 2)
pixels = np.zeros(shape_n)
for i in range(shape_n):
if random_nums[i] <= np.exp(feature_log_prob[i]):
pixels[i] = 1
return pixels
# plot the images
fig, ax = plt.subplots(10,20, figsize = (10,10))
for i in range(0, 20):
for j in range(0, 10):
temp_pix = gen_images(nb.feature_log_prob_[j])
temp_pix = temp_pix.reshape(28, 28)
temp_pix = 1 - temp_pix # give it a white background
ax[j,i].imshow(temp_pix, cmap = 'gray')
ax[j,i].set_xticks([])
ax[j,i].set_yticks([])
sklearn
also includes a multinomial options for Naive Bayes which allows us to classify dependent variables that have three or more labels. Using the same MNIST data set, we first trinarize our data set to produce three labels for classification. Given these thresholds, we produce 664k 0
s, 31k 1
s, and 88k 2
s.
# Produce a MultinomialNB; trinarize the data
def trinarize(data):
trinarized_data = np.zeros(data.shape) # create array of zeros
trinarized_data[(data <= 0.3)] = 0 # arbitrary thresholds
trinarized_data[(data > 0.3)] = 1
trinarized_data[(data >= 0.7)] = 2
return trinarized_data
trinarized_train = trinarize(mini_train_data)
trinarized_dev = trinarize(dev_data)
print(np.bincount(trinarized_train.astype(int).flat))
## [664735 30990 88275]
Next, we fit a familiar model and generate our accuracy
score to find that our multinomial model performs slightly worse than our Bernoulli model. This is almost certainly because adding a new categorical value for effectively middle values does little to help explain the digit label. What helps explain the digit label is whether or not there is pixel data to begin with, thus, the binary model is sufficient for our tasks. We just need information about a pixel being “on” or “off” to do well.
alphas = {'alpha': [1.0e-10, 0.0001, 0.001, 0.01, 0.1, 0.5, 1.0, 2.0, 10.0]}
# setup grid search
grid_nb = GridSearchCV(estimator = MultinomialNB(), param_grid = alphas, cv = 5, scoring = 'accuracy')
# fit nb
grid_nb.fit(trinarized_train, mini_train_labels) # use our trinarized data
# check best estimator result
## GridSearchCV(cv=5, error_score=nan,
## estimator=MultinomialNB(alpha=1.0, class_prior=None,
## fit_prior=True),
## iid='deprecated', n_jobs=None,
## param_grid={'alpha': [1e-10, 0.0001, 0.001, 0.01, 0.1, 0.5, 1.0,
## 2.0, 10.0]},
## pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
## scoring='accuracy', verbose=0)
## MultinomialNB(alpha=0.1, class_prior=None, fit_prior=True)
## 0.823
scores = grid_nb.cv_results_['mean_test_score'] # capture scores
# print scores
print("Best alpha = ", grid_nb.best_params_)
## Best alpha = {'alpha': 0.1}
for i in range(len(scores)):
print('For alpha', alphas['alpha'][i], 'the accuracy is:', round(scores[i], 3))
## For alpha 1e-10 the accuracy is: 0.788
## For alpha 0.0001 the accuracy is: 0.805
## For alpha 0.001 the accuracy is: 0.808
## For alpha 0.01 the accuracy is: 0.808
## For alpha 0.1 the accuracy is: 0.812
## For alpha 0.5 the accuracy is: 0.809
## For alpha 1.0 the accuracy is: 0.808
## For alpha 2.0 the accuracy is: 0.806
## For alpha 10.0 the accuracy is: 0.794
Lastly, sklearn
also incorporates a Gaussian Naive Bayes model which aims to use continuous features, as compared to binary/trinary integers, to predict labels. When using the Gaussian model, we should consider the var_smoothing
hyperparameter to improve our predictive performance. In short, var_smoothing
adds an additional value to the variance (\(\sigma\)) – in a similar fashion to the Laplace smoothing, to counter the effects of multiplying by zero. By adding some specified \(\sigma\) to each observation, we can improve its performance and stability.
# instantiate Gaussian NB
default_gnb = GaussianNB()
# fit model
default_gnb.fit(mini_train_data, mini_train_labels)
# generate accuracy
## GaussianNB(priors=None, var_smoothing=1e-09)
default_gnb_score = default_gnb.score(dev_data, dev_labels)
# poor performance; need to alter the smoothing function
# consider effects of theta and sigma
default_gnb.theta_.min()
## 0.0
## 0.9713831478537366
## 2.0075504984236782e-10
## 0.21110558821718223
var_smoothing_gnb = GaussianNB(var_smoothing = 0.06) # increase variance through var_smoothing
var_smoothing_gnb.fit(mini_train_data, mini_train_labels) # fit data again
## GaussianNB(priors=None, var_smoothing=0.06)
var_smoothing_gnb_score = var_smoothing_gnb.score(dev_data, dev_labels) # check score
# capture data to print results
model_names = ['GaussianNB', 'Smoothed GaussianNB']
model_scores = [default_gnb_score, var_smoothing_gnb_score]
for i in range(len(model_names)):
print('For model', model_names[i], 'the accuracy is', round(model_scores[i], 3))
## For model GaussianNB the accuracy is 0.585
## For model Smoothed GaussianNB the accuracy is 0.82
This post walked through the construction and application of the Naive Bayes algorithm by demonstrating the conditions under which the algorithm excelled, did poorly, and could be improved through hyperparameter or feature engineering.
Matthias Feurer, Jan N. van Rijn, Arlind Kadra, Pieter Gijsbers, Neeratyoy Mallik, Sahithya Ravi, Andreas Mueller, Joaquin Vanschoren, Frank Hutter. OpenML-Python: an extensible Python API for OpenML. arXiv:1911.02490 [cs.LG], 2019
LeCun, Yann. “The MNIST database of handwritten digits.” http://yann.lecun.com/exdb/mnist/ (1998).