Naïve Bayes

Andrew Fogarty


# load python
# load packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.datasets import load_iris
from sklearn.naive_bayes import BernoulliNB, MultinomialNB, GaussianNB
from sklearn.metrics import accuracy_score
from sklearn.datasets import fetch_openml
from sklearn.model_selection import GridSearchCV

np.random.seed(0) # set seed

1 Introduction

The Naive Bayes algorithm is part of a family of classifier algorithms that aim to predict the class or category of an observation. It is a Maximum Likelihood (MLE) generative model that suggests \(Y\) is generated by its features. At its core, the algorithm uses Bayes theorem which looks says: \(P(Y | X) = \frac{P(X | Y) \times P(Y)}{P(X)}\). This equation has four quantities of interest and each has its own name:

While Naive Bayes is useful for classifying large data sets, owing to its calculative speed, it naïvely assumes that every feature is independent of the other, given \(Y\) (i.e., conditional independence). The consequences for violating this assumption are overconfident predictions; owing to correlation among our \(X\) variables.

2 When to Use Naive Bayes

As described above, Naive Bayes is often described as being useful for its speed and simplicity, leading some people to suggest that we should use Naive Bayes as a baseline to compare non-nested models. I argue that Naive Bayes should be used purposefully, taking into account its assumptions, and used only then under those conditions. It makes little sense to use Naive Bayes as a baseline given most of our predictive tasks will involve correlated features and thus yield overly confident predictions. After all, correlated \(X\) variables are exactly why we use methods like regression in the first place – we use it to defeat omitted variable bias.

3 Naive Bayes Components

Naive Bayes requires that we calculate three quantities of interest: (1) \(P(Y)\), (2) \(P(X | Y)\), and (3) \(P(X)\). To better understand the algorithm and the computation required to produce these quantities of interest, we will use the iris data set for our initial analysis. We load our data below:

# load the data
iris = load_iris()
X, Y =,

# shuffle the data
shuffle = np.random.permutation(np.arange(X.shape[0]))
X, Y = X[shuffle], Y[shuffle]

# split into train and test
train_data, train_labels = X[:100], Y[:100]
test_data, test_labels = X[100:], Y[100:]

Next, to explore the Bayesian procedure, we transform our continuous features into binary measurements for simplification purposes. This threshold splits our data (i.e., petal length, petal width) into one of two types of features: short or long.

# for simplicity, we will binarize features
def binarize_iris(data, thresholds=[5.7, 3.3, 2.1, 0.6]):
    # thresholds derived from information gain; see decision tree analysis
    # initialize a new feature array with the same shape as the original data.
    binarized_data = np.zeros(data.shape)

    # apply a threshold  to each feature.
    for feature in range(data.shape[1]):
        binarized_data[:, feature] = data[:, feature] > thresholds[feature]
    return binarized_data

# create new binarized training and test data
binarized_train_data = binarize_iris(train_data)
binarized_test_data = binarize_iris(test_data)

3.1 Quantity of Interest: P(Y)

We begin by finding our first quantity of interest, the prior or \(P(Y)\), by: (1) calculating the count for each \(Y_{i}\), and (2) calculating the the probability distribution of \(Y_{i}\).

## For label setosa its probability distribution is: 0.31
## For label versicolor its probability distribution is: 0.33
## For label virginica its probability distribution is: 0.36

3.2 Quantity of Interest: P(X | Y)

Next, we calculate the likelihood or \(P(X | Y)\) which we can derive as the joint probability of \(X\) and \(Y\) normalized by the probability of \(Y\), or \(\frac{P(X,Y)}{P(Y)}\).

##  Feature Counts: Short
##  [[1.   0.42 0.08]
##  [0.32 0.97 0.94]
##  [1.   0.   0.  ]
##  [1.   0.   0.  ]]
##  Feature Counts: Long
##  [[0.   0.58 0.92]
##  [0.68 0.03 0.06]
##  [0.   1.   1.  ]
##  [0.   1.   1.  ]]
## Ensure P(feature = 0 | label) + P(feature = 1 | label) = 1
##  [[1. 1. 1.]
##  [1. 1. 1.]
##  [1. 1. 1.]
##  [1. 1. 1.]]

3.3 Quantity of Interest: P(X)

Lastly, we need to calculate the marginal probability or the denominator \(P(X)\) of Bayes Rule. In our algorithm, we actually do not directly calculate this quantity of interest in part because we only need to derive which class has the highest posterior value rather than deriving the true posterior value. Instead, what we do is we marginalize over the joint distribution to compute: \(\sum_{y}P(Y)\times P(X|Y)\). Practically, this means to derive \(P(X)\) we just sum over all possible values of \(Y_{i}\). Additionally, this means that since the \(P(X)\) will always sum to 1, we can effectively ignore the denominator for our calculations.

In this section, we show how the algorithm derives posterior predictions by gradually updating our belief \(P(Y|X)\) given the introduction of new features \(P(X|Y)\). We begin by displaying reference notation for our features and labels.

## Features:  ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
## Labels  :  ['setosa' 'versicolor' 'virginica']

We begin by pulling our first quantity of interest: \(P(Y)\).

## [0.31 0.33 0.36]

Next, we pull our second quantity of interest: \(P(X_{short}|Y)\)

## [[1.         0.42424242 0.08333333]
##  [0.32258065 0.96969697 0.94444444]
##  [1.         0.         0.        ]
##  [1.         0.         0.        ]]

Then, we prepare our posterior prediction which is simply our prior \(P(Y)\), absent the introduction of evidence.

## the posterior is [0.31 0.33 0.36]

Now we can generate our first posterior prediction by multiplying our prior \(P(Y)\) by our joint probability for sepal length \(P(\text{Sepal Length}_{short}|Y)\).

## P(sepal length = 0 | Y) [1.         0.42424242 0.08333333]
## The posterior sum is 0.48

Our posterior prediction tells us that: by taking into account the joint probability of sepal length among short length iris’, the predicted probability that the iris is a setosa is: 0.65.

## [0.65 0.29 0.06]

We continue this process by including a new feature one at a time to make this process clear.

## P(sepal width = 0 | Y) [0.32258065 0.96969697 0.94444444]
## the posterior sum is 0.550189393939394
## Given sepal length and width, our posterior prediction is [0.38 0.51 0.11]
## P(petal length = 0 | Y) [1. 0. 0.]
## the posterior sum is 0.378657487091222
## Given sepal length and width and petal length, our posterior prediction is [1. 0. 0.]
## P(petal width = 0 | Y) [1. 0. 0.]
## the posterior sum is 1.0
## Given sepal length and width and petal length and width, our posterior prediction is [1. 0. 0.]

Notice that our posterior.sum() has eventually summed to 1 which is why the denominator is effectively ignored in our Bayes application. Also notice that our algorithm came up with the right answer, but it is highly overconfident as evidenced by the 1 or 0 results. We derived the overconfidence from our \(P(X | Y)\) calculations which, because we binarized our data, found zero short iris observations in terms of petal length and petal width for versicolors and virginicas. As our calculations involve multiplication, our estimates will inevitably include zero. To overcome this problem, we can implement Laplace smoothing which replaces a count of 0 in our joint probability calculations \(P(X | Y)\) with a count of \(\alpha\). By default sklearn sets \(\alpha\) to 1.

4 sklearn: Naive Bayes

Next, we implement a basic Bernoulli Naive Bayes model to demonstrate its ability to generate posterior predictions and to classify data given binary features.

model = BernoulliNB(), train_labels)
## BernoulliNB(alpha=1.0, binarize=0.0, class_prior=None, fit_prior=True)
print('Accuracy: %3.2f' % model.score(binarized_test_data, test_labels))
## Accuracy: 0.80
posterior = np.exp(model.feature_log_prob_).T
print('Posterior Predicted Probabilities for Features = Long:\n', np.round(posterior, 2))
## Posterior Predicted Probabilities for Features = Long:
##  [[0.03 0.57 0.89]
##  [0.67 0.06 0.08]
##  [0.03 0.97 0.97]
##  [0.03 0.97 0.97]]

Although our posterior predictions look robust in that there is a clear separation between our iris classes, we need to remember that the perceived robustness is influenced by correlation in our \(X\), which is demonstrated below. It is for these reasons that we need to remain cautious about using Naive Bayes as a baseline model to derive predictions and accuracy for our classification problems.

df = pd.DataFrame(binarized_train_data)
##           0         1         2         3
## 0  1.000000 -0.444297  0.697650  0.697650
## 1 -0.444297  1.000000 -0.686502 -0.686502
## 2  0.697650 -0.686502  1.000000  1.000000
## 3  0.697650 -0.686502  1.000000  1.000000

4.1 sklearn: Naive Bayes Extended

In this section, we demonstrate some modeling choices that we can make when using different Naive Bayes models from sklearn on the MNIST data set. In the chunk below, we begin by loading our data using sklearn’s handy package to access MNIST.

Since MNIST is a data set of handwritten digits, we can visualize the data thanks to imshow from matplotlib. In the chunk below, we collect 10 examples of all 10 digits and display them in a 2x2 matrix.

Visualizing the Data

Visualizing the Data

4.2 Bernoulli Naive Bayes

Prior to fitting our model, we create a function to binarize our \(X\) data, resulting in 664k 0s and 120k 1s.

## [664735 119265]

Next, we apply the simplistic Bernoulli Naive Bayes model that functions exactly like the model we explicated at the start of this guide with the iris data set. To tune the central hyperparameter for Naive Bayes, we use grid search to loop over different \(\alpha\) parameters to find the best performing model as determined by accuracy.

We can see that our model does relatively well given only binary features. We can also see that low \(\alpha\) parameters produce better accuracy than high \(\alpha\) parameters. This is because \(\alpha\) replaces what otherwise would be zeros with 0.0001 allowing features some slight chance of occurring and thereby improving our generalizability.

## GridSearchCV(cv=5, error_score=nan,
##              estimator=BernoulliNB(alpha=1.0, binarize=None, class_prior=None,
##                                    fit_prior=True),
##              iid='deprecated', n_jobs=None,
##              param_grid={'alpha': [1e-10, 0.0001, 0.001, 0.01, 0.1, 0.5, 1.0,
##                                    2.0, 10.0]},
##              pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
##              scoring='accuracy', verbose=0)
## BernoulliNB(alpha=0.001, binarize=None, class_prior=None, fit_prior=True)
## 0.84
## Best alpha =  {'alpha': 0.001}
## For alpha 1e-10 the accuracy is: 0.8
## For alpha 0.0001 the accuracy is: 0.819
## For alpha 0.001 the accuracy is: 0.823
## For alpha 0.01 the accuracy is: 0.823
## For alpha 0.1 the accuracy is: 0.815
## For alpha 0.5 the accuracy is: 0.813
## For alpha 1.0 the accuracy is: 0.811
## For alpha 2.0 the accuracy is: 0.801
## For alpha 10.0 the accuracy is: 0.768

Since Naive Bayes is a generative model, we can demonstrate its ability below by generating digit images.

Generating Data

Generating Data

4.3 Multinomial Naive Bayes

sklearn also includes a multinomial options for Naive Bayes which allows us to classify dependent variables that have three or more labels. Using the same MNIST data set, we first trinarize our data set to produce three labels for classification. Given these thresholds, we produce 664k 0s, 31k 1s, and 88k 2s.

## [664735  30990  88275]

Next, we fit a familiar model and generate our accuracy score to find that our multinomial model performs slightly worse than our Bernoulli model. This is almost certainly because adding a new categorical value for effectively middle values does little to help explain the digit label. What helps explain the digit label is whether or not there is pixel data to begin with, thus, the binary model is sufficient for our tasks. We just need information about a pixel being “on” or “off” to do well.

## GridSearchCV(cv=5, error_score=nan,
##              estimator=MultinomialNB(alpha=1.0, class_prior=None,
##                                      fit_prior=True),
##              iid='deprecated', n_jobs=None,
##              param_grid={'alpha': [1e-10, 0.0001, 0.001, 0.01, 0.1, 0.5, 1.0,
##                                    2.0, 10.0]},
##              pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
##              scoring='accuracy', verbose=0)
## MultinomialNB(alpha=0.1, class_prior=None, fit_prior=True)
## 0.823
## Best alpha =  {'alpha': 0.1}
## For alpha 1e-10 the accuracy is: 0.788
## For alpha 0.0001 the accuracy is: 0.805
## For alpha 0.001 the accuracy is: 0.808
## For alpha 0.01 the accuracy is: 0.808
## For alpha 0.1 the accuracy is: 0.812
## For alpha 0.5 the accuracy is: 0.809
## For alpha 1.0 the accuracy is: 0.808
## For alpha 2.0 the accuracy is: 0.806
## For alpha 10.0 the accuracy is: 0.794

4.4 Gaussian Naive Bayes

Lastly, sklearn also incorporates a Gaussian Naive Bayes model which aims to use continuous features, as compared to binary/trinary integers, to predict labels. When using the Gaussian model, we should consider the var_smoothing hyperparameter to improve our predictive performance. In short, var_smoothing adds an additional value to the variance (\(\sigma\)) – in a similar fashion to the Laplace smoothing, to counter the effects of multiplying by zero. By adding some specified \(\sigma\) to each observation, we can improve its performance and stability.

## GaussianNB(priors=None, var_smoothing=1e-09)
## 0.0
## 0.9713831478537366
## 2.0075504984236782e-10
## 0.21110558821718223
## GaussianNB(priors=None, var_smoothing=0.06)
## For model GaussianNB the accuracy is 0.585
## For model Smoothed GaussianNB the accuracy is 0.82

5 Conclusion

This post walked through the construction and application of the Naive Bayes algorithm by demonstrating the conditions under which the algorithm excelled, did poorly, and could be improved through hyperparameter or feature engineering.

6 Sources