1 Introduction

The Naive Bayes algorithm is part of a family of classifier algorithms that aim to predict the class or category of an observation. It is a Maximum Likelihood (MLE) generative model that suggests \(Y\) is generated by its features. At its core, the algorithm uses Bayes theorem which looks says: \(P(Y | X) = \frac{P(X | Y) \times P(Y)}{P(X)}\). This equation has four quantities of interest and each has its own name:

Prior: \(P(Y)\)
Posterior: \(P(Y|X)\)
Likelihood: \(P(X|Y)\)
Marginal Probability: \(P(X)\)

While Naive Bayes is useful for classifying large data sets, owing to its calculative speed, it naïvely assumes that every feature is independent of the other, given \(Y\) (i.e., conditional independence). The consequences for violating this assumption are overconfident predictions; owing to correlation among our \(X\) variables.

2 When to Use Naive Bayes

As described above, Naive Bayes is often described as being useful for its speed and simplicity, leading some people to suggest that we should use Naive Bayes as a baseline to compare non-nested models. I argue that Naive Bayes should be used purposefully, taking into account its assumptions, and used only then under those conditions. It makes little sense to use Naive Bayes as a baseline given most of our predictive tasks will involve correlated features and thus yield overly confident predictions. After all, correlated \(X\) variables are exactly why we use methods like regression in the first place – we use it to defeat omitted variable bias.

3 Naive Bayes Components

Naive Bayes requires that we calculate three quantities of interest: (1) \(P(Y)\), (2) \(P(X | Y)\), and (3) \(P(X)\). To better understand the algorithm and the computation required to produce these quantities of interest, we will use the iris data set for our initial analysis. We load our data below:

# load the data
iris = load_iris()
X, Y = iris.data, iris.target

# shuffle the data
shuffle = np.random.permutation(np.arange(X.shape[0]))
X, Y = X[shuffle], Y[shuffle]

# split into train and test
train_data, train_labels = X[:100], Y[:100]
test_data, test_labels = X[100:], Y[100:]

Next, to explore the Bayesian procedure, we transform our continuous features into binary measurements for simplification purposes. This threshold splits our data (i.e., petal length, petal width) into one of two types of features: short or long.

# for simplicity, we will binarize features
def binarize_iris(data, thresholds=[5.7, 3.3, 2.1, 0.6]):
    # thresholds derived from information gain; see decision tree analysis
    # initialize a new feature array with the same shape as the original data.
    binarized_data = np.zeros(data.shape)

    # apply a threshold  to each feature.
    for feature in range(data.shape[1]):
        binarized_data[:, feature] = data[:, feature] > thresholds[feature]
    return binarized_data

# create new binarized training and test data
binarized_train_data = binarize_iris(train_data)
binarized_test_data = binarize_iris(test_data)

3.1 Quantity of Interest: P(Y)

We begin by finding our first quantity of interest, the prior or \(P(Y)\), by: (1) calculating the count for each \(Y_{i}\), and (2) calculating the the probability distribution of \(Y_{i}\).

# step 1

# initialize counter storage
label_counts = np.zeros((1, 3))

# generate label counts
for label in train_labels:
    label_counts[0, label] += 1

# step 2

# generate probability distribution
label_probs = np.zeros((0, 3))
for count in label_counts:
    label_probs = count / np.sum(label_counts)

# print probability distribution
for prob, name in zip(label_probs, iris.target_names):
    print('For label', name, 'its probability distribution is:', prob)

## For label setosa its probability distribution is: 0.31
## For label versicolor its probability distribution is: 0.33
## For label virginica its probability distribution is: 0.36

3.2 Quantity of Interest: P(X | Y)

Next, we calculate the likelihood or \(P(X | Y)\) which we can derive as the joint probability of \(X\) and \(Y\) normalized by the probability of \(Y\), or \(\frac{P(X,Y)}{P(Y)}\).

# compute joint probability: P(X, Y)

# create containers for counts of x = 0 and y
feature0_label_counts = np.zeros([len(iris.feature_names), len(iris.target_names)]) # length of x and y

# create containers for counts of x = 1 and y
feature1_label_counts = np.zeros([len(iris.feature_names), len(iris.target_names)]) # length of x and y

# for each training sample
for i in range(binarized_train_data.shape[0]):

    label = train_labels[i] # y_{i}
    features = binarized_train_data[i] # x_{i}

    # loop over training samples
    for index, value in enumerate(features):

        # count instances of each type
        feature0_label_counts[index][label] += (value == 0)
        feature1_label_counts[index][label] += (value == 1)

# normalize probabilities; divide by y
feature0_label_counts = feature0_label_counts / label_counts # (x|y) / y
feature1_label_counts = feature1_label_counts / label_counts # (x|y) / y

# print count results by binary feature
print(' Feature Counts: Short\n', np.round(feature0_label_counts, 2))

##  Feature Counts: Short
##  [[1.   0.42 0.08]
##  [0.32 0.97 0.94]
##  [1.   0.   0.  ]
##  [1.   0.   0.  ]]

print('\n Feature Counts: Long\n', np.round(feature1_label_counts, 2))

# check probabilities sum to 1

## 
##  Feature Counts: Long
##  [[0.   0.58 0.92]
##  [0.68 0.03 0.06]
##  [0.   1.   1.  ]
##  [0.   1.   1.  ]]

print ('\nEnsure P(feature = 0 | label) + P(feature = 1 | label) = 1\n'
, feature0_label_counts + feature1_label_counts)

## 
## Ensure P(feature = 0 | label) + P(feature = 1 | label) = 1
##  [[1. 1. 1.]
##  [1. 1. 1.]
##  [1. 1. 1.]
##  [1. 1. 1.]]

3.3 Quantity of Interest: P(X)

Lastly, we need to calculate the marginal probability or the denominator \(P(X)\) of Bayes Rule. In our algorithm, we actually do not directly calculate this quantity of interest in part because we only need to derive which class has the highest posterior value rather than deriving the true posterior value. Instead, what we do is we marginalize over the joint distribution to compute: \(\sum_{y}P(Y)\times P(X|Y)\). Practically, this means to derive \(P(X)\) we just sum over all possible values of \(Y_{i}\). Additionally, this means that since the \(P(X)\) will always sum to 1, we can effectively ignore the denominator for our calculations.

In this section, we show how the algorithm derives posterior predictions by gradually updating our belief \(P(Y|X)\) given the introduction of new features \(P(X|Y)\). We begin by displaying reference notation for our features and labels.

print("Features:  ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']")

## Features:  ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']

print("Labels  :  ['setosa' 'versicolor' 'virginica']\n")

## Labels  :  ['setosa' 'versicolor' 'virginica']

We begin by pulling our first quantity of interest: \(P(Y)\).

p_y = label_probs
print(p_y)

## [0.31 0.33 0.36]

Next, we pull our second quantity of interest: \(P(X_{short}|Y)\)

p_x0_y = feature0_label_counts
print(p_x0_y)

## [[1.         0.42424242 0.08333333]
##  [0.32258065 0.96969697 0.94444444]
##  [1.         0.         0.        ]
##  [1.         0.         0.        ]]

Then, we prepare our posterior prediction which is simply our prior \(P(Y)\), absent the introduction of evidence.

posterior = label_probs
print('the posterior is', posterior)

## the posterior is [0.31 0.33 0.36]

Now we can generate our first posterior prediction by multiplying our prior \(P(Y)\) by our joint probability for sepal length \(P(\text{Sepal Length}_{short}|Y)\).

posterior *= p_x0_y[0] # sepal length is the index value
print('P(sepal length = 0 | Y)', p_x0_y[0]) # knowing we had a short sepal length, its nearly certain its a setosa

## P(sepal length = 0 | Y) [1.         0.42424242 0.08333333]

print('The posterior sum is',  posterior.sum()) # notice that this eventually sums to 1

## The posterior sum is 0.48

posterior /= posterior.sum() # the marginal probability or denominator of Bayes

Our posterior prediction tells us that: by taking into account the joint probability of sepal length among short length iris’, the predicted probability that the iris is a setosa is: 0.65.

print(np.round(posterior, 2))

## [0.65 0.29 0.06]

We continue this process by including a new feature one at a time to make this process clear.

posterior *= p_x0_y[1] # sepal width
print('P(sepal width = 0 | Y)', p_x0_y[1])

## P(sepal width = 0 | Y) [0.32258065 0.96969697 0.94444444]

print('the posterior sum is',  posterior.sum())

## the posterior sum is 0.550189393939394

posterior /= posterior.sum()

print('Given sepal length and width, our posterior prediction is', np.round(posterior, 2))

## Given sepal length and width, our posterior prediction is [0.38 0.51 0.11]

posterior *= p_x0_y[2] # petal length
print('P(petal length = 0 | Y)', p_x0_y[2])

## P(petal length = 0 | Y) [1. 0. 0.]

print('the posterior sum is',  posterior.sum())

## the posterior sum is 0.378657487091222

posterior /= posterior.sum()

print('Given sepal length and width and petal length, our posterior prediction is', np.round(posterior, 2))

## Given sepal length and width and petal length, our posterior prediction is [1. 0. 0.]

posterior *= p_x0_y[3]
print('P(petal width = 0 | Y)', p_x0_y[3])

## P(petal width = 0 | Y) [1. 0. 0.]

print('the posterior sum is',  posterior.sum())

## the posterior sum is 1.0

posterior /= posterior.sum()

print('Given sepal length and width and petal length and width, our posterior prediction is', np.round(posterior, 2))

## Given sepal length and width and petal length and width, our posterior prediction is [1. 0. 0.]

Notice that our posterior.sum() has eventually summed to 1 which is why the denominator is effectively ignored in our Bayes application. Also notice that our algorithm came up with the right answer, but it is highly overconfident as evidenced by the 1 or 0 results. We derived the overconfidence from our \(P(X | Y)\) calculations which, because we binarized our data, found zero short iris observations in terms of petal length and petal width for versicolors and virginicas. As our calculations involve multiplication, our estimates will inevitably include zero. To overcome this problem, we can implement Laplace smoothing which replaces a count of 0 in our joint probability calculations \(P(X | Y)\) with a count of \(\alpha\). By default sklearn sets \(\alpha\) to 1.

4 sklearn: Naive Bayes

Next, we implement a basic Bernoulli Naive Bayes model to demonstrate its ability to generate posterior predictions and to classify data given binary features.

model = BernoulliNB()
model.fit(binarized_train_data, train_labels)

## BernoulliNB(alpha=1.0, binarize=0.0, class_prior=None, fit_prior=True)

print('Accuracy: %3.2f' % model.score(binarized_test_data, test_labels))

## Accuracy: 0.80

posterior = np.exp(model.feature_log_prob_).T
print('Posterior Predicted Probabilities for Features = Long:\n', np.round(posterior, 2))

## Posterior Predicted Probabilities for Features = Long:
##  [[0.03 0.57 0.89]
##  [0.67 0.06 0.08]
##  [0.03 0.97 0.97]
##  [0.03 0.97 0.97]]

Although our posterior predictions look robust in that there is a clear separation between our iris classes, we need to remember that the perceived robustness is influenced by correlation in our \(X\), which is demonstrated below. It is for these reasons that we need to remain cautious about using Naive Bayes as a baseline model to derive predictions and accuracy for our classification problems.

df = pd.DataFrame(binarized_train_data)
print(df.corr())

##           0         1         2         3
## 0  1.000000 -0.444297  0.697650  0.697650
## 1 -0.444297  1.000000 -0.686502 -0.686502
## 2  0.697650 -0.686502  1.000000  1.000000
## 3  0.697650 -0.686502  1.000000  1.000000

4.1 sklearn: Naive Bayes Extended

In this section, we demonstrate some modeling choices that we can make when using different Naive Bayes models from sklearn on the MNIST data set. In the chunk below, we begin by loading our data using sklearn’s handy package to access MNIST.

# load MNSIT from https://www.openml.org/d/554 via fetch_openml
X, Y = fetch_openml(name='mnist_784', return_X_y=True, cache=False)

# rescale grayscale values to [0,1].
X = X / 255.0

# shuffle the data
shuffle = np.random.permutation(np.arange(X.shape[0])) # generate unique random indices
X, Y = X[shuffle], Y[shuffle] # apply shuffle

# prepare some datasets
test_data, test_labels = X[61000:], Y[61000:]
dev_data, dev_labels = X[60000:61000], Y[60000:61000]
train_data, train_labels = X[:60000], Y[:60000]
mini_train_data, mini_train_labels = X[:1000], Y[:1000]

Since MNIST is a data set of handwritten digits, we can visualize the data thanks to imshow from matplotlib. In the chunk below, we collect 10 examples of all 10 digits and display them in a 2x2 matrix.

def visualize_mnist(label_array):
    fig, ax = plt.subplots(10, 10, figsize = (10,10))  # create 10x10 subplots

    for i in range(0, 10): # initiate loop
        find_digit = np.where(label_array == str(i)) # find numbers 0-9
        find_digit = mini_train_data[find_digit][0:10] # truncate data

        for j in range(0, 10): # initiate loop
            working_item = find_digit[j] # pull individual arrays
            working_item.shape = (28, 28) # reshape for viewing
            working_item = 255 - working_item # subtract itself for white background
            ax[i, j].imshow(working_item, cmap = 'gray') # plot the image
            ax[i, j].set_xticks([]) # remove axis ticks for clarity
            ax[i, j].set_yticks([]) # remove axis ticks for clarity

    return plt.show()

visualize_mnist(mini_train_labels)

Visualizing the Data

4.2 Bernoulli Naive Bayes

Prior to fitting our model, we create a function to binarize our \(X\) data, resulting in 664k 0s and 120k 1s.

# binarize X for BernoulliNB
def binarize(data):
    binarized_data = np.zeros(data.shape) # create array of zeros
    binarized_data[(data < 0.3)] = 0 # thresholds chosen arbitrarily
    binarized_data[(data >= 0.3)] = 1 # thresholds chosen arbitrarily
    return binarized_data

binarized_train = binarize(mini_train_data) # binarize x train
binarized_dev = binarize(dev_data) # binarize x test

print(np.bincount(binarized_train.astype(int).flat)) # review binarized results

## [664735 119265]

Next, we apply the simplistic Bernoulli Naive Bayes model that functions exactly like the model we explicated at the start of this guide with the iris data set. To tune the central hyperparameter for Naive Bayes, we use grid search to loop over different \(\alpha\) parameters to find the best performing model as determined by accuracy.

We can see that our model does relatively well given only binary features. We can also see that low \(\alpha\) parameters produce better accuracy than high \(\alpha\) parameters. This is because \(\alpha\) replaces what otherwise would be zeros with 0.0001 allowing features some slight chance of occurring and thereby improving our generalizability.

alphas = {'alpha': [1.0e-10, 0.0001, 0.001, 0.01, 0.1, 0.5, 1.0, 2.0, 10.0]}

# setup grid search
grid_nb = GridSearchCV(estimator = BernoulliNB(binarize = None), param_grid = alphas, cv = 5, scoring = 'accuracy')

# fit nb
grid_nb.fit(binarized_train, mini_train_labels) # use our binarized thresholds; ignoring sklearns

# check best estimator result

## GridSearchCV(cv=5, error_score=nan,
##              estimator=BernoulliNB(alpha=1.0, binarize=None, class_prior=None,
##                                    fit_prior=True),
##              iid='deprecated', n_jobs=None,
##              param_grid={'alpha': [1e-10, 0.0001, 0.001, 0.01, 0.1, 0.5, 1.0,
##                                    2.0, 10.0]},
##              pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
##              scoring='accuracy', verbose=0)

grid_nb.best_estimator_
# the best estimator is when alpha = 0.0001

## BernoulliNB(alpha=0.001, binarize=None, class_prior=None, fit_prior=True)

grid_nb.score(binarized_dev, dev_labels) # score on dev data

## 0.84

scores = grid_nb.cv_results_['mean_test_score'] # capture scores

# print scores
print("Best alpha = ", grid_nb.best_params_)

## Best alpha =  {'alpha': 0.001}

for i in range(len(scores)):
    print('For alpha', alphas['alpha'][i], 'the accuracy is:', round(scores[i], 3))

## For alpha 1e-10 the accuracy is: 0.8
## For alpha 0.0001 the accuracy is: 0.819
## For alpha 0.001 the accuracy is: 0.823
## For alpha 0.01 the accuracy is: 0.823
## For alpha 0.1 the accuracy is: 0.815
## For alpha 0.5 the accuracy is: 0.813
## For alpha 1.0 the accuracy is: 0.811
## For alpha 2.0 the accuracy is: 0.801
## For alpha 10.0 the accuracy is: 0.768

Since Naive Bayes is a generative model, we can demonstrate its ability below by generating digit images.

# create a NB model
nb = BernoulliNB(alpha = 0.0001) # alpha set to grid search's best result

# fit the model; lets use my binarized data
nb.fit(binarized_train, mini_train_labels) 

# generate probabilities
probs = np.exp(nb.feature_log_prob_)

# produce function to generate images from predicted probabilities
def gen_images(feature_log_prob):
    shape_n = feature_log_prob.shape[0]
    random_nums = np.round(np.random.rand(784), 2)
    pixels = np.zeros(shape_n)
    for i in range(shape_n):
        if random_nums[i] <= np.exp(feature_log_prob[i]):
            pixels[i] = 1
    return pixels

# plot the images
fig, ax = plt.subplots(10,20, figsize = (10,10))
for i in range(0, 20):
    for j in range(0, 10):
        temp_pix = gen_images(nb.feature_log_prob_[j])
        temp_pix = temp_pix.reshape(28, 28)
        temp_pix = 1 - temp_pix # give it a white background
        ax[j,i].imshow(temp_pix, cmap = 'gray')
        ax[j,i].set_xticks([])
        ax[j,i].set_yticks([])

Generating Data

4.3 Multinomial Naive Bayes

sklearn also includes a multinomial options for Naive Bayes which allows us to classify dependent variables that have three or more labels. Using the same MNIST data set, we first trinarize our data set to produce three labels for classification. Given these thresholds, we produce 664k 0s, 31k 1s, and 88k 2s.

# Produce a MultinomialNB; trinarize the data
def trinarize(data):
    trinarized_data = np.zeros(data.shape) # create array of zeros
    trinarized_data[(data <= 0.3)] = 0 # arbitrary thresholds
    trinarized_data[(data > 0.3)] = 1
    trinarized_data[(data >= 0.7)] = 2
    return trinarized_data
    
trinarized_train = trinarize(mini_train_data)
trinarized_dev = trinarize(dev_data)

print(np.bincount(trinarized_train.astype(int).flat))

## [664735  30990  88275]

Next, we fit a familiar model and generate our accuracy score to find that our multinomial model performs slightly worse than our Bernoulli model. This is almost certainly because adding a new categorical value for effectively middle values does little to help explain the digit label. What helps explain the digit label is whether or not there is pixel data to begin with, thus, the binary model is sufficient for our tasks. We just need information about a pixel being “on” or “off” to do well.

alphas = {'alpha': [1.0e-10, 0.0001, 0.001, 0.01, 0.1, 0.5, 1.0, 2.0, 10.0]}

# setup grid search
grid_nb = GridSearchCV(estimator = MultinomialNB(), param_grid = alphas, cv = 5, scoring = 'accuracy')

# fit nb
grid_nb.fit(trinarized_train, mini_train_labels) # use our trinarized data

# check best estimator result

## GridSearchCV(cv=5, error_score=nan,
##              estimator=MultinomialNB(alpha=1.0, class_prior=None,
##                                      fit_prior=True),
##              iid='deprecated', n_jobs=None,
##              param_grid={'alpha': [1e-10, 0.0001, 0.001, 0.01, 0.1, 0.5, 1.0,
##                                    2.0, 10.0]},
##              pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
##              scoring='accuracy', verbose=0)

grid_nb.best_estimator_
# the best estimator is when alpha = 0.01

## MultinomialNB(alpha=0.1, class_prior=None, fit_prior=True)

grid_nb.score(trinarized_dev, dev_labels) # score on dev data

## 0.823

scores = grid_nb.cv_results_['mean_test_score'] # capture scores

# print scores
print("Best alpha = ", grid_nb.best_params_)

## Best alpha =  {'alpha': 0.1}

for i in range(len(scores)):
    print('For alpha', alphas['alpha'][i], 'the accuracy is:', round(scores[i], 3))

## For alpha 1e-10 the accuracy is: 0.788
## For alpha 0.0001 the accuracy is: 0.805
## For alpha 0.001 the accuracy is: 0.808
## For alpha 0.01 the accuracy is: 0.808
## For alpha 0.1 the accuracy is: 0.812
## For alpha 0.5 the accuracy is: 0.809
## For alpha 1.0 the accuracy is: 0.808
## For alpha 2.0 the accuracy is: 0.806
## For alpha 10.0 the accuracy is: 0.794

4.4 Gaussian Naive Bayes

Lastly, sklearn also incorporates a Gaussian Naive Bayes model which aims to use continuous features, as compared to binary/trinary integers, to predict labels. When using the Gaussian model, we should consider the var_smoothing hyperparameter to improve our predictive performance. In short, var_smoothing adds an additional value to the variance (\(\sigma\)) – in a similar fashion to the Laplace smoothing, to counter the effects of multiplying by zero. By adding some specified \(\sigma\) to each observation, we can improve its performance and stability.

# instantiate Gaussian NB
default_gnb = GaussianNB()

# fit model
default_gnb.fit(mini_train_data, mini_train_labels)

# generate accuracy

## GaussianNB(priors=None, var_smoothing=1e-09)

default_gnb_score = default_gnb.score(dev_data, dev_labels)
# poor performance; need to alter the smoothing function

# consider effects of theta and sigma
default_gnb.theta_.min()

## 0.0

default_gnb.theta_.max()

# variance of each

## 0.9713831478537366

default_gnb.sigma_.min()

## 2.0075504984236782e-10

default_gnb.sigma_.max()

## 0.21110558821718223

var_smoothing_gnb = GaussianNB(var_smoothing = 0.06) # increase variance through var_smoothing
var_smoothing_gnb.fit(mini_train_data, mini_train_labels) # fit data again

## GaussianNB(priors=None, var_smoothing=0.06)

var_smoothing_gnb_score = var_smoothing_gnb.score(dev_data, dev_labels) # check score

# capture data to print results
model_names = ['GaussianNB', 'Smoothed GaussianNB']
model_scores = [default_gnb_score, var_smoothing_gnb_score]

for i in range(len(model_names)):
    print('For model', model_names[i], 'the accuracy is', round(model_scores[i], 3))

## For model GaussianNB the accuracy is 0.585
## For model Smoothed GaussianNB the accuracy is 0.82

5 Conclusion

This post walked through the construction and application of the Naive Bayes algorithm by demonstrating the conditions under which the algorithm excelled, did poorly, and could be improved through hyperparameter or feature engineering.

6 Sources

Matthias Feurer, Jan N. van Rijn, Arlind Kadra, Pieter Gijsbers, Neeratyoy Mallik, Sahithya Ravi, Andreas Mueller, Joaquin Vanschoren, Frank Hutter. OpenML-Python: an extensible Python API for OpenML. arXiv:1911.02490 [cs.LG], 2019
LeCun, Yann. “The MNIST database of handwritten digits.” http://yann.lecun.com/exdb/mnist/ (1998).

Naïve Bayes

Andrew Fogarty

3/03/2020