Decision Trees

Andrew Fogarty

2/07/2020

# load packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import time

from sklearn import tree
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

import graphviz
import xgboost as xgb
from sklearn.metrics import confusion_matrix
from sklearn.metrics import f1_score
from hyperopt import hp, fmin, tpe, STATUS_OK, Trials
from sklearn.datasets import fetch_openml

np.random.seed(0) # set seed

1 Introduction

The Decision Tree algorithm is part of a family of classifier and regression algorithms that aim to predict the class or value of an observation. Decision trees classify data by splitting features at specified thresholds such that, ideally, we can perfectly predict the observation’s label. At its core, features are split by using two relatively simplistic algorithms: entropy and information gain. When deciding how to split a feature, a threshold is selected such that the informational gain is the highest, meaning more information is revealed and thereby our predictions for our dependent variable’s label is improved (or perfect). Decision trees are very popular algorithms and have done exceedingly well in competitions. Indeed, academic research from 2014 found decision trees among the most accurate and generalizable across many different data sets in a comparison of 179 different classifiers (Fernández-Delgado et. al, 2014).

2 When to Use Decision Trees

In short, we should use decision trees when we want to present highly interpretative predictions because as we will see, we can produce very simple and useful graphical outputs of our analysis. The timing required to run decision trees, and more extensive random forests, will of course depend on the number of features and the size of the data. In fact, the number of possible decision trees is estimated by the following calculation: \(2^{(2^n)}\) so for a model with six features, there are potentially \(18446744073709551616\) possible decision trees. This range of possibilities means that we can increase our likelihood of predicting \(\hat{Y}\).

Decision Trees also basd on a very popular and recurring Kaggle-winning algorithm known as Extreme Gradient Boosting, or XGBoost. At the end of this demonstration, a full application of xgboost is provided. xgboost is so powerful that we should consider using it for nearly every machine learning problem that is not associated with computer vision and image recognition or natural language processing.

3 Decision Tree Components

Decision tree algorithms require that we compute two quantities of interest: entropy and information gain. Entropy is a measure of uncertainty while information gain is a measure of the change in entropy. To better understand these algorithms, we replicate and walk through them below with artificial data.

def entropy(distribution):
    ''' entropy function calculated by:
    sum(-probability * np.log2(probability))
    '''
    h = 0.0
    for probability in distribution:
        if probability > 0.0:
            h += -probability * np.log2(probability)
    return h

As we can see from our function above, it expects a probability distribution that we can generate below:

uniform = np.array([0.33, 0.33, 0.33])
non_uniform = np.array([0.15, 0.0, 0.85])

We can see that entropy is maximized when we are least sure about something; here we have just three equal groups in our uniform distribution.

print(entropy(uniform))
## 1.5834674497121084
print(entropy(non_uniform))
## 0.6098403047164004

Before we begin to look at information gain, let us first start building our functions and loading our data so that we can approach it methodically. Our decision tree analysis will be based on the simple but instructive iris data set provided by sklearn.

iris = load_iris()
X, y = iris.data, iris.target

# Shuffle the data, but make sure that the features and accompanying labels stay in sync.
np.random.seed(0)
shuffle = np.random.permutation(np.arange(X.shape[0]))
X, y = X[shuffle], y[shuffle]

# Split into train and test.
train_data, train_labels = X[:100], y[:100]
test_data, test_labels = X[100:], y[100:]

Next, we need a function to generate the distribution of our labels so we can feed it into our entropy function.

def get_label_distribution(labels):
    # initialize counters for all labels to zero
    label_probs = np.bincount(labels)
    
    # Normalize to get a distribution
    label_probs = label_probs / label_probs.sum()
    return label_probs

We begin by calculating our starting entropy by feeding our training labels into the functions.

initial_entropy = entropy(get_label_distribution(train_labels))

Now that we understand our initial entropy, what we want to do next is find which feature in our data set will maximize our information gain, or in other words, give us the most predictive power as the root variable for our tree. For right now, we will show how to calculate information gain for a single feature and then we will generalize it to show why we choose some features over others for the roots of our trees. Since we already know that petal_length is tied for the most important feature, we will use that variable to demonstrate the calculation and intuition behind information gain.

A common technique used in building decision trees is to binarize features such that information gain is maximized. While decision trees can support continuous data, thereby offering increased precision, we can aim for parsimony through binarization. So, then, how can we split the data such that we can generate the most accurate predictions on our first split? A histogram will help make this clear.

plt.hist(train_data[:, 2])

Notice the gap between the histogram’s bins between roughly 2 and 2.5. The values below 2 for petal_length are those of the Setosa while values above roughly 2.5 are those of the Virginica and Versicolor. Thus, if we want to classify all of the Setosas based on petal_length, the histogram tells us that we can do so easily by binarizing our feature petal_length with a threshold at any point roughly between these two values. With all of the Setosas classified, we can then move onto correctly classifying the Virginicas and Versicolors through the rest of the features.

Petal Length

Petal Length

So let’s binarize petal_length, based on the threshold we visually inspected above and then calculate its information gain.

# pull out petal length from data
petal_length = train_data[:, 2] # all rows; third column

def binarize_iris(data, threshold):
    # Initialize a new feature array with the same shape as the original data.
    binarized_data = np.zeros(data.shape)
    # Apply a threshold  to each feature.
    for feature in range(100):
        binarized_data[feature] = data[feature] <= threshold
    return binarized_data

binarized_petal_length = binarize_iris(petal_length, 2.45)

Next, we want to split our feature and create two tree branches. One branch will contain all of our Setosas while the other branch will contain the Virginicas and Versicolors.

subset0, subset1 = [], []
for datum, label in zip(binarized_petal_length, train_labels):
    if datum > 0:
        subset1.append(label)
    else:
        subset0.append(label)

Then, we want to calculate information gain, which we do by: (1) getting the distribution of our binarized and thresholded subset of petal_length, (2) we generate weights by dividing the fraction of observations in our thresholded subset over the whole of the training set, (3) we compute the weighted entropy score, and (4) we compare the resulting value with the initial entropy value. We then repeat this process for every other feature i.e., petal_width, sepal_width, etc. The feature with the highest information gain value is then used as root of our tree from which we split.

# compute the entropy of each subset.
subset0_entropy = entropy(get_label_distribution(subset0))
subset1_entropy = entropy(get_label_distribution(subset1))

# compute the weight 
subset0_weight = 1.0 * len(subset0) / len(train_labels)
subset1_weight = 1.0 * len(subset1) / len(train_labels)

# compute weighted entropy
final_entropy = (subset0_weight * subset0_entropy) + (subset1_weight * subset1_entropy)

# compute information gain as the difference between the initial and final entropy.
information_gain = initial_entropy - final_entropy

print(information_gain)
## 0.8931734583778569

To see that petal_length indeed is tied for the highest information gain, we prepare two functions: (1) to generalize information gain, and (2) to find the threshold that delivers the highest information gain.

def information_gain(data, labels, feature, threshold=0):
    # Get the initial entropy of the label distribution.
    initial_entropy = entropy(get_label_distribution(labels))

    # subset0 will contain the labels for which the feature is 0 (short) and
    # subset1 will contain the labels for which the feature is 1 (long).
    subset0, subset1 = [], []
    for datum, label in zip(data, labels):
        if datum[feature] > threshold: 
            subset1.append(label)
        else:
             subset0.append(label)

    # Compute the entropy of each subset.
    subset0_entropy = entropy(get_label_distribution(subset0))
    subset1_entropy = entropy(get_label_distribution(subset1))

    # Compute the final entropy by weighting each subset's entropy according to its size.
    subset0_weight = 1.0 * len(subset0) / len(labels)
    subset1_weight = 1.0 * len(subset1) / len(labels)
    final_entropy = subset0_weight * subset0_entropy + subset1_weight * subset1_entropy

    # Finally, compute information gain as the difference between the initial and final entropy.
    return initial_entropy - final_entropy
def try_features_and_thresholds(data, labels):
    best_thresholds = []
    for feature in range(data.shape[1]):
        # Choose a set of thresholds between the min- and max-valued features.
        thresholds = np.linspace(data[:,feature].min(), data[:,feature].max(), 20)[1:-1]

        # Try each threshold and keep track of the best one for this feature.
        best_threshold = 0
        best_ig = 0
        for threshold in thresholds:
            ig = information_gain(data, labels, feature, threshold)
            if ig > best_ig:
                best_ig = ig
                best_threshold = threshold
        best_thresholds.append(best_threshold)
        # Show the best threshold and information gain for this feature.
        print('%d %s\t best threshold: %.3f  |  maximal info gain: %.3f ' 
        %(feature, iris.feature_names[feature], best_threshold, best_ig))
        
    return best_thresholds
    
optimal_thresholds = try_features_and_thresholds(train_data, train_labels)
## 0 sepal length (cm)   best threshold: 5.732  |  maximal info gain: 0.525 
## 1 sepal width (cm)    best threshold: 3.389  |  maximal info gain: 0.338 
## 2 petal length (cm)   best threshold: 2.116  |  maximal info gain: 0.893 
## 3 petal width (cm)    best threshold: 0.605  |  maximal info gain: 0.893

The results above tell us the relative information gain provided by the identified threshold. It tells us that we should select petal_length or petal_width as they are tied for the highest information gain and thus one of the two should be the root of our decision tree and what we split on first.

Decision trees then follow this process iteratively until each observation is classified or split as far as it can meaningfully go. To select the next branch, we then take what we have left to classify, which in our case are the Virginicas and Versicolors, and then we determine which feature will yield us the most successful classifications as determined by information gain.

The code below shows how this process unfolds through sklearn:

# only split if we get 10 observations classified in a split
dt = DecisionTreeClassifier(criterion = 'entropy', min_samples_split = 10)

# fit the data
dt.fit(train_data, train_labels)

# print the accuracy
## DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='entropy',
##                        max_depth=None, max_features=None, max_leaf_nodes=None,
##                        min_impurity_decrease=0.0, min_impurity_split=None,
##                        min_samples_leaf=1, min_samples_split=10,
##                        min_weight_fraction_leaf=0.0, presort='deprecated',
##                        random_state=None, splitter='best')
print('Accuracy:', dt.score(test_data, test_labels))

# print feature importances
## Accuracy: 0.96
print(dt.feature_importances_)
## [0.         0.         0.67259596 0.32740404]

We can see that the Decision Tree does remarkably well classifying the Iris data set by only using two of its features; petal_width and petal_length. This makes intuitive sense because they are both the features that we found with the highest information gain because we could quickly and easily separate large numbers of the iris’ through thresholds. dt.feature_importances_, a sklearn function tells us just how important each feature was. In this case, we can see that the algorithm found petal_length the most important in its classifying task.

# produce graphical output
dot_data = tree.export_graphviz(dt, out_file=None,
                              feature_names=iris.feature_names,
                              class_names=iris.target_names,
                              filled=True, rounded=True)
graph = pydotplus.graph_from_dot_data(dot_data)
display(Image(data=graph.create_png()))
Decision Tree

Decision Tree

4 Decision Tree Engineering

There are a couple feature engineering strategies that we can take when building decision trees to improve our ability to generalize across new test data. The first is thresholding, which is a strategy that we have undertaken extensively here in this guide. The point here is that if we can collapse multiple categories into one or a few, we can build a more parsimonious model that will generalize better. For instance, instead having a feature for whether or not vehicles have 2, 4, 6, or 8 cylinders, we might collapse these categories into two: greater than 6 and less than 6 as the specificity between cylinders may not be all that helpful. In other words, we might aim to binarize or trinarize our data for parsimony.

The second strategy is pruning. When pruning, we build a completely built out tree and then proceed to delete branch splits where the function pchance is greater than the a threshold maxpchance. pchance, derived from chi-square tests, checks whether or not the data’s distribution occurred by chance and is calculated at each branch. maxpchance is a regularization and hyperparameter that is often estimated with development data. maxpchance is a threshold that we, the machine learning engineer, sets. In general, as maxpchance increases, our test data accuracy increases, but at some point, it will start to decrease and that is when we stop increasing maxpchance. This practice again helps us prevent over fitting and allows us generalize better on our test data. In terms of xgboost, we use the hyperparameters gamma and max_depth.

5 Boosting

Boosting is a term which describes a sequential training of models that emphasize misclassified observations. It is a technique that is executed in the following steps:

  1. Weight each observation as 1/n
  2. Train the classifier using the weights
  3. Reduce weights for correct examples; increase weights for misclassified examples
  4. Go back to step 2
  5. Stop after i iterations or generally when accuracy stops improving

6 Gradient Boosting

Gradient Boosting sequentially fits a new tree to the residual errors made by previous tree. Over time, an ensemble of trees is created which can then collectively make predictions.

6.1 Extreme Gradient Boosting (xgboost)

In this section, we perform a detailed application of xgboost using the the xgboost API instead of the sklearn API. We also document the use of hyperopt for hyperparameter optimization.

In the chunk below, we load and preprocess our MNIST data.

Since we are interested in a classification task, the code below allows us to calculate the weights for our classes in case we want to better balance the observations that we do have.

Next, we prepare our data by using xgboost’s DMatrix.

We setup some initial hyperparameters for our model like so:

Next, we specify that our gradient boosting algorithm be based on our test set performance, as we are aiming for a final prediction, and create a container for its results.

Then we parepare a learning rate scheduler that lowers our learning rate over time.

Finally, we are prepared to train our model. For the purposes of this demonstration, we set num_boost_round to 10 to end the process quickly. In practice, we would set this value very high and rely on our early_stopping_rounds to end our training early in the event of no progress.

## [0]  train-merror:0.130077   eval-merror:0.156113
## Multiple eval metrics have been passed: 'eval-merror' will be used for early stopping.
## 
## Will train until eval-merror hasn't improved in 5 rounds.
## [1]  train-merror:0.104483   eval-merror:0.12966
## [2]  train-merror:0.090636   eval-merror:0.115995
## [3]  train-merror:0.082731   eval-merror:0.107127
## [4]  train-merror:0.078138   eval-merror:0.100969
## [5]  train-merror:0.074766   eval-merror:0.097862
## [6]  train-merror:0.072347   eval-merror:0.09468
## [7]  train-merror:0.069547   eval-merror:0.092747
## [8]  train-merror:0.066299   eval-merror:0.09012
## [9]  train-merror:0.064583   eval-merror:0.088784

We generate our predictive results as usual:

## the accuracy of this classification is 0.9122
## array([[1641,    1,    2,    4,    6,    5,    9,    0,   29,    5],
##        [   1, 1893,   23,   11,    2,    1,    7,    4,   13,    1],
##        [  15,   19, 1575,   38,   23,    4,   12,   19,   44,    8],
##        [   7,   12,   60, 1549,    8,   31,    8,   19,   49,   25],
##        [   4,    7,   13,    3, 1553,    4,   10,    4,   15,   85],
##        [  21,    8,   10,   49,   13, 1376,   35,    8,   37,   34],
##        [   9,    6,    8,    2,   21,   55, 1591,    0,   39,    2],
##        [   6,   12,   34,   11,   19,    2,    1, 1678,    8,   48],
##        [  11,   34,   15,   31,   14,   26,   13,    6, 1538,   39],
##        [   7,    7,    8,   28,   62,   12,    1,   45,   10, 1569]],
##       dtype=int64)
## 0.9114312314368508

6.2 Hyperopt: xgboost

Hyperopt is one of the most popular and advanced hyperparameter search algorithms in use today. In the section below, we document its use:

Here we specify the parameters to search and sample over. The hyperparameters below, which aim to control model complexity, are chosen primarily based on comments from Tong He, a xgboost code contributor.

And we execute hyperopt like so:

## 
  0%|          | 0/30 [00:00<?, ?trial/s, best loss=?]
  3%|3         | 1/30 [00:15<07:19, 15.15s/trial, best loss: 0.09253672848487682]
  7%|6         | 2/30 [00:33<07:34, 16.23s/trial, best loss: 0.06922530527562143]
 10%|#         | 3/30 [01:45<14:42, 32.70s/trial, best loss: 0.03956804422541704]
 13%|#3        | 4/30 [02:29<15:42, 36.24s/trial, best loss: 0.03941184211827076]
 17%|#6        | 5/30 [03:19<16:50, 40.42s/trial, best loss: 0.038020415097349436]
 20%|##        | 6/30 [04:07<16:59, 42.49s/trial, best loss: 0.0342222628419494]  
 23%|##3       | 7/30 [04:53<16:45, 43.71s/trial, best loss: 0.033233773246691745]
 27%|##6       | 8/30 [05:10<13:02, 35.55s/trial, best loss: 0.033233773246691745]
 30%|###       | 9/30 [06:32<17:24, 49.74s/trial, best loss: 0.033233773246691745]
 33%|###3      | 10/30 [07:15<15:52, 47.65s/trial, best loss: 0.033233773246691745]
 37%|###6      | 11/30 [08:08<15:34, 49.17s/trial, best loss: 0.033233773246691745]
 40%|####      | 12/30 [09:34<18:04, 60.25s/trial, best loss: 0.033233773246691745]
 43%|####3     | 13/30 [10:09<14:52, 52.53s/trial, best loss: 0.033233773246691745]
 47%|####6     | 14/30 [10:27<11:17, 42.34s/trial, best loss: 0.033233773246691745]
 50%|#####     | 15/30 [10:58<09:41, 38.76s/trial, best loss: 0.033233773246691745]
 53%|#####3    | 16/30 [12:22<12:15, 52.55s/trial, best loss: 0.033233773246691745]
 57%|#####6    | 17/30 [13:13<11:15, 51.98s/trial, best loss: 0.033233773246691745]
 60%|######    | 18/30 [14:26<11:39, 58.25s/trial, best loss: 0.033233773246691745]
 63%|######3   | 19/30 [14:34<07:54, 43.10s/trial, best loss: 0.033233773246691745]
 67%|######6   | 20/30 [15:43<08:29, 50.95s/trial, best loss: 0.033233773246691745]
 70%|#######   | 21/30 [16:17<06:52, 45.84s/trial, best loss: 0.033233773246691745]
 73%|#######3  | 22/30 [17:00<05:59, 44.95s/trial, best loss: 0.033233773246691745]
 77%|#######6  | 23/30 [17:42<05:09, 44.19s/trial, best loss: 0.033233773246691745]
 80%|########  | 24/30 [18:02<03:41, 36.96s/trial, best loss: 0.033233773246691745]
 83%|########3 | 25/30 [18:35<02:59, 35.87s/trial, best loss: 0.033233773246691745]
 87%|########6 | 26/30 [19:19<02:33, 38.29s/trial, best loss: 0.033233773246691745]
 90%|######### | 27/30 [20:06<02:02, 40.91s/trial, best loss: 0.033233773246691745]
 93%|#########3| 28/30 [20:45<01:20, 40.34s/trial, best loss: 0.033233773246691745]
 97%|#########6| 29/30 [20:55<00:31, 31.14s/trial, best loss: 0.033233773246691745]
100%|##########| 30/30 [21:29<00:00, 32.10s/trial, best loss: 0.033233773246691745]
100%|##########| 30/30 [21:29<00:00, 43.00s/trial, best loss: 0.033233773246691745]
## Best XGB parameters are {'colsample_bytree': 0.7000000000000001, 'eta': 0.4825210766587527, 'gamma': 0.5067306790027051, 'max_depth': 10, 'min_child_weight': 2, 'subsample': 0.9}

We can plot the results of our hyperopt search by running the following code. The plot will tell us the direction to which our search space ran and its results. The x-axis represents the values searched while the y-axis represents the loss.

Hyperopt Seach Space

Hyperopt Seach Space

6.3 Cross-Validating xgboost

We can validate our hyperopt search by running its suggested parameters over xgboost’s cross-validation algorithm.

##    train-merror-mean  train-merror-std  test-merror-mean  test-merror-std
## 0           0.060292          0.001115          0.105184         0.000449
## 1           0.032558          0.000493          0.079282         0.000319
## 2           0.018925          0.000377          0.067146         0.000662
## 3           0.010944          0.000346          0.059527         0.000422
## 4           0.006409          0.000081          0.053537         0.000437
## 5           0.003538          0.000186          0.049477         0.000876
## 6           0.002213          0.000308          0.046493         0.000578
## 7           0.001338          0.000131          0.044496         0.000765
## 8           0.000691          0.000080          0.042789         0.001099
## 9           0.000371          0.000070          0.040340         0.000971

7 Sources