3 Decision Tree Components

Decision tree algorithms require that we compute two quantities of interest: entropy and information gain. Entropy is a measure of uncertainty while information gain is a measure of the change in entropy. To better understand these algorithms, we replicate and walk through them below with artificial data.

def entropy(distribution):
    ''' entropy function calculated by:
    sum(-probability * np.log2(probability))
    '''
    h = 0.0
    for probability in distribution:
        if probability > 0.0:
            h += -probability * np.log2(probability)
    return h

As we can see from our function above, it expects a probability distribution that we can generate below:

uniform = np.array([0.33, 0.33, 0.33])
non_uniform = np.array([0.15, 0.0, 0.85])

We can see that entropy is maximized when we are least sure about something; here we have just three equal groups in our uniform distribution.

print(entropy(uniform))

## 1.5834674497121084

print(entropy(non_uniform))

## 0.6098403047164004

Before we begin to look at information gain, let us first start building our functions and loading our data so that we can approach it methodically. Our decision tree analysis will be based on the simple but instructive iris data set provided by sklearn.

iris = load_iris()
X, y = iris.data, iris.target

# Shuffle the data, but make sure that the features and accompanying labels stay in sync.
np.random.seed(0)
shuffle = np.random.permutation(np.arange(X.shape[0]))
X, y = X[shuffle], y[shuffle]

# Split into train and test.
train_data, train_labels = X[:100], y[:100]
test_data, test_labels = X[100:], y[100:]

Next, we need a function to generate the distribution of our labels so we can feed it into our entropy function.

def get_label_distribution(labels):
    # initialize counters for all labels to zero
    label_probs = np.bincount(labels)
    
    # Normalize to get a distribution
    label_probs = label_probs / label_probs.sum()
    return label_probs

We begin by calculating our starting entropy by feeding our training labels into the functions.

initial_entropy = entropy(get_label_distribution(train_labels))

Now that we understand our initial entropy, what we want to do next is find which feature in our data set will maximize our information gain, or in other words, give us the most predictive power as the root variable for our tree. For right now, we will show how to calculate information gain for a single feature and then we will generalize it to show why we choose some features over others for the roots of our trees. Since we already know that petal_length is tied for the most important feature, we will use that variable to demonstrate the calculation and intuition behind information gain.

A common technique used in building decision trees is to binarize features such that information gain is maximized. While decision trees can support continuous data, thereby offering increased precision, we can aim for parsimony through binarization. So, then, how can we split the data such that we can generate the most accurate predictions on our first split? A histogram will help make this clear.

plt.hist(train_data[:, 2])

Notice the gap between the histogram’s bins between roughly 2 and 2.5. The values below 2 for petal_length are those of the Setosa while values above roughly 2.5 are those of the Virginica and Versicolor. Thus, if we want to classify all of the Setosas based on petal_length, the histogram tells us that we can do so easily by binarizing our feature petal_length with a threshold at any point roughly between these two values. With all of the Setosas classified, we can then move onto correctly classifying the Virginicas and Versicolors through the rest of the features.

Petal Length

So let’s binarize petal_length, based on the threshold we visually inspected above and then calculate its information gain.

# pull out petal length from data
petal_length = train_data[:, 2] # all rows; third column

def binarize_iris(data, threshold):
    # Initialize a new feature array with the same shape as the original data.
    binarized_data = np.zeros(data.shape)
    # Apply a threshold  to each feature.
    for feature in range(100):
        binarized_data[feature] = data[feature] <= threshold
    return binarized_data

binarized_petal_length = binarize_iris(petal_length, 2.45)

Next, we want to split our feature and create two tree branches. One branch will contain all of our Setosas while the other branch will contain the Virginicas and Versicolors.

subset0, subset1 = [], []
for datum, label in zip(binarized_petal_length, train_labels):
    if datum > 0:
        subset1.append(label)
    else:
        subset0.append(label)

Then, we want to calculate information gain, which we do by: (1) getting the distribution of our binarized and thresholded subset of petal_length, (2) we generate weights by dividing the fraction of observations in our thresholded subset over the whole of the training set, (3) we compute the weighted entropy score, and (4) we compare the resulting value with the initial entropy value. We then repeat this process for every other feature i.e., petal_width, sepal_width, etc. The feature with the highest information gain value is then used as root of our tree from which we split.

# compute the entropy of each subset.
subset0_entropy = entropy(get_label_distribution(subset0))
subset1_entropy = entropy(get_label_distribution(subset1))

# compute the weight 
subset0_weight = 1.0 * len(subset0) / len(train_labels)
subset1_weight = 1.0 * len(subset1) / len(train_labels)

# compute weighted entropy
final_entropy = (subset0_weight * subset0_entropy) + (subset1_weight * subset1_entropy)

# compute information gain as the difference between the initial and final entropy.
information_gain = initial_entropy - final_entropy

print(information_gain)

## 0.8931734583778569

To see that petal_length indeed is tied for the highest information gain, we prepare two functions: (1) to generalize information gain, and (2) to find the threshold that delivers the highest information gain.

def information_gain(data, labels, feature, threshold=0):
    # Get the initial entropy of the label distribution.
    initial_entropy = entropy(get_label_distribution(labels))

    # subset0 will contain the labels for which the feature is 0 (short) and
    # subset1 will contain the labels for which the feature is 1 (long).
    subset0, subset1 = [], []
    for datum, label in zip(data, labels):
        if datum[feature] > threshold: 
            subset1.append(label)
        else:
             subset0.append(label)

    # Compute the entropy of each subset.
    subset0_entropy = entropy(get_label_distribution(subset0))
    subset1_entropy = entropy(get_label_distribution(subset1))

    # Compute the final entropy by weighting each subset's entropy according to its size.
    subset0_weight = 1.0 * len(subset0) / len(labels)
    subset1_weight = 1.0 * len(subset1) / len(labels)
    final_entropy = subset0_weight * subset0_entropy + subset1_weight * subset1_entropy

    # Finally, compute information gain as the difference between the initial and final entropy.
    return initial_entropy - final_entropy

def try_features_and_thresholds(data, labels):
    best_thresholds = []
    for feature in range(data.shape[1]):
        # Choose a set of thresholds between the min- and max-valued features.
        thresholds = np.linspace(data[:,feature].min(), data[:,feature].max(), 20)[1:-1]

        # Try each threshold and keep track of the best one for this feature.
        best_threshold = 0
        best_ig = 0
        for threshold in thresholds:
            ig = information_gain(data, labels, feature, threshold)
            if ig > best_ig:
                best_ig = ig
                best_threshold = threshold
        best_thresholds.append(best_threshold)
        # Show the best threshold and information gain for this feature.
        print('%d %s\t best threshold: %.3f  |  maximal info gain: %.3f ' 
        %(feature, iris.feature_names[feature], best_threshold, best_ig))
        
    return best_thresholds
    
optimal_thresholds = try_features_and_thresholds(train_data, train_labels)

## 0 sepal length (cm)   best threshold: 5.732  |  maximal info gain: 0.525 
## 1 sepal width (cm)    best threshold: 3.389  |  maximal info gain: 0.338 
## 2 petal length (cm)   best threshold: 2.116  |  maximal info gain: 0.893 
## 3 petal width (cm)    best threshold: 0.605  |  maximal info gain: 0.893

The results above tell us the relative information gain provided by the identified threshold. It tells us that we should select petal_length or petal_width as they are tied for the highest information gain and thus one of the two should be the root of our decision tree and what we split on first.

Decision trees then follow this process iteratively until each observation is classified or split as far as it can meaningfully go. To select the next branch, we then take what we have left to classify, which in our case are the Virginicas and Versicolors, and then we determine which feature will yield us the most successful classifications as determined by information gain.

The code below shows how this process unfolds through sklearn:

# only split if we get 10 observations classified in a split
dt = DecisionTreeClassifier(criterion = 'entropy', min_samples_split = 10)

# fit the data
dt.fit(train_data, train_labels)

# print the accuracy

## DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='entropy',
##                        max_depth=None, max_features=None, max_leaf_nodes=None,
##                        min_impurity_decrease=0.0, min_impurity_split=None,
##                        min_samples_leaf=1, min_samples_split=10,
##                        min_weight_fraction_leaf=0.0, presort='deprecated',
##                        random_state=None, splitter='best')

print('Accuracy:', dt.score(test_data, test_labels))

# print feature importances

## Accuracy: 0.96

print(dt.feature_importances_)

## [0.         0.         0.67259596 0.32740404]

We can see that the Decision Tree does remarkably well classifying the Iris data set by only using two of its features; petal_width and petal_length. This makes intuitive sense because they are both the features that we found with the highest information gain because we could quickly and easily separate large numbers of the iris’ through thresholds. dt.feature_importances_, a sklearn function tells us just how important each feature was. In this case, we can see that the algorithm found petal_length the most important in its classifying task.

# produce graphical output
dot_data = tree.export_graphviz(dt, out_file=None,
                              feature_names=iris.feature_names,
                              class_names=iris.target_names,
                              filled=True, rounded=True)
graph = pydotplus.graph_from_dot_data(dot_data)
display(Image(data=graph.create_png()))

Decision Tree

6 Gradient Boosting

Gradient Boosting sequentially fits a new tree to the residual errors made by previous tree. Over time, an ensemble of trees is created which can then collectively make predictions.

6.1 Extreme Gradient Boosting (xgboost)

In this section, we perform a detailed application of xgboost using the the xgboost API instead of the sklearn API. We also document the use of hyperopt for hyperparameter optimization.

In the chunk below, we load and preprocess our MNIST data.

# load MNIST from https://www.openml.org/d/554 via fetch_openml
X, y = fetch_openml(name='mnist_784', return_X_y=True, cache=False)

# shuffle the data
shuffle = np.random.permutation(np.arange(X.shape[0]))  # generate random indices
X, y = X[shuffle], y[shuffle]  # apply shuffle

# set data types
X = X.astype('float32')
y = y.astype('float32')

# split data
X_train_full, X_test, y_train_full, y_test = train_test_split(X, y, random_state=42)
X_train, X_dev, y_train, y_dev = train_test_split(X_train_full, y_train_full, random_state=42)

# standardize
scaler = StandardScaler()
X_train_full = scaler.fit_transform(X_train_full)
X_train = scaler.transform(X_train)
X_dev = scaler.transform(X_dev)
X_test = scaler.transform(X_test)

Since we are interested in a classification task, the code below allows us to calculate the weights for our classes in case we want to better balance the observations that we do have.

# find appropriate weights for balancing classes
def find_weights(y_train):
    bins = np.bincount(y_train.astype('int64'))  # find bins
    bin_perc = bins/np.sum(bins)  # get percentage distribution
    min_val = np.amin(bin_perc)  # get min value
    weighting = min_val/bin_perc  # divide smallest by others

    weight_container = np.empty([0,]) # create storage
    for i in y_train.astype('float32'): # make y same type so condition can find it
        tmp_array = np.zeros([1,]) # create temp array
        for index, value in enumerate(weighting): # enumerate
            if i == index:
                tmp_array[0] = value # append value
        # append to storage
        weight_container = np.append(weight_container, tmp_array, axis = 0)

    return weight_container

Next, we prepare our data by using xgboost’s DMatrix.

# prepare xgboost data
xg_train = xgb.DMatrix(X_train, label=y_train, weight=find_weights(y_train))
xg_test = xgb.DMatrix(X_test, label=y_test, weight=find_weights(y_test))
xg_dev = xgb.DMatrix(X_dev, label=y_dev, weight=find_weights(y_dev))

We setup some initial hyperparameters for our model like so:

# setup parameters for xgboost
param = {'objective': 'multi:softprob',  # softmax multiclass
         'eta': 0.1,  # learning rate
         'max_depth': 6,  # depth of tree
         'eval_metric': 'merror',  # multiclass error rate
         'num_class': 10} # 10 classes to estimate

Next, we specify that our gradient boosting algorithm be based on our test set performance, as we are aiming for a final prediction, and create a container for its results.

# validations set to watch performance
watchlist = [(xg_train,'train'), (xg_test,'eval')]

# validation results
evals_results = {}

Then we parepare a learning rate scheduler that lowers our learning rate over time.

# decay learning rate
def learning_rate_decay(boosting_round, num_boost_round):
    learning_rate_start = param['eta']
    learning_rate_min = 0.01
    lr_decay = 0.95
    lr = learning_rate_start * np.power(lr_decay, boosting_round)
    return max(learning_rate_min, lr)

Finally, we are prepared to train our model. For the purposes of this demonstration, we set num_boost_round to 10 to end the process quickly. In practice, we would set this value very high and rely on our early_stopping_rounds to end our training early in the event of no progress.

bst = xgb.train(params=param,
                dtrain=xg_train,
                num_boost_round=10,
                evals=watchlist,
                evals_result=evals_results,
                early_stopping_rounds=5,
                callbacks=[xgb.callback.reset_learning_rate(learning_rate_decay)])

## [0]  train-merror:0.130077   eval-merror:0.156113
## Multiple eval metrics have been passed: 'eval-merror' will be used for early stopping.
## 
## Will train until eval-merror hasn't improved in 5 rounds.
## [1]  train-merror:0.104483   eval-merror:0.12966
## [2]  train-merror:0.090636   eval-merror:0.115995
## [3]  train-merror:0.082731   eval-merror:0.107127
## [4]  train-merror:0.078138   eval-merror:0.100969
## [5]  train-merror:0.074766   eval-merror:0.097862
## [6]  train-merror:0.072347   eval-merror:0.09468
## [7]  train-merror:0.069547   eval-merror:0.092747
## [8]  train-merror:0.066299   eval-merror:0.09012
## [9]  train-merror:0.064583   eval-merror:0.088784

We generate our predictive results as usual:

# get prediction
predicted_y = bst.predict(xg_test, ntree_limit=bst.best_ntree_limit) # produces probabilities

# transform to hard predictions
predicted_y = np.argmax(predicted_y, axis=1)

# manual accuracy for fun
correct_pred = predicted_y == y_test

# compare actual to predicted
total = len(y_test)
correct = 0.0
accuracy = 0.0
for j in range(len(y_test)):
    if correct_pred[j]: # if true
        correct += 1 #  add 1
accuracy = correct / total # generate accuracy
print('the accuracy of this classification is', round(accuracy, 4))

## the accuracy of this classification is 0.9122

# confusion matrix
confusion_matrix(y_test, predicted_y)

## array([[1641,    1,    2,    4,    6,    5,    9,    0,   29,    5],
##        [   1, 1893,   23,   11,    2,    1,    7,    4,   13,    1],
##        [  15,   19, 1575,   38,   23,    4,   12,   19,   44,    8],
##        [   7,   12,   60, 1549,    8,   31,    8,   19,   49,   25],
##        [   4,    7,   13,    3, 1553,    4,   10,    4,   15,   85],
##        [  21,    8,   10,   49,   13, 1376,   35,    8,   37,   34],
##        [   9,    6,    8,    2,   21,   55, 1591,    0,   39,    2],
##        [   6,   12,   34,   11,   19,    2,    1, 1678,    8,   48],
##        [  11,   34,   15,   31,   14,   26,   13,    6, 1538,   39],
##        [   7,    7,    8,   28,   62,   12,    1,   45,   10, 1569]],
##       dtype=int64)

# f1 score
f1_score(y_test, predicted_y, average='macro')

## 0.9114312314368508

6.2 Hyperopt: xgboost

Hyperopt is one of the most popular and advanced hyperparameter search algorithms in use today. In the section below, we document its use:

def objective(param):

    # validations set to watch performance
    watchlist = [(xg_train,'train'), (xg_dev,'eval')]

    # validation results
    evals_results = {}

    # boost
    bst = xgb.train(params=param,
                    dtrain=xg_train,
                    num_boost_round=30,
                    evals=watchlist,
                    evals_result=evals_results,
                    early_stopping_rounds=5,
                    verbose_eval=False)

    # get best run
    predicted_y = bst.predict(xg_dev, ntree_limit=bst.best_iteration)
    
    # transform to hard predictions
    predicted_y = np.argmax(predicted_y, axis=1)

    current_score = f1_score(y_dev, np.round(predicted_y), average='weighted')

    #use 1-merror; we are minimizing
    return {'loss': 1-current_score, 'status': STATUS_OK, 'model': bst}

Here we specify the parameters to search and sample over. The hyperparameters below, which aim to control model complexity, are chosen primarily based on comments from Tong He, a xgboost code contributor.

# setup parameters for xgboost
param = {'objective': 'multi:softprob',  # softmax multiclass
         'eval_metric': 'merror',
         'num_class': 10,  # number of outputs

         # control complexity
         'gamma': hp.uniform('gamma', 0, 2),
         'max_depth': hp.choice('max_depth', np.arange(1, 16, 1, dtype=int)),
         'eta': hp.uniform('eta', 0.01, 0.7),
         'subsample': hp.quniform('subsample', 0.5, 1.0, 0.10),
         'colsample_bytree': hp.quniform('colsample_bytree', 0.5, 1.0, 0.10),
         'min_child_weight': hp.choice('min_child_weight', np.arange(0, 4, 1, dtype=int))
         }

And we execute hyperopt like so:

trials = Trials()

best = fmin(fn=objective,
            space=param,
            algo=tpe.suggest,
            max_evals=30,
            trials=trials)

## 
  0%|          | 0/30 [00:00<?, ?trial/s, best loss=?]
  3%|3         | 1/30 [00:15<07:19, 15.15s/trial, best loss: 0.09253672848487682]
  7%|6         | 2/30 [00:33<07:34, 16.23s/trial, best loss: 0.06922530527562143]
 10%|#         | 3/30 [01:45<14:42, 32.70s/trial, best loss: 0.03956804422541704]
 13%|#3        | 4/30 [02:29<15:42, 36.24s/trial, best loss: 0.03941184211827076]
 17%|#6        | 5/30 [03:19<16:50, 40.42s/trial, best loss: 0.038020415097349436]
 20%|##        | 6/30 [04:07<16:59, 42.49s/trial, best loss: 0.0342222628419494]  
 23%|##3       | 7/30 [04:53<16:45, 43.71s/trial, best loss: 0.033233773246691745]
 27%|##6       | 8/30 [05:10<13:02, 35.55s/trial, best loss: 0.033233773246691745]
 30%|###       | 9/30 [06:32<17:24, 49.74s/trial, best loss: 0.033233773246691745]
 33%|###3      | 10/30 [07:15<15:52, 47.65s/trial, best loss: 0.033233773246691745]
 37%|###6      | 11/30 [08:08<15:34, 49.17s/trial, best loss: 0.033233773246691745]
 40%|####      | 12/30 [09:34<18:04, 60.25s/trial, best loss: 0.033233773246691745]
 43%|####3     | 13/30 [10:09<14:52, 52.53s/trial, best loss: 0.033233773246691745]
 47%|####6     | 14/30 [10:27<11:17, 42.34s/trial, best loss: 0.033233773246691745]
 50%|#####     | 15/30 [10:58<09:41, 38.76s/trial, best loss: 0.033233773246691745]
 53%|#####3    | 16/30 [12:22<12:15, 52.55s/trial, best loss: 0.033233773246691745]
 57%|#####6    | 17/30 [13:13<11:15, 51.98s/trial, best loss: 0.033233773246691745]
 60%|######    | 18/30 [14:26<11:39, 58.25s/trial, best loss: 0.033233773246691745]
 63%|######3   | 19/30 [14:34<07:54, 43.10s/trial, best loss: 0.033233773246691745]
 67%|######6   | 20/30 [15:43<08:29, 50.95s/trial, best loss: 0.033233773246691745]
 70%|#######   | 21/30 [16:17<06:52, 45.84s/trial, best loss: 0.033233773246691745]
 73%|#######3  | 22/30 [17:00<05:59, 44.95s/trial, best loss: 0.033233773246691745]
 77%|#######6  | 23/30 [17:42<05:09, 44.19s/trial, best loss: 0.033233773246691745]
 80%|########  | 24/30 [18:02<03:41, 36.96s/trial, best loss: 0.033233773246691745]
 83%|########3 | 25/30 [18:35<02:59, 35.87s/trial, best loss: 0.033233773246691745]
 87%|########6 | 26/30 [19:19<02:33, 38.29s/trial, best loss: 0.033233773246691745]
 90%|######### | 27/30 [20:06<02:02, 40.91s/trial, best loss: 0.033233773246691745]
 93%|#########3| 28/30 [20:45<01:20, 40.34s/trial, best loss: 0.033233773246691745]
 97%|#########6| 29/30 [20:55<00:31, 31.14s/trial, best loss: 0.033233773246691745]
100%|##########| 30/30 [21:29<00:00, 32.10s/trial, best loss: 0.033233773246691745]
100%|##########| 30/30 [21:29<00:00, 43.00s/trial, best loss: 0.033233773246691745]

print("Best XGB parameters are", best)

## Best XGB parameters are {'colsample_bytree': 0.7000000000000001, 'eta': 0.4825210766587527, 'gamma': 0.5067306790027051, 'max_depth': 10, 'min_child_weight': 2, 'subsample': 0.9}

We can plot the results of our hyperopt search by running the following code. The plot will tell us the direction to which our search space ran and its results. The x-axis represents the values searched while the y-axis represents the loss.

parameters = ['max_depth', 'eta', 'subsample', 'colsample_bytree', 'gamma', 'min_child_weight']

f, axes = plt.subplots(nrows=1, ncols=5, figsize=(20,5))
cmap = plt.cm.jet
for i, val in enumerate(parameters):
    xs = np.array([t['misc']['vals'][val] for t in trials.trials]).ravel()
    ys = [t['result']['loss'] for t in trials.trials]
    xs, ys = zip(*sorted(zip(xs, ys)))
    ys = np.array(ys)
    axes[i].scatter(xs, ys, s=20, linewidth=0.01, alpha=0.5)
    axes[i].set_title(val)

Hyperopt Seach Space

6.3 Cross-Validating xgboost

We can validate our hyperopt search by running its suggested parameters over xgboost’s cross-validation algorithm.

# setup parameters for xgboost; best results from hyperopt ex-post
param = {'objective': 'multi:softprob',  # softmax multiclass
         'eta': 0.639,  # learning rate
         'max_depth': 14,  # depth of tree
         'gamma': 0.104,  # regularization
         'min_child_weight': 3,
         'subsample': 0.9,
         'colsample_bytree': 1.0,
         'eval_metric': 'merror',  # multiclass error rate
         'num_class': 10}  # 10 classes to estimate
             
# prepare xgboost cv data
xg_X = xgb.DMatrix(X, label=y, weight=find_weights(y))

bst_cv = xgb.cv(params=param,
             dtrain=xg_X,
             num_boost_round=10,
             stratified=True,
             nfold=3,
             metrics='merror',
             early_stopping_rounds=5,
             as_pandas=True,
             verbose_eval=False)
print(bst_cv)

##    train-merror-mean  train-merror-std  test-merror-mean  test-merror-std
## 0           0.060292          0.001115          0.105184         0.000449
## 1           0.032558          0.000493          0.079282         0.000319
## 2           0.018925          0.000377          0.067146         0.000662
## 3           0.010944          0.000346          0.059527         0.000422
## 4           0.006409          0.000081          0.053537         0.000437
## 5           0.003538          0.000186          0.049477         0.000876
## 6           0.002213          0.000308          0.046493         0.000578
## 7           0.001338          0.000131          0.044496         0.000765
## 8           0.000691          0.000080          0.042789         0.001099
## 9           0.000371          0.000070          0.040340         0.000971

Decision Trees

Andrew Fogarty

2/07/2020

1 Introduction

2 When to Use Decision Trees

3 Decision Tree Components

4 Decision Tree Engineering

5 Boosting

6 Gradient Boosting

6.1 Extreme Gradient Boosting (xgboost)

6.2 Hyperopt: xgboost

6.3 Cross-Validating xgboost

7 Sources