```
# load packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import time
from sklearn import tree
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
import graphviz
import xgboost as xgb
from sklearn.metrics import confusion_matrix
from sklearn.metrics import f1_score
from hyperopt import hp, fmin, tpe, STATUS_OK, Trials
from sklearn.datasets import fetch_openml
np.random.seed(0) # set seed
```

The *Decision Tree* algorithm is part of a family of *classifier* and *regression* algorithms that aim to predict the *class* or *value* of an observation. Decision trees classify data by splitting features at specified thresholds such that, ideally, we can perfectly predict the observation’s label. At its core, features are split by using two relatively simplistic algorithms: `entropy`

and `information gain`

. When deciding how to split a feature, a threshold is selected such that the informational gain is the highest, meaning more information is revealed and thereby our predictions for our dependent variable’s label is improved (or perfect). Decision trees are very popular algorithms and have done exceedingly well in competitions. Indeed, academic research from 2014 found decision trees among the most accurate and generalizable across many different data sets in a comparison of 179 different classifiers (Fernández-Delgado et. al, 2014).

In short, we should use decision trees when we want to present highly interpretative predictions because as we will see, we can produce very simple and useful graphical outputs of our analysis. The timing required to run decision trees, and more extensive random forests, will of course depend on the number of features and the size of the data. In fact, the number of possible decision trees is estimated by the following calculation: \(2^{(2^n)}\) so for a model with six features, there are potentially \(18446744073709551616\) possible decision trees. This range of possibilities means that we can increase our likelihood of predicting \(\hat{Y}\).

Decision Trees also basd on a very popular and recurring Kaggle-winning algorithm known as Extreme Gradient Boosting, or `XGBoost`

. At the end of this demonstration, a full application of `xgboost`

is provided. `xgboost`

is so powerful that we should consider using it for nearly every machine learning problem that is not associated with computer vision and image recognition or natural language processing.

Decision tree algorithms require that we compute two quantities of interest: `entropy`

and `information gain`

. `Entropy`

is a measure of uncertainty while `information gain`

is a measure of the change in entropy. To better understand these algorithms, we replicate and walk through them below with artificial data.

```
def entropy(distribution):
''' entropy function calculated by:
sum(-probability * np.log2(probability))
'''
h = 0.0
for probability in distribution:
if probability > 0.0:
h += -probability * np.log2(probability)
return h
```

As we can see from our function above, it expects a probability distribution that we can generate below:

We can see that entropy is maximized when we are least sure about something; here we have just three equal groups in our uniform distribution.

`## 1.5834674497121084`

`## 0.6098403047164004`

Before we begin to look at information gain, let us first start building our functions and loading our data so that we can approach it methodically. Our decision tree analysis will be based on the simple but instructive `iris`

data set provided by `sklearn`

.

```
iris = load_iris()
X, y = iris.data, iris.target
# Shuffle the data, but make sure that the features and accompanying labels stay in sync.
np.random.seed(0)
shuffle = np.random.permutation(np.arange(X.shape[0]))
X, y = X[shuffle], y[shuffle]
# Split into train and test.
train_data, train_labels = X[:100], y[:100]
test_data, test_labels = X[100:], y[100:]
```

Next, we need a function to generate the distribution of our labels so we can feed it into our entropy function.

```
def get_label_distribution(labels):
# initialize counters for all labels to zero
label_probs = np.bincount(labels)
# Normalize to get a distribution
label_probs = label_probs / label_probs.sum()
return label_probs
```

We begin by calculating our starting `entropy`

by feeding our training labels into the functions.

Now that we understand our initial entropy, what we want to do next is find which feature in our data set will maximize our `information gain`

, or in other words, give us the most predictive power as the root variable for our tree. For right now, we will show how to calculate `information gain`

for a single feature and then we will generalize it to show why we choose some features over others for the roots of our trees. Since we already know that `petal_length`

is tied for the most important feature, we will use that variable to demonstrate the calculation and intuition behind `information gain`

.

A common technique used in building decision trees is to binarize features such that `information gain`

is maximized. While decision trees can support continuous data, thereby offering increased precision, we can aim for parsimony through binarization. So, then, how can we split the data such that we can generate the most accurate predictions on our first split? A histogram will help make this clear.

Notice the gap between the histogram’s bins between roughly 2 and 2.5. The values below 2 for `petal_length`

are those of the Setosa while values above roughly 2.5 are those of the Virginica and Versicolor. Thus, if we want to classify all of the Setosas based on `petal_length`

, the histogram tells us that we can do so easily by binarizing our feature `petal_length`

with a threshold at any point roughly between these two values. With all of the Setosas classified, we can then move onto correctly classifying the Virginicas and Versicolors through the rest of the features.

So let’s binarize `petal_length`

, based on the threshold we visually inspected above and then calculate its `information gain`

.

```
# pull out petal length from data
petal_length = train_data[:, 2] # all rows; third column
def binarize_iris(data, threshold):
# Initialize a new feature array with the same shape as the original data.
binarized_data = np.zeros(data.shape)
# Apply a threshold to each feature.
for feature in range(100):
binarized_data[feature] = data[feature] <= threshold
return binarized_data
binarized_petal_length = binarize_iris(petal_length, 2.45)
```

Next, we want to split our feature and create two tree branches. One branch will contain all of our Setosas while the other branch will contain the Virginicas and Versicolors.

```
subset0, subset1 = [], []
for datum, label in zip(binarized_petal_length, train_labels):
if datum > 0:
subset1.append(label)
else:
subset0.append(label)
```

Then, we want to calculate `information gain`

, which we do by: (1) getting the distribution of our binarized and thresholded subset of `petal_length`

, (2) we generate weights by dividing the fraction of observations in our thresholded subset over the whole of the training set, (3) we compute the weighted entropy score, and (4) we compare the resulting value with the initial entropy value. We then repeat this process for every other feature i.e., `petal_width`

, `sepal_width`

, etc. The feature with the highest `information gain`

value is then used as root of our tree from which we split.

```
# compute the entropy of each subset.
subset0_entropy = entropy(get_label_distribution(subset0))
subset1_entropy = entropy(get_label_distribution(subset1))
# compute the weight
subset0_weight = 1.0 * len(subset0) / len(train_labels)
subset1_weight = 1.0 * len(subset1) / len(train_labels)
# compute weighted entropy
final_entropy = (subset0_weight * subset0_entropy) + (subset1_weight * subset1_entropy)
# compute information gain as the difference between the initial and final entropy.
information_gain = initial_entropy - final_entropy
print(information_gain)
```

`## 0.8931734583778569`

To see that `petal_length`

indeed is tied for the highest `information gain`

, we prepare two functions: (1) to generalize information gain, and (2) to find the threshold that delivers the highest information gain.

```
def information_gain(data, labels, feature, threshold=0):
# Get the initial entropy of the label distribution.
initial_entropy = entropy(get_label_distribution(labels))
# subset0 will contain the labels for which the feature is 0 (short) and
# subset1 will contain the labels for which the feature is 1 (long).
subset0, subset1 = [], []
for datum, label in zip(data, labels):
if datum[feature] > threshold:
subset1.append(label)
else:
subset0.append(label)
# Compute the entropy of each subset.
subset0_entropy = entropy(get_label_distribution(subset0))
subset1_entropy = entropy(get_label_distribution(subset1))
# Compute the final entropy by weighting each subset's entropy according to its size.
subset0_weight = 1.0 * len(subset0) / len(labels)
subset1_weight = 1.0 * len(subset1) / len(labels)
final_entropy = subset0_weight * subset0_entropy + subset1_weight * subset1_entropy
# Finally, compute information gain as the difference between the initial and final entropy.
return initial_entropy - final_entropy
```

```
def try_features_and_thresholds(data, labels):
best_thresholds = []
for feature in range(data.shape[1]):
# Choose a set of thresholds between the min- and max-valued features.
thresholds = np.linspace(data[:,feature].min(), data[:,feature].max(), 20)[1:-1]
# Try each threshold and keep track of the best one for this feature.
best_threshold = 0
best_ig = 0
for threshold in thresholds:
ig = information_gain(data, labels, feature, threshold)
if ig > best_ig:
best_ig = ig
best_threshold = threshold
best_thresholds.append(best_threshold)
# Show the best threshold and information gain for this feature.
print('%d %s\t best threshold: %.3f | maximal info gain: %.3f '
%(feature, iris.feature_names[feature], best_threshold, best_ig))
return best_thresholds
optimal_thresholds = try_features_and_thresholds(train_data, train_labels)
```

```
## 0 sepal length (cm) best threshold: 5.732 | maximal info gain: 0.525
## 1 sepal width (cm) best threshold: 3.389 | maximal info gain: 0.338
## 2 petal length (cm) best threshold: 2.116 | maximal info gain: 0.893
## 3 petal width (cm) best threshold: 0.605 | maximal info gain: 0.893
```

The results above tell us the relative `information gain`

provided by the identified threshold. It tells us that we should select `petal_length`

or `petal_width`

as they are tied for the highest `information gain`

and thus one of the two should be the root of our decision tree and what we split on first.

Decision trees then follow this process iteratively until each observation is classified or split as far as it can meaningfully go. To select the next branch, we then take what we have left to classify, which in our case are the Virginicas and Versicolors, and then we determine which feature will yield us the most successful classifications as determined by `information gain`

.

The code below shows how this process unfolds through `sklearn`

:

```
# only split if we get 10 observations classified in a split
dt = DecisionTreeClassifier(criterion = 'entropy', min_samples_split = 10)
# fit the data
dt.fit(train_data, train_labels)
# print the accuracy
```

```
## DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='entropy',
## max_depth=None, max_features=None, max_leaf_nodes=None,
## min_impurity_decrease=0.0, min_impurity_split=None,
## min_samples_leaf=1, min_samples_split=10,
## min_weight_fraction_leaf=0.0, presort='deprecated',
## random_state=None, splitter='best')
```

`## Accuracy: 0.96`

`## [0. 0. 0.67259596 0.32740404]`

We can see that the Decision Tree does remarkably well classifying the Iris data set by only using two of its features; `petal_width`

and `petal_length`

. This makes intuitive sense because they are both the features that we found with the highest `information gain`

because we could quickly and easily separate large numbers of the iris’ through thresholds. `dt.feature_importances_`

, a `sklearn`

function tells us just how important each feature was. In this case, we can see that the algorithm found `petal_length`

the most important in its classifying task.

```
# produce graphical output
dot_data = tree.export_graphviz(dt, out_file=None,
feature_names=iris.feature_names,
class_names=iris.target_names,
filled=True, rounded=True)
graph = pydotplus.graph_from_dot_data(dot_data)
display(Image(data=graph.create_png()))
```

There are a couple feature engineering strategies that we can take when building decision trees to improve our ability to generalize across new test data. The first is thresholding, which is a strategy that we have undertaken extensively here in this guide. The point here is that if we can collapse multiple categories into one or a few, we can build a more parsimonious model that will generalize better. For instance, instead having a feature for whether or not vehicles have 2, 4, 6, or 8 cylinders, we might collapse these categories into two: greater than 6 and less than 6 as the specificity between cylinders may not be all that helpful. In other words, we might aim to binarize or trinarize our data for parsimony.

The second strategy is pruning. When pruning, we build a completely built out tree and then proceed to delete branch splits where the function `pchance`

is greater than the a threshold `maxpchance`

. `pchance`

, derived from chi-square tests, checks whether or not the data’s distribution occurred by chance and is calculated at each branch. `maxpchance`

is a regularization and hyperparameter that is often estimated with development data. `maxpchance`

is a threshold that we, the machine learning engineer, sets. In general, as `maxpchance`

increases, our test data accuracy increases, but at some point, it will start to decrease and that is when we stop increasing `maxpchance`

. This practice again helps us prevent over fitting and allows us generalize better on our test data. In terms of `xgboost`

, we use the hyperparameters `gamma`

and `max_depth`

.

Boosting is a term which describes a sequential training of models that emphasize misclassified observations. It is a technique that is executed in the following steps:

- Weight each observation as 1/n
- Train the classifier using the weights
- Reduce weights for correct examples; increase weights for misclassified examples
- Go back to step 2
- Stop after i iterations or generally when accuracy stops improving

Gradient Boosting sequentially fits a new tree to the residual errors made by previous tree. Over time, an ensemble of trees is created which can then collectively make predictions.

In this section, we perform a detailed application of `xgboost`

using the the `xgboost`

API instead of the `sklearn`

API. We also document the use of `hyperopt`

for hyperparameter optimization.

In the chunk below, we load and preprocess our MNIST data.

```
# load MNIST from https://www.openml.org/d/554 via fetch_openml
X, y = fetch_openml(name='mnist_784', return_X_y=True, cache=False)
# shuffle the data
shuffle = np.random.permutation(np.arange(X.shape[0])) # generate random indices
X, y = X[shuffle], y[shuffle] # apply shuffle
# set data types
X = X.astype('float32')
y = y.astype('float32')
# split data
X_train_full, X_test, y_train_full, y_test = train_test_split(X, y, random_state=42)
X_train, X_dev, y_train, y_dev = train_test_split(X_train_full, y_train_full, random_state=42)
# standardize
scaler = StandardScaler()
X_train_full = scaler.fit_transform(X_train_full)
X_train = scaler.transform(X_train)
X_dev = scaler.transform(X_dev)
X_test = scaler.transform(X_test)
```

Since we are interested in a classification task, the code below allows us to calculate the weights for our classes in case we want to better balance the observations that we do have.

```
# find appropriate weights for balancing classes
def find_weights(y_train):
bins = np.bincount(y_train.astype('int64')) # find bins
bin_perc = bins/np.sum(bins) # get percentage distribution
min_val = np.amin(bin_perc) # get min value
weighting = min_val/bin_perc # divide smallest by others
weight_container = np.empty([0,]) # create storage
for i in y_train.astype('float32'): # make y same type so condition can find it
tmp_array = np.zeros([1,]) # create temp array
for index, value in enumerate(weighting): # enumerate
if i == index:
tmp_array[0] = value # append value
# append to storage
weight_container = np.append(weight_container, tmp_array, axis = 0)
return weight_container
```

Next, we prepare our data by using `xgboost`

’s `DMatrix`

.

```
# prepare xgboost data
xg_train = xgb.DMatrix(X_train, label=y_train, weight=find_weights(y_train))
xg_test = xgb.DMatrix(X_test, label=y_test, weight=find_weights(y_test))
xg_dev = xgb.DMatrix(X_dev, label=y_dev, weight=find_weights(y_dev))
```

We setup some initial hyperparameters for our model like so:

```
# setup parameters for xgboost
param = {'objective': 'multi:softprob', # softmax multiclass
'eta': 0.1, # learning rate
'max_depth': 6, # depth of tree
'eval_metric': 'merror', # multiclass error rate
'num_class': 10} # 10 classes to estimate
```

Next, we specify that our gradient boosting algorithm be based on our `test`

set performance, as we are aiming for a final prediction, and create a container for its results.

```
# validations set to watch performance
watchlist = [(xg_train,'train'), (xg_test,'eval')]
# validation results
evals_results = {}
```

Then we parepare a learning rate scheduler that lowers our learning rate over time.

```
# decay learning rate
def learning_rate_decay(boosting_round, num_boost_round):
learning_rate_start = param['eta']
learning_rate_min = 0.01
lr_decay = 0.95
lr = learning_rate_start * np.power(lr_decay, boosting_round)
return max(learning_rate_min, lr)
```

Finally, we are prepared to train our model. For the purposes of this demonstration, we set `num_boost_round`

to 10 to end the process quickly. In practice, we would set this value very high and rely on our `early_stopping_rounds`

to end our training early in the event of no progress.

```
bst = xgb.train(params=param,
dtrain=xg_train,
num_boost_round=10,
evals=watchlist,
evals_result=evals_results,
early_stopping_rounds=5,
callbacks=[xgb.callback.reset_learning_rate(learning_rate_decay)])
```

```
## [0] train-merror:0.130077 eval-merror:0.156113
## Multiple eval metrics have been passed: 'eval-merror' will be used for early stopping.
##
## Will train until eval-merror hasn't improved in 5 rounds.
## [1] train-merror:0.104483 eval-merror:0.12966
## [2] train-merror:0.090636 eval-merror:0.115995
## [3] train-merror:0.082731 eval-merror:0.107127
## [4] train-merror:0.078138 eval-merror:0.100969
## [5] train-merror:0.074766 eval-merror:0.097862
## [6] train-merror:0.072347 eval-merror:0.09468
## [7] train-merror:0.069547 eval-merror:0.092747
## [8] train-merror:0.066299 eval-merror:0.09012
## [9] train-merror:0.064583 eval-merror:0.088784
```

We generate our predictive results as usual:

```
# get prediction
predicted_y = bst.predict(xg_test, ntree_limit=bst.best_ntree_limit) # produces probabilities
# transform to hard predictions
predicted_y = np.argmax(predicted_y, axis=1)
# manual accuracy for fun
correct_pred = predicted_y == y_test
# compare actual to predicted
total = len(y_test)
correct = 0.0
accuracy = 0.0
for j in range(len(y_test)):
if correct_pred[j]: # if true
correct += 1 # add 1
accuracy = correct / total # generate accuracy
print('the accuracy of this classification is', round(accuracy, 4))
```

`## the accuracy of this classification is 0.9122`

```
## array([[1641, 1, 2, 4, 6, 5, 9, 0, 29, 5],
## [ 1, 1893, 23, 11, 2, 1, 7, 4, 13, 1],
## [ 15, 19, 1575, 38, 23, 4, 12, 19, 44, 8],
## [ 7, 12, 60, 1549, 8, 31, 8, 19, 49, 25],
## [ 4, 7, 13, 3, 1553, 4, 10, 4, 15, 85],
## [ 21, 8, 10, 49, 13, 1376, 35, 8, 37, 34],
## [ 9, 6, 8, 2, 21, 55, 1591, 0, 39, 2],
## [ 6, 12, 34, 11, 19, 2, 1, 1678, 8, 48],
## [ 11, 34, 15, 31, 14, 26, 13, 6, 1538, 39],
## [ 7, 7, 8, 28, 62, 12, 1, 45, 10, 1569]],
## dtype=int64)
```

`## 0.9114312314368508`

Hyperopt is one of the most popular and advanced hyperparameter search algorithms in use today. In the section below, we document its use:

```
def objective(param):
# validations set to watch performance
watchlist = [(xg_train,'train'), (xg_dev,'eval')]
# validation results
evals_results = {}
# boost
bst = xgb.train(params=param,
dtrain=xg_train,
num_boost_round=30,
evals=watchlist,
evals_result=evals_results,
early_stopping_rounds=5,
verbose_eval=False)
# get best run
predicted_y = bst.predict(xg_dev, ntree_limit=bst.best_iteration)
# transform to hard predictions
predicted_y = np.argmax(predicted_y, axis=1)
current_score = f1_score(y_dev, np.round(predicted_y), average='weighted')
#use 1-merror; we are minimizing
return {'loss': 1-current_score, 'status': STATUS_OK, 'model': bst}
```

Here we specify the parameters to search and sample over. The hyperparameters below, which aim to control model complexity, are chosen primarily based on comments from Tong He, a `xgboost`

code contributor.

```
# setup parameters for xgboost
param = {'objective': 'multi:softprob', # softmax multiclass
'eval_metric': 'merror',
'num_class': 10, # number of outputs
# control complexity
'gamma': hp.uniform('gamma', 0, 2),
'max_depth': hp.choice('max_depth', np.arange(1, 16, 1, dtype=int)),
'eta': hp.uniform('eta', 0.01, 0.7),
'subsample': hp.quniform('subsample', 0.5, 1.0, 0.10),
'colsample_bytree': hp.quniform('colsample_bytree', 0.5, 1.0, 0.10),
'min_child_weight': hp.choice('min_child_weight', np.arange(0, 4, 1, dtype=int))
}
```

And we execute `hyperopt`

like so:

```
trials = Trials()
best = fmin(fn=objective,
space=param,
algo=tpe.suggest,
max_evals=30,
trials=trials)
```

```
##
0%| | 0/30 [00:00<?, ?trial/s, best loss=?]
3%|3 | 1/30 [00:15<07:19, 15.15s/trial, best loss: 0.09253672848487682]
7%|6 | 2/30 [00:33<07:34, 16.23s/trial, best loss: 0.06922530527562143]
10%|# | 3/30 [01:45<14:42, 32.70s/trial, best loss: 0.03956804422541704]
13%|#3 | 4/30 [02:29<15:42, 36.24s/trial, best loss: 0.03941184211827076]
17%|#6 | 5/30 [03:19<16:50, 40.42s/trial, best loss: 0.038020415097349436]
20%|## | 6/30 [04:07<16:59, 42.49s/trial, best loss: 0.0342222628419494]
23%|##3 | 7/30 [04:53<16:45, 43.71s/trial, best loss: 0.033233773246691745]
27%|##6 | 8/30 [05:10<13:02, 35.55s/trial, best loss: 0.033233773246691745]
30%|### | 9/30 [06:32<17:24, 49.74s/trial, best loss: 0.033233773246691745]
33%|###3 | 10/30 [07:15<15:52, 47.65s/trial, best loss: 0.033233773246691745]
37%|###6 | 11/30 [08:08<15:34, 49.17s/trial, best loss: 0.033233773246691745]
40%|#### | 12/30 [09:34<18:04, 60.25s/trial, best loss: 0.033233773246691745]
43%|####3 | 13/30 [10:09<14:52, 52.53s/trial, best loss: 0.033233773246691745]
47%|####6 | 14/30 [10:27<11:17, 42.34s/trial, best loss: 0.033233773246691745]
50%|##### | 15/30 [10:58<09:41, 38.76s/trial, best loss: 0.033233773246691745]
53%|#####3 | 16/30 [12:22<12:15, 52.55s/trial, best loss: 0.033233773246691745]
57%|#####6 | 17/30 [13:13<11:15, 51.98s/trial, best loss: 0.033233773246691745]
60%|###### | 18/30 [14:26<11:39, 58.25s/trial, best loss: 0.033233773246691745]
63%|######3 | 19/30 [14:34<07:54, 43.10s/trial, best loss: 0.033233773246691745]
67%|######6 | 20/30 [15:43<08:29, 50.95s/trial, best loss: 0.033233773246691745]
70%|####### | 21/30 [16:17<06:52, 45.84s/trial, best loss: 0.033233773246691745]
73%|#######3 | 22/30 [17:00<05:59, 44.95s/trial, best loss: 0.033233773246691745]
77%|#######6 | 23/30 [17:42<05:09, 44.19s/trial, best loss: 0.033233773246691745]
80%|######## | 24/30 [18:02<03:41, 36.96s/trial, best loss: 0.033233773246691745]
83%|########3 | 25/30 [18:35<02:59, 35.87s/trial, best loss: 0.033233773246691745]
87%|########6 | 26/30 [19:19<02:33, 38.29s/trial, best loss: 0.033233773246691745]
90%|######### | 27/30 [20:06<02:02, 40.91s/trial, best loss: 0.033233773246691745]
93%|#########3| 28/30 [20:45<01:20, 40.34s/trial, best loss: 0.033233773246691745]
97%|#########6| 29/30 [20:55<00:31, 31.14s/trial, best loss: 0.033233773246691745]
100%|##########| 30/30 [21:29<00:00, 32.10s/trial, best loss: 0.033233773246691745]
100%|##########| 30/30 [21:29<00:00, 43.00s/trial, best loss: 0.033233773246691745]
```

`## Best XGB parameters are {'colsample_bytree': 0.7000000000000001, 'eta': 0.4825210766587527, 'gamma': 0.5067306790027051, 'max_depth': 10, 'min_child_weight': 2, 'subsample': 0.9}`

We can plot the results of our `hyperopt`

search by running the following code. The plot will tell us the direction to which our search space ran and its results. The x-axis represents the values searched while the y-axis represents the loss.

```
parameters = ['max_depth', 'eta', 'subsample', 'colsample_bytree', 'gamma', 'min_child_weight']
f, axes = plt.subplots(nrows=1, ncols=5, figsize=(20,5))
cmap = plt.cm.jet
for i, val in enumerate(parameters):
xs = np.array([t['misc']['vals'][val] for t in trials.trials]).ravel()
ys = [t['result']['loss'] for t in trials.trials]
xs, ys = zip(*sorted(zip(xs, ys)))
ys = np.array(ys)
axes[i].scatter(xs, ys, s=20, linewidth=0.01, alpha=0.5)
axes[i].set_title(val)
```

We can validate our `hyperopt`

search by running its suggested parameters over xgboost’s cross-validation algorithm.

```
# setup parameters for xgboost; best results from hyperopt ex-post
param = {'objective': 'multi:softprob', # softmax multiclass
'eta': 0.639, # learning rate
'max_depth': 14, # depth of tree
'gamma': 0.104, # regularization
'min_child_weight': 3,
'subsample': 0.9,
'colsample_bytree': 1.0,
'eval_metric': 'merror', # multiclass error rate
'num_class': 10} # 10 classes to estimate
# prepare xgboost cv data
xg_X = xgb.DMatrix(X, label=y, weight=find_weights(y))
bst_cv = xgb.cv(params=param,
dtrain=xg_X,
num_boost_round=10,
stratified=True,
nfold=3,
metrics='merror',
early_stopping_rounds=5,
as_pandas=True,
verbose_eval=False)
print(bst_cv)
```

```
## train-merror-mean train-merror-std test-merror-mean test-merror-std
## 0 0.060292 0.001115 0.105184 0.000449
## 1 0.032558 0.000493 0.079282 0.000319
## 2 0.018925 0.000377 0.067146 0.000662
## 3 0.010944 0.000346 0.059527 0.000422
## 4 0.006409 0.000081 0.053537 0.000437
## 5 0.003538 0.000186 0.049477 0.000876
## 6 0.002213 0.000308 0.046493 0.000578
## 7 0.001338 0.000131 0.044496 0.000765
## 8 0.000691 0.000080 0.042789 0.001099
## 9 0.000371 0.000070 0.040340 0.000971
```

- Fernández-Delgado, Manuel, Eva Cernadas, Senén Barro, and Dinani Amorim. “Do we need hundreds of classifiers to solve real world classification problems?.” The journal of machine learning research 15, no. 1 (2014): 3133-3181.