Faster Gradient-Boosting Decision Trees

How to Lift Machine Learning Performance Using
Intel® Data Analytics Acceleration Library

Machine learning (ML) has grown explosively over the last decade. And alongside specific domains such as images or speech recognition, where deep learning thrives, one of the most popular techniques for solving a wide range of real-world ML problems―like regression and classification issues―is gradient boosting. Recently, boosting has been used for page ranking for commercial Web search engines and other winning ML challenge solutions.

To make it easier, Intel® Data Analytics Acceleration Library (Intel® DAAL) has optimized a collection of practical ML algorithms for Intel® processors. The latest version contains gradient-boosted tree classification and regression algorithms using the stochastic gradient boosting technique. The implementation enhances performance and functionality through multithreading and vectorization.

This article describes, in a nutshell, the gradient boosted trees algorithm and explains why it’s so widely used by ML practitioners. Compared to the popular XGboost* library, Intel DAAL gradient boosting can achieve up to 6.5x better training performance on the same datasets.

Gradient Boosting and Decision Tree

Gradient boosting is a powerful ML supervised algorithm used to achieve state-of-the-art accuracy on a variety of tasks like regression, classification, and ranking. The generalized gradient boosting algorithm, introduced by Jerome Friedman1, uses ensemble or committee methods to reduce the variance of an estimated prediction function. Gradient boosting is typically used with decision trees as base learners. Decision trees are simple, powerful analytical models used in a variety of analytic solutions such as segmentation, regression, and classification. [Editor’s note: You can learn about the advantages and construction and scoring of decision tree models, as well as some of the limitations and ways to overcome them, here.] Ensemble learning is one of the advanced methods to manage the problem of sampling and overfitting errors in decision trees. Like random decision forests, another popular tree ensemble model is gradient-boosted trees. It’s been implemented in many ML software packages including scikit-learn*, Xgboost*, LightGBM*, GBM*, H20*, Spark MLlib*, and OpenCV.*

Gradient-Boosted Decision Tree Algorithm Details

As we mentioned above, the gradient-boosted decision trees (GBDT) classification and regression algorithms are an ensemble processing of regression (decision) trees built using the stochastic gradient boosting technique.3

Given n feature vectors X = { x 1= (x 11,…,x 1p ), …, x n = (x n1,…,x n p ) } of n p-dimensional feature vectors and n responses Y = {y1,…,yn}, the learning process of the algorithm is to build a gradient-boosted trees classification or regression model based on the feature and response data set, and then use the classification and regression model to classify/predict new incoming samples.

The tree ensemble model uses M additive functions to predict the output i ) = f(x) = Σm/k=1 fk (xi ),fk ∈F where F= {f(x)=wq(x) , q: Rp → T, w ∈ RT } is the space of regression trees, T is the number of leaves in the tree, w is a leaf weights vector, and wi is a score on i-th leaf. Also, q(x) represents the structure of each tree that maps an observation to the corresponding leaf index.

The training procedure is an iterative functional gradient descent algorithm, which minimizes the objective function by iteratively choosing a function (regression tree) that points in the negative gradient direction. The objective function is:

where l(f) is twice differentiable convex loss function and Ω(f) = γT+ 1/2 λ||w|| is a regularization term that penalizes the complexity of the model defined by the number of leaves T and the L2 norm of the weights ||w|| for each tree. γ and λ are regularization parameters.

The GBDT algorithm principle has been well explained.7 The main processing effort is in building the tree. Figure 1 shows the additive tree structure to determine if a person likes computer games, where the tree’s leaf (terminal) nodes represent the corresponding response values (i.e., the prediction result from the tree). (Find more details here.)

 

 

 

 

 

 

 

 

 

Figure 1 – Tree ensembles6

Training Flow

Let S = (X, Y) be the set of observations. Given the training parameters, such as the number of iterations M, loss function l(f), regression tree training parameters, regularization parameters γ and λ, shrinkage (learning rate) parameter ϑ, the algorithm does the following:

  • Find an initial guess (ŷi )(0), i=1,..n
  • For k = 1,…, M:

— Update gi and hi, i =1,…n
— Grow a regression tree fk ∈ F that minimizes objective function

— Assign an optimal weight wj* = to the leaf j,j = 1,…T
— Apply shrinkage parameter Ɵ to the tree leaves and add the tree to the model
— Update ŷ1(k)

The algorithm for growing the tree:

  • Generate a bootstrap training set if required (stochastic gradient boosting).
  • Start from the tree with depth 0.
  • For each leaf node in the tree:
    – Choose a subset of features for split finding if required (stochastic gradient boosting)
    – Find the best split that maximizes gain:

  • Stop when the termination criteria are met.6

Prediction Flow

Once the trees are built in the training phase, we can use them to do predictions for a given set of queries. The classification algorithm computes the sum of responses of all the trees for each class. The class with the maximal response value (highest class probability) determines the prediction. The regression algorithm returns the aggregated result as the final result of given sample.

The GBDT Implementation in Intel® DAAL

The GBDT algorithm is computationally expensive, especially with continuous features and large datasets, but Intel DAAL provides a highly-tuned implementations for classification and regression. For maximum performance, it uses vectorization and multiple levels of parallelization in tree construction and prediction.

The training algorithms implemented in Intel DAAL support two kinds of split calculation modes to build a tree:

  1. Exact: All possible split values are examined when searching for the best split for a feature.
  2. Inexact: Continuous features are bucketed into discrete bins and the possible splits are restricted by the bucket borders only. The bucketing allows a reduction in the number of splits to compute on each step, thus making computation faster.

The sample code in Figure 2 shows how to train a GBDT model with Intel DAAL.

trainDatasetFileName = 'df_classification_train.csv'
nFeatures = 30
nClasses = 6

# Gradient boosted trees parameters
maxIterations = 50
minObservationsInLeafNode = 1
 
# Model object for the gradient boosted trees classification algorithm
model = None
predictionResult = None
testGroundTruth = None

def trainModel():
    # Initialize FileDataSource<CSVFeatureManager> to retrieve the input data from a .csv file
    trainDataSource = FileDataSource(
				    trainDatasetFileName,
				    DataSourceIface.notAllocateNumericTable,
				    DataSourceIface.doDictionaryFromContext)
global model

# Create Numeric Tables for training data and labels
trainData = HomogenNumericTable(nFeatures, 0, NumericTableIface.notAllocate)
trainGroundTruth = HomogenNumericTable(1, 0, NumericTableIface.notAllocate)
mergedData = MergedNumericTable(trainData, trainGroundTruth)

# Retrieve the data from the input file
trainDataSource.loadDataBlock(mergedData)

# Create an algorithm object to train the gradient boosted trees classification model
algorithm = training.Batch(nClasses)
algorithm.parameter().maxIterations = maxIterations
algorithm.parameter().minObservationsInLeafNode = minObservationsInLeafNode
algorithm.parameter().featuresPerNode = nFeatures

# Pass the training data set and dependent values to the algorithm
algorithm.input.set(classifier.training.data, trainData)
algorithm.input.set(classifier.training.labels, trainGroundTruth)

# Train the model and retrieve the results of the training algorithm
trainingResult = algorithm.compute()

model = trainingResult.get(classifier.training.model)

Figure 2 – Training a GBDT model

For training parameters like splitMethod, maxIterations, maxTreeDepth, shrinkage, and their default values, see Usage Model: Training and Prediction.

The sample code in Figure 3 demonstrates how to use the trained GBDT model to make predictions.

testDatasetFileName = 'df_classification_test.csv'

def testModel():
   global testGroundTruth, predictionResult
   # Initialize FileDataSource<CSVFeatureManager> to retrieve the test data from a .csv file
   testDataSource = FileDataSource(
                                   testDatasetFileName,
                                   DataSourceIface.notAllocateNumericTable,
                                   DataSourceIface.doDictionaryFromContext)

# Create Numeric Tables for testing data and labels
testData = HomogenNumericTable(nFeatures, 0, NumericTableIface.notAllocate)
testGroundTruth = HomogenNumericTable(1, 0, NumericTableIface.notAllocate)
mergedData = MergedNumericTable(testData, testGroundTruth)

# Retrieve the data from input file
testDataSource.loadDataBlock(mergedData)

# Create algorithm objects for gradient boosted trees classification prediction
algorithm = prediction.Batch(nClasses)

# Pass the testing data set and trained model to the algorithm
algorithm.input.setTable(classifier.prediction.data, testData)
algorithm.input.setModel(classifier.prediction.model, model)

# Compute prediction results and retrieve algorithm results
# (Result class from classifier.prediction)
predictionResult = algorithm.compute()

Figure 3 – Using the trained GBDT model to make predictions

Performance Considerations and Results

XGBoost* is a well-known library designed and optimized for tree boosting.8 It’s been implemented in various ML software packages and provides native interfaces for C++, R, Python*, Julia*, and Java* users. We compared Intel DAAL and XGBoost. Note that this performance comparison represents a snapshot in time. Both packages are being updated. Table 1 shows the performance advantage of Intel DAAL 2019 beta versus XGBoost 1.6. The results were measured for classification on four different data sets: Higgbs, Mnist, Letter, and Isolet from USI ML repository.

  • HIGGS: Two classification problems to distinguish between a signal process which produces Higgs bosons
    and a background process which does not. It has 1M samples, each with 28 features.
  • Letter: A database of character image features. The goal is to identify the 26 English capital letters. It has 16,000 samples, each with 16 features.
  • Isolet: A simple classification task to predict which letter name was spoken. It has 7,797 samples, each with 617 features.
  • MNIST: A standard classification task to identify the handwritten numbers. The training set consists of 60,000 images and the test set consists of 10,000 images of handwritten numbers. It is a subset of a larger set available from NIST. The digits have been size-normalized and centered in fixed-size images.

The test uses 50 trees with default Intel DAAL parameters (inexact model) and XGBoost (approximate model). The results show that Intel DAAL is, on average, 2.8x faster and up to 6.5x faster for training, and about 1.2x faster for inference.

Table 1. Comparison of Intel DAAL and XGBoost

HIGGS MNIST Letter Isolet
Training(s)
XGBoost 22.6 119.5 3.9 15
Intel DAAL 10.5 95.7 0.6 9.98
Speedup 2.15 1.25 6.5 1.5
Inference (ms)
XGBoost 161 29.7 26.7 24
Intel DAAL 150 25.7 31 15
Speedup 1.07 1.16 0.86 1.6

 

Boosting Performance

Intel DAAL contains a collection of ML algorithms optimized for Intel® processors. It recently added gradient-boosted trees classification and regression algorithms to its repertoire. Compared to the popular XGBoost library, Intel DAAL gradient boosting achieved better performance for both model training and inference.

References

  1. Jerome H. Friedman. “Greedy Function Approximation: A Gradient Boosting Machine,” Annals of Statistics, 1999.
  2. Kaggle: The Home of Data Science and Machine Learning
  3. Developer Guide for Intel® Data Analytics Acceleration Library 2018 Update 3
  4. XGBoost on GitHub
  5. Intel® Data Analytics Acceleration Library Decision Forest
  6. Tianqi Chen. “Introduction to Boosted Trees,” University of Washington, 2014.
  7. Tianqi Chen and Carlos Guestrin. “XGBoost: A Scalable Tree Boosting System,” 22nd SIGKDD Conference on Knowledge Discovery and Data Mining, 2016.

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit http://www.intel.com/performance.

Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

For more complete information about compiler optimizations, see our Optimization Notice.