Solving Real-World Machine Learning Problems
Machine learning plays an important and growing role in the fields of statistics, data mining, and artificial intelligence. With the rapid growth of data, there are good reasons to believe that learning from data will become even more pervasive—and a necessary ingredient for future business growth. At the same time, choosing the right algorithms and libraries to solve a given problem depends on many factors, including:
- Class of problem (e.g., classification, regression)
- Input data
- Required performance
- Prediction accuracy
- Model interpretability
This creates barriers for the wider adoption of machine learning, which requires varied skill sets.
In this article, we will talk about criteria you can use to select correct algorithms based on two real-world machine learning problems that were taken from the well-known Kaggle platform used for predictive modeling and from analytics competitions where data miners compete to produce the best models. We’ll use libraries that implement the algorithms from:
- Scikit-learn*, the most popular library among Python* data scientists,
- R, the language of data analytics and statistical computing, and
- Intel® Data Analytics Acceleration Library (Intel® DAAL), a performance library that provides
optimized building blocks for data analysis and machine learning on Intel® platforms.
Real-world machine learning usually has high CPU and memory requirements, which makes Intel® Xeon Phi™ processors an ideal platform. Intel DAAL provides a quick way of building machine learning applications optimized for Intel® Xeon® and Intel Xeon Phi processors. We will demonstrate how to use KNN (K-nearest neighbors), boosting, and support vector machines(SVM) with Intel DAAL on two real-world machine learning problems, both from Kaggle: Leaf Classification and Titanic: Machine Learning from Disaster and compare results with the same algorithms from scikit-learn and R.
Why Kaggle?
Kaggle is a platform for predictive modeling and analytics competitions in which companies and researchers post their data, and statisticians and data miners from all over the world compete to produce the best models.1 As of May 2016, Kaggle had more than 536,000 registered users, or “Kagglers.” Spanning 194 countries, the community is one of the largest and most diverse in the world. Kaggle has run over 200 data science competitions since it was founded.
The goal of each competition is to produce the best model for a given real-world problem. The model is often evaluated by analyzing its prediction accuracy on a test data set. You can evaluate your model in an instance by submitting your prediction on a test data set and seeing your result on a leaderboard.
We will evaluate the models, produced by different algorithms and libraries, to see how they perform in Kaggle competitions.
Leaf Classification
There are nearly half a million species of plants in the world. Classifying species has been historically problematic, often resulting in duplicate identifications. This Kaggle challenge is to accurately identify 99 species of plants using leaf images and extracted features (e.g., shape, margin, and texture) to train a classifier. The training data contains 990 leaf images, and the test data contains 594 images (Figure 1). Three sets of features are also provided per image: a shape contiguous descriptor, an interior texture histogram, and a fine-scale margin histogram (Figure 2). For each feature, a 64-attribute vector is given per leaf sample.
One approach is to apply a list of machine learning algorithms to the training data, evaluate their accuracy on validation data, and find optimal algorithms and hyperparameters. Scikit-learn, the most popular machine learning library among Python data scientists, provides a wide range of algorithms. In the Kaggle kernel, we analyzed the prediction accuracy of 10 algorithms. The linear discriminant analysis and KNN algorithms proved to be the best on validation data (see the Kaggle kernel for detailed results).
Here is the linear discriminant analysis in Python (scikit-learn):
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis clf = LinearDiscriminantAnalysis() clf.fit(X_train, y_train) test_predictions = favorite_clf.predict(X_test)
And KNN in Python (scikit-learn):
from sklearn.neighbors import KNeighborsClassifier clf = KNeighborsClassifier(k=4) clf.fit(X_train, y_train) test_predictions = favorite_clf.predict(X_test)
In R, you can also apply linear discriminant analysis and KNN.
library (MASS) r <- lda(formula = Species ~ ., data = train) plda = predict(object = r, newdata = test) test_predictions = plda$class
KNN in R:
library(class) test_predictions = knn(X_train, X_test, y_train, k=4)
Intel DAAL provides a scalable version of KNN2 that uses the KD-tree algorithm and low-level optimizations to make it extremely fast on Intel® architectures while also providing better accuracy.
KNN training stage in Python (Intel DAAL):
from daal.algorithms.kdtree_knn_classification import training, prediction from daal.algorithms import classifier, kdtree_knn_classification trainAlg = kdtree_knn_classification.training.Batch() trainAlg.input.set(classifier.training.data, X_train) trainAlg.input.set(classifier.training.labels, y_train) trainAlg.parameter.k = 4 trainingResult = trainAlg.compute()
KNN prediction stage in Python (Intel DAAL):
predictAlg = kdtree_knn_classification.prediction.Batch() predictAlg.input.setTable(classifier.prediction.data, X_test) predictAlg.input.setModel(classifier.prediction.model, ↳ trainingResult.get(classifier.training.model)) predictAlg.compute() predictionResult = predictAlg.getResult() test_predictions = predictionResult.get(classifier.prediction.prediction)
Figures 3 and 4 show performance comparison graphs. For details on system configurations used for benchmarking, see Configurations and Tools Used at the end of this article.
Figure 5 shows accuracy comparison graphs:
It is possible to improve KNN accuracy if we apply it to data with fewer dimensions. According to statistical decision theory, if we know the conditional (discrete) distribution P(G|X), where G is a label to predict, and we use the 0-1 loss function, then we predict Ĝ(x)=Gk if P(Gk|X=x)=maxgg∊G(g|X=x). KNN classification assumes that P(Gk|X=x) is constant in the neighborhood of x. Obviously, the larger the number of dimensions, the larger the neighborhood of x containing k training samples. In this problem, we do not have a large number of samples, so settling for the neighborhood as a surrogate for conditioning will fail miserably. The convergence still holds, but the rate of convergence decreases as the dimension increases. See Section 2.4 to 2.5 of The Elements of Statistical Learning3 for a more detailed explanation.
We can improve KNN accuracy by preprocessing the original data using the linear discriminant analysis (LDA) algorithm. In our approach, we preprocessed input data with LDA (nComponents=40) and trained the KNN model on the preprocessed data.
Preprocessing with LDA (Python):
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis lda = LinearDiscriminantAnalysis(n_components = 40) X_train_reduced = lda.fit_transform(X_train, y_train) X_test_reduced = lda.transform(X_test)
KNN in Python (scikit-learn):
from sklearn.neighbors import KNeighborsClassifier clf = KNeighborsClassifier(k=4) clf.fit(X_train_reduced, y_train) test_predictions = favorite_clf.predict(X_test_reduced)
Preprocessing with LDA (R):
library (MASS) r <- lda(formula = Species ~ ., data = train) plda = predict(object = r, newdata = train) X_train_reduced = plda$x plda = predict(object = r, newdata = test) X_test_reduced = plda$x
KNN in R (class):
library (class) test_predictions = knn(X_train_reduced, X_test_reduced, y_train, k=4)
KNN training stage in Python (Intel DAAL):
library (MASS) r <- lda(formula = Species ~ ., data = train) plda = predict(object = r, newdata = train) X_train_reduced = plda$x plda = predict(object = r, newdata = test) X_test_reduced = plda$x
KNN prediction stage in Python (Intel DAAL):
predictAlg = kdtree_knn_classification.prediction.Batch() predictAlg.input.setTable(classifier.prediction.data, X_test_reduced) predictAlg.input.setModel(classifier.prediction.model, ↳ trainingResult.get(classifier.training.model)) predictAlg.compute() predictionResult = predictAlg.getResult() test_predictions = predictionResult.get(classifier.prediction.prediction)
Figures 6, 7, and 8 show the result of applying data preprocessing.
Obviously, feature engineering plays a key role in machine learning, and good feature selection is critical to achieving accurate predictions. Moreover, Intel DAAL achieves the best accuracy and performance among the libraries tested.
Titanic: Machine Learning from a Disaster
Another Kaggle competition is based on the sinking of the RMS Titanic, one of the most infamous shipwrecks in history. On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1,502 out of 2,224 passengers and crew members. This sensational tragedy shocked the world and led to better safety regulations for ships. The challenge here is to analyze the different classes of passengers and crew and predict who among them survived the tragedy. The input data contains the features shown in Figure 9. See the data overview on Kaggle for details.
One approach is to preprocess the data into informative feature vectors that can be used to train the machine learning models. Then several classifiers on the preprocessed data should be tried to find out which algorithms perform best. In this Kaggle kernel, feature engineering is performed. The following features were constructed from the original ones (see Figure 10): passenger class, sex, age (transformed with feature binning), passenger fare (transformed with feature binning), port of embarkation, is alone (true if person has no siblings/spouse/children/parents on Titanic), title (Mrs./Miss/Mr./Master). Then, 10 algorithms from scikit-learn were tested and their prediction accuracy was compared. The SVM classifier with the Gaussian kernel gave the best accuracy (see Kaggle kernel for detailed results). SVM parameters are obtained with cross-validation.
SVM in Python (scikit-learn):
from sklearn.svm import SVC clf = SVC(C = 5, gamma = 1.5) clf.fit(X_train, y_train) test_predictions = favorite_clf.predict(X_test)
SVM in R:
library(e1071) model <- svm(X_train, y_train, gamma=1.5, cost=5) test_predictions <- predict(model, X_test)
SVM with the Gaussian kernel involves a lot of time-consuming exponential computations. In Intel DAAL, these computations are highly optimized for Intel architectures, enabling us to quickly create an SVM model.
SVM training stage in Python (Intel DAAL):
from daal.algorithms.svm import prediction, training from daal.algorithms import kernel_function, classifier import daal.algorithms.kernel_function.rbf trainAlg = svm.training.Batch() trainAlg.input.set(classifier.training.data, X_train) trainAlg.input.set(classifier.training.labels, y_train) kernel = kernel_function.rbf.Batch() kernel.parameter.sigma = 1.5 trainAlg.parameter.C = 5 trainAlg.parameter.kernel = kernel trainAlg.parameter.cacheSize = 60000000 trainingResult = trainAlg.compute()
SVM prediction stage in Python (Intel DAAL):
predictAlg = svm.prediction.Batch() predictAlg.input.setTable(classifier.prediction.data, X_test) predictAlg.input.setModel(classifier.prediction.model, trainingResult.get(classifier.training.model)) predictAlg.parameter.kernel = kernel predictAlg.compute() predictionResult = predictAlg.getResult() test_predictions = predictionResult.get(classifier.prediction.prediction)
Figures 11 and 12 show performance comparison graphs.
Figure 13 shows accuracy comparison graphs:
We see that Intel DAAL and scikit-learn produced the best accuracy and that Intel DAAL has the best performance.
We will now apply boosting classifiers to this classification problem. Boosting is one of the most powerful learning ideas introduced in the last 20 years. The idea behind boosting is to combine the outputs of many weak classifiers to produce a powerful committee.3 We will consider the following boosting algorithms:
- AdaBoost*
- BrownBoost*
- LogitBoost*
- Gradient boosting
Numerous resources are available4, 5, 6 with detailed explanations of these algorithms.
Python (scikit-learn), AdaBoost:
from sklearn.ensemble import AdaBoostClassifier clf = AdaBoostClassifier(n_estimators=1000) clf.fit(X_train, y_train) test_predictions = favorite_clf.predict(X_test)
Python (Intel DAAL), AdaBoost (training):
from daal.algorithms.adaboost import prediction, training from daal.algorithms import classifier trainAlg = training.Batch() trainAlg.input.set(classifier.training.data, X_train) trainAlg.input.set(classifier.training.labels, y_train) trainAlg.parameter. maxIterations = 1000 trainingResult = trainAlg.compute()
Python (Intel DAAL), AdaBoost (prediction):
predictAlg = prediction.Batch() predictAlg.input.setTable(classifier.prediction.data, X_test) predictAlg.input.setModel(classifier.prediction.model, ↳ trainingResult.get(classifier.training.model)) predictAlg.compute() predictionResult = predictAlg.getResult() test_predictions = predictionResult.get(classifier.prediction.prediction)
Python (Intel DAAL), BrownBoost (training):
from daal.algorithms.brownboost import prediction, training from daal.algorithms import classifier trainAlg = training.Batch() trainAlg.input.set(classifier.training.data, X_train) trainAlg.input.set(classifier.training.labels, y_train) trainAlg.parameter. maxIterations = 1000 trainingResult = trainAlg.compute()
Python (Intel DAAL), BrownBoost (prediction):
predictAlg = prediction.Batch() predictAlg.input.setTable(classifier.prediction.data, X_test) predictAlg.input.setModel(classifier.prediction.model, ↳ trainingResult.get(classifier.training.model)) predictAlg.compute() predictionResult = predictAlg.getResult() test_predictions = predictionResult.get(classifier.prediction.prediction)
Python (Intel DAAL), LogitBoost (training):
from daal.algorithms.brownboost import prediction, training from daal.algorithms import classifier trainAlg = training.Batch() trainAlg.input.set(classifier.training.data, X_train) trainAlg.input.set(classifier.training.labels, y_train) trainAlg.parameter. maxIterations = 1000 trainingResult = trainAlg.compute()
Python (Intel DAAL), LogitBoost (prediction):
predictAlg = prediction.Batch() predictAlg.input.setTable(classifier.prediction.data, X_test) predictAlg.input.setModel(classifier.prediction.model, ↳ trainingResult.get(classifier.training.model)) predictAlg.compute() predictionResult = predictAlg.getResult() test_predictions = predictionResult.get(classifier.prediction.prediction)
AdaBoost R (fastAdaBoost):
library(fastAdaBoost) model <- adaboost(Survived ~ ., train, 1000) pred <- predict(model, newdata=test)
LogitBoost R (caTools):
library(caTools) model <- LogitBoost(X_train, Y_train, nIter=1000) pred <- predict(model, X_test)
Gradient boosting R (gbm):
library(gbm) model <- gbm(Survived ~ ., data=train, n.tree = 1000, shrinkage = 1) predict(model, test, n.trees = 1000)
Figure 14 shows the accuracy of different boosting algorithms.
As we see, the BrownBoost algorithm from Intel DAAL demonstrates the best prediction accuracy.
Solving Data Analytics Problems Using Machine Learning and Intel DAAL
Selecting an algorithm to solve machine learning problems is a nontrivial problem and requires a lot of thought. Libraries, like Intel DAAL or scikit-learn, provide a wide variety of machine learning algorithms, so the user can choose the one that best suits the user’s problem.
We demonstrate how you can use Intel DAAL to get all the power of Intel platforms to obtain faster model training and prediction. Our benchmarks show that Intel DAAL has a performance advantage over scikit-learn and R implementations while also producing more accurate models.
Configurations and Tools Used
System configurations used for benchmarking:
Intel® Xeon®:
Model name: Intel® Xeon®CPU E5-2699
v4 @ 2.20 GHz
Core(s) per socket: 22
Socket(s): 2
MemTotal: 256 GB
Intel® Xeon Phi™:
Model name: Intel® Xeon Phi™ Processor
000A @ 1.40 GHz
Core(s) per socket: 68
Socket(s): 1
RAM: 16 GB
Software tools used in this example:
- Intel® DAAL 2017 Beta update 2
- R version 3.3.2
- scikit-learn version 0.19.1
- Class package version 7.3-14
- MASS package version 7.3-45
- e1071 package version 1.6-8
- fastAdaBoost package version 1.0.0
- caTools package version 1.17.1
- gbm package version 2.1.1
References
1.Overview of Kaggle on Wikipedia.
2. Patwary, Md. Mostofa Ali; Satish, Nadathur Rajagopalan; Sundaram, Narayanan; Liu, Jialin; Sadowski, Peter; Racah, Evan; Byna, Suren; Tull, Craig; Bhimji, Wahid; Prabhat, Dubey, Pradeep. 2016. “PANDA: Extreme Scale Parallel K-Nearest Neighbor on Distributed Architectures,” IEEE International Parallel and Distributed Processing Symposium.
3. Hastie, Trevor; Tibshirani, Robert; and Friedman, Jerome. 2009. The Elements of Statistical Learning, Second Edition. Springer International Publishing AG.
4. Freund, Yoav, and Schapire, Robert E. 1999. “Additive Logistic Regression: A Statistical View of Boosting,” Journal of Japanese Society for Artificial Intelligence (14[5]), pp. 771–780.
5. Friedman, Jerome; Hastie, Trevor; and Tibshirani, Robert. 2000. “Additive Logistic Regression: A Statistical View of Boosting,” The Annals of Statistics, 28(2), pp. 337–407.
6. Friedman, Jerome. 2001. “Greedy Function Approximation: A Gradient Boosting Machine,” The Annals of Statistics 29(5), pp. 1189–1232.
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit http://www.intel.com/performance.
Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.