Dealing with Outliers

How to Find Fraudulent Transactions in a Real-World Dataset

One challenging―but also very important―task in data analytics is dealing with outliers. We generally define outliers as samples or events that are inconsistent with the rest of data population1. An outlier often contains useful information about abnormal characteristics of the systems and entities that impact the data generation process2.

Common applications for outlier detection algorithms include:

  • Intrusion detection systems
  • Credit card fraud
  • Interesting sensor events
  • Medical diagnosis

In this article, we’ll focus on one of the most common applications of outlier detection—credit card fraud. With some simple outlier detection approaches, it’s possible to find 75 to 85 percent of fraudulent transactions―with false alarms less than 1 percent of the time―on a real-world dataset.

Defining the Approaches

There are two basic approaches to detecting outliers:

  1. Supervised, where we use fraudulent and non-fraudulent examples to predict the class of a new observation
  2. Unsupervised, where we don’t have labeled examples and detect fraudulent examples as outliers or abnormalities

Breaking it down further, there are three basic supervised approaches:

  1. Neural networks
  2. SVM
  3. Logistic regression

Unsupervised approaches include:

  1. Statistical-based techniques like BACON* outlier detection
  2. Multivariate outlier detection
  3. Clustering-based techniques3, 4 like k-means, expectation-maximization (EM), and DBSCAN*5, an algorithm that detects noise in data that can be treated as outliers6.

We’ll compare these techniques and evaluate each methodology based on certain design criteria. Accuracy metrics will include both detection and false positive rates7.

It’s important to remember that fraud detection is about more than capturing fraudulent events. It also requires capturing these activities as quickly as possible—so an algorithm’s performance is an important metric8. In our test, we’ll use some popular machine learning libraries, including Intel® Data Analytics Acceleration Library (Intel® DAAL), that implement algorithms for outlier detection to compare results for both accuracy and performance.

Figure 1. Dataset attributes

Dataset Description

We used a dataset9 from Kaggle*, a platform for predictive modeling and analytics competitions in which companies and researchers post their data and statisticians and data miners from all over the world compete to produce the best models10. We chose the most popular real-world dataset on credit card fraud detection from Kaggle11. It contains 31 numerical input variables (Figure 1).

The time attribute represents seconds elapsed between each transaction and the first transaction in the dataset. Amount represents the transaction amount, as the name implies. The remaining 28 features are the principal components obtained with PCA. The original features are not provided due to confidentiality issues.

Class is the response variable. It equals 1 in case of fraud and 0 otherwise. The dataset is highly unbalanced because fraud transactions are far less common than normal ones. Frauds represent only 0.172 percent of all transactions (492 frauds out of 284,807 transactions).

Let’s take a look at conditional distributions of some of principal components (Figures 2 and 3). Principal components are not correlated, so dropping those with similar conditional distribution for fraudulent and normal examples could help us build more accurate models. We’ll use a point-biserial correlation coefficient12 to measure the relationship between each principal component and target variable (Table 1). We’ll use this information later for feature engineering.

Credit Card Fraud Detection Approaches

Now we’ll apply different outlier detection techniques to the data, evaluate their accuracy and performance, and compare them. For credit card fraud detection, we’ll use both supervised and unsupervised methods.

Figure 2. Attribute with similar distribution between two types of transactions

Figure 3. Attribute with different distribution between two types of transactions

Table 1. Point-biserial correlation coefficients for principal components

Unsupervised Approaches

Statistical approaches were the earliest algorithms used for outlier detection13. We’ll start from outlier-detection-specific algorithms―multivariate outlier detection and BACON* outlier detection. A great advantage of these methods is that they are so computationally efficient that it’s easy to apply them to large datasets14. We achieved the best results with six principal components with a high point-biserial correlation coefficient (V11, V12, V14, V16, V17, V18) as features. We used implementations from Intel DAAL and the R* package robustX*. Multivariate outlier detection gave better results for accuracy than BACON outlier detection (Table 2).

Table 2. Statistical-based outlier detection approaches

Clustering and outlier detection have a complementary relationship. In clustering, the goal is to partition the points into dense subsets. In outlier detection, the goal is to identify points that don’t seem to fit naturally into these dense subsets2. Some clustering algorithms, like DBSCAN*, provide sub-products, which can be considered outliers.

There are many clustering-based approaches to outlier detection2,3,4,15. We’ll apply some of them in our tests. Clustering is inherent in credit card fraud because perpetrators usually produce a group of fraudulent transactions16.

There are two types of outliers, local and global. Different clustering-based algorithms can detect both types. The attribute values of a global outlier are outlying with respect to the values taken by the majority of the data points. The attribute values of a local outlier are extreme when compared to those of its neighbors17.

To detect global outliers, we can perform clustering on a dataset and define some clusters as outliers18 (Table 3). We’ll perform clustering with k-means and EM, treating small clusters as outliers19. Values of k and n_components were obtained by grid search from range [2, 3, …, 50]. We used the same initial centroids for k-means for Intel DAAL and Scikit-learn*, obtaining them with k-means++ initialization using the implementation in Intel DAAL. We achieved the best results using 17 principal components with high point-biserial correlation coefficients as features. With these methods, we achieved a relatively low false positive rate.

Table 3. Clustering-based methods for global outlier detection

We’ll also apply clustering techniques that can detect local outliers20. A naïve approach is to cluster the transactions using the k-means algorithm21 and treat points far from cluster centers as outliers (Table 4).

We obtained a good rate of detections. However, the number of false alarms is high. The reason is that the k-means algorithm is extremely sensitive to outliers, which have a large impact on cluster configuration. As a result, data points that should be declared as outliers are masked by the clustering. Thus, we need to use a robust version of k-means that can gracefully handle the presence of outliers―k,l-means15 (Table 5). Like k-means, this algorithm requires initial centroids. We’ll compute them with different initialization algorithms available in Intel DAAL―random, k-means++22, and k-means||23―and compare the results.

Table 4. K-means-based method for local outliers detection

 

Table 5. K,l-means algorithm results

We can see that this approach provides better accuracy than naïve k-means. Results are different for different initializations. The random method proved to be the best for this problem.

We’ll also use a density-based clustering algorithm that can detect noise in data (Table 6). A point is detected as noise if it differs significantly from its neighbors. A natural assumption is to treat noise data as local outliers. We used the same six features as for statistical methods. Values of parameters were obtained by grid search. We used the R package dbscan*. We reached a good level of detected frauds with a moderate level of false alarms. However, the performance is significantly worse compared to previous approaches.

Table 6. Dbscan for outlier detection

Supervised Approaches

Supervised learning methods require training data that contain examples of each class—in our case, normal and fraudulent transactions2. We’ll divide our dataset into training and evaluation sets, with 80 percent of the data used for training and 20 percent used for model evaluation.

We start with the neural network (Table 7) proposed in the Kaggle kernel24. A simple multi-layer perceptron with three hidden layers was used by the kernel author. Multi-layered perceptrons (MLPs) are widely used neural networks for classification as well as outlier detection25.

Table 7. Neural network

Another supervised learning algorithm used in outlier detection is logistic regression16 (Tables 8 and 9). We’ll use 17 attributes with the highest point-biserial correlation coefficients (Table 1) and the Amount feature, which was pre-processed with Z-score normalization. From the histograms24, we can see that fraud and normal observation are linearly separable due to the use of logistic regression. Our dataset is unbalanced, so use of random undersampling could help to build a model that detects frauds more accurately. This omits many non-fraudulent examples from training, so we’ll use a voting classifier26, with each estimator trained on a randomly under-sampled training dataset. (The regularization parameter, C, was set to one after experimentation with a wide range of values.) With this technique, we improved the detection rate by 3 percent with almost the same level of false alarms.

Table 8. Logistic regression

Table 9. Ensemble of logistic regression models

Let’s see how support vector machines (SVM), another supervised learning algorithm which can also be used for outlier detection16, works on the same data. We run SVM with 100,000 iterations, a linear kernel, and C=1. (These parameters were obtained with a grid search.) The same features were used as for logistic regression. Use of SVM ensembles improves the detection rate significantly while giving a low false alarm rate (Table 10).

Evaluation of the Different Anomaly Detection Approaches

There are a variety of unsupervised approaches for fraud detection. We applied a few of them in this article. Outlier detection-specific algorithms, k,l-means and DBSCAN, achieved the best detection rates, reaching 85 percent of frauds detected, while use of k-means to detect clusters of fraudulent transactions gives the lowest number of false alarms. The sooner fraudulent transactions are detected, the more losses can be avoided by stopping transactions made with fraudulent credit cards16―making the performance of the algorithm another key metric. The final choice of fraud detection approach will depend on many factors, including the cost of the fraudulent behavior detected and the cost associated with stopping it7. Unsupervised methods can be useful in situations where there is no prior knowledge of classes of observations.

Table 10. SVM

When we have prior knowledge of classes of observations, we can use supervised learning algorithms. We used techniques for unbalanced classification, like undersampling, to help increase the number of frauds detected. Ensembles of SVM classifiers give the highest detection rates, while random forests have the lowest false alarm rates. Although supervised methods like SVM gave us significantly better accuracy compared to unsupervised ones, use of them can pose problems:

  • Accurately labeled training data might be prohibitively expensive to obtain25 and a data scientist has to be confident about the true classes of the training data used to build the models.
  • The time to train a model can be high depending on the number of samples and the number of features per sample.
  • Unlike unsupervised approaches, supervised learning cannot detect new types of fraud without retraining16,27.

As we’ve shown, choosing the right machine learning algorithm for a given application is a non-trivial problem that requires a lot of thought.

Configuration Information

Hardware:

  • Processor: Intel® Xeon® Platinum 8168 processor @ 2.70GHz
  • Core(s) per socket: 24
  • Socket(s): 2
  • Memory: 63 GB

Software

  • Intel® Data Analytics Acceleration Library 2018 Gold
  • R version 3.4.1
  • Scikit-learn version 0.19.1
  • robustX package 1.2-2
  • dbscan package 1.1-1

References

  1. Victoria Hodge, “Outlier and Anomaly Detection: A Survey of Outlier and Anomaly Detection Methods.”
  2. Charu C. Aggarwal, “Outlier Analysis,” Second Edition.
  3. Vaishali, “Fraud Detection in Credit Card by Clustering Approach.”
  4. Yogesh Bharat Sonawane, Akshay Suresh Gadgil, Aniket Eknath More, Niranjan Kamalakar Jathar, “Credit Card Fraud Detection Using Clustering Based Approach.”
  5. Ester, M., H. P. Kriegel, J. Sander, and X. Xu, “A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise” in Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining, Portland, OR, AAAI Press, pp. 226-231. 1996.
  6. Samson Kiware, B.A, “Detection of Outliers in Time Series Data.”
  7. Elham Hormozi , Hadi Hormozi, Mohammad Kazem Akbari, Morteza Sargolzaei Javan, “Accuracy Evaluation of a Credit Card Fraud Detection System on Hadoop MapReduce.”
  8. Shraddha Ramesh Bhagwat, Vaishali Londhe, “A Review of Various Credit Card Detection Techniques.”
  9. Credit card fraud detection dataset.
  10. Overview of Kaggle on Wikipedia.
  11. Credit card fraud datasets on Kaggle.
  12. Point-biserial correlation coefficient on Wikipedia.
  13. Victoria J. Hodge and Jim Austin, “A Survey of Outlier Detection Methodologies.”
  14. Nedret Billor, Ali S. Hadi, Paul F. Velleman, “BACON: Blocked Adaptive Computationally Efficient Outlier Nominators.”
  15. Sanjay Chawla, Aristides Gionis, “k-means: A unified Approach to Clustering and Outlier Detection.”
  16. Sanjeev Jha, Montserrat Guillen, and J. Christopher Westland, “Employing Transaction Aggregation Strategy to Detect Credit Card Fraud.”
  17. Marie Ernst and Gentiane Haesbroeck, “Detection of Local and Global Outliers in Spatial Data.”
  18. Bin-mei Liang, “A Hierarchical Clustering Based Global Outlier Detection Method.”
  19. Yuan Wang, Xiaochun Wang, Xia Li Wang, “A Spectral Clustering Based Outlier Detection Technique.”
  20. Zengyou He, Xiaofei Xu, Shengchun Deng, “Discovering Cluster-Based Local Outliers.”
  21. Stuart P Lloyd, “Least Squares Quantization in PCM,” IEEE Transactions on Information Theory 1982, 28 (2): 1982pp: 129–137.
  22. Arthur, D. and Vassilvitskii, S, “k-means++: The Advantages of Careful Seeding,” Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms, Society for Industrial and Applied Mathematics Philadelphia, PA, USA, 2007, pp. 1027-1035.
  23. B. Bahmani, B. Moseley, A. Vattani, R. Kumar, and S. Vassilvitskii, “Scalable K-means++,” Proceedings of the VLDB Endowment, 2012.
  24. Kaggle kernel, use of neural network for fraud detection.
  25. Varun Chandola, Arindam Banner Jee, and Vipin Kumar, “Outlier Detection : A Survey.”
  26. Gareth James, “Majority Vote Classifiers: Theory and Applications.”
  27. Bolton, R.J. and Hand, D.J., “Unsupervised Profiling Methods for Fraud Detection” in Conference on Credit Scoring and cCedit Control, Edinburgh, 2001.

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit http://www.intel.com/performance.

Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

For more complete information about compiler optimizations, see our Optimization Notice.