JD Speeds Up Image Analysis By Replacing GPUs with CPUs

Chinese cloud service provider JD tried using GPUs to extract features from its vast database of product images. It found it could achieve 3.83x1 the performance using its existing servers based on the Intel® Xeon® processor

At a Glance:

  • Copying data between the storage and GPU analysis clusters took half of the total process time
  • First attempts in using acceleration using GPUs were relatively difficult to develop and execute
  • JD used the BigDL library to run its Caffe* model on the Spark* clusters, based on the Intel® Xeon® processor E5 family
  • Performance increased by 3.83x1

JD is a leading retailer and cloud service provider in China. There is a wealth of information in its catalog of product images, which could form the foundation for visual search or price comparison applications, if it could be analyzed. When JD tried to use GPUs to analyze its huge database of merchandise images, it found them hard to manage, and found the process of copying data between the storage and data clusters time consuming. Working closely with Intel, JD deployed the BigDL deep learning library on the Intel® Xeon® processor-based servers that store the images, and delivered an up to 3.83x performance increase1. This gives JD greater agility in using its product images as the foundation for new services.

Challenge

  • Enable image feature extraction from JD’s full catalog of hundreds of millions of products, across a wide range of categories including computing, toys, and clothing
  • Create a performance efficient infrastructure for image analysis that scales to the vast growing database
  • Establish a cloud analytics platform that would be easy to develop for, and could be used to create new image analysis applications

Solution

  • JD used BigDL to deploy its existing Caffe* model on the servers where the data is stored, based on the Intel® Xeon® processor E5 family
  • The infrastructure can be scaled out efficiently by adding new standard Intel® Xeon® processor servers
  • The Apache Hadoop* and Spark* frameworks handle resource management, making it easier to develop new applications in the future, while still remaining performance efficient

Results

  • Performance increased by up to 3.83×1, compared to the alternative GPU based solutions
  • JD established a platform for innovation that will be used to create new applications easily, for internal use and new public cloud services
  • JD lowered its total cost of ownership for the solution, by reusing its existing hardware estate for analysis, compared to operating a separate GPU cluster

Enabling Effective Image Analysis

For JD, the cloud is the foundation of its business. As well as being a leading retailer in China and providing a platform for others to sell online, the company has a public cloud offering. In some cases, the retail side of the business drives innovations, which can then be offered to public cloud customers.

With a huge catalog of products on sale in its retail site, the company has hundreds of millions of product images. These are stored in Apache HBase*, the distributed, big data store of the Hadoop framework. JD wanted to be able to match features in different product images. This capability could be used, for example, to enable a visual search feature, where customers could snap a photo of something they like, and JD could find something similar they can buy. Alternatively, it could be used to match its products with those on other websites, so that JD can price its products competitively.

The team at JD had tried to build the feature matching application using graphics processing units (GPUs), but had found it difficult to scale GPUs adequately to handle the database. JD had tried using both multi-GPU servers, and a GPU cluster. In the cluster setting, JD had encountered frequent out of memory errors and program crashes due to insufficient GPU memory. The resource management and allocation of individual GPU cards in the cluster proved to be complex and error-prone. For multi-GPU servers, JD’s developers had to manually manage data partitioning, task balancing, and fault tolerance. There were also a lot of dependencies, such as CUDA*, which made production deployment difficult.

With the image processing carried out in GPUs, there was also latency resulting from the time taken to copy the data from HBase to the GPUs for analysis, and to copy the results back again. This part of the process was responsible for about half of the total time taken for feature extraction. Image preprocessing was also challenging because there are no software frameworks to support resource management, data processing and fault tolerance.

JD needed an infrastructure that would support the feature extraction pipeline for its database of images, in a way that would be scalable and sustainable.

Using BigDL for Scalable Deep Learning

JD used BigDL, a distributed deep learning library for Apache Spark*, to enable the feature extraction workload on CPUs, using servers based on the Intel® Xeon® processor E5-2650 v4. BigDL enables Scala or Python* to be used to create deep learning applications that are based on scalable Spark clusters. It can be scaled out to hundreds or thousands of servers. To enhance its performance, BigDL uses the Intel® Math Kernel Library (Intel® MKL) and parallel computing to take advantage of the capabilities of the Intel Xeon processor.

BigDL enabled JD to take the Caffe* model it had previously trained using its GPU estate, and redeploy it on its existing CPU architecture, where the images were stored. In JD’s application, the Single Shot MultiBox Detector (SSD) model is used to detect objects in the images, and the DeepBit model is then used to extract features from the objects.

Figure 1 – The workflow for feature extraction at JD, using BigDL to manage the SSD model for object detection, and the DeepBit model for feature extraction.

The workflow is as follows (see also Figure 1):

  1. Read hundreds of millions of pictures from HBase as resilient distributed datasets (RDD).
  2. Use BigDL to preprocess these images to prepare them for the SSD model (including resizing, normalization, and batching). BigDL provides an image preprocessing library based on OpenCV, which supports common transformations and augmentations.
  3. Load the Single Shot MultiBox Detector (SSD) model for large scale, distributed object detection on Spark. This generates the coordinates and confidence scores for the detected objects.
  4. Crop the original picture based on the coordinates of the object with the highest score.
  5. Preprocess the target images in preparation for the DeepBit model (including resizing and batching).
  6. Use BigDL to load the DeepBit model for distributed feature extraction of the target images on Spark. This generates the corresponding features as vector floats.
  7. Store the RDD of the extracted object features in the Hadoop Distributed File System* (HDFS*).

 

Using a highly parallel architecture with 1,200 logical cores, the process of reading the image data from the database was greatly accelerated, contributing to an overall performance increase of 3.83x1.

The solution is based on JD’s existing CPU estate. It uses the Intel Xeon processor E5-2650 v4, running at 2.20GHz. Each server has 24 physical cores with Intel® Hyper-Threading Technology (Intel® HT) enabled, and is configured to support 50 logical cores using Apache Hadoop Yet Another Resource Negotiator* (YARN*), a cluster management technology. Using 24 servers, the solution has a total of 1,200 logical cores, offering a highly parallel workflow.

By using BigDL, JD was able to reuse the model it had previously trained using GPUs, on its existing servers based on the Intel Xeon processor. This lowered costs, compared to running a separate GPU cluster for feature extraction, because there is no need for additional GPU cards and the GPU servers are otherwise the same configuration as the CPU servers. Additionally, the CPU cluster processes the BigDL workload overnight, and is available to use on other tasks during the day, increasing utilization. The highly parallel data loading significantly cut the time taken for feature extraction, and JD benefited from the use of the Spark framework to manage resources, fault tolerance and task balancing.

With the increased performance, and the ability to easily scale out the solution by adding additional standard servers, JD can now handle larger data sets in its image analysis.

Technical Components of the Solution

  • BigDL. BigDL is a distributed deep learning library for Apache Spark*, which enables developers to write deep learning applications for Spark clusters using Scala or Python. It is open source and enables organizations to analyze data on the same Hadoop or Spark cluster that stores the data
  • Intel® Xeon® processor E5 family. Designed for architecting next-generation data centers, the Intel Xeon processor E5 family delivers versatility across diverse workloads in the data center or cloud
  • Intel® Math Kernel Library (Intel® MKL). Intel MKL optimizes code with minimal effort for future generations of Intel® processors. It features highly optimized, threaded and vectorized math functions that maximize performance on each processor family
  • Apache Spark*. Apache Spark is a fast engine for large-scale data processing, which can be used from Java, Scala, Python or R
  • Apache Hadoop*. The Apache Hadoop software library is a framework that enables distributed processing of large data sets across computing clusters. Its modules include the Hadoop Distributed File System (HDFS), which JD uses for storing the feature data extracted from images. Hadoop YARN offers a framework for job scheduling and cluster resource management on CPUs
  • Intel® Ethernet Server Adapter I350 and Intel® Ethernet Converged Network Adapter X710. For its networking requirements, JD uses Intel® adapters which address the demanding needs of the agile data center

Intel is a Close Ally

The new feature extraction capability was developed in collaboration with research and development engineers from Intel. JD and Intel have a long-standing relationship, which in previous years has focused on developing big data and analytics applications. Intel’s R&D team in China assists cloud service providers with enabling open source solutions, such as BigDL, and is able to bring the experience of working on many previous deployments.

“We had the challenge of how to build a large scale deep learning application, based on our big data cluster,” said Zhenhua Wang, Senior Software Engineer (Algorithm), JD. “Intel had a perfect match in the BigDL technology, and worked alongside us in implementing it. The Intel team brings expertise and experience that accelerates our time to market, and enables us to keep innovating.”

A Platform for Innovation

JD has established a platform that it can use to create new services based on image matching and feature extraction, and that it can use as a template as it develops other deep learning and artificial intelligence applications. The BigDL framework enables JD to use pretrained models from frameworks such as Caffe*, Torch* and TensorFlow* on general purpose hardware, enabling JD to test and launch new services more quickly and without investment in dedicated hardware. JD continues to apply BigDL to a wide range of deep learning applications, including distributed model training, for both internal applications and for cloud-based services. In its public cloud offering, JD has already launched a text classification model based on BigDL, which classifies articles by topic. JD continues to work closely with Intel on these and other new technology initiatives.

Spotlight on JD

JD is part of Jingdong Group, which began its work in ecommerce in 2004. As of March 2017, JD had more than 120,000 regular employees and is one of the largest online markets for mobile phones, digital technologies and computer retailers in China. The company’s catalog covers categories including household, computing, toys, menswear, womenswear, shoes, books, gifts, sports equipment, and auto accessories. In May 2014, Jingdong successfully listed on NASDAQ in the United States.

www.jd.com

Lessons Learned

Other cloud service providers can learn the following from JD’s experience:

  • By carrying out deep learning analysis on the same cluster that stores the data, the time taken to copy the data to a separate analysis cluster can be eliminated. In JD’s case, this accounted for half of the time of the entire analysis workload
  • BigDL offers a framework for taking models that have been trained using GPUs in frameworks such as Caffe*, and using them in Spark* on CPUs. BigDL can also be used with third-party pre-trained models, accelerating time to market
  • By establishing a competency in image feature extraction, JD is now able to develop and deploy innovative apps that help to differentiate its public cloud offering (such as text classification) or its ecommerce business (such as image search)

Find the solution that is right for your organization. Contact your Intel representative or visit intel.com/CSP

 

References

 

 

 

 

 

1 Building Large-Scale Image Feature Extraction with BigDL at JD.com, https://software.intel.com/en-us/articles/building-large-scale-imagefeature-extraction-with-bigdl-at-jdcom.

Intel technologies’ features and benefits depend on system configuration and may require enabled hardware, software, or service activation. Performance varies depending on system configuration. No computer system can be absolutely secure. Check with your system manufacturer or retailer, or learn more at https://software.intel.com/en-us/articles/bigdl-distributed-deep-learning-on-apache-spark.

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit www.intel.com/benchmarks

Performance estimates were obtained prior to implementation of recent software patches and firmware updates intended to address
exploits referred to as “Spectre” and “Meltdown”. Implementation of these updates may make these results inapplicable to your device or
system.

Intel does not control or audit third-party benchmark data or the web sites referenced in this document. You should visit the referenced web site and confirm whether referenced data are accurate.

Cost reduction scenarios described are intended as examples of how a given Intel- based product, in the specified circumstances and configurations, may affect future costs and provide cost savings. Circumstances will vary. Intel does not guarantee any costs or cost reduction.

All information provided here is subject to change without notice. Contact your Intel representative to obtain the latest Intel product specifications and roadmaps.

The performance test in this paper compares 20 NVIDIA Tesla* K40 with 24 servers based on the Intel® Xeon® processor E5-2650 v4 at 2.20GHz with 1200 logical cores. Each server has 24 physical cores with Intel® Hyper-Threading Technology (Intel® HT Technology) enabled, and is configured to support 50 logical cores in Apache Hadoop Yet Another Resource Negotiator* (YARN*).

Intel, the Intel logo, and Xeon are trademarks of Intel Corporation or its subsidiaries in the U.S. and/or other countries.

* Other names and brands may be claimed as the property of others.

For more complete information about compiler optimizations, see our Optimization Notice.