Accelerating Large-scale Image Retrieval on Heterogeneous Architectures with Spark ∗
Hanli Wang
Bo Xiao
Lei Wang
Jun Wu
Department of Computer Science and Technology, Tongji University, Shanghai, China Key Laboratory of Embedded System and Service Computing, Ministry of Education, Tongji University, Shanghai, China {hanliwang,1314xiaobo,110_wangleixx,wujun}@tongji.edu.cn
ABSTRACT
General Terms
Apache Spark is a general-purpose cluster computing system for big data processing and has drawn much attention recently from several fields, such as pattern recognition, machine learning and so on. Unlike MapReduce, Spark is especially suitable for iterative and interactive computations. With the computing power of Spark, a utility library, referred to as IRlib, is proposed in this work to accelerate large-scale image retrieval applications by jointly harnessing the power of GPU. Similar to the built-in machine learning library of Spark, namely MLlib, IRlib fits into the Spark APIs and benefits from the powerful functionalities of Spark. The main contributions of IRlib lie in two-folds. First, IRlib provides a uniform set of APIs for the programming of image retrieval applications. Second, the computational performance of Spark equipped with multiple GPUs is dramatically boosted by developing high performance modules for common image retrieval related algorithms. Comparative experiments concerning large-scale image retrieval are carried out to demonstrate the significant performance improvement achieved by IRlib as compared with single CPU thread implementation as well as Spark without GPUs employed.
System, Experimentation, Performance
Keywords Heterogeneous Computing; Spark; Image Retrieval; Graphics Processing Units
1. INTRODUCTION With the popularity of social networks and mobile multimedia applications, huge amounts of multimedia data are generated containing countless information to be mined and analyzed. Content-based image retrieval is such a golden example of applications for multimedia data mining. Most image retrieval systems are based on the model of Bag of Feature (BoF) [1], which represents images with sparse vectors by extracting salient local features from images and assigning them to the nearest visual word(s) in the vocabulary learned before. When the amount of images to be processed becomes larger and larger, the challenges including efficiency and accuracy arise from the huge computational demands produced by large-scale image retrieval. To meet the escalating computational demands, scientists and engineers seek solutions from high performance computing communities. Meanwhile, there is a trend towards heterogeneity in the last decade since Graphics Processing Units (GPUs) have been developed into general-purpose coprocessors with the aid of the programming models and architectures, such as CUDA1 and OpenACC2 . GPU, usually manufactured with hundreds or even thousands of stream processors, is capable of invoking massive parallelization at fine grained level, which makes it ideal for executing dataintensive computations. As a consequence, such heterogeneous clusters, equipped with multiple GPUs on each computing node, are more powerful and environment friendly in comparison with traditional computer clusters. Moreover, the development of distributed applications deployed on computer clusters has become much productivity and reliability since a wide range of programming paradigms, middle-wares and frameworks are developed. MapReduce [2], for instance, is such a popular programming paradigm proposed by Google, which applies the intuition that many large-scale workloads can be scaled out horizontally and ex-
Categories and Subject Descriptors H.3.3 [Information Search and Retrieval]: Relevance feedback; H.3.4 [Systems and Software]: Distributed systems, Performance evaluation (efficiency and effectiveness) ∗H. Wang is the corresponding author. This work was supported in part by the National Natural Science Foundation of China under Grant 61472281, the “Shu Guang” project of Shanghai Municipal Education Commission and Shanghai Education Development Foundation under Grant 12SG23, the Program for Professor of Special Appointment (Eastern Scholar) at Shanghai Institutions of Higher Learning (No. GZ2015005), and the Fundamental Research Funds for the Central Universities under Grant 0800219270. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from
[email protected]. MM’15, October 26–30, 2015, Brisbane, Australia c 2015 ACM. ISBN 978-1-4503-3459-4/15/10$15.00 ° DOI: http://dx.doi.org/10.1145/2733373.2806392.
1 2
1023
https://developer.nvidia.com/cuda-zone http://www.openacc-standard.org/
pressed with map and reduce operations. Apache Spark3 is another famous and general-purpose cluster computing system for big data analysis and aims at improving the performance of iterative and interactive computations. Although the performance of the heterogeneous clusters is impressive and the distributed parallel frameworks continue to be prosperous and effective, it is challenging to utilize heterogeneous clusters for large-scale image retrieval with existed parallel frameworks. To meet the huge computational demands for large-scale multimedia mining, several researchers propose to employ parallel frameworks. In [3], MapReduce is applied to multimedia data mining including K-Means clustering and background substraction. Shukla et al. [4] demonstrate how iterative computations can benefit from parallel execution on Spark and the results on Amazon EC2 clusters are presented. In [5], a framework, namely YAFIM, is proposed to parallelize frequent itemset mining with Spark and the experiments show that YAFIM achieves 18× speedup in average for various benchmarks as compared with MapReduce. In addition, a number of works are developed to achieve the maximum performance of the combined CPU+GPU system. In [6], StarPU is designed for numerical kernels to execute parallel tasks on a shared-memory machine of heterogeneous multicore architectures. He et al. [7] propose Mars, a GPUbased MapReduce framework to evaluate the benefits of cooperative usage of CPU and GPU on several traditional applications such as string matching, matrix multiplication, etc. In this work, a utility library for Spark, namely IRlib, is designed aiming at accelerating large-scale image retrieval on heterogeneous clusters. IRlib fits into the Spark APIs and thus can be easily integrated into Spark as a built-in library. It enables dramatic increases of Spark in parallel computing performance by harnessing the power of GPUs. The rest of this paper is organized as follows. The proposed IRlib is introduced in Section 2. Evaluation results are shown in Section 3. Finally, Section 4 concludes this paper.
2.
Figure 1: IRlib Overview.
tion and deserialization, which is suitable for handling the data of image retrieval such as feature vectors, visual dictionary vocabularies, BoF vectors, inverted file index and so on. These data structures can also be accessed by users for performance consideration. The Internet Communications Engine (ICE)5 , which is a fast and highly-scalable communication infrastructure, is utilized as Remote Procedure Call (RPC) service for building online retrieval applications. Within the proposed IRlib, a number of image retrieval modules are realized, including feature extraction, K-Means clustering, BoF generation, similarity search, Hamming Embedding (HE) [8], Weak Geometric Consistency (WGC) [8], RANSAC [9] for homograph calculation, the Selective Match Kernel (SMK)/Aggregated Selective Match Kernel (ASMK) and their binary variants [10], etc. Due to the space limit, only the K-Means clustering module is discussed in detail to demonstrate the implementation of a module of IRlib. As it is well known that K-Means clustering is an iterative and computationally laborious algorithm which aims to group a set of feature vectors (in the context of image retrieval) in such a way that the feature vectors belonging to one group are similar to each other than those in other groups. The Spark machine learning library MLlib6 implements K-Means clustering and a significant improvement is achieved because the in-memory techniques are utilized as compared to the implementation with MapReduce. However, the power of GPUs can not be exploited by MLlib. Considering the fact that the most time-consuming calculations of finding the nearest object can be accelerated greatly when executed on GPUs, a K-Means clustering module is designed in IRlib which achieves high performances by dispatching computations onto GPUs. A comparative experiment of K-Means clustering provided by the two libraries of IRlib and MLlib will be described in Section 3.4. Specifically, the API of K-Means clustering module is identical to that provided by the built-in MLlib, which means that it is easy for users to port the existed applications adopting K-Means clustering module in MLlib for usage of heterogeneous clusters. Firstly, the splits of input data scheduled by the Spark engine are received by the Module Driver in the format of ProtocolBuffer which have been serialized as binary streams. Then, after acquiring computational resources (i.e., GPU), the proper data is transferred to the memory of GPU and the related processing is ex-
PROPOSED IRLIB IMPLEMENTATION
The design goal of IRlib is to make full use of the heterogeneous clusters and provide a uniform set of APIs which is compatible with that of Spark for large-scale image retrieval. A number of common image retrieval related algorithms are implemented, referred to as modules, to exploit the joint power of CPU and GPU. Each module contained in the IRlib is composed of two relatively independent parts, namely Module Driver and Algorithm Worker as depicted in Fig. 1. As shown in Fig. 1, CUDA, which provides a comprehensive environment for GPU programming, is employed for developing the part of Algorithm Worker which is responsible for the implementation of the corresponding algorithm. The part of Module Driver which is programmed with Java language aims to interact with Spark engine and invoke Algorithm Worker. These two parts are communicated with each other by using the Java Native Interface (JNI). Although there are plenty of data structures provided by Spark, the exclusive data structures implemented by ProtocolBuffer4 , are developed for inner communication among the modules of IRlib so as to improve the performance of data serializa3 4
5
http://spark.apache.org/ https://developers.google.com/protocol-buffers/
6
1024
https://www.zeroc.com/ http://spark.apache.org/mllib/
ecuted by Algorithm Worker on both CPU and GPU. In order to achieve coalesced read which can dramatically increase the efficiency of GPU, the centroid data is stored in column major in the global memory of GPU. The feature vector is cached into GPU shared memory when the calculation of locating the nearest centroid for a feature vector is executed by a thread block.
3.
Table 2: Image datasets and the related descriptors. Dataset
EXPERIMENTAL RESULTS
Experimental Setup
Table 3: Speedup of K-Means clustering on Flickr100K with different sized visual dictionaries. Size 20K 200K 1M
Table 1: Hardware configuration of Spark cluster.
CPU CPU Cores Host Memory GPU CUDA Cores GPU Memory Network
2 Intel Core i5-3450 4 24 GB GTX 660Ti×2 1344×2 2 GB×2
Amount 6+1(single*) Intel Core i5-3470 4 24 GB GTX 660×2 960×2 2 GB×2 1Gbps LAN
Speedup
1,201.3
2,318.7
2,657.9
4
As shown in Table 4, the speedups of several representative IRlib modules are presented, including feature extraction (ImageProc), BoF calculation (BoF), HE model training (HETrain) and inverted file index generation (InvFlGen). The speedups of the ImageProc module in Table 4 remain the same under all test cases because this module is unrelated with the size of visual dictionary. In addition, as compared with other modules, the speedup of ImageProc is relatively low because it only owns CPU implementation at present.
Intel Core i5-4430 4 24 GB GTX 760×2 1152×2 2 GB×2
In order to verify the retrieval accuracy with the proposed IRlib, two benchmark works [11] and [10] are followed for guiding our implementation of image retrieval algorithms. Specifically speaking, the work of [11] is employed to demonstrate the performances of HE and WGC, where the Hessian-Affine detector [12] and SIFT descriptor [13] are used. On the other hand, the SMK/ASMK/ASMK-Binary algorithms are carried out according to [10], where the modified Hessian-Affine detector [14] and the Root-SIFT descriptor [15] are employed for feature detection and description. Several benchmark image datasets are utilized for experiments, including Oxford5K7 , Paris8 , Flickr100K9 and MSRBing1M10 . The information of these image datasets and the related descriptors is shown in Table 2.
3.2
descriptor size
In this subsection, the Oxford5K dataset is used to evaluate the computational performance and the retrieval accuracy in terms of mean Average Precision (mAP) of the proposed IRlib, while the MSR-Bing1M dataset is used as distracting images and Flickr100K is employed as an independent dataset to train visual dictionary vocabularies. The speedup performance on K-Means clustering achieved by IRlib as compared to the single CPU thread version is listed in Table 3 where it can be easily seen that dramatic speedup performances can be obtained by IRlib and the speedup increases as the size of visual dictionary increases.
The experiments are carried out on a 12-node computer cluster equipped with multiple GPUs. Each node is running GNU/Linux Ubuntu 12.04 64-bit Desktop operating system and gigabit-nics are utilized among them for communication. The detailed configuration of the cluster is given in Table 1. In addition, a computer with middle-level performance as indicated as ‘single*’ in Table 1 is selected to run the single CPU thread version of all the following experiments.
Hardware
# descriptors
Modified Hessian-Affine detector [14] and Root-SIFT descriptor [15] Oxford5K 5,063 13,759,319 8.5GB 6,392 16,402,339 10.2GB Paris
The computing performance of IRlib for large-scale image retrieval on heterogeneous clusters is evaluated by means of comparative experiments with applications running under a single CPU thread situation and these deployed on Spark without GPUs.
3.1
# images
Hessian-Affine detector [12] and SIFT descriptor [13] Oxford5K 5,063 18,190,561 11.2GB 100,000 317,565,032 195.0GB Flickr100K 392,528,043 242.0GB MSR-Bing1M 1,000,000
Table 4: Speedup of image retrieval on Oxford5K retrieval with different sized visual dictionaries. 20K 200K 1M ImageProc BoF HETrain InvFlGen
Image Retrieval on Oxford5K
7
http://www.robots.ox.ac.uk/∼vgg/data/oxbuildings/ http://www.robots.ox.ac.uk/∼vgg/data/parisbuildings/ 9 http://www.robots.ox.ac.uk/∼vgg/data/oxbuildings/ flickr100k.html 10 http://web-ngram.research.microsoft.com/GrandChallenge/ Datasets.aspx 8
1025
30.8 1,413.7 502.1 523.1
30.8 2,365.2 650.5 654.1
30.8 2,635.1 713.1 720.9
Table 5 shows the retrieval accuracy in terms of mAP for different sized vocabularies. As compared with [11], the mAP results are significantly better under all test conditions, with the minimum improvement of 0.03, maximum improvement of 0.16, and average improvement of 0.11. This improvement of mAP is mainly because much more images and feature vectors can be processed by using IRlib with Spark.
3.3 Extension with SMK/ASMK In this subsection, the performances of the SMK module are presented by strictly following the benchmark work
the power of GPU and take the advantages of Spark, several techniques are employed for the development of IRlib. The experimental results demonstrate that a significant speedup can be achieved by the proposed IRlib as compared with both single CPU thread implementation and the Spark version without GPU. In the future, more research efforts and attempts will be put to enrich the utilities and functions of IRlib.
Table 5: Retrieval accuracy in terms of mAP. Distractors Baseline HE WGC HE+WGC
20K 0 1M 0.397 0.330 0.588 0.448 0.533 0.363 0.657 0.427
200K 0 1M 0.461 0.284 0.625 0.383 0.542 0.266 0.656 0.363
1M 0 0.502 0.612 0.544 0.618
5. REFERENCES of [10] including the SMK, ASMK and ASMK-Binary algorithms. Two image datasets including Oxford5K and Paris are utilized for performance evaluation. As specified in [10], a 65K sized visual dictionary is generated by K-Means clustering. The visual dictionary trained on Paris is applied for Oxford5K image retrieval, and vice versa. The experimental results of speedup and mAP are listed in Table 6, where the mAP values in the parenthesis indicate the improvement of our implementation against that of [10]. Since our implementation follows [10], the mAP performances should be similar to each other, which is verified by the small mAP improvements as shown in the parenthesis in Table 6. As far as the computational performance is concerned, significant speedup ratios can be achieved by the SMK module for all test cases.
[1] J. Sivic and A. Zisserman. Video google: A text retrieval approach to object matching in videos. ICCV’03, pages 1470–1477, Oct. 2003. [2] J. Dean and S. Ghemawat. MapReduce: Simplified data processing on large clusters. Comm. of the ACM - 50th anniversary issue: 1958-2008, 51(1):107–113, Jan. 2008. [3] B. White, T. Yeh, J. Lin, and L. Davis. Web-scale computer vision using MapReduce for multimedia data mining. In MDMKDD’10, pages 1–10, Jul. 2010. [4] S. Shukla, M. Lease, and A. Tewari. Parallelizing ListNet training using Spark. In ACM SIGIR’12, pages 1127–1128, Aug. 2012. [5] H. Qiu, R. Gu, C. Yuan, and Y. Huang. YAFIM: A parallel frequent itemset mining algorithm with Spark. In PDPS’14, pages 1664–1671, May 2014. [6] C. Augonnet, S. Thibault, R. Namyst, and P.-A. Wacrenier. StarPU: A unified platform for task scheduling on heterogeneous multicore architectures. In Euro-Par’09, pages 863–874, Aug. 2009. [7] B. He, W. Fang, Q. Luo, N. K. Govindaraju, and N. T. Wang. Mars: A MapReduce framework on graphics processors. In PACT’08, pages 260–269, Oct. 2008. [8] H. J´egou, M. Douze, and C. Schmid. Hamming embedding and weak geometric consistency for large scale image search. In ECCV’08, pages 304–317, Oct. 2008. [9] M. A. Fischler and R. C. Bolles. Random sample consensus: A paradigm for model fitting with applications to image analysis and automated cartography. Comm. of the ACM, 24(6):381–395, Jun. 1981. [10] G. Tolias, Y. Avrithis, and H. J´egou. To aggregate or not to aggregate: Selective match kernels for image search. In ICCV’13, pages 1401–1408, Dec. 2013. [11] H. J´egou, M. Douze, and C. Schmid. Improving Bag-of-Features for large scale image search. IJCV, 87(3):316–336, May 2010. [12] K. Mikolajczyk and C. Schmid. Scale and affine invariant interest point detectors. IJCV, 60(1):63–86, Oct. 2004. [13] D. Lowe. Distinctive image features from scale-invariant keypoints. IJCV, 60(2):91–110, Nov. 2004. [14] M. Perd’och, O. Chum, and J. Matas. Efficient representation of local geometry for large scale object retrieval. In CVPR’09, pages 9–16, Jun. 2009. [15] R. Arandjelovic and A. Zisserman. Three things everyone should know to improve object retrieval. In CVPR’12, pages 2911–2918, Jun. 2012.
Table 6: Speedup and mAP performances achieved by the SMK module. ASMK
SMK
ASMK-Binary
Speedup
Oxford5K Paris
340.9 343.5
372.3 381.7
401.7 410.9
mAP
Oxford5K Paris
0.817(0.000) 0.785(0.003)
0.795(0.021) 0.744(0.026)
0.800(-0.004) 0.788(0.018)
3.4
Comparison with Spark MLlib
As aforementioned, MLlib is a built-in library of Spark for scalable machine learning consisting of common learning algorithms and utilities, including classification, clustering, etc. In this subsection, the comparison on K-Means clustering between IRlib and MLlib is carried out to verify the dramatic performance improvement of IRlib. The Flickr100K dataset is employed for testing with different sized visual dictionaries from 20K to 400K. Table 7: Speedup of K-Means clustering achieved by IRlib over Spark MLlib with different sized visual dictionaries. 20K 60K 120K 200K 400K Speedup
26.3
42.4
50.0
56.3
65.9
As seen in Table 7, the performance improvement achieved by IRlib over Spark MLlib is significant because of the computing power of GPUs being harnessed by IRlib.
4.
CONCLUSION
In this work, a utility library for Spark, namely IRlib, is designed to accelerate large-scale image retrieval by harnessing the computing power of GPU. In order to fully exploit
1026