A Large-scale Images Processing Model based on Hadoop Platform GongRong Zhang, QingXiang Wu*, ZhiQiang Zhuo, XiaoWei Wang, Xiaojin Lin College of Photonic and Electronic Engineering, Fujian Normal University Fujian, Fuzhou, 350007, China
[email protected],
[email protected] ABSTRACT This paper presents a parallel processing model based on Hadoop platform for large-scale images processing, which aims to make use of the advantages of high reliability and high scalability of Hadoop distributed platform for distributed memory and distributed computing, so as to achieve the purpose of fast processing of large-scale images. The Hadoop streaming technology is used in the model. The main operations are written on shell script as the mapper of Hadoop streaming, then an assigned filelist is used as the Hadoop streaming’s input. The large numbers of image files are delivered to cluster computers for concurrent image processing. The model has been implemented using virtual machines. A set of experimental results and analysis are provided.
Keywords Hadoop platform; Image processing; HDFS; MapReduce; Hadoop Streaming
1. INTRODUCTION With the rapid development of computer technology, image data acquisition technology and the popularity of the Internet, large amounts of multimedia data are produced in various domains. For example, a huge number of mammography images request fast processing in breast cancer screening. These data mainly in image and video data, and traditional image processing is based on a single node, only computing on one computer or on one server, which exists problems of processing slowly and poor concurrency. Therefore, a cluster-based, fast, parallel image processing system architecture became a research focus. Hadoop is a subsidiary of the Apache Software Foundation, an open source distributed computing platform [1]. Derived from an implementation of Google’s MapReduce [2], Hadoop consists of two chief components: Hadoop MapReduce and Hadoop Distributed File System (HDFS). HDFS and MapReduce as the core of Hadoop provide users with transparent bottom details of the distributed system infrastructure. Hadoop architecture is succinct. Using the architecture cheapest machines can be built as computer clusters with hundreds or thousands of nodes, to achieve PB (PetaByte) level data operations [3]. Hadoop clusters’ distributed storage and distributed computing have high reliability, expandability and efficiency. The architecture of Hadoop file system is suitable for storing and managing large volume of *Corresponding author.
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Conference’10, Month 1–2, 2010, City, State, Country. Copyright 2010 ACM 1-58113-000-0/00/0010 …$15.00.
medical images [4]. Hadoop is also used in massive image retrieval works. The massive image retrieval system based on Hadoop in dealing with large data retrieval, compared with the single-node retrieval system, can effectively reduce the search time, and improve the retrieval speed [5]. This paper proposes a parallel processing model to achieve an effective mass image processing solutions based on research on Hadoop platform. In this processing model, huge number images and the tasks of image processing are equably assigned to computer nodes in a cluster of computers for concurrent processing so that the image processing job can be speed up. The experimental results show that the proposed model can speed up the image processing with different number of nodes. This model is implemented using computer cluster program based on Hadoop platform to build distributed, parallel processing of large-scale images, compared to traditional single-node image processing it has merits of fast speed and high efficiency. An N-node Hadoop cluster has a master node which is responsible for task scheduling work, and the rest are slave nodes to do calculating work. Hadoop cluster is preferably built under the same version of the Linux system. After installing the system on each machine the cluster’s networks should be configured. In order to achieve mutual communication and data transfer, N nodes in the cluster is connected each other via a router or switcher. Configure master and slave nodes to enable them communication each other. Then Hadoop and java are installed and configured on each machine respectively, which can complete the process of building an n-node Hadoop platform. The same architecture is simulated using the virtual machines in this paper for test the properties of the model. Firstly, 2, 4, 6, 11, 16, 21 nodes of Hadoop clusters are built respectively, and then these clusters are used for large-scale image processing. The experimental results show that multiple-nodes cluster program of image processing is much fast compared with one computer image processing. The advantages of Hadoop clusters for large-scale images processing can be obviously shown. Since Hadoop clusters can be made up of hundreds or even thousands of entities computers, they will bring significant increases in efficiency for large scale images processing.
2. DESIGN OF COMPUTER CLUSTER NameNode of cluster generally is a server with high computational performance. Because NameNode stores all metadata, choose the machine with high performance can ensure high stability of the cluster. For example, NameNode’s configuration information can be as follows: 8CPU*2GHz, 4GB RAM, Disk space 750GB. DataNodes can use ordinary computers instead. For example, their configuration information can be as follows: 2CPU*2GHz, 2GB RAM, Disk space 500GB.
2.1 Simulation Using Virtual Machines For simplicity, we address 4 nodes Hadoop cluster configuration at details. Other clusters may different number of nodes but their configuration is similar to 4-nodes cluster’s configuration. Four virtual machines are built on a server, the server configuration is: 24CPU*2.0GHz, 64GB RAM, disk space 500GB. The 4-nodes cluster configuration information is shown in the following tables:
Hadoop applications. Cluster’s distributed file system is Master/Slave architecture, which contain a NameNode and some DataNode nodes. NameNode manages the file system metadata, DataNode store the actual data. Clients read and write to the file system needs to interact with the NameNode and DataNode [7]. HDFS architecture [8] is shown in Figure 1.
Table 1. Cluster Hardware Configuration Name
Amount
NameNode
1
DataNode
3
Disk space
Detailed configuration 1CPU*2.0GHz 4GB RAM 1CPU*2.0GHz 1GB RAM 20GB
Table 2. Cluster Software Configuration Software Name
Version
Red Hat AS
CentOS 6.3
JDK
jdk-6u31-linux-i586
Hadoop
1.0.0
Table 3. Single Node Configuration Name Hardware
Detailed configuration 1CPU*2.0GHz 1GB RAM
Software
The same with cluster
Disk space
20GB
Table 4. Net Work Configuration Machine name
Detailed configuration
Master
192.168.11.129
Slave1
192.168.11.132
Slave2
192.168.11.130
Slave3
192.168.11.131
In this article, VMware software is used to build cluster of virtual machines. VMware virtual machine software is the leader company of desktop to the data center virtualization solutions. The reason of using virtual machine to do the experiments is that it is relatively efficient to set up virtual Hadoop cluster to simulate different scenarios; it can expand quickly is very convenient to add or drop nodes, the hardware configuration can also be improved quickly. For example, when a virtual machine is low on memory, the virtual machine can be turned off, then by using software settings to increase virtual machine’s memory. The total memory of all virtual machine nodes cannot exceed the actual memory size of the physical machine.
2.2 HDFS Module Distributed File System HDFS is an open source implementation of Google GFS [6]. HDFS is the primary storage system used by
Figure 1. HDFS Structure Before the job start, all the picture data is transferred to the HDFS distributed file system. When the operation began, the client sends a job request to NameNode. NameNode receives a request from the client and then send data block information of file system and the location of the data block back to the client. The client can directly access DataNode where data block located, then process these data via Hadoop commands. HDFS has good reliability, creating multiple copies while storing data blocks, copies are stored on different machines, the number of copies can be specified when the file is generated, which default value is 3. When a machine is malfunctioning and can not obtain the required data, the required data can be obtained from other nodes so that does not affect the job execution.
2.3 MapReduce Module MapReduce is a programming model for the parallel processing of distributed large-scale data [9]. MapReduce job (job) is a unit of work that the client needs to perform: it consists of the input data, MapReduce programs and configuration information. The MapReduce job needs to undergo two types of machine to complete the process, JobTracker and TaskTracker. A cluster has only one JobTracker, on the NameNode node, which is responsible for scheduling work. And TaskTracker distributed in all DataNode node, is responsible for the execution of tasks. Hadoop split jobs into several small tasks (task) to perform, including two types of tasks: map and reduce tasks. These two types are map and reduce functions actually, mainly used for large sets of data parallel computing. MapReduce called in this program are primarily responsible for controlling image processing executable shell script file, the script is responsible for obtaining executable file’s input data from the HDFS system.
3. THE CORE THECHNOLOGY USED IN THE PROGRAM In general, the MapReduce programming model process image files in three ways: ① perform image processing tasks (for example, convert images into binary serialized data processing in this paper); ② implement a custom image file interface; ③ design their own Hadoop data types for processing image files directly. This program uses the Hadoop Streaming technology. Hadoop Streaming technology can help users to create and run a special kind of map/reduce jobs. These special jobs can be performed through an executable file or script file which acts as a
mapper or reducer. The program is to use a shell script file to act as a mapper implementation. The shell script can call executable files of image processing. The image files are used as input data for the executable file to process and then the processing results are uploaded to file system HDFS. In the program, image files to be processed should be uploaded to the HDFS at first, and then a filelist is made in a specified directory of Hadoop Streaming’s input. The directory contains a set of filelist, files in which the contents are the HDFS paths of image files to be processed, and each line in the file is an image path. Inputsplit is a text file, which is regarded as the input of mapper. The shell script mapper contains the file read-line operation, read the path of the image and get an image, and then call the image processing executable file to deal with the image. Finally, the image processing result is saved back to HDFS. In general, reducer is used for combining the output from mapper, and then exports the final results. In this scenario, mapper is the shell script. The output of mapper is no longer intermediate results. Shell script already contains the operations of getting image files from HDFS, image processing, and upload the results to HDFS and other operations. The map can directly manipulate the image processing results, so the task number of reduce can be set to 0.
4. THE EXPERIMENTAL RESULTS AND ANALYSIS 4.1 The Experimental Process and Results of Virtual Clusters In order to facilitate testing model results, virtual machines are used to simulate the whole process. This experiment select 3, 5, 10, 15 and 20 TaskTrackers for image processing nodes. The same data set is used in all 5 cases. The image processing results for a single node and multi-node are compared. The simulation results show that acceleration performance ratio is going up as increasing the number of nodes. Experimental data are one thousand BMP format image files; each image size is 256×256. The job is to perform binary processing for all images. The binary image processing is compiled by the C language. Virtual machine cluster contains four nodes. A JobTracker node is responsible for scheduling the work, and other three TaskTracker nodes are responsible for performing the calculations. The images are assigned to nodes through three files in filelist directory. Their file names are file1, file2, file3; file1 stored 334 image file paths, file2 and file3 stored 333 image file paths respectively. Experiment process produces a three map tasks, three TaskTracker respectively, to complete a map task. Each task specifies a file document and processing the images. The processing results are uploaded to HDFS. The 5, 10, 15 and 20 TaskTracker nodes’ experiments are similar to 3 TaskTracker nodes’, but they have different filelist. The 5 TaskTracker nodes’ experiments have five files in filelist directory, each file stored 200 images’ path respectively. The 10 TaskTracker nodes’ experiments have ten files in filelist directory, each file stored 100 images’ path respectively. The experimental results are shown in Table 5: Table 5. Experimental results The amount of DataNode 1
filelist 1
Job Finish Time(s) 2343
3
3
810
5
5
496
10
10
338
15
15
241
20
20
209
Experimental results show that, in dealing with the same image data, the processing of multi-node is more efficient than single node. Figure 2. shows the result of comparing clusters process time with single node’s. Besides that, The 5 DataNode save 38.76% time compare with 3 DataNode. The 10 DataNode save 31.85% time compare with 5 DataNode. The 15 DataNode save 28.69% time compare with 10 DataNode. The 20 DataNode save 13.28% time compare with 15 DataNode. From these results can see in certain capacity of data, increasing the number of nodes is not brought proportional speedup, acceleration performance may be decreased gradually with the increase nodes.
Saved time compare with single node 100.00% 80.00% 60.00% 40.00% 20.00% 0.00%
65.48%
78.83%
85.57%
89.71%
91.08%
5
10
15
20
3
Figure 2.Compared result
This paper also uses different data sets to do the experiments in single node and cluster (The cluster contains 10 DataNode nodes). The results are shown in Table 6. Table 6. Experimental results Data set
The amount of images
1
100
Single node cost time(s) 249
2
200
483
87
3
400
991
140
4
800
1967
253
Cluster cost time(s) 55
Experimental results show that with the increasing of the image data, Hadoop cluster’s parallel processing advantages become more and more obvious, and the proportion of cluster cost time and single node cost time become smaller. When dealing with 100 images, the cluster cost time accounted for 22.09% of single node; when processing 200 images, the proportion is 18.01%; when processing 400 images, the proportion is 14.13%, when the amount of images reach 800, the proportion decrease to 12.86%. The results are shown in Figure 3.
The cost time proportiong between 10-node cluster and single node 100.00% 80.00% 60.00% 40.00% 20.00% 0.00%
Cooperation of No.2013H6008).
Universities
in
Fujian
Province
(Grant
7. REFERENCES [1] T. White, Hadoop: The Definitive Guide. O'Reilly Media, Yahoo! Press, June 5, 2009. 22.09%
18.01%
14.13%
12.86%
100
200
400
800
Figure 3.Compared result
5. SUMMARY AND FUTURE WORK This paper presents a parallel processing model based on Hadoop platform for large-scale images processing, the process uses Hadoop Streaming technology, the main operations of images will be placed in shell script as mapper processing, and the input of Hadoop Streaming is the filelist stored HDFS paths of image files. After reading inputs the mapper will get the image to each node to process, which make full use of Hadoop clusters distributed storage and distributed computing advantages, to achieve the large-scale images binarization parallel processing. Experimental results show that comparing with the single-node machine, Hadoop clusters for large-scale image processing can bring significant efficiency gains. Hadoop clusters can save more time when the amount of images is larger. In this paper, the binary image processing is used as an example to test proposed model. Other more complex image processing program can also be used in this model with similar method for rapid processing a large number of image files.
6. ACKNOWLEDGEMENTS The authors gratefully acknowledge the fund from the Natural Science Foundation of China (Grant No.61179011) and Science and Technology Major Projects for Industry-academic
[2] F.N. Afrati, J.D. Ullman. Optimizing multiway joins in a Map-Reduce environment. IEEE Transactions on Knowledge and Data Engineering, Vol.23, No.9, (Sept. 2011), 1282–1298. [3] HDFS [EB/OL]. 2011-12-08. http://hadoop.apache.org/. [4] LI Peng-jun, CHEN Guang-jie, GUO Wen-ming. A distributed storage architecture for regional medical image sharing and cooperation based on HDFS. J South Med Univ, Vol.31, No.3, (Mar. 2011) 495-498. [5] WANG Met, ZHU Xin-zhong, ZHAO Jian-min, HUANG Cai-feng. Massive Images Retrieval System Based on Hadoop. Computer Technology and Development. Vol.23, No.1, (Jan. 2013), 204-208. [6] GHEMAWAT S, GOBIOFF H, LEUNG S. The Google File System. New York: ACM, 2003. [7] Ekpe Okorafor1 and Mensah Kwabena Patrick, “Availability of Jobtracker machine in hadoop/mapreduce zookeeper coordinated clusters”, Advanced Computing: An International Journal (ACIJ), Vol.3, No.3, (May. 2012), 19-30. [8] Hyeokju Lee, Myoungjin Kim, Joon Her, Hanku Her. C. 2012. Implementation of MapReduce-based image conversion module in cloud computing environment. International Conference on Information Networking 2012, ICOIN 2012 - Conference Program, 234-238. [9] Jeffrey Dean, Sanjay Ghemawat, “MapReduce: Simplified Data Processing on large Cluster”, Comm. ACM, Vol.51, No.1, pp. 107-113, 2008.