2011 12th IEEE/ACM International Conference on Grid Computing
Using the Gfarm File System as a POSIX compatible storage platform for Hadoop MapReduce applications Shunsuke Mikami
Kazuki Ohta
Osamu Tatebe
Department of Computer Science Department of Computer Science University of Tsukuba University of Tsukuba Preferred Infrastructure, Inc. Japan Science and Technology Agency, CREST Japan Science and Technology Agency, CREST
[email protected] [email protected] [email protected]
Abstract—MapReduce is a promising parallel programming model for processing large data sets. Hadoop is an up-and-coming open-source implementation of MapReduce. It uses the Hadoop Distributed File System (HDFS) to store input and output data. Due to a lack of POSIX compatibility, it is difficult for existing software to directly access data stored in HDFS. Therefore, it is not possible to share storage between existing software and MapReduce applications. In order for external applications to process data using MapReduce, we must first import the data, process it, then export the output data into a POSIX compatible file system. This results in a large number of redundant file operations. In order to solve this problem we propose using Gfarm file system instead of HDFS. Gfarm is a POSIX compatible distributed file system and has similar architecture to HDFS. We design and implement of Hadoop-Gfarm plug-in which enables Hadoop MapReduce to access files on Gfarm efficiently. We compared the MapReduce workload performance of HDFS, Gfarm, PVFS and GlusterFS, which are open-source distributed file systems. Our various evaluations show that Gfarm performed just as well as Hadoop’s native HDFS. In most evaluations, Gfarm performed bettar than twice as well as PVFS and GlusterFS.
I. I NTRODUCTION Increasingly, researchers manipulate and explore massive datasets for scientific discovery. As is the often case with scientific computations, the computations have to be distributed across hundreds or thousands of computers in order to finish in a reasonable amount of time. However, distributed computing is very complex, since we have to able to handle machine failures, schedule the processes, partition the data and so on. MapReduce [1] which is an emerging framework for dataintensive computing can hide these complexities. It is very beneficial for scientists to use the MapReduce programing model because it makes distributed computation easy. Hadoop is up-and-coming open-source implementation of MapReduce. It is used by various scientific communities, such as genome analytics [2] and astronomy [3]. Hadoop MapReduce uses the HDFS (Hadoop Distributed File System) [4] to store data. However, there are several problems with HDFS. HDFS only supports the operations required for MapReduce since it was developed specifically for Hadoop MapReduce.
1550-5510/11 $26.00 © 2011 IEEE DOI 10.1109/Grid.2011.31
MapReduce applicaons
POSIX applicaons
HDFS
MPI applicaons
MapReduce applicaons
POSIX applicaons
MPI applicaons
Gfarm file system
Fig. 1. HDFS supports only MapReduce applications, while Gfarm supports MapReduce, POSIX and MPI applications.
For example, MapReduce applications only create new files or append to existing files when writing to files on HDFS. HDFS does not support file modifications other than the append operation. Also, HDFS does not need to support concurrent writes from multiple clients to a single file. Lack of these features make it difficult for applications other than MapReduce to access these file systems. In scientific research, it is often the case that researchers need to use existing POSIX software such as MATLAB, as well as, MPI, which is widely-used in high performance computing, cannot run on HDFS. There are some applications that are much more suited to run using MPI than MapReduce. It is often necessary to import files of the POSIX programs to an HDFS file system, run MapReduce on them, then export the results to a file system that can be read by a POSIX application. Essentially we need to execute redundant copy and storage operations. We are proposing using the Gfarm [5] in place of HDFS because Gfarm has the POSIX compliant API and can exploit data locality. Gfarm can be mounted on Linux and other OSs by using FUSE [6] kernel module, so existing software can run on Gfarm transparently. We evaluated not only HDFS and Gfarm but also PVFS and GlusterFS which are all open source distributed file systems. PVFS and GlusterFS also are POSIX compatible distributed file systems and capable of running existing software. Our performance evaluations show that Gfarm performs as well as HDFS and better than PVFS and GlusterFS. In MapReduce workloads, PVFS and GlusterFS cause perfor-
181
mance degradation due to disk contention because these file systems cannot exploit locality. The major contribution of this work is to show that Gfarm, a POSIX compatible distributed file system, can perform as well as Hadoop’s native HDFS for MapReduce applications without changing the configuration of the file system. This means data in Gfarm file system can be shared by MapReduce, POSIX and MPI applications without any performance penalty (Figure 1). This paper is structured as follows. Section II introduces MapReduce and Hadoop. Section III introduces distributed file systems which we evaluated. Section IV describes the HadoopGfarm plug-in. Section V shows the performance evaluation. Section VI introduces related work and explains how our approach differs. Section VII concludes this paper.
NameNode
Metadata ops Local disk
Fig. 2.
A. Hadoop Google has published several distributed computing systems, such as the Google File System, MapReduce, and BigTable, but often does not release the source code to the community. The Apache Hadoop Project is bridging the gap by releasing open-source implementations of these systems. The Apache Hadoop Project contains many sub-projects, for example Hadoop MapReduce, HDFS, and HBase, that corresponds to Google MapReduce, Google file system, and BigTable respectively. Recently, related projects have started to run on top of MapReduce, simplifying its use. For example, Pig [7] and Hive [8], a high-level language for data analysis. These tools allow users to quickly write applications that take advantage of
HDFS Architecture
Master node
Slave node
Slave node
Slave node
JobTracker
TaskTracker
TaskTracker
TaskTracker
Map task
Map task
Map task
NameNode
MapReduce is a framework for distributed computing which was proposed by Google in 2004. In the MapReduce programming model, the algorithm takes a set of input key/value pairs, and produces a set of output key/value pairs. The programing model of MapReduce splits a workflow into 3 phases: map, shuffle, and reduce. In the map phase, the map function, which is written by a user takes an input pair and produces a set of intermediate key/value pairs. In the shuffle phase, intermediate data is forwarded to the reduce workers based on the intermediate key, so data which has the same key is grouped together. In the reduce phase, reduce workers get the sorted intermediate data which has a key and the corresponding set of intermediate values. Reduce workers pass this data to a user-created reduce function. The job master controls this workflow, and workers execute the computation. The input data is divided into equally sized blocks, in Google’s implementation the block is 64 MB, and then the MapReduce master takes the location information of the input files and attempts to schedule a map task on the machine that contains the input file. If that is not possible, it tries to find a machine that is close to the machine that holds the file. This algorithm conserves network bandwidth and exploits locality to minimize computation time.
Blocks
Replicaon Local disk
Read and Write
Client
II. M AP R EDUCE
DataNode
DataNode
DataNode
Block
Block
Local disk
DataNode
Block
Block
Local disk
DataNode
Block
Block
Local disk
HDFS
Fig. 3.
Interaction of Hadoop MapReduce and HDFS
MapReduce without having to learn all the details of MapReduce. These technologies promote data-intensive computing by making it easier to process large datasets. III. D ISTRIBUTED F ILE S YSTEMS We investigated distributed file system to find out which file systems are suitable as a POSIX compatible storage platform. We evaluated HDFS, Gfarm, PVFS and GlusterFS. This section describes the architecture of these file systems. A. HDFS HDFS is a distributed file system designed to hold very large amounts of data and provide high throughput access. Files are split into chunks which are managed by different nodes in the cluster. Each chunk is replicated across several machines, so that a single machine failure does not result in any data being unavailable. HDFS is specialized for MapReduce applications. Therefore it does not have features which are not necessary for MapReduce applications but may be necessary for other applications. HDFS relaxes some POSIX requirements in order to achieve high throughput in streaming access. For example, HDFS does not support file modification operations except append and does not support concurrent writes to a single file. Lack of these features make it difficult for applications other than MapReduce to access the HDFS. HDFS has a master/slave architecture as depicted by the Figure 2. The NameNode is a single master server that manages file system namespace and regulates access from clients. The clients access the NameNode for metadata operations such as
182
MDS
I/O server
File
Metadata ops
I/O server
File
File
Replicaon
Local disk Client
File
Local disk
Read and Write
Fig. 4.
Gfarm File System Architecture
Master node
Slave node
Slave node
Slave node
JobTracker
TaskTracker
TaskTracker
TaskTracker
Map task
Map task
Map task
Gfarm MDS
Gfarm I/O server
File
Local disk
Gfarm I/O server
File
Local disk
Gfarm I/O server
File
Local disk
Gfarm file system
Fig. 5.
Interaction of Hadoop MapReduce and Gfarm file system
open, close, mkdir, and so on. The DataNode process manages the storage attached to the node they are running on. The DataNodes are responsible for serving read and write requests from the clients. HDFS can be integrated with Hadoop MapReduce, allowing data to be read and computed locally when possible. When HDFS is integrated with Hadoop MapReduce, DataNode and TaskTracker, the Hadoop compute server, should be running on the same node in order to exploit locality, as depicted in Figure 3. When a Hadoop MapReduce application is executed, the JobTracker gets the block’s replica locations from the NameNode and allocates map tasks to the TaskTrackers on the slave nodes. TaskTrackers that are allocated map task then read the necessary data from HDFS through the DataNodes. B. Gfarm The Gfarm file system is a general purpose distributed file system which can be used in place of NFS. It is licensed under the BSD license and is available at http://sourceforge.net/projects/gfarm/. Like HDFS, it is a commodity-based distributed file system that federates local file systems of hundreds to thousands of compute nodes. The Gfarm file system consists of a metadata server (MDS) and I/O servers as depicted by the Figure 4. The MDS manages file system metadata including a hierarchical namespace, file attributes, and the replica catalog. I/O servers provide file access to the local file system.
The client can access Gfarm using the Gfarm client library. The Gfarm client library provides basic file system operations: open, read, write, close and so on, interacting with the MDS and I/O servers. Details of The Gfarm API are described on the Gfarm web page [9]. In addition, Linux clients can mount the Gfarm file system using the FUSE kernel module. Files stored in the Gfarm file system can be replicated and stored on any node. Files stored on a Gfarm file system can be accessed by any node, regardless of where the file is physically stored. Many distributed file system use file striping such as PVFS [10], but Gfarm does not. Although file striping is useful for improving access performance when a small number of clients are accessing large files, these files can be managed by a file group, which is specified by a directory or file name with a wildcard. Using file groups instead of large files gives us one big advantage over file striping, namely that by splitting large files into groups we can explicitly manage file replica placement. This is key for file location aware distributed data computing, and being able to take advantage of such locality is what differentiates Gfarm from many other distributed file systems. The Gfarm file system provides scalable I/O performance by exploiting the decentralization of disk access. It does this especially by giving priority to the local file system of a compute node when that compute node has a replica of the file needed by the current application, therefore reducing the number of times a node has to access files across the network. The Gfarm file system decentralizes disk access using the following two approaches. The first approach is how it chooses the most appropriate node when accessing a file. It first selects an I/O server by requesting the file system node status information from the MDS and measuring the RTT from the client to each file system node that contains a copy of the requested file. Basically, file system nodes that have enough capacity and a CPU load lower than the threshold are selected in increasing order of RTT. The second approach deals with how to choose the most appropriate node when scheduling a process. The scheduler tries to select any node among the nodes that contain a replica of the file that the process will access. However, if the node is busy, the scheduler selects a node that is near to the replica. This gives us greater opportunity to take advantage of file locality. We call this method file affinity scheduling. This requires help from the batch schedulers, while APIs and commands for the file affinity scheduling are provided by Gfarm. Unlike Gfarm, other general purpose distributed file system such as Ceph [11] and Luster [12], require a dedicated object storage cluster which needs additional hardware for storage besides the compute nodes. Furthermore these systems require that there be sufficient network bandwidth between the storage nodes and compute nodes, otherwise the compute nodes could become data starved. By contrast, the Gfarm file system federates local file systems of compute nodes and does not require any special configuration or additional hardware.
183
C. PVFS
Hadoop MapReduce applicaons
PVFS is designed to provide high performance I/O for parallel applications. Each PVFS file is striped across the disks on the nodes. PVFS default stripe size is 64KB. Striping increase throughput from single client, but it may cause performance degradation for concurrent data access from multiple nodes, like MapReduce workload. MapReduce has file affinity scheduling which allocates tasks near the data. However, striping file systems cannot exploit this kind of scheduling because each file is striped across all nodes, and read the data through the network. Of course, you can allocate tasks for each stripe, but 64KB is too small because of the overhead of running a task.
File system API HDFS client library
Gfarm JNI shim layer Gfarm client library
HDFS servers
Gfarm servers
D. GlusterFS Fig. 6.
GlusterFS is also an open source distributed file system. Many distributed file systems have a metadata server, but GlusterFS doesn’t store a metadata. All storage servers in the cluster can determine location of any piece of data without looking it up in an index or querying another server. All storage servers can determine the file location from the path and filename using a hash function. Users can optionally use replication and striping. In MapReduce workloads, GlusterFS cannot exploit locality. The file location in GlusterFS is fixed when creating a file. In addition, Hadoop’s file affinity scheduling cannot be used because there is no mechanism to find out on which server a file is located.
relation of Hadoop and Hadoop-Gfarm plug-in
Slave node
Slave node Task 1
Task 2
Task 1
Task 2
Task 3
Task 4
Task 3
Task 4
Files
Blocks
IV. I NTERACTION BETWEEN H ADOOP AND FILE SYSTEMS Hadoop MapReduce natively supports HDFS and doesn’t natively support Gfarm, PVFS or GlusterFS. There are two ways to use other file systems as the Hadoop file system. First is mounting the file system on each node and use it as local file system in Hadoop (file:///mnt). Second is implementing Hadoop’s abstract FileSystem API. The abstract FileSystem API includes POSIX-like operations such as read(), write(), open() and mkdir(). It also includes getFileBlockLocations() which is used for getting the location information of replicas. If we implement getFileBlockLocations() method, Hadoop MapReduce uses this and allocates map tasks near the data. We use the first method for PVFS and GlusterFS because these cannot exploit data locality as previously explained. We developed the Hadoop-Gfarm plugin for Gfarm because Gfarm can exploit data locality and can use Hadoop MapReduce’s file affinity scheduling. We describe the details of the HadoopGfarm plugin in the next subsection. A. Design and implementation of the Hadoop-Gfarm plug-in In this section, we will discuss the design and implementation of the Hadoop-Gfarm plug-in [13] which enables Hadoop MapReduce to access files on Gfarm file systems efficiently. Hadoop MapReduce can access files on Gfarm using FUSE. However, in this case we have to contend with the overhead that FUSE imposes and furthermore we would not be able to
Local disk
Local disk
Gfarm
HDFS
Fig. 7.
How each task access to disk on the both file systems.
use Hadoop MapReduce’s file affinity scheduling. In order to use file affinity scheduling as well as access the files on Gfarm directly, we decided to not use this approach and developed a plugin instead. The physical layout of HDFS and Gfarm are very similar as you can see by comparing Figure 2 and Figure 4. The NameNode corresponds to the Gfarm MDS, and DataNodes correspond to Gfarm I/O servers. Both file systems federate local file systems to provide a single file system. Therefore, Gfarm can be deployed in the same layout as HDFS. In particular, as depicted in Figure 5, the Gfarm MDS is deployed on the master node instead of the NameNode, and the Gfarm I/O servers are deployed on slave nodes instead of DataNodes. However, HDFS splits files into chunks which are distributed across several machines. Hadoop MapReduce allocates map tasks corresponding these chunks. Meanwhile Gfarm does not split files into chunks, but the disk access pattern of Hadoop MapReduce tasks running on Gfarm can
184
be the same as HDFS. As depicted by Figure 7, imagine each slave node stores two files whose size is 256 MB. In HDFS, these files are split into 128 MB chunks, while in Gfarm, the files are not split. Suppose you run a MapReduce job on HDFS with 4 tasks, each task processes the block pointed to by the indicated line. However, when you run the same MapReduce job on Gfarm, each task processes either the first or last half of its designated file, as shown in the Figure 7. In this way, MapReduce tasks can be distributed among multiple disks on the Gfarm file system, same as HDFS. When the number of input files is less than number of slave nodes, the disks which do not contain any of the input files are not used. However, this is not a problem since most large datasets are split into multiple small files. In fact, according to the statistics on Yahoo!’s cluster, a file on average consists of 1.5 128 MB blocks [14]. Hadoop exposes the FileSystem API in Java, via the FileSystem class, which handles metadata operations, and the FSDataInputStream class which handles file open for reading operations, and the FSDataOutputStream class which handles file open for writing operations. The Hadoop-Gfarm plug-in implements these classes and enables affinity scheduling and reading part of the file as explained in the previous section. There are similar types of implementations for other file system such as org.apache.hadoop.fs.kfs [15] and org.apache.hadoop.fs.s3 [16]. We use JNI (Java Native Interface) in the HadoopGfarm plug-in since Hadoop is written in Java, and Gfarm is written in C. That is, the Hadoop-Gfarm plug-in is the JNI shim layer as depicted in Figure 6. Hadoop programs take a URI path whose format is scheme://authority/path. Using this URI format, Hadoop selects the appropriate file system. For example, in order to use Gfarm, we configure Hadoop to use a URI with the format gfarm:///path. Given this URI, Hadoop knows to instantiate the GfarmFileSystem class which we implemented. The abstract FileSystem API includes POSIX-like operation such as read(), write(), open(), mkdir(). It also includes getFileBlockLocations() which is used for getting the location information of replicas. Using the getFileBlockLocations() methods, Hadoop MapReduce allocate map tasks near the data. V. P ERFORMANCE EVALUATION A. Evaluation environment and configuration
2.33GHz Quadcore Xeon E5410 (2 sockets) 32GB Hitachi HUA72101 1TB TABLE II M ACHINE SPECIFICATION II
CPU Memory Disk
B. Write performance We used the TestDFSIO benchmark that comes with Hadoop. One process is executed which writes to a separate file concurrently in each node. Each process generates 50 GB of file size which is larger than the machine’s 32 GB of main memory. Figure 8 shows the write performance when we vary the number of nodes from 1 to 30. Both show scalable performance in terms of the number of nodes. Gfarm shows 15% higher performance than HDFS. The performance is limited by the disk performance since each map task physically writes data to the local disk. The performance difference is considered to come from the usage of disk cache. PVFS shows 64% lower performance than Gfarm when using 30 nodes due to the overhead imposed by the random striping of file as described in section III-C. GlusterFS shows 72% lower performance than Gfarm when using 30 nodes. In GlusterFS, each node writes one file, but the file location is determined by hash function, so some nodes may have two files, while other nodes may not have any file.GlusterFS perform worse due to this unbalanced writing. C. Read performance
TABLE I M ACHINE SPECIFICATION I CPU Memory Disk
the sort application benchmarks, and the gridmix workload in order to compare Gfarm, HDFS, PVFS and GlusterFS. Furthermore, we considered real use cases, and evaluated the speedup gained by eliminating the redundant operations. We have two types of nodes. Table I and II show the machine specification of the cluster nodes we used. We used 15 of the nodes described in Table I and 15 of the nodes described in Table II. In the evaluation, cluster nodes in Table I are used for 1, 2, 4, and 8 nodes.For 16 nodes, 15 nodes in Table I and one node in Table II are used.For 30 nodes, 15 nodes in Table I and 15 nodes in Table II are used. When we use all 30 nodes, one of the nodes doubled as the master node and a slave node. We used Hadoop version 0.20.2, Gfarm version 2.3.0, PVFS 2.8.2 GlusterFS 3.2.1 and Linux kernel version 2.6. Each node is connected using 10 Gigabit Ethernet. HDFS, Gfarm, and GlusterFS can create replicas, but we executed the benchmarks without replication. For micro benchmarks, one process is allocated in each node. In MapReduce applications, 4 map/reduce tasks is are executed simultaneously since each node has 8 cores.
2.4GHz Quadcore Xeon E5620 (4 sockets) 24GB Hitachi HUA72101 1TB
We executed write and read micro-benchmarks, the grep and
We also used TestDFSIO for our read benchmark, but we made some changes. MapReduce applications can use file affinity scheduling to speed up IO, but since TestDFSIO does not have this feature, we added it to the benchmark. Each node uses one process and reads different 5GB file concurrently. In order to minimize the influence of the page cache on our benchmarks, we flushed the cache in between the write and read operations. Regarding Gfarm, the performance results are shown with and without file affinity scheduling to investigate the performance impact of the scheduling. Gfarm w/o affinity performs about as well as Gfarm with affinity. We used 10 Gigabit
185
Throuphput [MB/s]
1800 1600 1400 1200 1000 800 600 400 200 0
see that adding file affinity scheduling drastically increases Gfarm’s throughput. This is because Hadoop MapReduce select input files in the order of the file name, so many map tasks can access the same file concurrently. Gfarm w/ affinity and HDFS performed similarly. GlusterFS performed about 90% worse than Gfarm w/ affinity. GlusterFS also uses the same access pattern, so we see the same poor performance in that as well. PVFS performed about 70% worse than Gfarm w/ affinity and HDFS. The result is similar to the result of read performance. It is understandable because grep is read intensive application.
Gfarm HDFS PVFS GlusterFS
12 4
8
16
30
Number of nodes Write performance
700 Throughput [MB/s]
Fig. 8.
Ethernet, which means that the network bandwidth wouldn’t become a bottleneck even in when using 30 nodes. Gfarm performed about as well as HDFS. The overall performance was limited by the disk performance. PVFS performed about 57% worse than Gfarm due to the overhead imposed by the random striping of file as described in section III-C. Due to the large number of collision and unbalanced file access, GlusterFS performed about 59% worse than Gfarm. In GlusterFS, files cannot be allocated in a balanced fashion since users cannot determine the file location when writing.
600 500
Gfarm w/ affinity Gfarm w/o affinity HDFS PVFS GlusterFS
400 300 200 100 0 12 4
8
16 Number of nodes
Fig. 10.
30
Grep performance
E. Terasort benchmark Throughput [MB/s]
1600 1400 1200 1000
We used the Terasort [17] program as a benchmark. Gfarm performed about 10% better than HDFS as shown in Figure 11. In sort applications, the data size of the input and output are same. In the map phase, the performance gap between Gfarm and HDFS is very small. However, in the reduce phase this gap grows larger, because the sort job writes a large amount of data. Gfarm is faster than HDFS because, Gfarm’s write performance is better than HDFS as described in section V-B. As in the other benchmarks we ran, PVFS performed about 60% worse than Gfarm. As in the grep application, GlusterFS performed about 80% worse than Gfarm.
Gfarm w/ affinity Gfarm w/o affinity HDFS PVFS GlusterFS
800 600 400 200 0 12 4
8
Fig. 9.
16 Number of nodes
30
F. Gridmix
Read performance
D. Grep performance The grep application scans through all the data, searching for the given input string. Grep benchmark is a read-intensiveapplication because the output data size is much smaller than the input data size. Figure 10 shows throughput of the each of the file systems we tested. We also evaluated the performance of Gfarm without file affinity scheduling. The more nodes we use, the bigger the impact of the file affinity scheduling becomes. In the read performance benchmark, whether or not we were running Gfarm with affinity didn’t really impact the throughput much, however in the grep benchmark we can
Gridmix is a benchmark for live clusters. It models the mixture of jobs seen on a typical shared Hadoop cluster. We used Gridmix version 2. It generates random input data and submits several types of MapReduce jobs. Users can specify the number and types of the jobs via the configuration file. We used three types of jobs, JavaSort, Combiner, and MonsterQuery. JavaSort is a sort program. Combiner is a word count program with a combiner which is Hadoop’s feature. MonsterQuery is a 3-stage job which resembles multi-stage jobs like Pig and Hive. We configured the benchmark to execute 45 jobs in total. Figure 12 shows the number of unfinished jobs. Gfarm finished all of the jobs about 4% faster than HDFS. The reason that
186
or Gfarm. The Source Catalog which is StarCount input data is about 145 GB. We used 15 nodes in this evaluation. Figure 13 shows the result. In HDFS, input data resides in the page cache since we execute the job just after the copy. In Gfarm, we measured both cases; with and without the page cache. HDFS’s job execution time is almost same as that of Gfarm with cache. The total execution time on HDFS when coping from NFS is about 6.8 times larger than execution time on Gfarm without cache. The throughput of this copy was 45 MB/sec. Furthermore, total execution time on HDFS with copy from Gfarm is about 2 times larger than the execution time on Gfarm without cache. The throughput of this copy was 219 MB/sec.
Gfarm HDFS PVFS GlusterFS
300 250 200 150 100 50 0
12 4
8
16 Number of nodes
Fig. 11.
30
Sort performance
4000 Time ( sec )
Performance [MB/s]
350
Running or Waiting Jobs
Gfarm is slightly faster may be that Gfarm’s write performance is better than HDFS’s.
Job execuon me Copy me
3000 2000 1000 0
45
HDFS with copy HDFS with copy Gfarm without from NFS from Gfarm cache
Gfarm HDFS
40
Gfarm with cache
35 Fig. 13.
30
Total execute time of StarCount including copy time
25
VI. R ELATED WORK
20
A. Hadoop-PVFS
15 10 5 0 0
100
200
300 400 500 Seconds
Fig. 12.
600
700
800
Gridmix
G. StarCount In previous evaluations, Gfarm performed as well as HDFS. Furthermore, if you use Gfarm, you can reduce redundant copy and storage operations without any performance penalties. In order to quantitatively evaluate the speedup gained by eliminating these redundant operations, we measured both the job execution time as well as the time required to copy the files in and out of HDFS for the StarCount application with the parameters described below. The StarCount program counts the number of stars at each of the star magnitudes. This program takes star magnitudes as keys and sums of stars as values. StarCount is similar to the WordCount program which counts words, while StarCount counts stars. We used the 2MASS All-Sky Data [18] FITS images as our input data. First, we ran Source Extraction [19] on the FITS images to output the Source Catalog. Then, we ran StarCount on the Source Catalog. If we run StarCount on HDFS, we need to import the Source Catalog to HDFS first. On the other hand, when we run StarCount on Gfarm, we don’t need to copy the data. So when we measure the performance on HDFS, we also measured the copy time from a POSIX compatible file system such as NFS
In [20], the authors compare HDFS and PVFS, and show that PVFS can perform about as well as HDFS. PVFS typically keeps storage servers separate from the compute infrastructure. However, they set up the PVFS I/O severs on same node as the Hadoop compute servers. Also, they changed the configuration, increasing the default 64 KB stripe size to 64 MB, the same as the HDFS block size. PVFS performs about as well as HDFS when these configuration changes are applied. However these changes may cause performance degradation in other applications. Since our goal is to run MapReduce applications alongside other applications, this performance penalty is not tolerable. B. Cloudstore CloudStore [15] is a distributed file system integrated with Hadoop. CloudStore is implemented in C++ and has a POSIX-like C++ library for accessing the file system. Hadoop MapReduce applications access CloudStore through the Java Native Interface. Gfarm and CloudStore are similar in terms of the POSIX-like interface, and Hadoop MapReduce accesses the file systems through the Java Native Interface. However Cloudstore lacks many of the required features necessary for a general purpose file system, such as security features and file permissions. We need such functionality in order to be able to safely share data between many users. C. Ceph Ceph is a high-performance distributed file system under development since 2005 and is now included in Linux kernel. Since Ceph has multiple metadata servers, it can bypass the
187
scaling limits of many distributed file systems which is a single metadata server architecture. There are two ways to use Ceph as the file system for Hadoop [21]. The first way is mounting Ceph on the Linux machine using a Linux kernel of version 2.6.34 or later. Then, you can use it as a local file system in Hadoop (file:///mnt/ceph). The second way is patching the Hadoop Core with the patch available in the HADOOP-6253 JIRA [22]. This uses the user-level client of Ceph just like our plugin uses the user-level Gfarm client. However, Ceph is still experimental and when we placed a heavy load on Ceph, the client crashed. When we ran some MapReduce applications on Ceph version 0.20.2-1.fc13, the client also crashed. D. Hadoop on GPFS GPFS is a high-performance and highly-available distributed file system for enterprise use. The GPFS on Hadoop Project [23] allows Hadoop MapReduce applications to access files on GPFS. In the performance evaluation for MapReduce applications, GPFS’s performance is faster than HDFS’s. However, it requires changing the block size. They used a 128 MB block size for MapReduce and a 128 KB block size for online transaction processing. That means the data cannot really be shared by MapReduce applications and other applications even in the same file system since the optimal block size is different.
time is about 6.8 times less than the HDFS’s since you do not need any redundant copy operations when you use Gfarm. Our future work will include performance evaluation when using replication. We carried out the benchmarks in this paper without replication since Gfarm’s replication methods differ greatly from those of HDFS. HDFS uses chain replication, meaning that by the time the client finishes writing the data, the replicas have already been created. However, Gfarm creates replicas in the background after the client has finished writing the data. Even if you create replicas, write performance does not change. Although we did not use replication in this paper, replication is an important part tof modern distributed file systems and we feel that including it would result in scores that closer reflect real-world situations. ACKNOWLEDGMENT This work was partially supported by the JST CREST, “System Software for Post Petascale Data Intensive Science”, the MEXT Promotion of Research and Development for Key Technologies, Research for Next Generation IT Infrastructure, Resources Linkage for e-Science (RENKEI), Research of Data Sharing Technology, and the MEXT Grant-in-Aid for Scientific Research on Priority Areas, “New IT Infrastructure for the Information-explosion Era” (Grant number 21013005). R EFERENCES
E. BlobSeer BlobSeer [24] is a large-scale distributed versioning file system. It also can be integrated with Hadoop MapReduce. BlobSeer performs better than HDFS in real MapReduce applications. However, BlobSeer stores its data in memory. VII. C ONCLUSION In this paper we have shown that it is possible to use Gfarm, a POSIX compatible file system, in place of HDFS for Hadoop MapReduce applications. In order to process files on Gfarm efficiently when using Hadoop MapReduce, we designed and implemented the Hadoop-Gfarm plugin. In the micro benchmark, Gfarm’s read performance is about the same as HDFS and Gfarm’s write performance is 15% better than HDFS. In the real MapReduce applications we tested, we have shown that Gfarm performs as well as HDFS. We also evaluated PVFS and GlusterFS, but the performance of those file systems were less than half that of Gfarm in all evaluations. PVFS performed worse than both Gfarm and HDFS since striping file systems are not suitable for applications like MapReduce where multiple clients concurrently write to files. GlusterFS performed the worst since it cannot exploit locality which results in unbalanced file access and thus poor performance. HDFS relaxes some POSIX requirements to enable high throughput streaming access, but this paper shows that POSIX compatible file systems can perform as well as HDFS. You can reduce redundant copy and storage operations without performance degradation by using Gfarm in stead of HDFS. In the use case scenario we created, Gfarm’s total execution
188
[1] Jeffrey Dean, Sanjay Ghemawat, “Mapreduce: Simplified data processing on large clusters,” in OSDI04, 2004. [2] Ben Langmead, Michael C Schatz, Jimmy Lin, Mihai Pop and Steven L Salzberg, “Searching for snps with cloud computing,” Genome Biol 2009, 10:R134, 2009. [3] Ewa Deelman, Gurmeet Singh, Miron Livny, Bruce Berriman, John Good, “The cost of doing science on the cloud: The montage example,” 2008. [4] Apache, “HDFS architecture,” http://hadoop.apache.org/common/docs/current/hdfs design.html. [5] Osamu Tatebe, Kohei Hiraga, Noriyuki Soda, “Gfarm grid file system,” in New Generation Computing, Ohmsha, Ltd. and Springer, Vol.28, No.3, pp.257-275, DOI: 10.1007/s00354-009-0089-5, 2010. [6] M. Szeredi, “Filesystem in user space,” http://fuse.sourceforge.net/, 2003. [7] Apache, “Pig,” http://wiki.apache.org/hadoop/pig. [8] Apache, “Hive,” http://wiki.apache.org/hadoop/Hive. [9] Grid Datafarm, “http://datafarm.apgrid.org/.” [10] Philip H. Carns, Iii, Robert B. Ross, and Rajeev Thakur, “Pvfs: a parallel file system for linux clusters,” In ALS 00: Proceedings of the 4th annual Linux Showcase and Conference, 2000. [11] Sage A. Weil. Scott A. Brandt. Ethan L. Miller. Darrell D. E. Long. Carlos Maltzahn, “Ceph: A scalable,high-performance distributed file system,” in Proc. of the 7th USENIX Symposium on Operating Systems Design and Implementation (OSDI 06), pp.307.320, 2006. [12] Braam, P. J., “Lustre,” http://www.lustre.org/. [13] Kazuki Ohta, Shunsuke Mikami, “Hadoop-gfarm,” in https://gfarm.svn.sourceforge.net/svnroot/gfarm/gfarm hadoop. [14] Konstantin V. Shvachko, “Hdfs scalability: the limits to growth,” ;login:, vol. 35, no. 2, 2010. [15] “CloudStore,” in http://kosmosfs.sourceforge.net. [16] Amazon Web Services, “S3,” https://s3.amazonaws.com/. [17] Owen O Malley and Arun C. Murthy, “Winning a 60 second dash with a yellow elephant,” 2009. [18] R. M. Cutri, et al., “2mass all sky catalog of point sources (amherst: Univ. massachusetts press; pasadena: Ipac),” http://www.ipac.caltech.edu/2mass/releases/allsky/, 2003. [19] Masahiro Tanaka, “https://github.com/masa16/pwrake/tree/master/demo/setile,” 2010.
[20] Wittawat Tantisiriroj, Swapnil Patil, and Garth Gibson, “Data-intensive file systems for internet services: A rose by any other,” CMU-PDL-08114, 2008. [21] Carlos Maltzahn, Esteban Molina-Estolano, Amandeep Khurana, Alex J. Nelson, Scott A. Brandt, and Sage Weil, “Ceph as a scalable alternative to the hadoop distributed file system,” ;login:, vol. 35, no. 4, 2010. [22] Gregory Farnum, “Add a ceph file system interface,” https://issues.apache.org/jira/browse/HADOOP-6253, 2010. [23] Karan Gupta, Reshu Jain, Himabindu Pucha, Prasenjit Sarkar, Dinesh Subhraveti, “Scaling highly-parallel data-intensive supercomputing applications on a parallel clustered file system,” The SC10 Storage Challenge, 2010. [24] B. Nicolae, G. Antoniu, L. Bouge, D. Moise, and A. Carpen-Amarie, “Blobseer: Next generation data management for large scale infrastructures,” Journal of Parallel and Distributed Computing, 2010.
189