Performance Evaluation ofHDFS in Big Data Management Ripon Patgiri
Dipayan Dev Department of Computer Science & Engineering
Department of Computer Science & Engineering
National Institute of Technology, Silchar
National Institute of Technology, Silchar
Silchar, India
Silchar, India
dev.dipayan
[email protected]
[email protected]
Abstract- Size of the data used in today's enterprises has been
growing
at
Simultaneously,
exponential
the
need
to
rates
process
from and
last
few
analyze
years.
the
large
volumes of data has also increased. To handle and for analysis of large
datasets,
an
open-source
implementation
of
Apache
framework, Hadoop is used now-a-days. For managing and storing of all the resources across its cluster, Hadoop possesses a distributed
file
system
called
Hadoop
Distributed
File
capacity,
110
BW and other computation capacity by adding
commodity servers. HDFS keeps application data and file systems meta-data in different servers. Like popular DFS, viz. PVFS [3] [10], Lustre [11] and GFS [9] [13], HDFS too saves its meta-data in a server, known as Name Node. Rest of the application data are kept on the slave's server, known as Data Nodes.
is
The main purpose of the paper is to focus and reveal the
depicted in such a way that in can store Big Data more reliably,
different factors that influence the efficiency of Hadoop
System(HDFS).HDFS
is
written completely
in
Java
and
and can stream those at high processing time to the user
cluster. None of the previous papers demonstrated this kind of
applications. In recent days, Hadoop is used widely by popular
work that exposes the dependencies of Hadoop efficiency.
organizations like Yahoo, Facebook and various online shopping market venders. On the other hand, Experiments on Data Intensive computations are going on to parallelize the processing of
data.
None
performance.
of
them
Hadoop,
could
with
actually
its
achieve
Map-Reduce
a
desirable
parallel
data
processing capability can achieve these goals efficiently [I]. This
In this paper, we have discussed about the mechanism of HDFS and we tried to find out the different factors on which HDFS provides maximum efficiency. The remainder of this paper
is
organized
as
follows.
We
discuss
Hadoop
Architecture in Section 2. In Section 3, we discuss about file
paper initially provides an overview of the HDFS in details. Later
I/O operation and issue related to interaction with the clients.
on, the paper reports the experimental work of Hadoop with the
Section 4 discusses out two experimental works, which reveals
big data and suggests the various factors that affects the Hadoop
the major factors on which Hadoop cluster's performance
cluster
performance.
Paper
concludes
with
providing
the
different real field challenges of Hadoop in recent days and scope for future work
depends. The next part, Section 5 deals with the major challenges of Hadoop field. The conclusions and future work in the concerned issues are given in Section 6.
NameNode;
Keywords-HDFS;
DataNode;
Job Tracker;
Metadata.
II.
ARCHITECTURE
The cluster used by Hadoop framework uses a Master/Slaves I.
INTRODUCTION
structure (Fig. 1).The master nodes consists of Name Node
The last few years of internet technology as well as computer
and Job Tracker. The main duty of JobTracker's is to initiate
world has seen a lot of growth and popularity in the field of
tasks, track and dispatch their implementation. The charge of
cloud computing.
As a consequence, the cloud applications
Data Node and Task Tracker is given to Slave nodes [4]. The
have given birth
to big
source
main role of TaskTracker is to process of local data and
distributed system made by Apache Software Foundation, has
collection of all these result data as per the request received
data. Hadoop,
an open
contributed hugely to handle and manage such big data.
from applications and then report the performance in periodic
Hadoop [2]
[7] has a master-slave architecture, provides
interval to 10bTracker [5]. HDFS, which seems to be heart of
scalability and reliability to a great extent. In the last few
Hadoop, all of its charges, are given to NameNode and
years, this framework is extensively used and accepted by
DataNode for fulfilling, while 10bTracker and TaskTracker
different cloud vendors as well as big data handler. With the
mainly deal with MapReduce tasks.
help of Hadoop, a user can deploy programs and execute The
main
parts
of
Hadoop
In this paper, we are only dealing with HDFS architecture. So, here is a brief about Name Node and Data Node. [6] A
processes on the configured cluster. include
HDFS
(Hadoop
Distributed File System). A crucial property of Hadoop
short description of the HDFS client with the Name Node and Data Node is also portrayed.
Framework is the capacity to partition the computation and
A.
data among various nodes (known as Data Node) and running
The
the computations in parallel. Hadoop increases the storage
978-1-4799-5958-7114/$31.00 ©2014 IEEE
NameNode HDFS
directories.
Files
namespace and
is
a
directories
hierarchy are
of
files
represented
on
and the
NameNode
like
stored at that Data Node becomes inaccessible. The Name
permIssIOns, modification and access times, namespace and
by
i-nodes,
which
record
attributes
Node then schedules the things again and allocates all of those
disk space quotas.
blocks to other Data Nodes, which is selected randomly. Hodoop Cllenl Contwu � H!de too dmi aJ.b)lr.aQMIOM.DJ'IIi*
MASTIlF\ �---�
Fig 1: Master/Slaves structure of Hadoop cluster
The file content is split into large blocks and each block of
SlIaro
Phy$l(al Nod.
the file is independently replicated at multiple Data Nodes. The
NameNode
maintains
the
namespace
tree
and
the
Fig 2: Hadoop high-level architecture, interaction between Client and HDFS
mapping of file blocks to Data Nodes. For, faster execution of the cluster operations, HDFS has to keep the whole namespace in its Main Memory.
C.
I-node
HDFS Client
data and other list of block, which belong to each file,
This part explains the interaction among client and Name
constitute the meta-data of the name system, termed as
Node and Data Node. HDFS client is the media via which a
FSlmage. The un-interrupted record of the FSlmage stored in
user application accesses the file system. For faster access, the
the Name Node's local filesystem is termed as checkpoint.
namespace is saved in the Name Node's RAM. It remains unknown to an application that, that file system meta-data and
B.
DataNodes
application data are put on separate servers as well as about
The block replica stored on DataNodes is regarded as two
the replication of blocks. During the time, an application reads
files in the local host's own file system. The first one
a file for the first time; HDFS client gets response from the
constitutes the main data and second file acts as storage for
Name Node of all lists of Data Nodes that stores replicas of the blocks of that particular file. The client application checks
meta-data of blocks. During startup each Data Node get connect to the Name Node. The phenomenon is just like a normal handshake. The
the lists out, then a Data Node is contacted directly and requests to transfer the desired block for reading.
the
When a client wants to write in the HDFS, it first sends a
namespaceID and to check whether there is a mismatch
query to the Name Node to choose Data Nodes for hosting
purpose
of
this
type
of
handshake
is
to
verify
between the software versions of the Data Nodes. If either
replicas of the blocks of the file. When the client application,
does not match the same with the Name Node, that particular
get response from the Name Node, it start searches for that
Data Node gets automatically shut down.
given Data Node. When if finds, the first block is filled or
A namespacelD is assigned to the file system when it is
doesn't have enough disk space, the client requests new Data
formatted each time. The namespacelD is persistently stored
Nodes to choose for hosting replicas for the next block. Like
on
this way, the process continues. The detail interactions among
all
nodes
of
the
cluster.
A
node
with
a
different
namespacelD will not be able to join the cluster,
thus
maintaining the integrity of the file system. A Data Node that is newly initialized and without any namespacelD is permitted to join the cluster and receive the cluster's namespaceID.
the client, the Name Node and the Data Nodes are illustrated in Figure 2. III. FILE I/O OPERATIONS AND MANGEMENT OF REPLICATION
After the handshake is done, the Data Node registers with the Name Node. A storageID is allotted to the Data Node for the first time, during
the
registration
with
Name
Node.
Data
Node
periodically sends block report to the Name Node to identify the block replicas in its control. The report consists of block id, the length of each block etc. During normal operations, all the Data Nodes periodically send a response to the Name Node to confirm that, it is alive and active in operation and all the different block replicas it stores are available. If the Name Node does not receive a heartbeat from a Data Node in 10 minutes the Name Node considers the Data Node to be dead and the block replicas
A. File Read and Write This part of paper, describes the operation of HDFS for different
110
file operations. When a client wants to write
some data or add something into HDFS, he has to do the task through
an
application.
HDFS
follows
a
"single-writer,
multiple-reader model" [14]. When the application closes the file,
the content already written cannot be manipulated.
However, new data can be appended in to by reopening the file. A client, when tries to open a file to write is granted a lease for it. The writing application periodically renews the given lease by sending heartbeats to the NameNode. When the
file is closed down, the lease is revoked. The lease duration is
provides a restriction that, more than one replica cannot be
bound by a soft limit and a hard limit. The writer is granted an
stored at one node and more than two replicas cannot be stored
exclusive access for the file, as long as the soft limit persists.
in the same rack.
The client application is also grant a hard limit of one hour. When this hard limit expires, and also if the client fails to renew the hard limit, HDFS then automatically closes that file on behalf of the writer and recovers the lease granted for it. The lease provided to a writer never does prevent any other client to read that file. HDFS follows a "read by simultaneous reader at a time" scheme. Hadoop clusters consist of thousands of nodes. So, it is quite natural to come across failure frequently. The replicas that are stored in the Data Nodes might get corrupted as a result of memory faults, disks issues. Checksums are verified by the HDFS clients to check the integrity of the data block of a HDFS file. During reading, it helps the detection of any
Job Tracker/ Resource Mana�er (5;;h-edul�) Manua"Autom�t;c F.3Uo.YCl'"
0101 Name NOdes (Mo�ter Node)
corrupt packet in the network. All the checksums are stored in a meta-data file in the Data-Node's system, which is different from the block's data file During the reading of files by
Data NOdes (Slave No(:fes)
Fig. 3: Racks of commodity hardware and interaction with NameNode and .lobTracker
HDFS, all block data and checksums are transferred to the client. It calculates the checksums for the data and confirms that the newly calculated checksum matched with the one it
Summing up all the above policy, the HDFS follows the following replica placement policy:
received. If it doesn't match, the client informs the Name
l. A single Data Node does not contain more than one replica
Node about the damaged replica and brings another replica
of any block.
from different DataNode. [14]
2. A single rack never contains more than two replicas of the
When a file is opened by the client for reading, the client first
same block, given then there is significant number of racks in
fetches the whole list of blocks the different location of the
the cluster.
replicas from the NameNode. In the process, reading the content of the block, the client posses a property to read the
C. Replication management
closest replica first. If this attempt fails, it tries for the next one in sequence. If the DataNode is not available or if the
The NameNode ensures each block of the files always has
block replica is found to be corrupted (checksum test), the
number of replicas stored in the Data Nodes. Data Nodes
read operation then fails completely.
periodically send its block report to Name Nodes. The Name
B. Block Placement
Node detects whether a block is under or over replicated. When Name Nodes finds it over replicated, it chooses a
When we want to setup a large cluster, we don't connect
replica from any random Data Node to remove. NameNode
all nodes to a particular switch. A better solution is that nodes
doesn't reduce the number of racks which has available
of a rack should share a common switch, and switches of the
number of host replicas, and removes from that DataNode,
rack are connected by one or more switches.
which has the least disk space available. The purpose is to
The configuration of commodity hardware is done in such
balance the storage utilization across all the DataNodes,
a way that, the network bandwidth between two nodes in the
without
same rack is greater than network bandwidth between two
replication priority queue, which stores the blocks that is
hampering
the
block
availability.
There
is
a
nodes in. different racks. Figure 3 shows a cluster topology,
under replicated. Highest priority factor is given to the blocks,
used in Hadoop architecture.
which has only one replica. On the other hand, blocks having
The distance from a node to its parent node is always
more than two third of the specified replication factor are
measured as one. A distance between two nodes can be
given the lowest priority. A thread searches the replication
measured by just adding up their distances to their closest
priority queue to determine the best position for the new
node. For greater bandwidth, we should have shorter distance
replicas. Block replication of HDFS also follows the same
between two nodes. This eventually increases the capacity to
kind of policy like that of new block placement. If HDFS finds
transfer data.
the number of existing replica of a block is one, it searches for
The main concern of HDFS block replacement policy is minimization
of
write
cost,
and
maximization
of
that block and places the next replica on a rack which is
data
different from the existing one The main motive here is to
reliability, scalability and increase the overall bandwidth of
minimize the cost of creation of new replicas of blocks. The
the cluster. After the generation of new block, HDFS searches
NameNode also does proper distribution of the block to make
for a location places the first replica on that node, the 2nd and 3rd replicas are stored similarly on two different nodes in a different rack, and the rest are placed on random nodes. HDFS
sure that all replicas of a particular block are not put on one single rack. If any situation comes, that the Name Node detects that a block's replicas end up storing itself at one
common rack, the NameNode treats it as under-replicated and
Variation in Write opeation
allocate that block to a different rack. When the Name Node receives the notification that a replica is created on different node, the block becomes again becomes over-replicated. So, following the previous policy written above, the Name Node,
80
at that situation, decides to remove an old replica chosen III
randomly.
"ai" ::i!:
IV. PERFORMANCE EVALUA nON
60 40 20
o
[n this section, the performance of Hadoop Distributed File System is evaluated in order to conclude the efficiency
Throughput
dependence of a Hadoop cluster.
Average I/O Rate
Fig 4: The write bandwidth and throughput of lOGB file size is almost 6 times
A.
Experimental Setup
For
performance
greater than 1GB
characterization,
a
8-node
Hadoop
cluster was configured. The first 5 nodes provided both
Variation in Execution time for write operation
computation (as Map-Reduce clients) and storage resources th (as Data Node servers), and the 6 and 7th node served as Job Tracker
(Resource-Manager)
and
NameNode
storage
manager. Each node is running at 3.10 GHz clock speed and with 4GB of RAM and a gigabit Ethernet NIC. All nodes used Hadoop framework 2.4.0, and Java 1.6.0. The all nodes in configured with 500GB Hard Disk. Total configured capacity
10000 "C c::
8
5000
CIJ
� o
shown on NameNode web interface is 2.66TB.
1 GB B.
10 GB
Test using TestDFSIO to find the various criteria on Fig 5: The same amount of write takes almost 1.71 more time in case of 10GB
which Hadoop shows most optimal efficiency The test process aims to fmd optimal efficiency of the performance characteristics of two different sizes of files and
Variation in Read opeation
bottlenecks posed by the network interface. The comparison is done to check the performance between small and big files. A test of write and read between 1 GB file and 10 GB file is
100
carried out. A total of 500 GB data is created through it. III
HDFS block size of 5 [2 MB is used. For
this
test
we
have
used
industry
standard
"ai"
benchmarks:
o
TestDFS[O is used to measure performance of HDFS as well as of the network and
10
50
::i!:
Throughput
subsystems. The command reads and
writes files in HDFS which is useful in measuring system
Fig 6: Reading same amount of data in 10GB filesize oiTers close to 4times
wide performance and exposing network bottlenecks on the NameNode
and
workloads are
DataNodes.
10
A
majority
of
Average I/O Rate
more throughput and average bandwidth
MapReduce
Variation in Execution time for Read operation
bound more than compute and hence
TestDFS[O can provide an accurate initial picture of such scenanos.
6000
We executed two tests for both write and read: one for 50 files each of size [OGB and other with 500 files each of size 1GB. As an example, the command for a read test may look like:
$
hadoop jar Hadoop-*test*.jar TestDFSIO -read -nrFiles 100
-fileSize 10000 This command will read 100 files, each of size 10GB from the HDFS and thereafter provides the necessary performance measures.
�
4000
�
2000
o
o 1 GB
10 GB
Fig 7: The same amount of read takes almost 1.95 more time in case of 10GB
A word count job is submitted and experiment is carried
The command for a write test may look like:
$
hadoop jar Hadoop-*test*.jar TestDFSIO -write -nrFiles This command will write 100 files, each of size 10GB
from
the
HDFS
out on a 100 GB file, varying the no of reducers keeping with block size of HDFS fixed. The experiment carried out with 4
100 -fiIeSize 10000 and
thereafter
provides
the
necessary
different types of block size, viz. 512MB, 256MB, 128MB and 64MB. Based on the Test-Report we obtained, the charts in the figure,
performance measures. TestDFSIO generates 1 map task per file and splits are
8, 9, 10, and 11 have been made and proper conclusion is
defined such that each map gets only one file name. After
followed.
every run,
All the above graphs appear to form a uniform straight line or
the command generates a log file indicating
performance in terms of 4 metrics: Throughput in MBytes/s, Average
10
rate in MBytes/s,
10
rate standard deviation and
in some it shows a slight negative slope which indicates that with increase in number of reducer for a give block size time
execution time.
for processing either remains same or reduces to some extent.
Outputs of different tests are given below:
But, a significant negative slope is visible in figure 8, where
To obtain a good picture of performance, each benchmark tool
block size equals 512MB. On the other hand, in figure 10 and
was run 3 times on each 1GB file size and results were
11,
averaged to reduce error margin. The same process was. The
unexpected behavior.
same process was carried on 10GB file size to get data for comparison.
the
execution
times are
unpredictable
and show
an
It can be concluded that, for large block size, the reducers play an important role in the execution time of a Map-Reduce
Experiments shows, test execution time is almost half
job. For smaller block size, the change in number of reducers
during the I GB file test. This the total time it takes for the
doesn't bring noticeable changes in the execution time.
Hadoop jar command to execute.
Moreover, for significant large files, small change in block
From figure 4 and 6, we can visualize that, the throughput and
time.
size doesn't lead to change the drastic change in execution
10
Rate too shows a significant declined in terms of both write
Block-Size
and read for the 1GB file test. This is somewhat unexpected in nature. However, one major conclusion that we encountered is as follows: In these tests there is always one reducer that runs after the all map tasks have complete. The reducer is responsible for
III "C c:: o u CIJ VI
generating the result set file. It basically sums up all of these
10
15000 12500 10000 7500 5000 2500
o
values "rate,sqrate,size,etc." from each of the map tasks. So the Throughput,
512 MB
=
1
rate, STD deviation, results are based on
4
individual map tasks and not the overall throughput of the
10
40
100
Reducers
cluster. The nrFiles is equal to number of map tasks. In the 1GB file test there will be (500/6)
=
83.33 (approx) map tasks
running simultaneously on each node manager node versus 8.33 map tasks on each node in the 10GB file test.
Fig 8: Variation of processing times with variation of reducers, keeping block size fixed at 512MB
The
Block-Size
10GB file test yields throughput results of 64.94 MB/s on each node manager. Therefore, the 10GB file test yields (8.33 * 64.94MB/s) 540.95MB/s per node manager. Whereas, the I GB file test yields throughput results of 11.13 MB/s on each
node manager. Therefore, the 1GB file test yields (83.33 * 11.13MB/s) 927.46 MB/s per node manager. Clearly the 1GB file test shows the maximum efficiency. However increasing the no of commodity hardware eventually decreases the execution time and increased the average
256 MB
15000 12500 10000 7500 5000 2500
o 1
4
10
40
100
Reducers
10
rate. C. Dependence of Execution time of write operation based
on No ofReducers and Block size.
III "C c:: o u CIJ VI
=
Fig 9: Variation of processing times with variation of reducers, keeping block size fixed at 256MB
Block-Size VI "'C !: o u Q) III
=
3)
128 MB
.Researchers are focusing to design a more developed version of Hadoop component for monitoring, while Job Tracker will be given the charge of overall
15000 12500 10000 7500 5000 2500
scheduling.
4)
Improving data processing performance is also a topic of major challenge for the upcoming days. A special
optimization
process
should
be
assigned
based on what is the actual need of application.
o 1
4
10
40
Different experiments show that there are various
100
scopes to increase the processing performance and thus improving time complexity of data for the
Reducers
execution of a particular job. [15][16] Fig 10: Variation of processing times with variation of reducers, keeping
VI. CONCLUSION AND FUTURE WORK
block size fixed at 128MB
Block-Size
VI "'C !: o u Q) III
=
64 MB
This section of the paper describes related future work that we are considering; Hadoop being an open source project
15000 12500 10000 7500 5000 2500
justifies the addition of new features and changes for the sake of better scalability and file management. Hadoop is recently one of the best among managing the Big Data. Still, in our experiment, we have found that, it performs poor in terms of throughput when the numbers of
o
files are relatively larger compared to smaller numbers of 1
4
10
40
100
files. Our experiment shows how the performance bottlenecks are not directly imputable to application code but actually
Reducers
depends on numbers of data nodes available, size of files in used in HDFS and also it depends on the number of reducers
Fig 11: Variation of processing times with variation of reducers, keeping
used.
block size fixed at 64MB
It also shows how MapReduce
10
performance can vary
depending on the data size, number of map/reduce tasks, and
However, the biggest issue on which we are focusing is the scalability of Hadoop Framework. The Hadoop cluster becomes unavailable when its NameNode is down.
available cluster resources.
Scalability issue of the Name Node [12] has been a major struggle. The Name Node keeps all the namespace and block
V. MAJOR CHALLENGES IN HADOOP
locations in its main memory. The main challenge with the Name Node has been that when its namespace table space
Although the Hadoop Framework has been approved by
becomes close the main memory of Name Node, it becomes
everyone for his flexibility and faster parallel computing
unresponsive due to Java garbage collection. This scenario is
technology, there still are many problems which written in
bound to happen because the numbers of files used the users
short in the following points [7]
are increasing exponentially. Therefore, this is a burning issue
1)
Hadoop suffers from a irrecoverable failure called Single point of
in the recent days for Hadoop.
failure of Name Node. Hadoop
Acknowledgment
possesses a single master server to control all the associated sub
servers
(slaves)
for
the
tasks
to
execute, that leads to a server shortcomings like single point of failure and lacking of space capacity, which seriously affect its scalability. 2)
HDFS faces huge problems, dealing with small files. HDFS data are stored in the Name Node as meta
We would like to thank Data Science & Analytic Lab of NIT Silchar for granting us the laboratory. That contributes a lot to our research, for that I got the opportunity to configure the Hadoop.
References
data, and each meta-data corresponds to a block occupies about 200 Byte. Taking a replication factor of 3(default), it would take approximately 600 Byte. If there are such huge no of these kind of smaller files in the HDFS, Name Node will consume lot of space. Name Node keeps all the meta-data in its main memory, which leads to a challenging problem for the researchers. [8]
[1]
Manghui Tu, Peng Li, I-Ling Yen, BhavaniThuraisingham, Latifur
Khan.Secure Data Objects Replication in Data Grid., IEEE Transactions
on Dependable and Secure computing, Vol. 7, No. 1,2010 [2]
Apache Hadoop, http: //hadoop.apache.org/
[3]
P. H. Cams, W. B. Ligon III, R. B. Ross, and R. Thakur, PVFS: A
parallelfile system for Linux clusters, in Proc. of 4th Annual Linux Showcase
114, Parallel Data Laboratory, Carnegie Mellon University, Pittsburgh, PA, October 2008
and Conference, 2000, pp. 317327. [II] Hong Sha, Yang Shenyuan, Key technology of cloud computing and research on cloud computing model based on Hadoop, Software Guide,2010, 9(9): 9 II.
Lustre File System., http://www.lustre.org
[4]
[5]
Xie Guilan, Luo shengxian, Research on applications based on Hadoop
[6]
Jeflrey Shafer, Scott Rixner, and Alan L. Cox., The Hadoop Distributed
MapReduce model, Microcomputer & its applications, 2010, (8): 4 7.
[12] K. V. ShvachkoHDFS Scalability: The limits to growth.,login:. April 2010, pp. 616 [13] M. K. McKusick, S. Quinlan.GFS: Evolution on Fast-fonvard., ACM Queue, vol. 7, no. 7, New York, NY. August 2009 [14]
Shvachko, K.; Hairong Kuang; Radia, S.; Chansler, R.The Hadoop
Filesystem: Balancing Portability and Performance., Performance Analysis
Distributed File System., Mass Storage Systems and Technologies (MSST),
of Systems & Software (ISPASS),2010 122 133.
2010 IEEE 26th Symposium on, vol no. pp.l,IO, 3-7 May 2010
[7]
[15] Bhandarkar, M.MapReduce programming with apache Hadoop. Parallel & Distributed Processing (IPDPS), 2010 IEEE International Symposium on, vol., no., pp.l,I, 19-23 April 2010
Xin Daxin, Liu Fei, Research on optimization techniques for Hadoop
cluster performance., Computer Knowledge and Technology, 2011,8(7):5484
5486. [8]
Lu Huang;Hai-Shan Chen;Ting-Ting Hu, Research on Hadoop Cloud Model and its Applications., Networking and Distributed Computing (ICNDC), 2012, 59 63 Computing
[9] S. Ghemawat, H. Gobioff, S. Leung, The Googlefile system., In Proc. Of ACM Symposium on Operating Systems Principles, Lake George, NY, Oct 2003, pp 2943. [10]
w. Tantisiriroj, S. Patil, G. Gibson, Data-intensivefile systems for In
ternet services: A rose by any other name., Technical Report CMUPDL- 08-
[16]
Jinshuang Yan; Xiaoliang Yang; Rong Gu; Chunfeng Yuan; Yihua
Huang.Peiformance Optimization for Short MapReduce Job Execution in Hadoop., Cloud and Green Computing (CGC), 2012 Second International
Conference on, vol no., pp.688,694, 1-3 Nov. 2012