Performance evaluation of HDFS in big data management

1 downloads 10403 Views 809KB Size Report
Performance Evaluation ofHDFS in Big Data. Management. Dipayan Dev. Department of Computer Science & Engineering. National Institute of Technology, ...
Performance Evaluation ofHDFS in Big Data Management Ripon Patgiri

Dipayan Dev Department of Computer Science & Engineering

Department of Computer Science & Engineering

National Institute of Technology, Silchar

National Institute of Technology, Silchar

Silchar, India

Silchar, India

dev.dipayan [email protected]

[email protected]

Abstract- Size of the data used in today's enterprises has been

growing

at

Simultaneously,

exponential

the

need

to

rates

process

from and

last

few

analyze

years.

the

large

volumes of data has also increased. To handle and for analysis of large

datasets,

an

open-source

implementation

of

Apache

framework, Hadoop is used now-a-days. For managing and storing of all the resources across its cluster, Hadoop possesses a distributed

file

system

called

Hadoop

Distributed

File

capacity,

110

BW and other computation capacity by adding

commodity servers. HDFS keeps application data and file systems meta-data in different servers. Like popular DFS, viz. PVFS [3] [10], Lustre [11] and GFS [9] [13], HDFS too saves its meta-data in a server, known as Name Node. Rest of the application data are kept on the slave's server, known as Data Nodes.

is

The main purpose of the paper is to focus and reveal the

depicted in such a way that in can store Big Data more reliably,

different factors that influence the efficiency of Hadoop

System(HDFS).HDFS

is

written completely

in

Java

and

and can stream those at high processing time to the user

cluster. None of the previous papers demonstrated this kind of

applications. In recent days, Hadoop is used widely by popular

work that exposes the dependencies of Hadoop efficiency.

organizations like Yahoo, Facebook and various online shopping market venders. On the other hand, Experiments on Data­ Intensive computations are going on to parallelize the processing of

data.

None

performance.

of

them

Hadoop,

could

with

actually

its

achieve

Map-Reduce

a

desirable

parallel

data

processing capability can achieve these goals efficiently [I]. This

In this paper, we have discussed about the mechanism of HDFS and we tried to find out the different factors on which HDFS provides maximum efficiency. The remainder of this paper

is

organized

as

follows.

We

discuss

Hadoop

Architecture in Section 2. In Section 3, we discuss about file

paper initially provides an overview of the HDFS in details. Later

I/O operation and issue related to interaction with the clients.

on, the paper reports the experimental work of Hadoop with the

Section 4 discusses out two experimental works, which reveals

big data and suggests the various factors that affects the Hadoop

the major factors on which Hadoop cluster's performance

cluster

performance.

Paper

concludes

with

providing

the

different real field challenges of Hadoop in recent days and scope for future work

depends. The next part, Section 5 deals with the major challenges of Hadoop field. The conclusions and future work in the concerned issues are given in Section 6.

NameNode;

Keywords-HDFS;

DataNode;

Job Tracker;

Metadata.

II.

ARCHITECTURE

The cluster used by Hadoop framework uses a Master/Slaves I.

INTRODUCTION

structure (Fig. 1).The master nodes consists of Name Node

The last few years of internet technology as well as computer

and Job Tracker. The main duty of JobTracker's is to initiate

world has seen a lot of growth and popularity in the field of

tasks, track and dispatch their implementation. The charge of

cloud computing.

As a consequence, the cloud applications

Data Node and Task Tracker is given to Slave nodes [4]. The

have given birth

to big

source

main role of TaskTracker is to process of local data and

distributed system made by Apache Software Foundation, has

collection of all these result data as per the request received

data. Hadoop,

an open

contributed hugely to handle and manage such big data.

from applications and then report the performance in periodic

Hadoop [2]

[7] has a master-slave architecture, provides

interval to 10bTracker [5]. HDFS, which seems to be heart of

scalability and reliability to a great extent. In the last few

Hadoop, all of its charges, are given to NameNode and

years, this framework is extensively used and accepted by

DataNode for fulfilling, while 10bTracker and TaskTracker

different cloud vendors as well as big data handler. With the

mainly deal with MapReduce tasks.

help of Hadoop, a user can deploy programs and execute The

main

parts

of

Hadoop

In this paper, we are only dealing with HDFS architecture. So, here is a brief about Name Node and Data Node. [6] A

processes on the configured cluster. include

HDFS

(Hadoop

Distributed File System). A crucial property of Hadoop

short description of the HDFS client with the Name Node and Data Node is also portrayed.

Framework is the capacity to partition the computation and

A.

data among various nodes (known as Data Node) and running

The

the computations in parallel. Hadoop increases the storage

978-1-4799-5958-7114/$31.00 ©2014 IEEE

NameNode HDFS

directories.

Files

namespace and

is

a

directories

hierarchy are

of

files

represented

on

and the

NameNode

like

stored at that Data Node becomes inaccessible. The Name

permIssIOns, modification and access times, namespace and

by

i-nodes,

which

record

attributes

Node then schedules the things again and allocates all of those

disk space quotas.

blocks to other Data Nodes, which is selected randomly. Hodoop Cllenl Contwu � H!de too dmi aJ.b)lr.aQMIOM.DJ'IIi*

MASTIlF\ �---�

Fig 1: Master/Slaves structure of Hadoop cluster

The file content is split into large blocks and each block of

SlIaro

Phy$l(al Nod.

the file is independently replicated at multiple Data Nodes. The

NameNode

maintains

the

namespace

tree

and

the

Fig 2: Hadoop high-level architecture, interaction between Client and HDFS

mapping of file blocks to Data Nodes. For, faster execution of the cluster operations, HDFS has to keep the whole namespace in its Main Memory.

C.

I-node

HDFS Client

data and other list of block, which belong to each file,

This part explains the interaction among client and Name

constitute the meta-data of the name system, termed as

Node and Data Node. HDFS client is the media via which a

FSlmage. The un-interrupted record of the FSlmage stored in

user application accesses the file system. For faster access, the

the Name Node's local filesystem is termed as checkpoint.

namespace is saved in the Name Node's RAM. It remains unknown to an application that, that file system meta-data and

B.

DataNodes

application data are put on separate servers as well as about

The block replica stored on DataNodes is regarded as two

the replication of blocks. During the time, an application reads

files in the local host's own file system. The first one

a file for the first time; HDFS client gets response from the

constitutes the main data and second file acts as storage for

Name Node of all lists of Data Nodes that stores replicas of the blocks of that particular file. The client application checks

meta-data of blocks. During startup each Data Node get connect to the Name Node. The phenomenon is just like a normal handshake. The

the lists out, then a Data Node is contacted directly and requests to transfer the desired block for reading.

the

When a client wants to write in the HDFS, it first sends a

namespaceID and to check whether there is a mismatch

query to the Name Node to choose Data Nodes for hosting

purpose

of

this

type

of

handshake

is

to

verify

between the software versions of the Data Nodes. If either

replicas of the blocks of the file. When the client application,

does not match the same with the Name Node, that particular

get response from the Name Node, it start searches for that

Data Node gets automatically shut down.

given Data Node. When if finds, the first block is filled or

A namespacelD is assigned to the file system when it is

doesn't have enough disk space, the client requests new Data

formatted each time. The namespacelD is persistently stored

Nodes to choose for hosting replicas for the next block. Like

on

this way, the process continues. The detail interactions among

all

nodes

of

the

cluster.

A

node

with

a

different

namespacelD will not be able to join the cluster,

thus

maintaining the integrity of the file system. A Data Node that is newly initialized and without any namespacelD is permitted to join the cluster and receive the cluster's namespaceID.

the client, the Name Node and the Data Nodes are illustrated in Figure 2. III. FILE I/O OPERATIONS AND MANGEMENT OF REPLICATION

After the handshake is done, the Data Node registers with the Name Node. A storageID is allotted to the Data Node for the first time, during

the

registration

with

Name

Node.

Data

Node

periodically sends block report to the Name Node to identify the block replicas in its control. The report consists of block id, the length of each block etc. During normal operations, all the Data Nodes periodically send a response to the Name Node to confirm that, it is alive and active in operation and all the different block replicas it stores are available. If the Name Node does not receive a heartbeat from a Data Node in 10 minutes the Name Node considers the Data Node to be dead and the block replicas

A. File Read and Write This part of paper, describes the operation of HDFS for different

110

file operations. When a client wants to write

some data or add something into HDFS, he has to do the task through

an

application.

HDFS

follows

a

"single-writer,

multiple-reader model" [14]. When the application closes the file,

the content already written cannot be manipulated.

However, new data can be appended in to by reopening the file. A client, when tries to open a file to write is granted a lease for it. The writing application periodically renews the given lease by sending heartbeats to the NameNode. When the

file is closed down, the lease is revoked. The lease duration is

provides a restriction that, more than one replica cannot be

bound by a soft limit and a hard limit. The writer is granted an

stored at one node and more than two replicas cannot be stored

exclusive access for the file, as long as the soft limit persists.

in the same rack.

The client application is also grant a hard limit of one hour. When this hard limit expires, and also if the client fails to renew the hard limit, HDFS then automatically closes that file on behalf of the writer and recovers the lease granted for it. The lease provided to a writer never does prevent any other client to read that file. HDFS follows a "read by simultaneous reader at a time" scheme. Hadoop clusters consist of thousands of nodes. So, it is quite natural to come across failure frequently. The replicas that are stored in the Data Nodes might get corrupted as a result of memory faults, disks issues. Checksums are verified by the HDFS clients to check the integrity of the data block of a HDFS file. During reading, it helps the detection of any

Job Tracker/ Resource Mana�er (5;;h-edul�) Manua"Autom�t;c F.3Uo.YCl'"

0101 Name NOdes (Mo�ter Node)

corrupt packet in the network. All the checksums are stored in a meta-data file in the Data-Node's system, which is different from the block's data file During the reading of files by

Data NOdes (Slave No(:fes)

Fig. 3: Racks of commodity hardware and interaction with NameNode and .lobTracker

HDFS, all block data and checksums are transferred to the client. It calculates the checksums for the data and confirms that the newly calculated checksum matched with the one it

Summing up all the above policy, the HDFS follows the following replica placement policy:

received. If it doesn't match, the client informs the Name

l. A single Data Node does not contain more than one replica

Node about the damaged replica and brings another replica

of any block.

from different DataNode. [14]

2. A single rack never contains more than two replicas of the

When a file is opened by the client for reading, the client first

same block, given then there is significant number of racks in

fetches the whole list of blocks the different location of the

the cluster.

replicas from the NameNode. In the process, reading the content of the block, the client posses a property to read the

C. Replication management

closest replica first. If this attempt fails, it tries for the next one in sequence. If the DataNode is not available or if the

The NameNode ensures each block of the files always has

block replica is found to be corrupted (checksum test), the

number of replicas stored in the Data Nodes. Data Nodes

read operation then fails completely.

periodically send its block report to Name Nodes. The Name

B. Block Placement

Node detects whether a block is under or over replicated. When Name Nodes finds it over replicated, it chooses a

When we want to setup a large cluster, we don't connect

replica from any random Data Node to remove. NameNode

all nodes to a particular switch. A better solution is that nodes

doesn't reduce the number of racks which has available

of a rack should share a common switch, and switches of the

number of host replicas, and removes from that DataNode,

rack are connected by one or more switches.

which has the least disk space available. The purpose is to

The configuration of commodity hardware is done in such

balance the storage utilization across all the DataNodes,

a way that, the network bandwidth between two nodes in the

without

same rack is greater than network bandwidth between two

replication priority queue, which stores the blocks that is

hampering

the

block

availability.

There

is

a

nodes in. different racks. Figure 3 shows a cluster topology,

under replicated. Highest priority factor is given to the blocks,

used in Hadoop architecture.

which has only one replica. On the other hand, blocks having

The distance from a node to its parent node is always

more than two third of the specified replication factor are

measured as one. A distance between two nodes can be

given the lowest priority. A thread searches the replication

measured by just adding up their distances to their closest

priority queue to determine the best position for the new

node. For greater bandwidth, we should have shorter distance

replicas. Block replication of HDFS also follows the same

between two nodes. This eventually increases the capacity to

kind of policy like that of new block placement. If HDFS finds

transfer data.

the number of existing replica of a block is one, it searches for

The main concern of HDFS block replacement policy is minimization

of

write

cost,

and

maximization

of

that block and places the next replica on a rack which is

data

different from the existing one The main motive here is to

reliability, scalability and increase the overall bandwidth of

minimize the cost of creation of new replicas of blocks. The

the cluster. After the generation of new block, HDFS searches

NameNode also does proper distribution of the block to make

for a location places the first replica on that node, the 2nd and 3rd replicas are stored similarly on two different nodes in a different rack, and the rest are placed on random nodes. HDFS

sure that all replicas of a particular block are not put on one single rack. If any situation comes, that the Name Node detects that a block's replicas end up storing itself at one

common rack, the NameNode treats it as under-replicated and

Variation in Write opeation

allocate that block to a different rack. When the Name Node receives the notification that a replica is created on different node, the block becomes again becomes over-replicated. So, following the previous policy written above, the Name Node,

80

at that situation, decides to remove an old replica chosen III

randomly.

"ai" ::i!:

IV. PERFORMANCE EVALUA nON

60 40 20

o

[n this section, the performance of Hadoop Distributed File System is evaluated in order to conclude the efficiency

Throughput

dependence of a Hadoop cluster.

Average I/O Rate

Fig 4: The write bandwidth and throughput of lOGB file size is almost 6 times

A.

Experimental Setup

For

performance

greater than 1GB

characterization,

a

8-node

Hadoop

cluster was configured. The first 5 nodes provided both

Variation in Execution time for write operation

computation (as Map-Reduce clients) and storage resources th (as Data Node servers), and the 6 and 7th node served as Job Tracker

(Resource-Manager)

and

NameNode

storage

manager. Each node is running at 3.10 GHz clock speed and with 4GB of RAM and a gigabit Ethernet NIC. All nodes used Hadoop framework 2.4.0, and Java 1.6.0. The all nodes in configured with 500GB Hard Disk. Total configured capacity

10000 "C c::

8

5000

CIJ

� o

shown on NameNode web interface is 2.66TB.

1 GB B.

10 GB

Test using TestDFSIO to find the various criteria on Fig 5: The same amount of write takes almost 1.71 more time in case of 10GB

which Hadoop shows most optimal efficiency The test process aims to fmd optimal efficiency of the performance characteristics of two different sizes of files and

Variation in Read opeation

bottlenecks posed by the network interface. The comparison is done to check the performance between small and big files. A test of write and read between 1 GB file and 10 GB file is

100

carried out. A total of 500 GB data is created through it. III

HDFS block size of 5 [2 MB is used. For

this

test

we

have

used

industry

standard

"ai"

benchmarks:

o

TestDFS[O is used to measure performance of HDFS as well as of the network and

10

50

::i!:

Throughput

subsystems. The command reads and

writes files in HDFS which is useful in measuring system­

Fig 6: Reading same amount of data in 10GB filesize oiTers close to 4times

wide performance and exposing network bottlenecks on the NameNode

and

workloads are

DataNodes.

10

A

majority

of

Average I/O Rate

more throughput and average bandwidth

MapReduce

Variation in Execution time for Read operation

bound more than compute and hence

TestDFS[O can provide an accurate initial picture of such scenanos.

6000

We executed two tests for both write and read: one for 50 files each of size [OGB and other with 500 files each of size 1GB. As an example, the command for a read test may look like:

$

hadoop jar Hadoop-*test*.jar TestDFSIO -read -nrFiles 100

-fileSize 10000 This command will read 100 files, each of size 10GB from the HDFS and thereafter provides the necessary performance measures.



4000



2000

o

o 1 GB

10 GB

Fig 7: The same amount of read takes almost 1.95 more time in case of 10GB

A word count job is submitted and experiment is carried

The command for a write test may look like:

$

hadoop jar Hadoop-*test*.jar TestDFSIO -write -nrFiles This command will write 100 files, each of size 10GB

from

the

HDFS

out on a 100 GB file, varying the no of reducers keeping with block size of HDFS fixed. The experiment carried out with 4

100 -fiIeSize 10000 and

thereafter

provides

the

necessary

different types of block size, viz. 512MB, 256MB, 128MB and 64MB. Based on the Test-Report we obtained, the charts in the figure,

performance measures. TestDFSIO generates 1 map task per file and splits are

8, 9, 10, and 11 have been made and proper conclusion is

defined such that each map gets only one file name. After

followed.

every run,

All the above graphs appear to form a uniform straight line or

the command generates a log file indicating

performance in terms of 4 metrics: Throughput in MBytes/s, Average

10

rate in MBytes/s,

10

rate standard deviation and

in some it shows a slight negative slope which indicates that with increase in number of reducer for a give block size time

execution time.

for processing either remains same or reduces to some extent.

Outputs of different tests are given below:

But, a significant negative slope is visible in figure 8, where

To obtain a good picture of performance, each benchmark tool

block size equals 512MB. On the other hand, in figure 10 and

was run 3 times on each 1GB file size and results were

11,

averaged to reduce error margin. The same process was. The

unexpected behavior.

same process was carried on 10GB file size to get data for comparison.

the

execution

times are

unpredictable

and show

an

It can be concluded that, for large block size, the reducers play an important role in the execution time of a Map-Reduce

Experiments shows, test execution time is almost half

job. For smaller block size, the change in number of reducers

during the I GB file test. This the total time it takes for the

doesn't bring noticeable changes in the execution time.

Hadoop jar command to execute.

Moreover, for significant large files, small change in block­

From figure 4 and 6, we can visualize that, the throughput and

time.

size doesn't lead to change the drastic change in execution

10

Rate too shows a significant declined in terms of both write

Block-Size

and read for the 1GB file test. This is somewhat unexpected in nature. However, one major conclusion that we encountered is as follows: In these tests there is always one reducer that runs after the all map tasks have complete. The reducer is responsible for

III "C c:: o u CIJ VI

generating the result set file. It basically sums up all of these

10

15000 12500 10000 7500 5000 2500

o

values "rate,sqrate,size,etc." from each of the map tasks. So the Throughput,

512 MB

=

1

rate, STD deviation, results are based on

4

individual map tasks and not the overall throughput of the

10

40

100

Reducers

cluster. The nrFiles is equal to number of map tasks. In the 1GB file test there will be (500/6)

=

83.33 (approx) map tasks

running simultaneously on each node manager node versus 8.33 map tasks on each node in the 10GB file test.

Fig 8: Variation of processing times with variation of reducers, keeping block size fixed at 512MB

The

Block-Size

10GB file test yields throughput results of 64.94 MB/s on each node manager. Therefore, the 10GB file test yields (8.33 * 64.94MB/s) 540.95MB/s per node manager. Whereas, the I GB file test yields throughput results of 11.13 MB/s on each

node manager. Therefore, the 1GB file test yields (83.33 * 11.13MB/s) 927.46 MB/s per node manager. Clearly the 1GB file test shows the maximum efficiency. However increasing the no of commodity hardware eventually decreases the execution time and increased the average

256 MB

15000 12500 10000 7500 5000 2500

o 1

4

10

40

100

Reducers

10

rate. C. Dependence of Execution time of write operation based

on No ofReducers and Block size.

III "C c:: o u CIJ VI

=

Fig 9: Variation of processing times with variation of reducers, keeping block size fixed at 256MB

Block-Size VI "'C !: o u Q) III

=

3)

128 MB

.Researchers are focusing to design a more developed version of Hadoop component for monitoring, while Job Tracker will be given the charge of overall

15000 12500 10000 7500 5000 2500

scheduling.

4)

Improving data processing performance is also a topic of major challenge for the upcoming days. A special

optimization

process

should

be

assigned

based on what is the actual need of application.

o 1

4

10

40

Different experiments show that there are various

100

scopes to increase the processing performance and thus improving time complexity of data for the

Reducers

execution of a particular job. [15][16] Fig 10: Variation of processing times with variation of reducers, keeping

VI. CONCLUSION AND FUTURE WORK

block size fixed at 128MB

Block-Size

VI "'C !: o u Q) III

=

64 MB

This section of the paper describes related future work that we are considering; Hadoop being an open source project

15000 12500 10000 7500 5000 2500

justifies the addition of new features and changes for the sake of better scalability and file management. Hadoop is recently one of the best among managing the Big Data. Still, in our experiment, we have found that, it performs poor in terms of throughput when the numbers of

o

files are relatively larger compared to smaller numbers of 1

4

10

40

100

files. Our experiment shows how the performance bottlenecks are not directly imputable to application code but actually

Reducers

depends on numbers of data nodes available, size of files in used in HDFS and also it depends on the number of reducers

Fig 11: Variation of processing times with variation of reducers, keeping

used.

block size fixed at 64MB

It also shows how MapReduce

10

performance can vary

depending on the data size, number of map/reduce tasks, and

However, the biggest issue on which we are focusing is the scalability of Hadoop Framework. The Hadoop cluster becomes unavailable when its NameNode is down.

available cluster resources.

Scalability issue of the Name Node [12] has been a major struggle. The Name Node keeps all the namespace and block

V. MAJOR CHALLENGES IN HADOOP

locations in its main memory. The main challenge with the Name Node has been that when its namespace table space

Although the Hadoop Framework has been approved by

becomes close the main memory of Name Node, it becomes

everyone for his flexibility and faster parallel computing

unresponsive due to Java garbage collection. This scenario is

technology, there still are many problems which written in

bound to happen because the numbers of files used the users

short in the following points [7]

are increasing exponentially. Therefore, this is a burning issue

1)

Hadoop suffers from a irrecoverable failure called Single point of

in the recent days for Hadoop.

failure of Name Node. Hadoop

Acknowledgment

possesses a single master server to control all the associated sub

servers

(slaves)

for

the

tasks

to

execute, that leads to a server shortcomings like single point of failure and lacking of space capacity, which seriously affect its scalability. 2)

HDFS faces huge problems, dealing with small files. HDFS data are stored in the Name Node as meta­

We would like to thank Data Science & Analytic Lab of NIT Silchar for granting us the laboratory. That contributes a lot to our research, for that I got the opportunity to configure the Hadoop.

References

data, and each meta-data corresponds to a block occupies about 200 Byte. Taking a replication factor of 3(default), it would take approximately 600 Byte. If there are such huge no of these kind of smaller files in the HDFS, Name Node will consume lot of space. Name Node keeps all the meta-data in its main memory, which leads to a challenging problem for the researchers. [8]

[1]

Manghui Tu, Peng Li, I-Ling Yen, BhavaniThuraisingham, Latifur

Khan.Secure Data Objects Replication in Data Grid., IEEE Transactions

on Dependable and Secure computing, Vol. 7, No. 1,2010 [2]

Apache Hadoop, http: //hadoop.apache.org/

[3]

P. H. Cams, W. B. Ligon III, R. B. Ross, and R. Thakur, PVFS: A

parallelfile system for Linux clusters, in Proc. of 4th Annual Linux Showcase

114, Parallel Data Laboratory, Carnegie Mellon University, Pittsburgh, PA, October 2008

and Conference, 2000, pp. 317327. [II] Hong Sha, Yang Shenyuan, Key technology of cloud computing and research on cloud computing model based on Hadoop, Software Guide,2010, 9(9): 9 II.

Lustre File System., http://www.lustre.org

[4]

[5]

Xie Guilan, Luo shengxian, Research on applications based on Hadoop

[6]

Jeflrey Shafer, Scott Rixner, and Alan L. Cox., The Hadoop Distributed

MapReduce model, Microcomputer & its applications, 2010, (8): 4 7.

[12] K. V. ShvachkoHDFS Scalability: The limits to growth.,login:. April 2010, pp. 616 [13] M. K. McKusick, S. Quinlan.GFS: Evolution on Fast-fonvard., ACM Queue, vol. 7, no. 7, New York, NY. August 2009 [14]

Shvachko, K.; Hairong Kuang; Radia, S.; Chansler, R.The Hadoop

Filesystem: Balancing Portability and Performance., Performance Analysis

Distributed File System., Mass Storage Systems and Technologies (MSST),

of Systems & Software (ISPASS),2010 122 133.

2010 IEEE 26th Symposium on, vol no. pp.l,IO, 3-7 May 2010

[7]

[15] Bhandarkar, M.MapReduce programming with apache Hadoop. Parallel & Distributed Processing (IPDPS), 2010 IEEE International Symposium on, vol., no., pp.l,I, 19-23 April 2010

Xin Daxin, Liu Fei, Research on optimization techniques for Hadoop

cluster performance., Computer Knowledge and Technology, 2011,8(7):5484

5486. [8]

Lu Huang;Hai-Shan Chen;Ting-Ting Hu, Research on Hadoop Cloud Model and its Applications., Networking and Distributed Computing (ICNDC), 2012, 59 63 Computing

[9] S. Ghemawat, H. Gobioff, S. Leung, The Googlefile system., In Proc. Of ACM Symposium on Operating Systems Principles, Lake George, NY, Oct 2003, pp 2943. [10]

w. Tantisiriroj, S. Patil, G. Gibson, Data-intensivefile systems for In­

ternet services: A rose by any other name., Technical Report CMUPDL- 08-

[16]

Jinshuang Yan; Xiaoliang Yang; Rong Gu; Chunfeng Yuan; Yihua

Huang.Peiformance Optimization for Short MapReduce Job Execution in Hadoop., Cloud and Green Computing (CGC), 2012 Second International

Conference on, vol no., pp.688,694, 1-3 Nov. 2012

Suggest Documents