Performance-Aware Scheduling for Data-Intensive Cloud Computing

4 downloads 152 Views 946KB Size Report
intensive systems - data intensive cloud computing - developers need to understand the ...... Book Chapter in the Handbook of Cloud computing, Springer  ...
A Dissertation Submitted in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy in Engineering

Performance-Aware Scheduling for Data-Intensive Cloud Computing

Ph.D Candidate

:

Shadi Ibrahim

Major

: Computer Architecture

Supervisor

:

Prof. Hai Jin

Huazhong University of Science and Technology Wuhan 430074, P. R. China August, 2011

独创性声明 本人声明所呈交的学位论文是我个人在导师指导下进行的研究工作及取得的研 究成果。尽我所知,除文中已经标明引用的内容外,本论文不包含任何其他个人或 集体已经发表或撰写过的研究成果。对本文的研究做出贡献的个人和集体,均已在 文中以明确方式标明。本人完全意识到,本声明的法律结果由本人承担。

学位论文作者签名: 日期:

年 月 日

学位论文版权使用授权书

本学位论文作者完全了解学校有关保留、使用学位论文的规定,即:学校有权 保留并向国家有关部门或机构送交论文的复印件和电子版,允许论文被查阅和借阅。 本人授权华中科技大学可以将本学位论文的全部或部分内容编入有关数据库进行检 索,可以采用影印、缩印或扫描等复制手段保存和汇编本学位论文。

保密□ ,在_____年解密后适用本授权书。

本论文属于

不保密□。 (请在以上方框内打“√”)

学位论文作者签名: 日期:

年 月 日

指导教师签名: 日期: 年 月



华 中 科 技 大 学 博 士 学 位 论 文 Abstract Data volumes are ever growing, from traditional applications such as databases and scientific computing to emerging applications like Web 2.0 and online social networks. This has driven intensive research on scalable data intensive systems, including MapReduce and Dryad. Among those systems, Hadoop, an open-source MapReduce implementation, is widely adopted by companies such as Facebook and Google, and academia. Recently, MapReduce has been deployed in the cloud as a software-as-a-service. Due to its wide adoption, the performance of Hadoop in particular (and MapReduce in general) has received much attention in system research. Meanwhile, virtual machines (VM) have become increasingly important for supporting efficient and flexible resource provisioning. By means of this technique, cloud computing provides users with the ability to perform elastic computation using large pools of VMs, without facing the burden of owning or maintaining physical infrastructure. To this end, when building large scale data intensive systems - data intensive cloud computing - developers need to understand the principles of designing large systems to get performance guarantees, load balancing and fair charging for use of resources. Performance in data-intensive cloud computing is contributed by many factors including data locality, application types and the underlying cloud infrastructure which is mainly VM-based. First of all, a novel replica-aware map execution named Maestro is presented to overcome the non-local map execution in MapReduce system. In Maestro, map tasks are scheduled in two phases. The first one, first wave scheduling, schedules the maps when the job initializes to fill all the empty slots, and the second one, run time scheduler, schedules the map tasks according to data locality, node availability and block weight, which is the probability of the best replication to schedule the task. Interestingly, Maestro not only can efficiently achieve higher locality in MapReduce-like systems, but can also reduce unnecessary Map task speculation and balance the intermediate data distribution before the shuffle phase. The existing MapReduce system overlooked the data skew problem that occurs when significant variance in both intermediate keys’ frequencies and their distributions among I

华 中 科 技 大 学 博 士 学 位 论 文 the different data nodes is introduced, referred to as Partitioning Skew. Experimental results with Hadoop demonstrate that, in the presence of partitioning skew, the applications experience performance degradation due to the long data transfer during the shuffle phase along with the computation skew, particularly in the reduce phase. To address this problem, a novel algorithm for locality-aware and fairness-aware key partitioning in MapReduce is developed, referred as LEEN. LEEN embraces an asynchronous map and reduce scheme. All buffered intermediate keys are partitioned according to their frequencies and the fairness of the expected data distribution after the shuffle phase. LEEN can not only efficiently achieve higher locality and reduce the amount of shuffled data, but also LEEN guarantees fair distribution of the reduce inputs. In the cloud, the computing unit is virtual machine (VM) based; therefore, it is important to demonstrate the applicability of data-intensive computing on a virtualized data center. Although virtualization brings many benefits such as resource utilization and isolation, it poses, due to VM interference, a challenging problem for performance predictability and system throughput for large-scale virtualized environments. To this end, a quantitative analysis on the impact of interference on the system fairness is presented. Because Cloud is an economics-based distributed system, the concept of pricing fairness is adopted from micro economics. As a result, the current pay-as-you-go is neither personally nor socially fair. Accordingly, to solve the unfairness caused by interference, new pricing scheme (pay-as-you-consume) is proposed. In the pay-as-you-consume pricing scheme, users are charged according to their effective resource consumption excluding interference. The key idea behind the pay-as-you-consume pricing scheme is a machine learning based prediction model on the relative cost of interference. The preliminary experimental results with Xen demonstrate the accuracy of the prediction model, and the fairness of the pay-as-you-consume pricing scheme. The introduction of virtualization in Hadoop clusters poses new challenges due to the architectural design of the hypervisor. A series of experiments are conducted to measure and analyze the performance of Hadoop on VMs in terms of Hadoop Distributed File System (HDFS) throughput, performance variation with different VM consolidation and configuration, and task speculation. As a result, this dissertation outlines several issues

II

华 中 科 技 大 学 博 士 学 位 论 文 that will need to be considered when implementing MapReduce to fit completely on virtual machines - such as decoupling the storage system (HDFS) from the computation unit (VMs). Later, a novel MapReduce framework that runs on virtual machines, called Cloudlet, is proposed. Virtualization interferences are contributed to by intertwined factors including the application's type, the number of concurrent VMs, and the VM scheduling algorithms used within the host. Further studies revealed that selecting the appropriate disk I/O scheduler pairs can significantly affect the applications performance. Furthermore, a typical Hadoop application consists of different interleaving stages, each requiring different I/O workloads and patterns. As a result, the disk scheduler pairs are not only sub-optimal for different MapReduce applications, but are also sub-optimal for different sub-phases of the whole job. Accordingly, a novel approach for adaptively tuning the disk scheduler pairs in both the hypervisor and the virtual machines during the execution of a single MapReduce job is proposed. Experimental results show that MapReduce performance can be significantly improved; specifically, adaptive tuning of disk scheduler pairs achieves a 25% performance improvement on a sort benchmark with Hadoop. Keywords: Cloud computing, Virtualization, MapReduce, Hadoop, Replica-aware, Skew partitioning, Meta-Scheduler, Fairness.

III

华 中 科 技 大 学 博 士 学 位 论 文 Index Abstract ............................................................................................................ I 1

INTRODUCTION

1.1 CLOUD COMPUTING ................................................................................ (2) 1.2 GOOGLE MAPREDUCE ............................................................................... (4) 1.3 VIRTUALIZATION TECHNOLOGY ................................................................ (6) 1.4 RELATED WORK ...................................................................................... (11) 1.5 THESIS ORGANIZATION............................................................................ (18) 2

REPLICA-AWARE TASK SCHEDULING

2.1 MAP SCHEDULING IN HADOOP ................................................................ (20) 2.2 EMPIRICAL STUDY ON NON-LOCAL MAPS IMPACTS ................................ (22) 2.3 MAESTRO DESIGN ................................................................................. (26) 2.4 PERFORMANCE EVALUATION ................................................................... (31) 2.5 SUMMARY ............................................................................................... (33) 3

LOCALITY AND FAIRNESS–AWARE KEY PARTITIONING

3.1 BACKGROUND ......................................................................................... (34) 3.2 EMPIRICAL STUDY ON THE IMPACTS OF PARTITIONING SKEW .................. (37) 3.3 LEEN DESIGN ......................................................................................... (39) 3.4 PERFORMANCE EVALUATION ................................................................... (43) 3.5 SUMMARY ............................................................................................... (47) 4

VIRTULAIZATION INTERFERENCE COST

4.1 BACKGROUND.......................................................................................... (48) 4.2 EMPIRICAL STUDY ON INTERFERENCE IN XEN ........................................ (51) 4.3 KEY RESULTS AND DISCUSSION ............................................................. (53)

IV

华 中 科 技 大 学 博 士 学 位 论 文 4.4 PAY-AS-YOU-CONSUME SCHEME .......................................................... (60) 4.5 SUMMARY ............................................................................................... (66) 5

MAPREDUCE ON VIRTUAL MACHINES

5.1 BACKGROUND ......................................................................................... (68) 5.2 EVALUATING METHODOLOGY AND HARDWARE PLATFORM .................... (69) 5.3 EXPERIMENTS RESULTS ........................................................................... (71) 5.4 DISCUSSION AND OPEN ISSUES ................................................................ (75) 5.5 CLOUDLET ............................................................................................... (76) 5.6 PERFORMANCE EVALUATION ................................................................... (78) 5.7 SUMMARY ............................................................................................... (79) 6

ADAPTIVE DISK SCHEDULING FOR MAPREDUCE

6.1 BACKGROUND ......................................................................................... (80) 6.2 EMPIRICAL STUDY OF HADOOP ON XEN-CLUSTER ................................. (82) 6.3 A META-SCHEDULER FOR ADAPTIVE DISK I/O SCHEDULER SELECTION . (89) 6.4 PERFORMANCE EVALUATION ................................................................... (96) 6.5 DISCUSSION............................................................................................. (99) 6.6 SUMMARY ............................................................................................. (100) 7

CONCLUSIONS ................................................................................ (102)

ACKNOWLEDGMENTS ....................................................................... (105) REFERENCES ......................................................................................... (107) LIST OF ABBREVIATIONS .................................................................. (119) APPENDIX 1 PUBLICATION ............................................................... (120) APPENDIX 2 RESEARCH EXPERINECE ......................................... (122)

V

华 中 科 技 大 学 博 士 学 位 论 文 1

Introduction

The increasing popularity of Internet services such as Amazon Web Services1, Google App Engine2 and Microsoft Azure3 have drawn a lot of attention to the cloud computing paradigm. The interest in cloud computing has been motivated by many factors such as the low cost of system hardware, the increase in computing power and storage capacity (e.g. the modern data center consists of hundred of thousand of cores and Petascale storage) and the increase in the energy cost needed to operate the system. However, the interest in cloud computing is accompanied with a massive growth in data size generated by digital media, social networks and scientific instruments. The traditional data intensive system (data to computing paradigm) is not efficient for cloud computing due to the bottleneck of the internet when transferring large amounts of data to a distant CPU

[1]

. To cope with this issue a new paradigm should be adopted, where

computing and data resources are co-located, thus minimizing the communication cost and benefiting from the large improvements in I/O speeds using local disks. Google has successfully implemented and practiced the new data intensive paradigm in their Google MapReduce system

[2]

process 20 petabytes of data per day

[3]

(e.g. Google uses its MapReduec framework to ). The MapReduce model has become, due to the

popularity of its open source implementation Hadoop

[4]

, the de facto standard

programming paradigm for massive dataset computation. Hadoop, was developed primarily by Yahoo!, where it processes hundreds of terabytes of data on at least 10,000 cores [5], is now used by other companies, including Facebook, Amazon, Last.fm, and the New York Times [6]. Research groups from enterprises and academia are starting to study the MapReduce model for a better fit for the cloud, and explore the possibilities of adapting it for more applications.

1 2 3

Amazon Web Service: http://aws.amazon.com/. Google App Engine: http://code.google.com/appengine/. Windows Azure platform: http://www.microsoft.com/windowsazure/.

1

华 中 科 技 大 学 博 士 学 位 论 文 In this chapter, the emerging cloud computing paradigm is briefly discussed, then a the Google MapReduce programming model along with its popular implementation Hadoop is presented; then virtualization technology and Xen hypervisor is introduced. Finally this chapter discusses the related work on improving data-intensive cloud computing with emphasize on data-aware execution and virtualization-aware.

1.1 Cloud Computing A cloud is essentially a class of systems that deliver IT resources to remote users as a service. The resources encompass hardware, programming environments and applications. The services provided through cloud systems can be classified into Infrastructure as a service (IaaS), Platform as a Service (PaaS), and Software as a service (SaaS). •

Infrastructure as a Service (IaaS) is one of the “Everything as a Service” trends. IaaS is easier to understand if we refer it as Hardware as a Service (i.e. instead of constructing our own server farms, a small firm could consider paying to use infrastructure provided by professional enterprises). Companies such as Google, Microsoft and IBM are involved in offering such services. Large-scale computer hardware and high computer network connectivity are essential components of an effective IaaS. IaaS is categorized into: (1) Computation as a Service (CaaS), in which virtual machine based servers are rented and charged per hour based on the virtual machine capacity – mainly CPU and RAM size, features of the virtual machine, OS and deployed software; and (2) Data as a Service (DaaS), in which unlimited storage space is used to store the user’s data regardless of its type, charged per GByte for data size and data transfer. The most popular IaaS systems are Amazon EC24 , GoGrid5, Amazon S36 and Rackspace7.

4

Amazon Elastic Compute Cloud: http://aws.amazon.com/ec2/. GoGrid Cloud Hosting: http://www.gogrid.com/. 6 Amazon Simple Storage Service: http://aws.amazon.com/s3/. 7 Rackspace Managed Hosting: http://www.rackspace.com/. 5

2

华 中 科 技 大 学 博 士 学 位 论 文 •

Platform as a Service (PaaS) cloud systems provide a software execution environment that application services can run on. The environment is not just a pre-installed

operating

system

but

is

also

integrated

with

a

programming-language-level platform, which users can be used to develop and build applications for the platform. From the point of view of PaaS clouds’ users, computing resources are encapsulated into independent containers, they can develop their own applications with certain program languages, and APIs are supported by the container without having to take care of the resource management or allocation problems such as automatic scaling and load balancing. The main PaaS vendors are: Google App Engine, Microsoft Azure, and Force.com8. •

Software-as-a-Service (SaaS) is based on licensing software use on demand, which is already installed and running on a cloud platform. These on-demand applications may have been developed and deployed on the PaaS or IaaS layer of a cloud platform. SaaS replaces traditional software usage with a Subscribe/Rent model, reducing the user’s physical equipment deployment and management costs. The SaaS clouds may also allow users to compose existing services to meet their requirements. Some examples from SaaS are: the “Global Hosted Operating SysTem” (G.ho.st)9, Google apps10 and Saleforce11. Different enterprises play different roles in building and using cloud systems. These

roles range from cloud technology enablers (enabling the underlying technologies used to build the cloud, such as hardware technologies, virtualization technology, web services), to cloud providers (delivering their infrastructure and platform to customers), to cloud customers (using the providers’ services to improve their web applications), and users (who use the web applications, possibly unaware that it is being delivered using cloud technologies).

8

Force.Com Cloud computing: http://www.salesforce.com/platform/. Ghost Cloud Computing (G.ho.st): http://ghost.cc/. 10 Google App: http://www.google.com/apps/intl/en/business/index.html. 11 Salesforce homepage: http://www.salesforce.com/crm/. 9

3

华 中 科 技 大 学 博 士 学 位 论 文 1.2 Google MapReduce Google’s MapReduce

[2, 3]

is a programming model that demonstrates a simpler way

to develop data intensive applications for large distributed systems. It was an early realization of what Alex Szalay and Jim Gray stated in a commentary on 2020 computing [7]

: “In the future, working with large data sets will typically mean sending computations

to data rather than copying the data to your work station” At the time of writing, due to its remarkable features including simplicity, fault tolerance, and scalability, MapReduce is by far the most powerful realization of data intensive cloud computing programming. It is often advocated as an easier to use, efficient and reliable replacement for the traditional data intensive programming model for cloud computing. More significantly, MapReduce has been proposed to form the basis of the data center software stack [8]. The MapReduce

[2, 3]

system runs on top of the Google File System (GFS)

[9],

within

which data is loaded, partitioned into chunks, and each chunk replicated. Data processing is co-located with data storage: as shown in Figure 1.1, when a file needs to be processed, the job scheduler consults a storage metadata service to get the host node for each chunk, and then schedules a “map” process on that node, so that data locality is exploited efficiently.

A node reads the content of the corresponding input split and emits a

key/value pairs to the user defined Map function. The intermediate key/value pairs produced by the Map function are firstly buffered in memory and then periodically written to a local disk, partitioned into R sets by the partitioning function. The master passes the location of these stored pairs to the reduce worker, which reads the buffered data from the map worker using remote procedure calls (RPC). It then sorts the intermediate keys so that all occurrences of the same key are grouped together. For each key, the worker passes the corresponding intermediate value for its entire occurrence to the Reduce function. Finally, the output is available in R output files (one per reduce task).

4

华 中 科 技 大 学 博 士 学 位 论 文 Hadoop. Hadoop

[4]

is a top level Apache project, being built and used by a

community of contributors from all over the world

[10]

. It was advocated by industry's

premier Web players - Google, Yahoo!, Microsoft, and Facebook - as the engine to power the cloud

[11]

. The Hadoop project is stated as a collection of various subprojects for

reliable, scalable distributed computing [4].

Figure 1.1: MapReduce Execution Overview [2]

Yahoo! has been the largest contributor to the Hadoop project

[11]

. Yahoo! uses

Hadoop extensively in its web search and advertising businesses [11]. For example, in 2009, Yahoo! launched, according to them, the world's largest Hadoop production application, called Yahoo! Search Webmap. The Yahoo! Search Webmap runs on a more than 10,000 core Linux cluster and produces data that is now used in every Yahoo! Web search query [8]

. Besides Yahoo!, many other vendors have introduced and developed their own

5

华 中 科 技 大 学 博 士 学 位 论 文 solutions for the enterprise cloud; these include IBM Blue Cloud

[12]

, Cloudera 12 ,

Opensolaris Hadoop Live CD13 by Sun Microsystems, and Amazon Elastic MapReduce14. Beside the aforementioned vendors, many other organizations are using Hadoop solutions to run large distributed computations [6]. HadoopMapReduce Overview. The Hadoop common

[4]

, formerly Hadoop core,

includes file System, RPC, and serialization libraries, and provides the basic services for building a cloud computing environment with commodity hardware. The two fundamental subprojects are the MapReduce framework and the Hadoop Distributed File System (HDFS). The Hadoop Distributed File System is a distributed file system designed to run on clusters of commodity machines. It is highly fault-tolerant and is appropriate for data intensive applications as it provides high speed access the application data. The Hadoop MapReduce framework highly reliant on its shared file system (i.e. it comes with plug-ins for HDFS, CloudStore15, and Amazon Simple Storage Service S3). The Map/Reduce framework has master/slave architecture. The master, called JobTracker, is responsible for: 1) querying the NameNode for the block locations; 2) scheduling the tasks on the slave which is hosting the task’s blocks; and (c) monitoring the successes and failures of the tasks. The slaves, called TaskTracker, execute the tasks as directed by the master.

1.3 Virtualization Technology Virtualization is the idea of partitioning or dividing the resources of a single server into multiple segregated virtual machines (VM). Virtualization technology has been proposed and developed over a relatively long period

[13, 14]

. Recently, due to the rapidly

growth in IT infrastructure, we have seen the emergence of multi-core processors and a wide variety of hardware, operating systems and software. In this environment,

12 13 14 15

Cloudera Homepage: http://www.cloudera.com/. OpensolarisHadoop Live CD: http://opensolaris.org/os/project/livehadoop/. Amazon Elastic MapReduce,http://aws.amazon.com/elasticmapreduce/. CloudStore (Formerly Kosmos File System) :http://kosmosfs.sourceforge.net/.

6

华 中 科 技 大 学 博 士 学 位 论 文 virtualization has had a resurgence of popularity. Virtualization can provide dramatic benefits for a computing system, including increased utilization, energy saving, rapid deployment, improved maintenance capability, isolation, and encapsulation

[15, 16, 17]

.

Moreover, virtualization enables applications to migrate from one server to another while they are still running, without downtime, providing flexible workload management, and high availability during planned maintenance or unplanned events [18, 19, 20, 21]. There are numerous reasons that virtualization is effective in practical scenarios, for example [22, 23]: •

Server and application consolidation: under virtualization we can run multiple applications at the same time on the same server, resulting in more efficient utilization of resources;



Configurability: virtualization allows dynamic configuration and bundling of resources for a wider variety of applications than could be achieved at the hardware level – different applications require different resources (some requiring more storage, others requiring more computing);



Increased application availability: virtual machine checkpointing and migration allow quick failure recovery from unplanned outages with no interruption in service;



Improved responsiveness: resource provisioning, monitoring and maintenance can be automated, and common resources can be cached and reused.

1.3.1Virtualization Platforms Virtualization technology has been developed to best utilize computing capacity. Server virtualization has been described as follows: “In most cases, server virtualization is accomplished by the use of a hypervisor (VMM) to logically assign and separate physical resources. The hypervisor allows a guest operating system, running on the virtual machine, to function as if it were solely in control of the hard-ware, unaware that other guests are sharing it. Each guest operating system is protected from the others and is thus unaffected by any instability or configuration issues of the others.” [24]

7

华 中 科 技 大 学 博 士 学 位 论 文 Virtualization methods can be classified into two categories according to whether or not the guest OS kernel needs to be modified: (1) full virtualization (supported by VMware 16 , Virtual Box 17 , Microsoft Hyper-V 18 , Xen

[25]

, and KVM 19 etc.), and (2)

paravirtualization (currently supported only by Xen). Full virtualization emulates the entire hardware environment by utilizing hardware virtualization support, binary code translation, or binary code rewriting, and thus the guest OS does not need to modify its kernel. Having full virtualization is important for running non-open source operating system such as Windows, because it is too difficult to modify the Windows kernel without source code. Paravirtualization requires the guest OS kernel to be modified to become aware of the hypervisor. Because it need not emulate the entire hardware environment, paravirtualization can attain better performance than full virtualization. In paravirtualized architectures, OS-level information about the VM can be passed explicitly from the OS to the VMM, and this is done in practice to some extent [26, 27]. Any explicit information supplied by a paravirtualized OS is guaranteed to match what is available inside the OS. In some important environments, however, the explicit approach is less valuable, and because paravirtualization requires OS-level modification, that functionality cannot be deployed in VMMs running beneath legacy or closed-source operating systems anyway. 1.3.2 Xen Hypervisor The Xen hypervisor is a paravirtualizing virtual machine monitor (VMM)

[25, 28]

in

which the machine architecture presented to an operating system is not identical to the underlying hardware. The Xen hypervisor that sits directly on the bare machine is responsible for resource (CPU, memory and I/O device, etc.) allocation for the various virtual machines running on the same hardware device as shown in Figure 1.2.

16 17 18 19

VMware: www.vmware.com/. Sun VirtualBox: http://dlc.sun.com.edgesuite.net/virtualbox/. Microsoft Hyper-V Server: www.microsoft.com/hyper-v-server/. Kernel-based Virtual Machine (KVM): www.linux-kvm.org/.

8

华 中 科 技 大 学 博 士 学 位 论 文 There is an initial domain, called Domain 0, which is a modified Linux kernel. Dom0 is a unique virtual machine running on the Xen VMM that has privileges to access physical I/O devices as well as interact with the other VMs. Other VMs sharing the same host with Dom0 are called Domains U or Guest Os, which run modified UNIX-like operating systems and aware that they do not have direct access to the hardware.

Driver Domain

Guest Domain

Guest Domain

Guest Domain

Xen hypervisor Virtual Processor

Virtual Memory

Virtual Network

Virtual Block Dev.

Hardware (Processors, Memory, Network, Disk, .)

Figure 1.2 Xen Architecture [43]

1.3.3 Scheduling in Xen Xen is unique among VMM software because it allows users to choose among different CPU schedulers and I/O schedulers. From version 3.1.0, Xen has two different CPU schedulers available, “Credit” and “Simple Earliest Deadline First” (SEDF), both allowing users to specify CPU allocation via CPU weights. Moreover, there are currently four available I/O schedulers in the 2.6 Linux kernels: Noop, Anticipatory, Deadline, and Complete Fair Queuing Scheduler (CFQ). Furthermore, users can select the I/O schedulers on the fly in both Dom0 and DomUs. CPU schedulers in Xen Xen CPU scheduler determines how VMs can share the physical CPUs within the same host. Below, we briefly characterize their main features that motivated their inclusion in Xen at the time.

9

华 中 科 技 大 学 博 士 学 位 论 文 Credit Scheduler

[29]

is currently the default scheduler is Xen. It is a proportional

fair share CPU built to work conserving on SMP hosts. Credit scheduler guarantees that no CPU idles if there is runnable VCPU. Simple Earliest Deadline First

[30]

(SEDF) intend to deliver hard guarantees on

CPU allocation using real-time algorithms. Table 1.1 briefly presents a comparison between Credit and SEDF CPU scheduler. Table 1.1 Comparison of the two Xen CPU schedulers Scheduler

Preemptive

both

WC

NWC Modes

&

Parameterizable Granularity

Time

Global Load Balancing on Multiprocessors

SEDF







×

Credit

×



×



Disk I/O Schedulers in Xen The disk I/O scheduler performs two basic operations: merging and sorting. While the merging operation reduces the number transactions between the guest-OS and the VMM by merging adjacent I/O requests

[31]

, the sorting operation arranges pending I/O

requests in block order to minimize the seek time [32]. Noop

[33]

scheduler is implemented using a simple FIFO queue and performs only

basic merging and sorting. Completely Fair Queuing [34] (CFQ) is the default scheduler in the Linux 2.6 kernel. It implements both request merging and the elevator, and attempts to ensure fairness amongst all users (guarantee same number of I/O requests over a particular time interval for different VMs within the same physical machine). Deadline

[35]

implements request merging, a one-way elevator, and imposes a

deadline on all operations to prevent resource starvation. Anticipatory

[36]

assumes that processes typically perform multiple I/O operations

within short time therefore it introduce a short delay before dispatching the I/O attempts thus avoiding head movements if possible. Table 1.2 briefly presents a comparison among different disk I/O schedulers. 10

华 中 科 技 大 学 博 士 学 位 论 文 Table 1.2 Comparison of the four Xen disk schedulers Scheduler

Complexity

Size

Performance

Suitable Scenes

embedded system

No-op

simple

small

depends on specific applications which have their own scheduling mechanisms

Anticipatory

complicated

large

best in most cases

desktop & server (but give highly erratic performance on or storage systems) server (the preferred scheduler for database systems, especially if you have Tagged Command Queuing (TCQ) aware disks)

Deadline

moderate

moderate

almost as good as Anticipatory

CFQ

moderate

moderate

good

multimedia, desktop & multi-users

1.4 Related Work Since J. Dean and S. Ghemawat proposed the MapReduce model

[2]

, it has received

much attention from both industry and academia. Many projects are exploring ways to support MapReduce on various types of distributed architecture and for a wider range of applications as shown in Figure 1.3. For instance, QT Concurrent20 is a C++ library for multi-threaded application; it provides a MapReduce implementation for multi-core computers. Stanford’s Phoenix [37] is a MapReduce implementation that targets shared memory architecture, while M. Kruijf and K. Sankaralingam implemented MapReduce for the Cell B.E. architecture [38]. Mars [39] is a MapReduce framework on graphic processors (GPUs). The Mars framework aims to provide a generic framework for developers to implement data- and computation-intensive tasks correctly, efficiently, and easily on the GPU.

20

QT concurrent Page: http://labs.trolltech.com/page/Projects/Threads/QtConcurrent.

11

华 中 科 技 大 学 博 士 学 位 论 文

Figure 1.3: MapReduce different implementations.

Hadoop

[4]

, Disco21, Skynet22 and GridGain23 are open source implementations of

MapReduce for large scale data processing. Map-Reduce-Merge

[40]

is an extension on

MapReduce. It adds to Map-Reduce a merge phase to easily process data relationships among heterogeneous datasets. Microsoft Dryad

[41]

is a distributed execution engine for coarse grain data parallel

applications. In Dryad, computation tasks expressed as directed acyclic graph (DAG). Other efforts [42, 43] focus on enabling MapReduce to support a wider range of applications. S. Chen and S. W. Schlosser from Intel are working on making MapReduce suitable for performing earthquake simulation, image processing and general machine learning computations

[44]

. MRPSO

[45]

utilizes Hadoop to parallelize a compute-intensive

application, called Particle Swarm Optimization. Research groups from Cornell, Carnegie Mellon, University of Maryland and PARC are also starting to use Hadoop for both web data and non-data-mining applications, like seismic simulation and natural language processing [46].

21 22 23

Disco Project: http://discoproject.org/. Skynet: http://skynet.rubyforge.org/, 2009. GridGain = High Performance Cloud Computing: www.gridgain.com/.

12

华 中 科 技 大 学 博 士 学 位 论 文 This thesis focuses on the data-intensive cloud computing. As such, it builds upon prior work in a number of related areas. Topics related to the improvement in the performance of traditional Hadoop through data-aware execution and topics related to the proposed virtualization-aware methodologies are discussed here. 1.4.1 Data-Aware Scheduling Sangwon et al.

[47]

have proposed pre-fetching and pre-shuffling schemes for shared

MapReduce computation environment. While the pre-fetching scheme exploits data locality by assigning the tasks to the nearest node to blocks, the pre-shuffling scheme significantly reduces the network overhead required to shuffle key-value pairs. The pre-shuffling scheme tries to provide data-aware partitioning over the intermediate data, by looking over the input splits before the map phase begins and predicts the target reducer where the key-value pairs of the intermediate output are partitioned into a local node, thus, the expected data are assigned to a map task near the future reducer before the execution of the mapper. Jiang et al.

[48]

have presented an in-depth study of MapReduce, and identifies four

factors including I/O mode, indexing, parsing, and sorting that have significant performance effect on MapReduce. It also gives alternative strategies to each factor: using direct I/O instead of streaming I/O for data-local map; using range-indexes to do selection tasks and join tasks; using a mutable decoder to handle database-like workloads; using fingerprinting-based sort. With the appropriate implementation, the performance of MapReduce can be improved by a factor of 2.5 to 3.5 and approaches to parallel databases. Zaharia et al.

[49]

have proposed a simple scheduling algorithm called delay

scheduling to achieve locality and fairness in cluster scheduling. When a job that should be scheduled next according to fairness cannot launch a local task, it waits for small amount of time, letting other jobs launch tasks instead. The work is targeting the problem of shared cluster between multi users, which run mix of long batch jobs and short interactive query over a common data set. The work is started by designing a fair scheduler affair, dividing resources using Mix-Min fair scheduling and using the delay scheduling to achieve locality. 13

华 中 科 技 大 学 博 士 学 位 论 文 Ananthanarayanan et al. [50] have analyzed the Dryad job logs from Microsoft Bing’s cluster and observed wide disparity in data popularity and 18% of data exhibits high access concurrency. Contention for slots on machines storing popular data hurts job performance. This work presents Scarlett, a system that replicates blocks based on their popularity. With accurate prediction of data popularity (use a combination of historical usage statistics and information about the submitted and executing jobs for prediction), appropriate data placement and replication data compression, Scarlett improves locality performance for both Hadoop and Dryad. 1.4.2 Network-oriented optimizations Tyson et al.

[51]

have proposed Online MapReduce, a modified MapReduce

architecture in which intermediate data is pipelined between operators within a job or across multiple jobs. In Online MapReduce, data is pulled by downstream operators from upstream operators instead of pushed from upstream operators to downstream operators. Moreover, every MapReduce operator is a blocking operator. Reduce tasks cannot begin until Map tasks are finished, and Map tasks for the next job cannot get started until the Reduce tasks from the previous job have completed. Abouzeid et al.

[52]

have built a hybrid system, named HadoopDB that takes the best

features from MapReduce-based systems and parallel databases. By using MapReduce as the communication layer above multiple nodes running DBMS instances, SQL queries are translated into MapReduce by Hive, so as much as possible of the query processing logic will be pushed into DBMSs. Compared with Hadoop and several databases (Vertica, DB-X), HadoopDB achieves good performance, efficiency, scalability, fault tolerance and flexibility. Ananthanarayanan et al. [53] have shown that the outliers significantly prolong the job completion time and discuss the three causes of the outliers (hardware reliability and resource contention, network varying bandwidth and congestion, imbalance of workload among tasks). Then they presents a system named Mantri that monitors tasks and deal with outliers using the three techniques: smart restart of outliers (resource-aware restart), network-aware placement of tasks and using cost-model replication to protect the outputs

14

华 中 科 技 大 学 博 士 学 位 论 文 of valuable tasks (intermediate data). The resulting system improves the job performance significantly. Re-computation of the failed tasks significantly decreases the performance of applications, and Quiane-Ruiz et al.

[54]

have proposed three recovery algorithms to cope

with task failure and node failure: local check-pointing (spill the map outputs periodically to disk and record the record offset of map execution to handle map task failure); remote check-pointing (push the spilled data to the corresponding reducers and replicate the local consumed data to remote workers to handle node failure); query metadata check-pointing (instead of replicating local consumed data, it keeps track of input key-value pairs that produce the intermediate data and replicates the metadata files to the backup nodes to save the network bandwidth). 1.4.3 Data Skew in MapReduce Few studies have reported on the data skew impacts on MapReduce-based system. Qiu et al. have reported on the skew problems in some bioinformatics applications [55], and have discussed potential solutions towards the skew problems through implementing those applications using cloud technologies. Lin analyzed the skewed running time of MapReduce tasks, maps and reduces, caused by the Zipfian distribution of the input and intermediate data, respectively [56]. Chen et al.

[57]

have proposed Locality Aware Reduce Scheduling (LARS), which

designed specifically to minimize the data transfer in their proposed grid-enabled MapReduce framework, called USSOP. However, USSOP, due to the heterogeneity of grid nodes in terms of computation power, varies the data size of map tasks, thus, assigning map tasks associated with different data size to the workers according to their computation capacity. Obviously, this will cause a variation in the map outputs. Master node will defer the assignment of reduces to the grid nodes until all maps are done and then using LARS algorithm, that is, nodes with largest region size will be assigned reduces (all the intermediate data are hashed and stored as regions, one region may contain different keys). Thus, LARS avoids transferring large regions out. Kwon et al.

[58]

have proposed SkewReduce, to overcome the computation skew in

MapReduce-based system where the running time of different partitions depends on the 15

华 中 科 技 大 学 博 士 学 位 论 文 input size as well as the data values. At the heart of SkewReduce, an optimizer is parameterized by user-defined cost function to determine how best to partition the input data to minimize computational skew. 1.4.4 Cost of Adopting Cloud Platform There have been a few studies on evaluating user costs in the pay-as-you-go cloud. Deelman et al.

[59]

studied cost and performance tradeoffs of different execution and

resource provisioning plans in a scientific application via simulations. Palankar et al.

[60]

studied the cost, availability and performance on Amazon S3 services, in the context of data intensive scientific applications. Napper et al.

[61]

showed that the cost for solving a

linear system in Linpack increases exponentially with the problem size, in contrast with the linear scalability in the private cloud. Walker

[62]

has analyzed the performance

differences between running NAS Parallel Benchmarks (NPB) in private and public clouds. Assuncao et al.

[63]

evaluated the cost performance of different scheduling

strategies to combine the private (self-owned) and the public clouds. There are also some studies on cost variation of different runs in the cloud. Garfinkel [64] performed a half year study on Amazon web services including EC2, S3 and SQS, and demonstrated their accidents and variation on simple operations such as GET/PUT. Previous work by Wang et al.

[65]

has demonstrated the cost variance between

different runs on Amazon EC2. Recent studies

[66]

have investigated service-centric models for better cloud service

offerings. Mihoob et al.

[67]

demonstrated a case study of a consumer-centric resource

accounting model to verify any discrepancies in consumer’s bills. Yao et al. [68] introduced an accountability service model, towards achieving trustworthy service oriented architecture, to unambiguously identify the cause and the responsible party in case of faulty service. 1.4.5 VM Interference There have been a few studies on performance interference in virtualized servers. Gupta et al. [91] implemented XenMon to monitor the CPU usage of each guest and device driver domain and passed the usage information to a hypervisor scheduler for fair

16

华 中 科 技 大 学 博 士 学 位 论 文 scheduling between applications that use device driver domains and ones that do not. While in their later work, Cherkasova

[69]

provided a comparative evaluation of three

different CPU schedulers for virtual machines. They analyze the impact of the choice of scheduler and its parameters on application performance, and discuss challenges in estimating the application resource requirements in virtualized environments. There have many studies on the impact of I/O schedulers in native operating system as well as on the improvement of I/O schedulers by improving one of the existing schedulers

[70]

, using heuristic I/O schedulers

[71]

, and intelligence I/O schedulers

[72, 73]

.

This thesis presents the two state-of-art studies on the impact of I/O schedulers in native operating system on the application performance. Pratt and Heger [74] and Seelam et al [75] have done a thorough evaluation of the different Linux I/O schedulers and their effects on different workload. There have been a few studies on the impacts of the I/O schedulers’ composition in both VMM and guest-OS on the performance score of the application in virtualized environment. Boutcher and Chandra [76] examined whether traditional disk I/O scheduling still provides benefits in a layered system consisting of virtualized operating systems and underlying virtual machine monitor. They demonstrated that choosing the appropriate scheduling algorithm in guest operating systems provides performance benefits, while scheduling in the virtual machine monitor has no measurable advantage. While Kesavan et al [77] have examined the impact of the I/O schedulers in virtualized environment, and have different conclusion with the former study, that is the choice of an appropriate I/O scheduler at the VMM layer has a significant impact on the inter-application isolation and performance guarantees inside a given VM. Diego et al.

[78]

explored the relationship between domain scheduling in a virtual

machine monitor (VMM) and I/O performance by studying the impact of the VMM scheduler on performance using multiple guest domains concurrently running different types of applications. Pu et al.

[79]

and Mei et al.

[80]

have measured the performance interference among

two VMs running network I/O workloads that are either CPU bound or network bound and

17

华 中 科 技 大 学 博 士 学 位 论 文 elaborated the impacts of co-locating applications in a virtualized cloud in terms of throughput and resource sharing effectiveness. Koh et al.

[81]

studied the effects of performance interference between two virtual

machines hosted on the same hardware platform by looking at the system-level workload characteristics. Through subsequent analysis of collected characteristics, they predicted the performance of new application from its workload characteristic values successfully within an average error of approximately 5%. 1.4.6 Improving the Performance of MapReduce on VMs There have been a few studies on improving the MapReduce performance in the cloud, particularity in Xen-based virtual cluster. Zaharia et al.

[82]

have proposed a new scheduling algorithm called Longest

Approximate Time to End (LATE) to improve the performance of Hadoop in a heterogeneous environment, brought by the variation of VM consolidation amongst different physical machines, by running “speculative” tasks that is, looking for tasks that are running slowly and might possibly fail, and replicating them on another node just in case they do not perform. In LATE, the slow tasks are prioritized based on how much they hurt job response time. The number of speculative tasks is capped to prevent thrashing. Tashi

[83]

project aims to build a software infrastructure for cloud computing on

massive internet-scale datasets. Thus, they aim to provide VM management system along with Hadoop.

1.5 Thesis Organization Performance of data-intensive cloud computing can be improved at two levels: the programming model and the underlying infrastructure. To this end, this thesis is organized as follows. Chapter 2 characterizes the impacts of Hadoop’s map task scheduling on applications’ performance. In this chapter, a bottleneck is identified that degrades Hadoop’s performance and efficiency due to non-local map execution, thus slowing application performance and increasing needless map speculation; a new scheduling algorithm to overcome it is proposed. Chapter 3 identifies the bottleneck in Hadoop’s key

18

华 中 科 技 大 学 博 士 学 位 论 文 partitioning decision when partition skew occurs, this phenomena is widely practiced in scientific application and data base applications. To identify the impacts of the underlying infrastructure on data-intensive cloud computing, chapter 4 characterizes the interference brought by virtualizations and evaluates the impacts of interference on applications’ performance in terms of personal and social fairness. Chapter 5 evaluates the performance of Hadoop on virtualized environment and motivates a new MapReduce model to better fit of MapReduce in virtual cloud. Next, Chapter 6 reports on a set of experiments to identify the impact of different I/O scheduling on Hadoop in terms of performance, and examines a new methodology, adaptive I/O disk scheduling, to improve the performance of Hadoop’s applications. Finally, Chapter 7 concludes this thesis.

Chapter 1. Introduction

Chapter 2. ReplicaAware Task Scheduling

Chapter 3. Locality and Fairness -Aware Key Partitioning

Chapter 4. Virtualization Interference Impacts on application’ Performance

Chapter 5. MapReduce on Virtual Machines Chapter 7. Conclusion

Figure 1.1: Thesis Organization

19

Chapter 6. Adaptive I/O Disk Scheduling for MapReduce

华 中 科 技 大 学 博 士 学 位 论 文 References [1]

A. Szalay, A. Bunn, J. Gray, I. Foster, I. Raicu. The Importance of Data Locality in Distributed computing Applications.

In: Proceedings of NSF

Workflow, 2006. [2]

J. Dean and S. Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. In: Proceedings of the 6th Symposium on Operating Systems Design and Implementation (OSDI ’04), California, USA, Dec. 6-8, 2004.137-150

[3]

J. Dean and S. Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. In: Communications of the ACM, Jan. 2008, 51(1): 107-113

[4]

Hadoop: http://lucene.apache.org/, 2011.

[5]

Yahoo!,

Yahoo!

Developer

Network:

http://developer.yahoo.com/blogs/hadoop/2008/02/yahoo-worlds-largest-produ ction-hadoop.html (accessed September 2009). [6]

Hadoop,

Applications

powered

by

Hadoop:

http://wiki.apache.org/hadoop/PoweredB [7] A. S. Szalay, P. Z. Kunszt, A. Thakar, J. Gray, D. Slutz, and R. J. Brunner. Designing and Mining Multi-terabyte Astronomy Archives: The Sloan Digital Sky Survey. In: Proceedings of the 19thACM SIGMOD international conference on Management of data (SIGMOD’00), Texas, USA, ACM Press, May. 14-19, 2000.451 - 462 [8] D. A. Patterson. Technical perspective: the Data Center is the Computer. In: Communications of the ACM, Jan. 2008, 51 (1):105-105 [9] S. Ghemawat, H. Gobioff and S.T. Leung. The Google File System. In: Proceedings of the 19thACM Symposium on Operating Systems Principles (SOSP’03), New York, USA, ACM Press, Oct.19-22, 2003. 29-43 [10]

Hadoop

in

Wikipedia:

http://en.wikipedia.org/wiki/Hadoop,

(accessed

September 2009). [11]

CNET

news:

http://news.cnet.com/8301-13505_3-10196871-16.html,

107

华 中 科 技 大 学 博 士 学 位 论 文 (accessed September 2009). [12]

IBM

Blue

Cloud

Announcement:

http://www-03.ibm.com/press/us/en/pressrelease/22613.wss [13]

VMware white paper. ESX Server Performance and Resource Management for CPU-Intensive Workloads. Feb 2006. http://www.vmware.com/pdf/ESX2_CPU_Performance.pdf

[14]

R. P. Goldberg. Survey of Virtual Machine Research. In: IEEE Computer Magazine, Jun. 1974, 7(6): 34-45

[15]

C. A. Waldspurger. Memory Resource Management in VMware ESX Server. In: Proceedings of the 5th Symposium on Operating Systems Design and Implementation (OSDI ’02), Massachusetts, USA, Dec. 9-11, 2002. 181-194

[16]

K. Fraser, S. Hand, R. Neugebauer, I. Pratt, A. Warfield, and M. Williamson. Safe Hardware Access with the Xen Virtual Machine Monitor. In: Proceedings of the 1st Workshop on Operating System and Architectural Support for the on demand IT Infrastructure (OASIS’ 04), MA, USA, Oct. 7-13, 2004. 1-10

[17]

C. Clark, K. Fraser, S. Hand, J. G. Hansen, E. Jul, C. Limpach, L. Pratt and A. Warfield. Live Migration of Virtual Machines. In: Proceedings of the 2nd Symposium on Networked Systems Design and Implementation (NSDI ’05), Massachusetts, USA, May. 2-4, 2005. 1-11

[18]

T. Garfinkel, B. Pfaff, J. Chow, M. Rosenblum, and D. Boneh D. Terra: A Virtual Ma-chine-Based Platform for Trusted computing. In: Proceedings of the 19th ACM Symposium on Operating Systems Principles (SOSP ’03), Bolton Landing, New York, USA, ACM Press, Oct. 19 - 22, 2003.193-206

[19]

T. C. Bressoud and F. B. Schneider. Hypervisor based Fault-tolerance. In: Proceedings of the 15th ACM Symposium on Operating Systems Principles (SOSP’95), Colorado, USA, ACM Press, Dec. 3-6, 1995. 1-11

[20]

F. Petrini, D. J. Kerbyson, and S. Pakin. The Case of the Missing Supercomputer Performance: Achieving Optimal Performance on the 8,192 Processors of ASCI Q. In: Proceedings of ACM/IEEE Conference on Supercomputing (SC’03), Washington, DC, USA, Nov. 15-21, 2003. 55-55

108

华 中 科 技 大 学 博 士 学 位 论 文 [21]

K. Koch. How does ASCI Actually Complete Multi-month 1000-processor Milestone Simulations? In: Proceedings of the Conference on High Speed computing, OR, USA, Apr. 2002.

[22]

I. Foster, Y. Zhao, I. Raicu and S. Lu. Cloud computing and Grid computing 360-Degree Compared. In: Proceedings of the Grid Computing Environments Workshop (GCE’08), Texas, USA, Nov. 12-16, 2008. 1-10

[23]

S. Nanda and T. Chiueh. A Survey on Virtualization Technologies. Technical Report TR-179, Department of Computer Science, State University of New York, Feb. 2005. www.ecsl.cs.sunysb.edu/tr/TR179.pdf

[24]

IBM white paper. Seeding the Clouds: Key Infrastructure Elements for Cloud Computting. IBM Press, Feb. 2009. ftp://ftp.software.ibm.com/common/ssi/sa/wh/n/oiw03022usen/OIW03022USE N.PDF

[25]

Xen Homepage: http://www.xen.org/, 2009.

[26]

A. Whitaker, M. Shaw, and S.D. Gribble. Scale and Performance in the Denali Isolation Kernel. In: Proceedings of the 5th Symposium on Operating Systems Design and Implementation (OSDI’02), Massachusetts, USA, Dec. 9-11, 2002. 195-209

[27]

Xen Wiki: http://wiki.xensource.com/xenwiki/XenArchitecture, 2011.

[28]

I. Pratt, A. Warfield, P. Barham and R. Neugebauer. Xen and the Art of Virtualization. In: Proceedings of the 19th ACM Symposium on Operating Systems Principles (SOSP ’03), New York, USA, ACM Press. Oct. 19-22, 2003. 164 - 177

[29]

Xen Credit Scheduler, http://wiki.xensource.com/xenwiki/CreditScheduler, 2011,

[30]

D. Gupta. Scalable Virtual Machine Multiplexing. Ph.D Thesis, University of California, San Diego, May. 2009.

[31]

V. Chadha, R. Illiikkal, R. Iyer, J. Moses, D. Newell, and R.J. Figueiredo. I/o Processing in a Virtualized Platform: a Simulation-driven Approach. In: Proceedings of the 3rd International Conference on Virtual Execution

109

华 中 科 技 大 学 博 士 学 位 论 文 Environments (VEE’07), California, USA, June 13-15, 2007. 116-125 [32]

H. Frank. Analysis and Optimization of Disk Storage Devices for Time-sharing Systems. In : Journal of the ACM, Oct. 1969, 16(4):602-620

[33]

Noop scheduler (Wiki): http://en.wikipedia.org/wiki/Noop_scheduler, 2011.

[34]

Linux: Fair Queuing Disk Schedulers: http://kerneltrap.org/node/580, 2011.

[35]

Linux I/O schedulers:http://www.wlug.org.nz/LinuxIoScheduler, 2011.

[36]

Anticipatory

scheduling

(Wiki):

http://en.wikipedia.org/wiki/Anticipatory_scheduling , 2011. [37]

C. Ranger, R. Raghuraman, A. Penmetsa, G. Bradski, C. Kozyrakis. Evaluating MapReduce for Multi-core and Multiprocessor Systems. In: Proceedings of the 13th International Symposium on High-Performance Computer Architecture (HPCA’07), Arizona, USA, ACM Press, Feb. 10-14, 2007. 13-24

[38]

M. Kruijf and K. Sankaralingam. MapReduce for the Cell B.E. Architecture. Technical Report TR-1625, Department of Computer Sciences, the University of Wisconsin-Madison, Oct. 2007.

[39]

B. S. He, W. B. Fang,Q. Luo,N. K. Govindaraju and T. Y. Wang. Mars: a MapReduce Framework on Graphics Processors. In: Proceedings of the 17th International

Conference

on

Parallel

Architectures

and

Compilation

Techniques (PACT’08), Ontario, Canada, ACM Press, Oct. 25-29, 2008. 260-269 [40]

H. C. Yang, A. Dasdan, R. L. Hsiao, and D. S. P. Jr. Map-reduce-merge: Simplified Relational Data Processing on Large Clusters. In: Proceedings of the 2007 ACM SIGMOD International Conference on Management of Data, Beijing, China, ACM Press, June 11-14, 2007. 1029-1040

[41]

M. Isard, M. Budiu, Y. Yu, A. Birrell, and D. Fetterly. Dryad: Distributed Data-Parallel Programs from Sequential Building Blocks. In: Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems (EuroSys’07), Lisbon, Portugal, Mar. 21-23, 2007. 59-72

[42]

S. Chen, S. W. Schlosser. Map-Reduce Meets Wider Varieties of Applications. Technical Report IRP-TR-08-05, Intel Research Pittsburgh, May. 2008.

110

华 中 科 技 大 学 博 士 学 位 论 文 [43]

R. E. Bryant. Data-Intensive Supercomputing: The Case for DISC. Technical Report CMU-CS-07-128, Department of Computer Science, Carnegie Mellon University, May. 2007.

[44]

A. W. McNabb, C. K. Monson, and K. D. Seppi. Parallel PSO Using MapReduce. In: Proceedings of the Congress on Evolutionary Computation (CEC’07), Singapore, July 23-26, 2007. 7-14

[45]

Presentations by Steve Schlosser and Jimmy Lin at the 2008 Hadoop Summit, http://developer.yahoo.com/hadoop/summit/, (accessed September 2009).

[46]

S. Seo, I. Jang, K. C. Woo, I. Kim, J. S. Kim, and S. Maeng. HPMR: Prefetching

and

Pre-shuffling

in

Shared

MapReduce

Computation

Environment. In: Proceedings of the IEEE International Conference on Cluster Computing, Louisiana, USA, IEEE Press, Aug 31-Sep 4, 2009.1-8 [47]

D. Jiang, B.C. Ooi, L. Shi, and S. Wu. The Performance of MapReduce: An In-depth Study. Journal of the VLDB Endowment, Sep. 2010, (3)2: 472-483

[48]

M. Zaharia, D. Borthakur, J. SenSarma, K. Elmeleegy, S. Shenker, and I. Stoica. Delay scheduling: A Simple Technique for Achieving Locality and Fairness in Cluster Scheduling.

In: Proceedings of the 5th European

conference on Computer systems (EuroSys’10), Paris, France, ACM Press, Apr. 13-16 2010. 265-278 [49]

G. Ananthanarayanan, S. Agarwal, S. Kandula, A. Greenberg, I. Stoica, D. Harlan, E. Harris. Scarlett: Coping with Skewed Popularity Content in MapReduce Clusters. In: Proceedings of the 6th European Conference on Computer Systems (EuroSys’11), Salzburg, Austria, ACM Press, Apr. 10-13, 2011.

[50]

T. Condie, N. Conway, P. Alvaro, M. J. Hellerstein, K. Elmeleegy, and R. Sears. MapReduce Online. In: Proceedings of the 7th USENIX conference on Networked Systems Design and Implementation (NSDI’10), California, USA, Apr. 28-30, 2010. 313- 328

[51]

A. Abouzeid, K. Bajda-Pawlikowski, D.J. Abadi, A.Rasin, and A. Silberschatz. HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies

111

华 中 科 技 大 学 博 士 学 位 论 文 for Analytical Workloads. In: Journal of the VLDB Endowment, Aug. 2009, 2(1): 922-933 [52]

G. Ananthanarayanan, S. Kandula, A. Greenberg, I. Stoica, Y. Lu, B. Saha, and E. Harris. Reining in the Outliers in MapReduce Clusters Using Mantri. In: Proceedings of the 9th USENIX Conference on Operating Systems Design and Implementation (OSDI’10), California, USA, Oct. 4-6, 2010. 1-16.

[53]

J. A. Quiane-Ruiz, C.Pinkel, J. Schad, J. Dittrich. RAFTing MapReduce: Fast Recovery on the Raft. In: Proceedings of the IEEE International Conference on Data Engineering (ICDE’11), Hannover, Germany, IEEE Press, Apr. 11-16, 2011. To appear

[54]

X. Qiu, J. Ekanayake, S. Beason, T. Gunarathne, G. Fox, R. Barga, and D. Gannon. Cloud Technologies for Bioinformatics Applications. In: Proceedings of the ACM Workshop on Many-Task Computing on Grids and Supercomputers (MTAGS 2009), Oregon, USA, ACM Press, Nov. 16, 2009. 6

[55]

J. Lin. The Curse of Zipf and Limits to Parallelization: A Look at the Stragglers Problem in MapReduce. In: Proceedings of the Workshop on Large-Scale Distributed Systems for Information Retrieval (LSDS-IR'09), Boston, USA, July 23, 2009. 57-62

[56]

P. C. Chen, Y. L. Su, J. B. Chang, and C. K. Shieh. Variable-Sized Map and Locality-Aware Reduce on Public-Resource Grids. In: Proceedings of the 5th International Conference on Grid and Pervasive Computing (GPC’10), Hualien, Taiwan, May. 10-14, 2010. 234-243

[57]

Y. C. Kwon, M. Balazinska, B. Howe, and J. Rolia. Skew-resistant parallel processing of feature-extracting scientific user-defined functions. In: Proceedings of the 1st ACM Symposium on Cloud computing (SOCC’10), Indiana, USA, ACM Press, June 10-11, 2010. 75-86

[58]

E. Deelman, G. Singh, M. Livny, B. Berriman, and J. Good. The Cost of Doing Science on the Cloud: the Montage Example. In: Proceedings of the 2008 ACM/IEEE conference on Supercomputing (SC’08), Texas, USA, Nov. 15-21, 2008. 1-12

112

华 中 科 技 大 学 博 士 学 位 论 文 [59]

M.R. Palankar, A. Iamnitchi, M. Ripeanu, and S. Garfinkel. Amazon S3 for Science Grids: a Viable Solution?. In: Proceedings of the Proceedings of the 2008 International

Workshop on Data-aware Distributed

Computing

(DADC’08), Boston, USA, ACM Press, June 24, 2008. 55-64 [60]

J. Napper and P. Bientinesi. Can Cloud Computing Reach the Top500?.In: Proceedings of UnConventional High Performance Computing Workshop (UCHPC'09), Ischia, Italy, ACM Press, May. 20, 2009. 17-20

[61]

E. Walker. Benchmarking Amazon Ec2 for High-performance Scientific Computing. In: Login. Oct 2008, 33(5): 18- 23

[62]

M.D. deAssuncao, A. diCostanzo, and R. Buyya. Evaluating the Cost-benefit of Using Cloud Computing to Extend the Capacity of Clusters. In: Proceedings of the 18th ACM International Symposium on High performance Distributed Computing (HPDC-18), Munich, Germany, ACM Press, June 11-13, 2009. 141-150

[63]

S.L. Garfinkel. An Evaluation of Amazon's Grid Computing Services: Ec2, S3 and SQS. Technical Report TR-08-07, Harvard University, July 2007.

[64]

H. Y. Wang, Q. F. Jing, R. S. Chen, B. S. He, Z. P. Qian and L. D. Zhou. Distributed Systems Meet Economics: Pricing in the Cloud. In: Proceedings of the 2nd USENIX Workshop on Hot Topics in Cloud computing (HotCloud’10), Boston, USA, June 22, 2010.

[65]

H. Cai, K. Zhang, M. Wang, J.L. Li, L. Sun, X.S. Mao. Customer Centric Cloud Service Model and a Case Study on Commerce as a Service. In: Proceedings of the IEEE 2009 International Conference on Cloud Computing (CLOUD-II 2009) , Bangalore, India, IEEE Press, Sep. 21-25, 2009. 57-64

[66]

A. Mihoob, C. MolinaJimenez, S. Shrivastava. A Case for Consumer Centric Resource Accounting Models. In: Proceedings of the IEEE 2010 International Conference on Cloud Computing (Cloud 2010), Florida, USA, IEEE Press, July 5-10, 2010. 506-512

[67]

J.H. Yao, S.P. Chen, C. Wang, D. Levy, J. Zic. Accountability as a Service for the Cloud. In: Proceedings of the IEEE International Conference on Web

113

华 中 科 技 大 学 博 士 学 位 论 文 Services (ICWS’10), Florida, USA, July 5-10, 2010. 81-88 [68]

D. Gupta, L. Cherkasova, R. Gardner, and A. Vahdat. Enforing Performance Isolation Across Virtual Machines in Xen. In: Proceedings of the ACM/IFIP/USENIX 7thInternational Middleware Conference (Middleware’06), Melbourne, Australia, Nov. 27- Dec. 1, 2006. 342-362

[69]

L. Cherkasova, D. Gupta, A. Vahdat. Comparison of the Three CPU Schedulers in Xen. In: SIGMETRICS Performance Evaluation Review, September 2007,

[70]

35(2): 42-51

S. Iyer and P. Druschel. Anticipatory Scheduling: A Disk Scheduling Framework to Overcome Deceptive Idleness in Synchronous I/O. In: Proceedings of the 18th ACM Symposium on Operating Systems Principles (SOSP’01), Banff, Canada, Oct. 21-24, 2001. 117–130

[71]

J. Wildstrom, P. Stone, E. Witchel, and M. Dahlin. Machine Learning for On-Line Hardware Reconfiguration. In: Proceedings of the 20th International Joint Conference on Artificial Intelligence (IJCAI’07), Hyderabad, India, Jan. 6-12, 2007. 1113–1118

[72]

Y. Zhang and B. K. Bhargava. Self-learning Disk Scheduling. In: IEEE Transactions on Knowledge and Data Engineering, Jan 2009,

[73]

21(1): 50-65

K. Lund and V. Goebel. Adaptive Disk Scheduling in a Multimedia DBMS. In: Proceedings of the 11th ACM International Conference on Multimedia ACM (Multimedia’03), New York, USA, ACM Press, Nov. 2-8, 2003. 65–74

[74]

S. L. Pratt and D. A. Heger. Workload Dependent Performance Evaluation of the Linux2.6 I/O Schedulers. In: Proceedings of the Linux Symposium 2004, Ottawa, Canada, Jul. 21-24, 2004. 425-448

[75]

S. R. Seelam, R. Romero, P. J. Teller, and W. Buros. Enhancements to Linux I/O Scheduling. In: Proceedings of the Linux Symposium 2005, Ottawa, Canada, Jul. 20-23, 2005.175–192

[76]

D. Boutcher and A. Ch. Does Virtualization Make Disk Scheduling Passe? In: ACM SIGOPS Operating Systems Review, Jan. 2010, 44(1): 20-24

[77]

M. Kesavan, A. Gavrilovska, and K. Schwan. On disk I/O Scheduling in

114

华 中 科 技 大 学 博 士 学 位 论 文 Virtual Machines. In: Proceedings of the 2nd Conference on I/O Virtualization (WIOV’10), Pennsylvania, USA, Mar. 13, 2010. 1-6 [78]

D. Ongaro, A. L. Cox, and S. Rixner. Scheduling I/O in Virtual Machine Monitors. In: Proceedings of the 4thInternational Conference on Virtual Execution Environments (VEE’08), Washington, USA, Mar 5-7, 2008. 1–10

[79]

X. Pu, L. Liu, Y.D. Mei,S. Sivathanu, Y. Koh, C. Pu. Understanding Performance Interference of I/O Workload in Virtualized Cloud Environments. In: Proceedings of the 2010 IEEE 3rd International Conference on Cloud Computing (Cloud’10), Washington, USA, July 5-10, 2010. 51-58

[80]

Y.D. Mei, L. Liu, X. Pu, S. Sivathanu. Performance Measurements and Analysis of Network I/O Applications in Virtualized Cloud. In: Proceedings of the 2010 IEEE 3rd International Conference on Cloud Computing (Cloud’10), Washington, USA, July 5-10, 2010. 59-66

[81]

Y. Koh, R. Knauerhase, P. Brett, M. Bowman, Z. Wen, and C. Pu. An Analysis of Performance Interference Effects in Virtual Environments. In: Proceedings of the IEEE International Symposium on In Performance Analysis of Systems &Software. (ISPASS’07), California, USA, Apr. 25-27, 2007. 200-209

[82]

M. Zaharia, A. Konwinski, A. D. Joseph, R. H. Katz, and I. Stoica. Improving MapReduce Performance in Heterogeneous Environments. In: Proceedings of the 8th USENIX Conference on Operating Systems Design and Implementation (OSDI’08), California, USA, Dec. 8–10, 2008. 29-42

[83]

Tashi

Homepage:

http://www.pittsburgh.intelresearch.net/projects/tashi/,

accessed September 2009. [84]

R. K. Jain, D. W. Chiu, and W. R. Hawe. A Quantitative Measure of Fairness and Discrimination for Resource Allocation in Shared Computer Systems. Technical Report, Digital Equipment Corporation, Sep. 1984.

[85]

F. Cappello, E. Caron, M. Dayde, F. Desprez, Y. Jegou, P. Primet, E. Jeannot,S. Lanteri, J. Leduc, N. Melab, G. Mornet, R. Namyst, B. Quetier, O. Richard. Grid’5000: a large scale and highly reconfigurable Grid experimental testbed. In: Proceedings of the 6th IEEE/ACM International Workshop on Grid

115

华 中 科 技 大 学 博 士 学 位 论 文 Computing (GRID'05), Washington, USA, Nov. 13–14, 2005. 99-106 [86]

J. D. DeWitt and J. Gary. Parallel Database System: The Future of High Performance Database Systems. In: Communications of the ACM, June 1992, 35(6): 85-98

[87]

S. Ibrahim, H. Jin, L. Lu, B.S. He, L. Qi, and S. Wu. LEEN: Locality/Fairnessaware Key Partitioning for MapReduce in the Cloud. In: Proceedings of the 2010 IEEE 2nd International Conference on Cloud Computing Technology and Science (CloudCom’10), Indiana, USA, Nov. 30-Dec. 03, 2010. 17-24

[88]

M. Armbrust, A. Fox, R. Griffith, A. D. Joseph, R. H. Katz, A. Konwinski, G. Lee, D. A. Patterson, A. Rabkin, I. Stoica and M. Zaharia. Above the Clouds: A Berkeley View of Cloud Computing. Technical Report, University of California, Berkeley. Feb. 10, 2009. http://www.eecs.berkeley.edu/Pubs/TechRpts/2009/EECS-2009-28.pdf

[89]

L. Gillam, and N. Antonopoulos. Cloud Computing: Principles, Systems and Applications. Springer Press, Aug. 2010.

[90]

S. Maxwell. The Price is Wrong: Understanding What Makes a Price Seem Fair and the True Cost of Unfair Pricing. Wiley, Jan. 2008.

[91]

F. Black and M.S. Scholes. The Pricing of Options and Corporate Liabilities. In: Journal of Political Economy, May-June 1973, 81(3): 637-654

[92]

P. Aparrao, R. Iyer, D. Newell. Towards Modeling and Analysis of Consolidated CMP Servers. In: ACM SIGARCH Computer Architecture News, May 2008, 36(2): 38-45

[93]

L. Cherkasova, R.Gardner. Measuring CPU overhead for I/O Processing in the Xen Virtual Machine Monitor. In: Proceedings of the USENIX Annual Technical Conference (USENIX '05), California, USA, Apr. 10-15, 2005. 24-24

[94]

J. Katcher. Postmark: a New File System Benchmark. Technical Report, Network Appliance, Aug. 1997.

[95]

C. Bienia, S. Kumar, J.P. Singh, and K. Li. The Parsec Benchmark Suite: Characterization and Architectural Implications. In: Proceedings of the 17th

116

华 中 科 技 大 学 博 士 学 位 论 文 International

Conference

on

Parallel

Architectures

and

Compilation

Techniques (PACT’08), Ontario, Canada, Oct. 25-29, 2008. 72-81 [96]

B. Fischer and M. Scholes. A Study of Option Pricing Models. Technical Report, Bradley University, 1973.

[97]

V. Chadha, R. Illiikkal, R. Iyer, J. Moses, D. Newell, and R.J. Figueiredo. I/O Processing in a Virtualized Platform: a Simulation-driven Approach. In: Proceedings of the 3rd International Conference on Virtual Execution Environments (VEE’07), California, USA, June 13 – 15, 2007. 116-125

[98]

System Performance Benchmark: http://sysbench.sourceforge.net/

[99]

Virtual Memory Statistic: http://linux.die.net/man/8/vmstat.

[100] Flexible File System Benchmark: http://sourceforge.net/projects/ffsb/. [101] C. C. Chang and C. J. Lin. LIBSVM : A Library for Support Vector Machines. 2001, Software available at: http://www.csie.ntu.edu.tw/ cjlin/libsvm [102] R. Figueiredo, P. Dinda, J. Fortes. A Case for Grid Computing on Virtual Machines. In: Proceedings of 23rd International Conference on Distributed Computing Systems (ICDCS’03), IEEE Press. Rhode Island, USA, May 19-22, 2003. 550–559 [103] M.F. Mergen, V. Uhlig, O. Krieger, J. Xenidis. Virtualization for High Performance Computing. In: ACM SIGOPS Operating Systems Review, April 2006, 40(2): 8–11 [104] W. Huang, J. Liu, B. Abali, D.K. Panda. A Case for High Performance Computing with Virtual Machines. In: Proceedings of 20th ACM International Conference on Supercomputing (ICS’06), Queensland, Australia, ACM Press, Jun. 28 - Jul. 01, 2006. 125–134 [105] A.B. Nagarajan, F. Mueller, C. Engelmann, S.L. Scott. Proactive Fault Tolerance for HPC with Xen Virtualization. In: Proceedings of 21st ACM International Conference on Supercomputing (ICS’07), Washington, USA, ACM Press, June 16 - 20, 2007. 23–32 [106] C. Clark, K. Fraser, S. Hand, J.G. Hansen, E. Jul, C. Limpach, I. Pratt, A. Warfield. Live Migration of Virtual Machines. In: Proceedings of the 2nd

117

华 中 科 技 大 学 博 士 学 位 论 文 ACM/USENIX

Symposium

on

Networked

Systems

Design

and

Implementation (NSDI’05), Massachusetts, USA, May 2-4, 2005. 273-286 [107] M. Zhao, R.J. Figueiredo. Experimental Study of Virtual Machine Migration in Support of Reservation of Cluster Resources. In: Proceedings of 2nd International Workshop on Virtualization Technology

in

Distributed

computing (VTDC’07), Nevada, USA, ACM Press, Nov. 12, 2007. 1-8 [108] S. Ibrahim, H. Jin, B. Cheng, H. Cao, W. Song, and L. Qi. Cloudlet: Towards MapReduce Implementation on Virtual Machines. In: Proceedings of the ACM International Symposium on High Performance Distributed Computing (HPDC-18), Munich, Germany, ACM Press, June 11-13, 2009. 65-66 [109] S. Ibrahim, H. Jin, L. Lu, L. Qi, S. Wu, and X. Shi. Evaluating MapReduce on Virtual Machines: The Hadoop Case. In: Proceedings of the 1st International Conference on Cloud Computing (CloudCom 2009), Beijing, China, Springer LNCS, Dec. 1-4, 2009. 519-528 [110] C. Olston, B. Reed, U. Srivastava, R. Kumar and A. Tomkins. Pig Latin: A Not-So-Foreign Language for Data Processing. In: Proceedings of the 2008 ACM SIGMOD International

Conference

on Management of

Data

(SIGMOD’08), British Columbia, Canada, ACM Press, June 10-12, 2008. 1099-1110

118

华 中 科 技 大 学 博 士 学 位 论 文 List of Abbreviations Amazon EC2

Amazon Elastic Cmpute Cloud

Amazon S3

Amzaon Simple Storage service

Amazon SQS

Amazon Simple Queue Service

CaaS

Computation as a Service

CFQ

Complete Fair Queuing

DaaS

Data as a Service

DAG

Directed Acyclic Graph

DBMS

Database Management System

EVMTime

Effective Virtual Machine Time

G.ho.st

Global Hosted Operation SysTem

GFS

Google File System

GPU

Graphic Processor Unit

HDFS

Hadoop Distributed File System

IaaS

Infrastructure as a Service

LARS

Locality Aware Reduce Scheduling

LATE

Longest Approximate Time to End

LEEN

Locality-aware and Fairness-aware Key Partition

NPB

NAS Parallel Benchmark

PaaS

Platform as a Service

RPC

Remote Procedure Call

SaaS

Software as a Service

SAN

Storage Area Network

SEDF

Simple Earliest Deadline First

TCQ

Tagged Command Queuing

VM

Virtual Machine

VMM

Virtual Machine Monitor

119

华 中 科 技 大 学 博 士 学 位 论 文 Appendix 1 Publication [1]

Shadi Ibrahim, Hai Jin, Lu Lu, Bingsheng He, Song Wu, “Adaptive Disk I/O

Scheduling for MapReduce in Virtualized Environment”. To appear in 40th annual International Conference on Parallel Processing (ICPP 2011). [2]

Shadi Ibrahim, Bingsheng He, Hai Jin, “Towards Pay-As-You-Consume Cloud

Computing”. in Proceedings of IEEE 8th International Conference on Services Computing (SCC 2011), Washington, USA. IEEE Press, July. 4-9, 2011. [3]

Shadi Ibrahim, Hai Jin, Lu Lu, Bingsheng He,Li Qi, Song Wu. LEEN:

Locality/Fairness- aware Key Partitioning for MapReduce in the Cloud. In: Proceedings of the 2nd IEEE International Conference on Cloud computing Technology and Science (CloudCom 2010), Indiana, USA. IEEE Press, Nov. 30-Dec. 03, 2010. 17-24 [4]

Shadi Ibrahim, Hai Jin, Lu Lu, Li Qi, Song Wu, Xuanhua Shi. Evaluating

MapReduce on Virtual Machines: The Hadoop Case. In: Proccedings of the 1st International conference on Cloud computing (CloudCom 2009), Beijing, China. Springer, Dec. 1-4, 2009. 519-528 [5]

Shadi Ibrahim, Hai Jin, Cheng bin, HaiJun Cao, Song Wu, Li Qi. CLOUDLET:

Towards MapReduce Implementation on Virtual Machine. In: Proceedings of the 18th International Symposium on High Performance Distributed computing (HPDC-18), Munich, Germany. ACM Press, Jun. 11-13, 2009. 65-66 [6]

Shadi Ibrahim, Hai Jin, Li Qi, Chunqiang Zeng. Grid Maintenance: Challenges and

Existing Models. In: Proceedings of the 3rd IEEE International Conference on Information & Communication Technologies: from Theory to Applications (ICTTA 2008), Damascus, Syria. IEEE Press, Apr. 7-11, 2008. 1-6 [7]

Hai Jin, Shadi Ibrahim, Li Qi, Haijun Cao, Song Wu, Xuanhua Shi. The MapReduce Programming Model and Implementations. Book Chapter in Cloud computing: Principles and Paradigms, Wiley Press, Jan. 2011. 373-390

[8]

Hai Jin, Shadi Ibrahim, Tim Bell, Wei Gao, Dachuan Huang, Song Wu. Cloud Types and Services. Book Chapter in the Handbook of Cloud computing, Springer

120

华 中 科 技 大 学 博 士 学 位 论 文 Press, Sep. 2010. 335-355 [9]

Hai Jin, Shadi Ibrahim, Tim Bell, Li Qi, Haijun Cao, Song Wu, Xuanhua Shi. Tools and technologies for building the Clouds. Book Chapter in Cloud computing: Principles Systems and Applications, Springer Press, Aug. 2010. 3-20

[10] Dachuan Huang, Xuanhua Shi, Shadi Ibrahim, Lu Lu, Song Wu, Hai Jin. MR-Scope: A Real Time Tracing Tool for MapReduce. In: Proceedings of the First International Workshop on MapReduce and its Applications (MAPREDUCE'10) Chicago, USA. ACM Press, Jun. 21, 2010. 849-855. [11] Haijun Cao, Hai Jin, Song Wu, Shadi Ibrahim. ImageFlow: Workflow based Image Processing with Legacy Program in Grid. In: Proceedings of the 2009 International Conference on Future Information Technology and Management Engineering (FITME 2009), Sanya, China. IEEE Press, Dec. 13-14 2009. 115-118. [12] Haijun Cao, Hai Jin, Song Wu, Shadi Ibrahim. Petri Net based Grid Workflow Verification and Optimization. To appear in Journal of Supercomputing.

Pending Papers [13] Shadi Ibrahim, Hai Jin, Lu Lu, Song Wu. Maestro: Replica-aware Map execution to Improve MapReduce Performance in the Cloud. In submission.

121