Integrated QoS-aware Resource Provisioning for

Integrated QoS-aware Resource Provisioning for Parallel and Distributed Applications Zengxiang Li∗ , Long Wang∗ , Yu Zhang† , Tram Truong-Huu‡ , En Sheng Lim† , Purnima Murali Mohan‡ , Shibin Chen† , Shuqin Ren† , Mohan Gurusamy‡ , Zheng Qin∗ , Rick Siow Mong Goh∗ ∗ Institute

of High Performance Computing, A*STAR, Singapore Email: {liz, wangl, qinz, gohsm}@ihpc.a-star.edu.sg † Data Storage Institute, A*STAR, Singapore Email: {Zhang Yu, Lim En Sheng, Chen Shibin, Ren Shuqin}@dsi.a-star.edu.sg ‡ National University of Singapore, Singapore Email: {eletht, isepmm, elegm}@nus.edu.sg

Abstract—With more parallel and distributed applications moving to Cloud and data centers, it is challenging to provide predictable and controllable resources to multiple tenants, and thus guarantee application performance. In this paper, we propose an integrated QoS-aware resource provisioning platform based on virtualization technology for computing, storage and network resources. Coarse-grained CPU mapping and fine-grained CPU scheduling mechanisms are proposed to enable adjustable computing power. A hierarchical distributed scheduling mechanism is implemented on a scalable storage system to guarantee I/O throughput for individual tenants and applications. A network rate controller has also been developed to guarantee the data transmission rate. Web-based interface enables users to monitor realtime resource utilization and to adjust resource QoS levels on the fly. According to our experimental results, the resource cost can be saved up to 45% without degrading the performance of a distributed data processing benchmark; and the performance of a parallel agent-based simulation can be improved by 91% using the same amount of resources. Index Terms—Cloud computing, quality of service, virtualization, resource provisioning, multi-tenant data centers

I. I NTRODUCTION In this decade, we have witnessed a burst of moving parallel and distributed applications to Cloud and data centers. The success of Amazon EC2 has popularized the Infrastructureas-a-Service (IaaS) cloud computing model, in which users are allowed to request virtual machines (VMs) and then deploy and run arbitrary operating systems and applications. Computing resources can be acquired on a pay-per-use basis, hence resource costs for customers might be reduced significantly. However, Amazon’s EC2 does not enable users to configure or adjust VM capability flexibly. Moreover, the resource QoS cannot be guaranteed, especially for those resources located at remote regions or subscribed at peak hours [1]. On the other hand, applications may require resources with different Quality of Service (QoS) requirements, according to their diverse characteristics. Online applications (e.g., multiplayer games and weather forecast) require resources at high QoS levels to enable fast decision making and real-time response. In contrast, some background applications (e.g., scientific computation) have much more flexible time con-

straints, and thus may use cheaper resources at low QoS levels. Generally, compute-intensive applications [2] may require large-amount of CPU resources, while data-intensive applications [3] may require high network bandwidth and I/O throughput. However, some applications may require different kinds of resources at different execution phases. Take distributed data processing (refer to Section VI-A) as an example, high I/O throughput is desired at the data reading and writing steps; while powerful CPU makes sense at the data processing step. Furthermore, application workload may change dynamically. For instance, the number of active players of an online game shows strong diurnal pattern [4]. Large-scale applications, such as parallel agent-based simulations (refer to Section VI-B), are usually used to study a complex system in desired fidelity. They are typically composed of a group of parallel components with imbalance and dynamically changing workloads. Subsequently, the application performance may degrade significantly by the components with high workloads. However, the components with low workloads may leave rented resource idle. To enhance application performance and improve resource utilization, parallel components should be coordinated in both execution speed and resource provisioning. In this paper, an integrated QoS-aware resource provisioning platform is proposed to meet resource requirements of various applications. By exploiting virtualization technologies, multiple tenants are allowed to share data center infrastructure. The performance interference caused by noisy tenants or malicious applications can be avoided by well-designed resource mapping and scheduling. Users are free to choose resources at high or low QoS levels according to their available budget and application characteristics. Adaptive resource provisioning is also supported to handle dynamical changing workloads, based on accurate workload prediction. Last but not least, webbased interactive interface are implemented. User are allowed to manage and adjust resource QoS levels intuitively. Realtime resource monitoring and pricing calculation models assist users to find out cost-effective resource renting strategies. Experiments using a distributed data processing benchmark

Computer i

Applications of tenants Complex Workflows

Large-Scale Simulations

Big Data Analytics

VM_1

VM_n

Federate1

Federaten

Performance Monitor

Performance Monitor

RTI Component

RTI Component

Integrated QoS-aware resource provisioning platform

Communication for simulation execution

Resource Manager

WebͲBasedInteractiveInterface VMManagement,Resourcemonitoring,Pricingmodel Workload Prediction

Resource Scheduling

Hypervisor Hardware / Resources

Resource Mapping

CPU Core

Data center infrastructure

Fig. 2.

Fig. 1. Architecture of integrated QoS-aware resource provisioning platform.

have verified that predictable and controllable CPU, storage space and network resources are provided to multiple tenants. Experimental results have also illustrated that adaptive resource provisioning can coordinate parallel components in an agent-based simulation properly. The rest of this paper is organized as follows: Section II illustrates the architecture and advantages of our proposed resource provisioning platform. The CPU and storage resource provisioning mechanisms are presented in Section III and Section IV respectively. In Section V, we present the network manager module of the platform. Section VI reports experimental results using a distributed data processing benchmark and a parallel agent-based simulation. Section VII reviews related work on QoS-aware resource provisioning. Section VIII concludes the paper and outlines future work. II. I NTEGRATED Q O S- AWARE R ESOURCE P ROVISIONING P LATFORM The architecture of the integrated QoS-aware resource provisioning platform is shown in Figure 1. It enables multiple tenants to execute various applications on data center infrastructure in a flexible and cost-effective manner. According to the application characteristics, tenants rent desired amount of resources at preferred QoS levels. Furthermore, the resource quantity and QoS levels can be adjusted dynamically, without restarting the VMs and interrupting the applications. Different pricing models are defined based on resource quality or quantity. As shown in Equation 1, the VM price is determined by the QoS levels (Q) of CPU, storage and bandwidth requirement defined in functions f(.), g(.) and h(.). Generally, these functions are provided by resource providers according to their resource operation cost and market as follows: PV M = f (QCP U ) + g(QStorage ) + h(QN etwork ).

(1)

The VM price can also be calculated based on resource quantity, as shown in Equation 2. PV M =

k X i=1

ωi × Ri

(2)

CPU Core

ArmVee modules

CPU Core

Communication for resource management

Fine-grained CPU resource provisioning for parallel simulation.

where Ri represents the quantity of the ith resource measured in a fine-grained manner and ωi represents its weight in determining the VM price. Based on the predefined pricing models, the VM prices can be calculated according to arbitrary resource configurations. Subsequently, tenants can easily find out a cost-effective strategy, making the tradeoff between application performance and resource budget constraints. Resource mapping and scheduling mechanisms are essential to provide predictable and controllable resources to multiple tenants. Resources may be assigned to applications exclusively to avoid performance interference and guarantee high-level QoS requirements. Resources may also be shared among different applications with low-level QoS requirements for the purpose of reducing the cost. Resource scheduling mechanism can provide specified amount of resources in a fine-grained manner. Hence, the application performance can be controlled and coordinated. By tracing the application performance and resource utilization, the application workload can be predicted. Consequently, the resource provisioning can be adjusted automatically, according to dynamically changing workload. More details will be given in Section III and IV. Since the resource requirements are well defined in the interface with sufficient details, data center operators are able to manage their infrastructure efficiently. For instance, applications which require diverse resources can be consolidated on the same computer [5]. VMs with predictable workloads can be relocated to avoid hotspots, and thus, reduce the cooling overhead [6]. Therefore, more tenants can be served using the same amount of resources within the same operation cost. This will lead to potential revenue increase. Similar to most cloud resource management platform, e.g., Openstack [7] and Eucalyptus [8], our platform delivers services to tenants through web-based interface. Tenants are able to manage (e.g., create, suspend, terminate) VMs and list resources with QoS levels through local web browsers. They are also allowed to adjust CPU shares and storage I/O throughput on the fly. The resource utilization rates are monitored in realtime, and then presented to tenants using various statistics graphs. In the case that the resources are under utilization, users can downgrade QoS levels for saving resource cost. Otherwise, user can upgrade QoS levels for improving application performance.

III. CPU R ESOURCE P ROVISIONING M ECHANISM QoS-aware CPU resource provisioning is supported in either a coarse-grained or a fine-grained manner based on resource mapping and scheduling. In the coarse-grained manner, a resource mapping algorithm is developed to map VM’s virtual CPU (VCPU) to physical CPU cores (PCPU) by maintaining a mapping table. To meet high-level QoS requirement, each VCPU, such as VCPUs in V M1 , is pinned to a PCPU exclusively. For low-level QoS requirement, two VCPUs at most, such as VCPUs in V M2 and V M3 , may share the same PCPU. In the case that only V M2 is active, its VCPUs may achieve similar performance as those at high QoS levels. However, if both V M2 and V M3 (with the same priority) are active, the performance of their VCPUs may degrade by half. Generally, VCPUs at low QoS levels are cheaper but at the risk of performance degradation, especially when the infrastructure are highly utilized. In the fine-grained manner, VCPU capability is adjusted dynamically according to the overlying application workload [9]. It is implemented based on the native CPU scheduler in Xen, i.e., credit scheduler [10]. By default, the credit scheduler is a fair-share scheduler as all VMs are assigned the same credit. However, the assigned credit can be adjusted by setting the parameters (i.e., weight and cap) of the credit scheduler [10]. In the case that only the weight parameter is set for each VM, the credit scheduler is work-conserving, which means that a VM that has spent all of its credit will be allocated additional CPU share if there is available CPU resource. Optionally, we can use the cap parameter to specify the maximum CPU share the VM is allowed to consume. In this case, the credit scheduler is the non-work-conserving, which means that a VM never consumes CPU share beyond the cap even if there is available CPU resource. Compared with the workconserving, the non-work-conserving scheduler provides better performance isolation among VMs [11], [12]. The cap value, rather than the weight value, precisely specifies the CPU share (percentages of time slots) allocated to a VM. In addition, changing the cap value does not introduce additional overhead, and the new cap value can take effect immediately [11]. Fine-grained CPU resource provisioning are very important for performance coordination in a large-scale application. For instance, a parallel simulation is usually composed of a group of simulation components (federates in (High Level Architecture) HLA terminology [13]). They are encapsulated and executed on the resident VMs with their own guest operating systems. Therefore, federates developed by different participants using different operating systems can be consolidated in the same computer. By default, the federates share resources evenly, as their resident VMs are scheduled by the hypervisor using a fair-share scheduler. Subsequently, the simulation performance may degrade significantly due to workload imbalance [14]. To solve the problem, adaptive fine-grained CPU Resource provisioning are proposed as shown in Fig. 2. The performance monitor, using a middleware approach, measures federate performance (i.e., execution speed) transparently to the simulation

Service level for u1 Arrival Flows

u5 u3 u1

u1 u5 Overflow

Tokens (Average_rate)

Overflow

Tokens (Average_rate)

) burst ) size) TB12

) burst ) size) TB11 Served Flows

Fig. 3.

Arrival Flows

Served Flows

Two-level storage controller based on distributed token bucket.

application. The resource manager is able to limit each VM to a certain resource share through fine-grained adjustment of its cap value. Hence, federate execution speed can be controlled in a fine-grained manner. The resource manager periodically retrieves federate execution speeds from performance monitors and fetches the available resources from hypervisor. A self-adaptive auto-regressive-moving-average (ARMA) model which is commonly used in control theory is adopted to capture the relationship between federate performance and the resource share of the resident VM. Based on the ARMA model, the resource manager is able to distribute the available resources among the VMs, making their corresponding federates have comparable execution speed. The higher simulation workload, the greater resource share the federate will be allocated. Since federates are proactively controlled to advance simulation time with comparable speeds, they can avoid synchronization overhead. The execution speedup of entire simulation can be achieved. IV. S TORAGE R ESOURCE P ROVISIONING M ECHANISM To accommodate high volume data, a distributed storage system is composed of multiple gateways and servers (refer to Figure 5). A tenant may run multiple applications concurrently, connecting to the storage system through multiple gateways. To guarantee predictable I/O performance, a hierarchical distributed scheduling mechanism is proposed. On the tenant level, I/O throughput rented by each tenant is guaranteed and strictly limited by a CAP value. On the application level, the rented resources are on-demand served for multiple applications of the same tenant. Therefore, performance interference caused by “noisy” tenants can be avoided. In the meantime, tenants with multiple applications are allowed to share the rented I/O throughput according to their realtime demands. The hierarchical scheduling mechanism is implemented based on distributed token buckets [15]. The storage gateway maintain an individual token bucket for each tenant. Each token bucket has two parameters: average rate to generate token buckets and maximum capacity of token buckets. The former restricts the averaged I/O throughput; while the latter restricts the burst size of an I/O service. In the case that applications of an tenant are served across multiple gateways, these distributed token buckets work together to behave as a virtual global token bucket for the tenant. Suppose that tenant Ui subscribes service rate (i.e., I/O throughput) at Li . It has K applications (A1 , A2 ,...AK ) served by gate-

ways (G1 , G2 ,...GK ) respectively. The I/O demands of Aj is denoted as Dij . A mutualistic piggyback mechanism is developed to transmit the I/O demands among the gateways efficiently. Hence, each gateway can P calculate the total I/O K demands of tenant Ui (i.e., Di = j=1 Dij ). Thus, the gateway Gj can schedule its local service proportionally (i.e., D Sij = Diji × Li ). In this way, the gateways together serve PK tenant Ui at service rate j=1 Sij , which is equal to the subscribed service rate Li . In the other words, the QoS level of storage resource subscribed by the tenant is guaranteed. It is quite common that the applications of the same tenant may have imbalance and dynamically changing I/O demands. The hierarchical scheduling mechanism allows the applications with high I/O demands to occupy the spared I/O services of other applications. Therefore, their performance could be improved. In the meantime, the tenant’s subscribed source could achieve higher utilization rate.

Fig. 4.

P1

VI. E XPERIMENTS AND R ESULTS In order to evaluate our proposed QoS-aware integrated resource provisioning platform, experiments are conducted using a distributed data processing benchmark and a parallel agent-based simulation.

P2

VM1

StorageGateway1

StorageServer1

Fig. 5.

P1

P2

VM2 ComputerHost1

V. N ETWORK R ATE G UARANTEEING M ECHANISM We implemented a differentiated network bandwidth provisioning using traffic control (tc) in the VMs, which is based on the Token Bucket Filter (tbf qdisc) that is used to control the data rate that is injected into the network. In the tbf queueing discipline, the tokens are saved in a bucket of limited size and refreshed at a required rate. Each outgoing traffic byte is serviced by a single token from the bucket. To control the traffic leaving the interface eth0, we replace the default root queuing discipline with our own configuration. We use the tc command in the client program that is run in the VMs, which essentially slows down traffic based on the commands from the Network Manager module as shown in Fig. 4. It serves as a central manager for managing the Quality of Service at the network level. A server program running on the network manager controls the bandwidth assigned to a VM through a client running on the VM using tc. Thus, the egress traffic is shaped keeping the rate of transmission in control. The network manager module maintains a database of the bandwidth allocated to a user with low-level QoS requirement and high-level QoS requirement. Upon a user request, the network manager dynamically allocates bandwidth based on the QoS levels of the user using the qdisc modify command. A client daemon runs on the VMs, which are belong to the user with high-level QoS requirement. Through this interface, additional bandwidth is assigned to the user with high-level QoS requirement. This allows effective egress shaping policies to be created on both interfaces that receive and send data, to control the flows in and out of the network differentially based on the QoS level.

Architecture of network manager module.

P1

P2

VM3

ComputerHost2

StorageGateway2

StorageServer2

Experimental platform for distributed data processing benchmark.

A. Distributed Data Processing Benchmark A typical data processing application usually includes three steps: 1) reading original data from remote shared storage, 2) processing data on local computer, and 3) writing processed data back to remote storage. The first and third steps are data intensive requiring high I/O throughput and network bandwidth; while the second step is compute intensive requiring high speed CPU and large memory. We are focusing on the QoS levels of CPU and storage resources, while assuming memory and network resources are sufficient. A distributed data processing benchmark is executed in a distributed environment as shown in Fig. 5. Two computer hosts are connected to the distributed storage system, which is composed of two gateways and two servers. One high priority VM (V M1 ) is created on the first computer host; while two low priority VMs (V M2 and V M3 ) are created on the second computer host. Each VM has two VCPUs to support two data encryption processes running concurrently. The CPU resource mapping algorithm described in Section III is used to provide high (and low) priority VMs with CPU resources at high (and low, respectively) QoS levels. The distributed storage system employs the hierarchical scheduling mechanism described in Section IV. It provides six storage volumes: two volumes at high QoS level (i.e., 30 MB/s I/O throughput), four volumes at low QoS level (i.e., 20 MB/s I/O throughput). They are mounted on the high and low priority VMs, respectively, for

TABLE I P ERFORMANCE OF DISTRIBUTED DATA PROCESSING BENCHMARK IN

50

Scenario

Running VMs

Read Time(s)

Process Time(s)

Write Time(s)

Performance

Cost Effi.

1

VM1(High)

41.5

9.0

38.7

1

1

2

VM2(Low)

59.8

9.2

62.5

0.678

1.36

3

VM2(Low) VM3(Low)

62.3 62.3

17.2 17.3

61.8 60.7

0.631 0.636

1.263 1.272

4

VM1(High) VM2(Low)

41.3 62.6

8.95 9.1

38.5 60.5

1.01 0.675

1.01 1.35

5

VM1(High) VM2(Low) VM3(Low)

41.5 61.7 61.9

9.3 18.5 18.7

40.8 59.5 60.4

0.974 0.639 0.633

0.974 1.277 1.265

Throughput(MB/s)

DIFFERENCE EXECUTION SCENARIOS

40 30 20 10 0 0

40

60

80

100

120

140

Execution Time (sec) VM1(High)P1 VM1(High)P2

250

VM2(low)P1 VM2(Low)P2

I/O throughput of storage volumes in the 4th scenario.

Fig. 7.

200 250 150 100 50 0 0

20

40

60

80 100 120 140 160

Execution Time (sec) VM1(High)

Fig. 6.

CPU Utilization (%)

CPU Utilization (%)

20

200 150 100 50 0 0

VM2(Low)

40

60

80 100 120 140 160

Execution Time (sec) VM1(High) VM2(low)

CPU utilization of VMs in the 4th scenario.

saving original and encrypted data with around 1 GB size. Five execution scenarios are investigated to evaluate the performance of the data processing benchmark on the high and low priority VMs. The execution time of different processing steps are shown in Table I. In the first scenario, only one high priority VM is running. It can read and write one gigabyte data within [38, 40] seconds, which indicates high-level QoS with around 30M B/s I/O throughput. In addition, the data processing step can be finished within 9 seconds by fully utilizing two CPU cores. In the second scenario, only one low priority VM is running. It achieves almost the same processing speed as the high priority VM, as there is no resource competition. However, it takes more time on data reading/writing due to the storage volume at low QoS level. In the third scenario, two low priority VMs are running concurrently. Since they are sharing the physical CPU cores, the data processing time are almost doubled. In the fourth scenario, one high priority VM and one low priority VM are run concurrently. Since the performance interference is avoided, they have comparable performance as the first and second scenarios respectively. As shown in Figure 6, both VMs fully utilize two CPU cores at the data processing step. As shown in Figure 7, the applications running on high (or low) priority VM read/write date from distributed storage system at the speed of 30M B/s (or 20M B/s respectively). In the fifth scenario, one high priority VM and two low priority VMs are running concurrently. As shown in Figure 8,

20

Fig. 8.

VM3(Low)

CPU utilization of VMs in the 5th scenario.

the low priority VMs can use one CPU core only. Hence, the performance of data processing is degraded by half. In contrast, the performance of data reading and writing on either high or low priority VMs remains the same due to the guaranteed high or low I/O throughput. Table I also lists the normalized performance of VMs in different scenarios calculated based on the execution time of entire data processing benchmark. Suppose that the price of CPU (Storage) resource at high QoS level is twice of that at low QoS level, we can easily get the cost efficiency of all VMs (refer to Equation 1). As we can see, the low priority VMs may encounter performance degradation, but they are likely have higher cost-efficiency especially at scenario 2 and 4 (i.e., when the data center are not fully utilized). Our integrated resource provisioning platform enables users to adjust resource QoS Levels on the fly. As shown in Figure 6 and 8, downgrading CPU resource at the data reading and writing steps would not degrade performance. Similarly, downgrading storage resource at the data processing step would not degrade performance either as shown in Figure 7. Consequently, users are able to save 45% (or 5%) CPU (or storage respectively) resource cost, while achieving the same performance as V M1 in the first scenario. Using aforementioned VMs with the same configurations of

Execution Speed

CPU and I/O, we also evaluate the performance of the network rate controller. According to our experiment, VMs cannot transfer data beyond the specified rate. In the case that we set the network rate for both VMs to 20 MB/s, the bottleneck of the network rate prevents the reading and writing of the VM to the storage space, even though the high priority VMs was set the I/O throughput for 30 MB/s, When we increase the network rate for the high priority VM, its I/O rate can attain the throughput, thus resulting in better performance.

6 5.5 5 4.5 4 3.5 3 2.5 2 0

B. Parallel Agent-based Simulation Agent-based simulations are able to mimic human or object’s behaviors, simulate the interactions between individuals, and provide tempo-spatial information. Therefore, they are widely used in various areas, e.g., city transportation [16] and disease propagation [17]. Since a large scale agent-based simulation is time consuming, it is usually partitioned into a number parallel components (federates) which are managed by a parallel simulation framework [18]. One of the greatest challenges in parallel simulations is time synchronization, which ensures that events, either generated by the federate itself or received from other federates, are processed in time stamp (TS) order. Optimistic synchronization [19] allows a federate to process events and to advance simulation time freely. However, the faster federate (in terms of simulation time) may conduct over-optimistic executions and rollback its execution on receiving messages from the slower federate. It is generally agreed that the optimistic synchronization has good performance when all federates have comparable execution speeds (i.e., how fast the simulation time is advanced) [20]. This usually requires that federates should have balanced simulation workload. Unfortunately, it is very difficult to meet such requirement in practice. The fine-grained CPU resource scheduling mechanism described in Section III is integrated with SEARUMS [21], an agent-based epidemiological modeling software. SEARUMS are very convenient for studying global disease propagation, as it provides Graphical User Interface (GUI) and Geographic Information Systems (GIS)-based map visualization. The backend of SEARUMS is an efficient optimistic synchronized parallel simulation framework implemented in C++ and MPI [18]. A synthetic disease propagation model is developed and executed on SEARUMS for performance evaluations. It simulates the behaviors and interactions of N = 1600 people located in a 40 × 40 km2 area, during 5000 hours time period. It is composed of two federates (Fed1 and Fed2). Due to imbalanced partitioning, they may handle different number of people. Suppose the workload imbalance factor is k, Fed1 has N/2 − (N/2) ∗ k people, while Fed2 has N/2 + (N/2) ∗ k people. People may send external events to each other to simulate the disease transmission. The probability of generating an external event is denoted as PExternEvent . In the fixed scenario, VM CPU shares are fixed at 50%. In contrast, in the adaptive scenario, VM CPU share are adjusted automatically. The sum of CPU shares of these two VMs are capped at 100%, which indicates that the same

2

4

6

8

10

External Event (%) Adaptive

Fig. 9.

Fixed

Simulation execution speed VS external event probability.

amount of CPU resources are used in the fixed and adaptive scenarios. In some simple agent-based simulations, federate workload increases proportionally with the number of objects. Comparable execution speeds can be achieved by adjusting VM CPU shares according to the number of objects. However, in complex agent-based simulations, the number of objects in each federate may change because of object movement or migration. Furthermore, agents may execute different computation kernels to mimic diverse behaviors. Therefore, we should monitor the performance of federates in realtime and predict their workloads as described in Section III. We have further evaluated our adaptive resource provisioning mechanism using different simulation parameter settings. As shown in Figure 9, the simulation speed (i.e., the ratio between simulation length in hours and execution time in seconds) decreases with respect to the increasing PExternEvent . The adaptive scenario outperforms the fixed one by 44% to 51%. Besides the number of execution rollbacks, execution efficiency is also usually used to measure optimistic synchronization overhead. It is defined as the ratio between the number of committed events and that of scheduled events. It never goes beyond one, as some events which are scheduled over-optimistically should be discarded if execution rollback occurs. As shown in Figure 10, the adaptive scenario can avoid most execution rollbacks, achieving execution efficiency close to one. In the fixed scenario, the number of execution rollbacks increases dramatically with respect to the increasing PExternEvent . However, the execution efficiency keeps around 0.7. This is because a small number of events are scheduled over-optimistically if rollbacks occurs frequently. As shown in Figure 11, the simulation speed decreases dramatically with the increasing workload imbalance factor in the fixed scenario. The adaptive scenario outperforms the fixed one by 5% to 91%. The simulation speed almost remains the same in spite of high workload imbalance factor. This observation verifies that our adaptive resource provisioning mechanism is able to eliminate the effect of workload imbalance on simulation execution performance. As shown in Figure 12, most execution rollbacks are avoided in the adaptive scenario, and the execution efficiency is close to one. In contrast, the higher

600000 Execution Efficiency

Number of Rollbacks

500000 400000 300000 200000 100000 0 0

2

4

6

8

10

1.2 1.1 1 0.9 0.8 0.7 0.6 0.5 0.4 0

2


Fixed

(a) Execution Rollbacks. Fig. 10.

4

6

8

10


Fixed

(b) Execution Efficiency.

Simulation synchronization overhead VS external event probability.

QoS-aware resource provisioning is a hot research topic in both resource virtualization and cloud computing. To serve dynamic application workload, the execution platform can be scaled up (or scaled down) by increasing (or decreasing) the number of VMs [22]. A variety of mechanisms have been proposed to provision QoS-aware CPU [23], [24], [9], network [25], storage [26], [15] or both computing and network resources [27]. However, these solutions handle individual resources and have not tackled the performance interferences from noisy tenants and malicious applications. In contrast, our integrated resource provisioning platform coordinates CPU and storage resources in a distributed environment. It guarantees QoS in both tenant and application levels. A plenty of service quality control mechanisms are proposed for distributed storage systems either in centralized or distributed manner. The centralized approaches [28] tickles the disk scheduling to satisfy the delay bound. However they have poor scalability, and are not applicable to large scale shared storage system. Similar to [26], we are providing QoS

VIII. C ONCLUSION In this paper, an integrated QoS-aware resource provisioning platform is proposed to provide multiple tenants with predictable and controllable resources. Tenants are allowed to change resource QoS levels through the web-based interface according to application requirements and budget constraints. Cost-effective resource renting strategy can be made with the assistant of realtime resource monitoring and pricing models. Automatic resource provisioning is also supported to adapt dynamically changing workloads on the fly. To guarantee QoS levels, CPU mapping/scheduling mechanisms, hierarchical storage I/O scheduling mechanism and network manager are designed and implemented. Experiments using a distributed data processing benchmark have verified that CPU and storage resources are provided to tenants with diverse QoS requirements. By applying cost-effective strategy, users may save up to 45% resource cost without any performance degradation. Experimental results have also illustrated that adaptive fine-grained resource provisioning aligns execution

Execution Speed

VII. R ELATED W ORK

guarantee based on distributed scheduling algorithm. However, we are targeting guaranteed service instead of differentiated service, concerning service utilization and global fairness on both application and tenant levels. A number of adaptive resource provisioning mechanisms are proposed [23], [24] by adjusting VM capabilities according to the dynamically changing workload. However, they are targeted at server applications and cannot be applied directly for a large scale application with highly coupled parallel components. To the best of our knowledge, limited work has been conducted to speedup parallel and distributed simulations by harnessing virtualization technologies. A global VM scheduler [29], [14] collects the simulation time of all federates and schedules their resident VMs in least-simulation-time-first order. Similar to our fine-grained CPU resource provisioning mechanism, it can improve the synchronization efficiency. However, its implementation is non-trivial, as the simulation application, the guest operating system and the VM scheduler in hypervisor must be modified. In contrast, our mechanism is implemented in a transparent manner.

6 5.5 5 4.5 4 3.5 3 2.5 2 0

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Workload Imblance Factor Adaptive

Fig. 11.

Fixed

Simulation execution speed VS workload imbalance factor.

workload imbalance factor, the more execution rollbacks and the lower execution efficiency in the fixed scenario.

500000

Execution Efficiency

Number of Rollbacks

600000

400000 300000 200000 100000 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Workload Imbalance Factor Adaptive

Fixed

(a) Execution Rollbacks. Fig. 12.

1.2 1.1 1 0.9 0.8 0.7 0.6 0.5 0.4 0

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Workload Imbalance Factor Adaptive

Fixed

(b) Execution Efficiency.

Simulation synchronization overhead VS workload imbalance factor.

speeds of parallel components in an agent-based simulation in spite of workload imbalance. As a result, the performance can be improved by 91% using the same amount of resources. For future work, we will investigate the scalability of our resource provisioning platform. It will be applied on a large scale data center infrastructure to serve large number of tenants and applications with diverse resource requirements. ACKNOWLEDGMENT The authors would like to thank Prof. Dhananjai M. Rao from Miami University to assist with the experiments on parallel agent-based simulations based on SEARUMS [21]. We would like to acknowledge the assistance provided by Zhao Yang for development and integration of the platform. R EFERENCES [1] A. C. Zhou, B. He, X. Cheng, and C. T. Lau, “A declarative optimization engine for resource provisioning of scientific workflows in iaas clouds,” in HPDC 2015, 2015. [2] M. Brock and A. Goscinski, “Execution of compute intensive applications on hybrid clouds (case study with mpiblast),” in Proc. Int. Conf. on Complex, Intelligent and Software Intensive Systems (CISIS), 2012. [3] X. Li and J. Qiu, Cloud Computing for Data-Intensive Applications. Reading, MA: Springer-Verlag, 2014. [4] V. Nae, A. Iosup, S. Podlipnig, R. Prodan, D. Epema, and T. Fahringer, “Efficient management of data center resources for massively multiplayer online games,” in Proc. Conf. on Supercomputing (SC), 2008, pp. 10:1–10:12. [5] L. Lu, H. Zhang, E. Smirni, G. Jiang, and K. Yoshihira, “Predictive vm consolidation on multiple resources: Beyond load balancing,” in Proc. Int. Symp. on Quality of Service (IWQoS), 2013. [6] M.Tarighi, S.A.Motamedi, and S.Sharifian, “A new model for virtual machine migration in virtualized cluster server based on fuzzy decision making,” Journal of Telecommunications, vol. 1, no. 1, 2010. [7] Openstack: Open source software for creating private and public clouds. [Online]. Available: http://www.openstack.org/ [8] Eucalyptus cloud-computing platform. [Online]. Available: https://github.com/eucalyptus/eucalyptus [9] Z. Li, X. Li, T. N. B. Duong, W. Cai, and S. J. Turner, “Accelerating optimistic HLA-based simulations in virtual execution environments,” in Conf. on Principles of Advanced Discrete Simulation, 2013. [10] Xen, “Xen Credit Scheduler,” http://wiki.xen.org/wiki/Credit Scheduler. [11] D. Schanzenbach and H. Casanova, “Accuracy and responsiveness of cpu sharing using xens cap values,” Computer and Information Sciences Dept., University of Hawai at manoa, Tech. Rep., 2008. [12] S. K. Barker and P. Shenoy, “Empirical evaluation of latency-sensitive application performance in the cloud,” in Procs of conference on Multimedia systems (MMSys), 2010.

[13] IEEE, 1516-2010 IEEE Standard for Modeling and Simulation (M&S) High Level Architecture (HLA)– Framework and Rules, August 2010. [14] S. Yoginath and K. Perumalla, “Optimized hypervisor scheduler for parallel discrete event simulations on virtual machine platforms,” in Int. Conf. on Simulation Tools and Techniques, 2013. [15] S. Ren, S. Chen, Y. Zhang, E. S. Lim, K. L. Yong, and Z. Li, “Two-level storage qos to manage performance for multiple tenants with multiple workloads,” in IEEE CloudCom 2014, 2014. [16] L. M. Martinez, G. H. A. Correia, and J. M. Viegas, “An agentbased simulation model to assess the impacts of introducing a sharedtaxi system: an application to lisbon (portugal),” Journal of Advanced Transportation, vol. 49, 2015. [17] D. M. Rao, “Accelerating parallel agent-based epidemiological simulations,” in Conf. on Principles of Advanced Discrete Simulation, 2014. [18] ——, “Study of dynamic component substitution,” Ph.D. dissertation, Univ. of Cincinnati, 2003. [19] D. R. Jefferson, “Virtual time,” ACM Trans. Program. Lang. Syst, vol. 7, no. 3, 1985. [20] G. D’Angelo, “Parallel and distributed simulation from many cores to the public Cloud,” in HPCS 2011, 2011. [21] D. M. Rao, A. Chernyakhovsky, and V. Rao, “Modeling and analysis of global epidemiology of avian influenza,” Environmental Modelling and Software, vol. 24, 2009. [22] D. Ta, X. Li, R. S. M. Goh, X. Tang, and W. Cai, “Qos-aware revenuecost optimization for latency-sensitive services in iaas clouds,” in Proc. Int. Sym. on Distributed Simulation and Real Time Applications, 2012. [23] Z. Gong, X. Gu, and J. Wilkes, “Press: Predictive elastic resource scaling for cloud systems,” in Proc. Int. Conf. on Network and Service Management (CNSM’10), 2010. [24] Z. Shen, S. Subbiah, X. Gu, and J. Wilkes, “Cloudscale: Elastic resource scaling for multi-tenant Cloud systems,” in Proc. Sym. on Cloud Computing (SOCC), 2011. [25] D. M. Divakaran and M. Gurusamy, “Towards flexible guarantees in clouds: Adaptive bandwidth allocation and pricing,” IEEE Transactions on Parallel and Distributed Systems (TPDS), 2014. [26] Y. Wang and A. Merchant, “Proportion-share scheduling for distributed storage system,” in Proc. Int. Conf. on File and Storage Technologies(FAST), 2007. [27] T. Truong-Huu, G. Koslovski, F. Anhalt, J. Montagnat, and P. VicatBlanc Primet, “Joint Elastic Cloud and Virtual Network Framework for Application Performance-cost Optimization,” J Grid Computing, vol. 9, no. 1, pp. 27–47, 2011. [28] A. Povzner, T. Kaldewey, S. Brandt, R. Golding, T. M. Wong, and C. Maltzahn, “Efficient guaranteed disk request scheduling with fahrrad,” in Proc. of European Conference on Computer Systems (Eurosys), 2008. [29] S. B. Yoginath, K. S. Perumalla, and B. J. Henz, “Taming wild horses: The need for virtual time-based scheduling of vms in network simulations,” in Procs of International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems (MASCOTS’12), 2012, pp. 68–77.