Modeling Heterogeneous Virtual Machines on IaaS ... - IEEE Xplore

IEEE COMMUNICATIONS LETTERS, VOL. 19, NO. 4, APRIL 2015

537

Modeling Heterogeneous Virtual Machines on IaaS Data Centers Bin Wang, Xiaolin Chang, and Jiqiang Liu

Abstract—This letter considers the IaaS data center where tenant tasks are different in terms of the number of requested physical cores, leading to heterogeneous virtual machines (VMs). We develop a novel analytical model to evaluate the performance of heterogeneous VMs on the same physical machine by applying the Continuous Time Markov Chain. Index Terms—Cloud computing, performance analysis, heterogeneous, Markov chain, IaaS.

I. I NTRODUCTION

I

N “Infrastructure-as-a-Service (IaaS)” data centers, virtual computing resources are allocated to tenants in form of virtual machines (VMs), which are deployed on Physical Machines (PMs). Accurate performance evaluation of IaaS cloud is one of the critical issues that IaaS providers must address. Analytical modeling is an effective performance evaluation approach. Stochastic models, seeing [1], [2] and references therein, have been proposed for evaluating the performance of IaaS data centers by assuming all VMs/servers are homogeneous. In actual cloud data centers, the capacities and/or types of physical and/or virtual resources requested by tenants may be different [4], leading to VM heterogeneity. IaaS providers are being providing various VMs to meet tenants’ diverse service requests [5]. Large scale is also a fact of life in IaaS data centers. Recently, some authors [1] and [3] proposed scalable modeling approaches, namely interacting stochastic model approaches, to overcome the complexity caused by the monolithic model for the whole large-scale data center (including multiple PMs). Note that an approximately accurate monolithic model of a PM is a necessary prerequisite for the effectiveness of these interacting stochastic model approaches. These discussions motivate the work presented in this paper. This paper focuses on the performance evaluation of a PM, on which multiple heterogeneous VMs can be deployed simultaneously and each tenant request/job consists of only a task. Task and job are used interchangeably in the following. We consider the simple VM heterogeneity: only the number of cores allocated to each VM may be different. That is, the number of cores requested by each task may be different. Fig. 1 describes the queuing system considered in this paper, which consists of two stages: VM provisioning and task/job Manuscript received November 20, 2014; revised February 6, 2015; accepted February 9, 2015. Date of publication February 13, 2015; date of current version April 8, 2015. This work was supported in part by NCET (Grant No. NCET-11-0565), Fundamental Research Funds for the Central Universities (Grant No. 2012JBZ010), IRT (Grant No. IRT 201206). The associate editor coordinating the review of this paper and approving it for publication was G. Reali. The authors are with the School of Computer and Information Technology, Beijing Jiaotong University, Beijing 100044, China (e-mail: 12120498@bjtu. edu.cn; [email protected]; [email protected]). Digital Object Identifier 10.1109/LCOMM.2015.2403832

Fig. 1. Two stage queuing system.

serving. There is one provisioning server in Stage 1 and there are multiple working servers in Stage 2. A job begins its service only after its VM provisioning is completed. There is no waiting queue between the VM provisioning stage and the job service stage. As in [1] and [3], we assume that at most one task can stay in the provisioning stage but there may be multiple tasks in the job service stage. The recent analysis of Google Cluster Dataset [6] indicated that single-task jobs occupy 64% of all the kinds of jobs. Note that there are a lot of PMs on a data center. Therefore, even if the number of job arrivals on an IaaS data center is high, the probability of a single-task job arriving at a PM is low. So Poisson distribution can be used as an adequate model for the single-task job arrival process [8]. We develop a novel state-space-based analytical model for PM performance evaluation using Continuous Time Markov Chain. The realistic cloud data analysis [7] indicated that 1) few jobs requested more than 4 cores and more than 80% jobs requested 1 or 2 cores, and 2) the average number of cores of a PM is 15.95. When the number of cores in a PM is 25, the maximum number of cores requested by a job is 4 and queue size is 10, there are 3850 states in the state space of our proposed model. This indicates the reasonability of exploring a monolithic Markov model for the PM dynamics. Note that our proposed model can be used to extend the interacting stochastic model approaches [1] and [3] to evaluate large-scale scenarios. The main contributions are summarized as follows: 1) The state transition rules and the formulas for computing performance measures are described in detail. These computing formulas help defining various SLAs and revealing practical insights into capacity planning for cloud data centers. 2) Extensive numerical analysis experiments and simulations are performed. The experiment results validate the model approximation accuracy. Our literature investigation indicates that the PM modeling approach proposed in [3] is close to ours. However, there are two major differences between its PM model and our model. The first difference is the formula for computing the probability that a departing job releases crel cores, which is denoted as Pl (isc , isj , crel ) in Section II. Khazaei [3] assumed

1558-2558 © 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

538


Pl (isc , isj , crel ) = 1/M C, independent of isc and isj . These two variables and M C are defined in Section II-A. Such assumption may result in some non-existing state transitions, and then result in the incorrect calculation of steady-state probabilities and performance measures. In Section II, we prove that Pl (isc , isj , crel ) is related to both isc and isj . Our experiment results in Section III demonstrate that Pl (isc , isj , crel ) has significant effects on model accuracy. The second difference is the state transition rules. Our model assumes that the job under provision cannot go into service unless its VM provision is completed. But Khazaei [3] assumed that the job in VM provisioning stage leaves the system immediately if there is a job departure, no matter whether this job’s VM provisioning is completed or not. In addition, the states considered by each model and state transition rates are different. The rest of this paper is organized as follows. Section II presents our analytical model in detail. In Section III we present evaluation results. Our conclusion is summarized in Section IV. II. T HE A NALYTICAL M ODEL This section first presents the system assumptions. Then we present state transition rules. At last the formulas of computing performance measures are present. A. Assumptions A PM uses special cores for VM provisioning. m is defined to denote the number of homogeneous physical cores used for stage 2. K is defined to denote the size of the job waiting buffer/ queue before the VM provisioning server. The number of cores requested by an arriving job (denoted by creq ∈ [1, M C]) follows the uniform distribution. Here, M C is the maximum value of creq , less than m. The inter-arrival times of two successive jobs are independent and exponentially distributed random variable with 1/λ. VM provisioning is offered to jobs in the order of the job arrivals (First-come, First-served, FCFS). A job goes into VM provisioning only if there is no job under provisioning and the number of the free cores meets the number required by the job. An arriving job goes into the waiting queue if the waiting queue is not empty or it cannot enter VM provisioning. A job goes into the service provided by a working server immediately after its provisioning is completed. The provisioning and service rates are denoted as γ and μ, respectively. If the waiting queue is full, an arriving job is rejected immediately. B. The State Transition Rules With the above definitions and assumptions, we now present the model. A state is described by a 4-tuple index (iqj , idj , isc , isj ). Here, iqj is the number of jobs in waiting, 0 ≤ iqj ≤ K; idj is the number of jobs under provisioning, 0 ≤ idj ≤ 1; isc is the number of busy cores, 0 ≤ isc ≤ m; isj is the number c of jobs in service, 0 ≤ isj ≤ m. Pa req is defined to denote the probability that an arriving job requests creq cores. Thus, Pa1 = Pa2 . . . = PaM C = 1/M C. Pl (isc , isj , crel ) is defined to denote the probability that a departing job (this job is one of the isj jobs) releases crel cores when the number of busy cores and

the number of jobs in service are isc and isj , respectively, just before a job departs. Theorem 1 provides the formula of computing Pl (isc , isj , crel ), which indicates that Pl (isc , isj , crel ) does not always follow the uniform distribution. Theorem 1: Assume that N[isc , isj ] denotes the number of ways (permutations) of isj distinct jobs/VMs sharing isc cores and ξ[][crel ][isc , isj ] denotes the number of jobs, each of which occupies crel cores, in the way ( ∈ [1, N[isc , isj ]]). Then the probability that a departing job releases crel cores is N[isc ,isj ]

ξ[][crel ][isc , isj ] 1 · Pl (isc , isj , crel ) = . isj N[isc , isj ] =1 (1) Proof: Given isc and isj , there are N[isc , isj ] ways that isj distinct jobs/VMs share isc cores. Thus, Pl (isc , isj , crel ) = N[isc ,isj ] (P_way · Pcrel ). Here, P_way denotes the probability

=1

of the th way occurring. Pcrel denotes the conditional probability of a departing job releasing crel cores in the th way. Since creq of an arriving job follows the uniform distribution and there is no job dropping unless the waiting queue is full, the number of cores requested by the job going into stage 2 is also uniformly distributed. Thus, given isc and isj , all P_way N[isc ,isj ] are the same. Since (P_way ) = 1, we obtain P_way = =1

1/N[isc , isj ]. Now let’s see Pcrel . Since the service time of each job follows the exponential distribution, in a way every job departs the system with the same probability. Thus, Pcrel = ξ[][crel ][isc , isj ]/isj . So is the conclusion. Now we present the state transition rules under new job arrival, job service completion, and job’s VM provisioning completion. Event 1: New job arrival. When a job arrives, the caused state transitions can be divided into three categories according to iqj and idj . The first is given in (2), in which the arriving job directly goes into the VM min(m−i sc ,M C) provisioning stage with the probability Pa . =1

_ (0, 0, isc , isj ) −−−−−→ ⎞ ⎛(0, 1, isc , isj ), min(m−isc ,M C) Pa ⎠ and isc < m. (2) where rate_a1 = λ · ⎝ rate a1

=1

The second is given by (3) and (4), which respectively denotes the situation where the arriving job joins the waiting queue. λ−rate_a1 (0, 0, isc , isj ) −−−−−−−→(1, 0, isc , isj ) λ (iqj , idj , isc , isj ) −→(iqj + 1, idj , isc , isj ), where 0 < iqj < K or iqj < idj = 1.

(3) (4)

The third is given in (5), in which the arriving job is rejected due to no space in the waiting queue. λ

(K, idj , isc , isj ) −→(K, idj , isc , isj ). Event 2: Job service completion.

(5)

WANG et al.: MODELING HETEROGENEOUS VIRTUAL MACHINES ON IaaS DATA CENTERS

For each crel ∈ [ max(1, isc −M C ·(isj −1)), min(m−isc +1, M C)], there are four categories of state transitions according to iqj and idj . The first, given in (6), denotes that no job enters into VM provisioning stage after a job finishes its service. rate_s1 (0, idj , isc , isj ) −−−−−→(0, idj , isc − crel , isj − 1), where rate_s1 = isc · μ · Pl (isc , isj , crel ) and idj = 0/1. (6) The second is given in (7). Note that there may be a job under provisioning but the provisioning is not completed when a job’s service is complete. Thus, iqj is unchanged when a job departs. rate_s1 (iqj , 1, isc, isj )−−−−−→(iqj , 1, isc −crel, isj −1), where Iqj > 0. (7) The third category is given in (8). Even after a job departs the system, the number of available cores cannot satisfy the number required by the first job in the waiting queue. Thus, iqj is unchanged. rate_s2 (iqj , 0, isc , isj ) −−−−−→(iqj , 0, isc − crel , isj − 1), where m − isc + crel < M C, iqj⎛> 0 and ⎞ min(m−isc +crel ,M C) rate_s2 = isc ·μ·Pl (isc ,isj ,crel )· ⎝1− Pa⎠. =1

(8) The fourth category is given in (9). After a job departs the system, the number of available cores meets the number required by the first job in the waiting queue and then the job begins its VM provisioning. rate_s3 (iqj , 0, isc , isj ) −−−−−→(iqj − 1, 1, isc − crel , isj − 1), where iqj > 0 and ⎞ ⎛ min(m−isc +crel ,M C) rate_s3 = isc ·μ·Pl (isc , isj , crel )· ⎝ Pa ⎠ . =1

(9)

539

C. Performance Measures With the state transition rules in Section II-B, we obtain the state transition rate matrix Q and (2(K + 1)(m + 1)2 ) balance equations, described in (13). ΠQ = 0.

Here Π = {π(iqj ,idj ,isc ,isj ) , 0 ≤ iqj ≤ K, idj = 0/1, 0 ≤ isc ≤ m, 0 ≤ isj ≤ m}. We define

Πzero = π(iqj ,idj ,isc,isj ), 0 ≤ iqj ≤ K, idj = 0/1, 0 ≤ isc < isj ≤ m

∪ π(iqj ,idj ,isc ,isj ), 0 ≤ iqj ≤ K, idj = 0/1, 0 < isj ·MC < isc ≤ m

∪ π(iqj ,idj ,isc ,isj ), 0 = idj < iqj ≤ K,MC ≤ m−isc .

Each π(iqj ,idj ,isc ,isj ) in Πzero is set to zero, representing that the states in Πzero donot exist. Then balance equations are applied to compute all π(iqj ,idj ,isc ,isj ) ∈ Πnonzero = Π − Πzero with the normalization equation K 1 m m π(iqj ,idj ,isc ,isj ) = 1. Since these

iqj =0 idj =0 isc =0 isj =0

equations cannot be solved in closed form, we must resort to a numerical solution, as outlined in Section III. With the obtained π(iqj ,idj ,isc ,isj ) , we are able to compute the following measures: 1) Job immediate service probability (entering into the VM provisioning stage) is defined as ⎛ ⎞⎞ ⎛ min(m−isc ,M C) m m ⎝π(0,0,isc ,isj ) · ⎝ Pa ⎠⎠ . isc =0 isj =0

=1

2) Job mean completion time is defined as K

1

m

m

iqj =0 idj =0 isc =0 isj =0

Event 3: Job’s VM provisioning completion. For each creq ∈ [1, min(m − isc , M C)], there are three categories of state transitions according to iqj and idj , expressed in (10)–(12), respectively. rate_d1 (0, 1, isc , isj ) −−−−−→(0, 0, isc⎛+ creq , isj + 1) ⎞ min(m−isc ,M C) creq ⎝ (10) where rate_d1 = γ · Pa Pa ⎠

(13)

λ 1−

1

m

iqj · π(iqj ,idj ,isc ,isj )

m

idj =0 isc =0 isj =0

+

π(K,idj ,isc ,isj )

1 1 + . γ μ

III. N UMERICAL A NALYSIS

Our literature investigation of analytic model-based performance evaluation of cloud data centers indicated that only the =1 rate_d2 model proposed in [3] considered VM heterogeneity. However, (iqj , 1, isc , isj ) −−−−−→(iqj , 0, isc + creq , isj + 1), it is hard, if not impossible, to compare the experiment results of where iqj > 0, m − isc − creq < M ⎞ ⎛ C and these two models because different assumptions (see Section I) m−isc −creq c γ · Pa req are made in each modeling approach. Thus, this section per ⎝1 − rate_d2 = Pa ⎠ (11) forms model evaluation by comparing our model’s numerical min(m−i ,M C) sc =1 Pa results with the simulation results. We also investigate the =1 effects of different formulas of computing Pl (isc , isj , crel ) on rate_d3 (iqj , 1, isc , isj ) −−−−−→(iqj − 1, 1, isc + creq , isj + 1), model accuracy. where iqj > 0 and The results for the proposed analytical model are com⎛ ⎞ min(m−isc −creq ,M C) creq puted using Maple 18 [9]. The simulations are conducted by γ ·Pa ⎝ rate_d3 = Pa⎠ . (12) using Arena [10]. The simulation source file is available in min(m−i sc ,M C) =1 [11]. No cloud provider publishes information regarding the Pa buffer space, number of servers, or the percentage of reserved, =1

540


Fig. 2. Immediate Service Probability over m under various λ and M C. (a) λ = 6; (b) λ = 5; (c) λ = 4.

Fig. 3. Mean Job Completion Time over m under various λ and M C. (a) λ = 4 and M C = 1; (b) λ = 6 and M C = 1; (c) λ = 6 and M C = 2; (d) λ = 6 and M C = 3; (e) λ = 6 and M C = 4.

on-demand or spot instances [3]. Thus, we determine the parameter values in reference to [1], [3] and [7]. In all experiments, K = 10, γ = 50 and μ = 1. m is varying from 20 to 25, λ is from 4 to 6, and M C is from 1 to 4. Note that the upper bounds of both m and M C are set according to the real cloud data analysis [7]. Fig. 2 and Fig. 3 present the results of job immediate service probability and job completion time. “NumN-MCy” denotes the numerical results when Pl (isc , isj , crel ) is calculated by using (1). Here, y is the value of M C. “NumU-MCy” denotes the numerical results when Pl (isc , isj , crel ) is assumed to be uniformly distributed. We obtain “NumU-MCy” results by replacing Pl (isc , isj , crel ) with Pacrel in (6)–(9) of our proposed model. The other equations are unchanged. “Simu-MCy” denotes the corresponding simulation results, which are calculated with a confidence level of 95%. Note that we donot give immediate service probability when Pl (isc , isj , crel ) is uniformly distributed due to their very small values in all experiments. From the experiment results, we observe that 1) Our model can capture the system dynamics accurately; 2) With the increasing λ and/or M C, job immediate service probability is decreasing but job completion time is increasing; 3) Increasing m may not improve immediate service probability when the PM is working under light traffic, such as when M C = 1 and λ = 4, 5, 6 in Fig. 2 (a,b,c), or when M C = 1, 3, 4 and λ = 4 in Fig. 2(c). Fig. 2(c) indicates that for all configurations the trend is not changing too much (i.e., less than 10%).

IV. C ONCLUSION In this letter, a novel analytic model is developed for heterogeneous VMs on a PM under Poisson arrivals with exponentially distributed VM provisioning and job service times. Experiment results validate that the proposed model can approximately reflect the PM behaviors. R EFERENCES [1] R. Ghosh, F. Longo, V. K. Naik, and K. S. Trivedi, “Modeling and performance analysis of large scale IaaS clouds,” Future Gener. Comput. Syst., vol. 29, no. 5, pp. 1216–1234, Jul. 2013. [2] D. Bruneo, F. Longo, and A. Puliafito, “A stochastic model to investigate data center performance and QoS in IaaS cloud computing systems,” IEEE Trans. Parallel Distrib. Syst., vol. 25, no. 3, pp. 560–569, Mar. 2014. [3] H. Khazaei, J. V. Misic, V. B. Misic, and N. B. Mohammadi, “Modeling the performance of heterogeneous IaaS cloud centers,” in Proc. IEEE ICDCS Workshops, 2013, pp. 232–237. [4] C. Reiss, A. Tumanov, G. Ganger, R. Katz, and M. Kozuch, “Heterogeneity and dynamicity of clouds at scale: Google trace analysis,” in Proc. ACM Symp. Cloud Comp., 2012, pp. 1–13. [5] [Online]. Available: http://aws.amazon.com/ec2/instance-types/ [6] S. Di, D. Kondo, and W. Cirne, “Characterizing cloud applications on a google data center,” in Proc. IEEE ICPP, 2013, pp. 468–473. [7] R. Birke, A. Podzimek, L. Y. Chen, and E. Smirni, “State-of-the-practice in data center virtualization: Toward a better understanding of VM usage,” in Proc. DSN, 2013, pp. 1–12. [8] G. R. Geimmett and D. R. Stirzakers, Probability and Random Processes, 3rd ed. London, U.K.: Oxford Univ. Press, 2001. [9] Maplesoft, Inc., Maple 18. [Online]. Available: http://www.maplesoft. com/products/maple/ [10] [Online]. Available: http://www.arenasimulation.com [11] [Online]. Available: http://www.xingbin.net/xiaolin/HeSimuArena.doe