A Distributed Control Approach for Autonomic Performance ...

2 downloads 24697 Views 884KB Size Report
of web services deployed in a cloud computing environment. This approach is developed using interaction balance principles that has been applied for optimal ...
2013 IEEE/ACM 6th International Conference on Utility and Cloud Computing

A Distributed Control Approach for Autonomic Performance Management in Cloud Computing Environment Rajat Mehrotra Mississippi State University Starkville, MS [email protected]

Sherif Abdelwahed Mississippi State University Starkville, MS [email protected]

II. R ELATED W ORK Researchers from academia and industry have recently addressed the coordination issues among nodes in the distributed systems, which are coupled in only objective functions and constraints. A hierarchical control framework is developed in [2] for solving the performance management problems of a distributed web service deployment. In this case, a limited look-ahead controller is deployed at each level in hierarchy to manage the interaction among lowerlevel controllers. Another implementation of hierarchical control is presented in [3] by utilizing regression trees and neural networks based approximation techniques for dynamically learning the controller behaviour and making optimal resource allocation decisions at multiple levels of hierarchy. A self-management approach is developed in [4], where multiple decentralized controllers form a dynamic overlay network for communication. These controllers share their local information with their neighbors to approximate the overall system level state. All of these research contributions require abstract models at various levels of control, which can be a major bottleneck in large scale systems with many logical hierarchical levels. Similarly, communication overhead among the subsystems increases significantly with the increase in subsystems in decentralized approaches. In this paper, a distributed control algorithm is developed for performance management of web service system hosted in cloud infrastructure. This algorithm utilizes the interaction balance based management approach, which neither requires models at higher levels of the hierarchy nor requires communication among the subsystems. Important results from prior work on the distributed control of large scale systems [5, 6] are utilized in the paper, where the main idea is of cooperation among multiple independent subsystems to optimize a global cost function under certain constraints while independently optimizing the subsystem cost function. III. T HE D ISTRIBUTED C ONTROL A PPROACH This paper addresses the performance management issues in distributed systems, such as those used to host services in cloud environments. The system under consideration consists of N subsystems as shown in Figure 1, where incoming http requests ω(k) from different clients arrive at

Abstract—In this paper, a distributed control based performance management approach is developed for a general class of web services deployed in a cloud computing environment. This approach is developed using interaction balance principles that has been applied for optimal control of large scale dynamic systems. The proposed approach can be applied to a general class of distributed computing systems where performance can be tuned by changing a finite set of control inputs. The developed distributed control structure is demonstrated by applying it to manage a distributed web service to minimize power consumption and maximize quality of service. Keywords-Interaction Balance; Distributed Control.

I. I NTRODUCTION Modern web services are deployed in cloud computing environments over multiple computing nodes in accordance with service level agreement (SLA). These SLAs are negotiated between cloud provider and service provider for computing resource availability and pay-per-use. Despite being behaviourally decoupled (no dynamic interactions), these nodes are coupled in the overall deployment cost and performance objectives as defined by SLAs between the service provider and the end users of the service. Therefore, the objective function and constraints of each computing node depend on the state of other nodes. For example, all of the web service instances process incoming http requests according to node’s processing capacity, while minimizing the overall operational cost and SLA violation penalty. In a cloud computing environment, SLAs between a service provider and a cloud provider cover only pay-peruse and availability aspects of a cloud infrastructure. There is no agreement for the service level performance or quality of service (QoS) that the given resources are able to provide to the deployed web services. Application performance issues in cloud computing environment can be addressed by service providers through developing service specific controllers that can manage the web service requirements for computing, network, and storage resources [1]. These service level controllers can be designed, developed, and deployed by the service providers at each computing node independent of the cloud level controllers. Cloud level controllers are deployed by the cloud providers to ensure the resource availability and minimum downtime for the deployed web service. 978-0-7695-5152-4/13$26.00 CFP13UCC-USB/13 $26.00©©2013 2013IEEE IEEE DOI 10.1109/UCC.2013.54

Abdelkarim Erradi Qatar University Doha, Qatar [email protected]

269





ui (k)

um . Ai , Bi , Ci , and G are user defined norm weights. i To include the effect of coordinated goal, operating cost Ji at the subsystem i is indirectly coupled with all of the other subsystems through workload fraction αi (k) as shown in Equation 1. Therefore, changes in αi (k) at subsystem i affects the cost function Ji (k) and Jj (k), where j = i. H  Ji (k) = Qi (k + 1) − Qs G + qi (k + 1) − qs Ai

a global “Coordinator”, which forwards these http requests to the subsystems through “Dispatcher”. The “Coordinator” maintains a global queue (Q) for the incoming http requests. Current global queue level is accessible to all of the subsystems. At each time step k, the coordinator assigns a fraction αi (k) from the total requests (Q(k) + ω(k)) to subsystem i ∈ N . αi (k) is determined by the subsystems at each time step and passed to the coordinator. The proposed control approach is based on a decomposition of the global optimization problem into a set of N sub-problems that are solved at each subsystem by using model-based control [2]. State dynamics at subsystem i is described as, 

qˆi (k + 1)

=

qi (k) + αi (k)(Q(k) + ω ˆ (k)) −

rˆi (k + 1)

=

cˆi (k) um i (1 + qˆi (k + 1)) ui (k + 1) ⎡

ˆ i (k + 1) Q

=

(Q(k) + ω ˆ (k)) ⎣1 − αi (k) −

k=1

+ ri (k + 1) − rs Bi

+ ui (k)T (1) cˆi (k)um i

N 

(4)

As described previously, incoming http requests are first queued at the coordinator in global queue (Q) and then at the subsystem in local queue (qi ). The requests queued in the global queue can be processed by any subsystem that has system resources available, while requests queued in the local queue at the subsystem i can only be processed by subsystem i. So, these subsystems are discouraged from increasing their local queue by applying higher penalty on the local queue compared to the global queue in cost function. Also, higher value of αi (k) increases the local queues while reduces the global queue (Equation 1 and 3). Therefore, the overall optimization problem is to find the optimal set of control input u and workload fraction α for each subsystem i to minimize the overall cost function J, while satisfying the operational constraints of the subsystems and SLAs of the application. A. Problem Decomposition The distributed system considered here consists of N subsystems, which are coupled only through the workload fractions α(k) and cluster level cost function J(k). The global cost function J(k) is the sum of cost functions related to each subsystem i as Ji (k) in Equation 4. For proper decomposition, an interaction variable Zi (k) should be defined at the subsystem i to represent the effect of the subsystem dynamics on the global cost. Zi is chosen as the sum of workload fractions αj∗ received from other subsystems j N (j = i) as Zi (k) = j=i αj∗ (k). The constraint at subsystem N i is αi (k) = 1−Zi (k) and as such j=i αj∗ (k)+αi (k) = 1. The Lagrangian of each subsystem Li , ⎛ ⎞ H N   βi (k) ⎝1 − αi (k) − αj∗ (k)⎠ (5) Li (k) = Ji (k) +

(2) ⎤+ αj∗ (k)⎦ (3)

j=i

where [a]+ = max(0, a), qi (k) is the local queue size of the subsystem i, Q(k) is the measured global queue size, and ω ˆ (k) is the expected arrival rate of requests at the cluster level. qˆi (k + 1) is the expected queue level of the subsystem ˆ i (k + 1) is the i, rˆi (k + 1) is the expected response time, Q estimated global queue size as seen by subsystem i, T is sampling interval, and ui (k) ∈ Ui is the CPU frequency (Ui is the finite set of all possible CPU frequencies at each node). um i is the maximum supported frequency in the subsystem i and cˆi (k) is the predicted average service time per request at the maximum frequency. αi (k) is the fraction of workload ω ˆ (k) desired by the subsystem i and values of αj∗ (k) (j = i) are received from the coordinator. Combination of ui (k) and αi (k) is considered as control input at the subsystem i.

k=1 H

Figure 1.



ui (k)

+ m

ui Ci

j=i

where βi ∈ R is a vector corresponding to the subsystem i that is extracted from the Lagrange multipliers vector β received from the coordinator (β ∈ RN H ). The Lagrangian L(k) for the cost function J(k) is represented N as: L(k) = i=1 Li (k). The overall problem of minimizing cost function J is decomposed in to N first level problems of minimizing Li , such that Equation 1 is satisfied with Q(1) = 0 and qi (1) = 0. The problem at coordinator is expressed as updating the value of β to minimize the interaction error (Equation 7) to a pre-defined small value .

The Two-level Distributed Control Structure

The operating cost of the cluster of N subsystems for N a look-ahead horizon of H, J(k) = i=1 Ji (k), where Ji (k) is operating cost of the subsystem i including SLA violation penalty, energy cost of control inputs, and global queue size (see Equation 4). Energy cost of the subsystem depends on its frequency ui (k) and can be represented as

270

Figure 2.

4) Calculate Lagrange multipliers β for next iteration using Equation 9, 10, and 11. Send this updated β to the subsystems for solving subsystem level optimization problem. Increment l and jump to Step 2. IV. P ERFORMANCE M ANAGEMENT OF A W EB S ERVICE The proposed distributed control approach is simulated in Matlab using a web service application hosted in a distributed environment over four computing nodes as shown in Figure 1. An ARIMA filter based Traffic Estimator module is developed to estimate the future environmental input ω ˆ (k) [2]. The node level controller dynamics is shown in Figure 2. This simulation is performed to compare the performance of the proposed distributed control approach directly with a centralized approach managing the same deployment. The experiment settings and the coefficients used in the cost function are shown in Figure 3. The http workload (see Figure 5) used during this simulation, is based upon 1998 Football World Cup [7]. The interaction error tolerance  is set to 0.05 and look ahead horizon H is set to 2. Results of simulation are shown in Figure 4, 5, and 6.

Local (subsystem) Control Structure

B. Subsystem Level Control At the subsystem, the Lagrange Li is minimized using subsystem dynamics (Equation 2 and 3) with control input [ui (k), αi (k)] constraints. We create a uniform discretization for the workload fraction αi and with that the optimal value of control inputs that minimizes Li is computed using the following model predictive control based steps. ∗(l) 1) Use βil (k) and αj (k) (j = i) as received from coordinator to compute the optimal sequence of ∗(l) ∗(l) (αi (k), ui (k)) over the horizon k ∈ [1, H], that minimizes Lagrange Li (k) in Equation 5 by using a tree search method. l indicates the iteration instance between the subsystem and coordinator at time step k. ∗(l) 2) Forward the optimal values of αi to the coordinator. C. Coordinator Level Control At coordinator, the values of Lagrange multipliers β is updated to decrease the interaction error e defined as, N  ∗(l) eli (k) = 1 − αj (k) (6) el eli

=

 

j=1

el1

el2

eli (1)

... eli (2)

eli

...

elN

T

eli (k)



Figure 3.

(7)

eli (H) T(8)

... ... = Interaction error vector e is used as gradient to modify the Lagrange multipliers β(k) using conjugate gradient method [5] as per following set of equations. β (l+1) (k) = β (l) (k) + ξ l dl (k) (9) where, ξ l is step length and dl is search direction. dl (k) is calculated using following set of equations with d0 = e0 . dl+1 (k)

=

σ l+1

=

−el+1 (k) + σ l+1 dl (k) el+1  el 

Simulation settings

Figure 4 shows that the proposed approach applies higher frequencies on subsystem 1 and 4, while lower frequencies on subsystem 2 and 3 compared to the centralized approach. According to Figure 4, it results in lower response time on all the subsystems except on a few occasions at subsystem 1 and 4. Primary reasons of this higher response time is due to higher workload processed at Nodes 1 and 4 at these time steps. However, average response time across all subsystem is still lower with the proposed approach. Queue size at each subsystem follows similar trend as response time [8]. As shown in Figure 5, the proposed approach distributes the incoming workload among the subsystems closer to the subsystem’s relative processing capabilities in the cluster compared to the centralized approach. The proposed approach utilizes all subsystems fairly instead of over utilizing a few subsystems and leaving other subsystems under utilized as in case of the centralized approach. Furthermore, it adapts to the changes in the workload arrival rate by changing the load distribution on the subsystems to maintain their individual utility and total deployment utility simultaneously, while centralized approach only considers total deployment utility. Figure 5 shows that the global queue size is mostly zero except in the cases of extremely high workload. The number of interactions (l) between subsystems and coordinator in the proposed approach are mostly “1” except in cases of extreme workload rate. It varies between “1” to “100” due to different relative priorities of performance parameters at each subsystem. However, in

(10) (11)

where  ·  denotes the (Cartesian) 2 -norm. The main steps of the algorithm at coordinator level are as follows: 1) Set an initial values of Lagrange multipliers vector β and forward it to the subsystems. 2) Calculate interaction error vector e using αi∗ received from the subsystems through Equation 7. 3) If el 2 ≤ , stop and send the corresponding α(k) to the Dispatcher (as Figure 1) for workload distribution among subsystems, else go to the next step.

271

time compared to the distributed approach at extremely high workload arrival instances. Utility difference plot shows 98% average improvement. Also, the CPUtime measured in Matlab for distributed approach shows 95% improvement.

Figure 6.

Figure 4.

Comparison of Deployment Utility

V. C ONCLUSION AND F UTURE W ORK In this paper, a distributed control approach is developed for managing the SLAs of a web service deployed in distributed environment by minimizing the response time and power consumption simultaneously. The developed approach can be applied to a general class of web services hosted in a distributed environment including cloud infrastructure for performance management under given operating constraints. This approach provides scalability in number of computing nodes because it does not require any offline controller behavior approximation for hierarchical deployment. In future, this approach will be extended for managing a multi-level hierarchy by arranging computing nodes in multiple levels to enhance the performance of the deployment.

Comparison of Performance Parameters

case of low communication delay in interaction, total time can still be lower than the computation time of calculating control inputs in the centralized approach. The sum of squared error (el 2 ) varies between 0 to 0.4 due to different priorities for performance parameters at each subsystem.

ACKNOWLEDGEMENT This research was made possible by NPRP grant # NPRP 09-778-2299 from the Qatar National Research Fund (a member of Qatar Foundation). The statements made herein are solely the responsibility of the authors. R EFERENCES

Figure 5.

[1] H. C. Lim, S. Babu, J. S. Chase, and S. S. Parekh, “Automated control in cloud computing: challenges and opportunities,” in Proceedings of the 1st workshop on Automated control for datacenters and clouds, ser. ACDC ’09. New York, NY, USA: ACM, 2009, pp. 13–18. [Online]. Available: http://doi.acm.org/10.1145/1555271.1555275 [2] N. Kandasamy, S. Abdelwahed, and M. Khandekar, “A hierarchical optimization framework for autonomic performance management of distributed computing systems,” in Proc. 26th IEEE Int’l Conf. Distributed Computing Systems (ICDCS), 2006. [3] D. Kusic, N. Kandasamy, and G. Jiang, “Approximation modeling for the online performance management of distributed computing systems,” in Autonomic Computing, 2007. ICAC ’07. Fourth International Conference on, 2007, pp. 23–23. [4] J. Xu, M. Zhao, and J. A. Fortes, “Cooperative autonomic management in dynamic distributed systems,” in Proceedings of the 11th International Symposium on Stabilization, Safety, and Security of Distributed Systems, Berlin, Heidelberg, 2009, pp. 756–770. [5] M. G. S. A. Titli;, Systems: Decomposition, Optimisation, and Control. Pergamon Press, 1978. [6] N. Sadati, “A novel approach to coordination of large-scale systems; part ii interaction balance principle,” in IEEE Int’l Conf. on Industrial Technology, dec. 2005, pp. 648 – 654. [7] M. Arlitt and T. Jin, “Workload characterization of the 1998 world cup web site,” HP Labs, Technical Report HPL-99-35R1, Sep 1999. [8] R. Mehrotra and S. Abdelwahed, “Application of interaction balance principle for optimal control in the distributed web service deployment,” Miss. State Uni., Tech. Rep. MSU-13-002, May 2013.

Work Load Share and Interaction Statistics

To compare the performance of the centralized and distributed control approaches, their deployment utilities are computed offline by using Equation 4. According to Figure 6, low value of utility (higher cost) in centralized approach is due to the higher queue size and response

272