Autonomic Benchmarking for Cloud Infrastructures

Autonomic Benchmarking for Cloud Infrastructures An economic optimization model Steffen Haak, Michael Menzel Research Center for Information Technology (FZI) Haid-und-Neu-Str. 10-14 D-76131 Karlsruhe, Germany

{haak, menzel}@fzi.de ABSTRACT The growing number of Cloud Infrastructure-as-a-Service (IaaS) offerings today leave a wide range of choices when deploying an application in the Cloud. Self-configuring and -optimizing autonomic systems have to select an infrastructure which fits the performance preferences while simultaneously offering the optimal performance per price ratio. A task which is not trivial. Indicators provided by providers are often not coherent and not sufficient to predict the actual performance of a deployed application and, thus, raise the need for benchmarking the offered services. This implies, however, intensive effort to gather the needed metrics, growing with every additional provider taken into consideration. In this paper we present an approach based on the theory of optimal stopping that enables an automated search for an optimal infrastructure service regarding performanceper-price-ratio while reducing costs for benchmarking.

Categories and Subject Descriptors H.4 [Information Systems Applications]: Miscellaneous; D.2.8 [Software Engineering]: Metrics—complexity measures, performance measures

Keywords Cloud Computing, Benchmarking, Theory of Optimal Stopping, Self-Optimizing, Self-Configuring, Economic Model

General Terms

alternatives increases even more dramatically. To differentiate available services regarding their performance, comparable indicators become inevitable. The indicators provided as SLAs by the Cloud companies themselves, however, are often not coherent and not sufficient to predict the actual performance of an application. In particular, to gain valuable information for determining the performance of an application running on an infrastructure service based on heterogeneous hardware, non-functional Quality-of-Service attributes, which are specific to a certain application, have to be measured. Hence, a self-optimizing system must rely on own performance benchmarks to find a provider and service that fulfills its needs in the best possible way. This implies, however, intensive effort to gather the needed metrics, growing with every additional provider taken into consideration. Thus, an optimization approach is required that helps to find the optimal benchmarking effort, i.e. defining when to stop benchmarking in order to find the most promising IaaS offer. In this paper the focus of our research lies on intelligently reducing the benchmarking effort caused by the search for an performance-cost optimal IaaS offering for self-deploying systems. The remainder of the paper is structured as follows: First, we state the problem and list existing research work related to the problem. Next, we present and explain our approach and continue with a discussion on unclear issues within or related to our method. This is followed by a brief description on the planned evaluation of our approach. Finally, we conclude the paper and give an outlook on our future work.

Economics, Measurement, Performance

2. PROBLEM STATEMENT 1.

INTRODUCTION

The ever growing number of Cloud infrastructure services, or Infrastructure-as-a-Service (IaaS) offerings, leaves a wide range of deployment alternatives for a self-optimizing and configuring system. When considering different performance levels available for each IaaS, the range of selectable service

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. ACE’11, June 14, 2011, Karlsruhe, Germany. Copyright 2011 ACM 978-1-4503-0734-5/11/06 ...$10.00.

The Cloud has become a popular infrastructure and more and more applications migrate or are specifically developed for the Cloud. Most applications exploit the openness of basic infrastructure services that are offered by a growing number of Cloud providers. In addition, principles of autonomic computing are applied to improve the efficiency and overall performance through introducing the self-* capabilities. However, a self-configuring and self-optimizing application faces the choice of many Cloud providers, where each provider offers his infrastructure services with different facets, e.g. cheap, low-performance compute resources and expensive, high-performance compute resources. Even if particular services fulfilling all given requirements can be identified, there is still a vast amount of choices left for each provider. As analyzing the performance of a service is time and cost intensive, the autonomic system is re-

strained in the number of services that can be analyzed due to economic reasons. Indeed, Cloud providers do not provide comprehensive performance information about their offerings and, thus, in this case analyzing is equivalent to measuring the expected performance of the application deployed on an infrastructure service. Intensive measurements involving benchmarking setups cause effort and costs. Hence, selecting a number of providers to analyze depends heavily on the costs of measuring the performance of a single provider, due to budget reasons but for optimization reasons as well. A comprehensive model to determine the costs in advance is not available, yet, and is a prerequisite to determine a precise, optimal number of providers. Moreover, the performance differences between the services influence the number of providers to analyze. Nevertheless, this number cannot be determined in advance but must be assumed and is assessable during the measurement process. In summary, this leads to the problem of analyzing the appropriate number of providers while maximizing the expected benefit from finding the most suitable provider from a performance-per-cost perspective.

3.

RELATED WORK

Multiple approaches exist to gain comparable performance metrics for analyzing Cloud infrastructure services [2, 12, 11] and virtualized infrastructures [4, 8, 9] , but also standard hardware benchmarks help to acquire valuable measurement results [13]. However, none of the existing benchmarking methods overlooks the benchmarking of multiple services and, hence, none offers concepts towards cost reduction or optimization. The theory of optimal stopping has been applied to many similar problem. Yet, it has not been applied to finding an optimal benchmarking strategy, which differs in many aspects from the problems that are already described. Historically the theory has its foundations within the sequential analysis of stochastic observations. A broad overview is provided by Chow et al. [6] and Ferguson [7]. McCall [10] describes its application to economic problems related to incomplete information.

4.

APPROACH & METHOD

Our approach has to consider four major elements. First of all, we need to be aware of the costs accruing from benchmarking non-functional attributes, e.g. CPU, RAM performance or network latency, of a single infrastructure service, such as Amazon EC2. We hereby roughly describe how to gather the relevant information. The cost calculation must consider the specifics of benchmarking the non-functional attributes of an infrastructure service. We then discuss the distribution of performance among different IaaS service providers. Based on literature, we present a model for a scoring function which allows us to model non-functional preferences in order to rank the Cloud alternatives according to their Quality-of-Service (QoS) attributes. The main contribution follows, as we describe how to apply the theory of optimal stopping to the question on when to stop searching for a better IaaS provider. Optimal stopping theory provides a general approach to minimizing effort for finding a sufficient option and, thus, we provide customization for infrastructure service search by presenting a benchmarking cost calculation and a QoS attributes based scoring function.

4.1 Cost Calculation Costs caused by infrastructure service analysis mainly relate to own efforts in planning, executing and evaluating benchmarks. Unfortunately, there is no comprehensive cost model or method to calculate costs for benchmarking an infrastructure service, yet. However, we suggest several cost categories in the following list: • Staff • Licenses for benchmarking and evaluation software • Infrastructure service usage • Network traffic The aggregated sum of all expected costs for each cost category adds up to the total costs for benchmarking a single infrastructure service cbi regarding a QoS attributes, such as CPU, RAM maximum performance or network latency. To calculate the total costs for benchmarking an infrastructure service result from the sum of all costs raised from benchmarking QoS attributes. This is a very simple cost model that raises no claim for completeness, but it is a simple approach for predicting estimated costs. Commonly, cost calculation is required for each service as each service raises different costs cbi , e.g. in the infrastructure service usage category. Nevertheless, we assume identical costs cbi for every service i to simplify the application of our approach, but we intend to extend it with a more comprehensive service-wise cost calculation model.

4.2 Distribution of Performance One important aspect of automated benchmarking of infrastructure services based on the theory of optimal stopping is the distribution of expected performance among the services under consideration. As none or only little information is known about the performance attributes of infrastructure services beforehand, it is not clear which service provides the best performance. With multiple performance attributes, such as the already mentioned examples response time, throughput or availability, different services might be optimal in each dimension. We later present an utility function that maps multi-dimensional performance values to a single monetarized utility value representing the non-functional quality preferences of the autonomic agent. As costs for benchmarking all available infrastructure service options often exceed given budgets, it is not possible to decide between infrastructure services based on perfect information. Hence, the set of considered infrastructure services must be inspected intelligently in order to approach an optimal service selection. The inspection process described by the theory of optimal stopping comprises picking of one service after another, benchmarking each and applying the stopping rule after each benchmark. The order of picking depends on the expected distribution of expected performance. The following sections will give insights on the generally assumed random distributions and two more advanced views on picking services.

4.2.1 Stochastic Distribution In a very simplistic model, we could abstract from the different quality attributes and only consider the distribution of the aggregated utility value. With no information

about providers and corresponding infrastructure services the benchmarked performance is expected to be distributed randomly among all service alternatives according to some kind of distribution function. The number of services that will be benchmarked when applying our method based on the theory of optimal stopping is unknown and, hence, the overall accruing costs cannot be foreseen as they depend on the actual observations. Assuming a certain distribution function for the aggregated utility value, our automated approach would benchmark random services until the stopping rule will end the benchmarking process after an unpredictable, but from a local perspective optimal number of benchmarks and return the currently optimal service. An example of the randomly drawn aggregated utility values for N benchmarked services is depicted in Figure 1.

This utility value, thus is as a multi-dimensional random variable. Generally, our approach requires some knowledge or estimation on either the distribution of the single QoS attribute values, or the distribution of the aggregated utility value, which would be less precise, yet could be a simplifying assumption, if knowledge on the distribution of the single QoS attributes is not available. In the following subsections we examine two more aspects that either minimize the benchmarking effort by using priorly available information, or help to apply a benchmarking budget limit.

4.2.2 Prior Knowledge about Service Performance Attributes With prior knowledge about the performance attributes of any of the infrastructure services the unordered set of service can be turned into an ordered list. Composing the list involves picking services and adding them to the list in order, descending by the highest guessed overall performance. The guessing of performance must be aligned with the utility function that determines the actual overall performance value. Defining a specific order of the services list plays an important role in our stopping theory-based approach. Commonly, ordering the list of infrastructure services can reduce the benchmarking effort tremendously and lead to an early stopping. When defining the list of infrastructure services existing knowledge about providers or infrastructure services can help to find an intelligent order. By using prior knowledge about expected performance attributes of a service or relying on experiences with providers a pre-ordering of the list can be accomplished. Figure 2 shows the expected performance distribution of a list that is ordered based on rare knowledge about the considered services, e.g. drawn from the SLAs published on the providers’ websites.

Figure 1: Random Distribution of Expected Performance Among Unordered Infrastructure Services The example shows random performance values, evaluated by a utility function, for all benchmarked services. Every service is represented by one aggregated value which is the result of a utility function aggregating the perceived utility over all attributes of interest. The presented graph denotes this aggregated utility, however, as stated before, this is a simplistic representation. As we will describe it in detail in the proceeding section, the aggregated value is a function over the actually observed Quality-of-Service attributes (like response time or throughput), which can bee seen as a multidimensional random variable, each dimension having its own stochastic distribution function. For a better understanding consider the following example: you are benchmarking different infrastructure services with equal functionality. With respect to service quality, you are interested in three QoS attributes, the service’s response time, throughput and availability. As we obtain these value through benchmarking, they are individual random variables, although they are most likely not completely independent of each other. A utility function representing your preferences will allow you to rank the different benchmarking results, i.e. the three obtained QoS attribute values, as the utility function evaluates these values to a single value between zero (unsatisfying) and one (perfectly satisfying).

Figure 2: Distribution of Expected Performance Among Ordered Infrastructure Services As the example shows the probability for benchmarking services with higher performance first increased. Applying the automated stopping theory-based method on a perfectly

ordered list typically ends after benchmarking a few infrastructure services. In general, any list with some ordering potentially has higher chances to have the stopping rule end the search earlier than a random list.

4.2.3 Service List under Benchmarking Budget Limitations In case a hard budget limit is specified the benchmarking costs cb must not exceed this limit. To tackle this constraint there are two possibilities given. The more complex but recommended alternative is to alter the stopping rule accordingly and enforce the budget limit as a side constraint. The second option at hand is to limit the list of considered services so that cb can never exceed the budget limit. Indeed, the optimal service which offers the best performance per cost ratio might not be discovered when limiting the inspection to a subset of infrastructure services. But decreasing the number of infrastructure services helps to obey budget rules and reduce the costs of the service search. In general, we suggest to assume identical costs for each service benchmarking and define the size of the list according to the maximum cost budget. For example, when costs are 50$ per service benchmarking and budget limit is 500$, the maximum size of the list is 10 services. Choosing the maximum is recommended as more services raise the chances of finding a more optimal one. Still, however, the best service might not be in the limited list of benchmarked services and, thus, the optimal service can not found. Therefore, we suggest to use the more complex first option and alter the stopping rule accordingly.

4.3 Utility Model For maximizing the expected performance, we need to define a metric that allows us to rank and evaluate the benchmarking results according to our performance preferences. This is not only necessary for obtaining an objective decision process, it also allows us to balance the benefit of finding a more suitable service against the additional search costs for benchmarking. For evaluating the utility of a certain service’s performance, we rely on a quasi-linear utility or scoring function as known from the multi-attribute auction theory [1]. An adaption for evaluating QoS properties of Web services has been proposed by Blau [3]. The main benefit of quasi-linear functions is their computational simplicity which allows their usage in linear programming approaches. The scoring function S has the following form: N S(A) = λi ai i=1

The scoring function maps the N performance values ai in A to the real-valued interval [0, 1] such that S : A → [0, 1] with Λ = (λ1 , λ2 , . . . , λN ) being the weights for each attribute and N

λi = 1

i=1

ensuring that all weights sum up to a value of one, and ai : ai → [0, 1]

a linear normalization function for each attribute Ai such that better performance values lead to a higher normalized values which are bounded to the interval [0, 1]. Typically, this can be achieved by defining upper and lower bounds for each QoS attribute, yielding a linear utility progression in between these bounds. Examples for such attributes are the before mentioned response time, throughput or availability, which one observes in comparable way by benchmarking each service. To clarify this with an example: if we define bounds of 5ms (lower) and 10ms for response time, a measured response time of 7.5ms would lead to a normalized value of 0.5. More complex non-linear utility functions, also allowing interdependencies between attributes, are also thinkable, yet at the disadvantage of computational complexity. The scoring function S can be used to monetarize the utility of a Cloud service alternative by multiplying the scoringvalue of S with the monetary valuation for a perfect service α. We obtain the monetary equivalent of the benefit of using this service. b : A × α → R+ b(A, α) = α · S(A) If we subtract the costs we have to pay for the service (if we actually choose it), denoted as cs , and the costs that have been spent for the n conducted benchmarks cbn , defined as cbn = n · cb , we can calculate the overall utility, which will be used for finding the optimal stopping rule: u(A, α, n) = b(A, α) − cs − n · cb The value of u thus represents the monetarized utility of choosing a service with attributes A.

4.4 Optimal Stopping A stopping rule problem [5, 10, 7] typically is described by a sequence of random variables X1 , X2 , . . . whose joint distribution is assumed known and sequence of utility functions u0 , u1 (x1 ), u2 (x1 , x2 ), . . . , u∞ (x1 , x2 , . . .) mapping a real valued utility to the observations X1 = x1 , X2 = x2 , . . . , Xn = xn which are one-dimensional. After each observation, one can either decide to stop or keep on gathering information. As described, the costs for additional observation reduce the utility, whereas the chance of observing a better value for Xn+1 increases the utility. In our case however, we are confronted with multi-dimensional random variables, as the utility function depends on more than one performance value that we observe. We therefore rewrite the above mentioned formalism to meet our specific needs. When sequentially benchmarking different provider’s services, we observe a sequence of multidimensional random variables A1 , A2 , . . ., evaluating the result by a sequence of utility functions u0 , u1 (A1 , α), u2 (A1 , A2 , α), . . . , u∞ (A1 , A2 , . . . , A∞ , α) with un = max u(Ai , α, n) Ai

being the utility of the best performing provider (with respect to our preferences) after conducting n service benchmarks. Additionally, we need to define a stopping rule, i.e. the decision on whether to keep on or stop benchmarking after each observation. The stopping rule φ is defined as a sequence of binary decision functions. Given the observations A1 , . . . , An , it is denoted as φ = (φ0 , φ1 (A1 ), φ2 (A1 , A2 ), . . .) where we define a value of one for stopping after n observations and a value of zero, for not stopping. The probability of stopping after n observations thus is defined as

ψn (A1 , . . . , An ) =

n−1

(1 − φj (A1 , . . . , Aj )) ·φn (A1 , . . . , An )

j=1

The problem is to choose an optimal stopping rule φ which maximizes the expected utility U (φ) which is defined as ∞ ψj (A1 , . . . , Aj ) · uj (A1 , . . . , Aj , α) U (φ) = E j=1

calculated as the sum over all rewards uj multiplied with the probability of stopping ψj after j observations. It is obvious, that the optimal rule depends on the distribution of the QoS attributes within the random variable A. If the reward is a function of a Markov chain, one only has to consider stopping rules that are functions of the chain. Solving the problem can be achieved by calculating the optimal expected revenue and stopping, whenever this value has been exceeded. Typically, this type of problem is known as house selling with or without recall. The solution to our benchmarking problem yet is more complex than the typical house selling problems, as the reward depends on a multi-dimensional random variable (the utility function aggregating the randomly distributed QoS attribute values) with different underlying distributions. We are currently still working on finding the optimal rule for this problem.

5.

DISCUSSION

We present an approach which is based on the theory of optimal stopping to capture the problem of incomplete information in Cloud provider selection scenario. There are several aspects that have a major influence on the solution process, we plan to investigate on in further research. A major assumption of the presented approach is that the distributions of the performance random variables are known. It is a rather strong assumption, as obtaining this kind of information requires at least some benchmarking efforts by forehand. However one can argue that generic market studies or SLAs offered by the IaaS providers can at least reveal upper and lower bounds that can serve as a coarse estimate. Yet, it remains an open research request, how this information can be processed in a formal way to assess the distribution of the random variables. We also have to distinguish between an infinite and a finite horizon problem. In the infinite case, we assume there is an infinite number of observations (i.e. IaaS providers or configurations) possible, whereas the finite case assumes that

one has to stop after a finite number of benchmarks have been conducted, as there are no further alternatives left. The presented approach considers observations in random order, i.e. we assume a random selection ob providers to benchmark. While this assumption may be appropriate in other domains and suffices for a generic description of the problem, there is no doubt that an intelligent sequential benchmarking order can further improve the results by saving unnecessary benchmarks. This can be achieved by applying sampling methods from the statistical theory. We assume constant benchmarking costs for each provider, however these costs vary over the range of different providers, also influencing the optimal stopping rule. In a very dynamic scenario, we also have to reflect changing conditions like varying prices or performance values, while benchmarks are conducted. Last but not least, in reality, benchmarking will not provide us with a constant and stable performance value, as assumed by introducing the random variable A. The observed result is much more a performance distribution itself, where the fluctuation should be considered in the scoring function S(A).

6. PLANNED EVALUATION For evaluating our approach we plan to rely on two main pillars, mathematical proofs and simulation studies. The correctness of the optimal stopping rule under the made assumptions must be made by analytically proofing its correctness. However, applied to reality the approach will face a non-idealistic environment, especially with respect to the assumption of known performance distributions. We plan the investigate on the robustest of our approach by simulating realistic situations, where there exists a real distribution, it however is unknown to the stopping rule. In this case one would work with an estimation of this distribution. In a controlled simulation study, we can compare the results using this estimation to the optimal solution and thereby can judge its robustness of our optimization approach with respect to the deviation of the estimation from reality.

7. CONCLUSION & FUTURE WORK We presented an approach that is able to reduce costs raised by benchmarking analysis during infrastructure service selection. In detail, we introduced ideas for a method which is based on the theory of optimal stopping and described how to apply the method when benchmarking multiple services. We also pointed out the problem on how to define a stopping rule mandatory to reduce costs. Our first attempt to apply this theory to our domain seems promising to reduce costs during infrastructure service search. Yet, it is still subject to enhancements and evolves in our ongoing work. In particular, a comprehensive cost calculation scheme and model is under development and will be integrated into this method. Moreover, composing a list of infrastructure services leveraging prior knowledge is not well supported, yet, but essential to automated benchmarking. Detailed instructions on how to inspect and order services intelligently is part of our future work. Besides the order of the service list the stopping rule is a sensible part of the method having huge influence on how good our method performs. We un-

derstand that intensive support and guidance for finding and defining a stopping rule is required and has to be included in future enhancements of the method. The benefit of these rules relies strongly on a good estimate of the underlying performance distributions. Without knowledge about what to expect in the future benchmarks, there is no benefit over randomly deciding when to stop. Thus, future work must also consider how these distributions can be obtained. Additionally, we plan to introduce a process to describe the application of our method that will be the basis to implement our method in a software tool for automated benchmarking. With the software tool a proof of concept for an self-configuring system shall be developed, that finds an optimal infrastructure service for deploying a service.

8.

REFERENCES

[1] J. Asker and E. Cantillon. Properties of Scoring Auctions. The RAND Journal of Economics, 39(1):69–85, 2008. [2] C. Binnig, D. Kossmann, T. Kraska, and S. Loesing. How is the weather tomorrow?: towards a benchmark for the cloud. In Proceedings of the Second International Workshop on Testing Database Systems, DBTest ’09, pages 9:1–9:6, New York, NY, USA, 2009. ACM. [3] B. S. Blau. Coordination in service value networks. PhD dissertation, Universitaet Karlsruhe (TH), Fakultaet fuer Wirtschaftswissenschaften, 2009. [4] J. Casazza, M. Greenfield, and K. Shi. Redefining server performance characterization for virtualization benchmarking. Intel Technology Journal, 10(3):243–251, 2006.

[5] Y. Chow and H. Robbins. On optimal stopping rules. Probability Theory and Related Fields, 2(1):33–49, 1963. [6] Y. Chow, H. Robbins, and D. Siegmund. Great expectations: The theory of optimal stopping. Houghton Mifflin Boston, 1971. [7] T. Ferguson. Optimal stopping and applications. preprint, Mathematics Department, UCLA, http://www. math. ucla. edu/tom/Stopping/Contents. html. [8] R. Iyer, R. Illikkal, O. Tickoo, L. Zhao, P. Apparao, and D. Newell. VM3: Measuring, modeling and managing VM shared resources. Computer Networks, 53(17):2873–2887, 2009. [9] V. Makhija, B. Herndon, P. Smith, L. Roderick, E. Zamost, and J. Anderson. VMmark: A scalable benchmark for virtualized systems. VMware Inc, CA, Tech. Rep. VMware-TR-2006-002, September, 2006. [10] J. McCall. The economics of information and optimal stopping rules. Journal of Business, 38(3):300–317, 1965. [11] Y. Mei, L. Liu, X. Pu, and S. Sivathanu. Performance Measurements and Analysis of Network I/O Applications in Virtualized Cloud. In Cloud Computing (CLOUD), 2010 IEEE 3rd International Conference on, pages 59 –66, 2010. i/o trhoughput for VMs on same physical host. [12] W. Sobel, S. Subramanyam, A. Sucharitakul, J. Nguyen, H. Wong, S. Patil, A. Fox, and D. Patterson. Cloudstone: Multi-platform, multi-language benchmark and measurement tools for web 2.0. In Proc. of CCA. Citeseer, 2008. [13] R. Weicker. An overview of common benchmarks. COMPUTER,, pages 65–75, 1990.