Industry trends towards virtualization and server consolidation are opening new avenues for improved ..... level and with a dedicated resource for each replica.
An Off-Line Approach for Generating On-Line Adaptation Policies Matti Hiltunen∗ Kaustubh Joshi∗ Gueyoung Jung∗ Calton Pu Richard Schlichting∗ ∗ AT&T Labs Research College of Computing 180 Park Ave. Georgia Tech Florham Park, NJ, USA Atlanta, GA, USA Abstract Industry trends towards virtualization and server consolidation are opening new avenues for improved efficiencies through adaptive system management. However, writing effective policies that specify when and how an adaptive system should adapt is still a difficult, error-prone activity. Quantitative models open the door for automatic policy construction by providing ways to quantify alternative system configurations in a given environment, thus enabling optimization across possible policy choices. However, such “model-in-the-loop” approaches must balance the huge parameter spaces involved with the need for solutions that must be generated quickly, yet permit verification or review. In this paper, we propose a novel approach for generating resource allocation policies in multi-tier enterprise systems. This approach addresses execution speed by performing model solution and optimization offline, and uses decision-tree learners to extrapolate results to unevaluated environments. In doing so, the technique generates rule sets that can be subject to human review processes and used in COTS rule-based policy engines. To perform offline optimization, we develop and validate layered queuing models of the target system, and use them in a novel optimization algorithm based on bin-packing and gradient search techniques.
I. I NTRODUCTION The current trend in enterprise computing is to move away from separate application silos with designated resources into solutions that rely on dynamic sharing of a pool of resources. Examples include virtualization, utility/on-demand computing, and grid computing. The potential advantages of adaptive systems in such environments are great and include higher resource utilization, lower cost, and the ability to deal with large workload fluctuations, changes in user behavior, and failures. However, getting the most out of these benefits is challenging. For example, in traditional systems, pre-deployment resource planning is typically carried out to ensure the system is able to satisfy its service level agreements (SLAs). The resulting plan is a “point solution”—i.e., it specifies a good configuration for the system considering the workload at a single point in the parameter space (usually the mean or peak). In contrast, adaptive systems must be able to reallocate resources at runtime as a result of load changes, failures, and maintenance actions. Hence, a plan for every possible resource and workload condition is required. Such a function that maps changes in the environment to configuration changes is called the adaptation policy. Developing good policies for adaptation is one of the key problems in adaptive systems research. Traditionally, human operators have provided policy implicitly by reacting to alarms generated by monitoring systems or acting proactively based on observed trends in the system (e.g., high resource utilization). However, such a solution is not ideal due to slow human operator times, difficulty for operators to evaluate all the factors contributing to the optimal execution of the system, and finally, cost—the cost of 24/7 operator-based system management can easily become larger than the cost of the hardware or software of the managed system. A better solution is to reduce runtime human involvement by capturing operator knowledge as a set of rules. In fact, most management systems such as IBM’s Tivoli and HP’s OpenView support some form of rule-based automation. However, maintaining the rules through configuration and requirements
changes is tedious and error-prone, and every modification may potentially require the designers to re-evaluate all the factors affecting the decision. However, alternative approaches are possible for generating adaptation rules. For example, rules can be constructed based on extensive experimentation in different configurations and with different workloads or by using modeling. In this paper, we present a novel model-based approach for generating adaptation policies that can then be used by standard COTS rules engines to manage systems. Such an approach is easy to deploy in an existing system, the resulting rules can be checked by human operators before deployment, and the runtime overhead of the approach is expected to be small compared to on-line model evaluation. II. OVERVIEW We consider an adaptive system consisting of a fixed pool of computing resources R and a set of multi-tier applications A executing on these resources. Each application is implemented by a set of tiers or components, some of which may be replicated to increase throughput. For example, an application might consist of a web server tier (e.g., Apache), an application server tier (e.g., Tomcat), and a database tier (e.g., MySQL). Each application may support multiple different transaction types. An application’s workload (request rate) can only be roughly estimated at design time, and may change significantly at runtime due to events such as diurnal cycles, flash crowds, and changing application popularity. For each application—and each transaction type—its SLA specifies the desired response time and a reward/penalty for meeting/missing the response time. The goal of the adaptive system is to configure applications A dynamically on the resources R, as the workloads of the applications change, so that the overall reward for the system is maximized. Specifically, for each application, the system determines the number of replicas for each component, and for each component instance, its placement and resource share on the physical resources. A system configuration C specifies this information for all applications. In this paper we use RUBiS[1]1 as the example application and we execute each application component on its own Xen virtual machine. The credit-based scheduling mechanism of the Xen hypervisor is used to enforce that each component uses at-most its allocated CPU fraction. Here, we focus on identifying the best configuration given the current application workloads. Choosing the optimal set of adaptation actions (starting and stopping component replicas, migrating components, and adjusting virtual machine parameters) to reach this configuration is left as future work. A. Approach We use modeling as a key building block. Specifically, for each application ai , we construct a queueing model Mi that, given the applications configuration Ci and workload wi , produces the expected mean response time (for each transaction type) and utilization for each component instance. Given the models M for all the individual applications, we can calculate the system reward for any system configuration C and any workload W . We envision two approaches for using these models: “model inline” (MIL) and “model offline” (MOL). In MIL, the models are evaluated at runtime using the current configuration C and the measured workload as inputs. A better configuration can then be searched for by executing the models with alternative configurations as inputs. In MOL, the models are evaluated before system deployment with different configurations and workloads as input. The model outputs can then used to generate policy rules as outlined below. Both of these approaches are faced with the challenge of choosing the points on which to evaluate the models. In MIL, the number of possible configurations is very large consisting of all possible replication degrees of all the components in all the applications and all the possible resource fractions for each component instance. MOL not only has to consider this same range of possible configurations, but also all the possible request rates for each of the applications. 1 RUBiS
is a J2EE-based auction system commonly used as an example of a multi-tier enterprise application.
Application 2 Application 1
workload W Modeling Model for Model for Application 1
Management system
Tree constructor
resources R SLAs
optimized configuration
Decision Tree
response time, utilization
Rule set
Linearize
Optimizer workload, configuration
Rule set Request rate, etc Actions
System
Model solver
Fig. 1.
Approach Overview
The MOL approach has some significant advantages. If MOL is used to generate adaptation policy as a set of rules, it will be easy to integrate with existing rule-based management systems, while MIL requires the replacement of the existing management software with the new model engine. Furthermore, since policy rules are human readable, they can be inspected and checked by system administrators. The decoupling of policy generation from runtime policy evaluation also makes it possible to update policy generation algorithms and policy engines, and even policy verification tools, without affecting the other parts of the system. The model used in MIL may be very hard to understand (and not as familiar as the policy rules) and thus hard to inspect. Finally, since the models are evaluated before the system execution (for numerous configurations) the potentially considerable model execution and optimization time is out of the critical path at runtime. On the other hand, MIL has the potential to produce more accurate results since the model is evaluated at runtime exactly with the current workload and configuration, while MOL is bound to be limited to some preselected set of points. Furthermore, it is easier to update the model parameters (e.g., service times) when the model is inline. However, given the practical advantages of MOL, we choose it for our problem domain. Our overall approach is outlined in figure 1. First, we develop queueing models for each of the applications. The optimization process then uses these models and information about the system resources R and application SLAs to construct a decision tree that can then be linearized into a set of rules that is used at runtime by a rule-based management system to manage the adaptive system. The tree constructor chooses the points of workload W to evaluate. The optimizer searches for the best (or good enough) configuration for this workload by invoking the model solver with different configurations. This paper focuses on the optimizer and the application queueing models. The decision tree construction will be addressed in future work. III. G ENERATION OF O PTIMAL C ONFIGURATIONS More formally, let R = {r1 , ..., r|R| } be a set of resources and A = {a1 , .., a|A| } be a set of |N | applications. For application ai , let Ni = {n1i , ..., ni i } be the set of its constituent components, j j and for each component ni , let repset(ni ) ⊂ N be a set of allowed replication levels. For example, a web application ai might consist of an Apache server with up to 2 replicas, a Tomcat server with up to 3 replicas, and an unreplicated MySQL server. For this application, repset(apachei ) = {1, 2}, repset(tomcati ) = {1, 2, 3}, and repset(mysqli ) = {1}. |T | Each application ai may support multiple transaction types Ti = {t1i , . . . , ti i }. For example, RUBiS has transactions for login, profile, browsing, searching, buying, and selling. The workload for application ai can then be characterized by the set of request rates for its transactions, or wi = {wit |t ∈ Ti }, and the workload for the entire system by W = {w1 , . . . , w|A| }. Furthermore, each transaction ti is characterized by a directed acyclic transaction graph GT (ti ) that defines how the transaction uses the application components. The vertices of GT represent the components, and the directed edges represent function calls made by the source component to the destination
AboutMe Transaction Client
1
Apache
Home Transaction 1 Fig. 2.
Tomcat
1241
MySQL
Client
1
Apache
RUBiS Transaction Dependency Graphs
component. Each edge is labeled by the mean number of calls made during the course of a single transaction. Figure 2 shows examples corresponding to the AboutMe and Home transactions in RUBiS. While Home involves only a call to the Apache server, for AboutMe the Apache server makes a single call to the Tomcat server, which in turn makes an average of 1241 calls to the database server. Finally, for each application ai and for each of its transaction type tji , an SLA specifies a target response time TRTji , a reward U rij for meeting the target, and a penalty U pji for missing it. Ideally, the utility function should apply the reward/penalty on a per request basis. However, because doing so would require modeling response time distributions, we use an SLA definition that only uses the mean response time by defining the utility function for application ai and transaction tji as j Uij = wij U rij (TRTji − RTji ) if TRTji ≥ RTji , and Uij = wij U pji (TRTji − RT Overall P i ) otherwise. P j utility is a sum across all transactions and applications so that U = i∈A j∈Ti Ui . Other utility functions could be defined, but what is important is that the rewards can be different for different transactions allowing for differentiation based on transaction importance, and that utility is a monotonic decreasing function of the observed mean response time. A. Optimization The goal of runtime adaptation is to configure the system such that for a given workload W , the utility U of the entire system is maximized. Each system configuration c ∈ C specifies (1) the replication level rep(nji ) of each component nji of each application ai from the set repset(nji ), (2) the assignment of each replica nji (k) (k = 1, . . . , rep(nji )) to a physical resource r(nji (k)), and (3) the maximum fraction frac(nji (k)) ∈ [0, 1] of the resource each replica is allowed to use with the constraint that for every resource rp , the sum of the fractions is at most 1. In our current implementation, the only resource type we consider is CPU capacity and the resources are identical CPUs. Even with only a single resource type, the optimization task is challenging. This is because the parameter space contains both discrete and continuous variables, the space generated by the discrete variables itself is very large even for small applications, and the goal function (the utility) is a function of mean response time which is a non-linear function of the optimization parameters. Furthermore, it can be easily shown that the optimization is NP-Complete in the inputs by a reduction to the bin-packing problem. To tackle these challenges, we rely on the following observations: for any application and transaction, the utility function U is monotonically decreasing with increasing response time, the response time is monotonically (but not necessarily strictly) increasing with a reduction in the number of replicas of a component, and the response time is monotonically increasing with a reduction in the resource fraction allocated to the replicas of a component. Therefore, if one starts off with the highest allowed replication level and a resource fraction of 1.0 for each component, the utility function would be the highest. However, given the constraints imposed by the resource availability, it might not be possible to configure all applications with the highest replication level and with a dedicated resource for each replica. We determine how much of a resource is truly needed by each replica by solving the queuing model assuming full replication and infinite resource availability, and computing the utilization ρ(nji (k)) at each replica. Using these computed utilizations, a bin-packing approximation algorithm is executed to determine if the replicas can be “packed” into the set of available resources so that the sum of the utilizations at each resource is ≤ 1.
If a viable bin-packing can be found, the algorithm terminates. Otherwise, the replication level of some component or the allocated CPU fraction for a component is reduced by a fixed amount, and the bin-packing procedure is tried again to assign replicas to the resources. This process is repeated until a viable bin-packing is found. Various strategies can be used to choose which replication level (or CPU fraction) is to be reduced. Our current algorithm uses a strategy based on gradient-descent search of choosing the configuration change that yields the maximum reduction in overall CPU utilization for a unit reduction in utility, i.e., P P j j i,j,k ρold (ni (k)) i,j,k ρnew (ni (k)) − Unew − Uold is maximized. This technique never gets stuck in local minima because the resource fraction allocated to replicas can always be reduced down to 0 to ensure that the bin-packing succeeds. On the other hand, it may not result in the optimal configuration either. Even though optimality is not guaranteed, splitting the process into a bin-packing and discretesearch phase is useful. That is because the problem of bin-packing has been extensively studied, and is one of the few NP-Complete problems for which polynomial time approximation algorithms that can approximate the optimal solution to within any fixed percentage are known. Our implementation uses the n log n time first-fit decreasing algorithm that ensures results that are (asymptotically) within 22.22% of the optimal [4]. By using bin-packing to solve the component placement problem, we are left with a much smaller parameter space for the possibly non-optimal heuristic search. In the future, we will explore using improved search techniques such as simulated annealing. IV. A PPLICATION M ODELING Queuing models have previously been used to predict response times of multi-tier web-based systems, specifically for RUBiS (e.g., [6], [5]). However, these approaches have drawbacks that make them unsuitable for our purposes. Specifically, they use regular queuing networks that do not account for simultaneous resource possession that applications such as RUBiS, and J2EE multi-tier applications in general, exhibit. For instance, when an Apache server processes a request, it not only uses hardware resources (e.g., processor), but also blocks software resources (e.g., threads) - i.e., it possesses multiple system resources simultaneously. Similarly, when a web-server thread makes a call to an application server, it blocks both caller and callee threads. While simultaneous resource possession can be ignored when a system is lightly loaded with little or no queuing, it assumes paramount importance when modeling heavily loaded (or bottlenecked) resources as is expected when dealing with multiple applications running on shared hardware. Fortunately, these types of synchronous interactions have been studied in depth in the literature beginning with Woodside in [7], and a number of mean-value analysis based algorithms starting from the method of surrogate delays [3] have been proposed to solve them. For our models, we used the LQNS layered queuing network modeling tool [2] due to its rich support for multiple classes and servers in addition to synchronous interactions. A. Layered Queuing Model The resulting LQN model is simple, and only models contention due to software threads and CPU in the system. A sample model that includes only the AboutMe and Home transactions from RUBiS is shown in figure 3. In this model, each transaction type is represented by a separate class. Each software component replica is represented by a task (or queuing station) that uses the FCFS discipline and has a number of servers n, equal to the maximum number of software threads allowed by the component. In the figure, tasks are represented using parallelograms. Each replica (task) executes on a physical resource (the CPU), which is represented using a circle in figure 3, and is modeled using a queuing station that uses the processor sharing (PS) discipline to
1 AboutMe
AboutMe Home
Home Client Task
1
1 1
Apache Apache
Network
Delay
Fig. 3.
AboutMe AboutMe Home Home
Network Element
Proc1 Proc1 PS
1
NE
1
AboutMe 1241 AboutMe Home Apache Tomcat
Proc2 Proc1 PS
NE
1
AboutMe AboutMe Home Apache MySQL
Proc3 Proc1 PS
Layered Queuing Network Model for RUBiS
approximate a round-robin time-slice based scheduler. Each software server is assigned its own “virtual” CPU (corresponding to a virtual machine). The service rate of all jobs running on the virtual CPU is scaled by the fraction of the physical CPU dedicated to its virtual machine. For simplicity, disk usage and utilization are ignored. During the execution of a transaction t belonging to application ai , a task nji (k) makes a specified number of calls to other tasks as specified by the transaction dependency graph GT (t). As shown in the figure, these calls are routed through a network task that runs on a pure delay server. The pure delay server is justified by the observation that for the kinds of enterprise applications we consider, network bandwidth is usually not a bottleneck resource. The processing corresponding to each transaction is represented in a task using “entries”, which are shown as rectangular boxes within the task. Each entry can have a different service time, representing the varying demands that the different transactions can place on resources. Calls to both the underlying processor and other tasks are synchronous, and consume two distinct resources. For client requests, most previous work ([5], [6]) models the system as a closed queuing network. However, doing so requires a Markov model for the users’ transaction generation process. Such models are difficult to estimate for systems without an established history. However, in our case, it is possible to measure the instantaneous rates of individual transaction types at runtime. Therefore, we modeled the workload for each application ai as a set of |Ti | independent open Poisson processes, one for each transaction type. A special client task models the Poisson arrival process for each transaction class. Rather than creating a single model that includes all the applications sharing the available physical resources, the above approach creates a separate model for each application (since each replica gets a dedicated virtual CPU). This allows the models to be regenerated and solved incrementally on an application-by-application basis when the replication level of a particular server, or the CPU fraction assigned to a replica, is changed during the optimization. B. Parameter Collection and Validation To collect the parameters for the queuing models a pre-deployment training phase is conducted for each application. During this phase, several requests corresponding to each of the transaction types are sent to the application one request at a time and measurements are made. The following measurements are needed to fully parameterize the models: (1) mean service time per request at each server, (2) the mean number of calls made by a server to other servers, and (3) the round-trip network latency. The service time measurements are carried out by assigning a full physical CPU to each server. If a full CPU is not available, the measured service times are scaled appropriately to account for the actual CPU fraction alloted to the server. We validated our models on a deployment of a 3-tier servlet version of RUBiS using one Apache, one Tomcat, and one MySQL servers. Each server was installed using default values for all parameters in its own Xen-based virtual machine. The only exception was that we increased the heap size of the Tomcat server to 512 MBs to avoid garbage collection induced slowdowns. In this
Fig. 4.
Measured vs. Modeled Mean Response Times for RUBiS
setup, the round-trip latency was easy to estimate using ICMP pings. However, the measurement of the other parameters was more challenging. To that end, we instrumented the bytecode of the Servlet.class Java class that is extended by all Java servlets hosted by the Tomcat server. The instrumented class kept a count of the number of incoming requests, the service times for each incoming request, and the number of outgoing database requests. Using these measurements along with the network latency and end-to-end response time, we were able to compute all the necessary model parameters. Figure 4 shows preliminary results of the measured response times vs. the response times predicted by the model as a function of the number of concurrent users (which is an indirect measurement of the workload). These results were obtained by setting the fraction of CPU allocated to each server to be 30% of total CPU capacity. The results show that although the queuing model was simple, the overall predictions were accurate enough over a range of workload values. We are currently validating the model over a wider range of parameter changes (i.e., CPU fractions, number of replicas) and for predictions regarding individual transaction types. V. C ONCLUSIONS AND F UTURE W ORK We have presented a novel approach for automatic generation of adaptation policy rules for complex distributed systems. Our immediate future work includes complete implementation of this approach including decision tree construction and the evaluation of its effectiveness with real distributed applications. We will evaluate the accuracy of the resulting policy rules versus using a model inline approach. Other extensions include factoring in workload prediction and adaptation costs. R EFERENCES [1] E. Cecchet, A. Chanda, S. Elnikety, J. Marguerite, and W. Zwaenepoel. Performance comparison of middleware architectures for generating dynamic web content. In Proc. of the 4th ACM/IFIP/USENIX International Middleware Conf., pages 242–261, 2003. [2] G. Franks, S. Majumdar, J. Neilson, D. Petriu, J. Rolia, and M. Woodside. Performance analysis of distributed server systems. In Proc. of the Sixth International Conf. on Software Quality (6ICSQ), pages 15–26, 1996. [3] P. Jacobson and E. Lazowska. The method of surrogate delays: Simultaneous resource possession in analytic models of computer systems. SIGMETRICS Performance Eval. Rev., 10(3):165–174, 1981. [4] E. G. Coffman Jr., G. Galambos, S. Martello, and D. Vigo. Bin Packing Approximation Algorithms: Combinatorial Analysis. Chapter in Handbook of Combinatorial Optimization, D.-Z. Du and P. Pardalos, eds. Kluwer, 1998. [5] Y. Udupi, A. Sahai, and S. Singhal. A classification-based approach to policy refinement. In 10th IFIP/IEEE International Symposium on Integrated Network Management, pages 785–788, 2007. [6] B. Urgaonkar, G. Pacifici, P. Shenoy, M. Spreitzer, and A. Tantawi. An analytical model for multi-tier internet services and its applications. In SIGMETRICS ’05: Proc. ACM SIGMETRICS International Conf. on Measurement and Modeling of Computer Systems, pages 291–302, 2005. [7] C. M. Woodside, E. Neron, E. D. S. Ho, and B. Mondoux. An “active server” model for the performance of parallel programs written using rendezvouz. J. Systems and Software, pages 125–131, 1986.