QoS Aware Load Balancing in Multi-tenant Cloud ...

3 downloads 8786 Views 2MB Size Report
scheduling in servers to enforce a desired ratio of the 'average service times' .... consider cloud with a load-balanced farm of servers, hosting a single type of ...
QoS Aware Load Balancing in Multi-tenant Cloud Environments SUSARA DE SARAM University of Colombo School of Computing, Sri Lanka SRINATH PERERA WSO2 Inc., Sri Lanka and MAHEN JAYAWARDANE University of Colombo School of Computing, Sri Lanka

Current enterprise cloud services need to meet the challenge of serving customers, who expect different service levels, while achieving higher resource utilization. To catch a dominant market share, cloud vendors often need to provide different kinds of offerings to meet the emerging needs of their customers, rather than having a one size fits all approach. Multi-tenancy technology has dramatically increased the level of resource sharing. The improved sharing makes it challenging to provide differentiated services without losing the key benefits achieved by resource sharing. In this research, we investigated into the problem of serving different classes of users with different levels of Quality of Service, adhering to multi-tenancy. We propose a dynamic load balancing policy based on the principals of Queuing Theory to offer differentiated services. Our solution achieves the differentiation by using the level of concurrency in servers as the key attribute. The mechanism is self-adaptive to dynamic load conditions thus able to maintain the desired distance in average service times among the service classes. We implemented and tested the system on a prototype of a commercial cloud platform. Furthermore, we demonstrate the possibility of extending the proposed mechanism to support auto scaling as well. Keywords: Cloud Computing, Multi-tenancy, Quality of Service, Queuing Theory, Service Differentiation

1.

INTRODUCTION

Cloud computing [Buyya et al. 2008] is a popular approach of software delivery, which has made a positive impact on the computing industry. Cloud computing reduces the upfront cost of software, hardware and other IT infrastructure as well as the costs required for maintenance and upkeep, allowing companies to rent the necessary artifacts from a third party. The improved sharing of the resources achieved through resource centralization is beneficial for both the cloud users and the service providers [Zhang et al. 2010]. However, the diverse needs of the users complicate the ability of service providers in taking the advantage of shared resources. For example, two users needing different kinds of customizations for the same application force the service provider to run two separate application instances at the back-end. The service providers often handle this situation by running separate Virtual Machines (VMs) to serve different users or client organizations (tenants). Multiple VMs may run on a single physical machine using a hypervisor. This enables service providers to share the hardware of a physical server with multiple tenants. The concept of multi-tenancy brings more facilities to the cloud-computing paradigm by allowing a single application instance to be shared between multiple tenants who need different customizations. However, unlike virtual machines, multi-tenancy can share both the software and the hardware. Since it achieves better resource utilization, multi-tenancy has become a key requirement for modern cloud services. There are several stages that one can achieve multi-tenancy [Chong and Carraro 2006] and the finest stage allows running multiple application instances as a load-balanced farm of resources. This provides the best scalability for the applications and enables them to comply with unpredictable load conditions. The cloud vendors and users negotiate Service Level Agreements (SLA) in order to define the International Journal of Next-Generation Computing, Vol. 4, No. 1, March 2013.

112

·

Susara De Saram et al.

required levels of quality of service (QoS) and the costs and penalties associated [Bianco et al. 2008]. However, the users of a particular cloud service may have diverse requirements in the level of QoS. Therefore, cloud vendors have to offer differentiated services rather than having a one size fits all approach. To reduce the variance of QOS across different users, the service providers offer a set of different service classes where a user can preselect a service level according to the provided SLA metrics. Providing differentiated services helps the cloud vendors to catch a wider market share. However, achieving service differentiation in multi-tenant environments, which are composed of highly shared services, is difficult. On one hand, the service differentiation drives the cloud away from resources sharing, while multi-tenancy drives cloud towards more sharing. It is a challenge to balance these tradeoffs. This paper investigates the problem of serving different classes of users with different levels of performance while adhering to multi-tenancy. The most widely used performance measurements in cloud computing are the throughput and the service time. We can control throughput via connection throttling, which allows the cloud vendors to enable on-demand resource provisioning. As the service time is identified as a performance metric that gives a fair description of the end users experience [Broadwell 2004], we focused on the service time of the user-requests while differentiating the services. We propose a dynamic load balancing policy in achieving service differentiation considering the multithreaded architecture of the application servers. We will show that the level of concurrency in servers can be identified as an attribute that can be used to provide differentiation in performance. We use Queuing Theory as the basis of our approach. The proposed mechanism can maintain a desired ‘distance’ in average service times among any number of service classes, adapting dynamically to the changing load conditions. In this research, we assume all physical servers to be homogeneous. We consider a cloud that hosts a single type of service, and but we will show the applicability of our solution with multiple types of services as well by extrapolating from single type service. We have implemented the proposed solution on top of Apache Synapse mediation framework1 and tested using a prototype multi-tenant environment with a static pool of resources. The next section (section 2) discusses related works and section 3, section 4 and section 5 discuss the design and implementation. In the evaluation section (section 6), we have performed a comprehensive evaluation on the results. Final section (section 7) further discusses how this mechanism could support auto scaling in order to provide service that is more reliable. 2. 2.1

RELATED WORK Attempts for Service Differentiation

The Service Differentiation is addressed in both cloud computing and networking research studies. But we could not find any work on service differentiation with multi-tenancy. In the following, we present a brief review of some relevant prior work. Goudarzi and Pedram [2011] present a distributed solution for SLA driven resource allocation based on Weighted Fair Queuing (WFQ), which is the packetized version of Generalized Processor Sharing (GPS) scheduling. The authors consider a cloud computing system that is composed of clusters that have different number and possibly different types of servers. They statically partition the set of clusters for several groups of users. Within each cluster, they differentiate the service among the clients who are in different service classes based on the response time. The differentiation is achieved using multi-class queues, which is not possible with multi-threaded servers since they do not associate queues. Hayel et al. [2004] have done a study on the methods that provide Less-than-Best-Effort (LBE) services. Less-than-best-effort (LBE) has been proposed as a service for non-critical purposes. The goal of LBE is to utilize unused network capacity in providing non-critical services without 1 http://synapse.apache.org/

International Journal of Next-Generation Computing, Vol. 4, No. 1, March 2013.

QoS Aware Load Balancing in Multi-tenant Cloud Environments

·

113

affecting the best effort flows. The authors demonstrate LBE as a way of providing service differentiation for two service classes. Susitaival and Aalto [2003] compare scheduling and routing in finding the capability of achieving service differentiation. They formulated the optimizing service differentiation in terms of average delay, and for scheduling, they use WFQ algorithm. Their routing algorithm routed requests to high priority service classes along paths that are not much congested. Their results show that the level of service differentiation achieved by scheduling is quite low and the use of routing provides more flexibility. Lu et al. [2001] have proposed an architecture featuring a feedback control loop, to achieve relative delays for different service classes on web servers. They perform dynamic connection scheduling in servers to enforce a desired ratio of the ‘average service times’ among the different service classes. They show that the relatively differentiated services achieves performance differentiation more precisely than best effort differentiation model. Most of the prior work in providing service differentiation focused on scheduling. Since most multi-threaded architectures do not incorporate queuing, we could not adapt any prior scheduling technique to support our problem. Since most cloud service providers hire resources dynamically, a load balancing mechanism suits better for our problem. Any of the past researches mentioned in this section does not focus on multi-tenancy. The following section describes the contributions done to multi-tenancy highlighting the key prior work for our research. 2.2

Implementations of Multi-tenancy

The concept of multi-tenancy is implemented using various architectures [Chong and Carraro 2006] [D. Banks 2009] [Azeez et al. 2010] [Mudigonda et al. 2011]. In this research, we focus on the multi-tenant cloud architecture proposed by Azeez et al. [2010]. The commercial cloud platform WSO2 Stratos [Azeez et al. 2011] uses the same as the basis for implementing multitenancy. Their architecture achieves multi-tenancy at SOA2 level and enables users to run their services in a multi-tenant SOA framework. The design and implementation of our solution are based on this environment where we performed all the testing using a prototype. Azeez et al. [2010] achieve multi-tenant deployment of services using Apache Axis2 servers because the architecture of the Axis2 servers is originally designed to support multi-tenant deployments. In Axis2, all configuration data is stored in a single configuration tree called AxisConguration and rest of the system is stateless. To support multi-tenant service deployment on Axis2, they suggest having multiple AxisConguration instances, one per tenant. Each tenant’s services and other Axis2 deployment artifacts are created within its own AxisConguration. In addition to per tenant configurations, a master AxisConfiguration is there to dispatch requests to each tenant AxisConfiguration and managing them. Since tenants cannot ‘see’ outside their AxisCongurations scope, they cannot manipulate code hosted by other tenants or change any configurations in the master. 3.

CONCURRENCY VS. PERFORMANCE

Since we focus on the multi-tenant architecture proposed by Azeez et al. [2010], it is important to have an idea on the request-handling criteria in Axis2. Axis2 severs are capable of processing requests in parallel as they are multi-threaded in design. They have a large pool of threads (e.g. 100-500) from which each arriving request takes a separate thread for processing. The thread is released when the processing is done. All requests are processed concurrently and the number of concurrent threads can be increased up to the ‘thread pool size’, which is a system configuration parameter. If all the threads in the pool are busy when a request arrives, that request is rejected. When server runs with a higher number of concurrent threads, they compete for resources in the system and this can significantly affects the performance of the system. Due to resource contention, higher concurrency affects Service time (a.k.a. response time or latency). The average 2 SOA

- Service Oriented Architecture International Journal of Next-Generation Computing, Vol. 4, No. 1, March 2013.

114

·

Susara De Saram et al.

3

Service Time (seconds)

2.5

2

1.5 Average Service Times

1

0.5

0 0

5

10

15

20

25

30

35

40

45

50

Number of Concurrent Threads

.

Increase of the Service Time with Concurrency Level

service time of requests increases when the average number of concurrent threads (concurrency level) increases and vice versa. That happens because all concurrent threads share a single or a limited number of CPUs and a limited memory. Most often, the processor sharing happens in the time-shared manner that keeps the processes waiting. Moreover, the concurrent threads may associate with ‘locks’, which block the execution of the thread while a ‘lock’ is taken by another. Therefore, a higher concurrency level brings down the performance and that leads to an increase of the service time. Therefore, the auto-scaling load balancers are generally used to avoid the increase of the load on servers in order to keep free threads in the pool and keep the number of concurrent threads in a machine low. We performed a test to visualize how the concurrency level affects the average service time of requests and Figure 1 shows the results. We did this test over a single physical machine that runs a single instance of Axis2. We name this test, ‘Initial Test’ for the ease of exposition. This helps to analyze our environment more precisely. It plots the average service times against the average number of concurrent threads measured during the test, which we repeatedly performed with different amounts of workloads (different values for concurrency). We used a sample web service that takes around 350 milliseconds to process a request without any concurrent overhead. We could see that the average service time increases as the number of concurrent threads increases (Figure 1). Although small numbers of parallel threads have not affected the service times, higher levels have affected the time dramatically. We plotted the standard deviations of the average service times in order visualize the variation of measured values. This result shows that the performance level of a service experienced by a user directly depends on the concurrency level. Consequently, we can use the concurrency to control the service time.

4.

METHODOLOGY

As discussed in the earlier section, we identified the level of concurrency as a good parameter that could control the quality of the service provided. Therefore, we propose to use the level of concurrency to differentiate the service among the service classes. We will first describe architecture for supporting the QoS level differentiation and then propose a queuing theory based model to achieve differentiation. International Journal of Next-Generation Computing, Vol. 4, No. 1, March 2013.

QoS Aware Load Balancing in Multi-tenant Cloud Environments

·

115

L B

.

4.1

Load balancer distributes the user-requests across each server group

Proposed Architecture

To support QoS within a multi-tenanted setup, we propose the following architecture. Let us consider cloud with a load-balanced farm of servers, hosting a single type of service. We group all the available server instances into the number of service classes based on QoS where each class gets a particular group (Figure 2). Each group behaves as a farm of servers and the user-requests of each class are distributed across the servers in the corresponding group. Within a group, we use ‘Round Robin’ algorithm to distribute requests, as it is simple and capable of treating all servers in a similar manner. The number of servers in each group (group sizes) is the key for providing service differentiation. It is clear that the changes done in the group sizes must be reflected in the concurrency levels. If the load on each QoS class is known in priori, then we can solve the service differentiation by statically assigning sufficient number of nodes to each group. However, load of each class changes often. Therefore, we have designed a mechanism to re-evaluate and re-assign servers for the groups at run time based on the demands to maintain the difference of concurrency levels among the groups. Multi-tenancy loads tenants into servers as needed, and this enables us to avoid costly VM migrations as oppose to VM based isolation. It gives a better elasticity for the server groups that we propose and helps to adjust quickly for the changing demands at the run time. In this work, we have limited our focus to a static number of available servers. We do not start new instances or shut down any worker nodes. However, as discussed in the ‘Conclusions and Future Work’ section, we can easily extend the same model to auto scaling scenarios as well. To figure out the best way to allocate the servers to the groups, we need a theoretical basis. Looking at literature [Chen and Li 2010] [Khazaei et al. 2012], we found out that the best suit theory for modeling this situation is Queuing Theory. 4.2

Principals of Queuing Theory

Queuing Theory is the mathematical study of queuing systems. A queuing system consists of discrete objects usually called ‘items’, which arrive at some rate to the system. The items spend some time in the system and then leave the system. Little’s Law gives a fundamental result on queuing systems. It says that the average number of items in a queuing system, denoted L, equals International Journal of Next-Generation Computing, Vol. 4, No. 1, March 2013.

116

·

Susara De Saram et al.

the average arrival rate of items to the system, , multiplied by the average waiting time (time spent in the system) of an item, W [Little 2011]. Thus, L = λW.

(1)

This is a remarkable result as this relationship stands with any kind of arrival process distribution, service distribution and service order [Simchi-Levi and Trick 2011]. The only assumption needed is, for the system to be in ‘steady state’ whereby the average arrival rate is less than or equal to the service rate of the system. 4.3

Controlling the Levels of Concurrency

In general, a multi-tenant server environment is composed of a load balancer dividing the workload across a farm of homogeneous servers. We identify each server as a separate queuing system. The user-requests arrive at a server at some rate, spend some time for processing, and leave the system. Here, the requests represent the items. In a multi-threaded architecture, the number of requests (items) in a server equals to the number of concurrent threads running. Moreover, the time spent in the system is equal to the service time of the request. We use Little’s Law to adjust the server groups that we propose. The purpose of our server groups is to maintain different levels of concurrency, class-wise. According to Little’s Law, we have two parameters, arrival rate and service time, on which the concurrency level depends. We can alter either one and control the concurrency level. However we cannot use the service time, as it is what we finally need to differentiate. Therefore, the only attribute we can alter is the arrival rate. The system can positively alter the class-wise arrival rates by using the server-grouping mechanism we have discussed in section 4.2. We shall consider a particular service class, class k. Let the class k user-requests’ average arrival rate at the load balancer be Ak and the number of servers in the class k -group (the server group which corresponds to class k ) is nk . Since we use Round Robin to distribute requests over N number of servers, the average arrival rate for one server, λk , must be as follows. λk = Ak /nk Therefore, by changing the number of servers in the group (nk ) we can alter the average arrival rate of a single server. Applying Little’s formula for a random server of class k-group, we get the following. Lk = λk /W k , Lk = (Ak W k )/nk . Here the Lk is the average number of concurrent threads in any server of class k -group within the observed period. Wk is the average of service times over the period. We can see that the sizes of the groups are directly related to the average number of concurrent threads, allowing us to use the size of the groups to control the concurrency level of servers. 4.4

On-Demand Server Groups Adjustment

Our solution can enforce a desired ‘distance’ among QoS levels among the classes. We adjust the concurrency levels in servers by changing the group sizes on demand in order to maintain the difference among average service times. Even if the resources are not sufficient to provide a quality service for either class, we guarantee that the higher-class users still get a better service than the lower-class users. We achieve the service differentiation by maintaining a desired ratio among the concurrency levels in each group. As shown in Figure 1 in section 4.1, it is clear that the ratio of the thread International Journal of Next-Generation Computing, Vol. 4, No. 1, March 2013.

QoS Aware Load Balancing in Multi-tenant Cloud Environments

·

117

levels would reflect in the service times of the user-requests so closely. We monitor the system to identify the changes of arrival rates and service times separately for each service class. We update the sizes of the groups in regular intervals. For the implementation, we used 10-second intervals. After each 10 seconds, we calculate the new sizes of the groups according to the data gathered during the last 10 seconds. Here, we assume the work demands of the client base within last time frame would remain nearly the same for the next time frame as well. We calculate class-wise average arrival rates and average service times during each interval. These two values describe the current condition of the environment. Based on these values, we calculate the necessary group sizes, which would enforce the desired ratio of concurrency levels. Our solution supports any number of service classes, so we present a general solution for m number of classes. For the ease of presenting the calculations and formulas for group-size adjustments, we use the following notation. m = Number of service classes (0

Suggest Documents