A Flexible QoS Framework for Cluster-based Network Services Kai Shen, Hong Tang, and Tao Yang Department of Computer Science University of California at Santa Barbara, CA 93106
fkshen, htang,
[email protected]
Abstract
and fluctuate dramatically [9, 10, 21]. For example, the daily peak to average load ratio at Internet search service www.ask.com is typically 3:1 and it can be much higher and unpredictable during extraordinary events. As a consequence, over-provisioning system resources for a service site to accommodate the potential peak will not be costeffective, if not impossible. Recent studies on endpoint QoS support have been mostly focused on single-host systems or Web servers serving static content [1, 4, 6, 16, 19, 26]. With the increasing demand to provide highly scalable, available, and easy-to-manage services, the deployment of largescale complex server clusters has been rapidly emerging in which service components are usually partitioned, replicated, and aggregated [13, 14, 20, 23]. Limited studies have been conducted on service differentiation for cluster-based servers [27] and there is still a lack of comprehensive QoS support for these large-scale cluster-based network services.
Quality of service (QoS) support that provides customized service qualities to multiple classes of client requests can effectively utilize available system resources. This paper presents the design and implementation of a flexible and efficient QoS framework for cluster-based network services. This framework achieves four objectives. First, the framework provides a flexible mechanism for service providers to specify a variety of desired service quality metrics through QoS yield functions. Secondly, the framework provisions service differentiation and admission control for multiple classes of service accesses by employing both the yield functions and resource allocation guarantees. Thirdly, the framework achieves efficient resource utilization by producing high aggregate QoS yield through greedy scheduling and speculative admission control that drops zero or low yield requests. Finally, the design of this QoS framework takes into consideration service scalability and availability by deploying a decentralized two-level request distribution and scheduling scheme. We implemented the proposed framework on top of the Neptune clustering middleware, which provides replication and loadbalancing support for cluster-based services. Our simulations and experiments based on the prototype implementation show that the proposed schemes can effectively utilize system resources under QoS constraints, especially when system resource consumption is approaching or beyond the saturation point. Comparing with a previously proposed dynamic server partitioning scheme, the evaluations also show that our system responds more promptly to demand spikes and behaves more smoothly during server failures.
This paper presents the design and implementation of a flexible and efficient QoS framework for cluster-based network services. This framework addresses the issues of flexible service quality specification, service differentiation, resource utilization efficiency, system scalability and availability. One of our design principles is that the QoS framework should be flexible enough to allow service providers to express a variety of desired service qualities. Most previous studies have been using a monolithic metric to measure system utilization and define QoS constraints, be it system throughput, mean response time, or mean stretch factor [1, 23, 27]. On the contrary, we give service providers the flexibility of choosing the system utilization metrics that best suit their own needs or the nature of individual services. To be more specific, we consider the fulfillment of a service request generates certain yield, called QoS yield, which may be linked to the amount of economical benefit [9] or social reach resulted from serving this request. The QoS yield of each service access is considered to be a function of the service response time in our framework. By carefully selecting this QoS yield function, service providers can express desired service qualities.
1 Introduction A large amount of work has been done in network quality of service (QoS) support and service differentiation with respect to packet delay and connection bandwidth [5, 11, 15, 24]. It is equally important to extend network-level QoS guarantees to endpoint systems where service fulfillment and content generation take place. Previous studies show that the client request rates for Internet services tend to be bursty
Our second principle is to provide class-based service differentiation in terms of resource allocation and admission control. In such a scheme, client requests are first classified into a number of service classes based on different criteria such
The author is also affiliated with Ask Jeeves/Teoma Technologies.
1
as service types or client identities. Our framework then provides differentiated services to multiple service classes through two means: 1) service classes can acquire differentiated QoS support by having different QoS yield functions; 2) each class can also be guaranteed to receive a certain predetermined portion of system resources. Our third principle is that we want to utilize the system resources efficiently by seeking to produce high aggregate QoS yield and drop service requests that deliver zero or little QoS yield. In our framework, this efficiency is achieved at two levels: 1) a class-aware load-balancing scheme to ensure a balanced distribution of service requests to a set of replicated service nodes; 2) a multi-queue scheduling scheme for producing high QoS yield at each service node .
Wide-area Wide-area network network
Wireless Wireless network network
Web Web Web Web server Web server Web Server Server Server Server
WAP WAP WAP WAP gateway WAP gateway WAP Gateway Gateway Gateway Gateway
High-throughput High-throughput low-latency low-latency network network Photo Photo Photo Photo album Photo album Photo Album Photo Album Photo Album Partition Album Partition Album Partition Album Partition 0 19 -0 Partition 19 -0 Partition 19 -0 Partition 19 -0 Partition 19 -0 19 -0 19 -0 19 -
The QoS framework must also work together with the clustering middleware to ensure the scalability and availability of a service cluster. Scalability and availability are always overriding concerns for large-scale network services. Several prior studies have been relying on centralized components to provide QoS support for a cluster of replicated servers [19, 27]. In contrast, our framework is built on a fully distributed architecture that contains no centralized component. This design principle is crucial to ensuring smooth system response to demand spikes and server failures.
Image Image Image Image store Image store Image Store Store Store Partition Store Partition Partition Partition 0 9 -0 9-Partition Partition 0 9 -0 9-0 9-0 9-
Image Image Image Image store Image store Image Store Store Store Partition Store Partition Partition Partition 10 --Partition 19 10 Partition 10 -- 19 19 10 10 -- 19 19 10 19
Discussion Discussion Discussion Discussion group Discussion group Discussion Group Discussion Group Discussion Group Partition Group Partition Group Partition Group Partition 0 19 -0 Partition 19 -0 Partition 19 -0 19 -Partition 0 19 -Partition 0 19 -0 19 -0 19 -
Service cluster Service cluster
Figure 1: A targeted system architecture. presents our prototype implementation on a Linux cluster and the evaluation of the system performance. Section 7 discusses related work and Section 8 concludes the paper.
Our QoS framework targets large-scale cluster-based network services. Inside those clusters, services are usually partitioned, replicated, aggregated, and then delivered to external clients through protocol gateways. Figure 1 illustrates an example of our targeted architecture. In this example, the service cluster delivers a discussion group and a photo album service to wide-area network browsers and wireless clients through Web servers and WAP gateways. The discussion group service is a relatively stand-alone service while the photo album service relies on an internal image store service. All the components including protocol gateways are replicated. In addition, the image store service is partitioned into two partition groups.
2 Service Quality Specification The service quality specification in our QoS framework contains two parts. First, we introduce the concepts of QoS yield and QoS yield functions through which service providers can express a variety of desired service qualities. Secondly, using QoS yield functions and resource allocation guarantees, our framework allows the service providers to determine the desired level of service differentiation among multiple service classes.
Our QoS work in this study is focused on service qualities inside the service cluster. Issues related to wide-area network latency or bandwidth is out of the scope of this paper. The design and implementation of our QoS framework is a continuation of our previous research on Neptune: a clusterbased infrastructure for aggregating and replicating partitionable network services [22, 23]. Neptune addresses the issues of scalability, availability, data replication, and fine-grain load balancing in the service construction. Our work is all built at the cluster middleware level, with no change required in the OS kernel.
2.1
Flexible Specification for System Utilization
Most previous studies have been using a monolithic metric such as system throughput, mean response time, or mean stretch factor to measure system utilization [1, 23, 27]. In our framework, we give service providers the flexibility of choosing system utilization metrics that best suit their own needs or the nature of specific services. Fundamentally, we conceive that the fulfillment of a client request generates certain yield, called QoS yield, which may be linked to the amount of economical benefit or social reach resulted from serving this request in a timely fashion. Both goals of provisioning QoS and efficient resource utilization can be naturally combined as producing high aggregate yield. In our framework, we consider the QoS yield of each service access to be a function of the service response time. The QoS yield function is normally determined by service providers to allow high flexibil-
The rest of this paper is organized as follows. Section 2 describes the flexible service quality specification in our framework. Section 3 presents our two-level request distribution and scheduling architecture. Section 4 illustrates the service scheduling inside each service node. Section 5 describes our simulation experiments based on synthetic workloads and two traces from a commercial search engine. Section 6 2
ity in expressing desired service qualities. Let Y () represent the QoS yield function. Let r1 , r2 , , rk be the response times of k service accesses completed in an operation period. The goal of our system is to maximize aggregate yield, i.e.
k X i=1
QoS yield
maximize
Ythroughput Yresptime Ycombo
Y (ri ):
C
C’
We give a few examples to illustrate how service providers can use yield functions to express desired QoS requirements. For instance, the system with the yield function Ythroughput in Equation (1) is intended to achieve high system throughput with a deadline D. In other words, the goal of such a system is to complete as many service accesses as possible with response times D. Similarly, the system with the yield function Yresptime in Equation (2) is designed to achieve low mean response time. Note that the traditional concept of mean response time does not count dropped requests. Yresptime differs from that concept by considering dropped requests as if they are completed in D.
0 0
Figure 2: Illustration of QoS yield functions.
2.2
(
Ythroughput (r) =
(
Yresptime (r) =
C if 0 r D; if r > D: 0
C (1 ? Dr ) if 0 r D; 0 if r > D:
0 0
if 0 r D0 ; if D0 < r D; if r > D:
Service Differentiation and Resource Guarantees
(1) (2)
We use the concept of service classes to provide service differentiation. A service class is defined as a category of service accesses that enjoy the same level of QoS support. However, service accesses belonging to different service classes may receive differentiated QoS support. Service classes can be defined based on client identities. For instance, a special group of clients may be guaranteed to receive preferential service support or guaranteed share of system resources. Service classes can also be defined based on service types or data partitions. For example, a checkout transaction may be considered more important than a catalog-browsing request.
We notice that Ythroughput does not care about the exact response time of each service access as long as it is completed within the deadline. In contrast, Yresptime always reports higher yield for accesses completed faster. As a combination of these two, Ycombo in Equation (3) produces maximum yield when the response time is within a pre-deadline D0 , and the yield decreases linearly thereafter. The yield finally declines to a drop penalty C 0 when the response time reaches the deadline D.
8 > :0
D’ D Response time
Our framework provides differentiated services to different service classes on two fronts. First, service classes can acquire differentiated QoS support by specifying different QoS yield functions. For instance, serving a VIP-class client can be configured to generate more QoS yield than serving a regular client. Secondly, each service class can be guaranteed to receive a certain portion of system resources. Most previous service differentiation studies have focused on one of the above two means of QoS support. We believe a combination of them provide two benefits when system is overloaded: 1) the resource allocation is biased toward high-yield classes for efficient resource utilization; 2) the guaranteed portion of system resources for low-yield classes, however, will not be compromised. The second benefit is important because overly sacrificing low priority classes might have an indirect adverse effect for the overall system yield in the long run. Taking an online catalog service for example, if we drop too many low priority catalog-browsing requests, soon there will be a dramatic drop of high priority order placement requests because each order placement request is normally preceded by a number of browsing requests.
(3)
This corresponds to the real world scenario that users are generally comfortable as long as a service request is completed in D0 . They get more or less annoyed when the service takes longer and they most likely abandon the service after waiting for D. C represents the maximum yield resulted from a low response time and the drop penalty C 0 represents the loss when the service is not completed within the final deadline D. Figure 2 gives the illustration of these three functions. We want to point out that Ythroughput is a special case of Ycombo when D0 = D; and Yresptime is also a special case of Ycombo when D0 = 0 and C 0 = 0. In general, the QoS yield function can be any function that returns non-negative numbers with non-negative inputs. However, we restrict our study on monotonically non-increasing functions. This restriction makes sense because it should not hurt, if not help, to complete a service access faster. 3
3 Two-level Request Distribution and Scheduling
Each node in this architecture can process requests from all existing service classes. The request distribution and scheduling decision is made at two levels. At the cluster level, we employ a class-aware load-balancing scheme, called class LB, to evenly distribute requests for each class to all servers. The load-balancing scheme we use is a random-polling policy that discards slow-responding polls. Under this policy, whenever a service client is about to seek an internal service for a particular service class, it polls a certain number of randomly selected service nodes to obtain the numbers of active or queued requests for the service class on those nodes. Then it directs the service request to the node with the smallest number of active or queued requests. Polls not responded within a deadline are discarded. This strategy also helps excluding faulty nodes from request distribution. We empirically choose a poll size of 3 and a polling deadline of 10 ms in our system. Our prior study shows that such a policy is scalable and well suited for services of all granularities [22]. Inside each service node, our approach must also deal with the resource allocation across multiple service classes. This is handled by a node-level class-aware scheduling scheme, which will be discussed in Section 4. Notice that the node-level classaware scheduling is not necessary for the server partitioning approach because every node is configured to serve a single service class under that approach.
This section presents our two-level request distribution and scheduling architecture. In our system, each external service request enters the service cluster through one of the gateways and it is classified into one of the service classes according to rules specified by service providers. The gateway node then accesses one or more (in the case of service aggregation) internal service nodes to fulfill the request. Inside the service cluster, each service can be made available at multiple nodes through service replication. Under such a context, supporting quality of service for multiple service classes essentially becomes a cluster-wide resource allocation issue. The dynamic partitioning approach proposed in our previous work adaptively partitions all replicas for each service into several groups and each group is assigned to handle requests from one service class [27]. We believe such a scheme has a number of drawbacks. First, a cluster-wide scheduler is required to make server partitioning decisions, which is not only a single-point of failure, but also a potential performance bottleneck. Secondly, cluster-wide server groups cannot be repartitioned very frequently, which makes it difficult to respond promptly to changing resource demand. Lastly, the requirement of session affinity by most online services could limit the applicability of server partitioning. More specifically, requests from the same external client have to be served by the same service node under the constraint of session affinity, thus a client's high priority requests will be served by a server group dedicated for low priority requests if that client initially issues some low priority requests.
An Alternative Approach for Comparison For the purpose of comparison, we also design a request distribution scheme based on server partitioning [27]. Server partitioning is adjusted periodically at fixed intervals. This scheme uses the past resource usage to predict the future resource demand and makes different partitioning decisions during system under-load and overload situations.
In order to address those problems, the QoS framework we propose in this paper does not explicitly partition server groups. Instead, we use a decentralized two-level request distribution and scheduling architecture. Figure 3 illustrates such an architecture.
External requests External requests
Gateway Gateway
Gateway Gateway
Gateway Gateway
Cluster-wide Cluster-wide request distribution request distribution
Service Service node node scheduling scheduling
Service Service node node scheduling scheduling
Service Service node node scheduling scheduling
Service Service node node scheduling scheduling
When the aggregate demand does not exceed the total system resources, every service class gets their demanded resource allocation. The remaining resources will be allocated to all classes proportional to their demand. When the system is overloaded, in the first round we allocate to each class its resource demand or its resource allocation guarantee, whichever is smaller. Then the remaining resources are allocated to all classes under a priority order. The priority order is sorted by the expected yield per unit time for each class, which is calculated through dividing the initial yield by the mean service time of that class.
Fractional server allocations are allowed in this scheme. All servers are partitioned into two pools, a dedicated pool and a shared pool. A service class with 2.4 server allocation, for instance, will get two servers from the dedicated pool and acquire 0.4 server allocation from the shared pool through sharing with other classes with fractional allocations.
Service cluster Service cluster
Figure 3: Two-level request distribution and scheduling. 4
ification or off-line profiling [27]. But it might not have prior knowledge of the service time for each individual service access.
The length of the adjustment interval should be chosen carefully so that it is not too small to ensure system stability, nor is it too large to promptly respond to demand changes. We choose the interval to be 10 seconds in this paper. Within each allocation period, service requests are directed to one of the servers allocated to the corresponding service class according to the load-balancing policy [22].
We use the first-come-first-serve (FCFS) rule within each queue to ensure fairness for service requests belonging to the same service class. However, the application of this rule may not be the optimal choice for maximizing aggregate QoS yield. We have done an analysis and the following theorem shows that under certain assumptions, at least one optimal scheduling scheme employs FCFS scheduling within each service class queue.
4 Node-level Service Scheduling Our scheduling scheme at each service node is based on a class-aware multi-queue scheduler. Each service node contains a request queue per service class. Whenever a service request arrives, it enters the appropriate queue for the service class it belongs to. When the previous service request is completed or a new request arrives while the system is currently idle, the scheduler dequeues a request from one of the queues for service. Figure 4 illustrates such a runtime environment of a service node. We define the service time of a request to be the elapsed time between its scheduling time and its completion time. Thus the total response time of a service request is the sum of its service time, queuing time, and the network communication time.
Theorem 1 For any history with a finite number of incoming requests, if 1) all requests are eventually served; 2) all requests in the same service class have the same service time; and 3) the first derivatives of all QoS yield functions are monotonically non-increasing; then at least one of the optimal scheduling schemes that maximize aggregate QoS yield employs FCFS scheduling within each service class queue. The proof is laid out in Appendix A. Notice that all sample QoS yield functions in Section 2.1 do have monotonically non-increasing first derivatives before the yield drops to zero. We do realize that some assumptions for this theorem may not be true in reality. For instance, some requests have to be dropped when resource demand exceeds available resources over the long haul. In addition, requests in the same service class are unlikely to have exactly the same service time. Nevertheless, the above analysis suggests that FCFS rule is a reasonable heuristic in achieving high QoS yield in addition to maintaining fairness.
Class1 Class2
......
Service Service scheduler scheduler
ClassN
Figure 4: Runtime environment of a service node.
With the FCFS rule enforced on each queue, the scheduling issue becomes choosing the appropriate queue each time the scheduler is invoked. For a service node hosting N service classes: C1 ; C2 ; ; CN , each class Ck is configured with a QoS yield function Yk and a minimum system resource share guarantee gk , which is expressed as a percentage of total system resources. A guaranteed resource share of zero infers no N such guarantee at all. Note that when k=1 gk = 1, our system falls back to a static resource partitioning scheme. Figure 5 illustrates the framework of our service scheduling algorithm at each scheduling point. In the rest of this section, we will discuss two aspects of the scheduling algorithm: 1) maintaining resource allocation guarantee; and 2) achieving high aggregate QoS yield.
Each service class is configured with a QoS yield function and optionally a guaranteed portion of resource allocation. The goal of the scheduling scheme is to provide the guaranteed system resources for all service classes and schedule the remaining resources to achieve high aggregate QoS yield. Our analysis of the scheduling scheme is based on the following assumptions.
P
1. We assume the resource consumption of a service access is proportional to its service time. This assumption should be close to reality when all accesses are of the same type. Supporting multi-dimensional resource scheduling is beyond the scope of this paper. However, this simplification should not affect the general applicability of our schemes because we can provide such kind of support by plugging in a multi-dimensional resource measurement and allocation module. Based on this assumption, we will use resource consumption and service time interchangeably from now on.
4.1
Calculating Resource Consumption for Allocation Guarantee
A central piece of maintaining resource allocation guarantee is to calculate resource consumption for each service class at any desirable time. This calculation should be biased toward recent usage to stabilize quickly when the actual resource consumption jumps from one level to another. It should not be
2. We also assume that the scheduler has the knowledge of the mean service time for each service class. Generally this can be achieved through service provider spec5
1. Drop from each queue head those requests that are likely to generate zero or very small yield according to the request arrival time, expected service time and the yield function. (Section 4.2)
With the definition of uk (t), the proportional resource consumption of class Ck can be represented by Nuk (ut) (t) . In k=1 k step 2 of the service scheduling, this proportional consumption is compared with the guaranteed share to search for under-allocated service classes.
2. Search for the service classes that have a resource consumption of less than the guaranteed share and also have a non-empty request queue. (Section 4.1)
4.2
P
We start with examining the policies employed in step 2b of the service scheduling. The basic policy we consider is a global FCFS scheduling, which chooses the request with the earliest arrival time from all queue heads. The drawback for this policy is that it does not take the class-specific properties into consideration. In order to maximize the aggregate yield, we design a GREEDY policy that examines the head request of each queue and chooses the one with the highest yield per unit of resource consumption. This policy requires us to know in advance an expected service time es(r) for each request r. An accurate prediction will likely improve the scheduling performance, but alternatively, we can simply use the mean service time of the corresponding service class for this purpose. Let rk be the head request of Ck . Let at(r) be the arrival time of request r. And let t be the present time. The GREEDY policy always schedules rk with the highest yield (rk )) at each unit of resource consumption ( Yk (t+eses(r(kr)k?at ). )
(a) If found, schedule the one with the largest gap between the actual consumption and the guaranteed share. (Section 4.1) (b) Otherwise, schedule a request from one of the non-empty queues that is likely to produce high aggregate QoS yield. (Section 4.2)
Figure 5: The framework of our service scheduling algorithm. too shortsighted either in order to avoid oscillations or overreactions to short-term spikes. Among many possible functions that exhibit those properties, we define the resource consumption for class Ck at time t to be the weighted summation of the resource usage for all class Ck requests completed no later than t. The weight is chosen to decrease exponentially with regard to the elapsed time since the request completion. For each request r, let ct(r) be its completion time and s(r) be its actual service time, which is known after its completion. Equation 4 defines uk (t) to be the resource consumption for class Ck at time t.
uk (t) =
X
frjr2Ck and ct(r)tg
Next we look at the request dropping policies in step 1 of the service scheduling in Figure 5. Request dropping is necessary when resource demand exceeds the available resources over a long period of time. The basic conservative dropping policy is the one that drops all requests with an expected yield of zero even if they are scheduled right away. Let Dk be the deadline for yield function Yk () after which the yield becomes zero and let t be the present time. The basic policy drops all requests of Ck such that t + es(rk ) ? at(rk ) > Dk . However, such a dropping scheme might be too conservative. It does not drop those close-to-deadline requests that are only expected to generate a small yield or even zero yield due to the inaccuracy of the expected service time. An aggressive dropping strategy, on the other hand, has its own pitfall. It might drop positive yield requests and only find the service node idling later on.
t?ct(r) s(r);
(4) 0
<