Evaluating Cluster-Based Network Servers - Semantic Scholar

4 downloads 182173 Views 239KB Size Report
increase server throughput with respect to a locality-oblivious server by up to ... wasted as the front-end is completely dedicated to request distribu- tion; and (c) ...
Evaluating Cluster-Based Network Servers ∗ Enrique V. Carrera and Ricardo Bianchini Department of Computer Science Rutgers University Piscataway, NJ 08854-8019 {vinicio,ricardob}@cs.rutgers.edu

Abstract In this paper we use analytic modeling and simulation to evaluate network servers implemented on clusters of workstations. More specifically, we model the potential benefits of locality-conscious request distribution within the cluster and evaluate the performance of a cluster-based server (called L2S) we designed in light of our experience with the model. Our most important modeling results show that locality-conscious distribution on a 16-node cluster can increase server throughputwith respectto a locality-oblivious server by up to 7-fold, depending on the average size of the files requested and on the size of the server’s working set. Our simulation results demonstrate that L2S achieves throughput that is within 22% of the full potential of locality-conscious distribution on 16 nodes, outperforming and significantly outscaling the best-known localityconscious server. Based on our results and on the fact that the files serviced by network servers are becoming larger and more numerous, we conclude that our locality-conscious network server should prove very useful for its performance, scalability, and availability.

1 Introduction The number of Internet users has increased rapidly over the last few years. This large customer base is placing significant stress on the computing resources of popular services available on the Internet. As a result, service providers have no alternative but to use either large multiprocessor or distributed network servers in order to achieve reasonable quality of service levels. We are interested in distributed servers (cluster-based servers in particular), as they can deliver extremely high performance at low cost. The commercial HotBot search engine is an interesting example of a popular network service that performs several million queries per day using a very large cluster-based server [12]. Unfortunately, the performance achieved by these cluster-based servers is often far from ideal, mostly as a result of the way the clients’ requests are distributed within the servers. Traditionally, these cluster-based servers distribute the requests among the cluster nodes according to a simple load balancing mechanism and each node is responsible for serving its requests independently of other nodes. The main performance problems with these ∗ This work has been partially supported by NSF under Grant number CCR-9510173 and by Brazilian CNPq.

traditional servers are that the memories of the cluster are utilized as a set of independent caches of disk content and the request distribution is oblivious to the contents of the different caches. Although this caching strategy performs well when a small set of files accounts for a large percentage of requests and the average size of the files is small, performance suffers when these conditions do not hold. Furthermore, the files serviced by network servers are becoming larger and more numerous, especially with the proliferation of multimedia files and content hosting servers. For instance, a WWW hosting service, such as that provided by most ISPs, is especially demanding, as WWW pages from a large number of renters (individuals or corporations) are managed by the same set of nodes. Under these adverse circumstances, using the set of memories of the cluster as a single large cache and distributing requests among the nodes according to cache locality prove extremely profitable. In more detail, a locality-conscious server can schedule a request on the node most likely to be caching the requested file locally, i.e. the node that last serviced the requested file. A strict implementation of such a server does not allow replication of cached files, which can increase cache hit rates significantly, but can also produce severe load imbalance. A looser implementation, where some amount of replication is allowed, can promote both cache locality and load balancing. Pai et al. [16] proposed LARD (Locality-Aware Request Distribution), an elegant request distribution algorithm that considers both locality and load balancing, as well as a cluster-based server (hereafter called the LARD server) of static WWW content implementing the algorithm. Their results show that the LARD server exhibits better performance than servers that promote either locality or load balancing in isolation. Unfortunately, their work did not involve any analytical modeling and did not consider the potential benefits of locality-conscious distribution in general. It is unclear from the LARD paper, for instance, how much of the potential benefit of locality-conscious distribution is actually accrued by the LARD server. Moreover, LARD has a few characteristics that may limit its usefulness: (a) it involves a front-end node that is responsible for distributing requests and represents both a single point of failure and a potential bottleneck; (b) the cache space available at the front-end is wasted as the front-end is completely dedicated to request distribution; and (c) all requests must incur the overhead of being forwarded from the front-end to the back-end nodes. Even though the single point of failure limitation could probably be dealt with by having an additional failover node, the other limitations would still remain.

µP

µP

µP

... NI

MEM

NI

MEM

NI

MEM

Switched Network Router

The remainder of this paper is organized as follows. The next section presents some background material on cluster-based network servers. Section 3 describes our model in detail and presents its most important results. Section 4 presents L2S and compares it against LARD. Section 5 describes the methodology used in our evaluation of L2S and its most important results. Section 6 discusses some related pieces of work. Finally, section 7 summarizes our findings and concludes the paper.

Internet

Figure 1: Typical cluster of workstations. To address these concerns, in this paper we use analytic modeling and simulation to evaluate locality-conscious network servers of static content based on clusters. In the first part of the paper, we model the potential benefits of locality-conscious request distribution using an open queuing network under varying cache hit rate and average serviced file size assumptions. The model is solved by instantiating these parameters and mathematically solving the system of equations that describes it. In effect, the model places an upper bound on the throughput achievable by locality-conscious request distribution. Our modeling results show that this type of distribution can increase server throughput with respect to a locality-oblivious server by up to 7-fold for a 16-node cluster. Based on these observations and avoiding the limitations of LARD, in the second part of the paper, we propose a new localityconscious server, the Locality and Load balancing Server (L2S). In our server all nodes can both forward and service requests and, thus, the server can achieve higher throughput and has no single point of failure or potential bottleneck. As a result of these properties, the server is highly efficient and scalable. In addition, L2S avoids forwarding requests whenever doing so is profitable and balances load effectively, by allowing some amount of file replication and fairly quickly detecting variations in load. L2S can be used to implement a variety of cluster-based network servers, such as ftp, database, or WWW servers. Without loss of generality, our evaluation of L2S concentrates on its application as a WWW server. The evaluation is based on the detailed simulation of a 16-node cluster driven by 4 real WWW traces, and on a performance comparison of L2S against the LARD server and a traditional WWW server. Our simulation results demonstrate that L2S achieves throughput that is within 22% of the full potential of locality-conscious distribution, outperforming and significantly outscaling both the LARD and traditional servers. Based on these results, we conclude that our locality-conscious network server should become very useful in practice for its combination of three important properties: high performance, good scalability, and decentralized control. In summary, the main contributions made by this paper are: • Our modeling results place theoretical bounds on the performance of any locality-conscious server. These bounds are used to evaluate the real servers. • L2S seems to be the first cluster-based server to distribute requests based on locality and load balancing considerations in a totally distributed fashion. • Our simulation results show that L2S performs very well, even in comparison with the bounds set by the analytical model.

2 Cluster-Based Network Servers Figure 1 presents a typical architecture for a cluster of workstations. Each node of the cluster is a full-blown commodity workstation or PC, with a processor, multiple levels of caches, main memory, a disk, and a network interface. All nodes in the cluster have access to data stored on any disk via a distributed file system. The nodes are interconnected with a high-performance commodity network (such as Gigabit Ethernet), which is accessed without operating system intervention, using a modern messaging layer (such as the Virtual Interface Architecture or simply VIA [10]). The network also connects the nodes to a bridge/router that links the cluster to the Internet. A simple network server for such an architecture can simply consist of several independent sequential servers, as long as some load balancing mechanism is used to distribute the requests among the cluster nodes. Request distribution can be performed outside or in the context of the cluster. Round-robin DNS (Domain Name Service) is the simplest scheme for implementing load balancing from outside the cluster [15]. Under this scheme, the DNS server translates the “name” of the server to node addresses in round-robin fashion. The translation is then cached by intermediate name servers and possibly clients. This caching of translations can cause significant load imbalance [5]. In essence, the main problem with DNS distribution is that the server cannot adjust the request distribution according to its own instantaneous load and/or locality information. With this in mind, several servers implement their own request distribution (with or without a dedicated front-end node responsible for the distribution of requests). In this scheme, the main issues are then: what information to consider when distributing requests, and how to handle the TCP connection accepted by one of the nodes when another node will be responsible for serving the request. The traditional network servers distribute requests based solely on their load metric, which is usually the resource utilization or the number of open connections being handled by each node (e.g. [9, 14, 1, 11, 5, 18]). As a result, these traditional servers may suffer from high cache miss rates (when the server’s working set size exceeds the size of each local memory) among other problems. To reduce miss rates, the network server can distribute requests based on the content requested and on cache locality information [16, 8]. L2S is in this category. Some servers [19, 13, 22, 20] use contentbased distribution for other reasons besides cache locality, such as avoiding the regeneration of dynamic WWW content or segregating different types of WWW content (images, text, CGI scripts, etc). In terms of handling network connections, the most efficient servers use some form of packet re-writing/connection hand-off [16, 5], where the connection is transferred to the node that will actually service the request, which responds directly to the client. L2S uses this strategy.

Νλ µr

µp,m,f µo

µd

µi Router NI

µp,m,f µo

µi

CPU Disk

...

µd

µp,m,f µo

µd

µi

Figure 2: Model for a cluster of workstations.

3 Modeling 3.1 The Model We developed a simple open queuing network to model localityconscious and oblivious servers. Figure 2 depicts our model of a cluster of N workstations. The model assumes that all queues are M/M/1. Besides the rates mentioned in the figure and described in table 1, other important parameters to our model are the cache (main memory) size per node (C), the average size of the requested files (S), the number of files stored on the server (F ), and the file request distribution. This latter parameter deserves further comments. In this study, we concentrate on heavy-tailed distributions of access, such as the ones exhibited by WWW servers [7]. Such distributions can be approximated by means of Zipf-like distributions [7], where the probability of a request for the i’th most popular file is proportional to 1/iα with α typically taking on some value less than unity. With these parameters, we can define the total cache space and the average cache hit rate for the two types of servers we consider. In the traditional, locality-oblivious server, the total cache space Clo is only C bytes, since the most requested files end up cached by all nodes. On the other hand, a locality-conscious server can use Clc = N × C bytes of cache space if no file replication is allowed, or Clc = N × (1− R) × C + R × C bytes, if an R percentage of the main memory is used for file replication. (Note that we can regard a locality-oblivious server as a locality-conscious server with R = 1). With these cache sizes, the average cache hit rate can be defined as: H = z(n, F ), where z(n, F ) represents the accumulated probability of requesting the n most accessed files in a Zipf-like distribution of the requests to F files. The number of cached files (n) is equal to min(Clo ÷ S, F ) for the locality-oblivious server, and min(Clc ÷ S, F ) for the locality-conscious server. Furthermore, the percentage of requests forwarded to another workstation in the locality-conscious server can be defined as: Q = (N − 1) ×

(1 − h) ÷ N , where the hit rate for replicated files (h) is equal to z(min(R × C ÷ S, F ), F ). Note that our use of z(n, F ) to describe hit rates effectively means that the most accessed files are always cached in the model. These hit rate definitions are useful and intuitive, but in order to simplify the presentation of our results, we define the localityconscious hit rate (Hlc ) as a function of the locality-oblivious hit rate (Hlo ) as follows: Hlc = z(min(Clc ÷ S, f ), f ), where f is such that Hlo = z(Clo ÷ S, f ). In the same way, we define the percentage of requests forwarded to another workstation as: Q = (N − 1) × (1 − h) ÷ N , where h = z(R × Clo ÷ S, f ). Note that all the above definitions assume that the probability of an unreplicated file to be cached by a specific workstation is the same for all the workstations (1/N ). The first two columns of table 1 summarize the model parameters and their descriptions. As our model assumes perfect load balancing and does not consider cache replacements and contention for memory and I/O buses, it provides an upper bound on the throughput and a lower bound on the service latency achievable by the servers. Since the latencies involved in servers are usually low compared to the overall latency a client experiences establishing connections, issuing requests, and waiting for replies across a wide-area network, in this paper we will only focus on throughput performance. Parameter values. We carefully selected default values for our model parameters. The service rate µr was selected to approximate the performance of the Cisco 7576 router (4 Gbits/s). The service rates of the network interfaces (µi and µo ) were selected assuming that the cluster network provides 1 Gbit/s full-duplex links and 3microsecond overhead per message. The disk service rate µd , on the other hand, was determined assuming a disk with an access (seek + rotation) time of 14 ms and a transfer rate of 10 MBytes/s [16]. The µd value presented in the table 1 considers the extra access needed for reading the disk directory. The µp and µm rates are based on data from [17] and [16]. Finally, µf is based on data from [5]. The last column of table 1 summarizes the default values of the parameters. All our modeling results are derived by instantiating specific variables in the system of equations that describes our queuing model and then mathematically solving the system.

3.2 Results In this section we study the performance of both locality-conscious and oblivious servers, as a function of the average size of requested files and the file working set size. To simplify the analysis, we model variations in working set size using the cache hit rate of the localityoblivious server; as the working set increases, the hit rate decreases. Base results. Figures 3 and 4 show the throughput of the localityoblivious server and that of the locality-conscious server, respectively. The throughput peaks represent parts of the parameter space where the workload is CPU or memory-bound, while the relatively flat areas represent parts of the parameter space where the workload is I/O-bound. We can see from the figures that the throughput of both servers increases as the average file size decreases and the cache hit rate increases. However, throughputs only increase significantly in certain areas of the parameter space. For the locality-oblivious server, throughputs only increase significantly for files smaller than 64 KBytes and hit rates higher than 80%. In contrast, the area of

Param N R α µr µi µp µf µm µd µo C

Description Number of nodes Percentage of replication Zipf constant Routing rate Request service rate at NI Request read/parsing rate Request forwarding rate Reply rate (after stored locally) Disk access rate Reply service rate at NI Total cache space

H

Cache hit rate

h Q

Cache hit rate for replicated files Percentage of requests forwarded

Definition or Default Value 16 0% 1 500000/size ops/s 140000 ops/s 6300 ops/s 10000 ops/s (0.0001 + S/12000)−1 ops/s (0.028 + S/10000)−1 ops/s (0.000003 + S/128000)−1 ops/s Clo = C = 128 MBytes Clc = N × (1 − R) × C + R × C Hlo = z(min(Clo ÷ S, F ), F ) Hlc = z(min(Clc ÷ S, F ), F ) h = z(min(R × C ÷ S, F ), F ) Q = (N − 1) × (1 − h) ÷ N

Table 1: Model parameters and their default values. S = avg file size in KBytes; size = avg transfer size in KBytes.

4

4

x 10

x 10 2.5 Throughput (reqs/sec)

Throughput (reqs/sec)

2.5 2 1.5 1 0.5 0 1

0 0.75

32 0.5

64 0.25

96 0 128

Hit Rate (trad)

2 1.5 1 0.5 0 1

0 0.75

32 0.5

64 0.25

Avg Size (KB)

96 0 128

Hit Rate (trad)

Avg Size (KB)

Figure 3: Throughput of a locality-oblivious server.

Figure 4: Throughput of a locality-conscious server.

significant throughput increase for the locality-conscious server is much larger: files smaller than 96 KBytes and hit rates higher than about 50%. Furthermore, the locality-conscious server sustains its peak throughput over a much larger part of the parameter space than the locality-oblivious server. Figure 5 directly compares the throughput of the servers. It plots the throughput increase provided by considering locality when distributing requests, i.e. it plots the throughput results of figure 4 divided by those of figure 3. Figure 6 shows a side view of figure 5. These figures demonstrate that the locality-conscious server can achieve throughput that is a factor of 7 higher than the localityoblivious server. The improvements start growing as the hit rate increases and the average file size decreases. However, the improvements come down quickly after the hit rate reaches 80%, as this is the point where the locality-oblivious server starts performing well. When the hit rate of the locality-oblivious server reaches 95%, the throughput improvement for small files becomes slightly smaller than 1, due to the extra cost of forwarding requests in the localityconscious server.

We also studied variations in memory size from 128 MBytes (the default setting) to 512 MBytes. The results here are not surprising, since for each memory size we are modeling the gains obtainable by locality-conscious servers across the whole hit rate spectrum. As one would expect, larger memories reduce the throughput benefit of considering locality just about everywhere in the parameter space. Nevertheless, the potential gains achievable by the locality-conscious server are still significant. For a memory size of 512 MBytes, these gains peak at a factor of about 6.5 better than the locality-oblivious server, assuming all default parameter values (except for memory size, of course). Summary of other modeling results. Besides the results just described, we find that load imbalance is another potential source of disappointment for a locality-conscious server for hit rates lower than 40%, and hit rates higher than 65% and large files. A small degree of file replication (15%) in the locality-conscious server alleviates these problems. It provides for robust performance in the presence of load imbalance and reduces the overhead of request forwarding within the server. More details can be found in [8].

8 7 Throughput Increase

Throughput Increase

8 6 4 2 0 1

6 5 4 3 2

0 0.75

32 0.5

64 0.25

96 0 128

Hit Rate (trad)

Avg Size (KB)

Figure 5: Throughput increase due to locality.

4 L2S Based on the results of our modeling work, in this section we propose the Locality and Load balancing Server (L2S). The most important aspect of the server is its request distribution algorithm. The idea behind it is to allow multiple nodes (called the server set) to cache a certain file and distribute the requests for the file among these nodes, according to load considerations. A node’s load is measured as the number of connections being serviced by the node. Requests are directed to nodes using a standard method, roundrobin DNS, for instance. When a request arrives at a node, the request is parsed and, based on its content, the node must decide whether to service the request itself or forward the request to another node via some form of packet rewriting/hand-off. (From now on, we will refer to the node that received the client’s request as the “initial” node and the node that actually services the request as the “service” node.) An initial node is chosen as the service node, if it is not overloaded (i.e. its number of open connections is not larger than a user-defined threshold T ) and either it already caches the requested file or this is the first time the file is requested. A node that does not cache the requested file should only be chosen as the service node if both the initial node and the least loaded node in the set of servers are overloaded. If these two conditions indeed hold, the node is included in the set of servers for the requested file. To avoid the widespread replication of files, the amount of replication is controlled by periodically reducing the size of the server sets, again according to load considerations. More specifically, the size of the server set for a file is reduced if the node assigned to service the file is underloaded (according to a user-defined threshold t), the number of servers for the file is greater than 1, and it has been a while since the server set was modified. It is clear then that a node trying to make a distribution decision requires locality and load information about all nodes. These data are disseminated throughout the cluster via periodic broadcast (or pointto-point) messages. To avoid excessive overheads, it is important to utilize a user-level communication mechanism such as VIA [10] and control the frequency of broadcasts according to their cost. The dissemination of caching information is straightforward. Whenever an initial node changes the server set for a certain file, it broadcasts information about the modification to all the other nodes

1 0 1

0.75

0.5 Hit Rate (trad)

0.25

0

Figure 6: Throughput increase due to locality (side view). in the cluster. The number of these broadcasts is almost negligible, since modifications to the server sets occur seldomly in steady state. The dissemination of load information is also simple. A node broadcasts information about its load changes when its local value for some node’s load is a certain number of connections greater or smaller than the last broadcast value. It is interesting to compare L2S and LARD (Locality-Aware Request Distribution), the best previously published locality-conscious distribution algorithm. Both L2S and LARD intend to optimize locality without disregarding load balancing. However, in LARD a front-end node accepts, parses, and distributes requests, while in L2S all nodes can perform these tasks as well as service requests. In more detail, LARD uses a dedicated front-end that makes all distribution decisions by centralizing the locality and load data of all other cluster nodes, called back-end nodes. The front-end forwards requests to back-end nodes using a hand-off mechanism. This strategy simplifies the management of load and locality information by centralizing them at the front-end. However, the front-end represents both a potential bottleneck and a single point of failure. In addition, the fact that the front-end must be dedicated to request distribution (in order not to worsen the bottleneck) reduces the number of nodes that can be used to service requests and, more importantly, cache files. Finally, the front-end forces all requests to undergo one forward operation. In L2S we eliminate all of these problems. Nodes can both accept, service, and distribute requests, so throughput is higher and not all requests must be forwarded. Moreover, all nodes behave exactly the same independently of the node configuration of the cluster, so the system is bottleneck-free and has no single point of failure. Thus, L2S trades-off greater complexity (due to distributed management of load and locality) for better efficiency, scalability, and fault tolerance properties than in LARD. Similarly to L2S, LARD also allows some amount of file replication and uses the number of open connections as the load measure of each back-end node. It is unclear from [16] how the LARD front-end finds out that a request has been serviced (i.e. that a connection has been terminated) by a back-end node. To simplify our comparison, we assume that LARD uses the same strategy as L2S. In L2S, the service node informs the other nodes when it finishes serving a request (or small set of requests). In LARD, we assume that whenever a back-end node finishes serving a request (or small set of requests),

it informs the front-end node. Note that the L2S and LARD distribution algorithms we just presented are tailored to non-persistent connection-oriented protocols (such as HTTP/1.0), where each client request represents a different connection. However, persistent connections (such as those of HTTP/1.1) can be handled by slightly modifying the algorithms as described in [8] and [3], respectively.

Logs Calgary Clarknet NASA Rutgers

Num files 8397 35885 5500 24098

Avg file size 42.9 KB 11.6 KB 53.7 KB 30.5 KB

Num requests 567895 3053525 3147719 535021

Avg req size 19.7 KB 11.9 KB 47.0 KB 26.2 KB

α 1.08 0.78 0.91 0.79

Table 2: Main characteristics of the WWW server traces.

5 Simulation 5.1 Methodology We are interested in evaluating performance under varying architectural assumptions (especially memory size and network performance) and thus, we use detailed trace-driven simulation for our studies. Unfortunately, space limitations prevent us from presenting our study of parameter variations; the study can be found in [8]. Our simulators are based on the same basic code and use the same parameters listed in table 1, except for the memory size as explained below. In the traditional server we simulate all nodes behave independently of each other, except for the communication required by the distributed file system. The requests are assigned to nodes using a fewest-connections scheme (all cluster nodes are equally powerful), where a new request is assigned to the node with the fewest open connections at the moment. The simulated LARD system, on the other hand, assigns a node within the cluster to work as frond-end. We use the same execution parameters as determined by the designers of LARD in [16], since they produce the best results for our traces as well. In our L2S simulations, the requests are assigned to nodes using a round-robin DNS scheme. After a request is received by one of the nodes, the L2S distribution algorithm is executed with T = 20 connections and t = 10 connections. We simulate a cluster communication infrastructure comprised of a software implementation of VIA, M-VIA [6], for a Gigabit Ethernet network. With this setup, each broadcast of locality or load information is implemented by multiple point-to-point M-VIA messages. Forwarding of requests is also implemented with M-VIA for both L2S and the LARD server. All communication latencies are simulated faithfully, including the CPU and network interface overheads. More specifically, sending a 4-byte message across the network takes 19 microseconds [6]. Out of this one-way latency, 6 microseconds are spent by the CPUs (3 microseconds a piece) and 12 microseconds are spent by the network interfaces (6 microseconds a piece) involved. The network itself peaks at 1 Gbit/s bandwidth and exhibits a 1 microsecond switch latency [21]. All types of contention (CPU, memory and I/O bus, network interface, and disk) are also simulated faithfully, except for the contention within the network fabric itself, since we are simulating a very fast switched network. To avoid excessive communication overheads, a node in L2S only broadcasts information about load changes when its local number of active connections becomes 4 greater or smaller than the last broadcast value. Similarly, a back-end node in the LARD server only updates its load information at the front-end when 4 local connections have terminated since the last update. These frequencies were found to be the best for both L2S and the LARD server in most cases when assuming our default simulation parameters.

We use four WWW server traces to drive our simulator. Three of them were studied in [2]: one trace from the University of Calgary, one trace from Clarknet (a commercial Internet provider), and one trace from NASA’s Kennedy Space Center. The fourth trace contains the accesses made to the main server for the Computer Science Department at Rutgers University in the first 25 days of March 2000. We eliminated all incomplete (due to network failures or client commands) requests in the traces and ended up with the characteristics listed in table 2. Note that the traces cover a relatively wide spectrum of behavior. Despite the diversity of trace characteristics, the traces’ working sets are fairly small (from 288 MBytes to 717 MBytes) compared to those of today’s popular WWW content providers or hosting services and, thus, we simulate small main memories (32 MBytes) to reproduce situations where working set sizes are significant in comparison to cache sizes. In addition, this memory size allows us more direct comparisons with the LARD paper, since Pai et. al. also assumed 32-MByte memories in their work. To avoid penalizing the traditional server excessively, we warm the node caches (by simulating the accesses in each trace once) before starting our measurements. To determine the maximum throughput of each system, we disregarded the timing information in the traces and scheduled new requests as soon as the router and network interface buffers would accept them. These characteristics and simulation setup produce cache miss rates between 9 and 28% assuming a sequential server with 32 MBytes of main memory.

5.2 Results Base results. Our base throughput results are shown in the figures 7– 10. The figures plot the throughput of each system as well as the best possible throughput predicted by our model assuming 15% replication. Many interesting observations can be made from these figures. Maybe the most important of them is that L2S achieves throughput that is very close to that derived from our model. The throughput performance of L2S is always within 22% of the model results on 16 nodes. This result shows that L2S is extremely efficient and scalable, as the model determines an upper bound on the throughput achievable by any locality-conscious network server based on our cluster. Another important observation is that L2S clearly outperforms the other two servers, especially for the Calgary, Clarknet, and Rutgers traces. For these traces L2S outperforms the LARD server on 16 nodes by 33%, 141%, and 56%, respectively. The comparison against the traditional server is even more favorable to L2S; our gains on 16 nodes are 180%, 366%, and 442%, respectively. In fact, as we expected, the traditional server exhibits unexciting performance for all traces and numbers of nodes.

9000

14000

8000

model L2S LARD trad

6000

model L2S LARD trad

12000 10000 Requests/sec

Requests/sec

7000

5000 4000 3000

8000 6000 4000

2000

2000

1000 0

0 0

4

8 Number of Nodes

12

16

0

Figure 7: Throughputs for the Calgary trace.

8 Number of Nodes

12

16

Figure 8: Throughputs for the Clarknet trace.

4500

10000

4000

9000

model L2S LARD trad

3000

model L2S LARD trad

8000 7000 Requests/sec

3500

Requests/sec

4

2500 2000 1500

6000 5000 4000 3000

1000

2000

500

1000 0

0 0

4

8 Number of Nodes

12

16

0

4

8 Number of Nodes

12

16

Figure 9: Throughputs for the NASA trace.

Figure 10: Throughputs for the Rutgers trace.

The L2S advantage for the NASA trace is not as significant (7% with respect to the LARD server and 27% with respect to the traditional server, again on 16 nodes), since the average size of the requested file (47 KBytes) is relatively large for this trace, making the router and network interfaces the real throughput bottleneck for configurations of 4 nodes or larger. In smaller cluster configurations the servers cannot service enough requests to saturate these resources. The LARD server performs well for clusters of up to 8 or 12 nodes, but flattens out at about 5000 requests/second (exactly as pointed out in the LARD paper), as the connection establishment overhead at the front-end node becomes a serious bottleneck. However, even for smaller numbers of nodes, the throughput performance of the LARD server suffers due to the resources (especially cache space) wasted on the front-end. This problem is most easily visible when studying the cache miss rate behavior of the different systems. For a small number of nodes, L2S exhibits the lowest miss rates, but as we increase the number of nodes, the LARD server starts to exhibit miss rates that are comparable (if not slightly lower) than those of L2S for all traces. This significant improvement of the LARD server’s miss rates is due to the fact that the cache space wasted on the front-end becomes a smaller fraction of the overall cache space of the server.

L2S also has better load balancing behavior than its counterparts. For all traces, the CPU idle times of the traditional server stay roughly constant as we increase the number of cluster nodes. The CPU idle times observed for the LARD server decrease up to 8 or 12 nodes, but then start to increase as the front-end becomes the performance bottleneck. In contrast, the L2S idle times always improve, approaching full utilization for 16 nodes. Finally, another advantage of L2S over LARD is our smaller number of forwarded requests. Recall that LARD forwards 100% of the requests. The forwarding results show that for clusters of up to 4 nodes L2S forwards at least 15% fewer requests than the LARD server. For 16 nodes, L2S still forwards at least about 8% fewer requests than the LARD server (Clarknet and Rutgers), but this difference can be as significant as about 25% (NASA and Calgary). Increasing the size of the local memories affects the servers in very different ways. We find that increasing the size of the memories improves the performance of the traditional server tremendously, as this increase directly impacts the hit rates achievable by that server. The increase in memory size affects the behavior of the other two servers much less significantly, because their miss rates are already low. However, note that increasing the cache space will never have a significant effect on the LARD server, as the 5000 requests/second

is a hard barrier set by the processing power of the front-end node. As a result, for some of our traces, the throughput of the traditional server becomes higher than that of the LARD server for larger memories (128 MBytes) and numbers of nodes (8 or more). Summary of other simulation results. Besides these results, we find that the performance of L2S is only slightly affected by reasonable parameters of frequency of broadcasts, messaging overhead, and network latency and bandwidth. For more details refer to [8].

[2] M. Arlitt and C. Williamson. Web Server Workload Characterization: The Search for Invariants. In Proceedings of ACM SIGMETRICS, May 1996. [3] M. Aron, P. Druschel, and W. Zwaenepoel. Efficient Support for PHTTP in Cluster-Based Web Servers. In Proceedings of USENIX’99 Technical Conference, June 1999. [4] M. Aron, D. Sanders, P. Druschel, and W. Zwaenepoel. Scalable Content-Aware Request Distribution in Cluster-Based Network Servers. In Proceedings of USENIX’2000 Technical Conference (to appear), June 2000. [5] A. Bestavros, M. Crovella, J. Liu, and D. Martin. Distributed Packet Rewriting and its Application to Scalable Server Architectures. In Proceedings of the International Conference on Network Protocols, October 1998.

6 Related Work Throughout the paper we mentioned the most closely related pieces of work, except for a paper [4] that is (as of this writing) yet to be published. In the paper, the LARD designers describe and evaluate a more scalable version of the LARD server, where the request distribution algorithm is centralized at a “dispatcher” node, but client connections can be accepted by all the other cluster nodes. In their scheme, a client connection is assigned to a node by a simple loadbalancing switch (although round-robin DNS could have been used, instead of the switch), the chosen node then queries the dispatcher, and hands-off the connection to the node determined by it. This new version improves the scalability of their server, as the saturation points of the switch and of the dispatcher are reached at a higher throughput than the original LARD front-end. Nevertheless, several problems with their original strategy remain in the new system: (a) the switch and the dispatcher are (much less serious but) still potential bottlenecks and points of failure; (b) the cache space of the dispatcher is still wasted; and (c) all requests must incur the overhead of a two-way communication between the node that accepted the client connection and the dispatcher. L2S has none of these problems.

[6] P. Bozemana and B. Saphir. A Modular High Performance Implementation of the Virtual Interface Architecture. In Proceedings of the 2nd Extreme Linux Workshop, June 1999. [7] L. Breslau, P. Cao, L. Fan, G. Phillips, and S. Shenker. Web Caching and Zipf-like Distributions: Evidence and Implications. In Proceedings of IEEE InfoCom’99, March 1999. [8] E. V. Carrera and R. Bianchini. Analytical and Experimental Evaluation of Cluster-Based Network Servers. Technical Report 718, Department of Computer Science, University of Rochester, August 1999, Revised April 2000. [9] Cisco. Cisco LocalDirector. http://www.cisco.com/, April 2000. [10] Compaq Corp., Intel Corp., and Microsoft Corp. Virtual Interface Architecture Specification, Version 1.0, 1997. [11] D. M. Dias, W. Kish, R. Mukherjee, and R. Tewari. A Scalable and Highly Available Web Server. In Proceedings of COMPCON’96, pages 85–92, 1996. [12] A. Fox, S. Gribble, Y. Chawathe, E. Brewer, and P. Gauthier. ClusterBased Scalable Network Services. In Proceedings of SOSP, 1997. [13] V. Holmedahl, B. Smith, and T. Yang. Cooperative Caching of Dynamic Content on a Distributed Web Server. In Proceedings of the 7th HPDC, July 1998.

7 Conclusions

[14] IBM. IBM SecureWay Network Dispatcher. Technical Report http://www.software.ibm.com/network/dispatcher, IBM Corporation, April 2000.

In this paper we have evaluated the advantages of locality-conscious cluster-based network servers via modeling and simulation. Our modeling results assessed the potential benefits of locality-conscious servers. The model provided the motivation for the localityconscious server we proposed. The server has several important properties, among which high efficiency, good scalability, and lack of a single point of failure. Based on our modeling and simulation results, our main conclusion is that our locality-conscious network server should be very useful in practice, especially as files become larger and more numerous and hosting services proliferate. We are currently implementing a native version of our server.

[15] E. D. Katz, M. Butler, and R. McGrath. A Scalable HTTP Server: The NCSA Prototype. Computer Networks and ISDN Systems, 27:155–164, 1994. [16] V. Pai, M. Aron, G. Banga, M. Svendsen, P. Druschel, W. Zwaenepoel, and E. Nahum. Locality-Aware Request Distribution in Cluster-based Network Servers. In Proceedings of ASPLOS-8, October 1998. [17] V. Pai, P. Druschel, and W. Zwaenepoel. Flash: An Efficient and Portable Web Server. In Proceedings of USENIX’99 Technical Conference, June 1999. [18] Red Hat, Inc. Piranha – Load-Balanced Web and FTP Clusters, 2000. http://www.redhat.com/support/wpapers/piranha/.

Acknowledgements

[19] Resonate. Resonate Central Dispatch. http://www.resonateinc.com/, June 1999.

We would like to thank Mark Crovella, Brian Davison, Sandhya Dwarkadas, Liviu Iftode, Rich Martin, Wagner Meira Jr., and Robert Stets for comments that helped improve this paper significantly.

[20] J. Song, E. Levy-Abegnoli, A. Iyengar, and D. Dias. Design Alternatives for Scalable Web Server Accelerators. In Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software, April 2000.

References

[21] E. Speight, H. Abdel-Shafi, and J. Bennett. Realizing the Performance of the Virtual Interface Architecture. In Proceedings of ICS’99, June 1999. [22] H. Zhu, B. Smith, and T. Yang. A Scheduling Optimization for Resource-Intensive Web Requests on Server Clusters. In Proceedings of the 11th SPAA, June 1999.

[1] D. Andresen, T. Yang, V. Holmedahl, and O. H. Ibarra. SWEB: Towards a Scalable WWW Server on Multicomputers. In Proceedings of the 10th IPPS, April 1996.

8

Suggest Documents