Performance Tuning of Large-Scale Distributed WWW Caches - Fer

Performance Tuning of Large-Scale Distributed WWW Caches Siniša Srbljić, Andro Milanović, and Nikola Hadjina Abstract — World Wide Web (WWW) caches, such as the Harvest cache, its successor Squid, and a cache proposed by Malpani, Lorch, and Berger from the University of California, Berkeley, (referred to hereafter as the Berkeley cache) implement proxy to proxy communication. This communication mechanism unifies the communicating proxy caches and makes them perform as a single cache that is distributed over multiple proxy machines. In this paper, we investigate how the size of the distributed cache affects its performance. We compare the performance results for two different distributed caches: Squid and Berkeley. We show how to choose the number of proxy machines in a distributed cache in order to avoid network congestion and prevent overloading of the proxy machines. Index Terms — Distributed WWW cache, distributed cache management protocol, performance tuning.

I. INTRODUCTION World Wide Web (WWW) proxy caches are introduced to avoid the bottlenecks of the classical client/server Internet architecture. However, WWW proxy caches do not improve the performance of distributed systems as much as private caches improve the performance of multiprocessor systems. A low cache hit rate is one of the main reasons for the poor performance of WWW caches. It is well known that multiprocessor caches have hit rates over 0.998, while the hit rates of the WWW proxy caches are not much over 0.30. Experience has shown that connecting a larger number of clients on a proxy machine can increase the cache hit rate. However, the number of the clients per proxy machine is limited. Therefore, the Harvest cache [1] introduces the idea that proxy machines should communicate in order to increase the cache hit rate. Before sending the request to the original server, the proxy machine should check if the copy of the requested object exists at nearby proxy machines. The proxy-to-proxy communication distributes the WWW cache over multiple proxy machines that cooperate in order to return the S. Srbljić and A. Milanović are with the School of Electrical Engineering and Computing, University of Zagreb, Zagreb, Croatia. E-mail: {sinisa, andro}@zemris.fer.hr N. Hadjina is with Ministry of Defense, Republic of Croatia. [email protected]

requested object. Since we can increase the number of the proxy machines and the number of clients in a distributed cache, this can increase the cache hit rate. There are three known solutions for a distributed cache management protocol (DCM protocol). The Harvest cache [1] and its successor the Squid cache [2] are based on a hierarchy of communicating proxy machines. Alternative approaches to the design of the distributed proxy cache are described in [3]. We will denote this protocol as the Berkeley protocol. While details of DCM protocols can be found in [1]-[3], we only describe the basic algorithms of the DCM protocols in this paper. Since the main design issue is scalability, in this paper we present how to determine the number of proxy host machines in order to tune the performance of the distributed WWW cache. In Section 2 we describe, evaluate and compare current DCM protocols. Section 3 introduces the graphs used to present results of the performance tuning process. Section 4 discusses the results of the tuning process for different system parameters and for different application parameters. Final comments on the results of the performance tuning are given in the last section. II. PERFORMANCE EVALUATION AND COMPARISON OF DCM PROTOCOLS A DCM protocol consists of five major algorithms: the admission algorithm, the search algorithm, the decision algorithm, the replication algorithm, and the termination algorithm. Since the cache is distributed across multiple proxy machines, the admission algorithm must decide to which cache the client should make the first request. We will denote the proxy machine that runs the chosen cache as the master proxy to be consistent with terminology in [2]. When the proxy machine receives a request, and the local cache does not have a copy of the requested data, the DCM protocol must start its search algorithm in order to find a valid copy in one of the other cooperating proxy machine caches. We will call these cooperating proxy machines neighbors. After the DCM protocol finds out where the valid copies of the requested object exist, it must decide which neighbor proxy machine will return the copy. Finally, the replication algorithm determines if the copy should be cached and replicated at the master proxy

machine. If there is no copy in the distributed cache, the request for the object is sent to the original server. The termination algorithm decides when to stop the search at the neighbor proxy machines and send the request to the original server. In this paper, we present how three main algorithms affect the performance of the distributed cache: the search algorithm, the replication algorithm, and the termination algorithm. The Squid protocol and the Harvest protocol use the same algorithms, and therefore, we present only the results for the Squid protocol. The search algorithm is based on broadcasts: the master cache broadcasts the request message to all neighbor caches one by one. The neighbor caches respond either with a hit or a miss, depending on whether they have the requested object or not. On the first hit message, the master cache issues a get request to that neighbor cache. The replication algorithm stores a copy of the requested object at the master cache. The termination algorithm counts the number of miss responses. If all neighbors respond with a miss, or the preset timeout period is over, the request for data is then sent to the origin server.

Av. latency per request (s)

In order to reduce the network traffic, the Berkeley protocol [4] uses IP multicasts. During an IP multicast, the master cache sends only one UDP packet, and not multiple packets as in case of the broadcast. To additionally reduce the network traffic, the master cache does not replicate the requested object in the case of a neighbor cache hit and neighbor caches that do not have the requested object do not send back miss responses. The termination algorithm is based only on the preset timeout period. These features additionally reduce the network traffic, because the requested object is sent only once through the network (in contrast, the Squid protocol sends the requested object twice through the network, once from the neighbor cache to the master cache, and once from the master cache to the client) and there are no miss responses from neighbors that do not have the requested object.

Berkeley

Squid

Number of proxies N

Figure 1: Performance comparison of Squid and Berkeley cache

As a performance measure, we chose the average latency per request. Figure 1 presents the average latency

per request for two different DCM protocols1: Berkeley and Squid. The results are generated by an analytical model developed in [5] and compared to the results of a largescale network simulator developed in [6]. For both decreasing and increasing numbers of proxy machines, the distributed cache saturates. For a decreasing number of proxy machines, the proxy machines are overloaded and they are not able to handle all the input requests. However, increasing the number of proxy machines also increases the amount of proxy-to-proxy communication, which causes network congestion. As we can see from the Figure 1, the Berkeley protocol has a much wider interval for which it does not saturate (between 6 and 265 proxy hosts) than the Squid protocol (between 10 and 32 proxy hosts). Since the Berkeley protocol is based on an IP multicast and does not use miss responses and object replication, it generates lower traffic and lower proxy load, which increases the non-saturation interval. Although the Squid protocol generates higher traffic and a higher proxy load, the miss responses and object replication significantly reduce the latency. While the miss responses help stop the searching even before the preset timeout period expires, data replication increases the master cache hit rate. Both non-saturation intervals show that the latency does not change significantly while the DCM protocol does not saturate. Therefore, it is important to determine the non-saturation intervals for different values of system and application parameters in order to tune the performance of the distributed cache. III. GRAPHICAL PRESENTATION OF PERFORMANCE TUNING PROCESS Figure 2 gives an example of the graphs used to present the results of the tuning process. For a particular DCM protocol, the graph shows the non-saturation intervals for different values of a system or an application parameter. The non-saturation intervals for different values of the system or application parameter define the non-saturation area. Hereafter the non-saturation interval is presented as triple: (system or application parameter value, lower non1 Figure 1 presents the performance results for the following distributed cache configuration: the total number of proxy hosts, N, in the distributed cache ranges from 2 to 300, the average cross-sectional bandwidth of the network is 1024 Mb/s, the average bandwidth of a proxy host is 100 Mb/s, the total number of client hosts is 9100, the average number of requests per client per second is 0.53, the average bandwidth of a client is 10 Mb/s, the average size of the object is 10 kB, the total number of server hosts per distributed cache is 3, the average bandwidth of a server host is 100 Mb/s, the number of subnets that connect the proxy hosts D increases with the number of proxy machines D=max(1, 0.2×N), the timeout period used by termination algorithm is 40 milliseconds, the IP packets are of three typical sizes (40 bytes, 180 bytes, and 1024 bytes), the master cache hit rate is 0.25, and the neighbor cache hit rate is 0.30. We also use a number of other DCM protocol and TCP/IP parameters [5] that are not important for the discussion presented in this paper.

As we can see from figure 3a, the decrease in the cross-sectional network bandwidth does not affect the lower bound of the non-saturation area, while the upper bound decreases. For the Berkeley protocol, the typical values for the given distributed cache configuration are: (1024 Mb/s, 6, 265), (800 Mb/s, 6, 125), (625 Mb/s, 6, 16), and (611 Mb/s,-,-), where the first value is cross-sectional network bandwidth and other two values are the lower and the upper bound in the number of proxy host machines, respectively. Once the cross-sectional network bandwidth drops to the value of 611 Mb/s, it is not possible to choose the number of proxy hosts and to avoid the saturation of the Berkeley protocol. The typical values for the Squid protocol are: (1024 Mb/s, 10, 32), (900 Mb/s, 10, 20), (850 Mb/s, 10, 15), and (825 Mb/s,-,-).

saturation interval bound, upper non-saturation interval bound). To the left of the non-saturation area is the overloaded proxies saturation area. For a given number of proxy host machines and for a given value of the system or application parameter, the proxy machines are overloaded, which causes the DCM protocol to saturate. To the right hand of the non-saturation area is the congested network saturation area. For a given number of proxy host machines and for given values of the system or application parameter, the communication network is congested, which also causes the DCM protocol to saturate. Next section presents the results of the tuning process for two different system parameters (network bandwidth and proxy host bandwidth) and for two application parameters (number of clients and average object size). The results are given for both the Berkeley protocol and the Squid protocol.

Overloaded Proxies Saturation Area

Nonsaturation area

Congested Network Saturation Area

The increasing number of clients changes both the lower and the upper bound of non-saturation interval: the lower bound increases and the upper bound decreases. The typical values in Figure 3c are: (9100 clients, 6, 265), (12100 clients, 9, 107), (13100 clients, 10, 70), and (14840 clients, 11, 11) for the Berkeley protocol and (9100 clients, 10, 32), (9700 clients, 11, 26), (11100 clients, 13, 15), and (11124 clients, 14, 14) for the Squid protocol. The first value of the triple denotes the total number of the clients per distributed cache.

Number of Proxies

Figure 2: Saturation and non-saturation areas

IV. PERFORMANCE TUNING FOR DIFFERENT SYSTEM AND APPLICATION PARAMETER VALUES

Figure 3d shows that both the lower bound and the upper bound change if the average size of the object increases. The typical values are: (10 KB, 6, 265), (13 KB, 8, 159), (16 KB, 10, 53), and (16.61 KB, 11, 11) for the Berkeley protocol and (10 KB, 10, 32), (12 KB, 12, 19), (12.6 KB, 12, 15), and (12.77 KB, 13, 13) for the Squid

Squid

No. of proxies (a)

Berkeley

Squid

No. of proxies (b)

Number of clients

Berkeley

Proxy bandwidth (Mb/s)

Network bandwidth (Mb/s)

Figures 3a-b present the non-saturation areas for four system and application parameters: decreasing crosssectional network bandwidth, decreasing proxy host bandwidth, increasing number of the clients, and increasing average size of the object.

Berkeley

Squid

Object size (kB)

System or application parameter values

As we expect, if the proxy host bandwidth decreases (see Figure 3b), the upper bound does not change, while the lower bound increases. The typical values for the Berkeley protocol are: (100 Mb/s, 6, 265), (40 Mb/s, 17, 265), (10 Mb/s, 157, 265), and (9 Mb/s,-,-), where the first value is the proxy host bandwidth and the two other values are the non-saturation interval bounds. The typical values for the Squid protocol are: (100 Mb/s, 10, 32), (60 Mb/s, 20, 32), (50 Mb/s, 27, 32), and (47 Mb/s,-,-).

No. of proxies (c)

Figure 3: Non-saturation areas for different system/application parameters

Berkeley

Squid

No. of proxies (d)

protocol. The first value of the triple denotes the average size of the object. V. CONCLUSION In order to tune the performance of a large-scale distributed WWW cache, we should determine the number of proxy host machines for which the distributed cache does not saturate. The protocol that manages the distributed cache has a major effect on the performance tuning process. The Berkeley DCM protocol does not saturate over a wide range of numbers of proxy machines, which simplifies the performance tuning process. On the other hand, the Squid DCM protocol does not saturate over a lower range of numbers of proxy machines, but incurs lower latency. Therefore, the performance of a WWW cache may be improved if the Squid DCM protocol is used, but the number of proxy host machines should be precisely determined. In addition, we show how four major parameters affect the performance tuning process: cross-sectional network bandwidth, proxy host bandwidth, number of the clients, and average size of the object. If the cross-sectional network bandwidth decreases and all other parameters remain the same, the number of proxy machines should be decreased in order to reduce proxy-to-proxy communication. If the proxy host bandwidth decreases, the number of proxy machines should be increased in order to reduce the proxy load. The increase in the number of clients or the increase in average size of objects affects both the upper and the lower bound of the number of proxy host machines: the lower bound should be increased and the lower bound should be decreased. Although we assume that the cross-sectional network bandwidth is high (1024 Mb/s) and that the proxy host bandwidth is moderate (100 Mb/s), the increase in the number of clients or the average object size affects the upper bound more than it affects the lower bound. Therefore, the performance of distributed cache is much more sensitive to the proxy-to-proxy communication than to the proxy load.

VI. ACKNOWLEDGMENTS The authors wish to thank A. Grbic, M. Blazevic, D. Ivosevic, I.Benc, I.Crkvenac, R. Radovic, I..Skuliber, and M..Stefanec for providing valuable comments on the research presented in this paper. VII. REFERENCES [1]

“Harvest: A Scalable, Customizable Discovery and Access System”, Technical Report CU-CS-732-94, Department of Computer Science, University of Colorado - Boulder, Revised March 1995.

[2]

D. Wessels: “The Squid Internet Object Cache”, National Laboratory for Applied Network Research, http://squid.nlanr.net/Squid/.

[3]

R. Malpani, J. Lorch and D. Berger: “Making World Wide Web Caching Servers Cooperate”, World Wide Web Journal, Vol. I, Issue 1, Winter 1996.

[4]

S. Deering: “RFC 1054: host extensions for IP multicasting”, May 1988.

[5]

S. Srbljić, P. P. Duta, and D. F. Vrsalović: “AT&T GeoPlex Directory-Based Distributed Cache Management Protocol: Boosting the Scalability of World Wide Web Proxy Caches”, paper submitted for publication

[6]

M. Blazević: “Distributed Cache Management Protocols”, Diploma Thesis, University of Zagreb, School of Electrical Engineering and Computing, Zagreb, July 1999. (Thesis published in Croatian, original title: “Protokoli upravljanja distribuiranim priručnim memorijama”)

Performance Tuning of Large-Scale Distributed WWW Caches - Fer

Performance Tuning of Large-Scale Distributed WWW Caches - Fer

Suggest Documents

Content Replication in Large Distributed Caches - arXiv

Distributed Polling System Design Specification - FER

PERFORMANCE COMPARISON OF LOCKING CACHES ... - CiteSeerX

Performance Optimization of Pipelined Primary Caches

Integrating WWW caches and search engines - Brown CS

A Multi-agent Framework for Performance Tuning in Distributed

NetLogger: A Toolkit for Distributed System Performance Tuning and

Integrating WWW caches and search engines - Brown CS

Group-Based Management of Distributed File Caches - CiteSeerX

Chapter 6 TUNING CACHES TO APPLICATIONS ... - Ann Gordon-Ross

Performance Tuning of Scientific Applications

ETL Performance Tuning Tips

MySQL Performance Tuning - Percona

Performance Tuning - 3cyl.com

ABAP Performance Tuning

Java Performance Tuning

JDBC Performance Tuning

Hibernate Performance Tuning - databene

Two Stroke Performance Tuning

Automated Performance Tuning

Memcached Performance Tuning

Two Stroke Performance Tuning

OpenGL Performance Tuning - AMD

Beginning Performance Tuning - Proligence