Parallel Pull-Based LRU : a Request Distribution Algorithm for Clustered Web Caches using a DSM for Memory Mapped Networks Emmanuel Cecchet INRIA - SIRAC laboratory*
[email protected] Abstract The SIRAC laboratory has developed SciFS, a Distributed Shared Memory (DSM) that tries to benefit from the high performances and the remote addressing capabilities of the Scalable Coherent Interface (SCI) memory mapped network. We use SciFS for high performance cluster computing but we now experiment with it to build large scale clustered web caches. We propose Whoops! a clustered web cache prototype with a new request distribution algorithm, called PPBL (Parallel Pull-Based LRU), especially designed for use with a memory mapped network and a DSM system. Unlike other distribution algorithms the decision is distributed over all nodes thus providing better scalability. We evaluate a first PPBL implementation and discuss scalability issues. Then, we propose a solution to build a scalable implementation of PPBL. We conclude with other improvements that can be achieved to build efficient large scale clustered web caches using a DSM over memory mapped networks.
* 1. Introduction We have implemented a distributed shared memory system, called SciFS, which manages physical memory as in a NUMA architecture. The average memory access time is lowered by using several techniques: a relaxed memory consistency model, dynamic page migration and replication mechanism, combined with the use of idle remote memory instead of disk swap. SciFS relies on a lower layer, SciOS, which we also have developed. The SciOS/SciFS architecture is detailed in [2]. The SciOS layer is based on Dolphin's PCI-SCI adapters [3] and is implemented as a Linux kernel loadable module. SciFS is implemented as a Linux distributed file system. The distributed shared memory paradigm allows us to write distributed programs as on a centralized system with*
SIRAC is a joint laboratory of Institut National Polytechnique de Grenoble, Université Joseph Fourier and Institut National de Recherche en Informatique et en Automatique.
out dealing with low level communications and data transfers. We use SciFS to build a clustered web cache prototype that uses the DSM to provide large and fast data storage. With such an architecture, data placement is eased and traditional request distribution algorithms like simple round-robin or the more elaborated Locality-Aware Request Distribution (LARD) [1] can be enhanced by using new parallel strategies. Using a DSM like SciFS, we can exploit the memory of all nodes including idle nodes. This way, we have the largest storage space we can build with such hardware architecture. If several nodes serve the same request at the same time, the DSM will automatically replicate the data and thus increase the cache throughput. If one node is overloaded and cannot serve a request anymore, the data are migrated to the new node serving the request. We believe that the combination of all these techniques ease the building of an efficient clustered web cache. In the following section, we first introduce clustered web caches concepts (section 2) and then we present the architecture of the Whoops! prototype (section 3). In section 4, we describe PPBL, our new request distribution algorithm. We propose an implementation and discuss PPBL scalability issues in section 5. Finally, we describe related work (section 6) and conclude (section 7).
2. Clustered web caches Web caches have become the standard method to ensure an acceptable quality for Web access. For a cache implemented on a single dedicated computer, the throughput is quickly limited by the network interface. SMP computers also have to face this limitation. In contrast, cluster-based Web caches allow the throughput to be increased by simply adding nodes (and therefore network interfaces) to the cluster. A common clustered web cache architecture consists of one frontend node that distributes the incoming clients requests among several backends that serve the requests. The request is forwarded by the frontend according to a request distribution algorithm such as Weighted Round-Robin,
Locality Based or LARD [1, 4]. Each backend directly sends the reply to the client and, if needed, previously fetches the document from the web server through the Internet. Each backend node owns a part of the whole cache, so the distribution algorithm has to take into account the load of the different nodes and data locality. Using a DSM, new request-distribution algorithms can be designed, since the data locality is handled by the DSM. Unlike traditional approaches where each backend manages itself a cache according to its resources, the DSM provides a large and fast storage space that can optimally use the whole memory of each node including idle nodes. Backends can then transparently share and access data no matter where data is physically located in the cluster. We think that new memory mapped networks offer the required hardware support for building high performance DSM. We believe that efficient and scalable Web caches can actually be developed on a SCI-based cluster with the ease of programming provided by the shared memory paradigm. In [5], we simulated the clustered web cache behavior using a simple waiting-queue based model. Using various kinds of workloads following a Zipf-like distribution [6], both bandwidth and number of transactions per second appear to increase linearly with the number of nodes if the frontend is not saturated. We have built a prototype to validate the simulation results. We want to prove that this approach allows an easy adjustment of the cluster size to the clients' needs. Our simulation shows that Web cache services are not CPU intensive, so idle time is available on the cluster for other applications or more complex algorithms. We also try to build algorithms that leave CPU time for network processing or other services.
3. The Whoops! prototype We have implemented Whoops! a Web cache with TCP Handoff, On the fly cOmpression and parallel Pull-based LRU for Sci clusters. Whoops! proposes to use the DSM for all web cache management. Using a dedicated network for the DSM allows to split the network traffic in two pieces. The regular Ethernet network is just used to carry real information as the SCI network is used for request distribution and internal Web cache management and communication. We do not want to modify the browser on the client side, so if the client connects to the frontend, a backend will not be able to reply because the TCP connection is established with the frontend. We have to modify the TCP/IP stacks on both the frontend and the backend to make the backend directly reply to the client and the frontend forward the TCP acknowledgements from the client to the backend to maintain a coherent TCP connection. This well known problem is called TCP handoff and a solution to this problem has already been proposed by [1]. Unlike usual im-
plementations, we map some parts of the TCP/IP stacks in the DSM to make the frontend directly write the TCP acknowledgements in the TCP/IP stack of the right backend through an SCI mapping. This way Whoops! implements an efficient TCP handoff using only the SCI network, but this part is beside the scope of this article. The frontend node distributes the incoming requests to the backends. Backends treat the requests and send the requested documents back to clients. Whoops! uses an URL table in which we store the URL name, some other information related to the HTTP header and a pointer to the compression algorithm used. As the SciFS DSM provides a file system interface, we store the HTTP contents in a separate SciFS file for each entry. Each entry also contains a small three slot LRU cache, called serving LRU, providing the identifier of the last nodes that served this request. That way, if a node is overloaded, it can choose another node in this LRU to make it serve the request and preserve at best the data locality. Internet 3
4
Backend
2 1 Client
SCI ring Frontend
Backend
Figure 1. Flow of a typical request Figure 1 shows the flow of a typical request. The frontend node accepts a client connection (arrow numbered 1). The HTTP request is forwarded through the DSM to a backend (2) according to a given request distribution algorithm. The backend parses the request and looks for the requested object in the cache. If it is a cache miss, the backend node opens a connection to the origin server (3). It sends a request and receives the HTTP reply. The reply header is parsed and selected elements are stored in the cache information structure. The received data are compressed on the fly and written to the DSM. The compression algorithm is chosen according the HTML content type. I.e. we don’t use the same compression algorithm for text and image files. At the same time, the document is sent back to the client (4). If the request generates a cache hit, the backend node reads the data from the DSM, uncompresses it on the fly and sends it back to the client (4).
4. Parallel Pull-Based LRU We propose a new approach to distribute incoming requests to backend nodes. Existing decision algorithms are executed on the frontend node that takes the decision and sends the request to the chosen backend node [1, 4, 12, 13]. To reduce the bottleneck induced by this centralized solution, distribution algorithms have to stay simple to be quick, but this is not really scalable. It is possible to parallelize the decision algorithm on all nodes. We believe that such algorithms can sustain higher request peaks which is essential for large scale web caches. We propose to put all the incoming requests in the DSM to be treated in parallel by backend nodes. Instead of having one frontend pushing the requests to the backends, our approach allows the backends to pull the requests from the frontend. We distribute the URL table among the backend nodes. Each node manages its own part of the URL table using a local LRU. We call this request distribution algorithm Parallel Pull-Based LRU (PPBL). Local LRU
URL table
each node is stored in the DSM and is periodically updated by each backend. The load of a backend is currently estimated with the amount of data it still has to serve. The less loaded backend is the one that has the lowest amount of data to serve. If a backend has found a request but is overloaded (according to a static threshold), it can choose a less loaded node in the entry’s serving LRU. If the serving LRU is empty, the less loaded node of the cluster in the Status table. If the backend was the less loaded node of the cluster, it keeps the request else it forwards it directly to the selected node. The information related to the URL (SciFS file name, size, header, compression algorithm, …) remains on the same backend node and is accessed remotely by the serving node through the SCI network. The SciFS file containing the data is just opened and read, letting SciFS automatically replicate the data. This case happens when a document is very hot and the node usually serving it is overloaded. The SciFS replication throughput is far much faster than the FastEthernet network used to send the document, that way it does not hurt cache throughput. Local LRU
Incoming requests
Frontend
Serving LRU
Backend Local LRU
Status table
Frontend URL table
Backend
Incoming requests
Remote access Local access Status table
Remote access
URL table
Backend
Local LRU
URL table
Serving LRU
Local access
Figure 2. PPBL principle Figure 2 shows that the shared memory segment containing the incoming requests is fixed on the frontend node. Each backend node allocates a LRU linked list in its local memory and maps a part of the URL table in the DSM. Each mapped part is different and aligned on page boundaries to prevent any false sharing. Using this scheme, every access is local to each backend node. Each backend node picks up a not yet found URL request in the incoming queue, and searches in its URL list to check if it owns it or not. The lookup is made in the LRU order. If the matching entry is found, the backend signals that it has found the URL by updating the incoming request queue entry with its node identifier and the index of the URL in its table. If the backend accepts to serve the request, it sends the reply directly to the client. At this point, the frontend can then just discard the entry from the incoming queue. It is possible to use a signaling mechanism to interrupt the lookup on other nodes as soon as one node has found the requested URL. If the entry is not found, the frontend chooses the less loaded node to add it to the URL table and get the document from the web. A “Status table” containing the load of
Frontend Backend
Figure 3. PPBL with multiple frontends As shown by Figure 3, PPBL can also be easily adapted to a multi-frontend environment. One major problem with such an architecture is the request distribution on the different frontends. The Round Robin DNS (RR-DNS) is often used but it may result in poor load balancing. Another issue is that it is very hard to build a cooperative distribution algorithm on each frontend that will equally distribute the work on the backends. PPBL offers a solution to this problem because the decision belongs to the backends and each frontend just have to put the incoming requests in the DSM. Request distribution, and so indirectly load balancing, is handled by the backends that are in a better position to take the decision. In Figure 3’s example, one frontend owns the first part of the incoming request queue and accesses the other part remotely. Vice-versa for the other frontend that accesses its part of the queue locally and maps remotely the other part. Each time a new frontend is added to the system, the incoming request queue is expanded with the local part of the queue owned by this frontend. Backend nodes just pick the
5. PPBL implementation We have implemented the PPBL algorithm in our web cache prototype. We begin with the first implementation and its evaluation. We then try to point out the scalability issues of this approach and propose a new implementation that solves these problems.
5.1. PPBL first implementation Each client connection is handled by a dedicated thread on the frontend. The request is put in the incoming queue that resides in the DSM. The algorithm implements a producer/consumer model using the distributed locks provided by SciFS. Here is the algorithm executed by the frontend thread : 1. put(request); 2. for (x in [1..backend_nb]) unlock(start[request_nb]); 3. for (y in [1.. backend_nb]) lock(end[request_nb]); 4. if (request.node == not_found) add(request, less_loaded_backend);
There is one start and one end lock per incoming queue entry. The request is first put in the queue (step 1). On step 2, the start lock is released to make the backends start the lookup. Next, the frontend thread waits on the end lock for lookup completion. This is a rendez-vous with all cluster nodes (step 3). If the request is not found, it is added by the less loaded backend chosen according to Status table (step 4). Backend process is also multi-threaded to allow multiple lookups in parallel. Here is the algorithm executed by each backend thread : 1. lock(start[request_nb]); 2. lookup(request); 3. if (found) request.node = me; 4. unlock(end[request_nb]); 5. if (found) send_reply(request.client);
has already found the request before we start. In this case, we can avoid the lookup. If the request is found (step 3), we give our backend node number to inform all nodes that the request has been found and will be serviced by us (step 5). There is an enhancement not shown in the code sample : if the backend is overloaded, it can forward the request to another backend as explained in 4. In any case, the end of the lookup is signaled by releasing the end lock (step 4).
5.2. Evaluation and scalability issues We have evaluated this implementation of the PPBL algorithm in terms of performance and scalability. We avoid all TCP/IP communications to estimate the maximum raw throughput of PPBL. A multi-threaded process on the frontend directly generates the requests and puts them in the incoming queue. The testbed used for these experiments is a 8 nodes cluster. Each node is a Pentium II 450 MHz with a FastEthernet network adapter and a Dolphin PCI-SCI D321 card running Linux 2.2.14 with SciOS v2.2 and SciFS v2.2 release 2 (available at http://sciserv.inrialpes.fr). The simulation input is composed of 250000 requests on 100000 documents following a Zipflaw distribution [6] with α = 0,8 to simulate Dec traces. The cache can contain 10000 URL requests. Results presented in Figure 4 should be considered as preliminary. In these experiments, each backend searches 10 requests in parallel. This better amortizes the synchronization costs and therefore global throughput is increased by more than 25% compared to the sequential version. 1600 1400 1200 Throughput in req/s
requests, as the DSM handles the remote accesses to the right frontend.
1000 800 600 400 200 0
On step 1, the thread waits for a request to be ready (consumer waits for producer). Then, we perform the lookup (step 2). Three optimizations are used to accelerate the lookup : • The request is copied in a local buffer to prevent the remote read of the request on each loop. • We also use a hash code to speedup the lookup. Instead of comparing two URLs, we compare two integers (hash codes). Each URL stored in the URL list is saved with its hash code so, it is not necessary to compute it again on each lookup. • As backend code is multi-threaded, several lookups may occur in parallel. It is possible that another node
1 node
2 nodes
3 nodes
4 nodes
5 nodes
6 nodes
7 nodes
Figure 4. PPBL first implementation performance We can notice that the centralized version (1 node) obtains nearly the same throughput as the 2 backends configuration because of synchronization cost. Raw speedup is only 1.55 for 4 nodes and 1.9 for 7 nodes, but the one node results are too optimistic since a cpu usage of 100% is not feasible in real working conditions. Throughput does not grow linearly with the number of nodes because synchronization becomes more dominant than computation. We can achieve better speedup with larger cache sizes but raw throughput is then low. The cpu consumption is between 43% (for 7 nodes) and 55% (for 2 nodes) in saturated mode
on backends, leaving sufficient time for other activities like network I/O. Frontend cpu usage is about 15% so that leaves a lot of cpu time for a TCP handoff management. This first PPBL implementation has scalability problems and speedup remains very modest. This is due to the heavy synchronization needed by this approach. Under heavy load, i.e., in saturated mode, the synchronization cost is too high. If we have n backend nodes, we take and release 2n locks (n times start and n times end) for each request. SciFS locks with contention cost about 40 µs, so each time we add a backend node, we spend 80 µs more in synchronization for a request. As lookup becomes faster, synchronization time is harder to amortize and speedup is leveling off. We have to reduce the number of lock operations to make lookup time predominant on synchronization time.
used to periodically check for starvation and wakes up needed nodes. Note that this case only happens when the incoming request stream stops.
5.3. A new scalable PPBL implementation
In the saturated case, the n start lock acquire/release are replaced by a local index lookup (< 1 µs). In the worst case, i.e., the case where web cache is idle, we have to unlock all backends and we do not gain anything compared to the first implementation. In both saturated and normal case, the n end lock acquire/release are replaced by only 1.
URL table
Backend 1 Incoming requests
Remote access
Frontend
Local access Atomic fetch&inc Local LRU
Indexes
Completion queue
URL table
Backend 2
Figure 5. A scalable PPBL implementation
4500 4000 3500 Throughput in req/s
We propose a new implementation to reduce synchronization to only one lock acquire and release per request in the saturated case. In this case, backends do not have to synchronize on lookup start since there are always requests to treat. In the same way, only the last backend node that finishes the lookup has to signal the frontend for completion. As depicted by Figure 5, we add two new data structures in the DSM: an index table and a completion queue that are fixed in the frontend’s memory. The completion queue has the same number of entries as the incoming queue. This queue is accessed through a special SCI mapping that performs an atomic fetch and increment on each access. When a backend has finished its lookup on request i, it touches completion queue entry i and then performs an atomic fetch & increment. If the fetched value is equal to the number of backend nodes minus 1, it means that this backend is the last one to end the lookup. Only this node releases the end lock. The cost of an atomic remote fetch and increment is about 4 µs whereas the lock cost is more than 40 µs. We now have to solve the lookup starting problem in the saturated case. We just want to block if there is no request ready for lookup, else there is no need to synchronize. We use an index table where each node, including the frontend, indicates the next request number to treat. If the index of one node is higher than the frontend index, it must block by taking a lock and wait for a new request. When the frontend inserts a new request, it wakes up every blocked node by releasing the locks. A starvation may occur if a backend starts to sleep when all frontend threads have already finished waking up backends and have just started filling the incoming queue. If the incoming queue is full, the thread waiting for an empty slot also tries to wake up sleepy backends. If there is no incoming request, a thread with the lowest scheduling priority (scheduled when all other threads are asleep or idle), is
Local LRU
3000 2500 2000 1500 1000 500 0 1 node
2 nodes
3 nodes
4 nodes
5 nodes
6 nodes
7 nodes
Figure 6. PPBL scalable implementation results Figure 6 shows early results of the PPBL scalable implementation. We obtain a speedup of 5.8 with 7 nodes using the same problem and testbed as in 5.2. The index table and the completion queue mechanisms reduce the synchronization by more than 86% with a heavy load of requests. These results show that it is possible to implement an efficient Web cache distribution algorithm using a DSM and a memory mapped network. However, to build an efficient and scalable application, the programmer needs a good knowledge of the mechanisms involved in the middleware and their cost.
6. Related work Several distribution strategies are described in [10]. Our approach can be viewed as a mix between routing-based dispatching and broadcast-base dispatching. The request forwarding from the router to the frontend is similar to routing-based forwarding and the way the frontend makes the requests available to backends can be viewed as a pas-
sive broadcast. [1] introduces the locality aware request distribution algorithm (LARD) but all the decisions are centralized on the frontend. The frontend also have to support the TCP connection handoff. We believe that in large scale web clusters using TCP handoff, it is necessary to unload the frontend with the decision algorithm to give it more time to handle the intensive network I/O. Our approach distributes this load on the backends that usually have cpu cycles to spare. [4] proposes a more scalable solution to distribute the handoff on all backend nodes but the decision remains in a centralized dispatcher. Whoops! implements a TCP handoff that uses SCI mappings to directly write the client TCP acknowledgements from the frontend to the right place in backend’s TCP/IP stack. The Ethernet network is then unloaded of this traffic. [8] suggests that very simple policies such as round-robin perform a good load balancing if the achievable perrequest bandwidth is strongly limited by the client or the network. It is what we expect with our architecture. They also describe that cache affinities may be effectively estimated through hashed assignment of the URL space, but we propose a solution that maps cache affinity on DSM locality thus providing the best throughput in our case. [12] concludes that locality proved more important than load balancing for request response time. Replacement algorithms studied in [9, 13] show that LRU is not the most optimal algorithm and we consider testing other policies like Greedy Dual-Size and its optimized versions [11]. But with sufficient memory, the differences in performance are not significant and all the distribution schemes perform equally well [12, 13].
7. Conclusion We have designed a clustered web cache prototype called Whoops! on top of the SciFS distributed shared memory. Our approach uses the DSM for all web cache management and document storage as the Ethernet network is just used to carry client requests and replies. Traditional request distribution algorithms do not scale well because the decision is centralized on one node. Therefore, we propose a new algorithm, called Parallel Pull-Based LRU (PPBL), especially designed for use with this kind of architecture. Unlike usual algorithms, the decision is no more centralized on the frontend and sent to a backend, but the backends take the incoming requests directly from the frontend. The requests treatment can be easily parallelized freeing the frontend from this heavy task. The cpu usage on the backends remains low, leaving time for network I/O or compression algorithms. We have proposed and evaluated a first PPBL implementation. We noticed scalability problems due to a too heavy synchronization between the frontend and the backends in the saturated case. We propose a new scalable im-
plementation of PPBL that uses new data structures and the SCI atomic fetch-and-increment feature to reduce this synchronization cost. Early results confirm that PPBL is an efficient and scalable distribution algorithm for clustered web caches built over a DSM system and a memory mapped network. It also shows that even if the programming model offers more facilities to the end user, it is still awkward to implement an efficient and scalable algorithm. The preliminary results of Whoops! are very encouraging. We are currently evaluating the TCP handoff exploiting SCI memory mapping capabilities. We will then be able to evaluate the whole Whoops! prototype using the Polygraph benchmark [7] and compare it to commercial solutions.
8. References [1] V. Pai, M. Aron, G. Banga, M. Svendsen, P. Druschel, W. Zwaenepoel, and E. Nahum - Locality-Aware Request Distribution in Cluster-based Network Servers - Proceedings of the 8th Symposium on Architectural Support for Programming Languages and Operating Systems, pages 205-216, October 1998 [2] P. Koch, E. Cecchet and X. Rousset de Pina - Global Management of Coherent Shared Memory on an SCI Cluster - Proceedings of SCI Europe’98, Bordeaux , September 1998 [3] Dolphin Interconnect Solutions. PCI-SCI cluster adapter specification, May 1996. Version 1.2. See also http://www.dolphinics.no. [4] M. Aron, D. Sanders, P. Druschel, W. Zwaenepoel – Scalable Content-aware Request Distribution in Cluster-based Network Servers – Proceedings of the 2000 Annual Usenix Technical Conference - San Diego - June 2000 [5] C. Perrin and E. Cecchet – Web cache design for SCI clusters using a Distributed Shared Memory – 2nd Workshop on “Parallel Computing for Irregular Applications”, January 2000 [6] L. Breslau, P. Cao, L. Fan, G. Phillips and S. Shenker - Web Caching and Zipf-like Distributions : Evidence and Implications - Proceedings of the IEEE Infocom Conference, 1999 [7] Web Polygraph - Proxy http://www.ircache.net/Polygraph.
performance
benchmark
-
[8] R. Bunt, D. Eager, G. Oster and C. Williamson - Achieving load balance and effective caching in clustered web servers – Proceedings of the 4th International Web Caching Workshop, January 1999 [9] P. Cao and S. Irani - Cost-Aware WWW proxy caching algorithms Proceedings of the USENIX Symposium on Internet Technology and Systems, pages 193-206, December 1997 [10] O. Damani, P. Chung, Y. Huang, C. Kintala and Y. Wang – ONE-IP : Techniques for hosting a service on a cluster of machines – Computer Networks and ISDN Systems, December 1997 [11] S. Jin and A. Bestavros – Popularity-Aware Greedy Dual-Size Web Proxy Caching Algorithms – Technical Report 1999-009, Boston University, June 1999 [12] A. Khaleel and A. Reddy – Evaluation of data request distribution policies in clustered servers – Proceedings of the 6th International Conference on High Performance Computing, pages 50-55, December 1999 [13] Igor Tatarinov – Performance analysis of cache policies for Web servers – Proceedings of the 9th International Conference on Computing and Information, June 1998