Document Replacement Policies dedicated to Web Caching - CiteSeerX

3 downloads 806 Views 264KB Size Report
Document Replacement Policies dedicated to Web Caching ... storage. Upon the next request to this document, the proxy server returns the copy to the end-user without connecting .... servers. These access log- les are quite large and it is very.
Document Replacement Policies dedicated to Web Caching A. Belloum and L.O. Hertzberger Computer Architecture and Parallel Systems Group Dept. of Computer Science Universiteit van Amsterdam Kruislaan 403 NL 1098 SJ Amsterdam Email: fadam,[email protected]

ABSTRACT Web caching has been considered as a powerful solution to deal with the growth of web trac. Several studies have shown that caching documents throughout the Internet can save network bandwidth and reduce document access latency [9, 8, 10]. However, this technique has introduced new problems such as maintaining the document coherency and selecting the next document to be removed. With the continuous increase in demand for documents, the web cache servers are becoming the new bottleneck. A need for better resource management is becoming urgent in order to reduce the overhead sustained by web cache servers. In this paper, a number of web replacement policies are discussed and compared on the basis of trace-driven simulations. The impact of the web cache server con guration is pointed out through a set of experiments that use the cache size as a tuning parameter.

KEYWORDS: Web caching, simulation, modeling 1. INTRODUCTION Web caching is widely used nowadays, almost all web browsers allow the end-users to chose their own web cache or proxy sever. At each request, the web browser checks rst whether the con gured proxy server has a valid copy of the requested document. In turn, the proxy server forwards the user request either to another proxy server or directly to the original remote server. A three level proxy servers hierarchy has been investigated in [4], the latency introduced by such a topology is relatively low. When a proxy server receives the requested document, it sends a copy to the end-user and keeps another copy in its local storage. Upon the next request to this document, the proxy server returns the copy to the end-user without connecting the remote server. Due to the limited storage space at the proxy servers, it is often necessary to remove previously cached documents before storing any newly requested documents. In this paper, the focus will be on the replacement policies used in web caching. Document replacement strategies have a great impact on

web cache performance [13, 2]. The replacement process is tightly related to the cache size: if an in nite cache exists, there is no need for such a document removal process since all the documents that enter the cache are kept for ever. In real caches documents have to be replaced because of a lack of memory space, and the choice of the next document to be removed is quite important. The removal process is started once the web cache is full or the remaining amount of memory is not enough to store the new incoming document. Usually, this strategy is referred to as the \on-demand replacement strategy" to distinguish it from the so called \periodic replacement strategy" where documents are removed after a certain amount of time [12]. The argument for using periodic removal is that if a cache is nearly full, then running the removal policy only on-demand (when no more space is available) will invoke the removal policy on nearly all document requests. If removal is time consuming, it might generate a signi cant overhead. On the other hand, periodic removal reduces the hit-ratio because documents are removed earlier than required. The rest of the report is organized as follows: In Section 2, we discuss some of the document replacement strategies propose in the literature for web caching. In Section 3, we show the importance of di erent document replacement parameters using Principal-Component-Analysis. In Section 4, we present a new replacement strategy that is derived from the Nearest Neighbor Classi er (NNC), which is commonly used in pattern recognition and classi cation problems. In Section 5, a set of experiments is presented in order to show the impact of the NNC strategy on the web cache performances. Section 6 concludes this paper.

2. REPLACEMENT STRATEGIES The document replacement strategy can be split in two phases. First, the documents are sorted in order to determine the removal ordering. Second, one or more documents are removed from the head of the removal list. The algorithms currently used in practice to replace documents have not been designed speci cally for the web but, instead, have been taken straight form earlier work on pro-

cessor caches. Most of these methods are not well adapted to the web environment because they focus only on one or a part of the parameters that could impact the web cache performances and ignore the rest. Several strategies have been proposed such as the least frequently used (LFU), rst in rst out (FIFO) and many others. Only few of them have however, been implemented and used in real cache servers. Most of the time, documents are sorted in the cache according to the least recently used (LRU) strategy. The eciency of the LRU strategy has been widely discussed in literature [13, 1]. The fact that LRU focuses only on the time in selecting documents has been the main reason for criticism [7]. Many web trac studies show that time is not the unique factor that impacts the web cache performance. Other factors such as the number of references to the document, the document size, the document time-to-live and the document retrieve time were identi ed as a potential key for the document selecting process. What is not clear yet, is how these parameters impact the web cache performance. It becomes more and more clear that an appropriate key for selecting the documents should consider several factors. However, the appropriate way to combine these factors has not been de ned yet. In some methods, factors are combined two by two such as the LRU-MIN a variant of the LRU that considers the document size [12], or the LFU Aging that combines the number of document references and the time the documents entered the cache [1]. These methods show a slight improvement of the cache performances over the LRU strategy. New methods have been investigated by di erent research groups all over the world to deal with the document replacement problem. The most interesting results were published by Mark Abram et al. from Virginia University [13] and Bolot and Hoschka from INRIA [7]. In their methods, each document is assigned a priority according to which the removal process will select the next document to be replaced. The priority is computed according to a certain mathematical model [7, 13]. In each model, the di erent factors are assigned a weight which is calculated using weight optimization methods [7], or simulation driven techniques [13]. The weighted function proposed by Bolot-Hoschka combines several factors related to each document according to the following mathematical model: (i i

i

w t ; s ; rtt ; ttl

Where

i) =

w1

 rttli + w2  si

i

ttl

+

w3

+

w4 t

i : the time since the doc i was last referred. si : the document size. rtti : the time it takes to retrieve the document. ttli : the time to live of the document. wi : are xed using Lagrange multiplier technique. t

i

 si

(1)

In this model, the time plays an important role. The two terms on the right-hand side capture respectively the cost to retrieve a document and the document locality of reference. The second model, which was proposed by WoosterAbrams (the HYB algorithm), focuses on the number of references to a document rather than the time. Besides this, the estimated quantities are based on a per-server and not on a per-document basis. HYB replaces the document with the lowest value of the following expression: ( ref

w n

)=

; C BW; clat

ser(i) + WB  nWN C BWser (i) si

clat

(2)

Where ( ): The server on which the document i resides. j : The latency to open a connection to the server j. j : The bandwidth of the connection to the server j. i : The size of the cached document. : the number of references to the document i. B and N : are constant that set the importance of the variables.

ser i clat

C BW s

n

W

W

3. WORKLOAD ANALYSIS When combining several parameters to remove documents from the cache, the rst question that comes into mind is which parameter is the most important. The mathematical model proposed in the weighted methods described in the previous section were not clearly justi ed. It is clear that these models are a new formulation of the classical replacement strategies such the LRU or the LFU. New parameters have been added either to the time (Bolot-Hoschka) or the number of references (HYB strategy). These parameters express the cost to retrieve documents such as the retrievetime (rttl) or the estimated bandwidth of the connection to the server (CBW ). In this section, we will try to underline the most relevant parameters in describing the behavior of workloads used for web cache simulations. j

3.1. Workload traces The workloads used in this study are part of access log- les provided by the web server of the Computer Science Department of the University of Amsterdam WINS (wins.uva.nl), and the proxy server NLANR (ircache.nlanr.net). These traces, of which some characteristics are shown in Table 1, present two types of workloads. The WINS workload contains external requests to documents provided by the wins.uva.nl server. This workload exhibits a strong document reference locality. The NLANR workload involves requests coming from di erent web servers that use NLANR as a proxy server. Usually, this kind of servers are more busy than normal cache servers. These access log- les are quite large and it is very hard to use more than few days of workload duration.

Workload Duration Transfered Number of data requests WINS 1 month 4.2 GB 737750 NLANR 1 day 2.5 GB 261135

Table 1: Workloads characteristics

3.2. The Main-Component-Analysis A commonly used method to perform classi cation on workloads is the Principal-Component-Analysis (PCA) [11]. This method classi es the workload components by the weighted sum of their values. Given a set of parameters fx1 ; x2 ; :::; x g, the PCA produces a set of factorsPfy1 ; y2 ; :::; y g such that the following holds: (i) y = =1 a x . (ii) The y's are an ordered set such that y1 has the highest percentage of the variance of the workload characterization and the y has the lowest one. To apply the PCA to workloads used in web caching, we have considered a set of parameters, namely: the document addresses, the number of references to each document, the document retrieve time and the document size. The numerical values associated to each parameters are pre-processing in order to convert these data to the same range of magnitude. The scatter plots of the data presenting the variation along the principal factors associated to each parameter, shows that all the factors hold the same amount of information regarding the variance of the used workloads. There is no clear distribution of the workload components along one of the principal-factors (Figure 1). We therefore expect that a method that assumes equivalent importance of all the workload components will lead to better cache performance.

periment shows that the eigenvalues are in the same order of magnitude, which implies that all the considered factors are relevant for the workload characterization process (Figure 2). However, we can see a small advantage for \the number of references". It appears also that the \last reference time" contains less information on the workload than the other parameters. This con rms the assumption made by Wooster-Abrams, which states that the number of accesses to the documents has more impact on the cache performance than their reference time [13]. 2 1.8

n

ij

j

0

−20 −20

−10

0 10 document address

20

20

0 −20 −20

−10

0 10 document address

0 −20 −20

−10

0 10 document size

20

0

−10

0 10 document address

20

−10

0 10 document address

20

−10

0 10 Last ref. time

20

20

0 −20 −20

20

20

20

−20 −20

document size

number of ref.

Last ref. time

20

number of ref.

document size

n

Last ref. time

address

n

n j

20

0 −20 −20

Figure 1: Workload observations plotted along two principal-factors (WINS workload) We have applied the PCA to di erent workloads and we have noticed the same distribution along the di erent principal-factors. Another way to evaluate the importance of the workload components is to follow the evolution of the eigenvalues related to the principal-factors. Our ex-

1.4

Eigenvalues

i

1.6 size last reference

1.2

number of ref.

1 0.8 0.6 0.4 0.2 0 0

1

2 3 4 Number of requests

5

6 4

x 10

Figure 2: the evolution of the eigenvalues associated to according principal-factors (WINS workload)

4. NNC REPLACEMENT STRATEGY Decision making has been extensively studied in the area of pattern recognition and classi cation. One of the most commonly used method to solve such a problem is the Nearest Neighbor Classi er (NNC). Basically, the NNC methods relies on the strong relationship between the similarity and distance [3]. Given a set of observations of n parameters, we can de ne an n-dimensional space in which similar observations are close to each other. It is possible in such a situation to assume an inverse relationship between these two metrics. In other terms, the smaller the distance between two observations, the more similar they are. It is then possible to divide this set into several classes including similar observations. To apply the NNC method, we Usually start by selecting one or more descriptors for each class, the observations are assigned to di erent classes according to their distance to the di erent class descriptors. Figure 3 shows a very simple example, where n = 3 and the number of classes is equal to 2. To apply the NNC to web caching, we should rst describe the web caching problem in terms of classes and distances. In the web caching context, we de ne only two classes: The rst class contains the documents that should be removed and the second class the documents that should be kept in the cache. The class descriptor for the rst class is \the worst cacheable document" (WCD) and \the best

Parameter 2

Descriptor of class A

Parameter 3 Descriptor of class B

Parameter 1 : observation assigned to class A : observation assigned to class B

Figure 3: Nearest neighbor classi er cacheable document" (BCD) for the second class. The characteristics we consider to identify the WCD and the BCD should be speci ed beforehand. By doing so, we do not have to chose one of the observations to be the class descriptor as it's commonly done in NNC technique, we can build this element according to our wish. The eciency of the method relies on the choice of on the distance metric we want to use to measure the similarity of the documents, which is not obvious. The document replacement methods discussed in the literature can be mapped on the NNC strategy, since they implicitly de ne the WCD and remove rst the documents that are similar to it, for instance, in the FIFO strategy the oldest document is the WCD, while in the LRU strategy the least recently requested document is the WCD. There are many other examples of document replacement strategies that could be mapped on the NNC, according to the number of parameters considered in the removal process, the WCD is de ned in mono or multi-dimensional space. The NNC combines several parameters representing a document using the mathematical model based on distance calculations. Our choice of this distance was in uenced by the PCA analysis results, which show that all the workload components considered in the simulations contains approximately the same amount of information on the workload. It appears then that the Euclidean distance is a good choice, since it assumes equivalent importance for all the factors. The WCD can be de ned as the oldest document, the less frequently referenced document, the one which has the lowest retrieve time, the lowest time to live and so on. It seems that some of the components that de ne the WCD are easy to identify and to x. This is not the case for all the parameters that have been outlined to have a direct impact on the document replacement strategy. Let us consider, for instance the size of the documents. It is not

clear yet which document size could be worth caching. It is obvious that caching small size documents increases the document hit-ratio in the cache and increases the volume of data transfer, while caching large size documents results in a decrease of both the metrics. For such a kind of parameter, the bad value is not easy to determine. It should be assigned according to the goal we aim for.

5. EXPERIMENTS

We examine both the weighted Functions and NNC replacement strategies for di erent cache size con gurations, the goal behind these experiments is to identify the impact of the cache size and the on di erent replacement policies. In all the simulations, we consider a two level web cache server architecture. The reason for this choice is merely the fact that the two level architecture is becoming a typical web cache server con guration. The second level is used as a victim cache of the rst level. This means, instead of removing documents from the rst level cache, these documents are pushed into the second level cache. This strategy is often used since the rst level cache is used to store popular documents.

5.1. Performance metrics

The performance metrics used in this web cache analysis focus on the document hit-ratio and the byte hit-ratio: 

The document hit-ratio \SHR": This metric records the global hit-ratio obtained using a two level cache con guration.



The byte hit-ratio \SBHR": This metric records the global byte hit-ratio obtained using a two level cache con guration.

The byte hit-ratio give more information of the network bandwidth. As documents have di erent sizes, only recording the document hit-ratio will not give insight on how the strategy impacts the network bandwidth. Such an information is very important for a web cache performance analysis, since the current caching methods seem to give advantage to small document sizes.

5.2. The weighted Function

According to what was stated in the previous section, it is clear that the improvement of the replacement is closely related to the combination of di erent parameters. Combining two or more parameters to make the \decision" of replacing documents requires to identify the relationship among the given parameters. Bolot and Hoschka and Wooster-Abrams, have regarded the document replacement as a problem of weight optimization. The simulation presented by Bolot-Hoschka showed a great advantage of using such weighted methods over the simplistic LRU.

0.9

0.8 LRU LFU

0.7

SIZE SBHR

Applying the weighted Function replacement strategy for a two level web cache architecture shows a high document hit-ratio, but it decreases the byte hit-ratio (comparing to the LRU or LFU strategies [5]), Figures 4 and 5 compare both the SHR and the SBHR recorded with the weighted function with the those recorded using other strategies. We note here that the weights used in these experiments were proposed by Bolot-Hoschka to optimize the perceived retrieve time [7] and not the byte hit-ratio. In our experiments, the simulation results show that this approach could have a negative impact on the byte hit-ratio. The experiment conducted on the weighted function shows how the choice of the weights could have a bad e ect on performance metrics other than the one we focus on. This behavior could be a drawback for the weighted strategies since document hit-ratio, byte hit-ratio, perceived time1 and many other parameters are tightly correlated and a strategy that focuses only on one of the previous parameter could lead to a decrease in the global performances.

Bolot−Hoshka 0.6

NNC WINS workload

0.5

0.4

0.3 0

100

200

300 400 cache size

500

600

Figure 5: Byte hit-ratio recored for the di erent replacement strategies

0.94 0.92 0.9 LRU 0.88

LFU SIZE

0.86 SHR

Bolot−Hoshka 0.84

NNC WINS workload

0.82 0.8 0.78 0.76 0.74 0

100

200

300 400 cache size

500

600

Figure 4: Hit rate recored for the di erent replacement strategies To understand the mechanisms of the weighted function, we have to compare the weighted function to another method which has the same impact on the cache performances. There is a similarity between the hit-ratio evolution of the weighted function strategy and in the SIZEbased replacement one. Di erent simulations have been conducted using di erent workloads showed the same behavior of the cache hit-ratio [5]. It seems, at least for the weights values proposed by the authors, the size is given more importance than the other parameters.

5.3. The NNC replacement strategy

To show the behavior of the cache performances when removing documents according to the NNC strategy, let us 1 The perceived time is the elapsed time between the sending the request and receiving the document

consider a four dimensional space where each document is represented as a 4-tuple < size; Etime; Rtime; Nref >. The components of the 4-tuple represent respectively the document size, the entry time, the retrieve time and the number of references to the document. The WCD has been assigned the following 4-tuple < size; 1; 1; 1 >, in which the component size is used as a parameter for the simulation. In the rst experiment, the size is assigned a small value (1 KB) and in the second experiment a large one (64 MB). Figures 4 and 5 show the evolution of both the SHR and SBHR for several replacement strategies. Contrarily to what we were expecting simplistic replacement strategies such as the LRU or LFU exhibit high hit-ratios. We have shown in another study that these method deal better with one-timer documents which reduce web cache performances [6]. The NNC strategy as well as the weighted method do not cope well with one-timer documents since they combine several parameters to select the next document to be removed especially when small cache size are considered. However, the NNC results seem to make a tradeo between the hit-ratios comparing to the weighted method and the size based strategy. It is also clear from these simulations that the impact of the size has been considerably reduced. When describing the WCD as having a large size (64 MB) a decrease in the byte hit-ratio has been recorded only for small cache size (Figure 6). Like all the removal policies, the NNC method is very sensitive to the workload nature. When simulating the NNC strategy using workloads extracted from a busy web server, it turns out that the hit-ratio are dramatically a ected by the lack of the reference locality exhibited in such type of workloads. A more detailed study of this phenomenon is presented in [5].

References

0.9

0.85 NNC (size = 64MB) NNC(size = 1 KB)

SBHR

0.8

WINS workload

0.75

0.7

0.65

0.6 0

100

200

300 400 cache size

500

600

Figure 6: The byte hit-ratio obtained using NNC (WINS workload),the WCD has di erent values for the parameter size

6. CONCLUSIONS The study provided in this paper shows that the combination of several parameters in the document replacement strategy is not enough to achieve higher performances. The increase of web cache performances we were expecting from new document replacement policies has been compensated in favor to simplistic methods by using larger caches. This is partly due to the mathematical model used to combine the di erent parameters. It is often dicult to identify clearly the importance of the di erent parameters composing the workload. The workload component analysis performed in this study showed that the workload components often hold the same amount of information. It is impossible to identify the most relevant parameters. Because of this, we have investigated a new strategy derived form the NNC classi cation method, in this method all the workload component are handled alike. The rst evaluation of the web cache performances have shown a positive impact on both the document hit-ratio and the byte hit-ratio. Giving all the workload parameters the same weight in selecting the next document to be removed has considerably reduced the negative impact of certain parameters on the web cache performances. Besides this, the NNC document replacement strategy can be tuned by the designer, a characteristics which is very important when we consider the continuous evolution of the Internet trac.

Acknowledgments The authors are grateful to Henk Muller, Andy Pimentel and Arjan peddemors for their reviews of this paper. The work presented here is part of the JERA project funded by the Dutch HPCN foundation.

[1] Arlitt, M. F., and Williamsom, C. L., \Trace Driven Simulation of Document Caching Strategies for Internet Web servers," Simulation, Vol.68 , No. 1, Jannuary 1997, pp.23-33. [2] Arlitt, M. F., \A Performance study of Internet Web Servers," Master Thesis, Departement of computer Science, University of Saskatchewan. [3] Batchlor B. G., \Practical Approach to Pattern Classi cation," Plenum Press 1988. [4] Baentsch M., et al., \Word Wide Web Caching: The Application-Level View of the Internet," IEEE Communications Magazine June 1997. [5] Belloum A., and Hertzberger L. O., \Simulation of a two level cache server" Technical Report CS-98-01, Computer Science Department of the Univesity of Amsterdam. [6] Belloum A., and Hertzberger L. O., \Dealing with OneTimer Document in Web Caching" in proceedings of the EUROMICRO'98 Conf. Multimedia and Telecommunication, August 1998. [7] Bolot J. C., and Hoschka, P., \Performance Engineering of the World Wide Web: Application to dimensionning and cache desing," in proceedings of the Conf. on Computer Networks and ISDN September 1996, pp. 1397-1405. [8] Danzig, P.B., \A Case for Caching File Objects Internetworks," University of Colorado at Boulder, Technical Report CU-CS-642-93. Departement of Computer Science University of Colorado Boulder. [9] Gwertzman J., and Seltzer, M., \An Analysis of Geographical Push Caching," http://www.eecs.harvard.edu/ [10] Neil Smith, G., \The UK national Web cache - The state of the art," in proceedings of the Conf. on Computer Networks and ISDN, September 1996, pp. 14071414. [11] Jain R., \The Art of Computer Systems Performance Analysis: techniques for Experimental Design, Measurement, Simulation, and Modeling" New York: Wiley 1991, ch. 6, pp. 76-78. [12] Williams, S., et al., \Removal Polities in Network Caches for World-Wide Web Documents," in Proceedings of the ACMSIGCOMM, August 1997, pp. 293-305. [13] Wooster, R. P., \Optimizing Response Time, Rather Than Hit Rates, of WWW Proxy Caches," Master Thesis, Blacksburg, Virginia, 1996.

Suggest Documents