Decision Support Systems 46 (2008) 52–60
Contents lists available at ScienceDirect
Decision Support Systems j o u r n a l h o m e p a g e : w w w. e l s ev i e r. c o m / l o c a t e / d s s
A new approach for a proxy-level web caching mechanism Chetan Kumar a,⁎, John B. Norris b,1 a Department of Information Systems and Operations Management, College of Business Administration, California State University San Marcos, 333 South Twin Oaks Valley Road, San Marcos, CA 92096, United States b Krannert School of Management, Purdue University, 403 West State Street, West Lafayette, IN 47907, United States
a r t i c l e
i n f o
Article history: Received 12 October 2007 Received in revised form 7 April 2008 Accepted 21 May 2008 Available online 27 May 2008 Keywords: Web caching Proxy-level mechanism Web request patterns Performance evaluation
a b s t r a c t In this study we propose a new proxy-level web caching mechanism that takes into account aggregate patterns observed in user object requests. Our integrated caching mechanism consists of a quasi-static portion that exploits historical request patterns, as well as a dynamic portion that handles deviations from normal usage patterns. This approach is more comprehensive than existing mechanisms because it captures both the static and the dynamic dimensions of user web requests. The performance of our mechanism is empirically tested against the popular least recently used (LRU) caching policy using an actual proxy trace dataset. The results demonstrate that our mechanism performs favorably versus LRU. Our caching approach should be beneficial for computer network administrators to significantly reduce web user delays due to increasing traffic on the Internet. © 2008 Elsevier B.V. All rights reserved.
1. Introduction and problem motivation There has been tremendous growth in the amount of information available on the Internet. The trend of increasing traffic on the Internet is likely to continue [5]. Despite technological advances this huge traffic can lead to considerable delays in accessing objects on the web [13,17]. Web caching is one of the approaches to reduce such delays. Caching involves storing copies of objects in locations that are relatively close to the user. This allows user requests to be served faster than if they were served directly from the origin web server [2,5,9,10]. Caching may be performed at different levels, namely the browser, proxy, and web-server levels [6,11]. Browser caching typically occurs closest to the end user, such as the user computer's hard disk [13]. Proxy caches are situated at network access points for web users [7]. Consequently proxy caches can store documents and directly serve requests for them in the network, thereby avoiding repeated traffic to
⁎ Corresponding author. Tel.: +1 760 477 3976. E-mail addresses:
[email protected] (C. Kumar),
[email protected] (J.B. Norris). 0167-9236/$ – see front matter © 2008 Elsevier B.V. All rights reserved. doi:10.1016/j.dss.2008.05.001
web servers. This results in reducing network traffic, load on web servers, and the average delays experienced by network users while accessing the web [1,5]. Proxy caching is widely used by computer network administrators, technology providers, and businesses to reduce user delays on the Internet [7]. Examples include proxy caching solution providers such as IBM (www.ibm.com/websphere), Internet service providers (ISP) such as AOL (www.aol.com), and content delivery network (CDN) firms such as Akamai (www.akamai.com). Effective proxy caching has benefits for both the specific network where it is used as well as for all Internet users in general. Web-server caching, which is performed at the source of web content, focuses on reducing demand for HTTP connections to a single server [13]. A caching mechanism, performed at any network level, implies the characterization of the following two key decisions: which objects are to be stored in the cache (cache entry decision), and which of the currently cached objects are to be evicted to make room for new ones (cache replacement decision). In this paper we propose a new proxy-level caching mechanism that takes into account aggregate patterns observed in user object requests. Based on how frequently the contents of the cache are modified the existing caching mechanisms can be classified into two types: static, where the contents of the cache are
C. Kumar, J.B. Norris / Decision Support Systems 46 (2008) 52–60
fixed, and dynamic, where the contents are changed dynamically according to incoming user requests [2,13,14,17]. Our proposed caching mechanism falls into a third category: quasi-static, where the same objects are retained in the cache in between pre-determined time intervals, while they could be changed across intervals. Thus this mechanism is static within time intervals but may be dynamic across intervals. We derive the motivation for our approach from studies that have demonstrated the existence of patterns in proxy-level user requests. Cao and Irani [1], Rizzo and Vicisano [15], and Lorenzetti et al. [12], have shown that users typically re-access documents on a daily basis, with surges in demand for a document occurring in multiples of 24 h. Cao and Irani [1] have further demonstrated that these proxy proxy-level patterns exist due to the combined effect of individual user re-access behavior, even with the presence of browser caches. However the earlier studies concentrate on repeating 24 h access patterns for static documents that remain unchanged in terms of content and size. Instead in our caching approach we aim to identify and exploit repeating access patterns for documents whose contents may be changed over time, but whose uniform resource locator (URL) address remains the same. Examples of these types of web content include the front pages of many sites (e.g., www.yahoo.com, www.aol.com, www.cnn.com, www.google. com, etc.) which may vary the specific content of their sites, but retain the same name for the home page. Thus even if the contents of the website front page change, as long as we have identified aggregate user patterns for accessing the site at a particular time of the day, then the latest contents of the front page can be downloaded prior to the spike in user requests. We exploit the repeated-access pattern in our mechanism by making the caching decision for a specific time interval based on the history of observed requests for the same interval. However the caching decision also depends on the cost associated with caching. Determining the optimal quasistatic caching decision, while considering these factors, is the first part of our study. Following that we extend the mechanism to include a dynamic policy as well, i.e., current user access patterns are also taken into consideration to determine the objects to be cached. Note that users may exhibit repeating access patterns beyond front page level of web sites. For example after viewing the CNN front page users may often view reports in the CNN/Weather section. Another example is when users access a section of a website that has a URL with a unique session ID assigned to it. However since the names of documents in a specific section of a website are typically changed across days, it will be difficult to identify historical access patterns at a sub-front page level. Hence all objects beyond the front page levels are treated as current requests and can be cached by the dynamic part of our policy. A caching mechanism that contains both quasi-static and dynamic dimensions can handle, besides normal usage patterns, unanticipated events such as natural disasters or major accidents which can generate huge unexpected loads on websites. Therefore the objective of this study is to develop an integrated caching mechanism that utilizes both historically and currently observed request occurrences. The performance of the proposed mechanism is evaluated using real world data. The results indicate that our caching approach is beneficial for computer network administrators to significantly reduce delays experienced by web users at proxy server levels.
53
The plan of the rest of this paper is as follows. We first discuss literature related to our topic. We then illustrate the model for our caching mechanism. Next we present performance results for the mechanism using a proxy trace dataset. The results section consists of two sub-parts: first the quasistatic portion is evaluated, followed by the integrated mechanism. Finally we discuss conclusions and areas for future research. 2. Literature review While caching has been extensively studied in computer science, there has recently been a growing interest in the topic in the Information System (IS) area. Datta et al. [5] have identified caching to be a key research area due to its application in reducing user delays while accessing the increasingly congested Internet. Zeng et al. [18] have further highlighted the benefits of caching strategies that can handle dynamic content. Podlipnig and Boszormenyi [14], Zeng et al. [18], and Datta et al. [5], provide an extensive survey of the numerous caching techniques that have been proposed. These include popular cache replacement strategies such as least recently used (LRU), where the least recently requested object is evicted from the cache to make space for a new one; least frequently used (LFU), where the least frequently requested object is removed; lowest latency first, where the document with lowest download latency or time delay is removed; size, where the largest document is evicted; and their numerous extensions. The LRU policy, with its variations, is one of the most commonly employed proxy caching mechanisms [1,12,14,18]. The advantage of LRU is its simplicity. On the other hand LRU does not consider historical patterns in user requests. Most caching studies focus on improving performance on metrics such as user latency and bandwidth reduction. There have been relatively few studies that consider a data or model driven approach for managing caches effectively. The Mookherjee and Tan [13] study provides an analytical framework for the LRU cache replacement policy. The framework is utilized to evaluate LRU policy performance under different demand and cache characteristics. The study specifically models caching at the browser level for individual caches. Hosanagar and Tan [9] develop a model for optimal replication of objects using a version of the LRU policy. The study considers a framework of two caches whose capacities are partitioned into regions where the level of duplication is controlled. Their model does not utilize historical patterns of web requests for caching decisions. Hosanagar et al. [10] develop an incentive compatible pricing scheme for caching, involving content providers and cache operators, with multiple levels of Quality of Service. The study focuses on improving adoption of caching services among content providers rather than developing a specific caching mechanism. Cockburn and Mckenzie [4], and Tauscher and Greenberg [16], specifically study client-side behavior in the context of the Internet. They suggest that the probability of users revisiting websites is very high. Cao and Irani [1], Rizzo and Vicisano [15], and Lorenzetti et al. [12], have further demonstrated 24 h re-access patterns for documents. But they do not consider dynamic documents such as website front pages that often change contents. Zeng et al. [18] describe some caching methods that attempt to predict past access
54
C. Kumar, J.B. Norris / Decision Support Systems 46 (2008) 52–60
requests, such as Top-10 algorithm that compiles a list of most popular websites. Our caching mechanism partially follows along those lines in the quasi-static portion, but also exploits the 24 h re-access patterns mentioned earlier. Further we include a dynamic portion in our integrated mechanism that can handle deviations from past requests. Therefore our study is distinct from prior work as we consider both historical patterns and requests for dynamic content in order to develop a comprehensive proxy-level caching mechanism. This study expands on preliminary research of Kumar and Norris [11] by a comprehensive performance evaluation of the proposed mechanism using an actual proxy trace dataset.
initial time interval j = 1 which is always incurred when objects are first cached in the quasi-static mechanism. The objective is to minimize these costs, given the following constraints. Constraints (2) ensures that cache updating cost in period j is only incurred when a new object is brought into the cache in relation to preceding period j − 1. This is because if for an object i both xij − 1 = 1 and xij = 1 then the cache need not be updated. Only if xij − 1 = 0 and xij = 1 then yij = 1 and an updating cost is incurred. Constraints (3) capture the cache capacity restrictions. Constraints (4) ensure that the variables can only take binary values. Collecting terms and simplifying in Eq. (1), (W) can be rewritten as follows:
3. Model m
We first consider a quasi-static caching mechanism with no dynamic policy. A 0–1 mathematical program model is developed for this mechanism. It minimizes the total cost of caching and the delay due to requests for objects that are not in the cache, given the constraint on cache size. The parameters for the mathematical program model are: the number, n, of time intervals in a 24-h period for which caching decisions are to be provided; the capacity, k, of the cache; the cost, c, of caching an object; and the delay, t, to download an object from the web server. The number of past , for object i (i = 1,…, m) in time interval j (j = 1, requests, Rpast ij …, n) is an objective coefficient. The variables are:
1 if object i is cached at time interval j 0 otherwise: 1 if caching cost is incurred for xij ¼ 1 yij ¼ 0 otherwise: xij ¼
Using the above, we formulate the mathematical program for the quasi-static model as follows: ðWÞ
m n m n m þ c ∑ ∑ yij þ c ∑ xi1 min t ∑ ∑ 1−xij Rpast ij i¼1 j¼1
s:t:
xij −xij−1 yij
i¼1 j¼2
for i ¼ 1; . . .; m and j ¼ 2; . . .; n
m
∑ xij k for j ¼ 1; . . .; n
ð1Þ
i¼1
ð2Þ ð3Þ
i¼1
xij ; yij af0; 1g
for i ¼ 1; . . .; m and j ¼ 1; . . .; n:
ð4Þ
The decision variable xij provides the optimal solution for problem (W), i.e., the objects that should be cached at any time interval. Variable yij is used for calculating the cost of caching. The first term of the objective function (1) represents the communication delays incurred when requests are served directly from the origin server. When an object is not present in the cache xij = 0 and the cost for serving requests is t times the number of requests. The second and third terms together capture the cost of updating caches that is incurred whenever a new object is brought into the cache. This cost takes into account the computer resources required to refresh cache contents. The second term considers cache updating cost at any given time interval j where j ≠ 1. Note that this cost is to be incurred only if an object is not already present in the cache in the previous interval j − 1. The third term is updating cost in
n
m
n
ðDWÞ min −t ∑ ∑ xij Rpast þ c ∑ ∑ yij ij i¼1 j¼2
i¼1 j¼2
þ ∑ −tRpast i1 þ c xi1 s:t: ð2Þ; ð3Þ; ð4Þ: m
ð5Þ
i¼1
The solution to (DW) provides the optimal quasi-static caching decisions for the n time periods under consideration. Problem (DW) is a 0–1 Integer program among the class of NP problems that are hard to solve [8]. However, our problem structure is such that mathematical program solvers such as CPLEX can solve reasonably large problem sizes quickly using branch and bound techniques. This is facilitated as we assume uniform object sizes as a first cut of the model, analogous to the basic LRU policy. In addition in our problem n is typically small in the order of 24 or 48 time intervals, which reduces the number of variables under consideration. Solution approaches and problem sizes are discussed further in performance analysis Section 4. To study the performance of values from our approach, (DW) is solved with observed Rpast ij historical data and using different values of parameters n, k, c, and t. In order to conduct a statistically meaningful analysis we use a moving window of 30 days of proxy requests data for . Commercially available mathematical prorecording Rpast ij gram solver CPLEX version 8.1 is used for solving (DW). Next we incorporate a dynamic dimension to our caching mechanism. A portion of the cache is now set aside to handle cases when there are a significant number of requests that deviate from historically observed patterns. The partitioning of the cache between the quasi-static and the dynamic portions is determined by the volume of unanticipated requests. The more deviation there is between currently , the greater will be observed request occurrences and Rpast ij the dynamic portion. This kind of adaptive allocation of the proportion of dynamic and quasi-static caching allows the mechanism to handle both historical patterns as well as deviations from such patterns. 4. Performance analysis The performance of our caching mechanism (in terms of caching costs and request delays) is tested against the LRU caching mechanism, using an actual proxy trace dataset. The LRU cache replacement strategy, along with its many extensions, is widely employed for proxy caching [1,12,14,18]. Therefore we use LRU as a benchmark to compare the performance of our proposed mechanism. We employ the following cache entry and replacement policies for implementing an LRU
C. Kumar, J.B. Norris / Decision Support Systems 46 (2008) 52–60 Table 1 URL request frequencies for days 1 through 62 Frequency of requests
Number of URLs (cumulative)
N 10,000 9999–5000 4999–2000 1999–1000 999–500 499–200 199–100 99–50 49–30 b 30
32 (32) 17 (49) 68 (117) 128 (245) 227 (472) 537 (1009) 905 (1914) 2703 (3608) 2371 (5979) 139,166 (145,145)
caching mechanism. Any newly requested object, that is not already present in the cache, is always brought into the cache. The least recently requested object is evicted from the cache to make way for new objects. A detailed illustration of LRU implementation using parameters is provided in the next sub-section. We have obtained data of the proxy traces of web objects requested in the nine server locations of the IRCache network across multiple days (www.ircache.net). For performance testing we utilize 62 days of trace data of the New York IRCache proxy server. The data was collected between 29 April and 30 June, 2004. The comprehensive trace data includes the URLs of requests, the times when they are requested, the type of object requested, an assigned identifier to the IP address of the user requesting an URL, and the elapsed time for serving the request. In our model we include all front page requests that are defined as “page views” (i.e., objects with suffixes such as .htm, .html, .php, and .jsp) in Christ et al. [3]. As mentioned earlier, the documents within specific sections of the website front page are to be cached in the dynamic portion of our mechanism. If a specific section has a URL name that is unchanged across days then it may be treated as an object to be considered for caching at the quasi-
Table 2 Top 20 requested sites for days 1 through 62 Site
Number of requests
yahoo.com friendster.com microsoft.com gator.com msn.com doubleclick.net cisinternet.net google.com icq.com animespy.com water.com ebay.com hotbar.com formulababe.com phpwebhosting.com atwola.com wv-cis.net 17tahun.com aol.com atdmt.com
138,775 89,814 57,951 52,064 46,096 43,550 38,131 37,140 36,914 34,327 32,395 31,095 24,503 23,288 21,107 20,855 17,475 15,950 15,950 15,756
55
static portion. As stated earlier, we assume all objects to be of unit size as a first cut of the model. Subsequent versions of the mechanism can be modified to include different object sizes. We include the following subset of fields from the comprehensive dataset for our analysis: date of access, time of access (ranges from 0 to 86,400 s in a 24 h period), front page level URL requests, and frequency of URL requests over time period under consideration. The request frequencies are determined by assigning an ID to every unique URL front page and counting the number of occurrences of the ID over a time period. The comprehensive dataset of logs with all 62 days of requests is 2.1 GB in size. After converting the dataset to a Microsoft Access database, and querying for the required fields, the final dataset is 800 MB and has 2,567,818 records. The frequency distribution for URL requests for all 62 days is presented in Table 1. The distribution indicates that there are a large number of URLs that are requested a few times and a small set of very popular sites. Of the 145,145 total unique URLs in the dataset, 139,166 URLs are requested less than 30 times each. There are only 32 URLs that are requested more than 10,000 times. This confirms to intuition that users tend to revisit particular websites that are popular across the broad population. This pattern can be exploited by the quasi-static portion of our caching mechanism. The 20 most requested sites for the comprehensive 62 days dataset are presented in Table 2. The most popular site is yahoo.com, and other well known sites such as microsoft.com and google.com are also part of the list. Note that since yahoo.com is a portal that offers a large number of services such as email, search, news, music, etc., it is not surprising that it is the most requested site in our dataset. As mentioned earlier, we are using a 30 day moving . Therefore it is useful to characterwindow for recording Rpast ij ize the dataset patterns and evaluate URL frequency for batches of 30 days of requests instead of the comprehensive 62 days. Using the first 30 days of the dataset we confirm that the pattern of a few very popular sites is repeated. Of the 145,145 total unique URLs, 145,022 URLs are requested less than 500 times each. There are only 12 URLs that are requested more than 8000 times. While considering the most requested sites for the first 30 days, the usual suspects of popular sites such as yahoo.com, microsoft.com, and google.com are present as in the comprehensive 62 days case. For further illustration refer to Tables 3 and 4, that present the URL request frequencies, and the 20 most requested sites, respectively, of the first 30 days of data. These results indicate that a 30 day
Table 3 URL request frequencies for days 1 through 30 Frequency of requests
Number of URLs (cumulative)
N 15,000 14,999–12,000 11,999–8000 7999–5000 4999–3000 2999–1500 1499–1200 1199–1000 999–500 b 500
4 (4) 5 (9) 3(12) 5 (17) 12 (29) 32 (61) 11 (72) 11 (83) 40 (123) 145,022 (145,145)
56
C. Kumar, J.B. Norris / Decision Support Systems 46 (2008) 52–60
4.1. Quasi-static caching mechanism
Table 4 Top 20 requested sites for days 1 through 30 Site
Number of requests
yahoo.com friendster.com microsoft.com water.com icq.com animespy.com atwola.com msn.com google.com phpwebhosting.com ebay.com adbureau.net 17tahun.com gator.com 216.66.24.58 (NY IRCache node) doubleclick.net everyone.net ircache.net go.com plasa.com
43,624 30,264 15,126 15,111 14,766 14,542 13,314 13,123 12,726 11,573 8009 6705 6609 6440 5649 4900 4626 4241 3776 3684
moving window is a good indicator for overall URL frequency patterns that exist in proxy proxy-level requests. The values for Rijpast and parameters n, k, c, and t are the input to our mathematical program model. Proxy trace data is used for recording Rijpast. The solution of the model provides us with the optimal caching decisions based on the historical pattern over the moving window. Using these caching decisions we evaluate the performance of our model on the object request patterns, Rijcurrent, for the day following the 30 day moving window. In this manner we can evaluate the performance for any given proxy log period. We then repeat this performance analysis with the dynamic dimension included in the caching mechanism. The mechanism that includes both quasi-static and dynamic portions is referred to as integrated. Since we consider Rijpast to be an indicator of Rijcurrent request patterns, it would be interesting to note if any similarities can be identified by visually inspecting the request data. Table 5 presents the 20 most requested sites for day 31 alone (i.e., Rijcurrent), the day following the moving window of days 1 through 30 (i.e., Rijpast). As can be observed popular sites such as yahoo.com and microsoft.com are also present here (refer to Table 5), though the ranking order may be different from that of days 1 through 30 (refer to Table 4). Of course this is just one instance of aggregate URL popularities in Rijpast corresponding to Rijcurrent. The actual benefit of the above approach is to be evaluated by the overall model performance. The parameters n, k, c, and t values that are used in the mathematical program model are likely to have an impact on the overall performance of our approach. Therefore we utilize different values to analyze the sensitivity of these parameters on mechanism performance. In addition for the integrated mechanism we vary the percentage allocation of cache space between the quasi-static and dynamic portions and test the effect on overall performance. The mechanism performance is compared to LRU policy for each case. In the following subsections we first present the performance of the quasi-static portion of the mechanism, followed by the performance of the integrated mechanism.
We first compare the performance of our quasi-static caching mechanism versus the popularly used LRU caching mechanism. In our mechanism the caching decisions for any for given 24 h time period are determined by recording Rpast ij 30 day requests prior to the given day. Problem (DW) is then solved for a given set of parameter values. These tasks, as well as all other performance testing, are accomplished using C programming language and CPLEX mathematical program solver version 8.1. Problem (DW) can be solved quite quickly for even relatively large problem sizes. This is demonstrated as follows. We begin by dividing a day into 24 1 h time intervals for caching decisions. This means that the quasistatic portion of the cache is refreshed every hour. Setting n = 24 is a starting point given that repeating 24 h object reaccess patterns have been previously identified [1,15]. Let m = 6000 objects and cache capacity k = 1000 objects. Here proxy cache capacity is relatively large in proportion to the total number of objects being requested. This can occur if network administrators invest resources in acquiring a large disk storage space for the cache. The benefit of caching is due to reducing requests to origin web servers with unit cost t as it ties up network bandwidth. In contrast cost of internally updating caches c is relatively inexpensive as it is not affected by network congestion. Accordingly we first set t / c ratio to be two with t = 2 and c = 1. This ratio can be changed depending on a judgment of relative costs by network administrators. A more detailed discussion of the current t/c ratio value, as well is a as an alternative scenario, is provided later. Since Rpast ij coefficient that does not affect the solution time of (DW) we generate it randomly. Using the above values of m = 6000, n = 24, k = 1000, t = 2, and c = 1, the optimal solutions for (DW) require a pre solve time of 9.42 s, and a root relaxation solution time of 125.46 s. This shows that (DW) can be quickly solved for reasonably large problem sizes. Since in our problem instances n is typically small in the order of 24 or 48, the number of variables is reduced. Given our problem structure, the optimal quasi-static caching decisions can be Table 5 Top 20 requested sites for day 31 Site
Number of requests
friendster.com yahoo.com icq.com microsoft.com msn.com water.com animespy.com 17tahun.com adbureau.net google.com phpwebhosting.com doubleclick.net ebay.com detik.com gator.com 216.66.24.58 (NY IRCache node) go.com geocities.com everyone.net atwola.com
2513 2288 885 712 563 550 507 501 473 372 340 334 323 232 207 182 153 142 142 123
C. Kumar, J.B. Norris / Decision Support Systems 46 (2008) 52–60
determined using available program solvers. Significant deviations from past requests can be handled by the dynamic portion of the mechanism. As we later demonstrate this often favors our mechanism, in addition to less frequent cache updating costs, while comparing performance to heuristic based procedures such as LRU caching strategy. However, in different situations where problem size is greatly expanded, it is difficult to find optimal solutions in a reasonable time frame. For example, if m = 100,000, n = 250, and k = 10,000, (DW) cannot be solved to optimality within an upper bound of 30 min. In cases where problem size is very large, alternative solution approaches, that aim to find good results quickly, can be developed. Examples include dynamic programming and heuristic procedures [8]. Table 6 describes the characteristics of the comprehensive proxy cache dataset for all 62 days, when the m most requested URLs are included for analysis. As mentioned earlier, when we include all URLs that have 1 or more requests then there 145,145 total sites. There are a large number of URLs that are requested only a few times. By increasing the inclusion limit to sites that are requested more than 5 times we consider the top 26,938 requested URLs. For testing purposes we use the top 53 requested URLs which include all sites that have 4500 or more requests. for any dataset is a one-time event that is not Recording Rpast ij impacted by parameter changes. An index is created for is proportional to website requests and counting Rpast ij number of objects m. The trace of website requests is already may be maintained in the proxy server. Using the trace Rpast ij recorded prior to the time at which it is used. Since we are exploiting 24 h re-access patterns, at the very least we have one full day before the values are required. Therefore Rpast ij does not have to be freshly determined for a given time interval. This process is further aided by using a 30 day . For any new day the previous moving window for Rpast ij 29 days of requests count already exists and only the additional day count can be quickly added. For these reasons past Rij is an input coefficient for (DW) that does not affect solution time. The trace data is also used for implementing the LRU caching mechanism. The steps for LRU implementation are detailed in Fig. 1. We now compare the performance of the quasi-static model and LRU caching mechanism for different sets of parameter values. Table 7 measures the performance of the two mechanisms while varying n for quasi-static decision making and keeping other parameters constant. For both mechanisms total cost is t times the number of requests for URL objects not in cache + c times the number of objects brought into cache. Varying n determines the time intervals
Table 6 Proxy requests comprehensive dataset characteristics
57
Fig. 1. Implementation of LRU mechanism.
at which cached objects are updated by the quasi-static mechanism. The appropriate choice of n depends on specific proxy request patterns. Based on the performance results cache administrators decide if objects in quasi-static model are to be refreshed very frequently or infrequently. For our tests we vary n from 48 to 3. Using this range we can typically identify interior solutions for the quasi-static model where optimal n is in between the upper and lower values. Note that choice of n does not affect LRU mechanism costs. This is because in LRU cache updating decision is evaluated on the arrival of every new request. The optimal solution for the from days 1 to 30 is quasi-static model based on Rpast ij for day 31. The LRU mechanism costs on evaluated on Rcurrent ij day 31 are determined starting from the cache contents obtained after simulating the model on requests from day 1 to 30. As before we begin by setting t = 2 and c = 1. For this t/c ratio it is twice as beneficial to serve a request internally from cache as compared to origin server. The t/c ratio of 2 captures the normal situation where origin servers are not choked with extraordinarily high traffic. Alternatively the network connection is fast such as a T1 line. An example of this scenario is illustrated by the IRCache network (www.ircache.net). If a requested object is cached at a location close to the user, then
Table 7 = days 1 to 30, Rcurrent = day 31, t = 2, c = 1, Quasi-static performance for Rpast ij ij and varying n; corresponding LRU mechanism cost is 12,009
Number of requests
Total number of records
Parameter values (varying n) m = 53, k = 10, t = 2, c = 1
Quasi-static mechanism cost
% improvement over LRU
Top m requested objects 145,145 26,938 5978 1006 100 53 3
≥1 N5 N30 N200 N2,300 N4,500 N55,000
2,567,818 2,360,214 2,109,905 1,772,191 1,240,175 1,087,916 286,540
n = 48 n = 36 n = 24 n = 18 n = 12 n=6 n=3
9486 9463 9428 9093 9025 9121 9858
21.01 21.20 21.49 24.28 24.85 24.05 17.91
58
C. Kumar, J.B. Norris / Decision Support Systems 46 (2008) 52–60
Table 8 = days 1 to 30, Rcurrent = day 31, t = 200, c = 1, Quasi-static performance for Rpast ij ij and varying n; corresponding LRU mechanism cost is 804,603 Parameter values (varying n) m = 53, Quasi-static % improvement k = 10, t = 200, c = 1 mechanism cost over LRU n = 48 n = 36 n = 24 n = 18 n = 12 n=6 n=3
1,003,680 974,463 937,445 921,639 899,827 909,823 984,414
−24.74 −21.11 − 16.51 − 14.55 −11.83 − 13.08 − 22.35
the waiting time in fractions of seconds is about 5 times less than the alternative case. Of course while this is true for a single request, the total network delays depend on the aggregation of individual delays. In effect a relatively small t/c ratio means the penalty for having to access a distant origin web server is not disproportionately large. This ratio is later changed. Using the above parameters, we observe that varying n indeed affects quasi-static mechanism performance. The costs decrease while n is decreased from 48 to 12 and then costs increase when n is further reduced to 3. The best quasi-static performance occurs when n = 12 and this minimum cost value is 9025. For same t = 2 and c = 1 values LRU mechanism cost, using the procedure outlined in Fig. 1, is 12,009. At n = 12, quasi-static mechanism improves over LRU performance by almost 25%. The quasi-static mechanism always outperforms the LRU policy, as shown by the positive percentage improvement figures in Table 7. This is because the latter mechanism incurs frequent costs of bringing new objects into the cache. In order to compare the above results to other days in the dataset we run the models again for Rpast ij days 2 to 31 and Rcurrent day 32. We confirm that similar ij patterns exist for the above durations, where the minimum cost of 9841 is achieved by the quasi-static mechanism for n = 48. The corresponding LRU mechanism cost is always higher at 14,265. However how would the two mechanisms perform relative to one another when the cost for not serving requests from the cache is very much higher than the cost of frequently updating cache contents? This is tested by setting t = 200 and c = 1, and running the quasi-static model for different values of n. The t/c ratio of 200 can be representative of the case where the origin servers may be experiencing very high traffic. Alternatively the network connection may be of very slow speeds or severely congested. An example of this scenario is when due to a major disaster news websites experience extraordinarily high traffic from users seeking updates. Since the origin servers would be very slow the penalty for accessing them is very high. The results, reported in Table 8, show that using the above t/c value the LRU mechanism with a cost of 804,603 always outperforms the quasi-static model. This is indicated by the negative figures of quasi-static performance relative to LRU in percentage terms. The quasi-static mechanism has a minimum cost of 899,827 at n = 12 where it under performs LRU by almost 12%. This occurs because now the cost of frequently updating cache contents in the LRU mechanism is quite small compared to the relatively large cost of even a few request misses in the quasi-static mechanism. Therefore we conclude that as long as the relative difference between the per unit cost of updating caches and
that of URL request misses is not greatly exaggerated, the quasi-static mechanism outperforms the LRU policy for our given dataset. 4.2. Integrated caching mechanism We now include a dynamic portion in our mechanism, in addition to the quasi-static model, to create an integrated caching mechanism that can also handle requests that significantly deviate from historically observed patterns. The dynamic portion of our mechanism is implemented using a variation of the traditional LRU policy as follows. For the cache entry policy of the dynamic portion we ensure that only newly requested URLs that are not already present in the quasi-static part of the cache can be brought in. The cache replacement policy is the same as the usual LRU mechanism where the least recently requested site is evicted to make way for new requests. The proportion of cache space allocated to the quasi-static and dynamic portions can have a significant impact on the overall mechanism performance. Of course the proportion allocation that produces the best performance is specific to the patterns of the requests under consideration. We parametrically test the allocation proportion on performance for the proxy dataset and evaluate the costs of our integrated mechanism versus the traditional LRU policy used for comparison earlier (refer Fig. 1). We retain the earlier values of t = 2 and c = 1, and begin by setting n = 48. Table 9 presents the results for days 1 to 30, day 31, and varying proportion of quasi-static portion in the integrated mechanism (with the remaining part of cache capacity allocated to the dynamic policy). We observe that integrated mechanism performance improves as the quasistatic portion increased from 10% to 70% of capacity, after which performance worsens. The best integrated mechanism performance of 4935 is achieved at 70% quasi-static portion and 30% dynamic portion. It greatly improves on the LRU mechanism cost of 12,009 by almost 59%. As noted before the parameter changes of n do not affect LRU performance. Note that the integrated mechanism also outperforms the purely quasi-static mechanism costs of 9486 (refer to Table 7) for the above parameter values. Next we compare performance of the mechanisms, results presented in Table 10, by setting n = 24 and retaining the earlier parameter values. In this case the integrated mechanism costs decrease as we increase quasi-static portion allocation from 10% to 30%, after which the costs increase for higher quasi-static allocation. The integrated mechanism, which has the least cost of 5349 at 30% quasi-static allocation, improves on the LRU caching policy cost of 12,009 by about 55%. It also performs better than the purely quasi-static
Table 9 = days 1 to 30, Rcurrent = day 31, n = 48, and Integrated performance for Rpast ij ij varying quasi-static portion; corresponding LRU mechanism cost is 12,009 Parameter values
Quasi-static Integrated % improvement over portion % mechanism cost LRU
m = 53, n = 48, k = 10, t = 2, and c = 1
10 20 50 70 90
6981 7455 4974 4935 5214
41.87 37.92 58.58 58.91 56.58
C. Kumar, J.B. Norris / Decision Support Systems 46 (2008) 52–60 Table 10 = days 1 to 30, Rcurrent = day 31, n = 24, and Integrated performance for Rpast ij ij varying quasi-static portion; corresponding LRU mechanism cost is 12,009 Parameter values
Quasi-static Integrated % improvement over portion % mechanism cost LRU
m = 53, n = 24, k = 10, t = 2, and c = 1
10 20 30 50 70
8784 8058 5349 5385 6132
26.85 32.90 55.46 55.16 48.94
mechanism cost of 9428 (refer to Table 8). In order to confirm this result we test the model performances for the above = days 2 to 31 and Rcurrent = day 32. parameter values on Rpast ij ij Once again we observe that the integrated mechanism, that has the least cost of 5979 at 30% quasi-static allocation, outperforms both the LRU policy cost of 14,265 and the pure quasi-static mechanism cost of 9923. Therefore, it can be concluded that for the above parameter values, the decreasing order of mechanism performance is the integrated, followed by pure quasi-static, and finally the LRU caching policy. The above results demonstrate the advantage of our caching mechanisms over LRU policy. An effective caching mechanism has many benefits for all Internet users, including reduced network traffic, load on web servers, and web user delays [1,5]. The difference can be immediately apparent to an end user. A website that is cached may seem to load instantaneously compared to several seconds delay in the alternative case. Users appreciate fast loading websites and tend to revisit them. In addition, Internet companies can save on investing resources in server farms around the world for replicating web content to improve load speeds [7]. We have shown that by exploiting historical re-access patterns caching at the proxy level can be improved. For example, in our tests our integrated mechanism performed better than LRU policy by more than 50% in terms of total costs (refer to Tables 9 and 10). We have used a portion of a proxy trace dataset for performance testing. Given our test results, we believe that our proxy caching mechanism can significantly reduce delays for web users if it were to be deployed in large scale networks. 5. Discussion and conclusions In this study we propose a new proxy-level caching mechanism that takes into account aggregate patterns observed in user object requests. Previous studies have shown that users typically re-access documents on a daily basis. We exploit this repeated-access pattern of users in our mechanism by making caching decisions for a specific time interval based on the history of observed requests for the same interval. This forms the quasi-static portion of our mechanism. Following that we extend the mechanism to include a dynamic policy as well, i.e., the current user access patterns are also taken into consideration to determine the objects to be cached. Our integrated caching mechanism, that contains both quasi-static and dynamic policies, can handle besides normal usage patterns unanticipated events that can generate huge unexpected loads on websites. Hence our approach is more comprehensive than the existing mechanisms because it captures both the static and the dynamic dimensions of user web object request patterns. We compare the performance of
59
both our quasi-static and integrated mechanisms against the popularly used LRU caching policy. The parametric test results, using a comprehensive IRCache network proxy trace dataset, indicate that our mechanisms outperform the LRU mechanism. Our caching approach should be useful for computer network administrators and online content providers to significantly reduce delays experienced by web users at proxy server levels. There are a number of interesting avenues for future research, detailed as follows. In this study we have used the most popular requested sites for testing purposes and the quasi-static mechanism has performed better than the LRU policy. It would be interesting to determine how the mechanisms perform if we were to consider URLs that are not in the top range in terms of request popularity, so that the cost of updating the cache has a lesser effect on overall cost experienced by the two mechanisms. Currently any sub-front page level of requests, or website sections with unique session IDs, are handled by the dynamic portion of our mechanism. An area for improvement would be to develop methods for identifying patterns for these dynamic contents as well. Thus far we have considered a continuous 30 day moving window as an indicator for historical patterns. We may be able to improve performance by considering historical patterns specific to the day of the week. Alternatively we could collate past data depending on whether the day under consideration is a weekday or a weekend. In the future we also plan to extend our model to include different object sizes, as well as test it against other popular caching mechanisms such as k-LRU, LFU, and Top-10 variants [14,18]. Another interesting area would be to develop an analytical model for our integrated mechanism in order to characterize and compare it to existing analytical studies on LRU policy [13]. Finally, an area of extension is to test our mechanism performance using alternative solution approaches, such as dynamic programming and heuristic based procedures, which aim to find good solutions quickly. Our approach is the first attempt to adopt a quasi-static mechanism for caching decisions. We have also proposed a novel combination of quasi-static and dynamic schemes. By design, this approach is more comprehensive than the existing mechanisms because it captures both the static and the dynamic dimensions of web requests. As our testing results indicate it is likely to perform better than the existing approaches and should prove to be an effective caching strategy. Acknowledgements We thank Prabuddha De, Amar Narisetty, Karthik Kannan, seminar participants of Purdue University, and the 2004 Americas Conference of Information Systems (AMCIS) Doctoral Consortium for valuable comments and contributions on this study. References [1] C. Cao, S. Irani, Cost-aware WWW proxy caching algorithms, Proceedings of the Usenix Symposium on Internet Technologies and Systems, 1997. [2] R.I. Chiang, P.B. Goes, Z. Zhang, Periodic cache replacement policy for dynamic content at application server, Decision Support Systems 43 (2007) 336–348. [3] M. Christ, R. Krishnan, D. Nagil, O. Gunther, R. Kraut, On saturation of web usage by lay Internet users, Carnegie Mellon University Working Paper, 2000.
60
C. Kumar, J.B. Norris / Decision Support Systems 46 (2008) 52–60
[4] A. Cockburn, B. McKenzie, Pushing back: evaluating a new behaviour for the back and forward buttons in web browsers, International Journal of Human-Computer Studies (2002). [5] A. Datta, K. Dutta, H. Thomas, D. VanderMeer, World wide wait: a study of Internet scalability and cache-based approaches to alleviate it, Management Science 49 (10) (2003) 1425–1444. [6] B.D. Davison, A web caching primer, IEEE Internet Computing 5 (4) (2001) 38–45. [7] B.D. Davison, Web Caching and Content Delivery Resources, 2007 http:// www.web-caching.com. [8] M.R. Garey, D.S. Johnson, Computers and Intractability, W.H. Freeman, New York, 1979. [9] K. Hosanagar, Y. Tan, Optimal duplication in cooperative web caching, Proceedings of the 13th Workshop on Information Technology and Systems, 2004. [10] K. Hosanagar, R. Krishnan, J. Chuang, V. Choudhary, Pricing and resource allocation in caching services with multiple levels of QoS, Management Science, 51 (12) (2005) 1844–1859. [11] C. Kumar, J.B. Norris, A proxy-level web caching mechanism using historical user request patterns, Working Paper, 2007. [12] P. Lorenzetti, L. Rizzo, L. Vicisano, Replacement Policies for a Proxy Cache, 1996 http://info.iet.unipi.it/~luigi/research.html. [13] V.S. Mookherjee, Y. Tan, Analysis of a least recently used cache management policy for web browsers, Operations Research 50 (2) (2002) 345–357. [14] S. Podlipnig, L. Boszormenyi, A survey of web cache replacement strategies, ACM Computing Surveys 35 (4) (2003) 374–398. [15] L. Rizzo, L. Vicisano, Replacement policies for a proxy cache, IEEE/ACM Transactions on Networking 8 (2) (2000) 158–170. [16] L. Tauscher, S. Greenberg, How people revisit web pages: empirical findings and implications for the design of history systems, International Journal of Human Computer Studies 47 (1997) 97–138 Special issue on World Wide Web Usability. [17] E.F. Watson, Y. Shi, Y. Chen, A user-access model-driven approach to proxy cache performance analysis, Decision Support Systems 25 (1999) 309–338. [18] D. Zeng, F. Wang, and M. Liu, Efficient Web Content Delivery Using Proxy Caching Techniques, IEEE Transactions On Systems, Man, And Cybernetics—Part C: Applications And Reviews 34(3) (2004) 270–280.
Glossary n: m: i: j: k: c: t: xij:
number of time intervals in a 24-hour period number of objects object (i = 1,…, m) time interval (j = 1,…, n) capacity of the cache cost of caching an object delay to download an object from the web server decision variable where xij = 1 if object i is cached at time interval j, and xij = 0 otherwise variable where yij = 1 if caching cost is incurred for yij: xij = 1, and yij = 0 otherwise number of past requests for object i in time interval j Rpast ij : for a 30 day moving window : number of current requests for object i in time Rcurrent ij interval j for the day following the 30 day moving window
Chetan Kumar is an Assistant Professor in the Department of Information Systems and Operations Management at the College of Business Administration, California State University San Marcos. He received his PhD from the Krannert School of Management, Purdue University. His research interests include pricing and optimization mechanisms for managing computer networks, caching mechanisms, peer-to-peer networks, ecommerce mechanisms, web analytics, and IS strategy for firms. He has presented his research at conferences such as WEB, WISE, ICIS Doctoral Consortium, and AMCIS Doctoral Consortium. He has served as a reviewer for journals such as EJOR, JMIS, DSS, and JECR. John B. Norris received his PhD from the Quantitative Methods area at the Krannert School of Management, Purdue University. His research interests include web analytics, healthcare management, and decision support tools for student team assignment. He has presented his research at AOM, DSI, INFORMS, and POMS conferences.