Bringing the Web to the Network Edge: Large ... - Semantic Scholar

6 downloads 543 Views 202KB Size Report
When a cache-satellite distribution is used, documents requested for .... Cache [25], and isp-sat [8] have started to offer this cache-satellite distribution service.
Bringing the Web to the Network Edge: Large Caches and Satellite Distribution Pablo Rodriguez

Ernst W. Biersack

Institut EURECOM 2229, route des Crˆetes. 06904, Sophia Antipolis Cedex, FRANCE

frodrigue, erbi [email protected] May 5, 1999

Abstract In this paper we discuss the performance of a Web caching distribution where caches are interconnected through a satellite channel. Web caching is emerging as an important way to reduce client-perceived latency and network resource requirements in the Internet. A satellite distribution is being extensively deployed to offer broadcast services avoiding highly congested terrestrial links. When a cache-satellite distribution is used, documents requested for the first time in any cache are immediately broadcast to all other caches via the satellite. Caches connected to the satellite distribution end up containing all documents requested by a huge community of clients. Thus, the probability that a client is the first one to request a document is very small, increasing the number of requests that are hit in the cache. In this paper we develop analytical models to calculate upper bounds in the performance of a cache-satellite distribution. We derive simple expressions for the hit rate in the caches as well as the bandwidth in the satellite channel and the required capacity of the caches. Based on our model we find that an ISP with a large cache connected to a cache-satellite distribution can theoretically achieve hit rates up to

85%. We calculate that the

satellite channel needs a bandwidth of several Mbps to transmit newly appearing documents as well as document updates. Additionally, we use trace driven simulations to validate our model and evaluate the performance of a real cache-satellite distribution. Our results show that an ISP with a cache of can achieve local hit rates of up to

65%.

64 Gbytes connected to a satellite channel

 Preliminary results of this study were presented at the ACM/IEEE MobiCom’98 Workshop on Satellite-based Information Services

(WOSBIS’98), Dallas

1

Keywords: WWW, Caching, Push, Satellites

1 Introduction The growth of the World Wide Web leads to an overload of popular servers, increase in the network traffic, and high responses to the clients. To alleviate these problems, Web caching is being extensively deployed in the Internet. When a client first requests a document, the request travels to the origin server and a copy of the document is stored in the cache. Further requests for the same document are delivered to the client directly from the cache. Web caching happens at different network levels. Clients have local caches that satisfy multiple requests for the same document (i.e., temporal locality). Local ISPs have institutional caches that satisfy requests for the same document from different clients (i.e., geographical locality). Requests satisfied in the cache are called hits. Requests not satisfied in the cache are called misses. Misses can be classified into:



First-Access: Misses occurring when requesting documents for the first time.



Capacity: Misses occurring when accessing documents previously requested but discarded from the cache due to space limitations.



Updates: Misses occurring when accessing documents previously requested but already expired.



Non-cacheable: Misses occurring when accessing documents that need to be delivered from the origin server (e.g. dynamic documents generated from cgi-bin scripts or fast changing documents)

Even when a cache has an infinite storage capacity, the number of misses in an institutional cache can be very

50 ? 70%) [3] [4].

high (

While non-cacheable documents do not typically account for more than

requests[27][20] and update misses do not account for more than of requests that result in first-access misses (

10% of all

9% of all requests [27], there is a large percentage

30-50%) (See Section 6.1).

One way to reduce the number of first-access misses and update misses is to prefetch the cache; documents are pushed into the cache even if the cache has never requested them. The idea is to get documents in the cache expecting that clients will likely request them. Prepopulating caches has two main inconveniences: additional disk space and network bandwidth. Disk space may not be such a problem because disk capacity is increasing at rate of 60% per year and large disks are becoming cheaper and cheaper [7][15]. A more serious problem is network bandwidth, especially when documents are prefetched through the Internet. If many documents are prefetched through congested links and no client requests them, the bandwidth waste is 2

considerable. An alternative way to prefetch caches with documents is through a satellite distribution. A satellite distribution has fewer losses and congestion problems than a distribution in the Internet. Also, a satellite distribution can reach a very large population with relatively little effort; adding a new additional client does not increase the cost of transmission. A cache-satellite distribution that prefetches documents into Web caches works as follows. Every document request that results in a miss in any ISP, is automatically broadcasted to all the other ISP caches through the satellite channel. With a cache-satellite distribution a cache ends up storing the documents requested by a much larger community, approximately the total number of clients in all caches connected to the satellite distribution. The probability that a single client is the first client asking for a document is very small, thus, the number of first-access misses and update misses is clearly reduced. As more caches join the satellite distribution, the aggregation of clients is higher and the number of misses in an institutional cache is smaller. Even small ISPs with large caches can achieve very high hit rates locally. ISPs do not need to establish manually any cooperation with other ISP caches at the same or at different network levels; the satellite automatically joins a large number of ISPs. Recently several companies like Sky Cache [25], and isp-sat [8] have started to offer this cache-satellite distribution service. In this paper we investigate the feasibility and performance of a cache-satellite scheme. In the first part of the paper we develop analytical models to calculate upper bounds in the performance of a cache-satellite distribution. We derive expressions to calculate the maximum achievable hit rate in an institutional cache connected to the satellite distribution. We find that a cache connected to a satellite distribution can achieve maximum hit rates close to

85%, which are very high compared to the maximum hit rates achieved by a cache not connected to the satellite distribution 30-50%. However, this would require a very large receiver population, infinite disk space, and that the satellite distribution center broadcasts all documents are as soon as they have one request. We analyze the latency experienced when ISPs are connected through a satellite distribution and when ISPs are connected through a caching hierarchy. We find that a cache-satellite distribution clearly reduces the latency to the clients compared to a hierarchical caching distribution because documents are pushed in the institutional caches after the first document request. We calculate the storage capacity needed to keep all documents in the Web. We see that in the year 1998 a

6

cache with Tbytes would be enough to store all Web documents, and a cache with

600 Gbytes would be enough to

store all Web documents requested more than ten times. We also find that sending only those documents requested more than ten times reduces the hit rate in an institutional cache by less than

2%.

Additionally, we calculate the

bandwidth needed at the satellite to transmit all document updates and newly appearing documents. We find that a

30 Mbps is enough to distribute all Web documents in the year 1998. If only those documents requested more than ten times are distributed through the satellite, the bandwidth needed is reduced to 1:5 Mbps.

satellite channel of

3

In the second part of the paper we use trace driven simulations to evaluate the hit rate of an institutional cache that is not connected to the satellite distribution and the hit rate of an institutional cache connected to the satellite distribution. First, we calculate the percentage of first-access misses in several institutional caches and confirm that this number is quite high (

30-50%). We also calculate the maximum achievable hit rate in several institutional

caches when all accessed documents are prefetched into an infinite storage capacity cache. We find that the maximum achievable hit rate is close to

90%, which is the same value than the already calculated in our model. Second, we

consider an institutional cache connected to the satellite distribution and calculate the increase in the hit rate due to

28 Gbytes and a local hit rate when it is not connected to the satellite channel of 30-40%. We find that using a disk space of 20 Gbytes to prefetch documents from the satellite increases the hit rate in the institutional cache by 18-25%. Finally, we simulate that the number

the satellite distribution. The institutional cache has a disk space of

of caches connected to the satellite distribution increases and we find that an ISP with an additional disk space of

16 Gbytes can increase its hit rate by an additional 5%. Thus, an institutional cache connected to a cache-satellite distribution and a disk space of 64 Gbytes can achieve local hit rates up to 65%. When an ISP does not have enough space to prefetch a large amount of documents, filtering policies should be applied to store only those documents that best match the preferences of the ISP clients (i.e., do not store documents from the bottom of document trees, remove documents from sites that no local user has ever visited, etc.). A situation that deserves a special attention is extending a cache-satellite distribution to home-client caches. In this case very efficient filtering policies could be applied. Clients could benefit from local copies of the requested Web documents avoiding high response times due to congested links and slow modem connections. The rest of the paper is organized as follows. Section 2 discusses related work and similar approaches to increase the hit rate. In Section 3 we present the model that will be the base of our analysis. Section 4 derives expressions to calculate upper bounds in the performance of a cache-satellite distribution. In Section 5 we pick reasonable values for the parameters in the model and present some numerical results for the latency, hit rate, disk capacity, and bandwidth. Section 6 uses trace driven simulation to study the performance of a cache-satellite distribution. The results obtained in Section 6 are also used to validate the model developed in Section 3. The paper ends with a conclusion and summary of the results.

2 Related Work Web caching has been recognized as one of the most important techniques to reduce the bandwidth consumption, the latency to the clients, and the servers load. There has been extensive research to increase the number of documents that are hit at the closest possible network level. The number of documents hit in local caches can be increased by

4

increasing the client population connected to the cache. The more clients share the cache, the greater the chance that several clients are interested in the same document. ISPs cooperate and serve each others misses to increase the number of documents that are hit at close network levels. Web caching cooperation was first proposed in the context of the Harvest project [10]. The Harvest group designed the Internet Cache Protocol (ICP) [30], which supports discovery and retrieval of documents from neighboring caches. Today, many caches have established hierarchies of caches that cooperate via ICP [3]. Other approaches to make caches cooperate have been proposed recently, Cache Array Routing Protocol (CARP) [28], the central directory approach (CRISP) [26], Summary Cache [12], Cache Digest [24], the Relais project [19]. All these approaches use different strategies to share meta-information indicating the location of Web documents in Web caches. Adaptive Web caching [31] proposes a multicast-based adaptive caching infrastructure. Requests travel towards the origin server through a series of caches that are automatically selected using a content based-routing protocol. A similar approach is used by cachemesh [29] and cache-routing [17], with routing tables in every cache or at the network level to route requests not satisfied in the cache. Cache cooperation needs a careful selection of the cooperating caches, and in some cases a whole new applicationlayer routing infrastructure with intermediate caches in the path from the local ISP to the origin server. Additionally, requests not hit in the local ISP need to travel to other ISP caches which may be overloaded or connected through congested links. If many documents could be prefetched into huge local caches, most of the requests could be satisfied locally and no intermediate caches are needed. Since the price of disks is dropping dramatically, it is more expensive to download a document twice than storing it into a disk [14]. In [13] [18] documents linked in a Web page are prefetched into clients’ caches connected through modem links. Since clients are connected through modem links, no bandwidth is wasted to prefetch documents. There are other systems that prefetch documents linked in Web pages into Web proxies. Documents are fetched from the origin servers or from parent caches using the Internet [16]. In this paper we present a prefetch strategy that downloads documents requested by a large community into local caches using a satellite distribution, which is bandwidth plentiful and reaches a large population with little effort. A cache-satellite distribution is complementary to any other cache sharing scheme. Document requests not hit in the local cache connected to the satellite channel can be hit in other cooperating caches, which are not connected to the satellite distribution.

3 The Model 3.1 Hierarchical Caching in the Internet As shown in Figure 1, the Internet connecting the server and the clients can be modeled as a hierarchy of ISPs, each ISP with its own autonomous administration. We shall make the reasonable assumption that the Internet hierarchy 5

consists of three levels of ISPs: local, regional, and national. All of the clients are connected to the local ISP; the local ISPs are connected to the regional networks; the regional networks are connected to the national ISP. Caches International Path

National Cache

11 00 00 11 00 11 00 11

Origin Server National ISP

Regional Caches

Regional ISP

Regional ISP

Institutional Caches Local ISP

Local ISP

Local ISP

Local ISP 1 0 0 1

Clients

1 0 0 1

1 0 0 1

1 0 0 1

Figure 1: Internet topology. are usually placed at the access points between two different networks to reduce the cost of transmitting across a new network. As shown in Figure 1, we make this assumption for all three levels. Every local ISP has an institutional cache, every regional ISP has a regional cache, and every national ISP has a national cache. z National cache

International Path

111111 000000

Origin Server O H

Regional caches

1 0

1 0

1 0

1 0

l=H+1

H l=2 Institutional caches

1 0

1 0

1 0

1 0

l=1

......... Clients

l=0

Figure 2: The tree model. We model the underlying network topology in Figure 2 as a full

O-ary tree.

Let

O be the nodal outdegree of the

tree. Let H be the number of network links between the root node of a national ISP and the root node of a regional ISP. We also suppose that H is the number of links between the root node of a regional ISP and the root node of a local ISP. Let z be the number of links between a origin server and the root node of the tree (i.e., the international path). Institutional caches are placed on level national caches are placed at level

1 of the tree, regional caches are placed at level H + 1 of the tree and

2H + 1. There is one national cache. There are OH regional caches and there are 6

O2H institutional caches.

Hierarchical caching works as follows. At the bottom level of the hierarchy there are the

client caches. When a request is not satisfied by the client’s cache, the request is redirected to the institutional cache. If the document is not found at the institutional cache, the request is then forwarded to the regional cache, which in turn forwards unsatisfied requests to the national cache. If a requested document is not found in the cache hierarchy the national cache requests the document directly from the server. When the document is found, either at a cache or at the origin server, it travels down the hierarchy, leaving a copy at each of the intermediate caches.

3.2 Satellite Distribution Given the model of the Internet described on Figure 2, we now consider the situation where institutional caches cooperate via a satellite distribution (Figure 3). Every local ISP has an institutional cache with Internet connection, high storage capacity, and a satellite dish for receiving. Note that there is no more a caching hierarchy (i.e. regional or national caches). There is a master site which has an Internet connection and a satellite transmitter. A cache

International Path Origin Server

Master Internet

l=2 l=1 Satellite Dish

Institutional Caches ......... Clients

l=0

Figure 3: Push distribution from the servers to caches. satellite distribution works as follows. When there is a miss at some institutional cache, the institutional cache obtains the document from the origin server using the HTTP protocol. The institutional cache reports the missed URL to the master site. The master site obtains the document from the origin server and transmits the document into the satellite channel. As a result, all institutional caches receive the broadcasted document and can decide to keep it or not. Institutional caches receive all the documents requested by any client. After a certain period of time, institutional caches with huge storage capacity can end up containing most of the documents in the Web.

7

3.3 Web Documents We denote the total number of documents in the WWW in the year j -th as Nj . Let x be the annual rate at which the number of documents in the WWW is increasing.

Nj = Nj ?1  x: Denote Si , the size in bytes for Web document i,

1  i  Nj . A Web document contains several objects (i.e. html,

images, applets). Web document i expires after an elapsed update interval t following an exponential distribution with average update interval

i .

f (t) =

1 e?t= : i i

Choosing an exponential distribution for document updates we are not favoring the performance of a cache because documents have a higher probability of being stale. When document

i is modified,

not the whole document is

modified if not only a percentage of the document i [21]. Let Ri be the number of requests for document i. Requests from a local ISP for document i in a period t are Poisson distributed with average request rate I;i .

e? t  (I;i  t)r : r! Assuming that requests are uniformly distributed between all O2H institutional caches, there are I;i  O2H P (Ri = rjt) =

I ;i

requests for document i. Let I be the request rate from a local ISP for all

Nj documents , I

= PN

total

i=1 I;i . I is j

Zipf distributed [6] [32] that is, if we rank all Nj documents in the Web in order of their popularity, the i ? th most popular document has a request rate I;i given by

I;i = I where takes values between

 i

0:6 and 0:8 [6], and  is given by N X 1 )?1: =( i j

i=1

We assume that each document is requested independently from other documents, so we are neglecting any source of correlation between requests of different documents. We consider that newly appearing documents will also be Zipf distributed. That is, there will be some new appearing documents that will be very popular but there will also be many other new documents that will be requested by few clients. Let  be the percentage of requests that are for non-cacheable documents (fast-changing documents, cgi-bin, etc).

8

4 Performance Analysis In this section we present analytical models to calculate upper bounds in the performance of a cache-satellite distribution. We derive expressions to calculate the maximum achievable hit rate in an institutional cache connected to the satellite distribution and in an institutional cache not connected to the satellite distribution. We also derive simple expressions to calculate the bandwidth at the satellite needed to distribute all Web documents missed in any local ISP.

4.1 Hit Rate The hit rate is the percentage of requests that meet an up-to-date document in the cache. We assume that the caches have an infinite capacity and therefore documents are not removed from the cache due to storage limitations. First, we analyze the hit rate in an institutional cache not connected to the satellite distribution and then we consider the hit rate in an institutional cache connected to the satellite distribution.

P (L = 1) is the probability to meet a document at the institutional cache. The hit rate in an institutional cache in year j , Hitj is given by Let L be a random variable denoting the number of links traversed to hit a document.

Hitj = where P

Let



(L = 1) is given by

N X I;i j

P (L = 1) =

denote the time into the interval

I

i=1

Z1 0

 P (L = 1)  (1 ? )

(1)

P (L = 1jt)  f (t)  dt

(2)

[0; t] at which a request for document i occurs.

uniformly distributed over the interval. Thus,

P (L = 1jt) =

1  Z t P (Ri > 0j )  d: t

The random variable



is

(3)

0

(Ri > 0j ) is the probability that there has been at least one requests for document i in the interval [0;  ]. For an institutional cache not connected to the satellite distribution P (Ri > 0j ) = 1 ? e?  . Evaluating P (L = 1jt) where P

I ;i

we obtain

P (L = 1jt) = 1 +

e? t ? 1 : I;i  t I ;i

(4)

Combining equations( 2) and ( 4) we obtain

P (L = 1) = 1 +

1

I;i  i

 ln( 1 + 1   ) I;i i

When all institutional cache are connected to the satellite distribution, only the first request for document i out of all requests for the same document in any local ISP sees a document miss. The rest of the requests see a document hit. 9

The hit rate at any institutional cache when O2H institutional caches are connected to the satellite distribution can be calculated by assuming an institutional cache with a client request rate equal to I;i  O2H . Thus, the probability that there has been at least one request for document ? e? O2  .

1

I ;i

i in the period [0;  ] gets closer to one, P (Ri >

0j ) =

H

4.2 Bandwidth Now we calculate the bandwidth needed at the satellite to transmit all newly appearing documents as well as all document updates. The bandwidth needed at the satellite in year j , BWj is given by

BWj =

N X j

bwi

(5)

bwi(t)  f (t)dt

(6)

i=1

where bwi is the bandwidth needed to transmit document i

bwi = and bwi

0

(t) is the bandwidth needed to transmit document i given an update period t. bwi(t) is given by bwi(t) =

where

Z1

1 ? e?

I ;i

Si  (1 ? e? t

I ;i

O2 t H

)

(7)

O2 t is the probability that document i is requested at least once in the period t from any local ISP. H

5 Numerical Results In this section we pick some reasonable values for the different parameters in the model to show upper bounds in the requirements and performance of a cache satellite distribution. The number of documents Nj in the Web at the beginning of the year j =1998 is about 250 millions [5]. The Web is increasing at a rate such that the number of documents get duplicated every year (x HTTP traffic in the Internet backbone (O2H I ) is equal to grow by a factor of

= 2) [5]. We consider that the

1000 document requests per second [7]. This traffic will

2:8 per year (i.e., I ) due to the increasing number of clients, the increasing period of time that

clients are connected, and the increasing bandwidth of clients’ connections [7]. We consider that there is no relationship between the update period of a document and its popularity [6]. Thus, we

i = . We vary  from 1 hour to 20 days [11]. We assume the same document size Si = S for every Web document varying it from 10 KB to 20 KB[3]. We take the percentage of all document requests that are for non-cacheable objects as 10%, i.e.  = 0:1 [20] [27].

assume that all Web documents expire randomly with the same average update period

10

First, we will analyze the latency experienced by clients in local ISPs connected through a caching hierarchy and the latency experienced by clients in local ISPs connected through a satellite distribution. Second, we calculate and upper bound on the hit rate at any institutional cache connected to the satellite distribution and compare it with the hit rate of an institutional cache not connected to the satellite distribution. Third, we calculate the maximum disk capacity of an institutional cache to keep all documents in the Web. Finally, we calculate the bandwidth needed at the satellite channel to distribute all missed documents and also consider some simple filtering schemes.

5.1 Latency Analysis Given the hierarchical topology of the Internet with very congested top levels and much less congested low levels, documents hit at network levels close to the clients have small latencies. To quantify the latency we calculate the expected number of network links that a request travels to hit a document. We analyze the latency in the case that institutional caches cooperate through a caching hierarchy and in the case that institutional caches are connected through a satellite distribution.

= 4, H = 3 and we set the distance between the clients and the metropolitan caches to 1 link. We also set the distance z between the origin server and the national cache to 3 links. Thus, the height of the tree, which is the distance between the clients and the origin server, is equal to 10 links. Let Lj be the number of links traversed We choose O

by the j ?th request to hit an up-to-date document. Thus, we have

E [Lj ] =

X

l2f1;H +1;2H +1;2H +z+1g

l  P (Lj = l)

Lj . The first request j = 1 for a document in an update interval always travels to the origin server, i.e., P (L1 = 2H + z + 1) = 1. In the case that institutional caches cooperate through a caching hierarchy, for j  2, P (Lj = l) is given by We now proceed to calculate the distribution of

8 >> 1 ? (1 ? O12 )j?1 l=1 >> >> >< (1 ? 1 )j?1 ? (1 ? 1 )j?1 l = H + 1 O2 O P (Lj = l) = > >> >> (1 ? 1 )j?1 l = 2H + 1 O >> : H

H

H

(8)

H

In the case that institutional caches are connected through a satellite distribution, for j

 2, P (Lj = 1) = 1. Figure 4

shows the expected number of links traversed by the j -th document request in an update period. We consider a three level caching hierarchy and a cache satellite distribution. In both schemes, the first client asking for a document 11

10

8

Hierarchical Caching Satellite distribution

L

6

4

2

0 0 10

2

4

10

6

10

10

j−th Request

Figure 4: Average number of links traversed by the j -th request before a cache hit occurs in a caching hierarchy and in a cache-satellite distribution. always needs to travel

10 links to hit the document at the origin server.

In hierarchical caching, a document gets

closer to the clients as far as more clients request the document. If there are many requests for the same document (j

> 104), client requests can hit the document at the institutional cache (L = 1).

In a cache-satellite distribution,

on the other hand, the document is pushed to all institutional caches after the first client request. Any other request for the same document (j

> 1), is directly satisfied from the institutional caches. Thus, a cache satellite distribution

clearly reduces the latency to the clients by bringing documents close to them (i.e. to the institutional caches) after the first document request.

5.2 Hit Rate In this section we calculate the hit rate achieved by an institutional cache connected to the satellite distribution as a function of average update period

, and we compare it with the hit rate achieved by an institutional cache not

connected to the satellite distribution. Note that our model for the satellite distribution has been pushed to the limit to get upper bounds in the maximum achievable hit rate. For this purpose we have assumed some idealized scenarios where caches have infinite disk space and all institutional caches are connected to the satellite channel. As a result, we show that the hit rate in an institutional cache connected to the satellite channel can be much higher than the hit rate in an institutional cache not connected to the satellite channel. In a satellite distribution the client populations of each local ISPs are aggregated together to form one huge user population. The greater the user population, the greater the likelihood of repeated requests, and thus the greater the hit rate. Using the analysis presented in Section 4.1 we obtain Figure 5.

12

1 0.9 0.8

Hit Rate

0.7 0.6 0.5 0.4 0.3

− ∆=1 day

0.2

Hit−sat Hit−sat−thres Hit

−− ∆=7 days .− ∆=15 days

0.1 0 0

5

10 Threshold (req/∆)

15

20

Figure 5: Hit rate at a local ISP not connected to a satellite distribution and at a local ISP connected to a satellite distribution . For the satellite distribution, different filtering thresholds are considered. Table 1 summarizes the results of Figure 5 and shows the hit rate for a cache connected to the satellite distribution and for an institutional cache not connected to the satellite distribution. Our model for the hit rate at an institutional



1 day

7 days

15 days

Hit rate no Satellite

36 %

45 %

48 %

Hit rate Satellite

74 %

82 %

85 %

62 %

72 %

76 %

61 %

70 %

73 %

Hit rate Satellite, Ri Hit rate Satellite, Ri

>5

> 10

Table 1: Hit rate for a cache connected to the satellite distribution and for a cache not connected to the satellite distribution. 10% of uncacheable objects cache not connected to the satellite channel gives values ranging from

33% to 53%, which are very similar to the

actual hit rates reported in many institutional caches [3]. For the hit rate at an institutional cache connected to the satellite distribution we obtain hit rates close to

90% (this value remains almost invariable as more Web documents

appear). Given the long-tail distribution of Web documents, there will many documents that are requested only once or less in an update period. Documents requested only once in an update period are not shared by several clients and thus it is useless to place these kind of documents in all the caches. The master site can take the decision to broadcast only those documents that are requested at least once (Ri

>

0) in an update period. In this situation, the hit rate in an

institutional cache connected to the satellite channel remains almost invariable. If the master site wants to filter even 13

more the traffic in the satellite channel and only transmits those documents requested more than ten times (Ri the hit rate at an institutional cache is reduced by less than

2%.

> 10),

5.3 Caches Capacity The price of disks is decreasing faster than the price of the network capacity. Additionally, the capacity of the clients’ disks is increasing at a rate of 60% per year [7] with a baseline of 9 GBytes in 1998 (Figure 6). In a few years it will be easy to find disks with capacities close to TBytes at current prices [15].

Disk Capacity (GBytes)

350 300 250 200 150 100 50 1998

1999

2000

2001 2002 year

2003

2004

2005

Figure 6: Disk Capacity Trend. With such large storage capacities it will be feasible to store a large portion of the Web on several disks at different points in the network. In figure 7 we show the disk capacity needed to store all the Web documents assuming that any Web document is requested at least once in an update interval (Ri

> 0).

The values presented are an an upper

= 10 Kbytes, and S = 20 Kbytes the total number of documents in the Web can be stored in several TBytes (i.e., 10 TBytes in 1999). The bound for the storage requirements in a cache. Taking an average document size of

S

storage capacity to store all newly appearing Web documents needs to be doubled every year, following the rate at which the number of Web documents is increasing. If the master side only transmits those documents requested more than once (Ri

> 1), the needed storage capacity

1:8 TBytes in year 1999 and increases at rate equal to rate at which the HTTP Internet backbone traffic increases (x2:8 per year) (Figure 7(a)). If the master site only transmits those documents requested more than 10 times, the needed storage capacity in the year 1999 drops to 187 Gbytes and increases at a much smaller rate (Figure 7(b)). As we discussed in Section 5.2 sending those documents that have more than 5 requests reduces the hit rate at an institutional cache by 10%. On the other hand, this simple filtering policy saves a lot of disk space in drops to

14

S=10 KB. R >0 i S=20 KB. R >0 i S=10 KB. Ri>10 S=20 KB. R

Disk Capacity (TBytes)

30 25

35 30 Disk Capacity (TBytes)

35

20 15 10 5

25

S=10 KB. R >0 i S=20 KB. R >0 i S=10 KB. R >10 i S=20 KB. R >10 i

20 15 10 5

0 1998

1998.5

1999

1999.5 year

2000

2000.5

0 1998

2001

(a) More than once.

1998.5

1999

1999.5 year

2000

2000.5

2001

(b) More than ten times.

Figure 7: Disk capacity needed to store all Web documents requested at least once and disk capacity needed to store all Web documents requested multiple times.

S = 10 Kbytes, S = 20 Kbytes.

the institutional caches. Currently, Web caches have storage capacities with several tens of Gbytes[23], which is still lower than

180 Gbytes.

Thus, institutional caches should implement more sophisticated filtering policies to reduce the needed storage capacity while hardly affecting the hit rate.

5.4 Bandwidth As we discussed in Section 4.2, the satellite needs send all document updates and newly appearing documents in the Web. Using equation 5 we plot in Figure 8 the necessary bandwidth at the satellite to keep broadcasting newly appearing documents and document updates that are requested at least once in an update interval (Ri

>

0).

We

= 10. From Figure 8(a) we can observe that the bandwidth needed to send documents through the satellite is close to 31 MBps for year 1999 and that keeps increasing with a rate close to 2

take the size of a Web document as S

per year. The rate at which the bandwidth increases depends on the rate at which the HTTP Internet backbone traffic increases and on the rate at which new documents appear in the Web. If all appearing documents were requested at least once in an update period, the bandwidth would increase at the same rate that Web documents appear in the

2

Web (x per year). As discussed in section 5.3 there is a large group of documents that are only requested once or zero times in an update interval. The master site can decide to broadcast only documents that are requested more than once in one update interval and thus reduce the bandwidth used in the satellite. From Figure 8(a) we see that the bandwidth 15

needed when the master site only transmits documents requested more than once (Ri reduced to

>

1) and  = 10 days, is

15 Mbps in 1999. When the update interval  increases, there are fewer documents that do not receive

any request in an update interval. Thus, the needed bandwidth increases as well. A satellite distribution of an analog television channel uses a bandwidth of

27 Mbps.

Thus, with the capacity needed to broadcast one TV channel a

cache-satellite distribution can distribute almost all Web in the year

140 120

300

∆=10 days. R >0 i ∆=30 days. R >0 i ∆=10 days. R >1 i ∆=30 days. R >1

∆=10 days. R >0 i ∆=30 days. R >0 i ∆=10 days. R >10 i ∆=30 days. R >10

250

i

100

BW (Mbps)

BW (Mbps)

1999.

80 60

200

i

150 100

40

50 20 1998

1998.5

1999

1999.5 year

2000

2000.5

1998

2001

1998.5

(a) More than once.

1999

1999.5 year

2000

2000.5

2001

(b) More than ten times.

Figure 8: Required bandwidth at the satellite to transmit new documents and document updates: i) all documents requested more than once in an update interval are transmitted across the satellite, ii) only documents requested multiple times in an update interval are transmitted across the satellite. S

= 10 KBytes.

Figure 8(b) shows the needed bandwidth at the satellite when the master site only transmits those documents requested more than

10 times.

In this case, the needed bandwidth at the satellite in 1999 is equal to

2:2 MBps.

Currently, there is a company (SkyCache [25]) that distributes Web documents into Web caches in the USA and

4

Europe. This company is using a Mbps link to feed Web documents into the subscribed caches and only broadcasts those documents that are requested a certain number of times (i.e., two or three requests). In next section we use logs from caches connected to this cache-satellite service to evaluate the performance of a cache-satellite distribution.

6 Trace Driven Simulations In this section we perform a trace driven analysis to evaluate the hit rate of an institutional cache that it is not connected to a cache-satellite distribution and the hit rate of an institutional cache connected to a cache-satellite distribution. We also use the trace driven results to validate the analytical model developed in Section 5.2.

16

First, we will present the distribution of hits and misses in an institutional cache. Second, we consider an ISP that connects its institutional cache to a cache-satellite channel (i.e., SkyCache [25]) and calculate the increase in the hit rate at the institutional cache. Finally, we simulate that the number of caches cooperating in the satellite distribution increases to obtain the increase in the hit rate at the institutional cache achieved by having more institutional caches cooperate.

6.1 General Distribution of Hits and Misses To investigate the distribution of hits and misses, we analyzed the logs of our institutional cache at Institut EURE-

7 days of logs from the 10th of December to the 17th of December, which include 187; 054 requests. Our institutional cache has 6 GBytes of disk space and connects about 100 clients.

COM [2]. We took

100 90

Max−HIT First−Acces+Capacity HIT HIT−no−check Update Not−Cach

80

Object Rate (%)

70 60 50 40 30 20 10 0 10

11

12

13

14

15

16

17

day

Figure 9: General distribution of EURECOM’s cache entries. Hits are shown with a continuous line, misses are shown with a dashed line. Figure 9 shows the distribution of the object rate for the different days of the logs (results for the byte distribution show a similar behavior). The total hit rate (Hit) ranges between

35% to 50%. We divide the total hit rate into hits

that previously need to poll the server to check if the document has changed, and hits that do not need to poll the server. In Figure 9 we see that the hit rate with no poll check (Hit-no-check) to the origin server ranges between to

35%. The percentage of non-cacheable objects (i.e.,max-age=0, cgi-bin, private, etc.)

never exceeds

20%

10%; this

result has also been confirmed by other studies [27]. Document misses are the sum of first-access misses, capacity misses, and update misses and account for about

9%.

50% of all requests.

We see that the percentage of update misses is

We also observe that first-access misses account for a big percentage of the misses (

30%-50%).

An

explanation of why first-access misses are very high is because EURECOM’s cache has a very small and diverse

17

client population. Next, we calculate the maximum achievable hit rate (Max-Hit). To calculate Max-Hit, we have assume an ideal scenario where institutional caches have an infinite storage capacity and all accessed documents are prefetched in the cache. In this case all requests are hit in the institutional cache, except requests for non-cacheable documents and requests where the client imposes a poll check with the origin server (i.e., reload). This scenario is equivalent to have many institutional caches cooperating via a satellite distribution where the probability that a client is the first client asking for a document is very small. All first-access misses, and update misses are, therefore, hits. Given this

90% (similar results were obtained with the model developed in Section 5.2). Given the hit rates obtained at EURECOM’s cache without a satellite distribution (3050%), a cache-satellite distribution with many cooperating institutional caches could theoretically improve the hit rates in EURECOM’s cache by 40% to 60%. scenario, EURECOM’s cache could experience hit rates up to

6.2 Satellite Distribution To investigate the effect of a cache-satellite distribution we analyzed the logs of an ISP in the USA (AYE [1]) connected to a satellite distribution. This ISP gives access to about

1000 residential and business clients and has an

48 GBytes of disk space. The institutional cache of this ISP is also connected to a satellite distribution (SkyCache [25]) where Web documents are constantly being pushed at a rate of 4 Mbps. This satellite channel is currently joining more than 30 ISPs in the USA, including two national ISPs. Documents pushed are institutional cache with

documents that were missed in other institutional caches connected to the satellite channel. The satellite distribution

2 3

center only broadcasts to the ISPs those documents that receive more than or request, depending on the available satellite bandwidth [25]. We took one week of logs (Dec

18-25), which account for 13; 689; 620 requests.

First, we have analyzed the general distribution of requests in the AYE traces assuming that their cache is not connected to the satellite channel. Then, we calculate the additional hit rate achieved when the AYEs cache is connected to the satellite distribution. Figure 10 shows that the total hit rate at AYE’s institutional cache, when the cache is not connected to the satellite distribution is than in the EURECOM’s trace,

30-40%.

The number of first-access misses is much smaller

25-35%, because the client population is much higher.

If the cache is connected

to the satellite distribution there are many requests that hit the documents previously pushed through the satellite distribution. Calculating the percentage of requests that are misses without a cache-satellite distribution and result in local hits with the cache-satellite distribution, the hit rate is increased by and additional

18-25%. The maximum

possible hit rate (Max-Hit) assuming the idealized case where all documents are prefetched into the institutional caches and caches have infinite disk is about

90-94%. 18

6.3 Scaled Satellite Distribution Now, we proceed to calculate the effect of a cache-satellite distribution when the number of institutional caches connected to the satellite distribution increases. We calculate how many first-access misses and update misses can be hits by adding more ISPs to the satellite channel. For this purpose we have taken seven logs from the four major parent caches, “bo”, “pb”, “sd”, and “uc” in the National Web Cache hierarchy by National Lab of Applied Network Research (NLANR) [3] accounting for

32; 892; 307 requests. These caches are the top-level caches of the NLANR

caching hierarchy and they all cooperate to share their load. We consider the satellite distribution described in the previous section and then we assume that NLANR top-level caches get also connected to the satellite channel. Thus, every time that a client at NLANR requests a document and the document is not found in the NLANR caches, the document is distributed through the satellite to all the other ISPs connected to the satellite channel (i.e., AYEs cache). To simulate the effect of adding more caches to the satellite distribution, we take one day of logs from the AYE cache during one week. For every day we take requests not satisfied in AYEs cache, assuming that AYEs cache is receiving documents from a satellite distribution where NLANR caches are not connected. Then, we assume that NLANR caches are connected to the satellite channel and for every request not satisfied in AYEs cache we parse the previous days of the NLANR logs. If some client in the NLANR caches has previously accessed a document request that was not previously satisfied in AYEs cache, the document request will now result in a local hit in AYEs cache. To model a more realistic situation we consider that the AYE cache is finite and can only keep a certain number of documents pushed through the satellite from NLANR (i.e., one day). This corresponds to a storage capacity of

16

Gbytes. 100 90

Max−HIT HIT First−Acces+Capacity Satellite NLANR Not−Cach Update

80

Object Rate (%)

70 60 50 40 30 20 10 0 18

19

20

21

22

23

24

25

day

Figure 10: General distribution of AYE’s proxy log entries. Hits are shown with a continuous line, Misses are shown with a dashed line.

19

In the absence of MIME headers (i.e., last-modified, expires, etc.) in either NLANR or AYE traces, it is difficult to make sure that a document request in AYE which URL matches one entry in the NLANR logs, is still up-to-date. To consider a document up-to-date we compare the documents size of a request in the AYE logs and in the NLANR logs. Thus, if the URL and the document size of a miss in the AYE logs matches the URL and the document size of any entry in two days of NLANR logs we account for a new document hit. In Figure 10 we show the increase in the hit rate at AYEs institutional cache when all NLANR caches get connected to the satellite distribution in addition to the caches already connected. The hit rate at the AYEs cache would increase

4 5%. Thus, AYEs cache could have a local hit rate close to 65%. Other caches connected to the satellite channel

by a -

would also benefit from more caches getting connected to the satellite distribution. The benefits are more appreciable for a small ISP that does not have a large client population than for a large ISP with many clients and already a high local hit rate. Hit rates close to

65% have also been reported in a caching hierarchy [23][27]. However, in a caching hierarchy a

request needs to travel up in the network through a series of caches until the document is hit. These caches can be very congested or can be connected trough very congested links, in which case clients can experience high response times. On the other hand, a cache-satellite distribution can achieve the same hit rates locally with fast response times, low bandwidth usage, and a simple configurations. However, there is still a big gap between the and the maximum achievable hit rate

65% hit rate

90%. This can be due to the fact that the local cache is only preloaded with

one day of logs from the NLANR caches; if more days would be used the hit rates would increase. Also, we should note that a satellite distribution needs to receive the first request for a document before it is transmitted through the satellite. Thus, if the number of documents that are requested only once is high, the maximum achievable hit rate will be limited to

85% (see Section 5.2).

7 Economical Implications and Future Work A cache-satellite distribution of Web documents into Web caches has important economic implications for both content providers and ISPs. Content providers’ sites will see fewer requests, since many more requests are now serviced by the caches. Thus, a content provider can significantly reduce the rate of its access link connection to the Internet and the capacity of its server. ISP caches are filled with many documents almost for free. Most of the requests get hit in the ISP cache saving expensive bandwidth in the terrestrial links and avoiding the overhead of communicating with neighboring caches. ISPs could reduce the capacity of their access link to other ISPs. If the local caches of home clients are also connected to the satellite distribution, they do not suffer the last mile problem for the Web traffic. Transmitting documents through the satellite to local home-caches is a distribution scheme that

20

will become even more relevant when the Web and the TV get integrated [22]. Due to the broad and large number of documents that are transmitted through the satellite, good filtering mechanisms should be applied to discard certain documents without reducing the hit rate. We are working on different filtering policies that track automatically the preferences of clients connected to a cache and remove those documents that are not likely to be requested.

8 Conclusions Caching is being extensively deployed in the Internet to alleviate the problems related to the exponential growth of the Web. However, caching has a limited performance due to the high number of requests not hit in the cache. Requests not hit in a cache are requests for documents that have been purged from the cache, or requests for noncacheable documents, or requests for documents that no client has fetched before in that cache. Since the price of the disks is dropping very fast and the capacity of the disks is increasing at very high rates, we expect that only few documents will be purged from the caches due to space constraints. Only when the number of multimedia documents and continuous media streams grows, limited disk capacity may become a problem again. Non-cacheable documents may become cacheable if some intelligence is placed in the caches to cope with dynamic content (i.e., rotating banners for advertisements, cookies, etc.) [9]. To reduce the number of requests in a cache that result in first requests for a document, the cache needs to be shared by a large client population. In this paper we analyzed the distribution of hits and misses in several institutional caches and showed that the number of requests that are first requests for a document is usually very high (

30-

50%). A satellite distribution of Web documents into large caches can reduce the number of requests that are first

requests for a certain document almost to zero. The satellite prefetches the caches with all documents requested by a large community of clients. Documents get into the caches almost for free and significantly alleviate the congested Internet links. In this paper we have presented analytical models to evaluate the performance and requirements of a cache-satellite distribution. We also performed a trace driven analysis to show that a satellite distribution of Web documents into caches has big advantages. The satellite channel would need a bandwidth of several Mbps to distribute all appearing Web documents and document updates during year rate that an ISP can achieve with a satellite distribution is close to

1998. The maximum theoretical hit

85%. This result assumes that ISPs have infinite

storage capacity, that there is a high client population connected to the satellite and that no filtering is performed at the satellite distribution center. In a more realistic situation where ISPs have a limited cache (i.e.

64 Gbytes) and

the satellite distribution filters out every document that does not receive more than 3 requests, we show that an ISP connected to the satellite channel can achieve local hit rates up to

21

65%.

9 Acknowledgments The support of P. Rodriguez by the European Commission in form of a TMR (Training and Mobility for Researchers in Europe) fellowship (ERBFMBICT97263 9) is gratefully acknowledged. We would like to thank Keith W. Ross for his help and fruitful discussions and Peter Cohen from SkyCache and Jim Hardley from AYE for the cooperation in the trace driven study. Thanks to NLANR for making their logs available. EURECOM’s research is partially supported by its industrial partners: Ascom, Cegetel, France Telecom, Hitachi, IBM France, Motorola, Swisscom, Texas Instruments, and Thomson CSF.

References [1] “”AYE” Net, USA: http://www.aye.net/”. [2] “Institut EURECOM, Sophia Antipolis, France: http://www.eurecom.fr/”. [3] “National Lab of Applied Network Research (NLANR)”, http://ircache.nlanr.net/. [4] M. Baentsch, L. Baum, G. Molter, S. Rothkugel, and P. Sturm, “World Wide Web Caching: The ApplicationLevel View of the Internet”, IEEE Communications Magazine, pp. 170–178, June 1997. [5] K. Bharat and A. Broder, “A Technique for Measuring the Relative Size and Overlap of Public Web Search Engines”, In Seventh International WWW Conference, Brisbane Australia, April 1997. [6] L. Breslau, P. Cao, L. Fan, G. Phillips, and S. Shenker, “On the Implications of Zipf’s Law for Web Caching”, In 3rd International WWW Caching Workshop, June 1998. [7] E. A. Brewer, P. Gauthier, and D. McEvoy, “The Long-Term Viability of Large-Scale Caching”, In 3rd International WWW Caching Workshop, Manchester, UK, June 1998. [8] Broadcast Satellite Services, “http://www.isp-sat.com”. [9] P. Cao, J. Zhang, and K. Beach, “Active Cache: Caching Dynamic Contents on the Web”, , University of Wisconsin-Madison, 1998. [10] A. Chankhunthod et al., “A Hierarchical Internet Object Cache”, In Proc. 1996 USENIX Technical Conference, San Diego, CA, January 1996.

22

[11] F. Douglis, A. Feldmann, B. Krishnamurthy, and J.Mogul, “Rate of change and other metrics: A live study of the World Wide Web”, In Proceedings of the USENIX Symposium on Internet Technologies and Systems, December 1997. [12] L. Fan, P. Cao, J. Almeida, and A. Broder, “Summary Cache: A Scalable Wide-Area Web Cache Sharing Protocol”, Feb 1998, SIGCOMM’98. [13] L. Fan, P. Cao, and Q. Jacobson, “Web Prefetching Between Low-Bandwidth Clients and Proxies: Potential and Performance”, In 3rd International WWW Caching Workshop, June 1998. [14] A. Freedman, “Caching and IP over Satellite”, , isp-sat, 1998. [15] M. L. Gullickson, A. L. Chervenak, and E. W. Zegura, “Using experience to Guide Web Server Selection”, , Georgia Institute of Technology, 1998. [16] K. ichi Chinen, “An Interactive Prefetching Proxy Server for Improvements of WWW Latency.”, In Proc. INET, 1997. [17] U. Legedza and J. Guttag, “Using network-level support to improve cache routing”, In 3rd International WWW Caching Workshop, Manchester, UK, June 1998. [18] T. S. Loon and V. Bharghavan, “Alleviating the latency and bandwidth problems in WWW browsing”, In Proceedings of the USENIX Symposium on Internet Technologies and Systems, December 1997. ´ [19] M. Makpangou and Eric B´erenguier,

“Relais : un protocole de maintien de coh´erence de caches Web

coop´erants”, In Proceedings of the NoTeRe colloquium, March 1997. [20] S. Manley and M. Seltzer, “Web Facts and Fantasy”, In Proceedings of the USENIX Symposium on Internet Technologies and Systems, December 1997. [21] J. Mogul, F. Douglis, A. Feldmann, and B. Krishnamurthy, “Potential benefits of delta-encoding and data compression for HTTP”, In Proc.SIGCOMM, pp. 181–194, Cannes, France, sep 1997. [22] P. Rodriguez, J. Gafsi, and J. Nonnenmacher, “A more Attractive and Interactive TV”, In W3C Workshop on Television and the Web, Sophia Antipolis, France, July 1998, Position Paper. [23] A. Rousskov, “On Performance of Caching Proxies”, In ACM SIGMETRICS, Madison, USA, september 1998. [24] A. Rousskov and D. Wessels, “Cache Digest”, In 3rd International WWW Caching Workshop, June 1998. [25] SkyCache, “http://www.skycache.com”. 23

[26] J. C. Syam Gadde, Michael Rabinovich, “Reduce, Reuse, Recycle: An Approach to Building Large Internet Caches”, In The Sixth Workshop on Hot Topics in Operating Systems (HotOS-VI), May 1997. [27] R. Tewari, M. Dahlin, H. M. Vin, and J. S. Kay, “Beyond Hierarchies: Design Considerations for Distributed Caching on the Internet”, , The University of Texas at Austin, Feb 1998. [28] V. Valloppillil and K. W. Ross,

“Cache Array Routing Protocol v1.1. Internet Draft”, February 1998,

http://ds1.internic.net/internet-drafts/draft-vinod-carp-v1-03.txt. [29] Z. Wang, “Cachemesh: A Distributed Cache System For World Wide Web”, , UCL, June 1997. [30] D. Wessels and K. Claffy, “Application of Internet Cache Protocol (ICP), version 2”, Internet Draft:draftwessels-icp-v2-appl-00. Work in Progress., Internet Engineering Task Force, May 1997. [31] L. Zhang, S. Floyd, and V. Jacobson, “Adaptive Web Caching”, , University of California, Los Angeles, February 1997. [32] G. K. Zipf, Human Behavior and the Principle of Least Effort: An Introduction to Human Ecology, AddisonWesley, Reading, MA, 1949.

24

Suggest Documents