Implementation and Evaluation of an Architecture for Web Search Engine Freshness Vijay Shivshanker Gupta Cisco Systems 250 W. Tasman Dr. San Jose, CA 95134
Roy Harold Campbell Department of Computer Science Univ. of Illinois at Urbana-Champaign 1304 W. Springfield Ave., Urbana, IL 61801, USA
[email protected]
[email protected]
Abstract Currently, web search engines use crawlers to poll web servers on a per-URL basis for obtaining update information. This leads to a cycle time of many weeks before a search engine can revisit a web page. To address this problem, we present the implementation and evaluation of FreshFlow – an architecture for informing search engines about updates occurring at the various web servers in the Internet. First, we demonstrate that implementing FreshFlow incurs a low penalty at an Apache web server. Second, we use trace-driven simulations to show that the algorithm used in FreshFlow performs much better than other naive algorithms. Third, we analyze how FreshFlow is advantageous over an approach in which web servers poll individual web sites for update information.
1 Introduction The WWW has emerged as an important source of information for Internet users. The publicly-visible portion of the WWW is vast – at the time of this writing, there are more than a billion documents on the Web [19]. As a result, it is difficult for an end user to manually discover information on the Web. To alleviate this problem, web search engines index a large portion of the WWW. Users discover information on the Web by posting a query containing relevant keywords to a search engine. The search engine uses its index to respond with a list of web pages that could potentially contain the information that the user is looking for. The service provided by search engines is extremely valuable as demonstrated by the fact that there are billions of search engine queries per month [2]. Currently, search engines obtain update information by probing web servers on a per-URL basis with little help from web servers [30]. As a result of this, (i) it takes up to
six months for a new page to be indexed by popular web search engines [21], and (ii) the data indexed by the search engines is often stale [12]. Li [22] and Brewington [7] provide interesting examples of how the importance of much of the information on the Web gets reduced with the passage of time. Thus, providing search engines with update information will aid them in scheduling the downloads in such a fashion that it increases the relevance of their responses for queries. Cho and Garcia-Molina [12] and Brewington and Cybenko [8] have examined various search-engine-based schemes for improving the freshness of search engine repositories. Gupta and Campbell [18] and Brandman et al. [5] proposed that it would be useful if web servers kept track of updates occurring at their web sites and made the list of updates available to search engines. But neither of these proposals provides a practical implementation of their respective schemes. Furthermore, Brandman’s proposal results in significant polling overhead at the search engines. To overcome those limitations, this article provides the implementation for the design proposed in [18]. Doing an actual implementation lead to significant refinements in the preliminary design proposed in [18]. We first present the enhanced design of the FreshFlow architecture, and the implementation details for an Apache web server. Second, we use trace-driven simulations to compare FreshFlow algorithm [18] with other intuitively-obvious alternative algorithms. Third, we analyze the advantages of FreshFlow architecture over Brandman et al.’s pollingbased approach.
1.1 Motivation for Solution Approach Assuming web server help for search engine freshness, a naive solution to the problem could be as follows: Each time an update to a file occurs at a web
site, the web site immediately transmits the file to all the search engines.
Unfortunately, this solution is not very practical because of a number of reasons. First, the bandwidth cost is one of the most important components of the operating cost of search engines, and it would be prohibitive for search engines to receive push updates from all the web servers in the world. Second, many web servers resort to spamming of search engines by including words unrelated to content. When recency becomes a metric for the page ranks, web servers could spam search engines with insignificant updates. Third, the interest of a search engine in a page is dictated by many factors such as the size of the search engine repository, the bias towards advertisers, and so on. Thus, not all search engines are interested in each and every page on the Internet. Given the impracticality of pushing the entire modified file from every web server to every search engine, suppose that instead only the information about the occurrence of the update is transmitted. The problem is that there are numerous competing search engines on the Internet and it would be administratively cumbersome for web sites to deal separately with each search engine. FreshFlow [18] alleviates this problem by using a centralized entity called middleman that can receive update information from web servers, and become a one-stop location for search engines to look for updates. For the current Internet, a single middleman should be sufficient, as shown in Gupta [17]. This middleman could reside within a major ISP (Internet Service Provider), or a CDN (Content Distribution Network) or a web-hosting server. The rest of the article is organized as follows: Section 2 presents an overview of FreshFlow architecture. Section 3 presents a brief background of FreshFlow algorithm [18]. Section 4 describes the implementation details for the FreshFlow architecture. Section 5 describes the performance results for an Apache web server that supports an implementation of FreshFlow architecture. Section 6 uses results from trace-driven simulation to compare the performance of FreshFlow algorithm with alternative algorithms. Section 6 also makes an analytic comparison between FreshFlow and a pure-polling based approach for the metric of the ‘freshness of knowledge of update information’. Section 7 discusses related work. Section 8 discusses the limitations of and issues with the FreshFlow architecture. Section 9 offers our conclusions.
2 Overview of FreshFlow An example scenario for using FreshFlow architecture is depicted in Figure 1. In the figure, z1.com and z2.com are web servers that use middleman mid1.com. Also shown in the figure are two search engines, altavista and hotbot. is a time at which z1.com creates a In Figure 2, publicly-accessible HTML page (called update-list-page) that has the URL “z1.com/add5”. We call the event of creating an update-list-page as a publication. The update-list-page published at by z1.com lists the URLs of the static web pages (f1 and f2) that were modified since z1.com’s last publication. The URL for an update-list-page is called as update-list-URL. After creating this HTML page, z1.com informs mid1.com about the publication by sending the update-list-URL “z1.com/add5”. The choice of the publication times could be made by using the FreshFlow algorithm [18]. and , z2.com sends the URLs At times “z2.com/add31” and “z2.com/add32” respectively. The HTML page for “z2.com/add31” contains the URLs for the files g1 and g2, while the HTML page for “z2.com/add32” contains the URLs for the files g3, g4 and g5. At time , z1.com sends the URL “z1.com/add6” to the middleman. The middleman mid1.com collects the URLs of published updates from both z1.com and z2.com and, in its turn, appends their listing into HTML pages. These HTML pages are called SE-pages and their corresponding URLs are called SE-URLs.1 There is a separate SE-URL (say mid1.com/ROBOTS/hotbot.html) for each participating search engine. To get the list of new updates, a search engine, such as hotbot, sends an HTTP GET request (RFC 2616 [15]) for its SE-URL (mid1.com/ROBOTS/hotbot.html). The HTTP response is the SE-page for hotbot. The SE-page contains the update-list-URLs received by mid1.com since the previous HTTP GET request from hotbot.
2.1 Contributions Recently, there have been proposals for crawler-friendly web servers [5, 18]. Gupta and Campbell [18] presented the design and analysis of FreshFlow algorithm, and a preliminary design of the FreshFlow architecture. However, [18] does not report the implementation details. Second, 1
SE stands for search engine.
hotbot
GET mid1.com/ROBOTS/ hotbot.html GET mid1.com/ROBOTS/ altavista.html mid1. com SE−page for altavista
SE−page for hotbot Web server z2.com
update−list−URL for z2.com
altavista update−list−URL for z1.com Web server z1.com
Figure 1: An example scenario for FreshFlow architecture. f1
f2
z1.com
t1
f3
f4
t4
t
g1
g2
t2
Modification of a
Definition 3.2 Last
mid1.com
z2.com
Definition 3.1 Weight of a File. The weight, , of a file is defined to be the number of accesses to file . Since a file that is popular over a short duration might become unpopular later on, we age the weight of the file with the passage of time.
t
z1.com/add6 z1.com/add5 (contains URLs for f1, f2) (contains URLs for f3,f4)
z2.com/add31 (contains URLs for g1,g2)
Zipf-like distribution [6]. Hence, given a content file , one would like to model the fact that popular files have more importance from the point of view of search engine results. One could either adopt the use of the probability of access of a file at a web server, or the number of accesses to the file. In practice, use of number of accesses is more useful than the probability of access at a web site, because a relatively unpopular file at a popular web site could have more accesses than a popular file at an unpopular web site.
the most recent time before when
Time. The file at time is got updated.
z2.com/add32 (contains URLs for g3,g4,g5)
g3
g4
g5
t3
t
3.1 The Cost Model
g6
Event corresponding to file modification at a web server
Figure 2: Timeline of sample sequence of events. [18] does not compare the performance of FreshFlow algorithm with other intuitively obvious alternatives. Third, [18] does not provide an idea of the improvement in the freshness of the knowledge of update information at search engines if one uses FreshFlow instead of an approach in which search engines poll the web servers for update information. This article aims to fill those gaps.
3 Algorithm for Deciding Publication Times at Web Servers Gupta and Campbell [18] proposed an algorithm for deciding the publication times for update-list-URLs. Since one of the goals of this article is to compare the performance of FreshFlow algorithm with other intuitively-obvious alternatives, we provide a brief outline of the FreshFlow algorithm. The motivation for FreshFlow algorithm is that research indicates that the popularity of documents on the Web varies widely. Specifically, popularity follows a Zipf or
We present a cost model that drives the working of FreshFlow algorithm. The first component of the cost model, called connection cost ( ), models the fact that for scalability reasons, the middleman would like to receive as few updates as possible. The other component is a cost called opportunity cost ( ! ). The !" accounts for the staleness of the search engines’ knowledge of updates occurring at web servers. The reason for calling it opportunity cost is that we forego the opportunity of having more accurate knowledge at search engines for a reduced cost of publication of updates. Connection Cost ( # ) Definition 3.3 Connection cost for transmitting update-list-URL. This cost is the same whether there is one updated file or several updated files at the web server. Hence, we state the total cost for all pending updates as:
where
(
# $ %'&)(+*-,.
is a configurable value for each web server, and
,. %'&0/ 1
(1)
if there is a pending update at time if there is no pending update at time .
(
In our implementation and simulation, we used a of 60 million access-seconds. Here access is the dimension for “number of accesses”. Strictly speaking, “number of accesses” is dimensionless, but for ease of understanding, we assume otherwise. Different web servers can choose to ( have different values of , depending on how aggressive they want to be in publishing their updates. The smaller ( the value of , the more aggressive is a web server in the publication of its updates.
HTTP requests
crawler− help
Web Server Machine HTTP requests from search
Apache Transfer httpd of Log
Notification of creation of new file containing batch of updates
middleman
engines
Middleman machine
for SE−pages
Opportunity Cost ( !" ) Definition 3.4 Opportunity cost ( !" ). Given an unpub , we define the opporlished update to a content file tunity cost for update at time as
! '& * . %
(2)
%
Let denote the set of unpublished updates at time . Using Definition 3.4, we have the following definition for the opportunity cost of all the unpublished updates:
!" %'& The Cost Function (
Apache Transfer of httpd CustomLog
!" %
(3)
Figure 3: A process-level view of FreshFlow.
4 Implementation Section 2 discussed how the FreshFlow architecture works. In Figure 3, we show the process-level view of FreshFlow. The architecture has been deliberately kept as simple as possible to make it sufficiently easy for webmasters and search engine operators to deploy it. In other words, a simple architecture helps in providing a low-resistance migration path for webmasters and search engine operators. In the rest of this section, we provide the implementation details for each of the different types of entities involved, namely web server, middleman, and search engine.
4.1 Web Server )
The cost function for unpublished updates at time is given by: $ '& $ !" $ % (4)
3.2 FreshFlow Algorithm In brief, the proposed algorithm is as follows: We assume that a web server has statistics of the popularity of documents served by itself. equals at any time , When ! the web server can publish the updates, and send a message to the middleman. For the cost model presented above, the cost incurred by FreshFlow Algorithm is no more than twice of the cost incurred by an optimal algorithm. The analysis and details of the algorithm are reported in Gupta and Campbell [18].
We based our design on the Apache web server. Apache currently has a penetration of more than 60% of all the web servers in the world [26], and hence we feel that it is a good choice. When an Apache server (called httpd) is started, it examines a configuration file called httpd.conf. For our implementation, the only Apache-related modification was the tagging of our publish program (that we henceforth call crawler-help) to httpd. This was done by the use of a directive called TransferLog for a custom log format inside the httpd.conf file. We used a custom log format (using CustomLog directive) because the common log format (CLF, described in [20]) does not provide the file path names, but we need the latter for doing stat for detecting file modifications. The crawler-help doesn’t send the URL for an updated file to the middleman as soon as a modification is detected. Instead, it batches the updates into an update-list-page using the FreshFlow algorithm. After creating an update-listpage, the web server sends a message to the middleman. The contents of the message include (i) the update-list-
URL, and (ii) the time (in seconds since 1st January 1970 UTC) at which the HTML page containing the update-list was created. To prevent burdening the file system, we avoided the decision of periodically polling all the content files for possible updates. Instead, following Liu and Cao [23], we check for an update to a content file only when there is an access to that content file. Most modern operating systems, such as Solaris [29] and Windows NT [27] have tools for auditing updates, to facilitate intrusion detection. If the audit file could be made available to the crawler-help program, then it could detect updates to files even if there are no accesses to the files. For garbage-collecting the update-list-pages, [17] argues that it should be sufficient to delete published files whenever the Apache log files are deleted.
4.2 Middleman The design at the middleman follows the same pattern as that at the web server. In particular, the middleman consists of an Apache web server and a helper program called middleman. The helper program is spawned using the TransferLog facility in the Apache web server. The middleman program listens for update-list-URLs, and creates SE-pages for search engines. The mechanism for garbage-collecting SE-pages is described in [17]. In brief, whenever the middleman serves an SE-page to a search engine, the middleman deletes the old SE-page for that search engine, and starts creating a fresh SE-page for that search engine.
4.3 Search Engine A search engine uses a crawler to automatically download pages from the Web. The crawler starts with a set of seed URLs (e.g., www.yahoo.com) [9]. It then downloads these pages, extracts hyperlinks and crawls pages pointed to by these new hyperlinks. The crawler repeats this step indefinitely. Every downloaded page needs to be refreshed. Thus the crawler splits its resources in crawling new pages as well as checking if previously crawled pages have changed. Thus, search engines use a “pull” approach to gather update information. FreshFlow retains this pull approach, and hence it is not necessary to modify the search engine design. One requirement at any participating search engine is to make the SE-URL a seed-URL for crawling the Web. Another requirement is that when a search engine downloads an SE-page from a middleman, it
only needs to crawl that page to a depth of two. Search engines already have some inbuilt depth requirement for seed URLs, and hence this constraint is easily implementable in practice. Search engines would need to crawl their SE-URL at a higher rate as compared to other URLs. For example, Google has a facility called “Google Custom SiteSearch” [16] for downloading content at different frequencies such as one day, one week or one month.
5 Apache Web crawler-help
Server
Performance
with
For the web server, we used Apache (version 1.3.12). In our experiments, the web server was hosted on an UltraSparc-2 (Solaris 5.7) machine with 256 MB of memory. The two client machines used in the experiments were similar. All these machines were connected by a 100 Mbps LAN. The choice of such slow machines was dictated by the fact that the only other machines available to us were dualprocessor UltraSparc-60’s. It was not possible to conduct repeatable experiments on the latter since they are timeshared machines. The objective of this performance study was to assess the performance penalty of using the crawler-help program, as compared to a plain Apache server, under moderate and heavy loads. We used the wc day38 1 file from the World Cup traces [3, 31].2 The experimental details for the tracedriven workload generation are provided in Gupta [17]. We used and modified various tools to act as clients, but ended up using wget for performing the measurements. The format of the command used at the client was: wget -t 0 -O /dev/null -q -i urlsfile --ignore-length. As shown in Table 1, for moderate CPU loads, the use of crawler-help does not inflict any penalty in the throughput of the Apache web server. But for heavy CPU loads, we observed a penalty of 10% for this experiment. Note that while we call 133.4 fetches per second as moderate, it actually corresponds to several million requests per day (even when accounting for the burstiness of web server loads). The reader is cautioned that web server performance evaluation, by itself, is a difficult topic [4], and the ap2
The World Cup traces are extremely voluminous (9 GB compressed, 25 GB uncompressed) in a custom format which is more concise as compared to the CLF [3]. Given the hardware at our disposal, it would have been impossible to analyze all of them.
Load character
Number of clients
Moderate Heavy
1 2
CPU load on plain Apache server approx. 50% saturated
Average fetch size (bytes) 7300 7300
Total bytes fetched 5.677 GB 5.677 GB
Fetches/second Plain Apache with Apache crawler-help 133.4 263.0
133.4 236.0
Performance penalty 0% 10%
Table 1: Performance penalty at Apache web server for wc day38 1 traces because of using crawler-help program. proach in this article might not be the best if the primary objective is to evaluate representative loads on a web server.
6 Evaluation of FreshFlow 6.1 Comparison with other Algorithms The FreshFlow algorithm is somewhat different from the CPU scheduling and page replacement algorithms used in Operating Systems, and is also different from the scheduling algorithms used for packet scheduling in network switches. Gupta and Campbell [18] proved that, for a particular cost model, FreshFlow is no more than twice as bad as an optimal algorithm. Yet the practitioner will be eager to learn the advantages of FreshFlow algorithm over intuitively obvious alternative algorithms. The alternative algorithms that we examine are: Periodic: Web sites publish updates periodically. Random: The interval between two publications by a web site is distributed uniformly at random in some interval. UpdatePeriodic: This is similar to Periodic, except that it is periodic in the number of updates rather than being periodic in time. To compare the different algorithms, we performed trace-driven simulation on wc day 38 from the World Cup web site, and three different web servers in the Computer Science department at University of Illinois at Urbana-Champaign. The relevant details of the sample logs are listed in Table 2. There are arbitrarily large spaces from which one can choose the time period for Periodic and Random, or the number of updates that determines the period for UpdatePeriodic. To make the comparison fair, we assume that each web site has some quota on the number of connections allowed per day. For Periodic, one connection
1 1 seconds. For Random, the interis made every connection period is distributed uniformly at random in 1 1 the interval [0, ] seconds. For UpdatePeriodic, given that the average number of updates per day is , a connection is made after every updates. Recall that in the FreshFlow algorithm, a connection is made when the Opportunity Cost ( ! ) passes a threshold. To determine , we first simulated the FreshFlow algorithm with a threshold of 60 million access-seconds on the sample traces. If ! stayed below the threshold even after 1 day had passed since the last publication, and if there were pending updates, then a publication was made at least once every day. From this, we found the total number of connections made by FreshFlow for each of the sample traces. Then we divided these numbers with the respective durations of the traces. Thus we obtained a different quota for each site. We feel that doing so helps one to make a fair comparison between the different algorithms. The value of in UpdatePeriodic for each of the traces was based on the total number of updates during the trace period, divided by the number of connections made by the FreshFlow algorithm. In Tables 3 and 4, we compare the performance of FreshFlow algorithm with the performance of Random, Periodic and UpdatePeriodic algorithms. As can be seen from the tables, there is not too much difference between the average !" between Periodic, Random, UpdatePeriodic and FreshFlow. However, FreshFlow does significantly better than the other algorithms for the metric of standard deviation of !" . A lower standard deviation for ! is beneficial from the point of view of the web server.
6.2 Comparison with Polling-based Architecture We were curious to evaluate the performance advantages of the FreshFlow architecture over Brandman et al.’s approach [5] in which search engines directly poll web servers for update information. Specifically, we wanted to evaluate the metrics, “Number of messages for discovering updates”, and “Freshness of knowledge about updates”.
Trace Name
Origin Server
Trace Period
CS1 CS2 CSIL1 CSIL2 Choices
www.cs.uiuc.edu www.cs.uiuc.edu www-courses.cs.uiuc.edu www-courses.cs.uiuc.edu choices.cs.uiuc.edu
WC38
World Cup 1998
Jan 2001 (31 days) Feb 2001 (28 days) Jan 2001 (31 days) Feb 2001 (28 days) 16:50 27 Jun 2000 14:13 21 Mar 2001 (267 days) May 6th, 1998
# of requests
#total modifns
718,694 718,263 2,233,823 3,893,088 1,428,471
Bytes served (bytes) 3.44 GB 3.24 GB 20.19 GB 30.45 GB 25.99 GB
7,188,041
37.29 GB
21,007 22,605 115,629 220,640 91,035
Avg. # modifns per day 677 807 3729 7880 341
: Avg. # msgs/day to Middleman 15 15 16 30 3
: Avg. # modifns/msg 45 53 233 262 113
41,954
41954
524
80
Table 2: Details of traces used, and the number of modifications per message to middleman. represents the average number of messages per day that the web server in question needs to send to the middleman as per the FreshFlow algorithm. represents the average number of updates that occur at the corresponding web server when one message is sent to the middleman by that web server. Algorithm Periodic Random UpdatePeriodic FreshFlow
Conns 462 471 467 431
CS1 Avg. ! 63M 80M 58.5M 60.5M
NSD of ! 60.51 132.93 54.59 1.00
Conns 418 429 419 437
CS2 Avg. ! 78M 97M 73M 60.5M
NSD of ! 76.82 171.39 58.14 1.00
Conns 475 482 482 482
CSIL1 Avg. ! NSD of ! 102M 45.73 137M 95.54 60.29M 25.31 60.29M 1.00
Table 3: Comparison with Periodic, Random and UpdatePeriodic algorithms for CS1, CS2 and CSIL1 traces. Conns is the number of connections made to the middleman. NSD refers to Normalized Standard Deviation – the value of the . standard deviation divided by the standard deviation for the FreshFlow algorithm. Here, !" refers to !" Algorithm Periodic Random UpdatePeriodic FreshFlow
Conns 838 839 818 818
CSIL2 Avg. ! NSD of ! 73M 287 95M 1123 60M 256 60M 1
Conns 798 797 796 619
Choices Avg. ! NSD of ! 41M 12.03 55M 29.44 42M 13.46 60M 1.00
Conns 527 533 525 525
WC38 Avg. ! NSD of ! 77M 198.35 101M 431.92 66M 106.79 60M 1.00
Table 4: Comparison with Periodic, Random and UpdatePeriodic algorithms for CSIL2, Choices and WC38 traces. Let "$# be the set of sites that participate in any scheme for supporting web search engine freshness. Let the size of the set "$# be , and let the sites be numbered from 1 to . 6.2.1
Number of Messages for Discovering Updates
Suppose that in the FreshFlow architecture, the number of probes at a middleman by a search engine is % per day. Then, with a polling-based approach, the search engine has to make at least probes per day (each to a different site) to get information about all the updates occurring at participating web sites with a maximum delay of 1 day. Thus, using FreshFlow, the number of probes that a search en gine has to make is reduced by a factor of % . Since could be of the order of a few million, and % could be as
small as one thousand, using FreshFlow could result in a a potential savings of up to three orders of magnitude. 6.2.2
Freshness of Knowledge about Updates
If we had the luxury of performing a very large number of probes every second, then even a naive polling approach would work well and provide good freshness. Hence, we take the view that the number of probes per second in the polling approach is reasonably bounded when compared to the number of participating web sites on the Internet. To compare FreshFlow with the polling approach, we have to specify what scheduling algorithm is used at search engines to decide the order for polling the web sites. We consider two scheduling algorithms, round robin (RR), and weighted fair queueing (WFQ) [13]. In RR scheduling, in
each round, each web site is polled exactly once. In WFQ scheduling, we assume that at any given instant, the probability of a site being polled is proportional to the popularity of that site. Definitions Definition 6.1 Opportunity cost at search engine for all participating web sites (FreshFlow approach).
!" &
where !" site .
!"
is the average value of
!" $ %
for
Definition 6.2 Popularity of a web site. Given web site , the popularity for the site is denoted by % % . The % % ’s constitute a probability distribution. Definition 6.3 Average Staleness for a given web site. denote the average Given web site , let # staleness at the search engine with respect to site .
# & * % * % %
(5)
where % is the average inter-arrival time between probes at site , and is a constant that accounts for the discrepancy that !" uses accesses to documents while staleness uses popularity of sites.
Note that the average Staleness will depend on the architecture, scheduling algorithm and the number of messages used for propagating update information. The motivation using % % is that, for scalabilfor defining # ity reasons, polling has to be performed on a per-site basis, rather than on a per-page basis. Definition 6.4 Average Staleness at search engine for all participating web sites (Polling-based approach).
# &
#
In the rest of the discussion, we first state the value of staleness for the RR and WFQ scheduling algorithms. Thereafter, we find the condition for FreshFlow to be advantageous over a polling-based approach.
#
Let
for WFQ and RR
be the number of search engines; , the number of participating web sites; and , the number of sites polled , the average inter-arrival per second. Then, time (at site ) between successive polls for Weighted Fair Queueing scheduling algorithm, would equal ; , the average inter-arrival time (at site ) beand "!#! tween successive polls for Round Robin scheduling algo rithm, would equal . Using simple algebra, it follows that in the polling-based is identical for these two popular approach, # [17]. scheduling algorithms, with a value of Condition for FreshFlow to be Advantageous over Polling-based Approach
(
With FreshFlow, ! is . For FreshFlow to be better than polling with $%$ or &('*) , we need the following relationship: (,+ .- (/+
( &45 3 . Let us define 102 ( &76 ( 102 (for 198 6 8 ), then FreshFlow would be If : times as good as a polling-based approach. This result ( also shows how to adjust for changing the aggressiveness of FreshFlow. 7 Related Work
The problem of maintaining freshness of data at web servers has been previously studied by Cho and GarciaMolina [12], and Brewington and Cybenko [8]. Both assumed that web search engines poll the web-sites on a perURL basis to obtain update information. Brandman et al. [5] and Gupta and Campbell [18], independently proposed use of crawler-friendly web servers. But neither [5] nor [18] provide any implementation. Recent work on continuous consistency models for replicated services [32] argues that there exist applications which can benefit from bounding the maximum rate of inconsistent access in an application-specific manner. The particular applications that they studied were bulletin board, airline reservation system, and distributed web servers. FreshFlow also uses numerical methods for improving freshness at search engines, but there are differences: (i) the number of replicas in FreshFlow is much less than that in the work on continuous consistency models, and (ii) the scale of FreshFlow’s target environment is
much larger than those considered in the work in continuous consistency models. Herald [10] and SCRIBE [28] are designs for large-scale event notification on the Internet. Distribution of update information could benefit from an event notification infrastructure. However, neither Herald nor SCRIBE demonstrate their applicability for any practical problem. Specifically, they do not examine the applicability of their designs for the problem of web search engine freshness.
8 Discussion Our implementation addresses the modifications to static web pages, but does not currently handle modifications to dynamically-created web pages. Researchers have addressed the issue of detecting change to dynamic web data using application-specific techniques [11]. So, web servers could use application-specific methods to notify crawlerhelp about changes to the dynamic web pages. Many popular web sites make use of content distribution networks (CDNs) such as Akamai [1]. The use of CDNs results in having multiple URLs for the same resource. Our current implementation has not addressed that although the crawler-help program could be made to listen for batched updates from CDNs, just as the middleman program listens for batched updates from crawler-help programs at the various web servers. A search engine may not be able to download all the reported updates. The number of pages that a search engine can download depends on the bandwidth of its connection to the rest of the Internet. The exact heuristics that each search engine will use for selecting pages for download are beyond the scope of the implementation reported in this article. The crawler-help could aid search engines in the decision-making process by providing the size of the delta between the current version and the previous version. Centralizing the middleman service makes the middleman a tempting target for denial-of-service attacks. The idea of having a single middleman for the entire Internet is somewhat similar to the idea of having a few root servers for DNS (Domain Name Service) [14] for the Internet. As pointed out by Cerf [14], currently the 13 root servers are not immune to denial-of-service attacks. Thus the security issues with centralization are not restricted to FreshFlow, but also exist for one of the most widely used services on the Internet. Gupta [17] identifies some possible approaches to address security for centralized entities, but security issues are beyond the scope of this article. Another problem is that web servers could send spuri-
ous update-list-URLs to the middleman, and consequently frustrate the search engines. [17] provides details for authentication of messages at a middleman for millions of web servers, but we omit that discussion since it uses relatively standard techniques. Centralization of middleman creates a single point of failure. Luckily, the nature of the problem is such that it is relatively trivial to distribute the middleman functionality across multiple sites using the primary-backup approach [24] and the state-machine approach [25] as discussed in [17].
9 Conclusions Currently, web search engines repeatedly query web servers for updates on a per-URL basis. This leads to a cycle time of weeks before search engines can check for an update of a web page. In this article, we described the implementation and evaluation of FreshFlow – an architecture that allows search engines to get quick access to update information. To the best of our knowledge, this is the first report of an implementation for facilitating searchengine freshness using web server help. We saw how an Apache web server can be used for propagating update information to search engines without modifying the Apache source code. Furthermore, FreshFlow preserves the way search engines obtain their update information, namely using the pull approach. Experimental results indicate that providing FreshFlow functionality does not inflict a performance penalty (in terms of the number of fetches per second) for a moderately loaded Apache web server. For a heavily-loaded server, though, the performance penalty of our untuned implementation was up to 10% for a trace-driven workload. FreshFlow architecture makes use of FreshFlow algorithm [18] at web servers. Although [18] analytically compared FreshFlow algorithm with an optimal algorithm, [18] did not experimentally compare the performance of FreshFlow with intuitively obvious alternatives. Tracedriven simulations, reported in this article, indicate the superiority of FreshFlow over other intuitively obvious algorithms. Furthermore, [18] did not study the improvement in the freshness of the knowledge of update information at search engines if one uses FreshFlow instead of a polling-based approach. This article provided an analytic comparison of the FreshFlow approach and the pure polling-based approach. The use of FreshFlow requires the use of a middleman
that is situated inside an ISP, CDN or web-hosting server. Gupta [17] has argued that even a single middleman machine is sufficient for the current Internet. We cannot make any predictions about the commercial viability of providing middleman service. However, should such a service be provided and if search engines look to the middleman for update information, then there is incentive for web servers to publish their updates because it could lead to faster indexing of their web sites. Search engines also have an incentive to use FreshFlow because it would improve the freshness of their repositories.
Acknowledgments
[13] A. Demers, S. Keshav, and S. Shenker. Analysis and simulation of a fair queueing algorithm. Journal of Internetworking Research and Experience, 1(1):3–26, September 1990. Also in Proc. ACM SIGCOMM 1989, pp 1-12, September 1989. [14] Officials: Web security work needed, November 2001. dailynews.yahoo.com/htx/ ap/20011115/tc/internet security 1.html. [15] R. Fielding, J. Gettys, J. Mogul, H. Frystyk, L. Masinter, P. Leach, and T. Berners-Lee. RFC 2616: Hypertext transfer protocol – HTTP/1.1, June 1999. Network Working Group, Internet Engineering Task Force. [16] Google Custom SiteSearch, www.google.com/services/customsitesearch.html.
2001.
[17] V. Gupta. Scalable Distribution of Data Across Autonomous Systems. PhD thesis, Department of Computer Science, University of Illinois, Urbana-Champaign, October 2001. [18] V. Gupta and R. Campbell. Internet search engine freshness by web server help. In Proceedings of the Symposium on Applications and the Internet (SAINT), pages 113–119, January 2001.
We are very grateful to Kevin Chang, Dennis Mickunas, Marianne Winslett and Hugh Holbrook for helpful suggestions and feedback, to Prashant Vishwanathan and David Raila for their help with the testbed setup, and to Martin Arlitt and Tai Jin for providing the world cup traces.
[19] Web surpassed one billion documents, www.inktomi.com/new/press/billion.html.
References
[20] M. Johns. RFC 1413: Identification Protocol, February 1993. Network Working Group, Internet Engineering Task Force.
[1] Akamai Inc. www.akamai.com. [2] Alexa Internet, 2000. www.alexaresearch.com. [3] M. Arlitt and T. Jin. Workload characterization of the 1998 world cup web site. Technical Report HPL-1999-35R1, HewlettPackard Laboratories, Palo Alto, CA, September 1999. [4] P. Barford and M. Crovella. Generating Representative Web Workloads for Network and Server Performance Evaluation. In Proceedings of the ACM SIGMETRICS Conference, pages 151– 160, June 1998. [5] O. Brandman, J. Cho, H. Garcia-Molina, and N. Shivakumar. Crawler-Friendly Web Servers. In Workshop on Performance and Architecture of Web Servers (PAWS), June 2000. [6] L. Breslau, P. Cao, L. Fan, G. Phillips, and S. Shenker. Web caching and Zipf-like distributions: Evidence and implications. In Proceedings of the INFOCOM, March 1999. [7] B. Brewington. Observation of changing information sources. PhD thesis, Thayer School of Engineering, Dartmouth College, June 2000. [8] B. Brewington and G. Cybenko. Keeping up with the changing web. IEEE Computer, 33(5):52–58, May 2000. [9] S. Brin and L. Page. The anatomy of a large-scale hypertextual web search engine. In Proceedings of the Seventh World Wide Web Conference, April 1998. [10] L. Cabrera, M. Jones, and M. Theimer. Herald: Achieving a Global Event Notification Service. In Hot Topics in Operating Systems (HOTOS VIII), May 2001. [11] J. Challenger, A. Iyengar, and P. Dantzig. A scalable system for consistently caching dynamic web data. In Proceedings of the INFOCOM, volume 1, pages 294–303, March 1999. [12] J. Cho and H. Garcia-Molina. Synchronizing a database to improve freshness. In Proceedings of the SIGMOD, pages 117–128, May 2000.
2000.
[21] S. Lawrence and C. Giles. Accessibility and distribution of information on the web. Nature, 400:107–109, July 1999. [22] D. Li. Scalable Reliable Multicast and Its Application to Web Caching. PhD thesis, Stanford University, April 2000. [23] C. Liu and P. Cao. Maintaining strong cache consistency in the World Wide Web. In Proceedings of the International Conference on Distributed Computing Systems, pages 12–21, 1997. [24] S. Mullender, editor. Distributed Systems, chapter 8, pages 199– 216. Addison-Wesley, Reading, MA, second edition, 1993. Chapter author: N. Budhiraja et al. [25] S. Mullender, editor. Distributed Systems, chapter 7, pages 169– 198. Addison-Wesley, Reading, MA, second edition, 1993. Chapter author: F. Schneider. [26] The Netcraft Web server survey. www.netcraft.com/Survey. [27] Network Security Auditing Tools Home Page, November 2001. www.ntobjectives.com. [28] A. Rowstron, A-M. Kermarrec, M. Castro, and P. Druschel. SCRIBE: The design of a large-scale event notification infrastructure. In Proc. Third International Workshop on Networked Group Communication (NGC ’01), November 2001. [29] Solaris Security Guide, www.sabernet.net/papers/Solaris.html.
2001.
[30] D. Sullivan. How search engines work, searchenginewatch.internet.com/webmasters/work.html.
1998.
[31] 1998 World Cup Web Site Access Logs, April 2000. ita.ee.lbl.gov/html/contrib/WorldCup.html. [32] H. Yu and A. Vahdat. Design and evaluation of a continuous consistency model for replicated services. In Proceedings of the Fourth Symposium on Operating Systems Design and Implementation, pages 305–318, October 2000.