.
Semantics-Based Personalized Prefetching to Improve Web Performance1 Cheng-Zhong Xu and Tamer I. Ibrahim Department of Electrical and Computer Engineering Wayne State University, Detroit, MI 48202
[email protected]
Abstract With the rapid growth of web services on the Internet, users are experiencing access delays more often than ever. Recent studies showed that prefetching could alleviate the WWW latency to a larger extent than caching. Existing prefetching methods are mostly based on URL graphs. They use the graphical nature of http links to determine the possible paths through a hypertext system. While they have been demonstrated effective in prefetching of documents that are often accessed, few of them can pre-retrieve documents whose URLs have never been accessed. In this paper, we propose a semantics-based prefetching technique to overcome the limitation. It predicts future requests based on semantic preferences of past retrieved document. We applied this technique to news reading activities and prototyped a NewsAgent prefetching system. The system extracts document semantics by identifying keywords in their URL anchor texts and relies on neural networks over the keyword set to predict future requests. It features a self-learning capability and good adaptivity to the change of user surfing interest. We cross-examined the system in daily browsing of ABC, CNN, and MSNBC news sites for three months. The experimental results showed an achievement of approximately 60% hit ratio due to prefetching. Of the pre-fetched documents, less than 24% were undesired.
1. Introduction Bandwidth demands for the Internet are growing rapidly and saturating the Internet capacity due to the increasing popularity of web services. Although upgrade of the Internet with higher bandwidth networks has never stopped, users are experiencing web access delays more often than ever. They are demanding latency tolerant techniques to improve the web performance. The demands are urgency particularly for users who connect to the Internet via low-bandwidth modem or wireless links. 1
Preliminary results were published in Proc. of the 20th IEEE Conf. on Distributed Computing Systems, pages 636—643, April 2000. This work was supported in part by NSF Grants EIA-0729828, CCR-9988266.
1
Many latency tolerant techniques have been developed over the years. Most notably are caching and prefetching. Caching takes advantages of temporal locality of web references and is widely used in server and proxy sides to alleviate the web latency. Caching is often enabled by default in both popular IE and Netscape browsers. However, recent studies showed that benefits from clientside caching were limited due to the lack of enough temporal locality in web references of a client [21]. The potential for caching requested files was also shown to be even declining over the years [1][2]. As a complement to caching, prefetching is to pre-retrieve web documents (more generally, web objects) that tend to be browsed in the near future while a user is processing previously retrieved documents [22]. Prefetching can also happen off-line when a user is not browsing the web at all. A recent tracing experiment revealed that prefetching could offer more than twice the improvement due to caching alone [21]. Previous studies on web prefetching were mostly based on the history of user access patterns [8][24][29][31][14][30]. If the history information shows an access pattern of URL address A followed B with a high probability, then B will be pre-fetched once A is accessed. The access patterns are often represented by a URL graph, where each edge between a pair of vertices represents a follow-up relationship between the two corresponding URLs and the edge weight denotes a probability of the relationship in statistics. A limitation of this URL graph based technique is that the prefetched URLs must be frequently accessed before. The URL graph based approaches are demonstrated effective in prefetching of documents that are often accessed in history. However, there is no way for the approach to pre-retrieve documents that are newly created or never visited before. For example, all anchored URLs of a page are fresh when a client gets into a new web site and none of them will be pre-fetched by the approaches. For dynamic HTML sites where a URL has time-variant contents, the existing approaches could not retrieve desired documents by any means. This paper presents a novel semantics-based prefetching technique to overcome the limitations. It predicts future requests based on preferences of past retrieved documents in semantics, rather than on the temporal relationships between URL accesses. It is observed that client surfing is often guided by some keywords in anchor texts of URLs. Anchor test refers to the text that surrounds hyperlink definitions (hrefs) in web pages. For example, a user with a strong interest in Year 2000 problem won’t miss any relevant news or web pages that are pointed to by a URL with a keyword “Y2K” in its anchor text. An on-line shopper with a favorite auto model (e.g. “BMW X5”) would like to see all consumer reports or reviews about the model. We refer to this phenomenon as “semantic locality”. Unlike search engines that takes keywords as input and responds with a list of semantically related web documents, semantics-based prefetching systems tend to capture preferences from client’s past access patterns and predicts future references from a list of possible URLs when the client visits a new web site. Based on document semantics, this approach is capable of prefetching documents whose URLs have never been accessed. For example, the approach can help a client with interests in Y2K problems pre-fetch Y2K related documents when he/she gets into a new web site. It is known that the HTML format was designed to represent a presentation structure of a document, rather than logical data. A challenge is how to extract semantics of the HTML documents and construct adaptive semantic nets between URLs on-line. Semantics extraction is a key to
2
web search engines, as well. Our bitter experience with today’s search engines leads us to believe that current semantics extraction technology is far away from maturity for a general-purpose semantics-based system. Instead, we focus on domain specific semantics-based prefetching. The prefetching technique is first targeted at news services. We are interested in this application domain because of the following reasons. First, news headlines are often used as their URL anchors in most news web sites. The meaning of the news can be represented easily by a group of keywords in the headline. Second, news web sites like abc.com, cnn.com, and msnbc.com often have coverage about the same topic during a certain period. They can be used to cross-exam the performance of semantic-based prefetching systems. That is, one site is used to accumulate URL preferences and the others for evaluation of prediction accuracy. Third, effects of prefetching in the news domain can be evaluated separately from caching. Prefetching is often coupled with caching and prefetching takes effect in the presence of a cache. Since a user rarely accesses the same news items for more than once, there is very little temporal locality in user access patterns of news surfing. This leads to a zero impact of caching on the overall performance of prefetching systems. We prototyped a NewsAgent system. It monitors web surfing behaviors and dynamically establishes keyword-oriented access preferences. It employs neural networks to predict future requests based on the preferences. It employs neural networks to predict future requests based on the preferences. Such a domain-specific prefetching system is not necessary to be running all the time. It should be turned on or off on-demand. The rest of the paper is organized as follows. Section 2 presents neural network basics, with an emphasis on key parameters that affect the prediction accuracy and self-learning rules. Section 3 presents designs of the NewsAgent and its architecture. Subsequent three sections present the details of our experimental methodology and results in browsing three popular news sites. Related work is discussed in Section 7. Section 8 concludes the paper with remarks on future work.
2. Neural Network Basics Neural nets have been around since the late 1950's and come into practical use for general applications since mid-1980’s. Due to their resilience against distortions in the input data and their capability of learning, neural networks are often good at solving problems that do not have algorithmic solutions [18].
2.1. Neurons A neural network comprises of a collection of neurons. Each neuron performs a very simple function: computing a weighted sum of a number of inputs and then comparing the result to a predefined net threshold value. If the result is less than the net threshold, then the neuron yields an output of zero, otherwise one. Specifically, given a set of inputs x1, x2, x3,…, xn associated with weights w1, w2, w3, …., wn, respectively, and a net threshold T, the binary output y of a neuron can be modeled as:
3
1 y= 0
n
if ∑ wi xi ≥
T
i =1
Equation 2.1
otherwise
The output y is often referred to as net of the neuron. Figure 1 shows a graphical representation of the neuron. Although it is a over-simplified model of the functions performed by a real biological neuron, it has been shown a useful approximation. Neurons can be interconnected together to form powerful networks that are capable of solving problems of various levels of complexity.
X1 W1 Y W2
X2
Xn
T
Wn
Figure 1. A graphical representation of neuron To apply the neural network to web prefetching, the predictor examines the URL anchor texts of a document and identifies a set of keywords for each URL. Using the keywords as an input, the predictor calculates the net of each URL neuron. Their nets determine the priority of URLs that are to be pre-fetched.
2.2. Learning Rules A distinguishing feature of neurons (or neuron networks) is self-learning. It is accomplished through learning rules. A learning rule incrementally adjusts the weights of the network to improve a predefined performance measure over time. Generally, there are three types of learning rules; supervised, reinforced and unsupervised. In this study, we are interested mainly in supervised learning where each input pattern is associated with a specific desired target pattern. Usually, the weights are synthesized gradually and updated at each step of the learning process so that the error between the output and the corresponding desired target is reduced [18]. Many supervised learning rules are available for single unit training. A most widely used approach is µ–LMS learning rule. Let
4
m
J ( w)
=
1 2
∑ (d
i
−
yi )
2
Equation 2.2
i =1
be the sum of squared error criterion function. Using the steepest gradient descent to minimize J(w) gives
w
k +1
= w
m
k
+ µ∑ (d i −
yi ) x i
Equation 2.3
i =1
where µ is often referred to as the learning rate of the net because it determines how fast the neuron can react to the new inputs. Its value normally ranges between 0 and 1. The learning rule governed by the equation is called the batch LMS rule. The µ-LMS is its incremental version, governed by
w1 = 0 or arbitrary k +1 i = wk + µ(d i − y ) x k w
Equation 2.4
The LMS learning rule is employed to update the weight of keywords of URL neurons.
3. NewsAgent: A Web Prefetching Predictor This section presents details of the keyword-oriented neuron network based prefetching technique and the architecture of an application domain predictor: NewsAgent.
3.1. Keywords A keyword is the most meaningful word within the anchor text of a URL. It is usually a noun referring to some role of an affair or some object of an event. For example, associated with the news “Microsoft is ongoing talks with DOJ” as of March 24, 1999 on MSNBC are two keywords “Microsoft” and “DOJ”; the news “Intel speeds up Celeron strategy” contains keywords “Intel” and “Celeron”. Note that it is the predictor that identifies the keywords from anchor texts of URLs while users retrieving documents. Since keywords tend to be nouns and often start with capital letters, these simplify the task of keyword identification. It is known that meaning of a keyword is often determined by its context. An interesting keyword in one context may be of no interest at all to the same user if it occurs in other contexts. For example, a Brazil soccer fan may be interested in all the sports news related with the team. In this sports context, the word “Brazil” will guide the fan to all related news. However, the same word may be of less importance to the same user if it occurs in general or world news categories, unless the user has a special tie with Brazil. The fact that a keyword in different contexts has different implications is well recognized in structural organizations of news sites. Since a news site usually,
5
if not always, organizes the URLs into categories, we assume one prefetching predictor for each category for an accurate prediction. The idea of using categorical information or topic specific information to improve the precision of information retrieval can also be seen in focused crawling of search engines [3][16][27]. It is also known that the significance of a keyword to a user changes with time. For example, news related to “Brazil” hit the headline during the Brazil economic crisis period in 1998. The keyword should be sensitive to any readers of market or financial news during the period. After the crisis was over, its significance decreased unless the reader has special interests in the Brazil market. Due to the self-learning property of a neural network, the weight of the key can be adapted to the change.
3.2. Architecture Figure 2 shows the architecture of the NewsAgent. Its kernel is a prediction unit. It monitors the way of user browsing. For each user required document, the predictor examines the keywords within its anchor text, calculates their preferences (nets), and pre-fetches those URLs which contain keywords with higher preferences than the neuron threshold.
http://www .cnn.com/news/local
Local News
1
Local News
2
Sports Technology ...............
Web Document
1 …………… 2 …………… 3 …………… 4 …………… 5 …………… 6 …………… 7 ……………
10
Links Window
1 …………… 2 …………… 3 …………… 4 ……………
3
4
Keywords Extract
5
Prediction Unit
Keywords List
9 Prefetch Queue
3 …………… 2 …………… 4 ……………
Compute Preference
1 …………… 2 …………… 3 …………… 4 ……………
7
8
6
…………… …………… …………… …………… …………… …………… ……………
Cached Objects Retrieve and Update
…………… …………… …………… …………… …………… …………… ……………
12 11
6
Web Document
…………… …………… …………… …………… …………… …………… ……………
1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12.
User requests a URL associated with a category. The URL document is retrieved and displayed to the user. The contents of the document are examined to identify the first set of links to evaluate. Each link from the set is processed to extract the keywords. Keywords are matched to the keyword list. They are added to the list if they occur for the first time. The corresponding preferences are computed using the weights associated with the keywords. Links with preferences higher than the neuron threshold are sorted and placed in the prefetching queue. Links are accessed one by one and the objects are placed in the cache. If the prefetching queue is empty, process the next set of links [Step 2]. When the user requests a link, the cache list is examined to see if it’s there. If it was there, then it is displayed. If not, it is retrieved from the web and the corresponding keyword’s weights are updated. Figure 2. Architecture of the NewsAgent
In light of the fact that a document may contain a large number of links, the predictor won’t examine all anchor texts of URLs at a time. Instead, it examines and pre-fetches documents through a window. That is, a fixed number of URLs are considered at a time. It patterns after a user browsing behavior because users will look at a few links before deciding which to follow. Since evaluation of links takes non-negligible time, group evaluation also provides a quick response to user web surfing. The window size is referred to as breadth of the prefetching. Once an anchored URL is clicked, the predictor starts to look for the requested document in the cache. If it has already been pre-fetched, the document is displayed. (The weights of its related keywords will remain unchanged by the rule of Equation 2.4.) Otherwise, the predictor suspends any ongoing prefetching process and starts to examine the caption of the requested link for keywords. If a keyword is introduced for the first time, a new weight will be assigned to it. Otherwise its existing weight will be positively updated (d = 1) by the rule of Equation 2.4.
3.3. Control of Pre-fetch Cache and Keyword List Notice that the keyword list could be full with unwanted items when the user loses interest in a longstanding topic. The pre-fetch cache may also contain undesired documents due to the presence of some trivial (non-essential) keywords in document captions. For example, the keyword “DOJ” is trivial in a sector of technology news. It is of interest only when it appears together with some other technologyrelated keywords like Microsoft or IBM. Figure 3 shows a mechanism of the NewsAgent to control the overflow of the keyword list and purge the undesired documents in the pre-fetch cache. By checking the cache, we can identify documents that have been pre-fetched but never been accessed. The system will then decrement the weight of its related keywords. Also, we need to set a purging threshold value that determines whether a keyword should be eliminated from the keyword list or not. If the weight of a keyword is less than this threshold, then we examine the change trend of the weight. If it is negative, then the keyword is considered trivial and is eliminated from the list.
7
Cached Objects
…………… …………… …………… …………… …………… …………… ……………
Keywords List
1
LRU Algorithm
Update Weights
2
3 Keywords Extract
Purge List
4
…………… …………… …………… …………… …………… …………… ……………
5
Check Weight Threshold
6 7
Check Weight Trend
1. 2. 3. 4. 5. 6.
When the cache list is full, use LRU algorithm to remove the least recently accessed objects. Extract the corresponding keywords. Negatively update those weights. Compare the updated weights to the minimum weight threshold. If they are, check the updating trend (Decreasing in this case). Purge those keywords from the list if they are below the minimum weight threshold. 7. If the keyword list is full, consider steps 5, 6 & 7. However, the trend will not be necessarily decreasing in this case. Figure 3. Overflow Control for the Keyword List and Pre-fetch Cache The keywords are purged once the keyword list is full. The list size should be set appropriately. A too small list will force the unit to purge legitimate keywords, while a too large list would be filled with trivial keywords, wasting space and time in evaluation. We set the list size to 100 and the purging threshold value to 0.02 in our experiments. It was set based on our preliminary testing. The settings are by no means optimal choices. The keywords can also be purged when the cache is full (or up to a certain percentage) and the program needs to place a new document in it and the only way to do that is to replace an existing document or a number of documents. Documents to be replaced are those which have been in the cache for a long time without being accessed. The caption that resulted in prefetching those documents is first examined and the keywords are extracted. Then the weights associated with those weights are negatively updated (d = 0) by the rule of Equation 2.4. This will assure that the probability of those keywords to trigger the prefetching unit in the future is lower.
8
4. Experimental Methodology The system was developed entirely in Java and integrated with a pure Java web browser, namely IceBrowser from IceSoft.com. This lightweight browser with its open source code gave us a great flexibility in the evaluation of our algorithm. Using such a lightweight browser simpified the process of setting experiment parameters because the browser just did what it was programmed for and there were no hidden functions or Add-Ins to worry about as those with full-blown browsers like Netscape and Explorer. Such browsers have their own pre-fetching and caching schemes that may conflict with our algorithm. Since the web is essentially a dynamic environment, its varying network traffic and server-load make the design of a repeatable experiment challenge. There were studies dedicated to characterizing user access behaviors [2] and prefetching/caching evaluation methodologies; see [11] for a survey of the evaluation methods. Roughly, two major evaluation approaches are available: simulation based on a trace of user access and real-time simultaneous evaluation. A trace-based simulation works like an instruction execution trace based machine simulator. It assumes a parameterized test environment with varying network traffic and workload. While there are many web traffic benchmarking and characterization studies, few of the trace data sets available to public provide semantic information. Since the semantics-based prefetching techniques are domain specific, they should be enabled or disabled when clients enter or exist from domain services. To the best of our knowledge, there are no such domainspecific trace data sets available for evaluation of the NewsAgent. Real-time simultaneous evaluation [10] was initially designed to evaluate multiple web proxies in parallel by duplicating and passing object requests to each proxy. It complements to trace-based simulations because it reduces problems of unrealistic test environments, dated and/or inappropriate workloads. Along the line of the approach, we evaluated the performance of NewsAgent by crossexamining multiple news sites. One site was used for neural network training. The trained smart reader then worked, simultaneously with a browser without prefetching, to retrieve news information from other sites for a period. Since this algorithm aims to predict the future requests based on users’ access pattern, all the experiments were conducted by a single tester all the time on a Pentium II 600 MHz Windows NT machine with 128 MB of RAM connected to the Internet through a T-1 connection. Following are our experimental settings. •
Topic selection: In order to establish user access patterns systematically, we defined topics of interest to guide the tester’s daily browsing behaviors. We are interested in topics that have extensive coverage in major news sites for a certain period so that we can complete the data collection and cross-examination before the media changes its interest. During the time this experiment was conducted (from 4/1 to 6/15/1999) the KOSOVO crisis was in its peak. We selected this topic because it was receiving all the attention in media and we expected it would last long enough for us to complete this experiment. Notice that the NewsAgent has no intrinsic requirements for the selection of topics in advance. Reader interests may change over time and the algorithm works in the same way no matter whether there is a topic defined or not. However, without a selection of topic in evaluation, most of the keywords generated from tester’s random browsing would be too trivial to be purged.
9
•
Site selection: The next step was to choose a number of sites that a reader interested in such topic would most likely find an abundance of information about it. After evaluating many of the popular news content sites, we decided on cnn.com for the first phase (training) and abc.com for the second one (evaluation). The training phase took a little bit over a month (4/1 to 5/5/1999) and a comparable evaluation phase (5/7-6/14/1999).
•
User access pattern: The user access pattern was monitored and logged daily. Also, the time allowed to view a given page was set not to exceed one minute. Obviously, if we have infinite time between two consecutive requests, all other links can be evaluated, pre-fetched and cached in advance, which is not the case in real-life and that is why we need to set a time limit.
5. Experimental Results As indicated earlier in the previous section, the experiment was conducted in two phases: training and evaluation. This section presents the results of each phase.
5.1. Training Phase $VLWLVWKHFDVHIRUDQ\QHXUDOQHWZRUNHPSOR\LQJWKH /06UXOHWKHWUDLQLQJSKDVHLVHVVHQWLDOIRU the success of the network. In this phase, the network is not pre-fetching any future requests. All what it does is to monitor the access pattern, and consistently establish and update the inputs and their corresponding weights. In general, more training generally results in more accurate results. This may not be the case in news prefetching training because new topics are emerging endlessly, characterized by a new set of keywords, to replace old subjects. During this phase (04/01 – 05/05/1999), a total of 163 links were accessed. The average link size was 18K bytes (Minimum 9.4K and maximum 33.4K). Figure 4 shows the daily access pattern during this period.
Training Phase Access Pattern No. of Links
No. of New Keywords
14 12 10 8 6 4 2 0 4/1/99
4/8/99
4/15/99
4/22/99
4/29/99
Day
Figure 4. Access Patterns in Training Session
10
From Figure 4, it can be seen that in the first four days (4/1-4/4/99), new keywords were identified by the neuron as more links were accessed. It is expected because we started with an initialized neuron with no inputs (keywords). The next two days (4/5-4/6) have found no new keywords, even though more links were accessed. It means that the topic is characterized by a very limited number of keywords during a certain period. As the time proceeds, the number of new keywords identified rises sharply in the next two days (4/7-4/8/99). This is due to the dynamics of the subject. Any new related breaking news introduced new keywords. We can identify a repetitive cycle here. As links are accessed, a number of new keywords are identified. However, this number declines as the topic is saturated (i.e. no more keywords are introduced). But then, changes to the topic introduce another set of new keywords and the cycle repeats. This can be always visible in dynamic topics such as the one selected for this study. The neuron parameters used in the training phase were as follows, threshold rate of 0.5 and a learning rate of 0.1. The threshold was set to 0.5 because it’s a typical median value for the output of the neuron. The learning rate was set to 0.1 to maintain stability of the neuron and guarantee a smooth weight updating process. In the next section, we will show that high learning rate values can lead to instability and subsequently, the failure of unit. To examine the progress of the training process, we evaluated the output of the neuron after each link accessed. The output is calculated according to Equation 2.1. An output of 0.5 or higher is considered a success since it indicates that the neuron identified the input set as a match. Figure 5 plots the success rate of the neuron against number of links accessed. As we can see, the success rate increases with the number of links accessed until it stabilizes around 60%. That means that after a reasonable training period, the unit could identify 6 out of 10 links as links of interest. Since learning is an iterative process, and the system that we are training the unit on is not a deterministic one, a success rate of 60% is an acceptable value. Since topic changes generate additional sets of keywords to the neuron, we would not expect an extremely high success rate. Training Progress 70.00% 60.00% 50.00% 40.00% 30.00% 20.00% 10.00% 0.00% 1
11
21
31
41
51
61
71
81
91 101 111 121 131 141 151 161
No. of Links
Figure 5. Success Rate of the Neuron versus the Number of Links
11
5.2. Evaluation Phase The training phase was conducted using cnn.com as the source for the training set. In this subsection we present the results obtained from testing the unit on abc.com. There were neither weight updates nor extra keyword inputs to the neuron while abc.com was visited. We used the neuron trained in the browsing of cnn.com as discussed in preceding subsection. Figure 6 shows the access pattern we used in this phase. Prefetching systems are often evaluated in terms of four major performance metrics: hit ratio, byte-hit ratio, waste ratio, and waste byte ratio. Hit ratio refers to the percentage of requested links that are found in the pre-fetching cache to the total number of requests; Byte-hit ratio refers to the percentage of hit links in size. This pair of metrics reflects the improvement of web access latency. Waste ratio refers to the percentage of undesired documents that are pre-fetched into the cache; Byte waste ratio refers to the percentage of undesired documents in size. This pair of metrics reflects the cost (i.e. the extra bandwidth consumption) for prefetching.
Evaluation Phase A ccess Patteren 7 6
No. of Links
5 4 3 2 1 0 5/7/99
5/14/99
5/21/99
5/28/99
6/4/99
6/11/99
Day
Figure 6. Access Patterns in Evaluation Session Figure 7 shows the hit and byte hit ratios in this phase as a function of number of accessed links. While the news in abc.com had never been accessed before by the tester, the SmarNewsReader achieved a hit rate of around 60% based on the semantic knowledge accumulated in the visit of cnn.com. Since the news pages have a small difference in size, plot of the byte hit ratio matches with that of the hit ratio. Since the topic “KOSOVO” lasted longer than our experiment period, keywords that were related to the news about “KOSOVO” and used to train the NewsAgent neuron in April, 1999 remained the leading keywords to guide user’s daily browsing of related news. Since the input keywords and their weights in neuron remained unchanged in the evaluation phase, it is expected that accuracy of the predictor will decresase slowly as the tester keeps away from the selected topic.
12
Hit and Byte Hit Ratios 100.00% 80.00% Hit Ra 60.00% tio 40.00%
Hit Ratio Byte Hit Ratio
20.00% 0.00% 0
20
40
60
80
No. of Links
Figure 7. Hit Ratio and Byte Hit Ratio in the Evaluation Session Figure 8 shows an average of 24% waste ratio and approximately the same percentages of byte waste ratio during the evaluation phase,. It is because many of the links contained similar material and the reader might be satisfied with only one of them. Also, in the evaluation phase, the process of updating weights and purging trivial keywords was not active. However, the neuron kept pre-fetching those undesired documents as long as the output of the neuron was greater than the net threshold. This leads to a higher waste ratio as well. Evaluation Waste Ratio 40.00% 30.00% Wa ste Ra 20.00% tio
Waste Ratio Byte Waste Ratio
10.00% 0.00% 0
20
40
60
80
No. of Links
Figure 8. Waste Ratio and Byte Wa Waste Ratio
6. Effects of Net Threshold and Learning Rate Accuracy of the predictive decisions is affected by a number of parameters in NewsAgent. To evaluate the effects of the parameters, particularly, net threshold and learning rate, on hit ratio and waster ratio, we implemented multiple predictors with identical parameters, except the one under test, with the
13
same inputs and access sequence. We conducted the experiment over two new sites: Today in America Front Page at MSNBC and the US News Page at CNN.com. The first is full with multimedia documents, while the other is a combination of plain text and image illustrations during the period from November 18, 1998 to December 16, 1998, five days a week, and two half-an-hour sessions a day. Results from the surfing of these two new sites will not only reveal the impact of the parameters, but also help verify the observations we obtained in the last experiment. During the experiment time, the topic “Clinton Affair” has much coverage in both sites. As we explained in preceding section, we selected a guiding hot topic in evaluation because its extensive would help collect more data within a short time period. The neural-net based predictor can identify keywords and establish user access patterns automatically by monitoring user-browsing behaviors. Keywords related to the “Clinton Affair” topic include Clinton, Starr, Lewinsky, Testimony, etc. Each related piece of news must have one or more such keywords. To reflect a fact that users may be distracted from their main topic of interest for a while in browsing, we intentionally introduced noises by periodically selecting headlines that do not belong to the topic. We accessed a random link every 10 trials. We also assumed each access be followed by one minute viewing. A recent analysis of web access pattern showed that the average rate that HTTP requests were made was about 2 requests per minutes [2]. Users may loose interest in a certain topic after a period of time. The system responded to this by updating the associated weights of the relevant keywords.
6.1. Effect of Net Threshold Figure 9 plots the hit ratios due to prefetching with neuron thresholds of 0.25, 0.50 and 0.75 respectively. In agreement with Equation 2.1, the figure shows that a low threshold would yield a high hit ratio for a large cache. Since an extremely small threshold could trigger the prefetching unit frequently, it would lead to pre-retrieval of most of the objects on a given URL. By contrast, the higher the threshold, the smoother the curve is and the more accurate the prediction will be.
Hit Ratio
Threshold Effect
60% 50% 40% 30% 20% 10% 0%
T = 0.25 T = 0.50 T = 0.75
0
50
100
150
200
Requests
Figure 9. Effects of Threshold on Cache Hit Ratio
14
Figure 10 shows the waste ratio in the cache. Expectedly, the lower the threshold value, the more undesired objects that get cached. The cache may be filled up with undesired objects quickly. There is a tradeoff between the hit ratio and utilization ratio of the cache. The optimum net threshold should strike a good balance between the two objectives.
%Cached
Undesired Documents Percentage
30%
T = 0.75
20%
T = 0.50
10%
T = 0.25
0%
Threshold
Figure 10. Effects of Threshold on Cache Waste Ratio
6.2. Effects of Learning Rate The learning rate controls the speed a prediction unit responds to a given input. Figure 11 shows the hit ratio due to different learning rates. From the figure, it can be seen that a high learning rate leads to a high hit ratio. It is because high learning rates tend to update the weights quickly. Ripples on the curve are due to the fast updating mechanism. The figure also shows that the smaller the learning rate, the smoother the curve and the more accurate the prediction unit will be. It is because the weights of trivial keywords are updated with minimal magnitude. This generally affects cache utilization because insignificant weights won’t trigger the prediction unit with a lower learning rate. This may not be the case for a unit with a higher learning rate, as shown in Figure 12. Thus, choosing a proper learning rate depends mainly on how fast we want the prediction unit to respond and what percentage of the cache waste we can tolerate.
Hit Ratio
Learning Rate Effect 80% 60% 40% 20% 0%
R = 0.05 R = 0.10 0
50
100
150
200
Requests
Figure 11. Effect of Learning Rate on Hit Ratio
15
R = 0.20 R = 0.50
%Cached
Undesired Documents Percentage
30%
R = 0.05
20%
R = 0.10 R = 0.20
10%
R = 0.50
0%
Learning Rate
Figure 12. Effect of Learning Rate on Cache Waste Ratio
6.3. Update of Keyword and Purge of Trivial Keywords Recall that the prefetching unit relies on a purging mechanism to identify trivial weights and decrease their impact on the prediction unit so as not to be triggered by accident. Through our tests, we found that on average approximately 4 to 5 trivial keywords are introduced for each essential keyword. Figure 13 plots the distribution of keyword weights due to NewsAgent. The figure shows the system bounded the trivial weight and nearly minimized the weight in the browsing process. As outlined in Figure 3, weights of the keywords are reduced whenever their related objects in the cache are to be replaced by the prefetching unit. The objects that were never referenced were have the lowest preference value and are to be replaced first. Then weights of the keywords associated with the replaced object were updated accordingly. This guarantees that trivial keywords had minimal effects on the prediction unit.
Weight Value
Weight Update Process 10.00 8.00 6.00 4.00 2.00 0.00 0
50
100
150
200
250
Requests Main Weights
Trevial Weights
Figure 13. Effect of Unrelated Keywords on Prediction Accuracy
16
300
7. Related Work The concept of prefetching has long been used in a variety of distributed and parallel systems to hide communication latency [7]. A closely related prefetching topic to web surfing is file prefetching and caching in the design of remote file systems. Shriver et al presented an overview of the file system prefetching and explained with an analytical model why file prefetching works [32]. A most notable work in automatic prefetching is due to Griffioen and Appleton [17]. They proposed to predict future file accesses based on past file activities that are characterized by probability graphs. Padmanabhan and Mogul adapted the Griffioen-Appleton file predictive algorithm to reduce web latency [29]. Web prefetching is an active topic in both academia and industry. Algorithms used in today’s commercial products are either greedy or user-guided. A greedy prefetching technique does not keep track of any user web access history. It pre-fetches documents based solely on the static URL relationships in a hypertext document. For example, WebMirror [25] retrieves all anchored URLs of a document for a fast mirroring of the whole web site; NetAccelerator [19] pre-fetches the URLs on a link’s downward path for a quick copy of HTML formatted documents. User-guided prefetching techniques are based on user-provided preference URL links. Go-AheadGot-It![15] and Connectix-Surf-Express[5] are two industry leading examples in this category. The former relies on a list of predefined favorite links, while the latter maintains a record of user frequently accessed web sites to guide prefetching of related URLs. Research in academia is mostly targeted at intelligent predictors based on user access patterns. The predictors can be associated with servers, clients, or proxies. Padmanabhan and Mogul [29] presented a first server-side prefetching technique based on that of Griffioen and Appleton. It depicts the access pattern as a weighted URL dependency graph, where each edge represents a follow-up relationship between a pair of URLs and the edge weight denotes the transfer probability from one URL to the other. The server updates the dependency graph dynamically as requests coming from clients and makes predictions of future requests. Schechter, et al [31] presented methods of building a sequence prefix tree (path profile) based on HTTP requests in server logs and using the longest matched mostfrequent sequence predict next requests. Sarukkai proposed a probabilistic sequence generation model based on Markov chains [30]. On receiving a client request, the server uses the HTTP probabilistic link prediction to calculate the probabilities of the next requests based on the history of requests from the same client. Loon and Bharghavan proposed an integrated approach for proxy servers to perform prefetching, image filtering, and hoarding for mobile client [24]. They used usage profiles to characterize user access patterns. A usage profile is similar to the URL dependency graph, except that each node is associated with an extra weight indicating the frequency of access of the corresponding URL. They presented heuristic weighting ways to reflect the changing access patterns of users and temporal locality of accesses in the update of usage profiles. Markatos and Chronaki [26] proposed a top-10 prefetching technique for servers to push their most popular documents to proxies regularly according to clients’ aggregated access profiles. A question is that high hit ratio in proxies due to prefetching would may give servers a wrong perception of future popular documents.
17
Fan et al. [14] proposed a proxy-initiated prefetching technique between a proxy and clients. It is based on a Prediction-by-Partial-Matching (PPM) algorithm originally developed in data compressor [9]. PPM is a parametric algorithm that predicts the next l requests based on past m accesses by the user, and limits the candidates by a probability threshold of t. PPM is reduced to the algorithm of Papadumanta and Mogul when m is set to 1. Fan et al. showed that the PPM prefetching algorithm performs best when m=2 and l=4. A similar idea was recently applied to prefetching of multimedia documents as well by Su, et al. [33]. From the viewpoint of clients, prefetching is to speed up web access, particularly to those links that are expected to experience long delays. Klemm [20] suggested to pre-fetch web documents based on their estimated round-trip retrieval times. Since their technique requires the knowledge about the sizes of prospective documents for estimation of their round-trip time, it was limited to prefetching of static http pages. Prefetching is essentially a speculative process and incurs a high cost when its prediction is wrong. Cunha and Jaccoud [8] focused on the decision of if prefetching should be enabled or not. They constructed two user models based on random walk approximation and digital signal processing techniques to characterize user behaviors and help predict users’ next actions. They proposed to couple the models with prefetching modules for customized prefetching services. Crovella and Barford [6] studied the effects of prefetching on network performance. They showed that prefetching might lead to an increase of traffic burstiness. They proposed to use transport rate-controlled mechanisms to reduce smooth traffic caused by prefetching. Previous studies on intelligent prefetching were mostly based on temporal relationships between accesses to URLs. Semantic-based prefetching was studied in the context of file prefetching. Lei and Duchamp proposed to capture intrinsic corrections between file accesses in tree structures on-line and then match against existing pattern trees to predict future access [23]. By contrast, we deployed neuron nets over keywords to capture the semantic relationships between URLs and improve the prediction accuracy by adjusting the weights of keywords on-line.
8. Concluding Remarks This paper has presented a semantic-based web prefetching technique and its application in news surfing. It captures the semantics of a document by identifying keywords on its URL anchor texts and relies on neural networks over the keyword set to construct the semantic-net of the URLs and to predict user’s future requests. Unlike previous URL graph based techniques, this approach takes advantage of “semantics locality” of access patterns and is able to pre-fetch documents whose URLs are never accessed before. It also features a self-learning capability and good adaptivity to the change of user surfing interests. The technique was implemented in a NewsAgent system and cross-examined in daily browsing of ABC.com, CNN.com, MSNBC.com news sites. Experimental results showed an achievement of approximately 60% hit ratio due to the prefetching approach. We note that the NewsAgent system was proposed from the perspective of clients. Its predictive prefetching technique is readily applicable to client proxies.
18
Future work will be focusing on the following aspects. First is to construct a prediction unit using neuron networks, instead of single neurons. Besides the keywords, other factors, such as document size and the URL graphs of a hypertext document, will also be taken into account in the design of the predictor. Second is to improve cache utilization. We noticed that the hit ratio saturates at around 6070% after a reasonable number of requests. More work needs to be done in this aspect to reduce the cache waste ratio. On a different track altogether, other statistical classification and learning techniques like Bayesian networks could be deployed. Like many domain-specific search engines [3][16][27], the semantics-based prefetching technique is presently targeted at domain applications. In addition to news services, the prefetching technique could be applied to other application domains, such as on-line shopping and auction, where semantics of web pages are often explicitly presented in their URL anchor texts. Such domain-specific prefetching facilities can be turned on or off on-demand. The semantics-based prefetching technique is based on keyword matching. In general, there are two fundamental issues with keyword matching for information retrieval: polysemy and synonymy. Polysemy refers to the fact that a keyword may have more than one meaning. Synonymy means that there are many ways referring to the same concept. In this paper we alleviate the impact of polysemy by taking advantage of the categorical information in news service domains but address a little about synonymy. There exist many techniques to address these two issues in conventional information retrieval. Most notably is latent semantic indexing [12]. It takes advantage of the implicit higher-order structure of the association of terms with articles to create a multi-dimensional semantic structure of the information. Through the pattern of co-occurrences of words, LSI is able to infer the structure of relationships between article and words. Its application to web prefetching deserves an investigation. Recently, XML and RDF (Resource Description Framework) are emerging as standards of data exchange format in support of semantic interoperability on the web [13]. With the popularity of XML or RDF compliant web resources, we expect semantic-based prefetching techniques will play an increasingly important role in reducing web access latency and will lead to more personalized quality services on the semantic web.
References [1] V. Almeida, A. Bestavros, M. Crovella, and A. de Oliveira. Characterizing reference locality in the WWW. In Proc. of IEEE Conf. on Parallel and Distributed Information Systems, December 1996. [2] P. Barford, A. Bestavros, A. Bradley, and M. Crovella. Changes in web client access patterns: Characteristics and caching implications. World Wide Web: Special Issue on Characterization and Performance Evaluation. [3] S. Chakrabarti, M. van der Berg, and B. Dom. Focused crawling: A new approach to topicspecific web resource discovery. In Proceedings of the 8th Int’l World-Wide Web Conference (WWW8), 1999. [4] CNN United States News. http://www. cnn.com/US
[5] Connectix, Inc. http://www.connectix. com/.
19
[6] M. Crovella and P. Barford. The network effects of prefetching. In Proceedings of IEEE Infocom’98. [7] D. Culler, J. Singh, with A. Gupta. Parallel Computer Architecture: A Hardware/Software Interface. Morgan Kaufmann Pub. 1998. [8] C. R. Cunha and C. F. B. Jaccoud. Determining WWW user’s next access and its application to prefetching. In Proc. of the International Symposium on Computers and Communication’97, July 1997. [9] K. M. Curewitz, P. Krishnan, and J. Vitter. Practical prefetching via data compression. In Proceedings of SIGMOD’93, pages 257—266, May 1993. [10] B. Davison. Simultaneous proxy evaluation. In Proceedings of 4th International Web Caching Workshop, March 1999. [11] B. Davison. A survey of proxy cache evaluation techniques. In Proceedings of 4th International Web Caching Workshop, March 1999. [12] S. Deerwester, et al. Indexing by latent semantic analysis. Journal of the American Society for Information Science. Vol.41 (6), 1990, pages 391—407. [13] S. Decker, S. Melnik, et al. The semantic Web: The role of XML and RDF. IEEE Internet Computing, September/October 2000, pages 63—74. [14] L. Fan, P. Cao, W. Lin, and Q. Jacobson. Web prefetching between low-bandwidth clients and proxies: Potential and performance. In Proceedings of the SIGMETRICS’99. [15] Go Ahead Home Page. http://www.goahead.com/. [16] E. J. Glover, G. W. Falke, S. Lawrence, W. Birmingham, A. Kruger, C. Giles, and D. Pennock. Improving category specific web search by learning query modifications. Symposium on Applications and the Internet (SAINT 2001), 2001. [17] J. Griffioen and R. Appleton. Reducing file system latency using a predictive approach. In Proceedings of the 1994 Summer USENIX Conference. [18] H. Hassoun. Fundamentals of Artificial Neural Networks. The MIT Press. 1995 [19] ImsiSoft Home Page. http://www.imsisoft.com/. [20] R. P. Klemn. WebCompanion: A friendly client-side Web prefetching agent. IEEE Transactions on Knowledge and Data Engineering, Vol. 11, No. 4, July/August 1999, pages 577—594. [21] T. Kroeger, D. Long, and J. Mogul. Exploring the bounds of web latency reduction from caching and prefetching. In Proc. of USENIX Symp. on Internet Technologies and Systems, Dec. 1997. [22] D. Lee. Methods for web bandwidth and response time improvement In M. Abrams (ed), World Wide Web: Beyond the Basics, Chapter 25, Prentice-Hall Pub. April 1998. [23] H. Lei and D. Duchamp. An analytical approach to file prefetching. In Proceedings of USENIX 1997 Annual Technical Conference, January 1997, pages 275—288. [24] T. S. Loon and V. Bharghavan. Alleviating the latency and bandwidth problems in WWW browsing. In Proc. of USENIX Symposium on Internet Technologies and Systems, December 1997.
20
[25] Macca Soft Home Page. http://www.maccasoft. com/webmirror/. [26] E. P. Markatos and C. E. Chronaki. A Top-10 approach to prefetching on the web. In Proceedings of INET’98. July 1998. [27] A. McCallum, K. Nigam, J. Rennie, and K. Seymore. Building domain-specific search engines with machine learning techniques. In Proceedings of AAAI Spring Symposium on Intelligent Agents in Cyberspace, 1999. [28] MSNBC Today in America. http://www.msnbc.com/news/affiliates_front.asp [29] V. Padmanabhan and J. Mogul. Using predictive prefetching to improve world wide web latency. In Proc. of ACM SIGCOMM Computer Comm. Review, July 1996. [30] R. Sarukkai. Link prediction and path analysis using Markov Chains. In Proceedings of the 9th Int’l World Wide Web Conference, 2000. [31] S. Schechter, M. Krishnan, and M. Smith. Using path profiles to predict HTTP requests. In Proccedings of the 7th International World Wide Web Conference. Also appeared in Computer Networks and ISDN Systems, 20(1998), pages 457—467. [32] E. Shriver and C. Small. Why does file system prefetching work? In Proceedings of the 1999 USENIX Annual Technical Conference, June 1999. [33] Z. Su, Q. Yang, and H. Zhang. A prediction system for multimedia prefetching in Internet. In Proceedings of ACM Multimedia, 2000.
21