tween simplicity (showing just relevant content), and com- ... Figure 1. (a) The Apple.com Site Map, and (b) a fictitious personalized version that displays only the.
Mining Web Logs for Personalized Site Maps Fergus Toolan Nicholas Kusmerick Smart Media Institute University College Dublin Ireland fergus.toolan,nick @ucd.ie
Abstract Navigating through a large Web site can be a frustrating exercise. Many sites employ Site Maps to help visitors understand the overall structure of the site. However, by their very nature, unpersonalized Site Maps show most visitors large amounts of irrelevant content. We propose techniques based on Web usage mining to deliver Personalized Site Maps that are specialized to the interests of each individual visitor. The key challenge is to resolve the tension between simplicity (showing just relevant content), and comprehensibility (showing sufficient context so that the visitors can understand how the content is related to the overall structure of the site). We develop two baseline algorithms (one that displays just shortest paths, and one that mines the server log for popular paths), and compare them to a novel approach that mines the server log for popular path fragments that can be dynamically assembled to reconstruct popular paths. Our experiments with two large Web sites confirm that the mined path fragments provide much better coverage of visitors sessions than the baseline approach of mining entire paths.
1. Introduction Finding relevant information in a large Web site can be tedious and frustrating. Site Maps are commonly used by Web developers to help visitors understand and navigate complex sites. For example, Figure 1(a) shows a portion of Apple.com’s Site Map. By their very nature, Site Maps present nearly all of a Web site’s content. Of course, most visitors are interested in just a small subset of this content [3]. Figure 1(b) illustrates how this Site Map could be personalized for some particular visitor who is interested in just a few aspects of Apple.com. Our research goal is to develop techniques to enable Web sites to automatically deliver Personalized Site Maps. Achieving this goal involves solving two sub-problems.
The first challenge is to determine what content items (i.e., Web pages) each visitor is actually interested in. The second challenge is to display these relevant pages in a way that helps visitors understand how the relevant pages are related. Web designers invest substantial effort in crafting Site Maps in order to help visitors understand the overall structure of the Web site, and Personalized Site Maps must not “throw the baby out with the bathwater” by ignoring this structure. For example, a visitor interested in Apple’s 17 inch Studio Displays should not simply be pointed to the most relevant page, but she should be shown how the page is related to the site’s Products section. We adopt a simple solution to the first problem: We assume that the visitor expresses her interests with an explicit query such as “inexpensive studio displays” that is processed with standard information retrieval techniques. We defer to future work more sophisticated approaches based on collaborative filtering and other forms of user modeling. In this paper, we focus on the second sub-problem, how to organize a set of relevant Web pages to reflect the site’s structure. We note that there is a trade-off between two competing considerations. On the one hand, Personalized Site Maps should be as simple as possible. This suggests a trivial approach in which a Personalized Site Map displays just the shortest paths from each of the relevant pages to the site’s home page. However, short paths are not necessarily intuitively meaningful to visitors [4]. For example, Web sites often contain numerous navigational cross-links, so the shortest path between two pages may well involve completely unrelated parts of the site. We adopt the assumption that the most comprehensible path between two pages will be the one that has been most popular with previous site visitors [7]. We therefore add to our Personalized Site Maps paths that have been frequently traversed by past site visitors, which are often not the shortest. The technical challenge of our work concerns how to compute the most popular path between a given pair of pages. The naive approach would be to extract the most popular paths from the servers access log. However, given
Figure 1. (a) The Apple.com Site Map, and (b) a fictitious personalized version that displays only the few pages that are relevant to a particular visitor.
the inherent diversity of visitors interests, must be extremely large in order to obtain sufficient coverage over actual site visitors. To address this sparseness, we propose a novel algorithm for mining fragments of paths, rather than entire paths, from the server logs, and then assembling the fragments. For example, suppose that and are two paths that occur frequently in past visitors session, where the notation where indicates a traversal by a particular visitor from page to page . Using the naive approach we need to store the two paths in their entirety. However we could store only , and and then recreate the full paths. This path fragment method allows us to compress the previous sessions much more than storing entire paths. We make the following contributions. First, we formalize the problem of constructing Personalized Site Maps (Section 2). Second, we describe our algorithm for solving this problem that mines popular path fragments from server logs (Section 3). Finally, using data from two Web sites, we empirically demonstrate that shortest paths are often quite unpopular (thus providing evidence that shortest paths are not intuitively meaningful), and that our mined path fragments provide better coverage than simply storing entire paths (Section 4).
$%&%
% !'" #
2. Problem Formalization We formalize the problem of constructing a Personalized Site Map as follows. We take as input a Web site graph and its distinguished “home page” root . Each node in corresponds to a Web page, and directed edge represents a hyperlink between the corresponding documents. We also assume that a set of relevant nodes has been identified during the
)(+*-,/.0 21 , *879.;: ?@(BAC3EDF.G3IHF.KJLJL.G3CMON
3546,
=PQ(R*S,=P-.0 =PL1
,TPOUV?RWX , 2P
initial relevance-assessment step. A Personalized Site Map is a subset of the original graph, such that contains the root and the relevant nodes (i.e., ), as well as sufficient additional nodes and edges from so that contains a path from the root to each relevant node . So far, this personalization task is highly underconstrained, as there may be many such subgraphs . To decide between alternative subgraphs, we exploit the actual visitor usage data from the sites server log. The intuition is that we want to select the alternative whose edges are the most popular among previous site visitors. Our task thus reduces to the following: Given a Web site server log, we want to mine sufficient data from the log to be able to reliably reconstruct the most popular path from any node to any other node . Naturally, without access to the entire log some data will necessarily be lost and this reconstruction process cannot be perfect. We will therefore be interested in empirically comparing the coverage of alternative algorithms.
3CZ[4\Y? P *S,2P8.0 =P]1
3
P( 2P
2P
7^4V,
:>4^,
3. Algorithms We begin with a brief discussion of server log preprocessing. We then describe three alternative algorithms for constructing Personalized Site Maps. The first baseline algorithm, SP, ignores the server log and simply assumes that shorter paths are more popular than longer paths. The second algorithm, PP, extracts the most popular paths from the server log, and tries to reconstruct the most popular path between two pages using these paths. As mentioned above, PP is ineffective because path traversal logs for large graphs are necessarily very sparse and thus must be very large to ensure adequate coverage. Our third algorithm, MP, mines path fragments from server logs, and then
7
:
7
:
dynamically assembles them into a path from a given node to another node . Since MP discards strictly more information than PP, it can potentially make mistakes, but our experiments in Section 4 demonstrate its effectiveness in practice.
The coverage of PP is the fraction of extracted sessions in which the user navigated from page to page via the most popular path from to . In Section 4 we demonstrate that must be quite large in order to obtain sufficient coverage over the entire Web Site graph.
3.1. Server Log Pre-Processing
3.4. MP (“mined path”) Algorithm
Web server logs contain a large amount of noise which must be discarded, and also often do not contain data that must be inferred [1]. Noise corresponds to requests for images, applets, etc, which are logged, yet are irrelevant for our purposes. Data may be missing due to caching by, for example, the browser or Internet service provider. This arises most commonly when the visitor uses the browsers back button. For example, if a user traverses the path , then hits the back button, and then traverses , this will appear in the log as . We use a simple path completion algorithm to automatically insert entries that must be missing due to the known structure of the site graph. The final problem with server logs is that requests are stored in the order that the server receives them. Specifically, if multiple people are browsing the site concurrently, their requests are intermingled in the log file. We use simple session extraction heuristics to segment the entire log into a sequence of sessions. First, we partition requests by IP address. Second, we use an inter-access delay threshold to split a sequence of accesses from a given IP into one or more sessions. The Inter Access Delay Threshold is the amount of time allowed between requests for them to be treated as members of the same session. From our preliminary experiments described in Section 4.2 we set minutes.
The MP algorithm expands each server log session into a set of all subpaths of length between and . The most popular such fragments are then used to reconstruct a path from a page to a page . To do so, MP considers all possible ways to assemble the mined fragments, subject to the constraint that adjacent fragments must overlap on at least pages. For instance, if then the two fragments and can be assembled to create a path . This overlap constraint corresponds to an assumption that Web navigation can be modelled as a Markov process of order [9]. In our experiments we use , , and . We leave to future work a systematic exploration of optimal values for these parameters. The coverage of MP is the fraction of extracted sessions that can be recovered from the mined fragments. In Section 4 we demonstrate that, for a given value of , MP has better coverage than PP.
7b:5c`
7_a`7V:
(edgf
7
hbi ZLM
: 7\&7 :r&:T` `q:rc& `s& o (qp t(tp h i ZLMu(wv h ilknm (dIf
hjilknm
4. Experiments We now describe an experimental evaluation of the techniques we discussed in the previous section. We begin with a discussion of the two datasets we used for our experiments. We then describe the results of experiments designed to answer the following questions:
3.2. SP (“shortest path”) Algorithm The simplest technique in any route planning system is the shortest path between two points [4]. In order to estimate the most popular path from node to node , the SP algorithm ignores the past visitors entirely and simply assumes that short paths are more popular than long paths. To evaluate the SP algorithm, we measure its coverage. The coverage of the SP algorithm is the fraction of extracted sessions in which users went from page to page via the shortest path. In Section 4 we empirically demonstrate that the coverage of SP is in fact quite low.
7
:
7
3.3. PP (“popular path”) Algorithm
7 :
:
The PP algorithm simply records the most frequent sessions extracted from the server log during the preprocessing step. It is an example of sequential pattern discovery from web logs as seen in [5] and [2].
1. How sensitive are our results to the inter-access delay used to segment the raw server log into threshold sessions? (Section 4.2) 2. How frequently is the shortest path between two pages the most popular path? (Section 4.3)
3. How does the coverage of PP compare to that of MP, as a function of the amount of mined data? (Section 4.4)
4.1. Datasets We evaluated our techniques on two Web sites, the server for the Computer Science Department of University College Dublin (www.cs.ucd.ie), and Music Machines (machines.hyperreal.org)1. Figure 2 summarises these datasets. 1 The Music Machines server logs were archived by Mike Perkowitz and are available at http://www.cs.washington.edu/ai/adaptive-data
Time Period Total Requests After Pre-Processing Number of Distinct IP’s Number of Sessions Mean Session Length
UCD CS
Music Machines
Apr 2000 Dec 2001 4,327,397 1,258,643 55,429
Feb 1997 Apr 1999 14,722,468 2,996322 270,092
236,675 5.32
554,801 5.40
Our second experiment compared the overlap between the popular paths mined by the PP algorithm, using a threshold of 10, 15 and 20 minutes. The overlap is defined as the percentage of PP in common with SP at various inter access delay threshold values. As shown in Figure 4, there is a substantial overlap between the various sets of mined paths.
Figure 2. Summary of the experimental data The total number of requests includes images, applets, etc. The number after pre-processing is the number of requests for actual page views. The number of distinct IPs is the number of IP addresses from which the server received requests in the time period. Note that the number of IP addresses is not equal to the number of actual visitors due to noise introduced by proxy servers, and it is not equal to the number of sessions because each visitor may initiate several sessions in the log file time period.
4.2. Threshold Experiments
The first experiments relate to the inter-access delay threshold used to segment the raw server log into sessions. Specifically, we want to ensure that the results from our subsequent experiments are not overly sensitive to the setting of this free parameter.
Figure 4. Overlap between paths mined from the UCD log by the PP algorithm, for three pairs of inter-access delay thresholds.
Based on this data, we conclude that our technique is relatively stable across values of the inter-access delay threshold . We set the threshold minutes for the remainder of the experiments.
(dIf
4.3. Comparison of shortest and popular paths The next experiment seeks to confirm that shortest paths are not necessarily the most popular. Figure 5 shows the fraction of popular paths that are in fact the shortest path, of paths mined by the PP as a function of the number algorithm, for both web sites. For example, of the most-popular paths mined by PP, 60% these paths are in fact the shortest path. We can see that as the paths become more popular (i.e., for small values of ), shortness is indeed a good proxy for popularity. However, as increases the overlap (percentage in common) between PP and SP decreases substantially. We conclude that, as predicted, popular paths are frequently sub-optimal.
c(svyx
Figure 3. Number of sessions extracted from the UCD log, as a function of the inter-access delay threshold D.
Figure 3 shows the number of distinct sessions extracted from the log files of the UCD web site as the session threshold increases from five to 45 minutes. While the number of sessions grows rapidly as decreases, the variation is much smaller at the intuitively reasonable larger thresholds.
4.4. Coverage of Mined and Popular Paths So far, our experiments have been concerned with demonstrating that SP and MP do indeed generate different paths. In this section we investigate the benefits of using mined paths as opposed to just using the popular paths. For various values of , we measure the coverage i.e. the fraction of the extracted sessions that can be reconstructed in their entirety using MP. With we can reconstruct 27% of sessions from the UCD log file in their entirety and
(zf{x{x{x
Figure 5. Overlap between popular and shortest paths.
Figure 7. Coverage of the MP and PP algorithms for the Music Machines site
can manage to recreate 14% of the Music Machines Sessions.
5. Related Work
We can generalize this experiment by measuring the fraction of individual sessions that each algorithm can reconstruct. That is, we know that 27% of the UCD sessions can be fully (100%) reconstructed, but presumably many others sessions can be, say, 75% reconstructed. We therefore measured the average fraction of a given session that can be reconstructed for each of the two algorithms.
Previous research in Web usage and server log mining addresses two major issues: the pre-processing of the raw data, and the discovery of patterns or rules in the data. Our work relates to both of these areas. Pre-processing is discussed in detail in [1]. The aim of pre-processing of web server log files is to obtain a set of sessions (visits) recorded in the log files. It can be divided into three distinct phases: data cleansing, user/session identification and path completion. Our system implements all of these components of Web Usage Mining. The second phase of Web usage mining is that of pattern discovery [1,2,5,6]. Pattern discovery involves the extraction of some meaningful information, such as association rules, classification rules, or sequential patterns. The PP and MP algorithms can be seen as the pattern discovery phase for the Personalised Site Map task. The construction of improved site maps is discussed in [3]. Li et al discuss the need for topic-focused site maps that home in on the users interests and try to display that section of the map. They also discuss the granularity of the site map, which is the level of detail the map should show. They use the method of extracting logical domains from the web site where each logical domain is associated with a certain topic. Unlike our system they use semantic knowledge from the page’s contents.
For both Web sites, our experimental results in Figures 6 and 7 demonstrate that the mined path fragments can be used to reconstruct a greater proportion of individual traversals than relying solely on popular paths. Specifically, we measure the coverage of MP and PP for various values of up to 1000, and find that for each value of , MP has higher coverage than PP, and the coverage gap grows rapidly as increases.
6. Future Work
Figure 6. Coverage of the MP and PP algorithms for the UCD site
Our Personalized Site Map algorithms have been fully implemented. Our current focus involves measuring the effectiveness of our approach. Our experiments have demonstrated that our technique works well, in the sense that we are able to build site maps containing popular (as opposed to merely short) paths. We believe that users will find popular
paths intuitive, but we have not yet established this empirically. We intend to conduct user trials of the system to get users judgements of the quality of the Personalized Site Maps. For example, one important topic is the generality/specificity of the pages on the paths. Do the pages earlier in the path contain more general information than later pages? We are also exploring other applications for our pathmining algorithm. At its core, we have developed an approach to predicting which pages are likely to be viewed next, given a prefix of a visitor’s trajectory. Therefore, a second potential application concerns using this predictive ability for pre-fetching and caching [8]. Our preliminary investigation of pre-fetching shows promising results. For the UCD site, we allow the PP algorithm to recommend a likely next page after each session prefix. The entire dataset contains over 236,000 sessions, leading to 832,307 recommendations from PP. Of these recommendations, 43,349 (5.2%) were correct (ie, that the user did indeed visit the recommended page next). We intend to extend this experiment to caching of multiple pages, and comparing our approach to existing page-prediction algorithms. Another possible direction would be to introduce a collaborative element to the system. We could rate each popular path for a user based on whether it appears in his sessions or not. Standard collaborative filtering techniques can then be used to recommend a particular popular path to recommend with greater confidence than our current PP algorithm.
7. Conclusions We have introduced the problem of automatically constructing Personalized Site Maps. The key challenge is to display to the visitor a subgraph that both contains relevant content items, and also organizes them in a coherent and meaningful manner. Our approach is based on the assumption that the best way to indicate the relationship between a given pair of pages is to show the path between them that has been most popular with past visitors. Based on this observation, we propose a naive algorithm (PP) for mining popular paths from raw server logs, and a more sophisticated algorithm (MP) for mining path fragments. The key idea of MP is to mine a collection of path fragments that can be dynamically assembled in order to reconstruct many popular paths. Our experiments with two large Web sites confirm that MP can reconstruct a larger fraction of visitors sessions that PP.
Acknowledgements: We thank Barry Smyth for helpful discussions. This research was funded by grant N-00014-00-1-0021 from the US Office of Naval Research, and grant 01/F.1/C015 from Science Foundation Ireland.
References [1] Cooley, R., “Web Usage Mining” PhD Thesis, Department of Computer Science, University of Minnesota, 2001 [2] Gaul, W., Schmidt-Thieme, L. “Mining web navigation path fragments”. In Proceedings of the Workshop on Web Mining for E-Commerce – Challenges and Opportunities, Boston, MA, August 2000 [3] Li, W-S., Ayan, N. F., Kolak, O., Vu, Q. “Constructing Multi-Granular and Topic Focused Web Sites” in Proceedings of WWW10, Hong Kong, 2001.
d{dI|8}
[4] McGinty, L., Smyth, B. “Case Based Route Planning” Conference on Artificial in Proceedings of the Intelligence and Cognitive Science, Galway, Ireland, 2000 [5] Srikant, R., Agrawal, R. “Mining Sequential Patterns: Generalisations and Performance Improvements.” In Proceedings of the International Conference on Extending Database Systems, 1996
fF|8}
[6] Srivastava, J., Cooley, R., Deshpande, M., Tan, P-N. “Web Usage Mining: Discovery and Applications of Usage Patterns from Web Data” SIGKDD Explorations, Vol 1, Issue 2, 2000 [7] Wexelblat, A., Maes, P. “Footprints: History-Rich Tools for Information Foraging” In Proceedings of CHI’99 Conference on Human Factors in Computing, 1999 [8] Yang, Q., Zhang, H. H., Li, T. “Mining Web Logs for Prediction in WWW Caching and Pre-fetching.” In Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD’01, San Francisco, 2001 [9] Ypma, A., Heskes, T. “Categorization of Web pages and User Clustering with mixtures of Hidden Markov Models.” In Proceedings of the International Workshop on Web Knowledge Discovery and Data Mining, WEBKDD’02, Edmonton, Canada, 2002