Dynamic Personalization of Web Sites Without User Intervention
T
A novel online recommender system builds profiling models and offers suggestions without the user taking the lead.
he Web is an integral part of today’s business dealings. Companies and institutions exploit the Web to conduct their business; customers make daily use of the Net to perform all kinds of transactions. In addition, most users browse through pages of personal interest. The Web, as we know, is massive and its data collected from countless sources. Consequently, search tools should be able to accurately extract, filter, and select what is “hidden” from such tools.
By Ranieri
Baraglia and Fabrizio Silvestri
COMMUNICATIONS OF THE ACM February 2007/Vol. 50, No. 2
63
Online Module Session Recognition
Model Updating
Knowledge Base
User Profiling
Figure 1. Architecture of a typical Web Recommender System.
In the past, WP systems architecture was comprised of two components performed offline and online with respect to the Web server activity. The Silvestri fig 1 includes (2/07) the Preprocessing and offline component Pattern Discovery phases, while the online one implements the Pattern Analysis phase in order to generate the personalized content such as links to pages, advertisements, or information relating to products or services estimated to be of interest for the current user. We should note that in order to enhance navigation experience these systems must provide two main characteristics: they should be non-intrusive and scalable with respect to the Web site size. Moreover, the management of dynamic pages is another important issue of WP systems. Recent Web-based marketing strategies have been mainly focused on presentation of products and services and interaction with the clients. This led Web designers to use dynamic pages extensively. On the other hand, the static approach is preferable only in the case of “small” and “quasi-statical” Web sites. Here, we introduce SUGGEST— a novel solution to implement WP as a single online module that per-
SUGGEST is a novel solution to implementing Web personalization as a single online module that performs user profiling, model updating, and recommendation building. 64
February 2007/Vol. 50, No. 2 COMMUNICATIONS OF THE ACM
based on the ontology underlying the site. Data mining techniques can be applied to enriched Web logs to extract knowledge that could be used to improve the navigational structure as well as exploited in recommender systems. A recommender system proposed in [4] is based on both content and usage data and exploits semantics annotation of Web logs to produce suggestions including documents that are not in the same path, and whose conOffline tent is relevant with respect to the Phase Pattern Preprocessing page visited. In order to characterDiscovery ize every site page, a set of keywords is carried out by means of a text mining analysis, and is mapped in categories according to a domain-specific taxonomy and Log thesaurus. The Web logs are then Files enriched with relevant keywords and categories. The documents are clustered, based on the similarity Knowledge between the category terms, and Base used to expand the recommendation set suggested to the end user. Online SETA is aimed at supporting ePhase Users Pattern commerce initiatives, in particular Requests Analysis for assisting the navigation of users through Web virtual stores [1]. The system is designed as a multi-agent architecture. Each specialized agent historical usage data and to dynamically generate a list Figure 2. Architecture has been designed to support a difof the SUGGEST of page links (suggestions). The suggestions are used Online Recommender ferent activity of the front-end of a Web store. Adaptation is carried to personalize the HTML page requested on-the-fly. System. out using a classification of the The adoption of a LRU-based (Least Recently Used) Silvestri 2 (2/07) 26.5 picas algorithm handling the fig knowledge base– makes it pos- users based on Bayesian Networks that demonstrate user behavior from the profile specified. The system sible for SUGGEST to manage large Web sites. The majority of the existing WP systems are struc- requires an initial step of personal information collectured according to the offline and online modules (see tion. Oracle 10gAS Personalization is a commercial Figure 1). Quite a large number of commercial and open product offering Web personalization functions [8]. source solutions have been proposed and are often The periodically built predictive model is used to genavailable on the market. IndexFinder [9] is a semi- erate real-time suggestions. The model itself is built by automatic solution to develop adaptive Web sites. choosing the best model among those generated by Based on statistical analysis and the visit-coherence two distinct classification algorithms: Predictive Assoassumption (that is, pages within the same session are ciation Rules [5] and a Transactional Naïve Bayes [6]. in general conceptually related) Web logs are analyzed The model is stored in the table of an Oracle database, to carry out clusters of frequently co-occurring pages. making it possible to exploit the data mining model Then, index pages containing a collection of hyper- by PS/SQL procedures. links to related but unlinked pages regarding a specific topic are generated. A page index forms the sugges- SUGGEST: AN ONLINE WEB tions presented to the Web master. PERSONALIZATION SYSTEM A solution proposed in [7] involves enhancing The main limitation of traditional systems is the usage mining by enriching the set of information usu- loosely coupled integration of the Web personalizaally registered into Web logs with formal semantics tion system with the Web server ordinary activity. forms user profiling, model updating, and recommendation building [4]. SUGGEST is designed to dynamically generate personalized contents of potential interest for users of large Web sites made up of pages dynamically generated. It is based on an incremental personalization procedure tightly coupled with the Web server. It is able to update incrementally and automatically the knowledge base obtained from
Suggestions
The goal of the Pattern Discovery phase is to evaluate useful data patterns in the knowledge base created in the previous phase (for example, access user patterns) by exploiting various data mining techniques such as Statistical Analysis, Frequent Pattern Analysis, Clustering, and Association Rules Mining. Pattern Analysis aims to find interesting patterns. Once the correlations have been determined, one must decide which pattern to keep and which to reject with respect to the process-mining goal.
Suggestions
Web Usage Mining (WUM) typically extracts knowledge by analyzing historical data such as Web server access logs, browser caches, or proxy logs. WUM techniques are important for several reasons. It is possible to model user behavior and, therefore, to forecast their future movements. The information mined can subsequently be used in order to personalize the contents of Web pages, to improve Web server performance, to structure a Web site according to the preferences expressed by the users, or to help the business to carry out a specific users’ target. Web Personalization (WP) or recommender systems [3] are typical applications of WUM. These systems were introduced to improve Web site usage by customizing the contents of a Web Users site with respect to the users’ Requests needs. They provide mechanisms that collect information describing user activity and elaborate this information. In a first stage, WUM can be used to determine the number of Web server accesses, the pages requested, the interval time between different user sessions, and the IP address of the Web server users. The WP system elaborates this information in order to extract a user profile based on their navigational behaviors that can be employed to provide personalized navigational information. Personalization systems usually process the information related to users sessions, that is, a sequence of pages requested by the same user. For instance the sequence: Home ° Science ° Computer ° Science ° Algorithms could be a typical session of a person browsing a directory (for example, Yahoo!) with an interest in computer algorithms. The WUM-based personalization process is typically structured according to three phases: Preprocessing, Pattern Discovery, and Pattern Analysis. Preprocessing consists of elaborating the raw Web access logs to produce data in a format usable by the Pattern Discovery phase. The preprocessing goal is that all the non-relevant data is pruned by the logs, and is based on the kind of mining analysis to be carried out. At the end of this phase, a knowledge base is produced.
COMMUNICATIONS OF THE ACM February 2007/Vol. 50, No. 2
65
Online Module Session Recognition
Model Updating
Knowledge Base
User Profiling
Figure 1. Architecture of a typical Web Recommender System.
In the past, WP systems architecture was comprised of two components performed offline and online with respect to the Web server activity. The Silvestri fig 1 includes (2/07) the Preprocessing and offline component Pattern Discovery phases, while the online one implements the Pattern Analysis phase in order to generate the personalized content such as links to pages, advertisements, or information relating to products or services estimated to be of interest for the current user. We should note that in order to enhance navigation experience these systems must provide two main characteristics: they should be non-intrusive and scalable with respect to the Web site size. Moreover, the management of dynamic pages is another important issue of WP systems. Recent Web-based marketing strategies have been mainly focused on presentation of products and services and interaction with the clients. This led Web designers to use dynamic pages extensively. On the other hand, the static approach is preferable only in the case of “small” and “quasi-statical” Web sites. Here, we introduce SUGGEST— a novel solution to implement WP as a single online module that per-
SUGGEST is a novel solution to implementing Web personalization as a single online module that performs user profiling, model updating, and recommendation building. 64
February 2007/Vol. 50, No. 2 COMMUNICATIONS OF THE ACM
based on the ontology underlying the site. Data mining techniques can be applied to enriched Web logs to extract knowledge that could be used to improve the navigational structure as well as exploited in recommender systems. A recommender system proposed in [4] is based on both content and usage data and exploits semantics annotation of Web logs to produce suggestions including documents that are not in the same path, and whose conOffline tent is relevant with respect to the Phase Pattern Preprocessing page visited. In order to characterDiscovery ize every site page, a set of keywords is carried out by means of a text mining analysis, and is mapped in categories according to a domain-specific taxonomy and Log thesaurus. The Web logs are then Files enriched with relevant keywords and categories. The documents are clustered, based on the similarity Knowledge between the category terms, and Base used to expand the recommendation set suggested to the end user. Online SETA is aimed at supporting ePhase Users Pattern commerce initiatives, in particular Requests Analysis for assisting the navigation of users through Web virtual stores [1]. The system is designed as a multi-agent architecture. Each specialized agent historical usage data and to dynamically generate a list Figure 2. Architecture has been designed to support a difof the SUGGEST of page links (suggestions). The suggestions are used Online Recommender ferent activity of the front-end of a Web store. Adaptation is carried to personalize the HTML page requested on-the-fly. System. out using a classification of the The adoption of a LRU-based (Least Recently Used) Silvestri 2 (2/07) 26.5 picas algorithm handling the fig knowledge base– makes it pos- users based on Bayesian Networks that demonstrate user behavior from the profile specified. The system sible for SUGGEST to manage large Web sites. The majority of the existing WP systems are struc- requires an initial step of personal information collectured according to the offline and online modules (see tion. Oracle 10gAS Personalization is a commercial Figure 1). Quite a large number of commercial and open product offering Web personalization functions [8]. source solutions have been proposed and are often The periodically built predictive model is used to genavailable on the market. IndexFinder [9] is a semi- erate real-time suggestions. The model itself is built by automatic solution to develop adaptive Web sites. choosing the best model among those generated by Based on statistical analysis and the visit-coherence two distinct classification algorithms: Predictive Assoassumption (that is, pages within the same session are ciation Rules [5] and a Transactional Naïve Bayes [6]. in general conceptually related) Web logs are analyzed The model is stored in the table of an Oracle database, to carry out clusters of frequently co-occurring pages. making it possible to exploit the data mining model Then, index pages containing a collection of hyper- by PS/SQL procedures. links to related but unlinked pages regarding a specific topic are generated. A page index forms the sugges- SUGGEST: AN ONLINE WEB tions presented to the Web master. PERSONALIZATION SYSTEM A solution proposed in [7] involves enhancing The main limitation of traditional systems is the usage mining by enriching the set of information usu- loosely coupled integration of the Web personalizaally registered into Web logs with formal semantics tion system with the Web server ordinary activity. forms user profiling, model updating, and recommendation building [4]. SUGGEST is designed to dynamically generate personalized contents of potential interest for users of large Web sites made up of pages dynamically generated. It is based on an incremental personalization procedure tightly coupled with the Web server. It is able to update incrementally and automatically the knowledge base obtained from
Suggestions
The goal of the Pattern Discovery phase is to evaluate useful data patterns in the knowledge base created in the previous phase (for example, access user patterns) by exploiting various data mining techniques such as Statistical Analysis, Frequent Pattern Analysis, Clustering, and Association Rules Mining. Pattern Analysis aims to find interesting patterns. Once the correlations have been determined, one must decide which pattern to keep and which to reject with respect to the process-mining goal.
Suggestions
Web Usage Mining (WUM) typically extracts knowledge by analyzing historical data such as Web server access logs, browser caches, or proxy logs. WUM techniques are important for several reasons. It is possible to model user behavior and, therefore, to forecast their future movements. The information mined can subsequently be used in order to personalize the contents of Web pages, to improve Web server performance, to structure a Web site according to the preferences expressed by the users, or to help the business to carry out a specific users’ target. Web Personalization (WP) or recommender systems [3] are typical applications of WUM. These systems were introduced to improve Web site usage by customizing the contents of a Web Users site with respect to the users’ Requests needs. They provide mechanisms that collect information describing user activity and elaborate this information. In a first stage, WUM can be used to determine the number of Web server accesses, the pages requested, the interval time between different user sessions, and the IP address of the Web server users. The WP system elaborates this information in order to extract a user profile based on their navigational behaviors that can be employed to provide personalized navigational information. Personalization systems usually process the information related to users sessions, that is, a sequence of pages requested by the same user. For instance the sequence: Home ° Science ° Computer ° Science ° Algorithms could be a typical session of a person browsing a directory (for example, Yahoo!) with an interest in computer algorithms. The WUM-based personalization process is typically structured according to three phases: Preprocessing, Pattern Discovery, and Pattern Analysis. Preprocessing consists of elaborating the raw Web access logs to produce data in a format usable by the Pattern Discovery phase. The preprocessing goal is that all the non-relevant data is pruned by the logs, and is based on the kind of mining analysis to be carried out. At the end of this phase, a knowledge base is produced.
COMMUNICATIONS OF THE ACM February 2007/Vol. 50, No. 2
65
Factors Goal Demand Management Flow of Links Stocking strategy
gy
Transportation Penalty
(a)
(b)
Figure 3. (a) Effectiveness Indeed, the use of two comof SUGGEST measured ponents has the disadvantage as the probability of of having “asynchronous fig 3a an (2/07) recommending a pageSilvestri of potential future interest. cooperation” among the (b) Comparison of the components. The offline number of requests satisfied per seconds in component must be periodithe case of using cal to have up-to-date data SUGGEST vs. Apache without SUGGEST installed. patterns, but the frequency
of the updates is a problem that must be solved on a case-specific basis. On the other hand, the integration of the offline and online component functions in a single component poses other problems in terms of overall system performance, which should have a very low impact on user response times. Thus, the system must be able to generate personalized results in a small fraction of a user session. Moreover, the knowledge mined by using a single component must be comparable or better, of those mined by using two separate components. Figure 2 shows the architecture of SUGGEST the completely online Web Recommender System we recently proposed [2]. SUGGEST is completely online and incremental, and it is aimed at providing the users with information about the pages they may find of interest. It bases personalization on a user’s classification that evolves according to the user’s requests. Usage information is represented by means of an undirected graph whose nodes are associated to the identifiers of the accessed pages, and each edge is associated to a measure of the correlation existing between nodes (pages). This graph is incrementally modified to keep the user model up-to-date. In our model the “interest” in a page does not depend on its contents but on the order by which a page is visited during a session. Therefore, to weight each edge of the graph we introduced a novel formula: Wij=Nij/max(Ni, Nj) 66
February 2007/Vol. 50, No. 2 COMMUNICATIONS OF THE ACM
where Nij is the number of sessions containing both pages i and j, Ni and Nj are the number of sessions containingSilvestri only page fig i or3b j, respectively. Divid(2/07) ing Nij by the maximum between single occurrences of the two pages has the effect of discriminating internal pages from the so-called index pages. Index pages are those that do not generally contain useful content and are only used as a starting point for a browsing session. We decided to consider index pages to be of too little interest as potential suggestions because they are very likely to be included in too many sessions. Index pages are used in other works (for example, [9]) to present the results of the personalization phase. In these cases index pages are not actually used to identify potentially useful information but just to present the personalization results. SUGGEST user sessions (identified by means of a cookie-based protocol) are used to build “Session Clusters” eventually leading to a list of suggestions. It finds groups of strongly correlated pages by partitioning the graph according to its connected components. Each component in turn represents a different class, or cluster, of users. The connected components are obtained in an incremental way by using a derivation of the well-known Breadth-First Search (BFS) visit limited to the nodes involved in the request. Basically, we start from the current page identifier and we explore the component to which it belongs. If there are any nodes not considered in the visit a previously connected component has been split and needs to be identified. We simply apply the BFS again, starting from one of the nodes not visited. Furthermore, in order to limit the number of edges of the graph we applied a threshold. Edges whose weights are smaller than the predefined threshold are considered poorly correlated and thus discarded. Pages in the same cluster are ranked according to their co-occurence frequency and clusters with size lower than a threshold value are discarded as not significant enough. Note that the update algo-
rithm does not involve the exploration of the entire graph but just the nodes associated to the pages of the cluster containing the starting page. Usually, each cluster is composed by only a fraction of the entire node set, thus the cost of the algorithm is very limited. The data structure used to store the weights is an adjacency matrix where each entry contains the weight related to a pair of accessed pages. In order to manage Web sites with a number of pages not known, such as Web sites that intensively use dynamic pages, a very innovative solution is applied in SUGGEST, which indexes a page when required. To allow the adjacency matrix to become manageable in size, a LRU algorithm is applied. The Web master of a site may adjust the matrix size according to predetermined constraints such as available resources and performance level. Smaller matrix size values, however, may lead to poor system performance due to frequent page replacements. After the model has been updated SUGGEST prepares the list of suggestions on the basis of a classification of the user session. This is made in a straightforward way by finding the clusters having the largest intersection with the pages belonging to the current session. The final suggestions are composed by the most relevant pages in the cluster, according to the ranking determined by the clustering phase. The suggestions, are then inserted as a list of links in the requested page. Visited pages are not included in the suggestions therefore users belonging to the same class could have different sets of suggestions, depending on which pages have been visited in their active session. SUGGEST is implemented as a single Apache Web server module in order to allow easy deployment on potentially any kind of Web site currently available, without changing the site itself. Experimental results demonstrate that SUGGEST is able to provide significant suggestions as well as good system performance. In order to validate our approach, we performed several experiments using three Web server access logs available at www.web-caching.com: NASA (27 days, 19K sessions), USASK (180 days, 10K sessions), and BERKLEY (22 days, 22K sessions). The metric used to measure the quality of SUGGEST basically tries to estimate the effectiveness of a recommendation system as the capacity of anticipating users requests that will be made farther in the future (see [2] for more details.) Figure 3a shows experimental results. The threshold used to prune the graph edges (minfreq) is represented on the x-axis, whereas the quality of the suggestions is given on the y-axis. It can be seen that more than 50% of the suggested pages, in the case of USASK log, have been retained so as to meet the users’ need. We also plotted
the results of a recommender system giving random links as output. As it was expected, the results in this case are quite poor. In terms of view of efficiency, we elaborated 100,000 requests varying the number of requests performed simultaneously from 10 to 110. As shown in Figure 3b, the overhead introduced by SUGGEST is less than 8%. Moreover, if we consider that SUGGEST is able to anticipate the users’ requests, this module will increment the efficiency of the whole Web server system since users will spend less time navigating the Web server pages thus giving more free space to a larger number of users. CONCLUSION In this article we have introduced SUGGEST—a completely online Web recommender system that does not require user intervention on the modelbuilding module. We also empirically demonstrated that SUGGEST effectively and efficiently provides recommendations to users. c References 1. Ardissono, L., Goy, A., Petrone, G., and Segnan, M. Personalization in business-to-customer interaction. Commun. ACM 45, 5 (May 2002), 52–53. 2. Baraglia, R. and Silvestri, F. An online recommender system for large Web sites. In Proceedings of the 2004 IEEE/WIC/ACM International Conference on Web Intelligence (Sept. 20–24, 2004). 3. Eirinaki, M. and Vazirgiannis, M. Web mining for Web personalization. ACM Trans. on Internet Technology 3, 1 (Feb. 2003) 1–27. 4. Eirinaki, M., Vazirgiannis, M., and Varlamis, I. Sewep: Using site semantics and a taxonomy to enhance the Web personalization process. In Proceedings of Knowledge Discovery in Data 2003 (Aug. 2003). 5. Megiddo, N. and Srikant, R. Discovering predictive association rules. In Proceedings of Knowledge Discovery and Data Mining (1998), 274–278. 6. Nath, S.V. Customer churn analysis in the wireless industry: A data mining approach. See the section on Transactional Naive Bayes. 7. Oberle, D., Berendt, B., Hotho, A., and Gonzalez, J. Conceptual user tracking. In Proceedings of Web Intelligence, First International Atlantic Web Intelligence Conference. E. Menasalvas Ruiz, J. Segovia, and P.S. Szczepaniak, Eds. (Madrid, Spain, May 5-6, 2003). Springer, 142–154. 8. Oracle Corporation. Oracle application server 10g business intelligence overview. See the section on the Personalization Tool. 9. Perkowitz, M. and Etzioni, O. Adaptive Web sites. Commun. ACM 43, 8 (Aug. 2000).
Ranieri Baraglia (
[email protected]) is a researcher in computer science at the Information Science and Technology Institute (ISTI) of the Italian National Research Council in Pisa, Italy. Fabrizio Silvestri (
[email protected]) is a researcher in computer science at the Information Science and Technology Institute (ISTI) of the Italian National Research Council in Pisa, Italy. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee.
© 2007 ACM 0001-0782/07/0200 $5.00
COMMUNICATIONS OF THE ACM February 2007/Vol. 50, No. 2
67
Factors Goal Demand Management Flow of Links Stocking strategy
gy
Transportation Penalty
(a)
(b)
Figure 3. (a) Effectiveness Indeed, the use of two comof SUGGEST measured ponents has the disadvantage as the probability of of having “asynchronous fig 3a an (2/07) recommending a pageSilvestri of potential future interest. cooperation” among the (b) Comparison of the components. The offline number of requests satisfied per seconds in component must be periodithe case of using cal to have up-to-date data SUGGEST vs. Apache without SUGGEST installed. patterns, but the frequency
of the updates is a problem that must be solved on a case-specific basis. On the other hand, the integration of the offline and online component functions in a single component poses other problems in terms of overall system performance, which should have a very low impact on user response times. Thus, the system must be able to generate personalized results in a small fraction of a user session. Moreover, the knowledge mined by using a single component must be comparable or better, of those mined by using two separate components. Figure 2 shows the architecture of SUGGEST the completely online Web Recommender System we recently proposed [2]. SUGGEST is completely online and incremental, and it is aimed at providing the users with information about the pages they may find of interest. It bases personalization on a user’s classification that evolves according to the user’s requests. Usage information is represented by means of an undirected graph whose nodes are associated to the identifiers of the accessed pages, and each edge is associated to a measure of the correlation existing between nodes (pages). This graph is incrementally modified to keep the user model up-to-date. In our model the “interest” in a page does not depend on its contents but on the order by which a page is visited during a session. Therefore, to weight each edge of the graph we introduced a novel formula: Wij=Nij/max(Ni, Nj) 66
February 2007/Vol. 50, No. 2 COMMUNICATIONS OF THE ACM
where Nij is the number of sessions containing both pages i and j, Ni and Nj are the number of sessions containingSilvestri only page fig i or3b j, respectively. Divid(2/07) ing Nij by the maximum between single occurrences of the two pages has the effect of discriminating internal pages from the so-called index pages. Index pages are those that do not generally contain useful content and are only used as a starting point for a browsing session. We decided to consider index pages to be of too little interest as potential suggestions because they are very likely to be included in too many sessions. Index pages are used in other works (for example, [9]) to present the results of the personalization phase. In these cases index pages are not actually used to identify potentially useful information but just to present the personalization results. SUGGEST user sessions (identified by means of a cookie-based protocol) are used to build “Session Clusters” eventually leading to a list of suggestions. It finds groups of strongly correlated pages by partitioning the graph according to its connected components. Each component in turn represents a different class, or cluster, of users. The connected components are obtained in an incremental way by using a derivation of the well-known Breadth-First Search (BFS) visit limited to the nodes involved in the request. Basically, we start from the current page identifier and we explore the component to which it belongs. If there are any nodes not considered in the visit a previously connected component has been split and needs to be identified. We simply apply the BFS again, starting from one of the nodes not visited. Furthermore, in order to limit the number of edges of the graph we applied a threshold. Edges whose weights are smaller than the predefined threshold are considered poorly correlated and thus discarded. Pages in the same cluster are ranked according to their co-occurence frequency and clusters with size lower than a threshold value are discarded as not significant enough. Note that the update algo-
rithm does not involve the exploration of the entire graph but just the nodes associated to the pages of the cluster containing the starting page. Usually, each cluster is composed by only a fraction of the entire node set, thus the cost of the algorithm is very limited. The data structure used to store the weights is an adjacency matrix where each entry contains the weight related to a pair of accessed pages. In order to manage Web sites with a number of pages not known, such as Web sites that intensively use dynamic pages, a very innovative solution is applied in SUGGEST, which indexes a page when required. To allow the adjacency matrix to become manageable in size, a LRU algorithm is applied. The Web master of a site may adjust the matrix size according to predetermined constraints such as available resources and performance level. Smaller matrix size values, however, may lead to poor system performance due to frequent page replacements. After the model has been updated SUGGEST prepares the list of suggestions on the basis of a classification of the user session. This is made in a straightforward way by finding the clusters having the largest intersection with the pages belonging to the current session. The final suggestions are composed by the most relevant pages in the cluster, according to the ranking determined by the clustering phase. The suggestions, are then inserted as a list of links in the requested page. Visited pages are not included in the suggestions therefore users belonging to the same class could have different sets of suggestions, depending on which pages have been visited in their active session. SUGGEST is implemented as a single Apache Web server module in order to allow easy deployment on potentially any kind of Web site currently available, without changing the site itself. Experimental results demonstrate that SUGGEST is able to provide significant suggestions as well as good system performance. In order to validate our approach, we performed several experiments using three Web server access logs available at www.web-caching.com: NASA (27 days, 19K sessions), USASK (180 days, 10K sessions), and BERKLEY (22 days, 22K sessions). The metric used to measure the quality of SUGGEST basically tries to estimate the effectiveness of a recommendation system as the capacity of anticipating users requests that will be made farther in the future (see [2] for more details.) Figure 3a shows experimental results. The threshold used to prune the graph edges (minfreq) is represented on the x-axis, whereas the quality of the suggestions is given on the y-axis. It can be seen that more than 50% of the suggested pages, in the case of USASK log, have been retained so as to meet the users’ need. We also plotted
the results of a recommender system giving random links as output. As it was expected, the results in this case are quite poor. In terms of view of efficiency, we elaborated 100,000 requests varying the number of requests performed simultaneously from 10 to 110. As shown in Figure 3b, the overhead introduced by SUGGEST is less than 8%. Moreover, if we consider that SUGGEST is able to anticipate the users’ requests, this module will increment the efficiency of the whole Web server system since users will spend less time navigating the Web server pages thus giving more free space to a larger number of users. CONCLUSION In this article we have introduced SUGGEST—a completely online Web recommender system that does not require user intervention on the modelbuilding module. We also empirically demonstrated that SUGGEST effectively and efficiently provides recommendations to users. c References 1. Ardissono, L., Goy, A., Petrone, G., and Segnan, M. Personalization in business-to-customer interaction. Commun. ACM 45, 5 (May 2002), 52–53. 2. Baraglia, R. and Silvestri, F. An online recommender system for large Web sites. In Proceedings of the 2004 IEEE/WIC/ACM International Conference on Web Intelligence (Sept. 20–24, 2004). 3. Eirinaki, M. and Vazirgiannis, M. Web mining for Web personalization. ACM Trans. on Internet Technology 3, 1 (Feb. 2003) 1–27. 4. Eirinaki, M., Vazirgiannis, M., and Varlamis, I. Sewep: Using site semantics and a taxonomy to enhance the Web personalization process. In Proceedings of Knowledge Discovery in Data 2003 (Aug. 2003). 5. Megiddo, N. and Srikant, R. Discovering predictive association rules. In Proceedings of Knowledge Discovery and Data Mining (1998), 274–278. 6. Nath, S.V. Customer churn analysis in the wireless industry: A data mining approach. See the section on Transactional Naive Bayes. 7. Oberle, D., Berendt, B., Hotho, A., and Gonzalez, J. Conceptual user tracking. In Proceedings of Web Intelligence, First International Atlantic Web Intelligence Conference. E. Menasalvas Ruiz, J. Segovia, and P.S. Szczepaniak, Eds. (Madrid, Spain, May 5-6, 2003). Springer, 142–154. 8. Oracle Corporation. Oracle application server 10g business intelligence overview. See the section on the Personalization Tool. 9. Perkowitz, M. and Etzioni, O. Adaptive Web sites. Commun. ACM 43, 8 (Aug. 2000).
Ranieri Baraglia (
[email protected]) is a researcher in computer science at the Information Science and Technology Institute (ISTI) of the Italian National Research Council in Pisa, Italy. Fabrizio Silvestri (
[email protected]) is a researcher in computer science at the Information Science and Technology Institute (ISTI) of the Italian National Research Council in Pisa, Italy. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee.
© 2007 ACM 0001-0782/07/0200 $5.00
COMMUNICATIONS OF THE ACM February 2007/Vol. 50, No. 2
67