User-Web Interaction Models for Complete Web ... - Semantic Scholar

1 downloads 26521 Views 240KB Size Report
The third difference deals ... In our research, the recommended IC-pages are based on a learned Web ... In the first approach, we build a model to predict.
User-Web Interaction Models for Complete Web Recommendation Tingshao Zhu1 , Russ Greiner1 ,Gerald H¨aubl2 and Bob Price1 1

Dept. of Computing Science, University of Alberta, Canada T6G 2E8 {tszhu, greiner, price}@cs.ualberta.ca 2 School of Business, University of Alberta, Canada T6G 2R6 [email protected]

Abstract. This paper summarizes our research on predicting relevant pages for Web users based on a learned model. The prediction is based on the content of the pages the user has visited and the actions the user has applied to these pages. We assume that the sequence of actions chosen by the user indicates the degree to which pages satisfy the user’s information need. We show that the model parameters can be estimated by learning from a labelled corpus. Data from lab experiments demonstrate that the prediction model can recommend previously unseen pages from anywhere on the Web for new users.

1

Introduction

While the World Wide Web contains a vast quantity of information, it is often difficult for web users to find the information they are seeking. Our goal is to build a “passive” system, which can recommend pages relevant to the user’s current information need without burdening the user with intrusive questions. Like most recommendation systems, our system watches a user as she navigates through a sequence of pages, and suggests pages that (it hopes) will provide the relevant information [10, 11]. Our system is unique in several respects. First, as many recommendation systems are server-side, they can only provide information about one specific website. By contrast, our system is not specific to a single website, but can point users to pages anywhere on the Web. The fact that our intended coverage is the entire Web leads to a second difference: adaption. We observe that the user’s needs can change dramatically from day-to-day or even in the course of a single browsing session. As the user works on various tasks and subtasks, the user will often require information on various unrelated topics, including topics the user has not investigated before. This motivates us to build a system that can predict the user’s information need dynamically. A dynamic system also adapt different user communities and different domains. The third difference deals with the goal of the recommendation system: Many recommendation systems first determine other users that appear similar to the current user B, then recommend that B visit the pages that these similar users have visited. Unfortunately, there is no reason to believe that these correlated pages will contain information useful to B. Indeed, these suggested pages may correspond simply to irrelevant pages on the paths that others have taken towards their various goals, or worse, simply to standard dead-ends that everyone seems to hit. By contrast, our goal is to recommend only useful pages; i.e., pages that

2

are relevant to the user’s task. We call these “Information Content” pages or IC-pages for short. In our research, the recommended IC-pages are based on a learned Web browsing behavior model. We propose that there exists a general browsing behavior model of goal-directed information search on the web. That is, people follow some very general rules to locate the information they are looking for. If we can detect such patterns, and use them to predict a web user’s current information need, we may provide useful content recommendations. The models used by our system are learned from annotated webpages. Systems that learn from data — i.e., that can tune its internal parameters based on training examples — have several advantages over traditional approaches. Learning algorithms can discover features in enormous datasets that might otherwise not be found by human inspection. Secondly, embedded machine learning algorithms can improve the performance of the programs, by endowing them with adaptive, selfmodifying behaviours. Finally, and most importantly, these methods avoid the need to program complex systems by hand, and thus allow us to develop working systems for tasks that are difficult, or impossible, to design and implement directly.

2

Learning Tasks

Our work on web browsing behvior model is driven by several observations we have made concerning the nature of human-computer interaction while searching information on the Web. Consider the example suggested by Figure 1. Imagine the user needs information about marine animals. The user sees a page with links on “Dolphins” and “Whales” Clicking on the “Dolphins” link takes the user to a page about the NFL Football team. This page is not an IC-page for this user at this time, so the user “backs-up”. We might conclude that the word “Football” which appears on the previous page, but not on the current page might not be in IC-page, i.e., not an IC-word. The user then tries the “Whale” pointer, which links to a page whose title includes “whale”, and which includes whales, oceans, and other marine terms in its content. The user then follows a link on that page to another with similar content terminating in a page about a local aquarium. We might conclude that words such as “Dolphin” and “Ocean” are IC-words. The above observation suggest that the user’s current information need can be identified from the pages the user visits and the actions that the user applies to the pages. We infer the user’s attitude towards the content of the pages from the actions the user applies to the pages. The content of the page is communicated by the roles that words play on the page. We make the simplifying assumption that we can represent the information need of the session by a set of significant words from the session. We further assume that the significance of words can be judged independently for each word from the roles played by instances of the word on pages throughout the sequence (e.g., appearing as plain text, highlighted in a title, appearing in followed hyperlink, etc.) To capture the reaction of the Web user, we developed a number of browsing features from the page sequence. Previous papers ( [17] [18]) give the whole list of those “browsing features”. To obtain the user model for Web recommendation, we first collected a set of annotated web logs (where the user has indicated which pages are IC-pages) [18], from which our learning algorithm learned to characterize the IC-page associated with the

3

Fig. 1. Browsing Behavior to Locate IC-page

pages in any partial subsession. For more information on data collection and preprocessing, please refer to our previous papers ( [17, 18]). In the next two sections, we describe two approaches to infer the user’s information need from her browsing action. In the first approach, we build a model to predict whether a page is an IC-page based on “browsing features” of the URL. In the second approach, we train a model for IC-word prediction, and identify pages that match these predicted IC-words. 2.1 Predicting IC-pages from URLs The model that learned from URLs takes an annotated sequence of pages as input and a target URL V , and predicts whether V points to an IC-page — i.e., if V contains information relevant to the user. This model examines properties of the URLs, both in the target V and the observed sequence, such as the domain type, whether it followed a search engine, etc., to learn rules of the form any URL whose domain has been accessed more than 10 times, (1) and whose depth is more than 3 is likely to be an IC-page Notice this rule is not about any specific page — e.g., it is not about http://www. cs.ualberta.ca — but instead identifies a page based on the user’s browsing patterns. Let Uj = [u1 , u2, , . . . , uj ] be the sequence of URL properties associated with the first j pages of the user’s browsing sequence. ui = [”in .org domain”,”probability of domain being IC page”, ”previous URL was a search engine page”, . . .] Let Cj = [c1 , c2 , . . . , cj ] be the sequence of classifications of pages by the user as to whether the page is an IC-page or not. The learning task is to estimate a model of P r(cj |Uj , Cj−1 ) for 1 ≤ j ≤ n from labelled data. During training phase, we first extracted certain “browsing features” of the URLs from the raw data, and labeled each page to indicate whether it is an IC-page. After data preparation, we ran several classification algorithms on the data set, producing a decision tree (C4.5) [13], a Na¨ıveBayes (NB) [7] classifier, and a Boosted Na¨ıveBayes (BNB) [15] classifier. Among them, BNB has the best “worst-case” performance, averaging around 65%. Fig. 1 shows the results from BNB on balanced training data (equal

4

numbers of IC and ¬IC). Note that random guesses on balanced data would result in a precision of 50% and recall of 50%. Refer to [18] for more detail on data preparation and feature extraction. Table 1. Precision and Recall of IC-page Prediction from URLs

IC-page ¬IC-page

Precision Recall 67% 70% 69% 65%

The results suggest to us that information in a sequence of URLs together with some domain knowledge about the types of URLs is significant for predicting IC-page classifications over and above random predictions. Although this model is fairly accurate, it does require the user to annotate the pages while she is browsing. 2.2 IC-page Identification Based On IC-word Prediction To avoid annotations required in Section 2.1, we break the task of IC-page prediction down into two stages: (1) predicting the user’s information need, which corresponds to the IC-words, then (2) locating and returning pages that provide this information (the IC-pages). For the first subtask, the IC-word prediction doesn’t require annotation; the second subtask is to identify pages that match the predicted IC-words, thus in this model, the annotation is not required to predict IC-pages. 2.2.1 IC-word Prediction The input of IC-word prediction is an unannotated sequence of visited webpages U = h U1 , . . . , Un i as well as a learned word-recommender model MW , and returns a list of IC-words — i.e., words that are likely to appear within the IC-page associated with U (and that, presumably, reflect the user’s information need). In particular, it considers every word w that appears in any Ui , then assigns “browsing properties” to this w based on how w appears within this session — e.g., did w appear in the title of any webpage, did it ever appear as a query to a search engine, etc. MW is then used to classify each w, determining the chance that a word with these browsing properties will be an IC-word. Notice that this classifier bases its decisions on the browsing properties of a word, rather than on the word itself; i.e., it might claim that a word that appears in at least two titles in the session, (2) and was part of a seach engine query, is likely to be an IC-word but it will not make claims about, say, whether “moose” is an IC-word or not, nor will it build an association rule like “given moose, Alberta will be an IC-word” [1]. After preparing the data, we trained several classifiers, including Na¨ıveBayes, Support Vector Machines(SVM), and Decision Tree (C4.5). Our previous papers ( [18, 19]) report the testing results of these classifiers. Among them, C4.5 performed the best, scoring an average of 80%. In order to demonstrate how browsing models generalize across users, we trained the model on all but one user and tested it on the user left out. The results [16] indicate

5

that user models generalize well to new users. In a separate experiment [18] we show that explicitly training a model for a particular user can yield benefits beyond the generic model. A good recommendation system should predict all-and-only the IC-pages, and predict these pages early. We evaluate the ability of the model to predict IC-words from a subset of the pages taken from the beginning of the user’s browsing sequence. In [17], we compare our method with two baseline methods: In the first method, we let the IC-words be all of the words in the subset. In the second method, we let the IC-words be all of the feature words in the subset (i.e., words enclosed by some specific HTML tags). Our approach did significantly better at accurately identifying IC-words and identifying them early on in the user’s session. We compare the results for Na¨ıveBayesto the results for C4.5, and C4.5 performs much better than Na¨ıveBayes [19]. Since many of the IC-words are not even present in the subset of pages leading up to the IC-page, one cannot obtain a perfect prediction. We do, however, predict the IC-words that do appear in the early subset well. 2.2.2 IC-page Identification After predicting IC-words, the next step is to find pages that contain these words. In our research, the user’s current information need is represented as a list of word-probability pairs { h w, p(w) i }, where p(w) estimates the probability that the word w will be an IC-word. Figure 2 shows two ways that our system could use this information: First, it could “scout ahead”: follow the outward links from the current page (recursively, in a breadth-first fashion) seeking pages that match many of these IC-words. It would then recommend such IC-word-rich pages to the user. Alternatively, it could send an appropriate query to a search engine (e.g., Google), then possibly scout forward from the pages returned. We have implemented the two ways to identiy IC-pages in our complete recommender system — WebIC [19].

3

Related Work

Many groups have built various types of systems that recommend pages to web users. This section will summarize several of those systems, and discuss how they differ from our approach. Zukerman [20] distinguishes two main approaches to learning which pages will appeal to a user: A content-based system tries to learn a model based on the contents of the webpage, while a collaborative system bases its model on finding “similar” users, and assuming the current user will like the pages that those similar users have visited. Our models are basically content-based, as their decisions are based on the combination of the page content and the applied actions. However, our approach differs from the standard content-based systems: As many such systems are restricted to a single website, their classifiers can be based on a limited range of words or URLs; this means they can make predictions about the importance of specific URLs (see association rules [1]), or of specific hard-selected words [4, 8, 3]. We did not want to restrict ourselves to a single website, but wanted a system that could recommend pages anywhere on the web, which could therefore involve an unrestricted range of words. For this reason, we built our content-based classifiers based on characteristics (“browsing properties”) of the words, or of the URLs. Notice this means our system is not restricted

6

Fig. 2. IC-page Identification

to predefined words, nor to webpages that already have been visited by this user, nor even to webpages that have been visited by similar users. As these browsing properties will appear across different websites, we expect them to be useful even in novel web environments. Another way for predicting useful Web pages is to take advantage of a Web user model. Chi et al. [6], and Pirolli and Fu [12] construct Information Need from the context of the followed hyperlinks based on the theory of Information Scent, and view it as the information that the user wants; this appears very similar to our approach. By contrast, our approach is based on the idea that some browsing properties of a word (e.g., context around hyperlinks, or word appearing in titles or . . . ) indicate whether a user considers a word to be important; moreover, our system has learned this from training data, rather than making an ad-hoc assumption. Letizia [9] and Watson [5] anticipate the user’s information needs using heuristics. While the heuristics may represent the user behavior well, we believe a model learned from user data, should produce a more accurate user model. In Table 2, we compare several techniques according to key aspects of recommender system. Where – – – – – –

Co-occurence Based (COB) e.g., Association Rule [1], Sequential Pattern [2], etc. Collaborative Filtering [14] (CF) Content-Based [4, 8, 3] (CB) Heuristic-Based Model [12, 9, 5] (HBM) IC-page Prediction from URLs (IC-URL) i.e., Section 2.1 IC-page Prediction from IC-words (IC-word) i.e., Section 2.2

The most common kinds of collaborative filtering are based on weakly informative choice signals from users (purchases at Amazon, viewings of a web page, etc.). We view

7 Table 2. Teachniques for Recommender System COB CF CB HBM IC-URL IC-word Specific Site/Domain Yes Yes Yes No No No Model Acquirement Learning Learning Learning Hand-coded Learning Learning Annotation Required (training) No No Yes No Yes Yes Annotation Required (performance) No No No No Yes No Personalization Generic Cluster Individual None Individual Individual Recommending Useful Information No N/A Yes Yes Yes Yes Using Sequential Information No No No Yes Yes Yes

these system as not requiring user annotation. In some systems like movie databases, users explicitly rate the value of items. We would treat the explicit recommendations as annotations.

4

Conclusion and Future Work

Our browsing behaviour model identifies relevant IC-pages based on the browsing features of words. The model is independent of any particular words or domain. Although we can train a personalized model, the patterns are largely user-independent. Our feature-based model uses a unique source of information, i.e., browsing behaviors, to provide recommendations when other paradigms cannot. It can be implemented in a practical and useful application. Correlation-based recommenders point users to pages other users visit — not necessarily to pages other users found useful or that will be useful to the current user. Content-based recommenders that learn content models of a specific web site or set of sites cannot recommend pages for other sites. Browsing pattern-based recommenders lack leverage provided by content or peer knowledge but tap into a new source of knowledge that works for any content and user. We are currently investigating the best way to connect our system with multiple search engines, and learn the connection that can provide specific page recommendations to the user. We plan to explore Natural Language processing systems to extend the range of our IC-word predictions, and other machine learning algorithms to make better predictions. We also would like to do further tests on other domains and larger user pools.

Acknowledgement The authors gratefully acknowledges the generous support from Canada’s Natural Science and Engineering Research Council, the Alberta Ingenuity Centre for Machine Learning (http://www.aicml.ca), and the Social Sciences and Humanities Research Council of Canada Initiative on the New Economy Research Alliances Program (SSHRC 538-2002-1013).

References 1. R. Agrawal and R. Srikant. Fast algorithms for mining association rules. In Proc. of the 20th Int’l Conference on Very Large Databases (VLDB’94), Santiago, Chile, Sep 1994. 2. R. Agrawal and R. Srikant. Mining sequential patterns. In Proc. of the Int’l Conference on Data Engineering (ICDE), Taipei, Taiwan, Mar 1995.

8 3. Corin Anderson and Eric Horvitz. Web montage: A dynamic personalized start page. In Proceedings of the 11th World Wide Web Conference (WWW 2002), Hawaii, USA, 2002. 4. D. Billsus and M. Pazzani. A hybrid user model for news story classification. In Proceedings of the Seventh International Conference on User Modeling (UM ’99), Banff, Canada, 1999. 5. Jay Budzik and Kristian Hammond. Watson: Anticipating and contextualizing information needs. In Proceedings of 62nd Annual Meeting of the American Society for Information Science, Medford, NJ, 1999. 6. E. Chi, P. Pirolli, K. Chen, and J. Pitkow. Using information scent to model user information needs and actions on the web. In ACM CHI 2001 Conference on Human Factors in Computing Systems, pages 490–497, Seattle WA, 2001. 7. Richard Duda and Peter Hart. Pattern Classification and Scene Analysis. Wiley, New York, 1973. 8. Andrew Jennings and Hideyuki Higuchi. A user model neural network for a personal news service. User Modeling and User-Adapted Interaction, 3(1):1–25, 1993. 9. H. Lieberman. Letizia: An agent that assists web browsing. In International Joint Conference on Artificial Intelligence, Montreal, Canada, Aug 1995. 10. Bamshad Mobasher, R. Cooley, and J. Srivastava. Automatic personalization through web usage mining. Technical Report TR99-010, Department of Computer Science, Depaul University, 1999. 11. Mike Perkowitz and Oren Etzioni. Adaptive sites: Automatically learning from user access patterns. Technical Report UW-CSE-97-03-01, University of Washington, 1997. 12. P. Pirolli and W. Fu. Snif-act: A model of information foraging on the world wide web. In Ninth International Conference on User Modeling, Johnstown, PA, 2003. 13. Ross Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers, San Mateo, 1992. 14. P. Resnick, N. Iacovou, M. Suchak, P. Bergstorm, and J. Riedl. Grouplens: An open architecture for collaborative filtering of netnews. In Proceedings of ACM 1994 Conference on Computer Supported Cooperative Work, pages 175–186, Chapel Hill, North Carolina, 1994. ACM. 15. Robert Schapire. Theoretical views of boosting and applications. In Tenth International Conference on Algorithmic Learning Theory, 1999. 16. Tingshao Zhu, Russ Greiner, and Gerald H¨aubl. Predicting web information content. In IJCAI-03 Workshop on Intelligent Techniques for Web Personalization, pages 1–7, Acapulco, Mexico. 17. Tingshao Zhu, Russ Greiner, and Gerald H¨aubl. An effective complete-web recommender system. In The Twelfth International World Wide Web Conference(WWW2003), Budapest, HUNGARY, May 2003. 18. Tingshao Zhu, Russ Greiner, and Gerald H¨aubl. Learning a model of a web user’s interests. In The 9th International Conference on User Modeling(UM2003), Johnstown, USA, June 2003. 19. Tingshao Zhu, Russ Greiner, Gerald H¨aubl, and Bob Price. The clicks predict: Learning a model for predicting information need from human-www interaction. Submitted to CHI’04. 20. I. Zukerman and D. Albrecht. Predictive statistical models for user modeling. User Modeling and User-Adapted Interaction, 11(1-2):5–18, 2001.