Understanding and Improving Automated Collaborative Filtering ...

34 downloads 123722 Views 946KB Size Report
ABSTRACT. Automated collaborative filtering (ACF) is a recent software technology that ...... importance of incoming email is an example of information filtering.
Understanding and Improving Automated Collaborative Filtering Systems

A THESIS SUBMITTED TO THE FACULTY OF THE GRADUATE SCHOOL OF THE UNIVERSITY OF MINNESOTA BY

Jonathan Lee Herlocker

IN PARTIAL FULFILLMENT OF THE REQUIREMENTS OF FOR THE DEGREE OF DOCTOR OF PHILOSOPHY September 2000

 Jonathan Lee Herlocker 2000

i ABSTRACT Automated collaborative filtering (ACF) is a recent software technology that provides personalized recommendation and filtering independent of the type of content. In an ACF system, users indicate their preferences by rating their level of interest in items that the system presents. The ACF system uses the ratings information to match together users with similar interests (who are known as neighbors). Finally, the ACF system can predict a user’s rating for an unseen item by examining his neighbors’ ratings for that item. This dissertation presents a broad set of results with the goal of improving the effectiveness and understanding of ACF systems. The results cover four specific challenges: understanding and standardizing evaluation of ACF systems, improving the accuracy of ACF systems, designing and utilizing effective explanations for ACF predictions, and improving ACF to support focused ephemeral recommendations. To address these challenges, a combination of offline analysis and user testing is used. All of the evaluation metrics that have been proposed for ACF are examined theoretically and compared empirically. The empirical results show that all proposed ACF evaluation metrics perform similarly, which argues for the adoption of a standardized evaluation metric – for which I propose mean absolute error. With respect to improving algorithm accuracy, I present a detailed empirical examination of the neighborhood-based algorithm, which has been the most successful algorithm, both in research and in commercial applications. The results show that the most significant factor with respect to prediction accuracy is the difference in ratings distributions between users. As a result, normalization of the ratings provides the most significant accuracy improvement. Other algorithm improvements that increase accuracy are devaluation of low-supported correlations and weighting of neighbor contributions. Explanations have the potential to increase the effectiveness of ACF systems. ACF systems make predictions based on data of highly variable quantity and quality, but current ACF systems are black boxes, so users have no indication of when to trust an ACF prediction. Explanations expose some of the process and data behind the ACF prediction, allowing users to judge for themselves if a prediction is appropriate for the user’s current context of risk. I present results showing what forms of explanation users find the most compelling, as well as indications that explanations can increase the acceptance of ACF systems. Finally, I present results from tests of a new algorithm for supporting focused ephemeral user information needs. Traditional ACF systems recommend based on longterm composite interests, and provide no mechanism for focused information needs. Ephemeral information needs are those needs that are immediate, focused, and often temporary. The proposed algorithm provides support for ephemeral information needs using no additional data beyond the standard ACF ratings. The study shows that users greatly value the ephemeral query interface, and that the recommendations are accurate, given an appropriate level of support threshold – a key algorithmic parameter.

ii TABLE OF CONTENTS Chapter 1 : Introduction and Overview............................................................................... 1 1.1 Content-based Filtering and Retrieval................................................................. 2 1.1.1 Information Filtering vs. Information Retrieval.......................................... 2 1.2 Automated Collaborative Filtering: A New Way of Filtering ............................ 3 1.3 Research Challenges ........................................................................................... 6 1.3.1 Challenge: How can we improve the predictive accuracy of automated collaborative filtering algorithms? .............................................................................. 6 1.3.2 Challenge: How can we increase the effectiveness of ACF as a decisionmaking aid using explanation interfaces? ................................................................... 7 1.3.3 Challenge: How can we improve automated collaborative filtering systems for meeting ephemeral user information needs? ......................................................... 8 1.4 Research Methods ............................................................................................... 8 1.5 Contributions....................................................................................................... 9 1.5.1 Improving Prediction Accuracy .................................................................. 9 1.5.2 Explaining Collaborative Filtering Predictions......................................... 10 1.5.3 Ephemeral Recommendations................................................................... 10 1.5.4 Data and Software Artifacts ...................................................................... 10 1.6 Overview of Dissertation .................................................................................. 11 Chapter 2 : Related Work.................................................................................................. 12 2.1 Technology for Matching Information to Information Needs .......................... 12 2.1.1 Information Retrieval ................................................................................ 12 2.1.2 Information Filtering ................................................................................. 14 2.1.3 Machine Learning ..................................................................................... 14 2.1.4 Collaborative Filtering .............................................................................. 15 2.1.5 Automated Collaborative Filtering............................................................ 15 2.1.6 Evaluation of Information Filtering Systems ............................................ 16 2.2 Embodiments of Recommendation Technology............................................... 17 2.2.1 Intelligent Agents ...................................................................................... 17 2.2.2 Recommender Systems ............................................................................. 18 2.2.3 Methods of Collecting Preferences ........................................................... 18 Chapter 3 : Evaluation of Automated Collaborative Filtering Systems............................ 19 3.1 Steps in Evaluation............................................................................................ 20 3.1.1 Identify High Level Goals......................................................................... 20 3.1.2 Identify Specific Tasks.............................................................................. 22 3.1.3 Performing System-Level Analysis .......................................................... 23 3.2 Metrics............................................................................................................... 25 3.2.1 Evaluation of previously used metrics ...................................................... 25 3.2.2 Which metric to use?................................................................................. 39 3.2.3 An Empirical Comparison of Evaluation Metrics..................................... 40 3.3 Summary and Recommendations...................................................................... 45 Chapter 4 : Improving Predictive Accuracy...................................................................... 48 4.1 Problem Space................................................................................................... 48 4.2 Related Work..................................................................................................... 50 4.3 Experimental Design ......................................................................................... 51

iii 4.3.1 Data ........................................................................................................... 51 4.3.2 Experimental Method................................................................................ 51 4.3.3 Metrics....................................................................................................... 52 4.3.4 Parameters Evaluated ................................................................................ 52 4.4 Weighting Possible Neighbors .......................................................................... 53 4.4.1 Similarity Weighting ................................................................................. 53 4.4.2 Significance Weighting ............................................................................. 57 4.4.3 Variance Weighting................................................................................... 59 4.5 Selecting Neighborhoods .................................................................................. 61 4.5.1 Correlation Weight Threshold................................................................... 62 4.5.2 Maximum Number of Neighbors Used..................................................... 65 4.6 Producing a Prediction ...................................................................................... 67 4.6.1 Rating Normalization ................................................................................ 68 4.6.2 Weighting Neighbor Contributions........................................................... 70 4.7 Summary ........................................................................................................... 72 Chapter 5 : Explanations: Improving the Performance of Human Decisions ................... 75 5.1 Background ....................................................................................................... 75 5.2 Sources of Error ................................................................................................ 76 5.2.1 Model/Process Errors ................................................................................ 76 5.2.2 Data Errors ................................................................................................ 76 5.3 Explanations ...................................................................................................... 77 5.4 Research Questions ........................................................................................... 79 5.5 Building a Model of Explanations .................................................................... 80 5.5.1 White Box Conceptual Model................................................................... 80 5.5.2 Black Box Model ...................................................................................... 83 5.5.3 Misinformed Conceptual Models.............................................................. 84 5.6 Experiment 1 – Investigating the Model ........................................................... 85 5.6.1 Design........................................................................................................ 86 5.6.2 Results ....................................................................................................... 87 5.6.3 Analysis..................................................................................................... 88 5.7 Experiment 2 – Acceptance and Filtering Performance.................................... 91 5.7.1 Hypotheses ................................................................................................ 91 5.7.2 Design........................................................................................................ 91 5.7.3 Results ....................................................................................................... 93 5.7.4 Analysis..................................................................................................... 95 5.8 Summary ........................................................................................................... 96 5.9 Experimental Notes ........................................................................................... 97 Chapter 6 : Addressing Ephemeral Information Needs .................................................... 98 6.1 Approach ........................................................................................................... 99 6.2 Item Correlation .............................................................................................. 100 6.3 Algorithm Description..................................................................................... 101 6.3.1 Algorithm Detailed Description .............................................................. 102 6.4 Research Goals................................................................................................ 103 6.5 Experiment Design.......................................................................................... 103 6.5.1 User Interface .......................................................................................... 104 6.6 Results ............................................................................................................. 107

iv 6.6.1 Results for the Population: Including all Control Groups....................... 109 6.6.2 The Effect of Support Threshold: Between Groups Results ................... 110 6.6.3 Qualitative Survey Responses................................................................. 112 6.7 Discussion ....................................................................................................... 113 6.7.1 Evaluate the effectiveness of the proposed interest model and search algorithm. ................................................................................................................ 114 6.7.2 Investigate query-by-example “themes” as a query mechanism............. 114 6.7.3 Measure user reaction to the proposed interface..................................... 116 6.7.4 Identify classes of themes for which this technique is effective (using the proposed algorithm) ................................................................................................ 116 6.8 Summary ......................................................................................................... 117 Chapter 7 : Software and Data Artifacts ......................................................................... 120 7.1 Software Artifacts ........................................................................................... 120 7.1.1 Usenet GroupLens Client Library........................................................... 120 7.1.2 MovieLens............................................................................................... 120 7.1.3 DBLens ACF Environment..................................................................... 121 7.2 Data Artifacts .................................................................................................. 122 7.2.1 GroupLens Usenet Data .......................................................................... 122 7.2.2 MovieLens Data ...................................................................................... 122 7.3 Summary ......................................................................................................... 123 Chapter 8 : Conclusion.................................................................................................... 124 8.1 Answers to Challenges Addressed .................................................................. 124 Appendix I: Depictions of all Explanation Interfaces Used in Chapter 5....................... 127 Bibliography.................................................................................................................... 136

v LIST OF FIGURES 3-1 3-2 3-3 3-4 3-5 4-1 4-2 4-3 4-4 4-5 4-6 4-7 4-8 4-9 5-1 5-2 5-3 5-4 5-5 6-1 6-2 6-3 6-4

A possible representation of density function for relevant items An example of an ROC curve Comparative performance of different evaluation metrics Pearson correlation per-user compared to overall MAE versus Pearson correlation Comparison of three different similarity metrics Effects of significance weighting on prediction accuracy Effects of significance weighting – non-weighted contributions Effects of variance weighting on prediction accuracy Effects of correlation weight threshold on prediction accuracy Max_nbors parameter versus MAE Rating normalization versus MAE Effects of weighting neighbors contributions on MAE Effects of weighting neighbors – constant significance weighting Example of an explanation interface Explanation interface: clustered rating histogram Explanation interface: simple confidence value Explanation interface: neighborhood graph Percentage of correct movie decisions by control group Architecture of the ephemeral recommendation system Main screen of the MovieLens Matcher ephemeral recommender Theme edit screen of the MovieLens Matcher recommender Search results screen of the MovieLens Matcher recommender

32 33 42 43 44 55 58 59 61 64 66 69 71 71 86 89 92 93 94 99 105 106 107

vi LIST OF TABLES 1-1 3-1 3-2 4-1 4-2 4-3 4-4 4-5 4-6 4-7 4-8 5-1 6-1 6-2 6-3 6-4 6-5 6-6 6-7 6-8 6-9

Overview of dissertation 11 Precision/recall contingency table 28 Evaluation algorithms versus data/task characteristics 40 Example of collaborative filtering data space 49 Prediction algorithm variants tested 53 MAE of Spearman versus Pearson correlation 55 MAE of Pearson correlation versus mean-squared difference 56 Interaction between weight threshold, coverage, and MAE 63 Interaction between weight threshold and coverage 64 Significance of MAE differences due to max_nbors 67 Comparison of different overall average algorithms 74 Mean response of users to various explanation interfaces 88 Sample matrix of CF data space for recommending movies 101 Summary of user participation in ephemeral experiment 108 List of survey questions presented to experimental subjects 108 Mean responses to survey questions for ephemeral experiment 110 Statistical significance for survey answers between groups 1 and 2 111 Statistical significance for survey answers between groups 1 and 3 111 Statistical significance for survey answers between groups 2 and 3 112 Themes for which the ephemeral rec. algorithm was successful 113 Themes for which the ephemeral rec. algorithm was not successful 113

1

Chapter 1: Introduction and Overview The visibility of personal computers, individual workstations, and local networks has focused most of the attention on generating information—the process of producing documents and disseminating them. It is now time to focus more attention on receiving information—the processes of controlling and filtering information that reaches the persons who must use it. —Peter Denning, President of the ACM, 1982[22]

We all experience information overload—most of us on a day-to-day basis. The information comes from many different sources: in our professions, there are memos, books, technical articles, conference and journal publications, web pages, voice messages, and email messages. To keep up with events in our world there are news articles, editorial columns, weather reports, stock tickers, and PR announcements. To maintain our lifestyle, we often have to parse information from consumer journals, shopping specials, for-sale ads, and online auctions. To keep ourselves entertained we must select from a seemingly uncountable collection of movies, videos, music CDs, theater performances, musical performances, travel locations, and recreational events. Any one of these information flows represents more data than a normal individual person can handle. Yet, in all of the above cases, people can benefit from, and often require, specific individual pieces within each information flow. We spend significant effort in locating those pieces of information, often unsuccessfully. Filtering information flows such as those described above often involves tedious examination of large numbers of data. The tedious, repetitive nature of such work makes it an ideal target for automation using computer software, as computers are very good at performing repetitive tasks. For many years, the fields of information retrieval and information filtering have worked to apply software technology to information overload problems. More recently, software systems employing automated collaborative filtering technology have been applied to alleviate the problem of information overload. This dissertation describes several years of research into automated collaborative filtering algorithms, interfaces, and systems. This chapter is organized as follows. First, we will briefly define and describe content-based filtering, which is the prevalent approach to information filtering before the development of automated collaborative filtering. Second, we will describe automated

2 collaborative filtering and its advantages. Third, we will present a list of the research challenges that we have addressed. Fourth, we will describe briefly the experimental methods used in the pursuit of these challenges. Fifth, we present an outline for the remaining chapters. 1.1 Content-based Filtering and Retrieval For more than thirty years, computer scientists have been addressing the problem of information overload by designing software technology that automatically recognizes and categorizes information. Such software automatically generates descriptions of each item's content, and then compares the description of each item to a description of the user's information need to determine if the item is relevant to the user’s need. The descriptions of the user’s interest needs are either supplied by the user, such as in a query, or learned from observing the content of items the user consumes. We call these techniques content-based because the software performs filtering based on software analysis of the content of the items analyzed. Text search engines are a prime example of content-based filtering. Many text search engines use a technique called term-frequency indexing [70]. In term frequency indexing, documents and user information needs are described by vectors in a space with one dimension for every word that occurs in the database. Each component of the vector is the frequency that the respective word occurs in the document or the user query. The document vectors that are found to be the closest to the query vectors (computed using the dot-product) are considered the most likely to be relevant to the user’s query. Most information filtering and information retrieval systems today are built using entirely content-based information retrieval technology. Other examples of content-based filtering are Boolean search indexes, where the query is a set of keywords combined by Boolean operators [14]; probabilistic retrieval systems, where probabilistic reasoning is used to determine the probability that a document meets a user’s information need [26,69,88]; and natural language query interfaces, where queries are posed in natural sentences [42,49,81]. 1.1.1

Information Filtering vs. Information Retrieval

3 The terms “information retrieval” and “information filtering” both describe the application of software technology to information overload problems. Within computer and information sciences, there is often a distinction made between information filtering and information retrieval [9]. The field of Information Retrieval traditionally develops storage, indexing, and retrieval technology for textual documents. A user describes his information need in the form of a query to the information retrieval system and the system attempts to find items that match the query within a document store. The information need is usually very dynamic and temporary; a user issues a query to an information retrieval system describing an immediate need. Furthermore, information retrieval systems tend to maintain a relatively static store of information. Internet search engines and searchable bibliographic databases are results of information retrieval research. While information retrieval is user-initiated and query-focused, information filtering systems generally operate on continuous information streams. Unlike information retrieval systems, information-filtering systems generally maintain a profile of the user interests needs throughout many uses of the system. As a result, informationfiltering systems tend to filter information based on more long-term interests. Filters are applied to each new item of information that arrives via an information stream and appropriate action is taken if an item matches a filter. Appropriate actions might include ignoring the information, bringing the information to the notice of a user, or taking action on the information on behalf of the user. Software that categorizes the probable importance of incoming email is an example of information filtering. The automated collaborative filtering technology that addressed in this dissertation provides solutions for both information retrieval tasks and information filtering tasks. 1.2 Automated Collaborative Filtering: A New Way of Filtering In recent years, Collaborative filtering (CF) has been developed to address areas where content-based filtering is weak. CF systems are different from traditional computerized information filtering systems in that they do not require computerized understanding or recognition of content. In a CF system, items are filtered based on user

4 evaluations of those items instead of the content of those items. For example, the Tapestry system [28] allows users to specify queries such as “Show me all memos that Joe tagged as important.” Members of the community can benefit from others' experience before deciding to consume new information. Active collaborative filtering is a variant of CF where users of the system explicitly forward items of interest to other individuals or groups of people who might be interested [56]. While early CF systems required users to specify the predictive relationships between users’ opinions [28] or explicitly forward items of interest [56], automated collaborative filtering (ACF) systems automate all procedures except for the collection of user ratings on items. A user of an ACF system rates each item he experiences based on how much that item matches his information need. These ratings are collected from groups of people, allowing each user to benefit from the ratings of other users in the community. An ACF system uses the ratings of other people to predict the likelihood than an item will prove valuable to the user. Simple ACF systems present to a user the average rating of each item of potential interest [55]. This allows the user to discover items that are of popular interest, and avoid items that are of popular dislike. Such ACF systems are not personalized, presenting each user with the same prediction for each item. More advanced ACF systems automatically discover predictive relationships between users, based on common patterns discovered in the ratings[5,20,38,68,79]. These systems provide every user with personalized predictions that may be different from everybody else’s. An example of ACF is the MovieLens movie recommendation system[20]. A user provides MovieLens with ratings of movies they have seen before. MovieLens then uses those ratings to find people with similar movie tastes. Then MovieLens can recommend movies that the similar people enjoyed, but the user has not seen yet. ACF has several advantages: You can filter information of any type of content. The prerequisite for ACF is that users must be able to provide preference ratings for the items being filtered, which places almost no limitations on the type of content you can filter. The software does not have to be able to evaluate, comprehend, or index the content being filtered. This overcomes one of the primary limitations of content-based filtering, which has limited the

5 use of content-based filtering to well-structured content that can be compactly summarized in electronic format. While content-based filtering is useful primarily on text, ACF can be used to filter more complex parameters, such as art work, music, mutual funds, and vacation packages. In addition, with ACF, you can use the same computational framework for many different domains. For example, the same commercial ACF system has been used for filtering Usenet news, movies, music, art, and books[4]. You can filter based on complex and hard to represent preferences. In content-based filtering, the software has to be able to create an electronic description of the information need. ACF does not have to represent information needs explicitly. As a result, you can use ACF to filter and recommend items based on deep and complex relationships to the user, such as taste or quality. For example, ACF could be used to build a recommender system that recommended documents that were well written while content-based filtering would have trouble differentiating between well-written and poorly written documents. You can receive serendipitous recommendations. Serendipitous recommendations are recommendations for items that do not contain content that the user was actively searching for. In an ACF system, an item is recommended if a user with similar interests likes it. The item may not be similar to anything the user has seen before. In content-based filtering, the user provides the system with a description of desired items, and items matching the description are returned. Therefore, the user must proactively anticipate the descriptions of the items desired. For example, Jane, a user of MovieLens, may have seen several bad science fiction movies and make the conclusion that science fiction movies are not to her taste. In a content-filtering system, she would specify “no more sci-fi”, and would never get recommendations for sci-fi. With MovieLens, she would be connected with other people who disliked the same sci-fi movies. Yet, if some of those people end up seeing a new sci-fi movie and like it, the scifi movie may be recommended to Jane. Jane may then discover that not all sci-fi movies are bad (just most of them). She would not have discovered this had she been using the content-based filtering system.

6 ACF systems can provide additional value, such as helping to create community by connecting together people with similar interests. However, the focus of this dissertation is on using ACF to filter information. The first research report on an ACF system was published in 1994 [68]. Since then, both research and commercial interest in ACF has exploded. Several new companies have been formed to market ACF technology, including Net Perceptions (www.netperceptions.com), which was started by researchers at the University of Minnesota, and now has an extensive list of customers. The fact that ACF systems have moved successfully from academic invention to commercial success in such a short period partially validates the usefulness of the technology. However, the technology is still young, and there is considerable potential for improving the effectiveness of ACF beyond what is currently available. 1.3 Research Challenges This dissertation presents results related to three significant challenges to creating effective ACF systems. 1.3.1

Challenge: How can we improve the predictive accuracy of automated

collaborative filtering algorithms? The potential of automated collaborative filtering techniques has been generally established by previous work, which is discussed in Chapter 2. The neighborhood-based ACF algorithm has been shown to be one of the best performing of several tested algorithms [12]. However, the existing neighborhood-based algorithms were designed in an ad hoc manner and have not been examined in detail. Few alternative algorithms have been developed or tested. The effect of various parameters to existing algorithms, such as neighborhood size, is not fully understood. With algorithms that make fewer errors, interactions with ACF systems will make decision-making processes faster and more efficient. Algorithms that are more accurate will also increase the domains in which ACF can be useful, supporting areas where higher risk is involved. Part of the challenge is to develop a framework for the analysis of algorithms. This includes collecting and standardizing datasets, developing conceptual and

7 procedural frameworks for algorithm analysis, developing tools to support rapid and repeatable analysis, and then disseminating all of the above to the research community to encourage verifiable analysis and encourage comparison with new techniques. This challenge is addressed in Chapters 3, 4, and 7. Chapter 3 provides an examination of evaluation metrics for prediction accuracy, Chapter 4 provides the development of an analysis framework and results from empirical tests, and Chapter 7 presents software and data artifacts that have been or will be distributed to the public. 1.3.2

Challenge: How can we increase the effectiveness of ACF as a decision-

making aid using explanation interfaces? Automated collaborative filtering provides support for human decisions in filtering information. Successful ACF tools should increase the quality and efficiency of filtering information while reducing the amount of time necessary to make a decision or reducing the number of decisions that must be made by the user. They should also inspire user confidence in their effectiveness. When ACF systems are used as black-box oracles, prediction errors can prevent them from becoming successful. Due to the use of approximation algorithms and incomplete data, prediction errors are unavoidable. Each prediction error increases the uncertainty in the user’s decision-making process. As a result, the system may not be accepted, or may be used in a manner that requires verification from other information sources. Such systems will not significantly improve the speed or quality of the decisionmaking process. We can reduce the impact of prediction errors by increasing the amount of information that is communicated to the user regarding a prediction made by the automated collaborative filtering system. One approach is to provide the user with explanations of ACF predictions. Explanations offered by the ACF system should result in a better understanding of the goals, techniques, strengths, and limitations of the ACF system. Ideally, a user should be able to determine when it is appropriate to trust the recommendation of an ACF system. Together the user and the ACF system should be much more effective in making information filtering decisions.

8 In order to build automated collaborative filtering systems that communicate more effectively with the user, we need to understand what user-interface techniques we can use that will increase the effectiveness of the human interaction with the ACF system. 1.3.3

Challenge: How can we improve automated collaborative filtering systems

for meeting ephemeral user information needs? A core assumption of current automated collaborative filtering systems is that a user's interests will remain relatively consistent over time. Automated collaborative filtering systems represent the user's interest as a single interest profile. All of the historical preferences are used to predict what the user will find interesting. However, changes in information needs do occur. In the long term, interests may appear relatively static, but on a day-to-day basis, transient information needs may come and go. Today a user may be interested in a news article about Monica Lewinski, but having read the article, the user may not want to read more about Lewinski for several days. Systems need to be built that can support such ephemeral, or transient, information needs. However, it is also important to leverage the historical profile of a user that has been collected over past interactions with the automated collaborative filtering system. Within the profile, there is knowledge that is relevant to current ephemeral information needs. By using knowledge gained from the historical profile, we should be able to optimize the interaction between the user and the automated collaborative filtering system. For example, a user’s profile represents long-term interests, but that long-term interest will generally include sub-interests that relate specifically to the immediate information need as well as sub-interests that don’t. An ACF interface could help the user select which elements of the profile are relevant to the current information need. 1.4 Research Methods The primary method applied in this work was empirical analysis. Empirical research was conducted using two basic experimental methods: simulations of algorithm performance on collected rating data and experiments on live users of a web-based recommendation site. In both cases, the content being recommended are movies, both in the theater and on video.

9 The GroupLens research group operates a web-based movie recommender known as MovieLens (movielens.umn.edu). Users rate movies based on how much they enjoyed viewing those movies. We collect and archive the ratings entered. We can simulate how accurate new algorithms would have been by having them generate predictions on items for which we already know the rating. In this manner, we used a single dataset to compare many different algorithmic variants. To measure the effectiveness of ACF interfaces, we performed controlled experiments with volunteer users of the MovieLens web recommender. We provided different interfaces to different sets of users, and their usage of the system was monitored and compared. Both surveys and usage traces were used to evaluate the effectiveness of different interfaces and algorithms. All studies involving human participation have been reviewed and approved by the Research Subjects Protection Plan of the University of Minnesota (http://www.research.umn.edu/subjects/). 1.5 Contributions The key contributions of this dissertation are stated here without extensive analysis or context. Each contribution is described individually in the dissertation within the appropriate chapter. In addition, the conclusion addresses and synthesizes these contributions in more detail. 1.5.1

Improving Prediction Accuracy •

An analysis of common user tasks in ACF systems, and their effect on the choice of an evaluation metric.



A description and comparison of seven different potential evaluation metrics for ACF systems, with identification of strengths and weaknesses



Empirical results showing that the choice of evaluation metric is not that important and a recommendation for standardization on one metric.



A conceptual framework for the analysis of ACF algorithms.



An extensive empirical study of the effects and interactions of proposed variations of neighborhood-based ACF algorithms.

10 •

Identification of key parameters in neighborhood-based collaborative filtering algorithms, based on extensive empirical analysis.



A set of recommendations for the most accurate neighborhood-based ACF algorithm.

• 1.5.2

A new more accurate average prediction algorithm.

Explaining Collaborative Filtering Predictions •

An analysis of different models for presenting explanations of ACF recommendations.



Identification of the most compelling forms of explanation interfaces, based on empirical user study data.



Experimental proof from user studies that explanations have the capacity to improve the acceptance of ACF recommenders.



Several new algorithms for computing the most influential items in a user’s profile.

1.5.3

Ephemeral Recommendations •

First implementation of a focused ephemeral recommendation strategy for automated collaborative filtering.



An algorithm for providing focused recommendations in an ACF system without any additional data beyond the existing rating data.



A user study verifying the effectiveness of a theme query interface to the focused recommendation algorithm in the ACF space.

1.5.4

Data and Software Artifacts •

A client library for Usenet GroupLens, enabling the Usenet trials.



The MovieLens web recommender (lead designer/developer with Shyong (Tony) Lam, Brent Dahlen, Tim Lee, and other members of the GroupLens Research Group).



The DBLens software environment for rapid analysis of ACF algorithms (collaboration with Hannu Huhdanpa).

11 •

Usenet news dataset, containing both user ratings and the full text of the associated articles (in collaboration with other members of the GroupLens team)



Publically available rating datasets from the MovieLens web site (in collaboration with other members of the GroupLens team).

1.6 Overview of Dissertation An outline of the remainder of the dissertation can be found in Table 1-1.

Chapter 1 2 3

4

5

6 7

8

Chapter Description Introduction (this chapter) Related work. This chapter describes past research related in common to the areas of study addressed in this dissertation. An additional specific related work section can be found in chapter 4. Evaluation of Automated Collaborative Filtering Systems. This chapter provides a thorough analysis of considerations and methods for evaluating ACF systems. Evaluation methods that have been used before in ACF are critiqued. Recommendations are made regarding the analysis of ACF systems. Analysis of predictive accuracy. Describes work and results related to the analysis and improvement of prediction accuracy in neighborhood-based ACF algorithms. This includes an extensive empirical analysis measuring the effects of different algorithmic parameters. The parameters that have the most effect on prediction accuracy are identified. Explanations: improving the performance of human decisions. Presents work that shows how explanation interfaces can improve the effectiveness of ACF systems. Several experiments with live users of the MovieLens system are performed to evaluate the performance of an ACF system with an explanation interface. In addition, this chapter describes a survey of MovieLens users to identify the most compelling forms of explanation. Addressing Dynamic Information Needs. Description and experimental tests of methods for supporting dynamic information needs. Software and Data Artifacts. To perform the research described in earlier chapters, new software was written and data collected and organized. Much of that data and software was then released to the public. This chapter describes all such software and data artifacts. Conclusions

Table 1-1. Overview of the dissertation.

12

Chapter 2: Related Work The problem of information overload is neither new nor easy. Many techniques exist to solve specific subsets of information overload. The ideas that I present in this dissertation have been influenced by or are similar to preexisting ideas in several sub fields. This chapter presents other work that is related in general to the topics covered by the dissertation. Chapter 4 contains a more focused description of work related to algorithm design. I have grouped related work into two parts: Part 1—Technology for Matching Information to Information Needs—describes the algorithms and back-end systems that have been developed to locate and discover those items that will meet a user's information need; Part 2—Interfaces for Presenting Predictions—discusses the different paradigms of presenting filtering and retrieval technology to the user. 2.1 Technology for Matching Information to Information Needs 2.1.1

Information Retrieval The field of information retrieval (IR) develops indexing and query technology

based on automatic analysis of item content, primarily for textual documents. Indexing is the process of examining content collections and creating software-searchable data structures that contain descriptions of the items available[13,14,24,25,41,45,52,57,71]. One of the key focuses of indexing research is to identify the key information high-level concepts of a piece of content while discarding low-level noise. For example, when indexing documents, it is often common to collapse all synonyms into a single concept term [41,45]. Without synonym indexing, if a user requests a document using the term ‘calculate,’ she will not receive documents using the term ‘compute,’ even if those document effectively cover the same topic. IR queries are requests for documents or information by the user. Query research includes the design of query languages [18,40,52], visual and audio interfaces for queries [35,80], and matching user queries with indexed representations of items [72]. In most cases, query research and indexes research overlap, as the query technology depends on the indexing technology used.

13 Most information location systems today are built entirely on content-based information retrieval technology. While valuable, content-based techniques have limitations because current computer programs can only reliably recognize simple content features, such as text keywords. There have been attempts to perform more complex natural language processing of text documents, but for the most part, performing natural language processing has shown only small improvements over simpler keywordbased indexing [49]. Information retrieval is very successful at supporting users who know how to describe exactly what they are looking for in a manner that is compatible with the descriptions of the content that were created during the indexing. In contrast, automated collaborative filtering (ACF) does not support such users well. Items recommended by an ACF system are likely to be of interest, but could contain content completely different from any items previously rated. In many cases, this serendipitous effect is desirable, but there exist situations where results containing specific content are necessary. Therefore, it is often the case that the most effective information location or filtering services include both automated collaborative filtering technology and content-based information retrieval technology [7,16,20,30,78]. However, Chapter 6 explores one way of extending ACF technology to handle a traditional IR task – content focused recommendations. There has been considerable published work in IR with respect to evaluation metrics. Many of those papers[13-15,17,19,27,34,66,87,90] influenced the work on ACF evaluation metrics described in Chapter 3. In the neighborhood-based ACF algorithms analyzed in Chapter 4, the data space can be represented a table, where the users are rows and the items are columns. Each user can be seen as a vector of ratings for items. One step of the neighborhood-based ACF algorithm is to locate the most similar users (vectors), give on user (vector). This problem is very similar to the problem of locating similar documents in the Salton’s vector space model[70]. In the vector space model, documents are represented as high-dimensional vectors of keyword weights, and the problem is to locate similar keyword weight vectors. Successful techniques in the vector space model could be applied to the ACF neighborhood-based recommendation algorithm.

14 2.1.2

Information Filtering Information filtering differs from information retrieval. Information filtering deals

with continuous information streams while information retrieval deals with mostly static data stores [9]. Many commercial office automation systems have implemented simple content-based information filtering. Most commercial email systems include some rulebased system for assigning priorities to messages. The Lotus Notes product provides extensive text routing and text filtering based on syntactic content. Experimental information filtering systems include Information Lens for electronic mail [54], the Stanford Information Filtering Tool (SIFT) [89], LyricTime for music [51], and the SIFTER prototype [62]. In contrast to information retrieval, information-filtering tools are often used to support interests that are more long term. As a result, many existing information filtering tools can be replaced by or supplemented by an ACF system. 2.1.3

Machine Learning The fundamental goal of machine learning is to build computer systems that can

learn and experience. Automated collaborative filtering has a similar goal—to predict items of interest based on what can be learned from the experience of a community of people. Machine learning research applies ideas from statistics, numerical computing, and continuous functions. Most machine learning techniques such as neural networks, decision trees, and probabilistic networks were designed to involve extensive amounts of off-line computation, often making them inappropriate for real-time information filtering such as automated collaborative filtering. An excellent overview of machine learning techniques can be found in [60]. The neighborhood-based recommendation algorithm described in Chapter 4 has some similarities to nearest neighbor techniques that are well known in machine learning, particularly the locally weighted regression, described in [60]. Recently, other machine-learning techniques have been applied to ACF. Breese et al. found that neighborhood-based automated collaborative filtering techniques performed better than Bayesian belief networks for non-binary rating domains[12]. Billsus and Pazzani showed that combining singular value decomposition with a neural network produced results that were more accurate than unoptimized neighborhood-based

15 collaborative filtering [10]. Cohen and Basu have applied induction-rule learning to collaborative filtering. 2.1.4

Collaborative Filtering In collaborative filtering, a user's actions and analyses regarding a particular piece

of information are recorded for the benefit of a larger community. Members of the community can benefit from others' experience before deciding to consume new information. In essence, collaborative filtering systems automate “word-of-mouth” recommendations. Much of the work in collaborative filtering and automated collaborative filtering was originally motivated by the Tapestry system, out of Xerox Parc [28]. Tapestry was a full-featured filtering system for electronic documents— primarily electronic mail and Usenet postings. A user could create mail-filtering rules like “show me all documents that are replied to by other members of my research group,” or “don't show me any messages which Joe said were a waste-of-time.” Non-automated collaborative filtering systems such as Tapestry required the user to determine the relevant predictive relationships within the community, placing a large cognitive load on the user. As a result, such systems were only valuable in small, close-knit communities where everyone was aware of other users' interests and duties. Another example of nonautomated collaborative filtering was Maltz's active collaborative filtering [56], where users could explicitly indicate that resources (such as documents, news articles, or web pages) should be forwarded to specified users or groups of users. 2.1.5

Automated Collaborative Filtering Automated collaborative filtering (ACF) took the concept of collaborative

filtering and added automation, scale, and anonymity. ACF automates all of the CF procedures except for the collection of user ratings on items. Automated collaborative filtering systems can support predictions or recommendations for large-scale communities of users, and due to the large number of users, anonymity can be provided. The GroupLens Usenet filtering system was the first to provide automated collaborative filtering [47,58,59]. Since then, the GroupLens Research Project at the University of Minnesota has conducted further research in ACF systems, covering areas such as matrix storage methods for ACF [31], integration of content-based filterbots to

16 improve prediction quality [30,78], combining independent filtering agents with ACF [30], empirical analysis of prediction algorithms [36], reducing dimensionality with SVD [76,77], and explanation of ACF recommendations[37]. Several other similar systems were developed around the same time as the GroupLens Usenet system, including the Ringo music recommender [79] and the Bellcore Video Recommender [38]. These three research systems, two of which evolved into commercial products, used what have come to be called neighborhood-based prediction algorithms. Due to their speed, flexibility, and understandability, neighborhood-based prediction algorithms are currently one of the most effective ways to compute predictions in automated collaborative filtering. This dissertation includes a thorough analysis of neighborhood-based prediction algorithms in Chapter 3. Maltz’s master thesis [55] presents some early work on a system at Xerox that collected ratings for Usenet news articles. However, Maltz’s system only presented the overall average ratings and not personalized predictions. Much recent work has introduced new computational models to ACF, with empirical tests on non-binary numeric rating data. These computational models have included Latent Semantic Indexing[11], rule-induction[16], Bayesian networks[12], graph theory[5], and weighted-majority voting[21]. The relative performance of different algorithms has been shown to be significantly different on binary data than on non-binary numeric data[12]. There have been few published results of ACF on binary data. This includes statistical clustering methods on CD purchase data[86], and Bayesian networks on web page access data[12]. There has also been research on ACF systems that are not based on numeric ratings. Linton and associates present the Organization-Wide Learning (OWL) system, which uses collaborative filtering technique to identify Microsoft Word commands that other members of your community are using frequently, but you aren’t [50]. Other automated collaborative filtering systems have been built for web pages [7], social networks[46], and web resources [84] [7] to name a few. 2.1.6

Evaluation of Information Filtering Systems

17 Information retrieval, information filtering, machine learning, collaborative filtering and automated collaborative filtering all result in systems that connect people to relevant information. Accurate and repeatable evaluation of these systems is important to quantify the effectiveness of proposed technologies. Different methods of evaluation have evolved in each of the different fields described above. Much of this work in developing evaluation systems is relevant to the area of automated collaborative filtering. There are a large quantity of published research addressing evaluation of information retrieval systems and machine learning systems. Van Rijsbergen’s text has an excellent chapter on the evaluation of information retrieval systems [87]. Appropriate statistical tests for machine learning algorithms are addressed in Dietterich [23]. Saracevic et al. presents the design, execution, and analysis of a large evaluation project [73-75]. Many publications address the mean of relevance and “information need” as it is measured in information retrieval [17,34,64]. Other papers on evaluation include [27,29]. More discussion of related work can be found in Chapter 3. 2.2 Embodiments of Recommendation Technology 2.2.1

Intelligent Agents Recommendation technology matches people’s information needs to potentially

valuable and interesting items. Intelligent software agents represent one vision of how recommendation technology such as automated collaborative filtering (ACF) could be embodied. While there is no exact definition of what intelligent software agents are, there are some characteristics that are commonly associated with them. These characteristics include the ability to learn our preferences and the ability to perform actions on our behalf autonomously. Each agent is generally trained to support the needs of a single individual. These agents watch and record the actions of a user in day to day business and attempt to predict what the user will do next or what piece of information will need next. Then the agent can assist the user by making a recommendation, offering to perform a task, or fetching the necessary information for that user and having it ready for the user. Agent systems designed by Maes from the MIT Media Lab [53] have addressed information filtering in email, scheduling, and Usenet news. ACF can be used by

18 intelligent agents to help predict the user’s interests. ACF can also be used to summarize ratings from a collection of potential predictive agents[30]. 2.2.2

Recommender Systems Recommender systems provide an alternative interface to filtering and retrieval

technology. Recommender systems have a single focus: predicting what items or pieces of information that a user will find interesting or useful. Predictions from a recommender system are personalized based on each user's individual profile, which generally contains relevance or interest judgments of previously seen items. Unlike agents, recommender systems are centralized systems accessed by multiple users. Most automated collaborative filtering systems manifest themselves as recommender systems. The experimental systems that we will use during this project will have recommender system interfaces. 2.2.3

Methods of Collecting Preferences One of challenges facing all preference-learning system is how to collect

preference information from the user. One approach is to have the user explicitly specify preferences, such as through numeric ratings for items. However, this creates an additional load on users of the systems. Another approach is to learn preferences by gathering implicit ratings. Implicit ratings are ratings that are not entered by the user, but are inferred from observation of the user’s actions. Oard presents a taxonomy of different methods of observable behavior for implicit ratings[65]. There are several examples of ACF systems that make use of implicit ratings. Balabanovic presents a system that adjusts a user preference model for text documents based on observation of user interactions with a profile-building interface [6]. Baudisch describes a TV recommender system where users can create printable lists of TV shows that they are interested in seeing [8]. These “laundry lists” are then used as implicit ratings of interest. Other work suggests that the amount of time spent reading a document [47,61] can be used to predict ratings. For our work, we have not built agent-like interfaces. Our ACF systems do not perform any tasks autonomously for the user, and all ratings must be entered explicitly.

19

Chapter 3: Evaluation of Automated Collaborative Filtering Systems As automated collaborative filtering (ACF) systems are a relatively new invention, there are no common standards for the evaluation of ACF systems. At a glance, it would seem that every research group involved in the design and analysis of ACF systems has used a different evaluation technique and metric. Furthermore, there is no clear understanding of the relationships between the different metrics used. Such diversity of experimental techniques causes three key problems that impede the progress of research in ACF. 1. If two different ACF researchers evaluate their systems with different metrics, their results are not comparable. 2. If there is no standardized metric, then we must be suspicious that researchers have chosen a metric that produces the results they desire. 3. Without a standardized metric, every researcher must spend the effort to identify or develop an appropriate metric. In this chapter, we address the issue of evaluation both theoretically and empirically. Theoretically, we provide a procedural framework for the evaluation of an ACF system, which includes identification of the key goals of the ACF system and the key tasks that users will perform with the system, both of which are important in choosing the correct metric. We analyze in detail the strengths and weaknesses of the different metrics that have been used to evaluate ACF systems in the past. Empirically, we compare 11 different evaluation metrics applied to the results from 2180 different collaborative filtering algorithms. Based on the theoretical and empirical analysis, we provide recommendations for the choice of a proper evaluation metric. It is important to note that, in order to have comparable research results, it is also necessary to have publicly available standardized data sets. Several such sets are available (http://research.compaq.com/SRC/eachmovie/; http://kdd.ics.uci.edu/databases/msweb/msweb.data.html) for collaborative filtering. Furthermore, some of the datasets use in this dissertation will be released to the public. Descriptions of the released datasets can be found in Chapter 7.

20 3.1 Steps in Evaluation There are three steps to successfully measuring the performance of an ACF system. 1. Identify the high-level goals of the system 2. Identify the specific tasks towards those goals that the system will enable 3. Identify system-level metrics and perform system evaluation We will address each in turn. 3.1.1

Identify High Level Goals Before measuring the performance of an information filtering system, we must

determine exactly the goals of the system as well as the exact tasks the users will be performing with the system. This may sound straightforward, but it is often overlooked in the analysis of information filtering systems. Information filtering systems are not valuable by themselves. Rather they are valuable because they help people to perform tasks better than those people could without assistance from the filtering system. Therefore, at the highest level, information systems have one of two goals. Either they improve the effectiveness of an existing user task or they enable a new valuable information-filtering user task. If people are currently engaged in the performance of a task, then it is not the part of the information-filtering-system builder to justify that task. Improving significantly the efficiency, quality, or speed of such a task can be seen as a valuable contribution. For example, thousands of people around the world scan Usenet newsgroups at least once a week, trying to find valuable articles among lots of noise. A system that reduces the number of bad articles that those people have to read without decreasing the number of good articles they read will have made a contribution. With cutting-edge technology, it is common for information-filtering systems to enable tasks that were not previously possible. With these cases, the first step is to identify whether the new tasks themselves are valuable before attempting to measure how the system improves the performance of those tasks. For example, the Jester [32] system is a web-based collaborative filtering system that recommends jokes that you might like. Is this a valuable task?

21 A task can be valuable for several reasons. The most obvious reason is economic value. A task can be economically valuable if it increases profit or productivity. For example, an information-filtering system that locates the best value car sale on the Internet would be economically valuable for the users, allowing them to get better quality cars for lower cost. Economic value is the most commonly proposed contribution of an information-filtering system and is usually measured by improvements in productivity. Other measurements include the amount of money made or saved through use of the filtering system. If by performing a task, we improve the strength or value of a community, that task can be seen as socially valuable. An information-filtering system that connected voters to information about political candidates’ platforms and past records would be socially valuable, because it would allow the voters to cast more informed votes, resulting in a political leadership that more closely matched the needs and wants of the community. Most social value arguments are qualitative. Measuring the social value of an information filtering system quantitatively is extremely tricky, and is rarely done. Over the long term, it may be possible to measure the social value of an information filtering system. For example, the effectiveness of the voter support system mentioned above could be measured by looking at the legislative effectiveness of the hired officers after the fact. In many cases, a task doesn’t provide economic value or necessarily contribute the strength of a community. Yet, the task can still be valued by an individual. In such cases, the task improves the quality of life for the individual. Examples of improving the quality of life are entertainment, recreation, aesthetics, and automation (reducing undesirable activity). This category of task covers the whole gamut of tasks that don’t necessarily provide economic value or improve the community, yet are desired by individuals. We can measure whether a system improves the quality of life by presenting the system to the users and observing the effects. High usage and stated user happiness indicate that a system is improving quality of life. Finally, one aspect of value that has not really been explored with information filtering systems is the psychological value of a system. Information filtering systems can strengthen or reinforce positive psychological characteristics. For example, there are

22 large numbers of users of our MovieLens movie recommender that have rated hundreds and hundreds of movies, even though there is no clear value to rating more than one hundred movies. MovieLens gives them a forum to state their opinions in a way from which others can benefit. We believe that this helps build and reinforce self-worth by giving all users a “voice” of sorts. Measuring this kind of value would require expertise in cataloguing psychological health. The four different values that we have described here are not mutually exclusive. They overlap in subtle ways. However, they are generally measured independently. 3.1.2

Identify Specific Tasks Having specified what the high-level goals are, the next step is to specify specific

tasks that the users will perform, aided by the information-filtering system. These tasks will describe explicitly the nature of interaction between the user and the system. The choice of the appropriate metric to use in evaluating a system will depend on the specific tasks that are identified for the system. The appropriate tasks will depend greatly on the high-level goals of the users. However, we will present here n different representative tasks that a user of an ACF system might have. These tasks will illustrate the competing features of different metrics. 1. A user wants to locate a single item whose value exceeds a threshold. For example, a common task would be to locate a single “decent” movie to watch, or a book to read next. 2. A user is about to make a selection decision that has significant cost, and wants to know what the “best” option is. For example, a selection between many different health plans (HMOs) could have significant future consequences on a person. They are going to want to make the best possible selection. 3. A user has a fixed amount of time or resources, and wants to see as many of the most valuable items as possible within that restriction. Therefore, the user will be interested in the top n items, where n depends of the amount of time the user has. For example, consider news articles. People generally have a fixed amount of time to spend reading news (such as a half-hour before

23 starting work). In that time, they would like to see the news articles that are most likely to be interesting. 4. A user wants to gain or maintain awareness within a specific content area. Awareness in this context means knowing about all relevant events or all events above a given level of interest to the user. For example, a person in public relations for a company might want to be sure to read all articles that might have an effect on the stock price of the company. 5. A user wants to examine a stream of information in a given order, consuming items of value and skipping over items that are not interesting or valuable. For example, in Usenet bulletin boards, some readers frequently examine the subject line of every article posted to a group. If a subject appears interesting, the entire article is retrieved and read. 6. A user has a single item and wants to know if the item is worth consuming. For example, a user may see an advertisement for a new book and wants to know if it is worth reading. These tasks illustrate several very different manners in which users will interact with ACF systems. We will refer to users whose primary task is the first one shown above as “Type 1” users, users with task 2 as “Type 2” users, and so on. Having identified the key tasks that users will perform with a system, the next step is to identify the indicators of user-task performance, and then perform system-level evaluation. 3.1.3

Performing System-Level Analysis System-level evaluation is performed in cases where researchers can identify

measurable indicators of the system that will significantly correlate with the effectiveness of a system independent of the user interaction. Researchers who use system-level evaluation assume that differences in the given indicators will result in better task performance given any reasonable user interface. System-level evaluation has been the most prevalent form of evaluation in information filtering, because it offers inexpensive, easily repeatable analysis. The data are collected from users once, and then many different systems can be evaluated on the

24 collected data without further expensive user sessions. There have been many instances where a data set released to the public has allowed considerable work by other researchers who did not have the resources to collect their own data, such as EachMovie (http://research.compaq.com/SRC/eachmovie/), Cranfield [15], and TREC (http://trec.nist.gov/). In some cases, the user data will be simulated from user models, so users never have to be consulted. 3.1.3.1 Indicators of user task performance The problem of identifying the correct indicators that will correlate highly with user task performance was addressed by Cleverdon in his analysis of information retrieval systems [15]. Cleverdon argues that there are five measurable quantities affecting the users of an information system: Time Lag. The interval between the demand being made and the answer being given. Presentation. The physical form of the output. User effort. The effort, intellectual or physical, demanded of the user. Recall. The ability of the system to present all relevant documents. Precision. The ability of the system to withhold non-relevant documents. Cleverdon argues that of these five metrics, only precision and recall are of interest. He dismisses the others, because (1) time lag is easily measured, and generally a function of underlying hardware or implementation; (2) presentation is successful if the user can read and understand the list of references returned; and (3) user effort can be measured with a straightforward examination of a small number of cases. Although it is rarely mentioned in research literature, time lag can be a significant consideration for certain tasks. For example, most ACF systems nowadays are designed to support users through the web. Any reasonably successful web site can expect to attract a large number of concurrent users. Furthermore, these users expect results to be returned them within seconds. With large numbers of concurrent users expecting one or two second response times, the systems must often respond within tens of milliseconds. Add to that the fact that these ACF systems will be making predictions from huge amounts of data (millions of ratings from millions of users). Most researchers assume that

25 their research systems and algorithms, which are generally not fast enough for commercial use, can be tuned to provide commercial-level performance. However, we must keep in mind that many of the techniques used to improve performance can also diminish the quality of the information retrieval system. Evaluating presentation and user effort can be a much more complex than Cleverdon suggests. If the task is to simply read each of the documents on the list returned, then Cleverdon’s analysis is correct. However, the task may be much more complex and more general. For example, the task may be to select an item for purchase from a large set of competing alternatives. The system may recommend a set of ten items. A presentation that describes the reasoning behind each recommendation is more likely to be effective in helping a user make a purchase decision than a presentation that simply returns a ranked list. A system that requires the user to go through many screens will require more effort and reduce the speed in which the user can make decisions. However, measuring quantitatively the effect of presentation and user effort on a task is almost impossible without performing user tests. Precision and recall have been by far the most popular metrics for information retrieval, primarily due to strength of statements by early researchers such as Cleverdon. It is clear however, that they are not appropriate for measuring the performance of all tasks. This has been argued by many researchers over the years [19,66,82,83,90]. However, it does seem reasonable to measure in some the extent to which the information system accurately matches content items with the user’s information need. Precision and recall are just instances in a large space of accuracy evaluation metrics. In the next section, we will examine other metrics that have been used to evaluate the ability of an ACF system to meet a user’s information need. 3.2 Metrics In this section, we will examine theoretically and empirically metrics that have been used before to evaluate information filtering systems. 3.2.1

Evaluation of previously used metrics ACF systems have been evaluated in the research literature since 1994. During

that time, there have been many different ACF systems. With a couple of exceptions, a

26 different metric was used for the evaluation of each system that has been published since then. We will examine each of the metrics used, identifying the strengths and the weaknesses. 3.2.1.1 Coverage Coverage is simple the percentage of items for which the system could generate a prediction. In certain cases, an automated collaborative filtering system is not able to generate a prediction for an item due to lack of data or over-restrictive thresholding of data. It is important for evaluators to report the level of coverage achieved by their systems. In many cases, it is possible for systems to detect situations where they will not be able to produce a good prediction with high confidence (i.e. when there are small amounts of data on an item). It is possible for systems to artificially inflate the accuracy values by not producing predictions for such items. For certain ranking-based tasks, this may be appropriate (potentially appropriate for user task types 1, 2, and 3). However, for tasks in which a user can request a prediction for any item in the database (tasks 5 and 6), it is generally inappropriate to not be able to produce a prediction. In any case, coverage should be reported, and system accuracy should only be compared on items for which both systems can produce predictions. 3.2.1.2 Mean absolute error and related measures Mean absolute error measures the average absolute deviation between a predicted rating and the user’s true rating. Mean absolute error (Equation 3-1) has been used to evaluate ACF systems in several cases [12,36,79]. N

E =

∑p

i

i =1

− ri

(Eq. 3-1)

N

With mean absolute error, the error from every item in the test set is treated equally. This makes mean absolute error most appropriate for Type 5 and Type 6 user tasks, where a user is requesting predictions for all items in the information stream, or it is not known which items for which predictions will be requested

27 It may be less appropriate for tasks where a ranked result is returned to the user, who then only views items at the top of the ranking, such as types 1-4. In these cases, we are not particularly concerned with error that occurs towards the bottom of the ranking scale. Essentially, mean absolute error provides the same weight to errors on any items. If certain items are significantly more important than others are, then mean absolute error may not be the right choice. However, the mean absolute error metric should not be discounted as a potential metric for ranking-based tasks. Intuitively, it seems clear that as mean absolute error decreases, all other metrics must eventually show improvements. There are two other advantages to mean absolute error. First, the mechanics of the computation are simple and easily recognized by all. Second, mean absolute error has well studied statistical properties that provide a means for testing the significance of difference between the mean absolute errors of two systems. Two related measures are the mean squared error and the root mean squared error. These variations square the error before summing it. The result is more emphasis on large errors. For example, an error of one point increases the sum by one, but an error of two points increases the sum by four. In addition to mean absolute error, Shardanand and Maes [79] measured separately mean absolute error over items with extreme users ratings. They partitioned their items into two groups, based on user rating (a scale of 1 to 7). Items rated below three or greater than five were considered extremes. The intuition was that users would be much more aware of a recommender systems predictive performance on items that they felt strongly about. From Shardanand and Mae’s results, the mean absolute error of the extremes provides a different ranking of algorithms than the normal mean absolute error. In addition, the extreme mean absolute error appears to be more distinguishing. Measuring the mean absolute error of the extremes can be valuable. However, unless users are concerned only with how their extremes are predicted, it should not be used in isolation. 3.2.1.3 Precision and Recall and Related Measures

28 Precision and recall are the most popular metrics for evaluating information retrieval systems. In 1968, Cleverdon proposed them as the key metrics [15], and they have held ever since. For the evaluation of ACF systems, they have been used by Billsus [10] and Basu [16]. Relevant Irrelevant Total

Selected Nrs Nis Ns

Not Selected Nrn Nin Nn

Total Nr Ni N

Table 3-1. Contingency table showing the categorization of items in the document set with respect to a given information need.

Precision and recall are computed from a 2x2 contingency table, such as the one shown in table 3-1. The item set must be separated into two classes – relevant or not relevant. That is, if the rating scale is not already binary, we need to transform it into a binary scale. The justification for performing this transformation is that users are looking for items that meet their information need. If an item meets an information need, than it is a successful recommendation (i.e. relevant). If we measure how likely the system is to return relevant documents, then we are measuring how likely the system meets the user’s information need. Likewise, we need to separate the item set into the set that was returned to the user (selected), and the set that was not. We assume that the user will consider all items that are retrieved. Precision is defined as the ratio of relevant documents selected to number of documents selected, shown in Equation 3-2.

P=

N rs Ns

(Eq 3-2)

Precision represents the probability that a selected document is relevant. Recall (Equation 3-3) is defined as the ratio of relevant documents selected to total number of relevant documents available. Recall represents the probability that a relevant document will be selected.

29

R=

N rs Nr

(Eq 3-3)

Precision and recall depend on the separation of relevant and non-relevant items. The definition of “relevance” and the proper way to compute has been a significant source of argument within the field of information retrieval[34]. Most information retrieval evaluation has focused on an objective version of relevance, where relevance is defined with respect to the query, and is independent of user. Teams of experts can compare documents to queries and determine which documents are relevant to which queries. However, objective relevance makes no sense in automated collaborative filtering. ACF is recommending items based the likelihood that they will meet a specific user’s taste or interest. That user is the only person who can determine if an item meets his taste requirements. Thus, relevance in ACF is inherently subjective. Because of the highly subject nature of relevance in ACF, precision may not be appropriate if users are rating items on numerical scale with more than two ranks (such as in [38,68,79]). To compute precision on a non-binary scale, a rating threshold must be selected such that items rated higher than the threshold are relevant and items rated below the threshold are not relevant. Because of variance in both rating distributions and information need, the appropriate threshold may be different for each individual user. Furthermore, it may be extremely hard to determine. Recall is even more impractical to measure in ACF systems than it is in IR systems. To truly compute recall, we must determine how many relevant items are contained in the entire database. Since the user is only person who can determine relevance, we must have every user examine every item in the database. While this problem existing in information retrieval systems, IR researchers approximated the value by taking the union of relevant documents that all users found. This was possible because relevance was defined to be objective and global, which is not the case with ACF systems. To compute recall in ACF, a large enough set must be selected Precision and recall can be linked to probabilities that directly affect the user. This makes them more understandable to users and managers than metrics such as mean absolute error. Users can more intuitively comprehend the meaning of a 10% difference

30 in precision than they can a 0.5-point difference in mean absolute error. As a result, precision and recall may be more appropriate than other metrics for arguing cost benefits. Another weakness of precision and recall is that there is not a single number for comparisons of systems. Rather, two numbers must be presented to describe the performance of the system. In addition, it has been observed that precision and recall are inversely related [15] and are dependent on the length of the result list returned to the user. If more documents are returned, than the recall increases and precision decreases. Therefore, if the information filtering system doesn’t always return a fixed number of documents, we must provide a vector of precision/recall pairs to fully describe the performance of the system. While such an analysis may provide a good amount of information about a single system, it makes comparison of more than two systems complicated, tedious, and variable between different observers. It should be noted that in certain cases, the task for which we are evaluating is not concerned with recall, only precision. This is true for filtering many entertainment domains. For example, a person who is looking to find a video to rent for the weekend doesn’t need to see all the videos in existence that he will like. Rather he cares primarily that the movies recommended accurately match his tastes. In this case, it becomes easier to compare systems, although we must still be concerned about how different sizes of retrieval sets will affect the precision. Precision and recall can be appropriate for tasks where there is a clear threshold between items that meet an information need and items that don’t. For example, in type 1 and type 4 user tasks, either an item meets the information need or it doesn’t. For task 1, precision by itself (without recall) is a good metric. The user wants to find any item of interest, and will probably be provided with a short list of options. The user does not care whether they are getting a complete list of all relevant items. However, for task 4, the recall of an ACF system could be very important, since missing a relevant item could have costly consequences. For tasks 2, and 3, precision and recall are not appropriate metrics. Tasks 2 and 3 represent cases where ranked results are returned. At any point in the ranking, we want the current document to be more relevant than all documents lower in the ranking. Since

31 precision and recall only measure binary relevance, they cannot measure how good the ordering items is within the categories of relevant or not relevant. Precision and recall may or may not appropriate for tasks 5 and 6. If the recommender system is making binary recommendations (i.e. read or don’t read) then precision recall could be effective. On the other hand, if the system is producing nonbinary predicted ratings, precision and recall may not be effective, since they cannot measure how close predicted ratings are to real user ratings. 3.2.1.4 ROC curves, Swets’ A measure, and related metrics There are two different popularly held definitions for the acronym. Swets [82,83] introduced the ROC metric to the information retrieval community under the name “relative operating characteristic”. More popular however, is the name “receiver operating characteristic,” which evolved from the use of ROC curves in signal detection theory [33]. Regardless, in all cases, ROC refers to the same underlying metric. The ROC model attempts to measure the extent to which an information filtering system can successfully distinguish between signal (relevance) and noise. The ROC model assumes that the information filtering system will assign a predicted level of relevance to every potential item. Given this assumption, we can see that there will be two distributions, shown in Figure 3-1. The distribution on the left represents the probability that the system will predict a given level of relevance for an item that is in reality not relevant to the information need. The distribution on the right indicates the same probability distribution for items that are relevant. Intuitively, we can see that the further apart these two distributions are, the better the system is at differentiating relevant items from non-relevant items.

32

Figure 3-1. A possible representation of the density functions for relevant and irrelevant items.

With systems that return a ranked list, the user will generally view the recommended items starting at the top of the list and working until either the information need is met or enough time has elapsed or the system may only return the top 10 or so results. In any case, the ROC model assumes that there is a filter tuning value zc, such that all items that the system ranks above the cutoff are viewed by the user, and those below the cutoff are not viewed by the user. This cutoff defines the search length. As shown in Figure 3-1, at each value of zc, there will be a different value of recall (percentage of good documents returned) and fallout (percentage of bad documents returned). The ROC curve represents of plot of recall versus fallout, where the points on the curve correspond to each value of zc. An example of an ROC curve is shown in Figure 3-2.

33

Figure 3-2. An example of an ROC curve. The p-values shown on the curve represent different prediction cutoffs. For example if we chose to select all items with predictions of 4 or higher, then we experience approximately 45% of all relevant items and 20% of all non-relevant items.

ROC curves are useful for tuning the signal/noise tradeoff in information filtering systems. For example, by looking at an ROC curve, you might discover that your information filter performs well for an initial burst of signal at the top of the rankings, then produces only small increases of signal for moderate increases in noise from then on. Comparing multiple systems using ROC curves becomes tedious and subjective, just as with precision and recall. However, a single summary performance number can be obtained from an ROC curve. The area underneath an ROC curve, also known as Swet’s A measure, can be used as a single metric of the system’s ability to discriminate between good and bad items, independent of the search length. According to [33], the area

34 underneath the ROC curve turns out to be the probability that the system will be able to choose correctly between two items, one randomly selected from the set of bad items, and one randomly selected from the set of good items. Intuitively, the area underneath the ROC curve captures the recall of the system at many different levels of fallout. It is also possible to measure the statistical significance of the difference between two areas[33,48]. The advantages of ROC area are (a) it provides a single number representing the overall performance of an information filtering system, (b) is developed from solid statistical decision theory designed for measuring the performance of tasks such as those that an ACF system performs, (c) covers the performance of the system over all different query lengths, and (d) it accounts for non-binary system rankings. The disadvantages of the ROC area metric are (a) a reasonably large set of potentially relevant items is needed for each query; and (c) it may need a large number of data points to ensure good statistical power for differentiating between two areas. The ROC curve represents the performance of the system at different levels of search length. Therefore, an ROC area needs to be computed for each query or user and averaged over all queries. If there is only a small set of items on which to evaluate per query, than the meaning of ROC area is less clear. The ROC curve is no longer precision versus fallout, it is precision versus 1-precision. The ROC area is likely only valuable when there is a large result set per user to evaluate on. Hanley [33] presents a method by which one can determine the number of data points necessary to sample to ensure that a comparison between two areas has good statistical power (high probability of identifying a difference if one exists). From this data, it becomes clear that many data points may be required to have a high level of statistical power. The number of required data points becomes especially large when the two areas being compared are very close in value. Therefore, the potential result set for each query must also be large. The ROC area measure is most appropriate for type 4 tasks, and to a lesser extent, type 3 tasks. In type 4 tasks, users have a binary concept of relevance, which matches with ROC area. Furthermore a type 4 user is interested in both recall and fallout, wanting

35 to see the largest percentage of relevant items with the least amount of noise. ROC area is less appropriate for type 3 users, since type 3 users want the current item in the ranking to be more relevant than all later rankings and ROC does not guarantee this within the binary class of relevance. ROC area is not appropriate for the type 1 and 2 tasks, since ROC area provides a summary of performance over all query lengths, and type 1 and 2 users are only interested in what is recommended at the top of the ranking. ROC area is also less appropriate for type 5 and 6 tasks because ROC does not measure how close a prediction is to the actual rating, only the likelihood that a relevant item will have a high prediction than the a non-relevant item. 3.2.1.5 Prediction-Rating Correlation There are two classes of correlation. Pearson correlation measures the extent to which there is a linear relationship between two variables. It is defined as the covariance of the z-scores. c=

∑ (x − x )( y − y ) n * stdev( x) stdev ( y )

(Eq. 3-4)

Rank correlations, such as Spearman’s ρ (Equation 3-5) and Kendal’s Tau, measure the extent to which two different rankings agree independent of the actual values of the variables. Spearman’s ρ is computed in the same manner as the Pearson productmoment correlation, except that the x’s and y’s are transformed into ranks, and the correlation s computed on the ranks.

ρ=

∑ (u − u )(v − v ) n * stdev(u ) stdev (v)

(Eq. 3-5)

In spite of their simplicity, correlation metrics have not been used extensively in the measurement of ACF systems or information retrieval systems. Pearson correlation was used by Hill [38] to evaluate the performance of their ACF system. The advantages of correlation metrics are (a) they compare full system ranking to full user rankings with any number of distinct levels in each, (b) they are well understood

36 by the entire scientific community, (c) they provide a single measurement score for the entire system. Correlation metrics support user tasks 2 and 3 better than the previously discussed metrics. They compare a non-binary ranking provided by the system to a non-binary ranking provided by the user, and are appropriate if the user wants the “best” alternatives. Any interchanges that occur in the predicted ranking with respect to the true ranking will reduce the value of the correlation. Correlation metrics may be overly sensitive for tasks 1 and 4 since the user isn’t concerned about the ordering of items within the relevance classes. For example, even if the top ten items ranked by the system were relevant, a correlation metric might give a non-perfect value because the best item is actually ranked 10th. However, this is not important to users of type 1 and 4. This metric is not appropriate for tasks of type 5 and 6, for the same reason that ROC area and precision/recall were inappropriate – the inability to measure accuracy of individual predictions. For type 2 and 3 users, there are weaknesses in the way in which the “badness” of a interchange is calculated. For example, Kendal’s Tau metric applies equal weight to any interchange of equal distance, no matter where it occurs. Therefore, an interchange between 1 and 2 will be just as bad as an interchange between 1000 and 1001. Generally, the user is much more likely to consider the first five, and will probably never examine item ranked in the thousands, so the interchange between 1 and 2 should be worse. The Spearman correlation metric also does not handle weak orderings well. Weak orderings occur whenever there are at least two items in the ranking such that neither item is preferred over the other. If a ranking is not a weak ordering then it is called a complete ordering. If the user’s ranking (based on their ratings) is a weak ordering and the system ranking is a complete ordering, then the spearman correlation will be penalized for every pair of items which the user has rated the same, but the system ranks at different levels. This is not ideal, since the user shouldn’t care how the system orders items that the user has rated at the same level. Kendall’s Tau metric also suffers the same problem, although to a lesser extent than the Spearman correlation. 3.2.1.6 Breese’s Ranking Metric

37 In [12], Breese presents a new evaluation metric for ACF systems that is designed for tasks where the user is presented with a ranked list of results, and is unlikely to browse very deeply into the ranked list. The task for which the metric is designed is an Internet web-page recommender. It is well accepted that most Internet users will not browse very deeply into results returned by search engines. Breese’s metric attempts to evaluate the utility of a ranked list to the user. The utility is defined as the difference between the user’s rating for an item and the “default rating” for an item. The default rating is generally a neutral or slightly negative rating. The likelihood that a user will view each successive item is described with an exponential decay function, where the strength of the decay is described by a half-life parameter. The expected utility ( Ra ) is shown in Equation 3-6. ra , j represents the rating of user a on item j of the ranked list, d is the default rating (generally a non-committal rating, or slightly negative), and α is the half-life. The half-life is the rank of the item on the list such that there is a 50% chance that the user will view that item. Breese used a half-life of 5 for his experiments, claiming that a half-life of using a half-life of 10 caused little additional sensitivity of results. Ra = ∑ j

max (ra , j − d ,0 ) 2 ( j −1) / (α −1)

(Eq. 3-6)

The overall score for a dataset across all users ( R ) is shown in Equation 3-7. Ramax is the maximum achievable utility if the system ranked the items in the exact order

that the user ranked them. R = 100

∑R ∑R a

a

a

max a

(Eq. 3-7)

Breese’s metric is best for evaluation of type 2 tasks. Breese’s metric applies most of the weight to early items, with every successive rank having exponentially less weight in the measure. To obtain high values of the metric, the first five predicted rankings must consist of items rated highly by the user. This metric could also be applied to type 3 tasks where the length of length of search is not known beforehand, or varies from day to day. The downside is that if the true function describing the likelihood of accessing each rank

38 is significantly different from the exponential used in the metric then the measured results may not be indicative of actual performance. For example, if the user almost always searches 20 items into the ranked recommendation list, then the true likelihood function is a step function which is constant for the first 20 items and 0 afterwards. Breese’s metric may be overly sensitive for tasks that don’t require the “best” items to be ranked first, such as user task types 1 and 4. A high value of Breese’s utility metric will probably indicate a high performance of tasks 1, since the first few items have such significance. If the number of relevant documents in task 4 is large, then using Breese’s metric may not be appropriate, because the metric is assigning exponentially decreasing utility values when the utility is remaining constant. There are three other disadvantages to the Breese metric. First, unlike meanabsolute error, prediction-rating correlation, and ROC area, there is no statistical significance measure. Second, the metric does not handle the fact that weak orderings created by the system will result in different possible scores for the same system ranking. Suppose the system outputs a recommendation list, with three items sharing the top rank. If the user rated those three items differently, then depending on what order the metric chooses to parse the items, the metric could have very different values (if the ratings were significantly different). Third, due to the application of the max() function in Breese’s utility metric (Equation 3-6), all items that are rated less than the default vote contribute equally to the score. Therefore, an item occurring at system rank 2 that is rated just slightly less than the default rating (which usually indicates ambivalence) will have the same effect as an item that has the worst possible rating. The occurrence of extremely wrong predictions in the high levels of a system ranking can undermine the confidence of the user in the system. 3.2.1.7 The ndpm measure. Ndpm was used to evaluate the accuracy of the FAB ACF system [7]. It was originally proposed by Yao [90]. Yao developed Ndpm theoretically, using an approach from decision and measurement theory. Ndpm stands for “normalized distance-based

39 performance measure”. Npdm (Equation 3-7) can be used to compare two different weakly-ordered rankings. 2C − + C u ndpm = 2C i

(Eq. 3-7)

C − is the number of contradictory preference relations between the system

ranking and the user ranking. A contradictory preference relation happens when the system says that item 1 will be preferred to item 2, and the user ranking says the opposite. C u is the number of compatible preference relations, where the user rates item 1 higher

than item 2, but the system ranking has item 1 and item 2 at equal preference levels. C i is the number of “preferred” relationships in the user’s ranking. This metric is comparable among different queries (hence normalized), because the numerator represents the distance, and the denominator represents the worst possible distance. Ndpm is ideal for evaluating type 3 tasks. It is similar in form to the Spearman and Kendal rank correlations, but provides a more correct interpretation of the effect of tied user ranks. It is also reasonable for evaluating type 2 tasks, however it does suffer from the same weakness as the rank correlation metrics (interchanges at the top of the ranking have the same weight as interchanges at the bottom of the ranking). Because ndpm does not penalize the system for system orderings when the user ranks are tied, ndpm may also be appropriate for evaluation of type 1 and type 4 tasks. User ratings could be transformed to binary ratings (if they were not already), and ndpm could be used to compare the results to the system ranking. One of the key weakness of ndpm with respect to evaluating ranked retrieval is the lack of a statistical significance test. As ndpm only evaluates ordering and not prediction value, it is not appropriate for evaluating type 5 and type 6 user tasks. 3.2.2

Which metric to use? To summarize some of the key points of the discussion, Table 3-2 summarizes the

appropriate metrics, given certain characteristics of the data or the task that is being

40 evaluated. An ‘X’ in a row indicates that the respective evaluation metric is appropriate for the characteristic of the data or task listed in the row header.

Mean absolute error User can request predictions on any item Want to measure ranking effect Non-binary ratings Small test set per query Very discrete user ratings while almost complete system ordering Statistical Significance Test Very discrete system rankings Many tied ranks Focus exclusively on top-ranks

Precision Recall

ROC area

Spearman rank correlation

Kendall’s tau-b correlation

Breese’s utility metric

ndpm

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X X

X

X X

X

X

X

X

X X

X

X X

X X

X

X

X

X X X

Table 3-2. A table that indicates appropriate evaluation algorithms, given the requirements stated in the left-most column.

3.2.3

An Empirical Comparison of Evaluation Metrics In an effort to quantify the differences between the above mentioned evaluation

metrics, we computed all the stated evaluation metrics on a large set of different algorithms examined the extent to which the different evaluation metrics agreed or disagreed. As part of the research described in chapter 4, we examined the predictions generated by the basic nearest-neighbor automated collaborative filtering algorithm, after perturbing many different key parameters. We used this data also for examination of evaluation metrics. There were 2180 different algorithms tested, resulting in the same number of sets of predictions. For each of these result sets, we computed mean absolute

X

41 error, root mean squared error, Pearson correlation, Spearman rank correlation, area underneath an ROC-4 curve, the Breese utility metric, the ndpm metric. The data set was the same in all cases – 100,000 movie ratings from 943 users on 1682 items. Each user has rated at least 20 items. For each of the users, 10 rated items are withheld from the training. After training the system with all the other ratings, predictions are generated for the 10 withheld items. Then the predictions/prediction ranked list is compared to the user’s ratings/user ranked list. A scatter plot matrix is shown in Figure 3-3. Each cell of the matrix represents a comparison between two of the metrics. Each point of the scatter plot represents a different ACF algorithm. For several of the metrics, there are two different variations: overall and per-user. The difference between these two techniques is the manner in which averaging was performed. In the overall case, the metric was applied once to the entire test set of ratings, without regard to performance per user. In the per-user case, the metric was applied individually to the results for each user, and then the metric was averaged over all users to arrive at the final value. These two forms of averaging were applied to Pearson prediction-rating correlation and ROC-41

1

The ‘4’ in ROC-4 indicates that the cutoff between relevant and non-relevant items was chosen to be 4. Therefore items rated 4 or higher are considered relevant, and items rated 3 or lower non-relevant.

42

Figure 3-3: Comparative performance several different evaluation metrics on 2180 different algorithms (effectively 2180 different ACF systems).

Figure 3-3 shows some striking patterns. First, there is very strong agreement among all the ranking-based metrics (Breese, Pearson, Spearman, Kendall’s tau-b, roc-4, roc-5, and ndpm), when they are computed per user and averaged over all users. Second, notice that mean absolute error, Pearson Overall, and ROC-4 overall form two linear clusters when they are compared with any of the ranking metrics that are computed per-user. Analyzing this further, consider Figure 3-4. This is an expanded view of a cell in Figure 3-3, showing the scatter plot of Pearson per-user versus Pearson Overall. Furthermore, the points are coded, based on the type of normalization method used in the ACF algorithm that generated the point (normalization method was one of the parameters that was being evaluated in the research that generated this data; see Chapter 4 for more information on what normalization is and how it effects the algorithm). Points represented by crosses are evaluations of ACF algorithms where rating normalization was not used. Points represented by squares are evaluations of ACF algorithms where rating normalized was used.

43 As shown in the figure, the two point clouds form separate clusters based on the type of normalization used. The lower cloud, representing lower Pearson Overall correlation is entirely data points where no normalization was used. The upper cloud contains all the data points where some form of normalization was used. In all the cells of figure 3-3 where we see two linear clusters, the clusters are based on rating normalization. Figure 3.5 shows the same pattern when comparing Pearson-per-user with mean absolute error. .54 .52

combination

.50

weighted_average_zsc .48

ore

.46

weighted_average_rat ing

Pearson Overall

.44

weighted_average_dev iation_from_mean

.42

average_zscore

.40

average_rating .38

average_deviation_fr

.36

om_mean

.34

.35

.36

.37

.38

.39

.40

.41

.42

Pearson per user Figure 3-4: A plot comparing the performance of the Pearson predictionrating correlation when averaged-per user to when not averaged (computed over all ratings once). There are two distinct bands, where membership in the bands is entirely predicted by which algorithmic technique was used to compute the datapoint.

44

.88

.86

combination weighted_average_zsc

.84

ore .82

weighted_average_rat ing

.80

weighted_average_dev

mean abs err

iation_from_mean .78

average_zscore average_rating

.76

average_deviation_fr .74

om_mean

.34

.35

.36

.37

.38

.39

.40

.41

.42

Pearson per user Figure 3-5. This graph plots mean absolute error versus Pearson prediction-rating correlation that is computed per user and averaged. Notice that, similar to Figure 34, there are two bands that are completely predicted by the algorithmic technique depicted in the legend.

One conclusion from Figures 3-3, 3-4, and 3-5 is that mean absolute error has the ability to detect the effects of smaller significant differences in prediction accuracy. These small differences in prediction accuracy are not large enough to affect each user’s ten-item ranking, but when the predictions and ratings are merged together into lists of 9430 items, the small difference in accuracy results in a noticeable difference in ranking. For tasks in which the ordering of the predicted items is much more important than the actual predicted ratings, this suggests that mean absolute error or ranking metrics computed across aggregated lists may be too sensitive, indicating significant differences when there is not enough of a difference to change the rankings.

45 The other conclusion is that when it comes to comparing per-user rank metrics to each other and mean absolute error to overall rank metrics, there is strong agreement among the different metrics, so it’s unclear that there is any horribly wrong choice of metric, once we have chosen appropriately between those two classes. One should hold in mind that these empirical results, while based on numerous data points, all represent perturbations of the same base algorithm. The range in rank scores do not vary that much. Future work could extend the comparison of these evaluation metrics across significantly different recommendation algorithms. 3.3 Summary and Recommendations

An analysis of the application of different evaluation metrics to six different user tasks indicated that there are key design differences that make certain evaluation metrics more appropriate for certain tasks. These design differences create the potential for inaccurate measurement of certain tasks if the wrong metric is used. However, empirical analysis of the different evaluation metrics on a large collection of “systems” indicates that there is no significant amount of variance among the different evaluation metrics. Results from all of the evaluation metrics collapse into two classes. The first class contains all per-user ranking metrics, where a ranking-based metric was applied to each user’s results independently, and then the result averaged over all users. The second class contains the mean absolute error metrics and ranking metrics when applied once to the concatenation of predictions from all users. Within each class, most of the metrics are in strong agreement. Analysis of the data shows that disagreement between the two classes of metrics occurs only in systems that have specific modifications. These modifications appear to increase the mean error of the predictions, but do not significantly change the rank ordering that each individual user receives. We believe that this is because of the small number of items in each user’s test set (ten). Because there are so few items, the probability that there are two or more items with very close predictions is small. As a result, a small change in prediction accuracy is unlikely to cause a switch in ranking. With large numbers of items in the ranked list, small changes in prediction accuracy have the potential to greatly rearrange the ranking, at least within a local region of the ranking.

46 For practical purposes, given the results of the empirical testing, the choice of evaluation metric does not seem to have a significant biasing effect on reported results. Within each class all of the evaluation metrics appeared to be in agreement, and it seems reasonable to believe that the there would be agreement between the two classes if there were a significant number of test items per user. Most real-life ACF systems must rank a significant number of items per user, or else they wouldn’t be performing a very valuable filtering task (if there were only ten items, the user could probably view them all himself). If a system has been designed to support user tasks of type 5 or 6, where a user can request a prediction for any individual item, then at least one metric from the mean absolute error class should be chosen. If the system has also been designed to support per-user ranked results, then it may also be appropriate to evaluate using a per-user ranked-retrieval metric from the second class of evaluation metrics. However, this may not be true if your test set has a large enough distribution of test items per user. Furthermore, the results from the per-user ranked retrieval metric on a small test set may not catch differences that will affect the results on a live, large data set. If, as we argue, all of the evaluation metrics are in general agreement, then the best approach would be for the research community to standardize on one or two metrics. Even though all the metrics produce roughly the same results when comparing two systems, if two systems have been evaluated with different metrics, it makes them hard to compare, especially between researchers. Mean absolute error is the obvious choice for a standardized metric in automated collaborative filtering evaluation. It is simple, well recognized, and has well understood properties. There is extensive literature and knowledge on performing statistical tests and computing confidence intervals on the results of mean-absolute error computations. If a per-user ranked metric must be chosen, the choice for the best metric is not so clear. Ndpm is one of the cleanest metrics, being derived axiomatically [90]. It seems to have theoretical advantages over the similar rank-correlation metrics (Spearman, Kendall’s Tau). Pearson correlation is well understood, and accounts for relationships between the rating values and the prediction values, not just between the ranks. Breese’s utility metric is the only ranked metric to assign more weight to items earlier in the

47 ranking. During the empirical analysis, Breese’s metric provided the only uncertainty. While Breese’s utility metric appeared to be correlated with the other per-user rank metrics, it also had the largest number of outliers. Further analysis of the Breese metric is necessary to identify the characteristics of the systems for which the Breese metric disagrees with the classic per-user rank metrics.

48

Chapter 4: Improving Predictive Accuracy2 In this chapter, we present an algorithmic framework for performing collaborative filtering and examine empirically how existing algorithm variants perform under this framework. We present new, effective enhancements to existing prediction algorithms and conclude with a set of recommendations for selection of prediction algorithm variants. 4.1 Problem Space

The problem of automated collaborative filtering is to predict how well a user will like an item that he has not rated given that users ratings for other items and a set of historical ratings for a community of users. A prediction engine collects ratings and uses collaborative filtering technology to provide predictions. An active user provides the prediction engine with a list of items, and the prediction engine returns a list of predicted ratings for those items. Most prediction engines also provide a recommendation mode, where the prediction engine returns the top predicted items for the active user from the database. The problem space can be formulated as a matrix of users versus items, with each cell representing a user's rating on a specific item. Under this formulation, the problem is to predict the values for specific empty cells (i.e. predict a user’s rating for an item). In collaborative filtering, this matrix is generally very sparse, since each user will only have rated a small percentage of the total number of items. Table 1 shows a simplified example of a user-rating matrix where predictions are being computed for movies.

2

Selections from this chapter have been published as a conference paper in the SIGIR Conference on Research and Development in Information Retrieval [36].

49

Joe John Al

Star Wars 5 2 2

Hoop Dreams 2 5 2

Contact 5 4

Titanic 4 3 2

Nathan

5

1

5

?

Table 4-1: Collaborative filtering can be represented as the problem of predicting missing values in a user-item matrix. This is an example of a user-item rating matrix where each filled cell represents a user’s rating for an item. The prediction engine is attempting to provide Nathan a prediction for the movie ‘Titanic.’

The most prevalent algorithms used in collaborative filtering are what we call the neighborhood-based methods. In neighborhood-based methods, a subset of appropriate users are chosen based on their similarity to the active user, and a weighted aggregate of their ratings is used to generate predictions for the active user. Other algorithmic methods that have been used are Bayesian networks[12], singular value decomposition with neural net classification[10], and induction rule learning[16]. As an example of a neighborhood based method, consider Table 1 again. We wish to predict how Nathan will like the movie “Titanic.” Joe is Nathan's best neighbor, since the two of them have agreed closely on all movies that they have both seen. As a result, Joe's opinion of the movie Titanic will influence Nathan's prediction the most. John and Al are not as good neighbors because both of them have disagreed with Nathan on certain movies. As a result, they will influence Nathan's predictions less than Joe will. In this chapter, we will explore the space of neighborhood-based collaborative filtering methods and describe some new better performing algorithms that we have developed. Neighborhood-based methods can be separated into three steps. 1. Weight all users with respect to similarity with the active user. 2. Select a subset of users to use as a set of predictors (possibly for a specific item) 3. Normalize ratings and compute a prediction from a weighted combination of selected neighbors' ratings. Within specific systems, these steps may overlap or the order may be slightly different.

50 We will begin by discussing the most relevant prior work on collaborative filtering algorithms and, examining which techniques were used to implement the three steps of neighborhood-based methods. 4.2 Related Work

GroupLens first introduced an automated collaborative filtering system using a neighborhood-based algorithm[68]. GroupLens provided personalized predictions for Usenet news articles. The original GroupLens system used Pearson correlations to weight user similarity, used all available correlated neighbors, and computed a final prediction by performing a weighted average of deviations from the neighbor's mean (Equation 4-1). p a ,i represents the prediction for the active user for item a . n is the number of

neighbors and wa ,u is the similarity weight between the active user and neighbor u — as defined by the Pearson correlation coefficient, which is shown in Equation 4-2.

pa ,i = ra

∑ [(r + ∑

wa ,u =

∑ [(r − r )(r − r )] ∑ (r − r ) ∑ (r − r )

n

u =1

u ,i n

− ru )* wa ,u ]

(Eq. 4-1)

w u =1 a , u

m

i =1

a ,i

a

m

i =1

2

a ,i

a

u ,i

u

m

i =1

2

u ,i

(Eq. 4-2)

u

The Ringo music recommender [79] and the Bellcore Video Recommender [38] expanded upon the original GroupLens algorithm. Ringo claimed better performance by computing similarity weights using a constrained Pearson correlation coefficient, shown in Equation 4-3. The value 4 was chosen because it was the midpoint of Ringo’s sevenpoint rating scale. Ringo limited membership in a neighborhood by only selecting those neighbors whose correlation was greater than a fixed threshold, with higher thresholds resulting in greater accuracy, but reducing the number of items for which Ringo was able to generate predictions for. To generate predictions, Ringo computed a weighted average of ratings from all users in the neighborhood.

51

∑ [(r − 4)(r − 4)] ∑ (r − 4) ∑ (r − 4) m

wa ,u =

i =1

a ,i

m

i =1

u ,i

2

a ,i

m

i =1

2

(Eq. 4-3)

u ,i

The Bellcore Video Recommender used Pearson correlation to weight a random sample of neighbors, selected the best neighbors, and performed a full multiple regression on them to create a prediction. Breese et al. [12] performed an empirical analysis of several variants of neighborhood-based collaborative filtering algorithms. For similarity weighting, Pearson correlation and cosine vector similarity were used, with correlation being found to perform better. 4.3 Experimental Design 4.3.1

Data In order to compare the results of different neighborhood based prediction

algorithms, we ran a prediction engine using historical ratings data collected for purposes of anonymous review from the MovieLens movie recommendation site. The historical data consisted of 100,000 ratings from 943 users, with every user having at least 20 ratings. The ratings were on a five-point scale with 1 and 2 representing negative ratings, 4 and 5 representing positive ratings, and 3 indicating ambivalence. 4.3.2

Experimental Method From each user in the test set, ratings for 10 items were withheld, and predictions

were computed for those 10 items using each variant of the tested neighborhood based prediction algorithms. For each item predicted, the highest-ranking neighbors that have rated the item in question are used to compute a prediction (they form the user's neighborhood for that item). Note that this means that a user may have a different neighborhood for each item. All users in the database are examined as potential neighbors for a user---no sampling is performed.

52 The quality of a given prediction algorithm can be measured by comparing the predicted values for the withheld ratings to the actual ratings. 4.3.3

Metrics For a full discussion of metrics in collaborative filtering, please refer to Chapter 3.

For this experiment, we chose to consider the following metrics. Coverage. Coverage is a measure of the percentage of items for which a recommendation system can provide predictions. Common system features that can reduce coverage are small neighborhood sizes and sampling of users to find neighbors. We compute coverage as the percentage of prediction requests for which the algorithm was able to return a prediction. Unless otherwise noted, all experimental results demonstrated in this chapter had maximal coverage. Maximal coverage may be slightly less than perfect (99.8 in our experiments) because there may be no ratings in the data for certain items, or because very few people rated an item, and those that did had zero correlations or no overlap with the active user. Accuracy. To measure the accuracy of predictions, we chose mean-absolute error as a prediction error evaluation metric. 4.3.4

Parameters Evaluated The components that were evaluated are listed in Table 2 along with the

variations of each component that were tested. For each component, the performance using each of the variations was measured. All components except the one being measured were held constant to ensure that the results reflected the differences in the component being tested.

53

Component

Variants Tested

Similarity Weight

Pearson Correlation Spearman Correlation *Vector Similarity Entropy Mean-squared-difference

Significance Weighting

No significance weighting n/50 weighting None (variance – variancemin) / (variancemax – variancemin) Weight Thresholding Best-n neighbors

Variance Weighting

Selecting Neighborhoods

Rating Normalization

No normalization Deviation from mean Z-score

Table 4-2: A list of prediction algorithm components tested and the variants of each component that were tested.

4.4 Weighting Possible Neighbors The first step in neighborhood-based prediction algorithms is to weight all users with respect to similarity with the active user. When you are given recommendations for movies or books, you are more likely to trust those that come from people who have historically proven themselves as providers of accurate recommendations. Likewise, when automatically generating a prediction, we want to weight neighbors based on how likely they are to provide an accurate prediction. We compute similarity weights to measure closeness between users. However, we have only incomplete data specifying the history of agreement between users. In certain cases, we have only a small sample of ratings on common items for pairs of users. We have to be wary of false indicators of predictive relationships between users. To address these issues we adjust the similarity weight with significance weighting and variance weighting. 4.4.1

Similarity Weighting

54 Several different similarity weighting measures have been used. The most common weighting measure used is the Pearson correlation coefficient. Pearson correlation (Equation 4-2) measures the degree to which a linear relationship exists between two variables. The Spearman rank correlation coefficient is similar to Pearson, but computes a measure of correlation between ranks instead of rating values. To compute Spearman’s correlation, we first covert the user’s list of ratings to a list of ranks, where the user’s highest rating gets rank 1. Tied ratings get the average of the ranks for their spot. Then the computation is the same as the Pearson correlation, but with ranks substituted for ratings (Equation 4-4). k a ,i represent the rank of the active user’s rating of item i. k u ,i represents the rank of neighbor u’s rating for item i.

∑ [(k − k )(k − k )] ∑ (k − k ) ∑ (k − k ) m

wa ,u =

i =1

a ,i

a

2

m

i =1

a ,i

a

u ,i

u

2

m

i =1

u ,i

(Eq. 4-4)

u

Mean-squared difference (Equation 4-5) is another alternative that was used in the Ringo music recommender [79]. Mean-squared difference gives more emphasis to large differences between user ratings than small differences.

∑ (r d= m

i =1

− ru ,i )

2

a ,i

m

(Eq. 4-5)

In our experiments, we have found that Spearman correlation performs similarly to Pearson correlation. As an example, Figure 4-1 shows results for both Spearman and Pearson correlation across a variety of different values of max_nbors3, with all other parameters held constant. Note that the results are very close between Spearman and Pearson. The same data is shown in Table 4-3, along with indications of statistical significance. With the exception of one data point in Table 4-3, there was no statistical difference between the algorithm using Pearson and the algorithm using Spearman.

3

The max_nbors parameter stands for “maximum number of neighbors used” and is discussed in detail in Section 4.5.2.

55 Occasionally, such as with the max-nbors=60 data point, Pearson correlation will perform significantly better than Spearman. .80

.79

.78

.77

sim_metric

mean abs err

.76

mn_sq_diff .75 pearson .74

spearman

5.00

10.00

20.00

30.00

40.00

60.00

80.00

100.00

max_nbors

Figure 4-1. This graph compares the performance of algorithms using three different correlation measures as similarity metrics. Spearman is a rank-based metric, Pearson is the standard product-moment correlation, and mn_sq_diff is the mean squared difference of the ratings. The performance of the two correlation algorithms is effectively the same. The data used for this chart is weighted_average_deviation_from mean, negative correlations = 0, devalue = 50, and no threshold.

max-nbors 5 10 20 60 80 100

Pearson Spearman Significance P-value MAE MAE 0.7829 0.7618 0.7545 0.7518 0.7523 0.7528

0.7855 0.7636 0.7558 0.7529 0.7531 0.7533

No No No Yes No No

> 0.05 > 0.05 > 0.05 0.05 > 0.05 > 0.05

Table 4-3. This table shows the mean absolute error for algorithms using Pearson and Spearman correlation. This is the same data as shown in Figure G.

The performance of the mean-squared difference as a similarity metric is also shown in Figure 4-1. The graph shows that mean-squared difference results in lower

56 prediction accuracy than either Spearman or Pearson correlation. Table 4-4 shows the same data on a point-by-point comparison with algorithms using Pearson correlation.

max-nbors 5 10 20 60 80 100

Pearson mn_sq_diff Significance P-value MAE MAE 0.7829 0.7618 0.7545 0.7518 0.7523 0.7528

0.7898 0.7718 0.7634 0.7602 0.7605 0.7610

Yes Yes Yes Yes Yes Yes

0.05 0.001 0 0 0 0

Table 4-4. This table shows the mean absolute error for algorithms using Pearson correlation and mean squared difference as similarity measures. This is the same data as shown in Figure G.

Given the data we have observed, Spearman correlation does not appear to be valuable. Algorithms using Spearman correlation perform worse or the same as comparable algorithms using Pearson correlation. Furthermore, computation of Spearman correlation is much more compute-intensive, due to the additional pass through the ratings necessary to compute the ranks. The lack of improvement using Spearman correlation was surprising. We had thought that the inherent ranking characteristics in the rating process would cause the rank-based correlation to perform better. We believe that the large number of tied rankings (there are only five distinct ratings – five distinct ranks – for each user) results in a degradation of the accuracy of Spearman correlations. Increasing the size of the ranking scale could increase the difference in accuracy between Spearman and Pearson, however, with larger ratings scales, there is the issue of rate-rerate reliability4. Other similarity measures include the vector similarity “cosine” measure, the entropy-based uncertainty measure, and the mean-squared difference algorithm. The vector similarity measure has been shown to be successful in information retrieval, 4

“Rate-rerate reliability” describes the likelihood that a user, having given a rating on an item, will give the same rating when asked again at a later date. Larger scales lead to lower reliability. For example, imagine we had users rate movies on a 100 point scale, and a user rated “Star Wars” 83. If we asked that same user a month later what he rated the item, it is unlikely that he would say exactly 83 again. However, if the scale is only five points, there is a good chance that he will give the same rating both times.

57 however Breese has found that vector similarity does not perform as well as Pearson correlation in collaborative filtering[12]. The measure of association based on entropy uses conditional probability techniques to measure the reduction in entropy of the active user's ratings that results from knowing another user's ratings[67]. In our tests, entropy has not shown itself to perform as well as Pearson correlation. We also found that the mean-squared difference algorithm, introduced in the Ringo system, also did not perform well compared to Pearson correlation. 4.4.2

Significance Weighting One issue that has not been addressed in previously published studies is the

amount of trust to be placed in a correlation with a neighbor. In our experience with collaborative filtering systems, we have found that it was common for the active user to have highly correlated neighbors that were based on a very small number of co-rated items. These neighbors that were based on tiny samples (often three to five co-rated items) frequently proved to be terrible predictors for the active user. The more data points that we have to compare the opinions of two users, the more we can trust that the computed correlation is representative of the true correlation between the two users. We hypothesized that the accuracy of prediction algorithms would be improved if we were to add a correlation significance-weighting factor that would devalue similarity weights that were based on a small number of co-rated items. For our experiments, we applied a linear drop-off to correlations that were based on less than a certain threshold of number of items. For example, if our significance threshold was 50, and if two users had fewer than 50 commonly rated items, we multiplied their correlation by a significance weight of n/50, where n is the number of co-rated items. If there were more than 50 co-rated items, then no adjustment was applied. In this manner, correlations with small numbers of corated items are appropriately devalued, but correlations with 50 or more commonly corated items are not dependent on the number of co-rated items. Figure 4-2 compares the accuracy with and without the devaluing term, at different significance cutoff levels. The line with a significance cutoff of 1 represents the use of non-weighted raw correlations. Increasing the cutoff to 10 actually causing an increase in error, which was quite surprising. Our interpretation of this result is that 10 is

58 not a high enough significance threshold to catch the worst offending false correlations, and therefore simply enters noise into the system. However, significance thresholds of 25 or more do improve the accuracy of the system. Such high thresholds effectively remove the non-performing neighbors from your neighborhood. Significance thresholds larger than 50 don’t seem to have much significant additional effect on the accuracy of the system. .81

.80

.79

DEVALUE .78

1.00 10.00

Mean mean abs err

.77

25.00 .76 50.00 .75

75.00

.74

100.00

5.00

10.00

20.00

30.00

40.00

60.00

80.00

100.00

max_nbors

Figure 4-2. Shows the effects of significance weighting on prediction accuracy. Devalue indicates the number of shared ratings a correlation must be based on to not get devalued by a significance weighting. All correlations less than the specified constant were devalued linearly. There are 48 data points represented in this chart, with Pearson similarity measure, weighted neighbor contributions, and deviation from mean combination normalization.

Note that applying devaluing based on significance thresholds has several other advantages. As you increase the significance threshold, users need to have rated more items in common with you in order to enter your neighborhood. The fact that those users have seen more of the same movies as you is an indication in of itself of potential commonality of tastes, independent of the ratings. In addition, if you are choosing neighborhoods independent of the items for which ratings are being predicted, high significance thresholds will make it more likely that the neighbors selected have rated a large number of items, thus making them more useful in the generation of predictions.

59 An additional interesting observation is that devaluing correlation has a noticeably weaker effect on prediction accuracy if a non-weighted average of neighbor ratings is used to generate a prediction. Figure 4-3 shows the effects of devaluing on such algorithms. Compare this to Figure 4-2, which contains data from algorithms where weighted averages of ratings were used.

.82

.81

.80

DEVALUE .79

1.00 10.00

Mean mean abs err

.78

25.00 .77 50.00 .76

75.00 100.00

.75 5.00

10.00

20.00

30.00

40.00

60.00

80.00

100.00

max_nbors

Figure 4-3. This graph is similar to Figure 4-2, except the 48 points shown here do not weight neighbor contribution to the prediction. Since the similarity measure is only once (to find the neighborhoods), and not twice (it is not used for weighting contributions), the effect of devaluing the similarity measure is smaller.

4.4.3

Variance Weighting All the similarity measures described above treat each item evenly in a user-to-

user correlation. However, knowing a user's rating on certain items is more valuable than others in discerning a user's interest. For example, we have found that the majority of MovieLens users have rated the movie “Titanic” highly. Therefore knowing that two users rated Titanic high tells us very little about the shared interests of those two users. Opinions on other movies have been known to distinguish users' tastes. The movie “Sleepless in Seattle” has shown itself to separate those users who like action movies from those who like romance movies. Knowing that two people agree on “Sleepless in

60 Seattle” tells us a lot more about their shared interests than Titanic would have. We hypothesized that giving the distinguishing movies more influence in determining a correlation would improve the accuracy of the prediction algorithm. To achieve this, we modified the mean-squared difference algorithm to incorporate an item-variance weight factor. We added an addition variance weight for each item, then adding a normalizing factor in the denominator. This is shown in Equation 6. By incorporating a variance weight term, we will increase the influence of items with high variance in ratings and decrease the influence of items with low variance.

v (r − ru ,i ) i =1 i a ,i

∑ d=

m

2

∑i=1 vi m

(Eq. 4-6)

We computed an item variance weight as vi =

∑ (r = n

vari

u =1

u ,i

− ri )

n −1

vari − varmin where varmax

2

, and varmin and varmax respectively are the minimum and

maximum variances over all items. Contrary to our initial hypothesis, applying varianceweighting terms had no significant effect on the accuracy of the prediction algorithm. These results are shown in Figure 4-4. One explanation is that our variance-weighting scheme does not take into account the fact that a user who disagrees with the popular feeling provides a lot of information.

61

.80

.79

mean abs err

.78

.77

sim_metric variance weighting no variance weightin

.76 0

100

200

300

max_nbors

Figure 4-4. This graph shows the lack of difference in prediction accuracy once variance weighting was applied to the mn_sq_difference algorithm. These data points are no threshold, and weighted average deviation from mean combination.

4.5 Selecting Neighborhoods

After having assigned similarity weights to users in the database, the next step is to determine which other users' data will be used in the computation of a prediction for the active user. It is useful, both for accuracy and performance, to select a subset of users (the neighborhood) to use in computing a prediction instead of the entire user database. Commercial collaborative filtering systems are beginning to handle millions of users, making consideration of every user as a neighbor infeasible. The system must select a subset of the community to use as neighbors at prediction computation time in order to guarantee acceptable response time. Furthermore, many of the members of the community do not have similar tastes to the active user, so using them as predictors will only increase the error of the prediction. Another consideration in selecting neighborhoods suggested by Breese is that high correlates (such as those with correlations greater than 0.5) can be exceptionally more valuable as predictors than those with lower correlations can.

62 Two techniques, correlation-thresholding and best-n-neighbors, have been used to determine how many neighbors to select. The first technique, used by Shardanand and Maes, is to set an absolute correlation threshold, where all neighbors with absolute correlations greater than a given threshold are selected. The second technique, used in the GroupLens systems [68] is to pick the best n correlates for a given n. 4.5.1

Correlation Weight Threshold

Correlation thresholding sets a minimum correlation weight that a neighbor must have in order to be accepted into a user’s neighborhood. This ensures that neighbors have a minimum proven predicting value. The downside of correlation thresholding is that it can significantly reduce the coverage of the prediction algorithm. If you set the correlation threshold too high, then very few people make it into a user’s neighborhood. Then the ACF engine can only predict for items that those few neighbors have rated. We tested the effect of correlation weight thresholding in combination with all the other factors. The effect of weight thresholding was consistent across all data runs. Examples of the effects of correlation weight threshold can be seen in Tables 4-5 and 4-6. In both figures, the inverse relationship between weight threshold and coverage is immediately apparent. In our dataset, weight thresholds above 0.2 provide clearly unacceptable coverage. A somewhat surprising result was the applying a correlation weight threshold never actually improved the accuracy of the predictions. Figure 4-5 shows the mean absolute error of a given set of algorithms for different correlation weight thresholds. It appears from Figure 4-5 that a correlation weight threshold of 0.2 decreases the mean absolute error. We see in this in Table 4-5 with a decreased mean absolute error for 0.2, and we see the same for weight thresholds of 0.2 and 0.3 in Table 4-6. However, this turns out to be a “false” increase in mean absolute error. If we compare the mean absolute error only on points that both algorithms covered, we see that the weight threshold = 0 algorithm always outperforms other higher weight thresholds.

63

Weight Threshold 0.00 0.01 0.05 0.10 0.20 0.30 0.40 0.50

Mean Absolute Error Weight Mean Absolute Threshold = 0 Coverage Error 100% 0.7528 0.7528 0.7528 100% 0.7530 0.7522 99.7% 0.7533 0.7445 96% 0.7590 (p = 0) 0.7260 72% 0.7521 (p = 0) 0.7191 55% 0.7559 (p = 0) 0.6925 36% 0.7613 (p = 0) 0.6864 15% 0.7837 (p = 0)

Table 4-5. The interaction between correlation weight threshold, coverage and meanabsolute error. As the weight threshold is increased above 0.01, coverage is lost. Column 4 of this table shows the performance of the algorithm with weightthreshold=0, but only on the items that the weight-thresholded algorithm for the corresponding row covered. It is obvious that weight thresholding in this case has reduced the accuracy of the predictions. These data points are weighted-averagedeviation-from-mean, negative correlations = 0, max-nbor = 100, devalue = 50. Mean absolute errors with p values listed indicated values that have a statistically significant difference compared with the error at threshold = 0.

64 Weight Threshold 0.00 0.01 0.05 0.10 0.20 0.30 0.40 0.50

Coverage 100% 100% 100% 96% 72% 55% 36% 15%

Mean Absolute Error Weight-Threshold =0 Mean Absolute Error 0.7467 0.7467 0.7467 0.7468 0.7461 0.7473 (p=0.05) 0.7384 0.7526 (p = 0 ) 0.7184 0.7432 (p = 0 ) 0.7108 0.7452 (p = 0 ) 0.6847 0.7475 (p = 0 ) 0.6753 0.7745 (p = 0 )

Table 4-6. The interaction between correlation weight threshold and coverage. This table is similar to Figure E. Note the “false” gain in prediction accuracy at threshold = 0.2. The mean absolute error appears to decrease compared to no threshold. However, if you consider the accuracy of the algorithm with no threshold calculated only on items that are covered by the 0.2 threshold algorithm, we find that the 0.2 threshold algorithm is still performing worse. These data points are weighted-average-zscore, negative correlations = 0, max-nbor = 60, devalue = 50. .78

THRESH

.77

.20 72% coverage .10

.76

96% coverage .05

mean abs err

100% coverage .75

.01 100% coverage .00 100% coverage

.74 0

100

200

300

max_nbors

Figure 4.5. This graph depicts the relationship between correlation weight threshold and mean absolute error of the prediction for one class of algorithms. Points with less than 70% coverage (which are obviously unacceptable) are not shown. Notice that the 0.1 threshold increases the error. 0.2 threshold decreases the error slightly, but with a loss of 30% of coverage. The points shown in this graph are weighted-average-zscore, devalue = 50.

For the data set examined, correlation weight thresholding appears to have no redeeming value. In all examined cases, weight thresholding only made matters worse,

65 decreasing both the coverage and the accuracy of the algorithm. Should the number of users in a system increase (the tested data set had 943 users), the value of using correlation weight thresholding could increase, since there is more likelihood of encountering high-correlating users. However, as the number of users increases significantly beyond 943, it become less and less practical to examine all users - forcing the system to rely on sampling. Unless new intelligent sampling techniques are developed, the sample set will be very similar to working with a smaller set of users. Furthermore, this movie rating set is relatively dense – there are a relatively large number of users who have seen the same movies as you. With more sparse data sets, such as web sites, it will extremely hard to find high correlates that happen to have seen the same web pages as you. The improvements in prediction accuracy for the non-thresholding algorithm on only the items that were covered by the thresholding algorithms (see column 4 of Tables 4-5 and 4-6) suggest that when high correlates participate in a prediction, the quality of that prediction is likely to rise. However, the low correlates still provide a good amount of value, since discarding them resulted in lower accuracies (column 3 of Tables 4-5 and 4-6). 4.5.2

Maximum Number of Neighbors Used

The data shows that the “maximum number of neighbors to use” (max_nbors) parameter affects the error of the algorithm in a reasonably consistent manner. This is demonstrated in Figure 4-6. When the size of neighborhoods is restricted by using a small max_nbors (less than 20), the accuracy of the system suffers, with increased mean absolute error (Figure 4-6).

66

.86

.84

combination .82 average_deviation_fr om_mean

Mean mean abs err

.80

average_rating .78 weighted_average_dev iation_from_mean

.76

weighted_average_zsc ore

.74 5.00

20.00 10.00

40.00 30.00

80.00 60.00

100.00

max_nbors

Figure 4-6. Graph showing the performance of algorithms when the “maximum number of neighbors to use” parameter is varied. Small neighborhood sizes cause considerable increased error. The error stabilizes around 20 neighbors. Eventually, as neighborhood sizes are increased beyond 20, the error starts to trend upwards. 32 data points are plotted, with algorithms using a significance weighting of n/25, and not using negative correlations.

The loss of accuracy when using small neighborhoods was an expected result. The problem lies in the fact that, with the MovieLens dataset, even the top neighbors are imperfect predictors of the active user’s taste. Each user’s different experiences produce many different subtleties of taste, which makes it very unlikely that there is an excellent predictor for a given user within a database of 943 users. Even if there was one user with a perfect match in tastes, the other members of the neighborhood might cloud the recommendations. As the size of the neighborhood is allowed to increase, more users contribute to the prediction. As a result, variance introduced by individual users is averaged out over the larger numbers. Some serendipity may be lost, since consensus is required for a high prediction, but overall accuracy increases. We have found that even in the full MovieLens database, which contains approximately 80,000 users, there aren’t many exceptionally high correlations between users once there are more than 20 shared items. So there doesn’t seem to be potential for using small neighborhood sizes. In most real-world situations, a neighborhood of 20 to 50 neighbors seems reasonable, providing enough neighbors to average out extremes.

67 Another observation is the error increases slowly as the max_nbors parameter is increased beyond 20. As the number of neighbors consulted during a prediction increases, the probability of having to include low-correlating neighbors increases. However, this effect is much less pronounced than the small neighbor size. Consider Table 4-7, which shows the results of applying a paired-sample t-test to each pair of data points on the weighted_average_deviation_from_mean line of Figure 4-6. Max_nbor values of 5 and 10 are clearly worse than max_nbor of 20, but larger values of max_nbor do not provide statistically significant differences in mean absolute error. Small values of max_nbor always resulted in an increase in error that was statistically significant. max_nbors

MAE

t-value

5

0.7836

-10.18 (p=0)

10

0.7605

-4.74 (p=0)

20

0.7524

0.00

30

0.7520

0.41

40

0.7511

1.11

60

0.7508

1.14

80

0.7518

0.37

100

0.7527

-0.17

Table 4-7. Significance of differences in MAE due to the parameter max_nbors. Each row shows the MAE of the algorithm and the paired-sample t-value computed when comparing the algorithm to the max_nbors=20 algorithm. This data is the same as the weighted_average_devaiation from mean curve shown in Figure 4-6. Significant differences are in bold.

4.6 Producing a Prediction Once the neighborhood has been selected, the ratings from those neighbors are combined to compute a prediction. The basic way to combine all the neighbors' ratings into a prediction is to compute an average of the ratings. The averaging technique has been used in all published work using neighborhood-based ACF algorithms. We discuss two modifications to the ratings combination algorithm: rating normalization and weighting neighbor contributions. Both have been proposed as improvements to the ratings combination algorithm [68].

68 4.6.1

Rating Normalization The basic weighted average assumes that all users rate on approximately the same

distribution. From observation of collected data, we know that users do not all rate on the same distribution. Therefore, it makes sense to perform some sort of transformation so that user’s ratings are in the same space. The approach taken by GroupLens was to compute the average deviation of a neighbor's rating from that neighbor's mean rating, where the mean rating is taken over all items that the neighbor has rated. The deviation-from-mean approach is demonstrated in Equation 1. The justification for this approach is that users may have rating distributions centered around different points. One user may tend to rate items higher, with good items getting 5s and poor items getting 3s, while other users may give primarily 1s, 2s, and 3s. Intuitively, if a user infrequently gives ratings of 5, then that user should not receive many predictions of 5 unless they are extremely significant. The average deviation from the mean computed across all neighbors is converted into the active user's rating distribution by adding it to the active user's mean rating. An extension to the GroupLens algorithm is to account for the differences in spread between users' rating distributions by converting ratings to z-scores, and computing a weighted average of the z-scores (Equation 4-7).

∑ p a ,i = ra + σ a

n u =1

 ru ,i − ru   σ u



n

u =1

   wa ,u   

(Eq. 4-7)

wa ,u

We compared the three different modes of rating normalization: no normalization, deviation-from-mean, and z-score. The results from 404 different algorithms are shown in Figure 4-7. Note that the three different modes of normalization form three clearly identifiable bands in the chart. Performing rating normalization produces an obvious benefit. The deviation from mean normalization performs significantly better than the no normalization, while the z-score normalization only performs slightly better than the deviation from mean normalization.

69

.88

.86

.84

.82

.80

mean abs err

Normalization .78

Z-score Deviation from mean

.76

No normalation (raw rating)

.74 0

20

40

60

80

100

120

max_nbors

Figure 4-7. A scatterplot demonstrating the effect of rating normalization on prediction accuracy. Each point represents the accuracy of one algorithm. There are 404 algorithms shown here, all using Pearson correlation, but with varying significance weighting, varying values of max_nbors, and both with and without using negative correlations.

The increase in accuracy due to normalization comes about because of the differences in rating distributions among users. These differences exist because of both different perceptions of the world, and different perceptions of the ratings scale. The most significant difference between user rating distributions is a lateral shift in the distribution. For example, some users tend to give primarily positive ratings, saving the less than average ratings for the worst movies. Other users may rate most movies less than average, and give the above-average ratings to the few excellent movies. As a result, their ratings distributions are shifted when compared with each other. The rate-high user will have a high mean and median rating, while the rate-low user will have a low mean and median rating. A movie that the rate-high user rates 4 will probably only be a 3 for the rate-low user. By performing the deviation-from-mean normalization, we account for shifts in rating distributions. The z-score normalization accounts for differences in widths of the rating distributions in addition to shifts. By widths, we mean the variance of the rating

70 distribution. The intuition here is that some users are willing to give plentiful ratings on the extreme of their scale (such as 1s and 5s), while other users may rarely give extreme ratings. Thus, their rating distributions will have different variances. By performing zscore normalization, we appropriately adjust each prediction into ratings distribution of the user receiving the prediction. As can be seen from Figure 4-7, z-score normalization provides a small increase in accuracy over deviation-from-mean normalization. However, performing z-score normalization is relatively cheap computationally and requires only one additional storage element per user to record the user’s variance of ratings. 4.6.2

Weighting Neighbor Contributions

Having computed a similarity measure to locate the closest neighbors, it makes sense to use that similarity measure to weight the contribution of each neighbor based on how close they are to the active user. The original GroupLens algorithm (see Equation 4-1), did perform this rating, while the Ringo algorithm [79], did not. The Ringo algorithm simply averaged the ratings of users in the selected neighborhood. Figure 4-8 demonstrates the effect of weighting neighbor contributions on the accuracy of predictions. The crosses represent the accuracy of algorithms that weighted neighbor contributions, while the diamonds represent algorithms that did not weight neighbor contributions. Clearly, weighting neighbor contributions lowers the mean absolute error. There is some overlap between the two modes shown in the graph. This can be explained by variations in other parameters. What you are seeing is that the best performing algorithms without neighborhood weighting are about as good as the worst performing algorithms with neighborhood weighting. Figure 4-9 shows what the graph would look like when the significance weighting parameter is controlled.

71

.81

.80

.79

.78

mean abs err

.77

.76

WEIGHT .75

1.00

.74

.00 0

20

40

60

80

100

120

max_nbors

Figure 4-8. Demonstrates the effect of weighting neighbor’s contributions to a prediction. The diamonds represent algorithms where weighting was used. The crosses represent where weighting was not used. This graph shows 204 algorithms, all using deviation-from-mean normalization.

.79

.78

mean abs err

.77

.76

WEIGHT 1.00 .75

.00 0

20

40

60

80

100

120

max_nbors

Figure 4-9. Indicates the clear separation between data points that weight neighbor contributions and those that don’t. This was created from the data points shown in Figure 4-8 by controlling the significance weighting factor (using n/50).

72 4.7 Summary

Collaborative filtering is an exciting new approach to filtering information that can select and reject items from an information stream based on qualities beyond content, such as quality and taste. It has the potential to enhance existing information filtering and retrieval techniques. In this chapter, we have presented an algorithmic framework that breaks the collaborative prediction process into components, and we provide empirical results regarding variants of each component, as well as present new algorithms that enhance the accuracy of predictions. The empirical conclusions in this chapter are drawn from an analysis of historical data collected from an operational movie prediction site. The data is representative of a large set of rating-based systems, where the domain of predictions is high volume targeted entertainment with a generally high level of quality control. Domains of this criteria include movies, videos, books, and music. There is reason to believe that these results are generalizable to other prediction domains, but we do not yet have empirical results to prove it. Our algorithmic recommendations are certainly a good place to start when exploring a new and different prediction domain. We have made new contributions in each of the three steps of the neighborhoodbased prediction algorithm. We have shown that Spearman correlation is not an appropriate replacement for the Pearson correlation coefficient, taking longer to compute and producing less accuracy. We demonstrated that incorporating significance weighting by devaluing correlations that are based on small numbers of co-rated items provided a significant gain in prediction accuracy. While we hypothesized that decreasing the contributions of items which had a low rating variance across all users would increase predictions accuracy, it proved false, with variance weights decreasing the prediction accuracy. Best-n neighbors proved to be the best approach to selecting neighbors to form a neighborhood, while correlation-weight thresholding did not provide any clear value. When it comes to combining ratings to form a prediction, deviation-from-mean averaging was shown to increase prediction accuracy significantly over a normal weighted average, while z-score averaging provided no significant improvements over deviation-from-

73 mean. Furthermore, weighting the contributions of neighbors by their correlation with the user did increase the accuracy of the end predictions. For those who are considering using a neighborhood-based prediction algorithm to perform automated collaborative filtering, we have the following recommendations: Use Pearson correlation as a similarity measure – it remains the most accurate technique for computing similarity. If your rating scale is binary or unary, you may have to consider a different approach — see Breese [12] for more information. It is important to use a significance weight to devalue correlates with small numbers of co-rated items as it will often give you a larger gain in accuracy than your choice of similarity algorithm. Finally, users will rate on slightly different scales, so use the deviation-from-mean approach to normalization. These recommendations are summarized in Table 4-8. Similarity Weighting

Recommended Pearson Correlation

Significance Weighting Yes Top-n Selecting Neighbors Deviation-fromRating Normalization mean or z-score

Not Recommended Spearman, entropy, vector similarity, meansquared difference

Weight thresholding No normalization

Table 4-8. A tabulation of recommendations based on the results presented in this chapter.

In the progress of examining personalized algorithms, we also discovered a much more accurate non-personalized average algorithm. Automated collaborative filtering systems use non-personalized average algorithms to provide predictions when not enough is known about the user to provide a personalized prediction. The normal approach is to compute the average rating of the item being predicted over all users (Equation 4-8).

p a ,i

∑ =

n

r

u =1 u ,i

(Eq. 4-8)

n

However, we have found that computing a deviation-from-mean average over all users (Equation 4-9) results in a much more accurate non-personalized prediction as is demonstrated in Table 4-9.

74

p a ,i = ra

∑ +

n u =1

(ru ,i − ru ) n

Average Algorithm

(Eq. 4-9)

MAE

Average rating 0.8354 Average z-score 0.7900 Average 0.7835 deviation from mean

t-value (compared to average rating) 9.77 (p=0) 12.2 (p=0)

Table 4-9. Two more accurate overall-average prediction algorithms. Both the average-zscore and average-deviation-from mean algorithms are significantly more accurate than the base average-rating algorithm.

75

Chapter 5: Explanations: Improving the Performance of Human Decisions Automated collaborative filtering (ACF) systems predict a person’s affinity for items or information by connecting that person’s recorded interests with the recorded interests of a community of people and sharing ratings between like-minded persons. However, current recommender systems are black boxes, providing no transparency into the working of the recommendation. Explanations provide that transparency, exposing the reasoning and data behind a recommendation. In this chapter, we address explanation interfaces for ACF systems – how they should be implemented and why they should be implemented. To explore how, we present a model for explanations based on the user’s conceptual model of the recommendation process. We then present experimental results demonstrating what components of an explanation are the most compelling. To address why, we present experimental evidence that shows that providing explanations can improve the acceptance of ACF systems. We also describe some initial explorations into measuring how explanations can improve the filtering performance of users. 5.1 Background While automated collaborative filtering (ACF) systems have proven to be accurate enough for entertainment domains [2,19,20,38,44,47,51], they have yet to be successful in content domains where higher risk is associated with accepting a filtering recommendation. While a user may be willing to risk purchasing a music CD based on the recommendation of an ACF system, he will probably not risk choosing a honeymoon vacation spot or a mutual fund based on such a recommendation. There are several reasons why ACF systems are not trusted for high-risk content domains. First, ACF systems use models that are not always appropriate for every user’s current information need. Second, and probably most important, ACF systems base their computations on extremely sparse and incomplete data. These two conditions lead to recommendations that are often correct, but also occasionally very wrong. More importantly, it is hard for users to predict when a recommendation might be correct and when it might be wrong. ACF systems today are black boxes, computerized oracles that give advice, but cannot be questioned. A user is given no indicators to consult to

76 determine when to trust a recommendation and when to doubt one. These problems have prevented acceptance of ACF systems in all but low-risk content domains. Explanation capabilities provide a solution to building trust and may improve the filtering performance of people using ACF systems. An explanation of the reasoning behind an ACF recommendation provides transparency into the workings of the ACF system. Users will be more likely to trust a recommendation when they know the reasons behind that recommendation. Explanations will help users understand the process of ACF, and know where its strengths and weaknesses are. As a result, users can better judge for themselves when a recommendation is sound and when it is speculative. 5.2 Sources of Error Explanations help either detect or estimate the likelihood of errors in the recommendation. Let us examine the different sources for errors. Errors in recommendations by automated collaborative filtering (ACF) systems can be roughly grouped into two categories: model/process errors and data errors. 5.2.1

Model/Process Errors Model or process errors occur when the ACF system uses a process to compute

recommendations that does not match the user’s requirements. For example, suppose Nathan establishes a rating profile containing positive ratings for both movie adaptations of Shakespeare and Star Trek movies. Whether he prefers Shakespeare or Star Trek depends on his context (primarily whether or not he is taking his lady-friend to the movies). The ACF system he is using however does not have a computational model that is capable of recognizing the two distinct interests represented in his rating profile. As a result, the ACF system may match Nathan up with hard-core Star Trek fans, resulting in a continuous stream of recommendations for movies that could only be loved by a Star Trek fan in spite of whether Nathan is with his lady-friend or not. 5.2.2

Data Errors Data errors result from inadequacies of the data used in the computation of

recommendations. Data errors usually fall into three classes: not enough data, poor or bad data, and high variance data.

77 Missing and sparse data are two inherent factors of ACF computation. If the data were complete, there would be no need for ACF systems to predict the missing data points. Items and users that have newly entered the ACF system are particularly prone to error. When a new movie is first released, very few people have rated the movie, so the ACF must base predictions for that movie on a small number of ratings. Because there are only a small number of people who have rated the movie, the ACF system may have to base recommendations on ratings from people who do not share the user’s interests very closely. Likewise, when new users begin using the system, they are unwilling to spend excessive amounts of time entering ratings before seeing some results, forcing the ACF system to produce recommendations based on a small and incomplete profile of the user’s interests. The result is that new users may be matched with other people who share their interests on a small subset of items, but in actuality don’t share much more in common. Even in cases where considerable amounts of data are available about the users and the items, some of the data may contain errors. For example, suppose Nathan accidentally visits a pornography site because the site is located at a URL very similar to that of the White House – the residence of the US president. Nathan is using an ACF system that considers web page visits as implicit preference ratings. Because of his accidental visit to the wrong web site, he may soon be surprised by the type of movies that are recommended to him. However, if he has visited a large number of web sites, he may not be aware of the offending rating. High variance data is not necessarily considered bad data, but can be the cause of recommendation errors. For example, of all the people selected who have rating profiles similar to Nathan’s, half rated the movie “Dune” high and half rated it low. As a movie that polarizes interests, the proper prediction is probably not the average rating (indicating ambivalence), although this is probably what will be predicted by the ACF system. 5.3 Explanations Explanations provide us with a mechanism for handling errors that come with a recommendation. Consider how we as humans handle suggestions as they are given to us

78 by other humans. We recognize that other humans are imperfect recommenders. In the process of deciding to accept a recommendation from a friend, we might consider the quality of previous recommendations by the friend or we may compare how that friend’s general interests compare to ours in the domain of the suggestion. However, if there is any doubt, we will ask “why?” and let the friend explain their reasoning behind a suggestion. Then we can analyze the logic of the suggestion and determine for ourselves if the evidence is strong enough. It seems sensible to provide explanation facilities for recommender systems such as automated collaborative filtering systems. Previous work with another type of decision aide – expert systems – has shown that explanations can provide considerable benefit. The same benefits seem possible for automated collaborative filtering systems. Most expert systems that provided explanation facilities, such as MYCIN[1], used rule-based reasoning to arrive at conclusions. MYCIN provided explanations by translating traces of rules followed from LISP to English. A user could ask both why a conclusion was arrived at and how much was known about a certain concept. Other work describing explanation facilities in expert systems includes Hovitz, Breese, and Henrion[39]; and Miller and Larson[59]. Since collaborative filtering does not generally use rule-based reasoning, the problems of explanation there will require different approaches and different solutions. Work related to explanations can be found in cognitive science, psychology, and philosophy. Johnson & Johnson[43] argue that there is a need for a unified theory of explanation in human-computer interfaces, which would provide predictions for the proper use of content, timing, and patterns of successful explanation. They state that empirical studies are necessary to explore and define the unified theory of explanation. There has also been considerable study into the psychology of questioning and question answering with humans and how it can be applied to human-computer interfaces[2,3]. Philosophers have studied the rules and logic of human discourse – such as in the book “The Uses of Argument” by Toulmin[85]. The work described in this chapter is preliminary empirical work, and does not draw heavily on theories from social sciences, although there may be considerable potential with that line of research.

79 Building an explanation facility into a recommender system can benefit the user in many ways. It removes the black box from around the recommender system, and provides transparency. Some of the benefits provided are: Justification. User understanding of the reasoning behind a recommendation, so that he may decide how much confidence to place in that recommendation. User Involvement. User involvement in the recommendation process, allowing the user to add his knowledge and inference skills to the complete decision process. Education. Education of the user as to the processes used in generating a recommendation, so that he may better understand the strengths and limitations of the system. Acceptance. Greater acceptance of the recommender system as a decision aide, since its limits and strengths are fully visible and its suggestions are justified. Together, the potential for increasing the impact of automated collaborative filtering systems is great. 5.4 Research Questions There are three key research questions that we are interested in answering about the use of explanations with automated collaborative filtering (ACF) systems. 1. What models and techniques are effective in supporting explanation in an ACF system? An ACF system's computational model can be complex. What is the right amount of detail to expose? How much information is too much? What type of data do we present? Can we develop a theoretical model to guide our design decisions? Many such issues can be explored through theoretical design and experimentation with users. 2. Can explanation facilities increase the acceptance of automated collaborative filtering systems? We believe that by providing transparency into the workings of the ACF process, we will build users' confidence in the system, and increase their willingness to use the ACF system as a decision aid.

80 3. Can explanation facilities increase the filtering performance of ACF system users? The goal of an ACF system is to reduce information overload by helping the user to separate the good and valuable from that which is not. The information filter helps users to make decisions about which items to consume, such as what books to read or what movies to watch. We want ACF systems that result in more of the correct decisions, as well as filters that improve the rate at which we can process information, without missing anything important. But we also want systems that can reduce stress by making us more confident about our decisions. Can explanation interfaces strengthen these effects? 5.5 Building a Model of Explanations There are many different ways that we can explain a recommendation from an automated collaborative filtering (ACF) system. What kinds of data summaries are the most important or the most effective? What format of presentation is the most understandable and effective? What are the most compelling forms of explanation that we can give for a collaborative filtering recommendation? To address the first research question, we work “outside-in.” That is to say, that we start with the user-perceived conceptual model [63] of the ACF system, and then from that we generate the key components of explanation. We discuss the white-box model and the black-box conceptual models as well as misinformed conceptual models. 5.5.1

White Box Conceptual Model One of the strengths of ACF is that it has an easily communicated and understood

conceptual model of operation. The operation of an ACF system is analogous to the human word-of-mouth recommendation. Users of an ACF system are provided with the following three-step conceptual model of the operation of the ACF system. 1. User enters profile of ratings 2. ACF system locates people with similar profiles (neighbors) 3. Neighbors’ ratings are combined to form recommendations At the implementation level, these steps are broken up into more detailed steps, but the user is generally not aware of such details. A user’s perception of the performance

81 of the above listed three tasks will affect her perception of the performance of the overall ACF system. From this model, we can derive potential means of explanation of an ACF recommendation. We can focus on techniques to justify that the ACF system is indeed performing each of the above steps to the satisfaction of the user and her current context. Let us examine each of the steps in more detail, focusing on two components that we need to explain: the process and the data. 1. User enters profile of ratings Explaining step (1) may seem relatively straightforward, but consideration of this step brings light to important information that can be very important to the user. Consider process information: exactly how was the profile information collected? We can collect an immense amount of preference information from the user, both implicit, such as pageviews, and explicit, such as numeric ratings. Any interactions the user has with a system can possibly affect the outcome of the recommendations. The user can benefit from knowing how her actions have effected her recommendations. For example, a user, upon learning that web-page visits are considered weak ratings, may determine to provide more explicit preference ratings in order to have greater control over her recommendations. An explanation might need to explain what kinds of preference information were used in a given explanation. What kinds of data does the profile consist of? Was the movie recommendation for Titanic based purely on ratings of previously seen movies, or on the fact that the user spent time reading plot summaries of romance movies? Is every rating in the profile of equal weight? Users frequently believe that more recent ratings are more influential than older ones. In addition, has the user provided enough profile information to allow for high-quality recommendations? Perhaps the user has not rated a large enough or diverse enough set of movies to allow the ACF system to provide accurate recommendations with confidence. An explanation interface might be required to give the user feedback on the quality of her profile. For example, we have designed an explanation interface that identifies movie ratings in a user profile that had the most significant effect on a prediction. Ratings in the profile that have an exceptionally significant affect on the recommendation are a sign that the profile may not be diverse

82 enough, as well as an indication of potential similarities in content or taste between the significant item and the item being recommended. 2. ACF system locates people with similar profiles (neighbors) It is in performing step (2) that ACF systems show their true value over normal human word-of-mouth recommendations, with ACF systems being able to examine thousands of potential soulmates, and choose the most similar of the bunch. What do we have to do to help the user determine if the ACF system has identified the correct set of neighbors for the user’s current context of need? The process that is used to locate other people with similar profiles is one key to the success of the collaborative filtering technology. If the neighbors selected by the system are the best predictors for the user’s current information need, then the resulting recommendations will be the best possible. This is especially important for higher risk domains, where the user will often want to know when approximations and shortcuts are taken. For example, most ACF systems have huge numbers of profiles; their user community often numbers in the millions. These same ACF systems must also provide thousands of predictions per second. Supporting this large numbers of users at that level of performance requires many approximations. In most cases, the neighbors selected are not necessarily the “best” neighbors, but rather are the most similar profiles that could be found by sampling the profiles that are available in high-speed memory. The similarity metric that is used to judge potential neighbors can also be important in evaluating a prediction. Does the “closeness” measured by the given similarity metric match the users current context of information need? Providing descriptions of the data processed in locating a neighbor can be important to explaining a prediction. How many potential neighbors were examined (i.e. what was the sample size?) From the neighbors that were selected – what do their profiles look like? Do their interests match the users current context of information need? When measuring similarity between users, most ACF systems will give equal weight to all items that occur in both profiles. However, the user will often have strong constraints that are not captured by the system. For example, a user may feel that anyone who rated Star Wars low has no right giving ratings for science fiction movies. An explanation could give the user the ability to examine the ratings of the chosen neighbors and when the user

83 discovers the offending neighbor, he can disregard the prediction, or perhaps the system will allow him to manually remove that neighbor from consideration. 3. Neighbors’ ratings are combined to form recommendations The final step is explaining the data and the process of taking the ratings of the neighbors and aggregating them into a final prediction. It is at this level that many of the symptoms of weak predictions can be discovered with good explanations. The data are the most important in explaining this step. Users can benefit greatly from knowing exactly how each of their neighbors rated the item being explained, or if there are large numbers of ratings, the distribution of ratings for the item being recommended. They can combine this information with information from step (2), such as knowing how “good” or “close” their neighbors are. Users can detect instances where the prediction is based on a small amount of data, and investigate further to determine if a recommendation is an error, or just a sleeper item. For example, imagine that Jane has received a movie recommendation from an ACF-based movie recommender. She requests an explanation for the movie. She finds that the recommendation is based only on the ratings of five of her neighbors. From this Jane knows that the movie is either very new, not well known, or has received bad publicity. Of the five ratings from her neighbors, three are exceptionally high, and two are ambivalent or slightly negative. She then looks closely at the profiles of the neighbors. The three who liked the movie seem to share her interests in eclectic art films. The two who did not rate the movie higher seemed to only share Jane’s interest in popular Hollywood films. From this information, Jane determines that the movie is probably a not-well-known art film and decides to trust the recommendation. The process used to aggregate neighbor ratings into a prediction may also be of interest to the user. However, in most cases, the prediction is simply a weighted average of the neighbor’s ratings. 5.5.2

Black Box Model Often, there is not the opportunity or possibly the desire to convey the conceptual

model of the system to each user of the system. In such cases, the ACF system becomes a black box recommender, producing recommendations like those of an oracle. The user

84 may not even be aware that the ACF system is collecting implicit ratings, such as timespent-reading [47,61] to fuel the recommendations. For example, a video store could use past rental history as rating profiles, and produce personalized recommendations for users based on ACF technology. For fear that other video stores will copy their technique, they do not wish to reveal the process by which they compute recommendations, yet they would like to provide some sort of explanation, justification, or reason to trust the recommendation. In these situations, the forms of explanations generated through the white box model are not appropriate. We must focus on ways to justify recommendation that are independent of the mechanics that occur within the black box recommender. One technique is to use the past performance of the recommender as justification. For example, an explanation might include the statement “This system has been correct for you 80% of the time when recommending items similar to this one.” Another technique might be to bring in supporting evidence that may not have been used during the computation of the recommendation. For example, even though the video store recommendation was based only on the purchase records of customers, the video store could justify its predictions by displaying correlating recommendations from local film critics. Any white box can be viewed as a black box by focusing only on the inputs and outputs. Because of this, forms of explanation for black box recommenders should also be useful in providing explanations for white box recommenders. For example, even if information about the process and data used in the computation is available to the user, knowing the system’s past overall performance can be very useful. 5.5.3

Misinformed Conceptual Models It is inevitable that some users will form incorrect conceptual models of the ACF

systems that they are using to filter information. One common misconception that users acquire is that an ACF system is making decisions based on content characteristics. For example, several users of MovieLens have written us with comments that make it clear they believe we are recommending based on movie content characteristics such as director, actors/actresses, and genre. Here, the educational aspect of explanations comes

85 into play. Users with conflicting conceptual models will quickly realize that the explanations do not match their expectations, and through the process of examining explanation, learn the proper conceptual model. A related issue occurs when users are intentionally led to believe in an incorrect conceptual model. This might happen if the computational model is believed to be too complex to explain, so users are led to believe that a simpler, more understandable process is being used. There could even be instances where the recommender is using what could be considered subversive methods by the user, such as claiming to provide personalized recommendations, while pushing high inventory or high margin items. All these issues greatly complicate explanations, and we do not discuss them in this chapter. 5.6 Experiment 1 – Investigating the Model The cognitive models described act as a guide that can indicate potential key areas of explanation. However, there are huge amounts of information that could be explained in a prediction. Automated collaborative filtering (ACF) tools evolved to combat information overload, and we should avoid creating a new kind of information overload by presenting too much or too confusing data. When we design an explanation interface to an ACF system, we are faced with the initial problem: what exactly do we explain and in what manner? The model we have described solves this problem to some extent by suggesting information that is key in the user’s cognitive model of the system. However, even with the model, we are left with a huge number of features to potentially explain. What makes an explanation interface successful will vary from situation to situation. In some cases, a successful explanation is one that successfully convinces you to purchase a recommended product. In other cases, a successful explanation interface is one that helps you to identify predictions that have weak justification. In all cases, a successful explanation interface will be one that users perceive to be useful or interesting, and will continue to use. To explore this issue, we have performed an experiment that measures how users of an ACF system respond to different explanations, each derived from a different components of the explanation models described in the previous section.

86 5.6.1

Design The study was performed as a survey; test subjects were volunteer users of the

MovieLens web-based movie recommender. MovieLens uses ACF technology to produce personalized recommendations for movies and videos. The MovieLens database currently contains 4.6 million ratings from 74,000 users on 3500 movies, currently averaging approximately 1000 active users per week. Study participants were presented with the following hypothetical situation: Imagine that you have $7 and a free evening coming up. You are considering going to the theater to see a movie, but only if there is a movie worth seeing. To determine if there is a movie worth seeing, you consult MovieLens for a personalized movie recommendation. MovieLens recommends one movie, and provides some justification.

Each user was then provided with 21 individual movie recommendations, each with a different explanation component, and asked to rate on a scale of 1 – 7 how likely they would be to go and see the movie. An example of one explanation interface is shown in Figure 5-1.

Figure 5-1. One of the twenty-one different explanation interfaces given shown in the user survey. Notice that the title has been encoded, so that it does not influence a user’s decision to try a movie.

To ensure that the response to each stimulus could be compared fairly, the 21 different explanation interfaces all describe the same movie recommendation. The

87 explanations are based on data from an observed recommendation on MovieLens.5 The recommendation chosen for the survey was one that we, as experienced experts with the system, recognized as having good justification. That is to say that, had we been presented with the explanation data, we would believe that its probability of being correct was high. The study performed was organized as a randomized block design, with the blocks being users and the treatment being the different explanation interfaces. The 21 different interfaces were presented in a random order for each user to account for learning effects. The survey was presented to 78 users of the MovieLens site. A list of the different explanation interfaces provided is shown in Table 5-1, along with the accompanying results. Figures of every explanation interface shown can be found in Appendix I. 5.6.2

Results Table 5-1 is ordered from best performing explanation at the top to worst

performing explanation at the bottom. The first column indicates the rank of the explanation, which we will use to identify each row. The second column contains a simple textual description of the explanation – for a picture of the corresponding explanation interface, see Appendix I. The third column (N) indicates the number of user response for that explanation. The number of times each question was answered differed slightly because when users clicked on the reload or refresh button of their browser, no result was recorded and the survey moved to the next question. Mean response is the average rating given to that interface on a scale of one to seven. Explanations 11 and 12 are the base case. They represent no additional explanation data, beyond the simple knowledge of the white-box cognitive model. Therefore, explanations 13 and greater can be seen as negatively contributing to the acceptance of the recommendation. Rows with shaded backgrounds indicates explanations that had a statistically significant difference in mean response compared to explanation 11.

5

The explanation interfaces were based on the MovieLens recommendation of “Titanic” for the author.

88

# 1 2 3 4 5 6

N Histogram with grouping Past performance Neighbor ratings histogram Table of neighbors ratings Similarity to other movies rated Favorite actor or actress MovieLens percent confidence in

7 prediction 8 Won awards 9 Detailed process description 10 # neighbors 11 No extra data – focus on system 12 No extra data – focus on users

76 77 78 78 77 76

Mean Std Dev Response 5.25 1.29 5.19 1.16 5.09 1.22 4.97 1.29 4.97 1.50 4.92 1.73

77

4.71

1.02

76 77 75 75 78

4.67 4.64 4.60 4.53 4.51

1.49 1.40 1.29 1.20 1.35

13 prediction

77

4.51

1.20

14 Good profile 15 Overall percent rated 4+

77 75

4.45 4.37

1.53 1.26

16 similarity

74

4.36

1.47

17 Recommended by movie critics

MovieLens confidence in

Complex graph: count, ratings,

76

4.21

1.47

Rating and %agreement of closest 18 neighbor 77

4.21

1.20

19 # neighbors with std. deviation 20 # neighbors with avg correlation 21 Overall average rating

4.19 4.08 3.94

1.45 1.46 1.22

78 76 77

Table 5-1. Mean response of users to each explanation interface, based on a scale of one to seven. Explanations 11 and 12 represent the base case of no additional information. Shaded rows indicate explanations with a mean response significantly different from the base cases (two-tailed α = 0.05).

5.6.3

Analysis First, it is important to recognize the big winners: histograms of the neighbors’

ratings, past performance, similarity to other items in the user’s profile, and favorite actor or actress. There were three rating different histograms. The best performing histogram (explanation 1) is shown in Figure 5-2 (The entire set of explanation interfaces is shown in Appendix I). Explanation 3 was a standard bar chart histogram, with one bar for each category of rating (1-5). Explanation 4 presented the same data as Explanation 3, but in numerical tabular format instead of a bar chart (see Figure 5-1). Explanation 1 performed

89 better than a basic bar chart because it reduced the dimensionality of the data to a point where only one binary comparison is necessary (the good versus the bad). The hypothesis that simple graphs are more compelling is supported by observing the poor performance of Explanation 16, which presents a superset of the data shown in histograms, adding information about how close each neighbor is to the user. Your Neighbors' Ratings for this Movie 23

Number of Neighbors

25 20 15 10 5

7 3

0 1's and 2's

3's

4's and 5's

Rating Figure 5-2. A histogram of neighbors’ ratings for the recommended item, with the “good” ratings clustered together and the “bad” ratings clustered together, and the ambivalent ratings separated out. The result is that the user has to do only a single binary visual comparison to understand the consequence of the explanation. This was the best performing explanation.

Stating positive past performance of the ACF system was just as compelling as demonstrating the ratings evidence behind the recommender. The exact explanation was “MovieLens has predicted correctly for you 80% of the time in the past.”6 This highlights the fact that in many cases, if a recommendation source is accurate enough, we may not really care how the recommendation was formed. While this form of explanation is useful for setting the context, it is not valuable for distinguishing between different recommendations from the same system. However, recommendation-specific explanation

6

It is important to note that this is the only explanation that is not based on actual data. We extrapolated this number from our experiences with observation of the accuracy of our prediction algorithm.

90 information could be introduced by providing explanations such as “MovieLens has been 80% accurate on movies similar to this one.” Explanation 5 – movie similarity – was “This movie is similar to 4 other movies that you rated 4 stars or higher.” This kind of explanation can either be determined using content analysis or identifying movies that have correlated rating patterns. The success of explanation 6 – favorite actor/actress - shows that domain specific content features are an important element of any explanation. However, notice the unusually high variance. Clearly there is a division between those who evaluate movies based on actors/actresses and those who don’t. Some explanations (18–21) had significantly lower mean response than the base cases. Poorly designed explanations can actually decrease the effectiveness of a recommender system. This stresses the importance of good design of explanations interfaces. One of the key parameters in ACF systems is the similarity between the user and the neighbors. It is often the case that some of the neighbors chosen do not really share that much in common, so indicating the similarity can be important. Yet explaining similarity is tricky, since the statistical similarity metrics that have been demonstrated as the most accurate such as correlation[36] are hard to understand for the average user. For example, in the recommendation explained in this study the average correlation was 0.4, which we recognize from experience as being very strong for movie rating data. However, users are not aware that correlations greater than 0.4 are rare; they perceive 0.4 to be less than half on the scale of 0 to 1. This highlights the need to recode the similarity metric into a scale is perceptually balanced. In this specific case, we might recode the correlations into three classes: good, average, and weak. It is interesting to note that external “official” rating services such as awards or critics did not fare particularly well (explanations 8 and 17). This indicates that users believe personalized recommendation to be more accurate than ratings from critics, a fact that has been shown by previous work[2]. Prior to the main study, we performed a small pilot study where we had the opportunity to interview participants after they took the survey. From these interviews,

91 we learned that many users perceived each “recommendation” as having been generated using a different model – which was then explained. Each explanation was changing the user’s internal conceptual model of how the recommender computed predictions. In the primary study, we attempted to control for this effect by clearly stating to study participants up front that he model was going to be the same in each case. 5.7 Experiment 2 – Acceptance and Filtering Performance In the previous section, we addressed the first research question. In this section, we present an experiment that addresses the remaining two research questions: (2) can explanations improve acceptance of automated collaborative filtering (ACF) systems and (3) can explanations improve the filtering performance of users? 5.7.1

Hypotheses The goal of this experiment was to test two central hypotheses related to the

research questions:

Hypothesis 1: adding explanation interfaces to an ACF system will improve the acceptance of that system among users. Hypothesis 2: adding explanation interfaces to an ACF system will improve the performance of filtering decisions made by users of the ACF system. 5.7.2

Design The experimental subjects were users of the MovieLens web-based ACF movie

recommender. A link inviting users to try experimental interfaces was placed on the front page, and users volunteered to participate in approximately a month-long study. Experimental subjects were assigned randomly, either to a control group or to a group that was presented with a new experimental explanation interface. Members of control groups either saw the standard MovieLens interface or saw the standard interface with aesthetic changes to encourage them to believe they were seeing a significantly different system. Figures 5-3 and 5-4 depict two of the explanations interfaces shown. Figure 5-3 presents a simple discrete confidence metric, while Figure 5-4 presents graphically the

92 distribution of ratings for a movie within the user’s neighborhood, based on the similarity of each neighbor.

Figure 5-3. A simple confidence value as an explanation for a movie recommendation.

Each experimental subject was given a survey upon entering and leaving the experiment regarding their impressions of the MovieLens site. These surveys were used to assess how explanation interfaces might affect the acceptance of ACF systems. Each experimental subject was asked to use MovieLens in the manner they normally would for recommendations, but to return to MovieLens whenever they saw a new movie and fill out a mini-survey with the following questions: • • • • • • •

Which movie did you see? Did you go because you thought you would enjoy the movie or did you go for other reasons (such as other viewers)? Did you consult MovieLens before going? If you consulted MovieLens, what did MovieLens predict? How much did MovieLens influence your decision? Was the movie worth seeing? What would you now rate the movie?

93

Figure 5-4. A screen explaining the recommendation for the movie “The Sixth Sense.” Each bar represents a rating of a neighbor. Upwardly trending bars are positive ratings, while downward trending ones are negative. The x-axis represents similarity to the user.

5.7.3

Results 210 users participated in this study, filling out 743 mini-surveys. In 315 of those

mini-surveys, users consulted MovieLens before seeing the movie. In 257 of those minisurveys, MovieLens had some effect on the user’s decision to see the movie. In 213 (83%) of the cases where MovieLens had an effect on the decision, the MovieLens recommendation was not the sole reason for choosing a movie. Figure 5-5 shows the filtering performance of each experimental group. There was no statistically significant difference between any two experimental groups (based on a one-way ANOVA with α = 0.05).

94

Percent Correct Decisions Consulted MovieLens with Effect 1.0

Percent Correct

.9

.8

.7 nb ow Sh + ph ow ra sh rg + bo N ph ra ng f+ on C ph r ra rg rg bo N bo N + ce en ce fid en on id C nf Co te re sc 2 Di up ro lg tro 1 on C up ro lg tro on

C

GROUP

Figure 5-5. Percentage of correct movie decisions users made while using different versions of the explanation system. The first two bars represent control groups where no explanation interface was seen. The data was of extremely high variance, with none of the differences being statistically significant.

In exit surveys given at the end of the study, users in non-control groups were asked if they would like to see the explanation interface they had experienced added to the main MovieLens interface. 97 experimental subjects filled out the exit survey. 86% of these users said that they would like to see their explanation interface added to the system. As part of the exit surveys, users were given the opportunity to provide qualitative feedback in the form of text comments. They were asked what they liked most about the explanation interfaces, what they like least about the explanation interfaces, and given the chance to provide suggestions and comments. The qualitative feedback from all those who responded (60 users) was almost entirely positive. Comments ranged from It made sense of what seemed to be somewhat senseless numerical ratings.

to I could see the variety of responses to a film which corresponds to what I do with my friends. It helps me see how strongly they felt and the power or range of that diversity which always helps me be prepared for a film which evokes powerful response.

Some users were particularly excited with the ability to view the ratings of their neighbors. The viewable ratings profiles gave some more substance and reality to the

95 previously invisible “neighbors.” Several users asked for features to explore their interests with neighbors further (i.e. just show me the movies we agreed on), while others wanted to meet and converse with their neighbors. Another user wanted to be able to bookmark certain users, so she could return and see what movies they were going to see. The majority of the negative comments stemmed from the question “What did you like least about the explanation interfaces?” and were related to inadequacies in the prediction algorithm and not in the explanation interface. By using the explanation interfaces, users discovered many predictions that were based on small amounts of data or neighbors who weren’t that similar. It was this that they complained about, not the quality of the explanation interfaces. 5.7.4

Analysis The overwhelming request to see the explain feature added to the system and the

supporting positive remarks from the text comments indicates that users see explanation as a valuable component of an ACF system. The experimental subjects successfully used the system to identify predictions based on small amounts of data or neighbors that weren’t that similar. The filtering performance measurements performed during this study were inconclusive. The results were confounded primarily by lack of good data. Most of the filtering decisions reported by the study participants were made without consulting MovieLens first, even in the groups that received explanations. There was also a considerably large amount of uncontrolled variance, especially between users. A more controlled study would hopefully reveal more clearly the effect of explanations on decision performance. One of the key components to the explanations that we built was the graph of neighbors’ ratings shown in Figure 5-4. We believed this graph to be exceptionally effective at conveying a large amount of data about the prediction in a small amount of space. Most “experts” to whom we demonstrated this graph were impressed by its utility. However, though the process of performing this experiment and experiment 1, we have learned that normal MovieLens users do not find this graph a compelling explanation.

96 Our hypothesis is that while expert users prefer this graph, it is too complex for ordinary users. The confusion factor related to this graph may have affected people’s decisions not to use the explanation facilities more frequently in experiment 2. However, experiment 1 has demonstrated what the most effective explanation components are, which will be useful in designing the next generation of explanation interface. 5.8 Summary Explanations have shown themselves to be very successful in previous work with expert systems. From this knowledge, it seems intuitive that they will prove to be successful in interfaces to automated collaborative filtering systems. The challenges will be to extract meaningful explanations from computational models that are more ad hoc than rule-based expert systems, and to provide a usable interface to the explanations. Hopefully, the result will be filtering systems that are more accepted, more effective, more understandable, and which give greater control to the user. In this chapter, we have explored the utility of explanations in automated collaborative filtering (ACF) systems. We have explored theoretically and experimentally three key research questions related to explanations in ACF systems. What models and techniques are effective in supporting explanation in an ACF system? ACF systems are built around complex mathematical models. Knowing exactly what and how to explain is not straight-forward. We have presented an approach that develops the key process and data components of an explanation based on the user’s cognitive model of the explanation. Furthermore, we have performed an experiment to identify how compelling each of the identified explanation components is to the user. Rating histograms seem to be the most compelling ways to explain the data behind a prediction. In addition, indications of past performance; comparisons to similar rated items; and domain specific content features, such the actors and actresses in a movie are also compelling ways to justify a high recommendation. Can explanation facilities increase the acceptance of automated collaborative filtering systems? We hypothesized that adding explanation interfaces to ACF systems would increase their acceptance as filtering systems. Through an experiment with 210 users of the MovieLens web-based movie recommender, we have demonstrated that most

97 users value the explanations and would like to see them added to their ACF system (86% of survey respondents). These sentiments were validated by qualitative textual comments given by survey respondents. Can explanation facilities increase the filtering performance of ACF system users? We began an initial investigation into measuring the filtering performance of users both with and without the explanation interface. We believe that explanations can increase the filtering performance. Unfortunately, due to many factors, we were unable to prove or disprove our hypothesis. Users perform filtering based on many different channels of input, and attempting to isolate the affect of one filtering or decision aid requires well controlled studies, which are hard to perform through a web-site with users that you never meet. 5.9 Experimental Notes The two experiments described in this chapter were actually performed in the reverse order that they are described. We have exchanged the order to communicate our ideas more effectively.

98

Chapter 6: Addressing Ephemeral Information Needs Current automated collaborative filtering (ACF) technologies provide little or no support for focused ephemeral information needs. By “focused,” we mean that users have a specific idea of the type of information they are interested in. Ephemeral information needs are those information needs that are immediate and often temporary. Current ACF systems only support long-term unfocused interest needs. In this chapter we present an approach to supporting focused ephemeral information needs in ACF systems that requires no additional data beyond the existing ratings data and uses a theme-based search interface. Current ACF systems aggregate all the ratings the user has ever presented to the system into a single interest profile and produce recommendations that are the best match for the entire profile. This profile represents all of a user’s interests within a data stream. The items that best match all of the user’s interests are recommended. No attempt is made to customize the recommendations for the user’s current context of need. Even if the user creates a ratings profile representing only immediate information needs, recommendations may still not be focused towards the user’s immediate information needs. This happens because the user is correlated with other people who have additional interests beyond the interests that the user is focused on. For example, consider a user who wants to find a “witty romance movie.” He rates a bunch of witty romances highly and the ACF system matches him with other people who also like witty romances (they become his neighbors). However, these neighbors have other movie interests besides just witty romances. Therefore, the user may still end up getting recommendations for movies that have nothing to do with witty romances. One approach to providing support for focused ephemeral is to incorporate additional forms of data into the recommendation engine. For example, content descriptions together with a content-search engine can be used to support focused ephemeral information needs. However, since content descriptions can very greatly between different forms of media, approaches using content information are not generally applicable to many different content domains. For example, content search engines for text are very different than content search engines for movies, music, and art. In this chapter, we examine an approach that provides focused ephemeral recommendations

99 independent of the underlying content being recommended. As a result, this approach can be applied very generally to any domain where the items are ratable (and thus support ACF). 6.1 Approach Our approach leverages the ratings of an existing ACF system to provide focused ephemeral recommendations. We have investigated this approach with an implementation of a movie recommender. We will describe the approach both generally and within the context of our movie recommender implementation.

1

Long-term user profile (Ratings)

Specification (taken from user or user context)

3 Focused item set

ACF prediction engine

Ephemeral interest profile

2 Item-item matching algorithm

Items

Focused item set ordered by ACF predicted ratings

Figure 6-1. A depiction of the architecture of the proposed ephemeral recommendation system.

Figure 6-1 depicts the architecture of our approach. The recommendation process begins in the specification step (1), where an ephemeral profile is built through specification of the elements of the long-term profile that relate to the ephemeral

100 information need. In our movie recommender, the user selects movies from her long-term profile that match a desired “theme.” Themes are classes of movies that share some common element. The goal of the ephemeral recommender system is to provide recommendations for other movies that have the same common elements as the specified movies—i.e. match the theme. Our movie recommender has users explicitly perform the creation of ephemeral profiles, but such information could also be taken implicitly from user context. For example, if we were tracking a user browsing a web page, we could use the most recently visited pages as an ephemeral profile representing the current user’s need. In our movie recommender, the ephemeral profile is represented simply by a nonweighted list of positive examples. This is the simplest approach. Other approaches could apply a different weight to each item or include negative examples. In step 2 of the recommendation process (Figure 6-1), an item-item matching algorithm identifies the items that have the highest potential of being similar to the specified ephemeral profile. We call the output of this algorithm the “focused item set.” For our movie recommender, we have designed a prototype matching algorithm based on item-to-item correlations. The algorithm is described in Sections 6.2 and 6.3. Step 3 of the recommendation process (Figure 6-1) orders the items in the focused item-set based on the predictions from a traditional ACF recommender. Thus, for each item in the focused item-set, we compute a personalized predicted rating based on the user’s long-term profile. The item set is then sorted by predicted rating. This sorted item set is the final output of the process. The final ranked list represents movies that have strong similarity to the ephemeral profile, and the top ranked items will be movies that the long-term ACF system predicts the user will like. 6.2 Item Correlation In automated collaborative filtering (ACF), every member of a community provides ratings indicating preference for or against items. This data can be represented as a matrix, with each row representing a user and each item representing a column. Normally, ACF determines the similarity between users by correlating the rows, finding

101 rows of ratings that agree. However, to provide the focused queries, we compute correlations between the columns in this matrix. For example, consider Table 6-1. Star Wars and Return of the Jedi will be strongly correlated between they tend to receive the same rating from users. They were both rated high by Joe, and both rated mediocre by John and Al. On the other hand, Star Wars and Hoop Dreams are negatively correlated, since people who liked Star Wars appear to dislike Hoop Dreams and vice versa. For our approach, we consider only positive correlations.

Joe John Al Nathan

Star Wars 5 2 2 5

Hoop Dreams 2 5 2 1

Contact 5 3 5

Return of the Jedi 4 3 2

Table 6-1: A simple example of the collaborative filtering data space for movies, represented as a matrix, with users as the rows and items as the columns. Notice that Star Wars and Return of the Jedi are positively correlated while Star Wars and Hoop Dreams are negatively correlated.

What does a strong positive correlation between two movies A and B mean? It means that, in general, the higher the rating a user gives to A, the higher the rating he gave to B. Our hypothesis is that this correlation will capture similarities between movies. For example, if a single user gives the same rating to two movies, then there is a probability that those movies contain similarities. As we increase the number of users who give those two movies an identical or similar rating, the probability that those movies would actually have something in common increases. Thus by computing correlations over the entire user base, we feel that a high positive correlation could represent significant common elements. 6.3 Algorithm Description We present a simple algorithm for the matching of user theme profiles to items that are potentially similar. We first compute the correlations between all pairs of items in an offline batch process and create data-structures such that for each item in the database, we have a ranked list of the top correlates for that item. We then select those items that occur in the top-correlates list of each query item. The results are selected based on the maximum rank in which they occur in each query item’s top-correlates list. The top N

102 results are selected, where n is a configurable parameter. We chose 20 for our experiments. For an item to be selected, it must be in the top-correlates list for each query item. Items that occur close to the top of the rankings of all query items are more likely to be selected. An item that occurs close to the bottom of any of the lists will probably not be selected. The actual value of the correlation is not considered. If the ratings database is sparse, i.e. each user only rates a small percentage of all movies, then certain pairs of movies will have very little overlap. That is to say, for two movies A and B, there may be only a small number of users who have rated both A and B. To address this, we use a concept commonly used in data mining – support. The support for a correlation is the number of data points used in the computation of that correlation. In other words is it the number of users who have rated both A and B. In the algorithm presented here, we only consider the correlations of algorithms with a support above a specified threshold. 6.3.1

Algorithm Detailed Description For each item in the database, we maintain the top P correlates. N is the number of desired result items correlatesi[p] – The pth highest positive correlate for query item i. supporti[p] – The number of data points (user ratings) in the computation of the pth highest positive correlate for query item i. support_thresh – The minimal acceptable level of support for a correlation to be considered.

1. Users submits list of M query items. 2. Compute the result set: p = 0 num_found = 0 while (p < P) and (num_found < N)

103 for i = 1 to M if supporti[p] > support_thresh seen[correlatesi[p]]++ if seen[correlatesi[p]] = M output result item p num_found++ p++ 3. Get traditional ACF predictions for each result item, and order the result set by the predicted ratings. 6.4 Specific Research Goals This work focuses on addressing four research goals with respect to focused ephemeral information needs in automated collaborative filtering (ACF) systems. 1. Evaluate the effectiveness of the proposed interest model and search algorithm. Are the results returned by the search algorithm relevant to the needs of the user? Does such a system allow user to discover new interesting items? 2. Investigate query-by-example “themed profiles” as an information need specification mechanism. Are users able to describe their information needs using examples? Are negative examples necessary? 3. Measure user reaction to this form of query interface. Do they find such a feature/interface valuable? Are they satisfied with an implementation of such an interface? 4. Identify classes of themes for which this query process is valuable. Are there certain themes for which this search algorithm and data model are useful? 6.5 Experiment Design To reach the goals specified above, we performed a live experiment with users of the MovieLens web-based recommender [20]. Users of MovieLens rate movies on a 1-5 scale, and an ACF system produced personalized predictions for new movies by

104 matching them with other people of similar interests. The MovieLens database contains 5.3 million ratings, 81,000 users, and 3800 movies, with about 600-1000 active users per week. An invitation to join the experiment was placed on the MovieLens front page. After 90 people chose to join, the experiment was closed. Participants in the study were randomly assigned to three different groups, each of which was presented a different support threshold, low (30), middle (50), and high (75). Participants then had between three days and one week to use the experimental interface. At the end of the experiment, all the user were surveyed. The exact questions used in the survey are listed in Appendix II. 6.5.1

User Interface The theme-based query interface was reached through a link from the main

MovieLens site. The interface was called “MovieLens Matcher.” The main screen of MovieLens Matcher is shown in Figure 6-2. Users create theme profiles by selecting movies from a list of movies that they have already rated. By restricting the users to select from movies they have already rated, we reduce the amount of data they have to browse through and ensure that the users are not overwhelmed with unfamiliar movies. The profile create/edit screen is shown in Figure 6-3. Finally, when a user selects a profile from the main screen and requests recommendations, they are presented with a list of potentially similar movies, ordered by predicted rating as shown in Figure 6-4. On this screen, users can rate movie results that they have already seen.

105

Figure 6-2: The main screen of the MovieLens Matcher experimental interface. Users select a theme profile that they have created and click on the GO! button to receive recommendations of movies that might be similar.

106

Figure 6-3: A screenshot of the theme profile create and edit screen. Users can select movies from a list of movies that they have rated on the left.

107

Figure 6-4: The search results screen for the theme “Great Spy Flicks,” which is entirely James Bond movies. Notice that the first result is another James Bond movie (possibly the only James Bond movie that was not already in the profile.) Furthermore, the system identifies “The Saint,” which has strong similarities to James Bond movies. It is interesting to note the appearance of “Dante’s Peak,” which is not a spy flick, but does star Pierce Brosnan, who also stars as James Bond.

6.6 Results As mentioned earlier, 90 users joined the experiment. Of those, 73 users created at least one themed profile. On average, each user created 1.8 profiles, for a total of 134 profiles. The average size of each profile was 10.6 movies. After the experiment ended, all users were invited by email to fill out the survey and 52 users returned to the web site

108 to fill out a survey. These participation results are also shown in Table 6-2. Opted in to the experiment Created at least one profile Average profiles/user Average profile size Users who responded to survey

90 73 1.8 (134 total) 10.6 52

Table 6-2: User participation in the focused ephemeral queries experiment. RELEVANCE (REL)

– On a scale of 1 to 5, please rate how much you agree with the following statement: "MovieLens Matcher returned movies that were relevant to my selected profiles."

DISCOVERY (DISC)

- On a scale of 1 to 5, please rate how much you agree with the following statement: "I discovered interesting new movies using MovieLens Matcher."

VALUE

– On a scale of 1 to 5, please rate how much you agree with the following statement "I would find valuable a service that accurately recommends movies that are similar to ones I present."

EXAMPLES

– On a scale of 1 to 5, please rate how much you agree with the following statement "I was able to describe my immediate movie needs well by giving examples."

NEGATIVE_EXAMPLES (NEG_EXAM)

– On a scale of 1 to 5, please rate how much you agree with the following statement "I would like the ability to specify negative examples as part of my profile."

DEFINITION_STRENGTH (DEF_STR)

- On a scale of 1 to 5, would you describe the themes represented in your profiles as weakly defined or clearly defined, with 1 being weakly defined and 5 being clearly defined?

MOVIES_EXIST (EXISTS)

– On a scale of 1 to 5, how would you rank the probability (on average) that there actually exist movies that match the themes of your profiles?

SATISFACTION (SATIS)

– On a scale of 1 to 5, how satisfied were you with MovieLens Matcher?

ADDED (ADD)

– Would you like to see this feature added? (yes/no response)

GOOD_PROFILES

– Please describe each profile (one line per profile) for which you felt MovieLens Matcher was effective at locating potentially similar items.

BAD_PROFILES

– Please describe each profile (one line per profile) for which you felt MovieLens Matcher was not effective at locating potentially similar items.

Table 6-3. List of all survey questions presented to experimental subjects. In later tables, some of the descriptive tags have been shortened. The shortened tags are in displayed in parentheses.

109 6.6.1

Results for the Population: Including all Control Groups The questions asked are shown in Table 6-3. The mean responses from the

quantitative survey questions are shown in Table 6-4. If we first consider the survey responses over all control groups (listed in the row “Total” in Table 6-4), we see four results: Overall, users did not find the results from searches to be incredibly valuable. The response to the relevance of movies returned by the themed profiles (REL) was weak (3.05 out of 5), and the response to the discovery of new interesting items was weak (DISC) (3.07). Users agreed that the system as presented would be very valuable if it were accurate. Not only was the average response to VALUE high (4.32 out of 5), but the standard deviation was considerably lower than with other questions, indicating a stronger consensus. Users felt somewhat positive about their ability to accurately represent their desired themes using examples and desired the availability of negative examples. Furthermore, they believed that items actually existed that matched their profiles. An overall response of 3.45 was somewhat positive for EXAMPLE. However, a response of 4.35 to NEG_EXAMPLE and 4.1 for EXISTS were very positive. Users were weakly satisfied with the interface, and the overwhelming majority of them wanted the interface added permanently to MovieLens. A response of 3.24 for SATIS indicated weak satisfaction, but almost 87% of all users surveyed wanted MovieLens Matcher to become a permanent part of MovieLens.

110 eportGROUP

REL

DISC

VALUE

EXAMPLE NEG_EXA DEF_ST EXISTS S M R

2.4000 15

2.4667 15

4.3333 15

3.2667 15

2.7857 14

4.1333 15

ADD

2.6667 15

.8667 15

1.00

Mean N

.9103

.9904

.9759

1.2228

.9376

1.0509

.6399

1.0465

.3519

2.00

Mean N

3.6364 11

3.4545 11

4.1818 11

3.6000 10

4.5455 11

3.7273 11

4.5455 11

3.4545 11

.9091 11

Std. Deviation

1.2863

1.2136

.8739

1.0750

.6876

1.2721

.5222

1.2136

.3015

3.00

Mean N

3.2667 15

3.4000 15

4.4000 15

3.5333 15

4.0000 15

3.3333 15

3.7333 15

3.6667 15

.8667 15

Std. Deviation

1.2228

1.2421

.7368

.9904

1.0000

.8165

.9612

1.1127

.3519

Total

Mean

3.0488

3.0732

4.3171

3.4500

4.3500

3.2500

4.0976

3.2439

.8780

N Std. Deviation

41 1.2237

41 1.2122

41 .8497

40 1.0849

40 .9213

40 1.0801

41 .8002

41 1.1786

41 .3313

Std. Deviation

4.5714 14

SATIS

Table 6-4: Mean responses for survey quantitative survey questions. Except for ADD, all questions were on a scale of 1 to 5. Question ADD had a binary response. Group 1 had a support of 30, group 2 had a support of 50 and group 3 had a support of 75. Users could chose not to answer any question – which results in different values of N for each question.

6.6.2

The Effect of Support Threshold: Between Groups Results Each of the three experimental groups received results from an algorithm using a

different support threshold. Nothing else was different between the two groups, either in interface or in algorithm. When we examine the survey results by experimental group7, we see some very different results for the key questions. The mean results are shown in Table 6-4, with the results from statistical significance tests in Tables 6-5, 6-6, and 6-7. Users who received recommendations from the mid-to-high support threshold groups found the results to be significantly more relevant than users in the low support threshold group. Users of group 2 had a reasonably positive response to relevance (3.64), while users of group 3 had a slightly positive response (3.27). Both were considerably higher responses than the mean response across all groups, due to the negative response of the low support threshold experimental group. Users who received recommendations from the mid-to-high support threshold groups found significantly more new interesting movies. Similar to relevance, the responses were more positive than group 1 and the overall average. 7

I have avoided calling these “support groups”. :-)

111 Users in the high support threshold group were significantly more satisfied with the interface than users in the low support threshold group. Again, the same pattern seen with REL and DISC. t REL DISC VALUE EXAMPLES NEG_EXAM DEF_STR EXISTS SATIS ADD

-2.726 -2.213 .416 -.718 .080 -1.981 -1.806 -1.732 -.330

Sig. (2-tailed) .014 .039 .682 .480 .937 .062 .084 .099 .744

Table 6-5: Statistical significance of survey answers between experimental groups 1 and 2.

t REL DISC VALUE EXAMPLES NEG_EXAM DEF_STR EXISTS

-2.202 -2.275 -.211 -.656 1.588 -1.559 1.342

Sig. (2-tailed) .037 .031 .834 .517 .124 .132 .192

SATIS ADD

-2.535 .000

.017 1.000

Table 6-6: Statistical significance of survey answers between experimental groups 1 and 3.

112 t

Sig. (2-tailed)

REL DISC VALUE EXAMPLES NEG_EXAM DEF_STR

.739 .112 -.671 .157 1.647 .900

.468 .912 .510 .877 .113 .381

EXISTS SATIS ADD

2.763 -.456 .330

.011 .653 .744

Table 6-7: Statistical significance of survey answers between experimental groups 2 and 3.

6.6.3

Qualitative Survey Responses In addition to the quantitative survey questions discussed above, users were also

presented with two qualitative survey questions: 1.

Please describe each profile (one line per profile) for which you felt MovieLens Matcher was effective at locating potentially similar items.

2.

Please describe each profile (one line per profile) for which you felt MovieLens Matcher was not effective at locating potentially similar items.

There were 12 users surveyed who provided on-topic answers to first question. Their answers are summarized in Table 6-8. Notably three different users specified that profiles of small size tended to be more effective.

113 Themes for which the system was effective: British costume dramas made in the 90s Woody Allen films Themes with small numbers of movies (x3) Kids movies “Light” entertainment movies Action movies – minus sequels Witty romances Fun ‘80s nostalgia flicks Intelligent romantic comedies Romantic comedies Horror comedies (?) Raw comedy – usually vulgar Films the critics panned but I loved Table 6-8. Themes for which the system was successful at producing relevant recommendations. Three different users specified that themes with small numbers of items were more likely to be successful.

There were 11 users who provided on-topic answers to the second questions. Their responses are shown in Table 6-9. Again, there were multiple responses that large profiles provided inaccurate results. Themes for which the system was NOT effective: Cool British films of the ‘90s When large number of movies in a theme (x2) Original, unpredictable movies The few comedies I like Serious movies Harrison Ford movies What I call Science Fiction Screwball comedies Straight-up comedies Westerns Black comedies Good action movies Light comedies Table 6-9. Themes for which the system was not successful at producing relevant recommendations.

6.7 Discussion As described in Section 6.4, our research work had four goals. We will address our results in the context of those four goals.

114 6.7.1

Evaluate the effectiveness of the proposed interest model and search

algorithm. We measured perceived performance by asking users how relevant results were (question REL), and how much the system helped them find interesting new movies (question DISC). The user-perceived performance of the proposed item-item correlationbased algorithm proved to be highly dependent on the value of the support-threshold parameter. A low support threshold (in our case 30) resulted in exceptionally low values of perceived item relevance and item discovery. For support thresholds of 45 and higher, users responded with slightly positive values of perceived item relevance and item discovery. While positive, the results for the best performing experimental group were not exceptionally high. Multiple users observed that query performance was reduced when theme profiles contained more than a small number of items. One user specifically observed that this number appeared to be around 4 items. Analysis of the algorithm used indicates that poor performance for large profiles is an inherent limitation of the algorithm used. For a result item to be selected, it must occur in every movie’s top correlate list. As the number of items in the profile increases, the probability that there exists a result that occurs in every query item’s top-correlates list decreases. We did not attempt to measure the optimal theme profile size, but if the optimal size is around four items, then poor performance is to be expected, with average profiles sizes measured to be larger than ten. Furthermore, in our experiment, we did not truncate the top correlate lists – rather each top correlate list contained all the other movies in the database, ordered by correlation value. So, as you increased the number of query items in each profile, you increased the probability that one of the result items actually occurred very low in one or more of the query items’ top-correlate lists. An alternative approach might be to truncate each item’s top-correlate list at some fixed length, for example 100 items. Then query results would have to be at least one of the “closest” 100 items to every query item. This alternative approach would probably provide much more precision, but result in lower recall. 6.7.2

Investigate query-by-example “themes” as a query mechanism.

115 Themes appear to be an excellent mechanism for the specification of focused ephemeral queries within an automated collaborative filtering (ACF) environment. The themes interface allowed users to describe potentially complex information needs without the use of a bulky query language. Furthermore, the theme query approach is applicable to any domain where ACF is in use. Users had few problems comprehending and using the theme-based query system. Most participants (81%) in the experiment actively created multiple profiles, in spite of minimal documentation and no human contact for explanation or clarification. In survey responses, they indicated positively that they were able to describe their desired themes using examples (mean EXAMPLE = 3.45). The lack of strength in the EXAMPLE response could be attributed to the fact that users wanted to be able to provide negative examples (mean NEG_EXAM = 4.35) and could be an effect of low-relevance results (the user may believe that he isn’t getting relevant results because he can’t describe his information need well enough using examples). The themes created by users were personalized and expressive. They were personal in that they identified themes that were personally important (“my kind of Sci Fi”), and likely to be unique or shared only by a subset of other users. They identified attributes of movies that were complex and expressive, beyond the obvious attributes of movies (genre, actors, director, etc). In addition to allowing focused ephemeral recommendations in a ratings-only space, themed profiles allow users to specify queries for concepts that would be exceptionally hard to capture in a content-filtering space. Table 6-8 lists concepts such as “witty romances” and “British costume dramas of the ‘90s.” It would be exceptionally hard to come up with content-based filtering or search engine that could locate movies based on descriptions such as those. Finally, it is important to note that users wanted to specify profiles that on average contained approximately 10 movies. This may be useful information when tuning future algorithms. For the purposes of evaluating the algorithm presented in this chapter, the profiles that users were specifying were larger than the zone for which the algorithm appeared to have optimal performance (observed by one user to be fewer than four movies). This undoubtedly had a negative effect on the perceived performance.

116 6.7.3

Measure user reaction to the proposed interface. User reaction to the interface, independent of the algorithm was strongly positive.

Users stated clearly that they would find such an interface valuable, if it were to present accurate results (VALUE = 4.3). Furthermore, this response did not change significantly between different levels of support, indicating that the response was not influenced by the performance of this algorithm, even when the performance was abysmal (group 1). Such strong results reinforce the strength of the themed profile query approach. The reaction to this particular implementation of the themed query interface was mixed, with results dependent on the experimental group. Users in the experimental group with low support threshold clearly stated they were not satisfied with the implementation of the system. Users in the other two experimental groups responded with a slightly positive slant, although not impressive. Clearly, the algorithm needs further improvement to gain full acceptance. 6.7.4

Identify classes of themes for which this technique is effective (using the

proposed algorithm) Identifying the successful classes of themes is tricky with such a small set of examples. However, we can combine our knowledge of the process with the examples to identify characteristics of successful themes. We know that successful themes generally have small numbers of items – possibly less than four. Other than size, what characteristics make a successful theme? Successful themes appear to have two elements. First, a large number of people agree about which movies fit the theme. Second, a large number of people either like all movies in the theme or dislike all movies in the theme. Themes that attempt to capture those characteristics will most likely be successful with the algorithm presented in this chapter. For example, the “Woody Allen” profile works because his movies are so distinctive and his mark is so recognizable. Most people recognize Woody Allen movies, and their style is relatively consistent. Furthermore most people either hate all Woody Allen movies or like them all. Profiles representing the works of less well-known or more diverse directors would probably not be so successful.

117 In contrast, consider the theme “cool British films”. First of all, there is probably not a common shared perception of what movies are “cool.” Second, there are also not large numbers of users who would rate a movie high primarily because it was a “cool British film.” Therefore, the theme was unsuccessful. 6.8 Summary This chapter presents an early exploration of providing focused ephemeral queries within an automated collaborative filtering (ACF) environment. We introduce an effective new ACF query interface based on themes. The queryby-theme interface removes the necessity for specification of a query language. Specifying examples is an approach that can be extended to almost any content, making a single implementation effective across many content domains. Furthermore, the theme approach does not require the user to translate their internal representation of the information required into a system query language, reducing the cognitive load on the user. Rather, all they need to do is identify representative items. The results from the experiments clearly show that the theme-based query interface is easy to understand, easy to use, and provides functionality to which users would like to have access. Examination of themes created confirms that users are creating queries for concepts that would be exceptionally hard to handle with content-based search engines (especially for non-text content). Furthermore, users’ showed a positive reaction to the interface even in cases where the performance of the underlying search algorithm was poor. They appeared to recognize the potential value of the system in those situations. Themed queries will only be effective when the user is searching for content that is similar to content she has previously encountered, and is thus able to provide examples of. Thus, theme queries will be less effective for knowledge retrieval situations where the user is seeking a piece of knowledge, and not an item or document (such as question answering systems). However, it has been shown that relevance feedback, or query modification by example, can improve the accuracy of text search engines. One could theoretically build a system that would provide some initial content-based search, followed by an ACF-supported query-by-example relevance feedback.

118 The item-item matching algorithm has strengths and weaknesses. The primary strength of the item-item matching algorithm is that it requires no additional data, either content or rating, given an existing ACF rating database. The algorithm also is good at matching characteristics that are commonly recognized by the entire population of users. In addition, the algorithm online run time is exceptionally fast. On the downside, the item-item matching algorithm is not good at matching concepts that are personal, or not commonly recognized by the entire population. Unless the users are extremely knowledgeable, it would be hard to instruct users on what kind of themes work and what kind of theme won’t work. For the proposed user interface, the current algorithm will probably not be sufficient without modifications, or support from other processes. Clearly, there is more work to be done in examining content-focused queries in ACF systems. However, this basic item-item correlation algorithm could be useful in bootstrapping other approaches. For example, one potential approach to content-focused queries in ACF is to have users rate content-relationships between items. Users could explicitly specify that two movies contain similar content. Then these similarity-ratings could be used to create recommendations. In such an environment, the item-item correlation algorithm presented in this chapter could be used as a bootstrap algorithm, providing recommendations until sufficient similarity-ratings have been collected. Finally, it is important to note that the scenario evaluated in this paper represents an extreme. Users are explicitly specifying a need, and can directly compare the results to their initial specification. In such a manner, the results must be exceptionally correct to produce user satisfaction. The ephemeral recommendation process described here could also be used in scenarios where the user has not explicitly specified an ephemeral profile. For example, consider providing content layout of a web page based on ephemeral recommendations. The user’s ephemeral profile could be created implicitly by observing a user’s immediate browsing history. The ephemeral recommendation algorithm would identify items that are likely to match what the user is searching for during this browsing session. In the case where the matching algorithm is not successful at identifying similar items, the recommended items are still predicted to appeal highly to the user’s long-term interests. In the other case, we have produced a recommendation that

119 is focused to the users immediate needs. In either case, the user is likely to be pleasantly surprised.

120

Chapter 7: Software and Data Artifacts In order to accomplish the research presented in chapters one through six, it was necessary to create numerous software and data artifacts. Several of these software and data artifacts have the potential to provide significant continuing value to the GroupLens research group at the University of Minnesota and other members of the computer science research community. In this chapter, I provide descriptions of the significant artifacts. They represent contributions of the thesis work. 7.1 Software Artifacts Software systems are tools that enable research. Yet, it is often the case in computer science research that availability of software artifacts affects greatly the direction and outcomes of research. Because building stable software systems can take significant effort, researchers will often choose research that fits the tools instead of building tools to fit the research. Thus, the creation of easy-to-use reusable research software artifacts can greatly benefit the community. I will describe several such systems for which I was a key developer during my thesis work. 7.1.1

Usenet GroupLens Client Library The original Usenet GroupLens trial provided automated collaborative filtering

for users of Usenet all across the world. I was a member of the six-member team in 1995 that designed and developed the software necessary to support GroupLens. I designed and developed a client library in C that authors of Usenet browsers could integrate into their systems to provide support for GroupLens. The XRN and tin newsreaders were then modified to support GroupLens using my client library. In 1996, the University of Minnesota licensed the code for the library (and all other GroupLens code) to startup company Net Perceptions. Net Perceptions is now a very successful company selling ACF-based recommendation engines. The first worldwide Usenet trial resulted in the publication of three conference papers [68] and one journal paper [47]. 7.1.2

MovieLens

121 In 1997, the GroupLens research project built a web-based ACF system for recommending movies. The initial prototype was built by a master’s student, Brent Dahlen, who then graduated. I then developed the production site that is still active today. From 1997 through 1999, I rewrote and updated most almost every part of the code to improve the stability, interface, and performance of the service in order to attract more users, and to collect data for our experiments. In 2000, under my guidance, an undergraduate student, Tony Lam, designed an experimental framework that made it extremely easy to plug in new experiments into MovieLens, with little or no modification of source code. Researchers can now easily set up experiments where different groups of users see different user interfaces, and their usage is tracked separately. The MovieLens system has enabled and continues to enable large numbers of successful research projects in ACF by all members of the research group. Work with the MovieLens system has resulted in the publication of four papers in highly selective conferences [37,78]. 7.1.3

DBLens ACF Environment With the licensing of the GroupLens recommendation engine to Net Perceptions,

we no longer had an engine which we could modify to experiment with new recommendation algorithms. I designed the DBLens ACF Environment to allow us to easily build and test different forms of recommendation algorithms. The DBLens environment began with a prototype ACF system developed by Hannu Huhdanpa, a master’s student. From the prototype, I developed an extremely configurable framework for computation and evaluation of collaborative filtering algorithms. The DBLens system supports dynamic configuration of every stage of the ACF algorithm, allowing users to easily test the effects of tweaking any parameter. There is an extensive set of support scripts and tools, providing data import/export, experiment automation, and analysis of prediction results. The DBLens ACF environment will be released to the public by the end of this year. At the time of writing this document, there are no free ACF systems with publicly available source code. The release of DBLens will allow other researchers to quickly get

122 started performing analyses of ACF algorithms without having to create a software system to execute the algorithms. The DBLens ACF environment has been used by four other members of the GroupLens, on projects independent of the author’s. Results collected through the use of the DBLens environment have been the core of two published research papers in highly selective conferences [30,36]. 7.2 Data Artifacts Publicly available datasets are an enormous resource to the scientific community. Data collection can be an expensive and time-consuming process, so the release of valuable data sets to the public can result in significant research. The data sets that have resulted from projects of which I was a key member have enabled a collection of publications within the group, and some of them will hopefully enable much more research when the datasets are released to the public before the end of the year. 7.2.1

GroupLens Usenet Data From the third Usenet trial, of which I was responsible for managing, we

collected a large set of recorded predictions and ratings, together with the full text of the articles referenced. This collection of data allowed us to not only test new algorithms, but also to integrate algorithmic components that considered the textual content of the items being recommended. This lead to the work with filterbots, which was published in 1998 [78]. The results from the filterbot paper were obtained through analysis of the Usenet data. 7.2.2

MovieLens Data The MovieLens site continues to be used by approximately 900 people every

week. Several occurrences of high-visibility publicity (such as the Wall Street Journal, the New Yorker, and ABC Nightline) resulted in short spikes of significantly more users. Together, all those users provide extensive traces of usage. Most importantly, they provide ratings, which we can then use to analyze the performance of new algorithms. However, the usage traces also allow us to examine in significant detail how user’s interact with ACF-based recommender systems.

123 Ratings data from the MovieLens systems have been responsible for several publications in high-quality conferences by members of the research group [30,30,36]. The first set of ratings data has just been released to the public. The set of 100,000 ratings from 943 users is the same data set used for the empirical analysis in chapter 4. Releasing this data set to the public will enable further research by other members of the computer science research community, as well as provide them the opportunity to compare results with the results that we publish. 7.3 Summary Considerable effort was expended in the creation of the above stated artifacts. All of the mentioned artifacts have already shown themselves to be vital components of further research by other members of the research group. As the artifacts are released to the public outside the university, it is believed that they will continue to be vital components of further research. Therefore, I submit these artifacts as significant contributions of my thesis work.

124

Chapter 8: Conclusion Large distributed networks such as the Internet connect millions of users whose experiences, when collected together, hold massive quantities of knowledge and represent an incredible diversity of tastes. Automated collaborative filtering (ACF) systems provide individual users with the ability to tap into that massive collection of experience and knowledge and use it to filter information or make decisions. The work presented in this dissertation focuses on making automated collaborative filtering systems more understood, more effective, and more usable. Three different challenges are addressed, one focused on improving understanding of current methods and two focused on exploring potential new functionality of ACF systems. 8.1 Answers to Challenges Addressed ACF systems have proven remarkably successful in targeted entertainment domains (movies, music, books, etc), but have not achieved strong acceptance in other domains by addressing three challenges. The first challenge was “How can we improve the predictive accuracy of automated collaborative filtering algorithms?” The development of more accurate algorithms is an iterative and collaborative process by the entire research community. Standardized evaluation procedures, data sets, and evaluation metrics focus the efforts of the research community and increase each individual contribution by providing comparability with work by other researchers. To support this we analyze both theoretically and empirically the different metrics that have been used for evaluation of ACF algorithms. We show that all reviewed metrics perform comparably, with some variation due to the size of test sets. Since the metrics are comparable, we recommend that the research community standardize on the mean absolute error metric. This work is presented in chapter 3. In addition to the review of evaluation metrics, there are now several new publicly available rating data sets and a software framework for the rapid evaluation of ACF algorithms, presented in chapter 7. To address ways of increasing the accuracy of ACF algorithms, we present a detailed empirical analysis of neighborhood-based prediction algorithms, which are currently the most used class of prediction algorithms. We test proposed variations of

125 neighborhood-based algorithms, and identify those parameters and variations that have the most significant effect on prediction accuracy. Most notably, rating normalization and significance weighting provide the most significant increases in prediction accuracy. This work is presented in chapter 4. The second challenge was “How can we increase the effectiveness of ACF as a decision-making aid using explanation interfaces?” Increasing the acceptance and effectiveness of ACF systems beyond entertainment involves addressing situations where there is greater risk involved in accepting the recommendations of an ACF system. Users are unlikely to act on recommendations from a black-box prediction engine if it has been known to make mistakes and there is significant risk involved in the action. One approach to increasing the acceptance and effectiveness of ACF systems in such situations is to provide functionality that allows the ACF system to explanation the reasoning and data behind each recommendation. We present a theoretical and empirical exploration of providing explanations for ACF systems. We present results showing which types of explanation interfaces are the most compelling for users. Furthermore, we present data that shows that explanation interfaces will increase the acceptance of ACF systems. This work is presented in chapter 5. The third challenge was “How can we improve automated collaborative filtering for meeting ephemeral user information needs?” One of the weaknesses of current automated ACF systems is the assumption that each user’s interests remain consistent over time. All the user’s historical recorded historical preferences are used to predict new items of interest. Adaptations to changes in interests can only occur in the long term. ACF systems provide no mechanisms for meeting information needs that are ephemeral and more focused. One of the problems is creating ACF systems that support focused information queries without the use of content information, only rating information. By using only rating information, the ACF technology remains flexible enough to apply to any human describable content with little or no domain specific customization.

126 We explore the extension of ACF functionality to support more focused information needs without use of content information. We present one approach, which involves the utilization of correlations between items to represent potential similarities in content. This work is presented in chapter 6.

127

Appendix I: Depictions of all Explanation Interfaces Used in Chapter 5. 1. Histogram with grouping

2. Past Performance:

128 3. Neighbor ratings histogram:

4. Table of neighbors’ ratings

129 5. Similarity to other movies rated:

6. Favorite actor or actress

7. MovieLens percent confidence in prediction

130 8. Won awards:

9. Detailed process description:

10. Number of neighbors:

131 11. No extra data – focus on system:

12. No extra data – focus on users:

13. MovieLens confidence in prediction:

132 14. Good profile:

15. Overall percent rated 4+:

133 16. Complex graph: count, ratings, and similarity:

17. Recommended by critics:

134 18. Rating and % agreement of closest neighbor:

19. Number of neighbors with std. deviation:

20. Number of neighbors with average correlation:

135 21. Overall average rating:

136

Bibliography 1. Buchanan,B., Shortliffe,E., (Eds.) 1984. Rule-Based Expert Systems: The MYCIN Experiments of the Stanford Heuristic Programming Project. Addison-Wesley, Reading, MA. 2. Lauer,T.W., Peacock E., Graesser,A.C., (Eds.) 1985. Questions and Information Systems. Lawrence Erlbaum and Associates. 3. Graesser,A.C., Black,J.B., (Eds.) 1985. The Psychology of Questions. Lawrence Erlbaum and Associates. 4. NetPerceptions Inc. Web Site. 1999. 5. Aggarwal,C.C., Wolf,J.L., Wu,K.-L., Yu,P.S., 1999. Horting Hatches an Egg: A New GraphTheoretic Approach to Collaborative Filtering. Proceedings of ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 6. Balabanovíc,M., 1998. An Interface for Learning Multi-topic User Profiles from Implicit Feedback. Proceedings of the 1998 Workshop on Recommender Systems 6-10. 7. Balabanovíc,M., Shoham,Y., 1997. Fab: Communications of the ACM 40 (3), 66-72.

Content-Based,

Collaborative

Recommendation.

8. Baudisch,P., 1998. Recommending TV Programs: How far can we get at zero effort? Proceedings of the 1998 Workshop on Recommender Systems 16-18. 9. Belkin,N., Croft,B.W., 1992. Information Filtering and Information Retrieval: Two Sides of the Same Coin? Communications of the ACM 35 (12), 29-38. 10. Billsus,D., Pazzani,M.J., 1998. Learning collaborative information filters. Machine Learning Proceedings of the Fifteenth International Conference (ICML'98). 11. Billsus,D., Pazzani,M.J., 1998. Learning collaborative information filters. Proceedings of the 1998 Workshop on Recommender Systems. 12. Breese,J.S., Heckerman,D., Kadie,C., 1998. Empirical analysis of predictive algorighms for collaborative filtering. Proceedings of the 14th Conference on Uncertainty in Artificial Intelligence (UAI-98). (pp. 43-52). 13. Cleverdon,C., 1963. The testings of index language devices. Aslib Proceedings 15 (4), 106-130. 14. Cleverdon,C., 1967. The Cranfield tests on index language devices. Aslib Proceedings 19 173-192. 15. Cleverdon,C., Kean,M. 1968. Factors Determining the Performance of Indexing Systems. 16. Cohen,W.W., Basu,C., Hirsh,H., 1998. Using Social and Content-Based Information in Recommendation. Proceedings of the AAAI-98. 17. Cooper,W.S., 1971. A Definition of Relevance for Information Retrieval. Information Storage and Retrieval 7 19-37. 18. Cooper,W.S., 1988. Getting beyond Boole. Information Processing & Management 24 243-248. 19. Cooper,W.S., 1968. Expected Search Length: A Single Meausre of Retrieval Effectiveness Based on the Weak Ordering Action of Retrieval Systems. American Documentation 19 (1), 30-41.

137 20. Dahlen,B.J., Konstan,J.A., Herlocker,J.L., Good,N., Borchers,A., Riedl,J., 1998. Jump-starting movielens: User benefits of starting a collaborative filtering system with "dead data". University of Minnesota TR 98-017. 21. Delgado,J., Ishii,N., 1999. Memory-Based Weighted Majority Prediction for Recommender Systems. 1999 SIGIR Workshop on Recommender Systems. 22. Denning,P.J., 1982. Electronic Junk. Communications of the ACM 25 (2), 163-165. 23. Dietterich,T.G. Approximate Statistical Tests for Comparing Supervised Classification Learning Algorithms. 1997. 24. Doyle,L.B., 1962. Indexing and Abstraction by Association. System Development Corporation (Unisys) SP-718/001/00. 25. Foskett,D.J., 1980. Thesaurus. In: Encyclopedia of Library and Information Science. Marcel Dekker, New York, pp. 416-462. 26. Fuhr,N., Buckley,C., 1991. A Probabilistic Learning Approach for Document Indexing. Transactions on Information Systems 9 (3), 223-248. 27. Goffman,W., Newill,V., 1966. A Methodology for Test and Evaluation of Information Retrieval Systems. Information Storage and Retrieval 3 (1), 19-25. 28. Goldberg,D., Nichols,D., Oki,B.M., Terry,D., 1992. Using Collaborative Filtering to Weave an Information Tapestry. Communications of the ACM 35 (12), 61-70. 29. Good,I.J., 1966. A Decision-Theory Approach to the Evaluation of Information-Retrieval Systems. Information Storage and Retrieval 3 (2), 31-34. 30. Good,N., Schafer,J.B., Konstan,J.A., Borchers,A., Sarwar,B.M., Herlocker,J.L., Riedl,J., 1999. Combining collaborative filtering with personal agents for better recommendations. Proceedings of the 1999 Conference of the American Association of Artificial Intelligence (AAAI-99). (pp. 439-446). 31. Gordon,L.R. Observations on Using Three Database Management Systems to Store Sparse Matrices in GroupLens. 1997. Department of Computer Science, University of Minnesota. 32. Guptra,D., Digiovanni,M., Narita,H., Goldberg,K. Jester 2.0: Collaborative Filtering to Retrieve Jokes. Proceedings of ACM Conference on Research and Development in Information Retrieval (SIGIR) . 1999. 33. Hanley,J.A., McNeil,B.J., 1982. The Meaning and Use of the Area under a Receiver Operating Characteristic (ROC) Curve. Radiology 143 29-36. 34. Harter,S.P., 1996. Variations in Relevance Assessments and the Measurement of Retrieval Effectiveness. Journal of the American Society for Information Science 47 (1), 37-49. 35. Hearst,M.A., 1995. TileBars: visualization of term distribution information in full text information access. Conference proceedings on human factors in computing systems 59-66. 36. Herlocker,J.L., Konstan,J.A., Borchers,A., Riedl,J., 1999. An algorithmic framework for performing collaborative filtering. Proceedings of the 1999 Conference on Research and Development in Information Retrieval.

138 37. Herlocker,J.L., Konstan,J.A., Riedl,J., 2000. Explaining Collaborative Filtering Recommendations. Proceedings of the 2000 Conference on Computer Supported Cooperative Work. 38. Hill,W., Stead,L., Rosenstein,M., Furnas,G.W., 1995. Recommending and Evaluating Choices in a Virtual Community of Use. Proceedings of ACM CHI'95 Conference on human factors in computing systems. Denver, CO., (pp. 194-201). 39. Horvitz,E.J., Breese,J.S., Henrion,M., 1988. Decision Theory in Expert Systems and Artificial Intelligence. International Journal of Approximate Reasoning 2 (3), 247-302. 40. Hull,D.A., Grefenstette,G., 1996. Querying Across Languages: A dictionary based approach to multilingual information retrieval. Proceedings of the 19th Annual International Conference on Research and Development in Information Retrieval 49-57. 41. Hutchins,W.J., 1978. The concept of "aboutness" in subject indexing. Aslib Proceedings 30 172-181. 42. Jacobs,P.S., Rau,L.F., 1988. Natural language techniques for intelligent information retrieval. Proceedings of the eleventh international conference on research & development in information retrieval 85-99. 43. Johnson,H., Johnson,P., 1993. Explanation facilities and interactive systems. Proceedings of Intelligent User Interfaces '93. (pp. 159-166). 44. Jordan,J.R., 1968. A Framework for Comparing SDI Systems. American Documentation 19 (3), 221222. 45. Joyce,T., Needham,R.M., 1958. The Thesaurus Approach To Information Retrieval. American Documentation 9 192-197. 46. Kautz,H., Selman,B., Shah,M., 1997. Referral Web: Combining social networks and collaborative filtering. Communications of the ACM 40 (3), 63-65. 47. Konstan,J.A., Miller,B.N., Maltz,D., Herlocker,J.L., Gordon,L.R., Riedl,J., 1997. GroupLens: Applying collaborative filtering to Usenet news. Communications of the ACM 40 (3), 77-87. 48. Le,C.T., Lindren,B.R., 1995. Construction and Comparison of Two Receiver Operating Characteristics Curves Derived from the Same Samples. Biom.J. 37 (7), 869-877. 49. Lewis,D.D., Sparck Jones,K., 1996. Natural Language Processing for Information Retrieval. Communications of the ACM 39 (1), 92-101. 50. Linton,F., Charron,A., Joy,D., 1998. OWL: A Recommender System for Organziation-Wide Learning. Proceedings of the 1998 Workshop on Recommender Systems 65-69. 51. Loeb,S., 1992. Architecting Personalized Delivery of Multimedia Information. Communications of the ACM 35 (12), 39-47. 52. Luhn,H.P., 1961. The automatic derivation of information retrieval encodements from machinereadable texts. In: Information Retrieval and Machine Translation. Interscience Publication, New York, pp. 1021-1028. 53. Maes,P., 1994. Agents that Reduce Work and Information Overload. Communications of the ACM 37 (7), 31-71.

139 54. Malone,T.W., Grant,K.R., Turbak,F.A., Brobst,S.A., Cohen,M.D., 1987. Intelligent InformationSharing Systems. Communications of the ACM 30 (5), 390-402. 55. Maltz,D. Distributing information for collaborative filtering on Usenet net news. 1994. M.I.T. Department of EECS, Cambridge, Mass. 56. Maltz,D., Ehrlich,K., 1995. Pointing the way: Active collaborative filtering. Proceedings of the 1995 ACM Conference on Human Factors in Computing Systems. New York. 57. Maron,M.E., Kuhns,J.L., 1960. On relevance, probabilistic indexing and information retrieval. Journal of the Association for Computing Machinery 7 216-244. 58. Miller,B.N., Riedl,J., Konstan,J.A., Resnick,P., Maltz,D., Herlocker,J.L. The GroupLens Protocol Specification. 1999. 59. Miller,C.A., Larson,R., 1992. An Explanatory and ``Argumentative'' Interface for a Model-Based Diagnostic System. Proceedings of User Interface Software and Technology (UIST '92). (pp. 43-52). 60. Mitchell,T. 1997. Machine Learning, 1 ed. McGraw Hill. 61. Morita,M., Shinoda,Y., 1994. Information filtering based on user behavior analysis and best match text retrieval. Proceedings of SIGIR '94. New York. 62. Mostafa,S., Mukhopadhyay,W., Lam,W., Palakal,M., 1997. A Multilevel Approach to Intelligent Information Filtering. ACM Transactions on Information Systems 15 (4), 368-399. 63. Norman,D.A. 1989. The Design of Everyday Things. Currency-Doubleday, New York. 64. O'Connor,J., 1968. Some Questions Concerning "Information Need". American Documentation 19 (2), 200-203. 65. Oard,D.W., Kim,J., 1998. Implicit Feedback for Recommender Systems. Proceedings of the 1998 Workshop on Recommender Systems 81-83. 66. Pollock,S.M., 1968. Measures for the Comparison of Information Retrieval Systems. American Documentation 19 (4), 387-397. 67. Press,W.H., Flannery,B.P., Teukolsky,S.A., Yan,T. 1986. Numerical Recipes: The Art of Scientific Computing. Cambridge University Press, New York, NY. 68. Resnick,P., Iacovou,N., Suchak,M., Bergstrom,P., Riedl,J., 1994. GroupLens: An open architecture for collaborative filtering of netnews. Proceedings of 1994 Conference on Computer Supported Collaborative Work. (pp. 175-186). 69. Robertson,S.E., 1977. The probability ranking principle in IR. Journal of Documentation 33 294304. 70. Salton,G., Buckley,C., 1988. Term Weighting Approaches in Automatic Text Retrieval. Information Processing & Management 24 (5), 513-523. 71. Salton,G., Lesk,M.E., 1968. Computer Evaluation of Indexing and Text Processing. Journal of the Association for Computing Machinery 15 8-36. 72. Salton,G., Wong,A., Yang,C.S., 1975. A vector space model for automatic indexing. Communications of the ACM 18 613-620.

140 73. Saracevic,T., Kantor,P., 1988. A Study of Information Seeking and Retrieving. II. Users, Questions, and Effectiveness. Journal of the American Society for Information Science 39 (3), 177-196. 74. Saracevic,T., Kantor,P., 1988. A Study of Information Seeking and Retrieving. III. Searchers, Searches, and Overlap. Journal of the American Society for Information Science 39 (3), 197-216. 75. Saracevic,T., Kantor,P., Chamis,A.Y., Trivison,D., 1988. A Study of Information Seeking and Retrieving. I. Background and Methodology. Journal of the American Society for Information Science 39 (3), 161-176. 76. Sarwar,B.M., Karypis,G., Konstan,J.A., Riedl,J., 2000. Analysis of Recommender Algorithms for ECommerce. Proceedings of E-Commerce 2000. 77. Sarwar,B.M., Karypis,G., Konstan,J.A., Riedl,J., 2000. Application of Dimensionality Reduction in Recommender System--A Case Study. WebKDD workshop August 20, 2000. 78. Sarwar,B.M., Konstan,J.A., Borchers,A., Herlocker,J.L., Miller,B.N., Riedl,J., 1998. Using filtering agents to improve prediction quality in the grouplens research collaborative filtering system. Proceedings of CSCW '98. Seattle, WA.. 79. Shardanand,U., Maes,P., 1995. Social Information Filtering: Algorithms for Automating "Word of Mouth". Proceedings of ACM CHI '95. Denver, CO., (pp. 210-217). 80. Sparck Jones,K., Foote,G.J.F., Young,S.J., 1996. Experiments in Spoken Document Retrieval. Information Processing & Management 32 399-419. 81. Strzalkowski,T., Perez-Carballo,J., Marinescu,M., 1996. Natural language information retrieval in digital libraries. Proceedings of the 1st ACM international conference on digital libraries 117-125. 82. Swets,J.A., 1963. Information Retrieval Systems. Science 141 245-250. 83. Swets,J.A., 1969. Effectiveness of Information Retrieval Methods. American Documentation 20 (1), 72-89. 84. Terveen,L., Hill,W., Amento,B., McDonald,D., Creter,J., 1997. PHOAKS: A System for Sharing Recommendations. Communications of the ACM 40 (3), 59-62. 85. Toulmin S.E. 1958. The Uses of Argument. Cambridge University Press. 86. Ungar,L.H., Foster,D.P., 1998. Clustering Methods for Collaborative Filtering. Proceedings of the 1998 Workshop on Recommender Systems 114-129. 87. van Rijsbergen,C.J. 1979. Information Retrieval, 2nd edition ed. Butterworths. 88. Wong,S.K.M., Yao,Y.Y., 1995. On Modeling Information Retrieval with Probabilistic Inference. Transactions on Information Systems 13 (1), 38-68. 89. Yan,T., Garcia-Molina,H., 1995. SIFT: A tool for wide-area information dissemination. Proceedings of the USENIX 1995 Winter Technical Conference. 90. Yao,Y.Y., 1995. Measuring Retrieval Effectiveness Based on User Preference of Documents. Journal of the American Society for Information Science 46 (2), 133-145.