Nov 3, 2006 - Bachelor of Information Technology (Honours) ... Ratings generated from purely implicit information about a user can be used to made useful.
User Friendly Recommender Systems
D
ERE·M
T
SI
O
M ARK H INGSTON SID : 0220763
E
N
S·
EAD
E
M
·MUT A
Supervisor: Judy Kay This thesis is submitted in partial fulfillment of the requirements for the degree of Bachelor of Information Technology (Honours)
School of Information Technologies The University of Sydney Australia
3 November 2006
Abstract Recommender systems are a recent but increasingly widely used resource. Yet most, if not all of them suffer from serious deficiencies. Recommender systems often require first time users to enter ratings for a large number of items — a tedious process that often deters users. Thus, this thesis investigated whether useful recommendations could be made without requiring users to explicitly rate items. It was shown that ratings automatically generated from implicit information about a user can be used to make useful recommendations. Most recommender systems also provide no explanations for the recommendations that they make, and give users little control over the recommendation process. Thus, when these systems make a poor recommendation, users can not understand why it was made, and are not able to easily improve their recommendations. Hence, this thesis investigated ways in which scrutability and control could be implemented in such systems. A comprehensive questionnaire was completed by 18 participants as a basis for a broader understanding of the issues mentioned above and to inform the design of a prototype; a prototype was then created and two separate evaluations performed, each with at least 9 participants. This investigation highlighted a number of key scrutability and control features that could be useful additions to existing recommender systems. The findings of this thesis can be used to improve the effectiveness, usefulness and user friendliness of existing recommender systems. These findings include: • Explanations, controls and a map based presentation are all useful additions to a recommender system. • Specific explanation types can be more useful than others for explaining particular recommendation techniques. • Specific recommendation techniques can be useful even when a user has not entered many ratings. • Ratings generated from purely implicit information about a user can be used to made useful recommendations.
ii
Acknowledgements Firstly, I would like to thank my supervisor, Judy Kay, for all of the time and effort she has put into guiding me through the production of this thesis. I would like to thank Mark van Setten and the creators of the Duine Toolkit for producing a high quality piece of software and making it available to the public. I want to also thank Joseph Konstan, for taking the time to talk with me and give me encouragement at the formative, early stages of my thesis. I would also like to thank my lovely girlfriend Sarah Kulczycki, for her unwavering support and fun-loving spirit.
iii
C ONTENTS Abstract
ii
Acknowledgements
iii
List of Figures
vii
Chapter 1 Introduction
1
1.1
Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1
1.2
Research Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2
Chapter 2 Literature Review
4
2.0.1 Social Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4
2.0.2 Content-Based Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5
2.1
Hybrid Recommenders (The Duine Toolkit) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7
2.2
Unobtrusive Recommendation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8
2.3
Scrutability and Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.4
Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
Chapter 3 Exploratory Study
14
3.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.2
Qualitative Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.3
Recommendation Algorithm Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.4
Questionnaire - Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 3.4.1 Part A - Presentation Style . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 3.4.2 Part B - Understanding & Usefulness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 3.4.3 Final Questions - Integrative . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.5
Questionnaire - Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 3.5.1 Usefulness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 3.5.2 Understanding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 3.5.3 Understanding And Usefulness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 iv
C ONTENTS
v
3.5.4 Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 3.5.5 Presentation Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 3.5.6 Final Questions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 3.6
Test Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.7
Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
Chapter 4 Prototype Design
45
4.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.2
User’s View . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 4.2.1 iSuggest-Usability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 4.2.2 iSuggest-Unobtrusive . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
4.3
Design & Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 4.3.1 iSuggest-Usability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 4.3.2 iSuggest-Unobtrusive . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
4.4
Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
Chapter 5 Evaluations
62
5.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
5.2
Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 5.2.1 iSuggest-Usability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 5.2.2 iSuggest-Unobtrusive . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
5.3
iSuggest-Usability Evaluations — Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 5.3.1 Recommender Usefulness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 5.3.2 Explanations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 5.3.3 Controls . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 5.3.4 Presentation Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
5.4
iSuggest-Unobtrusive - Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 5.4.1 Statistical Evaluations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 5.4.2 Ratings Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 5.4.3 Recommendations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
5.5
Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
Chapter 6 Conclusion 6.1
87
Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
C ONTENTS
vi
References
90
Appendix A Appendix A — Questionnaire Form
93
Appendix B Appendix B — Questionnaire Results
94
Appendix C Appendix C — iSuggest-Usability Evaluation Instructions
95
Appendix D Appendix D — iSuggest-Usability Evaluation Results
96
Appendix E Appendix E — iSuggest-Unobtrusive Evaluation Instructions
97
Appendix F
98
Appendix F — iSuggest-Unobtrusive Evaluation Results
List of Figures 2.1 MAE For The Duine Toolkit’s System Lifecycle Test. Lower MAE Values Indicate Better Performance. The Numbers Below Each Group Indicate The Sample Size (In Number Of Predictions)
6
2.2 Examples Of Features That Can Be Computed For Various Item Types
7
2.3 Mean Response Of Users To Each Explanation Interface, Based On A Scale Of One To Seven. Explanations 11 And 12 Represent The Base Case Of No Additional Information. Shaded Rows Indicate Explanations With A Mean Response Significantly Different From The Base Cases.
12
3.1 Summary Of Possible Explanations And Control Features For The Major Algorithms In The Duine Toolkit.
18
3.2 Demographic Information For Each Of The Respondents.
20
3.3 List Based Presentation That Was Shown To Participants In The Questionnaire
21
3.4 Map Based Presentation That Was Shown To Participants In The Questionnaire
21
3.5 One Of The Explanation Screens Shown To Participants In The Questionnaire. This Screen Explains Recommendations From The Learn By Example Technique
22
3.6 One Of The Explanation Screens Shown To Participants In The Questionnaire. This Screen Explains Recommendations From The Social Filtering Technique
22
3.7 The Genre Based Control Shown To Participants In The Questionnaire
23
3.8 The Screens With The Maximum Average Usefulness For Each Recommendation Method. Error Bars Show One Standard Deviation Above And Below The Mean. N = 18.
25
3.9 Average Ranking Given To Each Presentation Method. N = 18. Top Ranking = 1. Bottom Ranking = 6.
25 vii
L IST OF F IGURES
viii
3.10Average Response For Contribution That Each Method Should Make To A Combination Of Recommendation Methods. Error Bars Show One Standard Deviation Above And Below The Mean. N = 18.
26
3.11The Screens With The Maximum Average Understanding For Each Recommendation Method. Error Bars Show One Standard Deviation Above And Below The Mean. N = 18
29
3.12Respondents’ Average Understanding Of Recommendation Methods Before And After Explanations. Error Bars Show One Standard Deviation Above And Below The Mean. N = 18 30 3.13Average Ratings For Questions Regarding Respondents’ Understanding, Likelihood Of Using And Perceived Usefulness Of Each Control Feature. Error Bars Show One Standard Deviation Above And Below The Mean. N = 18
35
3.14User’s Responses For Questions Regarding Recommendation Presentation Methods. Error Bars Show One Standard Deviation Above And Below The Mean. N = 18
37
3.15Average Rating For The Usefulness Of Possible Features Of A Recommender. Error Bars Show One Standard Deviation Above And Below The Mean.
39
4.1 List Based Presentation Of Recommendations
47
4.2 The Star Bar That Users Used To Rate Items
47
4.3 Recommendation Technique Selection Screen. Note: The ‘Word Of Mouth’ Technique Shown Here Is Social Filtering And The ‘Let iSuggest Choose’ Technique Is The Duine Toolkit Taste Strategy
49
4.4 Explanation Screen For Genre Based Recommendations
49
4.5 Social Filtering (Simple Graph) Explanation Screen For Social Filtering Recommendations
49
4.6 Explanation Screen For Learn By Example Recommendations
50
4.7 Explanation Screen For Most Popular Recommendations
50
4.8 The Genre Based Control (Genre Slider)
51
4.9 The Social Filtering Control. Note: The actual control is the ‘Ignore This User’ Link
52
4.10Full Map Presentation — Zoomed Out View
53
4.11Full Map Presentation — Zoomed In View
54
4.12Similar Items Map Presentation
54
4.13The Explanation Screen Displayed After Ratings Generation
55
L IST OF F IGURES
ix
4.14Architecture Of The Basic Prototype, With Components Constructed During This Thesis Marked In Blue
56
4.15Architecture Of iSuggest-Usability, With Components Constructed During This Thesis Marked In Blue
57
4.16Architecture Of iSuggest-Unobtrusive, With Components Constructed During This Thesis Marked In Blue
58
5.1 Demographical Informations About The Users Who Conducted The Evaluations Of iSuggest-Usability
66
5.2 Demographical Informations About The Users Who Conducted The Evaluations Of iSuggest-Unobtrusive
67
5.3 Average Usefulness Ratings For Each Recommendation Method. Error Bars Show Standard Deviation.
69
5.4 Average Usefulness Ratings For Each Explanation. Error Bars Show Standard Deviation.
71
5.5 Users’ Ratings For The Overall Use Of The iSuggest Explanations.
72
5.6 Users’ Ratings For The Effectiveness Of Control Features.
74
5.7 Users’ Ratings For The Overall Effectiveness Of The iSuggest Control Features.
75
5.8 Average Usefulness Of The Map Based Presentations. Error Bars Show Standard Deviation.
76
5.9 Sum Of Votes For The Preferred Presentation Type.
77
5.10Comparison Of Distribution Of Ratings Values.
79
5.11Comparison Of MAE And SDAE For Movielens Recommendations And Recommendations Using Generated Ratings. Lower Scores Are Better. Techniques Are Sorted By MAE.
80
5.12Average Usefulness Ratings For Each Recommendation Method. Error Bars Show Standard Deviation.
82
C HAPTER 1
Introduction
Recommender systems are a recent, but increasingly widely used resource. Yet most, if not all of them suffer from serious deficiencies. With so much information available over the Internet, people often turn to recommendation services to highlight the items that will be of most interest to them. All of the significant systems in the area of recommendation build up a profile of a user (usually through asking users to rate items they have seen) and then use content-based or collaborative filtering, or a combination (hybrid) of these methods, to make recommendations about what other pieces of information a user might be interested in. However many recommender systems require first time users to enter ratings for a large number of items. Further, these systems do not always make useful recommendations. Recommendations can be poor for a number of reasons, but what happens when a recommender does make a poor recommendation? Most recommender systems offer no information about the reason that they made particular recommendations. Further, most also offer users little opportunity to affect the system in a way that can improve recommendations. The fact that recommenders require users to rate items can also be a failing, as the tedious process of entering ratings can often deter users. When we take account of all of these factors, it is obvious that many existing recommender systems are not meeting their potential for usefulness and usability.
1.1 Background Since about 1995, recommender systems have been deployed across many domains. Two of the most important early recommender systems were Ringo (publicly available in 1994) and GroupLens1 (available in 1996). The success of Ringo, one of the first large-scale music recommendation systems, is reported in (Shardanand and Maes, 1995). GroupLens, an automated collaborative filtering system for Usenet 1www.grouplens.org/ 1
1.2 R ESEARCH Q UESTIONS
2
news, also proved highly successful. (Konstan et al., 1997) reported trials of the GroupLens system, and this classic paper showed that collaborative filtering could be effective on a large scale. The GroupLens project was soon adapted to produce MovieLens2, a large-scale, publicly available movie recommendation system. Large interest in recommender systems was soon fostered by the increasing public demand for systems that helped deal with the problem of information overload. Since then, much academic and commercial interest has been shown in recommender systems for many different domains. Although much of their research is not published, Amazon.com is one of the most well known implementers of this technology. Amazon.com makes use of collaborative filtering systems to recommend products that a user might like to purchase. Other companies that use recommender systems, include netflix.com for videos, TiVo for digital television and Barnes and Noble for books. Many music recommendation systems are also available today, such as Pandora.com (which maintains a staff of music analysts who tag songs as they enter the system) and last.fm3. (Atkinson, 2006) rated these two systems as the best music recommenders currently available to the public.
1.2 Research Questions In order to make recommender systems more user friendly, the problems detailed above need to be addressed. However, there is a lack of existing research into the way that recommender systems can: make recommendations unobtrusively; explain recommendations and offer users useful control over the recommendation process. This lack of research is especially prevalent in the area of music recommendation, where little research has been published. Thus, this project investigated the following research questions: Scrutability & Control: What is the impact of adding scrutability and control to a recommender system? Unobtrusive Recommendation: Can a recommender system provide useful recommendations without asking users to explicitly rate items? This thesis originally aimed to investigate these questions with reference to music recommender systems. To further this goal, a dataset containing unobtrusively obtained information about users was located for use in investigating Unobtrusive Recommendation. However, it quickly became apparent that few music 2http://movielens.umn.edu/ 3http://www.last.fm
1.2 R ESEARCH Q UESTIONS
3
datasets containing users’ explicit ratings of music. Thus, in order to conduct a thorough and rigorous study of Scrutability & Control, the MovieLens standard dataset was used. This contained information on users and their ratings of movies. The contributions of this thesis are: the identification of a lack of existing research into scrutability, control and unobtrusiveness in recommender systems (Chapter 2); the identification of a number of promising methods for adding scrutability and control to a recommender (Chapter 3); the creation of a prototype that implements these scrutability and control methods, and can also provide unobtrusive recommendations (Chapter 4); and the evaluation of the methods implemented in this prototype for providing scrutability, control and unobtrusiveness within a recommender system (Chapter 5).
C HAPTER 2
Literature Review
The basic purpose of a music recommender is to recommend items that will be of interest to a specific user. This task is required because of the fact that an abundance of information is now available to people via the Internet and many don’t have the time sort through it all. Currently, all major recommendation systems use social filtering, content-based filtering, or some combination of these two approaches to predict how interested a user will be in a specific item. This information is then used to recommend items that the system believes will be of the most interest to that user. Each of these approaches to recommendation is discussed below, with reference to Figure 2.0.1 (taken from (van Setten et al., 2002)). This graph shows the results of testing a series of approaches to recommendation using the MovieLens standard data set. These tests were evaluated using the Mean Absolute Error (MAE) metric, which (Herlocker et al., 2004) lists as an appropriate metric for the evaluation of recommender systems. Figure 1 gives a good indication of the relative levels of performance that can be achieved by using each approach.
2.0.1 Social Filtering (Polcicova et al., 2000), (Breese et al., 1998) and (Shardanand and Maes, 1995) explain that social filtering systems work by first asking users to rate items. Then by comparing those ratings, they locate users who share common interests and make personalized recommendations based on like-minded user’s opinions. Social filtering does not take formal content into account and makes judgments based purely upon the ratings of users. The GroupLens project, documented in (Konstan et al., 1997), involved a large-scale trial of a social filtering recommender system. This trial was confirmatory research - a large amount of users were asked to test the system, and the results of this testing were collated to provide a statistical confirmation that social filtering could be effective on a large scale. Many further research projects into social filtering have confirmed its utility through simulation. Such projects include (Breese 4
2 L ITERATURE R EVIEW
5
et al., 1998) and (van Setten et al., 2002), which both contain simulations run on the MovieLens data set and evaluated using mean error metrics. In general, social filtering algorithms work in the following way: "In the first step, they identify the k users in the database that are the most similar to the active user. During the second step, they compute the [set of] of items [liked] by these users and associate a weight with each item based on its importance in the set. In the third and final step, from this [set] they select and recommend the items that have the highest weight and have not already been seen by the active user" - (Deshpande and Karypis, 2004), p 4. Figure 2.0.1 shows the social filtering recommender to have the equal lowest MAE in four of the five tests, showing that it is a highly effective recommendation method. However, social filtering is not without its problems. (Adomavicius and Tuzhilin, 2005) summarises the issues with social filtering as: • An inability to make accurate predictions for new users. (Referred to in this thesis as the cold start problem for new users). • Poor recommendation accuracy during the initial stages of the system. (Referred to in this thesis as the cold start problem for new systems). • A lack of ability to recommend new items until they are rated by users. Social filtering was one recommendation technique used in this project to make music and movie related recommendations. As stated above, social filtering does not make use of the content of items, only the ratings that users have given each item. This means that social filtering approaches were easily adapted for use in both music and movie related recommendation.
2.0.2 Content-Based Filtering In content-based filtering systems, users are again asked to rate items. The system then analyses the content of those items and creates a profile that represents a user’s interests in terms of item content (features, key phrases, etc.). Then the content of items unknown to the user is analysed and these are compared with the user’s profile in order to find the items that will be of interest to the user. The information that a content-based filtering system can compute about a particular item falls into one of two categories: content-derived and meta-content information. Content-derived information (used in (Cano et al., 2005), (Logan, 2004) and (Mooney and Roy, 2000)) is computed by the system through
2 L ITERATURE R EVIEW
6
F IGURE 2.1: MAE For The Duine Toolkit’s System Lifecycle Test. Lower MAE Values Indicate Better Performance. The Numbers Below Each Group Indicate The Sample Size (In Number Of Predictions)
analysis of the actual content of an item (e.g. the beats per minute of a song or the key words found in a document). Meta-content information (used in (Mak et al., 2003), (van Setten et al., 2002) and (van Setten et al., 2003)) is any information that the system can glean about an item that does not come from analysing the content of that item (such information may come from an external database, or a header attached to the item). Examples of the type of features that can be computed for text, music and movie data are given in Figure 2.2. Content-derived information about an item needs to make use of algorithms that are specific to the type of item that is being analysed. In contrast, meta-content information does not need to be computed from actual items and, in fact, meta-content information is often quite similar for items from different domains. Figure 2.2 shows that meta-content information for each of the different item types exhibits certain similarities, whereas the content-derived information is quite specific to the type of item. This fact means that meta-content based recommenders are able to be easily adapted for use in new domains, but that it is much more difficult to perform the same adaptation on recommenders that use content-derived information. However, systems that make use of content-derived information gain a better picture of each of the items in the system and thus should be able to make more accurate recommendations than systems that use only meta-content information.
2.1 H YBRID R ECOMMENDERS (T HE D UINE T OOLKIT )
7
Like social filtering, content-based filtering also has weaknesses. (Adomavicius and Tuzhilin, 2005) states that they: • Become over specialised and only recommend very specific types of items to each user. • Are also subject to the cold start problem for new users. • May rely on content-derived information, which is often expensive (or impossible) to compute accurately.
Metacontent:
Contentderived:
Text
Music
Movies
Author
Composer
Writer
Abstract
N/A
Synopsis
Publisher
Producer
Producer
Genre
Genre
Genre
N/A
Performer
Key phrases Term frequencies
Beats / min
Actors Color Histogrm
MFCC’s
Story Tempo
F IGURE 2.2: Examples Of Features That Can Be Computed For Various Item Types (van Setten et al., 2002) makes use of content-based filtering using meta-content to make movie recommendations. This content-based filtering approach is one of a number of prediction techniques used in the Duine Toolkit to make recommendations. This toolkit is discussed in detail in Section 2.1. The tests summarized in (van Setten et al., 2002) show that the content-based algorithm included in the Duine Toolkit performed well during simulations. This project extended the Duine Toolkit to also include content-based prediction techniques for music recommendations.
2.1 Hybrid Recommenders (The Duine Toolkit) Hybrid recommender systems combine content based and social filtering in the hope that this combination might contain all the strengths of the two approaches, while also alleviating their problems. The Duine Toolkit is a hybrid recommender that was produced as a part of a PhD completed by Mark van Setten. It is a piece of software that makes available a number of prediction techniques (including both social filtering and content-based techniques) and allows them to be combined dynamically. This project will involved using the using the Duine toolkit to make both music and movie related recommendations. This toolkit makes use of prediction strategies, which were introduced in (van Setten et al., 2002). Such
2.2 U NOBTRUSIVE R ECOMMENDATION
8
prediction strategies are a way of easily combining prediction techniques dynamically and intelligently in an attempt to provide better and more reliable prediction results. (van Setten et al., 2002) introduces these prediction strategies and demonstrates how they can be adapted depending upon the various states that a system might be in. It introduces a software platform called Duine, which implements prediction strategies and can be extended to include new prediction techniques and new strategies. Simulations run in (van Setten et al., 2002) and (van Setten et al., 2004) showed that the combination of prediction techniques into prediction strategies can improve the effectiveness of a recommendation system. The testing done in these papers was of sound quality and was performed on the data set made available by the MovieLens project, which is a well-known, standard data set for recommender systems. The results of these tests are summarised in (van Setten et al., 2002). These results show that in every case, the Taste Strategy (a particular prediction strategy used in testing) had the lowest MAE of all of the prediction techniques used. This strategy is able to choose the most effective prediction technique for a particular situation and thus is able to maximise prediction accuracy. The work done in (van Setten et al., 2002) and (van Setten et al., 2004) focused on making predictions based on movie data. This project built upon this work by extending the Duine Toolkit for use in music recommendation. As well as making use of the Duine Toolkit in a new domain, this project also involved the addition of Scrutability & Control features and Unobtrusive Recommendation to this toolkit. Each of these additions is discussed in the following sections.
2.2 Unobtrusive Recommendation Generally, recommender systems build a profile of a user’s likes and dislikes by asking a user to rate specific items after they have listened to them. However, users often find this process to be tedious. Further, the cold start problem for new users means that users may need to rate many items before they receive useful recommendations. As a result, this thesis investigated ways in which a system can elicit information about a user’s likes and dislikes in an unobtrusive manner. In order to investigate Unobtrusive Recommendation, new features were added to the Duine Toolkit. This allowed would allow the system to make recommendations without needing to ask a user to rate the items that they have seen or heard. Accomplishing this task required an unobtrusive way to gauge a user’s level of interest in an item. Some of the unobtrusive methods for judging how interested a user is in an item are summarised in (Oard and Kim, 1998). These methods include the length of time that a user spends viewing an item, the number of times a user has viewed an item, the items that a user is willing to purchase, the items
2.2 U NOBTRUSIVE R ECOMMENDATION
9
that a user deletes from their collection and the items that a user chooses to retain in their collection. Unfortunately, (Oard and Kim, 1998) merely presents a summary of these methods and does not present any testing of the methods it mentions. Of course, one of the problems with all of the methods mentioned above for modelling users unobtrusively is the fact that preferences based upon such data are likely to be less accurate than preferences based upon explicit user ratings. (Adomavicius and Tuzhilin, 2005) states that "[unobtrusive] ratings (such as time spent reading an article) are often inaccurate and cannot fully replace explicit ratings provided by the user. Therefore, the problem of minimizing intrusiveness while maintaining certain levels of accuracy of recommendations needs to be addressed by the recommender systems researchers" - (Adomavicius and Tuzhilin, 2005), p 12. This paper recognises the need for more research into unobtrusive user modelling and notes a number of papers that have reported on work in this area. Unfortunately, there is a distinct lack of research published that deals with eliciting a user’s musical preferences unobtrusively. The literature available on unobtrusive user modelling is often concerned with determining user’s preferences in regard to websites and not their opinions on pieces of music. (Kiss and Quinqueton, 2001) mentions the use of navigation histories to gauge a user’s level of interest in particular websites. It also proposes some more creative methods for using implicit input, such as matching the sort order of a search with the order that results were visited and using the time taken to press the ’back’ button on a browser to judge a user’s interest in a page. Although (Kiss and Quinqueton, 2001) is obviously based upon some amount of research, and claims "the implementation has started and is well advancing, and we begin to have some experimental results" - (Kiss and Quinqueton, 2001), p 15, disappointingly, results from the project are not easily available and, as user modelling forms only one part of the paper, it is unlikely that it would be easy to identify the impact that particular user modelling techniques had upon the results of this research. However, this paper does still present some useful ideas on making use of implicit preference information that could be adapted for use in a music recommender. (Middleton et al., 2001) describes similar techniques for user modelling and includes results of a number of exploratory case studies that show that this form of user modelling can be quite successful. This project built upon existing methods for user profiling and extended these to investigate methods for inferring a user’s level of interest in an item from only implicit data.
2.3 S CRUTABILITY AND C ONTROL
10
2.3 Scrutability and Control
The literature discussed in the sections above all deals with the desire to make high quality recommendations. Once these recommendations are made, scrutability is concerned with explaining to the user why a particular recommendation was made. Further, control is concerned with allowing users to control a recommender system in order to improve recommendations. Research published in (Sinha and Swearingen, 2001) and (Sinha and Swearingen, 2002) shows that users are more willing to trust or make use of recommendations that are well explained (i.e. that are scrutable). Joseph Konstan, a leading figure in recommender systems research noted that "adding scrutability to recommender systems is important, but hard" - (Konstan, J., personal communication, June 3, 2006). Scrutability is a key component in a recommender system for a number of reasons. First, users are not always willing to trust a system when they are just beginning to use it. If users can be provided with some level of assurance that the recommendations made by a system are of a high quality, then they are more likely to trust that system. Such assurances are given to the user by showing why a particular recommendation was made. Scrutability is also useful in cases where a recommendation is made that a user believes is not appropriate. In this case, if a user can access some explanation for the recommendation, they may be more likely to understand why that recommendation might be of interest to them. Explanations may also help a user to identify areas where a system is making errors and, ideally, control functions should then be able to help the user alter the function of the system to make it less likely to make inappropriate recommendations. The value of control functions is not limited to allowing alterations to the recommendation process when errors occur. Rather, users can often make use of control functions at any time during the operation of a recommender system. This allows them to influence the process of recommendation in a way that hopefully leads to improved recommendation accuracy. Sinha and Swearingen have shown that scrutability improves the effectiveness of a recommendation system. (Sinha and Swearingen, 2001) and (Sinha and Swearingen, 2002), published the results of research that involved asking users to test a number publicly available recommendation systems and then evaluate their experience with each one. The findings of these studies show that "in general users like and feel more confident in recommendations perceived as transparent" - (Sinha and Swearingen, 2002), p 2. Although their experiments were on only a small scale, they were well designed and the concept of the importance of transparency is supported by other research such as was conducted by "Johnson & Johnson (1993) [who] point out that explanations play a crucial role in the interaction between users and complex systems" - (Sinha and Swearingen, 2002), p 1. A similar experimental study was
2.3 S CRUTABILITY AND C ONTROL
11
conducted in (Herlocker, 2000), which describes scrutability experiments conducted on a much larger sample group and confirms that "most users value explanations and would like to see them added to their [recommendation] system. These sentiments were validated by qualitative textual comments given by survey respondents" - (Herlocker et al., 2000), p 10. (Herlocker, 2000) describes in detail a series of approaches to adding scrutability to social filtering recommender systems. It reports on user trials that were conducted involving a large number of users, who were each asked to use prototype recommender systems and provide feedback on the value of the explanations given for recommendations. The results of these tests can be seen in Figure 2.3, which shows the most useful techniques for adding scrutability to be explanations showing histograms of ratings from like-minded users (nearest neighbours) and explanations showing the past performance of the recommender. (van Setten, 2005) also describes a small scale investigation into explanations for recommender systems and (Mcsherry, 2005) and (Cunningham et al., 2003) present methods for explaining a particular method of recommendation, named Learn By Example. Some commercial systems (such as liveplasma1) also offer innovative ways of presenting recommendations, such as Map Based presentation of items. Such presentations may increase the usefulness of recommendations and the ability of a user to understand these explanations. The papers (and systems) mentioned above each demonstrate that scrutability can be beneficial in recommender systems, and present some ways of creating it. However, Scrutability & Control in recommender systems is an area which has not received much research attention and thus; there are still many questions to be answered regarding the best way to achieve these goals. Specifically, there is a lack of existing research into: • Comparison of the multiple recommendation techniques in terms of their usefulness and ability to be explained. • Providing explanations for recommendation techniques other than social filtering. • The impact of adding of controls to a recommender system. • The relationship between a user’s understanding of a recommendation technique and the usefulness of its recommendations, and the potential trade-off between the two. • The effect of a Map Based presentation on the usefulness and understandability of recommendations. As a result, this project added Scrutability & Control features to the Duine Toolkit in order to build upon current research and investigate each of these areas.
1http://www.liveplasma.com
2.4 C ONCLUSION
12
F IGURE 2.3: Mean Response Of Users To Each Explanation Interface, Based On A Scale Of One To Seven. Explanations 11 And 12 Represent The Base Case Of No Additional Information. Shaded Rows Indicate Explanations With A Mean Response Significantly Different From The Base Cases.
2.4 Conclusion At this stage of the project, a number of key areas where more research was required were identified. The first of these areas was the provision of Unobtrusive Recommendation to users. Although there is existing work into unobtrusive modeling of a user’s interests, most of this research has concentrated upon the field of web browsing. Using implicit data to infer a user’s interests in items such as music or movies is an area where little research has been conducted. Thus, this project aimed to build upon existing work in the field of unobtrusive user modeling and investigate unobtrusive music recommendation. Adding Scrutability & Control to recommender systems is the second area where a lack of existing
2.4 C ONCLUSION
13
research was identified. Current research into explaining and controlling recommender systems is quite sparse, and although some research does exist, there are still many questions to be answered regarding this goal. These questions include issues relating to the impact of adding controls to a recommender system, as well as many issues related to providing scrutable recommendations. Ultimately, this project aimed to advance research into both Scrutability & Control in recommender systems and Unobtrusive Recommendation.
C HAPTER 3
Exploratory Study
3.1 Introduction The review of literature from Chapter 2 highlighted that there is a lack of existing research in the areas of scrutability, control and unobtrusiveness within recommender systems. This lack of research is especially prominent in the area of music recommendation, where little research at all has been published. Thus, this project aimed to investigate questions related to Scrutability & Control and Unobtrusive Recommendation. In order to investigate these areas, an exploratory study was first conducted, which involved the following tasks:
• A qualitative analysis of existing recommender technologies. • Conduct of a questionnaire to investigate aspects of recommender systems, as a foundation for gaining the understanding needed to create a prototype recommender system. • The creation of a dataset of implicit information about a large number of users, required for performing evaluations on a prototype at a later stage of the thesis.
The first stage for this research project was a qualitative analysis of a number of existing recommender systems and recommendation algorithms. This aimed to identify a suitable code base that could be extended into a prototype recommender system. An analysis of the recommendation algorithms contained in the chosen code base was then performed. This analysis aimed to discover methods that could be used to add controls and explanations to the prototype recommender system. To investigate users’ attitudes toward these explanations and controls (as well as attitudes toward other aspects of recommender systems and usability), a questionnaire was conducted. The results of this questionnaire would be used later in this thesis to guide the construction of the prototype. Finally, a source of test data was established for use in evaluating the prototype. Each of these tasks is detailed in the sections below. 14
3.2 Q UALITATIVE A NALYSIS
15
3.2 Qualitative Analysis The system chosen as a code base needed to be open source and have good code quality, resource consumption (with particular reference to running time and memory usage) and recommendation quality. It would also be highly useful if it provided support for the implementation of features such as explanations, control features and unobtrusive recommendation. The recommendation toolkits that were examined during the course of this qualitative analysis include: Taste: open-source recommender, written in Java. Available from http://taste.sourceforge.net/ Cofi: open-source, written in Java. Available from http://www.nongnu.org/cofi/
RACOFI: open-source, written in Java. Available from http://www.daniel-lemire.com/fr/abstracts/COLA2003.htm SUGGEST: Free, written in C. Available from http://www-users.cs.umn.edu/ karypis/suggest/ Rating-Based Item-to-Item: public domain, written in PHP. Available from http://www.daniellemire.com/fr/abstracts/TRD01.html consensus: open-source, written in Python. Available from http://exogen.case.edu/projects/consensus/ The Duine Toolkit: open-source, written in Java. Available from http://sourceforge.net/projects/duine The qualitative analysis of these systems began with an examination of the specifications of each toolkit. Further analysis involved the examination of any available reference documentation. This analysis, combined with learnings from the critical literature review described in 2 narrowed the candidates for use down to just Taste, and the Duine Toolkit. At this stage, the code for each of these toolkits was downloaded and examined. Ultimately, the Duine Toolkit was chosen for use for the following reasons: Well documented code base: the Duine Toolkit has complete and high quality documentation, as well as reference documents. Good recommendation quality: (van Setten et al., 2004) showed that the Duine Toolkit is able to choose the most effective recommendation technique for a particular situation and thus is able to maximise the quality of recommendations. Good resource usage: the Duine Toolkit has been built to conserve resources and ensures that the most resource intensive operations (which involve calculating the similarity between a user and all other users) occur only once for each user session, and not every time that a user rates an item. Multiple recommendation methods: the Duine Toolkit has six built in recommendation techniques and the facility to dynamically alter the recommendation technique that is being used.
3.3 R ECOMMENDATION A LGORITHM A NALYSIS
16
This meant that a system could be built that allowed users to easily swap from using one recommendation technique to another. This also meant that we could test issues regarding users’ interactions with not just one, but several methods of recommendation. Built in explanation facility: the Duine Toolkit was designed with explanations in mind — each recommendation that is created using this toolkit can have an explanation object attached to it, which describes how exactly that prediction was produced. This feature was included in the Duine Toolkit in in anticipation of further extensions to the toolkit that enabled recommendations to be displayed. Easy to add user controls: In the Duine Toolkit, personal settings can be set and saved for each user. Some of these settings affect the recommendations that are produced by the system. The fact that the Duine Toolkit can set and save such personal settings means that it could be extended to allow users to exert control over the recommendation process.
3.3 Recommendation Algorithm Analysis Once the Duine Toolkit was chosen as the code base for this thesis, an analysis of the recommendation techniques that it provided was necessary. The major recommendation techniques made available within the Duine Toolkit are:
Most Popular: This technique recommends the most popular items, based on the average rating each item was given, across all users of the system. Genre Based: This is a content-based technique that uses a user’s ratings to decide what genres that user likes and dislikes. It then recommends items based upon this decision. Social Filtering: This is a social filtering technique that looks at the current user’s ratings and finds others who are similar to that user. These similar users are then used to recommend new items. (Note: this method also makes use of ‘opposite users’). Learn By Example: This is a content-based technique that predicts how interested a user will be in a new artist by looking at how they have rated other similar items in the past. (Requires some measure of similarity to be defined). Information Filtering: This is a content-based technique that uses natural language processing techniques to process a given piece of text for each item (e.g. A description). This information, combined with the a user’s ratings is used to predict the user’s level of interest in new items.
3.3 R ECOMMENDATION A LGORITHM A NALYSIS
17
Note that examination of this technique showed that it could be used to create recommendations that were either Lyrics Based (using lyrics from songs) or Description Based (using descriptions of particular artists). Taste Strategy: As noted in Chapter 2, (van Setten et al., 2004) shows that this is the recommendation technique that produces the highest quality recommendations within the Duine Toolkit. This technique is, in fact, a ‘Prediction Strategy’ that is able to choose to make recommendations using any of the five techniques described above. This technique chooses the best available recommendation technique at any given point in time and makes recommendations using that technique. This is the default recommendation technique for the Duine Toolkit. Note that this technique was not considered as a candidate for the addition of scrutability or control, as it is a ‘Prediction Strategy’ that merely makes use of other recommendation techniques to make recommendations and does not actually create recommendations itself. Thorough examination and testing was conducted upon these algorithms to ascertain ways in which they could be explained and controlled. The results from this investigation are summarised in Figure 3.1. This table shows the possible explanations and control features that could be implemented for each of the recommendation algorithms within the Duine Toolkit. It also lists any problems that may be encountered when adding scrutability and control to this algorithm. For example, the entry for the Genre Based technique notes that recommendations produced using this technique could be explained by telling the user what genre an item belongs to and how interested the system thinks that user is in those genres. It also notes that one of the ways that users could be given control over this technique would be to allow them to specify their level of interest in particular genres. Finally, it shows that a possible problem that may be encountered when offering users controls and explanations for this technique would be if a user did not agree with the genres that an item was classified into.
3.3 R ECOMMENDATION A LGORITHM A NALYSIS
Algorithm Most Popular
Possible Explanations
18
Possible Control Features
Problems
Tell the user where this item ranks in terms of popularity. Tell the user the average rating that has been given to this item. Tell the user how many users have rated this item.
Genre Based
Tell the user the recommendation was Allow the user to specify their based on the genres that item belongs to. interest in a particular genre.
What if users don't agree with the genre classifications?
Show the user how interested the system thinks they are in each genre. Social Filtering
Show the user how similar users have rated an item.
Allow the user to specify the impact What if users do not think they are that similar and opposite users really similar to particular users? should have on recommendations.
Show the user the similar users that factored heavily in their recommendation.
Allow the user to choose users who There is A LOT of information involved in this algorithm. they want to be considered as similar to them. The 'opposite users' idea is a hard one to convey.
Learn By Example
Show the user the similar items that factored heavily in their recommendation and how they rated those similar items.
Allow the user to specify what factors should determine the similarity between items.
What if users do not think this item is actually similar to the items they have rated in the past.
Information Filtering
Show the user the key words that are present in the descriptions of items that they have liked in the past.
Allow user to control the features used in recommendation.
Users might disagree with the keywords used to categorise their interest - even if these key words are quite appropriate. Users might not understand how this approach is working, especially if it works on something other than descriptions (e.g. it may work on the text from forum posts about an item).
F IGURE 3.1: Summary Of Possible Explanations And Control Features For The Major Algorithms In The Duine Toolkit.
The Taste Strategy, was also examined at this stage, but it was found that because it switches between recommendation techniques, it is not a technique that can be explained in a consistent way to users. This meant that it was not considered as a suitable technique to add scrutability and control to.
3.4 Q UESTIONNAIRE - D ESIGN
19
3.4 Questionnaire - Design The recommendation algorithm analysis described in the previous section highlighted a number of usability features that could be added to a recommender system. Further, the analysis of existing recommender systems described in Section 3.2 and the review of literature described in Chapter 2 also brought to light some of the different usability features of existing recommender systems. In order to investigate how understandable and effective users would find these usability features, a questionnaire was designed. The results of this questionnaire should then be used to inform the construction of the prototype. A questionnaire was chosen as it was the most efficient way to gather large amounts of detailed information about users’ opinions on the set of potential usability features. The specific aims of the questionnaire were to assess several potential usability features related to: • Understanding of recommendations provided by various recommendation techniques. • Usefulness of recommendations provided by various recommendation techniques. • Attitudes toward control features for recommenders and understanding of how these would be used. • Preferences for recommendation presentation format. To this end, an extensive questionnaire was designed. It asked users to answer questions on a scale of 1 to 5, where 1 was the lowest score and 5 was the highest. Particular care was taken during the design of the questionnaire to ensure that each question would elicit useful information from participants and that all of the questions were clear and free of bias. An initial group of five respondents filled out the questionnaire, each answering 60 questions. After these respondents had completed the questionnaire, a number of revisions were made. These revisions included the removal of two questions, the addition of seven new questions and minor changes to the wording of a small number of questions. The questionnaire was then conducted with a further 13 people, who answered 65 questions (58 in common with the original questionnaire). Most respondents took around 40 minutes to complete the questionnaire. Figure 3.4 shows demographic information for each of the respondents. The sample group for this questionnaire was carefully selected to contain people from a variety of backgrounds and both males and females. The majority (12/18) of the users who completed the questionnaire were aged under 30. Since modern recommender systems are used most often by people who fall in the 18-30 age range, a higher proportion of respondents in this age range was deemed to be appropriate.
3.4 Q UESTIONNAIRE - D ESIGN
Participant: Age Gender Has An IT Background? Has Used Any Type Of Recommender Before?
1 22 F N Y
2 21 F N N
3 20 M N Y
4 30 M N Y
5 22 M Y Y
20
6 51 F N N
7 52 M N Y
8 19 M N Y
9 21 F N Y
10 22 F N Y
11 22 F N Y
12 21 F N N
13 19 M N Y
14 47 F N N
15 48 M N N
16 18 F N Y
17 47 F N Y
F IGURE 3.2: Demographic Information For Each Of The Respondents. Sections 3.4.1 to 3.4.3 now describe the final set of questions that were presented to respondents. Although these were many questions, they were actually in three groups: Part A had one set of 5 questions, Part B had six sets of questions, totalling 52 questions and the Final Questions comprised one set of seven questions. The entire questionnaire is included as Appendix A.
3.4.1 Part A - Presentation Style This section of the questionnaire aimed to investigate users’ preferences for recommendation presentation format. At this stage, respondents were shown two forms of recommendation presentation. The first of these was a standard List Based format (shown in Figure 3.3) and the second was a Map Based format (shown in Figure 3.4), that was similar to the liveplasma1 interface mentioned in Chapter 2. After viewing an example of each presentation format, respondents were then asked to rate how well they understood the information conveyed by that example and how useful they would find recommendations that were presented in this format. Finally, after viewing both formats, respondents were asked to indicate whether they would prefer the List Based format, the Map Based format or both.
3.4.2 Part B - Understanding & Usefulness This section of the questionnaire aimed to investigate understanding of recommendations, usefulness of recommendations and attitudes toward control features. This section presented six recommendation techniques to respondents (Most Popular, Genre Based, Social Filtering, Learn By Example, Description Based and Lyrics Based). For each of these techniques, respondents followed this process: 1http://www.liveplasma.com/
18 19 F N Y
3.4 Q UESTIONNAIRE - D ESIGN
21
F IGURE 3.3: List Based Presentation That Was Shown To Participants In The Questionnaire
F IGURE 3.4: Map Based Presentation That Was Shown To Participants In The Questionnaire Respondents were first presented with a short textual description of how this technique works. At this stage, they rated their initial understanding of the technique. Respondents were then presented with a number of explanation screens, each of which showed a recommended item and an explanation of why it was recommended (example explanation screens are shown in Figures 3.5 and 3.6). For each screen, respondents rated how well they understood why the recommendation had been made and how
3.4 Q UESTIONNAIRE - D ESIGN
22
useful they would find recommendations that were produced using this technique and explained in this fashion. If this technique had control features, then respondents were also presented with a control feature screen for each of the controls for this technique (an example control feature screen is shown in Figure 3.7). After viewing each control feature screen, respondents rated how well they understood how they would use this control, how likely they would be to use it and how useful they expected it would be. Finally, respondents rated the overall usefulness of this recommendation technique, and their overall understanding of it.
F IGURE 3.5: One Of The Explanation Screens Shown To Participants In The Questionnaire. This Screen Explains Recommendations From The Learn By Example Technique
F IGURE 3.6: One Of The Explanation Screens Shown To Participants In The Questionnaire. This Screen Explains Recommendations From The Social Filtering Technique
3.4.3 Final Questions - Integrative This section of the questionnaire aimed to investigate the usefulness of recommendation techniques and attitudes toward explanations and control features.
3.4 Q UESTIONNAIRE - D ESIGN
23
F IGURE 3.7: The Genre Based Control Shown To Participants In The Questionnaire
At this stage of the questionnaire, respondents were asked to indicate their general opinion on the usefulness of all the six recommendation techniques. They first ranked the techniques from 1 to 6 in order of usefulness. Then respondents were also asked to indicate the weight they would want to place on each technique if a combination of techniques was to be used in a recommender system. The weight that they could place on each technique ranged from ‘Not At All’ (weight of 0) to ‘Very Much’ (weight of 100). The final five questions in the questionnaire then asked respondents to rate how useful they would find the following five potential features of a recommender system:
System Chooses Recommendation Method: The recommender system chooses the best recommendation technique to use at any point in time. System Chooses Combination Of Recommendation Methods: The recommender system chooses a combination of recommendation techniques to be used. View Results From Other Recommendation Methods: The recommender system chooses the best recommendation technique to use at any point in time. However, users are then able to view what their recommendations would look like if other recommendation techniques were used. Explanations: Explanations are provided for how recommendations were made. Controls: Users are given some amount of control over how recommendations are made.
These final questions would give an overall picture of users’ attitude toward a variety of potential features of a recommender system. As well as providing useful information, these questions also acted as internal consistency checks, allowing a user’s answers to be validated. For example, when asked to rank the
3.5 Q UESTIONNAIRE - R ESULTS
24
recommendation techniques in order of usefulness, a user’s answers would be expected to correlate with answers to usefulness questions asked earlier in the survey.
3.5 Questionnaire - Results In total, 5 respondents answered the initial questionnaire (60 questions) and a further 13 respondents answered the revised questionnaire (65 questions). We now present and discuss the results of the questionnaire, with reference to the aims of the questionnaire, as expressed in Section 3.4. The results in this section are rather long because they report respondents’ answers in terms of recommendation usefulness, recommendation understanding, control features and presentation method. Each of these factors is important and each of them is different. For each factor, this section reports a small number of averages. This is explained with illustrative additional data which helps understanding of the results. Then there is a summary of the conclusions and a separate list of the implications for the prototype design. This section is quite long, but it has not been relegated to an appendix because it is all new information about how users can understand and control recommenders.
3.5.1 Usefulness This section discusses the questionnaire results relevant to the aim of: assessing the perceived usefulness of recommendations provided using various recommendation techniques. In Part B of the questionnaire, respondents rated the usefulness of 18 screens that presented recommendations. The screens that had the maximum average usefulness for each technique are presented in Figure 3.8, along with their average rating (error bars show one standard deviation above and below, actual results for each respondent shown in Appendix B). For example, from five Social Filtering screens presented to respondents, the one with the highest average usefulness rating was the Simple Text screen, so this is shown in Figure 3.8. In the Final Questions section of the questionnaire, respondents ranked the recommendation techniques in order of usefulness (where 1 is the highest possible ranking, and 6 is the lowest ranking). Figure 3.9 shows the average ranking given to each technique, with error bars showing one standard deviation above and below the mean (actual results for each respondent shown in Appendix B).
3.5 Q UESTIONNAIRE - R ESULTS
25
Avg. Usefulness Rating
5.0 4.5 4.0 3.5 3.0 2.5 2.0 1.5 1.0 Most Popular 2 Genre Based 1 Word Of Mouth Learn By (Avg. Rating (Genre Listing) 1 (Simple text) Example 2 Info.) (Similar Artists)
Description Based 1 (Simple Text)
Lyrics Based 1 (Simple Text)
F IGURE 3.8: The Screens With The Maximum Average Usefulness For Each Recommendation Method. Error Bars Show One Standard Deviation Above And Below The Mean. N = 18. Technique Word of Mouth Genre Based Most Popular Learn By Example Description Based Lyrics Based
Avg. 1.9 2.4 2.8 3.3 4.6 5.8
St. Dev. 1.3 1.2 1.3 1.0 1.0 0.5
F IGURE 3.9: Average Ranking Given To Each Presentation Method. N = 18. Top Ranking = 1. Bottom Ranking = 6. In the Final Questions section, respondents also indicated the weight they would want to place on each technique if a combination of techniques was to be used. Figure 3.10 shows the average weight (0-100) chosen for each method. Note that respondents could choose any value 0-100 for each technique. For example, Participant 6 gave Most Popular a weight of 30, Genre Based a weight of 80, Social Filtering a weight of 90, Learn By Example a weight of 70, Description Based a weight of 30 and Lyrics Based a weight of 0. We now discuss these results. Social Filtering: This method had the highest average ranking (1.9, where 1 is the ) and had high average usefulness scores, but, surprisingly, it had the second highest average contribution, with a weight of 68. Six people indicated that Social Filtering should have the most contribution, but low scores from other respondents caused this technique to receive a lower average contribution score than Genre Based. Social Filtering (Simple Text) was the highest rated Social Filtering screen. This screen had the highest average usefulness rating (4.4/5) of all screens shown in the questionnaire. The next highest rated Social Filtering screen was
3.5 Q UESTIONNAIRE - R ESULTS
26
100 90
Avg. Weight
80 70 60 50 40 30 20 10 0 Most Popular
Genre Based
Word of Mouth
Learn By Example
Description Based
Lyrics Based
F IGURE 3.10: Average Response For Contribution That Each Method Should Make To A Combination Of Recommendation Methods. Error Bars Show One Standard Deviation Above And Below The Mean. N = 18.
the Simple Graph screen with an average of 3.9/5. Although Social Filtering (Similar Users) had an average usefulness score of 3.1/5 (the lowest for all Social Filtering screens), four respondents commented that they thought the Social Filtering (Similar Users) screen was useful because it allowed you to view similar users and their profiles. One respondent commented that Social Filtering "is a great way to recommend new music." A further two people commented that this method would be useful, as long as similarity between users was calculated accurately. Another person commented that they did not like the idea of opposite users factoring in their recommendations. Finally, another commented that they would like to be able to indicate friends that have similar interests and are already using the recommender system. Genre Based: This method received the highest average contribution score (76) — six people indicated that this technique should have the most contribution. It was also given the second best average ranking (2.4/5). However, one respondent did mention that he thought classifying items by genres was too broad. The Genre Based (Simple Text) screen had the second highest average usefulness (4.1/5) of all screens presented in the questionnaire, and the two Genre Based screens both had average scores of 4 or more. Two people commented that they thought Genre Based (Genre Listing) was the best Genre Based screen as it provided more information. Learn By Example: This method had an average contribution score of 58 and only two people indicated that this method should have the highest contribution. This method was given an average ranking of 3.3, the fourth highest average ranking. The Similar Artists screen had the highest average usefulness score of the Learn By Example screens, with an average usefulness
3.5 Q UESTIONNAIRE - R ESULTS
27
of 4.0/5 — the third highest average usefulness score. One respondent commented that they doubted whether similarity between artists could be calculated objectively. Most Popular: Five respondents commented that they would not necessarily be interested in the the most popular items. However, Most Popular had the second highest average contribution score, with 68, and seven people indicated that Most Popular should have the most contribution. Most Popular was also given an average ranking of 2.8, which was the third best average ranking. The two screens displaying Most Popular recommendations — Most Popular (Ranking) and Most Popular (Avg. Rating Info.) — had average scores of 3.5/5 and 3.4/5 respectively. Description Based: This method scored 41 average contribution and had the second worst average ranking. Respondents viewed only one screen that presented Description Based recommendations. This screen had an average usefulness rating of 2.7/5, the second lowest average usefulness score. Nine people commented that they doubted the usefulness of using descriptions to make recommendations. Four of these people commented that descriptions are too subjective to be useful. Lyrics Based: This method scored 12 average contribution and had the worst average ranking. Respondents viewed only one screen that presented Lyrics Based recommendations. This screen had an average usefulness rating of 2.2/5, the lowest average usefulness score. Nine respondents commented that they didn’t think lyrics would be useful for making recommendations. Seven of these commented that lyrics did not determine whether they liked an item.
Findings. • Social Filtering and Genre Based were judged by respondents to be the most useful techniques. This is supported by the fact that these two methods both had either the first or the second best average score on every question. • Respondents were less interested in having Most Popular recommendations delivered on their own than they were in having this recommendation method combined with other techniques. We can see this because this method had the second highest average weight in the question regarding how techniques should be combined. However, five respondents commented that they were not interested in just the most popular items. • Respondents did not think that Description Based or Lyrics Based would be useful. This is shown by the fact that these two methods consistently had the lowest average scores for each question.
3.5 Q UESTIONNAIRE - R ESULTS
28
• Social Filtering (Simple Text), Genre Based (Simple Text), Most Popular (Ranking) and Learn By Example (Simple Text) were all judged by respondents to be the most useful screens for their particular recommendation techniques. • Genre Based (Simple Text) and Genre Based (Genre Listing) were approximately equally useful (their average usefulness scores were quite similar) and each offered a different form of useful information. • Most Popular (Avg. Rating Info.) and Most Popular (Ranking) were approximately as useful as one another (their average usefulness scores were quite similar) and each offered a different form of useful information. • Some users would find the Social Filtering (Similar Users) screen useful. This screen did not receive a high average usefulness score, but four respondents commented that they liked the ability it provided to examine the ratings of similar users.
Implications for the prototype.
• Social Filtering and Genre Based should be included as recommendation techniques. • Most Popular should be included as an optional recommendation technique, or one which can be combined with other techniques. • Learn By Example should also be included as a recommendation technique, as it was not found to be significantly less useful than the top three recommendation techniques. • Description Based and Lyrics Based should not be included in the prototype. • Social Filtering (Simple Text), Genre Based (Simple Text), Most Popular (Ranking) and Learn By Example (Simple Text) should all be included as explanation screens in the prototype. • Genre Based (Simple Text) and Genre Based (Genre Listing) should be combined into a single explanation screen, as their average usefulness scores were similar and each displays a different piece of information which would be useful to users. Further, these two screens could easily be combined without causing conflicting information to be displayed. For the same reasons, Most Popular (Avg. Rating Info.) and Most Popular (Ranking) should also be combined. • Social Filtering (Similar Users) should be considered for implementation in the prototype.
3.5 Q UESTIONNAIRE - R ESULTS
29
3.5.2 Understanding This section discusses the questionnaire results relevant to the aim of: assessing understanding of recommendations provided using various recommendation techniques. In Part B of the questionnaire, respondents rated their understanding of the 18 screens that presented recommendations. The screens that had the maximum average understanding for each technique are presented in Figure 3.11, along with their average rating (Error bars show one standard deviation above and below the mean. Actual results for each respondent shown in Appendix B). For example, from five Social Filtering screens presented in the questionnaire, the one with the highest average understanding rating was the Simple Text screen, so this is shown in Figure 3.11 (3rd bar from the left). Avg. Understanding Rating
5.0 4.5 4.0 3.5 3.0 2.5 2.0 1.5 1.0 Most Popular 2 Genre Based 1 Word Of Mouth (Avg. Rating (Genre Listing) 1 (Simple Text) Info.)
Learn By Example 1 (Avg. Rating Info.)
Description Based 1 (Simple Text)
Lyrics Based 1 (Simple Text)
F IGURE 3.11: The Screens With The Maximum Average Understanding For Each Recommendation Method. Error Bars Show One Standard Deviation Above And Below The Mean. N = 18 In Part B of the questionnaire, respondents also rated their understanding of four recommendation techniques before and after they saw the screens for that technique. Figure 3.12 shows the average ranking given to each technique, with error bars showing one standard deviation above and below the mean (actual results for each respondent shown in Appendix B). We now discuss the results shown in Figures 3.11 to 3.12. Social Filtering: Social Filtering (Simple Text) had the highest average understanding of all the Social Filtering screens, with 4.6/5, which was the second highest average score given to any of the Social Filtering screens. The Social Filtering (Simple Graph) screen (average of 4.5/5) and the Social Filtering (Table) screen (average of 4.3/5) both also received high average scores for understanding. Both Social Filtering (Graph w/ Opposites) and Social Filtering (Similar Users)
3.5 Q UESTIONNAIRE - R ESULTS
30
Avg. Understanding Rating
5.5 5.0 4.5 4.0
Before Explanations
3.5 3.0
After Explanations
2.5 2.0 1.5 1.0 Most Popular
Genre Based
Word Of Mouth
Learn By Example
F IGURE 3.12: Respondents’ Average Understanding Of Recommendation Methods Before And After Explanations. Error Bars Show One Standard Deviation Above And Below The Mean. N = 18
showed ‘opposite users’ in their explanation, but three users said that they were confused by the ‘opposite users’ concept, and these screens had the lowest average ratings from all of the Social Filtering screens in the questionnaire (Social Filtering (Similar Users) averaged 3.9/5 and Social Filtering (Graph w/ Opposites) averaged 3.8/5 — these were the only average scores that were below 4.0). Social Filtering was given the highest average understanding rating before explanations were provided (average of 4.4/5). However, after explanations were provided, the average for this technique dropped to 3.9/5 — the lowest average understanding rating. As mentioned above, three respondents commented that ‘opposite users’ had confused them and a further two people commented that the explanations contained too much information and were confusing. Genre Based: Two Genre Based screens were presented in the questionnaire, Genre Based (Simple Text) received the highest average understanding of all the explanation screens — 4.7/5. Genre Based (Genre Listing) also received a high average understanding rating of 4.6/5 — the third highest average understanding given to any of the 18 explanation screens. One respondent commented that Genre Based (Simple Text) was the better of the two Genre Based screens as it gave "more information about the individual artist and not just a genre". However, another commented that Genre Based (Genre Listing) was better, as it was more related to his ratings and profile. Genre Based actually received the lowest average understanding rating before the explanation screens were provided (average of 4.2/5). Remarkably, after explanations, the average usefulness rating for this method increased to 4.8/5. Eight people gave this method a higher
3.5 Q UESTIONNAIRE - R ESULTS
31
understanding rating after viewing the explanation screens, ten gave it the same rating, and no respondents gave it a lower rating. Learn By Example: Learn By Example (Simple Text) had the highest average understanding rating of the two Learn By Example screens presented in the questionnaire. Learn By Example (Simple Text) had an average of 4.2, which was just higher than the average of 4.1/5 for Learn By Example (Similar Artists). Learn By Example had the equal highest average understanding (4.4/5) before explanation screens were presented. However, this dropped to an average of 4.1/5 after respondents viewed the explanation screens — this was the second lowest after-explanation average. Only one respondent gave Learn By Example a higher understanding rating after explanations, fourteen gave it the same rating and three gave it a lower understanding rating. Most Popular: The Most Popular screen with the highest average rating was Most Popular (Ranking), with a score of 4.7/5 (which was the highest average understanding across all the explanation screens). However, Most Popular (Avg. Rating Info.) also received a score of 4.5/5. Five people commented that Most Popular (Ranking) made recommendations easier to understand as it gave more information. One person commented that he would like comments from users about that item to be added to the screen, indicating why they liked or disliked it. Figure 3.12 shows that this method improved from an average understanding of 4.3/5 before explanations to an average of 4.6/5 after the viewing of explanation screens. The average understanding rating for Most Popular after explanations is the second highest average understanding score shown in Figure 3.12. Four respondents gave Most Popular a higher understanding rating after explanations, twelve respondents gave it the same rating and two gave it a lower understanding rating. Description Based: Respondents viewed only one screen that presented Description Based recommendations. This screen had an average understanding rating of 4.0/5, which is the lowest of all the scores shown in Figure 3.11. Four respondents gave this method a score of 3 or less. This method is not shown in Figure 3.12 because once the first five respondents had completed the questionnaire, respondents were no longer asked to report their understanding of this method before and after viewing its screens. This decision was made because this method had been given low usefulness and low understanding scores by the first five respondents. Lyrics Based: Respondents viewed only one screen that presented Lyrics Based recommendations. This screen had an average understanding rating of 4.1/5, which is the second lowest of
3.5 Q UESTIONNAIRE - R ESULTS
32
all the scores shown in Figure 3.11. Three people gave this method a score of 3 or less. One respondent commented that the way this method works "just seems to make no sense". This method is not shown in Figure 3.12 because once the first five respondents had completed the questionnaire, respondents were no longer asked to report their understanding of this method before and after viewing its screens. This decision was made because this method had been given low usefulness and low understanding scores by the first five respondents.
Findings. The findings that came from this section of the questionnaire were: • Each of the recommendation techniques can be explained in a way that users can easily understand. This is supported by the fact that all of the values shown in Figure 3.12 were equal to or above 4.0. • When explaining recommendations, providing more information can often be beneficial. This is supported by the by user comments that indicated a desire for more information about recommendations. However, it is important to find a clear, concise way to deliver that information to people. • Complicated or poor explanations will often confuse a user’s understanding of a recommendation technique. For example, three people commented that the ‘opposite users’ idea was confusing. Further, the screens showing opposite users received the lowest average understanding scores and after these screens were shown to users, the average understanding of the Social Filtering technique dropped from 4.4/5 to 3.9/5. This finding was also reported in (Herlocker et al., 2000). • Social Filtering (Simple Text), Genre Based (Simple Text), Most Popular (Ranking) and Learn By Example (Simple Text) were judged by users to be the most understandable explanation of each of their recommendation techniques (as each of these had the highest average understanding of the screens for their technique). • Social Filtering (Simple Graph) was almost as understandable as Social Filtering (Simple Text) (as they had average understanding scores only 0.1 points apart). • Similarly, Learn By Example (Similar Artists) was almost as understandable as Learn By Example (Simple Text) (as they had average understanding scores only 0.1 points apart). • Genre Based (Simple Text) and Genre Based (Genre Listing) were approximately as effective at explaining recommendations as one another (their average understanding scores were quite similar) and each offered a different form of useful information.
3.5 Q UESTIONNAIRE - R ESULTS
33
• Most Popular (Avg. Rating Info.) and Most Popular (Ranking) were also approximately as effective at explaining recommendations as one another (their average understanding scores were quite similar) and each offered a different form of useful information. • The inclusion of the ‘opposite users’ concept negatively affected users’ perceived understanding of the Social Filtering (Similar Users) screen. This is supported by the fact that four respondents commented that the ‘opposite users’ concept confused their understanding of Social Filtering. • People found Learn By Example to be harder to understand than techniques such as Most Popular, Genre Based and even Social Filtering. This is surprising as one of the benefits often noted for the Learn By Example technique is the "potential to use retrieved cases to explain [recommendations]" - (Cunningham et al., 2003), p 1. • Different people prefer different styles of explanation. Evidence supporting this finding includes the fact that different users rated their understanding of different explanation screens higher than others.
Implications for the prototype.
• Social Filtering (Simple Text), Genre Based (Simple Text), Most Popular (Ranking) and Learn By Example (Simple Text) should all be included as explanation screens in the prototype. • Learn By Example (Simple Text) and Learn By Example (Similar Artists) should be combined into a single explanation screen, as their average understanding scores were similar and each displays a different piece of information which would be useful to users. Further, these two screens could easily be combined without causing conflicting information to be displayed. • The case for combining Most Popular (Avg. Rating Info.) and Most Popular (Ranking) and Genre Based (Simple Text) and Genre Based (Genre Listing) is also strengthened by these results, as each of these pairs had similar average understanding ratings. • Social Filtering (Similar Users) should be included in the prototype, without any reference to ‘opposite users’. This is because the ability to view similar users was deemed useful by some respondents, and the ratings for this control may have been negatively affected by the fact that it displayed ‘opposite users’ — a concept which consistently confused people.
3.5 Q UESTIONNAIRE - R ESULTS
34
3.5.3 Understanding And Usefulness The Pearson Correlation was calculated between the ratings that respondents gave for the usefulness of particular explanation screens and the ratings that they gave for their understanding of these screens. This correlation was calculated to be 0.28. Squaring this value gives 0.078, or 7.8 percent. This suggests that a user’s understanding of a recommendation does affect how useful they deem it to be. In fact, this value suggests that 7.8 percent of a user’s opinion on the usefulness of a recommendation technique is determined by how well they understand that recommendation. This result is confirmed by a number of cases that were observed within the questionnaire. Particularly significant were the cases in which a user’s understanding was confused by complicated concepts within explanations. This often caused a decrease in both the user’s understanding rating and their usefulness rating for that screen.
Findings. • A user’s opinions on the usefulness of recommendations are related to their understanding of these recommendations.
3.5.4 Control This section discusses the questionnaire results relevant to the aim of: assessing users’ attitudes toward features that provide control over recommender techniques and their understanding of how these would be used. In Part B of the questionnaire, respondents rated three control features according to how well they understood each control, how useful they thought each control would be and how likely they would be to use that control. Figure 3.13 shows the average score for each of these questions, with error bars showing the one standard deviation above and below the mean (actual results for each user shown in Appendix B). Genre Based Control (Genre Slider): This control had the highest average scores for understanding (4.9/5), usefulness (4.5/5) and likelihood of use (4.6/5). All but two respondents gave this control a 5 for understanding; the other two respondents gave it a 4. All but three people gave this control a 5 for how likely they would be to use it, and all but one users gave this control a rating of 4 or 5 when asked how useful they thought it would be. Further, seven users
3.5 Q UESTIONNAIRE - R ESULTS
35
5.0
Avg. Rating
4.5 4.0
Understanding
3.5
Use
3.0
Likelihood Of Use
2.5 2.0 1.5 1.0 Genre Based Control
Word Of Mouth Control 1 (Ignore User)
Word Of Mouth Control 2 (Adjust Influence)
F IGURE 3.13: Average Ratings For Questions Regarding Respondents’ Understanding, Likelihood Of Using And Perceived Usefulness Of Each Control Feature. Error Bars Show One Standard Deviation Above And Below The Mean. N = 18 commented that they strongly liked this control. One respondent commented that they would like to specify interest in more specific genres (i.e. sub-genres), but another commented that they thought too many genres would become confusing for users. Social Filtering Control (Like/Not Like): This control had the second highest average scores on all questions. Its average ratings were 4.6/5 for understanding, 3.5/5 for likelihood of use and 4.3/5 for usefulness. All but two respondents gave this control a rating of 4 or 5 for understanding, and the other two gave a rating of 3. Most users also gave this control a rating of 4 or 5 for usefulness. However, there was much more variation in the likelihood of use ratings for this control. In fact, this question had the second highest standard deviation (1.3) of any question asked about the three controls and responses to this question were distributed relatively evenly between 1 and 5. Social Filtering Control (Adjust Influence): This control had the lowest average scores for all questions. Social Filtering Control (Adjust Influence) had an average understanding rating of 3.8, likelihood of use rating of 3.0 and usefulness rating of 3.4. This method asked users to adjust the impact of ‘opposite users’ upon recommendations. As mentioned in section 3.5.2, three users commented that the concept of ‘opposite users’ was confusing, and the average understanding ratings for the Social Filtering technique fell when this concept was introduced. The ratings given to this method were highly varied — three people responded with a 5 for the usefulness of this control usefulness and 5 for their likelihood of using it, yet three others gave scores of only 1 or 2 for both of these questions (each of these three gave lower ratings for their understanding of the Social Filtering technique once the concept of ‘opposite users’ was
3.5 Q UESTIONNAIRE - R ESULTS
36
introduced.e three gave lower ratings for their understanding of the Social Filtering technique once the concept of ‘opposite users’ was introduced.
Findings.
• The Genre Based Control (Genre Slider) would get used often and would be easy to understand. Further, respondents also believed that it would be very useful. These findings are supported by the fact this control received the highest average usefulness scores, and most users gave a rating of 4 or 5 for all questions regarding this control. • It is important to get the number of available genres correct when allowing users to specify their interest in genres. This is supported by the fact that many users users commented that having too many genres would be overwhelming. • Social Filtering Control (Like/Not Like) is easy to understand (most users gave a rating of 4 or 5 for understanding). It would be used by some, but not all users (as there was a high variation in likelihood of use ratings). Further, most users would find this control to be quite useful (most users gave 4 or 5 for usefulness). • In general, most users would not understand how Social Filtering Control (Adjust Influence) works and most users would not use it. Most respondents believed that this control would not be very useful. These findings are supported by the fact that this control scored the lowest average rating in every question and three users commented that they were confused by the opposite users concept, which is a part of Social Filtering Control (Adjust Influence).
Implications for the prototype. Based upon these findings, it was decided:
• To include Genre Based Control (Genre Slider) in the prototype. It is important that the right number of genres is used with this control. The number of genres should not be too large (as this may become overwhelming) and should not be too small (as this may not be useful). • To include Social Filtering Control (Like/Not Like) in the prototype. This control may not be rated highly by all users, but it is worth testing its effectiveness in a real prototype. • Not to include Social Filtering Control (Adjust Influence) in the prototype.
3.5 Q UESTIONNAIRE - R ESULTS
37
3.5.5 Presentation Method This section discusses the questionnaire results relevant to the aim of: assessing users’ preferences for recommendation presentation format. In Part A of the questionnaire, respondents rated their understanding and opinion on the usefulness of two presentation methods: Map Based and List Based. Figure 3.14(a) shows the average score for each of these questions, with error bars showing the one standard deviation above and below the mean. Users also indicated their preference for the way in which they would like recommendations to be displayed. Figure 3.14(b) shows the sums of responses to this question. The actual results for each user shown in Appendix B.
Avg. Rating
4.5 4.0 3.5
Understanding
3.0
Use
2.5 2.0 1.5 1.0
Sum of Preferences
5.0 12 10 8 6 4 2 0 List Only List
Map
(a) Understanding And Usefulness Of Presentation Methods
Both List And Map
Map Only
(b) Sum Of Recommendation Presentation Preferences.
F IGURE 3.14: User’s Responses For Questions Regarding Recommendation Presentation Methods. Error Bars Show One Standard Deviation Above And Below The Mean. N = 18 Ten users indicated that they would prefer to have only List Based presentation. Four of these users commented that List Based is quicker to understand and read. These comments are supported by the results shown in Figure 3.14. This shows that List Based had an average understanding rating of 4.7/5, exactly one point higher than the average understanding rating for Map Based, which was 3.7/5. In addition, seven users commented that the map took longer to work out. However, List Based and Map Based had similar average usefulness scores — List Based scored an average of 3.8/5 and Map Based had an average of 3.5/5. Two users indicated that they would like to have recommendations presented through a Map Based only and six users indicated that they would like to have recommendations displayed as in both List Based and Map Based formats. Four users commented that the map gave more information and was useful for that reason.
3.5 Q UESTIONNAIRE - R ESULTS
38
Findings. • Most users would find a List Based presentation easier to understand and quicker to read than a Map Based presentation. This is supported by the fact that users commented that a list based presentation is quicker and easier to read and by the fact that the List Based presentation scored a higher average understanding rating than Map Based. • In general, users indicated they would find a List Based presentation useful. This is evidenced by the fact that 16/18 respondents indicated that they would want List Based as a part of their recommendation system and this presentation received the highest average usefulness score. • Some users indicated they would also find a Map Based presentation to be useful. Evidenced supporting this finding includes that 8/18 users indicated that they would want a Map Based presentation included in a recommender. • Different people prefer different styles of presentation. This was shown through the variation in the ratings that were given for the questions regarding presentation. Implications for the prototype. Based upon these findings, it was decided: • To definitely include a List Based presentation in the prototype. • That there was enough enough support for the usefulness of a Map Based presentation to include it in the prototype to examine how users would interact with an implementation of a Map Based presentation.
3.5.6 Final Questions This section discusses the results from the final questions asked of users, that gave an overall indication of their opinion of the various features shown in the questionnaire. In the Final Questions section of the questionnaire, respondents rated the general usefulness of five features that could be included in a recommender system. Figure 3.15 shows the average ratings for each of these features, with error bars showing the one standard deviation above and below the mean. Choice Of Recommendation Method: The average rating for the usefulness of the system deciding what recommendation method should be used was 3.6/5. Most people gave this feature a rating of 3 or more, but one person gave this feature a rating of 1, while giving all other features mentioned in this section a rating of 5. The average rating for this feature was much lower than
3.5 Q UESTIONNAIRE - R ESULTS
39
5.0 4.5
Avg. Rating
4.0 3.5 3.0 2.5 2.0 1.5 1.0 System Chooses Reco. Method
System Chooses Combination Of Reco. Methods
View Results From Other Reco. Methods
Explanantions
Controls
F IGURE 3.15: Average Rating For The Usefulness Of Possible Features Of A Recommender. Error Bars Show One Standard Deviation Above And Below The Mean. N = 18 the average rating for the usefulness of having the system choose a combination of methods (average of 4.6/5). There was very little deviation in the responses given to the usefulness of the system selecting a combination of methods, with all respondents giving ratings of either 4 or 5. This feature had the highest average rating of all features presented in this section of the questionnaire. Another feature with a high average usefulness rating was the ability to view recommendations made using different recommendation techniques, which had an average of 4.5/5. One respondent commented that "viewing what your recommendations would be like with different methods allows you to compare the usefulness of each method and choose the best one" and another commented that it would be "interesting and useful to see what your recommendations would look like using different methods." Explanations: The average rating for the usefulness of explanations was 3.8/5 One respondent commented that the addition of explanations "allows you to make your own judgments about on the usefulness of the results." More than half of the respondents for this question gave explanations a usefulness rating of 4 or 5. Controls: The average rating given by users for the usefulness of controls was 4.5/5. As noted in Section 3.5.4, seven respondents commented that they had a strong liking for the Genre Based Control (Genre Slider)control. Twelve respondents rated the usefulness of controls as 5, four users rated it as 4 and the remaining two gave controls a score of 2 and 1.
Findings.
3.6 T EST DATA
40
• Rather than having the system choose a single recommendation technique to use, people would prefer to have the system choose a combination of recommendation techniques or allow them to view recommendations using various techniques. This is supported by the fact that, on average, users rated the usefulness of the ‘System chooses recommendation method’ feature lower than the features that involved a combination of recommendation techniques and viewing recommendations using different techniques. • People in our study believed that explanations would be a useful addition to a recommender system. This is evidenced by the fact that users gave an average of 3.8/5 when asked to rate the usefulness of explanations and more than half of the respondents for this question gave a score of 4 or 5. • In general, people in our study believed that having control over a recommender system would be very useful. This is supported by the fact that users gave an average of 4.5/5 when asked to rate the usefulness of having control over a recommender system.
Implications for the prototype. • The prototype should allow users to view recommendations produce using various techniques and/or make recommendations using a combination of prediction techniques. • The prototype should contain explanations for the recommendations that it produces. These explanations should be offered to users if they are interested. • The prototype should allow users to have control over certain elements of the recommender system, to help them improve their recommendations.
3.6 Test Data In order to perform evaluations at a later stage in the thesis, a source of test data needed to be established. (Polcicova et al., 2000), (Maltz and Ehrlich, 1995), (Konstan et al., 1997) and (Basu et al., 1998) mention the fact that recommender systems are likely to exhibit poor performance unless they contain a significantly large number of user ratings. As a result, the data set used for testing needed to be large enough to allow effective recommendations to be made. In addition, the type and quantity of test data that could be gained would heavily influence the process of creating and evaluating a prototype at later stages of the project. An ideal set of test data for this project would have been a data set that contained information about around 1000 users, detailing:
3.6 T EST DATA
41
• Their ratings for particular artists. • The time that they spent listening to individual music tracks. • The actions that they performed while listening to music tracks. This mixture of music ratings information and listening patterns was desirable, as this would allow ratings generated from implicit data to be compared with each user’s explicit ratings. However, the lack of sources for information regarding music ratings and listening patterns meant that it was not possible to find a single data set containing both users’ explicit ratings and information about listening habits. Further, it was not possible to find any significant source of information about actions users had performed while listening to music. A dataset used in (Hu et al., 2005) was identified as a possible source of test data. This dataset is a collection of user’s ratings on for particular albums taken from the epinions.com2 website. However, this dataset was inadequate for use in this project, as it was deemed to be too small to enable a recommendation system to produce good recommendations. last.fm, an online radio service, was another source of data that was identified. This service makes large amount of data on users’ play-counts available through a web service. Due to the large amount of data available through this service, it was decided to use this to produce a dataset for use in investigating Unobtrusive Recommendation. Reading data from this service produced an initial dataset of 500,000 play counts, spanning 10,000 artists and 5,000 users. This dataset was then culled (to get rid of the users and artists that had few play-counts associated with them) to a size of 100,000 play-counts, spanning 3333 artists and 948 users. However, at this stage, the only source of test data that had been established was implicit data based upon users’ listening patterns. This data would indeed be useful for exploring the Unobtrusive Recommendation question, yet it was not ideal for exploring the Scrutability & Control question. This is because, if scrutability and control features were to be added to a prototype that made ratings based upon implicit data, then the performance of these features may be affected by the fact that this was implicit and not explicit data. Therefore, a data set consisting of explicit ratings was required in order to investigate the Scrutability & Control question. At this point, no significant source of explicit music ratings was able to be located, and so, it was decided that the MovieLens standard dataset (which provides explicit ratings on movies) should be used to investigate issues relating to Scrutability & Control. This dataset contains 100,000 ratings, from 943 users, on 1682 movies. Thus, two datasets were chosen for use in this thesis — a dataset compiled from data taken from last.fm and the MovieLens standard dataset. 2http://www.epinions.com
3.7 C ONCLUSION
42
Implications for the prototype. The prototype will have to have two variants in order to separately test the two goals of the thesis. These two variants would be: • A prototype based upon the MovieLens standard dataset, that investigated Scrutability & Control. • A prototype based upon the last.fm dataset that was created, that investigated Unobtrusive Recommendation.
3.7 Conclusion In order to investigate the areas of Scrutability & Control and Unobtrusive Recommendation, an exploratory study conducted. This began with a Qualitative Analysis, that identified the Duine Toolkit as the most appropriate code based for extension. This toolkit makes available six different recommendation techniques that could be used within a prototype system. A thorough examination of each technique was then conducted to ascertain ways in which they could be explained and controlled. A number of possible recommender usability features were brought to light through this analysis, and these, along with existing recommender usability features, were investigated through the conduction of a questionnaire. Based upon the results of this questionnaire, a large number of findings could be gleaned about the respondents in general. However, the data that was collected through this questionnaire was quite rich, and demonstrated the individuality of each of the respondents. Particular respondents had preferences for different types of presentation and their answers clearly reflected this. This type of variance in preferences makes a strong case for providing personalisation of presentations and explanations within recommender systems. • Each of the recommendation techniques can be explained in a way that users can easily understand. • When explaining recommendations, providing more information can often be beneficial. • Complicated or poor explanations will often confuse a user’s understanding of a recommendation technique. • A user’s opinions on the usefulness of recommendations are related to their understanding of these recommendations. • Social Filtering and Genre Based were judged by respondents to be the most useful recommendation techniques.
3.7 C ONCLUSION
43
• Respondents wanted the Most Popular recommendation technique to be combined with other techniques. • Respondents did not think that Description Based or Lyrics Based recommendation techniques would be useful. • Respondents believed that Social Filtering (Simple Text), Genre Based (Simple Text), Most Popular (Ranking) and Learn By Example (Simple Text) screens were the easiest to understand and most useful for their recommendation techniques. • Some respondents had a strong interest in the ability to view the profiles of other similar users. • Respondents indicated they would use the Genre Based Control (Genre Slider) often and that it was easy to understand. Further, respondents believed that it would be very useful. • Most respondents indicated they would find a List Based presentation easier to understand and quicker to read than a Map Based presentation. Most users indicated they would find a List Based presentation useful and some users indicated they would also find a Map Based presentation to be useful. • Respondents indicated they like to have the system choose a combination of recommendation techniques or allow them to view recommendations using various techniques. • Respondents believed that explanations would be a useful addition to a recommender system. • Respondents also believed that having control over a recommender system would be very useful. • Different users prefer different forms of presentation and explanation.
These findings meant that the prototype should:
• Include both List Based and Map Based presentations. • Allow users to view recommendations produce using various techniques and/or make recommendations using a combination of prediction techniques. • Contain explanations for recommendations. • Allow users to have control over certain elements of the recommender system. • Allow users to view profiles for similar users to them. • Include Social Filtering, Genre Based, Most Popular and Learn By Example recommendation techniques. • Include the following optional explanation screens:
3.7 C ONCLUSION
44
– Social Filtering (Simple Text), Social Filtering (Simple Graph) and Social Filtering (Similar Users) – Combination of Genre Based (Simple Text) and Genre Based (Genre Listing) – Combination of Most Popular (Avg. Rating Info.) and Most Popular (Ranking) – Combination of Learn By Example (Simple Text) and Learn By Example (Similar Artists) • Include the following controls: – Genre Based Control (Genre Slider) – Social Filtering Control (Like/Not Like) Finally, two sources of test data were established for use in conducting simulations and evaluations at a later stage in the thesis. The results of the investigations described in this chapter, along with the test data that was acquired, would inform the construction of a prototype, described in Chapter 4.
C HAPTER 4
Prototype Design
4.1 Introduction
In order to investigate questions regarding Scrutability & Control in recommender systems and Unobtrusive Recommendation, a prototype was developed. This prototype would later be used to conduct user evaluations and simulations to establish the usefulness of a number of unobtrusive user modeling and usability features. The findings of the questionnaire described in Chapter 3 were used to guide the construction of this prototype and ensure that only features that were likely to be of use in improving recommendation quality would be included in the prototype. Section 1 stated that this thesis aimed to investigate two main questions: the Scrutability & Control question and the Unobtrusive Recommendation question. However, each of these two are separate research questions. If a prototype was created to investigate both of these questions at once, it could be difficult to link each of the findings of this study to one specific research question. So, it was decided that two variants of our prototype should be created - one to investigate each of the major research questions for this project. Each of these prototype variants could then be evaluated separately and the results from each evaluation would provide findings that would clearly be related to only one research question. The prototype that we created to investigate these questions was called iSuggest. The two variants that we created of this prototype were called iSuggest-Usability and iSuggest-Unobtrusive. iSuggest-Usability incorporated the highest rated usability interface features from the questionnaire. This version of the prototype made movie recommendations, based upon the MovieLens standard data set. iSuggest-Usability would later be used to investigate the Scrutability & Control for recommenders through user evaluations. 45
4.2 U SER ’ S V IEW
46
iSuggest-Unobtrusive made music recommendations based upon the last.fm1 dataset described in Section 3.6. It would be used to investigate Unobtrusive Recommendation. iSuggest-Unobtrusive incorporated the ability to automatically generate the ratings that a user would give particular items using only unobtrusively obtained information. Specifically, this meant that it read the play-counts from a user’s iPod and then automatically generated a set of ratings that a user would give to particular artists. The automatically generated ratings were then used to produce recommendations for that user. This prototype aimed to generate ratings for a user in a way that was accurate, but was also easy for them to understand. iSuggest-Unobtrusive would later be used to investigate the Unobtrusive Recommendation through both user evaluations and statistical evaluations. This chapter describes the functions that each prototype variant made available to users, it then describes the architecture of each of the two variants.
4.2 User’s View The basic iSuggest prototype showed users the standard type of interface that is used within most current recommender systems. A user’s first interaction with the basic iSuggest system was to create an account within iSuggest and then log in. Users could then view three basic screens: Rate Items: Showed the items that the user had not yet rated and could still enter a rating for. My Ratings: Showed the items that the items that the user had rated, and the rating that the user had given each item. Recommendation List: Showed a list of the recommendations that the system had produced for the user. 4.1 shows an example of this screen. Each of these screens used a standard List Based presentation style, as suggested by the study reported in Chapter 3. Users were able to click to view more information about any of the items shown on any of these screens. They could then click to search the Internet for more information about any of these items (this linked to imdb.com2 for movie items and Amazon.com3 for music items). Users rated items by clicking on the Star Bar (shown in Figure 4.2) and dragging their mouse to produce a rating between 0 stars (worst) and 5 stars (best) for each item. This basic prototype made all recommendations using a 1www.last.fm 2www.imdb.com 3www.amazon.com
4.2 U SER ’ S V IEW
47
single recommendation method — the Duine Toolkits default Taste Strategy (described in Section 3.3). The Taste Strategy was chosen for use within the basic prototype as it is shown in (van Setten et al., 2004) to be the most effective recommendation method available for use in the Duine Toolkit. In this way, the basic iSuggest prototype utilised the optimum configuration of the Duine Toolkit and provided a standard List Based presentation of information. The two prototype variants that would be used to investigate the research goals of this thesis extended this basic prototype to incorporate new features and enable these features to be evaluated.
F IGURE 4.1: List Based Presentation Of Recommendations
F IGURE 4.2: The Star Bar That Users Used To Rate Items
4.2.1 iSuggest-Usability This version of the prototype extended the basic iSuggest prototype to incorporate all of the usability features that the results of the questionnaire suggested would be useful additions to a recommender system. This version of the prototype made movie recommendations, based upon the MovieLens standard data set. When using iSuggest-Usability, users were presented with the following new usability and interface features: • Multiple recommendation techniques.
4.2 U SER ’ S V IEW
48
• Explanations for all recommendations that were produced. • The ability to view a list of users similar to the current user. • Control features that allowed the user to affect the recommendation process. • A Map Based presentation of recommendations. Each of these features is discussed in detail in the sections below.
Multiple Recommendation Techniques.
Social Filtering, Genre Based, Most Popular and Learn By
Example recommendation techniques were all included as additional recommendation techniques that could be used by iSuggest-Usability. These were included as the questionnaire suggested that users would find these recommendation techniques to be the most useful. The questionnaire also suggested that users would like a recommendation system to combine multiple techniques to make recommendations and/or allow users to select which recommendation technique should be used. Thus, iSuggestUsability allowed users to select which of the five available methods (including the standard Taste Strategy) should be used to create recommendations. Users selected the recommendation technique to be used by accessing an options screen that presented them with the five techniques. An example of this screen is shown in Figure 4.3. Each of these techniques had a small description underneath its name to describe how it functioned. Users selected one option from the list of recommendations and confirmed this choice. This would cause the user’s recommendations to be replaced with a new set of recommendations. The questionnaire suggested that it would also have been desirable for iSuggest-Usability to enable combinations of recommendation techniques to be used. However, this was deemed to be outside the scope of the project.
Explanations. Every recommendation that was produced using the Social Filtering, Genre Based, Most Popular or Learn By Example techniques was accompanied by an explanation that users could view by clicking to see "More Info" about the recommended movie. The explanations provided to users depended upon the recommendation technique that was used to create the recommendation. The way in which recommendations from each technique were explained is described below. Most Popular: The questionnaire suggested that the Most Popular (Avg. Rating Info.) and Most Popular (Ranking) screens would be useful in explaining this technique to users. Most Popular was therefore explained using a combination of these two screens that displayed the amount of
4.2 U SER ’ S V IEW
49
F IGURE 4.3: Recommendation Technique Selection Screen. Note: The ‘Word Of Mouth’ Technique Shown Here Is Social Filtering And The ‘Let iSuggest Choose’ Technique Is The Duine Toolkit Taste Strategy
F IGURE 4.4: Explanation Screen For Genre Based Recommendations
F IGURE 4.5: Social Filtering (Simple Graph) Explanation Screen For Social Filtering Recommendations users who had rated the recommended movie, the average rating these users had given to the
4.2 U SER ’ S V IEW
50
F IGURE 4.6: Explanation Screen For Learn By Example Recommendations
F IGURE 4.7: Explanation Screen For Most Popular Recommendations movie and the rank that this movie therefore had in the database. The Most Popular explanation screen is shown in Figure 4.7. Genre Based: The questionnaire suggested that the Genre Based (Simple Text) and Genre Based (Genre Listing) screens would be useful in explaining this technique to users. However, the Genre Based (Genre Listing) screen showed users the average rating that they had given movies within a particular genre. Unfortunately, this average is not used by the Genre Based technique to create recommendations so using it to explain recommendations would not necessarily produce useful explanations. Rather, the Genre Based technique calculates a user’s interest in particular genres and uses this to make recommendations. Hence, the explanation for the Genre Based technique contained a listing of the genres that a movie belonged to and a link to a screen where the user could view their calculated interest in each genre. The Genre Based explanation screen is shown in Figure 4.4. Social Filtering: The questionnaire showed that Social Filtering (Simple Text), Social Filtering (Simple Graph) and Social Filtering (Similar Users) could all be useful ways to describe this technique. However, these explanations could not easily be combined. As a result, three different types of Social Filtering explanations were provided to users — Simple Text, Graph and Similar Users. Simple Text presented text indicating the number of similar users this recommendation was based upon. Graph (shown in Figure 4.5) presented text indicating the number of similar users that this recommendation was based upon and displayed a graph of the number of users who ‘Liked This Movie’ and ‘Didn’t Like This Movie’. Finally, Similar Users showed the names of the similar users who were most significant in the creation of this recommendation and whether these users ‘Liked This Movie’ or ‘Didn’t Like This Movie’. Users could then click to view the detailed profiles of these similar users.
4.2 U SER ’ S V IEW
51
Learn By Example: The questionnaire suggested that the Learn By Example (Simple Text) and Learn By Example (Similar Artists) screens would be useful in explaining this technique to users. Thus, Learn By Example was described using a combination of these two screens. This combined screen listed the similar items that this recommendation was based upon (including the rating that the user had given that item) and stated the average rating that this user had given to these similar items. The Learn By Example explanation screen is shown in Figure 4.6.
Similar Users. This screen allowed a user to view a list of other users who the system believed were the most similar to them. A user could then click to view the ratings given by each of the similar users displayed in the list. This screen was included because the questionnaire suggested that users had a strong interest in the ability to view the profiles of other similar users.
Control Features. The questionnaire suggested that control features would be a useful addition to a recommender system. In particular, it was suggested that Genre Based Control (Genre Slider) and Social Filtering Control (Like/Not Like) would be quite useful to users. As a result, these two features were incorporated into iSuggest-Usability. These control features are detailed below.
F IGURE 4.8: The Genre Based Control (Genre Slider)
Genre Based Control (Genre Slider): (shown in Figure 4.8) This control screen displayed the interest that the system had calculated the user had in each genre. These interest levels were displayed using slider bars and the users was able to manually adjust these sliders to indicate their actual interest level in each genre.
4.2 U SER ’ S V IEW
52
F IGURE 4.9: The Social Filtering Control. Note: The actual control is the ‘Ignore This User’ Link Social Filtering Control: (shown in Figure 4.9) This control was integrated into all screens that displayed similar users to the current user. On every screen where the system displayed the details of a similar user, these details were accompanied by the option to ‘Ignore This User’. Users could then choose to ignore a particular user if they felt that user was not similar to them. This control feature was a slight variation upon the Social Filtering Control screen shown in the questionnaire. The difference is that this feature no longer allowed users to confirm that another user was indeed similar to them. This is because such a confirmation would not have had any impact upon recommendations (as the system already believed that these two users were similar).
Map Based Presentation. The questionnaire suggested that many users would find the option of a Map Based presentation of recommendations to be useful. As a result, this form of presentation was incorporated into the prototype. The Map Based presentation displayed items to users so that: • Each movie on the map was shown as a circle and the name of the movie was written on that circle. • The closer that two circles were to one another, the more related they were (e.g. two very closely related movies would appear right next to one another and two movies not related to one another at all would appear far away from one another). Note: different relationships between items existed for different map types, these are discussed below. • If a user had seen an movie, it was coloured blue.
4.2 U SER ’ S V IEW
53
• If a user had not seen an movie, but their predicted rating for that movie was above 2.5 stars, it was coloured a shade of green (darker green indicated a higher rating). • If a user had not seen an movie, but their predicted rating for that movie was close to 2.5 stars, it was coloured orange. • If a user had not seen an movie, but their predicted rating for that movie was less than 2.5 stars, it was coloured a shade of red (darker red indicated a lower rating). • Users were allowed to zoom in and out on the map and move left, right up and down on the map. • Users could click on a particular circle to view more information about the movie that circle represented. Three variants of Map Based presentation were included in iSuggest-Usability. These variants were included in order to investigate how useful users would find particular styles of Map Based presentation. The details of each of these variants are described below.
F IGURE 4.10: Full Map Presentation — Zoomed Out View
Full Map: (shown in Figures 4.10 & 4.11) This map displayed all of the movies found in the MovieLens dataset. Each movie on this map was placed close to the genres that it belonged to. The names of the genres that movies were divided into were displayed in large writing on the map.
4.2 U SER ’ S V IEW
54
F IGURE 4.11: Full Map Presentation — Zoomed In View
F IGURE 4.12: Similar Items Map Presentation
Top 100 Map: This map was exactly the same as the Full Map, except that to reduce clutter and confusion on the map, it displayed only 100 movies. These 100 movies were the movies with the highest predicted rating for this user. Similar Items Map: (shown in Figure 4.12) This map showed the user a single focus item, surrounded by a number of items. These items were described to users as being related to the focus item because the users who liked the focus item also liked these items. This map was
4.2 U SER ’ S V IEW
55
chosen for inclusion because it displays items in a similar to the way that liveplasma4) displays items.
4.2.2 iSuggest-Unobtrusive This version of the prototype extended the basic iSuggest prototype to incorporate the ability to generate ratings using only unobtrusively obtained information about a user. iSuggest-Unobtrusive made use of the play-counts that were stored on users’ iPods to automatically generate a set of ratings that these users would give to particular artists. These ratings were then used to generate recommendations for that user. When using iSuggest-Usability, users connected their iPod, then clicked to ‘Get Ratings From My iPod’, ratings were then generated from the iPod connected to the system and an explanation of the ratings generation was shown. Users could then see the ratings that had been generated for them and the recommendations that had been produced for them. Users were able to choose from three different recommendation techniques — Random (which merely assigned a random number as the user’s predicted rating for each item), Social Filtering and Genre Based. The explanation of the ratings generation that was displayed is shown in Figure 4.13. It described the number of ratings that had been generated. It also noted that artists the user listened to frequently had been given a high rating and artists the user listened to less frequently received lower ratings. The construction of the ratings generation algorithm and this explanation screen was guided by the findings of the questionnaire. A particularly important consideration was the suggestion that complicated explanations could confuse a user’s understanding and do more harm than good. Thus, this explanation screen was designed to be simple for users to understand, yet still communicate effectively the way that ratings had been generated.
F IGURE 4.13: The Explanation Screen Displayed After Ratings Generation
4http://www.liveplasma.com
4.3 D ESIGN & A RCHITECTURE
56
4.3 Design & Architecture The architecture of the basic prototype is shown in Figure 4.14, with components constructed during this thesis marked in blue. The core components of the basic prototype were the iSuggest Controller, the iSuggest Interface and the Duine Toolkit. The iSuggest Controller managed the iSuggest system, allowing users to log in, submit ratings, set preferences and receive recommendations. It submitted any ratings and preferences to the Duine Toolkit and decided when a user’s recommendations needed to be updated. Such an update was required whenever a user changed their preferences or had submitted a certain number of new ratings to the Duine Toolkit. The iSuggest Interface manages all of the user interaction for the iSuggest system. This component was built using the Processing graphical toolkit (available from http://processing.org/). The basic iSuggest Interface incorporates List Based presentation screens that enable users to rate items and view recommendations. The iSuggest Interface submits the users’ ratings and preferences to the iSuggest Controller and it receives new recommendations from the iSuggest Controller whenever the user’s recommendations are updated. The Duine Toolkit receives ratings and preferences from the iSuggest Controller and uses these, along with a Ratings Database to generate recommendations when required.
'
'
(
! "##$%&
(
F IGURE 4.14: Architecture Of The Basic Prototype, With Components Constructed During This Thesis Marked In Blue
4.3.1 iSuggest-Usability iSuggest-Usability extended the basic prototype by adding scrutability and control features. This version of the prototype made movie recommendations, based upon the MovieLens standard data set. Figure 4.15 shows the architecture of iSuggest-Usability, with components constructed during this thesis marked in blue. The additional features included in this version of the prototype were:
4.3 D ESIGN & A RCHITECTURE
57
?2@ )*+,-./
)*+ ,-./ 0*+ *1*/2 34-/+567+28 954:
AB ,/+,-. )*+ ,-./
;4< ,2=2 -/ >2/+ 0*+*
F IGURE 4.15: Architecture Of iSuggest-Usability, With Components Constructed During This Thesis Marked In Blue Map Based Presentation Screens: These presentation screens made use of the traer.physics5 and traer.animation6 libraries. The traer.physics library was used to create a simulated particle system. In such a system, all particles repel one another, and links hold particles close to one another. This particle system was used to determine the positions of items in the Map Based presentation. The Full Map and Top 100 Map maps began by placing all of the systems movie genres onto the map as particles. Items were then placed one-by-one onto the map, and each item would be linked to the genres that it belonged to. In this way, each item would be repelled by all other items in the system, but it would stay close to the genres that it belonged to. The Similar Items Map used a different method to position items. This map calculated the correlation between each movie and all other movies in the database in terms of the ratings that users had given them. This map then displayed a single focus item, encircled by all of the movies that had a high level of correlation with the focus item. Similar Users Screen: This screen made use of the a list of similar users that was output from the Social Filtering algorithm. It then displayed the users who were the most similar to the current user (to a maximum of 9 similar users). Control Features: These features received input from the user regarding their preferences and forwarded this information to the iSuggest Controller. The iSuggest Controller then set these preferences in the Duine Toolkit and updated the user’s recommendations. Modified Recommendation Algorithms: The Social Filtering, Genre Based, Learn By Example and Most Popular algorithms were all modified so that they attached extensive explanation information to each recommendation that was made. This allowed the Explanation Screens to 5http://www.cs.princeton.edu/ traer/physics/ 6http://www.cs.princeton.edu/ traer/animation/
4.3 D ESIGN & A RCHITECTURE
58
fully explain each of the recommendations. The Social Filtering and Genre Based algorithms were also modified to make use of the user preferences that were set using control features. Explanation Screens: These screens took the explanation information that was attached to each recommendation and displayed this information in a way that the user should be able to understand.
4.3.2 iSuggest-Unobtrusive
iSuggest-Unobtrusive extended the basic prototype by adding the ability to automatically generate a user’s ratings from play-counts stored on their iPod. This version of the prototype made music recommendations based upon the last.fm dataset. The architecture of iSuggest-Unobtrusive is shown in Figure 4.16, with components constructed during this thesis marked in blue.
[\ ]UDFG ^ DEFNGI
YLZ CDEFGHI
CDE FGHI JDE DKDIL MNGIEOPQELR
[\ FIEFGH [\ ]UD FGDEFN GI
CDE FGHI
SONT UDIE VW T XLIE JDED
F IGURE 4.16: Architecture Of iSuggest-Unobtrusive, With Components Constructed During This Thesis Marked In Blue
The additional features included in this version of the prototype were:
Ratings Generation Algorithm. This algorithm needed to be both accurate at generating ratings from a users’ play-counts and easy to explain to users. The algorithm that was chosen to generate ratings worked in the following way:
4.4 C ONCLUSION
59
Input: Artists and play-counts from an iPod Output: User’s ratings for artists found on the iPod 1
minimum count = min(play-counts)
2
maximum count = max(play-counts)
3
foreach artist on the iPod do
4
artist play-count = sum(play-counts from songs by this artist)
5
normalized play-count = (artist play-count - minimum count) / (maximum count - minimum count)
6 7
new rating = (normalized play-count + 1) * 2.5 end Algorithm 1: Ratings Generation Algorithm
On line 4, the the play-counts are normalized with reference to the other play-counts that exist on the iPod. This places them on a scale of 0.0 – 1.0 Then, on line 5, these ratings are converted to a scale of 0.0 – 5.0. The minimum rating produced by this algorithm is 2.5, as this is a neutral rating, and the worst that any artist on a user’s iPod should is neutral (as the mere fact that the artist is on their iPod implies that the user has at least a neutral attitude toward that artist).
Explanation Screen.
This screen took the explanation information that was provided by the ratings
generation algorithm and displayed this in a way that users should be able to understand.
4.4 Conclusion To investigate the research goals of this project, a prototype called iSuggest was developed. This prototype was offered in two different versions, named iSuggest-Usability and iSuggest-Unobtrusive, each of which was built to explore a separate research question. The basic iSuggest system was created to imitate existing recommender interfaces and use the default Duine Toolkit recommendation technique (the Taste Strategy). This basic prototype was extended to create the two prototype variants - iSuggest-Usability and iSuggest-Unobtrusive. iSuggest-Usability incorporated the highest rated usability interface features from the questionnaire. This prototype made movie recommendations, based upon the MovieLens standard data set. It would
4.4 C ONCLUSION
60
later be used to investigate the first research goal of the project through user evaluations. iSuggestUsability made the following functions available to the user:
Multiple Recommendation Techniques: The questionnaire suggested that the ability to choose the recommendation technique to be used would be useful to users. Thus, iSuggest-Usability allowed users to request that recommendations be produced using any of five different recommendation techniques (Social Filtering, Genre Based, Most Popular, Learn By Example and the Duine Toolkits Taste Strategy). Explanations: Explanations were provided for all recommendations that were produced. Each recommendation technique was explained using its highest rated explanation screen from the questionnaire. Social Filtering was explained using three different explanation screens, each of which were shown by the questionnaire to be useful. Similar Users: Users were given the ability to view a list of the other users of the system who were deemed to be the most similar to the current user. Users could view all of the ratings entered by each similar user. Control Features: These allowed the user to affect the recommendation process. The control features implemented were the Genre Based Control (Genre Slider) and Social Filtering Control, as respondents of the questionnaire rated these highly. Map Based Presentation Of Recommendations: This form of presentation was rated as useful by many questionnaire respondents. Three different map based presentations were made available to the user - Full Map, Top 100 Map and Similar Items Map.
iSuggest-Unobtrusive incorporated the ability to read the play-counts from a user’s iPod and then generate a set of ratings that user would give to particular artists. These ratings could then be used to produce recommendations for a user. This prototype made the following functions available to the user:
Automatic ratings generation: Users could have ratings automatically generated from the playcounts on their iPod. Ratings generation explanation: Every time that ratings were automatically generated by this system, an explanation screen was shown to users that described how many ratings were generated and how these had been generated. Recommendations using unobtrusive information: Recommendations were provided to each user based upon the ratings that had been automatically generated. iSuggest-Unobtrusive made
4.4 C ONCLUSION
61
use of the last.fm dataset, which contains only unobtrusively obtained information, to make recommendations. Once the construction of the prototypes was complete, each of them needed to be evaluated to investigate the research goals of the project. The evaluation of these prototypes is described in Chapter 5.
C HAPTER 5
Evaluations
5.1 Introduction In order to investigate the research goals for this thesis, the two versions of the prototype — iSuggestUsability and iSuggest-Unobtrusive — were evaluated. These evaluations aimed to establish the effectiveness of the methods implemented in the prototype for providing scrutability, control and unobtrusiveness. iSuggest-Usability was evaluated through a user evaluation, which was completed by 10 people. This evaluation aimed to investigate the effectiveness of explanations, controls and Map Based presentations for improving explanations and providing scrutability. It also aimed to investigate how users interact with these elements. iSuggest-Unobtrusive was evaluated through both a user evaluation and statistical evaluations. These evaluations aimed to assess the ability of the prototype to generate ratings from implicit data, and its ability to make useful recommendations using these ratings. Each of these evaluations needed to be rigorously designed to ensure that it meaningfully and accurately tested effectiveness and investigated users’ interactions with the prototype system. This chapter describes the design of these evaluations and their results.
5.2 Design In order to investigate the way in which users interact with recommender systems and the usefulness of particular Scrutability & Control elements that we added to the two prototype systems that we developed, we designed two user evaluations, one for each of the prototype systems that we produced. During the completion of these evaluations, users were asked to answer questions about the usefulness of particular aspects of the iSuggest-Usability. For each of these questions, 1 was the lowest score that could be given, and 5 was the highest. Further, the evaluations were conducted through a process called a Thinkaloud (detailed in (Nielsen, 1993)), which involves asking users to verbalise their thought process while 62
5.2 D ESIGN
63
making use of particular elements of a system. During the Think-aloud process, notes were made to record the though processes expressed by users. Through the Think-aloud process, we aimed to discover information about how users interacted with recommender systems and how useful they found particular elements of the prototype that could not be captured by asking simple questions. The design of the two user evaluations is described below.
5.2.1 iSuggest-Usability The evaluations of the iSuggest-Usability were designed with the following goals in mind: Goal 1: Investigate whether providing explanations for recommendations can improve the usefulness of these recommendations. Goal 2: Investigate the most effective way to explain recommendations to users. Goal 3: Investigate whether there is a trade-off between recommender usefulness and understanding of recommendations. Goal 4: Investigate whether users can utilise control features to improve the quality of their recommendations. Goal 5: Investigate whether a recommender system benefits from the introduction of a map based presentation. Goal 6: Investigate the way in which users interact with a map-based style of presentation. In order to achieve each of these goals, the user evaluations for the iSuggest-Usability consisted of a Setup stage, Part A and Part B. Each user began by entering rating for movies at the Setup stage. Following this stage, users were asked to complete the Part A and Part B stages, each of which asked them to view recommendations and rate a number of different elements that were presented to them. Finally, users were presented with a set of final questions to answer about their general opinion of iSuggest-Usability. Part A presented users with a standard set of recommendation, with no additional Scrutability & Control features at all. This stage was included in the evaluation in order to serve as a control, to gauge the quality of the recommendations presented to users and to present them with a standard method of recommendation, without any Scrutability & Control features. Part B presented users with recommendations that incorporated the Scrutability & Control elements of this prototype and asked them to rate the recommendations and the usefulness of particular Scrutability & Control elements. In order to produce a Double Cross-over study, half of the participants in evaluations were
5.2 D ESIGN
64
asked to complete Part A before Part B (Type 1), and the other half completed Part B before Part A (Type 2). A full description of the details of each of the stages of the evaluation is included below (The instructions that users followed during these evaluations can be found in Appendix C).
Setup. During this stage, users moved through a list of movies and rated any of the movies that they had seen, according to how much they liked or disliked that movie. Users were asked to rate approximately 30 movies, as this number of ratings meant that the user was still considered to be a new user to the system, and the cold start problem for new users would still be very apparent for this user. The choice to simulate the cold start problem for new users during these user evaluations was motivated by the fact that explanation and control features are both elements that we have added to our prototype with the specific intention of: building users’ trust in the system, despite the quality of recommendations produced; aiding users in making better use of poor recommendations; and improving the quality of recommendations that are produced by the system. The cold start problem for new users is a well documented problem with recommender systems that causes such systems to produce poor recommendations. Thus, simulating this problem should produce some poor quality recommendations and allow us to assess the effectiveness of the Scrutability & Control elements that were added to this prototype.
Part A. During Part A of the user evaluations, users were presented with a list of recommendations that were produced using the Duine Toolkit’s Main Strategy. These recommendations were presented to the user without any form of explanation and users were offered no form of control over these recommendations. Recommendations were presented to users in this form to show them that often recommender systems do not provide the Scrutability & Control features that were introduced with this prototype.
Part B.
During this part of the user evaluations, users were presented with multiple sets of recom-
mendations, accompanied by Scrutability & Control features such as explanations and controls. During Part B, users were asked a number of questions in order to assess the usefulness of the recommendation methods and the Scrutability & Control features that were added to the prototype. Users were instructed to select and use each of the different recommendation method in turn. Each of these recommendation methods was accompanied by a short explanation of how it worked, to give users some idea of how recommendations would be produced. The questions that were presented to the user during this stage were divided into the following categories:
5.2 D ESIGN
65
Recommender Usefulness: After each set of recommendations was presented, the user was asked to rate how useful they found these recommendations. Explanation Usefulness: The recommendations presented to users at this stage were each accompanied by an explanation, and users were asked to rate how useful they found that explanation for helping them to understand and make use of the recommendations that were provided. In the case of the Social Filtering recommendations, users were in fact presented with three different forms of explanation for each recommendation and they were asked to rate each of these forms of explanation in turn. Control Feature Usefulness: For the Genre Based and Social Filtering recommendations, users were instructed to make use of specific control features that were intended to improve the quality of recommendations. Users were then asked to rate how useful they found each control feature for improving their predictions. Map Usefulness: Users were presented with the three different Map Based presentations, Full Map, Top 100 Map and Similar Items Map. They were asked to spend some time making use of each Map Based presentation and then they were asked to rate its usefulness as a method for viewing recommendations. In addition to asking users to rate each form of Map Based presentation, the way in which users interacted with each of them was observed. This section of the user trial focused on discovering whether users were interested in having a map based presentation of recommendations and if so, how such a presentation could most effectively be created.
Final Questions. Upon completion of the user evaluations, users were asked five questions. They were asked to rate the general usefulness of the explanations provided by the system and the usefulness of the control features in improving recommendations. Users were also asked whether they would prefer a list based presentation of recommendations, a map based presentation, or both. Finally, they were asked to state what the best and worst features of the iSuggest prototype were.
Participants. In all, 10 people completed the evaluations of iSuggest-Usability. This is well beyond the recommended minimum of 3 to 5 people for usability evaluations stated in (Nielsen, 1994). The sample group for this evaluation was carefully selected to contain people from a variety of backgrounds and both males and females. The majority (8/10) of the users who completed the questionnaire were aged under 30, but modern recommender systems are used most often by people who fall in the 18-30 age range, so a higher proportion of respondents in this age range was deemed to be appropriate. Figure
5.2 D ESIGN
66
5.1 shows demographical information about each of the participants, as well as indicating whether they completed Part A first (Type 1) or Part B first (Type 2).
Particpant Number Age Gender Type 1 or 2
Group 1 1 2 3 22 52 18 F M F 1 2 1
4 21 F 2
5 21 M 1
Group 2 6 7 8 30 23 51 M M F 2 1 2
9 25 M 1
10 23 F 2
F IGURE 5.1: Demographical Information About The Users Who Conducted The Evaluations Of iSuggest-Usability
5.2.2 iSuggest-Unobtrusive The evaluations of the iSuggest-Unobtrusive were designed with the following goals in mind:
Goal 1: Investigate whether users’ play counts can be accurately mapped to their ratings. Goal 2: Investigate whether effective recommendations can be made for users using only ratings generated from play counts.
In order to achieve each of these goals, the user evaluations for the Usability Prototype consisted of Parts A and B. The instructions that users followed during this evaluation can be found in Appendix E. During Part A, ratings were generated for each user by applying the ratings generation algorithm, and users were then asked to indicate how well they understood how these ratings had been generated and how accurate the ratings were. Part B presented three sets of recommendations to users:
Random Recommendations: These recommendations were created by assigning a random number as the user’s predicted interest in each item. These recommendations were included to act as a control, a reference point which could be used to judge the utility of the rest of the recommendations presented to users. Social Filtering Recommendations: These recommendations were created using the Social Filtering recommendation technique. This technique was chosen for use as it was the top performing algorithm on a set of statistical evaluations (the results of these statistical evaluations are summarised in 5.4.1).
5.2 D ESIGN
67
Genre Based Recommendations: These recommendations were created using the Genre Based recommendation technique. This technique was chosen as it was the second highest performing algorithm on a set of statistical evaluations (the results of these statistical evaluations are summarised later in this chapter, in Section 5.4.1). For each set of recommendations, users were first presented with the list of recommendations, then they were asked to spend as much time as they wanted assessing how useful they found the recommendations that were provided. Users were then asked to give the recommendations a ratings according to how useful they were. In order to produce a Double Cross over study, five of the participants in evaluations were shown Random Recommendations before Social Filtering and Genre Based Recommendations (Type 1), and the other four were shown Social Filtering and Genre Based Recommendations before Random Recommendations (Type 2). Once users had completed the trial, they were also asked to indicate whether or not they would like to have the ‘Get Ratings From My iPod’ feature incorporated into the iSuggest system. Participants. In all, 9 people completed the evaluations of iSuggest-Unobtrusive. These users were not all the same users that completed the evaluation of iSuggest-Usability, though some users did complete both evaluations. Again, the sample group for this evaluation was carefully selected to contain people from a variety of backgrounds and both males and females. The majority (6/9) of the users who completed the questionnaire were again aged under 30. Figure 5.2 shows demographic information about each of the participants, as well as indicating whether they were shown Random Recommendations first (Type 1) or Social Filtering and Genre Based recommendations first (Type 2). Participant: 1 2 3 Age 18 52 20 Gender F M F Type 1 or 2 1 2 1
4 51 F 2
5 19 M 1
6 21 F 2
7 20 M 1
8 23 M 2
9 31 F 1
F IGURE 5.2: Demographical Information About The Users Who Conducted The Evaluations Of iSuggest-Unobtrusive
Statistical Evaluations. In order to evaluate more thoroughly the ratings and recommendations that were produced by iSuggest-Unobtrusive, a set of simulations were carried out, and statistical data was collected during these simulations. An important issue in the execution of these simulations was the choice of statistical measures for evaluating performance. The chosen measures needed to provide a useful and reliable gauge of each systems performance. It was decided to evaluate the performance of
5.3 I S UGGEST-U SABILITY E VALUATIONS — R ESULTS
68
the ratings algorithm through the distribution of the ratings that were produced by that algorithm. This distribution could then be compared to the distribution of ratings within the MovieLens standard dataset. Evaluation of the usefulness of recommendations produced by iSuggest-Unobtrusive was slightly more complicated. (Herlocker, 2000) provides an evaluation of a number of possible measures for evaluating the usefulness of recommendations. This paper concluded that the MAE metric is an appropriate metric for use in evaluating recommender systems. This metric judges the accuracy of the predictions that a recommender system makes about a user’s level of interest in specific items. More accurate predictions will lead to higher quality recommendations and thus, a better MAE will result in better recommendations. One of the advantages of calculating the MAE is the fact that this metric was also used in (van Setten et al., 2002). This means that results from this simulation should be roughly comparable to the results of this study. MAE measures the absolute difference between a predicted rating and the user’s true rating for an item. The MAE is computed by taking the average value of this difference across the entire system. The MAE of a system represents the overall accuracy of predictions (and thus recommendations) made by that system. The standard deviation of the absolute error values (SDAE) is also useful to compute, as this measure describes how consistently a system will produce reliable predictions (and thus reliable recommendations). Thus, MAE and SDAE metrics were used to evaluate the iSuggest-Unobtrusive prototype.
5.3 iSuggest-Usability Evaluations — Results This section reports the results of the evaluations of iSuggest-Usability. The results are reported in terms of recommendation usefulness, explanations, control features and presentation method. At this point, it is important to note that the average amount of ratings that were entered by users during evaluations was 27.1. This is only a small number of ratings for a user to have entered into a recommender system, so the cold start problem for new users existed for each user during evaluations.
5.3.1 Recommender Usefulness Users rated the usefulness of the six sets of recommendations produced. Figure 5.3 shows the average score for each of the different techniques, with error bars showing one standard deviation above and below the mean (actual results for each user shown in Appendix D). We now discuss these techniques in order of average usefulness.
5.3 I S UGGEST-U SABILITY E VALUATIONS — R ESULTS
69
5.0 Avg. Usefulness Rating
4.5 4.0 3.5 3.0 2.5 2.0 1.5 1.0 Duine
Most Popular
Genre Based
Genre Based (Revised)
Social Filtering
Learn By Example
F IGURE 5.3: Average Usefulness Ratings For Each Recommendation Method. Error Bars Show One Standard Deviation Above And Below The Mean. N = 10
Genre Based (Revised): (average score of 3.9/5 after control features were used, ranked 1st). The Genre Based recommendations were the lowest rated when first presented, with an average score of 2.7/5. Five users gave their lowest rating to these recommendations and no users gave their highest score. However, once users were given the chance to adjust their genre interests, the average score for this method improved by 20% to 3.9/5. Seven people gave their highest score to these revised recommendations, and only two did not (due to an error in copying the questionnaire, one user did not give a rating for the revised Genre Based recommendations). Learn By Example: (average score of 3.7/5, ranked 2nd). This method produced the largest variation in user’s ratings, with most users rating this method above 3, yet others rating it as a 2. Despite the variation, this method had the second highest average score, and six users gave this method their highest score. Most Popular: (average score of 3.3/5, ranked 3rd). Three users rated this method highest and two of these users spontaneously commented that they would be very interested in the movies that were the most popular overall. In contrast, two other users rated this method lowest and one user spontaneously commented that this recommendation method was unlikely to ever produce good recommendations for him, as he was not interested in popular movies.
5.3 I S UGGEST-U SABILITY E VALUATIONS — R ESULTS
70
Duine: (average score of 3.1/5, ranked 4th). Most users were observed to find that these recommendations contained just a few items that were very interesting to them, among many that they were uninterested in. Similar to the Most Popular method, three users rated Duine the highest, and two users rated it lowest. Social Filtering: (average score of 2.8/5, ranked 5th). Four users rated this method the lowest, and although three users did give this method a score of 4/5, in general it was observed to often recommend movies that were completely unsuited to the user’s tastes.
Discussion. Individuals differentiated the quality of the recommender techniques. However there was no consistently superior technique: all methods were given at least one user’s highest rating, yet all methods were also given at least one user’s lowest rating. This suggests the value of allowing users to choose their recommendation method. Further, participants commented that the different recommendation methods could be useful for different tasks (e.g. one user commented that if he were in the mood to see something quite mainstream, he would choose Most Popular recommendations. However, if he were in the mood to see something more tailored to his own interests, he could choose Genre Based recommendations). The fact that some users commented that they would be interested in Most Popular recommendations, while others commented that they would not be is an example of the individuality of users. Such individuality makes a case for providing personalisation of presentations and explanations within recommender systems Of significant interest is the fact that allowing users to adjust their genre interests improved recommendations significantly, moving the Genre Based recommendations from the lowest rated to the highest rated set of recommendations. The average rating for Genre Based recommendations increased by 20% after the introduction of the Genre Control. This is strong evidence of the usefulness of control features in recommender systems. Also interesting was the impact of the cold start problem for new users on the performance of recommendation techniques. The learn algorithm was rated highly by users, indicating that it is able to produce good recommendations even when users have entered few ratings. In contrast, users rated the Social Filtering recommendations the second lowest, indicating it produced poor recommendations. The poor performance of this recommendation algorithm was due to the its inability to cope with such a small amount of ratings information. This serves as confirmation of the existence of the cold-start problem in our evaluations. It is in this case, where the recommendations produced by the social filtering algorithm are not good, that the explanations that are provided to users are quite crucial — in order to help the user to decide how much trust to place in recommendations by allowing them to
5.3 I S UGGEST-U SABILITY E VALUATIONS — R ESULTS
71
Avg. Explanation Rating
5.0 4.5 4.0 3.5 3.0 2.5 2.0 1.5 1.0 Most Popular
Genre Based
Social Filtering (Simple Text)
Social Filtering (Graph)
Social Explain (Similar Users)
Learn By Example
F IGURE 5.4: Average Usefulness Ratings For Each Explanation. Error Bars Show Standard Deviation. N = 10 understand how and why the system made a recommendation, especially if it is a recommendation that the user feels is not useful.
5.3.2 Explanations Users rated six explanation methods according to their usefulness for helping understand and use recommendations. Figure 5.4 shows the average score for each of the different explanations, with error bars showing one standard deviation above and below the mean (actual results for each user shown in Appendix D). We now discuss these explanations in order of average usefulness. Most Popular: (average score of 4.0/5, ranked equal 1st). Seven people gave the Most Popular explanation a score of 4 or more, and no users rated it below 3. However, one user did state that he believed that the Most Popular recommendations were calculated using more than just a simple average of the ratings given to each item — this belief was incorrect. Social Filtering (Graph): (average score of 4.0/5, ranked equal 1st). This explanation had the highest average rating of all the Social Filtering explanations. Seven users rated this explanation highest and no users rated it lowest.
5.3 I S UGGEST-U SABILITY E VALUATIONS — R ESULTS
72
Learn By Example: (average score of 3.6/5, ranked 3rd). Nine users gave this explanation a rating of 3 or more and four of these users rated this explanation the highest. However, while viewing these explanations, two users spontaneously commented that they disagreed with the similarity measure used by the Learn By Example technique. They were interested in knowing more information about how similarity is computed. One of these users expressed a desire to control the way that similarity is calculated. Genre Based: (average score of 3.4/5, ranked 4th). Five users gave this explanation their lowest score. Users were often observed to find these explanations inadequate. Two users spontaneously commented that although these explanations indicated the genres that each item belonged to, the reason that items from these genres were recommended was not made clear. Social Filtering (Simple Text): (average score of 2.8/5, ranked 5th). This explanation had the highest variance of all the explanations. Two users gave this explanation a score of 4 or more, and yet five users rated this method the lowest of all the explanations. Social Filtering (Similar Users): (average score of 2.6/5, ranked 6th). Similar to the Social Filtering (Simple Text), five users rated this method the lowest of all the explanations. No users gave this method a 5, and only two users gave this method a score above 3. Users also rated the overall usefulness of the iSuggest explanations for helping them understand and use recommendations. The average score for this question was 3.7/5. Figure 5.5 shows each user’s response
Usefulness of Explanations
to this question (actual results for each user shown in Appendix D). 5 4 3 2 1 1
4
7
10
Participant
F IGURE 5.5: Users’ Ratings For The Overall Use Of The iSuggest Explanations. N = 10
Discussion. The fact that users gave an average rating of 3.7 when asked to rate the usefulness of the iSuggest explanations shows that explanations appear to improve the usefulness and understandability of recommendations. After viewing the explanations provided for the Learn By Example technique, one user even expressed a desire to control how similarity between items was computed. This suggests
5.3 I S UGGEST-U SABILITY E VALUATIONS — R ESULTS
73
that scrutability might spur some users to take more control over a system. In general, most of the complaints that users did have about the explanations provided were that they wanted to know more details about how the recommendation process worked. In particular, users wanted the Genre Based and Learn By Example explanations to contain more information. Possible extensions to the existing iSuggest explanations could include:
Genre Based: Indicating the user’s calculated interest in each genre that an item belongs to. Learn By Example: Indicating why items were judged to be similar to one another. Further, a useful control feature could be the ability to adjust the factors that are used to judge similarity between items.
Of course, further research would be required to discover if these extension could be useful in improving the understandability and usefulness of recommendations. It was not surprising that the Most Popular explanations were rated highest on average. This method is quite simple in operation and thus is easy to explain to users. However, the fact that the Social Filtering (Graph) explanations were also rated highest on average was remarkable, as this recommendation method is much more complicated. On average, the Graph-based explanation of the Social Filtering technique was rated the higher than both the Simple Text and the Similar Users forms of explanation. This suggests that users found this graph of the ratings of similar users to aid their understanding and ability to use recommendations. The high performance of the Social Filtering (Graph) conflicted with the results of the questionnaire(where Social Filtering (Simple Text had the highest average understanding rating). The fact that Social Filtering (Graph) scored a higher average rating than Simple Text demonstrated the value of implementing and testing explanations. In fact, this result is supported by research in (Herlocker, 2000), where it was found that a histogram of similar user’s ratings was the most effective form of Social Filtering explanation. The fact that the Learn By Example explanations were rated third is somewhat surprising, as one of the benefits often noted for the Learn By Example technique is the "potential to use retrieved cases to explain [recommendations]" - (Cunningham et al., 2003), p 1. Finally, the Genre Based explanations scored poorly mainly due to the fact that these explanations did not contain enough detail.
5.3 I S UGGEST-U SABILITY E VALUATIONS — R ESULTS
74
5.3.3 Controls Users rated two control features according to their effectiveness in improving recommendations recommendations. Figure 5.6 shows users’ ratings for each of the control features, with error bars showing the standard deviation (results for each user also shown in Appendix D). 5 Effectiveness Rating
Effectiveness Rating
5
4
3
2
4
3
2
1
1 1
2
3
4
5
6
7
8
9
10
1
2
3
4
5
6
7
Participant
Participants
(a) Genre Based
(b) Social Filtering
8
9
10
F IGURE 5.6: Users’ Ratings For The Effectiveness Of Control Features.
Prediction Method Control: No specific statistical results were collected with respect to the ability of users to control the recommendation method that was used. However, three users of the system spontaneously commented that the ability to use many different prediction mechanisms was quite useful and one user stated that this helped him to "work with the system to produce recommendations rather than simply be given a set of ‘take-it-or-leave-it’ recommendations." Genre Based Control: (average score of 4.4/5, rated 1st). Nine users gave this method a score of 4 or more, and one user gave this control a 3. As noted in Section 5.3.1, the original Genre Based recommendations received the lowest average score. However, once users were given the chance to adjust their genre interests, the revised Genre Based recommendations received an average of 3.9/5 — the highest average score. One user spontaneously commented that he would like his genre interests to be used as input to other recommendation techniques, not just Genre Based. Another user spontaneously commented that he would like to be able to adjust his interest sub-genres, as well as genres. He felt that the ability to specify interest in sub genres would enable this control to improve his recommendations even further.
5.3 I S UGGEST-U SABILITY E VALUATIONS — R ESULTS
75
Social Filtering Control: (average score of 2.6/5, rated 2nd). Three users rated this control either 4 or 5, while the other seven users gave this control a rating of 2 or less. One user was observed to find no users whom he thought should be ignored, despite examining the ratings for all of the 9 most similar users. Two other users spontaneously commented that although they did click to ignore particular users, this had little to no impact upon their recommendations. Users also rated the overall effectiveness of the iSuggest control features for improving their recommendations. The average score for this question was 4.4/5. Figure 5.7 shows each user’s response to this question (actual results for each user shown in Appendix D).
Effectiveness Rating
5
4
3
2
1 1
2
3
4
5
6
7
8
9
10
Participant
F IGURE 5.7: Users’ Ratings For The Overall Effectiveness Of The iSuggest Control Features.
Discussion. The results of the survey showed that users were highly interested in having control over their recommender system. The results of these evaluations confirmed that such control features can be effectively incorporated into a recommender system. When asked how useful they found the iSuggest control features in improving their recommendations, all gave consistently high scores. This is strong evidence to support the case for including controls in recommender systems. However, the Social Filtering control feature was rated quite lowly by many users. This is most probably due to the fact the average amount of users that were ignored through the use of this control was only 2.3 — which is often not enough users to produce a significant change. This result suggests that most users would not use this control to ignore a large amount of users, and thus it would not be likely to be highly effective. However, some users did rate this control highly, so further investigation is needed. Despite the poor performance of this particular control, the overall results from this section of the evaluation show that control features can be highly effective — as long as the controls that are incorporated are able to demonstrate a noticeable effect.
5.3 I S UGGEST-U SABILITY E VALUATIONS — R ESULTS
76
The conclusions that we can draw from this investigation into the usefulness of control features include: • Controls can be useful in improving recommendations. • Users have shown a strong interest in being offered control over their recommender system. • The Genre Based Control is a very useful method for allowing users to improve the quality of recommendations. • Users found the ability to choose which recommendation technique was used was highly useful.
5.3.4 Presentation Method Five users rated the usefulness of three types of Map Based Presentation. After these users completed evaluations, their feedback was used to make the following changes to the Map Based Presentations: • Spread out the items in the map to make it less cluttered. • Allowed users to click on a genre to zoom in on that genre. • Had the map start in the ’zoomed out’ state, rather then a very ’zoomed in’ state. • Allowed users to zoom in further to read movie titles more clearly. A further group of five users then rated the usefulness of the Map Based Presentations. Figure 5.8 shows the average score that each group gave to the different forms of Map Based Presentation, with error bars showing the standard deviation (actual results for each user shown in Appendix D). 6 Avg. Usefulness
Avg. Usefulness
5
4
3
2
5 4 3 2 1
1 Full Map
Top 100 Map
(a) Group 1
Item-To-Item Map
Full Map
Top 100 Map
Item-To-Item Map
(b) Group 2 (After Revision Of Maps)
F IGURE 5.8: Average Usefulness Of The Map Based Presentations. Error Bars Show Standard Deviation.
5.3 I S UGGEST-U SABILITY E VALUATIONS — R ESULTS
77
Full Map Presentation: (average of 2.0/5 from Group 1, average of 4.3/5 from Group 2). Group 1 gave this method a maximum rating of 3. Two users from this group commented that the Map was too crowded. One user spontaneously commented that sometimes items were placed near genres that they didn’t really belong to — which was confusing. However, following the revision of the maps, Group 2 gave this method an average 4.3/5 — the highest score for any of the maps. Further, all users from Group 2 gave the Full Map more than 3/5. Three users from Group 2 rated Full Map the highest. One user from Group 2 commented that the Full Map "gives you a scope and makes it easier to navigate between genres". Another user spontaneously commented that she found the colour coding to be a useful way to quickly discover what genres the system thought you were interested in. Top 100 Presentation: (average of 2.6/5 from Group 1, average of 4.0/5 from Group 2). On average, Group 1 rated Top 100 slightly higher than Full Map. However, as was the case with Full Map presentation, all users from Group 1 rated Top 100 as 3 or below. The average rating for Top 100 from Group 2 (4.0/5) was slightly lower than the average for Full Map, but 4.0 was the second highest average score for any of the maps. One user from Group 2 gave this map a 5, three gave it a 4 and one user gave it a 3. Two users from Group 2 rated Top 100 the highest. Item-to-item Similarity: (average of 2.6/5 from Group 1, average of 3.0/5 from Group 2). Two users from Group 2 gave this method a four, but all other users from Groups 1 and 2 gave this method 3 or less. In Group 1, this map had the equal highest average score. In Group 2, the average scores of Full Map and Top 100 improved, but the average score for this map did not. This meant that this map had the lowest average score for Group 2. One user spontaneously commented that this map was not useful as it showed items that were not highly rated for her and that often the map would display relationships between items that she felt were not related. Another user volunteered that he felt this map should show more levels of Item-ToItem similarity. Users were also reported their preferred presentation type (’List Only’, ’Map Only’ or ’Both List And Map’). Figure 5.9 shows the sum of the responses given by groups 1 and 2 (actual results for each user shown in Appendix D).
Discussion. The initial group of five users gave all of the map based forms of presentations quite low scores. Only one of this initial group indicated he would like Map Based Presentations included in a recommender system. In general, users in Group 1 felt that the map based presentations were difficult
5
Sum of Preferences
Sum of Preferences
5.4 I S UGGEST-U NOBTRUSIVE - R ESULTS
4 3 2 1 0 List Only
Both List And Map
Map Only
(a) Group 1
78
5 4 3 2 1 0 List Only
Both List And Map
Map Only
(b) Group 2 (After Revision Of Maps)
F IGURE 5.9: Sum Of Votes For The Preferred Presentation Type.
to use. This was because the map seemed very crowded and it was hard to zoom in on particular items or areas of interest. However once the map interface was revised, the second group of users gave the map-based presentation higher scores for utility. Users in Group 2 found the Full Map and Top 100 maps to be especially useful. The probable cause for the lower performance of the Item-to-Item map lies in the fact that the Item-to-Item collaborative filtering process can sometimes produce relationships between items that a user might not expect. This confused users who were expecting items that were more directly related to be displayed with one another (e.g. movies in the same genre). After the revision of the maps, four out of five users said they would like both List-Based and MapBased presentation. This strongly suggests that Map Based Presentation of recommendations would be a worthwhile addition to a recommender system. The Full Map and Top 100 presentations are useful presentation methods, though user interaction and scalability are two areas where more research needs to be conducted. However, in general, once the initial usability issues were overcome, users seemed quite keen on having a Full Map presentation incorporated into a recommender system.
5.4 iSuggest-Unobtrusive - Results This section reports the results of both statistical and user evaluations of iSuggest-Usability. At this point, it is important to note that the average amount of ratings that were automatically generated for users during user evaluations was 80.5. This was a sufficient number of ratings to mean that the cold start problem for new users would not be a factor during evaluations.
5.4 I S UGGEST-U NOBTRUSIVE - R ESULTS
79
5.4.1 Statistical Evaluations Before any user evaluations were performed, statistical evaluations were carried out on iSuggest-Unobtrusive. These evaluations attempted to investigate the performance of the ratings generation algorithm and the quality of recommendations produced using these ratings. The datasets used to complete these evaluations were the MovieLens standard dataset, which contained 100,000 ratings and the last.fm dataset, which contained 100,000 play-counts, that were converted into 70149 ratings. The two statistical evaluations that were conducted were: a calculation of the distribution of the ratings that existed or were produced for each dataset; and a calculation of the MAE and SDAE for four recommendation techniques using each of the datasets. The results of these evaluations are reported below. The distribution of the ratings that were calculated from play-count data was calculated. This was compared to the distribution of ratings within the MovieLens standard data set. Figures 5.10(a) and
80%
80%
70%
70% % of Total Ratings
% of Total Ratings
5.10(b) show these distributions.
60% 50% 40% 30% 20%
60% 50% 40% 30% 20% 10%
10%
0%
0% 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
Rating
Rating
(a) Unobtrusively Generated Music Ratings
(b) Movie Ratings From MovieLens Dataset
F IGURE 5.10: Comparison Of Distribution Of Ratings Values. The rating scale that was used to calculate the distribution of ratings was a scale of 0.0-5.0, with increments of 0.5 (as all ratings within iSuggest were displayed on this scale). However, the ratings contained within the MovieLens dataset were based on a scale of 1.0-5.0, with increments of 1. This means that there are a number of values shown in Figure 5.10(b) for which no ratings exist. Despite this, the general distribution of ratings in the MovieLens dataset is clear. Only sixteen percent of the ratings in the MovieLens dataset occur below the value of 2.5, and zero percent of the ratings in the generated set occur below this value. twenty seven percent of the MovieLens ratings were 2.5’s, compared to sixteen percent of the generated ratings. Thirty five percent of MovieLens ratings were occur within the range of 3.0 to 4.5 (inclusive), whereas eighty three percent of the generated ratings occur within this range.
5.4 I S UGGEST-U NOBTRUSIVE - R ESULTS
80
Finally, twenty percent of the Movielens ratings were 5’s; only one percent of the generated ratings were 5’s. The MAE for four different recommendation techniques was calculated using the ratings generated from play-count data. This was compared to the MAE for the same techniques when recommending movies using the MovieLens standard data set. Figures 5.11(a) and 5.11(b) show the MAE for each of the four recommendation techniques, using MovieLens ratings and the generated ratings. Technique GMAE St. Dev. Social Filtering 0.091 0.171 Genre Based 0.101 0.174 Learn By Example 0.102 0.185 Most Popular 0.106 0.178
Technique GMAE St. Dev. Social Filtering 0.384 0.490 Genre Based 0.425 0.530 Learn By Example 0.465 0.592 Most Popular 0.384 0.488
(a) MAE of Recommendation Techniques Using Unobtrusively Generated Music Ratings
(b) MAE and SDAE of Recommendation Techniques Using Movie Ratings Taken From MovieLens Dataset
F IGURE 5.11: Comparison Of MAE And SDAE For Movielens Recommendations And Recommendations Using Generated Ratings. Lower Scores Are Better. Techniques Are Sorted By MAE. The average MAE for the recommendations using the generated ratings was calculated to be 0.315 lower than the average MAE for the recommendations using the MovieLens dataset. Further, the average SDAE for recommendations using generated ratings was 0.348 lower than the average SDAE for recommendations using MovieLens ratings. The Most Popular technique had the best (i.e. the lowest) MAE for recommendations using the MovieLens data set. It also had the lowest standard deviation. In contrast, this technique had the highest MAE for the recommendations created using generated ratings. Genre Based had the second best MAE for simulations. Learn By Example had the second worst MAE for the MovieLens recommendations, and the worst MAE for the generated rating recommendations. Finally, Social Filtering had the second worst MAE when recommendations were made using the MovieLens ratings. However, it had the best MAE when the generated ratings were used to make recommendations. Discussion. The statistical evaluations showed that the ratings generation algorithm was generally quite conservative — the percentage of generated ratings above 3 was smaller than the percentage of ratings above 3 in the MovieLens data. One of the causes of this was the fact that the data used to generate ratings was counts of songs that users listened to. Often this data will contain artists for whom the user has only one song, and whom the user listens to infrequently. Such artists would be given
5.4 I S UGGEST-U NOBTRUSIVE - R ESULTS
81
a rating quite close to 2.5 by the generation algorithm. Another cause is the fact that often, a user will listen to one ‘favourite’ artist very frequently, and other artists less frequently. In this case, the normalisation performed by the generation algorithm will result in the ‘favourite’ artist getting a high rating and the other artists getting lower ratings. In fact, the more that a user listens to a single artist, the lower that the ratings for other artists will be. As many users listen to a few ‘favourite artists’ very often, the ratings for the artists who are not a user’s favourites are likely to be relatively close to 2.5. The use of additional information in the ratings generation process (such as the number of songs by each artist that are on a user’s iPod and the amount of time that a user has spent listening to each track) would be likely to improve the accuracy of the ratings generation. The evaluation of the ratings algorithm using MAE and SDAE showed that the average MAE and SDAE for the recommendations using the generated ratings was much lower than the average MAE for recommendations using MovieLens. For the most part, this is due to the fact that the generated ratings were distributed over a much smaller range than the MovieLens ratings. The smaller range of the generated ratings meant that predictions for a user’s interest in a particular item using these generated ratings would be more likely to be correct than the predictions made using MovieLens data. Therefore, the MAE when using generated ratings is likely to be much lower than the MAE when using the MovieLens ratings. Due to the complexity of this situation, the MAE and SDAE calculations for the two simulations are not comparable. However, the MAE does still provide a useful measure of the performance of each of the prediction techniques. The two techniques that had the best MAE for the generated ratings simulation were Genre Based and Social Filtering. This meant that these two techniques were likely to be the most useful for making recommendations based upon the generated ratings. Once these statistical evaluations had been completed, users evaluations were conducted. The results of the user evaluations are reported in Sections 5.4.2 to 5.4.3.
5.4.2 Ratings Generation Users rated their understanding of how ratings had been generated from their iPod. They also rated the accuracy of the ratings that were generated. The results of from these questions are discussed below.
Understanding Of Ratings Generation: (average score of 5.0/5). All users responded to this question with a score of 5/5.
5.4 I S UGGEST-U NOBTRUSIVE - R ESULTS
82
Accuracy Of The Ratings Generated: (average score of 4.3/5). One user spontaneously commented that the program seemed to be a little bit conservative — being quite hesitant to give out higher ratings, and tending to give out ratings of mainly 2.5 and 3 stars. However, this question received very high scores from all users — no users responded with less than a score of 4, and three users gave a score of 5. Two users spontaneously commented that their favourite artist had been given the highest rating.
Discussion. Users gave consistently high scores when asked about their understanding of how their ratings were generated. This indicates that they believed they had a very clear understanding of how their ratings had been generated. Users also gave consistently high scores when asked about the accuracy of their ratings. This suggests that the algorithm implemented in this prototype was able to successfully model users’ interests in particular artists. Some users did comment that, as was shown in Section 5.4, the ratings generation process was quite conservative. Yet despite this, users felt that the ratings generated were quite accurate, especially due to the fact that the users’ favourite artists were consistently given the highest ratings.
5.4.3 Recommendations Users rated the usefulness of the three sets of recommendations produced from their generated ratings. Figure 5.12 shows the average score for each of the different techniques, with error bars showing the standard deviation (actual results for each user shown in Appendix F). We now discuss these techniques in order of average usefulness. Genre Based Recommendations: (average score of 3.9/5, ranked 1st). The average rating for these recommendations was substantially higher than the average for Random recommendations. In fact, all but one of the users gave Genre Based recommendations their highest rating. Social Filtering Recommendations: (average score of 3.1/5, ranked 2nd). This method received a higher average score than the Random Recommendations, yet it was not the highest rated recommendation method. One user commented that some artists that were recommended did seem to be quite appropriate, but that the recommendation list contained too many incorrect recommendations for it to be really useful. Random Recommendations: (average score of 2.2/5, ranked 3rd). Seven users gave this method their lowest rating. No users gave this method their highest rating.
5.4 I S UGGEST-U NOBTRUSIVE - R ESULTS
83
5.0
Avg. Usefulness Rating
4.5 4.0 3.5 3.0 2.5 2.0 1.5 1.0 Random
Social Filtering
Genre Based
F IGURE 5.12: Average Usefulness Ratings For Each Recommendation Method. Error Bars Show Standard Deviation. Users also reported whether they would like the ’Get Rating From My iPod’ feature incorporated into a recommender system. In answer to this question, all users reported that they would like to have the ’Get Ratings From My iPod’ function incorporated into a recommender system. One user spontaneously commented that "this is a great idea, and a really useful time saver". Three users commented that having ratings generated was highly preferable to rating items individually by moving through a large list. One of these users continued, saying that they would be willing to make minor adjustments to the ratings produced by the generation process to make the ratings more accurate and receive better recommendations. Discussion. The fact that Random recommendations received the lowest average score is not surprising, as these recommendations were presented to users to act as a control. The fact that 2.2/5 is the score that users would give a random set of recommendations can serve as a reference point for judging the utility of the recommendations presented to users. Social Filtering performed the best in the statistical evaluations described in Section 5.4.1, so it was assumed that users would find it to be highly useful. However, on average the usefulness of this method was rated lower than the Genre Based recommendations. The most likely reason for this is the fact that the ratings produced by the generation algorithm were distributed over only a small range. This meant that the process of matching similar users to one another was less successful, as the differences between users in terms of their ratings was less pronounced. This resulted in lower quality Social Filtering recommendations. Social Filtering performed well in statistical evaluations because it predicts a user’s rating for a new item in a way that is similar to taking the average rating that similar users gave this item. When there is such a low range of ratings in
5.5 C ONCLUSION
84
the system, this ‘average rating’ style approach is very likely to calculate a predicted value that is close to the average rating that users gave to items. Basically, because the range of ratings was so small in this example, a predictor such as this, which draws heavily upon users’ ratings is more likely to perform well on statistical evaluations. However, when used in a real world system, this recommendation method does not produce optimum results because it struggles to clearly identify similar and opposite users and thus produces poor recommendations. The fact that Genre Based recommendations were rated highly by the majority of users is strong evidence to suggest that useful recommendations can indeed be made using only implicit ratings data. The most likely reason that this recommendation method was able to produce high quality recommendations is the fact that it does not use the ratings that are input by a user in the same way that the Social Filtering method does. The Genre Based method uses the user’s ratings to adjust their predicted interest in particular genres. This predicted interest is most significantly affected by the items that a user has rated very high or very low. Items that the user has given a relatively neutral rating affect these predicted interests in a much less significant way. As a result, this recommendation method is not adversely affected by the fact that the ratings generation algorithm produced a large amount of relatively neutral ratings. Thus, this recommendation method was able to use the items that the user has rated highly to infer genre interests and make successful recommendations. The results of these user trials strongly suggest that useful recommendations can be made using only implicit data as ratings information. One big indicator of this lay in the fact that, when asked, all users reported that they would like to have the ’Get Ratings From My iPod’ function incorporated into a recommender system. In the future, more research is required to investigate whether ratings generated using a different algorithm might alter the performance of each recommendation technique.
5.5 Conclusion Evaluations were designed and conducted for each of the two prototype variants. These evaluations aimed to investigate the research questions defined in Chapter 1 and build upon the knowledge that was gained from the questionnaire conducted in Chapter 3. iSuggest-Usability was evaluated through user evaluations, conducted with 10 people. These user evaluations produced the following findings:
5.5 C ONCLUSION
85
Recommendation usefulness. • Despite the fact that very few ratings had been entered by each user, the Genre Based and Learn By Example techniques were highly rated by users. This suggests that these two techniques would be useful, even in situations where the cold start problem for new users exists.
Understanding. • Explanations were shown to be a useful addition to a recommender system. • A graph based method was shown to be the most effective way to explain Social Filtering recommendations. • On average, the Learn By Example recommendations were rated to be the third most understandable recommendations — a curious results given that as one of the benefits of the Learn By Example technique is stated to be the "potential to use retrieved cases to explain [recommendations]" - (Cunningham et al., 2003) • Some of the explanations incorporated into the prototype would benefit from the addition of extra information. • Comments made by during evaluations suggested that the addition of scrutability might spur some users to take more control over a system.
User Control. • Controls can be useful for allowing users to improve their recommendations, particularly the Genre Based control. • Users have a high level of interest in being given control of their recommender system. • Evidence showed that allowing users to select which recommendation technique should be used is a highly useful.
Presentation. • Evidence suggested that a Map Based presentation of recommendations (such as Full Map or Top 100 Map included in iSuggest-Usability) would be a useful addition to a recommender system.
5.5 C ONCLUSION
86
Evaluations also highlighted the individuality of users, many of whom preferred different presentation styles, explanation styles and recommendation techniques. In general, users found many of the features included in iSuggest-Usability to be quite useful for improving the quality of recommendations and the scrutability of a recommender system. iSuggest-Unobtrusive was evaluated through user evaluations, conducted with 9 people, as well as through statistical evaluations. These evaluations produced the following findings: • Ratings can be generated from implicit information in a way that users have indicated is easy to understand and is generally accurate. • Useful recommendations can be made based purely upon ratings generated from implicit information about users. • The ratings generation algorithm implemented in iSuggest-Unobtrusive is conservative, and could definitely be improved upon. • Genre Based is a useful recommendation technique to use when the distribution of ratings values is conservative. • The addition of other types of implicit data to the ratings generation process (such as time spent listening to each track) could improve quality of the generated ratings. Generally, found that iSuggest-Unobtrusive incorporated highly useful features that enabled ratings to be generated unobtrusively and effective recommendations produced from this information. Overall, the evaluations of the two prototype variants produced a number of important findings regarding both the Scrutability & Control and Unobtrusive Recommendation research questions.
C HAPTER 6
Conclusion
The research questions for this thesis were expressed in Chapter 1 to be:
Scrutability & Control: What is the impact of adding scrutability and control to a recommender system? Unobtrusive Recommendation: Can a recommender system provide useful recommendations without asking users to explicitly rate items?
As noted in Chapter 2, there is very little published research that deals with either of these two questions, but clear recognition of their importance and the challenges of achieving them. Thus, this thesis investigated each of these questions. An exploratory study was conducted, which involved an analysis of existing systems and the conduct of a questionnaire. The results from this study informed the creation of a prototype system, which included a number of scrutability, control and unobtrusive recommendation features. Finally, this system was evaluated through a combination of statistical methods and user evaluations. Both the exploratory study and the evaluations of the prototype produced significant findings. These findings include:
Scrutability & Control. Based on the results from the questionnaire (which had 18 respondents and is detailed in Chapter 3) and the two user evaluations (each of which had at least 9 participants and are detailed in Chapter 5), the following findings were made:
• Explanations are a useful addition to a recommender system. However, complicated or poor explanations can often confuse a user’s understanding of recommendations. • Specific explanation types were found to be more useful than others for explaining particular recommendation techniques. • Different users prefer different forms of presentation and explanation. 87
6.1 F UTURE W ORK
88
• Genre Based and Learn By Example are both techniques that could be utilised to avoid the cold start problem for new users. • A Map Based presentation of recommendations can be a useful addition to a recommender system. • Users have a high level of interest in being given control of their recommender system. Further, such controls can be useful for allowing users to improve the usefulness of recommendations • Respondents to our questionnaire did not think that Description Based or Lyrics Based recommendation techniques would be useful. Unobtrusive Recommendation. • Ratings can be generated from implicit information in a way that users have indicated is easy to understand and is generally accurate. These ratings can then be used to made useful recommendations. Overall, this thesis was highly successful. It highlighted a number of key scrutability and control features that would appear to be useful additions to existing recommender systems. These features can be used to improve recommendation quality and usefulness, as well as improve users’ trust and understanding of recommender systems. Further, the Genre Based and Learn By Example techniques were shown to produce useful recommendations, even when users had not entered a large number of ratings (a situation that causes many recommendation techniques to produce poor recommendations). It was also shown that a Map Based presentation would be a useful presentation method, which could be incorporated into existing recommender systems. Finally, it was shown that ratings automatically generated from implicit information about a user can be used to make useful recommendations. Each of these findings is significant, as they can be used to improve the effectiveness, usefulness and user friendliness of existing recommender systems.
6.1 Future Work Despite the substantial progress made during this thesis, there are a number of areas that require future research. These areas include: • Investigation of the usefulness of dynamically combining multiple recommendation techniques. • Investigation of new or extended ways of providing explanations and control to users.
6.1 F UTURE W ORK
89
• Further investigation into the most useful methods for providing a Map Based presentation of recommendations. • Improvements to the ratings generation algorithm presented in this thesis. • Investigation of other types of implicit data that could be used to generate ratings.
References
G. Adomavicius and A. Tuzhilin. 2005. Toward the next generation of recommender systems: a survey of the state-of-the-art and possible extensions. Knowledge and Data Engineering, IEEE Transactions on, 17(6):734–749. J. Atkinson. 2006. Free music recommendation services, 25th May. C. Basu, H. Hirsh, and W. Cohen. 1998. Recommendation as classification: Using social and contentbased information in recommendation. Proceedings of the Fifteenth National Conference on Artificial Intelligence. J. S. Breese, D. Heckerman, and C. Kadie. 1998. Empirical analysis of predictive algorithms for collaborative filtering. Proceedings of the Fourteenth Conference on Uncertainty in Artificial Intelligence, 461. P. Cano, M. Koppenberger, and N. Wack. 2005. An industrial-strength content-based music recommendation system. Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval, pages 673–673. P. Cunningham, D. Doyle, and J. Loughrey. 2003. An Evaluation of the Usefulness of Case-Based Explanation. Case-Based Reasoning Research and Development. LNAI, 2689:122–130. M. Deshpande and G. Karypis. 2004. Item-based top-n recommendation algorithms. ACM Transactions on Information Systems (TOIS), 22(1):143–177. J. L. Herlocker, J. A. Konstan, and J. Riedl. 2000. Explaining collaborative filtering recommendations. Proceedings of the 2000 ACM conference on Computer supported cooperative work, pages 241–250. J. L. Herlocker, J. A. Konstan, L. G. Terveen, and J. T. Riedl. 2004. Evaluating collaborative filtering recommender systems. ACM Transactions on Information Systems (TOIS), 22(1):5–53. J. L. Herlocker. 2000. Understanding and Improving Automated Collaborative Filtering Systems. Ph.D. thesis, UNIVERSITY OF MINNESOTA. X. Hu, J.S. Downie, K. West, and A. Ehmann. 2005. Mining Music Reviews: Promising Preliminary Results. Proceedings of the 6th International Symposium on Music Information Retrieval, pages 536–539. A. Kiss and J. Quinqueton. 2001. Machine learning of user preferences in a corporate knowledge management system. Proceedings of ISMCIK ’01, pages 257–269.
90
R EFERENCES
91
J. A. Konstan, B. N. Miller, D. Maltz, J. L. Herlocker, L. R. Gordon, and J. Riedl. 1997. Grouplens: applying collaborative filtering to usenet news. Communications of the ACM, 40(3):77–87. B. Logan. 2004. Music recommendation from song sets. Proc ISMIR. H. Mak, I. Koprinska, and J. Poon. 2003. Intimate: a web-based movie recommender using text categorization. Proceedings of the 2003 IEEE/WIC International Conference on Web Intelligence, pages 602–605. D. Maltz and K. Ehrlich. 1995. Pointing the way: active collaborative filtering. Proceedings of the SIGCHI conference on Human factors in computing systems, pages 202–209. D. Mcsherry. 2005. Explanation in Recommender Systems. Artificial Intelligence Review, 24(2):179– 197. S. E. Middleton, D. C. De Roure, and N. R. Shadbolt. 2001. Capturing knowledge of user preferences: ontologies in recommender systems. Proceedings of the international conference on Knowledge capture, pages 100–107. R. J. Mooney and L. Roy. 2000. Content-based book recommending using learning for text categorization. Proceedings of the fifth ACM conference on Digital libraries, pages 195–204. J. Nielsen. 1993. Evaluating the thinking-aloud technique for use by computer scientists. Advances in human-computer interaction, 3:69–82. J. Nielsen. 1994. Estimating the number of subjects needed for a thinking aloud test. International Journal of Human-Computer Studies, 41(3):385–397. D. W. Oard and J. Kim. 1998. Implicit feedback for recommender systems. Proceedings of the AAAI Workshop on Recommender Systems, pages 81–83. G. Polcicova, R. Slovak, and P. Navrat. 2000. Combining content-based and collaborative filtering. ˘ S127. Proceedings of ADBIS-DASFAA Symposium 2000, page 118âA ¸ ˘ U. Shardanand and P. Maes. 1995. Social information filtering: algorithms for automating âAIJword ˘ ˙I. Proceedings of the SIGCHI conference on Human factors in computing systems, pages of mouthâA 210–217. R. Sinha and K. Swearingen. 2001. Beyond algorithms: An hci perspective on recommender systems. Proceedings of the SIGIR 2001 Workshop on Recommender Systems. R. Sinha and K. Swearingen. 2002. The role of transparency in recommender systems. Proceedings of the conference on Human Factors in Computing Systems, pages 830–831. M. van Setten, M. Veenstra, and A. Nijholt. 2002. Prediction strategies: Combining prediction techniques to optimize personalization. Proceedings of the workshop Personalization in Future TV’02, pages 23–32. M. van Setten, M. Veenstra, A. Nijholt, and B. van Dijk. 2003. Prediction strategies in a tv recommender system: Framework and experiments. Proceedings of IADIS WWW/Internet 2003, pages 203–210.
R EFERENCES
92
M. van Setten, M. Veenstra, A. Nijholt, and B. van Dijk. 2004. Case-based reasoning as a prediction strategy for hybrid recommender systems. Proceedings of the Atlantic Web Intelligence Conference, pages 13–22. M. van Setten. 2005. Supporting People In Finding Information. Telematica Institut.
A PPENDIX A
Appendix A — Questionnaire Form
Note: On this questionnaire, the technique referred to in the thesis as Learning By Example is called Learning From Similar. Also, the technique referred to in the thesis as Social Filtering is called Word Of Mouth.
93
A PPENDIX B
Appendix B — Questionnaire Results
Note: A * indicates that this user did not answer this question due to the fact that the content of the questionnaire changed after the first five respondents.
94
A PPENDIX C
Appendix C — iSuggest-Usability Evaluation Instructions
95
A PPENDIX D
Appendix D — iSuggest-Usability Evaluation Results
Note: A * indicates that this user did not answer this question due to a copying error.
96
A PPENDIX E
Appendix E — iSuggest-Unobtrusive Evaluation Instructions
97
A PPENDIX F
Appendix F — iSuggest-Unobtrusive Evaluation Results
98