Chapter 3 The Conceptual Framework of a Hybrid ...

4 downloads 277 Views 3MB Size Report
improvements to the performance of the Netflix recommender algorithm (Koren, Bell, and ...... 23 http://blog.pandora.com/faq/contents/506.html ...... gest that recommender system providers combine several different recommendation tech-.
v

TO VERA AND MICHAEL

vi

Foreword

vii

Foreword

Markets, particularly those for media products, are characterized by an immense range of alternative products, not least though the rising internet trade and the almost unlimited digital “warehouse space” associated with it. For individual consumers, it is difficult to keep pace with this ever increasing variety of choice options. At the same time, digitization also offers new technological approaches to and possibilities for dealing with these challenges. The work of Paul Marx constitutes just such an approach to automatic recommendation systems. Specifically, Paul Marx considers an important issue relating to the generation of such recommendations, namely how recommendation systems can provide recommendations that are at the same time genuinely useful and valuable to consumers. This issue is linked to a fundamental problem of such systems that limits their practical acceptance -- many consumers trust these recommendations only to a limited extent (or not at all), because they do not contain any relevant explanations, but only vague statements such as “This product is being recommended to you because you like similar products”. In this book (which also served as his dissertation thesis), Paul Marx develops a highly innovative recommendation system that attempts to reduce the acceptance problem that characterizes automatic recommendation systems by providing relevant justifications for the generated recommendations, while at the same time offering state-of-the-art predictions with low predictive error. Based on a comprehensive review of the literature, Paul Marx has achieved his ambitious objective by combining extensive high-quality econometric analysis with bright and creative ideas. His work makes a substantial contribution to the highly competitive research area of recommendation systems, marking an impressive academic achievement. Paul Marx’s handing of the problems and solutions is impressive, including his extensive knowledge of econometrics at a level rarely encountered in dissertations in the field of business administration. Similarly, he is able to link his methodological skills to a comprehensive review of the literature on both recommenders and the motion picture industry, theoretically underpinning his considerations. This piece of work is a rare case in which theory and a

viii

Foreword

sophisticated methodology really work well together in obtaining meaningful results. It is also both linguistically and formally impressive, being very lively, but still appropriately factually formulated. Paul Marx’s investigation is part of a DFG research project. He has worked on this topic for almost six years, in the meantime also contributing to other research projects in this context which have, inter alia, been published in the Journal of Marketing, one of our discipline’s finest outlets. I am really happy to say that his continued commitment has really paid off for both marketing theory and practice. Now I hope that his work receives the attention which the concepts and solutions duly deserve. And dear managers, I hope for high-quality and well justified recommendations à la Paul Marx in my future shopping!

Münster and London, 18th December, 2012

Univ.-Prof. Dr. Thorsten Hennig-Thurau

Acknowledgments

ix

Acknowledgments

One paradox of a dissertation project is that it is always allegedly accomplished by a single person but actually represents the results of the joint efforts of many individuals. I wish to thank all of these people for their invaluable contributions to the success of my work – without them, I would be have been facing hard times. First of all, a great, big thank you goes to my supervisor, Thorsten Hennig-Thurau, for awakening my interest in this thesis project in particular and academic work in general. He has constantly provided an exceptional and encouraging personal example of how the combination of hard work and attention to detail produces fruitful results, personal satisfaction, and professional advancement. I would also like to thank Tobias Bauckhage, the CEO of MoviePilot, for providing invaluable data for my experiments. Thanks are also due to all of my teachers, who taught me the value of constant learning and inspired my curiosity towards and respect for the unknown. In particular, I would like to thank the teachers of Novosibirsk Aerospace Lyceum, the professors of the Aircraft Faculty of Novosibirsk State Technical University, and the professors of the Khristianovich Institute of Theoretical and Applied Mechanics. I am proud to have studied at these locations. Special mention also goes to Anne Priller and Denis Rechkin for teaching me languages: Anne for English and Denis for C#. And of course, I am eternally grateful to my family. I would like to thank my parents for shaping the individual that I have become and for providing me with their absolute love and continuing support throughout my life, particularly during the course of writing this thesis. I am also grateful to my ex-wife Elena for her patience and understanding and to my children Vera and Michael for reminding me that there are other important things going on out there in the world. Thank you for all your support. I have no doubt that this thesis would not have been possible without you.

Langenhagen, April 2012

Paul Marx

x

Table of Contents

xi

Table of Contents

Glossary ................................................................................................................................... xv List of Tables......................................................................................................................... xvii List of Figures ........................................................................................................................ xix

1

2

Introduction and Motivation ............................................................................................ 1 1.1

Motivation ................................................................................................................... 1

1.2

Objectives .................................................................................................................... 8

1.3

The Outline of the Thesis .......................................................................................... 10

Background and Related Work ..................................................................................... 11 2.1

A Parsimonious Overview of Recommendation Techniques .................................... 11

2.1.1

Collaborative Filtering ....................................................................................... 13

2.1.1.1

User-Based Approaches .............................................................................. 13

2.1.1.2

Item-Based Approaches .............................................................................. 22

2.1.1.3

Matrix Factorization and Latent Factor Models ......................................... 25

2.1.2

Content-Based Filtering ..................................................................................... 31

2.1.2.1

The Principles of Content-Based Approaches ............................................ 31

2.1.2.2

The Exploitation of Content Characteristics in Non-Textual Item Domains .................................................................... 36

2.1.3

Critical Issues of Collaborative and Content-based Approaches ....................... 41

2.1.3.1

Data Sparsity ............................................................................................... 42

2.1.3.2

“Ramp-up”: New User and New Item Problems ........................................ 44

2.1.3.3

Overspecialization....................................................................................... 45

xii

Table of Contents

2.1.3.4

Stability vs. Plasticity.................................................................................. 46

2.1.3.5

Other Problems: “Gray Sheep”, “Starvation”, and Shilling Attacks. ......... 47

2.1.4

2.2

Hybrid Recommender Systems .......................................................................... 49

2.1.4.1

Parallelized Hybridization Design .............................................................. 49

2.1.4.2

Monolithic Hybridization Design ............................................................... 52

2.1.4.3

Pipelined Hybridization Design .................................................................. 53

Explanations in Recommender Systems .................................................................... 54

2.2.1

The Relevance and Advantages of Explanation Facilities ................................. 55

2.2.2

Explanation Styles .............................................................................................. 62

2.2.3

Explanations in Hybrid Approaches .................................................................. 67

2.2.4

Summary ............................................................................................................ 70

2.3

Movie-Related Preferences and Relevant Movie Characteristics ............................. 71

2.3.1

The Operationalization of Preferences ............................................................... 72

2.3.1.1

The Multiattribute Utility Model and the Weighted Additive Decision Rule .................................................. 72

2.3.1.2

The Approach for Addressing Unstable Utility Functions ......................... 75

2.3.1.3

The Advantages of the WADD Approach for the Production of Explainable Recommendations ................................ 78

2.3.2 2.4

3

The Preference-Relevant Attributes of Motion Pictures .................................... 80

Summary .................................................................................................................... 91

Conceptual Framework of a Hybrid Recommender System that allows for Effective Explanations of Recommendations ...................................... 95 3.1

The Modeling of User Preferences ............................................................................ 97

3.1.1

The Motivation for the Approach ....................................................................... 97

3.1.2

A Basic Model of User Preferences ................................................................... 98

3.1.3

Accounting for Static Effects Beyond the User-Item Interaction .................... 100

Table of Contents

3.1.4 3.2

xiii

Accounting for Time ........................................................................................ 102

The Estimation of Model Parameters ...................................................................... 105

3.2.1

Step 1: The Estimation of the Initial Parameter Values ................................... 108

3.2.1.1

The Omitted Variable Bias in OLS Models and a Method to Counteract this Bias ....................................................... 109

3.2.1.2

The Estimation of User- and Item-Related Effects ................................... 114

3.2.1.3

The Estimation of Attribute Part-Worths .................................................. 116

3.2.2 3.3

4

Hybridization with Collaborative Filtering ............................................................. 128

3.3.1

The Motivation for Hybridization .................................................................... 128

3.3.2

The Selection of a Hybridization Method ........................................................ 130

Empirical Study ............................................................................................................. 137 4.1

The Examined Datasets and Their Properties .......................................................... 139

4.2

Measures of Prediction Accuracy ............................................................................ 144

4.3

The Employed Algorithms and Benchmarks ........................................................... 146

4.4

Results ..................................................................................................................... 148

4.4.1

Comparisons of Prediction Accuracies ............................................................ 148

4.4.2

The Provided Explanation Styles ..................................................................... 161

4.5

5

Step 2: The Optimization of the Parameters .................................................... 122

Summary .................................................................................................................. 164

Conclusions and Future Work ..................................................................................... 167 5.1

Research Summary, Key Findings and Contributions............................................. 167

5.2

Discussion and Implications .................................................................................... 172

5.3

Limitations and Avenues for Future Research ........................................................ 174

Bibliography ......................................................................................................................... 183

xiv

Table of Contents

Appendix A: Sources of Error in Recommender Systems ............................................... 207 Appendix B: A List of Preference-Relevant Attributes .................................................... 213 Appendix C: The Technical Details of Prediction Accuracy Tests .................................. 217

Glossary

xv

Glossary

ACM

Association For Computing Machinery

CB

Content-Based Filtering

CF

Collaborative Filtering

CS

Computer Science

CSCW

Computer Supported Cooperative Work

DFG

Deutsche Forschungsgemeinschaft (German Research Foundation)

DVD

Digital Versatile Disk

EBA

Elimination By Aspecs

esp.

Especially

GB

Gigabyte

GHz

Gigahertz

GPS

Global Positioning System

IDF

Inverse Document Frequency

IMDb

Internet Movie Database

kNN

K-Nearest Neighbor

MAE

Mean Absolute Error

MAU

Multiattribute Utility

MDS

Multidimensional Scaling

MF

Matrix Factorization

xvi

Glossary

NMAE

Normalized Mean Absolute Error

NRMSE

Normalized Root Mean Squared Error

OLS

Ordinary Least Squares

RAM

Random-Access Memory

RecSys

ACM Conference On Recommender Systems

RMSE

Root Mean Squared Error

RS

Recommender System

SD

Standard Deviation

SE

Standard Error

SVD

Singular Value Decomposition

TF

Term Frequency

TF-IDF

Term Frequency - Inverse Document Frequency

WADD

Weighted Additive Linear Model

w.r.t.

With Respect To

List of Tables

xvii

List of Tables

Table 2.1: A ratings database for collaborative filtering .......................................................... 14 Table 2.2: The similarities between Daniela’s profile and other user profiles, as measured by Pearson’s correlation coefficients ................................................ 16 Table 2.3: A mean-adjusted ratings database for collaborative filtering ................................. 24 Table 2.4: The adjusted cosine similarities of “Thor” to other movies in the dataset ............. 24 Table 2.5: The principle of content-based filtering .................................................................. 33 Table 2.6: A summary of the strengths and weaknesses of different recommendation approaches ............................................................... 42 Table 2.7: Reasons and benefits for the provision of explanations .......................................... 61 Table 2.8: The capacities of different recommendation methods to provide effective explanations ........................................................................... 66 Table 2.9: A summary of motion picture success factors ........................................................ 83 Table 2.10: A summary of preference-relevant movie attributes ............................................. 89 Table 4.1: Descriptive statistics for the raw rating datasets ................................................... 141 Table 4.2: Descriptive statistics for the datasets that are employed in the study ................... 142 Table 4.3: A comparison of the prediction accuracies of different algorithms for the MoviePilot dataset .................................................................................... 150 Table 4.4: A comparison of the prediction accuracies of different algorithms for the Netflix dataset ........................................................................................... 152 Table 4.5: The distribution parameters for the absolute prediction error of the optimization step ........................................................................................ 157 Table 4.6: The accuracy improvements produced by the hybrid method .............................. 160 Table 4.7: The explanation styles provided to users .............................................................. 161 Table C.1: An overview of the employed source code snippets from Press et al. 2007 ........ 218

xviii

List of Figures

xix

List of Figures Figure 2.1: A comparison of three user rating profiles ............................................................ 14 Figure 2.2: A comparison of three movie rating profiles ......................................................... 23 Figure 2.3: A simplified illustration of the latent factor approach ........................................... 26 Figure 2.4: An illustration of the extraction of a features vector from a document ................. 32 Figure 2.5: Basic types of hybridization designs ..................................................................... 51 Figure 3.1: The decomposition of a time-varying measure into three components: baseline, long-term trend, and short-term fluctuations ........................................ 104 Figure 3.2: Successive minimization with gradient methods ................................................. 123 Figure 3.3: A flowchart of the optimization step ................................................................... 126 Figure 4.1: The rating scales of the user interfaces of recommender systems ....................... 140 Figure 4.2: The distribution of the absolute prediction errors of the optimization step for the MoviePilot dataset ............................................. 159

xx

1. Introduction and Motivation

1

Chapter 1

Introduction and Motivation 1 Introduction and Motivation This chapter describes the motivations underlying the thesis that is presented. The objectives of this thesis and the subjects that are included in this document are briefly discussed. The chapter ends by describing the structure and contents of the thesis.

1.1 Motivation

Recommendations are an aspect of everyday life. It is natural for individuals to seek recommendations during the course of making a decision about a particular item or action. We rely on recommendations from different sources, including (but hardly limited to) other individuals, bestseller lists, travel guides, test reports, technical reviews, and restaurant and movie critics. Personalized recommender systems (RSs) are intended to support and augment this natural social process by helping users find the most interesting and valuable items for them in a fast and efficient manner. On the internet, service providers are not bound by shelf space limitations and can therefore offer far more products than traditional retailers.1 As a result, in an online environ-

1

For instance, the internet music store Rhapsody offers 19 times as much music as Wal-Mart, which stocks 39,000 songs. Amazon’s offerings include 2.3 million books, whereas even specialized book retailers typically

2

1. Introduction and Motivation

ment, the choice task can become overwhelming to the customers, causing optimal selection decisions to be nearly impossible to reach; this phenomenon is known as the information overload problem (Jacoby, Speller, and Berning 1974; Anderson 2004). In fact, the identification of the optimal product for satisfying one’s requirements among hundreds and sometimes even thousands of concurrent offerings is far from a simple task, particularly if the choice decision requires the consideration of several product characteristics and the maintenance of an acceptable price-budget relationship. In these situations, individuals strive to minimize their search effort. In other words, in these contexts, consumers are eager to avoid being overloaded by a vast quantity of offerings that do not interest them but instead wish to only view items that could potentially be relevant to their needs (Herlocker et al. 2004). RSs largely mitigate the information overload problem for users. In particular, RSs simplify search and decision tasks through the application of numeric algorithms that reduce the entire list of products in the domain of a user’s interests (e.g., various product offerings, such as books, CDs, movies, and other goods) to a small and manageable set of relevant items that are well matched to the user’s preferences. Moreover, RSs allow e-commerce providers to increase their up-selling and cross-selling potentials (Schafer, Konstan, and Riedl 2001; Bodapati 2008), e.g., by recommending items from adjacent product domains that are concordant with a consumer’s preferences. By providing customers with simpler and more effective ways of selecting products, RSs help online store owners to better manage customer relationships in ways that produce higher levels of customer loyalty to a particular firm and provide greater competitive barriers for the firm in question (Wei, Shaw, and Easley 2002; Ricci, Rokach, and Shapira 2011). In other words, RSs allow both parties in a business transaction to benefit considerably from the transaction in question by completing their tasks more efficiently. Accordingly, recommender systems have already found their way into many commercial applications and have established themselves as important components of online stores (Shafer et al. 1999; Ansari et al. 2000). In fact, most users of the internet have encountered a recommender system. A prominent example of a commercial RS is Amazon’s2 service of offering personalized recommendations, which is widely known as “Customers Who Bought

only stock a maximum of 130,000 books. Netflix, an online DVD rental service, offers 25,000 DVDs, whereas the average inventory of a conventional DVD rental store consists of only 3,000 DVDs (Anderson 2004). 2 http://www.amazon.com

1. Introduction and Motivation

3

This Item Also Bought”. Netflix3, an online DVD rental and video streaming service, recommends movies to its subscribers through the following statement: “If You Liked This Movie, You Will Also Like”. Last.fm4 and Pandora5 allow their users to create their own “personalized radio stations” online; these created stations play songs that conform to a user’s tastes. Mendeley6, an online researcher community, recommends scientific articles for individuals to read. Moviepilot7, which is purely a movie recommendation service, offers a series of recommendation systems. One of these systems generates forecasts for a user’s appreciation of a particular movie. The second Moviepilot recommendation system suggests a nearby movie theater that is showing the film that a user is expected to like the most out of all of the films that are currently being screened in movie theaters. The third Moviepilot recommendation system provides real-time rankings of TV programs based on a user’s preferences and then recommends a channel to watch. Besides restaurants and news, other widespread examples of the domains where RSs are employed also include the recommendations of physicians, lawyers, sightseeing locations, vacation resorts, libraries, web sites, acquaintances, sports centers, conventional goods, food items, and even lifestyles. In fact, Netflix has recently awarded a prize of one million dollars to the research team that first succeeded in making substantial improvements to the performance of the Netflix recommender algorithm (Koren, Bell, and Volinsky 2009); this award convincingly indicates the importance of RSs to online service providers. In recent years, levels of research interest in recommender systems have dramatically increased. A search of the EBSCO Business Source Premier Database reveals that over 300 scientific papers were published during the last fifteen years that explicitly addressed this topic. Conferences and workshops on RSs have become premier annual events.8 Sessions that are dedicated to RS are frequently incorporated into more traditional conferences in the field

3

http://www.netflix.com http://www.last.fm 5 http://www.pandora.com 6 http://www.mendeley.com 7 http://www.moviepilot.com 8 We refer specifically to the ACM conference on Recommender Systems (RecSys), which began in 2007 and is now conducted annually. 4

4

1. Introduction and Motivation

of information systems.9 Furthermore, several noted academic journals have presented special issues that discuss research findings and developments in the area of RSs.10 The topic of recommender systems is also frequently investigated in academic publications in the fields of psychology, e-commerce, and marketing.11 However, an RS can only deliver personalized recommendations if it possesses knowledge about its users. Every RS must obtain and maintain a user profile, i.e., data that allow the RS to draw conclusions about the relevance of particular items for users. These data may come from various sources, such as the user’s purchase history. In this case, each purchase act or purchased item can be regarded as an expression of a user’s preference in the purchased item’s domain; thus, purchases provide the RS with data about a user’s preferences and the portion of the item’s domain in which the user’s tastes or interests may be manifested. Another source of information that describes users’ preferences is the users’ explicit ratings of particular items. An RS may potentially acquire more information from ratings than from purchase behaviors because ratings allow users to indicate the magnitude and the direction of the preferences that they associate with an item, i.e., the degree to which an item is liked or disliked. Once user profiles are acquired, RSs can begin to produce recommendations. This process is typically accomplished through the use of numeric algorithms that exploit data from user profiles and the item catalogs. The modern literature on recommender systems has identified three key recommendation approaches: content-based, collaborative, and hybrid approaches (Balabanovic and Shoham 1997; Adomavicius and Tuzhilin 2005). In each particular situation, the choice of a recommendation approach is heavily dependent on the type of

9

Several of the most prominent conferences that have included sessions dedicated to RS are the ACM Special Interest Group on Information Retrieval (SIGIR); User Modeling, Adaption and Personalization (UMAP); and the ACM Special Interest Group on Management of Data (SIGMOD) (Ricci et al. 2011, p. 3). 10 The journals that have presented special issues on RS include AI Communications (2008), IEEE Intelligent Systems (2007), the International Journal of Electronic Commerce (2006), the International Journal of Computer Science and Applications (2006), ACM Transactions on Computer-Human Interaction (2005), and ACM Transactions on Information Systems (2004) (Ricci et al. 2011, p. 3). 11 These academic publications include studies by Hennig-Thurau, Marchand, and Marx (2012); HennigThurau et al. (2010); Bodapati (2008); Aksoy et al. (2006); Yuanping, Feinberg, and Wedel (2006); Fritzmons and Lehmann (2004); Rutkovsky, Senecal, and Nantel (2004); Fairchild and Rijsman (2004); Gershoff, Mukherjee, and Mukhopadhyay (2003); Cooke et al. (2002); Mild and Natter (2002); and Ansari, Essagier, and Kohli (2000).

1. Introduction and Motivation

5

user profile data that are available, the quality of these data, and the characteristics of the item domain to which these data are applied. However, the numeric algorithms of RSs are subject to errors that may result from a number of factors, such as the incompleteness of collected data; data input, profile extraction, and algorithmic processing errors; and the misspecification of the user decision strategy model (Herlocker, Konstan, and Riedl 2000; Aksoy et al. 2006). By presenting users with erroneous recommendations, RS risk compromising their credibility for these users and thereby damaging user trust; these effects may diminish firms’ reputations and lead to the loss of customers (Sinha and Swearingen 2002; Gershoff, Mukherjee, and Mukhopadhyay 2003; O’Donovan and Smith 2005; Cramer et al. 2008). This issue raises two questions: (i)

How can the recommendation algorithms be improved to reduce the rate and magnitude of recommendation errors?

(ii)

How can the negative effects of inaccurate recommendations on user acceptance and trust be mitigated?

Although the first of these two questions is directly related to the numeric algorithms and is focused on reducing prediction errors, the second question is typically addressed in modern RS literature through the issue of explanations. In other words, by providing personalized explanations, RSs enable users to assess the quality and suitability of recommendations for users’ current decision contexts. An understanding of the reasons underlying the recommendation of particular items allows users to make optimal choices, even for situations in which the recommendation process reflected the user’s preferences in a suboptimal way, failed to address the user’s decision context appropriately or simply contained calculation errors. This understanding can reduce the negative effects of inaccurate recommendations, improving the credibility of an RS and increasing user trust in this RS (Herlocker, Konstan, and Riedl 2000). Although both of the questions that are posed above have been examined in the extant literature, there remains room for improvement in these research directions. An important shortcoming of the current research is that the research streams that have addressed these questions have largely remained separate. We argue that the integration of these independent research streams may be beneficial for the reasons that are explained below.

6

1. Introduction and Motivation

Research in this field has been stimulated by the Netflix Prize competition, which prompted investigations that primarily assessed the accuracy of recommender algorithms. In this competition, Netflix allowed researchers to analyze a movie rating dataset that consisted of more than 100 million date-stamped ratings of 17,770 movies that had been provided by approximately 500,000 anonymous Netflix customers (Bennet and Lanning 2007) and thereby indirectly influenced RS studies by causing these studies to focus on this dataset. This concentration on rating data was further aggravated by the limited ability of contemporary information processing algorithms to automatically extract meaningful attributes descriptive to multimedia content, such as movies (Wei, Shaw, Easely 2002; Pazzani, Billsus 1997; Lops, de Gemmis, and Semeraro 2011). Consequently, various movie characteristics, such as stars, budgets, countries of origin, and other traits, were not adequately addressed by the extant recommender research. Although prior research on movies has provided evidence that these characteristics not only significantly influence the success of a movie (Hennig-Thurau, Houston, and Walsh 2006) but also impact consumer preferences (Austin 1989), this research has largely been ignored in the RS literature.12 We argue that the incorporation of these characteristics in the movie recommendation process can be fruitful for at least the following reasons: First, the capture of attribute-related movie preferences may offer more information than the use of rating data alone. This capture allows user preferences to be addressed in a more flexible manner and at a finer level of resolution during the course of recommendation generation; thus, this process may lead to more precise predictions of user ratings, i.e., the overall preferences of users for particular items. Second, if attribute-related preference information is readily available, then the recommendation process may be aligned with the users’ preference structures, and the recommendation generation procedure may therefore reflect users’ idiosyncratic attribute weights and their decision strategies. According to Aksoy et al. (2006), this feature of RSs can produce higher choice efficiency for users.

12

In extant research that has utilized preferences regarding movie attributes (e.g., Ying, Feinberg, and Wedel 2006), the choice of these attributes has often been based on information availability rather than a thorough study of relevant attributes; alternatively, these attributes have been used for the post-processing of recommendations that have already been generated (e.g., Symenoidis, Nanopoulos, and Manolopoulos 2009).

1. Introduction and Motivation

7

Third, not only the ability to provide explanations but also the content and level of detail of these explanations are dependent on the algorithm that is employed for providing recommendations, i.e., on the construction of the algorithm itself and on the type of information that it processes (Herlocker, Konstan, and Riedl 2000). The knowledge of the attribute-related weights that produce a particular recommendation can enable RSs to provide users with personalized explanations of recommendations in terms of movie characteristics that are meaningful to these users. Moreover, the consideration of preference-relevant attributes allows these explanations to emphasize the aspects of recommended items that users regard as important while evaluating them. Consequently, these comprehensive explanations can be better understood by users and may therefore prove to be not only potentially more valuable but also more actionable. This feature of explanations increases the transparency and credibility of an RS (Sinha and Swearingen 2002; Cramer et al. 2008; Herlocker, Konstan, and Riedl 2000) and offers other benefits for users, thereby reducing the negative effects of inaccurate recommendations.13 Thus, we conclude that the questions of improving the accuracy of recommendation algorithms and the handling of inaccurate recommendations, i.e., the ability to provide users with explanations of recommendations, are not mutually independent but are instead complementary considerations. Therefore, these questions should be concurrently addressed in an integrative fashion. The considerations that are exposed above motivate the presentation of the current thesis and form the basis for the objectives that will be formulated in the subsequent section of this document.

13

These positive effects of explanations will be discussed in Chapter 2.1 of this thesis.

8

1. Introduction and Motivation

1.2 Objectives

In accordance with the considerations and rationales that have been provided in the previous section of this document, the current thesis focuses on developing a recommendation method that not only aligns the recommendation process with user preferences but also possesses the capability to provide both accurate recommendations and actionable explanations of the reasoning underlying these recommendations. In contrast to the typical RS research approach of constructing an explanation facility around pre-calculated recommendations, we seek to incorporate the ability to provide explanations into the base framework of the recommendation algorithm. The stated objectives should be accomplished by incorporating attribute-based preferences into the recommendation process. Through an integrated examination of the algorithmic and explanatory aspects of RSs, we seek to combine the advantages of algorithmic accuracy with the benefits that are offered by explanation facilities in ways that mitigate the disadvantages of these two RS features. Although we wish to develop a general recommendation method that is applicable to a broad range of product domains, the development process itself should focus on the domain of motion pictures. For the following two reasons, this constraint represents an additional challenge for our research but enhances the contribution of this thesis to the RS and marketing literature. First, although the research on drivers of movie success is occurring in the field of marketing research, little is known about which attributes of motion pictures are predictive of the movie preferences of individual users. By focusing the current thesis on the topic of movies, we contribute to the stream of marketing research on motion pictures by elucidating and empirically testing the movie attributes that are informative for predicting individual movie preferences. Secondly, by constructing a list of movie attributes that are relevant to user preferences, we contribute to an RS-related research stream that has previously considered only a small subset of “technical” movie characteristics for the purpose of provid-

1. Introduction and Motivation

9

ing recommendations. In other words, we cause the list of movie attributes that are considered within recommendation algorithms to become more complete and better grounded.14 Furthermore, we also regard the practical applicability of our approach as an important constraint in the development of our method. Thus, we must account for the fact that not all individuals form their movie preferences based solely on movie attributes; instead, there might be persons with other intrinsic dispositions who base their choice decisions on factors that extend beyond movie characteristics and are difficult to measure, such as anticipated emotions, social pressures, how closely the plot of the movie relates to personal experiences, and other considerations. For this group of individuals, our attribute-based recommender algorithm will most likely fail to provide reliable recommendations. In a practical context, this phenomenon implies that customers with unusual preferences will not benefit from RS recommendations but will instead be confused or even distracted by the inaccurate nature of these recommendations. This effect can produce negative consequences for an e-commerce provider, such as decreased customer trust and loyalty, the loss of customers, and lower revenues. To counteract these negative effects, we suggest combining our attribute-based algorithm with an item-based collaborative filtering algorithm that is known to perform reasonably well with respect to both prediction quality and the ability to provide actionable explanations for users with unusual preference (in other words, preferences that are not based on movie attributes). This type of hybridization of recommendation approaches will allow for the generation of the best possible recommendations and actionable explanations for these recommendations for all users of an RS. Consequently, our final objective in this study is the development of a method for hybridizing our attribute-based algorithm with an item-based collaborative filtering recommendation process.

14

This issue will be discussed in greater depth in Section 2.3.2 of this thesis.

10

1. Introduction and Motivation

1.3 The Outline of the Thesis

This document is structured in five chapters; a bibliography section and appendices are also provided at the end of the thesis. Chapter 2 presents the description of the extant research that relates to the objectives of this thesis. In particular, this chapter encompasses an overview of contemporary recommendation algorithms as well as the study of the research on multiattribute utility, movie preferences, and explanations of recommendations. Thus, this chapter provides us with important information for the design of our proposals. Chapter 3 describes our proposed conceptual framework for a recommendation algorithm that not only incorporates attribute-based preferences of users but also allows for the alignment of the recommendation process with users’ preference structures. In addition, this algorithm provides the information that is required for the generation of detailed and actionable explanations. This framework and the proposed algorithm represent the core of the current thesis. In Chapter 4, the proposed algorithm is empirically tested using the real-world data of the commercial recommendation systems of Moviepilot and Netflix. The accuracy of the proposed method is compared with the accuracy of these state-of-the-art recommendation algorithms. The proposed algorithm is also evaluated with respect to its ability to provide explanations and the level of detail of these explanations. Finally, Chapter 5 concludes the thesis by restating the main contributions of this work and discussing avenues for further research.

2. Background and Related Work

11

Chapter 2

Background and Related Work 2 Background and Related Work This chapter summarizes the theoretical background that underlies the proposals of the current thesis and provides an overview of the extant research that relates to our objectives. In particular, the first section of this chapter provides an overview of the key recommendation approaches that currently exist and presents detailed descriptions of the corresponding recommendation algorithms; this knowledge is essential to the development of a new recommendation method. The second section of this chapter addresses the questions of why and how explanations of the rationales for recommendations should be provided. The third section of this chapter projects these findings into the domain of motion pictures and elaborates on the operationalization of movie characteristics for their subsequent use in the process of generating recommendations. The fourth section of this chapter recapitulates the main points of the theoretical discussion and concludes this portion of the thesis.

2.1 A Parsimonious Overview of Recommendation Techniques

The goal of RSs is to provide users with recommendations of the items they are not aware of and that are potentially interesting to them. In other words, RSs help consumers find useful items. To accomplish this task, RSs attempt to predict user preferences, i.e., ratings, for these yet unseen items. These predictions are based on preference data, usually ratings that RSs acquire from their user bases. After ratings for items that have not previously been rated

12

2. Background and Related Work

have been estimated, an RS can recommend the item(s) with the highest estimated rating(s) to the user (Adomavicius and Tuzhilin 2005). More formally, the recommendation task can be described in the following manner. Let be the set of all users, and let

be the set of all items that can be

recommended, such as movies, books, CDs, websites, news articles, or other products. Let denote a

matrix of ratings

, where the indices for each rating denote a particular

user-item combination. Finally, let user

be a preference function that measures the preference of

for item , i.e.,

choose such an item

. Then the recommendation task is for each user

to

that maximizes the user’s preference function: (2.1)

However, the central problem for RSs is that the preference function its mapping onto the rating space

is only defined on a subset of the

the entirety of this space). In particular, if a certain user corresponding matrix entry ratings,

space (rather than

does not rate an item , the

will remain empty. Consequently, to predict these unknown

must be estimated from the non-empty entries of

to the entire

is unknown and

and subsequently extrapolated

space (Adomavicius and Tuzhilin 2005; Jannach et al. 2011). Once these

predictions are made, recommendations are produced in accordance with equation (2.1). To estimate the ratings of items that have not yet been rated by a user, contemporary RSs employ a number of techniques. Although these techniques may be implemented in different ways, the underlying principles regarding how these recommendations are produced allows these techniques to be classified into three general categories (Balabanovic and Shoham 1997): 

Collaborative filtering,



Content-based filtering, and



Hybrid approaches.

The approaches in these three categories differ with respect to the strategies that they employ, the methods that they use, the data on which they rely, and their inherent strengths and weaknesses. The following subsections describe these approaches in greater detail.

2. Background and Related Work

13

2.1.1 Collaborative Filtering

The key concept of collaborative filtering (CF) is that the information about the preferences of the entire user base of an RS can be exploited to produce recommendations. In other words, CF methods utilize all ratings from all users for all items that are available to an RS to predict which items a particular participant in an RS community will most probably like or be interested in. The fact that every user potentially contributes to a recommendation is evinced by the title of this group of methods, i.e., in these methods, users are thought to jointly “collaborate” during the course of the recommendation process. The family of CF methods encompasses three types of approaches that differ with respect to the ways in which rating data are used: user-based CF, item-based CF, and matrix factorization. In the following paragraphs, we provide a brief overview of each of these three types of approaches.

2.1.1.1 User-Based Approaches

The main idea underlying user-based CF approaches (e.g., Shardanand and Maes 1995; Konstan et al. 1997; Breese, Heckerman, and Kadie 1998; Nakamura and Abe 1998; Delgado and Ishii 1999; Herlocker et al. 1999; Jannach et al. 2011) is that those users who exhibited preferences that are similar to the ones that the current user exhibited in the past can serve as predictors of the preferences that the current user will have for the items s/he has not yet seen. In other words, the aggregated ratings of these similar users (who are also referred to as peer users or nearest neighbors) are used as predictors of the ratings of the current user. In accordance with this reasoning, the algorithm for this type of approach can be decomposed into the following steps: 1. From all users in the user base, find a subset of users

that are similar to the

current user . 2. Aggregate the ratings of these users for the set of items has not yet rated. 3. Recommend the item from

with the highest rating.

that the current user

14

2. Background and Related Work

Table 2.1: A ratings database for collaborative filtering Sin City Titanic Memento Avatar Thor

Daniela 10 5 8 8 ?

Thorsten 5 1 3 5 7

André 8 5 8 5 10

Michael 5 8 1 10 6

Paul 1 10 10 3 3

To obtain an intuitive idea of the functioning of this algorithm, let us examine Table 2.1, which presents an example of a ratings database. In this example, the active user, Daniela, has rated “Sin City” as a “10” on a 1-to-10 scale; this rating indicates that Daniela strongly liked this movie. The task of our RS is to predict Daniela’s rating for “Thor”, which she has not yet seen or rated. The system searches the database for users with tastes that are similar to Daniela’s preferences, i.e., users whose movie ratings are similar to Daniela’s existing movie ratings; the RS then uses the ratings of these users to predict Daniela’s appreciation for “Thor”. If an RS can predict that Daniela will greatly enjoy “Thor”, then this RS should recommend “Thor” to her. Daniela Thorsten Paul

10 9 8 7 6 5 4 3 2 1

Sin City

Titanic

Memento

Avatar

Figure 2.1: A comparison of three user rating profiles modified from Jannach et al. (2011, p. 15)

In our first simplified example, among the four users other than Daniela who are in the database, Thorsten’s rating profile is the most similar to Daniela’s rating profile, whereas Paul’s rating profile is the most dissimilar to Daniela’s rating profile (see also Figure 2.1).

2. Background and Related Work

15

Thus, Thorsten’s rating for “Thor” will primarily be used to predict Daniela’s appreciation for this movie.15 Various approaches have been proposed for computing the similarity between two users of CF systems (Herlocker et al. 1999; Herlocker, Konstan, and Riedl 2002; Adomavicius and Tuzhilin 2005). Most of these approaches compute this similarity based on the ratings for items that both users have rated in common. The two most popular similarity measures are Pearson’s correlation coefficient and cosine similarity (Adomavicius and Tuzhilin 2005; Jannach et al. 2011). To introduce these metrics, let be the set of items that are rated by both user

and user . Pearson’s correlation

coefficient is then defined as follows (e.g., Resnick et al. 1994; Shardanand and Maes 1995): (2.2)

For instance, the value of Pearson’s correlation coefficient for Daniela’s and Thorsten’s profile is calculated as follows (

,

):

(2.3)

The values of Pearson’s correlation coefficient vary between +1 and -1. In particular, a Pearson’s correlation coefficient of +1 corresponds to the case of perfect positive correlation, i.e., a situation in which the examined user profiles are identical; by contrast, a Pearson’s correlation coefficient of -1 corresponds to the case of perfect negative correlation, i.e., a situation in which the examined user profiles are exact opposites. A Pearson’s correlation coefficient of zero is calculated for two examined user profiles that are completely unrelated.

15

The ratings of other users can also be used to predict Daniela’s rating of “Thor”. However, to utilize the ratings of other users (besides Thorsten), the prediction process will require extensions and/or modifications that increase its complexity, such as the weighting and aggregation of ratings from the “source users”. These modifications will be discussed later in this thesis. At this point, we omit these modifications from our example for the sake of brevity and to ensure that the discussion conveys a good understanding of the fundamental aspects of this recommendation approach.

16

2. Background and Related Work

Thus, values of Pearson’s correlation coefficient that are closer to +1 indicate higher levels of similarity between the two examined users. Consider Table 2.2, which summarizes the similarities between Daniela’s profile and the profile of each other user who is included in our example. This table reveals that among these other users, Thorsten and André have profiles with the highest similarity to Daniela’s profile (with Pearson’s correlation coefficients with Daniela’s profile of .886 and .7, respectively), whereas Michael and Paul have profiles that are dissimilar to Daniela’s profile (as evidenced by the negative values of -.33 and -.758, respectively, for the Pearson’s correlation coefficients of these profiles with Daniela’s profile). Thus, Thorsten’s and André’s ratings are informative for predicting Daniela’s rating of “Thor”. Table 2.2: The similarities between Daniela’s profile and other user profiles, as measured by Pearson’s correlation coefficients Daniela

Thorsten .886

André .7

Michael -.33

Paul -.758

Another possible method for calculating the similarity between two users is the cosine similarity metric (e.g., Breese, Heckerman, and Kadie 1998; Sarwar et al. 2001). Cosine similarity treats users as vectors in an m-dimensional space, where

, i.e., the number

of items the users have both rated. The similarity between the users is then computed as the cosine of the angle between the two user vectors16:

(2.4)

where

denotes the dot product17 between the two user vectors

and

and

is the

second norm of the vector, i.e., the vector’s Euclidean length, which is defined as the square root of the dot-product of a vector with itself.

16

In this location and throughout the remainder of the thesis, we use boldfaced font to denote vectors and regular font to denote scalars. 17 Recall that the dot product of two vectors and in -dimensional Euclidean space is defined as the sum of the pairwise scalar products of the vectors’ coordinates, which produces a scalar: .

2. Background and Related Work

17

For the profiles of Daniela and Thorsten, the cosine-based approach measures therefore produce the following result:

(2.5)

Note, however, that although cosine similarity, as defined by equation (2.4), is generally scaled between -1 and +1, if this metric is applied to positively directed scales, e.g., scales that range from 0 to 10, cosine similarity will always produce a number that is between 0 and 1. In fact, for negatively correlated user profiles, this measure produces relatively high positive values (of approximately .66). Thus, higher values are typically observed for the cosine similarity measures than for Pearson’s correlation coefficients. Furthermore, cosine similarity will only reach a value of zero if all examined co-rated pairs include at least one rating of zero. However, the scales that are typically assessed by recommender systems generally begin with a value of 1 (for instance, this type of scale might range from 1 to 5), ensuring that it is technically impossible for expression (2.4) to produce a value of zero. Hence, in many RS contexts, the cosine similarity approach cannot recognize unrelated rating profiles. One possible method of overcoming this drawback is by mean-centering the examined data prior to the application of cosine similarity; this mean-centering process involves the subtraction of the mean user rating from all of the ratings of the examined users. However, from a technical point of view, the mean-centering of the data effectively produces expression (2.2), which is the definition of the Pearson’s correlation coefficient (Nanopoulos, Radavanović, and Ivanović 2009). This fact explains why approaches for user-based collaborative filtering typically prefer to calculate user similarity with Pearson’s correlation coefficient rather than cosine similarity (Jannach et al. 2011). Various other metrics, such as Spearman’s rank correlation coefficient, the normalized Euclidian distance or the mean squared difference measure, have also been proposed to determine the similarity between two users of an RS system (Shardanand and Maes 1995; Herlocker et al. 1999, 2002; Adomavisius and Tuzhilin 2005; Jannach et al. 2011). However, empirical analysis has provided evidence that for user-based CF systems, Pearson’s coefficient outperforms other measures of comparing users (Herlocker et al. 1999). By contrast, for

18

2. Background and Related Work

the item-based CF systems, which will be described in the next section, cosine similarity consistently outperforms Pearson’s correlation coefficient (Jannach et al. 2011). Before the ratings of an active user can be predicted, a set of

peer users, whose ratings

will be considered in the prediction, must be defined, i.e., the most similar users to the active user must be selected through the application of a particular similarity assessment. The set of the most similar peers for a user is also referred to as the “k nearest neighbors” of the user in question. Because these neighbors compose the basis for predictions, i.e., recommendations, collaborative approaches for generating recommendations are often referred to as k-nearest neighbor (kNN) approaches. The value of

can range anywhere from 1 to the total number of users in an RS

(Adomavicius and Tuzhilin 2005). However, the question of how the exact value of determined has remained unanswered. In practice, the value of

is

is typically determined

heuristically by either defining a specific minimum similarity threshold (e.g., Shardanand and Maes 1995; Breese, Heckerman, and Kadie 1998) or choosing an explicit value of (Herlocker et al. 1999, 2002; Anand and Mobasher 2005; Jannach et al. 2011). Both of these techniques for choosing

are problematic. If an overly high value of

is selected, then

predictions rely on many users who have only limited similarity to the active user; this phenomenon produces “noisy” predictions. By contrast, low values of

can also negatively

impact the quality of the predictions because the probability that peer users will have profiles that contain relevant rating data decreases as the size of the examined neighborhood diminishes. In the worst case, none of the profiles of the k users who are most similar to the active user will contain data that are useful for predicting the active user’s rating for the item of interest. Similarly, an overly high similarity threshold can radically reduce the size of the examined neighborhood for each active user, causing the ratings for many items to be unpredictable. However, an overly low similarity threshold increases neighborhood size but also increases the amount of “noise” that is included in predictions. Jannach et al. suggest that “in most real-world situations, a neighborhood of 20 to 50 neighbors seems reasonable” (Herlocker et al. 2002, cited in Jannach et al. 2011, p. 18).18 A more detailed discussion of the

18

Jannach et al. (2011) include this statement as a quote from Herlocker et al. (2002). However, despite careful reading, we could not find this quotation in the referenced publication. Thus, in this thesis, we cite Jannach et al. (2011) as the source of the statement in question.

2. Background and Related Work

19

problem of the selection of neighborhood size is provided by Herlocker et al. (2002) and by Anand and Mobasher (2005). Once the neighborhood size or similarity threshold is defined, the ratings for the active user are predicted through the use of an aggregation rule. Various functions have been proposed as potential aggregation rules, including the following suggestions (Adomavicius and Tuzhilin 2005; Herlocker et al. 2002): (2.6)

(2.7)

(2.8) (2.9)

where

denotes a set of

users who are most similar to the current user

item . In the above equations, the multiplier factor, and the average rating of user

is defined as follows:

and have rated the

serves as a normalizing , with

. In the simplest case, the aggregation can be a simple average (Adomavicius and Tuzhilin 2005), as specified by function (2.6). In other words, based on Thorsten’s and André’s ratings of “Thor”, i.e., using a neighborhood size of k = 2, this formula will predict that Daniela’s rating of the movie will be approximately

; thus, this formaula

would suggest that Daniela will greatly enjoy “Thor”. Intuitively, because function (2.6) does not account for the degree of similarity of different peers, its predictions are subject to the influence of “noisy” input from neighbors with limited similarity. This issue can be addressed by establishing an appropriate similarity threshold; however, as described above, this type of countermeasure tends to reduce the coverage of an RS. Thus, if the goal is to accurately predict user ratings, function (2.6) might not be the best choice. However, the simplicity of this function is its greatest advantage. The aggregation rule in function (2.6) requires few resources and can be computed quickly; these traits could be very useful for an RS that must provide ad-hoc and real-time recommendations from considerable catalogs of items. Further-

20

2. Background and Related Work

more, for situations in which an RS does not have sufficient knowledge about an active user to produce a personalized prediction (the “new user problem”, which will be discussed in Section 2.1.3.2 of this thesis), recommendations that utilize the average rule might prove superior to a lack of any recommendations. However, in this situation, the condition under the sum sign in function (2.6) must be relaxed to

; in other words, to address

this scenario, all of the users who have rated an item should be included in the generation of recommendations. Equation (2.7) represents a slight modification of the previously discussed aggregation rule by reformulating this rule in a “deviation form”. In other words, the aggregation here does not occur over the ratings that users

have proposed for an item

over the deviations of these ratings from the average ratings sum is then adjusted by the mean rating

but instead occurs

of these users. The produced

of the active user. This approach accounts for the

fact that different users may use the rating scale in distinct ways. For instance, Thorsten’s rating of “5” may correspond to exactly the same quantity of preference as André’s rating of “8”. Moreover, the mean-adjustment corrects for the “gap” between user profiles that exhibit reasonable correlation but are shifted along the rating scale. An example of these profiles can be seen in Figure 2.1. In this figure, Daniela’s ratings and Thorsten’s ratings are strongly correlated but are “shifted” vertically; on average, Thorsten’s ratings lie approximately 5 points below Daniela’s ratings. Although Thorsten might not necessarily share Daniela’s movie tastes, his ratings appear to be reliable predictors for Daniela’s ratings. However, to predict Daniela’s preferences appropriately, Thorsten’s ratings should be incremented by approximately 5 points. The mean-adjustment performs this type of correction. To extend this example, the mean-corrected prediction of Daniela’s rating ( ) in this situation will be

,

, .19 Thus,

although the simple average function predicted that Daniela will greatly enjoy “Thor”, the mean-adjusted prediction causes us to strengthen this assertion to a prediction that Daniela will love this movie. Herlocker et al. (2002) have found that the mean-adjusted average, as defined in function (2.7), significantly outperforms function (2.6) with respect to prediction accuracy. Similarly to function (2.6), the deviation-from-mean average approach does not

19

For the sake of consistency with the rating scale that is provided to users, the system will round off 10.55 such that a prediction of 10 points will be displayed to Daniela.

2. Background and Related Work

21

account for different degrees of user similarity, thus the merits and shortcomings of the simple average are largely also applicable for this rule. However, the most common aggregation approach is the weighted sum, which is defined in equation (2.8) (Adomavicius and Tuzhilin 2005; Jannach et al. 2011). As noted above, peer users may be assigned weights that represent their similarity to the active user. Conventional wisdom indicates that users whose tastes are more similar to the tastes of the active user will be more credible recommenders for this active user and should therefore provide greater contributions to the eventual recommendation. The weighted aggregation procedure strives to achieve this effect. The normalization factor , as introduced above, adjusts the predicted rating to ensure that this predicted rating lies within the boundaries of the rating scale. Thus, in our example, Daniela’s rating for “Thor” will be

, and therefore the prediction of . Although in our

simple example, we might not see a large difference between the weighted sum prediction of a rating of 8.22 and the rating of 8.5 that is predicted with the simple mean rule of function (2.6), the difference between these two types of predictions can become very substantial if low similarity thresholds are established or large neighborhood sizes (high values of k) are chosen because these two factors can potentially increase both the variability of the ratings of peer users and the variability of the similarities between the active user and her peers. The actual magnitude and the direction of the difference between these two types of predictions depend on whether user ratings or user similarities are more diverse in a particular neighborhood and on as the absolute values of these similarities. We observe a deviation of more than 2 points between the prediction of 8.22, which is obtained from function (2.8), the weighted sum approach, and the prediction of 10.55, which is obtained from function (2.7), the mean-adjusted average aggregation rule. This deviation occurs because equation (2.8), similarly to equation (2.6), does not account for differences in the average ratings of different users. The mean-adjusted weighted sum aggregation rule, function (2.9), addresses this shortcoming in equation (2.8), similarly to the way in which the mean-adjustment of simple average approach, function (2.7), addresses this shortcoming in equation (2.6). In accordance with function (2.9), Daniela’s predicted rating for “Thor” will be

. In the given example, we do

not observe a large difference between the predictions of 10.55 and 10.51, which are obtained from the mean-adjusted aggregation rules of functions (2.7) and (2.9), respectively. Once

22

2. Background and Related Work

again, this difference can become much larger for high values of k or low similarity thresholds. After the ratings for unfamiliar items have been predicted, the item with the highest projected rating can be recommended to the active user. Alternatively, a set of

items with

the highest ratings can be displayed to the user. The latter presentation is often referred to as a “top-N recommendation” (e.g., Sarwar et al. 2000; Seyerlehner, Flexer, and Widmer 2009; Zhang 2009).

2.1.1.2 Item-Based Approaches

Instead of basing recommendations on the similarities between pairs of users, itembased collaborative filtering relies on the similarities between items (Sarwar et al. 2001; Rashid et al. 2002; Linden, Smith, and York 2003; Zeigler et al. 2005). The item-based CF algorithm can be decomposed into the following steps: 1. From all available items, find a subset of items current user

but are similar to

that have not been rated by the

, items that the user had liked the most in

the past. 2. For each item from , predict its rating for the active user as a sum of the active user’s ratings for the items from

weighted by their respective similarities to

the item of interest. 3. Recommend the item from

that exhibits the highest rating for the active user.

To obtain an intuitive notion about how this algorithm functions, examine Table 2.1 once again. From this table, we can observe that the ratings for “Sin City” and “Thor” are distributed similarly among the examined users who have rated both films (see also Figure 2.2). Thus “Sin City” is given a high weight for the prediction of Daniela’s rating for “Thor”. As mentioned above, empirical analysis has revealed that for item-based CF approaches, the cosine similarity measure performs best with respect to prediction accuracy (Jannach et al. 2011). Thus, the cosine similarity measure is the most frequently utilized metric for item-

2. Background and Related Work

23

based predictions (e.g., Sarwar et al. 2001; Rashid et al. 2002; Linden, Smith, and York 2003; Zeigler et al. 2005). 10 9 Sin City

8

Titanic

7

Thor

6 5 4 3 2 1 Thorsten

André

Michael

Paul

Figure 2.2: A comparison of three movie rating profiles

However, one fundamental difference between user-based CF methods and item-based CF methods with respect to the computation of similarity values is that in the former approaches, the similarities are computed between pairs of users (i.e., the columns of the rating matrix), whereas in the latter approaches, the similarities are computed between pairs of items (i.e., the rows of the rating matrix; see Table 2.1). In other words, in the item-based CF approaches, each pair of co-rated entries corresponds to different users. Thus, in the itembased case, the computation of the similarity between items with a cosine measure that is analogous to the metric that was specified in equation (2.4) for the user-based case produces one important drawback: this computation does not account for differences in the rating scales of different users. The adjusted cosine similarity measure addresses this drawback by subtracting the average rating of the corresponding user from each co-rated pair (Sarwar et al. 2001). This process causes the values of the similarity measure to range between -1 and +1 (Jannach et al. 2011). In this approach, the similarity between items

and

may be expressed as follows: (2.10)

24

2. Background and Related Work

where

is the set of users that have rated both item

and

item . To calculate the adjusted cosine similarity, we can transform the original ratings database by replacing the original rating values with their deviations from the mean ratings for the corresponding movies, as presented in Table 2.3. Table 2.3: A mean-adjusted ratings database for collaborative filtering Sin City Titanic Memento Avatar Thor

Daniela 2.25 -2.75 .25 .25 ?

Thorsten .8 -3.2 -1.2 .8 2.8

André .8 -2.2 .8 -2.2 2.8

Michael -1 2 -5 4 0

Paul -4.4 4.6 4.6 -2.4 -2.4

As an example, the adjusted cosine similarity measure for “Thor” and “Titanic” is then calculated as follows:

(2.11)

The computation of the similarities between “Thor” and each of the other movies that are present in our example dataset produces Table 2.4. This table indicates that among the movies in this dataset, “Sin City”, which has a value of adjusted cosine similarity of .698 to “Thor”, is the most similar movie to “Thor”. By contrast, “Titanic” (with an adjusted cosine similarity of -.891) and “Memento” (with an adjusted cosine similarity of -.378) can both be classified as dissimilar to “Thor”; in fact, among the examined movies, “Titanic” is the most dissimilar film to “Thor”. “Avatar”, which has an adjusted cosine similarity to “Thor” of nearly zero (.076), can be regarded as a movie that is unrelated to Daniela’s enjoyment of “Thor”. Table 2.4: The adjusted cosine similarities of “Thor” to other movies in the dataset Thor

Sin City .698

Titanic -.891

Memento -.378

Avatar .076

2. Background and Related Work

25

After the similarities between the items are determined, the prediction of the rating of the active user

for item is computed as a weighted sum of the active user’s ratings for the

items that are most similar to the questioned item; formally, this process can be expressed as follows: (2.12)

where

denotes a set of items that are most similar to the , the item of interest. In other

words, similarly to the user-based case, the size of the considered neighborhood is limited to the

items that are most similar to the item of interest, for a specific value of . In our example, there is only one movie that can be considered relevant for predicting

Daniela’s rating for “Thor”. For the purposes of demonstration, we may consider “Avatar” to be relevant for predictive purposes, although the nearly zero similarity between “Avatar” and “Thor” implies that the contribution of “Avatar” to the resulting predicted rating for “Thor” will be relatively low. According to prediction rule (2.12), Daniela’s predicted rating for “Thor” will be

.

Similarly to the user-based approaches, in the item-based approaches, after predicted ratings are obtained, the item(s) with the highest predicted rating(s) will be recommended.

2.1.1.3 Matrix Factorization and Latent Factor Models

Another approach within the class of collaborative filtering techniques is matrix factorization (Sarwar et al. 2000, 2002; Goldberg et al. 2001; Canny 2002; Koren, Bell and Volonsky 2009; Koren and Bell 2011; Jannach et al. 2011). The general objective of this approach is to first utilize the data that are received from all users of an RS to derive a set of latent factors that describe hidden associations between users and items and then apply this knowledge to the production of recommendations. In other words, matrix factorization (MF) techniques map both users and items onto a multidimensional joint factor space; in this space, user-item interactions are modeled as inner products of the vectors that represent user and item rating profiles. The latent space attempts to explain ratings by characterizing both items

26

2. Background and Related Work

and users in terms of factors that are automatically inferred from the ratings that have been gathered from the user community (Koren and Bell 2011). For instance, in the domain of motion pictures, these automatically identified factors may correspond to either obvious movie aspects, such as genre, or obscure movie dimensions, such as the depth of character development in a film or the quirkiness of a movie; however, these factors can also be completely uninterpretable (Koren, Bell, and Volinsky 2009).

Figure 2.3: A simplified illustration of the latent factor approach Source: Koren, Bell and Volinsky (2009), p. 44

Figure 2.3 depicts a simplified example of how latent factor models function; this example has been obtained from Koren, Bell, and Volinsky (2009). The figure indicates the locations of several well-known movies and a selection of fictitious users on two hypothetical dimensions, i.e., two factors; in this example, these factors are characterized as serious versus escapist and male- versus female-oriented. In a sense, the interpretation of the graph is similar to the interpretation of perceptional maps within the context of multidimensional scaling (MDS) procedures, which are well known in marketing (Myers 1996): The relative positions of the users and items in the twodimensional space characterize the degree to which the user’s taste matches the movie’s characteristics with respect to the derived factors. User or movie locations that are further from the origin in a factor’s direction indicate that the factor in question is more pronounced

2. Background and Related Work

27

in the user’s taste or the movie’s properties, respectively. A shorter distance between a particular user and a movie should be associated with a greater enjoyment of the movie in question by this user. Accordingly, we can describe Gus as having a strong preference for male-oriented escapist movies and “The Color Purple” as a serious, female-oriented movie. Therefore, in this example situation, we would expect Gus to love “Dumb and Dumber”, to hate “The Color Purple” and to give “Braveheart” an approximately average rating. However, note that certain movies, such as “Ocean’s 11”, and certain users, such as Dave, would be characterized as fairly neutral with respect to the two identified dimensions of the example scenario (Koren, Bell, and Volinsky 2009); this characterization implies that the two identified factors fail to describe either Dave’s movie tastes or the properties of “Ocean’s 11” substantively enough to generate accurate predictions. The underlying concept for deriving the factors is the method of singular value decomposition (SVD; Golub and Kahan 1965), which is an established technique for identifying latent semantic factors during the course of information retrieval (Koren, Bell and Volinsky 2009; Jannach et al. 2011). SVD is based on a linear algebra theorem that states that any matrix

can be decomposed into a product of three matrices in the following manner: (2.13)

where the columns of

and

are known as the left and right singular vectors of

respectively, and the values of the diagonal elements of

,

are called the singular values of

(Jannach et al. 2011; Golub and Kahan 1965; Press et al. 2007). The main purpose of this decomposition is that it enables us to approximate the full matrix

by examining only the

most important features of this matrix, which are the features with the largest singular values (Jannach et al. 2011; Press et al. 2007). Informally, the SVD technique can be described as follows. The singular values correspond to the eigenvalues of the eigenvectors that span the range of

(Press et al. 2007).

Thus, the eigenvectors with the largest singular values capture the greatest portion of the vari. These eigenvectors build up the basis, i.e., the set of “factors”, of the target factor

ance in space. If then

is the user-item matrix of ratings (e.g., our example rating dataset from Table 2.1),

will correspond to the users, and

al. 2011). Furthermore, if

will correspond to the catalog of items (Jannach et

factors have been determined to possess non-zero singular values,

then the product of the first

columns from

, the first

columns from

and the

28

2. Background and Related Work

dimensional diagonal matrix

of singular values, as expressed in equation (2.13), would

produce the best approximation of Thus, the first

columns of

respectively, along the

and

in terms of the least-squares error (Press et al. 2007). describe the coordinates of the users and the items,

dimensions of the factor space; in other words, these columns de-

scribe user tastes and item properties in terms of the determined factors. However, the conventional SVD approach is not defined if knowledge about the matrix is incomplete (Koren, Bell, and Volinsky 2009; Press et al. 2007), and incomplete knowledge is necessarily present in the context of RS use. In fact, if each element of the useritem matrix was known, there would be no reason to predict user ratings because all of these ratings would already be determined. To overcome this issue, previously published works have suggested the use of imputation techniques for completing missing ratings and ensuring that the ratings matrix is dense (e.g., Sarwar et al. 2000; Kim and Yum 2005; Ying, Feinberg, and Wedel 2006). However, the imputation approaches have been criticized for being very expensive with respect to computational resources. Moreover, inaccurate imputation may cause considerable distortions in the data (Koren, Bell and Volinsky 2009; Koren and Bell 2011). Consequently, recent works have suggested performing the decomposition of the useritem matrix based only on observed ratings and using an adequate regularization to address overfitting concerns (Canny 2002; Funk 2006; Paterek 2007; Bell, Koren, and Volinsky 2007; Salakhuzrdinov, Mnith, and Hinton 2007; Koren 2008; Koren and Bell 2011). If this approach is adopted, then the rating

of the user

modeled as the inner product of the vector of movie qualities ences

, which are each described in terms of

for the item

may be

and the vector of user prefer-

latent factor dimensions (Koren, Bell, and

Volinsky 2009): (2.14)

In other words, a rating is considered to be a projection of the results from the interaction between a user’s preferences and a movie’s properties onto the common space that is defined by these two vectors. However, neither the two vectors themselves nor the dimensionality of these vectors is known. The only information that the system can utilize are the results of user-item interactions, i.e., the ratings that users have previously provided for items. The task of the system is therefore to first recover knowledge about the users and items of interest from these prior ratings and then use this knowledge and expression (2.14) to predict future ratings.

2. Background and Related Work

29

In essence, the system must iterate through all ratings and infer which portion of each rating reflects user preferences and which aspects of these ratings reflect item properties; in other words, the system must decompose ratings into user and item vectors. This decomposition should additionally be performed such that expression (2.14) “gains” validity across the entire set of known ratings. To determine the vectors

and

, the algorithm for this approach iteratively

minimizes the regularized squared error on the set of the observed ratings (Koren, Bell, and Volinsky 2009): (2.15)

denotes the “training” set, i.e., the set of

where constant

pairs for which

is known. The

controls the extent of regularization; this constant seeks to address any potential

overfitting of the learned parameter values to the data by penalizing the growth of the magnitudes of these parameters in each iteration thus reducing the size of iteration step. The value of

is typically determined by cross-validation (Koren, Bell, and Volinsky 2009). The learning of the parameters, i.e., the minimization of the sum that appears in

expression (2.15), is typically performed by either the alternating least squares (ALS) method or the stochastic gradient descent method. As suggested by its name, the ALS approach alternates between fixing the fixing the

values. During each iteration of this approach that involves fixed

algorithm recomputes the

values, the

values by solving a least-squares problem; conversely, during

each iteration of the ALS approach that involves fixed the

values and

values, the algorithm recomputes

values by solving a least-squares problem. Each step of this process decreases the value

of expression (2.15). The alternation continues until the values that are obtained for this expression converge (Bell and Koren 2007). Another method for learning the parameters for a rating situation, stochastic gradient descent, can be attributed to Funk (2006), who popularized this approach during the Netflix Prize contest. This simple technique allowed him to reach the top ranking among the contestants for this prize; as a result, this approach became the focus of extensive attention in the RS literature (Paterek 2007; Salakhurdinov, Minh, and Hinton 2007; Takács et al. 2007; Koren 2008; Koren, Bell, and Volinsky 2009; Koren and Bell 2011). This technique utilizes an algo-

30

2. Background and Related Work

rithm that loops through all of the ratings in the training set; for each given rating algorithm computes both its predicted value

, the

and the associated prediction error

. Subsequently, the algorithm modifies its calculated parameters by a magnitude that is proportional to the learning rate , i.e., the step size, in the opposite direction of the calculated gradient (Koren, Bell, and Volinsky 2009):

(2.16)

The learning is completed if the sum in equation (2.15) cannot be reduced any further or if the magnitude of the decrease in this sum during a given iteration does not exceed a particular preassigned threshold, such as 0.001. The dimensionality

of the factor space can be either established from certain consider-

ations, such as system performance, or determined directly during the process of decomposition. In the latter situation, another loop wraps around the algorithm. During each cycle through this outer loop, the algorithm learns one factor dimension, i.e., one coordinate of the and

values. Once no further iteration of the inner loop can decrease the cost function

(2.15), another factor dimension is added, and the learning process continues, incorporating this new dimension. This loop proceeds until the addition of further factors does not decrease the cost function (Funk 2006). The intuition behind this procedure is that during the first cycle of iterations of the algorithm, the parameters of the factor with the highest explanatory power are learned until this first factor captures the maximal amount of the variance in the ratings that it can explain. The second factor attempts to capture the majority of the remaining variance, and this process continues as more factors are identified. Thus, the explanatory power of each successive factor decreases. This approach is directly analogous to the principle of SVD decomposition; therefore, the matrix factorization techniques are often collectively known as “SVD methods”. A comprehensive overview of the recent advances in matrix factorization for CF has been provided by Koren and Bell (2011). These authors addressed topics that relate to computational issues, various aspects of modeling, and parameter estimation; furthermore, they demonstrated how temporal models and implicit user feedback may be utilized to improve the accuracy of a model. In addition, these authors described their insights from applying these techniques in the context of the Netflix Prize competition.

2. Background and Related Work

31

2.1.2 Content-Based Filtering

Content-based (CB) approaches generate predictions by accounting for both the similarities between items and information about the past preferences of the active user. In contrast to CF methods, the calculation of item similarity in CB methods does not rely on the ratings of other users but instead solely reflects the content characteristics of the examined items. The primary advantage of CB approaches over CF approaches is that the former approaches require neither the existence of a large user community nor a considerable rating history to produce recommendations. In essence, CB methods do not need knowledge about any users other than the active user who is seeking recommendations (Jannach et al. 2011). The recommendation task consists of determining the items that are similar to items that the active user has enjoyed in the past (Balabanovic and Shoham 1997; Mladenic 1999; Herlocker et al. 1999). Historically, CB approaches have been developed for recommending text-based items, such as e-mail messages or news (Jannach et al. 2011). Accordingly, CB methods continue to primarily be utilized for the recommendation of textual documents. Nevertheless, the general idea of exploiting an object’s content can also be expanded to the domains of non-textual products or items. However, to address this situation, certain modifications to the original CB approach must be implemented. Hence, the current chapter is divided into two subsections. The first of these two subsections describes the principles and procedures of the “original” text-based CB approaches, whereas the second subsection of this chapter addresses the specific details of CB applications in non-textual domains.

2.1.2.1 The Principles of Content-Based Approaches

Because content-based approaches were first developed in the field of information retrieval and data mining, these approaches are primarily utilized for generating recommenda-

32

2. Background and Related Work

tions of textual documents (Jannach et al. 2011). The standard content-baseed approach involves extracting a list of relevant keywords from the content of a document or from a textual description of this document (Balabanovic and Shoham 1997; Adomavicius and Tuzhilin 2005; Lops, de Gemmis and Semeraro 2011; Jannach et al. 2011). Consequently, each document is described with a vector of dimensionality that is equal to the number of relevant keywords (which are also often known as features) that are maintained in the system. These vectors are then used to determine the documents, i.e., the items, that are similar to the documents that have previously interested a particular user. After these items are determined, they can be recommended to this user. To obtain an intuitive idea of how this process functions, examine Figure 2.4 and Table 2.5. The figure illustrates the principle of how keywords are extracted from documents and how these keywords are utilized to construct a vector representation of each document. In the given example, the vector’s elements correspond to the frequency of appearance of each keyword (a term which is used to encompass terms of interest that may have multiple words, such as “E. coli”) in the document. Alternative techniques for constructing the keyword vector in a more comprehensive manner will be discussed below. For the moment, to simplify our example, we simply regard the elements of a keyword vector as representative of the presence of each keyword in the examined document.

Emmerich defends Shakespeare film German film director Roland Emmerich admits courting controversy with his film that questions the authorship of Shakespeare's plays.

2 2 0 . . . 1 0

Emmerich film Aid . . . director E. coli

Figure 2.4: An illustration of the extraction of a features vector from a document

2. Background and Related Work

33

Consider now Table 2.5, which encompasses five article headlines20 and their binary representations as four-element row-vectors. For each element, a “y” denotes the presence of a specific term in an examined article. The last column of the table indicates the preferences of the user Thorsten for the first four articles. We observe that Thorsten liked articles with the “director”, “film” and “aid” keywords; by contrast, Thorsten did not like articles with the “E. coli” and “aid” keywords. Thus, the system will assign positive weights for the “film” and “director” keywords to Thorsten’s user profile. The “E. coli” keyword will receive a negative weight in Thorsten’s profile, whereas “aid” will be neutrally weighted in this profile because “aid” appears equally frequently in both liked and disliked articles. Based on these considerations, the system will predict that Thorsten will like the last article, which describes Tom Hanks’ attendance at his new movie, because this article’s content includes the keyword “film”. The system would predict that Thorsten would be less appreciative of the article if this article also contained the “E. coli” keyword. The precise magnitude of Thorsten’s predicted rating would depend on the relative weighting of each keyword. Table 2.5: The principle of content-based filtering aid Emmerich defends Shakespeare film EU sets E. coli aid at 150m euros

Tom Hanks had a 'personal mission' with Larry Crowne

E. coli

y y

E. coli map: How the outbreak looks Nadir to receive legal aid

director

Film

Thorsten

y

+

y

-

y

-

y

+ y

?

We now consider the content-based approach in greater detail: The aforementioned binary (as depicted in Table 2.7) and frequency-based (as illustrated in Figure 2.4) encodings of keywords are not the only methods to construct vector representations of documents. The need for more comprehensive techniques emerges because of the following shortcomings of the aforementioned methods. The binary representation assumes that all keywords have the same importance for characterizing the content of a document. However, conventional wisdom indicates that keywords that occur more often in a document should be more descriptive

20

The article headlines and annotations in Figure 2.4 and Table 2.5 are obtained from http://bbc.com; these articles were retrieved on 07.06.2011.

34

2. Background and Related Work

of the document in question. Although the frequency-based encoding approach addresses this issue, another serious drawback remains. In particular, longer documents inherently feature higher keyword frequencies and a richer vocabulary; therefore, both the probability that a keyword vector will contain a specific word and the weights of these keywords for this document will increase with the length of the document in question. Consequently, longer documents will have a higher probability of being recommended than shorter documents because compared with shorter documents, longer documents will not only possess keyword vectors that are more likely to overlap with user profiles but also the relevance weights of the keywords will more likely be overestimated (Jannach et al. 2011; Lops, Gemmis, and Semeraro 2011). A standard approach to address these shortcomings is the term frequency - inverse document frequency (TF-IDF) method (Salton, Wong and Yang 1975), an established technique from the field of information retrieval. The primary idea underlying this approach is that the descriptive power of a keyword for a document is dependent on both how frequently this word appears within the document itself and how often this word occurs within the entire corpus of examined documents. Accordingly, TF-IDF is composed of two measures: Term frequency (TF) describes the frequency of a keyword’s occurrence in a document; this assessment assumes that important words will occur more frequently within a particular text. To account for document lengths and to prevent longer documents from being allocated higher relevance weights, each keyword’s frequency is normalized (Jannach et al. 2011); typically, this normalization is accomplished by relating each keyword’s frequency to the maximum frequency of other words in the document21 (Adomavicius and Tuzhilin 2005; Lops, Gemmis, and Semeraro 2011). By contrast, inverse document frequency (IDF) assumes that words that occur infrequently among the entire set of examined documents will be the most powerful terms for describing a document’s contents. In other words, in the IDF approach, terms that are commonplace among the examined documents are not regarded as very helpful terms for

21

Other normalization schemes, which are optimized for specific cases, have been proposed by Chakrabarti (2002), Pazzani and Billus (2007), and Salton and Buckley (1988).

2. Background and Related Work

35

discriminating among these documents (Jannach et al. 2011). Thus, IDF discounts the weights of words that appear frequently. The product of TF and IDF yields the TF-IDF metric, which accounts for both of the aspects described above. More formally, let

be the frequency of the word

in the document

. Let

denote the maximum frequency of other examined keywords in the document. Furthermore, let corpus, and let

be the number of documents in the

denote the number of documents from

in which

appears. Then, in a

given document corpus, the TF-IDF metric and its components are defined as follows: (2.17) (2.18) (2.19)

After TF-IDF vector representations are computed, the similarities among the examined documents can be determined through the use of a similarity measure. Depending on the problem at hand, various similarity measures may be applicable (Maimon and Rokach 2005; Baeza-Yates and Ribeiro-Neto 1999; Zanker et al. 2006). However, in the domain of the recommendation of textual documents, the most common approach is the use of the cosine similarity metric that is defined in equation (2.4) (Adomavicius and Tuzhilin 2005; Jannach et al. 2011; Lops, Gemmis, and Semeraro 2011). In essence, the remaining procedures for recommendation generation in CB approaches are analogous to the procedures for item-based techniques; however, in CB approaches, only document ratings from the active user are employed. In other words, among the items that an active user has rated, the most similar items to each unrated item “vote” for this unrated item (Allan et al. 1998; Jannach et al. 2011). In addition, similarly to CF approaches, in CB techniques, the number of the “voters” can either be explicitly established or determined through the use of a minimum similarity threshold (Billsus, Pazzani, and Chen 2000; Billsus and Pazzani 1999). The “votes” are then aggregated into predicted ratings; this procedure typically employs a weighting rule that is based on the degree of similarity between the

36

2. Background and Related Work

examined item pairs. This type of weighting rule is analogous to aggregation rule (2.8). After the predictions are completed, the item(s) with the highest rating(s) or with the highest similarity to previously enjoyed items can be recommended.

2.1.2.2 The Exploitation of Content Characteristics in Non-Textual Item Domains

As noted in the introductory portion of Section 2.1, the idea of exploiting the content characteristics of items for the production of recommendations can also be transferred into the domains of non-textual objects, such as songs or movies. However, this transference poses the challenge of identifying and extracting qualitative characteristics that appropriately represent user and item profiles. This task is difficult primarily because of the very limited ability of modern content processing algorithms to automatically extract meaningful features that are descriptive of multimedia content (Wei, Shaw, Easely 2002; Pazzani, Billsus 1997; Lops, de Gemmis, and Semeraro 2011). Thus, to assess multimedia items, recommender algorithms must rely on rather “technical” characteristics of the examined content (such as genre, cast, and length, among other traits), which are either available from content providers or content manufacturers (Jannach et al. 2011) or extractable from external sources of information, such as catalogs or movie critic web sites (e.g., Alspector, Kolcz, and Karunanith 1998). Nevertheless, these technical content characteristics do not always overlap with qualitative features that determine a consumer’s judgment of items: For example, in the domains of quality and taste, the reasons that a consumer likes an item are often based on subjective impressions, such as features of an item’s exterior design, rather than particular product characteristics (Jannach et al. 2011). A manual specification of the item’s features by domain experts appears to be the only option for addressing this limitation (Adomavicius and Tuzhilin 2005; Lops, de Gemmis, and Semeraro 2011; Jannach et al. 2011).

2. Background and Related Work

37

The most prominent and exceptional example for the application of the CB approach on manually coded items is provided by Pandora.com, a popular internet radio and music recommendation service. Pandora’s services rely on data from the “Music Genome Project”22; these data are manually entered by highly-trained analysts.23 A song’s description often encompasses up to several hundred features24, or “music genes”, which relate to a song’s instrumentation, influence, measures, key tonality, structure, vocal harmonies, aesthetics, phrasing, lyrical mood, and emotional content, among other considerations. However, in most applications, the effort to manually encode item characteristics is considered to be impractical because of resource limitations (Adomavicius and Tuzhilin 2005; Jannach et al. 2011). As stated by the founder of Pandora, Tim Westergren, the “unlocking” of a track’s music genes, i.e., its manual annotation, by a trained musician requires times that range from approximately fifteen minutes for a pop song to approximately an hour and a half for more sophisticated compositions (Tim Westergren, cited in Tran-Le 2010). Because of resource limitations, RSs frequently only utilize item characteristics (i.e., attributes) that are available in electronic form (Jannach et al. 2011). Although considerable quantities of “technical” attributes are available in certain domains, such as the motion picture domain, an RS typically exploits only a subset of the available attributes for the items that are being examined (e.g., Ansari, Essegaier, and Kohli 2000; Kim and Kim 2001; Burke 2002; Melville, Mooney, and Nagarajan 2002; Ying, Feinberg, and Wedel 2006; Gunawardana and Meek 2009; Park and Chu 2009). This phenomenon reflects the issue of assigning importance weights to attributes within the vector representations of items. In the simplest case, the attributes of movies, i.e., genres, actors, directors, and other traits, would be coded in binary fashion to indicate whether each attribute of interest, e.g., a specific actor, is present in an examined movie. However, in this case, all attributes would be regarded as equally important for movie description, whereas conventional wisdom indicates that some attributes may discriminate stronger than others. For instance, the presence of a specific star actor in a movie may be a stronger signal than the categorization of a movie to a particular genre or the movie’s belonging to a specific production company. In contrast to the

22

http://www.pandora.com/mgp.shtml http://blog.pandora.com/faq/contents/506.html 24 http://blog.pandora.com/faq/contents/19.html 23

38

2. Background and Related Work

case of text-based items, in the movie domain, attribute importance weights cannot be characterized in terms of the frequencies of these attributes because each movie can only be described once at each attribute. In other words, we cannot state that there is more Clint Eastwood in “For a Few Dollars More” than in “The Good, the Bad and the Ugly”, and we cannot assert that one of these two films is more of a western movie than the other. Consequently, the frequency based TF-IDF measure is not available for allocating importance weights to the attributes of motion pictures. Because of the lack of an instrument to assign importance weights to attributes that are indicative of the abilities of these attributes to differentiate among movies, the typical CB approach for assessing movies involves the use of the binary movie vector representations that are described above. The issue of the different roles that the attributes play for the formation of a user’s movie preferences is addressed strictly through the user’s profile. This profile is represented as a vector with a number of dimensions that is equal to the number of attributes in each movie vector plus one. Hence, each dimension of the user vector represents both the importance weight of the corresponding attribute for the user’s process of discriminating among different movies and the quantity of movie preference that the user associates with this attribute; the last dimension of this vector represents the user’s rating baseline. The values of the vector’s entries are estimated by regressing the user’s past ratings on the set of available movie attributes (e.g., Ansari, Essegaier, and Kohli 2000; Kim and Kim 2001; Ying, Feinberg, and Wedel 2006). The regression model is typically formulated as follows:

(2.20)

where

denotes the rating of user

for movie ;

are binary dummy

variables that indicate the presence of the th attribute in the movie’s characteristics; the regression coefficients for these dummy variables, with the constant term of

are

; and

denotes the estimation error of the regression model. Note that the regression coefficients

correspond to the movie attributes and capture

the portion of the rating that is due to the presence of each movie attribute in a movie characteristics vector, i.e., the part-worths of the examined attributes. The beta values may be either positive, which would indicate an increase in preference if a particular attribute is present, or negative, which would indicate a dislike of an attribute. The baseline estimate

2. Background and Related Work

39

indicates the quantity of preference that a user demonstrates for movies in general, i.e., if no information about the movie’s characteristics is available; thus, this baseline estimate is equal to the active user’s mean movie rating.25 The regression coefficients that are estimated from equation (2.20) constitute the user preference profiles. In contrast to the situation involving textual items, in the case of non-textual items, the CB recommendation procedure omits the step of computing the similarity between item pairs. This omission reflects two specific properties of the non-textual domain: First, the content of non-textual items is described in terms of binary vectors; as demonstrated above, binary vectors represent the only practical way of automatically describing non-textual items because of issues with the assigning of different levels of an attribute to an item. By contrast, in a textual domain, each attribute (i.e., keyword) can be characterized by the number of its occurrences in a document, which “grants” descriptive power to the quantity of a feature in an item; this descriptive power is essentially absent in the assessment of non-textual items. Second, relative to the representation of a textual item, the simplified representation of a non-textual item allows preferences to be fully attributed to user profiles. Thus, a recommendation can be generated through the direct matching of item profiles with a user’s profile; this matching does not require a search for items that are similar to the items that were most strongly enjoyed by the active user in the past. Thus, in a sense, the information that was contained in the concept of similarity for the examination of textual items has been incorporated into the user profile for the examination of non-textual products. Thus, instead of predicting ratings for unseen movies through the “voting” of items that are similar to each unseen movie, predictions can be accomplished through the application of regression equation (2.20); in terms of vector forms, this process simply involves computing the inner product of , a user’s profile vector, and

, the vector of a movie’s attributes: (2.21)

However, note that for expression (2.21) to hold in a formal and mathematical manner, a unity entry must be added to the movie vector

25

at a position that corresponds to the position of

This phenomenon occurs because of the specific features of dummy regressions (for further detail, see Gujarati 2004).

40

the

2. Background and Related Work

entry in the user vector; this addition ensures that the baseline estimate is appropri-

ately incorporated into the final calculated sum. Analogously to all of the previously discussed methods, after the ratings are predicted, the item(s) with the highest rating(s) can be recommended to the active user. We now return to the issue of explaining the aforementioned assertion that CB approaches for recommendation generation typically utilize only a fraction of the available attributes of non-textual items. Although in many non-textual domains, including the domain of motion pictures, considerable quantities of attributes are frequently available in electronic form or can be readily extracted from additional information sources, the natural desire to include as many of these attributes as possible into the recommendation process to increase the “overlap” of the assessed technical item characteristics with qualitative considerations cannot be fulfilled because of the restrictions of the regression analysis. In particular, the problem is that regression analysis typically requires at least one observation per estimated parameter; in the absence of these observations, there are insufficient data to obtain a solution for expression (2.20) (Gujarati 2004). In addition, to avoid multicollinearity, the assessed observations are required to be mutually linearly independent with respect to the estimated parameters; in the presence of multicollinearity, expression (2.20) is only solvable if the parameters that cause multicollinearity are omitted from the model or other countermeasures are implemented (Gujarati 2004). Besides other requirements of the regression analysis that can also can be hurt, the two aforementioned issues considerably limit the number of possible parameters, i.e., attributes, that can be introduced to the regression model. In the best case, if no multicollinearity issues exist, the upper limit for the number of attributes that can be considered for each user is equal to the number of this user’s ratings that are available to an RS. Given that in the available movie RS datasets, the majority of users have each rated approximately twenty movies, the inclusion of an overly high number of attributes in the regression model would harm an RS because this highly specific RS would be able to produce recommendations only for a narrow group of its users. Thus, at present, the number of attributes that are considered by content-based movie RSs varies between ten (Kim and Kim 2001) and twelve (Ansari, Essegaier, and Kohli 2000; Ying, Feinberg, and Wedel 2006).

2. Background and Related Work

41

In fact, the estimation of a larger number of attribute part-worths for each user than are currently examined by content-based movie RSs would not be feasible within the CB approach that is described above. However, the discarding of a substantial quantity of potentially relevant attribute knowledge implies that a considerable portion of preference-relevant variance in a user’s prior ratings will not be captured by a CB model. As a result of this issue, CB models will demonstrate higher levels of prediction errors and reduced recommendation accuracy. This concern provides an explanation for why the majority of approaches that incorporate movie attributes into recommendation algorithms (e.g., Baudish 1999; Burke 2002; Melville, Mooney, and Nagarajan 2002; Park and Chu 2009; Gunawardana and Meek 2009) not only exploit only a small fraction of the available movie attributes but also refrain from directly using knowledge about these attributes to generate rating predictions; instead, CB information is typically utilized as additional information that can improve the CF predictions of hybrid models. A brief overview of hybrid approaches for RSs will be provided in the subsequent sections of this thesis. However, before providing this overview, we first examine the issues and trade-offs of the collaborative and content-based approaches that motivate the creation and emergence of hybrid recommendation techniques.

2.1.3 Critical Issues of Collaborative and Content-based Approaches

All of the recommendation techniques that have been introduced in the preceding sections have merits and limitations that require trade-offs with respect to the question of which approach a particular RS should employ. Certain trade-offs, namely, trade-offs that are relevant to the provision of effective explanations of recommendations, will be discussed in section 2.2. In the current section of this thesis, we provide a brief overview of the strengths and weaknesses of CB and CF approaches that influence the functionality of RS in a technical sense. In other words, in this section, the advantages and disadvantages of these approaches that impact the ability of an RS to provide recommendations are examined. Table 2.6 summarizes the discussion of these strengths and weaknesses.

42

2. Background and Related Work

Table 2.6: A summary of the strengths and weaknesses of different recommendation approaches “+” denotes a tendency to exhibit a problem, “–” indicates a lack of susceptibility to a problem, and “±” symbolizes the presence of a weakened form of a problem

Approach

Matrix

Content-

factorization

based

+

±



+



±

+

New Item



+

±



Overspecialization







+

Gray Sheep

+



±



Starvation



+

±



Shilling Attacks

+

+

±



Stability vs. Plasticity

+

+

+

+

User-based

Item-based

Sparsity

+

New User

Type of problem

2.1.3.1 Data Sparsity

Arguably, the most common problem for RSs that produces nearly all of the other issues of concern for these systems is the sparsity of the underlying data base for each RS. In other words, RSs must produce their recommendations on the basis of a user-item rating matrix that is typically very far from being dense (Burke 2002). Consider the example of Amazon, a firm that stocks millions of items and offers these items to millions of users. In this situation, it is unrealistic for any individual user to have rated a considerable percentage of the items in Amazon’s catalog. Quite the contrary, it is more realistic to assume that the majority of Amazon’s customers have rated only vanishingly small subsets of the items that are offered by this firm. These types of scarce datasets are typical for most RSs. For instance, in the Netflix Prize dataset, more than 99% of the possible ratings are missing (Koren and Bell 2011). The same problem applies to the publicly available data sets of EachMovie and MovieLens (O’Sulivan, Smyth and Wilson 2004, p. 230) as well as to the data that form the basis of MoviePilot’s recommender system. Although sparsity is problematic for all types of recommendation approaches, it is a particularly thorny issue for collaborative techniques, particularly for item-based and user-

2. Background and Related Work

43

based collaborative methods. This is because they base their predictions on neighborhoods of like-minded users or similar items that are similar to an unknown item of interest. However, to form the latter, however, some level of overlap between pairs of users or item profiles is required (Burke 2000). In other words, if two users with identical tastes have rated different segments of items, then a user-based CF system will fail to detect the similarity between these two users because the profiles of these users will not feature a sufficient number of common items. Thus, the system will not be able to recommend the items that are enjoyed by one of the users to the other user in question, despite the fact that these two users have identical tastes. Analogously, in the item-based approach, if two item profiles do not demonstrate sufficient overlap, they will not be regarded as similar, even if these two profiles are compiled for duplicates of the same item. Thus, in the absence of sufficient overlap, the information that is contained in one of the item profiles cannot be used to predict the user ratings for the other item. For MF approaches, the sparsity problem has not been sufficiently investigated in the extant literature. However, there are indications that the sparsity problem is mitigated in MF approaches because these methods reduce the dimensionality of the space in which recommendations are made by extracting latent factors from the original data (Burke 2000). Nevertheless, intuitively, factors can only successfully capture the variance in user and item profiles if these profiles demonstrate a degree of overlap. However, in contrast to the itembased and user-based approaches, in MF methods, user and item profiles are simultaneously involved in factor extraction. Thus, it is enough that the overlap happens along either of these two profile dimensions; thus, sufficient overlap is more likely to occur in MF methods than in cases in which only one of these two profile dimensions is considered. Nonetheless, sparsity remains a significant problem in domains that include a high number of items, unless the user base is very large (Burke 2000). As described above, CB approaches do not utilize the ratings of other users for their predictions but instead base these predictions on the content characteristics of items. Moreover, these content descriptions compose the data basis of CB approaches and are therefore available for each item in the catalog of an RS. Thus, the item space of a CB recommender is dense, and the density of the user space for a CB RS is irrelevant. Therefore, relative to other RS approaches, CB approaches are less likely to suffer from sparsity. Nonetheless, the density of the active user’s profile remains an important issue for CB techniques

44

2. Background and Related Work

because the quality and the relevance of the recommendations of an RS for this user are highly dependent on the knowledge that the RS in question possesses about the active user. However, this concern may be regarded as a subclass of the “ramp-up” problem for RSs (Konstan et al. 1998), which will be discussed in the subsequent section of this thesis.

2.1.3.2 “Ramp-up”: New User and New Item Problems

The “ramp-up” problem (which is often known as the “cold-start” problem) applies to situations in which an RS does not possess sufficient information to generate an accurate rating prediction (Konstan et al. 1998). These types of situations may occur if (i) a new user or (ii) a new item is introduced to a system. Accordingly, these situations are frequently referred to as “new user” and “new item” problems (Konstan et al. 1998; Burke 2002; Adomavicius and Tuzhilin 2005). The new user problem primarily poses an issue for user-based methods and contentbased approaches. For these RS techniques, an RS must acquire sufficient knowledge about an active user through this user’s ratings to either identify users with preferences that are similar to the active user’s opinions (in user-based CF systems) or detect items that match the active user’s profile (in CB systems). In these types of systems, to establish the basis for future recommendations, new users must supply ratings, which provide information about these users’ tastes and preferences. New items are added frequently to the catalogs that are maintained by an RS. In CB approaches, items are described in terms of their content and can therefore be recommended immediately after their introduction to the system. By contrast, in CF approaches, new items must receive ratings before they can be recommended by an RS. The new item problem is also known as the “early rater” problem because users who provide the first ratings for new items receive little benefit from this process (given that these early ratings do not increase the user’s ability to be matched with other users) (Avery and Zeckhauser 1997). Thus, CF systems must provide other incentives to encourage users to provide ratings (Burke 2002).

2. Background and Related Work

45

MF approaches, as a subclass of CF approaches, suffer equally from both new user and new item problems; from the perspective of matrix decomposition, it matters little whether a new entry occurs in a row or a column of the user-item matrix. However, MF approaches rely less than other RS approaches on the similarity between users or items; instead, MF processes involve factorizing the matrix entries, in a manner that can be described as “independent” of the row or column affiliations of these entries. Thus, ceteris paribus, the generation of recommendations is likely to require fewer ratings from a new user or for a new item in MF approaches than in user-based methods or item-based approaches, respectively. As evidenced by the above discussion, all CF approaches suffer from the ramp-up problem in one form or another. Thus, RSs must continuously acquire additional data, i.e., ratings, from users to improve their ability to generate recommendations and to improve the quality of these recommendations.

2.1.3.3 Overspecialization

Although CB approaches are not only less prone than other RS approaches to the new item problem but also capable of generating recommendations once one rating is acquired from a user, these approaches often suffer from recommendation uniformity (Zhang, Callan, and Minka 2002; Jannach et al. 2011), a phenomenon that is frequently referred to as the “portfolio effect” (Billus and Pazzani 2000; Linden, Smyth, and York 2003; Burke 2002). Because CB systems recommend items that score highly with respect to an active user’s profile, they tend to generate recommendations for items that are similar to items that these users have already seen (Adomavicius and Tuzhilin 2005). This phenomenon implies that the recommendations from a CB system tend to remain within a particular topic of interest; in radical cases, this issue may cause a user to receive recommendations for different versions of the same item (such as a book or a news article), even if s/he already owns or have read it (Linden, Smyth, and York 2003). That is, a user must exhibit an interest in at least one item that relates to a certain topic before this topic becomes relevant in the user’s profile.

46

2. Background and Related Work

By contrast, CF approaches allow for more diverse recommendations than CB approaches. Because CF approaches do not rely on item properties but instead utilize the ratings that users assign to a wide range of items, CF methods tend to be more capable than CB methods of identifying cross-genre relationships among items (Adomavicius and Tuzhilin 2005; Jannach et al. 2011). Thus, CF techniques are more helpful than CB techniques for discovering items that users might not otherwise have considered (Burke 2002).

2.1.3.4 Stability vs. Plasticity

As noted above, the ability of CF and CB approaches to recommend improves over time through the continuous acquisition of additional user input, which addresses the ramp-up problem. The converse issue to the ramp-up problem is the “stability vs. plasticity” problem (Burke 2002). The “stability vs. plasticity” problem refers to the concern that an RS may become overly rigid, i.e., insensitive to changes in user preferences. In a sense, this problem consists of the fact that established knowledge about the prior preferences of a user may “dominate” any new user input. For instance, suppose that a devoted science fiction fan abruptly begins to give high ratings to dramas. An RS might not recognize these changes in the user’s preferences, particularly if the new input conflicts with prior negative ratings for dramas. Instead, the system is likely to regard the new positive drama rating as an outlier and continue recommending science fiction movies. Similarly to the ramp-up situation, a user would need to provide the system with a substantial number of positive drama ratings to stabilize the system’s knowledge about the changes in the user’s preferences. To counteract this development, certain RS approaches suggest to discount past user preferences to diminish the influence of older ratings on recommendation results; however, these approaches risk losing information about interests of the user that are stable over the long term but only occasionally exercised (Billsus and Pazzani 2000; Schwab, Kobsa, and Koychev 2001; Burke 2002; Tsymbal 2004). For instance, if the hypothetical science fiction fan also enjoys westerns but watches western films relatively infrequently, a temporal discount function might gradually “forget” the user’s preference for westerns over the course of time and stop recommending these types of films to the user.

2. Background and Related Work

47

2.1.3.5 Other Problems: “Gray Sheep”, “Starvation”, and Shilling Attacks.

Although user-based CF methods are not affected by portfolio effects and can identify cross-genre niches, these methods suffer from an issue that is known as the “gray sheep” problem. This issue refers to the fact that users with “unusual” tastes cannot easily be categorized into a neighborhood of similar users because the rating profiles of these unusual users do not correlate well with the rating profiles of other users of an RS (Rashid et al. 2002; Claypool et al. 1999). Consequently, the generation of recommendations for these unusual users can be problematic. Similarly, items can be “starved” to benefit other items. In other words, popular items become easier to find as more users rate these items. The high quantity of ratings that are provided for a particular item increases the likelihood that this item will participate in the process of matching user profiles. Because popular items are typically given a higher rating, the probability for these items to be recommended increases too. On the other hand, in itembased approaches, popular items are more likely than unpopular items to exhibit a high similarity in terms of rating profiles, thus, popular items again become recommended more often than unpopular ones. For ambiguous items, i.e., items that provoke polarizing attitudes, it may be problematic to identify a neighborhood of similar items that can serve as a “source” for rating predictions. Thus, unpopular and ambiguous items become more difficult to discover (Rashid et al. 2002; McNee et al. 2003). The starvation effect also causes CF systems to be susceptible to malicious attacks (which are often called shilling attacks), i.e., the injection of ratings that explicitly seek to sink or to soar the popularity of an item (Lam and Riedl 2004; Sandvig, Mobasher, and Burke 2007; Resnick and Sami 2007; Mobasher et al. 2007; Metha, Hoffman, and Nejdl 2007). For example, a vendor of certain products who wants to increase his or her revenues from an independent online store could compromise the RS of this store by first creating several user profiles with ratings that conform to the preferences of target customers and then using these profiles to give high ratings to the vendor’s own products and low rankings to competitors’ products. This manipulation would increase the chances that the vendor’s products would be recommended and thereby raise the likelihood that these products will be purchased, despite

48

2. Background and Related Work

the fact that a competitor’s products might be better suited to fulfilling the needs of certain actual customers of the online store. Although MF approaches do not explicitly account for the relationships between pairs of user or item profiles, the amount and the character of ratings in the rows and columns of the matrix that is decomposed influences the “direction” and the information content of the extracted factors. Although this issue has not been studied in prior research, we can logically presume that the higher number of ratings for a popular item causes a factor to be influenced towards the item in question, increasing the chances that this item is recommended (the starvation problem). Moreover, an unusual pattern of a user vector may cause it to expose lower factor loadings; this effect would cause rating predictions for these users to become less reliable (the gray sheep problem). However, because the reduced dimensionality of the factor solution does not correspond directly to user and item dimensions, the “distortion” of a factor may be compensated through other factors; this compensation is likely to reduce the extent of both types of problems in the context of MF approaches. CB approaches are immune to the problems that are discussed above because neither the rating profiles of items nor the ratings of other users are relevant to CB approaches. This discussion of the problems and trade-offs of recommendation approaches has demonstrated that different types of RSs suffer from diverse problems in various situations (see also Table 2.6). In other words, different recommendation techniques share only a subset of the problems that have been described and exhibit unequal tendencies to suffer from the remaining issues of concern. Thus, an intuitive solution for avoiding these problems is to combine two or more techniques within one RS in a manner that allows the problems of each component technique to be addressed by another technique that is not susceptible to these types of issues. This intuition has led to the emergence of a class of RSs that are known as hybrid recommender systems, which will be discussed in the next section of this thesis.

2. Background and Related Work

49

2.1.4 Hybrid Recommender Systems

To transcend the trade-offs and problems that are associated with individual CF and CB recommendation approaches, hybrid systems generate recommendations by utilizing a combination of both types of recommendation methods. Most of these hybrid systems combine CB techniques with item-based CF (e.g., Balabanovic and Shoham 1997; Basu, Hirsh, and Cohen 1998; Claypool et al. 1999; Pazzani 1999; Soboroff and Nicholas 1999; Tran and Cohen 2000; Melville, Mooney, and Nagarajan 2002; Burke 2002; O’Sulivan et al. 2004; Symeonidis, Napopoulos, and Manopoulos 2007; Koren 2008). The goal of this combination is to not only take advantage of the invulnerability of CB techniques to data sparsity, the new item problem, and the starvation problem but also utilize CF methods to avoid the susceptibility of CB approaches to overspecialization. Another benefit of these hybrid approaches is that a certain item can be recommended to an active user in both situations when the item is highly rated by similar users and situations when it scores highly directly against the user’s profile (Burke 2002; Adomavicius and Tuzhilin 2005). The hybrid approaches demonstrate distinct nuances with respect to both how the different methods are combined to generate ratings predictions and how deeply these methods are integrated with each other. According to Jannach et al. (2011), three basic types of hybridization designs can be identified: parallelized, monolithic, and pipelined hybrids (see also Figure 2.5). The subsequent subsections provide a parsimonious overview of these hybridization designs.

2.1.4.1 Parallelized Hybridization Design

Parallelized hybrids first implement CF and CB recommendation methods separately and subsequently combine the individual predictions of these methods (Adomavicius and

50

2. Background and Related Work

Tuzhilin 2005). The combination of the predictions of individual recommendation approaches can follow mixed, weighted, or switching strategies (Burke 2002; Jannach et al. 2011). The mixed strategy combines the results of different components of a hybrid system at the level of the user interface. In other words, the lists of the top-scoring items produced by all of the individual recommendation techniques that are used within a mixed-strategy hybrid design are presented together to users (e.g., Burke, Hammond, and Young 1997; Wasfi 1999). This strategy is particularly applicable if each component of a hybrid processes information from different sources or product domains. For instance, this approach has merit for composing a TV viewing schedule (Cotter and Smyth 2000) or for recommending bundles of several distinct offerings, such as a combination of accommodations and sport/leisure activities (Zanker, Aschinger, and Jessenitschnig 2007). However, in certain cases, conflict resolution procedures may be required during the hybridization step; these procedures may include predefined precedence rules for different recommendation functions or a particular constraint filter, such as “activities must be within a 20 km reach of accommodations” (Jannach et al. 2011). The weighted strategy computes the final predictions of a hybrid as a linear combination of individual predictions (e.g., Claypool et al. 1999) that may also employ a weighting scheme for the individual prediction methods (e.g., Pazzani 1999). The weight of individual recommendation approaches in the hybrid can either be equal (in this case, the final rating prediction is a simple average of the individual predictions) or established proportionally to either their accuracy levels or to the particular error metric, such as MEA, that each individual predictor achieves on a holdout set. Furthermore, these weights can either be static throughout the hybrid approach or dynamically determined for each particular recommendation situation (Jannach et al. 2011). A prominent example of a weighted hybrid is the approach that won the One Million Dollar Netflix Prize by “blending” the results from more than 100 different recommendation algorithms (Bell, Koren, and Volinsky 2007b, 2008). In this approach, the contribution of individual algorithms to the final rating (i.e., the weight of each individual algorithm) is determined through a linear regression in which the dependent variable is the vector of the ratings in the holdout set and the vectors of the ratings predicted by different methods for the same training set of ratings are the independent variables.

2. Background and Related Work

51

Recommender 1



Input

Hybridization Step

Output

Hybrid Recommender

Output

Recommender n

Output

Recommender n

a) Parallelized hybridization

Input enhancement

Recommender 1



Input

Recommender n

b) Monolithic hybridization

Input

Recommender 1



c) Pipelined hybridization

Figure 2.5: Basic types of hybridization designs modified from Jannach et al. (2011, p. 129)

Finally, switching hybrids utilize the predictions from a single recommender that is selected from a set of available recommenders through the use of a decision rule that determines which recommender should be used in each specific situation. For example, the DailyLearner news recommender uses a CB/CF hybrid in which the CB approach is initially utilized. If the CB approach fails to produce recommendations with a sufficient confidence level, then CF methods will be employed. Similarly, NewsDude will employ cross-genre recommendations that are generated by CF methods if its CB processes fail to identify closely related articles (Billus and Pazzani 2000). In Tran and Cohen’s (2000) switching hybrid, users are asked to indicate their agreement with the predictions of individual recommendation methods. The method that generates the best score in each user’s survey is then selected for

52

2. Background and Related Work

the generation of future recommendations for the user in question. Marx, Hennig-Thurau, and Marchand (2010)26 proposed a switching hybrid that selects an individual recommendation approach for use based on its prediction accuracy for a given user.

2.1.4.2 Monolithic Hybridization Design

In contrast to the parallelized hybridization design, the monolithic hybridization design does not combine the results of different recommenders to generate its final predictions but instead integrates multiple approaches by preprocessing and combining several knowledge sources (Jannach et al. 2011). In other words, each component of a monolithic hybrid generates information that is subsequently consolidated in a form that can be processed by a single recommendation method. Two strategies can be followed for this type of input enhancement: feature augmentation and feature combination (Burke 2002; Jannach et al. 2011). The feature augmentation strategy proposes augmenting the user-item rating matrix with artificial user rating vectors in order to increase the overlap between user profiles within the CB recommender that generates the final recommendations (e.g., Good, Schafer, and Konstan 1999; Melville, Mooney, and Nagarajan 2002). These augmented vectors are produced by content-analysis agents, which are commonly known as “filterbots”. As a result of this augmentation, users with rating profiles that agree with the profiles of filterbots may receive better recommendations. Instead of extending the user-item rating matrix with artificial profiles, the feature combination strategy extends the user or item rating profiles by incorporating additional features into them. For instance, Basu, Hirsh, and Cohen (1998) report on experiments with a feature combination hybrid that combines collaborative features, such as user likes and dis-

26

This study by Marx, Hennig-Thurau, and Marchand (2010) can be regarded as a report that lays the groundwork for the current thesis. However, our thesis not only substantially extends this previously published study but also incorporates considerable modifications of the method that is proposed by these authors; moreover, compared with this published study, this thesis provides a theoretical foundation for the topic of interest that is more in-depth and comprehensive and an empirical study that is more versatile.

2. Background and Related Work

53

likes, with the content features of catalog items. Koren’s approach to multifaceted CF (Koren 2008) first enriched the MF model with information about item neighborhoods and factorized the user-item rating matrix that is based on this extended model.

2.1.4.3 Pipelined Hybridization Design

Pipelined hybridization designs involve a staged process in which several techniques sequentially build on each other before the final approach that is utilized produces recommendations. Two strategies can be followed for these approaches: cascade and metalevel hybridization (Burke 2002; Jannach et al. 2011). In the cascade strategy, the preceding recommender produces a coarse ranking of items that is subsequently refined by the next recommender in the sequence (Burke 2002). Because the recommendation list of each successive recommender is restricted to items that were recommended by its predecessor, these successive recommenders can only change the ordering of the recommended items or exclude certain items from the list. This technique is especially useful for breaking ties within recommendation lists, i.e., for providing a better rank-order of items (Jannach et al. 2011). For instance, collaborative techniques that rely on implicit feedback, such as purchasing history, instead of user ratings are prone to producing ties, due to the low differentiation ability of binary representations of buying acts. A contentbased recommender that matches the cross-genre suggestions of a preceding recommender to user interest profiles could then be used to break ties and to rank-order the remaining items in the list of recommendations. In the meta-level hybridization strategy, a preceding recommender builds a model that is used as the input for the principal recommender. This approach deviates significantly from the approach of monolithic designs; monolithic hybrids use the models of each recommender component to enrich the input data for a subsequent algorithm, whereas in meta-level hybrids, the entire model serves as the input (Burke 2002; Jannach et al. 2011). For instance, Balabanovic and Shoham (1997) and Pazzani (1999) suggest a technique of “collaboration via content”. This technique extracts content-based vectors of terms and weights for each user

54

2. Background and Related Work

that describe the user’s areas of interest. Instead of user-item ratings, these vectors, which are essentially models, are then compared across users through the application of CF approaches that identify the similarities between pairs of users. The rating predictions are then produced by a CF cascade that is applied to the ratings of users who have been identified as similar. Another example of the meta-level hybrid is a method devised by Soboroff and Nicholas (1999) that uses latent semantic indexing to reduce the dimensionality of CB user profiles that are initially represented by term vectors. The collaborative technique is then applied to these “reduced” user vectors. All of the hybrid approaches that have been discussed in the extant literature have demonstrated better prediction accuracy and/or better overall performance than individual recommendation methods. This fact causes the general idea of hybridization to be attractive for the development of our own recommendation method, which will be presented in the following chapter. In other words, the effectiveness of the aforementioned hybrid approaches suggest that we should consider the possibility of increasing the accuracy of our proposed recommendation method by hybridizing this method with another technique that will allow us to appropriately address any issues that we may encounter in our initial RS approach. However, because our objectives include several concurrent aims (namely, the provision of accurately predicted recommendations and the explanations of these recommendations), the choices of hybridization design, strategy, and technique are constrained by the abilities of extant recommendation approaches to provide effective explanations. Thus, before we can substantiate our choice of approaches, the topic of explanations in recommender systems must be discussed. The next section of this thesis is therefore dedicated to this topic.

2.2 Explanations in Recommender Systems

A cover article of the Wall Street Journal from 2002 titled “If TiVo Thinks You Are Gay, Here’s How to Set It Straight” described users’ frustration with the irrelevant choices that are made by their “TiVo”, a digital video recorder that records programs that it assumes its owner will enjoy, based on shows s/he chose to record in the past. For instance, Mr. Iwanyk

2. Background and Related Work

55

suspected that his TiVo thought that he was gay because it inexplicably continued to record programs with gay themes. Another case that is described in the article concerns Jeff Bezos, the founder of Amazon.com. “For a live demonstration before an audience of 500 people, Mr. Bezos once logged onto [Amazon.com] to show how it caters to his interests. The top recommendation it gave him? The DVD for “Slave Girls From Beyond Infinity”. That popped up because he had previously ordered “Barbarella”, starring Jane Fonda, a spokesman explains” (Zaslow 2006). Although Mr. Bezos could salvage his situation by providing a reasonable justification for a risqué recommendation, Mr. Iwanyk had no explanations for his TiVo’s behavior and therefore had to determine his own remedy for the incorrect recommendations that he was receiving. These examples foreshadow the need to integrate explanation facilities into recommender systems. In the subsequent sections of this chapter, more detailed evidence for providing explanations for recommendations will be provided, and the foundations of the criteria for determining how these explanations should be created will be elucidated in the context of the objectives of this thesis.

2.2.1 The Relevance and Advantages of Explanation Facilities

The idea of providing explanations to the users of intelligent systems is not new. The provision of explanations is an issue that has been extensively examined in the research about expert systems (e.g., Buchanan and Shorliffe 1984; Hovitz, Breeze, and Henrion 1988; Andersen, Olsen, and Jensen 1990; Johnson and Johnson 1993; Miller and Larson 1992; Sørmo, Cassens, and Aamodt 2005). For example, consider MYCIN27, which is the most frequently referenced expert system. MYCIN was designed by Shorliffe and Buchanan (1975) to assist physicians in the prescription of antibiotics; this system incorporated an explanation facility as an important component of its function. MYCIN possessed a knowledge base of

27

The name MYCIN is not an acronym; instead, this name was derived from the typical suffix of “-mycin” for antibiotics, the drugs that the expert system was intended to prescribe.

56

2. Background and Related Work

approximately 600 rules and would ask a physician a series of simple yes/no questions to identify the bacteria that were causing a patient’s infection. At the conclusion of the query process, the expert system provided a list of bacteria that could be responsible for the infection in question, ranked from high to low based on the probability of each diagnosis; MYCIN also recommended a particular course of drug treatment and provided the reasoning behind its recommendations, i.e., a list of questions and rules that led to a particular diagnosis and the order of the ranked bacteria. Despite MYCIN's success as an expert system, its developers claimed that its power was less associated with the details of its underlying numeric model than with its knowledge representation and reasoning scheme, i.e., the explanations that allowed physicians to control why a conclusion was reached and how much was known about a certain concept. The developers concluded that expert systems that act as decision guides must provide explanations for their advice (Buchanan and Shorliffe 1984). Since the initial implementation of MYCIN, the need to provide explanations of the reasoning behind the recommendations that are produced by expert systems has been widely recognized. It has been noted that explanation facilities are required for expert systems to be considered useful and acceptable because they remove the black-box from around the recommendation process, thus increasing users’ confidence in the recommendations that they receive by providing these users with transparency, i.e., an understanding of the recommendation model that is used and the ability to reassess the recommended actions (Moore and Swartout 1988; Hovitz, Breeze, and Henrion 1988; Majchrzak and Gasser 1991; Miller and Larson 1992; Johnson and Johnson 1993; Brézillon and Pomerol 1996; Doyle, Tsymbal, and Cunningham 2003; Lacave and Diéz 2004). Because RSs and expert systems share common roots and strive for similar goals, namely, the provision of recommendations that help users make their choices more efficiently, RSs can be regarded as the successors of expert systems. Thus, the arguments that support the provision of explanations for recommendations in the domain of expert systems remain valid in the domain of RSs (Herlocker, Konstan, and Riedl 2000; see also Tintarev and Masthoff 2008; Cramer et al. 2008). In other words, explanations play a crucial role in both expert systems and RSs. Explanations increase the transparency of the recommendation processes of RSs and provide users with an instrument to handle the errors that come along with recommendations (Herlocker, Konstan, and Riedl 2000). Furthermore, “good explana-

2. Background and Related Work

57

tions [can] help inspire user trust and loyalty, increase satisfaction, make it easier for users to find what they want, and to persuade them to try or purchase a recommended item” (Tintarev 2007, p. 203). Thus, the importance of an explanation facility in an RS cannot be overestimated. First, it is natural for humans to ask for reasoning while handling recommendations. “Consider how we […] handle suggestions as they are given to us by other humans. We recognize that other humans are imperfect recommenders. In the process of deciding to accept a recommendation from a friend, we might consider the quality of previous recommendations by the friend or we may compare how that friend’s general interests compare to ours in the domain of the suggestion. However, if there is any doubt, we will ask “why?” and let the friend explain their reasoning behind a suggestion. Then[,] we can analyze the logic of the suggestion and determine for ourselves if the evidence is strong enough” (Herlocker, Konstan, and Riedl 2000, p. 242). Second, the recommendations that are generated by RSs are inherently prone to errors. In essence, automated recommender systems are stochastic processes that create their recommendations based on heuristic approximations of human processes that are implemented through numeric algorithms. The computations of these systems are accomplished using extremely sparse and incomplete data. These two conditions result in recommendations that are often correct and reliable but may occasionally be extremely inaccurate; in other words, the suggestions that are generated by an RS are subject to errors. For instance, these errors can be caused either by a misspecification of the employed user model or by inadequate data (see Appendix A of this thesis for additional details).28 The chance of receiving an erroneous recommendation impairs the users’ acceptance of and trust in an RS. Explanations of the reasoning underlying the recommendations of an RS provide users with indications when to trust a recommendation and when to doubt one. By helping RS users detect or estimate the likelihood of recommendation errors, explanations mitigate the loss of acceptance and trust that is caused by erroneous recommendations and may even cause user levels of acceptance and trust to increase (Herlocker, Konstan, and Riedl 2000).

28

Because the exact mechanism of how prediction errors may occur is secondary to this section’s topic , the indepth discussion of prediction errors would divert our narrative from the main focus of the current section. Thus, we refer the interested reader to Appendix A for further details about the sources of error in recommender systems.

58

2. Background and Related Work

The topic of the effects of transparence on acceptance and trust has been extensively explored in the context of expert systems but not in the context of RSs. To the best of our knowledge, only two studies have examined these effects for RSs (Cramer et al. 2008; Herlocker, Konstan and Riedl 2000).29 Cramer and colleagues conducted a between-subject experiment that involves three groups of users. In the first group of users, explanations of why a recommendation had been made were presented along with each recommendation; in the second group, the system additionally reported how certain it was that a recommendation would be of interest to the active user; and the third group was the control one, i.e., these users saw recommendations without any additional information that increased the transparence of the recommendation process. Unfortunately, this previously published study is limited to the domain of artwork and utilizes a rather small sample of 60 persons; thus, the findings of this study can hardly be regarded as generalizable. Nevertheless, the study by Cramer et al. provided initial support for the aforementioned notion that the arguments that justify the provision of explanations with recommendations from expert systems are also applicable to recommendations from RSs. Thus, the findings of these researchers confirmed that an explanation of why a recommendation was made (i.e., the recommendation transparency) significantly increases the levels of user acceptance of these recommendations. In this study, trust in RSs was not directly influenced by transparency; however, the results showed that the RSs that provided explanations of the reasoning underlying the generated recommendations were perceived to be more understandable by users than the RS that provided no explanations. Perceived understanding was correlated with the perceived competence of an RS, trust in the RS, and user acceptance of an RS. This result indicates either that the effects of transparency on trust in an RS and the perceived competence of an RS have not become apparent in this study due to the study’s small sample size or that these effects are mediated or moderated by the users’ perceived understanding of the explanations that they received (a possibility that was not tested by Cramer et al.). Both of these possible outcomes reveal the importance to RSs of the

29

Studies also exist that have addressed the acceptance of explanations by users (Bilgic and Mooney 2005; Symeonidis, Napopoulos, and Manopoulos 2008). However, these studies examine changes in the acceptance of different explanation styles by users and assume that there is an a priori initial positive effect of explanations (of various styles) on the acceptance of RSs. Thus, we do not consider these explanation styles in the current discussion but will instead address these styles in the next section of the thesis.

2. Background and Related Work

59

transparency that is provided by explanations; in particular, these explanations are especially crucial for building or maintaining user trust in an RS. A study by Herlocker, Konstan and Riedl (2000) examined different variants of explanation interfaces in the domain of the “MovieLens”30, a CF movie recommender system. In this study, twenty-one variants of explanation presentations were compared with a base case in which no explanations were provided. The study results indicated that the integration of an explanation facility into the provision of recommendations can often significantly increase the acceptance of these recommendations by users. Although the literature on RS includes a variety of other academic studies that have provided evidence to support certain reasons for providing explanations and substantiated the benefits of these explanations for users and RS providers, each of these publications has addressed only several selected reasons, and none of these studies has claimed to provide a systematic overview of the reasons and benefits of explanations.31 To the best of our knowledge, only two groups of authors have recently attempted to develop a systematic classification of the reasons for providing explanations. However, these sets of authors derived their classifications from different perspectives; thus, the developed taxonomies are neither exclusive nor complete. In particular, Herlocker, Konstan, and Riedl (2000) considered the benefits of explanations from the user’s perspective and restrict their assessments to the case of automated collaborative filtering systems, whereas Tintarev and Masthoff developed their explanation taxonomy from the provider’s perspective with an emphasis on the objectives for the provision of explanations by different types of RSs (Tintarev 2007; Tintarev and Masthoff 2007, 2007a, 2007b, 2011). Furthermore, in the classification that was proposed by Herlocker, Konstan, and Riedl, all of the benefits to users are consequences of the transparency that explanations provide, whereas in the classification suggested by Tintarev and Masthoff, transparency is only one goal among a collection of coequal objectives. Table 2.7 summarizes the reasons and benefits for the provision of explanations according to the Herlocker, Konstan, and Riedl classification and the Tintarev and Masthoff

30 31

http://www.movielens.com A comprehensive survey of these studies has been provided by Tintarev and Masthoff (2011).

60

2. Background and Related Work

classification; these reasons and benefits are supplemented with arguments from Chen (2009) that do not fall into either of the aforementioned classifications.32 In his study, Chen demonstrated that explanations that emphasize product features that either conflict with user needs or involve incomplete user preferences can potentially increase user choice efficiency by helping users address their contextual needs, examine their unconscious choice criteria, and solve their preference conflicts. We do not claim that the classifications in Table 2.7 are complete but simply seek to use these classifications to expand our understanding of the topic and emphasize the need for explanations in RSs; to fulfill this need, recommendation algorithms must be utilized that allow for the generation of comprehensive explanations. However, it must be noted that the distinct reasons and benefits for providing explanations that have been identified in Table 2.7 are not mutually independent and therefore may interact. Therefore, for example, the provision of explanations for the purpose of justification may contribute not only to uncovering hidden preferences but also to increasing decision efficiency, decision effectiveness, satisfaction, and trust (Herlocker, Konstan, and Riedl 2000; Tintarev and Masthoff 2007). Because of the advantages and benefits that are discussed above and the positive interactions among these features, it appears sensible to provide explanation facilities for RS. Explanations help users judge the suitability of recommendations more accurately, handle the erroneous ones more productively and demonstrate increased choice efficiency. Moreover, explanations also provide a series of other benefits that potentially increase the trust in, the loyalty to, and the credibility of an RS; these effects are desired by all commercial RS applications. Thus, the ability to generate explanations should be integrated into the method of providing recommendations that we seek to develop in the current thesis.

32

Other authors have also elaborated about the reasons for the provision of explanations. However, as mentioned above, their arguments are rather fragmented; moreover, these arguments are either complementary to the points in Table 2.7 (e.g., Sinha and Swearigen 2002; O’Donovan and Smith 2005; Cramer et al. 2008; Symeonidis, Napopoulos, and Manopoulos 2008; Jannach et al. 2011) or have served as a basis for the aforementioned publications. For the sake of brevity, we do not mention the latter works here but instead refer interested readers to Herlocker, Konstan and Riedl (2000) and Tintarev and Masthoff (2007, 2007a, 2007b, 2011) for further information.

2. Background and Related Work

61

Table 2.7: Reasons and benefits for the provision of explanations Reason/Benefit Justification / Validation

User Involvement

Education

Acceptance

Transparency

Scrutability

Trust and Credibility

Definition Increase users’ understanding of the reasoning for a recommendation, enabling users to decide how much confidence to place in the recommendation in question Allow users to add their knowledge and inference skills to the complete decision process Help users understand the strengths and limitations of an RS and to better comprehend a product domain

Herlocker, Konstan & Riedl (2000)

Promote the greater acceptance of an RS through ensuring that the system’s strengths and limits are fully visible and that its suggestions are justified Explain how the system works and why one item is preferred relative to another product Allow users to tell the system that it is wrong and justify requirements for additional information from users Increase users’ confidence in the system, thereby reducing the complexity of decision making in uncertain situations

Effectiveness

Help users make better decisions

Efficiency

Help users make decisions more quickly and with lower levels of cognitive effort

Persuasiveness

Change users’ buying behaviors by convincing users to try or buy a product or service

Satisfaction

Improve the ease of use, enjoyment and customer return rates

Address contextual needs

Help users determine whether a recommendation is suitable for each user’s particular context or situation

Uncover hidden criteria

Help users uncover important choice criteria that they did not previously regard as relevant

Solve preference conflicts

Author(s)

Provide additional preference-relevant information that causes the optimal option to become more evident

Tintarev 2007, Tintarev & Masthoff (2007, 2011)

Chen (2009)

62

2. Background and Related Work

However, to conceptualize a framework for the development of this method, the question of how the explanations should be formed must be answered; in other words, the appropriate explanation style for our recommendation algorithm should be identified. The next section of this thesis examines this question.

2.2.2 Explanation Styles

The capability to provide personalized explanations varies across different recommendation approaches; in particular, this capability is rather limited in collaborative filtering approaches, whereas content-based approaches may allow for the provision of highly informative explanations (Tintarev and Mashoff 2007, 2011; Jannach et al. 2011, p. 165). Collaborative filtering approaches produce recommendations based solely on holistic preference data, such as item ratings or purchasing acts. Because of this aspect of CF approaches, the explanation ability of these approaches is rather limited. In particular, CF approaches allow for only two types of rather generalized statements: (i) “customers who bought item X also bought items Y, Z, …” and (ii) “item Y is recommended to you because you rated item X” (Symeonidis, Napopoulos, and Manopoulos 2008).33 The first type of explanation statement mimics the human word-of-mouth recommendation process (Jannach et al. 2011). This statement connects the user to whom the recommendations are presented, i.e., the active user, to other users who have rated the recommended item. Because the underlying process produces recommendations on the basis of user profile similarities, i.e., this process considers only users who have revealed preferences that are similar to the preferences of the active user, this type of explanation is referred to as the “nearest neighbor” style of explanations. By contrast, the second type of statement connects the recommended item to items that the active user has bought or rated in the past. During this process, the system identifies item X, which is the most influential item with respect to the recommendation of Y. Thus, in the

33

In the context of movie recommendations, these statements can be correspondingly paraphrased as “individuals who liked movie X also like movie Y” and “you will like movie Y because you liked movie X”.

2. Background and Related Work

63

existing literature, this type of explanation is known as the “influence” style of explanations (Tintarev and Masthoff 2007, 2011; Symeonidis, Napopoulos, and Manopoulos 2008). In contrast to CF, content-based (CB) filtering systems utilize attribute-level preferences for the generation of recommendations.34 Thus, CB systems are able to explain their recommendations on a finer level of resolution that allows the item attributes that are relevant to the building of user preferences and the choices of these users to be individually addressed. Because these attributes are typically extracted from the content of recommended items, these explanations are regarded as representative of the “content-based” (Symeonidis, Napopoulos, and Manopoulos 2008; Jannach et al. 2011; Tintarev and Masthoff 2011) or “keyword” (Bilgic and Mooney 2005; Tintarev 2007; Tintarev and Masthoff 2011) styles of explanations.35 An example of this type of explanation is “[t]his story received a high relevance score, because it contains the words f1, f2 and f3”36 (Billus and Pazzani 1999). At the present time, only three studies that involved real users have provided an evaluation of the explanation styles for RSs. In the context of the goals of this thesis, the results of these studies can be summarized in the following ways. As mentioned in the previous section, Herlocker, Konstan and Riedl (2000) examined twenty-one variants of different explanation presentations and demonstrated that an explanation facility can increase the acceptability of recommendations to users. However, this study also determined that this acceptability can decrease if the information that is provided in explanations exceeds the cognitive skills of the users who receive these explanations; in other words, explanations are less acceptable if they cannot be easily understood. In particular, in situations involving explanations that presented additional information, such as complex graphs, the percentage of agreement of a user’s closest neighbors, or a number of neighbors with either the standard deviations or the average pairwise correlations between these neighbors and the active user, was presented, the acceptance of recommendations decreased below the baseline situation. In other words, although these technical details undoubtedly increase

34

For a detailed description of CB approaches, see section 2.1.2 of this thesis. Because the terms “content-based style” and “keyword style” are largely used synonymously, to avoid ambiguities, we utilize the term “keyword style” throughout the remainder of the manuscript. 36 In the domain of movie recommendations, this example of a content-based explanation could be altered to “we recommend that you watch this movie because Bruce Willis received an Oscar for his acting in this film”. 35

64

2. Background and Related Work

the transparency of an RS, users may not consider these details relevant for making decisions. Thus, increased RS transparency will only be beneficial if users are able to deduce and understand the details that an RS provides about the ways in which it generates recommendations. This fact appears to be consistent with the conclusions of Aksoy and colleagues, who determined that RSs should “think like the people they are attempting to help” (Aksoy et al. 2006, p. 310) and argued that this conclusion maintains its validity with respect to explanations. In other words, RSs should not only think similarly to the users that they support but also explain any provided recommendations in terms that these users themselves apply during the course of decision-making procedures. Bilgic and Mooney (2005) criticized Herlocker and colleagues for their overly narrow concentration on the acceptance of explanations and their inability to demonstrate that any of the examined explanation variants actually increased the satisfaction of users with the items that they eventually chose. Instead, Bilgic and Mooney argued that “the goal of a good explanation should not be to “sell” the user on a recommendation, but rather, to enable the user to make a more accurate judgment of the true quality of an item” (Bilgic and Mooney 2005, p. 6). Therefore, these authors conducted a user study in which they evaluated different explanation approaches in terms of how well these approaches allowed users to accurately predict their true opinions of an item. The results of this study indicated that users who were presented explanations in the nearest neighbor style tended to overestimate the quality of the recommended items. Bilgic and Mooney claimed that this overestimation produces mistrust and could cause users to stop using an RS. Explanations that were provided in either the keyword style or the influence style were found to be significantly more effective than explanations in the nearest neighbor style with respect to enabling accurate assessments; in this study, the keyword style of explanations was superior to the influence style of explanations, although not by a significant amount. Symeonidis, Napopoulos, and Manopoulos (2008) conducted a survey to measure user satisfaction with three styles of explanation. Given the results of Bilgic and Mooney, these researchers omitted the nearest neighbor style of explanation from their study and instead introduced a new type of explanation that combined the keyword and influence styles. This new explanation was provided in the following form: “Item X is recommended, because it

2. Background and Related Work

65

contains features a, b, …, which are included in items Z, W, … that you have already rated”.37 These researchers used a between-subject experimental design in which study participants received movie recommendations accompanied with justifications that used one of the three tested explanation styles. The participants then were asked to rate each explanation style separately and to explicitly express their actual preference among the three styles. The study results revealed that the combined explanation style dominated both the keyword and the influence explanation styles at a high significance level (

). In this study, however, the

influence explanation style performed better than the keyword explanation style. The authors did not discuss the significance of the latter outcome; this lack of analysis might indicate that similarly to the results of Bilgic and Mooney, the difference between these two explanation styles was not significant. However, Symeonidis, Napopoulos, and Manopoulos did argue that relative to the influence style, the keyword explanation style provides the advantages of convenience and effectiveness and requires users to apply lower levels of inference skill. To further understand the advantages of the keyword explanation style, consider two examples of explanations that may be provided by a movie recommender. A keyword-style explanation could be expressed as follows: “Million Dollar Baby (2004) is recommended because it is a Drama that is directed by Clint Eastwood and stars Morgan Freeman; these features are included in the movies that you have rated highly.” By contrast, the following statement is an example of an influence-style explanation: “Million Dollar Baby (2004) is recommended because you gave high ratings to Unforgiven (1992), Se7en (1995) and Gran Torino (2008)”. The latter explanation style burdens the user to connect the referenced movies and to understand the commonalities among these films, such as the facts that all of these movies are dramas, two of these movies were directed by Clint Eastwood, and two of these films star Morgan Freeman. For a frequent consumer of movies, these commonalities can be easy to deduce; however, for less experienced movie consumers, the effort to identify these common traits can be rather discouraging. It can be argued that the explicit specification of these common features simplifies the inference process for both types of users.

37

The following statements provide a concrete example of the wording that was employed in this study: “Recommended movie title: Indiana Jones and the last crusade (1989). The reason for recommendation is the participant Harrison Ford, who appears in 5 movies you have rated.”

66

2. Background and Related Work

From the aforementioned observations, it follows that explanations are able to increase the acceptance of an RS and the satisfaction of RS users; moreover, explanations can help RS users make better choices. For users, relative to the influence explanation style, the keyword and influence explanation styles both produce greater satisfaction and improved capability to accurately judge the true quality of the recommended items. Although the combination of both the keyword and influence explanation styles produces the greatest overall satisfaction with recommendations, the keyword aspect to explanations appears to be the most important trait of explanations with respect to users’ abilities to efficiently judge the recommendation quality. However, certain explanations are not beneficial to a user and an RS. As described above, in contrast to the influence and keyword explanation style, the nearest-neighbor style potentially causes users to overestimate the quality of recommended items and can therefore produce suboptimal choices, which decreases users’ trust in and loyalty to an RS. Furthermore, the ability of an RS to provide explanations in a particular style is tightly correlated with the recommendation technique that is employed by this RS. In particular, userbased approaches are capable of utilizing the nearest neighbor explanation style, item-based approaches allow for the influence style of explanations, and CB approaches can generate explanations in the keyword style. From the above discussion, we can rank different recommendation techniques with respect to their potential for providing explanations that are helpful for both users and RSs; this ranking is depicted in Table 2.8. Table 2.8: The capacities of different recommendation methods to provide effective explanations Rank

Recommendation method

Associated explanation style

1

CB + item-based CF

keyword + influence

2/3+

CB

keyword

2/3

Item-based CF

influence

4

User-based CF

nearest neighbor

Thus, a combination of CB and item-based methods allows for the best possible explanations, whereas each individual component of this combination can produce the next highest levels of explanation capabilities. This phenomenon reflects the fact that prior studies have not revealed a significant difference between the effects of the keyword and influence

2. Background and Related Work

67

explanation styles. However, we argued above that from a qualitative perspective, the keyword style appears to dominate the influence style. We indicate this conjectured domination in Table 2.8 by including a plus sign near the rank of the CB approaches. Finally, the lowest ranking in this table is associated with the user-based CF recommendation method, which demonstrates poor performance and a propensity to generate negative effects for both users and RSs. MF techniques are not present in the table because they do not permit either of the explanation styles that are discussed above; instead, these techniques base their recommendations on uninterpretable factor solutions (see Section 2.1.1.3). The intuitive conclusion from the above findings is that to increase user satisfaction, the algorithm that we develop should implement a hybrid of CB and item-based CF approaches. However, certain hybridization designs are unsuitable for the generation of explainable recommendations. In fact, the majority of hybrid designs involve a series of problems that can potentially decrease the quality of the explanations that they provide. First, the hybridization process can limit the abilities of a hybrid approach to provide explanations. Second, hybrid processes may provide explanations that are not aligned with the user preferences. These issues will be discussed in the subsequent sections of this thesis.

2.2.3 Explanations in Hybrid Approaches

The capacity of hybrid approaches to provide explanations of recommendations is dependent on the specific approaches that are incorporated into each hybrid approach and varies with the degree to which these individual approaches are interwoven with each other. In the discussion of the previous section, different explanation styles and their associations with different recommendation approaches were presented. Hybrid approaches are able to utilize the explanation styles that are available to the individual recommendation methods that these approaches employ. However, the properties of these explanation styles, e.g., transparency or effectiveness, will only remain valid if the final recommendation is solely produced by one of the constituent methods of a hybrid. Thus, these properties will not remain valid if the predictions of these constituent methods are combined;

68

2. Background and Related Work

by contrast, this validity is maintained if the rating of the single best-performing constituent method is utilized directly. However, for situations in which a recommendation is produced as a mixed result of multiple methods, an explanation for why a particular item was recommended can nevertheless be generated. One possible method for generating this explanation is by adapting the explanation that would be applicable if an individual method had generated a particular item recommendation. For instance, if a hybrid combined a user-based CF with a CB technique in a pipelined manner (see Section 2.1.4.3) or if a CB approach was used for feature augmentation for CF processes within a monolithic hybrid (see Section 2.1.4.2), an explanation could be formed as a mix of the nearest neighbor style (“…because other users also liked”) and the keyword style (“…because it contains features X, Y, Z”). However, this method is not applicable to scenarios in which various hybridized recommendation techniques are more tightly integrated with each other. In these situations, the explanations from the individual constituent methods of a hybrid are not applicable. For instance, consider the case of a monolithic hybrid process (see Section 2.1.4.2). A possible method to generate an explanation for this scenario is to post-process the recommendation results with a CB technique. A concrete implementation of this type of approach is described in Symeonidis, Napopoulos, and Manopoulos (2008, 2009). In these papers, the recommendations are produced through the use of an item-based CF approach that is applied to previously formed biclusters of users and items. However, these explanations are generated in a manner that is consistent with a content-based approach. To accomplish this feat, the authors examine the correlations between the item feature profiles and user ratings and identify the features that are associated with the movies that a user most enjoyed. These features would then be emphasized in the explanations of recommendations for items that include these features. For instance, a CF algorithm may have predicted that a user would like “Gran Torino” and may therefore include this movie in its list of recommendations. To construct an explanation for this recommendation, the CB portion of the hybrid would be utilized to produce the list of properties of “Gran Torino” that are present in other movies that the active user has previously enjoyed. For a particular active user, the CB approach may observe that the features “Clint Eastwood”, “Tom Cruise”, “Angelina Jolie”, “Action”, “Comedy”, and “Drama” have high probabilities of being contained in the movies that the user appreciates the

2. Background and Related Work

69

most. Because “Clint Eastwood” and “Drama” apply to “Gran Torino”, the explanation will be “you will like Gran Torino because it is a Drama that stars Clint Eastwood”. This post-processing method allows for the generation of keyword-style explanations for virtually all recommendation approaches. Although previous research has demonstrated that the keyword style is the most effective single explanation style among the styles that have been examined (see Section 2.2.2), this post-processing utilization of the keyword explanation style incorporates the serious drawback that it is not reflective of the ways in which recommendations are actually produced. Thus, these post-processed explanations fail to achieve the goal of increased transparency for the recommendation system; this failure may negatively impact users’ acceptance of, trust in, and loyalty to an RS. More importantly, in this scenario, the recommendation process is not generally aligned with the preferences of the active user. Thus, the keyword explanation style’s capacity to increase the effectiveness of a user’s choices cannot be thoroughly realized. Although the provided explanations in the post-processing situation might efficiently highlight reasons why a user may enjoy the recommended item, these explanations are unable to explain why the system believes that the item in question is the best recommendation for the user because the recommendation procedure cannot access a user’s attribute preferences. As discussed later in this thesis, a deviation from the user’s preference function potentially decreases not only the effectiveness of a user’s choices but also a user’s satisfaction with and loyalty to an RS (see Section 2.3.1.2). Before we proceed to the summary of our discussion on the topic of explanations in RSs, we first conclude from the statements above that for our concurrent objectives, i.e., the generation of accurately predicted recommendations that are accompanied by actionable explanations, our choice of hybridization designs and strategies is restricted to schemes that use the predictions of an individual (unhybridized) RS method to generate the final ratings of items. Only these types of designs allow for the generation of explanations that maintain the highest degree of transparency and therefore enable the other advantages of explanations to be fully realized. These acceptable hybridization designs are (i) parallelized switching hybrids, (ii) monolithic feature-augmenting hybrids, and (iii) pipelined cascade hybrids (for additional details, see Section 2.1.4).

70

2. Background and Related Work

Furthermore, it appears reasonable for our hybrid to contain CB and item-based CF components because these methods are capable of producing influence and keyword styles of explanations, which are the most effective types of explanations among the various explanation styles that have been examined (see Section 2.2.2). Although the concept of producing CF recommendations in the domain of motion pictures is applicable through the approach that was described in Section 2.1.1.2, the design of the CB component for the movie domain requires refinement. Thus, we will focus our further narration on the development of the CB component of our recommendation method. The choice of a particular hybridization design will be discussed in Section 3.3, after the development of this CB component has been finished; at this time, we will possess all of the information that is required to substantiate our choice of recommendation approach.

2.2.4 Summary To summarize the discussion of Section 2.2, we can conclude that it appears sensible to incorporate an explanation facility into an RS because this facility provides a series of benefits for both users and RS providers (see Table 2.7). In addition to increasing the transparency of an RS, an explanation facility also increases users’ acceptance of, trust in, and loyalty to an RS. In particular, explanations of the reasoning underlying recommendations provide users with an instrument to assess recommendation errors in and mitigate the negative effects of these errors. Furthermore, explanations allow users to form their own judgments about recommendations and to evaluate these recommendations more efficiently; these effects increase the quality and effectiveness of user choices. However, for the benefits of explanations to be realized, explanations must be understandable to users. In the context of our thesis objectives, this requirement restricts our choice of hybrid methods to approaches that use the raw individual predictions of an individual constituent method of a hybrid. In particular, we should avoid approaches that aggregate predictions from different methods to produce final ratings or feature tightly interwoven RS components because these approach structures cause the explanation styles of individual constituent approaches to become inaccessible and inapplicable, reducing transparency of a hybrid RS

2. Background and Related Work

71

and generating other negative effects. Thus, to allow our hybrid method to provide explanations through the most effective type of explanation, namely, the keyword explanation style, we focus our development on the content-based component of our hybrid. The nuances of the hybridization of this CB component of our suggested approach with the item-based CF component of this approach (which permits the second most effective type of explanations, namely, the influence explanation style) will be clarified immediately after the development of the CB component has been accomplished. The next section of this thesis addresses the questions of how our CB approach should address movie attributes in a manner that aligns the recommendation processes of this approach with users’ movie preferences and which attributes this CB approach should examine.

2.3 Movie-Related Preferences and Relevant Movie Characteristics

Although background knowledge about key recommender algorithms and the rationale for the incorporation of an explanation facility into an RS were provided in the previous sections of this document, in the context of the objectives of this thesis, our development of an RS approach should focus on the domain of motion pictures (see Section 1.2); in this domain, item attributes are not readily accessible for automatic algorithmic processing. Thus, to develop a numerical algorithm that implements our approach in the movie domain and to allow the reader to comprehend this development process, a further understanding of the topics that are relevant to the movie domain is required. These topics may essentially be summarized by the questions of (i) how user preferences can be operationalized in the domain of movie pictures and (ii) what attributes of motion pictures are relevant for the formation of preferences in this domain. Therefore, the next subsections of this thesis provide a brief discussion of these questions.

72

2. Background and Related Work

2.3.1 The Operationalization of Preferences

2.3.1.1 The Multiattribute Utility Model and the Weighted Additive Decision Rule

The concept of multiattribute utility (MAU) has a extensive history in the research fields of psychology, decision making and marketing (e.g., Edwards 1954; Tversky 1967; Fishburn 1967; Green, Wind, and Jain 1972; Luce 1992; Caroll and Green 1995). This concept relies on two fundamental notions: the principle of utility maximization and the decomposition hypothesis. The former idea asserts that individuals make choices according to certain criteria of worth. Thus, each possible choice is associated with a certain quantity of utility ( ), and the alternative that is considered best or most preferred by a consumer over other alternatives should possess the greatest utility (Tversky 1967). In other words, if alternative A is preferred over alternative B, then the following inequality should apply: (2.22)

The decomposition hypothesis states that the utility of an alternative can be decomposed into the utility of the basic independent components of this alternative. In other words, this hypothesis suggests that individuals evaluate alternatives by examining a set of the components of these alternatives; these components are referred to as attributes (Tversky 1967). During the course of this evaluation process, they assign partial utilities, which are also known as part-worths, to each of the attributes of an alternative; these part-worths are thought to reflect the quantity of preference that a consumer associates with the levels of the attributes that are possessed by an evaluated alternative (Bettman, Johnson, and Payne 1991).38 In addi-

38

To further illustrate the relationships among alternatives, attributes, and attribute levels, consider the example of choosing among different models of cellular phones. Each model represents a (choice) alternative, which may be evaluated on various attributes, such as brand, display size, battery durability, price, and other traits. The levels for the attribute of brand may include Motorola, Samsung, Siemens, HTC, and other phone manufacturers, whereas the levels of the attribute price may include €20, €60, €120, and other price points. A consumer may regard price as more important than brand (in other words, for this consumer, the importance (and corresponding weight) of the price attribute is higher than the importance (and corresponding

2. Background and Related Work

73

tion, because the relative importance of different attributes may vary depending on the preference formation of a consumer, these part-worths are weighted by the relative importance of each attribute ( the part-worths (

). Thus, the utility of a multiattribute alternative ( ) is equal to the sum of ) of its attributes, weighted by the relative importance of each attribute.

Formally, this reasoning produces the following equation: (2.23)

where = the utility of a multiattribute alternative = the relative importance of the j-th attribute = the part-worth of the k-th level of the j-th attribute : the attributes of an alternative : the levels of an attribute that are possessed by an alternative Equation (2.23) specifies an additive composition model of the multiattribute utility and therefore represents an operationalization of preferences (because utility reflects preferences). Thus, the MAU model allows for a set of alternatives (e.g., movies or other products) to be arranged in rank order with respect to a consumer’s preference, assuming that all of the partworths and all of the corresponding importance weights are either already known or can be elicited in a particular manner, such as through the use of a numeric algorithm. This procedure of rank ordering is comparable to the weighted additive decision rule (WADD) (Bettman, Johnson, and Payne 1991; Corner and Kirkwood 1991; Weiss, Weiss, and Edwards 2009). In accordance with the aforementioned principle of decomposition, WADD suggests a normative procedure of decision making that involves the consideration of all of the relevant information about a problem. In other words, the WADD approach considers the values of each alternative for all of the relevant attributes and all of the relative importance weights of these attributes to the individual (Bettman, Johnson, and Payne 1991). In the context of the objectives of the current thesis, the MAU model and the WADD approach prescribe the way in which a recommender algorithm should be constructed. In

weight) of the brand attribute) and may therefore prefer cheap phones to more expensive models (because the part-worth of a €20 price is significantly higher than the part-worths of €60 and €120 prices).

74

2. Background and Related Work

particular, for each user, the algorithm should elicit the user’s preferences for individual attribute levels, i.e., the part-worths of attribute levels, and the importance weights of these attributes, which are relevant for the formation of the user’s preferences. The obtained information can then be aggregated through the WADD approach to calculate the utilities of the examined alternatives. The rank ordering of these alternatives can then directly produce a recommendation of the most preferred alternative (or a set of alternatives with high ranking with respect to user preferences). In the framework of a movie RS, the utility ( ) from equation (2.23) can be regarded as a rating (e.g., the number of stars) that a user provides for a particular movie. Higher ratings for a movie from a user correspond to greater levels of user enjoyment for the movie in question. Thus, a movie with the highest rating for a user should possess the highest utility for this user and should therefore be the user’s favorite film among the rated movies. Therefore, the operationalization of a user’s utility in terms of movie ratings allows for the comparison of different movies with respect to user preferences. Moreover, in an attribute composition model of user utility, this operationalization allows for numerical inferences to be drawn regarding the part-worths of each attribute and the contribution of each attribute to the cumulative utility and rank order of arbitrary movies; thus, these inferences permit the generation of movie recommendations. However, from psychology, it is known that consumers do not exhibit a stable utility function (Jannach et al. 2011, p. 195). Thus, the choices that consumers make under this condition can differ due to variations in the utility functions that these consumers employ at the moment that they make each choice. Therefore, to address this issue, the next sections of this thesis examine the question of how RS should address the instability of user utility functions (i.e., the changeability of user preferences) to provide recommendations that help users make optimal choices. Through this examination, the suitability of the WADD approach for the provision of recommendations in not only the general case of providing recommendations but also the specific conditions of unstable utility functions is substantiated.

2. Background and Related Work

75

2.3.1.2 The Approach for Addressing Unstable Utility Functions

Although the normative procedures of the MAU and WADD approaches are widely utilized in studies that analyze the decision-making behaviors of customers, these processes have been criticized for their restricted ability to describe how individuals actually make choices (Simon 1982; Edwards 1961; Luce 1992). A series of simplifying heuristics was suggested to improve the ability of these procedures to describe the actual choice behavior in various choice situation contexts, such as time-pressured decisions, routine choices, decisions involving low-involvement products, and under diverse mood, cognitive effort, consumer environment, and uncertainty conditions. (Kahneman and Tversky 1984; Bettman, Johnson, and Payne 1991; Payne, Bettman, and Johnson 1993; Gigerenzer et al. 1999). A well-known example of these heuristics is the lexicographic rule (LEX); in accordance with this rule, consumers first determine their most important attribute, e.g., price, and then choose the alternative with the highest part-worth of this attribute, completely ignoring other attributes. Another example of a simplifying heuristic is the frequency of good and bad features (FRQ) rule; under this heuristic, consumers choose alternatives simply based on counts of the good or bad features that are possessed by these alternatives. Other examples of simplifying decision rules include the satisficing (SAT) heuristic, the elimination-by-aspects (EBA) approach, the equal weight (EQW) heuristic, the majority of confirming dimensions (MCD) rule, the habitual choice principle and combined heuristics (for a comprehensive review, see Bettman, Johnson, and Payne 1991). Consequently, with respect to making choices, individuals can rely on different decision strategies; these strategies involve product attribute sets of various sizes. In diverse situations, these attribute sets may be composed in diverse ways and may be evaluated differently. Thus, intuitively, to be appropriately aligned with user preferences, the recommendation algorithm must account for not only the size of the attribute set and the attributes that are contained in this set but also the current user decision strategy for both producing and explaining recommendations. However, for an automated process of recommendation generation that attempts to be non-intrusive for users, i.e., a process that strives to minimize the level of interaction

76

2. Background and Related Work

with users that is required, it can be challenging to account for users’ current preference states. To resolve this issue, we provide a brief excerpt of a study by Aksoy et al. (2006) that can contribute to our understanding of how unstable utility functions can be addressed within a recommender algorithm in a manner that allows for the provision of effective recommendations and explanations to users. These authors examined the role of similarity between an RS and a consumer with respect to the quality of consumer choices. Two dimensions of similarity were considered. One of these dimensions is the degree to which consumer preferences for different product attributes are incorporated into the process of the generation of a recommendation.39 The other dimension of similarity that is examined is the degree to which the RS in question employs decision-making strategies that are similar to the approaches that are used by consumers. In their experiment, these authors surveyed the degree of similarity that existed between the actual decision strategies of study participants and the weighted additive model (WADD). The authors then separated the study participants into two groups; one group exhibited high similarity to the recommender that is used in the experiment, whereas the other group displayed low similarity to this recommender. Within these groups, the attribute weight similarity was manipulated by adding a random number to each participant’s own attribute weight. This number was low (ranging between -1 and +1) for simulations that involved a high degree of attribute weight similarity or high (ranging between -9 and +9) for simulations that involved a low degree of similarity. Aksoy et al. hypothesized that the attribute weight similarity and the perceived decision strategy similarity independently influence decision quality. Surprisingly, the results of a preliminary study revealed that the use of an RS that was similar to a study participant with respect to either attribute weights or decision strategy produced the same quality of consumer

39

Recommendation algorithms differ with respect to the extent to which user preferences are incorporated into the recommendation process. Certain recommendation agents, such as mySimon.com, provide randomly ordered alternative lists that do not include any information about consumer preferences. Other recommendation agents, such as Amazon.com, indirectly elicit attribute importance information based on a consumer’s previous choices, which are not necessarily consistent with this consumer’s utility function. Finally, certain recommendation agents, such as activeBuyersGuide.com, directly elicit consumer’s attribute importance weights and explicitly use these weights to rank alternatives (Aksoy et al. 2006; Diehl, Kornish, and Lynch 2003).

2. Background and Related Work

77

decisions as the use of an agent that was similar to a study participant with respect to both of these traits. In other words, three out of the four experimental conditions demonstrated high objective choice quality among recommended items that did not vary significantly. In particular, these three conditions were (i) if participants received recommendations from an RS that only maintained a high degree of similarity to the participants with respect to attribute weights; (ii) if participants received recommendations from an RS that only maintained a high degree of similarity to these participants with respect to decision strategy; and (iii) if participants received recommendations from an RS that was similar to them with respect to both types of similarities. Notably, Aksoy et al. verified this finding in their main study and successfully replicated the aforementioned results. Thus, to produce recommendations that significantly increase decision quality and reduce search effort, an RS only needs to be similar to a user with respect to one out of the two similarity dimensions that have been discussed. In addition, Aksoy et al. demonstrated that an RS can produce increased web site loyalty and satisfaction regardless of the dimension in which the RS and its users are similar. By contrast, dissimilarities between an RS and its users with respect to both attribute weight and decision strategy hurt consumer welfare by increasing perceived costs, reducing choice quality, and lowering web site loyalty. In particular, consumers “believe they make better decisions using no [recommendation] agent at all than using a doubly dissimilar agent” (Aksoy et al. 2006, p. 311). Based on these findings, the authors conclude that the similarity between an RS and its consumers is relevant and that recommendation agents “should think like the people they are attempting to help if the goal is to assist consumers in making better choices” (Aksoy et al. 2006, p. 310). The results of Aksoy and colleagues (2006) indicated the importance of individual attribute preferences for RSs in general and for the process of recommendation generation in particular. However, the finding that it is sufficient for an RS to maintain either attribute weight or decision strategy similarity with its users allows a recommender algorithm to maintain reasonable decision quality for users by concentrating on the former type of similarity. In this thesis, we do not seek to describe actual consumer behavior but instead attempt to provide users with a decision aid that contributes to the achievement of better choices. Therefore, given the aforementioned results, the use of the WADD approach within a recommendation algorithm appears to be a promising method to achieve this objective. Further

78

2. Background and Related Work

arguments for basing recommendations on the WADD approach will be discussed in the next section of this document.

2.3.1.3 The Advantages of the WADD Approach for the Production of Explainable Recommendations

As noted in the previous section, the minimal need for similar decision strategies between an RS and its users allows recommendation algorithms to ensure the highest quality of recommendations with respect to decision effectiveness through the use of a weighted additive (WADD) compensatory decision rule. In addition to this statement, we argue that the use of WADD provides at least four advantages for an RS. First, the WADD approach is capable of processing user preferences at the attribute level. Therefore, this decision strategy can easily be implemented within the context of a numerical algorithm that attempts to increase users’ choice efficiency by addressing attributerelated user preferences during the process of recommendation generation and calculation. Second, it has been found that normatively better consumer decisions are generated by the WADD approach than by heuristic decision procedures (Tversky 1967; von Winterfeld and Edwards 1986; Payne, Bettman, and Johnson 1988; Aksoy, Cooil, and Luire 2011). Thus, the WADD model should produce the most effective choices if a consumer’s attribute-related preference weights are known or can be accurately estimated by an RS. The task of producing efficient recommendations can therefore be reduced to the task of eliciting users’ attributerelated preference weights. Both the strategy of employing attribute preferences that are similar to a user’s own preferences and the strategy of utilizing a decision rule that generates the optimal consumer decisions can potentially increase the robustness of recommendations; in other words, both approaches may cause an RS to potentially be more tolerant to violations of the premise of attribute preference weight similarity. These violations may be caused by various factors, such as calculation errors. Third, an RS that seeks to maintain a similar decision strategy to its users under conditions that involve unstable utility functions will have to conjecture about the decision strategy

2. Background and Related Work

79

that a consumer employs each time the consumer in question requests a recommendation. However, the derivation of a decision rule is likely to be a time-consuming process that frequently leads to cognitive overloads for respondents. This issue diminishes or even eliminates the advantages of an RS. Instead, it appears reasonable for an RS to utilize WADD, the decision rule that generally demonstrates the best performance among the known decisionmaking approaches, and maintaining attribute preference weights that are similar to the attribute preference weights of its users. Although consumer preferences may change over time, these preferences are more likely than decision strategies to persist for longer periods of time. Furthermore, changes in preferences can be tracked automatically, without the need to interfere with user interactions with an RS; in fact, the recalculation of attribute preference weights can be triggered automatically after each implicit user input, such as a purchasing act or the rating of an item. Finally, because the WADD approach is compensatory (i.e., the WADD rule accounts for preference valence such that negative attribute-related preferences can counterbalance positive attribute-related preferences), this approach allows for the attributes of recommended items that exhibit negative preferences to be used as negative cues in explanation statements. This feature offers the potential to increase choice efficiency; in fact, several researchers have found that consumers tend to place more weight on negative information than on positive information during the course of evaluating an item (Lutz 1975; Wright 1974; Kanouse and Hanson 1972; Ito, Larsen, and Cacioppo 1998). However, to the best of our knowledge, this property of negative cues has not been considered in the existing works on RS that have discussed the keyword explanation style; thus, to date, only positive information has been included in analyses of keyword-based explanations. The incorporation of negative cues into the keyword explanation style can therefore allow this style to be extended to a “pros-andcons” approach that can produce explanations of the following form: “Titanic (1997) is recommended to you because it highly matches your preferences. Pros: Titanic is a high-budget Hollywood movie that is directed by James Cameron. Cons: You don’t like the movie’s genre of drama or its star, Leonardo Di Caprio. After accounting for these factors, we expect you to rate this movie an 8 out of 10.”

This explanation style maintains the advantages of the keyword explanation style with respect to choice effectiveness and strengthens these advantages by allowing users to consider

80

2. Background and Related Work

negative cues. It can be argued that because the item features here are derived directly from the attributes towards which user preferences exist, a “pros-and-cons” explanation involves the terms that users actually employ in their evaluations. Thus, this style is informative, understandable and actionable for users. The discussion above substantiates the suitability of the WADD approach as an underlying model for the generation of recommendations and the provision of explanations for these recommendations. To complete our conception of a movie recommendation algorithm that involves attribute-related preferences, we require a notion of which attributes of motion pictures should be considered by this algorithm, which will determine the movie attributes that should be directly elicited from users or from user preference data. The next section of this thesis is dedicated to this topic.

2.3.2 The Preference-Relevant Attributes of Motion Pictures

The task of identifying the attributes of motion pictures that drive consumer preferences and determine their choices is a less trivial than it may appear. The extant research on movie consumption leads to the recognition that the addressing of comprehensible, preferencerelevant movie attributes is challenged by the nature of movies. Movies are experiential goods. Thus, the main motive for individuals to consume a movie is the receipt of a hedonic value (e.g., pleasure or thrill) from experiencing the film rather than the fulfillment of a utilitarian need (Cooper-Martin 1991, 1992; Holbrook and Hirschman 1982). However, it is much more difficult to understand the nature and outcomes of hedonic motives than of utilitarian motives (Hennig-Thurau, Houston, and Walsh 2007); thus, hedonic motives are hard to formalize. Moreover, the domain of motion pictures is dominated by experience qualities. As a result, consumers can only assess the quality of a movie during the course of watching the film in question (De Vany and Walls 1999). This aspect of movies forces consumers to form their quality judgments based on proxies, which are known as “quasi-search qualities” (that is, movie traits that consumers can comprehend

2. Background and Related Work

81

before they actually watch a movie), and on movie related communications (Hennig-Thurau, Walsh, and Wruck 2001; Hennig-Thurau, Houston, and Walsh 2007). Although recent studies on movie consumption have extensively focused on movie consumers (e.g., Hirschmann and Morris 1982; Austin 1981, 1982, 1989; Cooper-Martin 1991, 1992; Moon, Bergey, and Iacobucci 2010), these studies have primarily been driven by the hedonic nature of movies; therefore, this line of research has largely concentrated on the unique aspects of consumer behaviors with respect to these types of goods rather than on the search for formalizable movie attributes that allow the preferences of individual movie watchers to be assessed. Therefore, little is known about which attributes of movies actually contribute to individual movie watcher preferences. To the best of our knowledge, Austin (1989) is the only author who has provided a thorough overview of the reasons that an individual selects a specific movie. Although the author has questioned the general validity of his own assertions, we believe that it is sensible to provide a brief excerpt of these conjectures. Austin suggested that movie genre is the most influential attribute for determining consumers’ choice of a particular movie, although a single movie can be simultaneously classified into several different genres. The genre categorization of a movie informs the consumer about the type of story that will appear in the film and the elements of the film’s plot, thereby narrowing the set of hedonic qualities that the consumer can anticipate the film in question to evince. Additional attributes that influence movie choices include the onscreen and offscreen production personnel and firms for a film; the name recognition of these entities, which may include star actors, directors, producers, screenwriters, and the production companies that are responsible for visual effects, can affect a moviegoer’s attendance decisions. Although star actors “no doubt contribute much to the audiences’ awareness [of] and knowledge about the film” (Austin 1989, p. 77), only a few offscreen players typically possess sufficiently strong public recognition to affect movie attendance decisions (Austin 1989). Thus, a recommender algorithm does not need to regard every name from the movie industry as a preference-relevant movie attribute: We can narrow down the list of names that should be considered by a recommender algorithm to entities that possess star power, i.e., individuals

82

2. Background and Related Work

and firms with names that are popular enough to influence a consumer’s movie preference assessments. This type of list can be obtained from various sources, including analytical web sites, such as IMDb40 or InsideKino41, that maintain updated lists of not only movie stars but also offscreen personnel and firms that possess star power. Appendix B of this thesis presents 248 persons and 6 production companies that InsideKino lists as entities that possess star power. According to Austin, other factors that influence movie choice include advertising, trailers, critical reviews, and word-of-mouth accounts (Austin 1989). However, these factors are not exactly movie attributes but instead represent additional sources of information about movies. In other words, these factors influence the process of preference assessment by providing customers with additional clues about the qualities of a movie. Although the presence of additional information may increase choice effectiveness, the information sources themselves are unlikely to possess distinct characteristics towards which an individual may exhibit relatively stable movie relevant preferences: The utility of an information source is dependent on the utility of the information that this source transfers. In other words, we argue that it is unlikely that a consumer will like all movies equally more or equally less as a result of watching a movie trailer, viewing a TV advertisement for a movie, or hearing about a movie from a particular friend. Thus, we discard the aforementioned factors from the list of preference-relevant movie attributes and refrain from further discussion regarding these considerations. Additional preference-relevant movie attributes can be obtained from the stream of movie-related research that addresses the economic success of motion pictures. This research stream also considers consumer preferences but approaches these preferences from the perspective of the movie-producing industry, rather than from the viewpoints of consumers. In particular, this stream of research is focused on the economic qualities of a movie, such as its profitability and its box-office gross (Hennig-Thurau, Walsh, and Wruck 2001). A film’s economic status is determined by the fees that consumers pay for various aspects of consuming the movie in question, such as attending the movie in a theater, acquiring the movie on DVD, or other methods of experiencing the movie. Thus, the “success factors” of films are

40 41

http://pro.imdb.com/people http://insidekino.de/Starpower.htm

2. Background and Related Work

83

determined by consumers’ reactions to movie studio actions, non-studio-related factors, and the characteristics of these movies (Hennig-Thurau, Houston, and Walsh 2007). A summary of motion picture success factors is provided in Table 2.9. Table 2.9: A summary of motion picture success factors based on Hennig-Thurau, Walsh, and Wruck (2001) and Hennig-Thurau, Houston, and Walsh (2007)

Movie characteristics

Genre Stars Directors Budget Symbolicity Certification Sequel Language Country of Origin Movie Length

Post-filming studio actions

Non-studio actions

Advertising expenditures Timing of movie release Number of screens

Critical reviews Awards Customer-perceived movie quality Early box-office information Word-of-mouth accounts

However, analyses of movie success only indirectly involve consumers, and this involvement occurs through the monetary value that consumers generate on the aggregate level. Thus, the consideration of individual customers does not occur in these analyses. Therefore, the empirical evidence of the influence of various success factors on decisions to consume a movie in general cannot be interpreted as proof that these particular success factors, including the specific movie characteristics that are included among these factors, are relevant to the movie preferences of individuals. Nevertheless, the fact that these success factors significantly influence aggregate consumption decisions can be interpreted as an indication of the possibility that the factors in question are valid determinants of the movie consumption decisions of individual consumers. Thus, we assume that the motion picture success factors listed in Table 2.9 may possess explanatory power for individual consumer movie preferences. This assumption has two consequences for the objectives of this thesis. First, this assumption provides additional support for the idea that genre and production personnel characteristics affect the preferences of individual consumers. Second, this assumption extends the list of movie attributes that may potentially be relevant to an individual’s preference assessments.

84

2. Background and Related Work

In particular, our list of preference-relevant movie characteristics, i.e., attributes, now includes the budget, symbolicity, certification, sequel, language, country of origin, and length of a movie. In the following paragraphs, we will briefly describe the meaning of these attributes and the motivations for their inclusion on the list of preference-relevant movie attributes that must be considered by a recommendation algorithm. To consumers, movie budgets serve as an indicator of film quality, “since the budget indicates whether the producer has the resources to turn an idea into convincing reality through acting, artistry, and technology” (Hennig-Thurau, Walsh, and Wruck 2001, p. 11). Thus, budgetary information for a movie allows consumers to calibrate their expectations of the movie’s quality before they choose to watch this film. In fact, if we consider the popularity of recent high-budget movies (e.g., “Avatar”, “Lord of the Rings”, “Titanic”, and “Godzilla”), we observe that higher budgets tend to attract more moviegoers. Thus, although many consumers may not explicitly consider movie budgets during the course of their movie consumption decisions (perhaps because this information is not always available), we should affirm the indirect influence of this factor on movie preferences. A recommender algorithm can elicit these types of hidden preferences from data about a user’s prior movie consumption and use these indicators for the generation of predictions. Certifications are intended to classify the potential offensiveness movies to certain audiences with respect to a variety of issues, such as suitability for children, violence, sex, abusive language, and other considerations. Although the impact of certifications on a film’s box-office success remains debtable, certifications are considered to be factors that influence consumer interest in movies (Hennig-Thurau, Huston, and Walsh 2007). Thus, we include certifications in the list of preference relevant movie attributes. Certain movie-producing countries are often associated with a specific style of narration that may affect the attractiveness of a movie to particular consumers. For instance, French movies are expected to be relatively artsy, whereas Hollywood films are expected to ‘merely’ focus on entertainment (Hennig-Thurau, Walsh, Wruck 2001). Thus, the country of a film’s origin may affect an individual’s movie preferences. The language that is spoken in a movie is closely related to a movie’s country of origin and may also influence a consumer’s decision to watch the movie in question. Conventional wisdom indicates that consumers who are unable to understand foreign languages are unlikely

2. Background and Related Work

85

to watch undubbed foreign movies, whereas other consumers may enjoy watching movies in their original language. However, in several non-English-speaking countries, the original language of a movie is relatively unimportant because a majority of foreign films are either dubbed (as occurs in Germany, Russia, and France) or subtitled (as occurs in the Netherlands, Sweden, and Bulgaria; Hennig-Thurau, Walsh, Wruck 2001). Thus, the informativeness of a movie’s language with respect to consumer preferences may depend on the country in which a particular recommender algorithm operates. Movie length can also be regarded as a factor that impacts consumer movie choices because a significant number of consumers are unwilling to spend more than a particular duration of time watching a movie; this duration can be regarded the ‘critical length’ of a movie for these consumers (Hennig-Thurau, Walsh, Wruck 2001). Awards that are granted by prestigious institutions, such as the Academy of Motion Picture Arts and Sciences (AMPAS), can be regarded as an independent indicator of the aesthetic quality of a movie (Hennig-Thurau, Walsh, and Wruck 2001; Hennig-Thurau, Houston, and Walsh 2007). The influence of awards on consumer choice making has been illustrated in the service sector (Dick and Basu 1994; Hennig-Thurau and Klee 1997), and it has been suggested that this influence also exists in the domain of motion pictures (HennigThurau, Houston, and Walsh 2007). Although awards are not inherent attributes of movies, they are closely associated with these attributes. Thus, we can consider awards to be “exogenous” movie characteristics that are preference-relevant attributes of motion pictures. However, certain motion picture success factors that are listed in Table 2.9 cannot be regarded as relevant movie attributes from the perspective of our goals because these factors are not suitable for the algorithmic prediction of consumer preferences. In particular, the notion of “customer-perceived movie quality”, which encompasses not only a movie’s experiential traits but also its structural qualities, such as its budget and personnel (Hennig-Thurau, Walsh, and Wruck 2001), presents three serious drawbacks that prevent it from being identified as a preference-relevant movie attribute. First, this consideration is a composite factor that includes several entities; moreover, the exact composition of this consideration is not specified by previously published research. This phenomenon implies that the addressing of this factor within a numeric process is not operationalizable and is therefore not sensible. Second, this consideration incorporates the effects of a movie’s budget

86

2. Background and Related Work

and personnel. Although a movie’s budget is a new piece of information for examination, the effects of movie personnel are already included on our list of preference-relevant movie attributes. Therefore, the repeated consideration of these effects is unnecessary and may be harmful to a recommender algorithm because multiple instances of the same entity demonstrate perfect multicollinearity. Finally, in addition to the aforementioned reasons for disregarding customer-perceived movie quality as a preference-relevant attribute, the notion of customer-perceived movie quality inherently implies that a consumer has already seen a particular film and can therefore assess his or her preferences with respect to the experiential traits of the movie in question. This implication indicates that a portion of the information that is incorporated into customer-perceived movie quality cannot be provided to an algorithm prior to the consumer’s consumption of a movie. Thus, movies that are unknown to consumers would be impossible to recommend. Accordingly, a recommendation algorithm that included this feature would not be a reasonable recommendation approach. Similar arguments apply to the issue of symbolicity, which refers to a movie’s potential to be easily categorized by consumers into existing categories that are familiar to these consumers (Hennig-Thurau, Walsh, and Wruck 2001). This categorization is based on various considerations, including a movie’s relationship to prior works (e.g., novels, myths, fairy tales, comics, TV programs, or computer games) and its affiliation to a series of movies (Hennig-Thurau, Walsh, and Wruck 2001; Hennig-Thurau, Houston, and Walsh 2007). Thus, the property of being a sequel can also be regarded as a dimension of the concept of “symbolicity” (Hennig-Thurau, Houston, and Walsh 2007) because sequels are both part of a series of movies and related to their movie predecessors. Therefore, although a report regarding the elements of a movie’s symbolicity can help customers to assess their liking for a movie prior to watching the film in question and can thereby potentially increase decision effectiveness, we doubt the potential uses of this report as a single attribute (which would be represented by the question of whether a movie is based on prior work) that can increase the quality of predictions by a recommender algorithm. A consumer may generally enjoy the prior work on which a movie is based (e.g., Greek myths) but may dislike a particular subset of this work (e.g., myths about Heracles). Similarly, the fact that a movie watcher liked certain sequels (e.g., the sequels to “Mission Impossible” or “The Matrix”) does not necessarily imply that the consumer in question likes sequels in general because this consumer might dislike other sequels (e.g., the sequels to “Batman” or “Spider-Man”).

2. Background and Related Work

87

Therefore, we classify the movie characteristics of “symbolicity” and “sequel” as inappropriate traits for inclusion in our preference-eliciting recommendation algorithm. Furthermore, although the number of screens of a film, the timing of a movie release, a movie’s advertising expenditures and a film’s early box-office information can influence movie attendance decisions, it can be argued that the impact of these considerations is relatively concentrated during the period near the opening of a movie and diminishes significantly over the course of time. Moreover, the influence of these factors occurs largely through increasing consumers’ awareness of a movie rather than by directly impacting consumers’ preferences for the movie in question. Because the primary value of a recommendation algorithm is that it recommends movies that match the user’s preferences, irrespective of the release dates of the recommended films, and because this algorithm should certainly recommend movies that may be unfamiliar to an active user, we can omit the aforementioned movie success factors from further consideration in this thesis. Analogously, because word-of-mouth and critical reviews are difficult to operationalize and because these factors do not necessarily mimic a consumer’s own preferences, we consider these considerations to be irrelevant for the purpose of developing a recommendation algorithm that accurately describes an individual’s preferences. However, the complete discarding of the factors that have been proven to reflect the aggregated movie attendance decisions of consumers may be dangerous because this process would involve the loss of preference-relevant information that might not necessarily be captured by the retained movie attributes. We suggest compensating for this information loss by including a movie’s box-office gross and admissions (i.e., the number of people that have attended a movie) in our recommendation algorithm. We propose two arguments to justify this suggestion. First, in movie success studies, these quantities are determined by movie watchers’ decisions to consume a particular movie; therefore, these quantities are somewhat reflective of a film’s relative popularity. We argue that the popularity of a particular movie may constitute a separate and independent motive for consuming the movie in question. Thus, the quantity of preference that is allocated to a film’s box-office gross and admissions represents a “quasi-search” quality because this quality indicates the movie’s popularity, which reflects the quality judgments of other consumers. Second, the success factors that we suggested omitting have been proven to influence the box-office gross of a movie (HennigThurau, Houston, Walsh 2007); thus, the consideration of the latter trait captures the variance

88

2. Background and Related Work

in the former characteristics. Therefore, the box-office gross of a film can serve as a proxy for assessing not only the experiential qualities of the movie but also other omitted factors that would otherwise be difficult to operationalize (e.g., advertising pressure and word-of-mouth qualities). Another attribute that we propose to include in our list of preference-relevant movie attributes is a movie’s year of production. This trait is typically not considered in the field of movie research because this research stream primarily addresses the managerially relevant success of a movie that occurs after the release of the film in question; thus, the extant movie research typically evinces a short-term focus that does not involve the tracking of a movie through its complete lifecycle. However, in the context of a movie RS, users can access virtually all movies that have ever been produced; therefore, the age of a movie may also be one determinant of a consumer’s intentions to watch the film in question. Certain consumers may tend to prefer only newly released movies, whereas other consumers may possess stronger preferences for relatively old and “mature” films. This assumption is broadly accepted within the field of recommender research (e.g., Ansari, Essegaier, and Kohli 2000; Adomavicius and Tuzhilin 2005; Ying, Feinberg, and Wedel 2006; Symenoidis, Nanopoulos, and Manolopoulos 2009; Koren 2009). Thus, we suggest that the year of a film’s production may be relevant to the formation of an individual’s preference regarding the consumption of a particular motion picture. The above discussion provides an overview of movie attributes that are relevant for assessing consumers’ preferences for movies that these consumers have not yet viewed. Thus, these attributes may be incorporated into a recommendation algorithm that seeks to generate recommendations that reflect individual user preferences and provides comprehensive and actionable explanations for these recommendations. The final list of the identified preferencerelevant movie attributes is presented in Table 2.10. In addition to the list of the preference-relevant attributes that have been characterized in the discussion above, Table 2.10 includes three other properties that provide insights into how these attributes may be operationalized within the recommender algorithm that we will develop in the next chapter of this thesis. The column “operationalization” indicates the scale level at which each correspondent attribute is intended to be coded for the purpose of algorithmic processing. Attributes in the

2. Background and Related Work

89

lower portion of Table 2.10 can be measured on a metric scale (i.e., in minutes, dollars, or years), whereas attributes in the upper portion of this table are not measurable in a metric manner. Moreover, this column does not denote the actual attributes themselves but instead refers to the specific categories of certain attributes, such as star actors or genres. In other words, the specified attribute categories may contain more than one entity. For instance, the star actor category may include various attributes that indicate whether specific actors, such as Tom Cruise or Clint Eastwood, are present in or absent from a particular movie. Thus, such attributes are coded in binary fashion; a value of 1 indicates the presence of an attribute, whereas a value of 0 indicates the absence of the attribute in question. Table 2.10: A summary of preference-relevant movie attributes * indicates that the data regarding this attribute group was unavailable

Attribute Star Actors Awards* Certification Country of Origin Directors Genre Language Producers Production Companies Screenwriters Admissions Box-Office Gross Budget Movie Length Year of Production Total:

Operationalization binary

metric

Number of Parameters 133 n/a* 29 38 106 26 22 4 6 5 1 1 1 1 1 374

Source InsideKino n/a* IMDb IMDb InsideKino IMDb IMDb InsideKino InsideKino InsideKino IMDb IMDb IMDb IMDb IMDb

The next column in Table 2.10, “number of parameters”, indicates the number of entities that are contained in an attribute category. For instance, the category of “star actors” contains 133 actors and actresses that are each operationalized as a separate parameter that is addressed by the RS algorithm that will be developed. Finally, the “source” column indicates the source from which information regarding the composition of an attribute category was obtained.

90

2. Background and Related Work

As noted above, the list of the preference-relevant persons and companies, i.e., actors, directors, producers, screenwriters, and production firms, was built based on the star power of these entities. In other words, the list of considered entities only includes persons and companies that are considered to possess sufficient star power to be recognizable by consumers and to be influential for consumers’ preference assessments of movies. These lists were obtained from InsideKino. The remaining attribute lists were obtained from IMDb. The list of genres consists of the complete genre classifications of IMDb, whereas all of the other attribute lists have been restricted by the properties of the datasets that have been employed in our study (see Chapter 4). For instance, the certification ratings are restricted to the ratings that are used in the USA and in Germany because our datasets only include customers from these countries; we have no logical reason to assume that our users’ preferences could be described by the certification ratings of other countries, such as Russia or Finland. The lists of countries of origin and of languages also reflect the movies that are contained in the examined datasets. Unfortunately, we could not manage to obtain a list of movie awards that can be matched with all (or a substantial portion) of the movies that are contained in our datasets. Although this information is available at IMDb, the non-commercial license for IMDb does not encompass movie awards, and the commercial licensing costs of $15.00042 that would be required to acquire these data would have been prohibitive in the context of this thesis. Therefore, we must omit the attribute of awards from our study. Nevertheless, we encourage commercial RS vendors to consider acquiring this license and to integrate awards into their algorithms because this information could potentially increase the quality of these vendors’ recommendations. The metric attributes of admissions, box-office gross, budget, movie length, and year of production can each be described by a single parameter. However, a degree of recoding is necessary for the description of money-valued attributes. To ensure the consistency of the attributes of movie budget and box-office gross, i.e., to ensure the comparability of different movies based on these attributes, different currencies must be consolidated into a single metric (in our case, the US dollar) by accounting both for the currency exchange rate that was

42

http://www.imdb.com/licensing/

2. Background and Related Work

91

applicable at the release date of a movie and for inflation. Although this procedure is a timeconsuming process that involves a vast quantity of sources that frequently become progressively harder to acquire for dates that are further in the past, this consolidation undoubtedly contributes to the quality of recommendations that are generated by our suggested algorithm. However, in our case, exchange rate information was unavailable prior to 1990.43 Therefore, the monetary metrics for movies that were released prior to this date were converted through the use of the exchange rates that were applicable on January 1st, 1990. After monetary values in US dollars are obtained for movies from different years, the annual US inflation rates that have been reported by the Bureau of Labor Statistics44 can be used to correct these monetary values for inflation. The complete list of binary attributes that are employed in our study is provided in Appendix B. In accordance with the discussion above, 374 total attributes are utilized to describe movies and calculate user preferences. The following section closes the discussion of the current chapter by providing a summary of the main points of this chapter, allowing us to proceed in the next chapter to the development of our algorithm for the generation of recommendations and the provision of explanations for these recommendations.

2.4 Summary

In this chapter, we provided an overview of the published theoretical work that relates to the objectives and the underlying proposals of the current thesis, which seeks to develop a

43

Daily currency exchange rates from January 1, 1990 until the present can be obtained at http://www.oanda.com/currency/historical-rates/ 44 The annual and monthly inflation rates for the US beginning from 1913 can be found at ftp://ftp.bls.gov/pub/special.requests/cpi/cpiai.txt. Alternatively, the Customer Price Index (CPI) inflation calculator can be utilized to calculate the inflation rate between two dates; this calculator is available at http://www.bls.gov/data/inflation_calculator.htm

92

2. Background and Related Work

recommendation method that is capable of providing both accurate recommendations and actionable explanations of the reasoning for these recommendations. In the first section of this chapter, we provided an overview of several key recommendation approaches: collaborative filtering, content-based filtering, and hybrid methods. We also provided detailed descriptions of recommendation algorithms that are representative of each of these approaches. This knowledge allows us to comprehend the principles and the details of recommendation generation; the merits and limitations of different RS approaches; and the problems that we must consider and account for in the development of our proposed recommendation method. In the second section of this chapter, we addressed the questions of why and how the explanations of recommendations should be provided. In particular, we concluded that to be effective and actionable, explanations must be aligned with user preferences. This alignment would also increase a user’s acceptability of, trust in, and loyalty to a recommender system. Furthermore, we demonstrated that the generation of these explanations requires a recommendation algorithm that is capable of reflecting a user’s attribute preferences and directly incorporating these preferences into the recommendation generation process. Moreover, we determined that for hybrid systems, the only hybridization designs that are capable of providing actionable explanations are designs with final ratings that are “raw” predictions of individual constituent RS approaches; by contrast, designs that produce final ratings by either mixing the results of multiple individual approaches or tightly interweaving different RS methods. Based on these considerations, we reasoned that the hybridization of an item-based CF with a CB method will allow for the most effective explanations. An item-based CF approach was described in the first section of this chapter, whereas both a CB method that is applicable to the domain of motion pictures and the method of hybridization these two approaches will be developed in the current thesis. In the third section of this chapter, to refine the conceptual framework for the development of our CB method, we introduced the concept of multiattribute utility (MAU) and the weighted additive composition rule (WADD), which serve as the bases for the operationalization of a user’s attribute preferences and for the derivation of recommendations from a user’s attribute-related preferences, respectively. In particular, we suggested utilizing the multiattribute utility (MAU) approach to decompose a consumer’s preferences into attribute partworths. To accomplish this task, we proposed operationalizing the movie utility for a

2. Background and Related Work

93

consumer in terms of the rating that the consumer assigns to a movie to indicate his or her preferences. If a consumer’s attribute part-worths can be elicited by a recommendation algorithm, these part-worths can then be used to calculate the predicted ratings (utilities) of arbitrary movies via the application of the WADD approach. Thus, movie options may be sorted in rank order in accordance with the consumer’s preferences; the movies with the highest calculated preference ratings will then be suggested as the actual recommendations of an RS. We also discussed the topic of how the instability of consumers’ intrinsic utility functions should be addressed within the context of the recommendation process. Based on a study by Aksoy et al. (2006), we have demonstrated that an RS can provide effective recommendations if it predicts a user’s preference part-worths of different movie attributes reasonably well and then produces recommendations through the WADD model, which is the normatively best preference model among the preference models that have been examined in the literature. This approach possesses several advantages over alternative techniques. First, this approach reduces the recommendation task to the challenge of eliciting a user’s attribute preference weights by obviating the need to derive a decision model for each user in each particular recommendation context; this reduction unifies and simplifies the problem of calculating recommendations. Second, because this approach directly utilizes a user’s attribute preference weights in the calculation of recommendations, the contribution of each attribute to every recommendation is known. Thus, this information can be directly harnessed to generate an explanation of the underlying reasoning for each recommendation in the keyword explanation style, which is known to be a style that is understandable and actionable to users and provides reasonable contributions to choice effectiveness. Finally, the suggested approach allows the keyword explanation style to be extended to the “pros-and-cons” style, which permits negative cues to be incorporated into explanations. These negative cues play an important role in the evaluation process and therefore may increase the decision effectiveness of RS users. To apply these findings to the situation of motion picture recommendations, we then examined the question of which movie attributes are relevant to the formation of consumer preferences for movies. However, this discussion was challenged by the dearth of research on the subject of the movie preferences of individual consumers. The extant research on movie consumption merely suggests a set of movie attributes without demonstrating the explanatory

94

2. Background and Related Work

power of these attributes, whereas research on movie success only addresses attribute preferences on the aggregate level. Based on the argument that empirical evidence regarding movie attributes on the aggregate level supports the relevance of these attributes for individual preferences, we combined the suggestions of both of these research streams. In addition, the suggested attributes were examined with respect to their potential descriptive power in the context of an RS and their suitability for being operationalized within a recommender algorithm. We also suggested that one additional attribute (year of production) that was not identified by either of the examined research streams may be descriptive of consumer movie preferences. The resulting list of preference-relevant movie attributes is summarized in Table 2.10 and is presented in detail in Appendix B of this thesis. On the whole, the discussion of Chapter 2 offered an integrative perspective of the explanatory and algorithmic issues of RS; this perspective was provided within the context of a common framework. At this point, we have addressed all of the important concepts that allow us to construct an algorithm that can provide personalized recommendations that are accompanied by effective and actionable explanations. Thus, we proceed to the next chapter, which describes the concepts of the RS method that will be utilized to achieve our goals.

3. Conceptual Framework

95

Chapter 3

The Conceptual Framework of a Hybrid Recommender System that Allows for Effective Explanations of Recommendations 3

The Conceptual Framework of a Hybrid Recommender System that allows for Effective Explanations of Recommendations

This chapter presents the actual proposal of the current thesis, which consists of a recommendation method that is capable of providing not only accurate recommendations in the domain of motion pictures but also actionable and effective explanations of the reasons for these recommendations. In accordance with the objectives that were stated in Section 1.2 of this thesis, this recommendation method pursues three concurrent aims. First, this method should provide accurate recommendations, which are recommendations that are both relevant to the users of an RS with respect to user preferences and helpful in encouraging these users to make optimal choices. Second, the method should be able to provide users with actionable explanations of recommendations, which may be described as explanations that allow a user to assess the suitability of recommendations for his or her particular choice-making contexts and to address recommendation errors in a manner that increases the effectiveness of his or her choices. Third, this method should demonstrate practical applicability; in particular, this method should be able to provide both recommendations and explanations to all of the users of an RS. In accordance with the conclusions of the previous chapter, this method should directly integrate user attribute preferences into the process of generating recommendations and should also align the recommendation process with user preferences. In this respect, this method merely mimics the general approach of content-based techniques. However, in the

96

3. Conceptual Framework

domain of motion pictures, the TF-IDF metric, which is the central concept of these contentbased techniques, is not applicable; moreover, other techniques, such as regression analysis, can process only a limited number of attributes because of the scarcity of the underlying rating data (see Section 2.1.2). In this chapter, we propose an algorithm that overcomes these restrictions and is capable of estimating a large number of parameters, which are the attribute part-worths of users, from a low number of data points (namely, ratings). This algorithm allows for recommendations and explanations to be generated from attribute-based user preferences. However, not all users form their movie preferences based solely on movie attributes. Thus, to ensure that our approach is able to provide recommendations to all of the users of an RS, we suggest hybridizing our algorithm with the item-based collaborative filtering technique, which is able to capture aspects of user preferences that extend beyond movie characteristics. We suggest the switching hybridization design (see Section 2.1.4.1) as a suitable method of balancing all of the concurrent aims of the current thesis, and we validate this suggestion in the discussion that appears in the closing part of the current chapter of this thesis. This chapter is divided into three sections. The first section of the chapter discusses modeling-related issues. In particular, the model of user preferences is gradually derived in this section, and the traits that the model incorporates and considers are discussed. The second section of the chapter presents the method of parameter estimation for the derived model. In essence, this section of the chapter presents the core of our proposal, which is an algorithm that is capable of estimating the attribute part-worths of users from very scarce data sets45, i.e., such data sets in which the number of parameters that must be estimated is much greater than the number of data points so that an algebraic solution to the estimation problem is impossible to obtain. The third section of this chapter provides the motivation for the hybridization of our algorithm and discusses the methodology of this hybridization.

45

For a discussion of the problems that are related to this issue, see Sections 2.1.2.2 and 2.1.3.1.

3. Conceptual Framework

97

3.1 The Modeling of User Preferences

3.1.1 The Motivation for the Approach

As stated above, a recommender algorithm that seeks to help a user make better choices and to increase his or her choice efficiency through the provision of actionable explanations should reflect the user’s way of thinking (see Section 2.3.1.2). This requirement can be fulfilled either by conforming the algorithm’s model to its user’s decision strategy or by ensuring that the algorithm possesses accurate estimations of the attribute preference weights of its user. As demonstrated by Aksoy et al. (2006), the relationship between these two aspects of a recommender algorithm is not additive; thus, it is only important for a recommender algorithm to achieve one out of the two aforementioned conditions. In our approach, we choose to achieve the second condition (the accurate estimation of a user’s attribute preference weights) because this type of algorithm is relatively generalizable and allows us to address all users in the same manner, namely, by applying the additive decision rule to a user’s estimated attribute part-worths. By contrast, the alternative approach for algorithm design, which involves deriving the user’s decision strategies, incorporates the critical disadvantage that consumers not only have unstable decision functions but also are likely to rely on simplified heuristics in a number of situations (e.g., time pressure). From the perspective of an RS, this type of spontaneous strategy change would greatly impede the accomplishment of the recommendation task because these strategy shifts would force an RS to adapt to every minor change in a user’s behavior; moreover, these changes can be difficult to track in an automatic way. Moreover, the derivation of a user’s decision strategy typically requires knowledge regarding the user’s attribute part-worths; this requirement would complicate the recommendation process and cause this process to be more prone to errors. Instead, we suggest a reliance on the WADD approach, which is the normatively most efficient decision rule among the decision methods that have been examined in the literature, that is accompanied by a focus on

98

3. Conceptual Framework

the accurate estimation of attribute preference weights.46 This proposed technique also conforms to our aim of providing users with an efficient decision aid instead of obtaining an in-depth understanding of each individual user. However, actionable explanations must be understandable to a user, i.e., delivered in terms that are meaningful to the user and relevant for the formation of his or her preferences. As demonstrated in Section 2.3.1.2, movie attributes constitute a suitable basis that fulfills these requirements: these attributes are both understandable to a user and relevant to the formation of user preferences. The latter consideration once again triggers the consideration of attribute preferences and confirms our choice to concentrate our proposals on the reliable estimation of a user’s attribute preference weights (that is, a user’s part-worths for each attribute). Consequently, we develop our model of user preferences in terms of the user’s attribute part-worths. To accomplish this objective, we utilize the concept of multi-attribute utility, which connects a user’s preferences to the utility that a particular movie option possesses for the user in question; this concept states that this utility can be decomposed into its attributerelated components, which are the user’s attribute part-worths (see Section 2.3.1.1). The following subsections present the development of the model in detail. Each subsection builds on a preceding subsection to refine the model by introducing additional model components.

3.1.2 A Basic Model of User Preferences

The datasets that recommender systems (RSs) operate with typically represent a set of ratings that the users of an RS have assigned to the items that are contained in the system’s catalog (see Section 2.1). In the context of movie recommendations, ratings are used to describe a user’s enjoyment of a movie, i.e., the degree to which a user liked a particular film.

46

For a detailed discussion of the arguments regarding this proposed methodology, see Section 2.3.1.

3. Conceptual Framework

99

Higher ratings of a movie from a user are indicative of greater levels of enjoyment of a movie by the user in question. Thus, ratings can be regarded as a method of expressing users’ preferences for movies; this concept can also be expressed as the usefulness of movies for users with respect to enjoyment. This notion is directly analogous to the concept of utility. In fact, a higher rating corresponds to a greater utility; two movies with the same rating from a user were equally “useful” to this user. Thus, we can argue that ratings are proxies for the utility of movies to users and for the preferences of users for particular movies. A user’s preference for a movie can be decomposed into the (partial) preferences of the user for the movie’s attributes, i.e., the user’s attitudes towards a movie’s characteristics (see Section 2.3.1.2). Thus, a rating can be described as a sum of the part-worths of the movie components to the user; more formally, a rating may be expressed as follows: (3.1)

where

is the rating of user

for movie ;

denotes the preference of the user for the th

attribute of the movie, i.e., the th attribute part-worth of the movie for this user; and denotes a binary variable. For this binary variable, 1 indicates the presence of an attribute, such as the appearance of a particular actor in the movie of interest, and 0 indicates the absence of this attribute in the movie in question. In the above equation,

defines

the indices of the set of attributes that are used to describe movies in the system’s dataset. If rewritten in vector form, expression (3.1) becomes the following equation: (3.2)

where

denotes a transposed binary vector of the movie’s characteristics and

is the

vector of the user’s part-worths for each corresponding attribute.47 This first simple model assumes that movie ratings are known from the user’s past rating records and that movie characteristics are available from a particular source, such as the Internet Movie Database (IMDb). The vector of the user’s preferences

must be estimated.

After this estimation is completed, the user’s part-worths for each attribute can be used not

47

Throughout this thesis, we use boldfaced font to denote vectors and regular font to denote scalars.

100

3. Conceptual Framework

only for generating predictions of the user’s ratings for new and unfamiliar movies but also for providing the explanations for recommendations. Note that the model above implies that the elements of the part-worth vector are real numbers; thus, these elements may assume either positive or negative values. These properties provide the ability to arrange the attribute part-worths in rank order according to the contributions of each attribute to the user’s final rating. This feature allows for explanations to be provided in the pros-and-cons style (see Section 2.3.1.3). Moreover, this feature enables explanations to emphasize the most important considerations that influenced a recommendation in either positive or negative ways, thereby increasing the effectiveness of these explanations.

3.1.3 Accounting for Static Effects Beyond the User-Item Interaction

The model that we propose in expressions (3.1) and (3.2) is relatively simple, and certain refinements are required to improve this model’s efficiency. The first shortcoming of this model relates to the centering of the part-worths. In particular, the part-worth values in the basic model are centered around zero. Although this centering technique is advantageous for distinguishing between “good” and “bad” attribute preferences, which are indicated by positive and negative part-worths, respectively, this approach is not well suited to the scale of most recommender systems. Most RSs employ rating scales that begin at a lowest rating of 1 point (or 1 star). To produce positive rating values, the current model requires each movie to possess at least one attribute that breeds a positive part-worth that is sufficiently high to compensate for all of the negative attributes of the movie in question. Moreover, to obtain a score that is greater than “0”, this requirement must be fulfilled for all users, a condition that appears to be rather unrealistic. A common way to compensate for this shortcoming is to integrate a constant term into the model; this term is often referred to as the “baseline”. By these means, the model parameters are shifted by the value of this constant, which causes the part-worths to be centered on the baseline. A suitable choice for the baseline is the mean value of all movie ratings that are contained in the sys-

3. Conceptual Framework

101

tem’s dataset. This choice is advantageous because it represents the first central moment of the distribution of ratings in the examined dataset. In accordance with the law of large numbers, if a high number of ratings are examined (a condition that is frequently fulfilled for an RS), the sample mean will converge to the expected value of the rating of a movie. Accordingly, the model may be updated to the following expression: (3.3)

where

denotes the mean value of the movie ratings, i.e., the expected value for every movie

in the absence of additional information about the movie and the user. In other words, if a user has no preferences for a movie’s attributes, i.e., the user’s part-worths for all of the movie’s attributes are zero, then the most probable value of the rating will be . Positive and negative attribute part-worths produce increases and decreases, respectively, in a user’s rating. However, the meaning of the part-worths is slightly different in this modified model than in the previous model formulation. In particular, in model (3.3), the user’s attribute part-worths indicate the degree to which a user’s evaluation of a movie differs from the user’s impression of an average movie due to the existence of a specific attribute. Furthermore, expressions (3.1) and (3.2) model the rating solely as the product of interactions between item attributes and a user’s attribute part-worths. However, there are certain effects that are independent of this interaction but are instead associated with either users or items (Koren 2009). Thus, the recommender literature frequently indicates that different users may use the rating scale in diverse ways. For instance, certain users tend to systematically give higher ratings than other users (e.g., Sarwar et al. 2001; Adomavicius and Tuzhilin 2005; Jannach et al. 2011; see also Sections 2.1.1.1 and 2.1.1.2). This phenomenon causes the mean rating of individual users to deviate from the overall mean, an effect that we refer to as user bias. An item bias may result from various causes, such as the “appeal to popularity” of mainstream movies, which causes the mean ratings of these movies to generally be higher than the overall mean movie rating (Austin 1989; Koren 2009); by contrast, movies that are less popular are likely to exhibit lower average ratings than the overall mean rating for all movies.

102

3. Conceptual Framework

Users may also differ with respect to their reactions to the average ratings and popularity levels of movies. One group of users may adapt to mainstream assessments of a film, whereas other users may respond to these assessments in an overly positive manner, and a third group may respond more skeptically to these assessments by rating movies in ways that defy general trends. Although these reactions involve both a user and a movie, it can be argued that these responses are directed at a movie as a whole rather than any specific characteristics of any particular movie. In other words, these reactions occur on a general level that does not relate to attribute-level interactions, i.e., changes in the user’s preferences for a movie that are conditional on the presence of a certain movie characteristic in the movie’s profile. The incorporation of these effects into the model of this thesis leads to the following modified expression for this model: (3.4)

where

denotes the user bias, which is defined as the deviation of a user’s mean

rating value from the overall mean rating. Analogously, in the equation above, the item bias is defined as the deviation of a movie’s mean rating value from the overall mean rating. The user’s reactions to the movie bias are captured by the scale factor

.

3.1.4 Accounting for Time

The model that is described by equation (3.4) separates user-item interactions from the effects caused by factors that are not related to users’ preference formation but rather influence the magnitude of the rating through the inherent natures of users and movies. This approach allows for the estimation of the attribute preferences (part-worths) of an user that are actually involved in the determination of this user’s overall preference for a particular movie. However, this model is static. In other words, this model does not account for temporal changes in either the preferences and rating behaviors of users or changes of the popularity levels of movies. However, since RSs rely on historical data and since the performance of an

3. Conceptual Framework

103

RS is affected by the quantity of data that is available to this system (see Section 2.1.3), the model must account for time to avoid the “stability vs. plasticity” issue (see Section 2.1.3.4). In fact, time affects all components of the model in one way or another. For instance, certain movies, such as “Casablanca”, can become classics over time, whereas other films, such as “Night of the Creeps”, may fall into oblivion. Moreover, over time, users may change their rating behaviors or adopt new perspectives of various aspects of movies, such as genres, actors, or directors. Thus, it is crucial to account for time changing factors (Koren 2009). Time-varying effects are typically modeled by splitting them into three components (see Figure 3.1): The first of these three components is a constant term, which represents the effect’s baseline. This component can be interpreted as the quantity of the modeled measure that exists at the ‘starting’ timepoint, i.e., at the point

. The second component of time-

varying effects captures the long-term trend of an effect and addresses aspects of temporal changes that develop linearly over the course of time. In other words, this component represents the “drift” in a measure’s baseline that occurs at a constant rate over time. The third component of the temporal effect captures short-term fluctuations, i.e., deviations from the drifted baseline at a particular point in time . These deviations may occur either irregularly or periodically. For instance, Christmas movies become more popular at Christmas time (a periodic effect). By contrast, the popularity of an actor often increases following the premiere of a new movie that stars the actor in question or if the actor’s name is mentioned in a considerable amount of press reports; in general, these phenomena have no periodic basis but instead occur irregularly. Figure 3.1 illustrates the three components of time-varying effects. Although all three types of time-varying effects can be modeled, their estimation is problematic because it requires an algebraic problem with a high number of degrees of freedom to be solved, which is a challenging task. In particular, vector

already includes 374

part-worth parameters that describe the attribute-based preferences of a user (see Section 2.3.2). The user bias, item bias, and scale factor, which are represented by

,

, and

,

respectively, are another three parameters that must be estimated. To account for the longterm trend and the short-term fluctuations in

,

,

, and each of the part-worths, three

times the number of unknowns that exist in model (3.4) would be required because each original parameter of this model would be accompanied by two additional variables that describe the changes in this parameter over time. For instance, the representation of user bias

104

3. Conceptual Framework

would take the form of

, where

is the slope of the user’s rating trend,

is the deviation of the user’s mean rating at the point in time , and of the user’s rating. Thus, there would be

is the static component

parameters to be estimated

for each user. However, the median numbers of ratings per user in the two data sets that are employed in our study are 25 and 96 (see Table 4.1). Thus, our algebraic problem would frequently be considerably underdetermined because we would have to estimate radically more parameters than the number of data points that would be available. The latter fact substantially increases the risk of obtaining an arbitrary solution to our modeled problem. Taking into account that the simpler problem that only considers long-term trends remains unsolved, we believe that a complete model of time-varying user ratings would produce an unacceptably

Time-varying measure

high risk of obtaining an arbitrary solution at the current stage of algorithm development.

short-term fluctuation at t=X trend

long-term change at t=X baseline baseline

0

X

Time

Figure 3.1: The decomposition of a time-varying measure into three components: baseline, long-term trend, and short-term fluctuations

Nevertheless, we believe that accounting for temporal changes in preferences as an important issue for our recommender algorithm. Thus, to address the trade-off between the level of decomposition of time-varying effects and the number of parameters that must be

3. Conceptual Framework

105

estimated, we choose a compromise approach. In particular, we decide to include only the long-term trend in our final model. Relative to the situation involving the full decomposition of time-varying effects, this compromise involves the estimation of one-third fewer parameters; specifically, the compromise approach includes

parameters per

user. Accordingly, term In this expression,

in equation (3.4) must be replaced with the expression

is the slope of a user’s long-term rating trend, whereas

.

is redefined

as the static portion of a user’s rating. Analogously, the movie bias and the user reaction factor must be replaced with

and

of the long-term trend of the movie bias and

, respectively, where

denotes the slope

represents the user’s reaction to this bias. The

described modifications to expression (3.4) produce lead to our final model: (3.5)

As noted above, user preferences can also be subject to temporal changes; thus, each element of a user’s part-worth vector

is constructed as

. In this equation, the index

denotes the attribute with which a part-worth value is associated, and

indicates the slope

of the long-term change in a user’s preference towards the jth movie attribute. In the next section of this thesis, we describe our method for estimating model parameters.

3.2 The Estimation of Model Parameters

From the above discussion, we obtained a model of user preferences, which has been formulated in equation (3.5). This model allows for a user’s ratings for movies that the user has not seen to be predicted from knowledge about the characteristics of a movie and the characteristics (i.e., attribute-based preferences and rating behavior) of a user. Whereas

106

3. Conceptual Framework

knowledge about movie characteristics can be obtained from various sources, such as the Internet Movie Database48, other model parameters must be learned from the past user ratings that are available to a recommender system. To describe our approach for the estimation of model parameters, we assume that both user- and movie-related datasets are available and that the dataset of past user ratings includes the associations among ratings, users and the movies to which these ratings were provided. As deduced in previous sections, our model includes 754 parameters. In total, 748 parameters describe a user’s preferences for 374 movie attributes (see Section 2.3.2 for the derivation of these attributes and Appendix B for the list of the attributes in question); in particular, these parameters include 374 pairs of the

and

values that build the elements of

vectors. The remaining six parameters describe the effects that are associated with

either a user or a movie. The complete set of 754 parameters must be estimated for each user based on the user’s prior ratings. Note that we do not regard

as a separate parameter because

the mean rating in a dataset can be readily calculated from the examined dataset of user ratings. However, a direct solution to this problem can only be obtained if the available data are sufficient, i.e., if the number of ratings from a user that exist in the examined dataset is equal to the number of parameters that must be estimated. Moreover, the data points are required to be linearly independent; in other words, no two movie vectors can consist of exactly the same attributes. In the case of movie recommenders, neither of these requirements is likely to be fulfilled: A linear dependence between movie vectors may be observed, for instance, if sequels or series, such as the “Matrix” trilogy or “Friends”, are included in the database. However, the more serious problem is that the users of movie recommenders are not likely to rate a sufficient number of items. In fact, as noted in a previous section of this thesis, the median numbers of ratings in the MoviePilot and Netflix databases that are utilized in our study are 25 and 96 ratings per user, respectively (see Table 4.1); these median numbers of ratings are less than the number of parameters that must be estimated per user by more than

48

These data are available for download at http://www.imdb.com/interfaces. Licensing information is provided at http://www.imdb.com/licensing (for commercial use) and http://www.imdb.com/licensing/noncommercial (for non-commercial use).

3. Conceptual Framework

107

30-fold and more than 7-fold, respectively. Thus, expression (3.5) can neither be solved algebraically nor addressed through various statistical techniques, such as regression analysis. In this situation, optimization techniques, such as gradient descent, can be applied to learn the model parameters through the minimization of the following dedicated error function:

(3.6)

In the above equation,

denotes a predicted rating, and

denotes the set of movies that

have been rated by a user. However, the optimization methods are strongly dependent on the initial point of optimization; thus, the choice of an improper starting point for this optimization will likely lead to the identification of local minima instead of a global solution to the specified problem49 (Press et al. 2007; Paterek 2007; Koren, Bell, and Volinsky 2009). This issue may result in unreliable estimates of the model parameters and consequently to higher errors in the predictions that are produced by this model. However, if the initial guess, i.e., a suboptimal yet good solution to equation (3.5), lies near the global optimum value for expression (3.6), optimization techniques will be able to determine this optimum and refine the “initial” model parameters such that the predictions made by the model exhibit the lowest possible errors, as calculated by expression (3.6). Accordingly, the task of estimating the parameters for our model of user preferences can be divided into two steps: (i) the generation of an accurate guess for an initial solution to expression (3.5) and (ii) the optimization of the model parameters through the minimization of expression (3.6), the dedicated error of the model. In the following subsections, we provide a description of our two-step method for parameter estimation.

49

The optimization method and its tendency to find local minima will be described in greater detail in Section 3.2.2. For the moment, we simply assume that it is possible to use optimization-based methods to complete the estimation.

108

3. Conceptual Framework

3.2.1 Step 1: The Estimation of the Initial Parameter Values

As noted above, in the context of a movie RS, which typically involves an insufficient number of ratings from each user, the values of the parameters in model (3.5) can generally neither be precisely solved nor simultaneously estimated. Nevertheless, to determine an efficient approximation of the solution for these parameters through an optimization approach, we must generate an initial guess for the parameter values that corresponds to a point in the parameter space that is as close as possible to the actual solution for the model. We propose the use of the OLS regression analysis method to separately obtain the initial parameter estimates for each user and each parameter. In other words, instead of estimating all model parameters jointly (which is impossible because of limitations on data availability), we suggest performing a set of regressions for each set of ratings from a user to independently estimate the individual parameters of the proposed model. Although this approach is likely to produce estimates that are biased, the OLS regression method includes a series of advantages. In addition to the estimates themselves, this method provides both (i) inferences about parameter significance and (ii) access to confidence limits for parameter values, i.e., intervals that are likely to include the true value of a parameter. This access allows us to interpret the OLS results as interval estimates and to additionally constrain the optimization routine such that the search of the parameter values is performed within the scope of possible solutions that most likely contain the true model solution. This constraint prevents the search procedure from venturing outside this scope to reach a local minimum that satisfies the restrictions of error function (3.6) but provides unreliable estimates of user’s preferences in terms of model (3.5). The information about the significance of a parameter can be used to remove parameters that are statistically meaningless for describing the user’s movie preferences and to generate and explain rating predictions; thus, this information simplifies the search procedure and reduces the probability of finding local minima for error function (3.6). However, compared with the simultaneous estimation of all parameters, the individual estimation of parameters introduces a model specification error to OLS. In other words, the regression model becomes underspecified, which may negatively influence the “quality” of

3. Conceptual Framework

109

the individual estimates that are obtained, particularly if the omitted regressors correlate with the independent variable in the underspecified regression model. In this situation, it is not straightforward to estimate parameter values and confidence limits or to draw conclusions about the significance of these estimates. Therefore, before we present the details of how the individual parameters can be estimated, in the next subsection of this thesis, we examine the consequences of the underspecification of an OLS regression and present our method to counteract these consequences and achieve more reliable initial parameter estimates.

3.2.1.1 The Omitted Variable Bias in OLS Models and a Method to Counteract this Bias

In the majority of regressions, the omission of a relevant variable from the regression model produces biased estimates of parameters and their correspondent variances (Gujarati 2004). Consequently, because the variance of the parameters in the regression analysis serves as the basis for inferences about the significance of these parameters, statements about the latter may become misleading. To further understand the rationale behind these assertions, let us consider the following example.50 In order to maintain consistency with the notation that is commonly used within regression analyses, let us redefine, for the length of this section, the symbols coefficients of regression equations,

and

as the

as the correlation coefficient between the th and th

independent variables of a regression model, and

as the -value that is obtained from

Student’s t-test. Suppose now that the true regression model to estimate is expressed as follows: (3.7)

but instead, we omit the relevant variable

50

and fit the model:

This example and its associated explanation are based on Gujarati (2004), Chapter 13, esp. pp. 510-513 and 556-557.

110

3. Conceptual Framework (3.8)

The consequences of omitting

are as follows:51

1. If the omitted variable correlation coefficient

is correlated with the included variable between

and

, i.e., if the

is nonzero, then the estimates of

will be biased and inconsistent. More formally, this implies that that 2. If

and and

. Moreover, this bias will not disappear as the sample size increases. and

are not correlated, then the constant term

will be biased, although

will be unbiased in this case. 3. The disturbance variance

, where

denotes the degrees of freedom of

the model, will be incorrectly estimated. 4. The variance of estimator

will be a biased estimator of the variance of the true

.

5. Consequently, the hypothesis-testing procedure, i.e., the t-test, will be likely to provide misleading conclusions about the statistical significance of

and its confidence

limits. For our proposed method, these consequences produce the following implications: (i) We may erroneously drop a parameter from our model based on an inapplicable conclusion about this parameter’s insignificance. (ii) The absolute value of a parameter may be overestimated, causing our solution for the starting point for an optimization to be further shifted from the global optimum for the error function (3.6); this larger offset increases the risk of finding a local minimum instead of a global minimum for function (3.6) during the course of the optimization process. (iii)The confidence intervals for a parameter might not include the true value of the parameter, which would cause our optimized solution to deviate from the global optimum estimation.

51

The proof of each individual statement that is included here lies beyond the scope of this thesis but can be found in various sources, such as Kamenta (1971) or Johnston and DiNardo (1997).

3. Conceptual Framework

111

However, we can counteract the consequences of OLS model misspecification and thereby reduce the aforementioned risks. In other words, to an extent, we can compensate for biased parameter values and any biases in the corresponding variances, allowing us to obtain more efficient initial estimates and more reliable confidence limits. First of all, note that problems 1-5 and their consequences (i)-(iii) only apply to the estimation of part-worth vectors and are inapplicable with respect to estimations of user bias, item bias, and users’ scale reaction factors because by definition, these latter features are free from correlations with other model variables. These features (that is, features other than partworth vectors) only capture effects that are associated with either a user or an item. Hence, by their nature, these effects should not be influenced by a source other than the inherent traits of a user or an item. Thus, the consequences of OLS model misspecification are irrelevant to the estimations of these parameters. For these reasons, the discussion below is only relevant for the estimation of part-worth parameters. Note that with respect to the part-worth parameters, we are uninterested in the estimates for

or

, the regression’s constant terms because the baseline for these part-worths is

provided by the user and item biases. In other words, in our underspecified auxiliary regressions, we only wish to obtain the values of the effects of the variables that appear in model (3.5), which are the slope coefficients

. Thus, we only need to correct for the biases in

these parameters and their corresponding variances. The bias of the constant term

affects

neither our initial model solution nor the subsequent optimization of this solution. To begin, let us consider how the estimate biases can be eliminated. It can be demonstrated that the expected value

of the slope of

in the underspecified regres-

sion model is equal to the sum of the following two quantities: the true value of the slope that would be obtained from the regression of the true model, i.e., from the model that contains both

and

and the slope

; and the product of the true value of the slope of the auxiliary regression of

on

of the omitted variable

. Formally, this result may be ex-

pressed as follows: (3.9)

where

is the slope in the regression of the excluded variable

included variable unless

and/or

(Gujarati 2004, Chapter 13). As demonstrated in (3.9), is zero, which would imply that

has no effect on

on the is biased

and/or that

and

112

3. Conceptual Framework

are uncorrelated. Thus, the first step for determining the bias of an estimate is the examination of the correlations between the variables. If no correlations can be determined, then the estimate of the corresponding parameter and its variance will be unbiased. However, if

and

are correlated, then the estimate of the slope is biased and thus

must be corrected. For instance, if “Clint Eastwood” and “Western” are the two attributes, both part-worths that are estimated from two independent univariate regressions will be overestimated because “Clint Eastwood” and “Western” are highly correlated. Thus, the use of these overestimated values in (3.5) for generating predicted user ratings of a western that stars Clint Eastwood, such as “For a Few Dollars More”, would also lead to the overestimation of this rating. Thus, the part-worths of both attributes must be corrected to rule out the bias in order to be able to make realistic rating predictions. In our example with two variables, this type of bias correction can be accomplished through the use of two additional auxiliary regressions, namely, (i) the regression of and (ii) the regression of

on

on

: (3.10)

The slopes

and

that are obtained from these auxiliary regressions can then be

substituted into expression (3.9). This substitution produces a system of two equations with two unknowns that can be algebraically solved for

and

:

(3.11)

In this system of equations, of

and

and

are known and

and

are the unknowns. The values

may be obtained by solving system (3.11) for these two variables:

(3.12)

These values represent the unbiased estimates of the effects of interest. Note that in the equations of (3.11) and (3.12), we omitted the expected value sign from consideration for greater simplicity because all the terms that are involved in these equations are produced by OLS regressions; thus, all of these terms are stochastic in nature.

3. Conceptual Framework

113

This feature is also the reason why we consider the solutions to equations (3.12) to be further optimizable.52 That is, the estimates of

and

that are obtained here are in fact the most

probable values of the true slopes in terms of expectancy theory. However, the expectancy theory indicates that in each given setting, other parameter values that are near the most probable values may constitute the “actual” global solutions to the model estimation problem. The next step is the correction of the variance estimates for

and

. This correction is

required because in each situation, these variance estimates are used for the calculation of both confidence limits and the significance statistic, i.e., the -value of Student’s t-test (Gujarati 2004, Chapter 8):

(3.13)

(3.14)

As stated in consequences 4 and 5 above, the variance of the regression parameters in OLS with the omitted variable is biased. Consequently, as may be observed from expressions (3.13) and (3.14), the -value and the confidence limits are biased. This bias implies that Student’s t-test, which is used within the regression analysis to test a parameter’s significance, is likely to provide misleading conclusions. On the other hand, both the biased variance and a biased -value produce errors in the calculations of confidence limits that may increase the probability that the true value of

lies outside of the predicted confidence limits.

One way to counteract this issue is to simply recalculate the variance from its definition (Gujarati 2004): (3.15)

where

is the variance inflation factor, which quantifies the extent of

multicollinearity in the OLS regression, and the regression of

is the multiple coefficient of determination of

on the other covariates. However, prior to completing this recalculation, it

is necessary to obtain the value of the residual sum of squares

52

The optimization procedures will be presented in Section 3.2.2.

of the

114

3. Conceptual Framework

“true” OLS model (3.7). Given the unbiased values of

and

obtained from equations

(3.12) and the following definition of the constant term of the regression: (3.16)

(Gujarati 2004), we are able to calculate

and thereby determine the value of

. Because

the number of degrees of freedom will equal the number of data points minus the number of regressors minus one, i.e.,

, we can now calculate

from equation

(3.15). After this procedure, the bias-corrected -value and the confidence limits may be obtained through the use of expressions (3.13) and (3.14). Accordingly, the test of significance can now be performed using the corrected -values. In the discussion above, we presented our method to counteract the issue of underspecified OLS models in the context of our proposal to estimate the parameters of the model (3.5) through the use of auxiliary regressions that each examine only one parameter at a time. The following sections of this thesis present the details regarding the estimation of the model parameters.

3.2.1.2 The Estimation of User- and Item-Related Effects

User bias, item bias, and a user’s popularity reaction scale factor are assumed to be conceptually independent from each other and from user-item interactions (see Section 3.1.3). Thus, these three factors are unaffected by the omitted variable problem that has been described in the previous section. Although there may be certain “technical” correlations with other model variables, these correlations and the associated variables are not relevant to the actual effects of interest because there are no conceptual associations linking any of these three factors to other variables in the model. Thus, we can simply perform bivariate auxiliary regressions to determine not only the appropriate initial parameter values but also the significance and confidence intervals of these values. We begin by estimating the user bias parameters an OLS regression of the form

and

. For each user, we conduct

. The user’s rating trend parameter

is

3. Conceptual Framework

115

derived directly from this regression, whereas the baseline subtracting the overall rating mean from

:

is recovered from

. We choose

by

to be the cut-

off criterion for assessing the significance of the regression parameters. With respect to time resolution, we select

to denote the number of days that have elapsed since the user first

provided a rating; thus, we assume that users’ rating behaviors are constant within each day but may vary between different days. Furthermore, because new users require time to become accustomed to an RS, we assume that rating behaviors will change more rapidly for new users than for experienced users. Therefore, to prevent the overfitting of the regression to unstable fluctuations of an average user’s rating and to increase the reliability of our initial estimates, we require the standard deviation of a user’s rating time to be at least 60 days. In other words, we require the user to have provided movie ratings for at least 120 days to capture his or her drifts in rating behavior. For users who do not meet this condition and for users with values of that were determined to be insignificant in auxiliary regressions, the parameter carded from the model, and cases, the confidence limits of

is dis-

is calculated to be the mean of the user’s ratings. In these are established for

(in accordance with expression

(3.14)). In this confidence limit, is obtained from Student’s t-distribution for

and for

a number of degrees of freedom that is equal to one less than a user’s number of ratings, and is the standard deviation of differences between a user’s ratings and the overall mean. The item biases are estimated in a similar fashion, using auxiliary regressions of the form

. Once again, the time resolution of one day is established. We expect

slower changes in movie popularity than in user bias, and we therefore require the time frame between a movie’s first and last rating to be at least 240 days. The estimates for the parameters that capture a user’s reaction to movie bias can now be determined in two steps. First, we fix the user and item parameters in equation (3.5) at their estimated values and ignore the aspects of the model that address user-item interactions, i.e., we set

. We are allowed to ignore these user-item effects because they are

conceptually unrelated to the inherent traits of users and items. Given fixed parameter values for each user’s rating, we calculate the difference between the actual rating and the user’s bias ; we also determine the value of the movie bias the second step of the estimation procedure, we solve the following regression problem:

. In

116

3. Conceptual Framework (3.17)

The rationale underlying this regression is that

is intended to capture the portion of

the rating that is unrelated to user bias but instead varies with time and movie biases. Because accounts for both of these latter factors and estimate for constant term

is representative of the user’s bias, the

from expression (3.17) provides precisely this knowledge. The regression’s captures the stable portion of this effect.

Analogously to the user and item bias cases, we discard parameters that do not reach the significance level of

. Again, we require the user to have provided movie ratings for a

minimum of 120 days. For the users who either do not fulfill this requirement or display insignificant values of both regression parameters, we discard value of

from the model and establish a

. In this situation, both the upper and the lower confidence limits are also set

to , which allows for no variations of

within the optimization process. In other words, if

the user does not exhibit any statistically significant reaction to average movie ratings or is relatively inexperienced in rating movies, s/he will not be expected to have this type of reaction for the purpose of rating predictions. This condition is equivalent to the removal of and

from our model.

In the above discussion, we have obtained the initial estimates for the effects that are not involved in the user-item interaction and clarified our model of parameters that appear to be irrelevant for the description of the preferences of a particular user. In the next step, the residual variance that is associated with the actual user-item interaction must be explained by movie attributes, and the initial values of the correspondent part-worths must be estimated. The following section of this thesis is dedicated to these questions.

3.2.1.3 The Estimation of Attribute Part-Worths

In contrast to user and item biases, user attitudes toward movie characteristics are not necessarily mutually independent. In fact, these attitudes may be strongly related to each other. Thus, a moviegoer may perceive different movie attributes as a signal of the same expected “quality” of a movie. For example, Clint Eastwood may be strongly associated with

3. Conceptual Framework

117

protracted westerns that contain little dialogue but significant quantities of unsettling music; Pixar Studio may be strongly associated with entertaining, high-quality computer animation; Andrey Tarkovsky may be strongly associated with contemplative, surrealistically framed Soviet classics; and France may be strongly associated with Alain Delon, Gerard Depardieu and artsy plots. Moreover, correlations among movie attributes are inherently present in movie attribute data. Thus, certain actors exhibit tendencies to appear in movies of a specific genre; for instance, Bruce Willis is known to act primarily in action movies. In addition, directors tend to employ the same stars in their films; for instance, Quentin Tarantino is known to have a stable “team” of actors. Strong correlations may also exist among particular directors, genres, and producers; producers and writers; studios and directors; and many other pairs or groups of entities. Consequently, during the course of the determination of model parameters, we inevitably encounter the concern of OLS model underspecification and must account for biases in the parameter estimates and their variances in our auxiliary regressions (see Section 3.2.1.1). However, although we proposed a method to correct for omitted variable bias, we may confront another issue during the course of parameter initialization, namely, the problem of multicollinearity. Multicollinearity is associated with the risk of the poor estimation of the coefficients in auxiliary regressions that are specified by expression (3.10); in fact, in extreme cases, multicollinearity may entirely preclude these estimations. This issue is particularly problematic if two variables are almost (or completely) perfectly correlated, i.e., if the correlation coefficient

between these variables is close to (or equal to) . In these cases, the

solution to (3.11) is either highly biased or indeterminate, preventing the effects of the two highly correlated variables from being reliably separated from each other (for proof, see Gujarati 2004, pp. 345-346). Thus, bias in both the parameter estimates and the variances of these estimates cannot be ruled out; again, these biases could generate incorrect conclusions about parameter significance and an erroneous estimation of confidence limits. Nevertheless, the joint effect of two highly correlated variables can be estimated through the approach that is specified in expression (3.9) (Gujarati 2004, p. 347, 511). We utilize this property to mitigate the problem of multicollinearity for our model. We argue that

118

3. Conceptual Framework

knowledge about the joint effects of two or more highly correlated variables is sufficient to describe user preferences in terms of our model (3.5): If certain attributes (nearly) always occur jointly in movies, their individual relative contributions to a user’s preferences will become irrelevant because these attributes will always affect preferences in combination. Thus, we examine the database of movie attributes for pairwise correlations. From each pair of attributes that correlate highly, as defined by

, we eliminate the attribute of the pair

that is less helpful for discriminating between movies, i.e., the attribute that exhibits a lower variance in the dataset. Note that this elimination occurs at the global level and is not performed separately for each examined user. By applying this rule to the datasets that have been employed in our study (see Section 4.1), we eliminated all language attributes from our model. The logical rationale for this elimination is that in the USA, movies are more frequently subtitled than dubbed. Hence, the movie languages in the Netflix dataset were perfectly multicollinear with the countries of their origin. By contrast, in Germany, it is traditional to dub all movies in German; therefore, in the dataset of the Germany-based MoviePilot, there were no entries that were associated with languages other than German. Thus, languages were uninformative for the description of user preferences for the MoviePilot dataset. For the Netflix dataset, we suggest that country of origin is a more informative feature than a film’s language. In particular, a list of countries possesses more differentiating power than a list of languages because the former list differentiates between countries that speak the same language but exhibit cultural differences (e.g., the country list distinguishes among the English-speaking nations of the USA, the UK, New Zealand, Australia, and India or among Argentina, Mexico, Spain, and other Spanishspeaking nations). In addition, in contrast to the language list, the country list contains different names of the same country; these names may be associated with different epochs and mentalities that may be highly influential for user preferences (for instance, consider the cultural distinctions among Germany, East Germany, and West Germany; between Russia and the Soviet Union; or between the Czech Republic and Czechoslovakia). Thus, through the application of our multicollinearity procedures, we removed

parameters from our

model, leaving 708 parameters to be estimated (see Sections 2.3.2, 3.1.4, and Appendix B for additional details regarding the number of parameters in the model). In the next step of the parameter determination, we estimate the regression coefficients of the pairwise auxiliary regressions that are specified by expression (3.10). In other words,

3. Conceptual Framework

119

each of the attribute-describing variables

is regressed on each of the remaining variables

that constitute the movie attribute vector

. In this procedure, we set the values of

insignificant regression coefficients to zero; thus, for these regressions, we assume that one variable of an examined variable pair does not “influence” the other variable of this pair in equations (3.10) and therefore that there is no need to account for this variable pair during the process of bias correction. Through this process, we obtain a set of auxiliary parameters that will later be used to correct the biases of estimates and their variances, as described in Section 3.2.1.1. This operation is also accomplished at the global level rather than on the level of individual users. The following rationale underlies this process. Equations (3.10) and (3.11) seek to clarify the way in which the effect of the variable that is examined in expression (3.8) on a user’s ratings is affected by an omitted variable. However, note that the effect of a movie attribute on a user’s rating (that is, the effect of ) occurs on the level of an individual user, i.e.,

is relevant only for a specific user. By

contrast, the effect of one attribute on another attribute ( attributes in general; therefore, power” of

for

on

on

) is applicable to movie

can be regarded as a quantity that adjusts the “explanatory

for the global-level role that

plays in the effects of

.

However, the performance of these auxiliary regressions on the global level allows us to estimate the part-worths of an individual user in a manner that accounts for the underspecification of the OLS model and reduces the impact of the multicollinearity issue. Consider an example in which a user has only rated the “Lord of the Rings” trilogy. Because all of the episodes of the trilogy were directed by Peter Jackson and involve a constant set of stars, all of the attributes that describe these films are perfectly correlated. This phenomenon would lead to equal estimates for all of the attribute parameters

and

in our separate

auxiliary regressions for these films; this equality would then cause equations (3.11) to generate an indeterminate solution

for all

. However, because we estimated

at

the global level, the values of these coefficients no longer likely to be the same; in particular, Peter Jackson has directed other films besides the “Lord of the Rings” trilogy, and the stars of this series of films have acted in other movies. Thus, although the values of equal in this example, the unequal

will remain

values will clarify the effects of the omitted variable for

different alphas to various degrees and thereby produce a determinate solution of (3.11), as shown in (3.12).

120

3. Conceptual Framework

After the aforementioned preparations have been completed, we are prepared to estimate the attribute part-worth parameters. For each user, we perform a set of regressions of the following form: (3.18)

where

and

are the parameters of interest, which designate the static and time-

dependent components, respectively, of the part-worth of the th movie attribute for the user is the th component of the movie’s

;

dummy variable that takes a value of characteristics and

otherwise;

characteristics; vector

represents a binary

if a particular attribute is present in the movie’s

is the constant term of the regression; and is the time that

has elapsed in days since the first rating in the dataset was produced. Similarly to the process of user bias estimation, in the parameter estimations, we require a user to have rated movies for at least 120 days before we attempt to capture the time-varying components of the user’s part-worths. For users who did not fulfill this requirement, we discard

and estimate the

following simplified regression: (3.19)

This simplified OLS model is also utilized for situations in which the “complete” model (3.18) could not be estimated because of data insufficiency. In these cases, the

parameters

are once again discarded. Subsequently, to correct for omitted variable bias, the estimated parameters are pooled together with the previously derived auxiliary parameters

to create, analogously to (3.11),

a system of equations of the following form:

(3.20)

where

denotes the estimated value of the th parameter (i.e.,

or

unbiased value of this parameter (for details, see Section 3.2.1.1), and the index of each remaining parameters.

),

denotes the designates

3. Conceptual Framework

121

This equation system is solved through the use of the SVD technique, which is described by Press et al. (2007, chapter 2.6).53 We choose to employ SVD because it is capable of handling ill-conditioned54 equation systems in a manner that provides an optimal solution in terms of least squares (Press et al. 2007). In general, as discussed above, we have approached the modeling process in a manner that seeks to minimize the risk of multicollinearity; therefore, we do not necessarily assume that system (3.20) is illconditioned. Nevertheless, we cannot confirm that ill-conditioned systems will be avoided throughout the vast variety of cases that may be encountered during the estimation process. Thus, through the adoption of the SVD approach, we ensure that our algorithm can obtain a productive solution for equation (3.20) for any situation, including scenarios that involve illconditioned systems. Using the solution to equation (3.20), we recalculate the variances of the estimated parameters in accordance with expression (3.15), as described in Section 3.2.1.1. At this point, we are able to complete the test for the parameter significance that is specified by expression (3.13). The parameters that do not fall within the significance level of

are

discarded, and the confidence limits for the remaining parameters are estimated (3.14). We utilize the procedure that is described above to finalize the estimation of the initial values for the parameters of our model of user preferences. As described in the introductory portion of this chapter, the initial parameter estimates are then passed to an optimization method as coordinates of the starting point for the optimization in multiple dimensions with the objective of obtaining a parameter solution that is closer to the optimum. The method and process of optimization are described in the next section of this thesis.

53

Because the SVD method is one of the standard methods for solving linear equations, the description of this method extends beyond the scope of the current thesis. For a detailed and comprehensive introduction to SVD, we refer to Press et al. (2007). 54 A system of linear equations is ill-conditioned if its underlying matrix is not of a full rank, i.e., if linear dependencies are present among the rows or columns of the equation matrix; these dependencies would indicate relationships among the variables and/or among the equations of the system (e.g., Press et al. 2007).

122

3. Conceptual Framework

3.2.2 Step 2: The Optimization of the Parameters

In the field of numerical research, the term “optimization” refers to (typically iterative) mathematical methods that attempt to determine the best available values of the parameters of a particular objective function (e.g., Press et al. 2007; Lange 2010). In our situation, we seek to find the parameter values of model (3.5) that produce the minimum possible error for the model’s predictions. This objective corresponds to finding the minimum of expression (3.6), the quadratic loss function of the model. We choose the quadratic form in (3.6) because its Ushape ensures that the loss function will have a single extremum, i.e., a single definite global minimum, and because it penalizes errors of high magnitude, potentially reducing the error in the final solution. Typical methods to solve this type of optimization problem include the steepest gradient descent method and the conjugate gradient method. Both methods are based on the same idea of an iterative process that approaches the minimum of the optimized function through stepwise updates that push the solution in the direction opposite the function’s gradient, i.e., in the direction of the function’s fastest descent.55 The difference between these two methods is that the steepest gradient descent optimizes only one dimension in each iteration by “stepping” in the direction of the dimension that exhibits the highest value of the function’s first partial derivative at a given point; by contrast, the conjugate gradient method considers all dimensions of a function’s space to choose the direction of movement, ensuring that function minimization in one direction is not “spoiled” by a subsequent minimization along another direction. This approach allows us to avoid cycling through a set of directions and thereby reduces the number of iterations that are required to achieve the optimum solution (Press et al. 2007, Chapter 10.7). Figure 3.2 displays the differences between these two methods. Although both methods are suitable for the minimization of our loss function and converge to the same solution (Press et al. 2007), we choose to employ the conjugate gradient

55

Because both of these methods are standard, well-known optimization approaches, we do not discuss these methods in detail in this thesis but instead refer the reader to Press et al. (2007) for an in-depth description of these approaches.

3. Conceptual Framework

123

method because this method displays higher efficiency than the steepest gradient descent method. However, due to the specific nature of our task, certain adjustments to the method must be implemented. These adjustments include (i) the initialization of the starting point of optimization; (ii) the restriction of the optimization procedure to the confidence limits of the parameters; and (iii) measures to prevent the overfitting of the model.

(a)

(b) Figure 3.2: Successive minimization with gradient methods (a) The steepest gradient descent method is less efficient than (b) the conjugate gradient method. In particular, more steps are required to reach a function’s minimum for the former method than for the latter method; in the former method, these steps cross and re-cross the principal axis of the function. The above graphs are adapted from Komarek (2004), p. 11.

We initialize the optimization process with the parameter values that are obtained from the auxiliary regressions that are described in Section 3.2.1. This initialization plays a crucial role for the convergence of the optimization method. In particular, appropriate initial parameter values not only reduce the number of iterations that are required to achieve the optimal solution but also, in combination with the restriction of the optimization process to the parameters’ confidence intervals, help to ensure that the solution that we achieve is the true global optimum. Recall that for most users, model (3.5) is underdetermined (see Section 3.1.4), i.e., the number of parameters that must be estimated is greater than the number of data points that are available. This fact “relaxes” the optimization procedure and creates the possibility of achieving more than one solution through the optimization process. Note that these additional solutions, i.e., “local” optima, are not caused by the form of the loss function but instead represent a set of possible spatial dispositions of the n-dimensional U-shape that satisfy condition (3.6). By initializing the optimization with the values that are obtained

124

3. Conceptual Framework

through statistical techniques (see Section 3.2.1), we ensure that the starting point for the optimization already lies near the “true” minimum of the loss function (3.6). By restricting the “area” of optimization to the confidence limits of the parameters that we have determined through auxiliary regressions, we additionally ensure that the true solution is only sought within the scope of the space in which this solution is most likely to occur. In other words, by not allowing the optimization procedure to exit the area that is bounded by these confidence limits, we remove the risk of “slipping” into the area of a local minimum instead of the global minimum of the function. To obtain a better understanding of this issue, consider the following simplified example. We assume that our model takes the form user ratings

and

that a user has assigned to the movies that are described by

the attribute vectors values of the vector

and that we dispose of two

and

. Our task is to estimate the three

. For an estimation that uses the steepest gradient descent method, the

objective function that minimizes the prediction error is . Because we need to estimate three parameters of

from only

two data points, our problem is underdetermined and therefore does not possess a definite solution. In the absence of a starting point from which to begin the optimization process, we can end up at any arbitrary point

in the three-dimensional space of preferences ( ) that

satisfies the minimum of the error function. For instance, possible solutions include ,

,

, and

. All of these

solution vectors satisfy the minimum condition of the error function and (in this example) produce a zero error. However, the exact point that is obtained from the optimization is dependent on how the U-shaped loss function is initially oriented in the three-dimensional space of . This orientation is essentially determined by the starting point of the optimization. Thus, the starting point of

will be more likely to reach the first two

solutions that are presented in this example than the last two suggested solutions because the former solutions are within a “short reach” of the chosen starting point. However, because the optimization process is unconstrained, i.e., because this process possesses more degrees of freedom than the number of data points that are available, the probability of arriving at other solutions is nonzero. In other words, in the absence of constraints, the U-shaped error function can change its orientation during the course of optimization in a manner that is dependent on

3. Conceptual Framework

125

both the direction of the gradient and the size of the step that is performed in this direction. Thus, the unconstrained search of the solution occurs not by moving along the U shape of the error function but rather by performing U-shaped steps in the entire space of

. The

imposition of interval constraints for each coordinate of

, and

, e.g.,

,

, limits the “freedom to move” of the loss function in the space of . In this example, the solution that will be found under the given constraints is the user’s rating of a movie rather than the ,

or

. Thus, if we now predicted

the result would be that would be predicted by solutions

,

, and

,

respectively. Another issue that is caused by the underdetermination of the model is the tendency towards overfitting (e.g., Koren 2009), i.e., the determination of parameter values that fit the available data well but exhibit large errors in predictions for data that are not included in the optimization process. To counteract overfitting and thereby ensure that the model developed in this thesis is generalizable and suitable for predictions of future ratings, we utilize a holdout set of six randomly drawn ratings for each user. The ratings that are contained in the holdout set are completely excluded from the entire procedure of learning the parameter values, i.e., these ratings are used for neither the auxiliary regressions nor the parameter optimization. Instead, they are used in the gradient method to determine the stop point of the optimization that prevents overfitting. In particular, for each iteration of the optimization process, the value of the loss function (3.6) is independently calculated for the holdout data set. The optimization is stopped after an iteration of the optimization process decreases neither the error value for the holdout data nor the “original” error value of the method. Figure 3.3 illustrates the flowchart of the optimization step of our algorithm. Our adjustments to the original method are marked in bold. In accordance with its definition, the gradient of the loss function is calculated as a set of partial derivatives of each parameter of the model. During each iteration of the optimization process, these parameters are adjusted in the opposite direction of the gradient by a magnitude that is proportional to the overall step size, as described by expressions (3.21). In these equations,

denotes the step size in the direction of th parameter, and

designates the prediction error of a user’s ratings, which has been calculated from the parameter values of the current iteration of the optimization process.

126

3. Conceptual Framework

Start with initial parameter values. Set the value of the error function for the training set ei = ∞ and for the holdout set eh,i = ∞

Calculate the value of the loss function for the training set ei and for the holdout set eh,i

ei < ei-1 or eh,i < eh,i-1

no

yes Calculate the gradient and determine the conjugate direction for the optimization as well as the step sizes in each direction

For each parameter

Adjust the parameter’s value according to the method

yes

Is the value within confidence limits? no Set the parameter‘s value equal to the boundary value of its confidence interval

Save parameter values and proceed to the hybridization step

Figure 3.3: A flowchart of the optimization step Boldfaced font is utilized to indicate our modifications to the original method

3. Conceptual Framework

127

(3.21)

Using the procedure that has been described above, we can obtain the final estimates for the parameters of the model of user preferences that is specified in expression (3.5) and can now predict the users’ future ratings and provide recommendations to these users. Knowledge about the significance and values of the examined parameters allows for the generation of explanations of these recommendations in the pros-and-cons style that is discussed in Section 2.2.3. At this point, we have fully described the recommendation algorithm that achieves the two main objectives of this thesis, namely, the generation of movie recommendations that account for attribute-based user preferences and the provision of actionable explanations of the rationales underlying these recommendations. However, this algorithm can only be effective for users who actually base their preferences on movie characteristics. That is, if a user relies on criteria other than movie attributes to select a movie to watch, the algorithm will most likely fail to produce helpful recommendations because a user’s movie choices will not reflect the user’s preferences. Thus, before we proceed to the empirical test of our proposed algorithm, we will discuss the hybridization of our recommendation algorithm with the itembased collaborative filtering method; this hybridization helps us not only overcome this problem with respect to the movie choices of certain users but also provide effective recommendations and explanations to users who experience this issue. The next section of this thesis is dedicated to this topic.

128

3. Conceptual Framework

3.3 Hybridization with Collaborative Filtering

3.3.1 The Motivation for Hybridization

In our discussion in Section 2.3.2, which we initiated with the assertion that movies are experiential experience goods; we emphasized the contrast between the hedonic nature of movie consumption and utilitarian consumption of various other goods. These hedonic aspects of movie consumption, in combination with the problem of automatically extracting meaningful and preference-relevant attributes from multimedia content, complicated the derivation of movie attributes that are descriptive of the preferences of movie consumers. Consequently, although the preference-relevant movie attributes that we identified for the operationalization of consumer tastes were carefully derived to capture the major portion of these preferences, in certain cases, these attributes might not be able to fully address all of the movie aspects that determine the preferences of consumers. For example, certain characteristics of a movie’s quality, such as depth and dynamics of character development, may frequently be described well by the attributes of “actors”, “writers” and “directors” because these attributes undoubtedly contribute to character development and tend to exhibit general tendencies or affinities that correlate well with the aforementioned characteristics. However, in particular movies, these associations may not necessarily surface. On the other hand, a consumer for whose preferences character development plays an essential role may not always consider this characteristic good and potentially disfavor protracted stories. Furthermore, a consumer may have controversial tastes that may be dependent on the context in which s/he watches a movie: in certain situations, this consumer may prefer thoughtful motion pictures with complex storylines, whereas in other settings, s/he may be more interested in light-hearted entertainment movies. Furthermore, a consumer may devote attention to other aspects of movies that do not correlate with our list of attributes, such as an overall “message” that deeply impact’s the consumer’s soul or the degree to which the plot of a movie is aligned with the consumer’s personal experiences. Finally, the data on which we base our estimations of a user’s attribute preferences may simply be insufficient for our suggested procedure to uncover a user’s preference structure.

3. Conceptual Framework

129

Although we proposed a method that is capable of estimating part-worths in underdetermined conditions, this approach cannot extract the preferences from the data that are not included in the analyzed dataset. For instance, if a user has a strong attitude towards a particular actor but has not rated any movie that features this actor, our algorithm would have no basis for deducing this particular preference of the user in question. This concern is an example of the overspecialization problem that is inherent to content-based RS techniques (see Section 2.1.3.3). Several of the aspects that have been mentioned above may be accounted for through the introduction of interaction effects into our model of user preferences. However, this modification would increase the complexity of our model, which is already complicated, by an order of magnitude; this additional complexity could render it nearly impossible to reliably estimate the model’s parameters. Other aspects, such as the movie consumption context, cannot be addressed in our approach without the use of additional information that can only be collected through interactions with the user. These interactions can potentially radically decrease recommendation efficiency and may cancel out the benefits of an RS to a movie consumer. Additional information, i.e., additional ratings, may also be required to counteract the overspecialization problem by determining the part-worths of the attributes that are inapplicable for the movies that have rated by the user. Recommendation approaches that do not utilize item attributes during the course of the recommendation process can help to counteract the potential problems that have been mentioned above. Because these approaches do not rely on item attributes, they are more likely than content-based methods to be able to capture relationships between ratings and items that extend beyond attribute preferences. Thus, these approaches may produce reliable predictions of the user’s preferences for situations in which the concepts that underlie these relationships are highly valuable to a user and cannot be satisfactorily captured by our proposed content-based method. Furthermore, because these approaches are not subject to the overspecialization problem, they are able to predict ratings for the movies with the attributes that could not be addressed by our approach. This feature may allow us to enrich the set of movies that may be examined for recommendation purposes with such movies that exhibit higher ratings than the ones selected by our method, thus potentially increasing the effectiveness of the recommendations that can be generated for a user.

130

3. Conceptual Framework

Thus, it appears sensible to extend our approach by including predictions that have been provided by non-content-based recommendation techniques. The two questions that we must answer in this context are (i) which method(s) should we combine with our proposed algorithm, and (ii) how this combination should be implemented.

3.3.2 The Selection of a Hybridization Method

As discussed in Section 2.1.4, several strategies can be followed for the construction of a hybrid recommender. However, note that the majority of hybridization strategies strictly seek to increase the accuracy of the recommendations that an RS can provide, whereas accuracy is only one of the concurrent objectives of the current thesis. In Section 2.2.1, we demonstrated that the explanations of recommendations play an important role in determining not only users’ perceptions of the transparency of an RS but also users’ acceptance of and trust in an RS. Moreover, explanations increase the effectiveness of user choices. Thus, we search for a hybridization approach that addresses the problems that have been described in the previous section but maintains the advantages of explanations. Our hybrid solution must balance the three concurrent aims of our thesis, which can be summarized in the following sentence: “The provision of the best possible recommendations and explanations of the reasons underlying these recommendations to the highest possible number of users of an RS” (see 1.2 as well as introduction to Chapter 3). From the discussion of different explanation styles that was provided in Section 2.2.2, recall that each recommendation method is associated with a particular explanation style; these associations arise from the specific nature of the recommendation generation method that is utilized by each recommendation approach. Each explanation style exhibits different potentials to increase user satisfaction with an RS and improve users’ abilities to accurately assess the true quality of recommended items. Among all of the explanation styles that were examined, the nearest neighbor explanation style that is inherent to the user-based CF approach produces the worst performance and may even decrease users’ levels of trust in an RS. By contrast, among the examined explanation styles, the keyword and influence explana-

3. Conceptual Framework

131

tion styles (which are available to CB methods and to item-based CF approaches, respectively) were found to be effective at facilitating accurate assessments by users. Although no overall consensus exists regarding which of these two explanation styles is better, the combination of these styles generates optimal results in terms of both overall user satisfaction and the quality of user assessments of recommendations. To complete our discussion of different explanation styles and their corresponding recommendation methods, please note that MF methods allow for no meaningful explanations because they base their recommendations on an uninterpretable factor solution (see Section 2.1.1.3). As mentioned above, among the explanation styles that have been examined in the literature, the most effective explanations combine the keyword explanation style and the influence explanation style. The keyword explanation style is a less extensive version of the pros-and-cons style (see Section 2.3.1.3) that is already implemented by our proposed method; therefore, to achieve our objective of providing effective explanations that accompany accurate recommendations, our suggested hybrid recommender should combine the proposed method with a method that permits the use of the influence explanation style. Thus, an item-based CF method should be utilized to extend our proposed content-based approach. It is relatively straightforward to determine how the predictions of these two methods should be combined. In Section 2.2.3, we argued that only three hybridization schemes are conceptually suitable for building a hybrid that maintains the explicability of recommendations: the parallelized switching hybrid, the monolithic feature-augmenting hybrid, and the pipelined cascade hybrid.56 These three options are the only acceptable hybrid choices because we are not allowed to combine the prediction results from different recommendation methods through either the use of mathematical operations, e.g., by averaging or weighting, or the tight interweaving of several individual recommendation methods. These types of combinations would produce recommendation results that were dissociated from the explanations that could be provided by the “original” recommendation method; therefore, these combinations would not generate hybrids that could satisfactorily explain the reasoning behind a recommendation. At this point, where the development of the CB component of our

56

For a description of different hybridization designs, see Section 2.1.4.

132

3. Conceptual Framework

method has been completed, and we dispose of the “technical” knowledge that is required for judging the suitability of different hybridization schemes for our recommendation methods; thus, we can now make our final choice of a hybridization design. It can be determined that the pipelined cascade scheme is unsuitable for hybridization in our situation because our CB algorithm and an item-based CF method lack the capability to refine each other’s results. In particular, it is unclear how the results of the CB algorithm should be processed by a CF method to calculate the similarities between pairs of items, and the CF rating predictions are not meaningful for the subsequent extraction of part-worths by the CB algorithm (because a pipelined cascade would simply determine the “preferences among CF preference predictions”). Moreover, as indicated above, these two approaches address different groups of users that form their preferences in distinct ways. Thus, it makes little sense to refine the results of one RS component through the use of another RS component that is intended to “serve” other users. Analogously, the monolithic feature-augmenting hybridization approach is not reasonable for our purposes. Although we can augment the user-item rating matrix with the user attribute part-worths that are produced by our CB algorithm, the subsequent application of the item-based CF technique to the resulting matrix would not generate enhanced results. Because item-based CF bases its recommendations on the similarities between pairs of items rather than on similarities between pairs of users, the attribute part-worths of a user would not provide any useful information for the item-based CF approach. However, it can be argued that the user-item matrix can be augmented with

, the vectors of movie characteristics,

which would provide additional information that could improve the calculations of the similarities between pairs of items. However, from the perspective of hybridization, this type of approach would only enhance the calculations within the item-based CF component of the hybrid and would require a rule to connect the results of the augmented CF with the CB component of the method. Although this type of enhancement of item-based CF is interesting and may eventually prove fruitful (see Section 2.1.4.2), we leave this topic for future research for the following reasons. First, this thesis is focused on the development and testing of the quality of our new CB recommendation algorithm, which is intended to achieve the main objectives of providing accurate recommendations and effective explanations of these recommendations. In this context, the hybridization of this algorithm with a CF approach causes this algorithm to

3. Conceptual Framework

133

assume a subordinate role of helping to ensure that all of the users of an RS are provided with recommendations. Furthermore, we argued that the pros-and-cons explanation style that is provided by our CB algorithm is more beneficial for users than the influence explanation style of item-based CF (see Sections 2.2.2 and 2.3.1.3). Thus, we would like to maximize the number of users who receive recommendations from our algorithm instead of the CF component of the hybrid. By contrast, relative to an approach that only utilized our CB algorithm, the use of this CB algorithm to enhance a CF approach might demonstrate improved prediction accuracy but would produce lower explanation quality (because of the inferiority of the influence explanation style to the pros-and-cons explanation style). Moreover, an increase in the prediction accuracy of item-based CF would not increase the quality of the explanation style of this approach because the reason for providing recommendations (namely, similarities between pairs of items) would remain unchanged. Finally, as stated above, the enhancement of CF would not clarify how an appropriate hybridization should be accomplished. Thus, in balancing the three concurrent aims of our thesis, we place greater value on finding a method of hybridization for our CB algorithm with a method that is already known to provide reasonably good recommendations than on increasing the prediction accuracy of an established method. The last hybridization design that remains from the three designs that were proposed as conceptually suitable alternatives is the parallelized switching hybrid. For a given user, this scheme produces final ratings that are “raw” predictions of the individual component of the hybrid that performs best in terms of prediction accuracy. Thus, in our case, users whose preferences can be captured better by our CB model than by the CF method will receive the recommendations that are produced by our model (as specified in expression (3.5)), which will be accompanied by explanations in the pros-and-cons style, which is the most effective explanation style among the individual explanation styles that have been examined. Users for whom our CB algorithm fails to produce accurate recommendations will receive recommendations that are predicted by the item-based CF component of our hybrid; these explanations will be provided in the influence explanation style, which is the best alternative explanation style to the pros-and-cons style. In other words, this hybridization scheme allows for both recommendations and actionable explanations to be provided to users who form their preferences based on movie attributes. Moreover, this hybridization approach includes a “fallback” method that is also capable of providing recommendations with explanations for users who form their preferences based on movie properties other than movie attributes. Thus,

134

3. Conceptual Framework

we can state that the switching hybridization scheme balances the three concurrent aims of our thesis and is suitable for the implementation of our hybrid recommendation method. To compare the accuracy of the two methods that constitute our hybrid, we suggest utilizing the holdout set of six randomly drawn ratings per user that we used to determine the stop point for the optimization process (see Section 3.2.2). In particular, we chose to utilize the exact same holdout set that was employed in this optimization process, although one might reasonably argue that our model of user preferences estimation has already been trained to these data. For the following reasons, we argue that this consideration is of relatively low importance with respect to comparisons of the accuracy of the two methods of the hybridized approach. First, the holdout set was employed to increase the generalizability of the estimated model parameters and to prevent the overfitting of the model to the training data. Because the ratings of the holdout set were excluded from the actual optimization procedure and used to “externally” calculate the value of the loss function with respect to the training data, the optimization process ensures that the prediction accuracy of the model for the unseen data should be very similar to the prediction accuracy of this model for the holdout set. Second, if our model of user preferences overfits the holdout data so that the “combining” algorithm prefers our model’s predictions to the CF predictions for the generation of final recommendations, the effects of this preference would simply decrease the overall accuracy of the hybrid method. This effect provides our hybrid method with no advantage for the purpose of comparisons of the prediction accuracy of different recommendation algorithms, which will be provided in Section 4.4. Finally, although we admit that the use of a separate holdout set that is not employed in calculations other than the comparison of the predictive accuracy of hybridized methods would be methodologically desirable, the conditions of our empirical study do not allow us to construct a holdout set in this manner. In particular, for the empirical validation of our algorithm and for the comparison of the predictive accuracy of our model with other recommendation techniques, we must utilize other holdout sets that we suggest should also contain six ratings each (see Section 4.1). For the comparison to be valid, these ratings should be excluded from each user’s profile, i.e., these ratings should not be used except for the purposes of comparison; this exclusion avoids the prospect that these data can be “learned” by

3. Conceptual Framework

135

any of the algorithms that are compared. Thus, in our empirical study, we actually require three holdout sets, which would constitute a total of 18 ratings. Given that the median number of ratings per user is 25 in the MoviePilot dataset (see Table 4.1) and that we require another six ratings per user to estimate the parameters of our model, we risk exhausting most of our available data for the purpose of constructing holdout sets. In this situation, users who only have a few ratings would be underrepresented in our empirical study, which would lead to questions about the generalizability of this study’s results. Although this issue is less problematic in practical settings, which do not require a validation holdout set, the reduction of the number of holdouts sets by one provides an RS with the advantage of shortening the “warm-up” stage and being able to generate recommendations after receiving 12 ratings from a user (6 ratings for the holdout set and another 6 ratings to permit the estimation of part-worths). Moreover, in practical settings, new data are constantly entering an RS. Hence, RS can be designed that “produce” an independent holdout set once enough ratings from the user are acquired; this process could potentially increase the accuracy of the hybridized approach as a whole. However, this possibility is not feasible in our experiments, which analyze static data. Thus, we decide to accept potentially inferior accuracy for the CB component of our hybrid in exchange for the benefit of increased generalizability of the study results and a greater attractiveness of our method for practitioners. However, it can be argued that if our hybrid method exhibits superior performance relative to different recommenders in the context of the current study, then this method will achieve even higher accuracy if an independent holdout is utilized in practice. In summary, in our hybrid of our proposed algorithm and item-based CF approaches, we suggest employing the switching hybridization design, which involves generating predictions of future ratings using whichever of these two methods performs best on the same holdout set that is used in the optimization procedure that is described in Section 3.2.2. To determine the method with the better performance, we propose utilizing Student’s t-test for paired samples. Thus, whichever of these two methods exhibits a significantly lower prediction error (

) on the holdout set is considered to be the more appropriate approach

and is used for future predictions. However, if the difference between the errors is not significant, we will use the predictions of model (3.5), even if this model produces greater errors than the item-based CF approach for the holdout set. By adopting this decision rule,we trade formal accuracy for the prospect of a more effective explanation.

136

3. Conceptual Framework

The discussion of this section closes the description of our proposed conceptual framework of a hybrid recommender system that allows for effective explanations of recommendations. In the next chapter, we present an empirical study that evaluates our proposed method and compares this method with key recommendation algorithms.

4. Empirical Study

137

Chapter 4

Empirical Study 4 Empirical Study In the previous chapters, we built theoretical foundations and conceptually developed a recommendation method that achieves our objectives. This recommendation method is capable of providing both (accurately predicted) recommendations and actionable explanations of the reasoning underlying these recommendations; moreover, this method can appropriately align the recommendation process with user preferences. This alignment of the recommendation process with user preferences is an inherent feature of the design of the method (see esp. Sections 2.3 and 3.1), and the developed method’s ability to provide actionable explanations is justified theoretically (see esp. Sections 2.2, 3.1 and 3.3.2); however, the statement that the predictions of the developed model are accurate requires proof. In fact, our method suggests the estimation of a considerable number of parameters that often exceeds the number of data points that are available for the estimation procedure (see Section 3.2). This fact may create doubt that the estimates of this method are capable of reliable predictions of user preferences and that these estimates are sufficiently accurate relative to established recommendation methods. Thus, proof that our method is applicable in real-world recommendation systems and provides advantages for these systems is required. Through the hybridization approach of our proposed method, we ensured that the hybrid method’s predictions will be at least as accurate as the predictions of its item-based CF component (see Section 3.3.2). Thus, another question of interest with respect to our hybrid method is how frequently our proposed preference estimation approach will be applied relative to the frequency with which the CF component will be used for the generation of the

138

4. Empirical Study

final rating predictions. The relative frequency of use of our proposed preference estimation algorithm will also determine the relative frequency with which users receive pros-and-cons explanations, which are the most effective type of explanations, rather than keyword explanations, which are the second most effective type of explanations. To answer these questions, we conduct an empirical study that tests different recommendation techniques on real-world rating data. In particular, we examine a dataset from MoviePilot.com, a German movie recommendation system and a dataset from Netflix.com, a US-based online DVD rental service. The use of two datasets for our tests ensures that the comparison results are generalizable to other instances of movie recommendations and demonstrates the potential for the portability of our method to various recommendation domains. We compare the accuracy of our proposed method with the accuracy of the key collaborative recommendation techniques that were described in detail in Section 2.1: userbased CF, item-based CF and the matrix factorization method. Because matrix factorization is known to provide one of the best predictive accuracies among “pure” (i.e., non-hybridized) recommendation algorithms (e.g., Funk 2006; Paterek 2007; Bell, Koren, and Volinsky 2007b, 2008; Koren 2009), we suggest that the comparison of our method with this algorithm will be the most informative comparison for assessing our method’s relative prediction accuracy. The comparison is performed through the use of holdout data that are not employed for the training procedure of either algorithm. To establish this holdout set, we removed the six newest ratings from each user’s profile from the training datasets, thus ensuring that neither of the compared recommenders can fit its underlying model to these ratings. The task of the algorithms therefore consists of accurately predicting the user ratings for these holdout data. The differences between the predicted and actual ratings are then utilized to calculate accuracy metrics that form the basis of the comparison results. Further details regarding the comparison procedure, the data that are examined, and the results of the empirical study are provided in the following sections of this thesis.

4. Empirical Study

139

4.1 The Examined Datasets and Their Properties

As mentioned above, two real-world datasets are utilized in our study. We chose to use the Netflix dataset because this dataset has been employed by the majority of recent recommender research; this phenomenon reflects the interest of researchers in the Netflix Prize competition, which promised a prize of one million dollars to an individual or a team that proposed a recommendation algorithm that exceeded the prediction accuracy of Netflix’s own recommender by 10% with respect to RMSE. Thus, the use of the Netflix dataset to assess the accuracy of our algorithm causes our results to be comparable to the findings for a variety of other recommendations methods that have been discussed in recently published investigations. The MoviePilot dataset was used in this investigation for the following reasons. First, the current research has been performed within the context of a research project that is funded by the German Research Foundation (Deutsche Forschungsgemeinschaft; DFG) and in which MoviePilot acts as a cooperation partner. This cooperation allowed us to obtain full access to various types of information that could have influenced rating data, such as changes in the labeling of a rating scale or updates to user interfaces. This information is not available for the Netflix data, although it is known that Netflix has altered its scale labels in the past (Koren 2009). However, no exact details about the type of alteration and the date when this alteration was implemented have ever been published. From his analysis of Netflix data, Koren (2009) infers that this alteration might have occurred in early 2004 because the mean movie rating increases sharply at this time in a manner that would otherwise be difficult to explain. Furthermore, Netflix has only released a subset of its rating data; the firm has stated that these released data are randomly drawn from its complete rating dataset. However, as a commercial provider that is funding a considerable prize, Netflix could have “integrated” certain artifacts into the published dataset. For example, one of the users in the dataset has rated over 17,000 movies. If the average duration of a movie is assumed to be 90 minutes, this individual must have been watching movies constantly and without any breaks for nearly three years. If this user spent only eight hours each day watching movies, s/he would have required more than eight years to watch all of the movies that s/he has rated. This prospect appears rather unrealistic. Netflix did not provide any comments regarding either this artifact or other artifacts that might have been artificially introduced into its dataset. In contrast to the Netflix

140

4. Empirical Study

dataset, the MoviePilot dataset that we examine is a complete set of all of the ratings that have been provided to the MoviePilot recommender system by its users. Each dataset represents a relational database with two tables. The first table contains four fields: “user_id”, “movie_id”, “timestamp” and “rating”; thus, each raw entry of the table defines a correspondence of a rating to a specific user and a specific movie as well as to the exact date and time that the rating was recorded by the system. The second table consists of two fields: “movie_id” and “movie_title”. To reduce ambiguity, movie titles are supplemented by including the year of production of each movie. The ratings in the Netflix dataset are represented on a 5-point scale with 1-point steps; on this scale, 1 denotes the worst rating (“Hated It”) for a movie and 5 indicates the best rating (“Really Liked It”) for a film. In Netflix’ user interface, the scale points correspond to the number of stars that a user gave to a movie (see Figure 4.1a). Although MoviePilot presents its users with an 11-point scale that ranges from 0 (“Hated This Movie”) to 10 (“My Favorite Movie”) with steps of .5 points in size, the user ratings are saved in the database as values from 0 to 100 that each correspond to ten times the rating that was directly provided by a user (i.e., a rating of 7.5 points is stored as 75). In the MoviePilot interface, the ratings are obtained from the users through the use of a horizontal scale bar that supports a gradient fill effect and changes the caption text to reflect the number of points that have been selected (see Figure 4.1b).

(a)

(b)

Figure 4.1: The rating scales of the user interfaces of recommender systems (a) The Netflix rating scale, which includes the following captions for ratings from 1 to 5 stars: “hated it”, “didn’t like it”, “liked it”, “really liked it”, and “loved it”; (b) The MoviePilot rating scale, which includes the following captions for each rating interval of 3.5 points: “hated the movie”, “not interested”, “average”, “good”, and “my favorite movie”.

4. Empirical Study

141

Table 4.1 presents the descriptive statistics for the raw MoviePilot and Netflix datasets. Table 4.1: Descriptive statistics for the raw rating datasets MoviePilot

Netflix

1,389,749 14,528 12,762 0 – 10 (0 – 100) .5 (5) 19-AUG-2006 – 04-APR-2008

100,480,507 480,189 17,770 1–5 1 11-APR-1999 – 31-DEC-2005

Ratings per user Min Max Mean Median SD

1 6,687 95 25 214.17

1 17,653 209 96 302.33

Ratings per movie Min Max Mean Median SD

1 6,546 108 13 345.62

3 232,944 5654 561 16,909.67

Ratings per day Min Max Mean Median SD

1 78,164 2,583 1,548 4,498.33

5 737,570 46,049 15,499 58,558.61

General characteristics Number of ratings Number of users Number of movies Scale interval Scale step size Time range

However, to perform our tests, both datasets were reduced in the following ways. We removed the six newest ratings of each user as a holdout set for out-of-sample predictions and the computation of accuracy measures for different recommender algorithms (in the following, we refer to this holdout set as the “validation set”). Another six ratings were drawn randomly from each user’s rating profile to construct a holdout set for assessing the operation of our proposed algorithm (this set is referred to as the “operation holdout”; see Sections 3.2.2 and 3.3.2). Users for whom there was insufficient data to generate both holdout sets were discarded from the examined datasets. Furthermore, users for whom less than six ratings

142

4. Empirical Study

remained after isolating both holdout sets of data were also removed from the examined datasets. The descriptive statistics for the resulting datasets are summarized in Table 4.2. Table 4.2: Descriptive statistics for the datasets that are employed in the study MoviePilot

Netflix

training set

operation holdout

validation set

training set

operation holdout

validation set

1,140,577 7,935 12,246 19-AUG2006 – 04APR-2008

47,610 7,935 5,052 20-AUG2006 – 04APR-2008

47,610 7,935 5,037 20-AUG2006 – 04APR-2008

93,170,314 428,385 16,543 11-NOV1999 – 31DEC-2005

2,570,310 428,385 16,241 06-JAN2000 – 31DEC-2005

2,570,310 428,385 16,212 05-JAN2000 – 31DEC-2005

Ratings per user Min Max Mean Median SD

1 6,535 143 59 250.16

6 6 6 6 0

6 6 6 6 0

8 16,419 217 101 304.50

6 6 6 6 0

6 6 6 6 0

Ratings per movie Min Max Mean Median SD

1 4,543 93 12 262.90

1 802 9 3 36.23

1 677 9 2 35.7161

2 213,367 5,623 544 16,305.89

1 15,816 158 20 624.28

1 12,354 158 18 603.76

Ratings per day Min Max Mean Median SD

1 56,194 2,120 1,293 3,451.80

1 3,413 106 50 201.44

1 3,629 104 52 206.64

5 703,924 42,631 15,167 53,378.06

1 27,936 1,283 38 3,423.97

1 17,202 1,242 61 2,820.47

General characteristics Number of ratings Number of users Number of movies Time range

Data regarding movie attributes (genres, star actors, directors, writers, production companies, budgets, admissions, box-office grosses, years of production, countries of origin, and certifications; see Sections 2.3.1.2 for the derivation of these attributes and Appendix B for a detailed list of the specific attributes) were obtained from IMDb under a the restrictions

4. Empirical Study

143

of a limited, non-commercial license.57 These data are provided as a set of text files that maintain the connections between a particular movie title and a list of specific types of attributes (e.g., actors, countries of origin, or other traits.). We converted the text files to a database format that is more convenient for our calculation purposes. In particular, we constructed a data table; each row of this data table represented a movie, and each column of the data tabled represented a specific attribute. Nominal attributes (such as the presence of particular actors, the involvement of particular directors, and a film’s country of origin) were coded as binary variables that each took a value of 1 if a particular attribute was present in a movie’s characteristics and 0 otherwise. Metric attributes (admissions, budget, box-office gross, and year of production) were recoded as follows. Movie budgets and box-office grosses were converted to a common currency (US dollars) to unify the measurement units and thus to increase the consistency of the estimation of the corresponding parameters of model (3.5). The movies’ years of production were recoded as the number of years from the current year (2011). This rescales the corresponding model parameter by three orders of magnitude (e.g., the year 2009 is recoded as 2), simplifying comparisons between the production year’s effect on a user’s preference and the effects of nominal parameters on a user’s preference during the course of “manual” inspections of parameter values. This rescaling also alters the interpretation of this parameter’s values; following this rescaling, negative values for this parameter indicate a preference for newer movies, whereas positive values for this parameter reveal a preference for older films. Thus, the meaning of the parameter is altered to a “preference for older movies”. Because this rescaling represents a positive affine transformation of the data, namely, the addition a constant to all values of a variable, this rescaling should have no effect on either the estimations or the predictions of our algorithm; instead, this rescaling strictly functions as a method of increasing the convenience of visual inspections of part-worth values. Because admissions are already scaled in common measurement units (the number of tickets that have been sold at movie theaters), the values of this parameter were not modified. After the aforementioned conversions were completed, the IMDb data was merged with the MoviePilot and Netflix datasets by matching the titles and the corresponding years of pro-

57

Copyright message: “Information courtesy of The Internet Movie Database (http://www.imdb.com). Used with permission”. Licensing information can be obtained at http://www.imdb.com/licensing/ (for commercial use) and http://www.imdb.com/help/show_leaf?usedatasoftware (for non-commercial and personal use).

144

4. Empirical Study

duction of the movies that are contained in all of the examined datasets. This step finalizes the preparation of the data for the actual study and concludes the description of the study data. The next section of the thesis introduces the measures that we employ for comparing the prediction accuracies of different recommender algorithms.

4.2 Measures of Prediction Accuracy

Prediction accuracy measures evaluate how close the predicted ratings of a recommender algorithm are to actual user ratings (Herlocker et al. 2004). Two established accuracy measures that are employed by the majority of studies in the research field of recommendation systems are mean absolute error (MAE) and root mean squared error (RMSE). Formally, these measures are defined as follows: (4.1)

(4.2)

MAE measures the average absolute deviation between a predicted rating true rating

and a user’s

, whereas RMSE weights large deviations more heavily by squaring each error

before summing these errors. For instance, in RMSE, an error of one point increases the error sum by one, whereas an error of two points increases this sum by four. Through its emphasis on large errors, RMSE produces equal accuracy assessments for algorithms that consistently produce moderate rating errors and algorithms that predict ratings fairly well in most instances but err greatly in certain cases. As may be observed from equations (4.1) and (4.2), RMSE can never be smaller than MAE. RMSE only equals MAE for one specific case, namely, if all predictions involve an error of constant magnitude, i.e., if all

for

. The meaning of MAE and RMSE can also be interpreted in statistical terms. Because

MAE is defined as the mean of absolute errors, this metric represents the first central moment of the error distribution, i.e., the expected value of the error that an algorithm produces. The

4. Empirical Study

145

formal definition of RMSE is the square root of the variance of the algorithm’s errors around zero, i.e., in statistical terms, RMSE is the algorithm’s second moment about zero. Thus, RMSE corresponds to the standard deviation of the errors from zero and therefore provides information about the “width” of the interval of the error distribution. In particular, assuming that errors are normally distributed, approximately 68% of these errors lie in the interval bounded by ±RMSE, approximately 95% of the errors should lie in the interval bounded by ±2RMSE, and approximately 99.7% of the errors lie in the interval bounded by ±3RMSE. In other words, MAE and RMSE are informative about the distribution of the prediction errors. Therefore, it appears sensible to report both measures for the purpose of evaluating the predictive accuracy of various algorithms. However, both MAE and RMSE depend on the scale that is used to obtain ratings from users. Although these measures allow for comparisons of the predictive accuracies of different algorithms, these comparisons remain informative only if the algorithms are tested on the same dataset or if the datasets utilize the same rating scale. These conditions do not initially hold in our situation because MoviePilot and Netflix utilize different rating scales (see Table 4.1). To overcome this limitation and ensure that predictions that are performed on different datasets are comparable, these two error measures are frequently normalized with respect to the range of rating values (Herlocker et al. 2004; Goldberg et al. 2001). The formal definitions of the normalized mean absolute error (NMAE) and the normalized root mean squared error (NRMSE) are as follows: (4.3) (4.4)

where

and

denote the minimum and the maximum ratings, respectively, of the

rating scale of a particular recommendation system. In Section 4.4, which presents the results of our empirical study, we will report all four of the accuracy measures that have been introduced above. The normalized accuracy measures allow us to compare the predictive accuracy of different algorithms across different datasets in a consistent manner, whereas the raw (i.e., non-normalized) accuracy measures allow the reader to compare our results with the findings of other published or unpublished studies of recommendation systems.

146

4. Empirical Study

Before the results of our study are presented, certain details regarding the algorithms that are employed in this study must be discussed. This discussion occurs in the next section of this thesis.

4.3 The Employed Algorithms and Benchmarks

To provide an informative report about the predictive accuracy of our proposed method, we performed a series of accuracy tests of individual recommender algorithms. Specifically, we examined the accuracy of pure user-based and item-based collaborative filters. Two variants of each approach were assessed; these variants differed with respect to the similarity measures that they employed. In particular, one variant of each approach utilized the similarity measure of Pearson’s correlation coefficient, whereas the other variant of each approach employed the cosine similarity metric (see Sections 2.1.1.1 and 2.1.1.2 for details). To assess these collaborative filters, we used a neighborhood size of

because this

neighborhood size provided the best accuracy over all of the examined datasets during the course of preliminary analyses. Another algorithm that is employed in our study is a matrix factorization algorithm by Funk (2006) that is similar to singular value decomposition; this algorithm has provided the foundation for all matrix factorization recommenders that have been assessed in the recent literature. Matrix factorization is known to provide one of the best predictive accuracies for a single algorithm; thus, we suggest that a comparison of our recommendation method with this foundational matrix factorization algorithm should prove to be informative. However, the results of the employed matrix factorization algorithm from our preliminary prediction runs, indicated that this approach is highly sensitive to its parameters, such as its number of iterations, regularization parameters, learning rate and number of factors.58 The optimal values of these parameters are dependent on the underlying data and should therefore be determined individually for each dataset to achieve optimal results. These assertions are

58

For an explanation of the meanings of these parameters, see Section 2.1.1.3.

4. Empirical Study

147

supported by the findings of recent research (e.g., Paterek 2007; Koren 2009; Koren, Bell, and Volinsky 2009; Koren and Bell 2011). Therefore, in our comparisons, we used differently parameterized versions of Funk’s algorithm and report the best results from these algorithm variants for each examined dataset. For comparisons that examined the MoviePilot dataset, the optimal variant of the algorithm involved the following parameter values: ,

,

and

. For the Netflix dataset, the corresponding parameters were ,

,

and

. To better assess the relative accuracy improvements that are provided by different algorithms, we introduce two “benchmarks”. The first benchmark is a simple heuristic that “predicts” a global average of a dataset to be the value of all future ratings for all users. Obviously, this benchmark represents the absolute lowest level of accuracy that a recommender system should provide to its users. A recommender algorithm that exhibits a lower level of prediction accuracy than the “global average” method should not be employed by a recommender system because this algorithm can be outperformed by a simple heuristic. The second benchmark is the results of the algorithm that won the Netflix Prize. This algorithm achieved an RMSE of

(Bell, Koren, and Volinsky 2008), which was an improvement

of more than 10% relative to the RMSE of the algorithm that Netflix had previously utilized; thus, this algorithm can be regarded as the most accurate recommendation algorithm that is currently known. Therefore, we suggest that the comparison of our method with this benchmark should be informative. However, the test runs of this algorithm on our data are impeded by the fact that this algorithm essentially represents the results of blending the predictions from more than 100 recommendation algorithms (Bell, Koren, and Volinsky 2008). Thus, the testing of this algorithm on our data would require the implementation of all of this algorithm’s constituent methods and the subsequent blending of the results from these methods. This process would not only require a great deal of time and resources but also expose the current study to potential criticism regarding any implementation errors in our results. Thus, we instead use the reported RMSE value of this algorithm for the Netflix dataset. We also utilize the corresponding NRMSE value of

, which can be readily calculated from

equation (4.4). In Table 4.3 and Table 4.4, which present the results of our study, these RMSE and NRMSE values are denoted as the accuracy values for the “Netflix Prize winner”. Un-

148

4. Empirical Study

fortunately, the authors of the prize-winning algorithm do not report the MAE of this algorithm. Nevertheless, we consider comparisons of the NRMSE values of different algorithms with the NRMSE value of this benchmark to be informative for assessing improvements in prediction accuracy. In the prior sections of this thesis, we provided a conceptual description of the design of our study and of the methods that are employed in this investigation; additional insights regarding the details of the technical implementation of these methods and the execution of our tests can be found in Appendix C. The next section presents the results of this study.

4.4 Results

This section presents the results and discusses the findings of our empirical study. The two main questions that are addressed in this section are (i) how well our proposed method predicts future user ratings and (ii) what proportion of the users of our proposed recommendation method receive explanations of their recommendations in the pros-and-cons explanation style, which is the most effective type of explanation that can be provided by an RS. Each of these questions is addressed in a separate subsection of this portion of the thesis.

4.4.1 Comparisons of Prediction Accuracies

The results of the prediction runs of different algorithms are summarized in Table 4.3 (for the MoviePilot dataset) and Table 4.4 (for the Netflix dataset). We describe the accuracy of our proposed method in three rows of these tables. First, the “Estimation step” row provides the results for the predictions of model (3.5), using parameter values that were obtained through the estimation portion of our algorithm (see Section 3.2.1). Second, the “Optimization step” row reports the accuracy of the predictions for the same model, using

4. Empirical Study

149

optimized parameter values as starting point of optimization (see Section 3.2.2). Finally, the “Hybrid” row indicates the accuracy results for the predictions that have been obtained by hybridizing the two highlighted methods, which are our optimized solution and the item-based CF approach, as described in Section 3.3.2. The columns of these tables present the four accuracy measures that were introduced in Section 4.2 and the percentage of improvement that has been achieved by a particular algorithm with respect to both the global average and the Netflix Prize winner benchmarks. For the reasons that were detailed in the previous section of this thesis, the improvement relative to the Netflix Prize algorithm is only reported for the RMSE and NRMSE metrics. To simplify the comparison, two additional columns display the rank order of each compared algorithm for particular accuracy metrics. In these columns, lower ranks correspond to better accuracy. From Table 4.3, which displays the results of the predictions from different algorithms for the MoviePilot dataset, it may be observed that all of the examined recommendation methods outperform the global average benchmark with respect to both the MAE and the RMSE. Thus, all of these methods are more effective at capturing the user rating variance than the first central moment of the overall rating distribution, which is the bottom-level benchmark (see Section 4.2). Among the collaborative methods, the item-based method that uses Pearson’s similarity metric outperforms other methods with respect to the MAE but is only the second-best method with respect to the RMSE. Conversely, the user-based approach that uses Pearson’s similarity metric is the best approach with respect to RMSE and only the second-best approach with respect to the MAE. Thus, for the MoviePilot data, the item-based Pearson method is superior to its user-based analog at providing lower errors for most users, whereas the user-based Pearson method is better than the item-based Pearson method at minimizing the overall error variance. However, in both cases, the difference between these two accuracy measures is marginal and totals less than one tenth of one percent. For both the MAE and the RMSE, the following rank order for the remaining collaborative filters is consistently observed: item-based with cosine similarity, user-based with cosine similarity, and MF.

Chapter 4: Empirical Study

150

150

Table 4.3: A comparison of the prediction accuracies of different algorithms for the MoviePilot dataset Algorithm

*The MAE and RMSE of this method are not available for this dataset, The NRMSE is derived from the RMSE that was reported by Bell, Koren, and Volinsky (2008); see also Section 4.3. MAE NMAE Improvement Rank # of RMSE NRMSE Improvement Improvement w.r.t. global MAE & w.r.t. global w.r.t. Netflix average (%) NMAE average (%) Prize winner (%)

Rank # of RMSE & NRMSE

Benchmark methods Global average Netflix Prize winner*

21.55641 n/a

0.21555 n/a

0.0 n/a

9 n/a

26.34466 n/a

0.26345 0.21780*

0.0 17.3

-21.0 0.0

10 2

Collaborative filtering methods User-based, Pearson User-based, Cosine Item-based, Pearson Item-based, Cosine Matrix factorization

16.92135 17.37269 16.80697 17.21427 17.56650

0.16921 0.17373 0.16807 0.17214 0.17567

21.5 19.4 22.0 20.1 18.5

3 5 2 4 6

22.15778 22.55114 22.17444 22.52112 22.65087

0.22158 0.22551 0.22174 0.22521 0.22651

15.9 14.4 15.8 14.5 14.0

-1.7 -3.5 -1.8 -3.4 -4.0

3 6 4 5 7

Proposed method Estimation step Optimization step Hybrid

18.18684 18.16394 16.19231

0.18187 0.18164 0.16192

15.6 15.7 24.9

8 7 1

24.28305 24.17754 20.66475

0.24283 0.24178 0.20665

7.8 8.2 21.6

-11.5 -11.0 5.1

9 8 1

4. Empirical Study

4. Empirical Study

151

It can be observed that although the predictions of our attribute preference model exhibit significant accuracy improvements (of more than 15%) with respect to the bottom-line benchmark of the global average estimator, this model is not among the most accurate approaches in the table. Moreover, the results from the optimization step do not differ substantially from the results from the estimation step. The former results demonstrate only a marginal improvement (of less than 1%) over the latter results with respect to both the MAE (an improvement of .12%) and the RMSE (an improvement of .43%). However, the improvement in the RMSE is approximately four times greater than the improvement in the MAE. This result indicates that the optimized part-worth values reduce the magnitude of prediction errors in specific cases instead of reducing the errors of all predictions. All but one of the methods in the table fail to surpass the benchmark of the Netflix Prize algorithm. The only method that outperforms this benchmark is our proposed hybrid approach. With respect to RMSE, our hybrid achieves a rather impressive improvement of 5.1% compared with the Netflix Prize algorithm. Before we interpret these results more comprehensively, we examine whether these findings for the MoviePilot dataset are consistent with the results of predictions for the Netflix dataset, which are displayed in Table 4.4. From Table 4.4, it is evident that the results for the prediction accuracy tests on the Netflix dataset are largely consistent with the findings for the MoviePilot dataset. For the Netflix dataset, similarly to the MoviePilot dataset, all of the algorithms that are examined outperform the bottom-level benchmark of the global average estimator. However, within the group of collaborative methods, we observe a slightly different rank order of prediction accuracies for the Netflix dataset than for the MoviePilot dataset. In the Netflix dataset, the item-based method that uses Pearson’s similarity metric is stably ranked first with respect to both the MAE and RMSE accuracy measures. However, different rank orders for MAE and RMSE can be observed between the second-best and the third-best prediction methods, which are the item-based cosine similarity and the user-based Pearson methods. In particular, the former approach outperforms the latter approach with respect to MAE, whereas the latter approach dominates the former approach with respect to RMSE. The ordering of the remaining collaborative methods is consistent with respect to both MAE and RMSE and does not differ from the order for the MoviePilot dataset; in particular, in both

Chapter 4: Empirical Study

152

152

Table 4.4: A comparison of the prediction accuracies of different algorithms for the Netflix dataset

Algorithm

*The MAE of this method is not available, The RMSE is obtained from the reports of Bell, Koren, and Volinsky (2008), and the NRMSE is derived from the RMSE; see also Section 4.3. MAE NMAE Improvement Rank # of RMSE NRMSE Improvement Improvement Rank # of w.r.t. global MAE & w.r.t. global w.r.t. Netflix RMSE & average (%) NMAE average (%) Prize winner NRMSE (%)

Benchmark methods Global average Netflix Prize winner*

0.93609 n/a

0.22815 n/a

0.0 n/a

11 n/a

1.10899 0.87120*

0.27725 0.21780*

0.0 21.4

-27.3 0.0

10 2

Collaborative filtering methods User-based, Pearson User-based, Cosine Item-based, Pearson Item-based, Cosine Matrix factorization

0.67940 0.69548 0.67726 0.67911 0.70521

0.16985 0.17387 0.16932 0.16978 0.17630

25.6 23.8 25.8 25.6 22.7

4 5 2 3 6

0.87921 0.89502 0.87714 0.87975 0.90324

0.21980 0.22376 0.21929 0.21994 0.22581

20.7 19.3 20.9 20.7 18.6

-0.9 -2.7 -0.7 -1.0 -3.7

4 6 3 5 7

Proposed method Estimation step Optimization step Hybrid

0.70718 0.70610 0.64053

0.17680 0.17653 0.16013

22.5 22.6 29.8

8 7 1

0.91189 0.90760 0.82220

0.22797 0.22690 0.20555

17.8 18.2 25.9

-4.7 -4.2 5.6

9 8 1

4. Empirical Study

4. Empirical Study

153

datasets, the user-based cosine similarity algorithm remains dominant over the matrix factorization method. Consistent with the results achieved for the MoviePilot data, the prediction accuracy of our preference-based model outperforms only the global average benchmark. Similarly to the MoviePilot dataset, for the Netflix dataset, the improvement of the accuracy of this model for the optimization step relative to the estimation step is only .14% with respect to MAE and .44% with respect to RMSE. Thus, once again, the optimization of part-worth values reduces the variance rather than the magnitude of prediction errors. For the Netflix data, the benchmark of the Netflix Prize winner is superior to all but one of the other methods that are examined. In particular, consistent with the result of the predictions for the MoviePilot dataset, our proposed hybrid method outperforms the Netflix Prize benchmark with respect to RMSE. However, the magnitude of this improvement is 5.6% in the Netflix dataset, which is .5% greater than the magnitude of the corresponding improvement in the MoviePilot dataset. After examining the results of prediction method performances on the MoviePilot and Netflix datasets, one important observation is that the rank order of different methods with respect to accuracy remains largely persistent for both datasets. We interpret this fact as an indicator that at least with respect to the rank order of the examined algorithms, the summarized results that are provided in Table 4.3 and Table 4.4 are generalizable. In other words, we assert that the results that have been obtained are descriptive of the general performance that would be observed for the accuracy of the examined algorithms for other datasets. A further discussion of the results regarding the accuracy of our proposed method is provided based on these assertions; in particular, we consider the differences and commonalities between the two sets of results and the conclusions that may be derived from these findings. First, it is evident that differences between the MoviePilot and Netflix datasets impact the magnitudes of the observed accuracy measures. The normalized accuracy measures (NMAE and NRMSE) are all greater for the Netflix dataset than for the MoviePilot dataset. Among the examined prediction methods, the accuracy of the global average predictions, our bottom-level benchmark, is impacted the most by the shift from one dataset to the other. This result is consistent with Koren’s observation that a sudden jump in the mean movie rating occurred in the Netflix dataset in early 2004; this phenomenon may be attributed to alterations

154

4. Empirical Study

in the labels of the Netflix rating scale that may have been implemented at this time (Koren 2009). In fact, this type of increase in the mean rating should produce both an increase in the mean error and an increase in the error variance; these effects appear to be reflected by the higher NMAE and NRMSE values, respectively, of the predictions for the Netflix dataset than the predictions for the MoviePilot dataset. Although these alterations in the Netflix ratings will, by definition, impact the predictions of the global average method59, the other prediction methods that are examined in this study exhibit a surprising robustness to these changes in ratings. In particular, the difference in NMAE between the predictions for the two datasets becomes noticeable only at the fourth decimal place for the majority of the assessed methods. The NRMSE values are affected by approximately an order of magnitude more than the NMAE values are affected, as evidenced by the fact that the differences between the two datasets with respect to NRMSE can be observed at the third position after the decimal point for most of the examined prediction methods. The percentage of accuracy improvement with respect to our benchmarks is also impacted by these issues, which decrease not only the consistency of the corresponding values for these two datasets but also the level of information that these corresponding values provide for the purpose of drawing comparisons. Another reason that the normalized accuracy measures may be higher for the Netflix dataset than for the MoviePilot dataset may simply be the fact that the former dataset is larger than the latter dataset. Conventional wisdom tells us that the chances of higher prediction errors will increase with the size of the dataset because an algorithm must predict more user ratings for validation purposes for a large dataset than for a small dataset. Regardless of the precise contributions of each of these two considerations to the higher values of error measures in the Netflix dataset relative to the MoviePilot dataset, it should be recognized that the differences between these two datasets influence the calculated accuracy metrics. Thus, the comparison of accuracy improvements between these two datasets should be approached with caution and should account for these aforementioned circumstances. Nonetheless, it can be noted that these differences between the two datasets produce a considerably lower effect on the accuracy metrics for our proposed method than on the accuracy metrics for the other examined approaches. We explain this phenomenon by noting

59

Recall that the global average is defined as the mean rating of a dataset.

4. Empirical Study

155

that in contrast to these other methods, our preference model incorporates temporal effects. Thus, the time-varying component of the rating variance would likely be captured to a greater extent by our preference model than by the competing methods that were examined. This phenomenon emphasizes the importance of accounting for temporal changes within recommendation algorithms. Interestingly, in contrast to the statements of Jannach et al. (2011), Pearson’s correlation coefficient consistently outperformed the cosine similarity measure in the current study (see also Section 2.1.1.1). In addition, contrary to our expectations, the matrix factorization approach performed the worst with respect to both accuracy measures among all of the collaborative approaches that were considered, despite the fact that we parameterized the optimization procedures of this approach to provide the best possible results for the given datasets (see Section 4.3). These inconsistencies with prior findings might have occurred because many previously published studies have held out only one rating per user for the purpose of calculating error metrics (e.g., Herlocker et al. 2004; Adomavicius and Tuzhilin 2005; Koren 2009; Koren and Bell 2011); by contrast, for validation purposes, we utilized holdout sets that consist of six ratings per user. Thus, in our study, the examined algorithms were faced with a slightly different challenge than the typical task that is utilized in contemporary RS research; in particular, in the current study, these algorithms had to predict not only the next user rating but also a sequence of future ratings for each user. Given that new ratings are incorporated into RS data on an irregular basis, i.e., different users provide new ratings at diverse time intervals, the temporal effects of shifting user preferences might have been reflected by our comparison results. In other words, although an algorithm may be well suited for predicting the next rating of a user, its ability to predict future user tastes beyond this rating could be impeded by temporal effects. These effects can cause shifts in the rank orders of algorithms with respect to prediction accuracy; our results may indicate the presence of these shifts. Nevertheless, we argue that deeper insights with respect to the performance of prediction algorithms may be obtained through the use of a sequence of future user ratings in the holdout set rather than the utilization of only the final user rating for validation purposes; in particular, the use of the sequence of future ratings allows us to assess whether an algorithm’s predictions are systematically accurate and reduces the influence of chance on accuracy measures (because the probability of “guessing” one user rating accurately is much higher than the probability of accurately “guessing” multiple ratings for each user).

156

4. Empirical Study

Another interesting observation is that neither component of our hybrid, namely, the item-based CF that uses the Pearson similarity metric and model (3.5), loaded with optimized part-worth values, is the best individual approach for approaching the benchmark that has been established by the Netflix Prize winner. A yet more interesting observation is that the proposed hybridization of our method with the item-based CF using the switching hybridization design (see Sections 2.1.4.1 and 3.3.2) produces a dramatic improvement in accuracy that allows the aggregated method to outperform all of its competitors, including the algorithm that won the Netflix Prize. These observations lead us to the following two conclusions: First, the superior accuracy of our hybrid relative to both of its components indicates that our model (3.5) does not capture all of the variance in user ratings. In other words, the attribute-based model of user preferences fails to describe the preference formation for certain users in the examined datasets. For these users, the item-based CF produces predictions that are closer to their true ratings than the predictions of our method. As discussed later in this thesis, it should be noted that this phenomenon occurred for approximately 30% of the user base in both of the examined datasets. Thus, item-based CF captures certain movie characteristics that are “hidden” from the attribute-based preference model and extend beyond formal attributes. These characteristics may include various traits, such as the depth of character development in a film, the presence of an enthralling story line or the overall atmosphere of a movie. In other words, for users who base their preferences on these types of movie characteristics, which are difficult to formalize, an analysis of item similarities is capable of revealing the relationships among movies with respect to these characteristics. By contrast, there are also a substantial number of users whose preference structures are described better by our attribute-based preference model than by the other approaches that are examined. Our switching hybrid therefore exhibits superior performance relative to the various other recommendation methods that we have examined because the hybrid approach provides both of these groups of users with predictions from the constituent method of the hybrid that better describes their preferences (see Section 3.3.2). This reasoning implies that the poor performance of our (non-hybrid) preference model is not primarily caused by calculation errors but instead reflects the inability of this model to capture movie preferences for a certain group of users who form their preferences based on factors other than movie attributes.

4. Empirical Study

157

Second, we can conclude that the estimation step of our algorithm provides fairly accurate estimates of our model parameters because the shift from parameters that are obtained from the estimation step of our algorithm to parameters that are obtained from the optimization step of our algorithm produces only a moderate improvement in prediction accuracy. This conclusion is obtained from the following reasoning. If our model’s predictions outperform the predictions of item-based CF for a substantial group of users and if the lower accuracy for the remaining users is simply caused by the model’s inadequacy for describing the preference formation of these users (as revealed by the fact that the combination of the pure predictions of both components of the hybrid approach produces superior overall results), then the initial model parameters must have been estimated reliably enough to produce low levels of overall prediction error. The optimization procedure produced an improvement in RMSE that was fourfold greater than the improvement in MAE, indicating that this procedure primarily reduces error variance rather than the expected value of prediction errors. Therefore, the adjustment of the point estimates in the optimization step results in a slightly better fit of the model to user preferences, whereas the model bias (i.e., the expectation value for the error) remains relatively unaffected by the optimization procedures. Thus, the initial interval estimates in the estimation step were reliably obtained. The dramatic accuracy improvement that is caused by the hybridization of our preference model and the item-based CF approach merits a closer examination. Consider Table 4.5, which summarizes the distribution parameters of the absolute error after the optimization step. To obtain a better sense of this distribution, see also Figure 4.2, which depicts a frequency histogram of the different error values of predictions of the attribute-based preference model (3.5) after the optimization step for the MoviePilot dataset.60

Min

Max

Mean

SD

Mode

Kurtosis

SE of kurtosis

25th percentile

50th percentile

75th percentile

Table 4.5: The distribution parameters for the absolute prediction error of the optimization step

0 0

100 5

18.16 .706

16.36 .624

0 0

2.434 2.527

.022 .018

6.03 .2315

13.60 .5368

25.48 1.44

Dataset

MoviePilot Netflix

60

A similar error distribution is observed for the Netflix dataset.

158

4. Empirical Study

It can be observed that for both datasets, our algorithm exhibits relatively high positive kurtosis values (over 2) and a relatively low standard deviation (relative to the mean error value). Both facts indicate that the error distribution for our algorithm is highly peaked, i.e., that most error values are concentrated around a particular point instead of being spread across a wide interval. The analysis of the quantiles of the two distributions (specifically, the 25th, 50th and 75th percentiles) reveals that these error distributions are positively skewed, i.e., the peaks of these distributions are situated closer to zero error than to the mean error. The peakedness and positive skew of these distributions are also confirmed by the fact that the prediction error’s standard deviation around the mean (SD) is lower than the value of the RMSE.61 Furthermore, it may be noted that the absolute prediction error is lower than the value of the standard deviation and exceeds it only for approximately 30% of the examined cases (as demonstrated by the values of the 50th and 75th percentile of the prediction error distributions). This phenomenon implies that the error measures are primarily produced by a low number of points with large deviations rather than a large number of points with nearly equal deviations. In combination, these facts provide evidence that our model generally predicts user ratings fairly accurately but is rather inaccurate for a relatively small number of cases (approximately 30%). In particular, for users whose preferences are not predicted accurately by the proposed model of this study, the magnitude of the prediction errors may range from approximately 25% to 100% of the interval of an RS’s rating scale. One possible explanation for this high magnitude could be that errors occur in a systematic way. Systematic errors may be produced by a wide range of factors, such as problems with a model’s quality, calculation errors, and unexpected patterns in user or item ratings. To prove the conjecture that systematic errors are occurring in our algorithm and to identify the source of these errors, we inspected the ratings for which our algorithm produced large errors and found that large errors are consistently demonstrated by the same group of users. This result supports our hypothesis that systematic errors are occurring with respect to our algorithm and allows us to attribute these errors to the users of an RS. However, we were unable to find patterns that would permit the a priori identification of users with high prediction errors. This group of users does not exhibit any

61

Recall that the RMSE designates the standard deviation of the error distribution around the value of zero (see Section 4.2).

4. Empirical Study

159

noticeable patterns with respect to the source data, such as a low number of ratings, a tendency to rate specific movies, or other particular features of their rating distributions that would allow us to differentiate these users from individuals that our algorithm assesses accurately. The only sensible explanation for this phenomenon is that these “problematic” users must form their movie preferences based on information that is not captured by the preference function of equation (3.5). This observation supports our previously stated suggestion that the attribute-based preference model is unable to capture movie preferences for a certain group of users; this inability to accurately represent the preferences of certain

900 800

RMSE = 24.17

1000

MAE = 18.16

users with the attribute-based model alone motivates the use of hybridization.

700

Frequency

600 500 400 300 200 100

0 4.405286344 8.810572687 13.21585903 17.62114537 22.02643172 26.43171806 30.83700441 35.24229075 39.64757709 44.05286344 48.45814978 52.86343612 57.26872247 61.67400881 66.07929515 70.4845815 74.88986784 79.29515419 83.70044053 88.10572687 92.51101322 96.91629956

0

Absolute prediction error

Figure 4.2: The distribution of the absolute prediction errors of the optimization step for the MoviePilot dataset The dashed lines indicate the MAE and the RMSE of the optimization step.

160

4. Empirical Study

To further justify the hybridization of our method with item-based CF, we performed the Kolmogorov-Smirnoff test for the equality of distribution functions. The results of this test revealed that the error distribution for the item-based CF approach significantly differs from the error distribution of the attribute-based preference model (

for both the MoviePilot

and the Netflix datasets). Consistent with this result, both approaches produced unequal errors for most users on the single-user level (

in Student’s t-test for the equality of means).

These results confirm that these two approaches capture different types of variance in user ratings and that each approach is well suited for describing the preference formation of a different type of user. Thus, the hybridization scheme that chooses the better prediction from the individual predictions from both approaches, as described in Section 3.3.2, is a sensible prediction technique that generates substantial improvements in the predictive accuracy for the hybrid relative to the predictive accuracy of either approach alone. Table 4.6 provides a summary of the increases in accuracy that are provided by our proposed hybrid method relative to this method’s individual components (the individual predictions of the optimization step and the item-based CF approach) and the benchmark methods (the global average algorithm and the algorithm that won the Netflix Prize). It is evident that the improvements in accuracy from hybridization are substantial and consistent with respect to both (N)MAE and (N)RMSE for both the MoviePilot and the Netflix datasets. Moreover, our proposed hybrid method outperforms all of the other examined prediction algorithms, including the highly accurate algorithm that won the Netflix Prize. This finding allows us to state that one of our initial objectives, i.e., the development of an accurate recommendation algorithm, has been achieved. Table 4.6: The accuracy improvements produced by the hybrid method The values in each row of the table indicate the percentage by which the accuracy of the hybrid method is greater than the accuracy of the listed algorithm. *The MAE of this method is not available; NRMSE values are derived from the RMSE values that have been reported by Bell, Koren, and Volinsky (2008), see also Section 4.3.

Algorithm Global average Optimization step Item-based CF, Pearson Netflix Prize winner*

MoviePilot (N)MAE (N)RMSE 24.88% 21.56% 10.85% 14.52% 3.65% 6.80% n/a 5.12%*

(N)MAE 29.81% 9.28% 5.42% n/a

Netflix (N)RMSE 25.86% 9.41% 6.26% 5.62%*

4. Empirical Study

161

4.4.2 The Provided Explanation Styles

In the previous section, our hybrid method was shown to outperform all other methods with respect to predictive accuracy. However, the predictions of the hybrid are produced as a combination of the predictions of the hybrid’s individual component approaches; each of these approaches provides its own explanation style with a particular effectiveness level for the purpose of a user’s decision-making process: In particular, the predictions of the attributebased model can be justified in the pros-and-cons explanation style, which is the most effective type of explanation, whereas the item-based predictions may be justified with the keyword explanation style, which is slightly less efficient than the pros-and-cons style (see Section 3.3.2). The item-based CF component provides substantial contributions to the predictive accuracy of the hybrid method; thus, it is not immediately clear what proportion of users receive final recommendations from the hybrid approach that are produced by our user preference model. In other words, what percentage of users receive recommendations that can be justified in the pros-and-cons explanation style? Table 4.7 answers this question by providing a summary of the number of users in each examined dataset that receive a particular type of explanation if the hybrid prediction method is utilized. The users who would receive justifications for their recommendations in the prosand-cons explanation style would have been provided with final recommendations that were generated by the user preference model of this study (expression (3.5)). The users who would receive justifications for their recommendations in the influence explanation style would have been provided with final recommendations that were generated by the item-based CF component of the hybrid. Table 4.7: The explanation styles provided to users Explanation style Pros-and-cons Influence Total

MoviePilot Number of Percentage of users users 5,194 65.31% 2,759 34.69% 7,953 100%

Netflix Number of Percentage of Users users 290,146 67.73% 138,239 32.27% 428,385 100%

162

4. Empirical Study

We can see that the results are consistent for both examined datasets; no substantial differences between these datasets are observed. Accordingly, if the hybrid method had been applied to these datasets, item-based CF and its associated influence explanation style would have been employed for approximately 34% of the users. The majority of the users (approximately 66%) would have received justifications for their recommendations that were produced in the most detailed and effective approach of the pros-and-cons explanation style. Although our hybrid method could not ensure that explanations would be provided in the most detailed explanation style for all users, all of the users would be provided with explanations in one of the two most effective styles (see Sections 2.2.2 and 2.3.1.3). Thus, we can state that our second objective, i.e., the provision of effective and actionable explanations, was achieved. As discussed in the previous section, the attribute-based preference model of expression (3.5) cannot capture the preference structure of certain users because those users form their preferences based on factors other than movie attributes. Thus, recommendation explanations in terms of movie attributes would not be informative for these users and therefore would not increase the effectiveness of these users’ choices. Because the item-based component of our hybrid substantially increases the predictive accuracy for these users, it appears to capture the “correct” aspects of the rating variance for these users. Thus, relative to attribute-based explanations, the influence-based explanation style that emphasizes the similarities between movies will be more informative and more effective for these users (who display preferences that are better described by the item similarity model of item-based CF than by the attribute-based model developed in this study). However if the hybrid method is employed, then approximately two thirds of the user base in both datasets will be provided with detailed explanations of the recommendations based on an attribute preference model that effectively captures the preferences of these users. In other words, for each group of users, our hybrid method provides explanations of its recommendations in the most effective style possible for each type of user. This assertion is supported by the conflicting research results of Biglic and Mooney (2005) and Symeonidis, Napopoulos, and Manopoulos (2008): Among other things, both studies compared the effectiveness of and user satisfaction with the influence and keyword explanation styles within the experimental framework of a single recommendation algorithm. In Bilgic and Mooney’s study, the keyword explanation style dominated the influence style, whereas the investigation of Symeonidis, Napopoulos, and Manopoulos produced the oppo-

4. Empirical Study

163

site result (see Section 2.2.2). However, in both studies, the difference between the two explanation styles was insignificant. Recall that our pros-and-cons explanation style is derived from the keyword explanation style and represents an extension of this explanation approach. Thus, the results of the current study, which reveal the existence of two user groups that form their preferences in different ways, help explain and resolve the conflict from the conclusions of the two studies that are mentioned above. Because both of the user groups that have been identified in this study are substantial in size, both types of users would most likely be represented in the “keyword” and “influence” groups of studies that examine explanation styles. However, the exact proportions in which the users of both types were represented in different experimental groups could vary. This variance would cause a difference in the “mean” judgments of the experimental groups in different investigations of explanation styles. However, these differences may not be significant because both the “keyword” and “influence” groups would include comparable numbers of both user types (specifically, certain users that prefer keyword explanations and other users who prefer influence-based explanations). Although this explanation requires proof, we leave this question for future research initiatives. For the moment, we suggest that this explanation is both convincing and consistent with the findings of our study. To summarize the above discussion, we argue that in our empirical study, our proposed method clearly demonstrated an ability to provide actionable recommendations that increase the effectiveness of the recommendations that are received by all users of an RS. The consistency of the results on two different datasets indicates the generalizability of these findings and allows us to assert that the second and the third objectives of the current thesis are achieved. Thus, we may conclude the development of our proposals. The next section of the thesis provides a brief summary of the findings of the empirical study.

164

4. Empirical Study

4.5 Summary

The purpose of the current chapter was to test our theoretically developed algorithm for providing recommendations and the explanations of these recommendations in an empirical setting. These tests were intended to demonstrate the portability of our suggested approach to the real-world operating environment of recommender systems and the compliance of the proposed method with the declared objectives of the current thesis. To complete these tests, we conducted an empirical study that employs datasets of user ratings for movies that are obtained from two real-world recommendation systems. Using these datasets, our proposed recommendation method was compared with other important recommendation algorithms with respect to their prediction accuracy, i.e., the ability to generate reliable recommendations. Further, the ability of various prediction methods to provide effective and actionable explanations to users was examined. The results of this study indicate that our proposed hybrid recommendation method outperforms not only collaborative filtering approaches but also the state-of-the-art algorithm that won the Netflix Prize with respect to predictive accuracy; moreover, our method provides all users with explanations of the reasoning underlying its recommendations. If our method had been applied to the examined databases, the majority of the users in these databases (approximately two thirds) would have received explanations in the pros-andcons explanation style, which provides detailed, comprehensible, and actionable explanations that increase the efficiency of the user choices. However, a smaller fraction of users (approximately one third of the user base) would have received explanations that were provided at a lower level of details than the pros-and-cons explanation style. This phenomenon occurs because our theoretically founded multi-attribute preference model does not capture the variance in the ratings of these users. Instead, these users base their movie preferences on other factors than the information that is contained in formalizable movie attributes. However, the item-based CF method is capable of producing reliable rating predictions for these users, and the similarities between movies can serve as a reliable descriptor of the preference formation process for these users. Thus, to be effective, the recommendation

4. Empirical Study

165

explanations for these users should be provided in the style that most suitable for their preference functions, i.e., the influence explanation style. Thus, the use of our hybrid method would allow both types of users to receive the explanations that would effectively support them in their decision-making processes. Because these two different user groups would receive recommendations from an algorithm that is well suited to each group’s preference functions, it can be argued that the hybrid approach of this study allows each user to receive recommendations that are generated by a recommendation process that aligns with the user’s preferences. Thus, we can assert that or developed method also satisfies the aspect of our objectives that involves aligning recommendations with user preferences. Although our method requires each user to provide at least 12 movie ratings before recommendations can be generated, this requirement is realistic and does not appear to overburden users. Moreover, this condition is consistent with the contemporary practices of an RS. For example, the current MoviePilot RS requires users to rate 20 movies before it begins to provide recommendations. Thus, we can state that our method is practicable for contemporary commercial recommender systems. Thus, the third objective of this thesis, which was to ensure the suitability of our method for applications in real-world settings, has been achieved. Thus, in our empirical study, our proposed recommendation method proved to be capable of providing all of its users with both accurately predicted recommendations and actionable explanations of the reasoning underlying these recommendations. Moreover, this method aligned the recommendation processes with user preferences. The results are consistent for both of the datasets that are examined in this study, and no significant discrepancies between the results for each dataset have been observed. Because both of these datasets exhibit unique characteristics, the consistency of the results that have been obtained for both datasets indicates that the findings of this study may be generalizable to not only the entire domain of movie recommendations but also to recommendations systems as a whole.

166

5. Conclusions and Future Work

167

Chapter 5

Conclusions and Future Work 5 Conclusions and Future Work In this chapter, we summarize our research and its findings, discuss the implications of these findings, and provide suggestions for future work. The first subsection of this chapter provides a brief recapitulation of the course of our analysis and the development of our algorithm; this subsection also summarizes our contributions to the RS literature. The second subsection of this chapter emphasizes the main implications of our findings for the providers and developers of recommendation systems. Finally, the third subsection of this chapter concludes our thesis by discussing ways to improve our proposed recommendation method and various avenues for future research.

5.1 Research Summary, Key Findings and Contributions

The objective of the current thesis was to develop a recommendation method that not only aligned the recommendation process with user preferences but also provided both accurately predicted recommendations and actionable explanations of the reasons underlying these recommendations. Our development also had to balance these goals with the constraint of ensuring that the recommendation method possessed practical applicability in the domain of motion pictures. Thus, our recommendation method should be capable of eliciting attribute-based user preferences regarding motion pictures to provide both recommendations and explanations of these recommendations; these explanations should be conveyed in terms

168

5. Conclusions and Future Work

that users can comprehend and should provide meaningful assistance to users’ decisionmaking processes. However, all of the users of an RS system that employs our method should be able to receive effective recommendations with actionable explanations; in particular, this condition should be satisfied for users who form their movie preferences based on factors other than movie attributes. To provide a foundation for the development of our proposed method and to ensure the novelty of our approach, we presented a brief discussion of the extant theoretical work that was relevant to our objectives. This discussion included the following aspects: (i) an overview of key recommendation techniques and the principles underlying these approaches; (ii) a theoretical discussion that addressed the questions of why explanations of recommendations should be an integral part of recommender systems and how these explanations should be provided to the users; and (iii) an analysis of the movie research literature that derived the movie attributes that are relevant for the preference formation and decision-making processes of users. We have demonstrated that explanations of the reasoning underlying recommendations that are provided are capable of increasing users’ acceptance of, trust in, and loyalty to recommendation systems. Moreover, explanations can contribute to the effectiveness of users’ choices among recommended items and increase their levels of satisfaction with their choices. However, these advantages can only be realized if explanations are understandable and actionable to users. These traits are fulfilled by the keyword explanation style, which emphasizes movie attributes that may be important to a user, and the influence explanation style, which emphasizes the movies that were most influential to the provision of a recommendation. Furthermore, a recommendation algorithm itself should operate in accordance with the characteristics that users employ for the judgment of choice alternatives. This operational design not only allows an algorithm to provide actionable explanations to users but also ensures that the recommendations that are produced by this algorithm are effective, i.e., that these recommendations help users make optimal choices. This reasoning led us to the choice of a multi-attribute utility (MAU) model and the weighted additive decision rule (WADD) as a basis for our recommendation algorithm. The latter choice is supported by the work of Aksoy et al. (2006), who emphasize the importance

5. Conclusions and Future Work

169

of eliciting user attribute part-worths for both providing effective recommendations and explaining the reasons underlying these recommendations. The derivation of preference-relevant movie attributes that can be employed within the context of our recommendation algorithm was challenged by the lack of extant research regarding this particular topic. The existing theoretical discussions of the preference relevance of movie attributes have not been accompanied by empirical evidence and may not provide a complete list of preference-relevant movie attributes. Thus, we not only appropriately adapted existing findings regarding these attributes but also extended our list by considering movie attributes that are examined in research about movie success factors. However, this stream of research considers movie attributes on an aggregate level and does not analyze the relevance of movie attributes for the preference formation of an individual consumer. We argued that a factor’s relevance to aggregate choices is associated with its relevance on an individual level; in addition, we discussed of the suitability of various movie success factors for describing individual preferences. This discussion produced a list of 374 movie attributes for consideration in our recommendation algorithm. However, the number of attributes for which user attribute part-worths can be estimated algebraically is limited to the number of ratings that a user has provided to an RS; for most users of a recommendation system, this number of ratings is relatively low. Consequently, extant recommender algorithms can utilize only a fraction of preference-relevant movie attributes and are therefore unable to capture a substantial portion of the variance in a user’s ratings. As a result, these algorithms are unable to produce reliable recommendations for the majority of the recommender system’s users. To address these latter considerations, we proposed a novel two-step algorithm that estimates user attribute preferences through the use of statistical techniques. The first step of this technique utilizes auxiliary regressions, which provide interval estimates of a single attribute part-worth for a single user. To ensure the reliability and the validity of these estimates for the generation of predictions, these estimates are then corrected for omitted variable bias and for multicollinearity. The second step of the algorithm then optimizes the biascorrected estimates to further increase the fit of these estimates to the available data and thereby reduce prediction errors. This optimization is accomplished via the conjugate gradient method, which was modified such that the values of each parameter are only allowed to vary

170

5. Conclusions and Future Work

inside the confidence intervals for this parameter that were obtained during the first step of the algorithm. The aforementioned estimation procedure is then applied to the refined model of user attribute preferences, which separates the basic effects of movie-user interactions, such as the actual movie attribute preferences, from rating variations that can be attributed either solely to movies, such as a movie’s appeal to popularity, or solely to users, such as a user’s handling of the rating scale or a user’s reaction to mainstream trends. Furthermore, this model was modified to account for temporal changes involving three different types of effects. This model development process eventually resulted in a model of user preferences that contained 708 parameters; each of these parameters must be estimated individually for each user. However, although our refined model of attribute-based user preferences can capture the preferences of the majority of users, the hedonic nature of motion pictures causes certain users to judge movies using criteria other than movie characteristics, such as the nebulous traits of overall impression or entertainment value. Thus, our model is unable to fully capture the preferences of these users. Thus, we suggested hybridizing our algorithm with the item-based collaborative filtering method, which does not address movie attributes. This method bases its recommendations on more general rating patterns and is therefore able to reveal user preferences that extend beyond movie characteristics. Furthermore, this item-based approach provides the second-best explanation style with respect to potentially increasing the effectiveness of user choices. Thus, the hybrid method provides all users with one of the two best types of recommendation explanations. To ensure that the hybrid method will be capable of providing relevant and actionable explanations, we chose to adopt a switching hybridization design that does not algebraically combine the predictions of individual component methods; instead, this design utilizes the “raw” prediction of its best-performing individual component as its final recommendation for each situation. To test our proposed method and to compare this method with the existing family of contemporary recommendation algorithms, we conducted an empirical study. This investigation examined rating datasets from two distinct real-world recommender systems. The study compared the predictive accuracy of the developed hybrid method with the accuracy levels of various key recommendation methods, including the highly accurate algorithm that won the Netflix Prize. The results are consistent for both datasets, indicating the generalizability of the empirical findings not only for the domain of motion pictures but also for the domain of

5. Conclusions and Future Work

171

recommendation systems as a whole. In particular, the study revealed that two groups of users exist: The larger of these two groups (which includes approximately two thirds of users) can be characterized well by our proposed attribute-based model of user preferences. For these users, explicit preference modeling outperforms the CF component of our hybrid method; thus, the attribute-based aspect of the hybrid model provides precise rating predictions and generates recommendation explanations in the pros-and-cons style, which is the most effective explanatory approach. The second, smaller group of users (approximately one third of the total number of users) appear to form their movie preferences based on factors other than movie attributes. For this group of users, item-based CF provides more reliable rating predictions than the developed model of this study, i.e., item similarity is more descriptive of these users’ preferences than attribute-based considerations. For these users, an emphasis on the similarity of recommended movies to previously viewed films is the most productive approach for recommendations and explanations. Thus, the hybrid approach of this study ensures that each user will receive the most precisely predicted recommendations for his or her user type and that these recommendations will be supported by the most effective explanations for the user in question. On the whole, our hybrid method was outperformed all of the examined collaborative filtering techniques with respect to predictive accuracy; moreover, as mentioned above, this method consistently provides recommendation explanations to each user of an RS that are in the most effective explanation style for the user in question. Notably, the prediction accuracy of our hybrid method was greater than the accuracy of the algorithm that won the Netflix Prize; this algorithm is the most accurate recommendation algorithm that has been published but possesses no inherent capability to provide explanations for recommendations. Thus, this finding regarding the accuracy and appropriateness of the hybrid approach of this study constitutes the main contribution of the current thesis to the literature in the field of recommendation systems. To recapitulate the above discussion, our results and contributions to research can be briefly summarized as follows: (i) We extended the keyword explanation style by integrating negative cues into this style and theoretically established that the resulting pros-and-cons explanation style should provide greater effectiveness than the keyword style with respect to justifying recommendations in a manner that improves a user’s decision-making process.

172

5. Conclusions and Future Work

(ii) We developed a content-based recommendation algorithm for the domain of multimedia products, i.e., for generating recommendations of motion pictures. This algorithm outperforms the key recommendation algorithms for the majority of users and is capable of providing users with recommendation explanations that effectively support the decision-making processes of these users. (iii)We developed a novel statistical approach for the estimation of highly underdetermined regression models. This approach employs a set of auxiliary regressions that estimate one regression parameter at a time. The initial estimates are then corrected for omitted variable bias and multicollinearity and subsequently optimized for the further reduction of prediction errors. (iv) We have revealed the existence of two substantially large groups of users of movie recommender systems; these two types of users form their preferences in different ways. The provision of recommendations for each group of users through the use of a method that captures user preferences in a more appropriate manner can produce a substantial increase in the prediction accuracy of a recommender system. (v) We demonstrated that a carefully designed content-based hybrid recommendation method can outperform collaborative filtering algorithms with respect to prediction accuracy. (vi) We provided an empirical support for the findings of previous research that argues that “[recommendation] agents should think like the people they are attempting to help” (Aksoy et al. 2006, p. 310) and empirical procedures that detail how this objective can be achieved. The next section discusses the implications of our findings.

5.2 Discussion and Implications

Even the most accurate recommendation algorithm is subject to prediction errors. Thus, recommendation systems that attempt to help users make better choices should account for factors that extend beyond the current considerations for rating predictions and broaden their

5. Conclusions and Future Work

173

horizons to encompass not only various aspects of the recommendation process but also explanation facilities that further increase users’ choice efficiencies: Recommender system providers should attempt to increase their understanding of the criteria that users employ to reach decisions; moreover, these providers should modify their RS algorithms to align the process of recommendation generation with these criteria: Because different users base their choices on different criteria, a recommender system should employ various recommendation processes that match the decision-making processes of individual users and incorporate the user’s underlying choice-making criteria into a personalized recommendation process. In other words, instead of employing one algorithm for all generated recommendations, a recommendation system should handle its users in an individually tailored fashion. To accomplish this objective, a recommender system should be a hybrid of several recommendation methods that are each aligned with the choice-making criteria of a specific user group; this type of RS can thus provide recommendations to a user via the individual constituent recommendation method that is most reflective of an individual user’s decision criteria. In addition, initiatives should be implemented to increase user comprehension of RS recommendations. In particular, an explanation facility should become an integral part of recommender systems. However, this facility should be tightly coupled with the recommender algorithm. If recommendations for different users are produced differently and appropriately (as discussed above), the explanations of these recommendations should also reflect the underlying process of recommendation generation and emphasize the aspects of recommendations that are relevant for a user’s decision-making processes. This approach can increase a user’s choice effectiveness and compensates for algorithmic prediction errors by allowing users to assess the quality and suitability of recommendations before they commit to their choices. Furthermore, recommendation explanations provide additional decision-supporting information that allows users to better address the context in which a decision is made and evaluate other fine-grained aspects of a decision’s implications. In other words, explanations can allow users to consider aspects of explanations that may be difficult to address for an automated recommendation agent. For instance, if Thorsten, who enjoys westerns, is choosing a movie to watch after dining with his spouse Claudia, he will be unlikely to choose a protracted Clint Eastwood classic for this occasion. Instead, he will appreciate a recommendation that hints at an entertaining or romantic aspect of a western movie because this type of recom-

174

5. Conclusions and Future Work

mendation will allow him to choose a movie that is best suited for his specific decision context, namely, a movie that both he and his wife will enjoy. The alignment of explanations with a user’s decision-relevant characteristics can increase the user’s confidence in the recommendations of an RS and the user’s choice efficiency. In addition, the provision of explanations that are understandable and actionable to a user can increase the user’s trust in, acceptance of, and loyalty to a particular recommendation system as a whole. The next section of this document discusses methods for improving the proposals of the current thesis and suggests directions for future research.

5.3 Limitations and Avenues for Future Research

No study can completely address all of the facets and nuances of a topic, and all investigations, including this thesis, involve certain limitations. Neither is also this thesis. In the following paragraphs, we will discuss the limitations of our research and illustrate the ways in which the work of this thesis may be improved and extended. First, it may be noted that because our hybrid method utilizes the switching hybridization design and therefore uses the raw predictions of an individual recommendation approach as its final predictions, the recommendations that are provided by this hybrid method remain subject to the strengths and weaknesses of these individual approaches (see Section 2.1.3, esp. Table 2.6). In other words, in this situation, the hybridization itself does not address the limitations of each individual recommendation technique that is a constituent method of the hybrid because all of the recommendations of our hybrid approach are directly provided by these individual techniques. However, through the incorporation of temporal effects in the preference model of our CB component of this hybrid approach, we have ensured that this approach is sensitive to changes in user preferences over time. Thus, it can be argued that these temporal considerations substantially reduce the susceptibility of the CB component of our hybrid method to the “stability vs. plasticity” problem.

5. Conclusions and Future Work

175

Furthermore, it can be argued that the switching property of our hybrid also helps mitigate the problems of the hybrid’s individual component methods. For instance, the use of the CF method addresses the “new user” issue of the CB component of the hybrid in the following manner. Our CB component requires users to provide at least 12 ratings (see Sections 3.2.2 and 3.3.2) to the system before this component begins generating recommendations; therefore, to address newer users, the hybrid will switch to its item-based CF component, which does not have this limitation and can therefore supply a user with recommendations immediately after receiving a single rating from the user in question. After the system has obtained sufficient data from a new user, its predictions will be switched to the method that best captures this user’s preference structure. However, it must be noted that depending on which recommendation approach is best for a particular user, relatively low-quality recommendations and explanations may be provided to certain new users until adequate quantities of data have been obtained from these users. Similarly, the “new item” problem of item-based CF is addressed in a natural manner within the context of our hybrid approach. In particular, a new item cannot be recommended by CF due to its lack of ratings; however, this item can be recommended by the CB component of the hybrid, thus ensuring an inflow of ratings for these items from the users who are served by the CB recommendation model. The “new item” problem can also be mitigated further by augmenting the movie-user matrix with vectors of movie features, as suggested in Section 3.3.2. This type of augmentation would allow the CF method to determine items that are similar to a new item based on the composition of movie properties, thereby reducing the importance of the CB-driven rating flow for new items. The “sparsity” problem of the item-based CF can be addressed through the CB method of the hybrid approach. In particular, the hybrid will produce CB recommendations until enough data are acquired to establish the requisite overlap of item rating profiles and permit CF-based recommendations of certain items. However, if the users in this situation only rate recommended items, the “overspecialization” problem of the CB approach may propagate through the entire recommendation system. In other words, to function appropriately in a practical setting, our hybrid method requires the implementation of a mechanism that forces users to rate items beyond the “specialization profile” of the CB component, particularly in the early stages of a recommendation system’s life cycle, when the number of ratings that have been acquired from users is low.

176

5. Conclusions and Future Work

The remaining problems of the individual methods, such as the potential “overspecialization” of our CB method and the vulnerability of the item-based CF approach to “shilling attacks”, “starvation”, and “stability vs. plasticity” problems, may persist and surface for any particular user, depending on the method that the switching rule of the hybrid approach has determined to be the most appropriate way to predict the user’s preferences. Thus, we encourage researchers to search for ways to counteract these issues. However, we stress that the solutions to these problems should not impede the ability of these individual methods to provide explanations. That is, they should not violate the method’s inherent concept of providing recommendations, because such a violation would alter the actual reason of the recommendation emergence and thus break the connection between the recommendation process and its inherent explanation style. Another research direction that is raised by our thesis relates to proving the efficiency of our proposed pros-and-cons explanation style. Although the utility of this explanation style, the effectiveness of this style for the decision-making processes of a user, and the ability of the proposed algorithm to provide this style of explanations were theoretically established in previous chapters of this thesis, we cannot quantify the degree to which the explanations that are presented in our proposed style actually increase choice effectiveness. This improvement may be substantial or only marginal. Similarly, it can be argued that the effectiveness of the pros-and-cons explanation style may depend on the nuances of explanation formulations. These nuances include the following types of considerations: the optimal number of attributes that should be incorporated into an explanation; the valence and the balance between the positive and negative cues that are included in an explanation; the wording and the length of an explanation; the location and design of an explanation in the user interface of a recommender system; and a variety of other issues. These topics were not addressed in the current thesis; however, additional studies that address these subjects could provide empirical verification of our theoretically founded propositions and would increase our understanding regarding the nature of effective explanations for recommendation systems. Our proposed recommendation method provided consistent results for two real-world datasets with different properties and therefore appears to be a generalizable approach that may be utilized throughout the domain of movie recommendations. We encourage further studies that will test the suitability, applicability and effectiveness of our method in other real-

5. Conclusions and Future Work

177

world applications, including contexts that involve the recommendations of other types of items and products. We would also like to pursue various research directions that would enrich the modeling aspects of our method, such as extending the list of item attributes, adding interaction effects to the suggested approach, and accounting for non-linear attribute preference functions and/or non-linear temporal changes in user preferences. The exploration of these factors could potentially increase the explanatory power of the multi-attribute preference model, which could improve the prediction accuracy of our algorithm and allow this algorithm to capture the preferences of greater proportions of users, including users whose preferences were not adequately captured by our proposed model during the course of our current study. For instance, short-term fluctuations of user preferences and the seasonality of user and item biases can be addressed through the analysis of the residual distribution of the predictions of our content-based method. If the regular features in this residual distribution can be revealed and attributed to particular types of effects, then appropriate cyclic parameters could be incorporated into our attribute preference model, enhancing this model’s prediction power. For example, if the parameter estimation algorithm captures an increase in a user’s preference towards the genre “family” that has occurred each December for the past decade, then a model could be devised to account for these fluctuations. This model could then confidently recommend Christmas movies to the user in question each December. The proposed algorithm could also be improved by enriching the representation of user profiles through the imputation of part-worths via similarity-based techniques. This type of imputation could increase the number of users for whom our CB algorithm can provide reliable rating predictions by revealing attribute preferences that are initially “hidden” from this algorithm. Furthermore, this improvement would also increase the number of items that could potentially be recommended to a user. For example, if a user who likes both action movies and westerns has only rated westerns, our content-based algorithm will not be able to deduce the user’s preference for action movies because of a lack of appropriate data. Therefore, this user would never receive a recommendation for an action movie. In this situation, a user-based CF could determine that other users with similar ratings provide high ratings to action movies. This information could then be used to input the part-worths for the “action” genre and for other attributes that are included in the “source” user profiles (e.g., actors, directors, and movie budgets) into the incomplete profile of the active user. Obviously,

178

5. Conclusions and Future Work

this type of imputation would require a very cautious approach that ensured that the enriched user profile remained descriptive of the active user’s preferences and balanced with respect to the relative importance of different attributes. A possible approach to address this concern is by rescaling the part-worths that are imputed on the values of the part-worth of the known attributes. Another possible approach for this type of imputation could be based on the similarities or correlations of known part-worths among profiles of different users. Furthermore, our empirical study revealed the existence of two large user groups that form their preferences in different ways. The larger of these two groups could be reasonably well described by our multi-attribute preference model and therefore received recommendations that were predicted by this model; however, we used predictions from item-based CF for the second group of users. Although the predictions from the item-based CF approach allowed for substantial improvements in the prediction accuracy of our hybrid method, this approach does not necessarily provide the best description of the underlying preference structures for all of the users in this second group. It is also possible that the users in this group may be further differentiated with respect to either the criteria that they utilize for their movie choices or methods that provide better predictions of their movie ratings. We argue that the further analysis of the users of this second group and the application of a recommendation method that more accurately captures the preferences of these users may be a fruitful method of increasing both the accuracy of predicted ratings and the effectiveness of explanations for the hybrid method. Thus, we encourage researchers to examine this issue more deeply and suggest that recommender system providers combine several different recommendation techniques into their recommendation system approaches instead of constructing a system around a single algorithm that produces the best overall performance. Finally, a “joint product” of the current thesis is the mathematical core of our algorithm, which specifies a method for estimating the parameters of underdetermined regression models. As discussed, the parameters that were predicted in the estimation step of the algorithm provided reasonably accurate estimations of user ratings. Notably, in many cases, the estimation of 708 parameters was accomplished on the basis of as few as 6 data points. The utilization of 6 additional data points as a holdout set during the optimization process improved the resulting prediction accuracy of the algorithm by 1% and 5% with respect to MAE and RMSE, respectively. We suggest that these results are notable and that the estimation method itself merits attention from other research fields that involve the estimation of

5. Conclusions and Future Work

179

many parameters from a small number of data points. Thus, we are eager to expand the applications of our estimation method to the solutions of other types of problems beyond item recommendations; we regard this direction as a highly promising research field for our future endeavors. In addition to the aforementioned specific research topics that result from the work of the current thesis, we would also like to inspire researchers to pursue more general research directions, particularly with respect to investigations of the applications of recommendation systems in real-world business environments. As observed in the background chapter of this thesis, an analysis of the literature on recommendation systems has revealed that the overwhelming majority of the published works on RS have originated from the field of computer science. This phenomenon is somewhat unsurprising because recommendation systems originally emerged from this field and have been developed in this context for twenty years (since the first publication on collaborative filtering by Goldberg et al. (1992)). However, at present, recommender systems are widely present throughout commercial applications (e.g., online stores; travel agencies; and restaurant, movie, and music recommenders), yet this topic has largely been ignored by researchers in economics and in marketing. In fact, we were able to identify less than ten publications on recommender systems in A-ranked marketing journals. Thus, the following fundamental question must be addressed: Are recommender systems a valuable instrument of the marketing mix? Our answer to this question is a definite “yes”. Thus, we encourage researchers to devote greater effort to studying the roles, properties, potentials, capabilities, values, advantages, problems, and consequences that recommender systems produce for businesses in general and for marketing-related initiatives in particular. One topic of interest that we suggest exploring is the qualification and quantification of the up-selling and cross-selling potentials of RSs and the effects of RSs on a firm’s revenue and turnover. Conventional wisdom suggests that RSs should demonstrate both types of potentials because recommendations make consumers aware of products that are highly likely to match their interests and/or needs; without the influence of RSs, consumers may not realize that these products even exist. Thus, the recommendation of these items increases the probability that they will be purchased. However, this reasoning raises the question of what products should be recommended by RSs: for instance, one potential answer might be

180

5. Conclusions and Future Work

products that are complementary to items that a customer owns or knows; alternatively, perhaps RSs should recommend products that are typically consumed in combination or items that extend the use of a “main” product. The answers to these questions may vary depending on whether a firm is focused on its revenue, its stock, consumer satisfaction, customer value, or a combination of these factors. These considerations could all be addressed through further research. Moreover, another issue that could be explored is the question of whether RSs should merely aid customers in making optimal choices or whether these systems can and should persuade consumers to purchase suboptimal items. Another interesting research direction relates to consumer behaviors and the influence of RSs on these behaviors. In other words, how do consumers react to the presence of a recommendation system on an e-commerce web site? Do they alter their search and buying behaviors because of RSs? Do RSs help customers find goods that they were initially seeking or cause consumers to alter their intentions and purchase a different product? How can the firms utilize RS alternatives to achieve their monetary and non-monetary measures of success? What influence do RSs have on customer retention rates and customer lifetime value? To what extent do RSs impact the quality of the relationship between a firm and its customers? What properties of an RS impact consumers trust in, loyalty towards, and commitment to an RS provider? Given that recommendation systems can pursue different goals and optimize different objective functions, the question of whether users are sensitive to the objectives of an RS is also an interesting topic to pursue. In other words, can consumers “uncover” the goals of a particular recommender follows? Do consumer perceptions of the objectives of an RS impact their disposition with respect to the recommendation system itself and the RS provider? If so, then does it make sense, from a marketing point of view, to establish an independent institution that will certify RSs and label them according to their respective goals or with respect to other properties they possess, such as the algorithms that they employ or the error metrics that they use? This type of certification process could be implemented to increase user trust in recommendations and their loyalty to an RS provider. Alternatively, it might be sensible for e-commerce providers to employ an RS from an independent RS provider and label this RS appropriately to communicate the impartiality of this system’s recommendations.

5. Conclusions and Future Work

181

Furthermore, we suggest exploring ways in which RS data and approaches can be used in applications beyond simply the provision of recommendations. For instance, marketers can utilize user preference data from RSs to acquire deeper insights into the properties of their customer base and for other marketing processes, such as market segmentation, product development, the optimization of a firm’s stock, product bundling, the creation of marketing collateral, or the initiation of a direct mailing campaign. Recommender algorithms could also be used to analyze scanner data and to optimize the loyalty programs of conventional retailers, among other functions. We also believe that the research on RSs should be more interdisciplinary. In other words, RS researchers should devote greater attention to and utilize knowledge from adjacent disciplines during the course of their work. During our work on this thesis, we observed that research in the fields of computer science (CS) and marketing tends to follow a “silo” approach. Thus, the concepts and methods of marketing are ignored in the CS studies and vice versa. For instance, marketing offers not only a variety of models of utility and preference but also methods to measure and analyze consumer choice-making, such as multi-attribute utility theory, conjoint analysis, self-explicated models, preference-based segmentation, and various compensatory and non-compensatory decision rules that are employed by consumers under different conditions. CS could employ this readily available knowledge regarding consumer preferences, consumer behaviors, and methods of addressing these topics during the process of developing recommendation algorithms and techniques. Instead, we observe that CS is primarily focused on predictions of final ratings that utilize numeric methods and largely neglects to exert any effort to understand consumers, who are the ultimate users of recommender algorithms. Other marketing concepts, including loyalty, trust, satisfaction, commitment, acceptance, credibility, consumer retention rates, and lifetime value, could also be used to evaluate the quality of RSs; these concepts are more important for the businesses that RSs “serve”, than the predictive accuracy of the RS algorithms. Thus, as discussed earlier, to maintain consumer loyalty to and trust in an RS, it is more important to provide consumers with an instrument to address recommendation errors than with recommendations that are accurate much of the time but occasionally erroneous. However, marketing researchers should also exert the effort to acquire a better understanding of the algorithmic issues of RSs to be able to correctly address the relevant

182

5. Conclusions and Future Work

marketing properties of RSs in their own research. As shown in this thesis, different recommendation algorithms inherently provide different levels of explanatory detail; these details are directly correlated with consumer loyalty, satisfaction, trust, and various other traits. It could also be argued that the applicability of different recommendation methods may also be constrained by the nature of the problem domain to which these methods are applied. In other words, the suitability of different algorithms for fulfilling a particular recommendation task or achieving a particular marketing objective can vary depending on the recommendation setting, the properties of the items that must be recommended, the data that are available, and the ability of an algorithm to process this type of data; this variance which can produce deviations in certain marketing metrics. Thus, marketing researchers should regard their results cautiously, particularly if they are attempting to generalize their findings for the entire RS domain. In summary, both marketing and CS research streams should recognize how deeply their respective research issues are interwoven and interconnected; thus, each of these fields should rely on the expertise and methods of the other field during the course of conducting RS research. Thus, we encourage the CS researchers to address marketing concepts, such as anticipated emotions, in their research. For instance, these concepts could be considered as an aspect of developing algorithms that account for different decision contexts, such as watching a movie alone on a weekday, watching a movie during the weekend in a large family group, or watching a movie on a Friday evening with one’s girlfriend. Moreover, we encourage marketing researchers to carefully account for the properties of different recommendation algorithms, such as the suitability of particular algorithms for fulfilling a specific recommendation task, prior to the conceptualization of a research project on a particular marketing issue, such as the ability of a group RS to help customers make better choices.

Bibliography

183

Bibliography Bibliography Adomavicius, Gediminas and Alexander Tuzhilin (2005), “Toward the next generation of recommender systems: A survey of the state-of-the-art and possible extensions”, in IEEE transactions on knowledge and data engineering, pp. 734–749. Adomavicius, Gediminas, Ramesh Sankaranarayanan, Shahana Sen, and Alexander Tuzhilin (2005), “Incorporating contextual information in recommender systems using a multidimensional approach”, in ACM Transactions on Information Systems (TOIS), Vol. 23, Issue 1, pp. 103-145. Adomavicius, Gediminas and Alexander Tuzhilin (2008), “Context-Aware Recommender Systems”, in Proceedings of the 2008 ACM conference on Recommender systems RecSys ’08, pp. 335-336. Adrissono, Liliana, Anna Goy, Giovanna Petrone, Marino Segnan, and Petro Torasso (2003), “Intrigue: Personalized Recommendation Of Tourist Attractions For Desktop And Handset Devices”, in Applied Artificial Intelligence, Vol. 17, pp. 687-714. Aksoy, Lerzan, Paul N. Bloom, Nicholas H. Lurie, and Bruce Cooil (2006), “Should Recommendation Agents Think Like People?”, in Journal of Service Research Vol. 8, No. 4, pp. 297-315. Aksoy, Lerzan, Bruce Cooil, and Nicholas H. Lurie (2011), “Decision Quality Measures in Recommendation Agents Research”, in Journal of Interactive Marketing Vol. 25 (2011), pp. 110-122. Allan, James, Jaime Carbonell, George Doddington, Jonathan Yarmon, and Yiming Yang (1998), “Topic Detection and Tracking Pilot Study Final Report”, in Proceedings of the DARPA Broadcast News Transcription and Understanding Workshop, pp. 194218.

184

Bibliography

Alba, Joseph W. and Howard Mamorstein (1987), “The Effects of Frequency Knowledge on Customer Decision Making”, in Journal of Consumer Research, Vol. 14, pp.14-26. Alspector, Joshua, Aleksander Kolcz, and Nachimuthu Karunanith (1998), “Comparing Feature-Based and Clique-Based User Models for Movie Selection”, in Proceedings ot the third ACM Conference on Digital Libraries, Pittsburgh, PA, pages 11-18. Anand, Sarabjot S. and Bamshad Mobasher (2005), “Intelligent Techniques for Web Personalization”, in Mobasher, Bamshad and Sarabjot Anand [eds.] “Intelligent Techniques for Web Personalization”, Lecture Notes in Computer Science, Vol. 3169, Springer, Heidelberg, Berlin, pp. 1-36. Andersen, Stig K., Kristian G. Olesen, and Finn V. Jensen (1990). “HUGIN - A Shell for Building Bayesian Belief Universes for Expert Systems”, Morgan Kaufmann Publishers Inc., San Francisco, CA, USA. Anderson, Chris (2004). “The long tail”, Wired, Hyperion New York (10), pp. 170-177. Ansari, Asim, Skander Essegaier, and Rajeev Kohli (2000), “Internet Recommendation Systems”, in Journal of Marketing Research, Vol. 37 (August), pp. 363-376. Ariely, Dan (2000). “Controlling the information flow: Effects on Consumers’ Decision Making and Preferences”, in Journal of Consumer Research Vol. 27(2), pp. 233-248. Ariely, Dan, John G. Lynch Jr, Manuel Aparicio IV (2004). “Learning by collaborative and individual-based recommendation agents”, in Journal of Consumer Psychology Vol. 14(1&2), pp. 81–95. Augistin, Vernon E. (1927), “Motion Pictures Preferences”, in Journal of Delinquency Vol. 7, pp. 206-209. Austin, Bruce A. (1981), “Film Attendance: Why College Students Chose to See Their Most Recent Film”, in Journal of Popular Film and Television, Vol. 9, pp. 43-49. Austin, Bruce A. (1989), “A Factor Analysis Study of Attitudes Toward Motion Pictures”, in Journal of Social Psychology, Issue 117, pp. 211-217. Austin, Bruce A. (1989), “Immediate Seating: A Look at Movie Audiences”, Wadsworth, Inc.

Bibliography

185

Avery, Christopher and Richard Zeckhauser (1997), “Recommender Systems for Evaluating Computer Messages”, in Communications of the ACM, Vol. 40, Issue 3, pp. 88-89. Baeza-Yates, Ricardo and Berthier Ribeiro-Neto (1999), “Modern Information Retrieval”, Addison-Wesley Longman Publishing Co., Inc. Boston, MA, USA. Balabanovic, Marko and Yoav Shoham (1997), “Fab: Content-Based, Collaborative Recommendation”, in Communications of the ACM, Vol. 40, No. 3, pp. 66-72. Baltrunas, Linas (2008). “Exploiting Contextual Information in Recommender Systems”, in Proceedings of the 2008 ACM conference on Recommender systems - RecSys ’08, pp. 295-298 Baltrunas, Linas and Francesco Ricci (2009). “Context-Dependent Items Generation in Collaborative Filtering”, in ACM Workshop on Context-aware Recommender Systems (CARS 2009), pp. 295-298 Balabanovic, Marko, and Yoav Shoham (1997), “Fab: Content-based, Collaborative Recommendation”, Communications of the ACM 40(3), pp. 66-72. Basu, Chumki, Haym Hirsh, William and Cohen (1998), “Recommendation as Classification: Using Social and Content-based Information in Recommendation”, in AAAI '98/IAAI '98

Proceedings

of

the

fifteenth

national/tenth

conference

on

Artificial

intelligence/Innovative applications of artificial intelligence, pp. 714–720. Baudisch, Patrick (1999), “Joining Collaborative and Content-based Filtering”, in Proceedings of the ACM Conference on Human Factors in Computing Systems, pp. 1-5. Bell, Robert and Yehuda Koren (2007) “Scalable Collaborative Filtering with Jointly Derived Neighborhood Interpolation Weights”, in Proceedings of the 2007 Seventh IEEE International Conference on Data Mining (ICDM'07), pp. 43-52. Bell, Robert, Yehuda Koren, and Chris Volinsky (2007), “Modeling Relationships at multiple Scales to Improve Accuracy of Large Recommender Systems”, in Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD '07), pp. 95-104.

186

Bibliography

Bell, Robert, Yehuda Koren, and Chris Volinsky (2007b), “The BellKor Solution to the Netflix Prize”, http://www2.research.att.com/~volinsky/netflix/ProgressPrize2007 BellKorSolution.pdf, [retrieved on 20.06.2011] Bell, Robert, Yehuda Koren, and Chris Volinsky (2008), “The BellKor 2008 Solution to the Netflix

Prize”,

http://www2.research.att.com/~volinsky/netflix/Bellkor2008.pdf,

[retrieved on 20.06.2011] Bennet, James and Stan Lanning (2007), “The Netflix Prize”, in Proceedings of KDD Cup and Workshop, August 12, 2007. Bettman, James R., Eric J. Johnson, and John W. Payne (1991), “Consumer Decision Making.” In Thomas S. Robertson and Harold H. Kassarjian (Eds.) "Handbook of Consumer Behavior", Prentice Hall, pp. 50–84. Bilgic, Mustafa and Raymond J. Mooney (2005), “Explaining recommendations: Satisfaction vs. Promotion”, in Proceedings of beyond Personalization 2005: the Workshop on the Next Stage of Recommender Systems Research at the 2005 International Conference on Intelligent User Interfaces (IUI'05), pp. 1-6. Billsus, Daniel and Michael J. Pazzani (1999), “A Personal News Agent that Talks, Learns and Explains”, in Proceedings of the 3rd ACM Annual Conference on Autonomous Agents (AGENTS'99), pp. 268-275. Billsus, Daniel and Michael J. Pazzani (2000), “User Modeling for Adaptive News Access”, in User-Modeling and User-Adapted Interaction Vol. 10(2-3), pp. 147-180. Billsus, Daniel and Michael J. Pazzani, and James Chen (2000), “A Learning Agent for Wireless News Access”, in Proceedings of the 5th ACM International Conference on Intelligent User Interfaces (IUI'00), pp. 33-36. Bodapati, Anand V. (2008). “Recommendation Systems with Purchase Data”, Journal of Marketing Research, Vol. 45 (1), pp. 77-93. Breese, John S., David Heckerman, and Carl Kadie (1998), “Empirical Analysis of Predictive Algorithms for Collaborative Filtering”, in Proceedings of the 14th Conference on Uncertainty in Artificial Intelligence (UAI-98), San Francisco, July 24-26, pp. 43-52.

Bibliography

187

Brézillon, Patric J. and Jean-Charles Pomerol (1996). “Misuse and Nonuse of Knowledgebased Systems: the Past Experiences Revisited”, in “Implementing Systems for Supporting Management Decisions”, Patrick Humphreys, Liam Bannon, Andrew McCosh, Piero Migliarese and Jean-Charles Pomerol (eds.), Chapman and Hall, pp. 44-60. Buchanan, Bruce G. and Edward H. Shortliffe (1984). “Rule-based Expert Systems: The MYCIN Experiments of Stanford Heuristic Programming Project”, Addison-Wesley, Reading, MA. Burke, Robin D. (2002), “Hybrid recommender systems: Survey and experiments”, in User Modeling and User Adapted Interaction (2002) Vol. 12, Issue: 4, pp. 331–370. Burke, Robin D., Kristian J. Hammond, and Benjamin C. Young (1997), “The FindMe Approach to Assisted Browsing”, in IEEE Expert, Vol. 12, pp. 32-40. Canny, John (2002), “Collaborative Filtering with Privacy via Factor Analysis”, in Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval SIGIR’02, pp. 238-245. Caroll, J. Douglas, Paul E. Green (1995). “Psychometric Methods in Marketing Research: Part I, Conjoint Analysis”, in Journal of Marketing Research, Vol. 32 (4), pp. 385391. Chakrabarti, Soumen (2002), “Mining the Web: Discovering Knowledge from Hypertext Data”, 1st edition, Morgan Kaufmann Publishers, San Francisco. Chakravarti, Dipankar and John G. Lynch (1983). “A Framework for Examining Context Effects on Consumer Judgment and Choice”, in R. P. Bagozzi and Alice M. Tybout (Eds.), “Advances in Consumer Research”, Vol. 10. Ann Arbor, MI: Association of Consumer Research, pp. 289-297. Chen, Li (2009). “Adaptive Tradeoff Explanations in Conversational Recommenders”, Proceedings of the third ACM conference on Recommender systems, ACM, pp. 225– 228. Claypool, Mark, Anuja Gokhale, Tim Miranda, Pavel Murnikov, Dmitry Netes, and Matthew Sartin (1999), “Combining Content-Based and Collaborative Filters in an Online

188

Bibliography

Newspaper”, in Proceedings of ACM SIGIR’99 Workshop on Recommender Systems: Algorithms and Evaluation, pp. 1-8. Cooke, Alan D.J., Harish Sujan, Mita Sujan, Barton A. Weitz (2002). “Marketing the Unfarmiliar: The Role of Context and Item-Specific Information in Electronic Agent Recommendations”, Journal of Marketing Research, Vol 1, pp. 488-497. Cooper-Martin, Elizabeth (1991), “Consumers and Movies: Some Findings on Experiential Products”, in Advances in Consumer Research, Vol. 18, pp. 372-378. Cooper-Martin, Elizabeth (1992), “Consumers and Movies: Information Sources for Experiential Products”, in Advances in Consumer Research 19, pp. 756-761. Cosley, An, Shyong K. Lam, Istvan Albert, Joseph A. Konstan, and John Riedl (2003), “Is seeing believing?: how recommender system interfaces affect users' opinions”, in Proceedings of the SIGCHI conference on Human factors in computing systems (CHI’03), ACM New York, NY, USA, pp. 585-592. Corner, James L., Craig W. Kirkwood (1991), “Decision Analysis Applications in the Operations Research Literature, 1970–1989”, in Operation Research, Vol. 39, Issue 2, pp. 206–219. Cotter, Paul and Barry Smyth (2000), “PTV: Intelligent Personalised TV Guides”, in Proceedings of the 17th National Conference on Artificial Intelligence and 12th Conference on Innovative Applications of Artificial Intelligence, pp. 957/964. Cramer, Henriette, Vanessa Evers, Satyan Ramlal, Maarten Someren, Lloyd Rutledge, Natalia Stash, Lora Aroyo, Bob Wielinga (2008). “The Effects of Transparency on Trust in and Acceptance of a Content-based Art Recommender”, in User Modeling and UserAdapted Interaction 18, 5, pp. 455-496. Czarkowski, Marek (2006), “A Scrutable Adaptive Hypertext”, PhD thesis, University of Sydney. Das, Abhinadan S., Mayur Datar, Ashutosh Garg, and Shuyam Rajaram (2007), “Google news personalization: scalable online collaborative filtering”, in Proceedings of the 16th international conference on World Wide Web (WWW’07), ACM, New York, pp. 271-280.

Bibliography

189

Delgado, Joaquin and Naohiro Ishii (1999), “Memory-Based Weighted-Majority Prediction for Recommender Systems”, in Proceedings of the ACM SIGIR’99, Workshop Recommender Systems: Algorithms and Evaluation, pp. 1-5. De Vany, Arthur and Walls, David (1999), “Uncertainty in the Movie Industry: Does Star Power Reduce the Terror of the Box Office?”, in Journal of Cultural Economics, Vol. 23, pp. 285-318. Diehl, Kristin, Laura J. Kornish, and John G. Lynch Jr. (2003), “Smart Agents: When Lower Search Costs for Quality Information Increase Price Sensitivity,” Journal of Consumer Research, 30 (June), pp. 56-71. Dick, Alan S. and Kunal Basu (1994), “Customer Loyalty: Toward an Integrated Conceptual Framework”, in Journal of the Academy of Marketing Science, Vol. 22, Issue 2, pp. 99-113. Ding, Yi and Xue Li (2005), “Time Weight Collaborative Filtering”, in Proceedings of the 14th ACM International Conference on Information and Knowledge Management, pp. 485-492. Doyle, Dónal, Alexey Tsymbal, and Pádraig Cunningham (2003). “A Review of Explanation and Explanation in Case-based Reasoning”, Technical Report, Department of Computer Science, Trinity College, Dublin. Edwards, Ward (1954), “The theory of decision making”, in Psychological Bulletin, Vol. 51, pp. 380-417. Edwards, Ward (1961), “Behavioral decision theory”, in Annual Review of Psychology, Vol. 12, pp. 473–498. El Helou, Sandy, Christophe Salzmann, Stéphane Sire, and Denis Gillet (2009), “ The 3A contextual ranking system: simultaneously recommending actors, assets, and group activities”, in RecSys '09 Proceedings of the third ACM conference on Recommender systems, pp. 373-376. Fishburn, Peter C. (1967). “Methods of estimating Additive Utilities”, in Management Schience Vol. 18 (7), pp. 435-453.

190

Bibliography

Fishburn, Peter C. (1968). “Utility Theory”, in Management Science Vol. 14 (5), pp. 335-378. Fishburn, Peter C. (1988). “Nonlinear Preference and Utility Theory”, John Hopkins University Press, Baltimore. Fishburn, Peter C. (1970), “Utility Theory for Decision Making”, Wiley, New York. Fitzsimons, Gavan J. and Donald R. Lehmann (2004), “Reactance to Recommendations: When Unsolicited Advice Yields Contrary Responses”, in Marketing Science, Institute for Operations Research and the Management Sciences, Vol. 23 (1), pp. 8294. Funk, Simon (2006), “Netflix Update: Try this at Home”, retrieved at http://sifter.org/~simon/ journal/20061211.html, on 04.06.2011. Gershoff, Andrew D., Ashesh Mukherjee, and Anirban Mukhopadhyay (2003). “Consumer acceptance of online agent advice: Extremity and positivity effects.”, in Journal of Consumer PsychologyVol 13, pp. 161-170. Gigerenzer, Gerd, Peter M. Todd, and the ABC Research Group (1999), “Simple heuristics that make us smart”, New York: Oxford University Press. Goldberg, David, David Nichols, Brian M. Oki, and Douglas Terry (1992). “Using collaborative filtering to weave an information tapestry”, Communications of the ACM, Vol. 35 (12), pp. 61-70. Goldberg, Ken, Theresa Roeder, Dhruv Gupta, and Chris Perkins (2001), “Eigentaste: A Constant Time Collaborative Filtering algorithm”, in Information Retrieval, Vol. 4, No. 2, pp. 133-151. Golub, Gene H. and William Kahan (1965), “Calculating the Singular Values and Pseudoinverse of a Matrix”, in Journalof the Society for Industrial and Applied Mathematics, Series B: Numerical Analysis, Vol. 2, No. 2, pp. 205-224. Green, Paul E., Yoram Wind, and Arun K. Jain (1972), “Preference Measurement of Item Collections”, in Journal of Marketing Research, Vol. 9, pp. 371-377. Green, Paul E. and Yoram Wind (1973), “Multiattribute Decisions in Marketing: A Measurement Approach”, Hinsdale, II.

Bibliography

191

Green, Paul E., V. Srinivasan (1990), “Conjoint Analysis in Marketing: New Developments With Implications for Research and Practice”, in Journal of Marketing, October 1990, pp. 3-15 Grudin, Jonathan (1988), “Why CSCW Applications Fail: Problems in the Design and Evaluation of Organizational Interfaces”, in Proceedings of the 1988 ACM Conference on Computer-Supported Cooperative Work (CSCW '88), pp. 85-93. Gunawardana, Asela and Christopher Meek (2009), “A Unified Approach to Building Hybrid Recommender Systems”, in RecSys '09 Proceedings of the third ACM conference on Recommender systems, pp. 117-124. Hennig-Thurau, Thorsten, Christian Friege, Sonja Gensler, Lara Lobschat, Arvind Rangaswamy, and Bernd Skiera (2010). “The Impact of New Media on Customer Relationships”, in Journal of Service Research, August 11, 2010 Vol. 13, No. 3, pp. 311-330. Hennig-Thurau, Thorsten, Mark B. Houston, Gianfranco Walsh (2006), “Differing Roles of success Drivers Across Sequential Channels: An Application to the Motion Picture Industry”, in Journal of Academy of Marketing Science, Vol. 34(4), pp. 559-575. Hennig-Thurau, Thorsten, Mark B. Houston, Gianfranco Walsh (2007), “Determinants of Motion Picture Box Office and Profitability: an interrelationship approach”, in Review of Managerial Science, Vol. 1(1), pp. 65-92. Hennig-Thurau, Thorsten and Alexander Klee (1997), “The Impact of Customer Satisfaction and Relationship Quality on Customer Retention: A Critical Reassessment and Model Development”, in Psychology & Marketing, Vol. 14, pp. 737-764. Hennig-Thurau, Thorsten, André Marchand and Paul Marx (2012), “Can Automated Group Recommender Systems Help Consumers Make Better Choices?”, in Journal of Marketing, accepted for publication and forthcoming. Hennig-Thurau, Thorsten, Walsh, Gianfranco, and Wruck, Oliver (2001) “An Investigation into the Success Factors of Motion Pictures”, in Academy of Marketing Science Review, (at amsreview.org/amsrev/theory/hennig06-01.html).

192

Bibliography

Herlocker, Johnathan, Joseph A. Konstan, Al Borchers, and John T. Riedl (1999). “An algorithmic framework for performing collaborative filtering”, in SIGIR ’99: Proceedings of the 22nd Annual In- ternational ACM SIGIR Conference on Research and De- velopment in Information Retrieval, pp. 230–237. Herlocker, Johnathan L., Joseph A. Konstan, and John T. Riedl (2000), “Explaining Collaborative Filtering Recommendations”, in Proceedings of the 2000 ACM conference on Computer supported cooperative work, ACM New York, NY, USA, pp. 241–250. Herlocker, Johnathan L., Joseph A. Konstan, and John T. Riedl (2002). “An Empirical Analysis of Design Choices in Neighborhood-based Collaborative filtering Algorithms”, in Information Retrieval, Vol 5, No. 4, pp. 287–310. Herlocker, Johnathan L., Joseph A. Konstan, Loren G. Terveen, and John T. Riedl (2004). „Evaluating Collabotarive Filtering Recommender Systems”, ACM Transaction on Information Systems 22(1), ACM New York, NY, USA 5–53. Hill, Will, Larry Stead, Mark Rosenstein, and George Furnas (1995), “Recommending and Evaluating Choices in a Virtual Community of Use”, in Proceedings of ACM CHI’95 Conference on Human Factors in Computing Systems, pp.194–201. Hirschman, Elizabeth C. and Morris B. Holbrook (1982), “Hedonic Consumption: Emerging Concepts, Methods and Propositions”, in Journal of Marketing, Vol. 46 (Summer), pp. 92-101. Holbrook, Morris B. and Hirschman, Elizabeth C. (1982) “The Experiential Aspects of Consumption: Consumer Fantasies, Feelings, and Fun”, in Journal of Consumer Research, Vol. 9 (September), pp. 132-140. Horvitz, Eric, John Breese, and Max Henrion (1988). “Decision Theory in Expert Systems and Artificial Intelligence”, in International Journal of Approximate Reasoning, Special Issue on Uncertainty in Artificial Intelligence, 2 (3), pp. 247-302. Also, Stanford CS Technical Report KSL-88-13. Ito, Tiffany A., Jeff T. Larsen, N. Kyle Smith, and John T. Cacioppo (1998). “Negative Information Weighs More Heavily on the Brain: The Negativity Bias in Evaluative

Bibliography

193

Categorizations”, in Journal of Personality and Social Psychology, Vol. 75, No. 4, pp. 887-900. Jacoby, Jacob, Donald E. Speller, and Carol Kohn Berning (1974). “Brand choice behavior as a function of information load: Replication and extension”, in Journal of Consumer Research, Vol. 1, pp. 33-42. Jannach, Dietmar, Markus Zanker, Alexander Feldfering, and Gerhardt Friedrich (2011), “Recommender Systems: An Introduction”, Cambridge university press, New York, 2011. Johnson, Harry and Peter Johnson (1993). “Explanation Facilities and Interactive Systems”, in IUI '93 Proceedings of the 1st international conference on Intelligent user interfaces, ACM New York, NY, USA, pp. 159-166. Johnston, Jack and John DiNardo (1997), “Econometric Methods”, 4th edition, McGraw-Hill, New York. Kahneman, Daniel and Amos Tversky (1984). “Choices, values, and frames”, in American Psychologist, Vol. 39, pp. 341–350. Kamenta, Jan (1971), “Elements of Econometrics”, McMillan, New York. Kanouse, David E. and Reid L. Hanson (1972), “Negativity in Evaluations,” in Attribution: Perceiving the Causes of Behavior, eds. Edward E. Jones and David E. Kanouse, Hillsdale, NJ, England: Lawrence Erlbaum Associates, Inc., pp. 47-62 Keefer, Donald L , Kirkwood, Craig W , Corner, James L (2002), “Summary of Decision Analysis Applications in the Operations Research Literature 1990-2001”, Technical Report of the Department of Supply Chain Management of the Arizona State University, [retrieved at http://www.informs.org/content/download/14833/178547/ file/DAAppsSummaryTechReport.pdf, 30.06.2011] Kim, Dohyun and Bong-Jin Yum (2005), “Collaborative Filtering Based on Iterative Principal Component Analysis”, in Expert Systems with Applications, Vol. 28, pp. 823-830.

194

Bibliography

Klein, Noreen M. and Manjit S. Yadav (1989), “Context Effects on Effort and Accuracy in Choice: An Inquiry into Adaptive Decision Making.” In Journal of Consumer Research, Vol. 15 (4), pp. 411–421. Komarek, Paul (2004), “Logistic Regression for Data Mining and High-dimensional Classification”, Doctoral Dissertaion, Carnegie Mellon University Pittsburgh, PA, USA. [retrieved at http://www.autonlab.org/autonweb/14709/version/4/part/5/data/ komarek:lr_thesis.pdf, on 05.07.2011] Konstan, Joseph A., Bradley N. Miller, David Malz, Jonathan L. Herlocker, Lee R. Gordon, and John Riedl (1997), “GroupLens: Applying Collaborative Filtering to Usenet News”, in Communications of the ACM, Vol. 30, No. 3, pp. 77-87. Konstan, Joseph A., John Riedl, Al Borchers, and Jonathan L. Herlocker (1998) “Recommender Systems: A GroupLens Perspective”, in Recommender Systems: Papers from the 1998 Workshop (AAAI Technical Report WS-98), Vol. 8, pp. 60-64. Koren, Yehuda (2008), “Factorization Meets the Neighborhood: A Multifaceted Collaborative Filtering Model”, in Proceeding of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 426-434. Koren, Yehuda (2009), “Collaborative Filtering with Temporal Dynamics”, in Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 447-456. Koren, Yehuda (2010), “Factor in the Neighbors: Scalable and Accurate Collaborative Filtering”, in ACM Transactions on Knowledge Discovery from Data (TKDD), Vol. 4, No. 1, pp. 1-24. Koren, Yehuda, Robert Bell, Chris Volinsky (2009), “Matrix Factorization Techniques for Recommender Systems”, in IEEE Computer Society Press Los Alamitos, CA, USA Vol.42, Issue 8, pp. 42-49. Koren, Yehuda, and Robert Bell (2011), “Advances in Collaborative Filtering”, in Ricci, Francesco, Lior Rokach, Bracha Shapira, Paul B. Kantor [eds.] (2011). “Recommender Systems Handbook”, Springer Science + Business Media LLC, pp. 145 - 186.

Bibliography

195

Lacave, Carmen and Francisco J. Diéz (2004). “A review of explanation methods for heuristic expert systems”, in The Knowledge Engineering Review, Vol. 19, pp 133-146. Lange, Kenneth (2010), “Optimization (Springer Texts in Statistics)”, Springer Verlag New York LLC. Lam, Shyong K. and John Riedl (2004), “Shilling Recommender Systems for Fun and Profit”, in Proceedings of the 13th international conference on World Wide Web, WWW’04, pp. 393–402. Linden, Greg, Brent Smith, and Jeremy York (2003), “Amazon.com Recommendations: Itemto-item Collaborative Filtering”, in Internet Computing, IEEE, pp. 76-80. Lops, Pasquale, Marco de Gemmis, and Giovanni Semeraro (2011), “Content-based Recommender Systems: State of the Art and Trends “, in Ricci, Francesco, Lior Rokach, Bracha Shapira, Paul B. Kantor [eds.] (2011). “Recommender Systems Handbook”, Springer Science + Business Media LLC, pp. 73 - 105. Luce, R. Duncan (1992), “Where does subjective expected utility fail descriptively?”, in Journal of Risk and Uncertainty, Vol. 5, pp. 5-27. Lutz, Richard J. (1975), “Changing Brand Attitudes through Modification of Cognitive Structure”, in Journal of Consumer Research, 1 (March), pp. 49 - 59. Maimon, Oded and Lior Rokach (eds.) (2005), “The Data Mining and Knowledge Discovery Handbook”, Springer Science+Business Media Inc. Majchrzak, Ann and Les Gasser (1991). “On using Artificial intelligence to integrate the design of organizational and process change in US manufacturing”, AI and society, Vol. 5, pp 321-338. Marx, Paul, Thorsten Hennig-Thurau, and André Marchand (2010), “Increasing consumers' understanding of recommender results: a preference-based hybrid algorithm with strong explanatory power”, in Proceedings of the fourth ACM conference on Recommender systems (RecSys’10), pp. 297-300.

196

Bibliography

McNee, Sean M., Shyong K. Lam, Joseph A. Konstan, John and Riedl (2003), “Interfaces for Eliciting New User Preferences in Recommender Systems”, in The 9th International Conference on User Modeling (UM'2003), pp. 178–187. McCarthy, Kevin, James Reilly, Lorraine McGinty, and Barry Smyth (2004), “Thinking Positively - Explanatory Feedback for Conversational Recommender Systems”, in Proceedings of the European Conference on Case-Based Reasoning (ECCBR 2004) Explanation Workshop. McGinty,

Lorraine

and

Barry

Smyth

(2004),

“Extending

Comparison-Based

Recommendation: A Review”, in Poster acceptance for the British Computer Society’s Specialist Group on Artificial Intelligence (AI-03). McSherry, David (2005), “Explanation in recommender systems”, in Artificial Intelligence Review, Vol. 24, Issue 2, pp. 179 – 197. Mehta Bhaskar, Thomas Hofmann, Wolfgang Nejdl (2007), “Robust Collaborative Filtering”, in Proceedings of the 2007 ACM conference on Recommender Systems, pp. 49-56. Melville, Prem, Raymond J. Mooney, and Ramadass Nagarajan (2002), “Content-boosted Collaborative Filtering”, in Proceedings of 18th National Conference on Artificial Intelligence (AAAI-2002), pp. 187-192. Mild, Andreas and Martin Natter (2002). “Collaborative Filtering or Regression Models for Internet Recommendation Systems?”, in Journal of Targeting, Measurement and Analysis for Marketing, volume 10, issue 4, pages 304-313. Miller, Christopher A. and Raymond Larson (1992). “An Explanatory and “Argumentative” Interface for a Model-based Diagnostic System”, in Proceedings of the 5th annual ACM symposium on User interface software and technology (UIST'92), ACM, pp: 43-52 Mladenic, Dunja (1999), “Text-learning and Related Intelligent Agents: A Survey”, in IEEE Intelligent Systems, Vol. 14, No. 4, pp. 44-54. Mobasher Bamshad, Robin Burke, Runa Bhaumik, and Chad Williams (2007), “Towards Trustworthy Recommender Systems: An Analysis of Attack Models and Algorithm Robustness”, in ACM Transactions on Internet Technology, Vol. 7, No. 2, pp.23-60.

Bibliography

197

Moon, Sangkil, Paul K. Bergey, and Dawn Iacobucci (2010), “Dynamic Effects Among Movie Ratings, Movie Revenues, and Viewer Satisfaction”, in Journal of Marketing, Vol. 74, pp. 108-121. Mooney, Raymond J., and Loriene Roy (1999). “Content-based Book Rrecommending Using Learning for Text Categorization”, in Proceedings of the ACM SIGIR'99, Workshop on Recommender Systems: Algorithms and Evaluation Mooney, Raymond J., and Loriene Roy (2000). “Content-based Book Rrecommending Using Learning for Text Categorization” Proceedings of the Fifth ACM Conference in Digital Libraries. San Antonio, TX. pp. 195-204 Moore, Johanna D. and William R. Swartout (1988). “Explanation in expert systems: A survey”, Research Report RR-88-228, University of Southern California, Marina Del Rey, CA, 1988. Moore, Carolyn A., Bednall, David, and Adam, Stewart (2005), “Genre, Gender and Interpretation of Movie Trailers: An Exploratory Study”, in ANZMAC 2005: Broadening the boundaries, conference proceedings, ANZMAC, Dunedin, N.Z., pp. 124-130. Myers, James H. (1996), “Segmentation and Positioning for Strategic Marketing Decisions”, American Marketing Association, Chicago, IL USA, 1996. Nakamura, Atsuyoshi and Naoki Abe (1998), “Collaborative Filtering Using Weighted Majority Prediction Algorithms”, in ICML '98: Proceedings of the 15th International Conference on Machine Learning, pp. 395-403. Nanopoulos, Alexandros, Miloš Radavanović, and Mirjana Ivanović (2009), “How Does High Dimensionality Affect Collaborative Filtering?”, in RecSys '09 Proceedings of the third ACM conference on Recommender systems, pp. 293-296. Neumann, Andreas W. (2009). “Recommender Systems for Information Providers”, PhysicaVerlag Heidelberg O'Donovan, John and Barry Smyth (2005). “Trust in Recommender Systems”, in IUI'05 Proceedings of the 10th international conference on Intelligent user interfaces, ACM New York, NY, USA, pp. 167-174.

198

Bibliography

O’Sullivan, Derry, Barry Smyth, and David C. Wilson (2004), “Preserving Recommender Accuracy and Diversity in Sparse Datasets”, in International Journal on Artificial Intelligence Tools, Vol. 13, Issue 1, pp. 219–236. O’Sullivan, Derry, Barry Smyth, David C. Wilson, Kieran McDonald, Alan and Smeaton (2004), “Improving the quality of the personalized electronic program guide”, in User Modeling and User-Adapted Interaction, Vol 14, Issue 1, pp. 5–36. Park, Seung-Taek and Wei Chu (2009), “Pairwise preference regression for cold-start recommendation”, in RecSys '09 Proceedings of the third ACM conference on Recommender systems, pp. 21-28. Paterek, Arkadiusz (2007), “Improving Regularized Singular Value Decomposition for Collaborative Filtering”, in Proceedings of KDD Cup Workshop at SIGKDD'07, 13th ACM International Conference on Knowledge Discovery and Data Mining, pp. 3942. Payne, John W., James R. Bettman, and Eric Johnson (1988), “Adaptive Strategy Selection in Decision Making,” Journal of Experimental Psychology: Learning, Memory and Cognition, 14 (July), pp. 534-52. Payne, John W., James R. Bettman, and Eric Johnson (1993), The Adaptive Decision Maker. Cambridge, UK: Cambridge University Press. Pazzani, Michael J. (1999), “A Framework for Collaborative, Content-Based, and Demographic Filtering”, in Artificial Intelligence Review - Special issue on data mining on the Internet, pp. 393-408. Pazzani, Michael J. and Daniel Billsus (2007), “Content-based Recommendation Systems”, in The Adaptive Web, pp. 325-341. Prag, Jay and James Casavant (1994) “An Empirical Study of the Determinants of Revenues and Marketing Expenditures in the Motion Picture Industry”. in Journal of Cultural Economics, Vol. 18, pp. 217-235. Press, William H., Saul A. Teukolsky, William T. Vetterling, Biran P. Flannery (2007), “Numerical Recipes: The Art of Scientific Computing”, Cambridge University Press, 3rd edition.

Bibliography

199

Pu, Pearl and Li Chen (2006), “Trust Building with Explanation Interfaces”, in Proceedings of the 11th international conference on Intelligent user interfaces (IUI '06), ACM New York, NY, USA, pp. 93-100. Rashid, Al Mamunur, Istvan Albert, Dan Cosley, Shyong K. Lam, Sean M. McNee, Joseph A. Konstan, and John Riedl (2002), “Getting to Know You: Learning New User Preferences in Recommender Systems”, in Proceedings of the International Conference on Intelligent User Interfaces, pp. 127–134. Resnick, Paul, Neophytos Iakovou, Mitesh Sushak, Peter Bergstrom, and John Riedl (1994), “GroupLens: An Open Architecture for Collaborative Filtering of Netnews”, in Proceedings of ACM CSCW’94 Conference on Computer Supported Cooperative Work, pp. 175-186. Resnick, Paul, Rahul Sami (2007), “The Influence Limiter: Provably Manipulation-resistant Recommender Systems”, in Proceedings of the 2007 ACM conference on Recommender systems Rec Sys’07, pp. 25-32. Ricci, Francesco, Lior Rokach, Bracha Shapira (2011), “Introduction to Recommender Systems Handbook”, in Ricci, Francesco, Lior Rokach, Bracha Shapira, Paul B. Kantor [eds.] (2011). “Recommender Systems Handbook”, Springer Science + Business Media LLC, pp. 1 - 35. Rutkowski, Anne-Francoise, Alea Fairchild, John B.Rijsman (2004). “Group Decision Support Systems and Patterns of Interpersonal Communication to Improve Ethical Negotiation in Dyads”. European Journal of Social Psychology, vol. 9, pages 11-30. Salakhurdinov, Ruslan, Andriy Minh, and Geoffrey Hinton (2007), “Restricted Boltzmann Machines for Collaborative Filtering”, in Proceedings of the 24th International Conference on Machine Learning, pp. 791-798. Salton, Gerard, Anita Wong, and Chung-Shu Yang (1975), “A Vector Space model for Information Retrieval”, in Journal of The American Society for Information Science, Vol. 18, No. 11, pp. 613-620.

200

Bibliography

Salton, Gerard and Christopher Buckley (1988), “Term-weighting Approaches in Automatic Text Retrieval”, in Information Processing and Management, Vol. 25, No. 5, pp. 513-523. Sandvig, J. J., Bamshad Mobasher, and Robin Burke (2007), “Robustness of Collaborative Recommendation Based on Association Rule Mining”, in Proceedings of the 2007 ACM conference on Recommender systems, pp. 105-111. Sarwar, Badul M., George Karypis, Konstan Joseph A., and Riedl John T. (2000), “Application of Dimensionality Reduction in Recommender System – A Case Study”, in ACM WebKDD 2000 Web Mining for E-Commerce Workshop, pp. 285289. Sarwar, Badul M., George Karypis, Joseph Konstan, and John T. Riedl (2001), “Item-Based Collaborative Filtering Recommendation Algorithms”, in WWW '01 Proceedings of the 10th International Conference on World Wide Web ACM New York, NY, USA, pp. 285-295. Sarwar, Badul M., George Karypis, Joseph Konstan, and John T. Riedl (2002), “Incremental Singular Value Decomposition algorithms for Highly Scalable Recommender Systems”, in ICCIT '02 Proceedings of the 5th International Conference on Computer and Information Technology, pp. 399-404. Sawhney, Mohanbir S. and Jehoshua Eliashberg (1996) “A Parsimonious Model of Forecasting Gross Box-Office Revenues of Motion Pictures”, in Marketing Science, Vol. 15, Issue 2, pp. 113-131. Seyerlehner, Klaus, Arthur Flexer, and Gerhard Widmer (2009), “On the Limitations of Browsing Top-N Recommender Systems”, in Proceedings of the third ACM conference on Recommender systems, pp. 321-324. Shardanand, Upenda and Patti Maes (1995), “Social Information Filtering: Algorithms for Automating ‘Word of Mouth’”, in Proceedings of ACM CHI’95 Conference on Human Factors in Computing Systems, pp. 210-217.

Bibliography

201

Schafer, Ben J., Joseph A. Konstan, and John Riedl (1999). “Recommender Systems in ECommerce”, Proceedings of the First ACM Conference on Electronic Commerce, Denver, CO, 158-166. Schafer, Ben J., Joseph A. Konstan, and John Riedl (2001). “E-Commerce Recommendation Applications”, Data mining and Knowledge Discovery. 5 (1-2), 115-153. Schafer, Joseph L. and John W. Graham (2002), “Missing Data: Our View of the Stat of the Art”, in Psychological Methods Vol. 7, No. 2, pp. 147-177. Schwab, Ingo, Alfred Kobsa, and Ivan Koychev (2001), “Learning User Interests through Positive Examples Using Content Analysis and Collaborative Filtering”, in User Modeling and User-Adapted Interaction. Senecal, Sylvain and Jacques Nantel (2004), “The Influence of Online Product Recommendations on Consumers’ Online Choices”, in Journal of Retailing, Vol. 80 (2), 159-169. Simon, Herbert A. (1982), “Models of bounded rationality”, Cambridge,MA: MIT Press. Sinha, Rashmi, and Kirsten Swearingen (2002). „The Role of Transparency in Recommender Systems”, Conference on Human Factors in Computing Systems, ACM New York, NY, USA, pp. 830–831. Shortliffe, Edward H. and Bruce G. Buchanan (1975). “A model of inexact reasoning in medicine”. Mathematical Biosciences, Vol. 23 (3-4), pp. 351–379. Soboroff, Ian M. and Charles Nicholas (1999), “Combining Content and Collaboration in Text Filtering”, in Proceedings of the IJCAI-99 Workshop on Machine Learning for Information Filtering, Vol. 99, pp. 86-91 Sørmo, Frode, Jörg Cassens, and Agnar Aamodt (2005). “Explanation in Case-Based Reasoning: Perspectives and Goals”, in Artificial Intelligence Review, Vol. 24 (2), Kluwer Academic Publishers, pp. 145-161. Symeonidis, Panagoitis, Alexandros Nanopoulos, Yannis Manolopoulos (2007). “Featureweighted User Model for Recommender Systems”, in UM '07 Proceedings of the 11th international conference on User Modeling, pp. 97–106.

202

Bibliography

Symeonidis, Panagoitis, Alexandros Nanopoulos, Yannis Manolopoulos (2008), “Providing Justifications in Recommender Systems”, IEEE Transactions on Systems, MAN, and Cybernetics, Vol. 38, No. 6, pp. 1262-1272. Symeonidis,

Panagoitis,

Alexandros

Nanopoulos,

Yannis

Manolopoulos

(2009).

„MoviExplain: a recommender system with explanations”, Proceedings of the third ACM conference on Recommender systems, pp. 317–320. Takács, Gábor, Isván Pilászy, Bottyán Németh, and Domonkos Tikk (2007), “Major Components of the Gravity Recommendation System”, in SIGKDD Explorations, Vol. 9, No. 2., pp. 80-84. Tang, Tiffany Ya, Pinata Winoto, and Keith C. C. Chan (2003) “On the Temporal Analysis for Improved Hybrid Recommendations”, in WI '03 Proceedings of the 2003 IEEE/WIC International Conference on Web Intelligence. pp. 214-220. Terveen, Loren, Jessica McMackin, Brian Amento, and Will Hill (2002), “Specifying preferences based on user history”, in Proceedings of the Conference on Human Factors in Computing Systems, pp. 315-322. Thompson, Cynthia A. , Mehmet H. Goker, and Pat Langley (2004), “A Personalized System for Conversational Recommendations”, in Journal of Artificial Intelligence Research, Vol. 24, pp. 393-428. Tintarev, Nava (2007), “Explanations of recommendations”, in Proceedings of the 2007 ACM Conference on Recommender Systems (RecSys'07), Minneapolis, MN, pp. 203206. Tintarev, Nava and Masthoff, Judith (2007), “Effective Explanations of Recommendations: User-Centered Design”, in Proceedings of the 2007 ACM conference on Recommender systems, ACM New York, NY, USA, pp. 153–156. Tintarev, Nava and Masthoff, Judith (2007a), “A Survey of Explanations in Recommender Systems”, in Proceedings of the 2007 IEEE 23rd International Conference on Data Engineering Workshop (ICDEW '07), IEEE Computer Society Washington, DC, USA, pp. 153–156.

Bibliography

203

Tintarev, Nava and Masthoff, Judith (2007b), “The Effectiveness of Personalized Movie Explanations: An Experiment Using Commercial Meta-data”, in AH '08 Proceedings of the 5th international conference on Adaptive Hypermedia and Adaptive WebBased Systems, pp. 204-213. Tintarev, Nava and Masthoff, Judith (2011), “Designing and Evaluating Explanations for Recommender Systems”, in Ricci, Francesco, Lior Rokach, Bracha Shapira, Paul B. Kantor [eds.] (2011). “Recommender Systems Handbook”, Springer Science + Business Media LLC, pp. 479-510. Tompson, Clive (2008), “If You Liked This, sure to Love That”, in The New York Times, November 23th 2008, http://www.nytimes.com/2008/11/23/magazine/23Netflixt.html Tran, Thomas and Robin Cohen (2000), “Hybrid Recommender Systems for Electronic Commerce”, in Knowledge-Based Electronic Markets, Papers from the AAAI Workshop, AAAI Technical Report WS-00-04, AAAI Press, pp. 78-83. Tran-Le, Esther (2010), “NYC Pandora Listener Meet Up”, blog entry, March 22, 2010, http://esthertranle.com/wordpress/2010/03/23/nyc-pandora-listener-meet-up, retrieved on 15.06.2011. Tsymbal, Alexey (2004), “The Problem of Concept Drift: Definitions and Related Work”, Technical Report TCD-CS-2004-15, Trinity College Dublin. Tversky, Amos (1967), “Additivity, Utility, and Subjective Probability”, in Journal of Mathematical Psychology, Vol 4, pp. 175-201. Uchyigit, Gulden and Matthew Y. Ma [Eds.] (2008), “Personalization Techniques and Recommender Systems: Series in Machine Perception and Artificial Intelligence Vol. 70”, World Scientific Publishing Co. Pte. Ltd. 2008 von Winterfeldt, Detlouf and Ward Edwards (1986), “Decision analysis and behavioral research”, New Yor k:Cambridge University Press. Wärnestål, Pontus (2005), “User Evaluation of a Conversational Recommender System”, in Proceedings of the 4th Workshop on Knowledge and Reasoning in Practical Dialogue Systems, pp. 32-39.

204

Bibliography

Wasfi, Ahmad M. Ahmad (1999), “Collecting user access patterns for building user profiles and collaborative filtering”, in Proceedings of the 4th international conference on Intelligent user interfaces (IUI '99 Los Angeles), pp. 57-64. Wei, Chang-Ping, Michael J. Shaw, and Robert F. Easley (2002), “A Survey of Recommendation Systems in Electronic Commerce”, in Roland T. Rust and P.K. Kannan [eds.] “e-Service: New Directions in Theory and Prasctice” (2002), M.E. Sharpe, Armonk, New-York, London, England. Weiss, Jie W., David J. Weiss, and Ward Edwards (2009), “A Descriptive Multi-attribute Utility Model for Everyday Decisions”, in Theory and Decision, Vol. 68, Issues (12), pp. 101-114. Wright, Peter (1974), “The Harassed Decision Maker: Time Pressures, Distractions, and the Use of Evidence”, in Journal of Applied Psychology, Vol. 59 (October), pp. 555-561. Ying, Yuanping, Fred Feinberg, Michel Wedel (2006). “Leveraging Missing Ratings to Improve Online Recommendation Systems”, in Journal of Marketing Research, Vol. XLIII, August, pp. 355-365. Zanker, Markus, Markus Aschinger, and Markus Jessenitschnig (2007), “ Development of a collaborative and constraint-based web configuration system for personalized bundling of products and services”, in Proceedings of the 8th international conference on Web information systems engineering (WISE'07), pp. 273-284. Zanker, Markus, Sergiu Gordea, Markus Jessenitschnig, and Michael Schnabl (2006), “A Hybrid Similarity Concept for Browsing Semi-structured Product Items”,

in

Proceedings of 7th International Conference on Electronic Commerce and Web Technologies (EC-Web), Springer 2006 (LNCS, 4082), pp. 21-30. Zaslow, Jeffrey (2002), “If TiVo Thinks You Are Gay, Here's How to Set It Straight”, in Wall Street Journal - Eastern Edition, 11/26/2002, Vol. 240 Issue 105, p. A1 Zhan, Sinan, Fengrong Gao, Chunxiao Xing, and Lizhu Zhou (2006), “Addressing Concept Drift Problem in Collaborative Filtering Systems”, in Proceedings of the 17th European Conference on Artificial Intelligence, pp. 34-39.

Bibliography

205

Zhang, Yi, Jamie Callan, and Thomas Minka (2002), “Novelty and Redundancy Detection in Adaptive Filtering”, in Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval SIGIR '02, pp. 81-88. Zhang Mi (2009), “Enhancing Diversity in Top-N Recommendation”, in Proceedings of the third ACM conference on Recommender systems, pp. 397-400. Zhao, Yangchang, Chengqi Zhang, and Shichao Zhang (2005), “A Recent-biased Dimension Reduction Technique for Time Series Data”, in ACM Proceedings of the 9th PacificAsia Conference on Knowledge Discovery and Data Mining (PAKDD'05), pp. 751758. Ziegler, Cai-Nicolas, Sean M. McNee, Konstan, Joseph A., and Georg Lausen (2005), „Improving Recommendation Lists Through Topic Diversification”, in Proceedings of the International World Wide Web Conference WWW’05, pp. 22–32.

206

Appendix A: Sources of Error in Recommender Systems

207

Appendix A: Sources of Error in Recommender Systems

In essence, automated recommender systems are stochastic processes that acquire their recommendations by applying numerical algorithms to produce heuristic approximations of human processes. The computations of these systems are accomplished with extremely sparse and incomplete data. These two conditions result in recommendations that are often correct and reliable but can occasionally be extremely incorrect; in other words, the suggestions that are generated by RS are subject to errors. According to Herlocker, Konstan, and Riedl (2000), the sources of these errors can be broadly grouped into two categories: model/process errors and data errors. We agree with this classification and will provide further elucidation regarding these errors in the paragraphs below.

MODEL/PROCESS ERRORS Model or process errors occur if a computational process that is employed by the RS for the generation of recommendations does not appropriately reflect a user’s intrinsic decision process and therefore does not match the user’s requirements. These types of errors can occur in the following ways: Multiattribute preferences. Multiattribute utility (MAU) models have an extensive history in the research fields of decision making and marketing (e.g., Edwards 1954; Tversky 1967; Green, Wind, and Jain 1972; Green and Wind 1973; Luce 1992). According to MAU, individuals make choices using an intrinsic utility function that sums the attribute-related preferences of the items that are contained in the evoked set of choice alternatives. Among these choices, the item with the highest utility to a given consumer has the highest probability of being chosen by the consumer in question. Although the research regarding motion pictures success factors has demonstrated that various movie attributes, such as actors, directors, genres, budgets, country of origin, awards, and diverse other factors, significantly influence a movie’s success by affecting the expression of consumer preferences (Hennig-Thurau, Houston, and Walsh 2006), contemporary movie recommender systems nonetheless fail to

208

Appendix A: Sources of Error in Recommender Systems

adequately consider these attribute characteristics and to account for attribute-related consumer preferences during the course of the recommendation process. The reason for this issue is the limited ability of information processing algorithms to automatically extract meaningful attributes that accurately describe multimedia content (Wei, Shaw, Easely 2002; Pazzani, Billsus 1997; Lops, de Gemmis, and Semeraro 2011). Existing studies that have incorporated attribute-related preferences with respect to movies have chosen attributes based on information availability rather than through a thorough study of relevant attributes (e.g., Ying, Feinberg, and Wedel 2006); alternatively, these attributes have been used for the postprocessing of recommendations that have already been generated (e.g., Symenoidis, Nanopoulos, and Manolopoulos 2009). Thus, RSs have failed to completely model the attribute-related preferences of users with respect to movies, and therefore existing recommendation processes may generate erroneous recommendations. Concept or interest drift. It is relatively common for individuals to change their interests. This type of shift is often observed in the domain of movie recommendations. In fact, movies may wax and wane in popularity, and users may adopt new views of actors, genres, directors, and other facets of a movie. In the RS literature, this phenomenon is referred to as “concept drift” (Billus and Pazzani 2000) or “interest drift” (Burke 2002). However, traditional RSs do not consider the prospect of interest drift and are therefore unable to reflect changes in user preferences (Zhan et al. 2006). To our knowledge, very few studies have focused on this problem. Tang, Winoto, and Chan (2003) suggested that a movie’s production year reflects the situational environment in which a movie was filmed and therefore might significantly affect users’ feature preferences. For this reason, they propose discounting user preferences for older movies and increasing the influence of newer movies in the recommendation process by assigning higher weights to user ratings for these newer movies. Similarly, other works have suggested the use of the date when the ratings are collected as a basis for the assignment of weights; greater weights are assigned to recent data, whereas the influence of older data are either decayed or completely removed from the computational process (Terveen et al. 2002; Zhao, Zhang, and Zhang 2005; Ding and Lee 2005). Zhan et al. (2006) propose an iterative data weighting method that can capture recurring user interests. However, the weighting methods of this approach remain susceptible to model and process errors that are caused by interest drift because these methods strictly rely on time as the describer of interest drift at the aggregate level and do not consider potential changes in attribute preferences.

Appendix A: Sources of Error in Recommender Systems

209

Contextual factors. Traditional RSs assume the homogeneity of recommendation contexts; in other words, the decision regarding what to recommend is not dependent on when a recommendation is requested (Adomavicius et al. 2005). However, behavioral research in marketing has revealed evidence that the decision-making processes of consumers are highly dependent on the decision context: The same consumer may prefer different products or brands or even utilize different decision-making strategies in different circumstances (Chakravarti and Lynch 1983; Klein and Yadav 1989; Bettman, Johnson, and Payne 1991). However, given the existence of a tremendous variety of imaginable user contexts, it appears impossible for an RS to collect all of the data that would be required to appropriately account for contextual considerations. Although the implicit collection of context information is possible, the contextual information that may be acquired in this manner is restricted to the information that is automatically available to an RS or data that can be obtained by an RS from real-time databases, such as the time of day, season, weather conditions, traffic situation, or GPS coordinates. The active querying of users for additional information about their contexts would contradict one of the main principles of RSs, namely, the idea that RSs are supposed to simplify users’ decision-making processes through the minimization of user-system interactions (instead of overwhelming users with long questionnaires).62 Although concepts of RS that incorporate contextual information have been examined in recent RS literature, these types of RSs either have not gone beyond the concept level (e.g., Adomavicius et al. 2005; Adomavicius and Tuzhilin 2008) or have utilized only very limited quantities of contextual information (e.g., Balturnas 2008; Balturnas and Ricci 2008; El Helou et al. 2009). Many of the contextual factors that influence the decision-making process, such as motives, the anticipated complexity of the decision task, the need to justify a decision to others, the desire to account for another individual’s preferences, time pressure, and prior knowledge (Bettman, Johnson, and Payne 1991), can hardly be formalized for either explicit or implicit data collection and therefore cannot be properly accounted for in the models of RSs. Consequently, the computational process of an RS typically fails to fully reflect a user’s context, presenting the prospect for recommendation errors, particularly for situations in which contextual factors dominate over user preferences.

62

This thesis is supported by the early studies in the research area of CSCW; these studies revealed that individuals are typically not ready to explicitly express their preferences and priorities and perceive these actions to be chores that require additional effort and are extrinsic to their actual task (e.g., Grudin 1988).

210

Appendix A: Sources of Error in Recommender Systems

Scale granularity. RSs typically make use of discrete integer-valued rating scales for collecting user preferences towards items, e.g., movies, that are contained in a catalog. Alternatively, RSs utilize a binary 0/1 scale for the implicit collection of purchase acts or other events (such as clicking on a hyperlink or reading an article) that represent meaningful data input for the recommendation process in certain item domains. This approach raises two problems that may produce errors during the computational process. First, users may not perceive scale points identically, and therefore two users may not utilize the same point on a rating scale to express a particular quantity of preference. For instance, if two persons find a certain movie equally good, one of these individuals may rate the movie a 5 out of 5 points, whereas the other individual may give the film only 4 out of 5 points. In this type of situation, a recommendation process may not be able to determine that the same amount of preference was intended in both cases; thus, an RS would likely treat these two scores differently, interpreting these ratings in accordance with its internal representation of the meanings of each scale points. Thus, the differences in the ratings that are received from the examined users can introduce errors into the recommendation process. Second, as described in Chapter 2.1, the algorithms that are employed in RS typically utilize only rational numbers. For this reason, the results of averaging or weighing, which may be employed within the recommendation process, will also often involve rational numbers. However, these results may need to be represented as integers at various times, such as for the evaluation of prediction accuracy. This requirement either introduces potential rounding errors into the RS process or can cause accuracy evaluations to be error-prone. Algorithmic processing errors. Finally, the computational procedure itself represents a potential source for errors. Even a perfect model of numeric algorithms for user choices will remain error-prone due to the possibility of overfitting, rounding errors, and other types of miscalculations. Moreover, data quality will determine the outcome of calculations; thus, data quality issues can result in errors.

DATA ERRORS Data errors result from problems with the data that are employed in the calculations of recommendations. These data problems typically fall into three categories: insufficient data, poor or bad data, and high-variance data (Herlocker, Konstan, and Riedl 2000).

Appendix A: Sources of Error in Recommender Systems

211

Insufficient data. RSs base their computations on extremely sparse and incomplete data. In fact, if complete data were available in a situation, the prediction of missing data points by RSs would not be required. The estimation of missing data itself can certainly pose computational challenges and result in errors (e.g., Schafer and Graham 2002). In the context of RS, the latter problem is exacerbated for items and users that have recently entered the system; this issue was addressed earlier in this thesis as the “new item” and “new user” problems. Poor or bad data. For situations in which considerable quantities of data about users and items are available, the available data may nonetheless contain errors. These errors may result from erroneous user inputs or could even be fraudulently generated through shilling attacks by malicious web robots, which seek to favor or disfavor a particular item (Mobasher et al. 2007; Sandvig, Mobasher and Burke 2007). Another reason for low-quality data points is natural variability in users’ perceptions of a numerical scale (Hill et al. 1995; Herlocker et al. 2004), which can occur if a single user rates an item differently at different times or if different users associate different ratings with the same quantity of preference. High-variance data. High-variance data are not necessarily regarded as bad data for recommendation algorithms. However, this type of data can cause recommendation errors (Herlocker, Konstan, and Riedl 2000). These errors are particularly prevalent in situations involving interest-polarizing items, such as the comedy movie “Napoleon Dynamite”, which can “be either loved or despised” (Thompson 2008); the high variance of the data regarding these items can render it difficult to predict a particular user’s preference rating for this type of item. In these situations, an RS will most likely predict an average rating for the item in question, despite the fact that the polarizing nature of this item implies that an average rating is unlikely to actually occur (Herlocker, Konstan, and Riedl 2000).

As we have described above, there are many factors that can produce misleading recommendations. The prospect of receiving an erroneous recommendation can impair users’ acceptance of and trust in an RS. Explanations of the reasoning for the recommendations of an RS can provide users with indications of the level of trust that should be assigned to a particular recommendation. Explanations provide users with an instrument to address recom-

212

Appendix A: Sources of Error in Recommender Systems

mendations errors and contribute to the recovery of users’ trust in and acceptance of an RS following an erroneous recommendation (Herlocker, Konstan, and Riedl 2000).

Appendix B: A List of Preference-Relevant Attributes

213

Appendix B: A List of Preference-Relevant Attributes

Genres (26) Action Adult Adventure Animation Biography Comedy Crime Documentary Drama Family Fantasy Film History Horror Music Musical Mystery News Reality Romance Sci-Fi Short Sport Thriller War Western

Actors (87) Affleck, Ben Allen, Tim Bale, Christian Banderas, Antonio Black, Jack Bleibtreu, Moritz Bloom, Orlando Broderick, Matthew Cage, Nicolas Caine, Michael

Carrey, Jim Chan, Jackie Clooney, George Connery, Sean Costner, Kevin Craig, Daniel Crowe, Russell Cruise, Tom Cusack, John Damon, Matt De Niro, Robert Depp, Johnny DiCaprio, Leonardo Diesel, Vin Douglas, Michael Downey, Jr. Robert Dreyfuss, Richard Eastwood, Clint Farrell, Colin Ford, Harrison Foxx, Jamie Fraser, Brendan Freeman, Morgan Gere, Richard Gibson, Mel Grant, Hugh Gyllenhaal, Jake Hanks, Tom Hartnett, Josh Hoffman, Dustin Hopkins, Anthony Ice, Cube Jackman, Hugh Jackson, Samuel L. Kutcher, Ashton LaBeouf, Shia Law, Jude Lawrence, Martin Ledger, Heath Maguire, Tobey Marsden, James Martin, Steve

McConaughey, Matthew McGregor, Ewan McKellen, Ian Murphy, Eddie Murray, Bill Myers, Mike Newman, Paul Nicholson, Jack Norton, Edward Owen, Clive Pacino, Al Phoenix, Joaquin Pitt, Brad Quaid, Dennis Redford, Robert Reeves, Keanu Reynolds, Ryan Russell, Kurt Sandler, Adam Schwarzenegger, Arnold Schweiger, Til Scott, Seann William Smith, Will Snipes, Wesley Stallone, Sylvester Statham, Jason Stiller, Ben Travolta, John Tucker, Chris Waalkes, Otto Wahlberg, Mark Washington, Denzel Williams, Robin Willis, Bruce Wilson, Owen Wood, Elijah

Actresses (46) Adams, Amy Aniston, Jennifer

214

Barrymore, Drew Berry, Halle Blanchett, Cate Bullock, Sandra Curtis, Jamie Lee Diaz, Cameron Dunst, Kirsten Fonda, Jane Foster, Jodie Hathaway, Anne Hawn, Goldie Hewitt, Jennifer Love Hudson, Kate Hunt, Helen Johansson, Scarlett Jolie, Angelina Keaton, Diane Kidman, Nicole Knightley, Keira Lopez, Jennifer Moore, Demi Moore, Julianne Paltrow, Gwyneth Pfeiffer, Michelle Portman, Natalie Potente, Franka Riemann, Katja Roberts, Julia Russo, Rene Ryan, Meg Ryder, Winona Sarandon, Susan Stiles, Julia Streep, Meryl Streisand, Barbra Swank, Hilary Theron, Charlize Thurman, Uma Weaver, Sigourney Weisz, Rachel Winslet, Kate Witherspoon, Reese Zellweger, Renée Zeta-Jones, Catherine

Directors (106) Abrahams, Jim

Appendix B: A List of Preference-Relevant Attributes

Allen, Woody Amiel, Jon Anderson, Paul W. S. Annaud, Jean-Jacques Apted, Michael Bay, Michael Besson, Luc Boyle, Danny Brest, Martin Brooks, James L Burton, Tim Cameron, James Campbell, Martin Carpenter, John Coen, Joel Cohen, Rob Columbus, Chris Coppola, Francis Ford Craven, Wes Crowe, Cameron Dante, Joe Davis, Andrew de Bont, Jan Demme, Jonathan del Toro, Guillermo De Palma, Brian De Vito, Danny Dörrie, Doris Donner, Richard Dugan, Dennis Eastwood, Clint Emmerich, Roland Ephron, Nora Farrelly, Peter Farrelly, Bobby Fincher, David Forster, Marc Gilliam, Terry Gosnell, Raja Gray, F. Gary Hallström, Lasse Hanson, Curtis Harlin, Renny Herek, Stephen Hoblit, Gregory Howard, Ron Jackson, Peter Johnston, Joe Lee, Ang

Lee, Spike Levant, Brian Levinson, Barry Levy, Shawn Lucas, George Lyne, Adrian Mann, Michael Marshall, Garry Marshall, Penny McTiernan, John Miller, George Newell, Mike Nichols, Mike Nolan, Christopher Noyce, Phillip Oz, Frank Petersen, Wolfgang Pollack, Sydney Raimi, Sam Ramis, Harold Ratner, Brett Reiner, Rob Reitman, Ivan Reynolds, Kevin Roach, Jay Rodriguez, Robert Russell, Chuck Schumacher, Joel Scorsese, Martin Scott, Ridley Scott, Tony Segal, Peter Shadyac, Tom Shankman, Adam Shyamalan, M. Night Singer, Bryan Singleton, John Smith, Kevin Soderbergh, Steven Sommers, Stephen Sonnenfeld, Barry Spielberg, Steven Stone, Oliver Tarantino, Quentin Thomas, Betty Turteltaub, Jon Tykwer, Tom Verbinski, Gore Vilsmaier, Joseph

Appendix B: A List of Preference-Relevant Attributes

Weir, Peter Woo, John Wortmann, Sönke Zemeckis, Robert Zucker, David Zucker, Jerry Zwick, Edward

Producers (4) Apatow, Judd Bruckheimer, Jerry Rudin, Scott Silver, Joel Writers (5) Crichton, Michael Curtis, Richard Dick, Philip K. Grisham, John King, Stephen

Production Firms (6) Imagine Nickelodeon Pixar Revolution Section Spyglass

Countries of Origin (38) Australia Austria Argentina Belgium Brazil Canada China Czech Republic Czechoslovakia Denmark

East Germany France Finland Germany Hong Kong Iceland India Ireland Israel Italy Japan Mexico Netherlands New Zealand Norway Poland Russia South Africa South Korea Soviet Union Spain Sweden Switzerland Thailand Turkey UK USA West Germany

Certifications in the USA (21) Approved E E10 G GP M NC-17 Not Rated Open Passed PG PG-13 R T TV-14 TV-G

215

TV-MA TV-PG TV-Y7 Unrated X

Certifications in Germany (8) 6 12 16 18 (Banned) BPjM Restricted Not Rated o.Al.

Languages (23) Afrikaans Czech Danish Dutch English Finnish French German Hebrew Icelandic Italian Japanese Korean Mandarin Norwegian Polish Portuguese Russian Slovenian Spanish Swedish Swiss German Turkish

216

Appendix C: The Technical Details of Prediction Accuracy Tests

217

Appendix C: The Technical Details of Prediction Accuracy Tests

Although we provide conceptual descriptions of our tests of predictive accuracy in Chapter 4 of our thesis, in this appendix, we provide insights into the details of the technical implementation and execution of these tests. By providing this information, we ensure that a critical reader can verify the methodical accuracy of the process of obtaining the results of this thesis, acquire a deeper understanding of the courses of action that were implemented in this study and, if necessary, replicate our results. Moreover, these details will allow a reader to utilize our procedure for his or her own studies and for the construction of his or her own recommendation system. To calculate the prediction accuracy of the global average and all variants of the userbased and item-based collaborative filtering algorithms, we utilized ‘MyMediaLite’63, an open-source library of recommender system algorithms. This library was recommended for use in real-world recommender systems and for research purposes at the 4th ACM Conference on Recommender Systems, RecSys’201064 (personal communication with Francesco Ricci, Gediminas Adomavicius, Xavier Amatriain). The matrix factorization algorithm was implemented based on Funk’s (2006) description65 of his approach, which brought him to the fourth position in the Netflix Prize leaderboard in the fall of 2006. The surprising performance of Funk’s algorithm has attracted extreme attention from the Netflix Prize community and caused the matrix factorization approach to become popular in the field of RS research. Although Funk’s approach has never been published in an academic journal, his blog entry describing his stochastic gradient descent method for matrix factorization has been widely cited in recent literature and serves as the basis for all published matrix factorization approaches (e.g., Paterek 2007; Koren 2009; Linden 2009; Koren and Bell 2011).

63

http://www.ismll.uni-hildesheim.de/mymedialite/index.html http://recsys.acm.org/2010/ 65 http://sifter.org/~simon/journal/20061211.html 64

218

Appendix C: The Technical Details of Prediction Accuracy Tests

Table C.1: An overview of the employed source code snippets from Press et al. 2007 Method or function name

Description

Role for the algorithm

Fitab

An object for fitting a straight line to set of points either with or without available errors .

invxlogx Erf Normaldist:Erf Lognormaldist:Erf Gauleg18 Beta:Gauleg18 Gamma:Gauleg18 Studenttdist:Beta Fdist:Beta SVD

Classes and functions providing distributional statistics and statistical tests for beta, gamma, Gauss, logarithmic, Student-t, and F-distributions.

An object for the singular decomposition of matrix .

Solves equation system for vector using pseudoinverse of matrix .

Bracketmethod

Base class for one-dimensional minimization routines. Provides a routine to bracket a minimum and several utility functions.

Brent:Bracketmethod

Isolates the minimum using Brent’s method.

F1dim

Performs one-dimensional minimization.

Linemethod

Base class algorithms.

Frprmn:Linemethod

Multidimensional minimization by the Fletcher-Reeves-Polak-Ribiere method.

line

Performing tests for significance, Section 3.2.1.

value

SVD::solve

for

Solving regression problems for a single regression parameter, Section 3.2.1.

Correction for omitted variable bias. Solving equation system (3.20), Section 3.2.1.3.

Optimizing initial parameter values, Section 3.2.2.

minimization

The program for our approach, which is described in Chapter 4, was implemented using source code snippets from Press et al. 2007, a widely acknowledged source of numeric methods for scientific computing. Table C.1 provides an overview of the employed procedures, short descriptions of these procedures and information about the roles of these procedures in our algorithm. All of the algorithms that were employed in our study were implemented in programming language C#. The tests were performed on an Intel® QuadCore Duo™ Q9400

Appendix C: The Technical Details of Prediction Accuracy Tests

219

2.67GHz machine with 8GB RAM that was running 64-bit Windows Server® 2008 Standard Edition with Service Pack 2.