Implementing a Scalable Music Recommender System - CSC - KTH

6 downloads 147 Views 2MB Size Report
in Python. Actually, since most of the calculations are run on a cluster, they are ..... Some data about users are avail
Implementing a Scalable Music Recommender System

ERIK

BERNHARDSSON

Master of Science Thesis Stockholm, Sweden 2009

Implementing a Scalable Music Recommender System

ERIK

BERNHARDSSON

Master’s Thesis in Computer Science (30 ECTS credits) at the School of Engineering Physics Royal Institute of Technology year 2009 Supervisor at CSC was Stefan Nilsson Examiner was Stefan Arnborg TRITA-CSC-E 2009:071 ISRN-KTH/CSC/E--09/071--SE ISSN-1653-5715

Royal Institute of Technology School of Computer Science and Communication KTH CSC SE-100 44 Stockholm, Sweden URL: www.csc.kth.se

Abstract We describe the implementation of a highly scalable recommender system, with excellent algorithmic complexities, capable of delivering personalized recommendations in realtime. The system was built to be used at the Swedish company Spotify, which is a novel application where users can listen to a vast collection of several million tracks, streaming over the internet. In the report, we discuss how the logs can be efficiently analyzed by a computer cluster in order to find structures and patterns corresponding to different musical tastes among the users. We introduce several algorithms with high capability of suggesting relevant new music to users.

Referat Personliga musikrekommendationer i realtid utifrån enorma datamängder Vi beskriver implementationen av ett skalbart rekommenderingssystem, vars algoritmiska komplexitet gör det möjligt att tillämpa på mycket stora datamängder, samtidigt som det kan presentera relevanta musikrekommendationer i realtid. Systemet utvecklades för det svenska företaget Spotify, som tagit fram en klient för att strömma musik över internet från ett utbud av flera miljoner spår. Vi berör hur systemet effektivt kan hitta strukturer och mönster i användares musiksmak genom att analysera loggfiler på ett datorkluster. Vi presenterar flera algoritmer som på ett skalbart och snabbt sätt kan presentera ny relevant musik till användare.

Contents 1 Introduction 1.1 Recommender systems . . . . . 1.2 What is Spotify? . . . . . . . . 1.3 Existing recommender systems 1.3.1 Spotify . . . . . . . . . 1.3.2 Other services . . . . . . 1.4 Use cases . . . . . . . . . . . . 1.5 Input . . . . . . . . . . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

2 Framework 2.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Playlist extension . . . . . . . . . . . . . . . . . . . 2.3 Recommendations . . . . . . . . . . . . . . . . . . 2.3.1 Predictions and recommendations . . . . . 2.3.2 Representation . . . . . . . . . . . . . . . . 2.3.3 Methodology . . . . . . . . . . . . . . . . . 2.4 Models . . . . . . . . . . . . . . . . . . . . . . . . . 2.4.1 Collaborative filtering algorithms . . . . . . 2.4.2 Item-oriented and user-oriented algorithms 2.4.3 Contextual algorithms . . . . . . . . . . . . 2.4.4 Audio-based algorithm . . . . . . . . . . . . 2.4.5 The ensemble method . . . . . . . . . . . . 2.5 Metrics . . . . . . . . . . . . . . . . . . . . . . . . 2.6 Training algorithms . . . . . . . . . . . . . . . . . 2.6.1 Testing protocol . . . . . . . . . . . . . . . 2.6.2 Likelihood . . . . . . . . . . . . . . . . . . . 2.6.3 Mean rank . . . . . . . . . . . . . . . . . . 3 Algorithms 3.1 PLSA Algorithm . . . 3.1.1 Term weighting 3.1.2 Complexity . . 3.1.3 On-line mode .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . . . . .

. . . . . . . . . . . . . . . . .

. . . .

. . . . . . .

. . . . . . . . . . . . . . . . .

. . . .

. . . . . . .

. . . . . . . . . . . . . . . . .

. . . .

. . . . . . .

. . . . . . . . . . . . . . . . .

. . . .

. . . . . . .

. . . . . . . . . . . . . . . . .

. . . .

. . . . . . .

. . . . . . . . . . . . . . . . .

. . . .

. . . . . . .

. . . . . . . . . . . . . . . . .

. . . .

. . . . . . .

. . . . . . . . . . . . . . . . .

. . . .

. . . . . . .

. . . . . . . . . . . . . . . . .

. . . .

. . . . . . .

1 1 1 2 2 3 6 6

. . . . . . . . . . . . . . . . .

7 7 7 8 8 8 8 9 9 10 11 11 12 12 12 13 13 14

. . . .

15 15 17 17 17

3.2

3.3 3.4 3.5 3.6

Graph algorithm . . . . . . . . . . . . . . . . 3.2.1 Sieving out a small set of pairs . . . . 3.2.2 Derive edge weights . . . . . . . . . . 3.2.3 Heuristic transitional probabilities . . 3.2.4 Model-based transitional probabilities 3.2.5 Final step: decreasing the graph size . 3.2.6 Generating recommendations . . . . . 3.2.7 Online mode . . . . . . . . . . . . . . Hu-Koren-Volinsky SVD . . . . . . . . . . . . Proxy algorithm . . . . . . . . . . . . . . . . Merging algoritms . . . . . . . . . . . . . . . Creating fast top lists . . . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

18 18 18 19 19 19 20 20 20 22 22 23

4 Implementation 4.1 Environment . . . . . . . . . . . . . . . . . 4.2 Complexity . . . . . . . . . . . . . . . . . . 4.3 Algorithm dependencies . . . . . . . . . . . 4.4 Scalability and fault tolerance . . . . . . . . 4.5 Generating recommendations for artists and

. . . . . . . . . . . . . . . . . . . . albums

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

25 25 25 27 27 27

5 Results 5.1 Test sets . 5.2 Baseline . 5.3 Results . . 5.4 Summary

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

29 29 29 29 31

6 Discussion 6.1 Interpretation of results . . . 6.2 Complexity . . . . . . . . . . 6.3 Metric . . . . . . . . . . . . . 6.4 Justifying heuristical methods 6.5 User experience . . . . . . . . 6.6 Lessons learned . . . . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

33 33 33 33 34 34 35

7 Current and future work 7.1 Graph algorithm . . . . . 7.2 Time dependency . . . . . 7.3 Feature-based algorithms 7.4 Diversity . . . . . . . . . . 7.5 Merging algorithms . . . . 7.6 User-specific algorithms . 7.7 Self-reinforcing feedback . 7.8 Radio . . . . . . . . . . . 7.9 Other tools . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

37 37 37 37 38 38 38 38 39 39

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . . . . . . .

. . . . . . . . .

7.10 Human tests Bibliography

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

40 41

Chapter 1

Introduction 1.1

Recommender systems

A recommender system provides a user with intelligent suggestions about which items to buy, download or explore, based on data about this user and data about items. There are many large implementations within e-commerce web sites, movie rental web sites, online services dedicated to music, and many more. Partly thanks to the advent of internet, very large user logs have been aggregated by companies, data which can often demonstrate relevant patterns on a large scale. For a service with a large number of users and items, the role of the recommender system is to present the user with items the user did not know about or did not think about. Providing a user with tailor-made recommendations of what the user might like has often lead to a notable increase in sales volume and the time a user spends on a web site. Many recommender systems draws conclusions about user behavior by examining properties of the user (such as age, sex, location, etc.) and the properties of items. Other recommender systems analyze structures of user preference, and tries to arrange similar users and items together. The arguably first major and still most famous implementation is Amazon’s “Users who bought this product also bought” [15]. An early implementation focusing on numerical scores and the mathematical treatment was Movielens [10], where users assigned 1-5 stars to movies and were given recommendations based on perceived similarity between items and users, where the similarity was based solely on the ratings themselves.

1.2

What is Spotify?

Spotify is a service for streaming music. The catalogue of music can be accessed by downloading the Spotify client, an application developed to be small and very responsive to the user. Features include • An assortment of several million tracks 1

CHAPTER 1. INTRODUCTION

Figure 1.1. Screenshot of the Spotify client. The current recommender system

is being shown in the lower right side of the window.

• An easy-to-use interface • Streaming music from central servers, caching music on the client computers and relaying it between clients using Peer-to-peer routing. The business model is to reimburse the record companies according to the number of times each track is played. In the Spotify Free service, the cost of playing the tracks is offset by interrupting music with ads (similar to commercial radio stations), while in the Spotify Premium service, users are charged when subscribing to an ad-free service (currently 1 Euro per day, 9 Euro per month, 99 Euro per year).

1.3 1.3.1

Existing recommender systems Spotify

The current recommender system used at Spotify (shown in Figure 1.1) is based solely on artist data. Using data about related artists from All Music Guide and combining this with history of played tracks, other artists are recommended. This algorithm is very fast and easy to implement, but suffers from several drawbacks: • The data from All Music Guide only covers a minor part of the set of artists in Spotify and is pure editorial data, not being possible to extend automatically over new items. 2

1.3. EXISTING RECOMMENDER SYSTEMS

Figure 1.2. Screenshot of Last.fm

• No conclusions regarding structure of artist similarities are drawn by the system, which is completely relying on the data by All Music Guide. • Recommendations are always artists, never albums or tracks.

1.3.2

Other services

A comparative study of Pandora and Last.fm is done in [16], where some light is shed on the relative accuracy between different implementations. Other well known implementations of similar systems are: • Last.fm is arguably the current state of the art collaborative filtering implementation for music, having its roots in the software Audoscrobbler. It seems to be entirely artist-based, not looking at individual tracks. A screenshot is presented in Figure 1.2 • Pandora bases its recommendations on an extremely ambitious human-aided feature extraction. This feature extraction goes under the name “Music Genome Project”, where trained people employed by Pandora work with the task of classifying each track in terms of several hundred variables. A screenshot is presented in Figure 1.3 • Genius is the music recommendation service in Itunes, the music software from Apple. It is not clear whether it is based on collaborative filtering or feature classifications, but most likely the former. A screenshot is shown in Figure 1.4 3

CHAPTER 1. INTRODUCTION

Figure 1.3. Screenshot of Pandora

Figure 1.4. Screenshot of Genius, the recommender system in iTunes by Apple

4

1.3. EXISTING RECOMMENDER SYSTEMS

Figure 1.5. Screenshot of Amazon

• Other systems worth mentioning are: Mufin, a feature-based recommender system developed by the Fraunhofer Institute, also known as the inventors of the MP3 format. Also ZuKool deserves to be mentioned, as well as Musicovery, which has an unorthodox control panel where users are asked to input their mood as a point on a two-dimensional surface.

• Amazon [15]. Although not a service dedicated to streaming music, Amazon pioneered the field of automated recommendations and is worth mentioning. They took advantage of existing market basket data when creating their feature “Users who bought this item also bought...” (Figure 1.2). This was one of the first major implementations of a recommender system and was also built upon large amounts of log data. Following Amazon’s success, many other E-commerce sites have developed similar systems. 5

CHAPTER 1. INTRODUCTION

1.4

Use cases

The goal of the development of an automatic recommender system at Spotify includes providing the following features to the user: • Generate recommendations for users based on the user’s previous history of tracks (“Artists you might like”) etc. • Provide automatic radio stations: preliminary an artist radio playing tracks similar to and by some artist, a personal radio station playing tracks based on the log history of the user, and a user-defined radio station, created by giving it a set of any tracks, which then is automatically extended. • Presenting similar artists and similar tracks as a way of browsing the collection. • Automatically extending play lists with more tracks. We will see how these operations can be formalized and expressed in a common set of operations working on sets of tracks (see 2.2).

1.5

Input

The most important design criterion was to develop a system which handles the existing data at Spotify, and does not require the collection of new data. The bulk of this data consists of the log data gathered since launch, which can be regarded as a set of (user, track, timestamp) tuples, denoting that some track was listened to by some user at some point in time. As a played track, we count a track that has been listened to for more than 60 seconds by the user. Some other data is also available, most notably the track metadata, consisting of album and artist information for each track. We use this data in order to generate recommendations not only for tracks, but also for albums and artists. Additionally, there is also some data we chose not to look at within the scope of this project, such as all user-created playlist, demographic data for users (age, sex) and the set of events of users jumping to the next track. The system is expected to handle very large logs in the future. Rough figures giving the approximate size of these logs are: 108 users, 107 tracks and 1011 log entries. It is also worth to keep in mind that this system is a real-time system, in which the user expects direct feedback. Thus, we aim to limit the time it takes to generate recommendations to no more than 100 ms.

6

Chapter 2

Framework 2.1

Data

As mentioned in 1.5, we only have access to log entries. This type of data is often referred to as implicit feedback. On the other hand, explicit feedback refers to users actively rating items (typically by assigning 1-5 “stars”). Most of the previous work in collaborative filtering focuses on the presence of such numerical scores. This is likely due to the popular Netflix Prize contest [3], which released large such data sets to be used for research purposes. However, we reiterate that no such information is available to this system, and different methods have to be employed.

2.2

Playlist extension

In order to simplify the treatment of the user logs, we make the assumption that the time stamps of the tracks is irrelevant, and that rewriting the user logs as a set of (user, track, count) tuples describes the past user behavior. (We will return to this in 7.2). Deriving recommendations for a user can be viewed as a special case of the more general problem of extending any set of tracks, which we also refer to as a playlist, disregarding the ordering of the latter. Since a user is uniquely defined by the tracks the user has listened to, recommending new tracks is equivalent to extending an unordered playlist. By focusing on the general primitive operation of playlist extension, we can support these two operations in an analogous fashion. Instead of explicitly working with users, we work with lists. Despite the name, lists are unordered sets of (item, count). This set can represent a user’s history or a playlist that we would like to extend. From now on, we will refer to users and lists interchangeably. 7

CHAPTER 2. FRAMEWORK

2.3 2.3.1

Recommendations Predictions and recommendations

What we refer to with recommendations corresponds to the normal usage of the term. We are concerned with how much user u would like track i if he/she were to be presented with the track. Factors included in this is (among other) whether the track is good enough, whether there is an element of novelty, whether the track is different from other recommended tracks, and much more. Many of these aspects are hard to estimate, and in order to simplify the problem setting, we have to make some fundamental assumptions on what we are trying to achieve. To simplify the problem, we will treat predictions instead of recommendations. The predictions can be seen as probabilities estimated by the system, ie. what is the probability that the user u will listen to track i next time? Instead of directly trying to find the best recommendations, we try to predict what the user will listen to in the future. The idea is that by predicting this, we can present these results to the user immediately, and these tracks will hopefully be good and relevant recommendations. Whether predictions and recommendations are equivalent is an fundamental question. Treating predictions and recommendations enables us to analyze our methods, though there is many subtle differences. We will come back to this later (see 6.5).

2.3.2

Representation

To treat the problem numerically, we first define a common way for the algorithms to treat predictions, in order to ensure consistency over all algorithm and providing a way to compare outputs from several algorithms. As mentioned in 2.2, the time stamps of log entries are not taken into account. We view a user’s collected history of tracks in no particular order. Which last track the user listened to does not affect predictions more than any other tracks. We will briefly discuss time dependency later (see 7.2). We define P (Nui ) as the probability that the next track the user u will listen to is track i. By definition we obtain the normalization criterion X

P (Nui ) = 1

i

for all users u, i. e. the probability of any track being played as the next track is 1.0 (ignoring the very small possibility that the user has actually left Spotify and will never come back).

2.3.3

Methodology

Assuming predictions are good recommendations (see 2.3.1), recommending tracks to a user amounts to: 8

2.4. MODELS

• Predict a set of tracks the user is most likely to play next. • Recommending tracks which the user has already listened to will in general be less relevant, so remove all already played tracks. • Recommend the top n remaining tracks to the user. Having introduced the next track probability, we can summarize the main objectives of the recommender system: Given a set of tracks that the user has listened to, together with the number of times: • What is the probability that the user will listen to track i as the next track? • Find the n tracks having the largest such probability How to actually predict these probabilities will be discussed in the next section, where the models are introduced.

2.4 2.4.1

Models Collaborative filtering algorithms

Collaborative filtering (CF) is a technique to look at (user, item) tuples in the log and find relationships. An example of a relationship might be two tracks which are often co-occurring among users, suggesting that these tracks are “similar”. No contextual information about the item is taken into account, i. e. the algorithms are unaware of what items represent. Collaborative filtering most often considers explicit feedback, which we do not have at hands. Outside the field of collaborative filtering, another field which often poses problems similar in terms of implicit feedback is Information Retrieval (IR). In IR, the input is given as a set of documents, which in the Bag of words model are simply unordered sets of words. For various queries, the problem is to return the most relevant documents. Considering a slightly different type of “queries”, we have a firm ground to start from. Looking at available algorithms within CF and IR, a rough classification of the most popular methods could be presented as follows: • Clustering. These methods are often simple to implement and very fast, but not as competitive in terms of prediction accuracy – Item clustering clusters similar tracks and recommends tracks in the same clusters as the track you like – User clustering clusters similar users, and recommends the top tracks of the users in the same clusters as you 9

CHAPTER 2. FRAMEWORK

• Neighborhood methods or graph methods. These methods analyze pairwise relationships between items or users and build a model based on a neighborhood graph. One of the currently best performing methods is presented in a paper by the current Netflix leaders [14]. – Item-item methods construct neighborhood graphs with edges between similar tracks. – User-user methods construct neighborhood graphs with edges between similar users. • Factorization methods and dimensionality reduction methods construct a lowrank approximation of the incidence matrix and use it for predictions. Many of these methods are special cases of the general problem of Non-negative matrix factorization (NMF), an area that has received much attention. We focus on the methods developed in IR, of which the some of the most prominent include: – Latent Semantic Indexing (LSI), which constructs a Singular Value Decomposition of the matrix M , where Mdw is defined as the number of incidences of word w in document d. – Probabilistic Latent Semantic Analysis [11] (PLSA), also known as Probabilistic Latent Semantic Indexing (PLSI), improves upon some shortcomings of Latent Semantic Indexing and can be applied in a straightforward fashion. (Side note: A later paper [9] showed that PLSA is actually LSI using Kullback-Leibler divergence instead of Euclidean norm). – Latent Dirichlet Allocation [7] was in turn developed to address and solve some shortcomings of Probabilistic Latent Semantic Analysis. – Restricted Boltzmann Machines [21] is another method that claims to overcome certain issues with Probabilistic Latent Semantic Analysis. • Other approaches include Bayesian probabilistic matrix factorization using Markov chain Monte Carlo [20], Semantic Hashing [19], . . .

2.4.2

Item-oriented and user-oriented algorithms

Most algorithms (the notable exception being factorization methods) handle either item-item similarities or user-user similarities. For instance, clustering methods usually either cluster similar users, or clusters similar items together. It is often possible to handle both cases in the same way by simply flipping the roles of users and tracks in the input and output data. However, due to the inability of most user-user methods to handle new users on the fly, we will only focus on item-item methods in this report. This also means that we will not consider the insertion of new items without having to rebuild the models. 10

2.4. MODELS

2.4.3

Contextual algorithms

Another class of algorithms are the meta-data models. These models tries to analyze the content of items and users, drawing conclusions how different aspects of the content attracts different users. These methods are usually referred to as contextual algorithms, or feature-based algorithms. A good example of a system mostly built around this is Pandora (see 1.3.2), which uses an vast database of track classifications made by humans. Some examples of what we could do are (but are not limited to) are given in the table below: • Finding structures in whether users from specific countries do like songs in specific languages (users from Germany are more likely to listen to German lyrics, for instance) • Finding correlations between music tastes and demographic data (such as age, sex, etc.) • Using editorial data on i. e. similar artists to produce recommendations (like the current system at Spotify) • Analyzing the audio contents of all tracks [16] (see 2.4.4)

2.4.4

Audio-based algorithm

Audio-based methods are likely to be the best feature-based methods, having prediction accuracy on par with collaborative filtering methods [16]. Using a featureextraction tool, we can produce a feature vector for each track. The details are omitted here, but the interested reader can find more information in, for instance, [6] and [16]. Ideally, the feature vectors should express the human perception as much as possible. Common values that can be extracted are: • Rhythm (most often measured in BPM, Beats Per Minute) • Major/minor key • Rhythm complexity (regularity/irregularity, etc.) • Timbre, which is an umbrella term for many different aspects of audio • Mel-Frequency Cepstral Coefficients (MFCC) are automatically extracted coefficients corresponding to timbre 11

CHAPTER 2. FRAMEWORK

2.4.5

The ensemble method

Instead of developing one algorithm to produce predictions, we will develop an array of them, the rationale being that different algorithms best handle different aspects of data and data scalability. These predictions can be combined to produce one final set of predictions. This method is commonly referred to as the ensemble method. We will later return to exactly how these combinations are made (see 3.5).

2.5

Metrics

Defining a metric is needed when developing a recommender system in order to automatically evaluate the performance of the system. The metric is some score on how well the system performs. Ideally, a metric has to fulfill: • Reflecting all aspects the user’s experience (i. e. the better value of the metric, the happier user). • Being effectively computable and optimizeable (i. e. there is a single attainable global maximum). In broad terms, the aim of a metric is to link human perception and the mathematical representation of the problem. Defining a metric essentially turns the entire work of building a recommender system into an optimization problem. However, defining a metrics which is entirely based on the happiness of the users is a difficult and tedious task. We can approximate this by letting the user submit immediate feedback when being presented with recommendations, and/or to user human test subjects. On the other hand, to actually get feedback from users, we need to obtain recommendations in the first place, giving us a challenging task within reinforcement learning. This and many other interesting issues unfortunately goes outside the scope of this project. Instead, we will look at much simpler metrics in 2.6.2 and 2.6.3, which are easy to define, to optimize, and hopefully will correspond to human experience good enough.

2.6

Training algorithms

In order to have a fully automated way of evaluating our algorithms, we split the input data into a training set and a test set, where the latter corresponds only to a small fraction of the complete data. The idea is to hide the test set to the system during training and to see how well the system is able to reproduce it. We use training set as input data to the algorithm. When this is done, we let the algorithms produce recommendations. How well the recommendations match what the users actually listened to, gives us a good picture of how competent our system is at predicting future events. 12

2.6. TRAINING ALGORITHMS

Not just do we split the data set into a training and a test set for the case of evaluation. In some cases, we also do it in order to automatically train unknown parameters.

2.6.1

Testing protocol

As mentioned in 2.6, we need to remove a “hidden set” of data for comparison purposes, in order to evaluate the system. We evaluate the algorithms using the following approach: • Pick a “handful” (a few hundred), of users randomly. We call these users “test users” and remove their data altogether from the training set • Train the algorithms using the remaining data • For each user in the set of test users – Create a new list in the system. Add all of the user’s played tracks to the list except the 10 most recently played tracks, which we keep hidden. – Additionally, of the hidden tracks, also remove those already played by the user earlier, so that there are only novel tracks among the hidden tracks. – Derive the likelihood or mean rank (see below) of the remaining hidden tracks, i.e. a measure of ability of the system to reproduce the right hidden tracks. • The average of these mean ranks is interpreted as the performance of the system. To pick test users in a stable and scalable fashion, we use a hash function to map user names to a real value within [0, 1), and take out all users which map to a values less than . This way,  defines a unique and consistent sample of users.

2.6.2

Likelihood

The likelihood denotes the probability of a model to produce a given output. In this context, we can define it in terms of predictions. If we by T denote the test set, i. e. a set of real log entries, we can define the likelihood as Y

P (Nui )

(u,i)∈T

In the expression above, P (Nui ) denote the predicted probabilities, as defined in 2.3.2. A great advantage of the likelihood is that it is continuous, differentiable and concave in P (Nui ). 13

CHAPTER 2. FRAMEWORK

2.6.3

Mean rank

This metric is due to [12]. For each user, we derive the numerical preferences for all tracks and sort decreasingly. Then, for each item in the user’s hidden set, we find its rank among all tracks as a percentage. The average of over all users of the averages over all hidden items forms the final value, which we denote mean rank. A very low mean rank denotes that almost all of the hidden items were found near the top of the list, so that in general, the lower value of the mean rank, the better. Note that a completely randomized algorithm would distribute all hidden tracks evenly over the ranked list, so that on average it would yield a mean rank of 50%.

14

Chapter 3

Algorithms 3.1

PLSA Algorithm

Probabilistic Latent Semantic Indexing, or PLSA for short, [11] derives a generative model for how tracks are listened to based on latent classes. In particular, we assume there are k latent classes with probabilities P (z). Users belong to each latent class with probabilities P (y|z), and items belong to each latent class with probabilities P (x|z). Under this model, we derive a generative model. In particular, the next track played by any user is given by: • Pick a random class z with probability P (z). • Pick a random item x with probability P (x|z). • Pick a random user y with probability P (y|z). P

For a given user y, the probabilities for the next track can be given as z P (x|z)P (z|y) where the last factor can be given from known values P (y|z) by Bayes’ theorem. The three unknown entities sought are: • P (z) for z = 1 . . . k • P (x|z) for z = 1 . . . k and x = 1 . . . m (where m is the number of items) • P (y|z) for z = 1 . . . k and y = 1 . . . n (where n the number of users) It has been shown [9] that PLSA is actually Singular Value Decomposition using Kullback-Leibler norm instead of the Euclidean norm, and it is often helpful to think of PLSA as a distant relative of the much more frequently occurring SVD. See Figure 3.1 for a simple example of the results of PLSA. The optimal conditional probabilities can be derived by the EM (Expectation Maximization) algorithm. [11] presented this and also suggested using a slightly modified variant of the EM algorithm known as the TEM (Tempered Expectation 15

CHAPTER 3. ALGORITHMS

0.014 Coldplay ! Speed of Sound Coldplay ! Fix You 0.012

Coldplay ! Clocks

Coldplay ! The Scientist

0.01 Coldplay ! Talk Coldplay ! Square One Coldplay ! Trouble Coldplay ! What If Coldplay ! White Shadows

Oh Laura ! Release M

0.008 Coldplay ! X&Y Coldplay ! A Message Coldplay ! The Hardest Coldplay ! Fix You Part Coldplay ! In My PlaceU2 ! Stuck In A Moment (Acoustic) Massive Attack ! Teardrop Coldplay ! Warning Sign Coldplay !Low Don’t Panic 0.006 Coldplay in the Sea Lykke!LiSwallowed ! Dance, Dance, Dance José González Heartbeats Coldplay ! God ’til Kingdom Come Coldplay ! Put a ! Smile Upon Your Face Coldplay Kleerup ! Longing Lullabies Coldplay ! ! Green Politik Eyes Coldplay Logic Robyn !For With Every Heartbeat Lykke Li !!!!Twisted Melodies & Desires Radiohead !Good, Fake Plastic Trees (Acoustic) Keane Somewhere Only We Know Foo Fighters ! Everlong Lykke Li I’m I’m Gone Jack Johnson ! Breakdown Live ! Heaven (Acoustic) Lykke Li ! Little Bit Lykke Li ItFoo FallFighters ! Times Like These(Acoustic) Lykke Li !! My Love Coldplay !Let Yellow Gavin DeGraw ! I Don’t Want To Be (Acoustic) James Blunt ! High Coldplay ! Daylight Maroon 5 ! She Will Be Loved (Acoustic) Lykke Li ! Tonight Coldplay Shiver John Mayer ! Daughters Foo Fighters My Hero Counting !of Holiday In Coldplay ! Crows A Håkan Rush Blood to!!Spain the Head RobynLång ! Cobrastyle 0.004 Coldplay Hellström Tro Och Håkan Hellström ! Me För EnTvivel Lång Tid ! Amsterdam Nickelback !Laura How Remind (Acoustic) Oh !You A Call to Arms Håkan Hellström ! Kärlek ettOn Brev Skickat Tusen Gånger Fighters ! Är Monkey Wrench Coldplay ! A!Whisper Laura ! Put Black n’ Blue Corinne Rae !Foo Your Records KT Tunstall Heal Over Coldplay Spies Lykke Li !! Oh Hanging High Lykke Li Bailey Breaking It Up Coldplay Sparks Håkan Hellström ! Kär I En Ängel Håkan Hellström ! För För Edelweiss Håkan Hellström ! Känn ingen Oh Laura ! It Ain’t Enough The Killers ! When You Were Jason YoungMraz ! I’m Yours (original demo) Robyn !Sent Be Håkan Hellström !Line Jag VetMine! Inte Vem Jag Ärsorg Menför jagmig VetGöteborg Att Jag Är Din Oh Laura ! Fine Joss Stone In Love With !A!Learn Boy (Acoustic) Gabriel Rios!!Fell Broad Daylight Håkan Hellström NuSpringsteen kan du så Fighters toSpringsteen Flyfå mig Bruce ! lätt Streets ofRiver Philadelphia Bruce Bruce ! Born to!Run The Håkan Hellström ! in Zigenarliv Dreamin Oh Laura !Foo Raining New York Håkan Hellström midsommarnattsdröm Håkan Hellström Jag hatar attSpringsteen jagSpringsteen älskar dig och jag !älskar så mycket jag hatar mig The Stripes Nation Army Bruce ! Dancing inSeven the dig Dark Takida ! Curly Sue Robyn !!!En Konichiwa Bitches Jack Johnson ! White Better Together Amyatt Winehouse ! Rehab Bruce Springsteen ! Hungry Heart Amy Winehouse ! Back to Black Lars Winnerbäck ! Om du lämnade mig nu Jason Mraz ! I’ll Do Anything Håkan Hellström ! Know!How Ramlar Kent Mannen i den With vita hatten Feist ! (feat. Feist) Kings of Convenience ! I’d Rather Dance You (16 år senare) The Smashing Pumpkins !! 1979 The Shins ! New SlangYou Killers ! When Were Young Snow Patrol !Cassidy Chasing Eva !Cars Fields Of Gold The Smashing Pumpkins ! The Tonight, Tonight 0.002 Massive !Waiting Unfinished Sympathy Andreas Kleerup with Robyn With Every Heartbeat (Radio Ve Zero 7slag !Attack In the Line Peter Smiths Bjorn and !400 There John Is ! Young a Light Folks That ! Never Goes Amy Winehouse ! You Know I’m No GoodOut The Smiths This Charming Man Ryan Adams Snow Patrol !The Wonderwall !!The Open Your Eyes Kent ! Suede ! The Beautiful Ones Verve !! Bitter Sweet Symphony Shins ! Caring Is Creepy Kent Utan dina andetag Kent Ingenting Kleerup ! Tower Of Trellick Kent ! Den döda vinkeln Presley Suspicious Minds Champagne Kent ! !Snurrar Don’t ! 747 Panic !Supernova Wonderwall AneColdplay Brun !Elvis To Let Myself Go Lars Winnerbäck ! Elegi House mix) Jeff Buckley ! Hallelujah Kleerup With Every Daft Punk ! Around (Kleerup theRemix) World Familjen !Oasis Det I!Oasis Min Skalle !Heartbeat Shoreline Massive Attack !Johnny Angel Rufus Wainwright ! Hallelujah Nina Simone ! Bob Sinnerman Dylan !Daft (Felix Lay Lady Da Housecat’s Heavenly !Lay Harder, Better, Faster, Stronger Moby !Anna Extreme Ways Norah Jones Sunrise Stevie Wonder !Ternheim Signed, sealed, delivered (i’m yours) Cash !Punk Hurt Frou Frou Let Go Jeff Rufus Buckley !! Nick Drake !!Winnerbäck One ofUniverse Things First The Killers Somebody Told Me Moby Ooh Yeah Wainwright !Lars Across the ! En tätort på Living en slättBoy in New York Iron &!Wine Robbie !!!Such Williams Great !These She’s Heights Madonna Kent !!Halleluljah Chans Simon ! Only Kent Max 500 Oasis Little by Little Nina Brun Simone ! The ! My Treehouse Baby Just Song Cares For Me Jason Mraz ! Life IsHimmelen U2 ! Ane One Robyn ! With Säkert! Every !& Heartbeat ViGarfunkel kommer (Tong attThe dö & samtidigt Spoon Wonderland Remix) !Wonderful Nothing Else Matters Kent VinterNoll2 R.E.M. ! Losing My Religion Kaiser Chiefs !Just Ruby Daft Punk !Metallica One More Time Timbuktu ! Alla Vill Till Men In The Killers !Slagsmålsklubben Mr Brightside and Linkin ! Numb/Encore Session Orchestra ! Enter Michael Jackson !Feel ! Sponsored Billie Jean by destiny KentJay!Z !London Spökstad Colin Hay ! IPark Don’t Think I’ll Ever Get Over You Depeche Mode ! Enjoy the Silence Anna Ternheim ! To Be Gone Timbuktu ! Det löser sig Metallica ! Sandman Bo Kaspers Orkester ! I samma bil The Cure ! Friday I’m in Love Daft Punk ! Around the World / Harder, Better, Faster, Stronger The Chemical Brothers !Let Hey Boy Hey Girl Bowie ! Life On Mars? Bob Dylan !Dylan Hurricane Creedence Bob All Revival Along ! the Fortunate Watchtower Son David Bowie !Clearwater Chine Girl Simon &Want Garfunkel The Boxer Robbie Williams !!Son Me Entertain You Bob Dylan ! Knockin’ on Heaven’s Door Queen !David Under Pressure Säkert! ! Någon gång måste du bli själv Rihanna ! Umbrella Dusty Springfield ! of a Preacher Man Anna Ternheim ! My Secret Wolfgang Amadeus Mozart ! Concerto No. 21 in C major for Piano, K. 467 "Elvira Madigan": II. Andante Queen ! I to Breack Free Ebba Grön ! Die Mauer Ted !! Sol vind och vatten Michael Jackson ! Billie Jean (original 12" version) Rihanna !Floyd Shut Up And Drive (Radio) (133 bpm) Pet Shop Boys Always on My Mind Queen !Gärdestad Don’t Stop Me Now LiRhapsody !!brick Ba ba ba Miss Li Oh Johnny Cash Ring of Fire Depeche Mode ! Personal Jesus Queen ! Bohemian Pink ! Another inBoy wall (part 2) Miss Li !Miss I’m Sorry, He’s Mine Hello Saferide !the The Quiz Alphaville ! Big in Japan Miss Li ! Let Her Go Depeche Mode ! Just Can’t Get Enough New Order ! Blue Monday Ultravox ! Dancing With Tears in My Eyes Pet Shop Boys ! Go West Soft Cell !Friend Tainted Love The Trashmen ! Surfin’ Bird Hello Saferide !On My Best a!ha !Alphaville Take Me 0 Howard Jones !League What Is Love Eurythmics ! Sweet Dreams (Are Made This) Dr. Alban ! Sing Hallelujah ! In My Mind ! Sounds a(Jag Melody Blümchen ! Heut’ ist mein Tag The ! You Want Me Kraftwerk !!Don’t Basic Model Element The Promise Man Scatman John ! Scatman Duran Snap! Duran Vengaboys !Human The Power !The Culture ! Boom on Film Beat Boom !lille Mr Boom Vain Boom!!! Yazoo ! Go Sash! ! Ecuador DJ BoBo Lena Cappella Wisborg !Antiloop Somebody !Girls Move ! Idas Dance on Sommarvisa Baby Me gör så att(från blommorna blommar) Inger Ronny Nilsson 2 Rooster Paradisio Inger Unlimited Haddaway !Real Nilsson Sjörövar!Fabbe & ! hans ! Bailando No !Don’t vänner Här Limit What kommer ! (från Is Köttbullelåten Pippi ’Pippi Långstrump Långstrump påofde sjupå haven’) MC Sabine, Cool Jonna Scooter Herbie Sar Aqua Jan James Lars Liljendahl & Corona Annie, Ohlsson The Berghagen Right La Roses Move & Black Bouche Erika, ! & The ! Type Your McCoy Are Du Liv Teacher Kerstin Rhythm käre Alsterlund of Red !Ass Teddybjörnen Be Mood !With Run My &!Like of Sophie Dr. snickarbo Lover the Away !Love Feelgood Pilutta!visan Night Fredriksson ! Imse vimse spindel ’Madicken Junibacken’) 0 0.005 0.01 0.015

Figure 3.1. An example of a simplified PLSA algorithm with only 2 latent

classes, represented by the x- and y-axes. There are 5000 users and 400 of the most common tracks in Spotify are presented as coordinates. Note that the metric is not Euclidian. Actually, the angular difference from the origin between two tracks corresponds to the similarity, while the distance from the origin corresponds to the popularity.

Maximization) algorithm. This modifications yields a cumbersome-looking convergence graph presented in Figure 3.2, where the sudden discontinuities in the derivative are caused by adaptively changing the parameter β (see [11] for the details). The EM and TEM algorithms are well suited to be implemented on a cluster, since the main loop simply sums values over tracks and users. The number of rounds needed (in practice around 30-50) means that the algorithm needs relatively much time to run. On the other hand, memory overhead is small and complexity is good, providing for easy extension to very large data sets. Google News (see [8]) use PLSA in order to generate personalized news to users on a very large scale. 16

3.1. PLSA ALGORITHM

%

!)+)&

,-"!

!)+*

;46?44@

!)+*&

!*

!*+!&

!*+"

!*+"&

!*+#

!

"!

#!

$!

%! &! ./0123-45-672387649:

'!

(!

)!

*!

Figure 3.2. Convergence of PLSA

3.1.1

Term weighting

We experimented with different heuristics of replacing track counts with some monotonic function thereof. The main one we experimented with were • NONE: f (x) = x (no weighting) √ • SQRT: f (x) = x • LOGP1: f (x) = log(x + 1) Although being hard to justify from a theoretic point of view, these weightings gave significantly better results. We attribute this to the abilities of reducing noise, where otherwise a few rare large numbers can affect the accuracy detrimentally. Feeding all frequencies into a term weighting function removes the effect of these, while still being approximately linear around x = 1.

3.1.2

Complexity

The algorithmic complexity of PLSA is O(P r), with P the number of steps and r the number of ratings. In practice, convergence is reached after P = 50 steps irrespective of the size of the data set.

3.1.3

On-line mode

The PLSA model as presented in [11] does not provide a model for handling new users (or documents, using the original terminology) on the fly without deriving all weights from scratch. Some work has been done to extend this, i. e. [22]. We the a simple scheme presented in [8]. 17

CHAPTER 3. ALGORITHMS

Some of the criticism of PLSA has attacked exactly this deficit, and later models (Latent Dirichlet Allocation [7], Restricted Boltzman Machines [21]) were presented to derive a proper generative model for new users.

3.2

Graph algorithm

We experimented with many varieties of the same algorithm. The basic idea is to build a generative model for how users choose tracks by deriving a state transition graph, where nodes are tracks and directed edges between tracks denote probability of listening to track B after track A. Methods of this type are often referred to as Neighborhood methods, since the set of adjacent edges can be thought of a “neighborhood” of similar items. An example of a subset of a very small graph generated from actual data is presented in Figure 3.3.

3.2.1

Sieving out a small set of pairs

This step is heuristic pre-processing step. Our goal is to implement an algorithm which, given tracks X and Y , can tell us very quickly whether X and Y are “roughly” related. The idea is to single out about O(n) possibly relevant track-track relations (out of the O(n2 ) possible pairs) before proceeding to the next step. We do this by using the PLSA results in order to quickly find the k closest tracks for a given track, since PLSA can calculate the similarity in O(1). Finding the k closest tracks for all tracks can be done by brute-force in O(n2 ) or by using KD-trees in O(n log n). Even faster but approximate methods can be done by using Locality Sensitive Hashing. Perhaps surprisingly, brute-force is relatively fast, so we used this method.

3.2.2

Derive edge weights

Once we have singled out a reasonable amount of pairs, we proceed and derive the transitional probabilities, the “strength” of each connection. We do this by looping over all users in parallel and considering the set of tracks for each user. Our goal is to create a transition graph between tracks, i. e. deriving P (Y |X) for each X and Y. We had a few different approaches to this. First, we tried to simply define P (Y |X) as some function of the users that has listened to track X and Y and how many times, respectively. Without deriving any model, and disregarding any risks of statistical insignificance, we experimented with many different heuristics on how to define this P (Y |X). The complexity of this step is O(mr¯u 2 ) (where m is the number of users and r¯u is the average number of ratings per users) 18

3.2. GRAPH ALGORITHM

3.2.3

Heuristic transitional probabilities

Starting with the ad-hoc way of deriving transitional probabilities, we tried the ones listed below. We denote the defined entities CXY . • PROD: CXY ∝

P

U

• PRODSQRT: CXY

NU X NU Y P √ ∝ U NU X NU Y

P NU X NU Y • COSINE: CX→Y ∝ pP U 2 pP 2 N NU X UX U U

3.2.4

Model-based transitional probabilities

Apart from the heuristic attempts of defining transitional probabilities, we also implemented a method dubbed PAIRS. This method derives P (Y |X) by contingency tables, but also removes statistically insignificant pairs (X, Y ). We assume that P (X) can be accurately determined by its global frequency. Given that we have k observed pairs (X, Y ), and a total of N pairs, the likelihood of this event assuming that X and Y are independent is: !

N (P (X)P (Y ))k (1 − P (X)P (Y ))N −k k This means that we can obtain a p-value by integrating the binomial distribution from 0 to k. Setting a fixed certainty of eg. 0.99 gives us an easy way of rejecting or accepting the hypothesis. Note that the acceptance of false positives is not fatal, so we can use a relatively low certainty level. After having accepted a set of pairs, the pairs are sorted decreasingly by the quantity CXY = P (Y |X)P (X) + P (X|Y )P (Y )

3.2.5

Final step: decreasing the graph size

We used a simple scheme for cutting down on the number of edges of the graph. Pruning is needed to further reduce the size of the graph to a manageable size. Whereas our first rough sieving is meant to limit down the number of edges per vertex to a few thousand, this final pruning reduces it to at most a hundred edges per vertex. We also try to produce a graph that is equally dense over all vertices, meaning that we have the same degree for each vertex. Our pruning algorithm proceeds as: • Collect all tuples (X, Y, CXY ) and sort them in decreasing order by CXY . • Iterate through the list. For each tuple: 19

CHAPTER 3. ALGORITHMS

– If X and Y so far both have fewer than D edges: Add an edge with weight CXY from X to Y and one with weight CXY from Y to X. To obtain the transitional probabilities between vertices, we normalize the weights of all edges so that the sum of weights of the out-edges of any vertex is 1.

3.2.6

Generating recommendations

To generate recommendations, we model the user as a probability distribution over the vertices of the graph, where the probability distribution is simply given by the log entries of the user. If the user has listened twice to track A, and three times to track B, we assign vertex A the probability 0.4, and B the probability 0.6. Recommendations are now taken as the probability distribution obtained when performing a one step random walk along the edges.

3.2.7

Online mode

The probability distribution attained by taking a one step random walk from the user’s probability distribution can be kept in memory cheaply. We let D denote the number of edges per vertex as above. The size of the graph is O(N D) (with N the number of tracks) and the memory footprint of any user is O(nD) where n is the number of tracks which the user has listened to. At any time, adding a new track is O(D), since it amounts to adding a constant to D values, namely the D adjacent vertices to the track added. For all purposes, D is at most 100, and can be regarded as a constant.

3.3

Hu-Koren-Volinsky SVD

In [12], an concise and fast algorithm that deconstructed the feedback matrix was presented by the same authors that were currently leaders of the Netflix Prize (e. g. [14]). In the original paper, no name is given to the algorithm, so we will use HKV-SVD to refer to it. The algorithm differentiates between preference values pui and confidence values cui . These two quantities are defined from the log counts rui as pui = 1 if rui > 0 and 0 otherwise cui = 1 + αrui with α typically 40 We then try to find f -sized vectors xu for each user and yi for each item so that the global quantity X

X

cui (pui − xTu yi )2 + λ(

u

u,i

20

|xu |2 +

X i

|yi |2 )

0.09 0.10 0.10

Andreas Kleerup with Robyn

0.11 0.10 0.08 0.10

0.11

0.10 0.08 0.09 0.07 Nina 0.12

Elvis Presley

0.10 0.10 0.07 0.08

Familjen

0.05

0.11 0.09 0.08

0.06

0.19 0.20

0.08

0.10

Simone

The Cure

0.09

Peter Bjorn and John

0.12

0.09

0.10

0.11 0.11

0.10 0.09 0.10

Jeff Buckley 0.14

0.10

The Smiths 0.11

0.10

3.3. HU-KOREN-VOLINSKY SVD

0.05 0.07

0.090.08

0.07

0.08

0.06 0.16 0.06

Kaiser Chiefs

Kings of Convenience

0.27

0.37

Feist

0.09 0.06

0.11

0.24

0.07

0.12

0.09 Slagsmålsklubben 0.07 0.13

0.09 0.11 0.09 0.17

0.15

0.10

Säkert!

0.13

0.10 0.11 0.09 Lykke 0.10

Li 0.10

Rufus Wainwright

0.09

0.09 0.08

0.10

0.16 0.13 0.10

Hello Saferide

0.12 0.08 0.10 0.10 Anna Ternheim 0.09

0.14

0.10

Snow Patrol 0.18

0.07 0.08

0.06 0.07 0.08 0.13

a-ha 0.08

0.08 0.10 0.10 0.08 Rihanna 0.07 0.07 0.13 0.19

Eurythmics

0.13 0.05 0.15 0.16 0.16 0.17

0.11 0.15 0.13

Kraftwerk

0.07 0.17 0.12

Simon & Garfunkel

0.13 0.12 0.13

Soft Cell 0.12

0.05 0.15 0.15 0.16

0.17

The Shins

0.17

0.15 0.13 0.14 0.13

Howard Jones

0.18 0.17 0.15 0.17 0.15

Colin Hay

0.19 0.16

0.14 0.14

0.17

0.05

Iron & Wine

0.19 0.17

0.14 0.14 0.16

Nick Drake

0.16

0.15 0.14 0.15 0.16

Frou Frou

0.15 0.15 0.16 Nickelback 0.15 0.16 Counting 0.14 0.14 0.14

0.15

Gabriel Rios

0.15 0.14 0.15

0.15 0.16 0.15

Crows

Keane

0.15 0.15 0.16 0.16 KT Tunstall 0.17 0.15 0.15

0.15 0.15 0.15

Radiohead

0.15 0.17 0.17 0.17

0.15 0.16

0.14 0.16

Corinne Bailey Rae 0.17

Culture Beat

0.15 0.14 0.15 0.14

Paradisio

0.15

0.16 0.17

2 Unlimited

0.16 0.13 0.13

0.15 0.14 0.14 0.15 0.15 Scatman

0.15 0.14 0.16 0.14 0.14

0.14 0.13

John

0.15 0.16 0.14 0.15 DJ 0.14 0.13

Antiloop

0.15

0.14 0.15 0.15

Aqua

0.14

0.17 0.15

BoBo

0.05

Cool James & Black Teacher0.160.17

0.15

Vengaboys

0.14

0.18

0.13 0.16 0.12 0.12

0.12

Scooter

0.12 0.15

0.17 0.16

U2

0.12 0.17

0.17 0.12 0.12

0.15

Snap!

James Blunt

0.15 0.16 Gavin 0.16

0.17 0.15 0.16 0.12 0.12

Maroon 5

0.16

0.17

0.14 0.15

0.12 0.13 0.12

DeGraw

0.15

0.17

Jack Johnson

0.12 0.13

0.16 0.17 0.16 0.15 0.15

Figure 3.3. A zoomed in version of a simplified graph created by the Graph 0.140.15

algorithm running on artist data. There are 5000 users in this example, and we limit the edges per vertex to 5 for clarity. Self-edges from a vertex to itself are not shown in the figure. 0.16 0.16

0.16

0.14

Haddaway

Corona

0.13 0.15 0.17

0.16 0.14 0.16 Sash! 0.15

0.15 0.16 0.16 0.15

0.16 0.16 0.15 0.15

0.15

Herbie

0.14 0.15 0.15 Basic

0.14

Element

0.17

0.14 0.15 0.17 La 0.14 0.16

0.16 0.14

Bouche

0.15 0.15 0.17 0.18 0.14

Cappella 0.10

21

Johnny Cash 0.12

0.13

Alphaville

0.14

0.12 0.12 0.12 0.13 0.12

0.10 0.12 0.10 0.13

Bob Dylan 0.11 0.13 0.10 0.09 0.12 Depeche 0.14

Duran Duran

0.15 0.17 0.13 0.15 0.16

0.10 0.13 0.12 0.12

Foo Fighters 0.13

0.09 0.20

0.10

0.16 0.13

Mode

New Order

0.15 0.14 0.13

0.16 0.15 0.15

0.16 0.16 Ultravox 0.15 0.14 0.14

Yazoo

0.15 0.15

0.16 0.16 0.15 The

0.13

Human League 0.16

MC Sar & The Real McCoy

0.18

CHAPTER 3. ALGORITHMS

is minimized. Note that the first term sums over all values, including those corresponding to rui = 0. The value λ controls how much the norm of the vectors is penalized, and in our setting, having λ = 1000 worked fine. The xu ’s and yi ’s are found by alternately fixing one of the types and optimizing the other, where the optimal values can be given as a closed-form expression. Typically, the vectors converge after no more than 10-20 steps, and the total complexity of each step is O(N f 2 + (n + m)f 3 ), where N is the total number of ratings and f is the dimensionality, which we set to 50. The full derivation is given in [12]. We note that it is also relatively straightforward to run this algorithm on a computer cluster. A disadvantage with this algorithm is its lack of any generative model. The conversion from the predicted xTu yi to P (Nui ) is also non-trivial, since the former quantity is not guaranteed to be positive, and the sum of all predictions is not equal to 1. We define P (Nui ) ∝ max(xTu yi , 0) and perform the conversion by some quite messy normalization that goes outside the scope of this report. However, despite this weak point, the algorithm performs on par with our other algorithms, and improves the final combined score significantly.

3.4

Proxy algorithm

As we will return to in 4.5, we can also produce album and artist recommendations relatively straightforward. These recommendations provides data that could be useful for track recommendations as well. To enhance track recommendations using artist and album recommendations, we use the artist and album recommendations as track recommendations. We develop a meta-model named proxy. This algorithm, when asked to produce a recommendation for a (user, track) pair, maps the track to its artist or album, obtains the recommendation from that ensemble, and returns the value. For tracks with several artists, a simple average is done over those artists. When returning a top list of tracks, the proxy algorithm first obtains a top list of artists or albums, then replaces each artist or album with the k most popular tracks by that artist or on that album.

3.5

Merging algoritms

Our goal is to combine the different values of P (Nui ) of the algorithms linearly in order to obtain a final value. We do this by deriving weights λ1 . . . λk , where k is the total number of algorithms. If the algorithms 1 . . . k are suggesting probabilities P (Nui )(1) . . . P (Nui )(k) , we form the final weighted value P (Nui )final using P (Nui )final = λ1 P (Nui )(1) + . . . + λk P (Nui )(k) To enforce consistent values of P (Nui )final , we also demand that λ1 + λ2 + . . . + λk = 1. 22

3.6. CREATING FAST TOP LISTS

Optimizing arbitrary metrics is hard problem. Some research has been done, where [17] contains a comparative study. We believe that in this setting, spending lots of effort on optimizing metrics is an artificial task, as there is no conclusive evidence of any metric being superior to any other. Instead of directly optimizing the mean rank, we optimize the likelihood as follows: • For each user u in the regression set, create a playlist consisting of each track in its history, except the 10 most recently listened tracks. • For each user, remove the tracks in the set of the 10 most recently listened tracks that the user at some point listened to earlier. • For each user u and each of the remaining recently listened tracks i – Derive P (Nui )(1) . . . P (Nui )(k) – Insert i into the play list. • Find the λ’s maximizing the likelihood using gradient descent. The advantages of this method is that the problem is “well-posed” in the sense that the target function is concave and continuously differentiable, so that the global maximum can be easily found very fast. Though outside the scope of this project, we experimented with several other ways of optimize metrics, none which was significantly superior to any other.

3.6

Creating fast top lists

Forming a top list of the n top predictions can obviously be done by generating predictions for all tracks, though this is not a practical approach. Instead, we use the ability of a some of the algorithms to generate top lists, even though not all of them have this ability (e. g. PLSA does not). Looking at top recommendations from different algorithms as candidates for the final top list, we poll candidates from algorithms, merge the candidates and obtain a nearly optimal top list. • Query each algorithm for the n top items of user u. • Take the union of all these lists. • Calculate P (Nui ) for each item i in the union by using the weights λ1 , . . . λk . • Output the n entries with largest P (Nui ).

23

Chapter 4

Implementation 4.1

Environment

For the purpose of developing a prototype rapidly, most of the system was written in Python. Actually, since most of the calculations are run on a cluster, they are I/O-bound, thus suffering little from the decrease in speed incurred by Python’s worse performance. We rewrote some small parts in Java and are currently porting some of the online code to C++. For the clustered algorithms, we used the framework provided by Hadoop [2], which is a publicly available software toolkit for running very large map-reduce jobs, similar to MapReduce developed at Google. Hadoop is an open source project developed in Java, largely contributed to by Yahoo. Furthermore, for some RPC tasks, we used Thrift [5], and XML as a format for configuration files. We also used SciPy [4] for mathematics in Python, and Boost.Python [1] for binding Python with C++. The data sets used for experimentations were real data sets containing all tracks that had been played during 2008. However, we removed the rarest tracks and users.

4.2

Complexity

The system is required support to instantaneous feedback from new ratings. Since most of the algorithms mentioned above have very long execution times (several hours or even days), we have to split all calculations into an offline step and an online step. The offline step is performed in the background. The system works with a snapshot of the logs. Typically, the offline jobs are run on a clustered platform. After the offline step, the system stores its results, which is designed to fit easily within the working memory of the online nodes. Working with the concept of lists (as defined in 2.2) moves the responsibility of creating user recommendation outside the system. At any time, a system that is connected to a live feed of tracks being played, can create a new list, and insert the 25

CHAPTER 4. IMPLEMENTATION

log_extractor()

log

select_relevant()

get_maps()

map_track

maps

make_maps(album)

map_user

map_album

split_data()

map_artist

make_translation_map(track, album)

log_track_train

make_translation_map(track, artist)

translation_map_track_album

sum_log(track, train)

translate(album, train)

data_track_train

plsa(track, 25, sqrt)

log_track_regr

translate(album, regr)

koren(track, 50)

koren_album_50

data_artist_train

group_data(album, train)

plsa_album_25_sqrt

plsa_track_25_sqrt

sum_log(artist, train)

count_items(album, train)

plsa(album, 25, sqrt)

koren(artist, 50)

plsa_artist_25_sqrt

data_artist_train_grouped

graph_count_pairs(artist, pairs)

graph_pairs_album_pairs

koren_artist_50

graph_sort_pairs(album, pairs)

count_items(artist, train)

graph_pairs_sorted_album_pairs

log_album_regr

graph_album_pairs_limit_25

koren_online(album, 50)

data_track_train_grouped

graph_track_prodsqrt_limit_100

graph_online(track, prodsqrt, limit, 100)

graph_online(track, prodsqrt, limit, 50)

graph_online_track_prodsqrt_limit_100

log_track_test

sum_track_train

koren_online_artist_50

graph_online(artist, pairs, limit, 25)

graph_online_artist_pairs_limit_25

ensemble_helper[artist]()

ensemble_helper_artist

ensemble_train(album)

ensemble_train(artist)

ensemble_album

graph_build_graph(track, pairs, limit, 50)

graph_track_pairs_limit_100

graph_online(track, pairs, limit, 50)

graph_online_track_pairs_limit_100

graph_online_track_pairs_limit_50

ensemble_artist

ensemble_online(album)

graph_track_pairs_limit_50

graph_online(track, pairs, limit, 100)

graph_online_track_prodsqrt_limit_50

graph_artist_pairs_limit_25

koren_online(artist, 50)

ensemble_helper_album

graph_build_graph(track, pairs, limit, 100)

graph_track_prodsqrt_limit_50

sum_artist_train

graph_online_album_pairs_limit_25

graph_pairs_sorted_track_pairs

graph_build_graph(track, prodsqrt, limit, 50)

sum_album_train

graph_online(album, pairs, limit, 25)

graph_sort_pairs(track, pairs)

graph_pairs_sorted_track_prodsqrt

graph_pairs_sorted_artist_pairs

graph_build_graph(artist, pairs, limit, 25)

ensemble_helper[album]()

graph_pairs_track_pairs

graph_sort_pairs(track, prodsqrt)

graph_build_graph(track, prodsqrt, limit, 100)

koren_online_album_50

graph_count_pairs(track, pairs)

graph_pairs_artist_pairs

graph_sort_pairs(artist, pairs)

graph_build_graph(album, pairs, limit, 25)

plsa_online(track, 25, sqrt)

graph_pairs_track_prodsqrt

plsa_online(artist, 25, sqrt)

plsa_online_artist_25_sqrt

graph_count_pairs(album, pairs)

count_items(track, train)

plsa(artist, 25, sqrt)

data_album_train_grouped

plsa_online_album_25_sqrt

koren_track_50

plsa_online_track_25_sqrt

log_artist_regr

group_data(artist, train)

plsa_online(album, 25, sqrt)

group_data(track, train)

translate(artist, regr)

log_artist_train

sum_log(album, train)

koren(album, 50)

translation_map_track_artist

translate(artist, train)

log_album_train

data_album_train

graph_count_pairs(track, prodsqrt)

make_maps(artist)

ensemble_online_album

koren_online(track, 50)

koren_online_track_50

translator_online(track, album)

translator_online_track_album

ensemble_online(artist)

ensemble_online_artist

translator_online(track, artist)

translator_online_track_artist

ensemble_helper[track]()

ensemble_helper_track

ensemble_train(track)

ensemble_track

ensemble_evaluate(track)

ensemble_track_evaluation

Figure 4.1. A flow chart of all the objects. Ellipses denote jobs, documents

denote files, and the rectangles with two pegs at the left side denote interfaces.

entire history of the user. As the user listens to new tracks, those tracks can be dynamically inserted into the list. We summarize the complexities of the operations in the table below. NewList(list-id) O(1) AddRating(list-id, item-id, count) O(1) GetPrediction(list-id, item-id) O(1) GetTopList(list-id, size) (optional) O(n) (n is the number of ratings of the user) 26

4.3. ALGORITHM DEPENDENCIES

4.3

Algorithm dependencies

Though admittedly being a bit outside the scope of this report, we will mention some aspects of a building system we developed to define algorithm dependencies. By defining recurring “patterns” mapping one or more input files and input interfaces to one or more output files and output interfaces, we could express the execution process in a very flexible and compact way. Patterns could be parametrized form, i. e. stated as implicit rules mapping inputs to outputs, and also partially specialized. An additional advantage was that if execution terminated prematurely, it could be resumed from its last results at any time. The development was inspired by the functional nature of a Makefile, a standard way of compiling applications on UNIX systems. The resulting dependency graph of the final building process is presented in Figure 4.1.

4.4

Scalability and fault tolerance

The system was developed having scalability and fault tolerance in mind at all times, although the mechanisms for distributing load and handling failures are not implemented so far. However, the system is designed to be extended with these features in a straightforward way. Some design features with the system that greatly facilitates scalability and fault tolerance are: • In the online mode, users are independent of each other and defined by the set of their ratings, which implies that – Users can be distributed evenly across available computers – Users can be hosted by several computers in case of computer failure, using the least balanced computer to calculate predictions – Users can be effectively transferred from one computer to another computer • The results of the offline steps are at most a few hundred megabytes, so it can be effectively loaded into working memory by online computers • When new calculation results from the offline mode is ready, the online nodes can swap the new results into memory with minimal interruptions.

4.5

Generating recommendations for artists and albums

While we have mentioned how the algorithms can generate track recommendations, it is also possible to generate artist and album recommendations in a similar fashion. We clone the ensembles to three copies, the track-ensemble running on the original log data, the album- and artist-ensembles running on log data where all entries are mapped to the corresponding album or artist. 27

Chapter 5

Results 5.1

Test sets

We used three test sets for testing purposes, named SMALL-SPARSE, SMALLDENSE and LARGE, respectively. The smaller test sets were used as “toy” sets, in order to quickly evaluate algorithms before running them on LARGE. The results of the various data sets also indicates some of the scalability properties of the algorithms. All test sets were extracted from the real logs from July 2008 to December 2008, and represent smaller subsets where the rarest items and users have been removed. A summary of the sizes is presented below

Tracks Users Log entries Distinct (user, track) pairs

5.2

SMALL-SPARSE 8 056 17 007 4 948 359 2 146 579

SMALL-DENSE 9 281 13 963 14 312 971 4 395 561

LARGE 262 209 155 741 86 559 885 44 353 610

Baseline

For comparison purposes, we did as in [12] and constructed a “dummy” algorithm that produced a list of recommendations by presenting the set of tracks in decreasing order of global popularity. This gives a baseline value to compare against.

5.3

Results

We ran a more extensive test on SMALL-SPARSE and SMALL-DENSE, and tried more algorithms. See Figure 5.1 and Figure 5.2 for the complete results. We chose to select a smaller subsets of the algorithms above before running on LARGE, in particular since the PLSA algorithm takes very long time to run. The ones we chose to use were 29

CHAPTER 5. RESULTS !"##$% *++,-%*. ?+

D'E#FGH#

/01&2-&130!/01&2-&045!607 /01&2-84!39: /01&2-&045&;!1-949: &;!1-!607

37.72 Graph: 25 edges ? *,=** &;!1.)%*)F'$#H$)PF'EE#E %B+ PLSA: 25 latent classes &;!1.)*B)F'$#H$)PF'EE#E *++,.++.++HKV-SVD: 12 latent classes D'E#FGH# /01&2-&130!/01&2-&045!607 /01&2-84!39: /01&2-&045&;!1-949: &;!1-!607 /01&2-&130! /01&2-84!39: &;!1-949: &;!1-;4/&% 2KL-!L5.)%*)F'$#H$)PF'EE#E %+ D'E#FGH#