MusicSense: Contextual Music Recommendation ... - Semantic Scholar

MusicSense: Contextual Music Recommendation using Emotional Allocation Modeling Rui Cai, Chao Zhang, Chong Wang, Lei Zhang, and Wei-Ying Ma Microsoft Research, Asia 49 Zhichun Road, Beijing 100080, P.R. China

{ruicai, v-chaozh, chwang, leizhang, wyma}@microsoft.com ABSTRACT

recommendation, either through collaborative filtering [2] or content similarity [4]. However, in most current application scenarios, the recommendation is still in a passive mode that consumers have to send some queries to a service to find suggestions. This may lose lots of chances for music recommendation when users surfing the Web. Comparatively, Google AdSense [1] provides a more active way to deliver contextual advertises matched to websites’ content, to increase the possibilities of ad clicks and product sales. Inspired by AdSense, in this paper, we propose a novel approach–MusicSense–to provide contextual music recommendation. The main idea is, to automatically deliver music pieces (or their thumbnails) which are relevant to the context of a Web page when users read it. Thus, it needs a way to properly measure the context relevance between music songs and Web pages. In this paper, we choose emotion as the bridge for such a relevance matching, as music is all about conveying composers’ emotions, and lots of Web pages such as Weblogs also express sentiments of writers. There are many research efforts have been reported in the literature on either music or text emotion classification. For example, Lu et al. utilized Gaussian mixture models to classify music songs into four emotion categories, using acoustic content features like intensity, timbre, and rhythm [12]. For text documents, in [9], Cui et al. have comparatively studied various supervised classifiers to classify online product reviews into positive and negative opinions; and in [11], Leshed et al. tried to categorize Weblogs into ten most frequently used moods with support vector machine (SVM). However, for our purpose of MusicSense, these current works still have some limitations. First, most works can handle only a few mood categories [3,12], which may be insufficient and inflexible for our purpose. Second, most works utilize supervised algorithms for mood classification, and their effectiveness thus relies heavily on the quality of training data. While in our situation, it’s difficult to collect enough high quality training data for all possible emotions. Moreover, cross-modal emotion mapping is still an open problem. Aiming at solving these problems and providing a more natural way for MusicSense, in this paper, we propose a probabilistic modeling called Emotional Allocation to characterize songs and Web pages as distributions in a common emotion space. This modeling leverages both the statistics from a large-scale Web corpus and guidance from psychological studies, and it also keeps the inference ability of generative models. For relevance matching, songs and Web documents are respectively represented by a collection of word terms, based on which their emotion distribution parameters are optimized in an iterative way. These distributions are then compared using Kullback-Liebler divergence, and the most close document-song pairs are finally selected for our MusicSense recommendation.

In this paper, we present a novel contextual music recommendation approach, MusicSense, to automatically suggest music when users read Web documents such as Weblogs. MusicSense matches music to a document’s content, in terms of the emotions expressed by both the document and the music songs. To achieve this, we propose a generative model– Emotional Allocation Modeling–in which a collection of word terms is considered as generated with a mixture of emotions. This model also integrates knowledge discovering from a Web-scale corpus and guidance from psychological studies of emotion. Music songs are also described using textual information extracted from their meta-data and relevant Web pages. Thus, both music songs and Web documents can be characterized as distributions over the emotion mixtures through the emotional allocation modeling. For a given document, the songs with the most matched emotion distributions are finally selected as the recommendations. Preliminary experiments on Weblogs show promising results on both emotion allocation and music recommendation.

Categories and Subject Descriptors H.3.3 [Information Storage and Retrieval]: Information Search and Retrieval—Retrieval models, search process; G.3 [Mathematics of Computing]: Probability and Statistics—Statistical computing

General Terms Algorithms, Performance, Experimentation

Keywords MusicSense, Emotional Allocation Modeling, Contextual Music Recommendation, Moods

1.

INTRODUCTION

With the growth of Internet, automatically recommendation now has becoming an important role in music sales. Many commercial systems such as Last.fm [2] and Pandora [4] have developed sophisticated approaches for music

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. MM’07, September 23–28, 2007, Augsburg, Bavaria, Germany. Copyright 2007 ACM 978-1-59593-701-8/07/0009 ...$5.00.

553

Music songs

Weblogs

Web-based Description Generation

Salient Word Extraction

λ

Web

Web-scale Corpus Emotion Vocabulary

Emotional Allocation Modeling

p (θ ; λ ) sn tiou ribt isD

Relevance Matching

p(en | θ )

en

p(wn | en ; β)

wn

N

Figure 2: Graphical model representation of the Emotional Allocation Modeling. The box denotes “plate” representing replicates. Solid circles denote random variables while dash squares denote hyperparameters.

Figure 1: The framework of the proposed approach. The rest of this paper is organized as follows. The framework of our approach is introduced in Section 2. In Section 3, we describe the details of the emotional allocation modeling. Evaluation and discussion are presented in Section 4. Finally, we conclude our work in Section 5.

2.

θ

β

puted as its weight 2 . At last, the top N terms with the highest weights are selected out as the description. A similar idea can be found in [10] for reference. In this paper, we experimentally set N = 100, to keep the most informative terms, and balance the computational complexity in further probability inference. The processing of Web documents is in a similar way. In this paper, we mainly focus on Weblogs where bloggers write about their feelings, opinions, and emotions. For each blog post, also the top 100 terms with the highest tf × idf are kept as salient words for further inference.

FRAMEWORK OF OUR APPROACH

The framework of the proposed approach is illustrated in Fig. 1. It mainly consists of three steps: (i) emotional allocation modeling; (ii) Web-based music description generation and Web document analysis; and (iii) probability inference and relevance matching. In our modeling, we assume that given a language and its vocabulary, different emotions should have different distributions over the terms in this vocabulary. In other words, the frequencies of a term under different emotions are also different. Then, given a collection of terms (e.g. a document), we can suppose it is generated by sampling a mixture of various emotions, as terms in this collection can be considered as controlled by different emotions. The parameters of such a sampling can be computed in a maximum likelihood manner. In such a way, a term collection would have a certain allocation of emotions, in form of a probability distribution. Thus, we call this modeling as Emotional Allocation. Moreover, as shown in Fig. 1, our modeling also refer to knowledge of psychology [8], to get a relatively complete and well-structured emotion vocabulary. And to achieve more accurate model estimation, we learn the termemotion relations through stating a very large-scale of Web pages, instead of based on a limited training set. The implementation details of the modeling process and the final relevance matching will be introduced in Section 3. In the following, we would like to give some more explanation to our implementation of the step 2. First, in this paper we mainly adopt Web-based information to describe music songs 1 , as current content-based music analysis technologies still cannot well handle the classification with tens of moods [12]. While on the Web, there is relatively abundant information such as lyrics and reviews to describe the semantic of a song. Thus, we can use search engines to retrieve more information from the Web to characterize songs. Here, we just prepare two queries in the form of “title + lyrics” and “title + reviews”, respectively. The first page returned by the first query, and the top 20 pages returned by the second query, are then used to generate descriptions. That is, these retrieved pages are merged as a virtual document after removing HTML tags and stop words. Then, for each term in this document, the well-known “term frequency–inverse document frequency (tf × idf )” [6] is com-

3. EMOTIONAL ALLOCATION MODELING In this section, we describe the details of the proposed emotional allocation modeling, including the model construction, parameter inference, and relevance matching.

3.1 A Generative Model The graphical model representation of the emotional allocation is illustrated in Fig. 2. In this model, we assume there are K emotions, each of which can be represented as a multinomial distribution over all the terms from a vocabulary W = {w1 , · · · , wM }, as: p(w = wm |e = ek ) = βkm (1 ≤ m ≤ M, 1 ≤ k ≤ K), (1) where for each k there is M m=1 βkm = 1, as shown in Fig. 2. In other words, conditioned on an emotion ek , each term wm can be generated with the probability βkm . In addition, to characterize the generation process of a series of terms, the emotion variable e is considered as continually sampled from another multinomial distribution p(e = ek |θ) = θk , which is controlled by a hyper variable θ. Similar with the assumption in the latent Dirichlet allocation (LDA) [7], here we also assume θ follows a Dirichlet distribution (as it is the conjugate prior of the multinomial distribution in Bayesian statistics). The probability density of a K-dimensional Dirichlet distribution is defined as: K Γ( K λi ) λi −1 Dir(θ; λ) = K i=1 θi , (2) i=1 Γ(λi ) i=1 where λ = (λ1 , · · · , λK ) is the parameter of this density. Thus, all the parameters of this model are λ and β. In this modeling, the K emotions {e1 , · · · , eK } can be manually selected according to the suggestions from psychology studies. For example, there are forty basic emotions defined in the Basic English Emotion Vocabulary [8], which are also adopted in our experiments. In such a situation, it becomes very difficult to collect and appropriately label sufficient training data for learning the emotion–term relations

1 It should be noted that our modeling approach is still capable to integrate knowledge from acoustic content analysis. We’ll discuss this in Section 3.3.

2

554

The idf is estimated based on a Web-scale corpus.

the results of acoustic analysis is available, to achieve a more reasonable inference result. For example, if a song is classified as sad with content-based analysis, the corresponding element in λ∗ could be initialized with a higher prior weight. Through the inference, each term collection is finally represented as a Dirichlet distribution over the mixture of emotions, with the optimized posterior parameter λ∗ .

Table 1: Variational Inference of the Parameters 1. 2. 3. 4. 5. 6. 7.

λ∗i = N/K, (1 ≤ i ≤ K). for n = 1 to N do for i = 1 to K do φni = βiwn exp(Ψ(λ∗i )), normalize the sum of φni , 1 ≤ i ≤ K, to 1. end. b for i = 1 to K do λ∗i = ε + N end. n=1 φni , if convergence exit; else goto line 2.

a

end.

3.3 Relevance Matching

a

Ψ(x) is the first derivative of the log Γ(x), and can be approximated via Taylor expansion. b ε is a small positive constant for parameter smoothing. (i.e., β). Thus, in this paper, β is estimated through stating a Web-scale corpus (around 100 million Web pages were used in our current experiments). Here, the main assumption is, βkm should be proportional to the co-occurrence frequency of the term wm and the emotion ek , when the corpus is large enough. The detailed implementation is as following: 1. Expand each emotion with its synonyms looked up from the WordNet [5]. For example, the emotion “happy” is expanded with words like blessed, blissful, glad, and so on. Such Mk typical synonyms of the emotion ek are noted as wik , 1 ≤ i ≤ Mk (for efficiency, Mk is less than 10 in the experiments). 2. For each pair (wm , wik ), state its co-occurrence Nwm ,wk i on the whole corpus. Here, the two terms are considered to have one co-occurrence if they are in a same paragraph. This is because a paragraph is a block with relatively consistent semantics and proper length; while a sentence is too short to provide sufficient statistics, and a whole document is too long and may contain multiple semantics. 3. Define the co-occurrence of the term wm and emotion k ek as Nwm ,ek = M i=1 Nwm ,wik × idfwm × idfwik , where idfwm and idfwk are inverse document frequencies estii mated on the same corpus, and are used here to punish those popular terms. 4. βkm = Nwm ,ek / M i=1 Nwi ,ek .

3.2 Parameter Inference As introduced in Section 2, the goal of the modeling is to predict the essential emotion allocation of a set of terms, which is controlled by the variable θ in this model. From the graph structure in Fig. 2, the conditional probability of a collection of N terms w =< w1 , · · · , wN > given θ is: p(w|θ; λ, β) =

N K

p(wn |en ; β)p(en |θ; λ).

(3)

n=1 en =1

Contrarily, with Bayes’ theorem, the posterior distribution of θ given the collection w is: p(θ|w; λ, β) =

p(w|θ; λ, β)p(θ; λ) , θ p(w|θ; λ, β)p(θ; λ)dθ

(4)

In our modeling, two term collections are considered to be relevant if they have similar distributions, i.e., similar allocations of emotions. For our scenario of MusicSense, songs most relevant to a Weblog should be selected as its recommendations. The most natural way to measure the similarity of two distributions is the Kullback-Liebler divergence. The KL divergence of two K-dimensional Dirichlet distributions Dir(θ; λp ) and Dir(θ; λq ) is: K Γ( K λpi ) Γ(λqi ) + + KLDir (λp ; λq ) = log i=1 log q Γ(λpi ) Γ( K i=1 λi ) i=1 K

[λpi − λqi ][Ψ(λpi ) − Ψ(

i=1

K

λpi )]. (5)

j=1

As the KL divergence is asymmetry, the distance between two term collections wp and wq is finally defined as: 1 (KLDir (λp ; λq ) + KLDir (λq ; λp )), (6) 2 where small distance means high relevance. Dist(wp ; wq ) =

4. EVALUATIONS AND DISCUSSIONS Evaluation of music recommendation is not a trivial task. To the best of our knowledge, there is still not a sophisticated methodology for such an evaluation. In this paper, we try to compare our recommendation results with subjective preference, to find out how close our approach can reach an ideal system. In the experiments, we totally collected 100 songs and 50 Weblogs 3 . These songs and Weblogs were selected from various themes, and try to cover as many emotions as possible. The descriptions of all the songs were retrieved from the Web, as that introduced in Section 2. Five college students were then invited to label the ground truth, includes: • Each labeler was asked to listen each song and then tag it with one or more words from the forty emotions in the Basic English Emotion Vocabulary [8]. The Weblog posts were also tagged in the same way. • For each Weblog, each labeler was asked to find out 3 ∼ 5 songs, which are the most ideal candidates in his (her) mind for listening when reading that blog post, from all the 100 songs. Accordingly, the following evaluations consist of two parts: emotion allocation and music recommendation.

4.1 Emotion Allocation We first investigate the effectiveness of the proposed modeling on music emotion allocation. As introduced in Section 3.2, each song is represented with a Dirichlet distribution parameterized by λ∗ . According to the properties ∗ of Dirichlet distribution, there is E(θi |λ∗ ) = λ∗i / K k=1 λk , th which can be taken as the “weight” of the i emotion in this

which is unfortunately computationally intractable. However, the variational inference [13] can provide a close approximation to the model parameters, denoted as λ∗ , in an iterative process shown in Table 1. In the first step in Table 1, λ∗ is by default uniformly initialized, assuming each mood has equal prior probability on this word collection. However, as we mentioned in Section 2, λ∗ can also be particularly initialized if other knowledge like

3 This data set is somewhat small here due to the expensive labeling efforts.

555

0.4 0.35 0.3

0.8 Our Approach

0.7

Labelled

0.6

0.25

0.5

0.2

0.4

0.15

0.3

0.1

0.2

0.05

0.1

0

0

Precision Recall

Top 1

Top 3

Top 5

Top 10

Figure 4: The average precisions and recalls of the Top 1, Top 3, Top 5, and Top 10 recommendations. Figure 3: The five most prominent emotions of the song “My Heart Will Go On”, tagged by our Emotional Allocation and by labeling, respectively.

5. CONCLUSIONS AND FUTURE WORK In this paper, we have formally defined the problem of contextual music recommendation (MusicSense) and propose a new probabilistic model called Emotional Allocation Modeling to solve this problem. With this model, we could reasonably consider each song (or a Weblog) is generated with a distribution over the mixture of emotions, effectively integrate knowledge discovering from a Web-scale corpus and guidance from psychological studies, and also keep the inference ability of generative model. In this way, emotion acts as a bridge for the relevance matching between blogs and songs. Preliminary experiments indicate that the model work effectively; both the emotion estimation and the music recommendation match subjective preference closely. In future work, we will deeply investigate some current implementation details to improve the performance; and will also try to utilize more information besides emotion to measure the relevance between music and documents. Moreover, we will carry out more user study to design an idea UI to deliver the contextual music recommendation.

song. For manually labeling, such a weight for each emotion on each song can be approximated through averaging the tag counts of all the labeler. Fig. 3 gives an example of our estimation result on the famous song “My Heart Will Go On”, compared with the labeled ground truth. From Fig. 3, it is still satisfied that the our result is agreed with the ground truth on the first two major emotions: loving and sad ; and the weights on these two emotions are also close. Moreover, the rest three emotions we estimated also somewhat related to the song’s semantics. We further measured the correlation coefficient between the two weight vectors generated by our approach and human labeling, it reaches around 0.71 on this song. The average correlation coefficient over the whole 100 songs is about 0.48. The evaluation of emotion allocation of blogs were carried out in the same way, and the average correlation coefficient is about 0.42. These indicate that for emotion allocation, there does exist a positive correlation between our results and the ground truth.

6. REFERENCES [1] [2] [3] [4] [5] [6]

4.2 Music Recommendation To evaluate the recommendation performance, for each blog post, we first merged the suggestions from all the labelers as the ground truth. On average, there are around 5.75 such suggestions for each post. Here, the labeling consensus is somewhat large because of the small scale of the music collection in experiment. Then, as introduced in Section 3.3, the distances between all the songs and blog posts are computed, and are sorted in ascending order for each post. The algorithm is very efficient in practice, i.e., the computational time was less than one second to go through the whole 100 songs using a PC with 3.2 GHz Intel Pentium 4 CPU and 1GB memory. At last, those top N ranked songs are selected as recommendations, and the average recalls and precisions over all the blogs are shown in Fig. 4, for N = 1, 3, 5, 10, respectively. For each blog post and a given N , the recall and precision are defined as recall = Nc /Ns and precision = Nc /N . Here, Nc is how many songs from the subjective suggestions have been covered in the top N candidates ranked by our approach; and Ns is the total number of subjective suggestions for that blog. From Fig. 4 it is notable that, when N increases, the precisions are relatively stable at around 45%, while the recalls increase from below 10% (N = 1) to above 70% (N = 10). This indicates that, about half candidates recommend by our approach are consistent with the subjective opinions; and when N becomes large enough, most preferred songs can be retrieved within our modeling. This is quite a promising result for our application scenario of MusicSense.

[7] [8] [9] [10]

[11] [12]

[13]

556

Google AdSense. http://www.google.com/adsense/. Last.fm. http://www.last.fm/. Musicovery. http://www.musicovery.com/. Pandora Internet Radio. http://www.pandora.com/. Wordnet. http://wordnet.princeton.edu/. R. A. Baeza-Yates and B. A. Ribeiro-Neto. Modern Information Retrieval. Addison-Wesley, 1999. D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent Dirichlet allocation. JMLR, 3:993–1022, 2003. R. Cowie, E. Douglas-Cowie, and et al. Emotion recognition in human computer interaction. IEEE Signal Processing Magazine, 18(1):33–80, 2001. H. Cui, V. Mittal, and M. Datar. Comparative experiments on sentiment classification for online product reviews. In Proc. AAAI’06, Boston, 2006. P. Knees, T. Pohle, M. Schedl, and G. Widmer. A music search engine built upon audio-based and Web-based similarity measures. In Proc. SIGIR’07, Amsterdam, July 2007. G. Leshed and J. Kaye. Understanding how bloggers feel: recognizing affect in blog posts. In Proc. ACM CHI’06, pages 1019–1024, Montréal, April 2006. L. Lu, D. Liu, and H.-J. Zhang. Automatic mood detection and tracking of music audio signals. IEEE Trans. Audio, Speech and Language Processing, 14(1):5–18, January 2006. M. E. Tipping and C. M. Bishop. Mixtures of probabilistic principal component analyzers. Neural Computation, 11(2):443–482, 1999.