CHI 2009 ~ Classifying and Recommending Content
April 8th, 2009 ~ Boston, MA, USA
Mixing It Up: Recommending Collections of Items Derek L. Hansen College of Information Studies University of Maryland, College Park, MD
[email protected]
ABSTRACT
Recommender systems traditionally recommend individual items. We introduce the idea of collection recommender systems and describe a design space for them including 3 main aspects that contribute to the overall value of a collection: the value of the individual items, co-occurrence interaction effects, and order effects including placement and arrangement of items. We then describe an empirical study examining how people create mix tapes. The study found qualitative and quantitative evidence for order effects (e.g., first songs are rated higher than later songs; some songs go poorly together sequentially). We propose several ideas for research in this space, hoping to start a much longer conversation on collection recommender systems. Author Keywords
Recommender Systems, collections, Collection Recommender Systems, Collaborative Filtering, Playlist, Mix Tape, Music, Automatic Playlist Generation ACM Classification Keywords
H.3.4 Information Storage and Retrieval: Systems and Software - Performance Evaluation (efficiency and effectiveness) INTRODUCTION
A good compilation tape, like breaking up, is hard to do. You’ve got to kick off with a corker, to hold the attention...and then you’ve got to up it a notch, or cool it a notch, and you can’t have white music and black music together, unless the white music sounds like black music, and you can’t have two tracks by the same artist side by side, unless you’ve done the whole thing in pairs and...oh, there are loads of rules. - Rob (Character in High Fidelity)
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. CHI 2009, April 4 - 9, 2009, Boston, Massachusetts, USA. Copyright 2009 ACM 978-1-60558-246-7/09/04...$5.00.
Jennifer Golbeck College of Information Studies University of Maryland, College Park, MD
[email protected]
As any good DJ will tell you, success is about far more than playing the most popular songs. As Rob from “High Fidelity” illustrates in the quote above, there are “loads of rules” about how to create a good mix tape. Rules about what songs fit (or don’t fit) in the same collection. Rules about what songs go in what position. And rules about the order songs should be placed in. These rules are important because the overall effect of a collection of songs can be quite different than the effect of each individual song on its own. Mix tapes are hardly unique in this way. Financial investments should be evaluated as part of an investment portfolio. A necktie only looks good when placed next to a matching shirt and pair of trousers. The style of a painting may be highlighted when placed next to a complementary painting in an exhibit. Articles, photographs, recipes, software configuration choices, events, and even people can be greatly affected by the other “items” they are grouped with. Despite our recognition of the importance of collections, little research has explored how to effectively recommend collections of items. Nearly all recommender systems focus exclusively on recommending items. Even when recommended items are displayed as part of a group (e.g., Amazon shows 5 recommended books at a time), the items are typically not selected in a way that considers the relationship of items with one another. Currently, most meaningful collections are created by hand (e.g., Amazon’s Listmania! collections) without any support by recommender systems. This isn’t all bad. The act of creating some types of collections, such as mix tapes, can be an important form of self-expression [16] that is unlikely to be replaced by algorithms. However, not all collections need the same tender love and care. A system that recommends a collection of compatible software configuration choices is hardly an expression of ones identity (except for, perhaps, a few of the geekiest among us). Even in cases where collections require the human touch, recommender systems may augment human judgment and make the creation process more fun. This paper is an invitation to the research community to develop effective methods for recommending collections of items. For us, a collection recommender system is a system that recommends compilations of 2 or more items, where each compilation is assessed as a whole.
1217
CHI 2009 ~ Classifying and Recommending Content We begin by reviewing the relatively sparse research on systems that recommend collections of items. We then describe a design space for recommending collections of items and map a few examples of hypothetical collection recommender systems onto the design space. This is followed by our presentation of an empirical study examining how people create mix tapes and what this means for collection recommender systems. Finally, we provide several ideas on future areas of research related to collection recommender systems. PRIOR RESEARCH
Recommender system research has advanced at a staggering pace over the past decade and a half, with thousands of articles written on the topic. This is partly driven by the successful use of recommender systems by companies such as Amazon, NetFlicks, Pandora, and TiVo. Research has explored novel recommendation algorithms including item-based, collaborative filtering, and hybrid algorithms [1], methods for evaluating recommendations [7], user interface designs [10], and incorporating preferences of groups of people (e.g., [11]). These have been explored in a variety of domains ranging from entertainment goods to vacation packages to social networks [6]. Despite the impressive volume and variety of recommender system research [1], nearly all of it is based on a common, yet often unspoken assumption: recommender systems are about recommending individual items. Despite this overarching trend, there have been a handful of articles that consider various aspects of recommending collections of items, although not in a comprehensive framework. Several researchers have noticed that top-N recommender algorithms produce results that are often too similar for a user’s likes [2, 3, 17]. For example, all 5 Amazon book recommendations may be authored by the same person. Although each item has a high probability of being liked by the user of the system, it may not satisfy the user’s desire to be exposed to new material. Ali and van Stam call this the “portfolio effect” and argue that we should develop methods to maximize the entire portfolio that is being presented, not each item [2]. Zeigler et al. flesh out this idea in the context of recommending books [17]. They develop an algorithm that balances diversity (based on book classification information) with accuracy, as well as a metric for measuring intra-list similarity [17]. Their user study found that item-based algorithms benefited from a small boost in diversity, while user-based algorithms did not. In summary, their study provided “empirical evidence that lists are more than mere aggregations of single recommendations” because they “bear an intrinsic, added value” ( [17], pg. 31). This important finding argues for the need to study collections in their own right, the focus of the current article. An important step in this direction is the recent development of a language, DDPREF, for expressing set-based (i.e., collection-based) preferences, as well as a technique for learning these preferences from prior sets [5]. Combining these strate-
April 8th, 2009 ~ Boston, MA, USA gies with recommender system approaches such as collaborative filtering seems particularly promising for the development of collection recommender systems. A growing body of literature examines the automatic creation of collections, particularly in the music domain where automatic playlist generators such as iTunes Genius, The Filter, and MusicIP have already hit the mainstream. In these tools, the user generally provides a seed song and the system produces a playlist of related songs from the user’s collection. These systems typically base their playlists on content based approaches that measure the similarity of songs on various dimensions (e.g., rhythm, genre, artist) (e.g., [12]) (see ISMIR proceedings for other articles). Most often, these playlists mirror the top-N algorithms discussed above, showing the top 10 or 20 similar songs, perhaps with a few of the most dissimilar songs thrown in at the end of the list (e.g., [9]). More advanced systems like PATS try and balance a desire for coherence (i.e., similarity of songs) and variation (diversity of songs) in different contexts-of-use by assuring that the same song is not recommended multiple times [13]. Results from a user study of PATS showed that it outperformed randomly assembled playlists [13]. A user study of an automatic playlist generator running on a mobile device showed that there was significant interest in such tools and that there is a need to group or spread out songs that are overly similar (e.g., from the same artist) [9], a parallel argument to the one made by Ziegler, et al. that songs presented sequentially should be diverse [17]. Users have complained that automatic playlist generators remove the fun of creating playlists and don’t provide enough possibilities for customizing playlists [9]. One approach to overcome this problem is to create a semi-automatic playlist generator such as SatisFly [14]. SatisFly guides users through the process of building playlists by helping them specify constraints such as the number of songs, variety of genres and artists, tempo, etc. As the user defines rules, the system uses constraint satisfaction to produce a list of songs that meet the user’s requirements for which song should come next. Thus, it explicitly recognizes that selecting the right songs in the right order is critical to making a good mix. The authors also distinguish between unitary (single item placement), binary (item-item interactions), and global constraints (entire collection interactions), ideas that could be easily applied to other types of collection recommender systems. More generally, we believe there is a need to develop tools like SatisFly that augment human creation of collections, rather than replace it. Designing appropriate user interfaces and social practices around the use of these system, issues central to the HCI community, are particularly important in such endeavors. As designers, it is important to explicitly recognize the highly social nature of collecting and sharing music [16, 4]. Empirical studies of user interaction with music col-
1218
CHI 2009 ~ Classifying and Recommending Content lections can inform the design of music related collection recommender systems, the focus of our user study. While space does not permit a comprehensive literature review, we highlight a few important points. Lee and Downie demonstrated the popularity of creating music collections when they found that 89% of those surveyed search for music in order to build a collection, with over 60% doing so 2 or more times a month [8]. Cunningham et al. study how people organize their personal music collections, finding that music is organized by its intended use (reminiscent of the context-of-use suggested by [13]), as well as other factors such as genre, performer, etc. [4]. A study by Voida et al. examining sharing of iTunes libraries found that people were interested in learning about new music from others, but only those they had enough in common with, suggesting the need for “increased scaffolding for the exploration of new music” [16]. They describe a music sharing design space including two dimensions: one ranging from intimacy and anonymity and another ranging from shared musical interests to divergent musical interests. Like iTunes, collection recommender systems can be used to cover more of this space, allowing for the creation of mix tapes not only by intimate friends with shared musical interests (Figure 2 in [16] p. 193). Although these and similar studies can provide some design guidance for collection recommender systems, they have not focused enough on the mechanics of how small collections such as mix tapes are created - information that is essential to the development of successful mix tape recommender systems as we focus on in our user study.
April 8th, 2009 ~ Boston, MA, USA interaction effects, and 3) Order interaction effects. We then describe the distinction between global and local recommender algorithms and how they relate to the 3 key aspects. Finally, we describe some hypothetical recommender systems for collections and how they relate to the overall framework. 1. Individual Item Values Collections are made up of individual items. A collection recommender system will typically consider the characteristics and ratings of the included items. For example, a mix tape recommender system may consider each song’s length, tempo, genre, etc. Collection recommender systems may also consider user preferences for items as revealed explicitly through ratings, implicitly through past behavior, or as predicted by a traditional recommender system. For example, a mix tape recommender system may consider how an individual has rated certain songs, which songs have been purchased in the past, and/or how likely it is that the user will like an “unknown” song based on a collaborative filtering algorithm. The value function typically ascribes a value to each of the items in a collection based on each item’s characteristics and/or ratings. We would expect a mix tape with many highly rated songs to outperform a mix tape with many poorly rated songs (as demonstrated by [13]). Thus, a simple value function may sum up the individual item ratings to get to the collection’s aggregate value. A more complex algorithm may weight each item’s value based on some other criteria (e.g., genre; ratings by “similar” users).
DESIGN SPACE FOR RECOMMENDING COLLECTIONS
To recommend collections of items, we must have a way of evaluating and comparing collections, not just items. In other words, we need some value function, V(S) that maps information about the set of items S in the collection (e.g., ratings; item interaction effects) onto a single value. The functional form will vary depending on the nature of the data and the interactions of items. Rather than provide specific functions in this paper, we describe some analytically distinct aspects that can be captured in various value functions. We take this general approach to spur researchers from a variety of backgrounds (e.g., optimization, artificial intelligence, economics) to consider the problems raised by recommending collections of items. In order to discuss collection recommender systems generally, we have purposefully avoided the inclusion of contextual factors in the design space. These will clearly be important depending upon the domain of interest, as evidenced by the success of systems that make recommendations based on the context-of-use (e.g., listening to music while working out versus at a dance) [13, 1].
In addition to ascribing a value to each item in a collection, a collection recommender may limit what items are included in collections. For example, a playlist recommender system may limit the playing time to 1 hour. This global constraint would limit the possible combinations of items whose value can be included in the aggregate value function. 2. Co-occurrence Interaction Effects A collection recommender system may consider the inter-dependencies of individual items. Items may have positive or negative interaction effects when they co-occur with other items. Positive interaction effects are “synergies” where the aggregate value of a group of items is greater than the sum of the individual values. Negative interaction effects occur when the aggregate value of two or more items is less than the sum of the individual items. Imagine a system that recommends outfits (i.e., collections of clothing and accessories). If a particular tie and shirt look good together, you would have a positive interaction effect that may turn two mediocre items into an outstanding collection. Likewise, if a tie clashes with a shirt you may turn two individually outstanding items into a poor collection.
In this section, we first describe 3 key aspects that can affect the value function. Although the aspects we discuss are analytically distinct, they may be inseparable in a particular functional form. The 3 aspects we discuss are: 1) Individual item values, 2) Co-occurrence
Co-occurrence interaction effects need not be restricted to item-item (i.e., pairwise) dependencies as
1219
CHI 2009 ~ Classifying and Recommending Content was the case in our tie and shirt example. The value function may encapsulate some overall guiding principles (i.e., goals) or global constraints that create dependencies between items [13, 5] For example, a system that recommends a collection of books for a book club may balance two potentially competing goals: include highly rated books (goal #1) and maximize the diversity of books (goal #2). To help meet the second goal, the value function may ascribe a higher value to collections that span more genres than comparably rated books from the same genre. The co-occurrence interaction happens because a given book’s value changes based on its relation to the other books in the collection (i.e., it is of a different genre than the other books). In this scenario the interaction effects are created because the value function favors overall diversity, not because two books fit particularly well together. Indeed, one could imagine a system that considered aggregate and item-item interaction effects by favoring overall diversity, but making exceptions for books in the same trilogy. 3. Order Interaction Effects A collection recommender system may consider the order of the individual items, an area that is particularly understudied. Some collections, such as a music album, a book of edited articles, or a seminar series may have sequential importance. Listening to the Beatles’ “Abby Road” on “shuffle” mode (i.e., random order) is not nearly as enjoyable as listening to it in the original order. Including a summary article as the first chapter in a book may make later chapters more understandable. In these examples, the overall collection value is influenced by the order in which items occur. Two types of ordering effects exist. The relative order of items may be important. For example, Song B following Song A may provide higher value than Song A following Song B. Or, the placement position of an item may be important. For example, Song A is a great “starter song” for an album. Combinations of these order effects are also possible. For example, Song B follows nicely after Song A, but only if Song A is the first song on the album. Local vs. Global Recommendations
Some systems, such as Amazon’s “Customers who bought this item also bought” feature, make global recommendations that do not differ for individuals. When making global recommendations, the recommendation algorithm and its inputs and outputs are identical for all individuals. In contrast, systems such as collaborative filtering systems (e.g., MovieLens) provide local (i.e., personalized) recommendations tailored to an individual or subset of the group at large. Outputs of these recommendation algorithms differ for different individuals or subgroups based on differences in algorithm inputs (i.e., personal ratings), the algorithm itself, or both.
April 8th, 2009 ~ Boston, MA, USA In collection recommender systems, local and global recommendations may apply to any of the 3 aspects discussed above. Table 1 shows a matrix of the possible combinations with examples in each cell. A great deal of research has explored algorithms that provide global and local recommendations for individual items (shown in row 1 of Table 1). However, little research has explored global and local algorithms that deal with interaction effects (rows 2 and 3 in Table 1). Indeed, for most domains of interest, our understanding of how people create and evaluate collections is so minimal we don’t yet know if global or local algorithms would be more effective. A single recommender system need not fit into a single cell in the table. A recommender system could generate a set of items to be included in a collection using a Local-Individual Item Values algorithm, but then use a Global algorithm to decide what order to place the items in. Notice that each of the cells in the table requires different data. For example, a Global-Order Interaction Effect algorithm could be based on a relatively small sample of expert created collections that have been ordered. In contrast, a Local-Order Interaction Effect algorithm may require that each person order enough collections to be able to accurately identify their type (i.e., match you up with others who order things similarly). Alternatively, the system could identify your type by some other criteria such as your answers to a quiz or your demographic characteristics.
Collection Recommender System Examples
The examples below show how the framework discussed above applies to various types of hypothetical collection recommender systems. Mix Tapes / Playlists 1. Individual Item Values: A mix tape recommender system would likely choose songs based on a local algorithm, since music taste is highly personal. A global constraint on the number of songs (or mix tape length) may be imposed. 2. Co-occurrence Interaction Effects: Certain mix tapes may favor songs of a similar genre (or artist), while others may favor higher diversity. 3. Order Interaction Effects: Some songs may be good beginning or ending songs. Alternating between fast tempo and slow tempo may be important, as well as not including two songs in a row by the same artist. Collection of Readings (e.g., Book of Edited Articles; Books to be Read by a Book Group) 1. Individual Item Values: Readings may be recommended using a global or a local algorithm. The to-
1220
CHI 2009 ~ Classifying and Recommending Content
April 8th, 2009 ~ Boston, MA, USA
Table 1. Global vs. Local Matrix
Global 1. Individual Item Values
2. Co-occurrence Interaction Effects
3. Order Interaction Effects
Local
Item values are the same for everyone. E.g., item values are predicted based on itembased filtering such as occurs with Amazon’s “Customers who bought this item also bought” feature. Co-occurrence interactions are the same for everyone. E.g., cowboy boots and baggy trousers don’t go together. Order interactions are the same for everyone. E.g., everyone likes to start a mix tape with a fast song.
tal number of readings, the price of readings, and the source of readings may be constraints. 2. Co-occurrence Interaction Effects: A global algorithm may be used with the goal of balancing several factors such as popularity, rating, diversity, and recency. 3. Order Interaction Effects: A global algorithm for ordering the readings may be useful in some cases (e.g., book of edited articles), while ordering may not be important in other cases (e.g., book group). The order of articles may be important when an article builds upon (i.e., cites) a prior article. Apparel 1. Individual Item Values: Since taste is highly personal, a local algorithm is likely preferable. The overall collection of items (e.g., clothes, shoes, accessories) may be limited by total cost. 2. Co-occurrence Interaction Effects: A large set of individual item-item interactions may be used to determine what items look good (or bad) together. While a global algorithm would likely work for this, a local algorithm may make sense (i.e., some people may like red and orange together, while others may hate it). 3. Order Interaction Effects: Not applicable. We have considered several other collection recommender systems including ones that recommend investment portfolios, project teams, photo albums, recipes, software configurations, and events. Due to space limitations we do not include our preliminary analysis of them here. However, even the few examples that we provide demonstrate a few important points. First, not all of the 3 aspects apply to every collection (e.g., order effects don’t matter when selecting an outfit to purchase). Second, it is not always clear whether a global or a local algorithm is best for the various aspects especially the interaction aspects. Only empirical studies can help us know. Finally, although we have separated
Item values differ based on individuals or people types. E.g., item values are predicted based on a collaborative filtering algorithm. Co-occurrence interactions differ based on individuals or people types. E.g., cowboy boots and baggy trousers go together for some people, but not for others. Order interactions differ based on individuals or people types. E.g., some people like to start a mix tape with a fast song; others like to start with a slow song.
out the three aspects for understanding, in practice a single value function will combine all of the aspects and balance all of the competing demands. CASE STUDY: SONG ORDER IN MIX TAPES
The order interaction effects outlined in the design space are poorly understood. To shed light on this topic we examined the creation of mix tapes (i.e., playlists) by users. There are a number of public websites where users can create and share mixes (e.g. Mixwit1 ). While these data provide insights into how users create mixes with total freedom in an uncontrolled environment, the sparsity of the data make it difficult to draw meaningful conclusions about ordering interaction effects. To investigate factors that affect the order of songs in a playlist, we conducted a controlled study that limited the number of songs that can be included in a mix, as well as performed a qualitative analysis of responses to open-ended questions about mix tape creation. The experiment was conducted online and a convenience sample of subjects was recruited primarily from the University of Maryland, College Park community. In this section, we present the details of the experiment and our subjects. We tested three main hypotheses about how users would order songs. H1: Some songs will be significantly more likely to appear first or last in a mix. H2: The user’s preference for a song will affect its position. H2a: Songs a user likes more will appear closer to the beginning of a mix. H2b: Songs a user likes more will appear in the first and/or last positions of a mix. H3: Some song pairings will be considerably more (or less) common than other song pairings. Methodology
Subjects first completed a demographic survey with their age, gender, education level, and music habits. 1
1221
http://mixwit.com
CHI 2009 ~ Classifying and Recommending Content
April 8th, 2009 ~ Boston, MA, USA
Next, they were shown two sample mixes and a general description of what a mix is. Participants were then given instructions on how to use the experimental interface and unlimited time to practice dragging, dropping, and changing the order of songs. Participants then created up to 4 mix tapes (i.e., mixes). To do this, each participant was shown 4 different song lists one at a time (see table 2 for an example). Each list had 15 possible songs, only 10 of which could be included in a mix. Both the order of the 4 song lists and the order of the 15 songs within each list were randomized to prevent any ordering bias. A Pearson Correlation coefficient shows no correlation between the initial ordering of songs and the final order (ρ = 0.03). Each song included a link to iTunes and Amazon.com where the subjects could listen to a preview, although not all songs were available. To create each mix tape, subjects dragged and dropped exactly 10 songs from the 15 potential songs into a box. Songs could be rearranged or removed. If subjects were unfamiliar with many of the songs, they could skip the list entirely and move on to the next one. The interface is shown in figure 1. We recorded which songs were chosen, their order, and the amount of time it took to complete each mix tape. After completing the mixes, subjects assigned a 1 to 5 star rating to all 60 potential songs or indicated that they did not know the song. Finally, participants answered qualitative questions including: “In your opinion, what makes a good mix tape/CD?”, “What things do you consider when making a mix tape/CD?”, and “Did you encounter any technical problems while making the mix tapes? If so, please specify.” Most people did not mention any technical problems. Some people mentioned that certain songs were not found through iTunes or Amazon. A few, but not all of those people mentioned finding the songs elsewhere. This may have made some songs less likely to be included in mixes, however, we do not expect that it changed our primary results which were about song order, not song selection.
Figure 1. This shows the interface for creating a mixtape in our experiment. Users drag and drop songs from the right into the “My Mix Tape” box on the left.
Table 2. One of the four song lists from which subjects created mixtapes. Song List 3 Nat King Cole - The Christmas Song Dean Martin - Winter Wonder Land Eartha Kitt - Santa Baby Bing Crosby - White Christmas Frank Sinatra - Let It Snow Bing Crosby - Mele Kalikimaka Perry Como - Home For The Holidays Boston Pops - Sleigh Ride Elvis Presley - Blue Christmas Bobby Helms - Jingle Bell Rock Brenda Lee - Rockin’ Around The Christmas Tree Burl Ives - Holly Jolly Christmas Boris Karloff- Whoville Carols Judy Garland - Have Yourself a Merry Little Christmas Barenaked Ladies with Sarah McLachlan - God Rest Ye Merry Gentlemen / We Three Kings
Data Completion
We had 52 subjects who completed the study, 62% of whom were Female. Their average age was 29, with a standard deviation of 7.5. Their education levels varied. There were 5 PhDs, 10 with a Master’s Degree, 23 with a Bachelor’s degree, and 14 with less than a Bachelor’s degree.
Among the 208 possible mixes (52 subjects * 4 song lists each), 33 were skipped and another 5 were excluded because a majority of the included songs were later classified as unknown by their creators making their ordering data suspect. This resulted in 170 complete mixes. Most users, 30, completed all 4 mixes, 11 completed 3 mixes, 6 completed 2 mixes, and 5 completed just 1 mix. Each of the 4 song lists was used either 42 or 43 times.
Responses to questions about musical taste and habit are presented in table 3. Questions are worded as in the questionnaire, although the final question on music genre taste only shows the most popular 4 out of 23 possible music genres. Three of our mixes included songs from the Rock, Alternative, and Pop genres. Table 3, suggesting that these were familiar genres to most of our subjects.
From the 3,120 song ratings (52 subjects * 60 songs), participants chose ”Don’t Know” for a song 449 times (an average of 8.6 songs per subject). Only 45 of these unknown songs were included in mixes (2.6% of all 1,700 included songs), suggesting that in nearly all 170 mixes participants knew all of the songs they were including (either from prior knowledge or after listening to a sample on iTunes, Amazon, or elsewhere). The average song
Subjects
1222
CHI 2009 ~ Classifying and Recommending Content Table 3. Subject Music Experience (n=52)
How often do you listen to music a day (on average)? Never Less than 1 hour Between 1 hour and 3 hours Over 3 hours Approximately how many mix tapes/CDs have you created in your lifetime? None Under 10 Between 10 and 50 More than 50 How often do you use online music websites or software, like iTunes, Mixwit.com, ArtOfTheMix.org, etc. Never Less than once a month on average 2-4 times a month on average 5 or more times a month on average What types of music do you like? Rock Classical Alternative Pop
0 17 21 14
3 8 27 14
5 11 6 30 47 39 37 37
rating of all known songs was 3.4 with a standard deviation of 1.2. Timing
Subjects spent an average of 225 seconds making a mix (stdev 192). This excluded mixes that users chose to skip. Note that we also excluded an outlier of a user who took approximately 10 hours to complete a mix, a case where he or she likely was interrupted while completing the task. There was no statistically significant difference between the time it took to complete each of the 4 mixes. EXPERIMENTAL RESULTS H1: First and Last Songs
Subjects often commented that song order was important to them, particularly choosing the first and last song (see qualitative section below). We tested whether certain songs were particularly well suited for those important positions. The most common first song appeared in that position 21% of the time. The most common last song appeared in that position 26% of the time. While these numbers may suggest characteristic first and last songs at first glance, further examination of the data shows this distribution is found for many positions. For example, the most common third song appeared in that position 26% of the time. Three other positions saw their most common song appear 19%-21% of the time and the lowest positions saw their most common song appear 14% of the time. If the frequency
April 8th, 2009 ~ Boston, MA, USA with which songs appear in a given rank follows a normal distribution, we would expect one song to show up more often than others because it is at the end of the tail. We did not perform a statistical test to determine if the higher percentages in the first, third, and last songs were statistically higher than that of other song positions, but it seems unlikely that there is a strong difference. Thus, we did not find strong support for the hypothesis that one song is particularly suited to a first or last rank, although further work with a larger dataset may confirm such a hypothesis. H2: Position Rank and Song Preference
We first considered whether or not the subject’s rating of a song would be correlated with its position in a mix, e.g., do subjects place songs they like closer to the beginning or end? The Pearson Correlation between position (first, second, ..., last) and rating was a weak correlation of -0.12. This suggests that the higher the rating of a song, the closer it is to the beginning of the mix. However, this small correlation is explained by the higher rating of first song as described below. Our results suggest that order and position overall are not as important as the first position. Table 4 shows the average rating of songs in each position. An ANOVA test shows significant differences in the average ratings for each rank (F(9,1690)= 6.640 p < 0.01). A Student’s t-test shows that the average rating for songs that appeared first was significantly higher than the average rating for any other rank. Also interesting is that there was no difference in the rating of songs among the rest of the ranks. An ANOVA for all populations excluding the first songs showed no significant difference in the average song rating (F(8,1521)=1.739 p = 0.09). Thus, users are not more likely to rate the final song higher than the other positions. Not surprisingly, subjects’ mixes tended to leave off songs they did not like. They also left out songs they did not know. Each song list contained 15 songs, and subjects had to choose 10 for the mix. Songs that were included had an average rating of 3.47 while songs that were excluded had an average rating of 3.30. A Student’s t-test shows this difference was significant for p < 0.01. These averages were computed over songs that the subjects knew and assigned ratings to. Of the 2,550 songs (170 mixes * 15 potential songs) that could have been included, there were 238 songs rated as “Unknown.” Of these, only 45 songs (19%) were added to mixes, much lower than the expected 159 songs (66.7% of all 238 unknown songs) that would have been included had the unknown songs been randomly inserted. This finding does not suggest that people purposefully avoided new music when making mix tapes (see qualitative section below). Many of the rated songs included in the mixes had been recently listened to (and likely learned) by participants on iTunes or Amazon.
1223
CHI 2009 ~ Classifying and Recommending Content
April 8th, 2009 ~ Boston, MA, USA
Table 4. Average rating of songs based on their rank in the mix
Song Rank Average Rating
1 4.32
2 3.87
3 3.95
Unknown songs were included only once in the 1st position, twice in the 4th position, 3 times in the 3rd and last position, and between 4 and 8 times for the other positions. This suggests that unknown songs are less likely to be selected and also less likely to be included in the important first position. H3: Song Pairings
We hypothesized that some songs would work better as pairs that are adjacent to one another in the mix than others. There are 15 songs to choose from for each mix and subjects selected 10 songs. This leads to 105 15 X possible pairs regardless of order ( i). If all songs i=6
were equally likely to be chosen, the probably for each 1 1 pair would be given by 15 ∗ 14 ∗ 2. However, songs are not equally likely to be chosen. Thus, the probability P of seeing a pair (A,B) is given by P (A) ∗ P (B) ∗ 2, where P(A) represents the probability song A would be chosen at a given position based on the frequency at which song A appears in mixes. We computed these probabilities for each pair and compared this to the observed frequencies for each pair. Several pairs appeared at least 2 times more often than expected. The most frequent combination - “Winter Wonder Land by Dean Martin” and “Let It Snow by Frank Sinatra” from Song List 3 - appeared adjacent to one another in 12 of the 43 mixes. This was 221% more than the expected 5.4 occurrences. On the other end, there were a number of song pairs that never appeared together, even though they were often included in the same mixes. An example was “Santa Baby by Eartha Kitt” and “White Christmas by Bing Crosby” from Song List 3. Our theoretical probability calculation suggests that the pair would be adjacent an average of 3.2 times, however, the two songs never appeared in order. Qualitative Responses
We asked participants two open-ended questions to get a feel for the variables that people consider when evaluating and creating mix tapes. Forty-nine out of 52 participants (94%) responded to the following questions: “In your opinion, what makes a good mix tape/CD?” and “What things do you consider when making a mix tape/CD?” Answers were independently coded into various categories by two raters. Their agreement was acceptable (Cohen’s Kappa values ranged between 0.60 to 0.75). Answers from the two questions were combined for analysis. Comments are organized into the design space discussed earlier in the paper.
4 3.77
5 3.72
6 3.79
7 3.62
8 3.66
9 3.72
10 3.86
1. Individual Item Values. Over half of the respondents mentioned that mix tapes must include songs the listener likes. They used phrases like “favorite songs,” “songs I like to hear repeatedly,” “familiar songs,” “no boring songs,” and “songs you know all the words to.” Many of these comments considered this the most important evaluation criteria (e.g., “most importantly, it has to be songs I like”), although a couple people were open to having songs of lower quality (“They don’t have to be all favourites, and in fact, it helps to have things that aren’t top favourites, because it refreshes the palate, and reminds us of other songs we like–and stops our attention from drifting.”). Importantly, approximately 40% of participants talked about making mix tapes for other people. Those that did so often mentioned that songs should be ones that the intended audience would like. Several individuals expressed a desire to be exposed to new songs (e.g., “if someone has made it for me, it will hopefully be songs that i haven’t heard before, introducing me to new bands.”). This is significant because recommender systems may be able to mix in songs that are unknown, but have a high probability of being liked. 2. Co-occurrence Interaction Effects. Approximately 70% of participants mentioned that mix tape songs should include some common theme. Only a few people explicitly mentioned that some mix tapes don’t need a common theme. Themes discussed by participants related to a particular mood, activity (e.g., exercise, long drive), time period, a certain “sound”, genre, beat, place, person, or story. Approximately 1/3rd of participants mentioned that they want variety. These two were not mutually exclusive as evidenced by the following quotes describing what makes a good mix tape: “a good variety of songs that still holds a theme together, usually in subject, but also in sound”; “One that balances variety with not trying to be all things to all people. There must be some specificity, or else I wonder what the hell these songs are all doing on the same CD/playlist/tape together.” A few people also mentioned item-item interaction effects. Some mention positive interaction effects (e.g., “certain artists I like to put together”), while others describe the need to avoid negative interaction effects (e.g., mixing “Rock/Punk” with a “country love song”; avoiding songs that “throw the listener out of the mood”). 3. Order Interaction Effects. A full 2/3rds of participants mentioned order effects as important when creating a good mix tape. They discuss how songs “transition”, “blend”, or “flow” into one another (e.g., a mix tape should have “songs in just that right order where the songs flow into one another and are not too jarring
1224
CHI 2009 ~ Classifying and Recommending Content or too random.”). Participants describe a wide range of strategies for ordering songs including “spreading the best songs in the mix throughout,” and building up to a “climax” and then “taking you back down,”and “not putting too many slow/fast songs together.” Some people mentioned that rhythm and beat were important to successful transitions, suggesting that content-based systems such as Pandora may be able to use their Music Genome information to help order songs in addition to recommend them. A few participants agreed with Rob of High Fidelity that two songs by the same artist shouldn’t be included next to one another. Many people mentioned first and last songs being of special importance (e.g., “interesting opener, strong closer”; ”start with something uptempo” and ”end a cd with something meaningful”). The wide range of strategies described suggest that there may be different “types” of people who like to order songs differently, making a Local-ordering algorithm potentially more beneficial than a Global-ordering algorithm, although some commonalities seemed fairly persistent (e.g., the importance of the first/last songs). CONCLUSION AND DISCUSSION
The idea of developing systems that recommend collections of items is not new. As discussed, it has already hit mainstream markets through tools like Apple’s Genius playlist generation software. However, current methods for generating collections are clearly in their infancy with much room for improvement. Current systems fail to account for many of the variables that empirical studies, such as the mix tape study we describe in this paper, have demonstrated are important. Efforts to further our understanding of collection recommender systems are scattered across research communities and talked about using different terminology. This makes it hard to transfer knowledge or even assess the amount of progress we are making, let alone coordinate our efforts in any meaningful way. We hope this paper will start a much longer conversation about the nature of collection recommender systems and methods for improving them. We have outlined a preliminary design space for collection recommender systems and provided examples of how music playlists and other collections can map onto that space. Specifically, we argue that three main aspects influence the overall value of a collection: (1) the value of the individual items that make up the collection, (2) cooccurrence interaction effects (including item-item and global interaction effects), and (3) order interaction effects including placement and sequential arrangement of items. Different recommendation algorithms (e.g., local versus global) can be used for each aspect, depending on the nature of the collection. For example, it is feasible to consider local algorithms that recommend song order based on the behavior of other “collection builders” who order songs like you. We hope this characterization inspires researchers to consider aspects of
April 8th, 2009 ~ Boston, MA, USA collection recommendations that they would not otherwise have considered, as well as provide some common ground for researchers coming at this problem from different perspectives. Our experiment provided quantitative and qualitative evidence that a collection is truly more than the items that make it up. We focused on identifying order interaction effects since they have not been explored sufficiently in prior research. Users self-reported many of them, including the importance of choosing good first and last songs, the flow of the songs on the mix, the interaction between songs, and how much they like the songs. The controlled results showed that first songs were higher rated than songs in other positions. The pairwise interaction between songs was also important. Some pairs appeared together many times more often than would be expected, while other pairs never occurred despite a probabilistic expectation that they would be together several times. While we did not find strong evidence that specific songs were ideal first or last songs, this is likely because the choice depends on the user and is more a local preference than a global feature. It may also be that we had collections that included many equally good “starter” songs, making it less likely that any one of them would make it into the sole “starter” position. This highlights the complex nature of the interaction effects that can arise in collection recommender systems. Rob, from High Fidelity of course would not be surprised that we were able to identify “loads of rules” about creating playlists. Of course this was only a first attempt to understand some of the “rules” important to the creation of playlists, and collections more generally. Futures studies may look at qualities of the music itself to determine placement effects (e.g., last songs may fade out gradually or end on a bang) or order effects (e.g., songs in the same key may go better together). They may also include more individuals in order to identify different “types” of users. Opportunities for future work on collection recommender systems are abundant. Clearly, new algorithms are needed that take into consideration the various interaction effects unique to collections. The high complexity of collection-based problems makes efficient algorithm development particularly challenging. Algorithms that take into consideration the order of items need to be adapted to a recommender system framework. In the realm of music playlists, one promising strategy would be to identify the musical characteristics [15] of songs found in collections created by humans in order to identify underlying “rules” about song order and placement. Finally, strategies and functions that balance competing goals must be developed (e.g. [17]). But algorithm development is really only the beginning. Preference elicitation, evaluation techniques, and user interface design all change dramatically when dealing
1225
CHI 2009 ~ Classifying and Recommending Content
April 8th, 2009 ~ Boston, MA, USA 7. Herlocker, J. L., Konstan, J. A., Terveen, L. G., and Riedl, J. T. Evaluating collaborative filtering recommender systems. ACM Trans. Inf. Syst. 22, 1 (2004), 5–53.
with collections rather than items. For example, some algorithms will require that we elicit data from users on how they order items or which items don’t fit well together. We may consider when to collect preferences on individual items versus entire collections and how they relate to one another. New evaluation metrics may be needed. And we will certainly need new user interfaces that allow us to effectively compare, mash up, and rate collections. In closing, it is important to recognize that collection recommender system research can lead to different types of applications. Automatic collection generation is an important application, but not necessarily the end goal. Creating a collection can be an important form of self-expression and a social act that would lose its meaning if completely automated. Thus, tools that help humans create collections [14] may be more useful, especially if they rely on useful collection recommender system algorithms. Furthermore, collection recommender systems need not always recommend entire collections. Often, what is needed is a single song to insert into a nearly complete mix tape, or a necktie to go with an otherwise complete outfit. Collection recommender algorithms can be used to recommend the most desirable and compatible item (or handful of items) to be added to an existing collection. These and other applications have great potential to improve recommender systems in a host of domains. REFERENCES
1. Adomavicius, G., and Tuzhilin, A. Toward the Next Generation of Recommender Systems: A Survey of the State-of-the-Art and Possible Extensions. IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING (2005), 734–749. 2. Ali, K., and van Stam, W. Tivo: making show recommendations using a distributed collaborative filtering architecture. In KDD ’04 (New York, NY, USA, 2004), ACM, pp. 394–401. 3. Bradley, K., and Smyth, B. Improving recommendation diversity. In AAAI’01 (2001). 4. Cunningham, S., Jones, M., and Jones, S. Organizing digital music for use: An examination of personal music collections. In Proceedings of 5th International Conference on Music Information Retrieval (ISMIR04), pp. 447–454. 5. desJardins, M., Eaton, E., and Wagstaff, K. L. Learning user preferences for sets of objects. In ICML ’06: Proceedings of the 23rd international conference on Machine learning (New York, NY, USA, 2006), ACM, pp. 273–280.
8. Lee, J. H., and Downie, J. S. Survey of music information needs, uses, and seeking behaviours: Preliminary findings. In ISMI (2004), pp. 441–446. ¨ nen, J. Evaluation 9. Lehtiniemi, A., and Seppa of automatic mobile playlist generator. In Mobility ’07: Proceedings of the 4th international conference on mobile technology, applications, and systems and the 1st international symposium on Computer human interaction in mobile technology (New York, NY, USA, 2007), ACM, pp. 452–459. 10. Miller, B. N., Albert, I., Lam, S. K., Konstan, J. A., and Riedl, J. Movielens unplugged: experiences with an occasionally connected recommender system. In IUI ’03 (New York, NY, USA, 2003), ACM, pp. 263–266. 11. O’Connor, M., Cosley, D., Konstan, J. A., and Riedl, J. Polylens: a recommender system for groups of users. In ECSCW’01 (Norwell, MA, USA, 2001), Kluwer Academic Publishers, pp. 199–218. 12. Pampalk, E., Flexer, A., and Widmer, G. Hierarchical organization and description of music collections at the artist level. In ECDL ’05 (2005), pp. 37–48. 13. Pauws, S., and Eggen, B. PATS: Realization and user evaluation of an automatic playlist generator. In Proceedings of the 3rd International Conference on Music Information Retrieval (2002), pp. 222–230. 14. Pauws, S., and van de Wijdeven, S. Evaluation of a new interactive playlist generation concept. In ISMIR International Conf. on Music Information Retrieval (2005), pp. 638–643. 15. Tzanetakis, G. Musical genre classification of audio signals. In IEEE Transactions on Speech and Audio Processing (2002), pp. 293–302. 16. Voida, A., Grinter, R. E., Ducheneaut, N., Edwards, W. K., and Newman, M. W. Listening in: practices surrounding itunes music sharing. In CHI ’05: Proceedings of the SIGCHI conference on Human factors in computing systems (New York, NY, USA, 2005), ACM, pp. 191–200. 17. Ziegler, C.-N., McNee, S. M., Konstan, J. A., and Lausen, G. Improving recommendation lists through topic diversification. In WWW ’05 (New York, NY, USA, 2005), ACM, pp. 22–32.
6. Golbeck, J., and Hendler, J. FilmTrust: Movie recommendations using trust in web-based social networks. IEEE CCNC (2006).
1226