Domain Ranking For Cross Domain Collaborative ...

Domain Ranking For Cross Domain Collaborative Filtering Amit Tiroshi and Tsvi Kuflik The University of Haifa, Haifa, Israel {tsvikak;atiroshi}@is.haifa.ac.il

Abstract. In recommendation systems a variation of the cold start problem is a situation where the target user has few-to-none item ratings belonging to the target domain (e.g., movies) to base recommendations on. One way to overcome this is by basing recommendations on items from different domains, for example recommending movies based on the target user's book item ratings. This technique is called cross-domain recommendation. When basing recommendations on a source domain that is different from the target domain a question arises, from which domain should items be chosen? Is there a source domain that is a better predictor for each target domain? Do books better predict a users' taste in movies or perhaps it‟s their music preferences? In this study we present initial results of work in progress that ranks and maps between pairs of domains based on the ability to create recommendations in domain one using ratings of items from the other domain. The recommendations are made using cross domain collaborative filtering, and evaluated on the social networking profiles of 2148 users. Initial results show that information that is freely available in social networks can be used for cross domain recommendation and that there are differences between the source domains with respect to the quality of the recommendations. Keywords: cross-domain recommendation, cold-start problem, collaborativefiltering

1

Introduction

Information overload nowadays prevents us from building information systems following the “one-fits-all” paradigm. In order to provide users with easy access to information, information systems must be adaptable; they should tailor the information served by them to the personal preferences and needs of their users. One approach for adapting content to users‟ preferences is Collaborative Filtering (CF). The CF approach [1] is based on similarity of users‟ preferences. It assumes that users which agreed in the past on items they liked will probably agree on more items in the future. For example taking one user‟s bookshelf and cross checking it with shelves of other users, finding those with similar books, will yield several possible book recommendations for that user. To carry this approach, information needs to be collected from the target user and a large number of other users, regarding their preferences and inter-

ests. Having no details at all or an insufficient amount of them regarding the target user‟s interests/preferences is defined as the „Cold-start Problem‟ [2]. A specific case of the cold-start problem is when a CF for a dedicated target domain does not have the user‟s preferences in that domain, but do have the user‟s preferences in other domains. For example, users are requesting from a CF system to recommend them about movies (the target domain), but the system only has their books/music preferences. One way to overcome this problem is by making Cross-Domain Collaborative Filtering (CDCF) [3]. In CDCF the system finds users with similar preferences to those of the target user based on domains in which that information is available („source domains‟). Then items from the target domain that were preferred by those users are filtered and recommended to the target user. When considering CDCF application, a major question is how can we decide which source domain to use for what target domain.

2

Background

In order to overcome the sparsity of users to items rating matrices in CF recommendation systems, [3] suggest to mediate ratings across domains. Cross domain mediation works by calculating similarity among users in domains other then the target domain (using methods such as k-nearest neighbors) and generating recommendations based on those similar users taste in the target domain. In their work they have evaluated the mediated recommendations in comparison to regular CF results; separate domains were mimicked by taking a single domain, movies, and splitting it into sub domains based on genres. For example, for recommending an action movie to a target user, that had too few ratings for movies in that genre, the system would take ratings of the user from other genres (e.g., comedy, romance), find similar users based on their ratings for movies in the other genres and for these similar users, and return their most liked movies in the action genre. Results showed that the cross domain based recommendations had even better results than the regular CF ones. To the best of our knowledge, there was very little work done about domain mapping for the purpose of cross domain recommendations. One example is [4], in that work the authors have mapped between domains using a user study of 144 university students. In their initial analysis they searched for correlation between domains based on shared items, for example “If a user liked the book „The Devil Wears Prada‟ did they also like the movie based on it?” or if users enjoyed movies in which a singer they like preformed. This analysis results showed high correlation between song items and movies that were related to each other, and items which involved singers who are also actors. In their second evaluation, domains were mapped using categories similarity; for example, do users who enjoy video games that belong to the action genre, also enjoy action movies? The results of this evaluation showed that users who liked books of a certain genre also enjoyed TV series from the same genre. Their last evaluation method tried to make CF recommendations for a target domain based on items from multiple source domains; in example, making movies and TV series recommendations based on the target user‟s set of movies, TV series, and CD items combined.

The results were that for each combination of source domain items that did not contain items from the target domain, their recommendations ranked last. On another example [5], a mapping approach based on Information Retrieval (IR) techniques was suggested and evaluated for measuring the similarity of domains based on Google Directory and Open Directory. A vector of terms frequency was calculated using TFIDF, for several domains in both web directories (web directories aggregate and categorize websites belonging to the same domain); the cosine similarity between the vector representations of the domains was measured and the results showed that indeed, distances were different between different domains and these results were consistent in the two directories. The authors concluded that inter-domain similarity may help in selecting domains for cross-domain recommendation. Their approach and the one proposed in this work both suggest an automatic way to generate domain mappings. The mapping presented in this work will be based on a larger scale of users (thousands) and from more heterogeneous backgrounds (age, employment, education). We too use CF recommendations and the target user‟s own set of items from the target domain to evaluate our results, however we do so for any source-to-target domain combinations, creating a full map between available domains. Our results are not refined by domain sub categories, since the dataset lacks that information, but we intend to complete it in future work. This evaluation will be completed by an additional comparison to the recommendations that would have been generated for the dataset using regular CF, as done in [3].

3

Cross Domain Recommendation Using SNS data

The suitability of different domains for CDCF is based on the evaluation of 2148 Facebook 1 profiles, which contained items (“likes”) in four domains: Music (artists/bands), Movies, TV shows, and Books. The dataset originally contained 6370 user profiles. However, only 33.72% of them contained at least a single item that belongs to at least one of the domains that were evaluated. The total number of items in each domain, and the average number of items each user liked from each domain are detailed in Table 1. Table 1. Dataset Statistics

Music Total Amount of Items Average Amount of Items Per User Standard Deviation

1

http://www.facebook.com

Movies

TV

Books

7481

5470

3310

4140

6.49

4.68

3.24

2.42

13.39

7.78

4.68

3.80

The dataset was collected using the platform‟s application API and based on users‟ consent to participate in an online experiment. All profiles and items were loaded into a graph based database system called Neo4j2; users and items as nodes connected to each other based on “Likes”, with edges labeled by the domain name. A generic cross domain graph walk was then implemented; the walk receives as parameters a source domain and target domain, and returns the various measurements that are described below. The walking method was then performed for each source and target domain combination for the existing domains. In order to find out which domains are more influential with respect to recommending for other domains in CDCF we have measured the precision of CDCF recommendation using different source and target domains. The process we followed was finding the K-nearest neighbors in a source domain and using these “neighborhoods” for creating recommendations in a target domain. For comparing the results we defined two precision measures. The first was defined as the percentage of items recommended by the system for a user in the target domain, that were included in the items the user actually rated/liked in that domain (PR#1). For example, if music recommendations were made based on movie items, then PR#1 represents the percentage of each user‟s own music likings that appeared at the music recommendations generated by the system. The second measure (PR#2) is the percentage of items appeared in the top 10 items recommended by the CDCF that appeared in the set of recommendations generated for the target domain using regular (non cross domain) collaborative filtering; naturally those did not contain the user‟s own likings in the domain (to continue with the previous example of music recommendations, in this method music recommendations were generated for each user using both CDCF and regular CF, the results of the CF did not contain the user‟s original preferences in music, and the two sets were intersected). For PR#1 we excluded from the measurements users that did not have items belonging either to the source domain (since no recommendations could be made for them), or that did not have items belonging to the target domain, (since there was nothing to compare to). In PR#2 we excluded users that did not have any recommendation results for the regular CF (since those serve as the basis for the comparison). For both metrics, PR#1 and PR#2, we also filtered out users with less than 5 items in each domain, in order to enable a minimal level of accuracy. An average of 18% of the 2148 users had at least 5 items from the source domain, and 1 item in the target domain, for the various evaluation combinations. Table 2 presents the results of the CDCF recommendation experiment; the rows are grouped by target domain and internally sorted by PR#1. The PR#1 column contains both the number of “hits” and their percentage, averaged for all users that had items in that target domain. As can be seen, although the absolute numbers are low, for nearly every target domain at least 40% of the items a user liked were identified based on the preferences in a different source domain. For example, generating movie recommendations based on similarity in TV series preferences, yielded 5.31 matching items on average, which are 43.56% of the target users‟ preferred movies. As may be expected, there are noticeable differences between the different domains, it seems that recommending music 2

http://neo4j.org/

items based on other domains is more accurate than recommending books. PR#2 shows a similar general behavior – there are domains where CDCF performs better than others, when compared with classical CF . Table 2. Cross-Domain Collaborative Filtering Domain Ranking

Source Domain

Target Domain

PR #1

PR #2

6.73 (42.59%)

3.25

6.45 (38.98%)

2.92

Books

2.24 (14.02%)

1.51

TV

5.31 (43.56%)

3.02

4.45 (39.84%)

2.64

Books

2.51 (18.49%)

1.23

Movies

3.37 (45.03%)

5.48

3.21 (46.51%)

5.48

1.90 (24.52%)

3.07

1.02 (18.19%)

1.60

1.00 (16.41%)

1.48

0.77 (13.63%)

0.92

TV Movies

Music

Music

Music

Movies

TV

Books TV Movies

Books

Music

The standard deviations of PR#1 and PR#2 in all cases were higher than the metrics values. It is worth noting, that as can be expected, without filtering out users with at least 5 items in the source domain (e.g. too little information for prediction), the successful prediction percentage drops to 30% for leading source domains in each group.

4

Discussion and Conclusions

This work contributes by suggesting a way for mapping similarity of domains for the purpose of cross-domain collaborative filtering, using freely available domain-rich users‟ information. It is the first time such an evaluation is done on a large scale and heterogeneous collection of users and items. Two precision parameters have been used to evaluate the resulted ranking; they demonstrate the ranking both by comparing recommendations to users‟ previously seen items and unseen ones (e.g. recommended by a community of users in the target domain, applying classical collaborative filtering). In the future, we intend to further investigate the above mapping. We plan to better understand the differences between domains, in order to be able to suggest how

these differences may be taken into account for defining the uncertainty in the CDCF process. Moreover, we plan to investigate the impact of the number of ratings in the source domain on the accuracy of the recommendation in the target domain. Taking a closer look at the dataset, we noticed that sometimes users chose to list all the items they liked as a single long string (there is an option to enter open text on Facebook). Currently we were unable to process this data, but simple parsing may help resolve this problem and further increase the performance of the recommendations. We also intend to enrich the dataset using domain knowledge (e.g., Genres) and evaluate its effect on the domain ranks.

5

References

1. Shardanand, U., Maes, P.: Social Information Filtering: Algorithms for Automating "Word of Mouth". In : CHI, pp.210-217 (1995) 2. Schein, A. I., Popescul, A., Ungar, L. H., Pennock, D. M.: Methods and metrics for cold-start recommendations. In : ACM, pp.253-260 (2002) 3. Berkovsky, S., Kuflik, T., Ricci, F.: Cross-domain mediation in collaborative filtering. User Modeling 2007, 355-359 (2007) 4. Winoto, P., Tang, T.: If you like the Devil Wears Prada the book, will you also enjoy the Devil Wears Prada the movie? A study of cross-domain recommendations. New Generation Computing 26(3), 209-225 (2008) 5. Berkovsky, S., Goldwasser, D., Kuflik, T., Ricci, F.: Identifying Inter-Domain Similarities Through Content-Based Analysis of Hierarchical Web-Directories., vol. ECAI, pp.789-790 (2006)