Hybrid Tag Recommendation for Social ... - DePaul University

Hybrid Tag Recommendation for Social Annotation Systems Jonathan Gemmell, Thomas Schimoler, Bamshad Mobasher, Robin Burke Center for Web Intelligence School of Computing, DePaul University Chicago, Illinois, USA

{jgemmell, tschimo1, mobasher, rburke}@cdm.depaul.edu ABSTRACT

General Terms

Social annotation systems allow users to annotate resources with personalized tags and to navigate large and complex information spaces without the need to rely on predefined hierarchies. These systems help users organize and share their own resources, as well as discover new ones annotated by other users. Tag recommenders in such systems assist users in finding appropriate tags for resources and help consolidate annotations across all users and resources. But the size and complexity of the data, as well as the inherent noise and inconsistencies in the underlying tag vocabularies, have made the design of effective tag recommenders a challenge. Recent efforts have demonstrated the advantages of integrative models that leverage all three dimensions of a social annotation system: users, resources and tags. Among these approaches are recommendation models based on matrix factorization. But, these models tend to lack scalability and often hide the underlying characteristics, or “information channels” of the data that affect recommendation effectiveness. In this paper we propose a weighted hybrid tag recommender that blends multiple recommendation components drawing separately on complementary dimensions, and evaluate it on six large real-world datasets. In addition, we attempt to quantify the strength of the information channels in these datasets and use these results to explain the performance of the hybrid. We find our approach is not only competitive with the state-of-the-art techniques in terms of accuracy, but also has the added benefits of being scalable to large real world applications, extensible to incorporate a wide range of recommendation techniques, easily updateable, and more scrutable than other leading methods.

Algorithms, Experimentation, Performance

Categories and Subject Descriptors H.2 [Database Management]: H.2.8 Database application—Data mining; H.3 [Information Storage and Retrieval]: H.3.3 Information Search and Retrieval—Search process

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. CIKM’10, October 26–30, 2010, Toronto, Ontario, Canada. Copyright 2010 ACM 978-1-4503-0099-5/10/10 ...$10.00.

Keywords Social Annotation, Information Channels, Hybrid Recommenders, Recommender Systems

1.

INTRODUCTION

In social annotation systems, information access functions such as search, navigation and resource sharing are supported by annotations, arbitrary tags applied to resources by individual users. Delicious1 supports users as they bookmark URLs. Citeulike2 enables researchers to manage scholarly references. Bibsonomy3 allows users to annotate both. Social annotation systems are quickly becoming ubiquitous in a variety of domains. For example, Amazon4 and others have incorporated social annotations into their web site. The popularity of social annotation systems is driven in part by the low entry barrier and the freedom to annotate resources with any tag. The aggregated connections between users, resources and tags provide a rich information space for users to explore. However, the benefits of social annotation systems do not come without a cost. The size, noise and dimensionality of the data make navigation and information access difficult. Recommender systems are therefore a critical component of these applications. In this work we focus on tag recommenders which assist users during the annotation process by recommending tags for a selected resource. Recent efforts in tag recommendation have proven that integrative models that leverage all three dimensions of a social annotation system (users, resources, tags) produce superior results. Graph-based models [7] and a variety of latent variable techniques [13, 14, 18, 19] have been investigated. These approaches tend to be computationally intensive and scale poorly. Previous work on hybrid tag recommenders [3, 4] that combine several components, each exploiting different dimensions of the data, have been shown to offer competitive results while maintaining the simplicity, computational efficiency and explanatory capacity of the component recommenders. However, these results have focused on hybrid models specifically designed for a particular dataset or as a means to augment another integrative technique. 1

delicious.com citeulike.org 3 bibsonomy.org 4 amazon.com 2

This paper proposes a framework for constructing linear weighted hybrids that combines component recommenders into a single integrated model. No individual component is required to cover all dimensions of the data, but when taken together the components complement one another. The hybrid is therefore able to produce results superior to what the components produce alone. To help understand our experimental results, we explore the notion of an information channel: the power one dimension possesses in predicting or modeling another dimension of the social annotation system. To quantify the strength of these information channels, we develop a family of metrics based on conditional entropy. These metrics reveal marked differences in the characteristics of the datasets, which are reflected in the performance of the recommendation components. The rest of the paper is organized as follows. In Section 2 we present related work. Section 3 introduces tag recommendation. Results of the recommendation techniques are provided in Section 4. In Section 5 we introduce the notion of information channels and use them to evaluate the results of the tag recommenders. Finally, we conclude the paper with a summary of our findings.

2.

RELATED WORK

One of the first techniques to demonstrate the value of an integrative approach for tag recommendation in social annotation systems was a graph-based variant [7] of the wellknown PageRank algorithm. The computational burden of computing the PageRank values of each user, resource and tag for every recommendation makes the algorithm ill-suited for large-scale deployment. Tensor factorization is another integrative technique for making tag recommendations. Tucker decomposition is one such example that factors the three dimensional tagging data into three feature spaces and a core residual tensor [18, 19]. Unlike the graph-based model, online computation of recommendations is highly efficient. However, the offline computation required to build the model is not scalable to the demands of real-world applications. A pair-wise interaction tensor factorization model has also been proposed, which offers far more reasonable run times in both the construction of the model and the generation of recommendations [13, 14]. It optimizes the ranking of tags given user-resource pairs in the training data. Tags may then be recommended for a new user-resource pair. This approach represents the current state-of-the-art in tag recommendation providing both a high degree of accuracy and computational efficiency. Our previous work in tag recommendation has demonstrated the benefits of hybrid recommenders [3, 4]. One approach demonstrated that the graph-based models may be improved by incorporating item-based collaborative filtering. Another effort resulted in a hybrid recommender for Bibsonomy in the context of the PKDD-ECML 2009 challenge [10]. In this paper we extend those efforts, proposing a more general framework for constructing linear weighted hybrid tag recommenders. The hybrid is constructed from component recommenders and produces results competitive to state-of-the-art techniques.

3.

TAG RECOMMENDATION

This section begins with a discussion of the data models for a social annotation system. We then present our proposed framework for the linear weighted hybrid tag recommender and discuss the individual components that may be incorporated into the hybrid. For comparative purposes we also describe the state-of-the-art pair-wise interaction tensor factorization algorithm.

3.1

Data Model

The foundation of a social annotation system is the annotation: the record of a user labeling a resource with one or more tags. A collection of annotations results in a complex network of interrelated users, resources and tags [11]. A social annotation system can be described as a four-tuple: hU, R, T, Ai, where U is a set of users; R is a set of resources; T is a set of tags; and A is a set of annotations. An annotation contains a user, resource and all tags the user applied to the resource. A social annotation system can also be viewed as a three dimensional matrix, URT, in which an entry URT(u,r,t) is 1 if u tagged r with t. Aggregate projections of the data can be constructed, reducing the dimensionality but sacrificing information [12]. For example, the relation between resources and tags can be defined as RT (r, t). In this work, we calculate RT (r, t) as the number of users that have applied t to r. This notion strongly resembles the “bag-of-words” vector space model [15] and is analogous to the idea of term frequency common in information retrieval. A similar two dimensional projection can be constructed for U T , in which an entry contains the number of times a user has applied a tag to any resource. Finally, U R is a binary matrix indicating whether or not a user has annotated a resource. An alternative approach would be to define an entry in the matrix as the number of tags a user has applied to a resource. Our previous work and continued experimentation has shown that the binary model for U R produces better results. Each resource, r, may be modeled as a vector over the multi-dimensional space of tags, where a weight, w(ti ), in dimension i corresponds to the importance of a particular tag, ti : r~t = hw(t1 ), w(t2)...w(t|T | )i

(1)

Similarly, a resource can be modeled as a vector over the space of users where each weight, w(ui), corresponds to the importance of a particular user, ui to produce r~u . Analogous vector models can be constructed for users (u~r ,u~t ) and tags (t~u ,t~r ). We draw the weights directly from the previously constructed aggregate projections UR,UT and RT. The model of a user, resource or tag is defined as a row or column taken from one of the projections.

3.2

Linear Weighted Hybrid

Our proposed framework aggregates the results of several component recommenders in linear combination [1]. The components are freed from the burden of covering all the available dimensions of the data and instead specialize in only a few. A successful hybrid creates a synergistic blend of its component parts producing results superior to what they could achieve alone.

We can view each component of a tag recommendation system as a function ψ : U × R × T → R, which, given a user u ∈ U and a resource r ∈ R, produces a real-valued result p as the predicted relevance of a tag t for that particular user-resource pair: ψ(u, r, t) = p. In the most common settings tag recommenders are used to produce a ranked list of suggested tags for a particular user and given a specific resource. To do so using the above formulation, for a given user u and resource r, we iterate over all tags, sort them by their corresponding relevance scores, and return the top n tags: n rec(u, r) = T OPt∈T ψ(u, r, t).

(2)

In our proposed hybrid framework the relevance score for a tag is calculated using several component tag recommenders. These scores are then combined in a linear model. Specifically, given a set of component tag recommenders C, a linear weighted hybrid tag recommender will accept a user u and resource r. It will then query each of its component recommenders, c ∈ C, for a tag, t, and combine the results in the linear model: ψh (u, r, t) =

X

αc ψc (u, r, t)

(3)

c∈C

where ψh (u, r, t) is the linear weighted relevance score of the tag and αc is the weight given to the component, c. It should be noted that the scores from one component may be drawn from a different distribution than the other components. In order to ensure that the relevance scores for all component recommenders are on the same scale, we normalize the scores so that each ψc (u, r, t) falls in the interval [0,1]. As additional recommenders are added to the hybrid, its complexity grows. The challenge then becomes how to ascertain the correct α for each component in order to maximize the effectiveness of the hybrid. We use a hill climbing technique because of its speed and simplicity. The α vector is initialized with random positive numbers constrained such that the sum of the vector equals 1. The vector is then randomly modified and tested against a holdout set to ascertain if it achieves better results. The holdout set may be evaluated for recall, precision or F-measure. In this work we rely on the F-measure since it incorporates both the recall and precision. If the result is improved, the change is accepted; otherwise it is usually rejected. Occasionally a change to the α vector is accepted even when it does not improve the results in order to more fully explore the α space. Modifications continue until the vector stabilizes. In order to ensure that a local maximum has not been discovered, the experiment is repeated 20 times from different random starting points. With this integrative model any tag recommender can be incorporated into the hybrid. We focus on relatively simple component recommenders due to their speed and scrutability. We now present those components.

3.2.1 Popularity Models Perhaps the simplest recommendation strategy is merely to recommend the most commonly used tags in the system.

Alternatively, given a user-resource pair a recommender may ignore the user and recommend the most popular tags for that particular resource. This strategy is strictly resource dependent and does not take into account the tagging habits of the user. We define ψ(u, r, t) for the resource based popularity recommender, popr , as: X

ψ(u, r, t) =

θ(v, r, t)

(4)

v∈U

We define θ(v, r, t) as 1 if v tagged r with t and 0 otherwise. In a similar fashion a recommender may ignore the resource and recommend the most popular tags for that particular user. While such an algorithm would include tags frequently applied by the user, it does not consider the resource information and may recommend tags irrelevant to the current resource. We define ψ(u, r, t) for the user based popularity recommender, popu , as: X

ψ(u, r, t) =

θ(u, s, t)

(5)

s∈R

While popularity models are not necessarily the most effective techniques, they do serve as a baseline and may benefit the hybrid. Popularity based recommenders require little online computation. They are easily built offline and can be incrementally updated.

3.2.2

User-Based Collaborative Filtering

User-based collaborative filtering [5, 9, 17] works under the assumption that users who have agreed in the past are likely to agree in the future. A neighborhood, Nr , of the k most similar users to u is identified through a similarity metric such that all the neighbors have tagged r. For any given resource the weighted sum can then be calculated as: ψ(u, r, t) =

X

σ(u, v)θ(v, r, t)

(6)

v∈Nr

where σ(u, v) is the similarity between the users u and v. In this work we rely on cosine similarity of the user models. As before, θ(v, r, t) is 1 if v has annotated r with t and 0 otherwise. When users are modeled as resources we call this approach KN Nur . When users are modeled as tags we call this technique KN Nut. Since the algorithm will only populate the neighborhood with users that have annotated r, the number of similarities to calculate can be quite small. The popularity of resources in social annotation systems follows the power law and the great majority of resources will benefit from this reduced computation, while a few will require additional computational effort. As a result the algorithm scales well with large datasets. Similarities may even be computed offline. This approach relies on the collaboration of other users. It may be the case that an appropriate tag cannot be recommended because it does not appear in a neighbor’s profile. While the personalization offered by user-based filtering is an important benefit, it lacks the ability to reflect the habits and patterns of the larger crowd.

3.2.3

Item-Based Collaborative Filtering

Item-based collaborative filtering [2, 16] relies on discovering similarities among resources rather than among users. We may model the resources as a vector over the user space.

We call this model KN Nru . When relying on tags, the vector contains the frequency with which a resource has been annotated with the tags. We call this model KN Nrt. We define Nu as the k nearest resources to r drawn from the user profile, u, and then define the relevance score of a tag for a user-resource pair as: ψ(u, r, t) =

X

σ(r, s)θ(u, s, t)

(7)

s∈Nu

If a user has annotated resources similar to r with t then ψ(u, r, t) will be high. Otherwise the relevance score will be correspondingly low. The strength of this approach is that it can draw the most relevant tags from the user profile. Its weakness is that it cannot recommend tags from outside the user profile. Similarity metrics need only be calculated with resources in the user profile. If the user profile is not exceptionally large, this computation can be quickly done in real time. Otherwise, similarities can be calculated offline.

3.3

Pair-wise Interaction Tensor Factorization

For the sake of comparison, we have chosen a tag recommender based on pair-wise interaction tensor factorization [14], which formed the basis for the winning submission of the PKDD 2009 Tag Recommendation Challenge [10]. This model-based approach generates a set of factor matrices which resembles a special case of the Tucker decomposition of a tensor. The tensor itself is not directly induced by the data (this could be achieved by regarding each (u, r, t) triple as a binary cell of a tensor), but rather reflects a ranking over the tags for each user-resource pair. The model is built by first considering observations in the data of the form (u, r, t+ , t− ), where (u, r, t+ ) is a triple which is found in the data (a positive example of tag selection) and (u, r, t− ) is a triple not found in the data (a negative example of tag selection). An iterative gradient-descent algorithm is employed to optimize a ranking function (based on Bayesian conditionals) that prefers positive examples in the data over negative ones. Each of four related matrices is updated until convergence is found. The matrices represent the factor-reduced components of the specialized tensor factorization M = Uk TkU + Rk TkR , where Uk is the user factor matrix, Rk is the resource factor matrix, TkU is the tag factor matrix with respect to users and TkR is the tag factor matrix with respect to resources, k is the selected number of factors, and M is the personalized tag-ranking tensor. Generating a tag recommendation for a given user u and resource r is simply a matter of referring to the appropriate user-resource column of the ranking tensor M . The relevance score of a tag given a user-resource pair is calculated as:

ψ(u, r, t) =

k X

Uk [u][i]TkU [t][i] + Rk [r][i]TkR [t][i]

(8)

i=1

4.

EXPERIMENTAL EVALUATION

In this section we describe the methods used to gather and pre-process our six datasets. Following an outline of our methodology, we examine the results of our proposed linear weighted hybrid tag recommender along with its components and the pair-wise interaction tensor factorization model. Finally we draw some general conclusions.

4.1

Datasets

Our experiments are conducted using data from six large real-world social annotation systems. On all datasets we generate p-cores [8]. Users, resources and tags are removed from the dataset in order to produce a residual dataset that guarantees each user, resource and tag occur in at least p annotations. We define a annotations to include a user, a resource, and every tag the user has applied to the resource. For the larger datasets we use 20-cores. In the smaller datasets 5-cores are used. Several reasons exist to construct p-cores. By eliminating infrequent items, the size of the data is dramatically reduced allowing the application of recommendation techniques that would otherwise be computationally impractical. By removing rarely occurring users, resource or tags, noise in the data can be dramatically reduced. Because of their scarcity, these are the very items likely to confound recommenders. Recommendation in the so-called long tail is a valid area of exploration, but it lies outside the scope of this paper. Bibsonomy enables its users to annotate both URL bookmarks and journal articles. The dataset was gathered on 1 January 2009 encompassing the entire system. This data set has been made available online by the system administrators [6]. They have pre-processed the data to remove anomalies. A 5-core was taken. It contains 13,909 annotations with 357 users, 1,738 resources and 1,573 tags. Citeulike is a popular online tool used by researchers to specifically manage and catalog journal articles. The site owners make their dataset freely available to download. We use a snapshot taken as of 17 February 2009. Once a 5-core was computed, the remaining dataset contains 2,051 users, 5,376 resources, 3,343 tags and 105,873 annotations. MovieLens is a data set gathered from the corresponding MovieLens Web site and is administered by the GroupLens research lab at the University of Minnesota. It contains users, rating of movies, and tags. A 5-core was generated from the data resulting in 35,366 annotations with 819 users, 2,445 resources and 2,309 tags. Delicious is a popular Web site in which users annotate URLs. On 19 October 2008, 198 of the most popular tags were taken from the user interface and the site was recursively explored. From 20 October to 15 December, the complete profiles of 524,790 users were collected. Due to memory and time constraints, 10% of the user profiles was randomly selected, and a 20-core taken for experiments. The dataset is our largest, containing 7,665 users, 15,612 resource and 5,746 tags. It contains 720,788 annotations. Amazon is one of the world’s largest retailers. The site includes a myriad of ways for users to express and discover opinions of the products: ratings, editorial reviews, customer reviews, product details, and customer purchasing habits. Recently, Amazon has added social tagging to this list. Beginning on 1 July 2009 we recursively explored the site to gather 1.5 million user profiles. Many users had extremely small profiles or used idiosyncratic tags. After taking a 20-core of the data it contained 498,217 annotations with 8,802 users, 10,679 resource and 5,559 tags. LastFM users upload their music profile, create playlists and share their musical tastes online. We selected 100 random users from the system and recursively explored the “friend” network. Only about 20% of the users had annotated a resource. Users have the option to tag songs, artists or albums. The tagging data here is limited to album an-

notations. Experimentation on artists and song data reveal similar trends. A p-core of 20 was drawn from the data. It contains 2,368 users, 2,350 resources, 1,141 tags and 172,177 annotations.

4.2

Methodology

Each user’s annotations were divided equally among five folds. Four folds were used as training data to build the recommenders. The fifth was used to train the model parameters and ascertain the optimal weights of the components in the hybrids. The results of the fifth fold was then discarded and we performed four fold cross validation on the remaining folds. The results were averaged over each user, then over the final four folds. The recommenders are evaluated on their ability to recommend tags given a user-resource pair. The user and resource for each annotation where submitted to the recommenders and the recommenders returned a set of tags, Tr . These were then evaluated against the tags in the holdout annotation, Th . Recall is a common metric for evaluating the utility of recommendation algorithms. It measures the percentage of items in the holdout set that appear in the recommendation set. Recall is a measure of completeness and is defined as: recall =

|Th ∩ Tr | |Th |

(9)

Precision is another common metric for measuring the usefulness of recommendation algorithms. It measures the percentage of items in the recommendation set that appear in the holdout set. Precision measures the exactness of the recommendation algorithm and is defined as: precision =

|Th ∩ Tr | |Tr |

(10)

The recall and precision will vary depending on the size on the recommendation set. In the following experiments we present the metrics with recommendation sets of size one through ten.

4.3

Experimental Results

In this section we offer some general observations about the experimental results reported in Figure 1. We then examine each dataset individually before offering a summary of our conclusions. After tuning the variables we chose a k of 30 for all collaborative filtering techniques. The trend was for the recall and precision to steadily increase as k was increased and then suffer from diminishing returns. P IT F , the pair-wise interaction tensor factorization model, was built with 64 features and a learning rate of 0.03 [14]. It was trained until convergence. We did experiments with 10 to 100 features. The results exhibited a sharp increase and then leveled out as the number of features approached 50. The hybrid reported in Figure 1 is composed of the two popularity based recommenders and four collaborative filtering recommenders. We have purposely constructed the hybrid with simple recommenders in order to permit insights into the datasets that might otherwise be obscured. By observing the importance of a component to the hybrid, we may infer the importance of the dimensions covered by that

component. The composition of the hybrids is reported in Table 1. The hybrids do not draw upon P IT F . A motivation of this paper is to demonstrate that hybrid recommenders can integrate multiple dimensions of the data by exploiting simple components. If P IT F had been included in the hybrid it would not be clear if the success of the hybrid was owed to P IT F or the ability of the hybrid to produce a synergistic blend of its constituent parts. Instead, we report the P IT F results because it represents the state-of-the-art tag recommender and therefore offers an important point of comparison. While not evaluated in this paper, experimentation has revealed that incorporating P IT F into the hybrid produces a small improvement over both P IT F and the linear weighted hybrid. In all six datasets the hybrid outperforms its constituent parts, revealing that a linear weighted hybrid can exploit multiple dimensions of the data through its components. These components are not individually required to cover all dimensions of the data, and may instead focus on a particular dimension such as the relationship between tags and resources. When aggregated into a single framework, the components provide complementary information while maintaining their simplicity, speed and insights into the data. The hybrid is competitive with P IT F , often surpassing it. In MovieLens P IT F proves marginally better. In Bibsonomy, Citeulike and LastFM the results are very similar. In Delicious and Amazon the hybrid is clearly superior. The difference between Delicious and Amazon versus the other datasets is the diversity of the user profiles. Citeulike and Bibsonomy users focus on their area of expertise. MovieLens and LastFM users gravitate toward particular genres of music and movies. In Delicious, however, the users are able to tag web pages from across the entire Internet. Consequently, the user profiles often contain numerous unrelated topics. Similarly, Amazon users do not restrict their annotations to particular categories. The user profiles reflect the diversity one might expect of a consumer visiting the world’s largest online retailer. These diverse user profiles are difficult to characterize with a feature space model, the foundation of P IT F . When recommending tags, P IT F cannot draw upon particular features while ignoring others. P IT F may recommend a tag not relevant to the particular context. In contrast, user-based and item-based collaborative filtering is able to focus on the most relevant parts of the user profile. User-based collaborative filtering only recommends tags applied to the query resource, narrowing the focus of the recommendation regardless of the diversity in the user profile. Item-based collaborative filtering techniques construct a neighborhood of resources from the user profile most similar to the query resource, effectively ignoring parts of the user profile that are not relevant to the immediate recommendation task. Our proposed linear weighted hybrid inherits the capacity to focus on specific aspects of the user profile. The hybrid offers additional benefits. When constructed from simple yet fast components, the hybrid itself maintains these properties offering a highly scalable and easily updatable solution for tag recommendation. It is possible to explain the results from the component recommenders and consequently the hybrid itself. In contrast P IT F is a black box with little explanatory capacity.

ŝďƐŽŶŽŵǇ

ŝƚĞƵůŝŬĞ

ϴϬй

ϳϬй

ϳϬй

ϲϬй

ϲϬй

ϱϬй

ϱϬй ϰϬй ϰϬй ϯϬй ϯϬй ϮϬй

ϮϬй

ϭϬй

ϭϬй Ϭй

Ϭй Ϭй

ϭϬй ƉŽƉͺƵ

ϮϬй ƉŽƉͺƌ

ϯϬй