Tag Normalization and Prediction for Effective ... - ACM Digital Library

2 downloads 0 Views 418KB Size Report
that a large potion of the stabilized tag set is predicted, and it is feasible to reduce the requirement of sufficient user annotations in the applications of social ...
2008 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology

Tag Normalization and Prediction for Effective Social Media Retrieval Ming-Hung Hsu Department of Computer Science and Information Engineering National Taiwan University [email protected]

Hsin-Hsi Chen Department of Computer Science and Information Engineering National Taiwan University [email protected]

Abstract

after the first 100 or so annotations. Halpin et al. [5] further hypothesized a generative model to investigate how the tag distribution reaches stable as more and more annotations emerge. Their study showed that the distribution of high-frequency tags is a stabilized power law distribution. When using social annotations, the main limitation is the requirement of sufficient annotations to attain the stabilized tag set. In particular, the new resources require sufficient time to attract users to make annotations. Prediction of a stabilized tag set requires a complex model to capture how users make inference from a resource to its tags. Some researches [5] indicate three important characteristics in a social annotation system, i.e., i) keywords in a resource are usually selected as tags by annotators; ii) similar resources are usually annotated by similar tags; iii) shared background knowledge and imitative annotation between users would take influence on users’ annotation activities. In this paper, we aim to predict the stabilized tag set for a resource with a small amount of early user annotations. Besides the availability of stabilized tags, the flexible tag naming also decrease the usability of users’ annotations. Hence unification of tags is necessary. This paper is organized as follows. In Section 2, we show a semi-automatic procedure to normalize the tag strings. In Section 3, we propose our tag prediction algorithm. The algorithm constructs a tag-correlation graph from the training URLs, selects candidate tags from the text content of a target URL, combines and weights the selected candidate tags with a small amount of early user-annotated tags, performs spreading activation [9] over the tag-correlation graph, and finally reports a predicted tag set. Section 4 shows the experimental setup. Section 5 discusses the results. Section 6 concludes the work.

In this paper, we propose a tag normalization algorithm to unify the users’ annotations. Meanwhile, we explore some general phenomena in a social annotation system and propose a supervised tag prediction model to predict the stabilized tag set of a resource, with feedback of a small amount of user annotation records. The experiments show that a large potion of the stabilized tag set is predicted, and it is feasible to reduce the requirement of sufficient user annotations in the applications of social annotations.

1. Introduction With the progress of web 2.0, a variety of services promote web users from traditional information receivers to influential information propagators and even information sources. Social bookmark sites [6] such as del.icio.us [11] and Flickr [12] provide such web 2.0 services. Here we use social annotation to represent the annotation of web resources with freely-chosen and open-ended strings as vocabulary. Strings used for annotations are referred as tags. Each annotation on a resource by a user may introduce single or multiple tags for the resource. With such kind of services, users can organize resources through their own tags; meanwhile, they can share interesting resources with each other through the same tag(s). In general, sharing and interactions between users occur implicitly, that is, in an unaware manner for users. Recently, applications and analyses of social annotations have attracted much more attentions. Dmitriev et al. [3] utilized social annotations for enterprise search. Russell [8] introduced a system to visualize variations of tags over time. Bao et al. [1] utilized the structural property and the keyword property to estimate the similarity between two terms and the quality of websites. Wu et al. [10] explored social annotations for semantic web construction. Golder and Huberman [4] analyzed the usage patterns in a social annotation system, based on popularly annotated URLs in del.icio.us. They found that the frequency of a tag is nearly a fixed proportion of the frequency sum of all tags

978-0-7695-3496-1/08 $25.00 © 2008 IEEE DOI 10.1109/WIIAT.2008.92

2. Tag normalization The major difference between the social annotation systems and the formal ontology-based classification systems is their greater malleability and adaptability in organizing information. However, the flexible use of freely-chosen and open-ended tag strings also causes polysemy, i.e., one tag may represent multiple meanings, 766 770

and paraphrase, i.e., different tag strings refer to the same tag term. For example, two tag strings, web_2.0 and web2.0, refer to exactly the same tag term web 2.0. Here we formulate several rules to deal with the paraphrase problem.

holds, ti is considered as an abbreviation of tj. If there are more than one possible tj, only the one of the shortest length is considered. If tj is not identified yet (refer to step V), ti is explored to cut tj into smaller chunks whose concatenation by blank spaces is the new canonical form of tj. By the way, the canonical form of ti is set as that of tj, and ti is marked as identified. For example, tag string 2.0 is a substring of web2.0. If web2.0 is not identified, then we will cut web2.0 into chunks web and 2.0, and regard web 2.0 as the canonical form of both web2.0 and 2.0. This step is a reverse transformation of rule-C. However, incorrect grouping may occur occasionally. For example, wme might be grouped with windowsmediaplayer because the former is a subsequence of the latter. V. If string ti is not identified yet, its canonical form is set to itself temporarily. Because it is marked as unidentified, its canonical form may be updated later at step IV. We perform the above procedure on each annotated URL, and group together tag strings with the same canonical form. Two groups of tag strings annotated on different URLs are merged together if the canonical form of one group is included by the other group. If a tag string occurs in more than one group, it will be removed from all the groups it occurs because it is an ambiguous tag string.

2.1 Naming rules for tag terms In our study, the annotation data are crawled from del.icio.us. When users make annotations, the multiple tags are always separated by a blank space, so that a tag string never contains space. If phrases, e.g., web development, are considered as tags, users always employ the following naming rules to transform phrases into tag strings. A. Concatenate words by punctuation marks. For example, this rule transforms the phrase web development into web-development or web.development. B. Direct concatenation. For example, the phrase web development is transformed into webdevelopment. C. Abbreviation of terms combined with rule-A or rule-B. For example, this rule transforms the phrase web development into web-dev or webdev. D. Exclusiveness between rule-A and rule-B. The two rules are almost never used concurrently in a tag string. Besides the above naming rules, we also find a common behavior of web users: when a tag string transformed from a phrase is used for annotation, words in the original phrase are usually used to annotate the target URL too, especially the target URL is annotated by sufficient users.

2.2.2 Manual filtering. We filter the incorrect groupings manually at the last step of tag normalization. This process effectively reduces the damages caused by the paraphrase problem without increasing the polysemy of tags. After filtering, tag strings in a group will be treated as identical in our experiments. Table 1 shows several grouping examples.

2.2 A semi-automatic normalization procedure The semi-automatic normalization procedure tries to group different tag strings denoting the same phrase as an identical tag term, called their canonical form. We transform tag strings into their canonical forms by using the above tag naming rules reversely. Some incorrect groupings may be generated when dealing with various abbreviations. We remove the incorrect groupings manually.

Table 1. Groups of tag strings along with the corresponding canonical forms Canonical Form

2.2.1 An automatic normalization procedure. The following procedure reversely transforms a tag string into its canonical form. I. A tag string is split into chunks by the punctuations. II. For each chunk of a tag string, we use WordNet [7] to check if it is a known word in WordNet. If there is any unrecognized chunk, then the tag string is marked as unidentified. The canonical form of an identified tag string is the string of words separated by a blank space. III. If the tag string not identified yet has only one chunk, we check if it can be split into words by looking up WordNet. If such words can be found, the tag string is marked as identified and the canonical form is the found words separated by blank space(s). IV. If the tag string ti is not yet identified by the above steps, we check if the character sequence in ti is a subsequence of the canonical form of another tag tj. If it

Group of Tag Strings

common lisp

Commonlisp, clisp, common_lisp

drag and drop

drag-and-drop, draganddrop, dragndrop

online game

Onlinegames, online_game, olg

web 2.0

web2.0, web_2.0, web2, web20, web_20

2.3 Preliminary statistics Our dataset includes 122,157 unique URLs, 11,313,261 annotations and 694,033 unique tag strings. After the automatic normalization procedure, we collected 3,692 groups of tag strings (excluding those tag strings not grouped with any others). Each group cost a graduate student major in computer science 10 to 60 seconds for manual filtering. There are 3,212 groups remaining eventually after the manual filtering. We define a common tag is a tag string annotated by two or more users. In our corpus, 15,934 URLs are in English and are sufficiently annotated, i.e., they received more than 100 annotations. For each of them, we count how many 771 767

common tags there are in its annotations, and how many of these common tags occur in the text content of URLs, which are denoted as content tags hereafter. A group of tag strings along with their canonical form is treated as content tags if any string in the group occurs in the URL’s content. The coverage ratio is the percentage of content tags versus common tags. The statistics is shown in Table 2. With normalization, the number of common tags decreases and the coverage ratio increases. The larger coverage ratio indicates more feasibility to select keywords in the URL text content as tags in the stabilized tag set. Table 2. Coverage of common tags by the URL text Without normalization Average # of common tags

41.23

Average # of content tags

16.01

Average coverage by URL text

43.66%

With normalization 38.01 (-7.81%) 16.32 (+1.94%) 47.82% (+9.53%)

Figure 1. Framework of the proposed algorithm

3. Tag prediction algorithm

tag. Let PC(ti|termj)=|Dj,i| / |Dj|. If PC(ti|termj) is not 0, then PC(ti|termj) would be equal to or larger than 1/N. All terms in the document are ranked according to their scores.

3.1 Framework

3.3 Construct tag-correlation graph

Figure 1 shows the framework of our tag prediction algorithm. The prediction algorithm is composed of two parts. In the first part (Section 3.2) we select candidate tags from the terms in the text content of a testing URL. In the second part (Sections 3.3 and 3.4), we use the training data to compute the correlations for tag pairs, and construct a tag-correlation graph. Early annotations for the testing URL are combined with the candidate tags selected in the first part. Spreading activation proposes tags highly correlated to the ones in the combined set as the predicted stabilized tags. Section 3.5 shows a restricted variation of this model.

A tag-correlation graph is a network in which nodes represent tags and the edge(s) between two nodes represent their correlation. When a tag ti usually co-occurs with a tag tj in the stabilized tag sets of training data, the reverse condition, i.e., tj usually co-occurs with ti, is not necessarily established. Equation 2 defines PT(ti|tj), which is a asymmetric correlation representing the strength of the edge from tj to ti in the directed tag-correlation graph. PT(ti|tj) is estimated by MLE over the training data.

PT (ti | tj ) =

3.2 Content tag selection



termj∈d

log( N × max(PC(ti | termj ),

(2)

3.4 Spreading activation (SA)

Equation 1 is employed to score terms in a document (URL text content) for content tag selection. The basic idea of the scoring function follows the statistical translation model in information retrieval [2], which estimates the probability that a query would be generated as a translation of a document.

CTScr(ti | d ) =

# of stabilized tag sets containing ti and tj # of stabilized tag sets containing tj

1 )) N

Graph algorithms for information retrieval such as spreading activation (SA) [9] are common and effective to find highly correlated entities if a graph which describes the correlation between two entities is available. The basic idea of SA can be explained by a natural phenomenon. When we drop a stone into a pond, oscillation on surface transfers energy to neighborhood, and becomes smaller and smaller in amplitude due to water resistance. Equation 3 and Equation 4 show how a node (i.e., a tag) ti gains energy from its neighbors.

(1)

PC(ti|termj) denotes the probability of ti as a stabilized tag, i.e., a tag included in the stabilized tag set, when ti and termj co-occur in the same document. N denotes the number of documents in the training data. PC(ti|termj) is estimated by Maximum Likelihood Estimation (MLE) over the training data. Assume Dj is the set of documents where termj occurs and Dj,i is the set of documents in Dj with ti as a stabilized

E(ti) = λ × E(ti) + (1 − λ ) ×

∑ (E(tj) × W (tj, ti))

(3)

tj∈Neighbor(ti )

W (tj, ti ) =

PT (ti | tj ) ∑ PT (tk | tj )

tk∈Neighbor(tj )

772 768

(4)

5. Experiment results

At each iteration, node tj propagates a portion (λ) of its energy to its neighbors, and gains some energy from its neighbors too. We perform SA for a fixed number of iterations and eventually tags of the highest energies are proposed as the predicted stabilized tags. A tag is an activation origin if it is a selected candidate content tag or if it is an early user-annotated tag. Except those activation origins, we initialize the energy for each tag to 0. For each candidate content tag ti selected from the document d, its initial energy E(ti) is its content selection score CTScr(ti|d) normalized by the highest content selection score of terms co-occurring with ti in d. For each of the early user-annotated tags, the initial energy is its total tagging by early users. For a content tag also annotated by early users, its initial energy is the sum of the two values.

5.1 Performance of the tag prediction 5.1.1 Performance of combined tag set. Table 3 shows the performance of combining the result of content tag selection with the result of 5 early user annotations. The combination is equivalent to the state just before SA. CTScr denotes content tag selection only and CTScr-V denotes the restricted variation of CTScr. User-5 denotes the tag sets of the earliest 5 user annotations. Table 3. Performance of combining the content tag selection function with early user annotations. Metric Method CTScr CTScr-V User-5 CTScr + User-5 CTScr-V + User-5

3.5 A restricted variation A general perspective on Equation 1 is that the selection score of ti is contributed by the context termj. Without any restriction, PC(ti|termj) in Equation 1 would favor a tag ti more than another tag tk if ti is more frequently annotated in training URLs, even though tk may be more semantically and significantly related to termj than ti. For example, let ti be web, tk be java and termj be programming. A similar bias also occurs in Equation 2. Suppose P(ti) is the probability that tag ti occurs, estimated by the frequency of ti occurring in the stabilized tag set in training data. As a variation of our model to avoid the biased condition, a threshold α.P(ti) is used to restrict that the selection score CTScr(tj|d) of ti is contributed by significantly correlated terms. That is, if the value of PC(ti|termj) estimated by MLE is not higher than α.P(ti), then PC(ti|termj) would be considered as 0 in Equation 1. Similarly, the same threshold is used for PT(ti|tj) in Equation 2. α is set to be 2 in our experiments.

R-Precision

Recall

MAP

0.2687 0.2164 0.2359 0.3972 0.3448

0.3569 0.3026 0.2343 0.4733 0.4184

0.1768 0.1333 0.2077 0.3439 0.2958

The preliminary statistics in Table 2 shows that the upper bound performance of CTScr in R-Precision is 0.4782. The performance of the proposed content tag selection method, 0.2687, achieves approximately 56% of the upper bound, while the variation model achieves 45% only. Though we consider the tags selected by CTScr-V are more suitable to represent the implicit topics of the target URL, the result shows the favor of tags that frequently annotate the training data can achieve a better performance if we only select tags from the URL text. Moreover, combining CTScr with early user annotations significantly improves the performance. It shows that the CTScr complements the early user annotations. CTScr tends to highly score a candidate tag that frequently occurs in the training data. Users, especially those who yearn to make early annotations, tend to annotate the target URL with: i) impressive and singular terms, which may seldom occur in the training data; or ii) tags that do not occur in the content but capture the implicit topic of the target URL. Results in Table 3 also indicate that a large portion of the stabilized tag set can be predicted with a small amount of early annotations, i.e., around 40% predictable with only 5 user annotations.

4. Experiment setup As mentioned in Section 2.3, total 15,934 URLs form a sufficiently-annotated set. Total 2,000 URLs are randomly selected from this set as the testing data and another 1,000 randomly selected URLs form the validation set for parameter tuning. The remaining sufficiently-annotated URLs are used for training. We increase the amount of training data by 45,156 more URLs annotated by at least 20 users. There are 114,787 unique tag strings in the training data before tag normalization, and 33,484 tag terms remain after normalization. For each URL in the training and testing sets, the stabilized tag set consists of the top 25 common tags of the highest frequencies. We use MAP, R-Precision and Recall rate to evaluate the performance of our algorithm. In the experiments, the top 50 tags with the highest scores are proposed as the stabilized tags.

5.1.2 Performance of spreading activation. Figure 3 shows performances in R-Precision on the validation set under different parameter settings, i.e., λ in Equation 3 and number of iterations in SA. Here the activation origins are initialized by User-5 combined with CTScr or with CTScr-V. λ =0.5-V represents the performance of combining User-5 with CTScr-V with λ to be 0.5 andλ =0.8-V is interpreted similarly. Figure 3 shows that for the same model, the peak performances with different values of λ are very similar. For example, the peak performances

773 769

Some tags in this type seem to occur in the stabilized tag set arbitrarily. For example, ‘read’ and ‘reference’ are two very commonly used tags for reminding users to read or to reference to the target URL. This type of tags would frequently occur in the training data and have imprecise correlations with other terms or tags. Therefore, they might be incorrectly proposed, and decrease the performance.

R-Precision

0.55 0.5 0.45 0.4

λ=0.5 λ=0.8 λ=0.5-V

0.35 0.3

λ=0.8-V

0.25 0.2 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

6. Conclusions and future works

Iterations

Figure 3. Performance of tag prediction with different parameters on the validation set.

In this paper, we propose a prediction model for social annotation. The experiment results show that our tag prediction model is able to predict a considerably large portion of the stabilized tag set with only 5 user annotations. As a preliminary work for prediction of social annotations, we also categorized the tags into three types, each of which indicates a direction to improve our model.

for CTScr-V are 0.5113 and 0.5102 when λ is set to be 0.5 and 0.8, respectively. The curve of a larger λ shows more gradual. For CTScr with λ as 0.8, the peak performance is obtained after 6 iterations. For CTScr-V with λ as 0.8, the peak performance is obtained after 8 iterations. The appropriate settings of parameters are later applied to experiments on the testing data. In the experiments on the testing set, we set λ to be 0.8. The performance in R-Precision of CTScr is improved from 0.3972 to 0.4459 after 6 iterations of SA. For CTScr-V, performance in R-Precision is improved from 0.3448 to 0.5070 after 8 iterations, a 47% relative improvement. With SA, CTScr-V significantly outperforms CTScr, indicating the selected content tags that are more semantically related to the implicit topic of target URL are more appropriate to be the activation origins to find correlated tags.

7. Acknowledgement Research of this paper was partially supported by Excellent Research Projects of National Taiwan University (97R0062-04) and Microsoft Research Asia.

8. References [1] Bao, S. et al. Optimizing web search using social annotations. In Proceedings of the 16th International Conference on World Wide Web. ACM Press, New York, 2007, pp. 501-510. [2] Berger, A., Lafferty, J. Information Retrieval as Statistical Translation. In Proceedings of the 22nd Annual International ACM SIGIR. ACM Press, New York, 1999, pp. 222-229. [3] Dmitriev, P.A. et al. Using Annotations in Enterprise Search. In Proceedings of the 15th International Conference on World Wide Web. ACM Press, New York, 2006, pp. 811-817. [4] Golder, S.A., Huberman, B.A. Usage patterns of Collaborative Systems. Journal of Information Science. 32(2), 2006, pp. 198-208. [5] Halpin, H. et al. The Complex Dynamics of Collaborative Tagging. In Proceedings of the 16th International Conference on World Wide Web. ACM Press, New York, 2007, pp. 211-220. [6] Hammond, T. et al. Social Bookmarking Tools (I): A General Review. D-Lib Magazine, 11(4), 2005. [7] Miller, G. A.: WordNet: an on-line lexical database. Special Issue of International Journal of Lexicography, 1990. [8] Russell, T. Folksonomy over Time. In Proceedings of the 6th ACM/IEEE-CS Joint Conference on Digital Libraries. ACM Press, New York, 2006, pp. 364-364. [9] Salton, G., Buckley, C. On the Use of Spreading Activation Methods in Automatic Information Retrieval. In Proceedings of the 11th Annual International ACM SIGIR. ACM Press, New York, 1988, pp. 147-160. [10] Wu, X. et al. Exploring Social Annotations for the Semantic Web. In Proceedings of the 15th International Conference on World Wide Web. ACM Press, New York, 2006, pp. 417-426. [11] http://del.icio.us/ [12] http://www.flickr.com/

5.2 Result analysis and discussion Further investigating the predicted and the stabilized tags, we categorize the tags into three types according to the relations between tags and target URL. The first type is called as topic-description since a large portion of the stabilized tags describe the implicit topic of the target URL, e.g., tags such as ‘programming’ and ‘java’ for a URL introducing java programming. This type of tags is highly correlated to the terms in the URL content, and correlated to some other tags of the same type too. A majority of the correctly-predicted tags of our method belongs to this type. The second type of tags is called as function-related since tags in this type are used to describe or conceptually related to the function of the target URL. For example, a URL provides service to rank professors in universities. One of the stabilized tags is ‘web 2.0’ and it does not occur in the text content. Many users annotate this website with the term because this website provides services for sharing of resources and information between users, one of the important ideas introduced by web 2.0. Tags in this type are difficult because some of them are highly correlated to other tags but most of them usually have no evident correlations with terms in the text content. The third type is called as personal-use because tags in this type are more user-dependent than resource-dependent. 774 770

Suggest Documents