On the Effectiveness of Collaborative Tagging Systems for ... - CiteSeerX

1 downloads 0 Views 269KB Size Report
On the Effectiveness of Collaborative Tagging Systems for Describing. Resources ... [8] that users choose tags in a pattern consistent with ..... total tag usage.
2009 World Congress on Computer Science and Information Engineering

On the Effectiveness of Collaborative Tagging Systems for Describing Resources Jinsheng Xu1 Christo Dichev2 Albert Esterline1 Darina Dicheva2 Jinghua Zhang2 2 1 Department of Computer Science Department of Computer Science North Carolina A&T State University Winston-Salem State University {jxu, esterlin}@ncat.edu { dichevc, dichevad, zhangji}@wssu.edu indicate interest in studying semantic relations among resources. There are, however, few reports on the pragmatic aspects of these implicit semantic relationships and their utilization. In this paper, we demonstrate practically that meanings can be attached to resources based on the aggregated tags applied to them by the user community. The insight is that, if such meaning proves practical, it can be exploited as a semantic distance between resources. If two resources are close, then they are consistently tagged with shared tags by many users, and thus they are commonly and collectively classified in the same or related implicit categories. Our experimental findings also show that tagging is not only a method for organizing content for those who create it but also a powerful mechanism for users to discover similar or related resources. Despite reports [8] that users choose tags in a pattern consistent with personal information management goals rather than as a result of social influence, our study indicates that massive tagging of resources leads to extremely useful bottom-up meanings for resources. Tagging and the creation of folksonomies in general can be viewed as the bottom-up development of ontologies. In the tradition of the Semantic Web [10], an ontology is a conceptualization of a domain specified in a particular modeling language with a particular set of terms. The modeling languages include most notably RDF and the sublanguages of OWL. These languages use URIs as identifiers, and commonly the URIs are URLs of resources accessible on the web. Thus, ontologies formulated in these languages also associate meanings with URLs. Such top-down ontologies, however, come with a framework of classes, subclasses, properties, and relations, and they are typically formulated as a standard by an organization. Tagging, in contrast, is completely unstructured and ungoverned, yet it involves meaning since it groups resources into classes and associates tags in ways predictive of inclusion relations among the classes. Semantic lexicons, such as FrameNet and WordNet, have been used as references in developing top-down ontologies and might give insight into bottom-up ontologies insofar as words with their normal meanings are used as tags. In fact, however,

Abstract This article investigates the effectiveness of community generated tags as social descriptors of resources uncoordinatedly annotated by community members. Our goal is to demonstrate practically that the aggregated tags applied to resources by the entire community define reasonably well resource meaning. This would allow using them for calculating semantic distance between resources. To test our hypothesis, we analyzed a large amount of data downloaded from del.icio.us. To this end, we developed an algorithm for searching ‘similar’ URLs based on the similarity of their aggregated tag vectors, which allowed us to identify clusters of similar resources. Our experimental findings demonstrate that massive tagging of resources leads to resource meanings that are defined bottom-up, and they prove the effectiveness of collaborative tagging systems for describing resources.

1. Introduction Tagging systems are a popular means for sharing resources. An important characteristic of this kind of repository is the freedom in choosing the vocabulary that can be used to tag resources. The primary use of tags is for organizing and collocating information resources. Tags can also be seen as a means for social interaction, assisting users to share and discover new resources. Although tagging is an individual, transparent process for organizing and categorizing resources, it results in an implicit social process of indexing and knowledge construction. Users share their resources through their tags, forming an aggregated tag-index known as a folksonomy. Despite the fact that users tag their resources in an uncoordinated, anarchic, and noisy fashion [1], the collective activity of tagging generates essentially composite meanings associated with the tagged resources. Through tagging, users collectively provide an implicit link between resources. While the early research on collaborative tagging has been focused on structuring, dynamics and vocabulary evolution [2, 3, 4], resent publications [1, 5, 6, 7]

978-0-7695-3507-4/08 $25.00 © 2008 IEEE DOI 10.1109/CSIE.2009.465

472 467

tagging can escape the bounds of established language, requires no organization but only the commitment of the users, and allows meaning to emerge to fit the task at hand. Li et al. [5] studied whether tags capture the main concepts associated with resources. Their method was to compare the tags with the tf and tfidf keywords. They found that, for 90% of 7000 randomly sampled resources, tags cover more than 90% of the top 40 tf×idf keywords. Brooks et al. [9], measured the effectiveness of tags by clustering blog articles based on tags. They measured the similarity of clusters using tf×idf keyword-based comparison and showed that tag based clustering only performs a little better than random clustering. In these articles, tf×idf keywordbased methods are used as standards for measuring the effectiveness of tags. Although such methods do provide important insights into tag effectiveness, they puts tags at a disadvantage. Tags and extracted keywords are totally different types of data, so they should be compared fairly. In this paper, we compare the performance of similar-URL search based on tags with keyword-based similar-page search. Based on several cases, the authors found that tag-based search generates much more accurate results. The reason is likely to be the human ability to extract abstract concepts from resources, and the results of such collaborative and collective efforts are much better in describing the resources than automated keyword extraction. The remainder of the paper is organized as follows. In section two, we describe how we collect data and the method we used in comparing URLs. In section three, we present the results of similar URL search. Finally, in section four, we describe key statistics in our data set.

T = { ti | 1 ≤ i ≤ n } R = { ri | 1 ≤ i ≤ m } We further define a Resource-Tag matrix MR-T as follows: MR-T := (bi,j)m×n, where the value of bi,j represents the number of times resource ri is tagged by tag tj . The values of bi,j are acquired from the data set and are non-negative integers. The tag vector for resource k is defined as Vk = [ bk1, bk2, …, bkn ]T, where the value of bkj represents the number of times resource k is tagged by tag tj . The values of bkj are acquired from the data set and are non-negative integers. The similarity of two resources is computed as the dot product of their tag vectors. Thus the similarity of resource i and resource j is: Sij = Vi·Vj ⁄ (│Vi││Vj│) The similarity value Sij ranges from 0 to 1 because a dot product of normalized vector is used. A larger value means greater similarity. To find the resources most similar to resource k, i.e., the resources most related to it, we simply need to calculate and sort (Sk1, Sk2, …, Skm) and pick a certain number of resources from the top of the sorted list. In practice, building the matrix of all URLs and tags requires too much memory. We queried the database and selected only those resources that share one or more tags with resource k.

3. Experiments 3.1 Search for URLs similar to http://www.shopping.com/ Table 1 shows the tag vector for the website that contains the top 10 tags. Table 2 shows the top 10 similar URLs found in our database with the corresponding scores, which are the dot products of normalized tag vectors. Each of these top 10 websites is indeed very similar to http://www.shopping.com/, which is a web site about searching for and comparing prices for shopping. There are much more closely related URLs found in the database. For example, http://www.mpire.com/ ranks about 30th, with a score of 0.90. We compared a similar search of the website on google.com. It returned only about 25 web pages with websites that were not so similar, such as several eBay sites and USA weekend magazine. Some of the top URLs in our search result do not appear on Google’s result. The main eBay website, http://www.ebay.com/, ranks only 399th in our similar search for http://www.shopping.com/, with a score of 0.84. The top tags for eBay are shopping, auction,

2. Method 2.1 Data Collection The data set for our study consists of summaries of URLs downloaded from del.icio.us. Each summary includes the top ten tags for the URL and the number of times each tag is used for this URL. Starting from a single popular URL, our crawler gets information about the users who tagged that URL then downloads the URL list of each user. Repeating this process, we downloaded about 764,000 URLs with their summaries. All data including information about users are stored in a MySQL database that contains four tables with over 15 million records.

2.2 Algorithm for Searching Similar Resources We define a tag space T and a resource space R as a set of tags and a set of resources, respectively:

468 473

shop, ebay, online, sell, auctions, deals, business, and compras.

Tag

shopping

comparison

search

compare

deals

shop

price

review

bargain

ecommerce

Count

448

153

84

81

72

67

49

32

32

18

12/07/the-hive-mind-folksonomiesand-user-based-tagging Table 4. Top 10 related URLs for http://arxiv.org/ftp/cs/papers/0508/0508082.pdf

3.2 Similar URL search for http://arxiv.org/ftp/cs/papers/0508/0508082.pdf Our next example is a search for similar URLs for a research paper entitled “The Structure of Collaborative Tagging Systems”. Table 3 shows the top 10 tags for the URL. Resources ranked 3, 4, and 6 are tagged fewer than 10 times, which may not reflect an accurate description of the resource. Google returned 21 similar URLs, and some of them appeared in our top 500 similar URLs. For example, Google’s top result, http://en.wikipedia.org/wiki/Folksonomy, ranks 34th, with a score of 0.72, its 2nd result, http://www.adammathes.com/academic/computermediated-communication/folksonomies.html, ranks 58th, with a score of 0.65, and its 3rd result, http://rashmisinha.com/2005/09/27/a-cognitiveanalysis-of-tagging/ (corrected link), ranks 64th, with a score of 0.64.

Table 1. Top 10 tags for http://www.shopping.com/

Rank

URL

Score

1 http://www.shopzilla.com/ 0.99 2 http://www.mysimon.com/ 0.99 3 http://www.pricegrabber.com/ 0.98 4 http://www.bizrate.com/ 0.98 5 http://www.nextag.com/ 0.97 6 http://www.pricescan.com/ 0.97 7 http://www.pronto.com/ 0.96 8 http://www.dealtime.com/ 0.96 9 http://www.pricespider.com/ 0.95 10 http://www1.bottomdollar.com/ 0.95 Table 2. Top 10 related URLs for http://www.shopping.com/

Tag

tagging

folksonomy

collaborative

web 2.o

research

delicious

paper

internet

social

del.icio.us

Count

61

27

21

20

16

14

13

9

7

7

3.3 Similar URL search based on Tag Vectors and Extracted Keyword Vectors We extracted keywords from a large sample of websites and compared the similarity scores of a large number of pairs of URLs based on tags and extracted keywords. The correlation of similarity scores generated by tag vectors and keyword vectors on the same set of web pages is only 0.19. This weak correlation shows that it is not likely that both are good measures for comparing web pages. The following table shows five of the most similar pairs of web pages discovered by keyword vectors.

Table 3. Top 10 tags for http://arxiv.org/ftp/cs/papers/0508/0508082.pdf

Rank 1 2 3 4 5 6 7 8 9 10

URL http://lsj.lishost.org/index.php/lsj/artic le/viewArticle/45/58 http://www.danah.org/papers/Hyperte xt2006.pdf http://km.aifb.unikarlsruhe.de/ws/swkm2008/yeungetal.pdf http://theshiftedlibrarian.com/archives /2008/01/30/tag-clouds-arent-just-forfolksonomies-anymore.html http://eprints.rclis.org/archive/000083 15/01/KippCampbellASIST.pdf http://www.educationau.edu.au/jahia/ webdav/site/myjahiasite/shared/paper s/arkhayman.pdf http://www.pewinternet.org/pdfs/PIP_ Tagging.pdf http://www.asis.org/Bulletin/Oct07/neal.html http://eprints.rclis.org/archive/000057 03/ http://infotangle.blogsome.com/2005/

Score 0.89 0.88

Web Page 1

0.87

http://www.clazh.com/to p-best-free-wordpressthemes-templates/ http://www.allgraphicde sign.com/graphicsblog/2 007/11/21/cool-flyersposters-leaflets-greatflyer-design-inspiration/

0.86 0.86 0.84

http://www.methods.co. nz/popup/popup.html

0.84

http://www.pagat.com/ http://www.faqts.com/kn owledge_base/index.pht ml/fid/199

0.83 0.83

Web Page 2

Keywo rd Based Score

Tag Base d Scor e

http://www.dailywp .com/

0.68

0.95

http://www.posttyp ography.com/

0.58

0.57

0.57

0.11

0.57

0.7

0.57

0.83

http://robertsabuda. com/popmakesimpl e.asp http://www.mostfun games.com/ http://www.pythonc hallenge.com/

Table 5. Top 5 most similar web pages based on keywords

0.82

469 474

The total number of URLs is 764,000 and the total number of posts is 73 million. The figure shows that most of the URLs have a small number of posts. 50% of the URLs have five or fewer posts. About 20% have 50 posts or more, about 13% have 100 or more posts, and only about 8% have 200 or more posts. The average number of posts per URL is about 96. The big disparity between the mean and median number of posts (96:5) shows that there is a small number of very popular URLs but a large number of very unpopular URLs. The most posted URL in our collection is http://www.flickr.com/, which has been posted more than 58,000 times.

The following table shows five of the most similar web page pairs discovered by tag vectors. Tag Base d Scor e

Web Page 1

Web Page 2

Keywo rd Based Score

http://www.sidereel.com /_home

http://www.watchtv sitcoms.com/gatewa y.html?s=http%3A// www.watchtvsitco ms.com/index.html

0.14

0.98

http://www.dailywp .com/

0.68

0.95

http://www.adamcr uickshank.com/

0.03

0.93

http://getxpad.com/

http://journler.com/

0.06

0.91

http://www.mosso.com/i ndex.jsp

http://www.hostgato r.com/

0.22

0.9

http://www.clazh.com/to p-best-free-wordpressthemes-templates/ http://www.adrianjohnso n.org.uk/home/

4.2. Distribution of the tag size By “tag size” we mean the total number of times a tag is used for tagging any URL. For example, ‘university’ is used 23,554 times for 1,327 different URLs, so the tag size for ‘university’ is 23,554. About 88,000 distinct tags are found, and the sum of tag sizes of these tags is about 155 million. Therefore, the average tag size is about 1,763. Figure 2 shows the cumulative distribution of the number of tags on tag size. About 50% of the tags are used 8 times or less, about 17% are used more than 100 times, and about 5% are used more than 500 times. The disparity between the mean and median tag size (1763:8) is much greater than in the distribution on the number of posts. Figure 3 shows the cumulative percentage of total tag usages. Top 1000 tags provide nearly 90% of total tag usage. The top 72 tags, fewer than 0.1% of all tags, provide half of the total tag usage. The 10 largest tags are “design”, “tools”, “webdesign”, “software”, “reference”, “web2.0”, “blog”, “css”, “web”, and “free”. The largest tag - “design” - has been used 4.7 million times. The average number of tags for a URL is about 200. On average, there is roughly a 2-to-1 ratio of tags to posts, so, on average, users use two tags on one URL.

Table 6. Top 5 most similar web pages based on tags

Our experience is that tag-vector-based web page comparison is more faithful than keyword-based comparison. The mean similarity of the sample is 0.028 for tag-based comparison and 0.047 for keyword-based comparison. However, tag-based comparison has a much greater standard deviation (0.08:0.047) than the keyword-based method. Tagbased comparison also generates more high scores than keyword-based comparison.

3.4 Search for similar URLs for URLs with a small number of posts When the number of posts is small, the tag vector may not represent the source accurately. This is especially true when the resource has many aspects (e.g., the URL of a web site in contrast to a single web document). University web sites are usually not tagged extensively by the del.icio.us community. For example, the NC A&T State University website (http://www.ncat.edu/) has 16 posts with three tags: “education”, “northcarolina”, and “school”. This tag vector obviously does not fully represent the resource. Not surprisingly, the search for similar URLs does not generate good results. On the other hand, Google’s similar page search returns URLs that include other state universities in North Carolina. If there were more posts for that university’s website with tags like “HBCU”, the tag-based search could have generated much better results.

4. Statistics of Data Collected from

del.icio.us

Figure 1. Distribution of the number of posts

4.1. Distribution of URL’s by number of posts Figure 1 shows the cumulative distribution of the number of URLs in relation to the number of posts.

470 475

Confeence on Complex Systems, Dresden, Germany, October 2007. [2] Golder S. A. and Huberman B. A.. The structure of collaborative tagging systems. Journal of Information Science, 32(2):198{208, 2006. [3] Halpin H., Robu V, and Shepherd H. The complex dynamics of collaborative tagging. In WWW '07: Proceedings of the 16th international conference on World Wide Web, pages 211{220, New York, NY, USA, 2007. ACM. [4] Sen S., Shyong, Lam, T.K., Cosley, D., Rashid, A.M., Frankowski, D., Harper, F., Osterhouse, J., Riedl, J.: tagging, community, vocabulary, evolution. 20th anniversary conference on Computer supported cooperative work , pp. 181—190. ACM Press (2006) [5] Li X., Guo L., and Zhao Y. E. Tag-based Social Interest Discovery. In Proc. of the 17th Intl. World Wide Web Conference, 2008. [6] Wu X, Zhang L, and Yu Y. Exploring social annotations for the semantic web. In Proc. of the 15th Intl. Conference on World Wide Web, pages 417–426, New York, NY, USA, 2006. [7] Zanardi V. Capra L.. Social Ranking: Uncovering Relevant Content Using Tag-based Recommender Systems, To appear in the 2nd ACM International Conference on Recommender Systems, October, 2008, Switzerland [8] Rader, E. and Wash, R. Influences on Tag Choices in del.icio.us. To appear in the Proceedings of CSCW 2008 San Diego, California. [9] Brooks, C. and Montanez, N., Improved annotation of blogosphere via autotagging and hierarchical clustering. In Proc. of ACM WWW, pages 625–631, May 2006. [10] Berners-Lee, T. Hendler, J., and Lassila, O. The Semantic Web. Scientific American Magazine, May 17, 2001.

Figure 2. Distribution of tag size

Figure 3. Cumulartive Percentage of top tags

5. Discussion and Conclusion In social tagging systems, users freely choose tags to annotate resources. This informal tagging behavior causes a variety of tagging problems, including polysemy, synonymy, plurals, and depth of tagging. Some researchers suggest that, as a tagging community grows, the social tagging system becomes less effective in organizing and describing resources. In this article, we investigated the effectiveness of collaborative tagging systems for describing resources. We have strong grounds to believe that, when a reasonably large number of similar resources are tagged with similar sets of tags, collaborative tagging systems offer an effective way for describing resources. To check our hypothesis, we have downloaded from del.icio.us over 764,000 URLs together with their tag vectors and have developed an algorithm for searching ‘similar’ URLs based on the similarity of their tag vectors. The results of applying the proposed algorithm for finding similar resources in our dataset are surprisingly good - in many cases better than the results of Google’s similar URL search. Our research shows that collaborative tagging systems are more effective in describing resources when the resources are tagged many times.

References [1] Cattuto C., Baldassarri A., Servedio V. D. P., and Loreto V. Emergent community structure in social tagging systems. In Proceedings of the European

471 476

Suggest Documents