Il-Yeol Song. College of Information Science and Technology ... created metadata is considered of high quality, it is costly to produce ..... author-provided keywords and descriptions add new information to ..... [1] H. S. Al-Khalifa and H. C. Davis, "Measuring the semantic value of .... Napa Valley, California, USA, pp. 223-232 ...
Proceedings of the 43rd Hawaii International Conference on System Sciences - 2010
Metadata Effectiveness: A Comparison between User-Created Social Tags and Author-Provided Metadata Caimei Lu
Jung-ran Park Xiaohua Hu Il-Yeol Song College of Information Science and Technology Drexel University, Philadelphia, PA 19104 USA {caimei.lu, songiy}@drexel.edu {jung-ran.park, tony.hu}@ischool.drexel.edu
Abstract This paper aims to investigate the additional information value provided by user-created social tags and author-provided metadata as well as their effectiveness in facilitating web clustering and discovery. We collected a data set of web pages that includes both social tags from the del.icio.us website and author-provided metadata crawled from the internet. Based on this data set, we first checked the overlap of user-created tags and author-provided metadata with the title and content of the annotated web pages. Then, we experimented on two clustering methods based on tags and author-provided metadata. The results show that both tags and author-provided metadata add valuable information to existing page content and that social tags are more effective than author-created metadata for enhancing web clustering performance either as an independent information source or as links connecting topically related pages.
1. Introduction Metadata is essential for organizing and searching information resources. Traditionally, metadata is created by professionals based on metadata standards or controlled vocabularies. Although professionally created metadata is considered of high quality, it is costly to produce and difficult to scale. The enormous volume of digital resources on the World Wide Web makes alternative ways of metadata generation a critical need. Automatic or semi-automatic metadata generation tools and techniques has been developed to facilitate metadata production; however, the capabilities of current tools and techniques are still limited, especially for generating metadata elements like “subject”, which requires human intellectual discretion [21]. Accompanying the transition from Web 1.0 to Web 2.0, another class of metadata creators has emerged; i.e., web users who annotate web resources through
social tagging systems. Social tagging, also known as social annotation or collaborative tagging, is one of the major characteristics of Web 2.0. Social tagging systems allow users to annotate resources with freeform tags. The resources can be of any type or in any format, such as web pages (e.g., del.ici.ous 1), videos (e.g., YouTube 2 ), photos (e.g., Flickr 3 ), academic papers (e.g., CiteULike4), and so on. Despite focusing on different types of resources, all social tagging systems have the common purpose of helping users share, store, organize and retrieve the resources they are interested in. The social tags created by users provide a special type of metadata that can be utilized for classifying and searching for the resources. Vander Wal coined the term “folksonomy” (a portmanteau of the folk and taxonomy) to describe the conceptual structures generated by social tagging systems [29]. Compared to professionally created metadata, social tagging lowers the entrance threshold of metadata creation. Web users, as long as they are familiar with the content of resources, can be taggers. They do not have to master specific metadata standards or indexing rules for tagging. Moreover, new words and phrases continuously emerge in every domain as social culture and technology evolve. Folksonomies can quickly adapt to the changes in vocabulary, whereas controlled vocabularies or large scale ontologies always react slowly to new terms and phrases because of high maintenance cost. Before the appearance of social tagging systems, authors were seen as an appropriate alternative to professionals as metadata creators [10, 18]. Compared to professionally created metadata, author-provided metadata are much less costly in terms of time and efforts. Moreover, it is assumed that authors are most familiar with their work and thus are able to describe the resources created by them precisely [10]. 1
http://delicious.com/ http://www.youtube.com/ 3 http://www.flickr.com/ 4 http://www.citeulike.com/ 2
978-0-7695-3869-3/10 $26.00 © 2010 IEEE
1
Proceedings of the 43rd Hawaii International Conference on System Sciences - 2010
However, researchers have pointed out that authors are not necessarily better at generating metadata than others [18, 28]. Although authors have intimate knowledge of their work and expect their work to be discovered and used, they may have a very different understanding than users about the work they create. As stated in [18], “author created metadata may help with the scalability problems in comparison to professional metadata, but both approaches share a basic problem: the intended and unintended eventual users of the information are disconnected from the process”. On the contrary, in social tagging systems, taggers are at the same time indexers and searchers. From the point view of information retrieval, it is easier to attain indexer-searcher consistency in social tagging systems, which is a prerequisite of effective retrieval [8]. Both author-generated metadata and user-produced metadata have been explored in some digital projects and organizations with the purpose of saving costs or enhancing the quality of services. For instance, the National Digital Library of Theses and Dissertations (NDLTD) 5 and the Synthesis Coalition's National Engineering Education Delivery System (NEEDS) 6 digital library for engineering education both support author-generated metadata. The libraries at the University of Pennsylvania developed a social bookmarking tool called PennTags, which allow users to save, annotate and share cataloged books, journals, individual articles, and even query results, web pages, and images [2]. The Ann Arbor District Library (AADL) also developed a social tagging application called SOPAC to let users create and manage tags for individual library resources [8]. However, more extensive and advanced application of author-generated metadata or user-created folksonomy in digital projects or organizations needs to be based on evaluation studies on the effectiveness of the metadata created by users and authors. This is because both user-created social tags and authorgenerated metadata are considered of poor quality. Folksonomy (i.e., user-generated metadata) has been criticized for flaws of imprecision and semantic ambiguity due to its uncontrolled nature [9, 18, 27]. Authors are also considered to be lacking in expert skills and incapable of producing high quality metadata [30]. Digital project and information organizations need to investigate whether users or authors are qualified metadata creators if they want to involve users or authors in metadata production and improve resource organization and discovery on their web sites. Furthermore, when choosing between user-created 5 6
http://www.ndltd.org http://www.needs.org/engineering/
social tags and author-created metadata, the organizations and digital projects need to investigate which of these two types of metadata are more effective as the alternative or complement of professional-produced metadata. This paper addresses these issues by investigating and comparing the quality or effectiveness of usergenerated social tags and author-created metadata. Guy et al. state that “high quality metadata supports the functional requirements of the system it is designed to support, which can be summarized as quality is about fitness for purpose” [11]. In this paper, we focus on two major functions of metadata: (1) providing additional information of the resource it describes; (2) facilitating the discovery of relevant information by bringing similar resources together and distinguishing dissimilar resources [19]. Specifically, in this paper, high-quality metadata or effective metadata refers to metadata that provides additional information about the topic of web pages and groups topically similar web pages into clusters. For evaluation, we collected two datasets. The first dataset comprises social tags applied to a set of URLs by a group of users in the del.icio.us system. The second dataset consists of the attributes (“keywords” and “description”) embedded in the HTML or XHTML documents of the same pages. The user-generated tags and author-provided keywords and descriptions are compared in different aspects. First, in order to investigate whether tags, keywords or descriptions provide extra information beyond web content, we examine intersection with the title and textual content of the web pages (section 4). Second, we assess the effectiveness of user-created tags and user-generated metadata for web clustering. We experiment on two clustering methods (K-means and Link K-means) to see whether the clustering performance can be enhanced by incorporating tags or author-provided keywords and descriptions into the clustering process. The clustering results are evaluated against a user-maintained web directory (Open Directory Project—ODP) based on several quality metrics (section 5 and 6).
2. Related work Although the quality of professional-created metadata has been investigated by many researchers [11, 14, 20, 33], the quality of author-provided metadata has rarely been studied in previous works. A study evaluating author-provided metadata was conducted by Greenberg et al. [10]. They examined eleven author generated metadata records using the National Institute of Environmental Health Sciences
2
Proceedings of the 43rd Hawaii International Conference on System Sciences - 2010
Dublin Core schema. The results of the study indicate that a simple web form can assist authors in the production of acceptable quality metadata. In the rest of this section, we primarily review studies on social tagging. Since the first social tagging system “del.icio.us” was created in 2004, social tagging has received continued attention in recent literature. Researchers focus on various aspects of social tagging, such as the usage patterns of social tagging systems [9, 12], semantic values and quality of social annotations [1, 6, 13, 27] as well as the applications of social annotations [4, 23, 32, 33, 35].
2.1. Usage patterns of social tagging systems In [9], Gold & Huberman discover several interesting usage patterns of social tagging based on the del.icio.us dataset. They found that a consensus about the tags used for describing a webpage is formed after the webpage has been annotated by a certain number of users. Harry Halpin et al. examine the tagging history of the del.icio.us site to investigate why and how the power law distribution of tag usage frequency is formed in a mature social tagging system along time [12].
2.2. Semantic values and qualities of social annotations In an early study on social tagging [18], Mathes examines the limitations and strengths of folksonomy as user-generated metadata vis-à-vis professional and author-created metadata. In order to measure the semantic values of social tags, Al-Khalifa and Davis compare social tags with keywords automatically extracted using the software package Yahoo Term Extractor and conclude that social tags are semantically richer than automatically extracted keywords [1]. To examine the semantics of social tags, Suchanek et al. examined the meaning of a sample of tags collected from del.ici.ous with two dictionaries YAGO and WordNet [27]. They found that the top popular tags of a webpage are mostly semantically meaningful in the sense that these tags are registered to the dictionaries. Based on a probabilistic generative model, Zhang et al. identify the conceptual structures of folksonomy and prove that a hierarchical relationship can be derived from the identified concepts in folksonomy [34]. In order to examine whether social tags are semantically meaningful and relevant to the annotated objects, Heymann et al. led a group of graduate students to evaluate a sample of posts collected from delicious [13]. The results of the study show that most
tags are relevant and objective. Bischoff et al. also compared social tags with the metadata created by experts [6]. They analyzed the overlap between social tags collected from Last.fm and music reviews extracted from Google results for the same set of music tracks. The results showed that 73.01% of the track tags can be found inside the review pages. They also compared track tags from Last.fm to the expert created reviews from www.allmusic.com for the same music tracks or albums and found that 46.4 % of the tags were present in the Allmusic review pages. In order to find out whether social annotation provides an additional information source of value for web search, some researchers analyze the intersection of social tags with other information sources such as titles and textual content of web resources. Based on a sample of the del.icio.us data set, Heymann et al. find that social tags occurred in the content of half of the pages and in the titles of 16% of the pages they annotate [13]. Bischoff et al. do similar analysis on del.icio.us tags and find that 44. 85% of the tags appear in the web page text [6]. Another requirement for social tags to be useful for web search is that they should intersect neatly with the query terms used for searching for the tagged objects. Heymann et al. find a significant intersection between the queries and tags by analyzing the overlap between popular query terms from an AOL query data set and del.icio.us tags [13]. Bishcoff et al. counted the percentage of queries containing tags on three systems del.icio.us, Flickr and Last.fm [6]. They found that 71.22% of Delicious queries consisted of at least one tag and 30.61% of the queries consisted entirely of tags. As for Flick and Last.fm, the percentages are 64.54% and 12.66, and 58.43% and 6%, respectively. Rather than measuring the overlap, Suchanek calculated the similarity between the frequency distributions of tags, page contents and queries based on two statistics: cosine similarity and NDGC (normalized discounted cumulative gain) [27]. The cosine similarity and NDGC were found to be 16.76% and 12.64% for tags and content and 21.61% and 25.07% for tags and queries. In our study, in addition to the overlap between tags and page content, we also examine the intersection between tags and author-provided metadata to investigate the difference between use-created tags and author-provided keywords and descriptions.
2.3. Applications of social annotation The social tagging system has been exploited for various application purposes such as information retrieval [4, 23, 31-33, 37], clustering [5, 17, 22], automatic tag recommendation [7, 24, 15, 25], and so
3
Proceedings of the 43rd Hawaii International Conference on System Sciences - 2010
on. Here we focus on studies on social tagging for web search and clustering. A purpose of social annotationbased web search is to use social tagging to improve the language model for information retrieval. Xu et al. have developed a language model for information retrieval based on the metadata property of social tags and their relationship to the annotated documents [32]. However, this model only considers the bipartite structure over documents and tags. Zhou et al. propose an extended generative language model based on LDA (Latent Dirichlet Allocation) for information retrieval [37]. This model incorporates not only the topical background of documents and social tags but also users’ domain interests. Based on the tripartite structure of social tagging systems, Wu et al. propose a probabilistic generative model in which the three entities of a social tagging system, namely tags, objects and users, are mapped to a common conceptual space represented by a multidimensional vector with each dimension corresponding to a knowledge category [31]. The values of each entity’s conceptual vector are estimated through an EM process. The generated conceptual vectors are used for developing various search models. Several efforts have been made to explore social tagging for clustering. The clustered object can be users, tags or resources. In [5], a tag graph is built based on the co-occurrence of tags in annotated resources. A spectral bisection method is adopted to cluster the tag graph. The identified tag clusters are used for finding semantically related tags. In [17], the authors use association rule algorithms to identify frequent tag co-occurrence patterns which are viewed as topics of user interests. Users and URLs are clustered under different topics based on their relation to the tags included in each topic. Although this approach can cluster both users and URLs at the same time, it has some limitations. First, the number of clusters depends greatly on the size of frequent tag cooccurrence patterns. Second, active users and popular URLs can belong to many clusters while new or inactive users and emerging or unpopular URLs may not be assigned to any cluster. Comprehensive research on social tagging-based clustering is conducted by Ramage et al. [22]. The authors incorporate social tags into two clustering methods: K-means and a generative clustering method. The clustering results are also evaluated against the web directory ODP. In our paper, we also apply K-means clustering to compare the effectiveness of user-generated tags and author-created metadata as document features for web clustering. We also experiment on another clustering method called Link K-means, which relies on both the resource content and the links among the resources for clustering. We apply Link K-means method to examine
whether tags or keywords can be used as bridges connecting topically relevant web pages.
3. Dataset for experiment For this study, we used a real-world social tagging dataset crawled from the del.icio.us web site during January 2009 and February 2009. The original data set contains 3,246,424 posts for 1,731,780 URLs created by 4784 users. Each post is a bookmark to a web page annotated by a user with one or more tags.. Owing to the lack of ground truth for evaluating clustering results, following the approach adopted in [22] we use the web categories of the Open Directory Project (ODP) 7 as the cluster standard. ODP is a human-edited hierarchical Web directory containing 17 top-level categories. We only keep 14 top-level categories as the clustering standard (see Table 1). By overlapping the URLs from the del.icio.us dataset with the URLs listed under the 14 categories of ODP, we obtained 45,462 URLs appearing in 208,437 posts. For these URLs, we crawled the content of “keywords” and “description” attributes provided within the tag in the head section of their HTML or XHTML documents. The meta elements are used by web page authors to provide structured metadata about their pages. The two most common attributes of meta elements are “description” and “keywords”. It was found that 23,188 of the 45,462 URLs contained both “keywords” and “description” attributes. We also crawled the titles and textual content of the URLs at the same time. Table 1 lists the distribution of the 23,188 URLs among the 14 ODP categories. Note that a URL may belong to more than one category. Table 1. The number of URLs under 14 ODP categories ODP Art Business Computer Games Health category #URLs 3783 2806 6731 828 531 ODP Kids and Home News Recreation Reference category Teens #URLs 811 957 399 1352 1274 ODP Science Shopping Society Sports --category #URLs 1716 2170 2547 575 ---
In our data set, the 23,188 URLs are associated with 19,934 different tag terms and 90,827 different keyword terms. We can see that the number of different keyword terms is about five times the number of different tags. This owes to the fact that the keywords are generated by the authors of each URL
7
http://www.dmoz.org/
4
Proceedings of the 43rd Hawaii International Conference on System Sciences - 2010
Number of tags (log2 scale)
while the tags are created by a much smaller number (about 4700) of users. In order to reveal the usage patterns of the keywords and tags applied to the URLs, we analyze their frequency distribution. Figure 1 and Figure 2 show the distributions of the frequency of tag usage and the frequency of keyword usage respectively. Since both axes of the figures are in log scale, Power Law distributions are clearly seen. This indicates that a relatively small number of tags are frequently used by users while most tags are only applied to a few objects. Seemingly, a very small portion of keyword terms are frequently used by authors to describe the web pages they create while most keyword terms are only used to describe a limited number of web pages.
Frequency of tag usage (log2 scale)
Number of keywords (log2 scale)
Figure 1. The distribution of the frequency of tag usage
Frequency of keyword usage (log2 scale)
Figure 2. The distribution of the frequency of keyword usage
4. Exploring the additional value of social tags and author-provided keywords In this section, we explore whether social tags and author-provided keywords and descriptions add new information to existing web page content. Currently, most web search engines rely primarily on page content and link structure for indexing, searching and ranking. For social tags and author-provided metadata to have potential impact on improving the performance of web search or other web mining tasks, they should contain useful information which cannot be extracted
from the page content directly. To investigate whether social tags and author-provided keywords contain additional information value, we examine their occurrence in the title and content of the annotated web pages. • Tag vs. Title. Among the 19,934 different tags associated with the examined 23,188 URLs, 4854 (24%) of the tags appear in the title of the pages. 12,482 (54%) URLs have at least one tag occurring in the title; 2024 URLs have half or more of the tags occurring in the title. • Keywords vs. Title. Among the 90,827 different keyword terms, 14,965 (16%) are from the title of the web pages they describe. In addition, 19,857 (85.6%) URLs have at least one keyword terms from the page title; 3344 (14%) URLs have at least half of the keywords from the title. • Tag vs. Content. Among all tags, 8,430 (42%) are present in the page text. About 80% (18568) of the annotated pages have at least one tag present in the page text. • Keywords vs. Content. 42,438 (47%) of the keyword terms can be found in the textual content of the pages they describe. In addition, for 22,178 (96%) of the URLs, at least one of the keyword terms is present in the page text. Based on the analysis, we can see that less than half of the tags or keywords are present in the title or content of the pages. This indicates that both tags and keywords contain additional information value beyond the title and page content of the pages. However, tags are more likely than keywords to be present in web page titles. This may be because some authors deliberately avoid choosing terms already used for page titles as keywords. However, keywords provided by authors contain at least one term present in the title while only about half of the URLs are annotated by at least one tag present in the page title. In terms of content, compared to users, authors are more likely to adopt terms appearing in the page content to annotate the pages. Almost all (96%) URLs are described by at least one keyword term from the page content. Users tend to adopt different terms from authorprovided keywords or descriptions to annotate web pages. To confirm this, we examined whether the terms used as tags by users also appear to be keywords or descriptions for the same page. • Tag vs. Keywords. 6,447 terms (32% of tags, 7% of keywords) are applied as both tags by users and keywords by authors to describe the same page. 16,113 (70%) URLs have at least one tag that is also adopted by the authors as the keyword to describe the pages.
5
Proceedings of the 43rd Hawaii International Conference on System Sciences - 2010
•
Tag vs. Description. 5,634 (28%) tags appear in the descriptions provided by the authors for the same pages. 15,161 (65%) URLs are annotated by at least one tag which is present in the description attributes of the pages. Only a small number of terms are applied by both users and authors for annotating and describing the same web pages. However, tags overlap more with keywords than description even though description generally contains more terms. This may be because tags by their nature are more likely to be keywords. They are both considered as important terms by users and authors to summarize the topics of web pages. Interestingly, a greater overlap exists between more popular tags and keywords. Table 2 lists the 30 most frequent tags and keyword terms in our data set. Table 2. 30 most frequent tags and keywords Tags
Keywords
reference, software, tools, design, web, shopping, free, programming, blog, art, music, news, resources, education, webdesign, technology, online, business, development, research, science, web2.0, internet, tutorial, games, computer, inspiration, how-to, fun, search
free, software, web, online, design, news, music, art, management, business, video, games, home, internet, books, education, digital, search, development, photography, computer, information, world, training, travel, school, windows, reviews, research, science
In Table 2, the terms are ordered based on their frequency of usage. The bold terms are those used both by users as tags and by authors as keywords. We can see that more than half (17) of the 30 terms are shared. The eight most frequent employed keyword terms are all used as social tags. This indicates that a consensus to some extent exists between users and authors on the terms for describing web resources. However, different from keywords which only include topical terms, tags also contains some subjective and personal terms such as “fun”.
5. Clustering social tags and authorprovided metadata An important function of metadata is to facilitate information organization (i.e., web clustering). As shown in the above section, a greater portion of tags and keyword terms are not present in the titles and content text of the pages. Although this indicates that both tags and author-provided keywords contain additional information beyond page content, it is still unknown whether this additional information is useful for web organization and searching.
In this section, we try to answer this question by examining whether social tags and author-provided metadata are useful for enhancing the performance of web clustering. We experiment on two clustering methods: K-means and Link K-means [3, 35]. The former is a content-based clustering method, while the latter relies on both the content of resources and the links among the resources.
5.1. Clustering standard and quality metrics Because there is no ground truth for the web clusters, we adopt the 14 ODP top-level categories mentioned earlier as the gold standard for evaluating the clustering results. The cluster quality is evaluated by three metrics, F-score [16], purity [36] and normalized mutual information (NMI) [26]. The Fscore combines information of precision and recall. Purity assumes that all samples of a cluster are predicted to be members of the actual dominant class for that cluster. NMI is defined as the mutual information between the cluster assignments and a preexisting labeling of the data set normalized by the arithmetic mean of the maximum possible entropies of the empirical marginals, i.e., NMI ( X , Y ) =
I ( X ;Y ) (log k + log c ) / 2
, where X is a random variable for cluster assignments, Y is a random variable for the pre-existing labels on the same data, k is the number of clusters and c is the number of pre-existing classes. All three metrics range from 0 to 1, with the higher their value the better the clustering quality.
5.2. K-means K-means is simple but efficient and highly scalable clustering method. It iteratively calculates the cluster centroids and reassigns each document to the closest cluster until no document can be reassigned. Traditional K-means models documents with word vectors. In our experiment, web documents are not only modeled as the vectors of their content words but also as vectors of tags applied to them and vectors of keywords and description words used to describe them. Specifically, we experiment with K-means based on the following vector space models: • Word vector: a web document is only represented by the vector of its content words. K-means, based on this vector model, is used as the baseline clustering method. • Tag vector: a web document is only represented with the social tags applied to it.
6
Proceedings of the 43rd Hawaii International Conference on System Sciences - 2010
•
Keyword vector: a web document is only represented with the keyword terms used by the author to describe it. • Description vector: a web document is only represented with the description words generated by the author. • (Word+Tag) vector: Tags applied to a web document are viewed as additional words of the document and combined with the original words of the document to form one vector. • (Word+Keyword) vector: Keywords used to describe a web document are viewed as additional words of the document and combined with the original words of the document to form one vector. • (Word+Description) vector: The words contained in the descriptions of a web document are viewed as additional words of the document and combined with the original words of the document to form one vector. • Word vector + Tag vector: Each document is represented with two independent vectors: word vector and tag vector. During the clustering process, the distance from a document to a cluster centroid is calculated as the linear combination of the distance value based on word vector and the distance value based on tag vector. • Word vector + Keyword vector: Each document is represented with two independent vectors: word vector and keyword vector. During the clustering process, the distance from a document to a cluster centroid is calculated as the linear combination of the distance values based on both word vector and keyword vector. • Word vector + Description vector: Each document is represented with two independent vectors: word vector and description vector. During the clustering process, the distance from a document to a cluster centroid is calculated as the linear combination of the distance values based on both word vector and description vector. Another issue of K-means clustering is how to weigh the features. In our experiment, we adopt the weighting function tf·idf for all types of vector models. The tf·idf value of a term is decided by both the term frequency and its document frequency in the entire collection of documents. The cluster number of resources (web pages) is set to 14, equal to the number of ODP categories. Because the clustering algorithm relies on random initialization, we run the algorithm 10 times and use the mean of each quality metric across the 10 runs as the final score. For each run, the number of iterations is set to 20. The clustering based on word vector is viewed as the baseline. All words are
lemmatized. Stop words and rare words with a document frequency less than five are filtered out. Table 3 lists the clustering results of K-means based on different vector space models. Note that in Table 3, ** indicates, compared to the baseline, the improvement is significant according to the pairedsample T-test at the level of p