An Epistemic Dynamic Model for Tagging Systems - CiteSeerX

14 downloads 0 Views 1MB Size Report
ABSTRACT. In recent literature, several models were proposed for re- producing ... 1. INTRODUCTION. During the last few years, collaborative tagging systems.
An Epistemic Dynamic Model for Tagging Systems Klaas Dellschaft

Steffen Staab

Department of Computer Science Universität Koblenz-Landau Koblenz, Germany

Department of Computer Science Universität Koblenz-Landau Koblenz, Germany

[email protected]

[email protected]

ABSTRACT In recent literature, several models were proposed for reproducing and understanding the tagging behavior of users. They all assume that the tagging behavior is influenced by the previous tag assignments of other users. But they are only partially successful in reproducing characteristic properties found in tag streams. We argue that this inadequacy of existing models results from their inability to include user’s background knowledge into their model of tagging behavior. This paper presents a generative tagging model that integrates both components, the background knowledge and the influence of previous tag assignments. Our model successfully reproduces characteristic properties of tag streams. It even explains effects of the user interface on the tag stream.

Categories and Subject Descriptors H.5.3 [Group and Organization Interfaces]: Collaborative computing; H.5.3 [Group and Organization Interfaces]: Theory and models

General Terms Experimentation, Human Factors

Keywords tagging, complex systems, stochastic modeling, temporal evolution

1. INTRODUCTION During the last few years, collaborative tagging systems like Flickr1 , Del.icio.us2 or Bibsonomy3 got more and more popular because they allow users to easily upload resources 1

http://www.flickr.com/ http://del.icio.us/ 3 http://www.bibsonomy.org/ 2

Copyright ACM, 2008. This is the author’s version of the work. It is posted here by permission of ACM for your personal use. Not for redistribution. The definitive version was published in Proceedings of the 19th ACM Conference on Hypertext and Hypermedia HT’08, http://doi.acm.org/10.1145/1379092.1379109

like photos, bookmarked URLs and BibTeX entries and to share them with other users. Additionally, the users can organize their resources by assigning tags or keywords to them. A tag assignment constitutes a link between the respective resource, the user and the assigned tags. The links can be used for retrieving specific resources or for navigating through the large set of resources by following the implicit links that are given by the co-occurrence of tags at the same resource. Over time, one can observe the emergence of a loose categorization system for resources which is frequently called a folksonomy [8]. Folksonomies constitute intriguing dynamic systems constructed by the collaboration and interaction of their users. They offer new possibilities for finding resources. But at the same time they constitute a challenge for existing models of categorization and retrieval of resources because the usage of tags at the micro-level of the individual user and at the macro-level of groups of users and of the complete user community has neither been understood nor has been put in a relationship with each other. Recent research has brought forward an interesting temporal perspective on the understanding of folksonomies by viewing them as dynamic stochastic systems with memory (cf. Cattuto et al. [3]). While this perspective is novel and has been seminal for relating the micro- and the macrobehavior of the system, the model by Cattuto et al. exhibits several short-comings. It abstracts away the background knowledge common to folksonomy users putting too much emphasis on imitation of other users and random generation of vocabulary. In the following, we advocate the hypothesis that both components, i. e. the background knowledge and the imitation, are needed for explaining and understanding the tagging behavior of users. We will integrate them into a dynamic generative model of folksonomies. It better approximates behavior found in actual tagging systems and it thus gives us more meaningful insights into the tagging process. For example, it helps us to distinguish between effects in the tagging system caused by a common natural language of users (e. g. English) and effects that are specific to the user interface of tagging systems. We will start the paper by giving an overview of what we understand by the term folksonomy and how we can define a co-occurrence and a resource tag stream as a view on a folksonomy. In Section 3, we will identify the characteristic properties of the co-occurrence and resource streams that also have to be reproduced by a successful generative model of folksonomies. But as we will show in Section 4, all previ-

ous models known in the literature fail to reproduce at least one of these properties. Then, in Section 5 we will describe our model that integrates background knowledge and imitation. We will show in Section 6 that both components are required for reproducing the observed characteristic properties of tag streams. The model gives us thus insights into the behavior of real users in tagging systems.

It is the objective of the tagging model described in Section 5 to better understand the tagging process and to explain how it may lead to the emergence of the described characteristic properties. It will help us to understand the link between the tagging behavior and the natural language behavior of users.

2. STREAM VIEW OF FOLKSONOMIES

A co-occurrence tag stream only contains tags that cooccur with another specific tag in the same posting. For example, for the experiments described in Section 6 we extracted three different streams of tags from a larger Del.icio.us dataset. We used the three tags that were already used in [3] but extracted their co-occurrence streams from an overall larger Del.icio.us crawl. The first stream contains all tags from tag assignments that co-occur with ajax in the same posting. The other two streams contain the tag assignments co-occurring with blog and xml. The statistics of the three streams are available in Tab. 1. More detailed information about the data sets are available in Appendix A.

In social resource sharing systems users can upload resources and assign arbitrary words, the so-called tags. The tags can later be used for retrieving specific resources from the system. The different systems can be distinguished by the kind of resources which can be uploaded. For example, Del.icio.us allows the sharing of bookmarks with other users, in Flickr the user can upload photos and the Bibsonomy system allows users to share bookmarks and BibTeX entries. The collection of all users, resources and tags as well as the assignments of tags to resources are called folksonomy. We formally define a folksonomy as follows (cf. [10] and [1]): Definition 1. A folksonomy F is a tuple F := (U, T, R, Y, pt) where • U , T , and R are finite sets, whose elements are called users, tags and resources, resp., and • Y is a ternary relation between them, i. e., Y ⊆ U × T × R, called tag assignments (TAS for short) • pt is a function pt : Y → n which assigns to each tag assignment of Y a temporal marker n ∈ N. It corresponds to the time at which a user assigned a tag to the resource. The tag assignments can be grouped into several postings. A posting contains all tag assignments made by the same user to the same resource at the same time. The temporal marker of the post is equal with the temporal marker of any of the contained tag assignments. The temporal marker associated with the tag assignments or posts allows their temporal ordering into a stream of tagging events. The analysis of the streams allows for making interesting observations of the ongoing dynamics in tagging systems and the underlying mechanisms. It helps for better understanding the potential and the limits of social tagging systems. Examples of tag stream analysis are amongst others available in [6], [7] and [3]. For example, in [6] Golder and Huberman analyze how the number of different tags used by specific users increases as they add further postings to Del.icio.us and if there exists a typical number of postings to a single resource after which the relative proportions of tags stabilize.

3. PROPERTIES OF TAG STREAMS The tag streams extracted from tagging systems have several characteristic properties which will be described in the following. We will distinguish between the characteristic properties of co-occurrence tag streams and of resource tag streams. They will be compared to properties of the human language that are already known from the analysis of natural language texts.

3.1

Co-occurrence Tag Streams

Tag ajax blog xml

|Y | 2,949,614 6,098,471 974,866

|U | 88,526 158,578 44.326

|T | 41,898 186,043 31,998

|R| 71,525 557,017 61,843

Table 1: Statistics of the used co-occurr. streams There are two characteristic properties of co-occurrence tag streams on which we will concentrate our analysis: Tag Growth The first characteristic property which can be observed in co-occurrence streams is the growth of the number of distinct tags. In Fig. 1 one can see a sub-linear growth rate for the three tag streams from Tab. 1, i. e. at the beginning the growth rate is higher and decays with the age of the stream but doesn’t stop at a certain limit. This growth pattern can also be observed in the global system as well as on the level of resource streams and reminds of a power-law [2]. Frequency-Rank Distribution Another characteristic property of tag streams is the distribution of the occurrence probabilities of the distinct tags in a tag stream. Fig. 2 shows the frequency-rank distribution for our three tag streams. The graph is shown in log-log scale. One can see that again they remind of a power-law decay for the medium to less frequently used tags and a flattened slope for the most frequently used tags (cf. [3]). Golder and Huberman analyzed how the occurrence probabilities of the tags develop over the age of the stream. They observed that approximately after 100 postings the occurrence probabilities of the most frequently used tags show a stable pattern [6]. It remains the question in how far the slope of the frequencyrank distribution of tags in a tag stream are characteristic for the tagging systems or whether it reflects a property that is already known from natural language texts. For example, for large text corpora it is known that their word frequencies exhibit a power-law behavior for all words, not only for the medium and less frequently used words. For example, in [9] they analyzed a large corpus comprising 2,606 books

Ajax Blog XML

0.01

Relative Tag Frequency

1e+06

1e+05

# of distinct tags

10000

0.0001

1e-06

1000 1e-08

100 1

Ajax Blog XML

10

1

1

10

100

1000 10000 # of tag assignments

1e+05

0.1 Ajax Blog XML

Relative Tag Frequency

0.001

1e-04

1e-05

1e-06

1e-07

10

100

1000 Tag Rank

10000

1000 10000 Tag Rank

1e+05

1e+06

1e+07

Figure 3: Frequency-rank distribution of words contained in three topic centric corpora.

in English. The resulting frequency-rank graph of the words showed a power law behavior. In [4], they also showed the power-law property for the word frequencies in the novel Moby Dick. But it seems that the strict power-law behavior is restricted to general text corpora only. If one analyzes corpora which are centered around a certain topic one can observe a behavior similar to that of co-occurrence tag streams. This is demonstrated in Fig. 3 that shows the frequency-rank distribution of text corpora crawled from the Web. They only contain texts related to one of the three topics of our cooccurrence tag streams, i. e. they were crawled by downloading all resource URLs that are contained in the streams. For example, the ajax Web corpus contains the 71,525 resources from the ajax tag stream. As one can see in Fig. 3, the frequency-rank distributions of the three Web corpora show a slope for the most frequently used words that is similar to the slope in Fig. 2. The difference is that in the Web corpora the flattened slope can be observed for words with ranks between 1 and 1,000 while for tag streams the flattened slope is typically observable for tags with ranks between 1 and 100. It thus seems to be characteristic for tag streams that the flattened slope occurs at much lower ranks compared to text corpora. Furthermore, one can observe an overall lower number of different tags in the co-occurrence streams than in the corresponding web corpus. For example, the ajax stream contains 41,898 different tags while the ajax web corpus contains approximately 290,000 different words.

3.2 1

100

1e+06

Figure 1: Growth of the number of distinct tags in the three tag streams from Tab. 1 in dependency on the age of the stream.

0.01

10

100000

Figure 2: Frequency-rank distribution of tags in the three tag streams from Tab. 1.

Resource Tag Streams

Until now, we have considered the properties of co-occurrence tag streams. In the following, we will look at the frequencyrank distributions of streams which only contain the tags assigned to a specific resource. In Fig. 4 one can see the frequency-rank distributions of three different resources in the Del.icio.us data set. For http://www.netvibes.com/ and http://www.googleguide.com/ one can see a sharp drop in the frequencies of tags between rank 7 and 10. For the resource stream of http://www.pandora.com/ this drop is not

http://www.googleguide.com/ http://www.pandora.com/ http://www.netvibes.com/

Relative Tag Frequency

0.1

0.01

0.001

0.0001

1

10 Tag Rank

100

Figure 4: Frequency-rank distribution for the 100 most frequently assigned tags for three resource streams from a Del.icio.us data set. as sharp but still present. The statistics of the three resource tag streams are shown in Tab. 2. Pandora and Netvibes are among the most heavily tagged resources in Del.icio.us while Googleguide represents a less famous web site in Del.icio.us. Resource pandora.com netvibes.com googleguide.com

|Y | 59,962 43,590 2,927

|U | 21,244 13,937 967

|T | 2,740 2,857 303

Table 2: Statistics of the used resource streams In [7], this drop artifact was observed for almost all sites in a data set of 500 heavily tagged sites from Del.icio.us. Thus, it is highly improbable that the drop only represents noise in the data set although there is a wider variety between the slopes of the single resource streams than what we observed for the co-occurrence streams. Instead of noise it is likely “a consistent effect of the way tagging is performed” (cf. [7]). Two possible explanations are offered in [7]: (1) It may be related to a cognitive effect during tagging (e. g. based on the average number of tags contained in a posting) or (2) it may be an artifact of the user interface specific to Del.icio.us. In Section 6.2.2 we will show, that with the help of our model we can explain the effect as being likely an artifact of the Del.icio.us user interface.

4. RELATED WORK Previous work has already developed models aimed at simulating the dynamic behavior of folksonomies. In order to come up with models close to the reality of observed folksonomies, the authors of this line of research have formalized assumptions about the tagging behavior of users into dynamic stochastic models. The qualities of these models were judged by comparing their fit to one or more of the previously described properties of tag streams. One of the earliest work in this field is the work of Golder and Huberman in [6]. They assume that there are two basic factors that influence users during assigning tags to a re-

source: On the one hand, they assume that a user of course selects a tag based on his background knowledge and the content of the resource. On the other hand, he is exposed to previous tag assignments of other users that he may simply imitate in order to reduce the effort required for assigning own tags. Nevertheless, Golder and Huberman only simulated the imitation of previous tag assignments in their model and have not modeled the influence coming from the background knowledge. Their model is a variation of the stochastic Polya urn model originally described in [5]. The model is best explained by the metaphor of an urn containing balls with different colors. In each step of the simulation, a ball is selected from the urn and then it is put back together with a second ball of the same color. After a large number of draws the fraction of the balls with the same color stabilizes but the fractions converge to random limits in each run of the simulation. Transfered to the simulation of tag streams the balls with different colors correspond to the distinct tags in a stream. The model proposed by Golder and Huberman successfully explains the emergence of stable tag frequencies which they observed for resource tag streams. But it has several shortcomings: For example, it ignores the influence coming from the background knowledge of users. Furthermore, it does not reproduce the characteristic frequency-rank distributions of co-occurrence and resource tag streams that we observed in Section 3. Instead, it leads to a plain power-law distribution of the tag frequencies without the deviations for the most frequently used tags. Finally, the Polya urn model does not explain the typical growth of the set of distinct tags. While in the original Polya urn model no new tags are invented there also exists a modification in which at each simulation step a new tag is invented and added to the tag stream with a low probability of p. The modified model corresponds to the Simon model described in [11]. But obviously, this leads to a linear growth of the distinct tags and not to the typical continuous, but declining growth that is shown in Section 3 in Fig. 1. In [3], Cattuto et al. propose a further variation of the Simon model. It takes the order of the tags in the stream into account. Like the previous models, it simulates the imitation of previous tag assignments but instead of imitating all previous tag assignments with the same probability it introduces a kind of long-term memory. It provides a fat-tailed access to the previous tag assignments, i. e. the probability of selecting a tag assignment located x steps into the past is given by a function Qt (x) that returns a powerlaw distribution of the probabilities. The Yule-Simon model with long term memory from [3] successfully reproduces the characteristic slope of the frequency-rank distribution of cooccurrence tag streams but it fails to explain the distribution in resource tag streams as well as the decaying growth of the set of distinct tags because it leads to a linear growth. The most recent model for simulating the evolution of tag streams is proposed in [7]. It is the first model which does not only simulate the imitation of previous tag assignments but it also selects tags based on their information value. The information value of a tag is 1 if it can be used for only selecting appropriate resources. A tag has an information value of 0 if it either leads to the selection of no or all resources in a tagging system. In [7], Halpin et al. empiri-

Polya urn (cf. [5, 6]) Simon model (cf. [11]) YS memory model (cf. [3]) Halpin et al. model (cf. [7]) Our model (cf. Section 5)

frequency rank co-occ. stream res. stream tag growth ◦ ◦ fixed size ◦



linear

+



linear





linear

+

+

sub-linear

Parameter I

Description Probability that a user imitates a previous tag assignment from the stream.

BK

Probability that a user selects a word from his active vocabulary. Number of the most popular tags in the previous tag stream that can be imitated. For simulating resource streams, its value corresponds to the number of visible popular tags in the Del.icio.us tagging interface. The maximal number of previous tag assignments the system uses to determine rankings for the n distinct tags. The probability that a user selects the word w from his active vocabulary associated with the topic t or the resource r. For example, p(w|t) may be approximated with the probability of observing w in a text corpus crawled for t (cf. Section 3.1).

n

Table 3: Rating of tag stream simulation models and how they reproduce the different properties of tag streams (+ = good, ◦ = medium). More details are available in the text. cally estimate the information value of a tag by retrieving the number of webpages that are returned by a search in Del.icio.us with the tag. Besides of the selection based on the information value, the model also simulates the imitation of previous tag assignments. It corresponds to a preferential attachment or Polya urn model. Overall, the proposed model leads to a plain power-law distribution of the tag frequencies and to a linear growth of the set of distinct tags. It thus only partially reproduces the frequency-rank distributions in co-occurrence and resource tag streams and it is not successful in reproducing the decaying tag growth. In Tab. 3 the different models described in this section are summarized and it is shown in how far they are able to reproduce the characteristic properties from Section 3. The basic assumption in all models is that users imitate the previous tag assignments of other users in one or the other variation of a Polya urn or Simon model. With exception of the YuleSimon model with memory all other models can only explain a power-law like distribution of the tag frequencies and not the deviation for the most frequent tags in co-occurrence and resource tag streams. Only the Yule-Simon model with memory can at least explain the deviation for co-occurrence streams. But the most obvious flaw of all previous model is their inability to explain the characteristic sub-linear tag growth. Instead of the continuous, but decaying growth of the set of distinct tags they all lead to a linear growth or, even worse, assume a fixed vocabulary size.

5. TAGGING MODEL In the following, we will propose a dynamic stochastic model which simulates the evolution of tag streams based on assumptions about the factors that influence users assigning tags. By modeling the influence factors at the micro-level of the single tag assignments we try to understand how the previously described characteristic properties on the macroscopic level of tag streams emerge. In [6] and [3] it has been observed that during tag assignment, a user is exposed to the content of the resource but also to the previous tag assignments of other users. In the following, we will present our model that consists of two components, each simulating one of the two factors that have an influence on the tag assignments of a user. We will motivate and discuss the five parameters of our simulation model

h

p(w|t) or p(w|r)

Table 4: Parameters of our tagging model

Figure 5: Tagging interface of Del.icio.us.

in detail. For ease of readability, the result of the discussion is summarized in Tab. 4. The evaluation of our model will be given in Section 6.

5.1

Imitation

The simulation of a tag stream always starts with an empty stream. Then, in each step of the simulation, it is simulated with probability I that one of the previous tag assignments is imitated. With probability BK, the user selects an appropriate tag from his background knowledge about the resource (see Section 5.2). Usually, not the whole previous tag stream is accessible. For example during posting a new resource to Del.icio.us (see Fig. 5), the user sees a set of recommended tags which is the intersection between the user’s tags and all tags already assigned to the resource (recommended tags). Furthermore, a user can see all tags that he previously attached to other resources (your tags) and the 7 most popular tags of the resource (popular tags). By clicking on any of the tags the user can easily assign it to the resource. For keeping our model as simple as possible, we will restrict it to only simulating the imitation of the most popular tags. For this purpose, we introduce two further parameters n and h which can be used for restricting the access to the previous tag stream:

5.2 Background Knowledge With probability BK, it is simulated that a tag is selected from the background knowledge about the content of the resource. It corresponds to selecting an appropriate natural language word from the active vocabulary of the user. Each word has assigned a certain probability with which it gets selected. Obviously, it is not possible to get the active vocabulary of each individual user. Instead we will simulate the active vocabulary of an average user with the help of the Web corpora (see Section 3 and Fig. 3). The probability of selecting a specific word t corresponds to the probability with which t occurs in the Web corpus. For example, for simulating the ajax tag stream we used the probabilities found in the ajax Web corpus. The background knowledge will influence the frequencyrank distribution as well as the tag growth. The simulation of the background knowledge adds tags with a certain probability (i. e. the frequency-rank distribution is influenced) but it may also add previously unknown tags if a tag gets selected the first time from the background knowledge (i. e. the tag growth is influenced).

6. EXPERIMENTAL RESULTS In this section, we will evaluate our model proposed in Section 5. For this purpose, we investigate how well our model approximates the observed distributions described in Section 3. To determine how well our model fits to the real life tag streams, we investigate to which extent the 5 parameters of our model may be fixed in a way such that our model can consistently predict the behavior of the various observed distributions. All the simulated tag streams shown in this paper as well as the software used for doing the simulations can be obtained in the web. More detailed information are available in Appendix A.

6.1 Tag Growth In this subsection, we will verify two of the assumptions made in our model: We will verify whether the selection from the background knowledge plays a role during tag assignment and whether the selection probabilities of words from

1e+05

I=0.3, BK=0.7, n=50, h=1000 I=0.6, BK=0.4, n=50, h=1000 I=0.9, BK=0.1, n=50, h=1000 Ajax Delicious Stream

10000

# of distinct tags

The parameter n represents the number of popular tags a user has access to. In case of simulating resource streams, n will correspond to the number of popular tags shown by the Del.icio.us interface (i. e. n = 7). In case of co-occurrence streams n will be larger because the union of the popular tags of all resources that are aggregated in the co-occurrence stream will be depicted over time. Furthermore, the parameter h can be used for restricting the number of previous tag assignments which are used for determining the n most popular tags. For example, for n = 7 and h = 300 only the 7 most popular tags during the last 300 tag assignments can be imitated. The probability of selecting the concrete tag t from the n tags is then proportional to how often t was used during the last h tag assignments. By combining n and h we get an effect that is comparable to the fat-tailed access of the Yule-Simon model with memory (cf. [3] and Section 4). Obviously, the imitation of previous tag assignments will never add a previously unknown tag to the stream. Thus, it will only have an effect on the frequency-rank distribution but not on the tag growth.

1000

100

10

1

1

10

100

1000 10000 # of tag assignments

1e+05

1e+06

Figure 6: Comparison between the growth rates of simulated streams and the ajax tag stream. the background knowledge have a frequency-rank distribution similar to the word occurrence probabilities in text corpora. Only if both assumptions are correct, we will be able to reproduce the characteristic tag growth rate in real tag streams with the help of our model because only these two components have an influence on the simulated tag growth rate. For testing our hypothesis we have simulated several tag streams with the help of our model and compared their tag growth with the tag growth in the real tag streams from Tab. 1. During the experiments, we fixed p(w|t) to the word occurrence probabilities in the ajax web corpus. Furthermore, we set n to a value of 50 and h to a value of 1000.4 Finally, we tested several values for I and BK. Fig. 6 shows the resulting growth rates for a value of BK between 10% and 70%. One can see that the growth rates of the simulated tag streams and the ajax stream have the same exponent. These findings can also be confirmed by looking at the growth rates of the other two tag streams from Tab. 1 (see Fig. 7). It can be seen that the growth rates of all three real tag streams are similar to the growth rates in the simulated streams. This confirms that background knowledge is suitable for modeling the tag growth rate in real tag streams. Furthermore, it confirms that the word occurrence frequencies in text corpora are a suitable approximation of the tag selection probabilities from the background knowledge.

6.2

Frequency-Rank Distributions

In this section, we will verify in how far we have realistically modeled the imitation of previous tag assignments and how it is influenced by the Del.icio.us tagging interface. Without the influence of imitating previous tag assignments and having only the influence of the background knowledge we would otherwise observe in our model frequency-rank distributions for streams that are comparable to that of p(w|t) or p(w|r). Thus, only if the imitation is correctly modeled 4 These could have been also any other arbitrary values because they do not influence the tag growth.

1e+06 0.1

0.6 < I < 0.9, 0.4 > BK > 0.1 Ajax Blog XML

1e+05

0.01

Relative Tag Frequency

# of distinct tags

10000

1000

0.001

0.0001

100

1e-05

10

1e-06

1

1e-07

1

10

100

1000 10000 # of tag assignments

Ajax Tag Stream I=0.9, BK=0.1, n=50, h=500 I=0.9, BK=0.1, n=50, h=1000 I=0.9, BK=0.1, n=50, h=10000

1e+05

1e+06

1

10

100 Tag Rank

1000

10000

Figure 7: Comparison between the growth of the set of distinct tags in simulated tag streams and the growth in the ajax, blog and xml tag stream. The lower bound of the gray area corresponds to a growth simulated with I = 0.9 and the upper bound corresponds to a simulated growth with I = 0.6.

Figure 8: Frequency-rank distributions in simulated tag streams with 90% probability of imitating a previous tag assignment and varying maximal number of tag assignments used for determining the rankings. The distribution of the ajax tag stream is shown in gray.

we will be able to reproduce the characteristic frequencyrank distributions of co-occurrence and resource tag streams. This would show that imitation also plays an important role for real users during tag assignment.

that it contains approximately 100 tags that are perceived by users as those tags characterizing the topic. In our case, this number can be used for approximating how many tags will be usually contained in the union of the top ranked tags of the different resource streams aggregated in the co-occurrence stream of blog. Because ajax and xml are narrower topics we will use for their simulation n = 50.

6.2.1

Co-occurrence Tag Streams

In the following, we will start with using our model for reproducing the characteristic frequency-rank distributions of co-occurrence tag streams (see Section 3.1). Based on our previous findings about the tag growth we can already predict the values for I, BK and p(w|t) that will lead to realistic frequency-rank distributions. For predicting appropriate values for n we will use the findings in [3] about the semantic breadth of the ajax, blog and xml co-occurrence streams. All in all, based on the assumptions in our model and previous findings we would predict that the following parameter values will lead to frequency-rank distributions of simulated co-occurrence streams that are similar to that of real streams: • I and BK: In order to reproduce the characteristic frequency-rank distribution and the correct tag growth rate of the co-occurrence streams we will have to use the parameter values that we have identified in Section 6.1. Thus, for simulating the blog and the xml stream we will have to use I = 0.6 while we will have to use I = 0.9 for simulating the ajax stream (i. e. BK has to be set to 0.4 and 0.1 respectively). • p(w|t): For each of the streams that should be simulated we will have to use the word occurrence probabilities from the appropriate web corpus, e. g. for simulating the blog stream we will use p(w|blog). • n: In [3], they experimentally estimated the semantic breadth of the three co-occurrence streams we are using. For example, for the blog stream they estimate

• h: For this parameter value, we do not have any hint that would be a realistic value according to the Del.icio.us tagging interface. But in order to verify its influence on the frequency-rank distribution we will use two good guesses for values that seem to be appropriate if such an influence exists. For this purpose, we will use h = 500 and h = 1000. Furthermore, with h = 10000 we will use a very high value that corresponds to seeing more or less the complete previous tag stream because after so many tag assignments the top ranked tags have reached a stable frequency and thus further increasing h will not lead to another ranking of tags. Thus, if the influence of h exists we would expect for h = 500 and h = 1000 to get realistic frequency-rank distributions and for h = 10000 a frequency-rank distribution that is significantly different to that of real co-occurrence tag streams. In order to be comparable with the frequency-rank distributions of the real co-occurrence streams we will also have to simulate the same number of tag assignments, e. g. for comparing the simulation with the real ajax stream we will also have to simulate a stream with 2.95 million tag assignments (cf. Tab. 1). In Fig. 8 and 9, one can see the frequency-rank distributions of the streams simulated with the above mentioned parameters and varying value for h. One can see that a

0.1

Ajax Tag Stream I=0.6, BK=0.4, n=50, h=500 I=0.6, BK=0.4, n=50, h=1000 I=0.6, BK=0.4, n=50, h=10000 0.1

0.001

0.0001

1e-05

1e-06

0.001

0.0001

1e-05

1

10

100 Tag Rank

1000

1e-06

10000

Figure 9: Frequency-rank distributions in simulated tag streams with 60% probability of imitating a previous tag assignment and varying maximal number of tag assignments used for determining the rankings. The distribution of the ajax tag stream is shown in gray. high value of h leads to a sharp drop in the frequency-rank distributions while with lowering h it gets closer to the distribution in the background knowledge. By comparing Fig. 8 and 9 one can see that the effect of h is increased with a higher probability of imitating a previous tag assignment. The results confirm that the parameter h is relevant for explaining the frequency-rank distributions because if there isn’t such an influence we would get a slope that is similar to that for h = 10000, i. e. there would be a sharp drop. Furthermore, according to the slopes in Fig. 8 it seems to be more appropriate to simulate co-occurrence streams with h = 1000 instead of h = 500. In Fig. 10 and 11, we show the influence of the parameter n on the slope of the frequency-rank distribution, i. e. how it is influenced by a different semantic breadth represented in the simulated co-occurrence stream. One can see that with an increased semantic breadth we predict more flattened slopes of the frequency-rank distribution. This effect is further increased with a higher probability of imitating a previous tag assignment. In Fig. 12, we show the frequency-rank distribution of simulations that were done with the previously predicted parameter values. Each of the simulated slopes is directly compared with the corresponding frequency-rank distribution of the real co-occurrence streams from Tab. 1. In all three cases, we have successfully reproduced the frequencyrank distribution of the real co-occurrence streams with the help of the parameter values that we predicted based on the assumptions in our model. Furthermore, we used for each of the distributions the same parameter values for BK and p(w|t) with which we also reproduced their tag growth rate in Section 6.1.

1e-07

1

10

100 Tag Rank

1000

10000

Figure 10: Frequency-rank distributions in simulated tag streams with 90% probability of imitating a previous tag assignment and varying number of top-ranked tags the user can see from the previous tag assignments. The distribution of the ajax web corpus is shown in gray.

0.1

Ajax Tag Stream I=0.6, BK=0.4, n=50, h=1000 I=0.6, BK=0.4, n=100, h=1000 I=0.6, BK=0.4, n=250, h=1000

0.01

Relative Tag Frequency

1e-07

Ajax Tag Stream I=0.9, BK=0.1, n=50, h=1000 I=0.9, BK=0.1, n=100, h=1000 I=0.9, BK=0.1, n=250, h=1000

0.01

Relative Tag Frequency

Relative Tag Frequency

0.01

0.001

0.0001

1e-05

1e-06

1e-07

1

10

100 Tag Rank

1000

10000

Figure 11: Frequency-rank distributions in simulated tag streams with 60% probability of imitating a previous tag assignment and varying number of top-ranked tags the user can see from the previous tag assignments. The distribution of the ajax web corpus is shown in gray.

0.1 Blog (I=0.6, n=100) Blog Delicious Stream Ajax (I=0.9, n=50) Ajax Delicious Stream XML (I=0.6, n=50) XML Delicious Stream

0.01

0.01 Relative Tag Frequency

Relative Tag Frequency

0.001

0.0001

1e-05

http://www.netvibes.com/ I=0.6, BK=0.4, n=7, h=300 I=0.9, BK=0.1, n=7, h=300

0.1

0.001

0.0001

1e-06

1e-05

1e-07

1e-08

1e-06

1

10

100

1000 Tag Rank

10000

1e+05

Figure 12: Comparison between the frequencyrank distributions of the three co-occurrence tag streams from Tab. 1 and corresponding simulated tag streams. I was set to 60% probability for simulating the blog and xml stream to take their higher tag growth rate into account (cf. Fig. 7). In order to be able to show all six curves in one graph, we shifted down the relative tag frequencies of the ajax and xml stream by respectively one and two decades so that they don’t overlap each other.

6.2.2

Resource Tag Streams

For the simulation of resource tag streams we will have to use other parameter values. According to our assumptions made in the model and the previous findings of Section 6.1 and 6.2.1 we would predict that the following parameter values have to be used for simulating resource tag streams: • I and BK: Like for the tag growth and the frequencyrank distribution in co-occurrence streams we would expect that for I values in the range between 60% and 90% are also appropriate for simulating resource tag streams because the probabilities of this basic user decision during tag assignment are independent from cooccurrence or resource streams. • p(w|r): In opposite to p(w|t), we will not get a good approximation of p(w|r) by the word occurrence probabilities in r because the content of a single resource r will not contain enough words for getting stable frequencies and ranks. Thus, we will in the following approximate p(w|r) with an appropriate topic specific distribution p(w|t). • n: For this parameter we will use n = 7 because in the Del.icio.us tagging interface the user sees the 7 most popular tags for imitation. • h: In a single resource stream less previous tag assignments will be used for determining the rankings than in co-occurrence streams because the latter are aggregations of several resource streams. Thus, we would expect a value for h that is less than that used for

1

10

100

1000

Tag Rank

Figure 13: Comparison between the frequencyrank distributions of the resource tag stream for http://www.netvibes.com/ and of tag streams simulated with n = 7. As values for I we used 60% and 90% probability. The distribution p(w|netvibes.com) is approximated by p(w|ajax). simulating co-occurrence streams. Thus, in our experiments we will use h = 300. Because of the unexact approximation of p(w|r) and because of the wider variance in the frequency-rank distributions in real resource streams we will not be able to exactly reproduce the frequency-rank distribution of a single resource stream. Instead, we can only expect to reproduce the most characteristic property of resource streams, namely the sharper drop between rank 7 and 10 in the frequencyrank distribution that we observed in Section 3.2 and that was also observed by Halpin et al. in [7]. In Fig. 13, we show the frequency-rank distribution that we predicted with our model for the tag stream of the resource http://www.netvibes.com/. The website of Netvibes.com makes heavy use of AJAX technology and two most often used tags are web2.0 and ajax. Thus, we approximated the distribution of p(w|netvibes.com) with the distribution of p(w|ajax). As it was expected, we do not get an overlap between the simulated and the real tag stream that is as good as for the co-occurrence tag streams in Section 6.2.1 but we can still see the most characteristic sharp drop in the frequency rank distribution between rank 7 and 10. Thus, it is likely that this sharp drop can be explained with the restriction of the Del.icio.us tagging interface where the user only sees the 7 most frequent tags during posting a new bookmark. Nevertheless, only further experiments with a more accurate approximation of p(w|r) may bring further evidence for this hypothesis.

7.

CONCLUSIONS AND FUTURE WORK

This paper has presented a dynamic, generative model of folksonomies that can be used for simulating the evolution of tag streams. It distinguishes between two basic ways a user may assign tags: (1) he may imitate a previous tag

assignment from the stream or (2) he may choose a word from his active vocabulary that is related to the content of the resource. In an evaluation, we have successfully shown that both factors play an important role for real users during tag assignment. The evaluation suggests that the imitation rate during tag assignment will be between 60% and 90%. Furthermore, with the tag growth and the frequency-rank distribution in co-occurrence streams as well as with the frequency-rank distribution in resource streams we have identified three different characteristic properties of tag streams. During our evaluation we have also shown that our model is the first one known in the literature that can be used for consistently predicting all three of the characteristic properties (cf. Tab. 3). Especially, it is the first model that successfully reproduces the sub-linear tag growth. In the future, we will do a more in depth evaluation of the model and of the characteristic properties of tag streams. For example, it still has to be evaluated in how far the observed frequency-rank distributions really obey to a powerlaw or whether they are better explained by other distributions with a long tail like log-normal distributions (cf. [4]). Furthermore, we plan to extend the evaluation to additional tag streams in order to get more generalizable results. There is a lot of ongoing research in which the co-occurrence probabilities of tags are exploited, e. g. for finding relationships between tags or for learning tag clusters. But our research suggests that the co-occurrence probabilities do not only express semantics coming from the background knowledge of users but that they are also influenced by the random process of imitating previous tag assignments. Especially the restriction to imitating only the top-N tags makes the co-occurrence probabilities in a tagging system dependent on the previous state of the system. In this case, our model predicts that the tag co-occurrence probabilities for the same resource (e. g. the URL http://www.netvibes.com/) in two independent tagging systems with comparable user communities will not converge to the same values due to the random imitation process. This will also influence the ranking of a tag attached to the same resource in independent but comparable tagging systems, i. e. there will be a deviation of the concrete rankings in one tagging system from the mean rankings in all tagging systems. Our model can be used for calculating the expected typical deviation in dependency on the imitation rate.

8. ACKNOWLEDGMENTS This work has been supported by the European project Semiotic Dynamics in Online Social Communities (TAGora, FP6-2005-34721).

9. REFERENCES [1] A. Baldassarri, C. Cattuto, K. Dellschaft, V. Loreto, V. Servedio, and G. Stumme. Theoretical tools for modeling and analyzing collaborative social tagging systems – a stream view. Deliverable D4.1, TAGora Project, 2007. http://www.tagora-project.eu.

[2] C. Cattuto, A. Baldassarri, V. Servedio, and V. Loreto. Vocabulary growth in collaborative tagging systems. Arxiv e-print, 2007. http://arxiv.org/abs/0704.3316. [3] C. Cattuto, V. Loreto, and L. Pietronero. Semiotic dynamics and collaborative tagging. Proceedings of the National Academy of Sciences, 104(5):1461, 2007. [4] A. Clauset, C. R. Shalizi, and M. E. J. Newman. Power-law distributions in empirical data. Arxiv e-print, 2007. http://arxiv.org/abs/0706.1062. ¨ [5] F. Eggenberger and G. Polya. Uber die Statistik verketteter Vorg¨ ange. Zeitschrift f¨ ur Angewandte Mathematik und Mechanik, 1:279–289, 1923. [6] S. Golder and B. Huberman. The structure of collaborative tagging systems. Journal of Information Science, 32(2):198–208, 2006. [7] H. Halpin, V. Robu, and H. Shepard. The complex dynamics of collaborative tagging. In Proceedings of 16th World Wide Web Conference WWW-2007, 2007. [8] A. Mathes. Folksonomies – cooperative classification and communication through shared metadata, December 2004. http://www.adammathes.com/academic/computermediated-communication/folksonomies.html. [9] M. Montemurro and D. Zanette. Frequency-rank distribution of words in large text samples: Phenomenology and models. Glottometrics, 4:87–98, 2002. [10] C. Schmitz, A. Hotho, R. J¨ aschke, and G. Stumme. Mining association rules in folksonomies. In Proc. IFCS 2006 Conference, 2006. [11] H. Simon. On a class of skew distribution functions. Biometrika, 42(3/4):425–440, 1955.

APPENDIX A.

DATA SETS

At http://isweb.uni-koblenz.de/Research/Tagdataset we provide the co-occurrence tag streams from Del.icio.us that we used in our experiments and that we extracted from a larger data set this larger data set. The larger data set contains information about approximately 140M tag assignments, 19M resources, 2.5M tags and 700K users. The tag streams are provided in an aggregated form in which the information about the date and the user are removed from the tag assignments. The correct order of the tag assignments is preserved in the data set and we replaced all resource URLs with artificial identifiers. Furthermore, we provide on the web site a list of the URLs that were crawled for the three web corpora used in this paper. For each of the web corpora, we also provide the word occurrence probabilities that are shown in Fig. 3. Finally, the web site also contains for download all the simulated tag streams that were used throughout this paper and the software used for generating the streams.

Suggest Documents