Random graph generative model for Folksonomy network structure ...

2 downloads 0 Views 349KB Size Report
Most social structures are represented as unipartite ... An example of such structure is a Folksonomy: a tuple of connections among users, resources and tags.
Procedia Computer Science

ProcediaComputer Computer Science Science 00 (2009) 000±000 Procedia 1 (2012) 1683–1688

www.elsevier.com/locate/procedia www.elsevier.com/locate/procedia

International Conference on Computational Science, ICCS 2010

Random graph generative model for Folksonomy network structure approximation 6]\PRQ&KRMQDFNL0LHF]\VáDZ.áRSRWHN* Institute of Computer Science, Polish Academy of Sciences, ul. Ordona 21, 01-237 Warsaw, Poland

Abstract The analysis of social networks has received much attention in recent years. Most social structures are represented as unipartite graphs or bipartite affiliation networks. However, more complex topologies are becoming popular within social networking community. An example of such structure is a Folksonomy: a tuple of connections among users, resources and tags. An intuitive way to represent a Folksonomy is a three-mode hypergraph. It has been shown that in such graphs a clustering coefficient decreases slowly over time at very high level and this property is unachievable for simple random hypergraphs. In this article we represent a Folksonomy as a tripartite graph. This small change of perspective enables us to divide graph generation process into two steps and adapt algorithms used for bipartite graph generation at each step. As a result we obtain iteratively graphs that reflect both dynamics and high level of clustering coefficient. c 2012 Published by Elsevier Ltd. Open access under CC BY-NC-ND license. ⃝ Keywords: social netowrks, random graphs, Folksonomy

1. Introduction Tagging digital resources is an important part of mass collaboration. Users of Web 2.0 portals participate in content creation by bookmarking and labeling digital resources (e.g. photos, videos, web sites, blogs or publications). We are interested in a situation when labeling is limited to assigning tags to resources. There are numerous knowledge management portals which core functionality is to store and share tagged resources (e.g. Delicious, Simpy, CiteULike or BibSonomy). Clouds of tags can be seen on virtually any website, but we can name it a Folksonomy only if the tags may be personalized. The interest of machine learning community in this kind of data was fuelled by last two ECML/PKDD Discovery Challenges [1]. On one hand supervised predictive and classification algorithms enable us to solve well defined problems and easily compare results. On the other hand it is appealing for a researcher to understand descriptive features of analyzed data and build a theoretical model that has similar properties as those observed in reality. The confluence of these two streams has been proposed [2] to strengthen models simulating graph evolution.

* Corresponding author. Tel.: +48-22-836-2841; fax: +48-22-837-6564. E-mail address: [email protected].

c 2012 Published by Elsevier Ltd. Open access under CC BY-NC-ND license. 1877-0509 ⃝ doi:10.1016/j.procs.2010.04.188

1684

S. Chojnacki, M. Kłopotek / Procedia Computer Science 1 (2012) 1683–1688 Author name / Procedia Computer Science 00 (2010) 000±000

The results of exploratory analysis in the field of social networks have lead to introduction of random algorithms that create networks with observed features. Early work in this area dates back to 1967 and a small diameter property of social networks [3]. More recent results reveal additional characteristics of social networks: heavy tailed degree distributions [4], densification of links within a network [5] or existence of a middle region [6]. An attempt to reflect relations observed in real data has lead to creation of new algorithms: preferential attachment model [4], forest fire model [5], Kronecker graph model [7], edge copying model [8] or winners do not take all model [9]. The article is organized as follows. In the second section we describe the structure and the content of a dataset as well as definition of a clustering coefficient (which is referred to as cliquishness coefficient in graphs induced from folksonomies [10]). In the third section we describe in detail our algorithm and compare it to other algorithms used in this domain. We also point out modifications made to original bipartite graph generation algorithms [11]. The fourth section contains empirical results obtained from our algorithm. In the final section we discuss both advantages of our approach and its limitations. 2. Dataset In order to explore network structure of a Folksonomy we used data from BibSonomy portal delivered as a training set to ECML/PKDD Discovery Challenge 2009. The data contained two types of resources i.e. websites and BibTeX entries. We filtered the data and kept only tag assignments done to websites manually by users. We also removed one outlier user. The data contained DVQDSVKRWRIDOODFWLYLW\RI%LE6RQRP\XVHUV¶XQWLOWKHHQGRI After the transformations we had 191 367 nodes in the dataset (2 658 users, 138 107 websites and 50 602 tags) and 621 749 tag assignments. This type of data structure (containing assignments of tags by users to resources) is usually named a Folksonomy. Formally, a Folksonomy is a tuple F : (U , T , R , Y ) , where U (users), I (tags) and R (resources) are finite sets [12]. Y stands for a ternary relation between them Y Ž U u T u R . According to this model we can we can express certain properties of folksonomies. The number of tag assignment (tas) of a given user u is: Y ˆ {u } u T u R , the number of users which have tagged resource r with tag t is: Y ˆ U u {t } u { r } . If we define a set of tags attached by user u to resource r by Tur : {t  T | (u , t , r )  Y } , then a post is (u , Tur , r ) .

Fig. 1. Graph representation of a Folksonomy: hypergraph (left), tripartite graph (right)

A hypergraph and a tripartite graphs contain the same set of vertices V U ‰ T ‰ R . The difference between two representations is visible for edges (Fig. 1.) An edge in a hypergraph has three endings (one of each type: user, tag, resource) and is called a hyperedge. In case of our tripartite graph G (V , E ) comprising a set of nodes V and a set of edges E, we have an edge connecting two nodes if there exists a post containing the two nodes e.g. {u , r }  E if  tT (u , t , r )  Y . Degree distributions calculated for a tripartite graph have an intuitive interpretation (Fig. 2.) and belong to a family of highly skewed distribution. On average a node of type tag is connected to 12.05 nodes of type resource (one frequent tag is assigned to 15 794 websites). One tag is used by 321 users, but expected degree value is 2.82. Average degree values for URL vs. tag, URL vs. user, user vs. tag and user vs. URL are 4.04, 1.15, 54.53 and 65.97 respectively. For example value 54.53 means that usually one user has used over 54 different tags to label bookmarked resources. We also present a degree distribution of pairs (user and URL) vs. tag, which has an average degree value of 3.88, which means that on average almost four tags are used to label a URL by one user. This distribution is used in second step of our graph generative algorithm.

1685

S. Chojnacki, M. Kłopotek / Procedia Computer Science 1 (2012) 1683–1688 Author name / Procedia Computer Science 00 (2010) 000±000 Degree distribution of TAG type nodes vs URL type nodes

Degree distribution of TAG type nodes vs USER type nodes

100000

100000 35349

27472 10000 number of nodes (log scale)

number of nodes (log scale)

10000

1000

100

10

1000

100

10

15794

1 1

10

100

1000

10000

321

1 100000

1

10

node's degree (log scale)

100

1000

node's degree (log scale)

Degree distribution of URL type nodes vs TAG type nodes

Degree distribution of URL type nodes vs USER type nodes

100000

1000000 31682

number of nodes (log scale)

number of nodes (log scale)

140563

100000

10000

1000

100

10000

1000

100

10

10 62

101

1 1

10

1

100

1000

1

10

node's degree (log scale)

100

node's degree (log scale)

Degree distribution of USER type nodes vs TAG type nodes

Degree distribution of USER type nodes vs URL type nodes

1000

10000

number of nodes (log scale)

number of nodes (log scale)

326

100

10

1

10

100

1000

10000

100

10

15371

10451

1

1217

1000

1 100000

1

10

node's degree (log scale)

100

1000

10000

100000

node's degree (log scale)

Degree distribution of USER-URL pairs vs TAG type nodes 100000 31199

number of nodes (log scale)

10000

1000

100

10

1 1

10

100

1000

node's degree (log scale)

Fig. 2. Degree distributions of nodes and one virtual node in a tripartite graph. Tag vs. resource/URL (top left), tag vs. user (top right), resource vs. tag (second row left), resource vs. user (second row right), user vs. tag (third row left), user vs. resource (third row right), virtual node of a pair (user and URL) vs. tag (bottom). The single value at the left side of each chart gives the number of nodes that have degree one, the value at the right gives the maximum degree.

1686

S. Chojnacki, M. Kłopotek / Procedia Computer Science 1 (2012) 1683–1688 Author name / Procedia Computer Science 00 (2010) 000±000

It is important to mention the fact that we compare values of a clustering coefficient for two different graph representations. It could be a questionable approach if we were comparing e.g. path lengths, but we fallow a definition of a clustering coefficient (called cliquishness) proposed in [10] that is transparent for both representations. Clustering coefficient is a measure based on an analysis of a triadic census ± characteristic for exponential random graphs. For every node it gives a probability that if two other nodes are neighbors of a selected node than they are also neighbors. In unipartite graphs the coefficient is calculated by dividing a number of full triads by n ˜ ( n  1) / 2 (where n is a number of neighbors) [13]. However, in case of a tripartite graph the number of possible pairs between neighbors is lower and a slightly different denumerator is used. Later on we give formal definition of a clustering coefficient (cliquishness) for a node of type tag. Lets define a set of all users that used tag t as U t {u  U ;  r : (t , u , r )  Y } and a set of resources annotated with tag t as R t {r  R ;  u : (t , u , r )  Y } , lets also define a set of pairs of user and resource that are connected together and to a tag t as ur t {( u , r )  T u U ; ( t , u , r )  Y } . Following this notation we calculate clustering coefficient as in Eq. 1. J (t )

ur t U t ˜ Rt

(1)

The coefficient gets values in closed interval [0,1]. However the way we construct a graph restricts the values of the coefficient to left-side open interval (0,1]. 3. Algorithm Two algorithms were considered for random Folksonomy resembling hypergraph generation in [10]. The first DOJRULWKP FDOOHG %LQRPLDO DOJRULWKP ZDV EDVHG RQ FODVVLF UDQGRP JUDSK JHQHUDWLRQ SURSRVHG E\ (UG|V [13]. The number of three types of nodes (|U|, |R| and |T|) and edges |Y| was equal to real numbers in BibSonomy. Ends of hyperedges were obtained from a uniform distribution. The second algorithm, called Permuted, contained LQIRUPDWLRQ DERXW QRGHV¶ GHJUHe distributions. The random graph was constructed from a real Folksonomy by permuting independently nodes between edges with accordance to Knuth Shuffle method [14]. The average clustering coefficients calculated for graphs obtained from both algorithms were significantly lower than in real data. The coefficient for Binomial graph was at the level of 30%, the coefficient for Permuted graph was at the level of 60%, whereas the coefficient in a real network is at the level exceeding 90%. Moreover we could observe a decreasing trend in real data over time, which was not met by the two algorithms. The Binomial algorithm generated growing trend, in case of Permuted algorithm decreasing values of the coefficient at the beginning of graph generation were followed by a period of increasing trend. Our algorithm consists of two steps. In each step we follow a procedure used for bipartite graph generation [11]. In the first step we add iteratively one node of type resource to the graph. Then we sample degree distribution for this node from the resource versus user distribution (Fig. 2.). For each edge we draw a random value from uniform distribution and if it is lower than Ȝ1 we join an edge with a new user, otherwise we pick a user from already existing nodes following preferential attachment principle [4]. In the second step we perceive every pair of user and resource generated in the first step as one object (virtual node) and repeat the procedure against the nodes of type tag i.e. we sample a degree from user-URL versus tag distribution (Fig. 2.) and for every triangle generate random number to FRPSDUH LW ZLWK Ȝ2 and decide whether to add a new tag or join an existing one based on preferential attachment principle. We follow the original preferential attachment principle, in which a probability that a node will be selected equals to its degree divided by the number of all edges in a network. We do not need to perform Lagrange smoothing, as our network does not contain isolate nodes. We modified the formula to calculate Ȝ ,Q RULJLQDO approach [10] it would be calculated in our first step as 1  R / U , instead we used 1 / E (deg U R ) , where E (deg U R ) stands for expected value of degree distribution of user node versus nodes of type resource. This modification guarantees that the proportion among the number of different nodes will be preserved over time. We PRGLILHGWKHZD\ZHFDOFXODWHGȜ2 analogously. Both steps of our algorithm are presented in Fig. 3.

1687

S. Chojnacki, M. Kłopotek / Procedia Computer Science 1 (2012) 1683–1688 Author name / Procedia Computer Science 00 (2010) 000±000

Fig. 3. One iteration of random graph generation. A new resource is added and its degree sampled (left), for every edge decision whether to join an existing node or a new user is made (middle), each pair (user-resource) from step one is interpreted as a virtual node and is joined to new or existing tags.

The value of Ȝ1 was equal to 0.015159 (=1/65.97). As we did not calculate explicit distribution for tag versus user-UHVRXUFHSDLUVZHREWDLQHGH[SHFWHGYDOXHRIWDJ¶VGHJUHHYHUVXVSDLUXVHU-resource as a solution of Eq. 2. UR ˜ E (deg UR T )

(2)

T ˜ E (deg TUR )

In the above equation |UR| represents the number of user-resource pairs (=175 405), E (deg UR T ) stands for average degree of a user-resource versus tag distribution (=3.88), |T| stands for the number of tags (=50 602). As a result E (deg TUR ) equals to 13.08 and Ȝ2 was at the level of 0.0764. 4. Experiments

100%

100%

80%

80%

Clustering Coefficient

Clustering Coefficient

We ran the above described graph generation algorithm twenty times. In every simulation we did 10 000 iterations. After every one thousand iterations we calculated average clustering coefficient for all nodes in a current state of a graph and also separately for three types of nodes (users, resources and tags). The results of these experiments show (Fig. 4.) that our algorithm creates graphs that meet two main observations in a real dataset: average clustering coefficient for the whole network is above the level of 90% and this value decreases over time.

60%

40%

20%

60%

40%

20%

01-2009

06-2008

01-2008

06-2007

01-2007

06-2006

01-2006

06-2005

01-2005

06-2004

01-2004

0% 0% 0

1

2

3

4

5

6

7

8

9

10

11

Iteration * 1 000

USERS

RESOURCES

TAGS

ALL NODES USERS

RESOURCES

TAGS

ALL NODES

avg(USERS)

avg(TAGS)

Fig. 4. Average values of a clustering coefficient over time in real data (left). Average values of a clustering coefficient over a batch of 1 000 iterations for 20 independent simulations (right).

Calculating the clustering coefficient separately for every type of the nodes enabled us to observe a new counter intuitive feature in the data. It turns out that an average clustering coefficient is decreasing over time only in case of

1688

S. Chojnacki, M. Kłopotek / Procedia Computer Science 1 (2012) 1683–1688 Author name / Procedia Computer Science 00 (2010) 000±000

two types of nodes, in case of users the coefficient is increasing. The feature was reflected by our algorithm, but except for empirical results we are currently unable to present a theoretical explanation for this property. 5. Conclusions In the article we have focused on one of the most important measures used to describe structural properties of social networks ± a clustering coefficient. The empirical observations confirmed earlier results that the clustering coefficient is at a very high level in graphs induced from a Folksonomy created from the data in the BibSonomy bookmarking portal. The value of the coefficient decreases over time. We proposed a new algorithm to generate random tripartite graph, the properties of an artificial graph resemble the properties of a real one. An important drawback of our approach is a fact that we need to utilize information about degree distributions from a benchmark graph. As a result we obtain iteratively a graph with desired characteristics, but when the graph becomes similar to the referenced one we do not observe any evolution that would enable us to predict what might happen with the structure of the referenced graph in the future.

References )(LVWHUOHKQHU$+RWKR5-lVFKNH HGV (&0/3.'D Discovery Challenge 2009, CEUR-WS.org, (2009). 2. J. Leskovec, Dynamics of large networks, PhD Dissertation, Machine Learning Department, Carnegie Mellon University, Technical report CMU-ML-08-111, (2008). 3. S. Milgram, The small-world problem. Psychology Today, 2:60±67, (1967). 4. A.-L. Barabasi and R. Albert, Emergence of scaling in random networks. Science, 286:509±512, (1999). 5. J. Leskovec, J. Kleinberg, C. Faloutsos, Graphs over Time: Densification Laws, Shrinking Diameters and Possible Explanations, ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), (2005). 6. R. Kumar, J. Novak, and A. Tomkins, Structure and evolution of online social nHWZRUNV,Q.''¶ Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 611±617, (2006). 7. J. Leskovec, D. Chakrabarti, J. Kleinberg, C. Faloutsos, Realistic, Mathematically Tractable Graph Generation and Evolution, Using Kronecker Multiplication, European Conference on Principles and Practice of Knowledge Discovery in Databases (ECML/PKDD), (2005). 8. R. Kumar, P. Raghavan, S. Rajagopalan, D. Sivakumar, A. Tomkins, and E. Upfal, Stochastic models for WKHZHEJUDSK,Q)2&6¶ Proceedings of the 41st Annual Symposium on Foundations of Computer Science, page 57, (2000). 9. '03HQQRFN*:)ODNH6/DZUHQFH(-*ORYHUDQG&/*LOHV:LQQHUVGRQ¶WWDNHDOO&KDUDFWHUL]LQJWKHFRPSHWLWLRn for links on theWeb. Proceedings of the National Academy of Sciences, 99(8): 5207±5211, (2002). 10. C. Schmitz, M. Grahl, A. Hotho, G. Stumme, et al, Network Properties of Folksonomies, ,Q:::¶ Proceedings of the 16th International conference on World Wide Web, pages 568±576, (2007). 11. J-L. Guillaume and M. Latapy, Bipartite structure of all complex networks, Information Processing Letters 90 (2004) 215±221. 12. R -lVFKNHHWDO7DJ5HFRPPHQGDWLRQVLQ)RONVRQRPLHV/1&6(2007) 69. 13. D. J. Watts and S. 6WURJDW]&ROOHFWLYHG\QDPLFVRI¶VPDOO-ZRUOG¶QHWZRUNV1DWXUH 393:440±442, June (1998). 14. D. E. Knuth. The Art of Computer Programming, Volume II: Seminumerical Algorithms, 2nd Edition. Addison-Wesley, (1981).

Suggest Documents