FCA-Based Approach for Mining Contextualized ... - CiteSeerX

16 downloads 52435 Views 225KB Size Report
tion that if a blog has the relationships with others, they would use ..... Table 5: Context table for the combination of derived contexts. Apple. GrapeF ruit. Kiwi. Yellow ... In this section, we report upon some experiments using the tech- nique of ...
FCA-based Approach for Mining Contextualized Folksonomy Hak Lae Kim

Suk Hyung Hwang

Hong Gee Kim

Digital Enterprise Research Institute IDA Business Park Galway, Ireland

Div. of Computer & information Science, Sun Moon University 100 Kal-San-Ri, Tangjeong-myeon, Asan-si Chungnam, Korea

Biomedical Knowledge Engineering Laboratory, Seoul National University Yeongeon-dong, Jongro-gu Seoul, Korea

[email protected]

[email protected]

[email protected]

ABSTRACT We present a novel approach to build the contextualized folksonomy and concept hieracrhies from tags of blogosphere based on Formal Concept Analysis. Our approach is based on the assumption that if a blog has the relationships with others, they would use the similar set of tags. We collect the sample data from blogosphere randomly and then build the concept hierarchies on the basis of the inclusion relations(tags) between the extensions(bloggers). We propose the formalization of the contextualized folksonomy in terms of Formal Concept Analysis and show how our approach can be used to create the contextualized folksonmy for blogosphere. We evaluate our approach by considering an already existing tags of blogosphere.

Categories and Subject Descriptors D.3.1 [Formal Definitions and Theory]: Semantics, Syntax ; H.5.4 [Hypertext/Hypermedia]: Theory ; J.4 [SOCIAL AND BEHAVIORAL SCIENCES]: Sociology

General Terms Folksonomy, Clustering, Tagging etc

Keywords Folksonomy, Formal Concept Analysis, Tagging

1.

INTRODUCTION

Blogosphere have become currently one of the most dynamic areas of the Web [2]: these have emerged as a means of decentralized personal publishing and of social sharing in community. Much research has been focused on the community perspective. It also addresses new issues relevant to tagging and folksonomy which is

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. SAC’07 March 11-15, 2007, Seoul, Korea Copyright 2007 ACM 1-59593-480-4 /07/0003 ...$5.00.

recently popular phenomenon in the blogosphere. Tags are a set of keywords that are attached to resources such as bookmarks, images, and blog entries etc. The process is known as tagging. It is not difficult to see attached tags within blog entries on many popular blogs. There is a folksonomy service by technorati which is a blog search engine. It provides the popular tags on general topics. They doesn’t support domain specific vocabularies. Several research report that the patterns of tag’s usage reflects a user’s mental model [4, 8, 9, 13]. Christopher and Nancy[12] propose a formal model of folksonomies as a set of triples and demonstrate the results of folksonomy dataset by an association rule mining technique. They describe the association between the assigned tags of resources. Christopher and Nancy[2] analyze the effectiveness of tags for classifying articles by gathering the top 350 tags from Technorati. They show that tags are useful for grouping articles into broad categories, but less effective in indicating the particular content of an article. So we can guess if BA has similar tag collection in her blog with BB they might be interested in similar topics. In the sense, they can share the tag’s collection together with their resources and can be connected each other as a group based on tags. Our approach in this paper is to build the contextualized folksonomies to provide a shared meaning of tags and to guide methods to use tags. The term contextualized folksonomies refers to a shared collection of tags that is extracted from blogs which has a same group, rather than centralized social systems. It provides a context-centric tag usage to grouped communities. It will recommend members of the group to tag what are the shared terms. If the members can refer the group folksonomies, they can avoid ambiguous word such as synonym, underscore words to describe resources. The main experiments on folksonomies and related topics are currently carried out based on centralized online resource sharing systems such as del.icio.us [3], Technorati [14], or Flickr [5]. Those systems support their own API to access the systems. It is, however, restricted to allow users to use those API e.g., top 100 tags. This approach is much difficult to get the sample data compare with previous research, since most users of blogs would use the different management tools and basically blogs are decentralized. We visit and collect tags manually in each blogs. As far as we know currently, the ideas presented in this paper have not been explored before, although there is a lot of recent work dealing with folksonomies. The main contribution of this work is that: First, we propose the formal mode of group folksonomies. Second, we analyze the

1340

popular tags through an in-depth study about tag’s frequency and its pattern produced by blogs. Third, we build the group folksonomies. This paper is organized as follows: In Section 2 we provide the strengths, and weaknesses about folksonomies and discuss why it is important to the group folksonomies in blogosphere. In Section 3 we describe briefly the definition of FCA and discuss the methods applied in our experiments. In Section 4 we describe the experiments conducted on the collected data, and we report on the results of the experiments. Finally, we sum up our finds and future work in Section 5.

2. 2.1

BACKGROUND State of the Art of Folksonomy

In this section, we define what exactly a folksonomy means. Folksonomy is a user-generated classification, emerging through bottom-up consensus [15] and provide an approach to address Webspecific classification issues. The first use of the term folksonomy, which is a fusion of the words folks and taxonomy, has been attributed to Thomas Vander Wal. In general terms, it is the set of tags with one or more keywords. Users are able to instantly add terms to the folksonomy as they become necessary for a single unit of content. The interesting observation is that when users do their tagging in a public space, the collection of their keyword/value associations becomes a useful source for its navigation. The most commonly cited folksonomies in action are web sites such as Flickr, del.icio.us, and Furl. Those sites allow people to store their digital images, bookmarks and share them with friends, or the general public. The principle of folksonomy is a sharing. Tags created by a user are just flat keyword without a social shared environment. Emanulele [11] describes the key point of folksonomy that ”is explicit, can be aggregated, produces benefits to user, and is relevant to the purpose of a site.” However, there are problems using free-form tagging and folksonomy. There is no relationship and formal inclusion criteria [7] between the terms. The relationship among terms is, however, defined explicitly in a controlled vocabulary such as a thesaurus, taxonomy, and even an ontology. A thesaurus has three types of relationships as narrower, broader, and related terms. There has collections, classes, and instances as relationships in a taxonomy. There are also formal rules to define the relationship of objects in a controlled vocabulary. Folksonomy has only a group of instances which labeled with a tag without a formal definition and relationship among terms. So we cannot identify relationships among the groups. Furthermore, there are critical issues to share folksonomies [10]. All folksonomies exist in vendor specific formats and are stored on vendor sites. There is currently no format for sharing or even expressing one’s explicit understanding of the meaning of his/her own folksonomy tags. Those problems happen in blogosphere at a moment, even worse. Although blogs have implicitly or explicitly social relationships between others and they want to share their tags, there is no way to do it. In the following section, we discuss how we can solve this problem of blogosphere using the group folksonomy.

2.2

Motivation

Applying social network analysis methods to the blogosphere has revealed interesting findings about how individuals share information and how online environments interact socially [1]. A blog has three types of social relationships such as blogroll links, citation links, and comment links. The linking patterns of blogs can be used to predict paths of information flow through the blogosphere. A blogroll is a collection of links to other blogs and ranges from

matters of common interests 1 . In addition, a tag collection of users might be a threshold to connect the bloggers who handled similar topics in their blogs. There exist an overwhelming number of tags as a result of blogger’s contribution in blogosphere. A lot of bloggers would create tags to organize the contents when they would publish their items. It can be a folksonomy as a network of concepts within the same namespace. However, a folksonomy doesn’t provide specific vocabularies for a certain domain. And it is restricted to categorize a group. What is a problem to make a group of blogs using a folksonomy? First of all, we don’t have methods or standards to share tags of blogs. All folksonomies exist in the vendor sites. For example, we store our bookmarks in del.icio.us and our photos in flickr. Technorati builds the folksonomy for their subscribers. However, it doesn’t support to the way to group the participants. This is the research questions: how can folksonomy help bloggers to build a group? We hypothesize that if a blog has a relationship with others, they would use similar set of tags or if multiple bloggers use same tags in their blog entries they are interested in similar subjects. Let a blogger b1 tag a resource r1 using a set of tags s1 = {t1 , t2 , t3 }, blogger b2 tag a resource r2 using a set of tags s2 = {t1 , t4 , t5 }, and blogger b3 tag a resource r3 using a set of tags s3 = {t1 , t4 , t6 } and let the r1 , r2 and r3 are similar topics2 . The bloggers b1 , b2 and b3 assign tag t1 into similar topics and b2 and b3 assign tag t1 , t4 into the resource r2 , r3 . So we can imagine the contextualized folksonomies(CF ) as follows: CF ({b1 , b2 , b3 }) = {t1 }, CF ({b2 , b3 }) = {t1 , t4 }. This can either mean that b1 , b2 , b3 have the common pattern with tag t1 , and that b2 , b3 have much closely the common pattern with tag t1 , t4 when they are tagging. Figure 1 shows the contextualized folksonomies.

Figure 1: An example of contextualized folksonomies

The contextualized folksonomy is aiding the user in choosing the tags which is most commonly and properly in the group. So it can be used as a recommender system. In addition, it can be applied for not only centralized systems, but also decentralized systems. Many people tagging the same (or similar) object in a common system collectively build a Folksonomy or ”People’s Taxonomy” for the object. These objects can be URLs, photos, movies, blog entries or just about anything else on the web. We decide to use Formal Concept Analysis methods in order to build contextualized folksonomies from tags in blogosphere. A 1

http://en.wikipedia.org/wiki/Blogroll Folksonomies work best when a large number of users all describe the same piece of information. Therefore, in this paper, we assume that all resources are treated as similar(or same) topics. 2

1341

Table 1: Formal context C t1 t2 t3 t4 t5 t6 C b1 × × × b2 × × × b3 × × ×

Figure 2: Concept lattice for the context of Table 1

FCA based approach has previously been applied to the Web. The concept lattice of FCA can be provided a convenient hierarchical description with blogs and their set of tags.

3.

FORMAL CONCEPT ANALYSIS

Formal Concept Analysis(FCA) is a method mainly used for the analysis of data, i.e. for investigating and processing explicitly given information [6]. Such data are structured into units which are formal abstractions of concepts of human thought allowing meaningful comprehensible interpretation.

3.1

Basic definitions

FCA starts with a formal context that is comprised of a set of objects, a set of attributes and a relation describing which objects possess which attributes. In the formal definition, the set of objects is denoted by O, and the set of attributes is denoted by A. D EFINITION 1. A formal context is a triple (O, A, R), where O is a set of objects and A is a set of attributes, and R ⊆ O × A is a binary relation between O and A. In order to express that an object o is in a relation with an attribute a, we write (o, a) ∈ R and read it as ”the object o has the attribute a”. Table 1 shows a formal context that is based on the set of objects O and the set of their attributes A as follows: O = {b1 , b2 , b3 }, A = {t1 , t2 , t3 , t4 , t5 , t6 } and the incidence relation R is given by the cross table. Let (O, A, R) be a context. For O ⊆ O, we define intent(O) := {a ∈ A|∀o ∈ O : (o, a) ∈ R}, and, dually for A ⊆ A, we define extent(A) := {o ∈ O|∀a ∈ A : (o, a) ∈ R}. The function intent maps a set of objects into the set of attributes common to the objects in O(intent : 2O → 2A ), whereas extent is the dual for attributes sets(extent : 2A → 2O ). These two functions form a Galois connection between the objects and attributes of the context. The central notion of FCA is the (formal) concept. Objects from a context share a set of common attributes and vice versa. Concepts can be imagined as maximal rectangles in the context table. If we ignore the sequence of the rows and columns we can identify even more concepts. The formal definition of concept is given in the following:

D EFINITION 2. Let C = (O, A, R) be a context. A formal concept is a pair (O, A) with O ⊆ O is called extension, A ⊆ A is called intension, and (O = extent(A)) ∧ (A = intent(O)). The set of all concepts of the context C is denoted by B(C) i.e., B(C) = {(O, A) ∈ 2O × 2A |O = extent(A) ∧ A = intent(O)}. Concepts are pairs of objects and attributes which are synonymous and thus characterize each other. In other words a concept is a pair consisting of a set of objects and a set of attributes which are mapped into each other by the Galois connection. For example, all concepts of the context of Table 1 are as follows: B(C) = {({b1 , b2 , b3 }, {t1 }), ({b2 , b3 }, {t1 , t4 }), ({b1 }, {t1 , t2 , t3 }), ({b2 }, {t1 , t4 , t5 }), ({b3 }, {t1 , t4 , a6 }), ({}, {t1 , t2 , t3 , t4 , t5 , t6 })}. The set of formal concepts is organized the partial ordering relation ≤ -to be read as ”is a subconcept of”- as follows: For a formal context C = (O, A, R) and two concepts c1 = (O1 , A1 ), c2 = (O2 , A2 ) ∈ B(C) the subconcept-superconcept relation is given by (O1 , A1 ) ≤ (O2 , A2 ) ⇔ O1 ⊆ O2 (⇔ A1 ⊇ A2 ). A concept c1 = (O1 , A1 ) is a subconcept of concept c2 = (O2 , A2 ) iff the set of its objects is a subset of the objects of c2 . Or an equivalent expression is iff the set of its attributes is a superset of the attributes of c2 . That is, a subconcept contains fewer objects and more attributes than its superconcept. The set of all formal concepts of a context C with the subconcept-superconcept realtion ≤ is always a complete lattice , called the (formal) concept lattice of C and denoted by L := (B(C), ≤). Figure 2 shows the concept lattice for the context C of Table 1. A concept lattice can be represented graphically using line diagrams(or Hasse Diagrams). These structures are composed of nodes and links. Each node represents a concept with its associated extents and intents. The links connecting nodes represent the subconcept-superconcept relation among them. This relation indicates that the parent’s extension is a superset of each child’s extension. Attributes propagate along the edges to the bottom of the diagram and dually objects propagate to the top of the diagram. Therefore, more abstract or general nodes occur higher in the hierarchy, whereas more specific ones occur at lower levels.

3.2

Many-valued contexts and Scalings

FCA may be applied to data in which objects are interpreted as having attributes with values. That is, attributes can have values. For example, the ”color” attribute may have many values such as ”yellow”, ”green” or ”red”. We call them many-valued attributes, in contrast to the one-valued attributes considered so far. In this case the basic data is stored in a many-valued context. D EFINITION 3. A many-valued context (O, A, V, R) is a set of objects O, a set of attributes A, a set of possible values V, and ternarry relation R ⊆ O × A × V 3 , where the following holds: (o, a, v1 ) ∈ R ∧ (o, a, v2 ) ∈ R ⇒ (v1 = v2 ). Like the one-valued context, many-valued context can be represented as a matrix, the rows of which are labeled by the objects and the columns of which are labeled by the attributes. Table 2 shows an example of a many-valued context for some fruits. In order to get concepts out of this many-valued context and draw the concept lattice, we have to transform the many-valued context into a one-valued context according to certain rules. The new one-valued context is called the derived context. The concepts of derived one-valued context are interpreted as the concepts of the many-valued context. The process of transformation is called conceptual scaling. And this process is not uniquely determined, but 3 (o, a, v) ∈ R indicates that the object o has a value v for the attributes a.

1342

Table 2: An example of many-valued context kind color price fruit 1 apple yellow $1.15 fruit 2 grapefruit yellow $7.25 fruit 3 kiwi green $4.70 fruit 4 apple red $2.15

fruit 1 fruit 2 fruit 3 fruit 4

Table 3: Some examples of Scale contexts Scolor yellow green red

Sprice 0 ≤ price < 2 2 ≤ price < 4 4 ≤ price < 6 6 ≤ price < 8

yellow ×

green

× ×

× × ×

Expensive

Mid-range

Cheap

Red

Green

Yellow

Kiwi

×

× × × ×

× ×

red

× ×

cheap ×

mid-range

expensive

× × ×

Table 4: Derived context Ccolor and Cprice Ccolor fruit 1 fruit 2 fruit 3 fruit 4

Cprice fruit 1 fruit 2 fruit 3 fruit 4

GrapeFruit

Apple

Table 5: Context table for the combination of derived contexts

yellow × ×

green

red

×

Figure 3: Concept lattice for the context of Table 5.

Given this understanding of formal concept analysis, we now describe FCAWIZARD, a formal concept analysis tool developed as part of a project for the medical data analysis and knowledge discovery. Based on the ConExp4 , FCAWIZARD combines multiple functions in one tool: • Editing one-valued and many-valued contexts:

×

FCAWIZARD supports the ability to edit and store one-valued and many-valued contexts in tabular format. cheap ×

mid-range

expensive

• Extracting concepts and Displaying concept lattice:

×

FCAWIZARD allows extracting concepts and graphical lattice display.

× ×

depends on the transformation rules. Formally, a many-valued context is transformed by constructing a scaling for each attribute. The scales are used to construct one-valued contexts for each attribute which are then combined or joined to form a one-valued context which represents the original many-valued context. To make this situation clearer consider a many-valued context in table 2 and take scale contexts(Scolor and Sprice in Table 3) for the attributes color and price, respectively. Now we can do plain scaling with the scale contexts. As shown in Table 4, we can derive one-valued contexts Ccolor and Cprice from the original manyvalued context according the scales. When transforming a many-valued context (O, A, V, R), there will be a one-valued context for each attribute of A. Therefore, if |A| ≥ 1, the multiple contexts need to be combined to form one unified context. Table 5 shows a context table as a result of the combination of derived contexts that can be derived from scale contexts and the many-valued context of table 2. Figure 3 shows the concept lattice generated from the many-valued context of table 5.

• Scaling: FCAWIZARD provides various scaling methods for manyvalued contexts. Users can also edit or customize their scale contexts for their own purpose. In particular, FCAWIZARD allows deriving the scaled contexts. Figure 4 shows some screenshots of FCAWIZARD.

4.

EXPERIMENT

In this section, we report upon some experiments using the technique of formal concept analysis for extracting common tags and mining aspectual views of tagging patterns(”use of tags”) from a given groups of bloggers. The aim of the experiments is to evaluate the feasibility of FCA-based approach when applied to build the group folksonomy for bloggers and to share it between group of bloggers.

4.1

Test Data

The test data used in the following experiments was gathered by hand randomly in blogosphere between August 5th and August 18th, 2006. In order to better coverage about frequency of tags, we collected blogs which provided the tag cloud services or displayed 4

1343

See http://sourceforge.net/projects/conexp

Figure 5: Concept lattice of the first experiment

Figure 4: Screenshots of FCAWIZARD

their tag frequency. We collected the 9 blogs (Channy5 ,mEmOpAd6 , Prak7 , Weblognara8 , Market-trend9 , Oreilly radar10 , @hof11 , Channy212 , Guichanist13 ) and gathered about 320 tags of them. Examining this list immediately points out several challenges to users of tags and designers of tagging systems. Firstly, there are number of cases where synonyms, pluralization, or even misspelling has introduced the same tag twice. For example, many bloggers use ”Web2.0”, ”web2.0” or ”Web2”. Secondly, many of the tags would use with in English and Korean. Thirdly, it seems clear that many bloggers seem to use tags simply as a means to organize their interest. There is ”life”, ”book” or even their initial as tags. In the following, we present an exploratory analysis of the tags and blogs.

4.2

Identifying the Common Tags

As a first experiment, to identify common(shared) tags among the given bloggers, we started with one-valued context that captures the tagging information for each blogger. That is, the onevalued context is composed of 9 bloggers, 3 tags they used, and binary relations that indicate which tag is used by whom. By applying FCAWIZARD into the one-valued context, we can extract some concepts and display it as concept lattice in figure 5. From the concept lattice, we can identify some common(shared) tags among the bloggers. However, the resulting concept lattice is too large to be completely displayed at once. In order to analyze it in more detail, we should restrict the set of objects and/or attributes and visualize only the corresponding part of the concept lattice. Figure 6 shows the simplified lattice that we restricted the attribute set 5

http://www.creation.net/blog http://www.linuxstudy.pe.kr/∼kebie/2005/blog/tt/index.php 7 http://www.fortytwo.co.kr/tt 8 http://weblognara.com/ 9 http://www.marketcast.co.kr/blog 10 http://radar.oreilly.com/ 11 http://www.hof.pe.kr/ 12 http://channy.tisory.com 13 http://anihil.cafe24.com/ 6

Figure 6: Simplified concept lattice

into {Web2.0, Google, Blog}. We can identify 4 concepts that group bloggers and tags in ways that are meaningful in identifying common tags. In figure 5, for example, the node labelled by Blog with hof and Weglognara denotes the concept ({hof, Weglognara}, {Blog, Google})14 . It means that bloggers hof and Weglognara have used common tags Blog and Google. Of course, this does not mean their actual values for the frequency of use for the tags are the same. Therefore, we need some more consideration for the frequency of use of tags.

4.3

Mining the Weighted Common Tags

The aim of the second experiment is not only to identify the common tags, but also to extract ordinal information from the bloggers’ common tags based on the frequency of use of tags. Table 6 shows a many-valued context of the frequency of use of tags for a group of bloggers. The number in each cell of Table 6 denotes the frequency of using of a given tag t(Ft ) that is calculated by using the following formula: Ft =

Number of times of use of the tag t . Number of total posts within a blog

In order to obtain a more fine-grained view, we apply conceptual scaling to the many-valued context. In the conceptual scaling, each attribute(”tag”) is treated as a separate formal context with the values(”frequency of use of tag”) as attributes associated with each of the original objects(”Bloggers”). We use the ordinal scale context of table 7 in the conceptual scaling. That is, the ordinal scale can be used to interpret each tag whose values(”frequency of use of tag”) 14

In the concept lattice, attributes are inherited downwards and dually objects propagate to the top of the diagram.

1344

Table 6: Many-valued context for the frequency of use of tags Blog Google Web2.0 Channy 0 5.2 6.6 mEmOpAd 0 0 6.5 Prak 0 7.1 44.1 Weblognara 15.6 4.2 0 trend 0 6.4 14.1 Oreilly 0 9.5 10.9 hof 26.4 27 0 Channy2 0 15.9 11.4 Guichain 0 8.6 8.6

Table 7: Ordinal scale context for mining the weighted common tags VL L M H VH 1≤f

Suggest Documents