Jan 16, 2009 - business css design development diy download education english environment fashion fic finance firefox flash flickr ... photoshop php plugin.
Clustering the Web 2.0 Katharina Morik, Michael Wurst Artificial Intelligence Unit, Technische Universität Dortmund, Germany January 16, 2009 Abstract Ryszard Michalski has been the pioneer of Machine Learning. His conceptual clustering focused on the understandability of clustering results. It is a key requirement if Machine Learning is to serve users successfully. In this chapter, we present two approaches to clustering in the scenario of Web 2.0 with a special concern of understandability in this new context. In contrast to semantic web approaches which advocate ontologies as a common semantics for homogeneous user groups, Web 2.0 aims at supporting heterogeneous user groups where users annotate and organize their content without a reference to a common schema. Hence, the semantics is not made explicit. It can be extracted by Machine Learning, though, hence providing users with new services.
1
Introduction
From its very beginning, Machine Learning aimed at helping people in doing their job more effectively. Ryszard Michalski has always stressed the service to people as a motivation for Machine Learning. Early on, classifier learning received a lot of attention [18], and subsequently eased the development of knowledge-based systems that support experts [15], [17]. Since the need of classified training examples could become a bottleneck for applications, clustering approaches became attractive because, there, no expert is necessary, who classifies the observations. However, statistical clustering approaches often lack the understandability of the clusters. The use of logic expressions for clustering turned precise cluster descriptions into easily understandable conditions for an observation to belong to the cluster [14]. The understandability of learning results is a strong factor of its success. It is a necessity for many applications. There is a large variety of understandable representations: depending on a user’s education and school of thinking, logic, visual, or numerical representations ease understanding. Since interpretations depend on the user’s background, it is quite difficult to supply heterogeneous user groups with one adequate representations. Instead, the representation has to cope with the heterogeneity of users, presenting information to a user in the way he or she prefers.
1
In contrast to the systems supporting a user in professional performance, the World Wide Web (WWW) offers documents, music, videos to a large variety of users, no longer restricted to one profession but offered to and by most different cultures and communities. The approach to organize the collections according to ontologies (the semantic web approach) has failed possibly just because of this heterogeneity of users. The view of the WWW as a space for collaboration among diverse users who not only seek but also deliver information (the Web 2.0 approach) takes into account that users do not want to obey a given semantic but use their own one without explicit declaration. Integrating Machine Learning capabilities into the WWW hence demands to cope with diverse representations. No longer, the same annotation of an object can be considered to have the same meaning: for different users the same label may have completely different meanings. Turning it the other way around, different annotations may mean the same. Machine Learning approaches normally use a fixed feature set and – in supervised learning – a fixed set of class labels. In the Web 2.0 context, this is no longer appropriate. Instead, the intensional meaning of a feature or label can only be determined on the basis of its extension regarding one particular user. We call this the locality requirement. The similarity of extensions can then be used to determine the similarity of features or labels given by different users, and, hence, the similarity of users. We shall investigate this issue in Section 2 1 . Moreover, the heterogeneity of users also challenges the design of user interfaces. Users differ in the way they like to look at a collection of items (i.e., pictures, music,...). Some prefer to navigate in a step-wise deepening manner with a small number of selectable nodes at each level. Others prefer to have many choices and few levels. For the developer of an interface it is not at all clear how to best organize a collection. Here, Machine Learning can be of help, if it constructs all possible and optimal structures and offers these to the user who can then choose the one he or she likes. The construction of optimal structures in one learning run technically means to learn the Pareto-optimal structures. Methodologically, our learning algorithm can be classified as an instance of multi-strategy learning. It exploits frequent set mining and evolutionary optimization in order to cluster a collection on the basis of annotations (tag sets) [13]. We shall describe our approach to this problem in Section 3 2 .
2
Collaboratively Structuring Collections
Media collections in the internet have become a commercial success. Companies provide users with texts, music, and videos for downloading and even offer to store personal collections at the company’s site. The structuring of large media collections has thus become an issue. Where general ontologies are used to structure the central archive, personal media collections are locally structured in very different ways by different users. The level of detail, the chosen categories, 1 The 2 The
work on collaborative clustering is based in the Ph D thesis of Michael Wurst [22]. work on tagset clustering is based on the diploma thesis of Andreas Kaspari [12].
2
and the extensions even of categories with the same name can differ completely from user to user. In a one year project, a collection of music was structured by our students [10]. We found categories of mood, time of day, instruments, occasions (e.g., “when working” or “when chatting”), and memories (e.g., “holiday songs from 2005” or “songs heard with Sabine”). The level of detail depends on the interests of the collector. Where some students structure instruments into electronic and unplugged, others carefully distinguish between string quartet, chamber orchestra, symphony orchestra, requiem, opera. A specialist of jazz designs a hierarchical clustering of several levels, each with several nodes, where a lay person considers jazz just one category. Where the most detailed structure could become a general taxonomy, from which less finely grained, local structures can easily be computed, categories under headings like “occasions” and “memories” cannot become a general structure for all users. Such categories depend on the personal attitude and life of a user. They are truly local to the user’s collection. Moreover, the classification into a structure is far from being standardized. This is easily seen when thinking of a node “favorite songs”. Several users’ structures show completely different songs under this label, because different users have different favorites. The same is true for the other categories. We found that even the general genre categories can extensionally vary among users, the same song being classified, e.g., as “rock’n roll” in one collection and “blues” in another one. Hence, even if (part of) the collections’ structure looks the same, their extensions can differ considerably [6]. In summary, structures for personal collections differ in the level of detail, the chosen categories, and the extensions for even the same labels. Can Machine Learning be of help also for structuring personal collections? Since users do not want to have their hand-made structures overwritten, one could deny the benefit of automatic structuring. While users like to classify some songs into their own structure, they would appreciate it, if a learning system would clean-up their collection “accordingly”. Moreover, users like to exchange songs (or pictures, or videos) with others. The success of Amazon’s collaborative recommendations shows that users appreciate to share preferences. A structure of another user might serve as a blueprint for refining or enhancing the own structure. The main objective seems to be that users are given a choice among alternatives instead of providing them with just one result. To summarize, the requirements of a learning approach are, that it should • not overwrite hand-made structures, • not aim at a global model, but enhance a local one, • add structure where a category’s extension has become too large, • take structures of others into account in a collaborative manner, • recommend objects which fit nicely into the local structure, and • should deliver several alternatives among which the user can choose.
3
2.1
The Learning Task of Localized Alternative Cluster Ensembles
The requirements listed above actually pose a new learning task. We characterize learning tasks by their inputs and outputs. Let X denote the set of all possible objects. A function ϕ : S → G is a function that maps objects S ⊆ X to a (finite) set G of groups. We denote the domain of a function ϕ with Dϕ . In cases where we have to deal with overlapping and hierarchical groups, we denote the set of groups as 2G . The input for a learning task is a finite set of functions I ⊆ {ϕ|ϕ : S → G}. The same holds for the output O ⊆ {ϕ|ϕ : S → G}. We consider the structuring of several users ui , each described by ϕi : Si → Gi . A user with the problem of structuring her left-over objects S might now exploit the cluster models of other users in order to enhance the own structure. Cluster ensembles are almost what we need [3], [19], [20]. However, there are three major drawbacks: first, for cluster ensembles, all input clusterings must be defined at least on S. Second, the consensus model of cluster ensembles does not take the locality of S into account. Finally, merging several heterogenous user clusterings by a global consensus does not preserve the user’s hand-made structuring. Hence, we have defined a new learning task [23]. Definition 1 (Localized Alternative Cluster Ensembles) Given a set S ⊆ X, a set of input functions I ⊆ {ϕi : Si → Gi }, and a quality function q : 2Φ × 2Φ × 2S → R
(1)
with R being partially ordered, localized alternative clustering ensembles delivers the output functions O ⊆ {ϕi |ϕi : Si → Gi } so that q(I, O, S) is maximized and for each ϕi ∈ O it holds that S ⊆ Dϕi . Note that in contrast to cluster ensembles, the input clusterings can be defined on any subset Si of X. Since for all ϕi ∈ O it must hold that S ⊆ Dϕi , all output clusterings must at least cover the objects in S.
2.2
The LACE Algorithm
The algorithm LACE derives a new clustering from existing ones by extending and combining them such, that each covers a subset of objects in S. We need two more definitions in order to describe the algorithm. Definition 2 (Extended function) Given a function ϕi : Si → Gi , the function ϕ0i : Si0 → Gi is the extended function for ϕi , if Si ⊂ Si0 and ∀x ∈ Si : ϕi (x) = ϕ0i (x). Definition 3 (Bag of clusterings) Given a set I of functions. A bag of
4
clusterings is a function 0 ϕi1 (x), .. . ϕi (x) = ϕ0ij (x), ... ϕ0im (x),
0 if x ∈ Si1 .. . 0 if x ∈ Sij .. . 0 if x ∈ Sim
(2)
0 0 where each ϕ0ij is an extension of a ϕij ∈ I and {Si1 , . . . , Sim } is a partitioning of S.
Now, we can define the quality for the output, i.e. the objective function for our bag of clusterings approach to local alternative clustering ensembles. Definition 4 (Quality of an output function) The quality of an individual output function is measured as X sim(x, x0 ) with j = hi (x) (3) q ∗ (I, ϕi , S) = max 0 x∈S
x ∈Sij
where sim is a similarity function sim : X × X → [0, 1] and hi assigns each example to the corresponding function in the bag of clusters hi : S → {1, . . . , m} with 0 hi (x) = j ⇔ x ∈ Sij . (4) The quality of a set of output functions now becomes X q(I, O, S) = q ∗ (I, ϕi , S).
(5)
ϕi ∈O
Besides this quality function, we want to cover the set S with a bag of clusterings that contains as few clusterings as possible. The main task is to cover S by a bag of clusterings ϕ. The basic idea of this approach is to employ a sequential covering strategy. In a first step, we search for a function ϕi in I that best fits the set of query objects S. For all objects not sufficiently covered by ϕi , we search for another function in I that fits the remaining points. This process continues until either all objects are sufficiently covered, a maximal number of steps is reached, or there are no input functions left that could cover the remaining objects. All objects that could not be covered are assigned to the input function ϕj containing the object which is closest to the one to be covered. Alternative clusterings are produced by performing this procedure several times, such that each input function is used at most once. When is an object sufficiently covered by an input function so that it can be removed from the query set S? We define a threshold based criterion for this purpose. Let Zϕi be the set of objects delivered by ϕ. Definition 5 A function ϕ sufficiently covers an object x ∈ S (written as x @α ϕ ), iff x @α ϕ :⇔ maxx0 ∈Zϕ sim(x, x0 ) > α. 5
O=∅ I0 = I while (|O| < maxalt ) do S0 = S B=∅ step = 0 while ((S 0 6= ∅) ∧ (I 0 6= ∅) ∧ (step < maxsteps )) do ϕi = arg max qf∗ (Zϕ , S 0 ) ϕ∈J
I 0 = I 0 \ {ϕi } B = B ∪ {ϕi } S 0 = S 0 \ {x ∈ S 0 |x @α ϕi } step = step + 1 end while O = O ∪ {bag(B, S)} end while Figure 1: The sequential covering algorithm finds bags of clusterings in a greedy manner. maxalt denotes the maximum number of alternatives in the output, maxsteps denotes the maximum number of steps that are performed during sequential covering. The function bag constructs a bag of clusterings by assigning each object x ∈ S to the function ϕi ∈ B that contains the object most similar to x. Zϕi is the set of objects delivered by ϕ. This threshold allows to balance the quality of the resulting clustering and the number of input clusters. A small value of α allows a single input function to cover many objects in S. This, on average, reduces the number of input functions needed to cover the whole query set. Turning it the other way around: when do we consider an input function to fit the objects in S well? First, it must contain at least one similar object for each object in S. This is essentially what is stated in the quality function q ∗ . Second, it should cover as few additional objects as possible. This condition follows from the locality demand. Using only the first condition, the algorithm would not distinguish between input functions which span a large part of the data space and those which only span a small local part. This distinction, however, is essential for treating the locality appropriately. The situation we are facing is similar to that in information retrieval. The target concept S – the ideal response – is approximated by ϕ delivering a set of objects – the retrieval result. If all members of the target concept are covered, the retrieval result has the highest recall. If no objects in the retrieval result are not members of S, it has the highest precision. We want to apply precision and recall to characterize how well ϕ covers S. We can define precision(Zϕi , S) =
1 X max {sim(x, z)|x ∈ S} |Zϕi | z∈Zϕi
6
(6)
and recall(Zϕi , S) =
1 X max {sim(x, z)|z ∈ Zϕi }. |S|
(7)
x∈S
Please note that using a similarity function which maps identical objects to 1 (and 0 otherwise) leads to the usual definition of precision and recall. The fit between an input function and a set of objects now becomes qf∗ (Zϕi , S) =
(β 2 + 1)recall(Zϕi , S)precision(Zϕi , S) . β 2 recall(Zϕi , S) + precision(Zϕi , S)
(8)
Recall directly optimizes the quality function q ∗ , precision ensures that the result captures local structures adequately. The fitness qf∗ (Zϕi , S) balances the two criteria. Deciding whether ϕi fits S or whether an object x ∈ S is sufficiently covered requires to compute the similarity between an object and a cluster. Remember that Zϕi is the set of objects delivered by ϕ. If the cluster is represented by all of its objects (Zϕi = Si , as usual in single-link agglomerative clustering), this central step becomes inefficient. If the cluster is represented by exactly one point (|Zϕi | = 1, a centroid in k-means clustering), the similarity calculation is very efficient, but sets of objects with irregular shape, for instance, cannot be captured adequately. Hence, we adopt the representation by “well scattered points” Zϕi as representation of ϕi [8], where 1 < |Zϕi | < |Si |. These points are selected by stratified sampling according to G. We can now compute the fitness qf∗ of all Zϕi ∈ I with respect to a query set S in order to select the best ϕi for our bag of clusterings. The whole algorithm works as depicted in Figure 1. We start with the initial set of input functions I and the set S of objects to be clustered. In a first step, we select an input function that maximizes qf∗ (Zϕi , S). ϕi is removed from the set of input functions leading to a set I 0 . For all objects S 0 that are not sufficiently covered by ϕi , we select a function from I 0 with maximal fit to S 0 . This process is iterated until either all objects are sufficiently covered, a maximal number of steps is reached, or there are no input functions left that could cover the remaining objects. All input functions selected in this process are combined to a bag of clusters, as described above. Each object x ∈ S is assigned to the input function containing the object being most similar to x. Then, all input functions are extended accordingly (cf. definition 2). We start this process anew with the complete set S and the reduced set I 0 of input functions until the maximal number of alternatives is reached.
2.3
Results of the LACE Algorithm
The LACE algorithm has been successfully applied to the collection of student structures for the music collection. Leaving out one clustering, the learning task was to again structure the now unstructured music. We compared our approach with single-link agglomerative clustering using cosine measure, top 7
down divisive clustering based on recursively applying kernel k-means [5] (K kmeans), and with random clustering. Localized Alternative Cluster Ensembles were applied using cosine similarity as inner similarity measure. The parameters for all algorithms were chosen globally optimal. In our experiments we used α = 0.1. For β, the optimal value was 1. Kernel k-means and random clustering were started five times with different random initializations. Method LACE K k-means audio K k-means ensemble single-link audio single-link ensemble random
Correlation 0.44 0.19 0.23 0.11 0.17 0.09
Absolute distance 0.68 2.2 2.5 9.7 9.9 1.8
FScore 0.63 0.51 0.55 0.52 0.60 0.5
Table 1: The results for different evaluation measures. Table 1 shows the results. As can be seen, the local alternative cluster ensembles approach LACE performs best (see [23] for more details on the evaluation). The application opportunities of LACE are, on the one hand, to structure collections automatically but personalized and, on the other hand, to recommend items mass-tailored to a user’s needs. If all not yet classified instances fit into a hand-made structure, the structure is not changed. If, however, some instances do not fit or a cluster has become too large, a set S is formed and other structures ϕi are exploited. This leads to some new structures and is accompanied by recommendations. Let us look at some examples for illustration. • In the cluster “pop” of user A there might be too many instances. Now, a structure ϕB dividing similar songs into “rock”, “metal”, and “disco” might be found in the collection of user B under a different title, say “dancefloor”. Integrating these clusters is accompanied by new instances from user B. These recommendations are specific to user A’s “pop” cluster (i.e., local). • Another example starts with several music plays S, which user A did not find the time to structure. He might receive a structure ϕ1 according to instruments and another one, ϕ2 , with funny titles like “songs for Sabine”, “poker nights”, and “lonely sundays”. When listening to music in the latter clusters, user A might like them and change the tags to “romance”, “dinner background”, and “blues”. • Of course, memory categories (e.g., “holidays 2005”) will almost never become tasks for automatic structuring, because their extension is precisely determined and cannot be changed. However, for somebody else, also these categories can be useful and become (under a different name) a cluster for a larger set of instances. 8
Popular tags on del.icio.us
05/21/2007 02:55 PM
popular | recent
del.icio.us / tag /
login | register | help
Popular tags on del.icio.us
del.icio.us
search
This is a tag cloud - a list of tags where size reflects popularity. sort: alphabetically | by size
.net
advertising
ajax
apple
architecture
business
comics
development
diy download
fonts
food
health
free
history
javascript
community
freeware
home
art
education
fun funny
howto
jobs language library
media microsoft mobile movies mp3
blog blogging blogs book books css culture database design environment fashion fic finance firefox flash flickr
article
computer
audio
cooking
english
cool
furniture gallery game
games google graphics
green hardware
imported inspiration internet java lifehacks linux mac magazine management maps marketing music network networking news online opensource osx humor
illustration
photo photography photos photoshop php plugin politics portfolio productivity programming psychology python radio rails recipes reference research resource resources rss ruby rubyonrails science search security seo sga shop shopping slash social software tech technology tips tool tools toread travel tutorial tutorials tv twitter ubuntu video videos web web2.0 webdesign webdev wedding wiki windows phone
wordpress writing view youtube Figure 2: Tagwork cloud of an excerpt of the Del.icio.us tags. More frequently used tags are printed in a bolder style. http://del.icio.us/tag/
Page 1 of 2
The localized alternative cluster ensembles offer new services to heterogeneous user groups. The approach opens up applications in the Web 2.0 context ranging from content sharing to recommendations and community building, where the locality requirement is essential. Machine Learning, here, does not support professional performance but the computer use for leisure. It supports the many small groups of users, which together are more than the main stream group. Both from a commercial and ethical point of view, this is challenging service.
3
Structuring Tagged Collections
Collaborative tagging of resources has become popular. Systems like Del.icio.us3 , Last.fm4 , or FlickR5 allow users to annotate all resources by freely chosen tags, without a given obligatory taxonomy of tags (ontology). Since the users all together develop the “taxonomy” of the Web 2.0 by their tags, the term "folksonomy" came up. Definition 6 (Folksonomy) A folksonomy is a tuple F = (U, T, R, Y ), where U is the set of users, T the set of tags, and R the set of resources. Y ⊆ U × T × R is a relation of taggings. A tuple (u, t, r) ∈ Y means, that user u has annotated resource r with tag t. The popular view of folksonomies is currently the tag cloud, like the one shown in Figure 3. Tags are understandable although not precisely defined. While tag clouds support users to hop from tag to tag, inspecting the resources is cumbersome. When selecting a tag, the user finds all the resources annotated with this tag. There is no extensionally based hierarchy guiding the exploration of resources. For instance, a user cannot move from all photos to those being tagged 3 http://del.icio.us/ 4 http://last.fm/ 5 http://flickr.com
9
as "art", as well. Hence, navigation in folksonomies is particularly restricted. However, the data for a more appropriate structure are already given – they just need to be used. A resource, which has been tagged as, e.g., {art, photography} is linked with the termsets {art}, {photography}, {art, photography} by a function g : T × R → N. The user may now refine the selection of resources, e.g., from all photos to those being annotated as "art", as well. As is easily seen, the power set of tags, together with union and intersection forms a lattice (see Figure 3). If we choose cluster sets within this lattice, these no longer need to form a lattice. Only quite weak assumptions about valid cluster sets are necessary: Definition 7 (Frequent Termset Clustering Conditions) A cluster set C ⊆ P(T ) is valid, if it fulfills the following constraints: ∅∈C
(9a)
∀D ∈ C with D 6= ∅ : ∃C : C ≺ D
(9b)
∀C ∈ C : ∃r ∈ R : r∇C.
(9c)
Condition (9a) states that the empty set must be contained in each cluster set. Condition (9b) ensures that there is a path from each cluster to the empty set (thus the cluster set is a connected graph). Condition (9c) ensures that each cluster contains at least one resource.
{}
{a,b}
{a}
{b}
{a,c}
{b,c}
{a,b,c}
{c}
{a,d}
{a,b,d}
{d}
{b,d}
{a,c,d}
{c,d}
{b,c,d}
{a,b,c,d}
Figure 3: The frequent termsets with the lattice of possibly frequent termsets grayed.
The link to the resources is given by the cover relation based on the function g:∇ ⊆ R × P(T ) for which the following holds: r∇C ≡ ∀t ∈ C : g(t, r) > 0 10
(10)
A resource is covered by a termset, if all terms in the termset are assigned to the resource. The support of a termset is defined as the fraction of resources it covers. The frequency of tag sets can either be defined as the number of resources it covers, or as the number of users annotating resources with it, or as the number of tuples U × R. Our method works with any of these frequencies. Using FPgrowth of [9], we find the frequent sets which only need to obey the clustering conditions 9a, 9b, and 9c.
3.1
Learning Pareto-Optimal Clusterings from Frequent Sets
Having found all frequent tag sets, the task of structuring them according to the preference of a user is to select a cluster set C among all possible valid clusterings Γ. Since we do not know the preference of the user, we do not tweak heuristics, such that these select a preferred clustering, as done in [21], [1], [7]. In contrast, we decompose the selection criteria into two orthogonal criteria and apply multi-objective optimization. Given orthogonal criteria, multi-objective optimization [2] finds all trade-offs between the criteria such that the solution cannot be enhanced further with respect to one criterion without destroying the achievements with respect to the other criteria – it is Pareto-optimal. The user may then explore the Pareto-optimal solutions or select just one to guide her through a media collection. Several algorithms were proposed for multi-objective optimization, almost all of them based on evolutionary computation [2, 24]. In this work, we use the genetic algorithm NSGA-2 [4] to approximate the set of Pareto-optimal solutions. We used the operators implemented within the RapidMiner system, formerly known as Yale [16]. The population consists of cluster sets. These individuals are represented by binary vectors. A mapping from cluster sets to binary vectors is defined such that each element of the set of frequent termsets corresponds to one position in the vector. The result of the initial frequent termset clustering corresponds to a vector where each component has the value 1. This hierarchy of frequent sets is traversed in breadth-first manner when mapping to the vector components. We are looking for optimal solutions which are part of the frequent sets result, i.e., vectors where some of the components have the value 0. As a illustration, Figure 4 shows the binary encoding of a cluster set, where the frequent sets {a}, {b}, {c}, {a, b}, {b, c} had been found. In the course of mutation and cross-over, the NSGA-2 algorithm may create vectors that correspond to invalid cluster sets. These are repaired by deleting those clusters that are not linked to a path which leads to the root cluster. Hence, our cluster conditions are enforced by post-processing each individual. The algorithm approximates the set Γ∗ ⊆ Γ of Pareto-optimal cluster sets. Γ∗ = {C ∈ Γ |6 ∃D ∈ Γ : ~{(D) ~{(C)}
(11)
where f~(D) ~{(C) states that there is no cluster set D that Pareto-dominates the cluster set C with respect to the fitness functions f~. Thus Γ∗ contains all 11
{} {a}
{b}
{c}
{a,b}
1
{b,c}
1
0
1
1
Figure 4: Each cluster corresponds to a component of the vector. The ordering corresponds to breadth-first search in the result of the frequent termset clustering, i.e. the complete search space. non-dominated cluster sets.
3.2
Resulting Navigation Structures
We have applied our multistrategy approach which combines frequent sets with multi-objective optimization to data from the Bibsonomy system ([11]), which is a collaborative tagging system allowing users to annotate web resources and publications. Our data set contains the tag assignments of about 780 users. The number of resources tagged by at least one user is about 59.000. The number of tags used is 25.000 and the total number of tag assignments is 330.000. Since all Pareto-optimal clusterings are found in one learning run, it does not make sense to compare the found clusterings to a one delivered by an alternative approach. Either the one clustering is part of the Pareto front, or it is not Pareto-optimal. The clusterings in Figure 5 show some navigation structures from a Pareto front minimizing overlap and maximizing coverage. In the picture, only the depth of a node is indicated. Each node is a set of resources labeled by (a set of) tags. Since users regularly navigate by tags, these labels are easily understood. Also the structure is easily understood so that users can, indeed, select the structure they like. Other approaches combine the two criteria into one heuristic and deliver just one of the shown clusterings. A more detailed analysis of the individual results shows the following: • Cluster sets that fulfill the overlap criterion well are quite narrow and do not cover many resources. • Cluster sets that fulfill the coverage criterion well are quite broad and contain a lot of overlap. Note, however, that overlap might be desired for
12
navigation: the same resource can then be retrieved using different tags or tag sets. • All resulting cluster structures are very shallow, as neither of the criteria forces the selection of deep clusters. Both, high coverage and low overlap can be achieved with clusters of level one.
Figure 5: Some cluster sets from the Pareto front optimizing overlap vs. coverage: starting with a small overlap (upper left) moving to a high coverage (lower left). Multi-objective optimization requires the criteria to be orthogonal. Hence, it has to be verified for each pair of criteria that they are negatively correlated. Actually, on artificially generated data, the correlations of criteria has been investigated [12]. In addition to the usual clustering criteria of overlap and coverage, we have defined criteria which take into account the hierarchy of clusters and the given frequent tag sets, namely childcount and completeness. Since navigation is often performed top-down, we use the number of child nodes at the root and each inner node as indicator of the complexity of a cluster set.
13
Definition 8 (Child count) Given a cluster set C, we define succ: C → P(C) as succ (C) = {D ∈ C | C ≺ D} and thus the set C 0 ⊆ C as C 0 = {C ∈ C || succ(C) |> 0}. Based on this, we can define childcountmax (C) = maxC∈C 0 | succ(C) |
(12)
Thus the complexity of a cluster structure is given as the most complex node. The maximal child count will usually increase with increasing coverage, since to cover more resources often means to add clusters. The criterion of completeness is similar to coverage, but tailored to the frequent termsets that are the starting point of our clustering. The idea of completeness is, that the selected clusters should represent the given frequent termsets as completely as possible. Definition 9 (Completeness) Given two cluster sets C and C ref . We assume C ⊂ C ref . Then the function complete : Γ × Γ → is defined as: complete(C, C ref ) =
|C | | C ref |
(13)
0 supp=6.0% supp=7.5% supp=10%
-5 -10 -15
childcount
-20 -25 -30 -35 -40 -45 -50 0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
completeness
Figure 6: Pareto front for child count vs. completeness for different minimal supports Thus, the more of the original frequent termsets are contained in the final clustering, the higher the completeness. This combines coverage and cluster depth in one straighforward criterion. Figure 6 shows the Pareto front when minimizing childcount and maximizing completeness. Figure 7 shows some clusterings along the Pareto front when optimizing child count vs. completeness. Inspecting the Pareto optimal results more closely yields the following: 14
• Clusterings with a small maximum child count are narrow, but deep. This effect can be explained, as deep clustering yield on average a higher completeness. • Clusterings with high completeness are broader, but still deep and contain much of the heterogeneity contained in the original data. They also show a very high coverage. In this way, we can actually optimize three criteria at once: the simplicity of the cluster structure, its detailedness in terms of cluster depth, and the coverage of resources. These criteria are furthermore not biased to remove heterogeneity from the data, which is essential in many explorative applications.
Figure 7: Some cluster sets from the Pareto front optimizing child count vs. completeness: starting with a small child count (upper left) moving to a high completeness (lower left). Tag set clustering by multi-objective optimization on the basis of frequent set mining is a multistrategy learning approach which supports the personal15
ized access to large collections in the Web 2.0. The decomposition of clustering criteria into orthogonal ones allows to compute all Pareto-optimal solutions in one learning run. Again, the service is to heterogeneous user groups. Machine Learning helps to automatically build Human Computer Interfaces that correspond to a user’s preferences and give him or her a choice.
4
Conclusion
In this chapter, two approaches of Machine Learning serving the Web 2.0 have been shown. Both deliver sets of clusterings in order to allow users to choose. Both are labeled by user given tag (sets) without a formally defined semantics but understood in the way natural language is understood. In order to serve heterogeneous user groups, we have focused on the locality in the LACE algorithm and on delivering all Pareto-optimal solutions in the tagset clustering. Since Ryszard Michalski has always been open to new approaches and, at his last visit at Dortmund, was curious about our then emerging attempts to apply Machine Learning to the Web 2.0, we are sorry that we cannot discuss matters with him, anymore. Even though our methods are different from his, the impact of his ideas about understandability, service of Machine Learning, and multistrategy learning is illustrated also by this moderate work.
References [1] F. Beil, M. Ester, and X. Xu. Frequent term-based text clustering. In Proceedings of the International Conference on Knowledge Discovery and Data Mining (KDD), 2002. [2] C. A. Coello Coello. A comprehensive survey of evolutionary-based multiobjective optimization techniques. Knowledge and Information Systems, 1(3):129–156, 1999. [3] Souptik Datta, Kanishka Bhaduri, Chris Giannella, Ran Wolff, and Hillol Kargupta. Distributed data mining in peer-to-peer networks. IEEE Internet Computing, special issue on distributed data mining, 2005. [4] K. Deb, S. Agrawal, A. Pratab, and T. Meyarivan. A fast elitist non-dominated sorting genetic algorithm for multi-objective optimization: NSGA-II. In Proceedings of the Parallel Problem Solving from Nature Conference, 2000. [5] Inderjit S. Dhillon, Yuqiang Guan, and Brian Kulis. Kernel k-means: spectral clustering and normalized cuts. In Proc. of the conference on Knowledge Discovery and Data Mining, 2004. [6] Oliver Flasch, Andreas Kaspari, Katharina Morik, and Michael Wurst. Aspect-based tagging for collaborative media organization. In Bettina
16
Berendt, Andreas Hotho, Dunja Mladenic, and Giovanni Semeraro, editors, From Web to Social Web: Discovering and Deploying User and Content Profiles. Springer, 2007. [7] B. C. M. Fung, K. Wang, and M. Ester. Hierarchical document clustering using frequent itemsets. In Proceedings of the SIAM International Conference on Data Mining, 2003. [8] Sudipto Guha, Rajeev Rastogi, and Kyuseok Shim. CURE: an efficient clustering algorithm for large databases. In Proc. of ACM SIGMOD International Conference on Management of Data, pages 73–84, 1998. [9] J. Han, J. Pei, and Y. Yin. Mining frequent patterns without candidate generation. In Proceedings of the ACM SIGMOD International Conference on Management of Data, 2000. [10] Helge Homburg, Ingo Mierswa, Bülent Möller, Katharina Morik, and Michael Wurst. A benchmark dataset for audio classification and clustering. In Proceedings of the International Conference on Music Information Retrieval, 2005. [11] A. Hotho, R. Jäschke, C. Schmitz, and G. Stumme. BibSonomy: A social bookmark and publication sharing system. In Proceedings of the Conceptual Structures Tool Interoperability Workshop at the International Conference on Conceptual Structures, 2006. [12] Andreas Kaspari. Maschinelle Lernverfahren für kollaboratives tagging. Master’s thesis, Technische Univ. Dortmund, Computer Science, LS8, 2007. [13] R. S. Michalski and K. Kaufman. Intelligent evolutionary design: A new approach to optimizing complex engineering systems and its application to designing heat exchangers. International Journal of Intelligent Systems, 21(12), 2006. [14] R. S. Michalski, R. Stepp, and E. Diday. A recent advance in data analysis: Clustering objects into classes characterized by conjunctive concepts. In L. Kanal and A. Rosenfeld, editors, Progress in Pattern Recognition, pages 33–55. North-Holland, 1981. [15] Ryszard S. Michalski and R. Chilausky. Knowledge acquisition by encoding expert rules versus computer induction from examples: A case study involving soybean pathology. International Journal for Man-Machine Studies, (12):63–87, 1980. [16] Ingo Mierswa, Michael Wurst, Ralf Klinkenberg, Martin Scholz, and Timm Euler. Yale: Rapid prototyping for complex data mining tasks. In Lyle Ungar, Mark Craven, Dimitrios Gunopulos, and Tina Eliassi-Rad, editors, KDD ’06: Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 935–940, New York, NY, USA, August 2006. ACM. 17
[17] Katharina Morik. Balanced cooperative modeling. In Ryszard Michalski and Gheorghe Tecuci, editors, Machine Learning - A Multistrategy Approach, pages 295–318. Morgan Kaufmann, 1994. [18] R. S. Ryszard S. Michalski. A variable decision space approach for implementing a classification system. In Proceedings of the Second International Joint Conference on Pattern Recognition, Copenhagen, Denmark, pages 71–75, 1974. [19] Alexander Strehl and Joydeep Ghosh. Cluster ensembles – a knowledge reuse framework for combining partitionings. In Proceedings of the AAAI, 2002. [20] Alexander P. Topchy, Anil K. Jain, and William F. Punch. Combining multiple weak clusterings. In Proceedings of the International Conference on Data Mining, pages 331–338, 2003. [21] K. Wang, C. Xu, and B. Liu. Clustering transactions using large items. In Proceedings of the International Conference on Information and Knowledge Management, 1999. [22] Michael Wurst. Distributed Collaborative Structuring – A Data Mining Approach to Information Management in Loosely-Coupled Domains. PhD thesis, Technische Univ. Dortmund, Computer Science, LS8, 2008. [23] Michael Wurst, Katharina Morik, and Ingo Mierswa. Localized alternative cluster ensembles for collaborative structuring. In Johannes Fürnkranz, Tobias Scheffer, and Spiliopoulou Myra, editors, Proceedings of the European Conference on Machine Learning, pages 485–496. Springer, 2006. [24] E. Zitzler and L. Thiele. Multiobjective evolutionary algorithms: a comparative case study and the strength Pareto approach. IEEE Transactions on Evolutionary Computation, 3(4):257–271, 1999.
18