Constructing and Mapping Fuzzy Thematic Clusters to Higher Ranks in a Taxonomy Boris Mirkin, Susana Nascimento, Trevor Fenner, and Lu´ıs Moniz Pereira
Abstract— We present a method for mapping a structure to a related taxonomy in a thematically consistent way. The components of the structure are supplied with fuzzy profiles over the taxonomy. These are then generalized in two steps: first, by fuzzy clustering, and then by mapping the clusters to higher ranks of the taxonomy. To be specific, we concentrate on the Computer Sciences area represented by the ACM Computing Classification System (ACM-CCS), but the approach is aplicable also to other taxonomies. We build fuzzy clusters of the taxonomy subjects according to the similarity between individual profiles. Clusters are extracted using an original additive spectral clustering method involving a number of model-based stopping conditions. The clusters are parsimoniously lifted to higher ranks of the taxonomy using an original recursive algorithm for minimizing a penalty function that involves “head subjects” on the higher ranks of the taxonomy along with their “gaps” and “offshoots”. An example is given illustrating the method applied to real-world data.
I. I NTRODUCTION The last decade has witnessed an unprecedented rise of the concept of ontology as a computationally feasible tool for knowledge maintenance. For example, the usage of Gene Ontology [5] for interpretation and annotation of various gene sets and gene expression data is becoming a matter of routine in bioinformatics (see, for example, [23] and references therein). To apply similar approaches to less digitalized domains, such as activities of organizations in a field of science or knowledge supplied by teaching courses in a university school, one needs to distinguish between different levels of data and knowledge, and build techniques for deriving and transforming corresponding bits of knowledge within a comprehensible framework. The goal of this paper is to present such a framework, utilizing a pre-specified taxonomy of the domain under consideration as the base. In general, a taxonomy is a rootedtree-like structure whose nodes correspond to individual topics in such a way that the parental node’s topic generalizes the topics of its children’s nodes. We concentrate on the issue of representing an organization or any other system under Boris Mirkin is with School of Computer Science, Birkbeck University of London, London, WC1E 7HX, UK, and Division of Applied Mathematics, Higher School of Economics, Moscow, RF; (email:
[email protected]). Susana Nascimento is with Computer Science Department and Centre for Artificial Intelligence (CENTRIA), Faculdade de Ciˆencias e Tecnologia, Universidade Nova de Lisboa, Caparica, Portugal; (email:
[email protected]). Trevor Fenner is with School of Computer Science, Birkbeck University of London, London, WC1E 7HX, UK; (email:
[email protected]). Lu´ıs Moniz Pereira is with Computer Science Department and Centre for Artificial Intelligence (CENTRIA), Faculdade de Ciˆencias e Tecnologia, Universidade Nova de Lisboa, Caparica, Portugal; (email:
[email protected]).
consideration, in terms of the taxonomy topics. We first build profiles for its constituent entities in terms of the taxonomy and then thematically generalize them. To represent a functioning structure over a taxonomy is to indicate those topics in the taxonomy that most fully express the structure’s working in its relation to the taxonomy. It may seem that conventionally mapping the system to all nodes related to topics involved in the profiles within the structure would do the job, but it is not the case - such a mapping typically represents a fragmentary set of many nodes without any clear picture of thematic interrelation among them. Therefore, to make the representation thematically consistent and parsimonious, we propose a two-phase generalization approach. The first phase generalizes over the structure by building clusters of taxonomy topics according to the functioning of the system. The second phase generalizes over the clusters by parsimoniously mapping them to higher ranks of the taxonomy determined by a special parsimonious “lift” procedure. Both entity profiles and thematic clusters derived at the first phase are fuzzy in order to better reflect the real world objects, so that the lifting method applies to fuzzy clusters. It should be pointed out that both building fuzzy profiles and finding fuzzy clusters are research activities well documented in the literature; yet the issues involved in this project led us to develop some original schemes of our own including an efficient method for fuzzy clustering combining the approaches of spectral and approximation clustering [15]. We apply these constructions in two areas: (i) to visualize activities of Computer Science research organizations; and (ii) to discern the complexes of mathematical ideas according to classes taught in regular teaching courses in a university department. We take the popular ACM Computing Classification System (ACM-CCS), a conceptual four-level classification of the Computer Science subject area as a prespecified taxonomy for (i), and the three-layer Mathematics Subject Classification MSC2010 developed by the Mathematical Reviews and Zentralblatt Mathematics editors (see http://www.ams.org/mathscinet/msc/msc2010.html), for (ii). In what follows the focus ia mainly on the application (i) to research organizations; an earlier stage of this project is described in [16]. The paper is organized according to the structure of our approach: section II describes an e-system we developed for getting fuzzy profiles from Computer Science researchers, section III describes our method for deriving fuzzy clusters from the profiles, and section IV presents our parsimonious lift method to generalize to hihger ranks in a taxonomy tree.
II. TAXONOMY- BASED PROFILES A. Why representing over the ACM-CCS taxonomy? The concept of fuzzy profile has been utilized in [6], [18], [7]. In [20], it is proposed to automatically learn a user profile as a fuzzy tuple for a TV recommender system, and in [30] a multi-agent student profile system based on fuzzy logic is developed to provide dynamic learning plans for individual students in order to personalize web-based educational systems. Some studies involve an ontology or taxonomy. For example, in web mining, representations are extracted from domain ontologies: the ontologies are used to automatically characterize usage profiles by describing user’s interests and preferences for web personalisation [28]. A recommender system for on-line academic research papers, which extracts user profiles according to an ontology of research topics, is described in [10]. A similar software can be designed for discerning Mathematics taxonomy topics in syllabuses of mathematics-related modules in a university department and building fuzzy profiles of the modules. In the case of investigation of activities of a university department or center, a research team’s profile can be defined as a fuzzy membership function on the set of leaf-nodes of the taxonomy under consideration so that the memberships reflect the extent of the team’s effort put into corresponding research topics. In this case, the ACM Computing Classification System (ACM-CCS)[1] is used as the taxonomy. ACM-CCS comprises eleven major partitions (first-level subjects) such as B. Hardware, D. Software, E. Data, G. Mathematics of Computing, H. Information Systems, etc. These are subdivided into 81 second-level subjects. For example, item I. Computing Methodologies consists of eight subjects including I.1 SYMBOLIC AND ALGEBRAIC MANIPULATION, I.2 ARTIFICIAL INTELLIGENCE, I.5 PATTERN RECOGNITION, etc. They are are further subdivided into third-layer topics as, for instance, I.5 PATTERN RECOGNITION which is represented by seven topics including I.5.3 Clustering, I.5.4 Applications, etc. Taxonomy structures such as the ACM-CCS are used, mainly, as devices for annotation and search for documents or publications in collections such as that on the ACM portal [1]. The ACM-CCS tree has been applied also as: a gold standard for ontologies derived by web mining systems such as the CORDER engine [26]; a device for determining the semantic similarity in information retrieval [11] and e-learning applications [29], [4]; and a device for matching software practitioners’ needs and software researchers’ activities [3]. Here we concentrate on a different application of ACMCCS – a generalized representation of a Computer Science research organization. It should be mentioned that the very idea of representing the activities of a research organization as a whole may seem rather odd because conventionally it is only the accumulated body of results that matters in the sciences – and these always have been and still are provided by individual efforts. The assumption of individual research efforts implicitly underlies systems for reviewing
and comparing different research departments in countries such as the United Kingdom where scientific organizations are subject to regular comprehensive review and evaluation practices. The evaluation is based on the analysis of individual researchers’ achievements, leaving the portrayal of a general picture to subjective declarations by the departments [21]. Such an evaluation exercise provides for the assessment of relative strengths among different departments, which is good for addressing funding issues but difficult to use for designing an integral portrayal of the developments. This latter aspect is important for decisions regarding longterm or wide-range issues of scientific development such as national planning or addressing the so-called ‘South–North divide’ between developed and underdeveloped countries [27]). Obviously, an ACM-CCS visualized image can be used for overviewing scientific subjects that are being developed in the organization, assessing the scientific issues in which the character of activities in organizations does not fit well onto the classification – these can potentially be the growth points, and help with planning the restructuring of research and investment. B. E-Screen survey tool Fuzzy profiles are derived from either automatic analysis of documents posted on the web by the teams or by explicitly surveying the members of the department. The latter option is especially convenient in situations in which the web contents do not properly reflect the developments, for example, in non-English speaking countries with relatively underdeveloped internet infrastructures for the maintenance of research results. If such is the case, a tool for surveying research activities of the members and teams is needed. To tackle this, and also to allow for expert evaluations, we developed an interactive survey tool that provides two types of functionality: i) collection of data about ACM-CCS based research profiles of individual members; ii) statistical analysis and visualization of the data and results of the survey on the level of a department. The respondent is asked to select up to six topics among the leaf nodes of the ACMCCS tree and assign each with a percentage expressing the proportion of the topic in the total of the respondent’s research activity for, say, the past four years. Figure 1 shows a screenshot of the baseline interface for a respondent who has chosen six ACM-CCS topics during her survey session. A deeper view can be taken with a different “research results” form that allows to make a more detailed assessment in terms of individual research results in several categories such as refereed publications, funded projects, and theses supervised. These can be used for weighting research results with respect to the topics in ACM-CCS. The set of profiles supplied by respondents forms an N × M matrix F where N is the number of ACM-CCS topics involved in the profiles and M the number of respondents. Each column of F is a fuzzy membership function, rather sharply delineated because only six topics may have positive memberships in each of the columns.
Since all the individual profile scores sum up to unity, t∈T = 1, the scores of individuals bearing more topics tend to be smaller than those of individuals engaged in fewer numbers of topics. To make up for this, we utilize a natural weighting factor, the ratio of the number of topics marked by individual v, n v , and nmax , the maximum n v over all v = 1, 2, ..., V . wtt = Fig. 1. Screenshot of the interface survey tool for selection of ACM-CCS topics.
III. R EPRESENTING RESEARCH ORGANIZATION BY FUZZY CLUSTERS OF ACM-CCS TOPICS A. Deriving similarity between ACM-CCS research topics We represent a research organization by clusters of ACMCCS topics to reflect thematic communalities between activities of members or teams working on these topics. The clusters are found by analyzing similarities between topics according to their appearances in the profiles. The more profiles contain a pair of topics i and j and the greater the memberships of these topics, the greater is the similarity score for the pair. There is a specific branch of clustering applied to similarity data, the so-called relational fuzzy clustering [2]. In the framework of ‘usage profiling’ relational fuzzy clustering is a widely used technique for discovering different interests and trends among users from Web log records. In bioinformatics several clustering techniques have been successfully applied in the analysis of gene expression profiles and gene function prediction by incorporating gene ontology information into clustering algorithms [8]. In spite of the fact that many fuzzy clustering algorithms have been developed already, the situation remains rather uncertain because they all involve manually specified parameters such as the number of clusters or threshold of similarity, which are subject to arbitrary choices. This is why we come up with a version of approximate clustering modified with the spectral clustering approach to make use of the Laplacian data transformation, which has proven an effective tool to sharpen the cluster structure hidden in the data [9]. This combination leads to a number of model-based parameters as aids for choosing the right number of clusters. Consider a set of V individuals (v = 1, 2, · · · , V ), engaged in research over some topics t ∈ T where T is a prespecified set of scientific subjects. The level of research effort by individual v in developing topic t is evaluated by the membership f tv in profile fv (v = 1, 2, · · · , V ). Then the similarity wtt between topics t and t can be defined as the inner product of profiles f t = (ftv ) and ft = (ft v ), v = 1, 2, · · · , V , so that att =< ft , ft >= V v=1 ftv ft v .
V nv ftv ft v , n v=1 max
(1)
To make the cluster structure in the similarity matrix sharper, we apply the spectral clustering approach to preprocess the similarity matrix W using the so-called Laplacian transformation [9]. First, an N × N diagonal matrix D is defined, with (t, t) entry equal to d t = t ∈T wtt , the sum of t’s row of W. Then unnormalized Laplacian and normalized Laplacian are defined by equations L = D − W and Ln = D−1/2 LD−1/2 , respectively. Both matrices are semipositive definite and have zero as the minimum eigenvalue. The minimum non-zero eigenvalues and corresponding eigenvectors of the Laplacian matrices are utilized then as relaxations of combinatorial partition problems [25], [9]. Of comparative properties of these two normalizations, the normalized Laplacian, in general, is considered superior [9]. Yet the Laplacian normalization by itself cannot be used in our approach below because our method relies on maximum rather than minimum eigenvalues. To pass over this issue, the authors of [19] utilized a complementary matrix M n = D−1/2 W D−1/2 which relates to Ln with equation L n = I − Mn where I is the identity matrix. This means that Mn has the same eigenvectors as L n , whereas the respective eigenvalues relate to each other as λ and 1 − λ, so that the matrix Mn can be utilized for our purposes as well. We prefer using the Laplacian pseudoinverse transformation, Lapin for short, defined by ˜ ˜ −1 Z˜ L− n (W ) = Z Λ ˜ and Z˜ are defined by the spectral decomposition where Λ Ln = ZΛZ of matrix L n = D−1/2 (D − W )D−1/2 . To specify these matrices, first, set T of indices of elements corresponding to non-zero elements of Λ is determined, ˜ = Λ(T , T ) and after which the matrices are taken as Λ Z˜ = Z(:, T ). The choice of the Lapin transformation can be explained by the fact that it leaves the eigenvectors of L n unchanged while inverting the non-zero eigenvalues λ = 0 − to those 1/λ of L− n . Then the maximum eigenvalue of L n is the inverse of the minimum non-zero eigenvalue λ 1 of Ln , corresponding to the same eigenvector. The inverse of a λ near 0 could make the value of 1/λ quite large and greatly separate it from the inverses of other near zero eigenvalues of Ln . Therefore, the Lapin transformation generates larger gaps between the eigen-values of interest, which suits our one-by-one approach described in the next section well. Thus, we utilize further on the transformed similarity matrix A = L− n (W ).
B. Additive fuzzy clusters using a spectral method Assume that the thematic similarities a between topics are but manifested expressions of some hidden patterns within the organization which can be represented by fuzzy clusters in exactly the same manner as the manifested scores in the definition of the similarity w tt (1). We assume that a thematic fuzzy cluster is represented by a membership vector u = (ut ), t ∈ T , such that 0 ≤ ut ≤ 1 for all t ∈ T , and an intensity μ > 0 that expresses the extent of significance of the pattern corresponding to the cluster, within the organization under consideration. With the introduction of the intensity, applied as a scaling factor to u, it is the product μu that is a solution rather than its individual co-factors. Given a value of the product μu t , it is impossible to tell which part of it is μ and which ut . To resolve this, we follow a conventional scheme: let us constrain the scale of the membership vector u level, for example, by a condition such as on a constant 2 u = 1 or t t t ut = 1, then the remaining factor will define the value of μ. Our additive fuzzy clustering model involves K fuzzy clusters that reproduce the pseudo-inverted Laplacian similarities att up to additive errors according to the following equations: tt
att =
K
μ2k ukt ukt + ett ,
(2)
where uk = (ukt ) is the membership vector of cluster k, and μk its intensity. The item μ2k ukt ukt expresses the contribution of cluster k to the similarity att between topics t and t , which depends on both the cluster’s intensity and the membership values. The value μ2 summarizes the contribution of intensity and will be referred to as the cluster’s weight. To fit the model in (2), we apply the least-squares approach, thus minimizing the sum of all e 2tt . Since A is definite semi-positive, its first K eigenvalues and corresponding eigenvectors form a solution to this if no constraints on vectors u k are imposed. On the other hand, if vectors uk are constrained to be just 1/0 binary vectors, the model (2) becomes of the so-called additive clustering [24], [12]. A simplified version of model (2), involving a constant, not cluster-specific, intensity, was considered in [22] – the authors proposed to fit that model using Newton’s descent method, thus involving many initialization parameters that need to be pre-specified. We are going to apply the one-by-one principal component analysis strategy for finding one cluster at a time. Specifically, at each step, we will be considering the problem of minimization of a reduced to one fuzzy cluster least-squares criterion
(btt − ξut ut )2
so that the optimal ξ is ξ=
(3)
t,t ∈T
with respect to unknown positive ξ weight (so that the intensity μ is the square root of ξ) and fuzzy membership
u Bu
(4)
2
(u u)
which is obviously non-negative if B is semi-positive definite. By putting this ξ in equation (3), we arrive at E=
k=1
E=
vector u = (ut ), given similarity matrix B = (b tt ). At the first step, B is taken to be equal to A. Each found cluster changes B by subtracting the contribution of the found cluster (which is additive according to model (2)), so that the residual similarity matrix for obtaining the next cluster will be B−μ2 uuT where μ and u are the intensity and membership vector of the found cluster. In this way, A indeed is additively decomposed according to formula (2) and the number of clusters K can be determined in the process. Let us specify an arbitrary membership vector u and find the value of ξ minimizing criterion (3) at this u by using the first-order condition of optimality: t,t ∈T btt ut ut ξ= 2 2 , t∈T ut t ∈T ut
t,t ∈T
b2tt − ξ 2
t∈T
u2t
2
u2t = S(B) − ξ 2 (u u) ,
t ∈T
where S(B) = t,t ∈T b2tt is the similarity data scatter. Let us denote the last item by 2 u Bu 2 , (5) G(u) = ξ 2 (u u) = u u so that the similarity data scatter is the sum: S(B) = G(u) + E
(6)
of two parts, G(u), which is explained by cluster (μ, u), and E, which remains unexplained. An optimal cluster, according to (6), is to maximize the explained part G(u) in (5) or its square root u Bu , (7) u u which is the celebrated Rayleigh-Ritz quotient, whose maximum value is the maximum eigenvalue of matrix B, which is reached at its corresponding eigenvector, in the unconstrained problem. This shows that the spectral clustering approach is appropriate for our problem. According to this approach, one should find the maximum eigenvalue λ and corresponding normed eigenvector z for B, [λ, z] = Λ(B), and take its projection to the set of admissible fuzzy membership vectors. In a simplest case of unconstrained fuzzy memberships, the projection operator P(z) works as follows: ⎧ ⎪ ⎨0, if z ≤ 0; P(z) = z, if 0 < z < 1; ⎪ ⎩ 1, if z ≥ 1. g(u) = ξu u =
If, however, one requires K that the fuzzy clusters form a fuzzy partition so that k=1 ukt = 1 for each t ∈ T , then after each cluster extraction step k, k= 1, 2, · · · K − 1, k the cumulative belongingness α kt = l=1 ult should be taken into account. For each t, the unity in the definition of P(z), K should be changed for 1 − α kt – this will warrant that k=1 ukt ≤ 1. According to this approach, there can be a number of model-based criteria for halting the process of sequential extraction of fuzzy clusters: 1) The optimal value of ξ (4) for the spectral fuzzy cluster becomes negative. 2) The contribution of a single extracted cluster becomes too low, less than a pre-specified τ > 0 value. 3) The residual scatter E becomes smaller than a prespecified value, say less than 5% of the original similarity data scatter. The one-by-one fuzzy thematic cluster extraction spectral algorithm referred to as the ADDI-FS algorithm is defined in Table I. TABLE I
in the tree because of the penalties for associated events, the cluster“gaps” and “offshoots”. Good subject cluster
111 Bad subject cluster 000
111 000 00000000000000 11111111111111 000 111 00000000000000 11111111111111 000 111 00000000000000 11111111111111 000 111 00000000000000 11111111111111 00000 11111 00 11 101111 0000 1111 00000 11111 00 11 000000 10111111 0000 1111 00000 11111 00 11 101111 0000 00000 11111 00 11 000000 10111111 0000 1111 00000 11111 00 11 0 1 0000 00000 11111 00 11 000000 10111111 0000 00000 11111 001111 11 10 0000 00000 11111 001111 11 000000 10111111 0000 1111 11 00 11 00 11 00
Fig. 2. Two clusters of second-layer topics, presented with checked and diagonal-lined boxes, respectively. The checked box cluster fits within one first-level category (with one gap only), whereas the diagonal line box cluster is dispersed among two categories on the right. The former fits the classification well; the latter does not.
The gaps are head subject’s children topics that are not included in the cluster. An offshoot is a taxonomy leaf node that is a head subject (not lifted). It is not difficult to see that the gaps and offshoots are determined by the head subjects specified in a lift (see Fig. 3). Topic in subject cluster
Head subject
ADDI-FS ALGORITHM . Gap
ADDI -- FS Algorithm
1 2 3 4 5 6
Set: k = 0, B = A, > 0, τ > 0; Repeat k = k + 1; [ξ, z] = Λ(B); If ξ > 0 uk = P(z) or uk = P(−z) depending on which leads to a larger Gk
Offshoot
Fig. 3. Three types of features in mapping of a subject cluster to the taxonomy.
The total count of head subjects, gaps and offshoots, each weighted by both the penalties and leaf memberships, is used 7 ξk = ; for scoring the extent of the cluster misfit needed for lifting (uk uk )2 2 8 Gk = ξk2 uk uk ; a grouping of research topics over the classification tree. 9 B = B − ξk uk uk ; The smaller the score, the more parsimonious the lift and 10 S(B) = T r(B B); the better the fit. Depending on the relative weighting of 11 Else: Halt 12 Until (ξk ≤ 0 or Gk ≤ τ or S(B) ≤ or k == Kmax ) gaps, offshoots and multiple head subjects, different lifts can minimize the total misfit, as illustrated on Fig. 5 and 6 later. Altogether, the set of topic clusters together with their More details on the ADDI–FS method, including its optimal head subjects, offshoots and gaps constitute a parsiapplication to affinity and community structure data along monious representation of the organization. Such a represenwith experimental comparisons with other fuzzy clustering tation can be easily accessed and expressed. It can be further methods, are described in [15]. elaborated by highlighting those subjects in which members IV. PARSIMONIOUS L IFTING M ETHOD of the organization have been especially successful (i.e., To generalize the contents of a thematic cluster, we lift it publication in best journals or awards) or distinguished by a to higher ranks of the taxonomy so that if all or almost all special feature (i.e., industrial use or inclusion in a teaching children of a node in an upper layer belong to the cluster, program). Multiple head subjects and offshoots, when they then the node itself is taken to represent the cluster at this persist at subject clusters in different organizations, may higher level of the ACM-CCS taxonomy (see Fig. 2). Such show some tendencies in the development of the science, lifting can be done differently, leading to different portrayals that the classification has not taken into account yet. of the cluster on ACM-CCS tree depending on the relative A parsimonious lift of a subject cluster can be achieved weights of the events taken into account. A major event by recursively building a parsimonious representation for is the so-called “head subject”, a taxonomy node covering each node of the ACM-CCS tree based on parsimonious (some of) leaves belonging to the cluster, so that the cluster representations for its children. In this, we assume that any is represented by a set of head subjects. The penalty of head subject is automatically present at each of the nodes the representation to be minimized is proportional to the it covers, unless they are gaps (as presented on Fig. 3). Our number of head subjects so that the smaller that number algorithm is set as a recursive procedure over the tree starting the better. Yet the head subjects cannot be lifted too high at leaf nodes. uk Buk
The procedure determines, at each node of the tree, sets of head gain and gap events to iteratively raise them to those of the parents, under each of two different assumptions that specify the situation at the parental node. One assumption is that the head subject has been inherited at the parental node from its own parent, and the second assumption is that it has not been inherited but gained in the node only. In the latter case the parental node is labeled as a head subject. Consider the parent-children system as shown in Fig. 4, with each node assigned with sets of gap and head gain events under the above two inheritance of head subject assumptions. Let us denote the total penalty, to be minimized, under the inheritance and non-inheritance assumptions by p i and pn , respectively. A lifting result at a given node is defined by a pair of sets (H, G), representing the tree nodes at which events of head gains and gaps, respectively, have occurred in the subtree rooted at the node. We use (H i , Gi ) and (Hn , Gn ) to denote lifting results under the inheritance and non-inheritance assumptions, respectively. The algorithm computes parsimonious representations for parental nodes according to the topology of the tree, proceeding from the leaves to the root in the manner which is similar to that described in [14] for a mathematical problem in bioinformatics.
Fig. 4. Events in a parent-children system according to a parsimonious lift scenario.
For the sake of simplicity, we present only a version of the algorithm for crisp clusters obtained by a defuzzification step. Given a crisp topic cluster S, and penalties h, o and g for being a head subject, offshoot and gap, respectively, the algorithm is initialized as follows. At each leaf l of the tree, either H n = {l}, if l ∈ S, or Gi = {l}, otherwise. The other three sets are empty. The penalties associated are p i = 0, pn = o if Hn is not empty, that is, if l ∈ S, and pi = g, pn = 0, otherwise. This is obviously a parsimonious arrangement at the leaf level. The recursive step applies to any node t whose children v ∈ V have been assigned with the two couples of H and G sets already (see Figure 4 at which V consists of three children): (H i (v), Li (v); Hn (v), Ln (v)) along with associated penalties pi (v) and pn (v). (I) Deriving the pair H i (t) and Gi (t), under the inheritance assumption, the one of the following two cases is to be chosen depending on the cost: (a) The head subject has been lost at t, so that H i (t) = ∪v∈V Hn (v) and Gi (t) = ∪v∈V Gn (v) ∪ {t}. (Note different indexes, i and n in the latter expression.) The penalty in this case is pi = Σv∈V pn (v) + g; or
(b) The head subject has not been lost at t, so that H i (t) = ∅ (under the assumption that no gain can happen after a loss) and Gi = ∪v∈V Gi (v) with pi = Σv∈V pi (v). The case that corresponds to the minimum of the two p i values is returned then. (II) Deriving the pair H n (t) and Gn (t), under the noninheritance assumption, the one of the following two cases is to be chosen that minimizes the penalty p n : (a) The head subject has been gained at t, so that H n (t) = ∪v∈V Hi (v) ∪ {t} and Gn (t) = ∪v∈V Gi (s) with pn = Σv∈V pi (v) + h; or (b) The head subject has not been gained at t, so that Hn (t) = ∪v∈V Hn (v) and Gn = ∪v∈V Gn (v) with pn = Σv∈V pn (v). After all tree nodes t have been assigned with the two pairs of sets, accept the H n , Ln and pn at the root. This gives a full account of the events in the tree. This algorithm leads indeed to an optimal representation; its extension to a fuzzy cluster is achieved through using the cluster membersips in computing the penalty values. V. A N APPLICATION TO A REAL WORLD CASE Let us illustrate the approach by using the data from a survey conducted at the Centre of Artificial Intelligence, Faculty of Science & Technology, New University of Lisboa (CENTRIA-UNL). The survey involved 16 members of the academic staff of the Centre who covered 46 topics of the third layer of the ACM-CCS. With the algorithm ADDI-FS applied to the 46 × 46 similarity matrix, two clusters have been sequentially extracted, after which the residual matrix has become definite negative (stopping condition (a)). The cluster membership values sorted in the descending order are given in Table II. The contributions of two clusters total to about 50%. It should be noted that the contributions reflect the tightness of the clusters rather than their priority stand. Cluster 1 is of pattern recognition and its applications to physical sciences and engineering including images and languages, with offshoots to general aspects of information systems. In cluster 2, all major aspects of computational mathematics are covered, with an emphasis on reliability and testing, and with applications in the areas of life sciences. Overall these results are consistent with the informal assessment of the research conducted in the research organization. Moreover, the sets of research topics chosen by individual members at the ESSA survey follow the cluster structure rather closely, falling mostly within one of the two. Figure 5 shows the representation of CENTRIA’s cluster 2 in the ACM-CCS taxonomy with penalties of h = 1, o = 0.8, and g = 0.2. Decreasing the gap penalty g from 0.2 to 0.1 leads to a different parsimonious generalization shown in Fig. 6. In this case we obtain a generalized scenario by further lifting of some of the ’Head subjects’: I.2.0, I.2.3, I.2.8 to I.2Artificial Intelligence and G.1.0, G.1.6, G.1.7, G.2.2, G3 to GMathematics of Computing, with the consequent generation of 10 and 16 Gaps, respectively, because they are much
TABLE II
ADDI-FS RESULTS AT DATA OF
SIMILARITY OF RESEARCH
TOPICS IN RESEARCH CENTER Cluster 1 Eigenvalue Contribution Intensity Weight Membership 0.69911 0.3512 0.27438 0.1992 0.1992 0.19721 0.17478 0.17478 0.16689 0.16689 0.14453 0.13646 0.13646 0.13646 0.02867 Cluster 2 Contribution Eigenvalue Intensity Weight Membership 0.46756 0.40619 0.34435 0.32681 0.30067 0.25967 0.23748 0.18722 0.17359 0.17359 0.17203 0.1537 0.11827 0.10195 0.06175 0.00726
46.50 35.2% 5.57 31.04 Code I.5.3 I.5.4 J.2 I.4.9 I.4.6 I.2.6 H.5.2 I.6.4 I.2.7 I.5.1 I.5.2 H.5.0 H.0 H.4.0 I.2.11 15.2% 32.90 4.52 20.41 Code J.3 I.2.8 F.2.1 F.4.1 G.1.6 D.3.3 G.2.2 G.3 B.2.3 B.7.3 I.2.0 G.1.0 I.2.3 G.1.7 K.2 D.1.6
CENTRIA-UNL.
B.2
B.2.3...B.7.3...
G.
I.
J. ...
B.2.3 ... B.7.3 ...
F.2
F.4
...
D.3.3 ...
I.
G.
J.
D.3
D.3.3...
F.2
G.1
F.4
...
F.2.1...F.4.1...
... G.0
G.2
I.2 G.3G.4G.m
G.1.0 G.1.1 G.1.2 G.1.3 G.1.4 G.1.5 G.1.6 G.1.7 G.1.8 G.1.9 G.1.10 G.1.m G.2.0 G.2.1 G.2.2 G.2.3 G.2.m
... J.3 ...
I.2.0 I.2.1I.2.2 I.2.3I.2.4I.2.5 I.2.6I.2.7 I.2.8I.2.9 I.2.10 I.2.11 I.2.m
Mapping of CENTRIA Cluster 2 on ACM−CCS taxonomy
Topic LIFE AND MEDICAL SCIENCES (Applications in) Problem Solving, Control Methods, and Search Numerical Algorithms and Problems Mathematical Logic Optimization Language Constructs and Features Graph Theory PROBABILITY AND STATISTICS Reliability, Testing, and Fault-Tolerance Reliability and Testing General in I.2 ARTIFICIAL INTELLIGENCE General in G.1 NUMERICAL ANALYSIS Deduction and Theorem Proving Ordinary Differential Equations HISTORY OF COMPUTING Logic Programming
F.
D.3
B.7 ...
acmccs98
...
F.
Fig. 6. Mapping of CENTRIA cluster 2 onto the ACM-CCS tree with penalties h = 1, o = 0.8 and g = 0.1.
D.
B.7
D.
acmccs98
...
Penalty: Head Subject: 1 Leaf Head subject: 0.8 Gap: 0.2
B.2
Penalty: Head Subject: 1 Leaf Head subject: 0.8 Gap: 0.1
B.
Topic Clustering Applications in I.5 PATTERN RECOGNITION PHYSICAL SCIENCES AND ENGINEERING (Applications in) Applications in I.4 IMAGE PROCESSING AND COMPUTER VISION Segmentation Learning User Interfaces Model Validation and Analysis in I.6 SIMULATION AND MODELING Natural Language Processing Models in I.5 PATTERN RECOGNITION Design Methodology (Classifiers) General in H.5 INFORMATION INTERFACES AND PRESENTATION GENERAL in H. Information Systems General in H.4 INFORMATION SYSTEMS APPLICATIONS Distributed Artificial Intelligence
Head Subject Gap "..." − Not covered
B.
Head Subject Gap "..." − Not covered
G.1
G.2
...
F.2.1 ...
F.4.1 ...
I.2 G.3
...
G.1.0G.1.6G.1.7 ... G.2.2 ...
...
I.2.0 I.2.3 I.2.8
J.3
...
...
Mapping of CENTRIA Cluster 2 on ACM−CCS taxonomy
Fig. 5. Mapping of CENTRIA cluster 2 onto the ACM-CCS tree with penalties h = 1, o = 0.8 and g = 0.2.
cheaper now. Notice that this last scenario would not be allowed with a gap cost of 0.2, which would lead to higher costs in each of the subtrees. It seems that the latter mapping leads to a better fitting over the ACM-CCS tree. However, the fine tunning of the penalty weights is an aspect to be addressed. VI. C ONCLUSION We have described a method for knowledge generalization that employs a taxonomy tree. The method constructs fuzzy profiles of the entities constituting the structure under consideration and then generalizes them in two steps. These steps are:
(i) fuzzy clustering research topics according to their thematic similarities, ignoring the topology of the taxonomy, and (ii) lifting clusters mapped to the taxonomy to higher ranked categories in the tree. These generalization steps thus cover both sides of the representation process: the empirical – related to the structure under consideration – and the conceptual – related to the taxonomy hierarchy. This work is part of the research project Computational Ontology Profiling of Scientific Research Organization (COPSRO), main goal of which is to develop a method for representing a Computer Science organization, such as a university department, over the ACM-CCS classification tree. Our approach involves the following steps: 1) Surveying the members of the organization regarding the ACM-CCS topics they work on in order to derive their research profiles; this may be supplemented with indication of the degree of success achieved (publication in a good journal, grant award, etc.); 2) Deriving thematic similarity between ACM-CCS topics according to the research profiles resulting from the survey and then clustering them; 3) Mapping the fuzzy clusters to the ACM-CCS taxonomy and lifting them to higher ranks in a parsimonious way by minimizing the penalty for emerging head subjects, gaps and offshoots; 4) Aggregating the results from different clusters and, subsequently, different organizations over the taxonomy; 5) Interpreting the results and drawing conclusions. In principle, the approach can be extended to other areas of science or engineering, provided that such an area has been systemized in the form of a comprehensive concept tree. Potentially, this approach could lead to a useful instrument for comprehensive visual representation of developments in any field of organized human activities.
ACKNOWLEDGMENTS The authors are grateful to CENTRIA-UNL members that participated in the survey. Igor Guerreiro is acknowledged for developing software for the ESSA tool. Rui Felizardo is acknowledged for developing software for the lifting algorithm with interface shown in Figures 5 and 6. This work has been supported by grant PTDC/EIA/69988/2006 from the Portuguese Foundation for Science & Technology. The support of the individual research project 09-01-0071 “Analysis of relations between spectral and approximation clustering” to BM by the “Science Foundation” Programme of the State University – Higher School of Economics, Moscow RF, is also acknowledged. R EFERENCES [1] ACM Computing Classification System, 1998, http://www.acm.org/about/class/1998. Cited 9 Sep 2008. [2] J. Bezdek, J. Keller, R. Krishnapuram, T. Pal, Fuzzy Models and Algorithms for Pattern Recognition and Image Processing, Kluwer Academic Publishers, 1999. [3] M. Feather, T. Menzies, J. Connelly, “Matching software practitioner needs to researcher activities”, Proc. of the 10th Asia-Pacific Software Engineering Conference (APSEC’03), IEEE, pp. 6, 2003 doi:10.1109/APSEC.2003.1254353. [4] D. Gaevic, and M. Hatala, “Ontology mappings to improve learning resource search”, British Journal of Educational Technology, 37(3), pp. 375 - 389, 2006. [5] “The Gene Ontology Consortium: Gene Ontology: tool for the unification of biology”, Nature Genetics, 25, pp. 25-29, 2000. [6] R. Krishnapuram, A. Joshi, O. Nasraoui, L. Yi, “Low-complexity fuzzy relational clustering algorithms for web mining”, IEEE Trans. on Fuzzy Systems, 9(4), pp 595-607, 2001. [7] N. Labroche, “Learning web users profiles with relational clustering algorithms”, In Workshop on Intelligent Techniques for Web Personalization AAAI Conference, 2007. [8] J. Liu, W. Wang, J. Yang, “ Gene ontology friendly biclustering of expression profiles”, Proc. of the IEEE Computational Systems Bioinformatics Conference. IEEE, pp. 436-447, 2004, doi: 10.1109/CSB.2004.1332456. [9] U. von Luxburg, A tutorial on spectral clustering, Statistics and Computing 17, pp. 395-416, 2007. [10] S. Middleton, N. Shadbolt, D. Roure, “Ontological user representing in recommender systems”, ACM Trans. on Inform. Systems, 22(1), pp. 54-88, 2004, doi: 10.1145/963770.963773. [11] S. Miralaei, A. Ghorbani, “Category-based similarity algorithm for semantic similarity in multi-agent information sharing systems”, IEEE/WIC/ACM Int. Conf. on Intelligent Agent Technology, pp. 242245, 2005, doi: 10.1109/IAT.2005.50. [12] B. Mirkin, “Additive clustering and qualitative factor analysis methods for similarity matrices”, Journal of Classification, 4(1), pp. 7-31, 1987, doi:10.1007/BF01890073. [13] B. Mirkin, Clustering for Data Mining: A Data Recovery Approach. Chapman & Hall /CRC Press, 276 p., 2005. [14] B. Mirkin, T. Fenner, M. Galperin, E. Koonin, “Algorithms for computing parsimonious evolutionary scenarios for genome evolution, the last universal common ancestor and dominance of horizontal gene transfer in the evolution of prokaryotes”, BMC Evolutionary Biology, 3:2, 2003, doi:10.1186/1471-2148-3-2. [15] B. Mirkin, S. Nascimento, “Analysis of Community Structure, Affinity Data and Research Activities using Additive Fuzzy Spectral Clustering”, Technical Report 6, School of Computer Science, Birkbeck University of London, 2009. [16] B. Mirkin, S. Nascimento, L.M. Pereira “Cluster-lift method for mapping research activities over a concept tree”, In: J. Koronacki, S.T. Wierzchon, Z.W. Ras, J. Kacprzyk (eds.), Recent Advances in Machine Learning II, Computational Intelligence Series Vol. 263, Springer, pp. 245-258, 2010. [17] S. Nascimento, B. Mirkin, F. Moura-Pires, “Modeling proportional membership in fuzzy clustering”, IEEE Transactions on Fuzzy Systems, 11(2), pp. 173-186, 2003, doi: 10.1109/TFUZZ.2003.809889.
[18] O. Nasraoui, “World Wide Web Personalization”, In J. Wang (ed), Encyclopedia of Data Mining and Data Warehousing, Idea Group, 2005. [19] A. Ng, M. Jordan and Y. Weiss, On spectral clustering: analysis and an algorithm, In: T.G. Ditterich, S. Becker, Z. Ghahramani (Eds.), Advances in Neural Information Processing Systems, 14, MIT Press, Cambridge Ma., pp. 849-856, 2002. [20] A. Pigeau, G. Raschia, M. Gelgon, N. Mouaddib, R. Saint-paul, “A fuzzy linguistic summarization technique for TV recommender systems”, The 12th IEEE International Conference on Fuzzy Systems, vol.1, pp. 743- 748, 2003. [21] RAE2008: Research Assessment Exercise, 2008, http://www.rae.ac.uk/ (Cited 9 Sep 2008). [22] M. Sato, Y. Sato, L.C. Jain, Fuzzy Clustering Models and Applications, Physica-Verlag, Heidelberg, 1997. [23] A. Skarman, L. Jiang, H. Hornshoj, B. Buitenhuis, J. Hedegaard, L. Conley, P. Sorensen, “Gene set analysis methods applied to chicken microarray expression data”, BMC Proceedings 2009, 3(Suppl 4):S8doi:10.1186/1753-6561-3-S4-S8. [24] R.N. Shepard and P. Arabie, “Additive clustering: representation of similarities as combinations of overlapping properties”,Psychological Review 86, 1979, 87-123. [25] J. Shi, J. Malik, “Normalized cuts and image segmentation”, IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(8), pp. 888-905, 2000. [26] C. Thorne, J. Zhu, V. Uren, “Extracting domain ontologies with CORDER”, Tech. Reportkmi-05-14. Open University, pp. 1-15, 2005. [27] The United Nations Millennium Project Task Force. http://www.cid.harward.edu/cidtech. (Cited 1 Sep 2006). [28] S.M. Weiss, N. Indurkhya, T. Zhang, F.J. Damerau, Text Mining: Predictive Methods for Analyzing Unstructured Information. Springer Verlag, 237 p., 2005. [29] L. Yang, M. Ball, V. Bhavsar, H. Boley, “Weighted partonomytaxonomy trees with local similarity measures for semantic buyerseller match-making”, Journal of Business and Technology. Atlantic Academic Press, 1(1), pp. 42-52, 2005. [30] D. Xu, H. Wang, K. Su, “Intelligent Student Profiling with Fuzzy Models”, Proceedings of the 35th Annual Hawaii International Conference on System Sciences (HICSS’02)-Vol. 3, pp. 81-88, 2002.