Fuzzy clusters: Additive spectral fuzzy clustering with. FADDIS (Mirkin and Nascimento, 2012). Susana Nascimento. Data Recovery Clustering for Similarity Data ...
Fuzzy Additive-Spectral Clustering Representing Research Activities over ACM-CCS Taxonomy Conclusion
Data Recovery Clustering for Similarity Data Susana Nascimento1 (joint work with Boris Mirkin2 ) 1 Department
of Computer Science and Centre for Artificial Intelligence (CENTRIA) FCT-Universidade Nova de Lisboa Portugal 2 Birkbeck University of London, UK, and Higher School of Economics, Moscow, RF
International Workshop “Clusters, orders, trees: Methods and applications” in honor of Professor Boris Mirkin Moscow, December 12th-13th 2012
Susana Nascimento
Data Recovery Clustering for Similarity Data
Fuzzy Additive-Spectral Clustering Representing Research Activities over ACM-CCS Taxonomy Conclusion
Outline
1
Fuzzy Additive Clustering using a Spectral Approach (FADDIS)
2
Representing Research Activities over ACM-CCS Taxonomy
3
Conclusion
Susana Nascimento
Data Recovery Clustering for Similarity Data
Fuzzy Additive-Spectral Clustering Representing Research Activities over ACM-CCS Taxonomy Conclusion
Outline
1
Fuzzy Additive Clustering using a Spectral Approach (FADDIS)
2
Representing Research Activities over ACM-CCS Taxonomy
3
Conclusion
Susana Nascimento
Data Recovery Clustering for Similarity Data
Fuzzy Additive-Spectral Clustering Representing Research Activities over ACM-CCS Taxonomy Conclusion
Outline
1
Fuzzy Additive Clustering using a Spectral Approach (FADDIS)
2
Representing Research Activities over ACM-CCS Taxonomy
3
Conclusion
Susana Nascimento
Data Recovery Clustering for Similarity Data
Fuzzy Additive-Spectral Clustering Representing Research Activities over ACM-CCS Taxonomy Conclusion
Data Recovery Clustering for Similarity Data
Crisp clusters: Additive clustering (Shepard and Arabie, 1979; Mirkin, 1976 (in Russian), 1987, 2005). Fuzzy clusters: Additive spectral fuzzy clustering with FADDIS (Mirkin and Nascimento, 2012).
Susana Nascimento
Data Recovery Clustering for Similarity Data
Fuzzy Additive-Spectral Clustering Representing Research Activities over ACM-CCS Taxonomy Conclusion
Fuzzy Additive Clustering Model
Fuzzy Cluster Similarity Observed pairwise similarity B = Bij i, j ∈ I Find fuzzy cluster (u, μ) – Membership vector u = (ui ), s.t. 0 ≤ ui ≤ 1 for all i ∈ I – Intensity μ > 0
Fuzzy cluster similarity A = μ2 u.u aij = (μui ) μuj – the membership expresses the propensity of an entity to contribute to the similarity index when coupled with another entity; – the product µ2 ui uj expresses the contribution of cluster to the similarity aij bt. entities i and j. Susana Nascimento
Data Recovery Clustering for Similarity Data
Fuzzy Additive-Spectral Clustering Representing Research Activities over ACM-CCS Taxonomy Conclusion
Fuzzy Additive Clustering Model
Additive Fuzzy Clustering Model
Observed similarity B summarizes: – K fuzzy clusters (K unknown) – Residual similarity E
Additive Fuzzy Model B = A1 + A2 + · · · + Ak + E E to be minimized over unknown clusters
Susana Nascimento
Data Recovery Clustering for Similarity Data
Fuzzy Additive-Spectral Clustering Representing Research Activities over ACM-CCS Taxonomy Conclusion
Fuzzy Additive Clustering Model
Method: Sequential Extraction of Clusters
Least-squares for finding one cluster at a time Clustering criterion minu,ξ
(bij − ξui uj )2
(1)
i,j∈I
wrt unknown weight ξ > 0 (ξ = μ2 ) and fuzzy membership vector u = (ui ), given residual similarity value bij .
Susana Nascimento
Data Recovery Clustering for Similarity Data
Fuzzy Additive-Spectral Clustering Representing Research Activities over ACM-CCS Taxonomy Conclusion
Fuzzy Additive Clustering Model
Fitting Fuzzy Additive Clustering Model with Spectral Method 1
First order optimality condition by minimizing criterion (1) to find ξ for arbitrary u ξ=
u Bu (u u)2
(2)
which is non-negative if matrix B is positive semidefinite. 2
Given ξ, the minimizing of clustering criterion (1) is equivalent to maximize the Rayleigh quotient squared maxu=0
u Bu u u
Maximum value is the maximum eigenvalue of matrix B, which is reached at the corresponding eigenvector – Rayleigh-Ritz theorem Susana Nascimento
Data Recovery Clustering for Similarity Data
Fuzzy Additive-Spectral Clustering Representing Research Activities over ACM-CCS Taxonomy Conclusion
Fuzzy Additive Clustering Model
Fitting Fuzzy Additive Clustering Model with Spectral Method Spectral Clustering approach Λ(B) = [λ, z] – λ maximum eigenvalue of B – z corresponding normed eigenvector for B
Projection P(z) =
u u ,
with u = (ui )
⎧ ⎪ ⎨0, ui = zi , ⎪ ⎩ 1, Susana Nascimento
if zi ≤ 0; if 0 < zi < 1; if zi ≥ 1. Data Recovery Clustering for Similarity Data
Fuzzy Additive-Spectral Clustering Representing Research Activities over ACM-CCS Taxonomy Conclusion
Fuzzy Additive Clustering Model
Stop Conditions of Sequential Extraction of Fuzzy Clusters
s1 The eigenvalue for the spectral fuzzy cluster is negative ( ξ>0) s2 Contribution ξk2 of current cluster reached a pre-specified proportion
Susana Nascimento
Data Recovery Clustering for Similarity Data
Fuzzy Additive-Spectral Clustering Representing Research Activities over ACM-CCS Taxonomy Conclusion
Representing Research Activities over ACM-CCS Taxonomy
Computational Ontology Profiling of Scientific Research Organizations Application of FADDIS to the problem of representation of a Department’s research activities over ACM Computing Classification System (ACM-CCS taxonomy – 1998). – E-survey tool over ACM-CCS topics – Similarity measure between ACM-CCS topics – Additive, crisp or fuzzy, clustering – Lifting of thematic clusters in ACM-CCS taxonomy for interpretation in its framework Project COPSRO, Fundação para Ciência e Tecnologia, Ministry of Science and Technology (grant PTDC/EIA/69988/2006), PORTUGAL
Susana Nascimento
Data Recovery Clustering for Similarity Data
Fuzzy Additive-Spectral Clustering Representing Research Activities over ACM-CCS Taxonomy Conclusion
Representing Research Activities over ACM-CCS Taxonomy
E-Survey Tool of Scientific Activities (ESSA)
– Selection of a set of topics among the leaf nodes of the ACM-CCS tree (3rd layer); – Assign each topic with a % expressing the proportion of the topic in the total of the respondent’s research activity (past four years). Susana Nascimento
Data Recovery Clustering for Similarity Data
Fuzzy Additive-Spectral Clustering Representing Research Activities over ACM-CCS Taxonomy Conclusion
Representing Research Activities over ACM-CCS Taxonomy
Generic ESSA output: Fuzzy membership
Fuzzy memberships that characterize the research profile of a member of the Department. Susana Nascimento
Data Recovery Clustering for Similarity Data
Fuzzy Additive-Spectral Clustering Representing Research Activities over ACM-CCS Taxonomy Conclusion
Representing Research Activities over ACM-CCS Taxonomy
Similarity between ACM-CCS Topics
Similarity between topics i, j = 1, 2, · · · , |I| bij =
|V | nv fiv · fjv nmax v =1
with nv number of topics taken by respondent v, nmax maximum of nv Similarity matrix B = (bij ) is positive semidefinite. This similarity measure is proportional to the number and importance of research activitives in both topics.
Susana Nascimento
Data Recovery Clustering for Similarity Data
Fuzzy Additive-Spectral Clustering Representing Research Activities over ACM-CCS Taxonomy Conclusion
Representing Research Activities over ACM-CCS Taxonomy
Similarity between ACM-CCS Topics: example Table: Membership values for six ACM-CCS subjects A–F assigned by four individuals. A B C D E F nv
v1 0.6 0.4
v2 0.2 0.3 0.5
2
3
v3 0.2 0.4 0.4 3
v4 0.2 0.2 0.2 0.2 0.2 5
– Similarity a(A, B) =
3 3 5 2 × 0.6 × 0.4 + × 0 × 0 + × 0 × 0.2 + × 0.2 × 0.2 = 0.136 5 5 5 5
Susana Nascimento
Data Recovery Clustering for Similarity Data
Fuzzy Additive-Spectral Clustering Representing Research Activities over ACM-CCS Taxonomy Conclusion
Representing Research Activities over ACM-CCS Taxonomy
Analysis of ESSA Surveys Data
ESSA 2009 Surveys
CENTRIA-UNL CSIS-BKUL DEI-ISEP
N. of Participating Respondents
N. of 3rd Layer ACM-CCS Covered Topics
16 22 30
46/318 54/318 65/318
Susana Nascimento
Data Recovery Clustering for Similarity Data
Fuzzy Additive-Spectral Clustering Representing Research Activities over ACM-CCS Taxonomy Conclusion
Representing Research Activities over ACM-CCS Taxonomy
Analysis of CENTRIA ESSA Survey’09
FADDIS clusters Cluster C1 C2
Contribution(%) 35.2 15.2
λ1 46.5 32.9
Weight(ξ) 31.04 20.41
Intensity(μ) 5.57 4.52
FADDIS Stop condition S1: eigenvalues for third cluster are negative.
Susana Nascimento
Data Recovery Clustering for Similarity Data
Fuzzy Additive-Spectral Clustering Representing Research Activities over ACM-CCS Taxonomy Conclusion
Representing Research Activities over ACM-CCS Taxonomy
FADDIS Thematic Cluster 1 at CENTRIA Membership value 0.69911 0.3512 0.27438 0.1992 0.1992 0.19721 0.17478 0.17478 0.16689 0.16689 0.16513 0.14453 0.13646 0.13646
Code I.5.3 I.5.4 J.2 I.4.9 I.4.6 H.5.1 H.5.2 H.5.3 H.1.1 I.5.1 H.1.2 I.5.2 H.5.0 H.0
ACM-CCS Topic Clustering Applications in I.5 PATTERN RECOGNITION PHYSICAL SCIENCES AND ENGINEERING (Applications in) Applications in I.4 IMAGE PROC. AND COMPUTER VISION Segmentation Multimedia Information Systems User Interfaces Group and Organization Interfaces Systems and Information Models in I.5 PATTERN RECOGNITION User/Machine Systems Design Methodology (Classifiers) General in H.5 INF. INTERFACES AND PRESENTATION GENERAL in H. Information Systems
Susana Nascimento
Data Recovery Clustering for Similarity Data
Fuzzy Additive-Spectral Clustering Representing Research Activities over ACM-CCS Taxonomy Conclusion
Representing Research Activities over ACM-CCS Taxonomy
Assessment of CENTRIA Cluster 1
5 members’efforts overlap the cluster: 4 fall within the cluster; 1 overlap other clusters: 50%.
Susana Nascimento
Data Recovery Clustering for Similarity Data
Fuzzy Additive-Spectral Clustering Representing Research Activities over ACM-CCS Taxonomy Conclusion
Representing Research Activities over ACM-CCS Taxonomy
FADDIS Thematic Cluster 2 at CENTRIA Membership 0.46756 0.40619 0.34435 0.32681 0.30067 0.25967 0.23748 0.18722 0.17359 0.17359 0.17203 0.1537 0.11827 0.10195
Code J.3 I.2.8 F.2.1 F.4.1 G.1.6 D.3.3 G.2.2 G.3 B.2.3 B.7.3 I.2.0 G.1.0 I.2.3 G.1.7
ACM-CCS Topic LIFE AND MEDICAL SCIENCES (Applications in) Problem Solving, Control Methods, and Search Numerical Algorithms and Problems Mathematical Logic Optimization Language Constructs and Features Graph Theory PROBABILITY AND STATISTICS Reliability, Testing, and Fault-Tolerance Reliability and Testing General in I.2 ARTIFICIAL INTELLIGENCE General in G.1 NUMERICAL ANALYSIS Deduction and Theorem Proving Ordinary Differential Equations
Susana Nascimento
Data Recovery Clustering for Similarity Data
Fuzzy Additive-Spectral Clustering Representing Research Activities over ACM-CCS Taxonomy Conclusion
Representing Research Activities over ACM-CCS Taxonomy
Assessment of CENTRIA Cluster 2
8 members’efforts overlap the cluster: 7 fall within the cluster; 1 overlap other clusters: 10%.
Susana Nascimento
Data Recovery Clustering for Similarity Data
Fuzzy Additive-Spectral Clustering Representing Research Activities over ACM-CCS Taxonomy Conclusion
Representing Research Activities over ACM-CCS Taxonomy
Summary of FADDIS Results over ESSA Surveys
The clusters found in the three research organizations show a clear meaning. These clusters are consistent with the informal assessment of the researches conducted in those organizations. The sets of research topics chosen by individual members at ESSA survey follow the found cluster structure rather closely. The number of clusters found in the Research Center are smaller than the ones in the University departments: 2, 4, 4.
Susana Nascimento
Data Recovery Clustering for Similarity Data
Fuzzy Additive-Spectral Clustering Representing Research Activities over ACM-CCS Taxonomy Conclusion
Conclusion
Conclusion (1)
Two data recovery models for fuzzy clustering Both have competitive fuzzy clustering algorithms – FCPM vs FCM – FADDIS vs several Relational Fuzzy Clustering algorithms (NERFCM (Hathaway and Bezdek, 1994), FMFCM (Brouwer, 2009)).
Some model-based features – Underlying type in FCPM – FADDIS stop conditions that provide an indicator of the number of clusters
Susana Nascimento
Data Recovery Clustering for Similarity Data
Fuzzy Additive-Spectral Clustering Representing Research Activities over ACM-CCS Taxonomy Conclusion
Conclusion
Conclusion (2)
Real Applications Mental disorders: Keeping prototypes “extremal”/arquetipal Clustering research topics according to the working of a research department, and represent its activities over a taxonomy – Consistent results over diverse real data of University departments
Susana Nascimento
Data Recovery Clustering for Similarity Data
Fuzzy Additive-Spectral Clustering Representing Research Activities over ACM-CCS Taxonomy Conclusion
Conclusion
Future Work
Data modeling and experiments to analyse Cluster Structure Recovery and other specifics of the models Overcoming shortcomings followed from a sequential nature of FADDIS Explore real world applications
Susana Nascimento
Data Recovery Clustering for Similarity Data
Fuzzy Additive-Spectral Clustering Representing Research Activities over ACM-CCS Taxonomy Conclusion
Conclusion
Some References 1. B. Mirkin and S. Nascimento (2012). Additive Spectral Method for Fuzzy Cluster Analysis of Similarity Data Including Community Structure and Affinity Matrices, Information Sciences, 183(1), pp. 16-34, Elsevier. 2. S. Nascimento (2005). Fuzzy Clustering via Proportional Membership Model. Vol 119 of Frontiers of Artificial Intelligence and Applications,IOS Press, 200 pp. 3. S. Nascimento, B. Mirkin, and F. Moura Pires (2003). Modeling Proportional Membership in Fuzzy Clustering. IEEE Transactions on Fuzzy Systems (IEEE-TFS), 11(2), pp. 173-186
Susana Nascimento
Data Recovery Clustering for Similarity Data