Data Recovery Clustering for Similarity Data

0 downloads 0 Views 334KB Size Report
Fuzzy clusters: Additive spectral fuzzy clustering with. FADDIS (Mirkin and Nascimento, 2012). Susana Nascimento. Data Recovery Clustering for Similarity Data ...
Fuzzy Additive-Spectral Clustering Representing Research Activities over ACM-CCS Taxonomy Conclusion

Data Recovery Clustering for Similarity Data Susana Nascimento1 (joint work with Boris Mirkin2 ) 1 Department

of Computer Science and Centre for Artificial Intelligence (CENTRIA) FCT-Universidade Nova de Lisboa Portugal 2 Birkbeck University of London, UK, and Higher School of Economics, Moscow, RF

International Workshop “Clusters, orders, trees: Methods and applications” in honor of Professor Boris Mirkin Moscow, December 12th-13th 2012

Susana Nascimento

Data Recovery Clustering for Similarity Data

Fuzzy Additive-Spectral Clustering Representing Research Activities over ACM-CCS Taxonomy Conclusion

Outline

1

Fuzzy Additive Clustering using a Spectral Approach (FADDIS)

2

Representing Research Activities over ACM-CCS Taxonomy

3

Conclusion

Susana Nascimento

Data Recovery Clustering for Similarity Data

Fuzzy Additive-Spectral Clustering Representing Research Activities over ACM-CCS Taxonomy Conclusion

Outline

1

Fuzzy Additive Clustering using a Spectral Approach (FADDIS)

2

Representing Research Activities over ACM-CCS Taxonomy

3

Conclusion

Susana Nascimento

Data Recovery Clustering for Similarity Data

Fuzzy Additive-Spectral Clustering Representing Research Activities over ACM-CCS Taxonomy Conclusion

Outline

1

Fuzzy Additive Clustering using a Spectral Approach (FADDIS)

2

Representing Research Activities over ACM-CCS Taxonomy

3

Conclusion

Susana Nascimento

Data Recovery Clustering for Similarity Data

Fuzzy Additive-Spectral Clustering Representing Research Activities over ACM-CCS Taxonomy Conclusion

Data Recovery Clustering for Similarity Data

Crisp clusters: Additive clustering (Shepard and Arabie, 1979; Mirkin, 1976 (in Russian), 1987, 2005). Fuzzy clusters: Additive spectral fuzzy clustering with FADDIS (Mirkin and Nascimento, 2012).

Susana Nascimento

Data Recovery Clustering for Similarity Data

Fuzzy Additive-Spectral Clustering Representing Research Activities over ACM-CCS Taxonomy Conclusion

Fuzzy Additive Clustering Model

Fuzzy Cluster Similarity Observed pairwise similarity   B = Bij i, j ∈ I Find fuzzy cluster (u, μ) – Membership vector u = (ui ), s.t. 0 ≤ ui ≤ 1 for all i ∈ I – Intensity μ > 0

Fuzzy cluster similarity A = μ2 u.u   aij = (μui ) μuj – the membership expresses the propensity of an entity to contribute to the similarity index when coupled with another entity; – the product µ2 ui uj expresses the contribution of cluster to the similarity aij bt. entities i and j. Susana Nascimento

Data Recovery Clustering for Similarity Data

Fuzzy Additive-Spectral Clustering Representing Research Activities over ACM-CCS Taxonomy Conclusion

Fuzzy Additive Clustering Model

Additive Fuzzy Clustering Model

Observed similarity B summarizes: – K fuzzy clusters (K unknown) – Residual similarity E

Additive Fuzzy Model B = A1 + A2 + · · · + Ak + E E to be minimized over unknown clusters

Susana Nascimento

Data Recovery Clustering for Similarity Data

Fuzzy Additive-Spectral Clustering Representing Research Activities over ACM-CCS Taxonomy Conclusion

Fuzzy Additive Clustering Model

Method: Sequential Extraction of Clusters

Least-squares for finding one cluster at a time Clustering criterion minu,ξ



(bij − ξui uj )2

(1)

i,j∈I

wrt unknown weight ξ > 0 (ξ = μ2 ) and fuzzy membership vector u = (ui ), given residual similarity value bij .

Susana Nascimento

Data Recovery Clustering for Similarity Data

Fuzzy Additive-Spectral Clustering Representing Research Activities over ACM-CCS Taxonomy Conclusion

Fuzzy Additive Clustering Model

Fitting Fuzzy Additive Clustering Model with Spectral Method 1

First order optimality condition by minimizing criterion (1) to find ξ for arbitrary u ξ=

u Bu (u u)2

(2)

which is non-negative if matrix B is positive semidefinite. 2

Given ξ, the minimizing of clustering criterion (1) is equivalent to maximize the Rayleigh quotient squared maxu=0

u Bu u u

Maximum value is the maximum eigenvalue of matrix B, which is reached at the corresponding eigenvector – Rayleigh-Ritz theorem Susana Nascimento

Data Recovery Clustering for Similarity Data

Fuzzy Additive-Spectral Clustering Representing Research Activities over ACM-CCS Taxonomy Conclusion

Fuzzy Additive Clustering Model

Fitting Fuzzy Additive Clustering Model with Spectral Method Spectral Clustering approach Λ(B) = [λ, z] – λ maximum eigenvalue of B – z corresponding normed eigenvector for B

Projection P(z) =

u u ,

with u = (ui )

⎧ ⎪ ⎨0, ui = zi , ⎪ ⎩ 1, Susana Nascimento

if zi ≤ 0; if 0 < zi < 1; if zi ≥ 1. Data Recovery Clustering for Similarity Data

Fuzzy Additive-Spectral Clustering Representing Research Activities over ACM-CCS Taxonomy Conclusion

Fuzzy Additive Clustering Model

Stop Conditions of Sequential Extraction of Fuzzy Clusters

s1 The eigenvalue for the spectral fuzzy cluster is negative ( ξ>0) s2 Contribution ξk2 of current cluster reached a pre-specified proportion

Susana Nascimento

Data Recovery Clustering for Similarity Data

Fuzzy Additive-Spectral Clustering Representing Research Activities over ACM-CCS Taxonomy Conclusion

Representing Research Activities over ACM-CCS Taxonomy

Computational Ontology Profiling of Scientific Research Organizations Application of FADDIS to the problem of representation of a Department’s research activities over ACM Computing Classification System (ACM-CCS taxonomy – 1998). – E-survey tool over ACM-CCS topics – Similarity measure between ACM-CCS topics – Additive, crisp or fuzzy, clustering – Lifting of thematic clusters in ACM-CCS taxonomy for interpretation in its framework Project COPSRO, Fundação para Ciência e Tecnologia, Ministry of Science and Technology (grant PTDC/EIA/69988/2006), PORTUGAL

Susana Nascimento

Data Recovery Clustering for Similarity Data

Fuzzy Additive-Spectral Clustering Representing Research Activities over ACM-CCS Taxonomy Conclusion

Representing Research Activities over ACM-CCS Taxonomy

E-Survey Tool of Scientific Activities (ESSA)

– Selection of a set of topics among the leaf nodes of the ACM-CCS tree (3rd layer); – Assign each topic with a % expressing the proportion of the topic in the total of the respondent’s research activity (past four years). Susana Nascimento

Data Recovery Clustering for Similarity Data

Fuzzy Additive-Spectral Clustering Representing Research Activities over ACM-CCS Taxonomy Conclusion

Representing Research Activities over ACM-CCS Taxonomy

Generic ESSA output: Fuzzy membership

Fuzzy memberships that characterize the research profile of a member of the Department. Susana Nascimento

Data Recovery Clustering for Similarity Data

Fuzzy Additive-Spectral Clustering Representing Research Activities over ACM-CCS Taxonomy Conclusion

Representing Research Activities over ACM-CCS Taxonomy

Similarity between ACM-CCS Topics

Similarity between topics i, j = 1, 2, · · · , |I| bij =

|V |  nv fiv · fjv nmax v =1

with nv number of topics taken by respondent v, nmax maximum of nv Similarity matrix B = (bij ) is positive semidefinite. This similarity measure is proportional to the number and importance of research activitives in both topics.

Susana Nascimento

Data Recovery Clustering for Similarity Data

Fuzzy Additive-Spectral Clustering Representing Research Activities over ACM-CCS Taxonomy Conclusion

Representing Research Activities over ACM-CCS Taxonomy

Similarity between ACM-CCS Topics: example Table: Membership values for six ACM-CCS subjects A–F assigned by four individuals. A B C D E F nv

v1 0.6 0.4

v2 0.2 0.3 0.5

2

3

v3 0.2 0.4 0.4 3

v4 0.2 0.2 0.2 0.2 0.2 5

– Similarity a(A, B) =

3 3 5 2 × 0.6 × 0.4 + × 0 × 0 + × 0 × 0.2 + × 0.2 × 0.2 = 0.136 5 5 5 5

Susana Nascimento

Data Recovery Clustering for Similarity Data

Fuzzy Additive-Spectral Clustering Representing Research Activities over ACM-CCS Taxonomy Conclusion

Representing Research Activities over ACM-CCS Taxonomy

Analysis of ESSA Surveys Data

ESSA 2009 Surveys

CENTRIA-UNL CSIS-BKUL DEI-ISEP

N. of Participating Respondents

N. of 3rd Layer ACM-CCS Covered Topics

16 22 30

46/318 54/318 65/318

Susana Nascimento

Data Recovery Clustering for Similarity Data

Fuzzy Additive-Spectral Clustering Representing Research Activities over ACM-CCS Taxonomy Conclusion

Representing Research Activities over ACM-CCS Taxonomy

Analysis of CENTRIA ESSA Survey’09

FADDIS clusters Cluster C1 C2

Contribution(%) 35.2 15.2

λ1 46.5 32.9

Weight(ξ) 31.04 20.41

Intensity(μ) 5.57 4.52

FADDIS Stop condition S1: eigenvalues for third cluster are negative.

Susana Nascimento

Data Recovery Clustering for Similarity Data

Fuzzy Additive-Spectral Clustering Representing Research Activities over ACM-CCS Taxonomy Conclusion

Representing Research Activities over ACM-CCS Taxonomy

FADDIS Thematic Cluster 1 at CENTRIA Membership value 0.69911 0.3512 0.27438 0.1992 0.1992 0.19721 0.17478 0.17478 0.16689 0.16689 0.16513 0.14453 0.13646 0.13646

Code I.5.3 I.5.4 J.2 I.4.9 I.4.6 H.5.1 H.5.2 H.5.3 H.1.1 I.5.1 H.1.2 I.5.2 H.5.0 H.0

ACM-CCS Topic Clustering Applications in I.5 PATTERN RECOGNITION PHYSICAL SCIENCES AND ENGINEERING (Applications in) Applications in I.4 IMAGE PROC. AND COMPUTER VISION Segmentation Multimedia Information Systems User Interfaces Group and Organization Interfaces Systems and Information Models in I.5 PATTERN RECOGNITION User/Machine Systems Design Methodology (Classifiers) General in H.5 INF. INTERFACES AND PRESENTATION GENERAL in H. Information Systems

Susana Nascimento

Data Recovery Clustering for Similarity Data

Fuzzy Additive-Spectral Clustering Representing Research Activities over ACM-CCS Taxonomy Conclusion

Representing Research Activities over ACM-CCS Taxonomy

Assessment of CENTRIA Cluster 1

5 members’efforts overlap the cluster: 4 fall within the cluster; 1 overlap other clusters: 50%.

Susana Nascimento

Data Recovery Clustering for Similarity Data

Fuzzy Additive-Spectral Clustering Representing Research Activities over ACM-CCS Taxonomy Conclusion

Representing Research Activities over ACM-CCS Taxonomy

FADDIS Thematic Cluster 2 at CENTRIA Membership 0.46756 0.40619 0.34435 0.32681 0.30067 0.25967 0.23748 0.18722 0.17359 0.17359 0.17203 0.1537 0.11827 0.10195

Code J.3 I.2.8 F.2.1 F.4.1 G.1.6 D.3.3 G.2.2 G.3 B.2.3 B.7.3 I.2.0 G.1.0 I.2.3 G.1.7

ACM-CCS Topic LIFE AND MEDICAL SCIENCES (Applications in) Problem Solving, Control Methods, and Search Numerical Algorithms and Problems Mathematical Logic Optimization Language Constructs and Features Graph Theory PROBABILITY AND STATISTICS Reliability, Testing, and Fault-Tolerance Reliability and Testing General in I.2 ARTIFICIAL INTELLIGENCE General in G.1 NUMERICAL ANALYSIS Deduction and Theorem Proving Ordinary Differential Equations

Susana Nascimento

Data Recovery Clustering for Similarity Data

Fuzzy Additive-Spectral Clustering Representing Research Activities over ACM-CCS Taxonomy Conclusion

Representing Research Activities over ACM-CCS Taxonomy

Assessment of CENTRIA Cluster 2

8 members’efforts overlap the cluster: 7 fall within the cluster; 1 overlap other clusters: 10%.

Susana Nascimento

Data Recovery Clustering for Similarity Data

Fuzzy Additive-Spectral Clustering Representing Research Activities over ACM-CCS Taxonomy Conclusion

Representing Research Activities over ACM-CCS Taxonomy

Summary of FADDIS Results over ESSA Surveys

The clusters found in the three research organizations show a clear meaning. These clusters are consistent with the informal assessment of the researches conducted in those organizations. The sets of research topics chosen by individual members at ESSA survey follow the found cluster structure rather closely. The number of clusters found in the Research Center are smaller than the ones in the University departments: 2, 4, 4.

Susana Nascimento

Data Recovery Clustering for Similarity Data

Fuzzy Additive-Spectral Clustering Representing Research Activities over ACM-CCS Taxonomy Conclusion

Conclusion

Conclusion (1)

Two data recovery models for fuzzy clustering Both have competitive fuzzy clustering algorithms – FCPM vs FCM – FADDIS vs several Relational Fuzzy Clustering algorithms (NERFCM (Hathaway and Bezdek, 1994), FMFCM (Brouwer, 2009)).

Some model-based features – Underlying type in FCPM – FADDIS stop conditions that provide an indicator of the number of clusters

Susana Nascimento

Data Recovery Clustering for Similarity Data

Fuzzy Additive-Spectral Clustering Representing Research Activities over ACM-CCS Taxonomy Conclusion

Conclusion

Conclusion (2)

Real Applications Mental disorders: Keeping prototypes “extremal”/arquetipal Clustering research topics according to the working of a research department, and represent its activities over a taxonomy – Consistent results over diverse real data of University departments

Susana Nascimento

Data Recovery Clustering for Similarity Data

Fuzzy Additive-Spectral Clustering Representing Research Activities over ACM-CCS Taxonomy Conclusion

Conclusion

Future Work

Data modeling and experiments to analyse Cluster Structure Recovery and other specifics of the models Overcoming shortcomings followed from a sequential nature of FADDIS Explore real world applications

Susana Nascimento

Data Recovery Clustering for Similarity Data

Fuzzy Additive-Spectral Clustering Representing Research Activities over ACM-CCS Taxonomy Conclusion

Conclusion

Some References 1. B. Mirkin and S. Nascimento (2012). Additive Spectral Method for Fuzzy Cluster Analysis of Similarity Data Including Community Structure and Affinity Matrices, Information Sciences, 183(1), pp. 16-34, Elsevier. 2. S. Nascimento (2005). Fuzzy Clustering via Proportional Membership Model. Vol 119 of Frontiers of Artificial Intelligence and Applications,IOS Press, 200 pp. 3. S. Nascimento, B. Mirkin, and F. Moura Pires (2003). Modeling Proportional Membership in Fuzzy Clustering. IEEE Transactions on Fuzzy Systems (IEEE-TFS), 11(2), pp. 173-186

Susana Nascimento

Data Recovery Clustering for Similarity Data