The MST-kNN with Paracliques

0 downloads 0 Views 3MB Size Report
Feb 7, 2015 - Graph based methods generally take a distance matrix computed from the input and build a proximity graph. G(V , E), where each vertex ...
Introduction The MST-kNN with Paracliques Conclusion and Future Research Directions

The MST-kNN with Paracliques Ahmed Shamsul Arefin Carlos Riveros Pablo Moscato*

Regina Berretta

The Priority Research Centre for Bioinformatics Biomarker Discovery and Information-based Medicine University of Newcastle {Ahmed.Arefin, Carlos.Riveros, Regina.Berretta, Pablo.Moscato}@newcastle.edu.au.

February 7, 2015

Presented by Ahmed S. Arefin

The MST-kNN + Paracliques — ACALCI 2015

Introduction The MST-kNN with Paracliques Conclusion and Future Research Directions

Overview 1

Introduction Background The Problem

2

The MST-kNN with Paracliques Proposed Solution Implementation Results

3

Conclusion and Future Research Directions Conclusion Future Research Directions

Presented by Ahmed S. Arefin

The MST-kNN + Paracliques — ACALCI 2015

Introduction The MST-kNN with Paracliques Conclusion and Future Research Directions

Background The Problem

Introduction

Data clustering is perhaps the most common and widely used approach in data analytics. Over the years, a large number of methods have been developed for clustering. Among those, the graph-based approaches are well-known for their advantages in partitioning both real-world and artificial data [Jain et al., 1999].

Presented by Ahmed S. Arefin

The MST-kNN + Paracliques — ACALCI 2015

Introduction The MST-kNN with Paracliques Conclusion and Future Research Directions

Background The Problem

Graph-based clustering

Graph based methods generally take a distance matrix computed from the input and build a proximity graph G(V , E), where each vertex represents a data element, each edge represents the presence of a proximity relationship and the weight of the edge represents, in some way, the degree of proximity of the pair of vertices [Anders, 2003]. This is followed by the computation of some subgraphs [Berkhin, 2006], e.g., the Minimum Spaning Tree (MST), the k-Nearest Neighbour Graph (k-NNG), the Relative Neighbourhood Graph (RNG) and so forth.

Presented by Ahmed S. Arefin

The MST-kNN + Paracliques — ACALCI 2015

Introduction The MST-kNN with Paracliques Conclusion and Future Research Directions

Background The Problem

The MST-kNN

Among the various known graph clustering methods, the MST-kNN [Inostroza-Ponta et al., 2006] (see also [Gonzalez-Barrios and Quiroz, 2003]) is of interest for our work, as it does not require any ad hoc user-defined parameter. Further, in terms of homogeneity and separation index [Sharan et al., 2003], it has been shown that it performs better than the classical clustering algorithms such as K-Means and SOMs [Inostroza-Ponta et al., 2007].

Presented by Ahmed S. Arefin

The MST-kNN + Paracliques — ACALCI 2015

Introduction The MST-kNN with Paracliques Conclusion and Future Research Directions

Background The Problem

The MST-kNN

The MST-kNN’s scalability and performance have been demonstrated in its external-memory [Arefin et al., 2011] as well as in data-parallel variants [Arefin et al., 2012a] and [Arefin et al., 2012c]. Furthermore, it has been employed in the analysis of large-scale real-world data of various kinds, such as: -

stock market time series data [Inostroza-Ponta et al., 2006], yeast gene expression data [Inostroza-Ponta et al., 2011], prostate cancer data [Capp et al., 2009], breast cancer data [Arefin et al., 2011] Alzheimer’s disease data [Arefin et al., 2012b] and so on.

Presented by Ahmed S. Arefin

The MST-kNN + Paracliques — ACALCI 2015

Introduction The MST-kNN with Paracliques Conclusion and Future Research Directions

Background The Problem

The MST-kNN

. The MST-kNN [Inostroza-Ponta et al., 2006]

1

1 The method in [Gonzalez-Barrios and Quiroz, 2003] does not have recursion and automatic k. Presented by Ahmed S. Arefin

The MST-kNN + Paracliques — ACALCI 2015

Introduction The MST-kNN with Paracliques Conclusion and Future Research Directions

Background The Problem

The MST-kNN (Demonstration)

A complete graph formed by 16 Indo-European Languages, extracted from the 84 Indo European Languages distance matrix provided in [Dyen et al., 1992] Presented by Ahmed S. Arefin

The MST-kNN + Paracliques — ACALCI 2015

Introduction The MST-kNN with Paracliques Conclusion and Future Research Directions

Background The Problem

The MST-kNN (Demonstration)

The Minimum Spanning Tree (MST)

Presented by Ahmed S. Arefin

The MST-kNN + Paracliques — ACALCI 2015

Introduction The MST-kNN with Paracliques Conclusion and Future Research Directions

Background The Problem

The MST-kNN (Demonstration) - Contd. .

.

Application of the MST-kNN on the 16 Indo-European Languages Note that k = min{bln(n)c ; min k / GkN N is connected} Presented by Ahmed S. Arefin

(1)

The MST-kNN + Paracliques — ACALCI 2015

Introduction The MST-kNN with Paracliques Conclusion and Future Research Directions

Background The Problem

The MST-kNN (1st Iteration)

. Application of the MST-kNN on the 84 Indo European Languages

Presented by Ahmed S. Arefin

The MST-kNN + Paracliques — ACALCI 2015

Introduction The MST-kNN with Paracliques Conclusion and Future Research Directions

Background The Problem

The Problem

The MST-kNN’s Limitation: The MST-kNN’s outcome does not provide insight of the core vertices’ interactions within the MST-kNN partitions.

Presented by Ahmed S. Arefin

The MST-kNN + Paracliques — ACALCI 2015

Introduction The MST-kNN with Paracliques Conclusion and Future Research Directions

Proposed Solution Implementation Results

The MST-kNN with Paracliques

Proposed Solution: We propose a modified version termed as the MST-kNN with Paracliques. It adopts the working procedure of the MST-kNN, but using an iterative approach and integrates paraclique structures into the MST-kNN’s outcome.

Presented by Ahmed S. Arefin

The MST-kNN + Paracliques — ACALCI 2015

Introduction The MST-kNN with Paracliques Conclusion and Future Research Directions

Proposed Solution Implementation Results

Definitions A clique is a set of vertices in which every vertex has an edge to every other vertex in the set. A maximal clique is a clique that cannot be extended by adding another vertex. The maximum clique of a graph is a maximal clique that has the largest number of vertices and is arguably the most ‘natural’ cluster in a proximity graph [Ngomo, 2006]. However the problem of finding the maximum clique is a well-known N P-hard problem. In contrast, the identification of paracliques [Chesler and Langston, 2006]2 provides a viable alternative. 2

See also ‘quasi-cliques’ in [Abello et al., 2002] Presented by Ahmed S. Arefin

The MST-kNN + Paracliques — ACALCI 2015

Introduction The MST-kNN with Paracliques Conclusion and Future Research Directions

Proposed Solution Implementation Results

The MST-kNN with Paracliques (Contd.)

We identify paracliques via the identification of the maximal cliques of size 3 or higher present in the kNN graphs reconstructed from the MST-kNN components. In other words, we collect the neighborhood networks as paracliques that are present within the MST-kNN components, but lack only a few edges to become cliques of a larger size. This results in insightful networks among the core vertices in each MST-kNN partition than the ones portrayed by the MST alone.

Presented by Ahmed S. Arefin

The MST-kNN + Paracliques — ACALCI 2015

Introduction The MST-kNN with Paracliques Conclusion and Future Research Directions

Proposed Solution Implementation Results

The MST-kNN with Paracliques (Algorithm)

The Proposed Method (see lines 8 and 9)

Presented by Ahmed S. Arefin

The MST-kNN + Paracliques — ACALCI 2015

Introduction The MST-kNN with Paracliques Conclusion and Future Research Directions

Proposed Solution Implementation Results

The MST-kNN with Paracliques (Algorithm) - Contd.

The kNN Paracliques Method

Presented by Ahmed S. Arefin

The MST-kNN + Paracliques — ACALCI 2015

Introduction The MST-kNN with Paracliques Conclusion and Future Research Directions

Proposed Solution Implementation Results

The MST-kNN with Paracliques (Implementation) The proposed method has been implemented in R using the igraph package [Csardi and Nepusz, 2006]. For example, For computing the MST and kNN we use minimum.spanning.tree (Prim’s) and graph.adjacency functions, respectively. For finding the maximal cliques we use decompose.graph, maximal.cliques and induced.subgraph functions and for retrieving the k maximal cliques we use an order function. For merging the graphs, we compute the symmetric graph differences.

Presented by Ahmed S. Arefin

The MST-kNN + Paracliques — ACALCI 2015

Introduction The MST-kNN with Paracliques Conclusion and Future Research Directions

Proposed Solution Implementation Results

The MST-kNN with Paracliques (Results)

The MST-kNN + Paracliques on the 84 Indo European Languages

Presented by Ahmed S. Arefin

The MST-kNN + Paracliques — ACALCI 2015

Introduction The MST-kNN with Paracliques Conclusion and Future Research Directions

Proposed Solution Implementation Results

The MST-kNN with Paracliques (Results) - Contd.

168 Shakespearean era plays in [Marsden et al., 2013] (see also [Arefin et al., 2014] - 256 Shakespearean era plays and poems) Presented by Ahmed S. Arefin

The MST-kNN + Paracliques — ACALCI 2015

Introduction The MST-kNN with Paracliques Conclusion and Future Research Directions

Proposed Solution Implementation Results

The MST-kNN with Paracliques (Results) - Statistical Significance Table: Significance of clusterings by the MST-kNN and its paraclique variant. Data

Method

Scoring Class

84 Indo-European languages data set

MST-kNN

9 language groups

168 Shakespearean era plays data set

MST-kNN with Paracliques MST-kNN

MST-kNN with Paracliques

39 authors of the plays

Wilcoxon test p-Value 1.04E-07

KruskalWallis test p-Value 2.09E-07

1.02E-07

2.04E-07

1.13E-10

2.26E-10

8.10E-12

1.62E-11

*The Kruskal-Wallis test 2, on the original vs. the individual 1000 random permutations resulted in p-values close to 0.

Presented by Ahmed S. Arefin

The MST-kNN + Paracliques — ACALCI 2015

Introduction The MST-kNN with Paracliques Conclusion and Future Research Directions

Conclusion Future Research Directions

Conclusion and Future Research Directions

We presented an interesting variant of the MST-kNN method, termed as the MST-kN N with Paracliques, which provides more insights of the inter-relations among the partitioned elements. We envision that the modified method will be a useful data clustering approach for the analysis of data sets in several areas, including– bioinformatics, artificial intelligence, image and video analysis, creative arts, and finance.

Presented by Ahmed S. Arefin

The MST-kNN + Paracliques — ACALCI 2015

Introduction The MST-kNN with Paracliques Conclusion and Future Research Directions

Conclusion Future Research Directions

Conclusion and Future Research Directions - Contd.

Issue 1 At the moment, on smaller data sets, our method’s time performance is similar to the MST-kNN, however at a large scale, e.g., with a data set having more than 10,000 elements, it performs at least 10 times slower, which is mainly due to its maximal clique finding component. Plan We aim to re-implement this part using a data-parallel approach, which we expect to give a better speedup gain.

Presented by Ahmed S. Arefin

The MST-kNN + Paracliques — ACALCI 2015

Introduction The MST-kNN with Paracliques Conclusion and Future Research Directions

Conclusion Future Research Directions

Conclusion and Future Research Directions - Contd. Issue 2 So far we have only compared our outcomes against the MST-kNN. This is because, we initially aimed at enhancing the MST-kNN performance only, where the original method has already been shown to perform better against the traditional clustering methods, such as CLICK and SOMs. Plan We aim to compare our outcomes against the other data partitioning methods, such as DBSCAN for graphs, affinity propagation, spectral clustering, etc. This would also help us to identify the data types, for which the proposed method is more appropriate.

Presented by Ahmed S. Arefin

The MST-kNN + Paracliques — ACALCI 2015

Introduction The MST-kNN with Paracliques Conclusion and Future Research Directions

Conclusion Future Research Directions

Conclusion and Future Research Directions - Contd.

Issue 3 Currently it is only available (beta) for the members of CIBM research group at the CIBM website http://cibm.newcastle.edu.au. It is part of a local R tool called CIBM-RUtils. Plan We aim to publish it as a data clustering package for R at the CRAN http://cran.r-project.org/ (once the Issues 1 and 2 have been resolved).

Presented by Ahmed S. Arefin

The MST-kNN + Paracliques — ACALCI 2015

Introduction The MST-kNN with Paracliques Conclusion and Future Research Directions

Conclusion Future Research Directions

Thanks + QA

Thanks: 1

2

Dr. Renato Vimieiro, Lecturer, Centro de Informatica, UFP, Brazil (former CIBM member). All CIBM members, collaborators and users/testers of the CIBM-RUtils.

Presented by Ahmed S. Arefin

Thank you + QA?

The MST-kNN + Paracliques — ACALCI 2015

Introduction The MST-kNN with Paracliques Conclusion and Future Research Directions

Conclusion Future Research Directions

Abello, J., Resende, M. G., and Sudarsky, S. (2002). Massive quasi-clique detection. In LATIN 2002: Theoretical Informatics, pages 598–612. Springer. Anders, K.-H. (2003). A hierarchical graph-clustering approach to find groups of objects. In Proceedings 5th Workshop on Progress in Automated Map Generalization, pages 1–8. Arefin, A., Riveros, C., Berretta, R., and Moscato, P. (2012a). kNN-Boruvka-GPU: A fast and scalable mst construction from kNN graphs on GPU. In Murgante, B., Gervasi, O., Misra, S., Nedjah, N., Rocha, A., Taniar, D., and Apduhan, B., editors, Computational Science and Its Applications ICCSA 2012, volume 7333 of Lecture Notes in Computer Science, pages 71–86. Springer Berlin Heidelberg. Arefin, A. S., Inostroza-Ponta, M., Mathieson, L., Berretta, R., and Moscato, P. (2011). Clustering nodes in large-scale biological networks using external memory algorithms. In Xiang, Y., Cuzzocrea, A., Hobbs, M., and Zhou, W., editors, Algorithms and Architectures for Parallel Processing, volume 7017 of Lecture Notes in Computer Science, pages 375–386. Springer Berlin Heidelberg. Arefin, A. S., Mathieson, L., Johnstone, D., Berretta, R., and Moscato, P. (2012b). Unveiling clusters of RNA transcript pairs associated with markers of Alzheimers disease progression. PloS one, 7(9):e45535. Arefin, A. S., Riveros, C., Berretta, R., and Moscato, P. (2012c). kNN-MST-Agglomerative: A fast and scalable graph-based data clustering approach on GPU. In Computer Science & Education (ICCSE), 2012 7th International Conference on, pages 585–590. IEEE.

Presented by Ahmed S. Arefin

The MST-kNN + Paracliques — ACALCI 2015

Introduction The MST-kNN with Paracliques Conclusion and Future Research Directions

Conclusion Future Research Directions

Arefin, A. S., Vimieiro, R., Riveros, C., Craig, H., and Moscato, P. (2014). An Information Theoretic clustering approach for unveiling authorship affinities in Shakespearean era plays and poems. PLoS ONE, 9(10):e111445. Berkhin, P. (2006). A survey of clustering data mining techniques. In Grouping multidimensional data, pages 25–71. Springer. Capp, A., Inostroza-Ponta, M., Bill, D., Moscato, P., Lai, C., Christie, D., Lamb, D., Turner, S., Joseph, D., and Matthews, J. (2009). Is there more than one proctitis syndrome? a revisitation using data from the TROG 96.01 trial. Radiotherapy and oncology, 90(3):400–407. Chesler, E. and Langston, M. (2006). Combinatorial genetic regulatory network analysis tools for high throughput transcriptomic data. In Eskin, E., Ideker, T., Raphael, B., and Workman, C., editors, Systems Biology and Regulatory Genomics, volume 4023 of Lecture Notes in Computer Science, pages 150–165. Springer Berlin Heidelberg. Csardi, G. and Nepusz, T. (2006). The igraph software package for complex network research. InterJournal, Complex Systems, 1695(5). Dyen, I., Kruskal, J. B., and Black, P. (1992). An Indoeuropean classification: a lexicostatistical experiment. Transactions of the American Philosophical Society, pages iii–132.

Presented by Ahmed S. Arefin

The MST-kNN + Paracliques — ACALCI 2015

Introduction The MST-kNN with Paracliques Conclusion and Future Research Directions

Conclusion Future Research Directions

Gonzalez-Barrios, J. M. and Quiroz, A. J. (2003). A clustering procedure based on the comparison between the k nearest neighbors graph and the minimal spanning tree. Statistics & Probability Letters, 62(1):23–34. Inostroza-Ponta, M., Berretta, R., Mendes, A., and Moscato, P. (2006). An automatic graph layout procedure to visualize correlated data. In Artificial Intelligence in Theory and Practice, pages 179–188. Springer. Inostroza-Ponta, M., Berretta, R., and Moscato, P. (2011). QAPgrid: A two level QAP-based approach for large-scale data analysis and visualization. PloS one, 6(1):e14468. Inostroza-Ponta, M., Mendes, A., Berretta, R., and Moscato, P. (2007). An integrated QAP-based approach to visualize patterns of gene expression similarity. In Progress in Artificial Life, pages 156–167. Springer. Jain, A. K., Murty, M. N., and Flynn, P. J. (1999). Data clustering: a review. ACM computing surveys (CSUR), 31(3):264–323. Marsden, J., Budden, D., Craig, H., and Moscato, P. (2013). Language individuation and marker words: Shakespeare and his Maxwell’s demon. PloS one, 8(6):e66813. Ngomo, A.-C. N. (2006). Clique-based clustering. Evaluation, 1:10.

Presented by Ahmed S. Arefin

The MST-kNN + Paracliques — ACALCI 2015

Introduction The MST-kNN with Paracliques Conclusion and Future Research Directions

Conclusion Future Research Directions

Sharan, R., Maron-Katz, A., and Shamir, R. (2003). CLICK and EXPANDER: A system for clustering and visualizing gene expression data. Bioinformatics, 19(14):1787–1799.

Presented by Ahmed S. Arefin

The MST-kNN + Paracliques — ACALCI 2015