The MST-kNN with Paracliques

Introduction The MST-kNN with Paracliques Conclusion and Future Research Directions

The MST-kNN with Paracliques Ahmed Shamsul Arefin Carlos Riveros Pablo Moscato*

Regina Berretta

The Priority Research Centre for Bioinformatics Biomarker Discovery and Information-based Medicine University of Newcastle {Ahmed.Arefin, Carlos.Riveros, Regina.Berretta, Pablo.Moscato}@newcastle.edu.au.

February 7, 2015

Presented by Ahmed S. Arefin

The MST-kNN + Paracliques — ACALCI 2015


Overview 1

Introduction Background The Problem

2

The MST-kNN with Paracliques Proposed Solution Implementation Results

3

Conclusion and Future Research Directions Conclusion Future Research Directions




Background The Problem

Introduction

Data clustering is perhaps the most common and widely used approach in data analytics. Over the years, a large number of methods have been developed for clustering. Among those, the graph-based approaches are well-known for their advantages in partitioning both real-world and artificial data [Jain et al., 1999].





Graph-based clustering

Graph based methods generally take a distance matrix computed from the input and build a proximity graph G(V , E), where each vertex represents a data element, each edge represents the presence of a proximity relationship and the weight of the edge represents, in some way, the degree of proximity of the pair of vertices [Anders, 2003]. This is followed by the computation of some subgraphs [Berkhin, 2006], e.g., the Minimum Spaning Tree (MST), the k-Nearest Neighbour Graph (k-NNG), the Relative Neighbourhood Graph (RNG) and so forth.





The MST-kNN

Among the various known graph clustering methods, the MST-kNN [Inostroza-Ponta et al., 2006] (see also [Gonzalez-Barrios and Quiroz, 2003]) is of interest for our work, as it does not require any ad hoc user-defined parameter. Further, in terms of homogeneity and separation index [Sharan et al., 2003], it has been shown that it performs better than the classical clustering algorithms such as K-Means and SOMs [Inostroza-Ponta et al., 2007].





The MST-kNN

The MST-kNN’s scalability and performance have been demonstrated in its external-memory [Arefin et al., 2011] as well as in data-parallel variants [Arefin et al., 2012a] and [Arefin et al., 2012c]. Furthermore, it has been employed in the analysis of large-scale real-world data of various kinds, such as: -

stock market time series data [Inostroza-Ponta et al., 2006], yeast gene expression data [Inostroza-Ponta et al., 2011], prostate cancer data [Capp et al., 2009], breast cancer data [Arefin et al., 2011] Alzheimer’s disease data [Arefin et al., 2012b] and so on.





The MST-kNN

. The MST-kNN [Inostroza-Ponta et al., 2006]

1

1 The method in [Gonzalez-Barrios and Quiroz, 2003] does not have recursion and automatic k. Presented by Ahmed S. Arefin




The MST-kNN (Demonstration)

A complete graph formed by 16 Indo-European Languages, extracted from the 84 Indo European Languages distance matrix provided in [Dyen et al., 1992] Presented by Ahmed S. Arefin




The MST-kNN (Demonstration)

The Minimum Spanning Tree (MST)





The MST-kNN (Demonstration) - Contd. .

.

Application of the MST-kNN on the 16 Indo-European Languages Note that k = min{bln(n)c ; min k / GkN N is connected} Presented by Ahmed S. Arefin

(1)




The MST-kNN (1st Iteration)

. Application of the MST-kNN on the 84 Indo European Languages





The Problem

The MST-kNN’s Limitation: The MST-kNN’s outcome does not provide insight of the core vertices’ interactions within the MST-kNN partitions.




Proposed Solution Implementation Results

The MST-kNN with Paracliques

Proposed Solution: We propose a modified version termed as the MST-kNN with Paracliques. It adopts the working procedure of the MST-kNN, but using an iterative approach and integrates paraclique structures into the MST-kNN’s outcome.





Definitions A clique is a set of vertices in which every vertex has an edge to every other vertex in the set. A maximal clique is a clique that cannot be extended by adding another vertex. The maximum clique of a graph is a maximal clique that has the largest number of vertices and is arguably the most ‘natural’ cluster in a proximity graph [Ngomo, 2006]. However the problem of finding the maximum clique is a well-known N P-hard problem. In contrast, the identification of paracliques [Chesler and Langston, 2006]2 provides a viable alternative. 2

See also ‘quasi-cliques’ in [Abello et al., 2002] Presented by Ahmed S. Arefin




The MST-kNN with Paracliques (Contd.)

We identify paracliques via the identification of the maximal cliques of size 3 or higher present in the kNN graphs reconstructed from the MST-kNN components. In other words, we collect the neighborhood networks as paracliques that are present within the MST-kNN components, but lack only a few edges to become cliques of a larger size. This results in insightful networks among the core vertices in each MST-kNN partition than the ones portrayed by the MST alone.





The MST-kNN with Paracliques (Algorithm)

The Proposed Method (see lines 8 and 9)





The MST-kNN with Paracliques (Algorithm) - Contd.

The kNN Paracliques Method





The MST-kNN with Paracliques (Implementation) The proposed method has been implemented in R using the igraph package [Csardi and Nepusz, 2006]. For example, For computing the MST and kNN we use minimum.spanning.tree (Prim’s) and graph.adjacency functions, respectively. For finding the maximal cliques we use decompose.graph, maximal.cliques and induced.subgraph functions and for retrieving the k maximal cliques we use an order function. For merging the graphs, we compute the symmetric graph differences.





The MST-kNN with Paracliques (Results)

The MST-kNN + Paracliques on the 84 Indo European Languages





The MST-kNN with Paracliques (Results) - Contd.

168 Shakespearean era plays in [Marsden et al., 2013] (see also [Arefin et al., 2014] - 256 Shakespearean era plays and poems) Presented by Ahmed S. Arefin




The MST-kNN with Paracliques (Results) - Statistical Significance Table: Significance of clusterings by the MST-kNN and its paraclique variant. Data

Method

Scoring Class

84 Indo-European languages data set

MST-kNN

9 language groups

168 Shakespearean era plays data set

MST-kNN with Paracliques MST-kNN

MST-kNN with Paracliques

39 authors of the plays

Wilcoxon test p-Value 1.04E-07

KruskalWallis test p-Value 2.09E-07

1.02E-07

2.04E-07

1.13E-10

2.26E-10

8.10E-12

1.62E-11

*The Kruskal-Wallis test 2, on the original vs. the individual 1000 random permutations resulted in p-values close to 0.




Conclusion Future Research Directions

Conclusion and Future Research Directions

We presented an interesting variant of the MST-kNN method, termed as the MST-kN N with Paracliques, which provides more insights of the inter-relations among the partitioned elements. We envision that the modified method will be a useful data clustering approach for the analysis of data sets in several areas, including– bioinformatics, artificial intelligence, image and video analysis, creative arts, and finance.





Conclusion and Future Research Directions - Contd.

Issue 1 At the moment, on smaller data sets, our method’s time performance is similar to the MST-kNN, however at a large scale, e.g., with a data set having more than 10,000 elements, it performs at least 10 times slower, which is mainly due to its maximal clique finding component. Plan We aim to re-implement this part using a data-parallel approach, which we expect to give a better speedup gain.





Conclusion and Future Research Directions - Contd. Issue 2 So far we have only compared our outcomes against the MST-kNN. This is because, we initially aimed at enhancing the MST-kNN performance only, where the original method has already been shown to perform better against the traditional clustering methods, such as CLICK and SOMs. Plan We aim to compare our outcomes against the other data partitioning methods, such as DBSCAN for graphs, affinity propagation, spectral clustering, etc. This would also help us to identify the data types, for which the proposed method is more appropriate.





Conclusion and Future Research Directions - Contd.

Issue 3 Currently it is only available (beta) for the members of CIBM research group at the CIBM website http://cibm.newcastle.edu.au. It is part of a local R tool called CIBM-RUtils. Plan We aim to publish it as a data clustering package for R at the CRAN http://cran.r-project.org/ (once the Issues 1 and 2 have been resolved).





Thanks + QA

Thanks: 1

2

Dr. Renato Vimieiro, Lecturer, Centro de Informatica, UFP, Brazil (former CIBM member). All CIBM members, collaborators and users/testers of the CIBM-RUtils.


Thank you + QA?




Abello, J., Resende, M. G., and Sudarsky, S. (2002). Massive quasi-clique detection. In LATIN 2002: Theoretical Informatics, pages 598–612. Springer. Anders, K.-H. (2003). A hierarchical graph-clustering approach to find groups of objects. In Proceedings 5th Workshop on Progress in Automated Map Generalization, pages 1–8. Arefin, A., Riveros, C., Berretta, R., and Moscato, P. (2012a). kNN-Boruvka-GPU: A fast and scalable mst construction from kNN graphs on GPU. In Murgante, B., Gervasi, O., Misra, S., Nedjah, N., Rocha, A., Taniar, D., and Apduhan, B., editors, Computational Science and Its Applications ICCSA 2012, volume 7333 of Lecture Notes in Computer Science, pages 71–86. Springer Berlin Heidelberg. Arefin, A. S., Inostroza-Ponta, M., Mathieson, L., Berretta, R., and Moscato, P. (2011). Clustering nodes in large-scale biological networks using external memory algorithms. In Xiang, Y., Cuzzocrea, A., Hobbs, M., and Zhou, W., editors, Algorithms and Architectures for Parallel Processing, volume 7017 of Lecture Notes in Computer Science, pages 375–386. Springer Berlin Heidelberg. Arefin, A. S., Mathieson, L., Johnstone, D., Berretta, R., and Moscato, P. (2012b). Unveiling clusters of RNA transcript pairs associated with markers of Alzheimers disease progression. PloS one, 7(9):e45535. Arefin, A. S., Riveros, C., Berretta, R., and Moscato, P. (2012c). kNN-MST-Agglomerative: A fast and scalable graph-based data clustering approach on GPU. In Computer Science & Education (ICCSE), 2012 7th International Conference on, pages 585–590. IEEE.





Arefin, A. S., Vimieiro, R., Riveros, C., Craig, H., and Moscato, P. (2014). An Information Theoretic clustering approach for unveiling authorship affinities in Shakespearean era plays and poems. PLoS ONE, 9(10):e111445. Berkhin, P. (2006). A survey of clustering data mining techniques. In Grouping multidimensional data, pages 25–71. Springer. Capp, A., Inostroza-Ponta, M., Bill, D., Moscato, P., Lai, C., Christie, D., Lamb, D., Turner, S., Joseph, D., and Matthews, J. (2009). Is there more than one proctitis syndrome? a revisitation using data from the TROG 96.01 trial. Radiotherapy and oncology, 90(3):400–407. Chesler, E. and Langston, M. (2006). Combinatorial genetic regulatory network analysis tools for high throughput transcriptomic data. In Eskin, E., Ideker, T., Raphael, B., and Workman, C., editors, Systems Biology and Regulatory Genomics, volume 4023 of Lecture Notes in Computer Science, pages 150–165. Springer Berlin Heidelberg. Csardi, G. and Nepusz, T. (2006). The igraph software package for complex network research. InterJournal, Complex Systems, 1695(5). Dyen, I., Kruskal, J. B., and Black, P. (1992). An Indoeuropean classification: a lexicostatistical experiment. Transactions of the American Philosophical Society, pages iii–132.





Gonzalez-Barrios, J. M. and Quiroz, A. J. (2003). A clustering procedure based on the comparison between the k nearest neighbors graph and the minimal spanning tree. Statistics & Probability Letters, 62(1):23–34. Inostroza-Ponta, M., Berretta, R., Mendes, A., and Moscato, P. (2006). An automatic graph layout procedure to visualize correlated data. In Artificial Intelligence in Theory and Practice, pages 179–188. Springer. Inostroza-Ponta, M., Berretta, R., and Moscato, P. (2011). QAPgrid: A two level QAP-based approach for large-scale data analysis and visualization. PloS one, 6(1):e14468. Inostroza-Ponta, M., Mendes, A., Berretta, R., and Moscato, P. (2007). An integrated QAP-based approach to visualize patterns of gene expression similarity. In Progress in Artificial Life, pages 156–167. Springer. Jain, A. K., Murty, M. N., and Flynn, P. J. (1999). Data clustering: a review. ACM computing surveys (CSUR), 31(3):264–323. Marsden, J., Budden, D., Craig, H., and Moscato, P. (2013). Language individuation and marker words: Shakespeare and his Maxwell’s demon. PloS one, 8(6):e66813. Ngomo, A.-C. N. (2006). Clique-based clustering. Evaluation, 1:10.





Sharan, R., Maron-Katz, A., and Shamir, R. (2003). CLICK and EXPANDER: A system for clustering and visualizing gene expression data. Bioinformatics, 19(14):1787–1799.



The MST-kNN with Paracliques

The MST-kNN with Paracliques

Suggest Documents

Conversation with the Candidates with ABC11 ... - Rackcdn.com

Conversation with the Candidates with ABC11 ... - Rackcdn.com

The facilitators of communication with people with

On the Neurons with Dendrites Intermingling with the ...

Interacting with the Oracle Server Interacting with the Oracle Server

Toying with the Eye of Modernity: The Engagement with Modern ...

1 OUT WITH THE OLD, IN WITH THE NEW ...

Connecting the real world with the digital overlay with

In with the Old . . .

Working with the Gallery

overcome with the ISOM

TANGO WITH THE SHERIFF

The Trouble with Boys

FLYING WITH THE CRANE

OPERATIONAL EXPERIENCES WITH THE

Speaking With the Crowd

Praying With The Family

Training with the

Dealing with the Tribe:

Living with the Japanese

Filled with the Spirit

Go with the flow:

Registrars with the MRCPsych

S battery with the