Discovering larger network motifs: Network Motif clustering

33 downloads 0 Views 80KB Size Report
Hierarchical clustering algorithms divide the data into a tree of nodes, where each node ... BIRCH, CURE and Spectral clustering are numerical methods while ...
Discovering larger network motifs: Network Motif clustering Chen Li and Wooyoung Kim

Discovering Large Network Motifs In this project, we aim to discover large network motifs. The main idea would be 1) combining smaller network motifs and extend it to larger network motifs or 2) using clustering algorithms to find more compact representation for the whole network, then using existing or new algorithm for finding network motifs. In order to find the appropriate approaches, we reviewed some of related papers, about clustering algorithms for bioinformatics applications [1], and about network motifs algorithms [3].

1

A roadmap of clustering algorithms in bioinformatics applications[1]

This paper gives an overview of clustering algorithms that have been developed and try to match the proper algorithms to various bioinformatics applications.

1.1

Desirable features of clustering algorithms to evaluate

Bioinformatics applications have various requirements for clustering results based on the purpose of clustering. Therefore there are also various clustering algorithms. To evaluate a clustering algorithm, they use the following general set of features for the clustering algorithms. • Scalability 1

• Robustness • Order insensitivity • Minimum user-specified input • Mixed datatypes • Arbitrary-shaped clusters • Point proportion admissibility : Duplicating data and re-clustering should not alter the result.

1.2

Various clustering algorithms

This paper separates the clustering algorithms into five categories: partitioning, hierarchical, grid-based, density-based, model-based and graph-based algorithms. 1.2.1

Partitioning clustering methods

Partitioning clustering methods are useful for bioinformatics applications including gene expression data where a fixed number of clusters are required. The partitioning methods are further divided into numerical methods and discrete methods. K-means algorithm and Farthest First Traversal kcenter (FFT) algorithm, K-medoids or PAM (Partitioning Around Medoids), CLARA (Clustering Large Applications), CLARANS (Clustering Large Applications Based Upon Randomized Search) and Fuzzy K-means belong to numerical methods. Discrete methods include K-modes, Fuzzy K-modes, squeezer and COOLCAT. K-prototypes is a mixed of discrete and numerical clustering methods. The algorithms except K-means are in fact modification of K-means algorithms with various purposes. 1.2.2

Hierarchical clustering algorithms

Hierarchical clustering algorithms divide the data into a tree of nodes, where each node represents a cluster. Hierarchical clustering algorithms are often divided into two categories based on their methods or the purposes: Agglomerative vs. Divisive; Single vs. Complete vs. Average linkage. In

2

bioinformatics applications, hierarchical clustering methods are more popular as natures can have various levels of subsets, such as, representing protein family relationships. But hierarchical methods are slow, errors are not tolerable and information losses are common when moving the levels. Like partitioning methods, hierarchical methods consists of numerical methods and discrete methods. BIRCH, CURE and Spectral clustering are numerical methods while ROCK, Chameleon and LIMBO are discrete methods. 1.2.3

Grid-based clustering algorithms

Grid-based clustering forms a grid structure of cells from the input data. Then each data is distributed in a cell of the grid. STING combines a numerical grid-based clustering method and hierarchical method. 1.2.4

Density-based clustering algorithms

Density-based clustering algorithms use a local density standard. Clusters are dense subspaces separated by low density spaces. Examples of bioinformatics application using density-based methods include finding the densest subspaces in interactome (protein-protein interaction) networks. DBSCAN, OPTICS, DENCLUE, WaveCluster, CLIQUE use numerical values for clustering. SEQOPTICS is used for sequence clustering. HIERDENC (Hierarchical Density-based Clustering), MULIC (Multiple Layer Incremental Clustering), Projected (subspace) clustering, CACTUS, STIRR, CLICK, CLOPE use discrete values for clustering. 1.2.5

Model-based clustering algorithms

Model-based clustering uses a model which is often derived by a statistical distribution. The methods combine background information into gene expression, interactomes and sequences for bioinformatics applications. SelfOrganizing Maps is an example of numerical model-based methods and COBWEB is a method of discrete model-based clustering algorithm. On the other hand, BILCOM (Bi-level clustering of Mixed Discrete and Numerical Biomedical Data) mixes numerical and discrete model-based clustering methods, using empirical Bayesian approach. Gene expression clustering and protein sequence clustering are the applications with this method. Other examples include AutoClass, SVM Clustering methods.

3

1.2.6

Graph-based clustering algorithms

Graph-based clustering algorithms were applied to interactomes for complex prediction and to sequence networks. MCODE (Molecular Complex Detection) is for detecting subnetworks in an interactome. SPC (Super Paramagnetic Clustering) is similar to COOLCAT and RNSC (Restricted Neighborhood Search Clustering) is similar to ROCK and Chameleon. MCL(Markov Clustering), similar to projected clustering. is used for interactomes, by simulating a flow. Other methods include TribeMCL, SPC, CD-HIT, ProClust and BAG algorithms.

1.3

Bioinformatics applications

Those methods are used for bioinformatics applications. For gene expression clustering, k-means, hierarchical and SOMs have been used. Based on the problems encountered, alternative algorithms can be applied. For interactomes, AutoClass, SVM clustering, COBSEB or MULIC were used. For sequence clustering, hierarchical clustering algorithms are most appropriate.

2

A Review on Models and Algorithms for Motif Discovery in Protein-Protein Interaction Networks [3]

This paper presents two distinct definitions of a motif based on frequency and statistical significance. They are as following. Definition 1 A motif is a sub-graph that appears more than a threshold number of times. Definition 2 A motif is a sub-graph that appears more often than expected by chance. A motif according to definition 2 is often said to be overpresented. There are several characteristics used to evaluate a motif. Frequency: there are three frequency concepts. There are arbitrary overlaps of nodes and edges (non-identical case), only overlaps of nodes (edge-disjoint case) and no overlaps (edge and vertex-disjoint case). Statistical Significance: the 4

obtained values of the frequencies for the observed and random networks are compared through either the Z-score or the abundance

2.1

Models of Random Graphs

An important issue is how to model random networks to be used for comparison and evaluation of motif occurrences. Pioneering work of random networks preserves the same degree distribution of biological networks, such as power-law degree distribution. A random graph model preserving degree sequence is proposed for the search of n-node motifs. A new model of PPI networks has been proposed based on the notion of geometric random networks and Poisson distribution of the degrees. A new research direction is improving the fit by incorporating node clustering into the model to account for different patterns of connectivity in different network modules. A major problem in modeling arises from the incompleteness and noise of the PPI data. In this area, models of random link removal that reproduce incomplete networks have been analyzed. The parameters of models have no significant dependence of degree distribution, while motif frequencies seem to be affected by link removal strength.

2.2

Motif Discovery Algorithms

Exact algorithm on motifs with a small number of nodes Exhaustive recursive search ERS: The input network is represented by an adjacency matrix M. ERS does not scale to motifs of size bigger than 4. ESU: starting with individual nodes and adding one node at a time until the required size k is reached. During construction, the algorithm keeps an auxiliary dynamically list of nodes that are candidates for future additions. When combined to sampling, ESU allows the detection of sub-graph of size up to 14. Compact topological motifs: introduces a compact graph representation obtained by grouping together maximal sets of nodes that are ’indistinguishable’. An example, the nodes of the set U1 are indistinguishable with respect to the set U2 . A compact representation is more efficient compared with the enumeration of all occurrences. The resulting algorithm achieves a drastic reduction of the size of the output motifs. See the papers [2, 6].

5

Figure 1: The graph on the left show the sets U1 and U2 as compact nodes and U1 U2 as compact edge.

2.3

Approximate Algorithms

Search algorithm based on sampling (MFINDER): the algorithm picks at random edges of the input graph until a set of k nodes is obtained to get sample sub-graph and assigns weights to the samples to correct the non-uniform sampling. It scale will with large networks, but does not scale well with large motifs. Rand-ESU: is proposed to overcome MFINDER’s drawback of extra time needed to compute the weights of all samples. ESU builds a tree whose leaves correspond to sub-graphs of size k while internal nodes correspond to sub-graphs of size 1 up to k-1, depending on the tree level. It assigns to each level in the tree a probability that the nodes are further explored, so as to guarantee all leaves are visited with uniform probability. NeMoFINDER (applies to unlabelled undirected graphs): this algorithm combines approaches developed within the data mining and computational biology communities. It search for repeated trees and extend them to sub-graphs. It leads to a reduction of the computation time for discovery of larger motifs, but at the cost of missing some potentially interesting sub-graphs. Sub-graph counting by scalar computation: it characterize a biological network by a set of measures based on scalars and functional of the adjacency matrix A associated to the network. Its advantages are mathematical elegance and computational efficiency. A-priori-based motif detection: the basic idea is if a sub-graph is frequent so are all its sub-graphs. It builds candidate motifs of size k by joining motifs of size k-1 and then evaluating their frequency. Details are in the papers [4, 5].

6

References [1] Bill Andreopoulos, Aijun An, Xiaogang Wang, and Michael Schroeder. A roadmap of clustering algorithms: finding a match for a biomedical application. Brief Bioinform, pages bbn058+, February 2009. [2] Alberto Apostolico, Matteo Comin, and Laxmi Parida”. Bridging Lossy and Lossless Compression by Motif Pattern Discovery. Electronic Notes in Discrete Mathematics, 21:219 – 225, 2005. General Theory of Information Transfer and Combinatorics. [3] Giovanni Ciriello and Concettina Guerra. A review on models and algorithms for motif discovery in protein-protein interaction networks. Brief Funct Genomic Proteomic, 7(2):147–156, 2008. [4] Jun Huan, Wei Wang, and Jan Prins. Efficient Mining of Frequent Subgraphs in the Presence of Isomorphism. Data Mining, IEEE International Conference on, 0:549, 2003. [5] Michihiro Kuramochi and George Karypis. Finding Frequent Patterns in a Large Sparse Graph. Data Mining and Knowledge Discovery, 11(3):243– 271, November 2005. [6] Laxmi Parida. Discovering Topological Motifs Using a Compact Notation. Journal of Computational Biology, 14(3):300–323, 2007.

7