Modularity and community detection in Semantic Similarity Networks trough Spectral Based Transformation and Markov Clustering [Extended Abstract] Pietro Hiram Guzzi
∗
Dept Surgical and Medical Sciences University of Catanzaro Italy
Simone Truglia
Dept Surgical and Medical Sciences University of Catanzaro Italy
Marianna Milano Dept Surgical and Medical Sciences University of Catanzaro Italy
[email protected] [email protected] [email protected] Pierangelo Veltri Mario Cannataro Dept Surgical and Medical Sciences University of Catanzaro Italy
[email protected]
ABSTRACT Semantic Similarity Networks are currently used for modeling similarities among biological entities. Nodes of such networks are for instance proteins while weighted edges among them encode semantic similarity scores among them. Networks are usually affected by noise. This paper presents an algorithm for de-noising these networks. The improvement of the use of mining algorithm on processed networks is also shown.
Categories and Subject Descriptors H.4 [Information Systems Applications]: Miscellaneous; D.2.8 [Software Engineering]: Metrics—complexity measures, performance measures
General Terms Theory
Keywords Biological Networks, Semantic Similarity, Spectral Analysis.
1. INTRODUCTION ∗Corresponding Author
Copyright is held by author/owner(s) BCB 13, September 22 - 25, 2013, Washington, DC ACM 978-1-4503-2434-2/13/09.
ACM-BCB 2013
Dept Surgical and Medical Sciences University of Catanzaro Italy
[email protected]
Biological information about genes and proteins is structured on simple terms, also known as, annotation terms, that are associated to each molecule. Terms are organized into biological ontologies [5, 3]. Gene Ontology (GO) [6] organizes a set of annotations (referred to as GO Terms) structured into three main taxonomies: Molecular function (MF), Biological Process (BP), and Cellular Component (CC). The the Gene Ontology Annotation (GOA) database [2] stores the association among molecules and their GO terms. Semantic Similarity measures (SSMs), has been introduced to quantify the similarity of two or more terms belonging to the same ontology. SSMs take in input two or more ontology terms and produce as output a value representing their similarity. Since each genes or proteins is associated with a set of GO terms, researcher explored the possibility to use such formal instruments for the comparison and analysis of proteins and genes [5]. Consequently, many works has beed introduced [10] we here focus on building of semantic similarity networks[10]. A semantic similarity network of proteins (SSN) is an edge-weighted graph Gssu =(V ,E), where V is the set of proteins, and E is the set of edges, each edge has an associated weight that represent the semantic similarity among related pairs of nodes. These networks are constructed by computing some similarity value between genes or proteins and then linking nodes whose similarity is greater than zero. Unfortunately, such networks are usually quasi complete networks and may suffer from biases of measures ([9]), so the use of them as framework of analysis has many problems. Thus the definition of a threshold on the edge weight to retain only the meaningful relationships is a crucial step. An high threshold may result on the loss of many significant relationship while a low threshold may introduce a lot of noiseMany methods have been defined: for instance the use of an arbitrary global threshold [4], or the use only of a frac-
652
tion of highest relationship [1], or statistical based methods [11]. Nevertheless, internal characteristics of SSMs (as investigated in [9]) do not suggest the use of global thresholds. In fact small regions of relatively low similarities may be due to the characteristics of measures while proteins or genes have high similarity. Thus the use of local thresholds may constitute an efficient way, i.e retaining only top k-edges for each node [7, 8, 13]. Although this consideration, this choice may be influenced by the presence of local noise and in general may cause the presence of biases in different regions. The choice of a correct threshold is a crucial step for subsequent mining of SSN. Starting from these considerations, we developed a novel hybrid method that merges together both local and global considerations. After the thresholding we mine the networks and we show a considerable improvement with respect to raw networks. Consequently we here propose the following strategy: Building of Semantic Similarity Network Initially we consider a dataset of proteins and we calculate for each pair of proteins the semantic similarity among them; this process causes the formation of an edge-weighted graph Gssu =V, E, where V is the set of proteins, and E is the set of edges, each edge has an associated weight that represent the semantic similarity among related pairs of nodes. Pruning of Semantic Correlation Networks : The resulting Gssu is analyzed in order to eliminate meaningless edges based on a local threshold. Briefly, for each node v, we analyze its neighborhood, and we eliminate those edges whose weight are low on the basis of two consideration. Determination of Fiedler s Vector We calculate the spectrum of the Laplacian of the Weighted Adjacency matrix of the SSN. The standard Laplacian matrix is defined as L = D − A, where D is a diagonal matrix with the diagonal element Dii being the vol of the node i. Fiedler s vector Vf is the eigenvector of the standard Laplacian matrix corresponding to the second smallest eigenvalue Such vector is used to rescale all the edges of the original matrix. Rewriting of the Original Weighted Adjacency Matrix : Then for each node i we may associate Vf (i) i.e. the it h element of the Fiedler’s vector and for each edge i, j we may associate the difference ∥Vf (i) − Vf (j)∥.
[2]
[3]
[4]
[5]
[6]
[7]
[8]
[9]
[10]
[11]
Soft Clustering of the Resulting Weighted Matrix Finally we mine the resulting networks through soft Markov [12] Clustering ([12]). Results Evaluation. We evaluated the obtained results in terms of functional coherence of extracted modules. We define functional coherence F C of a module M as the average of semantic similarity values of all the pair of nodes (i,j) composing a module. Results show a considerable improvements in terms of functional coherence with respect to original networks.
[13]
coexpression analysis. PLoS Computational Biology, 4(3):e1000043, 2008. E. Camon, M. Magrane, D. Barrell, V. Lee, E. Dimmer, J. Maslen, D. Binns, N. Harte, R. Lopez, and R. Apweiler. The gene ontology annotation (goa) database: sharing knowledge in uniprot with gene ontology. Nucl. Acids Res., 32(suppl 1):D262–266, January 2004. M. Cannataro, P. H. Guzzi, and A. Sarica. Data mining and life sciences applications on the grid. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 3(3):216–238, 2013. T. Freeman, L. Goldovsky, M. Brosch, S. van Dongen, P. Maziere, R. Grocock, S. Freilich, J. Thornton, and A. Enright. Construction, visualization, and clustering of transcription networks from microarray expression data. PLoS Computational Biology, 3(10):e206, 2007. P. Guzzi, M. Mina, C. Guerra, and M. Cannataro. Semantic similarity analysis of protein data: assessment with biological features and issues. Briefings in bioinformatics, 13(5):569–585, 2012. M. A. e. a. Harris. The gene ontology (go) database and informatics resource. Nucleic Acids Res Nucleic Acids Res, 32(Database issue):258–61, January 2004. H. Lee, A. Hsu, J. Sajdak, J. Qin, and P. Pavlidis. Coexpression analysis of human genes across many microarray data sets. Genome Res, 14:1085–1094, 2004. M. Moriyama, Y. Hoshida, M. Otsuka, S. Nishimura, N. Kato, T. Goto, H. Taniguchi, Y. Shiratori, N. Seki, and M. Omata. Relevance network between chemosensitivity and transcriptome in human hepatoma cells. Molecular Cancer Therapeutics, 2:199–205, 2003. G. P. and M. M. Investigating bias in semantic similarity measures for analysis of protein interactions. In Proceedings of 1st International Workshop on Pattern Recognition in Proteomics, Structural Biology and Bioinformatics (PR PS BB 2011), pages 71–80, 13th September 2011 2012. C. Pesquita, D. Faria, A. O. Falc˜ ao, P. Lord, and F. M. Couto. Semantic similarity in biomedical ontologies. PLoS computational biology, 5(7):e1000443, July 2009. T. Rito, Z. Wang, C. M. Deane, and G. Reinert. How threshold behaviour affects the use of subgraphs for network comparison. Bioinformatics, 26(18):i611–i617, 2010. Y.-K. Shih and S. Parthasarathy. Identifying functional modules in interaction networks through overlapping markov clustering. Bioinformatics, 28(18):i473–i479, 2012. B. Voy, J. Scharff, A. Perkins, A. Saxton, B. Borate, E. Chesler, L. Branstetter, and M. Langston. Extracting gene networks for low-dose radiation using graph theoretical algorithms. PLoS Computational Biology, 2(7):e89, 2006.
2. REFERENCES [1] U. Ala, R. Piro, E. Grassi, C. Damasco, L. Silengo, M. Oti, P. Provero, and F. Cunto. Prediction of human disease genes by human-mouse conserved
ACM-BCB 2013
653