Reconstruction of genetic networks in yeast using support based

0 downloads 0 Views 228KB Size Report
can be used later on to predict other complex network like gene interaction network. ... Remaining of the paper is organized as follows. Section 2 discusses ...
Reconstruction of Genetic Networks in Yeast using Support Based Approach S Roy

D K Bhattacharyya

Dept. of In formation Technology, North Eastern Hill University, Umshing, Shillong , Meghalaya, India [email protected]

Dept. of Co mputer Science & Engineering, Tezpur University, Napaam, Tezpur, Assam, India [email protected]

Abstract—A network of co-regulated genes is crucial for understanding cell physiology. Existing techniques generally depend on proximity measures based on global similarity to draw the relationship between genes. It has been observed that expression profiles are sharing local similarity rather than global similarity. In this work, pair-wise expression profiles of genes are compared, local supports are computed, and based on which gene-gene network is constructed. The genes in a network with high pattern similarity form a coherent group that can encode proteins and take part in common biological processes. We applied the proposed approach on Yeast expression data to construct Yeast gene-gene network. Biological significance of the results is evaluated based on GO annotation database and found satisfactory. Keywords- genetic network; gene expression; pattern similarity; support; association network

I. INT RODUCTION Gene exp ression technology, such as high density DNA micro-array, allows us to monitor gene expression patterns at the genomic level. Advent of this technology leads to the new challenges of extracting bio logically relevant knowledge fro m such large gene expression data sets. As a result, data min ing of gene expression data has become an important area of research for biologists. Development of suitable mining techniques will contribute to get into the insight of the gene-gene relationships and that may further lead to discover hidden facts related to any species or microbes. Gene-gene relationship can be described through biological pathways. Again biological pathways can be represented as networks and broadly classified [1] as metabolic pathways, signal transduction pathways and gene regulatory networks or gene interaction networks. The most preliminary form of network is gene-gene association network wh ich basically describe the inter-relationships between different genes. As shown in “Fig 1”, mathematically gene association network can be defined as an undirected graph T = {G, E}, where G denotes the set of N genes (nodes) {G1 ,G2 , …GN} participated in a common gene product formation process or biological process and E is the set of edges {e1,e2,…em} that corresponds to the existing interrelationship between genes, shown as arc between any two genes. Such association network can be used later on to predict other complex network like gene interaction network. A network of co-regulated genes may form gene clusters that can encode proteins, which interact

978-1-4244-9008-0/ 10/$26.00 ©2010 IEEE

116

amongst themselves and take part in common biological processes. In this study, we try to reconstruct the genetic association network fro m expression data by capturing the local pair-wise similarity purely on pattern matching followed by construction of genetic association network. We try to make a close grouping of genes by extracting simple gene-gene association network based on local expression pattern similarity. In the line of association mining technique [2], we co mpute the support count of pair of genes based on their matching profiles. Gene pairs showing high support i.e. high pattern similarity are used to construct gene-gene association network. The genes participated in a network are forming the gene group with high co-regulation. We applied our approach in Yeast Sporulation expression dataset and measured biological significance based on GO terms. In this study, we try to reconstruct the genetic association

Figure. 1. Gene-gene association network: An undirected graph with five genes (nodes) and interrelationship between genes as edges of the graph.

Remain ing of the paper is organized as follows. Section 2 discusses about some existing work Section 3 introduces the proposed method formally. The Results section (Section 4) discusses about dataset used and also reports experimental results. Biological significance of the results are evaluated and presented in the same section. A discussion and Conclusion section in section 5 summarizes the work with concluding remark. II.

RELAT ED WORK

Recently, gene interaction network or gene network construction gaining interest among the bioinformatics research community. A number of techniques have been proposed for such network construction [3], [4], [5], [6], [7], [8]. Existing techniques on finding gene-gene relationship can be broadly categorized as, (i) Computational approach (ii) Literature based approach.

Co mputational approach mainly uses statistical, machine learning or soft-computing techniques [3] as a tool. On the other hand, a literature based approach tries to search for the available information on genes and their inter-relationships and construct network based on such documented information. Literature based approach are found capable of building such network having high biological relevance but computationally expensive. A bio medical literature searching based technique has been used in [9] to construct gene relation network (GRN) by mapping literature knowledge into gene expression data.

measure the similarity in terms of angular deviation and regulation patterns. Given angular deviation A = {a 1 , a2 ,…,aM1 } and regulation patterns R = {r1 , r2 ,…,rM-1 } of a gene, derived fro m gene exp ression profile, two gene’s kth exp ression profiles are similar if the difference of angular deviation between two conditions/samples of two genes is less then some given threshold and their regulation pattern is same. Mathematically it can be defined as:

In [4], a bi-clustering based technique is proposed to extract simple gene interaction network. A continuous column mu ltiobjective evolutionary bi-clustering has been proposed to extract the rank correlated gene pairs. Later on, such pairs are used to construct the gene network for generating relationship between a transcription factor and its target’s expression level. Recently, mutual information [10], [11] or correlation coefficient [5], [6], [7], [8] based approach have been proposed for extracting gene-gene interaction network. Most of these computational techniques are trying to extract such network based on global similarity pattern or literature mining, which is computationally expensive or sometime, may not be able to give biologically relevant groups of genes or network.

(2)

Defin ition 2. (Support): It is the ratio between number of conditions or time points on which Gi and Gj genes having similarity and the total number of conditions (N), and can be defined as follows. (3) Defin ition 3. (Connected): Two genes Gi and Gj are said to be connected (or having inter-relationship) if Support(Gi ; Gj ) , where is a user defined threshold.

Next we present a support based gene-gene association network or genetic network construction method fro m expression data.

Defin ition 4. (Gene-Gene Association Network): It is a subset of similar genes, which are connected. In other words, for any two given genes Gi , Gj T (a gene-gene association network), then Gi and Gj are connected.

III. SUPPORT BASED GENET IC NETWORK CONST RUCTION Clustering based on global similarity measures like Euclidean distance or Pearson correlation may not always capture the true gene-gene relationship [4], [3], [12]. On the other hand, most of the existing techniques have been found to give less emphasis on pattern matching based on local similarity. It is well observed that the genes share local rather than global functional similarity in their gene expression profiles. In this section, a local expression pattern similarity based approach has been reported to construct gene-gene association network. Below we present a theoretical representation of the proposed approach.

Lemma 1: For any two genes Gi , Gj if Gi T, a gene-gene g association network and Gi is connected to Gj , then Gj T. Proof: The above lemma can be proved by contradiction. Assume, Gi and Gj are two connected genes and Gi T, but Gj T. Now, as per Definit ion 4, T is a subset of connected genes and since Gi and Gj are connected, so Gj T, wh ich contradicts and hence the proof. Similarly the following lemma is trivial based on the Defin ition 1-4 and Lemma 1. Lemma 2: Let Gi and Gj are two genes and T1 and T2 are two gene-gene association networks. Now, if Gi T1 and Gj T2 then Gi and Gj are not connected.

A. Terminology Used Given G = {G1 ,G2 , …GN} be the set of N number of genes and R = {O1 ,O2 , … ,OM } be the set of M condition or time points of a micro array data, the gene expression dataset D is represented as a N x M matrix i.e. D N x M where each entry d i,j in the matrix corresponds to the logarithmic of the relative abundance of mRNA of a gene. Following definitions and lemmas provide the theoretical basis of the proposed method.

Lemma 3: Genes belonging to the same gene-gene association network are co-expressed or similar. Proof: This lemma also can be proved by contradiction. Let any two genes Gi and Gj T are not co-expressed. Now, if Gi and Gj are in same network, they are to be connected (as per Defin ition 3 & 4), and hence Gi , Gj are connected. Again, any two connected genes are always similar i.e. co-expressed (as per Defin ition 1 through 3), which contradicts the assumption, hence the proof.

Defin ition 1. (Gene Similarity): Gene similarity between a pair of genes Gi and Gj i.e. Gene-Similarity(Gi,Gj) is definedas the total number of conditions or time points Gi, Gj matches over DNxM. In other words, Gene-Similarity measures the total number of agreements for a gene pair over the conditions or time po ints and is defined as follows:

Similarly, the proof of the following Lemma (reverse case Lemma 3) is trivial. Lemma 4: Genes belonging to different gene networks are not co-expressed.

(1) where, sim(Gi ,Gj ) is the similarity between two gene expression profiles under a condition or t ime point. We

117

Next we d iscuss the preprocessing steps involves in capturing the angular deviation and regulation pattern information of each expression profile.

C. Construction of association network To compute the similarity between two target genes, we used both angular deviation and regulation pattern as matching criteria. Similarity values are used for calculating the support in order to construct the association network. We read each row fro m the preprocessed database, and check whether two target rows (genes) (say, Gi and Gj ) are similar or not with respect to both angular deviation and regulation in a particular column. Following the Definit ion 1, we compute the gene similarity value between two genes followed by support between the genes. Using these support count, next we check whether two genes satisfy the given constraints for connectivity. This step is repeated for all pair of genes. Based on all connected pairs, an adjacency matrix is computed as:

B. Preprocessing In order to capture the patterns of each gene under conditions, a number of techniques adopt either angles of edge of every two conditions [13] or regulation patterns in terms of up- or down- regulation [14]. Alone angle or regulation pattern between the edges of two conditions are ineffective in capturing the true expression pattern of a gene. We try to compare two gene expressions both in terms of angular deviation and regulation pattern between two adjacent conditions, simultaneously. In order to capture both regulation pattern and angular deviation of each gene, we read row of original data with S number of expression values or conditions and converted into another row of (S-1) nu mber of colu mns, each column of which contains angular deviation and regulation pattern of two adjacent conditions. We consider regulation information as triplet values [1, 0,-1] to represent up-regulation, no changes and down regulation respectively. Regulation value in kth column of a gene Gi , Gi (rk ), based on two consecutive conditions (say, Ok-1 & Ok) can be calculated as:

(5) where 1 indicates a hypothetical relationship exists between the genes and 0 indicates lack of relation. A gene association network connecting various genes is constructed based on that adjacency matrix. The above technique is applied in Yeast expression data to generate Yeast genetic network. We hypothesize that group of genes participated in a sub-network are responsible for similar cellular function and process. Following sections are discusses and proves the fact.

(4)

IV. RESULT S This section is intended to provide the details about the experiments conducted, the data set used and the biological validation of the results. We have applied the proposed technique on real Yeast Sporulation gene expression data. Later on, the network is visualized with the help of graph visualization tool GUESS (http://guess.wikispot.org). Since it is difficult to report all the networks, we present sample subnetworks from the Yeast dataset generated by the proposed approach.

For calculating angular deviation we have taken arc tangent between two adjacent expression levels as in [13]. Preprocessing steps are illustrated in “Fig 2, 3 & 4”. “Fig 2” shows a sample Yeast expression dataset. A profile plot of the sample has been provided in “Fig 3” for better understanding of the problem. Converted sample data after preprocessing is shown in “Fig 4”.

We analyze the biological significance of the result in terms of GO annotation database. Currently, work is going on to apply the proposed technique in other datasets in order to generate different types of networks. Preliminary results of that are found satisfactory (details are beyond the scope of the article). Below we are reporting about the Yeast dataset.

Figure 2.Sample Expression data from Yeast Sporulation dataset

YeastSporulation:Sporulationdata(http://cmgm.stanford.ed u /pbrown/sporulation) is a collection of 6118 genes under 7 time points (0, 0.5, 2, 5, 7, 9 and 11.5h) measured during the sporulation process of budding yeast. The data are then log transformed. A mong the available genes in original data set, genes whose expression levels did not change significantly during the harvesting have been ignored. This is determined with a threshold level of 1.6 for the root mean square of the log2-transformed ratios. The resulting set consists of 474 genes. Dataset is normalized using Z-normalizat ion [15], so that each row has mean 0 and variance 1. The normalized dataset is also available in [16].

Figure 3.Profile plot of sample data

Figure 4. Sample data after preprocessing

Following section discusses the way to construct the association network of connected genes.

118

indicated in terms of color displayed, are shown in “Fig 7”. It shows the branching of generalized molecu lar function into sub-function like structural mo lecular activ ity, RNA b inding etc. and cellular co mponents into various sub components like ribonucleo protein comp lex, cytosol, macro molecu lar co mp lex

A. Experimental Results As mentioned in the previous sections, in our experiments we have used the concept of support for drawing link or interrelationship between genes. Gene pair satisfying support

Figure 5. Sample Yeast sub networks.

criteria with respect to user defined threshold is considered as connected to each other and display only those genes that were linked to others with a support higher than the threshold. Out of the completely connected network of genes, we were left with network of genes that were more strongly connected to each other. We display the association networks graphically using the graph visualization tool GUESS with nodes representing genes and lines between nodes representing hypothetical associations of genes. Such networks are presented in” Fig 5”.The genes participated in an association network are basically forming a group of coherent or co-exp ressed genes. The groups are then evaluated using cluster profile p lotting. The cluster profile p lot shows, the normalized gene exp ression values of the genes of that cluster with respect to the time points for each co-exp ressed group (Fig 6). Later on, coexpressed groups are evaluated for their bio logical significance.

etc., which are then clustered gene-wise to produce the final result. In other words,it displays the annotated genes in a sample association network that is enriched for GO categories.

Figure 6.cluster profile plot of each subnetwork of a.18 genes and b.13 genes

In Table 1, the significant shared GO terms associated with three Yeast subnetworks consisting of 13,14 and 16 genes are reported along with their cluster frequency ( i.e. out of total number of genes in the network the number of genes involved in a particular cellular activ ity ) and p-values. Terms related Yeast data are depicted for biological process, mo lecular function and cellular component ontologies. Out of 13 genes fro m the network , the genes (YJR145C, YPL198W, YM L063W, YHR203C, YJR123W, YGL076C, YPL131W, YLR075w) are involved in the process of regulation of translation, while genes (YJR145C, YPL090C, YLR075w, YPL198W, YHR203C, YM L063W, YJR123W, YGL076C, YPL131W, YOR096W, YOL040C) are involved in translation process. On the other hand genes (YNL069C, YOR096W, and YOL040C) are involved in structural mo lecule activity. Similarly, out of 14 genes from second network (constructed with the parameters =71 % and =10), the most significant processes are cell d ifferentiation (YDR273W, YBR148W, YGL170C, YIL045W, YHR185C, YLR307W, YJL038C, YHR184W) and spore wall b iogenesis (YDR273W, YHR185C, YLR307W, YHR184W ), however, no significant GO terms found in case of mo lecular function. A network of 16 genes is extracted with the above parameters. Out of wh ich,

Bio logical interpretation of the results can be assessed by functional annotation of the genes participated in a cluster. We determined the biological relevance of the smaller groups comprise of all the genes participated in a common association network fro m Yeast Sporulation data, in terms of the statistically significant GO terms validated using GO annotation database (http://db.yeastgenome.org/cgibin/GO/goTermFinder ). In this annotation database, genes are assigned to three structured, controlled vocabularies (ontologies) that describe gene products in terms of associated biological processes, components and molecular functions in a species -independent manner. Statistical significance is evaluated for the genes in each group by computing p-values, which signifies how well they match with different GO categories. It is to be noted that smaller p-value (close to zero) indicates better match which in turn indicates more close and compact cluster structure. The significant GO terms (or parents of GO terms) for a set of 13 genes along with their p-values, with the significance being

119

Figure 7. GO terms and their parents from Yeast for Molecular Function & Cellular Process.

120

emphasis will be given to extend this method for reconstruction of a genetic interaction network in a computationally effective way.

majority of the genes (YM R242C, YPL081W, YLL045c, YNL301C, YIL052C, YER126c, YNL098C, YDR418W, YER074w, YNL119W, YER056c-a, YJL177W, YPL079W, YHR010W) are involved in gene exp ression and cellular protein metabolic process (YM R242C, YPL081W, YLL045c, YNL301C, YIL052C, YDR418W, YER074w, YNL119W, YER056c-a, YJL177W, YPL079W, YHR010W).

REFERENCES [1] [2]

In case of cellu lar co mponents, genes from first network (YJR145C, YPL090C, YLR075w, YPL198W, YHR203C, YM L063W,YJR123W, YGL076C, YPL131W, YOR096W, YOL040C) belongs to ribosomal subunit and genes (YJR145C, YM L063W, YHR203C, YJR123W, YPL090C, YOR096W, YOL040C) are involved in cytosolic small ribosomal subunit. Similarly, fro m second network, the significant co mponents are prospore memb rane. (genes YDR273W, YBR148W, YHR184W ). Genes (YM R242C, YPL081W, YLL045c, YNL301C, YIL052C, YER126c, YDR418W, YER074w, YER056c -a, YJL177W, YPL079W, YHR010W) fro m third network are belonging to intracellu lar non-membrane-bounded organelle and ribonucleo protein complex.

[3]

[4]

[5]

[6] [7]

[8]

V. DISCUSSION AND CONCLUSION An effective support based algorithm for genetic network reconstruction for Yeast data has been reported in this paper. The technique is used to find the biologically relevant gene pairs that may form a network of associated genes. All the genes participated in the network have been established to have similar functional behavior. Finally, the network is visualized using graph visualization tool and later on, selected networks are biolog ically validated using publicly available GO annotation database. The results as discussed to validate the claims that the simple gene-gene relation based association networks are capable to detect biologically significant set of genes. Results are reported to show that co-expressed groups formed fro m the network are having h igh bio logical significance. Moreover, it also further establishes that the simp le expression pattern matching can be helpful in finding biologically relevant genes. Proposed method has also applied in other types of expression data to construct different genetic networks and found satisfactory (not reported in this article). Gene-gene association network can be used further to pred ict more co mp lex b iological interaction networks. In future,

[9]

[10]

[11]

[12]

[13] [14]

[15]

[16]

121

S. T avazoie et al., “Systematic determination of genetic network architecture,” Nature Genetics, vol. 22, pp. 281–285, 1999. J. Han and M. Kamber, Data mining Concepts and Technique. San Francisco, CA: Morgan Kaufmann, 2006. S. Mitra, R. Das, and Y. Hayashi, “Genetic networks and soft computing,” IEEE/ACM T ransactions on Computational Biology and Bioinformatics, 2009. S. Mitra et al., “Gene interaction - an evolutionary biclustering approach,” Information Fusion, vol. 10, no. Special Issue on Natural Computing Methods in Bioinformatics, pp. 242–249, 2009. S. H. Jung and H. Cho, “Identification of gene interaction networks based on evolutionary computation,” in Proc of AIS ’2004, LNAI, Springer Verlag, Jan. 2005, pp. 428–439. A. Hin et al., “Global mapping of the yeast genetic interaction network,” Science, vol. 808, no. 303, 2004. A. Ozur, “Identifying gene-disease associations using centrality on a literature mined gene-interaction network,” Bioinformatics, vol. 24, pp. i277–i285, 2008. W. P. Kuo et al., “Functional relationships between gene pairs in oral squamous cell carcinoma,” in AMIA Symposium Proc., LNAI,Springer Verlag, 2003, pp. 371–375. T. Karopka, “Automatic construction of gene relation networks using text mining and gene expression data,” Informatics for Health and Social Care, vol. 29, pp. 169–183, 2004. M. B. Eisen et al., “Cluster analysis and display of genome-wide expression patterns,” in Proc of National Acad Sci, USA, 1998, pp. 14 863–14 868. A. J. Butte et al., “Mutual information relevance networks: functional genomic clustering using pairwise entropy measurements,” in Pacific Symposium on Biocomputing 5, 2000, pp. 415–426. I. Prince et al., “Evaluation of gene-expression clustering via mutual information distance measure,” BMC Bioinformatics, vol. 8, no. do i: 10.1186/1471–2105/8/111, pp. 111–122, 2007. Z. Zhang et al., “Mining deterministic biclusters in gene expression data,” in Proc of BIBE’04, IEEE CS Press, 2004, pp. 283–290. A. T anay and R. Shamir, “Discovering statistically significant biclusters in gene expression data,” Bioinformatics, vol. 18, no. Suppl. 1, pp. S136–S144, 2002. C. Cheadle et al., “Analysis of microarray data using z score transformation,” Informatics for Health and Social Care, vol. 5, no. 2, pp. 73–81, 2003. S. Bandyopadhyay et al., “An improved algorithm for lustering gene expression data,” Bioinformatics, vol. 23, pp. 2859–2865, 2007.

Suggest Documents