Functional Associations in Regulatory Networks

2 downloads 0 Views 422KB Size Report
ENSG00000164853 EVX2 HLX1 HOXA10 HOXD3 NKX2-3 NR4A2. POU4F1 SIX3 ZFHX1B. ATF δCREB NF-YA organogenesis; neurogenesis and myogenesis.
Functional Associations in Regulatory Networks Christoph Dieterich, Ho-Joon Lee, Thomas Manke, Martin Vingron

Data and Data Representation Recent advances in functional genomics have produced vast amounts of relational data with the aim to gain insights into the architecture of regulatory networks and the molecular mechanisms which control most cellular processes. Bioinformatics can support this task both in terms of data organisation, as well as generating hypotheses for dedicated experiments. In this context, the information on protein-DNA binding has proven to be of particular value. On the one hand, bioinformatics has a long tradition of identifying functional sequence elements such as transcription factor binding sites; on the other hand, we have seen enormous progress in the experimental techniques to obtain genome-wide binding data. For our purposes, the binding information can be represented as a Boolean binding matrix (after some appropriate discretisation) which is illustrated in Figure 1. But rather than merely storing the binding information, we believe that this data lends itself to a simpler and more meaningful representations in terms of groups of genes which are co-regulated by groups of transcription factors. Such modules are the conceptual and functional building blocks of regulatory networks. 1

Figure 1: A Boolean binding matrix denotes the presence(black)/absence(white) of a transcription factor binding site in the promoter region of a gene. Biological prejudice motivates the decomposition of such a matrix into functional modules.

Modules in Yeast - a simple model organism Our work on the yeast regulatory network (1) has revealed that a number of transcription factors tend to form larger associations with common functionality and enhanced interactions among them (Fig. 2). This supports the notion that specific cellular programs are governed by synergistic control rather than the action of individual transcription factors. It is expected that combinatorial control mechanism are even more prevelant in multi-cellular organisms.

Sequence Conservation - the road to higher organisms As large-scale in-vivo experiments for protein-DNA binding are only just emerging for higher organisms, we mostly rely on sequence data to search for possible binding 2

Figure 2: Synergy network. This threshold graph shows pairwise associations between transcription factors and was obtained using the similarity between global binding patterns of transcription factors (columns of the binding matrix). It is apparent that transcription factor pairs are assembeled into larger clusters which often have a clearly defined functional role, as denoted by different colours.

3

sites of transcription factors with known binding pattern. However, sequence motifs alone are not powerful enough to encompass the full complexity of flexible and context dependent regulation and often give rise to many wrong and many missed annotations. Additional constraints need to be invoked to distinguish functional binding sites from unspecific motif occurrences. One approach we pursue in our group is that of “phylogenetic footprinting”: sequence conservation among sufficiently distant species points to functional sequence elements, such as binding sites, which have been maintained throughout evolution. We have generated a large database of conserved non-coding regions for many vertebrate species (human, mouse, rat, dog . . . ) and restricted the search space for transcription factor binding sites to only the conserved promoter regions (2). Where our predictions can be compared with experimental evidence (e.g. for E2F binding), this approach does indeed result in a reduction of false positves, while the number of missed annotations remains small. To be useful as a predicitive tool, however, further constrains (such as co-expression) ought to be imposed, as exemplified by our study of SRF targets among muscle-specific genes (3). In the following we employ a complementary constraint which invokes the concept of modularity and assumes that many functional groups of genes are regulated by groups of transcription factors.

4

Biclustering - search for modularity In the case of yeast, synergistic transcription factors could be revealed by simply clustering the global binding patterns associated with each regulator. For mammalian genomes, in contrast, the promoter regions are often poorly defined and much larger intergenic regions result in much higher noise levels of binding site predictions, even after conservation has been invoked. Any global association measure would be dominated by noise. On the other hand we do not expect synergistic transcription factors to always co-occur in all promoters, but rather in a small subset of promoters where their binding is functional. Biclustering algorithms formalize the task of identifying such subsets in noisy and heterogenous data. The immediate benefit is that gene modules are defined directly with respect to subsets of (possibly overlapping) regulators, and vice versa. The algortihmic challenge is to develop heuristics which can tackle this computationally hard problem efficiently. We have rephrased the search for significant modules as a bipartite graph problem, where each factor is linked to all its associated promoter regions (Fig. 3). A simple heuristic is used to identify unexpectedly dense subgraphs which score highly with respect to random expectations. Such subgraph (modules) maybe suggestive of biological function (4).

From statistical significance to biological relevance The implicit assumption of many bioinformatics algorithms is that biologically meaningful patterns can be detected as statistically significant overrepresentation. The bi-

5

Figure 3: The binding matrix denotes the presence/absence of a transcription factor binding site in the promoter region of a target gene. It is thought to comprise a number of modules, (dense submatrices). An equivalent description of this Boolean scenario invokes the notion of a bipartite graph in which one seeks to identify densely connected subgraphs, which are unlikely to occur at random. clustering methodolgy discussed above attempts to extract associations that appear significant in the light of the underlying topology of the corresponding graph model. The question arises whether those modules are indeed biologically relevant. To answer this question, we compared the predicted modules systematically with a large functional catalogue from the Gene Ontology Consortium. For a given set of genes (from the derived modules) we consider only the functional category with the best overlap as quantified by the hypergeometric score. We then ask whether this overlap is significant if one assumes that genes are selected proportionally to the conserved size of their promoter region. This screen results in several modules for which a clear functional prevelance can be observed (see examples in Tab. 1). These associations are prime candidates for experimental validation as they are supported by 1) evolutionary conserved binding sites 2) unexpected associations in bipartite graph model and 3) functional

6

TF-module E2F Elk-1 TFIID

ATF δCREB NF-YA

ATF UBP-1 NP-TCII

target genes homeobox transcription factors, CNS development ENSG00000164853 EVX2 HLX1 HOXA10 HOXD3 NKX2-3 NR4A2 POU4F1 SIX3 ZFHX1B organogenesis; neurogenesis and myogenesis ANK2 NTNG1 GOLGA1 MGAT2 MOG MORF4L2 MYH7 NEUROG1 OSBP SIX1 ZFHX1B retinoic acid receptor activity ALX3 RARA RARG RXRG SOX1 ZFHX1B

overlap 9/10

7/12

3/6

Table 1: Selected modules for which the functional overlap was significant (p < 0.0002). The last column denotes the fraction of putative target genes which belong to the specified GOcategory. enrichment of putative target genes.

Beyond Transcription Factor Binding - Data Intergation The validation of modules with functional annotations should only be considered a simple step in a much more general framework. Since no single experimental method can fully capture the full diversity of biological associations (e.g. protein-DNA binding), we are seeking a combined approach which can incorporate diverse and complementary data from other functional genomics projects. In the case of yeast we have incorporated a large compendium of expression data (6000 genes × 1000 conditions) to screen regulated genes for functional enrichment. This is similar in spirit to the literature-based functional annotation, but more comprehensive and quantitative as we can directly measures the degree of co-expression for putative modules in certain conditions. Many known associations are recovered as exemplified by the binding of transcription factors Hir1 and Hir2. Their combination defines a group of 7 genes, which are highly correlated over many conditions and contain histone genes as known targets of Hir1 and Hir2 (Fig. 4).

7

Figure 4: Gene expression data can support modules obtained from binding data. In this example, the module defined by transcription factors Hir1 and Hir2 shows high correlation over several expression data sets. The biclustering formalism introduced above also allows for a more direct and combined analysis of diverse data sources, such as protein-DNA binding data, compartmental localisation, membership assignment in protein complexes or other data from functional genomics. In such a setting biclusters will generally draw support from a combination of different sources and generate new hypotheses about the regulation of specific processes.

References [1] B. R. Manke T, V. M., Correlating protein-DNA and protein-protein interaction networks., JMB 333 (1) (2003) 75–85. [2] C. Dieterich, B. Cusack, H. Wang, K. Rateitschak, A. Krause, M. Vingron, Annotating regulatory DNA based on man-mouse genomic comparison., Bioinformatics 18 Suppl 2 (2002) S84–90. [3] P. U. et al., The SRF target gene FHL2 antagonizes rhoa/mal-dependent activation

8

of SRF., Mol Cell. 16 (6) (2004) 867–380. [4] Manke T, Dieterich C. and Vingron M., Detecting Functional Modules of Transcription Factor Binding Sites in the Human Genome., Lecture Notes in Computer Science 3318 (Springer 2005).

9

Suggest Documents