Vol. 20 Suppl. 1 2004, pages i109–i115 DOI: 10.1093/bioinformatics/bth908
BIOINFORMATICS
Functional inference from non-random distributions of conserved predicted transcription factor binding sites Christoph Dieterich∗, Sven Rahmann and Martin Vingron Computational Molecular Biology, Max Planck Institute for Molecular Genetics, Ihnestrasse 63-73, D-14195 Berlin, Germany Received on January 15, 2004; accepted on March 1, 2004
1
INTRODUCTION
With our current knowledge of mammalian genome sequences as well the plethora of experimental results on transcription factors, their binding sites and target genes, the hope is rising ∗ To
whom correspondence should be addressed.
Bioinformatics 20(Suppl. 1) © Oxford University Press 2004; all rights reserved.
to uncover the genomic regulatory network systematically. Both new kinds of large-scale experiments (ChIP on a chip) and theoretical analyses [for a review see (Qiu, 2003)] are extending the traditional biochemical approaches. Several factors, however in particular, the theoretical ones are hampering these efforts. Most prominently, although we know representative sequences for many transcription factor binding sites, it is still hard to use these to extrapolate and predict new binding sites. The traditional weight matrix approach is highly error prone, with especially the false positive rate constituting a serious obstacle to the generation of useful predictions. To remedy this situation, one takes into account additional, hopefully independent, information. Evolutionary conservation of transcription factor binding patterns is one indicator that is frequently used to obtain more specific predictions (Hardison, 2000). In earlier work, we reported the generation of a database of evolutionarily conserved upstream binding sites in the human and mouse genomes (Dieterich et al., 2003b). In Dieterich et al. (2003a), we utilized this information to determine possible regulatory mechanisms behind the co-expression of a group of genes as determined in a DNA-microarray experiment. Here, we suggest to exploit the distribution of evolutionarily conserved, predicted binding sites over different groups of co-expressed genes as an indicator for functionality of the predicted binding sites. The rationale is that when a factor plays a role in the co-expression of a group of genes, we ought to observe these functional binding sites on top of the random occurrences of predicted binding sites. A deviation from the random distribution for a particular factor should thus indicate a functional role for these binding sites. We will exemplify this for the human cell cycle data, with the genes that peak in a particular phase of the cell cycle taking the role of the co-expressed group. Gene expression data are taken from Whitfield et al. (2002) who studied expression levels of genes in cycling HeLa cells. Based on the expression levels, they identified genes that are periodically up-regulated and assigned each of them to one out of five expression clusters
i109
Downloaded from bioinformatics.oxfordjournals.org by guest on July 12, 2011
ABSTRACT Motivation: Our understanding of how genes are regulated in a concerted fashion is still limited. Especially, complex phenomena like cell cycle regulation in multicellular organisms are poorly understood. Therefore, we investigated conserved predicted transcription factor binding sites (TFBSs) in man– mouse upstream regions of genes that can be associated to a particular cell cycle phase in HeLa cells. TFBSs were predicted from selected binding site motifs (represented by position weight matrices, PWMs) based on a statistical approach. A regulatory role for a transcription factor is more probable if its predicted TFBSs are enriched in upstream regions of genes, that are associated with a subset of cell cycle phases. We tested for this association by computing exact P -values for the observed phase distributions under the null distribution defined by the relative amount of conserved upstream sequence of genes per cell cycle phase. We considered non-exonic and 5 -untranslated region (5 -UTR) binding sites separately and corrected for multiple testing by taking the false discovery rate into account. Results: We identified 22 non-exonic and 11 5 -UTR significant PWM phase distributions although expecting one false discovery. Many of the corresponding transcription factors (e.g. members of the thyroid hormone/retinoid receptor subfamily) have already been associated with cell cycle regulation, proliferation and development. It appears that our method is a suitable tool for detecting putative cell cycle regulators in the realm of known human transcription factors. Availability: Further details and supplementary data can be obtained from http://corg.molgen.mpg.de/cellcycle Contact:
[email protected]
C.Dieterich et al.
2 METHODS 2.1 Detecting conserved binding sites Upstream regions and TFBS predictions An upstream region encompasses 5 genomic DNA extending from the start of translation. Homologous man/mouse upstream regions of maximally 15 000 bp size were inspected for TFBSs. This upper bound stems from the observation that most promoter regions are