Prediction of Nuclear Hormone Receptor Response Elements

0888-8809/05/$15.00/0 Printed in U.S.A.

Molecular Endocrinology 19(3):595–606 Copyright © 2005 by The Endocrine Society doi: 10.1210/me.2004-0101

Prediction of Nuclear Hormone Receptor Response Elements Albin Sandelin and Wyeth W. Wasserman Center for Genomics and Bioinformatics (A.S., W.W.), Karolinska Institutet, 113 37 Stockholm, Sweden; Centre for Molecular Medicine and Therapeutics (W.W.W.), British Columbia Children’s and Women’s Hospitals, and Department of Medical Genetics (W.W.W.), University of British Columbia, Vancouver, Canada V5Z 4H4 The nuclear receptor (NR) class of transcription factors controls critical regulatory events in key developmental processes, homeostasis maintenance, and medically important diseases and conditions. Identification of the members of a regulon controlled by a NR could provide an accelerated understanding of development and disease. New bioinformatics methods for the analysis of regulatory sequences are required to address the complex properties associated with known regulatory elements targeted by the receptors because the standard methods for binding site prediction fail to

reflect the diverse target site configurations. We have constructed a flexible Hidden Markov Model framework capable of predicting NHR binding sites. The model allows for variable spacing and orientation of half-sites. In a genome-scale analysis enabled by the model, we show that NRs in Fugu rubripes have a significant cross-regulatory potential. The model is implemented in a web interface, freely available for academic researchers, available at http://mordor.cgb.ki.se/NHR-scan. (Molecular Endocrinology 19: 595–606, 2005)

U

stitute established drug targets for combating disease (6), for instance in the treatment of diabetes [peroxisome proliferator-activated receptor (PPAR) ␥] (7), cardiovascular disease (estrogen receptor, PPAR ␥, and ligand X receptor) (8–10), and osteoporosis and autoimmune diseases (vitamin D receptor) (11). NR proteins are composed, in most cases, of several domains, including a DNA binding domain, a ligandbinding domain, and one or more activation domains (12). With certain exceptions, NRs operate as dimers—either as homodimers or in heterodimeric combinations (4, 5). The members of the NR family have been categorized based on function and evolution. The classification scheme suggested by Laudet et al. (13) divides NRs into seven groups (NR0NR6). Whereas most classes exhibit similar properties in DNA binding specificity, the NR0 and NR3 groups deserve special treatment. The NR0 group describes unusual NRs lacking either the ligand binding domain or DNA binding domain (13, 14). Because both domains are required for DNA binding by dimerization, these factors either cannot bind DNA or bind as monomers (15). The NR3 group (also known as nuclear steroid receptors) differs significantly in terms of both DNA binding specificity and functional ligands (16). The binding distinction of the NR0 and NR3 subclasses is not further addressed in this work. The NR1–2 and NR4–6 subclasses are collectively referred to as nuclear hormone receptors (NHRs). NHR dimers bind to regulatory sequences composed of two half-sites that are separated by variable spacing and can occur in different orientations (12, 17). Half-sites have

NDERSTANDING OF THE intricate biochemical systems regulating gene transcription is paramount for comprehension of the information flow from genes to proteins and, in extension, the dynamics of the cell (1). In response to perceived change in the state of a cell, expression of genes can be modulated by altered interactions of DNA binding transcription factors (TFs) with regulatory sequences. Although individual cis-regulatory sequences can be elucidated in laboratory studies, the process is laborious. Exhaustive characterization of the TF binding sites (TFBS) on a genome-scale using classical reporter gene methods is unrealistic. However, laboratory techniques can be complemented with computational methods to facilitate fast and efficient identification of candidate binding sites (2, 3). The nuclear receptor (NR) family constitutes a medically important group of TFs, involved in critical biological processes in all multicellular metazoans (4). The NR proteins are unique in directly linking regulation of gene transcription to the concentrations of key physiological signaling molecules, such as estrogen, testosterone, and glucocorticoids (4, 5). Thus, NRs conFirst Published Online November 24, 2004 Abbreviations: DR, Direct repeat; ER, everted repeat; GHMM, general HMM; HMM, Hidden Markov Model; IR, inverted repeat; NHR, nuclear hormone receptor; NR, nuclear receptor; PPAR, peroxisome proliferator-activated receptor; TF, transcription factor; TFBS, transcription factor binding site. Molecular Endocrinology is published monthly by The Endocrine Society (http://www.endo-society.org), the foremost professional society serving the endocrine community. 595

596

Mol Endocrinol, March 2005, 19(3):595–606

the consensus sequence AGGTCA, but most factors tolerate considerable variation from the consensus within functional binding sites. Combinations include the direct repeat (DR) AGGTCAnxAGGTCA (or the reverse complement TGACCTnxTGACCT), the inverted repeat (IR) AGGTCAnxTGACCT, and the everted repeat (ER) TGACCTnxAGGTCA (5), where the number of spacer nucleotides nx can vary from zero to eight nucleotides. For clarity, when we refer to a specific configuration, the repeat abbreviation (DR, ER, or IR), is followed by the number of spacers between half-sites (for example, a direct repeat with a 5-bp spacer is referred to as DR5). A few NHRs (e.g. retinoic acid receptor-related orphan receptor, nerve growth factor-induced clone B, and steroidogenic factor-1) are known to function in vivo as monomers, where the single half-site has an adenosine/thymidine-rich 5⬘ extension (4). This situation is not addressed in this work because single halfsites can be adequately modeled with well-established profile models termed position-weight matrices. Computational Prediction of NHR Target Sites Although the NHR class has been extensively studied over decades (5), no comprehensive computational prediction method has been available to aid experimentalists finding NHR regulatory elements. The lack of tools in this area is primarily due to the inability of the dominating framework for statistical models to describe the unique binding properties of NHRs (i.e. the variable half-site spacing and orientation). TF binding specificities are most commonly modeled using a position weight matrix—a statistical profile tabulating accepted nucleotide variation in each position of a protein-DNA interface (reviewed in Refs. 18 and 19). Reflecting the binding properties of TFs in vitro, matrix profiles generate a high rate of false-positive predictions. In practice, most predictions are assessed in combination with filtering approaches based on evolutionary selective pressure, phylogenetic footprinting (2, 20), or in composite models that identify functionally important combinations TFBS (21, 22). One of the disadvantages of the weight matrix model is its inability to represent variable spacing and inversions (23) of half-sites within TFBS. This inflexibility is usually not a concern because the structural constraints for monomeric protein-DNA interaction generally preclude such variation, but it is a substantial problem for the analysis of NHR target sites. Because NHR target sites display remarkable flexibility, a new bioinformatics framework is required. In this report, we describe a novel method for NHR binding site prediction using classification Hidden Markov Models (HMMs). HMMs are widely used in bioinformatics in problems ranging from the prediction of protein domains to the analysis of gene structure (23–26). Briefly, an HMM model consists of a set of states, where each state can emit symbols (nucleotides in this case) based on a

Sandelin and Wasserman • NHR Binding Site Detection

probability distribution. The emission probability for a certain nucleotide is specific for each state. States are connected in a chain-like structure, where the probability of moving from one state to another is termed a transition probability. Any specific path through the states for a given sequence will have a defined probability (effectively the product of all emission probabilities for respective nucleotides and all transitions probabilities for each move between states). It is possible to computationally calculate the optimal route through the chain for a given sequence (the Viterbi algorithm), or the total probability of the model producing a certain sequence (the Forward algorithm). Excellent textbooks and reviews are available discussing HMMs in detail (23, 24). Based on an extensive collection of verified binding sites, we have developed an HMM model capable of simultaneous recognition and classification of NHR binding sites. Analysis of the model’s performance indicates high sensitivity for classification of real sites, while maintaining selectivity comparable to weight matrix binding profiles for TFs outside the NR class. Application of the model to the genome of the pufferfish (Fugu rubripes) (27), identifies a high cross-regulatory potential between NR encoding genes. The bioinformatics method is implemented in a userfriendly web interface, available at http://mordor.cgb. ki.se/NHR-scan.

RESULTS A Curated Reference Collection of NHR Binding Sites The performance of predictive models is dependent on the quality and availability of reference data for training parameters. We collected 107 NHR binding sites from the biomedical literature, as described in Materials and Methods. Annotation was based on the unified NR nomenclature system (13). Although a few of the regulatory elements are derived from genes in insects and birds, most of the sites are mammalian— with 78% of all sites from human or rodent genes. The reference collection is available from the NHR-scan web site. Model Architecture The model architecture was chosen based on the biochemical characteristics of NHR binding and the volume of training data. Briefly, the model is a parallel classification-type HMM combined with circular local alignment architecture (23) (Fig. 1). From the background state B, it is possible to move to each type of match state: DR, ER, and IR, of which each consists of two half-site models separated by a spacer-state classifier. Direct repeats can be viewed in two directions; thus, a reverse complement version is required for an accurate description. In the


Mol Endocrinol, March 2005, 19(3):595–606 597

Fig. 1. Graphical Representation of the HMM Framework For greater understanding within the widest possible audience, we present the framework from a biological viewpoint (A) and as a standard HMM graph representation (B). Arrows denote possible transitions between states. From the background state, four principal transitions are possible: remain in the background state or move to any of the three match state chains corresponding to DRs, IRs, or ERs of half-sites. Each such chain of states consists of two half-site models (visualized by sequence logos) (A), separated by a set of spacer states that can generate different spacer configurations. At a more detailed level, DRs, IRs, and ERs are modeled by a pair of chains, corresponding to sites in forward and reverse strands (B).

case of ER and IR sites, one must consider the binding site characteristics closely. It would not be illogical to argue that a reverse complement model is duplicative, based on the palindromic pattern. However, imperfect palindromes are frequently ob-

served, resulting in a de facto orientation. Therefore, for each repeat class, we allow for two match states; one in each strand direction. As the default setting, the combined probability j of entering all match states is internally divided equally among the six

598


alternatives (three structures on each strand). This parameter can be adjusted. The half-site models are entirely consistent with position weight matrix models for TFBS because they only allow sequential transitions. The full model (including emission and transition probabilities for all states) can be retrieved from the web interface. Model Parameter Estimation Advanced machine learning algorithms, such as the Baum-Welch algorithm (28), are often employed for estimation of model parameters (i.e. emission and transition probabilities). However, if each nucleotide in the input sequences is labeled with respective state, emission parameter estimation is produced by simply counting each nucleotide found in each state in the input data (a maximum likelihood procedure). Transition probabilities can be estimated analogously (23). Because all nucleotides in the input sequences can be labeled based on the literature-derived data, training of emission and transmission probabilities is straightforward—comparable to constructing common binding site profiles by counting nucleotides in each column of an alignment. A pseudocount is often added to emission and transition counts to avoid overfitting the model to small data sets. We used a Laplace’s rule pseudocount procedure (i.e. in each state, add one to the count of observed nucleotides) (23) because the background nucleotide distribution in promoters is approximately uniform as judged from the Eukaryotic Promoter Database (29). Because DNA strands can be described in two directions, and it is not given that all experimentally confirmed sites are reported in the same strand, we used a Gibbs Sampling-based alignment algorithm [ANN-Spec (30)] to orient each reference site within each class (DR, IR, and ER). Because ANN-Spec does not produce gapped alignments, concatenated halfsites were aligned (i.e. the regions of variable spacing were removed). Gap state emissions (but not transitions) were estimated from a pool consisting of all gap-labeled nucleotides to avoid overfitting sparsely populated states. Scoring Algorithms To identify and classify candidate binding sites, the Viterbi algorithm was used to find the most probable state path through the model with a given sequence. Given the model, the Viterbi algorithm is applied to identify the most probable chain of states (labels) that are consistent with the observed sequence (23). A disadvantage with the Viterbi algorithm is that predicted binding sites cannot overlap. The alternative is to use the Forward algorithm on a fixed window that slides along the input sequence. The latter approach is computationally expensive but produces a probability score for the given subsequence and the model, corresponding to prediction scores using profile models.


We have implemented both methods in the web application, allowing the user to select between the faster Viterbi approach for long sequences and the slower Forward approach for detailed analysis. Model Evaluation Most models for binding site prediction are limited by inadequate supplies of confirmed binding sites for training and verification. We have evaluated the model sensitivity with the standard cross-validation (jackknife test) procedure and using two additional independent data sets (Fig. 1, A and B), as described in Materials and Methods. In a cross-validation test, one binding site is removed from the training set and subsequently assessed with the model created with the remaining sites. The procedure is repeated iteratively to cover every site in the training set. In any sensitivity test for an HMM, the outcome is strongly dependent on the probability j of entering match states initially. In the cross-validation test (Fig. 1A), a plateau is reached at approximately 85% correctly labeled nucleotides for DR and ER sites and approximately 75% for IR sites, whereafter less stringent values of j(⬎0.04) do not improve sensitivity substantially. To verify that the model is not overfitted to the original training data, we tested the method on two additional large data sets: response elements identified from in vitro selection studies (31, 32) and literature (33), respectively (Fig. 2B). The large majority of sites in both sets can be identified with low settings of j. Although almost all the in vitro sites are identified at stringent thresholds, the literature-derived sites are more varied (detectable with less-stringent settings). Nevertheless, the scores for the binding sites in both test sets are clearly distinct from the distribution of scores observed for the sequences in the randomly generated negative set (Fig. 1B). As in the previous test, a sensitivity plateau is reached after j ⬎ 0.04. Selectivity was measured as the rate of predictions on genomic sequence (Fig. 2C) (as described in Materials and Methods). For low values of j, the model on average predicts one site of each configuration per 3000–5000 bp, whereas higher j values results in a logarithmically decreased selectivity. The Limitations of Phylogenetic Footprinting for Detection of NHR Response Elements TFBS prediction is limited by poor selectivity (i.e. most predicted sites are not functional) (19, 34), due to the limited size of binding sites and the considerable sequence variation in binding sites tolerated by most TFs (reviewed in Refs. 18 and 19). Cross-species comparison (phylogenetic footprinting) is a viable method for addressing the selectivity problem (2). Such approaches involve the identification and alignment of promoter regions for orthologous genes in two separate species (for instance,



human and mouse), followed by analysis with TF binding profiles within preferentially conserved regions. The success of alignment-based phylogenetic footprinting is limited by two important alignment problems: 1) Sites have to be situated in regions that can be readily aligned. Regulatory regions located far from promoters (distal regions) tend to lack long continuous segments of conservation that are required for detection by global alignment algorithms, even if the binding sites in themselves are conserved. This is an important problem for the analysis of NHR TFBS because many of the sites are distally positioned (35–37). 2) Although regulatory properties of essential systems can remain intact over long periods of evolution, alignment-based phylogenetic footprinting lends itself best to moderate evolutionary distances such as that between human and mouse. Alignment-based analysis of gene promoters separated by longer evolutionary distances is often problematic because alignment methods often are unable to generate reliable alignments over the entire regulatory sequence due to the low level of sequence similarity (2). Analysis of NHR Binding Sites in the F. rubripes Genome The pufferfish F. rubripes (27) genome is a valuable reference, both for its compact size (⬃450 Mb) and its phylogenetic distance from humans (450 My). The F. rubripes and human genomes display remarkable similarity in the content and order of genes (38). In diverse studies, NR genes, in particular the ligand-binding and DNA-binding region, are remarkably well conserved over evolution in vertebrates (6, 39, 40). Successful regulatory analyses have been enabled by sequence comparisons between F. rubripes and mammals (41– 43), including cases that identified NR binding sites (40). Because the genes encoding NRs are well represented in F. rubripes (44), and in light of the importance of NRs in the regulation of critical genes, the F. rubripes genome is a powerful platform for computational prediction of regulatory elements. To motivate the use of the compact F. rubripes genome as a platform for prediction of NHR binding sites, we explored whether genes regulated by NHRs

Fig. 2. Model Evaluation A, Sensitivity (percentage correctly labeled nucleotides) as a function of the total probability j of entering match states, using cross-validation evaluation. B, Sensitivity evaluations using independent data sets: sets of 134 in vitro-selected sites and 69 literature-derived sites. Sensitivity is measured as the fraction of the known sites that are predicted as sites as a function of the cutoff j of entering match states. For comparison, we evaluated a set of randomly generated sites. C, Selectivity (average no. of base pairs/prediction) as a function of the total probability j of entering match states.

600



in man are likely to be similarly regulated in the pufferfish. We investigated whether a binding site configuration (e.g. DR1) active in human and rodent genes is more abundant than expected by chance in the promoters of homologous F. rubripes genes (as described in Materials and Methods). Homologs to NHR-regulated human and rodent genes were identified in the F. rubripes based on EnsEMBL orthology mappings. DR1 sites are significantly overrepresented in these F. rubripes promoters (P ⫽ 0.0015, cumulative binomial distribution (as explained in Materials and Methods), at cutoff 0.02). DR3, DR5, ER1, and ER6 configured sites were also significantly overrepresented (P ⬍ 0.05). The other site configurations, which are not populous in the reference collection, were not found to be significantly overrepresented.

searchers. The interface, available at http://mordor. cgb.ki.se/NHR-scan (Fig. 4), supports Viterbi and Forward algorithm scoring, user-defined thresholds (Fig. 4A), and multiple output formats (Fig. 4B). For convenience, gene sequences can be retrieved from the EnsEMBL (50) database by user-supplied genome coordinates for any species with EnsEMBL annotations. Because several research groups are focused on particular NRs and may have supplemental data, we have enabled the option to directly modify the HMM emission and transition parameters. For instance, custom half-site models and spacing requirements can be set within the interface (Fig. 4C). The online service is freely available without restriction.

Case Example: Evidence for Cross/Autoregulation of NHR Genes

DISCUSSION

Numerous NR genes in mammals are autoregulated or cross-regulated by other NRs (45–49). To demonstrate an application of the NHR binding site prediction model in a genome-scale application, we investigated the properties of cross/autoregulation. As discussed above, for genome analysis, the compact F. rubripes genome is a preferred target for analysis. Occurrence of NHR binding sites in upstream sequences of NR genes was compared with a background promoter set. Despite the inclusion of steroid NR genes (whose binding specificity is not addressed by the model) and the possible presence of atypical NRs (for instance, pseudogenes), we observed a clear overrepresentation (P ⬍ 0.05, cumulative binomial distribution) of certain site configurations: most notably the direct repeats DR1, DR3, DR4, DR5, and DR8. In a further analysis step, we investigated whether the six subclasses (NR1–6) of NRs had different cross/autoregulatory potential (Fig. 3). The same statistical analysis was performed for the six different classes of NRs found in the set. In general, cross/autoregulatory potential remains high in most groups; however, certain differences in site distributions are evident. The number of NR1, NR2, and NR3 class genes is sufficient to support analysis of site frequency (Fig. 3A). Although the regulatory potential of NR1 class is composed of a wide range of site configurations (DR4, DR5, DR8, ER1, and ER6 are all significantly overrepresented), the NR2 class clearly has a higher tendency for direct repeats (DR1 and DR3 are significantly overrepresented) (Fig. 3B). Interestingly, the NR3 (steroid receptors) class has a defined cross-regulatory potential, with an abundance of direct repeats that are unlikely to be bound by NR3 members (DR1 and DR3 are highly overrepresented). Web Interface We have implemented an Internet-based system to support efficient and user-friendly analyses by re-

Developed on the basis of a curated reference collection of NHR binding sites, the NHR-scan model produces accurate prediction and classification of potential binding sites. The HMM framework allows for a binding site model that tolerates both variable spacing between half-sites as well as variable orientation. NHR-scan will be a valued resource for the active community of researchers attempting to identify genes directly regulated by the biomedically important NHR transcription factors. Performance of the HMM Model Framework Sensitivity NHR-scan accurately describes the properties of NR binding sites. Even a small probability j of entering the match states confers high predictive sensitivity (Fig. 2A). At higher settings of j (⬎0.1) (i.e. less stringent), more than 90% of reference binding sites are identified, although some are mislabeled (data not shown). The latter is possibly due to 1) the few base pair changes required to transfer a weak site of one configuration type to another, and 2) potentially erroneous classifications of sites in the literature. Misclassifications were only observed for IR and ER sites, which, although consistent with the greater variation tolerated in these site formats, is most likely due the limited number of validated sites. The IR and ER site collections are each 12 sites deep, compared with 83 DR sites. The classification and detection of IR sites are less reliable than those produced by the DR and ER classifiers (Fig. 2A). This is entirely consistent with the sequence variability observed in one of the half-sites in the IR model—the IR sites are less informative than the others (Fig. 1). Because the data depth is low, we cannot estimate whether this is a sampling bias or whether it reflects a general property of IR sites. Clearly, both IR and ER models would benefit from incorporation of more functionally validated sites. Despite the limited data, the sensitivity tests using inde-



Fig. 3. Cross/Autoregulatory Potential of NRs in F. rubripes The promoters for NR genes of F. rubripes were analyzed for one or more occurrences of each site configuration. A–C, Radar plot visualization of how large fraction of the gene promoters (within NR classes I, II, and III, respectively) that contain one or more of a site configuration (e.g. DR1). D, Heat map representation showing each promoter analyzed, grouped by NR class and regulatory potential. Black or gray squares denote one or more occurrences of a given site configuration in the promoter of the gene. Color intensity of site configurations corresponds to statistical significance groupings within an NR class.

602



Fig. 4. NHR-scan Web Interface A, Sequence and threshold input interface; B, results page; C, optional model parameter modification interface.

pendent data sets confirm that the model is not overfitted. Selectivity The frequency of predicted binding sites is directly dependent on the user-specified probability of entering the match states. For predictive purposes, a structure for a NR binding site (i.e. a given spacing and orientation) is highly specific and thus generates few false predictions. Because all known half-site architectures are allowed by NHR-scan, predictions will be more frequent than those produced by a rigid model. For example, at threshold j ⫽ 0.01 we predict, on average, one site approximately every 2000 bp, while classifying approximately 80% of the labels correctly (Fig. 2A). Although the selectivity is higher than a corresponding analysis with a typical matrix model for TF binding sites (which produce predictions at an average rate of ⬃1/500 bp), the model is in itself not sufficiently selective for meaningful analysis of entire genomes. F. rubripes NRs Have a High CrossRegulatory Potential We have identified site configurations that occur preferentially in F. rubripes NR promoters, indicating a

considerable potential for cross/autoregulation of these genes. Furthermore, classes of receptors (13) exhibit differential preferences in terms of site configurations—most notably difference between the NR1, NR2, and NR3 classes. Because the NR3 class (steroid receptors) is functionally distinct from NR1 and NR2 (both in terms of ligand affinities and binding site preferences), it is not surprising that the regulatory sequences display distinct characteristics. The NR3 class proteins preferentially bind a distinct site configuration, consisting of an IR3 site with a different halfsite consensus (16). It is possible that some of the detected DR3 sites are nuclear steroid receptor binding sites; however, initial tests of the NHR-scan model on 43 validated nuclear steroid receptor sites at the same settings only detected six sites (data not shown). Supporting the hypothesis, several findings indicate that the different classes of receptors often are involved in cross-signaling (51, 52). For instance, in human, thyroid hormone receptor ␤ is known to be autoregulated by a proximal DR4 response element (47). Interestingly, thyroid hormone ␤ receptor (here labeled THB2_HUMAN) has one of the highest potentials for cross/autoregulation within the test set. In addition, recent studies indicate that the thyroid receptor ␤ gene is cross-regulated by retinoid X receptor (47). We


interpret the abundance of NHR binding sites to indicate potential for considerable cross/autoregulatory networks within the NR genes. Comparison to Similar Methods The only comparable published resource for NHR TFBS prediction is NUBI-scan (53). This service is based on an algorithm that scans sequences with a single half-site model [(using profile-based prediction methodology (18)] and subsequently evaluates pairs of half-sites with a chosen configuration (for example DR4) by the product of half-site scores. Thus, the user must directly specify possible site configurations for analysis to proceed. Large-scale sensitivity and selectivity performance data has not been published for NUBI-scan. Several characteristics distinguish NHRscan from NUBI-scan, including the depth of the training data (the NUBI-scan model is based on 11 reference sites), the model framework, and the assessment of predictions. In addition to obvious algorithmic differences, the statistical methods for evaluating hits are different. NHR-scan uses established HMM scoring algorithms, whereas NUBI-scan reports predictions exceeding 6–7 Z scores. These Z scores are derived by comparison to a distribution of all scores generated by the input sequence; NUBI-scan results are dependent on the length and composition of the sequence analyzed Roelet et al. (54) have developed a prediction method for the CCAAT-binding transcription factor/ nuclear factor I transcription factor that binds DNA with two half-sites with a variable spacing (but not orientation), based on a generalized profile concept (54). Although not stated as a HMM, the algorithm has certain similarities to NHR-scan but is focused on another biological problem. Future Directions Models for Each Type of NR Class (or Individual Factors). NHR-scan models the binding sites for all NHRs and is thus classified as a meta-model. To model individual factors, or subclasses of NRs, would be a logical extension of the present work. However, the body of experimentally validated in vivo sites is insufficient for most factors. Promising developments are underway to increase the collections of known sites, both through laboratory methods for chromatin immunoprecipitation of TFs bound to active sites (55) and through systematic evaluations of sites in vitro (56). Input Sequence Weighting. Because the binding sites in the reference collection cannot be regarded as fully representative (for example, the collection contains nine ligand X receptor sites but only three testicular orphan receptor-4 sites), implanting a system to weight the contribution of each site could have utility. Such a system could be based on the similarity between the binding domains of the NHRs or between


target genes and modified by the number of sites available. It is hard to evaluate weighting procedures given the data limitations. Because a weighting system would produce a more generalized model, sensitivity is likely to decline, and we would have no means of measuring potential benefits. Regulatory Modules. The focus on isolated binding sites of a single TF is biologically unrealistic. This is reflected in the limited selectivity of binding site predictions. In the recent years, cis-regulatory module detection (i.e. detection of functional clusters of binding sites) has proven a potent method for delineating regulatory regions on a genome scale (57, 58). NHR-scan can conceptually be integrated in the mathematical frameworks employed for such module discrimination (which usually are based on profile representations). For example, the liver regulatory model by Krivan and Wasserman (22), which includes a profile model for the HNF4 NHR, would likely be enhanced with an HMM model describing the variable spacing of HNF4 binding sites. Background Model. Background models for describing sequence characteristics can dramatically impact the performance of predictive algorithms. NHR-scan presently uses a simple nucleotide distribution. It is known that the intricacy of genomic DNA requires additional model complexity for rigorous representations, for instance as an n-state Markov chain (30, 59). This concept could be extended to include different background Markov chains for various genomic features: repeat regions, coding regions, and regions with regulatory potential (60). Conclusion We have presented a HMM framework for predicting NHR binding sites in genomic sequences. The model is sensitive, while maintaining respectable selectivity. A genome-scale application of the model highlighted the high cross-regulatory potential of F. rubripes NRs. The NHR-scan model provided through a user-friendly Internet interface will positively impact efforts to identify functional NHR binding sites.

MATERIALS AND METHODS Data Collection One hundred and seven experimentally verified binding sites for 18 known NHRs were collected from the biomedical literature. For a site to be accepted into the collection, we required convincing experimental evidence (at least validated for binding in vitro and demonstrated to mediate a response through transfection assays). A further requirement was a positive identification of the interacting NHR and an experimentally based identification of the half-site positions. We included the reported bound sequence and three flanking nucleotides in both directions. To avoid overfitting, we discarded sites for the same factors in promoters from orthologous genes. For example, only a known rat site for PPAR-␥ regulating the malic enzyme gene was incorporated, not the known site in the orthologous mouse gene.

604 Mol Endocrinol, March 2005, 19(3):595–606

Algorithms For scoring purposes, we used the general HMM (GHMM) C⫹⫹ library (http://ghmm.org/), where the model is described as an XML object using the XMLIO library (http:// xmlio.sourceforge.net/). The combination of libraries enables efficient central processing unit usage and human-readable models and outputs. It should be noted that the model can be used by any HMM software package that allows for custom state architectures. Cross-Comparison (Jack-Knife) Assessment of Sensitivity Both sensitivity and selectivity are dependent on both the quality of the individual match states and the probability of entering a match state. For these reasons, we performed the test with varying probabilities of entering match states. The value j describes the combined probability of entering any match state; thus, the probability of entering any of the match chain is j/6 because each match state is composed of a forward- and a reverse-strand chain. We measured performance as the fraction of correctly labeled nucleotides (background, DR, IR, ER, or spacer). Assessment of Sensitivity Using Independent Data Sets We collected two sets of independent binding site data: 1) 134 in vitro selected binding sites from the JASPAR database (31) and other sources (32); and 2) 69 known HNF-4 binding sites from literature sources, collected by Ellrot et al. (33), These data sets consists of binding sites of length approximately 15–20 bp. Unlike the training set, we have no information about the exact positions of binding sites, and thus the half-site configuration for a given site is not known. Thus, we evaluate sensitivity by measuring how many of the sequences are predicted to contain a site, given a cutoff j. Each site was extended with three randomly generated nucleotides because the model for technical reasons requires that the first and last nucleotide in any analyzed sequence must be labeled background. Selectivity Assessment For selectivity assessment, we analyzed the vertebrate representative subset of the Eukaryotic Promoter Database (29) for NHR sites, using different thresholds as above. Because most promoters are not expected to harbor a functional NHR site, we measure selectivity as the frequency with which we predict predicted sites (average number of base pairs between two sites). Coregulation F. rubripes/Human and Rodent Homologs to human and rodent NHR-targeted genes were located in F. rubripes genome (assembly 17.2.1) using Swissprot (61) identifiers and EnsEMBL (50) gene annotations. For each gene, we retrieved F. rubripes promoter sequence corresponding to ⫺1700 to ⫹300 relative to the predicted transcription start site. For an F. rubripes promoter to be classified as positive, we required that one or more sites of the same type and half-site configuration as found in the reference human/rodent gene were predicted, using a set cutoff. For clarity, this measure does not correspond to the quantity of sites in a single promoter because a promoter can only be positive (i.e. containing one or more sites) or negative (containing no sites). With a given cutoff j, for each site configuration, a certain number of positives are reported out of the number of genes suspected to be regulated by that particular site type. For example, nine genes in the reference


set (human/rodent) are regulated by DR1 sites. Setting the cutoff j to 0.02, we classify seven of the corresponding nine F. rubripes promoters as positive. We estimated the chance of classifying a single promoter as positive by analyzing nonoverlapping promoters of the same size from the vertebrate representative subset of the Eukaryotic Promoter Database (29) (in total 1,072,239 bp). With such an analysis, we have an assessment of the chance p of randomly drawing a positive promoter using a certain type of site and configuration. The chance of drawing r positive sites out of n trials (corresponding to the number of reference genes regulated by the same site configuration) follows a binomial distribution: n! P n,r ⫽ p r共 1 ⫺ p 兲 n ⫺ r r! 共 n ⫺ r 兲 for a given type of site and cutoff j, where p is the probability of finding one or more sites of the correct type in a background sequence of the same length. We report the significance as the cumulative density B, that is: the probability of more than r successes in n trials.

冘 n

B⫽

P n, y

y⫽r⫹1

For clarity, the probability j was divided equally among the match states (DR, ER, and IR) in this evaluation NR Cross/Autoregulatory Potential Using EnsEMBL (50) annotation, we retrieved the 57 genes containing the characteristic C4H4 NR DNA binding domain [INTERPRO (62) domain IPR001628]. This is comparable to the 68 genes identified using protein similarity searches in another study (44). As shown previously, we analyzed the region ⫺1700 to ⫹300 relating to the TSS, using the same statistical method. We divided the NRs into subclasses (as defined by Ref. 13), based on annotation or sequence similarity [(Blastp (63)]. Four of the 57 sequences could not be classified reliably. No promoter pair displayed any noteworthy similarity, based on ClustalW (64) analysis. Additional data and resources, including color versions of images, training data, and model framework in GHMM XML format, are available at the NHR-scan web page http://mordor.cgb.ki.se/ NHR-scan.

Acknowledgments We are grateful for many valuable discussions with Lukas Ka¨ll and Hui Gau (Center for Genomics and Bioinformatics, Department for Biosciences at Novum, Karolinska Institute).

Received March 9, 2004. Accepted November 17, 2004. Address all correspondence and requests for reprints to: Wyeth W. Wasserman, Centre for Molecular Medicine and Therapeutics, 3018-950 West 28th Avenue, Vancouver, British Columbia, Canada V5Z 4H4. E-mail: wyeth@cmmt. ubc.ca. This work was supported by funds from the Pharmacia Corporation to the Center for Genomics and Bioinformatics, as well as funds from the Canadian Institutes of Health Research (to W.W.W.).

REFERENCES 1. Davidson E 2001 Genomic regulatory systems. Development and evolution. San Diego: Academic Press


2. Lenhard B, Sandelin A, Mendoza L, Engstrom P, Jareborg N, Wasserman WW 2003 Identification of conserved regulatory elements by comparative genome analysis. J Biol 2:13 3. Brazma A, Jonassen I, Vilo J, Ukkonen E 1998 Predicting gene regulatory elements in silico on a genomic scale. Genome Res 8:1202–1215 4. Owen GI, Zelent A 2000 Origins and evolutionary diversification of the nuclear receptor superfamily. Cell Mol Life Sci 57:809–827 5. Mangelsdorf DJ, Thummel C, Beato M, Herrlich P, Schu¨tz G, Umesono K, Blumberg B, Kastner P, Mark M, Chambon P, Evans RM 1995 The nuclear receptor superfamily: the second decade. Cell 83:835–839 6. Maglich J, Sluder A, Willson T, Moore J 2003 Beyond the human genome: examples of nuclear receptor analysis in model organisms and potential for drug discovery. Am J Pharmacogenomics 3:345–353 7. Gurnell M, Savage DB, Chatterjee VK, O’Rahilly S 2003 The metabolic syndrome: peroxisome proliferator-activated receptor ␥ and its therapeutic modulation. J Clin Endocrinol Metab 88:2412–2421 8. Joseph SB, Tontonoz P 2003 LXRs: new therapeutic targets in atherosclerosis? Curr Opin Pharmacol 3:192–197 9. Francis GA, Annicotte JS, Auwerx J 2003 PPAR agonists in the treatment of atherosclerosis. Curr Opin Pharmacol 3:186–191 10. Ogita H, Node K, Kitakaze M 2003 The role of estrogen and estrogen-related drugs in cardiovascular diseases. Curr Drug Metab 4:497–504 11. Pinette KV, Yee YK, Amegadzie BY, Nagpal S 2003 Vitamin D receptor as a drug discovery target. Mini Rev Med Chem 3:193–204 12. Branden C, Tooze J 1999 Introduction to protein structure. 2nd ed. New York: Garland Publishing, Inc. 13. Laudet V, Auwerx J, Gustafsson J, Wahl W 1999 A unified nomenclature system for the nuclear receptor superfamily. Cell 97:161–163 14. Ruau D, Duarte J, Ourjdal T, Perriere G, Laudet V, Robinson-Rechavi M 2004 Update of NUREBASE: nuclear hormone receptor functional genomics. Nucleic Acids Res 32:D165–D167 15. Gronemeyer H, Laudet V 1995 Transcription factors 3: nuclear receptors. Protein Profile 2:1173–1308 16. Nelson CC, Hendy SC, Shukin RJ, Cheng H, Bruchovsky N, Koop BF, Rennie PS 1999 Determinants of DNA sequence specificity of the androgen, progesterone, and glucocorticoid receptors: evidence for differential steroid receptor response elements. Mol Endocrinol 13: 2090–2107 17. Rastinejad F, Perlmann T, Evans RM, Sigler PB 1995 Structural determinants of nuclear receptor assembly on DNA direct repeats. Nature 375:203–211 18. Stormo GD 2000 DNA binding sites: representation and discovery. Bioinformatics 16:16–23 19. Wasserman WW, Sandelin A 2004 Applied bioinformatics for the identification of regulatory elements. Nat Rev Genet 5:276–287 20. Wasserman WW, Palumbo M, Thompson W, Fickett JW, Lawrence CE 2000 Human-mouse genome comparisons to locate regulatory sites. Nat Genet 26:225–228 21. Wasserman WW, Fickett JW 1998 Identification of regulatory regions which confer muscle-specific gene expression. J Mol Biol 278:167–181 22. Krivan W, Wasserman WW 2001 A predictive model for regulatory sequences directing liver-specific transcription. Genome Res 11:1559–1566 23. Durbin R, Eddy S, Krogh H, Mitchison G 1999 Biological sequence analysis. Cambridge, UK: Cambridge University Press 24. Eddy SR 1998 Profile hidden Markov models. Bioinformatics 14:755–763


25. Pedersen AG, Baldi P, Brunak S, Chauvin Y 1996 Characterization of prokaryotic and eukaryotic promoters using hidden Markov models. Proc Int Conf Intell Syst Mol Biol 4:182–191 26. Reese MG, Kulp D, Tammana H, Haussler D 2000 Genie—gene finding in Drosophila melanogaster. Genome Res 10:529–538 27. Aparicio S, Chapman J, Stupka E, Putnam N, Chia JM, Dehal P, Christoffels A, Rash S, Hoon S, Smit A, Gelpke MD, Roach J, Oh T, Ho IY, Wong M, Detter C, Verhoef F, Predki P, Tay A, Lucas S, Richardson P, Smith SF, Clark MS, Edwards YJ, Doggett N, Zharkikh A, Tavtigian SV, Pruss D, Barnstead M, Evans C, Baden H, Powell J, Glusman G, Rowen L, Hood L, Tan YH, Elgar G, Hawkins T, Venkatesh B, Rokhsar D, Brenner S 2002 Wholegenome shotgun assembly and analysis of the genome of Fugu rubripes. Science 297:1301–1310 28. Baum LE 1972 An equality and associated maximization technique in statistical estimation for probabilistic functions of Markov processes. Inequalities 3:1–8 29. Praz V, Perier R, Bonnard C, Bucher P 2002 The Eukaryotic Promoter Database, EPD: new entry types and links to gene expression data. Nucleic Acids Res 30:322–324 30. Workman CT, Stormo GD 2000 ANN-Spec: a method for discovering transcription factor binding sites with improved specificity. Pac Symp Biocomput:467–478 31. Sandelin A, Alkema W, Engstrom P, Wasserman WW, Lenhard B2004 JASPAR: an open-access database for eukaryotic transcription factor binding profiles. Nucleic Acids Res 32:D91–D94 32. Feltkamp D, Wiebel FF, Alberti S, Gustafsson JA 1999 Identification of a novel DNA binding site for nuclear orphan receptor OR1. J Biol Chem 274:10421–10429 33. Ellrott K, Yang C, Sladek FM, Jiang T 2002 Identifying transcription factor binding sites through Markov chain optimization. Bioinformatics 18(Suppl 2):S100–S109 34. Fickett JW 1996 Quantitative discrimination of MEF2 sites. Mol Cell Biol 16:437–441 35. Wang H, Faucette S, Sueyoshi T, Moore R, Ferguson S, Negishi M, LeCluyse EL 2003 A novel distal enhancer module regulated by pregnane X receptor/constitutive androstane receptor is essential for the maximal induction of CYP2B6 gene expression. J Biol Chem 278: 14146–14152 36. Pedigo N, Zhang H, Koszewski NJ, Kaetzel DM 2003 A 5⬘-distal element mediates vitamin D-inducibility of PDGF-A gene transcription. Growth Factors 21:151–160 37. Sueyoshi T, Negishi M 2001 Phenobarbital response elements of cytochrome P450 genes and nuclear receptors. Annu Rev Pharmacol Toxicol 41:123–143 38. Elgar G 1996 Quality not quantity: the pufferfish genome. Hum Mol Genet 5(Spec No):1437–1442 39. Wentworth JM, Schoenfeld V, Meek S, Elgar G, Brenner S, Chatterjee VK 1999 Isolation and characterisation of the retinoic acid receptor-␣ gene in the Japanese pufferfish, F. rubripes. Gene 236:315–323 40. Abrahams BS, Mak GM, Berry ML, Palmquist DL, Saionz JR, Tay A, Tan YH, Brenner S, Simpson EM, Venkatesh B 2002 Novel vertebrate genes and putative regulatory elements identified at kidney disease and NR2E1/fierce loci. Genomics 80:45–53 41. Gilligan P, Brenner S, Venkatesh B 2002 Fugu and human sequence comparison identifies novel human genes and conserved non-coding sequences. Gene 294:35–44 42. Aparicio S, Morrison A, Gould A, Gilthorpe J, Chaudhuri C, Rigby P, Krumlauf R, Brenner S 1995 Detecting conserved regulatory elements with the model genome of the Japanese puffer fish, Fugu rubripes. Proc Natl Acad Sci USA 92:1684–1688 43. Bagheri-Fam S, Ferraz C, Demaille J, Scherer G, Pfeifer D 2001 Comparative genomics of the SOX9 region in human and Fugu rubripes: conservation of short regula-

606 Mol Endocrinol, March 2005, 19(3):595–606

44.

45.

46.

47.

48.

49.

50.

51.

52.

53.

tory sequence elements within large intergenic regions. Genomics 78:73–82 Maglich JM, Caravella JA, Lambert MH, Willson TM, Moore JT, Ramamurthy L 2003 The first completed genome sequence from a teleost fish (Fugu rubripes) adds significant diversity to the nuclear receptor superfamily. Nucleic Acids Res 31:4051–4058 Waxman DJ 1999 P450 gene induction by structurally diverse xenochemicals: central role of nuclear receptors CAR, PXR, and PPAR. Arch Biochem Biophys 369:11–23 Tata JR 1996 Metamorphosis: an exquisite model for hormonal regulation of post-embryonic development. Biochem Soc Symp 62:123–136 Sakurai A, Suzuki S, Katai M, Miyamoto T, Kobayashi H, Nakajima K, Ichikawa K, DeGroot LJ, Hashizume K 1995 Transcriptional regulation of human thyroid hormone receptor ␤ 1 gene expression: effect of human retinoid X receptor and identification of a transcriptional silencer region. Mol Cell Endocrinol 110:103–112 Camp HS, Whitton AL, Tafuri SR 1999 PPAR␥ activators down-regulate the expression of PPAR␥ in 3T3–L1 adipocytes. FEBS Lett 447:186–190 Grad JM, Dai JL, Wu S, Burnstein KL 1999 Multiple androgen response elements and a Myc consensus site in the androgen receptor (AR) coding region are involved in androgen-mediated up-regulation of AR messenger RNA. Mol Endocrinol 13:1896–1911 Brooksbank C, Camon E, Harris MA, Magrane M, Martin MJ, Mulder N, O’Donovan C, Parkinson H, Tuli MA, Apweiler R, Birney E, Brazma A, Henrick K, Lopez R, Stoesser G, Stoehr P, Cameron G 2003 The European Bioinformatics Institute’s data resources. Nucleic Acids Res 31:43–50 Pascussi JM, Gerbal-Chaloin S, Drocourt L, Maurel P, Vilarem MJ 2003 The expression of CYP2B6, CYP2C9 and CYP3A4 genes: a tangle of networks of nuclear and steroid receptors. Biochim Biophys Acta 1619:243–253 Houston KD, Copland JA, Broaddus RR, Gottardis MM, Fischer SM, Walker CL 2003 Inhibition of proliferation and estrogen receptor signaling by peroxisome proliferator-activated receptor ␥ ligands in uterine leiomyoma. Cancer Res 63:1221–1227 Podvinec M, Kaufmann MR, Handschin C, Meyer UA 2002 NUBIScan, an in silico approach for prediction of nuclear receptor response elements. Mol Endocrinol 16: 1269–1279


54. Roulet E, Bucher P, Schneider R, Wingender E, Dusserre Y, Werner T, Mermod N 2000 Experimental analysis and computer prediction of CTF/NFI transcription factor DNA binding sites. J Mol Biol 297:833–848 55. Shannon MF, Rao S 2002 Transcription. Of chips and ChIPs. Science 296:666–669 56. Roulet E, Busso S, Camargo AA, Simpson AJ, Mermod N, Bucher P 2002 High-throughput SELEX SAGE method for quantitative modeling of transcription-factor binding sites. Nat Biotechnol 20:831–835 57. Aerts S, Van Loo P, Thijs G, Moreau Y, De Moor B 2003 Computational detection of cis-regulatory modules. Bioinformatics 19(Suppl 2):II5–II14 58. Johansson O, Alkema W, Wasserman WW, Lagergren J 2003 Identification of functional clusters of transcription factor binding motifs in genome sequences: the MSCAN algorithm. Bioinformatics 19(Suppl 1):I169–I176 59. Bailey TL, Elkan C 1995 The value of prior knowledge in discovering motifs with MEME. Proc Int Conf Intell Syst Mol Biol 3:21–29 60. Elnitski L, Hardison RC, Li J, Yang S, Kolbe D, Eswara P, O’Connor MJ, Schwartz S, Miller W, Chiaromonte F 2003 Distinguishing regulatory DNA from neutral sites. Genome Res 13:64–72 61. Boeckmann B, Bairoch A, Apweiler R, Blatter MC, Estreicher A, Gasteiger E, Martin MJ, Michoud K, O’Donovan C, Phan I, Pilbout S, Schneider M 2003 The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003. Nucleic Acids Res 31:365–370 62. Mulder NJ, Apweiler R, Attwood TK, Bairoch A, Barrell D, Bateman A, Binns D, Biswas M, Bradley P, Bork P, Bucher P, Copley RR, Courcelle E, Das U, Durbin R, Falquet L, Fleischmann W, Griffiths-Jones S, Haft D, Harte N, Hulo N, Kahn D, Kanapin A, Krestyaninova M, Lopez R, Letunic I, Lonsdale D, Silventoinen V, Orchard SE, Pagni M, Peyruc D, Ponting CP, Selengut JD, Servant F, Sigrist CJ, Vaughan R, Zdobnov EM 2003 The InterPro Database, 2003 brings increased coverage and new features. Nucleic Acids Res 31:315–318 63. Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ 1997 Gapped BLAST and PSIBLAST: a new generation of protein database search programs. Nucleic Acids Res 25:3389–3402 64. Chenna R, Sugawara H, Koike T, Lopez R, Gibson TJ, Higgins DG, Thompson JD 2003 Multiple sequence alignment with the Clustal series of programs. Nucleic Acids Res 31:3497–3500

Molecular Endocrinology is published monthly by The Endocrine Society (http://www.endo-society.org), the foremost professional society serving the endocrine community.