Critical Reviews in Toxicology, 32(2):67–112 (2002)
In Silico Approaches to Mechanistic and Predictive Toxicology: An Introduction to Bioinformatics for Toxicologists Mark R. Fielden, Jason B. Matthews, Kirsten C. Fertuck, Robert G. Halgren, and Tim R. Zacharewski Department of Biochemistry and Molecular Biology, National Food Safety and Toxicology Center and Institute for Environmental Toxicology, Michigan State University, East Lansing, MI, U.S.A *
Corresponding author: Tim Zacharewski, Ph.D, Department of Biochemistry & Molecular Biology, 223 Biochemistry Building, Wilson Road, Michigan State University, East Lansing, MI 48824. Tel: 517-355-1607. Fax: 517-353-9334. E-mail:
[email protected]
ABSTRACT: Bioinformatics, or in silico biology, is a rapidly growing field that encompasses the theory and application of computational approaches to model, predict, and explain biological function at the molecular level. This information rich field requires new skills and new understanding of genome-scale studies in order to take advantage of the rapidly increasing amount of sequence, expression, and structure information in public and private databases. Toxicologists are poised to take advantage of the large public databases in an effort to decipher the molecular basis of toxicity. With the advent of high-throughput sequencing and computational methodologies, expressed sequences can be rapidly detected and quantitated in target tissues by database searching. Novel genes can also be isolated in silico, while their function can be predicted and characterized by virtue of sequence homology to other known proteins. Genomic DNA sequence data can be exploited to predict target genes and their modes of regulation, as well as identify susceptible genotypes based on single nucleotide polymorphism data. In addition, highly parallel gene expression profiling technologies will allow toxicologists to mine large databases of gene expression data to discover molecular biomarkers and other diagnostic and prognostic genes or expression profiles. This review serves to introduce to toxicologists the concepts of in silico biology most relevant to mechanistic and predictive toxicology, while highlighting the applicability of in silico methods using select examples. KEY WORDS: in silico, bioinformatics, toxicogenomics, microarray, cluster analysis, expressed sequence tags, SAGE, position weight matrix, ligand docking, molecular modeling, multiple sequence alignment, BLAST, single nucleotide polymorphisms.
I. INTRODUCTION Emerging challenges of managing and interpreting large amounts of complex biological data have given rise to a quickly growing field of computational biology, or bioinformatics. While this field is not new, it is rapidly becoming indispensable to life scientists as biology evolves toward high throughput approaches to whole genome molecular analyses. Bioinformatics can be defined as ‘a scientific discipline that encompasses all aspects of biological information acquisition, processing, storage, distribution, analysis and interpretation’.1 This interdisciplinary field 1040-8444/02/$.50 © 2002 by CRC Press LLC
attempts to link the theory and application of mathematics, biology, and computer science in order to understand the significance and relationships between quantitative and qualitative biological data at the cellular, biochemical, and molecular level. To this end, numerous public, commercial, and proprietary software applications and databases have been developed to allow researchers to analyze physical and genetic data to infer functionality in a variety of model organisms, as well as humans. Computational, or in silico, analyses lend substantial predictive power in addition to traditional methods of experimental biology. In silico analyses can provide alternative
67
and efficient means to generate new hypotheses, aid in designing appropriate experiments, and in interpreting large amounts of information from genome-scale studies. As a result, computational biology has the potential to influence all levels of biologically based research and increase the scope and efficiency of both basic and applied science. The field of toxicology is rapidly evolving in an effort to assess the genetic, molecular, and biochemical basis of toxicity for natural and anthropogenic chemicals, as well as therapeutic agents. Consequently, toxicologists must now adapt and enable themselves with new tools to leverage the increasing amount of biological information. The purpose of this review is to introduce the concepts behind contemporary computational methods applicable to the study of toxicology. The review focuses on those methods that take advantage of the large databases of publicly available sequence, expression, and structure data. A list of computational methods and public databases mentioned in this review can be found in Table 1, as well as a glossary of commonly used terms in the appendix. This review does not address knowledge-based systems that attempt to predict ADMET properties in silico (absorption, distribution, metabolism, excretion, and toxicity) based on physical and chemical descriptors.2,3 Predicting ADMET properties are desirable for lead candidate selection in high throughput drug screening programs; however, these methods are usually commercially or internally developed and rely on private databases of experimental data, thus limiting their use to commercial environments. Examples are also used where available to illustrate the applicability of in silico methods to a variety of toxicological interests.
II. IDENTIFICATION OF TARGET TISSUES Xenobiotics can preferentially target specific organs, tissues, or cell types due to their affinity for target proteins, which are often expressed in a tissue-specific manner or expressed at higher levels in the target tissue. Targeting of specific cell types can also occur by virtue of the lack of expression of proteins involved in detoxifying 68
the foreign compound or by expression of proteins that bioactivate the foreign compound. Knowledge of the distribution, abundance, and ontogeny of a target protein in the whole animal can aid in the identification of susceptible tissues and help characterize toxicity. The expression of target proteins and their mRNAs have traditionally been examined experimentally using immunological, hybridization, or PCR-based detection methods. However, with the advent of expressed sequence tags (ESTs)4 in conjunction with the increased availability of cDNA libraries of diverse origin, it has become possible to qualitatively and quantitatively measure mRNAs of target genes in a tissue and stage-specific manner by searching and counting ESTs in public databases.
A. Expressed Sequence Tags (ESTs) ESTs are partial DNA sequences of cDNA clone inserts, which were originally used to rapidly identify expressed genes and expedite gene discovery and genome mapping.4 ESTs are generated by first constructing a cDNA library and sequencing each derived clone in a single pass on the 5′ or 3′ end of the cDNA insert. The anonymous sequence reads are then compared, typically using the BLAST suite of programs, to large sequence databases to establish a putative clone identity (Figure 1). When the intent is to discover novel genes or a representative set of genes from a cDNA library, it is preferable to reduce the redundancy of the clone set. A cDNA library can be normalized in such a way as to decrease the relative abundance of very highly expressed mRNAs and increase the abundance of rare mRNAs. 5, 6 Libraries can also be generated by subtractive hybridization, which results in an enrichment of cDNAs that are differentially expressed between two mRNA populations, or a combination of subtraction and normalization techniques. 7-9 While normalized or subtracted cDNA libraries are useful for discovering novel or rare transcripts, they are unsuitable for estimating relative transcript abundance between two mRNA populations.
69
TABLE 1 Summary of Publicly Available Databases and Tools for In Silico Toxicology
70
TABLE 1 (continued)
71
FIGURE 1. Construction of cDNA libraries and generation of expressed sequence tags (ESTs) for gene identification. A mRNA sample from a tissue or cell line is converted into double-stranded cDNA, cloned into a vector and transformed in bacterial host. Individual clones are sequenced on the 5’ or 3’ end of the insert. Hundreds to thousands of single pass sequence reads from a library can be grouped into clusters to reduce the redundancy of the sequences. Each cluster represents a unique expressed transcript. The number of ESTs within a gene cluster, relative to the total number of ESTs sequenced from its respective cDNA library, represents the relative level of expression for that gene in the sample.
72
B. EST Clustering Public EST sequence data are curated in the Database of Expressed Sequence Tags (dbEST)10 (Table 1), which is maintained by NCBI, the National Center for Biotechnology Information. The Institute of Genome Research (TIGR) also maintains a database, termed the Gene Index, of publicly available EST sequence data from dbEST and from EST sequence data produced by TIGR11 (Table 1). Both NCBI and TIGR cluster EST sequences within species, such that subsets of EST sequences that belong together, based on overlapping sequence, are grouped into a larger set, where each set theoretically represents a unique expressed gene from that species (Figures 1 and 2). Such sets are represented by clusters in the Unigene NCBI database12 and consensus sequences in the TIGR Gene Index11 (Figure 2). The advantage of the TIGR Gene Index is the conservative assembly of the EST sequences within each cluster to form a consensus sequence. Unigene does not attempt to assemble the ESTs within each cluster to produce a consensus sequence. As a result it is possible to find several TIGR consensus sequences contained within one UniGene cluster. These multiple consensus sequences can represent different splice variants of the same gene. The methods used to cluster or assemble the sequences within Unigene (http://www.ncbi.nlm.nih.gov/ UniGene/build.html) and the Gene Indices13 have been described previously in detail. While both methods ‘clean’ the EST data by removing vector contaminants, repeats, and low-complexity sequence, the quality of the data relies heavily on the quality of the original sequence data. Both databases allow query searches based on accession number or other unique identifier, gene name, tissue, and nucleotide or protein sequence similarity using BLAST. TIGR currently maintains Gene Indices for nine organisms, including human, mouse, rat, Drosophila, and zebrafish, as well as plant and microbial species. Unigene currently clusters EST sequence data for human, mouse, rat, bovine, and zebrafish. Unigene clusters are updated frequently based on new sequence data; however, there is instability in the cluster assignments between revisions of the database as a result of the clustering algorithm used. Therefore, when the clustering process is repeated, ESTs
may be assigned to a different cluster such that clusters with lone ESTs can be orphaned and subsequently retired. GenBank accession numbers for each EST should be used, rather than cluster IDs, when tracking the identity of clones. TIGR has a more stable identification system due to the more conservative assembly algorithm, but this database is updated less frequently and may not contain the most recent EST sequence submissions. In addition to grouping EST sequences into unique clusters, TIGR and Unigene also provide annotation for each cluster or consensus sequence, including a putative gene name. The identity and corresponding annotation of each EST sequence is derived by comparing the EST sequence against known sequences already in a database. The annotation for each EST is tentative because it is based on the limited sequence information available for each cDNA clone and is susceptible to sequencing errors and errors based on the propagation of dubious annotations from the records of other homologs. Many ESTs are termed singletons, because they do not bear a sufficient sequence identity to any other sequence in the database and represent the only member of a cluster. Most ESTs are of unknown identity, having variable degrees of similarity to other known genes in the database.
C. Serial Analysis of Gene Expression SAGE (Serial Analysis of Gene Expression) is a high-throughput sequencing-based method for the generation of expression data.14 Like EST sequencing, this method provides a rapid means to identify and quantitate the number of transcripts in a mRNA population. The SAGE method is diagrammed in Figure 3. The short (9 to 12 base) tag sequence produced by the SAGE method theoretically provides enough information to uniquely identify a transcript, while quantitating the number of times a tag is observed is proportional to the expression level of the corresponding transcript. Like EST data, this method sacrifices accuracy in the assignment of tags to a gene, and the ability to quantitate gene expression in return for high throughput. Furthermore, a 9 to 14 base tag is not always sufficient to uniquely identify a 73
74
FIGURE 2. Example of a partial output of a TIGR Tentative Human Consensus (THC) sequence THC532369 (androgen receptor). (A) Sequence assembly of full-length cDNAs, protein translations, and ESTs from various public sources. Arrow indicates the direction of sequence in relation to the consensus sequence. The sequence was obtained from GenBank and other DNA and protein databanks. The THC 532369 represents the consensus sequence derived from the overlapping partial sequences. (B) Listed for each THC are sequence source (Src; e.g., GenBank), unique identifier (EST Id), GenBank accession (GB#), IMAGE clone ID (clone; another unique identifier), left and right position of the sequence within the THC, and the library mRNA source. Not all sequence annotations are shown for the partial sequences in A. DNA sequence, BLAST search results, tentative orthologous sequences for mouse and rat, and mapping data also available but not shown here. (C) Expression report details for the libraries from which each EST within THC532369 was derived, the number of times it was sequenced from that library, and a link to the library record (Cat#). Horizontal bars represent the level of expression in each library.
FIGURE 3. Schematic of the serial analysis of gene expression (SAGE) method. PolyA RNA is transcribed into double-stranded cDNA with a biotinylated oligo(dT) primer and cleaved with an anchoring enzyme (AE) that cleaves each transcript at least once every 256 bp (44). The 3’ portion of the cDNA is then bound to streptavidin beads, thus providing a unique site (i.e., the AE site) on each transcript closest to the polyA tail. The cDNA pool is then divided in half and ligated via the anchoring enzyme restriction site to one of two linkers containing a type IIS restriction site (TE) and a unique primer site (A or B). The type IIS restriction enzyme blunt end cleaves at a defined distance up to 20 bp away from the recognition site. Blunt ended cleavage by the tagging enzyme produces a short piece of cDNA containing the tagging and anchoring enzyme site, the primer site, and the cDNA tag 9 to 14 bases long, depending on the tagging enzyme used. The blunt ended products are pooled and ligated to each other to produce ditags flanked by the linkers. Ligated tags then serve as a template for PCR amplification with primers specific to each linker. This amplification step serves to amplify tag sequences and provide orientation. The resulting amplification products are cleaved with the anchoring enzyme allowing isolation of the ditags, which are then concatenated by ligation, cloned, and sequenced. Software identifies individual tags, makes tag to gene assignments, and quantitates the tags to generate expression levels for the genes.
75
gene because one or more genes may share the same tag, and one gene may be represented by more than one tag. Sequencing errors can also compound the problem of tag to gene assignments. The absence of an EST or SAGE tag from a cDNA library does not necessarily indicate that the gene is not expressed in that tissue. This is often the case when only a small fraction of the clones from a cDNA library are sequenced, or the library was normalized or subtracted. Nonetheless, due to the large size and complexity of the EST and SAGE databases, and the comprehensive representation of many tissues, it is possible to discover genes that are specifically or preferentially expressed in a particular tissue by comparing EST or tag counts between libraries.
D. Other Expression Resources In addition to EST and SAGE data, gene expression information for the mouse is also available from the Jackson Laboratory Gene Expression Database (GXD)15 (Table 1). This resource curates gene expression information, including Northern Blot, RNase protection, Western Blot, and reverse transcriptase-polymerase chain reaction (RT-PCR) assays from mouse tissues obtained from various stages of development. These data are collected from a number of sources, including direct online submission and annotations from the literature by the curators of the database. This resource allows complex queries for genes based on accession number, tissue of interest, or developmental stage.
E. In Silico Northern Blots Determining whether a gene is expressed in a particular tissue involves querying the Unigene database or the TIGR Gene Index for the gene of interest after selecting an organism to search. The output returns expression information as a list of the cDNA library sources from which the EST or tag was derived. The Unigene output also includes a link to SAGE results if available, which not only provides a list of cDNA libraries containing a tag representative of the queried gene, 76
but also a computer-generated visual indication of the relative abundance of the tag in the cDNA library. The visual indicator resembles a Northern blot and is referred to as a virtual Northern. Automated comparison of EST counts (digital differential display, or DDD) was first described for the discovery of prostate-specific genes, in which genes were identified in dbEST that have many related ESTs from a prostate cDNA library, but have none or few ESTs in nonprostate libraries.16 A potential application is to identify tissue-specific proteins for targeted therapy of prostate cancer, but the approach is generally applicable to finding tissue-specific targets for other forms of therapy. When 7 of 15 putative prostate-specific genes identified in silico were examined experimentally for their tissue distribution, only three were found to be prostate specific. This indicates that cDNA libraries are generally incomplete and many more expressed sequences exist that have not been detected by EST sequencing. DDD of human cDNA libraries is available through the NCBI Unigene DDD (Table 1). Unigene clusters can be identified that are specifically or preferentially expressed within a cDNA library, or pool of libraries. The output returns a list of clusters, their relative expression level in the cDNA library or pool of libraries being compared, as well as significance level associated with their relative difference. The statistical significance of digital gene expression profiles based on tag frequency data has also been addressed by Audic and Claverie.17 In this manner, gene expression can be examined across many organs or between normal and diseased cells. SAGE can be used to quantitatively catalog expressed sequences and monitor differential gene expression between two mRNA populations. The virtual northern tool, SAGEMap vNortherns, allows an mRNA sequence to be queried for expression in SAGE libraries represented in SAGEMap, a database of SAGE results18 (Table 1). Representative tags are extracted from the query sequence and links are provided to SAGE libraries containing a representative of the tag. Following the tag hotlinks provides a list of the SAGE libraries containing the tag, as well as a virtual northern blot representing the relative level of expression of the tag in the SAGE library. Pooling and comparing different SAGE libraries can
be performed with SAGEMap xProfiler, a tagbased method of DDD (Table 1). Using this tool, SAGE libraries can be pooled based on characteristics, such as normal, diseased, or tissue type, and compared with another pool to identify transcripts that are differentially expressed between two pools. Fold difference factors and coefficient of variance cutoffs can be used to prioritize genes differentially expressed. The results, including library information, tag sequences, tag counts and statistical p-values, can be downloaded in text format. In a similar manner to DDD, EST frequencies can be experimentally compared between two cDNA libraries from control and test samples to measure differential gene expression following chemical exposure. This approach has been used to identify changes in gene expression in mouse liver following exposure to dioxin.19 Based on the frequency of almost 5000 ESTs derived from male mice treated with dioxin or vehicle, the overall inventory of transcripts was altered by approximately 30%. In addition, 20 genes were identified to be significantly up- or down-regulated in response to dioxin treatment. One advantage of this approach is the generation of a set of cDNA clones from mouse liver. These clones were then used to construct tissue-specific cDNA microarrays to measure gene expression and verify the EST frequency results.
F. ESTs and Microarrays ESTs also provide the raw material from which to construct cDNA microarrays. As an initial step, EST data can be analyzed for tissue distribution to develop target-specific cDNA microarrays. This approach attempts to maximize the amount of useful expression information extracted from array experiments and increase the probability of identifying changes in gene expression (see below for an introduction to microarray analysis). This approach is desirable because a large proportion of genes represented on a microarray may not be expressed or detected in the tissue of study. For example, it was shown that a large proportion of differentially expressed genes in a melanoma cell line were not identified when analyzed using microarrays that were constructed from libraries
without origin bias, in contrast to microarrays that were constructed with ESTs from a melanocytederived library.20 The Database of Transcripts Expressed in Spermatogenesis and in Testis (dbTEST) is an example of how public expression information can be used to develop tissue-specific cDNA microarrays to study testicular gene expression in the mouse (http://www.bch.msu.edu/~zacharet/ dbTestExp.html). Entries in dbTEST are derived from a number of sources, including the literature and the mouse GXD. The majority of the entries, however, are derived from information contained in the dbEST. This takes advantage of the fact that EST sequence records are annotated with their library of origin. To compile dbTEST, the UniGene database was queried for a list of all clusters that contained at least one EST sequence derived from a mouse testis cDNA library. These testis libraries cover many stages of development, including pre- and postnatal stages. The use of public EST expression information has an advantage in that cDNA clones representative of each cluster are likely to be publicly available through licensed distributors of IMAGE (Integrated Molecular Analysis of Genomes and their Expression) Consortium cDNA clones.21 This strategy rapidly identifies a large pool of potential target genes, while concurrently identifying readily available cDNA clones for these genes. Several assumptions are inherent in the above approach. The first, and most important, assumption is that a gene can be considered to be expressed in a tissue if a single EST with similarity to that gene is identified within that tissue. This assumption is overly simplistic and likely results in the false identification of many genes due to erroneous clustering of tags. The utility of tag detection and counting methods also assumes that the database is representative of the true abundance and distribution of tags. This assumption is certainly false, because the results of tag detection and quantitation methods are biased toward the libraries that are within the database. The libraries do not comprehensively represent all tissues or developmental stages, as most libraries are derived from normal and untreated control animals. Therefore, it is possible that genes expressed only under pathological conditions, or only following exposure to an inducing agent, are 77
underrepresented. However, tag databases such as dbEST and SAGEMap continue to grow in size, thus improving the utility of these methods. It is likely that with the increase in use of microarray technology for gene expression profiling, microarray databases will provide a more valuable source of expression data under a wider variety of experimental conditions.
III. IDENTIFICATION OF TARGET GENES DNA or protein sequence data can be used to identify other, potentially novel, proteins of interest, such as other protein family members, or to identify related proteins in other species as a means of establishing an in vivo model system for study. While these tasks would have been exceedingly difficult if not impossible to address a decade ago, the rapid increase in the availability of DNA and protein sequence data and the development of tools to mine these data now allow for in silico identification and characterization of target proteins based on sequence data alone. These approaches can expedite the cloning of toxicologically relevant genes and aid in the interpretation and prediction of protein function and dysfunction.
A. Terminology Comparisons between homologous sequences provide a powerful means for predicting protein function based on sequence data alone. However, before considering the structural or evolutionary comparison of proteins within or between species, it is appropriate to first clarify protein terminology that is often misused or misunderstood. Two proteins that exhibit high amino acid sequence or structural similarity are often said to be homologous, which implies a common evolutionary origin. There is no strict rule that determines what degree of similarity between two protein sequences implies homology, because protein sequences evolve at different rates. The degree of similarity is often expressed as a percentage of sequence similarity or identity that are often mistaken for each other. Percent sequence similarity refers to the percentage of the amino acid pairs in 78
a sequence alignment that are of similar chemical property, such as glutamate and aspartate (i.e., both acidic). Percent sequence identity refers to the percentage of amino acid pairs in a sequence alignment that are identical. Therefore, sequence similarity is always greater than sequence identity. Homologous proteins can be considered either paralogs or orthologs. Paralogs refer to homologous proteins within a species that diverged as a result of an early gene duplication event. Although similar in amino acid sequence and protein structure, they often have unique but related functions and are often members of the same protein family. Orthologs refer to homologous proteins in different species that arose from a common ancestral gene during speciation. As a result, they usually serve the same function in both organisms and can often, but not always, genetically complement each other. When comparing two related protein sequences between species, the distinction of whether a protein of similar sequence is a true ortholog is often unclear. When dealing with incompletely sequenced genomes, there exists the possibility that the true ortholog (i.e., most similar protein) has not yet been sequenced, and that the most similar sequence is just a paralog that fulfills a related but distinct function.22 A reciprocal sequence comparison will quickly reveal if the most similar sequence (i.e., the putative ortholog) is also most similar to the query sequence in the original species.
B. Pairwise Sequence Alignments Aligning novel sequences with previously characterized proteins affords an efficient and powerful means to extract functional information from unknown proteins based on the evolutionary conservation of protein function. Sequences can be compared by local or global alignments, depending on the purpose of the comparison. The Needleman-Wunsch algorithm was first described for global sequence alignment in which an optimal forced alignment between two sequences is determined.23 The Smith-Waterman algorithm finds the optimal local alignment within two sequences.24 The method of choice will depend on whether the two sequences are presumed to be
related over their entire length or only within related domains within the sequence. The SmithWaterman algorithm is the preferred approach because similarities between sequences are usually restricted to relatively short segments, such as motifs or active sites, rather than the entire length. Searching for sequence similarity between a query sequence and a large number of other sequences in a database using the Smith-Waterman approach is computationally intensive and time consuming. Therefore, a heuristic strategy, which makes use of approximations, has been utilized in the FASTA25 and BLAST26 programs (Table 1) to increase computational speed at the cost of sensitivity in detecting relevant matches. However, adjusting the parameters of the search can influence the heuristics and alter the balance between speed and sensitivity. When trying to identify homologous proteins, aligning amino acid sequence is preferred over nucleotide sequences due to the degenerate nature of the amino acid code. Both the FASTA and BLAST suite of programs support the virtual translation of DNA sequences such that a DNA query sequence can be compared to a protein database or a DNA database that has also been virtually translated in all six reading frames. Although FASTA supports this feature, we will concentrate on the BLAST suite of programs due to its popularity and speed. The choice of BLAST program will depend on the nature of the query sequence, the database being searched, and the purpose of the search. Table 2 outlines the family of BLAST programs available and when they should be used. To search for weaker, but perhaps biologically relevant, sequence similarities, a position-specific iterative form of BLAST (PSI-BLAST) has been introduced27 (Table 1). This method produces a position-specific scoring matrix (or profile) from statistically significant matches produced by the BLAST program. A position-specific scoring matrix describes the position and frequency of amino acids in a sequence alignment. This matrix is then used as a query to search a protein database for matches in an iterative fashion until no new homologous sequences have been found. While not as sensitive as the best motif search programs (see Protein Domain Classification below for more detail on motif searches), its speed and ease of use makes this program attractive.
Another advantage of this program is its ability to detect distant but biologically relevant relationships between proteins that were once only detectable by structural comparisons.28,29 It is likely of limited practical use for most sequence comparisons when the intent is to find homologs in closely related species or to predict function based on domain analysis. While the source of errors in PSI-BLAST is the same as BLAST, they are easily amplified during the iteration process. For example, deceptive alignments can easily be produced if proteins with highly biased amino acid composition are included in the profile.30
C. Evaluating Sequence Alignments To judge the effectiveness of a sequence alignment, an alignment score between two sequences is calculated based on residue similarity and the presence of gaps. A substitution score is given to each pair of residues that can be aligned. A complete set of these substitution scores is called a substitution matrix. PAM31 and BLOSUM32 are the most widely used scoring matrices for protein sequence comparison. A scoring matrix is constructed from target frequencies, which represent the observed frequency of point mutations within related proteins that have been accepted during evolution because they do not seriously disrupt the function of the protein. When calculating a substitution score, a conservative substitution, such as a serine for a tyrosine (both polar), is given a higher score than a nonconservative substitution, such as a serine for a valine (polar to a nonpolar). The substitution scores in the matrix are proportional to the natural log of the ratio of target frequencies to background frequencies, which are dictated by the frequency of the different amino acids in nature. To estimate target frequencies in a PAM matrix, a set of very closely related sequences were used to collect mutation frequencies corresponding to 1 PAM (Point Accepted Mutation), where 1 PAM represents a unit of evolutionary divergence in which 1% of the amino acids have been changed. Target frequencies are then extrapolated to a distance of 250 PAMs. Note that 100 PAMs does not mean that 100% of the amino acids have changed, as the amino acids can change many times, as well as changing back 79
80
TABLE 2 Sequence Searches Using the BLAST Family of Programs
to the original amino acid. When comparing sequences from evolutionarily divergent species where the sequence similarity is expected to be weak, such as between mouse and yeast, best results are obtained when using higher PAM values, such as PAM200 or PAM250. When comparing sequences from related species where sequence similarity is expected to be strong, such as between human and rodents, a lower PAM value is appropriate, such as PAM30.33 The BLOSUM substitution matrix uses a different strategy to estimate target frequencies. Frequencies are derived from local multiple sequence alignments of distantly related sequences. The advantage of this approach is the empirical derivation of matrix scores, rather than by extrapolation. The BLAST program uses a BLOSUM-62 matrix as the default scoring matrix. Sequences in the multiple sequence alignment used in construction of the matrix having at least 62% identity are merged into a single sequence such that the target frequencies are influenced more by the divergent sequences in the alignment. As a result, BLOSUM matrices with high cutoffs, such as BLOSUM-90, are more appropriate for comparing closely related sequences. The BLOSUM-62 matrix is considered efficient over a broad range of evolutionary distances and is suitable for query lengths greater than 85 amino acids. Because short sequence alignments need to be strong to rise above background, higher BLOSUM matrices or lower PAM matrices are recommended. In addition to substitution scores, gaps can contribute negatively to the score of an alignment. While no theory exists for imposing or selecting gap costs, it is common to impose an initial fixed cost (G) for each gap and an additional gap penalty (L) proportional to the length of the gap. For practical purposes, the default settings of BLAST suffice. The biological relevance of a match can only be ascertained experimentally; however, local alignment statistics can indicate which alignments are unlikely to have arisen by chance alone. While theory does not exist to describe the expected distribution of global alignments, the distribution scores for optimal local ungapped alignments has been shown to follow an extreme value distribution.34 Although unproven, evidence suggests the same distribution holds for gapped alignments,35 which is supported by the current version of BLAST. BLAST programs
calculate a raw score (denoted S) for alignments based on the substitution scoring matrix and gap penalties imposed. Bit scores, which represent raw scores normalized for the statistical variables that define a given scoring system, are reported for all alignments. Bit scores are more informative and allow for accurate comparisons between different alignments, even when using different scoring matrices. To calculate the significance of the alignment, the observed alignment score (S) is compared with the expected extreme value distribution to generate an E (expect) value. An E value represents the number of distinct alignments with an equivalent or superior alignment score (S) than would be expected to occur purely by chance alone. From the E value, a P value can be calculated. E values less than 0.01 are essentially identical to P values and generally considered significant, or at least of interest. Local alignments with no gaps are referred to as High Scoring Pairs (HSPs). To calculate the significance of the ungapped alignment, the observed alignment score for the HSP is compared with the Poisson distribution, which describes the number of random HSP scores equal to or greater than S. This is the P value associated with S and describes the probability of finding an alignment with an equivalent or superior alignment score simply by chance alone. The probability of randomly generating an alignment is in part governed by the length of the query sequence and the size of the database searched. While there is no strict rule for judging significance, a comparative analysis of sufficiently divergent human and mouse orthologs reveals that coding sequences as low as 36% identical at the protein level result in highly significant E values less than 2 × 10–33 (BLOSUM-62, BLASTP) or less than 6 × 10–58 at the DNA level (BLASTN).36 Ultimately, the decision as to whether a match is significant will depend on the species being compared, the query sequence, the database searched and an informed decision by the investigator to judge whether the alignment is of biological relevance.
D. Identifying Novel and Orthologous Genes Database similarity searching identifies which sequences in a database of hundreds of thousands are most similar to the particular sequence of 81
interest. These searches can often result in hundreds of full and partial matches that are returned as a ranked list in ascending order of E value. It is common for most matches to be partial sequences represented by ESTs, as they account for at least 70% of the sequences in GenBank (Release 120). The advantage of matching to an EST is that the cDNA clone from which the sequence was derived is likely available through the IMAGE Consortium and authorized vendors,21 thus providing a means to quickly isolate the gene of interest, as well as cDNA clones for homologs in other model species. This approach was taken to isolate novel members of the PAS Superfamily by Bradfield and co-workers.37 The PAS Superfamily is characterized by a 200 to 300 amino acid PAS domain that was defined originally by PER, Arnt, and SIM, the three founding members of the family. PAS domains are signaling motifs that function to sense oxygen, redox potential, light, and planar aromatics.38,39 In addition to their role as environmental sensors, they play important roles in development and cell lineage.40 Specificity in sensing arises from heterotypic interactions between different members of the Superfamily. In an effort to identify novel members of the PAS Superfamily that may act as binding partners to other PAS proteins, such as Arnt or AhR, a database search was used. The bHLH-PAS domain of the human AhR and Arnt, and the Drosophila SIM protein, as well as the PAS domain of the Drosophila PER protein, were used as query sequences in Blastn searches of EST sequences in GenBank. Candidate matches that exceeded an empirically chosen cutoff were selected for subsequent verification using Blastx. Only candidate ESTs that matched known bHLH-PAS proteins were then further characterized, cloned, expressed, and tested for functionality. This novel approach identified five new members of the PAS Superfamily, denoted MOPs 1-5. 37 Subsequent coimmunoprecipitaton studies, yeast two-hybrid analyses, and transient transfection experiments revealed that these proteins form multiple heterodimeric complexes with other PAS proteins, such as Arnt, CLOCK, Hypoxia-inducible factor 1 alpha (HIF1alpha), and HIF2alpha.37,41 Although nucleotide sequence similarity was used successfully to identify these novel PAS proteins, 82
a Tblastn search using a PAS protein as a query and searching EST sequences would have provided a more sensitive approach to identify a larger number of potential leads due to the large size of the EST database and the higher sensitivity of detecting homologous proteins compared to DNA. This would be advantageous when limiting the search to other species, because this approach often results in a large number of EST matches representing the query sequence itself, due simply to the size and redundancy of the EST database. While EST searching permits identification of novel genes, it has the further advantage that these novel genes are identified based on homology to genes of known function. While the function of the homolog may not be identical to the known gene used as the query sequence, this information can provide substantial insight into functional analysis of the novel gene. Such approaches are well suited toward finding human homologs of proteins characterized in other organisms, because functional information is often initially obtained in model organisms more amenable to genetic manipulation, such as Drosophila or yeast. Gene identification based on crossspecies comparisons traditionally has relied on low-stringency hybridization of cDNA or genomic libraries, and PCR-based approaches using degenerate primers. This is technically difficult in unrelated species, such as mammals and yeast. However, the vast amount of sequence data available for many species now permits rapid in silico identification of related genes based on sequence comparisons. For example, Banfi et al.42 were able to identify 66 potential human orthologs of genes that cause mutant phenotypes in Drosophila. This was accomplished by comparing the protein sequence of the Drosophila genes to the human section of dbEST. A similar approach was used by Rotig et al.,43 who compared yeast genes involved in disorders of mitochondrial oxidative phosphorylation to human EST sequences, thus identifying 102 potential human genes involved in mitochondrial diseases. One current limitation of the EST-based search approach is the poor representation of ESTs from toxicologically relevant model organisms, such as the rat, dog, or primate. Human EST sequences represent approximately 44% of the almost 8
million ESTs in GenBank as of May 2001, while the rat only represents about 3.5% (~285,000 EST sequences). However, mouse EST sequences are well represented in GenBank (~25%). There is a scarcity of EST sequences for the dog or primate. In time, however, sequence data for the mouse and rat will be complete,44 and a chimpanzee sequencing project is being considered.45 Despite the relative lack of sequence data for rodents, there exists a wealth of mouse/human and mouse/ rat homology data by gene or chromosomal position curated by the Mouse Genome Database46 and the NCBI Human-Mouse Homology Maps (Table 1). While the power and sensitivity of homology searching continues to improve, it is important to realize that two sequences that share similarity will not necessarily share function. Due to the size of the GenBank, any search will more than likely return a number of results that, while seemingly significant, may be seriously misleading without adequate consideration. For example, two proteins may align well in a putative transmembrane domain but have widely disparate functions due to dissimilarity in extramembrane domains. While computational techniques can allow for the rapid identification of protein targets, it is crucial to verify computational results with other in silico methods and ultimately experimental data. As the amount of information increases, assigning function based on sequence and identifying and comparing genes between model organisms will be facilitated with greater ease and confidence.
structural information for the protein. In some cases, novel genes may be genetically or biochemically isolated and subsequently cloned, and inferences on protein function must be made based on sequence data alone. Several in silico methods can help formulate hypotheses regarding potential functions, and thus expedite the process of protein characterization. Proteins can be characterized by their conserved functional domains, such as an enzyme catalytic site, or a ligand, RNA, DNA, or protein binding site. To characterize and ultimately predict function, a number of approaches can be used depending on the sequence or structural information available for the protein. For the majority of proteins, structural information is not available and inferences must be made with only sequence data and limited experimental data. Comparison of a protein sequence with multiple homologous sequences can aid in the identification of functionally conserved residues involved in binding and/or catalysis in the active site. Protein sequence data can also be used to identify larger functional domains within proteins based on similarity with conserved domains of known function. When structural information is available for homologous proteins, the structure of a novel protein can be predicted with reasonable accuracy, depending on the degree of similarity. Finally, with structural information, ligand-protein interactions can be predicted and modeled to characterize their role in toxicity.
A. Multiple Sequence Alignments IV. CHARACTERIZATION OF TARGET PROTEINS Proteins are the functional units of the cell and represent a major site of drug and xenobiotic action.47 Protein-ligand interactions, which are critical for sensing and maintaining cellular homeostasis and facilitating adaptive responses, are mediated by conserved structural domains. Inadvertent protein-ligand interactions can result in adverse functional consequences and altered cellular homeostasis. Therefore, characterizing target proteins is fundamental to the elucidation of mechanism of action of drugs and xenobiotics. Unfortunately, characterizing proteins is labor intensive and investigators often begin without
The alignment tools, BLAST and FASTA, discussed in the previous section are generally used to search databases for similar sequences or potential homologs to a query sequence. These programs are invaluable because they allow for the rapid identification of related sequences. By contrast, multiple sequence alignment (MSA) analyses allow for structural, functional, and evolutionary inferences to be made. Traditionally, MSAs have been used to identify characteristic motifs and conserved regions within protein families, evolutionary conservation among proteins, and to improve secondary and/or tertiary structure predictions. MSAs can also identify critical amino acid residues that play important structural 83
and functional roles, thus pointing to potential candidate residues for mutagenesis studies (e.g., Plate 1*). A MSA is based on the alignment of the closest related sequences followed by the successive alignment of more distant ones. Alignment programs can be classified based on the algorithm they use to generate the alignment. ClustalW48 and Multalign49 use a global alignment algorithm to construct an alignment over the entire length of the sequence, while Dialign,50 Match-Box,51 and PIMA52 use a local alignment algorithm. The majority of these alignment programs are available on the internet53,54 (Table 1), as well as being bundled within comprehensive commercially available sequence analysis software packages from the Genetics Computer Group (GCG) (Madison, WI) and DNAStar Inc. (Madison, WI), for example. Thorough comparisons of multiple sequence alignment programs have been published previously and are not discussed in detail here.55,56 Thompson et al.56 evaluated the alignment accuracy of 10 different multiple sequence alignment programs using the following criteria: sequence length, degree of sequence identity, and the presence of large insertions or deletions. The results of the comparison demonstrated that no single program was capable of producing reliable alignments under all criteria examined. Overall, global alignment programs were more accurate and reliable at aligning sequences of similar length, sequences of divergent families (closely related proteins of >25% identity), and orphan sequences within families (distant members of a family with 1%, then these changes are referred to as single nucleotide polymorphisms (SNPs). It has been shown that such genetic variation can significantly predispose individuals to disease and increase their susceptibility to certain xenobiotics.128 Therefore, identifying susceptible genotypes will increase our understanding of xenobiotic-gene interactions and the molecular basis of toxicity and disease. Large-scale SNP discovery projects, as well as EST databases, now allow for the in silico identification of genetic variants in the human population.
A. Occurrence and Implications Recently, a map of 1.42 million SNPs distributed throughout the genome was described, using all SNPs publicly available as of November 2000. While the average frequency in the available sequence was one SNP in every 1.9 kb, it was estimated that 85% of exons are within 5 kb of the nearest SNP and only 4% are greater than 80 kb away from an SNP. Furthermore, it was estimated that the average gene contains approximately two exonic SNPs per gene, when both the coding and untranslated regions were considered.129 However, while these are predicted average values, the degree of diversity within an individual locus varies greatly depending on the gene under consideration. The mutability of specific genes may be influenced by several factors, including evolutionary selection, recombination, gene conversion, and local variability in mutation rate.127,130 While interest in detection and characterization of polymorphisms has certainly increased in recent years, SNP research has, in fact, been pursued for several decades. In the 1980s, polymorphisms were characterized by examining the differential ability of restriction enzymes to recognize specific loci from different sources. During the 1990s, this approach was overtaken by the increased use of microsatellites (long segments consisting of many repeats of a 2- to 4-base-long motif; also known as short tandem repeats, or STRs). The recent renewed interest in SNPs re-
flects a trend in research away from monogenic diseases in favor of complex multifactorial diseases, as well as an improvement in SNP detection methods.131 A cSNP is a SNP found in the coding region of a gene, and its presence is likely to affect protein function by virtue of an amino acid change. Phenotypic changes may occur for a variety of reasons, including the deletion or premature insertion of a stop codon, alteration in transcription rate, and interference with RNA maturation, RNA or protein stability, or protein functional groups. Many polymorphisms can occur outside of the coding region, some of which may positively or negatively impact gene regulation.127 Furthermore, SNPs that do not necessarily cause direct phenotypic effects can still be useful for linkage disequilibrium analyses, using the knowledge that DNA recombines in large blocks that tend to remain associated. It has been proposed that having a set of SNP markers distributed at approximately 100-kb intervals across the genome, a vast improvement over current low-density maps created using microsatellite markers, would provide sufficient power in association studies so that small genetic effects in a complex disease trait could be detected.132 Although SNPs are not the only type of genetic variation, it is anticipated that these will have the greatest utility for identifying genes associated with particular diseases, as well as making vital contributions to other fields such as functional genomics, pharmacogenomics, physical mapping, and evolutionary biology.133,134
B. SNP Marker Development Large-scale efforts are currently underway to identify and catalog SNPs in the human genome. While private companies are developing proprietary repositories of SNP information, there are several publicly accessible databases of genetic variation (Table 1).135 To complement these resources, the National Center for Biotechnology Information (NCBI) in 1998 established its own SNP relational database, dbSNP, in collaboration with the National Human Genome Research Institute (NHGRI) (Table 1). This database, like several other public databases worldwide, will organize and store data describing hundreds of
thousands of SNPs.133 NHGRI and the SNP Consortium have cataloged over 300,000 SNPs to date and submissions to dbSNP approach 3 million, as of May of 2001, of which greater than 99.9% are human SNPs. A recently added feature is the ability to BLAST any query DNA sequence against dbSNP, thus identifying potentially functional SNPs within your sequence of interest. Generally, the steps involved in SNP marker development consist of obtaining the DNA sequence, developing sequence-tagged sites (STSs), identifying the sequence variants in the STSs and estimating their allele frequencies, and ascribing the marker to a unique chromosomal location.132,136 Many emerging technologies are being designed and refined for use in SNP discovery. Some of these include direct DNA sequencing, heteroduplex analysis, single-stranded conformation polymorphism analysis, and variant detector arrays, as well as several high-throughput technologies, many involving various fluorescence techniques and mass spectrometry.131 Other recently described methods include branch migration inhibition,137 reduced representation shotgun (RRS) sequencing,138 and microdevice electrophoresis and sequencing.139 Strategies that circumvent the screening and mapping stages are advantageous because they avoid the most arduous aspects of marker development. Gu et al.136 have described two such methods of discovery, the EST comparison approach and the genomic sequence comparison approach, which take advantage of the escalating amount of sequence data stored in public databases. The first method employs databases containing redundant EST data, while the second uses sequence data from genomic regions where large-insert clones overlap. Neither method requires sequence generation or polymorphism screening. While the EST comparison method identifies SNPs associated with expressed genes that are useful for association studies, the quality of single-pass EST sequence reads may be low, particularly in the 3' untranslated region, where most polymorphisms are found. In contrast, taking advantage of genomic sequence information from large-scale sequencing projects is beneficial because of higher-quality sequence and because the physical maps of the sequences are well characterized. It is hoped that these approaches, in 93
combination with complete human genome sequence data, will soon provide the desired >30,000 evenly distributed SNPs for analysis of complex genetic traits and other applications noted above. Investigators in any sector of the research community can submit sequence variation information to the dbSNP database through a webbased form. Although the database entries are not strictly limited to SNPs, nearly 98% of the variations in the database currently are SNPs, while only 2% consist of short insertions or deletions. The scope of the dbSNP database includes all species and variation at any level of frequency, as well as both functionally important and silent mutations.134,135 This database provides supplemental sequence information to that available in Genbank and also interfaces with the other NCBI resources, such as literature databases.
C. Gene-Environment Interactions It is predicted that the aforementioned public databases will grow to include all of the common variants of human genes. By combining this with increasing information about the functions of individual genes, researchers may soon be able to establish associations between an individual’s genotype and inherent or environment-induced risk of developing specific diseases. Many functional variants of genes that encode drug and xenobiotic metabolizing enzymes have been identified and reviewed by Meyer140 and Nebert.141 For example, more than 70 variant alleles of the CYP2D6 locus have been described, many of which include one or several SNPs. Other polymorphisms exist in CYP2C9 and 2C19 that decrease enzyme activity. Studies of epoxide hydrolases (EH) have revealed ethnicity-linked SNPs and other polymorphisms in both the coding and noncoding regions of the microsomal and soluble forms of this enzyme,142 and it is likely that other SNPs or deletion mutants are ethnically linked.143 Following SNP detection, determining the functional significance of these SNPs will be a greater challenge. It is anticipated, however, that molecular and modeling studies will continue to make significant contributions toward understanding the functional implications of these polymorphisms. For example, computer modeling has been used to predict the effect of several poly94
morphisms within the soluble EH gene on enzyme dimerization.142 Future research will focus on genes encoding receptors, transporters, ion channels, and DNA repair enzymes, which can play important roles in drug efficacy and toxicant susceptibility.140
D. The Environmental Genome Project dbSNP is integral to the Environmental Genome Project (EGP), which has two main objectives as outlined by NIEHS (National Institute of Environmental Health Sciences). The first is to identify functionally important polymorphisms that predict susceptibility or resistance to diseases triggered by environmental exposure. The second is to provide improved technologies and study designs for use in epidemiological studies probing the effects of environmental and genetic (‘ecogenetic’) interactions on the etiology of disease. More specifically, the EGP aims to catalog and resequence selected environmental response genes and establish methods of assessing functionally important polymorphisms in these genes, which may include assays of activity or expression levels, analysis of conserved sequence domains, and the examination of interaction domains in three-dimensional structures. The phases of the EGP are discussed in detail in Ref. 144. It is hoped that the information gained from these analyses will facilitate epidemiological studies of the role of gene-environment interactions in the etiology of diseases. Genes of interest to the EGP include those involved in cell cycling and death, DNA repair, metabolism of endogenous and exogenous agents, signal transduction systems, and nutritional, oxidative, and immune response pathways.145 The number is steadily increasing, and it is estimated that dbSNP may eventually contain 10 to 20 million records of genetic variation in the human genome alone.135 Research areas supported by the EGP currently consist of: DNA Sequencing; Functional Analysis; Biostatistics/Bioinformatics; Technology Development; Population-Based Studies; and Ethical, Legal, and Social Implications. It is anticipated that results generated from these studies will aid in improving current methods of risk assessment for the enhanced protection of sensitive subgroups. This would represent a significant enhancement over current risk as-
sessment procedures, which assume average individuals receiving average exposures. In accordance with the precedent set a decade ago by the Human Genome Project, the EGP intends to foster research addressing the ethical, legal, and social implications of the information generated by the EGP.146 It is hoped that these precautions will minimize negative outcomes and allow the public to enjoy the future benefits of SNP research, such as personalized medicine and protection of hypersensitive individuals.
VII. MINING MICROARRAY GENE EXPRESSION DATA With the advent of cDNA microarrays, the expression level of hundreds to thousands of genes can be simultaneously measured in a cell, tissue, organ, or whole organism under a variety of normal, physiological, and pathological states.147 The technology behind microarrays has been reviewed extensively and is not discussed here. 148 Microarrays are particularly suited to studying the molecular responses to drug or xenobiotic exposure in both human and model species, in vitro and in vivo, as changes in gene expression often precede toxicity. The utility and limitations of gene expression profiling in mechanistic and predictive toxicology have been reviewed.149,150 One of the greatest challenges in using microarray technology is also one of its most advantageous features. While a single microarray experiment can produce thousands of data points detailing the molecular behavior of a cell, tissue, or organ, the amount of information can be overwhelming and many new hypotheses suggested by the data will never be fully developed due to time or resource constraints. However, with the use of microarray databases, results from gene expression profiling experiments will be available so that the information can be reanalyzed and previously undiscovered features of the data explored. This will give the scientific community the opportunity to capitalize on an enormous amount of information regarding their favorite gene, cell type, tissue, process, or treatment. Although the potential exists for new hypotheses to be made and tested by analyzing gene expression data in silico, there are currently many obstacles that make obtaining,
storing, and comparing microarray data difficult. Methods of analyzing microarray data are also in their infancy, although many options exist. Due to the digital nature of microarray experiments, this review would be incomplete without considering the current and future possibilities that in silico analyses of gene expression data may hold. In time, such methods may be applicable to analyzing protein expression profiles, as the field of proteomics continues to advance rapidly.151 Metaanalysis of metabolic profiles from nuclear magnetic resonance (NMR) spectroscopic data (i.e., metabonomics)152 may also become possible. Currently, however, published data are not currently amenable to in silico analyses.
A. Microarray Databases Not long after publications first described the application of microarrays to study gene expression in human, plant, and yeast models147,153-155 came the realization that these large data sets need to be stored and disseminated in an organized and extractable manner. Because publication and presentation of large data sets is not practical within the limitations of a journal, data need to be readily available in an electronic format for review. A public repository for microarray data would include: (1) an agreement on the standard type of information that should be reported for each experiment, (2) an extensible format to capture these data, (3) a suitable database to store the information, and (4) a set of tools to search and analyze the data.156 Standards for reporting and annotating microarray experiments based on agreed ontologies have been widely discussed (http://www.ebi.ac.uk/microarray/MGED/). Storage and mining of microarray data is still in the experimental stage; however, it is expected that in time, common data storage media and retrieval and analysis capabilities will be available similar to GenBank. It is also hoped that universally accepted microarray formats, controls, and methods of data normalization and/or analysis will result in microarray data being comparable across platforms, thus facilitating a more accurate metaanalysis of the data. As a result, a review of the current state of microarray databases is not appropriate at this time, however, a list of web sites 95
offering access to gene expression data can be found in Table 3. Concepts behind the most common analytical methodologies for microarray data and their current and potential applications will be discussed. Noncommercial and freely available array analysis software and databases are not reviewed in detail but summarized in Table 3. Also included is a list of some commercially available programs to analyze gene expression data.
B. Pairwise Conditional Gene Expression Analysis As changes in gene expression following chemical exposure can precede and/or follow toxicity, gene expression profiling using microarrays has been recognized as a valuable tool to monitor the totality of effects on gene expression and thus potentially explain the molecular basis of toxicity.149,157 In the simplest experimental case, the RNA population of a control sample is compared to a test sample. In these instances, the interest is in determining what genes are significantly changed and how they are changed following treatment. In evaluating the significance of differential gene expression in microarray experiments comparing two RNA populations, an ad hoc threshold is usually applied to indicate the level of differential expression needed to be deemed significant. This threshold level is often chosen on the basis of observed variability in control vs. control hybridization experiments. For example, a gene that does not differ in expression between two samples will have a theoretical expression ratio of one. In practice, however, the observed expression ratios for > 95% of the genes in a control vs. control experiment typically range from 0.5- to 2-fold, due to experimental error. As a result, threshold values ranging from ± 1.5- to 3-fold are typically applied, depending on the variability in the system. This results in many genes below the threshold level being disregarded and more confidence given to genes with the highest level of differential expression. This empirical method of determining significant changes in gene expression is in contrast with statistical methods of estimating significance levels (i.e., p values) for differential expression where genes are ranked in order of 96
confidence. In order to estimate p values, replicate experimental data are required to estimate experimental error on a gene by gene basis, because some genes exhibit more variability than others. In single experiments without replication, it is common practice to assume that a subset of genes do not change in expression, regardless of treatment. The variability in measurement of these control or housekeeping genes can then be used as a basis for estimating experimental error for the rest of the population of genes, thus allowing for significance levels to be calculated for the remaining genes.158 Selection of housekeeping genes is often based on historical or empirical evidence,154 however, many so-called housekeeping genes have been observed to violate the assumption of constancy.159,160 Nonetheless, this method is advantageous when replication is not possible for logistical or technical reasons, such as with tumor biopsies. In order to overcome any assumptions regarding the distribution of ratio measurements and constancy of housekeeping gene expression, estimating experimental error is best achieved empirically by measuring the variation in gene expression, on a gene by gene basis, across replicate control samples.161 In many cases, results of microarray experiments are not replicated, or are measured only in duplicate. It is important to realize that pooling of samples reduces the number of replicates to one and precludes estimation of experimental error. Pooling samples is advantageous and sometimes necessary when RNA is limited due to small tissue size, although advances in cDNA labeling technologies have overcome many of the requirements for large amounts of starting material making pooling unnecessary. Ultimately, the number of replicates needed will depend on the variability of gene expression in the model system and the acceptance rate of false positives. In addition to significance testing, other factors must be considered when analyzing changes in gene expression. One must make a distinction between the magnitude of differential expression and the significance level associated with a change in gene expression, because without formal proof there is no way to conclude that a 10-fold change in steady state mRNA level is any more biologically relevant than a 2-fold change. In fact, the value of fold changes themselves can be an inaccurate
97
TABLE 3 Microarray Databases and Analytical Tools
98
representation of the underlying changes. For example, if a gene is expressed at an arbitrary level of 50 transcripts per cell, a 2-fold induction would correspond to a net gain of 50 transcripts per cell. However, a 2-fold repression would correspond to only a net loss of 25 transcripts. A loss of 50 transcripts would correspond to an infinite fold-repression. Therefore, net increases or decreases in transcript abundance may need to be considered when interpreting the biological significance of an observed change. In the absence of robust statistical analysis of the data, confidences must rely on secondary verification of gene expression changes using alternative methods, such as RT-PCR or Northern blots. Finally, when trying to interpret the biological relevance of changes in mRNA, one must concede that mRNA levels do not always reflect protein abundance or activity.
C. Multiconditional Gene Expression Analysis Single pairwise analysis of gene expression data between two samples provides only a snapshot of the cell and does not describe the dynamic interactions that characterize cellular processes such as differentiation, development, and disease. Measuring changes in gene expression over time provides information on the kinetics and coordination of gene expression during these dynamic processes. Analyzing gene expression data across multiple samples can also reveal underlying similarities across the samples, thus permitting correlates of gene behavior in order to predict and diagnose cellular responses based on expression profiles. In order to extract ordered subsets of information from disordered sets of multiconditional gene expression data, a number of multivariate methods have been applied. The most common method for clustering multiconditional gene expression data is hierarchical clustering, which was made popular by the work of Eisen et al.162 Principal component analysis and partitioning methods, such as k-means clustering and self-organizing maps (SOMs) have also been applied to multiconditional gene expression data sets.163 In general, these methods attempt to find order or trends among disordered data sets by grouping similar objects together. Grouping
genes based on similarity of expression is desirable for a number of reasons. For example, two genes of similar expression characteristics may be coordinately regulated and therefore involved in a similar function and/or under the same regulatory control. Based on this supposition, identifying sets of coordinately expressed genes has the potential to indicate gene function based on expression profile alone. This is a powerful means of assigning putative function to a large number of uncharacterized ESTs. In addition to grouping genes, samples can be clustered based on the expression of all genes on the array. This approach is particularly suited to predictive toxicology because it has been proposed that each chemical that acts through a particular mechanism of action will induce a unique and diagnostic gene expression profile under a given set of conditions.149 Gene expression profiles induced by mechanistically similar chemicals can be correlated under certain conditions.164 Therefore, it should be possible to predict a mechanism of action for a chemical of undefined toxicity based on correlated expression data alone. In contrast to statistics, these algorithms do not attempt to make any statistical inferences regarding the data, but rather are used to organize the data for more effective visualization and interpretation.
D. Hierarchical Cluster Analysis The purpose of hierarchical cluster analysis is to place objects into groups or clusters, such that objects within a group are more similar to each other. Hierarchical clustering is an unsupervised and agglomerative (bottom up) form of clustering because it makes no a priori assumption of the number of gene expression clusters. In contrast, partitioning methods such as k-means clustering and SOMs are supervised forms of clustering because the data set is divided into a specified number of clusters. The overall process of hierarchical cluster analysis begins with defining the objects to be clustered by a measure of similarity, and computing pairwise distances for all objects to be clustered. This results in a matrix of pairwise distances that can then be visualized using a dendrogram. By the process of clustering, objects are organized into branches, such that branches of the tree that are adjacent are more closely related, and 99
the length of the tree branch reflects the degree of similarity between objects. Finally, the branches of the tree are ordered and colored to indicate graphically the nature of the relationships within and between the branches (Plate 3*). In the case of multiconditional gene expression data, the objects to be clustered are gene expression profiles. A profile describes how a gene varies in expression among multiple conditions. The measure of expression is usually a ratio of test to control sample expression level. Prior to clustering, it is appropriate to log2 transform ratios so that increases or decreases of equal magnitude are treated numerically equal but with opposite sign. A correlation metric provides a measure of similarity between gene expression profiles and is the common basis for grouping genes into clusters. The Pearson correlation coefficient r, is a common metric that quantifies how closely two variables vary together. The correlation coefficient r is always between –1 and 1, where 1 indicates an identical profile, 0 indicates the two profiles are independent, and –1 indicates the profiles are exact opposites. The advantage of this statistic is that it captures similarity in shape without emphasis on magnitude as it is invariant to scale. Similarity measures (e.g., correlation coefficients) are then often transformed into Euclidean distances prior to clustering. The Euclidean distance is a measure of dissimilarity (i.e., distance = 1 - correlation). Euclidean distances are calculated as follows: dih = √ ∑(xij – xhj)2 where xij and xhj are the correlation coefficients in experiments i and h for gene j. This transformation accounts for similarity with all other genes, rather than by a single pairwise comparison. It is also less sensitive to random fluctuations in expression measurement such that two genes that exhibit poor pairwise correlation may still be close on the tree by virtue of their correlation with all other genes. One situation where analyzing nonEuclidean distance data can be useful is for categorical or binary data (e.g., expressed or not expressed, up- or down-regulated). The resulting distance matrix can then be hierarchically clustered by a variety of methods. To begin clustering, each object (i.e., gene) is represented by a single cluster. The two most *
Plate 3 appears following page 94.
100
similar objects are then merged into a new pseudoobject. A new distance measure is calculated for the pseudo-object and merged with the next most similar object. This process is repeated in an iterative fashion until only one pseudo-object remains, which represents the root of the tree. The new distance measure to be calculated is the distance between an object and the next most similar object or pseudo-object to which it is merged. The variety of hierarchical clustering methods available differ in the way the distance measures between objects are calculated. For example, in average linkage clustering, the distance between two clusters is the average distance between pairs of observations in each cluster. In complete linkage, the distance between two clusters is the maximum distance between an object in one cluster and an object in another cluster. A number of other clustering methods exists and are generally biased toward finding clusters possessing certain characteristics, such as size (number of members), shape, and dispersion. Average linkage clustering, for example, tends to join clusters with small variances and is biased toward producing clusters with the same variance.165 In most instances, the true number, size, and dispersion of the clusters is unknown, and so like other in silico predictive or exploratory methods, a combination of approaches is desirable. Once clustering is complete, the branches of the tree must be ordered about their branch points, or nodes, to generate a linear ordering of objects. For each tree, there exists 2N–1 orderings that are consistent with any tree of N objects. Because objects can be flipped about their node without altering cluster membership, the ordering of clusters and objects within a cluster is somewhat random. This can lead to misinterpretation of clustering because it is the distance between objects that relates their similarity, not the linear order by which they are arranged on the tree. The objects on the tree are colored to indicate the magnitude of expression across the multiple conditions (e.g., red is up-regulated, green is down-regulated, and black is no change; Plate 3). When the samples represent independent conditions, such as different cell types, treatments, or tumor samples, both the genes and the experiments can be hierarchically clustered in a two-dimensional dendrogram.
In this manner, experiments that are more similar in expression profile across all genes are closer together on the tree in the first dimension, while genes that are more similar in expression profile across all experiments are closer together on the tree in the other dimension. This approach can be used to illustrate relationships between multiple conditions and gene expression levels so that subsets of genes can be used as predictors of cellular phenotype or response. When the experiments are time points or different doses, the conditions would be arranged in a linear order.
E. Other Clustering Methods Hierarchical clustering has been suggested to be inappropriate for gene expression data because phylogenetic trees, or dendrograms, are best applied to situations of true hierarchical descent, such as evolution. However, gene expression does not follow a hierarchy, but rather is characterized by distinct mechanisms. When hierarchical clustering is applied, the data are forced to be joined at some level, regardless of the relationship between the data. As two branches of the tree are joined into one, they become less similar and eventually meaningless. As an alternative to imposing a hierarchical descent on the data, partitioning methods, such as SOMs and k-means clustering, can divide data set into similar, but distinct groups. Hierarchical clustering can then be applied to each partitioned set of genes. The purpose of SOMs is similar to hierarchical clustering in that it divides the data into groups, while also allowing one to impose partial structure on the data. If we regard n points (i.e., the expression levels of n genes) in k-dimensional space, where k is the number of samples, then SOMs impose structure to the data as follows. One first chooses an initial geometry of nodes, such as a 3 × 2 grid (i.e., six clusters or reference vectors), and maps them into k-dimensional space. The nodes are mapped initially at random, and then iteratively for some 20,000 to 50,000 iterations. For each iteration, a data point n is randomly selected and the nodes are moved in its direction. The closest node is moved the most and the other nodes are moved by smaller amounts depending on their distance from n in the initial
geometry. By this process, neighboring points in the initial geometry are mapped to nearby points in k-dimensional space. Neighboring nodes define similar clusters (i.e., correlated), while nodes on the opposing corners of the grid define opposite clusters (i.e., anticorrelative). Changing the initial geometry of the nodes imparts an alternative structure to the clusters. The selection of grid geometry and the number of clusters is still considered an arbitrary process because the number of significant patterns of gene expression for any one system is not predictable. Another drawback is the necessity to place genes into discrete and nonoverlapping clusters. The Plaid model for gene expression array data overcomes these limitations by allowing clusters of genes or samples to overlap. It also allows clusters of genes to be defined by only a subset of samples.166 SOMs were first applied to gene expression data by Tamayo et al.167 to cluster genes based on the yeast cell cycle and hematopoietic differentiation. The GENECLUSTER software package developed and used by Tamayo et al. produces and displays SOMs of gene expression data (Table 1). K-means clustering is a partitioning method similar to SOMs, in that the data are partitioned according to the number (k) of predefined reference vectors. Each data point n is mapped to its most similar reference vector. Each reference vector is then recalculated as the average of the points that mapped to it. These steps are repeated iteratively until all genes map to the same reference vector on consecutive iterations. This iterative reallocation of cluster members minimizes the overall within-cluster dispersion. In contrast to SOMs, k-means clustering does not position similar clusters adjacent to each other. Like SOMs, this method depends heavily on the choice of k. Tavazoie et al.168 applied k-means clustering to yeast mitotic cell cycle data. In this example, the number of reference vectors k was initially set at 10, 30, and 60, but 30 was ultimately chosen because it provided the best compromise between the number of clusters and separation between them. Like SOMs, k should be chosen empirically based on the size and separation of the resulting clusters, as well as the biology of the phenomenon being explored. 101
Principal component analysis (PCA) can also partition gene expression data into groups; however, unlike SOMs and k-means clustering, k is unknown and dictated by the variability in the data set. Multivariate methods, such as hierarchical clustering, attempt to reduce the dimensionality of n objects in order to find similar groups within the data. PCA is a multivariate statistical technique that also attempts to reduce the dimensionality of the data in order to find subsets within the data that explain the most variation.169 This approach identifies those variables (i.e., samples) in the data that best explain the difference in the observations (i.e., gene expression levels). A set of n objects to be clustered can be thought to exist in n-dimensional space. Given n observations (i.e., genes) among m conditions, the goal of PCA is to find r new variables (termed principal components) that explain as much of the variance in the original n × m objects as possible. The first principal component is a linear combination of gene expression profiles that explains the largest proportion of the variation within the data. The second principal component is another linear combination of gene expression profiles that explains the next largest proportion of the variation within the remaining objects, while remaining mutually uncorrelated and orthogonal (i.e., at a right-angle) to the first principal component. Raychaudhuri et al.170 used PCA to analyze yeast sporulation data. In this example, the expression data for the sporulation time series could be summarized by the first two principal components, which accounted for over 90% of the total variability. The first PC indicated genes with increasing average expression across all time points, while the second PC identified genes with an increasing positive trend. The application of PCA to the sporulation data set also illustrated how hierarchical clustering may be inappropriate for data sets with smoothly varying distributions, because PCA found many genes to be highly correlated with members of other clusters.
F. Gene Expression Profiling in Mechanistic and Predictive Toxicology One of the first applications demonstrating the utility of gene expression profiling to charac102
terize toxicity was by Marton et al.171 Using cDNA microarrays containing essentially every open reading frame (ORF) in the yeast genome (6000+), Marton et al. were able to show that the gene expression profile induced by the immunosuppressant FK506 was highly correlated with the expression profile induced by the mechanistically similar immunosuppressant cyclosporin A. These expression signatures were not, however, correlated with the expression profile induced by other unrelated drugs. These experiments demonstrate the principle that chemicals can induce characteristic and unique gene expression profiles. Therefore, expression profiles may be used to identify and classify unknown chemicals with respect to mechanism of action. The use of yeast as a model system allowed Marton et al. to also verify the drug target (i.e., calcineurin) by correlating the expression profile of the immunosuppressants to the expression profile induce by genetic disruption of the target protein. In addition, it was demonstrated that off-target effects exist for FK506 since a distinct profile of gene expression was produced in a calcineurin mutant strain treated with the immunosuppressant. Although the toxicity of FK506 is primarily mechanism based, these effects point to potential causes of unwanted side effects. The correlative studies by Marton et al. predicted that gene expression profiles from a large number of mutant and chemically treated cells could be used to dissect cellular pathways, determine gene function, and identify mechanisms of action of unknown chemicals. Using a ‘compendium’ of yeast expression profiles from over 300 diverse mutants and chemical treatments, Hughes et al.161 were able to identify and subsequently confirm gene function for a number of uncharacterized ORFs. They were also able to identify a previously unknown drug target for the commonly used topical anesthetic dyclonine. This was accomplished by matching gene expression profiles caused by uncharacterized perturbations (e.g., gene disruption or chemical inhibition) with a large set of reference profiles corresponding to disturbances of known cellular pathways. Hierarchical clustering and correlation measures were used to match similar expression profiles. When the dyclonine-induced expression profile was compared to the compendium, it was found to be correlated (r = 0.82)
with the profile resulting from genetic disruption of the ergosterol pathway, specifically erg2. The human gene with the greatest sequence similarity to the erg2 protein was found to be the sigma receptor, which is known to bind a number of neuroactive drugs and other inhibitory compounds that target both yeast erg2p and the human sigma receptor. Despite the use of yeast as a model system, a potential mechanism of action for dyclonine can be inferred for mammalian systems. Yeast has also been used successfully to identify signaling pathways that play a role in the transcriptional response to a variety of physical and chemical perturbations,172 including heat shock, osmotic shock, nitrogen and amino acid depletion, oxidative stress, and others. When the gene expression profiles induced by a panel of these environmental stresses were compared by hierarchical clustering, a set of ~900 genes were found to show a similar response to almost all of the environmental changes. Overall, transcriptional responses to environmental stress were transient, while the duration and amplitude of change varied with the magnitude of the stress. Their regulation, however, was dependent on many signaling systems that acted in a condition-specific and gene-specific manner, rather than being controlled by a single stress-sensing pathway. Additional clusters of genes were found to be altered only under specific conditions, while no two conditions produced identical expression patterns. To determine what factors governed the response to stress, the promoters of subclusters of responsive genes were analyzed for common regulatory elements. It was found that many stress-responding genes contained Msn2 and/or Msn4p binding sites. By examining the expression profile of yeast msn2 msn4 mutants, or yeast overexpressing MSN2 or MSN4, it was determined that these transcription factors play a role in regulating expression of a subset of responsive genes following environmental stress. Other responsive genes were not affected by MSN2/4 and are thought to be under control of other independent signaling pathways. This elaborate set of studies demonstrate the complexity at which cells can detect and respond to unique forms of physical and chemical stressors and underscores the utility of transcriptional responses to diagnose cellular perturbations and decipher the mechanisms of action of chemicals.
cDNA microarrays have also been applied to mammalian systems, particularly human cell lines. While mammalian cell lines are not as amenable to genetic manipulation as yeast, they have proven to be useful for identifying genes responsive to stressors, such as irradiation-induced apoptosis,173,174 DNA-damaging agents and antiinflammatory drugs,164 the hepatotoxicant carbon tetrachloride,175 metals and combustion byproducts,176-178 and β-naphthoflavone.179 Whether induction or repression of genes by toxicant exposure are direct or indirect effects is not clear, but cotreatment with protein synthesis inhibitors can be used to distinguish between primary toxicant effects and delayed secondary effects. For example, Puga et al.180 discovered 310 genes that were up or downregulated at least 2.1-fold by 8 h exposure to dioxin. However, inhibition of protein synthesis by cyclohexamide treatment resulted in only 108 genes being deregulated. The transcriptional responses to chemical stressors in mammalian cells are also complicated by cellular heterogeneity and genotype, which presents further challenges in interpreting gene expression changes in vivo in experimental animals and human populations. Nonetheless, gene expression profiling in combination with analytical approaches in silico will continue to revolutionize mechanistic and predictive toxicology.
VIII. CONCLUSIONS The field of bioinformatics is rapidly changing, and more sophisticated and effective programs are being continually developed for a number of diverse applications. However, these new computational methods applicable to the study, of molecular toxicology are not replacing, and will not replace, traditional methods of investigation, but are meant to offer an alternative and sometimes more efficient means of studying molecular processes. Understanding the concepts and their practical applications is essential to make full use of the public databases of sequence, expression, and structural information. However, the application of in silico methods must be exercised with caution, as digital information is just as susceptible to error as analog, or experimental, information. Not only do sequencing errors exist, but errors in sequence annotation, storage, tracking, 103
and retrieval can complicate the interpretation of results and lead to false positives. With the near completion of a number of mammalian genome sequences, such as human and mouse, investigating gene function will be of primary importance. While genetic perturbation using gene knockout technology in mice represents a powerful means of identifying gene function, chemical perturbation experiments represent another fruitful means to identify gene function and explore the basis of normal physiological and cellular processes, thus leading to an increased understanding of the molecular basis of toxicity.
APPENDIX A: GLOSSARY OF TERMS Algorithm: A process or set of rules by which a calculation or process can be carried out (e.g., by a computer program). Agglomerative clustering: A form of clustering that starts from the bottom and progresses upward by merging two objects together to form new objects, which are successively merged with similar objects, such that all objects are connected at the top by a single root. Bioinformatics: A scientific discipline that encompasses all aspects of biological information acquisition, processing, storage, distribution, analysis, and interpretation. Bit score: An alignment score (S′) calculated from the raw score (S). The bit score is calculated from the formula S′ = (lambda*S - ln K)/(ln 2), where lamba and K are parameters dependent on the substitution matrix and gap costs, respectively. The bit score represents the raw score normalized for the scoring system used (i.e., lamba and K), such that alignment scores from different searches can be compared. BLAST: Basic Local Alignment Search Tool. A heuristic algorithm designed to rapidly search large DNA or protein sequence databases for sequences with optimal local alignment to a query sequence. BLOSUM: Blocks Substitution Matrix. A substitution matrix representing the observed frequencies of amino acid mutations in local alignments of closely related proteins. cDNA: Complementary Deoxyribonucleic Acid. DNA, either single or double stranded, which is complementary in nucleotide sequence to the tem104
plate sequence from which it was synthesized. cDNA is typically synthesized by first- and second-strand synthesis from single stranded messenger (mRNA), thus making a copy of the protein coding sequence. cDNA library: A collection of cloned DNA from an organism or tissue. mRNA is converted into double-stranded DNA, cloned into a vector, and transformed into a bacterial host for propagation. Each bacterial clone harbors a cloned cDNA from the organism or tissue from which the mRNA was obtained. cDNA microarray: An experimental platform designed to measure the expression level for hundreds to thousands of genes in parallel in a single experiment. Microarrays typically refer to cDNA spotted at high density on glass slides, but cDNA or oligonucleotides may be spotted or synthesized in situ on other substrates, such as nylon. Clone: Typically refering to a bacterium grown from a single colony and harboring a plasmid conferring antibiotic resistance and typically a cDNA insert. Cluster: A group of DNA sequences related by identity (see Unigene Database in Table 1, and EST cluster). May also refer to a group of genes related by expression pattern across multiple conditions (see hierarchical clustering). Consensus sequence: The sequence that is common to a collection of overlapping sequences, which are not necessarily identical. Correlation coefficient: A measure of the degree to which variables vary together. cSNP: Coding Single Nucleotide Polymorphism. A SNP found in the coding region of a gene, which is likely to affect protein function by virtue of amino acid changes (see SNP). Dendrogram: A graphical representation of the hierarchical relationship between variables. Resembles a tree, where variables (leaves) that are more closely related are closer together on the branch, and branches that are more closely related are closer together on the tree. The length of each branch represents the measure of correlation between variables. Domain: Typically refers to a discrete functional subunit within a larger unit, such as a DNA binding domain within a larger transcription factor protein. EST: Expressed Sequence Tag. A single sequence read of a cDNA clone from the 5′ or 3′ end.
EST clustering: An algorithm that bins identical or overlapping expressed sequence tags into discrete clusters, where each cluster represents a unique expressed gene. Euclidean distance: The straight line distance between two points P1 and P2. In a plane with P1 at x1 and y1, and P2 at x2 and y2, then the Euclidean distance is given by √((x1 – x2)2 + (y1 – y2)2). E-value: Expectation value. The number of unique alignments with scores equivalent to or greater than S that are expected to occur in a database search by chance alone. The lower the E value, the more significant the alignment. Extreme value distribution: The distribution of expected scores for the local alignment of a pair of ungapped random amino acid sequences. Used to calculate the E (expect) value for a protein sequence alignment, which indicates the number of distinct alignments with an equivalent or superior alignment score than would be expected to occur purely by chance alone. FASTA: An early algorithm designed to search large DNA or protein sequence databases for sequences with optimal local alignment to a query sequence. Functional genomics: The study of gene function on a whole or partial genome scale, which may include the study of gene expression using cDNA microarrays. Global alignment: The alignment of two DNA or protein sequences over their entire length. Heuristics: A method of making approximations to increase the computational speed of an algorithm. Hidden Markov models: Probabilistic-based models for multiple sequence alignments, database searches, protein classification, gene finding, and pattern recognition. Hierarchical clustering: A form of clustering which groups objects into smaller groups or clusters, such that objects within a group are more similar to each other than members of other clusters. Clusters are often visualized with a dendrogram, which represents objects and their relationships to each other like branches of a tree. Homologous: Implying an evolutionary relationship between similar sequences within and between species. IMAGE Consortium: Integrated Molecular Analysis of Genomes and their Expression Con-
sortium. An academic consortium with the goal of sharing and distributing cDNA libraries and sequence, map and expression information with the public domain. In silico: Computational-based method of study, particularly for biological systems. K-means clustering: A form of clustering that groups objects into smaller groups or clusters, such that objects within a group are more similar to each other than members of other clusters. The objects are binned into k clusters, which is determined a priori. Unlike hierarchical clustering, the relationship between clusters is not determined. Ligand docking: A computational method for docking substrates into the active sites of proteins of known or modeled structure. Linkage disequilibrium analyses: The analysis of the association of allelic markers for estimating genetic linkage of disease genes or other genetic traits. Local alignment: The alignment of some subsequence of two DNA or protein sequences. Metabonomics: ‘The quantitative measurement of the dynamic multiparametric metabolic response of living systems to pathophysiological stimuli or genetic modification’.152 Microsatellite: Stretches of DNA that consist of tandem repeats of simple nucleotide sequence (e.g., TTA repeated 15 to 20 times in succession). Molecular modeling: Computational methods used to predict the three-dimensional structure of proteins based on sequence similarity to proteins of known structure. Multiple sequence alignment: The local or global alignment of three or more homologous sequences. NCBI: National Center for Biotechnology Information Needleman-Wunsch algorithm: An algorithm used to find the optimal alignment between two sequences over their entire length. Normalization: A method to reduce the redundancy of a cDNA library, such that rare sequences are enriched and abundant sequences are reduced. Orthologous: Refers to a homologous sequence between species that arose from a common ancestral gene during speciation. PAM: Point Accepted Mutation. A unit representing the amount of evolutionary change in a protein sequence. Used for calculating PAM(x) 105
substitution matrices for proteins with x amount of evolutionary change. Paralogous: Refers to two homologous sequences within a species that arose by an early gene duplication event. Phylogenetic tree: Graphical representation of the evolutionary relationships and lineage branching among organisms or their genomes. Position weight matrix: A matrix describing the observed frequency of nucleotides or amino acids at each position of a sequence alignment. Protein domain: A discrete functional amino acid sequence of a protein that is characterized by independent folding units within the larger protein structure. Proteomics: The study of protein expression on a whole or partial genome scale. QSAR: Quantitative Structure Activity Relationship. An attempt to correlate physical or chemical descriptors of compounds with chemical measurements or biological activities. SAGE: Serial Analysis of Gene Expression. A high throughput sequencing-based method for determing the gene expression level for thousands of expressed sequences. Singleton: An expressed sequence tag that does not cluster with other sequences, or bear any similarity to other sequences in the database. STS: Sequence-Tagged Sites. Short genomic landmark sequences (200 to 500 bp) that can be detected by PCR and used for physical mapping of a genome. SOMs: Self organizing maps. A form of clustering that reduces multidimensional data into discrete clusters by a method derived from neural networks. Data are mapped onto nodes, where nodes that are closer to each other in two-dimensional space are more similar to each other than nodes further away. Like k-means clustering, the number of clusters is determined a priori. Smith-Waterman algorithm: An algorithm for finding the optimal local alignment between two sequences. SNP: Single nucleotide polymorphism. A single nucleotide substitution, addition or deletion in the genome. The locus is considered polymorphic if at least two variants exist and the allele frequency of the least common variant is > 1%. 106
Substitution matrix: A matrix of values representing the estimated probabilities of amino acid mutations occurring through evolution. Subtractive hybridization: A method for enriching a cDNA library for sequences that are differentially expressed between two mRNA populations. Toxicogenomics: A scientific discipline that applies toxicology, functional genomics, bioinformatics and statistics to predictive, mechanistic, and discovery toxicology. Unsupervised clustering: A form of clustering in which the algorithm is blind to the objects or the number of clusters to be formed, such that the clustering outcome is unbiased. Virtual northern: A visual representation of the abundance of expressed sequences in a database. Expressed sequences may be derived from from SAGE or cDNA libraries. The visual representation is similar to a Northern blot, where the intensity of the signal is proportional to the level of expression in the samples being compared.
REFERENCES 1. Benton, D., Bioinformatics - principles and potential of a new multidisciplinary tool. Trends Biotechnol. 14: 261–272, 1996. 2. Norris, D.A., Leesman, G.D., Sinko, P.J., and Grass, G.M., Development of predictive pharmacokinetic simulation models for drug discovery. J Controlled Release. 65: 55–62, 2000. 3. Blake, J.F., Chemoinformatics - predicting the physicochemical properties of ‘drug-like’ molecules. Curr. Opin. Biotechnol. 11: 104–107, 2000. 4. Adams, M.D., Kelley, J.M., Gocayne, J.D., Dubnick, M., Polymeropoulos, M.H., Xiao, H., Merril, C.R., Wu, A., Olde, B., Moreno, R.F., et al., Complementary DNA sequencing: expressed sequence tags and human genome project. Science 252: 1651–1656, 1991. 5. Soares, M.B., Identification and cloning of differentially expressed genes. Curr. Opin. Biotechnol. 8: 542–546, 1997. 6. Urmenyi, T.P., Bonaldo, M.F., Soares, M.B., and Rondinelli, E., Construction of a normalized cDNA library for the Trypanosoma cruzi genome project. J Eukaryot Microbiol. 46: 542–544, 1999. 7. Diatchenko, L., Lau, Y.F., Campbell, A.P., Chenchik, A., Moqadam, F., Huang, B., Lukyanov, S., Lukyanov, K., Gurskaya, N., Sverdlov, E.D., and Siebert, P.D., Suppression subtractive hybridization: a method for generating differentially regulated or tissue-specific cDNA probes and libraries. Proc. Natl. Acad. Sci. U.S.A. 93: 6025–6030, 1996.
8. Bonaldo, M.F., Lennon, G., and Soares, M.B., Normalization and subtraction: two approaches to facilitate gene discovery. Genome Res. 6: 791–806, 1996. 9. Gurskaya, N.G., Diatchenko, L., Chenchik, A., Siebert, P.D., Khaspekov, G.L., Lukyanov, K.A., Vagner, L.L., Ermolaeva, O.D., Lukyanov, S.A., and Sverdlov, E.D., Equalizing cDNA subtraction based on selective suppression of polymerase chain reaction: cloning of Jurkat cell transcripts induced by phytohemaglutinin and phorbol 12–myristate 13–acetate. Anal Biochem. 240: 90–97, 1996. 10. Boguski, M.S., Lowe, T.M., and Tolstoshev, C.M., dbEST — database for expressed sequence tags. Nat Genet. 4: 332–333, 1993. 11. Quackenbush, J., Liang, F., Holt, I., Pertea, G., and Upton, J., The TIGR gene indices: reconstruction and representation of expressed gene sequences. Nucleic Acids Res. 28: 141–145, 2000. 12. Wheeler, D.L., Chappey, C., Lash, A.E., Leipe, D.D., Madden, T.L., Schuler, G.D., Tatusova, T.A., and Rapp, B.A., Database resources of the National Center for Biotechnology Information. Nucleic Acids Res. 28: 10–14, 2000. 13. Liang, F., Holt, I., Pertea, G., Karamycheva, S., Salzberg, S.L., and Quackenbush, J., An optimized protocol for analysis of EST sequences. Nucleic Acids Res. 28: 3657–3665, 2000. 14. Velculescu, V.E., Zhang, L., Vogelstein, B., and Kinzler, K.W., Serial analysis of gene expression. Science 270: 484–487, 1995. 15. Ringwald, M., Mangan, M.E., Eppig, J.T., Kadin, J.A., and Richardson, J.E., GXD: a gene expression database for the laboratory mouse. The Gene Expression Database Group. Nucleic Acids Res. 27: 106– 112, 1999. 16. Vasmatzis, G., Essand, M., Brinkmann, U., Lee, B., and Pastan, I., Discovery of three genes specifically expressed in human prostate by expressed sequence tag database analysis. Proc. Natl. Acad. Sci. U.S.A. 95: 300–304, 1998. 17. Audic, S. and Claverie, J.M., The significance of digital gene expression profiles. Genome Res. 7: 986– 995, 1997. 18. Lash, A.E., Tolstoshev, C.M., Wagner, L., Schuler, G.D., Strausberg, R.L., Riggins, G.J., and Altschul, S.F., SAGEmap: a public gene expression resource. Genome Res. 10: 1051–1060, 2000. 19. Thomas, R., Rank, D., Penn, S., Zastrow, G., Gu, Y., Glover, E., Bunger, M., Jovanovich, S., and Bradfield, C., Dioxin induced changes in global expression as measured using EST frequency and cDNA microarrays: identification of a secondary gene battery. Toxicol. Sci. 54 Suppl. 1: 2000. 20. Loftus, S.K., Chen, Y., Gooden, G., Ryan, J.F., Birznieks, G., Hilliard, M., Baxevanis, A.D., Bittner, M., Meltzer, P., Trent, J., and Pavan, W., Informatic selection of a neural crest-melanocyte cDNA set for microarray analysis. Proc. Natl. Acad. Sci. U.S.A. 96: 9277–9280, 1999. 21. Miller, G., Fuchs, R., and Lai, E., IMAGE cDNA clones, UniGene clustering, and ACeDB: an integrated
22.
23.
24.
25.
26.
27.
28.
29.
30.
31.
32.
33.
34.
35. 36.
37.
resource for expressed sequence information. Genome Res. 7: 1027–1032, 1997. Yuan, Y.P., Eulenstein, O., Vingron, M., and Bork, P., Toward detection of orthologues in sequence databases. Bioinformatics 14: 285–289, 1998. Needleman, S.B. and Wunsch, C.D., A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. Mol. Biol. 48: 443– 453, 1970. Smith, T.F. and Waterman, M.S., Identification of common molecular subsequences. J. Mol. Biol. 147: 195–197, 1981. Pearson, W.R. and Lipman, D.J., Improved tools for biological sequence comparison. Proc. Natl. Acad. Sci. U.S.A. 85: 2444–2448, 1988. Altschul, S.F., Gish, W., Miller, W., Myers, E.W., and Lipman, D.J., Basic local alignment search tool. J. Mol. Biol. 215: 403–410, 1990. Altschul, S.F., Madden, T.L., Schaffer, A.A., Zhang, J., Zhang, Z., Miller, W., and Lipman, D.J., Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25: 3389–3402, 1997. Mushegian, A.R., Bassett, D.E., Jr., Boguski, M.S., Bork, P., and Koonin, E.V., Positionally cloned human disease genes: patterns of evolutionary conservation and functional motifs. Proc. Natl. Acad. Sci. U.S. A. 94: 5831–5836, 1997. Huynen, M., Doerks, T., Eisenhaber, F., Orengo, C., Sunyaev, S., Yuan, Y., and Bork, P., Homology-based fold predictions for Mycoplasma genitalium proteins. J. Mol. Biol. 280: 323–326, 1998. Wootton, J.C. and Federhen, S., Analysis of compositionally biased regions in sequence databases. Methods Enzymol. 266: 554–571, 1996. Dayhoff, M., Schwartz, R., and Orcutt, B., A model of evolutionary change in proteins, in Atlas of Protein Sequence and Structure, Dayhoff, M., Ed., National Biomedical Research Foundation, Washington, DC, pp. 345–352, 1978. Henikoff, S. and Henikoff, J.G., Amino acid substitution matrices from protein blocks. Proc. Natl. Acad. Sci. U.S.A. 89: 10915–10919, 1992. Altschul, S.F., Amino acid substitution matrices from an information theoretic perspective. J. Mol. Biol. 219: 555–565, 1991. Karlin, S. and Altschul, S.F., Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes. Proc. Natl. Acad. Sci. U.S.A. 87: 2264–2268, 1990. Altschul, S.F. and Gish, W., Local alignment statistics. Methods Enzymol. 266: 460–480, 1996. Makalowski, W., Zhang, J., and Boguski, M.S., Comparative analysis of 1196 orthologous mouse and human full-length mRNA and protein sequences. Genome Res. 6: 846–857, 1996. Hogenesch, J.B., Chan, W.K., Jackiw, V.H., Brown, R.C., Gu, Y.Z., Pray-Grant, M., Perdew, G.H., and Bradfield, C.A., Characterization of a subset of the basic-helix-loop-helix-PAS superfamily that interacts
107
38.
39.
40.
41.
42.
43.
44.
45. 46.
47. 48.
49.
50.
51.
52.
108
with components of the dioxin signaling pathway. J. Biol. Chem. 272: 8581–8593, 1997. Taylor, B.L. and Zhulin, I.B., PAS domains: internal sensors of oxygen, redox potential, and light. Microbiol. Mol. Biol. Rev. 63: 479–506, 1999. Gu, Y.Z., Hogenesch, J.B., and Bradfield, C.A., The PAS superfamily: sensors of environmental and developmental signals. Annu. Rev. Pharmacol. Toxicol. 40: 519–561, 2000. Crews, S.T., Control of cell lineage-specific development and transcription by bHLH- PAS proteins. Genes Dev. 12: 607–620, 1998. Hogenesch, J.B., Gu, Y.Z., Jain, S., and Bradfield, C.A., The basic-helix-loop-helix-PAS orphan MOP3 forms transcriptionally active complexes with circadian and hypoxia factors. Proc. Natl. Acad. Sci. U.S.A. 95: 5474–5479, 1998. Banfi, S., Borsani, G., Rossi, E., Bernard, L., Guffanti, A., Rubboli, F., Marchitiello, A., Giglio, S., Coluccia, E., Zollo, M., Zuffardi, O., and Ballabio, A., Identification and mapping of human cDNAs homologous to Drosophila mutant genes through EST database searching. Nat. Genet. 13: 167–174, 1996. Rotig, A., Valnot, I., Mugnier, C., Rustin, P., and Munnich, A., Screening human EST database for identification of candidate genes in respiratory chain deficiency. Mol. Genet. Metab. 69: 223–232, 2000. Collins, F.S., Patrinos, A., Jordan, E., Chakravarti, A., Gesteland, R., and Walters, L., New goals for the U.S. Human Genome Project: 1998–2003. Science 282: 682–689, 1998. Pennisi, E., Genomics. Rat genome off to an early start. Science 289: 1267–1269, 2000. Blake, J.A., Eppig, J.T., Richardson, J.E., and Davisson, M.T., The Mouse Genome Database (MGD): expanding genetic and genomic resources for the laboratory mouse. The Mouse Genome Database Group. Nucleic Acids Res. 28: 108–111, 2000. Drews, J., Drug discovery: a historical perspective. Science 287: 1960–1964, 2000. Thompson, J.D., Higgins, D.G., and Gibson, T.J., CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res. 22: 4673–4680, 1994. Barton, G.J. and Sternberg, M.J., A strategy for the rapid multiple alignment of protein sequences. Confidence levels from tertiary structure comparisons. J. Mol. Biol. 198: 327–337, 1987. Morgenstern, B., Frech, K., Dress, A., and Werner, T., DIALIGN: finding local similarities by multiple sequence alignment. Bioinformatics 14: 290–294, 1998. Depiereux, E., Baudoux, G., Briffeuil, P., Reginster, I., De Bolle, X., Vinals, C., and Feytmans, E., MatchBox_server: a multiple sequence alignment tool placing emphasis on reliability. Comput. Appl. Biosci. 13: 249–256, 1997. Smith, R.F. and Smith, T.F., Pattern-induced multisequence alignment (PIMA) algorithm employing sec-
53. 54.
55.
56.
57.
58.
59.
60.
61.
62.
63.
64.
65.
66. 67.
68. 69.
ondary structure-dependent gap penalties for use in comparative protein modelling. Protein Eng. 5: 35– 41, 1992. Gaskell, G.J., Multiple sequence alignment tools on the web. BioTechniques. 29: 60–62, 2000. Marti-Renom, M.A., Stuart, A.C., Fiser, A., Sanchez, R., Melo, F., and Sali, A., Comparative protein structure modeling of genes and genomes. Annu. Rev. Biophys. Biomol. Struct. 29: 291–325, 2000. Briffeuil, P., Baudoux, G., Lambert, C., De Bolle, X., Vinals, C., Feytmans, E., and Depiereux, E., Comparative analysis of seven multiple protein sequence alignment servers: clues to enhance reliability of predictions. Bioinformatics. 14: 357–366, 1998. Thompson, J.D., Plewniak, F., and Poch, O., A comprehensive comparison of multiple sequence alignment programs. Nucleic Acids Res. 27: 2682–2690, 1999. Thompson, J.D., Plewniak, F., Thierry, J., and Poch, O., DbClustal: rapid and reliable global multiple alignments of protein sequences detected by database searches. Nucleic Acids Res. 28: 2919–2926, 2000. Bairoch, A., PROSITE: a dictionary of sites and patterns in proteins. Nucleic Acids Res. 19 Suppl: 2241– 2245, 1991. Gribskov, M., McLachlan, A.D., and Eisenberg, D., Profile analysis: detection of distantly related proteins. Proc. Natl. Acad. Sci. U.S.A. 84: 4355–4358, 1987. Bucher, P., Karplus, K., Moeri, N., and Hofmann, K., A flexible motif search technique based on generalized profiles. Comput Chem. 20: 3–23, 1996. Sonnhammer, E.L., Eddy, S.R., and Durbin, R., Pfam: a comprehensive database of protein domain families based on seed alignments. Proteins 28: 405–420, 1997. Attwood, T.K., Beck, M.E., Flower, D.R., Scordis, P., and Selley, J.N., The PRINTS protein fingerprint database in its fifth year. Nucleic Acids Res. 26: 304– 308, 1998. Pietrokovski, S., Henikoff, J.G., and Henikoff, S., The Blocks database—a system for protein classification. Nucleic Acids Res. 24: 197–200, 1996. Schultz, J., Milpetz, F., Bork, P., and Ponting, C.P., SMART, a simple modular architecture research tool: identification of signaling domains. Proc. Natl. Acad. Sci. U.S.A. 95: 5857–5864, 1998. Corpet, F., Gouzy, J., and Kahn, D., Recent improvements of the ProDom database of protein domain families. Nucleic Acids Res. 27: 263–267, 1999. Apweiler, R., Protein sequence databases. Adv. Protein Chem. 54: 31–71, 2000. Bernstein, F.C., Koetzle, T.F., Williams, G.J., Meyer, E.E., Jr., Brice, M.D., Rodgers, J.R., Kennard, O., Shimanouchi, T., and Tasumi, M., The Protein Data Bank: a computer-based archival file for macromolecular structures. J. Mol. Biol. 112: 535–542, 1977. Rost, B., Twilight zone of protein sequence alignments. Protein Eng. 12: 85–94, 1999. Szklarz, G.D., Ornstein, R.L., and Halpert, J.R., Application of 3–dimensional homology modeling of
70.
71. 72.
73.
74.
75.
76.
77.
78.
79.
80.
81.
82.
83.
84.
85.
cytochrome P450 2B1 for interpretation of site-directed mutagenesis results. J. Biomol. Struct. Dyn. 12: 061–078, 1994. Levitt, M., Accurate modeling of protein conformation by automatic segment matching. J. Mol. Biol. 226: 507–533, 1992. Moult, J., Predicting protein three-dimensional structure. Curr. Opin. Biotechnol. 10: 583–588, 1999. Szklarz, G.D., Graham, S.E., and Paulsen, M.D., Molecular modeling of mammalian cytochromes P450: application to study enzyme function. Vitam. Horm. 58: 53–87, 2000. Sanchez-Ferrer, A., Nunez-Delicado, E., and Bru, R., Software for reviewing biomolecules in three dimensions on the Internet. Trends Biochem. Sci. 20: 286– 288, 1995. Sayle, R.A. and Milner-White, E.J., RASMOL: biomolecular graphics for all. Trends Biochem. Sci. 20: 374, 1995. Kuntz, I.D., Meng, E.C., Shoichet, B.K., Structurebased molecular design. Acc. Chem. Res. 27: 117– 123, 1994. Kuntz, I.D., Blaney, J.M., Oatley, S.J., Langridge, R., Ferrin, T.E., A geometric approach to macromoleculeligand interactions. J. Mol. Biol. 161: 269–288, 1982. Rarey, M., Kramer, B., Lengauer, T., and Klebe, G., A fast flexible docking method using an incremental construction algorithm. J. Mol. Biol. 261: 470–489, 1996. Leach, A.R. and Kuntz, I.D., Conformational analysis of flexible ligands in macromolecular receptor sites. J. Comp. Chem. 13: 730–748, 1992. Jones, G., Willett, P., Glen, R.C., Leach, A.R., Taylor, R., Development and validation of a genetic algorithm for flexible docking. J. Mol. Biol. 267: 727– 748, 1997. Leach, A.R., Ligand docking to proteins with discrete side-chain flexibility. J. Mol. Biol. 235: 345–356, 1994. Morris, G.M., Goodsell, D.S., Huey, R., and Olson, A., Distributed automated docking of flexible ligands to proteins: parallel applications of AutoDock 2.4. J. Comput. Aided. Mol. Des. 10: 293–304, 1996. Wlodawer, A. and Erickson, J.W., Structure-based inhibitors of HIV-1 protease. Annu. Rev. Biochem. 62: 543–585, 1993. Shoichet, B.K., Stroud, R.M., Santi, D.V., Kuntz, I.D., and Perry, K.M., Structure-based discovery of inhibitors of thymidylate synthase. Science 259: 1445–1450, 1993. Gschwend, D.A., Sirawaraporn, W., Santi, D.V., and Kuntz, I.D., Specificity in structure-based drug design: identification of a novel, selective inhibitor of Pneumocystis carinii dihydrofolate reductase. Proteins 29: 59–67, 1997. Matthews, J. and Zacharewski, T., Differential binding affinities of PCBs, HO-PCBs, and aroclors with recombinant human, rainbow trout (Oncorhynchus mykiss), and green anole (Anolis carolinensis) estro-
86.
87.
88.
89.
90.
91.
92.
93.
94.
95.
96.
97.
98.
99.
gen receptors, using a semi-high throughput competitive binding assay. Toxicol. Sci. 53: 326–339, 2000. Poulos, T.L., Finzel, B.C., Gunsalus, I.C., Wagner, G.C., and Kraut, J., The 2.6–Å crystal structure of Pseudomonas putida cytochrome P-450. J. Biol. Chem. 260: 16122–16130, 1985. Poulos, T.L., Finzel, B.C., and Howard, A.J., Highresolution crystal structure of cytochrome P450cam. J. Mol. Biol. 195: 687–700, 1987. Williams, P.A., Cosme, J., Sridhar, V., Johnson, E.F., and McRee, D.E., Mammalian microsomal cytochrome P450 monooxygenase: structural adaptations for membrane binding and functional diversity. Mol. Cell. 5: 121–131, 2000. Williams, P.A., Cosme, J., Sridhar, V., Johnson, E.F., and McRee, D.E., Microsomal cytochrome P450 2C5: comparison to microbial P450s and unique features. J. Inorg. Biochem. 81: 183–190, 2000. Schnecke, V. and Kuhn, L.A. Database screening for HIV protease ligands: The influence of binding-site conformation and representation on ligand selectivity, in Seventh International Conference on Intelligent Systems for Molecular Biology, AAAI Press, 1999. Schnecke, V. and Kuhn, L.A., Virtual Screening with solvation and lignad-induced complementarity. Drug Discovery and Design. 20: 171–190, 2000. McKinney, J.D., Richard, A., Waller, C., Newman, M.C., and Gerberick, F., The Practice of Structure Activity Relationships (SAR) in Toxicology. Toxicol. Sci. 56: 8–17, 2000. Waller, C.L., Minor, D.L., and McKinney, J.D., Using three-dimensional quantitative structure-activity relationships to examine estrogen receptor binding affinities of polychlorinated hydroxybiphenyls. Environ. Health. Perspect. 103: 702–707, 1995. Waller, C.L., Oprea, T.I., Chae, K., Park, H.K., Korach, K.S., Laws, S.C., Wiese, T.E., Kelce, W.R., and Gray, L.E. Jr., Ligand-based identification of environmental estrogens. Chem. Res. Toxicol. 9: 1240–1248, 1996. Tong, W., Perkins, R., Xing, L., Welsh, W.J., and Sheehan, D.M., QSAR models for binding of estrogenic compounds to estrogen receptor alpha and beta subtypes. Endocrinology 138: 4022–4025, 1997. Wiese, T.E., Polin, L.A., Palomino, E., and Brooks, S.C., Induction of the estrogen specific mitogenic response of MCF-7 cells by selected analogues of estradiol-17 beta: a 3D QSAR study. J. Med. Chem. 40: 3659–3669, 1997. Waller, C.L., Juma, B.W., Gray, L.E.J., and Kelce, W.R., Three-dimensional quantitative structure—activity relationships for androgen receptor ligands. Toxicol. Appl. Pharmacol. 137: 219–227, 1996. Waller, C.L. and McKinney, J.D., Three-dimensional quantitative structure-activity relationships of dioxins and dioxin-like compounds: model validation and Ah receptor characterization. Chem. Res. Toxicol. 8: 847– 858, 1995. Waller, C.L., Evans, M.V., and McKinney, J.D., Modeling the cytochrome P450–mediated metabolism of
109
100.
101.
102.
103.
104.
105.
106.
107.
108.
109.
110.
111.
112.
113.
110
chlorinated volatile organic compounds. Drug Metab. Dispos. 24: 203–210, 1996. Blackwood, E.M. and Kadonaga, J.T., Going the distance: a current view of enhancer action. Science 281: 61–63, 1998. Blackwell, T.K. and Weintraub, H., Differences and similarities in DNA-binding preferences of MyoD and E2A protein complexes revealed by binding site selection. Science 250: 1104–1110, 1990. Wingender, E., Chen, X., Hehl, R., Karas, H., Liebich, I., Matys, V., Meinhardt, T., Pruss, M., Reuter, I., and Schacherer, F., TRANSFAC: an integrated system for gene expression regulation. Nucleic Acids Res. 28: 316–319, 2000. Kolchanov, N.A., Podkolodnaya, O.A., Ananko, E.A., Ignatieva, E.V., Stepanenko, I.L., Kel-Margoulis, O.V., Kel, A.E., Merkulova, T.I., Goryachkovskaya, T.N., Busygina, T.V., Kolpakov, F.A., Podkolodny, N.L., Naumochkin, A.N., Korostishevskaya, I.M., Romashchenko, A.G., and Overton, G.C., Transcription regulatory regions database (TRRD): its status in 2000. Nucleic Acids Res. 28: 298–301, 2000. Cavener, D.R., Comparison of the consensus sequence flanking translational start sites in Drosophila and vertebrates. Nucleic Acids Res. 15: 1353–1361, 1987. Claverie, J.M. and Audic, S., The statistical significance of nucleotide position-weight matrix matches. Comput. Appl. Biosci. 12: 431–439, 1996. Perier, R.C., Praz, V., Junier, T., Bonnard, C., and Bucher, P., The eukaryotic promoter database (EPD). Nucleic Acids Res. 28: 302–303, 2000. Prestridge, D.S. and Burks, C., The density of transcriptional elements in promoter and non-promoter sequences. Hum. Mol. Genet. 2: 1449–1453, 1993. Frech, K., Quandt, K., and Werner, T., Software for the analysis of DNA sequence elements of transcription. Comput. Appl. Biosci. 13: 89–97, 1997. Chen, Q.K., Hertz, G.Z., and Stormo, G.D., MATRIX SEARCH 1.0: a computer program that scans DNA sequences for transcriptional elements using a database of weight matrices. Comput. Appl. Biosci. 11: 563–566, 1995. Quandt, K., Frech, K., Karas, H., Wingender, E., and Werner, T., MatInd and MatInspector: new fast and versatile tools for detection of consensus matches in nucleotide sequence data. Nucleic Acids Res. 23: 4878– 4884, 1995. Heinemeyer, T., Wingender, E., Reuter, I., Hermjakob, H., Kel, A.E., Kel, O.V., Ignatieva, E.V., Ananko, E.A., Podkolodnaya, O.A., Kolpakov, F.A., Podkolodny, N.L., and Kolchanov, N.A., Databases on transcriptional regulation: TRANSFAC, TRRD and COMPEL. Nucleic Acids Res. 26: 362–367, 1998. Klingenhoff, A., Frech, K., Quandt, K., and Werner, T., Functional promoter modules can be detected by formal models independent of overall nucleotide sequence similarity. Bioinformatics. 15: 180–186, 1999. Karpichev, I.V. and Small, G.M., Global regulatory functions of Oaf1p and Pip2p (Oaf2p), transcription factors that regulate genes encoding peroxisomal pro-
114.
115.
116.
117.
118.
119.
120.
121.
122.
123. 124.
125.
126.
127.
teins in Saccharomyces cerevisiae. Mol. Cell Biol. 18: 6560–6570, 1998. Geraghty, M.T., Bassett, D., Morrell, J.C., Gatto, G.J., Jr., Bai, J., Geisbrecht, B.V., Hieter, P., and Gould, S.J., Detecting patterns of protein distribution and gene expression in silico. Proc. Natl. Acad. Sci. U.S.A. 96: 2937–2942, 1999. Kliewer, S.A., Lehmann, J.M., Milburn, M.V., and Willson, T.M., The PPARs and PXRs: nuclear xenobiotic receptors that define novel hormone signaling pathways. Recent Prog. Horm. Res. 54: 345– 367, 1999. Maloney, E.K. and Waxman, D.J., trans-Activation of PPARalpha and PPARgamma by structurally diverse environmental chemicals. Toxicol. Appl. Pharmacol. 161: 209–218, 1999. Gray, L.E., Jr., Xenoendocrine disrupters: laboratory studies on male reproductive effects. Toxicol. Lett. 102–103: 331–335, 1998. Wei, P., Zhang, J., Egan-Hafley, M., Liang, S., and Moore, D.D., The nuclear receptor CAR mediates specific xenobiotic induction of drug metabolism. Nature 407: 920–923, 2000. Xie, W., Barwick, J.L., Downes, M., Blumberg, B., Simon, C.M., Nelson, M.C., Neuschwander-Tetri, B.A., Brunt, E.M., Guzelian, P.S., and Evans, R.M., Humanized xenobiotic response in mice expressing nuclear receptor SXR. Nature 406: 435–439, 2000. Charles, G.D., Bartels, M.J., Zacharewski, T.R., Gollapudi, B.B., Freshour, N.L., and Carney, E.W., Activity of benzo[a]pyrene and its hydroxylated metabolites in an estrogen receptor-alpha reporter gene assay. Toxicol. Sci. 55: 320–326, 2000. Itoh, K., Ishii, T., Wakabayashi, N., and Yamamoto, M., Regulatory mechanisms of cellular response to oxidative stress. Free Radic Res. 31: 319–324, 1999. Ishii, T., Itoh, K., Takahashi, S., Sato, H., Yanagawa, T., Katoh, Y., Bannai, S., and Yamamoto, M., Transcription factor Nrf2 coordinately regulates a group of oxidative stress-inducible genes in macrophages. J. Biol. Chem. 275: 16023–16029, 2000. Stormo, G.D., Consensus patterns in DNA. Methods Enzymol. 183: 211–221, 1990. Cui, Y., Wang, Q., Stormo, G.D., and Calvo, J.M., A consensus sequence for binding of Lrp to DNA. J. Bacteriol. 177: 4872–4880, 1995. Stormo, G.D., Strobl, S., Yoshioka, M., and Lee, J.S., Specificity of the Mnt protein. Independent effects of mutations at different positions in the operator. J. Mol. Biol. 229: 821–826, 1993. Lusska, A., Shen, E., and Whitlock, J.P., Jr., ProteinDNA interactions at a dioxin-responsive enhancer. Analysis of six bona fide DNA-binding sites for the liganded Ah receptor. J. Biol. Chem. 268: 6575–6580, 1993. Cargill, M., Altshuler, D., Ireland, J., Sklar, P., Ardlie, K., Patil, N., Shaw, N., Lane, C.R., Lim, E.P., Kalayanaraman, N., Nemesh, J., Ziaugra, L., Friedland, L., Rolfe, A., Warrington, J., Lipshutz, R., Daley, G.Q., and Lander, E.S., Characterization of single-
128.
129.
130. 131.
132.
133.
134.
135.
136.
137.
138.
139.
140. 141.
142.
143.
nucleotide polymorphisms in coding regions of human genes. Nat. Genet. 22: 231–238, 1999. Nebert, D.W., Pharmacogenetics and pharmacogenomics: why is this relevant to the clinical geneticist? Clin. Genet. 56: 247–258, 1999. Gu, Z., Hillier, L., and Kwok, P.-Y., Single nucleotide polymorphism hunting in cyberspace. Hum. Mutat. 12: 221–225, 1998. Zubritsky, E., SNP mining. The rush is on. Anal. Chem. 71: 683A-686A, 1999. Picoult-Newberg, L., Ideker, T.E., Pohl, M.G., Taylor, S.L., Donaldson, M.A., Nickerson, D.A., and Boyce-Jacino, M., Mining SNPs from EST databases. Genome Res. 9: 167–174, 1999. Gray, I.C., Campbell, D.A., and Spurr, N.K., Single nucleotide polymorphisms as tools in human genetics. Hum. Mol. Genet. 9: 2403–2408, 2000. Lindblad-Toh, K., Winchester, E., Daly, M.J., Wang, D.G., Hirschhorn, J.N., Laviolette, J.-P., Ardlie, K., Reich, D.E., Robinson, E., Sklar, P., Shah, N., Thomas, D., Fan, J.-B., Gingeras, T., Warrington, J., Patil, N., Hudson, T.J., and Lander, E.S., Large-scale discovery and genotyping of single-nucleotide polymorphisms in the mouse. Nat. Genet. 24: 381–386, 2000. Taillon-Miller, P., Gu, Z., Li, Q., Hillier, L., and Kwok, P.-Y., Overlapping genomic sequences: a treasure trove of single-nucleotide polymorphisms. Genome Res. 8: 748–754, 1998. Smigielski, E.M., Sirotkin, K., Ward, M., and Sherry, S.T., dbSNP: a database of single nucleotide polymorphisms. Nucleic Acids Res. 28: 352–355, 2000. Sherry, S.T., Ward, M., and Sirotkin, K., Use of molecular variation in the NCBI dbSNP database. Hum. Mutat. 15: 68–75, 2000. Lishanski, A., Screening for single-nucleotide polymorphisms using branch migration inhibition in PCRamplified DNA. Clin. Chem. 46: 1464–1470, 2000. Altshuler, D., Pollara, V.J., Cowles, C.R., Van Etten, W.J., Baldwin, J., Linton, L., and Lander, E.S., An SNP map of the human genome generated by reduced representation shotgun sequencing. Nature 407: 513– 516, 2000. Schmalzing, D., Belenky, A., Novotny, M.A., Koutny, L., Salas-Solano, O., El-Difrawy, S., Adourian, A., Matsudaira, P., and Ehrlich, D., Microchip electrophoresis: a method for high-speed SNP detection. Nucleic Acids Res. 28: E43, 2000. Meyer, U.A., Pharmacogenetics and adverse drug reactions. Lancet 356: 1667–1671., 2000. Nebert, D., Drug-metabolizing enzymes, polymorphisms and interindividual response to environmental toxicants. Clin. Chem. Lab. Med. 38: 857–861, 2000. M. Sandberg, C. Hassett, E. T. Adman, J. Meijer, and C. J. Omiecinski, Identification and functional characterization of human soluble epoxide hydrolase genetic polymorphisms. J. Biol. Chem. 275: 28873– 28881., 2000. S. Landi, Mammalian class theta GST and differential susceptibility to carcinogens: a review. Mutat. Res. 463: 247–283, 2000.
144. K. Olden and S. Wilson, Environmental health and genomics: visions and implications. Nature Reviews 1: 149–153, 2000. 145. G. S. Omenn, The genomic era: A crucial role for the public health sciences. Environ. Health Perspect. 108: A204–205, 2000. 146. R. R. Sharp and J. C. Barrett, The environmental genome project: ethical, legal, and social implications. Environ. Health Perspect. 108: 279–281, 2000. 147. M. Schena, D. Shalon, R. W. Davis, and P. O. Brown, Quantitative monitoring of gene expression patterns with a complementary DNA microarray. Science 270: 467–470, 1995. 148. N. G. Supplement, The chipping forecast. Vol. 21. 1999. 149. E. F. Nuwaysir, M. Bittner, J. Trent, J. C. Barrett, and C. A. Afshari, Microarrays and toxicology: the advent of toxicogenomics. Mol. Carcinog. 24: 153–159, 1999. 150. W. D. Pennie, J. D. Tugwood, G. J. Oliver, and I. Kimber, The principles and practice of toxigenomics: applications and opportunities. Toxicol. Sci. 54: 277– 283, 2000. 151. N. L. Anderson, A. D. Matheson, and S. Steiner, Proteomics: applications in basic and applied biology. Curr. Opin. Biotechnol. 11: 408–412., 2000. 152. J. K. Nicholson, J. C. Lindon, and E. Holmes, ‘Metabonomics’: understanding the metabolic responses of living systems to pathophysiological stimuli via multivariate statistical analysis of biological NMR spectroscopic data. Xenobiotica 29: 1181–1189., 1999. 153. M. Schena, D. Shalon, R. Heller, A. Chai, P. O. Brown, and R. W. Davis, Parallel human genome analysis: microarray-based expression monitoring of 1000 genes. Proc. Natl. Acad. Sci. U.S.A. 93: 10614– 10619, 1996. 154. J. DeRisi, L. Penland, P. O. Brown, M. L. Bittner, P. S. Meltzer, M. Ray, Y. Chen, Y. A. Su, and J. M. Trent, Use of a cDNA microarray to analyze gene expression patterns in human cancer. Nat. Genet. 14: 457–460, 1996. 155. J. L. DeRisi, V. R. Iyer, and P. O. Brown, Exploring the metabolic and genetic control of gene expression on a genomic scale. Science 278: 680–686, 1997. 156. A. Brazma, A. Robinson, G. Cameron, and M. Ashburner, One-stop shop for microarray data. Nature 403: 699–700, 2000. 157. S. Farr and R. T. Dunn, 2nd, Concise review: gene expression applied to toxicology. Toxicol. Sci. 50: 1– 9, 1999. 158. Y. Chen, E. R. Dougherty, and M. L. Bittner, Ratiobased decisions and the quantitative analysis of cDNA microarray images. J. Biomed. Optics. 2: 364–374, 1997. 159. T. D. Schmittgen and B. A. Zakrajsek, Effect of experimental treatment on housekeeping gene expression: validation by real-time, quantitative RT-PCR. J. Biochem. Biophys. Methods. 46: 69–81, 2000. 160. T. Suzuki, P. J. Higgins, and D. R. Crawford, Control selection for RNA quantitation. Biotechniques 29: 332–337, 2000.
111
161. T. R. Hughes, M. J. Marton, A. R. Jones, et al., Functional discovery via a compendium of expression profiles. Cell 102: 109–126, 2000. 162. M. B. Eisen, P. T. Spellman, P. O. Brown, and D. Botstein, Cluster analysis and display of genomewide expression patterns. Proc. Natl. Acad. Sci. U.S.A. 95: 14863–14868, 1998. 163. G. Sherlock, Analysis of large-scale gene expression data. Curr. Opin. Immunol. 12: 201–205, 2000. 164. M. E. Burczynski, M. McMillian, J. Ciervo, L. Li, J. B. Parker, R. T. Dunn, 2nd, S. Hicken, S. Farr, and M. D. Johnson, Toxicogenomics-based discrimination of toxic mechanism in HepG2 human hepatoma cells. Toxicol. Sci. 58: 399–415, 2000. 165. Milligan, G., A review of Monte Carlo tests of cluster analysis. Multivar. Behav. Res. 16: 379–407, 1981. 166. Lazzeroni, L.a.O., A.B., Plaid models for gene expression data. Stanford Biostatistics Technical Report No. 211, 2000. 167. P. Tamayo, D. Slonim, J. Mesirov, Q. Zhu, S. Kitareewan, E. Dmitrovsky, E. S. Lander, and T. R. Golub, Interpreting patterns of gene expression with self-organizing maps: methods and application to hematopoietic differentiation. Proc. Natl. Acad. Sci. U.S.A. 96: 2907–2912, 1999. 168. S. Tavazoie, J. D. Hughes, M. J. Campbell, R. J. Cho, and G. M. Church, Systematic determination of genetic network architecture. Nat. Genet. 22: 281–285, 1999. 169. Everitt, B., Cluster Analysis. John Wiley & Sons, New York, NY, 1993. 170. Raychaudhuri, S., Stuart, J., and Altman, R., Principal component analysis to summarize microarray experiments: application to sporulation time series. Pacific Symposium on Biocomputing 2000 452–463, 2000. 171. M. J. Marton, J. L. DeRisi, H. A. Bennett, et al., Drug target validation and identification of secondary drug target effects using DNA microarrays [see comments]. Nat. Med. 4: 1293–1301, 1998. 172. A. P. Gasch, P. T. Spellman, C. M. Kao, O. CarmelHarel, M. B. Eisen, G. Storz, D. Botstein, and P. O. Brown, Genomic Expression Programs in the Response of Yeast Cells to Environmental Changes. Mol. Biol. Cell. 11: 4241–4257, 2000.
112
173. S. A. Amundson, M. Bittner, Y. Chen, J. Trent, P. Meltzer, and A. J. Fornace, Jr., Fluorescent cDNA microarray hybridization reveals complexity and heterogeneity of cellular genotoxic stress responses. Oncogene 18: 3666–3672, 1999. 174. A. J. Fornace, Jr., S. A. Amundson, M. Bittner, T. G. Myers, P. Meltzer, J. N. Weinsten, and J. Trent, The complexity of radiation stress responses: analysis by informatics and functional genomics approaches. Gene Exp. 7: 387–400, 1999. 175. P. R. Holden, N. H. James, A. N. Brooks, R. A. Roberts, I. Kimber, and W. D. Pennie, Identification of a possible association between carbon tetrachloride-induced hepatotoxicity and interleukin-8 expression. J. Biochem. Mol. Toxicol. 14: 283–290, 2000. 176. S. S. Nadadur, M. C. Schladweiler, and U. P. Kodavanti, A pulmonary rat gene array for screening altered expression profiles in air pollutant-induced lung injury. Inhal. Toxicol. 12: 1239–1254, 2000. 177. M. A. Hossain, C. M. Bouton, J. Pevsner, and J. Laterra, Induction of vascular endothelial growth factor in human astrocytes by lead. Involvement of a protein kinase C/activator protein-1 complex-dependent and hypoxiainducible factor 1–independent signaling pathway. J. Biol. Chem. 275: 27874–27882, 2000. 178. H. Sato, M. Sagai, K. T. Suzuki, and Y. Aoki, Identification, by cDNA microarray, of A-raf and proliferating cell nuclear antigen as genes induced in rat lung by exposure to diesel exhaust. Res. Commun. Mol. Pathol. Pharmacol. 105: 77–86, 1999. 179. M. Bartosiewicz, M. Trounstine, D. Barker, R. Johnston, and A. Buckpitt, Development of a toxicological gene array and quantitative assessment of this technology. Arch. Biochem. Biophys. 376: 66–73, 2000. 180. A. Puga, A. Maier, and M. Medvedovic, The transcriptional signature of dioxin in human hepatoma HepG2 cells. Biochem. Pharmacol. 60: 1129–1142., 2000. 181. A. K. Shiau, D. Barstad, P. M. Loria, L. Cheng, P. J. Kushner, D. A. Agard, and G. L. Greene, The structural basis of estrogen receptor/coactivator recognition and the antagonism of this interaction by tamoxifen. Cell 95: 927–937, 1998.