MINI-REVIEW Journal of
Cellular Physiology
Bioinformatics JASON H. MOORE1,2,3,4* 1
Computational Genetics Laboratory, Departments of Genetics and Community and Family Medicine,
Dartmouth Medical School, Lebanon, New Hampshire 2
Department of Biological Sciences, Dartmouth College, Hanover, New Hampshire
3
Department of Computer Science, University of New Hampshire, Durham, New Hampshire
4
Department of Computer Science, University of Vermont, Burlington, Vermont
Bioinformatics is an interdisciplinary field that blends computer science and biostatistics with biological and biomedical sciences such as biochemistry, cell biology, developmental biology, genetics, genomics, and physiology. An important goal of bioinformatics is to facilitate the management, analysis, and interpretation of data from biological experiments and observational studies. The goal of this review is to introduce some of the important concepts in bioinformatics that must be considered when planning and executing a modern biological research study. We review database resources as well as data mining software tools. J. Cell. Physiol. 213: 365–369, 2007. ß 2007 Wiley-Liss, Inc.
Bioinformatics is an interdisciplinary field that blends computer science and biostatistics with biological and biomedical sciences such as biochemistry, cell biology, developmental biology, genetics, genomics, and physiology. Bioinformatics emerged as an important discipline shortly after the development of high-throughput DNA sequencing technologies in the 1970s (Boguski, 1994). It was the momentum of the Human Genome Project that spurred the rapid rise of bioinformatics as a formal discipline. The word ‘‘bioinformatics’’ did not start appearing in the biomedical literature until around 1990 but quickly caught on as the descriptor of this important new field. An important goal of bioinformatics is to facilitate the management, analysis, and interpretation of data from biological experiments and observational studies. Thus, much of bioinformatics can be categorized as database development and implementation, data analysis and data mining, and biological interpretation and inference. The goal of this chapter is to review each of these three areas and to provide some guidance on getting started with a bioinformatics approach to molecular investigations of cellular physiology. The need to interpret information from whole-genome sequencing projects in the context of biological information acquired in decades of research studies prompted the establishment of the National Center for Biotechnology Information (NCBI) as a division of the National Library of Medicine (NLM) at the National Institutes of Health (NIH) in the United States in November of 1988. When the NCBI was established, it was charged with (1) creating automated systems for storing and analyzing knowledge about molecular biology, biochemistry, and genetics, (2) performing research into advanced methods of computer-based information processing for analyzing the structure and function of biologically important molecules and compounds, (3) facilitating the use of databases and software by biotechnology researchers and medical care personnel, and (4) coordination of efforts to gather biotechnology information worldwide (Benson et al., 1990). Since 1988, the NCBI has fulfilled many of these goals and has delivered a set of databases and computational tools that are essential for modern biomedical research in a wide range of different disciplines including molecular epidemiology. The NCBI and other international efforts such as the European Bioinformatics Institute (EBI) that was established in 1992 (Robinson, 1994) have played a very important role in inspiring and motivating the establishment of research group and centers around the world that are dedicated to providing ß 2 0 0 7 W I L E Y - L I S S , I N C .
bioinformatics tools and expertise. Some of these tools and resources will be reviewed here. Database Resources
One of the most important pre-study activities is the design and development of one or more databases that can accept, store, and manage molecular physiology data. Haynes and Blach (2006) list eight steps for establishing an effective information management system. The first step is to develop the experimental plan for the laboratory, molecular, and sample information that will be collected. What are the specific needs for the database? The second step is to establish the information flow. That is, how does the information find its why from the laboratory to the database? The third step is to create a model for information storage. How are the data related? The fourth step is to determine the hardware and software requirements. How much data needs to be stored? How quickly will investigators need to access the data? What operating system will be used? Will a freely available database such as mySQL (http://www.mysql.com) serve the needs of the project or will a commercial dataset solution such as Oracle (http:// www.oracle.com) be needed? The fifth step is to implement the database. The important consideration here is to define the database structure so that data integrity is maintained. The sixth step is to choose the user interface to the database. Is a web page portal to the data sufficient or does a specialized software application for the desktop need to be created? The seventh step is to determine the security requirements. Do HIPAA regulations (http://www.hhs.gov/ocr/hipaa) need to be followed? Most databases need to be password protected at a minimum. The eighth and final step outlined by Haynes and
Contract grant sponsor: National Institutes of Health; Contract grant numbers: LM009012, AI59694, HD047447, RR018787, HL65234. *Correspondence to: Jason H. Moore, 706 Rubin Bldg, HB7937, Dartmouth-Hitchcock Medical Center, One Medical Center Dr., Lebanon, NH 03756. E-mail:
[email protected] Received 10 June 2007; Accepted 12 June 2007 DOI: 10.1002/jcp.21218
365
366
MOORE
Blach (2006) is to select the software tools that will interface with the data for summary and analysis. Some of these tools will be reviewed below. Although most investigators choose to develop and manage their own database for security and confidentiality reasons, there are an increasing number of public databases for depositing data and making it widely available to other investigators. The tradition of making data publicly available soon after it has been analyzed and published can largely be attributed to the community of investigators using gene expression microarrays. Microarrays (Schena et al., 1995) represent one of the most revolutionary applications that were derived from the knowledge of whole genome sequences. The extensive use of this technology has lead to the need to store and search expression data for all the genes in the genome acquired in different genetic backgrounds or in different environmental conditions. This has resulted in a number of public databases such as the Stanford Microarray Database (Sherlock et al., 2001; http://genome-www5.stanford.edu), the Gene Expression Omnibus (Barrett et al., 2005; Barrett and Edgar, 2006; http://www.ncbi.nlm.nih.gov/geo), ArrayExpress (Brazma et al., 2003; Brazma et al., 2006a; http://www.ebi.ac.uk/ arrayexpress) and others that anyone can use to upload or download data. The nearly universal acceptance of the data sharing culture in this area has yielded a number of useful tools that might not have been developed otherwise. The need of defining standards for the ontology and annotation of microarray experiments has led to proposals such as the minimum information about a microarray experiment (MIAME) (Brazma et al., 2001; Ball and Brazma, 2006; http:// www.mged.org/Workgroups/MIAME/miame.html) that provided a standard that greatly facilitates the storage, retrieval, and sharing of data from microarray experiments. The MIAME standards provide an example for other types of data such as SNPs and protein mass spectrometry spectra. See Brazma et al. (2006b) for a comprehensive review of standards for data sharing in genetics, genomics, proteomics, and systems biology. The success of the different databases then depends on the availability of methods for easily depositing data and tools for searching the databases often after data normalization. Despite the acceptance of data sharing in the genomics community, the same culture does not yet exist in genetics and other disciplines. One of the few such examples is the pharmacogenomics knowledge base of PharmGKB (Hewett et al., 2002; Altman, 2007; http://www.pharmgkb.org). PharmGKB was established with funding from the NIH to store, manage, and make available molecular data in addition to phenotype data from pharmacogenetic and pharmacogenomic experiments and clinical studies. It is anticipated that similar databases for a variety of different disciplines will appear and gain acceptance over the next few years as the NIH and various journals start to require data from public research be made available to the public. In addition to the need for a database to store and manage molecular data collected from experimental or observational studies, there are a number of database resources that can be very helpful for planning a study. A good starting point for database resources are those maintained at the NCBI (Wheeler et al., 2006; http://www.ncbi.nlm.nih.gov). The NCBI maintains the PubMed literature database with more than 15 million indexed abstracts from published articles in more than 4,700 life science journals. The PubMed Central database (http:// www.pubmedcentral.nih.gov) is quickly becoming an indispensable tool with more than 400,000 full text articles from over 200 different journals. Rapid and free access to the complete text of published articles significantly enhances the planning, execution, and interpretation phases of any scientific study. The new Books database (http://www.ncbi.nlm.nih.gov/ books) provides free access for the first time to electronic JOURNAL OF CELLULAR PHYSIOLOGY DOI 10.1002/JCP
versions of many textbooks and other resources such as the NCBI Handbook that serves as a guide to the resources that NCBI has to offer. This is a particularly important resource for students and investigators that need to learn a new discipline such as genomics. One of the oldest databases provided by the NCBI is the GenBank DNA sequence resource (Benson et al., 1993, 2006; http://www.ncbi.nlm.nih.gov/Genbank). DNA sequence data for many different organisms have been deposited in GenBank for more than two decades now totaling more than 100 Gb of data. GenBank is a common starting point for the design of PCR primers and other molecular assays that require specific knowledge of gene sequences. Curated information about genes, their chromosomal location, their function, their pathways, etc., can be accessed through the Entrez Gene database (http://www.ncbi.nlm.nih.gov/entrez/ query.fcgi?db¼gene), for example. Important emerging databases include those that store and summarize DNA sequence variations. NCBI maintains the dbSNP (Sherry et al., 1999, 2001; http://www.ncbi.nlm.nih.gov/ projects/SNP/) database for single-nucleotide polymorphisms or SNPs. dbSNP provides a wide range of different information about SNPs including the flanking sequence primers, the position, the validation methods, and the frequency of the alleles in different populations. As with all NCBI databases, it is possible to link to a number of other datasets such as PubMed. In addition to databases for storing raw data, there are a number of databases that retrieve and store knowledge in an accessible form. For example, the Kyoto Encyclopedia of Genes and Genomes (KEGG) database stores knowledge on genes and their pathways (Kanehisa, 1997; Ogata et al., 1999; http:// www.genome.jp/keg). The Pathway component of KEGG currently stores knowledge on 42,937 pathways generated from 307 reference pathways. While the Pathway component documents molecular interaction in pathways, the Brite database stores knowledge on higher order biological functions. One of the most useful knowledge sources is the Gene Ontology (GO) project that has created a controlled vocabulary to describe genes and gene products in any organism in terms of their biological processes, cellular components, and molecular functions (Ashburner et al., 2000; Gene Ontology Consortium, 2006; http://www.geneontology.org). GO descriptions and KEGG pathways are both captured and summarized in the NCBI databases. For example, the description of p53 in Entrez Gene includes KEGG pathways such as cell cycle and apoptosis. It also includes GO descriptions such as protein binding and cell proliferation. In general, a good place to start for information about available databases is the annual Database issue and the annual Web Server issue of the journal Nucleic Acids Research (http:// nar.oxfordjournals.org/). These special issues include annual reports from many of the commonly used databases. Data Analysis
Once the data are collected and stored in a database, an important goal is to identify biomolecular patterns that are associated with an observed phenotypic outcome. Data analysis in the biological sciences has moved away from traditional univariate statistical methods such as the t-test toward data mining and machine learning methods such as cluster analysis and support vector machines as the size and complexity of molecular datasets have grown. Machine learning and data mining methods have the advantage of having more power to detect nonlinear patterns in high-dimensional data that might more closely reflect the underlying hierarchical complexity of biological systems. Although significant power can be gained, these methods are typically more difficult to implement because they require knowledge of computer science and often require more computing power than is available using a single
BIOINFORMATICS
processor on a desktop or laptop computer. An important development in the last few years is the availability of open-source and user-friendly software that brings many of these advanced computer science methods to the biologist. We briefly review several of these software packages below. For those looking for a good introduction to machine learning and data mining methods we recommend Hastie et al. (2001) and Whitten and Frank (2005) as a good starting point. Data mining using R
R is perhaps the one software package that everyone should have in their bioinformatics arsenal. R is an open-source and freely available programming language and data analysis and visualization environment that can be downloaded from http:// www.r-project.org. According to the web page, R includes (1) an effective data handling and storage facility, (2) suite of operators for calculations on arrays, in particular matrices, (3) large, coherent, integrated collection of intermediate tools for data analysis, (4) graphical facilities for data analysis and display either on-screen or on hardcopy, and (5) a well-developed, simple, and effective programming language which includes conditionals, loops, user-defined recursive functions and input and output facilities. A major strength of R is the enormous community of developers and users that ensure just about any analysis method you need is available. Perhaps the most useful contribution to R is the Bioconductor project (Reimers and Carey, 2006; http://www.bioconductor.org). According to the Bioconductor web page, the goals of the project are to (1) provide access to a wide range of powerful statistical and graphical methods for the analysis of genomic data, (2) facilitate the integration of biological metadata (e.g., PubMed, GO) in the analysis of experimental data, (3) allow the rapid development of extensible, scalable, and interoperable software, (4) promote high-quality and reproducible research, and (5) provide training in computational and statistical methods for the analysis of genomic data. Many of the available tools are specific to microarray data but some are more generally applicable to genetics and epidemiology. For example, there are tools in Bioconductor for accessing data from the KEGG database mentioned above. There are numerous packages for machine learning and data mining that are either part of the base R software or can be easily added. For example, the neural package includes routines for neural network analysis (http://cran.r-project.org/src/ contrib/Descriptions/neural.html). Others include arules for association rule mining (http://cran.r-project.org/src/contrib/ Descriptions/arules.html), cluster for cluster analysis (http:// cran.r-project.org/src/contrib/Descriptions/cluster.html), genalg for genetic algorithms (http://cran.r-project.org/src/ contrib/Descriptions/genalg.html), som for self-organizing maps (http://cran.r-project.org/src/contrib/Descriptions/ som.html), and tree for classification and regression trees (http://cran.r-project.org/src/contrib/Descriptions/tree.html). Many others are available. A full list of contributed packages for R can be found at http://cran.r-project.org/src/contrib/ PACKAGES.html. The primary advantage of using R as your data mining software package is its power. However, the learning curve can be challenging at first since some computer programming in the R language is usually necessary. Fortunately, there is plenty of documentation available on the web and in published books. Several essential R books include those by Gentleman et al. (2005) and Venebles and Ripley (2002). Data mining using Weka
One of the most mature open-source and feely available data mining software packages is Weka (Whitten and Frank, 2005; http://www.cs.waikato.ac.nz/ml/weka). Weka is written in Java and thus will run in any operating system (e.g., Linux, Mac, Sun, JOURNAL OF CELLULAR PHYSIOLOGY DOI 10.1002/JCP
Windows). Weka contains a comprehensive list of tools and methods for data processing, unsupervised and supervised classification, regression, clustering, association rule mining, and data visualization. Machine learning methods include classification trees; k-means cluster analysis, k-nearest neighbors, logistic regression, naive Bayes, neural networks, self-organizing maps, and support vector machines, for example. Weka includes a number of additional tools such as search algorithms and analysis tools such as cross-validation and bootstrapping. A nice feature of Weka is that it can be run from the command line making it possible to run the software from Perl or even R (see http://cran.r-project.org/src/contrib/ Descriptions/RWeka.html). Weka includes an experimenter module that facilitates comparison of algorithms. It also includes a knowledge flow environment for visual layout of an analysis pipeline. This is a very powerful analysis package that is relatively easy to use. Further, there is a published book that explains many of the methods and the software (Whitten and Frank, 2005). Data mining using Orange
Orange is another open-source and freely available data mining software package (Demsar et al., 2004; http://www.ailab.si/ orange) that provides a number of data processing, data mining, and data visualization tools. What makes Orange different and in some way preferable to other packages such as R is its intuitive visual programming interface. With Orange, methods and tools are represented as icons that are selected and dropped into a window called the canvas. For example, an icon for loading a dataset can be selected along with an icon for visualizing the data table. The file load icon is then ‘‘wired’’ to the data table icon by drawing a line between them. Double-clicking on the file load icon allows the user to select a data file. Once loaded, the file is then automatically transferred by the ‘‘wire’’ to the data table icon. Double-clicking on the data table icon brings up a visual display of the data. Similarly, a classifier such as a classification tree can be selected and wired to the file icon. Double-clicking on the classification tree icon allows the user to select the settings for the analysis. Wiring the tree viewer icon then allows the user to view a graphical image of the classification tree inferred from the data. Orange facilitates high-level data mining with minimal knowledge of computer programming. A wide range of different data analysis tools is available. A strength of Orange is its visualization tools for multivariate data (e.g., Leban et al., 2005). Recent additions to Orange include tools for microarray analysis and genomics such as heat maps and GO analysis (Curk et al., 2005). InforSense is an example of a powerful commercial software package that provides visual programming tools much like Orange (http://www.inforsense.com/). Interpreting data mining results
Perhaps the greatest challenge of any statistical analysis or data mining exercise is interpreting the results. This is an important goal that is difficult to accomplish without a close working relationship between molecular biologists, for example, and statisticians and computer scientists. Fortunately, there are a number of emerging software packages that are designed with this in mind. GenePattern (http://www.broad.mit.edu/cancer/ software/genepattern/), for example, provides an integrated set of analysis tools and knowledge sources that facilitates this process (Reich et al., 2006). Other tools such as the Exploratory Visual Analysis (EVA) database and software (http://www.exploratoryvisualanalysis.org/) are designed specifically for integrating research results with biological knowledge from public databases in a framework designed for biologists and epidemiologists (Reif et al., 2005). Several commercial packages for typing bioinformatics results to
367
368
MOORE
pathways and gene function include Pathway Studio from Ariadne (http://www.ariadnegenomics.com/products/ pathway-studio/) and Ingenuity Pathway Analysis from Ingenuity Systems (http://www.ingenuity.com/products/ pathways_analysis.html). These tools and others will facilitate interpretation. The Future
We have only scratched the surface of the numerous bioinformatics methods, databases, and software tools that are available to the biological and biomedical research communities. We have tried to highlight some of the important software free resources such as Weka and Orange that might not be covered in other reviews that focus on more traditional bioinformatics methods from biostatistics. While there are an enormous number of bioinformatics resources today, the software landscape is changing rapidly as new technologies for high-throughput biology emerge. Over the next few years, we will witness an explosion of novel bioinformatics tools for the analysis of high-dimensional biological data generated using high-throughput technologies such as protein mass spectrometry. Each of these new data types and their associated research questions will require special bioinformatics tools and perhaps special hardware such as faster computers with bigger storage capacity and more memory. Perhaps the biggest challenge moving forward is to facilitate communication at all levels among biomedical researchers, biostatisticians, and computer scientists. The key to successful bioinformatics is close face-to-face collaboration between the biologist and bioinformaticist. This is not always possible but is critical for moving forward with a systems-based research agenda where large volumes of data and information are the norm. The best bioinformatics tools will be those that experts in each area can use jointly. We propose that one way to facilitate the close collaboration between biologists and bioinformaticists is to make available user-friendly software packages that can be used jointly by researchers with expertise in experimental biology and researchers with expertise in computer science (Moore et al., in press). That is, make available software that is intuitive enough for a biologist, for example, to use and powerful enough for a bioinformaticist to use. To be intuitive to a biologist, the software needs to be easy to use and needs to provide output that is visual and easy to navigate. To be powerful, the software needs to provide the functionality that would allow a bioinformaticist the flexibility to explore the more theoretical aspects of the algorithm. The key, however, to the success of any such software package is the ability of the biologist and the bioinformaticist to sit down together at the computer and jointly carry out an analysis. This is important for several reasons. First, the biologist can help answer questions the bioinformaticist might have that are related to domain-specific knowledge. Such questions might only arise at the time of the analysis and might otherwise be ignored. Similarly, the biologist might have ideas about specific questions to address with the software that might only be feasible with the help of the bioinformaticist. For example, a question that requires multiple processors to answer might need the assistance of someone with expertise in parallel computing. The idea that biologists and the bioinformaticists should work together is not new. Langley (2002) has suggested five lessons for the computational discovery process. First, traditional computational notations are not easily communicated to biologists. This is important because a computational model may not be interpretable by a biologist. Second, biologists often have initial models that should influence the discovery process. Domain-specific knowledge such as details about enzymatic reactions in a biochemical JOURNAL OF CELLULAR PHYSIOLOGY DOI 10.1002/JCP
pathway can be critical to the discovery process. Third, biological data are often rare and difficult to obtain. It often takes years to collect and process the data to be analyzed. As such, it is important that the analysis is carefully planned and executed. Fourth, biologists want models that move beyond description to provide explanation of data. Explanation and interpretation are paramount to the biologist. Finally, biologists want computational assistance rather than automated discovery systems. Langley (2002) suggests that practitioners want interactive discovery environments that help them understand their data while at the same time giving them or their collaborating bioinformaticist control over the modeling process. Collectively, these five lessons suggest that synergy between biologists and bioinformaticists is critical. This is because each has important insights that may not get expressed or incorporated into the discovery process if either carries out the analysis in isolation. Future bioinformatics databases and analysis tools that successfully integrate these lessons will prove to be the most useful for biological and biomedical discovery.
Literature Cited Altman RB. 2007. PharmGKB: A logical home for knowledge relating genotype to drug response phenotype. Nat Genet 39:426. Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT, Harris MA, Hill DP, Issel-Tarver L, Kasarskis A, Lewis S, Matese JC, Richardson JE, Ringwald M, Rubin GM, Sherlock G. 2000. Gene ontology: Tool for the unification of biology. The Gene Ontology Consortium. Nat Genet 25:25–29. Ball CA, Brazma A. 2006. MGED standards: Work in progress. OMICS 10:138–144. Barrett T, Suzek TO, Troup DB, Wilhite SE, Ngau WC, Ledoux P, Rudnev D, Lash AE, Fujibuchi W, Edgar R. 2005. NCBI GEO: Mining millions of expression profiles—database and tools. Nucleic Acids Res 33:D562–D566. Barrett T, Edgar R. 2006. Gene expression omnibus: Microarray data storage, submission, retrieval, and analysis. Methods Enzymol 411:352–369. Brazma A, Hingamp P, Quackenbush J, Sherlock G, Spellman P, Stoeckert C, Aach J, Ansorge W, Ball CA, Causton HC, Gaasterland T, Glenisson P, Holstege FC, Kim IF, Markowitz V, Matese JC, Parkinson H, Robinson A, Sarkans U, Schulze-Kremer S, Stewart J, Taylor R, Vilo J, Vingron M. 2001. Minimum information about a microarray experiment (MIAME)—toward standards for microarray data. Nat Genet 29:365–371. Brazma A, Parkinson H, Sarkans U, Shojatalab M, Vilo J, Abeygunawardena N, Holloway E, Kapushesky M, Kemmeren P, Lara GG, Oezcimen A, Rocca-Serra P, Sansone SA. 2003. ArrayExpress—a public repository for microarray gene expression data at the EBI. Nucleic Acids Res 31:68–71. Brazma A, Kapushesky M, Parkinson H, Sarkans U, Shojatalab M. 2006a. Data storage and analysis in ArrayExpress. Methods Enzymol 411:370–386. Brazma A, Krestyaninova M, Sarkans U. 2006b. Standards for systems biology. Nat Rev Genet 7:593–605. Benson D, Boguski M, Lipman DJ, Ostell J. 1990. The National Center for Biotechnology Information. Genomics 6:389–391. Benson D, Lipman DJ, Ostell J. 1993. GenBank. Nucleic Acids Res 21:2963–2965. Benson DA, Karsch-Mizrachi I, Lipman DJ, Ostell J, Wheeler DL. 2006. GenBank. Nucleic Acids Res 34:D16–D20. Boguski MS. 1994. Bioinformatics. Curr Opin Genet Dev 4:383–388. Curk T, Demsar J, Xu Q, Leban G, Petrovic U, Bratko I, Shaulsky G, Zupan B. 2005. Microarray data mining with visual programming. Bioinformatics 21:396–398. Demsar J, Zupan B, Leban G. 2004. Orange: From experimental machine learning to interactive data mining, white paper ( www.ailab.si/orange) Faculty of Computer and Information Science, University of Ljubljana. Gene Ontology Consortium. 2006. The Gene Ontology (GO) project in 2006. Nucleic Acids Res 34:D322–D326. Gentleman R, Carey VJ, Huber W, Irizarry A, Dudoit S. 2005. Bioinformatics and computational biology solutions using R and bioconductor. New York, NY: Springer. Hastie T, Tibshirani R, Friedman J. 2001. The elements of statistical learning. New York, NY: Springer. Haynes C, Blach C. 2006. Information management. In: Haines JL, Pericak-Vance MA, editors. Genetic analysis of complex disease. Hoboken, NJ: Wiley. pp 219–235. Hewett M, Oliver DE, Rubin DL, Easton KL, Stuart JM, Altman RB, Klein TE. 2002. PharmGKB: The pharmacogenetics knowledge base. Nucleic Acids Res 30:163– 165. Kanehisa M. 1997. A database for post-genome analysis. Trends Genet 13:375–376. Langley P. 2002. Lessons for the computational discovery of scientific knowledge. In: Proceedings of the First International Workshop on Data Mining Lessons Learned. Sydney. Leban G, Bratko I, Petrovic U, Curk T, Zupan B. 2005. VizRank: Finding informative data projections in functional genomics by machine learning. Bioinformatics 21:413–414. Moore JH, Barney N, White BC. Solving complex problems in human genetics using genetic programming: The importance of theorist-practitioner-computer interaction. In: Riolo R, Soule T, Worzel B, editors. Genetic programming theory and practice V. New York, NY: Springer, in press. Ogata H, Goto S, Sato K, Fujibuchi W, Bono H, Kanehisa M. 1999. KEGG: Kyoto encyclopedia of genes and genomes. Nucleic Acids Res 27:29–34. Reich M, Liefeld T, Gould J, Lerner J, Tamayo P, Mesirov JP. 2006. GenePattern 2.0. Nat Genet 38:500–501. Reif DM, Dudek SM, Shaffer CM, Wang J, Moore JH. 2005. Exploratory visual analysis of pharmacogenomic results. Pac Symp Biocomput 2005:296–307.
BIOINFORMATICS
Reimers M, Carey VJ. 2006. Bioconductor: An open source framework for bioinformatics and computational biology. Methods Enzymol 411:119–134. Robinson C. 1994. The European Bioinformatics Institute (EBI)—open for business. Trends Biotechnol 12:391–392. Schena M, Shalon D, Davis RW, Brown PO. 1995. Quantitative monitoring of gene expression patterns with a complementary DNA microarray. Science 270:467–470. Sherlock G, Hernandez-Boussard T, Kasarskis A, Binkley G, Matese JC, Dwight SS, Kaloper M, Weng S, Jin H, Ball CA, Eisen MB, Spellman PT, Brown PO, Botstein D, Cherry JM. 2001. The Stanford microarray database. Nucleic Acids Res 29:152–155. Sherry ST, Ward M, Sirotkin K. 1999. dbSNP-database for single nucleotide polymorphisms and other classes of minor genetic variation. Genome Res 9:677–679.
JOURNAL OF CELLULAR PHYSIOLOGY DOI 10.1002/JCP
Sherry ST, Ward MH, Kholodov M, Baker J, Phan L, Smigielski EM, Sirotkin K. 2001. dbSNP: The NCBI database of genetic variation. Nucleic Acids Res 29:308–311. Venebles WN, Ripley BD. 2002. Modern applied statistics with S. New York, NY: Springer. Wheeler DL, Barrett T, Benson DA, Bryant SH, Canese K, Chetvernin V, Church DM, DiCuccio M, Edgar R, Federhen S, Geer LY, Helmberg W, Kapustin Y, Kenton DL, Khovayko O, Lipman DJ, Madden TL, Maglott DR, Ostell J, Pruitt KD, Schuler GD, Schriml LM, Sequeira E, Sherry ST, Sirotkin K, Souvorov A, Starchenko G, Suzek TO, Tatusov R, Tatusova TA, Wagner L, Yaschenko E. 2006. Database resources of the National Center for Biotechnology Information. Nucleic Acids Res 34:D173–D180. Whitten IH, Frank E. 2005. Data mining. Boston, MA: Elsevier.
369