Applying Data Warehouse Concepts to Gene Expression ... - CiteSeerX

13 downloads 0 Views 47KB Size Report
Affymetrix with its definition of the Affymetrix Analysis. Data Model (AADM) used for its GeneChip LIMS relational database [1]. Recently, several organizations.
In Proceeding of the 2 nd IEEE International Sym posium in Bioinformatics and Bioengineering, BIBE 2001

Applying Data Warehouse Concepts to Gene Expression Data Management Victor M. Markowitz and Thodoros Topaloglou Data Management Systems, Gene Logic Inc. 2001 Center Street, Berkeley CA 94704 Abstract In this paper we present a method for applying data warehouse and on-line analytical processing concepts to gene expression data management. This method has been employed in developing the data management system that is used to host Gene Logic’s GeneExpress® database products.

1. Introduction DNA microarray technologies allow the generation of large amounts of primary and derived (analyzed) gene expression data [5]. Effective exploration of these data has been hindered by the variety and heterogeneity of the data formats used by different microarray technology platforms. This problem has been addressed by several groups at organizations such as the European Bioinformatics Institute (EBI) (see http://www.ebi.ac.uk/arrayexpress), the National Center for Biotechnology Information (NCBI) (see http://www.ncbi.nlm.nih.gov/geo), and the National Center for Genome Resources (NCGR) (see http://www.ncgr.org/genex/index.html), with alternative proposals for organizing gene expression data as part of their efforts to establish public gene expression data repositories. A similar problem has been addressed by Affymetrix with its definition of the Affymetrix Analysis Data Model (AADM) used for its GeneChip LIMS relational database [1]. Recently, several organizations have formed the Microarray Gene Expression Database Group (MGED), with the goal of defining a common format for gene expression data that would enhance the ability of gene expression data repositories to share and exchange data (see http://www.mged.org). Initially, gene expression data sets have been relatively small and could be managed using files. Substantially larger data sets are now prevalent, the result of more powerful DNA microarray technologies that allow measuring expression levels for a large number of genes in a single experiment, combined with the ability of laboratories to carry out an increasing number of experiments on a regular basis. For example, Affymetrix’s GeneChip human genome HG_U95 set consists of five probe arrays containing probe-sets representing approximately sixty thousand full length genes and ESTs. Screening a single human sample using this probe array

set involves five experiments that generate approximately 300 megabytes of data in GeneChip specific data formats. Gene Logic’s GeneExpress® database products contain gene exp ression data generated using primarily the Affymetrix GeneChip platform [4], with additional data sets generated using other gene expression platforms, such as QRT-PCR. GeneExpress products are available as DataSuites, such as BioExpress and ToxExpress , which contain data on disease or tissue specific samples (see http://www.genelogic.com/gexpress.htm). GeneExpress is notable for its size: it includes gene expression data on thousands of normal and diseased tissues, experimental animal model and cellular tissues, generated with tens of thousands of experiments. GeneExpress is also notable for its rich annotations: it contains detailed data on the samples and gene fragments involved in the microarray experiments, both essential for supporting comprehensive gene expression data analysis. The definitions and standards for gene expression data proposed by various organizations provide useful guidelines for organizing data in systems such as GeneExpress, but are limited in their coverage of sample and gene annotations and in addressing data operations. For systems with the size and complexity of GeneExpress, we propose using data warehouse and on-line analytical processing (OLAP) concepts for data modeling and exploration, adapted to the characteristics of the gene expression application domain. In this paper, we present a method for applying data warehouse and OLAP concepts to gene expression data management. The rest of the paper is organized as follows. In section 2, we review the basic data warehouse and OLAP concepts. In section 3, we show that comprehensive analysis of gene expression data requires three related databases for modeling the sample, gene annotation, and gene expression data spaces, respectively. We discuss in this section the basic operations for manipulating data in each of these data spaces and show how these operations can be used for exploring gene expression data. In section 4, we describe how data warehousing and OLAP concepts have been employed for the development of the data management system that is used to host Gene Logic’s GeneExpress® database products. We conclude the paper with a summary in section 5.

2. Data warehouse basic concepts Data management for gene expression data has to satisfy two major requirements: data acquisition and data analysis. The database technologies needed to address these requirements are substantially different. Data acquisition has been a traditional application for databases: the design of such databases aims at optimizing update performance. Such databases are sometimes called operational databases [2], are characterized by rapid content changes, where new data replaces old data, and need to support rapid data updates in real time. Gene Gene_Id Gene_Name Gene_Symbol

Analysis Analysis_Id Algorithm Version

Experimen

Expression Gene_Id Experiment_Id Analysis_Id Expression_Ca ll

Experiment_Id Exp_Name Exp_Date Exp_File Sample

Figure 1. A star schema example. Unlike operational databases, data warehouses are characterized by periodic (rather than real time) content accumulation, where new data are added to old data, and need to support rapid exploration of massive amounts of data. Data in data warehouses come from diverse, usually heterogeneous, data sources and therefore need to be integrated. Design of data warehouses aims at optimizing query performance for faster data access and for on-line analytical processing (OLAP). At the core of a data warehouse is a primary measure attribute associated with a fact object, where the values for the measure attributes are analyzed using the warehouse directly or via an OLAP mechanism. The fact object is modeled in the context of different dimension objects, where each dimension is characterized by one or several category attributes. Category attributes in a dimension can be organized in a classification hierarchy. A typical example of a data warehouse application involves product sold in stores on certain dates, where: quantity sold is the measure object, product, store and date are the associated dimensions, product is characterized by category (e.g., cloth, electronic), store is characterized by location (city, state), and date is characterized by time (year, month, day). Data warehouses are usually structured using a star relational schema [2] such as that shown in figure 1, where

each dimension is represented by a table and the fact table (Expression in figure 1) contains the main information about the measure object and its relationship to the dimension tables. Snowflake schemas extend the star schema by providing auxiliary tables for representing more complex dimension structures. An example of a snowflake schema is shown in figure 2. OLAP applications view a data warehouse as a multidemensional data space where aggregation functions, such as summarization, can be applied on the measure values [2]. Other OLAP operations include (1) a combination of selection and projection called slice and dice, which combines a projection on the multidimensional space (slice) with a selection of ranges over the dimension that is projected out (dice); (2) aggregations (e.g., summarization) of the measure in a given dimension over one level of the classification hierarchy associated with that dimension; such operations are sometimes called rollup operations; and (3) disaggregation operations, also called drill down operations, which are the reverse of the aggregation operations. In the example above, projection (slice) can be applied in order to look at the data in a two (e.g., location, date) dimension space; selection (dice) can be used to look at products sold on certain days; and an aggregation operation can be used to summarize quantity sold for a given (e.g., electronics) product category.

3. Gene expression data modeling Unlike traditional data warehouse applications that deal with data representing relatively simple and precise realworld facts, such as product sales, scientific applications in general, and gene expression applications in particular, involve complex and often imprecise phenomena. Data in these applications may change over time as a reflection of the evolution of the underlying scientific methods used to generate data, and are often derived, that is, represent interpretations of experimental results using complex analytical methods. The complexity of the data involved in gene expression applications suggests using three modeling data spaces: sample data space, gene annotation data space; and gene expression data space. In this section, we discuss the main modeling and operational characteristics of these data spaces.

3.1. Sample data space Sample data form an independent data space for analytical processing. The fact object in the sample data space is the biosample representing the biological material that is screened in a microarray experiment. A biosample can be of tissue, cell or processed RNA type, and

originates from a species-specific (e.g., human, animal) donor. Samples are associated with attributes that describe properties useful for gene expression analysis, such as sample structural and morphological characteristics (e.g., organ site, diagnosis, disease, stage of disease), donor data (e.g., demographic and clinical record for human donors, or strain, genetic modification and treatment information for animal donors). Samples may also be involved in studies and therefore can be grouped into several time/treatment groups. Samples come from a variety of sources, with sample information structured and encoded in heterogeneous formats. Format differences range from the type of data being captured to different controlled vocabularies used in order to represent anatomy, diagnoses, and medication. In order to provide support for capturing samples from different sources, the sample data space is modelled as an independent data warehouse, with a star or snowflake schema structure, depending on the complexity of the sample data space. Figure 2 illustrates a snowflake schema for modelling the sample space. Donor Demographic

Donor Clinical

Similar to sample data, gene annotations are considered as a separate data space. The fact object in the gene annotation space is the gene fragment, representing the entity that is examined using a microarray. For GeneChip probe arrays, for example, the gene fragment represents the DNA sequence employed for synthesizing the oligonucleotide probes that are placed on the chips. Gene fragments can be organized across two main dimensions: microarray design and biological annotation. The microarray design describes physical characteristics of a chip type design, including the placement of sequence fragments on the array. This information is provided by the microarray manufacturer and is used to interpret the signal in a microarray experiment. Known Gene

Sequence Cluster

Sequence

Pathways

Donor

Biological Sample Pathology

Gene Fragments

Chromosom e

Figure 3. A gene annotation schema example. Study

Figure 2. A sample schema example. The sample category attributes can be organized in classification hierarchies implemented using controlled vocabularies or existing taxonomies such as SNOMED’s topography and morphology axes for sample organ and diagnosis, respectively. OLAP like operations can be used for navigating the sample space along various taxonomies. For example, analyzing samples for a specific diagnosis may involve a selection of the diagnosis and projection of the pathology dimension. Further, consider a classification of donor data using an Organ ? Tissue hierarchy. Summarization of samples on tissue type would result in the total number of samples classified by tissue type; further summarization on organ would result in the total number of samples classified by organ (e.g., liver, brain).

3.2. Gene annotation data space

Microarra y Design

The annotation for a gene fragment consists of determining its biological context, including its associated primary DNA sequence entry in Genbank, membership in a gene-oriented cluster, association with a known gene (i.e., a gene that is recorded in an official nomenclature catalogue, such as the Human Gene Nomenclature Database; see http://www.gene.ucl.ac.uk/nomenclature/), and functional and pathway characterization. The gene fragment annotation involves integrating information from a variety of genomic data sources, therefore the gene annotation data space is also modelled as an independent data warehouse, with a star or snowflake schema structure, as illustrated by the example shown in figure 3. An important aspect of the gene data space is the evolution of the science underlying recorded gene annotations. For example, the association of a gene fragment with a known gene or gene-oriented cluster may change because of the evolution of such clusters or amendments to the known genes (see [6] for a discussion of problems associated with gene nomenclature and identification). The evolution of gene data may affect the result of gene expression data analysis, and therefore needs to be tracked. However, note that gene data

changes are different from historical data in traditional data warehouses. Thus, while historical data record changes of known indisputable facts (e.g., prices of products), the evolving gene data record changes in what is known about (i.e., interpretations of) scientific facts. OLAP like operations can be used for navigating the gene annotation space mainly along the biological annotation dimension. For example, examining gene fragments associated with metabolic pathway may involve a selection of (metabolic) pathways and a projection on the pathway dimension. Further, consider a classification of gene annotation data using the following hierarchy: Species ? Chromosome ? Known Gene. Summarization of the gene fragments on known genes would result in the total number of fragments classified by their association with a known gene; further summarization on chromosome would result in the total number of gene fragments classified by chromosome.

3.3. Gene expression data space The fact object in the gene expression data space is the gene expression value. Gene expression data are defined at several granularity levels. The data generated by measurement instruments (scanners) are at the highest level of granularity. Analysis programs turn these data into quantitative gene expression measurements. For example, Affymetrix’s GeneChip platform involves (a) a cell averaging method that averages pixel intensities and computes cell-level intensities, where each cell corresponds to one probe on the probe array, followed by (b) a chip analysis method that generates gene expression values by “summarizing” the intensities of multiple probe pairs that correspond to each gene or EST fragment on the probe array. The GeneChip expression values include presence/absence (PA) calls and numeric absolute gene expression measurements. Other platforms, such as QRTPCR, report an expression value per gene and per sample, relative to a reference sample. We present below a multiAnalysis Method

Experiment al Data

Gene Expression Gene Expression

Gene Item

Biosample Item

Figure 4. A gene expression schema example.

dimensional structure that could support representing gene expression values generated with different platforms or analysis methods. The four main dimensions in the gene expression data space are gene, sample, method and experiment, where gene and sample provide the connection to the gene annotation and sample data spaces, respectively. The gene expression data space is modelled as an independent data warehouse, with a star schema, such as that shown in figure 1, or a snowflake schema, such as that shown in figure 4. The experiment dimension links gene expression data to experimental parameters such as chip lot, experimental protocol, and software version, all of which are often needed in data analysis. The method dimension models the different gene expression values generated using different primary analysis methods. Secondary analysis methods are also used for adjusting further gene expression. Next, we show how summarisation and aggregation operations in the gene expression data space can be Samples

Genes Abs expr. PA calls Methods

Figure 5. Gene expression data space. defined as variants of primitive OLAP operators. These operations can be then used to define more complex data analysis operations. Operations in the gene expression data spaces are often defined in terms of pre-selected sets of samples and/or genes. Sample selections are defined over the sample data space in order to extract sets of samples with a certain profile. For example, a sample set may consist of the male colon samples with adenocarcenoma, for donors in the age group 40-60 and without a smoking history. Gene selections are defined over the gene annotation data space in order to extract sets of genes with certain properties. For example, a gene set may consists of the genes on chromosome 22 whose protein products are involved in the estrogen metabolism pathway. Note that analysing gene expression data over arbitrary sets of genes and samples may not be always biologically

meaningful. For example, analysing gene expression across samples from different species may not yield biologically meaningful results. Consequently, gene and sample operations may need to be restricted in order to ensure that the resulting sets are consistent from a gene expression analysis point of view. Consider a simplified gene expression data space with three dimensions, as shown in figure 5: Sample, Gene, and Methods, the latter involving two methods EPA and EAbs. EPA measurements are p (present), a (absent), m (marginal, unknown) calls, while EAbs measurements are numeric (absolute) gene expression values. Figure 6 illustrates a gene expression data space projected over method EPA, where the samples are s1, s2, s3, and s4, and the gene fragments are g1, g2, g3, g4, g5, g6, g7, and g8. g1

g2

g3

g4

g5

g6

g7

g8

s1

p

p

p

p

a

a

p

a

s2

p

p

p

p

a

a

p

a

s3

p

p

p

p

a

a

a

a

s4

p

a

p

p

a

m

a

a

Figure 6. Gene expression value examples. Gene expression summarization applies to EPA and can be defined either over the entire sample and gene set dimensions or a set of genes specified using a gene selection and a set of samples specified using a sample selection. Gene expression summarization on sample dimension counts for each gene in the gene set, the number of gene expression measures of a given measure type e (p, a, or m), over the samples in the sample set: given gene set G and sample set S, the gene expression summarization on S for EPA consists of expression summaries {σ (g, p, S), σ (g, a, S), σ (g, m, S)}, for each gene g in G, where given a gene g, σ (g, e, S) is the count of expression measures of type e over all the samples of S. For the example in figure 6, let S = {s1, s2, s3, s4}. Then the gene expression summarization on S for EPA consists of triplets such as {σ (g2, p, S) = 3, σ (g2, a, S) = 1, σ (g2, m, S) = 0} for each gene in G. Gene expression summarization on the gene dimension summarizes for each sample in the sample set, the gene expression measures of a given measure type e (p, a, or m), over all the genes in the gene set: given a gene set, G, and sample set, S, the gene expression summarization on G for EPA, consists of expression summaries {σ (s, p, G), σ (s, a, G), σ (s, m, G)}, for each sample s in S, where given a

sample s, σ (s, e, G) is the count of expression measures of type e over all the genes of G. For the example in figure 6, let G = {g1, g2, g3, g4, g5, g6, g7, g8}. Then the gene expression summarization on G for EPA consists of triplets such as {σ (s4, p, G) = 3, σ (s4, a, G) = 4, σ (s2, m, G) = 1} for each sample in S. Gene expression aggregation applies to EAbs, is defined over the sample dimension, and computes an aggregate (e.g., mean) gene expression value for each gene in the gene set: given a gene set, G, and sample set, S, the gene expression value aggregation on S, consists of aggregate expression values computed for each gene g of G over all the samples of S.

3.4. Examples of gene expression operations We present below examples of operations that can be used for exploring gene expression data and show how these operations can be defined in terms of the gene expression summarization operations above. Our goal is to illustrate the definition of an abstract data type for expression data associated with a basic set of operations that can be used to formulate higher-level gene expression exploration operations. In the rest of this section, we use card (S) to denote the cardinality of set S. Consistently expressed gene operations are defined over a set of genes and a set of samples, and define the set of consistently present and consistently absent genes in a sample set: given gene set, G, and sample set, S, the sets of consistently present (CPG) and consistently absent (CAG) genes in S, are defined as follows: CPG (G,S) = {g i | σ (g i, p, S) = card (S ), g i in G} ; CAG (G,S) = {g i | σ (g i, a, S) = card (S ), g i in G}. Consider the example in figure 6, and let S = {s1, s2, s3, s4} and G = {g1, g2, g3, g4, g5, g6, g7, g8}. Then CPG (G, S) = {g1, g3, g4}, and GAG (G, S) = {g5, g8}. The set of inconsistently expressed genes (IEG) is defined as: IEG (G,S ) = G − CPG (G,S) −CAG (G,S). Note that sets CPG (G,S), CAG(G,S), and IEG(G,S) partition the set of genes G with regard to the way genes are expressed in sample set S, and that other operations can be defined using CPG, CAG, and IEG. Similar operations define the subset of samples in which the genes from a given gene set are either all present or all absent in a given sample set: given a gene set, G, and sample set, S, the subsets of samples of S in which all the G genes are consistently present (CPS), consistently absent (CAS), or inconsistently expressed (IES) are defined as follows: CPS (G,S) = {s i | σ (s i, p, G) = card (G ), s i in S} ; CAS (G,S) = {s i | σ (s i, a, G) = card (G ), s i in S}; IES (G,S) = S − CPS (G,S) −CAS (G,S). For the example in figure 6, let S = {s1, s2, s3, s4} and G = {g1, g2, g3}. Then CPS (S, G) = {s1, s2, s3}.

A variation of the CPG, CAG, CPS, and CAP operations involves using an additional threshold, T, for defining the gene expression consistency in terms of the minimum number of samples out of the total number of samples in S, for which the genes are present or absent. For the example in figure 6, let S = {s1, s2, s3, s4} and G = {g1, g2, g3, g4, g5, g6, g7, g8}. Then CPG (G, S, T=3) = {g1, g2, g3, g4}, and GAG (G, S, T=3) = {g5, g6, g8}. Derived operations can be used to contrast expressed genes in a set of samples with expressed genes in another set of samples. For example, given a gene set, G, and sample sets, S1 and S2: differentially expressed genes in S1 vs S2 can be computed as: • CPG (G,S1) ∩ CAG (G,S2), which defines the set of G genes that are consistently present in samples of S1 and consistently absent in samples of S2; • CAG (G,S1) ∩ CPG (G,S2), which defines the set of G genes that are consistently absent in samples of S1 and consistently present in samples of S2. Similarly, common consistently expressed genes in S1 vs S2 can be computed as: • CPG (G,S1) ∩ CPG (G,S2), which defines the set of G genes that are consistently present both in samples of S1 and in samples of S2; • CAG (G,S1) ∩ CAG (G,S2), which defines the set of G genes that are consistently absent both in samples of S1 and in samples of S2. Consider the example in figure 6, and let S1 = {s1, s2}, S2 = {s3, s4} and G = {g1, g2, g3, g4, g5, g6, g7, g8}. Then the set of differentially expressed genes in S1 vs S2 involve only g7. An example of an operation that involves gene expression aggregation on the sample dimension, consists of computing for each gene in a gene set, G, the ratios between the aggregated expression values over a sample set, S1, and the aggregated expression values over another sample set, S2 (this operation is part of a frequently used gene expression analysis method, called fold change).

4. An Application In this section we describe the application of the data warehouse concepts presented in the previous section to the development of the data management system used to host Gene Logic’s GeneExpress® database products. Gene Logic’s expression data are generated mainly using the Affymetrix GeneChip platform in a high throughput production environment. Large scale data processing requires data management facilities for acquiring, organizing, managing, integrating, and exploring massive amounts of data. These requirements are addressed by the GeneExpress Data Acquisition System (GXDAS) and the GeneExpress Data Warehouse (GXDW).

GXDAS consists of operational databases and laboratory information management system (LIMS) applications that support data acquisition and management of production data. GXDW contains summarized and curated gene expression data, integrated with sample and gene annotation data, and provides support for effective data exploration and mining.

4.1 GeneExpress Data Acquisition System The sample component of GXDAS provides support for various sample data collection and quality control protocols, via data entry, data migration, and reporting tools. The system uses domain specific vocabularies and taxonomies, such as SNOMED, to ensure consistency during data collection, and records the data in a database with a structure that is compatible with GXDW’s Sample Database. The gene expression component of GXDAS provides support for Gene Logic’s high-throughput GeneChipbased production and seamless integration with the GeneChip LIMS. This system manages gene expression experiment, QC/QA, and process data. Gene expression experiment data are recorded in GeneChip specific files containing: (i) the binary image of a scanned probe array; (ii) average intensities for the probes on the probe array; and (iii) expression values of gene fragments probed in the probe array. Using the GeneChip LIMS publish operation the data in these files are converted into an AADM based representation and recorded in the LIMS relational database. The sample component of GXDAS is integrated seamlessly with the GeneChip LIMS and a Chip QC module, thus ensuring data consistency across and efficient data flow through individual data management systems. The Chip QC component is used for detecting chip image defects using both image analysis software and manual visual inspection, and for masking the probes affected by these defects. Further, GXDAS accelerates the rate of data generation by providing support for parallel publishing via multiple GeneChip LIMS systems, and thus provides support for performing hundreds of experiments per day. GXDAS migrates the gene expression data in relational (AADM) format and the QC data to the GXDW staging area where the necessary data integration, transformation, validation, and correction are performed before loading the data into GXDW. The gene annotation component of GXDAS provides support for assembling consistent annotations for the gene fragments underlying gene expression experiments, by acquiring, integrating, and curating data from various, mainly public, data sources.

Gene annotations allow assessing the biological meaning, redundancy, and ambiguity of GeneChip probe array gene fragments. For example, a gene fragment may be present on more than one probe array in an array-set, different gene fragments may correspond to the same known gene, and a gene fragment may correspond to multiple known genes. Gene fragments are organized in non-redundant classes based on UniGene, which provides a partitioning of GenBank sequences in non-redundant gene-oriented clusters (see http://www.ncbi.nlm.nih.gov/UniGene/), and associated with known genes recorded in LocusLink, which provides curated sequence and descriptive information, such as official nomenclature, on genes (see http://www.ncbi.nlm.nih.gov/LocusLink/index.html). Note that with the rapid pace of sequencing the human and other genomes, and of identifying genes, the biological identity of gene fragments in general, and ESTs in particular, is likely to change overtime [6]. Acquiring gene annotations from public data sources involves querying regularly these sources, parsing and interpreting the results, and identifying reliable associations between gene fragments and known genes. Gene fragments are further associated with gene products (e.g., from SwissProt, a curated protein sequence database that includes the description of the function of a protein, etc.; see http://www.expasy.ch/sprot/), pathways (e.g., metabolic, signalling pathways), SNPs, chromosome maps, cross-species homologies and sequence contigs. Gene fragments are also associated with terms from the Gene Ontology, a controlled vocabulary that can be used for classifying genes (see http://www.geneontology.org/).

4.2 GeneExpress Data Warehouse The GeneExpress Data Warehouse (GXDW) contains very large amounts of data and has a structure that supports efficient gene expression exploration and analysis. GXDW is the integrated product of three component databases that materialize the sample, gene annotation, and gene expression data spaces discussed in section 3. GXDW is loaded with sample, gene annotation and expression data from a staging area where these data are integrated after passing data consistency and quality validation. The staging area has a transient database that provides a buffer between the GXDAS data sources and GXDW while data undergo various transformations. Gene expression data generated using the GeneChip platform are represented in the AADM relational format extended with GXDW specific fields used for representing classification of experiments per sample and studies (e.g., toxicology studies), quality control information on samples and experiments, and so on.

In the AADM representation, the method dimension for the gene expression data space involves two analysis methods: cell averaging and chip analysis. The results of cell averaging and chip analysis are stored in two fact tables, the Measurement_Elem_Result (MER) and the Abs_Gene_Expr_Result (AGER) tables, respectively [1]. Given the size of GXDW, the management of both tables is problematic. For example, one human sample involves 5 experiments that result in 1.25 million rows in the MER table and 42,000 rows in the AGER table. In GXDW, AGER is explored using an OLAP like multi-dimensional array (GXA), while MER is partitioned and archived. Experimental parameters such as protocol version, analysis software build, and analysis method are also recorded in GXDW. GXDW also includes gene expression data generated using other gene expression platforms, such as READS (see http://www.genelogic.com/readstech.htm) and QRTPCR. Currently, gene expression data originating from different platforms are managed and structured independently, rather than using a common data format. Gene expression data generated using different platforms are correlated via common samples (i.e. samples that are run using different technologies) or common genes (i.e., gene fragments that are common to different technology platforms). The multi-dimensional array used for exploring gene expression data, GXA (Gene Expression Array), follows the structure shown in figure 5, and supports a data representation that is independent of the underlying gene expression technology platform. GXA provides the framework for implementing gene expression operations such as those described in section 3, and for integrating advanced data analysis methods or tools. GXA is implemented as a collection of two dimensional matrices, where in each matrix gene expression data items are associated with a (sample, gene) pair. Various matrices cover different types of gene expression data, where these data are classified according to species (e.g., human, mouse, rat), GeneChip probe array versions (e.g., Hu95), or methods used to generate raw gene expression data or normalize these data across experiments. Since individual methods are inherently imprecise, the ability of providing support for multiple methods enhances the ability of cross-validating or refining analysis results.

4.3 GeneExpress Data Exploration GeneExpress data are explored using GX™ Explorer [3] which provides support for constructing gene and sample sets, analysing gene expression data in the context of gene and sample sets, and managing user analysis results. A typical analysis task may involve three stages. The first, pre-processing, stage often consists of defining the

scope of an analysis task in terms of samples and genes of interest. In this stage, users can filter genes and samples based on multiple criteria, and can select a specific type of method used for generating or normalizing gene expression data targeted for analysis. The second stage of an analysis task usually involves executing one or several analysis methods of choice. GX Explorer supports a variety of native analysis methods, whose description is outside the scope of this paper. An example of defining these methods in terms of the gene expression exploration operations presented in the previous section, is provided by an analysis operation that involves identifying consistently expressed (either present or absent) genes from a gene set, G, over a sample set, S: the result of this operation consists of the pair {CPG(G,S), CAG(G,S)}. The native analysis tools supported by GX Explorer are implemented using an Analysis Engine based on GXA, and are based on sound statistical methods for minimising the potential bias introduced by variability sources such as individual probe arrays or samples. GX Explorer also supports commercial analysis packages, such as GeneSpring , Spotfire, and Partek Pro 2000 , and S+. The third stage of a typical analysis task consists of interpreting the results of the analysis using visualisation tools in the context of comprehensive gene annotations, such as gene clusters, chromosome maps and pathway maps provided by GXDW.

5. Summary We presented in this paper a data warehouse method for managing and exploring gene expression data, that involves modelling gene expression applications using separate sample, gene annotation, and gene expression multi-dimensional data spaces. We presented basic operations in these data spaces in terms of traditional OLAP dimension reduction and aggregation manipulations. We showed how these basic operations can be used for defining more complex gene expression exploration operations. We described how the proposed gene expression data warehouse method has been applied to the development of the data management system that hosts Gene Logic’s GeneExpress® database products. By mid 2001, this system has been deployed at over twenty biotech and pharmaceutical companies and academic institutions. Data are delivered to customers as part of tissue, disease, or study specific GeneExpress DataSuites. The GeneExpress Warehouse is hosted on an Oracle 8i database server back-end and is supplied with a continuous stream of data from the GeneExpress Data

Acquisition System. Data warehouse management tools are used for ensuring data consistency, with process specific consistency rules checking the correct execution of the data migration and integration processes and domain specific rules validating the sample, expression and gene annotation data. The structure of gene expression data in GXDW is compatible with other gene expression data definitions that have been proposed by MGED and other organizations, while the gene annotation and sample data components of GXDW are not covered by these definitions. Data exploration and analysis in GeneExpress is carried out using the GX Explorer client user interface developed using the Java programming environment and employing a CORBA mechanism for communication in a three tier (database server - analysis engine and query processor user interface client) architecture. When required, Gene Logic’s GeneExpress databases can incorporate gene expression data from other gene expression, sample, and gene data sources, integrated with Gene Logic data using the GX TM Connect tool (see http://www.genelogic.com/gent.htm). Such enhanced databases provide users a richer, and therefore more valuable, context for their in house gene expression data analysis projects. Acknowledgements. We want to thank our colleagues at Gene Logic who have been involved in the development of the GeneExpress system for their outstanding work.

6. References [1] Affymterix, “Affymetrix Analysis Data Model”, http://www.affymetrix.com/support/aadm/aadm.html. [2] S. Chaudhuri and U. Dayal. “An Overview of Data Warehousing and OLAP Technology”, SIGMOD Record, March 1997. [3] Gene Logic. “GeneExpress® 1 User Manual”, Gene Logic Inc., 2001. [4] D.J. Lockhart, H. Dong, , M.C. Byrne, M.T. Follettie, M.V. Gallo, M.S. Chee, M. Mittmann, C. Wang, M. Kobayashi, H. Horton, and E.L. Brown, “Expression Monitoring by Hybridization to High-Density Oligonucleotide Arrays”. Nature Biotechnology, 14, pp. 1675-1680, 1996. [5] D.J. Lockhart and A.E. Winzeler. “Genomics, Gene Expression, and DNA Arrays”, Nature, 405, pp. 827836, 2000. [6] H. Pearson. “Biology’s name game”, Nature, 417, pp. 631-632, 2001.