Innovative Computational Methods For ... - Semantic Scholar

3 downloads 0 Views 278KB Size Report
Mar 6, 2007 - The general tenet is that genes encoding proteins parti- cipating in a common ...... sequence homology to the obesity-associated protein tubby,.
The Computer Journal Advance Access published March 6, 2007 # The Author 2007. Published by Oxford University Press on behalf of The British Computer Society. All rights reserved.

For Permissions, please email: [email protected] doi:10.1093/comjnl/bxm003

Innovative Computational Methods For Transcriptomic Data Analysis: A Case Study in the Use Of FPT For Practical Algorithm Design and Implementation1 M ICHAEL A. L ANGSTON2,*, ANDY D. PERKINS2, ARNOLD M. SAXTON 3, J ON A. SCHARFF2 AND B RYNN H. VOY 4 2

Department of Computer Science, University of Tennessee, Knoxville, TN 37996-3450, USA 3 Department of Animal Science, University of Tennessee, Knoxville, TN 37996-4574. USA 4 Mammalian Genetics and Genomics, Oak Ridge National Laboratory, Oak Ridge, TN 37831-6124, USA *Corresponding author: [email protected] Tools of molecular biology and the evolving tools of genomics can now be exploited to study the genetic regulatory mechanisms that control cellular responses to a wide variety of stimuli. These responses are highly complex, and involve many genes and gene products. The main objectives of this paper are to describe a novel research program centered on understanding these responses by (i) developing powerful graph algorithms that exploit the innovative principles of fixed parameter tractability in order to generate distilled gene sets; (ii) producing scalable, high performance parallel and distributed implementations of these algorithms utilizing cutting-edge computing platforms and auxiliary resources; (iii) employing these implementations to identify gene sets suggestive of co-regulation; and (iv) performing sequence analysis and genomic data mining to examine, winnow and highlight the most promising gene sets for more detailed investigation.

As a case study, we describe our work aimed at elucidating genetic regulatory mechanisms that control cellular responses to low-dose ionizing radiation (IR). A low-dose exposure, as defined here, is an exposure of at most 10 cGy (rads). While the consequences of high doses of radiation are well known, the net outcome of low-dose exposures continues to be debated, with support in the literature for both detrimental and beneficial effects. We use genome-scale gene expression data collected in response to low-dose IR exposure in vivo to identify the pathways that are activated or repressed as a tissue responds to the radiation insult. The driving motivation is that knowledge of these pathways will help clarify and interpret physiological responses to IR, which will advance our understanding of the health consequences of low-dose radiation exposures. Keywords: transcriptomic data analysis; fixed-parameter tractability; graph algorithms Received 18 May 2006; revised 12 January 2007

1.

1

A preliminary version of a portion of this paper was presented at the ACM Symposium on Applied Computing, held in Dijon, France, in April 2006.

BACKGROUND AND SIGNIFICANCE

Understanding the health risks of exposure to low levels of radiation is critical if we are to protect the workforce while making the most effective use of our natural resources. The US National Academy of Sciences very recently issued a

THE COMPUTER JOURNAL, 2007

Page 2 of 13

M. A. LANGSTON et al.

report [1] summarizing the most current evidence of health effects of low levels of ionizing radiation (IR). This report supports the concept of a ‘linear-no-threshold’ risk model, which holds that even the smallest doses of IR have the potential to increase the risk of cancer. Over the next several decades, it is expected that the majority of radiation exposures associated with human activity will be low-dose in nature. These may arise from medical diagnostics, hazardous waste abatement, handling materials for nuclear weapons and power systems and even terrorist acts such as dirty bombs. The major type of exposures will be low-dose IR (primarily X-radiation and gamma-radiation) from fission products. Microarrays and other recent advances in technology provide a means to begin extracting such mechanisms and pathways without a priori knowledge of genes that might be involved. By focusing on gene co-regulation and putative network mechanisms, we seek to characterize radiation-induced perturbations of normal physiological processes. This should greatly enhance our understanding of low-dose radiation effects at all levels of biological organization, from genes to cells to tissues and finally to organisms.

2. BIOLOGICAL PATHWAY EXTRACTION FROM MICROARRAY DATA Microarrays provide a snapshot of the simultaneous expression levels for virtually every gene present in an organism. In a single slice, arrays can identify genes that are up-or down-regulated in a condition of interest. It is this analysis of differential expression that today drives the majority of array experiments. Yet, it is the next level of microarray analysis – the potential to extract meaningful relationships among multiple genes – that sets the power of arrays apart from traditional measures of gene expression. Guilt by association, the assumption that genes with similar expression patterns participate in common cellular functions, drives the emerging attempt to extract pathways from microarray data [2]. As noted in [3], ‘partitioning genes into closely related groups has thus become the key mathematical first step in practically all statistical analyses of microarray data’. The general tenet is that genes encoding proteins participating in a common pathway will display correlated expression levels when analysed at sufficient scale, and that the identities and known functions of these genes can be used to highlight the existing and assimilate new functional pathways. The appeal of this rationale is rather obvious, namely, that important pathways of interacting genes can be elucidated from expression data, and the cellular mechanisms that underlie a biological response will then be revealed. Initial attempts to validate this concept are starting to emerge from the literature [4 – 8]. To date, the computational methods to extract such patterns lag far behind the general agreement about their utility.

2.1.

Existing approaches

The traditional strategy for identifying subsets of genes with similar expression patterns from microarray experiments relies on various clustering algorithms. Clustering includes a wide variety of methods for organizing multivariate data into groups with approximately similar expression patterns, and a wealth of clustering approaches has been proposed [9–13]. The various methods build upon the correlation measure between expression levels of all pairwise combinations of genes, which is used to calculate a distance metric of similarity (or dissimilarity) of expression between each gene pair. The most common clustering algorithms are either hierarchical, in which all genes begin in their own clusters and are eventually merged into one, or centroid, in which genes are organized into a predefined number of clusters by iterative adjustments based on similarity [2]. There are several important limitations, however, to the vast majority of clustering algorithms that lie in contrast to the realities of biology. One such limitation is that the clusters these methods produce are disjoint, requiring that a gene be assigned to only one cluster. Although this simplifies the amount of data to be evaluated, it places an artificial limitation on the biology under study in that many genes play important roles in multiple but distinct pathways [14]. There are recent clustering techniques, for example those employing factor analysis [15], that do not require exclusive cluster membership for single genes. Unfortunately, these tend to produce biologically uninterpretable factors without the incorporation of prior biological information [16]. Another important limitation is that most of the measures of similarity used by current clustering algorithms do not permit the recognition of negative correlations, which are common and often equally meaningful from a biological perspective. In contrast, the notion of relevance networks [14, 17, 18] has been suggested as a means to overcome the limitations of traditional clustering methods. Relevance networks begin with a matrix of the correlation coefficients between all pairwise combinations of genes, and identify both positive and negative relationships. In a relevance network, genes are represented as vertices and correlations between them exceeding a defined threshold as edges. Additional types of data can be incorporated to recognize relationships between gene expression and other metrics, e.g. correlations between genes and cancer susceptibility [14, 18]. Unfortunately, without an algorithmic means to extract the aggregate relationships between multiple genes, many of the most interesting relationships – those with tight connections between multiple genes – remain embedded within the vast sea of correlations.

2.2.

Benefits of a graph theoretical approach

It is, therefore, necessary to develop more powerful tools to extract subsets of coordinately regulated genes from large aggregates of gene expression data in order to fulfill the

THE COMPUTER JOURNAL, 2007

INNOVATIVE COMPUTATIONAL METHODS

FOR

promise of coexpression-based data mining. The field of graph theory offers unique advantages to this problem. Many innovative graph algorithms are based on decades of basic research, and constitute a class of tools that can help elucidate relationships in highly complex data structures, in our case as matrices of correlations across thousands of genes. In this respect, graph algorithms offer a means to extract meaningful aggregates of genes from within the relevance network framework. Weighted graphs are produced from this type of data. They consist of vertices representing genes and edges whose weights are indicative of the correlation between each pair of vertices (genes). Given a suitable threshold, t, edges with weights less than t are discarded; edges with weights at least t are retained. This produces an unweighted graph, G, whose structural properties are of interest. Similar representations have already been shown to be suitable for probing and determining the structure of biological networks, including the extraction of evolutionarily conserved modules of co-expressed genes [19–21]. 2.3.

The clique problem and putative co-regulation

The challenge, once the graph is created, is not to study the graph in its entirety but rather to extract its embedded subgraphs, or small, tightly connected regions of the graph that represent subsets of genes with strong correlations between every pair of its members and thus likely to represent biologically significant interactions. In the most extreme case, in which a subgraph contains all possible edges between its vertices, this structure is called a clique. A clique on four vertices is illustrated in Fig. 1. The importance of clique lies in the fact that each and every pair of vertices is joined by an edge, from which we can infer some form of co-regulation among the corresponding genes. Clique is widely known for its application in a variety of combinatorial settings, a great number of which are relevant to computational molecular biology [22]. It is particularly useful in microarray analysis, because it addresses the previously noted shortcomings of traditional clustering algorithms. Note that cliques need not be disjoint. A vertex can be in more than one clique, just as a gene can be in more than one regulatory network. Moreover, negative correlations are easily handled in a variety of ways, for example, by temporarily taking the absolute value of correlation coefficients just prior to thresholding. In terms of gene expression, clique represents the most trusted potential for identifying a set of interacting genes [14]. This general methodology is placed in the broader context of clustering for microarray analysis in Fig. 2, with our approach illustrated in grey. Our work centers on solving immense instances of the clique problem and applying the solutions to low-dose IR data. Solving clique is a major computational bottleneck, however, and a classic graph-theoretic problem in its own right. A considerable amount of effort has been devoted

TRANSCRIPTOMIC DATA ANALYSIS

Page 3 of 13

FIGURE 1. A clique of size 4.

to solving clique efficiently [23]. There is also considerable interest in solving the dense k-subgraph problem [24]. Here the focus is a cluster’s edge density, also referred to as clustering coefficient, curvature and even cliquishness [25, 26]. In this respect, clique is the ‘gold standard.’ A cluster’s edge density is maximized with clique by definition. In particular, we seek to solve the maximum clique problem, whose goal is to find the largest k for which G contains a clique of size k. We hasten to point out that our use of thresholding and clique is distinct from a superficially similar approach employed in ‘signature algorithms,’ including those described in [27, 28]. Signature algorithms are designed to group genes into sets along with the conditions under which each set may be regulated. Randomization, iteration and scoring are key ingredients. Given a starting seed (genes called a ‘reference set’), each condition is scored depending upon how well it seems responsible for any co-regulation observed with that seed. Once conditions are scored, each gene is then scored by how well it appears to be regulated by the current set of conditions whose score exceeds a preset threshold. Results of this back and forth action are gene clusters called ‘transcription modules’ that, like maximal cliques, are allowed to overlap. There is no notion of requiring extreme edge density as there is with clique-based methods. Instead, modules derived with signature algorithms depend heavily upon the seeds chosen, the number of iterations employed and other factors. Likewise, the time required to run these algorithms depends on seeds, iterations and convergence criteria. Another interesting approach to coexpression discovery is presented in [29]. It differs from our approach on at least two fronts: partial correlations are used without thresholding, and clusters are identified by genes sharing partial correlations that are stronger within the cluster than surrounding correlations. This method was applied to 5000 yeast genes, and successfully identified known metabolic networks. However, the number of clusters was small when compared with what our clique-based approach generates. Moreover, it is not clear that this method will scale up to some 30 000 or so genes in mammals. In fact, there is no statistical basis for deciding if a cluster is significantly different from background, because a cluster could be based on average partial correlations of 0.10, compared with surrounding correlations of 0.03, for

THE COMPUTER JOURNAL, 2007

Page 4 of 13

M. A. LANGSTON et al.

FIGURE 2. An overview of clustering for microarray analysis. The approach used in this effort is shown in grey.

example, and would almost surely not be statistically different from unrelated genes. Although the idea of partial correlations is intriguing, there is also a danger that true relationships may be statistically adjusted away. For example, if three genes are correlated, partial correlations between any two, adjusted for the third, could be zero, and that cluster would disappear. 3. CLIQUE EXTRACTION The inputs to the standard decision version of clique are an undirected graph G with n vertices, and a parameter k  n. The question asked is whether G contains a clique of size k, that is, a subgraph isomorphic to Kk. Subgraph isomorphism, clique in particular, is NP-complete. From this, it follows that there is no known algorithm for deciding clique that runs in time polynomial in the size of the input. One could of course solve clique by generating and checking all candidate solutions. This brutal force approach requires O(n k) time, however, and is thus prohibitively slow, even for problem instances of only modest size. One might be tempted to try and solve clique approximately rather than exactly. Clique is so difficult, however, that guaranteeing solutions even to within only n 1 cannot be accomplished within polynomial time for any 1 . 0 unless P ¼ NP [30]. 3.1.

Novel algorithms for clique discovery

Dramatically better approaches are clearly required if clique is to be applied to huge microarray data sets. In this context, we employ fixed parameter tractability (FPT), whose roots can be

traced at least as far back as the work showing, via the Graph Minor Theorem, that a variety of parameterized problems are tractable when the relevant input parameter is fixed. See, for example, [31, 32]. A problem is FPT if it has an algorithm that runs in O(f(k)n c) time, where n is the problem size, k the input parameter and c a constant independent of both n and k. Clique is not FPT, however, unless the W hierarchy collapses [33]. The W hierarchy, whose lowest level is FPT, can be viewed as a fixed-parameter analog of the polynomial hierarchy, whose lowest level is P. Such a collapse is widely viewed as an exceedingly unlikely event, roughly on a par with the likelihood of the collapse of the polynomial hierarchy [34]. We, therefore, focus instead on clique’s complementary dual, the vertex cover problem. Consider G0 , the complement of G. (G0 has the same vertex set as G, but edges present in G are absent in G0 and vice versa.) The question now asked is whether G0 contains a set C of k vertices that covers every edge in G0 , where an edge is said to be covered if either or both of its endpoints are in C. Like clique, vertex cover is NP-complete. Unlike clique, however, vertex cover is FPT. The crucial observation here is this: a vertex cover of size k in G0 turns out to be exactly the complement of a clique of size n 2 k in G. Thus, we search for a minimum vertex cover in G0 , thereby finding the desired maximum clique in G. Currently, the fastest known vertex cover algorithms run in O(1.2759kk 1.5 þ kn) and very recently O(1.2740k þ kn) time [35, 36]. Contrast this with O(n k). The requisite

THE COMPUTER JOURNAL, 2007

INNOVATIVE COMPUTATIONAL METHODS

FOR

exponential growth (assuming P = NP) is, therefore, reduced to a mere additive term, making it realistic now to consider the search for cliques of huge sizes in immense microarray data sets. Recent work on this subject is featured in [37–42]. Codes have been installed in Clustal XP, a high-performance, parallel version of the popular Clustal W package [43]. 3.2.

Computational tractability

Key algorithmic factors in the success of an FPT-based approach are kernelization, in which a problem is reduced to its compute kernel; and branching, in which a search tree is employed to explore the solution space efficiently. Under some conditions these operations can be iterated, a process termed interleaving. The mathematical details upon which these techniques rely are highly complex. Our approach to solving clique efficiently is perhaps best illustrated with the following example. Here we ran comparisons using Mus musculus data obtained with Affymetrix U74v2 arrays. Each is represented by 12 422 vertices (probe sets, which are roughly equivalent to genes) and millions of edges (gene–gene correlations). Depending on the particular value of k employed, kernelization generally cuts the size of the graph by about half or so. On the reduced graph we used 32 processors, each running at 500 MHz. Sample computational results are listed in Table 1. For this set of runs the threshold was set at 0.5. Note the importance of processor load balancing. Without effective load balancing, parallel algorithms are quickly reduced to sequential methods due to the inherent nature of the branching process. The need for such balancing has required custom solutions due to the many intricate features of these algorithms [41]. Note also that solving the ‘no’ instance took a good deal longer than solving the ‘yes’ instance. This is typical, and reflects the fact that when no satisfying cover is present, the entire search space must be examined; there is no opportunity to finish early. It is probably safe to say that until the advent of FPT, problems as large as those listed in Table 1 were considered unsolvable by most researchers. With n set at 12 422, and k set at 12 422 2 12 053 ¼ 369, it is mind-boggling to imagine a straightforward O(n k) clique-find algorithm on a problem of that size. In addition to our work on clique extraction, we have also had to devise mathematically and biologically justifiable methods for threshold setting, and effective strategies for comparing discrete structures in control versus treatment data. We will discuss these methods in the context of data sets with only two levels of one condition, control versus IR. Our work is equally extensible, quite naturally, to multiple conditions.

TRANSCRIPTOMIC DATA ANALYSIS 4.

Page 5 of 13

THRESHOLD SELECTION

To form a graph from a correlation matrix, a cutoff value, or threshold, must be chosen above which an edge connecting two genes is created, and below which the two genes are considered unconnected. This decision is of critical importance, as clique results are highly dependent on the starting graph. We are currently employing three approaches to thresholding: ontological distance, statistical relevance and graph structure. In all cases, the criterion for successful thresholding is the ability to capture biologically meaningful cliques while minimizing cliques with little or no biological significance. This raises the question of how ‘biological meaning’ will be defined. This is an issue with which all current approaches to extracting co-regulated genes must deal, since too little is known about the comprehensive biology of any genome to establish a concrete, proof-of-principle data set, especially for higher eukaryotes. In lieu of that, the state-of-the-art is to determine whether predicted sets of co-regulated genes map to a common biological function, based on their gene ontology (GO) annotation. Therefore we score the biological meaning of cliques and other subgraphs by determining the degree to which they are enriched for specific GO categories. The biological accuracy of various thresholds is determined in part by the degree to which the thresholds produce cliques that are enriched for genes with common functional assignments. We use GOTreeMachine [44] to identify statistically significant GO enrichments from dense gene sets. GoTreeMachine returns the statistical probability that a set of genes is significantly enriched in one or more GO categories. We do not rule out the possibility that using more than one thresholding method may be necessary in practice. If methods disagree, this may indicate something about the experimental data that needs to be re-examined.

4.1.

Ontological distance

Consider the use of ontological distance, based on correlations between gene co-expression and functional similarity. The functional similarity between two genes can be evaluated using information theory based on GO and normalized to [0, 1]. Pearson correlation coefficients can be calculated for each gene pair in the data set. Gene pairs can then be ranked and plotted by their coefficients. A careful analysis of the resulting curve (e.g. by identifying inflection points, discontinuities or other mathematical means) can sometimes help identify optimal threshold(s) for gene co-expression network

TABLE 1. Comparisons of simple parallel and load-balanced branching times. Graph name mus74-50 mus74-50

Graph size 12 422 12 422

Cover size 12 053 12 052

Instance type Yes No

THE COMPUTER JOURNAL, 2007

Parallel branching 6 days 6 days

Balanced branching 2 h 8 min 3 h 57 min

Page 6 of 13

M. A. LANGSTON et al.

construction. We employ broken line regression for this purpose, which uses the non-linear model  Prob ¼

a þ b  Corr; Corr , X 0; a þ b  X 0 þ d  ðCorr  X 0Þ; Corr  X 0;

where X0 is the breakpoint. To verify that this model does generally occur, we tested the fit of this model on a wide variety of published data sets, and compared fits with other non-linear models. Of course, a disadvantage of this approach is the uneven quality or missing GO annotations for many genes.

4.2.

Statistical methods

We also rely on the theoretical distribution of correlation, specifically the normalized Z-transformation of Fisher. The supporting hypothesis is that correlation values statistically different from 1.0 represent noise, and thus should not be used to create graph edges. Advantages of this approach are that it considers the amount of data used to calculate each correlation, and the threshold value can change from gene to gene, increasing as data becomes more reliable. The statistical test is trivial to perform, simply a Z-test based on the normal distribution. Protection from type I errors, however, due to extensive multiple testing must be implemented. For example, arrays with over 20 000 genes will produce 400 million correlations for each treatment group. Bonferroni adjustments are known to be too conservative in situations like this where individual tests are correlated (share data). The false discovery rate (FDR) methodology [45] is also considered. Using simulation, we test whether this method provides the stated protection. It is appropriate for multiple comparisons among means, but sometimes requires mathematical modifications to apply to multiple comparisons of correlations. It is not always clear what level of FDR is acceptable for a given application. For example, an FDR of 5% may produce a threshold correlation of 0.99, which is much too high to produce a meaningfully connected graph.

4.3.

more input graphs. MCS is frequently solved with the use of an ‘association graph’, a form of product graph built from the inputs. Thus, we also use CliqueBranch, the fastest currently known MCS algorithm [46]. CliqueBranch relies on exploiting subtle structures within the association graph. THEOREM [46] Given two graphs of orders m and n, respectively, the CliqueBranch algorithm solves the MCS problem in O(m þ 1)n time. We are also interested in the potential offered by spectral graph theory [47], whose underlying mathematics has many known connections to pure and applied research, and includes both continuous and discrete applications. The eigenvalues and eigenvectors of the Laplacian of a graph’s adjacency matrix, for example, give insight into the structural properties of the graph. The spectrum describes such features as the number of components in the graph, the total number of walks of a specific length, whether the graph is bipartite as well as bounds on the chromatic number and diameter of the graph [48]. Spectral graph theory has already found utility in biological applications, specifically in the area of spectral clustering [49]. We are also aware of results that make use of the spectral decomposition of cell-graphs that appear to be highly accurate in the classification of cancerous and healthy tissue [50]. In the present context, we consider the eigenvectors and eigenvalues of graphs obtained by thresholding at various levels. The second smallest eigenvalue denotes the algebraic connectivity of a graph. As expected, we have observed that this eigenvalue increases as the threshold is lowered and microarray correlation graphs become denser. On real data these values seem to grow more or less linearly, so it is not clear to us whether we can exploit inflection points or discontinuities. The zeros are thus far more useful. The multiplicity of zero as an eigenvalue tells us the number of connected components in a graph. Zeros appear to fall out rather precipitously as the threshold is lowered, producing a thresholding metric that is appealing both biologically and mathematically.

5.

Graph structure theory

We seek innovative ways in which intrinsic graph structure can help us to identify biologically meaningful thresholds. An obvious candidate is with the use of maximal clique size and distribution profiles. Changes in such profiles may signal key thresholds. Another is through the use of graph width metrics (e.g. pathwidth, treewidth, cutwidth, bandwidth and so forth), which measure foundational graph attributes. Changes in them have the potential to reveal a great deal about graph structure and with it biological content. A more computationally demanding approach is based on the maximum common subgraph (MCS) problem. Here, the goal is to find the largest induced subgraph contained in two or

CORROBORATORY EVIDENCE OF CO-REGULATION

After cliques and other graph structures have been identified, we use ontology analysis to evaluate the potential biological significance of these structures and prioritize them for further study [51]. We would predict that cliques comprised of genes involved in core cellular functions would be abundant in both control and irradiated groups, especially given the relatively low doses of IR to which the mice were exposed (i.e. below the level of significant DNA and cellular damage). Consistent with this prediction, many of the maximal cliques in both groups displayed enrichment for ribosomal subunit proteins and for proteins involved in translation. In IR but not

THE COMPUTER JOURNAL, 2007

INNOVATIVE COMPUTATIONAL METHODS

FOR

control, many cliques were also enriched in genes involved in inflammatory and immune responses, congruent with the physiological response to low doses of radiation and with our results for differential expression. Another component of bioinformatic evaluation of graph structures exploits the axiom that co-regulated genes are likely to share at least some transcription factor binding sites [18]. Genes linked by physical interactions in a network, such as transcription factor – gene interactions, have more strongly correlated expression levels than those chosen at random [52]. For this, we use sequence analysis pipelines to determine if genes within a graph structure share putative cisregulatory elements. Such toolkits enable the extraction of genes and their upstream sequences from the available databases and utilize phylogenetic footprinting to delimit the search space to regions most likely to contain regulatory elements [53]. Many of these methods routinely use popular packages such as MEME [54] and MAST [55] as part of the sequence analysis, but other motif finding and searching methods are under development. These are but two forms of corroboratory evidence. Pathway matching with KEGG data and other tools are employed as well, with the end goal a high level of confidence in putative co-regulation before any time and money is invested in wet-lab verification.

6.

DIFFERENTIAL ANALYSIS FOR LOW-DOSE IR

Accumulating evidence suggests that the genes and pathways altered by lower doses are not merely a subset of those altered by higher doses, the hallmarks of which are DNA damage and its repair mechanisms [56]. Therefore, low-dose IR may exert its own unique set of cellular responses. We have focused on microarray data from the spleens of mice exposed to an acute low-dose of IR in vivo. Six standard inbred strains of mice (to mimic genetic variation seen in humans) were exposed to an acute dose of 10 cGy X-rays, and tissues were harvested at one and 3.5 h post-exposure. Gene expression profiles were produced at the 3.5 h time point. Four mice each from control and IR were randomly paired to create four biological replicates for each strain, and each biological replicate was hybridized in duplicate, incorporating a dye swap to control for dye-specific effects, for a total of 48 array hybridizations. The arrays platform consists of long (65-mer) oligonucleotides corresponding to 21 547 unique mouse genes (Compugen Mouse OligoLibrary, release 2.0TM). The raw microarray data are processed in a manner standard for differential expression and other analyses from 2-color microarray platforms. Spots that did not pass experimenter-established criteria for sufficient signal-above-background or for spot quality are not included in the input data set. Background is subtracted from signal,

TRANSCRIPTOMIC DATA ANALYSIS

Page 7 of 13

and data from each channel (cyanine 3 and 5) are normalized (their ratio is used, so not independent) using Lowess, a locally weighted linear regression method that adjusts for intensity-dependent noise effects [57]. Because, we treat the control and treatment data separately, we add a normalization step not necessary for differential expression analysis from 2-color arrays. Various technical parameters, such as the laser setting used to scan the array slides, can alter the hybridization signal independent of true changes in gene expression. To adjust for this, we perform a between slide normalization by median centering each slide. After normalization, we create correlation matrices for each treatment by calculating Pearson’s correlations on biological replicates (technical replicates such as dye swaps are averaged). Alternative correlation measures to Pearson’s have been suggested, for example, partial correlations [58], entropy-based [14, 18] and shrinkage-adjusted estimates [3], but Pearson’s uses equally weighted information, appropriate for normalized expression values with close to normal distributions, as has been our experience. To analyse this data, we have implemented our toolchain as illustrated in Fig. 3. We concentrate on the classic maximum clique problem. Of course, we also must handle the related problem of generating all maximal cliques once a suitable threshold has been chosen, which is often a function of maximum clique size itself. This can be memory intensive even on massively parallel supercomputers, as shown in [40]. There are a variety of other issues dealing with preand post-processing, which are for the most part quite easily handled and are dwarfed by the computational complexity of the fundamental clique problem at the heart of our method. We term this high-performance computational system our ‘clique compute engine.’ We employ this engine to extract subsets of genes with putatively shared regulation in response to the effects of radiation. Here, we use our IR data set and a somewhat arbitrary threshold of 0.85. We selected 0.85 for three reasons. First, it was the value we used (also somewhat arbitrarily) for the neurobiology work we reported in [59]. Second, it has been shown that pairs of yeast genes with a correlation of 0.85 or greater have a 50% chance of sharing a transcription factor binding site (thereby increasing the odds that they were actually co-regulated and not just co-expressed) [18]. With this threshold, we have produced an initial distribution of cliques in these data for control and IR samples. One thorny issue with microarrays in general is that noise arises due to the sensitivity of the detection methods. To determine the extent to which gene expression data in the IR data set correlated with noise, we included in the correlation matrix ‘signal’ values from control buffer spots printed on the arrays, treating these values as real data. The correlation of noise spots with expression data drops sharply at a threshold of 0.85, giving us yet a third reason to view this value as a reasonable starting choice for the threshold.

THE COMPUTER JOURNAL, 2007

Page 8 of 13

M. A. LANGSTON et al.

Figure 3. Clique extraction for control and low-dose IR data.

6.1.

Differential expression

6.2.

Once an acceptable threshold has been defined and a graph has been created from the data, we can ask a suite of biologically interesting questions from the graph, beginning with the identification of differentially expressed genes. These are genes whose expression levels increase or decrease in response to IR exposure. We have identified statistically significant, differentially expressed genes using mixed model ANOVA [60]. These sets are enriched in genes involved in apoptosis, regulation of transcription and many novel, uncharacterized genes. Unlike the response to higher doses of IR, we have not seen an over-expression of genes with known associations to DNA repair. Instead, a significant percentage (22%) of genes that are differentially expressed in at least one strain upon IR exposure encode products that mediate immunity and response to oxidative stress, suggesting that low doses of IR may initiate immune response mechanism(s). Moreover, the connectivity of differentially expressed genes in control is highly different than it is after treatment. A good example is illustrated in Fig. 4, depicting the vertices connected to gene 1602 in control and in dose. The degree of gene 1602, known to be involved in immune function, is only two in control but 235 in dose.

FIGURE 4. Connectivity of gene 1602. Each circle represents another gene. Most edges are obscured in dose due to density.

Differential correlation

Differential expression can deal only with individual genes (vertices). Thanks to the use of multiple conditions over many arrays, we are able to build a correlation matrix with which we can study what we call differential correlation. By this, we mean a significant relationship (set of edges) found in one condition (or more, if multiple conditions are available) but not another. For example, what relationships between gene pairs appear in dose but are not found in control? How are these correlations related to each other? Are there multiple, connected edges that appear or disappear, suggesting the activation or suppression of a network of genes? In Fig. 5, a

FIGURE 5. Differential correlations red is dose and blue is control.

THE COMPUTER JOURNAL, 2007

INNOVATIVE COMPUTATIONAL METHODS

FOR

light grey edge indicates a correlation above 0.85 found in dose but not even above 0.25 in control (and vice versa for light grey edges). With this mechanism, we are more easily able to identify the correlations that are dramatically altered by treatment. Topo3a for example, the gene denoted by the star with the most rays in this figure, seems to be a nexus for many gene pair differential correlations. Most of them represent gene – gene interactions not witnessed in control but now highly activated after exposure. 6.3.

TRANSCRIPTOMIC DATA ANALYSIS

Page 9 of 13

not both, yet the remainder of the clique is in both. For example, a core pathway related to signaling through a specific G-protein may be common to both control and dose, but the set of genes with which that core interacts may be completely unique to treatment. In this case, we are interested in the differential clique signatures, that is, the subset of genes within the cliques that are unique to treatment. Differential topology opens many new doors and helps frame a variety of testable new hypotheses. We now discuss a few compelling examples.

Differential topology and the search for biological meaning

Next we investigate differential topology. By this, we mean cliques or other network structures that appear or disappear according to condition. While differential expression is reflected at the vertex level, and differential correlation is seen at the edge level, differential topology is observed in changes at the subgraph level. For example, a clique that appears in dose but not in control would highlight a credible network of genes likely to play some role in the response to dose, and thus constitute a target for further study. Identifying these subgraph changes from among the vast sea of data and relationships requires mathematical and statistical innovation. After all, the real hope that drives the emerging field of network extraction from microarray data is to be able to identify pathways that underlie the response to specific treatments or conditions. An initial step in differential topology is a series of numeric profilings to assess the overall similarity of the graphs for control and IR. We compile and compare total number of cliques, the degree of overlap between genes comprising cliques, average edge weights and the size and numbers of maximum and maximal cliques to generate comprehensive numerical profiles. These parameters provide a basis for further, more sophisticated comparisons and the use of covariance matrices. For a gene to be a member of a clique in control, but not under treatment, at least one correlation dropped below the threshold needed to create an edge. Loss of one edge could be random chance, but if many correlations changed, a statistically significant difference between covariance matrices will occur. Such statistical tests have been described [61] and used [62], but not for comparing the millions of subgraphs generated from array data. Our experience suggests these tests are too sensitive, producing statistical significance for changes within the range of biological variability. We instead develop modifications that protect against multiple testing problems and make results address biological noise more realistically. An ideal situation is to find entire cliques that are treatment-specific, because these highlight what may be complete components of cell signaling or response that are activated or repressed by IR. Of course, we also expect many instances in which genes are present in a clique in either dose or control but

6.3.1. IR’s effect on protein synthesis We have found in a maximum control clique a set of 11 genes involved in translation and protein synthesis that are much more co-abundant in the spectrum of control maximal than in dose maximal cliques. This suggests a general pause in ongoing protein synthesis with IR, as the cell redirects its efforts toward more specific pathways. We have identified all control and dose maximal cliques that contain the largest possible subset (10 for control, 6 for dose) of these core genes. We have also identified six genes that are almost always associated with the core genes in control but never associated with them in dose. In turn, a separate set of six genes are always associated with the core in dose but never in control. A next step is to mine the literature to determine if the dose signature genes have a known relationship to protein synthesis that differs from that of the control signature set.

6.3.2. Carboxylesterase 2 and the use of clique anchors A relatively small set of genes constitute the majority of cliques. Certain genes appear much more frequently in cliques than others, and in some cases this follows a treatmentspecific pattern. Those genes with over-representation in cliques according to condition seem to serve as individual gene ‘anchors’ from which to probe clique membership. For example, gene 11920 is especially intriguing. It appears in over 30% of dose cliques but 0.7% of control cliques, with a degree of 363 in dose and 140 in control. In dose, there are six other genes that appear in at least 50% of 11920 cliques, but they are not frequently associated with 11920 in control cliques. Gene 11920 encodes carboxylesterase 2, a relatively novel product that has been implicated in the response to carcinogens [63]. We then examined the largest cliques (1 of size 43, 23 of size 42) that contain 11920 in dose, and many of the genes within these cliques are known to be involved in the immune pathway, i.e. a pathway in which many genes are differentially expressed according to differential expression analysis of our data set. This gene alone represents a good example of how using the structure of the initial graph can point to potential anchors around which to search for additional cliques of interest.

THE COMPUTER JOURNAL, 2007

Page 10 of 13

M. A. LANGSTON et al.

6.3.3. Core gene sets and nucleated cliques We have also studied what we term ‘nucleated cliques.’ These represent an extension of the anchored clique concept, in which the anchor is formed not only from one highly represented gene but from a set of genes that often appear together. We have developed systematic means to extract potential nuclei from the initial graph structures. Using these data, we have identified an example that illustrates the appeal of this approach, beginning with the clique maxima from control and dose. We first identified six genes that were found in the two dose maxima but were not present in any control maximal cliques, suggesting a treatment-specific distribution for these genes. We then searched for their distributions in dose and control and found that these six genes plus one additional gene appear together in 5765 dose cliques. In contrast, in control, the requirement has to be reduced to finding any two of the six together, which yields only 1653 cliques. These six genes thus suggest a nucleus for dosespecific cliques that are likely to represent a pathway of genes involved in the response to IR. Among these six, two are immune-related (again consistent with differential expression), two have unknown function and two have been loosely associated with radiation in the literature. Biologically, this nucleus and the genes that often appear in cliques with it represent an appealing composite candidate for further studies into IR. As an example of such a study, it would be interesting to determine if there is corroboratory evidence in the literature. This evidence might originate, for example, from protein – protein interaction or gene expression data from other types of stressors, or interactions between these genes as part of a pathway. 6.3.4. A surprising role for Tulp4 One outcome of our approach is that it points to new links between the IR response and the genes not previously associated with radiation. An excellent example of this is seen with the gene known as tubby-like protein 4 (Tulp4). We have frequently found Tulp4 in cliques with many genes related to immune function. Yet, Tulp4 is a member of the tubby family of genes, which are thought to function as membranebound transcription regulators that translocate to the nucleus in response to phosphoinositide hydrolysis, providing a direct link between G-protein signaling and the regulation of gene expression [64]. While Tulp4 is named based on its sequence homology to the obesity-associated protein tubby, little is known about its unique biological function. Thus, our work suggests a previously unknown link between the Tulp4 and the IR response, perhaps as a mediator of changes in the expression of immune-related genes. Along these same lines, we expect to ascribe potential functional roles to the large set of novel genes with unknown functions that are differentially expressed after IR in our data set. For example, if an unknown gene frequently appears in cliques enriched for GO terms related to oxidative stress, we can

logically hypothesize that the unknown gene has similar function. This can then serve as a starting point for experimental investigations to explore this possibility and to further define the roles of novel genes in the IR response.

7.

SUMMARY AND CONTINUING RESEARCH

We have developed a suite of novel graph-based methods to extract cliques and other dense subgraphs as a means to identify sets of genes that interact to control physiological responses. FPT has played a central role in this research. Our work advances efforts to identify gene regulatory networks that illuminate the basis for biological response and represent potential therapeutic targets for disease applications. While we have illustrated our approach using the response to low-dose IR, our methods are extensible to any biology of interest and to virtually any quantitative phenotypic measure. We have concentrated here on computational aspects. More information about experimental conditions, procedures, genes of interest and results can be found in [65]. We are extending our work along several fronts. 7.1.

Supercomputer implementations

We are currently porting our codes to supercomputers located at ORNL. This is an onerous task given the many novel features of our algorithms. Great care is required to manage processor and memory resources. Load balancing on such prototype machines can be especially problematic. Initial targets include a 512-node SGI Altix and a 256-node Cray X1. We have found the former particularly well-suited for our purposes thanks to its 2 Tb of shared memory [40]. The latter is also very appealing due to its superior speed and special built-in features for vector operations, which our codes should be able to exploit for graph manipulation via the unwrapping of a variety of ‘for’ loops. In the longer term, we aim to employ the tremendously more powerful machines now under construction and awarded to ORNL in the recent competition to build the Nation’s next leadership class computing facility for science [66]. We are confident that with our algorithms and these platforms we can solve problematic instances previously considered hopelessly out of reach. 7.2.

Dealing with noise

To account for the many sources and varieties of noise inherent in data generated with current microarray technology, we recently developed a clique relaxation technique to identify what we call ‘paracliques.’ Informally, a paraclique is an extremely densely connected subgraph, but one that may be missing a small number of edges and thus is not, strictly speaking, a clique. In our application, this

THE COMPUTER JOURNAL, 2007

INNOVATIVE COMPUTATIONAL METHODS

FOR

corresponds to a very highly intercorrelated group of putatively co-regulated genes whose transcript expression levels, as reflected in real and somewhat dirty microarray data, show highly significant but not necessarily perfect pairwise correlations. Let us illustrate, beginning with a clique C of size k, where k is perhaps the size of the largest clique in the list. We set a glom factor, g, at some value strictly less than k. We also set an edge weight bound, b, at some value strictly less than the threshold, t, used to build the correlation graph. We now augment C in a controlled manner to produce a paraclique P by requiring that any new vertex added be adjacent to at least g other vertices and that the weight of the correlation coefficient of any ‘missing’ edge is at least b (see [67] for details). We then remove P from the graph and iterate. Even if b is set to 0 we know the following: THEOREM [67] For any graph G, with g set to jCj 2 1 the edge density of P as computed by the paraclique algorithm is at least 50% as long as jPj  2jCj. This is of course merely a worst-case guarantee. On real data, we routinely see paracliques with edge densities in the range of 90 – 95%. Paraclique has already found utility in the study of both neurological gene co-expression data [59] and shotgun proteomic data [68]. 7.3.

[2] [3]

[4]

[5]

[6]

[7]

[8]

[9]

Complementary data analysis

Recall that we have used spleen tissue in the case study described here. We are now running arrays to perform a parallel analysis using data from other IR tissues, including skin, testis and thymus. Tissues exhibit different baseline levels of expression of stress response genes [69], which likely translates into tissue-specific responses to stressors. While there is certainly some level of tissue specificity to the IR response, we also predict that many core pathways will be activated or repressed by IR regardless of tissue type. We have already begun to see this in differential expression analysis of IR skin data, which like spleen implicates genes involved in immune response.

8.

TRANSCRIPTOMIC DATA ANALYSIS

[10]

[11] [12] [13]

[14]

ACKNOWLEDGEMENTS

This research has been supported in part by the US Department of Energy under contract DE-AC05-00OR22725, by the US National Science Foundation under grant CCR-0075792, by the US Office of Naval Research under grant N00014-01-1-0608, and by the US National Institutes of Health under grants 1-P01-DA-015027-1, 5-U01-AA-013512 and 1-R01-MH-074460-0.

REFERENCES

[15]

[16]

[17]

[18]

[1] N. R. C. Committee to Assess Health Risks from Exposure to Low Levels of Ionizing Radiation, (2005) ‘Health Risks from

Page 11 of 13

Exposure to Low Levels of Ionizing Radiation: BEIR VII Phase 2’. National Academies Press, Washington, DC. Slonim, D. K. (2005) From patterns to pathways: gene expression data analysis comes of age. Nat. Genet., 32, 502–508. Cherepinsky, V., Feng, J., Rejali, M. and Mishra, B. (2003) Shrinkage-based similarity metric for cluster analysis of microarray data. Proc. Natl. Acad. Sci. USA, 100, 9668–9673. Bhardwaj, N. and Lu, H. (2005) Correlation between gene expression profiles and protein–protein interactions within and across genomes. Bioinformatics, 21, 2730–2738. Ge, H., Liu, Z., Church, G. M. and Vidal, M. (2001) Correlation between transcriptsome and interactome mapping data from Saccharomyces cerevisiae. Nat. Genet., 29, 482–486. Grigoriev, A. (2001) A relationship between gene expression and protein interactions on the proteome scale: analysis of the bacteriophage T7 and the yeast Saccharomyces cerevisiae. Nucleic Acids Res., 29, 3513–3519. Yang, H. H., Hu, Y., Buetow, K. H. and Lee, M. P. (2004) A computational approach to measuring coherence of gene expression in pathways. Genomics, 84, 211–217. Jansen, R., Greenbaum, D. and Gerstein, M. (2002) Relating whole-genome expression data with protein–protein interactions. Genome Res., 12, 37–46. Bellaachia, A., Portnoy, D., Chen, Y. and Elkahloun, A. G. (2002) E-CAST: a data mining algorithm for gene expression data. Proc. Workshop on Data Mining in Bioinformatics, Alberta, Canada, 49–54. Ben-Dor, A., Bruhn, L., Friedman, N., Nachman, I., Schummer, M. and Yakhini, Z. (2000) Tissue classification with gene expression profiles. J. Comput. Biol., 7, 54 –64. Ben-Dor, A., Shamir, R. and Yakhini, Z. (1999) Clustering gene expression patterns. J. Comput. Biol., 6, 281–297. Hansen, P. and Jaumard, B. (1997) Cluster analysis and mathematical programming. Math. Program., 79, 191– 215. Hartuv, E., Schmitt, A., Lange, J., Meier-Ewert, S., Lehrachs, H. and Shamir, R. (1999) An algorithm for clustering cDNAs for gene expression analysis, Proc. RECOMB, Lyon, France, 188–197. Butte, A. J., Tamayo, P., Slonim, D., Golub, T. R. and Kohane, I. S. (2000) Discovering functional relationships between RNA expression and chemotherapeutic susceptibility using relevance networks. Proc. Natl Acad. Sci. USA, 97, 12182–12186. Alter, O., Brown, P. O. and Botstein, D. (2000) Singular value decomposition for genome-wide expression data processing and modeling. Proc. Nat. Acad. Sci., 97, 10101 –10106. Girolami, M. and Breitling, R. (2004) Biologically valid linear factor models of gene expression. Bioinformatics, 20, 3021 – 3033. Patti, M. E., et al. (2003) Coordinated reduction of genes of oxidative metabolism in humans with insulin resistance and diabetes: potential role of PGC1 and NRF1. Proc. Natl. Acad. Sci. USA, 100, 8466–8471. Allocco, D. J., Kohane, I. S. and Butte, A. J. (2004) Quantifying the relationship between co-expression, co-regulation and gene function. BMC Bioinformatics, 5, 18.

THE COMPUTER JOURNAL, 2007

Page 12 of 13

M. A. LANGSTON et al.

[19] Alon, U. (2003) Biological networks: the tinkerer as an engineer. Science, 301, 1866–1867. [20] Barabasi, A.-L. and Oltvai, Z. N. (2004) Network biology: understanding the cell’s functional organization. Nat. Revi. Genet., 5, 101– 113. [21] Oltvai, Z. N. and Barabasi, A.-L. (2002) Systems biology. Life’s complexity pyramid. Science, 298, 763 –764. [22] Setubal, J. C. and Meidanis, J. (1997) Introduction to Computational Molecular Biology. PWS Publishing Company, Boston. [23] Bomze, I., Budinich, M., Pardalos, P. and Pelillo, M. (1999) ‘The maximum clique problem.’ In Du, D.-Z. and Pardalos, P. M. (eds), Handbook of Combinatorial Optimization, vol. 4, Kluwer Academic Publishers. [24] Feige, U., Peleg, D. and Kortsarz, G. (2001) The dense k-subgraph problem. Algorithmica, 29, 410 –421. [25] Rougemont, J. and Hingamp, P. (2003) DNA microarray data and contextual analysis of correlation graphs. BMC Bioinformatics, 4, 4–15. [26] Watts, D. J. and Strogatz, S. H. (1998) Collective dynamics of small-world networks. Nature, 393, 440–442. [27] Ihmels, J., Friedlander, G., Bergmann, S., Sarig, O., Ziv, Y. and Barka, N. (2002) Revealing modular organization in the yeast transcriptional network. Nat. Genet., 31, 370 –377. [28] Kloster, M., Tang, C. and Wingreen, N. S. (2005) Finding regulatory modules through large-scale gene-expression data analysis. Bioinformatics, 21, 1172–1179. [29] Magwene, P. M. and Kim, J. (2004) Estimating genomic coexpression networks using first-order conditional independence. Genome Biol., 5, R100. [30] Feige, U., Goldwasser, S., Lovasz, L., Safra, S. and Szegedy, M. (1991) Approximating the maximum clique is almost NP-complete, Proc. IEEE Symp. on the Foundations of Computer Science, San Juan, Puerto Rico, 2–12. [31] Fellows, M. R. and Langston, M. A. (1985) Nonconstructive tools for proving polynomial-time decidability. J. ACM, 35, 727 –739. [32] Fellows, M. R. and Langston, M. A. (1994) On search, decision and the efficiency of polynomial-time algorithms. J. Comput. Syst. Sci., 49, 769 –779. [33] Downey, R. G. and Fellows, M. R. (1999) Parameterized Complexity. Springer, New York. [34] Garey, M. R. and Johnson, D. S. (1990) Computers and Intractability; A Guide to the Theory of NP-Completeness. W. H. Freeman and Company. [35] Chandran, L. S. and Grandoni, F. (2004) Refined memorisation for vertex cover. Proc. Int. Workshop on Parameterized and Exact Computation, Bergen, Norway. [36] Chen, J., Kanj, I. A. and Xia, G. (2005) ‘Simplicity is beauty: improved upper bounds for vertex cover.’ Technical Report, Texas A&M University. [37] Abu-Khzam, F. N., Langston, M. A. and Suters, W. H. (2005) Effective vertex cover kernelization: a tale of two algorithms. Proc. ACS/IEEE Int. Conf. on Computer Systems and Applications, Cairo, Egypt.

[38] Abu-Khzam, F. N., Collins, R. L., Fellows, M. R., Langston, M. A., Suters, W. H. and Symons, C. T. (2004) Kernelization algorithms for the vertex cover problem: theory and experiment. Proc. Workshop on Algorithm Engineering and Experiments (ALENEX), New Orleans, Louisiana. [39] Abu-Khzam, F. N., Baldwin, N. E., Langston, M. A. and Samatova, N. F. (2005) On the relative efficiency of maximal clique enumeration algorithms, with application to high-throughput computational biology, Proc. Int. Conf. on Research Trends in Science and Technology, Beirut, Lebanon. [40] Zhang, Y., Abu-Khzam F. N. , Baldwin, N. E., Chesler, E. J., Langston, M. A. and Samatova, N. F. (2005) Genome-scale computational approaches to memory-intensive applications in systems biology. Proc. Supercomputing, Seattle, Washington. [41] Abu-Khzam, F. N., Langston, M. A., Shanbhag, P. and Symons, C. T. (2006) Scalable parallel algorithms for FPT problems. Algorithmica, 45, 269–284. [42] Cheetham, J., Dehne, F., Rau-Chaplin, A., Stege, U. and Taillon, P. J. (2003) Solving large FPT problems on coarse grained parallel machines. J. Comput. Syst. Sci., 67, 691–706. [43] Abu-Khzam, F. N., Cheetham, F., Dehne, F., Langston, M. A., Pitre, S., Rau-Chaplin, A., Shanbhag, P. and Taillon, P. J. ClustalXP http://ClustalXP.cgmlab.org/. [44] Zhang, B., Schmoyer, D., Kirov, S. and Snoddy, J. (2004) GOTree Machine (GOTM): a web-based platform for interpreting sets of interesting genes using gene ontology hierarchies. BMC Bioinformatics, 5, 16. [45] Benjamini, Y. and Hochberg, Y. (1995) Controlling the false discovery rate: a practical and powerful approach to multiple testing. J. R. Stat. Soc. B, 57, 289–300. [46] Suters, W. H., Abu-Khzam, F. N., Zhang, Y., Symons, C. T., Samatova, N. F. and Langston, M. A. (2005) A new approach and faster exact methods for the maximum common subgraph problem. Proc. Int. Comput. Comb. Conf., Kunming, China. [47] Chung, F. R. K. (1997) Spectral Graph Theory. AMS Press. [48] Cvetkovic, D. M., Doob, M. and Sachs, H. (1998) Spectra of Graphs: Theory and Applications (3rd revision) Wiley, New York. [49] Verma, D. and Meila, M. (2003) A comparison of spectral clustering algorithms. Department of Computer Science and Engineering, University of Washington, Seattle, Washington. [50] Demir, C., Gultekin, S. H. and Yener, B. (2004) Spectral analysis of cell-graphs of cancer. Department of Computer Science, Rensselaer Polytechnic Institute, Troy, New York. [51] Baldwin, N. E., Chesler, E. J., Kirov, S., Langston, M. A., Snoddy, J. R., Williams, R. W. and Zhang, B. (2005) Computational, integrative and comparative methods for the elucidation of genetic co-expression networks. J. Biomed. Biotechnol., 2, 172–180.

THE COMPUTER JOURNAL, 2007

INNOVATIVE COMPUTATIONAL METHODS

FOR

[52] Ideker, T., Galitski, T. and Hood, L. (2001) A new approach to decoding life: systems biology. Annu. Revi. Genomics Hum. Geneti., 2, 343–372. [53] Wasserman, W. W., Palumbo, M., Thompson, W., Fickett, J. W. and Lawrence, C. E. (2000) Human–mouse genome comparisons to locate regulatory sites. Nat. Genet., 26, 225–258. [54] Bailey, T. L. and Elkan, C. (1994) Fitting a mixture model by expectation maximization to discover motifs in biopolymers. Proc. Second Int. Conf. on Intelligent Systems for Molecular Biology, 28–36. Menlo Park, California. [55] Bailey, T. L. and Gribskov, M. (1998) Combining evidence using p-values: application to sequence homology searches. Bioinformatics, 14, 48–54. [56] Yin, E., Nelson, D. O., Coleman, M. A., Peterson, L. E. and Wyrobek, A. J. (2003) Gene expression changes in mouse brain after exposure to low-dose ionizing radiation. Int. J. Radiat. Biol., 70, 759 –775. [57] Smyth, G. K. and Speed, T. (2003) Normalization of cDNA microarray data. Methods, 31, 265 –273. [58] Magwene, P. M. and Kim, J. (2004) Estimating genomic coexpression networks using first-order conditional independence. Genome Biol, 5, R100. [59] Chesler, E. J. et al. (2005) Complex trait analysis of gene expression uncovers polygenic and pleiotropic networks that modulate nervous system function. Nat. Genet., 37, 233– 242. [60] Kerr, M. K., Martin, M. and Churchill, G. A. (2000) Analysis of variance for gene expression microarray data. J. Comput. Biol., 7, 819 –837. [61] Modarres, R. and Jernigan, R. W. (1992) Testing the equality of correlation matrices. Commun. Stat. – Theory Methods, 21, 2107–2125.

TRANSCRIPTOMIC DATA ANALYSIS

Page 13 of 13

[62] Kim, K. H. and Bentler, P. M. (2002) Tests of homogeneity of means and covariance matrices for multivariate incomplete data. Psychometrika, 67, 609– 623. [63] Thai, S. F., Allen, J. W., DeAngelo, A. B., George, M. H. and Fuscoe, J. C. (2003) Altered gene expression in mouse livers after dichloroacetic acid exposure. Mutat. Res., 543, 167–180. [64] Santagata, S., Boggon, T. J., Baird, C. L., Gomez, C. A., Zhao, J., Shan, W. S., Myszka, D. G. and Shapiro, L. (2001) G-protein signaling through tubby proteins. Science, 292, 2041–2050. [65] Voy, B. H., Scharff, J. A., Perkins, A. D., Saxton, A. M., Borate, B., Chesler, E. J., Branstetter, L. K. and Langston, M. A. (2006) Extracting gene networks for low-dose radiation using graph theoretical algorithms. PLoS Comput. Biol., 2, e89. [66] LinuxElectrons, http://www.linuxelectrons.com/article.php/200 40513022006839. [67] Chesler, E. J. and Langston, M. A. (2005) Combinatorial genetic regulatory network analysis tools for high throughput transcriptomic data. Proc. RECOMB Satellite Workshop on Systems Biology and Regulatory Genomics, San Diego. [68] Tabb, D. L., Thompson, M. R., Khalsa-Moyers, G. K., VerBerkmoes, N. C. and McDonald, W. H. (2005) MS2Grouper: group assessment and synthetic replacement of duplicate proteomic tandem mass spectra. J. Am. Soc. Mass Spectrom., 16, 1250 – 1261. [69] Tomascik-Cheeseman, L. M., Coleman, M. A., Marchetti, F., Nelson, D. O., Kegelmeyer, L. M., Nath, J. and Wyrobek, A. J. (2004) Differential basal expression of genes associated with stress response, damage control, and DNA repair among mouse tissues. Mutat. Res., 561. 1–21.

THE COMPUTER JOURNAL, 2007

Suggest Documents