To use the Expresso system (see Section 6.3) to categorize drought ..... Net photosynthesis exhibited an acclimatory response under mild ..... solving environment for bioinformatics: Finding answers with microarray technology. ... 8(3â4):pp.
ITR: Understanding Stress Resistance Mechanisms in Plants: Multimodal Models Integrating Experimental Data, Databases, and the Literature Results from Relevant, Prior NSF Support Next Generation Software: A Microarray Experiment Management System. Co-PIs: N. Ramakrishnan, R. G. Alscher, L. S. Heath, L.T. Watson, J. Weller EIA-0103660, $600,000, 09/01/2001–08/31/2004. This project supports the design and implementation of Expresso - a system for microarray experiment management. The NSF NGS program specifically calls for innovative software designs for emerging applications. In this regard, Expresso provides novel algorithms (i) for model-based design and management of microarray experiments, (ii) for integrating experiment design and data analysis to ‘close the loop,’ and (iii) for supporting a lightweight data management system. The specific computational capabilities of Expresso are discussed at several places in this proposal, especially as they pertain to results from our microarray experiments and the choice of data analysis techniques.
1
Introduction
Plants have evolved to cope with a variety of environmental stresses, both abiotic — drought, heat, cold, and salt — and biotic — pathogens and insects. The imposition of stress on a plant marshals defense resources at the cellular level. The exact defensive resources marshaled exhibit both common features and divergent features among the various stressors. In addition, defense mechanisms can operate collaboratively or independently under different circumstances. We are actively pursuing functional genomic and bioinformatic approaches to investigate these defense mechanisms, particularly in the context of drought stress in loblolly pine (NSF grant # EIA-0103660). We propose to utilize biological information from multiple sources — experimental data (especially gene expression data); sequence, protein, and other databases; and the biological literature — to build a new generation of biological models, called multimodal networks, that can represent multiple aspects of our current state of biological knowledge and of biological systems themselves: responses over time; response variation by subcellular compartments; uncertainty (our lack of complete knowledge of cell state); evolution of the genome; and the dynamic changes in biological information due to the boom in biological data and knowledge. In addition to the pine drought stress experimental data that we are obtaining on an ongoing basis, we propose to pursue experiments taking advantage of the imminent availability of an Arabidopsis transcriptome chip (Chris Town, TIGR, see attached message) representing about 23,000 genes from the Arabidopsis thaliana genome. Because we can quickly generate experimental data using this chip, we will be able to rapidly develop and refine our multimodal models incorporating data from our drought stress experiments with Arabidopsis. We are in the first year of NSF grant # EIA-0103660, which supports the development of Expresso, a Next Generation Software system for microarray experiment management and data analysis. The proposed research leverages our momentum in bioinformatics and in the Expresso project. In the proposed project, we will develop explanatory and predictive models of phenomena occurring within plant cells in response to drought and other abiotic stresses. Biological phenomena that will play a role in our models include the following.
• Compartmentalization of genomic information and cellular processes among the nucleus, organelles, the cell wall, and the cytosol. (An organelle is a component of the cell that is bounded by a semi-permeable membrane, implying that the organelle maintains its own internal and unique environment.) • The possibility of redundancy or near-redundancy in the function of a cell in the form of alternative pathways. • Metabolic pathways involved in essential cell processes such as bioenergetics and the biosynthesis of macromolecules. • Signaling pathways and attendant downstream events responding to environmental changes, including beneficial changes (such as an increase in available nutrients) as well as threatening changes (such as a decrease in available water). • Defense mechanisms to successfully adapt to environmental stresses. • Evolution of cellular defense mechanisms through mutation and especially acclimation to stressors within the environment over generations. We incorporate existing biological knowledge as constraints on multimodal networks and employ techniques from data mining and information integration.
2
Research Objectives
The proposed research has the following objectives. 1. To develop a mathematical theory of the space of multimodal networks and operations on that space. 2. To create a library of computational models — based on multimodal networks (see Section 7.1) — for cell and molecular biology phenomena. 3. To provide computational mechanisms for manipulating the library, including creation, combination, and evaluation of models. 4. To incorporate biological knowledge from multiple sources — experimental data (including our gene expression data for drought stress responses in loblolly pine and Arabidopsis), databases, and the literature — in development of multimodal models. 5. To identify predictive models for drought stress responses in plants. 6. To identify Arabidopsis genes on microarrays that are expressed during recovery from different levels of drought stress. 7. To use the Expresso system (see Section 6.3) to categorize drought responsive genes into groups with respect to degree of osmotic stress imposed, particular signaling pathways (where possible), and functional categories. 8. To use vtc 1-1, an ascorbic acid deficient Arabidopsis mutant, to investigate the effects of increased reactive oxygen species on drought-mediated gene expression events. See Section 7 for details, especially the biological hypotheses tested via Objectives 6–8. 2
3
Relation to Long-Term Goals
As a multidisciplinary team, we have been working together towards bioinformatics goals for more than two years. In the Expresso project, we are combining computation and experimentation in an intimate way that completely closes the experimental loop. At multiple points in the experimental loop (design and data mining, especially), we are able to make use of existing biological knowledge in novel ways to support the experimental process. We are also incorporating existing computational models of molecular processes into Expresso. Finally, we allow the biologist to visualize the information integrated within Expresso databases in flexible ways. The multimodal networks and information integration aspects of the proposed project yields additional biological knowledge within the experimental loop, provide a powerful new class of integrative models within Expresso, and supports an additional class of information within Expresso for visualization. Analysis of the molecular and physiological data produced by the project will provide information for our long-term goal, the identification of mechanisms underlying stress resistance in higher plants. The application of computational tools to data revealing successive, interacting, global patterns of gene expression such as those seen on microarrays will uncover the relationship of specific downstream events to the signaling pathways that control drought stress responses in the model system Arabidopsis. The data analysis and data mining components of the Expresso microarray experiment management system have the ability to integrate data from a variety of experiments with biological knowledge provided by the life scientist. The incorporation of large and diverse data sources into multimodal models provides predictive power and the potential of improving drought stress tolerance in plants.
4
Current Knowledge of Drought Stress Responses in Plants
Genes controlling osmotic adjustment, protein stabilization, ROS detoxification, ion transport, membrane fluidity, gene activation, and signal transduction have all been implicated in osmotic stress adjustment responses in higher plants in separate experimental systems [14, 55]. Early events in responses to drought stress. A putative osmosensor AtHK1, a membrane-based histidine kinase, is thought to be one component to relay changes in osmotic potential outside the cell to intracellular signal transduction pathways [15]. Other membrane sensors are also hypothesized to be present and may respond to increasing levels of stress. These sensors trigger discrete phospholipidbased signaling pathways which are involved in early events in drought stress responses, with different pathways proposed to respond to different stress levels [8]. Activation of osmotic stress signal transduction pathways are proposed to be associated with (in increasing order of osmotic stress) phospholipase D, phosphoinositide 3-kinase, (PI3K), phospholipases A2, and C and an additional pathway from PLD leads to a first wave of downstream events. These affect potassium channels (PLD), H+ -ATPase (phosphoinositide 3-kinase, PI3K), inositol-3-phosphate (IP3 ) production and calcium controlled processes (phospholipase C, PLC), diacylglycerol (DAG) production, protein kinases, including mitogen affected protein kinases (MAPK) (PLC and PLD). PLD has been shown to respond to the stress hormone abscisic acid (ABA) acting as an intermediate between ABA and downstream events in gene expression [43]. An antisense Arabidopsis line depleted in PLD, alpha isoform, showed increased sensitivity to water stress and decreased responsiveness to applied ABA [52].
3
Oxidative Stress
Membrane receptors
Metabolite Defense
Reactive Oxygen Species (ROS)
Protein kinases; phosphatases
Antioxidants
Redox sensitive transcription factors
Gene expression
Antioxidant Defense; Repair; Acclimation
Figure 1: Redox regulation of cellular systems. Later events in drought stress responses. The relationship of the initial drought-mediated changes described above to subsequent downstream events in gene expression is not well defined as yet. Downstream activation of stress resistance genes associated with three distinct functions can occur as a consequence of exposure to drought: 1. Synthesis of molecules associated with specific resistance to drought stress, such as proline for osmotic adjustment to water stress, aquaporins for water movement across membranes, extensins, and proline-rich proteins for cell wall extensibility events, (drought acclimation genes). (The involvement of ABA and/or calcium is unknown for genes in the second two categories. Proline biosynthesis is known to be stimulated by ABA [59].) 2. Activation of oxidative stress resistance processes, such as antioxidant-based mechanisms for sustained removal of toxic reactive oxygen species (antioxidant genes). Hydrogen peroxide (H2 O2 ), a potentially toxic reactive oxygen species (ROS), is produced by cells under stress by, for example, acting as an alternative electron acceptor from electron transport chains that are damaged or that lack their normal electron acceptor (e.g. the photosynthetic electron transport chain in the absence of available carbon dioxide and NADP+.). If ROS levels exceed the resistance capacity of the cell, damage to macromolecules occurs. (see below). However, the presence of increased levels of ROS also causes a change in intracellular redox status, which is also sensed by the genome by an unknown mechanism (Figure 1). ROS themselves, perhaps in the form of H2 O2 , appear to play a role in this intracellular redox sensing, in the activation of antioxidant resistance mechanisms such as glutathione and ascorbate biosynthesis, and in ROS-scavenging pathways [2, 34, 40, 49, 67, 68]. Redox sensitive transcription factors have been identified in animal, bacterial, and plant cells [46]. H2 O2 also mediates ABA signaling in guard cells, providing a direct link between drought responses and ROS-associated mechanisms. The relationship between ROS and drought-associated response mechanisms has not been fully elucidated, however. 3. Removal or repair of damaged macromolecules, such as the action of molecular chaperones on denatured proteins, or the enzymatic removal of lipids that have undergone peroxidation (removal and repair genes). Unless ROS are removed promptly, their action can cause protein unfolding, the inactivation of enzymes, DNA damage, mutagenesis, lipid peroxidation, and disruption of cell membrane function. A novel aldehyde 4
reductase which acts to remove the products of drought-mediated lipid peroxidation has been reported [45]. Heat shock proteins/molecular chaperones are important players in resistance to oxidative stress [24, 25, 64]. Molecular chaperones interact to protect against damage to macromolecules through the repair of denatured proteins or through targeting irreversibly damaged proteins to the ubiquitin/proteasome pathway. The relative drought responsiveness of members of these three groups has not been fully explored, nor has their relationship to known signal transduction pathways. Preliminary Data. The Arabidopsis thaliana mutant, vtc1, contains a recessive, singlepoint mutation in the gene that encodes a GDP-mannose pyrophosphorylase [12]. This enzyme is involved in the conversion of D-mannose to L-glactose and represents one of the initial steps in the proposed L-ascorbate biosynthetic pathway in plants [65]. vtc 1-1 and 1-2 mutants contain only 30% of the foliar L-ascorbate levels found in wild type plants [13]. The mutants are ozone-sensitive and grow more slowly than wild type seedlings, but do not exhibit altered photosynthetic capacity nor increased chlorophyll fluorescence under excessive irradiance [63]. However, when wild type and vtc1 plants are subjected to severe moisture stress, sufficient to cause complete wilting of the foliage, vtc1 leaves become bleached white, whereas wild type remain green (B. Chevone, unpublished observation). Upon watering, wild type plants recover and attain full turgor, but vtc1 plants fail to rehydrate and eventually die. We presume that the buildup of ROS in the absence of sufficient antioxidant results in increased cellular damage. However, any ROS-influenced signaling mechanisms should also be affected to a greater degree in the mutant plants.
5
A Key Biological Exemplar
Our current knowledge of plant response to osmotic stress is incomplete, but there are complementary views represented in the literature that provide an excellent start on network models for drought stress responses. For example, Figure 2 gives a network exemplifying response to osmotic stress adapted from Munnik and Meijer [43], suggesting alternative responses dependent on the level of stress imposed and the resulting perception by a variety of osmosensors (only one of which — ATHK1 — has been identified [61]). An alternative example (Figure 3) gives a related network that emphases the role of ABA in drought stress responses, adapted from Shinozaki et al. [56]. These networks are excellent starting points for a more detailed model of drought stress responses. They each suggest the spatial flow of events from the cell membrane to the nucleus (gene expression), as well as the temporal flow of rapid response followed by slower adaptation. The networks are incompatible, however, as the nodes are different in each (reflecting the alternate perspectives of the two research efforts) and are even at different levels of abstraction within the same network. The two networks contain nodes representing individual molecular species (PI3K, PLD, etc., in Figure 2; ABA in Figure 3), as well as gene expression, an immensely complex process involving many genes and molecular mechanisms for transcription. It should also be noted that the kind of relationship represented varies from node to node and arc to arc. For our representation of these networks, it is essential that we include the lessons learned from these and similar networks in the biological literature. Nodes must be of multiple types and at multiple levels of abstraction. Arcs must be capable of representing multiple types of relationships between nodes. Moreover, our networks must be able to include arcs representing many-to-many relationships; this is seen in Figure 2 through the potential of 5
Dehydration
PI3K
S2
S3
PLD
...
SN
PLC
PLA2
h
DAG
In
PI(3,5)P2
ib
it
s?
Signal transduction
ATHK1
Rapid
Osmosensors∗
H+ ATPase?
PA
K+ channel
IP
L-PA
Ca2
?
Protein Kinase
Adaptive responses
Slow
MAPK Gene Expression∗
Figure 2: This network is adapted from Munnik and Meijer [43]. ∗ Different osmosensors and signaling pathways are proposed to respond to different levels of osmotic stress. We propose (Hypothesis I) that the quality and quantity of adaptive gene expression is also affected by the degree of stress imposed.
multiple osmosensors to influence multiple signal transducers. (In the parlance of graph theory, our networks must be multigraphs.) Of course, future experiments may associate particular osmosensors with particular signal transducers. Operations on our networks must support refinement of networks to reduce a many-to-many relationship to, for example, numerous one-to-one relationships. As different networks representing related phenomena are typically incompatible (as discussed above), a combination of two or more networks is partial, in the sense that there will be only some nodes and arcs in common and the same process may be represented redundantly in the combination, though in alternate ways.
6 6.1
Current Work Effect of mild and severe stress on gene expression in Pinus taeda
With microarray design and data analysis provided by Expresso (Heath, Ramakrishnan, and Watson), Alscher and Chevone have identified genes and groups of genes whose expression is associated with successful adaptation to drought stress. Two years of drought experiments are summarized below (1999-2001). The first year was supported by in-house VT funds. The second was supported by NSF grant # EIA-0103660 (Ramakrishnan, PI; beginning date, 9/1/01). We are currently analyzing the results of the second year.
6.2
The Experimental System
Plant Material and Experimental Conditions (same for both years). We have investigated expression patterns of genes in needles of loblolly rooted cuttings (equivalent in size and development to one-year-old seedlings, but of identical genotype) from two different unrelated genotypes from the Atlantic Coast Plain that had been exposed to cycles of mild drought conditions over a growing season. We have compared those results with those obtained from rooted cuttings of one of the genotypes exposed to more severe, nonadaptive, conditions over the same time period. Rooted cuttings were subjected to mild or severe drought stress for four (mild) or three (severe) cycles. “Mild” stress was defined 6
in
H2 O2
ABA levels increase
G3
Gene expression
∗∗
Stomatal closure
bZIP-ABRE system
G2 Adaptive Responsives
Antioxidant ActivationMechanism
PLDα
G4
∗
Slow
Protein synthesis REB/CBF genes
∗∗
Rapid
G1
t A en B d A en ep
Signal Transduction DREB2-DRE system
d
d A ep B en A d en t
Dehydration
∗
Figure 3: This network is adapted from Shinozaki and Yamaguchi-Shinozaki [56]. Four mechanisms for influencing gene expression are represented, as suggested by G 1 , G2 , G3 , and G4 on the arcs. ∗∗ DREB2/DRE system (drought responsive element binding protein and drought responsive element); REB/CBF (rice endosperm binding factor and C-repeat binding factor); ABRE (ABE responsive element). Our prediction is that when levels of PLD α are depleted, the expression of ABA-sensitive and ABA-insensitive, PLD α controlled genes will be affected ( ∗ Hypothesis II). We also predict that in vtc 1-1, an ascorbic acid deficient mutant, that higher levels of H 2 O2 will stimulate ABA dependent gene expression (∗∗ Hypothesis III).
as needles dried down to -10 bars, which produced little effect on growth and new flushes compared to control trees. “Severe” stress was defined as needles dried down to -17 bars and growth retardation with markedly fewer new flushes compared to controls. Total RNA was isolated from the samples by the method of Chang et al. [9], modified in our laboratory, and used as a source of cDNA to probe the microarrays of loblolly cDNAs from the Pine Genome Sequencing Project (R. Sederoff, PI, NCSU). Choice of target cDNAs for the microarray. The number of ESTs sequenced by the Pine Genome Sequencing Project is now over 60,000 and will reach 85,000 in a few months. Many of these have a proposed functional annotation derived from a BLAST search of protein databases. In the first year, 384 clones of known function were printed. A 2103 clone unigene set was used on the microarrays in the second year. A system of functional categories was set up to include all the genes that were printed; see http://bioinformatics.cs.vt.edu/~ralscher/IPB_2002/. Using algorithms incorporated in Expresso [1], we identified genes and groups of genes involved in stress responses.
6.3
Expresso: a data management and analysis system
The data analysis and data mining components of the Expresso microarray experiment management system have the ability to integrate data from a variety of experiments with biological knowledge provided by the life scientist. Data mining via inductive logic programming reveals relationships among components of the hierarchy of regulatory responses. Network bioinformatics captures these relationships in a predictive computational unit. Results from Expresso. Signal transduction, drought acclimation, photosynthesis, and protection /repair genes are up-expressed specifically in acclimated needles. Physiological and expression data were obtained for the control versus mild stress condition and for the control versus severe, nonacclimatory, condition. Net photosynthesis exhibited an acclimatory response under mild stress conditions in Cycles 2 and 3, whereas no acclimation was detected in Cycles 2 or 3 under severe stress (see Table 1). The number of genes showing increases in transcript abundance for Cycles 1, 2 and 3 under mild conditions were 70, 1284 and 370 respec-
7
Net Photosynthesis (µmol CO2 m−2 s−1 ) Condition Cycle Control Stressed Mild 1 4.28 2.48 2 3.54 3.82 3 4.75 3.28 Severe 1 3.67 0.88 2 3.0 0.19 3 2.9 0.77
Table 1: Effect of mild or severe drought stress on net photosynthesis in one year old loblolly pine rooted cuttings. Measurements were made on the first fully mature fasicle in each case, using the Li-Cor 6400. Three or four repeated measurements were made in each case.
tively. Comparable values for severe conditions over the three cycles were 860, 765, and 612. Numbers of genes whose expression was affected under mild conditions, and not under severe conditions were 38, 960, and 281 for Cycles 1, 2 and 3. Some of the categories into which these genes fell were transcription factors, drought-acclimation, oxidative stress resistance and protection and repair (Figure 4). We are in the process of analyzing these data further. At the final harvest under mild conditions in 1999 the expression of genes associated with drought acclimation, such as the dehydrins and aquaporins was increased, with either negative or undetectable change for the severe stress condition. LP-3, an established water-stress inducible gene in loblolly pine, increased under mild but not under severe condition. The same pattern was observed for glutathione-S-transferase (antioxidant function), proteases, receptor-like protein kinases (signal transduction), phosphoribulokinase, transketolase (chloroplast form), rubisco-binding proteins, protochlorophyllide reductase (photosynthesis), genes encoding protection/repair genes (HSPs) such as HSP70 (chloroplast-associated chaperone function [50]), HSP23 (LEA-like genes [17]) and HSP100 (thermotolerance [27]). The HSP result is in agreement with results obtained with hydrogen peroxide stress in Arabidopsis, although in that case, gene expression was not related to stress resistance [16]. In contrast, HSP80s (thought to be involved in chromatin organization [54]) did not respond. These data provide a first snapshot of the status of gene expression specifically associated with acclimation during the course of a month-long exposure to cycles of drought stress in a woody species.
7 7.1
Plan of Work Multimodal Networks: General, Flexible, Extensible Models
We propose to use networks (directed graphs or hypergraphs) as our underlying models for representing the topological and dynamic aspects of molecular transformation and transportation within the plant cell in response to stress imposition. While networks are capable of representing time, topology, and causality in natural ways, the richness of information available in cell and molecular biology requires more than just a network. As a result, we propose to refine and extend networks to represent such information as hierarchical structure, and uncertainty. We call these extended networks multimodal, as they incorporate diverse forms of information in a single framework. Network Models. There is a rich collection of models in theoretical computer science and related areas for representing processes. These include automata models (e.g., finite state machines, Turing machines, random access machines); grammatical or rewriting models (e.g., context-free grammars, unrestricted grammars, string-rewriting systems, 8
Figure 4: Effect of mild or severe drought stress on one year old loblolly pine seedlings. Water was withheld until a given water potential was reached. Plants were then rewatered once and water was again withheld. Each cycle lasted 4-5 days for mild stress and 7-9 days for severe stress. Plants underwent 4 cycles of mild drought stress, while another group underwent 3 cycles of 7-9 days of severe stress. Venn diagrams are depicted for cycles 1, 2, and 3 showing the numbers of genes in loblolly pine that are up-regulated on microarrays under both mild and severe stress, as well as only one or the other condition.
graph-rewriting systems); logical models (e.g., predicate logic, logic programming); linguistic models (e.g., functional programming languages); and graph, hypergraph, or network models (e.g., artificial neural networks, boolean circuits, Petri nets) [5, 28, 38, 57, 62]. These models are primarily discrete in nature — involving discrete concepts such as states, strings, logical formulae, graphs, and networks — though some are adorned with real numbers employed in a discrete fashion (e.g., artificial neural networks). As we will not be modeling components of a cell below the level of molecules (which will hence be regarded as discrete entities), discrete models are an appropriate starting point for our modeling efforts. In the biological literature, models of cell metabolism or signal transduction are typically expressed visually as pathways or networks. Nodes in such a network can represent chemical reactions involving one or more chemical inputs and zero or more enzymes, producing one or more reaction products, while arcs represent the reaction products flowing from one reaction to another. Alternately, a node can represent a metabolite and an incoming arc can represent the required precursors and enzymes for the production of the metabolite. (See Polle [47] for a typical example. Also, see [19, 20].) Our first network model will be a formalization of a biological pathway as a network that expresses the dependencies among the components in the pathway, much like Figures 2 and 3. The important work is to extend this initial model to carefully address some additional aspects of the plant cell and of the nature of biological knowledge. Regulation of Transcription and Translation. The transcription-translation transformation mechanism is delicately choreographed through biochemical repressors and inducers, whose interaction can be precisely described by a system of ordinary differential equations. An extended model can represent the logic of this choreography, at least to the extent that the relevant biology is understood. N.B., explicit information on translation cannot be obtained from our microarray experiments, so networks representing translation can only be derived from sources in the biological literature. Hierarchical Topology. Plant cells are hierarchically organized, with each cell containing 9
organelles and finer levels of structure occurring within organelles. We call this organization the hierarchical topology of a cell. Significant details of the hierarchical topology can be derived from the biological literature and represented in multimodal networks, utilizing hierarchical connections among network nodes at different levels of cellular organization. Temporal. Many reactions within the cell occur constantly and in parallel with other reactions. Other reactions occur in response to certain internal or external conditions but still in parallel with other reactions. Finally, there are reactions that can occur only after other reactions have produced the needed precursors or an information transfer has brought the needed precursors into spatial proximity. The dependence or independence of reaction sequencing can be represented implicitly or explicitly within an augmented network model, again as a precursor to a time dependent ODE model. A related NIH project is developing a system to automate the conversion of augmented network models to ODE models. A very important and relevant example for our proposed research is the switch to defense metabolism in response to pathogen attack as described by Scheideler et al. [53]. We propose to explore the comparable switch to defense metabolism which occurs in response to drought stress. Scheideler et al. utilized a 13,000 cDNA microarray to identify changes in the transcriptome of Arabidopsis thaliana leaves over a time course from the onset of infiltration by Pseudomonas syringae pv, tomato. The resulting microarray intensities were analyzed by a custom-built statistical suite constructed using MATLAB. Through the use of careful statistical techniques, described in [4], Scheideler et al. were able to quantify the levels of confidence attained (see discussion of uncertainty below). In Expresso, we will be utilizing the more sophisticated statistical techniques of Wolfinger et al. [66] and of Kerr, Martin, and Churchill [30, 31, 32] to attain levels of confidence for our microarray experiments on drought stress responses. Uncertainty. Probabilities, reflecting uncertainty, are naturally attached to the arcs in a multimodal network to yield a probabilistic, constrained network. Such a network generalizes such concepts as reliability in communication networks [11] and Markov chains [18], and can be analyzed using extensions of stochastic techniques to be developed as part of the mathematical theory of multimodal networks.
7.2
Expresso and Experimental Data Analysis
Expresso is an innovative and integrated solution to microarray experiment management and data analysis that is being developed by an interdisciplinary research team under a highly competitive NSF Next Generation Software grant (NSF grant # EIA-0103660). As such, Expresso is an ideal tool for the bioinformatics needs of this project. Expresso integrates all phases of microarray experiments into one system, including experiment design (selection of clones, chip layout, specification of hybridizations to be performed); image analysis; statistical analysis; data management via a unique semi-structured database; data mining via inductive logic programming (ILP) [41]; and integration of biological information from diverse sources into the database, analyses, and data mining. An important aspect of the integrated nature of Expresso is that it organically provides support for closing the experimental loop, allowing the results of the analysis of data from previous experiments to feed directly into the design of subsequent experiments. The flexibility of Expresso is reflected in its support for multiple alternatives at each phase of the experiment (e.g., the statistical analysis discussed below), as opposed to the myriad stand-alone software systems that support a single alternative for a single phase. This project will extend Expresso in three significant ways: improved techniques for data analysis, enhanced information integration using ILP, and supporting a database of biological networks. 10
Numerous statistical techniques for analyzing the rich datasets generated in microarray experiments have been proposed and implemented [3, 10, 21, 29, 30, 31, 32, 33, 37, 39, 44, 60, 66, 69]. Each technique aims to address one or more aspects of the complexity of microarray datasets and none is capable of applying immediately to resolve every dilemma posed by the high-dimensional parameter space in which the datasets reside. The design of the Expresso system recognizes the value of having multiple statistical techniques available and, indeed, of applying diverse techniques to a data set. Confirmation of results from multiple analyses is analogous to confirmation via repetition of an experiment but entails only the marginal cost of some additional computation. Many statistical techniques used in the microarray literature assume normal distributions. However, our experience with microarray datasets has demonstrated that an assumption of normality is not universally justifiable and leads to weak results. By assuming a binomial distribution, we have obtained classification results (up-expressed, down-expressed, unchanged) with high confidence that, in conjunction with the use of inductive logic programming, have yielded very useful biological results. Extending this technique to multiple classification levels (e.g., highly up-expressed, moderately up-expressed, etc.) will be done as part of this project and represents no conceptual difficulty. In addition, the Wolfinger et al. [66] mixed model analysis will be used to establish levels of confidence for microarray data.
7.3
Data Mining and Information Integration
The experiments performed in this project will generate tremendous quantities of gene expression data from stress experiments. To discover gene groups and metabolic pathways essential for successful response (and adaptation) to stress, models at multiple levels of abstraction will be employed [48]. In the most abstract models, groups of genes that exhibit coordinated expression under experimental conditions will be mined, represented, and visualized as a network. At lower levels of abstraction, the expression of individual genes and proteins will be evaluated and related to the expression of other genes and proteins using clustering algorithms. A complete set of accurate models for the expression of the thousands of ESTs studied in our experiments is far beyond current computational or modeling technology. However, the models in this project have limited and tractable goals: to identify gene expression relevant for stress resistance and determine how they relate to the larger networks of inference occurring at higher abstraction levels. Our bioinformatics approach is characterized by an emphasis on data mining as well as information integration. Data mining is used to suggest high-level descriptors from expression data, and information integration is the task of summarizing and consolidating data from multiple methodologies (expression data, literature, and hypothesized networks). Two main approaches will be utilized in data mining. Inductive logic programming (ILP) [42] provides a structured approach to finding rules that associate the level of gene expression to experimental conditions (such as levels of stress). ILP uses the language of first-order predicate logic to encode experimental conditions, gene clusters, and other properties useful for forming high-level representations. For example, activation(expt1,gene-cluster1,0.5) asserts that genes in gene-cluster1 are moderately activated under the conditions of experiment expt1. The rules produced by ILP specify the interactions between the various predicates and their parts. This also enables domain-specific background knowledge (such as the fact that the activation of a certain group of genes are known to be inversely correlated with expression data of a different group) to be incorporated into the data mining process [6]. In addition, the induced concept descriptions are easily comprehensible — the example rule: 11
activation(E2,G,-1) :-
activation(E1,G,-0.5), stresslevel(E1,S1), stresslevel(E2,S2), S2>S1+2.
expresses the mined pattern that genes in a cluster (G) go from ‘moderately repressed’ (E1) to ‘heavily repressed’ (E2) by increasing stress levels (from S1 to S4) by more than two orders of magnitude. Since the rules are horn clauses (that contain at most one predicate in their consequent side), they allow us to perform ‘What if’ analyses of various scenarios, such as reasoning about the effect of imposition of stress on particular pathways. To scale up ILP to the complexity of the experiments conducted in this proposal, we propose to incorporate both syntactic and semantic restrictions on the induction of rules [58]. In addition, the software architecture of our system will be augmented with a natural database query interface (using SQL on a database server based on Postgres or ORACLE), so that we can utilize this aspect to provide meta-level patterns for rule generation (e.g., ‘find me a rule that connects predicates about the isoflavone reductases to the application of mild stress.’). The other mechanism for clustering levels of gene expression constitutes attribute-value based techniques such as self-organizing maps, agglomerative techniques, and statistical co-occurrence models. Co-occurrence models overcome the disadvantages of traditional clustering algorithms (that use a similarity metric) by modeling the categorical and/or ordinal nature of data without information loss due to discretization [22]. Such techniques can be utilized to fine-tune a model obtained from ILP or can utilize ILP to form higher-level abstractions from lower-level patterns (notice the cluster gene-cluster1 in the description above). The networks mined by such techniques can represent temporal and causal relationships in only a simple form. In addition, we can use Bayesian networks [26] that propagate conditional probabilities through a graphical representation, and thus model causal relationships in a direct way. In addition, Bayesian networks can handle noise, hypothesize missing variables, and encode expert knowledge in a limited form. The complexity of learning Bayesian networks is NP-hard but various approximation algorithms such as the EM (expectationmaximization) approach lend credibility to its use as an important tool in bioinformatics. A final focus of modeling in this project involves information from databases and the literature on known stress response mechanisms, including pathways and metabolic processes. Combining the results from such resources with patterns from data mining will result in a system-wide model. The end-result will be a graphical network with visual interpretations to aid in human understanding of the concepts, which, in turn, identify promising areas of future data-driven exploration. This will aid in generalizing across the stress studies considered and in identifying commonalities and distinctions among responding pathways for various levels of stress. Such an ambitious goal poses interesting challenges from the information integration point of view. The diversity of information resources is typically harnessed by remapping queries to originating sources, introducing a transparency layer of middleware between data sources, employing federated database schemas, or using other mediator-based schemes. Such simple approaches will be infeasible for this project since some of the primary sources of information involve results from data mining. Our overarching approach is akin to what has been referred to as a ‘truth maintainance system’ [51]. A database capturing the current state of knowledge is created (e.g., a relational schema corresponding to the graphical network) and the entries of this database (in this case, the edges) are annotated with explanations or justifications for their presence. For instance, we might annotate an entry with the justification that it was mined from ‘last year’s 12
experiments.’ Or perhaps, the justification could be ‘claimed in reference No. 356.’ The goal of the end-system is to flexibly allow the retraction and addition of explanations and justifications. After every modification to the database, any conclusions originally made are revisited to see if they are still valid. Once again, maintaining consistency of such a network is NP-hard but such an approach systematizes the process of representing and integrating biological knowledge. We will experiment with specific computational formulations that are tractable as well as interesting from the biological point of view. For instance, many constraints and assumptions can be encoded as prior background knowledge for ILP. For certain restrictions on the form of the background knowledge, ILP algorithms can be made extremely efficient (by propositionalizing the representations; see [36]). In addition, we will explore interfaces to the system that allow the biologist to flexibly query for subgraphs of the original network. For instance, the biologist could enquire ‘taking only the 2000 data into account, can I still make this conclusion?’ Such a query involves applying a restriction operator on the original network, propagating constraints, and seeing if the node in question is indeed influenced.
7.4
Experiments and the Hypotheses Addressed
The proposed project concerns only gene expression or physiological events. Investigations at the metabolite and/or proteome level are not within its scope. Hypothesis I. A hierarchy of downstream drought stress responses occurs in Arabidopsis which is related to the degree of stress imposed. Experimental conditions for drought stress imposition. The experimental design of Sang et al. [52]. for drought stress will be followed, with some modifications, and the addition of a recovery phase (see below for details). Drought stress, and recovery from the stress will be monitored as water potential. Choice of conditions for RNA isolation for hybridization to Arabidopsis Gene Chips. RNA will be isolated from leaf samples at three time points corresponding to defined degrees of drought stress and recovery, and from leaves of control, unstressed, plants at the same time point (see below). Microarrays: Arabidopsis Gene Chips will be utilized (see below). 6 slides will be used to test Hypothesis I. Hypothesis II. Particular downstream events are associated with signaling pathways, such as those associated with phospholipase Dα (PLDα). Distinguishing genes that are associated with PLDα signaling. An anti-PLD α line will be made available to us by Dr. X. Wang (see attached message). PLD α has been implicated in ABA signaling in response to drought stress as well as in MAPK-associated pathways. The antisense PLD α plants should, therefore, show decreased gene expression of those categories of genes that are under ABA control and of any other genes whose expression is under ABA-independent, PLD α control. Gene expression events in the antisense plants following the imposition of drought stress will be compared with that of wild type plants. A comparison of the effects on gene expression events when ABA (10 micromolar) is sprayed on leaves of wild type and PLD α depleted plants once a day during the drought period will also be carried out. 12 slides will be used to test Hypothesis II. Hypothesis III. ROS and drought stress pathways interact, with the consequence that larger increases in ROS in the vtc 1-1 mutant lead to increases in the response of drought 13
acclimation pathways. Gene expression patterns associated with recovery from drought stress imposition in mutant and wild type plants will be compared (6 slides) by the methods described below. Although the mutant is known to be more drought sensitive, the expectation is that, at lower levels of stress, increases in any ROS-sensitive drought responsive pathways, compared to the wild type, will occur. 7.4.1
Materials and Methodologies
Physiology of Drought Stress Responses (Alscher and Chevone). Wild type and mutant Arabidopsis plants, which have been grown in a greenhouse for six weeks and watered regularly, will be used for the drought stress experiments. Plants will be subjected to drought stress by withholding water. The soil surface of each pot will be covered with plastic wrap to minimize evaporation. Water potential of leaves taken from 6 different plants will be determined at 6 hour intervals during daylight hours as the water potential falls. Water potential will be monitored by a WesCor HR-33T Dewpoint microvoltmeter. Immediately after each water potential measurement, 6 drought stressed plants will be watered, and water potential followed as recovery occurs. Water potential measurements will continue until values reach a plateau close to pre-stress values. Plants that have not recovered after 48 hours will be classified as “unrecovered.” The kinetics of recovery, and conditions leading to non-recovery will be determined for mutant and wild type plants. Once these values are known, time points corresponding to the maximum degree of water stress, early, and late stages of recovery will be chosen for RNA isolations for each experimental condition. RNA Isolation (Alscher). RNA will be isolated by the method of Graham et al. [23]. Microarrays (Alscher). Either the 8200 gene version or the complete Arabidopsis transcriptome chip (Affymetrix) will be used, depending upon availability. Hybridizations will be carried out at the Virginia Bioinformatics Institute (VBI, see attached letter from Dr. R. Kruzelok, Core Facility, VBI). 24 slides will be used in total, with each comparison replicated once i.e there will be two separate hybridizations for each comparison. Data Analysis (Heath, Ramakrishnan, and Watson). All data will be analyzed using the methods incorporated in Expresso, as described elsewhere in this proposal.
8
Management Plan
The project will be conducted under the overall direction of Heath. Two bioinformatics graduate students will work full time with Heath, Ramakrishnan, and Watson on the computational aspects of this project. They will be members of the Expresso development team and will be responsible for the application of Expresso in all phases of the physiological experiments and microarray hybridizations in this project. Duties will include incorporating the physiological and hybridization data in an Expresso database, applying existing statistical and data mining components to analyze the experimental results, and closing the experimental loop. In addition, they will work with Heath and Ramakrishnan on the library of multimodal networks. Alscher will direct a plant biology graduate student in all laboratory manipulations related to microarrays. Physiological measurements will be made by the plant biology student under the direction of Alscher and Chevone.
14
9 9.1
Future Impacts Scientific impact
The experimental and computational methods used in this research will make possible further microarray experimentation to address large numbers of complex, interrelated hypotheses as is proposed here. The microarray analysis techniques and the modeling capabilities that will be incorporated in the Expresso system will allow more rigorous analysis of complex experimental datasets, as well as the development of more sophisticated models that match the experimental results. The multimodal networks will have some predictive power that will enable biologists to explore hypotheses in silico, to obtain estimates of conditional probabilities associated with given drought stress conditions, to decide on future experiments based on the estimated likelihood of a large yield of information from the experiments, and to serve as the basis for ODE models that have more quantitative predictive power. The way will be paved for informed investigations into the behavior of the proteome under drought stress, and for an expanded role for Expresso in analyzing the relationship of transcriptome to proteome responses. The multimodal networks mined and catalogued by Expresso will be made available on a website; plant biologists will be able to navigate and query these networks for supporting data and references.
9.2
Recruitment of women into computer science
The dearth of women faculty in computer science departments is widely recognized as an inhibiting factor in attracting young women into careers in computer science. Research in bioinformatics provides a bridge between the biological sciences, where women have a larger representation, and computer science with a lower representation. According to statistics at http://www.awis.org/statistics/statistics.html, 40% of doctorates in biology and agriculture were awarded to women in 1997, with 35.5% assistant professors, and 14.4% full professors at 4 year colleges and universities. The corresponding numbers for CS in 1999 were 15%, 16.4%, and 7.6% [7, 35]. Beginning in Fall, 2002, at Virginia Tech, we are offering a bioinformatics option for our graduate degrees in computer science, statistics, and the life sciences. We see this educational option, which necessarily involves students in interdisciplinary research, as a natural path for women and minorities to initiate careers in the computational sciences. Alscher has been involved in efforts to encourage women students to pursue careers in science for many years and has acted as a role model (see Biographical Sketch) Ramakrishnan is the recipient of a 2000 NSF CAREER grant; the educational component of this proposal leverages the NSF-sponsored ‘Learning in Networked Communities’ (LiNC) Blacksburg virtual school project to encourage women and minority high school students to pursue math and science careers. The immediate focus of this project is on conducting a series of workshops in local high schools that explain the role of computers as a means of studying physical phenomena. Recruiting minorities into bioinformatics. Dr. Larry D. Moore is the Co-Director of the Office for Minority Academic Opportunities at Virginia Tech and a colleague in Alscher’s department. See attached message from Dr. Moore for a description of his program. Alscher has a strong record of encouraging minority students at VT. Groups of minority students recruited by Dr. Moore from Historically Black Colleges and Universities (HBCU) have spent the summer in research labs on the VT campus for over a decade. Alscher has hosted several of these students in her laboratory over the years. Efforts will be made to recruit potential graduate students into bioinformatics from among these groups.
15
References [1] R. G. Alscher, B. I. Chevone, L. S. Heath, and N. Ramakrishnan. Expresso: A problem solving environment for bioinformatics: Finding answers with microarray technology. In Proceedings of the High Performance Computing Symposium, Advanced Simulation Technologies Conference, pages 64–69, 2001. [2] E. S. Arner and A. Holmgren. Physiological functions of thioredoxin and thioredoxin reductase. Eur J Biochem, 267(20):6102–9, 2000. [3] P. Baldi and A. D. Long. A Bayesian framework for the analysis of microarray expression data: regularized t-test and statistical inferences of gene changes. Bioinformatics, 17(6):509–19, 2001. [4] T. Beissbarth, K. Fellenberg, B. Brors, R. Arribas-Prat, J. Boer, N. C. Hauser, M. Scheideler, J. D. Hoheisel, G. Schutz, A. Poustka, and M. Vingron M. Processing and quality control of DNA array hybridization data. Bioinformatics, 16(11):1014–22, 2000. [5] Ronald V. Book and Friedrich Otto. String-rewriting systems. Springer-Verlag, New York, 1993. [6] I. Bratko and S. Muggleton. Applications of Inductive Logic Programming. Communications of the ACM, Vol. 38(11):pp. 65–70, November 1995. [7] R. E. Bryant and M. J. Irwin. Current and future Ph.D. output will not satisfy demand for faculty. Computing Research News, 13(2):5–11, 2001. [8] E. J. Calabrese, L. A. Baldwin, and C. D. Holland. Hormesis: a highly generalizable and reproducible phenomenon with important implications for risk assessment. Risk Anal, 19(2):261–81, 1999. [9] S. Chang, J. Puryear, and J. Cairney. A simple and efficient method for isolating rna from pine trees. Plant Molec. Biol. Reporter, 11:113–116, 1993. [10] Y. Chen, E. R. Dougherty, and M. L. Bittner. Ratio-based decisions and the quantitative analysis of cDNA microarray images. Journal of Biomedical Optics, 2(4):364–374, 1997. [11] Charles J. Colbourn. The Combinatorics of Network Reliability. Oxford University Press, New York, NY, 1987. [12] P.L. Conklin, S.N. Norris, G.L.Wheeler, N. Smirnoff, and E.H. Williams. Genetic evidence for the role of GDP-mannose in plant ascorbic acid (Vitamin C) biosynthesis. Proc. Natl. Acad. Sci., 96:4198–4203, 1999. [13] P.L. Conklin, J.E. Pallanca, R.L. Last, and N. Smirnoff. L-ascorbic acid metabolism in the ascorbate deficient mutant vtcl. Plant Physiology, 115:1277–1285, 1997. [14] J. C. Cushman and H. J. Bohnert. Genomic approaches to plant stress tolerance. Curr Opin Plant Biol, 3(2):117–24, 2000. [15] B. Degenhardt and H. Gimmler. Cell wall adaptations to multiple environmental stresses in maize roots. J Exp Bot, 51(344):595–603, 2000. 16
[16] R. Desikan, S. A-H-Mackerness, J. T. Hancock, and S. J. Neill. Regulation of the Arabidopsis transcriptome by oxidative stress. Plant Physiol, 127(1):159–72, 2001. [17] J. Z. Dong and D. I. Dunstan. Characterization of three heat-shock-protein genes and their developmental regulation during somatic embryogenesis in white spruce [Picea glauca (Moench) Voss]. Planta, 200(1):85–91, 1996. [18] Joseph L. Doob. Stochastic Processes. Wiley, New York, NY, 1953. [19] J. S. Edwards and B. O. Palsson. The Escherichia coli mg1655 in silico metabolic genotype: Its definition, characteristics, and capabilities. Proceedings of the National Academy of Sciences of the United States of America, 97(10):5528–5533, May 9 2000. [20] Jeremy S. Edwards and Bernhard O. Palsson. Metabolic flux balance analysis and the in silico analysis of Esherichia coli K-12 gene deletions. BMC Bioinformatics, 1(1):1–10, July 27 2000. [21] N. Friedman, M. Linial, I. Nachman, and D. Pe’er. Using Bayesian networks to analyze expression data. J Comput Biol, 7(3-4):601–20, 2000. [22] D. Gibson, J. Kleinberg, and P. Raghavan. Clustering Categorical Data: An Approach Based on Dynamical Systems. VLDB Journal, Vol. 8(3–4):pp. 222–236, 2000. [23] I. A. Graham, K. J. Denby, and C. J. Leaver. Carbon catabolite repression regulates glyoxylate cycle gene expression in cucumber. Plant Cell, 6:761–772, 1994. [24] N. Gustavsson, U. Harndahl, A. Emanuelsson, P. Roepstorff, and C. Sundby. Methionine sulfoxidation of the chloroplast small heat shock protein and conformational changes in the oligomer. Protein Sci., 8(11):2506–2512, 1999. [25] U. Harndahl, B.P. Kokke, N. Gustavsson, S. Linse, K. Berggren, F. Tjerneld, W.C. Boelens, and C. Sundby. The chaperone-like activity of a small heat shock protein is lost after sulfoxidation of conserved methionines in a surface-exposed amphipathic alpha-helix. Biochim. Biophys. Acta, 1545(1–2):227–237, 2001. [26] D. Heckerman. Bayesian Networks for Knowledge Discovery. In U. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy, editors, Advances in Knowledge Discovery and Data Mining, pages 273–306. AAAI/MIT Press, 1996. [27] S. W. Hong and E. Vierling. Mutants of Arabidopsis thaliana defective in the acquisition of tolerance to high temperature stress. Proc Natl Acad Sci U S A, 97(8):4392–7, 2000. [28] John E. Hopcroft, Rajeev Motwani, and Jeffrey D. Ullman. Introduction to Automata Theory, Languages, and Computation. Addison-Wesley, Boston, MA, 2001. Second Edition. [29] T. Ideker, V. Thorsson, A. F. Siegel, and L. E. Hood. Testing for differentially-expressed genes by maximum-likelihood analysis of microarray data. J Comput Biol, 7(6):805–17, 2000. [30] M. K. Kerr and G. A. Churchill. Bootstrapping cluster analysis: assessing the reliability of conclusions from microarray experiments. Proc Natl Acad Sci U S A, 98(16):8961–5, 2001.
17
[31] M. K. Kerr and G. A. Churchill. Statistical design and the analysis of gene expression microarray data. Genet Res, 77(2):123–8, 2001. [32] M. K. Kerr, M. Martin, and G. A. Churchill. Analysis of variance for gene expression microarray data. J Comput Biol, 7(6):819–37, 2000. [33] S. Kim, E. R. Dougherty, M. L. Bittner, Y. Chen, K. Sivakumar, P. Meltzer, and J. M. Trent. General nonlinear framework for the analysis of gene interaction via multivariate expression arrays. J Biomed Opt, 5(4):411–24, 2000. [34] Y. Kovtun, W. L. Chiu, G. Tena, and J. Sheen. Functional analysis of oxidative stressactivated mitogen-activated protein kinase cascade in plants. Proc Natl Acad Sci U S A, 97(6):2940–5, 2000. [35] Dexter Kozen and Jim Morris. 1997-1998 CRA Taulbee Survey. Computing Research News, 11(2):4–9, 1999. [36] N. Lavrac and P. Flach. An Extended Transformation Approach to Inductive Logic Programming. ACM Transactions on Computational Logic, Vol. 2(4):pp. 458–494, October 2001. [37] M. L. Lee, F. C. Kuo, G. A. Whitmore, and J. Sklar. Importance of replication in microarray gene expression studies: statistical methods and evidence from repetitive cDNA hybridizations. Proc Natl Acad Sci U S A, 97(18):9834–9, 2000. [38] Harry R. Lewis and Christos H. Papadimitriou. Elements of the Theory of Computation. Prentice Hall, Upper Saddle River, NJ, 1998. Second Edition. [39] A. D. Long, H. J. Mangalam, B. Y. Chan, L. Tolleri, G. W. Hatfield, and P. Baldi. Improved statistical inference from DNA microarray data using analysis of variance and a Bayesian statistical framework. analysis of global gene expression in Escherichia coli K12. J Biol Chem, 276(23):19937–44, 2001. [40] M. J. May, C. Vernoux, M. Leaver, M. Van Montagu, and D. Inze. Glutathione homeostasis in plants: implications for environmental sensing and plant development. J. Exp. Bot, 49:649–667, 1998. [41] S. Muggleton. Scientific knowledge discovery using inductive logic programming. Communications of the Association for Computing Machinery, 42(11):42–64, 1999. [42] S. Muggleton. Scientific Knowledge Discovery using Inductive Logic Programming. Communications of the ACM, Vol. 41(11):pp. 56–62, November 1999. [43] Teun Munnik and Harold J. G. Meijer. Osmotic stress activates distinct lipid and MAPK signalling pathways in plants. FEBS Letters, 498:172–178, 2001. [44] M. A. Newton, C. M. Kendziorski, C. S. Richmond, F. R. Blattner, and K. W. Tsui. On differential variability of expression ratios: improving statistical inference about gene expression changes from microarray data. J Comput Biol, 8(1):37–52, 2001. [45] A. Oberschall, M. Deak, K. Torok, L. Sass, I. Vass, I. Kovacs, A. Feher, D. Dudits, and G. V. Horvath. A novel aldose/aldehyde reductase protects transgenic plants against lipid peroxidation under chemical and drought stresses. Plant J, 24(4):437–46, 2000.
18
[46] G. M. Pastori and C. H. Foyer. Identifying oxidative stress responsive genes by transposon tagging. Journal of Experimental Botany, in press, 2002. [47] Andrea Polle. Dissecting the superoxide dismutase-acorbate-glutathione-pathway in chloroplasts by metabolic modeling. computer simulations as a step towards flux analysis. Plant Physiology, 126:445–462, 2001. [48] N. Ramakrishnan and A.Y. Grama. Mining Scientific Data. Advances in Computers, Vol. 55:pp. 119–169, 2001. [49] S. Karpinskiand H. Reynolds, B. Karpinska, G. Wingsle, G. Creissen, and P. Mullineaux. Systemic signaling and acclimation in response to excess excitation energy in Arabidopsis. Science, 284(5414):654–7, 1999. [50] D. V. Rial, A. K. Arakaki, and E. A. Ceccarelli. Interaction of the targeting sequence of chloroplast precursors with HSP70 molecular chaperones. Eur J Biochem, 267(20):6239–48, 2000. [51] S.J. Russell and P. Norvig. Artificial Intelligence: A Modern Approach. Prentice Hall, Upper Saddle River, NJ, 1995. [52] Y. Sang, S. Zheng, W. Li, B. Huang, and X. Wang. Regulation of plant water loss by manipulating the expression of phospholipase D α. Plant J., 28(2):135–44, 2001. [53] M. Scheideler, N. L. Schlaich, K. Fellenberg, K. Beissbarth, N. C. Hauser, M. Vingron, A. J. Slusarenko, and J. D. Hoheisel. Monitoring the switch from housekeeping to pathogen defence metabolism in Arabidopsis thaliana using cDNA arrays. Journal of Biological Chemistry, in press, electronic publication ahead of print, 2001. [54] T. Schnaider, J. Oikarinen, H. Ishiwatari-Hayasaka, I. Yahara, and P. Csermely. Interactions of HSP90 with histones and related peptides. Life Sci, 65(22):2417–26, 1999. [55] K. Shinozaki and K. Yamaguchi-Shinozaki. Gene expression and signal transduction in water-stress response. Plant Physiol, 115:327–334, 1997. [56] Kaxuo Shinozaki and Kazuko Yamaguchi-Shinozaki. Molcular reponses to dehydration and low temperature: Differences and cross-talk between two stress signaling pathways. Current Opinion in Plant Biology, 3:217–223, 2000. [57] Micheal Sipser. Introduction to the Theory of Computation. PWS Publishing Company, Boston, MA, 1997. [58] A. Srinivasan and R. King. Feature Construction with Inductive Logic Programming: A Study of Quantitative Predictions of Biological Activity Aided by Structural Attributes. Data Mining and Knowledge Discovery, Vol. 3:pp. 37–57, 1999. [59] N. Strizhov, Abraham, L. Okresz, S. Blickling, A. Zilberstein, J. Schell, C. Koncz, and L. Szabados. Differential expression of two P5CS genes controlling proline accumulation during salt-stress requires ABA and is regulated by ABA1, ABI1 and AXR2 in Arabidopsis. Plant J, 12(3):557–69, 1997. [60] J. G. Thomas, J. M. Olson, S. J. Tapscott, and L. P. Zhao. An efficient and robust statistical modeling approach to discover differentially expressed genes using genomic expression profiles. Genome Res, 11(7):1227–36, 2001. 19
[61] T. Urao, B. Yakubov, R. Satoh, K. Yamaguchi-Shinozaki, M. Seki, T. Hirayama, and K. Shinozaki. A transmembrane hybrid-type histidine kinase in Arabidopsis functions as an osmosensor. Plant Cell, 11(9):1743–54, 1999. [62] Jan van Leeuwen, editor. Handbook of Theoretical Computer Science. Vol. B. Elsevier Science Publishers B.V., Amsterdam, 1990. [63] S.D. Veljovic-Jovanovic, C. Pignocchi, G. Noctor, and C.H. Foyer. Low ascorbic acid in the vtc-1 mutant of Arabidopsis is associated with decreased growth and intracellular redistribution of the antioxidant system. Plant Physiology, 127:426–435, 2001. [64] N. Wehmeyer and E. Vierling. The expression of small heat shock proteins in seeds responds to discrete developmental signals and suggests a general protective role in desiccation tolerance. Plant Physiol., 122(4):1099–1108, 2000. [65] G.L. Wheeler, M.A. Jones, and N. Smirnoff. The biosynthetic pathway of Vitamin C in higher plants. Nature, 393:365–369, 1998. [66] R. D. Wolfinger, G. Gibson, E. D. Wolfinger, L. Bennett, H. Hamadeh, P. Bushel, C. Afshari, and R. S. Paules. Assessing gene significance from cDNA microarray expression data via mixed models. J Comput Biol, 8(6):625–37, 2001. [67] C. Xiang and D. J. Oliver. Glutathione metabolic genes coordinately respond to heavy metals and jasmonic acid in Arabidopsis. Plant Cell, 10(9):1539–50, 1998. [68] C. Xiang, B. L. Werner, E. M. Christensen, and D. J. Oliver. The biological functions of glutathione revisited in arabidopsis transgenic plants with altered glutathione levels. Plant Physiol, 126(2):564–74, 2001. [69] H. Zegzouti, B. Jones, C. Marty, J. M. Lelievre, A. Latche, J. C. Pech, and M. Bouzayen. ER5, a tomato cDNA encoding an ethylene-responsive LEA-like protein: characterization and expression in response to drought, ABA and wounding. Plant Mol Biol, 35(6):847–54, 1997.
20