Large-Scale Computational Modeling of Genetic ... - CiteSeerX

0 downloads 0 Views 323KB Size Report
Munich, Germany; 2University Pompeu Fabra, Technology Department, ..... into numerical gene-expression values (Hegde et al., 2000; Angulo and Serra,.
Artificial Intelligence Review 20: 75–93, 2003. © 2003 Kluwer Academic Publishers. Printed in the Netherlands.

75

Large-Scale Computational Modeling of Genetic Regulatory Networks M. STETTER1∗, G. DECO2 & M. DEJORI1,3 1 Siemens AG, Corporate Technology, Information and Communications, CT IC 4, 81730 Munich, Germany; 2 University Pompeu Fabra, Technology Department, 08003 Barcelona, Spain; 3 Department of Computer Science, Technical University of Munich, 85747 Garching, Germany (∗ author for correspondence, e-mail: [email protected])

Abstract. The perhaps most important signaling network in living cells is constituted by the interactions of proteins with the genome – the regulatory genetic network of the cell. From a system-level point of view, the various interactions and control loops, which form a genetic network, represent the basis upon which the vast complexity and flexibility of life processes emerges. Here we provide a review over some efforts towards gaining a quantitative understanding of regulatory genetic networks by means of large scale computational models. After a brief description of the biological principles of gene regulation, we summarize recent advances in massive gene-expression measurements by DNA-microarrays, which form the to date most powerful data basis for models of genetic networks. One class of models such as reaction-diffusion networks and nonlinear dynamical descriptions are biased towards using explicit molecular biological knowledge. A second class, centered around machine learning approaches like neural networks and Bayesian networks, adopts a more data-driven approach and thereby makes massive use of the novel gene expression measurement techniques. Keywords: Bayesian networks, DNA-microarrays, genetic pathways, systems biology

1. Introduction During their life span, living cells accomplish a vast multitude of different tasks, which are all directly or indirectly controlled by their genome, encoded in the deoxyribonucleic acid (DNA) molecule. For example, cells must maintain their organization, must ensure their nutrition and must synthesize and exchange new biomolecules – which they consist of – by metabolic processes. In response to external chemical signals, they eventually grow and undergo cell division, differentiate into specialized cell types, start or stop secretion of one or several substances, start proliferating, leaving their place within the cellular matrix, or initiating their own death (Lodish et al., 2000). But also without any direct external control, cells can vary their behavior: For example, specialized neurons underneath the optic nerve possess a machinery by which they produce oscillating concentrations of several proteins and thereby participate in encoding our wake-sleep or circadian rhythm (Hedges

76

M. STETTER ET AL.

and Kumar, 2003). According to one popular view, all these cellular life processes are coordinated and guided by the states of the underlying network of mutual interactions between genes, the genetic network. When cells are assembled to form a higher organism, these and other cellular processes are being carried out in a concerted fashion, subtly coordinated by a mutual exchange of chemical signals. The resulting collective activities form the basis of growth and shaping of organs and body parts by self-organizational processes. They also guide the maintenance of the structure of the organism: For example the vast majority of the molecules of a human body are exchanged against other ones during a period of only two years, but in spite of that our body more or less remains in its shape. In other words, our structure is maintained independently of the identity of the matter we are made of. Furthermore, the cellular machinery is involved in processes of adaptation to the environment such as learning or muscular training, and guides repair mechanisms involving wound healing and DNA repair (Bohr, 2002) at the macroscopic and microscopic levels, respectively. In light of the central role of cellular genetic network for the life of higher organisms, it seems not too surprising that many severe diseases including different kinds of cancer or Alzheimer’s disease show a clear relationship to genetic disorders (Bohr, 2002). Further, fundamental restrictions in wound healing and repair exist for higher organisms (amputated limbs and removed organs do not replace themselves in humans), and seem to have their roots in a disability of differentiated cells to re-use the embryonal genetic machinery of morphogenesis. Finally, changes in the cellular machinery related to impairment in repair mechanisms also seem to be related to aging and death (Strehler, 1995; Vijg and Dolle, 2002). These and other insights form an enormous drive towards understanding both the principles and the details of the machinery which underlies the operation of living cells. There is a huge diversity of different molecular processes, which have to be characterized and stored in data- and knowledge bases. Molecular biologists are working since many years on the challenging tasks to collect these data about genes (e.g., in the Human Genome Project (Human Genome Issue, 2001; http://www.genome.gov, 2003)), gene products, metabolites and other molecules. Biological diversity is enormous, as is the complexity of the data and knowledge to be stored, and existing data- and knowledge bases are at the edge to turn into an inextricable jungle. Therefore, collecting the data is not enough: Instead we also need to reveal unifying principles of operation of life processes. Many researchers propagate the view, that networking is one such principle. All processes in a living cell are directly or indirectly related to and guided by complex, recurrent and mutually interacting signaling chains. Proteins are produced from

COMPUTATIONAL MODELING OF GENETIC REGULATORY NETWORKS

77

genes, interact with each other and with smaller molecules but also act back onto the DNA where they regulate the production of other proteins. It is not sufficient to regard individual components of the cellular machinery anymore, but instead its full understanding is indissoluble coupled to the understanding of the concerted action of all molecules in the cell at a systems level. In light of this view, systems level modeling of regulatory networks by methods of artificial intelligence represents a most powerful tool on our way towards understanding life processes. This article provides an overview over some approaches towards systems level computational modeling of cellular molecular networks. We will focus on genetic networks, which are formed by the set of (regulatory) gene-protein plus some protein-protein interactions within the cell. In light of the vast amount of work available in this field, we don’t even attempt to be comprehensive in any of the issues addressed. Instead, we aim to provide a coarse sketch of the fast evolving field of systems biology of genetic networks. In section 2, we briefly describe some key features of genetic networks and how snapshots of its instantaneous state can be measured by DNA microarray technology. Section 3 will then summarize some computational approaches of genetic network models with a focus on neuronal networks and structural learning in Bayesian networks.

2. Genetic Networks and Gene Expression Profiling In this section we briefly describe mechanisms of gene expression and gene regulation. In favor of the grand picture we omit many details and also focus on eukaryotic cells. For further in depth reading see (Lodish et al., 2000; Brown, 1999). Genetic information in living cells is encoded in the sequence of nucleotide bases adenosine (A), thymine (T), guanine (G) and cytosine (C), which form the double-stranded DNA. A human DNA consists of about 3 billion base pairs, the complete sequence of which has very recently been published (http://www.genome.gov, 2003). Genetic information is stored as a set of genes. Sloppy speaking, a gene can be viewed as a continuous segment of varying length on one of the two DNA strands, the “sense-making” strand. Each gene encodes the amino acid sequence of one protein or a set of proteins. At each cell division, the DNA is replicated with very high accuracy, and consequently almost every cell of an organism contains a virtually identical DNA copy (Figure 1). Human DNA is estimated to carry roughly 30,000 genes, which all together cover only 2–3% of the DNA double strand.

78

M. STETTER ET AL.

Figure 1. The genome and the proteome of a living cell. Transcription and translation are used to produce proteins from genes on the DNA. Proteins in turn regulate gene expression levels. For details see text.

When a gene is used, the cellular machinery transcribes the specific sequence of this gene into a single stranded molecule of ribonucleic acid (RNA), the so-called messenger RNA (mRNA) of this gene. RNA is very similar to single-stranded DNA, where adenine is exchanged by uracil (U). In a second step, the mRNA is translated into a sequence of amino acids, which are chained together by ribosomes to form a certain protein. When proteins are produced from a gene, this gene is said to be expressed. It is important to note that the stronger a gene is expressed, the more mRNA of this gene is generally present in the cytosol. There are also non-coding types of RNA, ribosomal or rRNA and transfer or tRNA, which are not further considered here, but can form a considerable percentage of total RNA in the cell. The human genome is believed to encode about 1 million different proteins. Most interestingly, each cell of an organism contains only a subset of these proteins at any time. This subset is called its proteome. In other words, whereas the genome is (almost) identical for each cell, the proteome varies strongly over different cell types, but also depending on the state of the cell (e.g., the phase of the cell-division cycle), and the external signals imposed on the cell. This implies, that cells have control over the use of their genome, i.e., they have the possibility to change the proteome in a very flexible way. In fact, many proteins bind and act back on the DNA, either directly or indirectly, where they affect the expression level of genes. This mechanism is called gene regulation. Figure 2a illustrates some mechanisms of gene expression and gene regulation.

COMPUTATIONAL MODELING OF GENETIC REGULATORY NETWORKS

79

Figure 2. From cellular regulatory mechanisms to abstract genetic networks. (a) Trancriptional and translational mechanisms and control points for regulation. RR = regulatory region; PTM = post-translational modification (of proteins). (b) More schematic view of the interaction pathways between genes. (c) Abstract genetic network. Shaded boxes mark the part of the network measured by DNA microarray experiments.

2.1. Gene regulation and regulatory networks Regulation adds a feedback step to the feed-forward process of gene expression (Figure 1). Gene expression controls protein concentrations, and proteins in turn – either directly or indirectly – regulate gene expression levels. Hence, the genome and proteome form a recurrent (and nonlinear) regulatory network, the genetic network. Figure 2a summarizes different steps of gene expression and regulatory feedback from proteins. The sequence of a gene can be divided into exons, which explicitly encode the amino-acid sequence of the corresponding proteins, and introns, which do not code protein sequence but contain otherwise important information such as splice sites (Figure 2a top). A gene is always transcribed from one direction (from the so-called 3’ direction). Upstream of the first exon with respect to the direction of transcription, it contains a regulatory region. When a gene is expressed, its whole sequence except the regulatory region is transcribed into mRNA. In a second step, the non-coding intron regions are cut out in a process called RNA splicing. RNA fragments can be spliced in various different kinds, which is one way to produce a whole family of proteins from one gene. The spliced RNA is then translated into amino acid chains, the raw protein. These chains are

80

M. STETTER ET AL.

subject to post-translational modifications, i.e., they are chemically modified (for example equipped with a phosphate rest) by enzymes, and finally fold into their characteristic three-dimensional shape, which is crucial for their vital function. Many proteins form various sets of aggregates, and are only operative in these complexes. Due to splicing, post-translational modifications and aggregation, each gene may produce a whole family of operative protein structures. Most of the proteins interact with each other or interact with the genome and thereby participate in molecular signaling or reaction networks of the cell. Figure 2a summarizes some important mechanisms, by which proteins can regulate gene expression. The first and most effective regulatory mechanism is the binding of a protein to the regulatory region of a gene. Proteins of this kind are called transcription factors. By binding to the regulatory region, transcription factors affect the initiation of transcription of that gene. Usually up to a few tens of transcription factors can act on the same regulatory regions of a gene (Latchman, 1998). This first mechanism constitutes a gene-protein interaction. Transcription factors can enable, disable, enhance or repress gene expression, and they can do so in a highly nonlinear collective way (Yuh et al., 2001a). In turn, any given transcription factor could act on a few thousands of different genes, c.f. (Brown, 1999). A second regulation mechanism is formed by proteins which are involved in RNA splicing. They can control which gene product or if at all a gene product is translated. Finally, proteins modify other proteins or attach to them. These latter mechanisms represents protein-protein interactions. The whole genomic/proteomic machinery is also controlled by extracellular signals, which form the interface between a cell and its environment. External signals can regulate gene expression either by directly acting as transcription factors, or by modifying transcription factors (Brown, 1999). Figure 2b illustrates at a more abstract level the pathways, by which a gene A can regulate the expression level of another gene B at three different levels. Figure 2c, finally shows a fully abstracted graphical representation of a genetic network. Each gene regulates many (up to thousands) of other genes, and its own expression level may be regulated by up to a few tens of other genes. Some genes form a source of transcriptional control, they have a high fan out or divergence of edges (cf. gene A), other genes have no regulatory action at all but may be regulated by others. They are characterized by a high fan in or convergence of edges (cf. gene C). Gene expression is also affected by extracellular signals and by the interaction of proteins with non-protein molecules such as metabolites.

COMPUTATIONAL MODELING OF GENETIC REGULATORY NETWORKS

81

2.2. Gene-expression measurements by DNA microarrays Computational models of genetic networks depend on the availability of data that reflect the state of the system. These data would ideally involve the expression rates of all genes plus the state of the proteome, i.e., the types, concentrations and states of all proteins in the cell. To date, delineating the proteome as a whole is still difficult, because we lack massively parallel techniques for protein characterization. Present techniques are centered around two-dimensional gel electrophoresis followed by mass spectrometry to determine the protein sequences (Brown, 1999; Yates, 1998), although protein chip technologies for large-scale proteome measurements are being put forward (Phizicky et al., 2003; Hanash, 2003). During the last decade, techniques for the large scale measurement of gene expression levels (referred to as techniques for gene expression profiling), which are based on DNA microarrays (DNA chips), have been developed (Fodor et al., 1993; Schena et al., 1995), for reviews see Brown and Botstein (1999), Schena (2000), Baldi and Hatfield (2002). DNA microarrays measure in parallel the cellular mRNA concentrations for hundreds up to many thousands of genes. The shaded boxes in Figure 2 mark the level at which DNA microarray measurements operate. They make use of the fact that the presence of mRNA for a certain gene reflects the level of expression of this gene. It needs to be noted that DNA microarray measurements cannot capture the post-transcriptional stages downstream of mRNA processing. This includes many protein protein interactions, which can also form part of the regulatory mechanism within the genetic network (dashed arrows in Figure 2b). However, direct regulatory mechanisms by transcription factors, as well as indirect regulatory processes involving the modification of transcription factors by protein-protein interactions will in general be reflected in the gene-expression levels measured by DNA microarrays. Microarray measurements make use of the following facts: (i) Complementary strands of RNA and DNA (a pair of sequences where each T combines with A/U and G with C on the two strands) bind to each other, a process which is called hybridization. (ii) Hybridization is highly selective for the sequence and is most stable when the two sequences are complementary to each other. For longer sequences of a few hundred base pairs, virtually only complementary strands bind at an adequate temperature. This effect can be used to very selectively filter out RNA strands that contain a certain sequence, by means of a complementary DNA (cDNA) probe. The strength of hybridization is temperature-dependent, which is used in many molecular biological techniques. (iii) The mRNA sequence for each gene is unique, in other words for every mRNA to a given gene there can be found a sequence that appears only in this mRNA. (iv) Due to whole genome projects, the sequences of

82

M. STETTER ET AL.

Figure 3. Schematic sketch of differential gene-expression profiling with DNA microarrays. Left: each spot of a microarray contains single-stranded nucleotide sequences as probes, which are complementary to a sequence of one gene. Center: differently labeled RNA molecules (light and dark grey) from two samples are brought in contact with the array, and are hybridized. Confocal fluorescence microscopy is used to optically determine the relative fraction of RNA from each cell and for each spot (gene). Right: a microarray usually contains many thousands of sample spots. The spot size n ranges from 25–500 µm, depending on the type of microarray.

whole genomes, but also the loci and sequences of an increasing number of genes for a collection of different organisms become available (e.g., the bacterium escherichia coli (Blattner et al., 1997), the bakers yeast (Goffeau et al., 1996), and humans (http://www.genome.gov, 2003)). A gene-expression measurement with microarrays is based on the selective hybridization of dissolved mRNA with probes of complementary sequence that are fixed on a substrate (Figure 3). The surface of a DNA microarray is divided into many spots. At each spot a number of nucleotide strands with a sequence complementary to the mRNA-sequence of one gene are fixed. Hence, a DNA microarray with N spots can measure N mRNA concentrations at the same time. Microarrays use either short oligonucleotide probes with 15–25 base pairs (Fodor et al., 1993), as for example manufactured by the company Affymetrix, or use longer strands of a few hundred bases. The latter microarrays are more selective and can be used for differential gene-expression measurements as described in Brown and Botstein (1999), whereas the former microarrays are mostly used to measure expression levels from a single cell. Figure 3 illustrates schematically the procedure of a differential geneexpression measurement. mRNA from a control cell and the cell to be measured are extracted, purified, processed and labeled with two different

COMPUTATIONAL MODELING OF GENETIC REGULATORY NETWORKS

83

dyes (often Cy3 and Cy5). They are brought together with the DNA probe spots on the microarray where they compete for hybridization. At the end of hybridization the ratio of bound mRNA from both cells on each spot reflects the ratio of concentrations of this mRNA in both cells. After optical readout – usually with a confocal laser scanning microscope – an image with colored spot patterns results, where the color reflects the expression level of each gene in the measured cell relative to the control cell. Image processing algorithms can semi-automatically or automatically analyze the spot pattern into numerical gene-expression values (Hegde et al., 2000; Angulo and Serra, 2003), which after normalization procedures (Hegde et al., 2000; Bilban et al., 2002; Baldi and Hatfield, 2002) are stored in a N dimensional gene-expression vector x. There are many sources of noise in microarray experiments, which include biological noise (variation of cellular states in a homogeneous strain, variations caused by RNA extraction procedures), finite sample effects such as fluctuations in the number of hybridizing molecules, and optical readout noise. Consequently, the gene-expression vector is of probabilistic nature, and is often modeled as a random vector.

3. Computational Modeling of Genetic Networks Genetic regulatory networks form complex, recurrent and nonlinear systems, which evolve dynamically according to their mutual interactions and in response to external signals. Because also the brains of higher organisms represent large-scale nonlinear recurrent signal processing systems, it turns out that a large body of theoretical investigation of such systems has been accumulated by theoretical brain research. In this related field, biological networks of nerve cells have been vastly abstracted in the mid eighties to various kinds of artificial neural networks (Hertz et al., 1991; Haykin, 1994). During the following decade, many analogies could be found between the biologically inspired theory of neural networks and statistical learning theory (Bishop, 1995). This link opened the gate for the unification of machine learning techniques and neural networks and at the same time drew the attention of theoreticians towards the applicability of machine learning methods for the large-scale analysis of biological systems. Hence in brain research, an apparent step backwards, namely to more abstract artificial neural networks, has brought qualitatively new insights into the systems biology of the brain, based on which – as a second step – more biological knowledge and diversity is nowadays taken into account in the younger discipline of computational neuroscience (Rolls and Deco, 2002; Stetter, 2002) to provide more detailed brain models. At present, genome research seems to be at the edge of the first abstraction step. Although first attempts to describe genetic networks at an abstract level

84

M. STETTER ET AL.

date back more than thirty years (Kauffman, 1969, 1977), the data basis to constrain the models has been too sparse at that time. However, as described in the previous section, increasing knowledge about gene sequences and a growing body of DNA microarray technologies and data sets demand for adequate computational large-scale modeling methods (Berrar et al. (eds.), 2002). 3.1. Biochemically driven models Biochemically inspired models of genetic networks are based on the reaction kinetics between the different components of the genetic regulatory network within the different compartments of a cell. These models can be associated with the level of detail of Figure 2a. Reaction kinetics provides a framework, by which the chemical reactions between molecular compounds of the network can be described (Cornish-Bowden, 1995; Kato et al., 2000). For example, if a transcription factor T, produced from a gene j , is brought together with the DNA sequence it is selective to, say in the regulatory region of gene i, it might react with a rate k1 to form a compound TD with the DNA, but might also dissociate with a rate k2 from it. If this process follows a first order reaction kinetics, its time evolution is given by d T D(t) = −k2 T D(t) + k1 T (t); dt d T (t) = −k1 T (t) + k2 T D(t) + dText (t). dt

(1) (2)

The term dText denotes the net extra supply of transcription factor molecules at the site of gene i. This supply in turn is increased by a high expression rate of gene j , followed by diffusion towards the site of gene i, and it is decreased by spatial diffusion away from gene i and other chemical reactions. If these diffusion processes across the intracellular space is explicitly taken into account, these models are referred to as reaction-diffusion models. Collective phenomena in reaction-diffusion systems have been first described over 50 years ago (Turing, 1951). Biochemically inspired models have the advantage that they can be most directly related to biological processes, but they also suffer from a number of difficulties. For example, most of the biochemically relevant reactions under participation of proteins do not follow linear reaction kinetics. Many proteins undergo conformational changes after reactions, which change their chemical behavior. In particular, in many regulatory DNA regions transcription factor binding can show cooperative or competitive effects, which are nonlinear and mostly unknown. Second, the full network of metabolic, enzymatic and regulatory reactions is very complex and hard to disentangle in a single step. To do so, the kinetic equations of all the different interactions (e.g., those in

COMPUTATIONAL MODELING OF GENETIC REGULATORY NETWORKS

85

Figure 2a) would have to be written down, but the type of reactions and their parameters are often unknown. At present, the data basis seems therefore not sufficient to globally understand regulatory networks at this level of detail. However, there exist very well-examined regulatory sub-networks (Yuh et al., 1998, 2001a; de Jong et al., 2001), which are sufficiently well-characterized to be modeled in some detail. Other approaches use approximations to reaction-kinetic formulations to arrive at systems of coupled differential equations for describing the time course of gene-expression levels (Chen et al., 1999; Kato et al., 2000; Sakamoto and Iba, 2001). In a differential equation approach, the temporal evolution of the expression level Xi of gene i is guided by the concerted influence of other gene products, which is described by a function F : τ

dxi = −xi + Fi (x) dt

(3)

Differential equation models represent approaches, which adopt a more abstract view on genetic regulatory networks. For example, in eq. (3) no clear specification is provided (nor needed) anymore about whether the quantities xi denote mRNA concentrations or protein concentrations. These models act at the level of description shown in Figure 2c rather than Figure 2a. The shaded box in the figure illustrates, that models in this level of description can be most powerfully constrained by microarray measurements of gene expression. The next paragraph provides a brief overview over different genetic network models of the type eq. (3). 3.2. Neural network models The general form of eq. (3) represents a very rich family of networks. When modeling genetic networks, one major task is to specify the interaction functions Fi such as to be simple enough to be constrained by the existing data and at the same time accounting for biological evidence. One way to specify the interaction function is to assume that the effects of transcription factors j superimpose linearly when they bind to the regulatory region of a gene i, and that the expression of i is a nonlinear function fi of the total regulatory input:    dxi = −xi + fi Iext + wij xj  . (4) τ dt j The weights wij describe the relative impact of each transcription factor and Iext represents an external cellular signal. Equation (4) is identical to the formulation of a continuous artificial neural network (Hertz et al., 1991; Haykin, 1994).

86

M. STETTER ET AL.

As mentioned previously, many biochemical reactions do not act linearly. This observation can be accounted for by including higher-order terms to the formulation of the total input (Bishop, 1995):     dxi τ = −xi + fi Iext + wij xj + wij k xj xk + . . . . (5) dt j jk In this formulation, the functions Fi are specified as univariate nonlinear transforms of the Taylor-series expansion of their input arguments. It accounts for multiplicative nonlinearities in the molecular interactions, which are quite common in reaction kinetics. A special case of the multiplicative network in eq. (5) has been shown to arise as an approximation from a reaction kinetic formulation of transcriptional regulation (Kato et al., 2000): When products of the activator genes j and inhibitor genes k bind to the regulatory region of gene i, the expression of gene i follows approximately  hj  hj h dxi = −axi + b xj − c xj xk k , dt j jk

(6)

where hj and hk are integer stoichiometric coefficients. The richness of the class of interaction functions can be also reduced by giving up their continuous nature. Gene expression states are then characterized by binary variables with the states ‘expressed’ or ‘not expressed’. In this case, the functions Fi are binary-valued, with binary valued arguments, they are boolean functions. Boolean networks have been among the earliest approaches towards genetic network modeling (Kauffman, 1969). The model formulations above have many free parameters. In fact, even the simplest neural network formulation eq. (4) is controlled by all the weights wij , which are of the order of the square of the number of genes, plus further free parameters describing the nonlinear functions. Hence the number of weights can well exceed one million, whereas the number of microarray measurements usually ranges between tens up to a few hundreds of measurements. Hence, in practice these parameters cannot be estimated from the data. In addition, all these models assume the structure of the genetic network as known. However, it is one of the major challenges of the post-genomic era to infer, which genes might act in a regulatory way on which other genes, in other words to infer the structure of the network from the data. 3.3. Data mining of transcription profiles Therefore, another line of genetic network modeling has gone through yet another step of abstraction, and treats the task of modeling microarray data

COMPUTATIONAL MODELING OF GENETIC REGULATORY NETWORKS

87

as a data mining problem (Somogyi et al., 1997; D’haeseleer et al., 1997, 2000; Wang et al., 1999; Baldi and Brunak, 1998; Slonim, 2002; Berrar et al., 2003). The goal of data mining is to explore a data set and to discover regularities and structures from it. As opposed to hypothesis-driven approaches, which search for a particular and pre-defined pattern in the data, data mining approaches specify autonomously, which patterns are present and important in the data – they are exploratory and data driven. As gene expression data sets are noisy by their nature, statistical methods play an important role for finding trends and patterns in the experimental results. In the case of genetic networks, the patterns to be inferred may be for example clusters of genes, which are co-expressed under a given mode of the cell, or the structure of regulatory relationships between genes. Clustering algorithms aim at discovering sets of genes or gene expression patterns which are more similar to each other than to others (Eisen et al., 1998; Ben-Dor et al., 1999; Golub et al., 1999). The set of gene expression measurements is considered as a data matrix X, where the element xij represents the expression level of gene i in the j -th experiment (Eisen et al., 1998). Hence, each column of X contains the expression levels for all genes of one experiment and each row contains the expression levels of one gene across the experiments. Hierarchical clustering of the rows of the matrix has been suggested in order to find clusters of genes that are co-expressed over the different experiments with respect to one another. One kind of hierarchical clustering groups together pairs of clusters, starting out from the individual data vectors, and proceeding along decreasing similarity. By this procedure, different families of cell-cycle regulated genes in the bakers yeast Saccharomyces Cerevisiae have been identified (Spellman et al., 1998). If there are different conditions over the experiment, clustering can also be applied in the space of column vectors in order to group together global gene-expression patterns. This method has been used to predict different subtypes of acute lymphoblastic leukemia (ALL) from the global expression patterns (Golub et al., 1999; Yeoh et al., 2002). Clustering can also be carried out along all two dimensions, where it is suitable to identify clusters of genes which are co-expressed selectively under a given condition. Gene clusters selective for different subtypes of ALL have been identified using two-dimensional clustering (Yeoh et al., 2002). 3.4. Bayesian inference of genetic network structures Clustering studies have revealed many extended clusters of genes, which collectively change their expression levels when a cell or tissue changes from one mode of life to another. These global patterns are thought to reflect the execution of specific genetic programs. One might now ask what is it that

88

M. STETTER ET AL.

(a)

(b)

Figure 4. (a) The number of false positive (dashed line) and the number of false negative (solid stars) errors of a learning Bayesian network as a function of the number of data points used to infer the network structure. The crosses mark the total number of edges found. (b) Feature graph trained for the Y P L256C genetic sub-network of the bakers yeast.

either stabilizes or evokes a change to a new genetic program. In other words, are there dominant genes or gene groups which are the underlying cause of a specific global gene-expression pattern? Also it is important to know, which global control functions might have failed when a pathological global gene expression pattern is observed in case of a disease. Knowing the basic origin of a disease might help identifying more potent and more selective drugs and to do so in shorter time. Due to these considerations, recent approaches have concentrated on inferring the structure of the underlying genetic regulatory networks from microarray data. As structural learning of genetic regulatory pathways becomes an increasingly important component of post-genomic research, we will briefly summarize a case-study for Bayesian inference of genetic networks (Friedman et al., 2000; Dejori and Stetter, 2003). In this approach, the set of measured gene expression vectors x is considered to be drawn from a high-dimensional multivariate probability density function, which is modeled by a Bayesian network with adaptive network structure. Generally, a Bayesian network B is a probabilistic model which describes the multivariate probability distribution  for a set of variables x = {x1 , . . . , xN }, where each variable xi only depends on its parents P ai , thus  decomposes into P (x1 , . . . , xn ) =

N  i=1

P (xi |P ai )

(7)

COMPUTATIONAL MODELING OF GENETIC REGULATORY NETWORKS

89

The basic idea is to display the associations among the variables, namely the conditional dependencies and independencies, by means of a directed acyclic graph (DAG) G. In context of genetic pathway inference, each node of a Bayesian network is assigned to a gene, and can assume the different expression levels of this gene throughout the set of measurements. Each edge between genes hints towards a regulatory relationship between them. If this edge is directed, it can under certain assumptions be interpreted as a causal relationship: It can be inferred which gene controls another gene. Learning Bayesian networks (Hofmann, 2000; Steck, 2001; Chickering, 1996) use Bayesian statistics to find the network structure and the corresponding model parameters which describe best the probability distribution for which the dataset X is drawn. In the class of score-based learning algorithms, the goodness of fit of a network G with respect to the data set X is assessed by assigning a score S(G) by use of a statistically motivated scoring function S, as for example the Bayesian score (Heckerman et al., 1995), that based on Bayesian reasoning is proportional to the posterior probability P (G|X): S(G|X) =

P (X|G)P (G) . P (X)

(8)

Unfortunately this task is NP-hard especially for high-dimensional data, as microarray datasets with thousands of expressed genes, so that the search space must be limited, e.g., simply by learning small subnetworks (Dejori and Stetter, 2003b), or in a slightly more sophisticated way by restricting the maximum number of parents of each variable as done in Friedman et al. (1999). Another problem might be the effect of small sample size. As shown in Dejori and Stetter (2003), a Bayesian network applied on small gene expression data sets will most likely not detect all gene interactions which are actually present in the metabolic network. However, the ones that are detected are relatively reliable even when the dataset is small. This is demonstrated in Figure 4a, which shows for the example of the Alarm network as a benchmark system, how the number of false positive and false negative edges increase, as the data set used to train a network goes down. the ones that are detected are relatively reliable even when the dataset is small. To estimate the variability in the inference, many networks are trained, and their structures are combined to a feature graph, where each feature f given the data set X computed as  f (G)P (G|X). (9) P (f |X) = Gg

P (G|X) is the marginal likelihood of the learned structure G and f (G) is 1 if G contains feature f , and is 0 otherwise. Here each network was trained 20

90

M. STETTER ET AL.

times and then combined to a feature graph. Figure 4b shows the feature graph of the Y P L256C-subnetwork that was learned from the data of Spellman et al. (1998). Due to the small sample size and high noise levels, the confidence in the resulting fPDAG is not too high and the variance of the learned structures is considerable, but nevertheless there are two features with a confidence level of 0.8. The shown network represents the causal relationships between genes within a single gene cluster. Hence, learning Bayesian networks can in principle resolve the fine structure of regulatory relationships, which form the underlying hidden cause of the observed global and clustered gene-expression pattern.

4. Summary The genes of a genome are expressed to form the cell’s proteome, which in turn acts back onto the genome in a multitude of regulatory processes. Characterizing the global principles of operation of these genetic regulatory networks represents one of the major challenges of the post-genomic era. Understanding genetic networks will help opening the gate towards a quantitative understanding of morphogenesis and pathogenesis and towards the development of new tissue engineering techniques and drug discovery methods, just to mention a few. With high-throughput gene-expression profiling techniques such as DNA microarrays, a data basis has become available for data-driven modeling of genetic networks. We have provided an overview over different approaches from artificial intelligence towards genetic network modeling. These techniques involved artificial neural networks, machine learning and data mining techniques. To date, clustering algorithms and Bayesian Networks seem to be most suitable to model the currently available data basis of DNA microarray measurements, and have been successfully used to infer genetic causal regulatory relationships from gene expression profiles.

Acknowledgements Mathäus Dejori gratefully acknowledges support through the Ernst-vonSiemens foundation.

References Angulo, J. & Serra, J. (2003). Automatic Analysis of DNA Microarray Images Using Mathematical Morphology. Bioinformatics 19(5): 553–562.

COMPUTATIONAL MODELING OF GENETIC REGULATORY NETWORKS

91

Baldi, P. & Hatfield, G. W. (2002). DNA Microarrays and Gene Expression. Cambridge, MA: Cambridge University Press. Baldi, P. & S. Brunak, S. (1998). Bioinformatics, the Machine Learning Approach. Cambridge, MA: MIT Press. Ben-Dor, A., Shamir, R. & Yakhini, Z. (1999). Clustering Gene Expression Patterns. J. Comput. Biol. 6(3–4): 281–297. Berrar, D., Dubitzky, W. & Granzow, M. (2002). A Practical Approach to Microarray Data Analysis. Boston/Dordrecht/London: Kluwer Academic Publishers. Berrar, D. P., Downes, C. S. & Dubitzky, W. (2003). Multiclass Cancer Classification Using Gene Expression Profiling and Probabilistic Neural Networks. Pac. Symp. Biocomput.: 5–16. Bilban, M., Buehler, L. K., Head, S., Desoye, G. & Quaranta, V. (2002). Normalizing DNA Microarrays. Curr. Issues Mol. Biol. 4: 57–64. Bishop, C. M. (1995). Neural Networks for Pattern Recognition. Oxford: Clarendon Press. Blattner, F. R., Plunkett, G., Bloch, C. A., Perna, N. T. & Burland, V. et al. (1997). The Complete Genome Sequence of Escherichia coli K-12. Science 277(5331): 1453–1474. Bohr, V. A. (2002). DNA Damage and Its Processing, Relation to Human Disease. J. Inherit. Metab. Dis. 25(3): 215–222. Brown, T. A. (1999). Genomes. Oxford: Bios Scientific Publishers. Brown, P. O. & Botstein, D. (1999). Exploring the New World of the Genome with DNA Microarrays. Nature Genetics suppl. 21: 33–37. Chen, T., Le, H. L. & Church, G. M. (1999). Modeling Gene Expression with Differential Equations. Proc. Pacific Symp. Biocomputing, Vol. 4, 29–40. Chickering, D. M. (1996). Learning Bayesian Networks from Data. UCLA Cognitive Systems Laboratory, Technical Report (R-245), June 1996. Ph.D. Thesis. Cornish-Bowden, A. (1995). Fundamentals of Enzyme Kinetics. Colchester: Portland Press. Degroot, M. H. (1970). Optimal Statistical Decisions. New York: McGraw-Hill. Dejori, M. & Stetter, M. (2003). Bayesian Inference of Genetic Networks from GeneExpression Data: Convergence and Reliability, In: Arubnia, H.R., Joshua, R. & Mun Y. (eds.) Proceedings of the 2003 International Conference on Artificial Intelligence (IC-AI’03), 321–327, CSREA Press. Dejori, M. & Stetter, M. (2003b). Analysing Gene Expression Data with Bayesian Networks. Neurocomputing: Special Issue on Bioinformatics, submitted. D’haeseleer, P., Liang, S. & Somogyi, R. (2000). Genetic Network Inference: From CoExpression Clustering to Reverse Engineering. Bioinformatics 16: 707–726. D’haeseleer, P., Wen, X., Fuhrman, S. & Somogyi, R. (1997). Mining the Gene Expression Matrix: Inferring Gene Relationships from Large-Scale Expression Data. In Holcombe, M. & Paton, R. (eds.) Information Processing in Cells and Tissues, 203–212. Plenum Press. de Jong, H., Page, M., Hernandez, C. & Geiselmann, J. (2001). Qualitative Simulation of Genetic Regulatory Networks: Method and Application. In B. Nebel (ed.) Proceedings of the 17. International Joint Conference on Artificial Intelligence IJCAI-01, 67–73. San Mateo, CA: Morgan Kauffman. Eisen, M. B., Spellman, P. T., Brown, P. O. & Botstein, D. (1998). Cluster Analysis and Display of Genome-Wide Expression Patterns. Proc. Natl. Acad. Sci. USA 95(25): 14863–14868. Fodor, S. P., Rava, R. P., Huang, X. C., Pease, A. C., Holmes, C. P. & Adams, C. L. (1993). Multiplexed Biochemical Assays with Biological Chips. Nature 364: 555–556. Friedman, N. (1997). Learning Bayesian Networks in the Presence of Missing Values and Hidden Variables. 14th International Conference on Machine Learning (ICML).

92

M. STETTER ET AL.

Friedman, N., Linial, M., Nachman, I. & Pe’er, D. (2000). Using Bayesian Network Analyze Expression Data. J. Comput. Biology 7: 601–620. Friedman, N., Nachman, I. & Pe’er, D. (1999). Learning Bayesian Network Structure from Massive Datasets: The ‘Sparse Candidate’ Algorithm. Proc. Fifteenth Conf. on Uncertainty in Artificial Intelligence (UAI99). Goffeau, A., Barrel, B. G., Davis, R. W., Dujon, B. & Feldman, H. et al. (1996). Life with 6000 Genes. Science 274(5287): 546, 563–567. Golub, T. R., Slonim, D. K., Tamayo, P., Gaasenbeek, M., Mesirov, J. P. & Coller, H. et al. (1999). Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene-Expression Monitoring. Science 286(5439): 531–537. Hanash, S. (2003). Disease Proteomics. Nature 422(6928): 226–232. Haykin, S. (1994). Neural Networks. A Comprehensive Foundation. New York, Toronto, Oxford: MacMillan College Publishing Company. Heckerman, D., Geiger, D. & Chickering, D. (1995). Learning Bayesian Networks: The Combination of Knowledge and Statistical Data. Machine Learning 20: 197–243. Hegde, P., Qi, R., Abernathy, K., Gay, C. & Dharap, S. et al. (2000). A Concise Guide to cDNA Microarray Analysis. Biotechniques 29(3): 548–550, 552–554, 556. Hedges, S. B & Kumar, S. (2003). Genomic Clocks and Evolutionary Timescales. Trends Genet 19: 200–206. Hertz, J., Krogh, A. & Palmer, R. G. (1991). Introduction to the Theory of Neural Computation. Redwood City, CA: Addison-Wesley. Hofmann, R. (2000). Lernen der Struktur nichtlinearer Abhängigkeiten mit graphischen Modellen. Doctoral Thesis. Hopfield, J. J. (1982). Neural Networks and Physical Systems with Emergent Collective Computational Abilities. Proc. Natl. Acad. Sci. U.S.A. 79(8): 2554–2558. http://www.genome.gov Human Genome Issue, 15. February 2001. Nature. Jensen, F. V. (2001). Bayesian Networks and Decision Diagrams. New York: Springer. Kato, M., Tsunoda, T. & Takagi, T. (2000). Inferring Genetic Networks from DNA Microarray Data by Multiple Regression Analysis. Genome Informatics 11: 118–128. Kauffman, S. A. (1969). Homeostasis and Differentiation in Random Genetic Control Networks. Nature 224: 177–178. Kauffman, S. A. (1977). Gene Regulation Networks: A Theory for Their Global Structures and Behaviors. In Other, A. N. (ed.) Current Topics in Developmental Biology, vol. 6, 145–182. New York: Academic Press. Latchman, D. S. (1998). Eukaryotic Transcription Factors, 3rd edition. San Diego, CA: Academic Press. Lodish, H., Berk, A., Zipursky, S. L., Matsudaira, P., Baltimore, D. & Darnell, J. (2000). Molecular Cell Biology. W. H. Freeman and Company. Norsys Software Corp. (2003). Alarm Network. http://www.norsys.com/netlib/ Phizicky, E., Bastiaens, P. I., Zhu, H., Snyder, M. & Fields, S. (2003). Protein Analysis on a Proteomic Scale. Nature 422(6928): 208–215. Rolls, E. T. & Deco, G. (2002). Computational Neuroscience of Vision. Oxford: Oxford University Press. Sakamoto, E. & Iba, H. (2001). Inferring a System of Differential Equations For a Gene Regulatory Network by Using Genetic Programming. Congress on Evolutionary Computation: 720–726. Schena, M. (2000). Microarray Biochip Technology. Natick: Eaton Publishing Co.

COMPUTATIONAL MODELING OF GENETIC REGULATORY NETWORKS

93

Schena, M., Shalon, D., Davis, R. & Brown, P. O. (1995). Quantitative Monitoring of GeneExpression Patterns with a cDNA Microarray. Science 270: 467–470. Slonim, D. K. (2002). From Patterns to Pathways: Gene Expression Data Analysis Comes of Age. Nature Genetics 32 Suppl: 502–508. Somogyi, R., Fuhrman, S., Askenazi, M. & Wuensche, A. (1997). The Gene Expression Matrix: Towards the Extraction of Genetic Network Architectures. Nonlinear Analysis, Proc. of the Second World Congress of Nonlinear Analysis (WCNA96), Vol. 30, 1815– 1824. Spellman, P. T., Sherlock, G., Zhang, M. Q., Iyer, V. R., Anders, K., Eisen, M. B., Brown, P. O., Botstein, D. & Futcher, B. (1998). Comprehensive Identification of Cell Cycle-regulated Genes of the Yeast Saccharomyces cerevisiae by Microarray Hybridization, Molecular Biology of the Cell 9: 3273–3297. Steck, H. (2001). Constraint-based Structural Learning in Bayesian Networks Using Finite Data Sets. Doctoral thesis, Munich, Germany. Stetter, M. (2002). Exploration of Cortical Function. Dordrecht: Kluwer academic publishers. Strehler, B. L. (1995). Deletional Mutations Are the Basic Cause of Ageing: Historical Perspectives. Mutat. Res 338(1–6): 3–17. Tresp, V. & R. Hofmann, R. (2001). Nonlinear Time-Series Prediction with Missing and Noisy Data. In Jordan, M. I. & Sejnowski, T. J. (eds.) Graphical Models: Foundations of Neural Computation. Cambridge, MA: MIT Press. Turing, A. M. (1951). The Chemical Basis of Morphogenesis. Phil. Trans. R. Soc. Lond. B 237: 37–72. Vijg, J. & Dolle, M. E. (2002). Large Genome Rearrangements as Primary Cause of Ageing. Mech. Ageing Dev. 123(8): 907–915. Wang, T. J. L., Shapiro, B. A. & Shasha, D. (1999). Pattern Discovery in Biomolecular Data: Tools, Techniques and Applications. Oxford: Oxford University Press. Yates, J. R. (1998). Mass Spectrometry and the Age of the Proteome. J. Mass Specrom. 33: 1–19. Yeoh, E.-J., Ross, M. E., Shurtleff, S. A., Williams, W. K. & Petal, D. et al. (2002). Classification, Subtype Discovery, and Prediction of Outcome in Pediatric Acute Lymphoblastic Leukemia by Gene Expression Profiling. Cancer Cell 1: 133–143. Yuh, C. H., Bolouri, H. & Davidson, E. H. (2001a). Cis-Regulatory Logic in the Endo 16 Gene: Switching from a Specification to a Differentiation Mode of Control. Development 128: 617–629. Yuh, C. H., Bolouri, H. & Davidson, E. H. (2001b). Genomic Cis-Regulatory Logic: Experimental and Computational Analysis of a Sea Urchin Gene. Science 279: 1896–1902.

Suggest Documents