The EM Algorithm and the Rise of Computational Biology - arXiv

7 downloads 0 Views 654KB Size Report
Apr 12, 2011 - Xiaodan Fan, Yuan Yuan and Jun S. Liu. Abstract. ... Yuan Yuan is Quantitative. Analyst ..... Yi that is generated from state l according to Γi,.
Statistical Science 2010, Vol. 25, No. 4, 476–491 DOI: 10.1214/09-STS312 c Institute of Mathematical Statistics, 2010

The EM Algorithm and the Rise of Computational Biology arXiv:1104.2180v1 [stat.ME] 12 Apr 2011

Xiaodan Fan, Yuan Yuan and Jun S. Liu

Abstract. In the past decade computational biology has grown from a cottage industry with a handful of researchers to an attractive interdisciplinary field, catching the attention and imagination of many quantitatively-minded scientists. Of interest to us is the key role played by the EM algorithm during this transformation. We survey the use of the EM algorithm in a few important computational biology problems surrounding the “central dogma” of molecular biology: from DNA to RNA and then to proteins. Topics of this article include sequence motif discovery, protein sequence alignment, population genetics, evolutionary models and mRNA expression microarray data analysis. Key words and phrases: ature review.

EM algorithm, computational biology, liter-

ing how biology, traditionally regarded as an empirical science, has come to embrace rigorous statistical 1.1 Computational Biology modeling and mathematical reasoning. Before getting into details of various applications Started by a few quantitatively minded biologists and biologically minded mathematicians in the 1970s, of the EM algorithm in computational biology, we computational biology has been transformed in the first explain some basic concepts of molecular biolpast decades to an attractive interdisciplinary field ogy. Three kinds of chain biopolymers are the cendrawing in many scientists. The use of formal statis- tral molecular building blocks of life: DNA, RNA tical modeling and computational tools, the expecta- and proteins. The DNA molecule is a double-stranded tion–maximization (EM) algorithm, in particular, long sequence composed of four types of nucleotides contributed significantly to this dramatic transition (A, C, G and T). It has the famous double-helix in solving several key computational biology prob- structure, and stores the hereditary information. RNA lems. Our goal here is to review some of the histor- molecules are very similar to DNAs, composed also ical developments with technical details, illustrat- of four nucleotides (A, C, G and U). Proteins are chains of 20 different basic units, called amino acids. The genome of an organism generally refers to the Xiaodan Fan is Assistant Professor in Statistics, collection of all its DNA molecules, called the chroDepartment of Statistics, the Chinese University of mosomes. Each chromosome contains both the proHong Kong, Hong Kong, China e-mail: tein (or RNA) coding regions, called genes, and [email protected]. Yuan Yuan is Quantitative Analyst, Google, Mountain View, California, USA coding regions. The percentage of the coding regions e-mail: [email protected]. Jun S. Liu is Professor of varies a lot among genomes of different species. For Statistics, Department of Statistics, Harvard University, example, the coding regions of the genome of baker’s 1 Oxford Street, Cambridge, Massachusetts 02138, USA yeast are more than 50%, whereas those of the hue-mail: [email protected]. man genome are less than 3%. RNAs are classified into many types, and the three This is an electronic reprint of the original article most basic types are as follows: messenger RNA published by the Institute of Mathematical Statistics in (mRNA), transfer RNA (tRNA) and ribosomal RNA Statistical Science, 2010, Vol. 25, No. 4, 476–491. This (rRNA). An mRNA can be viewed as an intermedireprint differs from the original in pagination and typographic detail. ate copy of its corresponding gene and is used as a 1. INTRODUCTION

1

2

X. FAN, Y. YUAN AND J. S. LIU

template for constructing the target protein. tRNA is needed to recruit various amino acids and transport them to the template mRNA. mRNA, tRNA and amino acids work together with the construction machineries called ribosomes to make the final product, protein. One of the main components of ribosomes is the third kind of RNA, rRNA. Proteins carry out almost all essential functions in a cell, such as catalysation, signal transduction, gene regulation, molecular modification, etc. These capabilities of the protein molecules are dependent of their 3-dimensional shapes, which, to a large extent, are uniquely determined by their one-dimensional sequence compositions. In order to make a protein, the corresponding gene has to be transcribed into mRNA, and then the mRNA is translated into the protein. The “central dogma” refers to the concerted effort of transcription and translation of the cell. The expression level of a gene refers to the amount of its mRNA in the cell. Differences between two living organisms are mostly due to the differences in their genomes. Within a multicellular organism, however, different cells may differ greatly in both physiology and function even though they all carry identical genomic information. These differences are the result of differential gene expression. Since the mid-1990s, scientists have developed microarray techniques that can monitor simultaneously the expression levels of all the genes in a cell, making it possible to construct the molecular “signature” of different cell types. These techniques can be used to study how a cell responds to different interventions, and to decipher gene regulatory networks. A more detailed introduction of the basic biology for statisticians is given by Ji and Wong (2006). With the help of the recent biotechnology revolution, biologists have generated an enormous amount of molecular data, such as billions of base pairs of DNA sequence data in the GenBank, protein structure data in PDB, gene expression data, biological pathway data, biopolymer interaction data, etc. The explosive growth of various system-level molecular data calls for sophisticated statistical models for information integration and for efficient computational algorithms. Meanwhile, statisticians have acquired a diverse array of tools for developing such models and algorithms, such as the EM algorithm (Dempster, Laird and Rubin (1977)), data augmentation (Tanner and Wong (1987)), Gibbs sampling (Geman and Geman (1984)), the Metropolis–Hastings

algorithm (Metropolis and Ulam (1949); Metropolis et al. (1953); Hastings (1970)), etc. 1.2 The Expectation–Maximization Algorithm The expectation–maximization (EM) algorithm (Dempster, Laird and Rubin, 1977) is an iterative method for finding the mode of a marginal likelihood function (e.g., the MLE when there is missing data) or a marginal distribution (e.g., the maximum a posteriori estimator). Let Y denote the observed data, Θ the parameters of interest, and Γ the nuisance parameters or missing data. The goal is to maximize the function Z p(Y|Θ) = p(Y, Γ|Θ) dΓ,

which cannot be solved analytically. A basic assumption underlying the effectiveness of the EM algorithm is that the complete-data likelihood or the posterior distribution, p(Y, Γ|Θ), is easy to deal with. Starting with a crude parameter estimate Θ(0) , the algorithm iterates between the following Expectation (E-step) and Maximization (M-step) steps until convergence: • E-step: Compute the Q-function: Q(Θ|Θ(t) ) ≡ EΓ|Θ(t) ,Y [log p(Y, Γ|Θ)]. • M-step: Finding the maximand: Θ(t+1) = arg max Q(Θ|Θ(t) ). Θ

Unlike the Newton–Raphson and scoring algorithms, the EM algorithm does not require computing the second derivative or the Hessian matrix. The EM algorithm also has the nice properties of monotone nondecreasing in the marginal likelihood and stable convergence to a local mode (or a saddle point) under weak conditions. More importantly, the EM algorithm is constructed based on the missing data formulation and often conveys useful statistical insights regarding the underlying statistical model. A major drawback of the EM algorithm is that its convergence rate is only linear, proportional to the fraction of “missing information” about Θ (Dempster, Laird and Rubin (1977)). In cases with a large proportion of missing information, the convergence rate of the EM algorithm can be very slow. To monitor the convergence rate and the local mode problem, a basic strategy is to start the EM algorithm with multiple initial values. More sophisticated methods are available for specific problems, such as the “backupbuffering” strategy in Qin, Niu and Liu (2002).

EM IN COMPUTATIONAL BIOLOGY

3

in Section 6. A main objective of computational biology research surrounding the “central dogma” is The idea of iterating between filling in the miss- to study how the gene sequences affect the gene exing data and estimating unknown parameters is so pression. In Section 2 we attempt to find conserved intuitive that some special forms of the EM algo- patterns in functionally related gene sequences as rithm appeared in the literature long before Demp- an effort to explain the relationship of their gene ster, Laird and Rubin (1977) defined it. The earliest expression. In Section 3 we give an EM algorithm example on record is by McKendrick (1926), who in- for multiple sequence alignment, where the goal is to vented a special EM algorithm for fitting a Poisson establish “relatedness” of different sequences. Based model to a cholera infection data set. Other early on the alignment of evolutionary related DNA seforms of the EM algorithm appeared in numerous quences, another EM algorithm for detecting pogenetics studies involving allele frequency estima- tentially expression-related regions is introduced in tion, segregation analysis and pedigree data anal- Section 4. An alternative way to deduce the relationysis (Ceppellini, Siniscalco and Smith, 1955; Smith, ship between gene sequence and gene expression is to 1957; Ott, 1979). A precursor to the broad recog- check the effect of sequence variation within the popnition of the EM algorithm by the computational ulation of a species. In Section 5 we provide an EM biology community is Churchill (1989), who applied algorithm to deal with this type of small sequence the EM algorithm to fit a hidden Markov model variation. In Section 6 we review the clustering anal(HMM) for partitioning genomic sequences into re- ysis of microarray gene-expression data, which is gions with homogenous base compositions. Lawrence important for connecting the phenotype variation and Reilly (1990) first introduced the EM algorithm among individuals with the expression level variafor biological sequence motif discovery. Haussler et al. tion. Finally, in Section 7 we discuss trends in com(1993) and Krogh et al. (1994) formulated an inno- putational biology research. vative HMM and used the EM algorithm for pro2. SEQUENCE MOTIF DISCOVERY AND tein sequence alignment. Krogh, Mian and Haussler GENE REGULATION (1994) extended these algorithms to predict genes In order for a gene to be transcribed, special proin E. coli DNA data. During the past two decades, probabilistic modeling and the EM algorithm have teins called transcription factors (TFs) are often rebecome a more and more common practice in com- quired to bind to certain sequences, called transcripputational biology, ranging from multiple sequence tion factor binding sites (TFBSs). These sites are alignment for a single protein family (Do et al., 2005) usually 6–20 bp long and are mostly located upto genome-wide predictions of protein–protein inter- stream of the gene. One TF is usually involved in the actions (Deng et al., 2002), and to single-nucleotide regulation of many genes, and the TFBSs that the polymorphism (SNP) haplotype estimation (Kang TF recognizes often exhibit strong sequence specificity and conservation (e.g., the first position of et al. (2004)). the TFBSs is likely T, etc.). This specific pattern As noted in Meng and Pedlow (1992) and Meng is called a TF binding motif (TFBM). For example, (1997), there are too many EM-related papers to Figure 1 shows a motif of length 6. The motif is reptrack. This is true even within the field of computaresented by the position-specific frequency matrix tional biology. In this paper we only examine a few (θ 1 , . . . , θ 6 ), which is derived from the alignment of key topics in computational biology and use typical 5 motif sites by calculating position-dependent freexamples to show how the EM algorithm has paved quencies of the four nucleotides. the road for these studies. The connection between In order to understand how genes’ mRNA expresthe EM algorithm and statistical modeling of com- sion levels are regulated in the cell, it is crucial plex systems is essential in computational biology. to identify TFBSs and to characterize TFBMs. AlIt is our hope that this brief survey will stimulate though much progress has been made in developing further EM applications and provide insight for the experimental techniques for identifying these TFdevelopment of new algorithms. BSs, these techniques are typically expensive and Discrete sequence data and continuous expression time-consuming. They are also limited by experidata are two of the most common data types in com- mental conditions, and cannot pinpoint the bindputational biology. We discuss sequence data analy- ing sites exactly. In the past twenty years, compusis in Sections 2–5, and gene expression data analysis tational biologists and statisticians have developed 1.3 Uses of the EM Algorithm in Biology

4

X. FAN, Y. YUAN AND J. S. LIU

many successful in silico methods to aid biologists in finding TFBSs, and these efforts have contributed significantly to our understanding of transcription regulation. Likewise, motif discovery for protein sequences is important for identifying structurally or functionally important regions (domains) and understanding proteins’ functional components, or active sites. For example, using a Gibbs sampling-based motif finding algorithm, Lawrence et al. (1993) was able to predict the key helix-turn-helix motif among a family of transcription activators. Experimental approaches for determining protein motifs are even more expensive and slower than those for DNAs, whereas computational approaches are more effective than those for TFBSs predictions. The underlying logic of computational motif discovery is to find patterns that are “enriched” in a given set of sequence data. Common methods include word enumeration (Sinha and Tompa (2002); Hampson, Kibler and Baldi (2002); Pavesi et al. (2004)), position-specific frequency matrix updating (Stormo and Hartzell (1989); Lawrence and Reilly (1990); Lawrence et al. (1993)) or a combination of the two (Liu, Brutlag and Liu, 2002). The word enumeration approach uses a specific consensus word to represent a motif. In contrast, the position-specific frequency matrix approach formulates a motif as a weight matrix. Jensen et al. (2004) provide a review of these motif discovery methods. Tompa et al. (2005) compared the performance of various motif discovery tools. Traditionally, researchers have employed various heuristics, such as evaluating excessiveness of word counts or maximizing certain information criteria to guide motif finding. The EM algorithm was introduced by Lawrence and Reilly (1990) to deal with the motif finding problem. As shown in Figure 1, suppose we are given a set of K sequences Y ≡ (Y1 , . . . , YK ), where Yk ≡

Fig. 1. Transcription factor binding sites and motifs. (A) Each of the five sequences contains a TFBS of length 6. The local alignment of these sites is shown in the gray box. (B) The frequency of the nucleotides outside of the gray box is shown as θ0 . The frequency of the nucleotides in the ith column of the gray box is shown as θi .

(Yk,1 , . . . , Yk,Lk ) and Yk,l takes values in an alphabet of d residues (d = 4 for DNA/RNA and 20 for protein). The alphabet is denoted by R ≡ (r1 , . . . , rd ). Motif sites in this paper refer to a set of contiguous segments of the same length w (e.g., the marked 6mers in Figure 1). This concept can be further generalized via a hidden Markov model to allow gaps and position deletions (see Section 3 for HMM discussions). The weight matrix, or Product-Multinomial motif model, was first introduced by Stormo and Hartzell (1989) and later formulated rigorously in Liu, Neuwald and Lawrence (1995). It assumes that, if Yk,l is the i th position of a motif site, it follows the multinomial distribution with the probability vector θ i ≡ (θi1 , . . . , θid ); we denote this model as PM (θ 1 , . . . , θ w ). If Yk,l does not belong to any motif site, it is generated independently from the multinomial distribution with parameter θ 0 ≡ (θ01 , . . . , θ0d ). Let Θ ≡ (θ 0 , θ 1 , . . . , θ w ). For sequence Yk , there are L′k = Lk − w + 1 possible positions a motif site of length w may start. To represent the motif locations, we introduce the unobserved indicators Γ ≡ {Γk,l | 1 ≤ k ≤ K, 1 ≤ l ≤ L′k }, where Γk,l = 1 if a motif site starts at position l in sequence Yk , and Γk,l = 0 otherwise. As shown in Figure 1, it is straightforward to estimate Θ if we know where the motif sites are. The motif location indicators Γ are the missing data that makes the EM framework a natural choice for this problem. For illustration, we further assume that there is exactly one motif site within each sequence and that its location in the sequence is uniformly P distributed. This means that l Γk,l = 1 for all k and P (Γk,l = 1) = L1′ . k Given Γk,l = 1, the probability of each observed sequence Yk is (1)

h(Bk,l )

P (Yk |Γk,l = 1, Θ) = θ0

w Y

h(Yk,l+j−1 )

θi

.

j=1

In this expression, Bk,l ≡ {Yk,j : j < l or j ≥ l + w} is the set of letters of nonsite positions of Yk . The counting function h(·) takes a set of letter symbols as input and outputs the column vector (n1 , . . . , nd )T , where ni is the number of base type ri in the input h(·) set. We define the vector power function as θ i ≡ Qd nj j=1 θij for i = 0, . . . , w. Thus, the complete-data likelihood function is the product of equation (1) for k from 1 to K, that is, ′

P (Y, Γ|Θ) ∝

Lk K Y Y

k=1 l=1

P (Yk |Γk,l = 1, Θ)Γk,l

EM IN COMPUTATIONAL BIOLOGY w Y

5

lengths in the middle of a binding site. To overcome the restriction that each sequence contains exi=1 actly one motif site, Bailey and Elkan (1994, 1995a, (i) where BΓ is the set of all nonsite bases, and MΓ 1995b) introduced a parameter p0 describing the is the set of nucleotide bases at position i of the prior probability for each sequence position to be the start of a motif site, and designed a modified EM alTFBSs given the indicators Γ. The MLE of Θ from the complete-data likelihood gorithm called the Multiple EM for Motif Elicitation (MEME). Independently, Liu, Neuwald and Lawrence can be determined by simple counting, that is, (1995) presented a full Bayesian framework and Gibbs (i) sampling algorithm for this problem. Compared with ˆ i = h(MΓ ) and θ ˆ 0 = P h(BΓ ) . θ the EM approach, the Markov chain Monte Carlo K K k=1 (Lk − w) (MCMC)-based approach has the advantages of makThe EM algorithm for this problem is quite intu- ing more flexible moves during the iteration and initive. In the E-step, one uses the current parame- corporating additional information such as motif loter values Θ(t) to compute the expected values of cation and orientation preference in the model. (i) The generalizations in Bailey and Elkan (1994) h(MΓ ) and h(BΓ ). More precisely, for sequence and Liu, Neuwald and Lawrence (1995) assume that Yk , we compute its likelihood of being generated (t) all overlapping subsequences of length w in the sefrom Θ conditional on each possible motif locaquence data set are from a finite mixture model. tion Γk,l = 1, More precisely, each subsequence of length w is treated as an independent sample from a mixture of wk,l ≡ P (Yk |Γk,l = 1, Θ(t) ) PM (θ 1 , . . . , θ w ) and PM (θ 0 , . . . , θ0 ) [independent  h(Yk,l )  h(Yk,l+w−1 ) θ1 θw h(Yk ) Multinomial(θ0 ) in all w positions]. The EM solu= ··· θ0 . θ0 θ0 tion of this mixture model formulation then leads to the MEME algorithm of Bailey and Elkan (1994). PL′k wk,l , we then compute the ex- To deal with the situation that w may not be known Letting Wk ≡ l=1 pected count vectors as precisely, MEME searches motifs of a range of different widths separately, and then performs model ′ Lk K X X wk,l selection by optimizing a heuristic function based (i) h(Yk,l+i−1 ), EΓ|Θ(t) ,Y [h(MΓ )] = on the maximum likelihood ratio test. Since its reWk k=1 l=1 lease, MEME has been one of the most popular moEΓ|Θ(t) ,Y [h(BΓ )] = h({Yk,l : 1 ≤ k ≤ K, 1 ≤ l ≤ Lk }) tif discovery tools cited in the literature. The Google scholar search gives a count of 1397 citations as of w X (i) August 30th, 2009. Although it is 15 years old, its EΓ|Θ(t) ,Y [h(MΓ )]. − performance is still comparable to many new algoi=1 rithms (Tompa et al., 2005). In the M-step, one simply computes =

h(B ) θ0 Γ

(i)

h(MΓ )

θi

,

3. MULTIPLE SEQUENCE ALIGNMENT

(i)

(t+1) θi

=

EΓ|Θ(t) ,Y [h(MΓ )]

K E Γ|Θ(t) ,Y [h(BΓ )] (t+1) . θ0 = PK k=1 (Lk − w)

and

It is necessary to start with a nonzero initial weight matrix Θ(0) so as to guarantee that P (Yk |Γk,l = 1, Θ(t) ) > 0 for all l. At convergence the algorithm ˆ and predictive probabiliyields both the MLE Θ ties for candidate TFBS locations, that is, P (Γk,l = ˆ Y). 1|Θ, Cardon and Stormo (1992) generalized the above simple model to accommodate insertions of variable

Multiple sequence alignment (MSA) is an important tool for studying structures, functions and the evolution of proteins. Because different parts of a protein may have different functions, they are subject to different selection pressures during evolution. Regions of greater functional or structural importance are generally more conserved than other regions. Thus, a good alignment of protein sequences can yield important evidence about their functional and structural properties. Many heuristic methods have been proposed to solve the MSA problem. A popular approach is the progressive alignment method (Feng and Doolittle, 1987),

6

X. FAN, Y. YUAN AND J. S. LIU

in which the MSA is built up by aligning the most closely related sequences first and then adding more distant sequences successively. Many alignment programs are based on this strategy, such as MULTALIGN (Barton and Sternberg, 1987), MULTAL (Taylor, 1988) and, the most influential one, ClustalW (Thompson, Higgins and Gibson, 1994). Usually, a guide tree based on pairwise similarities between the protein sequences is constructed prior to the multiple alignment to determine the order for sequences to enter the alignment. Recently, a few new progressive alignment algorithms with significantly improved alignment accuracies and speed have been proposed, including T-Coffee (Notredame, Higgins and Heringa (2000)), MAFFT (Katoh et al., 2005), PROBCONS (Do et al., 2005) and MUSCLE (Edgar, 2004a, 2004b). They differ from previous approaches and each other mainly in the construction of the guide tree and in the objective function for judging the goodness of the alignment. Batzoglou (2005) and Wallace, Blackshields and Higgins (2005) reviewed these algorithms. An important breakthrough in solving the MSA problem is the introduction of a probabilistic generative model, the profile hidden Markov model by Krogh et al. (1994). The profile HMM postulates that the N observed sequences are generated as independent but indirect observations (emissions) from a Markov chain model illustrated in Figure 2. The underlying unobserved Markov chain consists of three types of states: match, insertion and deletion. Each match or insertion state emits a letter chosen from the alphabet R (size d = 20 for proteins) according to a multinomial distribution. The deletion state does not emit any letter, but makes the sequence generating process skip one or more match states. A multiple alignment of the N sequences is produced by aligning the letters that are emitted from the same match state. Let Γi denote the unobserved state path through which the i th sequence is generated from the profile HMM, and S the set of all states. Let Θ denote the set of all global parameters of this model, including emission probabilities in match and insertion states elr (l ∈ S, r ∈ R), and transition probabilities among all hidden states tab (a, b ∈ S). The complete-data log-likelihood function can be written as log P (Y, Γ|Θ) =

N X i=1

[log P (Yi |Γi , Θ) + log P (Γi |Θ)]

Fig. 2. Profile hidden Markov model. A modified toy example is adopted from Eddy (1998). It shows the alignment of five sequences, each containing only three to five letters. The first position is enriched with Cysteine (C), the fourth position is enriched with Histidine (H), and the fifth position is enriched with Phenylalanine (F) and Tyrosine (Y). The third sequence has a deletion at the fourth position, and the fourth sequence has an insertion at the third position. This simplified model does not allow insertion and deletion states to follow each other.

=

N  X X

Mlr (Γi ) log elr

i=1 l∈S,r∈R

+

X



Nab (Γi ) log tab ,

a,b∈S

where Mlr (Γi ) is the count of letter r in sequence Yi that is generated from state l according to Γi , and Nab (Γi ) is the count of state transitions from a to b in the path Γi for sequence Yi . The E-step involves calculating the expected counts of emissions and transitions, that is, E[Mlr (Γi )|Θ(t) ] and E[Nab (Γi )|Θ(t) ], averaging over all possible generating paths Γi . The Q-function is Q(Θ|Θ(t) ) =

N X X P (Γi , Yi |Θ(t) ) i=1 Γi

P (Yi |Θ(t) )  X · log(elr )Mlr (Γi ) l∈S,r∈R

+

X

a,b∈S

 log(tab )Nab (Γi ) .

A brute-force enumeration of all paths is prohibitively expensive in computation. Fortunately, one can apply a forward–backward dynamic programming technique to compute the expectations for each sequence and then sum them all up. In the M-step, the emission and transition probabilities are updated as the ratio of the expected event occurrences (sufficient statistics) divided by

EM IN COMPUTATIONAL BIOLOGY

7

Second, the number of sequences is sometimes too small for parameter estimation. When calculating the conditional expectation of the sufficient statistics, which are counts of residues at each state and state transitions, there may not be enough data, resulting in zero counts which could make the estimation unstable. To avoid the occurrence of zero where counts, pseudo-counts can be added. This is equivX (t) alent to using a Dirichlet prior for the multinomial mlr (Yi ) = Mlr (Γi )P (Γi , Yi |Θ ), parameters in a Bayesian formulation. Γi Third, the assumption of sequence independence X nab (Yi ) = Nab (Γi )P (Γi , Yi |Θ(t) ), is often violated. Due to the underlying evolutionΓi ary relationship (unknown), some of the sequences X X may share much higher mutual similarities than othml (Yi ) = mlr (Yi ), na (Yi ) = nab (Yi ). ers. Therefore, treating all sequences as i.i.d. samr∈R b∈S ples may cause serious biases in parameter estimaThis method is called the Baum–Welch algorithm tion. One possible solution is to give each sequence (Baum et al., 1970), and is mathematically equiva- a weight according to its importance. For example, lent to the EM algorithm. Conditional on the MLE if two sequences are identical, it is reasonable to give ˆ the best alignment path for each sequence can Θ, each of them half the weight of other sequences. The be found efficiently by the Viterbi algorithm (see weights can be easily integrated into the M-step of Durbin et al., 1998, Chapter 5, for details). the EM algorithm to update the model parameters. The profile HMM provides a rigorous statistical For example, when a sequence has a weight of 0.5, modeling and inference framework for the MSA prob- all the emission and transition events contributed by lem. It has also played a central role in advancing this sequence will be counted by half. Many meththe understanding of protein families and domains. ods have been proposed to assign weights to the seA protein family database, Pfam (Finn et al., 2006), quences (Durbin et al., 1998), but it is not clear how has been built using profile HMM and has served to set the weights in a principled way to best account as an essential source of data in the field of profor sequence dependency. tein structure and function research. Currently there Last, since the EM algorithm can only find loare two popular software packages that use profile cal modes of the likelihood function, some stochasHMMs to detect remote protein homologies: HMMER (Eddy, 1998) and SAM (Hughey and Krogh, tic perturbation can be introduced to help find bet1996; Karplus, Barrett and Hughey, 1999). Madera ter modes and improve the alignment. Starting from and Gough (2002) gave a comparison of these two multiple random initial parameters is strongly recommended. Krogh et al. (1994) combined simulated packages. There are several challenges in fitting the profile annealing into Baum–Welch and showed some imHMM. First, the size of the model (the number of provement. Baldi and Chauvin (1994) developed a match, insertion and deletion states) needs to be generalized EM (GEM) algorithm using a gradient determined before model fitting. It is common to ascent calculation in an attempt to infer HMM pabegin fitting a profile HMM by setting the num- rameters in a smoother way. Despite many advantages of the profile HMM, it ber of match states equal to the average sequence is no longer the mainstream MSA tool. A main realength. Afterward, a strategy called “model surgery” son is that the model has too many free parameters, (Krogh et al., 1994) can be applied to adjust the which render the parameter estimation very unstamodel size (by adding or removing a match state depending on whether an insertion or a deletion is ble when there are not enough sequences (fewer than used too often). Eddy (1998) used a maximum a pos- 50, say) in the alignment. In addition, the vanilla teriori (MAP) strategy to determine the model size EM algorithm and its variations developed by early in HMMER. In this method the number of match researchers for the MSA problem almost always constates is given a prior distribution, which is equiva- verge to suboptimal alignments. Recently, Edlefsen lent to adding a penalty term in the log-likelihood (2009) have developed an ECM algorithm for MSA that appears to have much improved convergence function. the total expected emission or transition events: P {mlr (Yi )/P (Yi |Θ(t) )} (t+1) , = Pi elr (t) i {ml (Yi )/P (Yi |Θ )} P {nab (Yi )/P (Yi |Θ(t) )} (t+1) tab = Pi , (t) {n (Y )/P (Y |Θ )} a i i i

8

X. FAN, Y. YUAN AND J. S. LIU

properties. It is also difficult for the profile HMM to incorporate other kinds of information, such as 3D protein structure and guide tree. Some recent programs such as 3D-Coffee (O’Sullivan et al., 2004) and MAFFT are more flexible as they can incorporate this information into the objective function and optimize it. We believe that the Monte Carlo-based Bayesian approaches, which can impose more model constraints (e.g., to capitalize on the “motif” concept) and make more flexible MCMC moves, might be a promising route to rescue profile HMM (see Liu, Neuwald and Lawrence, 1995; Neuwald and Liu, 2004). 4. COMPARATIVE GENOMICS A main goal of comparative genomics is to identify and characterize functionally important regions in the genome of multiple species. An assumption underlying such studies is that, due to evolutionary pressure, functional regions in the genome evolve much more slowly than most nonfunctional regions due to functional constraints (Wolfe, Sharp and Li, 1989; Boffelli et al., 2003). Regions that evolve more slowly than the background are called evolutionarily conserved elements. Conservation analysis (comparing genomes of related species) is a powerful tool for identifying functional elements such as protein/RNA coding regions and transcriptional regulatory elements. It begins with an alignment of multiple orthologous sequences (sequences evolved from the same common ancestral sequence) and a conservation score for each column of the alignment. The scores are calculated based on the likelihood that each column is located in a conserved element. The phylogenetic hidden Markov model (Phylo-HMM) was introduced to infer the conserved regions in the genome (Yang, 1995; Felsenstein and Churchill, 1996; Siepel et al., 2005). The statistical power of Phylo-HMM has been systematically studied by Fan et al. (2007). Siepel et al. (2005) used the EM algorithm for estimating parameters in Phylo-HMM. Their results, provided by the UCSC genome browser database (Karolchik et al., 2003), are very influential in the computational biology community. By August 2009, the paper of Siepel et al. (2005) had been cited 413 times according to the Web of Science database. As shown in Figure 3, the alignment modeled by Phylo-HMM can be seen as generated from two steps. First, a sequence of L sites is generated from a twostate HMM, with the hidden states being conserved

or nonconserved sites. Second, a nucleotide is generated for each site of the common ancestral sequence and evolved to the contemporary nucleotides along all branches of a phylogenetic tree independently according to the corresponding phylogenetic model. Let µ and ν be the transition probabilities between the two states, and let the phylogenetic models for nonconserved and conserved states be ψn = (Q, π, τ, β) and ψc = (Q, π, τ, ρβ), respectively. Here π is the emission probability vector of the four nucleotides (A, C, G and T) in the common ancestral sequence x0 ; τ is the tree topology of the corresponding phylogeny; β is a vector of non-negative real numbers representing branch lengths of the tree, which are measured by the expected number of substitutions per site. The difference between the two states is characterized by a scaling parameter ρ ∈ [0, 1) applied to the branch lengths of only the conserved state, which means fewer substitutions. The nucleotide substitution model considers a descendent nucleotide to have evolved from its ancestor by a continuous-time time-homogeneous Markov process with transition kernel Q, also called the substitution rate matrix (Tavar´e, 1986). The transition kernels for all branches are assumed to be the same. Many parametric forms are available for the 4-by-4 nucleotide substitution rate matrix Q, such as the Jukes–Cantor substitution matrix and the general time-reversible substitution matrix (Yang, 1997). The nucleotide transition probability matrix for a branch of length βi is eβi Q . Siepel et al. (2005) assumed that the tree topology τ and the emission probability vector π are known. In this case, the observed alignment Y = (y1· , y2· , y3· , y4· ) is a matrix of nucleotides. The parameter of interest is Θ = (µ, ν, Q, ρ, β). The missing information Γ = (z, X) includes the state sequence z and the ancestral DNA sequences X. The completedata likelihood is written as P (Y, Γ|Θ) = bz1 P (y·1 , x·1 |ψz1 )

L Y

azi−1 zi P (y·i , x·i |ψzi ).

i=2

Here y·i is the i th column of the alignment Y, zi ∈ {c, n} is the hidden state of the i th column, (bc , bn ) = µ ν ( µ+ν , µ+ν ) is the initial state probability of the HMM if the chain is stationary, and azi−1 zi is the transition probability (as illustrated in Figure 3). The EM algorithm is applied to obtain the MLE of Θ. In the E-step, we calculate the expectation

EM IN COMPUTATIONAL BIOLOGY

Fig. 3. Two-state Phylo-HMM. (A) Phylogenetic tree: The tree shows the evolutionary relationship of four contemporary sequences (y1· , y2· , y3· , y4· ). They are evolved from the common ancestral sequence x0· , with two additional internal nodes (ancestors), x1· and x2· . The branch lengths β = (β0 , β1 , β2 , β3 , β4 , β5 ) indicate the evolutionary distance between two nodes, which are measured by the expected number of substitutions per site. (B) HMM state-transition diagram: The system consists of a state for conserved sites and a state for nonconserved sites (c and n, respectively). The two states are associated with different phylogenetic models (ψc and ψn ), which differ by a scaling parameter ρ. (C) An illustrative alignment generated by this model: A state sequence (z) is generated according to µ and ν. For each site in the state sequence, a nucleotide is generated for the root node in the phylogenetic tree and then for subsequent child nodes according to the phylogenetic model (ψc or ψn ). The observed alignment Y = (y1· , y2· , y3· , y4· ) is composed of all nucleotides in the leaf nodes. The state sequence z and all ancestral sequences X = (x0· , x1· , x2· ) are unobserved.

of the complete-data log-likelihood under the distribution P (z, X|Θ(t) , Y). The marginalization of X, conditional on z and other variables, can be accomplished efficiently site-by-site using the peeling or pruning algorithm for the phylogenetic tree (Felsenstein (1981)). The marginalization of z can be done efficiently by the forward–backward procedure for HMM (Baum et al., 1970; Rabiner, 1989). For the M-step, we can use the Broyden–Fletcher–Goldfarb– Shanno (BFGS) quasi-Newton algorithm. After we obtain the MLE of Θ, a forward–backward dynamic programming method (Liu, 2001) can then be used to compute the posterior probability that a given ˆ Y), hidden state is conserved, that is, P (zi = c|Θ, which is the desired conservation score.

9

As shown in the Phylo-HMM example, the phylogenetic tree model is key to integrating multiple sequences for evolutionary analysis. This model is also used for comparing protein or RNA sequences. Due to its intuitive and efficient handling of the missing evolutionary history, the EM algorithm has always been a main approach for estimating parameters of the tree. For example, Felsenstein (1981) used the EM algorithm to estimate the branch length β, Bruno (1996) and Holmes and Rubin (2002) used the EM algorithm to estimate the residue usage π and the substitution rate matrix Q, Friedman et al. (2002) used an extension of the EM algorithm to estimate the phylogenetic tree topology τ , and Holmes (2005) used the EM algorithm for estimating insertion and deletion rates. Yang (1997) implemented some of the above algorithms in the phylogenetic analysis software PAML. A limitation of the PhyloHMM model is the assumption of a good multiple sequence alignment, which is often not available. 5. SNP HAPLOTYPE INFERENCE A Single Nucleotide Polymorphism (SNP) is a DNA sequence variation in which a single base is altered that occurs in at least 1% of the population. For example, the DNA fragments CCTGAGGAG and CCTGTGGAG from two homologous chromosomes (the paired chromosomes of the same individual, one from each parent) differ at a single locus. This example is actually a real SNP in the human β-globin gene, and it is associated with the sickle-cell disease. The different forms (A and T in this example) of a SNP are called alleles. Most SNPs have only two alleles in the population. Diploid organisms, such as humans, have two homologous copies of each chromosome. Thus, the genotype (i.e., the specific allelic makeup) of an individual may be AA, TT or AT in this example. A phenotype is a morphological feature of the organism controlled or affected by a genotype. Different genotypes may produce the same phenotype. In this example, individuals with genotype TT have a very high risk of the sickle-cell disease. A haplotype is a combination of alleles at multiple SNP loci that are transmitted together on the same chromosome. In other words, haplotypes are sets of phased genotypes. An example is given in Figure 4, which shows the genotypes of three individuals at four SNP loci. For the first individual, the arrangement of its alleles on two chromosomes must be ACAC and ACGC, which are the haplotypes compatible with its observed genotype data.

10

X. FAN, Y. YUAN AND J. S. LIU

One of the main tasks of genetic studies is to locate genetic variants (mainly SNPs) that are associated with inheritable diseases. If we know the haplotypes of all related individuals, it will be easier to rebuild the evolutionary history and locate the disease mutations. Unfortunately, the phase information needed to build haplotypes from genotype information is usually unavailable because laboratory haplotyping methods, unlike genotyping technologies, are expensive and low-throughput. The use of the EM algorithm has a long history in population genetics, some of which predates Dempster, Laird and Rubin (1977). For example, Ceppellini, Siniscalco and Smith (1955) invented an EM algorithm to estimate allele frequencies when there is no one-to-one correspondence between phenotype and genotype; Smith (1957) used an EM algorithm to estimate the recombination frequency; and Ott (1979) used an EM algorithm to study genotypephenotype relationships from pedigree data. Weeks and Lange (1989) reformulated these earlier applications in the modern EM framework of Dempster, Laird and Rubin (1977). Most early works were singleSNP Association studies. Thompson (1984) and Lander and Green (1987) designed EM algorithms for joint linkage analysis of three or more SNPs. With the accumulation of SNP data, more and more researchers have come to realize the importance of haplotype analysis (Liu et al., 2001). Haplotype reconstruction based on genotype data has therefore become a very important intermediate step in disease association studies. The haplotype reconstruction problem is illustrated in Figure 4. Suppose we observed the genotype data Y = (Y1 , . . . , Yn ) for n individuals, and we wish

Fig. 4. Haplotype reconstruction. We observed the genotypes of three individuals at 4 SNP loci. The 1st and 3rd individuals each have a unique haplotype phase, whereas the 2nd individual has two compatible haplotype phases. We pool all possible haplotypes together and associated with them a haplotype frequency vector (θ1 , . . . , θ6 ). Each individual’s two haplotypes are then assumed to be random draws (with replacement) from this pool of weighted haplotypes.

to predict the corresponding haplotypes Γ = (Γ1 , . . . , − Γn ), where Γi = (Γ+ i , Γi ) is the haplotype pair of the i th individual. The haplotype pair Γi is said to be compatible with the genotype Yi , which is expressed − as Γ+ i ⊕ Γi = Yi , if the genotype Yi can be generated from the haplotype pair. Let H = (H1 , . . . , Hm ) be the pool of all distinct haplotypes and let Θ = (θ1 , . . . , θm ) be the corresponding frequencies in the population. The first simple model considered in the literature assumes that each individual’s genotype vector is generated by two haplotypes from the pool chosen independently with probability vector Θ. This is a very good model if the region spanned by the markers in consideration is sufficiently short that no recombination has occurred, and if mating in the population is random. Under this model, we have  n  Y X θj θk . P (Y|Θ) = i=1

(j,k):Hj ⊕Hk =Yi

If Γ is known, we can directly write down the MLE n of Θ as θj = 2nj , where the sufficient statistic nj is the number of occurrences of haplotype Hj in Γ. Therefore, in the EM framework, we simply replace nj by its expected value over the distribution of Γ when Γ is unobserved. More specifically, the EM algorithm is a simple iteration of (t+1)

θj

=

EΓ|Θ(t) ,Y (nj ) 2n

,

where Θ(t) is the current estimate of the haplotype frequencies, and nj is the count of haplotypes Hj that exist in Y. The use of the EM algorithm for haplotype analysis has been coupled with the large-scale generation of SNP data. Early attempts include Excoffier and Slatkin (1995), Long, Williams and Urbanek (1995), Hawley and Kidd (1995) and Chiano and Clayton (1998). One problem of these traditional EM approaches is that the computational complexity of the E-step grows exponentially as the number of SNPs in the haplotype increases. Qin, Niu and Liu (2002) incorporated a “partition–ligation” strategy into the EM algorithm in an effort to surpass this limitation. Lu, Niu and Liu (2003) used the EM for haplotype analysis in the scenario of case-control studies. Kang et al. (2004) extended the traditional EM haplotype inference algorithm by incorporating genotype uncertainty. Niu (2004) gave a review of general algorithms for haplotype reconstruction.

11

EM IN COMPUTATIONAL BIOLOGY

6. FINITE MIXTURE CLUSTERING FOR MICROARRAY DATA In cluster analysis one seeks to partition observed data into groups such that coherence within each group and separation between groups are maximized jointly. Although this goal is subjectively defined (depending on how one defines “coherence” and “separation”), clustering can serve as an initial exploratory analysis for high-dimensional data. One example in computational biology is microarray data analysis. Microarrays are used to measure the mRNA expression levels of thousands of genes at the same time. Microarray data are usually displayed as a matrix Y. The rows of Y represent the genes in a study and the columns are arrays obtained in different experiment conditions, in different stages of a biological system or from different biological samples. Cluster analysis of microarray data has been a hot research field because groups of genes that share similar expression patterns (clustering the rows of Y) are often involved in the same or related biological functions, and groups of samples having a similar gene expression profile (clustering the columns of Y) are often indicative of the relatedness of these samples (e.g., the same cancer type). Finite mixture models have long been used in cluster analysis (see Fraley and Raftery, 2002 for a review). The observations are assumed to be generated from a finite mixture of distributions. The likelihood of a mixture model with K components can be written as P (Y|θ 1 , . . . , θ K ; τ1 , . . . , τK ) =

n X K Y

τk fk (Yi |θ k ),

i=1 k=1

where fk is the density function of the kth component in the mixture, θ k are the corresponding parameters, and τk is the probability that an observed datum P is generated from this component model (τk ≥ 0, k τk = 1). One of the most commonly used finite mixture models is the Gaussian mixture model, in which θ k is composed of mean µk and covariance matrix Σk . Outliers can be accommodated by a special component in the mixture that allows for a larger variance or extreme values. A standard way to simplify the statistical computation with mixture models is to introduce a variable indicating which component an observation Yi was generated from. Thus, the “complete data” can be expressed as Xi = (Yi , Γi ), where Γi = (γi1 , . . . , γiK ),

and γik = 1 if Yi is generated by the kth component and γik = 0 otherwise. The complete-data loglikelihood function is log P (Y, Γ|θ 1 , . . . , θ K ; τ1 , . . . , τK ) =

K n X X

γik log[τk fk (Yi |θ i )].

i=1 k=1

Since the complete-data log-likelihood function is linear in the γjk ’s, in the E-step we only need to compute (t)

(t)

τk fk (Yi |θ k )

(t)

γˆik ≡ E(γik |Θ , Y) = P K

(t) (t) j=1 τj fj (Yi |θ j )

.

The Q-function can be calculated as (2)

Q(Θ|Θ(t) ) =

K n X X

γˆik log[τk fk (Yi |θ i )].

i=1 k=1

The M-step updates the component probability τk as n

(t+1)

τk

=

1X γˆik , n i=1

and the updating of θ k would depend on the density function. In mixture Gaussian models, the Qfunction is quadratic in the mean vector and can be maximized to achieve the M-step. Yeung et al. (2001) are among the pioneers who applied the model-based clustering method in microarray data analysis. They adopted the Gaussian mixture model framework and represented the covariance matrix in terms of its eigenvalue decomposition Σk = λk Dk Ak DkT . In this way, the orientation, shape and volume of the multivariate normal distribution for each cluster can be modeled separately by eigenvector matrix Dk , eigenvalue matrix Ak and scalar λk , respectively. Simplified models are straightforward under this general model setting, such as setting λk , Dk or Ak to be identical for all clusters or restricting the covariance matrices to take some special forms (e.g., Σk = λk I). Yeung and colleagues used the EM algorithm to estimate the model parameters. To improve convergence, the EM algorithm can be initialized with a model-based hierarchical clustering step (Dasgupta and Raftery, 1998).

12

X. FAN, Y. YUAN AND J. S. LIU

When Yi has some dimensions that are highly correlated, it can be helpful to project the data onto a lower-dimensional subspace. For example, McLachlan, Bean and Peel (2002) attempted to cluster tissue samples instead of genes. Each tissue sample is represented as a vector of length equal to the number of genes, which can be up to several thousand. Factor analysis (Ghahramani and Hinton, 1997) can be used to reduce the dimensionality, and can be seen as a Gaussian model with a special constraint on the covariance matrix. In their study, McLachlan, Bean and Peel used a mixture of factor analyzers, equivalent to a mixture Gaussian model, but with fewer free parameters to estimate because of the constraints. A variant of the EM algorithm, the Alternating Expectation–Conditional Maximization (AECM) algorithm (Meng and van Dyk, 1997), was applied to fit this mixture model. Many microarray data sets are composed of several arrays in a series of time points so as to study biological system dynamics and regulatory networks (e.g., cell cycle studies). It is advantageous to model the gene expression profile by taking into account the smoothness of these time series. Ji et al. (2004) clustered the time course microarray data using a mixture of HMMs. Bar-Joseph et al. (2002) and Luan and Li (2003) implemented mixture models with spline components. The time-course expression data were treated as samples from a continuous smooth process. The coefficients of the spline bases can be either fixed effect, random effect or a mixture effect to accommodate different modeling needs. Ma et al. (2006) improved upon these methods by adding a gene-specific effect into the model: yij = µk (tij ) + bi + εij , where µk (t) is the mean expression of cluster k at time t, composed of smoothing spline components; 2 ) explains the gene specific deviation bi ∼ N (0, σbk from the cluster mean; and εij ∼ N (0, σ 2 ) is the measurement error. The Q-function in this case is a weighted version of the penalized log-likelihood: ( n ! K T X X X (yij − µk (tij ) − bi )2 b2i − γˆik + 2 2σ 2 2σbk j=1 k=1 i=1 (3) ) Z − λk T [µ′′k (t)]2 dt , where the integral is the smoothness penalty term. A generalized cross-validation method was applied 2 and λ . to choose the values for σbk k

An interesting variation on the EM algorithm, the rejection-controlled EM (RCEM), was introduced in Ma et al. (2006) to reduce the computational complexity of the EM algorithm for mixture models. In all mixture models, the E-step computes the membership probabilities (weights) for each gene to belong to each cluster, and the M-step maximizes a weighted sum function as in Luan and Li (2003). To reduce the computational burden of the M-step, we can “throw away” some terms with very small weights in an unbiased weight using the rejection control method (Liu, Chen and Wong, 1998). More precisely, a threshold c (e.g., c = 0.05) is chosen. Then, the new weights are computed as  max{ˆ γik , c}, with probability min{1, γˆik /c}, γ˜ik = 0, otherwise. The new weight γ˜ik then replaces the old weight γˆik in the Q-function calculation in (2) in general, and in (3) more specifically. For cluster k, genes with a membership probability higher than c are not affected, while the membership probabilities of other genes will be set to c or 0, with probabilities γˆik /c and 1 − γˆik /c, respectively. By giving a zero weight to many genes with low γˆik /c, the number of terms to be summed in the Q-function is greatly reduced. In many ways finite mixture models are similar to the K-means algorithm, and they may produce very similar clustering results. However, finite mixture models are more flexible in the sense that the inferred clusters do not necessarily have a sphere shape, and the shapes of the clusters can be learned from the data. Researchers such as Suresh, Dinakaran and Valarmathie (2009) tried to combine the two ways of thinking to make better clustering algorithms. For cluster analysis, one intriguing question is how to set the total number of clusters. Bayesian information criterion (BIC) is often used to determine the number of clusters (Yeung et al. (2001); Fraley and Raftery (2002); Ma et al. (2006)). A random subsampling approach is suggested by Dudoit, Fridlyand and Speed (2002) for the same purpose. When external information of genes or samples is available, cross-validation can be used to determine the number of clusters. 7. TRENDS TOWARD INTEGRATION Biological systems are generally too complex to be fully characterized by a snapshot from a single viewpoint. Modern high-throughput experimental techniques have been used to collect massive

EM IN COMPUTATIONAL BIOLOGY

amounts of data to interrogate biological systems from various angles and under diverse conditions. For instance, biologists have collected many types of genomic data, including microarray gene expression data, genomic sequence data, ChIP–chip binding data and protein–protein interaction data. Coupled with this trend, there is a growing interest in computational methods for integrating multiple sources of information in an effort to gain a deeper understanding of the biological systems and to overcome the limitations of divided approaches. For example, the Phylo-HMM in Section 4 takes as input an alignment of multiple sequences, which, as shown in Section 3, is a hard problem by itself. On the other hand, the construction of the alignment can be improved a lot if we know the underlying phylogeny. It is therefore preferable to infer the multiple alignment and the phylogenetic tree jointly (Lunter et al., 2005). Hierarchical modeling is a principled way of integrating multiple data sets or multiple analysis steps. Because of the complexity of the problems, the inclusion of nuisance parameters or missing data at some level of the hierarchical models is usually either structurally inevitable or conceptually preferable. The EM algorithm and Markov chain Monte Carlo algorithms are often the methods of choice for these models due to their close connection with the underlying statistical model and the missing data structure. For example, EM algorithms have been used to combine motif discovery with evolutionary information. The underlying logic is that the motif sites such as TFBSs evolved slower than the surrounding genomic sequences (the background) because of functional constraints and natural selection. Moses, Chiang and Eisen (2004) developed EMnEM (Expectation–Maximization on Evolutionary Mixtures), which is a generalization of the mixture model formulation for motif discovery (Bailey and Elkan, 1994). More precisely, they treat an alignment of multiple orthologous sequences as a series of alignments of length w, each of which is a sample from the mixture of a motif model and a background model. All observed sequences are assumed to evolve from a common ancestor sequence according to an evolutionary process parameterized by a Jukes–Cantor substitution matrix. PhyME (Sinha, Blanchette and Tompa, 2004) is another EM approach for motif discovery in orthologous sequences. Instead of modeling the common ancestor, they modeled one designated “reference species” using a two-state HMM (motif state

13

or background state). Only the well-aligned part of the reference sequence was assumed to share a common evolutionary origin with other species. PhyME assumes a symmetric star topology instead of a binary phylogenetic tree for the evolutionary process. OrthoMEME (Prakash et al., 2004) deals with pairs of orthologous sequences and is a natural extension of the EM algorithm of Lawrence and Reilly (1990) described in Section 2. Steps have also been taken to incorporate microarray gene expression data into motif discovery (Bussemaker, Li and Siggia (2001); Conlon et al. (2003)). Kundaje et al. (2005) used a graphical model and the EM algorithm to combine DNA sequence data with time-series expression data for gene clustering. Its basic logic is that co-regulated genes should show both similar TFBS occurrence in their upstream sequences and similar gene-expression timeseries curves. The graphical model assumes that the TFBS occurrence and gene-expression are independent, conditional on the co-regulation cluster assignment. Based on predicted TFBSs in promoter regions and cell-cycle time-series gene-expression data on budding yeast, this algorithm infers model parameters by integrating out the latent variables for cluster assignment. In a similar setting, Chen and Blanchette (2007) used a Bayesian network and an EM-like algorithm to integrate TFBS information, TF expression data and target gene expression data for identifying the combination of motifs that are responsible for tissue-specific expression. The relationships among different data are modeled by the connections of different nodes in the Bayesian network. Wang et al. (2005) used a mixture model to describe the joint probability of TFBS and target gene expression data. Using the EM algorithm, they provide a refined representation of the TFBS and calculate the probability that each gene is a true target. As we show in this review, the EM algorithm has enjoyed many applications in computational biology. This is partly driven by the need for complex statistical models to describe biological knowledge and data. The missing data formulation of the EM algorithm addresses many computational biology problems naturally. The efficiency of a specific EM algorithm depends on how efficiently we can integrate out unobserved variables (missing data/nuisance parameters) in the E-step and how complex the optimization problem is in the M-step. Special dependence structures can often be imposed on the unobserved variables to greatly ease the computational

14

X. FAN, Y. YUAN AND J. S. LIU

burden of the E-step. For example, the computation is simple if latent variables are independent in the conditional posterior distribution, such as in the mixture motif example in Section 2 and the haplotype example in Section 5. Efficient exact calculation may also be available for structured latent variables, such as the forward–backward procedure for HMMs (Baum et al., 1970), the pruning algorithm for phylogenetic trees (Felsenstein, 1981) and the inside– outside algorithm for the probabilistic context-free grammar in predicting RNA secondary structures (Eddy and Durbin, 1994). As one of the drawbacks of the EM algorithm, the M-step can sometimes be too complicated to compute directly, such as in the Phylo-HMM example in Section 4 and the smoothing spline mixture model in Section 6, in which cases innovative numerical tricks are called for. ACKNOWLEDGMENTS We thank Paul T. Edlefsen for helpful discussions about the profile hidden Markov model, as well as to Yves Chretien for polishing the language. This research is supported in part by the NIH Grant R01HG02518-02 and the NSF Grant DMS-07-06989. The first two authors should be regarded as joint first authors. REFERENCES Bailey, T. L. and Elkan, C. (1994). Fitting a mixture model by expectation maximization to discover motifs in biopolymers. Proc. Int. Conf. Intell. Syst. Mol. Biol. 2 28– 36. Bailey, T. L. and Elkan, C. (1995a). Unsupervised learning of multiple motifs in biopolymers using EM. Machine Learning 21 51–58. Bailey, T. L. and Elkan, C. (1995b). The value of prior knowledge in discovering motifs with MEME. Proc. Int. Conf. Intell. Syst. Mol. Biol. 3 21–29. Baldi, P. and Chauvin, Y. (1994). Smooth on-line learning algorithms for hidden Markov models. Neural Computation 6 305–316. Bar-Joseph, Z., Gerber, G., Gifford, D., Jaakkola, T. and Simon, I. (2002). A new approach to analyzing gene expression time series data. In Proc. Sixth Ann. Inter. Conf. Comp. Biol. 39–48. ACM Press, New York. Barton, G. and Sternberg, M. (1987). A strategy for the rapid multiple alignment of protein sequences. J. Mol. Biol. 198 327–337. Batzoglou, S. (2005). The many faces of sequence alignment. Briefings in Bioinformatics 6 6–22. Baum, L. E., Petrie, T., Soules, G. and Weiss, N. (1970). A maximization technique occurring in the statistical analysis of probabilistic functions of Markov chains. Ann. Math. Statist. 41 164–171. MR0287613

Boffelli, D., McAuliffe, J., Ovcharenko, D., Lewis, K. D., Ovcharenko, I., Pachter, L. and Rubin, E. M. (2003). Phylogenetic shadowing of primate sequences to find functional regions of the human genome. Science 299 1391–1394. Bruno, W. (1996). Modeling residue usage in aligned protein sequences via maximum likelihood. Mol. Biol. Evol. 13 1368–1374. Bussemaker, H. J., Li, H. and Siggia, E. D. (2001). Regulatory element detection using correlation with expression. Nature Genetics 27 167–171. Cardon, L. R. and Stormo, G. D. (1992). Expectation maximization algorithm for identifying protein-binding sites with variable lengths from unaligned DNA fragments. J. Mol. Biol. 223 159–170. Ceppellini, R., Siniscalco, M. and Smith, C. A. B. (1955). The estimation of gene frequencies in a randommating population. Annals of Human Genetics 20 97–115. MR0075523 Chen, X. and Blanchette, M. (2007). Prediction of tissuespecific cis-regulatory modules using Bayesian networks and regression trees. BMC Bioinformatics 8 (Suppl 10) S2. Chiano, M. N. and Clayton, D. G. (1998). Fine genetic mapping using haplotype analysis and the missing data problem. Annals of Human Genetics 62 55–60. Churchill, G. A. (1989). Stochastic models for heterogeneous DNA sequences. Bull. Math. Biol. 51 79–94. MR0978904 Conlon, E. M., Liu, X. S., Lieb, J. D. and Liu, J. S. (2003). Integrating regulatory motif discovery and genomewide expression analysis. Proc. Natl. Acad. Sci. USA 100 3339–3344. Dasgupta, A. and Raftery, A. (1998). Detecting features in spatial point processes with clutter via model-based clustering. J. Amer. Statist. Assoc. 93 294–302. Dempster, A., Laird, N. and Rubin, D. (1977). Maximum likelihood from incomplete data via the EM algorithm. J. Roy. Statist. Soc. Ser. B 39 1–38. MR0501537 Deng, M., Mehta, S., Sun, F. and Chen, T. (2002). Inferring domain–domain interactions from protein–protein interactions. Genome Res. 12 1540–1548. Do, C. B., Mahabhashyam, M. S. P., Brudno, M. and Batzoglou, S. (2005). Probcons: Probabilistic consistency-based multiple sequence alignment. Genome Res. 15 330–340. Dudoit, S., Fridlyand, J. and Speed, T. (2002). Comparison of discrimination methods for the classification of tumors using gene expression data. J. Amer. Statist. Assoc. 97 77–87. MR1963389 Durbin, R., Eddy, S., Krogh, A. and Mitchison, G. (1998). Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids. Cambridge Univ. Press, Cambridge. Eddy, S. R. (1998). Profile hidden Markov models. Bioinformatics 14 755–763. Eddy, S. R. and Durbin, R. (1994). RNA sequence analysis using covariance models. Nucleic Acids Res. 22 2079–2088. Edgar, R. (2004a). MUSCLE: A multiple sequence alignment method with reduced time and space complexity. BMC Bioinformatics 5 113.

EM IN COMPUTATIONAL BIOLOGY Edgar, R. (2004b). MUSCLE: Multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res. 32 1792–1797. Edlefsen, P. T. (2009). Conditional Baum–Welch, dynamic model surgery, and the three Poisson Dempster–Shafer model. Ph.D. thesis, Dept. Statistics, Harvard Univ. Excoffier, L. and Slatkin, M. (1995). Maximum-likelihood estimation of molecular haplotype frequencies in a diploid population. Mol. Biol. Evol. 12 921–927. Fan, X., Zhu, J., Schadt, E. and Liu, J. (2007). Statistical power of phylo-HMM for evolutionarily conserved element detection. BMC Bioinformatics 8 374. Felsenstein, J. (1981). Evolutionary trees from DNA sequences: A maximum likelihood approach. J. Mol. Evol. 17 368–376. Felsenstein, J. and Churchill, G. A. (1996). A hidden Markov model approach to variation among sites in rate of evolution. Mol. Biol. Evol. 13 93–104. Feng, D. and Doolittle, R. (1987). Progressive sequence alignment as a prerequisite to correct phylogenetic trees. J. Mol. Evol. 25 351–360. ¨ ckler, B., GriffithsFinn, R., Mistry, J., Schuster-Bo Jones, S., Hollich, V., Lassmann, T., Moxon, S., Marshall, M., Khanna, A., Durbin, R., Eddy, S., Sonnhammer, E. and Bateman, A. (2006). Pfam: Clans, web tools and services. Nucleic Acids Res. Database Issue 34 D247–D251. Fraley, C. and Raftery, A. E. (2002). Model-based clustering, discriminant analysis, and density estimation. J. Amer. Statist. Assoc. 97 611–631. MR1951635 Friedman, N., Ninio, M., Pe’er, I. and Pupko, T. (2002). A structural EM algorithm for phylogenetic inference. J. Comput. Biol. 9 331–353. Geman, S. and Geman, D. (1984). Stochastic relaxation, Gibbs distribution, and the Bayesian restoration of images. IEEE Trans. Pattern Anal. Mach. Intell. 6 721–741. Ghahramani, Z. and Hinton, G. E. (1997). The EM algorithm for factor analyzers. Technical Report CRG-TR-961, Univ. Toronto, Toronto. Hampson, S., Kibler, D. and Baldi, P. (2002). Distribution patterns of over-represented k-mers in non-coding yeast DNA. Bioinformatics 18 513–528. Hastings, W. K. (1970). Monte Carlo sampling methods usings Markov chains and their applications. Biometrika 57 97–109. Haussler, D., Krogh, A., Mian, I. S. and Sjolander, K. (1993). Protein modeling using hidden Markov models: Analysis of globins. In Proc. Hawaii Inter. Conf. Sys. Sci. 792–802. IEEE Computer Society Press, Los Alamitos, CA. Hawley, M. E. and Kidd, K. K. (1995). HAPLO: A program using the EM algorithm to estimate the frequencies of multi-site haplotypes. Journal of Heredity 86 409–411. Holmes, I. (2005). Using evolutionary expectation maximization to estimate indel rates. Bioinformatics 21 2294–2300. Holmes, I. and Rubin, G. M. (2002). An expectation maximization algorithm for training hidden substitution models. J. Mol. Biol. 317 753–764. Hughey, R. and Krogh, A. (1996). Hidden Markov models for sequence analysis. Extension and analysis of the basic method. Comput. Appl. Biosci. 12 95–107.

15

Jensen, S. T., Liu, X. S., Zhou, Q. and Liu, J. S. (2004). Computational discovery of gene regulatory binding motifs: A Bayesian perspective. Statist. Sci. 19 188–204. MR2082154 Ji, H. and Wong, W. H. (2006). Computational biology: Toward deciphering gene regulatory information in mammalian genomes. Biometrics 62 645–663. MR2247187 Ji, X., Yuan, Y., Sun, Z. and Li, Y. (2004). HMMGEP: Clustering gene expression data using hidden Markov models. Bioinformatics 20 1799–1800. Kang, H., Qin, Z. S., Niu, T. and Liu, J. S. (2004). Incorporating genotyping uncertainty in haplotype inference for single-nucleotide polymorphisms. American Journal of Human Genetics 74 495–510. Karolchik, D., Baertsch, R., Diekhans, M., Furey, T. S., Hinrichs, A., Lu, Y. T., Roskin, K. M., Schwartz, M., Sugnet, C. W., Thomas, D. J., Weber, R. J., Haussler, D. and Kent, W. J. (2003). The UCSC genome browser database. Nucleic Acids Res. 31 51–54. Karplus, K., Barrett, C. and Hughey, R. (1999). Hidden Markov models for detecting remote protein homologies. Bioinformatics 14 846–856. Katoh, K., Kuma, K., Toh, H. and Miyata, T. (2005). MAFFT version 5: Improvement in accuracy of multiple sequence alignment. Nucleic Acids Res. 33 511–518. Krogh, A., Brown, M., Mian, I. S., Sjolander, K. and Haussler, D. (1994). Hidden Markov models in computational biology applications to protein modeling. J. Mol. Biol. 235 1501–1531. Krogh, A., Mian, I. S. and Haussler, D. (1994). A hidden Markov model that finds genes in E. coli DNA. Nucleic Acids Res. 22 4768–4778. Kundaje, A., Middendorf, M., Gao, F., Wiggins, C. and Leslie, C. (2005). Combining sequence and time series expression data to learn transcriptional modules. IEEE/ACM Trans. Comp. Biol. Bioinfo. 2 194–202. Lander, E. S. and Green, P. (1987). Construction of multilocus genetic linkage maps in humans. Proc. Natl. Acad. Sci. USA 84 2363–2367. Lawrence, C. E. and Reilly, A. A. (1990). An expectation maximization (EM) algorithm for the identification and characterization of common sites in unaligned biopolymer sequences. Proteins 7 41–51. Lawrence, C. E., Altschul, S. F., Boguski, M. S., Liu, J. S., Neuwald, A. F. and Wootton, J. C. (1993). Detecting subtle sequence signals: A Gibbs sampling strategy for multiple alignment. Science 262 208–214. Liu, J. S. (2001). Monte Carlo Strategies in Scientific Computing. Springer, New York. MR1842342 Liu, J. S., Chen, R. and Wong, W. H. (1998). Rejection control and sequential importance sampling. J. Amer. Statist. Assoc. 93 1022–1031. MR1649197 Liu, J. S., Neuwald, A. F. and Lawrence, C. E. (1995). Bayesian models for multiple local sequence alignment and Gibbs sampling strategies. J. Amer. Statist. Assoc. 90 1156–1170. Liu, J. S., Sabatti, C., Teng, J., Keats, B. J. and Risch, N. (2001). Bayesian analysis of haplotypes for linkage disequilibrium mapping. Genome Res. 11 1716–1724.

16

X. FAN, Y. YUAN AND J. S. LIU

Liu, X. S., Brutlag, D. L. and Liu, J. S. (2002). An algorithm for finding protein-DNA binding sites with applications to chromatin-immunoprecipitation microarray experiments. Nature Biotechnology 20 835–839. Long, J. C., Williams, R. C. and Urbanek, M. (1995). An E-M algorithm and testing strategy for multiple-locus haplotypes. American Journal of Human Genetics 56 799– 810. Lu, X., Niu, T. and Liu, J. S. (2003). Haplotype information and linkage disequilibrium mapping for single nucleotide polymorphisms. Genome Res. 13 2112–2117. Luan, Y. and Li, H. (2003). Clustering of time-course gene expression data using a mixed-effects model with B-splines. Bioinformatics 19 474–482. Lunter, G., Miklos, I., Drummond, A., Jensen, J. and Hein, J. (2005). Bayesian coestimation of phylogeny and sequence alignment. BMC Bioinformatics 6 83. Ma, P., Castillo-Davis, C., Zhong, W. and Liu, J. (2006). A data-driven clustering method for time course gene expression data. Nucleic Acids Res. 34 1261–1269. Madera, M. and Gough, J. (2002). A comparison of profile hidden Markov model procedures for remote homology detection. Nucleic Acids Res. 30 4321–4328. McKendrick, A. G. (1926). Applications of mathematics to medical problems. Proceedings Edinburgh Methematics Society 44 98–130. McLachlan, G. J., Bean, R. W. and Peel, D. (2002). A mixture model-based approach to the clustering of microarray expression data. Bioinformatics 18 413–422. Meng, X. and van Dyk, D. (1997). The EM algorithm—An old folk song sung to a fast new tune (with discussion). J. Roy. Statist. Soc. Ser. B 59 511–567. MR1452025 Meng, X.-L. (1997). The EM algorithm and medical studies: A historical linik. Statistical Methods in Medical Research 6 3–23. Meng, X.-L. and Pedlow, S. (1992). EM: A bibliographic review with missing articles. In Proc. Stat. Comp. Sec. 24– 27. Amer. Statist. Assoc., Washington, DC. Metropolis, N., Rosenbluth, A., Rosenbluth, M., Teller, A. and Teller, E. (1953). Equation of state calculations by fast computing machines. Journal of Chemical Physics 21 1087–1092. Metropolis, N. and Ulam, S. (1949). The Monte Carlo method. J. Amer. Statist. Assoc. 44 335–341. MR0031341 Moses, A., Chiang, D. and Eisen, M. (2004). Phylogenetic motif detection by expectation–maximization on evolutionary mixtures. In Pacific Symposium on Biocomputing 324– 335. World Scientific, Singapore. Neuwald, A. and Liu, J. (2004). Gapped alignment of protein sequence motifs through Monte Carlo optimization of a hidden Markov model. BMC Bioinformatics 5 157. Niu, T. (2004). Algorithms for inferring haplotypes. Genetic Epidemiology 27 334–347. Notredame, C., Higgins, D. and Heringa, J. (2000). TCoffee: A novel method for fast and accurate multiple sequence alignment. J. Mol. Biol. 302 205–217. O’Sullivan, O., Suhre, K., Abergel, C., Higgins, D. G. and Notredame, C. (2004). 3DCoffee: Combining protein sequences and structures within multiple sequence alignments. J. Mol. Biol. 340 385–395.

Ott, J. (1979). Maximum likelihood estimation by counting methods under polygenic and mixed models in human pedigrees. American Journal of Human Genetics 31 161– 175. Pavesi, G., Mereghetti, P., Mauri, G. and Pesole, G. (2004). Weeder Web: Discovery of transcription factor binding sites in a set of sequences from co-regulated genes. Nucleic Acids Res. 32 W199–W203. Prakash, A., Blanchette, M., Sinha, S. and Tompa, M. (2004). Motif discovery in heterogeneous sequence data. In Pacific Symposium on Biocomputing 348–359. World Scientific, Singapore. Qin, Z. S., Niu, T. and Liu, J. S. (2002). Partition–ligation– expectation–maximization algorithm for haplotype inference with single-nucleotide polymorphisms. American Journal of Human Genetics 71 1242–1247. Rabiner, L. R. (1989). A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of the IEEE 77 257–286. Siepel, A., Bejerano, G., Pedersen, J. S., Hinrichs, A. S., Hou, M., Rosenbloom, K., Clawson, H., Spieth, J., Hillier, L. W., Richards, S., Weinstock, G. M., Wilson, R. K., Gibbs, R. A., Kent, W. J., Miller, W. and Haussler, D. (2005). Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes. Genome Res. 15 1034–1050. Sinha, S. and Tompa, M. (2002). Discovery of novel transcription factor binding sites by statistical overrepresentation. Nucleic Acids Res. 30 5549–5560. Sinha, S., Blanchette, M. and Tompa, M. (2004). PhyME: A probabilistic algorithm for finding motifs in sets of orthologous sequences. BMC Bioinformatics 5 170. Smith, C. A. B. (1957). Counting methods in genetical statistics. Annals of Human Genetics 35 254–276. MR0088408 Stormo, G. D. and Hartzell, G. W. I. (1989). Identifying protein-binding sites from unaligned DNA fragments. Proc. Natl. Acad. Sci. USA 86 1183–1187. Suresh, R. M., Dinakaran, K. and Valarmathie, P. (2009). Model based modified K-means clustering for microarray data. In International Conference on Information Management and Engineering 271–273. IEEE Computer Society, Los Alamitos, CA. Tanner, M. A. and Wong, W. H. (1987). The calculation of posterior distributions by data augmentation (with discussion). J. Amer. Statist. Assoc. 82 528–540. MR0898357 Tavar´ e, S. (1986). Some probabilistic and statistical problems in the analysis of DNA sequences. In Some Mathematical Questions in Biology—DNA Sequence Analysis (New York, 1984). Lectures on Mathematics in the Life Sciences 17 57–86. Amer. Math. Soc., Providence, RI. MR0846877 Taylor, W. (1988). A flexible method to align large numbers of biological sequences. J. Mol. Evol. 28 161–169. Thompson, E. A. (1984). Information gain in joint linkage analysis. Math. Med. Biol. 1 31–49. Thompson, J., Higgins, D. and Gibson, T. (1994). CLUSTAL W: Improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res. 22 4673–4680.

EM IN COMPUTATIONAL BIOLOGY Tompa, M., Li, N., Bailey, T. L., Church, G. M., De Moor, B., Eskin, E., Favorov, A. V., Frith, M. C., Fu, Y., Kent, W. J., Makeev, V. J., Mironov, A. A., Noble, W. S., Pavesi, G., Pesole, G., R´ egnier, M., Simonis, N., Sinha, S., Thijs, G., van Helden, J., Vandenbogaert, M., Weng, Z., Workman, C., Ye, C. and Zhu, Z. (2005). Assessing computational tools for the discovery of transcription factor binding sites. Nature Biotechnology 23 137–144. Wallace, I. M., Blackshields, G. and Higgins, D. G. (2005). Multiple sequence alignments. Current Opinion in Structural Biology 15 261–266. Wang, W., Cherry, J. M., Nochomovitz, Y., Jolly, E., Botstein, D. and Li, H. (2005). Inference of combinatorial regulation in yeast transcriptional networks: A case study of sporulation. Proc. Natl. Acad. Sci. USA 102 1998– 2003.

17

Weeks, D. E. and Lange, K. (1989). Trials, tribulations, and triumphs of the EM algorithm in pedigree analysis. Math. Med. Biol. 6 209–232. MR1052291 Wolfe, K. H., Sharp, P. M. and Li, W. H. (1989). Mutation rates differ among regions of the mammalian genome. Nature 337 283–285. Yang, Z. (1995). A space–time process model for the evolution of DNA sequences. Genetics 139 993–1005. Yang, Z. (1997). PAML: A program package for phylogenetic analysis by maximum likelihood. Comput. Appl. Biosci. 13 555–556. Yeung, K. Y., Fraley, C., Murua, A., Raftery, A. E. and Ruzzo, W. L. (2001). Model-based clustering and data transformations for gene expression data. Bioinformatics 17 977–987.

Suggest Documents