IEEE TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS,
VOL. 5,
NO. 4,
OCTOBER-DECEMBER 2008
1
Learning Scoring Schemes for Sequence Alignment from Partial Examples Eagu Kim and John Kececioglu Abstract—When aligning biological sequences, the choice of scoring scheme is critical. Even small changes in gap penalties, for example, can yield radically different alignments. A rigorous way to learn parameter values that are appropriate for biological sequences is through inverse parametric sequence alignment. Given a collection of examples of biologically correct reference alignments, this is the problem of finding parameter values that make the scores of the reference alignments be as close as possible to those of optimal alignments of their sequences. We extend prior work on inverse parametric alignment to partial examples, which contain regions where the reference alignment is not specified, and to an improved formulation based on minimizing the average error between the scores of the reference alignments and the scores of optimal alignments. Experiments on benchmark biological alignments show we can learn scoring schemes that generalize across protein families, and that boost the accuracy of multiple sequence alignment by as much as 25 percent. Index Terms—Inverse parametric sequence alignment, substitution score matrices, affine gap penalties, linear programming, cutting plane algorithms.
Ç 1
INTRODUCTION
A
fundamental question that arises whenever biological sequences are aligned is to decide what parameter values to use for the alignment scoring scheme. For example, the standard scoring function for protein sequence alignment requires determining values for 210 substitution scores and two gap penalties. In practice, substitution scores are usually set by convention to a matrix from the PAM [5] or BLOSUM [17] families, while gap penalties are often set by hand through trial and error. Despite this practice, choosing appropriate values for these parameters is critical. Even small changes in gap penalties can yield radically different alignments, while aligning a specific family of sequences may require substitution scores that differ substantially from a conventional matrix. A rigorous approach to determining these values is through inverse parametric sequence alignment [16], [31], which sets parameters using examples of biologically correct reference alignments. Informally, inverse alignment seeks parameter values for the alignment scoring function that make the examples be optimal alignments of their sequences. Put another way, it seeks a parameter choice at which the examples are optimal solutions to the sequence alignment problem. Note that if a parameter choice exists that makes every example be the unique-optimal alignment of its sequences, then any algorithm that computes optimal sequence alignments will perfectly recover the examples, given the example sequences and this parameter choice as inputs. In general, inverse alignment belongs to the broad area of inverse parametric optimization [9], where examples
. The authors are with the Department of Computer Science, The University of Arizona, Tucson, AZ 85721. E-mail: {egkim, kece}@cs.arizona.edu. Manuscript received 7 Dec. 2007; revised 26 Apr. 2008; accepted 21 May 2008; published online 3 June 2008. For information on obtaining reprints of this article, please send e-mail to:
[email protected], and reference IEEECS Log Number TCBBSI-2007-12-0165. Digital Object Identifier no. 10.1109/TCBB.2008.57. 1545-5963/08/$25.00 ! 2008 IEEE
of optimal solutions to a combinatorial optimization problem are used to set parameter values in its objective function. In practice, a biologically useful parameter choice rarely exists that makes a collection of example reference alignments all be optimal [19] (let alone unique-optimal). Consequently, the problem becomes one of finding parameter values that make the examples score as close as possible to optimal. An important issue in formulating the problem is deciding what measure of error to minimize between the scores of the examples and the scores of optimal alignments, when assessing closeness to optimality. Recently, Kececioglu and Kim [19] discovered a new method for inverse alignment based on linear programming that for the first time could quickly learn best possible values for all 212 parameters in the standard model for global protein sequence alignment from hundreds of examples of complete reference alignments. Their approach minimized the maximum relative error in scores across the examples. In this paper, we extend this work in several directions: . . . .
to partial examples, which contain regions where the reference alignment is not specified, to an improved error model involving minimization of the average absolute error across the examples, to a stronger approach for eliminating degenerate solutions that are not biologically useful, and to an experimental study of the recovery rate of the learned scoring schemes when used for multiple sequence alignment.
The issue of partial versus complete examples is especially important when the reference alignments are drawn from suites of benchmark protein multiple alignments that have been determined by aligning the three-dimensional structures of the protein molecules [24], [1], [2], [33]. While such benchmarks provide the most accurate protein alignments currently available, the regions where these reference alignments are reliable usually consist of a series of aligned conserved blocks that rarely cover the entire sequences; Published by the IEEE Computer Society
2
IEEE TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS,
between these blocks, the alignment is often unreliable or unspecified. The improved approach to inverse alignment that results from these extensions, which minimizes average absolute error given partial examples, remains fast and achieves better recovery. To briefly summarize the results, the recovery of reference alignments by the learned scoring schemes improves upon conventional amino acid substitution matrices by as much as 25 percent when used for multiple alignment of specific sequence families. Learning a scoring scheme from more than 200 partial examples, whose reference alignments cover about 40 percent of each sequence on average, takes around 4 hours on a laptop.
1.1 Related Work Inverse parametric sequence alignment was introduced in the seminal paper of Gusfield and Stelling [16]. They considered the problem for the case of two parameters and one example, and sought parameter values that made the example be an optimal scoring alignment of its sequences. For this form of inverse alignment, they presented an indirect approach that attempted to avoid computing the complete parametric decomposition [34], [15], [27] of the parameter space. Sun et al. [31] gave the first direct algorithm for inverse alignment for the case of three parameters and one example. Their algorithm, which is quite involved, finds values for the three parameters that make the example alignment score optimal in Oðn2 log nÞ time, for two strings of length n. In a significant advance, Kececioglu and Kim [19], using a completely different approach based on linear programming, gave the first polynomial-time algorithm for arbitrarily many parameters and examples. Their approach is also flexible, and solves the unique-optimal and near-optimal variations of inverse alignment as well. For the near-optimal variation, their algorithm finds a parameter choice that makes the examples score as close as possible to optimal in terms of minimizing the maximum relative error of the examples. As they demonstrated, it is also fast in practice. The approach to inverse alignment developed in [19] actually solves the more general problem of inverse parametric optimization. For a combinatorial optimization problem P, inverse parametric optimization seeks parameter values for the objective function of P that make a given example solution to an instance I of P be an optimal solution to instance I . Their approach solves the inverse parametric problem in polynomial time for any problem P where: 1) the objective function for P is linear in its parameters, and 2) problem P can be solved in polynomial time for any fixed choice of its parameters. This solves inverse parametric optimization for an extremely wide range of problems P, including many classic combinatorial problems such as sequence alignment, shortest paths, spanning trees, network flow, and maximum matchings. The authors recently learned that Eppstein [9] discovered a similar approach to general inverse parametric optimization. Eppstein applied it in the context of minimum spanning trees to find edge weights that make an example tree be the unique optimal spanning tree. In the context of biological sequence alignment, this unique-optimal form of the inverse problem rarely has a solution [19]. Alternative approaches for determining alignment parameters have been recently proposed based on machine
VOL. 5,
NO. 4,
OCTOBER-DECEMBER 2008
learning. Do et al. [7] use discriminative training on conditional random fields to find parameter values for a hidden Markov model of sequence alignment. Their approach requires solving a convex nonlinear numerical optimization problem that becomes nonconvex in the presence of examples that are partial alignments, and is not guaranteed to run in polynomial time. Yu et al. [37] describe a support vector machine approach for learning parameters to align a protein sequence to a protein structure. Their approach involves solving a quadratic numerical optimization problem with linear constraints, and for the first time incorporates a measure of alignment recovery directly into the problem formulation. Both of these approaches require considerable computational resources, appearing to take days on large instances to compute an approximate solution. In contrast to these machine learning approaches, our method for inverse alignment only involves linear programming, for which an optimal solution can be computed quickly, even for very large instances. We also rigorously address for the first time the issue of input examples that are partial alignments.
1.2 Overview The next section presents several variations of inverse alignment, with relative or absolute error, and with complete or partial examples. Section 3 reduces these variations to linear programming, and develops an iterative approach to partial examples. Finally, Section 4 presents results from experiments on recovering benchmark protein alignments when using learned parameters for both pairwise and multiple sequence alignment.
2
INVERSE ALIGNMENT
AND ITS
VARIATIONS
The conventional sequence alignment problem is: given a pair of sequences and a scoring function f on alignments, find an alignment A of the sequences that has an optimal score under f. The inverse alignment problem turns this around: given an example alignment A of a pair of sequences, find parameter values for scoring function f that make A be an optimal alignment of its sequences. To learn parameter values that are useful in practice for biologists, this basic form of inverse alignment, which was first studied in [16] and [31], must be generalized considerably: to multiple examples, to suboptimal alignments, to partial examples, and to handle degeneracy, each of which we consider in turn. When function f has many parameters, many example alignments A are needed to determine reliable values for the parameters. Accordingly, we take the input to inverse alignment to be a large collection of example alignments. In practice, there are usually no biologically useful parameter values that make all the example alignments have optimal score. Consequently, we consider finding parameter values that make the examples score as close as possible to optimal, and we examine two optimization criteria for measuring the error between the example scores and the optimal alignment scores: minimizing relative error or absolute error. Finally, the type of benchmark reference alignments that are available in practice for learning parameters often consist of regions where the alignment is specified, interspersed with regions where no alignment is specified. We call such a
KIM AND KECECIOGLU: LEARNING SCORING SCHEMES FOR SEQUENCE ALIGNMENT FROM PARTIAL EXAMPLES
reference alignment a partial example, since it is only a partial alignment of its sequences. When the reference specifies a complete alignment of its sequences, we call it a complete example. Our approach to inverse alignment from partial examples builds upon a solution to the problem with complete examples, which we discuss first.
2.1 Complete Examples Inverse alignment from complete examples with arbitrarily many parameters was first considered by Kececioglu and Kim [19]. They examined the relative error criterion, which we review below. Let f be the alignment scoring function, which gives score fðAÞ to alignment A. Typically, f is a function of several parameters p1 ; p2 ; . . . ; pt , which assign scores or penalties to various alignment features such as substitutions and gaps. (For example, the standard scoring model for aligning protein sequences has 210 substitution scores for all unordered pairs of amino acids, plus two gap penalties for opening and extending a gap, for a total of t ¼ 212 parameters.) We view the entire set of parameters as a vector p ¼ ðp1 ; . . . ; pt Þ. When we want to emphasize the dependence of f on its parameters p, we write fp . The input consists of many example alignments Ai , where each example aligns a corresponding set of sequences S i . (Typically, the examples Ai are induced pairwise alignments that come from a structural multiple alignment; in this case, each S i contains two sequences.) For scoring function f and parameters p, we write fp$ ðS i Þ for the score of an optimal alignment of sequences S i under fp . We assume that an optimal alignment minimizes f. The following defines inverse alignment under the relative error criterion assuming the input contains complete examples. Definition 1 (Inverse Alignment under Relative Error). Inverse Alignment from complete examples under the relative error criterion is the following problem. Let the alignment scoring function be fp with parameter vector p ¼ ðp1 ; . . . ; pt Þ drawn from domain D. The input is a collection of complete alignments A1 ; . . . ; Ak that respectively align the sets of sequences S 1 ; . . . ; S k . The output is parameter vector x$ :¼ arg min Erel ðxÞ; x2D
where Erel ðxÞ :¼ max
1%i%k
fx ðAi Þ & fx$ ðS i Þ : fx$ ðS i Þ
In other words, the output vector x$ minimizes the maximum relative error of the alignment scores of the examples. An underlying issue in this formulation of inverse alignment is that the problem is degenerate. The trivial parameter choice x ¼ ð0; ' ' ' ; 0Þ makes every alignment score the same, thus making each example be an optimal alignment. This undesirable solution must be ruled out, though how this is done depends on the particular form of alignment being considered. Section 3.3 presents a new approach for eliminating a general family of degenerate solutions that applies to global sequence alignment [14]. This general family includes the trivial solution as a special case.
3
When scoring function fp is linear in its parameters p, inverse alignment under relative error can be solved in polynomial time [19] as long as an optimal alignment can be computed in polynomial time for any fixed parameter choice. We review the solution in Section 3, which uses a reduction to linear programming. The above formulation considers the maximum error of the examples, because as we will see later in Section 3.1.1, minimizing the average relative error would lead to an optimization problem with nonlinear constraints. We also consider here a new model: minimizing the absolute error. This has the advantage that we can minimize the average error of the examples and still have a formulation that is efficiently solvable by linear programming. The following defines inverse alignment under the absolute error criterion again assuming complete examples. Definition 2 (Inverse Alignment under Absolute Error). Inverse Alignment from complete examples under the absolute error criterion is the following problem. The input is a collection of complete alignments Ai of sequences S i for 1 % i % k. The output is parameter vector x$ :¼ arg min Eabs ðxÞ; x2D
where Eabs ðxÞ :¼
" 1 X ! fx ðAi Þ & fx$ ðS i Þ : k 1%i%k
Output vector x$ minimizes the average absolute error of the alignment scores of the examples. Note that this formulation is also degenerate, which is addressed in Section 3.3. We next extend to input examples that are partial alignments.
2.2 Partial Examples For inverse alignment of protein sequences, the best example alignments that are available come from multiple alignments of protein families that are determined by aligning the three-dimensional structures of family members. Several suites of such benchmark alignments are currently available [24], [1], [2], [33] and are widely used for evaluating the accuracy of software for multiple alignment of protein sequences. Most all these benchmarks, however, are partial alignments. The benchmark alignment has regions that are reliable and where the alignment is specified, but between these regions, the alignment is effectively left unspecified. These reliable regions are usually the core blocks of the multiple alignment, which are sections of the alignment where structure is conserved across the family. Core blocks are typically gap free, though they can sometimes contain gaps. For our purposes, a partial example is an alignment A of sequences S where each column of A is labeled as being either reliable or unreliable. A complete example is a partial example whose columns are all labeled reliable. Note that in this definition, a partial example may label a gap column as reliable. Consequently, partial examples do not simply specify which pairs of positions to align by substitutions, but may also specify which positions should be in gaps.
4
IEEE TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS,
When learning parameters by inverse alignment from partial examples, we treat the unreliable columns as missing information: such columns do not specify the alignment of the example sequences. Given a partial example A for sequences S, a completion A is a complete example for S that agrees with the reliable columns of A. In other words, a completion A can change A on the substrings that are in unreliable columns, but must not alter A in reliable columns. Note that in general, a partial example A does not have a unique completion A. We define inverse alignment from partial examples as the problem of finding the optimal parameter choice over all possible completions of the examples. Definition 3 (Inverse Alignment from Partial Examples). Inverse Alignment from partial examples is the following problem. The input is a collection of partial alignments Ai of sequences S i for 1 % i % k. The output is parameter vector x$ :¼ arg min min EðxÞ; x2D
A1 ;...;Ak
where error function E is either Eabs or Erel , and symbol Ai varies over all possible completions of Ai . In other words, vector x$ minimizes the error of the example scores over all possible completions of the partial examples. We make a few remarks on this definition. First, this formulation finds parameter values for which optimally aligning the unreliable regions yields a completion that scores as close as possible to an optimal unconstrained alignment of the example sequences. Second, even when the reliable regions are all gap free (which is common when partial examples are obtained from the standard suites of benchmark protein alignments), this formulation still learns useful values for gap penalties, as appropriate values for gaps in the unreliable regions are necessary for the reliable regions to be part of alignments that have a close-to-optimal score. Third, note that this formulation for partial examples contains the prior formulations on complete examples as special cases. In the next section, we reduce inverse alignment from complete examples to linear programming, and tackle partial examples by solving a series of problems on complete examples.
3
SOLUTION
BY
LINEAR PROGRAMMING
When the alignment scoring function fp is linear in its parameters p, inverse alignment from complete examples under relative error can be reduced to linear programming [19], and a similar reduction applies to absolute error. We define a linear scoring function as follows: Suppose f scores an alignment A by measuring t þ 1 features of A through functions g0 ; g1 ; . . . ; gt and combines these measures into one score through a weighted sum involving parameter vector p ¼ ðp1 ; . . . ; pt Þ by X fp ðAÞ :¼ g0 ðAÞ þ pi gi ðAÞ: 1%i%t
Then, we say f is linear in parameters p1 ; . . . ; pt . (The initial term given by function g0 usually arises when some parameters in the scoring function are fixed, such as when learning gap penalties for a known substitution scoring
VOL. 5,
NO. 4,
OCTOBER-DECEMBER 2008
matrix [16], [31], [19], in which case g0 ðAÞ measures the total substitution score of alignment A. When all parameters are varying, the term for g0 is typically zero.) For example, in the standard scoring model for global alignment of protein sequences, there is a substitution score !ab for every unordered pair a, b of amino acids, a penalty " for opening a gap, and a penalty # for extending a gap. (A gap in an alignment is a maximal run of either insertion or deletion columns. A gap of ‘ columns has cost " þ #‘, which is often called an affine gap cost.) This gives a scoring function with 212 parameters !ab , ", and # for the alphabet of 20 amino acids. For this scoring model, the functions g!;a;b count the number of substitutions of each type a; b in A, and functions g" and g# respectively count the number of gaps and the total length of all gaps in A. We first describe the linear programming approach for complete examples, and then discuss its extension to partial examples.
3.1 Complete Examples As described in detail in [19], inverse alignment from complete examples with relative error can be reduced to solving a series of linear programs. We briefly review this solution for relative error, and then show how to modify it for absolute error. For the standard model of protein sequence alignment, the linear programs have the following form. There is a variable for each parameter !ab , ", and # in the alignment scoring function f. The domain D of these parameters is described by the following domain inequalities. When the alignment problem is to minimize the score f of a global alignment, parameter values can always be shifted so they are nonnegative and then scaled so the largest magnitude is 1, without changing the problem. Thus, domain D has inequalities ð0; 0; 0Þ % ð!ab ; "; #Þ % ð1; 1; 1Þ:
ð1Þ
The description of D also has the inequalities !aa % !ab ;
ð2Þ
since an identity should cost no more than a substitution involving that letter. (As an aside, inequality (2) holds for PAM [5] and BLOSUM [17] similarity scores when they are transformed into substitution costs.) The remaining inequalities, which ensure that the examples score as close as possible to optimal alignments, differ for the relative and absolute error criteria.
3.1.1 Relative Error For the relative error criterion, we first consider the problem assuming a fixed upper bound $ on the relative error. For a given bound $, we test whether there is a feasible solution x with relative error at most $ by solving a linear program. We then find the smallest $$ , to within a specified precision, for which there is a feasible solution, using binary search on $. With precision %, this binary search for $$ involves solving Oðlogð$$ =%ÞÞ linear programs. The feasible solution x$ found at bound $$ is then an optimal solution to inverse alignment under relative error. Beyond the domain inequalities, the remaining inequalities in each linear program enforce that the relative error of all examples is at most $. For each example Ai , and every
KIM AND KECECIOGLU: LEARNING SCORING SCHEMES FOR SEQUENCE ALIGNMENT FROM PARTIAL EXAMPLES
alignment Bi of sequences S i , the linear program for $ has an inequality
2.
fx ðAi Þ % ð1 þ $Þ fx ðBi Þ:
3.
ð3Þ
Notice that for a fixed value of $, this is a linear inequality in parameters x, since function fx is linear in x. Example Ai satisfies all these inequalities iff the inequality with Bi ¼ B$ is satisfied, where B$ is an optimal-scoring alignment of S i under parameters x. In other words, the inequalities are all satisfied iff the score of Ai has relative error at most $ under fx . Finding the minimum $ for which this system of inequalities has a feasible solution x corresponds to minimizing the maximum relative error of the example scores. As an aside, one could attempt to minimize the average relative error across the examples by modifying this approach as follows: Each example Ai would have a corresponding error variable $i , and inequality (3) would be modified by replacing the global constant $ with the variable $i . The objective function would be to minimize P 1 1%i%k $i . Unfortunately, inequality (3) then becomes k quadratic in variable $i and the vector of variables x, which is no longer solvable by linear programming. The above linear program given by inequalities (1), (2), and (3) has an exponential number of inequalities of type (3), since for an example Ai , the number of alignments Bi of S i is exponential in the lengths of the sequences [11]. Nevertheless, this program can be solved in polynomial time [19] using a far-reaching result from linear programming theory known as the equivalence of optimization and separation [13]. This equivalence result states that one can solve a linear program in polynomial time iff one can solve the separation problem for the linear program in polynomial time [12], [28], [18]. The separation problem is, given a possibly infeasible e of parameter values, to report an inequality from vector x e, or to report that x e the linear program that is violated by x satisfies the linear program if there is no violated inequality. We can solve the separation problem in polynomial time for the above linear program by the following algorithm. Given a vector x e of parameter values, for each example Ai we compute an optimal-scoring alignment B$ of S i under f e. If inequality (3) is satisfied when Bi ¼ B$ , then the using x inequalities are satisfied for all Bi ; on the other hand, if inequality (3) is not satisfied for B$ , this gives the requisite violated inequality. For a problem with k examples, solving the separation problem involves computing at most one optimal alignment for each example, for a total of at most k optimal alignments, which runs in polynomial time. As a consequence, the equivalence of separation and optimization implies that the full linear program can be solved in polynomial time. This should mainly be viewed, however, as an existence proof for a polynomial-time algorithm for inverse alignment. The equivalence theorem relies on the ellipsoid method for linear programming (which is slow in practice), and does not provide a good bound on the polynomial for the running time of the resulting algorithm [13]. In practice, a polynomial-time solution to the separation problem is typically leveraged in the following cutting plane algorithm [4] for solving a linear program with inequalities L. 1.
Start with a small subset P of the inequalities in L.
4.
5
e to the linear program Compute an optimal solution x given by subset P. If no such solution exists, halt and report that L is infeasible. e. If the Call the separation algorithm for L on x e satisfies L, output x e and algorithm reports that x e is an optimal solution for L. halt: x Otherwise, add to P the violated inequality returned by the separation algorithm in step 3, and loop back to step 2.
While such cutting plane algorithms are not guaranteed to terminate in polynomial time, they can be fast in practice [19]. For inverse alignment, we start with subset P initialized to the domain inequalities (1) and (2), together with the nondegeneracy inequality (5) described later in Section 3.3.
3.1.2 Absolute Error For the absolute error criterion, we modify the linear program as follows: For each example Ai , we have an additional error variable &i . Inequality (3) for each example Ai is replaced by fx ðAi Þ % fx ðBi Þ þ &i :
ð4Þ
The objective function for the linear program is to minimize 1 X &i : k 1%i%k Notice that minimizing this objective function forces each &i to have the value &i ¼ fx ðAi Þ & fx$ ðS i Þ: Thus, an optimal solution x$ to this linear program with inequalities (1), (2), and (4) gives a parameter vector that minimizes the average absolute error of the example scores. Again the program has exponentially many inequalities of type (4), but the same separation algorithm that computes an optimal alignment B$ solves the separation problem in polynomial time, so in principle the linear program can be solved in polynomial time. In practice, we use a cutting plane algorithm as described above.
3.2 Partial Examples Inverse alignment from partial examples involves optimizing over all possible completions of the examples. While for partial examples we do not know how to efficiently find an optimal solution, we present a practical iterative approach which as demonstrated in Section 4 finds a good solution. ð0Þ We start with an initial completion Ai for each partial example Ai . This initial completion may be formed by optimally aligning each unreliable region of Ai using a default parameter choice xð0Þ ; we call this the default initial completion. (In practice, for xð0Þ we use a standard substitution matrix [17] with appropriate gap penalties.) Alternatively, an initial completion may be obtained by simply using the complete alignment given by both the unreliable and reliable columns of the partial example; we call this the trivial initial completion. We then iterate the following process for j ¼ 0; 1; 2; . . . . Compute an optimal parameter choice xðjþ1Þ by solving the inverse alignment problem on the complete examples
6
IEEE TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, ðjÞ
ðjþ1Þ
Ai . Given xðjþ1Þ , form a new completion Ai example Ai by: 1. 2.
of each
computing an alignment of each unreliable region that is optimal with respect to parameters xðjþ1Þ , and concatenating these alignments of the unreliable regions, alternating with the alignments given by the reliable regions, to form a new complete example ðjþ1Þ Ai .
Such a completion optimally stitches together the reliable regions of the partial example, using the current estimate for parameter values. This iterative scheme alternates a step of finding optimal parameters with a step of finding optimal completions, yielding a sequence of parameters and completions: A
ð0Þ
ð1Þ
7! xð1Þ 7! A
7! xð2Þ 7! ' ' ' :
As the following shows, the error of successive parameter estimates decreases monotonically. Theorem 1 (Error Convergence for Partial Examples). For the iterative scheme for inverse alignment from partial examples, denote the error in score for iteration j ) 1 by ! " ðj&1Þ ; ej :¼ E xðjÞ ; A where the right-hand side measures error criterion Eabs or Erel for the given parameters on the given completions. Then, e1 ) e2 ) e3 ) ' ' ' ) e$ ; where e$ is the optimum error for inverse alignment from partial examples Ai under criterion E. ðjÞ
Proof. Since Ai is an optimal-scoring completion of Ai with respect to parameters xðjÞ , ! ðjÞ " ! ðj&1Þ " % fxðjÞ Ai : fxðjÞ Ai
In other words, under parameters xðjÞ , the new ðjÞ completions A score just as close to optimal as the ðj&1Þ . This implies that with respect old completions A to error, ! " ! " ðjÞ ðj&1Þ E xðjÞ ; A % E xðjÞ ; A ;
whether we consider relative or absolute error. So for the ðjÞ new completions A , error ej is achievable, as witnessed ðjÞ by xðjÞ . Since the optimum error for A , given by the new parameters xðjþ1Þ , cannot be worse, ejþ1 % ej : Furthermore, e$ lower bounds the error for all completions. u t
Theorem 1 implies that the error of this iterative scheme converges, as the error across iterations forms a nonincreasing sequence that is bounded from below. (The error may converge, however, to a value larger than the optimum error e$ .) As shown in Section 4, choosing a good initial completion can reduce the final error. In practice, we iterate
VOL. 5,
NO. 4,
OCTOBER-DECEMBER 2008
this scheme until the improvement in error becomes too small or a bound on the number of iterations is reached. As error improves across the iterations, recovery of the examples generally improves as well.
3.3 Eliminating Degeneracy In general, the preceding formulations of inverse alignment, for both relative and absolute error, when applied to global sequence alignment have a family F of degenerate solutions: # $ ð!ab ; "; #Þ 2 F :¼ ð2c; 0; cÞ : c ) 0 :
Every parameter choice x 2 F from this family makes all global alignments of an example’s sequences have the same score, namely, cðm þ nÞ for two sequences of lengths m and n. Such an x makes each example be an optimal scoring alignment, hence such an x is an optimal solution for every instance of inverse alignment. Of course, these degenerate solutions are not of biological interest. To eliminate them, we use the following approach. Let nondegeneracy threshold ' be the difference between the expected cost of a random substitution (of different letters) and the expected cost of a random identity, measured for a default substitution scoring matrix. For a sound scoring scheme that distinguishes between substitutions and identities, this difference should be positive. The value of ' for the commonly used BLOSUM [17] and PAM [5] matrices for standard amino acid frequencies is around 0.4 when substitution scores are translated and scaled to costs in the range [0, 1]. Values of ' for a wide range of BLOSUM matrices are shown in the next section in Fig. 2. Given a value for threshold ', we add to the linear program the nondegeneracy inequality P P 2 a