ments only; (iii) preliminary analyses show that the alignment accuracy ..... E. F. Jr. Meyer, M. D. Brice, J. R. Rodgers,. O. Kennard, T. ... John Wiley. & Sons, New ...
A simple iterative approach to parameter optimization Alexander Zien, Ralf Zimmer, T h o m a s Lengauer GMD - German National Research Center for Information Technology, Institute for Algorithms and Scientific C o m p u t i n g (SCAI), Schlot3 Birlinghoven, D-53754 Sankt Augustin, Germany, {Alexander.Zien,Ralf.Zimmer,Thomas.Lengauer}@gmd.de
Abstract
used to adjust the scoring parameters in order to reflect the implied partial ordering as closely as possible. We developed methods [21] that can be used in such situations. They extend a basic idea by Maiorov and Crippen [12], namely to alter the parameterization in order to obey the following rule: a (good) reference solution should score higher than any (less good) alternative solution candidate. However, this basic rule may be formulated in different problem specifications. We developed methods for two problem variants. Both require as input a set of solution candidates for a problem instance, where each candidate is characterized by some multi-dimensional descriptor c. The methods determine a weighting vector w such that the induced scoring wc results in a ranking of the data that best approximates (i) the preference of some distinct data points over the rest, or, alternatively, (ii) the preference of any data point from a subset over the rest. The first problem variant is closely related [4] to the pattern recognition problem (eg. [8, 18]). Based on a linear programming (LP) method for pattern recognition [5], we developed a method for this variant (called VALP, for Violated Inequality Minimization Approximation LP). A similar linear programming formulation was suggested by Akutsu and Tashimo and also applied to protein threading [1]. For the other problem variant, we developed an approach (called CALP, for Cone Intersection Maximization Approximation LP) based on polyhedral intersections. We know of no other methods for the latter problem. In this paper, we discuss the application of the calibration methods to the threading program 123D [3]. Here, we have the problem that the input to the cali- " bration methods depends on the setting of the weights w to be optimized, establishing a circular dependency. In order to resolve this recursion, we propose an iterative procedure that alternates estimating optimal weights for given alignments and computing corresponding solutions for given weights. Our goals are, on one hand, to describe a general parameter optimization method for bioinformatics problems and, on the other hand, to
Various bioinformatics problems require optimizing several different properties simultaneously. For example, in the protein threading problem, a linear scoring function combines the values for different properties of possible sequence-to-structure alignments into a single score to allow for unambigous optimization. In this context, an essential question is how each property should be weighted. As the native structures are known for some sequences, the implied partial ordering on optimal alignments may be used to adjust the weights. To resolve the arising interdependence of weights and computed solutions, we propose a novel approach: iterating the computation of solutions (here: threading alignments) given the weights and the estimation of optimal weights of the scoring function given these solutions via a systematic calibration method. We show that this procedure converges to structurally meaningful weights, that also lead to significantly improved performance on comprehensive test data sets as measured in different ways. The latter indicates that the performance of threading can be improved in general. l
Introduction
Many bioinformatics optimization approaches involve empirical scoring schemes that combine different terms in a linear function. Usually, the different terms (called scoring contmbut~ons) model different aspects of the optimization problem on a phenomenological basis. Often it is unclear how the different contributions should be weighted. In cases where a suitable measure that could serve as a model for the score is absent, regression methods are not applicable. Instead, a given preference of some possible solutions (candidates) over others may be Permlssmn to make dlg,tal or hard copm~ of all or part of this work for personal or classroom use is granted w,lthout tee pro~,,ded that cop~es are not made or d,stnbuted for profit or comme=eml advantage and that copies bear this notice and the full citation on the first page To copy otherwise, to repubhsh, to post on servers or to redlstr=bute to hsts. reqmres prior specific perm~ssmn and/or a tee R E C O M B 2 0 0 0 T o k y o Japan U S A Copyright ACM 2000 1-58113-186-0/00/04 $5 00
318
SolutionRate
improve the performance and sensitivity of the threading method 123D in particular, thereby demonstrating the efficacy of the calibration procedure. In the following section we introduce our novel protocol to apply data generation and calibration iteratively in a cycle. Section 3 provides a short review of our calibration methods, further details on which are given in the appendix. In section 4 we briefly describe the threading method with the relevant parameters and data sets. The results of the calibration are evaluated for two different threading protocols and discussed in Section 5.
2
contrtbtttlons of solution candtdates
] Apphcation (threading) I
An iterative calibration/optimization approach
Cahbratmn score= wc [ (VAI P/CALP)
SolutionRate
Here, we describe a general approach which, for a given optimization problem together with a method to compute solution candidates for it (applicatzon), simultaneously computes optimized weights w for the scoring function (score) and the corresponding solutions. The scoring function of the application problem needs to follow the general form score = wc. Thus, score is a linear combination of the scoring contributions c = ( c l , . . . , Cd) and their weights w = ( w l , . . . , Wd). The iterative calibration method requires as input merely the knowledge of preferences of some solution candidates over others (standard-of-truth). During the iterations, it aims at maximizing a measure of correspondence (solution rate) between the computed solutions and the given standard-of-truth. For our purposes here, the standard-of-truth is derived from the SCOP classification of protein structures. Here, the preferences take the following form: an alignment of a sequence with a structure from the same structural class should score better than an alignment of that sequence with an unrelated structure. The success can be measured with respect to different threading protocols: In fold recognztion, for each sequence the single best scoring fold is selected from a representative database of folds. Here, the success rate is the number of sequences that are assigned to a fold similar (as classified by SCOP) to their native fold. In related paw recognztion, the aim is to detect for each sequence all related structures in a set of proteins. Here, the success rate is the number of correctly found related pairs. The iterative procedure to optimize the weight parameters w and the solutions is outlined in Figure 1. In a cycle, the execution of the application program and the calibration alternate. This illustrates kind of a dual view on the scoring function: On the one hand, w is the vector of parameters for the application t h a t computes solutions, each decribed by a vector c. On the other hand, w is the solution of the calibration based on those contribution vectors c. The procedure works as follows: starting from an
Figure 1: Overview of tion/optimization procedure.
the
iterative
calibra-
initial setting of weights w as input, the application computes solution candidates (either optimal or approximately optimal wrt. score) together with the corresponding vectors c of scoring contributions. Each candidate is classified as good or bad based on the external standard-of-truth. The classified contribution vectors serve as input for the calibration. The goal is to determine weights w such t h a t the distinction between good and bad is best possibly accounted for by the scoring function. In the next iteration the newly determined w is used to recompute the solution candidates, that are in general different to those of the previous iteration. Note that, for the new candidates, the previously optimized weights w are not necessarily well suited for the distinction between good and bad candidates. The solution rate of w and the implied ranking of candidates c with respect to the standard-of-truth may be measured at two distinct states within the iteration cycle: first, for each iteration, taking w after the calibration (given the candidates c) and, second, taking c after executing the application (given the weighting w). During the iterations b o t h measures should converge independently towards a common limit, thereby improving on the solution rate.
3
Calibration methods
In this section, we briefly describe the basic calibration methods t h a t are utilized in each iteration. More details are given in the Appendix. Consider the space of all possible solutions for a certain optimization problem. In threading, this is the set of all alignments of one sequence against a set of structures. Let every such solution candidate s be represented by a descriptor c --c(s), comprising a vector of d real-valued contributions
319
1;o the score. Assuming a linear scoring function, the ,;core of a candidate can be written as d
8 c o r e ( 8 ) = WC : E p = l W p C p.
of violated inequalities (VIM) amounts to finding a cone that consists of as many positive half-spaces as possible. However, smce the number of cones grows exponentially in d, it is infeasible to examine all cones individually. If fact, it can be shown that VIM is NP-hard, and the same holds for CIM [4, 21]. We therefore present the approximation algorithms VALP for VIM and CALP for CIM, respectively. A detailed description can be found in the Appendix.
(1)
Here, wp is the weight for the pth contribution with value cp. In essence, the goal of our methods is to determine weights such that 'good' candidates score better than 'bad' candidates. We confine our algorithms to samples of the respective solution candidate spaces, since complete coverage is usually infeasible 1. Let S ~ be a sample of solution candidates for a problem instance i (in threading, ~ is a sequence and a set of structures) out of a training set. For each training instance, an external standard-of-truth must be known, so that the candidate (i.e. alignment) s E S ~ can be classified as 'good' or 'bad' (i.e., structurally similar or not). Details on how to choose instances, samples and standard-oftruth are given in the application section. The ultimate goal of any calibration is to let the ranking of candidates implied by their scores exactly reflect their quality. However, this is usually impossible with realistic data from our applications. The reason is the general inability of an oversimplified scoring function to accurately represent the complex situation of structural similarity. In order to deal with this situation, we consider two different relaxed measures of success of the calibration, resulting in two problem specifications:
4
Threading
The application we apply our iterative calibration procedure to is the threading program 123D [3]. Threading maps protein sequences onto known protein structures in order to find the most compatible fold using empirical objective functions. These objective functions (similarity measures) are statistically derived using the inverse Boltzmann law [16] from known protein structures using established compatibilities of protein sequences with these structures. Many proposals for such empirical potentials have been made during the last years [10]. General agreement has not yet been reached, neither about the scoring function nor about the threading method to optimize it. It is apparent, however, that such scoring functions have to involve several contributions, the exact balancing of which influences in non-trivial ways the outcome and the performance of the respective threading experiment. Here, we show that the careful choice of such weighting parameters is crucial by applying calibration methods which significantly improve the fold recognition rate. For an orthogonal view on the parameter estimation problem and alternative approaches for optimal parameter settings see the literature on parametric alignment [9, 19, 20, 22]. Unfortunately, for more than two parameters, the parametric alignment methods are still much to slow to allow for optimal parameter selection. Therefore, we use the calibration methods described above (section 3) for this purpose.
Violated Inequality Minimization (VIM): M i n i m i z e the number o] candzdates s E S ~ that score better than a particular good candidate, called the reference candidate ~ , s u m m e d over the instances i. Cone Intersection Maximization ( C I M ) : Maximize the number of problem instances for which a good candidate scores best. By formula (1), each inequality score(~ ~) > score(s) regarded in VIM can be written as w x > 0, with x := c(~ ~) - c(s).
Each vector x is normal to a hyperplane that divides the d-dimensional space of weight vectors into two halfspaces. Considering a set of inequalities further dissects the weight space into areas within which the same subsets of inequalities hold. These areas are the intersections of the respective positive and negative half-spaces. Thus, they have the shape of polyhedral cones emerging from the origin (cones, for short). In order to satisfy all inequalities simultaneously, the weight vector must be chosen from the intersection of all positive half-spaces (the solution cone). Frequently, the inequality systems derived from realistic application data are inconsistent, i.e. the solution cone is empty. Minimizing the number
4.1
123D
Many methods have been proposed for threading [10]. In order to perform a systematic analysis of relevant parameters and parameter settings for the calibration methods we need a very fast program. Therefore, we choose the 123D [3] method. This program, as implemented in the ToPLign package [13], is extremely fast and optimizes an interesting empirical scoring potential. Moreover, improving its sensitivity would further its large scale application to real-world problems. 123D uses a dynamic programming algorithm to compute for given sequence seq and given weights w a ranked list of a set of folds ~'. The ranking is derived from the
1 A v a d a b l e a p p r o a c h e s to c o m p l e t e c o v e r a g e of c a n d i d a t e s p a c e s of s i z e an p o l y n o m i a l t i m e [2] a r e r e s t r i c t e d to t h e c o n s m t e n t
exponential case
320
and test sets. Second, the examples should be difficult, i.e. the sequence identity of the pairs should be below 30%, thus, presumably hard to be detected with sequence methods alone. For our calibration purposes here we used the sets described in [11] and [17] consisting of 251 single domain structures, 80 of which are classified into 11 families according to structural criteria.
scores of the corresponding optimal alignments. The program has been developed to optimize a new type of scoring potentials (called contact capaezty potentials CCP [3]), but also exploits standard scoring contributions, i.e. sequence and secondary structure preferences. One goal of the experiment described here is to evaluate a CCP potential and to derive an optimal weighting of this contribution in order to make the best possible use of it in similarity searches. 123D threads a sequence of length n = 150 amino acids in about 15 minutes CPU time on a current workstation/PC against the entire set of about 13.000 chains in the PDB [6]. Thus, a threading run against the representative set of 251 proteins used in this study requires only about 20 seconds. 4.2
4.4
Relevant parameters
For the scope of this paper we consider six terms of the objective function used in 123D: a sequence score (seqp), a secondary structure preference (ssp), a 'local' contact capacity potential (ccpl), a 'global' contact capacity potential (ccpg), and an affine gap penalty function with gap insertion costs (gi) and gap extension costs (ge) [3]. These terms are evaluated independently for three types of secondary structure elements: alpha helix (H), beta strand/extended (E), and all other conformations/loops (L). The rationale behind this is the different degree of structural conservation, and the corresponding level of confidence, in secondary structures, structural cores, and loops. This results in 18 parameters relevant for the following scoring function used in our threading experiment:
sc°reA(seq' f) =
E
w,~e A (seq, f)
*E{H,E,L}
36{seqp,gz,ge,ssp,ccpl,ccpg}
For any setting of the weight parameters w, 123D guarantees to compute optzmal sequence-structure alignments, i.e. an alignment A of the sequence seq with the fold f 6 iT with the maximal score score(seq, f) = max A score A ( seq, f ).2 Hence, for any threading instance with given weights w, 123D computes, for each sequence seq, liTI optimal alignments and the corresponding scoring contributions, which are used as the sample of candidates in the calibration procedure. 4.3
Selection of Test Data
A test (TS) and training (Tit) set for parameter calibration for fold recognition should fulfill two criteria" First, we should have reasonably sized sets of positive and negative examples, i.e. pairs of structurally compatible and non-related proteins, disjoint for training 2 T h e respective scoring c o n t r i b u t i o n s cA for a given a l i g n m e n t A are defined by s u m m i n g t h e r e s p e c t i v e t e r m s over t h e p o s i t i o n s of t h e given ahgnment A
321
Calibration
For the calibration approach described here, we regard the following kind of fold recognition experiment. Given a problem instance consisting of defined weights w, a sequence seq and a set of representative folds iT, we first compute the set of optimal alignments of seq with each fold from 9v with respect to score. A 'similar' fold f with sufficiently strong 'confidence' (high score) in the similarity may be used to predict a putative fold for seq and in some cases allows for using f as a template structure to derive a 3D model for seq. For the fold recognition problem, in addition, a partition of iT into disjoint fold families f f is given. We say that a sequence seq is recognized by its family iTseq, if any member of ~-s~q (except for the native fold ]s~q) scores best. We have to make sure that seq is in iT in order to provide the standard-of-truth needed for both the calibration (training) and the subsequent evaluation (test). In this setup, the training set TR is the union of certain families and a test set TS is another union of families disjoint from TR. The candidates for the calibration of weight parameters are the alignments computed via the threading of all sequences from TR. Good candidates are all aligned sequence structure pairs of the same family and bad candidates are all other alignments. This situation perfectly fits the CIM problem definition. The goal of the calibration, as defined by CIM, is to find weights w such that the number of recognized folds is maximized. In order to apply the algorithms developed for VIM to the fold recogn~ion problem, we need to select a reference solution fseq for every problem instance seq. This can be the optimal alignment with any similar structure (any f 6 ~Fseq). Here, we choose the most similar structure from iTseq according to structural superposition. Despite VIM aiming at scoring the reference structures better than all others it may happen that, after the weight optimization, different family members score better for some sequences. Accordingly, in each iteration, we re-select the reference solution as follows: taking the ranking of the current threading result, A fseq is set to that member of iT~q which ranks highest. In our experiments, this strategy appears to converge quickly, both with respect to the recognition rate and the resulting weights.
5
Results
recognition rate from 35 to 36). The weights of the three sequence parameters are quite high as compared to most other parameters. Interestingly, first, the weights for gap insertion are high in strands (E), but not in loops (L) and helices (H) and, finally, they are much higher for both loops and strands than for helices. Gap extension weights, on the other hand, are universally more important in helices. Secondary structure preference (ssp) is first set to zero for loops and helices and to high weights in strands. In the following, ssp becomes - and consistently remains - i m p o r t a n t only for H and vanishes completely for L and E (peak for parameter position 11 ~ ssp-H). For the contact capacity potential weightings we start with a heavy weight only for the local CCP for helices (ccpl-H), which nicely corresponds with the intuition behind the local potentials In the following iterations, we see a consistent and converging pattern, keeping the ccpl-Hparameter almost at the same weight and increasing the weight for the global CCP in strands (ccpg-E), again, exactly as expected for the local/global CCPs. In general, we observe a setting of the parameters, which, within certain intervals, fluctuates around reasonable parameter weightings, thereby increasing the recognition rate by about 40%. The fluctuations can be explained by the still too small data sets, the heuristic sampling strategy, the local optima computed in the calibrations, and by the small number of iterations. Maybe, however, such fluctuations are inherent in the procedure due to the dualistic nature of the problem iterating heuristic (numerical) parameter fitting and the computation of optimal (discrete) alignments.
For the calibration of threading weights we have performed several experiments, each consisting of an iteration of parameter calibration with VALP and CALP, respectively, followed by the recomputation of the threading alignments (see section 4). Table 5 shows the recognition rates using the different calibration methods during the iterations. The results are given for both the training and test sets. Whereas the procedure does not work for the CALP calibration (as the objective function used to approximate the maximum number of candidates scoring best is not good enough in our current version, see equation 4 in the Appendix), for VALP the procedure works well (although VALP is not directly taylored towards the fold recognition objective) and we observe a significant increase of the recognition rate. 15
101 II ,_roll, ,_,II
,0
scoring contr~utlons
(a)
5.1
10
5
0
scoring contributions
(b)
..I
Figure 2: Values of the 18 weights after the first (a) and the last (b) iteration. There six groups of three bars each represent (left to right) seqp, g~, ge, ssp, ccpl, ccpg subdivided by L (coil), H (helix) and E (sheet). Figure 2 shows the parameter settings for the 18 parameters calibrated during the iterations, the top part shows the values after iteration 1, the bottom part the final values after iteration 7. Starting from uniform weights, we observe a remarkable change in the weights already in the first step (leading to a slightly improved
322
Fold recognition
Figure 3 visualizes the development of the fold recognition rate during the iterations (for a detailed discussion see [11]). An analysis of the individual recognized sequences over the iterations also reveals that the procedure is well behaved, i.e. when improved weights lead to additional recognized sequences only very few previously recognized get lost. Here, we evaluate the improvements on the fold recognition performance for an extended benchmark set. We use the set of protein structures defined by Brenner et al. [7] (PDB40), which contains 1407 protein strutures with at most 40% mutual sequence identity. This set is comprehensive with respect to the known protein structures. We define a pair of those proteins as superfamzly related or fold related if they are in the same superfamily or fold class according to the SCOP [14] classification, respectively. With the optimized parameters we improve the correct superfamily assignments from 42% to 52% as compared to the original parameters, if we assign all sequences to a superfamily. However, 14% of the sequences do not have a related pair in PDB40 on
TR
zero
CALP VALP
I 1st
2nd 36 ( + 1 ) 44 ( + 9 )
34 (-1) 36 (+1)
35 35
TS
1st 28 (+2) 29 (+3)
zero 26 VALP 26 CALP
n u m b e r of i t e r a t i o n 3rd [ 4th 5th 30 (-5) 30 (-5) 30 (-5) 46 (+11) 49 (+14) 48 (+13) n u m b e r of i t e r a t i o n 3rd 4th 5th 17 (-9) 19 (-7) 15 (-11) 29 (+3) 31 (+5) 30 ( + 4 )
I
2nd 24 (-2)
31 (+5)
6th 31 (-4) 46 (+11)
7th 28 (-7) 48 (+13)
6th 16 (-10)
7th 17 (-9) 30 (+4)
30 (+4)
Table 1: Number of correctly recognized folds as defined in the text from the training set (upper part) and test set (lower part). This measure of success is shown for the standard weights (iteration zero) and for the following iteration cycles for the different algorithms. As a point of reference, a purely sequence based approach recognizes 31 of 81 folds for the training set and 17 of 74 for the test set. VALP
3S
fill llll 0
I
2
3
4
g
6
PDB40) but also the recognition protocol is different. First, we threaded all 1407 sequences of the PDB40 set against all the 1407 structures by computing 1.96 million alignments both for the original and the optimized parameters. Then, we produced the plots shown in Figures 4 and 5 by successively decreasing the score threshold for recognition in order to adjust either the percentage of assigned folds or the specificity (the error rate, i.e. T P / ( T P + F P ) , where TP = True Positives and FP = False Positives). In Figure 4 we compare the error rates (rate of false positives) vs. the sensitivity (TP/(TP+TN), where TN = True Negatives) of the two experiments. We could almost increase the sensitivity from 5% to 12% for a rate of 0.05% false positives. In Figure 5 we plot the specificity against the sensitivity. Here, we could improve the sensitivity, e.g. for a specificity of 50%, from 3% fold or 6% super/amply related pairs to 6% and 12%, respectively. The two figures clearly demonstrate that we could more than double the performance of 123D with optimized parameters even for a different benchmark protocol.
7
Figure 3: Fold recognition rate (in percent) initially (0) and for seven iterations (1-7). Training set: recognition rate after calibration (dark) and after recomputation (light); test set: rate after alignment computation (medium). the superfamily level, and for optimal score thresholds below which sequences are not assigned to a superfamily, the improvement is from 43% to 54%. For the fold level of the SCOP classification, we observe similar improvements on the overall performance.
5.2
6
Conclusions
We applied novel calibration techniques to the parameter adjustment of empirical scoring functions of the threading tool 123D. Significant improvements could be achieved for two different applications of this tool, despite the fact that the calibration has been done for the first application only. This demonstrates that we improved the threading performance in general. Our conclusion is strengthened by additional observations: (i) the calibrated weights are structurally meaningful; (ii) the performance improves for local alignments, while the calibration process was performed for global alignments only; (iii) preliminary analyses show that the alignment accuracy measured as the number of structurally correctly aligned residues increases by 18% on average on the TR/TS benchmark.
Recognition of related pairs
In this section we discuss the performance of our calibrated threading parameters for the detection of all related structures for a sequence. The optimized parameters are obtained as described above for the fold recognition test with no respect to the current benchmark, neither the set nor the evaluation protocol. The recognition of related pairs profits from the calibration only via the improved performance of the threading procedure. Thus, here, not only the test set is far more extensive than the training set (of which 50 sequences and 153 structures re-appear in the 1407 proteins of
323 --i
~00
10
f
f
70
f
J
g~
, io
2o
3o
4o a~. d rm
so eo 7o PO~W4 { 100C0~
to
go
t
,
,
loo
(a)
(a)
I
jo 40 30 2O 10 0
~o
~a
3o
*o
so
io
70
Io
9o
ioo
(b)
(b)
Figure 4: Rate of false positives for related pair recognition on the superfamily level of the S C O P hierarchy with optimized (a) and original (b) p a r a m e t e r settings.
Figure 5: Sensitivity-Specificity plot for related pair recognition with optimized (a) and original (b) p a r a m eter settings. The lines represent the recognition of the native structure (medium), the superfamily members (dark) and the fold m e m b e r s (light) according to SCOP.
Acknowledgements We t h a n k our former colleague Christian Lemmen for discussions on an earlier version of this paper and Theo Mevissen for help on implementation issues. P a r t of this work has been funded by the B M B F under contract no. TargId 0311615.
Solving the HLI amounts to finding a normal vector w such t h a t its positive half-space completely contains the set of vectors, i.e., X C H > (w). Figure 6 illustrates the problem for the two-dimensional case. Both the HLI and the associated set of cones are invariant to vector scaling. More precisely, each DP, as well as the weight vector, can be scaled independently by arbitrary positive amounts without changing the set of satisfied inequalities. We will make use of this property in our algorithms. In out optimization problems, each candidate solution s corresponds to a description vector of real numbers. Let Cp(S) denote the unweighted contribution of property p to the score for a candidate s. Suboptimal solutions s E S in the sample should score less t h a n ~, the reference solution. This induces the following syst e m of linear inequalities:
Appendix Every vector x ~ R d \ { 0 } defines a
hype~laneH(x) :=
{ v I vx = 0 } through the origin; x is called a normal of H(x). H(x) divides the space into a posztwe hall-space and a negative hal/-space as follows H >(x):={vlvx>O},
g r ) }
,d: w p = w p / z p
}
We now describe the main idea of the C A L P algorithm (Figure 9). Let an error value sl(w) be defined as 0, if w E Cz, and as positive otherwise. For technical reasons, we define sl(w) to be the distance of w to the most distant hyperplane H ( x ) , x E Xl, corresponding to a violated inequality. This allows us to stay within the framework of linear programming.
(3)
The respective solution cone C ''k consists of exactly those weight vectors w which let the (good) solution s ~,~ score best for instance i. This implies that, for each instance, at most one good candidate can occupy the first rank. Thus, the solution cones of the H L I belonging to the same instance are disjoint. We will exploit this fact for the formal problem definition. Compared to VIM, the simultaneous consideration of alternative good solutions relaxes the problem: For every instance i, only one of the alternative HLI needs to be solved in order to move a good candidate to the first rank. Nevertheless, our d a t a usually remain inconsistent. Accordingly, we want to determine w such as to maximize the number of instances z with a good candidate at the highest rank. This quantity equals the total number of HLI X ~'~ solved by a vector w, since w cannot lie within more than one of the disjoint solution cones for each instance i. Thus, the problem can be
:=
m a x {0,
max
=
min{t>OIVxEXt:
xEVDP(w,XI)
-wx} t>-wx}
If w lies inside the cone Cl, all dot products w x are positive, resulting in zero error. Otherwise, w x is negative for some x E Xl, reflecting the corresponding positive error value. Solving CIM is equivalent to minimizing the number of non-zero error variables. Unfortunately, this can not be realized with a linear objective func- tion. CIM can be solved exactly using quadratic programming [21]. However, this is likely to be infeasible even if we reduce the amounts of d a t a as describe above. Therefore, we simply minimize the sum ~ t st of the error values, similar to the LP approach to VIM. This results in a linear program formulated as in line (6) in Figure 9. Steps (3) - (5) and (7) comprise balancing of components and normalizing analogously as for the VALP algorithm.
326
Paczfic Symposium on Biocomputing '99, volume 4, pages 482-493, 1999.
References
[1] T. Akutsu and H. Tashimo. Linear programming based approach to the derivation of a contact potential for protein threading. In Proceedings of the Paczfic Symposzum on Biocomputing 'g8, pages 413-424, 1998.
[12] V. N. Maiorov and G. M. Crippen. Contact potential that recognizes the correct folding of globular proteins. Journal of Molecular Biology, 227:876888, 1992.
[2] T. Akutsu and M. Yagiura. On the complexity of deriving score functions from examples for problems in molecular biology. In 25th International Colloquzum on Automata, Languages, and Programming (ICALP'98), 1998.
[13] H. Mevissen, R. Thiele, R. Zimmer, and T. Lengauer. The ToPLign software environment - Toolbox for protein alignment. In Bioznformatik '94. Jena, IMB - Institut f/it molekulare Biotechnologie, 1994.
[3] N. Alexandrov, R. Nussinov, and R. Zimmer. Fast protein fold recognition via sequence to structure alignment and contact capacity potentials. In Paczfic Symposzum on Biocomputing'96, pages 53-72. World Scientific Publishing, 1996.
[14] A. G. Murzin, S. E. Brenner, T. Hubbard, and C. Chothia. SCOP: A structural classification of' proteins database for the investigation of sequences and structures. Journal of Molecular Biology, 247:536-540, 1995.
[4] E. Amaldi and V. Kann. The Complexity and Approximabihty of Finding Maximum Feasible Subsystems of Linear Relations. Theoretzcal Computer Science, 147:181-210, 1995.
[15] F. Rosenblatt. The perceptron - a perceiving and recognizing automaton. Technical report, Cornell Aeronautical Laboratory, 1957. [16] M. Sippl. Calculation of confromational ensembles from potentials of mean force: An approach to the knowledge-based prediction of local structures in globular proteins. Journal of Molecular Biology, 213:859-883, 1990.
[5] K. P. Bennett and O. L. Mangasarian. Robust Linear Programming Discrimination of Two Linearly Inseparable Sets. Optimization Methods and Software, 1:23-34, 1992.
[17] R. Thiele, R. Zimmer, and T. Lengauer. Protein threading by recursive dynamic programming. Journal of Molecular Bwlogy,. 290(3):757779, 1999.
[6] F. C. Bernstein, T. F. Koetzle, G. J. B. Williams, E. F. Jr. Meyer, M. D. Brice, J. R. Rodgers, O. Kennard, T. Shimanouchi, and M. Tasumi. The protein data bank: a computer based archival file for macromolecular structures. Journal of Molecular Biology, 112:535-542, 1977.
[18] J. T. Tou and R. C. Gonzalez. Pattern Recognztion Prmczples. Addison-Wesley, Massachusetts, USA, 1974.
[7] S. Brenner, C. Chothia, and T. Hubbard. Assessing sequence comparison methods with reliable structurally identified distant evolutionary relationships. PNAS, 95(11):5857-5864, May 1998.
[19] M. Vingron and M. S. Waterman. Sequence alignment and penalty choice. Journal of Molecular Biology, 235(1):1-12, 1994.
[8] R. O. Duda and P. E. Hart. Pattern Classification and Scene Analysis, chapter 5: Linear Discriminant Functions, pages 130-188. John Wiley & Sons, New York, USA, 1973.
[20] G. Vogt, T. Etzold, and P. Argos. An assessment of amino acid exchange matrices in aligning protein sequences: The twilight zone revisited. Journal of Molecular B~ology, 249:816-831, 1995.
[9] D. Gusfield, K. Balasubramanian, and D. Naor. Parametric optimization of sequence alignment. In Proceedings of the 3rd A CM-SIAM Symposium on Dzserete Algomthms. ACM/SIAM, 1992.
[21] A. Zien. Optimierungsmethoden zur Kalibrierung empirischer Kostenfunktionen. Master's thesis, Friedrich-Wilhelms-Universit~t Bonn, Germany, 1997.
[10] E. E. Lattman, editor. Critzcal Assessment of Technzques for Protein Structure Predictzon (CASP2). Supplement 1 to Proteins: Structure, Function, and Genetic, 1997.
[22] R. Zimmer and T. Lengauer. Fast and numerically stable parametric alignment of biosequences. In Fwst Annual Internatzonal Conference on Computatzonal Molecular Bzology (RECOMB'97), pages 344-353. ACM Press, 1997.
[11] C. Lemmen, A. Zien, R. Zimmer, and T. Lengauer. Application of Parameter Optimization to Molecular Comparmon Problems. In Proceedings of the 327