Comparison of Protein Structures by Multi

114

Genome Informatics 16(2): 114–124 (2005)

Comparison of Protein Structures by Multi-Objective Optimization Luonan Chen1

Ling-Yun Wu2

Ruiqi Wang1,∗

[email protected]

[email protected]

[email protected]

Yong

Wang1

[email protected] 1 2

Shihua

Zhang2

[email protected]

Xiang-Sun Zhang2 [email protected]

Department of Electrical Engineering and Electronics, Osaka Sangyo University, Daito, Osaka 574-8530, Japan Academy of Mathematics and Systems Science, Chinese Academy of Sciences, Beijing 100080, China

Abstract We propose a novel method for solving the structure comparison problem for proteins, based on a decomposition technique. We define the structure alignment as a multi-objective optimization problem with both discrete and continuous variables, i.e., maximizing the number of aligned atoms and minimizing their root mean square distance. By controlling a single distance-related parameter, theoretically we can obtain a variety of optimal alignments corresponding to different optimal matching patterns, i.e., from a large matching portion to a small portion. The number of variables in our algorithm increases with the number of atoms of protein pairs in almost a linear manner. The software is available upon request, or from http://zhangroup.aporc.org/bioinfo/samo/.

Keywords: protein structure, multi-objective optimization, circular permutation

1

Introduction

A pairwise structure alignment is a difficult computation problem that optimally superimposes two structures and further finds the regions of closest overlap in a three-dimension space. Many algorithms have been so far developed for the structure alignment problem [10, 12] mainly based on either distance-based methods, such as iterative dynamical programming [1, 7], fuzzy matching method [3] and mean field equation approximation [5, 17], or vector-based methods, such as DALI (simulated annealing) [9], CE (combinatorial extension method) [14] and genetic algorithm [15]. Recently, Kolodny and Linial [10] proved that there is an ²-approximate polynomial-time algorithm to solve such a problem for a commonly used scoring function although they noted that the work should be viewed as mostly a theoretical rather than a everyday tool. Despite the relative success, there is much room for improvement in terms of quality and computational efficiency. On the other hand, an accurate and efficient algorithm is always demanded in the area of molecular biology [17], which can be employed to fold family classification, motif finding, phylogenetic tree reconstruction and even protein docking [8]. From the viewpoint of optimization, there are two criteria for distance-based algorithms, i.e., maximize the number of the aligned atoms and minimize their distance. Such two objectives clearly have a trade-off relation [5], i.e., a closer matching is usually with a shorter aligned chain. In other words, the solutions of such an alignment problem consist of a Pareto set, which can be solved in a more exact manner by a multi-objective optimization technique. *Current address is Aihara Complexity Modelling Project, ERATO, JST., Shibuya-ku, Tokyo 151-0064 Japan.

Protein Structure Comparison

115

In this paper, we propose a novel method to solve the structure comparison problem for homologous proteins or similar proteins (in terms of structure) in the framework of multi-objective optimization. We define the structure alignment as a two-objective optimization problem with both discrete and continuous variables, i.e., maximizing the number of aligned atoms and minimizing their root mean square distance (rms). The discrete variables represent matching relation among structures whereas the continuous variables include a translation vector and a rotation matrix with each protein structure as a rigid body. By exploiting the special structure of the protein alignment problem, we decompose the original problem into two subproblems: one linear programming subproblem (LPS) for the protein matching and one weighted least square subproblem (LSS) for coordinate transformation. A very efficient algorithm is developed for optimizing the LPS by bipartite matching technique, whereas the LSS is solved by singular value decomposition (SVD) technique. By controlling a single distance-related parameter, theoretically we can obtain a variety of optimal alignments corresponding to different optimal matching patterns, which all belong to the Pareto set. In other words, depending on how close we require to match a pair of proteins, we can obtain an optimal alignment solution, from a large portion matching to a small portion matching. The main features for this paper are summarized as follows. (1) A novel formulation is proposed to accurately align structures of homologous proteins in the framework of multi-objective optimizaton. (2) A fast and accurate algorithm based on bipartite matching algorithm is developed by exploiting the special structure. (3) Convergence of computation is numerically stable and is also theoretically ensured due to the Benders decomposition. To improve quality of solution, an annealing procedure is adopted for expanding the searching region. In addition, by the information of the matching matrix, the algorithm has the ability to identify circular permutations [5, 17]. Furthermore, no heuristic parameter, such as gap penalty, is required in the formulation.

2

Formulation of Protein Alignment

In this section, we formulate the pairwise structure alignment problem as a mixed integer programming (MIP), adopting the similar but non-identical notation to that of [3, 5, 17]. Let nx and ny be the number of atoms of two proteins X = (X1 , ..., Xnx ) and Y = (Y1 , ..., Yny ) to be structurally aligned, where Xi = (xi,1 , xi,2 , xi,3 ) and Yj = (yj,1 , yj,2 , yj,3 ) ∈ R3 (i = 1, ..., nx ; j = 1, ..., ny ) are the atom coordinates of two protein chains, and correspond to Cα atoms along the backbones in this paper although Cβ or side-chain atoms can be considered in thePsame way. A square distance metric between the chain atoms is adopted, i.e. d2ij = |Xi − Yj |2 = 3k=1 (xi,k − yj,k )2 is the square distance between the atom i in X and the atom j in Y . Each protein chain is viewed as a rigid geometry body. The coordinate transformation of a rigid body is generally expressed by a ˆ i = A + RXi for the atom i of translation vector A ∈ R3 and a rotation matrix R ∈ R3×3 , i.e., X the chain X, where there are three independent variables for the translation vector and the rotation matrix respectively due to the rigid body transformation. For a pairwise structure alignment, we fix the coordinates of the second protein chain Y , which is assumed to be longer than the first chain X. Therefore, after coordinate transformation, a square distance between the atom i in X and the atom j in Y is d2ij = |A + RXi − Yj |2 . (1) We define a matching matrix S with binary elements sij to describe matching of two atoms for i = 1, ..., nx ; j = 1, ..., ny : ( 1 if atom i in the chain X matches atom j in the chain Y , sij = (2) 0 otherwise. Therefore S is an nx × ny matrix with only binary elements.

116

Chen et al. j

X i S= j

Y

0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0

i

Figure 1: An example for two protein chains and their assignment matrix S with nx = 6 and ny = 7. Since each atom in one chain must match at most one atom in the other, the following conditions are satisfied. nx X i=1

sij ≤ 1 for j = 1, ..., ny ;

ny X j=1

sij ≤ 1 for i = 1, ..., nx .

(3)

Figure 1 is a simple example illustrating the matches for S, where a line or a column with all zeros means a gap. Then, the total square distances D and the total number m for the aligned atoms between the two proteins are respectively expressed as: D(S, A, R) =

ny nx X X i=1 j=1

2

sij |A + RXi − Yj | ;

m(S) =

ny nx X X

sij .

(4)

i=1 j=1

Generally, there is a trade-off relation [5, 17] between the distance and the number of aligned atoms. Hence, the pairwise structure alignment problem can be formulated as a multi-objective optimization problem with discrete variables S and continuous variables (A, R): vector-minimize(S,A,R) {D(S, A, R), −m(S)},

subject to

(3)

with sij ∈ {0, 1},

(5)

where the first objective is to minimize the total square distances of the aligned atoms, and the second one is to maximize the total number of the aligned atoms for the two proteins (because of the negative sign of m). The optimal solutions of the two-objective optimization problem consist of a Pareto set, which can be solved by transforming the two objectives of (5) into a single objective. One typical technique is the ²-method, which alternates a positive scalar parameter λ to obtain the Pareto set, with the formulation as follows: minimize(S,A,R)

ny nx X X i=1 j=1

sij (|A + RXi − Yj |2 − λ2 ), subject to

(3) with sij ∈ {0, 1},

(6)

where the objective function is D(S, A, R) − λ2 m(S). Theoretically, by changing the parameter λ for the single-objective optimization problem (6), we can obtain all optimal solutions belonging to the Pareto set. Clearly, λ transforms the number m of aligned atoms into equivalent square distance in (6), and controls the balance of D and m. Notice |A + RXi − Yj |2 − λ2 = d2ij − λ2 in (6), which implies that λ has the same physical meaning and scale as the distance of dij . We will exploit this property to drastically simplify the computation in LPS next. When λ is small, the optimal alignment has a short aligned chain (m) but with a tight matching (D). Conversely, for a big λ, we can have a long aligned chain but with a rough matching. Therefore, rather than one solution, we can obtain a set of optimal solutions for different pairs of (D, m) by changing λ. In addition to the accurate form without any heuristic parameters of gaps in this paper, clearly the objective function is a linear form of S and the number m of aligned atoms directly pairs with the square distance D in (6) based on the transformation of multi-objective programming.


2.1

117

Decomposition by LSS and LPS

(6) is a mixed integer programming for a given λ but with a special structure, i.e., there is no term in the constraints (3) related to the continuous variables (A, R). Due to such a special structure, (6) can be decomposed into a weighted least square subproblem (LSS) pair that is to find the best transformation of coordinates for the protein X, and an integer linear programming subproblem (LPS) that is to find the best superposition for the protein pairs: • LSS for solving (A, R) with fixed (S, λ) minimize

ny nx X X i=1 j=1

sij |A + RXi − Yj |2 .

(7)

• LPS for solving S with fixed (A, R, λ) maximize −

ny nx X X i=1 j=1

sij (|A + RXi − Yj |2 − λ2 ), subject to

(3) with sij ∈ {0, 1}.

(8)

P x Pny Notice that for the LSS, in addition to (3), λ2 ni=1 j=1 sij is constant due to the fixed (S, λ), which has no effect on the optimization and is eliminated from the objective function in (7). It is easily to show that the decomposition of (7)-(8) is also consistent with the Benders decomposition [6] because of the special structure, which ensures the computational convergence.

2.2

Solving LSS

LSS of (7) is a weighted least square problem of two 3-D chains, which can actually be solved analytically [2, 3, 5]. Numerically, R and A can also be obtained by singular value decomposition (SVD) as shown in Appendix A.1 of [5]. There are six independent variables for LSS. For fixed (S, λ), LSS pulls the protein X closer to the protein Y by computing the optimal rotation matrix R and translation vector A. Notice that coordinate pairs (Xi , Yj ) only with sij = 1 for i = 1, ..., nx and j = 1, ..., ny are used in LSS according to (7). In other words, LSS is not affected by those coordinate pairs (Xi , Yj ) with sij = 0, most of which are actually known before the computation. Such a property is exploited in next section to drastically simplify the computation of LPS.

2.3

Solving LPS

LPS of (8) is an integer linear programming problem because of binary variables S, but it can be exactly solved in polynomial time. In fact, LPS of (8) is a maximum weighted bipartite matching problem [13] which has the integrality property, i.e. the optimal solution is guaranteed to be integers even without the constraint sij ∈ {0, 1}. In other words, the discrete optimal solution of LPS can be obtained by directly using any linear programming algorithm such as simplex algorithm and interiorpoint method by relaxing the binary variables as continuous variables 0 ≤ sij ≤ 1. However, there exists a more effective algorithm based on Hungarian method [13] for the maximum weighted bipartite matching problem with computational complexity O(¯ n(ˆ n +¯ n log n ¯ )) where n ¯ = nx +ny and n ˆ = nx ×ny in the LPS. For large problems, such as proteins with each several hundreds amino acids, O(¯ n(ˆ n+n ¯ log n ¯ )) is too high for fast structure alignment. The algorithm for LPS can be furtherP improved its Pny by exploiting x 2 − λ2 ) for special structure. Note that the objective function of (8) is to maximize − ni=1 s (d ij ij j=1 the fixed (dij , λ). Therefore for i = 1, ..., nx and j = 1, ..., ny , if dij ≥ λ, then sij = 0 must hold at the optimal solution, as illustrated in Figure 2. In other words, we can eliminate all sij corresponding to

118

Chen et al.

O

X

i

di,j+1 di,j

j+2 j+1

j j-1

Y

For atom i of X, only si,j and si,j+1 may not be 0 due to di,j and di,j+1 less than O .

Figure 2: Reducing variables based on dij and λ. λ corresponds to the radius of the search region. dij ≥ λ from both the objective function and the constraints of (8). Such a manipulation significantly simplifies LPS, and reduces total variables n ˆ from nx × ny to |{dij : dij < λ}| = O(λ2 min{nx , ny }), which is discussed in Section 2.5. The detail procedure to solve LPS based on Hungarian method with the reduced variables is given in Appendix A.

2.4

Algorithm

The detailed algorithm is stated straightforward for a given λ as follows: • Initialization: 1. Set λ and convergence criterion ², which are all positive numbers. Set all initial values of variables sij . Let the iteration index t = 1. 2. Assuming nx ≤ ny , fix the coordinates of protein , and move protein chain X to Pny chain Y P x Xi /nx . their common center of mass by translation i=1 Yi /ny − ni=1

• Step-1: Solve LSS of (7) for (A, R) with the fixed S according to the SVD algorithm of Appendix A.1 of [5]. • Step-2: Solve LPS of (8) for S with the fixed (A, R) according to the procedure of Appendix A. • Checking convergence. If |D(t) − D(t−1) | ≤ ² is satisfied, terminate the computation and output rms and m. Otherwise, let t ← t + 1 and then go to Step-1. ¯ = Pnx Xi /nx NoticeP that for Initialization, the original centers ofpmass for proteins X and Y are X i=1 ny and Y¯ = i=1 Yi /ny respectively. rms is defined as D/m, where D and m are expressed in (4). The protein structure alignment is an iterative computation of LSS and LPS in succession. As discussed in Section 2.3, sij is possibly 1 at the optimal solution only if dij < λ. If the distance of any two atoms i and j are farther than λ, no matching for such two atoms is considered in LPS. In other words, only atom pairs with the distance less than λ are further considered in LSS for the translation and rotation operation because of sij = 0 for any atom pairs with dij > λ in LPS. As a result, the aligned rms in LSS is less than λ. We can use this property to obtain an alignment for a specific rms without solving all Pareto set by setting an appropriate λ. Empirically, we can obtain an optimal solution with rms = r if setting λ = 2r ∼ 3r, where rms is expected to be r = 0 ∼ 3 because an alignment for rms > 3 is not generally considered as a good matching. In other words, we can give a list of solutions that covers all the reasonable alignments with λ changing from 0 to 9 because the range of rms for those solutions generally includes the range of 0 ∼ 3. Considering that the distance is approximately 3.8˚ A for two consecutive Cα atoms or Cβ atoms of the two amino acids in a protein chain, the reduced LPS generally has variables less than min{nx , ny }(λ+ 3.8)2 /3.82 , which is a much small LP comparing with the original LP with variables nx × ny . For


119

4

3.5

1300

x 10

reduced original

experimental predicted

1200

3

1100 2.5

reduced size

1000

size

2

1.5

900

800

700 1

600 0.5

0

500

1

2

3

4

5

6

7

8

9

10 11 12 13 14 15 16 17 18 19 20 instance

400 0.5

1

1.5

2 original size

2.5

3

3.5 4

x 10

Figure 3: Reducing variables of LPS. The left figure indicates the reduced sizes in terms of the number of variables before and after the implementation of variable reduction for different protein pairs (λ = 6.0). The right figure is comparison of experimental and predicted reduced sizes (λ = 6.0). example, the variables approximately are less than 400 × (6 + 3.8)2 /3.82 ≈ 2660 for a pair of proteins both with 400 amino acids and λ = 6, comparing with 160000 variables in the original LP. The right figure of Figure 3 shows that such estimation is also fairly accurate.

2.5

Improving Quality

As indicated in Figure 2, geometrically λ corresponds to the radius of the search region during the optimization process. sij is possibly 1 at the optimal solution only if dij < λ. If the distance of any two atoms i and j are larger than λ, no matching for such two atoms is considered in LPS. Consequently, the information for these two atoms can not be passed to the subsequent LSS due to the corresponding sij = 0. In other words, for a small λ, the searching region is small and the optimization may significantly depend on the initial relative locations of the two proteins. Needless to say, iterative computation of LSS and LPS in succession changes the relative locations of the two proteins by translation and rotation, but such an effect on the optimization is restricted in a limited way. To alleviate such a problem and improve the quality of the solution, we adopt an annealing ¯ t during iterations, where λ is the technique [4] in (8) or Step-2 of Algorithm by changing λ(t) = λ + λγ ¯ target value, λ > 0, 1 > γ > 0 (a cooling coefficient for annealing), and t is the iteration index. That ¯ so that the algorithm performs a global search on a large region is, first set a large initial λ(0) = λ + λ ¯ t by γ to narrow the to find a better matching in the earlier iterations. Then, reduce λ(t) = λ + λγ searching region during each iteration until convergence. Although introducing the annealing process requires additional computation cost, it enlarges the searching region which results in the improvement of alignment quality. Such an annealing process is only activated when the quality of the alignment is not satisfactorily high.

3 3.1

Numerical Simulation Alignment

We adopt the same benchmark examples as those of [3, 5, 14] from Protein Data Bank (http:// www.rcsb.org/pdb/) for numerical simulation by comparing with the several existing methods, i.e. Dali [9] (http://www.ebi.ac.uk/dali), CE [14] (http://cl.sdsc.edu/ce/ce_align.html). Since our program does not consider the ordering of the aligned segments on the sequence, we also do simple post- processing of the results by properly local swaps of residues, and Cα representation is adopted for

120

Chen et al.

Table 1: Comparisons of structure alignment algorithms with rms and m. Protein Family Reductases

Globins

Ten ‘difficult’ structures

Protein Pairs 1DHFa - 8DFR 1DHFa - 4DFRa 1DHFa - 3DFR 8DFR - 4DFRa 8DFR - 3DFR 4DFRa - 3DFR 2HHBa - 2HHBb 2HHBa - 1MBD 2HHBa - 2HBG 2HHBa - 1ECD 1MBD - 2HBG 2HHBb - 1MBD 2HHBb - 2HBG 2HHBb - 1MBA 2LHB - 1MBD 2LHB - 2HBG 1MBD - 1MBA 1MBA - 1ECD 2HBG - 1ECD 1FXIa - 1UBQ 1TEN - 3HHRb 3HLAb - 2RHE 2AZAa - 1PAZ 1CEWi - 1MOLa 1CID - 2RHE 1CRL - 1EDE 2SIM - 1NSBa 1BGEb - 2GMFa 1TIE - 4FGF

Our Method rms m λ 0.7 182 6.0 1.8 156 6.0 1.6 159 6.0 1.9 157 6.0 1.6 159 6.0 1.5 156 6.0 1.4 139 6.0 1.5 141 6.0 1.6 140 6.0 2.2 131 6.0 2.0 141 8.0 1.6 145 6.0 1.7 137 6.0 2.2 140 6.0 1.4 137 6.0 1.9 133 6.0 1.9 143 6.0 1.9 136 8.0 2.4 129 6.5 2.5 70 6.0 1.7 87 6.0 2.9 87 6.0 2.5 82 4.5 2.3 83 6.5 2.3 98 6.5 3.1 281 6.0 2.9 322 6.0 3.3 110 7.5 2.4 115 6.0

Post Process rms m 0.7 182 1.9 156 1.6 158 1.9 157 1.7 159 1.5 156 1.4 139 1.5 141 1.5 138 2.2 130 1.9 140 1.6 145 1.6 136 2.2 138 1.4 136 1.8 131 1.9 143 1.9 136 2.4 129 2.6 62 1.7 86 4.0 61 2.6 35 2.2 82 2.4 96 3.6 74 3.6 246 3.9 91 2.5 111

Dali rms m 0.7 182 2.0 154 1.7 158 2.0 155 1.8 159 1.5 154 1.4 138 1.5 139 1.7 138 2.3 129 2.2 140 1.6 145 2.0 135 2.3 138 1.4 135 2.0 128 1.9 142 1.9 133 2.6 129 2.6 60 1.9 86 3.0 75 2.5 81 2.3 81 3.2 97 3.5 211 3.3 291 3.3 94 3.1 114

CE rms m 0.7 182 2.0 154 1.7 158 2.0 155 1.8 158 1.5 155 1.5 139 1.6 141 1.7 136 2.6 128 2.1 140 1.6 144 1.9 134 2.4 139 1.6 137 2.1 130 1.8 141 2.0 134 2.6 125 3.8 64 1.9 87 3.4 84 2.9 84 2.3 81 2.9 97 3.8 219 3.0 275 3.9 107 2.9 116

each protein chain. Without additional explanation, all simulations are conducted without annealing process. The convergence criterion is ² = 0.01 for all examples. The simulation results are shown in Table 1, where Dihydrofolate reductases and Globins are considered easy for alignment while other ten protein pairs are thought to be very difficult to align [14]. For any protein pair, our method gives a list of solutions corresponding to different λ from a small to a large number, which all belong to the Pareto set. Since a different λ gives an optimal solution with different m and rms for the proposed method, we listed those results of our algorithm with the corresponding λ, which are comparable to others. The “post process” means the results by simply local swaps of residues so as to compare with other methods. According to Table 1, the aligned results for homologous proteins by our method are almost consistently better than others, and our method typically produces alignments with lower rms distances or longer chains. However, for the protein pairs with rare similarity of structures, such as some of the ten most difficult protein pairs [14], our algorithm does not perform effectively due to the post-process of sequence ordering. Such results indicate that improvements of the accuracy for our method are due to not only the new optimization algorithm but also the relaxation of permutations. In addition, we also aligned protein pairs for different folds and different classes by comparing with other methods. The results indicated that the our algorithm can obtain an alignment with a larger matching portion with a better rms for those protein pairs.

3.2

Convergence

The decomposition of the algorithm is consistent with the Benders decomposition, which actually ensures the local convergence. Numerical simulation for each example in Table 1 (without annealing) typically requires 4-8 iterations, and the convergence is also fairly stable from the numerical


121

160

4.0

m rms

150

150

3.0

m rms

3.5

140 130

100

2.0

3.0 m

m

rms

rms

120 110

2.5 100

50

90

1.0

2.0

80 70 1

3

5

7

t (iteration)

9

11

13

1.5

0 1

2

3

4

5

λ

6

7

8

9

0.0

Figure 4: Convergent process and trade-off relation. Left figure is convergent process of the aligned residue number (m) and the root-mean-square-distance (rms) for a pair of proteins 2LHB-2HBG at ¯ = 0.0. Right figure is trade-off relation between the aligned residue number (m) and λ = 8.0 and λ ¯ = 0.0. the root-mean-square-distance (rms) for a pair of proteins 2HBG and 1M BD at λ

computation viewpoint.

3.3

Trade-Off Relation

Generally, there is a trade-off relation between rms distance and the number of the aligned atoms m, which relies on λ in the algorithm. In other words, depending on how close we require to match a pair of proteins, we can obtain an optimal alignment solution by varying the scalar λ, from a large portion matching to a small portion matching. Therefore, rather than one solution, we can obtain an optimal set of solutions for different pairs of (rms, m) by changing λ. The right figure of Figure 4 shows such a trade-off relation by changing λ for the alignment of a pair of protein chains 2HBG - 1MBD, which is the Pareto set of multi-objective optimization. It is easy to see from the right figure of figure 4 that generally, the larger λ, the more the number of the aligned atoms m. Conversely, a closer matching has a short aligned chain. Therefore, depending on the requirement for the criterion of rms, the proposed algorithm can produce the longest aligned atoms, by changing λ, even for the case with a very small matching portion. Figure 5 demonstrates an example of a small portion matching and a large portion matching for a pair of proteins 2HBG-1MBD. When λ is small, we obtained an accurate local matching for a small portion of the two proteins with m = 46 and rms = 0.68, as indicated in (a). However, when λ increases to 8, we have a global matching alignment result with m = 141 and rms = 2, which is shown in (c). Such results imply that the proposed algorithm has the ability to obtain accurate optimal alignments for different matching sizes, which can be applied to screening study of protein docking or super family classification. As indicated in the right figure of Figure 4, rms and m of the aligned protein pairs increase with λ. Generally such a tendency is monotonic and stable but there may be no-monotonic behavior when λ is very small due to the small search region and non-convexity, i.e., m or rms may decrease with the increase of λ. However, this problem is generally alleviated when the annealing procedure is adopted. Although the proposed algorithm can give different matching sizes for a protein pair, it is still a global alignment method. In other words, it is not a local alignment algorithm [8] because the aligned atom pairs may be distributed in a wide area or may not be always restricted in a local area of a protein.

122

Chen et al.

10

10

(a)

10

(b)

5

5

5

0

0

0

−5

−5

−5

−10

−10

−10

−15

−15

−15

−20

−20 20 10 0 −10 −20

−10

0

−5

5

10

−20 20

15

(c)

10 0 −10 −20

−10

−5

0

10

5

20

15

10 0 −10 −20

−10

−5

0

10

5

15

¯ = 0.0. Figure 5: A small and a large portion matching for a pair of proteins 2HBG and 1M BD at λ Only the matched alignments detected by our algorithm are shown. (a): m = 46 with rms = 0.68 at λ = 1.15; (b): m = 69 with rms = 0.95 at λ = 1.5; (c): m = 141 with rms = 2.0 at λ = 8.0. 160

(c)

m without annealing rms without annealing m with annealing rms with annealing

10 5

2.8

2.4

120

100

2.0

rms

m

140

3.2

10

(a)

0

0

−5

−5

−10

−10

−15

−15

−20

−20 20

20 10

80 3.8

4.2

4.6

5.0

5.4

5.8

1.6 6.2

(b)

5

10

0

0

−10

λ

−20

−10 −20

10

10

0

0

−10 −20

−10 −20

Figure 6: Improvement of quality by annealing for two protein chains 2HBG and 2LHB. The right two figures are an example at λ = 4.5, where the number of the aligned residues m is 105 with rms = 2.58 without annealing shown in (a), and is m = 131 with rms = 1.78 with annealing shown ¯ = 10λ and γ = 0.4. The left figure (c) is the results for different λ without and with in (b) at λ annealing in terms of m and rms.

3.4

Variable Reduction

One of significant features for the algorithm is that we significantly reduce the size of the problem (8) by considering the geometrical meaning of λ such that a complicated alignment is tractable even for a large-scale alignment problem from the computational viewpoint. The left figure of Figure 3 indicates how much the sizes of (8) in terms of the number of variables are reduced for different protein pairs before and after the implementation of variable reduction or the procedure of Appendix A. For instance, the variables are reduced from 33852 to 1092 for a pair of proteins with 182 and 186 amino acids when λ = 6. For such a case, CPU times are 2 seconds and 0.5 seconds without and with the variable reduction respectively, which demonstrate the effectiveness of the variable reduction procedure of Appendix A.

3.5

Improving Quality by Annealing

Notice that λ geometrically corresponds to the radius of the searching region. For a small λ, the searching region is small and the optimization may significantly depend on the initial relative locations of the two proteins. For such a case, the annealing procedure is effective to increase the searching ¯ = 10λ and γ = 0.4 region, thereby improving the quality of the solution. In our program, we set λ for annealing when rms is unsatisfactorily large. In other words, the radius of initial searching region


123

with the annealing is enlarged by 11 times. Figure 6 shows the effect of annealing for a pair of proteins 2HBG-2LHB. Clearly, the quality of alignment is poor without annealing, in particular when λ is small. With annealing, the computation converges a much better solution due to the enlarged search region, as indicated in the left figure. The right two figures of Figure 6 demonstrate cases with and without annealing when the target λ is set ¯ > 0), the quality of alignment is considerably to be 4.5. Clearly, with the annealing process (i.e. λ improved, as indicated in (a) and (b). Generally, when rms is more than 2.5, it is better to start the annealing procedure, which may give a better solution. However, when one wants to obtain an accurate alignment with a high quality or to have a detail analysis of protein structures, it is always recommended to use annealing for simulation although a higher computation cost is required. The algorithm was implemented in C + + language. The simulation for each structure alignment on average requires only a few seconds (most of them are less than one second) on IBM Thinkpad T23 (Pentinum III-M 1.20 GHz CPU) computer, which is considered very fast.

4

Conclusion

We developed an effective and accurate method to solve structure alignment problem for homologous proteins. The proposed algorithm is quite general and treats the structure alignment in a more accurate way with implicit complete exploration of the entire space. The original protein alignment problem is formulated as a multi-objective optimization problems, and further decomposed into LSS with 6 continuous variables and LPS with binary variables. A very efficient algorithm with a numerically stable convergent process is developed for optimizing LPS and LSS successively. We show that the size of variables linearly increases with the number of atoms of the protein pairs. By controlling a single distance-related parameter, theoretically we can obtain a variety of optimal alignments corresponding to different optimal matching patterns, i.e., from a large matching portion to a small matching portion. In addition, by directly checking the matching matrix, the algorithm has the ability to identify permutations. The software is available at the website http://zhangroup.aporc.org/bioinfo/samo/.

References [1] Akutsu, T., Protein structure alignment using dynamic programming and iterative improvement, IEICE Trans. Inf. Syst. 12:1629–1636, 1996. [2] Arun, K. S., Huang, T. S., and Blostein, S. D., Least-squares fitting of two 3-D point sets, IEEE Trans. Pattern Analysis and Machine Intelligence (PAMI), 9:698–701, 1987. [3] Blankenbecler, R., Ohlsson, M., Peterson, C., and Ringner, M., Matching protein structures with fuzzy alignments, Proc. Natl. Acad. Sci. USA, 100:11936–11940, 2003. [4] Chen, L. and Aihara, K., Chaotic simulated annealing by a neural network model with transient chaos, Neural Networks, 8:915–930, 1995. [5] Chen, L., Zhou, T., and Tang, Y., Protein structure alignment by deterministic annealing, Bioinformatics, 21:51–62, 2005. [6] Geoffrion, A. M. and Graves, G. W., Multicomodity distribution system design by Benders decomposition, Management Sci., 20:822–844, 1974. [7] Gerstein, M. and Levitt, M., Using iterative dynamic programming to obtain accurate pairwise and multiple alignments of protein structures, ISMB, 59–67, 1996.

124

Chen et al.

[8] Hiroike, T. and Toh, H., A local structural alignment method that accommodates with circular permutation, Chem-BioInformatics J., 1:103–114, 2001. [9] Holm, L. and Sander, C., Protein structure comparison by alignment of distance matrices, J. Mol. Biol., 233:123–138, 1993. [10] Kolodny, R. and Linial, N., Approximate protein structural alignment in polynomial time. Proc. Natl. Acad. Sci. USA, 101:12201–12206, 2004. [11] Mizuguchi, K., Deane, C. M, Blundell, T. L., and Overington, J. P., HOMSTRAD: A database of protein structure alignments for homologous families, Protein Sci., 7:2469–2471, 1998. [12] Orengo, C. A., and Taylor, W. R., SSAP: Sequential structure alignment program for protein structure comparison, Methods Enzymol., 266:617–635, 1996. [13] Schrijver, A., Combinatorial Optimization: Polyhedra and Efficiency, vol. A, Springer, 2003. [14] Shindyalov, I. and Bourne, P., Protein structure alignment by incremental combinatorial extension (CE) of the optimal path, Protein Engineering, 11:739–747, 1998. [15] Szustakowski, J. and Weng, Z., Structural alignment using agenetic algorithm, Proteins: Structure, Function and Genetics, 38:438–440, 2000. [16] Uliel, S., Fliess, A., Amir, A., and Unger, R., A simple algorithm for detecting circular permutations in proteins, Bioinformatics, 15:930–936, 1999. [17] Zhou, T., Chen, L., Tang, Y., and Zhang, X., Aligning multiple protein structures by deterministic annealing, J. Bioinform. Comp. Biol., 3:837–860, 2005.

A

Algorithm for Maximum Weighted Bipartite Matching Problem

The procedure to solve LPS of (8) based on Hungarian method with the reduced variables is as follows. 1. Construct a directed graph (V, E) with nx + ny + 2 vertices: V = {s, t, u1 , u2 , · · · , unx , v1 , v2 , · · · , vny }, where V and E represent vertex set and edge set respectively in the graph. Add the edges (s, ui ) for i = 1, · · · , nx , and (vj , t) for j = 1, · · · , ny , and every edge is assigned zero weight. For all i and j such that dij < λ, join ui to vj by an edge (ui , vj ) with weight wui vj = dij − λ. Let F¯ = 0. 2. Find the shortest path from s to t by using Dijkstra’s algorithm [13], and compute the shortest distance from s to v of all vertices, F (v). If no such path exists or F (t) > F¯ , then terminate the computation. 3. Modify the weights of all edges. The weights wij are replaced by 0 wij = wij + F (i) − F (j).

4. Reverse the directions of the edges along the shortest path. 5. Replace F¯ by F¯ − F (t). Go back to Step 2 and repeat.