Pergamon. PII: S0097-8485(96)00011-3. Computers Chem. Vol. 21, No. 1, pp. 13-23, 1997. Copyright © 1996 Elsevier Science Ltd. Printed in Great Britain.
~)
Computers Chem. Vol. 21, No. 1, pp. 13-23, 1997 Copyright © 1996 Elsevier Science Ltd Printed in Great Britain. All rights reserved 0097-8485/97 $17.00 + 0.00
Pergamon
PII:S0097-8485(96)00011-3
HIERARCHIC INERTIAL PROJECTION: A FAST DISTANCE MATRIX E M B E D D I N G ALGORITHM A N D R A S ASZODI* and W I L L I A M R. TAYLOR Division of Mathematical Biology, National Institute for Medical Research, The Ridgeway, Mill Hill, London NW7 1AA, U.K. (Received 27 November 1995; accepted in final form 5 March 1996)
Abstract--Wehave designed an improved method for solving the embedding problem, which consists in
generating molecular conformations satisfying prescribed distance restraints. The problem was broken up into smaller subproblems by carrying out separate embeddings of subsets of the original point set. The relative orientation of the subsets were then determined by an additional embedding and the final coordinates of the full point set were obtained by rigid-body translations and rotations. The new approach was found to be considerably faster than the traditional method, and produced high-quality results when built into DRAGON, a Distance Geometry-based protein modelling tool developed in our laboratory. The method has a number of promising applications including the fast generation of model conformations from a set of distance restraints and macromolecular docking simulations. Copyright © 1996 Elsevier Science Ltd Key words: Distance geometry, protein modelling.
translation and rotation, rendering mirror images indistinguishable. However, these shortcomings are compensated for by a number of major advantages. First, Distance Geometry has the remarkable property that every geometric feature invariant under transformations in the Euclidean group such as angles, areas etc. can be expressed in terms of distances alone (Blumenthal, 1961), and consequently Distance Geometry provides an alternative but equally powerful system for the description of geometric objects. Second, the strength of most interactions in physical and chemical systems depends on the distances of the interacting entities and therefore Distance Geometry offers a very convenient way of representing these interactions in terms of interpoint distances (Taylor, 1993; Asz6di & Taylor, 1994a, b). While it is straightforward to calculate the interpoint distances of a point set represented by Cartesian coordinates, the inverse problem, often called the Fundamental Problem in Distance Geometry (Crippen & Havel, 1988), is more difficult. This mathematical problem consists in finding an arrangement of points in space such that the interpoint distances correspond to prescribed values. In macromolecular modelling applications, the goal is to generate three-dimensional conformations that satisfy a set of distance restraints obtained from experiments and/or from theoretical considerations. A variety of algorithms have been discovered over the years that solve the Fundamental Problem, e.g. the "Method of Alternating Projections (MAP)" (Glunt et al., 1990) or the "Spectral Gradient" (Glunt
1. INTRODUCTION Macromolecules are usually represented in simulation studies by point sets in which the individual points correspond to constituent atoms or, if a lower resolution is desired, chemical moieties or even whole residues. The list of point coordinates in a Cartesian coordinate system provides a complete geometry description of the model point set that can be manipulated easily by the powerful tools of analytical geometry and vector algebra. In a number of applications, however, the actual position and orientation of the molecule is not important, and the structure can be represented by "inner coordinates", which are invariant under translation and rotation. Suitably chosen inner coordinates (such as the "torsional angle space representation" for polypeptides) describe the point set with minimal degrees of freedom, by removing the redundancy inherent in the Cartesian representation. This feature is often exploited in energy minimization studies, where each additional degree of freedom markedly increases the computational cost of the simulation. Another widely used inner coordinate system is the Distance Geometry representation, whereby a set of points are described in terms of their mutual distances. This system is often redundant (N points can be represented by I N ( N - 1) interpoint distances or 3N Cartesian coordinates in a three-dimensional Euclidean space, the latter being more economical for N > 7), and is invariant under inversion as well as *Corresponding author Fax: + 181 913 8545. 13
14
A. Asz6di and W. R. Taylor
et al., 1992), but in the present study we concentrate
on the now classic technique known as Multidimensional Scaling in the multivariate statistics literature (for an introduction, see the books by Torgerson (1958) or Krzanowski (1988), and references therein) or Molecular Embedding in computational chemistry (Crippen & Havel, 1988; Kuntz et al., 1989; Havel, 1991). The embedding method relies on matrix diagonalization, a computationally expensive operation. We designed a modified embedding algorithm, the Hierarchic Inertial Projection, to increase the effectiveness of the original method. The new algorithm was built into a Distance Geometry-based protein modelling program (Asz6di et al., 1995) to test its suitability for macromolecular simulations. 2. METHODS The Hierarchic Inertial Projection (HIP) employs the "divide and conquer" principle, which has often proved useful in the design of fast numerical algorithms. The method is based on the classic embedding approach (Crippen & Havel, 1988). To distinguish it from the HIP method, which is also an embedding algorithm, the original method will henceforth be referred to as the "traditional projection".
where A is the diagonal matrix of the eigenvalues of M and the eigenvectors of M are the columns of the modal matrix W, the matrix of the position vectors can be expressed as (4)
X = AIizw ~
with the individual coordinates x~, = ).~ 2w,~
(5)
where 2k is the kth eigenvalue and w~kis the ith coordinate of the corresponding kth eigenvector. The D largest eigenvalues are used for embedding the points in a D-dimensional space. If the distance matrix contained incompatible non-metric entries that do not obey the triangle (or higher-order) inequalities, then some of the eigenvalues will be negative. These non-metric entries are usually filtered out by appropriate preconditioning procedures (Havel, 1991; Asz6di & Taylor, 1994a). The embedding method generates the positions of the points in a rectangular coordinate system centred at the centroid of the point set, with the coordinate axes corresponding to the axes of inertia. The (k, l)-th element of the inertial tensor T = [tk,] is defined as
tkl: = ~ XkiXt,,
(6)
i=l
2.1. Traditional projection
The algorithm consists of the following steps. 1. From the N x N matrix of squared interpoint distances D = [d~], obtain the squared distances of each point from the centroid using Lagrange's Theorem (Langrange, 1870; Flory, 1969):
and by substituting the expression for the point coordinates (equation (5)), we obtain ,v
t~, = Y" x , , x , = i=1
~F').F-w,~w,, i=1
= (2~2t) 1:2 ~ W,kW,/. (7) 1
~
d~ = ~
1
N
2
d:k -- - ~ ~ djk. k = 1
i=l
(1)
j 3 Target structure
Number of restraints
HIP
Traditional
Speed gain (%)
5 10
22.8 + 1.7 23.4 5- 1.4 23.3 5- 1.5
37.1 _+0.7 37.0 _ 0.7 37.0 5- 0.5
147 143 144
3AIT
0 5 10
20.8 5- 3.3 20.5 5- 0.9 20.5 5- 1.1
35.0 _ 0.3 35.2 + 0.4 35.4 5- 0.5
155 150 150
2TRX
0 20 30
30.6 5- 0.7 30.2 + 0.7 28.4 5- 0.7
47.5 5- 0.5 47.4 5- 0.5 48.4 + 0.5
162 159 170
0
3ICB
22
A. Asz6di and W. R. Taylor
Table 2. DRAGON-4 model quality. Averages and standard deviations of the individual RMS values are tabulated as a function of the number of simulated NOE restraints for models produced by the HIP and traditional embedding methods. The figures are based on the statistical analysis of 25 runs for each restraint set. Entries which are significantly(P < 0.01) lower than their counterparts are in boldface Average RMS [A] Target Numberof structure restraints HIP Traditional 0 10.4 + 2.2 9.60 + 1.99 31CB 5 6.63 + 1.08 6.84 + 1.42 10 4.95 + 0.71 5.32 +_0.91 3AIT
0 5 10
9.82 ± 0.81 7.59 + 1.02 5.41 ___0.37
0
9.55 ± 1.89
10.9 +_ 0.93
2TRX
20 30
7.19 _ 1.59
10.7 + 1.56 6.63 + 3.58
4.18 -t- 0.45
10.3 _+ 1.1 6.97 + 1.76 5.61 + 1.65
auxiliary transformations are ignored. The relative execution time is t(c, N) 1 kskC3 t(N) ~ c 2 + ke~bN3
(40)
from which it is evident that the raw gain represented by the first term is offset as c approaches N and that there is an optimal value for c when the ratio in equation (40) is minimal. While it would be theoretically possible to establish an optimal number and layout of clusters for a given embedding problem, this might prove difficult or cumbersome in practical applications. We recommend starting with a few larger clusters rather than lots of smaller ones and try to find the optimal cluster number by experimentation. The considerations above apply to single-processor computers. On a system with several CPUs the H I P algorithm could be implemented so that the cluster embeddings are performed in parallel on different processors, thus achieving a further speed gain. 4.1.2. Projection quality. The success of a Distance Geometry-based structural modelling approach depends on the ability of the program to handle inconsistent or inaccurate input distances. Since in the H I P algorithm the clusters were represented by their centroids and inertial axes in the skeleton (i.e. overall positions and overall orientations at a "lower resolution"), the method involved extensive averaging of individual scalar products (and hence, indirectly, distances), as evidenced by the preponderance of summation symbols in Section 2.3. Numerical methods based on averaging usually have the capability of filtering out noise to some extent and therefore the HIP algorithm could be expected to be robust. A rigorous theoretical analysis of error propagation within such a complicated algorithm would be a formidable if not impossible task, but our simulation results obtained with D R A G O N - 4 indicate that the new projection is at least as reliable as the traditional approach. Moreover, applied to the largest protein in the test set, the H I P algorithm
performed significantly better, showing its suitability to larger-scale simulations. Projection quality depends on the cluster layout if the input data are noisy: in the extreme, if no reliable intercluster distances are available at all, the HIP algorithm will not be able to deduce the mutual position of the clusters and handedness consistency between clusters cannot be guaranteed. Since the flexibility of the algorithm facilitates experimentation with different cluster layouts, we recommend a choice of cluster boundaries so that the less reliable distances would be distributed evenly within and between clusters. An additional attractive feature of the new approach was that fewer embedding cycles were required to reach 3D when the HIP algorithm was used in D R A G O N - 4 (Table 1). The reason for this increase in efficiency was that the cluster embeddings performed a partial regularization of the input distances and the skeleton embedding started from preconditioned data containing fewer non-metric distances. 4.2. Applications The H I P algorithm can replace traditional embedding in all Distance Geometry applications, especially when the number of points to be embedded is large. Although the algorithm presented in this paper used only one level of hierarchy for the sake of clarity, it is possible to extend it so that the cluster embeddings themselves would be performed by HIP embeddings of subclusters. The projections of very large point sets, structured as a tree of hierarchic clusters, could thus become computationally tractable. Another powerful application of the algorithm is the embedding of rigid bodies. If the Euclidean coordinates of the points in the clusters are known, the cluster embeddings are not necessary, only the cluster inertial axes and moments have to be pre-computed once. Afterwards, their mutual orientation is determined by a single skeleton embedding, for which the metric matrix elements are obtained by a procedure similar to the one described in Section 2.3, using the intercluster distances as input data. This modification provides an elegant way to assemble models of large proteins from smaller fragments of known structure (such as or-helices and r-sheets), and to perform macromolecular docking simulations. W o r k is currently under way in our laboratory to explore these specialized applications. REFERENCES
Asz6di A. & Taylor W. R. (1994a) Biopolymers 34, 489-505. Asz6di A. & Taylor W. R. (1994b) Protein Engng. 7, 633-644. Asz6di A., Gradwell M. J. & Taylor W. R. (1995) J. Mol. Biol. 251, 308-326. Billeter M., Schaumann T., Braun W. & WiJthrich K. (1990) Biopolymers 29, 695-706.
Fast distance matrix embedding algorithm Blumenthal L. M. (1961) A Modern View of Geometry. Dover, New York. Crippen G. M. & Havel T. F. (1988) Distance Geometry and Molecular Conformation. Chemometrics Research Studies Press, Wiley, New York. Crippen G. M., Smellie A. S. & Richardson W. W. (1992) J. Comput. Chem. 13, 1262-1274. Flory P. J. (1969) Statistical Mechanics of Chain Molecules. Wiley-Interscience, New York. Glunt W., Hayden T. L., Hong S. & Wells J. (1990) S I A M J. Matrix Anal. Appl. 11, 589~00. Glunt W., Hayden T. L. & Raydan M. (1992) J. Comput. Chem. 14, 114-120. Green B. F. (1952) Psychometrika 17, 429-440. Havel T. F. (1991) Prog. Biophys. Mol. Biol. 56, 43-78. Katti S. K., LeMaster D. M. & Eklund H. (1990) J. Mol. Biol. 212, 167-184. Krzanowski W. J. (1988) Principles o f Multivariate Analysis: A User's Perspective. Clarendon Press, Oxford.
CAC 21/1~
23
Kuntz I. D., Thomason J. F. & Oshiro C. M. (1989) Meth. Enzymol. 177, 159-204. Langrange J. L. (1870) Oeuvres, volume 5. Paris. McLachlan A. D. (1979) J. Mol. Biol. 128, 49-79. Press W. H., Teukolsky S. A., Vetterling W. T. & Flannery B. P. (1992) Numerical Recipes in C: The Art of Scientific Computing. Cambridge University Press, Cambridge. Sibson R. (1978) J. Roy. Statist. Soc. B. 40, 234-238. Szebenyi D. M. E. & Moffat K. (1986) J. Biol. Chem. 261, 8761-8777. Taylor W. R., Thornton J. M. & Turnell W. G. (1983) J. Mol. Graphics 1, 30-38. Taylor W. R. (1988) J. Molec. Evol. 28, 161 169. Taylor W. R. (1993) Protein Engng. 6, 593-604. Torgerson W. S. (1958) Theory and Methods of Scaling. Wiley, London. Young G. & Householder A. S. (1938) Psychometrika 3, 19 22.