A sampling problem in molecular dynamics simulations of ...

7 downloads 1929 Views 2MB Size Report
Communicated by Donald L. D. Caspar, Florida State University, Tallahassee, FL, .... fit along each of the arrows denoting a* and b* if the Bragg reflections.
Proc. Natl. Acad. Sci. USA Vol. 92, pp. 3288-3292, April 1995 Biophysics

A sampling problem in molecular dynamics simulations of macromolecules (protein flexibility/x-ray diffuse scattering/crystallography/phase space/dynamical systems)

JAMES B. CLARAGEtt§, TOD ROMOtt, B. KIM ANDREWSttl, B. MONTGOMERY PETTITT1T AND GEORGE N. PHILLIPS, JR.tt tKeck Center for Computational Biology and tDepartment of Biochemistry and Cell Biology, Rice University, Houston, TX 77251-1892; and Chemistry, University of Houston, Houston, TX 77204-5641

IDepartment of

Communicated by Donald L. D. Caspar, Florida State University, Tallahassee, FL, December 23, 1994 (received for review October 17, 1994)

ABSTRACT Correlations in low-frequency atomic displacements predicted by molecular dynamics simulations on the order of 1 ns are undersampled for the time scales currently accessible by the technique. This is shown with three different representations of the fluctuations in a macromolecule: the reciprocal space of crystallography using diffuse x-ray scattering data, real three-dimensional Cartesian space using covariance matrices of the atomic displacements, and the 3N-dimensional configuration space of the protein using dimensionally reduced projections to visualize the extent to which phase space is sampled.

the difference between the total and Bragg scattering, contains valuable information about variations from the average structure. Note that whereas Bragg diffraction is the Fourier spectrum of the average, or mean electron density, diffuse diffraction is the variance of the electron density's Fourier components. In a sense, then, study of diffuse scattering from a crystal is study of a higher moment in the statistical distribution of matter in a macromolecule. Standard Bragg analysis yields the mean positions, (ri), and displacements ( IBragg)'

[3]

where Si is the displacement in the atomic coordinate ri(i = 1, 3N; N = number of atoms) relative to its average value (ri). Diffuse intensity contains information on the mean positions and the higher-order moment in the displacements

Experimental Tests of Molecular Dynamics

Classical molecular dynamics is a computational technique to simulate the behaviors, both thermodynamic and dynamic, of an atomic model. In the long time limit, the sampled trajectory yields detailed information about the approximate model Hamiltonian. Shorter trajectories yield incomplete information and confound comparison of the model with experiment. Relatively short trajectories of biological macromolecules, on the order of 1 ns, have been used with reasonable success for a number of experimental comparisons-for example, crystallographic B values (1) from x-ray scattering, incoherent structure factors from neutron scattering (2, 3), and order parameters from NMR spectroscopy (4). Because these experimental techniques probe the movement of single atoms we can conclude that the amplitude and frequency of displacement for particular individual atoms in the protein are being simulated accurately. However, on the basis of such local results we cannot also conclude that these same molecular dynamics calculations accurately simulate, or converge on, the correlation between two or more atoms in a protein (e.g., whether one part of a protein tends to act concertedly with another). The state of our present models, then, is not completely tested by our experimental methods. In this article we consider a particular test set to measure the convergence of two-body properties in protein

IDiffuse -3 (5i8j),

[41

[2]

the cross-correlation between displacements in atomic coordinates i andj. Diffuse scattering experiments on several proteins and nucleic acids (7-10) demonstrate that correlations in displacements of neighboring atoms fall off roughly exponentially with the distance between the atoms, with a characteristic length of 4-8 A. That is, (8iA8) = rjl/y), where the correlation length y is less than the dimension of the protein molecule. Thus, in analogy with liquid structures, where the correlations in atomicpositions decay approximately exponentially with pair separation, the motion in proteins is often called liquid-like, in the sense that correlations in atomic displacements fall off rapidly with distance. As a test of whether the pair correlations in atomic movements predicted by molecular dynamics are reliable over a particular time scale one can compute the diffuse scattering from a trajectory using the set of simulated conformations for the ensemble average in Eq. 2; this model intensity can then be compared with the actual data. Though this test was proposed several years ago by Caspar et aL (7), it was not until recently that it was carried out (11). The results were not encouraging. Diffuse scattering intensity computed from a 600-ps dynamics trajectory on lysozyme does not reproduce the observed intensity from this system. Neither is there agreement between theory and experiment for a 150-ps trajectory of fully solvated myoglobin (12). Fig. 1 shows the comparison for myoglobin. There are many ways one might explain the discrepancy. First, the calculations on lysozyme and myoglobin were not performed in the crystalline environment but on a single molecule. Thus interactions and correlations between proteins, and therefore the diffuse scattering, will be physically different in the simulated system from the actual one. Also,

The publication costs of this article were defrayed in part by page charge payment. This article must therefore be hereby marked "advertisement" in accordance with 18 U.S.C. §1734 solely to indicate this fact.

§To whom reprint requests should be addressed at: Department of Biochemistry and Cell Biology, P.O. Box 1892, Rice University, Houston, TX 77251.

e(Wx(J)exp(-Jr|

simulations. An experimental technique that is sensitive to statistics between pairs of atoms, and thus to collective behavior in macromolecules, is diffuse x-ray scattering (5, 6). All x-ray intensity on a detector besides the sharp Bragg peaks is typically ignored in structural studies. However, this remaining intensity, IDiffuse

=

ITotal

-

IBragg

= (IFTpI2) -IFT(p)I2,

[1]

3288

Proc. Natl. Acad Sci USA 92 (1995)

Biophysics: Clarage et at 0.4-

_

3289

0.4-

0.30 .2Xl

0^ .1

.X 0.20.1 0.0 -

FIG. 1. Comparison of observed (a) and calculated (b) x-ray diffuse scattering from P6 myoglobin crystals (a = b = 91.20 A, c = 45.87 A, a = 13 = 90°, y = 120°). The patterns are shown as two-dimensional Ewald sections through three-dimensional reciprocal space since this is how the data were actually recorded (on imaging plates with 1.54-A radiation). Both patterns are identically oriented down the 6-fold c* axis of the crystal. Only the asymmetric sectant of each pattern is displayed here, out to a resolution of 0.4 A-1, corresponding to 2.5-A spacing. The reciprocal lattice constants normal to the 6-fold have magnitude 0.011 A- 1, so that roughly 10 reciprocal lattice points would fit along each of the arrows denoting a* and b* if the Bragg reflections were included in these patterns. The observed diffuse scattering was processed to remove Bragg peaks and halos (7, 12) and thus to isolate the diffuse component arising from intramolecular protein motions. The calculated diffuse scattering was generated from the molecular dynamics trajectory using Eq. 2 and the Ewald section methods described in refs. 7 and 12. Since the myoglobin simulation was performed on only one molecule, crystal symmetry had to be imposed on the theoretical pattern to produce a meaningful comparison with the data; this was achieved by P6 averaging the diffuse intensities from the isolated molecule, which dynamically amounts to the approximation that correlations across crystal neighbors are negligible. The details of the molecular dynamics simulation appear in ref. 13.

since the diffuse scattering is determined not only by correlations but also by the underlying average structure as well (8), the diffuse patterns may not match because the crystal environment will enforce a somewhat different average structure (typically 1-2 A rms) from the simulated structure of an isolated molecule. Yet another possible source of discrepancy between the model and actual system is that today's typical molecular dynamics calculations have been carried out for less than nanosecond time scales while diffuse scattering experiments take seconds to minutes of laboratory integration. Since the rms differences are relatively small, the problem of sampling time is worth looking into. We will consider various measures of sampling in the next section. Sampling Reciprocal Space. Fig. 2 a and b show the calculated diffuse scattering from the two halves of a 150-ps simulation of solvated myoglobin (13). This theoretical diffuse intensity, IDiffuse = (IFTpI2) - IFT(p)12 was computed from the trajectory for two separate samples: the first and second halves of the trajectory after equilibration. Each sample was 50 ps in length and consisted of 1000 myoglobin configurations equally spaced by 50 fs. Care was taken to omit the initial portion (first 50 ps) of the simulation, when the atomic positions were relaxing from their starting structure, so that spurious correlations are not included in the diffuse scattering. The two intensity distributions are different. Evidently, the diffuse scattering has not converged to a unique result for this simulation. There are at least two ways scattered intensity might fail to converge: either the structure does not have a well-defined average and is drifting or the system has not adequately sampled correlations and atomic displacements.

I 0.4

0.3 0.2

0.0

FIG. 2. Planar slices through the origin of the variance in the molecular transform for myoglobin, calculated from the first half of the dynamics trajectory (a) and the second half (b). The displayed vectors a* and b* are the axes reciprocal to the periodic cell in which the simulation was performed (a = b = 56.32 A, c = 44.45 A, a = p = -y = 90°). The poor correspondence between these two patterns implies that atomic displacement correlations have not converged for the simulation. From the standpoint of x-ray scattering measurements, the diffuse component of the diffraction has not yet converged. (c and d) Computed for the same conditions as a and b except using the average of the molecular transform. Though the two patterns are not identical, there is significant agreement between features in both patterns, implying that local properties of the average structure have converged for this molecular dynamics simulation. In experimental terms, the Bragg diffraction has converged.

Care was taken to only use the latter part of the trajectory where the rms coordinate deviations from the starting structure had converged (13). As an independent check we computed the average structure factors, IBragg = IFT(p)12, for the two halves of the equilibrated trajectory. Fig. 2 c and d clearly show that this quantity has converged. Since average structure factors are functions of the mean positions and displacements (crystallographic B values) of individual atoms, this 150-ps dynamics simulation is adequate to calculate the lowest moment of the atomic distributions. The sampling problem is with higher moments, such as pair correlations.

3290

Biophysics: Clarage et al

Proc. Natl. Acad Sci USA 92 (1995)

Real Space. Instead of focusing on diffuse scattering, or the Fourier spectrum of atomic displacement correlations, the sampling problem can also be viewed directly in real space using the full atomic displacement covariance matrix (14),

cri,j) = (aisi).

in the system is represented by a single point. The global trajectory of the protein is then a path in this hyperspace, in complete analogy with a single particle's trajectory being a path in 3-space as in Fig. 4a. Cuts, phase portraits, and other reductions of the trajectory are routinely used in analysis of strange attractors in dynamical systems (15), where momenta may also be included so that 6N degrees of freedom can be studied. The problem that arises in trying to construct representations for proteins is the sheer size of the phase space. To solve this problem, we apply principal component analysis, as used in multivariate statistics (16) and which we have applied to visualize conformational clustering in time-average ensembles (17, 18). Given a quantity that is a function of M variables, an observation can be considered a function on an M-dimensional space; a set of observations is then a swarm of points on this space. By diagnalizing the covariance matrix of all pairs of observations, one obtains the principal components (orthogonal eigenvectors) of the distribution or, equivalently, the moments of inertia of the distribution (19). Projecting the set of observations from their canonical M-dimensional space down to a subspace spanned by the largest few of these principal components then gives a dimensionally reduced picture, or shadow, of the data. In general, projecting a high-dimensional pattern onto a lower dimension produces a tangled mess, owing to overlapping of features. But as long as some moments dominate the distribution (i.e., the swarm tends to populate some directions more than others because of correlations), then principal component projection gives a relatively interpretable representation. To construct such a configuration portrait for a protein we treat each point on the trajectory as an observation in a 3N-dimensional configuration space. If we choose the origin of space as the centroid of the distribution (corresponding to each atom at its average position) then the covariance matrix of the observations is simply the atomic displacement correlation matrix in Eq. 5. Diagonalizing this matrix and projecting the trajectory along the three most dominant eigenvectors gives a dimensionally reduced view of protein dynamics. The configuration portrait for the molecular dynamics of myoglobin is shown in Fig. 4b. Note how little of configuration space the trajectory has sampled, in particular how the trajectory has failed to close back on itself. Compared with the single atom portrait in Fig. 4a, it appears that the protein simulation as a whole has sampled but a section of arc on its underlying attractor.

[5]

This matrix contains the equal-time pair correlations between the coordinate displacements of all atoms in the molecule, averaged over the computed trajectory. Fig. 3 shows the covariance matrix for each of the two halves of the trajectory. Note that the two matrices differ. This difference quantifies the sampling problem for higher-order properties; the pair correlations have not converged. The simulation has yet to attain enough moments of the Boltzmann distribution to determine moments beyond the mean. Since the effective initial conditions for the two halves of the trajectory are different, this result may be stated equivalently as: two short, independent dynamics runs can produce different answers for questions pertaining to correlations in atomic displacements. The Fourier and real space results suggest that only by running much longer molecular dynamics simulations will the calculations predict unique-and therefore meaningful and testable-predictions for the higher moments of the distribution. Configuration Space. As demonstrated, the lowest-order moments of protein fluctuations, (S), converge for modest length molecular dynamics simulations given some care in preparing a thermodynamic state. This is related to the rough correspondence between crystallographic and calculated meansquare displacements (1). For the moment, let us consider the sampling problem in terms of the dynamics of single atoms. Fig. 4a shows the path of one particular Ca atom during half of the simulation. The mean-square displacement (i.e., the variance, of this distribution) appears quite well defined-that is, if one integrated this atom's trajectory further it would most likely not alter the resulting statistics. By sweeping out a representative sample of its own allowable states, the atom's distribution and therefore thermodynamic averages are convergent. In essence, the particle trajectory has located and filled its attractor. Obviously, the completeness of the sampling is a consequence of the low dimensionality (d = 3) of a single particle's configuration space. How far is the entire protein from filling in its attractor? To address this we move to the 3N-dimensional configuration space of the macromolecule, where each state of the N atoms 150

a

150 ..

[00


r

1

50

.< .

X tv

tz 4-

#..

1

~~50

100

150

FIG. 3. Atomic displacement covariance matrix for the first half (a) and the second half (b) of the myoglobin trajectory. The difference between the two matrices shows directly that pair correlations between atomic displacements have not converged for the simulation. (Positive matrix elements are shown as gray shades, increasing correlation values coded from white to black.)

Biophysics: Clarage et aL a

Proc. Natl. Acad. Sci USA 92 (1995)

3291

does in Fig. 4a, which is itself yet another projection of the same trajectory, this time onto the subspace defined by the particle's own orthogonal Cartesian coordinates. How Long Is Long Enough?

b 1O0

0-

-5-1 0-

-15,

We have investigated what effect increasing the simulation length has on the phase portrait, using a 500-ps trajectory of solvated myoglobin running CHARMm version 23f3 with current parameters (24). This trajectory does not begin closing back on itself any more than the previous trajectory computed some time ago with AMBER, though there is now more obvious clustering along the arc, giving the overall appearance of beads on a string. Given more time, the protein simply explores new regions of phase space. This result may have been anticipated since fluctuations upon which the correlations are based are known to converge as the inverse square root of the trajectory length (25). The ergodic theorem tells us that eventually a simulation will achieve a Boltzmann distribution. It does not tell us how long this might take. Evidently, 150 ps and 500 ps are not enough. By examining the literature we have found evidence that even much longer simulations fall far short. For example, in an effort to separate the dynamics of a nanosecond lysozyme simulation into its harmonic and anharmonic components, Amadei et at (23) project atomic coordinates along what are essentially various quasi-harmonic modes for the trajectory. Though the authors were concerned with looking for modes that dominate the molecular dynamics, if we reexamine their results in light of the configuration portraits developed above, we find that the published projection onto the lowest dominant modes (e.g., Amadei's figure 9a) traces out but a short arc, implying nanosecond calculations are still far from converged for pair correlations. Discussion

FIG. 4. (a) Trajectory of an arbitrary Ca atom in the myoglobin simulation, showing that the particle adequately samples its own local three-dimensional configuration space. (b) Trajectory of the entire protein in 3N-dimensional configuration space, viewed using two different configuration portraits as described in the text. The portrait that appears as a single meandering arc is a projection onto the three-dimensional subspace that best stretches out features in the trajectory; the trajectory's sparse filling of configuration space and failure to close back on itself illustrate the sampling problem in present nanosecond time-scale simulations of protein dynamics. The tinier closed path in the center is the trajectory projected onto a less optimal three-dimensional subspace from the standpoint of visualization, one where features in the trajectory are poorly separated and overlapping.

Fig. 4b also shows the trajectory projected along less dominant principal components, those with the 18th, 19th, and 20th highest eigenvalues as opposed to the three highest. This portrait (and in general any random projection along three arbitrary orthogonal directions) appears as a tiny ball of knotted string compared to the previous projection. Geometrically, this corresponds to visualizing the trajectory from a viewpoint where most of the distribution lies in directions outside the embedding space, leading to overlap of the trajectory's most dominant features. Dynamically, since the principal components are the eigenvectors of the displacement covariance matrix, they correspond to the quasi-harmonic modes (20, 21) of the system; thus, the contrast between the two perspectives of the trajectory embodies- the fact that some modes contribute more than others to the atomic displacement amplitudes (18, 22, 23). These two portraits also make clear that the higher-frequency modes are not the cause of the sampling problem; they appear to fill fairly densely their contribution to the attractor, just as the single particle portrait

We have used three different arguments to elucidate a sampling problem in present molecular dynamics simulations on single macromolecules evolving for nanosecond time scales. Although statistics on single particle fluctuations are reasonably predicted (e.g., mean-square displacements of individual atoms), higher-order properties arising from collective phenomena are not well determined. At present, different short simulations give different answers. Given this sampling dilemma, it may be that normal-mode approaches to macromolecular dynamics would provide better answers to questions concerning long-term fluctuations, if only for the reason that such methods are analytic and thus suffer no sampling problem. That is, even though the harmonic potential function may be a poor approximation to reality, perhaps this error would be made up for by the complete sampling possible. There is some evidence for this conjecture. Recent crystallographic results (11) on lysozyme indicate that normal modes give a better fit to diffuse scattering than do a 600-ps molecular dynamics run. Perhaps this superior fit to the experimental data is a consequence of the perfect sampling, or precision, of normal modes and not so much their underlying accuracy. The better fit by normal modes to diffuse scattering data may not imply that atomic fluctuations are correlated across the entire protein molecule. Karplus and co-workers (26) convincingly showed that normal modes can give rise to liquid-like correlations in atomic displacements. They calculated the radially averaged displacement correlations from both molecular dynamics and normal modes-i.e., (5ea5) as a function only of the radial separation between atoms i and j. There is significant scatter in these plots (possibly a result of sampling, for the normal mode plot has much less scatter), but it appears that fitting a curve to the data would give correla-

3292

Biophysics: Clarage et al

tions that fall off fairly rapidly (,y = 4-8 A) with pair separation between atoms in the structure. That is, though individual normal modes are perfectly self-correlated, the random-phase superposition of a spectrum of modes can give rise to liquidlike disorder in the atomic movements. This result implies that at least in an "averaged" sense, molecular dynamics is capable of reproducing the correct qualitative character of displacement correlations-namely that they tend to decay with a correlation length of 4-8 A. It is the details of correlation between a specific pair of atoms where the simulation methods appear to fall short, owing to inadequate sampling. Although this study points out a sampling problem in molecular dynamics calculations on macromolecules confined to the sub-nanosecond regime, it does not by itself tell us how much longer to simulate in order to achieve convergent answers for collective behaviors of the molecule (arising from the higher moments such as pair correlations). In light of other experimental probes of displacement correlations, we might speculate 100 ns as an upper bound for the convergence time. Time-resolved Mossbauer diffraction studies from protein crystals (27) indicate that the movements giving rise to the liquid-like component of the diffuse scattering have relaxation times under 100 ns. In terms of configuration portraits, then, this time might be the period needed for a protein trajectory to orbit back to a region of phase space near the initial condition and thus begin defining a Boltzmann distribution of conformations. This discussion of course assumes that the trajectory remains constrained to some bounded region of phase space. Though this is true for the experimental systems studied through x-ray diffraction experiments, and implicitly the case for all normal mode calculations, in general macromolecules populate"an equilibrium of folded and unfolded states. Even near physiological conditions, where the equilibrium exists far to the folded state, there is still some probability that a protein or nucleic acid is unfolded. Without access to these long time-scale folding processes, so critical for the definition of the system's structural moments, molecular dynamics is not viewing a converged ensemble. Though in some respects trivial, this argument is often unappreciated by both theorists and experimenters trying to make connections between predictions of molecular dynamics and laboratory results. We thank Mike Clarage for some insightful discussions. This work was supported by National Institutes of Health Grant AR40252, the Robert A. Welch Foundation, and the W. M. Keck Center for Com-

putational Biology. J.B.C. is the recipient of National Research

Proc. Natl. Acad. Sci. USA 92 (1995) Service Award Postdoctoral Fellowship GM-13945 from the National Institutes of Health. Further media associated with this document and others by the authors are available in hypertext via the World Wide Web at http://www-bioc.rice.edu/Bioch/Phillips/Papers. 1. Phillips, G. N. (1990) Biophys. J. 57, 381-383. 2. Doster, W., Cusack, S. & Petry, W. (1989) Nature (London) 337, 754-756. 3. Smith, J. C. & Kneller, G. R. (1993) Mol. Simul. 10, 363-375. 4. Eriksson, M. A. L., Berglund, H., Hard, T. & Nilsson, L. (1993) Proteins 17, 375-390. 5. James, R. W. (1948) The Optical Principles of the Diffraction of X-Rays (Bell, London). 6. Boylan, D. & Phillips, G. N. (1986) Biophys. J. 49, 76-78. 7. Caspar, D. L. D., Clarage, J., Salunke, D. M. & Clarage, M. (1988) Nature (London) 332, 659-662. 8. Clarage, J., Clarage, M., Sweet, R., Phillips, W. & Caspar, D. L. D. (1992) Proteins 12, 145-157. 9. Chacko, S. & Phillips, G. N., Jr. (1992) Biophys. J. 61, 1256-1266. 10. Kolatkar, A., Clarage, J. B. & Phillips, G. N., Jr. (1994) Acta Crystallogr. Sect. D 50, 210-218. 11. Faure, P., Micu, A., Perahia, D., Doucet, J., Smith, J. C. & Benoit, J. P. (1994) Nat. Struct. Biol. 1, 124-128. 12. Clarage, J. B. & Phillips, G. N., Jr. (1995) Methods Enzymol., in press. 13. Findsen, L. A., Subramaniam, S., Lounnas, V. & Pettitt, B. M. (1991) Principles of Molecular Recognition, ed. Buckingham, A. D. (Chapman & Hall, London). 14. McCammon, A. J. & Harvey, S. C. (1986) Dynamics of Proteins and Nucleic Acids (Cambridge Univ. Press, Cambridge, U.K.). 15. Eckmann, J. P. (1981) Rev. Mod. Phys. 53, 643-654. 16. Kendal, M. & Stuart, A. (1966) The Advanced Theory of Statistics (Hafner, New York), Vol. 3, Chap. 43. 17. Clarage, J. B. & Phillips, G. N., Jr. (1994) Acta Crystallogr. Sect. D 50, 24-36. 18. Romo, T., Clarage, J. B., Sorensen, D. C. & Phillips, G. N., Jr. (1995) Proteins, in press. 19. Born, M. & Huang, K. (1954) Dynamical Theory of Crystal Lattices (Clarendon, Oxford). 20. Levitt, M., Sander, C. & Stern, P. (1985) J. Mol. Bio. 181, 423-447. 21. Garcia, A. E. (1992) Phys. Rev. Lett. 68, 2696-2699. 22. Diamond, R. (1990) Acta Crystallogr. Sect. A 46, 425-435. 23. Amadei, A., Linssen, A. B. M. & Berendsen, H. J. C. (1993) Proteins 17, 412-425. 24. McKerrel, A. & Karplus, M. (1994) J. Am. Chem. Soc., in press. 25. Allen, M. P. & Tildesley, D. J. (1987) Computer Simulations of

Simple Liquids (Clarendon, Oxford). 26. Ichiye, T. & Karplus, M. (1991) Proteins 11, 205-217. 27. Nienhaus, G. U., Heinzl, J., Huenges, E. & Parak, F. (1989) Nature (London) 338, 665-666.