Statistical Inference for Big Data Problems in Molecular ... - CiteSeerX

Statistical Inference for Big Data Problems in Molecular Biophysics

Arvind Ramanathan1 , Andrej Savol2,4 , Virginia Burger2,4 , Shannon Quinn2,4 , Pratul K. Agarwal3 , Chakra Chennubhotla4 1 Computational Data Analytics Group, Computer Science and Engineering Division Oak Ridge National Laboratory, Oak Ridge, TN 37830 2 Joint Carnegie Mellon University-University of Pittsbugh Ph.D. Program in Computational Biology 3 Computational Biology Institute, Computer Science and Mathematics Division Oak Ridge National Laboratory, Oak Ridge, TN 37830 4 Department of Computational and Systems Biology University of Pittsburgh, Pittsburgh, PA 15260

Abstract We highlight the role of statistical inference techniques in providing biological insights from analyzing long time-scale molecular simulation data. Technological and algorithmic improvements in computation have brought molecular simulations to the forefront of techniques applied to investigating the basis of living systems. While these longer simulations, increasingly complex reaching petabyte scales presently, promise a detailed view into microscopic behavior, teasing out the important information has now become a true challenge on its own. Mining this data for important patterns is critical to automating therapeutic intervention discovery, improving protein design, and fundamentally understanding the mechanistic basis of cellular homeostasis.

1

Molecular Biophysics

Over last 30 years biophysicists have taken advantage of the advances in computing power to run increasingly detailed simulations of biomolecules in order to investigate the mechanistic basis of their function. The structure, dynamics and function of biological macro-molecules such as proteins, de-oxy/ribose nucleic acid (DNA/RNA), carbohydrates and lipids control cellular function, and thus life. Proteins, the workhorses of the cell, are long polymers of amino-acid residues which fold into three-dimensional structures to perform their function. The biological function controlled by the dynamical interactions between various bio-molecules can occur at multiple time-scales from femto-seconds up to micro-, milli-, seconds and beyond, spanning more than 15 orders of magnitude between them. Molecular dynamics (MD) simulations provide insights into the dependence of biological function on interactions at multiple length and time scales. In this paper, we focus on using fully-atomistic simulations of proteins/biomolecules in solution as they best represent the cellular environment. MD simulations are governed by a potential energy function that includes both bonded and non-bonded interaction terms. The gradient of the energy function defines a force-field which is then applied to every atom in the molecule. At each time step, Newton’s laws of motion are integrated to generate a trajectory. A time-step on the order of a femtosecond (10−15 s) is necessary for capturing the smallest vibrations of interest, whereas biological interesting events typically occur at microsecond (10−6 s) and higher time scales. With improvements in sampling techniques and available hardware resources, MD simulations have successfully crossed the millisecond (10−3 s) time-scale barrier [1] and have provided novel insights into the functioning of bimolecular systems. 1

Assuming that a typical protein has O(1000) atoms, representing a protein in full-atomistic detail in Cartesian space (x, y, z) requires at least 3 × O(1000) single-precision numbers. Even if one stores the output from molecular dynamics (MD) simulation at 100 ps (10−10 s) intervals, assuming regular access to millisecond long simulations in the near future, typical datasets would have somewhere between 1012 − 1015 conformations, which could easily occupy several terabytes of storage. Indeed, datasets of this order are being made available online by several research groups, and one can certainly expect more of these datasets in the near future. Such large-scale (and potentially large-volume) datasets from molecular simulations pose a significant big-data challenge. The purpose of MD simulations is to reveal mechanistic basis of protein function. Traditionally, biophysicists have relied on the availability of order-parameters (e.g., experimental observables such as dihedral angles, distance constraints between different parts of the protein/bio-molecule, or other thermodynamical measurements) as a means of analyzing these simulations. These ‘knowledge based parameters’ are often difficult to obtain a priori and can be considered as a biological filter applied to the large data-set from which only a few small number of functionally relevant dimensions are drawn. Given the experts’ knowledge, these parameters were sufficient to analyze smaller timescale simulations. However, with the growth in MD data sets (reaching petabyte scale), there is a need to develop automated tools that can discover potential order parameters as well as reveal novel (hitherto unknown) features of the complex energy landscape. Machine learning and statistical inference techniques offer new avenues for elucidating novel relationships in the conformational landscape. Hence, our goal here is to bridge machine learning with molecular biophysics in the hope of discovering new biology.

2

Computational Challenges

We ask how statistical inference tools can help sift through petascale simulation data to reveal the organizing principles of conformational landscapes. The challenges include: (1) building statistical insights into the time-dependent structural changes that the protein undergoes in the course of a simulation; (2) exploiting these structural dependencies to build a biophysically/biochemically relevant low-dimensional representation of the simulation data; (3) using the low-dimensional representations to generate kinetically and energetically coherent conformational sub-states and finally (4) drawing causality relations from time-dependent structural changes within each conformational sub-state to implicate functionally relevant residues. We discuss the first three issues in more detail below. Building statistical insights into time-dependent structural changes that a protein undergoes in the course of a simulation Naturally occurring ensembles, such as images, sounds and videos, have been shown to posses scale-invariant statistics. However, such invariances may be hard to find in the molecular simulation data, because the protein data is more likely to resemble an object-specific ensemble, such as a dataset of face images which are known to exhibit multi-scale behavior but not in a convenient scale invariance form. Thermal fluctuations allow the molecule to cross over energy barriers and tumble to places far removed from the starting configuration (Fig. 1). Moreover, recent evidence from experiments suggest that these rare, low-probability excursions from the mean conformation may have a significant bearing on biological function, including protein folding, enzyme catalysis and molecular recognition. The algorithmic aspects of sufficiently describing these rare-excursions and their properties and building rich representations from the simulation data remains a hard problem. McCammon’s group published an early result pointing out the long-tailed behavior of the positional fluctuation data from picoseconds long simulation trajectories [4]. More recently, we documented this long-tail property in the positional data of ubiquitin and adenylate kinase simulations using micro-second long simulation trajectories [3, 5]. We observed the long-tails give rise to nonorthogonal couplings between the various portions of the biomolecule. While the long tails hint at the use of independent component analysis algorithms, just like the natural datasets of images and sounds, we also observed that respecting the non-orthogonal correlations intrinsic to the data is the key to discovering energetically coherent clusters and building low-dimensional biophysically relevant projections.

2

Figure 1: Biophysical insights gained from a statistical characterization of the simulation data. (A) Time-dependent structural changes are anharmonic as shown by the long-tails in the positional fluctuations from 0.5µs long simulation data of a protein: ubiquitin. Log-histograms of the positional deviations from the mean conformation for (i) backbone carbon alpha atoms (red, kurtosis κ = 6.3), (ii) all-atoms of the protein (blue, kurtosis κ = 8.2) and (iii) best fitting Gaussian to the carbon alpha data (dotted, kurtosis κ = 3). (B) Non-orthgonal or anharmonic coupling in the positional deviations of carbon alpha atoms of residues 31 and 45. PCA (black arrows) imposes orthogonality and misses the intrinsic orientation of this data, but a variant of JADE [2] (red arrows) that deconvolves fourth-order dependencies successfully captures the intrinsic anharmonic directions. (C) An emergent behavior from using higher-order statistics is the ability to discover energetically coherent clusters of conformations. (D) Biological insights derived from (C) implicating motions of different regions in the protein ubiquitin as being important for recognizing diverse substrates (2D3G and 2G45). Figure modified from [3].

Learning a biophysically/biochemically relevant low-dimensional representations For a molecular system with N atoms and t simulation frames, a full conformational description requires 3N t variables, that is, the x, y, z cartesian coordinates for every frame. Dimensionality reduction methods are concerned with describing the complexities of molecular motions with far fewer variables such that important structural shifts can be visualized and interpreted. Conceptually, such approaches map each 3N -dimensional snapshot to a datapoint on a lower-dimensional manifold, but the nature and complexity of such a mapping are nontrivial, and no current method can claim optimality. Designing this embedding, or reaction coordinates, such that biophysicallyrelevant transitions are observed has actually become more challenging because longer simulations necessarily access more structural diversity and rare transitions. A truly useful mapping must thus 3

(1) filter out thermal fluctuations, (2) be sensitive to the non-Gaussian/anharmonic character of conformational shifts, and capture rare (i.e. outlier) transitions. Existing methods primarily identify basis vectors (in 3N conformational space) that align with highvariance direction. So called principal component analysis (PCA) methods, adapted from the factor analysis statistics literature, require strong assumptions about the nature of correlated motions within the studied biomolecules. We have shown that two such assumptions, linearity and orthogonality, are not valid for MD simulations. These findings are consistent with the mechanistic interpretation of intramolecular forces where bonded and non-bonded interactions promote inherently anharmonic coupling between different parts of the bio-molecule. We further emphasize two important drawbacks of existing dimensionality reduction methods. Firstly, current reaction coordinate formulations are not protein specific, meaning neither the inherent properties of molecular interactions nor the restrictions on protein structure are considered. Secondly, existing methods scale poorly, requiring matrix operations that become intractable for those systems that are of biological interest. The appearance of substantial atomistic simulations arrived primarily via hardware speedups. Useful interpretation of molecular simulations, on the other hand, requires a physics-centric approach to analyzing collective atomic motions. Our experience in highlighting potential weaknesses of existing methods and suggesting alternatives will be critical in expanding the possible tools researches can apply to molecular systems. Clustering simulation data with biological and thermodynamical relevance. A second class of analysis techniques concerns grouping simulation frames that share important structural or kinetic features. Clustering conformer snapshots presents a considerable challenge in that the data has a natural ordering, namely the temporal relationships between successive frames that many datasets do not. Clustering approaches that appreciate the temporal associations among snapshots and also their structural relationships can provide insight into how protein motions facilitate function, or dysfunction. Specifically, clustering enables the determination of a small number of parameters (referred to as order parameters) which are highly relevant for estimating rate kinetics and thermodynamics, which are directly measurable via experiments. Although many clustering techniques are regularly applied to MD simulations, the challenge is to determine which techniques can provide adequate insights into the biological underpinnings of function and relate them back in a seamless way to the experimental observables such as rate kinetics and thermodynamics.

3

Solution Space and Outlook

We have outlined the current state-of-the-art in terms of developing statistical inference approaches for big-data problems in molecular biophysics. However, with the astronomical increase in the size of datasets from molecular simulations, it is clear that there is a need to develop integrated machine learning toolkits that provide for: • handling access to and analysis of big simulation data in a distributed fashion such as using Hadoop • online or near real-time analytics of simulations as they are progressing, to facilitate anomaly detection and large-scale biophysically relevant motion signatures [6, 7] • developing toolkits that are particularly suited to exploit heterogeneous computing resources such as graphics processing units (GPUs) for molecular biophysics applications [8] The availability of packages such as HiMach [9], mdanalysis [10] and HOST4MD [5] will certainly facilitate the development of a computing infrastructure required to tackle the big data challenges from molecular biophysics.

References [1] David E. Shaw, Ron O. Dror, John K. Salmon, J. P. Grossman, Kenneth M. Mackenzie, Joseph A. Bank, Cliff Young, Martin M. Deneroff, Brannon Batson, Kevin J. Bowers, Edmond Chow, Michael P. Eastwood, Douglas J. Ierardi, John L. Klepeis, Jeffrey S. Kuskin, Richard H. Larson, Kresten Lindorff-Larsen, Paul Maragakis, Mark A. Moraes, Stefano Piana, 4

Yibing Shan, and Brian Towles. Millisecond-scale molecular dynamics simulations on anton. In Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, SC ’09, pages 39:1–39:11, New York, NY, USA, 2009. ACM. [2] Jean-Francois Cardoso. High-order contrasts for independent component analysis. Neural Comput., 11(1):157–192, 1999. [3] A. Ramanathan, A. J. Savol, C. J. Langmead, P. K. Agarwal, and C. S. Chennubhotla. Discovering conformational sub-states relevant to protein function. PLoS ONE, 6(1):e15827, 01 2011. [4] S. H. Northrup, M. R. Pearl, J. D. Morgan, J. A. McCammon, and M. Karplus. Molecular dynamics of ferrocytochrome c: magnitude and anisotropy of atomic displacements. J. Mol. Biol., 153:1087–1111, 1981. [5] A. Ramanathan, A. J. Savol, P. K. Agarwal, and C. S. Chennubhotla. Event detection and sub-state discovery from biomolecular simulations using higher-order statistics: Application to enzyme adenylate kinase. Proteins: Struct., Funct., and Bioinform., 80(11):2536–2551, 2012. [6] Willy Wriggers, Kate A. Stafford, Yibing Shan, Stefano Piana, Paul Maragakis, Kresten Lindorff-Larsen, Patrick J. Miller, Justin Gullingsrud, Charles A. Rendleman, Michael P. Eastwood, Ron O. Dror, and David E. Shaw. Automated event detection and activity monitoring in long molecular dynamics simulations. Journal of Chemical Theory and Computation, 5(10):2595–2605, 2009. [7] A. Ramanathan, J.-O. Yoo, and C. J. Langmead. On-the-fly identification of conformational substates from molecular dynamics simulations. J. Chem. Theory Comput., 7(3):778–789, 2011. [8] D. Brandt. Investigation of GPGPU for use in Processing of EEG in Real-time. PhD thesis, Kate Gleason College of Engineering, 2010. [9] Tiankai Tu, Charles A. Rendleman, David W. Borhani, Ron O. Dror, Justin Gullingsrud, Morten Ø. Jensen, John L. Klepeis, Paul Maragakis, Patrick Miller, Kate A. Stafford, and David E. Shaw. A scalable parallel framework for analyzing terascale molecular dynamics simulation trajectories. In Proceedings of the 2008 ACM/IEEE conference on Supercomputing, SC ’08, pages 56:1–56:12, Piscataway, NJ, USA, 2008. IEEE Press. [10] Naveen Michaud-Agrawal, Elizabeth J. Denning, Thomas B. Woolf, and Oliver Beckstein. Mdanalysis: A toolkit for the analysis of molecular dynamics simulations. Journal of Computational Chemistry, 32(10):2319–2327, 2011.

5

Statistical Inference for Big Data Problems in Molecular ... - CiteSeerX

Statistical Inference for Big Data Problems in Molecular ... - CiteSeerX

Suggest Documents

Statistical Inference, Learning and Models in Big Data - arXiv

Statistical Inference, Learning and Models in Big Data (PDF Download ...

Problems of Statistical Estimation and Causal Inference in ... - CiteSeerX

STATISTICAL INFERENCE OF MISSING SPEECH DATA ... - CiteSeerX

Statistical Inference and Data Mining - CiteSeerX

Solved Exercises and Problems of Statistical Inference

Statistical Inference for Stochastic Epidemic Models - CiteSeerX

Statistical Inference for Interval Identified Parameters - CiteSeerX

Finite Dimensional Statistical Inference - CiteSeerX

Statistical inference for inverse problems - FakultÃ¤t Statistik (TU ...

Statistical Inference Problems in High Energy Physics and Astronomy

Statistical Inference Under Multiterminal Data Compression ...

Causal Inference with Panel Data - Statistical Horizons

Data Analysis and Statistical Inference

Robust Message-Passing for Statistical Inference in ... - CiteSeerX

Crime Rate Inference with Big Data - sigkdd

Big Problems for Arti cial Intelligence - CiteSeerX

Bayesian inference for inverse problems

Statistical Inference for Managers - StatLit.Org

Optimization Problems in Statistical Object Recognition ... - CiteSeerX

Optimization Problems in Statistical Object Recognition ... - CiteSeerX

Exact conditional inference on h-sample problems for categorical data

Asymptotic inference for high-dimensional data - CiteSeerX

Statistical Inference in Integrated Geodesy