462
Int. J. Bioinformatics Research and Applications, Vol. 9, No. 5, 2013
An interval-based algorithm to represent conformational states of experimentally determined polypeptide templates and fast prediction of approximated 3D protein structures Márcio Dorn LABIO, PPGCC, Faculdade de Informática, Pontifícia Universidade Católica do Rio Grande do Sul, Av. Ipiranga, 6681, Prédio 32, Sala 602 90619-900 – Porto Alegre – RS – Brasil and Institute of Informatics, Federal University of Rio Grande do Sul, Av. Bento Gonçalves 9500, Prédio 67, Sala 202 91501-970 – Porto Alegre – RS – Brasil E-mail:
[email protected]
Osmar Norberto de Souza∗ LABIO, PPGCC, Faculdade de Informática, Pontifícia UniversidadeCatólica do Rio Grande do Sul, Av. Ipiranga, 6681, Prédio 32, Sala 602 90619-900 – Porto Alegre – RS – Brasil E-mail:
[email protected] *Corresponding author Abstract: Predicting the three-dimensional (3-D) structure of a protein that has no templates in the Protein Data Bank (PDB) is a very hard, still an impossible task. Computational prediction methods have been developed during the last years, but the problem still remains challenging. In this paper we present a new strategy based on Interval Arithmetic to store structural information obtained from experimental protein templates and predict native-like approximate three-dimensional structures of proteins. Our objective is to perform the prediction in a very fast manner and predict native-like structures that can be used as starting point structures to ab initio methods. We illustrate the efficacy of our method in five case studies of polypeptides. Keywords: structural bioinformatics; 3-D protein structure prediction; interval arithmetic; pattern recognition; data mining.
Copyright © 2013 Inderscience Enterprises Ltd.
An interval-based algorithm to represent conformational states
463
Reference to this paper should be made as follows: Dorn, M. and Norberto de Souza, O. (2013) ‘An interval-based algorithm to represent conformational states of experimentally determined polypeptide templates and fast prediction of approximated 3-D protein structures’, Int. J. Bioinformatics Research and Applications, Vol. 9, No. 5, pp.462–486. Biographical notes: Márcio Dorn received an MSc (2008) in Computer Science from the Pontifícia Universidade Católica do Rio Grande do Sul (PUCRS), Porto Alegre, Brasil. He is currently a PhD student in Computer Science at the Federal University of Rio Grande do Sul, Porto Alegre, Brazil. In 2009 he spent 12 months as research associate at the University of Karlsruhe, Germany. His main research interests include bioinformatics, structural bioinformatics, machine learning, optimisation and high performance computing. Osmar Norberto de Souza received his PhD in Science, with emphasis in computational molecular biophysics, from the University of London, London, England. He is a Senior Lecturer at the Faculties of Computer Science and Biosciences in the Pontifícia Universidade Católica do Rio Grande do Sul (PUCRS), Porto Alegre, Brazil. His primary research interest is the development and application of computational methods to the sequence-structure- dynamics-function relationship in biological molecules. He is mostly interested in protein structure prediction, protein dynamics, and bioinformatics applied to neglected diseases.
1 Introduction A polypeptide or protein (herein these terms are synonyms) molecule is a covalent chain of amino acids residues that, in physiological conditions, or native environment, adopt a unique 3-D structure. This native structure dictates the biochemical function of the protein in vivo (Baxevanis and Quellette, 2001; Branden and Tooze, 1998). During the last 40 years a number of computational methods and algorithms have been proposed as a solution to the Protein Structure Prediction (PSP) problem-reviewed in Osguthorpe (2000), Moult (2005), Tramontano (2006) and Bujnicki (2006). These methods and algorithms can be classified in four groups: •
Ab initio methods without database information (Bujnicki, 2006; Floudas et al., 2006)
•
First principle methods with database information (Srinivasan and Rose, 1995; Rohl et al., 2004)
•
Comparative modelling methods (Stuart et al., 2000)
•
Fold recognition methods (Jones et al., 1992).
Ab initio methods without database information are founded on thermodynamics and are based on the fact that the native structure of a protein corresponds
464
M. Dorn and O.N. de Souza
to the global minimum of its free energy (Bujnicki, 2006; Floudas et al., 2006). Ab initio structure prediction methods aim at predicting the native conformation of a protein considering only the amino acid sequence. Structural templates from a database such as PDB are not allowed. This class of methods simulate the protein conformational space using an energy function, which describes the internal energy of the protein and its interactions with the environment in which it is inserted. The goal is to find a global minimum of free energy that corresponds to the native or functional state of the protein. The main advantage of this class of methods is that it can predict new folds because they are not limited to templates from the PDB. However, these methods have some limitations with respect to the size of the conformational search space (Ngo et al., 1997). AMBER (Case et al., 2005) and GROMACS (van der Spoel et al., 2005) are examples of packages that have been used in ab initio methods without database information. In first principle methods with database information, general rules of protein structure are extracted from protein databases and used to build starting point 3-D protein structures (Srinivasan and Rose, 1995; Rohl et al., 2004). These methods do not compare a target sequence to a known structure, but they compare fragments, i.e., short amino acid sub-sequences of a target fragment against fragments of known protein structures. The conformation of a protein is seen as a set of various fragments of amino acid sequences representing various structural motifs that are combined to form the 3-D protein structure (Tramontano, 2006). When homologue fragments are identified they are assembled into a structure through a fragment assembly procedure with the purpose of finding the structure with the lowest potential energy. ROSETTA (Rohl et al., 2004; Simons et al., 1999), FRAGFOLD (Jones et al., 1992; Jones, 2001), I-TASSER (Zhang, 2007, 2008, 2009) and LINUS (Srinivasan and Rose, 1995, 2002) are four examples of first principle methods that use database information. They has been the most successful predictors along the last years as revealed by the biannually Critical Assessment of protein Structure Prediction (CASP – http://predictioncenter.org/) experiments (Moult, 2005; Raman et al., 2009; Zhang, 2009). In comparative modelling methods, a target sequence of amino acid residues is aligned against the amino acid sequence of another protein with known structure and stored in a structural database (Stuart et al., 2000). If the target sequence is similar to the sequence of the template protein, the structural information obtained from the known structure is used for modelling the target protein. Comparative modelling by homology can be applied whenever it is possible to detect an evolutionary relationship between the target protein and the template protein of which the 3D structure is known (Sánchez and Sali, 1997). The structure of these proteins are similar in the sense that amino acid residues with identical physicochemical properties occupy the same position in homologous proteins. Comparative homology modelling can only predict structures of protein sequences that are similar or nearly identical to other protein sequences of known structure. It does not predict new protein folds. SWISS-MODEL (Arnold et al., 2006) and MODELLER (Eswar et al., 2006) are examples of comparative modelling methods. Fold Recognition methods are motivated by the notion that structure is more stable than sequence, i.e., proteins with no similar sequences could have similar folds. Threading methods use structural information such as residue-residue contact patterns and solvent accessibility in a structure to predicted 3-D structures
An interval-based algorithm to represent conformational states
465
of proteins (Jones et al., 1992). After identifying the structural similarities, which cannot be detected solely by the similarities between the amino acid sequences, the predicted structural models are constructed (Bryant and Altschul, 1995). Fold Recognition methods are focused on predicting the 3-D folded structure of protein amino acid sequences for which comparative methods provide no reliable predictions. Fold-recognition via threading is limited to the fold library derived from the PDB (Kolinski, 2004). GENTHREADER (Jones, 1999) is an example of threading method. Both methodologies must deal with some limitations. Comparative homology modelling can only predict structures of protein sequences that are similar or nearly identical to other protein sequences of known structure. It does not predict new protein folds. Fold-recognition via threading is limited to the fold library derived from the PDB. Only ab initio or de novo predictions can obtain novel structures with new folds. However, the complexity and high dimensionality (Ngo et al., 1997) of the search space, even for a small protein molecule, still makes the problem intractable (Levinthal’s Paradox) (Levinthal, 1968), despite the current availability of high performance computing platforms. In this paper we present a new strategy to store conformational information obtained from conserved residues from PDB templates. This information is used in order to predict an initial, approximate, native-like protein 3-D structure for a target amino acid sequence. Initially we extract protein templates information from PDB using the CReF method (Dorn and Norberto de Souza, 2008, 2010a). The conformational information is than represented as intervals of conformation states based on an Interval Arithmetic strategy (Moore, 1959, 1966, 1999; Moore and Yang, 1959, Alefeld and Herzberger, 1983). The structural information is used to predict 3-D approximated structures of polypeptides. This predicted approximate 3-D structure is expected to be good enough to be further refined by means of molecular mechanics methods such as Molecular Dynamics (MD) simulations (van Gunsteren and Berendsen, 1990). In such a refinement step global interactions between all atoms in the molecule are evaluated and deviations in the torsion angles can be corrected (Dorn et al., 2008). Furthermore the protein approximate 3-D structure renders the conformational search space of its native structure greatly reduced, and, consequently, will demand a much reduced computational effort in the refinement steps. Section 2 contextualise the 3-D Protein Structure Prediction problem and basic concepts used in this paper. Section 3 introduces the new strategy to acquire and store structural information from PDB and a new algorithm that uses this information to predict approximated structures of polypeptides. Section 4 reports several results illustrating the effectiveness of our method. Section 5 concludes and points out directions for further research.
2 Preliminaries 2.1 Protein structure and representation A peptide is a molecule composed of two or more amino acid residues chained by a peptide bond (Figure 1). An amino acid is a molecule containing an amine and a carboxyl functional group and a organic substituent group R (side chain).
466
M. Dorn and O.N. de Souza
The amino acids differ in which side chain (R group) is attached to their alpha carbon (Hovmˆ oller et al., 2002). Larger peptides are generally referred to as polypeptides or proteins (Creighton, 1990). Figure 1 Schematic representation of a model peptide illustrating a duplet (φ, ψ) of main-chain torsion angles. N is nitrogen, C and Cα are carbons, O is oxygen, H is hydrogen and Ri is an arbitrary side-chain
A peptide has three main chain torsion angles, namely phi (φ), psi (ψ) and omega (ω). In the model peptide (Figure 1) the bonds between N and Cα , and between Cα and C are free to rotate. These rotations are described by the φ and ψ torsion angles, respectively. The angle ω is either close to 0◦ (cis) or 180◦ (trans), with the latter value being the preferred one (Branden and Tooze, 1998). The main chain φ and ψ torsion angles are the most responsible for determining the conformation or fold of a polypeptide (Lesk, 2000; Branden and Tooze, 1998). A polypeptide chain can be represented in many ways, however, the most commonly representations are: •
Cartesian coordinates of all atoms
•
Cartesian coordinates of the heavy atoms only
•
Cartesian coordinates of main-chain atoms and side-chain centroids
•
Cartesian coordinates of the Cα
•
Main-chain and side-chain torsion angles.
In this work we represent a protein structure C in the form of a vector C = [xi , xi+1 , . . . , xn ], where xi is a duplet of torsion angles, φ (phi), ψ (psi) for each amino acid residue in the primary structure of a protein (Figure 1). This representation is supported by the fact that the set of consecutive torsion angles represents the internal rotations of a protein main-chain (Branden and Tooze, 1998).
2.2 Energy function An energy function describes the internal energy of the protein and its interactions with the environment in which it is inserted. In ab initio methods energy function are used with the goal of find a global minimum of free energy that corresponds to the native or functional state of the protein (Osguthorpe, 2000; Tramontano, 2006). A potential energy function incorporates two types of terms: bonded and non-bonded. The bonded terms (bonds, angles and torsion’s) are covalently linked.
An interval-based algorithm to represent conformational states
467
The bonded terms constrain bond lengths and angles near their equilibrium values. The bonded terms also include a torsional potential (torsion) that models the periodic energy barriers encountered during bond rotation. The non-bonded potential includes: ionic bonds, hydrophobic interactions, hydrogen bonds, van der Waals forces, and Dipole-dipole bonds. The energy function can be used to calculate a value for the potential energy for any conformation of a given protein, defined by the Cartesian coordinate vector. The lower the energy value, the better should be the conformation. In this work, in order to evaluate the conformation of a protein, we use the AMBER94 energy function (Cornell et al., 1995). It is a composite sum of several molecular mechanics equations grouped into two major types: bonded (bending, torsion, angles) and non-bonded (van-derWalls, electrostatics). The AMBER94 energy function is described by equation (1). Energy =
1 1 Kb (b − b0 )2 + Kθ (θ − θ0 )2 2 2
bonds
+
angles
1 Kη (1 + cos(ηω − γ)) 2
torsions
+
−1 N −1 N
j=1 i=j+1
i,j
R0ij rij
12
−2
R0ij rij
6
qi qj + 4π0 rij
(1)
where Bonds represent the energy between covalently bonded atoms. Angles represent the energy due to the geometry of electron orbitals involved in covalent bonding. Torsions represent the energy for twisting a bond due to bond order (e.g., double bonds) and neighbouring bonds or lone pairs of electrons. The last term represents the non-bonded energy between all atom pairs, which can be decomposed into van der Waals and electrostatic energies.
2.3 Metrics: root mean square deviation Metrics are a set of tools to evaluate the similarity of an experimental and a predicted protein structure. In this work we employ the Root Mean Square Deviation (RMSD) to calculate the similarity between predicted and experimental structures. Equation (2) calculates the RMSD between two structures, where rai and rbi are, respectively, the position of an atom i of structure a and of an atom j of structure b, where the superposition of structures and are optimised. The RMSD value is calculated with the McLachan algorithm (McLachlan, 1982) using the program ProFit (McLachlan, 1982). n RM SD(a, b) =
i=1
|rai − rbi |2 . n
(2)
2.4 Extracting structural information from the PDB using CReF In order to reduce the complexity and the high dimensionality of the conformational search space inherent to ab initio methods, information about
468
M. Dorn and O.N. de Souza
structural motifs found in known protein structures can be used to construct approximate conformations. In Dorn and Norberto de Souza (2008, 2010a) a new method (CReF) to acquire structural information from the PDB and predict approximated structure is presented. In CReF the target amino acid sequence is split into contiguous fragments, which are used to search for their homologues in the PDB (Berman et al., 2000). Only the central amino acid residue (φ, ψ pair of torsion angles) of each template fragments is considered. The φ, ψ pairs of all template fragments are clustered and the resulting information is used to build the approximate predicted models for the target sequence. In this work we employ the CReF strategy to acquire the structural information from the PDB that will be used to build intervals of structural information.
3 From intervals to approximate 3-D structures of polypeptides We developed a new computational strategy to extract and store structural information from protein templates obtained from the PDB. Firstly we analyse fragment templates to obtain structural information. In a second moment we utilise this information to construct intervals of torsion angles and represent a protein main-chain conformation as intervals. At the end we reduce these intervals with the objective to find the smallest closed interval that contains the structure with the lowest potential energy. By representing a protein main-chain as intervals we believe that we can deal with the uncertainties that arise from PDB template information.
3.1 Building protein structures represented as φ, ψ intervals Below we describe the developed algorithm that uses interval arithmetic to represent the protein main-chain and to predict its 3-D structure. Steps 1–4 are similar to those of the CReF method (Dorn and Norberto de Souza, 2008, 2010a). Target sequence fragmentation – Step 1: In this step a target sequence K is fragmented into many short si contiguous fragments with l amino acids each. A set S of contiguous fragments, representing all possible fragments of length l is created and represented as S = {si , si+1 , . . . , sn }, where si and sn are the first and the last fragment, respectively. Each si fragment has an odd value for l because we only consider the φ, ψ torsion angles of the fragment’s central amino acid. An odd l value guarantees the central amino acid is flanked by an equal number of neighboring residues. This type of fragmentation has been used before by Dorn and Norberto de Souza (2008, 2010a) and Zhang et al. (1989). Obtaining protein templates from PDB – Step 2: Each of the si fragments of size l obtained above is used to search the PDB for templates, using the web version of BLASTp (http://blast.ncbi.nlm.nih.gov/Blast.cgi) (Altschul et al., 1997). The final result of this step is a list of templates with their PDB accession codes (PDB ID) for each si .
An interval-based algorithm to represent conformational states
469
Calculating pairs of torsion angles – Step 3: Pairs of φ, ψ torsion angles for every central amino acid of all templates (Figure 3 in reference Dorn and Norberto de Souza (2008)) associated with a si fragment are calculated using the program Torsions (by Dr. Andrew C.R. Martin, UCL-London). Each pair of torsion angles is represented as one 2-tuple ti = (φ, ψ). Each target si is represented as a set of 2-tuples si = {ti , ti+1 , . . . , tp }, where ti and tp are the first and the last 2-tuple of the central amino acid of a template, respectively. Hence, we may represent the set S of si fragments and our ti elements as S = {si = {ti , ti+1 , . . . , tp }, si+1 = {ti , ti+1 , . . . , tp }, . . . , sp = {ti , ti+1 , . . . , tp }} (Dorn and Norberto de Souza, 2008). Clustering template torsion angles – Step 4: All ti 2-tuples belonging to one si fragment are clustered using the probabilistic Expectation Maximisation (EM) method (Chapman and Chang, 2000). With this task we will identify the region(s) where the ti 2-tuples are concentrated in the Ramachandran plot (Figure 2). A complete description of the clustering step can be found in our previous work (Dorn and Norberto de Souza, 2008, 2010a). EM considers the different probabilities of distribution for each individual cluster in order to identify which set of clusters are more favourable for a given set of data. It begins by clustering the ti 2-tuples based in the k-means algorithm to obtain an initial solution. The mean value m(ki ) between the 2-tuples tj of a cluster ki , j = 1, 2, . . . , n, is obtained by equation (3). 1 tj . n j=1 n
m(ki ) =
(3)
After the identification of the f ki clusters for ti ∈ si , the average and the estimate standard deviation (ESD) of all tj 2-tuple of ki is calculated. The average m(ki , θ) of all θ angles of ki is calculated (equation 4), in which θ represents either of the two torsion angles (φ or ψ ) and n is the number of 2-tuples associated to a ki cluster. 1 t[θ]j . n j=1 n
m(ki , θ) =
(4)
The ESD of all θ angles of ki is calculated thought the arithmetic average of their tj 2-tuples (equation (5)). An elevated value of ESD indicates that the analysed elements (2-tuple) are far from the average m(ki , θ). A lower value of ESD indicates that the analysed elements are adjacent to the average m(ki , θ).
1 n σ(ki , θ) = (t[θ]j − m(ki , θ))2 . n j=1
(5)
We define f =4 for the number of clusters to be identified. This number was chosen because the φ, ψ pair of torsion angles is restricted to four major regions of conformations in the Ramachandran plot (Ramachandran and Sasisekharan, 1968; Hovmˆ oller et al., 2002). The first describes the α-helix region. The second describes the β-sheet region. The third describe the left-handled α-helix region.
470
M. Dorn and O.N. de Souza
The fourth region represents the amino acids residues occurring in coil regions of the polypeptide (Hovmˆ oller et al., 2002). At the end of the clustering step we end up with 4 ki clusters, for each one of them we have an associated 2-tuple ki = (mφ, mψ) and si = {ki = (mφ, mψ), ki+1 = (mφ, mψ), . . . , kf = (mφ, mψ)}, with i = 1, 2, . . . , f in k, and mφ and mψ represents the average of φ or ψ of a i-th cluster ki . The WEKA (http://www.cs.waikato.ac.nz/ml/weka/) (Witten and Frank, 2006) data-mining package was used for clustering. Figure 2 illustrates the identification of f =4 ki clusters in the set of ti ∈ Si . The average and the ESD values are used in the following steps to build intervals of angular variation that will represent each ki . Figure 2 Schematic representation of the clustering step: (A) The Ramachandran plot and their ti 2-tuples obtained from the central residue of the template fragments obtained from PDB for the targets si = FNMQC and; (B) si = NMQCQ. Ellipses are delimiting the identified cluster and ki is the cluster identification
Representation of a torsion angles interval – Step 5: here starts the main contribution of this paper. In this step torsion angles intervals are built from each ki cluster. A closed interval X ∈ R numbers is represented as [X] = [x, x], where x and x represents, respectively, the lower and the upper bound of an interval [X] (Moore, 1959, 1999; Moore and Yang, 1959; Moore, 1966; Alefeld and Herzberger, 1983). Intervals of torsion angles are built through the average m(ki , θ) and the estimated standard deviation σ(ki , θ) values obtained from the duplet templates belonging to a ki cluster. A closed interval of a dihedral angle is represented as [θ] = [θ, θ], where θ represents a torsion angle (φ or ψ). The lower bound θ of an interval [θ] from a cluster ki is built from the difference between the average m(ki , θ) and the estimated standard deviation σ(ki , θ) (equation (6)). θ = m(ki , θ) − σ(ki , θ).
(6)
The upper bound θ of an interval [θ] is obtained trough the average m(ki , θ) and the estimated standard deviation σ(ki , θ) sum (equation (7)). θ = m(ki , θ) + σ(ki , θ).
(7)
An interval-based algorithm to represent conformational states
471
The width w of a closed interval [θ] is obtained by equation (8). w([θ]) = θ − θ.
(8)
The midpoint c of a closed interval [θ] is obtained by equation (9). c([θ]) = θ +
w([θ]) . 2
(9)
Intervals of φ, ψ torsion angles are built for each ki cluster ∈ to a fragment si . From this point each si is represented as a set of φ, ψ torsion angles intervals: si = {ki = (φ, φ, ψ, ψ), ki+1 = (φ, φ, ψ, ψ), . . . , kf = (φ, φ, ψ, ψ)}, where (φ, φ) and (ψ, ψ) are, respectively, the lower limit and the upper limit for φ and ψ. f represents the number of ki clusters. The lower and upper bounds of each dihedral angle represent a limited area of variation of the template tuples in the Ramachandran plot (Figure 3). Figure 3 Interval representation in the Ramachandran plot. P1 = (φ, ψ) and P2 = (φ, ψ) represent, respectively, the lower and the upper bounds of a torsion angles interval in the Ramachandran plot
Labelling the torsion angles intervals – Step 6: After representing each ki ∈ si as intervals we assign labels to them. Labels have the function of mapping a ki cluster (now represented as intervals) to the region they occupy in the Ramachandran plot. In this step all ki ∈ S are labelled. We have built a library that represents the most favourable regions of the Ramachandran plot. This library is based on the works by Thornton and collaborators (Laskowski et al., 1993; Morris et al., 1992) that divide the Ramachandran plot in 12 preferred regions (Table 1). However, for simplification, we combine these regions in eight states: ‘A’, ‘B’, ‘L’, ‘a’, ‘b’, ‘l’, ‘p’ and ‘c’, where ‘c’ denominates the coil region, and represents the regions ‘a’, ‘b’, ‘l’, ‘p’ and the remaining area of the Ramachandran plot (disallowed regions). The midpoint c = ([θ]) of an interval of ki is calculated for each torsion angle (φ and ψ) and submitted to a mapping function, based in the
472
M. Dorn and O.N. de Souza
developed library. The function assigns a label to each ki cluster. Labelled clusters have the form ki : rot, where rot can be either ‘A’, ‘B’, ‘L’, ‘a’, ‘b’, ‘l’, ‘p’ or ‘c’. Table 1 Conformational regions of the Ramachandran plot obtained from Thornton and co-workers (Lesk, 2000; Morris et al., 1992)
Region
Region preferences
Description
A B L a b l p ∼a ∼b ∼l ∼p remaining
Most favourable Most favourable Most favourable Additional allowed Additional allowed Additional allowed Additional allowed Generously allowed Generously allowed Generously allowed Generously allowed Disallowed
α-helix β-sheet Left-handed α-helix α-helix β-sheet Left-handed α-helix Extension of α-helix α-helix β-sheet Left-handed α-helix Extension of α-helix Except GLY
Each si ∈ S is represented as si = {k1 : rot, k2 : rot, k3 : rot, k4 : rot}. After labelling the clusters ki they are ordered according to the number of tj 2-tuples they contain: si = {ki : rot > ki+1 : rot, . . . , kj : rot}. A large number of ti ∈ Si localised in a ki cluster indicates that the φ, ψ pair of torsion angles of the central amino acid residue in the templates occur more frequently in the interval ki = (φ, φ, ψ, ψ) and that the torsion angles of the central amino acid residue of the target fragment Si are more likely to occupy this interval. Prediction of secondary structure – Step 7: In this step the secondary structure for the target sequence K is predicted. We use the prediction results to help in the assignment of each amino acid residue i ∈ K to the region that it will probably occupy in the Ramachandran plot. Some methods are better at predicting α-helix while others at predicting β-sheets. Therefore, to avoid as much as possible any bias towards these regular secondary structures, we use the consensus of different predictions methods: DSC (King and Sternberg, 1996), PHD (Rost and Sander, 1993), GOR (Garnier et al., 1996, 1978; Gibrat et al., 1987), SIMPA96 (Stuart et al., 2000), DPM (Deleage and Roux, 1987), SOPMA (Geourjon and Deleage, 1995), SOPM (Geourjon and Deleage, 1994), MLRC (Guermeur et al., 1999), and PREDATOR (Guermeur et al., 1996) at the NPS@ (Net-work Protein Sequence Analysis) Consensus Secondary Structure Prediction server (Combet et al., 2000) and SCRATCH (Cheng et al., 2005) at the SCRATCH Secondary Structure Prediction server. The secondary structure predictors use up to eight-states to model a target protein secondary structure. We further simplified this model, reducing it to only three-states: ‘H’ (α-helix region), ‘B’ (β-sheet region) and ‘C’ (coil region). Table 2 presents the eight-states model utilised in the NPS@ and SCRATCH prediction servers and their correspondence in our three-states model.
473
An interval-based algorithm to represent conformational states Table 2 NPS@ and SCRATCH server conformational states and our correspondent three-state model: ‘H’ represent the α-helix region, ‘B’ represents the β-sheet region and ‘C’ represents the coil region NPS@
SCRATCH
Description
3-State model
H, G, I B, E T, S, C
H, G, I B, E T, S, C
α-helix β-sheet coil
H B C
Building the initial, approximate 3-D structures – Step 8: With the identified and labelled ki clusters and the predicted secondary structure of the target sequence K, we build its initial structure. The duplets of torsion angles of each amino acid residue in the target sequence is represented as an interval. Now, the structure C of a protein is represented as a vector C = {[Xi ], [Xi+1 ], . . . , [Xn ]}, where [Xi ] is a pair of torsion angles represented as Xi = {xx}, i. e., Xi = {(φφ), (ψψ)}. All ω torsion angles are set to 180◦ . Figure 4 illustrates a duplet interval of a model peptide. Figure 4 Schematic representation of a model peptide illustrating a duplet of main-chain torsion angles represented as intervals
To build the initial structure represented as intervals we make use of the information about the ki clusters of si to represents the ith amino acid residue of the target sequence K. We chose the interval ki = (φφ, ψψ) of si according to two rules: •
Rule 1: Find in the si set, the cluster(s) whose rot is equal to the label identified for the ith amino acid residue in the consensus predicted secondary structure. From these results the cluster ki with the largest number of 2-tuples ti is selected. In the Step 7 we had discussed that, in secondary structure prediction is utilised a 3-state model (‘H’, ‘B’,‘C’) to represent the secondary structure of a protein, but in step 6 we utilise a 8-state model (‘A’, ‘B’, ‘L’, ‘a’, ‘b’, ‘l’, ‘p’ and ‘c’) for labelling the identified clusters ki . Then, for satisfying the rule 1 we created a function that selects the ki cluster that best describe the state model of the target amino acid residue based on their templates. This function is further based in three conditions: 1
For amino acid residues that in the secondary structure prediction are in the α-helix state (‘H’): firstly are selected the clusters labelled with ‘A’
474
M. Dorn and O.N. de Souza (most favourable region), if it does not exist, then is selected the ki cluster labelled as ‘a’ (additional allowed region).
•
2
For amino acid residues that in the secondary structure prediction are in the β-sheet state (‘B’): firstly are selected the cluster labelled with ‘B’ (most favourable region), if it not exists then is selected the ki cluster labelled as ‘b’ (additional allowed region).
3
For amino acid residues whose secondary structure are in the coil state (‘C’): firstly are selected the cluster labelled as ‘c’ (‘a’, ‘b’, ‘l’, ‘p’, and the remaining area), followed by those labelled as ’L’ (most favourable region), followed by those labelled as ‘l’ (additional allowed region), followed by those labelled as ‘p’ (additional allowed region), followed by those labelled as ‘a’ (additional allowed region), followed by those labelled as ‘b’ (additional allowed region), followed by those labelled as ‘B’ (most favourable region), followed by those labelled as ‘A’ (most favourable region). The clusters labelled as ‘A’, ‘B’, ‘a’ and ‘b’ are also used to represent the coil region because the coil state is an irregular structure which can also be made of amino acid residues alternating between regular secondary structures regions of the Ramachandran plot (Zhirong and Jiang, 1996).
Rule 2: If the a amino acid residue in the secondary structure prediction step is identified as ‘H’ or ‘B’ and does not exist a cluster labelled, respectively, as ‘A’, ‘a’ or ‘B’, ‘b’, then we calculate the mean value of the torsion angles of the amino acid residue i − 1 and i + 1 and of their correspondent cluster ki . The obtained value is assigned to the ith amino acid residue. This decision is based on the fact that when regular structures, such as α-helix and β-sheet, are present the main-chain dihedral angles follow their regular patterns.
The intervals of the cluster ki = (φ, φ, ψ, ψ) of a si represent the duplet of torsion angles of the correspondent amino acid residue in the target sequence K. Figure 5 depicts how we build a protein structure from φ, ψ torsion angles intervals. Figure 5 Building a protein structure from φ, ψ torsion angles intervals. K is the target sequence and SS is the predicted secondary structure
Optimisation of the coil regions – Step 9: In this step the torsion angles intervals of the protein main-chain are reduced. We trim down only the intervals of the
An interval-based algorithm to represent conformational states
475
amino acid residues that are present in coil regions of the protein. This choice is based on two factors: (1) α-helix and β-sheet regions are regular structures and have regularities in their dihedrals; (2) coil regions are key to the fold (arrangement of the secondary structures in the 3-D space) a protein molecule will adopt (Fiser et al., 2000). The intervals of the regions of regular structures are not reduced, and this case we consider for it’s the mid point c (equation (9)). To reduce the interval of the coil regions we created an algorithm composed of six major steps, as follows. 1
Coil regions are identified from the consensus predicted secondary structure of the target sequence K. A coil region is formed when we have at least two consecutive amino acids residues in that conformational state (‘C’ state). Figure 6 illustrates this step. Only a coil region is analysed every now.
Figure 6 Schematic representation of coil regions from the consensus predicted secondary structure of the target sequence K
2
Duplets of torsion angles are selected from the coil regions intervals. We generate a set of pseudo-random numbers uniformly distributed between 0 and 1 and generated with the linear congruent method (Graybeal and Pooch, 1980). This method is described in equation (10), where a is an integer between 1 and M , and M is a prime number. The equation proceeds with the multiplication of a × Z and the division of the result by M . The result is stored in Zk and the algorithm executes until a random number set of size N is obtained. In the developed algorithm we adopt as default a = 25.717 and M = 2.147.483.647. Zk = a × Zk−1 .
(10)
A set of random number as the form result = {Z0 , Z1 , . . . , ZN }, where result is a vector with N randon numbers Zk . Based in the result vector, angles are selected from an interval (equation (11)). η(θ) = θ + (result[i] × w([θ])).
(11)
3
After selection of the torsion angles of all amino acid residues in a coil region the protein structure is built. For amino acids residues in α-helix and β-sheet regions is utilised the central value of the interval (equation (9)). The torsions information are processed by the teLeap module of AMBER 7 (Case et al., 2005) which generates the protein structure.
4
The side-chains of the protein structure built in Step 3 are optimised with SCAP (Jacobson et al., 2002; Xiang and Honig, 2001). SCAP adjusts the
476
M. Dorn and O.N. de Souza side-chains using the Dunbrack’s rotamers library (Dunbrack and Karplus, 1993).
5
We utilise the TINKER package (Kundrot et al., 2004; Pappu et al., 1998; Ponder and Richards, 2007; Ren and Ponder, 2003) with the AMBER94 force field (Section 2.2) to calculate the potential energy of the predicted protein structure. The energy value is employed to further reduce the torsion angles interval in the next step.
6
Based on the structures with the lowest energy, we reduce the φ, ψ intervals of the amino acids residues in its coil region. The arithmetic mean for each torsion angle of each amino acid residue in the coil region is calculated for all the δ selected structures (equation (12)) in the current iteration of the algorithm. 1 σi . δ i=1 δ
m(θ) =
(12)
A torsion angles interval θ (φ or ψ) of an amino acid residue i, in a coil region of the target sequence K, is reduced in the following way: a
The centre (c([θ])) of the torsion angles interval is calculated;
b
The mean value of the analysed angle in the σ conformations with the lowest energy is calculated.
c
From c([θ]), and considering a threshold value ν of 10% around the intervals midpoint, we verify if the mean value m(θ) of the structure with the lowest potential energy is localised next to the intervals lower or up-per bounds. If it is next to the lower bound the method proceeds reducing a ν percentage of the interval beginning from the upper bound. Conversely, the interval is similarly reduced, however, beginning from the lower bound. When the mean value is located in the threshold ν2 of the intervals width, the interval is reduced from both, the lower and the upper bound (Figure 7).
Figure 7 Schematic representation of an interval showing the intervals limits and their reduction (see text above)
In each iteration of the algorithm that optimises the coil regions, the interval of every main-chain torsion angles for all amino acid residue in a coil region of the
477
An interval-based algorithm to represent conformational states
protein is reduced. The size of each interval is calculated again and the algorithm stops when w([θ]) of all intervals in the coil regions is equal or less than a limit ε. In this work we adopt a default limit of ε = 10.
4 Experiments and results In this section we present the results obtained with the developed method. We predicted the approximate 3-D structure of five proteins: 1ZDD (Starovasnik et al., 1997), 1ROP (Banner et al., 1987), 1K43 (Pastor et al., 2002), 1GB1 (Gronenborn et al., 1991) and 1GAB (Mace and Agard, 1995). Table 3 summarises the information about these proteins which covers α, β, and α/β classes. Ribbon representations of the experimental structures (black) can be seen in Figure 8. Table 3 Number of amino acid residues of the target proteins and the number of target fragments obtained in the fragmentation step (with a 5 residues window size)
PDB ID
Num. amino acids
Num. fragments
34 14 54 56 53
30 12 50 52 49
1ZDD 1K43 1ROP 1GB1 1GAB
Figure 8 Ribbon representation of the Cα trace of experimental (black) and predicted protein structures (grey). Experimental and predicted 3-D structure of proteins with PDB ID = 1ZDD (A), 1K43 (B), 1ROP (C), 1GB1 (D) and 1GAB (E). Side-chains are not shown for clarity
4.1 Algorithm parameters We tested out the method in the prediction of the approximate 3-D structure of five proteins (Table 3). In the fragmentation step we use a windows size of five (l = 5 amino acids residues). In each case study we remove all PDBs whose sequences are equal or greater than 50% identical to the target proteins. We run PROCHECK (Laskowski et al., 1993) to analyse the stereo-chemical quality of the predicted structures. The secondary structure is calculated with DSSP (Kabsch and Sander, 1983) and PROMOTIF (Hutchinson and Thornton, 1996). Structure illustrations were prepared with PyMOL (Delano, 2002). RMSD values
478
M. Dorn and O.N. de Souza
(Section 2.3) are calculated with PROFIT (McLachlan, 1982). We do not consider the first (N-terminus) and the last (C-terminus) 2 amino acids residues. Their torsion angles are fixed as 180◦ . Setting the algorithm parameters for the prediction tests: •
Number of initial structures (ι) : ι = 1.000 for each coil region in the first step of the algorithm;
•
Number of conformation after the first step (ι) : ι = 100 conformations are generated for each coil region in each step of the algorithm
•
Number of considered structures for determining the type of interval reduction (upper or lower bound – δ): δ = 10% for the number of structures that are used to determine the type of interval reducing in a coil region
•
Interval threshold (υ): a threshold of plus or minus 10% relative to the midpoint of the interval size
•
Minimal size of an interval: the limit size to processing the interval reduction is w([θ]) = 10.
Secondary structure predictions are performed at the NPS@ (Combet et al., 2000) and SCRATCH (Cheng et al., 2005) servers. All programs have been implemented using the Python and C++ programming languages and executed in a Linux environment of a PC Intel Core 2 Duo E6400 2.4 GHz 2MB Cache and 2GB RAM.
4.2 Results In this section we describe and analyse the results obtained by the proposed method. Table 4 presents the Cα root mean square deviation (RMSD) of the predicted approximate structures. Table 4 Cα RMSD of the predicted approximate 3-D structures with respect to the experimental, NMR or X-ray structures. For example, 1ZDD is the experimental structure while 1ZDD-P represents the predicted structure
Predicted protein ID 1ZDD-P 1K43-P 1ROP-P 1GB1-P 1GAB-P
RMSD (Å) 5.00 1.28 6.69 11.18 11.91
As can be seen in Table 4 and Figure 8, the first three proteins adopt folds very similar to the equivalent experimental data. The RMSD values are somewhat large, but characteristic for this type of prediction. The other three test cases have a far larger RMSD because their regular secondary structures did not come closer together to produce the final expected packing mode. However, when we compare segments of the predicted approximated 3-D structures for each of the test cases,
479
An interval-based algorithm to represent conformational states
against their experimental counterpart, we obtain improved values for the RMSDs as reported in Table 5. Table 5 Cα of segments of regular secondary structures of the predicted approximate 3-D structures with respect to the experimental, NMR or X-ray structures
Protein ID
Amino acid residue (i − j)
RMSD (Å)
4–14 20–30 3–6 9–12 3–28 32–54 3–8 18–33 42–45 50–54 4–20 27–33 37–52
0.60 0.40 0.90 1.64 0.90 0.80 0.22 4.77 0.20 1.06 3.15 0.36 1.05
1ZDD-P – 1K43-P – 1ROP-P – 1GB1-P – – – 1GAB-P – –
The RMSDs of the segments indicate that the individual helices and sheets are well formed, except for one segment in 1GB1-P and another in 1GAB-P. Their individual β strands seems to be well-formed but they did not pack close together, to form a true β-sheet. The same happened to the helices in 1GAB. Table 6 shows the secondary structure contents of the predicted approximates structures. The results obtained with DSSP (Kabsch and Sander, 1983) and PROMOTIF (Hutchinson and Thornton, 1996) corroborates the previous results in that the helices in the predicted structures are well formed and are similar to the experimental ones. Table 6 Analysis of secondary structure and coil contents of the predicted approximate 3-D structures
Protein ID 1ZDD-P 1ZDD 1K43-P 1K43 1ROP-P 1ROP 1GB1-P 1GB1 1GAB-P 1GAB
β-sheet%
α-helix%
310 -helix%
Others%
0.00 0.00 28.60 42.90 0.00 0.00 0.00 35.70 0.00 0.00
76.50 73.50 0.00 0.00 89.30 89.30 28.60 23.20 73.60 66.00
0.00 0.00 21.40 0.00 0.00 0.00 54.40 0.00 0.00 7.50
23.50 26.50 78.60 57.10 10.70 10.70 66.10 41.10 26.40 26.40
480
M. Dorn and O.N. de Souza
We ran PROCHECK (Laskowski et al., 1993) for the predicted and experimental structures to verify and compare the stereo-chemical quality of the predicted structures (Table 7). We notice that the percentage of occupied preferred regions in the Ramachandran plot decrease with increasing complexity of the test proteins. These results are somewhat expected given that 1GAB-P and 1GB1-P have a more complex folding pattern when compared with the others proteins. Table 7 Ramachandran plot preferences of the amino acid residues in the experimental and predicted 3-D structures.. Most Favourable region; Additional Allowed region; Generously Allowed region and Disallowed region
Protein ID
Most %
Addit. %
Gener. %
Disal. %
1ZDD-P 1ZDD 1K43-P 1K43 1ROP-P 1ROP 1GB1-P 1GB1 1GAB-P 1GAB
87.10% 87.10 100.00 66.70 94.40 98.10 66.00 84.00 79.60 80.00
12.90% 12.90 0.00 33.30 3.70 1.90 24.00 16.00 10.20 18.00
0.00 0.00 0.00 0.00 1.90 0.00 2.00 0.00 6.10 2.00
0.00 0.00 0.00 0.00 0.00 0.00 8.00 0.00 4.10 0.00
4.3 Execution time and general evaluation of the proposed method The developed method can be classified as a first principle method that use database information (Srinivasan and Rose, 1995; Rohl et al., 2004). We extract rules of 3-D protein structure from PDB. The acquired structural information is represented in terms of intervals of torsion angles which, in a second moment are reduced in order to find the conformation with the lowest potential energy. As we stated in the Introduction, our objective is to obtain approximate 3-D structures for protein sequences of unknown structure, which, in turn, can be refined by state-of-the-art MM methods. These approximate 3-D structures can be used as a starting conformation in ab-initio methods, hence considerably reducing the conformational space to be searched. The total time of ab initio methods, which usually starts from a fully extended conformation will be great reduced (Breda et al., 2007). We emphasise that our method is designed to predict an approximate 3-D conformation in a very fast manner. Using interval arithmetic we could represent in a more accurate manner the information of torsion angles obtained from PDB templates. Each prediction reported in Section 4.2 takes about 24-48 hours of CPU time in a Linux environment of a PC Intel Core 2 Duo 2.4 GHz 2MB Cache and 2GB RAM. This time depends on the length of the amino acid sequence of the target protein. When compared with other prediction methods classified as first principle method that use database information, our method, present advantages in terms of demanded time to produced native-like approximated 3-D structures of proteins. ROSETTA (Rohl et al., 2004; Simons et al., 1999), FRAGFOLD (Jones et al.,
An interval-based algorithm to represent conformational states
481
1992; Jones, 2001), I-TASSER (Zhang, 2007, 2008, 2009) and LINUS (Srinivasan and Rose, 1995, 2002) has been the most successful predictors along the last years as revealed by the biannually Critical Assessment of protein Structure Prediction (CASP – http://predictioncenter.org/) experiments (Moult, 2005; Raman et al., 2009; Zhang, 2009). Clearly, they present the best results, however, the proposed strategy is a novel idea to predicted approximate 3-D structures of proteins. Refined structures are expected to be obtained in a refinement step using classical molecular mechanics methods, such as MD simulations.
5 Conclusion We introduced a new method to predict approximate 3-D structures of proteins. This method is based on CReF (Dorn and Norberto de Souza, 2008, 2010a) which was modified to build structures whose main-chain φ, ψ torsion angles are represented as intervals. We do not use entire fragments, but only the φ, ψ pair information of the central residue in the fragment. The predicted consensus secondary structure is used to guide the construction of the approximate 3-D structure. The developed method does not utilise techniques for fragment assembly and optimisation. This makes it a very fast method. The five case studies showed correct prediction of secondary structures and, in three cases, a partially correct overall fold. For larger proteins the approximate structure is less accurate. Complete packing of the secondary structures into super-secondary and tertiary structures could not be achieved at this stage of the method development. This is likely to be due to the poorer definition of coiled regions such as turns and loops. However, we expect these deviations to be well adjusted or corrected by properly designed protocols of refinement by MM methods, particularly MD simulations. This would in turn reduce the total time ab initio methods, which usually start from a fully extended conformation of a protein, would take to fold a sequence of unknown structure. We see the main contributions of this work in •
proposal of an approach for the generation of an approximate 3-D structure of proteins,
•
using φ, ψ torsion angle information of the central residue in contiguous fragments of a target sequence,
•
and clustering the torsion angles according to a 3-states secondary structure model,
•
development of a strategy to represent and construct a protein structure based on interval analysis,
•
the development of a strategy to reduce the torsion angles intervals with the objective to find the conformation with the lowest potential energy.
Acknowledgements We thank Professor Dalcidio Moraes Claudio for inspiring the use of Interval Arithmetic in this work. This research was supported by grants 302641/2009-2 e
482
M. Dorn and O.N. de Souza
559917/2010-4 from CNPq to ONS. ONS is a CNPq Research Fellow. MD was suported by a CNPq M.Sc. scholarship. Currently MD is supported by a CNPq PhD scholarship (142158/2010-0).
References Alefeld, G. and Herzberger, J. (1983) Introduction to Interval Computations., 1st ed., Academic Press, New York, USA. Altschul, S.F., Madden, T.L., Schaffer, A.A., Zhang, J., Zhang, Z., Miller, W. and Lipman, D.J. (1997) ‘Gapped blast and psi-blast: a new generation of protein database search programs’, Nucleic Acids Res., Vol. 25, No. 17, pp.3389–3402. Arnold, K., Bordoli, L., Kopp, J. and Schwede, T. (2006) ‘The swiss-model workspace: A web-based environment for protein structure homology modelling’, Bioinformatics, Vol. 22, No. 2, pp.195–201. Banner, D.W., Kokkinidis, M. and Tsernoglou, D. (1987) ‘Structure of the cole1 rop protein at 1.7 a resolution’, J. Mol. Biol., Vol. 196, No. 3, pp.657–675. Baxevanis, A.D. and Quellette, B.F. (2001) Bioinformatics: A Practical Guide to the Analysis of Genes and Proteins., 3rd ed., John Wiley and Sons Inc., New Jersey, USA. Berman, H.M., Westbrook, J., Feng, Z., Gilliland, G., Bath, T.N., Weissig, H., Shindyalov, I.N. and Bourne, P.E. (2000) ‘The protein data bank’, Nucleic Acids Res., Vol. 28, No. 1, pp.235–242. Branden, C. and Tooze, J. (1998) Introduction to Protein Structure, 2nd ed., Garlang Publishing Inc., New York, USA. Breda, A., Santos, D.S., Basso, L.A. and Norberto de Souza, O. (2007) ‘Ab initio 3-d structure prediction of an artificially designed three-a-helix bundle via all-atom molecular dynamics simulations’, Genet. Mol. Res., Vol. 6, No. 4, pp.901–910. Bryant, S.H. and Altschul, S. (1995) ‘Statistics of sequence-structure threading’, Curr. Opin. Struct. Biol., Vol. 5, No. 2, pp.236–244. Bujnicki, J.M. (2006) ‘Protein structure prediction by recombination of fragments’, ChemBioChem, Vol. 7, No. 1, pp.19–27. Case, D.A., Cheatham, T.E., Darden, T., Gohlke, H., Luo, R., Merz, K.M. and Onufriev, A. (2005) ‘The amber biomolecular simulation program’, J. Comput. Chem., Vol. 26, No. 16, pp.1668–1688. Chapman, B. and Chang, J. (2000) ‘Biopython: python tools for computational biology’, ACM SIGBIO Newsl., Vol. 20, No. 2, pp.15–19. Cheng, J., Randall, A., Sweredoski, P. and Baldi, M. (2005) ‘Scratch: a protein structure and structural feature prediction server’, Nucleic Acids Res., Vol. 33, No. 2, pp.72–76. Combet, C., Blanchet, C., Geourjoun, C. and Deleage, G. (2000) ‘Nps: Network protein sequence analysis’, Trends Biochem. Sci., Vol. 25, No. 3, pp.147–150. Cornell, W.D., Cieplak, P., Bayly, C.I., Gould, I.R., Merz Jr., K.M., Ferguson, D.M., Spellmeyer, D.C., Fox, T., Caldwell, J.W. and Kollman, P.A. (1995) ‘A second generation force field for the simulation of proteins, nucleic acids, and organic molecules’, J. Am. Chem. Soc., Vol. 117, No. 19, pp.5179–5197. Creighton, T.E. (1990) ‘Protein folding’, Biochem. J., Vol. 270, pp.1–16. Delano, W.L. (2002) The PyMOL Molecular Graphics System, Delano Scientific, San Carlos, CA, USA. Deleage, G. and Roux, B. (1987) ‘An algorithm for protein secondary structure prediction based on class prediction’, Protein Eng., Vol. 1, No. 4, pp.289–294.
An interval-based algorithm to represent conformational states
483
Dorn, M., Breda, A. and Norberto de Souza, O. (2008) ‘A hybrid method for the protein structure prediction problem’, Lect. Notes Bioinf., Vol. 5167, pp.47–56. Dorn, M. and Norberto de Souza, O. (2008) ‘Cref: A central-residue-fragment-based method for predicting approximate 3-d polypeptides structures’, Proceedings of the ACM Symposium on Applied computing, Fortaleza, Brazil, pp.1261–1267. Dorn, M. and Norberto de Souza, O. (2010a) ‘Mining the protein data bank with cref to predict approximate 3-d structures of polypeptides’, Int. J. Data Min. Bioin., Vol. 4, No. 3, pp.281–299. Dunbrack Jr., R.L. and Karplus, M. (1993) ‘Backbone-dependent rotamer library for proteins: application to side-chain prediction’, J. Mol. Biol., Vol. 230, No. 2, pp.543–574. Eswar, N., Martí-Renom, M.A., Webb, B., Madhusudhan, M.S., Eramian, D., Shen, M., Pieper, U. and Sali, A. (2006) ‘Comparative protein structure modelling with modeller’, Curr. Protoc. Bioinf., Vol. 15, No. 56, pp.1–30. Fiser, A., Do, R.K.G. and Sali, A. (2000) ‘Modelling of loops in protein structure’, Protein Sci., Vol. 9, No. 9, pp.1753–1773. Floudas, C.A., Fung, H.K., McAllister, S.R., Mnnigmann, M. and Rajgaria, R. (2006) ‘Advances in protein structure prediction and de novo protein design: A review’, Chem. Eng. Sci., Vol. 61, No. 3, pp.966–988. Garnier, J., Gibrat, J-F. and Robson, B. (1996) ‘Gor secondary structure prediction method version iv’, Methods Enzymol., Vol. 266, pp.540–553. Garnier, J., Osguthorpe, D.J. and Robson, B. (1978) ‘Analysis of the accuracy and implications of simple methods for predicting the secondary structure of globular proteins’, J. Mol. Biol., Vol. 120, No. 1, pp.97–120. Geourjon, C. and Deleage, G. (1994) ‘Sopm: a self-optimised method for protein secondary structure prediction’, Protein Eng., Vol. 7, No. 2, pp.157–164. Geourjon, C. and Deleage, G. (1995) ‘Sopma: significant improvements in protein secondary structure prediction by consensus prediction from multiple alignments’, Comput. Appl. Biosci., Vol. 11, No. 6, pp.681–684. Gibrat, J.F., Garnier, J. and Robson, B. (1987) ‘Further developments of protein secondary structure prediction using information theory. new parameters and consideration of residue pairs’, J. Mol. Biol., Vol. 198, No. 3, pp.425–443. Graybeal, W.J. and Pooch, U.W. (1980) Simulation: Principles and Methods., 1st ed., Cambridge Winthrop Publishers Inc., Cambridge, UK. Gronenborn, A.M., Filpula, D.R., Essig, N.Z., Achari, A., Whitlow, M., Wingfield, P.T. and Clore, G.M. (1991) ‘A novel, highly stable fold of the immunoglobulin binding domain of streptococcal protein g’, Science, Vol. 253, No. 5020, pp.657–661. Guermeur, Y., Geourjon, C., Gallinari, P. and Deleage, G. (1996) ‘Incorporation of non-local interactions in protein secondary structure prediction from the amino acid sequence’, Protein Eng., Vol. 9, No. 2, pp.133–142. Guermeur, Y., Geourjon, C., Gallinari, P. and Deleage, G. (1999) ‘Improved performance in protein secondary structure prediction by inhomogeneous score combination’, Bioinformatics, Vol. 15, No. 5, pp.413–421. Hovmˆ oller, T.Z., Zhou, T.Z. and Ohlson, T. (2002) ‘Conformation of amino acids in protein’, Acta Crystallogr., Sect. D: Biol. Crystallogr., Vol. 58, No. 5, pp.768–776. Hutchinson, E.G. and Thornton, J.M. (1996) ‘Promotif: A program to identify and analyse structural motifs in proteins’, Protein Sci., Vol. 5, No. 2, pp.212–220. Jacobson, M.P., Friesner, R.A., Xiang, Z. and Honig, B. (2002) ‘On the role of the crystal environment in determining protein side-chain conformations’, J. Mol. Biol., Vol. 320, No. 3, pp.597–608.
484
M. Dorn and O.N. de Souza
Jones, D.T, Taylor, W.R. and Thornton, J.M. (1992) ‘A new approach to protein fold recognition’, Nature, Vol. 358, No. 6381, pp.86–89. Jones, D. (1999) ‘Genthreader: an efficient and reliable protein fold recognition method for genomic sequences’, J. Mol. Biol., Vol. 287, No. 4, pp.797–815. Jones, D. (2001) ‘Predicting novel protein folds by using fragfold’, Proteins, Vol. 45, No. S5, pp.127–132. Kabsch, W. and Sander, C. (1983) ‘Dictionary of protein secondary structure: Pattern recognition of hydrogen-bonded and geometrical features’, Biopolymers, Vol. 22, No. 12, pp.2577–2637. King, R.D. and Sternberg, M.J. (1996) ‘Identification and application of the concepts important for accurate and reliable protein secondary structure prediction’, Proteins, Vol. 5, No. 11, pp.2298–2310. Kolinski, A. (2004) ‘Protein modelling and structure prediction with a reduced representation’, Acta Biochim. Pol., Vol. 51, pp.349–371. Kundrot, C.E., Ponder, J.W. and Richards, F.M. (2004) ‘Algorithms for calculating excluded volume and its derivatives as a function of molecular conformation and their use in energy minimisation’, J. Comput. Chem., Vol. 12, No. 3, pp.402–409. Laskowski, R.A., MacArthur, M.W., Moss, D.S. and Thornton, J.M. (1993) ‘Procheck: a program to check the stereochemical quality of protein structures’, J. Appl. Crystallogr., Vol. 26, No. 2, pp.283–291. Lesk, A.M. (2000) Introduction to Protein Architecture: the Structural Biology of Proteins., 1st ed., Oxford University Press, Cambridge, UK. Levinthal, C. (1968) ‘Are the pathways for protein folding?’, J. Chem. Phys., Vol. 65, pp.44–45. Mace, J.E. and Agard, D.A. (1995) ‘Kinetic and structural characterisation of muta-tions of glycine 216 in alpha-lytic protease: a new target for engineering substrate specificity’, J. Mol. Biol., Vol. 254, No. 4, pp.720–736. Stuart, A., Fiser, A., Sanchez, R., Mello, F., Martì-Renom, M.A. and Sali, A. (2000). ‘Comparative protein structure modelling of genes and genomes’, Annu. Rev. Biophys. Biomol. Struct., Vol. 29, No. 16, pp.291–235. McLachlan, A.D. (1982) ‘Rapid comparison of protein structures’, Acta Crystallogr., Vol. A38, pp.871–873. Moore, R.E. and Yang, C.T. (1959) ‘Interval analysis’, Technical Report, Lockheed Missiles and Space Co., Sunnyvale, CA, USA Technical Report Space Div. Report LMSD285875. Moore, R.E. (1959) Automatic Error Analysis in Digital Computation, Technical report, Lockheed Missiles and Space Co., Sunnyvale, CA, USA, Technical Report Space Div. Report LMSD84821. Moore, R.E. (1966) Interval Analysis., 1st ed., Prentice-Hall, New Jersey, USA. Moore, R.E. (1999) ‘The dawning’, Reliab. Comput., Vol. 5, pp.423–424. Morris, A.L., MacArthur, M.W., Hutchinson, E.G. and Thornton, J.M. (1992) ‘Stereochemical quality of protein structure coordinates’, Proteins, Vol. 12, No. 4, pp.345–364. Moult, J.A. (2005) ‘Decade of casp: progress, bottlenecks an prognosis in protein structure prediction’, Curr. Opin. Struct. Biol., Vol. 15, pp.285–289. Ngo, J.T., Marks, J. and Karplus, M. (1997) The Protein Folding Problem and Tertiary Structure Prediction, chapter Computational complexity, protein structure prediction and the Levinthal Paradox Boston, MA, USA. Osguthorpe, D.J. (2000) ‘Ab initio protein folding’, Curr. Opin. Struct. Biol., Vol. 10, No. 2, pp.146–152.
An interval-based algorithm to represent conformational states
485
Pappu, R.V., Hart, R.K. and Ponder, J.W. (1998) ‘Analysis and application of potential energy smoothing for global optimisation’, J. Phys. Chem. B, Vol. 105, pp.9725–9742. Pastor, M.T., Lopez de la Paz, M., Lacroix, E., Serrano, L. and Perez-Paya, E. (2002) ‘Combinatorial approaches: a new tool to search for highly structured beta-hairpin peptides’, Proc. Natl. Acad. Sci. USA, Vol. 99, No. 2, pp.614–619. Ponder, J.W. and Richards, F.M. (2007) ‘An efficient newton-like method for molecular mechanics energy minimisation of large molecules’, Comput. Chem., Vol. 8, pp.1016–1024. Ramachandran, G.N. and Sasisekharan, V. (1968) ‘Conformation of polypeptides and proteins’, Adv. Protein Chem., Vol. 23, pp.238–437. Raman, S., Vernon, R., Thompson, J., Tyka, M., Sadreyev, R., Pei, J., Kim, D., Kellogg, E., Dimaio, F., Lange, O., Kinch, L., Sheffler, W., Kim, B., Das, R., Grishin, N. and Baker, D. (2009) ‘Structure prediction for casp8 with all-atom refinement using rosetta’, Proteins, Vol. 77, pp.89–99. Ren, P. and Ponder, J.W. (2003) ‘Polarisable atomic multipole water model for molecular mechanics simulation’, J. Phys. Chem. B, Vol. 107, No. 24, pp.5933–5947. Rohl, C.A., Strauss, C., Misura, K. and Baker, D. (2004) ‘Protein structure prediction using rosetta’, Methods Enzymol., Vol. 383, pp.66–93. Rost, B. and Sander, C. (1993) ‘Prediction of protein secondary structure at better than 70% accuracy’, J. Mol. Biol., Vol. 232, pp.584–599. Sánchez, R. and Sali, A. (1997) ‘Advances in comparative protein-structure modelling’, Curr. Opin. Struct. Biol., Vol. 7, No. 2, pp.206–214. Simons, K.T., Bonneau, R., Ruczinski, I. and Baker, D. (1999) ‘Ab initio protein structure prediction of casp iii targets using rosetta’, Proteins, Vol. 3, pp.171–176. Srinivasan, R. and Rose, G.D. (1995) ‘Linus – a hierarchic procedure to predict the fold of a protein’, Proteins, Vol. 22, No. 2, pp.81–99. Srinivasan, R. and Rose, G.D. (2002) ‘Ab initio prediction of protein structure using linus’, Proteins, Vol. 47, No. 4, pp.489–495. Starovasnik, M.A., Braisted, A.C. and Wells, J.A. (1997) ‘Structural mimicry of a native protein by a minimised binding domain’, Proc. Natl. Acad. Sci. USA, Vol. 94, pp.10080–10085. Tramontano, A. (2006) Protein Structure Prediction, 1st ed., John Wiley and Sons Inc., Weinheim, Germany. van der Spoel, D., Lindahl, E., Hess, B., Groenhof, G., Mark, A. and Berendsen, H. (2005) ‘Gromacs: fast, flexible, and free’, J. Comput. Chem., Vol. 26, No. 16, pp.1701–1718. van Gunsteren, W.F. and Berendsen, H.J.C. (1990) ‘Computer simulation of molecular dynamics: methodology, applications, and perspectives in chemistry’, Angew. Chem., Int. Ed. Engl., Vol. 29, No. 9, pp.992–1023. Witten, I.H. and Frank, E. (2006) Data Mining: Practical Machine Learning Tools and Techniques., 2nd ed., Morgan Kaufmann Series in Data Management Systems, Oxford, UK. Xiang, Z. and Honig, B. (2001) ‘Extending the accuracy limits of prediction for side-chain conformations’, J. Mol. Biol., Vol. 311, No. 2, pp.421–430. Zhang, X., Waltz, D. and Mesirov, J. (1989) ‘Protein structure prediction by a data-level parallel algorithm’, Proceedings of the ACM/IEEE Conference on Supercomputing, Reno, NV, USA, pp.215–223. Zhang, Y. (2007) ‘Template-based modelling and free modelling by i-tasser in casp7’, Proteins, Vol. 8, pp.108–117.
486
M. Dorn and O.N. de Souza
Zhang, Y. (2008) ‘I-tasser server for protein 3d structure prediction’, BMC Bioinf., Vol. 9, No. 40, pp.1–8. Zhang, Y. (2009) ‘I-tasser: Fully automated protein structure prediction in casp8’, Proteins, Vol. 77, No. S9, pp.100–113. Zhirong, S. and Jiang, B. (1996) ‘Patters and conformations of commonly occurring supersecondary structures (basic motifs) in protein data bank’, J. Protein Chem., Vol. 15, No. 7, pp.675–690.