Move-biased Monte Carlo Simulation Method for

1 downloads 0 Views 507KB Size Report
and two others that are not Monte Carlo based (ACO and ENLS). The results are calculated for at least 100 iterations. 3.2. Relative improvement. The analytical ...
The African Review of Physics (2015) 10:0056

465

Move-biased Monte Carlo Simulation Method for Amino Acid Gyration in Protein Native Structure Prediction

1

Samson O. Aisida1,* and Oluwole E. Oyewande2 Department of Physics and Astronomy, University of Nigeria, Nsukka, Nigeria 2 Department of Physics, University of Ibadan, Ibadan, Nigeria

Proteins, the workhorse of living organisms, are linear polymers of Amino Acid (AA) sequences, execute genetic code inscribed in DNA of organisms and have functionality that depends on their Native Structure (NS). The predicted structures vary much with energy functions and structure-mapping spaces. In our simplified approach of ab initio protein structure prediction, the structure is evaluated and mapped using hydrophobic-polar (HP) energy model and two-dimensional square lattice, respectively. In order to build the hydrophobic-core using the HP energy model, we develop a Move-Biased Monte Carlo (MBMC) simulation algorithm for the gyration of amino acid sequences that simplifies the complexity of existing Monte Carlo (MC) and makes it consistent for the native structure prediction (NSP). The MBMC method was developed by the design and coupling of moves algorithm, which implements diagonal-pull moves on the lowest energy self avoiding walk (SAW) conformation. The adopted method was tested on a set of benchmark protein sequences and the result obtained was compared with some existing methods. We found that this model is very effective and gave the lowest energy in all cases with a short computation time, fewer algorithmic steps and simpler simulation procedures for isotropic gyration.

1.

Introduction

Protein generally refers to a complete biological molecule in a stable conformation. This biopolymer is composed of basically 20 types of amino acids as the building blocks and adopts unique folded three-dimensional (3D) conformation to obtain the native structure, which has the lowest possible free energy level to function optimally. The amino acid sequence of each protein contains all the essential information needed to determine a unique 3D-structure, which is the minimal free energy state [1]. A protein can then be unfolded or denatured into a flexible chain that has lost its natural shape by adding some denaturants like temperature etc. However, protein’s improper folding is the basis for many age related diseases such as Alzheimer, Parkinson and Cancer related sicknesses [2-5]. Protein structure prediction (PSP) is computationally intractable even at short lattices [6-8]. The PSP problem is to obtain the 3D-structure from the given amino acid sequences such that the overall potential energy among the non-bonded amino acid residues is minimized. A successful method for obtaining this structure would have far reaching implications in protein science such as the drug design. Considering the throughput sequences, it is ______________ *

[email protected]

not productive to exclusively rely on the experimental methods that are time consuming, laborious and at the same time give more sequences than the structures. In order to circumvent the aforementioned frailties often from X-ray crystallography and NMR, scientist have used discretized lattice-based energy model for PSP [9-12]. This paper addresses the protein folding problem (PFP), which consists of predicting the native structure of a protein from its sequence of amino acids; the problem involves conflicting constraints as well as rugged energy landscape (as shown in Fig. 3) [13,14]. The folded structure must be stable and must fold within a reasonable time interval [15]. This study was designed to develop a move-biased MC (MBMC) simulation algorithm with a straightforward self avoiding walk (SAW) for the gyration of protein sub-units (amino- acids) that will simplifies the complexity of existing MC and makes it consistent for NSP. The method is tested within the frame work of a 2-D square lattice backbone-only model for chains with up to 85 monomers. The remainder of this paper is structured as follows: Sec. 2 describes the methodology, the results and discussions of the computational experiments are presented in Sec. 3 and the conclusion is given in Sec. 4.

The African Review of Physics (2015) 10:0056

2. 2.1.

Methodology

Computational methods for NSP

The computational approaches for NSP can be classify broadly into two main categories as Comparative modeling and ab initio approach [1619,12]. Comparative modeling (a combination of homology modeling and threading method) depends on the existing database of experimentally determined protein structures as starting points. Consequently, their results may become unrealistic and less accurate, especially for longer sequences. The ab initio (i.e., from the origin) approach is based on physical principles governing the interaction of amino acids in a polypeptide chain and the surrounding solvent. The ab initio prediction is the only choice that infers the protein structure from primary sequence information when no suitable templates can be found based on the intrinsic properties (hydrophilic and polar) of amino acids. This method is saddles with three major responsibilities by model design, definition of energy function that can effectively differentiate between compact and non-compact structures, and model of a search algorithm that can efficiently find the minimal-energy conformation. The ab initio folding is based on the Anfinsen’s thermodynamic hypothesis [20], which assumes that the native fold of a protein populates its global energy minimum, and the Levinthal paradox [21], i.e., protein fold into their specific 3-D conformations in a time-span far shorter than it would be possible for protein molecules to actually search the entire conformation space for the lowest energy state [18,19,12,22-24]. The most prevalent abstraction among the ab initio methods is hydrophobic-polar (HP) model by Dill [25]. In this thesis, we used HP energy model for conformation evaluation and 2D Square lattice for conformation mapping using the move-biased self-avoiding Monte Carlo simulation method (MBMC). This method incorporates the neighborhood search strategy (diagonal-pull move). This stochastic search method is heuristic like Monte Carlo, genetic algorithm, tabu search, ant colony algorithm, and simulated annealing, which has been prominent for PSP problem. The iterative Monte Carlo methods based on local search approach have been in the forefront in the search for the lowest energy conformation. Among these are replica exchange Monte Carlo algorithm (REMC) by Thachuk et al. [26], which is a classical Monte Carlo search method coupled with random walk at the same time for PSP. It samples conformations according to the Boltzmann

466

distribution in energy space and employs VSHD moves, a combination of three moves and pull move neighborhood search for both 2-D and 3-D HP lattice model to the benchmark sequences (BMS), which give them the ground state structure when compared to previous state-of-the-art results. Unger and Moult [27] used genetic algorithm (GA) for 2-D lattice model. This is an extension of MC method, which includes information exchange between a set of parallel simulations. In comparison with MC method, they concluded that GA is more superior to conventional MC in term of searching effectiveness for PFP. Similarly, Kerson [28] used a genetic algorithm on optimal secondary structure (GAOSS) by ameliorating the evolutionary Monte Carlo algorithm for PSP in 2D HP model. Their results showed that GAOSS obtains the conformation faster and pave way for more ground state conformation. Besides MC search, Jacek et al. [29] used the tabu search strategy (TSS) by using conformational motif as a problem domain knowledge to find the optimal conformations of the 2D BMS. Meanwhile, Mahmood et al. [12] used tabu based spiral search local method on 3D FCC, their algorithm employs a novel H-core directed guidance that squeezes the structure around a dynamic hydrophobic-core centre with the application of random work, which employs pull moves coupled with relay-restart technique to enhance the H-core and prevent it from early convergence. Moreover, Alena and Holger [24] used ant colony optimization algorithm (ACO) for both 2-D and 3-D HP model to obtain the lowest energy when compared to the previous state-of-the-art algorithm. Also, Guo et al. [30] designed a hybrid elastic net algorithm (ENL) coupled with local search strategies, which ameliorate the multi-mapping problem of the original elastic net algorithm to produce the minimal energy for BMS. Presently, none of the aforementioned heuristic algorithms appears to completely dominate the others in terms of solution quality and run-time when applied to both the 2-D and 3-D lattice HP model. 2.2.

H-P lattice energy model

To deal with the complexity of PSP, researchers have used discretized lattice-based structures and simplified energy models, since deterministic approaches are not efficient in identifying the ground state energy conformations. The HP lattice model by Dill [25] is a standard model from the perspective of statistical mechanics showing rich thermodynamic behaviors and abstracts from real

The African Review of Physics (2015) 10:0056

467

proteins. This model is the most frequently used lattice model, which is based on the observation that the hydrophobic interaction between amino acids is the main driving force for protein folding. In HP model, amino acids are represented as a reduced set of H and P according to the hydrophobicity of a single amino acid. The Hs form the protein core, while the Ps, which has affinity for water, tends to remain in the outer surface. The folding of a protein in this model means that amino acids are embedded in the 2-D lattice such that adjacent residue in sequence occupy adjacent grid points in the lattice and no grid point in the lattice is occupied by more than one residues, a process known as the self-avoiding walk. The optimization function is to maximize the number of contacts between hydrophobic atoms (H-H contacts) when this is done protein reaches it ground state [15]. The energy E (ζ ) of a given conformation ( ζ ) is defined as the number of topological neighboring (TN) contacts (negative number of non-consecutive (H_H) contact) between those of Hs in the lattice, with E HH = −1 and E HP = EPP = 0 . If a conformation

is

denoted

as ζ = ζ 1 , ζ 2 ,...., ζ n ,

ζ i ∈ [H , P ] and i ∈ {1,2,......,n} , where ζ i is H if

ith amino acid is hydrophobic and P if it is polar. Therefore, if we have λ such that H-H TN contacts, its energy E (ζ ) is denoted as

E(ζ ) = λ (− 1) . The energy evaluation focuses on

hydro-phobicity only and the used sequence alphabet is reduced to two, based on the amino acids as hydrophobic (H) or polar (P), i.e., E(σ , ζ ) = (H , P ) . Hence, a conformation with the highest number of H-H contacts indicates a conformation with the lowest free energy.

E(σ , ζ ) = ∑ ∈σiσj (qi − q j )∆ ij 1≤ i < j ≤ N

(1)

Where, the interaction energy between monomers i and j located at positions q i and q j , respectively, is defined as 1 if ∆ qi − q j =  0

(

)

q

= 1 without covalent otherwise

bond

(2)

and − 1 ∈ (σ i , σ j ) =  0

if

σ i and σ j are both hydrophobic(H) otherwise

(3) 2.3.

Monomer gyration move-biased (MGMB) model

We model the gyration of the protein residues due to the interaction of non-bonded monomers as the directional probabilities of a SAW. To obtain a stable conformation, with unique ground state energy minimum in this model, the gyration plays a vital rule. The gyration occurs when the protein residues spin in either direction (i.e., up, down, right, and left), which must be guided to obtain the native structure. SAW is a random walk that is prohibited from revisiting an old site previously visited. SAW was first proposed about 5 decades ago as a standard model of a linear polymer molecule of long chain in a good solvent [31-33]. However, a lot of software used for protein structure prediction iterates on SAW subsets but our model differs in the inclusion of a mapping process to account for the fact that different protein-solvent systems correspond in our model to different sets of the directional probabilities and SAW length. Our model also includes a post-SAW search mechanism in which multiple couple moves (CM) are used repeatedly to build the hydrophobic core. The CM (Diagonal-pull) is employed because they are complete, local and reversible. The basic idea of the CM on the 2-D square lattice is feasible only when there is at least one free vacancy of its neighbors (as shown in Fig. 1). Successful CM never generates infeasible conformations. We describe this process by choosing randomly a vertex from the chain with length n to ensure a free lattice point in the grid adjacent to either the predecessor or successor of the vertex in the chain and then move it to this free lattice. This perturbation continues until a valid conformation is form.

The African Review of Physics (2015) 10:0056 10

(a) E = -1

468

(b) E = -2

(c) E = -4

Fig.1: Move-biased. biased. The method comprising a series of coupled move (diagonal-pull) pull) operator for H-core H formation from (-1) energy to (-4) 4) energy. The polar, hydrophobic and empty sites residues are denoted as blue, red and white respectively.

In this paper, we analyze the influence of the variation of gyrations of the four possible directions,, which a SAW may take. For instance, on the sequence length (i.e., length of simulated protein) to simulate a protein conformation on a square lattice to model the folding process and used the coupled moves to predict the native structure. We let

ϑ=

ξu , where ξ u = 0.005L1 / 2, ξ d = 1 / 4 ξd so that 0.02 ≤ ϑ ≤ 2.0 (4)

ξ ρ = r , where ξ r = 1 / 2 L 0.005, ξ l = 1 / 4 ξl so that 2.0 ≥ ρ ≥ 0.02

ϖ=

ϑ ρ

stabilizes bilizes with reduction in the competition between the directions (for this figure, figure we used ξ d = 0.02, ξ r = 0.02, and ξ l = 0.25 to simulate a situation in which only the th left direction is dominant). Fig. 2d shows the variation var of the walker with respect to the sequence length and the directional gyrations.

8

(a)

6 N 4 2 0 (5)

0

ϖ

20

40

60

(6)

Where, ξ u , ξ d , ξ r and ξ l are the gyrations of up, down, right and left steps, respectively. From Fig. 2a, large ϖ represents very low ρ

relative to ϑ (since the maximum of ϑ is 2.0) ϖ increases as ρ tends to 0.02,, which means that two directions are formed for high ϖ , gyration of up and that of down (see Fig. 1). For low ρ , the gyration of right is lower than left; that is why the sequence length may be short because two directions are favored. Thee fluctuation in the sequence length as a function of ϑ in Fig. 2b is due to the other factors (directions). If I all other gyrations are fixed, then the fluctuations will give way to a more steady variation. A run for such a parameter revealed evealed that some variations are still and can be explained as a competition between three directions. Fig. 2c shows a gradual decrease in the sequence length as ϑ increases, increases which confirms that the variation in the sequence length

(b)

6 4 N 2 0 -2

0

1

ϑ

2

3

30

(c) 20 N 10 0 0 -10

5

ϑ

10

15

The African Review of Physics (2015) 10:0056

8 6 N 4 2 0

469

new conformation ζ

(d)

strategy. If ζ





by coupled move search

is a legal conformation, then we

update the current conformation ζ with ζ

1

2

ρ

3

3.1. Fig.2: The directional probabilities of the monomers: (a) is the plot of sequence length (N) against d at 5 realizations, whereas (b) and (c) are the plots of sequence length (N) against ϑ (d) and the plot of sequence length (N) against ρ

3.

The Algorithms Procedures

We put forward an improved MC method called move-biased Monte Carlo simulation (MBMC) based on self-avoiding walk (SAW) and the local neighborhood search strategy (diagonal-pull moves) in our algorithm. The improved method is developed for the PFP in the HP lattice model. The calculating procedure is presented as follows. The adopted method generates an initial conformation ' ζ ' following a SAW on the square lattice. Its places the first amino acid at (0, 0) followed by a random selection of a basis vector to place the amino acid at a neighboring free lattice point. The mapping proceeds until a SAW is found for the whole protein sequence. We compute the energy E (ζ ) as a SAW on square lattice point for each conformation. We let i = 1 and execute coupled (diagonal-pull) moves for all legal move positions of the ith amino acid of the current conformation ζ . If the coupled move is executed successfully, we compute the energies of corresponding legal conformations obtained by coupled moves and pick out the conformation with the lowest energy as a newly updated conformation of ζ , expressed as ζ ⊗ . We compute E ζ ⊗ and if

( )

( )

( )

E ζ < E (ζ ) , then we let ζ = ζ , E (ζ ) = E ζ ⊗ and go to the last procedure, otherwise we go to the previous procedure. A random number is selected from 0 < r ≤ 1 . If r < exp{[ E (ζ ) − E ζ ⊗ ] / k B T } , ⊗



( )

( )

then we let ζ = ζ , E (ζ ) = E ζ and go to the next step, otherwise we go to the previous step. From the current conformation ζ , we produce the ⊗



, i.e.,

we let ζ = ζ and E (ζ ) = E ζ . We stop if the move is ergodic, otherwise we go to the second step. ⊗

0

( )





Results and Discussion

To compare MBMC with methods for the 2D PFP described in the literature, we tested it on a number of standard BM instances as shown in Table 1. Experiments on these standard BMS were conducted by performing a number of independent runs ≤ 500 for each instance in 2-D. Our experiments were performed in Silverfrost FTN 95 compiler and run it on a PC with an intel Pentium Dual-core CPU, 2.30 GHz processor and 4.00GB of RAM, and run-time was measured in terms of CPU time. The results obtained on standard 2-D BMS as shown in Table 2 indicate that MBMC is competitive with the Conventional Monte Carlo (CMC), Genetic algorithm (GA), Evolutionary Monte Carlo (EMC) and Ant colony Optimization (ACO) methods described in the literature. CMC and EMC, report the number of valid conformation scanned during the search but their run time is not accounted for. We therefore compared our method with some others (GA and ACO) that gave account of their run time. MBMC works very well on BMS of sizes up to 60 AA and gave high quality optimal configurations for the longest sequences considered here (65 and 85 AA). On average, MBMC requires less CPU time than others for finding best known conformations for sequences 36 ≤ N ≤ 85 residues. In comparison with other methods (CMC, GA, EMC, and ACO), for the 2-D HP PSP considered here, MBMC generally shows very good performance on standard BMS with optimal conformation (as shown in Fig. 4).

The African Review of Physics (2015) 10:0056 10

470

Table 1: Experimental results of the benchmark benchmark sequences for 2D HP model Ins.

Length

Sequences (H-hydrophobic, P-polar)

1

20

(HP)2PH2PHP2HPH2P2HPH

24

2 2

2 3 4

2

2

2

2

2

2

H P HP HP HP HP HP HP H 2

25

2

2 4

2 4

2 4

P HP H P H P H P H

36

3

2 2

2 5

7 2

2

2 2

2 5

10 6

2 4

2

2 2

P H P H P H P H P H P HP 2

2 2

2

2 2

2 2

E*

MBMCa

-9

-9

-9

-9

-8

-8

-14

-14

5

-23

-23

5

48

P HP H P H P H P H P H P HP H

6

60

P2H3PH8P3H9(HP)2P2H12P3H6PH(HP)2

-36

-35

64

H12(PH)2(P2H2)2P2HP2H2PPH2P2HP2 (H2P2)2(HP)2H12

-42

-42

-53

-52

7 8

85

4 4

12 6

12 3 3

2

2 2 2

H P H P (H P ) HP (H P ) HPH

*The putative energy value a The minimum energy value obtained by MBM

Fig.3: The he energy landscape (folding funnel) funne for 85-mer indicating the aggregate gregate prone and the native conformations

The energy landscape, as shown in Fig. 3 of instance 8 with a randomly chosen order ord of amino acids, is very rugged and has been smoothed to resemble a funnel, with many high energy and few low energy conformations. ons. This funnel topology makes predicting the mechanism of folding easy once the structure ture is known. The intermediate conformations, which constitute the high energies (non-compact) structure, aree essential steppingstepping stones to guide a protein through the folding f process ess to the native state. These intermediates are the critical species in mis-folding folding processes (i.e., an aberration from the native state) that lead l to

aggregation and diseases because they expose sticky interfaces that are normally buried in the th native states. The common one is the ‘molten globule’, i.e., a state possessing native-like native secondary structure elements but lacking the tight packed tertiary structure of the native state. Fig. 4 shows the typical conformations of the lowest-energy statee obtained by MBMC. It is quite clear from the figures that each of these conformations possesses a compact hydrophobic core.

The African Review of Physics (2015) 10:0056 10

471

Fig.4: The 2-D ground state conformations found by the MBMC algorithm on the 2D square HP lattice model for the benchmark sequences ences with minimal energies of -9, -9, -8, -14, -23, -35, -42 and -52, respectively.. The symbols “ ”, “ ” and “ ” indicate hydrophobic, ydrophobic, polar and contact between hydrophobic monomers, monomers respectively; while + and – denote the first and the last monomers.

Table 2: This table shows the performance comparison of MBMC with other heuristic methods. The comparison of the lowest energy conformation with four other algorithms, algorithms which include: CMC, GA, EMC, and ACO. The number in each cell is the minimum energy obtained by the corresponding method for their the respective HP sequence, while the numbers in parentheses are the CPU times reported and the numbers of valid conformations scanned before the lowest-energy energy values were found. found The missing entries indicate cases where the respective method has not been tested on a given instance respectively. Length

E*

MBMC

20

-9

-9 (8.90sec.) ec.) (474)

-9 (292443)

-9 (5.60s) (30,492)

-99 (