An Efficient Artificial Bee Colony Algorithm for 3D Protein Folding ...

1 downloads 0 Views 339KB Size Report
An Efficient Artificial Bee Colony Algorithm for 3D Protein Folding. Simulation. Cheng-Jian Lin. 1. Chi-Yung Lee. 2. 1. Department of Computer Science and ...
台灣.高雄.高雄大學

2009 第17屆模糊理論及其應用研討會

An Efficient Artificial Bee Colony Algorithm for 3D Protein Folding Simulation Cheng-Jian Lin1 Chi-Yung Lee2 Department of Computer Science and Information Engineering National Chin-Yi University of Technology E-mail: [email protected] 2 Department of Computer Science and Information Engineering Nankai University of Technology E-mail: [email protected] 1

Abstract In the computational biology, the prediction of proteins conformation from its amino acid sequence is one of the most prominent problems, solving the problem of protein structure is the most important works for study proteins, Hydrophobic-hydrophilic (HP) model is highly simplified model; it can represent behavioral properties of real-world proteins. Protein folding problem in the HP lattice model is the problem of finding the lowest free energy conformation. In order to enhance the performance of predicting protein structure, we propose a modified artificial bee colony (MABC) algorithm to determine the protein structure from sequence. We demonstrate that our algorithm can be applied successfully to the protein folding problem based on the hydrophobic-hydrophilic lattice model. Simulation results indicate that our approach performs better than those of existing evolutionary algorithms. Keywords: 3D protein structure prediction, HP lattice model, Artificial Bee Colony Algorithm, Swarm intelligence, Evolution strategies.

摘要 在計算生物學上,如何從氨基酸序列去預測蛋白質結 構是一個重要議題,此問題一般採用 HP 模式進行研究, HP 模式主要將每個氨基酸根據其親疏水性來分類,分成 H(hydrophobic or non-polar)或 P(hydrophilic or polar)。 蛋白質折疊問題是希望在 HP 晶格模型中尋找最低自由 能的構形。為了在預測蛋白質結構時能有較好的效能, 我們提出了一種改良式的蜜蜂群(MABC)演算法,此 方法能够從氨基酸序列中準確地預測出測蛋白質結構。 從模擬實驗結果顯示,我們的方法比現有的進化演算法 還要好。 關鍵詞:3維蛋白質結構預測, HP模型, 蜜蜂群演算法, 群 體智慧。

1.

INTRODUCTION

0

The prediction of protein structure from its amino-acid sequence is one of the most prominent problems in computational biology. A protein’s function depends mainly on its tertiary structure, which in turn depends on its primary structure. Mistakes in the folding process create proteins with abnormal shapes, which are the causes of diseases such as cystic fibrosis, Alzheimer’s, and mad cow [1]. If we could predict the tertiary structures of proteins from their sequences, we would be able to treat these diseases better. The knowledge of protein tertiary structures also has other applications, such as in the structure-based drug design field [2]. Currently, protein structures are primarily determined by techniques such as MRI (magnetic resonance imaging) and X-ray crystallography, which are expensive in terms of equipment, computation, and time. Additionally, these techniques require isolation, purification, and crystallization of the target protein. Computational approaches to protein structure prediction are therefore very attractive. Some researchers have used Monte Carlo methods [3] for solving protein folding problem, Monte Carlo methods are a class of computational algorithms that rely on repeated random sampling to compute their results. Many researchers [4]-[7] have used evolutionary algorithms, such as the genetic algorithms (GA), for solving the protein folding problem. Genetic algorithms are stochastic search techniques based on the mechanism of natural selection, which requires information to search effectively in a large or poorly understood search space. In the past, some researchers use swarm intelligence for solving protein structure prediction, such as ant colony optimization (ACO) [8], immune algorithm (IA) [9][10], and particle swarm optimization ant colony optimization is a population-based approach to solving combinatorial optimization problems that is inspired by the foraging behavior of ant colonies. Immune Algorithm based on clonal

705

台灣.高雄.高雄大學

selection principle with aging operator and memory B cells. At each generation the best individuals of the population are selected based on their affinity measures (how good they are as solutions to the problem). The selected individuals are cloned, giving rise to a temporary population of clones. The clones are submitted to a hypermutation operator, whose rate is proportional (or inversely proportional) to the affinity between the antibody and the antigen (the problem to be solved). The remainder of this paper is structured as follows. Section 2 gives the preliminaries and the formal definition of the protein folding problem in the HP lattice model. Section 3 reviews the artificial bee colony algorithm and describes our approach in detail, the proposed modified artificial bee colony (MABC) algorithm. The simulation results of 3D obtained by our method and other methods are compared in Section 4. Finally, the conclusion is given in Section 5.

2.

THE HP PROTEIN MODEL

This chapter describes the HP protein folding problem and how to calculate the free energy.

2.1 The HP Protein Model The HP model is based on the observation that the hydrophobic interaction between the amino acid residues is the driving force for the protein folding and for the development of native state in proteins [13][14]. Dill [15] proposes the hydrophobic- hydrophilic model. Lattice proteins are made to resemble real proteins by introducing an energy function [16], a set of conditions which specify the energy of interaction between neighboring beads, usually taken to be those occupying adjacent lattice sites. The energy function mimics the interactions, which include hydrophobic and hydrogen bonding effects, between amino acids in real proteins. The beads are divided into types, and the energy function specifies the interactions, depending on the bead type, just as different types of amino acid interact differently. An instance is shown in Figure 1 for the 2D and 3D HP lattice model [18], respectively. The black squares denote the hydrophobic amino acid and the white squares denote the hydrophilic. The dotted line denotes the H-H contacts (free energy) in the conformation, which are assigned an energy value of -1. The free energy is minimum value; the number of H-H contact is the maximum. In two-dimensional case, Figure 1 shows a protein structure with 11 H-H contacts (energy= -11). Since the native state of a protein generally corresponds to the lowest free energy state for the protein, the optimal conformation in the HP model is the one that has the maximum number of H-H contacts which gives the lowest energy value.

2009 第17屆模糊理論及其應用研討會

Figure 1: An optimal conformation for the sequence “(HP)2PH(HP)2(PH)2H P(PH)2”; the 3D HP lattice model.

2.2 Calculating the Free Energy For any sequence in any particular structure, the free energy can be rapidly calculated from the free energy function. For the simple HP model, this is simply an enumeration of all the contacts between the H residues that are adjacent in the structure but not in the chain. Most researchers consider a lattice protein sequence protein-like only if it possesses a single structure with an energetic state lower than in any other structure. This is the energetic ground state, or the native state. The relative positions of the beads in the native state constitute the lattice protein's tertiary structure. Lattice proteins do not have genuine secondary structure, although some researchers have claimed that they can be extrapolated to real protein structures, which do include secondary structure, by appealing to the same law by which the phase diagrams of different substances can be scaled onto one another. By varying the free energy function and the bead sequence of the chain (the primary structure), effects on the native state structure and the kinetics (rate) of folding can be explored, this may provide insights into the folding of real proteins. In particular, lattice models have been used to investigate the free energy landscapes of proteins, i.e. the variation of their internal free energy as a function of conformation. We present the minimum free energy function of the HP lattice model with calculation conditions as follows: n = length of the protein sequence

(1)

where f is a mapping function: f → {0, 1}. That is, f(i) = 0 represents the hydrophilic residue and f(i)=1 represents the hydrophobic residue. d = {d1, d2,…, dn} is a vector set, where dj denotes a projection onto the Cartesian coordinate. If each residue is connected to its sequence neighbor on an adjacent lattice site, then S(i,j) = 1. Otherwise S(i,j) ≠ 1. Each lattice site is only occupied by one amino acid residue, which we call a conformation valid.

706

台灣.高雄.高雄大學

3.

THE PROPOSED ARTIFICIAL BEE COLONY ALGORITHM

Swarm intelligence has become a research interest to many research scientists of related fields in recent years. Bonabeau has defined the swarm intelligence as “any attempt to design algorithms or distributed problem-solving devices inspired by the collective behavior of social insect colonies and other animal societies” [17]. Bonabeau et al. focused their viewpoint on social insects alone such as termites, bees, wasps as well as other different ant species.

3.1 Artificial Bee Colony Algorithm

2009 第17屆模糊理論及其應用研討會

food source. Onlookers are placed on the food sources by using a probability based selection process. As the nectar amount of a food source increases, the probability value with which the food source is preferred by onlookers increases, too. Every bee colony has scouts that are the colony’s explorers. The explorers do not have any guidance while looking for food. They are primarily concerned with finding any kind of food source. As a result of such behavior, the scouts are characterized by low search costs and a low average in food source quality. Occasionally, the scouts can accidentally discover rich, entirely unknown food sources. In the case of artificial bees, the artificial scouts could have the fast discovery of the group of feasible solutions as a task. In this work, one of the employed bees is selected and classified as the scout bee. The selection is controlled by a control parameter called "limit". If a solution representing a food source is not improved by a predetermined number of trials, then that food source is abandoned by its employed bee and the employed bee is converted to a scout. The number of trials for releasing a food source is equal to the value of "limit" which is an important control parameter of ABC. In a robust search process exploration and exploitation processes must be carried out together. In the ABC algorithm, while onlookers and employed bees carry out the exploitation process in the search space, the scouts control the exploration process.

In this work, a particular intelligent behavior of a honey bee swarm, foraging behavior, is considered and a new artificial bee colony (ABC) algorithm simulating this behavior of real honey bees is described for solving multidimensional and multimodal optimization problems. In the model, the colony of artificial bees consists of three groups of bees: employed bees, onlookers and scouts. The first half of the colony consists of the employed artificial bees and the second half includes the onlookers. For every food source, there is only one employed bee. In other words, the number of employed bees is equal to the number of food sources around the hive. The employed bee whose food source has been exhausted by the bees becomes a scout. The search carried out by the artificial bees can be summarized as follows:  Employed bees determine a food source within the neighborhood of the food source in their memory.  Employed bees share their information with onlookers within the hive and then the onlookers select one of the food sources.  Onlookers select a food source within the neighborhood of the food sources chosen by themselves.  An employed bee of which the source has been abandoned becomes a scout and starts to search a new food source randomly. The main steps of the algorithm are given below:

3.2 The Modified Artificial Bee Colony Algorithm

Send the scouts onto the initial food sources REPEAT Send the employed bees onto the food sources and determine their nectar amounts Calculate the probability value of the sources with which they are preferred by the onlooker bees Stop the exploitation process of the sources abandoned by the bees Send the scouts into the search area for discovering new food sources, randomly Memorize the best food source found so far UNTIL (requirements are met)

First, we introduce the artificial bee colony algorithm as follow step in detail: 1. Initialize the population of solutions xi,j 2. Evaluate the population 3. cycle=1 4. Repeat 5. Produce new solutions (food source positions)υi,j in the neighbourhood of xi,j for the employed bees using the formula (2) and evaluate them υi,j = xi,j + Φij(xi,j - xk,j) (2)

In the case of real honey bees, the recruitment rate represents a “measure” of how quickly the bee swarm locates and exploits the newly discovered food source. Artificial recruiting process could similarly represent the “measurement” of the speed with which the feasible solutions or the optimal solutions of the difficult optimization problems can be discovered. The survival and progress of the real bee swarm depended upon the rapid discovery and efficient utilization of the best food resources. Similarly the optimal solution of difficult engineering problems is connected to the relatively fast discovery of “good solutions” especially for the problems that need to be solved in real time.



Each cycle of the search consists of three steps: moving the employed and onlooker bees onto the food sources and calculating their nectar amounts; and determining the scout bees and directing them onto possible food sources. A food source position represents a possible solution to the problem to be optimized. The amount of nectar of a food source corresponds to the quality of the solution represented by that

6.

k is a solution in the neighborhood of i, Φ is a random number in the range [-1,1]

Apply the greedy selection process between xi andυ i

7.

707

Calculate the probability values Pi for the solutions xi by means of their fitness values using the equation (3)

台灣.高雄.高雄大學

2009 第17屆模糊理論及其應用研討會

(3) 8.

Produce the new solutions (new positions)υi for the onlookers from the solutions xi, selected depending on Pi, and evaluate them 9. Apply the greedy selection process for the onlookers between xi andυi 10. Determine the abandoned solution (source), if exists, and replace it with a new randomly produced solution xi for the scout using the equation (4) (4) 11. Memorize the best food source position (solution) achieved so far 12. Cycle = cycle + 1 13. until cycle = Maximum Cycle Number (MCN) In order to speed up the explore capability of artificial bee colony algorithm; we modified the equation (2), as the following equations show. (5) – Where θ is a random number in the range[-1,1]. xbest,j is the best solution so far.

Figure 3: The schemes were represented by internal coordinates (the black cube represents the current location). 1.

Opposite motion As shown in Figure 4, the motion of a local structure is in the 3D HP model. We can change the local structure between two randomly determined sequence positions. All the residue directions are right, down, backward, up, backward, right to left, up, forward, down, forward, left. The inversion method can advance in the opposite direction, which is the direction of repulsion.

For increase convergence speed, we add xbest,j behind the equation (2), xbest,j let solutions near the global minimum and speed up convergence capability. The flowchart of artificial bee colony algorithm show as follow.

4.

PROTEIN FOLDING SIMULATION USING THE MABC

In 3D case, the symbols R, L, F, B, U and D are used to denote the fold directions right, left, forward, backward, up and down in the encoding scheme, respectively. An initial population is generated randomly and initializes an n-1 dimensional space within a fixed range. In Figure 3, it adopted schemes for representing internal movements is absolute directions. An initial population is generated randomly and initializes an n-1 dimensional space within a fixed range. A local search can perform an intensive search for a new and better solution. This is similar to the mutation operation. The local search is different from the mutation operation in terms of the rules. A local search has system rules and effectively finds a local solution. 3D local search contains two parts:

Figure 4: All the residues move in the opposite direction. 2.

Rotation motion In 3D case, the structure can rotate clockwise (CW) or counterclockwise (CCW) relative to the local structure, as shown in Figure 5. Figures 5 (a) and 5 (b) represent clockwise rotation and counterclockwise rotation, respectively. The illustrations show that left to right transformation from the top is the fixed x-axis, the fixed y-axis, and the fixed z-axis, specifically. Therefore, we can build Tables 1, which lists the relationship between the original directions and the transformed directions. Therefore, we can build Tables 1, which lists the relationship between the original directions and the transformed directions. Tables 1 denote the residue folds direction with local search of the 3D HP model. We choose the best direction in a local search by three methods. The new folding direction is superior to the original direction. If the new folding direction is not better than the original direction, the original direction will not change.

708

台灣.高雄.高雄大學

2009 第17屆模糊理論及其應用研討會

Table 2: The 3D HP benchmarks Seq. Length

(a)

1 2 3 4 5 6

20 24 25 36 48 50

7

60

Protein Sequence

Energy(3D ) (HP)2PH(HP)2(PH)2HP(PH)2 -11 H2P2(HP2)6H2 -13 P2HP2(H2P4)3H2 -9 P(P2H2)2P5H5(H2P2)2P2H(HP2)2 -18 P2H(P2H2)2P5H10P6(H2P2)2HP2H5 -29 H2(PH)3PH4PH(P3H)2P4(HP3)2HPH4(PH)3P -26 H2 P(PH3)2H5P3H10PHP3H12P4H6PH2PHP -49

(a)

(b)

(b) Figure 5: (a) The clockwise rotation motion; (b) the counterclockwise rotation motion. Table 1: In 3D case, the residue folds direction with local search. Direction

Opposite

The z-axis fixed

The y-axis fixed

The x-axis fixed

CW

CCW

CW

CCW

CW

CCW

Right (R)

L

B

F

D

U

R

R

Left (L)

R

F

B

U

D

L

L

Forward (F)

B

R

L

F

F

U

D

Backward (B)

F

L

R

B

B

D

U

Up (U)

D

U

U

R

L

B

F

Down (D)

U

D

D

L

R

F

B

In this 3D Protein Structure, our algorithm is compared with the standard genetic algorithm, backtracking-EA [18], aging-AIS [9], and ClonalgI [10]. In Table 2, the 7 chosen HP instances are standard benchmarks used to test the searching ability of the algorithms. Sequences 1 through 7 were introduced in [19]. These sequences have been used as the benchmark for the HP model. We give the structure obtained by our algorithm as follows. Fifty independent runs of the algorithms were performed. For sequence 1 through 4 and sequence 6 were out of 100 population size. For sequence 5 and sequence 7 were out of 300 population size. For all sequences, 20,000 iterations of our algorithm were run. The structure of 7 protein sequences can be clearly seen in Figure 6.

(c)

(d)

(e)

(f)

(g) Figure 6: Results of the structure of 7 protein sequence The results are listed in Table 13, which shows a performance comparison of the various existing algorithms. In Backtracking-EA [18], the experiments were done with an elitist generational EA (population size = 100, crossover rate = 0.9, mutation rate = 0.01) using linear ranking selection (η=2.0). A maximum number of 105 evaluations were enforced. The Aging-AIS used the standard parameter values k = 10, dup = 2, and c = 0.4, as described in [9]. B cells had the

709

台灣.高雄.高雄大學

2009 第17屆模糊理論及其應用研討會

aging parameter τB = 5, with the memory B cells τBm = 10, and a maximum number of evaluations equal to 105. ClonalgI used the 10 individuals in the population. The duplication rate was equal to 4, the mutation rate was equal to 0.6, and the termination criterion was 105 evaluations. Table 3: The simulation results obtained from the proposed algorithms compared with the methods given in the literature. Seq. Length E* MABC GA Backtracking-EA Aging-AIS ClonalgI

[8]

[9]

[10]

[11]

1

20

-11

-11

-11

-11

-11

-11

2

24

-13

-13

-13

-13

-13

-13

3

25

-9

-9

-9

-9

-9

-9

4

36

-18

-18

-18

-18

-18

-18

5

48

-29

-29

-25

-25

-29

-29

6

50

-26

-26

-23

-23

-23

-26

[14]

7

60

-49

-49

-37

-39

-41

-48

[15]

[12]

[13]

[16]

5.

CONCLUSION

We present a modified artificial bee colony algorithm for the simple HP lattice model. Through numerical simulation, we show that modified artificial bee colony algorithm is more effective than traditional artificial bee colony algorithm, so the modified artificial bee colony algorithm can be efficiently used for multivariable, multimodal function optimization. We demonstrated that our algorithm can be applied successfully to the protein folding problem based on the 3D hydrophobic-hydrophilic lattice model. The simulation results indicate that our approach performs better than those of existing evolutionary algorithms.

[17]

[18]

[19]

REFERENCES [1]

[2]

[3]

[4]

[5]

[6]

[7]

F. E. Cohen, and J. W. Kelly, “Therapeutic Approaches to Protein-misfolding Diseases,” Nature, December 2003, 426, pp. 905–909. T.N. Bui, and G. Sundarraj, “An efficient genetic algorithm for predicting protein tertiary structures in the 2D HP model”, Proceedings of the 2005 conference on Genetic and evolutionary computation (GECCO’05), 2005, pp. 385-392. F.M. Liang and W.H. Wong, “Evolutionary Monte Carlo for protein folding simulations,” Journal of Chemical Physics, 2001, vol. 115 no. 7, pp. 3374-80. N. Krasnogor, W.E. Hart, J. Smith, and D.A. Pelta, “Protein structure prediction with evolutionary algorithms,” Proceedings of the Genetic and Evolutionary Computation Conference, Orlando, FL, Morgan Kaufmann, USA, July 1999, pp. 1596–1601. A. L. Patton, W. F. Punch III, and E. D. Goodman, “A standard GA approach to native protein conformation prediction”, Proceedings of the Sixth International Conference on Genetic Algorithms, Morgan Kauffman1995, pp. 574-581. J. Pedersen and J. Moukt, “Protein folding simulations with genetic algorithms and a detailed molecular description”, Journal of Molecular Biology, June 1997, vol. 269, no. 2, pp. 240–259. N. Krasnogor, D. Pelta, P. M. Lopez, P. Mocciola, and E. de la Canal. “Genetic algorithms for the protein folding problem: a critical view”, In C.F.E. Alpaydin, ed., Proc. Engineering of Intelligent Systems. ICSC Academic Press, 1998.

710

A. Shmygelska, R. Anguirre-Hernandez, and H. H. Hoos, “An ant colony optimization algorithm for the 2D HP protein folding problem,” Proc. Int. Workshop Ant Algorithms, Brussels, Belgium, Sep. 2002, pp. 40–52. V. Cutello, G. Morelli, G. Nicosia, and M. Pavone, “Immune Algorithms with Aging Operators for the String Folding Problem and the Protein Folding Problem,” Evolutionary Computation in Combinatorial Optimization (EvoCOP), May. 2005, pp. 80–90. C. P. de Almeida, R.A. Gonçalves and M.R. Delgado, “A Hybrid Immune-Based System for the Protein Folding Problem,” Evolutionary Computation in Combinatorial Optimization (EvoCOP), 2007, pp.13-24. A. Berger1 and K. Linderstrom-Lang, “Deuterium exchange of poly-DL-alanine in aqueous solution,” Arch. Biochem. Biophys, 1957, pp. 106–118. N.G. Hunt, L.M. Gregoret and F.E. Cohen, “The origins of protein secondary effects on packing density and hydrogen bonding studied by a fast conformational search,” J. Mol. Biol. 241 (2), 1994, pp. 214–225. Lau, K. F. and K. A. Dill, “A Lattice Statistical Mechanics Model of the Conformational and Sequence Spaces of Proteins,” Macromolecules, 22(10), Oct. 1989, pp. 3986–3997. Richards, F. M., “Areas, Volumes, Packing, and Protein Structure,” Annual Review of Biophysics and Bioengineering, 1977, pp. 151–176. K.A. Dill “Theory for the folding and stability of globular proteins,” Biochemistry, 1985, vol. 24 no. 6:pp.1501–9. O. Takahashi, H. Kita and S. Kobayashi, “Protein folding by a hierarchical genetic algorithm,” The 4th Int. Symp. On Artificial Life and Robotics (AROB), 1999. E. Bonabeau, M. Dorigo, G. Theraulaz, “Swarm Intelligence: From Natural to Artificial Systems”, New York, NY: Oxford University Press, 1999. C. Cotta, “Protein structure prediction using evolutionary algorithms hybridized with backtracking,” Proc of the 7th International Work-Conference on Artificial and Natural Neural Networks, Lecture Notes in Computer Science, 2003, vol. 2687, pp. 321–328. T. Jiang, Q. Cui, G. Shi, and S. Ma, “Protein folding simulations for the hydrophobic-hydrophilic model by combining tabu search with genetic algorithms”, Journal of Chemical Physics, August 2003, vol. 119, no. 8, pp.4592-4596,

Suggest Documents