problem is proved to be NP-hard, several computational and heuristic methods have addressed the problem seeking feasible answers. In this paper, harmony ...
2011 First International Conference on Informatics and Computational Intelligence
Using Harmony Search for Solving a Typical Bioinformatics Problem
Saman Poursiah navi1, Ehsan Asgarian2, Hossein Moeinzadeh3 ,Vahid Chahkandi4 1
Department of Computer Engineering, Islamic Azad University Quchan Branch, Quchan, Iran. Department of Engineering, Quchan Institute of Engineering and Technology, Quchan, Iran. 3 Department of Computer Engineering, Iran University of Science, Tehran, Iran. 4 Department of Computer Engineering, Islamic Azad University Mashad Branch, Mashad, Iran. 2
named Harmony Search (HS) [7, 10]. In this paper, we propose an algorithm based on HS, for haplotype reconstruction problem in minimum error correction model. To demonstrate the effectiveness and speed of HS, we have applied HS algorithms on standard SNP fragments database and got very good results compared to the GA and PSO.
Abstract— Recently, there has been great interest in Bioinformatics among researches from various disciplines such as computer science, mathematics, statistics and artificial intelligence. Bioinformatics mainly deals with solving biological problems at molecular levels. One of the classic problems of bioinformatics which has gain a lot attention lately is Haplotyping, the goal of which is categorizing SNP-fragments into two clusters and deducing a haplotype for each. Since the problem is proved to be NP-hard, several computational and heuristic methods have addressed the problem seeking feasible answers. In this paper, harmony search (HS) is considered as a clustering approach. Extensive computational experiments indicate that the designed HS algorithm achieves a higher accuracy than the genetic algorithm (GA) and particle swarm optimization (PSO) to the MEC model in most cases.
II. FORMULATIONS AND PROBLEM DEFINITIONS Suppose that there are m SNP fragments from a pair of haplotypes. M=mij is defined as a matrix of fragments, which each entry mij has value ‘A’, ‘B’ or ‘-’ (‘-’ is missing or skipped SNP site which is called gap). We use partition P(C1,C2) (C1 and C2 are two classes) to formulate the problem. P as an exact algorithm or clustering method divides fragments into C1 and C2. Each haplotype is reconstructed from the members of one of the classes with voting function. The function is so defined: (NiA(M) (or NiB(M)) denotes the number of ‘A’s (or 'B's) in jth column of matrix M). A N Aj (C i ) ! N Bj (C i ) V ij ® , i {1, 2},0 d j d n otherwise ¯B Reconstruction rate (shortly RR) is a very simple and popular mean to compare the results of designed algorithms on existing datasets. RR, which is based on Hamming Distance (HD), is the degree of similarity between the original haplotypes (h = (h1, h2)) and reconstructed ones (h' = (h'1 , h'2)). Hamming distance of two fragments HD(fi, fj) and RR(h ,h') are formulated as:
Keywords-Harmony Search; Haplotyp; SNP fragments
I.
INTRODUCTION
It is generally accepted that all human beings share about 99% identity at the DNA level and only some regions of differences in DNA sequences are responsible for genetic diseases [14, 15]. Every nucleotide in a SNP site is called an allele. Almost all SNPs have two different alleles, known here as 'A' and 'B'. The SNP sequence on each copy of a pair of chromosomes in a diploid genome is called a haplotype which is a string over {'A', 'B'}. SNP fragments are composed of Gaps and errors. MEC is a standard model for haplotype reconstruction, which is fed by fragments as an input to infer the best pair of haplotypes with the minimum error to be corrected. For MEC model two different procedures can be employed to resolve the problem [1-3]. First, Partitioning and clustering methods can be designed to divide the SNP fragments into two classes. In this approach, each class corresponds to one haplotype. The second approach is based on inferring haplotypes directly from SNP fragments and correcting simultaneously the errors [11]. A meta-heuristic algorithm, mimicking the improvisation process of music players, has been recently developed and
d ( hij , hkj )
1 ® ¯0
( mij z mkj z ) Otherwise
n
HD ( hi , hk )
¦ d (h
ij
, hkj )
j 1
rij
HD ( hi , hcj )
RR ( h , hc )
1
i, j
^1, 2`
min( r11 r22 , r12 r21 ) 2n
HD1 and HD2 are considered as two distances obtained 978-0-7695-4618-6/11 $26.00 © 2011 IEEE DOI 10.1109/ICI.2011.13
18
‘improvisation’ [10]. In the memory consideration, the value of the first decision variable (x’) for the new vector is chosen from any of the values in the specified HM range ( x 1c1 x 1cHMS ). Values of the other decision variables
from comparison of fi and the two other fragments (f1 and f2). III. RELATED WORK A. Genetic Algorithm (GA) To resolve haplotype assembly [16] have proposed a genetic algorithm to cluster the fragments. The chromosomes are defined as a binary string of length m (number of fragments). When chromi is equal to 0 (or 1), it means that the ith fragment is considered to be one of the first (or the second) class members. There is a fitness function recommended for evaluating the individuals by [16], which computes the distance of all fragments with their class centers (the class centers in this problem are computed by a voting function method).
( x 2c ,..., x mc ) are chosen in the same manner. The HMCR, which varies between 0 and 1, is the rate of choosing one value from the historical values stored in the HM, while (1 – HMCR) is the rate of randomly selecting one value from the possible range of values. xc {xi1 , xi2 ,..., xiHMS } with probability HMCR, xic m ® i with probability (1 HMCR). ¯ xic X i Every component obtained by the memory consideration is examined to determine whether it should be pitchadjusted. This operation uses the PAR parameter, which is the rate of pitch adjustment as follows:
B. Particle Swarm Optimization (PSO) Particle swarm optimization method is much like genetic algorithm. [13] Used this method for haplotype reconstruction problem in MEC model in which the particles are coded the same as the described genetic algorithm.
Yes xic m ® ¯ No
PAR, (1 PAR ).
The value of (1 – PAR) sets the rate of doing nothing. If the pitch adjustment decision for x’ is YES, x’ is replaced as follow: x ic m x ic ¬« rand() *2¼»
C. K-means A heuristic clustering method (based on MEC model) has been published by [17]. First, two fragments are selected as the primitive centers. So the other fragments are clustered according to hamming distance of them and specified centers. In iterations, the centers are updated according to newly constructed clusters and voting function.
Where rand() is a random number between 0 and 1. In this Step, HM consideration, pitch adjustment or random selection is applied to each variable of the new harmony vector in turn. C. Update harmony memory If the new harmony vector, is better than the worst harmony in the HM, judged in terms of the objective function value, the new harmony is included in the HM and the existing worst harmony is excluded from the HM.
IV. PROPOSED APPROACHES In this paper we introduce a HS to solve MEC model [9]. The steps in the procedure of harmony search are described in the next four subsections.
D. Check stopping criterion If the stopping criterion (maximum number of improvisations) is satisfied, computation is terminated. Otherwise, Steps 2 and 3 are repeated
A. Initialize algorithm parameters In the light of the goal of the MEC model, the goodness and badness of an individual is dependent on the number of error corrections. Hence, we design the following objective function: Maximize f (x 1 , x 2 ,..., x m )
with probability with probability
V. EXPERIMENTAL RESULTS
mn E ( P {x 1 , x 2 ,..., x m }) , x i ^0,1` mn
There are some simulation (SIM0, SIM50) and real biological (ACE, Daly) datasets available for haplotype reconstruction problem. Our approaches are implemented using Visual C#.Net 4.0 and executed on all of the datasets. All datasets have 12 different gap and error rates (Error Rate = 0.1, 0.2, 0.3 and 0.4 and Gap rate = 0.25, 0.50 and 0.75). To show the performance of our algorithm in Daly dataset, the reconstruction rates subtraction of our method and Kmeans and also subtraction of our approach and GA in MEC model is presented (Figure1). The X-axis and Y-axis of this chart present different error and gap rates and reconstruction rate subtraction consecutively.
Where m is length of SNP fragment; n is number of SNP fragments; xi is ith SNP of current SNP fragment; P{x1, x2,…, xm} is a partitioning of {x1, x2,…, xm} and E(P{x1, x2,…, xm}) is the corresponding error correction in comparison with their own center, i.e. the distance between center and each fragment. Then, the HM matrix is filled randomly. B. Improvise a new harmony A new harmony vector, is generated based on three rules: (1) memory consideration, (2) pitch adjustment and (3) random selection. Generating a new harmony is called
19
VI. CONCLUSION In this paper, two new approaches were proposed to solve haplotype reconstruction problem in MEC model. HS was used to cluster data of our problem. Some other improvements were performed in different parts of HS. In this approach, HS was used to cover almost all solution space and improve the accuracy of the solutions.
Figure. 1. Reconstruction rate subtraction (Improvement of HS from K-means, GA and PSO).
TABLE 1: RECONSTRUCTION RATE ON DALY, ACE, SIM0 AND SIM50 DATASETS FOR DIFFERENT GAP AND ERROR RATE (MEC MODEL)
[10] Mahdavi, M., Fesanghary, M., and Damangir, E. An improved harmony search algorithm for solving optimization problems. Applied Mathematics and Computation, 188, 2 (May 2007), 1567-1579. [11] Moeinzadeh, M-Hossein, et.al. Neural Network Based Approaches, Solving Haplotype Reconstruction in MEC and MEC/GI Models. In Second Asia International Conference on Modelling and Simulation (2008), IEEE, 934-939 [12] Panconesi, Alessandro and Sozio, Mauro. Fast Hare: A Fast Heuristic for Single Individual SNP Haplotype Reconstruction. In Algorithms in Bioinformatics ( 2004), LNCS Springer-Verlag, 266-277. [13] Qian, Weiyi, Yang, Yingjie, Yang, Ningning, and Li, Chun. Particle swarm optimization for SNP haplotype reconstruction problem. Applied Mathematics and Computation, 196 (2007), 266-272. [14] Terwilliger, Joseph. and Weiss, KKenneth. Linkage disquilibrium mapping of complex disease: Fantasy and reality? Current Opinion in Biotechnology, 9, 6 (1998), 578–594. [15] Venter, J. Craig, Adams, M.D., and al., et. The sequence of the human genome. Science, 291, 5507 (February 16, 2001), 1304–1351. [16] Wang, Rui-Sheng, Wu, Ling-Yun, Li, Zhen-Ping, and Zhang, XiangSun. Haplotype reconstruction from SNP fragments by Minimum Error Correction. Bioinformatics, 21, 10 (2005), 2456-2462. [17] Wang, Ying, Feng, Enmin, and Wang, Ruisheng. A clustering algorithm based on two distance functions for MEC model. Computational Biology and Chemistry, 31, 2 (2007), 148-150. [18] Zhang, X., Wang, R., Wu, L., and Zhang, W. Minimum conflict individual Haplotyping from SNP fragments and related Genotype. Bioinformatics Oxford Journal (2006), 271-280.
REFERENCES [1]
[2]
[3]
[4] [5]
[6]
[7]
[8]
[9]
Asgarian, Ehsan, Moeinzadeh, Mohamad Hoesin, Habibi, Jafar, Rasooli, A., and Moaven, Sh. Solving Haplotype Reconstruction Problem in MEC Model with Hybrid Information Fusion. Australian Journal of Basic and Applied Sciences, 3, 1 (2009), 277-282. Asgarian, Ehsan, et.al. Solving MEC model of haplotype reconstruction using information fusion, single greedy and parallel clustering approaches. In International Conference on Computer Systems and Applications ( 2008), IEEE/ACS, 15 - 19. Bonizzoni, Paola, Vedova, Gianluca Della, Dondi, Riccardo, and Li, Jing. The Haplotyping problem: An overview of computational models and solutions. Journal of Computer Science and Technology, 18, 6 (2003), 675-688 Chakravarti, Aravinda. It's raining SNPs, hallelujah? Nature Genetics, 19 (1998), 216 - 217. Coello, Carlos A. Coello. Constraint-Handling using an Evolutionary Multiobjective Optimization Technique. Civil Engineering and Environmental Systems , 17 (2000), 319–346. Geem, Z. W., Tseng, C., and Park, Y. Harmony search for generalized orienteering problem: best touring in China. In Lect Notes Comput Sci ( 2005), Springer, 741–750. Geem, Z. W., Kim, J. H., and Loganathan, G V. Harmony search optimization: application to pipe network design. International Journal of Modelling and Simulation, 22, 2 (2002), 125–133. Greenberg, Harvey J., Hart, William E., and Lancia, Giuseppe. Opportunities for Combinatorial Optimization in Computational Biology. INFORMS Journal on Computing, 16, 3 (2004), 211-231. Lee, Kang Seok and Geem, Zong Woo. A new meta-heuristic algorithm for continues engineering optimization: harmony search theory and practice. Computer Methods in Applied Mechanics and Engineering, 194, 36-38 (September 2005), 3902–3933.
20