Improvement of Genetic Algorithm for Classifying SNP ...

0 downloads 0 Views 576KB Size Report
578-594, 1998. [3]. J. C. Venter, et al., "The sequence of the human genome," science, vol. 291, p. 1304, 2001. [4]. A. Chakravarti, "It's raining SNPs, hallelujah?,".
Improvement of Genetic Algorithm for Classifying SNP Fragments Saman Poursiah navi€, Vahid ChahkandiΦ €

Department of Computer Engineering, Islamic Azad University Quchan Branch, Quchan, Iran. Department of Computer Engineering, Islamic Azad University Mashad Branch, Mashad, Iran.

Φ



Abstract—Reconstructing haplotype in MEC (Minimum Error Correction) model is an important clustering problem which focuses on inferring two haplotypes from SNP fragments (Single Nucleotide Polymorphism) containing gaps and errors. Mutated form of human genome is responsible for genetic diseases which mostly occur in SNP sites. In this paper, genetic algorithm (GA) is considered as a classifier of diploid genomes. New encoding approach is used to improve GA efficiency. In the previous approach of GA based on reconstruction rate, all bits of chromosome considered as cluster state of SNP-fragments. In our proposed method the value of the final haplotypes is based on the centers of SNP fragments clusters. Finally, these two approaches are executed on four standard datasets (ACE, Daly, SIM0 and SIM50) and the results show the efficiency of our proposed approach. Keywords-Genetic Algorithm; Haplotype; classification; pre-processing, SNP fragments; genotype information

I.

Φ

[email protected], [email protected]

INTRODUCTION

The availability of complete genome sequence for human beings makes it possible to investigate genetic differences and to associate genetic variations with complex diseases [1]. It is generally accepted that all human share about 99% identity at the DNA level and only some regions of differences in DNA sequences are responsible for genetic diseases [2, 3]. Single Nucleotide Polymorphisms (SNPs), a single DNA base varying from one individual to another, are believed to be the most frequent form responsible for genetic differences [4] and are found approximately every 1000 base pairs in the human genome and turn to be promising tools for doing disease association study. Every nucleotide in a SNP site is called an allele. Almost SNPs have two different alleles, known here as 'A' and 'B'. The SNP sequence on each copy of a pair of chromosomes in a diploid genome is called a haplotype which is a string over {'A', 'B'}. A genotype is the conflated information of a pair of haplotypes on homologous chromosomes. Although haplotypes have more information for disease associating than individual SNPs and also more than genotype information, but it is substantially more difficult to determine haplotypes than to determine genotypes or individual SNPs through experiments. Hence,

computational methods that can reduce the cost of determining haplotypes become attractive alternatives. Gaps and errors consist in SNP fragments. One question arising from this discussion is how the distribution of gaps and error in the input data affects computational complexity. MEC, LHR, MEC/GI and some other models have been discussed for haplotype reconstruction [5-7]. Minimum Error Correction (MEC) as a standard model is decided to be center of our research. MEC is a standard model for haplotype reconstruction based on SNP fragments as an input data to infer the best pair of haplotypes with the minimum error to be corrected. Genotype information is also used to improve the efficiency of methods in MEC/GI model [1, 8, 9]. For MEC model two different procedures can be done for solving the problem. First, Partitioning and clustering method can be employed to divide the SNP fragments into two classes. In this approach, each class corresponds to one haplotype. To infer the haplotypes from each partitioning, another function is designed (describe later). The second approach is based on inferring haplotypes directly from SNP fragments and correcting the errors simultaneously. It is proved that haplotype reconstruction in MEC model is an NP-Hard problem [10-12]. Thus, reducing running time and obtain acceptable result with minimum error are both concern by the researchers. Heuristic algorithm and different clustering methods are designed to achieve these goals [1, 8, 13, 14]. To resolve haplotype reconstruction in MEC model Wang et al. [9] have proposed a genetic algorithm to cluster the SNP fragments. The chromosomes are defined as a binary string Chi{0,1} of length m (number of fragments). When Chi is equal to 0 (or 1), it means that the ith fragment is considered to be one of the first (or the second) class members. Fitness of individuals must be assigned based on number of error corrections required. But there is no cluster centers obtained yet. There is a fitness function recommended for evaluating the individuals by Wang [9], which computes the distance of all fragments with their class centers (the class centers in this problem are computed by a voting function method).

j B

( N (C i ) ) denotes the number of fragments in C i whose jth entry is ‘A’ ('B'). The voting function is so formulated:  A N Aj (C i )  N Bj (C i ) hij   Otherwise B i  1, 2 and 1  j  n In order to compute the degree of similarity between the original haplotypes (h=(h1, h2)) and the reconstructed ones ( h   (h1, h2 ) ), reconstruction rate (shortly RR) is

defined as a measure for comparing the efficiency of different algorithms. It is based on Hamming distance (HD), a function for computing the distance between two SNP fragments or haplotypes which is so defined:

 1   d ( mij , mkj )   1  0

, mkj ),

( mij  mkj  ) ( mij  mkj  ) otherwise

Once we have reconstructed the pair of haplotypes, RR is so computed: rij  HD(hi , hj ), i, j  {1,2} RR (h, h)  1 

min( r11  r22 , r12  r21 ) 2n

III. PROPOSED APPROACHES In this paper we introduce an improved genetic algorithm clustering approach (IGA) for the haplotype reconstruction problem which is based on MEC model. The algorithm and their details are fully discussed in the following. In proposed algorithm, each chromosome does not represent a state of clustering. Rather, it is formed by concatenation of two solution haplotypes (each bit contains the corresponding number of the SNP-fragment). Chromosome Haplotype 2

Haplotype 1

A B 1

2

A B A m m+1 m+2

. . .

pertaining to one haplotype. In order to obtain the pair of haplotype from the two classes, the SNP fragments of every cluster have to be combined with each other, while an operation called voting. For every class Ci, N Aj (C i )

ij

j 1

. . .

(the input of the haplotyping problem), corresponding to a pair of haplotypes whose lengths are the same as n. We use partition P which divide the SNP fragments into two classes 1 and  2 to formulate the problem, each class

n

 d (m

. . .

Every SNP fragment is assumed to be a string of length n whose entries may be 'A', 'B' or '-'. '-' denotes a missing or skipped site in the sequence which is called a gap. Suppose we have m SNP fragment F  {f 1 , f 2 ,..., f m }

HD ( mi , mk ) 

. . .

II. FORMULATION AND PROBLEM DEFINITION

B 2m

n

HD(mi , mk )   d (mij , mkj ),

Figure 1: IGA chromosome

j 1

1 (mij  mkj  ) d (mij , mkj )   0 otherwise

Therefore, in this problem the objective function is defined as:

And applying a second distance becomes necessary when the hamming distance between one and two other fragments are equal, which is defined as follows:

Minimize f (x 1 , x 2 ,..., x 2 m ) 

n

2

 Min  HD (h , m ) j

i

i 1 j 1

h1 and h2 extract from chromosome , x i {0,1}

Where m is length of SNP fragment; n is number of SNP fragments; xi is ith SNP of current SNP fragment; HD(hj,mi) is distance between ith SNP fragment and jth haplotype.

Figure 2: Comparison the results ( RR( IGA) 

There are several kinds of selection operators, e.g. tournament selection, rank selection and roulette wheel selection, etc. among which roulette wheel selection is very popular. We adopt both tournament selection and roulette wheel selection in different parts of our GA. A combination of single-point crossover and uniform crossover is adopted. In addition, according to the principle of the fittest survives, we use roulette wheel selection operator to select individuals to crossover and generate offsprings. Similarly, we adopted the combination of single-point mutation and swap mutation in our IGA. IV. EXPERIMENTAL RESULTS There are some simulation and real biological datasets available for haplotyping problem. In this paper ACE (Angiotensin Converting Enzyme) [15], Daly [16], SIM0 and SIM50 were chosen. Our approaches were implemented using Visual C++ 6.0 and executed on all of the datasets. All datasets have 12 different gap and error rates (Error Rate = 10%, 20%, 30%, 40% and Gap rate = 25%, 50%, 75%). The reconstruction rate obtained by subscribing IGA and method of Wang et al. for all error/gap rates are shown in figure1. ACE dataset includes 23 different test cases for each error rate (92 test cases). The experimental results of new and

RR(Wang ) ) of approaches in different gap rates

previous approaches for MEC model are compared in figure 2. We executed our method on 1864 files from different datasets, the properties of which is shown in table1. In this table the number of SNP-fragments and the length of each fragment are also mentioned. In table2 the results of all datasets in two different methods are shown. It is proved by experience that our approach outperforms the best pervious approach. V. CONCLUSION In this paper, we first present an improved genetic algorithm (IGA) to solve haplotype reconstruction in MEC model. We define the chromosome encoding of GA to improve efficiency. The SNP-fragments are clustered according to two haplotypes in each chromosome are inferred from the clusters. This algorithm is performed on 4 datasets (ACE, SIM0, SIM50 and DALY (1854 files)) this numerical results show that IGA method based on MEC has higher accuracy than the best pervious result of Wang [9]. Table 1: Datasets properties

Table 2: Reconstruction rate on ACE, DALY, SIM0 and SIM50 datasets for different Gap and error rate

REFERENCES [11] [1]

[2]

[3] [4] [5]

[6]

[7]

[8]

[9]

[10]

X. S. Zhang, et al., "Minimum conflict individual haplotyping from SNP fragments and related genotype," Evolutionary bioinformatics online, vol. 2, p. 261, 2006. J. D. Terwilliger and K. M. Weiss, "Linkage disequilibrium mapping of complex disease: fantasy or reality?," Current Opinion in Biotechnology, vol. 9, pp. 578-594, 1998. J. C. Venter, et al., "The sequence of the human genome," science, vol. 291, p. 1304, 2001. A. Chakravarti, "It's raining SNPs, hallelujah?," Nature Genetics, vol. 19, p. 216, 1998. P. Bonizzoni, et al., "The haplotyping problem: an overview of computational models and solutions," Journal of Computer Science and Technology, vol. 18, pp. 675-688, 2003. H. J. Greenberg, et al., "Opportunities for combinatorial optimization in computational biology," INFORMS Journal on Computing, vol. 16, pp. 211-231, 2004. A. Panconesi and M. Sozio, "Fast hare: A fast heuristic for single individual SNP haplotype reconstruction," Algorithms in Bioinformatics, pp. 266-277, 2004. M. H. Moeinzadeh, et al., "Three Heuristic Clustering Methods for Haplotype Reconstruction Problem with Genotype Information," in International Conference on Innovations in Information Technology, 2007, pp. 402-406. R. S. Wang, et al., "Haplotype reconstruction from SNP fragments by minimum error correction," Bioinformatics, vol. 21, p. 2456, 2005. V. Bafna, et al., "Polynomial and APX-hard cases of the individual haplotyping problem,"

[12]

[13]

[14]

[15]

[16]

Theoretical Computer Science, vol. 335, pp. 109125, 2005. R. Cilibrasi, et al., "On the complexity of several haplotyping problems," Algorithms in Bioinformatics, pp. 128-139, 2005. R. Cilibrasi, et al., "The complexity of the single individual SNP haplotyping problem," Algorithmica, vol. 49, pp. 13-36, 2007. M. H. Moeinzadeh, et al., "Neural network based approaches, solving haplotype reconstruction in MEC and MEC/GI models," in International Conference on Modeling & Simulation, 2008, pp. 934-939. R. Wang, et al., "A markov chain model for haplotype assembly from SNP fragments," GENOME INFORMATICS SERIES, vol. 17, p. 162, 2006. M. J. Rieder, et al., "Sequence variation in the human angiotensin converting enzyme," Nature Genetics, vol. 22, pp. 59-62, 1999. M. J. Daly, et al., "High-resolution haplotype structure in the human genome," Nature Genetics, vol. 29, pp. 229-232, 2001.

Suggest Documents