Multiple Sequence Alignment is an important problem in molecular biology, ... In general, Multiple Sequence Alignment belongs to a class of hard optimization.
GENETIC ALGORITHMS AND THE MULTIPLE SEQUENCE ALIGNMENT PROBLEM IN BIOLOGY Kosmas Karadimitriou and Donald H. Kraft Department of Computer Science Louisiana State University Baton Rouge, Louisiana 70803 {kosmas, kraft} @bit.csc.lsu.edu
ABSTRACT Multiple Sequence Alignment is an important problem in molecular biology, where it is used for constructing evolutionary trees from DNA sequences and for analyzing the protein structures to help design new proteins. To date, most multiple alignment methods are based on a dynamic programming approach. This approach however results in exponential time complexity, since it requires time proportional to the product of the sequence lengths. Tree-based algorithms, which combine results from pairwise alignments, have also been proposed. However, these algorithms depend on the existence of a tree that describes the relations between the sequences, and this tree cannot always be obtained. In general, Multiple Sequence Alignment belongs to a class of hard optimization problems called combinatorial problems. One of the methods that has been developed recently to solve this type of problems is Genetic Algorithms. Genetic Algorithms create a “population” of random solutions and then use the concepts of natural selection, crossover and mutation to improve these solutions. Genetic Algorithms have been used successfully in a wide variety of application areas to find solutions for hard optimization problems. They offer the advantage of operating on several solutions simultaneously, combining exploratory search through the solution space with exploitation of current results. In this study we show how Genetic Algorithms can be used to solve the Multiple Sequence Alignment problem. Our results suggest that optimal, or near-optimal solutions can be obtained with Genetic Algorithms faster than with dynamic programming methods. Also Genetic Algorithms are inherently parallel and therefore can be implemented very efficiently on parallel computers.
Published in: Proceedings of the Second Annual Molecular Biology and Biotechnology Conference, February 1996, Baton Rouge, LA.
1
1. INTRODUCTION Multiple sequence alignment is an optimization problem that appears in many and diverse scientific fields. During the last decade, there has been an increasing interest in the biosciences for methods that can efficiently solve this problem for sequences such as biological macromolecules, DNA and proteins. To date, most of these methods follow either the dynamic programming approach, or a tree-based approach. However, multiple sequence alignment is a combinatorial problem with exponential time complexity; therefore, there is no good analytical method that can solve it efficiently. Genetic algorithms is a fairly new, non-analytical optimization technique that can give solutions to hard optimization problems that traditional techniques fail to solve. It is based on a simulated evolution, where processes such as crossover, mutation and survival of the fittest help to “evolve” good solutions to a given problem. In this study, we show how genetic algorithms can be used to solve the problem of multiple sequence alignment.
2. MULTIPLE SEQUENCE ALIGNMENT (MSA) Multiple sequence alignment (MSA) refers to the problem of optimally aligning three or more sequences of symbols with or without inserting gaps between the symbols. The objective is to maximize the number of matching symbols between the sequences and also use only minimum gap insertion, if gaps are permitted. This problem appears in several fields, such as molecular biology, geology, and computer science. In biology it is especially important for constructing evolutionary trees based on DNA sequences and for analyzing the protein structures to help design new proteins. Multiple sequence alignment belongs to a class of optimization problems with exponential time complexity, called combinatorial problems. It exhibits O(LN) time complexity where L is the mean length of the sequences to be aligned and N is the number of sequences [Carrillo88]. In biology, the sequences can have lengths in the order of hundreds (proteins), thousands (RNA), or millions of units (DNA). This results in unacceptable long times, even for aligning only a few sequences [Sankoff83]. To compare different alignments, a fitness function is defined based on the number of matching symbols and the number and size of gaps. In biology, this fitness function is referred to as cost function and is given biological meaning by using different weights for different types of matching symbols and assigning gap costs when gaps are used [Altschul89].
3. GENETIC ALGORITHMS Genetic algorithms is an optimization technique that was formulated during the early years of the 1970’s by John Holland [Holland75]. This technique is useful for finding the optimal or near optimal solutions for combinatorial optimization problems that traditional methods fail to solve efficiently. The genetic algorithms approach is based on the assumption that simulating an evolutionary process in a population of potential solutions can eventually “evolve” good solutions. Biological terms are conveniently used to describe this process: chromosomes are the potential solutions. Every chromosome is composed of several genes,
2
the solution parameters. Many chromosomes form a population. Successive populations are referred to as generations. Crossover is the exchange of genes (solutions parameters) between two chromosomes (solutions). Mutation is the random change of one or more genes in a chromosome. Offsprings are the new chromosomes created by two parent chromosomes by crossover. The genetic algorithms process starts with an initial population composed of random chromosomes, which form the first generation. Crossover is used to combine genes from the existing chromosomes and create new ones. Then, the best chromosomes are selected to form the next generation. This selection is based on a fitness function which assigns a fitness value to every chromosome. The ones with the best fitness value “survive” to give offsprings for the new generation, and the process is repeated until satisfactory solutions evolve. The main advantage of genetic algorithms over other optimization methods is that there is no need to provide a particular algorithm to solve a given problem. It only needs a fitness function to evaluate the quality of different solutions. Also since it is an implicitly parallel technique, it can be implemented very effectively on powerful parallel computers to solve exceptionally demanding large-scale problems.
Natural Evolution and Genetic Algorithms Evolution in Nature
Genetic Algorithms
Finds good solutions to the problem of surviving in the physical world.
Finds good solutions to combinatorial optimization problems.
Primary processes: • mixing of genes in the population (sexual reproduction) • small percentage of random changes to genetic material (mutations) • survival of the fittest (natural selection)
Main parts: • combination of existing solutions to create new ones (crossover) • small percentage of random changes to the solution parameters (mutations) • evaluation of current solutions and selection of the best ones to continue.
Surviving members reproduce and die. New generation repeats the process.
Start with “population” of random solutions. Repeat with newer “generations” of solutions until satisfactory good solutions evolve.
3
The “crossover” operation cut
PARENTS 1 6 4 5 3
1
OFFSPRINGS
6 4
1 0 3 5 3
5 3
exchange 4 0 3 2 0
4
0 3
4 6 4 2 0
2 0
cut
Mutation: introduces new “genetic material” into a population How ? By including a small possibility that an “error” will occur in a crossover. Why ? To provide insurance against the development of uniform population incapable of further evolution (technically, it keeps the algorithm from getting “stuck” at a less than optimal solution).
4. APPLYING GENETIC ALGORITHMS TO MSA a) MSA without gaps Every possible alignment can be represented as a chromosome which is composed of N genes, where N is the number of sequences to be aligned. Every gene stores the translation of its corresponding sequence from left to right. Example: Sequence translations 4 0 3 2 0
Chromosome :
Corresponding alignment:
G L
1
2
E L
K D G
3
F G T S
4
4 5
K K F Y G
L Q E E S
Sequence numbers
G Q G C F G G N I I N T P I
The fitness function can be as simple as counting the total number of matching symbols, or as complicated as considering the type of symbols aligned, their location in the sequences, their neighboring symbols, etc. In this implementation, we use the simplest possible fitness function, which is to count the total number of matches and then assign 1 point for each match: fitness = (total matches) * 1.0
b) MSA with gaps In this case every alignment is represented by a chromosome composed of G*N genes where G is the maximum number of gaps allowed in a sequence and N is the number of input sequences to be aligned. Every gene stores the position in which a gap appears in its corresponding sequence. The location of the gene in the chromosome denotes the sequence with which it is associated. Example: Chromosome :
3 6 6 0 0 6 0 7 0 0 0 4
Gap positions in sequence 1 Corresponding alignment:
Gap positions in sequence 2
- - D C E - A D F - - - C - - - - E C A D - A E F -
For MSA with gaps, the fitness function is modified to include some penalty points for the gaps: fitness = (total matches) * 1.0 - (gap penalties) Each matching pair of symbols adds 1 point to the fitness value, whereas 4 points are subtracted for every group of consecutive gaps. Also 0.4 points are subtracted for each individual gap. This distinction between “gap groups” and “individual gaps” helps to impose a start-up penalty for introducing new gaps. gap penalties = (gap groups) * 4.0 + (total number of gaps) * 0.4
5
Experiment 1: MSA without gaps Total Population: 1000 Mutations: 1% Generations: 16 Final alignment: RQWERQWERQWERQWERWERWQQR RQEQWERQWERQWERQWERWQERW WQERQWERQWRQWERQWRQWERWE EWQERQWERWQERWERQWERQWER EWRQWRQWERQWERQWEWREQWER WERQERQWRWERQWERQWERQQWE
RQWERQWERQWERQWERWERWQQR RQEQWERQWERQWERQWERWQERW WQERQWERQWRQWERQWRQWERWE EWQERQWERWQERWERQWERQWER EWRQWRQWERQWERQWEWREQWER WERQERQWRWERQWERQWERQQWE
Total number of different possible alignments:
190,000,000
Number of alignments the genetic algorithm needed to produce this solution:
16,000
Total computer time:
18 seconds
Experiment 2: MSA with gaps Maximum total number of gaps: 60 Total Population: 1000 Mutations: 0.05%-1% (adapted dynamically) Generations: 700 Final alignment: KLGQGCFGEVWMGTWNGT--TRVAIKTLKPGTM--SPEAFLQEAQVMKKLRHEKLVQLYAV----VSEEPIYIVTEYMSKGSLLDFLK-KLGGGQYGEVYEGVWKKYSLT-VAVKTLKEDTMEVEE--FLKEAAVMKEIKHPNLVQLLGV---CTREPPFYI--ITEFMTYGNL-LDYELGQGSFGMVYEGNA-RDIIKGEA-ETRVAVKT-VNESASLRERIEFLNEAS---VMKGFTCHHVV---RLLGVVSKGQPTLVVMELMALLGSGAFGEVYEGTAV-DILGVGSGEIKVAVKTLKKGSTD-----QEKIE-FLKEAHLMSKFNH-PNILKQLGVCLLNEPQYIILELM-ILGRGVSSVVRRCIHKPT-CKEYAVKIID-VTG---GGSFSAEEVQELREATLKEVDILRKVSGHPNIIQLK--DTYETNT--FFFLVFLLGSGGFGSVYSGIRVSDNL-PVAIKH---VEKDRI--SDWGELPNG--TRVPMEVVLLKKVSSGFSGVIRLL-DWFERPDSFVLILER-
Total number of different possible alignments: 530,000,000,000 Number of alignments the genetic algorithm needed to produce this solution: Total computer time:
700,000 10 minutes
(each sequence is part of protein kinase that includes the ATP-binding site)
6
5. CONCLUSION Multiple sequence alignment is very useful in many scientific fields, including biology. However, it belongs to the combinatorial optimization problems with exponential time complexity. Genetic algorithms is a fairly new optimization technique that is effective for this type of problems. In this study we describe the genetic algorithms methodology and we demonstrate how it can be implemented to produce optimal or near-optimal solutions to the MSA problem. Two different types of alignments are considered: alignments with and without gaps. In both cases, genetic algorithms produce reasonably good solutions using only a small amount of computer resources.
REFERENCES [Altschul89] Altschul S.F., “Gap costs for multiple sequence alignment”, J. Theoretical Biol., vol. 138, pp. 297-309, 1989. [Carrillo88] Carrillo H., and Lipman D., “The multiple sequence alignment problem in biology”, Siam J. Appl. Math., vol. 48, no. 5, pp. 1073-1082, October 1988. [Holland75] Holland J.H., “Adaptation in Natural and Artificial Systems”, Ann Arbor: The University of Michigan Press, 1975. [Holland92] Holland J.H., “Genetic Algorithms”, Scientific American, pp. 66-72, July 1992. [Sankoff83] Sankoff D., and Kruskal J.B., eds., “Time warps, string edits, and macromolecules: the theory and practice of sequence comparison”, Addison-Wesley, 1983.
7