Improvement of Clustal-Derived Sequence Alignments with Evolutionary Algorithms Ren´e Thomsen Dept. of Computer Science University of Aarhus Ny Munkegade, Bldg. 540 DK-8000 Aarhus C, Denmark
[email protected]
Gary B. Fogel Natural Selection, Inc. 3333 N. Torrey Pines Ct., Suite 200 La Jolla, CA 92037, USA
[email protected]
Abstract- Multiple sequence alignment (MSA) is a central problem in bioinformatics. In this study, we extended previous efforts using evolutionary algorithms (EAs) for MSA. Candidate solutions in the initial population were derived from the well-known alignment program Clustal X. Evolutionary computation was then used to evolve increasingly appropriate solutions. Three new alignment operators were introduced and tested within the framework of protein sequence alignment. Statistics on alignment quality were generated with respect to selected alignment benchmarks from the BAliBASE database using the BLOSUM 62 substitution matrix. Our results indicate the degree to which EAs can enhance the results of Clustal X. Moreover, the experimental results show that the commonly used sum-ofpairs scoring scheme sometimes fails to correlate higher scoring alignments with increase in alignment quality in terms of the BAliBASE sum-of-pairs score. Keywords: Clustal X, Multiple Sequence Alignment, Evolutionary Algorithms, BAliBASE
1 Introduction The ability to capture molecular sequence information in biology has long surpassed our degree of understanding of that information. The wealth of DNA, RNA, and protein sequence information currently available demands better means for data interpretation. This interpretation is made significantly easier when viewing the sequences by comparison rather than isolation. Multiple sequence alignment (MSA) of nucleic or amino acid sequences continues to play a very central role to the advancement of understanding in molecular biology. Sequence alignments can be used to (i) determine evolutionary distances between organisms and infer phylogenetic relationships, (ii) discover conserved motifs that might be important at the levels of transcription, translation, and/or structure, (iii) improve our understanding and prediction of molecular structures. The size of the MSA problem space increases dramatically with the num-
Thiemo Krink Dept. of Computer Science University of Aarhus Ny Munkegade, Bldg. 540 DK-8000 Aarhus C, Denmark
[email protected]
ber of sequences in the alignment and their length. As a result, MSA problems are NP-hard [14]. So far, the most popular approach for the solution of MSA problems has been to use dynamic programming (DP) [4]. Although DP can guarantee a mathematically optimal alignment, DP can only tackle MSA problems with a small number of short sequences. To overcome this problem, a number of heuristics have been introduced to reduce the computational complexity of the problem and provide approximate solutions with DP. For instance, the progressive alignment method [3] gradually builds an alignment by first estimating the evolutionary distance between all sequences to be aligned and then aligns the sequences in order of decreasing similarity. The widely-used series of Clustal algorithms utilize progressive alignment methods. Although these DP methods are fast, progressive alignment methods suffer from entrapment in local optima because they only optimize the alignment in a pair-wise manner not taking the entire alignment into account. To alleviate these problems, a number of iterative, stochastic approaches for MSA have been offered that make use of simulated annealing [7] or evolutionary computation [1], [2], [6], [8]. These stochastic methods can be used to efficiently search large solution spaces. However, these same methods have suffered from long runtimes (when compared to DP methods above) as they generally start with random initialization of candidate solutions and spend a great deal of time gradually improving these random solutions before reaching an alignment quality comparable to Clustal. We previously outlined an approach that utilized the output from Clustal V as an initial seed to an evolutionary algorithm (EA) [12], which resulted in a marked improvement in the speed of the overall time to completion and also demonstrated that the locally optimal Clustal V outputs could be improved as much as 10% (in terms of sum-of-pairs score) by the EA in as little as 2 minutes. In this study, we revised this approach to make use of Clustal X (a graphical front-end to Clustal W). Clustal W [10] and Clustal X [9] are the more widely accepted versions of the Clustal algorithm that makes use of position de-
pendent gap penalties and weights in the sum-of-pairs (SP) scoring function. The commonly used SP scoring function was used to evaluate the candidate alignments, and we note similar improvement trajectories to our previous work. In addition, several new variation operators were introduced. The efficacy of this approach was evaluated with respect to the BAliBASE [11] sequence alignment database. The BAliBASE database contains several multiple sequence alignments, which are manually refined from the known 3D structures of proteins. This makes it possible to evaluate the quality of MSA algorithms with respect to their ability to derive true alignments. The experimental results show that it is possible to refine the Clustal solutions in all cases. However, the comparison of the commonly used SP scoring scheme to the BAliBASE SP score (SPS) reveals that improvements in SP does not always correlate with good SPS values. This finding indicates that more investigations on choice of substitution matrices, gap-penalties, etc. are needed in order to improve alignment algorithms in general.
2 Methods 2.1 Representation In all experiments, we represented the MSA problems as an array of molecular sequence information, where each sequence was encoded as an array of characters over the alphabet {C, S, T, P, A, G, N, D, E, Q, H, R, K, M, I, L, V, F, Y, W, −, X}. The symbols C to W represent the 20 amino acids for
use with protein sequence information. The symbol ”−” refers to a gap in the alignment as the result of an insertion or deletion of an amino acid residue over evolutionary history. The symbol X indicates an undefined amino acid and can occasionally be encountered in protein sequence data. The n sequences to be aligned are generally of different length l1 , l2 , ..., ln . Further, the maximum number of columns in the matrix was limited to w = (1.2 × lmax ), where lmax = max{l1 , l2 , ..., ln }. The arbitrary choice of 1.2 as a scaling factor [2] allowed the alignment to be 20% longer than the longest sequence.
ments did not resemble typically observed alignments. Furthermore, different crossover operators were tested but they were not able to improve the candidate solutions (data not shown). However, having a population of identical seed alignments, allowed the EA to take different paths toward better alignments from the initial starting point. 2.3 Variation Operators During the evolutionary run, all individuals were subjected to different variation operators in order to alter the candidate alignment solutions. The variation operators are described below. The LocalShuffle and BlockShuffle are similar to those previously introduced (see [2] and [12]). The LocalShuffle operator picks a random amino acid from a randomly chosen row (sequence) in the alignment and checks whether one of its neighbors is a gap. If this is the case, the algorithm swaps (exchanges) the selected amino acid with a gap neighbor. If both neighbors are gaps then one of them is picked randomly. The DirectedLocalShuffle operator is a directed version of LocalShuffle. The only difference is that the neighbor with best overall fitness score is the one that is chosen (if more than one). The BlockShuffle operator is very similar to LocalShuffle. First, a random block of consecutive amino acids is picked from a randomly chosen row (sequence). The block is then moved to the left or right by one position (depending on which side contains gaps) if there is a neighbor position with a gap. If the block has gaps on both sides, it picks the direction of movement randomly. The PassGaps operator identifies all gap regions in a randomly chosen sequence. Afterwards, the amino acid on the left or right hand side of the chosen gap region (randomly chosen) is moved to the end (or beginning if moving in the other direction) of the gapped-region thus bypassing all the gaps. The RandomMoveGap operator randomly choses a sequence and identifies all gaps (ignoring terminal gaps). A randomly chosen gap from this set is moved to a randomly chosen position in the sequence between the first and last occurring amino acids. 2.4 Fitness Evaluation
2.2 Population Initialization For all experiments below, the solution obtained from Clustal X was used to seed all the individuals in the initial EA population, (i.e., all individuals were identical). Since the candidate alignments were scaled to be 1.2 times longer than the seed alignment, gaps were added to the end of each sequence to fill up the matrix. It may seem peculiar not to initialize with randomly generated alignments, but preliminary experiments indicated that randomly created align-
Prior to the evaluation of candidate alignments, columns containing gaps only were moved to the far right hand side of the alignment. During fitness evaluation, all fully-gapped columns at the terminal ends of the alignment were ignored. The optimization task at hand is defined as a maximization problem, such that matching amino acids over columns of the matrix are rewarded and gaps are penalized: f itness = SP score − GapP enalty
(1)
Where SPscore is the sum of all pairwise symbol matches (also referred to as the sum-of-pairs function, see equation 2), which is specified by a substitution matrix (such as the accepted point mutation (PAM) matrix [4] or the blocks substitution (BLOSUM) matrix [5]). SP score =
n−1 X
n X
BLOSU M 62(li , lj )
(2)
i=1 j=i+1
These substitution matrices contain scores for all possible amino acid symbol matches and mismatches based on the frequency of occurrence of these changes over evolutionary time represented in protein sequence databases. Amino acids that are identical between two sequences at a particular location receive a high score, whereas an unlikely amino acid mismatch (substitution) at a given position will have the lowest score. We applied a score of zero for all pairs of amino acids containing an X (undefined amino acid). For the experiments described here, the BLOSUM62 [5] matrix was used, since it is generally considered to be a good substitution matrix for MSA. The GapPenalty specifies the penalty associated with the introduction of gaps in the alignment and is calculated for all n sequences. We used an affine gap cost penalty: GapP enalty = GOP + GAP S × GEP
(3)
Where GOP was a fixed gap opening penalty, GAPS was the number of consecutive gaps under consideration, and GEP was the gap extension penalty. Terminal gaps were not penalized. The values for GOP and GEP are shown in Section 3.3. 2.5 The MSAEA In this study, a simple EA denoted MSAEA was used (see Figure 1). All individuals were initialized and evaluated according to the fitness function (see Section 2.2 and 2.4). Each individual had a probability of exposure to variation as long as the termination criteria (current number of fitness evaluations < max number of fitness evaluations allowed) was not fulfilled. µ offspring were created from randomly chosen parents (λ) using one of the five mutation operators. The mutation operator used to create each offspring was randomly chosen among the five available operators (with equal probability). Further, individuals that were altered due to mutation were reevaluated using the fitness function. Finally, (λ + µ)-selection was applied to select the best λ number of individuals (from the pool of λ + µ individuals) as parents in the new population P (t + 1).
procedure MSAEA begin t=0 initialize population P(t) with Clustal X’s seeds evaluate P (t) while (not termination-condition) do begin mutate individuals in P(t) evaluate P(t) select P(t+1) from P(t) t++ end end Figure 1: Pseudocode of the MSAEA. P (t) refers to the population at generation t.
3 Experiments 3.1 Alignment Test Cases Table 1 shows the protein sequence data sets that we used in our experiments. All eight data sets were selected from the first reference set from the BAliBASE database (version 2, http://bess.u-strasbg.fr/BioInfo/BAliBASE2/) [11], which is a publically available suite of alignment benchmarks. Data set 1taq 1pii 1pfc 1hfh 451c kinase 1aboA 1tvxA
NSEQ 5 4 5 5 5 5 5 4
LSEQ (min,max,avg) (806,928,865.2) (247,259,251.5) (108,117,112) (116,132,121.2) (70,87,80) (263,276,270.2) (49,80,63.6) (54,70,61.75)
SEQID >35% 20-40% 20-40% 20-40% 20-40%