MULTIPLE INDEXING SEQUENCE ALIGNMENT FOR GROUP FEATURE IDENTIFICATION∗ ∗ WEI-YAO CHOU*, TUN-WEN PAI*$, JIM ZONE-CHANG LAI*, WEN-SHYONG TZOU# *
Department of Computer Science and Engineering & Center for Marine Bioscience and Biotechnology, #Institute of Bioscience and BioTechnology, National Taiwan Ocean University, No. 2,Pei-Ning Rd. Keelung, Taiwan 20224, Republic of China MARGARET DAH-TSYR CHANG, HAO-TENG CHANG, WEI-YI CHOU, TAN-CHI FAN Institute of Molecular and Cellular Biology & Department of Life Science, National Tsing Hua University, No. 101 Sec. 2, Kuang Fu Rd. Hsinchu, Taiwan 30013, Republic of China
A novel scheme for combinatorial patterns and exclusive group features identification employing multiple indexing sequence alignment (MISA) based on interval-jumping searching algorithms and hierarchical clustering technique s is proposed in this paper. The intervaljumping searching algorithm transforms sequences into digital number sets in order to find consensus motifs and provides approximate matching results. The searched consensus motifs with tolerant characteristics are labeled and formulated a scoring matrix in order to cluster imported sequences into several subgroups prior to the proposed multiple indexing sequence alignment. To extract distinguishable features among clustered groups, the proposed system performs various combinations of fundamental bitwise operations to obtain their distinctive characteristics. In this paper, MISA has been employed to analyze real biological data and demonst rated to be practical for searching combinatorial patterns for each subgroup and its distinctive features from other subgroups are also identified for further analysis. Comparisons with other existing algorithms are also shown in this paper to demonstrate superior performance of the proposed system.
1
Introduction
Multiple sequence alignment (MSA) is one of the important tasks in the field of computational biology [1]. It reveals conserved regions and evolutionary relationship by direct sequence alignment. Given a set of k sequences (k >= 2), the MSA problem is to align similar subsequences in the same regions. The optimal alignment of two sequences could be solved in O(n 1*n 2), where n 1, n 2 represents the length of two sequences respectively. Unfortunately, when k is strictly greater than two, it becomes an NP-hard problem in an optimal way to match as many residues as possible from the k sequences ∗∗
This work is supported by Grants NSC 94-2627-B-007-003, NSC 94-3112-B-007-004-Y, VGHUST 94-G2-02-0, and a grant from Center for Marine Bioscience and Biotechnology , NTOU, Taiwan, R.O.C.. $ To whom correspondence. E-mail:
[email protected]
1
2 [2,3]. Therefore, a lot of approximate and heuristic algorithms have been developed to overcome such problems intensely. Some algorithms show guaranteed performance and some others perform well in practical cases [4,5,6,7]. A natural heuristic for MSA is based on computing optimal pairwise alignments of the k sequences. Several MSA algorithms attempt to combine compatible subsets of optimal pairwise alignment into a multiple alignment. However, they may misalign global multiple residues by matching local pairwise residues. In general, the heuristic algorithm alignment performs the process in a one-residue versus one-position manner. Unfortunately, in the evolutionary process, some residues may mutate through transition and transversion which may in turn lead to inaccurate alignment. Although a substitution matrix such as the BLOSUM series matrices can be incorporated to improve the performance, no perfect criterion has yet proved its ability to distinguish “genuine” (biologically relevant) similarities from chance similarities (noise). Failure to align the truly relevant residues may arise from misjudging a noise residue as the target residue. In this paper, we propose an efficient and effective algorithm to solve the above problems. Initially, consensus motifs are identified according to the parameters of pattern lengths, number of tolerant residues and occurrence rates by employing the Ladder-like Interval Jumping Searching Algorithm (LIJSA) [8]. Every subsegment of an input sequence with specified pattern length is scanned, encoded, and allocated in an appropriate numerical interval in sequential order. After searching processes, each consensus motif with various parameter settings possesses a unique indexing number and will be matched in the same interval. Therefore, a sequence in residues representation can be transformed into an indexing notation which is composed of previous labeled consensus segments with the non-matched subsegments temporarily filtered out for the following alignment procedures. The problem of MSA is converted to a MISA version and the combinatorial patterns extraction becomes the main goal of the proposed system. To approach the aim, we assume that if two transformed indexing sequences contain more consensus motifs simultaneously, then they possess higher similarity. Thus, the system is able to select one sequence as the central sequence that possessed most consensus motifs comparing to all other sequences . Pairwise indexing alignments between the central sequence and all other sequences are performed, and merging operations on pairwise aligned results are implemented to achieve approximate mu ltiple indexing sequence alignment (MISA). The details will be described in the following contents . 2
Preliminaries
Let Σ be the set of characters (residues), and S = {S 1,S 2,… , S k } be a set of k sequences with a maximal length of n over Σ. Let S i[x… y] denote the substring of S i where 1