Global pairwise sequence alignment using Hidden ...

9 downloads 0 Views 968KB Size Report
Hong Kong and Shenzhen, China, 2-7 Jan 2012. Global pairwise sequence alignment using Hidden Markov Models applied through different scoring schemes.
Proceedings of the IEEE-EMBS International Conference on Biomedical and Health Informatics (BHI 2012) Hong Kong and Shenzhen, China, 2-7 Jan 2012

Global pairwise sequence alignment using Hidden Markov Models applied through different scoring schemes Duran M., Bucak t.O., IEEE Member

Abstract- Hidden Markov Method in Bioinformatics is very

results according to the model. In this study, the Hidden

popular since it proposed for the sequence analysis. This

Markov Model and its applications will be used. Different calculation schemes will also be compared.

statistical

method

can

be

used

from

pairwise

sequence

alignment to database search. In this study, a global pairwise sequence alignment and database search using Hidden Markov

II. PROBLEM DESCRIPTION

Method are implemented. Although that can be solved by

Biological Sequence Analysis has been developing since the eighties. At the very beginning of the research, scientists analyzed the sequences experimentally in the laboratory [6]. With the exponential growth of biological data, scientists needed a computational support. It is far easier to discover structure of DNA or protein on the computer than to experimentally determine its function or its structure. Sequence Alignment is simply comparing two or more sequences of letters or residues. From the computational perspective, the goal is to maximize the similarity between these sequences [7]. The process of aligning sequences using Hidden Markov Model involves statistics, algorithm and mathematical calculations. In this paper, these issues will be explained.

Dynamic Programming, the latter poses such a weakness that eventually leads to an excessive memory usage once all the possibilities are tried. Two different models are used to build Hidden Markov Model. The first one is untrimmed model and second

is

trimmed

model.

Additionally

these

models

are

compared through different scoring schemes.

I. INTRODUCTION NA research has been established for years. Scientists

D have discovered structure of DNA and protein. As an

outcome of these studies, lots of living organisms' DNA sequences has been discovered [1]. In time, DNA analysis

and protein analysis have become important. The first studies about DNA analysis have been realized through some calculations. Because of the problem's massive data, these calculations take very long time. By the development of computer software and Artificial Intelligence, computational biology has been emerged to be a very popular discipline [2]. These developments helped biologists to ease the time taking calculations.

A.

Pairwise sequence alignment is the most widely used alignment for extracting information from protein and DNA sequences as well as the most basic sequence analysis method. The goal of pairwise sequence alignment is to find the best possible alignment of two sequences [8]. Given a scoring system for matches, mismatches and gaps, the best possible alignment is defined as the alignment that optimizes the total score [9]. The key issues for this process are: What kind of alignment will be used? Which scoring scheme will be used for ranking alignments? Which algorithm will be used for optimal scoring alignments? What sort of statistical methods will be used for calculating the significance of alignment score? Pairwise sequence alignment is a useful tool and has found many applications in the fields of biology. For example, the similarity of sequences can be used to find an evolutionary common ancestry [10]. Alternatively, the similarity of sequences can be used to predict the structure and functions of protein domains. Genome assembly is another use of Pairwise sequence alignment in which the sequences are aligned to find overlaps in the shorter pieces of the sequenced DNA by utilizing matches so that the long stretches of sequence can be formed.

Sequence alignment requires a comparison of two or more sequences. The purpose of this comparison is to observe a similarity between the sequences. The similarity indicates such structures that have the same functions [3]. Additionally aligning sequences helps constructing Phylogenetic Trees [4]. In Computational Biology, there are





many applications of sequence alignment. It is a fact that Biological Sequence alignment is at the core of the Bioinformatics. In this paper Hidden Markov Model will be built from the





well-aligned sequences and the model will be tested based on the certain schemes. Inherent to the problem structure, it can be solved by Dynamic Programming. However, this method poses a weakness that eventually leads to the excessive memory usage once all the possibilities are tried [5]. On the other hand, the stochastic methods evaluate the Mustafa

DURAN

is

with

Fatih

University, Buyukcekmece,

1ST ANBUL 34500 TURKIYE (phone: +90-2 I 2-866 33 00 Ext: 2452; e-mail: [email protected]). ihsan Omiir BUCAK is with Mevlana

University, Selcuklu,

KONYA 42003 TURKIYE (e-mail: [email protected]).

978-1-4577-2177-9/12/$25 (C) 2012 IEEE

Pairwise Sequence Alignment

941

B. Multiple Sequence Alignment

Recursion:

Multiple sequence alignment is the most important task of Biological Sequence Analysis. Usually the first step of Bioinformatics is to align more than two homologous sequences [llJ. The development of multiple sequence alignment algorithms is the most studied area of Bioinformatics. It constitutes the core of the Multiple Sequence Alignment problem. Number of sequences evolves through the processes of residue substitution, insertion and deletion [12]. The input data of the multiple sequence alignment algorithms are a number of homologous sequences. In reality the objection is to discover how close those sequences are. After the alignment is completed, a table is formed in which each row corresponds to an input sequence, and each column to a position in the alignment. Dashes in the table represent gaps which also indicate the delete positions for the residues to be inserted or deleted [l3]. Pairwise alignments are usually lower in rank than multiple alignment as multiple alignments utilize a larger amount of data. However, multiple alignment requires great effort to address both computationally and conceptually [14]. Moreover, multiple sequence alignment techniques can also align pairs of sequences. Many multiple alignment methods are constructed from pairwise alignment methods.

VDj (i)

=

max

lv

Mj-l

(i -I)

V1i-1 (i

-I)

VDj-1 (I. -1)

log aM j_1D j log a I j_1D

+

+ +

j

log aDj_1D

j

j

Termination: Final score is V

M 1.+1

(n) , calculated using the top recursion.

The definitions of notations above are given with all the details in the reference [12J. III. ANALYSIS

C.

Local and Global Alignment

Usually global scoring method is used in some cases; however, local scoring gives much more accurate results [15]. In these particular cases, the sequences are very similar at certain parts, but they are not similar enough for all the residues. For instance, if one compares twin people in terms of their biometric features such as fingerprints, voices or retinas, the conclusion may result in two different people in definite terms. However, if they are compared as based on their faces or appearances, the conclusion may result in two identical people. D.

Tests for this study have been run for 79 sequences each of which has 702 residues. Also, the test data includes 19 sequences and each of them has also 702 residues. All experimental runs have been conducted on a laptop computer with a 2.0 GHz Intel Core2 Duo CPU, 2 GB of RAM, running Windows7. The well-aligned sequences have been used to increase the accuracy of the model. After the model is built, different scoring schemes such as product scoring, log-odds scoring and adaptive log-odds scoring TABLE 1. Sequences

Initialization: =

I

Log-odds Score 3.702188618844783e+03

6.12244597482838e-30 1,258981

I II826206e-237

0.597837000755620

Sequence 2

0,158470340971775

Sequence 3

3,06738889497494e-234

0,158470340971775

Sequence 4

3,06738889497494e-234

0,158470340971775

Sequence 5

2,08494632555500e-265

-0,213093215460708

Sequence 6

8,90270914893291e-243

0,158470340971775

Sequence 7

9,14686306441277e-253

0,158470340971775

Sequence 8

5,60574504783892e-261

0,597837000755620

Sequence 9

1,84795703551402e-245

0,158470340971775

Sequence 10

II

2,62909196013500e-247

0,158470340971775

Sequence

1,5496704984768ge-242

0,158470340971775

Sequence 12

1,02534586445265e-245

0,158470340971775

Sequence 13

2,83445065308080e-238

0,158470340971775

Sequence 14

1,48759267081086e-262

0,597837000755620

Sequence 15

7,26782218738664e-240

0,158470340971775

Sequence 16

0

0,158470340971775

Sequence

Viterbi algorithm was first introduced in 1967 by Andrew Viterbi [16J. It is still used widely in digital wireless technology, especially in GSM networks, as well as in speech recognition, biological sequence analysis, and many other applications of Hidden Markov models [17J. The role of the Viterbi Algorithm is to find the most probable path for the Hidden Markov Model. The algorithm of Viterbi Algorithm is as follows:

V 0 (0) M

Product Score

Consensus Sequence

Viterbi Algorithm

PRODUCT AND LOG-ODDS SCORES.

.....

0; (Begin state is Mo , and End state is M L+I )

Sequence 73

0

-0,164303051291276

Sequence 74

0

-0,213093215460708

Sequence 75

0

-0,213093215460708

Sequence 76

0

0,346522572474714

Sequence 77

0

-0,906240396020654

Sequence 78

0

0,597837000755620

Sequence 79

0

-0,213093215460708

Exceptional Sequence

942

I

-2.2448470 0338574e+03

have been applied. Furthermore, two different inputs have been used for the experiment. The first one has consecutive delete states for each sequence and the other one starts with an amino acid.

Number afTolal Amino Acids 140c0 �������-r��,.n 12000 100J0

A.

Product and Log- Odds Score Analysis

8000

Sequence alignment methods mostly involve the scoring matrices for the alignment process. On the contrary, the stochastic methods do not require the substitution matrices, because it uses its own matrices obtained from the input sequences. According to probability theory, probability of the sequence is product of each element. The difference of the HMM is states, because the probability of transition from one state to another must be added to the evaluation. The weakness of this scoring scheme is the underflow error. There are too many elements, and their probabilities are between 0 and 1. Hence the multiplication of these massive elements will eventually yield O. This effect is shown on Table I. After sequence 73 all the sequences were calculated as O. Because of this, the product scoring scheme is not a proper calculation method for the problem.

6000

Fig. 2. Amino acid distribution for the Model

I.

have produced the same score values for Viterbi Algorithm. This effect is shown in Figure 4. On the other hand, the results are quite different for the model 2 which is with the well-distributed residues. Number of residues for the model 2 is shown in Figure 3 and the logNumber of Total Amino Acids

5000 3.5



2.5

�� 2 o

a: 1.5

0.5 °0�--�-L--�'0----� 5 ' -----� �----3 �--� � Sequence Number

Fig. 3. Amino acid distribution for the Model 2.

Fig. 1. Product scores of the given sequences.

Another drawback is that the results for the sequences will be close to each other. In addition, the scores are too small to be calculated by the computer. For example, the largest one 291 is for sequence 3 and its score is only 3 X 10. • For most of the sequences computer could not evaluate the score and hence produced 0 as the score value. As an outcome, the sequences cannot be distinguished from each other. This situation can be observed very clearly in Figure I. B.

odds scores of this model are shown in Figure 5. On the contrary to the model l, the model 2 has produced much better results. Mostly, the scores for the sequences are 1.5

In this study, two different input sequences have been used to build the model. In fact, two of them are identical sequences. The difference between each other is the number of delete states. Number of residues and dashes for the model 1 and the model 2 are shown in Figure 2 and Figure 3 respectively. Vast amount of the delete states at the beginning of the sequences affect the model I. Because of the massive dashes at the beginning of each sequence, the results are yielded close each other. Only few sequences differ. The reason for that are the maximum values for the most sequences returned very close to each other. As the consequence, they

943

n

0.5

is if,

3'

Input Sequence Analysis

�__-__ -__ -__ --____ -__ -__ -__

�w

0 .(l.S

�w

·1 15 - . 0

10

1S

20 3 � Sequence Number

Fig. 4. Log - odds scores for the Model

I

35

40

4S

50

I.

different from each other. This is very important when aligning sequences as the conclusion will be obtained as based on these scores. For example, according to the model 1, the results are the same from sequence 1 to 70 and that make us think that they are either identical or from the same family. But, as it can clearly be seen from the model 2, they are not exactly the same. This situation is very similar to global and local scoring.

Model was trained by quantum-behaved particle swarm optimization [18]. Another approach is modifying Markov model. Reduced state-space Hidden Markov Model and Semi-Markov models are one of them [19, 20]. Furthermore different state types of Hidden Markov Model like 9-state hidden Markov model or are used for bioinformatics tasks [21]. In conclusion, Hidden Markov Models are very powerful tools for Computational Biology and its applications. New approaches and studies will develop this newly arisen discipline in the future.

-0.6 ./J.B

·l O�--;----7.'0;---;':';-5--;;!:20C--:;25;--:!.

Sequence Number

Fig. 5. Log - odds scores for the Model 2.

REFERENCES C.

Adaptive Log-odds Score Analysis

[ 1] [2] [3] [4] [5] [6] [7] [8]

: -0.5 . ., t -1,5

,

'"

S 7

9

" " If I 12 111"'" in\ ,,'1;', � ,�" 55""

In v

V

1,------, 616971

l

Computational Biology

A. Krogh, "An introduction to hidden Markov models for biological

in Molecular Biology, Elsevier,

pp. 45--63, 1998. [9]

, 0 0•5



Engelbert Buxbaum,Fundamentals oj Protein Structure and Function, Springer,2007. Barrett, c., Hughey, R., and Karplus, K., Computer Applications in the Biosciences 13,pp. 19 1- 199, 1997. Anna Tramontano, The Ten Most Wanted Solutions in Protein BioinfiJrmatics, Chapman & Hall/CRC,2005. C.A Orengo, D.T. Jones & J. M. Thorton, Bioin(ormatics Genes, Proteins & Computers, BIOS,2005. Bucher, P., Karplus, K., Moeri, N., and Hofmann, K., Computers and Chemistry 20,pp.3-24, 1996. sequences",Computational Methods

.------

r\ r\

Eddy, S. R., Mitchison, G .• and Durbin. R., J . 2,9-23.1995.

log Scores Comparison r-

to Protein Architecture, Oxford

University Press,2003.

When one compares the sequences, it is possible that it can produce bad results although similar enough to protein family. To avoid this, the adaptive log-odds scoring scheme is used in this study. Similar to the neural network weighting algorithm, the weak results are improved. To be able to determine a limit of the weak results, a threshold value needed to be specified. The log-odds scores are calculated according to this threshold. The comparison of the three different log scoring schemes are given in Figure 6.

n

Arthur M. Lesk,Introduction

A. Krogh,M. Brown, 1. S. Mian, K. Sjolander and D. Haussler, "Hidden Markov models in computational biology: Applications to

A� il,! \I � i d VII \ V

protein modelling",JMB,pp. 150 1-1531, 1994.

& John Tooze,Introduction to Protein Structure, Grand Puhlishing, New York, 1999.

[ 10] Carl Branden

[ 1 1] Eddy, S. R., "Multiple alignment using hidden Markov models".,

Proc. ofThird Int. Conj: on Intelligent SystemsfiJr Molecular Biology, volume 3 pp. 1 14-120 Menlo Park,CA. AAAI Press, 1995. [ 12] R. Durbin, S.Eddy, A. Krogh, G. Mitchison,

Biological sequence analysis- Prohahilistic models o{ proteins and nucleic acids,

Cambridge University Press,Cambridge,2003.

Fig. 6. A comparison of different scoring schemes.

[ 13] Durbin,R. M.,Eddy, S. R., Krogh, A.,and Mitchison, G.,Biological

Sequence Analysis, Cambridge University Press, 1998. [ 14] S. R. Eddy, "Profile hidden Markov models", Bioin(ormatics, vol. 14, pp. 755-763, 1998.

IV. CONCLUSION

[ 15] Rabiner,L. R., A "Tutorial on Hidden Markov Models and Selected Applications in Speech Recognation", Proc.

Biology research becomes more computer depended day by day. Necessity of computer support causes a new discipline for these studies. Computational Biology and its applications are very useful for this purpose. On the contrary, HMM can handle most of the Computational Biology problems. One can use HMMs from pairwise alignment to database searches. In this paper, Hidden Markov Model and its applications have been introduced. Furthermore, Viterbi algorithm has been used to find the most probable path for the Hidden Markov Model. Two different inputs have been used to build Hidden Markov Models. Each model is tested for the product scores and the log-odds scores. Moreover, three different score calculation schemes have been compared. The difference amongst them is shown by the table and the figures. Also database search was performed and results are shown. In recent years scientists develop hybrid solutions for Hidden Markov Model. In some studies Hidden Markov

IEEE 77, pp. 257-286,

1989. [ 16] Viterbi

A.J., Error hounds .fiJr convolution codes and an asymptotically optimum decoding algorithm, IEEE Transactions on

Information Theory 13 (2): 260--269, 1967. [ 17] M.S. Ryan and G.R. Nudd, The

Viterhi Algorithm, Technical Report

University of Warwick, 1993. [ 18] Sun j., Wu X., Fang W., Ding Y., Long H., and Xu W., "Multiple sequence alignment using Hidden Markov Model trained by an improved

quantum-behaved

particle

swarm

optimization".,

InJrmnation Sciences 182, Elsevier, pp 93- 1 14,2010. [ 19] Lampros c., Papaloukas c., Exarchos K., Fotiadis D. 1., "Improving the protein fold recognition of a reduced state-space hidden Markov model.",

Computers in Biology and Medicine 39, Elsevier, pp 907-

9 14,2009. [20] Bidargaddi N.P., Chetty M., Kamruzzaman J., "Combining segmental semi-Markov models with neural networks for protein secondary structure prediction", Neurocomputing

72,

Elsevier, pp. 3943-3950,

2009. [21] Lee, S. Y., Lee, J. Y., Jung K.S., and Ryu K. H., "A 9-state hidden Markov model using protein secondary structure information for protein fold recognition", Elsevier, pp 527-534,2009.

944

Computers in Biology and Medicine 39,