DNA & Protein Sequence Comparison

21 downloads 1320 Views 2MB Size Report
From Creighton, TE, “Proteins”, 1993. +. +. From Creighton, TE, “Proteins”, 1993. - . -. Acid. -amide. Hydrophilic. HydrophobicHydrophilic. From Creighton, TE ...
Sequence Comparison

DNA & Protein Sequence Comparison

q

q

Physical similarity between sequences Alignments

n

n

n

Various algorithms for different needs

n

Comparison to annotated sequences for: q q q

Structural clues Evolutionary relationship Functional inference

NB: often annotation itself is inferred from prior sequence analysis

Amino Acid Structure

Homology q

q q

n

Sequence is often known early in analysis Protein sequence confers more information. Alignment between sequences q

Sequence Comparison Similarity

n

n

Pharm 207 / Bio 207 Lecture 2 Kutbuddin Doctor

n

n

Similarity due to evolutionary relationship. Common ancestral gene. Either homologous or NOT.

Analysis of similarity gives evidence for/against the hypothesis of homology. Understanding the statistical and biological significance of similarity is critical to deciding on homology.

1

The 20 Amino Acids G

Gly

Glycine

P

Pro

Proline

A

Ala

Alanine

V

Val

Valine

L

Leu

Leucine

I

Ile

Isoleucine

M

Met

Methionine

C

Cys

Cysteine

F

Phe

Phenylalanine

Y

Tyr

Tyrosine

W

Trp

Tryptophan

H

His

Histidine

K

Lys

Lysine

R

Arg

Arganine

Q

Gln

Glutamine

N

Asn

Asparagine

E

Glu

Glutamic acid

D

Asp

Aspartic acid

S

Ser

Serine

T

Thr

Threonine

-

+

~Hydrophilic

Hydrophilic

+

Hydrophobic

From Creighton, TE, “Proteins”, 1993

Acid-amide

-

From Creighton, TE, “Proteins”, 1993

-

From Creighton, TE, “Proteins”, 1993

2

Basic

+

See web page!

Aromatic

+

From Creighton, TE, “Proteins”, 1993

Manual Sequence Alignments

Manual Sequence Alignments Based on AA properties

Based on AA properties

A N A L Y S |

|

|

|

I

S

A S C D E F G

|

|

N

A

L

Y

|

|

|

|

|

A

N

A

L

Y

S

I

|

S

P

R

I

M E

|

:

|

R

V

R

E

|

|

|

Local, not global alignment.

T K T Y F P H F - D L S H - - - - - GS A QV K GH GK K V | : | | |

Z

:

A T C E E F G

A N A L Y Z E D A

:

D

E

-

E

|

| | |

| :

: | |

| | | | |

T Q R F F E S F GD L S T P D A V M G N P K V K A H G K K V

D

3

Homology / Homoplasy from Similarity

More on Homology n

Direct evidence usually not available q

n

Homology: q

n

similarity due to common ancestor

n

n

q

similarity due to convergence Common constraint pushing independent evolution towards a common feature n

Lack of legs on snakes although it is a reptile.

q q

n n

Based on evidence of similarity + statistical model Your evaluation of the statistics gives a degree of certainty Actual homology is black or white

Homology strongly suggests similar structure and function Homoplasy is a rare event. q

q

Sequence Alignment Algorithms

extinct or very changed over millions of years

Homology is an inference q

Homoplasy: q

Ancestral common organisms are no longer present.

Significant similarity is often directly interpreted as homology Caveat: horizontal transfer of genes between organisms

Global vs Local alignments n

n

Gribskov lecture: BIMM140 (17th April) q

n

n

http://www.sdsc.edu/~gribskov/bimm140/

Alignment is measure of similarity Global: q

Simplified discussion..

q

n

attempts to span full length of sequences Used when aligning domains

Local: q

q q

Will try to find all the sub-sequences with good scores. Dotplots can reveal local alignments BLAST finds optimal local alignments to a large database of proteins

Human FADD vs cFLIP

4

Dynamic Programming n n

n n n n

SPEED! Searching for optimal alignments across a database of millions of sequences. See diagrams http://www.sdsc.edu/~gribskov/bimm140/lectures/20 02c4.pdf (page 10 of PDF, page 37 in text) Score between pairing of each AA (more detail later) Choosing the optimal path between sequences => optimal alignment Cumulative score of the path “Dynamic programming” is a faster method to find the optimal global path ( Needleman and Wunsch, 1970 papers) – looking at only “positive” paths.

Scoring Matrices n

n n

Scoring function is critical to finding good alignments DNA: absolute match or mismatch Protein: absolute, conservative or mismatch q q

Property based (hydrophobic, polar, basic, etc) Based on experimental data (PAM, BLOSSUM) n

Expected via the process of evolution (cumulative point mutations in DNA).

Global vs Local alignments n n

Needleman Wunsch (1970) Global alignment q

q

n

dynamic programming allows one to look at only “sub-optimal” paths to find optimal global alignment. “sub-optimal” alignments are extended to reach each end.

Local alignments simply free the last constraint => stop the alignment when the cumulative score becomes 0.

Gaps n

n

n

n

Gaps represent insertions and deletions in one or the other sequences being aligned Too large a gap or too many gaps may represent an implausible set of mutations – or perhaps loss of function Separate gap penalties in the score for creation and extension. Empirical rationale that gap creation should cost more than extension.

5

Statistical Significance of Alignment n n n n n n

Local alignments have good statistical scoring, global alignments are more difficult. What is likelihood that a given alignment could have occurred by chance (not due to common ancestry)? We expect scores to have a extreme value distribution. (As opposed to normal distribution) Scores well outside of expected distribution should not be occurring by chance. These may be due to HOMOLOGY! p-value: q

n

With the score you observe (or better), how many alignments against this database do you expect by chance.

n

n n

n

n

FASTA and BLAST use heuristics and can also be implemented in parallel. Word based comparisons rather than individual AAs.

Dotplots

Both protein and DNA sequences contain lowcomplexity regions and DNA sequences in particular contain repetitive elements.

n

Filtering attempts to remove the confusion such regions can cause to a database search.

n

n n

n n

n

One query sequences, aligned pair-wise to all in the database Heuristics and parallel computing Heuristic methods make assumptions at the risk of missing some alignments.

Lower E-value => more likely homology than chance.

Filtering n

n

E-value: expectation value q

n

probability that the score would occur by chance

Database Searching

This can be done either by first checking the query sequence or filtering the results.

n

2 sequences compared across 2 axis Similarity is plotted as dots between the sequences A dotplot using identical sequences will have one straight diagonal across. Repetitive elements are visible Gaps are jumps between diagonal elements Regions (patches) of higher similarity are visible Exercise 1: Analyze self-dotplot for your protein q q

Comment on repetitive elements. Comment on Effect of window sizes, scales

6

BLAST & PSI-BLAST n n n

n

Link available on class web page. Use BLAST to find homologs to your query sequence. Use E-values to differentiate homologs from nonhomologs PSI-BLAST: q

q

Iterative BLAST in which the scoring matrix is “customized” for every position in your query sequence based on found “homologs”. A conserved functional His across many homologs for this protein will have a higher gap penalty for replacement than other His at other positions in the sequence.

7