From Creighton, TE, “Proteins”, 1993. +. +. From Creighton, TE, “Proteins”, 1993. -
. -. Acid. -amide. Hydrophilic. HydrophobicHydrophilic. From Creighton, TE ...
Sequence Comparison
DNA & Protein Sequence Comparison
q
q
Physical similarity between sequences Alignments
n
n
n
Various algorithms for different needs
n
Comparison to annotated sequences for: q q q
Structural clues Evolutionary relationship Functional inference
NB: often annotation itself is inferred from prior sequence analysis
Amino Acid Structure
Homology q
q q
n
Sequence is often known early in analysis Protein sequence confers more information. Alignment between sequences q
Sequence Comparison Similarity
n
n
Pharm 207 / Bio 207 Lecture 2 Kutbuddin Doctor
n
n
Similarity due to evolutionary relationship. Common ancestral gene. Either homologous or NOT.
Analysis of similarity gives evidence for/against the hypothesis of homology. Understanding the statistical and biological significance of similarity is critical to deciding on homology.
1
The 20 Amino Acids G
Gly
Glycine
P
Pro
Proline
A
Ala
Alanine
V
Val
Valine
L
Leu
Leucine
I
Ile
Isoleucine
M
Met
Methionine
C
Cys
Cysteine
F
Phe
Phenylalanine
Y
Tyr
Tyrosine
W
Trp
Tryptophan
H
His
Histidine
K
Lys
Lysine
R
Arg
Arganine
Q
Gln
Glutamine
N
Asn
Asparagine
E
Glu
Glutamic acid
D
Asp
Aspartic acid
S
Ser
Serine
T
Thr
Threonine
-
+
~Hydrophilic
Hydrophilic
+
Hydrophobic
From Creighton, TE, “Proteins”, 1993
Acid-amide
-
From Creighton, TE, “Proteins”, 1993
-
From Creighton, TE, “Proteins”, 1993
2
Basic
+
See web page!
Aromatic
+
From Creighton, TE, “Proteins”, 1993
Manual Sequence Alignments
Manual Sequence Alignments Based on AA properties
Based on AA properties
A N A L Y S |
|
|
|
I
S
A S C D E F G
|
|
N
A
L
Y
|
|
|
|
|
A
N
A
L
Y
S
I
|
S
P
R
I
M E
|
:
|
R
V
R
E
|
|
|
Local, not global alignment.
T K T Y F P H F - D L S H - - - - - GS A QV K GH GK K V | : | | |
Z
:
A T C E E F G
A N A L Y Z E D A
:
D
E
-
E
|
| | |
| :
: | |
| | | | |
T Q R F F E S F GD L S T P D A V M G N P K V K A H G K K V
D
3
Homology / Homoplasy from Similarity
More on Homology n
Direct evidence usually not available q
n
Homology: q
n
similarity due to common ancestor
n
n
q
similarity due to convergence Common constraint pushing independent evolution towards a common feature n
Lack of legs on snakes although it is a reptile.
q q
n n
Based on evidence of similarity + statistical model Your evaluation of the statistics gives a degree of certainty Actual homology is black or white
Homology strongly suggests similar structure and function Homoplasy is a rare event. q
q
Sequence Alignment Algorithms
extinct or very changed over millions of years
Homology is an inference q
Homoplasy: q
Ancestral common organisms are no longer present.
Significant similarity is often directly interpreted as homology Caveat: horizontal transfer of genes between organisms
Global vs Local alignments n
n
Gribskov lecture: BIMM140 (17th April) q
n
n
http://www.sdsc.edu/~gribskov/bimm140/
Alignment is measure of similarity Global: q
Simplified discussion..
q
n
attempts to span full length of sequences Used when aligning domains
Local: q
q q
Will try to find all the sub-sequences with good scores. Dotplots can reveal local alignments BLAST finds optimal local alignments to a large database of proteins
Human FADD vs cFLIP
4
Dynamic Programming n n
n n n n
SPEED! Searching for optimal alignments across a database of millions of sequences. See diagrams http://www.sdsc.edu/~gribskov/bimm140/lectures/20 02c4.pdf (page 10 of PDF, page 37 in text) Score between pairing of each AA (more detail later) Choosing the optimal path between sequences => optimal alignment Cumulative score of the path “Dynamic programming” is a faster method to find the optimal global path ( Needleman and Wunsch, 1970 papers) – looking at only “positive” paths.
Scoring Matrices n
n n
Scoring function is critical to finding good alignments DNA: absolute match or mismatch Protein: absolute, conservative or mismatch q q
Property based (hydrophobic, polar, basic, etc) Based on experimental data (PAM, BLOSSUM) n
Expected via the process of evolution (cumulative point mutations in DNA).
Global vs Local alignments n n
Needleman Wunsch (1970) Global alignment q
q
n
dynamic programming allows one to look at only “sub-optimal” paths to find optimal global alignment. “sub-optimal” alignments are extended to reach each end.
Local alignments simply free the last constraint => stop the alignment when the cumulative score becomes 0.
Gaps n
n
n
n
Gaps represent insertions and deletions in one or the other sequences being aligned Too large a gap or too many gaps may represent an implausible set of mutations – or perhaps loss of function Separate gap penalties in the score for creation and extension. Empirical rationale that gap creation should cost more than extension.
5
Statistical Significance of Alignment n n n n n n
Local alignments have good statistical scoring, global alignments are more difficult. What is likelihood that a given alignment could have occurred by chance (not due to common ancestry)? We expect scores to have a extreme value distribution. (As opposed to normal distribution) Scores well outside of expected distribution should not be occurring by chance. These may be due to HOMOLOGY! p-value: q
n
With the score you observe (or better), how many alignments against this database do you expect by chance.
n
n n
n
n
FASTA and BLAST use heuristics and can also be implemented in parallel. Word based comparisons rather than individual AAs.
Dotplots
Both protein and DNA sequences contain lowcomplexity regions and DNA sequences in particular contain repetitive elements.
n
Filtering attempts to remove the confusion such regions can cause to a database search.
n
n n
n n
n
One query sequences, aligned pair-wise to all in the database Heuristics and parallel computing Heuristic methods make assumptions at the risk of missing some alignments.
Lower E-value => more likely homology than chance.
Filtering n
n
E-value: expectation value q
n
probability that the score would occur by chance
Database Searching
This can be done either by first checking the query sequence or filtering the results.
n
2 sequences compared across 2 axis Similarity is plotted as dots between the sequences A dotplot using identical sequences will have one straight diagonal across. Repetitive elements are visible Gaps are jumps between diagonal elements Regions (patches) of higher similarity are visible Exercise 1: Analyze self-dotplot for your protein q q
Comment on repetitive elements. Comment on Effect of window sizes, scales
6
BLAST & PSI-BLAST n n n
n
Link available on class web page. Use BLAST to find homologs to your query sequence. Use E-values to differentiate homologs from nonhomologs PSI-BLAST: q
q
Iterative BLAST in which the scoring matrix is “customized” for every position in your query sequence based on found “homologs”. A conserved functional His across many homologs for this protein will have a higher gap penalty for replacement than other His at other positions in the sequence.
7