Sep 19, 2012 - Dynamic programming. (Smith-Waterman). FASTA. BLAST. Multiple. Dynamic ... E -1 0 0 2 -4 2 5 -2 0 -3 -3 1
Challenging algorithms in bioinformatics Torbjørn Rognes
[email protected] Department of Informatics, UiO 19 September 2012
Definition:
What is bioinformatics?
Bioinformatics is the development and use of computational and mathematical methods to gather, process and interpret molecular biological data. Aim of research: To increase our understanding of the connections between biological processes at different levels while developing better theories and methods in computer science and statistics. An interdisciplinary subject: Computer science/statistics + biology/medicine
Bioinformatics at many levels in biology
Molecules DNA Genomics Assembly Gene finding
Organ
Neuroinformatics
RNA
Protein
Transcriptomics
Sequence analysis
Systems biology
RNomics
Structural biology Proteomics
Microarrays
Sequence analysis
RNA folding
Annotation
RNA-Seq
Read mapping
Cell
Drug design MS analysis
Organism
Population
Personal medicine
Population genetics
Cellular simulations
Evolutionary biology
SNP analysis
Cancer biology
Metagenomics Phylogenomics
ChIP-Seq All levels: mining the literature (natural language processing)
Biobank data analysis
Some basic molecular biology
The genome is our genetic material The genome consists of DNA
DNA is built from nucleotides
4 nucleotides or bases: A, C, G, U(racil)
RNA is translated into proteins based on codons (3 bases) Proteins consist of amino acids (on average 350 aa) There are 20 different natural amino acids:
4 nucleotides or bases: A(denine), C(ytosine), G(uanine), T(hymine)
Genes are pieces of DNA DNA is transcribed into RNA when genes are expressed RNA is similar to DNA, but with a different backbone and uracil
From ~2 to ~3000 million nucleotides
A, C, D, E, F, G, H, I, K, L, M, N, P, Q, R, S, T, V, W, Y
Proteins form 3D structures that act as molecular machines or as structural building blocks 4
Important challenges in bioinformatics Examples of important computational challenges in bioinformatics (hardest problems first):
Protein structure prediction Whole-genome de novo sequence assembly Multiple sequence alignment (with many divergent sequences) Sequence database searching Sequence read mapping
5
Protein structure prediction
Hardest problem (“Holy grail”): predict 3D protein structure directly from sequence
Protein secondary structure prediction (easier)
“ab initio“ “homology modelling” “threading” Predict helixes, strands and loops Not 3D
“Folding@Home”
6
Protein 3D structure and design
MPARALLPRRMGHRT LASTPALWASIPCPR SELRLDLVLPSGQSF RWREQSPAHWSGVLA DQVWTLTQTEEQLHC TVYRGDKSQASRPTP DELEAVRKYFQLDVT LAQLYHHWGSVD...
Structure prediction
Protein design
Whole genome sequence assembly
Genome sequencing results in millions of small pieces of the full genome The challenge is to puzzle these together in the right order Genome size ranging from 2Mbp (bacteria) to 3Gbp (human) Read size from 30 bp to 1000 bp Sequencing errors Natural variation (allels) Repeats and similar regions
Whole genome sequence assembly
All the pieces must be puzzled together
Two genome assembly strategies
The overlap-layout-consensus approach
Alignment variants and algorithms Global
Pairwise
Dynamic programming (Needleman-Wunsch)
Local
Dynamic programming (Smith-Waterman) FASTA BLAST
Multiple
Dynamic programming
Dynamic programming
ClustalW T-Coffee MUSCLE MAFFT
MEME
Note: dynamic programming gives an optimal solution, but is extremely space and time demanding with many sequences (n > 2) 13
Pairwise sequence alignment
E.coli AlkA
Hollis et al. (2000) EMBO J. 19, 758-766 (PDB ID 1DIZ)
Human OGG1
Source: Bruner et al. (2000) Nature 403, 859-866 (PDB ID 1EBM)
E.c. AlkA 127 SVAMAAKLTARVAQLYGERLDDFPE--YICFPTPQRLAAADPQA-LKALGMPLKRAEALI 183 ++| + |+ | +| || + | ||+ | || + +| |+ ||+ || + H.s. OGG1 151 NIARITGMVERLCQAFGPRLIQLDDVTYHGFPSLQALAGPEVEAHLRKLGLGY-RARYVS 209 E.c. AlkA 184 HLANAALE-----GTLPMTIPGDVEQAMKTLQTFPGIGRWTANYFAL | | || | |+| | | ||+| |+ | H.s. OGG1 210 ASARAILEEQGGLAWLQQLRESSYEEAHKALCILPGVGTKVADCICL
225 256
Common alignment scoring system Substitution score matrix
Score for aligning any two residues to each other Identical residues have large positive scores Similar residues have small positive scores Very different residues have large negative scores
Gap penalties
BLOSUM62 amino acid substituition score matrix
A R N D C Q E G H I L K M F P S T W Y V
A
R
N
D
C
Q
E
G
H
I
L
K
M
F
P
S
T
W
Y
V
4 -1 -2 -2 0 -1 -1 0 -2 -1 -1 -1 -1 -2 -1 1 0 -3 -2 0
-1 5 0 -2 -3 1 0 -2 0 -3 -2 2 -1 -3 -2 -1 -1 -3 -2 -3
-2 0 6 1 -3 0 0 0 1 -3 -3 0 -2 -3 -2 1 0 -4 -2 -3
-2 -2 1 6 -3 0 2 -1 -1 -3 -4 -1 -3 -3 -1 0 -1 -4 -3 -3
0 -3 -3 -3 9 -3 -4 -3 -3 -1 -1 -3 -1 -2 -3 -1 -1 -2 -2 -1
-1 1 0 0 -3 5 2 -2 0 -3 -2 1 0 -3 -1 0 -1 -2 -1 -2
-1 0 0 2 -4 2 5 -2 0 -3 -3 1 -2 -3 -1 0 -1 -3 -2 -2
0 -2 0 -1 -3 -2 -2 6 -2 -4 -4 -2 -3 -3 -2 0 -2 -2 -3 -3
-2 0 1 -1 -3 0 0 -2 8 -3 -3 -1 -2 -1 -2 -1 -2 -2 2 -3
-1 -3 -3 -3 -1 -3 -3 -4 -3 4 2 -3 1 0 -3 -2 -1 -3 -1 3
-1 -2 -3 -4 -1 -2 -3 -4 -3 2 4 -2 2 0 -3 -2 -1 -2 -1 1
-1 2 0 -1 -3 1 1 -2 -1 -3 -2 5 -1 -3 -1 0 -1 -3 -2 -2
-1 -1 -2 -3 -1 0 -2 -3 -2 1 2 -1 5 0 -2 -1 -1 -1 -1 1
-2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6 -4 -2 -2 1 3 -1
-1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 7 -1 -1 -4 -3 -2
1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4 1 -3 -2 -2
0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5 -2 -2 0
-3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 11 2 -3
-2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 7 -1
0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -3 -1 4
Penalty for opening a gap in a sequence (Q) Penalty for extending a gap (R) Typical gap function: G = Q + R * L, where L is length of gap Example: Q=11, R=1
E.c. AlkA 127 SVAMAAKLTARVAQLYGERLDDFPE--YICFPTPQRLAAADPQA-LKALGMPLKRAEALI 183 ++| + |+ | +| || + | ||+ | || + +| |+ ||+ || + H.s. OGG1 151 NIARITGMVERLCQAFGPRLIQLDDVTYHGFPSLQALAGPEVEAHLRKLGLGY-RARYVS 209 E.c. AlkA 184 HLANAALE-----GTLPMTIPGDVEQAMKTLQTFPGIGRWTANYFAL | | || | |+| | | ||+| |+ | H.s. OGG1 210 ASARAILEEQGGLAWLQQLRESSYEEAHKALCILPGVGTKVADCICL
225 256
How to find the best alignment(s)?
There are too many possible alignments of two sequences to enable examination of every possible alignment individually There is an algorithm to identify the best (optimal) alignment(s), i.e the one(s) with the highest score The algorithm is of the dynamic programming (DP) type The algorithm was initially described by Needleman and Wunsch in 1970 Two steps:
First, identify the highest possible score using DP Then, identify the alignment(s) with the highest score (using temporary results from the initial step)
Dynamic programming:
General method for solving recursive problems by storing temporary results from smaller problems along the way Used to solve many problems in bioinformatics
16
Dynamic programming Assume that we are aligning the sequences x and y. We may denote the cells using coordinates: φ
y1
y2
y3
y4
φ
(0,0)
(0,1)
(0,2)
(0,3)
(0,4)
x1
(1,0)
(1,1)
(1,2)
(1,3)
(1,4)
x2
(2,0)
(2,1)
(2,2)
(2,3)
(2,4)
x3
(3,0)
(3,1)
(3,2)
(3,3)
(3,4)
x4
(4,0)
(4,1)
(4,2)
(4,3)
(4,4)
• Any path to (s,t) corresponds to a global alignment of (x1:s, y1:t). • Any such alignment (w,z) has a score S(w,z). • Let B(s,t) be the highest score (= optimal score) among these 17
Dynamic programming Assume a score function S(...) with gap penalty g. We are going to fill in the optimal score in each cell: φ
y1
y2
y3
y4
φ
B(0,0)
B(0,1)
B(0,2)
B(0,3)
B(0,4)
x1
B(1,0)
B(1,1)
B(1,2)
B(1,3)
B(1,4)
x2
B(2,0)
B(2,1)
B(2,2)
B(2,3)
B(2,4)
x3
B(3,0)
B(3,1)
B(3,2)
B(3,3)
B(3,4)
x4
B(4,0)
B(4,1)
B(4,2)
B(4,3)
B(4,4)
Optimal score for the alignment of x and y
B(0, 0) = 0 B(i, 0) = S ( x1 , φ ) + ... + S ( xi , φ ) = −ig B(0, j ) = S (φ , y1 ) + ... + S (φ , y j ) = − jg B(i − 1, j − 1) + S ( xi , y j ) B(i,= j ) max B (i − 1, j ) + S ( xi , φ ) B(i, j − 1) + S (φ , y ) j 18
Dynamic programming Score function: • S(a,a) = 2 • S(a,b) = 0 for a ≠ b • S(a,-) = -1 Sequences: x = TDWAX y = ZTDX Computing B(i,j) values:
Z
φ
0
-1
-2
D -3
T
-1
0
1
0
-1
D
-1
0
3
W
-2 -3
-2
-1
2
2 3
A
-4
2
-5
-2 -3
1
X
-3 -4
0
3
φ
T
X -4
19
Dynamic programming Z
φ
0
-1
-2
D -3
T
-1
0
1
0
-1
D
-1
0
3
W
-2 -3
-2
-1
2
2 3
A
-4
2
-5
-2 -3
1
X
-3 -4
0
3
φ
T
X -4
ZTD--X -TDWAX S = (-1) + 2 + 2 + (-1) + (-1) + 2 = 3
20
Algorithmic complexity
Assume that we are aligning two sequences of length m and n Memory: O(nm) A fixed number of tables (one or two) with n*m cells: constant * nm A fixed number of additional variables: constant Little memory needed if we are only interested in the best score Time: O(nm) Calculate B(i,j) and P(i,j) for n*m cells in the table: constant * nm Perform traceback: ≤ constant * (n+m)
21
Multiple sequence alignment
Align three or more sequences Show corresponding amino acids in the different proteins Place gaps at correct positions Impossible to solve optimally by brute force for more than a few short sequences
22
Finding the best multiple alignment
To find the best multiple sequence alignments the MSA programs will try to find the one with the highest score The score is usually the sum-of-pairs-score or similar Corresponds approximately to the sum of all pairwise alignment scores For the alignment A of m sequences s1 til sm we have the sum-of-pairs score S(A):
S(a,b) is the pairwise score of a and b, and s-i is the projection of si, that is, si with inserted gaps
The sum-of-pairs score M
Q
P
I
L
L
L
M
L
R
-
L
L
-
M
K
-
I
L
L
L
M
P
P
V
L
I
L
score(k) = S(P,R) + S(P,-) + S(P,P) + S(R,-) + S(R,P) + S(-,P) score for column k = 3
We have S(-,-) = 0
Total score = score(1) + score(2) + .... + score(N)
Clustal W
One of the most commonly used and wellknown tools for multiple sequence alignment. Now somewhat outdated and surpassed by other tools. Uses a progressive algorithm: Always starts with the most similar sequences and then aligns less similar sequences with each other.
MUSCLE
MUSCLE = Multiple Sequence Comparison by Log Expectation Iterative procedure: improves the alignment gradually until good enough by introducing random changes in the alignment Very high quality of alignments Much faster than Clustal W
26
Searching sequence databases • Goal: Identify which sequences in a database are significantly similar to a given DNA, RNA or protein sequence. • How: The query sequence is compared (aligned) with each of the database sequences, and the amount of similarity is determined for each database sequence. Example: Query sequence:
acgatcgattagcca
Database sequences: Identical (trivial): Very similar (easy): Similar (moderate): Very diverged (hard):
acgatcgattagcca acgaccgatgagcca atgacggatgagcga atgacgggatgagcga 27
Searching databases
Based on local alignments of query sequence with every database sequence Exhaustive / optimal / brute-force: Smith-Waterman Heuristic: BLAST, FASTA, PARALIGN, SSAHA, PatternHunter, BLAT, ... Speed difference: about 50X Sensitivity Specificity
28
Heuristic search algorithms
Many heuristic algorithms creates an index into either the query sequence or the database sequences The index / hash key is short for sensitive algorithms and longer for quicker methods Allows fast lookup of matching positions Software:
BLAST (w/variants), FASTA, SSAHA, BLAT, PatternHunter
Index key size
2-7 amino acids (BLASTP) 1-2 amino acids (FASTA) 1-6 nucleotides (FASTA) 7-28 nucleotides (BLASTN, MEGABLAST) 10 nucleotides (SSAHA) 11 nucleotides (BLAT)
29
Short read mapping Input data: 10-100 million reads, each 50-150bp Sequencing errors (typ. 1% error rate) Reference genome: E.g. human genome, 3 Gbp Genome variants
ACGTACGTACGTACGTACGTACGTACGTACGTACGT
Output: 0, 1, or more potential genomic locations of each read Mapping quality assignment Requirements: Speed, sensitivity, specificity
TCAGTGCATTGCACCCTGGGTGTGATCGTGCATGTG
CACGTACGTGGTTTTGCAACGTGCTGATGCTAGCAT
Short read mapping algorithms
Q-gram filtering
“Spaced seeds”
At least one relatively long match A few mismatches allowed Multiple short matches Multiple mismatches allowed
Suffix array / Burrows Wheeler Transform (BW) / FM-index
Exact match
31
Burrows-Wheeler
The entire reference genome is initially transformed according to the BurrowsWheeler Transform and compressed into the FM-index. Align base by base of the read from the end (suffix). As the read is traversed, all potential locations are reported. If no match is found when a new base is aligned, then back up and try a base substitution.
Burrows-Wheeler transform