Bioinformatics. 1. 1. INTRODUCTION TO. BIOINFORMATICS. Robert van
Engelen. 2. Overview. ✹ Part I: Algorithms on Strings, Trees, and. Sequences.
Overview Æ
INTRODUCTION TO BIOINFORMATICS
Part I: Algorithms on Strings, Trees, and Sequences Æ
Æ
Part II: Neural Nets and Genetic Algorithms Æ
Robert van Engelen
Why are computers used in Biology and what is the role of Computer Science in Biology? How can we use nature’s biological computing mechanisms to solve complex problems in Computer Science?
1
2
Part I
Algorithms on Strings, Trees, and Sequences
Æ Why
Æ
are computers used in Biology and what is the role of Computer Science in Biology? Æ Growth
of data such as DNA sequence
Æ Æ
data Æ Pattern
Many molecular biology problems on sequences can be formulated as string matching problems
Æ
search and pattern analysis
Æ Æ
Storing, retrieving, and comparing DNA strings Comparing two or more strings for similarities Searching databases for related strings and substrings Defining and exploring different notions of string relationships Looking for new or ill-defined patterns occurring frequently in DNA
3
Strings, Trees, and Sequences (cont’d) Æ
Æ
Æ
Æ
Æ
Matching and Alignment of Strings and Sequences Æ Exact
Reconstructing long strings of DNA from overlapping string fragments Determining the physical and genetic maps from probe data under various experimental protocols Looking for structural patterns in DNA and protein determining secondary (2D) structure of RNA Finding conserved but faint patterns in many DNA and protein sequences And much more...
string matching
Æ Knuth-Morris-Pratt
Æ Exact
and Boyer-Moore
matching with a set of patterns
Æ Aho-Corasick
Æ Inexact Æ Edit
matching
Distance and dynamic programming
Æ Sequence Æ Multiple 5
Bioinformatics
4
alignment problems alignment problems 6
1
What is a String?
Example
Æ Definitions
Æ Alphabet
ÆA
string S is an ordered list of characters of a given alphabet written contiguously from left to right Æ S(i) denotes the character at position i in string S Æ |S| denotes the length of string S Æ S[i..j] is the contiguous substring of S starting at position i and ending at position j
Æ Let
= {a,b,c,1,2,3,#,$} S = a1#33$
Æ S(1) Æ S(3) Æ |S|
=a =#
=6
Æ S[4..5]
= 33
7
8
String ≠ Sequence
What are Prefixes, Suffixes, and Substrings?
ÆA
Æ Definitions
string is not the same as the concept of a (sub)sequence in biology!
Æ S[1..i]
is a prefix of string S is a suffix of string S Æ S[i..j] is an empty string if i>j Æ The proper prefix, suffix, or substring of a string is a prefix, suffix, or substring that is not the entire string nor the empty string
Æ (Sub)sequences
in the biological literature refer to strings that might be interspersed with other characters, such as gaps
Æ S[i..|S|]
9
10
Example
Exact String Matching
Æ Let
Æ
S = abcd
We call string P the pattern of length n=|P| We call the string T the text of length m=|T| Æ The exact matching problem: find all occurrences of P in T (if any)
Æ S[1..2]
= ab is a proper prefix of S Æ S[2..3] = bc is a proper substring of S Æ S[2..4] = bcd is a proper suffix of S Æ S[1..4] = abcd is a prefix, suffix, and substring of S Æ S[4..3] is empty
Æ
11
Bioinformatics
Æ
Let P = aba Let T = bbabaxababay
Æ
Then P occurs in T at locations 3, 7, and 9
Æ
12
2
Exact String Matching: the Naïve Method Æ
What is the Worst Case Running Time? Æ In
Slide P along T and for each alignment compare characters from left to right Æ Æ
Æ
Let P = aab and T = aaaaaaaaab aaaaaaaaab aab No match aab No match aab No match aab No match aab No match aab No match aab No match aab MATCH! The search requires n*(m-n+1) = 24 comparisons
the worst case, this algorithm requires n*(m-n+1) comparisons to find a match of pattern P in text T, where n=|P| and m=|T| Æ We say that the worst case running time of this algorithm requires in the order of n*m computational steps to complete Æ Asymptotic notation: Q(n*m)
13
Asymptotic Notation Æ
Asymptotic Notation
Definition
Æ
Q(g(n)) = { f(n): there exists positive constants c1, c2, and n0 such that 0 < c1g(n) < f(n) < c2g(n) for all n > n0 } Æ
We write
Æ
f(n) = Q(g(n)) to denote that Q(g(n)) is a tight asymptotic bound of f(n) Let f(n) = 2 + 0.7*n, then f(n) = Q(n) Æ
14
Definition O(g(n)) = { f(n): there exists positive constants c and n0 such that 0 < f(n) < cg(n) for all n > n0 }
Æ
We write f(n) = O(g(n)) to denote that O(g(n)) is an asymptotic upper bound of f(n)
n0=10, c 1=0.7, c2=1, 00 then shift P right by n-L'(i) positions Æ
Æ _________xSc___
bc______ySc_ bc______ySc_
Æ Æ
Æ If
no such shift is possible, shift P n places to the right
Define N(j) = the length of the longest suffix of P[1..j] that is also a suffix of P Let P = cabdabdab then N(3) = 2 and N(6) = 5 for i := 1 to n do L'(i) := 0 for j := 1 to n-1 do i := n-N(j)+1 L'(i) := j
31
32
Exact Matching of Multiple Patterns
Application: Sequence Tagged Site (STS)
Æ Boyer-Moore
Æ
is faster than KnuthMorris-Pratt in practice Æ KMP algorithm forms basis for AhoCorasick algorithm for matching multiple patterns Æ Multiple pattern matching in O(n+m+k) time where k = the number of occurrences in T of any of the patterns
An STS is a DNA string of length 200-300 nucleotides whose right and left ends, of length 20-30 nucleotides each, occur only once in the entire genome Æ Each STS occurs uniquely in the DNA of interest Æ Hundreds of thousands STSs in databases Æ Problem: find which STSs are contained in anonymous DNA Æ Use Aho-Corasick to find STSs in newly sequenced DNA to find the map locations
33
Exact Matching With Wildcards Æ
Regular Expression Matching
Zinc Finger DNA transcription factor:
ÆA
regular expression (RE) is
ÆA
CYS??CYS?????????????HIS??HIS
character from the alphabet “empty” symbol e
Æ The
If the number of wildcards ? is limited and can be bounded by a fixed constant, a linear time O(n+m) algorithm exists Æ If the number of wildcards is unbounded, it is not known if the problem can be solved in linear time Æ
35
Bioinformatics
34
Æ Concatenation
of two REs, written as R1R2 of two REs, written as R1+R2 Æ Repetition of an RE, written as R* Æ Alternation
Æ O(n*m)
time
36
6
Regular Expression Example
Inexact Matching, Sequence Comparison and Alignment
Æ (a+b)yk(pp+e)q*
Æ Some
type of errors are acceptable in valid matches
is a string that matches Æ bykppq is a string that matches Æ byk is a string that matches Æ yk is a string that does not match Æ aykqqqq
Æ Sequence
data may contain errors characters in a subsequence embedded in a string need not be contiguous Æ Comparison of similar sequences Æ Sequence alignment allows mismatches Æ The
37
38
First Fact of Biological Sequence Analysis
Edit Distance
Æ “In
Æ Edit
biomolecular sequences (DNA, RNA, or amino acid sequences), high sequence similarity usually implies significant functional or structural similarity”
distance = minimum number of edit operations needed to transform the first string into the second character (in second string) character (from first string) Æ Replace character Æ RI D D I edit transcript v intner first string wri t ers second string Æ Insert Æ Delete
39
Alignment and Edit Distance Æ
Dynamic Programming
Global string alignment Æ
Æ Use
q a c _ d b d q a w x _ b _
A string alignment can be converted into an edit transcript (edit operations) and vice versa Æ Alignment displays string relationship as the product of evolutionary events Æ Edit distance emphasizes mutational events as a process Æ
41
Bioinformatics
40
Dynamic Programming methodology to find the minimum number of edit operations Æ O(n*m) time algorithm where n is the length of the first string and m is the length of the second string
42
7
Local Alignment
Gaps
P q r a x a b _ c s t v q x y a x _ b a c s l l Æ The two substrings have maximum similarity Æ Local alignment problem: find substrings whose similarity is maximum over all pairs of substrings Æ O(n*m) time algorithm where n is the length of the first string and m is the length of the second string
Æ
A gap is a consecutive series of spaces in a string Æ Gaps result from mutational events that delete, copy, and insert long pieces of DNA
Æ
Æ Æ Æ Æ
Æ
Unequal cross-over in meiosis DNA slippage during replication Jumping genes and translocations Insertions of DNA by retrovirus
Need to use weighted edit distance or similarity measure to compare strings for alignment
43
44
Alignment With Gaps
Multiple String Comparison and Multiple Alignment
Æ Find
Æ
local alignment with maximum similarity using Æ Small
penalties for gaps (low gap weights) high penalties for mismatches
Æ
Æ Relatively
Æ cDNA
Æ
matching
Æ
Æ ----------------------------------
---
Extracting and representing biologically important, yet faint or widely dispersed, commonalities from a set of strings
----
-- ---
long string string with gaps
Æ Æ
Æ
Discover evolutionary history Discover critical conserved motifs Discover 2D and 3D molecular structure Discover common biological function Characterize families of proteins
Said to be the “holy grail” of molecular biology
45
Alignments With Profiles Æ
Æ
a a a c
b b c b
c a c _
_ b b b
a a _ c
a b c _
Sequence Databases
C1 C2 C3 C4 C5 .75 .25 .50 .75 .75 .25 .25 .50 .25 .25 .25 .25
Æ Fast
growing industry have gmail! Æ Success stories Æ You
Æ Discovery
of similarities of oncogenes and their link to growth factors Æ Linking protein sequence database SwissProt to Riley database lead to discovery of 102 biochemical roles of 1007 genes from 1734 coding regions of DNA strings
Determine if a string belongs to a family of proteins represented by a profile Æ Æ
a a b b c Suppose alignment a-a = 2 and a-c = -3, then first column score = 0.75*2 - 0.25*3 = 0.75 47
Bioinformatics
46
48
8
Real Sequence Database Search
Part II
Compare the new sequence to PROSITE and BLOCKS databases for sequence motifs Æ Search GenBank DNA archive or Swiss-Prot for highly similar sequences (local similarity) Æ
Æ Æ
Æ
Use BLAST and FASTA for first heuristic approximation Compute optimal similarity (weighted edit distance) based on dynamic programming
Æ How
can nature’s biological computing mechanisms be used to solve problems in Computer Science? Æ Neural
networks programming
Æ Evolutionary
Try PAM and/or BLOSUM scoring matrix when amino acid substitutions are employed 49
50
Neural Nets and Genetic Algorithms
Neural Nets
Æ Neural
Æ Parallel,
nets
distributed information processing structure
Æ Pattern
recognition Æ Constraint satisfaction Æ Genetic
Æ Processing
elements (nodes) between nodes Æ Single output per node Æ Local computations Æ Connections
algorithms
Æ Search
and optimization
51
52
Neural Nets
Hopfield Networks
Æ Hopfield
Æ Fully
networks Æ Boltzmann networks Æ Kohonen networks Æ Feedforward Networks Æ Backpropagation networks
connected Æ Weighted connections wij Æ State of ith node: si = -1 or si = +1 Æ Node Æ Node
53
Bioinformatics
i changes state si to +1 if Sjwijsj > 0 i changes state si to -1 if Sjwijsj < 0
54
9
Pattern Recognition with Hopfield Networks
Boltzmann Networks
m patterns x1,…,xm Æ Each pattern x is a vector over {-1,+1} Æ Set wij = Sk x ik xjk Æ Given an input pattern, the states of the nodes of the Hopfield network converge to one of the patterns that is similar to the input pattern Æ Assume
Æ Generalization
of Hopfield networks for solving combinatorial optimization problems Æ Combined state of nodes forms the solution
55
56
Kohonen Networks
Feedforward Networks
Æ Model
Æ Multilayer
Æ Self
the brain organization
Æ Input
nodes nodes Æ Output nodes Æ Hidden
Æ Nodes
compute
Æ si
= +1 if Sjwijsj > f
Æ si
= -1 otherwise
57
58
Backpropagation Networks
Applications of Neural Nets
Æ Backpropagation
Æ Pattern
of error Æ Adjusts the weights by propagating errors back into the network upon mismatch of network’s output and true answer Æ Unsupervised training
analysis (images, sound, …) Æ Modeling the auditory cortex of a bat Æ Traveling salesman problem Æ Modeling the somatotopic map of the body surface
59
Bioinformatics
60
10
Genetic Algorithms
Optimization Problems
A family of computational models inspired by evolution Æ Encode a potential solution of a specific problem in a simple chromosome-like data structure Æ Apply selection, reproduction, and recombination operators to a population of chromosomes to evolve new solutions
Æ Find
suitable problem encoding in binary “chromosome” (string of 0 and 1) Æ Define an evaluation function (a.k.a. fitness function)
Æ
61
62
Selection and Reproduction
Recombination
ÆA
Æ Crossover
population forms a pool of partial solutions Æ Calculate fitness values for all chromosomes in the population Æ A selected group of individuals with larger than average fitness value is allowed to survive and make copies into a new population
(one-point, uniform)
Æ Form
new solution by combining two solutions
Æ Mutations
with low probability
Æ Insert/destroy
material
63
Example
Example (Cont’d)
Find x on [0,31] such that f(x) = x2 is maximized Æ Encode x as string of 5 bits (2 5=32) Æ Fitness function f Æ Generate random population of size 4 Æ
Æ
01001 11000 01000 10011
65
Bioinformatics
64
Pool 1 01101
f(x) 169
% 14
%/ave #copies 58 1
2 11000
576
49
197
2
3 01000
64
6
22
0
4 10011
361
31
123
1
66
11
Example (Cont’d)
Applications of GA Æ Machine
Pool
Mate
X-Site New Pool
f(x)
0110 | 1
2
4
01100
144
1100 | 0
1
4
11001
625
11 | 000
4
2
11011
729
10 | 011
3
2
10000
256
learning salesman problem Æ Robot trajectory generation Æ Parametric design (e.g. pipelines, aircraft, …) Æ Traveling
67
68
Questions?
69
Bioinformatics
12