INTRODUCTION TO BIOINFORMATICS Overview Part I Algorithms ...

12 downloads 119 Views 2MB Size Report
Bioinformatics. 1. 1. INTRODUCTION TO. BIOINFORMATICS. Robert van Engelen. 2. Overview. ✹ Part I: Algorithms on Strings, Trees, and. Sequences.
Overview Æ

INTRODUCTION TO BIOINFORMATICS

Part I: Algorithms on Strings, Trees, and Sequences Æ

Æ

Part II: Neural Nets and Genetic Algorithms Æ

Robert van Engelen

Why are computers used in Biology and what is the role of Computer Science in Biology? How can we use nature’s biological computing mechanisms to solve complex problems in Computer Science?

1

2

Part I

Algorithms on Strings, Trees, and Sequences

Æ Why

Æ

are computers used in Biology and what is the role of Computer Science in Biology? Æ Growth

of data such as DNA sequence

Æ Æ

data Æ Pattern

Many molecular biology problems on sequences can be formulated as string matching problems

Æ

search and pattern analysis

Æ Æ

Storing, retrieving, and comparing DNA strings Comparing two or more strings for similarities Searching databases for related strings and substrings Defining and exploring different notions of string relationships Looking for new or ill-defined patterns occurring frequently in DNA

3

Strings, Trees, and Sequences (cont’d) Æ

Æ

Æ

Æ

Æ

Matching and Alignment of Strings and Sequences Æ Exact

Reconstructing long strings of DNA from overlapping string fragments Determining the physical and genetic maps from probe data under various experimental protocols Looking for structural patterns in DNA and protein determining secondary (2D) structure of RNA Finding conserved but faint patterns in many DNA and protein sequences And much more...

string matching

Æ Knuth-Morris-Pratt

Æ Exact

and Boyer-Moore

matching with a set of patterns

Æ Aho-Corasick

Æ Inexact Æ Edit

matching

Distance and dynamic programming

Æ Sequence Æ Multiple 5

Bioinformatics

4

alignment problems alignment problems 6

1

What is a String?

Example

Æ Definitions

Æ Alphabet

ÆA

string S is an ordered list of characters of a given alphabet written contiguously from left to right Æ S(i) denotes the character at position i in string S Æ |S| denotes the length of string S Æ S[i..j] is the contiguous substring of S starting at position i and ending at position j

Æ Let

= {a,b,c,1,2,3,#,$} S = a1#33$

Æ S(1) Æ S(3) Æ |S|

=a =#

=6

Æ S[4..5]

= 33

7

8

String ≠ Sequence

What are Prefixes, Suffixes, and Substrings?

ÆA

Æ Definitions

string is not the same as the concept of a (sub)sequence in biology!

Æ S[1..i]

is a prefix of string S is a suffix of string S Æ S[i..j] is an empty string if i>j Æ The proper prefix, suffix, or substring of a string is a prefix, suffix, or substring that is not the entire string nor the empty string

Æ (Sub)sequences

in the biological literature refer to strings that might be interspersed with other characters, such as gaps

Æ S[i..|S|]

9

10

Example

Exact String Matching

Æ Let

Æ

S = abcd

We call string P the pattern of length n=|P| We call the string T the text of length m=|T| Æ The exact matching problem: find all occurrences of P in T (if any)

Æ S[1..2]

= ab is a proper prefix of S Æ S[2..3] = bc is a proper substring of S Æ S[2..4] = bcd is a proper suffix of S Æ S[1..4] = abcd is a prefix, suffix, and substring of S Æ S[4..3] is empty

Æ

11

Bioinformatics

Æ

Let P = aba Let T = bbabaxababay

Æ

Then P occurs in T at locations 3, 7, and 9

Æ

12

2

Exact String Matching: the Naïve Method Æ

What is the Worst Case Running Time? Æ In

Slide P along T and for each alignment compare characters from left to right Æ Æ

Æ

Let P = aab and T = aaaaaaaaab aaaaaaaaab aab No match aab No match aab No match aab No match aab No match aab No match aab No match aab MATCH! The search requires n*(m-n+1) = 24 comparisons

the worst case, this algorithm requires n*(m-n+1) comparisons to find a match of pattern P in text T, where n=|P| and m=|T| Æ We say that the worst case running time of this algorithm requires in the order of n*m computational steps to complete Æ Asymptotic notation: Q(n*m)

13

Asymptotic Notation Æ

Asymptotic Notation

Definition

Æ

Q(g(n)) = { f(n): there exists positive constants c1, c2, and n0 such that 0 < c1g(n) < f(n) < c2g(n) for all n > n0 } Æ

We write

Æ

f(n) = Q(g(n)) to denote that Q(g(n)) is a tight asymptotic bound of f(n) Let f(n) = 2 + 0.7*n, then f(n) = Q(n) Æ

14

Definition O(g(n)) = { f(n): there exists positive constants c and n0 such that 0 < f(n) < cg(n) for all n > n0 }

Æ

We write f(n) = O(g(n)) to denote that O(g(n)) is an asymptotic upper bound of f(n)

n0=10, c 1=0.7, c2=1, 00 then shift P right by n-L'(i) positions Æ

Æ _________xSc___

bc______ySc_ bc______ySc_

Æ Æ

Æ If

no such shift is possible, shift P n places to the right

Define N(j) = the length of the longest suffix of P[1..j] that is also a suffix of P Let P = cabdabdab then N(3) = 2 and N(6) = 5 for i := 1 to n do L'(i) := 0 for j := 1 to n-1 do i := n-N(j)+1 L'(i) := j

31

32

Exact Matching of Multiple Patterns

Application: Sequence Tagged Site (STS)

Æ Boyer-Moore

Æ

is faster than KnuthMorris-Pratt in practice Æ KMP algorithm forms basis for AhoCorasick algorithm for matching multiple patterns Æ Multiple pattern matching in O(n+m+k) time where k = the number of occurrences in T of any of the patterns

An STS is a DNA string of length 200-300 nucleotides whose right and left ends, of length 20-30 nucleotides each, occur only once in the entire genome Æ Each STS occurs uniquely in the DNA of interest Æ Hundreds of thousands STSs in databases Æ Problem: find which STSs are contained in anonymous DNA Æ Use Aho-Corasick to find STSs in newly sequenced DNA to find the map locations

33

Exact Matching With Wildcards Æ

Regular Expression Matching

Zinc Finger DNA transcription factor:

ÆA

regular expression (RE) is

ÆA

CYS??CYS?????????????HIS??HIS

character from the alphabet “empty” symbol e

Æ The

If the number of wildcards ? is limited and can be bounded by a fixed constant, a linear time O(n+m) algorithm exists Æ If the number of wildcards is unbounded, it is not known if the problem can be solved in linear time Æ

35

Bioinformatics

34

Æ Concatenation

of two REs, written as R1R2 of two REs, written as R1+R2 Æ Repetition of an RE, written as R* Æ Alternation

Æ O(n*m)

time

36

6

Regular Expression Example

Inexact Matching, Sequence Comparison and Alignment

Æ (a+b)yk(pp+e)q*

Æ Some

type of errors are acceptable in valid matches

is a string that matches Æ bykppq is a string that matches Æ byk is a string that matches Æ yk is a string that does not match Æ aykqqqq

Æ Sequence

data may contain errors characters in a subsequence embedded in a string need not be contiguous Æ Comparison of similar sequences Æ Sequence alignment allows mismatches Æ The

37

38

First Fact of Biological Sequence Analysis

Edit Distance

Æ “In

Æ Edit

biomolecular sequences (DNA, RNA, or amino acid sequences), high sequence similarity usually implies significant functional or structural similarity”

distance = minimum number of edit operations needed to transform the first string into the second character (in second string) character (from first string) Æ Replace character Æ RI D D I edit transcript v intner first string wri t ers second string Æ Insert Æ Delete

39

Alignment and Edit Distance Æ

Dynamic Programming

Global string alignment Æ

Æ Use

q a c _ d b d q a w x _ b _

A string alignment can be converted into an edit transcript (edit operations) and vice versa Æ Alignment displays string relationship as the product of evolutionary events Æ Edit distance emphasizes mutational events as a process Æ

41

Bioinformatics

40

Dynamic Programming methodology to find the minimum number of edit operations Æ O(n*m) time algorithm where n is the length of the first string and m is the length of the second string

42

7

Local Alignment

Gaps

P q r a x a b _ c s t v q x y a x _ b a c s l l Æ The two substrings have maximum similarity Æ Local alignment problem: find substrings whose similarity is maximum over all pairs of substrings Æ O(n*m) time algorithm where n is the length of the first string and m is the length of the second string

Æ

A gap is a consecutive series of spaces in a string Æ Gaps result from mutational events that delete, copy, and insert long pieces of DNA

Æ

Æ Æ Æ Æ

Æ

Unequal cross-over in meiosis DNA slippage during replication Jumping genes and translocations Insertions of DNA by retrovirus

Need to use weighted edit distance or similarity measure to compare strings for alignment

43

44

Alignment With Gaps

Multiple String Comparison and Multiple Alignment

Æ Find

Æ

local alignment with maximum similarity using Æ Small

penalties for gaps (low gap weights) high penalties for mismatches

Æ

Æ Relatively

Æ cDNA

Æ

matching

Æ

Æ ----------------------------------

---

Extracting and representing biologically important, yet faint or widely dispersed, commonalities from a set of strings

----

-- ---

long string string with gaps

Æ Æ

Æ

Discover evolutionary history Discover critical conserved motifs Discover 2D and 3D molecular structure Discover common biological function Characterize families of proteins

Said to be the “holy grail” of molecular biology

45

Alignments With Profiles Æ

Æ

a a a c

b b c b

c a c _

_ b b b

a a _ c

a b c _

Sequence Databases

C1 C2 C3 C4 C5 .75 .25 .50 .75 .75 .25 .25 .50 .25 .25 .25 .25

Æ Fast

growing industry have gmail! Æ Success stories Æ You

Æ Discovery

of similarities of oncogenes and their link to growth factors Æ Linking protein sequence database SwissProt to Riley database lead to discovery of 102 biochemical roles of 1007 genes from 1734 coding regions of DNA strings

Determine if a string belongs to a family of proteins represented by a profile Æ Æ

a a b b c Suppose alignment a-a = 2 and a-c = -3, then first column score = 0.75*2 - 0.25*3 = 0.75 47

Bioinformatics

46

48

8

Real Sequence Database Search

Part II

Compare the new sequence to PROSITE and BLOCKS databases for sequence motifs Æ Search GenBank DNA archive or Swiss-Prot for highly similar sequences (local similarity) Æ

Æ Æ

Æ

Use BLAST and FASTA for first heuristic approximation Compute optimal similarity (weighted edit distance) based on dynamic programming

Æ How

can nature’s biological computing mechanisms be used to solve problems in Computer Science? Æ Neural

networks programming

Æ Evolutionary

Try PAM and/or BLOSUM scoring matrix when amino acid substitutions are employed 49

50

Neural Nets and Genetic Algorithms

Neural Nets

Æ Neural

Æ Parallel,

nets

distributed information processing structure

Æ Pattern

recognition Æ Constraint satisfaction Æ Genetic

Æ Processing

elements (nodes) between nodes Æ Single output per node Æ Local computations Æ Connections

algorithms

Æ Search

and optimization

51

52

Neural Nets

Hopfield Networks

Æ Hopfield

Æ Fully

networks Æ Boltzmann networks Æ Kohonen networks Æ Feedforward Networks Æ Backpropagation networks

connected Æ Weighted connections wij Æ State of ith node: si = -1 or si = +1 Æ Node Æ Node

53

Bioinformatics

i changes state si to +1 if Sjwijsj > 0 i changes state si to -1 if Sjwijsj < 0

54

9

Pattern Recognition with Hopfield Networks

Boltzmann Networks

m patterns x1,…,xm Æ Each pattern x is a vector over {-1,+1} Æ Set wij = Sk x ik xjk Æ Given an input pattern, the states of the nodes of the Hopfield network converge to one of the patterns that is similar to the input pattern Æ Assume

Æ Generalization

of Hopfield networks for solving combinatorial optimization problems Æ Combined state of nodes forms the solution

55

56

Kohonen Networks

Feedforward Networks

Æ Model

Æ Multilayer

Æ Self

the brain organization

Æ Input

nodes nodes Æ Output nodes Æ Hidden

Æ Nodes

compute

Æ si

= +1 if Sjwijsj > f

Æ si

= -1 otherwise

57

58

Backpropagation Networks

Applications of Neural Nets

Æ Backpropagation

Æ Pattern

of error Æ Adjusts the weights by propagating errors back into the network upon mismatch of network’s output and true answer Æ Unsupervised training

analysis (images, sound, …) Æ Modeling the auditory cortex of a bat Æ Traveling salesman problem Æ Modeling the somatotopic map of the body surface

59

Bioinformatics

60

10

Genetic Algorithms

Optimization Problems

A family of computational models inspired by evolution Æ Encode a potential solution of a specific problem in a simple chromosome-like data structure Æ Apply selection, reproduction, and recombination operators to a population of chromosomes to evolve new solutions

Æ Find

suitable problem encoding in binary “chromosome” (string of 0 and 1) Æ Define an evaluation function (a.k.a. fitness function)

Æ

61

62

Selection and Reproduction

Recombination

ÆA

Æ Crossover

population forms a pool of partial solutions Æ Calculate fitness values for all chromosomes in the population Æ A selected group of individuals with larger than average fitness value is allowed to survive and make copies into a new population

(one-point, uniform)

Æ Form

new solution by combining two solutions

Æ Mutations

with low probability

Æ Insert/destroy

material

63

Example

Example (Cont’d)

Find x on [0,31] such that f(x) = x2 is maximized Æ Encode x as string of 5 bits (2 5=32) Æ Fitness function f Æ Generate random population of size 4 Æ

Æ

01001 11000 01000 10011

65

Bioinformatics

64

Pool 1 01101

f(x) 169

% 14

%/ave #copies 58 1

2 11000

576

49

197

2

3 01000

64

6

22

0

4 10011

361

31

123

1

66

11

Example (Cont’d)

Applications of GA Æ Machine

Pool

Mate

X-Site New Pool

f(x)

0110 | 1

2

4

01100

144

1100 | 0

1

4

11001

625

11 | 000

4

2

11011

729

10 | 011

3

2

10000

256

learning salesman problem Æ Robot trajectory generation Æ Parametric design (e.g. pipelines, aircraft, …) Æ Traveling

67

68

Questions?

69

Bioinformatics

12