Graphical Models in Computational Molecular Biology

2 downloads 66 Views 9MB Size Report
Graphical Models in. Computational ... Probabilistic Model of Evolution. Random .... R m m+1. H. H. Felsenstein and Churchill, MBE 1996. Phylogenetic HMMs.




The Age of Genomes

Graphical Models in Computational Molecular Biology Nir Friedman Hebrew University Bacteria 1.6Mb 1600 genes

.

Includes slides by: Yoseph Barash, Nebojsa Jojic, Tommy Kaplan, Daphne Koller, Iftach Nachman, Dana Pe’er, Tal Pupko, Aviv Regev, Eran Segal

95

Biology  Molecular Biology Organism

Organs

Tissues

Eukaryote Animal 13Mb 100Mb ~6000 genes ~20,000 genes 96

97

98

Human 3Gb ~30,000 genes?

99

00

01

02

03

Macro Molecules as Sequences

Cells

Building blocks

Larger units

SUGARS

POLYSACCHARIDES

FATTY ACIDS

FATS, LIPIDS, MEMBRANES

AMINO ACIDS

PROTEINS

NUCLEOTIDES

NUCLEIC ACIDS (DNA, RNA)

Molecules

The Central Dogma

Transcription

DNA



Translation

RNA

Protein

• 1





What Biologists Want?

Philosophy

Identify Components • Find genes in DNA sequences  Associate them with function • Understand what a gene does  Interactions between components • Which genes work together?  Dynamics of systems • When genes are activated, and by what?  How did we get here • How did genes evolve, and why this way? (And many other things) 

Identify biological problem

Formulate model

Learning/Inference procedures

Data curation, apply procedure, get results

Biological interpretation

Evolutionary History

Outline Sequence evolution  Protein families  Transcriptional regulation  Gene expression  Discussion

Donkey Horse Indian rhino W hite rhino Grey seal Harbor seal Dog Cat Blue whale Fin whale Sperm whale Hippopotamus Sheep Cow Alpaca Pig Little red flying fox Ryukyu flying fox Horseshoe bat Japanese pipistrelle Long-tailed bat Jamaican fruit-eating bat Asiatic shrew Long-clawed shrew Mole Small Madagascar hedgehog Aardvark Elephant Armadillo Rabbit Pika Tree shrew Bonobo Chimpanzee Man Gorilla Sumatran orangutan Bornean orangutan Common gibbon Barbary ape Baboon W hite-fronted capuchin Slow loris Squirrel Dormouse Cane-rat Guinea pig Mouse Rat Vole Hedgehog Gymnure Bandicoot W allaroo Opossum Platypus



Phylogeny

Ape

QEPGGLVVPPTDA

Monkey

QEPGGMVVPPTDA

Lemur Cat

Perissodactyla Carnivora Cetartiodactyla Chiroptera Moles+Shrews Afrotheria Xenarthra Lagomorpha + Scandentia Primates Rodentia 1 Rodentia 2 Hedgehogs

Related Proteins Conserved Position

Primate evolution

New discoveries

Insertion

Deletion

QEPGGLVIPP REPQGLIVPPTEG Globins http://www.sanger.ac.uk/cgi-bin/Pfam/getacc?PF00042



• 2





Evolution

Challenges 

Evolution = Neutral Variation + Selection

Ape

Neutral variation • Random changes Mutation, duplication, deletions, rearrangements, …

Selection • Preference for variants with better fit

From sequences to phylogenies QEPGGLVVPPTDA

Monkey

QEPGGMVVPPTDA

Lemur

QEPGGLVIPPTDA

Cat

REPQGLIVPPTEG



Understand selection Selection  Function & Structure



Reconstruct history of evolutionary events

Probabilistic Model of Evolution

Parameterization of Phylogenies Assumptions:  Positions (columns) are independent of each other  Each branch is a reversible continuous time discrete state Markov process P (a  c |t + t ') =  P (a  b |t )P (b  c |t ') b

P (a )P (a  b |t ) = P (b )P (b  a |t ) Aardvark

Bison Chimp Dog

Elephant

Random variables – sequence at current day taxa or at ancestors Potentials/Conditional distribution – represent the probability of evolutionary changes along each branch

Computational Tasks Likelihood computation, inference of ancestral states  Inference (dynamic programming, belief propagation)

governed by a rate matrix Q Parameterization: Rate matrix + tree topology + branch lengths



Maximum Likelihood Reconstruction Observed data: (D )  

N sequences of length M Each position: an independent sample from the marginal distribution over N current day taxa

Likelihood:

Branch length estimation: Parameter estimation (EM)





Given a tree (T,t) :

l (T ,t : D ) = log P (D |T ,t ) =  log P (x[m1,K,N ] |T ,t )

Reconstruction:  Structure learning

M

Goal: Felsenstien, JME 1981





Find a tree (T,t) that maximizes l(T,t:D) .

• 3





SEMPHY: Structural EM Phylogenetic Reconstructoin

Current Approaches 

Perform search over possible topologies T3

T1

T2

Parameter space

Borrow the idea of Structural EM: Use parameters found for current topology to help evaluate new topologies.

Parametric optimization

Problem (EM) • Such procedures are computationally expensive! • Computation of optimal parameters, per Local Maxima candidate, requires non-trivial optimization step. Tn T • Spend 4non-negligible computation on a candidate, even if it is a low scoring one.

Outline: 

Perform search in (T, t) space, using EM-like iterations: • E-step: use current solution to compute expected sufficient statistics for all topologies • M-step: select new topology based on these expected sufficient statistics

Friedman et al, JCB 2002

Algorithm Outline

Algorithm Outline

0 0 Compute: E [S(i , j ) (a , b ) | D ,T ,t ]

0 0 Compute: E [S(i , j ) (a , b ) | D ,T ,t ]

Weights: wi , j = maxt F (t , E [Si , j ])

Weights: wi , j = maxt F (t , E [Si , j ]) Find: T '= arg maxT

w

(i , j )T

i,j

Original Tree (T0,t0)

Pairwise weights

Unlike standard EM for trees, we compute all possible pairwise statistics Time: O(N2M)

Friedman et al, JCB 2002

Algorithm Outline

Friedman et al, JCB 2002

Algorithm Outline

0 0 Compute: E [S(i , j ) (a , b ) | D ,T ,t ]

0 0 Compute: E [S(i , j ) (a , b ) | D ,T ,t ]

Weights: wi , j = maxt F (t , E [Si , j ])

Weights: wi , j = maxt F (t , E [Si , j ])

Find: T '= arg maxT

Find: T '= arg maxT

w

(i , j )T

i,j

Construct bifurcation T1

Max. Spanning Tree Fast greedy procedure to find tree By construction: Q(T’,t’)  Q(T0,t0) Thus, l(T’,t’)  l(T0,t0)

Friedman et al, JCB 2002



This stage also computes the branch length for each pair (i,j)

w

(i , j )T

i,j

Construct bifurcation T1

Fix Tree Remove redundant nodes Add nodes to break large degree This operation preserves likelihood l(T1,t’) =l(T’,t’)  l(T0,t0) Friedman et al, JCB 2002

• 4





Algorithm Outline

Beyond Vanilla Flavor

0 0 Compute: E [S(i , j ) (a , b ) | D ,T ,t ]

Rate variation:  Different positions evolve at different rate

Weights: wi , j = maxt F (t , E [Si , j ]) Find: T '= arg maxT

w

(i , j )T

i,j

Construct bifurcation T1

R X7

New Tree

Thm: l(T1,t1)  l(T0,t0)

X5

These steps are then repeated until convergence

X6

X1

X3

X2

X4

Friedman et al, JCB 2002

Yang, JME 1994

Application: Map Conservation

Beyond Vanilla Flavor Dependencies between Positions:  Dependent rate  Dependent state

H

Calculate rates

m

http://consurf.tau.ac.il/

h10

x11

x12

h12

x13

x14



h22

x23

Product of trees

hJ0

h20

x24

q2 (h20 , h22 )

h10

h20

x11

x12

h12

qJ (hJ0 , hJ2 )

hJ0



h22 x23

x14

x1J

x24

hJ2

x

3 J

x J4

hJ2

x J3

h10

x J4

Product of chains

Each substitution depends on the substitution at the previous position  This structure captures context specific effects during evolution Problem:  Resulting model is intractable

x11



Siepel & Haussler, MBE 2004



q1 (h10 , h12 )

x13

x1J

m+1

Felsenstein and Churchill, MBE 1996

Variational approximations

Phylogenetic HMMs

H R

R

x12

h12 x13

q (h10 )

x14

x J4

q (hJ0 ) hJ0

q (h21 )

h22 x23

q1 (h11 , h21 ,..., h1J )

x J3

q (h20 )

x12

q 0 (h10 , h20 ,..., hJ0 ) hJ2

x

h20

h12 x13

1 J

x24

q (h12 )

x11



h22 x23

x14

h10

Mean field (product of “nodes”)

hJ0

h20

x24



q (hJ2 )

x1J

hJ2

x J3

x J4

Jojic et al, ISMB 2004

• 5





Tracking bounds after each step of exact EM

Computational cost of likelihood improvements using best variational approximation and exact EM

Jojic et al, ISMB 2004

Jojic et al, ISMB 2004

Remaining Challenges To learn phylogenies, need to align sequences For good alignment, need to know which positions are conserved Can we do both at the same time?

Outline Sequence evolution Protein families  Transcriptional regulation  Gene expression  Discussion









Requires Dealing with insertions + deletions  Different ways of shifting each sequence Hard problem 

Proteins

Protein Families Idea:  Use knowledge about function of known proteins  Sequence conservation  similar structure & function

Sequence  Structure  Function Major question:  How to annotate structure and function of new protein sequences? Hemoglobin 



Protein Family: A collection of proteins sequences that have a common  Evolutionary ancestor  Structure  Function

• 6





Profile HMMs

Protein Family D

D

D

D

I

I

I

I

I

Start

M1

M2

M3

M4

Multiple Sequence alignment  detect new proteins Need to represent  Commonalities  Variations    

Conserved stretches Potential insertions Potential deletions Preferred letters at each position

End

Match: aDelete: conserved a deletion, position with silent state Insert: an insertion withageneral (background) emission probability specialized without emission any emission probability

I

L

M

W

K

E

ILMWKE LMWKE ILWK

Using Profile HMMs Training set of curated proteins (domains) from the same family

PFAM Database

Manual curation: 20-200 seq

Profile HMM

Annotation of new protein sequences + alignment with original sequences

Search throughout all known protein sequences: 106 -- 108

The Central Dogma

Outline Sequence evolution Protein families  Transcriptional regulation • Transcription factor binding sites • Combinatorial regulation  Gene expression  Discussion  

Transcription

DNA



Translation

RNA

Protein

• 7





Transcription Factor Binding Sites

Transcriptional Regulation  DNA binding proteins Non-coding region



Gene 1

Activator



Repressor

Gene 2Binding sites

RNA transcript

Coding region (transcribed)

(specific sequences)

Gene regulatory proteins contain structural elements that can “read” DNA sequence “motifs” The amino acid – DNA recognition is not straightforward Experiments can pinpoint binding sites on DNA

Gene 3

Gene 4

Zinc finger

Modeling Binding Sites

How to model binding sites ? P(X1 X2 X3 X4 X5 ) = ? represents a distribution of binding sites

Given a set of (aligned) binding sites …  

Consensus sequence Probabilistic model NNGGGGCNGGGC (profile of a binding site) A

4

3

1

1

0

0

1

0

0

1

0

1

C

1

3

0

0

0

0

13

6

0

0

1

9

G

5

5

13

13

14

14

0

8

14

12

13

1

T

4

3

0

0

0

0

0

0

0

1

0

3

Is this sufficient?

GCGGGGCCGGGC TGGGGGCGGGGT AGGGGGCGGGGG TAGGGGCCGGGC TGGGGGCGGGGT TGGGGGCCGGGC ATGGGGCGGGGC GTGGGGCGGGGC AAAGGGCCGGGC GGGAGGCCGGGA GCGGGGCGGGGC GAGGGGACGAGT CCGGGGCGGTCC ATGGGGCGGGGC

X1

X2

X3

X4

X5

Profile:

X1

X2

X3

X4

X5

Tree: Direct dependencies

X1

X2

X3

X4

X5

X4

X5

T

X1

X2

Arabidopsis ABA binding factor 1 Mixture of Profiles

Tree X4

X5

X6

X7

X8

X9 X10 X11 X12

Test LL per instance -18.47 (+1.46) (improvement in likelihood > 2.5-fold)

5T

Barash et al, RECOMB 2002



Both types of dependencies



11

T

2 2

3 3

3 4 1

5 4

5

3

Barash et al, RECOMB 2002

TRANSFAC: 95 aligned data sets 128 64 32 16 8 4 2 1 0.5

Test LL per instance -18.70 (+1.23) (improvement in likelihood > 2-fold)

Mixture of Trees:

X3

Likelihood improvement over profiles

Fold-change in likelihood

24%

Global dependencies

P(X XX = =) = P(T)P(X | T)P(X )P(X | T, XT)P(X T)P(X X3)) P(X L P(T)P(X T)P(X | T)P(X P(X L P(X )P(X )4 |4)P(X 11L 5 )X 1)P(X 2| 3| 1 )P(X 5| |)P(X XT, X )P(X X |T, X 5)  1 | T)P(X 23|)P(X 3 |)P(X 5 | T)

76%

Test LL per instance -19.93

Independency model

Mixture of Profiles:

T

1

Profile

Leucine zipper

Helix-Turn-Helix

Zin

cf

ing

er

bZ

IP

bH

LH

lix x He Heli rn u T

S

he

et

ers

oth

??

?

Barash et al, RECOMB 2002

• 8





Motif finding problem Input: A set of potentially co-regulated genes Output: A common motif in their promoters

Sources of data: Gene annotation (e.g. Hughes et al, 2000)  Gene expression (e.g. Spellman et al, 1998; Tavazoie et al, 2000)  ChIP (e.g. Simon et al, 2001; Lee et al, 2002) 

Example 

Upstream regions from yeast Sacharomyces cerevisiae genes (300-600bp)

5’- TCTCTCTCCACGGCTAATTAGGTGATCATGAAAAAATGAAAAATTCATGAGAAAAGAGTCAGACATCGAAACATACAT

…HIS7

5’- ATGGCAGAATCACTTTAAAACGTGGCCCCACCCGCTGCACCCTGTGCATTTTGTACGTTACTGCGAAATGACTCAACG

…ARO4

5’- CACATCCAACGAATCACCTCACCGTTATCGTGACTCACTTTCTTTCGCATCGCCGAAGTGCCATAAAAAATATTTTTT

…ILV6

5’- TGCGAACAAAAGAGTCATTACAACGAGGAAATAGAAGAAAATGAAAAATTTTCGACAAAATGTATAGTCATTTCTATC

…THR4

5’- ACAAAGGTACCTTCCTGGCCAATCTCACAGATTTAATATAGTAAATTGTCATGCATATGACTCATCCCGAACATGAAA

…ARO1

5’- ATTGATTGACTCATTTTCCTCTGACTACTACCAGTTCAAAATGTTAGAGAAAAATAGAAAAGCAGAAAAAATAAATAA ATTGATTGACTCATTTTCCTCTGACTACTACCAGTTCAAAATGTTAGAGAAAAATAGAAAAGCAGAAAAAATAAATAA

…HOM2

5’- GGCGCCACAGTCCGCGTTTGGTTATCCGGCTGACTCATTCTGACTCTTTTTTGGAAAGTGTGGCATGTGCTTCACACA GGCGCCACAGTCCGCGTTTGGTTATCCGGCTGACTCATTCTGACTCTTTTTTGGAAAGTGTGGCATGTGCTTCACACA

…PRO3

Probabilistic Model

Learning models: unaligned data EM (MEME)  Identify

Motif model

Background probability

 Learn

binding site positions a dependency model

Background probability: given  Motif model – parameters being learned

X1

X2

X3

X4

X5

X1

X2

X3

X4

X5

X1

X2

X3



T X4

X5

X4

X5

T X1

Hidden variable: Location of motif within each sequence

Learning models: unaligned data

Gibbs Sampling (AlignACE)

X3

Identify sites

Model Learning



EM (MEME)

X2

Challenges Small training sets  10-500 sequences (out of 1000s genes) Short motifs within long sequences Motifs are 6-20bp, promoters are 500-5000bps



Discriminative Sampling • Find motif that best separates positive examples from rest of promoter sequences



Motifs are not perfect words Mismatches are allowed



• 9





Comparative Genomics Functional areas should be conserved

Modeling Conserved Motifs  

Parallel search in orthologous promoters Model evolution at each position H

H

H R

Position in the motif

R

R

Ancestral sequence Observed sequence 1

3

2

Kellis et al. Nature 2003

Outline Sequence evolution Protein families  Transcriptional regulation • Transcription factor binding sites • Combinatorial regulation  Gene expression  Discussion

Combinatorial Regulation

 

Context A Basal expression level

transcript level Upstream region of target gene

Context B Activator induces expression

ac tivator

activator binding site

Context C Activator + repressor decrease expression

repressor

repressor binding site

Effect at a Distance



ac tivator

activator binding site

Eukaryotic Regulation

• 10



The Life Cycle of a Drosophila Fly



Segmentation Achieved by Spatial patterning of genes

Bicoid –Blue

Gradual Refinement of Expression Domains

Eve Stripped-Red

Kruppel-Yellow

Partial gene network of the segmentation

Coarse

Fine

Each Stripe has its own “Enhancer”

Example: Eve Stripe 2

Control of even-skipped expression



• 11





Eve Gene 

Build (generalized HMM)

within-cluster gap

Searching for “dense” windows

A+ AB+ B-

-R

between-cluster gap

Spacer

States:  Gap (background) • Long gaps (between clusters) • Short gap (in cluster)  Motifs • Emit motif sequence • Either positive strand or negative one

Cluster



Probabilistic Model

Search all sites 1MB around eve gene.

Parse long genomic sequences by Viterbi’s algorithm Berman et al, PNAS 2002

Frith et al, NAR, 2002; Rajewsky et al, BMC Bioinf, 2002; Baily & Noble, Bioinf, 2003

Outline

Transcriptional Regulation

Sequence evolution Protein families  Transcriptional regulation • Transcription factor binding sites • Combinatorial regulation  Gene expression • General methods • Unified models of transcriptional regulation • Regulatory networks  Discussion  

High Throughput Gene Expression

Expression Data Conditions

1000s of genes 10-100s of arrays Possible designs  Biopsies from different patient populations  Time course  Different perturbations  

Transcription

Genes

Translation

Extract

Microarray

RNA expression levels of 10,000s of genes in one experiment Gasch et al. Mol. Cell 2001



• 12





Expression: Unsupervised

Expression: Supervised

Feature selection + Classification (Naïve Bayes)

Classifier confidence

Labeled samples

P-value =< 0.027

Similarly, we can classify genes

Eisen et al. PNAS 1998; Alter et al, PNAS 2000

Relational Approach

Expression +Annotations

Key Idea: Expression level is “explained” by properties of gene and properties of experiment

Array annotations: Tissue type, Clinical conditions, Treatments, Prognosis

Gene Properties

Array Properties

Gene annotations: Function, Process, Regulatory regions Cellular location, protein family

Expression Level

Relational models! Segal et al, ISMB 2001

Probabilistic Relational Models Gene

Adding Heterogeneous Data

Experiment

Gene

Experiment

Gene Cluster

GCN4 Gene Cluster

Exp. type HSF

Exp. cluster

Lipid Exp. cluster

Endoplasmatic

Level

GCluster E Cluster 1 2

μ 

P(Level)

P(Level)

Level -0.7



Expression

0.8 1.2 -0.7 0.6



1 1

Level

Expression

CPD

Level 0.8



Annotations



Binding sites



Experimental details

Segal et al, ISMB 2001

• 13





Semantics Gene

Outline

Experiment

Gene Cluster

Sequence evolution Protein families  Transcriptional regulation • Transcription factor binding sites • Combinatorial regulation  Gene expression • General methods • Unified models of transcriptional regulation • Regulatory networks  Discussion

Exp. type

GCN4 HSF

Lipid Exp. cluster

Endoplasmatic

Level

Expression

GCN41



+



Exp. type1

Gene Cluster1

Exp. type2 Exp. cluster1

HSF 1

Lipid1

Endoplasmatic1

GCN42

Lev el1, 2

Gene Cluster2

HSF 2

Lipid2

Endoplasmatic2

GCN43

Exp. cluster2

Lev el1, 1

Lev el2,

Lev el2,

1

2

Gene Cluster3

HSF 3

Lipid3

Endoplasmatic3

Lev el3,

Lev el3,

1

2

Goal

TF to Expression ACTAGTGCTGA

Key Question:  Can we explain changes in expression?

+

CTATTATTGCA CTGATGCTAGC

General model: Transcription factor binding sites in promoter region should “explain” changes in transcription

t2 Motif

R(t1 )

R(t2 )



t1 Motif

AGCTAGCTGAGACTGCACACTGATCGAG CCCCACCATAGCTTCGGACTGCGCTATA TAGACTGCAGCTAGTAGAGCTCTGCTAG TAGACTGCAGCTAGTAGAGCTCTGCTAG TAGACTGCAGCTAGTAGAGCTCTGCTAG AGCTCTATGACTGCCGATTGCGGGGCGT AGCTCTATGACTGCCGATTGCGGGGCGT CTGAGCTCTTTGCTCTTGACTGCCGCTT CTGAGCTCTTTGCTCTTGACTGCCGCTT CTGAGCTCTTTGCTCTTGACTGCCGCTT TTGATATTATCTCTCTGCTCGTGACTGC TTTATTGTGGGGGGACTGCTGATTATGC TGCTCATAGGAGAGACTGCGAGAGTCGT CGTAGGACTGCGTCGTCGTGATGATGCT CGTAGGACTGCGTCGTCGTGATGATGCT GCTGATCGATCGGACTGCCTAGCTAGTA GCTGATCGATCGGACTGCCTAGCTAGTA GATCGATGTGACTGCAGAAGAGAGAGGG GATCGATGTGACTGCAGAAGAGAGAGGG TTTTTTCGCGCCGCCCCGCGCGACTGCT CGAGAGGAAGTATATATGACTGCGCGCG CGAGAGGAAGTATATATGACTGCGCGCG CCGCGCGCACGGACTGCAGCTGATGCAT CCGCGCGCACGGACTGCAGCTGATGCAT GCATGCTAGTAGACTGCCTAGTCAGCTG GCATGCTAGTAGACTGCCTAGTCAGCTG CGATCGACTCGTAGCATGCATCGACTGC AGTCGATCGATGCTAGTTATTGGACTGC GTAGTAGTGCGACTGCTCGTAGCTGTAG

Segal et al, RECOMB 2002, ISMB 2003

“Classical” Approach 

Flow of Information Sequence

Control regions

I II III CTAGAGCTCTATGACTGCCG CTAGAGCTCTATGACTGCCG Gene IV ATTGCGGGGCGTCTGAGCTC Gene V TTTGCTCTTGACTGCCGCTT TTTGCTCTTGACTGCCGCTT Gene VI

CCAAT ACGATGCTAGTGTAGCTGATGCTGATCGATCGTACGTGCTAGCTAGCTAGCTAGCTAGCTAGCT TCGACTG GATAC CCAAT AGCAGCTAGCTCGACTGCTTTGTGGGGCCTTGTGTGCTCAAACACACACAACACCAAATGTGCT C CCAAT TTGTGGTACTGATGATCGTAGTAACCACTGTCGATGATGCTGTGGGGGGTATCGATGCATACC TCGACTG GATACTCGACTG CCAAT ACCCCCCGCTCGATCGATCGTAGCTAGCTAGCTGACTGATCAAAAACACCATACGCCCCCCGTCG C GATAC CTGCTCGTAGCATGCTAGCTAGCTGATCGATCAGCTACGATCGACTGATCGTAGCTAGCTACTT C GCAGTT CCAAT TTTTTTTTTTGCTAGCACCCAACTGACTGATCGTAGTCAGTACGTACGATCGTGACTGATCGCT TCGACTG GCAGTT CCAAT CGTCGTCGATGCATCGTACGTAGCTACGTAGCATGCTAGCTGCTCGCAAAAAAAAAACGTCGTC TCGACTG C GATAC CCAAT GCAGTT GATCGTAGCTGCTCGCCCCCCCCCCCCGACTGATCGTAGCTAGCTGATCGATCGATCGATCGTAG C CTGAATTATATATATATATATACGGCG

Genes



Cluster gene expression profiles Search for motifs in control regions of clustered genes AGCTAGCTGAGACTGCACAC AGCTAGCTGAGACTGCACAC Gene TTCGGACTGCGCTATATAGA TTCGGACTGCGCTATATAGA Gene

clustering

GACTGCAGCTAGTAGAGCTC GACTGCAGCTAGTAGAGCTC Gene

Procedural • Apply separate method to each type of data Motif GACTGC • Use output of one method as input to the next • Unidirectional information flow



Motifs

TCGACTGC

Motif Profiles

TCGACTGC + GATAC

GATAC

CCAAT

CCAAT

GCAGTT GCAGTT + CCAAT

Expression Profiles

• 14





Unified Probabilistic Model Sequence

Sequence S1

S2

S3

S4

S1

Motifs

Motif Profiles

Gene

Expression Segal et al, RECOMB 2002, ISMB 2003

S2

S3

Expression Segal et al, RECOMB 2002, ISMB 2003

Probabilistic Model Observed

S4

Experiment

Gene

Expression Profiles

Sequence

Sequence S1

Motifs

Regulatory Modules

Sequence S1

S2

S3

S4

Motif profile

Motifs R1 R2 R3

Expression Profiles

S4

Module

Unified Probabilistic Model

Motif Profiles

S3

R1 R2 R3

Experiment

Expression Profiles

Sequence

S2

Motifs R1 R2 R3

Motif Profiles

Sequence

genes

Sequence

Unified Probabilistic Model

Expression profile

R1 R2 R3

Experiment

Module

Motif Profiles

ID

Gene Level

Observed

Expression

Segal et al, RECOMB 2002, ISMB 2003

Promoters Revisited 

Promoter organization matters



Different logic for promoters • AND, OR, NAND, XOR, ….

Expression Profiles

Experiment

Module

ID

Gene Level

Expression Segal et al, RECOMB 2002, ISMB 2003

More Refined Models

Beer & Tavazoie, Cell, 2004



• 15





More Refined Models

Outline Sequence evolution Protein families  Transcriptional regulation • Transcription factor binding sites • Combinatorial regulation  Gene expression • General methods • Unified models of transcriptional regulation • Regulatory networks  Discussion  

Beer & Tavazoie, Cell, 2004

Causal Reconstruction for Gene Expression

Goal: Reconstruct Cellular Networks Conditions

Gene A

Genes

Gene C

Gene B

Gene D

Gene E 

Use language of Bayesian networks to reconstruct causal connections

Biocarta. http://www.biocarta.com/ Friedman et al, JCB 2000

Expression data

Guided K-means Discretization

Gene Interaction Maps

Preprocess



Bayesian Network Learning Algorithm, + Bootstrap

Learn model



300 deletion knockout inYLR343W yeast [Hughes et al 2000] 0.75 – 1.0 0.5 – 0.75 600 genes 0.35 – 0.5

YLR334C

Edge

Separator

Reconstruct Sub-Networks

Regulator

Feature extraction

Feature assembly

Global network Local features  Sub-network Friedman et al, JCB 2000; Pe’er et al, ISMB 2001



Sst2 Ste6

Ndj1

Fus1

Kss1 Bar1

Kar4

Signaling

Markov

Tec1

Mfa2

Transcription factor

Aga1

Prm1

Aga2

Dow nstream effector

Far1 Mfa1

Fus3

Fig1

Ste2 Ste12

“Mating response” Substructure STE12 (main TF), Fus3 (Main MAPK) are marginal

Pe’er et al, ISMB 2001

• 16





From Networks to Modules Gene1

CPD 2

SLT2

RLM1 Regulation Function 1

CRH1

Gene partition

YPS3

Regulation program learning

PTP2

One common regulation function Regulation Function 2

CPD 1

Module 1

MAPK of cell wall integrity pathway

Regulation Function 3

Module 2 Gene reassignment to modules

Gene2

Functional modules

Regulation Function 4

Gene4 Gene3

CPD 3

Idea: enforce common regulatory program  Statistical robustness: Regulation programs are estimated from m*k samples  Organization of genes into regulatory modules: Concise biological description

Gene5

Gene6

Module 3

Segal et al, Nat Gen 2003

Module 25 (59 genes) Tpk1

Kin82

Nrg1

Msn4

Oxid. Phosphorylation (26, 5x10 - 35) Mitochondrion (31, 7x10-32) Aerobic Respiration (12, 2x10-13)

HAP4 Motif

Hap4

Learned Network (fragment)

STRE (Msn2/4)

-500

Module 2 (64 genes) Tpk2

Msn4

-400

-300

-200

-100

Gene set coherence (GO, MIPS, KEGG)

Usv1

Module 4 (42 genes) Hap4

Match between regulator and targets

Mth1

Match between regulator and promoters Module 1 (55 genes) Atp1

Atp3

Match between regulator and condition/logic

Atp16

Segal et al, Nat Gen 2003

Segal et al, Nat Gen 2003

A Major Assumption

Realistic Regulation Modeling 

mRNA

mRNA

TF

G

mRNA

mRNA

TF

G

TF

protein

TF

active protein

G

mRNA tr. rate

mRNA degradation



Model the closest connection

Active protein levels are not measured  Transcript rates are computed from expression data and mRNA decay rates 

mRNA

mRNA

TF

G

TF

protein

TF

active protein

G

mRNA tr. rate

mRNA degradation

• 17





New Proposed Scheme Known

General Two Regulator Function

regulation function f(TF) + noise model

biology data

I. Compute distribution of promoter states

TF1

Location

a

TF2

1

G1

G2

Model Learning

G3

P(promoter state)

State Equations

G4

S1 + tf1

a

S1tf1

S2 + tf2

G c

S2tf2 c

TF1

Expression data

TF2

b

2

+

TF1

b

d

TF2

d 0

0.2

0.4

0.6

+ Gi Transcription rates

mRNA decay rates

Kinetic parameters

f(TF1 , TF2 ) should describe mean transcription rate of G

Nachman et al, ISMB 2004

Nachman et al, ISMB 2004

Example: One Activator Function

General Two Regulator Function II. Assign activation level to each state

a=0

a P(promoter state)

1

a

S1tf1

S1 + tf1

b

b=1

c

c=0

d

d =0

b

2

S2tf2

S2 + tf2

[S ] +[S tf] 1=

b  d

 = 250

TF2

d 0

0.2

0.4

0.6

G

+ b X

f = ( a X

+ c X

 = 20

=4 =1

f (tf ) =  TF activity level

) X 

+ d X

Transcription rate f(tf)

TF1

tf

c

Nachman et al, ISMB 2004

Adding a Temporal Aspect

 = 250

tf 1 + tf

f(tf, )

State Equations

ê b [S ][tf] =ê d [S tf]

 = 20

=4 =1 Time Nachman et al, ISMB 2004

Caulobacter CtrA regulon

For time series – add explicit time modeling t-1

t

TF1

Time Persistence

TF1

TF1

TF2 G2

t+1

G3

TF2 G4

G2

G3

TF2 G4

G2

G3

G4 Is There a Second Regulator?



• 18





Caulobacter CtrA regulon

Multiple Regulon Experiments

BIC score: 62.16 (log likelihood: 134.69) 1.4

Can we describe the cell transcriptome using a small number of hidden regulators?

1.8

1.2

Input r

1

1.6

0.8 0.6

1.2

1.4

1

1.2

Pred. r

1

0.8

0.8 0.6

G1 G2

0.4

predicted rates

TF1

1.4

TF1

0.4

input rates

G3

G4

few TFs

0.6

TF1

TF2

TF3

0.4

0.2

D

0.1 0

error

−0.1

E

0.2

0

−0.2

1

2

3

4

7

8

9

10

many genes

G1

G2

G3

G4

G5

G6

G7

2

1.4

TF1

1.5

1.2 1 0.8

TF1

0.6 0.4 input rates

TF2

1

0.5

0

1.4

Pred. r

6 1

BIC score: 252.36 (log likelihood: 344.47)

Input r

5

Time

−0.3 delta (input − predicted)

1

2

3

4

5

6

7

8

9

10

1

1.2

•“Realistic” dimensionality reduction

1 2.5

0.8 0.6 0.4 predicted rates

G1 G2

G3 G4

D

E

0.2 0.1 0

error

−0.1 −0.2

2

1 0.5 0

−0.3

TF2

1.5

1

2

3

4

5

6

7

8

9

10

•Allows prediction of target gene dynamics

2

delta (input − predicted)

Time

Cell Cycle Experiment actual rates

Outline Sequence evolution Protein families  Transcriptional regulation • Transcription factor binding sites • Combinatorial regulation  Gene expression • General methods • Unified models of transcriptional regulation • Regulatory networks  Discussion 

predicted rates



Philosophy

Recap 

Identify biological problem

Models of evolution • Sequence evolution • Comparative models

Formulate model

D



Sequence Models



Transcription Factors

• Profile HMMs Learning/Inference procedures

I

I

S ta tr

M1

D

D

D

I

I

I

M2

M3

M4

E n d

• Binding sites • CIS-regulatory modules Data curation, apply procedure, get results



Gene Expression



Combination of subsets of these

• Clustering, supervised, and beyond Biological interpretation



New discoveries

• 19





Additional Areas Gene finding • Extended HMMs + evolutionary models  Analysis of genetic traits and diseases • Linkage analysis • SNPs, haplotypes, and recombination  Interaction networks (protein-DNA, protein-protein) • Relational models  Protein structure • 2nd-ary and 3rd-ary structure, molecular recognition 

Take Home Message 

Graphical models as a methodology • Modeling language • Foundations & algorithms for learning • Allows to incorporate prior knowledge about biological mechanisms • Learning can reveal “structure” in data



Exploring unified system models • Learning from heterogeneous data  Not simply combining conclusions • Combine weak evidence from multiple sources  detect subtle signals • Get closer to mechanistic understanding of the signal

What About Biology? We skimmed over many details  Layers of regulation in gene expression  Large scale evolutionary events  Confounding effects in gene expression measurements • Heterogeneity in cell population • Noise and artifacts • Difference between mRNA level and transcription rate …



The END

• 20