Graphical Models in. Computational ... Probabilistic Model of Evolution. Random
.... R m m+1. H. H. Felsenstein and Churchill, MBE 1996. Phylogenetic HMMs.
•
•
The Age of Genomes
Graphical Models in Computational Molecular Biology Nir Friedman Hebrew University Bacteria 1.6Mb 1600 genes
.
Includes slides by: Yoseph Barash, Nebojsa Jojic, Tommy Kaplan, Daphne Koller, Iftach Nachman, Dana Pe’er, Tal Pupko, Aviv Regev, Eran Segal
95
Biology Molecular Biology Organism
Organs
Tissues
Eukaryote Animal 13Mb 100Mb ~6000 genes ~20,000 genes 96
97
98
Human 3Gb ~30,000 genes?
99
00
01
02
03
Macro Molecules as Sequences
Cells
Building blocks
Larger units
SUGARS
POLYSACCHARIDES
FATTY ACIDS
FATS, LIPIDS, MEMBRANES
AMINO ACIDS
PROTEINS
NUCLEOTIDES
NUCLEIC ACIDS (DNA, RNA)
Molecules
The Central Dogma
Transcription
DNA
•
Translation
RNA
Protein
• 1
•
•
What Biologists Want?
Philosophy
Identify Components • Find genes in DNA sequences Associate them with function • Understand what a gene does Interactions between components • Which genes work together? Dynamics of systems • When genes are activated, and by what? How did we get here • How did genes evolve, and why this way? (And many other things)
Identify biological problem
Formulate model
Learning/Inference procedures
Data curation, apply procedure, get results
Biological interpretation
Evolutionary History
Outline Sequence evolution Protein families Transcriptional regulation Gene expression Discussion
Donkey Horse Indian rhino W hite rhino Grey seal Harbor seal Dog Cat Blue whale Fin whale Sperm whale Hippopotamus Sheep Cow Alpaca Pig Little red flying fox Ryukyu flying fox Horseshoe bat Japanese pipistrelle Long-tailed bat Jamaican fruit-eating bat Asiatic shrew Long-clawed shrew Mole Small Madagascar hedgehog Aardvark Elephant Armadillo Rabbit Pika Tree shrew Bonobo Chimpanzee Man Gorilla Sumatran orangutan Bornean orangutan Common gibbon Barbary ape Baboon W hite-fronted capuchin Slow loris Squirrel Dormouse Cane-rat Guinea pig Mouse Rat Vole Hedgehog Gymnure Bandicoot W allaroo Opossum Platypus
Phylogeny
Ape
QEPGGLVVPPTDA
Monkey
QEPGGMVVPPTDA
Lemur Cat
Perissodactyla Carnivora Cetartiodactyla Chiroptera Moles+Shrews Afrotheria Xenarthra Lagomorpha + Scandentia Primates Rodentia 1 Rodentia 2 Hedgehogs
Related Proteins Conserved Position
Primate evolution
New discoveries
Insertion
Deletion
QEPGGLVIPP REPQGLIVPPTEG Globins http://www.sanger.ac.uk/cgi-bin/Pfam/getacc?PF00042
•
• 2
•
•
Evolution
Challenges
Evolution = Neutral Variation + Selection
Ape
Neutral variation • Random changes Mutation, duplication, deletions, rearrangements, …
Selection • Preference for variants with better fit
From sequences to phylogenies QEPGGLVVPPTDA
Monkey
QEPGGMVVPPTDA
Lemur
QEPGGLVIPPTDA
Cat
REPQGLIVPPTEG
Understand selection Selection Function & Structure
Reconstruct history of evolutionary events
Probabilistic Model of Evolution
Parameterization of Phylogenies Assumptions: Positions (columns) are independent of each other Each branch is a reversible continuous time discrete state Markov process P (a c |t + t ') = P (a b |t )P (b c |t ') b
P (a )P (a b |t ) = P (b )P (b a |t ) Aardvark
Bison Chimp Dog
Elephant
Random variables – sequence at current day taxa or at ancestors Potentials/Conditional distribution – represent the probability of evolutionary changes along each branch
Computational Tasks Likelihood computation, inference of ancestral states Inference (dynamic programming, belief propagation)
governed by a rate matrix Q Parameterization: Rate matrix + tree topology + branch lengths
Maximum Likelihood Reconstruction Observed data: (D )
N sequences of length M Each position: an independent sample from the marginal distribution over N current day taxa
Likelihood:
Branch length estimation: Parameter estimation (EM)
Given a tree (T,t) :
l (T ,t : D ) = log P (D |T ,t ) = log P (x[m1,K,N ] |T ,t )
Reconstruction: Structure learning
M
Goal: Felsenstien, JME 1981
•
Find a tree (T,t) that maximizes l(T,t:D) .
• 3
•
•
SEMPHY: Structural EM Phylogenetic Reconstructoin
Current Approaches
Perform search over possible topologies T3
T1
T2
Parameter space
Borrow the idea of Structural EM: Use parameters found for current topology to help evaluate new topologies.
Parametric optimization
Problem (EM) • Such procedures are computationally expensive! • Computation of optimal parameters, per Local Maxima candidate, requires non-trivial optimization step. Tn T • Spend 4non-negligible computation on a candidate, even if it is a low scoring one.
Outline:
Perform search in (T, t) space, using EM-like iterations: • E-step: use current solution to compute expected sufficient statistics for all topologies • M-step: select new topology based on these expected sufficient statistics
Friedman et al, JCB 2002
Algorithm Outline
Algorithm Outline
0 0 Compute: E [S(i , j ) (a , b ) | D ,T ,t ]
0 0 Compute: E [S(i , j ) (a , b ) | D ,T ,t ]
Weights: wi , j = maxt F (t , E [Si , j ])
Weights: wi , j = maxt F (t , E [Si , j ]) Find: T '= arg maxT
w
(i , j )T
i,j
Original Tree (T0,t0)
Pairwise weights
Unlike standard EM for trees, we compute all possible pairwise statistics Time: O(N2M)
Friedman et al, JCB 2002
Algorithm Outline
Friedman et al, JCB 2002
Algorithm Outline
0 0 Compute: E [S(i , j ) (a , b ) | D ,T ,t ]
0 0 Compute: E [S(i , j ) (a , b ) | D ,T ,t ]
Weights: wi , j = maxt F (t , E [Si , j ])
Weights: wi , j = maxt F (t , E [Si , j ])
Find: T '= arg maxT
Find: T '= arg maxT
w
(i , j )T
i,j
Construct bifurcation T1
Max. Spanning Tree Fast greedy procedure to find tree By construction: Q(T’,t’) Q(T0,t0) Thus, l(T’,t’) l(T0,t0)
Friedman et al, JCB 2002
•
This stage also computes the branch length for each pair (i,j)
w
(i , j )T
i,j
Construct bifurcation T1
Fix Tree Remove redundant nodes Add nodes to break large degree This operation preserves likelihood l(T1,t’) =l(T’,t’) l(T0,t0) Friedman et al, JCB 2002
• 4
•
•
Algorithm Outline
Beyond Vanilla Flavor
0 0 Compute: E [S(i , j ) (a , b ) | D ,T ,t ]
Rate variation: Different positions evolve at different rate
Weights: wi , j = maxt F (t , E [Si , j ]) Find: T '= arg maxT
w
(i , j )T
i,j
Construct bifurcation T1
R X7
New Tree
Thm: l(T1,t1) l(T0,t0)
X5
These steps are then repeated until convergence
X6
X1
X3
X2
X4
Friedman et al, JCB 2002
Yang, JME 1994
Application: Map Conservation
Beyond Vanilla Flavor Dependencies between Positions: Dependent rate Dependent state
H
Calculate rates
m
http://consurf.tau.ac.il/
h10
x11
x12
h12
x13
x14
…
h22
x23
Product of trees
hJ0
h20
x24
q2 (h20 , h22 )
h10
h20
x11
x12
h12
qJ (hJ0 , hJ2 )
hJ0
…
h22 x23
x14
x1J
x24
hJ2
x
3 J
x J4
hJ2
x J3
h10
x J4
Product of chains
Each substitution depends on the substitution at the previous position This structure captures context specific effects during evolution Problem: Resulting model is intractable
x11
Siepel & Haussler, MBE 2004
•
q1 (h10 , h12 )
x13
x1J
m+1
Felsenstein and Churchill, MBE 1996
Variational approximations
Phylogenetic HMMs
H R
R
x12
h12 x13
q (h10 )
x14
x J4
q (hJ0 ) hJ0
q (h21 )
h22 x23
q1 (h11 , h21 ,..., h1J )
x J3
q (h20 )
x12
q 0 (h10 , h20 ,..., hJ0 ) hJ2
x
h20
h12 x13
1 J
x24
q (h12 )
x11
…
h22 x23
x14
h10
Mean field (product of “nodes”)
hJ0
h20
x24
…
q (hJ2 )
x1J
hJ2
x J3
x J4
Jojic et al, ISMB 2004
• 5
•
•
Tracking bounds after each step of exact EM
Computational cost of likelihood improvements using best variational approximation and exact EM
Jojic et al, ISMB 2004
Jojic et al, ISMB 2004
Remaining Challenges To learn phylogenies, need to align sequences For good alignment, need to know which positions are conserved Can we do both at the same time?
Outline Sequence evolution Protein families Transcriptional regulation Gene expression Discussion
Requires Dealing with insertions + deletions Different ways of shifting each sequence Hard problem
Proteins
Protein Families Idea: Use knowledge about function of known proteins Sequence conservation similar structure & function
Sequence Structure Function Major question: How to annotate structure and function of new protein sequences? Hemoglobin
•
Protein Family: A collection of proteins sequences that have a common Evolutionary ancestor Structure Function
• 6
•
•
Profile HMMs
Protein Family D
D
D
D
I
I
I
I
I
Start
M1
M2
M3
M4
Multiple Sequence alignment detect new proteins Need to represent Commonalities Variations
Conserved stretches Potential insertions Potential deletions Preferred letters at each position
End
Match: aDelete: conserved a deletion, position with silent state Insert: an insertion withageneral (background) emission probability specialized without emission any emission probability
I
L
M
W
K
E
ILMWKE LMWKE ILWK
Using Profile HMMs Training set of curated proteins (domains) from the same family
PFAM Database
Manual curation: 20-200 seq
Profile HMM
Annotation of new protein sequences + alignment with original sequences
Search throughout all known protein sequences: 106 -- 108
The Central Dogma
Outline Sequence evolution Protein families Transcriptional regulation • Transcription factor binding sites • Combinatorial regulation Gene expression Discussion
Transcription
DNA
•
Translation
RNA
Protein
• 7
•
•
Transcription Factor Binding Sites
Transcriptional Regulation DNA binding proteins Non-coding region
Gene 1
Activator
Repressor
Gene 2Binding sites
RNA transcript
Coding region (transcribed)
(specific sequences)
Gene regulatory proteins contain structural elements that can “read” DNA sequence “motifs” The amino acid – DNA recognition is not straightforward Experiments can pinpoint binding sites on DNA
Gene 3
Gene 4
Zinc finger
Modeling Binding Sites
How to model binding sites ? P(X1 X2 X3 X4 X5 ) = ? represents a distribution of binding sites
Given a set of (aligned) binding sites …
Consensus sequence Probabilistic model NNGGGGCNGGGC (profile of a binding site) A
4
3
1
1
0
0
1
0
0
1
0
1
C
1
3
0
0
0
0
13
6
0
0
1
9
G
5
5
13
13
14
14
0
8
14
12
13
1
T
4
3
0
0
0
0
0
0
0
1
0
3
Is this sufficient?
GCGGGGCCGGGC TGGGGGCGGGGT AGGGGGCGGGGG TAGGGGCCGGGC TGGGGGCGGGGT TGGGGGCCGGGC ATGGGGCGGGGC GTGGGGCGGGGC AAAGGGCCGGGC GGGAGGCCGGGA GCGGGGCGGGGC GAGGGGACGAGT CCGGGGCGGTCC ATGGGGCGGGGC
X1
X2
X3
X4
X5
Profile:
X1
X2
X3
X4
X5
Tree: Direct dependencies
X1
X2
X3
X4
X5
X4
X5
T
X1
X2
Arabidopsis ABA binding factor 1 Mixture of Profiles
Tree X4
X5
X6
X7
X8
X9 X10 X11 X12
Test LL per instance -18.47 (+1.46) (improvement in likelihood > 2.5-fold)
5T
Barash et al, RECOMB 2002
•
Both types of dependencies
11
T
2 2
3 3
3 4 1
5 4
5
3
Barash et al, RECOMB 2002
TRANSFAC: 95 aligned data sets 128 64 32 16 8 4 2 1 0.5
Test LL per instance -18.70 (+1.23) (improvement in likelihood > 2-fold)
Mixture of Trees:
X3
Likelihood improvement over profiles
Fold-change in likelihood
24%
Global dependencies
P(X XX = =) = P(T)P(X | T)P(X )P(X | T, XT)P(X T)P(X X3)) P(X L P(T)P(X T)P(X | T)P(X P(X L P(X )P(X )4 |4)P(X 11L 5 )X 1)P(X 2| 3| 1 )P(X 5| |)P(X XT, X )P(X X |T, X 5) 1 | T)P(X 23|)P(X 3 |)P(X 5 | T)
76%
Test LL per instance -19.93
Independency model
Mixture of Profiles:
T
1
Profile
Leucine zipper
Helix-Turn-Helix
Zin
cf
ing
er
bZ
IP
bH
LH
lix x He Heli rn u T
S
he
et
ers
oth
??
?
Barash et al, RECOMB 2002
• 8
•
•
Motif finding problem Input: A set of potentially co-regulated genes Output: A common motif in their promoters
Sources of data: Gene annotation (e.g. Hughes et al, 2000) Gene expression (e.g. Spellman et al, 1998; Tavazoie et al, 2000) ChIP (e.g. Simon et al, 2001; Lee et al, 2002)
Example
Upstream regions from yeast Sacharomyces cerevisiae genes (300-600bp)
5’- TCTCTCTCCACGGCTAATTAGGTGATCATGAAAAAATGAAAAATTCATGAGAAAAGAGTCAGACATCGAAACATACAT
…HIS7
5’- ATGGCAGAATCACTTTAAAACGTGGCCCCACCCGCTGCACCCTGTGCATTTTGTACGTTACTGCGAAATGACTCAACG
…ARO4
5’- CACATCCAACGAATCACCTCACCGTTATCGTGACTCACTTTCTTTCGCATCGCCGAAGTGCCATAAAAAATATTTTTT
…ILV6
5’- TGCGAACAAAAGAGTCATTACAACGAGGAAATAGAAGAAAATGAAAAATTTTCGACAAAATGTATAGTCATTTCTATC
…THR4
5’- ACAAAGGTACCTTCCTGGCCAATCTCACAGATTTAATATAGTAAATTGTCATGCATATGACTCATCCCGAACATGAAA
…ARO1
5’- ATTGATTGACTCATTTTCCTCTGACTACTACCAGTTCAAAATGTTAGAGAAAAATAGAAAAGCAGAAAAAATAAATAA ATTGATTGACTCATTTTCCTCTGACTACTACCAGTTCAAAATGTTAGAGAAAAATAGAAAAGCAGAAAAAATAAATAA
…HOM2
5’- GGCGCCACAGTCCGCGTTTGGTTATCCGGCTGACTCATTCTGACTCTTTTTTGGAAAGTGTGGCATGTGCTTCACACA GGCGCCACAGTCCGCGTTTGGTTATCCGGCTGACTCATTCTGACTCTTTTTTGGAAAGTGTGGCATGTGCTTCACACA
…PRO3
Probabilistic Model
Learning models: unaligned data EM (MEME) Identify
Motif model
Background probability
Learn
binding site positions a dependency model
Background probability: given Motif model – parameters being learned
X1
X2
X3
X4
X5
X1
X2
X3
X4
X5
X1
X2
X3
T X4
X5
X4
X5
T X1
Hidden variable: Location of motif within each sequence
Learning models: unaligned data
Gibbs Sampling (AlignACE)
X3
Identify sites
Model Learning
EM (MEME)
X2
Challenges Small training sets 10-500 sequences (out of 1000s genes) Short motifs within long sequences Motifs are 6-20bp, promoters are 500-5000bps
Discriminative Sampling • Find motif that best separates positive examples from rest of promoter sequences
•
Motifs are not perfect words Mismatches are allowed
• 9
•
•
Comparative Genomics Functional areas should be conserved
Modeling Conserved Motifs
Parallel search in orthologous promoters Model evolution at each position H
H
H R
Position in the motif
R
R
Ancestral sequence Observed sequence 1
3
2
Kellis et al. Nature 2003
Outline Sequence evolution Protein families Transcriptional regulation • Transcription factor binding sites • Combinatorial regulation Gene expression Discussion
Combinatorial Regulation
Context A Basal expression level
transcript level Upstream region of target gene
Context B Activator induces expression
ac tivator
activator binding site
Context C Activator + repressor decrease expression
repressor
repressor binding site
Effect at a Distance
•
ac tivator
activator binding site
Eukaryotic Regulation
• 10
•
The Life Cycle of a Drosophila Fly
•
Segmentation Achieved by Spatial patterning of genes
Bicoid –Blue
Gradual Refinement of Expression Domains
Eve Stripped-Red
Kruppel-Yellow
Partial gene network of the segmentation
Coarse
Fine
Each Stripe has its own “Enhancer”
Example: Eve Stripe 2
Control of even-skipped expression
•
• 11
•
•
Eve Gene
Build (generalized HMM)
within-cluster gap
Searching for “dense” windows
A+ AB+ B-
-R
between-cluster gap
Spacer
States: Gap (background) • Long gaps (between clusters) • Short gap (in cluster) Motifs • Emit motif sequence • Either positive strand or negative one
Cluster
Probabilistic Model
Search all sites 1MB around eve gene.
Parse long genomic sequences by Viterbi’s algorithm Berman et al, PNAS 2002
Frith et al, NAR, 2002; Rajewsky et al, BMC Bioinf, 2002; Baily & Noble, Bioinf, 2003
Outline
Transcriptional Regulation
Sequence evolution Protein families Transcriptional regulation • Transcription factor binding sites • Combinatorial regulation Gene expression • General methods • Unified models of transcriptional regulation • Regulatory networks Discussion
High Throughput Gene Expression
Expression Data Conditions
1000s of genes 10-100s of arrays Possible designs Biopsies from different patient populations Time course Different perturbations
Transcription
Genes
Translation
Extract
Microarray
RNA expression levels of 10,000s of genes in one experiment Gasch et al. Mol. Cell 2001
•
• 12
•
•
Expression: Unsupervised
Expression: Supervised
Feature selection + Classification (Naïve Bayes)
Classifier confidence
Labeled samples
P-value =< 0.027
Similarly, we can classify genes
Eisen et al. PNAS 1998; Alter et al, PNAS 2000
Relational Approach
Expression +Annotations
Key Idea: Expression level is “explained” by properties of gene and properties of experiment
Array annotations: Tissue type, Clinical conditions, Treatments, Prognosis
Gene Properties
Array Properties
Gene annotations: Function, Process, Regulatory regions Cellular location, protein family
Expression Level
Relational models! Segal et al, ISMB 2001
Probabilistic Relational Models Gene
Adding Heterogeneous Data
Experiment
Gene
Experiment
Gene Cluster
GCN4 Gene Cluster
Exp. type HSF
Exp. cluster
Lipid Exp. cluster
Endoplasmatic
Level
GCluster E Cluster 1 2
μ
P(Level)
P(Level)
Level -0.7
•
Expression
0.8 1.2 -0.7 0.6
…
1 1
Level
Expression
CPD
Level 0.8
Annotations
Binding sites
Experimental details
Segal et al, ISMB 2001
• 13
•
•
Semantics Gene
Outline
Experiment
Gene Cluster
Sequence evolution Protein families Transcriptional regulation • Transcription factor binding sites • Combinatorial regulation Gene expression • General methods • Unified models of transcriptional regulation • Regulatory networks Discussion
Exp. type
GCN4 HSF
Lipid Exp. cluster
Endoplasmatic
Level
Expression
GCN41
+
Exp. type1
Gene Cluster1
Exp. type2 Exp. cluster1
HSF 1
Lipid1
Endoplasmatic1
GCN42
Lev el1, 2
Gene Cluster2
HSF 2
Lipid2
Endoplasmatic2
GCN43
Exp. cluster2
Lev el1, 1
Lev el2,
Lev el2,
1
2
Gene Cluster3
HSF 3
Lipid3
Endoplasmatic3
Lev el3,
Lev el3,
1
2
Goal
TF to Expression ACTAGTGCTGA
Key Question: Can we explain changes in expression?
+
CTATTATTGCA CTGATGCTAGC
General model: Transcription factor binding sites in promoter region should “explain” changes in transcription
t2 Motif
R(t1 )
R(t2 )
t1 Motif
AGCTAGCTGAGACTGCACACTGATCGAG CCCCACCATAGCTTCGGACTGCGCTATA TAGACTGCAGCTAGTAGAGCTCTGCTAG TAGACTGCAGCTAGTAGAGCTCTGCTAG TAGACTGCAGCTAGTAGAGCTCTGCTAG AGCTCTATGACTGCCGATTGCGGGGCGT AGCTCTATGACTGCCGATTGCGGGGCGT CTGAGCTCTTTGCTCTTGACTGCCGCTT CTGAGCTCTTTGCTCTTGACTGCCGCTT CTGAGCTCTTTGCTCTTGACTGCCGCTT TTGATATTATCTCTCTGCTCGTGACTGC TTTATTGTGGGGGGACTGCTGATTATGC TGCTCATAGGAGAGACTGCGAGAGTCGT CGTAGGACTGCGTCGTCGTGATGATGCT CGTAGGACTGCGTCGTCGTGATGATGCT GCTGATCGATCGGACTGCCTAGCTAGTA GCTGATCGATCGGACTGCCTAGCTAGTA GATCGATGTGACTGCAGAAGAGAGAGGG GATCGATGTGACTGCAGAAGAGAGAGGG TTTTTTCGCGCCGCCCCGCGCGACTGCT CGAGAGGAAGTATATATGACTGCGCGCG CGAGAGGAAGTATATATGACTGCGCGCG CCGCGCGCACGGACTGCAGCTGATGCAT CCGCGCGCACGGACTGCAGCTGATGCAT GCATGCTAGTAGACTGCCTAGTCAGCTG GCATGCTAGTAGACTGCCTAGTCAGCTG CGATCGACTCGTAGCATGCATCGACTGC AGTCGATCGATGCTAGTTATTGGACTGC GTAGTAGTGCGACTGCTCGTAGCTGTAG
Segal et al, RECOMB 2002, ISMB 2003
“Classical” Approach
Flow of Information Sequence
Control regions
I II III CTAGAGCTCTATGACTGCCG CTAGAGCTCTATGACTGCCG Gene IV ATTGCGGGGCGTCTGAGCTC Gene V TTTGCTCTTGACTGCCGCTT TTTGCTCTTGACTGCCGCTT Gene VI
CCAAT ACGATGCTAGTGTAGCTGATGCTGATCGATCGTACGTGCTAGCTAGCTAGCTAGCTAGCTAGCT TCGACTG GATAC CCAAT AGCAGCTAGCTCGACTGCTTTGTGGGGCCTTGTGTGCTCAAACACACACAACACCAAATGTGCT C CCAAT TTGTGGTACTGATGATCGTAGTAACCACTGTCGATGATGCTGTGGGGGGTATCGATGCATACC TCGACTG GATACTCGACTG CCAAT ACCCCCCGCTCGATCGATCGTAGCTAGCTAGCTGACTGATCAAAAACACCATACGCCCCCCGTCG C GATAC CTGCTCGTAGCATGCTAGCTAGCTGATCGATCAGCTACGATCGACTGATCGTAGCTAGCTACTT C GCAGTT CCAAT TTTTTTTTTTGCTAGCACCCAACTGACTGATCGTAGTCAGTACGTACGATCGTGACTGATCGCT TCGACTG GCAGTT CCAAT CGTCGTCGATGCATCGTACGTAGCTACGTAGCATGCTAGCTGCTCGCAAAAAAAAAACGTCGTC TCGACTG C GATAC CCAAT GCAGTT GATCGTAGCTGCTCGCCCCCCCCCCCCGACTGATCGTAGCTAGCTGATCGATCGATCGATCGTAG C CTGAATTATATATATATATATACGGCG
Genes
Cluster gene expression profiles Search for motifs in control regions of clustered genes AGCTAGCTGAGACTGCACAC AGCTAGCTGAGACTGCACAC Gene TTCGGACTGCGCTATATAGA TTCGGACTGCGCTATATAGA Gene
clustering
GACTGCAGCTAGTAGAGCTC GACTGCAGCTAGTAGAGCTC Gene
Procedural • Apply separate method to each type of data Motif GACTGC • Use output of one method as input to the next • Unidirectional information flow
•
Motifs
TCGACTGC
Motif Profiles
TCGACTGC + GATAC
GATAC
CCAAT
CCAAT
GCAGTT GCAGTT + CCAAT
Expression Profiles
• 14
•
•
Unified Probabilistic Model Sequence
Sequence S1
S2
S3
S4
S1
Motifs
Motif Profiles
Gene
Expression Segal et al, RECOMB 2002, ISMB 2003
S2
S3
Expression Segal et al, RECOMB 2002, ISMB 2003
Probabilistic Model Observed
S4
Experiment
Gene
Expression Profiles
Sequence
Sequence S1
Motifs
Regulatory Modules
Sequence S1
S2
S3
S4
Motif profile
Motifs R1 R2 R3
Expression Profiles
S4
Module
Unified Probabilistic Model
Motif Profiles
S3
R1 R2 R3
Experiment
Expression Profiles
Sequence
S2
Motifs R1 R2 R3
Motif Profiles
Sequence
genes
Sequence
Unified Probabilistic Model
Expression profile
R1 R2 R3
Experiment
Module
Motif Profiles
ID
Gene Level
Observed
Expression
Segal et al, RECOMB 2002, ISMB 2003
Promoters Revisited
Promoter organization matters
Different logic for promoters • AND, OR, NAND, XOR, ….
Expression Profiles
Experiment
Module
ID
Gene Level
Expression Segal et al, RECOMB 2002, ISMB 2003
More Refined Models
Beer & Tavazoie, Cell, 2004
•
• 15
•
•
More Refined Models
Outline Sequence evolution Protein families Transcriptional regulation • Transcription factor binding sites • Combinatorial regulation Gene expression • General methods • Unified models of transcriptional regulation • Regulatory networks Discussion
Beer & Tavazoie, Cell, 2004
Causal Reconstruction for Gene Expression
Goal: Reconstruct Cellular Networks Conditions
Gene A
Genes
Gene C
Gene B
Gene D
Gene E
Use language of Bayesian networks to reconstruct causal connections
Biocarta. http://www.biocarta.com/ Friedman et al, JCB 2000
Expression data
Guided K-means Discretization
Gene Interaction Maps
Preprocess
Bayesian Network Learning Algorithm, + Bootstrap
Learn model
300 deletion knockout inYLR343W yeast [Hughes et al 2000] 0.75 – 1.0 0.5 – 0.75 600 genes 0.35 – 0.5
YLR334C
Edge
Separator
Reconstruct Sub-Networks
Regulator
Feature extraction
Feature assembly
Global network Local features Sub-network Friedman et al, JCB 2000; Pe’er et al, ISMB 2001
•
Sst2 Ste6
Ndj1
Fus1
Kss1 Bar1
Kar4
Signaling
Markov
Tec1
Mfa2
Transcription factor
Aga1
Prm1
Aga2
Dow nstream effector
Far1 Mfa1
Fus3
Fig1
Ste2 Ste12
“Mating response” Substructure STE12 (main TF), Fus3 (Main MAPK) are marginal
Pe’er et al, ISMB 2001
• 16
•
•
From Networks to Modules Gene1
CPD 2
SLT2
RLM1 Regulation Function 1
CRH1
Gene partition
YPS3
Regulation program learning
PTP2
One common regulation function Regulation Function 2
CPD 1
Module 1
MAPK of cell wall integrity pathway
Regulation Function 3
Module 2 Gene reassignment to modules
Gene2
Functional modules
Regulation Function 4
Gene4 Gene3
CPD 3
Idea: enforce common regulatory program Statistical robustness: Regulation programs are estimated from m*k samples Organization of genes into regulatory modules: Concise biological description
Gene5
Gene6
Module 3
Segal et al, Nat Gen 2003
Module 25 (59 genes) Tpk1
Kin82
Nrg1
Msn4
Oxid. Phosphorylation (26, 5x10 - 35) Mitochondrion (31, 7x10-32) Aerobic Respiration (12, 2x10-13)
HAP4 Motif
Hap4
Learned Network (fragment)
STRE (Msn2/4)
-500
Module 2 (64 genes) Tpk2
Msn4
-400
-300
-200
-100
Gene set coherence (GO, MIPS, KEGG)
Usv1
Module 4 (42 genes) Hap4
Match between regulator and targets
Mth1
Match between regulator and promoters Module 1 (55 genes) Atp1
Atp3
Match between regulator and condition/logic
Atp16
Segal et al, Nat Gen 2003
Segal et al, Nat Gen 2003
A Major Assumption
Realistic Regulation Modeling
mRNA
mRNA
TF
G
mRNA
mRNA
TF
G
TF
protein
TF
active protein
G
mRNA tr. rate
mRNA degradation
•
Model the closest connection
Active protein levels are not measured Transcript rates are computed from expression data and mRNA decay rates
mRNA
mRNA
TF
G
TF
protein
TF
active protein
G
mRNA tr. rate
mRNA degradation
• 17
•
•
New Proposed Scheme Known
General Two Regulator Function
regulation function f(TF) + noise model
biology data
I. Compute distribution of promoter states
TF1
Location
a
TF2
1
G1
G2
Model Learning
G3
P(promoter state)
State Equations
G4
S1 + tf1
a
S1tf1
S2 + tf2
G c
S2tf2 c
TF1
Expression data
TF2
b
2
+
TF1
b
d
TF2
d 0
0.2
0.4
0.6
+ Gi Transcription rates
mRNA decay rates
Kinetic parameters
f(TF1 , TF2 ) should describe mean transcription rate of G
Nachman et al, ISMB 2004
Nachman et al, ISMB 2004
Example: One Activator Function
General Two Regulator Function II. Assign activation level to each state
a=0
a P(promoter state)
1
a
S1tf1
S1 + tf1
b
b=1
c
c=0
d
d =0
b
2
S2tf2
S2 + tf2
[S ] +[S tf] 1=
b d
= 250
TF2
d 0
0.2
0.4
0.6
G
+ b X
f = ( a X
+ c X
= 20
=4 =1
f (tf ) = TF activity level
) X
+ d X
Transcription rate f(tf)
TF1
tf
c
Nachman et al, ISMB 2004
Adding a Temporal Aspect
= 250
tf 1 + tf
f(tf, )
State Equations
ê b [S ][tf] =ê d [S tf]
= 20
=4 =1 Time Nachman et al, ISMB 2004
Caulobacter CtrA regulon
For time series – add explicit time modeling t-1
t
TF1
Time Persistence
TF1
TF1
TF2 G2
t+1
G3
TF2 G4
G2
G3
TF2 G4
G2
G3
G4 Is There a Second Regulator?
•
• 18
•
•
Caulobacter CtrA regulon
Multiple Regulon Experiments
BIC score: 62.16 (log likelihood: 134.69) 1.4
Can we describe the cell transcriptome using a small number of hidden regulators?
1.8
1.2
Input r
1
1.6
0.8 0.6
1.2
1.4
1
1.2
Pred. r
1
0.8
0.8 0.6
G1 G2
0.4
predicted rates
TF1
1.4
TF1
0.4
input rates
G3
G4
few TFs
0.6
TF1
TF2
TF3
0.4
0.2
D
0.1 0
error
−0.1
E
0.2
0
−0.2
1
2
3
4
7
8
9
10
many genes
G1
G2
G3
G4
G5
G6
G7
2
1.4
TF1
1.5
1.2 1 0.8
TF1
0.6 0.4 input rates
TF2
1
0.5
0
1.4
Pred. r
6 1
BIC score: 252.36 (log likelihood: 344.47)
Input r
5
Time
−0.3 delta (input − predicted)
1
2
3
4
5
6
7
8
9
10
1
1.2
•“Realistic” dimensionality reduction
1 2.5
0.8 0.6 0.4 predicted rates
G1 G2
G3 G4
D
E
0.2 0.1 0
error
−0.1 −0.2
2
1 0.5 0
−0.3
TF2
1.5
1
2
3
4
5
6
7
8
9
10
•Allows prediction of target gene dynamics
2
delta (input − predicted)
Time
Cell Cycle Experiment actual rates
Outline Sequence evolution Protein families Transcriptional regulation • Transcription factor binding sites • Combinatorial regulation Gene expression • General methods • Unified models of transcriptional regulation • Regulatory networks Discussion
predicted rates
Philosophy
Recap
Identify biological problem
Models of evolution • Sequence evolution • Comparative models
Formulate model
D
Sequence Models
Transcription Factors
• Profile HMMs Learning/Inference procedures
I
I
S ta tr
M1
D
D
D
I
I
I
M2
M3
M4
E n d
• Binding sites • CIS-regulatory modules Data curation, apply procedure, get results
Gene Expression
Combination of subsets of these
• Clustering, supervised, and beyond Biological interpretation
•
New discoveries
• 19
•
•
Additional Areas Gene finding • Extended HMMs + evolutionary models Analysis of genetic traits and diseases • Linkage analysis • SNPs, haplotypes, and recombination Interaction networks (protein-DNA, protein-protein) • Relational models Protein structure • 2nd-ary and 3rd-ary structure, molecular recognition
Take Home Message
Graphical models as a methodology • Modeling language • Foundations & algorithms for learning • Allows to incorporate prior knowledge about biological mechanisms • Learning can reveal “structure” in data
Exploring unified system models • Learning from heterogeneous data Not simply combining conclusions • Combine weak evidence from multiple sources detect subtle signals • Get closer to mechanistic understanding of the signal
What About Biology? We skimmed over many details Layers of regulation in gene expression Large scale evolutionary events Confounding effects in gene expression measurements • Heterogeneity in cell population • Noise and artifacts • Difference between mRNA level and transcription rate …
•
The END
• 20