Mar 25, 2008 - Department of Computer Science .... with concrete language, with solid .... Ï â set of match rules, ea
Modeling and Optimization of Scientific Workflows Daniel Zinn Department of Computer Science University of California at Davis
March 25th, 2008
Daniel Zinn
1
University of California, Davis
Scientific Workflows
C OMAD
Research Questions
Analysis & Optimization
Experiments
Conclusion
Outline
1
Scientific Workflows
2
Collection-Oriented Modeling and Design (C OMAD)
3
Research Questions
4
Dataflow Analysis and Optimization
5
Experimental Evaluation
6
Conclusion
Daniel Zinn
2
University of California, Davis
Scientific Workflows
C OMAD
Research Questions
Analysis & Optimization
Experiments
Conclusion
Outline
1
Scientific Workflows
2
Collection-Oriented Modeling and Design (C OMAD)
3
Research Questions
4
Dataflow Analysis and Optimization
5
Experimental Evaluation
6
Conclusion
Daniel Zinn
3
University of California, Davis
Scientific Workflows
C OMAD
Research Questions
Analysis & Optimization
Experiments
Conclusion
Natural Sciences and E-Science Research Earth Sciences
dataintensive
Daniel Zinn
Physical Sciences
Life Sciences
computeintensive structurally & semantics metadataintensive intensive
4
University of California, Davis
Scientific Workflows
C OMAD
Research Questions
Analysis & Optimization
Experiments
Conclusion
Workflows: An Introduction Workflow Basics Dataflow network represented as graph Actor: represents computational unit (legacy components) Channel: represents dataflow between these units Types on the Channels Form of visual programing
Daniel Zinn
5
University of California, Davis
Scientific Workflows
C OMAD
Research Questions
Analysis & Optimization
Experiments
Conclusion
Scientific Workflow Modeling & Design
Daniel Zinn
6
University of California, Davis
Scientific Workflows
C OMAD
Research Questions
Analysis & Optimization
Experiments
Conclusion
Sometimes: Brittle and Ugly
Daniel Zinn
7
University of California, Davis
Scientific Workflows
C OMAD
Research Questions
Analysis & Optimization
Experiments
Conclusion
Desiderata
Scientific Workflow Systems should ... (1) support high-level design and maintenance of workflows and data (save scientists’ brain cycles) (2) support automatic optimization of scientific workflows on parallel/distributed systems (save machine cycles)
Daniel Zinn
8
University of California, Davis
Scientific Workflows
C OMAD
Research Questions
Analysis & Optimization
Experiments
Conclusion
Outline
1
Scientific Workflows
2
Collection-Oriented Modeling and Design (C OMAD)
3
Research Questions
4
Dataflow Analysis and Optimization
5
Experimental Evaluation
6
Conclusion
Daniel Zinn
9
University of California, Davis
Scientific Workflows
C OMAD
Research Questions
Analysis & Optimization
Experiments
Conclusion
Collection-Oriented Workflows [McPhillips2005] C OMAD — adopting assembly-line metaphor Data is organized in nested collections Actors “pick up” only relevant data (read scope) and put results back Actors ignore (pass through) what’s outside the scope
Daniel Zinn
10
University of California, Davis
Scientific Workflows
C OMAD
Research Questions
Analysis & Optimization
Experiments
Conclusion
Collection-Oriented Workflows [McPhillips2005] C OMAD — adopting assembly-line metaphor Data is organized in nested collections Actors “pick up” only relevant data (read scope) and put results back Actors ignore (pass through) what’s outside the scope
Daniel Zinn
10
University of California, Davis
Scientific Workflows
C OMAD
Research Questions
Analysis & Optimization
Experiments
Conclusion
Collection-Oriented Workflows [McPhillips2005] C OMAD — adopting assembly-line metaphor Data is organized in nested collections Actors “pick up” only relevant data (read scope) and put results back Actors ignore (pass through) what’s outside the scope
Daniel Zinn
10
University of California, Davis
Scientific Workflows
C OMAD
Research Questions
Analysis & Optimization
Experiments
Conclusion
Collection-Oriented Workflows [McPhillips2005] C OMAD — adopting assembly-line metaphor Data is organized in nested collections Actors “pick up” only relevant data (read scope) and put results back Actors ignore (pass through) what’s outside the scope
Daniel Zinn
10
University of California, Davis
Scientific Workflows
C OMAD
Research Questions
Analysis & Optimization
Experiments
Conclusion
C OMAD vs. Conventional Workflows Advantages∗ Mostly linear WF design (easier for the scientist to understand) Easier to reuse (change-resilience: usually can add, remove, swap-out actors w/o breaking the pipeline) More robust to input changes Can be automatically optimized
∗
Fineprint: Complexity moved to configuration layer
Daniel Zinn
11
University of California, Davis
Scientific Workflows
C OMAD
Research Questions
Analysis & Optimization
Experiments
Conclusion
Resilience to Input Changes Input
α
Daniel Zinn
Conventional Workflow α
β A
γ B
12
Collection-Oriented Workflow α
β
γ
A
B
α→β
β→γ
University of California, Davis
Scientific Workflows
C OMAD
Research Questions
Analysis & Optimization
Experiments
Conclusion
Resilience to Input Changes Input
α [α]
Conventional Workflow α
β
[β]
[γ]
*
* β
A
α
B
[α] α
Daniel Zinn
γ
A
Collection-Oriented Workflow
β
γ
β
γ
A
B
α→β
β→γ
[α]
[β]
[γ]
A
B
α→β
β→γ
B
12
University of California, Davis
Scientific Workflows
C OMAD
Research Questions
Analysis & Optimization
Experiments
Conclusion
Resilience to Input Changes Input
α [α]
Conventional Workflow α
β
γ
A
[β]
[γ]
* β
[α|ϕ]
[β|ϕ]
β|ϕ γ|ϕ
S
α→β
β→γ [β]
[γ]
A
B
α→β
β→γ
[α | ϕ]
[γ | ϕ]
[β | ϕ] A
B
α→β
β→γ
S β
A
[γ|ϕ] *
β|ϕ
α
Daniel Zinn
γ
γ B
B
* α|ϕ
β A
[α]
* β
A
[α | ϕ]
α
B
[α] α
Collection-Oriented Workflow
γ
β B
12
University of California, Davis
Scientific Workflows
C OMAD
Research Questions
Analysis & Optimization
Experiments
Conclusion
Three-Layer Architecture Count Alignments
Data A
Sink
Workflow Graph Data B
Configurations (White Box Part) Scientific Functions
Update Statistics
Merge
Align DNA Sequences [ClustalW]
Refine Alignment [Gblocks]
s : DNASeq+ → append f(s)
DNASeq+
f
Infer Set of PhylTrees [DNAPARS]
Compute a Consensus Tree [CONSENSE]
Display DNASequences, Infered Tree
a : Alignment → append PhylTrees[q(a)]
Alignment
Alignment
q
PhylTree+
(Black Box Part)
Graph as clean representation of scientific process White-box layer for data-management Legacy/scientific functions as black boxes Analyze and optimize based on white-box layer Daniel Zinn
13
University of California, Davis
Scientific Workflows
C OMAD
Research Questions
Analysis & Optimization
Experiments
Conclusion
Outline
1
Scientific Workflows
2
Collection-Oriented Modeling and Design (C OMAD)
3
Research Questions
4
Dataflow Analysis and Optimization
5
Experimental Evaluation
6
Conclusion
Daniel Zinn
14
University of California, Davis
Scientific Workflows
C OMAD
Research Questions
Analysis & Optimization
Experiments
Conclusion
Research Questions In a nutshell Define a good white-box layer ... with concrete language, with solid theoretical basis, with appropriate type system ... and show how workflow desiderata can be achieved
Desiderata Support workflow design! Support workflow optimization!
Daniel Zinn
15
University of California, Davis
Scientific Workflows
C OMAD
Research Questions
Analysis & Optimization
Experiments
Conclusion
Specific Research Questions Resilience and robustness Characterize input schema changes not effecting the workflow Characterize actor addition/removal/replacement not effecting the workflow Modeling support Infer output schema Check if all actors are active Infer canonical input schema Automatic optimization Reduce shippings Detect and exploit parallelism Daniel Zinn
16
University of California, Davis
Scientific Workflows
C OMAD
Research Questions
Analysis & Optimization
Experiments
Conclusion
Specific Research Questions Resilience and robustness Characterize input schema changes not effecting the workflow Characterize actor addition/removal/replacement not effecting the workflow Modeling support Infer output schema Check if all actors are active Infer canonical input schema Automatic optimization Reduce shippings Detect and exploit parallelism Daniel Zinn
16
University of California, Davis
Scientific Workflows
C OMAD
Research Questions
Analysis & Optimization
Experiments
Conclusion
Specific Research Questions Resilience and robustness Characterize input schema changes not effecting the workflow Characterize actor addition/removal/replacement not effecting the workflow Modeling support Infer output schema Check if all actors are active Infer canonical input schema Automatic optimization Reduce shippings Detect and exploit parallelism Daniel Zinn
16
University of California, Davis
Scientific Workflows
C OMAD
Research Questions
Analysis & Optimization
Experiments
Conclusion
Outline
1
Scientific Workflows
2
Collection-Oriented Modeling and Design (C OMAD)
3
Research Questions
4
Dataflow Analysis and Optimization
5
Experimental Evaluation
6
Conclusion
Daniel Zinn
17
University of California, Davis
Scientific Workflows
C OMAD
Research Questions
Analysis & Optimization
Experiments
Conclusion
Results Adopted type-system for the channels Tree-grammars to describe schema on channels (XML-schema) Type-level signatures for actors Type propagation through actors Shipping optimization Based on dependency analysis Reduces amount of data shipped Improved execution time
Daniel Zinn
18
University of California, Davis
Scientific Workflows
C OMAD
Research Questions
Analysis & Optimization
Experiments
Conclusion
Typing in C OMAD Conventional Actor τ0
τ
−→ A : α → ω −→ C OMAD- Actor
τ0
τ
−→ ∆A : τα → τω −→
Daniel Zinn
19
University of California, Davis
Scientific Workflows
C OMAD
Research Questions
Analysis & Optimization
Experiments
Conclusion
Typing in C OMAD Conventional Actor τ0
τ
−→ A : α → ω −→ C OMAD- Actor
τ0
τ
−→ ∆A : τα → τω −→ context paths
!A : #" " #!
"matched" fragments
#" Daniel Zinn
"replaced" fragments
A
#
#"
#! 19
#'
#! University of California, Davis
Scientific Workflows
C OMAD
Research Questions
Analysis & Optimization
Experiments
Conclusion
Channel Type System Definitions Type declaration: T ::= hti R
τ: S hsi
hai
hbi
A ::= hai D + | E ∗
D+ | E ∗ C hdi
Schema τ : set of type declarations with induced labels Lτ , and types Tτ = Cτ ∪˙ {Z }
Tτ = {S, A, . . . , Z } Lτ = {hsi, hai, . . . , hhi}
(A | B)∗
hei
F ∗G hfihgi
Z Z
hci
H∗
Z
Restrictions and conventions Non-ambiguous, non-recursive tree grammars
hhi
Z
s ∈ Jτ K: s = s[b[c]a[d[ffg]d[gg]]a[e[hhh]]b[c]b[c]]
Daniel Zinn
Leaf nodes hold the actual data (Z )
20
University of California, Davis
Scientific Workflows
C OMAD
Research Questions
Analysis & Optimization
Experiments
Conclusion
Actor Type-level Signatures context paths
!A : #" " #!
"matched" fragments
"replaced" fragments
A
#"
#
#"
#!
#'
#!
Formal Signature ∆A = hσ, τα , τω i "1
!1 "2
!2 "3
!3
A1 A2 A3 τα(a) – input selection schema τω – new output schema parts σ – set of form X → !31 "4 "1 match rules, !1 "2 each of!the !3 R with 2 "3 A1 (b) X ∈ τ , and α
A2
A3
F
A4
R regexp. over types in τα ∪ τω Daniel Zinn
"1
!1 "2
21
!2
!22
"3 of California, !3Davis University
Scientific Workflows
C OMAD
Research Questions
Analysis & Optimization
Experiments
Conclusion
Example τ
τ0
S
S
(A | B)∗
(A | B)∗ C
C
D+ | E ∗ (BEG∗ )∗ G C
D+ | E ∗ H∗
H∗
(BE(E |
H∗
C
H∗
B)∗ )∗ (E
H∗ C
| B)
H∗ C
A ∆A = hσ, τα , τω i with σ : {G → E | B}, τα : {G ::= hgi Z }, and τω : {E ::= hci H ∗ , B ::= hbi C, H ::= hhiZ , C ::= hciZ }. Daniel Zinn
22
University of California, Davis
Scientific Workflows
C OMAD
Research Questions
Analysis & Optimization
Experiments
Conclusion
Initial Contributions
S
S (A |
B)∗
(A |
S B)∗
i∈τ
(BEG∗ )∗ G
H∗
C
A1 σA1 = {F → BEG∗ }
C
D+ | E ∗
D+ | E ∗
D+ | E ∗
(A | B)∗ C
C
C
F ∗G
(A |
S B)∗
D+ | X ∗ H∗
H∗
(BE(E | B)∗ )∗ (E | B)
H∗
C
H∗
H∗ C
(BX ∗ (X | B)∗ )∗ (X ∗ | B)
H∗ C
A2 σA2 = {G → (E | B)}
C
A3 σA3 = {E → X }
C
C
o
Type propagation (WF design support!) Detect non-active actors (WF design support!) Shipping optimization (WF optimization!)
Daniel Zinn
23
University of California, Davis
Scientific Workflows
C OMAD
Research Questions
Analysis & Optimization
Experiments
Conclusion
Workflow Example: Shipping optimized (part)
τ
τ
S hsi
(A + B)∗ hai
D+E ∗ hdi
F ∗G
Daniel Zinn
hbi
D0
◦ d01 • d02
S
A1
(A | ◦04 )∗ •04 D + | ◦∗03
• d03
C ◦ d01
hei
H∗
• d04
E
C
•02 • d02
24
B
•03
F ∗ ◦02
G
H∗
• d04
• d03
University of California, Davis
Scientific Workflows
C OMAD
Research Questions
Analysis & Optimization
Experiments
Conclusion
Workflow Example: Shipping optimized (complete) S
S
S
S S
S (A | ◦04 )∗
(A | ◦04 )∗
(A | ◦04 D + | ◦∗03
D + | ◦∗03
F ∗ ◦02
(BEG∗ )∗ ◦02 C
i∈τ
D0
• d02
D + | ◦∗03
A1
o1
H∗
◦ d12
D1
Daniel Zinn
A2
o2
D2
H∗
(◦14 X ∗ (X | ◦24 )∗ )∗ (X ∗ | ◦24 )
H∗
◦ d23
M3
i3
A3
o3
M4
o
• d24
S
S
S
(A | ◦04 )∗
(A | ◦04 )∗
(A + B)∗
hfihgi
i2
H∗
• d14
hsi
F ∗G
M2
H∗ C
C
D+ | X ∗
H∗ (◦14 E(E | ◦24 )∗ )∗ (E | ◦24 )
(◦14 ◦13 (E | B)∗ )∗ (E | B)
• d13
S
hdi
D+ | E ∗
(A | ◦04 )∗
• d04
D+E ∗
D + | ◦∗03
)∗
• d03
hai
(A | ◦04 )∗
(◦14 ◦13 G∗ )∗ G
H∗
◦ d01 = i1
(A | ◦04 )∗
(A | ◦04 )∗
•04 D + | ◦∗03
hbi
C hei
◦ d01
hci
H∗ hhi
F ∗ ◦02
◦ d12
B
•03 E
C
•02
• d02
G
H∗
• d04
• d03 • d14
D + | ◦∗03
D + | ◦∗03
(◦14 ◦13 G∗ )∗ ◦02
(◦14 ◦13 (E | ◦24 )∗ )∗ (E | ◦24 )
•14
•13
B
E
C
H∗
25
◦ d23
• d13
H∗
• d24
•24
H∗
•24
B
B
C
C
• d24
University of California, Davis
Scientific Workflows
C OMAD
Research Questions
Analysis & Optimization
Experiments
Conclusion
Shipping Savings
Shipping optimality There is no unnecessary base data shipping∗ ∗
of course, signature might be too coarse, etc.
O( shipping savings )? Savings are linear in bypassed data size Savings are linear in number of bypassed actors
Daniel Zinn
26
University of California, Davis
Scientific Workflows
C OMAD
Research Questions
Analysis & Optimization
Experiments
Conclusion
Outline
1
Scientific Workflows
2
Collection-Oriented Modeling and Design (C OMAD)
3
Research Questions
4
Dataflow Analysis and Optimization
5
Experimental Evaluation
6
Conclusion
Daniel Zinn
27
University of California, Davis
Scientific Workflows
C OMAD
Research Questions
Analysis & Optimization
Experiments
Conclusion
Experimental Evaluation Experimental Environment Cluster of 40 Linux nodes 100MBit/s networked Shared nothing architecture Parallel Workflow System’s Specs Workflow system written in C++/Perl PVM as distribution “library” Each actor as single process PVM for passing XML tokens Data as local files; scp for passing data tokens
Daniel Zinn
28
University of California, Davis
Scientific Workflows
C OMAD
Research Questions
Analysis & Optimization
Experiments
Conclusion
Experimental Setup
Sample Workflow Chosen to study impact of data-shipping savings alone! Data-intensive but not a CPU-intensive pipeline A1
A2
A3
∆A1 : σ = {A → B | U} ∆A2 : σ = {B → C | V } ∆A3 : σ = {C → W }.
Daniel Zinn
29
University of California, Davis
Scientific Workflows
C OMAD
Research Questions
Analysis & Optimization
Experiments
Conclusion
One Workflow — Different Behaviors
Scenario
(a) Parallel
(b) Serial (c) Mixed
Daniel Zinn
A1 : hai 7→ hui A2 : hbi 7→ hvi A3 : hci → 7 hwi A1 : hai 7→ hbi A2 : hbi 7→ hci A3 : hci 7→ hwi A1 : hai 7→ hbi A2 : hbi 7→ hvi A3 : hci 7→ hwi
Input Data Input Workflow
Actual Dataflow A1
s[ (a[z] b[z] c[z] w[z]) ∗ i ] A1
A2
A2
D0
M4
A3
A3
s[ a[z] ∗ i ] A1
A2
A3
D0
s[ (a[z] ∗ i) (c[z] ∗ i) ] A1
A2
30
A3
D0
A1
A2
A3
A2
A1 A3
M4
M4
University of California, Davis
Scientific Workflows
C OMAD
Research Questions
Analysis & Optimization
Experiments
Conclusion
Experimental Analysis Input Data Input Workflow
Scenario A1 : hai 7→ hui A2 : hbi 7→ hvi A3 : hci → 7 hwi
(a) Parallel
(c) Mixed
opt.
A2
D0
M4
A3
80i
35i −56%
≈ 3.6i
≈ 1.1i −69%
80i
80i 0%
≈ 3.6i
≈ 2.6i −28%
80i
50i −38%
≈ 3.6i
≈ 2.2i −39%
A3
A1 : hai 7→ hbi A2 : hbi 7→ hci A3 : hci 7→ hwi A1 : hai 7→ hbi A2 : hbi 7→ hvi A3 : hci 7→ hwi
(b) Serial
A2
Exec. Time (sec) orig. opt.
orig.
A1
s[ (a[z] b[z] c[z] w[z]) ∗ i ] A1
Data Shipped (MB)
Actual Dataflow
s[ a[z] ∗ i ] A2
A1
D0
A3
A1
s[ (a[z] ∗ i) (c[z] ∗ i) ] A2
A1
M4
A3
A2
A1
D0
A3
A2
A3
M4
Runtime Measurements 80 70 60 50 40 30 20 10 0
80 70 60 50 40 30 20 10 0
Orig individual Orig avgerage Opt individual Opt average
0
2
4
6
8
10
12
Parallel Daniel Zinn
14
16
18
20
80 70 60 50 40 30 20 10 0
Orig individual Orig average Opt individual Opt average
0
2
4
6
8
10
12
Serial 31
14
16
18
20
Orig individual Orig average Opt individual Opt average
0
2
4
6
8
10
12
14
16
18
20
Mixed University of California, Davis
Scientific Workflows
C OMAD
Research Questions
Analysis & Optimization
Experiments
Conclusion
Outline
1
Scientific Workflows
2
Collection-Oriented Modeling and Design (C OMAD)
3
Research Questions
4
Dataflow Analysis and Optimization
5
Experimental Evaluation
6
Conclusion
Daniel Zinn
32
University of California, Davis
Scientific Workflows
C OMAD
Research Questions
Analysis & Optimization
Experiments
Conclusion
Conclusion Scientific Workflows & C OMAD C OMAD: a model of computation with many advantages Resilience to input changes Supporting workflow evolution Powerful configuration layer needed Contributions Expressive type-system Defined actor signatures Algorithm for type propagation Analysis of actor dependency Dataflow optimization Daniel Zinn
33
University of California, Davis
Bibliography A. Brüggemann-Klein, M. Murata, and D. Wood. Regular tree and regular hedge languages over unranked alphabets. Unpublished, 2001. H. Comon, M. Dauchet, R. Gilleron, F. Jacquemard, D. Lugiez, S. Tison, and M. Tommasi. Tree automata techniques and applications. http://www.grappa.univ-lille3.fr/tata, 1997. Haruo Hosoya, Jerome Vouillon, and Benjamin C. Pierce. Regular expression types for xml. ACM Transactions on Programming Languages and Systems (TOPLAS), 2005. Edward A. Lee and Thomas Parks. Dataflow process networks. Proceedings of the IEEE, 83(5):773–799, May 1995. Timothy M. McPhillips and Shawn Bowers. An approach for pipelining nested collections in scientific workflows. SIGMOD Record, 34(3):12–17, 2005.
Daniel Zinn
34
University of California, Davis
Thank you.
Questions? Daniel Zinn
35
University of California, Davis