Apr 7, 2016 - CWL, Galaxy etc: too bulky for our purposes .... Step 3. Execution of simple workflows. ⢠Workflow steps
Variant Pipeline Tools reborn as
Script of Scripts Bo Peng, Ph.D. Department of Bioinformatics and Computational Biology The University of Texas, MD Anderson Cancer Center Mar. 30th, 2016 Updated on Apr. 7th, for SoS release 0.5.7
Why workflow management? Bioinformatics analyses usually involve the application of various tools and scripts
Why workflow management? Bioinformatics analyses usually involve the application of various tools and scripts Workflow management systems handle • Creation and management of workflows (viewer, editor, GUI, repository, …) • Execution of workflows under different environments (cluster, docker, cloud, …) • Execution management (resource management, suspension, resume, …) • Logging and data provenance (monitor, logging, reproducibility, …)
There are already big players in the field
There are already big players in the field
There are already big players in the field
There are already big players in the field
There are already big players in the field
There are already big players in the field
Why did we start VPT? • Embed into Variant Tools (VT) to provide VT pipelines through VT repository • Reasonably powerful and flexible • But easy to read, write, and share • No suitable tool for this task – Snakemake: Make-style system does not work for VT where commands change existing files – CWL, Galaxy etc: too bulky for our purposes
History of Variant Pipeline Tools
History of Variant Pipeline Tools VPT was created as a small Variant Tools add-‐on to execute internal Variant Tools commands
2013
History of Variant Pipeline Tools VPT was created as a small Variant Tools add-‐on to execute internal Variant Tools commands
2013
VPT was expanded to execute Variant Tools related pipelines
2014
History of Variant Pipeline Tools VPT was created as a small Variant Tools add-‐on to execute internal Variant Tools commands
2013
VPT was expanded to execute Variant Tools related pipelines
2014
VPT was expanded to execute more bioinforma>cs pipelines, and Variant Simula>on Tools
2015
History of Variant Pipeline Tools VPT was created as a small Variant Tools add-‐on to execute internal Variant Tools commands
2013
VPT was expanded to execute Variant Tools related pipelines
2014
VPT was expanded to execute more bioinforma>cs pipelines, and Variant Simula>on Tools
2015
VPT was redesigned and rewriCen as Script of Scripts
2016
Lessons learned
Lessons learned
Repository is good Tied to VT is bad.
Lessons learned
Repository is good Tied to VT is bad.
Simple format is good Too simple can be limi>ng
Lessons learned
Repository is good Tied to VT is bad.
Simple format is good Too simple can be limi>ng
Flexibility is good Too much flexibility can be dangerous
Script of Scripts Script of Scripts (SoS) is a lightweight workflow system that helps you organize your commands and scripts in different languages into readable workflows.
Script of Scripts Script of Scripts (SoS) is a lightweight workflow system that helps you organize your commands and scripts in different languages into readable workflows.
Input driven workflow system
Script of Scripts Script of Scripts (SoS) is a lightweight workflow system that helps you organize your commands and scripts in different languages into readable workflows.
Input driven workflow system
Based on Python
Script of Scripts Script of Scripts (SoS) is a lightweight workflow system that helps you organize your commands and scripts in different languages into readable workflows.
Input driven workflow system
Based on Python
Script and user friendly
Basic SoS • • • •
Basic format Command line argument Input and output files Parallelization
A bioinformatics workflow A shell script to align reads to reference genome
A bioinformatics workflow A shell script to align reads to reference genome
A R script to analyze results
Script of Scripts
Script of Scripts header
Script of Scripts Workflow descrip>on
Script of Scripts
Steps with logical order of execu>on
Script of Scripts
Shell script
Script of Scripts
Shell script
Script of Scripts
R script
Script of Scripts
Execute the script
Execute the script
Command line arguments
Command line arguments
Parameters
Command line arguments
String interpola>on
Command line arguments
Command line arguments
Specify input and output files
Specify input and output files
Output of step 1 (No input)
Specify input and output files
Input, output and dependent files of step 2
Specify input and output files
Output of step 3 (Input is the output of step 2)
Specify input and output files
Use of SoS variables input and output
Ignore executed steps
Ignore executed steps
Ignore executed steps
Reuse exis>ng results
Ignore executed steps
Ignore executed steps
Re-‐execute with different input files
Execute Workflow in Parallel
Execute Workflow in Parallel
input files are sent one by one to _input
Execute Workflow in Parallel
input files are paired with sample_type
Execute Workflow in Parallel
Each _input has a corresponding _sample_type
Execute Workflow in Parallel
Execute in parallel
Execute Workflow in Parallel
Summary • SoS scripts consist of (meaningful) comments, scripts and commands, and SoS specific syntax • Command sos with subcommands show, dryrun and run • Workflows are defined in logical order, all steps will be executed • Scripts in different languages can be included verbatim • Scripts can be customized using user and SoS-defined variables and command line arguments • Specification of input and output files are not required, but are helpful • Workflows can be executed in parallel, can be resumed while skipping executed steps
Intermediate SoS • • • •
String interpolation Shared variables Step process Execution
String Interpolation
String Interpolation Arbitrary python expression and statements are allowed.
String Interpolation Arbitrary python expression and statements are allowed.
: for format specifica>on !r for object representa>on !q for shell quote
String Interpolation Arbitrary python expression and statements are allowed.
: for format specifica>on !r for object representa>on !q for shell quote
String Interpolation Arbitrary python expression and statements are allowed.
: for format specifica>on !r for object representa>on !q for shell quote
String Interpolation Arbitrary python expression and statements are allowed.
: for format specifica>on !r for object representa>on !q for shell quote
String Interpolation Arbitrary python expression and statements are allowed.
: for format specifica>on !r for object representa>on !q for shell quote
String Interpolation Arbitrary python expression and statements are allowed.
: for format specifica>on !r for object representa>on !q for shell quote
String Interpolation Arbitrary python expression and statements are allowed.
: for format specifica>on !r for object representa>on !q for shell quote
String Interpolation Arbitrary python expression and statements are allowed.
: for format specifica>on !r for object representa>on !q for shell quote
String Interpolation Arbitrary python expression and statements are allowed.
: for format specifica>on !r for object representa>on !q for shell quote
Step input, output, depends • By default a step gets its input from the output of its previous step • Strings, variables, expressions are allowed • Wildcard characters are expanded • Step input defines variable input • Step output defines variable output • Step depends defines variable depends
Variables, expressions and statements
Variables, expressions and statements sorted_bam is a parameter or global variable
Variables, expressions and statements
output can be derived from input
Variables, expressions and statements
Op>on alias exposes step variables to later steps (readonly)
Variables, expressions and statements
Arbitrary Python statements can be used in a step
SoS actions and step process
SoS actions and step process check_command and run are both SoS ac>ons check_command is executed in both dryrun and run modes run is executed only in run mode
SoS actions and step process check_command and run are both SoS ac>ons check_command is executed in both dryrun and run modes run is executed only in run mode process starts step process that accepts run&me op&ons (e.g. concurrent=True)
SoS actions and step process check_command and run are both SoS ac>ons check_command is executed in both dryrun and run modes run is executed only in run mode process starts step process that accepts run&me op&ons (e.g. concurrent=True)
action: options script is a shortcut for process: options action(‘script’)
Execution of simple workflows Workflow with no defined input and output files
Step 1
Step 2
Step 3
Step 4
Execution of simple workflows Workflow with no defined input and output files
Step 1
Step 2
Step 3
Step 4
• Workflow steps are executed sequen>ally • Steps are executed in separate processes • Step loops can be executed in parallel
Execution of complex workflows Workflow with defined input and output
Logical View
None
Step 1
a1
Step 5
a6
Step 2
a6
Step 6
a2
None
a1
Step 3
a3
a2 a3
Step 4
a5
a2 a6
Step 7
a7
a7
Step 8
a8
Execution of complex workflows Workflow with defined input and output
Logical View
None
Step 1
a1
Step 5
a6
Step 2
a6
Step 6
a1
Step 3
a3
a2 a3
Step 4
a5
a2 a6
Step 7
a7
a7
Step 8
a8
a2
None
a3 Step 1 a1
Processing View
Step 5 Step 2
a5
a2 Step 7 Step 5
a6 None
a7
Step 8
a8
Advanced SoS • • • •
Step options Multiple workflows Sub- and combined-workflows Nested workflows
group_by, for_each and paired_with Without group_by and for_each, paired_with=‘var’ Previous output
Step input
input = _input var = _var input is sent all at once as _input
Step output
_output = output
_output becomes output
group_by, for_each and paired_with Without group_by and for_each, paired_with=‘var’ Previous output
Step input
input = _input var = _var input is sent all at once as _input
Step output
_output = output
_output becomes output
With group_by and/or for_each, paired_with=‘var’
Previous output
Step input
input
_input, _var
Step output
_output
_input. _var
Step output
_output
Step output
_output
var _input, _var
Subsets of input become _input with matching subsets of var (_var)
output
output is the collec>on of _output
Input option group_by
Input option group_by
Input option group_by
Input option group_by
Input option group_by
Input option group_by
Input option for_each
Input option for_each
Input option for_each
Input option for_each
Input and output option pattern
Input and output option pattern
• Input op>on pattern matches paCern with input file names • It creates variables name and par that paired_with variable input • Output op>on pattern creates output filenames from list variables
Input and output option pattern
• Input op>on pattern matches paCern with input file names • It creates variables name and par that paired_with variable input • Output op>on pattern creates output filenames from list variables
• Loop variables _name, _par and _output can also be used in case of input loops • Specifying par>al output (_output) allows SoS to have finer control of the execu>on
Multiple workflows Single default workflow
Multiple workflows Single default workflow
Single named workflow
Multiple workflows Single default workflow
Single named workflow
Mul>ple named workflows
Multiple workflows Single default workflow
Mul>ple named workflows
Single named workflow
Default and named workflow
Multiple workflows Single default workflow
Mul>ple named workflows
Single named workflow
Default and named workflow
Shared workflow steps
Multiple workflows Single default workflow
Mul>ple named workflows
Shared workflow steps
Single named workflow
Default and named workflow Wildcard workflow steps
Shared workflow step
step_name can be either mouse_20 or human_20 The statement determines which reference genome to use from step_name
Sub-workflow and Combined workflow A sub-‐workflow consists of one or more consecu>ve steps of a workflow
Sub-workflow and Combined workflow A sub-‐workflow consists of one or more consecu>ve steps of a workflow
A combined workflow consists of one or more subworkflows
Sub-workflow and Combined workflow A sub-‐workflow consists of one or more consecu>ve steps of a workflow
A combined workflow consists of one or more subworkflows
sos commands accept regular, sub-‐, and combined workflows
Nested workflow A nested-‐workflow is a workflow executed within a SoS step by ac8on sos_run
Nested workflow A nested-‐workflow is a workflow executed within a SoS step by ac8on sos_run
Nested workflow A nested-‐workflow is a workflow executed within a SoS step by ac8on sos_run
Customized input files from output of two previous steps
Nested workflow A nested-‐workflow is a workflow executed within a SoS step by ac8on sos_run
Customized output file generator
Nested workflow A nested-‐workflow is a workflow executed within a SoS step by ac8on sos_run
A nested workflow from three sub-‐workflows
Nested workflow A nested-‐workflow is a workflow executed within a SoS step by ac8on sos_run
One or more sub-‐workflows defined in another script
Nested workflow A nested-‐workflow is a workflow executed within a SoS step by ac8on sos_run
Execute complete workflows 100 8mes with different random seeds
Status of SoS • • • • •
Hosted at https://github.com/BoPeng/SOS/ Require Python 3.3 or higher Install using command pip3 install sos Most features have been implemented and well-tested Pending (but not in a hurry) features: – – – – – – – –
Native Docker support (build and run docker containers) Auxiliary step (snakemake-like rules) Dynamic DAG (directed acyclic graphs) Celery task management (cluster systems) Resource management (CPU, RAM, etc) Execution monitor (monitor status of jobs) Post-execution analysis (command sos summary) …
• Version 0.5.9, expect changes and bugs, but you know how to reach me for help
Acknowledgements • • • • •
Dr. Gao Wang Dr. Paul Scheet Dr. Suzanne Leal Dr. John Weinstein and others
• Grant 1R01HG005859 (Dr. Paul Scheet) • The Michael and Susan Dell Foundation • The Chapman Foundation