Multi-granularity parallelization for scientific workflow management

Multi-granularity parallelization for scientific workflow management Michael J Pan, Arthur W Toga University of California Los Angeles {mjpan,toga}@loni.ucla.edu

Abstract Exponential growth in the complexity and computational requirements of scientific research necessitates the synergy of scientific workflow management systems and distributed and grid computing solutions. This paper explores the strategies for parallelism at multiple levels of workflow granularities in order to optimize workflow execution. These strategies are discussed in the context of the Pipeline[16], a scientific workflow management system that employs these strategies for optimizing workflow execution in distributed and grid computing environments.

1

Introduction

Exponential growth in the complexity and computational requirements of scientific research necessitates the synergy of scientific workflow management systems and distributed and grid computing solutions. This paper explores strategies for parallelism at multiple levels of workflow granularities in order to optimize workflow execution in distributed and grid computing environments. A workflow is a collection of steps, data, and related information that define the paths required to complete a task. It includes information on the task structure, the relative task order, task synchronization, and so forth. Workflows generally fall into one of two categories- business process workflows and scientific workflows. Business workflows are formal descriptions of the structure of buiness processes and role hierarchies in the organization within which these processes will be executed. The environments that manage these workflows are concerned with the electronic documentation of the paper trail required in business processes such as student course registration, order processing, and document publication. Research in the field began in the early 1970’s with the goal of office automation of business processes. Some examples of past research into business workflows include FlowMark[1], INCAS[3] and InConcert[17]. Today there is almost no research into business workflow management;

the field is generally considered entirely a commericial endeavor. Scientific workflows are a recent phenomena, the result of ever growing availability of algorithms, data, and data collections in daily scientific research. Scientific workflow management systems, in addition to managing processes like business workflow management systems, need also to be data oriented. That is, the goal of the execution of the workflow is not just the execution of the processes, but also the management (production, storage, query and validation) of source and derived data. The simultaneous management of both data and processes coupled with the ever increasing computational requirements introduces complexity and scalability issues to scientific in silico experiments. Scientific workflow management systems were developed as the solution to manage these issues of complexity and scalability. Examples of active research into scientific workflow management can be found in the Kepler[2], Taverna[15], Triana[5], GridDB[14], Pipeline[16], SCIRun[12], Pegasus[7] projects. Due to the requirements of scientific workflows, integration with distributed and grid computing environments and an ability to optimize workflow execution parallelism within them are critical to the sucess of a scientific workflow management system. In this paper, we will examine strategies for parallelism at increasingly finer granularities of a workflow. Figure 1 shows how the same workflow can be viewed at different levels of encapsulation. As such, parallelization strategies need to address all levels. The follow sections start with a discussion on the parallelization of workflows (section 3), followed by one on module parallelization (sections 4), and finally, one on dataflow parallelization (section 5). Additionally, we investigate in section 6 how to control flow structures (i.e., loops and branches) interact with the implementation of these strategies. These strategies are discussed in the context of the Pipeline[16], a scientific workflow management system that employs these strategies for optimizing workflow execution in distributed and grid computing environments.

Workflow Module B

Module A

Module D

Workflow

Module E

Level 0

Level 1 Workflow

Module B

Module A

b1

a2

b2

a3

a1

Module C Module E e

Level 2 Workflow Module B

Module A

b1

a2

b2

a3

a1

Module D Module C c2

c1

c4

c3 Module E d1

e

Level 3

Figure 1. Varying levels of workflow encapsulation

The Pipeline

The Pipeline[16] is a scientific workflow management system that has its roots in the neuroimaging domain. It is build upon a modular framework, with a clear separation between the data, execution, and user interface components. It is also grid enabled, supportable by grid computing infrastructures including, but not limited to, Condor[18], Globus[9], and DRMAA (http://drmaa.org) enabled grid engines. In the following sections we will outline the data model and the execution monitors that manage the workflow components used in the Pipeline. The data model characterizes the data structures and the relationships between these structures which are necessary for refinement and execution of the workflows. The execution monitors are responsible for all tasks associated with the execution of the workflow (or portions thereof) to which they are assigned.

2.1

Module D

d1

2

Data model

Function definitions are the building blocks of a workflow; they describe the functions available in the system for constructing workflows. The definitions include both a specification of the functionality and parameters of the described function, and they can be one of three types- atomic functions, composite functions. An atomic function definition describe a single function. They generally describe a system executable that can be run on behalf of the user, but can also be descriptions of various scripts to be submitted to various programs, including programming environment such as Perl, Python, Tcl as well as mathematical applications such as Matlab, R, and Octave. A composite functions provides an abstraction of a set of functions (we call them its child functions) as a single function. All internal details of the composite function (including child functions and their parameters) are encapsulated; access to internals are available only through the composite function’s parameters which expose selected child function parameters. Module definitions instantiate defined functions. The modules- just as the function definitions- can be either atomic or composite, depending upon the function being instantiated. Each module definitions specify the parameters of the functions which it activates. Required parameters for the function are always activated, while optional parameters may or may not be activated. The module definitions also specify the binding of variables to the activated function parameters. The variables contain the values to be passed to the function call. The variables can be be a single value, or a list of values. If it is the latter, then the module becomes a form of a parameter sweep application (PSA). PSAs are a class of applications that enabled data analysis over large parameter spaces. An module, if it is a PSA, upon execution

c

a

a

b

b

d

c

e

d

e

Figure 2. An unsorted workflow graph will spawn multiple tasks, each with different values for its input parameters.

3

Workflow parallelization

In this section we briefly discuss workflow parallelization. The ability to support the execution of multiple workflows in parallel is almost an assumed requirement. Within a single research environment, there may be multiple researchers and multiple workflows executed by each researcher. To support multiple workflows, each workflow, when submitted for execution, is assigned its own execution monitor. The monitor is responsible for all tasks associated with the execution of the particular module, including file staging, synchronization, execution, and user notification. In the following sections, this basic implementation of workflow monitoring is extended to support parallelization at finer granularities.

4

Module level parallelism

In this section we describe the implementation for module level parallelism in workflow execution, which addresses parallelization within a workflow. Given any workflow, a scientific workflow management system needs to maximize parallelization to optimize the execution of the modules of the workflow. Support for parallelization of workflow modules can either require explicit specification on the part of users, or be implicit and automatically determined on their behalf. AGWL[8] requires users to designate parallel paths in a workflow. Others require users to specify parallelization with respect to PSAs. In this section, we discuss only the former (support of parallel paths in workflows) and leave the discussion of the latter (parallelization w.r.t. PSAs) until section 5. Requiring users to explicity specify parallel portions of a workflow suffers from two issues. First, users may not realize all possible parallel paths that may exist within the workflow. As such, depending only upon the

Figure 3. Topologically sorted workflow graph

user to specify possible parallelizations may not result in an optimized execution, and the system should be able to automatically deduce as many possible parallelizations as possible. Second, it is tedious for users to have to specify the parallel paths, when the system can do so on the behalf of the user. In order to optimize parallelization and alleviate users of this specification task, our data model does not require the user to indicate parallel components. While approaches that require explicit specification assume serial execution, we take the opposite approach and assume parallelism unless otherwise specified. We note that within a workflow, two modules can be simultaneously executed unless one is dependent upon the other, where module B is dependent upon module A if B requires as input, an output of A. In the construction of their workflows, users specify this dependency when they specify data connections between modules. As such, unconnected modules are assumed to be parallellizable. For example, in figure 2, since modules A and C are not dependent upon each other and are determined to be able to execute in parallel. Once the user has completed the workflow refinement process, we build a dependence graph1 from the workflow using our definition of data dependence as described above. A dependence graph is a directed acyclic graph (DAG) G = (V, E) where V, the vertices, is the set of modules, and E, the edges, is the set of dependencies. Each edge e is a dependency such that if e=(u,v) where u,v ∈ V, then v depends on u. The following sections discusses various strategies for module parallelism given this dependence graph.

4.1

A naive implementation

In this section we present a naive strategy for module parallelization. This is presented as pseudocode in Al1 We

discuss graph cycles in section 6

S1

S2

a

c

c

b

a

b

d

d

e

e

Figure 4. Optimally sortings of Figure 2 for Algorithm 1

gorith 1. Having constructed the workflow’s dependence graph, a topological sorting of the graph determines the order in which the modules are to be executed. Given a DAG, a topological sorting of the graph is a linear partial ordering of all its nodes such that for any edge (u,v), u appears before v in the ordering. This sorting algorithm is good for linearly ordering precedence of events and event dependencies. We refer the reader to [6] for more information on topological sorting. Figure 2 is a simple workflow dependence graph prior to the application of a topological sorting. Having topologically sorted the workflow modules (figure 3), the composite module monitor progresses once through the list to determine whether the head of the list can be submitted for execution. If all the predecessors of the head have executed, then the head can be submitted. Otherwise, the CMM waits until the predecessors of the head module have executed. Algorithm 1 Naive module parallelization 1: L ← topologicalSort(Workflow) 2: while L is not empty do 3: H ← dequeue(head(L)) 4: P ← predecessors(H) 5: repeat 6: wait 7: until completed(P) 8: submit H 9: end while While topological sorting guarantees correctness in the order of execution, the ordering is not necessarily optimal. For example, in the workflow shown in figure 2, the execution of module C is not conditional on any other module in the workflow, and should be able to start as soon as the workflow is submitted for execution and run parallel to module A. Yet in the sorting as shown in figure 3, module C does not execute until module B is submitted for execution

and runs parallel to module B. If the execution of C is less than the execution time of B, then all is well. However, if C takes longer to execute, then it would be more efficient to have started C along with A. This example illustrates how a blocking algorithm such as Algorithm 1 is non-optimal because the parent monitor only watches the head of the sorted list to determine if the next module can execute, when it may be possible that other modules, ready to execute, are queued behind the head of the list. Figure 4 shows two optimal sortings for the example workflow. However, for any arbitrary workflow, if we were to run the sort many times from different starting locations, the sort does not guarantee the ability to generate all possible sortings. As such, the optimal sort may never be generated and this implementation will not be able to maximize module parallelism.

4.2

A non-blocking implementation

Algorithm 2 Non-blocking module parallelization 1: L ← circularQueue(modules(Workflow)) 2: while L is not empty do 3: H ← head(L) 4: P ← predecessors(H) 5: if completed(P) then 6: submit H 7: dequeue(H) 8: else 9: head(L) ← next(L) 10: end if 11: end while In this section we refine Algorithm 1 to address the issues with the naive implementation. Because of its blocking nature, the naive implementation required the topological sorting. Otherwise, there may be a deadlock in the workflow execution, as the head of the list and one of its dependency modules (located later in the list) circularly wait for each other. The non-blocking implementation is shown in Algorithm 2. To begin, we first build a circular queue from the set of child modules. Whether this list is topologically sorted is of trivial importance, as the monitor will continously iterate of all the elements of this queue. For every module, we check whether the predecessors of the module has completed. If so, then the module is executed. Otherwise, the algorithm proceeds to the next module. In this manner, no module ready for execution has to wait for any other module which is not its predecessor. We note that a more efficient of the non-blocking algorithm will just iterate over all the modules once, starting those with no predecessors. As each of those have completed execution, then the monitor checks only the module’s

successors.

4.3

Local module coordination

In this section we modify the coordination of module execution from a global to a local model. Until now, the modules are globally coordinated. That is, all modules of a workflow are coordinated by a single monitor. As such, there is an inherent bottleneck in this system, as the determination (and pre/post processing, i.e., staging, synchronization, execution, and notification) of which module(s) can be submitted is performed serially. To solve this issue, we move from a global coordination paradigm (where there is only one single execution monitor for all the child modules) to a local coordination paradigm (where every child module gets its own monitor). We note that in our data model, a workflow is simply a composite module. The basic workflow monitor introduced in section 3 is modified to become a module monitor (MM), responsible solely for tasks related to its assigned module, leaving child module related tasks to their corresponding monitors. Each MM determines when the module for which it is responsible can be submitted for execution. In turn, each module monitor is managed by the parent module monitor. A module that is ready to be executed now will immediately be submitted for execution by its monitor, and not have to wait for its predecessor in an arbitrary partial ordering to execute. There are three possible implementations of local coordination- push, pull, hybrid. Each provides a different solution to how monitors are informed of the status of preceding monitors, and determine when they can submit themselves for execution. The following sections on local module coordination deliberates these three implementations for their support of module level parallelism. 4.3.1

Push paradigm

Algorithm 3 Monitor push implementation 1: Preprocess 2: Execute module 3: Postprocess 4: Start next module In this section, we describe the basic implementation of local module execution monitoring, which is a push implementation. In this implementation, a parent monitor traverses an ordering (not necessarily topologically sorted) of its child module monitors and starts the ones which have no predecessors in the workflow. Each child module monitor which is started is then responsible for the preprocessing, execution, and postprocessing of its assigned module, just as before. Additionally once it has completed, it queries

the parent monitor for its dependent monitors- the monitors whose input depend upon its output- and start them for execution. This process is show in algorithm 3. A small modification to the module monitors is that each monitor will need to be able to query its parent monitor for information about the dependence graph at the level of its siblings, as well as answer those same questions at the level of its children. Alternatively, it can inform the parent monitor of its completion, who then informs its dependent module monitors in turn. 4.3.2

Pull paradigm

Algorithm 4 Monitor pull implementation 1: P ← predecessors(self) 2: for p ∈ P do 3: while p not completed do 4: wait 5: end while 6: end for 7: Preprocess 8: Execute module 9: Postprocess In this section, we describe the push implementation of local module execution monitoring, where every monitor is responsible for querying its predecessors to see if they have finished and data is available. In section 4.3.1 we discussed the push paradigm for execution, whereby execution of modules is pushed from a completing monitor to the next by that monitor. Alone, the push implementation is insufficient in its ability to handle all possible scenarios, as it works only if each module has a single input. For example, in figure 2, if module C finishes before B finishes, and C starts module D, an exception occurs, because the data that D needs from B is not yet available. To address the possibility that a module may have multiple inputs and cannot be started by any particular single predecessor module, we discuss an alternative to the pushing of execution, in which the monitors employ a pull paradigm to determine its own execution time. This process is shown in algorithm 4. In this implementation, the parent module monitor traverses all its child module monitors and starts all of them. We note here that unlike previous implementations, the starting of a monitor does not imply its execution. Once started, a module monitor queries its parent for its predecessors. This is similar to the push implementation’s query for successor modules, but in the opposite direction. Having obtained the set of its preceding monitors, the monitor then polls each of them and waits until all of completed. Once all predecessors have completed, the monitor executes the module. In this manner, modules are guaranteed to be

executed in the correct order, and only when all its dependencies have completed execution.

Workflow A

A

4.3.3

Workflow B

CM

CM

B

B

C

C

D

D

Hybrid paradigm

Algorithm 5 Monitor hybrid implementation 1: if P is undefined then 2: P ← predecessors(self) 3: end if 4: p ← predecessor that just notified self 5: remove p from P 6: if P is not empty then 7: return 8: end if 9: Preprocess 10: Execute module 11: Postprocess 12: Notify next monitor In this section, we describe the hybrid implementation of local module execution monitoring. In sections 4.3.1 and 4.3.2 we discussed two contrasting paradigms for execution- pushing and pulling. While the pull implementation addressed the fundamental weakness of the push implementation, it itself suffers from drawbacks, mainly that of high resource usage. The continuous polling for the completion status of its predecessors (line 3) unnecessarily consumes processor cycles, and is not scalable in the case of large workflows with many monitors. Additionally, scalability issues become even more exacerbated when a workflow’s various components are distributed across a network due to limited network bandwidth. In this section, we introduce a hybrid approach, which employs both push and pull implementations to synergize their strengths and overcome their individual weaknesses. The algorithm, showing in algorith 5, is as follows. The parent module monitor traverses the set of modules and starts all of them. Upon instantiation, a module monitor initializes a collection containing its predecessors. Once it has been determined that a predecessor has completed, it is removed from this collection. This is the “pull” portion of the algorithm. If this collection of predecessors is not empty, the monitor goes back to sleep. We note here, however, that there is no polling upon “sleep”; the monitor simply does nothing. Only when another predecessor completes and notifies it, does it awake. Once this collection is empty, i.e. all predecessors have completed, the monitor executes the module. After completion, the monitor “pushes” execution to the next monitor, notifying it to begin. The hybrid implementation successfully combines the correctness of the pull implementation with the low resource usage of the push approach. Discussions in the

A

Figure 5. Workflows requiring AM level parallelism

rest of the paper will be based upon this hybrid approach, whereby monitors notify their dependent monitors, but each monitor waits until it is notified by all predecessors before starting.

4.4

Atomic module parallelization

In this section we discuss the changes necessary to support parallelism at the atomic module level. Given the encapsulation of internal modules provided by composite modules, the support for module parallelism thus far does not efficiently support nested composite modules. Each module is started only when all its predecessors have started. As such, if a module is composite, its child modules are started only after it has started. An atomic module which is ready to begin should not have to wait until all its sibling modules are ready to begin. As an example, in figure 5, the module B of workflow A should begin immediately, and not wait until all inputs to the CM are completed (i.e., until module A is completed). Similarly, if a composite module immediately precedes an atomic module, the atomic module should begin when all its predecessors have completed, and not have to wait until its predecessor (the CM) and siblings (the child MMs of the CM) have completed. In workflow B of figure 5, module A should start when module C is completed instead of waiting for its actual preceding module, the CM, has finished execution, which requires waiting until module D has finished. These examples highlight the need for a finer granularity of parallelization than the module level. To enable parallelization at this finer level, we modify the MMs whose generic implementation has managed execution of all types of modules into composite module monitors (CMM) and atomic module monitors (AMM), a change that reflects the data model more closely. Modifying the hybrid implementation, each CMM will start once it has been notified by any of its predecessors of their completion. It will also, in turn, notify all of its child monitors which are waiting for that

D1: n files

P1: m parameters

A1: nxm processes

A2: 1 process

No files specified Padded to nxm files D2: nxm files

D3: 1 file virtualized as nxm files

A3: nxm processes

no virtualization

virtualized as x files

A4: 1 process

A5: x processes

Figure 6. A sample PSA workflow data. The AMMs will behave as the MMs did previously, starting only when all their inputs have completed. Additionally, we need to modify the queries that each monitor submits regarding its dependencies. For example, in figure 1, if a MM (either CMM or AMM) only queries its parent for its predeceding monitor, the CMM which manges the overall workflow will return module D to module E. As such, module e’s queries for its predecessors will only get the monitor assigned to module D, and not c3. This is technically correct, given the encapsulation, but still does not provide us with atomic module parallelism. Once again, we look to the data model for the solution. We note that scientific workflows are essentially dataflow structures. Shifting our focus from the status of the monitors which output the data to the data itself resolves our predicament. To support true atomic module level parallelism and composite function and module encapsulation even in the presence of recursively nested modules, we modify the queries which both CMMs and AMMs submit regarding the status of their predecessors. Instead of querying its parent for its preceding monitors, and then querying those monitors for their status, module monitors query for the status of the variable bound as argument to the input parameter. A query by module E for the status for its input data can be translated by module D as a query to the output of module C, which in turn translates it into a query regarding c3’s output. The AMM for c3, with c3 being the actual module that outputs the data, provides the answer, which is then propagated through the chain back to the AMM for module e. MMs can now query for variable dependencies regardless of how deeply nested it or a predecessor is. As a result, atomic module parallelism is enabled without breaking the encapsulation provided by composite functions and modules.

5

Dataflow parallelization

In this section, we extend our implementation to parallelize at the finer granularity of processes and data to sup-

port the use of parameter sweep applications (PSA). PSAs are commonly used in scientific research, with two distinct groups of researchers who use of PSAs in two distinct manners. The first group, the algorithm and workflow developers, use PSAs to determine the valid and/or optimal parameter values and value ranges for the algorithms or workflows. The second group are domain researchers who use PSAs to apply workflows with the predetermined parameter values to their datasets. Research on the scheduling and parallelization of PSAs (e.g., [4], [11]), has not addressed the parallelization of PSAs in the context of a workflow, where it may be preceded and/or followed by other PSAs, e.g. the workflow shown in figure 6. In scientfic workflow management systems research, there has been some work into the support for PSAs. However, there has not been much focus on support of both automatic parameter space management and dataflow parallelization. Furthermore, the proposed data models, e.g. AGWL[8] of ASKALON[19], Functional Data Model Relational Cover (FDM/RC) of GridDB[14], Web Services Flow Language (WSFL)[13] from IBM and its extension, Service WorkFlow Language (SWFL)[10] used by Taverna[15], all require an user to explicity specify PSA activity. The Pipeline[16] is designed with both functionalities to adequately support PSAs as workflow components. Additionally, the data model does not require users to explicitly specify any parallelization; possible parallelism is automatically deduced by the system.

5.1

PSAs as workflow components

A workflow management system’s support for PSAs should include both parallelism at the dataflow level in the workflow and automatic parameter space management. In section 4.4, no atomic module monitor is submitted for execution until its predecessor has completed execution. This means that no process of an atomic module begins until all the processes of, as well as all preceding atomic modul(s), have completed execution. It also means that no AM is considered to have executed until all its processes have executed. However, given an example workflow similar to the one shown in figure 6, processes in A3 should not have to wait until all processes in A1 and A2 have completed to begin execution. Each process in A3 is dependent only upon a single process in each of A1 and A2. When those two have completed, the corresponding process in A3 should begin execution immediately. A related issue to (and prerequisite of) dataflow parallelism is parameter space management. When PSAs are connected to form a workflow, it becomes infeasible for researchers to define the parameter space of non-initial modules. Module A1 and A2 have inputs specified by the user, but A3 and A4 have inputs whose values need to be automatically managed by the system, as it is

A: n files

Crop: n processes

B: n files

Reunite: 1 process

C: 1 file

Figure 7. A workflow with a function which combines many files into one

not known the values of D2 and D3 until A1 and A2 have executed. Even once the actual cardinality of D3 is known, its values need to be virtualized any module where multiple processes are taking the same value (e.g., A3 and A5). Additionally, as shown by A3 and A5, this virtualization needs to be adaptable to the module which follows, as their cardinalities may not be the same. The capability for contextually dependent parameter space management enables the system to truly support dataflow parallelization and not have to wait for user input on these generated data.

5.2

A naive dataflow parallelization

One approach to dataflow parallelism would be simply to create multiple parallel workflows, one for each parameter set. For example, to support the nxm process in A1, nxm workflows would be created, each with a parameter set of (D1i , P1j ) where 1 ≤ i ≤ n and 1 ≤ j ≤ m. This naive strategy suffers from major inadequcies, i.e. lacking both correctness and efficiency. Spawning multiple workflows fails correctness when all the processes and outputs of a single module need to be synchronized on a process, termed a join operation in WSFL[13]. The arithmetic sum and product are also join operations. In figure 7, a single Reunite[20] process takes n files as input and outputs a single file. The preceding module, Crop[20], requires n processes for the n files. If multiple workflows are attempted to be spawned for each of the n inputs, then Reunite will be executed n times, once for each input. The naive strategy suffers from efficiency as the same process may be run on more than one of the spawned workflows. In figure 6, given that A1 and A3 require nxm processes and A4 requires 1, the process of spawning multiple workflows would also create an instance of A4 for each workflow, thereby also running A4 nxm times. Additionally, the strategy creates more monitors than neces-

sary given our monitor implementation. The workflow in figure 7, ignoring correctness, would require n CMMs for the overall workflow, n AMMs for the Crop module, and n AMMs for the Reunite module.

5.3

An optimized dataflow parallelization

This section present a strategy, implemented in the Pipeline, that resolves the incorrectness and inefficiencies of the naive strategy. We note that the root of the dataflow parallelism problem comes from the implementation of AMM which intricately ties the monitoring of the atomic module with the monitoring of the underlying processes. This union prevents an AMM to complete until all its processes have completed, and thereby blocks all successor AMMs from starting. The naive strategy overlooks this relationship, and its attempt to separate the processes (by spawning muiltiple workflows and assigning an AMM to each separate process) broke the reflection of the execution model to the data model, thereby resulting in its inadequacies. We introduce a hierarchy of monitors, which we call the process monitors (PM) to monitor individual processes submitted for execution. Each AMM, instead of managing multiple processes, will manage multiple PMs instead. And just as the CMMs already do, the AMMs need to start as soon as any predecessor has a PM which has completed execution, and inform its child monitors (the PMs) which are waiting for the particular data input. Since AMMs have the ability to notify monitors waiting for variables bound to its outputs, we modify them slightly to also notify the same monitors of the status of the values of variables, i.e. the actual output data. Each PM, upon completion will notify its parent AMM, which will then notify the AMMs which follow it. Using figure 6 as an example, when PM A1i (1 ≤ i ≤ n) completes, AMM A1 is notified. AMM A1 in turn notifies A3, who then in turn notifies all PMs A3ij (∀ j). This separation has two effects. First, separating process monitoring from atomic module monitoring enables the rapid integration of the Pipeline with various distributed and grid computing environments due to the abstraction of the execution environment in the workflow management process. Second, true dataflow parallelism is now possible.

6

Control flow and parallelism

This section identifies and resolve issues regarding how the overlaying of control flow structures on top of scientfic workflow management environments, which are primarily data flow oriented, affects parallelism within these environments. Integrating control flow structures into the workflow data model allow workflow designers to manage the workflows dynamically during execution, without having to

Branch module

Loop module

Loop module

Branch module

branch1 branch2 branch3

CM (single iteration) CM (branch 1)

CM (branch 2)

Loop module

CM (branch 3)

Figure 9. Loop module encapsulation Branch module

Figure 8. Branch module encapsulation write wrapper scripts for each control block they wish to insert. While wrapper scripts work, they are inefficient (as they need to be installed on all machines where the control flow block may run) and inadequate in certain situations. For example, distributed and grid environments are generally heterogenous environments. That means, a single wrapper script for each block will not suffice. Instead, a single wrapper script for every execution environment for every control flow block needs to be written, and each installed to their corresponding machines. As such, it is desirable to support control flow structures within our workflow management system. In the following sections, we discuss the implementation of two types of control flow constructs- branches and loopsin the context of parallelization.

6.1

Branches

In this section we discuss the support of execution branches in workflows in the context of multiple levels of parallelism. Support for branching structures should maintain the encapsulation provided by composite functions and modules. A module outside of a branching (either preceding or following) should not have to be aware of the existence of the branch. Module monitors outside of the branch should not have to query all branches to determine when it can run, as it breaks the encapsulation which are built into composite modules. Nor is it a scalable solution, as requiring external MMs to query all branches increases the number of necessary queries by the branching factor. The implementation is rather simple, given what we already have completed. To provide an encapsulation of all possible branches to the modules which follow, we introduce into our data model a specific type of composite

function- the branch function construct. Branch functions describe the possible branches and the conditions for taking any branch. Branch functions, like other composite functions, encapsulate their internals and is abstracted as a single function. We also introduce a corresponding branch module which encapsulates its branches and is abstract as a single composite module. Additionally, we introduce a branch module monitor (BMM) which handles the encapsulation. As CMMs already provide an encapsulation of its internal monitors and data, the BMM is simply a modified CMM which is capable of dynamically determining which monitors and data to instantiate upon input.2 In all other situations, including parallelism support and the querying for data status and in its query responses, it behaves just like a CMM.

6.2 Loops This section discusses the interaction of the loop control flow constructs and the various levels of workflow parallelism. The main obstacle faced with the addition of loop control flow constructs is the introduction of cycles into the workflow graphs, which conflicts with the acyclic requirement of the workflow dependence graph as describe in section 4. Fortunately, our data model (section 2.1) supports composite function abstractions. To satisfy the DAG requirement, we apply a two step transformation to the graph cycles. First, each iteration of a loop module is encapsulated as a single composite module. Second, the set of all single iteration composite modules (i.e. the graph cycle) is abstracted as a single function. We call these functions and their instantiations, loop functions and loop modules, respectively. Like branch functions and branch modules (and other composite functions and modules), the loop functions 2 Unlike various optimization strategies employed in compiler research, execution of all branches is not suitable, as the workflow management systems do not know the number of computations/time of computation to execute each branch. The unknown variability in execution time makes it impractical to employ a strategy such as predictive branching. Additionally, as the range of the data which will be available at the time of the execution of the branch is unknown, it is not possible to attempt predictive branching.

and loop modules provide an encapsulation of their internals. This 2 step transformation is depicted in figure 9. Having applied this transformation, we turn our attention to how to support loop constructs in atomic module parallelization (section 4.4). We introduce loop module monitors (LMM), delegated with the task of monitoring the loop module to which they are assigned. Each LMM will manage a collection of CMMs. Each CMM will manage a single iteration of the loop. These CMMs will manage the parallelism for its child modules as discussed in previous sections. Additionally, for each iteration, the behavior of the LMMs during execution will generally be the same as a CMM when it follows other modules, and as an AMM when it precedes other modules, with the exception of the last iteration of the loop, where it behaves like a CMM again. When a LM follows other modules, since the corresponding LMM is the parent monitor to all CMM for every iteration, the LMM will support atomic module parallelism by querying their predecessor monitors on their behalf. In the first iteration, that query will be issued to the preceding monitors, just as in section 4.4. In subsequent iterations, the LMM’s behavior deviates from the CMM, and it (the LMM) just queries its child monitors at iteration n to answer a query from iteration n+1. When the LMM precedes other modules, it will behave as an AMM for most of its lifetime to those modules to ensure that they do not begin execution prematurely. Once the last iteration of the loop has started execution however, the LMM will behave as a CMM to support both atomic module and dataflow parallelization as described in the previous sections. With these changes, we now have a way of supporting any workflow graph, cyclic or acyclic, in our execution model.

7

Conclusion

In this paper we have described how the Pipeline, a scientific workflow management system, implements parallelization strategies at multiple workflow granularities in order to maximize parallelism and optimize workflow execution. The implementation includes both inter-workflow (to support multiple users and multiple workflows) and intraworkfow (to support PSAs) parallelism. Additionally, we discussed the support for control flow structures in the context of the various parallelization strategies. This parallelism support, coupled with a large grid computing infrastructure, has dramatically reduced the time necessary to run experiments within our laboratory. While no quantitative numbers are available, many users report the reduction of compute time from days to mere hours.

8

Acknowledgements acknowledgements will go here

References [1] FlowMark - Managing your workflow. Technical Report SH19-8243-00, IBM, March 1995. [2] I. Altintas, C. Berkley, E. Jaeger, M. Jones, B. Ludaescher, and S. Mock. Kepler: An extensible system for design and execution of scientific workflows. In Proceedings of the 16th International Conference on Scientific and Statistical Database Management (SSDBM’04), June 2004. [3] D. Barbara, S. Mehrotra, and M. Rusinkiewicz. INCAS: A computation model for dynamic workflows in autonomous distributed environments. Technical report, Matsushita Information Technology Laboratory, May 1994. [4] H. Casanova, G. Obertelli, F. Berman, and R. Wolski. The AppLeS Parameter Sweep Template: User-level middleware for the grid. In Proceedings of the Super Computing Conference, pages 75–76, 2000. [5] D. Churches, G. Gombas, A. Harrison, J. Maassen, C. Robinson, M. Shields, I. Taylor, and I. Wang. Programming scientific and distributed workflow with Triana services. In Grid Workflow 2004 Special Issue of Concurrency and Computation: Practice and Experience, 2004. [6] T. H. Cormen, C. E. Leiserson, and R. L. Rivest. Introduction to Algorithms. MIT Press, 1990. [7] E. Deelman, J. Blythe, Y. Gil, C. Kesselman, G. Mehta, K. Vahi, A. Lazzarini, A. Arbree, R. Cavanaugh, and S. Koranda. Mapping abstract complex workflows onto grid environments. Journal of Grid Computing, 1(1):9–23, 2003. [8] T. Fahringer, J. Qin, and S. Hainzer. Specification of Grid Workflow Applications with AGWL: An Abstract Grid Workflow Language. In Proceedings of IEEE International Symposium on Cluster Computing and the Grid 2005 (CCGrid 2005), Cardiff, UK, May 9-12 2005. IEEE Computer Society Press. [9] I. Foster and C. Kesselman. Globus: A metacomputing infrastructure toolkit. The International Journal of Supercomputer Applications and High Performance Computing, 11(2):115–128, Summer 1997. [10] Q. Huang and Y. Huang. Workflow engine with multi-level parallelism support. In Proceedings of The UK e-Science All Hands Meeting 2005, 2005. [11] E. Huedo, R. S. Montero, and I. M. Llorente. Experiences on adaptive grid scheduling of parameter sweep applications. In 12th Euromicro Conference on Parallel, Distributed and Network-Based Processing (PDP’04), 2004. [12] C. R. Johnson, R. S. MacLeod, S. G. Parker, and D. M. Weinstein. Biomedical computing and visualization software environments. Communications of the ACM, 47(11):64–71, 2004. [13] F. Leymann. Web services flow language (wsfl 1.0). Technical report, IBM, May 2001. [14] D. T. Liu and M. J. Franklin. GridDB: A data-centric overlay for scientific grids. In Proceedings of the 30th International Conference on Very Large Data Bases (VLDB), 2004. [15] T. Oinn, M. Addis, J. Ferris, D. Marvin, M. Senger, M. Greenwood, T. Carver, K. Glover, M. R. Pocock, A. Wipat, and P. Li. Taverna: A tool for the composition and enactment of bioinformatics workflows. Bioinformatics Journal, 20(17):3045–3054, November 2004.

[16] D. E. Rex, J. Q. Ma, and A. W. Toga. The LONI Pipeline Processing Environment. NeuroImage, 19:1033– 1048, 2003. [17] S. K. Sarin. Workflow and data management in InConcert. In Proceedings of the Twelfth International Conference on Data Engineering, pages 497–499, 1996. [18] D. Thain, T. Tannenbaum, and M. Livny. Condor and the grid. In F. Berman, G. Fox, and T. Hey, editors, Grid Computing: Making the Global Infrastructure a Reality. John Wiley & Sons Inc., December 2002. [19] M. Wieczorek, R. Prodan, and T. Fahringer. Scheduling of Scientific Workflows in the ASKALON Grid Environment. ACM SIGMOD Record, 35(3), 2005. ˜ http://dps.uibk.ac.at/marek/publications/acm-sigmod2005.pdf. [20] R. P. Woods, S. R. Cherry, and J. C. Mazziotta. Rapid automated algorithm for aligning and reslicing PET images. Journal of Computer Assisted Tomography, 16:620–633, 1992.

Multi-granularity parallelization for scientific workflow management

Multi-granularity parallelization for scientific workflow management

Suggest Documents

Parallelization in Scientific Workflow Management Systems

Parallelization in Scientific Workflow Management Systems

Distributed Scientific Workflow Management for Data ...

Scientific workflow management with ADAMS

Scientific workflow management: between generality and ... - CiteSeerX

Scientific workflow management with ADAMS - Computer Science

Scientific workflow management: between generality and ... - CiteSeerX

Scientific Workflow Management in ProteomicsDS - Molecular

Migrating Scientific Workflow Management ... - Semantic Scholar

Workflow Parallelization by Data Partition and Pipelining

A Service Framework for Scientific Workflow Management in the Cloud

Integrating Policy with Scientific Workflow Management for Data ...

A Reference Architecture for Scientific Workflow Management Systems ...

The WASA Approach to Workflow Management for Scientific ... - BPT

Integrating Policy with Scientific Workflow Management for ... - SciTech

Conventional Workflow Technology for Scientific Simulation - IAAS

Techniques for Efficiently Querying Scientific Workflow ... - ICDT

Highly Dynamic Workflow Orchestration for Scientific Applications ...

Conventional Workflow Technology for Scientific Simulation

A Scientific Workflow Infrastructure for ... - Semantic Scholar

Scalable HPC Workflow Infrastructure for Steering Scientific ...

A Scientific Workflow Infrastructure for ... - Semantic Scholar

Two-layer transaction management for workflow management

a VIsual sciEntific Workflow management system - Semantic Scholar