On the Difficulties of Using Workflow Tools to Express

0 downloads 0 Views 313KB Size Report
José C. Cunha. 2 lass@isel.ipl.pt; [email protected].ipl.pt; [email protected]. 1. Instituto Superior de Engenharia de Lisboa. 2. CITI/Dept. Informática, Fac.
On the Difficulties of Using Workflow Tools to Express Parallelism and Distribution – An Application Case Study in Geological Sciences Luís Assunção1,2; Carlos Gonçalves1,2; José C. Cunha2 [email protected]; [email protected]; [email protected] 1

2

Instituto Superior de Engenharia de Lisboa CITI/Dept. Informática, Fac. Ciências e Tecnologia, Universidade Nova de Lisboa

Abstract We discuss the parallelization and distribution of a workflow-based application. We present experimental results and discuss lessons learned. The experimentation is based on a scientific application case study – Sequential Simulation applied to Geostatistics. As a result we identify a list of issues not well supported in the most referenced scientific workflows tools and environments namely Kepler, and outline relevant research directions.

1 Introduction In the recent past, there were significant developments concerning infra-structures used to support parallel and distributed executions based on multiprocessors, clusters and grids. However existing platforms still do not provide an adequate level of unification of the required concepts and mechanisms to support heterogeneity, distinct granularities of the computations, and a diversity of scales and dynamic behaviors. Also there was intense research on how to express scientific applications using workflows. The workflow defines a graph of dependencies which specifies an execution order of multiple tasks, some of them in parallel. More recently, scientific workflows have been used for developing complex applications in Science and Engineering [1], [2], [3], [4]. The benefits of this approach are two-fold. On one hand, it provides an adequate level of transparency to the end user, promoting a logical problem specification and decomposition. On the other hand, it naturally promotes a clear separation of concerns between the workflow specification and the actual execution of tasks in parallel or distributed environments. This is illustrated for example by problem solving

environments [5], with high level workflow based interfaces, and allowing running applications in clusters and grids. However, existing workflow tools lack the support to express large scale computations, adequate control and data flow models, and dynamic management of workflow components and instances. Such lack of expressivity in the workflow precludes a scientist from specifying non trivial applications and managing complex scientific experiments, for instance involving multiple workflow instances with complex and dynamic dependencies. On the other hand the actual workflow tools still need to provide a more clear separation between the workflow specification and its mapping to the execution environments. We are exploiting the above issues in an ongoing project – GeoInfo, whose goal is to enable large scale experimentation in Earth, Sea, and Space sciences. As a part of this project, we are involved in a collaboration between the CITI (Centre for Informatics and IT) and the CICEGe (Centre for Geological Sciences) of our University. In this paper we describe our experimentation conducted at two levels: i) concerning the execution platforms (Condor [6]; and Web services); ii) concerning the workflow specification (Kepler [7]). A selected application case study was considered Sequential Simulation applied to Geostatistics. Two main problem-solving approaches were considered, namely i) parameter sweep; ii) and algorithmic decomposition. We evaluate these approaches and identify the difficulties of their implementations. This paper is organized as follows: Section 2 presents the application case study and related work. Section 3 describes the above problem-solving approaches. Section 4 presents experimentation results and lessons learned related to the execution of multiple simulations (parameter sweep). Section 5 presents our

strategy for workflow-based decomposition of the geostatistics sequential algorithm. In Section 6 we present the conclusions and directions to future work.

2 Case study – Sequential Simulation Sequential simulation is used throughout the natural resources industry, for instance in oil industry to construct numerical models for oil reservoir description [9], [10]. There are several approaches and theories to describe the simulation process - Sequential Indicator Simulation (SIS); Sequential Gaussian Simulation (SGS); Direct Sequential Simulation (DSS). Following [9] the simulation process can be described as a workflow with four basic steps: 1. Choose the stationary domain (data set of samples and parameters); 2. Define a random path to visit every location; 3. For each node (location): a. Search to find nearby data and previously simulated values; b. Calculate the conditional distribution; c. Perform Monte Carlo simulation to obtain a single value from the distribution; 4. Repeat step 3 until every node has been visited. At steps 3a and 3b we have dependencies between nodes. In fact, the related theory [10], [11] determines that the result for node k will influence the results of subsequent k+1, k+2,…,N nodes, i.e. to process the k node we need the results from previous k-1, k-2,…,1 nodes.

2.1 Parallel processing requirements The traditional approach to implement this kind of application is to use a sequential processing algorithm where dependencies between nodes are simple to resolve, because we can start on node 1 and finish on node N. However, considering that one simulation can have millions of nodes, it is desirable to explore the parallelization and distribution of the processing of the 3rd step. At section 2.2 we review related work on using parallelization techniques to increase the number of simulated nodes. The requirement to execute a large number of simulations by changing some parameters (parameter sweep paradigm) is a common characteristic in other application scenarios [12].

2.2 Related work The parallel implementation of the sequential simulation algorithm to improve scalability and performance is a non-trivial task. Some authors [13], [14] have proposed such a parallelization using an approach consisting of the partitioning of the domain space into regions and assigning them to parallel processes that execute in multi-processor machines. Despite the reporting of satisfactory results and namely a significant reduction (49%) of the processing time, this approach can produce inaccurate results due to the processing of adjacent nodes in different regions. Another disadvantage is related to the lack of flexibility, as the generation and subsequent processing of regions is strictly limited by the number of processors. In order to explore multi-core architectures, a related experiment [15] describes a multithreaded version of the same algorithm in a Windows application using Win32 API. These authors report on a speedup of 65-70%, when parallelizing the algorithm for a 4 processors. However, this approach has scalability limitations for large number of nodes. Firstly, by increasing the number of threads (until the maximum number of nodes), the context switching overhead significantly increase contributing to the degradation of the total execution time. Secondly, this approach is limited due to the maximum number of threads per process which are available in current operating systems. In order to allow the processing of large regions, and to scale on the number of iterations, we are exploiting the algorithm decomposition through the use of scientific workflow tools, and the parallelization of the execution in distributed computing environments. The main aim of our work is to explore scientific workflows paradigm in order to find suitable solutions to support parallelism and distribution for applications with similar characteristics as the Sequential Simulation algorithm.

3 Parallel and distributed approaches We discuss two approaches to exploit parallel and distributed executions: i) the execution of multiple simulations by parameter sweep in a distributed infrastructure; and ii) the algorithm decomposition that can also be executed in the same distributed infrastructure. Firstly (in section 4), we explore the dimension of the scale of multiple simulations based on the parameter sweep approach. The following architectural scenarios were used for launch jobs: i) using Condor

[6] (section 4.1); ii) using a Web services based environment (section 4.2). In order to explore the invocation of Web services to execute a large number of simulations, we first developed an end-user interface, and then we experiment using two widely used scientific workflow tools [7], [8]. In all scenarios we present and compare performance results and discuss lessons learned. Secondly (section 5), we discuss the decomposition of the algorithm in four steps, using the workflow paradigm. This allowed us to evaluate the possibilities to substantially increase the problem size and the subsequent distribution of the simulation through many machines.

4 Multiple simulations – parameter sweep This scenario is required when we need to run multiple simulations with different parameters and data samples. Instead of running all simulations in the same machine, these simulations are run in parallel on several machines, with post-mortem analysis of the results. In our preliminary research we evaluated a C++ application based on a Windows API (Win32) implemented by Nunes [15]. We converted the application to the Java language in order to run it in heterogeneous environments including Linux and Windows. Table 1 illustrates the obtained execution times, for one instance of execution of the Win32 and the Java application. The target hardware was a dual core CPU 2.0 GHz, 3Gb RAM running Windows XP. The problem size consists of 2.000 points on X axis, 2.000 points on Y axis and 1 point on Z axis, i.e. 4.000.000 nodes for processing. Table 1 - Comparing execution times

Threads 1 2

Execution Time (minutes) Java (1.6) C++ (Win32) 3.74 2.29

4.78 2.78

The Java version performs better because the Just in Time Compiler of the Java Virtual Machine 1.6 explores the real capacities of CPUs instructions, while the C++ compilers to Win32 generate compatible code to old X86 CPUs. The Figure 1 illustrates the visualization of data results, where geological scientists can analyze the characteristics of the simulated subsoil regions, e.g. the distribution of heavy metals, for instance cadmium.

Figure 1 – Data visualization for one execution (using the GeoView tool [16])

To support our experiments with the parallel approaches, a PC based cluster with 7 nodes was used, with the following configuration: one head node with a dual core CPU, 2.0 GHz, 3GB RAM; and 6 working nodes based on Pentium III CPU with variable speeds ranging from 600 MHz to 900 MHz, 512 MB RAM, interconnected by 100 Mbit Ethernet. All nodes have installed Linux Fedora Core 8, Condor Scheduler 7.0.5, Apache Tomcat 6.0.18 and Java 1.6.0.10. The functionalities of the head node are mainly related to authentication, Tomcat load balancing and Condor Central Manager. All nodes share a storage device managed by NFS file system. The execution time of one instance of the Java application in a cluster node is 20.86 minutes, which is significantly higher than the results in Table 1. This is due to the lower performance of each cluster node. Additionally in the cluster there is an incurred overhead owing to the time to create and write the results file into shared storage managed by NFS.

4.1 Using Condor To run multiple simulations, the first approach was based on the Condor, which is a robust and widely used distributed job scheduler [6]. In order to evaluate the overall execution time for multiple simulations, we submitted 6 Condor Jobs, each job running in one working node of the cluster. Assuming that all the files, including parameter files, exist on a shared storage, we obtained an execution time of 22.52 minutes for execution of the 6 jobs. As expected, the execution time for 6 simulations was approximately identical to the time for 1 execution (20.86 minutes). This result clearly shows the potential of distributed processing on multiple machines under the parameter sweep approach.

4.2 Using Web services In order to provide more generic and flexible execution environments to solve computational science simulations, other approaches rely on architectures organized as a collection of geographically distributed Web services [17]. To enable the experimentation with a serviceoriented architecture and its interaction with higherlevel workflow tools, we developed an architecture based on Web services. Then we used it to evaluate the flexibility provided by scientific workflows tools like Kepler [7] and Triana [8] concerning distributed service invocation. An advantage of this architecture is supporting heterogeneity because Web services are currently based on a set of standards. Consequently our architecture can be used by a great number of tools in different environments. The architecture presented on Figure 2 is mainly composed by two types of Web services: • Executor Web service - available on each node of the cluster and capable of accepting and launching tasks, waiting for task completion and returning results or exception errors; • Broker Web service - used to register the available processing nodes and when requested perform load balancing to the existing nodes. Some end-user tools for instance Kepler and Triana only support synchronous calls. This limitation penalizes performance because it does not allow simultaneous calls to the Executor Web service. Consequently an important role for this Web service is supporting asynchronous calls to the Executor Web services allowing submission of multiple tasks. Another role is to manage the state of tasks execution, by decoupling the execution platform and end-user tools. In the current implementation, this Web service is developed in .NET technology and available on a Windows workstation. The reason for this choice was the simplicity to implement Web service state management and asynchronous calls with completion callbacks.

end-user interface

Broker Web Service ... Executor #1 Web Service

Executor #N Web Service

proc

proc proc

proc

data file data file

Shared Storage

exec file exec file

upload/download files by FTP or remote SSH session

Figure 2 – Architecture based on Web services

Figure 3 shows an end-user interface for a desktop application that was developed in C# .NET, allowing access to the Web services architecture. This allows an end-user to launch multiple simulations in a transparent way. In order to start simulations, the user defines the simulation parameters in the input dialog boxes and then only needs to click on the button. The desktop application behavior relies on asynchronous calls with callbacks to the Executor Web services whose endpoint (URL) was obtained from Broker Web service. We also assume that all files, including parameter files, exist on the shared storage.

All Executor Web services share files on a shared storage managed by the NFS file system. The end-users upload/download files to this shared storage by FTP or SSH sessions. Figure 3 – End-user interface to launch simulations

The execution time for running the same 6 Jobs as previously launched on Condor, is 23.1 minutes. This

results in a similar performance when comparing to Condor (22.52 minutes) but this end-user interace is more user-friendly. In order to allow access to the described Web services architecture, we experimented with two widely used scientific workflow tools, Kepler [7] and Triana [8]. Figure 4 illustrates a Kepler workflow to launch multiple simulations that was designed using Kepler built-in actors. The number of simulations is controlled by a Ramp actor that defines the Job Number until a given limit.

Figure 5 – Kepler workflow to get completion status

We have also experimented with the Triana [8] tool and we concluded that we can easily define similar workflows to invoke the Broker Web service. However Triana units used to invoke Web services expose difficulties to define parameters and results when complex data types are involved. So the creation of data-flow based workflows involving several Web services is not easy.

4.3 Lessons learned

Figure 4 – Kepler workflow using the architecture

The experimentation has shown some limitations of the Kepler actors. Unlike the C# .NET desktop application, we cannot invoke the Broker Web service to obtain dynamically an Executor Web service location (URL). This is because the Web service actor in Kepler cannot support changing the Web service endpoint at execution time. Consequently it is difficult to explore load balancing scenarios with Kepler workflows. Because the lack of asynchronous calls functionality, other limitation of Kepler is that the Web service actor does not fire while the first call does not terminate. Then in order to execute this workflow by exploring parallelism we have changed the Broker Web service to act as an intermediary between Kepler and the Executors Web services. Thus the Kepler workflow requests the simulations to the Broker Web service that asynchronously invokes the Executor Web services according to a load balancing scenario. Later the end-user activates the workflow shown on Figure 5 to get completion status information.

Assuming that we have a large cluster we can conclude that exploiting distribution of executions allows a great increase of scalability in terms of parameter sweep scenarios. Although we can use workflow tools to execute multiple simulations we found important limitations on Kepler and Triana such as: • Both do not support Web services invocation using asynchronous calls; • Both reveal difficulties to work with complex data types. For instance in Triana there is a special unit to generate and display complex data types. However, it is impossible to extract one data field to be submitted as input to other unit. Although Kepler is more user-friendly, sometimes the user needs to fight with a low level xml token when handling complex or aggregate data types. • Both use a data-flow model but we have concluded that sometimes it is necessary control-flow or a mix of both.

5 Workflow for algorithm decomposition We believe that applying the scientific workflows paradigm is the correct way for end-users (scientists) to express the sequential simulation problem in their specific domains. In Figure 6 we present an abstract workflow to describe the sequential simulation problem based on four steps. This abstract workflow will be executed many times, as the number of nodes to

simulate. Steps S1, S2, S3, S4 have the following description: S1 - Definition of the location of the relevant nodes that influence the current node; S2 - Generation and resolution of the Krigage system; S3 - Estimation of the conditional distribution; S4 - Result from Monte Carlo simulation. feedback dependency

S1

S2

S3

S4

Figure 6 - Sequential simulation workflow

The execution of each node corresponds to an instance of the workflow. Thus, millions of instances of the workflow must be generated. It is also relevant to understand that the execution of step S3 for k node (instance k) has an input with feedback dependency from the result of the step S4 of the k-1 node (k-1 instance). An interesting characteristic is that steps S1 and S2, are independent between instances and on the other hand they are the most demanding in terms of processing. So we can conclude that it is important to support the massive parallel execution of steps S1 and S2 for a large number of nodes. Using the workflow paradigm in Kepler (Figure 7) each step is an actor to invoke a Web service so we can distribute the execution of each step for each instance. The Nodes 1..N actor, based on Kepler built-in Ramp actor, generates the sequence of nodes 1 to N. The Dependency Feedback actor is based on Kepler built-in Sample delay actor that has the expected behavior. For the first instance this actor fires a default value and for the next instances the actor carries the input port value (result from actor Step4) to the output port that connects to actor Step3.

Figure 7 – Kepler sequential simulation workflow

Although we can design the workflow we found big problems in terms of its implementation. First of all a problem related to data flow between steps. In sequential simulation we have large matrixes. For

instance a region with 4 millions of points manipulates matrixes up to 50Mbyte making it impractical to pass this kind of data between actors in Kepler. In order to support such data-flow the Web services supporting the workflow steps have access to data files stored on a shared storage. Then we needed to simulate control-flow between step actors, by returning an integer value from the Web services that are used to fire next actors. Another issue is the execution control of the workflow. In Kepler, Directors control the workflow execution. The most efficient and robust SDF Director is not suitable because it is sequential so contrarily to expected the Step actors cannot fire in parallel. Then we cannot have several instances executing concurrently. The Process Network (PN) director allows firing actors in parallel. However as the Kepler engine runs in the context of the same operating system process and it uses one thread per enabled actor this raises scalability problems. When the number of instances substantially increases, unexpected exceptions are frequently raised. Given the recognition that service oriented architectures are a good approach to implement large distributed applications we consider that a big effort is necessary to improve tools like Kepler to support Web services invocation. Our application case study has clearly characteristics to explore the workflow paradigm. Despite all powerful functionalities already provided by Kepler, our experimentation has concluded that there is still a need for improvements. In particular, flexibility to express workflows, more robustness and support to define flows (control, data or mix) between tasks.

6 Conclusions From our experimentations with two representative workflow tools, Kepler [7] and Triana [8], we concluded that the ability to combine multiple flow patterns is critical to enable a large class of applications. Both Triana and Kepler have shown limitations in the case study of geological sciences. Concerning Kepler it provides a user friendly interface and a richer set of functionalities, but lacks generic support for complex data types even concerning invocation of Web services. Furthermore, trying to simulate hybrid flow patterns is difficult and in some case requires advanced programming skills to develop new Kepler Actors. Concerning the workflow engine, although both Triana and Kepler architectures are based on a conceptual design that separates the workflow intermediate representation and the actual

enactment engine, such a clear separation is not supported in the available implementations. A recent initiative, the Hydrant project [18] is addressing this issue by providing a Web portal as a front-end to a remote Kepler engine. The above experimentations led to our current research, grounded on the identification of several open issues concerning both the workflow high-level abstractions and the execution mechanisms: • Lack of expressiveness for specifying forms of control-flow instead of data-flow, or their combinations. • Support to execute a particular workflow multiple times (workflow instances), eventually with different parameters. • Difficulties in expressing loops and feedback dependencies among activities in different instances. • Limitations in the supported data types. Most existing systems support the invocation of external Web services, but preclude the use of complex data types. • Limitations with respect to storage and data sharing between the workflow activities. • Limitations in the workflow scale, in terms of the number of concurrent or parallel execution threads. The workflow engine is often executed as a single/monolithic operating system process, thus limiting the number of tasks that can be executed in parallel. • Lack of naming, location transparency and dynamic binding in the specification of required Web services. The above issues have also been identified in related studies [1], [19] and several ongoing initiatives are trying to overcome them [20]. Work is under way to define models and enhance the prototype platform in order to overcome the above open issues, namely in the following directions: i) expressiveness on workflows specification; ii) corresponding mappings to the execution platforms; iii) and the granularity of the executable units (process; components; objects; Grid/Web services) and their interactions.

[3]

[4]

[5] [6] [7] [8] [9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]

7 References [1] [2]

Ian J. Taylor, Ewa Deekman, et al., “Workflows for eScience”, Springer-Verlag, 2007. Jia Yu and Rajkumar Buyya, “A Taxonomy of Workflow Management Systems for Grid Computing”, Journal of Grid Computing, Volume 3, Numbers 3-4, Springer-Verlag, 2006.

[18] [19]

[20]

Dieter Cybok, “A Grid Workflow Infrastructure” Concurrency and Computation: Practice and Experience, 2005, Volume18, Issue 10, pp. 12431254. Nick Russell, van der Aalst et al., “Workflow Resource Patterns: Identification, Representation and Tool Support”, Queensland University of Technology, Brisbane, Australia. University of Technology, Eindhoven, Netherlands, 2004. José C. Cunha, Omer F. Rana, “Grid Computing: Software Environments and Tools”, Springer 2006. The Condor project: http://www.cs.wisc.edu/condor/ accessed 2008-12-15. The Kepler project: http://kepler-project.org/, accessed 2008-12-15. The Triana project: http://www.trianacode.org/, accessed 2008-12-15. Stefan Zanon, Oy Leuangthong, “Implementation Aspects of Sequential Simulation”, Book Series: Quantitative Geology and Geostatistics, Volume 14, Geostatistics Banff 2004, pp. 543-548, Springer Netherlands, 2005. Amilcar Soares, “Direct Sequential Simulation and Cosimulation”, Mathematical Geology, Vol. 33, No. 8, November 2001. Caers, J., H. Gross, and A. R. Kovscek, "A Direct Sequential Simulation Approach to Streamline-Based History Matching,", Quantitative Geology and Geostatistics, Vol. 14, Springer, Netherlands, 10881086, 2005. Olabarriaga, S.D.; Nederveen, A.J.; Nuallain, B.O., "Parameter Sweeps for Functional MRI Research in the "Virtual Laboratory for e-Science" Project," Seventh IEEE International Symposium on Cluster Computing and the Grid, 2007, pp.685-690, 14-17 May 2007. Vargas H. Caetano, H., and Filipe M., “Parallelization of sequential simulation procedures”, geoENV VI Geostatistics for Environmental Applications, 2006 and EAGE Petroleum Geostatistics, 2007. H.S. Vargas, H. Caetano and H. Mata-Lima, “A New Parallelization Approach for Sequential Simulation”, In Quantitative Geology and Geostatistics, geoENV VI, pp. 489-496, Springer Netherlands, 2008. Rúben Filipe Martins Nunes, “Paralelização dos Algoritmos Simulação Sequencial Gaussiana, Indicatriz Directa”, Msc Thesis FCT-UNL, Jan 2008. Site of CMRP - Centre for Modelling Petroleum Reservoirs: http://cmrp.ist.utl.pt/index.php?lg=2, accessed 2008-12-15. Keshav Pingali and Paul Stodghill, “A Distributed System Based on Web Services for Computational Science Simulations”, ICS '06: Proceedings of the 20th annual international conference on Supercomputing, pages 297-306, ACM, 2006. Hydrant project: http://www.hpc.jcu.edu.au/hydrant/, accessed 2008-12-15. Yolanda Gil, Ewa Deelman, et al., “Examining The Challenges of Scientific Workflows”, IEEE Computer, December 2007, pp. 24-32. Grid Workflow Forum: http://www.gridworkflow.org, accessed 2008-12-15.

Suggest Documents