SciFlow: A Dataflow-Driven Model Architecture for ... - Semantic Scholar

6 downloads 66982 Views 4MB Size Report
available through a cloud provider such as Amazon EC2. The parallelization of SciFlow is ..... the data processing, analytics, and mining tools built into the. Hadoop stack. ... S. Seo, E. J. Yoon, J. Kim, S. Jin, J.-S. Kim, and S. Maeng,. “HAMA: An ...
2013 IEEE International Conference on Big Data

SciFlow: A Dataflow-Driven Model Architecture for Scientific Computing using Hadoop Pengfei Xuan, Yueli Zheng

Sapna Sarupria

Amy Apon

School of Computing Clemson University Clemson, SC, USA {pxuan, yzheng}@clemson.edu

Chemical and Biomolecular Engineering Clemson University Clemson, SC, USA [email protected]

School of Computing Clemson University Clemson, SC, USA [email protected]

task computing (MTC) [1] application in which each of the tasks executes for a short time. SciFlow is implemented using the Hadoop platform. It leverages the broad adoption of Hadoop and applies it to the problem of scientific computation. We utilize a standard Hadoop distribution without changing the components within the Hadoop environment while fully supporting the completion and other requirements of the scientific application. Our software layer implements seven primitive dataflow patterns that can be used to build complex data-intensive applications. Our choice to use Hadoop is motivated by its features and by its broad community support. The Hadoop stack is an open source, robust, and fully productive environment, and is widely adopted as an industry standard. Hadoop runs on a commodity cluster and supports the MapReduce framework [2]. The Hadoop platform hides many of the complexities of the underlying distributed system. It decouples the complexity of parallel and distributed computing from the application and allows run-time tasks such as scheduling, load balancing, failure recovery, intermachine communication, and distributed partitioning of data to be automatically handled by the Hadoop platform. A feature of Hadoop that is particularly important for our rare event sampling application is that the Hadoop environment ensures that all initiated tasks complete and write an output file at the end of completion. For example, with forward flux sampling (FFS), while the subtasks can be executed in a different order, the loss of a single output file can affect the correctness of the entire simulation. Many supporting technologies are built over Hadoop for data analysis and compute [3–8]. Overall, the appropriate leveraging of the powerful components in the Hadoop environment offers significant potential to the scientific computing community for combining computation with data analytics as compared to traditional HPC-based computing environments. Two major contributions are presented in this paper. First, we describe the SciFlow model architecture for data-intensive MTC and our dataflow-driven design patterns for building data-intensive MTC applications on a standard Hadoop platform. Secondly, we validate the design of SciFlow with the implementation of a large-scale rare event simulation application. Our application integrates GROMACS, a molecular dynamics (MD) program with a customized FFS algorithm [9] using SciFlow. We also evaluate the performance and demonstrate the scalability of our implementation.

Abstract—Many computational science applications utilize complex workflow patterns that generate an intricately connected set of output files for subsequent analysis. Some types of applications, such as rare event sampling, additionally require guaranteed completion of all subtasks for analysis, and place significant demands on the workflow management and execution environment. SciFlow is a user interface built over the Hadoop infrastructure that provides a framework to support the complex process and data interactions and guaranteed completion requirements of scientific workflows. It provides an efficient mechanism for building a parallel scientific application with dataflow patterns, and enables the design, deployment, and execution of data intensive, many-task computing tasks on a Hadoop platform. The design principles of this framework emphasize simplicity, scalability and faulttolerance. A case study using the forward flux sampling rare event simulation application validates the functionality, reliability and effectiveness of the framework. Keywords-dataflow; Big Data; scientific computing; Hadoop; forward flux sampling rare events simulation; many-task computing; dataflow-driven design patterns

I.

INTRODUCTION

Behaviors of scientific computing applications are highly heterogeneous. Scientific applications can be characterized as data-intensive, computationally-intensive, analysisintensive, and visualization-intensive, with resulting bottlenecks at CPU performance, I/O throughput, memory capacity, network bandwidth, and network latency. The availability of inexpensive computational resources and emerging software techniques for the analysis of large-scale data are creating new opportunities for scientific discovery. However, the combination of commodity computing technologies with computationally-intensive applications exposes gaps in scientific application development, deployment, and execution that have not been fully met by traditional software infrastructures and programming models. For example, the generally lower reliability of commodity hardware structures does not fully support workflows that require the guaranteed completion of all of the tasks in a workflow. We have implemented the SciFlow model architecture, which uses dataflow-driven design patterns, to address the needs of applications that require the guaranteed completion of all of the tasks in the workflow. In SciFlow, computational workflow is represented by dataflow, and dataflow drives the underlying computational scientific application. SciFlow enables the implementation of a many-

978-1-4799-1293-3/13/$31.00 ©2013 IEEE

36

Figure 2. Seven primitive dataflow patterns in SciFlow.

Based on this observation, we use design principles that emphasize simplicity, scalability and fault-tolerance over the underlying computing infrastructures by building a new layer over the standard Hadoop components.

Figure 1. Two basic dataflows in SciFlow. Computational flow is used to distribute the computational model and collect the results from scientific applications. Data processing flow is used to process streaming data.

To the best of our knowledge, there is no publicly available parallel FFS application that can achieve the scale of our implementation at this time. The current version of FFS using the SciFlow framework represents a major improvement over our previous script version of FFS, which manages the execution of a few thousand runs of the MD simulation program using shared file system and frequently runs into I/O challenges. Script implementations in general lack the supporting environment for managing failure of various types, and as a result do not save the state of execution in a way that supports isolation of erroneous MD trajectories. In these cases, the entire FFS algorithm must be re-executed, leading to significant computational overhead. Our implementation easily runs on Palmetto, the Clemson University campus high-performance computing cluster. We have run FFS with about 460K MD simulation runs, managing the reliable completion of more than 1.1M tasks, the analysis of more than 460K output files, and a total write workload of more than 6.6TB. The remainder of this paper is organized as follows: Section II presents dataflow-driven design patterns implemented in the SciFlow framework. Section III describes our case study for the implementation of rare events simulation. Section IV provides the baseline benchmark and evaluates the performance of FFS implementation. Section V is a short survey of related work. Section VI concludes with future research plans using the SciFlow model architecture. II.

A. Design Patterns for Scientific Computing Application We define a dataflow of a scientific computing application as a directed acyclic graph (DAG) assembled by using a series of data processing flows and computational flows. In the dataflow DAG, a node denotes either a data processing operation or a computational simulation in a scientific computing application. Each individual node has an incoming input source and an outgoing output sink, and each edge indicates dataflow path and direction. Fig. 1 illustrates a simple dataflow DAG, where the yellow arrows point to a computational flow and a data processing flow, respectively. Each of them has a data source for containing the ingress stream and a data sink for the egress stream. There are many different types of patterns for message passing, workflow and parallel computing [10–13]. In SciFlow, we utilize seven primitive dataflow patterns (Fig. 2) that provide high-level abstractions for the construction of user-designed algorithms and representation of scientific workflows. The seven primitive dataflow patterns are: • Split: A single data stream is split into two or more streams by operations in data processing. The operations can be applied to each element in the data stream or group of elements based on their properties. An example of split is the selection or classification of different data sets from the data stream under the given data filter conditions. • Merge: The inverse of the split pattern. Multiple incoming data streams with the identical data schema are combined and stacked to one data stream. Examples of merge include input data preparation for further computational tasks and data aggregation among different data stream paths. • Join: Similar to the merge pattern but has ability to combine incoming streams with different data schema by using their common field values.

DATAFLOW-DRIVEN SCIENTIFIC COMPUTING IN SCIFLOW

One goal of the SciFlow model architecture is to maximize reusability and compatibility between scientific computing applications and the standard Hadoop platform. Hadoop has a very robust environment that is rapidly evolving. Any customized solutions or modifications to the current version of Hadoop will have difficulty keeping in step with the frequent updates of the Hadoop components.

37

Map-only: A collection of data sets and parameters is distributed to a number of subsets and a copy of the scientific application runs on each of these subsets concurrently. The deployment and execution of these scientific applications are implemented under this pattern with completely independent operations. • Reduce: This pattern is used to collect results produced by the scientific application in the Maponly pattern. The user can specify post-processing to filter or refine subset of the output. • Iteration: A repetition of sub-dataflows by assigning new data sources and data sinks. The standard Hadoop MapReduce programming model does not support iteration operations, so dataflows must be acyclic. However, we can implement this pattern by using a number of equivalent replications of sub-dataflows, assigning a new source and a new sink for each of them. • Pipeline: Data processing flows and computational flows are connected by their common data source and data sink. To simplify implementation and also improve data-processing efficiency, we can concatenate multiple data processing operations together by eliminating shared sources and sinks. A dataflow can always be decomposed to its subdataflows until only primitive dataflow patterns remain. Then an equivalent dataflow can be reconstructed by tracking back along the decomposition path, using the primitive dataflow patterns. In SciFlow the underlying scientific application automatically inherits the features of MapReduce such as fault-tolerance, implicit parallelism and load balancing. To process the computational results we can construct a data analysis pipeline using data processing patterns and apply it after the corresponding computational flow. •

Figure 3. SciFlow model architecture and its components.

sources and sinks for each new flow, and works with the iteration pattern to create a new data source and data sink for the upcoming sub-dataflow. In addition, the management of data serialization for each of flows can also be implemented in this component. The dataflow-driven design patterns simplify the programming complexity for task-parallel applications on large-scale distributed systems and provide a natural way to match its original workflow description. The dataflow-based implementation has a very high reusability for adding new features or recomposing as a sub-dataflow in the new development. The parallelism allows ordinary applications to be composed into MapReduce jobs that can scale to the distributed computing environment with its additional inherent fault-tolerance feature. In the second tier, the Hadoop platform is used as the dataflow execution engine to support, manage and serve data-intensive MTC. It uses MapReduce to express a highlevel, dataflow-based implementation while hiding explicit parallel operations and communication patterns. The file system of the Hadoop environment, the Hadoop Distributed File System (HDFS), partitions data into blocks and distributes it over the computing cluster. The design of HDFS provides important fundamental guarantees, such as reliability, scalability, availability and performance to support its upper parallel data processing model from the file system’s perspective. Hadoop Streaming, the third tier, is a standard utility in the Hadoop distribution, and is used to support integration between Hadoop and external applications. In this tier, scientific applications are scheduled to their corresponding computing nodes and are executed from the localized task directory that is created by the Hadoop runtime system. Since the localized task directory is a POSIX-like file

B. Model Architecture The SciFlow model architecture provides a deployment, execution and management environment for scientific computing applications. It has a four-layered architecture using MapReduce as the interpretation interface to a Hadoop platform. Fig. 3 illustrates the four-layered framework design. The top layer, the Dataflow layer, is a SciFlow application built in three logical components: • Dataflow DAG is a high-level implementation of a scientific workflow created by assembling a series of dataflow patterns with the underlying scientific applications. • Controllers are used to control user-specified performance and resource constraints by passing application native parameters to the underlying application wrapper, dynamically setting parallel granularity based on the runtime environment and adjusting the workload submitted to the Hadoop cluster. • Resource Manager tracks data sources and data sinks for each of the data flows, reserves new

38

FFS [9] provides a promising route to sample rare events in equilibrium and non-equilibrium systems, but can become computationally challenging for sophisticated models with complex free-energy landscapes when optimized interface placement is not straightforward. Examples of rare events include nucleation-driven phase transitions (e.g., ice nucleation), genetic switch flipping, and (lattice) protein folding [14]. The fundamental idea of FFS is to generate trajectories from an initial to a final state by propagating smaller trajectories through a series of interfaces between the initial and final states as shown in Fig. 4. The initial and final states and the interfaces are distinguished based on the values of an order parameter. The order parameter represents a physical quantity that distinguishes between the initial and final states of the system. For example, in a liquid water-to-ice transition, the order parameter could be the number of icelike water molecules.

system, there is no need to modify the applications to adapt to the HDFS file system. To exchange information between Hadoop runtime system and the localized system on which external applications run, a wrapper script is used to stage input and output files as well as redirect or rewrite system stdin and stdout on each of the computing nodes. For scientific applications that do not support output with the stdout command, we can either parse its output file in the wrapper or inject its source code to get the required outputs. The bottom layer, or the fourth tier, of the SciFlow model architecture is Hadoop-compatible hardware infrastructure. This may be, for example, a Hadoop cluster provisioned locally on a campus, or a Hadoop resource available through a cloud provider such as Amazon EC2. The parallelization of SciFlow is achieved by implicit multiple levels of parallelism: job-level parallelism comes from the execution of independent MapReduce jobs; tasklevel parallelism comes from running the application using the Map-only pattern driven by the Hadoop Streaming job; data-level parallelism comes from the fine-grained granularity in data block replicas in HDFS; and hardwarelevel parallelism is possible by using the accelerators and multiple cores. The SciFlow automatically parallelizes the scientific applications in each machine using all available cores running as a data-intensive many-task application. C. Application Deployment and Execution A SciFlow application can be deployed on any computing environment that is compatible with Hadoop MapReduce. To start a scientific computing application, the user only needs to package the implementation (Java class file) and underlying scientific application dependencies (such as binary scientific programs, runtime libraries, and application configuration files) into two JAR files, and upload them to the Hadoop runtime system (HDFS file system). The dataflow of the scientific computing application is interpreted as separate MapReduce jobs which are assigned to JobTracker with a DAG arrangement. JobTracker breaks MapReduce jobs into the map tasks and reduce tasks in parallel and distributes them over the entire cluster by allocating tasks to TaskTracker on the worker machines. Before executing a task, TaskTracker extracts the corresponding job JAR from MapReduce workflow and copies it to the local filesystem. The application dependencies are also copied from the Distributed Cache to the current task work directory on the local disk. After that, an instance of TaskRunner is created to launch run the task. When a task is completed, the Hadoop runtime system automatically recycles related local resources, including intermediate files and directories. III.

Figure 4. A schematic view of forward flux sampling method [9].

In our study, we focus on liquid-to-solid transitions in water-methane solutions using an atomistically detailed model for water (TIP4P/Ice) [15]. At low temperature, and high pressure, water-methane solutions can form solid crystalline structures called hydrates. We use the sum of the number of water molecules in the three largest solid-like clusters as our order parameter. Solid-like water molecules are those that have structures consistent with those found in ice and hydrate structure. A cluster includes all solid-like water molecules that are within a specified cut-off distance of each other (we use a value of 0.3 nm as our cut-off distance). For further details of the simulation system and order parameter calculations see [16]. This system is expected to have a complex energy landscape and therefore, sampling liquid-to-solid transition has proven to be rather challenging, even with FFS. We have correspondingly modified the original method. For example, we incorporated automatic placement of interfaces during execution instead of deciding the interfaces a priori as suggested in the original algorithm. It is also noteworthy here that the studies made possible here with SciFlow were impossible to achieve through the previous bash-script based implementation.

CASE STUDY – LARGE SCALE RARE EVENTS SIMULATION

A. Big Data Challenges for FFS Simulation The execution of the FFS application creates hundreds of thousands of output files and many terabytes of data. Managing these data and the analysis of the resulting rare

Simulating rare events in detailed atomistic molecular simulations to obtain statistically relevant number of transition trajectories is often computationally prohibitive.

39

events over the massive simulation data sets requires infrastructure-level support for the application. In addition, the FFS implementation presents a variety of configuration, runtime, data management, and task management challenges when run at scale. Program exceptions or system failures can introduce statistical errors and can result in incorrect results. The guaranteed computation of millions of MD simulation tasks needs a reliable data processing and computing infrastructure. Our execution of the FFS simulations at the scale of thousands CPU cores generates a high data velocity, significant aggregate I/O and task throughput, and a very large (TBs) data volume on the underlying distributed computing system.

Algorithm 1. Modified-FFS and its dataflow-driven design pattern 1 set 𝒱! ← 𝑖𝑛𝑖𝑡𝑖𝑎𝑙  𝑐𝑜𝑛𝑓𝑖𝑔𝑢𝑟𝑎𝑡𝑖𝑜𝑛𝑠 2 set 𝐾 ← 𝑛𝑢𝑚𝑏𝑒𝑟  𝑜𝑓  𝑡𝑟𝑖𝑎𝑙  𝑟𝑢𝑛𝑠 3 set 𝑇 ← 𝑡ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑  𝑜𝑓  λ  𝑠𝑒𝑙𝑒𝑐𝑡𝑖𝑜𝑛 4 set ∆  ← 𝑠𝑝𝑎𝑐𝑒  𝑣𝑎𝑟𝑖𝑎𝑡𝑖𝑜𝑛   5 set 𝐺!"#$%& ! ← ∅ 6 while (𝒱! ≠ ∅) do for each 𝑐 ∈ 𝒱! do 7 for all 𝑘 ← 1  to  𝐾 do 8 𝐺!"#$% ⇐ 𝑀𝐷𝑆𝑖𝑚!!!"# (𝑐, 𝑟𝑎𝑛𝑑! ) 9 end for 10 end for 11 12 𝑐𝑜𝑢𝑛𝑡 ← 𝐶𝑜𝑢𝑛𝑡𝐶𝑙𝑢𝑠𝑡𝑒𝑟𝑆𝑖𝑧𝑒(𝐺!"#$% ) 13 λ!!! ← 𝐹𝑖𝑛𝑑𝜆!!! (𝑐𝑜𝑢𝑛𝑡, 𝑇) (𝐺!"#$$ , 𝐺!"#) ⇐ 𝐼𝑑𝑒𝑛𝑡𝑖𝑓𝑦(𝐺!"#$% , λ! , λ!!! ) 14 15 for each 𝑐 ! ∈ 𝐺!"# do 𝐺!"#$ ⇐ 𝑀𝐷𝑆𝑖𝑚!"#$ (𝑐 ! ) 16 end for 17 ! 𝐺!"#$$ ⇐ 𝐼𝑑𝑒𝑛𝑡𝑖𝑓𝑦!𝐺!"#$ , λ! , λ!!! ! 18 ! 𝐺 19 !"#$" ⇐ 𝑀𝑒𝑟𝑔𝑒!𝐺!"#$$ , 𝐺!"#$$ , 𝐺!"#$%& ! ! !𝐺!"#$%&!!! , 𝐺∆ ! ⇐ 𝐶𝑜𝑙𝑙𝑒𝑐𝑡!𝐺!"#$" , λ!"#$ , ∆! 20 for each 𝑔 ∈ 𝐺∆ do 21 𝒱!!! ⇐ 𝐺𝑒𝑛𝑒𝑟𝑎𝑡𝑒𝐶𝑜𝑛𝑓𝑖𝑔𝑢𝑟𝑎𝑡𝑖𝑜𝑛(𝑔) 22 end for 23 if 𝐶𝑜𝑢𝑛𝑡!𝐺!"#$" ! = 𝐶𝑜𝑢𝑛𝑡(𝒱! )×𝐾 24 then exit 25 26 end while

B. FFS Implementation by Using Dataflow-driven Design Patterns The modified FFS algorithm and its implementation using the concept of dataflow-driven design are shown in Algorithm 1. There are three pairs of Map-only and Reduce dataflow patterns (shown in yellow) that represent the underlying concurrent MD simulations and the conversion between trajectory files and molecular structure files. In our current design, we use GROMACS [17], a freely available MD simulation software for our simulations. We use Cascading [18] APIs and the Gradle building environment to implement the primitive dataflow patterns. The implementations of these computational flows are mapped to the underlying Hadoop MapReduce jobs by using dataflow pattern API calls. For example, the following statement implements the short MD simulations (first yellow area, Lines 7 − 11 in Algorithm 1) using SciFlow: Flow shortMDSimFlow = new MapOnlyAndReducePattern(flowName, "perl cachedir_jar/runMD.pl", "cat", runMDConfigShortPath, runMDResultShortPath);

Flow crossFlow = identifyFlow.getTails()[0]; Flow incomFlow = identifyFlow.getTails()[1];

// branch flow 1 // branch flow 2

The Join dataflow pattern adds the next λ information to the output of the short MD simulation. In the next step, the Split dataflow pattern identifies the crossed and incomplete trajectories by comparing the cluster size with λ! and λ!!! values. The long MD simulation (Lines 15 − 18 in Algorithm 1) can be implemented using a similar approach. In Line 19, the Merge dataflow pattern is used to combine the crossed trajectories from the beyond, short and long MD simulation dataflows:

// flow name // mapper // reducer // source data // sink data

Lines 12 − 13 in Algorithm 1 count the cluster size from the output of the first MD simulation, and then calculate the value of the next interface 𝜆 . This is implemented using the standard Count operation in Cascading and the Pipeline pattern defined in SciFlow. Line 14 in Algorithm 1 identifies the trajectories that cross the next interface by comparing their cluster size and the next 𝜆. We decompose and correspond this statement to two primitive dataflow patterns:

Flow crossMergeFlow = new MergePattern(crossShortFlow, crossLongFlow, crossBeyondFlow);

// merged flow // flow 1 // flow 2 // flow 3

We use a similar approach to implement the remaining parts of the algorithm.   IV.

PERFORMANCE EVALUATION

A. Cluster Hardware and Software Configuration We evaluate the performance of the SciFlow-based FFS implementation using two different computing clusters. Our experimental configuration uses the original Hadoop 1.1.2 distribution from Apache. To improve JobTracker performance for processing a large number of tasks, we increase the concurrent task limitation in the capacity scheduler and enable out-of-band heartbeats in mapredsite.xml. Table I gives detailed hardware configuration information for the two cluster systems. Cluster-A is a 128-

Flow joinFlow = new JoinPattern(lambdaFlow, // dataflow contains λ!!! value "ffs_id, lambda_value", // schema for 𝜆 data shortMDSimFlow, // simulation output dataflow "config, time, … , cluster_size"); // schema for MD output Flow identifyFlow = new SplitPattern(joinFlow, // connect with joinFlow "cluster_size", // comparison field Integer.TYPE, // comparison type "cluster_size >= λ!!! ", // split condition 1 "λ!  < cluster_size < λ!!! "); // split condition 2

40

Figure 6. Execution time, task throughput, average read and write I/O throughout on different scales.

Task Throughput: To measure maximum possible task throughput rate reached on Hadoop platform, we run an “empty-workload” Hadoop Streaming job on variety of scales on both clusters. The input file for each benchmark is a list of empty strings with a size of 20 times the count of map slots times the count of machines. The total number of map tasks for Hadoop Streaming job is set to the same as the list size. The mapper executes sleep(0) for a zeroduration run. We estimate the peak value of task throughput by using total number of map tasks divided by the running time of the map phase. I/O Throughput: We obtain aggregate read and write I/O performance by using TestDFSIO from the Hadoop distribution. TestDFSIO is a distributed I/O benchmark program that produces a separate map task to read or write files on each of DataNodes in parallel. To minimize the effect of filesystem buffers for the benchmark, we let each map task write and read 20 times on 20 separate files. The aggregate I/O throughput is calculated by using the total number of map tasks multiplied with the average throughput reported by TestDFSIO. The measurement results of the benchmark characterize the baseline of two clusters and provide a performance reference. Cluster-A has better aggregate write performance than Cluster-B due to the large number of whole computers and more local hard disks. Concurrently writing many files to the local disks is a bottleneck of Cluster B. However, the high bandwidth and the low latency of the InfiniBand network in Cluster-B give higher aggregate read throughput and task throughput as the count of cores increases. The switch backplane capacity and CPU performance of Cluster-A is a major bottleneck and limits its aggregate read throughput and task throughput.

Figure 5. Baseline benchmarks on two Hadoop clusters.

machine cluster with 8×2.5GHz CPU cores and a Gigabit Ethernet network. Cluster-B is a 64-machine cluster with 16×2.40GHz CPU cores and a 40Gbps InfiniBand network. In this experimental platform Hadoop runs on IP over InfiniBand. Each CPU core of Cluster-B has roughly two times the flops performance of a CPU core of Cluster-A. The nodes in both clusters have a single local disk, which is a common configuration in a commodity cluster computing environment. Cluster-A is representative of a computing environment on a cloud or grid with cost-effective processors and a slower interconnection. Cluster-B is representative of a campus high-performance cluster environment with a dedicated scientific computing environment and network. Evaluation on these two hardware configurations allows the exploration of the impact of various cluster configurations to the performance of the SciFlow-based FFS implementation. TABLE I. Name Cluster-A Cluster-B

CLUSTER HARDWARE CONFIGURATIONS CPU

HDD

Intel Xeon L5420 8×2.50GHz Intel Xeon E5-2665 16×2.40GHz

73GB 10000RPM SAS 1TB 7200RPM SATA

RAM

Network

32GB DDR2-667

1 Gb Ethernet

64GB DDR31600

10/40Gb InfiniBand FDR/EN

B. Baseline Performance Benchmark The baseline performance measures the maximum aggregate task and I/O throughput provided by the two clusters, which are two key factors that affect the performance of dataflow patterns and scientific computing applications. Fig. 5 provides the detailed performance matrices on Hadoop clusters along a range of number of machines.

C. Performance Evaluation for FFS Implementation To better understand the scaling and performance behaviors of SciFlow applications, we test the FFS implementation on both clusters with a range of counts of

41

Figure 7. Profiling of FFS implementation in first 5 interfaces simulation.

CPU cores. We perform the calculations using the experimental environments using the same compiled FFS program and a fixed problem size as input. We measure the total execution time and analyze the average computational resources used in processing the first interface of the FFS simulation to compare the two clusters. Since the number of cores in the biggest cluster is an order of magnitude larger than the smallest one, we define a large enough problem size to generate sufficient workload on each testing set. For our performance testing, we use 32 starting configurations as the input (corresponding to the red circles in Fig. 4) and 1024 runs of GROMACS on each of them. In order to observe I/O traffic besides the MapReduce program (i.e., from GROMACS), we add additional I/O counters to the Hadoop Streaming job to capture and count all possible read and write operations on HDFS or the local file system. The performance model of the FFS implementation is characterized by execution time, average task throughput and average aggregate I/O throughput. To get precise measurements, all performance-related measurements are calculated after the FFS runtime by analyzing the Hadoop job history log, minimizing the measurement overhead to the SciFlow application. Fig. 6 provides the experimental results of the FFS implementation on different scales. Somewhat surprisingly, the total execution time decreases almost linearly as the number of CPU cores is increased. The FFS implementation based on dataflow-driven design patterns using SciFlow is very efficient. We do not find any obvious overhead at the current testing scale. For example, the FFS runtime environment over 1024 CPU cores requires average 13.78 task/s throughput and produces average 195.13MB/s aggregate I/O throughput on Cluster-A, and 25.86 task/s and 366.25 MB/s on Cluster-B. It is easy to see that the average task and I/O throughputs of FFS simulation are met by the performance capacity of the Map-only pattern, as shown in our previous baseline benchmark in Fig. 5. To explore the future peak performance requirements introduced by SciFlow applications, we profile the FFS implementation along the three performance factors. Fig. 7

shows the performance profiling results from the FFS simulation for the first five interfaces over 1024 CPU cores on Cluster-B. We find that the basic performance pattern of FFS simulation is the interaction between task throughput and I/O throughput. In this experiment, the simulation starts from a list of inputs, including 47 configurations, and generates 256 GROMACS simulation tasks for each of the input. A very large number of MD simulations are executed. These are distributed using the Map-only pattern over the whole cluster. The start and completion of each MD simulation produces internal streaming data (time series data) and external output files (MD trajectory file) on the Hadoop cluster. At the end of the computational flow, the Reduce pattern collects and aggregates computational results for the next data processing flow, which is a composition of the Merge, Split and Join patterns. Because the size of the time series data is not very large, we set up a granularity controller to assign a small number of mappers and reducers to handle those data sets. Finally, the short data processing phase gets configurations to continue new simulations in the next interface. TABLE II. AGGREGATE PEAK THROUGHPUT ON FIRST FIVE FFS I NTERFACES Aggregate Peak Throughput

Timestamp (sec)

Task Rate (Task/s)

Read (MB/s)

Write (MB/s)

560 1180 1880 2940 3740

308.75 288.85 232.55 125.55 171.30

386.16 333.87 320.19 355.77 350.58

432.34 391.83 374.08 381.06 391.77

By studying the profile of the FFS application, we can identify its peak demand on the underlying hardware resource. Table II shows the timestamps of the five highest Task Rates along with their I/O throughputs. We find that a high task throughput always corresponds to high I/O operations. In the FFS algorithm, the beginning of the second MD simulation (represented by line 16 in Algorithm 1) is the point in the code that generates highest task and I/O

42

throughput. The test code runs about 230K MD simulation runs, with about 500K tasks, more than 230K output files, and a total write workload of about 3.3TB. V.

map/shuffle/reduce operations. To our knowledge, neither Nebula nor Dryad is being actively developed presently. There are GUI and web portal based workflow systems, such as Kepler [23], Taverna [24], Galaxy [32] and Wings [25] that focus on the ease of use of a computational pipeline. Their graphical representation and centralized architecture can be inefficient or problematic when faced with complex data-intensive MTC applications as compared to scripting-based workflow systems [28, 33]. A few scientific computing projects have explored the Hadoop environment for combining computation with largescale data analytics. Dede et al. [34] give a preliminary evaluation for representing patterns of scientific ensembles on Hadoop. In the paper, they analyze and discuss many of the gaps and challenges in mapping common scientific ensemble patterns to the MapReduce programming model. The Apache Oozie project [35] provides a production workflow management system for Hadoop, which can also be used to support MapReduce-based scientific workflow. Many cloud providers, HPC and grid computing services provide direct support for MapReduce, enabling applications to be deployed on a variety of computational infrastructures.

RELATED WORK

Two decades ago, Beowulf clusters provided a cost effective high performance computational (HPC) infrastructure by combining a large number of commodity computing devices, and was widely adopted by both industry and academia [19, 20]. Beowulf cluster computing has been widely adopted as a standard for high performance computing, and many distributed software infrastructures have been built to support provisioning, workflow, scheduling, and resource management on Beowulf clusters, such as Swift [21], CometCloud [22], Kepler [23], Taverna [24], Wings [25], Dryad [26] and more [27, 28]. However, none of these software infrastructures have been as widely adopted as the Beowulf cluster hardware infrastructure [29]. We discuss the workflow systems that are most closely related to our approach. Swift [21] and the Karajan workflow engine [28] are a performance-oriented MTC workflow system designed for composing application programs into parallel applications using a scripting language-based description. Swift allows computational scientists to assemble a complicated scientific workflow and easily scale their own applications on largescale distributed systems. To improve task processing performance, the execution model, Turbine [30], is designed to replace Swift’s centralized evaluator by an asynchronous distribution mechanism. There is no filesystem-level faulttolerance in the current Swift system, which could cause some problems in the handling of a large number of data sets. In contrast, SciFlow leverages the built-in block replicas of HDFS to support very large numbers of tasks and corresponding output files. Skywriting and CIEL [31] use a master/worker computation model in which a job is represented as a dynamic task graph. The programming language can describe task-level parallelism with iterative and recursive support. However, CIEL cannot support the same level of high task throughput as SciFlow, which limits its scalability for MTC applications. CometCloud [22] emphasizes platform as a service (PaaS) and a hybrid HPC + Cloud infrastructure to enable dynamic and on-demand federation of computing infrastructures as well as the deployment and robust execution of applications on these federated environments. CometCloud lacks high-level programming model to support the incorporation of scientific applications. However, because the standard MapReduce is supported on CometCloud, it should be straightforward to deploy and execute SciFlow application over underlying infrastructure resources that may be managed by CometCloud. Nebula and Dryad [26] are a very similar data-parallel solution with Hadoop from Microsoft Research. They extend the MapReduce programing model to a more relaxed programming model in which a user can specify an arbitrary communication DAG rather than a sequence of

VI.

CONCLUSIONS AND FUTURE WORK

We have presented an approach and the SciFlow model architecture for easily building parallel scientific computing applications through the use of dataflow-driven design patterns. In the SciFlow model architecture, we adopt the standard Hadoop distribution as the core of its data processing, which enables SciFlow applications to leverage the data processing, analytics, and mining tools built into the Hadoop stack. The dataflow-driven design patterns for SciFlow are based on a MapReduce programming model that has been widely adopted by industry and academia. Our approach allows applications to run on a high performance-computing cluster without requiring knowledge of the underlying computing infrastructures. The SciFlow model architecture leverages Hadoop and its underlying ecosystem to automatically parallelize data processing, to manage dataflow dependence, synchronization, and data serialization and transfer across the whole cluster. The design principles of this framework emphasize simplicity, scalability and fault-tolerance. We implement a software layer over advanced components from the Hadoop environment and leverage the extensive services of the Hadoop software stack. We validate the implementation and study its performance via a case study of a rare events simulation application using FFS. The SciFlow model architecture opens doors for further exploration in this area. In future work, we plan to port several data-intensive MTC applications to SciFlow. We also plan to further explore and characterize the behaviors of the SciFlow performance in areas such as scalability, faulttolerance, and load balancing, and using various hardware accelerators (e.g., MIC, GPU) and computing environments (e.g., cloud, grid).

43

ACKNOWLEDGMENTS

[19]

This research is sponsored in part by NSF awards CNS1228312 and OCI-1212680, and used HPC resources of the Clemson Computing and Information Technology. We gratefully acknowledge editing assistance by Randy Apon.

[20] [21]

REFERENCES [1] [2] [3]

[4]

[5]

[6]

[7] [8]

[9]

[10] [11] [12] [13] [14] [15]

[16]

[17] [18]

[22]

I. Raicu, “Many-task computing: bridging the gap between highthroughput computing and high-performance computing,” University of Chicago, Chicago, IL, USA, 2009. J. Dean and S. Ghemawat, “MapReduce: simplified data processing on large clusters,” Communications of the ACM, vol. 51, no. 1, pp. 107–113, 2008. C. Olston, B. Reed, U. Srivastava, R. Kumar, and A. Tomkins, “Pig latin: a not-so-foreign language for data processing,” in Proceedings of the 2008 ACM SIGMOD international conference on Management of data, New York, NY, USA, 2008, pp. 1099– 1110. A. Thusoo, J. S. Sarma, N. Jain, Z. Shao, P. Chakka, N. Zhang, S. Antony, H. Liu, and R. Murthy, “Hive - a petabyte scale data warehouse using Hadoop,” in 2010 IEEE 26th International Conference on Data Engineering (ICDE), 2010, pp. 996 –1005. F. Chang, J. Dean, S. Ghemawat, W. C. Hsieh, D. A. Wallach, M. Burrows, T. Chandra, A. Fikes, and R. E. Gruber, “Bigtable: A Distributed Storage System for Structured Data,” ACM Trans. Comput. Syst., vol. 26, no. 2, pp. 4:1–4:26, Jun. 2008. M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, and I. Stoica, “Spark: cluster computing with working sets,” in Proceedings of the 2nd USENIX conference on Hot topics in cloud computing, 2010, pp. 10. S. Owen, R. Anil, T. Dunning, and E. Friedman, Mahout in Action. Greenwich, CT, USA: Manning Publications Co., 2011. S. Seo, E. J. Yoon, J. Kim, S. Jin, J.-S. Kim, and S. Maeng, “HAMA: An Efficient Matrix Computation with the MapReduce Framework,” in 2010 IEEE Second International Conference on Cloud Computing Technology and Science (CloudCom), 2010, pp. 721–726. R. J. Allen, D. Frenkel, and P. R. ten Wolde, “Simulating rare events in equilibrium or nonequilibrium stochastic systems,” The Journal of Chemical Physics, vol. 124, no. 2, pp. 024102, Jan. 2006. W. M. P. van der Aalst, A. H. M. ter Hofstede, B. Kiepuszewski, and A. P. Barros, “Workflow Patterns,” Distributed and Parallel Databases, vol. 14, no. 1, pp. 5–51, Jul. 2003. S. Gorlatch, “Send-receive considered harmful: Myths and realities of message passing,” ACM Trans. Program. Lang. Syst., vol. 26, no. 1, pp. 47–56, Jan. 2004. M. McCool, J. Reinders, and A. Robison, Structured Parallel Programming: Patterns for Efficient Computation. Elsevier, 2012. C. Pautasso and G. Alonso, “Parallel computing patterns for Grid workflows,” in Workshop on Workflows in Support of Large-Scale Science, 2006. WORKS ’06, 2006, pp. 1–10. R. J. Allen, C. Valeriani, and P. Rein ten Wolde, “Forward flux sampling for rare event simulations,” Journal of Physics: Condensed Matter, vol. 21, no. 46, p. 463102, Nov. 2009. J. L. F. Abascal, E. Sanz, R. García Fernández, and C. Vega, “A potential model for the study of ices and amorphous water: TIP4P/Ice,” The Journal of Chemical Physics, vol. 122, no. 23, p. 234511, Jun. 2005. S. Sarupria and P. G. Debenedetti, “Homogeneous Nucleation of Methane Hydrate in Microsecond Molecular Dynamics Simulations,” J. Phys. Chem. Lett., vol. 3, no. 20, pp. 2942–2947, Oct. 2012. E. Lindahl, B. Hess, and D. Van Der Spoel, “GROMACS 3.0: a package for molecular simulation and trajectory analysis,” Molecular modeling annual, vol. 7, no. 8, pp. 306–317, 2001. “Cascading | Application Platform for Enterprise Big Data.” [Online]. Available: http://www.cascading.org/.

[23]

[24]

[25]

[26]

[27] [28]

[29]

[30] [31]

[32]

[33]

[34]

[35]

44

G. Bell and J. Gray, “What’s next in high-performance computing?,” Communications of the ACM, vol. 45, no. 2, pp. 91– 95, 2002. “Development over Time | TOP500 Supercomputer Sites.” [Online]. Available: http://www.top500.org/statistics/overtime/. M. Wilde, M. Hategan, J. M. Wozniak, B. Clifford, D. S. Katz, and I. Foster, “Swift: A language for distributed parallel scripting,” Parallel Computing, vol. 37, no. 9, pp. 633–652, 2011. H. Kim and M. Parashar, “CometCloud: An Autonomic Cloud Engine,” in Cloud Computing, R. Buyya, J. Broberg, and A. Goscinski, Eds. John Wiley & Sons, Inc., 2011, pp. 275–297. B. Ludäscher, I. Altintas, C. Berkley, D. Higgins, E. Jaeger, M. Jones, E. A. Lee, J. Tao, and Y. Zhao, “Scientific workflow management and the Kepler system,” Concurrency and Computation: Practice and Experience, vol. 18, no. 10, pp. 1039– 1065, 2006. T. Oinn, M. Addis, J. Ferris, D. Marvin, M. Senger, M. Greenwood, T. Carver, K. Glover, M. R. Pocock, A. Wipat, and P. Li, “Taverna: a tool for the composition and enactment of bioinformatics workflows,” Bioinformatics, vol. 20, no. 17, pp. 3045–3054, Nov. 2004. Y. Gil, V. Ratnakar, J. Kim, J. Moody, E. Deelman, P. A. González-Calero, and P. Groth, “Wings: Intelligent workflowbased design of computational experiments,” Intelligent Systems, IEEE, vol. 26, no. 1, pp. 62–72, 2011. M. Isard, M. Budiu, Y. Yu, A. Birrell, and D. Fetterly, “Dryad: distributed data-parallel programs from sequential building blocks,” in Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007, New York, NY, USA, 2007, pp. 59–72. D. Talia, “Workflow Systems for Science: Concepts and Tools,” ISRN Software Engineering, vol. 2013, pp. 1–15, 2013. E. Deelman, D. Gannon, M. Shields, and I. Taylor, “Workflows and e-Science: An overview of workflow system features and capabilities,” Future Generation Computer Systems, vol. 25, no. 5, pp. 528–540, May 2009. S. Jha, M. Cole, D. S. Katz, M. Parashar, O. Rana, and J. Weissman, “Distributed computing practice for large-scale science and engineering applications,” Concurrency and Computation: Practice and Experience, pp. 1559–1585, 2012. J. M. Wozniak, T. G. Armstrong, K. Maheshwari, E. L. Lusk, D. S. Katz, M. Wilde, and I. T. Foster, “Turbine: A distributed-memory dataflow engine for extreme-scale many-task applications,” 2012. D. G. Murray, M. Schwarzkopf, C. Smowton, S. Smith, A. Madhavapeddy, and S. Hand, “CIEL: a universal execution engine for distributed data-flow computing,” in Proceedings of the 8th USENIX conference on Networked systems design and implementation, 2011, p. 9. J. Goecks, A. Nekrutenko, J. Taylor, and T. G. Team, “Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences,” Genome Biol, vol. 11, no. 8, p. R86, 2010. I. Raicu, I. Foster, Y. Zhao, A. Szalay, P. Little, C. Moretti, A. Chaudhary, and D. Thain, “Towards data intensive many-task computing,” Data Intensive Distributed Computing: Challenges and Solutions for Large-Scale Information Management. IGI Global, Hershey, 2009. E. Dede, M. Govindaraju, D. Gunter, and L. Ramakrishnan, “Riding the elephant: managing ensembles with hadoop,” in Proceedings of the 2011 ACM international workshop on Many task computing on grids and supercomputers, New York, NY, USA, 2011, pp. 49–58. M. Islam, A. K. Huang, M. Battisha, M. Chiang, S. Srinivasan, C. Peters, A. Neumann, and A. Abdelnur, “Oozie: towards a scalable workflow management system for Hadoop,” in Proceedings of the 1st ACM SIGMOD Workshop on Scalable Workflow Execution Engines and Technologies, New York, NY, USA, 2012, pp. 4:1– 4:10.