A Unified Storage Framework for SOAR HPC Scientific ... - UCF CS

1 downloads 341 Views 1MB Size Report
SC08ExascalePowerWorkshop/index.html. [3] http: //adiosapi.org/index.php5?title=Publications. [4] http://cmulargescalelunch.kyloo.net/files/.
USFD: A Unified Storage Framework for SOAR HPC Scientific Workflows Grant Mackey [email protected] Jun Wang [email protected]

Saba Sehrish [email protected]

John Bent [email protected]

Christopher Mitchell [email protected]

Meghan Wingate [email protected]

University of Central Florida, 4000 Central Florida Blvd., Orlando, FL 32816 Los Alamos National Lab P.O. Box 1663, Los Alamos, NM 87545

ABSTRACT Emerging scientific workflows in HPC focus more on analysis rather than simulation. Simulation output is so dense with information that copious amounts of analysis must be performed on a single output to understand the results of that simulation. We identify this repetitive analysis as a new application type, Simulate Once Analyze Repeatedly (SOAR) Computing. Current scientific HPC, when extended to SOAR computing, results in excessive data migration between compute and storage resources. For a workflow bound by file I/O, a large data migration overhead is unacceptable. We propose a framework which uses a data-intensive storage cluster coupled with an interoperability layer, called USFD. USFD is a Unified Storage Framework designed to better support SOAR HPC scientific workloads through enhanced file I/O support and co-located storage and analysis. In this work we analyze the performance of USFD and other traditional HPC approaches for SOAR scientific workloads. Our results show that SOAR workflows which use USFD complete analysis at a 7.5x performance increase over other approaches with QCD and 4x performance increases with FLASH.

1.

INTRODUCTION

High Performance Computing (HPC) is currently at the peta-scale and is already designing and planning for exascale architectures [2]. As these supercomputers continue to deliver increasingly fine resolution simulations, they correspondingly also produce larger datasets. Only a few years prior, world class supercomputers were generating terabytes of data. Now, simulations running on current supercomputers [24] can generate petabytes of data [11, 12] and exabytes of data are in our near future. This new generation of computing gives rise to new usage

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Copyright 20XX ACM X-XXXXX-XX-X/XX/XX ...$10.00.

patterns. Whereas older scientific workflows required only one cycle of simulation and one cycle of analysis to obtain a result, more recent simulations require more cycles of analysis for each simulation to extract knowledge [21, 37]. Their output is so information dense that multiple types of analyses must be performed in order to understand the simulation results. We identify this single-simulation, multi-cycle analysis workload as a new type of computing: Simulate Once Analyze Repeatedly (SOAR) computing. The number of analysis cycles can range anywhere from tens to hundreds depending on what is required to yield the results of interest [27, 49]. The exact number for any particular simulation is defined by but not limited to: (1) the number of team-members exploring the results of the simulation, (2) the number of variables that need to be examined individually and the number of different subsets of variables that need to be examined together, and (3) the number of different analysis algorithms such as various statistical analyses and visualizations. In addition, simulations that generate a time series showing the simulation’s evolution add another level of complexity to the analysis process as subsets (up to and including the entire series) of the time steps are examined together. Thus, how many rounds of analysis (N) can be summarized by: FM (N ) = PM Sizeof Output ⌉ i=1 Fi (N ) =# of Analysis Ops Available ∗⌈ Sizeof M em where M is the number of users who might perform data analysis on a single simulation output, and F(N) is function expressing the number of times data analysis will be performed on a single simulation result in general. There are several new technical challenges in Simulate Once Analyze Repeatedly computing. SOAR workloads focus more on high, sustained, file I/O bandwidth during execution. While SOAR workloads may perform complex calculation, the majority of the runtime is spent doing file I/O. An underlying storage architecture needs to support both compute and I/O bounded HPC workloads with I/O as the first priority. Existing HPC platforms, when extended to support SOAR computing (Figure 1A and 1B), bear limitations which make them inappropriate for servicing the new requirements of SOAR computing. 1) Petabyte/Exabyte scale data movement becomes a performance bottleneck in SOAR. Data migration is the constant movement of files from one computing resource to an-

(A)Extension of a Traditional Approach

(B)Extension of a Hybrid Approach

(C) USFD

Figure 1: Ways to support both compute-intensive and data-intensive workloads in HPC (dotted lines represent data migration). Approaches A and B are extensions of current HPC platforms and USFD is our contribution. other via a parallel file system (PFS). To perform the analysis for SOAR with existing HPC approaches, data must be constantly migrated between compute and storage resources. For many years, data migration was inconsequential. Data sizes were trivial in comparison to the amount of bandwidth that I/O devices could provide [1,41]. However, with the advent of petascale computing, simulations have grown larger and their analytics needs greater. Frequent data migration between computing resources is expensive in terms of I/O time [25]. Existing storage architectures for PFS’s cannot support the throughput demands of SOAR computing without a prohibitively expensive expansion of equipment. Additionally, it is not clear that the software of existing PFS’s will continue to scale to extremely large numbers of components. The stark reality of HPC is that most of the fastest parallel file systems are still on the order of gigabits/second transfer rates [17, 18, 33, 48], making file system I/O a large bottleneck in these supercomputers [28]. 2) A SOAR workflow will combine applications with many different file I/O semantics. Any existing file system must be modified to properly support both compute and data intensive file I/O semantics. Hence in a SOAR environment where the focus of the workload is on analyzing data, utilizing a data-intensive file system (DIFS) which has native support for common file I/O semantics of analysis programs is the clear choice. We define data-intensive file systems (DIFS) as those which tailor to large contiguous reads/writes to a file(s), co-locating computation with storage, with less focus on small non-contiguous I/O (e.g. GoogleFS and HadoopFS). It is not straightforward to find a unified storage system that can support both simulation output workloads as well as data-intensive analysis workloads. Existing PFS’s designed for absorbing simulation ouput are not well suited for data intensive applications due to their network architectures. Typical large supercomputers such as ORNL’s Jaguar and LANL’s Roadrunner have two networks: a fast interconnect between compute nodes, and a slower storage network. A small number of IO nodes are dual-connected to both networks. Data transferred between compute nodes and the storage system is routed through these IO nodes. This architecture allows the storage infrastructure to be shared by multiple supercomputers [36, 43]. The disadvantage of this approach is that the additional cost of the network makes it much more expensive than a comparable amount of storage attached directly to the compute nodes as is the case in DIFS. On the other hand, existing DIFS do not support the semantics required for simulation output; specifically, they

lack support for massive concurrent access to shared objects. In this paper, we present our work developing a new framework to support Simulate Once Analyze Repeatedly workloads. We propose a framework which uses a data-intensive storage cluster coupled with an interoperability layer, called USFD. USFD is a Unified Storage Framework designed to better support SOAR HPC scientific workloads. Utilizing a data-intensive storage resource with a new support framework removes constant data migration between compute and storage resources because it allows data analysis locally on the storage resource without sacrificing the semantics required for simulation output. Importantly, we provide POSIX and MPI-IO support for our chosen scientific SOAR workloads. We use two representative SOAR workloads in our experiments, FLASH and QCD with their respective analysis applications. We have observed a 7.5x performance increase over extended HPC approaches with QCD and 4x performance increase with FLASH.

2. DESIGN AND IMPLEMENTATION OF THE USFD FRAMEWORK USFD is designed to meet the challenges imposed by SOAR applications. In this section, we explain the characteristics of these workflows and our approach to address their challenges. We will also discuss how to extend existing HPC approaches to support SOAR as well as our detailed design of USFD with relevant use cases.

2.1 Scientific Application Workflow In scientific computing, there are applications which need to simply run calculations and return a straight forward answer to the user. They focus on processing and very rarely need to access the file system. Likewise, there are other applications computationally expensive, but are ultimately file I/O bound. We define a scientific workflow as a combination of both applications types which results in discerning useful information from complex simulation results. A data-intensive workflow has been a silent trend in HPC that, only recently, has become a hot topic of conversation in the HPC community [5]. Scientists have created larger and larger simulations, which have resulted in ever larger data analysis demands. So large, in fact, that scientists have spent more time analyzing the output of their simulations than they have actually running them [47], hence the name data-intensive. In current data-intensive workflows, the focus is on analysis. The analysis is performed once, maybe

Application Sim FLASH QCD Analysis ParaView ADAT

POSIX X X X X

MPI-IO X X

Other X X

Table 1: File Semantics of Typical HPC Applications twice on a particular dataset and is very I/O intensive. However, simulations are still growing in scale. We have observed from experience [28, 39], that to fully understand next generation simulations, many large data analysis phases must be executed; which is why we now introduce Simulate Once Analyze Repeatedly computing (SOAR). SOAR is a new type of scientific application workflow that focuses on data analysis, but to an even higher degree than current data intensive HPC workloads. This is because next generation simulations generate complex answers which must be analyzed in many different ways. As this is a new type of workload pattern, it has not yet been studied in depth. QCD is a physics simulation and many different types of post data analysis are performed on its output [23]. QCD’s analysis application, ADAT, provides a suite of more than 50 different types of analysis applications which can be performed on the result of a single QCD run. While 50 different analysis applications may seem excessive, it is not uncommon for high energy physicists to use many of them for their analysis [14,30,34,45]. Together, QCD and ADAT represent a concrete example of SOAR workflow. Yet another SOAR workflow is FLASH [6]. Unlike QCD, FLASH is an application which can be analyzed by many different types of applications. Most commonly, FLASH is analyzed through visualization. What makes FLASH, and other simulations which require visualization, SOAR workloads is their scale. As the simulations get larger and larger, more visual inspection of the data must be done. That is, the data will be visualized at different levels of detail, filters, subsets, etc. All of these operations results in reads from the file system. Hence, the more a data set is visualized, the more data analysis is performed, making this workload another very good example of Simulate Once Analyze Repeatedly scientific computing.

2.2 Design Challenges In this section we discuss the design challenges for USFD, which include supporting different I/O semantics and file I/O patterns of different applications.

2.2.1 USFD: Supporting Different I/O Semantics The two application types which make up a SOAR workload, simulation and analysis, have very different file semantics. Simulations tend to use established file I/O semantics when reading or writing data. Most scientific simulations still rely on POSIX to handle their file I/O, while some rely on MPI-IO, HDF, NetCDF, and so on. Consequently, while some analysis applications do utilize these I/O semantics, more and more have their own specialized file semantics which provide for their high-performance read bandwidths. When combining different applications types together, the storage system must be able to handle multiple file I/O semantics. Otherwise, that storage system will not be able to properly support the intended application usage. In Table 1 we present a small list of applications which, when combined, form a SOAR workflow. The first workflow is the quantum chromodynamics suite from USQCD [23].

For this workflow, the simulation component QCD, uses POSIX as its default means of file I/O. However, QCD can allow for MPI-IO as its file I/O layer, as it is not dependent on POSIX for file I/O support. QCD’s analysis application, ADAT, is strictly a POSIX supported analysis operation. Another example of a SOAR workflow is the FLASH application [6]. FLASH differs from QCD in that FLASH is meant to be visualized to better understand the simulation results. There are many different types of visualization environments available for FLASH, however, we have provided ParaView as a representative visualization application [20]. There are various supported file semantics for both components of this SOAR workflow. In simulation, FLASH supports the use of POSIX, MPI-IO, as well as other writeoptimized file semantics when performing checkpoint writes. Likewise, ParaView supports a multitude of read file semantics as well [13]. The multitude of possible file semantics makes it a challenge for a storage system to aptly support both read and write intensive operations such as checkpointing and data analysis.

2.2.2 USFD: Performance Related File I/O Patterns In addition to file semantics support, we recognize the difference in file I/O patterns of computation-intensive workloads and data-intensive workloads. They may be reads or writes, sequential or random, big or small, but an HPC storage system must be able to handle the varied I/O patterns these two workloads generate. • Computation-intensive I/O Compute-intensive file I/O is important to SOAR workflows. Without it there is no data to analyze, but it is not the focus of SOAR computing. HPC simulations focus on complex computation and high-speed message passing between processes. A compute-intensive I/O pattern is usually a very bursty, high-throughput I/O operation followed by periods of file system idleness. An example of this common I/O pattern is checkpointing. In an environment where failure is a typical occurrence, checkpointing is used to periodically save a simulation’s state, thereby saving the user from losing all of the computed simulation [26]. When writing a checkpoint, all compute nodes involved in the calculation must write their state to a file system concurrently [42]. From an I/O perspective, a dormant file system suddenly has thousands of requests it must service simultaneously [38]. Hence, when a checkpoint file reads/writes to the file system, users need the data transfer to be fast and resilient [40]. • Data-intensive I/O The focus of data-intensive computing is on continuous and high sustained file system I/O bandwidth. There are many different forms of data-intensive I/O, such as data mining of large databases, pattern matching in large simulation data, visualization of simulation output, et cetera [20, 32, 35, 44]. Data-intensive computing follows a concept similar to Single Instruction Multiple Data computing. In a typical dataintensive workload, a large amount of file data is read in by N parallel processes, computation is performed on the data, and then the results are output to a file. Data-intensive file I/O is the dominant I/O pattern for a SOAR workload. Hence, for USFD to function appropriately, it is important to choose a file system which can support data intensive I/O properly.

In addition, one could argue that compute intensive systems are not very appropriate for these workload patterns. Not because they are unable to perform the workload task, but because in general the compute intensive systems are designed around heavy levels of inter-node process communication (4x QDR IB networks) and high FLOPS. For an embarrassingly parallel application, the strengths of a compute intensive system are not being utilized properly. Hence, a computer system with local disks, commodity processors and a network of moderate latency (ethernet/1x DDR IB) performs well for a wide variety of data-intensive workloads.

2.3.2 Extending Case B - Adding a Data Intensive resource to an HPC System Figure 2: Application Flow of USFD with Interoperability Layer

2.3 Straightforward Solutions By Extending Current HPC Platforms This section identifies two HPC approaches which are currently used for Data-Intensive computing. These two approaches are represented as Cases A and B from Figure 1. In this section, we extend their purpose to deal with SOAR workflows. We discuss how Cases A and B behave from a file I/O perspective by providing a runtime model of the data path for SOAR workloads. A full explanation of the models for these approaches can be found at our group website [16]. Following that, we discuss the strengths of each approach and show that the current approaches have limitations in regards to SOAR computing.

2.3.1 Extending Case A - Traditional HPC As shown in Figure 1A), the following I/O operations are performed for SOAR HPC workflows: 1. XA :Diskless compute intensive cluster outputs its application results to a parallel file system 2. YA :Diskless cluster then uses PFS to read simulation output for the data-intensive analysis application 3. ZA :Diskless compute intensive resource writes output of data-analysis application to a PFS Where FA (N )=XA +N ∗YA +N ∗ZA represents the total I/O time for a SOAR workload as the number of analysis application runs on a simulation output increases.More Specifically, P FA (N )=

+

SO

MIN (BWP F S−W , BWN ET ) P (N ) ∗ M i=1 AOi

MIN (BWP F S−W , BWN ET )

+

(N ) ∗

M i=1

SOi

MIN (BWP F S−R , BWN ET )

Data migration occurs in the

last two steps. (2) and (3) are repeated every time a data analysis application is launched. This is due to the fact that most output from HPC simulations will not fit into main memory and hence cannot be cached, requiring that it be fetched again from the file system. The strength of Case A is step (1), that is, Parallel File Systems (PFS) are well tuned for checkpoint reads and writes. However, next generation simulations generate much larger and more informationally complex datasets. When analyzing that data, it all must be read back in over a network in order to be processed. This approach is strong for scientific workloads because the time to reread output once or twice is not that significant. However, SOAR workloads may perform hundreds of data analysis operations. Rereading PB of information over a network for every data analysis operation is not a trivial amount of time for a networked parallel file system.

In Figure 1B), we show a second existing approach with following I/O operations: 1. XB :Diskless compute intensive cluster outputs its application results to a PFS 2. Y1B :Simulation output is copied to local drives in a data intensive cluster 3. ZB :Simulation data is read from the local drives by the analysis application 4. Y2B :Output of the data-intensive application is copied from data-intensive cluster to a PFS Where FB (N )=XB +Y1B +N ∗ ZB +N ∗ Y2B , SO More specifically:FB (N ) = + MIN (BWP F S−W , BWN ET ) P PM (N ) ∗ M i=1 SOi i=1 SOi + + M IN (BWP F S−R , BWN ET , BWDIF S−W ) BWDIF S−R (N ) ∗ AO MIN (BWDIF S−R , BWN ET , BWP F S−W )

Similar to Case A, in

this approach data is migrated between computing resources to perform analysis. However, data is not migrated as much as approach A. There is at minimum one occurrence of (2) for data analysis. But, unlike Case A, because the D.I. cluster has local hard drives, all data analysis (step 3) can occur with only one iteration of (2). In Case A, every instance of data analysis results in a repeat of A’s (2). Case B has an additional overhead of step 4. Step 4, like Case A’s (3), occurs for every time data analysis is run and is unavoidable with Case B. For Case B, the Data Intensive cluster is not intended to be a final resting place for data, rather a volatile resource used for data analysis. The storage capability of the Data Intensive cluster for this approach is not as large as the PFS, and hence data from the analysis must be written back to the PFS, else the Data Intensive cluster will not have disk space for other analysis programs and their output. Case B accounts for the utilization factor of compute-intensive systems by adding a data-intensive cluster into the data path. While mitigating system usage issues, this approach also removes most of the overhead of reading in massive amounts of data from the parallel file system for every data analysis operation. Unlike A, the user must only wait once for simulation data to be transferred for N amounts of data analysis. However, the drawback of Case B is that the some data must still be migrated to the DIFS. Data that will be stored permanently must be moved back to the PFS because the DIFS is not apportioned to hold multiple large datasets or output results from analysis applications [19,22]. Therefore, if a user finishes a group of testing, i.e. their requested cluster time runs out, later data analysis jobs will require the simulation output to be migrated again to the DIFS from the PFS.

2.4 USFD: Our Proposed Solution In this section, we present the flow of I/O operations performed in USFD. We then discuss the design considerations and challenges in adopting USFD for scientific workflows as well as the strength of USFD in SOAR workflows. Finally, we describe the modifications made to these applications in order to overcome the challenges presented in Section 2.3.

2.4.1 Dataflow of USFD USFD is similar to a data intensive cluster in that it has local disks in every node of the cluster. From Figure 1: 1. XU SF D :Diskless Compute intensive cluster outputs its application results to local disks on USFD. 2. Z1U SF D :Simulation data is read from local drives by the analysis application 3. Z2U SF D :Analysis results are written to local drives on USFD • Where FU SF D (N )=XU SF D +N ∗Z1U SF D +N ∗Z2U SF D SO More specifically: FU SF D = + (N ) ∗

PM

i=1

SOi

BWDIF S−R

+

(N ) ∗

PM

i=1

MIN (BWDIF S−W , BWN ET ) AOi

BWDIF S−W

After (1), the simulation data is present on USFD and available for data analysis. Diskless compute intensive clusters connect the same way to USFD as they do to a PFS as in A and B. However networked PFSs do not have support for local data analysis like USFD does. Hence, data analysis applications can exploit the data locality provided by USFD and immediately begin working. Therefore, (2) and (3) do not result in any data migration for USFD.

2.4.2 Supporting File Semantics with USFD Several issues must be addressed in our proposed USFD design because it combines two I/O patterns onto a DISC system [9]. As discussed in Section 2.2, both I/O types have very different demands of their file systems. Simulations tend to have a standard set of file semantics, while analytics tend to have a more diverse set of semantics, all of which must be supported by USFD. That is, while many file semantics for simulations can harken back to POSIX, many analysis applications have no similarities to POSIX. POSIX file semantics are a mature standard for computing. Without at least some support for POSIX, a file system cannot be useful to HPC. Another set of file semantics important to HPC is MPI-IO and the functionality that it provides. Finally, to these well known file semantics, many others exist. USFD must be able to adapt to support new types of semantics. From Table 1, we list the file system semantics of two representative SOAR workloads.

2.4.3 How USFD Solves Data Migration USFD accounts for utilization factors of computing resources as well as eliminates the problem of data migration. The benefit of combining the two I/O patterns as shown in Figure 2 is that now only one type of storage resource exists to service an HPC center. USFD’s interoperability layer allows for multiple different I/O patterns and file I/O semantics to be serviced by one underlying storage system, (in this case a DIFS). That is, compute-intensive clusters still see a mount point for checkpointing and their requests are still serviced in the same manner as if a parallel file system were still attached. However the difference now is that once the data

is stored it can be analyzed in situ, removing unnecessary data movement. Data migration will occur for storage approaches which do no use our unified storage framework. For some HPC workloads, data migration occurs infrequently and does not result in much overhead. However, a Simulate Once Analyze Repeatedly environment only exacerbates the problem of data migration. When repeatedly running analysis applications which require frequent access to disk, any other approach which uses a network attached storage system will experience some form of data migration. While analysis time is constant as N increases, excess data migration (a sort of data pre-loading) is large. While there are ways to improve network latency and bandwidth, to provide comparable sustained bandwidth to that of a DIFS (which instead utilizes the aggregate bandwidth of local disks) is impractical and costly.

2.4.4 Modifying Applications for USFD We chose two separate workloads to evaluate USFD, Quantum Chromodynamics (QCD) and FLASH. Both of these workflows are highly relevant in today’s scientific high performance computing. QCD is a particle and nuclear physics simulation with a purely data processing/data analysis component (ADAT). FLASH is a high-energy density physics simulation code which uses visualization for its data analysis component. In order to have these workloads function for the file system we chose for USFD, the Hadoop Distributed File System (HDFS) [31], some modifications were needed. In particular, HDFS does not have support for MPI/MPI-IO which required writing new code for both QCD and FLASH. The analysis operations (ParaView and ADAT) had to be modified as well, for similar reasons. Coding file semantic support for HDFS resulted in 36k lines of new code between QIO, FLASH and ParaView [39]. While we did not make the HDFS POSIX or MPI-IO compliant for general application use, we implemented specific code for our chosen applications. This code helps form the base of a more generalized interoperability layer for USFD.

2.4.5 The QCD Suite We used the Lattice Quantum Chromodynamics (QCD) suite provided by USQCD [23], which provides support for QCD from simulation to data analysis, for our validation application. From this suite of codes, we use QIO and ADAT. When combined, these two applications form a complete scientific workflow of computation-intensive simulation and data-intensive analysis applications. • Modifying QIO: The I/O Driver QIO is the I/O kernel for the Chroma suite provided by USQCD. QIO provides for a test suite, qio-test(1-8), which consists of several different types of I/O patterns.We have used qio-test5. This test performs a sequential N-1 file access pattern. Qio-test5’s write access pattern averages at 262,144 bytes per I/O operation and a phase of 8 byte writes. Because the HDFS does not have direct support for common file semantics, such as POSIX, an I/O driver must exist to properly interface the application with a non-POSIX DIFS. In order to have QIO semantics support for HDFS, we utilized the overloader functions provided by the c-lime library in QIO to write our own HDFS I/O driver. QIO calls its write functions (i.e. QIO write()) and our I/O driver embedded in c-lime inserts POSIX compliant HDFS commands. In this driver, we utilize the libhdfs C bindings to

support open, close, read, write, seek and tell [7]. It is important to note that currently these bindings are somewhat limited in their abilities. For example, in the HDFS, seek is only properly supported when a file is opened in read only mode; also, files cannot be appended to once closed. However, our I/O driver allows for seek on write, hence this limitation of the HDFS does not affect our testing with QIO. This modified seek on write file support incurs large time penalties in HDFS. There are means to mitigate this overhead, but for these tests we wish to show that even with large performance penalties, USFD’s other strengths compensate for the longer write times. Ultimately, we must incorporate I/O optimizations in future USFD work. • ADAT and the ADAT Kernel ADAT is the analysis suite for the output of QCD simulation code and analyzes the XML files (in the form of key, value pairs). It provides a variety of analysis operations, but we only use the hadron spec strip (HSS) analysis for our dataintensive workload. HSS analyzes multiple XML files and generate a set of output files. The output data is ≈ 1/4th of the input data for ADAT. HSS is a serial code written in C++. Hence we implement a parallel ADAT code (parallelizing the code was a simple for loop unravelling). However, in order to better utilize our proposed system, this analytics application is ported to a data processing abstraction that can run on the HDFS. We implemented an ADAT I/O Kernel for HDFS using MapReduce; reads/writes the same data pattern and size as hadron spec strip analytics code. We split the input files among multiple map tasks such that each map task generates the identical set of output files for the given input file. The reduce phase reads all the intermediate output files from map tasks, and combines all the output files with same properties. We discuss the performance of ADAT I/O kernel in the evaluation section.

2.4.6 FLASH and ParaView We develop a trace replay of a 16 node FLASH I/O run was conducted via trace data made available by Los Alamos National Laboratory (LANL) [10]. The FLASH trace is an N-1 strided pattern, more complex than the qio-test5 application, containing many small writes. We implemented MPI I/O optimizations for the HDFS which side step the seek on write problem when replaying this checkpoint dump. We apply these MPI optimizations for the PFS that we run on as well. The focus of this test is to show that, with some very basic effort on the part of a DIFS, the large checkpoint overheads which we claim will occur with a USFD approach become very minimal. Hence, we wrote two equivalent replay codes, one with PVFS2 calls and one for HDFS calls. To visualize the FLASH output data, the ParaView application was used [20]. ParaView’s file I/O routines do not natively support access to the HDFS. In a previous work by our research group, we wrote I/O methods for the HDFS to support ParaView file semantics [39]. The data we used in this analysis was also obtained from LANL [15]. This test really emphasizes the strength of USFD over other approaches for data analysis. ParaView is an interactive program which constantly demands the attention of the file system. Our previous work shows this sort of workload over a conventional PFS is impractical. The thesis of VisIO was that the parallel file system cannot properly support real time visualization because the networks they rely on are too slow.

UCF CASS Testing: 45 nodes, 12TB disk space Make & Model 15 Dell PowerEdge 1950 CPU 2 Intel Xeon 5140, Dual Core, 2.33 GHz RAM 4.0 GB DDR2, PC2-5300, 667 MHz Internal HD 2 SATA 500GB (7200 RPM) or 2 SAS 147GB (15K RPM) Network Connection Intel Pro/1000 NIC Operating System Rocks 5.4 (Cent OS 5.5), Kernel:2.6.18-194.17.4.el5 Make & Model CPU RAM Internal HD Network Connection Operating System

30 Sun V20z 2x AMD Opteron 242 @ 1.6 GHz 2GB - registered DDR1/333 SDRAM 1x 146GB Ultra320 SCSI HD 1x 10/100/1000 Ethernet connection Rocks 5.4 (Cent OS 5.5), Kernel:2.6.18-194.17.4.el5 LANL DISC Testing: 60 nodes, 120 TB disk space Make & Model 60 Intel EMT64 nodes, CPU 2 AMD/Opteron 64 bit CPUs RAM 6 GB Internal HD 2 One Terabyte SATA HDs Network Gigabit Ethernet connection to each node Operating System Fedora 10

Table 2: Cluster Configurations VisIO reinforces our claim that, to properly support Simulate Once Analyze Repeatedly workloads, a data intensive file system must be used for our unified storage framework.

3. EVALUATION We design the experiments in this work examine the workflow execution time using both existing approaches and our unified approach, USFD. As previously mentioned, the test applications are QCD and FLASH. We present the results by Simulate Once Analyze Repeatedly (SOAR) scientific workflow, then by test cluster the workflow was run on. We compare the I/O time of running these SOAR workflows with PVFS2 as our representative parallel file system and HDFS as our representative Data Intensive File System (DIFS). In these results, for both traditional cases and USFD, we use the same computing resources. That is, one third of the cluster was utilized for our disk-less compute nodes, one third behaved as our network attached parallel file system and the last third was used as our data intensive environment. Our test clusters are shown in Table 2. The combination of QIO and ADAT comprises our first SOAR scientific workflow. Our second SOAR workflow comprises of a 16 node FLASH checkpoint and utilizing ParaView (v3.8.0) as the analysis/visualization application. We used version 0.20.2 for the Hadoop distributed file system and PVFS2 version 2.8.2. We used MPICH 1.2.1 for running our two simulation applications QCD and FLASH, as well as for one of our analysis applications, ParaView. We used the current version of the QCD suite, utilizing QIO in specific as our simulation code and ADAT as analysis.

3.1 Quantum Chromodynamics Testing In this section we describe the experiments and results using QCD as a test application on our local CASS cluster and LANL cluster. The testing includes comparing I/O time for data migration using all three approaches.

3.1.1 CASS Testing For this SOAR workflow, we run QCD for simulation and ADAT for analysis. We measured 1)simulation I/O time,

(a) CASS 1 TB QCD

(b) CASS QCD Breakdown 1 Time Step

(c) CASS Analysis Writes 1 Time Step

(d) LANL QCD Testing

(e) CASS 1TB FLASH

(f) CASS FLASH Breakdown 1 Time Step

(g) CASS FLASH Validation Figure 3: From Sections 2.3.1, 2.3.2 and 2.4.1, the equations for the file I/O times of each approach in 3a, 3d, 3e and 3g are expressed as FA (N )=XA +N ∗YA +N ∗ZA ; FB (N )=XB +Y1B +N ∗ZB +N ∗Y2B ; FU SF D (N )=XU SF D +N ∗Z1U SF D +N ∗Z2U SF D . 2) data migration I/O time and 3)data analysis I/O time of the workflow. The results are shown in Figure 3a. For Figure 3a we test with qio-test5 generating 1TB of data. In this test we extend the number of data analysis runs (N) out to N=20. We justify this number as the lower end of the spectrum for analysis runs for a QCD workflow, based on our research of the application usage [14, 34, 45]. Within the first two iterations of data analysis (N), Case B emerges as the fastest performing time. After 10 iterations of data analysis, USFD then becomes the best performing case for this SOAR workflow. USFD performs better than than Case B as data analysis increases because, unlike Case B, USFD has no data migration overhead. There is a difference in the rate at which USFD improves over the two traditional cases. From Figure 3a, the lines representing I/O performance for Cases A and B has a steep

positive slope. Despite its higher initial I/O time, USFD’s overall I/O performance for this SOAR workload intercepts with Case A as N continues. We relate this to the equation below for when FA (N ) = FU SF D (N ): N 1 N 1 + = + BWP F S−W BWNET BWDIF S−W BWDIF S−R This equation shows that Case A is equal to USFD performance when the time to write the simulation data to a PFS plus the time it takes to read that simulation data across a network for N analysis phases is equal to the time to write simulation data to the DIFS plus the time it takes to read the simulation output in a DIFS for N analysis phases. (A complete solution [16]) The graph clearly shows that after five analysis operations, USFD performs better than Case A. We show below

that the time for a DIFS to write simulation data is much larger than the time for a PFS to write simulation data. We determine that 1)The time to migrate data for Case A is much larger than the time to read simulation output for analysis with USFD; and 2)That the time to write simulation data to a DIFS is not so large that it dwarfs the time savings of using a DIFS for data analysis. Therefore, USFD improves over Case A as a function of data migration overhead and data analysis overhead. Since it is the nature of a SOAR workflow to have a large value of N, it follows that USFD is better suited for this type of workloads as long as data migration and analysis are a bottleneck for Case A. Additionally we examine the crossover point of Case B and USFD for the QCD SOAR workload. Unlike Case A, which used a PFS for data analysis, Case B uses a DIFS for data analysis just as USFD does. The difference in I/O time between the two cases is that unlike USFD, Case B still uses a parallel file system for simulation writes, which gives rise to data migration when performing analysis for the SOAR workload. The equation below shows the relation for FB (N ) = FU SF D (N ): N 1 1 + = BWP F S−W BWNET BWDIF S−W Again, this is a drastic simplification, the full relation can be found at [16]. The I/O performance of Case B is equal to USFD when the time to write simulation data to a PFS plus the time to migrate it to a DIFS N times is equal to the time to write simulation data to a DIFS. After ten analysis operations, USFD performs better than Case B for file I/O. As shown in the graph, once USFD begins to perform better than Case B, the two lines diverge significantly. We can determine from this that the overhead that Case B experiences with data migration over a SOAR workflow is larger than the overhead than USFD experiences with its longer write times for simulations when using a DIFS. These results reinforce that the time to migrate data across a network is a large overhead. Figure 3b further reinforces the equations presented with Figure 3a. As a scientific workflow becomes a SOAR workflow, the initial performance gains of the traditional approaches are dwarfed by the overhead they introduce for repeated data-intensive analysis. In Figure 3b, we analyze a single 3GB time step from the 1TB QCD workflow. We observe that for one time step worth of the SOAR workflow, approach A is the fastest in terms of I/O speeds, with case B in second and USFD as the slowest. Writing the output of qio-test5 with HDFS takes almost twice as long as PVFS2. However, we see that the time it takes for Case A to perform analysis takes three times as long as cases B or USFD. Hence, as the number of analysis runs (N) increases, Case A will result in a larger overall I/O time as compared to Case B and USFD. Likewise, while Case B initially performs better than USFD for a single time step of the SOAR workflow, it too has a performance penalty which USFD does not share. Unlike USFD, Case B must migrate data from the parallel file system to a DIFS for the analysis. This overhead is large enough to make USFD, which has slower I/O performance for simulation writes and equal performance for analysis, faster for the QCD SOAR workflow. We further break down Figure 3b in the following items, simulation performance, data migration, and analysis to bet-

Copy data from PVFS2 to HDFS PVFS2 Fuse to PVFS2 to HDFS to HDFS Fuse HDFS Fuse PVFS2 Fuse 347.66667 seconds 83.0 seconds 71.0 seconds Copy data from HDFS to PVFS2 HDFS Fuse to PVFS2 to HDFS to PVFS2 Fuse HDFS Fuse PVFS2 Fuse 258.42133 seconds 244.032 seconds 221.776 seconds Table 3: Migrating simulation: tests with a variety of mechanisms. The HDFS and PVFS2 fuse drivers, PVFS2 shell commands, and HDFS shell commands for one time step ter understand the I/O times of the QCD SOAR workflow.

• XA,B,U SF D : Simulation Write Times with Qio-test5 Figure 3c shows that write performance with USFD suffers when writing qio-test5 simulation output. The purpose of Figure 3c is to show that there is room to improve the I/O performance of USFD. The first set of tests compares the write time of the MPI QIO code on PVFS2 and on HDFS. The data set generated in this initial test was one 3GB time step of the total 1TB qio-test5 output. In QIO-test5, the write pattern is N-1 segmented with the bulk of the writes being 256KB in size. Qio-test5 has four write phases per time step, only two of which were significant and their I/O times are presented in Figure 3c. More importantly,“real” and “su3” are similar in the amount of data generated, however the HDFS takes much longer to write “su3” than “real”. This shows very plainly the difficulty HDFS has in supporting checkpoints with small I/O, because in the “su3” write phase, QIO-test5 is writing many 8 byte variables. Figure 3c shows that for the write pattern of QIO-test5, the overhead associated with such small writes to HDFS impacts its performance by ≈double when compared to PVFS2. We associate these poor performance times with the large chunk size of HDFS blocks (64MB) and the replication factor (3) of HDFS. This illustrates the need of a data intensive file system (DIFS) to support both compute and data intensive workloads if it is to function as a unified storage resource. The results below will illustrate that even though this particular DIFS has a drawback in relation to compute intensive I/O, USFD makes up the performance penalty by exploiting its framework, removing the unavoidable data migration overhead associated with approaches A and B. • YA,B : Data Migration We test with multiple means of migrating data to and from PVFS2 and HDFS in order to utilize the quickest method of moving data between file systems. The methods that we used in our testing are what a user would expect to find available, no new copy tools were created for this testing. We start with moving a single time step of the qio-test5 output between resources to get a baseline I/O time. As presented in Table 3, out of the available copy tools using HDFS shell commands to a PVFS2 fuse instance is the fastest method to use when migrating data between PFS and DIFS. The results presented in Table 3 reflect the migration overhead seen in Figure 3b for Case B. In case A migration overhead becomes apparent later when understanding its analysis performance. We will discuss these timings next when analyzing Case A’s data analysis performance. Again, there is no data migration with USFD.

ADAT ADAT I/O Kernel

Open Time (seconds) 2.025 0.04

Read Time (seconds) 135.68 50.708

Write (seconds) 1.6875 8.738

Table 4: Comparing ADAT analysis time and MapReduce ADAT, 1 time step (3GB) of the QCD dataset. • ZA,B,U SF D : Analysis With ADAT The output of the MPI QCD program appears in the form of key, value pairs and are written to an XML file. Because ADAT uses a different I/O stack than QIO, we chose not to implement another POSIX to HDFS I/O driver for this work. Instead we wrote an ADAT I/O kernel in MapReduce to run one (hadspec strip) of the many analysis applications in ADAT. Hadspec strip (like the other provided strip applications provided) reads each of the XML input file into a single nested structure sequentially, and creates multiple output files after analyzing different Wilson Hadron measurements (such as forward propagation headers, forward propagation correlations, various currents, etc.). Running ADAT on a single 3GB time step from qio-test5 generates ≈ 760M B of output. Essentially, the simulation output and analysis output scale at the same rate, where the analysis output will be equal to ≈25 percent of the simulation input for an ADAT application run. Table 4 shows the file open, read, and write time for the ADAT (C++) application an ADAT I/O kernel for 3GB input files. Table 4 reveals the implicit data migration overhead for Case A (which we briefly mentioned in the discussion of Table 3). In Table 4, Case A’s ADAT read takes 43 times longer than for Case B and USFD where the data is local to the nodes. The reason why analysis takes so much longer for Case A than with Case B or USFD is because of the network transfers taking place while reading in the dataset for analysis. Case A’s average bandwidth of 1.41MB/s as opposed to the 61MB/s bandwidth of Case B and USFD clearly illustrates the benefit of in place data analysis instead of analysis over a network.

3.1.2 Los Alamos National Laboratory Testing The purpose of this testing was to experiment with a larger dataset for our QCD SOAR workflow. This graph is comprised from the result of multiple micro benchmarks run on the LANL DISC cluster to validate our model. Seen in Figure 3d, the time spent moving data across a network increases linearly with the data for multiple runs of the same analysis application. We chose the value of N to be 100 for this graph to represent running all of the analysis applications available from ADAT twice. Again, from our research of QCD and how it is analyzed, it is not uncommon for 40 to 50 different analysis applications to be run on one simulation output. Hence, if two people were to run the same analyses of a single QCD simulation output, N’s value would be 100. With this large dataset (64TB), Case A is only appropriate if the data is going to be analyzed very few times; After which, Case B is the better approach to use. For QCD datasets which are of TB size, ≈80 data analysis operations need occur for USFD to equal case B’s I/O time. As more data analysis is performed, our approach steadily improves over that best performing case, Case B, at a rate of 7.5x by the time N=100. As shown from earlier results, the HDFS checkpoint write

performance is low. If the output of the analysis application increases, the bulk of the I/O workload shifts more towards analysis. Hence, the number of times data analysis operations (N) needed to be equivalent to case A or B decreases linearly. This is because a greater percentage of the workload is now focused on analysis, the strength of a DIFS.

3.2 FLASH Testing With CASS We run FLASH for simulation and ParaView for analysis in this workflow. We measured 1)simulation I/O time, 2)data migration I/O time and 3)data analysis I/O time of the workflow. The results are shown in Figure 3e, which highlights the impact of data migration on SOAR workflows. As a user executes more and more analysis operations on a simulation output, the longer (over the lifetime of the data set) the user has had to wait for analysis results. We have found in cosmology work that generally an average of 10 to 15 visualizations are performed on a single FLASH output [27]. Hence, we initially ran 11 analysis runs to be within the range of our findings on this SOAR workflow. After only 10 data analysis operations Case A has already spent 5 times longer on I/O than USFD; while case B has spent almost three times as long as USFD. Because the time to perform data migration is immutable for A and B, the performance gap between A and B and USFD will only continue to grow. Again, we reference our equations for FA (N ) = FB (N ) and FB (N ) = FU SF D (N ): 1 N 1 N + = + (1) BWP F S−W BWNET BWDIF S−W BWDIF S−R N 1 1 + = (2) BWP F S−W BWNET BWDIF S−W Figure 3e shows that, for this SOAR workflow, USFD performs better than Cases A and B after only one round of data analysis (N). From this figure, we can relate back to our equations. For Case A, we see that the time to write simulation data plus the time to write simulation data to PVFS2 plus the time to perform data analysis with ParaView over the network only once takes longer than USFD takes to write the FLASH replay data back to the HDFS plus the time it took to do run one analysis operation with ParaView on local disks. For Case B, we see that the time to write FLASH replay data to PVFS2 and migrate it once to the HDFS takes longer than the HDFS takes to write out the FLASH replay data. Figure 3f shows the I/O time for one 1GB time step of the full 1TB simulation output of FLASH with ParaView. These results show that on average, USFD actually performs better than Cases A and B for one run of simulation and only one run of data analysis. From the shaded regions, we see that PVFS2 with MPI-IO writes the FLASH checkpoint faster only marginally faster than HDFS with MPI-IO semantics. However, performing data analysis using a DIFS as opposed to a parallel file system takes 1/16th of the time. Our approach actually performs faster than A and B for only one run of simulation and analysis. That is, for case A, while performing the checkpoint write is faster with PVFS2 than HDFS, the time to perform analysis using ParaView and PVFS2 is much slower than using HDFS. Case B is faster than case A because it uses PVFS2 for the checkpoint write and then HDFS for analysis, however it is not faster than USFD. This is because of the explicit data migration that must take place for Case B to use both a PFS and a

DIFS. The time savings that Case B gained by using a PFS and a DIFS was overshadowed by the time it took to move data onto and off of the DIFS.

3.2.1 Scalability of N In order to show the scalability N we conducted a series of experiments to validate the equations presented in Sections 2.3.1, 2.3.2, and 2.4.1 for the FLASH SOAR workflow. These results are presented in Figure 3g. As our testbed is not very large, we wish to show that our equations presented in these sections are valid, regardless of system or value of N. For this test, we chose N=100 to express a limit of possible analysis for 1 FLASH dataset, based on our experience with the workflow. From the graph, we see that our modeled equations for file I/O time deviate by only 5% or less of our experimental results for all approaches presented in Figure 1. Figure 3g gives us confidence that while we do not have a large scale system to run USFD on, it’s file I/O time for a SOAR workload can be accurately modelled for any system by using our equations.

4.

RELATED WORK

To fully support a SOAR scientific workflow, a framework is needed which addresses the issues of data migration across storage networks, multiple different application file semantics, and support of varied I/O patterns. There are many lines of research in the HPC community converging upon the concept of SOAR workflows. Ideas, when combined, coincide with concepts presented in the USFD framework. There are projects that try and address the immediate concerns of data migration, projects seeking to combine existing parallel file systems with large-scale data processing abstractions like MapReduce, and others are interoperability layers which focus on supporting a multitude of different file semantics and I/O patterns. There is an existing approach that, like USFD, states that a data intensive cluster should be used to perform analysis of HPC data, reducing data migration. Zazen [47] uses an analysis cluster in a means similar to Case B in section whatever. Zazen caches a copy of simulation output files on local disks of an analysis cluster and uses a novel task-assignment protocol to co-locate data access with computation. Zazen is similar to our USFD concept for analysis, however it does not consider the entire SOAR workflow. Zazen resolves the issue of immediate data migration, however Zazen must inevitably write simulation and analysis data to a parallel file system for storage. In doing so, future analysis operations, which occur in a SOAR workflow, will experience data migration penalties when Zazen performs analysis. Other research provides functionality to conventional parallel file systems that allows them to support different types of programming frameworks, like MapReduce. Most notably are the PVFS2 shim layer and the GPFS project [8, 29]. These two projects decouple the MapReduce framework from a distributed file system and map the data locality relations to regions in a parallel file system. Both of these approaches put a data-locality dependent programming framework onto a data-independent parallel file system, whereas USFD seeks to take data-independent programming frameworks and get them to exploit the data locality of an underlying distributed file system. USFD removes data migration by allowing for in-situ data analysis from local disk, whereas the PVFS2 and GPFS MapReduce methods gener-

ally rely on a disk-less compute cluster to perform dataintensive analysis jobs across a network. These projects seem orthogonal from our intent for USFD and SOAR workflows. However, mechanics aside, they show that there is a strong interest from the HPC community to use new, dataintensive programming interfaces for analysis of HPC data. Finally, there are existing works which function like our proposed interoperability layer for USFD, most notably ADIO and ADIOS [3, 46]. ADIO is designed to work solely with MPI-IO, focusing on interacting with various underlying file systems, and ADIOS is a more general layer, focusing on file semantics. When the core ideas of these two research works are combined, the result is USFD’s interoperability layer. ˘ Zs ´ interoperability layer focuses on file semantics USFDˆ aA and I/O methods to support SOAR workloads across any file system that exposes data placement information to the application layer. All of these related works share a common component with USFD in support of SOAR workflows. This trend gives us confidence that the HPC community is ultimately heading towards a USFD-like framework to support next generation HPC workloads. These trends show us that SOAR workflows are a new and interesting HPC problem, which will become even more prevalent as the size and scale of scientific simulation and instrumentation increases.

5. CONCLUSION AND FUTURE WORK HPC has proven through its long history that it knows how to do computation. The greater challenge now is how to do data, specifically data analysis. In this paper we have proposed a framework to utilize a data-intensive storage cluster for a new type of HPC scientific workflow, Simulate Once Analyze Repeatedly (SOAR) computing. The motivation for our Unified Storage Framework to support SOAR workloads, USFD, is to remove the time consuming data migration between compute and storage resources, which occurs frequently with SOAR computing. We achieve this by providing file semantic support which enables HPC compute intensive applications to write data to a unified data intensive file system. Once written, the data is analyzed in-place, removing data migration. We present two scientific workflows, the QCD Lattice application suite and FLASH with ParaView, which perform both compute and data intensive operations. In tests we have observed that, as the amount of analysis increases for a scientific workflow, the time to migrate data between resources becomes inordinately large. That is, as a scientific workflow becomes a SOAR workload, traditional HPC approaches garner considerable overhead due to data migration between resources. Our results show that SOAR scientific workflows utilizing USFD complete analysis at a 7.5x performance increase over traditional approaches with QCD and 4x performance increases with FLASH as data analysis is repeated. Furthermore, this performance is a linear function of the time it takes to migrate data between resources in traditional approaches. As shown in our results, data intensive file systems generally do not outperform traditional parallel file systems for compute-intensive I/O. Parallel file systems were designed around providing functionality to compute-intensive workloads. However, data intensive file systems were designed around servicing I/O to data-intensive applications, and provide better functionality than PFSs in this regard. For the

types of scientific workloads presented here, the analysis portion of the workflow dominates file I/O time. Our proposed system, USFD, is designed to support SOAR computing by providing an interoperability layer for compute-intensive file I/O while maintaining focus on data-intensive file I/O. It is important for the file system to provide for high availability to checkpointing while not hindering the progress of other data-intensive analytics that might be running. In short, USFD must provide for storage system availability to both compute and data-intensive applications. This must occur while providing fairness and reliability which both application types currently experience. However, in this work, we focus on addressing the problem of data migration and the challenges of a DIFS servicing compute and data intensive file semantics. A scheduler to guarantee fairness and availability for USFD is left to future work.

6.

REFERENCES

[1] Kryders law. scientificamerican.com/article.cfm? id=kryders-law. [2] Super Computing 2008 Exascale Workshop http://www.lbl.gov/CS/html/ SC08ExascalePowerWorkshop/index.html. [3] http: //adiosapi.org/index.php5?title=Publications. [4] http://cmulargescalelunch.kyloo.net/files/ hdfspvfs-pdlslides.ppt. [5] http://exascaleresearch.labworks.org/ascr2011/ index/materials. [6] http://flash.uchicago.edu. [7] http://hadoop.apache.org/common/docs/current/ libhdfs.html. [8] http://institute.lanl.gov/isti/irhpit/ projects/hdfspvfs.pdf. [9] http://institutes.lanl.gov/isti/disc/. [10] http://institutes.lanl.gov/plfs/maps/. [11] http://iopscience.iop.org/1742-6596/180/1/ 012019/pdf/1742-6596\_180\_1\_012019.pdf. [12] http://lanl.gov/roadrunner/rropenscience.shtml. [13] http://paraview.org/Wiki/ParaView/Users\ _Guide/List\_of\_readers. [14] http://super.bu.edu/~brower/workshop/edwards\ _bos06.ppt. [15] http://t8web.lanl.gov/people/heitmann/arxiv/ data.html. [16] http://www.eecs.ucf.edu/~gmackey/papers/ USFD-report.pdf. [17] http://www.hpcuserforum.com/presentations/ Tucson/SUN%20%20Lustre_Update-080615.pdf. [18] http://www.lustre.org/. [19] http://www.nersc.gov/nusers/systems/. [20] http://www.paraview.org. [21] http: //www.scidacreview.org/0902/html/ultravis.html. [22] http://www.tacc.utexas.edu/resources/. [23] http://www.usqcd.org/usqcd-software/. [24] Roadrunner.http://top500.org/system/9707. [25] www.cercs.gatech.edu/iucrc06/material/klasky. ppt. [26] Fault Tolerance: Principles and Practice. Prentice-Hall International, 1981.

[27] J. Ahrens, K. Heitmann, M. Petersen, J. Woodring, S. Williams, P. Fasel, C. Ahrens, Chung-Hsing Hsu, and B. Geveci. Verifying scientific simulations via comparative and quantitative visualization. Computer Graphics and Applications, IEEE, 30(6):16 –28, 2010. [28] James Ahrens. Personal Correspondence at Los Alamos National Lab. [29] Rajagopal Ananthanarayanan, Karan Gupta, Prashant Pandey, Himabindu Pucha, Prasenjit Sarkar, Mansi Shah, and Renu Tewari. Cloud analytics: Do we really need to reinvent the storage stack? In HotCloud ’09: Workshop on Hot Topics in Cloud Computing in conjunction with the 2009 USENIX Annual Technical Conference, 2009. [30] K.D. Born, E. Laermann, T.F. Walsh, and P.M. Zerwas. Spin dependence of the heavy-quark potential: a qcd lattice analysis. Physics Letters B, 329(2-3):332 – 337, 1994. [31] Dhruba Borthaku. The Hadoop Distributed File System: Architecture and Design. [32] Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Mike Burrows, Tushar Chandra, Andrew Fikes, and Robert E. Gruber. Bigtable: a distributed storage system for structured data. In OSDI ’06: Proceedings of the 7th symposium on Operating systems design and implementation, pages 205–218, Berkeley, CA, USA, 2006. USENIX Association. [33] Jason Cope, Theron Voran, Matthew Woitaszek, Adam Boggs, Michael Oberg, and Henry M. Tufo. Experiences deploying a 10 gigabit ethernet computing environment to support regional computational science, 2007. [34] Jozef J. Dudek, Robert G. Edwards, Michael J. Peardon, David G. Richards, and Christopher E. Thomas. Toward the excited meson spectrum of dynamical QCD. April 2010. [35] Bin Fu, Kai Ren, Julio Lopez, Eugene Fink, and Garth Gibson. Discfinder: a data-intensive scalable cluster finder for astrophysics. In Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing, HPDC ’10, pages 348–351, New York, NY, USA, 2010. ACM. [36] G. Grider, H. Chen, J. Nunez, S. Poole, R. Wacha, P. Fields, R. Martinez, P. Martinez, S. Khalsa, A. Matthews, and G. Gibson. Pascal - a new parallel and scalable server io networking infrastructure for supporting global storage/file systems in large-size linux clusters. In Performance, Computing, and Communications Conference, 2006. IPCCC 2006. 25th IEEE International, pages 10 pp.–340, April 2006. [37] Tony Hey, Stewart Tansley, and Kristin Tolle, editors. The Fourth Paradigm: Data-Intensive Scientific Discovery. Microsoft Research, Redmond, Washington, 2009. [38] Rajeev Gandhi Priya Narasimhan Michael P. Kasick, Jiaqi Tan. Black-box problem diagnosis in parallel file systems. In 8th USENIX conference on File and Storage Technologies (FAST), pages 43–56. USENIX Association, 2010. [39] Christopher Mitchell, James Ahrens, and Jun Wang. Visio: Enabling interactive visualization of ultra-scale,

[40]

[41]

[42]

[43]

[44]

[45]

[46]

[47]

[48]

[49]

time series data via high-bandwidth distributed i/o systems. In To appear in IEEE International Parallel and Distributed Processing Symposium, May 2011. A. Moody. The scalable checkpoint / restart (SCR) library: Approaching file I/O bandwidth of 1 TB/s. In TeraGrid Fault Tolerance Workshop, 2009. G. E. Moore. Cramming more components onto integrated circuits. Electronics, 38(8):114–117, April 1965. Sriram Sankaran, Jeffrey M. Squyres, Brian Barrett, Vishal Sahay, and Andrew Lumsdaine. The lam/mpi checkpoint/restart framework: System-initiated checkpointing. International Journal of High Performance Computing Applications, 19(4), 2005. Galen M. Shipman, David A. Dillow, Sarp Oral, and Feiyi Wang. The spider center wide file system; from concept to reality. In Cray User Group 2009 Proceedings, 2009. Alexander S. Szalay, Peter Z. Kunszt, Ani Thakar, Jim Gray, Don Slutz, and Robert J. Brunner. Designing and mining multi-terabyte astronomy archives: the sloan digital sky survey. SIGMOD Rec., 29(2):451–462, 2000. T. T. Takahashi, H. Suganuma, Y. Nemoto, and H. Matsufuru. Detailed analysis of the three quark potential in su(3) lattice qcd. PHYS.REV.D, 65:114509, 2002. Rajeev Thakur and Ewing Lusk. An abstract-device interface for implementing portable parallel-i/o interfaces. In in Proceedings of The 6th Symposium on the Frontiers of Massively Parallel Computation, pages 180–187. IEEE Computer Society Press, 1996. Tiankai Tu, Charles A. Rendleman, Patrick J. Miller, Federico D. Sacerdoti, Ron O. Dror, and David E. Shaw. Accelerating parallel analysis of scientific simulation data via zazen. In USENIX conference on File and Storage Technologies, pages 129–142, 2010. Brent Welch, Marc Unangst, Zainul Abbasi, Garth Gibson, Brian Mueller, Jason Small, Jim Zelenka, and Bin Zhou. Scalable performance of the panasas parallel file system. In FAST’08: Proceedings of the 6th USENIX Conference on File and Storage Technologies, pages 1–17, Berkeley, CA, USA, 2008. USENIX Association. J. Woodring, K. Heitmann, J. Ahrens, P. Fasel, C.-H. Hsu, S. Habib, and A. Pope. Analyzing and Visualizing Cosmological Simulations with ParaView. ArXiv e-prints, October 2010.

Suggest Documents