Structured Streams: Data Services for Petascale ... - Semantic Scholar

2 downloads 0 Views 524KB Size Report
Matthew Wolf. Hasan Abbasi. Matthew Barrick ...... [15] G. A. Gibson, D. P. Nagle, K. Amiri, F. W. Chang, E. Feinberg, H. G. C. Lee, B. Ozceri, E. Riedel, and D.
Structured Streams: Data Services for Petascale Science Environments Patrick Widener Matthew Wolf Jack Pulikottil Greg Eisenhauer Patrick G. Bridges

Hasan Abbasi Matthew Barrick Jay Lofstead Ada Gavrilovska Scott Klasky Ron Oldfield Arthur B. Maccabe Karsten Schwan.

Abstract The challenge of meeting the I/O needs of petascale applications is exacerbated by an emerging class of data-intensive HPC applications that requires annotation, reorganization, or even conversion of their data as it moves between HPC computations and data end users or producers. For instance, data visualization can present data at different levels of detail. Further, actions on data are often dynamic, as new end-user requirements introduce data manipulations unforeseen by original application developers. These factors are driving a rich set of requirements for future petascale I/O systems: (1) high levels of performance and therefore, flexibility in how data is extracted from petascale codes; (2) the need to support ‘on demand’ data annotation – metadata creation and management – outside application codes; (3) support for concurrent use of data by multiple applications, like visualization and storage, including associated consistency management and scheduling; and (4) the ability to flexibly access and reorganize physical data storage. We introduce an end-to-end approach to meeting these requirements: Structured Streams, streams of structured data with which methods for data management can be associated whenever and wherever needed. These methods can execute synchronously or asynchronously with data extraction and streaming, they can run on the petascale machine or on associated machines (such as storage or visualization engines), and they can implement arbitrary data annotations, reorganization, or conversions. The Structured Streaming Data System (SSDS) enables high-performance data movement or manipulation between the compute and service nodes of the petascale machine and between/on service nodes and ancillary machines; it enables the metadata creation and management associated with these movements through specification instead of application coding; and it ensures data consistency in the presence of anticipated or unanticipated data consumers. Two key abstractions implemented in SSDS, I/O graphs and Metabots, provide developers with high-level tools for structuring data movement as dynamically-composed topologies. A lightweight storage system avoids traditional sources of I/O overhead while enforcing protected access to data. This paper describes the SSDS architecture, motivating its design decisions and intended application uses. The utility of the I/O graph and Metabot abstractions is illustrated with examples from existing HPC codes executing on Linux Infiniband clusters and Cray XT3 supercomputers. Performance claims are supported with experiments benchmarking the underlying software layers of SSDS, as well as application-specific usage scenarios.

1

Introduction

Large-scale HPC applications face daunting I/O challenges. This is especially true for complex coupled HPC codes like those in climate or seismic modeling, and also for emerging classes of data-intensive HPC applications. Problems arise not only from large data volumes but also from the need to perform activities such as data staging, reorganization, or transformation [24]. Coupled simulations, for instance, may require data staging and conversion, as in multi-scale materials modeling [7], or data remeshing or changes in data layout [17, 14]. Emerging data-intensive applications have additional requirements, such as those derived from their continuous monitoring [23]. Their online monitoring and the visualization of monitoring data require data filtering and conversion in addition to the basic constraints of low overhead, flexible extraction of said data [35]. Similar needs exist for data-intensive applications in the sensor domain, where sensor data interpretation requires actions like data cleaning or remeshing [22]. Addressing these challenges presents technical difficulties including:

• scaling to large data volumes and large numbers of I/O clients (i.e., compute nodes), given limited I/O resources (i.e., a limited number of nodes in I/O partitions), • avoiding excessive overheads on compute nodes (e.g., I/O buffers and compute cycles used for I/O), • balancing bandwidth utilization across the system, as mismatches will slow down the computational engines, either through blocking or through over-provisioning in the I/O subsystem, and • offering additional functionality in I/O including on demand data annotation, filtering, or similar metadata-centric I/O actions. Structured Streams, and the Structured Streaming Data System (SSDS) that implements them, are a new approach to petascale I/O that encompasses a number of new I/O techniques aimed at addressing the technical issues listed above: Data taps are flexible mechanisms for extracting data from or injecting data into HPC computations; efficiency is gained from making it easy to vary I/O overheads and costs in terms of buffer usage and CPU cycles spent on I/O and by controlling I/O volumes and frequency. Structured data exchanges between all stream participants make it possible to enrich I/O by annotating or changing data, both synchronously or asynchronously with data movement. I/O graphs explicitly represent an application’s I/O tasks as configurable topologies of the nodes and links used for moving and operating on data. I/O graphs start with lightweight data taps on computational nodes, traverse arbitrary additional task nodes on the petascale machine (including compute and I/O nodes, as desired), and end on storage or visualization engines. Using I/O graphs, developers can flexibly and dynamically partition I/O tasks and concurrently execute them, across petascale machines and the ancillary engines supporting their use. Enhanced techniques dynamically manage I/O graph execution, including their scheduling and the I/O costs imposed on petascale applications. Metabots are tools for specifying and implementing operations on the data moved by I/O graphs. Metabot specifications include the nature of operations (annotation, organization, modification of data to meet dynamic end user needs) as well as implementation and interaction details (such as application synchrony, data consistency requirements, or metabot scheduling). Lightweight storage separates fast path data movements from machine to disk from metadata-based operations like file consistency, while preserving access control. Metabots operating on disk-resident data are one method for asynchronously (outside the data fast path) determining the file properties of data extracted from high performance machines. SSDS is being implemented for leadership class machines residing at sites like the Oak Ridge National Laboratory. Its realization for Cray XT3 and XT4 machines runs data taps on its compute nodes, using Cray’s Catamount kernel, and it executes full-featured I/O graphs utilizing nodes, metabots, and lightweight storage both on the Cray XT3/XT4 I/O nodes and on secondary service machines. SSDS also runs on Linux-based clusters using Infiniband RDMA transports in place of the Sandia Portals [3] communication construct. Enhanced techniques for automatically managing I/O graphs’ costs vs. performance have not yet been realized, but measurements shown in this paper demonstrate the basic performance properties of SSDS mechanisms and the cost/performance tradeoffs made possible by their use. In the remainder of this paper, Section 2 describes the basic I/O structure of the HPC systems targeted by SSDS. Section 3 follows with a description of the Structured Streams abstractions, how it addresses the emerging I/O challenges in the systems described in Section 2, and the implementation of structured streams in SSDS. Section 4 presents experimental results from our prototype SSDS implementation, illustrating how the basic SSDS abstractions that implement structured streams provide a powerful and flexible I/O system for addressing emerging HPC I/O challenges. Section 5 then describes related work, and Section 6 presents conclusions and directions for future research.

2

Background

Figure 1 depicts a representative machine structure for handling the data-intensive applications targeted by our work, derived from our interactions with HPC vendors and with scientists at Sandia, Los Alamos, and Oak Ridge National Laboratories. These architectures have four principal components: (1) a dedicated storage engine with limited attached computational facilities for data mining; (2) a large-scale MPP with compute nodes for application computation and I/O staging nodes for buffering and data reorganization; (3) miscellaneous local compute facilities such as visualization systems; and (4) remote data sources and sinks such as high-bandwidth sensor arrays and remote collaboration sites. The storage engine provides long-term storage, its primary client being the local MPP system. Nodes in the MPP are connected by a high-performance interconnect such as the Cray Seastar or 4x Infiniband with effective data rates of 40GB/sec or more, while this MPP is connected to other local machines, including the data archive, using a high-speed commodity interconnect such as 10GigE. Because these interconnects have lower peak and average bandwidths than the MPP’s highperformance interconnect, some application nodes inside the MPP (i.e., service or I/O nodes) are typically dedicated to 2

impedance matching by buffering, staging, and reordering data as it flows into and out of the MPP. Systems with remote access to the storage Local MPP engine have lower bandwidths, even when usRemote (Compute and Sensors Storage ing high end networks like TeraGrid or DOE’s I/O Nodes) Engine UltraScience network, but they can still produce and consume large amounts of data. Consider, for example, data display or visualization clusters [16], sensor networks, satellites or Visualization other telemetry sources, or ground-based sites Remote Clients such as the Long Wavelength Array of radiotelescopes. For all such cases, the supercomputer may be either the source (checkpoints or Figure 1. System Hardware Architecture processed simulation results) or sink (analysis of data across data sets) for the large data sets resident in the pool of storage.

3 3.1

Structured Streams Using Structured Streams

To address the I/O challenges in systems and applications such as those described in Section 2, we propose to manage all data movement as structured streams. Conceptually, a structured stream is a sequence of annotations, augmentations, and transformations of typed, structured data; the stream describes how data is conceptually transformed as it moves from data source to sink, along with desired performance, consistency, and quality of service requirements for the stream. An SSDS-based application describes its data flows in terms of structured streams. Data generation on an MPP cluster or retrieval from an storage system, for example, may be expressed as one or more structured streams, each of which performs data manipulation according to downstream application requirements. A structured streaming data system (SSDS), then, maps these streams onto runtime resources. In addition to meeting the application’s I/O needs, this mapping can be constructed with MPP utilization in mind or to meet overhead constraints. A structured stream is a dynamic, extensible entity; incorporation of new application components into a structured stream is implemented as their attachment to any node in the graph. Such new components can attach as pure sinks (as simple graph endpoints, termed data taps) or include specific additional functionality in more complex subgraphs. A resulting strength of the structured stream model is that data generators need not be modified to accommodate new consumers.

3.2

Example Structured Streams

Structured streams carry data of known structure, but they also offer rich facilities for the runtime creation of additional metadata. For example, metadata may be created to support direct naming, reference by relation, and by arbitrary content. Application-centric metadata like the structure or layout of data events can be stated explicitly as part of the extended I/O graph interface offered by SSDS (i.e., data taps), or it may be derived from application programs, compilers, or I/O marshaling interfaces. Consider the I/O tasks associated with the online visualization of data in multi-scale modeling codes. Here, even within a single domain like materials design, different simulation components will use different data representations (due to differences in length or time scales); they will have multiple ways of storing and archiving data; and they will use different program for analyzing, storing, and displaying data. For example, computational chemistry aspects can be approached either from the viewpoint of chemical physics or physical chemistry. Although researchers from both sides may agree on the techniques and base representations, there can be fundamental differences in the types of statistics which these two groups may gather. The physicists may be interested in wavefunction isosurfaces and the HOMO/LUMO gap, while the chemists may be more interested in types of bonds and estimated relative energies. More simply, some techniques require the data structures to be reported in wave space, while others require real space. Similar issues arise for many other HPC applications. A brief example demonstrates how application-desired structures and representations of data can be specified within the I/O graphs used to realize structured streams. Consider, for example, the output of the Warp molecular dynamics code [26], a tool developed by Steve Plimpton. It and its descendants have been used by numerous scientists for exploring materials physics problems in chemistry, physics, mechanical, and materials engineering. The output is composed of arrays of three 3

real coordinates, representing the x, y, and z values for an individual atom’s location, coupled with additional attribute for the run, including the type of atom, velocities of the atoms and/or temperatures of the ensemble, total energies, and so on. Part of the identification of the atomic type is coupled to the representation of the force interaction between atoms within the code – Iron atoms interact differently with other Iron atoms than they do with Nickel. Using output data from one code to serve as input for another involves not only capturing the simple positions, but also the appropriate changes in classifications and indexing that the new code may require. I/O graphs use structured data representations to simply record the ways in which such data is laid out in memory (and/or in file blocks) and then maintain and use the translations to other representations. The current implementation of SSDS uses translations that are created manually, although at a higher level than the Perl or Python scripting which are standard practices in the field. In ongoing work related (but not directly linked) to the SSDS effort, we are producing automated tools for creating and maintaining these translations.

3.3

Realizing Structured Streams

The concept of a structured stream only describes how data is transformed. To actually realize structured streams, we have decomposed their specification and implementation into two concrete abstractions implemented in the Structured Streaming Data System (SSDS) that describe when and where data is transformed, I/O graphs and metabots. Based on the application and performance and consistency needs of a structured stream and available system resources, a structured stream is specified as a series of in-band synchronous data movements and transformations across a graph of hosts (an I/O graph) and some number of asynchronous, out-of-band data annotations and transformations (metabots), resulting in a complete description of how, where, and when data will move through a system. The mapping of a structured stream to an IOgraph and a set of metabots in SSDS can change as application developers desire or as run-time conditions dictate. For instance, data format conversion may be performed strictly in an I/O graph for a limited number of consumers, but broken out (according to data availability deadlines) into a combination of I/O graph and metabot actions as the number of consumers grows. This ability to customize structured streams, and their potential for coordination and scheduling of both in-band and out-of-band data activity, makes them a powerful tool for application composition. 3.3.1

I/O graphs

I/O graphs are the data-movement and lightweight transformation engines of structured streams. The nodes of an I/O graph exchange data with each other over the network, receiving, routing, and forwarding as appropriate. I/O graphs are also responsible for annotating data and/or for executing the data manipulations required to filter data or more generally, ‘make data right’ for recipients and to carry out data staging, buffering, or similar actions performed on the structured data events traversing them. Stated more precisely, in an I/O graph, each operation uses one or more specific input data form(s) to determine onward routing and produce potentially different output data form(s). We identify three types of such operations: • inspection, in which an input data form and its contents are read-only and used as input to a yes/no routing, forwarding, or buffering decision for the data; • annotation, in which the input data form is not changed, but the data itself might be modified before the routing determination is made; and • morphing, in which the input data form (and its contents) is changed to a different data form before the routing determination is made. Regardless of which operations are being performed, I/O graph functionality is dynamically mapped onto MPP compute and service nodes, and onto the nodes of the storage engine and of ancillary machines. The SSDS design also anticipates the development of QoS- and resource-aware methods for data movement [5] and for SSDS graph provisioning [18, 30] to assist in deployment or evolution of I/O graphs. While the I/O graphs shown and evaluated in this paper were explicitly created by developers, higher level methods can be used to construct I/O graphs. For example, consider a data conversion used in the interface of a high performance molecular modeling code to a visualization infrastructure. This data conversion might involve a 50% reduction in height and width of a 2-dimensional array of, say, a bitmap image (25% of original data size). Precise descriptions of these conversions can be a basis for runtime generation of binary codes with proper looping and other control constructs to apply the operation to the range of elements specified. Our previous work has used XML specifications based on which automated tools can generate the binary codes that implement the necessary data extraction and transport actions across I/O graph nodes.

4

3.3.2

Metabots

Metadata in MPP Applications. The particulars of stream structure may not be readily available from MPP applications. One reason for current applications using traditional file systems is that these systems force metadata to be generated in-line with computational results, reducing effective computational I/O bandwidth. As a result, minimal metadata is available on storage devices for use by other applications, and the metadata that is present is frequently only useful to the application that generated it. Additionally, such applications are extremely sensitive to the integrity of their metadata, ruling out modifications which might benefit other clients. Recovering metadata at a later time for use by other applications can reduce to inspection of source codes in order to determine data structure. As noted earlier, this situation is caused by the difference between MPP internal bandwidth and I/O bandwidth. One approach is to recover I/O bandwidth by removing metadata operations and other semantic operations from parallel file systems. However, semantics such as file abstractions, name mapping, and namespaces are commonly relied upon by MPP applications, so removing such metadata permanently is not an option. In fact, we consider the generation, management, and use of these and other types of metadata to be of increasing importance in the development of future high performance applications. Our approach, therefore, is not to require all metadata management to be in the fast “data path” of these applications, but instead, to move as much metadata-related processing as appropriate out of this path. The metabots described below realize this goal. Metabots and I/O graphs. Metabots provide a specification-based means of introducing asynchronous data manipulation into structured streams. A basic I/O graph emphasizes metadata generation as data is captured and moved (e.g., data source identification). By default, generation is performed “in-band”, a process analogous to current HPC codes that implicitly create metadata during data storage. Metabots make it possible to move these tasks “out-of-band”, that is, outside the fast path of data transfer. Specifically, Metabots can coordinate and execute these tasks at times less likely to affect applications, such as after a data set has been written to disk. In particular, metabots provide additional functionality for metadata creation and data organizations unanticipated by application developers but required by particular end-user needs. Colloquially, metabots are metadata agents whose run-time representations crawl storage containers, generating metadata in user-defined ways. For example, in many scientific applications, data may need to be analyzed for both spatial and temporal features. In examining the flux of ions through a particular region of space in a fusion simulation, the raw data (organized by time slice) may need to be transformed so that it is organized as a time series of data within a specific spatial bounding box. This can involve both computationally intensive (e.g., bounding box detection) and I/O bound phases (e.g. appending data fragments to the time series). The metabot run-time abstraction allows these to occur out-of-band or in-band, as appropriate. Metabots differ from traditional workflow concepts in two ways: (1) they are tightly coupled to the synchronous data transmission, and need to be flexibly reconfigurable based on what the run-time has or has not completed, and (2) they are confined to specific meta-data and data-organizational tasks. As such, the streaming data and metabot run-times could be integrated in future work to serve as a self-tuning actor within a more general workflow system like Kepler [20].

3.4

A Software Architecture for Petascale Data Movement

The data manipulation and transmission mechanism of SSDS leverages extensive prior work with high performance data movement. Key technologies realized in that research and leveraged for this effort include: (1) efficient representations of meta-information about data structure and layout; enabling (2) high performance and ‘structure-aware’ manipulations on SSDS data ‘in flight’, carried out by dynamically deployed binary codes and using higher level tools with which such manipulations can be specified, termed XChange [1]; (3) a dynamic overlay optimized for efficient data movement, where data fast path actions are strongly separated from the control actions necessary to build, configure, and maintain the overlay[30]; and (4) a lightweight object storage facility (LWFS [21]) that provides flexible, high-performance data storage while preserving access controls on data. LWFS implements back end metadata and data storage. PBIO is an efficient binary runtime representation of metadata. [10] High performance data paths are realized with the EVPath data movement and manipulation infrastructure [9], and the XChange tool’s purpose is to provide I/O graph mapping and management support [30]. Knowledge about the structure and layout of data is integrated into the base layer of I/O graphs. Selected data manipulations can then be directly integrated into I/O graph actions, in order to move only the data that is currently required and (if necessary and possible) to manipulate data to avoid unnecessary data copying due to send/receive data mismatches. The role of CM (Connection Manager) is to manage the communication interfaces of I/O Graphs. Also part of EVPath and CM are the control methods and interfaces needed to configure I/O graphs, including deploying the operations that operate on data, link graph nodes, delete them, etc. Potential overheads from high-level control semantics will not affect the data fast path performance, and alternative realizations of such 5

semantics become possible. These will be important enablers for integrating actions like file system consistency or conflict management with I/O graphs, for instance. Finally, the metadata used in I/O graphs can be provided by applications, but it can also be derived automatically, by the metabots described in Section 3.3.2. I/O graph nodes use daemon processes to run on the MPP’s service nodes, on the storage engine, and on secondary machines like visuECho XChange alization servers or remote sensor machines. LWFS Metabots Pub/Sub I/O Graph In addition, selected I/O graph nodes may run Manager Manager on the MPP’s compute engines, to provide to such applications the extended I/O interfaces CM EVPath PBIO offered by SSDS. For all I/O graph nodes, operator deployment can use runtime binary code CM Transports generation techniques to optimize data manipulation for current application needs and platform conditions. Additional control processes Figure 2. SSDS software architecture. not shown in the figure run tasks like optimization for code generation across multiple I/O graph nodes and machines, and I/O graph mapping and management actions. Structured streams do not replace standard back-end storage (i.e., file systems) or transports (i.e., network subsystems). Instead, they extend and enhance the I/O functionality seen by high performance applications. The structured stream model of I/O inherently utilizes existing (and future) high performance storage systems and data storage models, but offers a datadriven rather than connection- or file-driven interface for the HPC developer. In particular, developers are provided with enhanced I/O system interfaces and tools to define data formats and the operations to be performed on data ‘in flight’ between simulation components and I/O actions. As an example, the current implementation of SSDS leverages existing file systems (ext3 and lustre) and protocols (Portals and IB RDMA). Since the abstraction presented to the programmer is inherently asynchronous and data-driven, however, the run-time can perform data object optimizations (like message aggregation or data validation) in a more efficient way than the corresponding operation on a file object. In contrast, the successful paradigm of MPI I/O [32], particularly when coupled with a parallel file system, heavily leverages the file nature of the data target and utilizes the transports infrastructure as efficiently as possible within that model. However, that inherently means the underlying file system concepts of consistency, global naming, and access patterns will be enforced at the higher level as well. By adopting a model that allows for the embedding of computations within the transport overlay, it is possible to delay execution of or entirely eliminate those elements of the file object which the application does not immediately require. If a particular algorithm does not require consistency (as is true of some highly fault-tolerant algorithms), then it is not necessary to enforce it from the application perspective. Similarly, if there is an application-specific concept of consistency (such as validating a check point file before allowing it to overwrite the previous check point), that could be enforced as well, in addition to the more application-driven specifications mentioned earlier. 3.4.1

Implementation

Datatap. The datatap is implemented as a request-read service that is designed for the multiple orders of magnitude difference between the available memory on the I/O and service nodes compared to the compute partition. We assume the existence of a large number of compute nodes producing data (we refer to them as datatap clients) and a smaller number of I/O nodes receiving the data (we refer to them as datatap servers). The datatap client issues a data available request to the datatap server, encodes the data for transmission and registers this buffer with the transport for remote read. For very large datasizes, the cost of encoding data can be significant, but it will be dwarfed by the actual cost of the data transfer [12, 11, 4]. On receipt of the request, the datatap server issues a read call. Due to the limited amount of memory available on the datatap server, the server only issues a read call if there is memory available to complete it. The datatap server issues multiple read requests to reduce the request latency as much as possible. The datatap server is performance bound by the available memory which restricts the number of concurrent request and the request service latency. The datatap server acts as a data feed in to the I/O graph overlay. The I/O graph can replicate the functionality of writing the output to a file (see Section 4.4), or it can be used

6

to perform “inflight” data transformations (see Section 4.4). We currently have two implementations of the datatap using Infiniband user-level verbs and the Sandia Portals interface. We needed the multiple implementations in order to support both our local Linux clusters and the Cray XT3 platform. The two implementations have a common design and hence common performance except in one regard. The Infiniband user level verbs do not provide a reliable datagram (RD) transport increasing the time spent in issuing a data available request (see Figure 8). I/O graph Implementation. Actual implementation of I/O graph data transport and processing is accomplished via a middleware package designed to facilitate the dynamic composition of overlay networks for message passing. The principal abstraction in this infrastructure is ‘stones’ (as in ‘stepping stones’), which are linked together to compose a data path. Message flows between stones can be both intra- and inter-process, with inter-process flows being managed by special output stones. The taxonomy of types of stones is relatively broad, but includes: terminal stones which implement data sinks; filter stones which can optionally discard data; transform stones which modify data; and split stones which implement data-based routing decisions and may redirect incoming data to one or more other stones for further processing. I/O graphs are highly efficient because the underlying transport mechanism performs only minimal encoding on the sending side and uses dynamic code generation to perform highly efficient decoding on the receiving side. The functions that filter and transform data are represented in a small C-like language. These functions can be transported between nodes in source form, but when placed in the overlay we use dynamic code generation to create a native version of these functions. This dynamic code generation capability is based on a lower-level package that provides for dynamic generation of a virtual RISC instruction set. Above that level, we provide a lexer, parser, semanticizer, and code generator, making the equivalent of a just-in-time compiler for the small language. As such, the system generates native machine code directly into the application’s memory without reference to an external compiler. Because we do not rely upon a virtual machine or other sand-boxing technique, these filters can run at roughly the speed of unoptimized native code and can be generated considerably faster than forking an external compiler.

4 4.1

Experimental Evaluation Overview

To evaluate the effectiveness of our prototype SSDS implementation, we conducted a variety of experiments to understand its performance characteristics. In particular, we conducted experiments using a combination of HPC-oriented I/O benchmarks that test individual portions of SSDS including I/O graphs, the dataTap, and Metabots, and prototype full-system SSDS application benchmark using a modified version of the GTC [23] HPC code. As an experimental testbed, we utilized a cluster of 53 dual-processor 3.2GHz Intel EM64T processors each with 6GB of memory running Redhat Enterprise Linux AS release 4 with kernel version 2.6.9-42.0.3.ELsmp. Nodes were using connected by a non-blocking 4x Infiniband interconnect using the IB TCP/IP implementation. I/O services in the cluster are provided by cluster dedicated nodes containing 73GB 10k RPM Seagate ST373207LC Ultra SCSI 320 disks. Underlying SSDS I/O service was provided by a prototype implementation of the Sandia Lightweight File Systems (LWFS) [21] communicating using the user-level Portals TCP/IP reference implementation [3]. Note that the Portals reference implementation, unlike the native Portals implementation on Cray Seastar-based systems, is a largely unoptimized communication subsystem. Because this communication infrastructure currently constrains the absolute performance of SSDS on this platform, our experiments focus on relative comparisons instead of absolute performance numbers.

4.2

Metabot Evaluation

Overview. To understand the potential performance benefits of Metabots in SSDS, we ran several metadata manipulation benchmarks with various SSDS configurations using different setups of a benchmark based on the LLNL fdtree benchmark [13], which we shall call fdtree-prime. The benchmark essentially creates a file hierarchy parametrized on the depth of the hierarchy, the number of directories at each depth and the number and size of files in each directory. In particular, we focused on comparing the benchmark performance with metadata manipulation inline and with metadata manipulation moved out-of-band to a Metabot. In situations where applications create large file hierarchies which are accessed at a later point of time, in-band file creation can incur significant metadata manipulation overhead. Since the application programmer generally knows the structure of this hierarchy, including file names, numbers, and sizes, it may frequently be possible to move namespace metadata manipulations 7

LWFS (Number of Files) 2000 1800 1600 1400

Naming Raw Reconstruct

3500 3000 2500

Seconds

seconds

LWFS (Directory Depth) 4000

naming raw reconstruct

1200 1000 800 600 400 200 0

2000 1500 1000 500 0

0

1

10000 20000 30000 40000 50000 60000 70000 # of files

(a) Scaling of Files

2

3 # of Levels

4

5

(b) Scaling of Directory

Figure 3. Out-of-band Metadata Reconstruction out-of-band. We have implemented this optimization using SSDS Metabots, where the application can write directly to the resulting storage targets without in-band metadata manipulation costs to application execution. Subsequently, a Metabot creates the needed filename to object mappings out-of-band. Setup. To understand the potential performance benefits of Metabots, we built a Metabot that would perform specified file hierarchy metadata manipulations according to a specification after the actual raw I/O writes done by fdtree-prime. We then compared the time taken to run fdtree-prime setups with metadata manipulation in-band and metadata manipulation out-ofband, as well as the amount of time needed for out-of-band metadata construction. For these tests, we used two different fdtree-prime setups: one which created an increasing number of files of size 4KB in a single directory and one that created the same-sized files in an increasingly deep directory hierarchy in which each directory contained 100 files and 2 subdirectories. Experiments were run on 4 nodes of the cluster described in section 4.1. One node executed the benchmark itself, while three nodes ran the LWFS authorization, naming, and storage servers. Results. In our first experiment, we see our raw write performance gets increasingly better with an increase in the number of files. Even with a flat directory structure, at 65,536 files, the write performance with inline-metadata creation is 70% slower than a raw write. In the second experiment, the performance gain is even more apparent. With a depth of 5 levels, and 2 directories per level, the write performance with inline-metadata creation is 9.7 times slower than a raw write. In both cases, the metadata construction Metabot takes about the same time as that of the inline-metadata benchmark. Analysis. While the above results demonstrate how moving metadata creation out-of-band can significantly increase the performance of in-band activity, the sum total of raw-write time and construction Metabot time is still greater than that of inline-metadata creation. This is so because the construction Metabot has to read from a raw object stored on the LWFS storage server and carry out the same operations as that of the inline-metadata benchmark. In the current LWFS API, the creation of a file is accompanied by the creation of a new LWFS object containing the data for the new file. Hence the construction Metabot has to repeat the workload of the inline-metadata benchmark. A more efficient implementation of the API would allow the construction Metabot to avoid file data copies and new object creations and simply create filesystem metadata for the object that was created during the raw-write benchmark. Performance could also have been better had the construction Metabot been deployed on the storage server as opposed on a remote node as was done for this series of experiments. Another issue to be addressed is that we do not have metrics for comparing write performance with other parallel file system implementations; this was primarily due to constraints of platform availability. However, LWFS performance characteristics are comparable to Lustre [21].

4.3

DataTap Evaluation

Overview. The datatap serves as a lightweight low overhead extraction system. As such the datatap replaces the remote filesystem services offered by the I/O nodes for large MPP machines. The datatap is designed to require a minimum level of synchronicity in extracting the data, thus allowing a large overlap between the application’s kernel and the data movement. The adverse performance impact of extracting data from an application can be broken down into two parts. The nonasynchronous parts of the data extraction protocol (i.e., the time for the remote node to accept the data transfer request) and the blocking wait for the transfer to complete (e.g., if the transfer time is not fully overlapped by computation) have an

8

impact on the total computation time of the application. To reduce this overhead we designed the datatap in SSDS to have a minimum blocking overhead. Setup. We have implemented two versions of the datatap, (1) using the low level Infiniband verbs layer and (2) using the Sandia Portals interface. The Portals interface is optimized for the Cray XT3 environment but it does offer a TCP/IP module also. However the performance of the Portals over TCP/IP datatap is orders of magnitude worse than either Infiniband or Portals on the Cray XT3. Results. We tested the Infiniband datatap on our local Linux cluster described above. The Portals datatap was tested offsite on a Cray XT3 at Oak Ridge National Laboratory. The results demonstrate the feasibility, scalability, and limitations of inserting datataps in the I/O fast path. Server read data bandwidth 800

4MB 32MB 64MB 128MB 256MB

4MB 32MB 64MB 128MB 256MB

700 600

Bandwidth (MB/s)

Bandwidth (MB/s)

Server read data bandwidth 1000 900 800 700 600 500 400 300 200 100 0

500 400 300 200 100 0

0

10

20 30 40 50 Number of processors

60

70

0

(a) IB-RDMA data tap on Linux cluster

10

20 30 40 50 Number of processors

60

70

(b) Portals data tap on Cray XT3

Figure 4. Bandwidth during data consumer read. This bandwidth is less than the maximum available for higher number of processors because of multiple overlapping reads First we consider the bandwidth observed for a RDMA read (or Portals “get”). The results are shown in Figure 4. The maximum bandwidth is available when the data size is large and the number of processors is small. This is due to the increasing number of concurrent read requests as the number of processors increases. For both the Infiniband and the Portals version the read bandwidth reaches a minimum value for a specific data size. This occurs when the maximum amount of allocatable memory is reached, forcing the datatap data consumer to schedule requests. The most significant aspect of this metric is that as the number of processors increases, the maximum number of outstanding requests reaches a plateau. Increasing the ammount of memory available to the consumer (or server) will result in a higher level of concurrency. Evaluation. First we look at the time taken to complete a data transfer. The bandwidth graph is shown in Figure 6. The Portals version shows a much higher degree of variability, but the overall pattern is the same for both Infiniband and Portals. For larger number of processors the time to complete a transfer increases proportionally. This increase is caused by the increasing number of outstanding requests on the consumer as the total transferred data increases beyond the maximum amount of memory allocated to the consumer. For small data sizes (the 4MB transfer for example) the time to completion stays almost constant. This is because all the nodes can be serviced simultaneously. The higher latency in the Infiniband version is due to the lack of a connectionless reliable transport. We had to implement a lightweight retransmission layer to address this issue, which invariably limits the performance and scalability observed in our Infiniband-based experiments. Another notable feature is that for the Portals datatap the time to complete for large data sizes is almost the same. This is because the Cray SeaStar is a high bandwidth but also high latency interface. Once the total data size to be transferred increases beyond the total available memory the performance becomes bottlenecked by the latency. The design of the datatap is such that only a limited number of data producers can be serviced immediately. This is due to the large imbalance between the combined memory of the compute partition and the combined memory of the I/O partition. Hence, an important performance metric is the average service latency, i.e. the time taken before a transfer request 9

Client observed egress bandwidth 400

4MB 32MB 64MB 128MB 256MB

300 250

4MB 32MB 64MB 128MB 256MB

250 Bandwidth (MB/s)

350 Bandwidth (MB/s)

Client observed egress bandwidth 300

200 150 100

200 150 100 50

50 0

0 0

10

20 30 40 50 Number of processors

60

70

0

(a) IB-RDMA data tap on Linux cluster

10

20 30 40 50 Number of processors

60

70

60

70

(b) Portals data tap on Cray XT3

Figure 5. Bandwidth observed by a single client

Time to complete transfer request 30

4MB 32MB 64MB 128MB 256MB

25

Time (s)

20 Time (s)

Time to complete transfer request 5 4.5 4 3.5 3 2.5 2 1.5 1 0.5 0

15 10 5 0 0

10

20 30 40 50 Number of processors

60

70

4MB 32MB 64MB 128MB 256MB

0

(a) IB-RDMA data tap on Linux cluster

10

20 30 40 50 Number of processors

(b) Portals data tap on Cray XT3

Figure 6. Total time to complete a single data transfer

is serviced by the datatap server. The request service latency also determines how long the computing application will wait on the blocking send. Figure 7 shows the latency with increasing number of clients. The latency increase is almost directly proportional to the number of nodes, as the datatap server becomes memory bound once the number of processors increases beyond a specific limit. The shape of the graphs is different for Infiniband and Portals but the conclusion is the same: request service latency can be improved by allocating more memory to the datatap servers (for example, by increasing the number of servers). The impact of the memory bottleneck on aggregate bandwidth observed by the datatap server is shown in Figure 9. The results demonstrate the aggregate bandwidth increases with increasing number of nodes, but reaches a maximum when the server becomes memory-limited. Our ongoing work focuses on understanding how to best schedule outstanding service requests so as to minimize the impact of these memory mismatches on the ongoing computation. Time spent in issuing a data transfer request will be the cause of the most significant overhead on the performance of the application. This is because the application only blocks when issuing the send and when waiting for the completion of the data transfer. The actual data transfer is overlapped with the computation of the application. Unfortunately, the Infiniband version of the datatap blocks for a significant period of time (up to 2 seconds for 64 nodes and a transfer of 256 MB/node) (see Figure 8(a)). This performance bottleneck is caused by the lack of a reliable connectionless transport layer in the current

10

(a) Number of ions = 582410

Run Parameters GTC/No output GTC/LUSTRE GTC/Datatap

(b) Number of ions = 1164820

Time for 100 iterations 213.002 231.86 219.65

Run Parameters GTC/No output GTC/LUSTRE GTC/Datatap

Time for 100 iterations 422.33 460.90 434.53

Table 1. Comparison of GTC run times on the ORNL Cray XT3 development machine for two input sizes using different data output mechanisms

OpenFabrics distribution. Thus as the request service latency increases the time taken to complete a send also increases. We are currently looking at ways to bypass this bottleneck. In contrast, the Portals datatap has very low latency and the latency stays almost constant for increasing number of nodes. The bulk of the time is spent in marshaling the data and we believe that this can also be optimized further. This demonstrates the feasibility of the datatap approach for environments with efficiently implemented transport layers.

4.4

Application-level Structured Stream Demonstration

Overview. The power of the structured stream abstraction is to provide a ready interface for programmers to overlap computation with I/O in the high performance environment. In order to demonstrate both the capability of the interface and its performance, we have chosen to implement a structured stream to replace the bulk of the output data from the Gyrokinetic Turbulence Code GTC [23]. GTC is a particle-in-cell code for simulating fusion within tokamaks, and it is able to scale to multiple thousands of processors In its default I/O pattern, the dominant cost is from each processor writing out the local array of particles into a separate file. This corresponds to writing out something close to 10% of the memory foot print of the code, with the write frequency chosen so as to keep the average overhead of I/O to within a reasonable percentage of total execution. As part of the standard process of accumulating and interpreting this data, these individual files are the aggregated and parsed into time series, spatially-bounded regions, etc. as per the needs of the following annotation pipeline. Run-time comparisons. For this experiment, we replace the aforementioned bulk write with a structured stream publish event. We ran GTC with two sets of input parameters with 528,410 ions and 1,164,820 ions and compared the run-time for three different configurations. In the table 1 GTC/No output is the GTC configuration with no data output, GTC/Lustre outputs data to a per-process file on the Lustre filesystem and finally GTC/Datatap uses SSDS’s lightweight datatap functionality for data output. We compare the application run-time on the Cray XT3 development cluster at Oak Ridge National Laboratory. We observe a significant reduction in the overhead caused by the data output (from about 8% on Lustre to about 3% using the datatap). This decrease in overhead is observed when we double the datasize (by increasing the number of ions). Average observed latency for request completion

Average observed latency for request completion 0.00025

4MB 32MB 64MB 128MB 256MB

0.00014 0.00012

4MB 32MB 64MB 128MB 256MB

0.0002 0.00015

Time (s)

Time (s)

0.0002 0.00018 0.00016

0.0001 8e-05 6e-05

0.0001

4e-05 2e-05

5e-05

0

0 0

10

20

30

40

50

60

70

0

Number of processors

10

20

30

40

50

Number of processors

(a) IB-RDMA data tap on Linux cluster

(b) Portals data tap on Cray XT3

Figure 7. Average latency in request servicing

11

60

70

Time to issue request 2.5 2

4MB 32MB 64MB 128MB 256MB

0.4 0.35 0.3

1.5

Time (s)

Time (s)

Time to issue request 0.45

4MB 32MB 64MB 128MB 256MB

1

0.25 0.2 0.15 0.1

0.5

0.05 0

0 0

10

20 30 40 50 Number of processors

60

70

0

(a) IB-RDMA data tap on Linux cluster

10

20 30 40 50 Number of processors

60

70

(b) Portals data tap on Cray XT3

Figure 8. Time to issue data transfer request I/O graph evaluation. The structured stream is configured with a simple I/O graph: datataps are placed in each of the GTC processes, feeding out asynchronously to an I/O node. From the I/O node, each of the messages is forwarded to a graph node where the data is partitioned into different bounding boxes, and then copies of both the whole data and the multiple small partitioned data sets are forwarded on to the storage nodes. Once the data is received by the datatap server we filter based on the bounding box and then transfer the data for visualization. The time taken to perform the bounding box computation is 2.29s and the time to transfer the filtered data is 0.037s. In the second implementation we transfer the data first and run the bounding box filter after the data transfer. The time taken for the bounding box filter is the same (2.29s) but the time taken to transfer the data increases to 0.297s. In the first implementation the total time taken to transfer the data and run the bounding box filter is lower, but the computation is performed on the datatap server resulting in a higher impact on the performance of the datatap, resulting in higher request service latency. For the second implementation the computation is performed on a remote node therefore reducing the impact on the datatap.

5

Related Work

A number of different systems (among them NASD [15], Panasas [25], PVFS [19], and Lustre [6]), provide highperformance parallel file systems. Unlike these systems, SDSS provides a more general framework for manipulating data moving to and from storage than these systems. In particular, the higher-level semantic metadata information available in Structured Streams allows it to make informed scheduling, staging, and buffering decisions than these systems. Each of these systems could, however, be used as an underlying storage system for SSDS in a way similar to how SSDS currently uses Server observed ingress bandwidth

Server observed ingress bandwidth

600

450 400 Bandwidth (MB/s)

Bandwidth (MB/s)

500 400 300 200

4MB 32MB 64MB 128MB 256MB

100 0 0

10

20 30 40 50 Number of processors

350 300 250 200 150

4MB 32MB 64MB 128MB 256MB

100 50 0

60

70

0

(a) IB-RDMA data tap on Linux cluster

10

20 30 40 50 Number of processors

(b) Portals data tap on Cray XT3

Figure 9. Aggregate Bandwidth observed by data consumer 12

60

70

LWFS [21]. The previous work with Active Disks [29] is somewhat similar in spirit to SDSS provides a way for executable code to be hosted very close to the drive as a way to enhance performance for data analysis operations. Its main limitation is that it relies on manipulating data stored on the drive. Our approach focuses on pulling that functionality into the IO graph and metabots. With IO graphs, SSDS can manipulate the data before it reaches storage, while metabots provide similar functionality to Active Disks but with explicit consistency management for interaction with SSDS-generated IO graphs. Scientific workflow management systems like Kepler [20], Pegasus [8], Condor/G [31], and others [36] are also closely related to the general Structured Streams/SDSS approach described in this paper. Similarly, the SRB [28] project is developing a Data Grid Language (DGL) [33] to describe how to route data within a grid environment, coupled with transformations on the metadata associated with file data. Unlike the system described in this paper, these systems focus on on coarsegrained application scheduling, metadata manipulation as opposed to file data manipulation, and wide-area Grid as opposed to tightly-coupled HPC systems. In addition, current workflow systems are tightly coupled to the synchronous data transmission, and need to be reconfigurable based on what the run-time has or has not completed. Our approach, in contrast, focuses on fine-grained scheduling, buffering, and staging issues in large-scale HPC systems, and allows the data itself to be annotated and manipulated both synchronously and asynchronously, all while still meeting application and system performance and consistency constraints. Because of the usefulness of sych workflow systems, we are currently examining how streaming data and metabot run-times could be integrated in future work to serve as a self-tuning actor within a more general workflow system like Kepler [20]. Our approach also shares some research goals with DataCutter [2, 34], which delivers client-specific data visualizations to multiple end points using data virtualization abstractions. DataCutter, unlike our system, requires end users to write custom data filters in C++ and limits automatically generated filters to a flat SQL-style approach for data selection. Our approach, in contrast, uses comparatively richer descriptions for both filter and transformation operations that provides SSDS more optimization opportunities. In addition, DataCutter has no analogue to the asynchronous data manipulation provided by metabots. Finally, previous research with ECho [11] and Infopipes [27] provide the ability to dynamically install application-specific stream filters and transformers. Neither of these systems, however, dealt with the more general scheduling, buffering, and asynchronous data manipulation problems that we seek to address using Structured Streams.

6

Conclusions and Future Work

The structured stream abstraction presented here is a new way of approaching I/O for petascale machines and beyond. Through microbenchmarks and integration with a production HPC code, we have shown that it is possible to achieve performance while also providing an extensible capability within the I/O system. The layering of asynchronous lightweight datataps with a high-performance data transport and computation system provides SSDS with a flexible and efficient mechanism for building an online data manipulation overlay. This enables us to address the needs of next generation leadership applications. Additionally, the integration of offline metadata annotation facilities provides a mechanism for shifting the embedded computation between online and offline. This allows the system to address run-time quality of service trade-offs such as size of memory buffer in I/O nodes vs. computational demands vs. bandwidth on ingress/egress. As next steps, enriching the specification and scheduling capabilities for both online and offline data manipulation will further improve runtime performance. As a further extension of this, we intend to investigate utilizing autonomic techniques for performing tradeoffs relevant to application concepts of data utility. On the metabot side, the specific issue of data consistency models and verification will be a major driver. For datatap and I/O graph development, we will work to further enrich the model of embedding computation into the overlay to better exploit concurrency in transport and processing. Exploiting concurrency whereever possible, including in the I/O system, will be key to the widespread deployment and adoption of petascale applications.

References [1] H. Abbasi, M. Wolf, K. Schwan, G. Eisenhauer, and A. Hilton. Xchange: Coupling parallel applications in a dynamic environment. In Proc. IEEE International Conference on Cluster Computing, 2004. [2] M. Beynon, R. Ferreira, T. M. Kurc, A. Sussman, and J. H. Saltz. Datacutter: Middleware for filtering very large scientific datasets on archival storage systems. In IEEE Symposium on Mass Storage Systems, pages 119–134, 2000.

13

[3] R. Brightwell, T. Hudson, R. Riesen, and A. B. Maccabe. The Portals 3.0 message passing interface. Technical report SAND99-2959, Sandia National Laboratories, December 1999. [4] F. E. Bustamante, G. Eisenhauer, K. Schwan, and P. Widener. Efficient wire formats for high performance computing. In Proc. Supercomputing 2000 (SC 2000), Dallas, Texas, November 2000. [5] Z. Cai, V. Kumar, and K. Schwan. Self-regulating data streams for predictable high performance across dynamic network overlays. In Proc. 15th IEEE International Symposium on High Performance Distributed Computing (HPDC 2006), Paris, France, June 2006. [6] Lustre: A scalable, high-performance file system. Cluster File Systems Inc. white paper, version 1.0, November 2002. http://www.lustre.org/docs/whitepaper.pdf. [7] J. Clayton and D. McDowell. A multiscale multiplicative decomposition for elastoplasticity of polycrystals. International Journal of Plasticity, 19(9):1401–1444, 2003. [8] E. Deelman, J. Blythe, Y. Gil, C. Kesselman, G. Mehta, K. Vahi, K. Blackburn, A. Lazzarini, A. Arbree, R. Cavanaugh, and S. Koranda. Mapping abstract complex workflows onto grid environments. J. Grid Comput., 1(1):25–39, 2003. [9] G. Eisenhauer. The evpath library. http://www.cc.gatech.edu/systems/projects/EVPath. [10] G. Eisenhauer. Portable binary input/output. http://www.cc.gatech.edu/systems/projects/PBIO. [11] G. Eisenhauer, F. Bustamente, and K. Schwan. Event services for high performance computing. In Proceedings of High Performance Distributed Computing (HPDC-2000), 2000. [12] G. Eisenhauer and L. K. Daley. Fast heterogenous binary data interchange. In Proceedings of the Heterogeneous Computing Workshop (HCW2000), May 3-5 2000. http://www.cc.gatech.edu/systems/papers/Eisenhauer00FHB.pdf. [13] fdtree. http://www.llnl.gov/icc/lc/siop/downloads/download.html Last Visited: April 16, 2007. [14] C. Forum. Mxn parallel data redistribution @ ornl. http://www.csm.ornl.gov/cca/mxn/, Jan 2004. [15] G. A. Gibson, D. P. Nagle, K. Amiri, F. W. Chang, E. Feinberg, H. G. C. Lee, B. Ozceri, E. Riedel, and D. Rochberg. A case for network-attached secure disks. Technical Report CMU–CS-96-142, Carnegie-Mellon University, June 1996. [16] O. V. T. Group. Exploratory visualization environment for research in science and technology (everest). http://www.csm. ornl.gov/viz/. [17] X. Jiao and M. T. Heath. Common-refinement-based data transfer between nonmatching meshes in multiphysics simulations. International Journal for Numerical Methods in Engineering, 61(14):2402–2427, December 2004. [18] V. Kumar, B. F. Cooper, Z. Cai, G. Eisenhauer, and K. Schwan. Resource-aware distributed stream management using dynamic overlays. In Proceedings of the 25th IEEE International Conference on Distributed Computing Systems (ICDCS-2005), 2005. [19] R. Latham, N. Miller, R. Ross, and P. Carns. A next-generation parallel file system for linux clusters. LinuxWorld, 2(1), January 2004. [20] B. Ludscher, I. Altintas, C. Berkley, D. Higgins, E. Jaeger, M. Jones, E. A. Lee, J. Tao, and Y. Zhao. Scientific workflow management and the kepler system: Research articles. Concurr. Comput. : Pract. Exper., 18(10):1039–1065, 2006. [21] R. A. Oldfield, A. B. Maccabe, S. Arunagiri, T. Kordenbrock, R. R. sen, L. Ward, and P. Widener. Lightweight i/o for scientific applications. In Proc. 2006 IEEE Conference on Cluster Computing, Barcelona, Spain, September 2006. [22] R. A. Oldfield, D. E. Womble, and C. C. Ober. Efficient parallel I/O in seismic imaging. The International Journal of High Performance Computing Applications, 12(3):333–344, Fall 1998. [23] L. Oliker, J. Carter, michael Wehner, A. Canning, S. Ethier, A. Mirin, G. Bala, D. parks, patrick Worley Shigemune Kitawaki, and Y. Tsuda. Leading computational methods on scalar and vector hec platforms. In Proceedings of SuperComputing 2005, 2005. [24] I. W. G. on High End Computing. Hec-iwg file systems and i/o r&d workshop. http://www.nitrd.gov/subcommittee/ hec/workshop/20050816\_storage/. [25] Object-based storage architecture: Defining a new generation of storage systems built on distributed, intelligent storage devices. Panasas Inc. white paper, version 1.0, October 2003. http://www.panasas.com/docs/. [26] S. Plimpton. Fast parallel algorithms for short-range molecular dynamics. Journal of Computational Physics, 117(1):1–19, 1995. http://lammps.sandia.gov/index.html. [27] C. Pu, K. Schwan, and J. Walpole. Infosphere project: System support for information flow applications. SIGMOD Record, 30(1):25– 34, 2001. [28] A. Rajasekar, M. Wan, R. Moore, W. Schroeder, G. Kremenek, A. Jagatheesan, C. Cowart, B. Zhu, S.-Y. Chen, and R. Olschanowsky. Storage Resource Broker—managing distributed data in a Grid. Computer Society of India Journal, Special Issue on SAN, 33(4):42– 54, October 2003. [29] E. Riedel, C. Faloutsos, G. A. Gibson, and D. Nagle. Active disks for large-scale data processing. IEEE Computer, 34(6):68–74, June 2001. [30] K. Schwan, B. F. Cooper, G. Eisenhauer, A. Gavrilovska, M. Wolf, H. Abbasi, S. Agarwala, Z. Cai, V. Kumar, J. Lofstead, M. Mansour, B. Seshasayee, and P. Widener. Autoflow: Autonomic information flows for critical information systems. In M. Parashar and S. Hariri, editors, Autonomic Computing: Concepts, Infrastructure and Applications. CRC Press, 2006. [31] D. Thain, T. Tannenbaum, and M. Livny. Distributed computing in practice: the condor experience. Concurrency - Practice and Experience, 17(2-4):323–356, 2005. [32] R. Thakur, W. Gropp, and E. Lusk. Data sieving and collective I/O in ROMIO. Technical Report ANL/MCS-P723-0898, Mathematics and Computer Science Division, Argonne National Laboratory, August 1998.

14

[33] J. Weinberg, A. Jagatheesan, A. Ding, M. Faerman, and Y. Hu. Gridflow description, query, and execution at scec using the sdsc matrix. In HPDC ’04: Proceedings of the 13th IEEE International Symposium on High Performance Distributed Computing (HPDC’04), pages 262–263, Washington, DC, USA, 2004. IEEE Computer Society. [34] L. Weng, G. Agrawal, U. Catalyurek, T. Kur, S. Narayanan, and J. Saltz. An approach for automatic data virtualization. In HPDC, pages 24–33, 2004. [35] M. Wolf, Z. Cai, W. Huang, and K. Schwan. Smartpointers: Personalized scientific data portals in your hand. In Proceedings of SuperComputing 2002, Nov 2002. http://www.sc-2002.org/paperspdfs/pap.pap304.pdf. [36] J. Yu and R. Buyya. A taxonomy of scientific workflow systems for grid computing. SIGMOD Rec., 34(3):44–49, 2005.

15

Suggest Documents