Processing Large-Scale Multidimensional Data in Parallel and Distributed Environments ? Michael Beynony, Chialin Changy, Umit Catalyurek+, Tahsin Kurc+, Alan Sussmany, Henrique Andradey, Renato Ferreiray, Joel Saltz+ y Dept. of Computer Science + Dept. of Biomedical Informatics University of Maryland The Ohio State University College Park, MD 20742 Columbus, OH, 43210 fbeynon,chialin,umit,als,hcma,
[email protected] fkurc-1,
[email protected]
Abstract Analysis of data is an important step in understanding and solving a scienti c problem. Analysis involves extracting the data of interest from all the available raw data in a dataset and processing it into a data product. However, in many areas of science and engineering, a scientist's ability to analyze information is increasingly becoming hindered by dataset sizes. The vast amount of data in scienti c datasets makes it a dicult task to eciently access the data of interest, and manage potentially heterogeneous system resources to process the data. Subsetting and aggregation are common operations executed in a wide range of data-intensive applications. We argue that common runtime and programming support can be developed for applications that query and manipulate large datasets. This paper presents a compendium of frameworks and methods we have developed to support ecient execution of subsetting and aggregation operations in applications that query and manipulate large, multi-dimensional datasets in parallel and distributed computing environments. Key words: data-intensive applications, multi-dimensional datasets, parallel processing, distributed computing, runtime systems.
?
This research was supported by the National Science Foundation under Grants #ACI-9619020 (UC Subcontract #10152408) and #ACI-9982087, the Oce of Naval Research under Grant #N6600197C8534, Lawrence Livermore National Laboratory under Grant #B500288 (UC Subcontract #10184497), and the Department of Defense, Advanced Research Projects Agency, USAF, AFMC through Science Applications International Corporation under Grant #F30602-00-C-0009 (SAIC Subcontract #4400025559).
1 Introduction There is a large body of research devoted to developing high-performance architectures and algorithms for ecient execution of large-scale scienti c applications. Moreover, it is becoming increasingly more ecient to use collections of high-performance machines for application execution, because of the availability of faster networks and tools for discovery, allocation, and management of distributed resources. As a result, long-running, large-scale simulations [20,46,56,58] are producing unprecedented amounts of data. In addition, advanced sensors attached to instruments, such as earth-orbiting satellites and medical instruments [3,62], are generating very large datasets that must be made available to a wider audience. Looking at available technology, disk space has become plentiful and relatively inexpensive. Using o-the-shelf components, it is currently possible to build a disk-based storage cluster with about 1 Terabyte of storage space, consisting of six Pentium III PCs, each with two 80GB EIDE disks, for about $10,000. The availability of such low-cost systems, built from networks of commodity computers and high-capacity disks, has greatly enhanced a scientist's ability to store large-scale scienti c data. However, the primary goal of gathering data is better understanding of the scienti c problem at hand, and data analysis is key to this understanding. The vast amount of data available in scienti c datasets makes it an onerous task for a scientist both to eciently access the data, and to manage the system resources required to process it. A growing set of data-intensive applications query and analyze collections of very large multi-dimensional datasets. Examples of such applications include satellite data processing [24,27,62], full-scale water contamination studies and surface/subsurface petroleum reservoir simulations [44,66], visualization and processing of digitized microscopy images [3], visualization of largescale data [5,8,29,42,61], and data mining [4,7,34,68]. Although the datasets used for analysis and the data products generated by applications that manipulate those datasets may dier in many ways, a close look at many dataintensive applications [17,21,31,42,44] reveals that there exist commonalities in their data access patterns and processing structures. Analysis requires extracting the data of interest from the dataset, and processing and transforming it into a new data product that can be more eciently consumed by another program or analyzed by a human. Subsetting of data is often done through range queries, and aggregation (reduction) operations are commonly executed in the data processing step of a wide range of applications. We argue that frameworks and methods can be developed that will provide common programming and runtime support for a wide range of applications that make use of large scienti c datasets. In this paper, we present an overview 2
of the methods and frameworks we have developed for ecient execution of applications that query and manipulate large, multi-dimensional datasets. The algorithms and runtime systems presented in this paper target architectures that range from tightly coupled distributed-memory parallel machines with attached disk farms to heterogeneous collections of high-performance machines and storage systems in a distributed computing environment.
2 Overview In this section we brie y describe several data-intensive applications that have motivated the design and implementation of the algorithms and frameworks presented in this paper. We also discuss data access and processing patterns commonly observed in these applications. 2.1 Motivating Applications
Satellite Data Processing. Earth scientists study the earth by processing
remotely-sensed data continuously acquired from sensors attached to satellites. A typical analysis processes satellite data for ten days to a year (for the AVHRR sensor, ten days of data is about 4 GB) and generates one or more composite images of the area under study [1,24]. Generating a composite image requires projection of the region of interest onto a two-dimensional grid; each pixel in the composite image is computed by selecting the \best" sensor value that maps to the associated grid point. An earth scientist speci es the projection that best suits her needs.
Analysis of Microscopy Data: Virtual Microscope. The Virtual Micro-
scope [3,17] provides a realistic digital emulation of a high power light microscope. The raw data for such a system can be captured by digitally scanning collections of full microscope slides under high power. The size of a slide with a single focal plane can be up to several gigabytes, uncompressed. Hundreds of such digitized slides can be produced in a single day in a large hospital. The processing for the Virtual Microscope requires projecting high resolution data in the region of interest on the slide onto a grid of suitable resolution (governed by the desired magni cation) and appropriately compositing the pixels mapping onto a single grid point, to avoid introducing spurious artifacts into the displayed image.
Coupling of Environmental Codes: Water Contamination Studies. Powerful simulation tools are crucial to understand and predict the transport and reaction of chemicals in bays and estuaries [44]. Such tools include 3
a hydrodynamics simulator [46], which simulates the ow of water in the domain of interest, and a chemical transport simulator [20], which simulates the reactions between chemicals in the bay and transport of these chemicals. For a complete simulation system, the hydrodynamics simulator needs to be coupled to the chemical transport simulator, since the latter uses the output of the former to simulate the transport of chemicals within the domain. As the chemical reactions have little eect on the circulation patterns, the uid velocity data can be generated once and used for many contamination studies. The output data from a large grid at a single time step may be several megabytes, and thousands of time steps may need to be simulated for a particular scenario. The grids used by the chemical simulator may be dierent from the grids the hydrodynamic simulator employs, and the chemical simulator usually uses coarser time steps. Therefore, running a chemical transport simulation requires retrieving the hydrodynamics output in the region of interest (i.e. a region of the grid over a speci ed time period) from the appropriate hydrodynamics dataset, averaging the hydrodynamics outputs over time, and projecting them into the grid used by the chemical transport simulator.
Visualization of Simulation Datasets: Iso-surface Rendering. The
study and characterization of ground waterways and oil reservoirs involve simulation of the transport and reaction of various chemicals over many time steps on a three-dimensional grid that represents the region of interest. In a typical analysis of datasets generated by a simulation, a scientist examines the transport of one or more chemicals in the region being studied over several time steps [42]. Visualization is key to understanding the results of the simulation, and iso-surface rendering is a well-suited method to visualize the density distributions of chemicals in a region. Given a three-dimensional grid with scalar values at grid points and a user-de ned scalar value, called the iso-surface value, an iso-surface rendering algorithm extracts the surface on which the scalar value is equal to the iso-surface value. The extracted surface (iso-surface) is rendered to generate an image. In general, the iso-surface is approximated by a list of polygons [45], and a polygon rendering algorithm (e.g., the z-buer algorithm) is employed to produce the output image [65].
Mining Interesting Patterns: Decision Tree Construction. The goal of data mining is to discover interesting and useful, but a priori unknown patterns from large databases. Classi cation is one of the important problems in data mining, and has applications in many elds, such as nancial analysis and medical diagnosis [55,68]. In classi cation, we are given a subset of all records in the dataset, called the training set, in which each record consists of several elds, referred to as attributes. An attribute can be either a numerical attribute or a categorical attribute. If values of an attribute belong to an ordered domain, the attribute is called a numerical attribute (e.g., income, age). A categorical attribute, on the other hand, has values from an unordered domain (e.g., type of car, house, job, department name). One of the categorical attributes 4
is designated as the classi cation attribute; its values are called class labels. The goal of the classi cation is to create a concise model of the classi cation attribute based on the other attributes. Once such a model is constructed, future records, which are not in the training set, can be classi ed using the model. A decision-tree classi er builds a tree by dividing the training set into partitions so that all or most of the records in a partition (leaf node) have the same class label [55]. A tree is grown by splitting each leaf node at the current level, starting from the root node, which contains the entire training set, into child nodes. The records associated with a tree node are split into partitions on the split condition. Two histograms showing the class distribution of the values of a numerical attribute are computed for a numerical attribute, and count tables are created for categorical attributes, which store the class distribution of the records for each categorical attribute. The histograms are used to nd the best split point at a tree node [55]. Computation of the histograms and count tables involves counting the class values of records and adding them to the respective tables and histograms. 2.2 Data Access and Processing Structure: Range Queries and Reduction Operations
Most of the datasets accessed and manipulated by the applications described in the previous sections are multi-dimensional. That is, data items are associated with points in a multi-dimensional attribute space. The data dimensions can be spatial coordinates and time, or varying conditions, such as temperature, velocity or chemical concentration values in an environmental simulation application, or income and age in a data mining application. Oftentimes, reference to the data of interest can be described by a range query. A range query de nes a multi-dimensional bounding box in the underlying multi-dimensional attribute space of the dataset(s). Only the data items whose associated coordinates fall within the multi-dimensional box are retrieved. Performing reduction-type operations is one of the common processing patterns observed in applications that analyze large datasets. Figure 1 shows the high-level pseudo-code of the basic data processing loop. The function Select(:::) identi es the set of data items in a dataset that intersect a given range query. An intermediate data structure, referred to here as an accumulator, can be used to hold intermediate results during processing. For example, a z-buer can be used as an accumulator to hold color and distance values in an iso-surface rendering application [54]. In data mining, histograms and count tables can be used to hold the distribution of attribute values for building decision trees [55]. Accumulator items are allocated and initialized during the initialization phase (steps 3{6). The processing steps consist of retrieving data items that intersect the range query (step 8), mapping the retrieved 5
1. 2.
DU DI
Select(Output Dataset, Range Query) Select(Input Dataset, Range Query)
3. 4. 5. 6.
foreach ue in DU do read ue ae Initialize(ue) endfor
7. 8. 9. 10. 11. 12.
foreach ie in DI do read ie SA Map(ie ) foreach ae in SA do
(* Initialization *)
(* Reduction *)
ae
Aggregate(ie; ae)
endfor (* Output *) 13. foreach ae do
14. 15. 16.
ue
Finalize(ae) write ue
endfor
Fig. 1. The basic data processing loop.
input items to the corresponding output items (step 9), and aggregating, in some application speci c way, all the input items that map to the same output data item (steps 10{11). The mapping function, Map(ie), is an applicationspeci c function, which may map an input item to a set of output items. The aggregation function, Aggregate(ie; ae), aggregates the value(s) of an input item ie with the intermediate results stored in the accumulator item ae that corresponds to one of the output items that ie maps to. Finally, the intermediate results stored in the accumulator are post-processed to produce the nal results for the output dataset (steps 13{16). Steps 1 and 4 are needed when the processing of data updates an already existing dataset, and data items are needed to initialize accumulator elements. The output can be stored on disks in the system (step 15) or can be consumed by another program (e.g., displayed by a client program in a visualization application). The output dataset is usually much smaller than the input dataset, hence steps 7{12 are called the reduction phase of the processing. Aggregation functions in the reduction phase are usually commutative and associative, i.e., correctness of the output data values does not depend on the order input data items are aggregated. 6
3 Supporting Reduction Operations on Distributed-memory Parallel Machines The implementation of aggregation operations on a parallel machine requires distribution of data and computations among disks and processors to make ecient use of aggregate storage space and computing power, and carefully scheduling data retrieval, computation and network operations to keep all resources (i.e., disks, processor memory, network, and CPU) busy without overloading any of the resources. We have developed a framework, called the Active Data Repository [21,31] (ADR), that provides support for applications that perform range queries with user-de ned aggregation operations on multi-dimensional datasets, to be executed on a distributed-memory parallel machine with an attached disk farm. In this section, we brie y describe the framework, and present algorithms and optimization techniques developed in the ADR framework. 3.1 Active Data Repository Framework 3.1.1 Storing Datasets
A dataset is partitioned into and stored as a set of data chunks. A data chunk contains a subset of data items in the dataset. The dataset is partitioned into data chunks by the application developer, and data chunks in a dataset can have dierent sizes. Since data is accessed through range queries, it is desirable to have data items that are close to each other in the multi-dimensional space placed in the same data chunk. A data chunk is the unit of data retrieval. That is, it is retrieved as a whole during processing. Retrieving data in chunks instead of as individual data items reduces I/O overheads (e.g., seek time), resulting in higher application level I/O bandwidth. As every data item is associated with a point in a multi-dimensional attribute space, every data chunk is associated with a minimum bounding rectangle (MBR). The MBR of a data chunk is the smallest box in the underlying multi-dimensional space that encompasses all the coordinates of all the items in the data chunk. Data chunks are distributed across the disks in the system to fully utilize the aggregate storage space and disk bandwidth. To take advantage of the data access patterns exhibited by range queries, data chunks that are close to each other in the underlying attribute space should be assigned to dierent disks. In the ADR framework, we employ a Hilbert curve-based declustering algorithm [28] to distribute the chunks across the disks. Hilbert curve algorithms are fast and exhibit good clustering and declustering properties. Other declustering algorithms, such as those based on graph partitioning [47], can also be 7
used. Each chunk is assigned to a single disk, and is read and written only by the local processor to which the disk is attached. If a chunk is required for processing by one or more remote processors, it is sent to those processors as a whole by the local processor via interprocessor communication. After data chunks are assigned to disks, a multi-dimensional index is constructed using the MBRs of the chunks. The index on each processor is used to quickly locate the chunks with MBRs that intersect a given range query. Ecient spatial data structures, such as R-trees and their variants [9], can be used for indexing and accessing multi-dimensional datasets. 3.1.2 Query Processing
The processing of a range query is accomplished in two steps: a query plan is computed in the query planning step, and the actual data retrieval and processing is carried out in the query execution step according to the query plan. Query planning is carried out in three phases: index lookup, tiling and workload partitioning. In the index lookup phase, indices associated with the datasets are used to identify all the chunks that intersect with the query. If the accumulator data structure is too large to t entirely in memory, it must be partitioned into output tiles, each of which contains a disjoint subset of accumulator elements. Partitioning is done in the tiling phase so that the size of a tile is less than the amount of memory available for the accumulator. A tiling of the accumulator implicitly results in a tiling of the input dataset. Each input tile contains the input chunks that map to the corresponding output tile. Since an input element may map to multiple accumulator elements, the corresponding input chunk may appear in more than one input tile if the accumulator chunks are assigned to dierent tiles. During query execution, the input chunks placed in multiple input tiles are retrieved multiple times, once per output tile. Therefore, care should be taken to minimize the boundaries of output tiles so as to reduce the number of such input chunks. In the workload partitioning phase, the workload associated with a tile is partitioned among processors. In the query execution step, the processing of an output tile is carried out according to the query plan. A tile is processed in four phases{ a query iterates through these phases repeatedly until all tiles have been processed and the entire output has been computed. (1) Initialization. Accumulator elements for the current tile are allocated space in memory and initialized. (2) Reduction. Each processor retrieves data chunks stored on local disks. Data items in a data chunk are aggregated into the accumulator elements allocated in each processor's memory during phase 1. (3) Global Combine. If necessary, partial results computed by each pro8
cessor in phase 2 are combined across the processors via inter-processor communication to compute nal results for the accumulator. (4) Output Handling. The nal output for the current tile is computed from the corresponding accumulator values computed in phase 3. The output is either sent back to a client or stored back to the disks. 3.2 An Implementation of the ADR framework
We have developed an implementation of the ADR framework as a set of modular services, implemented as a C++ class library, and a runtime system 2 [21,31]. Several of the services allow customization for user-de ned processing. A uni ed interface is provided for customizing these services via C++ class inheritance and virtual functions. An application developer has to provide accumulator data structures and functions that operate on in-core data, to implement application-speci c processing of out-of-core data with ADR. An ADR application consists of one or more clients, a front-end process, and a customized back-end. The front-end interacts with clients, translates client requests into queries and sends one or more queries to the parallel back-end. Since the clients can connect and generate queries in an asynchronous manner, the existence of a front-end relieves the back-end from being interrupted by clients during processing of queries. The back-end is responsible for storing datasets and carrying out application-speci c processing of the data on the parallel machine. The back-end runtime system provides support for common operations such as index lookup, management of system memory, and scheduling of data retrieval and processing operations across a parallel machine. During the processing of a query, the runtime system tries to overlap disk operations, network operations and processing as much as possible. Overlap is achieved by maintaining explicit queues for each kind of operation (data retrieval, message sends and receives, data processing) and switching between queued operations as required. Pending asynchronous I/O and communication operations in the operation queues are polled and, upon their completion, new asynchronous operations are initiated when more work is required and memory buer space is available. Data chunks are therefore retrieved and processed in a pipelined fashion. We have developed several applications [18,21,42] using the ADR framework implementation. In the following section, we describe the implementation of the Virtual Microscope (VM) (see Section 2) as an example application and 2 The
software and user's manual http://www.cs.umd.edu/projects/adr
9
can
be
downloaded
from
Fig. 2. The Virtual Microscope client.
present experimental performance results. Figure 2 illustrates the VM client graphical user interface. 3.2.1 The Virtual Microscope using ADR
In the Virtual Microscope (VM) application, the digitized image from a slide is essentially a three dimensional data set, because each slide may consist of multiple focal planes. In other words, each digitized slide consists of several stacked two-dimensional images. However, the portion of the entire image that must be retrieved to provide a view into the slide for any given set of query parameters (i.e., area of interest, magni cation and focal plane) is two dimensional. Therefore, to optimize performance, each two-dimensional image (a focal plane) can be considered separately for partitioning into chunks and for declustering chunks across disks. Most queries require processing only a small portion of the image. Hence the size of the chunks must be big enough to eciently use the disk subsystem but not so big as to retrieve and process too much unneeded data. An image chunk is used as the unit of data storage and retrieval in the implementation of VM using ADR. That is, a chunk and its associated metadata (position of the tile in the whole image and its size) are stored as a single chunk in a data le. The chunks are distributed across the system disks using a Hilbert curve-based algorithm [28]. It is also clear that the images should be stored in a compressed form. In the current implementation we selected JPEG compression as the default compression method because of the availability of fast and stable compression/decompression libraries [37]. Therefore, the aggregation function in the VM implementation rst performs data decompression on a retrieved chunk, then carries out clipping and subsampling operations on the uncompressed data chunk to produce low resolution images as needed (data is only stored at the highest resolution). The index implementation exploits the fact that the 10
chunks are non-overlapping and that the slides are fully rectangular images without holes. The chunks are numbered in row-major order and their location and size in the corresponding data le are stored in a two-dimensional matrix, each element of which corresponds to a data chunk in a focal plane. Given the location of the data chunk in the overall image, the index maps it to the corresponding entry in the matrix to locate each chunk that intersects the query very quickly. If it becomes necessary to store datasets with holes in the images, or to store images that are not rectangular, an R-tree based indexing method can be employed. 3.2.2 Experimental Results
We present experimental performance results on a Linux PC cluster. The PC cluster consists of one front-end node and ve processing nodes, with a total of 800GB of disk storage. Each processing node has an 800MHz Pentium III CPU, 128MB main memory, and two 5400RPM Maxtor 80GB EIDE disks. The processing nodes are interconnected via 100Mbps switched Ethernet. The front-end node is also connected to the same switch. We have used the driver program described in [10] to emulate the behavior of multiple simultaneous end users (clients). The implementation of the client driver is based on a workload model that has been statistically generated from traces collected from real experienced users. Interesting regions are modeled as points in the slide, and provided as an input le to the driver program. When a user pans near an interesting region, there is a high probability a request will be generated. The driver adds noise to requests to avoid multiple clients asking for the same region. In addition, the driver avoids having all the clients scan the slide in the same manner. The slide is swept through in either an up-down fashion or a left-right fashion. In the experiments we use a slide consisting of 32336x27840 3-byte pixels 3 . Performance results for the VM data server using dierent chunk sizes are displayed in Figure 3(a). In this gure, the 400x, 200x, 100x and 50x bars show the average response time of the VM data server to queries at dierent resolutions, where 400x is the highest resolution data that is actually stored in the server. The overall bar displays the average response time of the VM system to the queries at all resolutions. As seen in the gure, chunk size 256x256 produces the best response time at each resolution, and therefore for the overall average for a 512x512 output image. Both 128x128 and 512x512 chunk sizes result in response times that are approximately 33% higher. Increasing the chunk size decreases system performance because with too large a chunk size all of the processing nodes in the data server cannot be eciently utilized, 3 That slide, and 22 others, can be accessed from the Johns Hopkins Medical In-
stitutions Virtual Microscope web page located at http://vmscope.jhmi.edu
11
Response Time for 512x512 Output 5.0
overall
Response Time for 512x512 Output
400x
4.0
1 Client
200x
2 Clients
100x 3.0
Response Time (seconds)
Response Time (seconds)
4.0
50x
2.0
1.0
3.0
3 Clients 4 Clients
2.0
5 Clients
1.0
0.0
0.0 128x128
256x256
512x512
1024x1024
1
2048x2048
2
3
4
5
Number of Processors
Chunk Size
(a)
(b)
Fig. 3. (a) Performance results for the ADR VM server running on 5 processors for varying image chunk sizes. Average response time of the server for the queries that produce 512x512 output image. (b) Performance gures for the server on varying number of processors to produce 512x512 output images. Each client submits 100 queries to the server.
especially for queries requesting a relatively small output image. As chunk size increases, the number of chunks that intersect with a xed size user request decreases. For example, with chunk size 2048x2048 a query requesting an output of size 512x512 at the highest resolution intersects with either 1, or 2 or 4 chunks. It is highly probable that most such queries will intersect only 1 chunk because of the large chunk size. In that case four out of the ve processors in the data server will be idle. Figure 3(b) displays the average response time from queries generated by multiple concurrently running clients. Since a 256x256 chunk size gave the best response time for a single client query, we have selected 256x256 as the default chunk size for this experiment. Each client is an instance of the driver program described earlier in this section and generates 100 queries (note that each client will generate a somewhat dierent set of queries due to the design of the client driver). The generated query set contains queries at dierent resolutions, hence some of the queries (those at lower resolutions) require processing more data at the VM data server since the stored data is at the highest resolution. For example, a query at 50x magni cation requires processing 64 times more data than a query requesting an output image at 400x magni cation. The response times that are shown in these gures are the average response time for a single query. As is seen in the gure, the performance of the ADR version of the VM server scales well as the number of clients increases. For example, with 5 clients, the speedup for ve server processors is 3.6 compared to a one processor server. 12
3.3 Query Processing Strategies
Workload partitioning and tiling have signi cant eects on the performance of an application implemented using the ADR framework. We have evaluated several potential strategies [22,23,43] that use dierent workload partitioning and tiling schemes. To simplify the presentation, we assume that the target range query involves only one input and one output dataset. Both the input and output datasets are assumed to be already partitioned into data chunks and declustered across the disks in the system. In the following discussions we assume that an accumulator chunk is allocated in memory for each output chunk to hold the partial results and that an accumulator chunk is the same as an output chunk. Therefore, output chunk and accumulator chunk are used interchangeably in this section. In all of the algorithms discussed in this section, we employ Hilbert space lling curves [28] in the tiling phase. Our goal is to minimize the total length of the boundaries of the tiles, by assigning spatially close chunks in the multidimensional attribute space to the same tile, to reduce the number of input chunks crossing one or more boundaries. The advantage of using Hilbert curves is that they have good clustering properties [47], since they preserve locality. In our implementation, the mid-point of the bounding box of each output chunk is used to generate a Hilbert curve index. The chunks are sorted with respect to this index, and selected in this order for tiling. The current implementation, however, does not take into account the distribution of input chunks in the output attribute space, so for some distributions of the input data in its attribute space there can still be many input chunks intersecting multiple tiles, despite a small boundary length.
3.3.1 Fully Replicated Accumulator (FRA) Strategy.
In the FRA strategy, each processor performs processing associated with its local input chunks. The accumulator is partitioned into tiles, each of which ts into the available local memory of a single processor. This scheme eectively replicates all of the accumulator chunks in a tile on each processor. During the reduction phase, each processor generates partial results for the accumulator chunks using only its local input chunks. Replicated accumulator chunks are then forwarded to the processors that own the corresponding accumulator chunks during the global combine phase to produce the complete intermediate result. 13
3.3.2 Sparsely Replicated Accumulator (SRA) Strategy.
The FRA strategy replicates each accumulator chunk in every processor, even if no input chunks will be aggregated into the accumulator chunks in some processors. This results in unnecessary initialization overhead in the initialization phase of query execution, and extra communication and computation in the global combine phase. The available memory in the system also is not eciently employed, because of unnecessary replication. Such replication may result in more tiles being created than necessary, which may cause a large number of input chunks to be retrieved from disk more than once. In SRA strategy, a replicated chunk is allocated only on processors owning at least one input chunk that maps to the corresponding accumulator chunk. 3.3.3 Distributed Accumulator (DA) Strategy.
In this scheme, every processor is responsible for all processing associated with its local output chunks. Tiling is done by selecting, for each processor, local output chunks from that processor until the memory space allocated for the corresponding accumulator chunks in the processor is lled. Since no accumulator chunks are replicated by the DA strategy, no replicated chunks are allocated. This allows DA to make more eective use of memory and produce fewer tiles than the other two schemes. As a result, fewer input chunks are likely to be retrieved for multiple tiles. Furthermore, DA avoids interprocessor communication for accumulator chunks during the initialization phase and for replicated chunks during the global combine phase, and also requires no computation in the global combine phase. On the other hand, it introduces communication in the reduction phase for input chunks; all the remote input chunks that map to the same output chunk must be forwarded to the processor that owns the output chunk. Since a projection function may map an input chunk to multiple output chunks, an input chunk may be forwarded to multiple processors. 3.3.4 A Hypergraph-based Strategy
In this strategy [22], workload partitioning is formulated as a hypergraph partitioning problem. A hypergraph is a graph where each hyperedge (also called a net) can connect more than two vertices in the graph. We rst describe the tiling algorithm, and then the workload partitioning algorithm.
Tiling: The memory requirement of an output tile, which determines how
many output chunks can t in an output tile, depends on how many output chunks are replicated on processors, and that information is only available after a workload partitioning is computed for the output tile. To circumvent the 14
P0 x
P1
P2 y
P0
a
x
ax
bx ex cx
dx
b
a
b c d
dy
e
c
e
P1
d
y
P2
(a)
(b) Fig. 4. (a) An example mapping between input chunks a ? e and output chunks x and y . (b) The aggregation hypergraph for the example mapping in (a).
dependency between the tiling process and the workload partitioning process, we employ the following tiling algorithm. (1) Create output tiles, assuming each tile is replicated on all processors. This conservative approach guarantees that for any possible workload partitioning solution computed by the workload partitioning algorithm, no single output tile can use more processor memory than is available, although more tiles than necessary may be generated. We refer to each tile generated in this step as a small tile. (2) Apply the workload partitioning algorithm to each small tile, and compute the actual memory requirement for each small tile. (3) Merge the small tiles to form the nal tiles, based on the actual memory requirements of the small tiles, without violating the memory constraint on any processor.
Workload Partitioning: For a given output tile, the hypergraph-based al-
gorithm (HG) uses a hypergraph, referred to as an aggregation hypergraph, to model the aggregation operations involving pairs of corresponding input and output chunks. In an aggregation hypergraph, an aggregation vertex is created for each input-output chunk pair. One processor vertex for each processor in the target machine is also created. Every input and every output chunk is represented as a net that connects vertices corresponding to the aggregation operations that need the data chunk. The net also connects to the processor vertex for the processor that owns the input or output chunk. Figure 4(b) shows the aggregation hypergraph for the example mapping in Figure 4(a), with circles representing vertices and lines representing nets. In the gure, each aggregation vertex is labeled with an input-output pair, each processor vertex is labeled with a processor id, and each net is labeled with a data chunk. HG assigns weights to vertices and nets of an aggregation hypergraph to model 15
I/O, communication and computation time for aggregation operations. The weight of a processor vertex is the total time required for the processor to read its local input and output chunks. The weight of an aggregation vertex is the time required to perform the aggregation operation involving the input-output chunk pair. For a net that corresponds to an input chunk, the weight is the time to send the input chunk to a remote processor. For a net that corresponds to an output chunk, the weight is the time to (1) send the output chunk from its owner to a remote processor, (2) initialize the replicated output chunk on the remote processor, (3) send the replicated output chunk back to the owner processor, and (4) combine the replicated output chunk with the local output chunk on the owner processor. A P-way cut of a hypergraph partitions the vertices into P disjoint partitions. The weight of a partition is de ned as the sum of the weights of all vertices assigned to that partition. The connectivity of a net for a cut is de ned to be the number of partitions that are connected by the net. The HG algorithm computes a workload partitioning that minimizes execution time by solving the following optimization problem: Given a threshold and an aggregation hypergraph with P processors, compute a P -way cut C such that (1) each partition contains exactly one processor vertex, (2) the dierence between the weights of any two partitions does not exceed , and (3) the following cost function is minimized
X
f[connectivity(e; C ) ? 1] weight(e)g
(1)
each net e where connectivity(e;C ) returns the connectivity of the net e for the cut C . A P -way cut that satis es the rst two constraints of the problem de nition corresponds to partitioning of the workload so that aggregation operations assigned to a partition are performed by the processor in that partition. The rst constraint ensures that each partition is assigned to only one processor. The second ensures that the computational load imbalance between any two processors does not exceed . For example, Figure 5 shows a cut for the hypergraph in Figure 4. The connectivity of a net e corresponds to the number of processors that either own or require the data chunk e. Therefore, connectivity(e; C ) ? 1 is the number of remote processors that the data chunk e must be sent to. For example, net d in Figure 5 spans two partitions, thus the input chunk d must be sent to processor P2. Net x in Figure 5 spans all three partitions. The output chunk x must therefore be replicated on all three processors. The cost function in Eqn. (1) computes the total overhead incurred from sending input chunks to remote processors and from replicating output chunks on multiple processors, as is required by the workload partitioning induced by a cut. Minimizing the cost function results in the cut that incurs the minimum communication overhead. 16
P0
a
x
ax
bx ex cx
dx dy
b
e
c P1
y
d
P2
Fig. 5. A cut for the aggregation hypergraph shown in Figure 4.
3.3.5 Experimental Results
We compare the performance of HG to that of the DA and RA strategies on a 48-node PC cluster running Linux. We present experimental results using datasets derived from the VM application [43,64]. Each node in the cluster has two 450MHz Pentium II processors, 500MB memory, and one local disk. The nodes are interconnected via both Myrinet (120MB/sec max.) and Fast Ethernet (100Mb/sec max.) networks. In the experiments, only one process was executed on each node. For HG, we use a hypergraph partitioning tool, called PaToH [16], which has been shown to generate good partitions quickly. In the gures, F , D, S , and P stand for the fully replicated, distributed, and sparsely replicated accumulator, and the PaToH hypergraph partitioning strategies, respectively. In the experiments, we assume the output is divided into regular rectangular regions and distributed across the disks. The assignment of both input and output chunks to the disks was done using a Hilbert curve based declustering algorithm. Figure 6 shows the performance of the strategies for VM using Myrinet and Fast Ethernet. The execution times shown in the gures are the processing times for queries at the server running on the parallel machine. The goal of this experiment is to show the performance of the strategies with dierent communication bandwidth capabilities. The number of input chunks, each of which is 192KB, is xed at 8192. The number of output chunks is 256, and the size of each chunk is 192KB. On average, each input chunk maps to one output chunk and each output chunk is mapped to by 32 input chunks. As is seen in the gure, when Fast Ethernet is used, the RA strategies perform better than DA for small numbers of processors, whereas DA achieves better performance on larger number of processors. In all cases, the overall execution time of HG is close to that of the better of the DA and RA strategies. HG results in low interprocessor communication volume by sending some of the input chunks as in the DA strategy and replicating some of the output chunks as in the RA strategies. As the number of processors increases, HG switches from replicating output chunks to sending a mixture of input and output 17
VM-Myrinet
VM-Fast Ethernet
50
100 partitioning execution
partitioning execution
80
Total Time (sec)
Total Time (sec)
40
30
20
10
0
60
40
20
F D S P
F D S P
F D S P
F D S P
F D S P
4 procs
8 procs
16 procs
32 procs
48 procs
0
F
D
S
4 procs
Number of Processors
P
F
D
VM
F
D
S
P
16 procs
F
D
S
P
32 procs
VM 16
input output
Computation Time (sec)
Total Comm. Volume (MB)
P
Number of Processors
4800 4000 3200 2400 1600 800 0
S
8 procs
F D S P
F D S P
F D S P
F D S P
F D S P
4 procs
8 procs
16 procs
32 procs
48 procs
12 10 8 6 4 2 0
Number of Processors
max comps avg comps min comps
14
F D S P
F D S P
F D S P
F D S P
F D S P
4 procs
8 procs
16 procs
32 procs
48 procs
Number of Processors
Fig. 6. The performance of the strategies for the VM dataset. F , D, S , and P stand for the fully replicated, distributed, and sparsely replicated accumulator, and the PaToH hypergraph partitioning strategies, respectively.
chunks.
4 Supporting Reduction Operations in Distributed, Heterogeneous Environments In the previous section, we presented a framework and algorithms for ecient execution of data subsetting and reduction operations on tightly-coupled parallel computer systems. With the help of faster networks and the tools to discover and allocate distributed resources, it is increasingly becoming costeective to use collections of archival storage and computing systems in a distributed environment, to store and manipulate large datasets. A networked collection of storage and computing systems provides a powerful environment, yet introduces many unique challenges for applications. Such a setting requires access to and processing of data in a distributed, heterogeneous environment. Both computational and storage resources can be at locations distributed across the network. Also, the overall system may present a heterogeneous environment to the application: (1) the characteristics, capacity and power of resources, including storage, computation, and network, can vary widely, (2) space availability may require sub-optimal placement of datasets within a system (i.e., across the disks in the system) and across the systems, causing non-uniform data access costs, (3) the distributed resources can be shared by 18
other applications, which results in varying resource availability. These characteristics have several implications for developing ecient applications. An application should be exible to accommodate the heterogeneous nature of the environment. Moreover, the application should be optimized in its use of shared resources and be adaptive to changes in their availability. For instance, it may not be ecient or feasible to perform all processing at a data server when its load becomes high. In that case, the eciency of the application depends on its ability to perform application processing on the data as it progresses from the data source(s) to the client, and on the ability to move all or part of its computation to other machines that are well suited for the task. There is a large body of research on building computational grids and providing support for enabling execution of applications in a Grid environment [32]. There is also hardware and software research on archival storage systems, including distributed parallel storage systems [39], le systems [59], and remote I/O [57]. However, providing support for ecient subsetting and processing of very large scienti c datasets stored in archival storage systems in a distributed environment remains a challenging research issue. Component-based programming models are becoming widely accepted [19,33,38,50,53] for developing applications in distributed, heterogeneous environments. In this model, the processing structure of an application is represented as multiple objects that interact with each other by moving data and control information. We have developed a component-based framework, called DataCutter [11,12], for developing data intensive applications in a distributed environment. The framework is built upon prior work in our Active Disks [2,63] and Active Data Repository projects. As was described in the previous section, the ADR framework aims to realize performance gains by executing application-speci c data subsetting and aggregation operations at the server where the data is stored. The Active Disks project investigated the potential performance bene ts of pushing application-speci c data processing to disks, turning these passive system components into active devices. A stream-based programming model was described for programming Active Disks. The lter-stream programming model employed in DataCutter adapts and extends the programming model of Active Disks to heterogeneous distributed environments. Both ADR and Active Disks target homogeneous, tightly coupled systems. DataCutter is designed to support subsetting and reduction operations (the query processing loop presented in Section 3.1.2), as does ADR. However, with DataCutter, we are extending the functionality of ADR to distributed, heterogeneous environments by allowing decomposition of application-speci c reduction operations into a set of interacting components, which we refer to as lters. The goal is to achieve performance improvements by providing the exibility to (1) place components among storage and compute nodes in a system [11], and (2) instantiate and run multiple copies of a group of components or copies of individual components in parallel [13]. The middleware we have developed 19
provides two core services: an indexing service for subsetting of datasets via range queries, and a ltering service for instantiating and executing application components. In the following sections we brie y describe the framework and middleware, and present experimental results on the Virtual Microscope application.
4.1 Multi-level Indexing for Subsetting Very Large Datasets
One of our goals is to provide support for subsetting very large datasets (sizes up to petabytes). We require that a scienti c dataset contain both a set of data les and a set of index les. Data les contain the data elements of a dataset; data les can be distributed across multiple storage systems. As in the ADR framework, each data le is viewed as consisting of a set of data chunks, each of which contains a subset of all the data items in the dataset and is associated with an MBR in the underlying multi-dimensional space. Ecient spatial data structures have been developed for indexing and accessing multi-dimensional datasets, such as R-trees and their variants [9] { ADR uses the R-tree as its default indexing method. However, storing very large datasets may result in a large set of data les, each of which may itself be very large. Therefore a single index for an entire dataset could be very large. Thus, it may be expensive, both in terms of memory space and CPU cycles, to manage the index, and to perform a search to nd intersecting data chunks using a single index le. Assigning an index le for each data le in a dataset could also be expensive because it is then necessary to access all the index les for a given search. To alleviate some of these problems, we have developed a multi-level hierarchical indexing scheme implemented via summary index les and detailed index les. The elements of a summary index le associate metadata (i.e. an MBR) with one or more data chunks and/or detailed index les. Detailed index le entries themselves specify one or more data chunks. Each detailed index le is associated with some set of data les, and stores the index and other metadata for all data chunks in those data les. There are no restrictions on which data les are associated with a particular detailed index le for a dataset. Data les can be organized in an application-speci c way into logical groups, and each group can be associated with a detailed index le for better performance. For example, in satellite datasets, each data le may store data for one week. A detailed index le can be associated with data les grouped by month, and a summary index le can contain pointers to detailed index les for the entire range of data in the dataset. An R-tree is used as the indexing method for the summary and detailed index les. 20
4.2 Processing of Data: Filters and Streams
In the lter-stream programming model, the processing structure of a data intensive application is represented as a set of interacting components, called lters. Data exchange between any two lters is described via streams, which are uni-directional pipes that deliver data in xed size buers. Filters are location-independent, because stream names are used to specify lter to lter connectivity rather than endpoint location on a speci c host. This allows the placement of lters on dierent hosts in a distributed environment. Therefore, processing, network and data copying overheads can be minimized by the ability to place lters on dierent platforms. A lter is a user-de ned object with methods to carry out application-speci c processing on data. Currently, lter code is expressed using a C++ language binding by sub-classing a lter base class. This provides a well-de ned interface between the lter code and the ltering service. The interface for lters consists of an initialization function, a processing function, and a nalization function. class ApplicationFilter : public DC Filter Base t f
g
public: int init(int argc, char *argv[]) f ... g; int process(stream t st[]) f ... g; int nalize(void) f ... g;
A stream is an abstraction used for all lter communication, and speci es how lters are logically connected. A stream is the means of uni-directional data ow between two lters, from upstream lter to downstream lter. Bidirectional data exchange is achieved by creating two streams in opposite directions. All transfers to and from streams are through a provided buer abstraction. A buer represents a contiguous memory region containing useful data. Streams transfer data in xed size buers. The size of a buer is determined in the init call; a lter discloses a minimum and an optional maximum value for each of its streams. The actual size of the buer allocated by the ltering service is guaranteed to be at least the minimum value. The optional maximum value is a preferred buer size hint to the ltering service. The size of the data in a buer can be smaller than the size of the buer. Therefore, the buer contains a pointer to the start, the length of the portion containing useful data, and the maximum size of the buer. In the current prototype implementation we use TCP for stream communication, but any point-to-point communication library could be added. Filter operations progress as a sequence of cycles, with each cycle handling a single application-de ned unit-of-work. An example of a unit-of-work would be a spatial query for an image processing application that describes a region within an image to retrieve and process. A work cycle starts when the ltering 21
service calls the lter init function, which is where any required resources such as memory or disk scratch space are pre-allocated. Next the process function is called to continually read data arriving on the input streams in buers from the sending lters. A special marker is sent after the last buer to mark the end for the current unit-of-work. The nalize function is called after all processing is nished for the current unit-of-work, to allow release of allocated resources such as scratch space. When a work cycle is completed, these interface functions may be called again by the runtime system to process another unit-of-work. 4.3 Parallel Filters
Parallel lters target reduction operations in a distributed environment [13]. A reduction operation can be realized by a lter group that implements transformation, mapping, and aggregation operations and encapsulates the accumulator data structure. We are developing support for parallel lters from two classes, 1- lter and n- lter, which are dierentiated based on the granularity of mapping between application tasks and lters. 1- lter parallel lters represent an entire parallel program as a single lter instance. The goal is to allow the use of optimized parallel implementations of user-de ned mapping, transformation, ltering and aggregation functions on a particular machine con guration (i.e. as in ADR). For instance, coupling of a hydrodynamics simulator to a chemical transport simulator would require a series of transformation and aggregation operations on the output of the hydrodynamics simulator to create the input for the chemical transport simulator. A common transformation operation is the projection of ow values computed at points on one grid to ux values at faces of another mesh for chemical transport calculations. The projection requires solving linear systems of equations. Ecient parallel implementations have been implemented for solution of linear equations on distributed-memory and shared-memory platforms. In this case, a parallel implementation of the projection operation can be a 1- lter parallel lter in the group of lters that implement the operations needed for coupling the two simulators.
The experimental results in this section concentrate on the execution of n- lter parallel lters, which are represented as concurrent instances of the same lter. For this class, the lter code itself is the unit of parallelism, and is replicated across a set of host machines. The runtime performance optimizations target the combined use of ensembles of distributed-memory systems and SMP machines. Note that pipelining works well when all stages are balanced, both in terms of relative processing time of the stages, as well as the time of each stage compared to the communication cost between stages. Oftentimes, the 22
F0 P0 host1
host3
C0 F4
P1 host2
F3
host4
host5
F5
Fig. 7. P,F,C lter group instantiated using parallel lters.
processing of lter-based applications is not well balanced, which results in bottlenecks that cause other lters before and after a bottleneck lter to become idle. This imbalance and resulting performance penalty is addressed by using transparent copies, in which the lter is unaware of the concurrent lter replication. We de ne a copy set to be all transparent copies of a given lter that are executing on a particular host. The lter runtime system maintains the illusion of a single logical point-to-point stream for communication between a logical producer and a logical consumer in the lter group. When the logical producer and/or logical consumer has transparent copies, the system must decide for each producer which consumer copy set to send a stream buer to. For example, in Figure 7, if P1 issues a buer write operation to the logical stream that connects P to F , the choice is to send it to the copy set on host3 or host4. Each copy set shares a single buer queue, so there is perfect demand-based balance between copies within a single host. For distribution between copy sets (dierent hosts), we have designed and implemented several policies: (1) Round Robin (RR) distribution of buers among copy sets, (2) Weighted Round Robin (WRR) among copy sets based on the number of copies on that host, (3) a Demand Driven (DD) sliding window mechanism based on buer consumption rate. Not all lters will operate correctly in parallel as transparent copies, because of internal lter state. For example, a lter that attempts to compute the average size of all buers processed for a unit of work will not arrive at the correct answer, because only a subset of the total set of buers for the unitof-work are visible at any one copy. Such lters can be annotated to prevent the lter service from utilizing transparent copies. In the applications we have implemented, very few lters exhibit this type of behavior, and for some of those cases an additional lter can be inserted to combine the distributed lter state from the transparent copies into a single coherent state. 4.4 Application: The Virtual Microscope
In the original Virtual Microscope system, the processing of a query is carried out entirely at the parallel server. During query processing, the chunks that 23
read_data
decompress
clip
zoom
view
Fig. 8. Virtual Microscope decomposition
intersect the query region are read from local disks in each node. As a data chunk is stored in compressed form (JPEG format), the data chunk must be rst decompressed. Then, it is clipped to the query region. Afterwards, each clipped chunk is subsampled to achieve the zoom level (magni cation) speci ed in the query. The resulting image blocks are assembled and displayed at the client. The lter decomposition used for the Virtual Microscope system is shown in Figure 8. The gure only depicts the main data ow path of image data through the system; other low-volume streams related to the client-server protocol are not shown for clarity. The thickness of the stream arrows indicate the relative volume of data that ows on the dierent streams. In this implementation each of the main processing steps in the server is a lter:
read data (R): Full-resolution data chunks that intersect the query region are read from disk, and written to the output stream.
decompress (D): Image blocks are read individually from the input stream.
The block is decompressed using JPEG decompression and converted into a 3 byte RGB format. The image block is then written to the output stream. clip (C): Uncompressed image blocks are read from the input stream. Portions of the block that lie outside the query region are removed, and the clipped image block is written to the output stream. zoom (Z): Image blocks are read from the input stream, subsampled to achieve the magni cation requested in the query, and then written to the output stream. view (V): Image blocks are received for a given query, collected into a single reply, and sent to the client using the standard Virtual Microscope client/server protocol. 4.4.1 Experimental Results
Multi-level Indexing. The rst experiment isolates the impact of organizing
the dataset into multiple les and using the multi-level indexing scheme. In this experiment we use a 4GB 2D compressed JPEG image dataset (90GB uncompressed), created by stitching together smaller digitized microscopy images. This dataset is equivalent to a digitized slide with a single focal plane that has 180K 180K RGB pixels. The 2D image is regularly partitioned into 200 200 data chunks and stored in a set of data les in the IBM HPSS archival storage system at the University of Maryland [11]. The HPSS setup has 10TB of tape storage space, 500GB of disk cache, and is accessed through 24
45k
180k
800
q2
90k
q5 q3
q4
Computation
600 500 400 300 200 100 0
90k
Load
700
Response Time (sec)
q1
q1 q2 q3 q4 q5
q1 q2 q3 q4 q5
q1 q2 q3 q4 q5
q1 q2 q3 q4 q5
1x1
2x2
4x4
10 x 10
Dataset File Partitioning / Query
(a)
(b)
Fig. 9. (a) 2D dataset and query regions for multi-level indexing experiments. (b) Query execution time with the dataset organized into 1 1, 2 2, 4 4 and 10 10 les. Load shows the time to open and access the les, which contain data chunks that intersect a query. Computation shows the sum of the execution time for searching data chunks that intersect a query, and for processing the retrieved data via lters.
a 10-node IBM SP. One node of the IBM SP is used to run the lter that carries out index lookup, and the client was run on a SUN workstation connected to the SP node through the department Ethernet. The server host is where the read data lter is run, which is the machine containing the dataset. While these experiments were conducted using HPSS, we only use HPSS as an example of an archival, high-capacity storage system. We de ned ve possible queries, each of which covers 5 5 chunks of the image (see Figure 9(a)). The execution times we will show are response times seen by the visualization client averaged over 5 repeated runs. Figure 9(b) shows the results when the 2D image is partitioned into 1 1, 2 2, 4 4 and 10 10 rectangular regions, and all data chunks in each region are stored in a data le. Figure 9(a) illustrates the partitioning of the dataset into 1 1 (entire rectangle), 2 2 (solid lines), and 4 4 (dashed lines) les. Each data le is associated with a detailed index le, and there is one summary index le for all the detailed index les for each partitioning. As is seen in the gure, the load time decreases as the number of les is increased. This is because of the fact that HPSS loads the entire le onto disks used as the HPSS cache when a le is opened. When there is a single le, the entire 4GB le is accessed from HPSS for each of the queries{ in these experiments, all data les are purged from disk cache after each query is processed. When the number of les increases, only a subset of the detailed index les and data les are accessed using the multi-level hierarchical indexing scheme, decreasing the time to access data chunks. Note that the load time for query 5 for the 2 2 case is substantially larger than that of other queries, because query 5 intersects 25
data chunks from each of the four les (Figure 9(a)), hence the same volume of data is loaded into the disk cache as in the 1 1 case. The load time for that query is also larger than that in 1 1 case because of the overhead of seeking/loading/opening four les instead of a single le. The computation time, on the other hand, remains almost the same, except for the 10 10 case, where it slightly increases, due to the overhead from opening many les. These results demonstrate that applications can take advantage of the multilevel hierarchical indexing scheme by organizing a dataset into an appropriate set of les. However, having too many les may increase computation time, potentially decreasing overall eciency when multiple similar queries are executed on the same dataset. Overall, the conclusions to be drawn are that organization of data chunks into les can signi cantly aect performance, and that the use of hierarchical indexing techniques can greatly improve overall performance.
Placement of Filters and Parallel Filters. These experiments address
the performance implications of lter placement, and of replicating particular lters, to better utilize multiple processors on a multiprocessor node and to utilize processors on multiple nodes. All the experiments were performed on a Linux PC cluster with ve hosts, consisting of four single processor and one dual-processor 800MHz Pentium III machines, interconnected via 100Mbit Ethernet. The performance results are shown in Table 1. For each con guration, the same 50 queries at various magni cations were processed, with all queries producing the same size output image (512x512 pixels). The column labeled Average shows the average response time over all 50 queries, while the columns labeled with magni cations show the average response times for the subset of queries at that magni cation. Note that queries at lower magni cations retrieve more data than those at higher magni cations, because the data is only stored at the highest magni cation (400x). For this set of lters, the decompress lter (D) is the most computationally expensive, therefore is a good candidate for replication. In all con gurations, the read lter (R) reads image data from a local disk. For con guration 8, the demand-driven (DD) writer policy was used to distribute buers to consumer lter copies. Several conclusions can be drawn from the results. First, running the read lter on a dierent host from all the processing lters, at least for a single processor host, results in a signi cant performance increase, as seen by comparing con gurations 1 and 2. This is mainly a result of asynchronous disk I/O resulting from the con guration structure. Second, replicating the bottleneck lter on a relatively unloaded machine also results in performance gains, as seen by comparing con gurations 2 and 3, where the decompression lter was replicated on the dual-processor host. Also note that replicating a nonbottleneck lter does not increase performance, by comparing con gurations 3 and 4. Excessive replication of even a bottleneck lter is not eective, as is shown by adding two more decompression copies in con guration 5 as com26
Response Time (seconds) Con guration R-D-C-Z-V Average 400x 200x 100x 50x 1 h-h-h-h-h 2.096 0.382 0.725 1.734 6.952 2 h-g-g-g-g 1.489 0.367 0.621 1.271 4.600 3 h-g(2)-g-g-g 1.153 0.393 0.501 0.953 3.413 4 h-g(2)-g(2)-g-g 1.145 0.369 0.491 0.947 3.432 5 h-g(4)-g(2)-g-g 1.171 0.394 0.500 0.957 3.500 6 h-g-g-b-b 1.874 0.437 0.740 1.501 5.996 7 h-g(2)-g(2)-b-b 1.679 0.454 0.680 1.271 5.341 8 h-g(2)-g(2)-b,l,m-b 1.659 0.518 0.727 1.254 5.101 9 g-g-g-g-g 1.436 0.333 0.575 1.263 4.464 10 g-g(2)-g-g-g 1.076 0.325 0.451 0.920 3.236 Table 1 Performance of 50 Virtual Microscope queries on a PC cluster, using various con gurations. Each con guration is described by the placement its read (R), decompress (D), clip (C), zoom(Z), and view (V) lters on hosts g, h, b, l, and m. Host g has two processors, and in some con gurations ran two copies of a lter, denoted g(2). In con guration 8, the Z lter was replicated on three hosts (b, l, and m).
pared to con guration 4. Con gurations 6-8 show that for this set of lters distributing the computational load across multiple hosts does not improve performance, because the computations are not expensive enough to overcome the additional communication costs for moving stream data between hosts. Finally, con gurations 9 and 10 show the bene ts of running lters on a powerful host. Host g is a dual-processor machine with a large amount of memory, that can easily accommodate the processing requirements of all the lters, but to take full advantage of the host in terms of response time, the decompression lter should be replicated. Another overall conclusion is that the overhead introduced by placing lters on multiple hosts is not very large. In particular, the performance dierence between con gurations 2 and 9 is very small, showing that it is feasible to read data on one host and process it elsewhere, so long as the required communication bandwidth can be supported (which holds in this case for the local area network used).
Dynamic Buer Scheduling Policies. To explore the performance eect
of dierent policies for scheduling buers among copies of a lter, we have implemented an emulator-based extended version of the Virtual Microscope application. The Virtual Microscope emulator includes an extended processing stage after the desired image data has been constructed (i.e. after decompress, clip and zoom). For example, such processing may perform a complex contentbased classi cation of the cells present in the slide image. Such algorithms are 27
read_data
dcz
chew
view
Fig. 10. The emulated lters in the Virtual Microscope with additional computationally intensive processing.
computationally expensive, and provide a more complex and heterogeneous application for experimentation. The real application was executed on various nodes of the Linux cluster in isolation (each lter was executed on a separate node with no other user processes running on the node), and detailed timings were collected. These timings have been used to parameterize a generic lter emulator. Based on the results in the previous section, we decided to combine the functionality of the decompress, clip, and zoom lters into a single emulated dcz lter. In addition, we created a new chew lter, that performs signi cant processing in comparison to the other lters. Figure 10 shows the resulting emulator-based application. The lter emulator abstracts the processing and data handling we have seen through implementing various data intensive lter-based applications. The advantage of using a lter emulator is we can easily adjust application characteristics to fully explore the large space of potential application lters. The emulator itself assumes a simple data ow model of lter operation - the lter (1) blocks to read sucient input on all its input streams, (2) performs computation on the input in proportion to the size of the input, and (3) generates some amount of output data to write to all its output streams. All input and output operations are performed using xed size buers. The sizes of the input and output data, the computation time required per unit of input, and the amount of scratch working memory needed, are the parameters that must be set to emulate a real lter. Note that settings for the computation time parameter values were collected from experiments on each of the various hosts types in the Linux cluster, and the appropriate values are used for each of the following experiments. The experimental setup is a heterogeneous Linux cluster with four classes of nodes: rogue - single processor Pentium III 650MHz nodes with 128MB memory and multiple large attached EIDE disks, blue - dual processor Pentium III 550MHz nodes with 1GB memory, red - dual processor Pentium II 450MHz nodes with 256MB memory, and one 8-processor Pentium III 550MHz node with 4GB memory. The stored data for all experiments is assumed to be local to one of the storage-class (rogue) nodes. The interconnect shared by all nodes is switched 100Mbit Ethernet. All results shown are for a single 512x512 Virtual Microscope query at 100x zoom. Table 2 shows the basic behavior of the lters, including computational requirements and data transfer times, when each of the four lters is run on its own rogue node. The chew lter is by far the most computationally expensive, and is a good 28
Filter Compute Time Write Time Output read data 0.18 s 128.61 s 2.1 GB dcz 2.41 s 218.13 s 853 KB chew 231.01 s 0.02 s 853 KB view 0.03 s n/a n/a
Table 2 Behavior of emulated lters on isolated rogue nodes.
Response Time (seconds) Con guration RR WRR DD 3 red(1), 1 8cpu(6) 89.596 41.014 43.157 1 rogue(3), 1 8cpu(6) 117.242 76.787 50.813
Table 3 Write policy impact on performance under two speci c con gurations. N host(C ) denotes N nodes from the host class are used to execute C copies of the chew lter.
candidate for multiple transparent copies. The goal is to allow the copies to execute in parallel to oset the computational imbalance between lters. With one copy of chew, the lters earlier in the pipeline spend most of their time stalled trying to perform stream write operations. As described in Section 4.3, multiple copies implies a decision for each producer to choose which copy set to send to. In Table 3, we compare two cases designed to illustrate the dierence in the write policies. The rst con guration is designed so that the weighted round robin (WRR) policy performs best, because sending one buer to each of the 3 red nodes, and 6 buers to the 8cpu node, should create a reasonably balanced workload. The demand driven (DD) policy performs slightly worse for that con guration, since the acknowledgment messages required by the algorithm are an added overhead that results in a very similar write distribution as WRR. The round robin (RR) policy is the worst, since lter copies on the 8cpu node end up mostly idle. The second con guration places 3 copies on a single processor rogue node, and again 6 copies on the 8cpu node. In this case, the 3 copies on the rogue node contend for the single processor, and eectively perform at 1=3 the rate of a single copy in the rst con guration. In this case, WRR is not the correct choice, and DD performs the best, because the rate of acknowledgment messages from the rogue node is reduced, hence fewer buers are sent there. Overall, DD performs well in both cases, provided the additional acknowledgment trac is not a problem. The results from this experiment can also be applied to situations where the load on a host increases at runtime, eectively reducing the number of processors available for transparent copies of a lter. In such situations DD should perform well, outperforming the other write policies. 29
5 Related Work Reduction operations have long been recognized as an important source of parallelism for many scienti c applications [26,35,36,67]. Most techniques for optimizing parallel reductions have been developed for scenarios where data can t into processor memory, and the main goal is to partition the iterations among processors to achieve good load balance with low induced interprocessor communication overhead. Brezany et. al [14] have extended the inspector-executor approach [51] for out-of-core irregular applications. Recently, Yu and Rauchwerger [67] developed a decision-tree based system for shared-memory machines that selects a reduction algorithm from a library of algorithms according to the measured characteristics of the program's data reference pattern. The strategies presented in Section 3.3 are inspired by the various techniques developed previously. DA adopts the \owner computes" rule, while FRA and SRA make use of replicated buer. These strategies extend those approach to out-of-core multi-dimensional datasets and provide a uni ed framework for ecient execution. We also developed a hypergraph-based strategy, which takes into account the pre-existing distribution of input and output datasets across processors. Several run-time support libraries and le systems have been developed to support ecient I/O in a parallel environment [25,41,48,60]. These systems mainly focus on supporting regular strided access to uniformly distributed datasets, such as images, maps, and dense multi-dimensional arrays. Our work, however, has focused on eciently supporting parallel aggregation operations over subsets of irregular spatially indexed datasets speci ed by range queries. Userde ned computation is an integral part of the frameworks presented in this paper. Some similar work was independently done by Goil and Choudhary [34]. They developed an infrastructure, called PARSIMONY, that provides support for online analytical processing (OLAP) and data mining operations. Several researchers have concurrently and independently explored the concept of Active Disks (alternate names include Intelligent Disks and Programmable Disks) that allows performing processing within the disk subsystem. Research in this area can be roughly divided into two categories: application processing in Active Disks and system-level processing in Active Disks. Riedel et al. [52] investigated the performance of Active Disks for data mining and multimedia algorithms and developed an analytical performance model. The ISTORE project [15] uses the IDISK [40] architecture as a building block to create a meta-appliance; a storage infrastructure that can be tailored for speci c applications. Acharya et al. [2] introduced a stream-based programming model for disklets and their interaction with host-resident peers. Restructured versions of a wide range of data-intensive applications have also been developed in that work. 30
There are also a number of research projects that focus on component-based models for developing applications in a distributed environment. The ABACUS framework [6] addresses the automatic and dynamic placement of functions in data-intensive applications between clients and storage servers. This work is closely related to DataCutter in that application components are placed to improve performance, but ABACUS only support applications that are structured as a chain of function calls, and the only possibilities for placement are the client or server. MOCHA [53] is a database middleware system designed to interconnect data sources distributed over a wide area network. MOCHA operates in the highly structured relational database world, and can automatically deploy implementations of new data types to hosts for execution of queries. The work shows how an optimizer customized to deal with \datain ating" and \data-reducing" operators, can improve performance. MOCHA can leverage total knowledge about query selectivities stored in its catalog, whereas DataCutter deals with arbitrary application code with no such useful information. Armada [49] is a exible parallel le system framework being developed to enable access to remote and distributed datasets through a network (armada) of application objects, called ships. The system provides authorization and authentication services, and runtime support for the collection of application speci c ships to run on I/O nodes, compute nodes, and other nodes on the network.
6 Conclusions and Future Work We have presented an overview of frameworks and methods we have developed to provide support for applications that analyze and explore large multidimensional scienti c datasets. The ADR framework targets optimized execution of data intensive applications on distributed memory architectures with a disk farm. The DataCutter and lter-stream programming framework extend the work on tightly-coupled, homogeneous systems to distributed, heterogeneous collections of computational and storage systems. ADR enables execution of user-de ned functions at the storage system where the data is stored. The lter-stream programming model provides exibility by allowing applications to be composed from interacting components. This allows applications to achieve good performance on many platforms and under varying resource availability. We are now examining new strategies and algorithms within these frameworks to further improve the performance of data intensive applications. A compiler frontend is also being developed for the ADR framework [30]. In that work, application developers can implement application-speci c processing using a Java dialect or XML. The compiler then creates a customized instance of ADR from the Java or XML code. We have also initiated an eort to develop 31
a framework for optimizing multiple simultaneous queries for analysis and exploration of large scienti c datasets. This work is motivated by the fact that in some cases data analysis can be employed in a collaborative environment, where co-located clients access the same datasets and perform similar processing of the datasets. In such an environment, commonalities among data, data access patterns, and processing functions on data can be exploited to provide signi cant performance bene ts in executing multiple simultaneous queries.
Acknowledgments We are grateful to the Albuquerque High Performance Computing Center for providing access to their Linux clusters and providing all the necessary support for some of the ADR experiments.
References [1] A. Acharya, M. Uysal, R. Bennett, A. Mendelson, M. Beynon, J. Hollingsworth, J. Saltz, and A. Sussman. Tuning the performance of I/O-intensive parallel applications. In Proceedings of the Fourth ACM Workshop on I/O in Parallel and Distributed Systems, May 1996. [2] A. Acharya, M. Uysal, and J. Saltz. Active disks: Programming model, algorithms and evaluation. In Proceedings of the Eighth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS VIII), pages 81{91. ACM Press, Oct. 1998. ACM SIGPLAN Notices, Vol. 33, No. 11. [3] A. Afework, M. D. Beynon, F. Bustamante, A. Demarzo, R. Ferreira, R. Miller, M. Silberman, J. Saltz, A. Sussman, and H. Tsang. Digital dynamic telepathology - the Virtual Microscope. In Proceedings of the 1998 AMIA Annual Fall Symposium. American Medical Informatics Association, Nov. 1998. [4] R. Agrawal, T. Imielinski, and A. Swami. Database mining: A performance perspective. IEEE Transactions on Knowledge and Data Engineering, 5(6):914{ 925, Dec 1993. [5] J. Ahrens, K. Brislawn, K. Martin, B. Geveci, C. C. Law, and M. Papka. Largescale data visualization using parallel data streaming. IEEE Computer Graphics and Applications, 21(4):34{41, July/August 2001. [6] K. Amiri, D. Petrou, G. Ganger, and G. Gibson. Dynamic function placement in active storage clusters. Technical Report CMU-CS-99-140, Carnegie Mellon University, Pittsburg, PA, June 1999.
32
[7] H. Andrade, T. Kurc, A. Sussman, and J. Saltz. Decision tree construction for data mining on clusters of shared-memory multiprocessors. Technical Report CS-TR-4203 and UMIACS-TR-2000-78, University of Maryland, Department of Computer Science and UMIACS, Dec. 2000. [8] C. L. Bajaj, V. Pascucci, D. Thompson, and X. Y. Zhang. Parallel accelerated isocontouring for out-of-core visualization. In Proceedings of the 1999 IEEE Symposium on Parallel Visualization and Graphics, pages 97{104, San Francisco, CA, USA, Oct 1999. [9] N. Beckmann, H.-P. Kriegel, R. Schneider, and B. Seeger. The R -tree: An ecient and robust access method for points and rectangles. In Proceedings of the 1990 ACM SIGMOD International Conference on Management of Data (SIGMOD90), pages 322{331, Atlantic City, NJ, May 1990. [10] M. Beynon, A. Sussman, and J. Saltz. Performance impact of proxies in data intensive client-server applications. In Proceedings of the 1999 International Conference on Supercomputing. ACM Press, June 1999. [11] M. D. Beynon, R. Ferreira, T. Kurc, A. Sussman, and J. Saltz. DataCutter: Middleware for ltering very large scienti c datasets on archival storage systems. In Proceedings of the Eighth Goddard Conference on Mass Storage Systems and Technologies/17th IEEE Symposium on Mass Storage Systems, pages 119{133. National Aeronautics and Space Administration, Mar. 2000. NASA/CP 2000-209888. [12] M. D. Beynon, T. Kurc, A. Sussman, and J. Saltz. Optimizing execution of component-based applications using group instances. In Proceedings of CCGrid2001: IEEE International Symposium on Cluster Computing and the Grid, pages 56{63. IEEE Computer Society Press, May 2001. [13] M. D. Beynon, A. Sussman, U. Catalyurek, T. Kurc, and J. Saltz. Performance optimization for data intensive grid applications. In Proceedings of the Third Annual International Workshop on Active Middleware Services (AMS2001), Aug. 2001. [14] P. Brezany, A. Choudhary, and M. Dang. Parallelization of irregular codes including out-of-core data and index arrays. In Proceedings of Parallel Computing 1997 - PARCO'97, pages 132{140. Elsevier, Sept. 1997. [15] A. Brown, D. Oppenheimer, K. Keeton, R. Thomas, J. Kubiatowicz, and D. Patterson. ISTORE: Introspective storage for data-intensive network services. In Proceedings of the 7th Workshop on Hot Topics in Operating System (HotOS-VII), Mar 1999. [16] U. Catalyurek and C. Aykanat. Hypergraph-partitioning based decomposition for parallel spars e-matrix vector multiplication. IEEE Transaction on Parallel and Distributed Systems, 10(7):673{693, 1999. [17] U. Catalyurek, T. Kurc, A. Sussman, and J. Saltz. Improving the performance and functionality of the virtual microscope. Archives of Pathology & Laboratory Medicine, 125(8), Aug. 2001.
33
[18] U. Catalyurek, T. Kurc, A. Sussman, and J. Saltz. Improving the performance and functionality of the Virtual Microscope. Archives of Pathology & Laboratory Medicine, 125(8), Aug. 2001. [19] Common Component Architecture Forum. http://www.cca-forum.org. [20] C. F. Cerco and T. Cole. User's guide to the CE-QUAL-ICM three-dimensional eutrophication model, release version 1.0. Technical Report EL-95-15, US Army Corps of Engineers Water Experiment Station, Vicksburg, MS, 1995. [21] C. Chang, R. Ferreira, A. Sussman, and J. Saltz. Infrastructure for building parallel database systems for multi-dimensional data. In Proceedings of the Second Merged IPPS/SPDP Symposiums. IEEE Computer Society Press, Apr. 1999. [22] C. Chang, T. Kurc, A. Sussman, U. Catalyurek, and J. Saltz. A hypergraphbased workload partitioning strategy for parallel data aggregation. In Proceedings of the Eleventh SIAM Conference on Parallel Processing for Scienti c Computing. SIAM, Mar. 2001. [23] C. Chang, T. Kurc, A. Sussman, and J. Saltz. Optimizing retrieval and processing of multi-dimensional scienti c datasets. In Proceedings of the Third Merged IPPS/SPDP (14th International Parallel Processing Symposium & 11th Symposium on Parallel and Distributed Processing). IEEE Computer Society Press, May 2000. [24] C. Chang, B. Moon, A. Acharya, C. Shock, A. Sussman, and J. Saltz. Titan: A high performance remote-sensing database. In Proceedings of the 1997 International Conference on Data Engineering, pages 375{384. IEEE Computer Society Press, Apr. 1997. [25] P. F. Corbett and D. G. Feitelson. The Vesta parallel le system. ACM Transactions on Computer Systems, 14(3):225{264, Aug. 1996. [26] R. Das, M. Uysal, J. Saltz, and Y.-S. Hwang. Communication optimizations for irregular scienti c computations on distributed memory architectures. Journal of Parallel and Distributed Computing, 22(3):462{479, Sept. 1994. [27] H. Fallah-Adl, J. Jaja, S. Liang, J. Townshend, and Y. J. Kaufman. Fast algorithms for removing atmospheric eects from satellite images. IEEE Computational Science & Engineering, 3(2):66{77, Summer 1996. [28] C. Faloutsos and P. Bhagwat. Declustering using fractals. In Proceedings of the 2nd International Conference on Parallel and Distributed Information Systems, pages 18{25, Jan. 1993. [29] R. Farias and C. T.Silva. Out-of-core rendering of large, unstructured grids. IEEE Computer Graphics and Applications, 21(4):42{50, July/August 2001. [30] R. Ferreira, G. Agrawal, and J. Saltz. Compiling object-oriented data intensive applications. In Proceedings of the 2000 International Conference on Supercomputing, pages 11{21. ACM Press, May 2000.
34
[31] R. Ferreira, T. Kurc, M. Beynon, C. Chang, A. Sussman, and J. Saltz. Object-relational queries into multi-dimensional databases with the Active Data Repository. Parallel Processing Letters, 9(2):173{195, 1999. [32] I. Foster and C. Kesselman. The GRID: Blueprint for a New Computing Infrastructure. Morgan-Kaufmann, 1999. [33] Global Grid Forum. http://www.gridforum.org. [34] S. Goil and A. Choudhary. PARSIMONY: An infrastructure for parallel multidimensional analysis and data mining. Journal of Parallel and Distributed Computing, 61(3):285{321, March 2001. [35] M. Hall, S. Amarasinghe, B. Murphy, S. Liao, and M. Lam. Detecting coarsegrain parallelism using an interprocedural parallelizing compiler. In Proceedings of Supercomputing '95, San Diego, CA, Dec. 1995. [36] H. Han and C.-W. Tseng. Improving compiler and run-time support for irregular reductions. In Proceedings of the 11th Workshop on Languages and Compilers for Parallel Computing, Aug. 1998. [37] The Independent JPEG Group's JPEG software, http://www.ijg.org.
March
1998.
[38] C. Isert and K. Schwan. ACDS: Adapting computational data streams for high performance. In 14th International Parallel & Distributed Processing Symposium (IPDPS 2000), pages 641{646, Cancun, Mexico, May 2000. IEEE Computer Society Press. [39] W. E. Johnston and B. Tierney. A distributed parallel storage architecture and its potential application within EOSDIS. In NASA Mass Storage Symposium, Mar. 1995. [40] K. Keeton, D. A. Patterson, and J. M. Hellerstein. A case for intelligent disks (IDISKS). ACM SIGMOD Record, 27(3):42{52, Sept. 1998. [41] D. Kotz. Disk-directed I/O for MIMD multiprocessors. In Proceedings of the 1994 Symposium on Operating Systems Design and Implementation, pages 61{ 74. ACM Press, Nov. 1994. [42] T. Kurc, U. Catalyurek, C. Chang, A. Sussman, and J. Saltz. Visualization of large datasets with the Active Data Repository. IEEE Computer Graphics and Applications, 21(4):24{33, July/August 2001. [43] T. Kurc, C. Chang, R. Ferreira, A. Sussman, and J. Saltz. Querying very large multi-dimensional datasets in ADR. In Proceedings of the 1999 ACM/IEEE SC99 Conference. ACM Press, Nov. 1999. [44] T. M. Kurc, A. Sussman, and J. Saltz. Coupling multiple simulations via a high performance customizable database system. In Proceedings of the Ninth SIAM Conference on Parallel Processing for Scienti c Computing. SIAM, Mar. 1999.
35
[45] W. Lorensen and H. Cline. Marching cubes: a high resolution 3D surface reconstruction algorithm. Computer Graphics, 21(4):163{169, 1987. [46] R. A. Luettich, J. J. Westerink, and N. W. Schener. ADCIRC: An advanced three-dimensional circulation model for shelves, coasts, and estuaries. Technical Report 1, Department of the Army, U.S. Army Corps of Engineers, Washington, D.C. 20314-1000, December 1991. [47] B. Moon and J. H. Saltz. Scalability analysis of declustering methods for multidimensional range queries. IEEE Transactions on Knowledge and Data Engineering, 10(2):310{327, March/April 1998. [48] N. Nieuwejaar and D. Kotz. The Galley parallel le system. In Proceedings of the 1996 International Conference on Supercomputing, pages 374{381. ACM Press, May 1996. [49] R. Old eld and D. Kotz. Armada: A parallel le system for computational. In Proceedings of CCGrid2001: IEEE International Symposium on Cluster Computing and the Grid, Brisbane, Australia, May 2001. IEEE Computer Society Press. [50] B. Plale and K. Schwan. dQUOB: Managing large data ows using dynamic embedded queries. In IEEE International High Performance Distributed Computing (HPDC), August 2000. [51] R. Ponnusamy, J. Saltz, A. Choudhary, Y.-S. Hwang, and G. Fox. Runtime support and compilation methods for user-speci ed irregular data distributions. IEEE Transactions on Parallel and Distributed Systems, 6(8):815{831, Aug. 1995. [52] E. Riedel, C. Faloutsos, and G. Gibson. Active Storage for Large-Scale Data Mining and Multimedia Applications. In Proceedings of VLDB'98, 1998. [53] M. Rodriguez-Martinez and N. Roussopoulos. MOCHA: A self-extensible database middleware system for distributed data sources. In Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data (SIGMOD00), pages 213{224. ACM Press, May 2000. ACM SIGMOD Record, Vol. 29, No. 2. [54] W. Schroeder, K. Martin, and B. Lorensen. The Visualization Toolkit: An Object-Oriented Approach To 3D Graphics. Prentice Hall, 2nd edition, 1997. [55] J. Shafer, R. Agrawal, and M. Mehta. SPRINT: A scalable parallel classi er for data mining. In The 22nd VLDB Conference, pages 544{555, Bombay, India, Sept 1996. [56] P. H. Smith and J. van Rosendale, editors. Data and Visualization Corridors: Report on the 1998 DVC Workshop Series. Technical Report CACR-164, California Institute of Technology, Sept. 1998. [57] SRB: The Storage Resource Broker. http://www.npaci.edu/DICE/SRB/index.html.
36
[58] T. Tanaka. Con gurations of the solar wind ow and magnetic eld around the planets with no magnetic eld: calculation by a new MHD. Jounal of Geophysical Research, 98(A10):17251{62, Oct 1993. [59] M. Teller and P. Rutherford. Petabyte le systems based on tertiary storage. In the Sixth NASA Goddard Space Flight Center Conference on Mass Storage Systems and Technologies, Fifteenth IEEE Symposium on Mass Storage Systems, 1998. [60] R. Thakur, A. Choudhary, R. Bordawekar, S. More, and S. Kuditipudi. Passion: Optimized I/O for parallel applications. IEEE Computer, 29(6):70{78, June 1996. [61] S.-K. Ueng, K. Sikorski, and K.-L. Ma. Out-of-core streamline visualization on large unstructured meshes. IEEE Transactions on Visualization and Computer Graphics, 3(4):370{380, Dec. 1997. [62] U.S. Geological Survey. Land satellite (LANDSAT) thematic mapper (TM). http://edcwww.cr.usgs.gov/nsdi/html/landsat tm/landsat tm. [63] M. Uysal, A. Acharya, and J. Saltz. Evaluation of active disks for decision support databases. In Proceedings of the 6th International Symposium on HighPerformance Computer Architecture. IEEE Computer Society Press, Jan. 2000. [64] M. Uysal, T. M. Kurc, A. Sussman, and J. Saltz. A performance prediction framework for data intensive applications on large scale parallel machines. In Proceedings of the Fourth Workshop on Languages, Compilers and Run-time Systems for Scalable Computers, pages 243{258. Springer-Verlag, May 1998. [65] A. Watt. Fundamentals of three-dimensional computer graphics. Addison Wesley, 1989. [66] M. F. Wheeler, W. Lee, C. N. Dawson, D. C. Arnold, T. Kurc, M. Parashar, J. Saltz, and A. Sussman. Parallel computing in environment and energy. In J. Dongarra, I. Foster, G. Fox, K. Kennedy, L. Torczon, and A. White, editors, CRPC Handbook of Parallel Computing. Morgan Kaufman Publishers, Inc., 2001. [67] H. Yu and L. Rauchwerger. Adaptive reduction parallelization techniques. In Proceedings of the 14th ACM International Conference on Supercomputing, pages 66{77, Santa Fe, New Mexico, May 2000. [68] M. J. Zaki, C.-T. Ho, and R. Agrawal. Parallel classi cation for data mining on shared-memory multiprocessors. In IEEE International Conference on Data Engineering, pages 198{205, Sydney, Australia, Mar 1999.
37