Efficient Manipulation of Large Datasets on ... - Semantic Scholar

2 downloads 13136 Views 238KB Size Report
Columbus, OH, 43210. ¥ beynon,als¦ .... is determined in the init call, where a filter discloses a minimum and an ..... geneity resulting from background jobs on the Rogue nodes. ..... of the Fifth NASA Goddard Space Flight Center Conference.
Efficient Manipulation of Large Datasets on Heterogeneous Storage Systems Michael D. Beynon











Dept. of Computer Science University of Maryland College Park, MD 20742 beynon,als  @cs.umd.edu



Dept. of Biomedical Informatics The Ohio State University Columbus, OH, 43210 kurc.1,catalyurek.1,saltz.1  @osu.edu

In this paper we are concerned with the efficient use of a collection of disk-based storage systems and computing platforms in a heterogeneous setting for retrieving and processing large scientific datasets. We demonstrate, in the context of a data-intensive visualization application, how heterogeneity affects performance and show a set of optimization techniques that can be used to improve performance in a component-based framework. In particular, we examine the application of parallelism via transparent copies of application components in the pipelined processing of data.

Introduction

Component-based programming models are becoming widely popular [3, 5, 11, 18, 19, 26, 27] for developing applications in distributed, heterogeneous environments. In this model, the processing structure of an application is represented as multiple objects that interact with each other by moving data and control information. Component-based models allow several degrees of freedom to improve application performance. The placement of components affects performance by placing components with affinity to data sources near the sources, minimizing communication volume on slow links, and placing computationally intensive components on less loaded or faster hosts. Another degree of freedom is parallelism, from executing copies of individual or groups of components across storage and comput-

 This research was supported by the National Science Foundation under

Grants #ACI-9619020 (UC Subcontract #10152408) and #ACI-9982087, Lawrence Livermore National Laboratory under Grant #B500288 (UC Subcontract #10184497), and the Department of Defense, Advanced Research Projects Agency, USAF, AFMC through Science Applications International Corporation under Grant #F30602-00-C-0009 (SAIC Subcontract #4400025559). Currently at MIT Lincoln Laboratory







Abstract

1



, Tahsin Kurc , Umit Catalyurek , Alan Sussman , Joel Saltz

ing systems. In this paper, we investigate parallelism in the pipelined processing of data via transparent copies of individual components within the context of a component-based framework [6, 8]. We reported preliminary performance results using a simplified version of an application for browsing digitized microscopy images in [8]. In this paper, we extend our previous work and present performance results for another scientific visualization application, isosurface rendering, on a cluster of machines at the University of Maryland. In a heterogeneous environment, exploiting these degrees of freedom requires dynamic scheduling of components and adaptation of data flow between multiple copies of producers and consumers to balance workload and maximize bandwidth [3, 5, 30]. Amiri et. al. [3] investigated the performance impact of automatic and dynamic partitioning of a stack of components between clients and storage servers. Rodr´iguez-Mart´inez and Roussopoulos [29] investigate optimizers for operator placement between clients and servers in relational databases. Parallelism in the form of replicated component copies has not been studied in such work. Scheduling of tasks and dynamic assignment of data among tasks have also been investigated by several researchers. Schnekenburger [30] examines demanddriven assignment of data to processes for external sorting in heterogeneous systems. Arpaci-Dusseau et. al. [5] present the notion of distributed queues that implement a credit-based scheme for adaptive data flow between producer and consumer components, and evaluate their scheme for two SPMD style data-movement applications with relatively low processing requirements, namely external sort and external hash-join, on a network of workstations. Our work differs from previous work in that (1) we examine multiple policies for scheduling data flow across homogeneous and heterogeneous configurations, (2) we experimentally evaluate the performance impact of placement, parallelism, and scheduling of data flow in a unified environment, and (3) the isosurface application is an example of appli-

cations in which some of the components maintain internal state (i.e., data structures to maintain intermediate, partial results). This requires a final filter in the pipeline to combine the results from filter copies, so the final output is consistent regardless of how many copies of various filters are instantiated at other pipeline stages. We present experimental results using two different implementations of the application. The first implementation illustrates an application that has multiple phases of processing with synchronization between the phases. The second implementation is fully pipelined, so is representative of applications that do not require synchronization.

2

F0 P0 host1

host3

C0 F4

P1 host2

F3

host4

host5

F5

Figure 1. Support for copies. P,F,C filter group instantiated using transparent copies. If copy  issues a buffer write operation to the logical stream that connects filter to filter  , the buffer can be sent to the copy set on  or    .

A Component-based Framework

We have developed a component framework, called DataCutter [6], designed to support subsetting and processing of large datasets. Application components are called filters, and communicate via streams. A filter is a user-defined component that performs application-specific processing on data, and conforms to a simple interface for method callbacks. The interface that a filter writer must implement consists of an initialization function (init), a processing function (process), and a finalization (finalize) function. A stream is a simple unidirectional (producer-consumer) pipe abstraction used for all filter communication, and specifies how filters are logically connected. All transfers to and from streams are through fixed size buffers, similar to message passing. The size of a buffer is determined in the init call, where a filter discloses a minimum and an optional maximum buffer size for each of its streams, and the runtime system chooses the actual size. The current prototype implementation uses TCP for stream communication, but any point-to-point communication library could be used. Filter operations progress as a sequence of application-specific cycles, referred to as a unit-of-work (UOW). An example of a UOW would be rendering of a simulation dataset from a particular viewing direction. A work cycle starts when the filtering service calls the filter init function, which is where any required resources such as memory can be pre-allocated. Next the process function is called to read from any input streams, work on data buffers received, and write to any output streams. The function is expected to return when an end-of-work marker is received (i.e., no more data buffers can be read from the input streams). The finalize function is called after all processing is finished for the current UOW, to allow release of allocated resources such as scratch space.

tem and transmitted to clients. The processing in filterbased applications is often not well balanced. The imbalance and resulting performance penalty can be addressed using parallelism, by executing multiple copies of a single filter across a set of host machines. This runtime performance optimization targets the use of collections of distributedmemory systems and multiprocessor machines. In DataCutter, we provide support for transparent copies, in which the filter is unaware of the concurrent filter replication. We define a copy set to be all transparent copies of a given filter that are executing on a single host. In the current implementation, the application developer determines (1) the application processing structure via decomposition into filters, (2) the placement of the application filters on hosts in the system, and (3) how many transparent copies of a filter to execute1 . The filter runtime system maintains the illusion of a single logical point-to-point stream for communication between a logical producer filter and a logical consumer filter. When the logical producer or logical consumer is transparently copied, the system must decide for each producer and buffer which consumer copy set to send the stream buffer to (see Figure 1). Each copy set shares a single buffer queue which provides demand-based balance within a single (potentially multiprocessor) host. For distribution between copy sets (on different hosts), we investigate several policies: (1) Round Robin (RR) distribution of buffers among copy sets, (2) Weighted Round Robin (WRR) among copy sets based on the number of copies on that host, (3) a Demand Driven (DD) sliding window mechanism based on buffer consumption rate. If a filter keeps intermediate results to compute the final output and transparent copies of this filter are executed, an additional application-specific combine filter may be appended to the pipeline to merge partial results into the final output.

Transparent Filter Copies. Once the application processing structure has been decomposed into a set of filters, it is possible to use multiple filters for implementing a pipeline of processing on data as it is retrieved from the storage sys-

1 We are in the process of examining various mechanisms to automate some of these steps.

2

Round Robin and Weighted Round Robin are zero overhead policies that send buffers in cyclic fashion to each host containing copies or one per filter on each host, respectively. That is, in WRR the number of buffers sent to a host for a given filter is linearly proportional to the number of copies of the filter running on that host. The Demand Driven policy, on the other hand, is designed to send buffers to the filter that will result in the fastest processing. To approximate this, we instead send it to the filter that is showing recent good performance. When a consumer filter processes a data buffer received from a producer, it sends back an acknowledgment message to the producer that indicates the buffer is now being processed. The producer chooses the consumer filter with the least number of unacknowledged buffers to send a data buffer. In the event of a tie, any local colocated copies will be chosen. The effect is to direct more buffers to faster consumers, with the cost of extra acknowledgment message traffic.

3

(a) Isosurface rendering of chemical densities in a reactive transport simulation.

Case Study: Isosurface Rendering

Isosurface rendering is a technique for extracting and visualizing surfaces within a 3D volume [23]. It is a widely used visualization method in many application areas. For instance, it is well-suited for visualizing the density distributions of chemicals in a region of an environmental simulation. Figure 2(a) shows a rendering of the output from a reactive transport simulation. In isosurface rendering, the input data is a three-dimensional grid with scalar values at grid points and a user-defined scalar value, called the isosurface value. The visualization algorithm consists of two main steps. The first step extracts the surface on which the scalar value is equal to the isosurface value. The second step renders the extracted surface (isosurface) to generate an image. Isosurface rendering belongs to a class of applications that carry out transformation, mapping and reduction operations on large multi-dimensional datasets for analysis and exploration purposes [13]– these operations are collectively referred to as generalized reduction operations. In this class of applications, input data elements are usually accessed by a range query, which defines a multi-dimensional bounding box in the input space. The input elements that intersect the range query are mapped to the multi-dimensional space of the output and aggregated to produce the output elements. Correctness of the output data values does not depend on the order input data items are aggregated, i.e., mapping and aggregation operations are commutative and associative operations. A user-defined data structure, called an accumulator, is used to maintain intermediate results. For instance, the isosurface extraction step can be viewed as a transformation operation. The rendering step carries out mapping operations from a 3-dimensional input space to a 2-dimensional

R

E

Ra

read dataset

isosurface extraction

shade + rasterize

M merge / view

(b) Isosurface rendering application modeled as filters.

Figure 2. Isosurface rendering and filter implementation.

output space and the output data values are computed by performing hidden-surface removal operations on the input elements. A z-buffer data structure can be used to hold intermediate results (i.e., partially rendered images) [33]. The processing structure of such applications can be represented as a pipeline of operations on the data. However, it is possible that the stages of the pipeline are not balanced in terms of both the volume of data moved between stages and the computational load of each stage. Moreover, the execution time of a stage may vary greatly based on the characteristics of the input data.

3.1

Implementation

Our implementation consists of a total of four filters (Figure 2(b)): A read (R) filter reads the volume data from local disk, and writes the 3D voxel elements to its output stream. An extract (E) filter reads voxels, produces a list of triangles from the voxels, and writes the triangles (the isosurface) to its output stream. A raster (Ra) filter reads triangles from its input stream, renders the triangles into a two3

dimensional image from a particular viewpoint, and writes the image to its output stream. During processing, multiple transparent copies of the computationally expensive raster filter can be executed. In that case, an image containing a portion of the rendered triangles will be generated by each copy of the raster filter. Therefore a merge (M) filter is used to composite the partial results to form the final output image. The merge filter also sends the final image to the client for display.

inactive pixel location. Z-buffer Rendering: In this method, a 2-dimensional array, called a z-buffer, is used for hidden-surface removal. Each z-buffer entry corresponds to a pixel in the image plane, and stores a color value (an RGB value) and a z-value. The (z-value,color) pair stores the distance and color of the foremost polygon at a pixel location. A full z-buffer is allocated and initialized in the init method of each instantiated copy of the raster filter. As the raster filter reads data buffers from its input stream, the triangles in each buffer are rasterized to generate pixel values on the z-buffer. When the end-of-work marker is received from the input stream, the raster filter sends the contents of the entire z-buffer to the merge filter in fixed size buffers. The merge filter also allocates and initializes a z-buffer when it is instantiated. As data is received from the raster filter, the received z-buffer values are merged with the values stored in the local z-buffer so that the final z-buffer contains pixel values with the smallest distance to the screen. After all buffers are received and processed, the merge filter extracts color values from the zbuffer and generates an image for the client. We can view the z-buffer rendering algorithm as having two phases: (1) a local rendering phase, in which input buffers are processed in the raster filter, and (2) a pixel merging phase, in which z-buffers are merged in the merge filter. The end-of-work marker behaves as a synchronization point between the two phases and among all copies of the raster filter, since all the copies wait until receiving this marker to start sending data to the merge filter. Moreover, z-buffer rendering may introduce high communication overhead in the pixel merging phase because pixel information for inactive pixel locations is also transmitted.

3.1.1 Extract Filter We use the marching cubes algorithm [23], which is commonly used for extracting isosurfaces from a rectilinear mesh. In this algorithm, voxels are visited one at a time. At each voxel, the scalar values at the corners of the voxel are compared to the isosurface value. Using the scalar values at the corners, the algorithm generates a set of triangles that approximate the surface passing through the voxel. The extract filter receives data from the read filter in fixed size buffers. A buffer contains a subset of voxels in the dataset. Note that each voxel is processed independently in the marching cubes algorithm. Therefore, the extract filter can carry out isosurface extraction on the voxels in a buffer as soon as the buffer is read from its input stream. Moreover, multiple copies of the extract filter can be instantiated and executed concurrently. The data transfer between the extract filter and the raster filter is also performed using fixed-size buffers. The extract filter writes the triangles extracted from each voxel in the current input buffer to the output buffer. When the output buffer is full or the entire input buffer has been processed, the output buffer is sent to the raster filter. This organization allows the processing of voxels to be pipelined.

Active Pixel Rendering: This algorithm utilizes a modified z-buffer scheme to store foremost pixels in consecutive memory locations efficiently. Active pixel rendering essentially uses a sparse z-buffer. Two arrays are used for hiddensurface removal. The first array is the Winning Pixel Array (WPA), which is used to store the foremost (winning) pixels. Each entry in this array contains the position on the screen, distance to the screen, and color (RGB) value of a foremost pixel. In our implementation, fixed size buffers for the output stream of the raster filter are used for the WPA. The second array, the Modified Scanline Array (MSA) (of size , where is the x-resolution of the screen), is an integer array and serves as an index into the WPA for the scanline being processed. During the processing of a triangle, when a pixel is generated for a position in the current scanline, the location in the MSA is used to access the WPA; the value stored in the MSA entry (MSA[ ]) is used as an index into the WPA. The WPA is sent to the merge filter when full or when all triangles in the current input buffer are processed.

3.1.2 Raster Filter The raster filter performs rendering of the triangles received in buffers from the extract filter. First, the triangles in the current input buffer are transformed from world coordinates to viewing coordinates (with respect to the viewing parameters). The transformed triangles are projected onto a 2-dimensional image plane (the screen), and clipped to the screen boundaries. Then the filter carries out hiddensurface removal and shading of triangles to produce a realistic image. Hidden-surface removal determines which polygon is visible at a pixel location on the image plane. We now briefly discuss two hidden-surface removal methods: z-buffer rendering [33] and active pixel rendering [22]. More detailed descriptions of these algorithms can be found in [7]. For the following discussion, a pixel location ( , ) on the image plane is said to be active if at least one pixel is generated for that location. Otherwise, it is called an







4

4.1

In this way, processing of triangles in the raster filter is overlapped with merging operations in the merge filter. Also, unlike the z-buffer algorithm, there is no synchronization point between the local rendering phase and pixel merging phase, allowing pipelining of all filters including the merge.

4

Baseline Experiments

Tables 1 and 2 show the results for a baseline experiment where each of the four filters are isolated and executed on a separate host in pipeline fashion using the first dataset. The output is a 2048x2048 RGB image in these experiments. As is seen from the tables, the active pixel version sends many more buffers, but they have a smaller total size. Raster is by far the most expensive filter. One of the degrees of freedom in a component-based implementation is to decide how to decompose the application processing structure into components. We have experimented with three different configurations of the isosurface application filters described in Section 3, where multiple Read, Extract, and Raster filters are combined into a single filter (Figure 3). In all configurations, The Merge filter is always a separate filter, and only one copy of Merge is executed on one of the nodes in all experiments.

Experimental Results

We now evaluate policies for scheduling data flow and demonstrate the implications of structuring the isosurface rendering application in different ways. We consider three causes of heterogeneity: (1) cpu/disk/memory resources are non-uniform, perhaps caused by multiple purchases of equipment over time; (2) space availability can dictate suboptimal placement of the application dataset on disks, causing non-uniform data retrieval costs; (3) shared resources can result in varying resource availability. We carried out the experiments on a heterogeneous collection of Linux PC clusters at the University of Maryland. This collection consists of one cluster (Red) with 8 2-processor Pentium II 450MHz nodes, each of which has 256MB memory and one 18GB SCSI disk, one cluster (Deathstar) with one 8-processor Pentium III 550MHz node and 4GB memory, one cluster (Blue) with 8 2-processor Pentium III 550MHz nodes, each having 1GB memory and two 18GB SCSI disks, and one cluster (Rogue) with 8 1processor Pentium III 650MHz nodes, each of which has 128MB memory and two 75GB IDE disks. The nodes of the Red and Blue clusters are interconnected via Gigabit Ethernet, whereas those of Rogue are interconnected via Switched Fast Ethernet (100Mbit). The Red, Blue, and Rogue clusters are connected to each other via Gigabit Ethernet and Deathstar is connected to the other clusters via Fast Ethernet. We experimented using datasets generated by the parallel environmental simulator ParSSim [4], developed at the Texas Institute for Computational and Applied Mathematics (TICAM) at the University of Texas. For the experiments in Section 4.1, we used a 1.5GB dataset that contains the simulation output for fluid flow and the transport of four chemical species over 10 time steps on a rectilinear grid of     points. The grid at each time step was partitioned into 1536 equal sub-volumes in three dimensions. For the experiments in other sections, a 25GB dataset was used. This dataset contains output over 10 time steps on a rectilinear grid of       points. The grid at each time step is roughly 2.5GB and was partitioned into 24,576 equal subvolumes in three dimensions. Both datasets were declustered across 64 data files using a Hilbert curve-based declustering algorithm [14], and these files were distributed across the disks in different experimental configurations. We rendered a single isosurface into a 512x512 or 2048x2048 RGB image. The timing values presented are the average of rendering five consecutive timesteps from the dataset.

4.2

Comparison with Active Data Repository

The Active Data Repository (ADR) [12, 15] is a highly parallel framework and runtime system, designed to efficiently support parallel applications that perform generalized reduction operations (see Section 3) on a homogeneous parallel computer or cluster. Great care is taken to partition and distribute datasets, to maintain an optimal number of active asynchronous disk I/O calls, and to overlap as much computation and communication as possible. The key weakness of ADR is the expectation of resource homogeneity and the impact of static partitioning on load balance. In contrast, we expect the component-based framework to allay some of the load balance problems when there is heterogeneity. This set of experiments is designed to quantify these effects. We use the Z-buffer rendering algorithm for the ADR implementation, since Z-buffer better matches the programming model of ADR. Active Pixel rendering, on the other hand, better matches the component-based implementation. For all experiments, we present response times for rendering timesteps of the stored simulation runs. For the component-based implementations, we used the RE–Ra–M configuration (see Figure 3). For the first experiment, the setup is designed to fit what the ADR version expects, namely homogeneity. We carried out this experiment on the Rogue cluster. The experiments render a single isosurface into a   or      RGB image. The timing values presented are the average of five consecutive timesteps. In all experiments, the dataset is uniformly partitioned over the nodes in use. Between runs the disk cache was cleared by unmounting and mounting the local filesystems where the datasets are stored, which invalidates all disk cache entries. For the componentbased version, we present results for both the Z-buffer ren5

R

E 443 38.6

number volume (MB)

E

Ra 470 11.8

Z-buffer Ra M 16 32.0

Active Pixel Ra M 469 28.5

Table 1. Number of buffers and data volume (in MB) transferred between filters for the Z-buffer and Active Pixel isosurface rendering implementations R 

z-buffer active pixel

0.68 0.64

5.3 4.3

1.65 1.64 

E 13.0 11.2



9.43 11.67 

Ra 74.5 79.5 



0.90 0.73

M 7.1 5.0

Sum 12.66 14.68 



Table 2. Processing times for the isosurface rendering filters (in seconds)

RERa

M

RE

Ra

M

(b) RE–Ra–M configuration

(a) RERa–M configuration

R

ERa

M

(c) R–ERa–M configuration

Figure 3. Experimental configurations with different combinations of Read, Extract, and Raster as filters.

Average Execution Time per Timestep (sec)

50

dering algorithm and the Active Pixel algorithm. Figure 4 displays a graph with the overall behavior using a set of dedicated Rogue nodes. As expected, the ADR (Z-buffer) version performs better than the component-based Z-buffer implementation in some cases primarily because ADR is tuned for exactly this type of accumulator based application. The component-based Z-buffer and the ADR versions perform essentially the same algorithm. In the worst case, the component-based Z-buffer version is  slower. For 2 nodes or more, the Active Pixel version is about the same or is faster than the ADR version, the best case being  faster for 8 nodes and the       output image. The main point here is that the component-based versions are competitive with the original ADR version, even when using the Z-buffer algorithm that is known to stall the pipeline.

original DC Z-buffer

40

DC Active Pixel

30

20



10



0

512

2048

1

512

2048

2

512

2048

4

512

2048



8

Number of Nodes / Output Image Size

Figure 4. Isosurface rendering times (absolute) for the original ADR implementation and the two component-based versions, as the number of homogeneous nodes is varied. Two output image sizes (     and      ) are shown. “DC Z-buffer” and “DC Active Pixel” denote the DataCutter implementations of Z-Buffer and component-based Active Pixel, respectively.

The component-based framework is designed to deal with heterogeneity, therefore the next experiment is set up to compare the original ADR version with the same component-based versions in a more heterogeneous environment. We achieve heterogeneity in two ways. First, we use half Rogue nodes and half Blue nodes for each experiment. Second, we execute a varying number of background user processes that are at the same priority as the isosurface application, causing competition for cpu time. The background jobs are run on every Rogue node, and the Blue 6

nodes are dedicated to the isosurface application. We performed three sets of experiments for the heterogeneous case using a total of 4, 8, and 16 nodes (half Rogue and half Blue). The execution times are displayed in Figure 5. Each graph’s bars are normalized to the original ADR value, because the execution times vary greatly. These graphs clearly show, regardless of the total number of cpus, the component-based versions work well when there is heterogeneity resulting from background jobs on the Rogue nodes. The original ADR version does outperform the componentbased versions for the cases with low background load, and a larger number of total nodes. This is because ADR is designed and highly tuned to support SPMD accumulator based applications. The low number of background jobs is essentially similar to the homogeneous case, therefore ADR does well. In contrast, the component-based framework is designed and tuned for pipelining and heterogeneity, but still can outperform ADR on several key cases. The ADR version performs poorly as the number of background jobs increase from 0 to 16. Moreover, the performance degradation is more pronounced for the      case. This is because the larger output image size means the Raster filter has more work to do, and since the ADR version cannot offload work from the heavily loaded Rogue nodes to the dedicated blue nodes, the resulting load imbalance is more pronounced. Second, both component-based versions exhibit stable behavior as background load increases. Table 3 shows the effect of the Demand Driven (DD) writer policy on the average number of buffers received by the Raster filter on each type of node (Rogue is heavily loaded, and Blue is unloaded). As the number of background jobs increase, the number of buffers sent to the Raster filters on Rogue nodes decreases because they are instead sent to Blue node Raster filters that are exhibiting faster processing rates. This process balances the load, and results in overall reduced application execution times. Both component-based versions can transparently and effectively deal with the heterogeneity. In many cases, coexistence with other applications and loads are inevitable, and getting good performance with unexpected loads without user intervention is a great benefit. The results show that overly tuned systems that require a dedicated homogeneous parallel machine are not practical for a heterogeneous environment.

4.3

CPU time, at the same priority as the filter code. The node running the Merge filter does not have a background job. As shown in Table 4, increasing the number of background jobs on a subset of the 8 nodes increases the execution time. Since only half the nodes are actually experiencing the background load, there is load imbalance among the two sets of nodes. For the round-robinpolicy (RR), the combined filter that contains Read sends one buffer to every consuming filter. For the Demand Driven (DD) policy, when a consumer filter processes a data buffer received from a producer, it sends back an acknowledgment message (indicating that the buffer has been consumed) to the producer. The producer chooses the consumer filter with the least number of unacknowledged buffers to send a data buffer. As is seen from the table, the DD policy deals best with the load imbalance, by dynamically scheduling buffers among the filters. The main tradeoff with DD is the overhead of the acknowledgment messages for each buffer versus the benefits of adjusting to load imbalance. Our results show that in this machine configuration the overhead is relatively small. The DD policy achieves better performance than RR in some cases, even when background load is low. We attribute this result to the fact that buffers from a read filter may be directed by DD to the local copies of the other filters on the same host, thus reducing the communication overhead. The acknowledgement message for a buffer sent to a remote copy is received after the buffer has been consumed by the remote copy. This mechanism allows DD to implicitly take into account communication overhead. The RERa–M configuration does not show much improvement using DD, because a single combined filter allows no demand-driven distribution of buffers among copies, and the output for each filter goes to the single Merge copy. In most cases, the RE– Ra–M configuration shows the best performance, since the raster filter is the most compute intensive filter among all the filters, and the volume of data transfer between RE and Ra is less than that between R and ERa in the experiments.

4.4

Writer Policies with Compute Nodes

In this experiment, we use the 8-way multiprocessor machine as a compute node and vary the number of 2-processor Red nodes (Figure 6). The output is a 2048x2048 image in these experiments. We look at the performance of different filter configurations using the active pixel algorithm. Merge is executed on the 8-processor node along with 7 ERa or Ra filters. One copy of each filter except Merge is run on every 2-processor node for each configuration. The dataset is partitioned among disks on the 2-processor nodes. Note that while interprocessor communication is done over Gigabit Ethernet among the 2-processor nodes, all communication to the 8-processor node goes over slower 100Mbit Ethernet.

Writer Policies with Background Load

In this experiment, we examine the performance of the application when there is computational load imbalance among the nodes in the system due to other user processes, for the RR and DD load balancing strategies. One copy of each filter (excluding Merge) was executed on each of eight nodes on the Rogue cluster. Background load was added to 4 of the nodes by executing a user level job that consumes 7

Normalized Avg Execution Time per Timestep (sec)

Normalized Avg Execution Time per Timestep (sec)

1.0 0.8 0.6 0.4 0.2 0.0

512

2048

0

512

2048

512

1

2048

512

4

1.0 0.8 0.6 0.4 0.2 0.0

2048

512

2048

0

16

512

2048

1

512

2048

4

512

2048

16

Number of Background Jobs / Output Image Size

Number of Background Jobs / Output Image Size

original

original

DC Z-buffer

DC Active Pixel

(a) 2 Rogue + 2 Blue nodes

DC Z-buffer

DC Active Pixel

(b) 4 Rogue + 4 Blue nodes

Normalized Avg Execution Time per Timestep (sec)

1.8 1.6 1.4 1.2 1.0 0.8 0.6 0.4 0.2 0.0

512

2048

0

512

2048

1

512

2048

4

512

2048

16

Number of Background Jobs / Output Image Size original

DC Z-buffer

DC Active Pixel

(c) 8 Rogue + 8 Blue nodes

Figure 5. Isosurface rendering times (normalized to original) for the original ADR implementation and the two component-based versions, as the number of Rogue background jobs is varied. Two output image sizes (      and       ) are shown. In the graphs, “DC Z-buffer” and “DC Active Pixel” denote the DataCutter implementations of Z-Buffer and component-based Active Pixel, respectively.

8

Avg Num E-Ra Buffers recv per Node Class 512x512 Output Image 2048x2048 Output Image DC Z-buffer DC A.Pixel DC Z-buffer DC A.Pixel #bg rogue blue rogue blue rogue blue rogue blue 2 Rogue + 2 Blue nodes 0 3516 3874 3564 3826 3520 3870 3585 3805 1 3533 3857 3559 3831 3358 4032 3300 4090 4 3279 4111 3120 4270 2308 5082 1928 5462 16 1704 5686 1551 5839 741 6649 806 6584 4 Rogue + 4 Blue nodes 0 1936 1758 2079 1616 2043 1652 2043 1652 1 1932 1763 2056 1639 1709 1986 1696 1998 4 1825 1870 1856 1840 1082 2613 1189 2506 16 921 2774 1184 2511 376 3319 497 3198 8 Rogue + 8 Blue nodes 0 910 937 939 909 897 951 870 977 1 910 938 929 918 801 1046 804 1044 4 874 973 865 983 518 1330 570 1277 16 617 1230 580 1267 185 1662 328 1520 Table 3. Average number of buffers received per Raster filter per node class over the E-Ra stream. “DC Z-buffer” and “DC A.Pixel” denote the DataCutter implementations of Z-Buffer and componentbased Active Pixel, respectively.

Ra0 RE

RE

Ra

Ra



Ra7



 

M

Figure 6. Heterogeneous collection of nodes experimental setup, varying number of 2-processor nodes.

As is seen in Table 5, in all cases the RE–Ra–M configuration performs better than R–ERa–M because of lower communication volume. When the number of 2-processor nodes is high we don’t obtain performance benefits from using the 8-processor node. The main reason is the slow network connection to and from the 8-processor node, which causes too much communication overhead. Nevertheless, use of the 8-processor node improves performance when data is partitioned on a small number of nodes. Decomposing the application into an appropriate set of filters allows us to move part of the computation to the 8-processor node, thus resulting in better parallelism. Our results also show that WRR achieves the best performance among all the DataCutter policies in this experiment. Note that there are no background jobs running on the nodes. Hence, DD should behave similarly to WRR, but acknowledgment messages

from filters running on the 8-processor node incur high overhead in DD because of the slow network connection.

4.5

Skewed Distribution of Data

In this experiment, we used two Blue and two Rogue nodes and varied the distributionof the 25GB dataset among the two clusters. The output is a 2048x2048 image. The balanced case denotes the even partitioning of data files across all the nodes. For the skewed cases, we moved P% percent (where P ranged from 25 to 75) of the files from Blue nodes to the Rogue nodes and distributed them evenly across the Rogue nodes. As we see from Figure 7, the RERa-M configuration is the most sensitive to the skewed data distribution, because all the processing is done in SPMD fashion and the total time is equal to the execution time of the slowest 9

# Bg Jobs 0

1

4

16

Configuration RERa–M RE–Ra–M R–ERa–M RERa–M RE–Ra–M R–ERa–M RERa–M RE–Ra–M R–ERa–M RERa–M RE–Ra–M R–ERa–M

512x512 Output Image Active Pixel Z-buffer RR DD RR DD 8.68 7.97 8.37 8.39 6.79 6.08 7.38 6.38 9.68 7.33 10.35 7.43 8.67 8.02 8.35 8.39 6.84 5.86 7.29 6.49 9.77 7.43 10.20 7.80 10.61 10.24 10.29 10.26 8.32 6.91 9.12 7.74 11.84 9.39 13.06 9.42 35.92 33.87 34.68 34.68 24.28 20.05 23.89 19.90 35.71 23.56 32.92 21.87

2048x2048 Output Image Active Pixel Z-buffer RR DD RR DD 12.55 11.81 24.41 24.41 10.33 6.08 31.08 26.97 13.72 8.37 34.94 28.68 12.49 11.85 24.61 24.66 10.30 6.88 31.25 27.81 14.35 9.04 34.86 29.60 15.51 14.51 27.48 27.45 12.89 8.25 33.89 29.09 17.13 11.13 38.44 31.87 52.68 48.66 64.08 64.07 38.94 20.56 60.42 40.72 50.00 24.04 73.77 43.71

Table 4. Comparison of execution time (seconds) for filter configurations with background jobs. A total of 8 nodes in Rogue cluster are used, with 7 nodes executing 1 copy of each of the filters except the merge filter (and background jobs on 4 of those nodes), and 1 node executes 1 copy of all filters including the merge filter. Load balancing policies are round robin (RR) and demand driven (DD).

Configuration RE–Ra–M R–ERa–M

1 data node RR WRR DD 7.26 3.02 3.04 8.22 3.97 4.22

2 data nodes RR WRR DD 5.69 2.62 2.99 6.47 4.08 3.99

4 data nodes RR WRR DD 4.37 2.98 3.50 5.10 4.07 4.11

8 data nodes RR WRR DD 3.75 2.89 3.83 4.00 3.69 4.20

Table 5. Comparison of execution time (seconds) for filter configurations using the active pixel algorithm while varying the number of nodes storing the dataset. The 8-processor node is a compute node.

tors, implemented in Java, to hosts for execution of queries. Placement is based on the expected size of data transfers between operators. The work shows how an optimizer customized to deal with “data-inflating” and “data-reducing” operators, can improve performance. MOCHA can leverage total knowledge about query selectivities stored in its catalog, whereas DataCutter deals with arbitrary application code with no such useful information. ACDS [19] (Adapting Computational Data Streams) is a framework that addresses construction and adaptation of computational data streams in an application. A computational object performs filtering and data manipulation, and data streams characterize data flow from data servers or from running simulations to the clients of the application. The dQUOB (dynamic QUery OBjects) system [27] consists of a runtime and compiler environment that allows an application to insert computational entities, called quoblets into data streams. The data streams targeted by the dQUOB system are those used in large-scale visualizations, in video streaming to a large number of end users, and in large volumes of business transactions. Dv [2] is a framework for developing applications

node, which in this case is the one with the most data. By decoupling all the processing from the retrieval of data, the R-ERa-M configuration achieves better performance by allowing the data retrieved on slow nodes to be processed on other nodes. The demand driven policy helps improve the performance further by enabling data buffers to be processed on the fastest nodes. The decoupled RE-Ra-M configuration has the same properties as R-ERa-M configuration, but it sends less data over the network, thus obtains the best performance among all the filter configurations.

5 Related Work The ABACUS framework [3] addresses the automatic and dynamic placement of functions in data-intensive applications between clients and storage servers. ABACUS supports applications that are structured as a chain of function calls, and can partition the functions between the client and server. MOCHA [29] is a database middleware system designed to interconnect data sources distributed over a network. It can dynamically deploy user-defined opera10

Skewed 25% - Active Pixel Rendering 30

25

25

20 RR WRR DD

15 10 5

Execution Time (seconds)

Execution Time (seconds)

Balanced - Active Pixel Rendering 30

0

20 RR WRR DD

15 10 5 0

RERa-M

R-ERa-M

RE-Ra-M

RERa-M

Filter Configuration

RE-Ra-M

Skewed 75% - Active Pixel Rendering

Skewed 50% - Active Pixel Rendering 30

30

25

25

20 RR WRR DD

15 10 5

Execution Time (seconds)

Execution Time (seconds)

R-ERa-M Filter Configuration

0

20 RR WRR DD

15 10 5 0

RERa-M

R-ERa-M

RE-Ra-M

RERa-M

Filter Configuration

R-ERa-M

RE-Ra-M

Filter Configuration

Figure 7. Rendering times for various degrees of skewed distributions of data using a collection of two nodes in the Blue cluster and two nodes in the Rogue cluster.

for distributed visualization of large scientific datasets on a computational grid. The Dv framework is based on the notion of active frames. An active frame is an application-level mobile object that contains application data, called frame data, and a frame program that processes the data. Active frames are executed by active frame servers running on the machines at the client and at remote sites. Parallel file systems and I/O libraries have been a widely studied research topic, and many systems and libraries have been developed [10, 21, 24, 32]. These systems aim to achieve high I/O bandwidth by supporting regular strided access to uniformly distributed datasets, such as images, maps, and dense multi-dimensional arrays. Several research projects have also focused on providing data access in a Grid environment [16], such as the Storage Resource Broker (SRB) [31], which provides uniform UNIX-like I/O interfaces and meta-data management services to access collections of distributed data resources. However, providing support for efficient processing of data is not a part of these systems. Armada [26] is a flexible parallel file system framework being developed to enable access to remote and distributed datasets through a network (armada) of application objects, called ships. The system provides authorization and authentication services, and runtime support for the collection of application specific ships to run on I/O nodes, compute nodes, and other nodes on the network.

Several researchers have concurrently and independently explored the concept of Active Disks (alternate names include Intelligent Disks and Programmable Disks) that allows performing processing within the disk subsystem [1, 20, 28]. Research in this area can be roughly divided into two categories: application processing in Active Disks and system-level processing in Active Disks. Riedel et al. [28] investigated the performance of Active Disks for data mining and multimedia algorithms and developed an analytical performance model. The ISTORE project [9] uses the IDISK [20] architecture as a building block to create a metaappliance; a storage infrastructure that can be tailored for specific applications. Gibson et al. [17] are investigating the use of disk-processors for performing file-system and security-related processing on network-attached disks. Van Meter et al. [25] investigated mechanisms for securely accessing shared network-attached peripherals in an untrusted network. Acharya et al. [1] introduced a stream-based programming model for disklets and their interaction with hostresident peers. Restructured versions of a wide range of data-intensive applications (from databases, data-mining, data warehousing, and image databases) have also been developed in that work.

11

6

Conclusions

[5] R. H. Arpaci-Dusseau, E. Anderson, N. Treuhaft, D. E. Culler, J. M. Hellerstein, D. A. Patterson, and K. Yelick. Cluster I/O with River: Making the fast case common. In IOPADS ’99: Input/Output for Parallel and Distributed Systems, May 1999. [6] M. Beynon, T. Kurc, A. Sussman, and J. Saltz. Design of a framework for data-intensive wide-area applications. In Proceedings of the 9th Heterogeneous Computing Workshop (HCW2000), pages 116–130. IEEE Computer Society Press, May 2000. [7] M. D. Beynon, T. Kurc, U. Catalyurek, A. Sussman, and J. Saltz. A component-based implementation of iso-surface rendering for visualizing large datasets. Technical Report CS-TR-4249 and UMIACS-TR-2001-34, University of Maryland, Department of Computer Science and UMIACS, May 2001. [8] M. D. Beynon, A. Sussman, U. Catalyurek, T. Kurc, and J. Saltz. Performance optimization for data intensive grid applications. In Proceedings of the Third Annual International Workshop on Active Middleware Services (AMS2001), Aug. 2001. [9] A. Brown, D. Oppenheimer, K. Keeton, R. Thomas, J. Kubiatowicz, and D. Patterson. ISTORE: Introspective storage for data-intensive network services. In Proceedings of the 7th Workshop on Hot Topics in Operating System (HotOS-VII), Mar 1999. [10] P. H. Carns, W. B. Ligon, R. B. Ross, and R. Thakur. PVFS: A parallel file system for Linux clusters. In Proceedings of the 4th Annual Linux Showcase and Conference, pages 317– 327, Oct. 2000. [11] Common Component Architecture Forum. http://www.ccaforum.org. [12] C. Chang, R. Ferreira, A. Sussman, and J. Saltz. Infrastructure for building parallel database systems for multidimensional data. In Proceedings of the Second Merged IPPS/SPDP Symposiums. IEEE Computer Society Press, Apr. 1999. [13] C. Chang, T. Kurc, A. Sussman, and J. Saltz. Optimizing retrieval and processing of multi-dimensional scientific datasets. In Proceedings of the Third Merged IPPS/SPDP (14th International Parallel Processing Symposium & 11th Symposium on Parallel and Distributed Processing). IEEE Computer Society Press, May 2000. [14] C. Faloutsos and P. Bhagwat. Declustering using fractals. In Proceedings of the 2nd International Conference on Parallel and Distributed Information Systems, pages 18–25, Jan. 1993. [15] R. Ferreira, T. Kurc, M. Beynon, C. Chang, A. Sussman, and J. Saltz. Object-relational queries into multi-dimensional databases with the Active Data Repository. Parallel Processing Letters, 9(2):173–195, 1999. [16] I. Foster and C. Kesselman. The GRID: Blueprint for a New Computing Infrastructure. Morgan-Kaufmann, 1999. [17] G. Gibson, D. Nagle, K. Amiri, F. W. Chang, E. M. Feinberg, H. Gobioff, C. Lee, B. Ozceri, E. Riedel, D. Rochberg, and J. Zelenka. File server scaling with network-attached secure disks. In Proc. of the ACM International Conference on Measurement and Modeling of Computer Systems (Sigmetrics ’97), 1997. [18] Global Grid Forum. http://www.gridforum.org.

Our results show that the component-based implementation of isosurface rendering allows a flexible implementation for achieving good performance in a heterogeneous environment and when the load on system resources change. Partitioning the application into filters and executing multiple copies of bottleneck filters results in parallelism and better utilization of resources. A demand driven buffer distribution scheme among multiple filter copies achieves better performance than other policies when the bandwidth of the interconnect is reasonably high and the system load dynamically changes. However, extra communication required for acknowledgment messages introduces too much overhead when the network is slow. We plan to further investigate methods to reduce the communication overhead in DD and look at other dynamic strategies for buffer distribution. The active pixel algorithm is a better alternative to the zbuffer algorithm in the component-based implementation of isosurface rendering. It makes better use of system memory, and allows the rasterization and merging operations to be pipelined and overlapped. We note that as the number of copies of other filters or the number of nodes increases, the merge filter becomes a bottleneck. The current approach replicates the image space across the raster filters. Alternatively, we could partition the image space into subregions among the raster filters, thus eliminating the merge filter. However, this will cause load imbalance among raster filters if the amount of data for each subregion is not the same, or if it varies over time. We plan to investigate a hybrid strategy that combines image-partitioning and image-replication in future work.

References [1] A. Acharya, M. Uysal, and J. Saltz. Active disks: Programming model, algorithms and evaluation. In Proceedings of the Eighth International Conference on Architectural Support for Programming Languagesand Operating Systems (ASPLOS VIII), pages 81–91. ACM Press, Oct. 1998. ACM SIGPLAN Notices, Vol. 33, No. 11. [2] M. Aeschlimann, P. Dinda, J. Lopez, B. Lowekamp, L. Kallivokas, and D. O’Hallaron. Preliminary report on the design of a framework for distributed visualization. In Proceedings of the International Conference on Parallel and Distributed Processing Techniques and Applications (PDPTA’99), pages 1833–1839, Las Vegas, NV, June 1999. [3] K. Amiri, D. Petrou, G. R. Ganger, and G. A. Gibson. Dynamic function placement for data-intensive cluster computing. In the USENIX Annual Technical Conference, San Diego, CA, June 2000. [4] T. Arbogast, S. Bryant, C. Dawson, and M. F. Wheeler. Parssim: The parallel subsurface simulator, single phase. http://www.ticam.utexas.edu/ arbogast/parssim.

12

Biographies

[19] C. Isert and K. Schwan. ACDS: Adapting computational data streams for high performance. In 14th International Parallel & Distributed Processing Symposium (IPDPS 2000), pages 641–646, Cancun, Mexico, May 2000. IEEE Computer Society Press.

Michael D. Beynon is a research scientist at MIT Lincoln Laboratory in the Advanced Networks and Applications group. He received his Ph.D. from the Department of Computer Science at the University of Maryland. His research interests are in the development of programming models, runtime systems, and infrastructure to support the execution of data-intensive applications in high performance clustered, parallel, and distributed environments.

[20] K. Keeton, D. A. Patterson, and J. M. Hellerstein. A case for intelligent disks (IDISKS). ACM SIGMOD Record, 27(3):42–52, Sept. 1998. [21] D. Kotz. Disk-directed I/O for MIMD multiprocessors. In Proceedings of the 1994 Symposium on Operating Systems Design and Implementation, pages 61–74. ACM Press, Nov. 1994.

Tahsin Kurc is an assistant professor in the Department of Biomedical Informatics at the Ohio State University. His research interests include algorithms and systems software for high-performance and data-intensive computing, and highperformance computing in scientific visualization.

[22] T. Kurc, C. Aykanat, and B. Ozguc. Object-space parallel polygon rendering on hypercubes. Computers & Graphics, 22(4):487–503, 1998. [23] W. Lorensen and H. Cline. Marching cubes: a high resolution 3d surface reconstruction algorithm. Computer Graphics, 21(4):163–169, 1987.

Umit Catalyurek is an assistant professor in the Department of Biomedical Informatics at the Ohio State University. His research interests include graph and hypergraph partitioning algorithms, and software systems for highperformance and data-intensive computing.

[24] J. M. May. Parallel I/O for High Performance Computing. Morgan Kaufmann Publishers, 2000. [25] R. V. Meter, S. Hotz, and G. Finn. Derived virtual devices: A secure distributed file system mechanism. In Proceedings of the Fifth NASA Goddard Space Flight Center Conference on Mass Storage Systems and Technologies, Sept. 1996.

Alan Sussman is an assistant professor in the Computer Science Department at the University of Maryland. His research interests include high performance runtime and compilation systems for data- and compute-intensive applications.

[26] R. Oldfield and D. Kotz. Armada: A parallel file system for computational. In Proceedings of CCGrid2001: IEEE International Symposium on Cluster Computing and the Grid, Brisbane, Australia, May 2001. IEEE Computer Society Press. [27] B. Plale and K. Schwan. dQUOB: Managing large data flows using dynamic embedded queries. In IEEE International High Performance Distributed Computing (HPDC), August 2000.

Joel Saltz is the chair of the Department of Biomedical Informatics and a professor of Computer Science at the Ohio State University. His research interests are in the development of systems sofware, databases and compilers for the management, processing and exploration of very large datasets.

[28] E. Riedel, C. Faloutsos, and G. Gibson. Active Storage for Large-Scale Data Mining and Multimedia Applications. In Proceedings of VLDB’98, 1998. [29] M. Rodr´iguez-Mart´inez and N. Roussopoulos. MOCHA: A self-extensible database middleware system for distributed data sources. In Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data (SIGMOD00), pages 213–224. ACM Press, May 2000. ACM SIGMOD Record, Vol. 29, No. 2. [30] T. Schnekenburger. External sorting for databases in distributed heterogeneous systems. In Conference on Economical Parallel Processing, pages 9–16, Vienna, Austria, Feb 1993. [31] SRB: The Storage Resource Broker. http://www.npaci.edu/DICE/SRB/index.html. [32] R. Thakur, A. Choudhary, R. Bordawekar, S. More, and S. Kuditipudi. Passion: Optimized I/O for parallel applications. IEEE Computer, 29(6):70–78, June 1996. [33] A. Watt. Fundamentals of three-dimensional computer graphics. Addison Wesley, 1989.

13

Suggest Documents