Toward Automatic Parallelization of Spatial Computation for Computing Clusters Baoqiang Yan
Philip J. Rhodes
Department of Computer and Information Science University of Mississippi University, MS 38677 (01) 662-9155023
Department of Computer and Information Science University of Mississippi University, MS 38677 (01) 662-9157082
[email protected]
[email protected] similar gains in I/O performance due to disk and network latencies. Access to large datasets stored on local or remote disk remains a major performance bottleneck in many important scientific applications. Similarly, the latency introduced by communication between compute nodes hinders application performance from scaling well as the number of nodes is increased.
ABSTRACT High performance parallel computing infrastructures, such as computing clusters, have recently become freely available for scientific researchers to solve problems of unprecedented scale through data parallelization. However scientists are not necessarily skilled in writing efficient parallel code, especially when dealing with spatial datasets. Two important performance issues involved are the heavy I/O costs and the communication overhead. To address this issue, we are developing an scheme that helps scientists realize I/O friendly and scalable data parallelization for spatial computation.
The manner in which a dataset is partitioned among the compute nodes of a cluster is an important performance factor. If the partitioning process is informed by the dependencies inherent in the application, communication costs can be reduced, and performance increased. For some applications, the pattern of dependencies may change when application specific parameters change. For example, ray casting is an important visualization technique for rendering transparent volume, as demonstrated in figures 1 and 2. Given a view direction, the blending operation is performed in a predefined increment along each ray. However, if the data volume is split into blocks for rendering in parallel, most
Built upon our iteration aware spatial prefetching and caching techniques, this data parallelization scheme takes an explicit specification of data dependency, identifies the best feasible access patterns while applying some I/O efficiency rules and then wraps them in separate spatial data iterators for efficient cache loading and data partitioning respectively. This scheme prioritizes but reconciles the I/O costs in the different stages of a data intensive cluster application to achieve the overall best I/O performance while maintaining fair computational scalability.
Categories and Subject Descriptors H.2.8 [Database Management]: Database Applications — Scientific Databases; D.1.3 [Programming Techniques]: Concurrent Programming — Parallel Programming Figure 1. Ray casting with viewing direction rotating around a major axis. (a) Ray casting. Dots indicate the blending steps along a ray. Numbers represent nodes. (b) Dependencies for the view direction in (a). (c) Dependencies for a new view direction.
General Terms Design, Experimentation, Performance
Keywords Spatial Data, I/O optimization, Caching, Dependency, Locality, Data Parallelization, Access Pattern, Cluster Computing
1. INTRODUCTION Due to massive gains in processor performance over past decades and the wide availability of parallel computing resources such as computing clusters, scientists are now able to approach problems of unprecedented scale through data parallelization. Unfortunately, gains in processing power have not been matched with Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. HPDC’08, June 23–27, 2008, Boston, Massachusetts, USA. Copyright 2008 ACM 978-1-59593-997-5/08/06...$5.00.
Figure 2. 3D head images generated using our parallelization method for different view directions. The head data is courtesy of Siemens Medical Systems, Inc.
45
rays need to pass through multiple blocks before they can finally reach the image plane. This means that the rendering on one job block needs data from its neighboring blocks which might be assigned to a different node. It is also obvious that when the view direction changes, the dependencies may also change significantly.
techniques [5, 6, 7, 13, 15, 16, 21, 28] targeted toward early parallel machines, and were later applied to cluster computing environments. However, they require compiler support and only deal with in-core loop data parallelization without directly addressing disk I/O. In GRID computing environments, data and code are partitioned and placed in a manner that uses resources most efficiently. However, recent techniques require considerable user involvement, which can be onerous for the scientist-programmer.
We present here a feasibility study for a system underlying a future API that would determine an efficient partitioning in response to change in an application’s dependency pattern. Such a system is particularly valuable for the visualization of datasets that are much larger than the aggregate storage capacity of a cluster. Although visualization clusters often have significant disk storage attached to compute nodes, clusters composed of less expensive nodes such as blades trade storage for increased computational power. In addition, as the size of simulation data extends beyond the terabyte range, there is a need for a system that is not bound by the storage capacity of the cluster.
2.1 Data parallelization for I/O intensive applications on GRID Network latency is a major performance bottleneck in distributed environment. Unlike compiler based data parallelization that targets in-core data splitting and processing, GRID computing gives the same priority to I/O performance as data processing while trying to reap the abundant idle storage and computing resources. This is mostly realized through the careful placement of data files, middleware code, or filters to reduce network traffic [4, 20, 24].
As dataset sizes increase, the ability to efficiently access subsets of interest over a network becomes more important. Rather than transfer entire datasets on multiple disks via FedEx, a scientist should be able to visualize or otherwise process a subset of interest via remote access. In addition to determining an effective partitioning strategy for an application, our system aggregates cluster I/O requests to minimize the effect of disk and network latency on performance [31].
Multiple clients and multiple I/O access points are essential for higher bandwidth [8]. Collective I/O can be used to aggregate or co-schedule multiple data requests to data sources on the same physical storage medium, reducing the number of separate I/O requests. On the data source side, files are partitioned and distributed to utilize free storage resources on the GRID and to increase I/O bandwidth. Data declustering and replication techniques provide opportunities for simultaneous I/O operations, increasing I/O bandwidth via parallelism. Replicas can also be placed close to compute node to improve data query satisfaction time. However, none of these methods are performed with the dependencies of a particular computation in mind.
This aggregation of cluster I/O is based upon the iteration aware prefetching scheme developed in Granite, a middleware system for efficient access to large spatial scientific datasets [26, 27]. In addition to linear (i.e. “plane-by-plane, row-by row”) file formats, Granite can be used on top of chunking [29]. The work in this paper concentrates on the linear file format, but can be extended to the others, and to additional formats such as space filling curves [1, 10]. It also can be extended to work with declustering [2, 11, 12, 22] for parallel I/O and to multiple aggregators, although we currently use a single dedicated machine to aggregate cluster I/O requests.
For data filtering and processing, data-reducing operators are moved closer to the data sources and data-inflating operators near the clients [4, 20], saving network bandwidth. I/O performance is also improved by replicate or cache data at places closer to compute nodes. A node could be both an I/O node and a compute node. Although such techniques reduce the volume of transmitted data, they do not directly attack the problem of latency because they do not necessarily reduce the number of transactions.
The envisioned system would take advantage of Granite’s ability to construct a cache that is tuned using information contained in an iterator. Since the iterator completely specifies the access pattern, Granite caches can effectively prefetch data, and never have to load data more than once [27]. They also aggregate a large number of small requests into a small number of large requests. Such aggregation is particularly valuable for remote access to large datasets, since it reduces the number of times that network latency costs are incurred.
Many systems, e.g. ADR, use a dataflow [18], blueprints [24] or something similar to describe the relationship between the data sources and filters. Code placement is performed using data flow information explicitly provided by the user, who in turn designs the logical dataflow with the application specific dependency pattern in mind.
We give a brief discussion of background in the next section, followed by a description of Granite spatial prefetching and caching technique and its preliminary application to cluster computing in section 3. Section 4 introduces our dependency aware caching and job splitting for spatial data parallelization, while section 5 describes how we decide on the parameters for that mechanism accompanied by corresponding experimental results. Section 6 examines the computational scalability of out technique. Section 7 presents application performance results, followed by concluding remarks and future work in section 8.
With spatial applications, especially visualization, dependencies may change when parameters such as view direction changes. There is no guarantee that a static data flow or partitioning is ideal or even suitable for a new spatial dependency. If repartitioning does not incur excessive overhead, it is advantageous to automatically recompute a dataflow and partitioning in response to a change in dependency pattern. Kurc, et al. [18] describe a cluster based ray casting volume visualizer. The cluster nodes had significant local storage, managed by ADR, allowing the cluster to store the entire dataset. Although data dependencies for this application change with view direction, they use a single static partitioning for all directions, due to the cost that repartitioning the data would incur.
2. BACKGROUND Efficient data parallelization has been an active area of research for many years. Research in the area began with compiler based
46
datum query specifies a single index space location, and is satisfied by the return of a single datum. A subblock query specifies an n-rectangular region of the index space, and is satisfied by the return of a data block, which is conceptually an n-dimensional array of datums. The subblock query is essentially a spatial range query commonly supported by I/O systems that provide subset operation on spatial datasets.
Our own work addresses the rendering or processing of datasets that are larger than a cluster’s aggregate storage capacity. This scenario may occur when a cluster has little or no storage on the compute nodes, or when the data set is extremely large. In either case, the data must be fetched all over again for each rendering or processing. Therefore, using a new partitioning that is well suited to application specific dependencies adds no new cost, and can greatly improve performance.
A datasource satisfies these spatial queries by reading data from a one dimensional file. Therefore it must be able to map its index space to file offsets. This mapping is done with the help of an axis ordering. An axis ordering is simply a ranking of axes from outermost to innermost. “Innermost” and “outermost” suggest positions in a set of nested for loops used to access the file in its storage order on disk. We call the axis ordering that maps a datasource to its file a storage ordering. The innermost (or rightmost) axis of a storage ordering is known as the rod axis, where rods are series of elements that are contiguous in both the data volume and the one dimensional file or memory array. For example, the data volume in figure 3a is stored in {0,1,2} ordering, meaning that only the neighboring datums along axis 2 are physically stored contiguously. The number and length of rods in a query determine the cost to satisfy the query. They are also the key factors to determine how the data should be loaded and partitioned for parallelization besides the communication overhead.
The cost of re-fetching the data from a source outside the cluster makes our approach unattractive for fully interactive visualization without the use of multiresolution or other data-reducing techniques. However, by removing the limitation imposed by cluster storage, we allow even modest clusters to visualize large datasets, and also extend the reach of better equipped facilities.
2.2 Spatial data and non-contiguous I/O There are several current systems that attack the problem of storing and accessing spatial data, such as HDF5 [17] and DataCutter [3, 4, 18]. NetCDF [23] and HDF5 are two popular selfdescribing scientific data formats and high level application I/O libraries to store and access multidimensional array data. Both of them (pNetCDF [19] for NetCDF datasets) supports parallel I/O through low level MPI-IO interface [9]. Spatial datasets present special difficulty because a subvolume in the data space maps to a large number of non-contiguous I/O requests in the one dimensional file.
These ideas comprise the Granite rod storage model, which also applies to chunking [29]. Other techniques, such as space filling curves [10] require a different storage model. Storage models provide information about the layout of data on disk that can be used to read data efficiently and to choose access patterns that maximize performance. The key during this process is to minimize the number of required physical I/O requests. Granite does this in a measurable and predicted way by taking advantage of prior knowledge of future access pattern to perform efficient spatial prefetching and caching on data stored on either local disk or remote server [27].
Data sieving tries to amortize the data access cost on multiple noncontiguous requests at the cost of extra data loading. It does not work well for cases where the gaps between the requests are large, which can be remedied when combined with collective I/O such as in [30]. Granite can effectively take advantage of data sieving while doing gapped spatial query and aggregate I/O when applied in cluster computing [31]. Chunking [29] reorganizes a dataset into multidimensional subblocks that are stored linearly in the file. This turns otherwise many non-contiguous I/O requests into a single large unit of data retrieval. However, the spatial locality preserved in a chunking is only suitable for a particular access pattern and would incur huge storage overhead if the access pattern changes often. As mentioned earlier, though not necessarily, our scheme can be used on top of chunking.
3. GRANITE SPATIAL AND CACHING
PREFETCHING
In Granite, an access pattern is represented as an iterator object that contains information such as the iteration space, iteration block shape, iteration ordering and steps. The iteration ordering, which is also a kind of axis ordering, determines the direction in which an iterator proceeds through the iteration space.
2.3 Granite datasource component In Granite, a datasource is conceptually an n-dimensional array containing a set of sample points [27], as shown in figure 3. The array indices define the index space, also called a data volume. Each index space location has a collection of associated data values, called a datum. Datasources must handle two basic queries. A
While doing prefetching and caching, it is desired that the iteration ordering in an access pattern match the storage ordering of the datasource to achieve the best disk I/O performance. We have recently been examining the problem of using the application access pattern to choose from several remote replicas stored differently on disk [25]. The network protocol we employ is UDT — an efficient application level data transport protocol for the emerging distributed data intensive applications over wide area high-speed networks [14].
3.1 Iteration and cache shapes Using the information about the access pattern provided by an iterator object, and the fact that the iterator progresses along one of the principal axes, Granite can calculate a cache block shape that reduces the number of reads from the file and contains all the data needed in the current and future iteration steps. The time required to perform the iteration is greatly reduced because the
Figure 3. Granite datasource component. (a) A 3D data volume stored in {0,1,2} ordering. (b) Datum and subblock queries.
47
Figure 4. Different cache shapes and their cache block dimensions for a 1024*2048*4096 sub-volume of storage ordering {0,1,2} using 128MB cache memory. Thick lines denote rods in the cache block. Cache blocks consist of some number of planes orthogonal to axis 0, where each plane consists of a number of rods that are stored nearby on disk. The distance between neighboring rods in adjacent planes is much greater than the distance between adjacent rods within a plane. For this reason, shape 0 has best performance, since the number of planes is small (16), and the rods are long. Shape 1 has the next best performance, since the rods are long, but has more planes (1024) than shape 0. Shape 2 is worst, since the number of planes is large, but rods are shortest, requiring more read operations to fill the volume. number of disk read operations is reduced and these cache blocks verter but leaves more chances for Granite to choose an access are well formed, meaning that they will be loaded only once durpattern that best suits a file’s storage ordering. ing the iteration. In a cluster environment, the pattern converter is responsible for In the work described here, we assume that we have enough choosing a proper cache iterator and job iterator for cache loading memory to form cache blocks that span at least two axes of the and splitting respectively. By using a cache iterator, the iteration data volume, as shown in figure 4. The remaining axis is called aware spatial data distribution system [31] reduces both disk and the cache iteration axis, since it is in this direction that cache network latency by transforming a large number of small requests iterator will proceed as the entire data volume is processed. In this into a small number of large requests that fill an n-dimensional paper, we will refer to the three cache shapes shown in figure 4 by collective cache block on the cluster head node. The job iterator is the number of the cache iteration axis. Granite keeps the storage responsible for the job extraction out of the cache and job distriordering of the cache block in memory the same as the storage bution to compute nodes for data parallelization. The SPMD ordering of the physical data source on disk. We refer to the rod (Single Program Multiple Data) parallel execution model suits the axis for the iterator cache block as the cache rod axis. Granite data iterator well since the number of chunks contained in a single data iterator block can be determined before the program A cache iterator can go in either a forward (positive) or backward starts. The combination of cache and job iterators essentially re(negative) direction along the cache iteration axis. Our experisults in a BLOCK_CYCLIC partitioning for data parallelism in mental results show that with the same amount of cache memory, SPMD. cache iterators of the same shape bring about the same cache loading performance for iterators in both directions, making the To facilitate the identification of I/O costs for performance encache shape the dominant factor determining the suitability of a hancement, we divide a data intensive cluster application into five cache iterator. The suitability of a cache iterator is a measure of stages as shown in figure 5. The stage zero selects from among how fast it can bring data from disk to memory for a given storage replicas with different storage orderings the ones that can best be ordering. As mentioned in section 2.3, this is mainly determined used with the application access pattern. Stage one constructs and by the number and length of the rods contained in the cache uses the cache iterator to read data from a datasource to the head shape. We restrict our analysis in this paper to forward iteration. node memory. Stage two constructs and uses a job block iterator to split the cache block into job blocks and distribute them to 3.2 Pattern converter compute nodes for data processing. In stage three, the compute If the application doesn’t list all axes of the index space in its nodes perform the application task, perhaps requiring communiiteration ordering, the corresponding access pattern is called an cation between compute nodes due to data dependency. The final incomplete access pattern1. An incomplete access pattern must result transfer and combination are done in stage four, the output eventually be turned into a complete one through a pattern constage. In this paper, we focus on stages 1, 2, and 3.
Figure 5. Five stages of a planned cluster computing system for large datasets. CN - Compute Node. 1
We used term indefinite access pattern in [31].
48
Figure 6. Example computation and comparison of the costs to split among 8 nodes a 16*2048*4096 cache block stored in {0,1,2} ordering. An arrow means dependency exists at least on the corresponding axis. (a)(b)(c) Single axis splitting along a major axis, denoted as S0, S1 or S2 in our text. Each shaded area represents the cross-node job border area and unit communication size. Their differences leave opportunities to choose a splitting with the least communication volume. (d) Double axis splitting along axes 1 and 2, denoted as S12 (The other two double axis splitting options are S01 and S02). These finer splittings raise the number of job blocks and thus the job distribution and inter-node communication costs if any. Finer splittings are necessary when a strict job size constraint exists on compute nodes due to memory availability. If both split axes have dependency, the choice of the final major split axis with less communication overhead can be made based upon comparison of their cross-node job border areas as shown in shade in (d). In all cases, splitting rod axis 2 such as in (c) and (d) would make job extraction cost degrade.
4.2 Augmented cache splitter
4. DEPENDENCY AWARE CACHING AND JOB SPLITTING FOR SPATIAL DATA
The cache splitter is a key part of our effort to automate the distribution of spatial data on a cluster. It extends the pattern converter described in section 3.2 and performs the same tasks. First, it chooses the shape of the block used by the cache iterator on the head node to access the disk. Second, it splits this cache block into n-dimensional job blocks which will be distributed to the cluster compute nodes.
In order for our technique to choose a proper combination of cache iterator and job block iterator to realize data parallelism with the best I/O performance, we must perform several tasks. First, we must generate a dependency descriptor based on application specified dependency constraints to represent the computation order over a data volume. Second, we must use our cache splitter to appropriately shape the job blocks and construct a job iterator to bring the best overall I/O performance under certain memory constraints.
While the older pattern converter took an incomplete access pattern as input, the cache splitter now accepts a dependency description and determines the best access pattern for the user. It further allows the data dependency to exist on more than one axis of an iteration space and does not require that job splitting be done only along dependency-free axes, making it necessary to take into account (or avoid completely) the cost of inter compute node communication when constructing the job block iterator.
4.1 Dependency descriptor The application needs to tell Granite about the dependency constraints. We represent the dependency axes and directions using two bitsets: Be for existence and Bd for direction. In each bitset, a bit corresponds to an axis of the data space, with the rightmost bit denoting the dependency state on axis 0. A 1 bit in Be denotes dependency existence and 0 absence for the corresponding axis. A 1 bit in Bd denotes the positive dependency direction and 0 negative. We use asterisk in Bd to denote the case where a bit may be either 0 or 1. A positive direction indicates that values with smaller coordinates for that axis must be visited before values with larger coordinates. Our current implementation assumes the dependency is unidirectional, namely dependency exists in at most one direction along each axis of an iteration space. For example, for a 3D iteration space, the combination of {0,1,1} for existence and {*,0,1} for direction means the application has a positive dependency on axis 0 and a negative dependency on axis 1. There is no dependency on axis 2.
4.3 Space-time cost analysis Given a data dependency pattern, the cache splitter must perform a space-time cost analysis to make the right choice of cache and job block iterator. Analysis of space cost is motivated by the memory constraints on the head node and compute nodes. The memory available on the head node determines the size and number of cache blocks to be loaded. Within this constraint, the cache block dimensions are determined by the cache shape and iteration space dimensions, as shown in figure 4. In addition, there are cases in which the length of a cache block is not sufficient to be divided into the number of job blocks matching the number of compute nodes. We will discuss this issue in detail in section 5. Lastly, we must guarantee that the final job block size not exceed the memory limit on compute nodes.
If the bits in Be are all 1’s, it represents a complete dependency pattern. All other cases are incomplete dependency patterns, which means the computation order does not matter on dependency-free axes. Counting the number of 1-bits in Be yields the dimensionality of a dependency pattern. The work described in [31] applied to 1D dependencies. Our current work concentrates on 2D dependencies, and also applies to the 3D case.
The analysis of time costs is more complex. The list below identifies the time costs corresponding to the three cluster application stages that we focus on in this paper. • • •
Stage one - disk I/O and remote data transfer Stage two - job extraction and job distribution Stage three - inter compute node communication
The full impact of these costs is partially hidden using threaded I/O so that stages are running at least partially concurrently.
49
R1: Always use cache blocks that result in the fewest reads from disk. In [27], we describe a notation for the various storage orderings that are possible when mapping an n-dimensional dataset to a one dimensional array or file using linear or chunked storage. In this paper, we restrict our attention to a single ordering for which cache shape 0 is the best match. Symmetric arguments can be made for other orderings. However, in section 5.6, we suggest that the ability to select a different copy of the same data with a different storage ordering may be beneficial in some cases.
Figure 7. The cache loading times of the 2GB subset of a 4GB file using 256MB of cache memory and different cache shapes. Note that the remote cache loading using cache shape 2 is even quicker than the local cache loading on Orion. That is because the remote disk on Rebel is much faster than the disk on Orion and the bandwidth between Rebel and Orion is high, about 80-90 Mbps.
5.2 Job extraction cost After the collective cache blocks are loaded into memory, they need to be partitioned properly because it affects all time costs of stage 2 thereafter. The partitioning can be performed along more than one axis. The axis along which the jobs are distributed among different compute nodes is called the major split axis (MSA), while others are called secondary split axes (SSA).
5. EFFICIENCY RULES IN DATA PARALLELIZATION We come up with several rules to guide the partitioning of the data for the best overall performance with a couple of exceptions. To verify the effectiveness of these rules, we did a series of tests using forward cache iterators and different partitioning. Due to resource limit, these tests were performed on two different clusters. One is a 12-node Orion cluster with a Seagate Barracuda drive that has 11.5 ms average seek time. The other is Mimosa at the Mississippi Center for Supercomputing Research (MCSR), a 253-node Intel cluster. But due to resource and time limitation, we were only able to use up to 16 nodes as computing nodes, and without network connectivity to outside of the cluster. The remote tests were done between the Orion cluster at the University of Mississippi and a linux box at the University of New Hampshire. Both are running Linux operating systems. The datasource we used is a 4GB file viewed as a 3D data volume with dimensions 1024x1024x4096 and a {0,1,2} storage ordering. The iteration is done on a 1024x1024x2048 subset of this data volume using 256MB of cache memory. All tests were carried out using the same data source and iteration space. We analyze the results separately for different stages.
The job extraction cost in stage two is mainly determined by the number of calls to the System.arraycopy function that is used to fill the job blocks with data out of the cache. The job extraction cost can be minimized by shaping the job blocks to suit the storage layout of the multidimensional cache block in memory. This is determined by the the relationship between the split axes and the cache rod axis. Recall that the cache rod axis is the axis along which elements are contiguously stored in the one dimensional memory array. It is the only format-specific concept in the partitioning algorithm, making our work extensible to other file formats with different storage models such as space-filling curves. If the splitting is done along the cache rod axis, the job blocks cut the rods into smaller segments, increasing the number of calls to System.arraycopy and thus performance degrades. For example, if we split a cache block into 8 job blocks, the splitting in figure 6c would require 2048*16 function calls to fill a job block with rods of length 512 while the splitting in figures 6a and b only require 2048*2 calls to fill a job block with rods of length 4096. This leads us to formulate our second rule:
5.1 Cache loading cost For data intensive cluster applications, the I/O in stage one dominates the whole application performance and thus is the first factor to consider for significant performance improvement. It is done collectively for all participating compute nodes and its time cost is mainly determined by the shape and number of the cache blocks to be loaded. For a given amount of cache memory and iteration space, the cache shape becomes the only element that brings major performance difference in stage one. Figure 7 shows that for a given storage ordering, the cache shape that best matches the datasource storage layout (cache shape 0 for {0,1,2} storage ordering) has the best suitability for both local and remote data accesses, because it contains the fewest, longer rods with better locality and the iteration can be done with the smallest number of disk reads and less seek time.
R2: Avoid splitting the cache rod axis. Figure 8 shows the time to extract all jobs from the 256MB cache block constructed using 3 forward cache iterators with different cache shapes. Each cache block is split either along a single axis or two. It is obvious from the results that for all cache block shapes, job extraction takes longer if the rod axis is split during the partitioning process, especially for cache blocks of shape 2, because the already short rods in cache blocks of shape 2 would be broken into even smaller segments. This again verifies our
Although this result is not surprising, we experimentally compared cache loading costs with the other costs incurred by cluster computation. We found that even if choosing cache shape 0 incurred more job extraction, distribution, or cluster communication costs in subsequent stages, I/O costs were always more significant. The significance of first stage I/O and the importance of cache shape in cache loading suggest our first efficiency rule.
Figure 8. Job extraction time from a single 256MB cache block of different shapes using different job splittings. S0 - split along axis 0. S01 - split along axes 0 and 1.
50
claim of R1, always use the cache iterator of shape that best matches data storage layout.
Table 1. The combined time in seconds of stage 2 and 3 without job extraction using single axis splitting. Dep{Be}{Bd} indicates dependency pattern. S0/S1/S2 indicates single axis splitting along axis 0/1/2.
5.3 Job distribution cost Job distribution transfers job blocks over the network from the head node to compute nodes. Its cost is determined by the number of jobs and the job size. The number of jobs a cache block is split into and the resulting job size are reciprocal. Normally for coarsegrained parallelization, a small number of large network data transfers are preferable to a large number of small transfers. Unless prevented by job size limit, our cache splitter chooses to split along as few axes as possible. For example, the single axis splittings in figures 6a, b and c are preferable to the splitting along two axes shown in figure 6d. Our next rule reads as follows:
Table 2. The combined time in seconds of stage two and three without job extraction using double axis splittings. Dep{Be}{Bd} indicates dependency pattern. S01/S02/S12 indicates double axis splitting along axes 0,1/0,2/1,2. But choosing the major split axis while doing double axis splitting could bring different cross-node job border areas and thus different communication overhead which is denoted after the splitting notations as follows. NC/YC indicates choosing a major split axis that Does Not/Does require communication. LUC/MUC indicates, if communication is unavoidable on both axes, choosing a major split axis that incurs smaller/larger cross-node job border area, namely Less/More Unit Communication Size.
R3: Split along as few axes as allowed by the memory on the compute nodes. Comparing single against double axis splitting for a 2GB iteration space on Orion, we found that the job distribution time for a single axis partitioning took a total of 66 seconds, versus 70 seconds for a double axis partitioning. The difference is not significant because of the high performance network and our optimization through parallel distribution. The performance difference would more evident in a GRID environment where network latency is higher.
5.4 Communication overhead The communication overhead in stage three was not taken into consideration in our previous work. Now with the increase of the dimensionality of the dependency pattern, it must be factored in and should be able to be quantified for iterator selection.
formance differences reflect the differences in communication cost. However, since these costs are overlapped with threads, we can’t simply subtract job distribution costs. We can see that for any 2D dependency pattern, the performance is the best when the splitting is done along the dependency-free axis (marked by the asterisks). The performance is the worst when the splitting is done along the cache iteration axis which at the same time has dependency on it (marked by the crosses). We will further discuss the cause of these worst cases in section 5.5.
Communication overhead would be incurred if the major splitting axis happens to be a dependency axis or even unavoidable when the application has a complete dependency pattern, meaning there exists dependency on all axes of the data volume. R4: Avoid partitioning a dependency axis among compute nodes. If unavoidable, and R6 doesn’t apply, choose the partitioning with less unit communication (LUC) size.
While doing double axis splitting as needed, the cache splitter has the option to choose different major split axes that brings different communication overhead. If one of the split axis candidates is dependency-free, we can choose splittings that either do (YC) or do not (NC) incur inter compute node communication. If a dependency exists on both candidate split axes, we can minimize the total communication volume by choosing the major split axis that brings less unit communication size (LUC) based upon the border area comparison as exemplified in figure 6d. The results in table 2 show that splitting without incurring communication (NC) is better, and if it is unavoidable, splitting that incurs less unit communication cost (LUC) is better.
The first half of R4 is obviously true for all data parallelization algorithms since it is critical for the avoidance of cross-node true dependency and thus the communication. However, when necessary, a quantification method to measure the communication overhead is needed to select the right job iterator with minimal communication requirement. We assume the communication cost to be proportional to the area of the border between neighboring jobs that have dependency in between but are assigned to different compute nodes. For example, the shaded areas in figures 6a, b and c show the unit communication costs for different single axis cache splittings. This difference between these border areas provides opportunity for the cache splitter to select a job splitting that minimizes the size of communication. We don’t consider cases where data compression or value-based transformation are employed and might cause unpredictable communication volumes for jobs with the same shape and size. However, even in such cases we expect the communication cost to be proportional to the job block border area.
The performance differences are small because the inter compute node communication was used the border area which is quite small compared with the whole volume of a job block. We expect the differences would be more significant with the increase of unit communication size.
5.5 Avoiding Serialization Recall that the cache iterator loads successive slices of the data volume from disk or remote server by iterating along the cache iteration axis. Each cache block is then partitioned into job blocks. When a partitioning splits only one axis that is both the cache iteration axis and a dependency axis, the application will
Table 1 shows the combined time of stage two and three without job extraction under different single axis splittings using cache shape 0 that has the best suitability for storage ordering {0,1,2}. Since the job distribution costs are the same in all cases, the per-
51
Figure 9. Example application serialization among compute nodes and its avoidance. Numbers indicate compute nodes. The shading on the top face represents different cache blocks loaded one by one, left to right, from disk or remote server. The front face shows that each cache block is then split into job blocks and distributed to the compute nodes according to the specified node numbers. Curved arrows on the front face denote cross-node true dependencies requiring communication. Job blocks with the same shade of gray on the front face, if diagonally contiguous, are processed simultaneously. (a) Full serialization. (b) Partially parallelized with finer jobs. (c) Partially parallelized with coarser jobs, the best case. become serialized among the compute nodes, as shown in figure very small pieces, incurring a very large number of non9a. That is, the work on job block (i+1) is dependent upon the contiguous memory accesses. Thus, the job extraction is result of processing job block i, which means inter compute node poor. communication cannot occur until job block i has been processed. 2. The cache splitting that we would avoid if we used the poor There is always only one job block being processed at any mojob extraction actually results in a relatively small number of ment among all compute nodes. This leads us to our next rule. inter-node communications. R5: For a single-axis partitioning, if the split axis and cache itIn this scenario, it doesn’t make sense to avoid a relatively small eration axis are the same axis k, then only use this partitioning if communication cost with a large job extraction cost. k is not a dependency axis. To check the tradeoff between job extraction and communication The situation in figure 9a can be mitigated by splitting along a costs, we combined stages 2 and 3 for a set of single axis splitting secondary axis within each cache block such as shown in figure tests using a cache shape with short rods without disk I/O. As 9b. Although the job blocks are smaller, raising job distribution indicated by the triangle in figure 11, the partitioning that splits cost, a partial “pipeline style” parallelism is realized. For examthe rods without any communication cost is outperformed by the ple, node 1 can begin work as soon as node 0 has completed its partitioning that does incur communication without cutting the first small block. rods. This leads us to formulate rule 6: The case shown in figure 9c is better still. Here, the cache block is R6: If job blocks for a candidate partitioning are sufficiently short split using a different major split axis, avoiding the serialization along the cache rod axis, choose a new partitioning even if it while the job distribution cost is kept about the same or even less incurs communication. as in the first case. In other words, for this exceptional case, we must give precedence to R2 over R4. Since the cache block is split into job blocks, the ratio of cache rod length to the number of compute nodes determines when this exceptional case occurs. Although “sufficiently short” depends on the characteristics of the particular cluster, R6 is sufficient if the number of participating nodes is fixed, and there is no other choice of dataset. However, if another Figure 10. Cost evaluation of different job splitting cases copy of the dataset with a different rod axis is available, using that based upon the relations between the split axes, cache rod dataset may give better performance. This suggests a basis for axis and the axes with dependency. SA indicates all split replica selection, where replicas with different storage organizaaxes. MSA!SA is always true. "A!SA:A==CRA means that tion are available. the rod axis is split. #A!SA:A!=CRA means that the rod axis is not split. MSA!DA means the splitting is done along
5.6 A conflict resolution rule Figure 10 summarizes the job extraction and communication costs under different splittings. As we can see in cases II and III, R2 and R4 sometimes give conflicting advice. That is, we must either split the cache rod axis, or split the dependency axis. In this case, it is usually beneficial to follow R4, avoiding communication, since these costs are typically larger than extraction costs. However, in cases where the job blocks are sufficiently short along the cache rod axis, extraction costs can overwhelm cluster communication. This happens when the following two conditions exist at the same time:
Figure 11. Combined time in seconds of stage 2 and 3 including job extraction time. Dep{Be}{Bd} indicates dependency pattern. The case indicated by the triangle is a splitting without communication, but its performance is worse (taking longer time to finish) than the two cases left to it that have communication overhead.
1. The ratio of cache rod length to number of compute nodes is small. To avoid communication, we must cut the rods into
52
Figure 12. Execution time without disk I/O on an increasing number of nodes for different partitioning strategies and dependency patterns. A 128MB disk cache was used. Times are in milliseconds. Dep{Be}{Bd} — dependency pattern. Splitting options such as S12LUC are same as described in the caption of Table 2.
Figure 13. Ray casting application performance results on Orion cluster using 8 nodes and the same set of other parameters as the tests shown previously. Times are in seconds. Dep{Be}{Bd} indicates dependency pattern. Splitting options such as S12LUC are same as described in the caption of Table 2. The other possibility is to reduce the number of nodes used for the computation. This strategy would at least allow other users to use those extra nodes, increasing overall cluster throughput.
vided by the application. This tool will be used to develop an API that will save scientific programmers from many of the details of efficient cluster computing. In addition, the work described here pays special attention to I/O costs, aggregating compute node requests into a small number of large requests. This scheme will be particularly attractive for visualization, since the dependencies inherent in such applications are often determined by parameters such as view direction.
6. SCALABILITY Figure 12 shows the execution times for job extraction, distribution, and a simulated computation that is assumed to be proportional to the job block volume. To demonstrate the computational scalability, we excluded I/O costs, but still extracted job blocks from a 128MB cache block. It is clear from the figure that running on 8 nodes produces the best results, and that for 16 nodes, extraction, distribution, and communication costs were greater than the computational benefit. The reason for this result is that as the number of nodes increases, the size of the job blocks decreases, making the computation more fine-grained. This implies that with a sufficiently large cache block, even 16 or more nodes will show a performance improvement. We will evaluate this when certain limitations in our hardware are addressed.
We have many opportunities for future work. We plan to investigate the observations made in section 5.6 that suggest choosing from multiple copies of a dataset with different storage organization may improve performance. We will also further examine the scalability of our approach on larger datasets with larger cache and different processing times with regards to the same amount of data. Choosing an optimal number of nodes during the parallelization for given set of parameters may also prove interesting.
9. ACKNOWLEDGMENTS This work was supported in part by the National Science Foundation under grant CCF-0541239.
7. RAY CASTING RESULTS As described in the introduction, ray casting with arbitrary view directions around major axes provides us a good application to evaluate the effect of our system towards an automatic data parallelization API.
10. REFERENCES [1] Asano, T., Ranjan, D., Roos, T., Welzl, E. and Widmayer, P. 1997. Space filling curves and their use in the design of geometric data structures. Theoretical Computer Science, volume 181, No. 1, 3-15, 1997.
We explored the same test cases as discussed above. It is clear from the results in figure 13 that for all the 2D dependence patterns we tested, a cache iterator with shape 0 brings the best application performance in most cases unless the chosen job iterator splitting causes the application serialized inter compute node communication. A cache iterator with shape 1 has comparable performance, but it is probably benefiting from the file system cache to some extent.
[2] Atallah, M.J. and Prabhakar, S. 2000. (Almost) optimal parallel block access for range queries. Proceeding ACM Symp. on Principles of Database Systems, 205-215, May, 2000.
8. CONCLUSION AND FUTURE WORK
[3] Beynon, M., Ferreira, R., Kurc, T. M., Sussman, A. and Saltz, J. H. 2000. Datacutter: Middleware for filtering very large scientific datasets on archival storage systems. IEEE Symposium on Mass Storage Systems, 119–134, 2000.
We have described the workings of a tool that determines an efficient partitioning for a cluster given a dependency pattern pro-
[4] Beynon, M., Chang, C., Catalyurek, U., Kurc, T., Sussman, A., Andrade, H., Ferreira, R., Saltz, J., 2002. Processing
53
with the active data repository, Technical Report CS-TR4208, University of Maryland, 2001. Available from: citeseer.ist.psu.edu/kurc01exploration.html.
Large-Scale Multidimensional Data in Parallel and Distributed Environments. Parallel Computing, 28(5): 827–859 [5] Blelloch, G., 1995. NESL: A Nested Data-Parallel Language (3.1). CMU-CS-95-170.
[19] Li, J., Liao, W., Chouhdary, A., Ross, R., Thakur, R., Gropp, W., Latham, R., Sigel, A., Gallagher, B. and Zingale, M. 2003. Parallel netcdf: a high-performance scientific i/o interface. Proceedings of Supercomputing 2003, ACM Press.
[6] Bourzoufi, H., Sidi-Boulenouar, B., Andonov, R., 1992. Tiling and processors allocation for three dimensional iteration space. Jnl. of Parallel and Distributed Computing, 16:108-120.
[20] Rodgriguez-Martinez, M. and Roussopoulos, N. 2000. Mocha: a self-extensible database middleware system for distributed data sources. SIGMOD’00: Proceedings of the 2000 ACM. SIGMOD international conference on Management of data, Dallas, Texas, ACM Press (2000), 213–224.
[7] Chatterjee, S., Gilbert, J.R., Schreiber, R., Sheffler, T.J., 1994. Array Distribution in Data-Parallel Programs, Languages and Compilers for Parallel Computing, 76-91 [8] Ching, A., Coloma, K., and Choudhary, A., 2006. Challenges for Parallel I/O in GRID Computing. Engineering the Grid: Status and Perspective, American Scientific Publishers.
[21] Chavarria-Miranda, D., Mellor-Crummey, J. and Sarang, T. 2001. Data-Parallel Compiler Support for Multipartitioning. European Conference on Manchester, United Kingdom, August 2001.
[9] Corbett, P., Feitelson, D., Fineberg, S., Hsu, Y., Nitzberg, B., Prost, J.P., Snir, M., Traversat, B., and Wong, P., 1995. Overview of the MPI-IO Parallel I/O Interface. Proc. of the Third Workshop on I/O in Parallel and Distributed Systems, Santa Barbara, CA.
[22] Moon, B., Acharya, A. and Saltz, J. 1996. Study of scalable declustering algorithms for parallel Grid files. CS-TR-3589, University of Maryland, 1996
[10] Kamel, I. and Faloutsos, C., 1994. Hilbert R-tree: An improved R-tree using fractals. Proceedings of the Twentieth International Conference on Very Large Databases.
[23] NetCDF webpage [online]. Available from: http://www.unidata.ucar.edu/software/netcdf. (last accessed 4/2008).
[11] Fan, C., Gupta, A.K. and Liu, J., 1994. Latin cubes and parallel array access. IPPS: 8th International Parallel Processing Symposium, IEEE Computer Society Press.
[24] Oldeld, R. and Kotz, D. 2001. Armada: a parallel le system for computational grids. Proceedings of the 1st IEEE/ACM International Symposium on Cluster Computing and the Grid. Brisbane, Australia, IEEE Computer Society (2001), 194–201.
[12] Ferhatosmanoglu, H., Tosun, A. !., Canahuate, G. and Ramachandran, A. 2006. Efficient parallel processing of range queries through replicated declustering. Distributed and Parallel Databases, volume 20, 117-147, 2006.
[25] Ramakrishnan, S. and Rhodes, P. J. 2006. Multidimensional Replica Selection in the Data Grid. Proc. HPDC 2006, (2 pages).
[13] Goumas, G., Drosinos, N., Athanasaki M. and Koziris, N., 2002. Compiling Tiled Iteration Spaces for Clusters. Proceedings of the 2002 IEEE International Conference on Cluster Computing.
[26] Rhodes, P.J., Bergeron, R.D. and Sparr, T.M. 2001. A Data Model for Distributed Multisource Scientific Data, Hierarchical and Geometrical Methods in Scientific Visualization. Springer-Verlag, Heidelberg, 2001.
[14] Gu, Y. and Grossman, R. L. 2007. UDT: UDP-based Data Transfer for High-Speed Wide Area Networks. Computer Networks (Elsevier). Volume 51, Issue 7. May 2007.
[27] Rhodes, P. J., Tang, X., Bergeron, R. D. and Sparr, T. M. 2005. Iteration Aware Prefetching for Large Multidimensional Scientific Datasets. Proc. SSDBM '05.
[15] Gupta, M. and Banerjee, P. 1991. Auomatic Data Partitioning on Distributed Memory Multiprocessors. Sixth Distributed Memory Computing Conference, Portland, OR, Apr. 1991.
[28] Richardson, H. 1996. High Performance Fortran: history, overview and current developments. Tech. Rep. TMC-261, Thinking Machines Corporation, April 1996. [29] Sarawagi, S. and Stonebraker, M. 1994. Efficient organizations of large multidimensional arrays. Proc. of the Tenth International Conference on Data Engineering. Feb. 1994.
[16] Hatcher, P.J., Quinn, M.J., Lapadula, A.J., Anderson, R.J. and Jones, R.R., 1991. Dataparallel C: A SIMD Programming Language for Multicomputers. Proceedings of Distributed Memory Computing Conference, 1991, Volume 6, Issue 28 Apr - 1 May 1991, 91-98.
[30] Thakur, R., Gropp, W. and Lusk, E. 2002. Optimizing noncontiguous accesses in MPI-IO. Parallel Computing 28 (2002) 83-105.
[17] HDF5 webpage [online]. Available from: http://hdf.ncsa.uiuc.edu/hdf5/index.html (last accessed 4/2008).
[31] Yan, B. and Rhodes, P.J. 2006. An Iteration Aware Multidimensional Data Distribution Prototype for Computing Clusters. Proc. of IEEE Cluster 2006, Barcelona, Spain, 2006.
[18] Kurc, T., Catalyurek, U., Chang, C., Sussmany, A., Saltzy, J., 2001. Exploration and visualization of very large datasets
54