Parallelizing Spatial Databases on Shared-Memory Multiprocessors Shashi Shekhar Sivakumar Ravada Vipin Kumar Douglas Chubb Greg Turner Abstract
Several emerging visualization applications such as ight simulators, distributed interactive simulation (DIS), and virtual reality are using geographic information systems (GISs) for high- delity representation of actual terrains. These applications impose stringent performance and responsetime restrictions which require parallelization of the GIS and shared-memory multiprocessors (SMPs) are well suited for parallelizing the GIS. Currently, we are developing a high performance GIS on an SMP (Silicon Graphics \Power Challenge") as a part of the virtual GIS project of Army Research Laboratories. Our experimental results with range-query operation show that data-partitioning is an eective approach towards achieving high performance in GIS. As partitioning extended spatial objects is dicult, special techniques such as systematic declustering are needed to parallelize the range-query operation. Systematic declustering methods outperform random declustering methods for range query problems. Even random declustering coupled with dynamic load-balancing (DLB) is outperformed by static load-balancing (SLB) methods with systematic static declustering. In addition, the performance of DLB methods can be improved by using the declustering methods for determining the subsets of polygons to be transferred during run-time.
Keywords: Declustering Methods, Geographic Information Systems, High Performance, LoadBalancing, , Range Query, Shared-Memory Multiprocessors.
Author Shashi Shekhar Sivakumar Ravada Vipin Kumar Douglas Chubb Greg Turner
Address Department of Computer Science University of Minnesota 4-192 EE/CS, 200 Union St. SE Minneapolis, MN 55455 Research and Technology Division U.S. Army CECOM, RDEC, IEWD Vint Hill Farms Station Warrenton, VA 22186-5100 Army Research Laboratory Adelphi, MD
E-mail
[email protected] [email protected] [email protected]
Phone/Fax (612)624-8307/ (612)625-0572
[email protected]
This work was supported by the Army High Performance Computing Research Center under the agreement #DAAH04-
95-2-0003/contract #DAAH04-95-C-0008.
1 Introduction A high performance geographic information system (HPGIS) is a central component of many interactive applications like real-time terrain visualization, situation assessment, and spatial decision making. The geographic information system (GIS) [9] often contains large amounts of spatial data (geometric and feature data) represented as large sets of points, chains of line-segments, and polygons. Spatial data sets required to model a geographic applications are usually very large, and require large data structures to represent the data for use by a computer program. Since interactive and real-time applications require short response times, the GIS has to process large data sets in a very short time. The existing sequential methods for supporting the GIS operations do not meet such real-time requirements imposed by many interactive applications. Hence, parallelization of GIS is essential in meeting the high performance requirements of several HPGIS applications. A GIS operation can be parallelized either on a shared-memory multiprocessor or on a distributedmemory multiprocessor. But there are several advantages of using a bus-based shared-memory computer for GIS problem when compared to using a distributed-memory machine. Parallelizing GIS problems requires run-time work-transfers to achieve good load-balance as the spatial data objects have varying sizes and complex shapes. A bus-based shared-memory multiprocessor facilitates these run-time worktransfers, as the entire data is equally accessible to all the processors. In case of a distributed-memory machine, data has to be replicated at dierent processors to reduce the cost of run-time work-transfers, as the communication cost incurred in work-transfers is comparable to the processing cost [12] for many GIS operations. This data replication in distributed-memory machines reduces the available total mainmemory for storing the spatial data. Another advantage of bus-based shared-memory machines for this kind of problems is that the results of the computation reside in a shared address space. Hence, results from one stage of the system will be available to the next stage of the system without any communication. In case of distributed-memory machines, processing in the next stage of the system may require movement of data, as the results of the computation will be left in the local memory of each processor.
1.1 Application Domain: Real-Time Terrain Visualization A real-time terrain-visualization [11] system is a virtual environment [5] like distributed interactive simulation system [1] which let the users navigate and interact with a three-dimensional computer generated geographic environment in real-time. This type of system has three major components: interaction, 3-D graphics, and GIS. Figure 1 shows the dierent components of a terrain visualization system for a typical
ight simulator. The HPGIS component of the system contains a secondary storage unit for storing the entire geographic database and a main memory for storing the data related to the current location of 1
the simulator. The graphics engine receives the spatial data from the HPGIS component and transforms these data into 3-D objects which are then sent to the display unit. As the user moves over the terrain, the part of the map that is visible to the user changes over time, and the graphics engine has to be fed with the visible subset of spatial objects for a given location and user's viewport. The graphics engine transforms the user's viewport into a range query and sends it to the HPGIS unit. The HPGIS unit retrieves the visible subset of spatial data from the main memory and computes their geometric intersection with the current viewport of the user and sends the results back to the graphics engine. For example, Figure 1 shows a polygonal map and a range query. Polygons in the map are shown with dotted lines. The range query is represented by the rectangle, and the result of the range query is shown in solid lines. The frequency of this operation depends on the speed at which the user is moving over the terrain. For example, in the terrain visualization of a ight simulator, a new range query may be generated twice a second, which leaves less than half a second for intersection computation. In order to meet such response-time constraints, HPGIS often caches a subset of spatial data in main memory. The main-memory database may in turn query the secondary-storage database to get a subset of data to be cached. The frequency of this operation should be very small for the caching to be eective.
2/sec Display
view
30/sec
Graphics & Analysis Engine
feed back
range-query set of polygons
Main-Memory Terrain Database
secondary storage range-query set of polygons
Secondary Storage Terrain Database
High Performance GIS Component
Figure 1: Components of the Terrain-Visualization system with details of a range query. 2
1.2 Problem Formulation The range-query problem for the GIS can be stated as follows: Given a rectangular query box B , and a set SP of extended spatial objects (e.g, polygons, chains of line segments), the result of a range query over SP is given by the set fxjx = Pi B and Pi 2 SP g, where gives the geometric intersection of two extended objects. We call this problem the GIS-range-query problem. Note that this problem is dierent from the traditional range query, where the objects in the given range are retrieved from secondary memory (disk) to main memory without clipping the objects, but it is similar to the polygonclipping problem in computer graphics. The existing sequential solutions for the range-query problem cannot always be directly used as a solution to the GIS-range-query problem due to the high performance requirements of many applications. For example, the limit on response time (i.e. half a second, as shown in Figure 1) for solving the GISrange-query problem allows the processing of maps with no more than 1500 polygons (or 100,000 edges) on many of the latest processors available today, like the IBM RS6000/590 and DEC-Alpha (150Hz) processors. However, the maps used in many HPGIS applications are at least an order of magnitude larger than these simple maps. Hence we need to consider parallel processing to deliver the required performance. The goal of parallelization is to achieve the minimum possible response time for a set of range queries.
1.3 Scope of the Paper In this paper, we focus on parallelizing the GIS-range-query problem on a bus-based shared-memory multiprocessor. Even though the parallel formulations discussed in this paper do not strictly require a symmetric multiprocessor, the availability of cache-coherency eases the implementation of the dynamic load-balancing routines. In the rest of the paper, the word SMP refers to a cache-coherent shared-memory multiprocessor [10]. We also focus on a data-partitioning approach to parallelizing the GIS-range-query as opposed to a CREW PRAM [2] approach. The simple approach of dividing the data statically between processors without using any run-time work-transfers is referred to as the static load-balancing (SLB) approach. The approach of using work-transfers during run-time to improve the load-balance is referred to as dynamic load-balancing (DLB).
2 Parallel Formulation for GIS-range-query The GIS-range-query problem has three main components: (i) Approximate ltering at the polygon level, (ii) intersection computation, and (iii) polygonization of the result. (See [12] for a detailed discussion of a sequential algorithm.) Figure 2(a) shows the dierent data structures in the main-memory and the communication pattern of the processor with the main-memory for the GIS-range-query problem. 3
A search structure is used to approximately lter out non-interesting polygons from the set of input polygons. The query box is then intersected with each of the remaining polygons, and the output is obtained as a set of polygons by polygonizing the results of the intersection computation. The GIS-range-query operation can be parallelized either by function-partitioning [2] or by datapartitioning [12]. Function-Partitioning uses specialized data structures (e.g. distributed data structures) and algorithms which may be dierent from their sequential counterparts. Data-Partitioning techniques divide the data among dierent processors and independently execute the sequential algorithm on each processor. Data-Partitioning in turn is achieved by declustering [3] the spatial data. If the static declustering methods fail to equally distribute the load among dierent processors, the load-balance may be improved by redistributing parts of the data to idle processors using dynamic load-balancing (DLB) techniques. In this paper, we focus on parallelizing the GIS-range-query problem using the data-partitioning approach. Figure 2 shows the dierent stages of the parallel formulation. In the following discussion, let P be the number of processors used in the system. Initially, all the data is partitioned into two sets, local data set Sa and shared data set Sb . The data in Sa is statically declustered into P sets Sai , for i = 0 : : : (P ? 1) and Sai is mapped to the local address space of processor Pi . The purpose of local data set Sa is to reduce the synchronization overhead of accessing the data. The set Sb is shared among processors for DLB and we call the set Sb , the shared pool of data.
(c)
static partition
get next range query
Approximate filtering
Intersection computation
polygonization
Approximate filtering
Intersection computation
polygonization
DLB S Y N C DLB
Figure 2: Dierent stages of the parallel formulation. When a bounding box for the next range query is received, each processor Pi performs the approximate polygon-level ltering and retrieves the candidate polygons from its local data set Sai and places the result in set Li . In addition, one of the processors performs the approximate ltering for the data from set Sb , and keeps the resulting object IDs in shared adress space. Each processor Pi then independently works on data from the set Li until no more objects are left in this set. When a processor Pi nishes work on the data from Li , it goes into the DLB mode. In this mode, only the data from the dynamic set are used for dynamic load-balancing. The algorithm terminates when the DLB method terminates.
4
2.1 Static Declustering of Spatial data The goal of declustering is to partition the data so that each partition imposes exactly the same load for any range-query. Intuitively, the polygons close to each other scattered among dierent processors such that for each range-query, every processor has equal amount of work. Polygonal maps can be declustered at the polygonal level or at sub-polygonal level. We focus only on polygonal level declustering to avoid the additional work of merging sub-polygons. Optimal declustering of extended spatial data (like polygons) is dicult to achieve due to the nonuniform distribution and variable sizes of polygons. In addition, the load imposed by a polygon for each range query is a function of the size and location of the query. Since the location of the query is not known a priori, it is hard to develop a declustering strategy that is optimal for all range queries. Moreover, the declustering problem for spatial data is shown to be NP-Hard [12]. Since the declustering problem is NP-Hard, heuristic methods are used for declustering spatial data. A heuristic declustering method has to address the following four issues: 1. The work-load metric is the load imposed by an object for a range-query, and it is a function of the shape and extent of the object. In case of polygons, the work-load can be approximated to the number of edges or the area of the polygon. 2. Spatial extent R(A) of an object A is de ned as the region of space aected by the object A, i.e., if a query overlaps R(A), the work required to process that query is in uenced by the object A. 3. Load density for spatial extent of an object A is the distribution of the work load over the extent R(A). For example, the load imposed by A may be uniformly distributed over the region R(A). But typically, for extended spatial data this load is non-uniform over R(A). 4. The algorithm of choice also aects the performance of declustering. Local load balance and similarity graph-based methods are two popular algorithms for declustering point data. In our experiments, we extend these algorithms to the polygonal data. Figure 3 shows a set of polygons which are being declustered among 4 processors. The processor number to which a polygon is assigned is shown inside each polygon. The processor assignments are shown for all the polygons except for polygon A. Now consider the problem of assigning this polygon to a processor. Using the local load-balance method with area as the load metric, the polygon A is assigned to processor 3 as that processor has the least total area of all the polygons assigned to that processor. Similarly, the local load-balance method with number of polygons as the load metric would assign polygon A to the processor 4 as this processor has the least number of polygons associated with it. On the other hand, the similarity-graph based declustering method, would assign the polygon to 5
processor 1 as that processor has the least load in the vicinity of polygon A. A random choice would assign polygon A to any of the four processors. 2
1 4 3
1 3
1
2 2 ?X A
4
3
Figure 3: Dierent polygon assignments by dierent declustering algorithms In our experiments, we use the number of edges as the work load metric with smallest rectangular box enclosing each object as the spatial extent of that object. Further, we assume that the work load is uniformly distributed over the extent of the object. In addition, we compare the performance of the two algorithms for the declustering the spatial data.
2.2 Dynamic Load-Balancing (DLB) Techniques If static declustering methods fail to equally distribute the load among dierent processors, the loadbalance may be improved by transferring some spatial objects to idle processors using dynamic loadbalancing techniques. A typical dynamic load-balancing technique for a shared memory computer addresses two issues: (i) which processor should an idle processor ask for more work and (ii) what is the granularity of work transfer and the data-partitioning method.
Which Processor Should an Idle Processor ask for More Work? Methods to decide which processors an idle processor should ask for more work are discussed and analyzed in [8]. These methods can be divided into two categories: (1) In a pool-based method (PBM), a xed processor has all the available work, and an idle processor asks this xed processor for more work. (2) In a peer-based method, all the work is initially distributed among dierent processors, and an idle processor selects a peer processor as the work donor using random polling, nearest neighbor, and global round robin (GRR) or asynchronous (local) round robin (ARR). In a pool based method, a single lead processor has all the remaining work in a shared pool and an idle processor asks this lead processor for more work. The structure of the GIS-range-query problem imposes a limitation on the amount of work that can be kept in the shared pool. If the lead processor has too big a pool, some of the other processors might nish processing their local data before the lead processor can perform the approximate ltering step on the pool data. This forces the other processors to wait for the lead processor to nish the ltering step on pool data. On the other hand, too small a 6
pool can also lead to processor idling due to the static load-imbalance in the rest of the data. This two situations are shown in Figure 4. Small Pool ApprxFil(Pool) Process S_1
Large Pool DLB
IDLE
ApprxFil(Pool)
Process S_2
Process S_2
Process S_P
DLB
IDLE
Process S_P
(a)
Process S_1 IDLE
DLB DLB
IDLE
DLB (b)
Figure 4: A Small pool may result in high a static load imbalance. A Large pool may result in processor idling.
Granularity of Transfers and Data-Partitioning Method Several strategies like self-scheduling [4], factoring scheduling [6], and chunk scheduling [7] exist for determining the amount of work to be transferred during DLB. The rst two scheduling strategies are mostly used in pool-based DLB methods while chunk scheduling is applicable in both peer-based and pool-based DLB methods. The cost of synchronization eects the performance of these scheduling algorithms as several processors have to share the knowledge about the remaining work. Small chunks can be used when the synchronization cost is low. In the case of chunks of more than one object, it is desirable to keep a comparable amount of work in each chunk, so that the load-imbalance can be kept low. Since work estimation is hard in case of extended spatial data objects, the simple scheduling strategies are not suitable for GIS applications. However, if work can be divided into chunks of equal work, any of the scheduling strategies can be applied treating each chunk as the smallest piece of work that can be transferred. We note that this problem of dividing work into chunks of equal work is similar to the declustering problem. We hypothesize that the load-balance of any DLB method can be improved by using a systematic declustering method for dividing the work into chunks. Since the declustering operation is expensive, this chunking can be done statically. As the loading new data set into main memory is done relatively infrequently (less than once in 5 minutes), there will be sucient time to perform this operation without interrupting the actual range-query operation.
3 Experimental Evaluation We experimentally evaluate the proposed parallel formulation on a shared bus multiprocessor with 16 processors and 1GByte of main-memory (SGI PowerChallenge). We use spatial vector data for Killeen, 7
Table 1: Maps and range queries used in our experiments Map #Objects #edges range-query size number 1X 1458 polygons 82324 25% 75 2X 2916 polygons 164648 25% 75 4X 5832 polygons 329296 25% 75 8X 11664 polygons 658592 25% 75 16X 23328 polygons 1317184 25% 75 Texas, for this experimental study. This data is divided into seven themes representing the attributes slope, vegetation, surface material, hydrology, etc. We used the \slope" attribute-data map with 1458 polygons and 82324 edges as a base map in our experiments (this is denoted by 1X map). In the rest of the discussion, a kX map denotes a map that is k times bigger than the base map 1X map. Table 1 shows the details of the maps and the range queries. Figure 5 shows our experimental methodology. The number of dierent options we tried for each parameter is shown in parentheses, and the number of possible combinations after each module is also shown in the gure. In our experiments, we only measure and analyze the cost per range-query and exclude any preprocessing cost. This preprocessing cost includes the cost of loading the data into main memory and the cost of declustering the data among dierent processors. Note that this preprocessing cost is paid only once for each data set and the frequency of this operation is less than once in 5 minutes. Map Base size map (1X,2X,4X 8X,16X)
P (1,2,4, 8,16)
Map Generator 5 maps pool size (0.0,0.2,0.4,0.6,0.8,1.0) Analysis
partitioning/declustering method (llb,sim,random) Decluster
DLB methods (GRR,ARR,PBM)
450 options
polygons/chunk=1,2,...,30 Data collection
#range-queries desired=75
Parallel HPGIS
75 range queries
work-load generator
90000 measurements
size of range-query
Figure 5: Experimental Method for Evaluating DLB Methods for the parallel GIS-range-query.
3.1 Evaluation of Static Declustering Methods We compare the performance of dierent declustering methods: LLB, similarity-graph, and random. For simplicity, the work-load metric is xed to be the number of edges, and the spatial extent is assumed to be a point. The load-density over the spatial extent is assumed to be uniform. Figure 6 gives the speedups for these three methods for two dierent maps. The x-axis gives the number of processors 8
16
16 "sim" "llb" "ran"
"sim" "llb" "ran"
14
12
12
10
10 Speedup
Speedup
14
8
8
6
6
4
4
2
2
0
0 2
4
6
8 10 Number of Processors
12
14
16
2
(a)Map=2X
4
6
8 10 Number of Processors
12
14
16
(b)Map=16X
Figure 6: Speedups for dierent static declustering methods. and the y-axis gives the average speedups for 75 range-queries. The main trends observed from these graphs are: (i) Bigger maps lead to better speedups for most schemes, probably due to the improved load-balance. (ii) Similarity-Graph method gives the best speedups among the dierent methods. (iii) Random declustering methods provide inferior speedups beyond 8 processors.
3.2 Evaluation of the Granularity of Work-Allocation in DLB We compare the eect of dierent chunk sizes with the number of polygons in chunks ranging from 1 to 100, using PBM as the DLB method and similarity-graph-method for declustering. The experiment is conducted using 16X map for P = 4::16 with 20% pool. Figure 7 shows the results of this experiment. As can be seen from this graph, chunks with multiple polygons give better speedups than chunks with single polygons. This is mainly due to the very small amount of time required to solve the range-query problem for a large number of small polygons. For example, when the polygon is completely inside the range-query, it takes only 4 comparisons to solve the problem. Also, as expected, the size of the chunk increases with increasing number of processors.
3.3 Declustering for DLB Methods In this experiment, the eect of chunking based on systematic declustering is compared to that of random declustering for the DLB method. We used PBM as the DLB method and compared random, similaritygraph, and LLB methods of declustering. The dynamic data is declustered with 10 polygons per chunk. Figure 8 shows the experimental results. The x-axis gives the number of processors and the y-axis gives the average speedup over 75 queries. From this data, it is clear that random declustering of data is not as eective as systematic declustering for achieving a good load-balance for the GIS-range-query 9
16 "P=16" "P=8" "P=4"
14
Speedup
12
10
8
6
4
2 0
10
20
30
40 50 60 Polygons per Chunk
70
80
90
100
Figure 7: Speedups for dierent granularities of work transfers for the GRR method for Map=16X. problem. This shows that systematic declustering of data improves the load-balance. Also note that, SLB with systematic declustering (as in Figure 6(b)) is better than SLB coupled with DLB using random declustering (as in Figure 8). 16 "sim" "LLB" "RAN"
14
12
Speedup
10
8
6
4
2
0 2
4
6
8 10 Number of Processors
12
14
16
Figure 8: Speedups for dierent declustering methods for the GRR method for Map=16X.
3.4 Eect of the Pool Size We evaluate the eect of the pool size using the PBM as DLB for varying number of processors and varying data les. The number of processors is varied from 4 to 16 and the data les are varied from 1X map to 16X map. The pool size is varied from 0 to 100% of the total data. Note that a 0% pool refers to the static declustering with no DLB. Figure 9 shows the results of this experiment. The x-axis gives the size of the pool as a percentage of the total data, and the y-axis gives the average speedups 10
over 75 range queries. 16
16 "Map=16X" "Map=8X" "Map=4X" "Map=2X" "Map=1X"
15 14
"P=16" "P=8" "P=4"
14
12
Speedup
Speedup
13 12 11
10
8
10 6 9 4
8 7
2 0
20
40
60
80
100
0
Pool Size
20
40
60
80
100
Pool Size
(a)P=16
(b)Map=16X
Figure 9: Speedups for dierent pool sizes for Pool-Based method. As expected, the speedups increase as we increase the pool size, up to a point, and then they start decreasing. The initial increase in speedup may be due to the increased load-balance. The decrease in speedup after a achieving a maximum value is due to the non-parallelizable overhead of the approximate ltering. Note that this decrease is greater as P increases. This is due to the increase in non parallelizable overhead with increasing P and the increase in the synchronization overhead. The maximum speedup occurs at dierent pool sizes for dierent number of processors and for dierent data sets. Also note that the best speedups are achieved at smaller pool sizes for increasing map sizes. This may be due to the increase in static load-balance due to the increased total work.
3.5 Comparison of DLB methods We compare the performance of the three DLB methods (GRR, ARR, and PBM). The number of processors is varied from 4 to 16 and the 16X map is used as the input data. The number of polygons per chunk is 10, and similarity-graph-method is used for declustering the data. Figure 10 shows the speedups for these three methods. The x-axis gives the number of processors, and the y-axis gives the average speedups over 75 range queries. Both PBM and ARR have comparable performance, while GRR performs worse than these two methods.
4 SMP Programming Models for GIS-range-query Problem A parallel algorithm for the range query problem can be developed on CREW PRAM model of computation or on message passing model of computation. In case of PRAM models, the synchronization and 11
16
16 "PBM" "ARR" "GRR"
"PBM" "ARR" "GRR"
14
12
12
10
10
Speedup
Speedup
14
8
8
6
6
4
4
2
2 4
6
8
10 12 Number of Processors
14
16
4
(a) Map=2X
6
8
10 12 Number of Processors
14
16
(b) Map=16X
Figure 10: Speedups for dierent DLB methods for Map=2X and Map=16X. processor communication are assumed to be free of cost. Hence PRAM models often result in inferior speedups than what is predicted by the algorithm. On the other hand, a message passing algorithm will usually result in an implementation that conforms to the performance predicted by the algorithm. Hence, if there is a message passing algorithm which shows sucient locality of data, that algorithm is better suited for implementation on an SMP than a CREW PRAM algorithm. This approach is very useful when the applications exhibit data parallelism. This is also useful when there is no obvious PRAM method that would outperform any other message-passing parallel formulation. This approach is also suitable when the available PRAM algorithms are hard to implement. Since the GIS-range-query problem exhibits all these characteristics, we chose to port a message passing algorithm to a bus-based SMP. A message passing algorithm can be ported to an SMP using PVM/MPI type of packages. Traditionally, PVM/MPI type packages are used to port a distributed memory implementation to a sharedmemory environment and vice versa. Even though these packages are very convenient for porting parallel programs, the performance achieved using such packages did not meet the high performance requirements of the GIS-range-query for terrain visualization. Since each call to PVM/MPI is treated as a system call, the PVM/MPI implementations incurred run-time overhead which can otherwise be avoided. We manually ported the message passing algorithm to the SMP using the transformations given in Table 2. An advantage of this manual transformation is that it reduces the overhead incurred in PVM/MPI based implementation. Another advantage of this transformation is the identi cation of replicated data in message passing algorithm which is converted to shared data in SMPs. This address space mapping is not done automatically by PVM/MPI type of systems. As can be seen from the experiments, this 12
address mapping improves the capacity of the system to solve larger problems. Table 2 gives the mapping from the message passing model to the shared-memory features. In a message passing model, address space is either local or remote while in an SMP, the address space is either local or global. Similarly, processors in a message passing environment communicate by exchanging messages, while in an SMP, shared address space is used for inter processor communication. In message passing systems, data replication is often used to facilitate data sharing between processors. This is mostly used when the cost of data transfer is very high when compared to the cost of data processing. This distinction between dierent kinds of address space in shown in Table 2. Data partitioning is used in a message passing environment to increase the locality of data so as to reduce the communication cost. Similarly, in case of SMPs, data partitioning can be used to increase the locality so as to reduce the synchronization overhead. In a distributed memory machine, explicit message transmission is required to detect the termination of a process. In SMPs, barrier synchronization and spin locks can be used for this purpose. Table 2: Porting a Distributed-Memory program to an SMP Distributed-Memory program ! Shared-Memory program Address Space Mapping local data (R/W) (non-replicated) ! local data replicated data (read-only) ! global data without locks remote data (R/W) ! global data with locks data-transfer/data-sharing communication ! memory read/write synchronization with messages ! synchronization with memory R/W send message M ! lock(M); write(M); unlock(M) recv message M ! lock(M); read(M); unlock(M) one-to-all broadcast ! write by one; read by others (with locks) all-to-all broadcast ! read/write by all (with locks) termination detection ! barrier synchronization/spin-locks data-partitioning and DLB declustering method M ! declustering method M DLB method D ! DLB method D 0
5 Discussion and Conclusions Data-partitioning is an eective approach towards achieving high performance in GIS. We parallelize the GIS-range-query problem using data partitioning and dynamic load-balancing techniques. We use data partitioning to reduce the synchronization overhead as opposed to data partitioning to reduce communication overhead on a message passing system. Partitioning extended spatial-data maps is 13
dicult, due to the varying sizes and extents of the polygons and the diculty of estimating the work load. Hence, special techniques such as systematic declustering are needed to parallelize the GIS-rangequery problem. We show that the traditional declustering methods for multi-dimensional point data need signi cant extensions to be applicable for extended spatial data. We also show that neither declustering nor DLB alone are sucient by themselves for achieving good speedups beyond 8 processors. We show that bus-based SMPs are well suited for parallelizing GIS operations due to the large volume of data required to represent the spatial data. We also show that the main memory is better utilized in a bus-based SMP than on a distributed memory machine for solving the GIS-range-query problem. For example, by going from a distributed memory to a shared-memory machine with same amount of total main-memory, we are able to solve 8 times bigger problems on the shared-memory machine. We experimentally rank dierent data partitioning (declustering) methods for extended spatial data by evaluating their performance for the range-query operation. The experimental results show that systematic declustering methods outperform random declustering methods for range query problems. Even random declustering coupled with DLB is outperformed by SLB methods with systematic static declustering. DLB methods can be used to improve the speedups obtained by SLB methods with systematic declustering. In addition, the performance of DLB methods can be improved by using the declustering methods for determining the subsets of polygons to be transferred during run-time. In case of DLB methods, PBM outperforms peer-based methods for P 16. But with increasing P , peer-based methods start to dominate PBM.
Acknowledgments This work is sponsored by the Army High Performance Computing Research Center under the auspices of the Department of the Army, Army Research Laboratory cooperative agreement number DAAH0495-2-0003/contract number DAAH04-95-C-0008, ARO contract number DA/DAAH04-95-1-0538, the content of which does not necessarily re ect the position or the policy of the government, and no ocial endorsement should be inferred. This work is also supported by the NSF grant #CDA9414015. We would like to thank the AHPCRC and Department of Computer Science, University of Minnesota, for providing us with access to their computing facilities. We would also like to thank Minesh Amin and Christiane McCarthy for improving the readability and technical accuracy of this paper.
References
[1] DIS Home Page. http://dis.pica.army.mil. [2] S. G. Akl and K. A. Lyons. Parallel Computational Geometry. Prentice Hall, Englewood Clis, 1993. [3] M. T. Fang, R. C. T. Lee, and C. C. Chang. The Idea of Declustering and its Applications. In Proceedings of the International Conference on Very Large Databases, 1986.
14
[4] Z. Fang, P.-C. Yew, P. Tang, and C.-Q.Zhu. Dynamic processor self-scheduling for general parallel nested loops. In Proceedings of the International Conference in Parallel Processing, August 1987. [5] W. Hibbard and D. Santek. Visualizing Large Data Sets in the Earth Sciences. IEEE Computer, Special Issue on Vualization in Scienti c Computing, August 1989. [6] S. F. Hummel, E. Schonberg, and L. E. Flynn. Factoring - a method for scheduling parallel loops. Communications of the ACM, pages 35{90, August 1992. [7] C. Kruskal and A. Weiss. Allocating independent subtasks on parallel processors. IEEE Transactions on Software Engineering, pages 1001{1016, October 1985. [8] V. Kumar, A. Grama, A. Gupta, and G. Karypis. Introduction to Parallel Computing: Design and Analysis of Algorithms. The Benjamin/Cummings Publishing Company, Inc., 1994. [9] R. Laurini and D. Thompson. Fundamentals of Spatial Information Systems. Academic Press Inc, 1992. [10] D. E. Lenoski and W. Weber. Scalable Shared-Memory Multiprocessing. Morgan Kaufmann Publishing Company, Inc., 1995. [11] D. R. Pratt, M. Zyda, and K. Kelleher. Guest Editor's Introduction: Virtual Reality-In the Mind of the Beholder. IEEE Computer, Special Issue on Virtual Environments, July 1995. [12] S. Shekhar, S. Ravada, V. Kumar, D. Chubb, and G. Turner. Load-Balancing in High Performance GIS: Declustering Polygonal Maps. In Proceedings of the 4th International Symposium on Large Spatial Databases. Lecture Notes in Computer Science: #951, Springer Verlag. Extended version is avaiable as Technical Report TR 95-076, Department of Computer Science, University of Minnesota, August 1995.
15