42
Int. J. Reasoning-based Intelligent Systems, Vol. 5, No. 1, 2013
High-performance computing in GIS: techniques and applications Natalija Stojanovic* and Dragan Stojanovic Department of Computer Science, Faculty of Electronic Engineering, University of Niš, Niš, Serbia Email:
[email protected] Email:
[email protected] *Corresponding author Abstract: In this paper, the application of High-Performance Computing (HPC) techniques in processing of geospatial data in Geographic Information System (GIS) is presented. We evaluate two parallel/distributed architectures and programming models: Message Passing Interface (MPI) over Network of Workstations (NoW) and Compute Unified Device Architecture (CUDA) on Graphics Processing Unit (GPU) in well-known problems in GIS: map matching and slope computations. A distributed application for map-matching computation over large-spatial data sets consisting of moving points and a road network was implemented using MPI and experimentally evaluated. A slope computations based on large-digital elevation model data was performed on GPU using CUDA which enable hundreds of threads to run concurrently employing multiprocessors on a graphics card. Experimental evaluations indicate improvement in performance and shows feasibility of using NoW and multiprocessors on a graphic card for HPC in GIS. Keywords: high-performance processing; NoW; network of workstations; MPI; GPU; CUDA; spatio-temporal data; map matching; GIS; slope. Reference to this paper should be made as follows: Stojanovic, N. and Stojanovic, D. (2013) ‘High-performance computing in GIS: techniques and applications’, Int. J. Reasoning-based Intelligent Systems, Vol. 5, No. 1, pp.42–49 Biographical notes: Natalija Stojanovic is an Assistant Professor at the Computer Science Department, Faculty of Electronic Engineering, University of Niš, Serbia. She received her PhD, MSc, and BSc degrees in Computer Science from the University of Niš, in 2009, 2003 and 1999, respectively. Her current research interests include high-performance parallel and distributed computing architectures and programming models, data-intensive applications and cloud computing. She successfully participates in several international and national projects in those and related domains. Dragan Stojanovic is an Associate Professor at the Computer Science Department, Faculty of Electronic Engineering, University of Niš, Serbia. He received his PhD, MSc, and BSc degrees in Computer Science from the University of Niš, in 2004, 1998 and 1993, respectively. His research and development interests include context-aware and location-based mobile services, mobility data management, geographic information systems, spatio-temporal databases as well as mobile and ubiquitous computing. He has published widely in those and related topics. He successfully participates in several international and national R&D projects in cooperation with academic partners and industry. This paper is a revised and expanded version of a paper entitled ‘High-performance processing of geospatial data on network of workstations’ presented at the ‘Jubilee 10th International Conference on Telecommunications in Modern Satellite, Cable and Broadcasting Services (TELSIKS’2011)’, Niš, Serbia, 5–8 October 2011.
1
Introduction
Today’s many computing-intensive applications require more processing power then even before. Computer games, database searching, web search engines, financial and economic forecasting, climate modelling, medical imaging, etc., are application domains from a broad range of applications that demand accelerating and performance
Copyright © 2013 Inderscience Enterprises Ltd.
improvements (Patel and Hwu, 2008). One of the approaches for achieving performance improvements, which has been used in last decades, is to decrease the size of components used to build computers. The technology progress has made possibility to put billions of transistors on a single chip. But, raising the clock frequency, as a major approach of computer performance advance, has limited due to energyconsumption and heat-dissipation. These limitations have led
High-performance computing in GIS to alternative strategies for creating more powerful computers and have made the use of parallelism to become one of the crucial solutions for accelerating applications. It is important that application software must be able to make effective use of parallelism that is present in available hardware resources. There are many ways to parallelise applications in order to gain high performance. Just as there are several different classes of parallel hardware, so there are different models of parallel programming. Therefore, using multicore processor has made each desktop or laptop computer a small parallel system. Consequently, OpenMP (Open-Multiprocesing) was developed to enable writing of parallel programs for sharedmemory multiprocessor platforms. OpenMP represents a set of compiler directives, library routines and environment variables which provide programmer to tell the compiler which instructions to execute in parallel. Also, OpenMP provides programmer to define how to distribute them among the threads that will run the code. The Message Passing Interface (MPI) has become the major model of programming distributed-memory applications. MPI is a specification, not an implementation, of library routines helpful for users that write portable message-passing programs in C/C++ and Fortran. MPI is a library of functions designed to allow execution of multiple processes on different processors. Multiple processes work concurrently using messages to communicate and collaborate with each other. Graphics Processing Units (GPUs) have recently emerged as a powerful co-processor for general-purpose computation. Compared with commodity CPUs, GPUs have an order of magnitude higher computation power, as well as memory bandwidth. They provide general-parallel processing capabilities and general-purpose programming languages such as NVIDIA CUDA (Kirk and Hwu, 2010). Compute Unified Device Architecture (CUDA) includes programming model along with hardware architecture that supports data-parallel type of implementation. CUDA C/C++ compiler, libraries and runtime software enable programmers to access data-parallel computation model and develop and accelerate data-intensive applications. Many modern software applications, which are developed to model real world, process a large amount of data and thus cause long execution times on today’s computers. One such application domain that could significantly benefit from parallel processing is Geographic Information System (GIS) which includes processing and analytics of massive geospatial data (Big Geospatial Data).
2
GIS and high-performance computing
Advanced geospatial data collection and mobile positioning technologies (remote sensing, wireless sensor networks, satellite positioning, etc.) available today enable collection of a large quantities of geographically referenced data at low cost. Such increasing in geospatial database size is followed by a complexity and computational intensity of analysis/processing of such data required in different GIS application domains, such as environmental monitoring, climate change observation, traffic management and optimisation, fleet management, and Location-Based Services (LBS). High-Performance Computing (HPC) can be used to
43 overcome the computational intractability of large, complex spatio-temporal data sets and spatio-temporal analysis and processing over such data that have large computational requirements. The application of HPC principles and technologies in GIS dated from mid-1990s (Armstrong, 1995), but nowadays, with the increasing volume of geospatial data required for computational- and data-intensive problem solving in different GIS application domains, have emerged as a prominent research area (Clematis et al., 2003; Shekhar, 2010). HPC methodologies and techniques are successfully used in processing both voluminous raster geospatial data produced by Earth observation satellites in terabytes (Akhter et al., 2010; Shekhar, 2010; Zhang, 2010), as well in spatial query processing and overlay computation over huge amount of spatial vector data (Wang et al., 2009; Park et al., 2010; Shi, 2010). Spatial queries and spatial overlay computations, as methods for combining information between different GIS layers, are crucial capabilities of geospatial analysis in various GIS applications. Both, spatial join query processing and spatial overlay computations are time-consuming in handling large-scale and voluminous spatio-temporal data, and thus are excellent candidates for application of HPC. While geospatial data collection reaches unprecedented levels, and HPC architectures, like computer clusters, cloud computing platforms, and recently massively parallel GPU, become increasingly available, there is still a lack of parallel GIS algorithms, application libraries and toolkits on these architectures. There are several types of high-performance architectures, based on parallel and distributed processing principles. Some of such architectures employ the parallel processing at the single-computing node while others are built from collections of networked, possibly heterogeneous computers/workstations. Because of its scalability and availability these networked environments represent a promising solution for low-cost, high-performance GIS computing. High-performance processing of geospatial data can be performed using existing HPC paradigms:
Parallel computers: multiCPU and multicore CPU;
Cluster computing using Network of Workstations (NoW);
Cloud & grid computing;
General-purpose GPU (GPGPU) computing.
Several recent research works employ high-performance processing techniques on geospatial data. Akhter et al. (2010) develop the methodology, which enables GRASS GIS software to run on HPC systems. Different implementations for parallel/distributed GRASS modules are presented on three different programming platforms (MPI, Ninf-G and OpenMP) and their performance are compared and reviewed. Zhang (2010) considers a new HPC framework for processing geospatial data in a personal computing environment. He argued that modern personal computers equipped with multicore CPU and many-core GPU provide excellent support for spatial data processing comparing with cluster computing using MPI and newly emerged cloud computing using MapReduce framework.
44
N. Stojanovic and D. Stojanovic
Shi (2010) identifies and discusses some fundamental research challenges in application of HPC techniques in a service-oriented GIS. Wang et al. (2009) design a framework for retrieving, indexing, accessing and managing spatial data in the cloud environment. They developed an interoperable spatial-data object model, redesigned spatial indexing structures and algorithms in the cloud computing environment and implemented proposed model as prototype software based on Google App Engine. Park et al. (2010) proposed cloud computing in 3D image processing and visualisation and use MapReduce cloud computing methodology and the Hadoop open source software framework to perform massively parallel processing of 3D geospatial data. They also present the performance comparison with MPI-based cluster computing implementation. They found that the MPI implementation is faster than MapReduce/Hadoop, while the later demonstrates better faulttolerance and more stability in their experiments. Ma et al. (2009) presented a new framework for query processing over trajectory data based on MapReduce. They extended traditional data partitioning, indexing and query processing techniques to apply them to highly parallel, large-scale clusters utilising MapReduce framework. Large volume of trajectory data set was partitioned in data subsets according to two strategies, based on spatial extent and on object identifiers. These subsets were distributed to different nodes in the cluster, and spatio-temporal range query and trajectory query were processed on these subsets according to MapReduce methodology. Their preliminary experiments showed that proposed framework is efficient and scaleable in processing trajectory data sets. Bouillet and Ranganathan (2010) showed application of System S for high-performance stream processing, to accelerate the real-time computation of map matching, using a grid-based decomposition approach and distributing the computation over several processing nodes (cluster of workstations). They provided results of performance evaluations showing that their approach can perform realtime map matching of GPS data arriving at a rate of 1 million points per second onto a map with 1 billion links in an acceptable time. He et al. (2008) designed a small set of data-parallel primitives for relational join processing on GPU, which are tuned to fully utilise architectural features of GPU. These primitives are used in implementation representative relational join algorithms which are able to achieve a significant performance improvement with respect to their optimised CPU-based implementation. Fang et al. (2009) presented two implementations of Apriori algorithm for frequent itemset mining. First implementation stores itemsets in a bitmap and runs entirely on the GPU, while second utilises trie data structure to store itemsets and adopts GPU-CPU co-processing scheme. Both implementations are up to two orders of magnitude faster than optimised CPU-based implementations. In the work of Bakkum and Skadron (2010), acceleration of database operations, particularly SELECT queries, has been performed. Using SQL represents improvement from the
paradigm of previous research where GPU queries have driven through the use of operational primitives, such as map, reduce or sort. Obtained results showed that depending on the size of the result set minimum speed-up was 20x. Parallel processing for fast finding of a large set of spatial data that are contained in a given query region was presented by Oh (2012). The authors proposed a method that achieves high speed by efficiently utilising the parallelism of the GPU and by reducing transfer latency between GPU and CPU. In order to measure the speed-up achieved by the proposed method, the execution time is compared with the sequential solution based on R*-tree loaded in the CPU main memory. The proposed method achieved performance from 1.81 to 186.47 times faster than the main memory based R*-tree. In the work of van der Merwe and Meyer (2009) automatic Digital Surface Models (DSM) generation from satellite imagery, using GPU, was performed. According to preliminary results, GPU algorithm decreased processing time by 900%. Beutel et al. (2010) considered construction of a grid Digital Elevation Model (DEM) from 3D point clouds generated by the LiDAR equipment. The obtained results showed that using of a GPU for this type of GIS application can significantly speed up the computation.
3
Distributed processing of geospatial data on network of work stations
3.1 MPI The MPI has become the major model of programming parallel distributed-memory applications. MPI is a specification not an implementation of library routines helpful for users that write portable message-passing programs in C/C++ and Fortran. There are different implementations of MPI: MPICH, OpenMPI, LAM, WMPI, Cray MPI, etc. The most popular MPI implementation is MPICH2 (http://www.mcs.anl.gov/ research/projects/mpich2/). The major goal of MPI, as with most standards, is a degree of portability across different machines. MPI has the ability to run on heterogeneous systems, groups of processors with distinct architectures (Snir et al., 1998). Therefore, MPI provides a computing model that hides many architectural differences. It is not necessarily to know whether the code is sending messages between processors of the same or different architectures. The MPI implementation automatically performs any necessary data conversion and utilises the correct communications protocol (Figure 1). This means that the same source code can be executed on a variety of machines as long as the MPI library is available. Message passing works by creating processes which are distributed among of group of processors. Basic assumption behind MPI is that multiple processes work concurrently using messages to communicate and collaborate with each other. MPI is library of functions designed to allow execution of multiple processes on different processors. One or more processes can execute on one participating processor depending
High-performance computing in GIS of number of available processors. MPI provides default communicator, MPI_COMM_WORLD, which consists of all the processes that participate in the execution. When MPI program is running all processes are executing the same code. MPI always identifies processes according to their relative rank in a group, that is, consecutive integers in the range 0, …,\number_of_processes. Each process uses its own rank (process identifier) to determine which part of program instructions it is supposed to run. Figure1
Overall architecture of network of workstations-MPI (see online version for colours)
45 to produce a new trajectory data set consisting of points referenced on segments of the road network. We consider two strategies where spatial relation between objects from the first data set (moving point data set) and objects from the second data set (road network data set) is analysed. The first strategy is spatial splitting, i.e. splitting data (work) among processes according to spatial allocation. Each process is responsible for data in the data set that exists in corresponding geographical region. The area of interest is divided by a regular grid consisting of rectangular regions of equal dimensions along x- and y-axes. The second strategy is uniform splitting where distribution depends on size of data set. The data set is evenly distributed among processes. In this case, each process is responsible for similar chunk of data. We have implemented both strategies. Data distribution among processes can be performed in different ways. It can be achieved either through point-to-point communication between processing nodes (MPI_Send/MPI_Recv), collective communication using message broadcast (MPI_Bcast), or using MPI file I/O.
3.3 Experimental evaluation
3.2 Distributed spatial join using MPI With the advances in positioning technologies and development of LBS large volumes of spatio-temporal data representing trajectories of moving objects emerged. However, efficient and scalable spatial join query processing and spatial overlay computations over spatio-temporal trajectory data remains a big challenge. A trajectory data set is usually in a large volume and exceeds the computation and storage capacities of a single computer. Hence, it seems natural to split the trajectory data set and processing tasks to a number of different computers – processing nodes and then aggregate their partial results into the final one. Regarding trajectory (moving object) data sets, the crucial problem in GIS and LBS applications that deal with such data is how to enrich a raw trajectory data (stream of spatio-temporal points) with appropriate semantic data provided by relevant spatial data sources (road network, POI, etc.). Concerning vehicle movement over road network the process of relating raw trajectory data (stream of GPS points) to underlying road network is called map matching, and represent a special case of spatial join as a method for combining data between GIS layers. Mobile positioning technologies, particularly GPS, enables collection of vehicle-tracking data by sampling positions at regular intervals. To improve accuracy of such data, a technique called map matching is used that matches the trajectories, vehicle positions given as series of (x, y, t) triplets, to the road network, giving the semantics to such trajectories as well. In this paper, we considered spatial join processing over large-trajectory data set and road network data set for the purpose of map matching. The map-matching process aims
In order to accelerate sequential execution of a mapmatching application we implemented two solutions. In the first solution, network segment distribution is performed in a way to geographically split up the segment data set into regions (spatial data splitting). Each process is responsible only for segments that exist in particular geographical region. Point distribution is performed in similar way. All points that spatially belong to the same region are assigned to the same process. The second solution uses different point distribution. Point data set is evenly distributed among participating processes (uniform data splitting). Allocation of processes to processors is performed in the same way for both solutions. For evaluating proposed solutions we used Network-based Generator of Moving Objects (Brinkhof, 2002) to generate the massive trajectory data set that simulate objects moving over the road network of the city of Oldenburg, Germany. Two data sets are used and are stored in corresponding files. The first data set represents the set of segments (each polyline consists of straight line segments) where each record contains segment identifier and (x, y) coordinates of the segment endpoints. The segment data set contains 7035 records. The second data set contains the set of points where each record contains point identifier, sequence number, the timestamp (x, y) coordinate of point and the current speed. The point data set consists of 329,273 records. For estimating performances of proposed solutions we used commodity computers (nodes) with Intel Celeron processor running at 2.6 GHz, 256 MB RAM and 100 MB/s Ethernet connection. For distributed application development Visual Studio 2008 is used with MPICH2 implementation of MPI. We performed experiments for sequential and two proposed distributed solutions. For spatial data splitting we split given data space into 4 4 grid, i.e. 16 separate regions (Figure 2).
46
N. Stojanovic and D. Stojanovic
Figure 2
Spatial data splitting (see online version for colours)
We compared performances of the application on 2, 4, 8 and 16 processors. In order to estimate proposed distributed solutions we used the speed-up and the efficiency as measures. Speed-up (SP) is used to compare the execution time of map-matching computation on p processors (TP) to the execution time on one processor (T1) i.e. Sp
T1 TP
(1)
Efficiency is used to estimate how well-utilised the processors are in solving the given problem and is defined as: Ep
Sp
(2)
p
The results obtained for proposed solutions are presented in Tables 1 and 2. Table 1 shows experimental results for the spatial data splitting with respect to sequential solution for variable number of processors (p = 2, 4, 8, 16). Times T1 and TP are measured in seconds. Table 2 shows experimental results for uniform data splitting with respect to sequential solution for variable number of processors (p = 2, 4, 8, 16). Table 1
Experimental results for the spatial data splitting
p
2
4
8
16
T1
28.0465
28.0465
28.0465
28.0465
TP
17.23657
14.96973
10.15531
9.558329
SP
1.627151
1.873548
2.761758
2.934247
EP
0.813575
0.468387
0.34522
0.18339
Table 2
Experimental results for the uniform data splitting
p
2
4
8
16
T1
28.0465
28.0465
28.0465
28.0465
TP
14.70279
7.680966
3.685875
1.858132
SP
1.907563
3.651429
7.609185
15.09392
EP
0.953782
0.912857
0.951148
0.94337
Both splitting strategies and map-matching solutions give speed-up greater than 1 with respect to the sequential solution, which promote them for acceleration of sequential solution. The speed-up increases with the increasing number of processors. When the uniform data splitting is used, all processes have evenly data distribution, while when using spatial distribution it is not the case. In that case, the time consumed for data processing in corresponding process depends on volume of data allocated to it. This fact explains the obtained results. Although the solution based on uniform data splitting gives better results for speed-up and efficiency than the solution based on spatial data splitting, the spatial splitting preserves locality of geospatial processing and thus saves network bandwidth and local storage. Shortcoming of the uniform data splitting is that each process has to get the whole road network (all segments) for its processing, because it is not a priori known where the point will be placed in the space, and thus need larger communication and storage resources at each computing node. On the contrary, spatial data splitting need only corresponding part of road network and therefore the time for data distribution is much shorter with respect to uniform data splitting. Also, by using spatial data splitting data strategy, each process is assigned geospatial data in a particular area which are highly related; sequences of (x, y, t) points represent sub-trajectories of the objects in that area moving over connected road segments. As such, the aggregation of partial results calculated by all processing nodes can be easily and efficiently performed. Furthermore, by employing spatial data splitting, different spatio-temporal analysis and processing can be assigned to different processing nodes. As can be seen in this experimental evaluation, efficient spatial join processing on a massive trajectory data set (in the context of map matching) over an HPC framework represents a big challenge and gives satisfactory improvements and acceleration.
4
Parallel processing in GIS on graphics processing unit
General-Purpose computing on GPU (GPGPU) represents a new paradigm with tightly coupled massively parallel computing units and achieves significant gains in performance improvements. The tremendous computing capability of GPGPU has been deployed in many fields. Some of the broad range GPGPU applications are: signal and image processing (Rumph et al., 2001; Fung et al., 2002), global illumination (Foley et al., 2005), geometric computing (Kruger et al., 2005), database and data mining (Govindaraju et al., 2004; Govindaraju et al., 2005; Bakkum et al., 2010), weather prediction (Michalakes et al., 2008), bioinformatics (Manavski et al., 2008), etc. In this paper usage of GPGPU in GIS analysis is illustrated.
4.1 CUDA programming model General purpose GPU (GPGPU) is a technique to perform general-purpose computations in GPU using appropriate framework, an API and a programming model, such as OpenCL, Microsoft’s DirectCompute, and NVIDIA’s CUDA (Kirk and Hwu, 2010). The CUDA programming model is
High-performance computing in GIS based on a logical representation of grids, blocks and threads. A grid is organised as a 2D array of blocks. Each block is organised into a 3D array of threads (Figure 3). Figure 3
CUDA programming model (see online version for colours)
47 building a mountain cabin on a flat spot, finding a large flat area to build an airport, locating places to clear beginner, intermediate, and advanced ski runs etc. In a GIS environment, the most efficient method to determine slope is through the use of DEM. DEM represents topography by a series of regular grid points with assigned elevation values. Slope is derived from analysing the target point elevation value relative to its neighbours, and writing an output to the centre point. Several algorithms have been developed to calculate percent slope and degree of slope. The simplest and most common is called the neighbourhood method. The neighbourhood algorithm estimates percent slope in cell e by comparing the elevations of neighbouring grid cells (3 3) as shown in Figure 4. The rate of change in the x direction for cell e is calculated with the following:
dz dx
c 2 f i a 2d g 8 x _ cell _ size
(3)
The rate of change in the y direction for cell ‘e’ is calculated with the following:
dz
dy g 2h i a 2b c 8 y _ cell _ size
(4)
Slope ratio in percent is given by: The exact organisation of a grid is determined by the execution configuration provided at kernel launch. The first parameter of the execution configuration specifies the number of blocks. The second specifies the dimensions of each block in terms of number of threads. All blocks in a grid have the same dimensions. The total size of a block is limited. The kernel function contains instructions in C programming language that are executed by each individual thread created when the kernel is launched at runtime. Since all threads in a grid execute the same kernel function, they rely on unique coordinates to distinguish themselves from each other and to identify the appropriate part of the data to process. Threads may access data from different memories in CUDA memory hierarchy. Threads of the grid have access to the same global memory. Single block’s threads have access to shared memory visible only to threads of that block. Threads within a block can cooperate among themselves by sharing data and synchronising their execution through shared memory. Shared memory is much faster than global memory. The programmers need to code for the data partition in order to achieve maximum acceleration of algorithms.
4.2 Parallel slope computation using CUDA Numerous empirical and process-based models have been developed in earth science and GIS analysis to predict environmental phenomena, such as soil erosion and sediment deposition. Slope steepness is a fundamental parameter in most of such models. In order to apply these models with maximum effectiveness in large areas of complex terrain, integration with GIS is required. Slope has application in many other fields: hydrology, route planning, avalanche prediction, fire behaviour and many more. We need to know the value/percentage of a slope for many reasons, such as,
Slope(%) Figure 4
[dz / dx]2 [dz / dy ]2 100
(5)
Slope calculation (see online version for colours)
In this paper, we consider acceleration of sequential execution of the percent slope algorithm computation which is applied to DEM. The nature of this algorithm is suitable for appropriate CUDA parallel implementation using shared memory. This algorithm shows inherent data parallelism and capability for data reusing in shared memory. To provide high-performance computation of slope, we implemented parallel solution which can significantly speed-up computation depending on how many times data in shared memory can be reused. To provide using of shared memory, each thread block must perform the following steps:
Load corresponding part of data from global memory to shared memory;
Synchronise threads;
Operate on data in shared memory in parallel;
Write data back from shared to global memory.
To realise the first step, we create shared memory bigger than the actual block size. For example if a thread block is
48
N. Stojanovic and D. Stojanovic
BD × BD, we need to add one row/column on each size of the shared memory, i.e. the shared memory size per block in this case will be BD + 2 × BD + 2. Then, the internal BD × BD portion of the shared memory is filled with the corresponding data values from global memory, while the boundaries have to be padded with the adjacent values. The thread block in the case BD = 16 is shown in Figure 5. Figure 5
Visual representation of the shared memory with expansion by one row/column (see online version for colours)
Once the data are loaded into shared memory we can operate on it in parallel in order to achieve maximum performance.
4.3 Experimental evaluation In order to estimate proposed parallel solution of the slope algorithm we use the speed-up as measure. Speed-up (S) is used to compare the execution time of slope computation on GPU (TGPU) to the execution time on CPU processor (TCPU) i.e. S
Every thread in a thread block copies one data value from global index into translated shared index with the exception of the boundary threads. The four boundary threads of a BD × BD block also pad the additional data along with their own data value. Therefore, threads in four corners of the thread block copy three data values from global to shared memory. The boundary threads (except the corners) copy two data values. Other threads copy only one data value. Pseudo-code kernel that implements loading of shared memory, computation of slope performed by each thread in parallel and then writing back slope values to the global memory is given in Figure 6. Figure 6
Kernel pseudocode
TCPU TGPU
(6)
Table 3 shows experimental results for the percent slope computation on GPU with respect to sequential solution for variable size of input data. Times TCPU and TGPU are measured in milliseconds. We use test cases including DEM with 1201 1201, 2402 2402 and 4804 4804 dimensions. CUDA-based solution is executed on NVIDIA GeForce GT 525 M GPU, using CUDA version 4.0. The CPU host system is equipped with Intel(R) Core i7 running on 2.2. GHz with 4 GB RAM. The size of thread block is chosen to be 16 16. The size of grid is chosen depending of the input data size and the size of the thread block. Table 3
Experimental results for the slope computation
Input size
1201 1201
2402 2402
4804 4804
TCPU
109.2
483.4
1888.1
TGPU
8.9
35.7
142.8
S
12.26966
13.54062
13.22199
As can be seen in this experimental evaluation, the GPU solution gives a speed-up in average greater than 13 for variable input size with respect to the sequential solution, which promote it for acceleration of slope computation. This is achieved by using data-parallel nature of slope algorithm that fits the set of problems that general-purpose GPUs are desired to solve. Moreover, capability for data reusing in shared memory significantly influences the improvement of speed-up. In GIS algorithms where the result value requires more neighbouring values for computation (e.g. 5 5), the more performance improvement is expected due to increased capability of data reusing.
5
Conclusion and future work
With advances in remote sensing, sensor networks and mobile positioning, the generation of massive spatiotemporal data has exploded in recent years. It leads to rising of interest in processing and analysis of such Big Data which is often spatio-temporal in nature. This paper shows that performing distributed processing over NoWs using MPI, as well as employing multiprocessors on a graphic card using CUDA could significantly improve the performance in executing various computation and data-intensive GIS algorithms. It
High-performance computing in GIS definitely shows that using parallelisation and HPC in GIS represents a promising research and development field. The future research will consider the usage of other HPC techniques and platforms in GIS, such as cloud computing (MapReduce/Hadoop) and the adaptation of various GIS algorithms to cloud infrastructure.
Acknowledgements Research presented in this paper is funded by Ministry of Education, Science and Technological Development, Republic of Serbia in the project ‘Environmental Protection and Climate Change Monitoring and Adaptation’ – III-43007.
References Akhter, S., Aida, K. and Chemin, Y. (2010) ‘GRASS GIS on high performance computing with MPI, OpenMP and Ninf-G programming framework’, Proceeding of ISPRS 2010, Japan. Armstrong, M.P. (1995) ‘Is there a role for high performance computing in GIS?’, Journal of the Urban and Regional Information Systems Association, Vol. 7, No. 1, pp.7–10. Bakkum, P. and Skadron, K. (2010) ‘Accelerating SQL database operations on a GPU with CUDA’, Proceedings of the 3rd Workshop on General-Purpose Computation on Graphics Processing Units 2010, pp.94–103. Beutel, A., Mшlhave, T. and Agarwal, P.K. (2010) ‘Natural neighbor interpolation based grid DEM construction using a GPU’, Proceedings of 18th ACM SIGSPATIAL International Symposium on Advances in Geographic Information Systems, pp.172–181. Bouillet, E. and Ranganathan, A. (2010) ‘Scalable, real-time mapmatching using IBM’s System S’, Mobile Data Management, pp.249–257. Brinkhof, T. (2002) ‘A framework for generating network-based moving objects’, GeoInformatica, Vol. 6, No. 2, pp.153–180. Clematis, A., Mineter, M.J. and Marciano, R. (2003) ‘High performance computing with geographical data’, Parallel Computing, pp.1275–1279. Fang, W., Lu, M., Xiao, X., He, B. and Luo, Q. (2009) ‘Frequent itemset mining on graphics processors’, Proceedings of the 5th International Workshop on Data Management on New Hardware, pp.34–42. Fung, J., Tang, F. and Mann, S. (2002) ‘Mediated reality using computer graphics hardware for computer vision’, 6th International Symposium on Wearable Computing, pp.83–89. Foley, T. and Sugerman, J. (2005) ‘KD-tree acceleration structures for a GPU raytracer’, Graphics Hardware, pp.15–22. Govindaraju, N.K., Lloyd, B., Wang, W., Lin M. and Manocha, D. (2004) ‘Fast computation of database operations using graphics processors’, Proceedings of the ACM SIGMOD International Conference on Management of Data, pp.215–226. Govindaraju, N.K. and Manocha, D. (2005) ‘Efficient relational database management using graphics processors’, Proceeding of the ACM SIGMOD Workshop Data Manage, New Hardware, pp.29–34.
49 He, B., Yang, K., Fang, R., Lu, M., Govindaraju, N., Luo, Q. and Sander, P. (2008) ‘Relational joins on graphics processors’, Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data, pp.511–524. Kirk, D. and Hwu, W.M. (2010) Programming Massively Parallel Processors: A Hands-on Approach, Elsevier. Kruger, J. and Westermann, R. (2005) ‘GPU simulation and rendering of volumetric effects for computer games and virtual environments’, Computer Graphics Forum, Vol. 24, No. 3, pp.685–693. Ma, Q., Yang, B., Qian, W. and Zhou, A. (2009) ‘Query processing of massive trajectory data based on MapReduce’, Proceedings of CloudDB, pp.9–16. Manavski, S. and Valle, G. (2008) ‘CUDA compatible GPU cards as efficient hardware accelerators for Smith-Waterman sequence alignment’, BMC Bioinformatics, Vol. 9, S10p. Michalakes, J. and Vachharajani, M. (2008) ‘GPU acceleration of numerical weather prediction’, Parallel Processing Letters, Vol. 18, No. 4, pp.531–548. Oh, B. (2012) ‘A parallel access method for spatial data using GPU’, International Journal on Computer Science and Engineering, Vol. 4 No. 03, pp.492–500. Park, J.W., Lee, Y.W., Yun, C.H., Park, H.K., Chang, S.I., Lee, I.P. and Jung, H.S. (2010) ‘Cloud computing for online visualization of GIS applications in ubiquitous city’, CLOUD COMPUTING 2010, The 1st International Conference on Cloud Computing, GRIDs, and Virtualization, Lisbon, Portugal. Patel, S., Hwu, W.W. (2008) ‘Accelerator architectures’, IEEE Micro, Vol. 28, No. 4, pp.4–12. Rumpf, M. and Strazodka, R. (2001) ‘Level set segmentation in graphics hardware’, Proceedings of the IEEE International Conference on Image Processing (ICIP ’01), Vol. 3, pp.1103–1106. Shekhar, S. (2010) ‘High performance computing with spatial datasets’, Invited Talk, HPDGIS 2010 Workshop: International Workshop on High Performance and Distributed Geographic Information Systems, San Jose, California. Shi, X. (2010) ‘High performance computing: fundamental research challenges in service oriented GIS’, ACM SIGSPATIAL – HPDGIS 2010 Workshop: International Workshop on High Performance and Distributed Geographic Information Systems, San Jose, California, pp.31–34. Snir, M., Otto, S., Huss-Lederman, S., Walker, D. and Dongara, J. (1998) MPI: Complete Reference, The MIT Press, Cambridge, MA. van der Merwe, D. and Meyer, J. (2009) ‘Automatic digital surface modeling using a graphics processing unit’, Proceedings of the Southern Africa Telecommunication Networks and Applications Conference (SATNAC) 2009. Wang, Y., Wang, S. and Zhou, D. (2009) ‘Retrieving and indexing spatial data in the cloud computing environment’, Cloud Computing, 1st International Conference-CloudCom, Beijing, China, pp.322–331. Zhang, J. (2010) ‘Towards personal high-performance geospatial computing (HPC-G): perspectives and a case study’, ACM SIGSPATIAL – HPDGIS 2010 Workshop: International Workshop on High Performance and Distributed Geographic Information Systems, San Jose, California, pp.3–10.