International Journal of
Geo-Information Article
A Hybrid Process/Thread Parallel Algorithm for Generating DEM from LiDAR Points Yibin Ren 1,2 1
2 3
*
ID
, Zhenjie Chen 3 , Ge Chen 1,2 , Yong Han 1,2, * and Yanjie Wang 1,2
Qingdao Collaborative Innovation Center of Marine Science and Technology, College of Information Science and Engineering, Ocean University of China, No. 238, Songling Road, Qingdao 266100, China;
[email protected] (Y.R.);
[email protected] (G.C.);
[email protected] (Y.W.) Laboratory for Regional Oceanography and Numerical Modeling, Qingdao National Laboratory for Marine Science and Technology, No. 1, Wenhai Road, Qingdao 266237, China Jiangsu Provincial Key Laboratory of Geographic Information Science and Technology, Nanjing University, No. 163, Xianlin Avenue, Nanjing 210023, China;
[email protected] Correspondence:
[email protected]; Tel.: +86-0532-6678-6365
Received: 10 August 2017; Accepted: 24 September 2017; Published: 28 September 2017
Abstract: Airborne Light Detection and Ranging (LiDAR) is widely used in digital elevation model (DEM) generation. However, the very large volume of LiDAR datasets brings a great challenge for the traditional serial algorithm. Using parallel computing to accelerate the efficiency of DEM generation from LiDAR points has been a hot topic in parallel geo-computing. Generally, most of the existing parallel algorithms running on high-performance clusters (HPC) were in process-paralleling mode, with a static scheduling strategy. The static strategy would not respond dynamically according to the computation progress, leading to load unbalancing. Additionally, because each process has independent memory space, the cost of dealing with boundary problems increases obviously with the increase in the number of processes. Actually, these two problems can have a significant influence on the efficiency of DEM generation for larger datasets, especially for those of irregular shapes. Thus, to solve these problems, we combined the advantages of process-paralleling with the advantages of thread-paralleling, forming a new idea: using process-paralleling to achieve a flexible schedule and scalable computation, using thread-paralleling inside the process to reduce boundary problems. Therefore, we proposed a hybrid process/thread parallel algorithm for generating DEM from LiDAR points. Firstly, at the process level, we designed a parallel method (PPDB) to accelerate the partitioning of LiDAR points. We also proposed a new dynamic scheduling strategy to achieve better load balancing. Secondly, at the thread level, we designed an asynchronous parallel strategy to hide the cost of LiDAR points’ reading. Lastly, we tested our algorithm with three LiDAR datasets. Experiments showed that our parallel algorithm had no influence on the accuracy of the resultant DEM. At the same time, our algorithm reduced the conversion time from 112,486 s to 2342 s when we used the largest dataset (150 GB). The PPDB was parallelizable and the new dynamic scheduling strategy achieved a better load balancing. Furthermore, the asynchronous parallel strategy reduced the impact of LiDAR points reading. When compared with the traditional process-paralleling algorithm, the hybrid process/thread parallel algorithm improved the conversion efficiency by 30%. Keywords: parallel geo-computing; hybrid process/thread paralleling; DEM generation; LiDAR points
ISPRS Int. J. Geo-Inf. 2017, 6, 300; doi:10.3390/ijgi6100300
www.mdpi.com/journal/ijgi
ISPRS Int. J. Geo-Inf. 2017, 6, 300
2 of 19
1. Introduction 1.1. Background Grid digital elevation models (DEM) are very important in geographic information science (GIS). It is widely used in hydrological analysis, visibility analysis and terrain analysis [1,2]. With the progress of data acquisition technology and the development of spatial analyzing methods, the demand for DEMs in different spatial scales has been increasing in recent years. The airborne Light Detection and Ranging (LiDAR) has advantages of wide scanning range, fast data acquisition speed and high point destiny, making it very suitable for DEM generation, especially for large areas [3–5]. However, the volume of the LiDAR dataset is very large. For example, a full scanning project of airborne LiDAR can generate a raw dataset with hundreds of millions of points, whose volume is dozens of gigabytes (GB) [6,7]. Thus, it is a great challenge for the traditional computer to make an effective conversion from LiDAR points to DEM grids. Parallel computing based on multi-core computers provides a new way to solve this challenge. Compared with the traditional serial algorithm, parallel algorithms can accelerate the processing procedure by making full use of multiple cores. Thus, using parallel computing to improve the processing efficiency of spatial data has been a hot topic in the field of geo-computing. A series of parallel geo-computing algorithms has been proposed, such as spatial data conversion algorithm [8], remote sensing images processing algorithm [9], and raster computing algorithms [10–12], etc. This provides important references for LiDAR dataset processing. We can also use parallel computing to accelerate the conversion efficiency from LiDAR points to DEM. Therefore, designing a high-performance parallel algorithm for generating DEM from LiDAR points is of great significance. 1.2. Related Work In recent years, a series of parallel algorithms for generating DEM from LiDAR points has been designed by researchers. According to the differences between parallel strategies, these algorithms can be grouped into three categories: (1) thread-paralleling algorithms; (2) process-paralleling algorithms; and (3) heterogeneous-paralleling algorithms. The representation of the first category is in study [13] where the authors partitioned the LiDAR dataset into a lot of square blocks and scheduled them statically with a pipeline model by Open Multi-Processing (OpenMP) in a multi-core computer. Each thread completed four steps including inputting, indexing, interpolating and outputting to generate DEM independently; all threads were synchronized in time. In the second category, typically, a cluster with several computation nodes is selected as the hardware environment; the Message Passing Interface (MPI) is selected to implement the programming of process-paralleling. Huang et al. [14] implemented the basic master-slave model by MPI in a high-performance cluster (HPC) where the master distributed raw data blocks to slaves and gathered computation results, merged them into the final DEM. Han et al. [15] implemented a process-paralleling algorithm in a PC-cluster, using MPI to pass values of points located at the boundary of each data block to solve the data barrier existing in the different processes. The last category is the heterogeneous paralleling algorithms combining Central Processing Unit (CPU) programming with Graphics Processing Unit (GPU) programming. These algorithms can leverage the power of floating-point computing in GPU to accelerate the computation of spatial interpolation [16,17]. As known to all in parallel programming, the process is the unit of computation resource allocation; a parallel algorithm designed in multiple processes can run on a HPC consisting of a series of computation nodes. Its advantages are its flexibility of scheduling and the scalability of computation [18]; however, the memory between processes is independent, which makes it difficult to share data between each other. In contrast, thread has no independent memory space which means they all share the memory of the same process [19]. Compared with process, thread requires fewer resources of the operating system which can maximize the power of a hyper-threading computer [20]. However, the algorithm implemented by threads cannot run across multiple computation nodes,
ISPRS Int. J. Geo-Inf. 2017, 6, 300
3 of 19
which makes it more suitable to run on a computer with multiple cores. Generally, the above two programming modes are used in parallel programming of multiple CPUs. In recent years, GPU parallel programming has rapidly developed as its powerful floating-point computing capability is suitable for handling computationally-intensive problems [21]. However, the memory capacity limitations of the GPU and the gap between computation and communications can negatively influence the performance of the algorithm [22]. Additionally, as compared with CPU clusters, GPU clusters have more expensive hardware and the associativity between software and hardware requires improvement [23]. Therefore, considering a series of factors, most geo-computing parallel algorithms use the HPC with multiple CPUs as the parallel environment to achieve a variety of characteristics of GIS computing tasks [11,24,25]. Thus, the aim of this research was to design a high-performance parallel algorithm for generating DEM from LiDAR points in the HPC with multiple CPUs. 1.3. Problems The general parallel procedure of generating DEM from LiDAR points is shown in Figure 1. To achieve a high-performance parallel algorithm, we must solve two key problems. The first one is how to achieve load balancing between computation units (Figure 1b). Load balancing is one of the most important parts of parallel programming and can affect the performance of the whole algorithm [26]. Most existing algorithms use the static scheduling strategy, which partitions the LiDAR dataset and assigns them to computation units before the algorithm running [13,27,28]. This is useful when the shape of the LiDAR dataset is regular. However, if the shape is irregular, the static strategy will not respond dynamically according to the computation progress, leading to load unbalancing. To achieve more flexible scheduling, a dynamic strategy implemented by master-slave mode was proposed by [14]. However, this basic dynamic strategy had a limitation: if the data block with the largest computation quantity was assigned last, the running time of the computation unit that processes this data block would be much longer than the others, which would affect the efficiency of the whole algorithm. Additionally, since the number of LiDAR points is very large, we should parallelize the data procedure as much as possible to minimize the partitioning time. ISPRS Int. J. Geo-Inf. 2017,partitioning 6, 300 4 of 19
Figure Figure 1. 1. General General parallel parallel procedure procedure of of generating generating aa digital digital elevation elevation model model (DEM) (DEM) from from Light Light Detection and Ranging (LiDAR) points: (a) Partition the LiDAR dataset; (b) Load balancing problem; Detection and Ranging (LiDAR) points: (a) Partition the LiDAR dataset; (b) Load balancing problem; and and (c) (c) Boundary Boundary problem. problem.
In addition to load balancing, the other key point was how to minimize the boundary problem, 1.4. Our Idea especially for algorithms running on the HPC. Existing algorithms running on the HPC to generate As stated before, the advantages of process-paralleling are the flexible capability of scheduling DEM from LiDAR points have been designed in multiple processes mode. As we all known, the and the scalability of computation across multiple nodes, where the advantage of thread-paralleling core computation of this procedure is spatial interpolation [29]: when a grid cell near the boundary is memory sharing. Thus, if we read a partitioned data block with data buffer from the LiDAR file into the memory by a process, dividing it into a series of small data blocks and creating multiple threads to perform the spatial interpolation, we will not need to add data buffers on these small data blocks since all threads belong to the same process can access all points directly from the memory. In comparison with the only process-paralleling strategy, this two-level hybrid paralleling
ISPRS Int. J. Geo-Inf. 2017, 6, 300
4 of 19
of the data block is interpolated, points fall in the search radius of this cell, but are read by other processes that cannot be accessed directly, forming the boundary problem (Figure 1c). Usually, there are two methods to solve this problem: (1) sending the values of the point to processes that need them [15]; and (2) adding a data buffer on each data block [13]. For the first method, if there are too many processes or data blocks, the volume of data that be passed between processes will be great, which will affect the efficiency of the algorithm obviously. Compared to the first method, the cost of adding a data buffer is relatively small, which has been adopted by most algorithms. However, data buffers between adjacent data blocks overlap each other, generating redundant data reading. Especially for a larger dataset on the HPC, hundreds of data buffers form a large amount of redundant data, increasing the unnecessary data reading time. Furthermore, the parallel processing of LiDAR data is data-intensive and computation-intensive [30], which makes the cost of data reading unable to be ignored. When different processes read points falling in the same area simultaneously, data reading competition is generated [31]. Generally, there are hundreds of processes on a HPC reading LiDAR points in parallel, generating strong competition, which further increases the data reading cost. In summary, in aiming to achieve more efficient conversion from LiDAR points to DEM, we need to design a more flexible schedule strategy and do our best to minimize the boundary problem. Thus, it is of great significant to propose a new higher performance parallel algorithm for generating DEM from LiDAR points. 1.4. Our Idea As stated before, the advantages of process-paralleling are the flexible capability of scheduling and the scalability of computation across multiple nodes, where the advantage of thread-paralleling is memory sharing. Thus, if we read a partitioned data block with data buffer from the LiDAR file into the memory by a process, dividing it into a series of small data blocks and creating multiple threads to perform the spatial interpolation, we will not need to add data buffers on these small data blocks since all threads belong to the same process can access all points directly from the memory. In comparison with the only process-paralleling strategy, this two-level hybrid paralleling strategy can reduce the number of boundary problems obviously. At the same time, as compared with only thread-paralleling, process-paralleling at the first level can offer more flexible scheduling strategy and more scalable computation. Additionally, threads need fewer resources of the operating system than processes; using threads paralleling inside the computational node can maximize the power of hyper-threading. Based on the above analyses, we made full use of the advantages of both parallel modes and proposed a hybrid process/thread parallel algorithm for generating DEM from LiDAR points. There were four novel contributions in this research: (1) for the first time, we proposed a hybrid strategy combine processes with threads for parallel DEM generation from LiDAR points; (2) we designed a parallel data partitioning method to reduce the partitioning time of LiDAR dataset; (3) we designed a hybrid process/thread dynamic scheduling strategy, including a new dynamic scheduling strategy based on computation quantity at the process level to achieve load balancing and an asynchronous parallel strategy between threads to hide the cost of LiDAR points’ reading; and (4) we tested the hybrid parallel algorithm on a HPC using datasets of different volumes (4 GB, 30 GB, and 150 GB) , and conducted a series of comparative experiments and comprehensive analyses. The rest of this paper is organized as follows. Section 2 introduces the overall architecture of the hybrid process/thread parallel algorithm. Section 3 introduces the parallel data partitioning at the process level and the logical data partitioning at the thread level. The hybrid process/thread dynamic scheduling strategy is introduced in Section 4. The performance experiments and comparative analyses are introduced in Section 5. Then, in Section 6, we discussed the applicability of our algorithm to different terrains and different LiDAR applications. The last section gives our conclusions and future research directions.
ISPRS Int. J. Geo-Inf. 2017, 6, 300
5 of 19
2. Overall Design the Hybrid Process/Thread Parallel Algorithm ISPRS Int. J. Geo-Inf. 2017,of 6, 300
5 of 19
The hybrid process/thread parallel algorithm for generating DEM from LiDAR points was based on the classic master-slave strategy. Slave was the unit to read the partitioned data block of 2. Overall Design of the Hybrid Process/Thread Parallel Algorithm LiDAR dataset and the unit to communicate with the master. In addition to the master, all slaves The hybrid process/thread parallel algorithm generating DEM from LiDAR points was based needed to create multiple threads to perform thefor computation of DEM generation. The dynamic on scheduling the classic strategy master-slave Slave was the unit read the partitioned data blockbetween of LiDAR basedstrategy. on computation quantity wastoimplemented by communications master and slaves. The overall designwith of the process/thread parallel algorithm for generating dataset and the unit to communicate thehybrid master. In addition to the master, all slaves needed to DEMmultiple from LiDAR points was shown Figure 2, including eight steps: create threads to perform theincomputation of DEM generation. The dynamic scheduling strategy based on computation quantity implemented between 1. All processes partitioned the LiDARwas dataset in parallel.by Wecommunications named this procedure the master Paralleland slaves. Partitioning The overall Method design of the hybridData process/thread parallel algorithm for3.1). generating DEM from Considering Buffer (PPDB) (detailed in Section LiDAR points was shown in Figure 2, including eight steps: 2. The master created a task queue based on the partitioned results. Each task was a data block waiting to be processed. All processes partitioned the LiDAR dataset in parallel. We named this procedure the Parallel 3. The master assigned one task number to each slave. Partitioning Method Considering Data Buffer (PPDB) (detailed in Section 3.1). 4. Each slave read LiDAR points into the memory based on the task number. 2. 5. The master task queuethe based thestored partitioned results. into Eachsmall task blocks. was a data block Each slavecreated logicallyapartitioned data on block in the memory to used be processed. 6. waiting Threads an asynchronous parallel strategy (detailed in Section 4.2) to perform the DEM 3. The master one task number each slave. generationassigned computation for each smalltodata block. 4. 7. Each slave read LiDAR points into memory on the computation task number.of current data block. Slave requested a new task from the the master when based it finished 8. If the task queue was not empty, a new task number would be sent back tosmall the slave, looping 5. Each slave logically partitioned the data block stored in the memory into blocks. this procedure the task queue was empty. 6. Threads used anuntil asynchronous parallel strategy (detailed in Section 4.2) to perform the DEM
1.
generation for eachconsists small data block. Generally,computation a parallel algorithm of two parts: data partitioning and scheduling strategy. 7. Therefore, Slave requested a new task our fromalgorithm the mastermore whenclearly, it finished the computation of current datatwo block. aiming to explain we grouped these eight steps into parallel data partitioning for the LiDAR 4, 5) and hybrid dynamic 8. parts: If the task queue was not empty, a new task dataset number(1, would be sent backprocess/thread to the slave, looping this scheduling strategy (2, 3, 6, 7, 8). In was the following, procedure until the task queue empty. we will detail these two parts. Parallel Data Partitioning for LiDAR Dataset 1. PPCD at the process level
4. Each slave read LiDAR data block
2. Master created task queue
3. Master assigned one task to each slave
5. Each slave partitioned the block logically for threads
Start
End 7. Slave requested new task after finishing the current one
6. Threads generated DEM asynchronously
No
8. Was the task queue empty?
Yes
Hybrid Process/Thread Dynamic Scheduling Strategy Figure Overall designofofthe thehybrid hybridprocess/thread process/thread parallel generation. Figure 2. 2. Overall design parallelalgorithm algorithmininDEM DEM generation.
3. Generally, Parallel Data Partitioning for LiDAR Dataset a parallel algorithm consists of two parts: data partitioning and scheduling strategy. Therefore, aiming to explain our algorithm more clearly, we grouped these eight steps into two parts: 3.1. Parallel Partitioning for Method Considering Data (PPDB) at the Process Level dynamic scheduling parallel data partitioning the LiDAR dataset (1,Buffer 4, 5) and hybrid process/thread strategyWe (2,used 3, 6, 7, 8). In the following, will theseunit twotoparts. a horizontal or vertical we strip as detail the spatial partition the whole LiDAR dataset at the process level. In this way, the data buffer of one strip only needed to extend in one direction 3. Parallel Partitioning for for LiDAR Dataset (X or Y),Data which was important paralleling the partitioning procedure. The direction of the strip was determined by the extent of LiDAR dataset. As shown in Figure 3, if the horizontal span (X 3.1. Parallel Partitioning Method Considering Data Buffer (PPDB) at the Process Level direction) was bigger than the vertical span (Y direction), we would use the vertical strip as the partitioning otherwise horizontal strip used. main the ideawhole of the LiDAR PPDB was to We used aunit, horizontal or the vertical strip as thewould spatialbeunit to The partition dataset
at the process level. In this way, the data buffer of one strip only needed to extend in one direction (X or Y), which was important for paralleling the partitioning procedure. The direction of the strip was determined by the extent of LiDAR dataset. As shown in Figure 3, if the horizontal span (X direction)
2.
Strip Span Calculation and Spatial Mapping
Calculating the average span of each strip in Y direction Dy was based on Equation (3). If strips were vertical, Equation (4) was used. This step as executed by the master and the results were broadcast to all processes. All processes read points parallel and obtained the Y coordinate Yi of ISPRS Int. J. Geo-Inf. 2017, 6, 300 6 of 19 each point. Based on Yi, each process calculated the strip Ni that the point belonged to according to Equation (5) and Figure 4b. Next, we calculated whether the point belonged to the data buffer of the adjacent strips follows: the (Y point satisfiedweEquation (6),the it vertical belonged to as thethe buffer of strip unit, Ni-1 was bigger thanasthe verticalifspan direction), would use strip partitioning and shouldthe behorizontal added to Nstrip i-1; ifwould it satisfied Equation (7), itidea belonged to the was buffer stripthe Ni+1coordinate . otherwise be used. The main of the PPDB to of obtain of each LiDAR point by traversing the dataset parallel and calculating which strip the point belonged 3. Writing Results in Turn to. This consisted of three steps (Figure 4), as shown by the use of the horizontal strip as an example to files were byseries the master and each corresponded a resultant All detailResultant the procedure of thecreated PPDB. A of variables was strip defined as below: N to is the number file. of strips processes wrote partitioned results into files in turn when they finished traversing. As LiDAR related toJ.the volume of300 internal memory and the volume of the dataset; B is the size of the data buffer; ISPRS Int. Geo-Inf. 2017, 6, 7 of 19 points are storedof continuously bythe point number, stored points based on run-length encoding: M is the number total points in LiDAR file; Nwe is the number of all processes ; P (0 ≤ r < N p r p ) is for a series of consecutive points, only the start point number and the end point number were the process number; Sp is the number of bytes for one point in the file; Pstart is the starting position of Pi the = Pstart + i × Pfour (2) s recorded. In this way, we notfile; only improved but also storage space. the first point in the LiDAR Xmax , Xmin , Y Ymin arespeed, extents ofsaved the target DEM. max ,writing
D y = (YMax − YMin ) / N
(3)
D x = ( X Max − X Min ) / N
(4)
Ni = (Yi − YMin ) / Dy + 1
(5)
( N i ≠ 1)&(Yi ≤ (( N i − 1) × D y + B ))
(6)
( N ≠ N ) & (Y ≥ ( N × D − B ))
(7)
i i i y Figure Figure3. 3.Strip Stripfor fordata datapartitioning: partitioning:(a) (a)Horizontal Horizontalstrips; strips;and and(b) (b)Vertical Vertical strips. strips.
In the above-mentioned procedure, data partitioning was done in parallel by all processes and each process only needed to read M/Np points. When processes read data for computing, they can obtain points located in one strip by point numbers stored in the partition file directly and do not need to judge the spatial relationship between points and strips. Point numbers stored in the same file were recorded from small to large, thus, data reading was sequential I/O instead of random I/O which improved the reading speed.
i = Pr × ( M N p )
(1)
Figure Figure 4. 4. Procedure Procedure of of the the Parallel Parallel Partitioning Partitioning Method Method Considering Considering Data Data Buffer Buffer(PPDB): (PPDB): (a) (a) Initial Initial position positioncalculation; calculation;(b) (b)Strip Stripspan spanand andspatial spatialmapping; mapping;and and(c) (c)Writing Writingresults. results.
1. Initial Position Calculation for Each Process 3.2. Logical Partition at the Thread Level To obtain the points’ coordinates, parallel processes need to read data from different positions one file. slaveThus, reads we all points into one strip into memory, it needs for to partition data stored in of theAfter LiDAR needed calculate the initial position the datathe reading pointer theeach memory for threads. Ascalculated all threadsthe belonging the same process sharei memory they can of process. First, we startingtoLiDAR point number for each space, process based access any point directly. Therefore, we partitioned the data block logically: the object to be on Equation (1). Next, the initial data reading position Pi in the file for each process was calculated partitioned not the4a. data itself, but the space range corresponding to the strip. We partitioned it by Equation was (2), Figure into 2.multiple small rectangles logically the long edge, and numbered rectangles as the Strip Span Calculation and Spatialalong Mapping partitioned results (Figure 5). Points stored in the notbased partitioned physically, Calculating the average span of each strip in Y memory direction were Dy was on Equation (3). If which strips saved partitioning time. Compared with the whole dataset, the shape of one data strip was more were vertical, Equation (4) was used. This step as executed by the master and the results were broadcast regular and the point density was also more even. Thus, we assigned all tasks to threads circularly, to all processes. All processes read points parallel and obtained the Y coordinate Yi of each point. using on multiple balance the the strip computation which also saved Based Yi , eachloops processtocalculated Ni that theload pointbetween belongedthreads, to according to Equation (5) communication cost across threads. The thread computed the DEM values of all cells located in the and Figure 4b. Next, we calculated whether the point belonged to the data buffer of the adjacent strips rectangle time until it finished all rectangles assignedtotothe it. buffer In the of interior of each process, the as follows:each if the point satisfied Equation (6), it belonged strip N i−1 and should be partition procedure as a serial operation; however, there were multiple slaves doing the partition at added to N i−1 ; if it satisfied Equation (7), it belonged to the buffer of strip Ni+1 . the same time, which meant that it was also a parallel procedure. 3. Writing Results in Turn
Pi = Pstart + i × Ps
(2)
D y = (YMax − YMin ) / N
(3)
D x = ( X Max − X Min ) / N
ISPRS Int. J. Geo-Inf. 2017, 6, 300
(
7 of 19
(4)
)
Resultant files were created by the each strip corresponded to a resultant file. Ni =master Yi − Yand (5) Min / Dy + 1 All processes wrote partitioned results into files in turn when they finished traversing. As LiDAR points are stored continuously by point number, we stored points based on run-length encoding: for ( N i ≠ 1)&(Yi ≤ (( N i − 1) × D y + B )) (6) a series of consecutive points, only the start point number and the end point number were recorded. In this way, we not only improved the writing speed, but also saved storage space. ( Ni ≠ N ) &partitioning (Yi ≥ ( N i × D B )) in parallel by all processes and (7) y −done In the above-mentioned procedure, data was each process only needed to read M/Np points. When processes read data for computing, they can obtain points located in one strip by point numbers stored in the partition file directly and do not need to judge the spatial relationship between points and strips. Point numbers stored in the same file were recorded from small to large, thus, data reading was sequential I/O instead of random I/O which improved the reading speed. i = Pr × M/Np (1) Pi = Pstart + i × Ps
(2)
Dy = (YMax − YMin )/N
(3)
Dx = ( X Max − X Min )/N
(4)
Ni = (Yi − YMin )/Dy + 1
(5)
( N 6= 1)&(Y ≤ (( N − 1) × D + B))
(6)
y i i i Figure 4. Procedure of the Parallel Partitioning Method Considering Data Buffer (PPDB): (a) Initial position calculation; (b) Strip span and spatial mapping; and (c) Writing results. ( Ni 6= N )&(Yi ≥ ( Ni × Dy − B)) (7)
3.2. Logical Logical Partition Partition at at the the Thread Thread Level Level 3.2. After one one slave slave reads reads all all points points in in one one strip strip into into memory, memory,ititneeds needsto topartition partitionthe thedata data stored storedin in After the memory for threads. As all threads belonging to the same process share memory space, they can the memory for threads. As all threads belonging to the same process share memory space, they can accessany anypoint point directly. Therefore, we partitioned the datalogically: block logically: to be access directly. Therefore, we partitioned the data block the object the to beobject partitioned partitioned was not the data itself, but the space range corresponding to the strip. We partitioned was not the data itself, but the space range corresponding to the strip. We partitioned it into multipleit into multiple rectangles logically alongand thenumbered long edge, and numbered rectangles results as the small rectanglessmall logically along the long edge, rectangles as the partitioned partitioned resultsstored (Figure stored thepartitioned memory were not partitioned physically, which (Figure 5). Points in 5). thePoints memory wereinnot physically, which saved partitioning saved partitioning time. Compared with the whole dataset, the shape of one data strip was more time. Compared with the whole dataset, the shape of one data strip was more regular and the point regular was and also the point was we alsoassigned more even. Thus,towe assigned all tasks to threads density more density even. Thus, all tasks threads circularly, using multiplecircularly, loops to using multiple loops to balance the computation load between threads, which also saved balance the computation load between threads, which also saved communication cost across threads. communication cost across threads. The thread computed the DEM values of all cells located in the The thread computed the DEM values of all cells located in the rectangle each time until it finished all rectangle each timeto until finished all of rectangles assigned to it. In the interioras ofaeach the rectangles assigned it. Init the interior each process, the partition procedure serialprocess, operation; partition there procedure as a serial operation; there were multiple slaves meant doing that the partition at however, were multiple slaves doing however, the partition at the same time, which it was also same procedure. time, which meant that it was also a parallel procedure. athe parallel
Figure5. 5. Logical Logical partitioning partitioning at at the the thread threadlevel. level. Figure
4. Hybrid Process/Thread Dynamic Scheduling Strategy To balance the computation loads between processes, we proposed a new dynamic scheduling strategy based on computation quantity at the process level. As mentioned previously, the parallel procedure of LiDAR points is computation and data intensive. To minimize the influence of
points. 4.1. Dynamic Scheduling Strategy Based on Computation Quantity The master scheduled tasks to slave dynamically based on the computation progress in each ISPRS Int. J. Geo-Inf. 6, 300 8 of 19 slave; tasks with 2017, larger amounts of computation were assigned early and the smaller ones assigned later. It consisted of three steps: task queue creating, initially assigning and dynamically distributing. data Task inputting we proposed an asynchronous parallel strategy to hide the reading time of 1. Queuecost, Creating LiDAR points. Generally, the frequency of the laser is fixed during scanning. Thus, the density of the LiDAR points is relatively uniform. Thus, theon more points, the longer the computing time. We counted the 4.1. Dynamic Scheduling Strategy Based Computation Quantity number of points in each strip when we partitioned the data at the process level. Next, all strips master scheduled tasksnumber to slaveofdynamically based on the computation in eachstored slave; wereThe descended based on the points in each strip, forming the taskprogress queue which tasks with larger amounts of computation were assigned early and the smaller ones assigned later. the numbers of the strips. It consisted of three steps: task queue creating, initially assigning and dynamically distributing. 2. Initial Assigning 1. Task Queue Creating Generally, frequency then laser during scanning. Thus, density The masterthe took out the of first tasksis(nfixed is the number of slave) andthe sent them of to the the LiDAR slaves. points is relatively uniform. Thus, the more points, the longer the computing time. We counted the Slaves received tasks and read points from the LiDAR file based on the strip number. Next, threads number of points in each strip when we partitioned the data at the process level. Next, all strips were in each slave generated DEM from LiDAR points in parallel (Figure 6a). descended based on the number of points in each strip, forming the task queue which stored the 3. Dynamic numbers of theAssigning strips. 2. Initial Assigning Slave requested a new task from the master when the preprocessing of the current data block The masterNext, took the outmaster the first n tasks the (n is thequeue. number of slave) sentempty, them to slaves. was completed. retrieved task If the queue and was not thethe next task Slaves received tasks and read points from the LiDAR file based on the strip number. Next, threads in would be sent to the slave; otherwise, the NULL flag would be sent. The above procedure was each slaveuntil generated from LiDAR points in parallel (Figure repeated the taskDEM queue was empty, as shown in Figure 6b. 6a). Task queue min TN ... Master T5 T4 send T3 T2 max T1
T1 P1 T2 P2 T3 P3 T4 P4
Master sent tasks to each slave.
Task queue min TN ... Master T9 T8 T7 T6 max T5
working P1 working P2 working P3 finished
P4
Task queue min TN ... Master T10 T9 T8 T7 max T6
Finishing T4,P4 requested a new task .
(a)
working P1 working P2 working P3 T5
P4
Master sent next task(T5) to P4. (b)
Figure 6. Dynamic scheduling based on computation quantity: Ti is the task number; Pi is the process Figure 6. Dynamic scheduling based on computation quantity: Ti is the task number; Pi is the process number; (a) Step 2, initial assigning; and (b) Step 3, dynamic assigning. number; (a) Step 2, initial assigning; and (b) Step 3, dynamic assigning.
4.2. Asynchronous Between Threads 3. DynamicParallel Assigning Slave requested a new task from the whencomputing the preprocessing of the current data block Threads in the process executed themaster conversion from LiDAR points to DEM. Aswas all completed. Next, the master task queue. the queue was not empty, the next taskparallel would threads in the same processretrieved share allthe points of oneIfstrip, we designed an asynchronous be sent totothe slave; the NULL strategy hide the otherwise, points’ reading time. flag would be sent. The above procedure was repeated until the task queue was empty, as shown in Figure 6b. The main computation of the conversion computing is spatial interpolation. To speed up the
neighborhood points search during interpolation computing, we established a grid index where the 4.2. Asynchronous Parallel Between Threads main thread read points. Threads in each process needed to execute five tasks: data reading, index Threads the processfor executed the conversion computing points to DEM. As all building, datainpartitioning threads, spatial interpolation and from resultLiDAR writing. We attributed them threads the same process share all points of onegeneration. strip, we designed an asynchronous parallel the strategy into twoinsteps: data preprocessing and DEM Data preprocessing contained first to hide reading time. three ofthe all points’ five tasks; where data reading was the main task; the computational complexity of this The main computation of the conversion computing is spatial interpolation. To speed up the neighborhood points search during interpolation computing, we established a grid index where the main thread read points. Threads in each process needed to execute five tasks: data reading, index building, data partitioning for threads, spatial interpolation and result writing. We attributed them into two steps: data preprocessing and DEM generation. Data preprocessing contained the first three of all five tasks; where data reading was the main task; the computational complexity of this step was relatively small. Thus, this step was executed serially by the main thread. DEM generation contains spatial interpolation and result writing. Therefore, this step was executed in parallel by multiple threads. We made these two steps run asynchronously (Figure 7a): (1) the main thread created a computing thread and the computing thread was blocked; (2) the main thread preprocessed the
ISPRS Int. J. Geo-Inf. 2017, 6, 300
9 of 19
step was relatively small. Thus, this step was executed serially by the main thread. DEM generation and result writing. Therefore, this step was executed in parallel by ISPRS Int. contains J. Geo-Inf.spatial 2017, 6,interpolation 300 9 of 19 multiple threads. We made these two steps run asynchronously (Figure 7a): (1) the main thread created a computing thread and the computing thread was blocked; (2) the main thread the computing first data stripthread and thewas computing threadwhen was unblocked the completed main thread its work; first datapreprocessed strip and the unblocked the mainwhen thread completed its work; and (3) the computing thread created threads generate DEM points, from and (3) the computing thread created multiple threads tomultiple generate DEMtofrom LiDAR while LiDAR points, while simultaneously, the main thread preprocessed the next data strip. In this way, simultaneously, the main thread preprocessed the next data strip. In this way, the preprocessing time the preprocessing time of the next data strip was hidden behind the computing procedure of the of the next datadata. stripAs was hidden behind computing of theparallel currentstrategy, data. As current shown in Figure 7b,the compared with procedure the synchronous theshown in Figure 7b, compared with the synchronous parallel strategy, the preprocessing time of all data strips preprocessing time of all data strips (except the first one) was hidden behind the processing of DEM generation. (except the first one) was hidden behind the processing of DEM generation. Processi
Main Thread
Running Time
Computing Threads
The first computing thread.
Data reading.
thread0 thread1 thread3 ......
block
Index building.
Creating multiple threads.
unblock
thread 1 Spatial interpolation
thread2 Spatial interpolation
thread3 Spatial interpolation
Result writing
Result writing
Result writing
Logical partitioning .
thread 0 thread 1 thread3 ......
thread 0 thread 1 thread3 ......
Asynchronous parallel strategy. Running Time thread0 thread1 thread3 ......
thread0 thread1 thread3 ......
thread0 thread1 thread3 ......
Synchronous parallel strategy. Destination DEM.
Step 1:Preprocessing for data i+1.
Legend Preprocessing
Step 2:Generating DEM for data i. (a)
DEM Generation
(b)
Figure 7. Asynchronous parallel between threads: (a) Procedure flow of the asynchronous parallel
Figure 7. Asynchronous parallel between threads: (a) Procedure flow of the asynchronous parallel strategy; and (b) The running time comparison of the two strategies. strategy; and (b) The running time comparison of the two strategies. 4.3. Hybrid Process/Thread Communicating Procedure
4.3. Hybrid Process/Thread Communicating Procedure We used MPI to implement the process programming and used POSIX threads (Pthreads) to thread programming. Communication was theused key POSIX point of the hybrid We implement used MPI the to implement the process programming and threads (Pthreads) to process/thread parallel algorithm. The main thread in the slave not only communicated with the implement the thread programming. Communication was the key point of the hybrid process/thread master, but also communicated with the computation threads in the same slave. Communication parallel with algorithm. Thewas main thread inby theMPI slave not only communicated with master, the master implemented message; communication between the the threads was but also communicated with the same slave. We Communication with the master was implemented by computation the condition threads variablesinofthe the Pthreads. used the function named “MPI_Init_thread” to initialize the program, which allowed theimplemented Pthreads to be called implemented by MPI message; communication between thefunctions threads of was by theincondition The main communication steps were as follows (Figure 8): (1) the master sent the task by variablesMPI. of the Pthreads. We used the function named “MPI_Init_thread” to initialize the program, MPI_Send and the main thread in the slave received the task by MPI_Recv; (2) the main thread which allowed functions of the Pthreads to be called in MPI. The main communication steps were as released the condition variable by pthread_cond_signal to unblock the computation thread (block by follows (Figure 8): (1) theonce master sent the task by MPI_Send anddata the main the thread slave received pthread_cond_wait) it completed preprocessing the first strip; thread (3) the in main a new task the master by MPI_Send master fedvariable information to the slave the task requested by MPI_Recv; (2) from the main thread released; (4) thethe condition byback pthread_cond_signal to and thethread main thread it by MPI_Recv; and (5) if feedback preprocessing was NULL, the the first unblock by theMPI_Send computation (blockreceived by pthread_cond_wait) once it the completed main thread modified the end tag and released the condition variable by pthread_cond_signal. data strip; (3) the main thread requested a new task from the master by MPI_Send ; (4) the master fed information back to the slave by MPI_Send and the main thread received it by MPI_Recv; and (5) if the feedback was NULL, the main thread modified the end tag and released the condition variable by ISPRS Int. J. Geo-Inf. 2017, 6, 300 10 of 19 pthread_cond_signal. Master
Slave Main thread. Start
Start Initial tasks assigning.
Computation main thread. Start
① MPI Message
pthread_cond_ wait
Data preprocessing.
② pthread_cond_signal
Receive request.
Y
③ MPI Message
Completed.
Whether to end? N
Is the task queue null? Y NULL
N Task number
④ MPI Message
Request new task.
Create computation threads.
Receive feedback.
Generating DEM parallel.
Whether to end?
Synchronization
⑤ pthread_cond_signal
N
Whether to end? Y
End
N
Y Modify flag.
Threads destroy.
End
End
Figure 8. Communications in the scheduling procedure.
Figure 8. Communications in the scheduling procedure. 5. Experiments and Discussions 5.1. Environment and Experimental Datasets The hybrid process/thread parallel algorithm for generating DEM from LiDAR points was implemented with standard C++ in Visual Studio 2015; Open MPI v2.0.1 was selected for the
ISPRS Int. J. Geo-Inf. 2017, 6, 300
10 of 19
5. Experiments and Discussions 5.1. Environment and Experimental Datasets The hybrid process/thread parallel algorithm for generating DEM from LiDAR points was implemented with standard C++ in Visual Studio 2015; Open MPI v2.0.1 was selected for the implementation of MPI; data reading of LiDAR file was implemented by libLAS 1.2; and DEM writing was implemented by the Geospatial Data Abstraction Library (GDAL 2.1.0). The parallel environment was a high-performance computing cluster that consisted of five nodes; each node had two CPUs (Intel Xeon E5-2640 v2, 2.0 GHz) with a memory of 16 GB; and each CPU contained 12 logical cores. The operation of the cluster as CentOS Linux 7.0 with the parallel file system Lustre. We used three LiDAR datasets to test our algorithm (Table 1). Data1 had 1.2 billion points; its volume was 30 GB, as shown in Figure 9a. Data2 as a part extracted from Data1 with 0.16 billion points with a volume of 4 GB (Figure 9b). Data3 was far bigger than the other two datasets, with 60 billion points; its volume was 150 GB (Figure 9c). Table 1. Details of the experimental datasets. Name
Volume
Number of Points
With
Length
Data1 Data2 Data3
30 GB 4 GB 150 GB
1.2 billion 0.16 billion 6.0 billion
15 km 8 km 77 km
19 km 4 km 141 km
ISPRS Int. J. Geo-Inf. 2017, 6, 300
11 of 19
Figure 9. Locations of experimentaldatasets datasets (the (the background background basic map was from OpenStreetMap): Figure 9. Locations of experimental basic map was from OpenStreetMap): 1; (b) Location of Data2; (c) Locations of Data3; (d) Resultant DEM of Data3, (a) Location of Data (a) Location of Data1 ; (b) Location of Data2 ; (c) Locations of Data3 ; (d) Resultant DEM of Data3 , generated by our algorithm; (e) Resultant DEM of Data1, generated by our algorithm; (f) Resultant generated by our algorithm; (e) Resultant DEM of Data1 , generated by our algorithm; (f) Resultant DEM of Data1 generated by ArcGIS; and (g) Resultant data of (e) subtract (f). DEM of Data1 generated by ArcGIS; and (g) Resultant data of (e) subtract (f).
5.2. Accuracy Analysis
A series of parameters was set before our experiments: the resolution of the target DEM was 1 m; order to analyze whether ourinparallel algorithm affect the accuracy of the resultant the spatialIninterpolation method used this research waswould the Inverse Distance Weighted (IDW) [32] DEM, we compared the DEM generated by our parallel algorithm with the one generated by was with an initial searching radius of 10 m; the minimum number of points involved in interpolation ArcGIS. We used Data1 as the experimental dataset. For our parallel algorithm, the DEM was 10 and the maximum was 30; if the number of points initially searched was less than 10, the searching generated by the experiment detailed in Section 5.3.2, with 10 slaves and 12 threads in each slave. radius increased in increments of 10 m until there were more than 10 points being searched; the The interpolation parameters were set as Section 5.1. We also used IDW as the interpolation method maximum search radius was set to 30 m. in ArcGIS, and all interpolation parameters were same with our algorithm. The resultant DEM of our parallel algorithm was shown in Figure 9e. Figure 9f was the resultant DEM of ArcGIS. To analyze the difference between them, we used the DEM of our algorithm to subtract the one of ArcGIS, obtaining the result shown in Figure 9g. All values in the resultant dataset are 0 m, which means there was no difference between these two datasets. Although we partitioned the LiDAR dataset into multiple blocks in our parallel algorithm, DEM cells near the boundaries of each data block were not influenced. This was because the data buffer allowed each DEM cell to search all adjacent points. Thus, the hybrid parallel algorithm proposed in this paper had no influence on the
ISPRS Int. J. Geo-Inf. 2017, 6, 300 ISPRS Int. J. Geo-Inf. 2017, 6, 300
12 of 19 11 of 19
At the thread level to find the optimal number, we tuned the number of logical partitioned blocks to testAnalysis the performance of the algorithm. To eliminate the effect of multiple loops, we used 5.2. Accuracy Data2, which could be processed by a single node at one time. We used one CPU of the node0 as the In order to analyze whether our parallel algorithm affectcomputation the accuracythreads, of the resultant experimental environment, and created a process which would had twelve binding DEM, we compared the DEM generated by our parallel algorithm with the one generated by ArcGIS. the process to the CPU by the command "mpirun -1 --bind-to-socket --bysocket"; the interpolation We used Data the as experimental the DEM was generated by the 1 as set parameters were described indataset. SectionFor 5.1.our Weparallel used Nalgorithm, t to represent the number of data blocks, experiment detailed in Section 5.3.2, with 10 slaves and 12 threads in each The interpolation setting Nt as 12, 24, 36, ..., 96, respectively, gathering the running time of theslave. algorithm. The result is parameters were set as Section 5.1. We also used IDW as the interpolation method in ArcGIS, and all shown in Figure 10. interpolation parameters same with ourwhen algorithm. The DEM of our parallel algorithm From Figure 10, it were can be seen that Nt was 12,resultant the loop of the assigning task only was shown in Figure 9e. Figure 9f was the resultant DEM of ArcGIS. To analyze the difference executed once, leading to load unbalancing between different threads, making the running between time the them, wewith usedthe theincreasing DEM of our to assigning subtract the onemultiple of ArcGIS, obtaining shown longest; of algorithm Nt, the task loops times made the the result computation in Figure 9g. Allsovalues in the resultant dataset are 0 m, which means there was no difference between load balanced, the running time reduced gradually; when N t was 48, it reached the shortest value; these two datasets. Although we partitioned the LiDAR dataset into blocks in our when Nt was greater than 60, the running time increased gradually.multiple This was because tooparallel many algorithm, DEM cells near the boundaries of each data block were not influenced. This was because blocks affected the continuity of computing and data accessing, thus slightly increasing the running the data buffer allowed eachnumber DEM cell search all adjacent points.level Thus, the4–5 hybrid parallel algorithm time. Therefore, when the ofto data blocks at the thread was times as many as the proposed in this paper had no influence on the accuracy of DEM. number of computation threads, the algorithm could achieve better performance. 5.3.Strips’ Performance Analysis 2. Number at the Process Level We the conducted experiment to analyze the performance of the the number parallel algorithm proposed in At process this level, we explored the relationship between of data strips (named this paper. we analyzed optimalFor performance, discussed two important factors that N p) and theBefore performance of our the algorithm. the datasetwe with a volume larger than the memory may affectofthe blocks; and combination manner capacity theresults: HPC, the the number numberofofpartitioned strips wasdata determined by the memory capacity. Thus,between in this processes and first one, we conducted experiments thethan thread process level experiment, wethreads. focusedFor on the dataset whose volume was close to orat less the and memory capacity. separately to find number data blocks. the second one, we fixedone the master total number of We used Data 1 as the the optimal test data, settingof the number ofFor processes as six, including and five threads(each on the HPC, therefore changing number of computation processes and 24, the as number slaves slave corresponded to onethe node); the number of threads in each(slaves) slave was there of threads in each process to analyze performance. were 24 logical cores in each node. the We parallel set Np as 10, 20, 40, 80; the number of data blocks at the thread level was fixed at 96. Interpolation parameters were set as described in Section 5.1. 5.3.1.The Optimal Blocks resultNumber is shownofinData Figure 11 (Pi is the number of process, i = 0, 1, …, 5): when Np was 20, the running time was the shortest; when NpLevel was 10, the algorithm ran longer. This was as the number of 1. Blocks’ Number at the Thread stripsAtwas small, leading to the fewer loopsnumber, during the Therefore, thetoo thread level to find optimal we dynamic tuned thescheduling number ofstrategy. logical partitioned computation loads between different processes were unbalanced. When N p was larger than 20,used the blocks to test the performance of the algorithm. To eliminate the effect of multiple loops, we number of loops increased and the of data which balancing Data2 , which could be processed by asize single nodeblocks at one decreased, time. We used onewas CPUuseful of thefor node0 as the computational loads. However, the number of data buffers, the volume of redundant data, and the experimental environment, and created a process which had twelve computation threads, binding the partitioning time also increased. Thus, the running time did not decrease, but increased. Finally, we process to the CPU by the command “mpirun -1 –bind-to-socket –bysocket”; the interpolation parameters reached followinginconclusion: theused dataset a volume close to of ordata less blocks, than memory were set the as described Section 5.1.forWe Nt towith represent the number setting capacity, the dynamic strategy proposed in this paper could achieved a better performance when Nt as 12, 24, 36, ..., 96, respectively, gathering the running time of the algorithm. The result is shown the number of strips was about four times as many as the number of slaves. in Figure 10. 280
277.12 274.18
275
Time(s)
270.34 270 266.27
265.52 265
267.45
262.32 262.87
260 255 12
24
36
48
Nt
60
72
84
96
Figure 10. 10. Running different N Ntt.. Figure Running time time with with different
From Figure 10, it can be seen that when Nt was 12, the loop of the assigning task only executed once, leading to load unbalancing between different threads, making the running time the longest;
ISPRS Int. J. Geo-Inf. 2017, 6, 300
12 of 19
with the increasing of Nt , the task assigning loops multiple times made the computation load balanced, so the running time reduced gradually; when Nt was 48, it reached the shortest value; when Nt was greater than 60, the running time increased gradually. This was because too many blocks affected the continuity of computing and data accessing, thus slightly increasing the running time. Therefore, when the number of data blocks at the thread level was 4–5 times as many as the number of computation threads, the algorithm could achieve better performance. 2. Strips’ Number at the Process Level At the process level, we explored the relationship between the number of data strips (named Np ) and the performance of our algorithm. For the dataset with a volume larger than the memory capacity of the HPC, the number of strips was determined by memory capacity. Thus, in this experiment, we focused on the dataset whose volume was close to or less than the memory capacity. We used Data1 as the test data, setting the number of processes as six, including one master and five slaves (each slave corresponded to one node); the number of threads in each slave was 24, as there were 24 logical cores in each node. We set Np as 10, 20, 40, 80; the number of data blocks at the thread level was fixed at 96. Interpolation parameters were set as described in Section 5.1. The result is shown in Figure 11 (Pi is the number of process, i = 0, 1, . . . , 5): when Np was 20, the running time was the shortest; when Np was 10, the algorithm ran longer. This was as the number of strips was too small, leading to fewer loops during the dynamic scheduling strategy. Therefore, computation loads between different processes were unbalanced. When Np was larger than 20, the number of loops increased and the size of data blocks decreased, which was useful for balancing computational loads. However, the number of data buffers, the volume of redundant data, and the partitioning time also increased. Thus, the running time did not decrease, but increased. Finally, we reached the following conclusion: for the dataset with a volume close to or less than memory capacity, the dynamic strategy proposed in this paper could achieved a better performance when theJ. number of strips ISPRS Int. Geo-Inf. 2017, 6, 300 was about four times as many as the number of slaves. 13 of 19 700
P0
P1
P2
P3
P4
P5
600
Time(s)
500 400 300 200 100 0 N=10
N=20
N=40
N=80
Figure 11. 11. Running Running time time with with different different N Npp.. Figure
5.3.2. 5.3.2. Optimal Optimal Combination Combination Manner Manner In this experiment, experiment, we we focused focused on on finding finding an an optimal optimal combination combination manner manner between between processes processes In this (excluding the and consisting consisting of (excluding the master, master, and of all all slaves) slaves) and and threads threads for for our our hybrid hybrid parallel parallel algorithm. algorithm. We fixed the total number of threads on the HPC to 120, changing the combination of We fixed the total number of threads on the HPC to 120, changing the combination manner ofmanner processes processes and threads (we used P and T to represent the number of slaves and threads separately; and threads (we used P and T to represent the number of slaves and threads separately; and used and used P × T represent their combination): 5 ××24, × 4.used We Data used Data 1 as the test P×T represent their combination): 5 × 24, 10 12,10 20××12, 6, 20 30××6,4.30We 1 as the test data. data. According to the experiment above, at the process level, we set N p at four times as many as the According to the experiment above, at the process level, we set Np at four times as many as the number number weN also set Nt to four times as many as the number of computation threads. of slaves;ofweslaves; also set t to four times as many as the number of computation threads. Interpolation Interpolation parameters remained unchanged. parameters remained unchanged. The result is shown in Figure 12. 12. When When the the combination combination manner manner was was 10 10 × × 12, time The result is shown in Figure 12, the the running running time was the the shortest. shortest. With CPUs was With other other combining combining manners, manners, the the algorithm algorithm ran ran longer longer as as there there were were 10 10 CPUs on the HPC. When P was less than 10 (5 × 24), the process switching between CPUs in one node affected the efficiency; when P was larger than 10, although there was no switching cost, the increase in processes led to an increase in the number of strips, which made data buffers and redundant data readings increase, leading to the longer running time. Thus, for our hybrid parallel strategy, we obtained better parallel results when the number of computation process was equal to the number of CPUs.
affected the efficiency; when P was larger than 10, although there was no switching cost, the increase in processes led to an increase in the number of strips, which made data buffers and redundant data readings increase, leading to the longer running time. Thus, for our hybrid parallel strategy, we obtained better parallel results when the number of computation process was equal to the number of CPUs. ISPRS Int. J. Geo-Inf. 2017, 6, 300 13 of 19 The resultant DEM is shown in Figure 9d. The running time in the serial state was 112,486 s (this is not shown in Figure 13 as the column was too high.). It decreased with the number of on the HPC. When P was less than 10 (5 × 24), the process switching between CPUs in one node increasing threads and the speedup increased gradually (Figure 13). The execution time reached its affected the efficiency; when P was larger than 10, although there was no switching cost, the increase minimum value with 120 threads (2342 s) and the speedup reached its maximum value (48). With in processes led to an increase in the number of strips, which made data buffers and redundant data more than 120 threads, the running time tended to increase gradually due to the computation readings increase, leading to the longer running time. Thus, for our hybrid parallel strategy, we limitation of the hardware (a total of 120 logical cores in our HPC). This indicated that the parallel obtained better parallel results when the number of computation process was equal to the number algorithm showed good scalability with an increase in computation resources. of CPUs. 600 533.74 500 420.57 Time(s)
400
368.44
344.67
300 200 100 0 5×24
10×12
20×6
30×4
P×T
Figure12. 12.Running Runningtime timeof ofdifferent differentcombinations combinationsof ofslaves slavesand andthreads. threads. Figure
The resultant DEM is shown in Figure 9d. The running time in the serial state was 112,486 s (this is not shown in Figure 13 as the column was too high.). It decreased with the number of increasing threads and the speedup increased gradually (Figure 13). The execution time reached its minimum value with 120 threads (2342 s) and the speedup reached its maximum value (48). With more than 120 threads, the running time tended to increase gradually due to the computation limitation of the hardware (a total of 120 logical cores in our HPC). This indicated that the parallel algorithm showed good scalability with an in computation resources. ISPRS Int. increase J. Geo-Inf. 2017, 6, 300 14 of 19 9000
50 Running Time
45
Speedup
40
8000 7000
30
Time(s)
6000
25 5000
20
Speedup
35
15
4000
10 3000
5
2000
0 20 30 40 50 60 70 80 90 100 110 120 130 140 150 Thread Number
Figure Figure 13. 13. Running Running time time and and speedup speedup of of Data Data33..
5.3.3. 5.3.3. Performance Performance Analysis Analysis with with the the Largest Largest Experimental Experimental Dataset Dataset After After finding finding the the optimal optimal number number of of data data blocks blocks and and the the optimal optimal combination combination manner manner between between processes and threads, we tested the performance of our hybrid parallel algorithm processes and threads, we tested the performance of our hybrid parallel algorithm by by using using itit to to generate generate DEM DEM from from the the largest largest experimental experimental LiDAR LiDAR dataset dataset (Data (Data33).). In In this this experiment, experiment, we we set set the the number number of of slaves slaves as as 10; 10; the the initial initial number number of of threads threads in in each each slave slave was was 22 and and increased increased by by 11 each each time. Considering the memory capacity and the experiment of Section 5.3.1, we partitioned time. Considering the memory capacity and the experiment of Section 5.3.1, we partitioned the the data data into into 50 50 strips; strips; the the volume volume of of each each strip strip was was approximately approximately 33 GB. GB. The The data data buffer buffer was was 40 40 m m and and the the interpolation parameters were set as per Section 5.1. Besides the running time, we use the speedup interpolation parameters were set as per Section 5.1. Besides the running time, we use the speedup [14] [14] as the other evaluation criterion for the algorithm. It is expressed as the ratio of Ts (the running time for the serial algorithm) to Tp (the running time for the parallel algorithm), Equation (8). =
/
(8)
ISPRS Int. J. Geo-Inf. 2017, 6, 300
14 of 19
as the other evaluation criterion for the algorithm. It is expressed as the ratio of Ts (the running time for the serial algorithm) to Tp (the running time for the parallel algorithm), Equation (8). speedup = Ts /Tp
(8)
5.4. Comparative Experiments 5.4.1. Comparison with the Static Schedule Strategy To analyze the effectiveness of the new dynamic load balancing strategy proposed in this paper, we statistically obtained the running time of each process in the experiment described in Section 5.3.3 when the speedup was at its peak. We also statistically obtained the number of points processed by each process. We compared them with values obtained from the static scheduling strategy. The static scheduling strategy assigned task circularly as below: strips whose numbers were 1, 11, and 21, . . . , 51 were assigned to process 1; strips 2, 12 and 22, . . . , 52 were assigned to process 2; and all strips were assigned before the algorithm was run. Other parameters remained unchanged. The results are shown in Figure 14. In the static algorithm, the longest execution time was 2656 s (No. 5 process), with the most points (0.873 billion). The shortest execution time was 1804 s (No. 1 process). The time difference was about 852 s. In the dynamic strategy, the longest time was 2342 s (No. 0 process) and the shortest time was 2010 s (No. 1 process). The time difference was 332 s, which was smaller than the static one. Furthermore, the time curve of the dynamic strategy was more stable than the static one; the same was true of the point quantity. This indicated that the dynamic scheduling strategy based on computation quantity a better load balancing than the traditional static strategy. ISPRS Int. J.could Geo-Inf.achieve 2017, 6, 300 15 of 19 1.2
Number of Points in Dynamic Number of Points in Static
1.1
Time in Dynamic
2500
1
Time in Static
0.9 2000
Times(s)
0.8 0.7
1500 0.6 0.5
1000
0.4 0.3
500
Number of Points (billion)
3000
0.2 0
0.1 0
1
2
3
4
5
6
7
8
9
10
Process Number
Figure Figure 14. 14. Load Load balancing balancing information information of of two two scheduling scheduling strategies.
5.4.2. 5.4.2. Comparison Comparison with with the the Synchronous Synchronous Parallel Parallel Strategy Strategy We compared our our parallel parallel algorithm algorithm with with the the synchronous synchronous algorithm. algorithm. The We compared The procedure procedure of of the the synchronous algorithm is shown as Figure 7b, whose preprocessing step did not overlap the synchronous algorithm is shown as Figure 7b, whose preprocessing step did not overlap the generating generating DEM step. For our asynchronous parallelwe algorithm, used theinexperiment in Section DEM step. For our asynchronous parallel algorithm, used the we experiment Section 5.3.3. For the 5.3.3. For the synchronous algorithm, as it does not need to cache another data strip when the DEM synchronous algorithm, as it does not need to cache another data strip when the DEM was generating, was generating,Data we 3partitioned Data 3 intomade 40 strips, which of made volume each strip slightly we partitioned into 40 strips, which the volume eachthe strip slightlyofbigger. The strategy bigger. The strategy at the process level and the interpolation parameters remained unchanged. The at the process level and the interpolation parameters remained unchanged. The results are shown results are shown in Figure 15. The running time of the asynchronous algorithm was longer than in Figure 15. The running time of the asynchronous algorithm was longer than the synchronous one the synchronous one when the number of threads was small as the main thread in the asynchronous algorithm did not participate in the computation. With an increase of threads, the time difference decreased, which is where the advantage of the asynchronous strategy appeared. Finally, the minimum time of the synchronous algorithm was 2878 s. For the asynchronous algorithm, it was 2342 s and the efficiency was improved by 18.6%. Compared with the
synchronous algorithm is shown as Figure 7b, whose preprocessing step did not overlap the generating DEM step. For our asynchronous parallel algorithm, we used the experiment in Section 5.3.3. For the synchronous algorithm, as it does not need to cache another data strip when the DEM was generating, we partitioned Data3 into 40 strips, which made the volume of each strip slightly bigger. strategy ISPRS Int.The J. Geo-Inf. 2017,at 6, the 300 process level and the interpolation parameters remained unchanged. 15 The of 19 results are shown in Figure 15. The running time of the asynchronous algorithm was longer than the synchronous one when the number of threads was small as the main thread in the when the number of threads wasparticipate small as the thread in the With asynchronous algorithm did the not asynchronous algorithm did not inmain the computation. an increase of threads, participate in the computation. With an increase of threads, the time difference decreased, which time difference decreased, which is where the advantage of the asynchronous strategy appeared. is wherethe the minimum advantage time of theofasynchronous strategy appeared. theFor minimum time of the Finally, the synchronous algorithm wasFinally, 2878 s. the asynchronous synchronous algorithm was 2878 s. For the asynchronous algorithm, it was 2342 s and the efficiency algorithm, it was 2342 s and the efficiency was improved by 18.6%. Compared with the was improved by 18.6%. Compared with the synchronous algorithm, although our algorithm synchronous algorithm, although our algorithm added 10 strips, which generated moreadded data 10 strips, which generated more data buffers was and redundant data, their influence almost negligible buffers and redundant data, their influence almost negligible by hiding thewas preprocessing time. by hiding the preprocessing time. This proved that our asynchronous parallel strategy was effective. This proved that our asynchronous parallel strategy was effective. 9000 Asynchronous Time Synchronous Time
8000 7000
Time(s)
6000 5000 4000 3000 2000 1000 20
30
40
50 60 70 80 Thread Number
90
100 110 120
Figure Figure 15. 15. Running Runningtime timeof oftwo two strategies strategies at at the the thread thread level. level.
5.4.3. 5.4.3. Comparison Comparison with with Process Process Parallel Parallel Algorithm Algorithm We made a comparative experiment between the hybrid process/thread parallel algorithm (HPA) proposed in this paper and the traditional process parallel algorithm (PPA). The data partitioning methods and scheduling strategies at the process level of the two algorithms remained the same. However, for PPA, each process was the unit to read and generate the DEM. The number of processes in the PPA was equal to the number of all threads in the HPA. For HPA, the initial number of threads in each process was 2, which was increased by step 1. For PPA, the initial number of processes as 21 (No. 0 is the master.) and it increased by 10 each time. The number of strips in HPA was fixed to 50. For PPA, it varied dynamically, maintaining four times as many as the number of slaves, which ensured load balancing while minimizing the number of strips. As shown in Figure 16, the running time of HPA was shorter than the time of PPA; the increasing speed of the speedup of HPA was faster than that of PPA. There were two reasons for this phenomenon. First, HPA improved the efficiency of the input of LiDAR points by reducing the number of data buffers. Figure 17 shows the input time of LiDAR points in two algorithms, which were extracted from the whole time of each algorithm. It contained two parts: the time of parallel data partitioning by PPDB, and time of data reading by each process. As Figure 17a shows, the overall data inputting time was about 7000 s in a serial state, with each part about 3500 s. When the number of threads increased to 20, it decreased to approximately 1000 s, with each part about 500 s (Figure 17c). After that, they maintained stable and did not increase with an increase in threads. However, for PPA, the time of both parts increased with the increase in processes (Figure 17b). This is as the number of data buffers in PPA increased obviously with the increasing of data strips, which made a lot of redundant data reading. Furthermore, when different processes read LiDAR points stored in the same file simultaneously, data reading competition increased the reading time, especially for a larger number of processes. The data partitioning time increased with the increase in strips and the increase of reading requests. However, in the HPA, the number of data buffers remained unchanged; parallel inputting requests were no more than 10 constantly, which made the input time remain steady. Thus, the speedup of HPA increased faster than that in PPA. Second, the asynchronous parallel strategy in HPA effectively hid the time of
remained unchanged; parallel inputting requests were no more than 10 constantly, which made the input time remain steady. Thus, the speedup of HPA increased faster than that in PPA. Second, the asynchronous parallel strategy in HPA effectively hid the time of reading points as the time of reading points in Figure 17c overlapped with computation time, which further increased the speedup of HPA. ISPRS Int. J. Geo-Inf. 2017, 6, 300 16 of 19 Finally, the minimum execution time of PPA was 3347 s; for HPA, it was 2342 s. The efficiency improved by 30%. Compared with PPA, the HPA proposed in this research could affect a more reading points as the time reading points in Figure 17c overlapped withof computation time, which efficient conversion from of LiDAR points to DEM, in particular, an increase computing resources, further increased speedup of HPA. showed that thethe HPA had better computational scalability than the PPA.
Figure16. 16.Running Running time time and Figure and speedup speedupof oftwo twoalgorithms. algorithms. ISPRS Int. J. Geo-Inf. 2017, 6, 300 2000
7000
1500
17 of 19
Time of Reading Points Time of PPMCDB
1000
6000
HPA
500
PPA
5000
Time of Reading Points
4000
0 20 30 40 50 60 70 80 90 100 110 120 Thread(process) Number
Time of PPMCDB
3000 2000
2000
1500
1000
1000
Time of Reading Points Time of PPMCDB
500
0 1
20
30
40
50
60
70
80
90 100 110 120
0 20 30 40 50 60 70 80 90 100 110 120 Thread(process) Number
Figure LiDARinput inputtime timeinintwo twoalgorithms: algorithms: (a) (a) the the overall overall input Figure 17.17. LiDAR inputtime timeof ofhybrid hybridprocess/thread process/thread parallel algorithm (HPA) and process parallel algorithm (PPA); (b) the detailed timeofofPPA; PPA;and and(c) (c)the parallel algorithm (HPA) and process parallel algorithm (PPA); (b) the detailed time the detailed time of HPA. detailed time of HPA.
6. Discussion Finally, the minimum execution time of PPA was 3347 s; for HPA, it was 2342 s. The efficiency improved 30%. Compared PPA, the the feasibility HPA proposed this research could affect aalgorithm more efficient Thisbysection primarily with discusses of theinproposed hybrid parallel to conversion pointsand to DEM, in particular, an increase of computing resources, that different from typesLiDAR of datasets applications. On one hand, our test datasets includedshowed multiple theterrain HPA had better scalability thanarea theand PPA. types: Datacomputational 1 was representative of a hilly Data3 was a mixed terrain of flat and hilly areas. This indicated that our hybrid parallel algorithm was well suited for DEM generation from 6. different Discussion types of LiDAR datasets. One the other hand, parallel generation of DEMs from LiDAR points is a typical representation of the computation-intensive procedure of a to This section primarily discusses thedata-intensive feasibility of and the proposed hybrid parallel algorithm LiDAR dataset. It included points’ partitioning, data reading, task scheduling, main computing, different types of datasets and applications. On one hand, our test datasets included multiple terrain and result-writing. Generally, most of these steps can be used in other applications of LiDAR types: Data1 was representative of a hilly area and Data3 was a mixed terrain of flat and hilly areas. datasets, such as parallel Triangulated Irregular Network (TIN) construction and parallel LiDAR This indicated that our hybrid parallel algorithm was well suited for DEM generation from different point filtering. When we apply our hybrid parallel strategy to other LiDAR applications, we only need to replace the spatial interpolation with the main computing steps of other applications. Thus, the hybrid process/thread parallel strategy proposed in this research will be useful for people who want to accelerate the processing efficiencies of their LiDAR datasets. In the future, we will package all steps of the hybrid parallel strategy, forming a set of tools that can be easily used by more LiDAR applications.
ISPRS Int. J. Geo-Inf. 2017, 6, 300
17 of 19
types of LiDAR datasets. One the other hand, parallel generation of DEMs from LiDAR points is a typical representation of the data-intensive and computation-intensive procedure of a LiDAR dataset. It included points’ partitioning, data reading, task scheduling, main computing, and result-writing. Generally, most of these steps can be used in other applications of LiDAR datasets, such as parallel Triangulated Irregular Network (TIN) construction and parallel LiDAR point filtering. When we apply our hybrid parallel strategy to other LiDAR applications, we only need to replace the spatial interpolation with the main computing steps of other applications. Thus, the hybrid process/thread parallel strategy proposed in this research will be useful for people who want to accelerate the processing efficiencies of their LiDAR datasets. In the future, we will package all steps of the hybrid parallel strategy, forming a set of tools that can be easily used by more LiDAR applications. 7. Conclusions This research analyzed the parallel load balancing and the boundary problems existing in generating DEM from LiDAR points. To solve these two problems, a hybrid process/thread parallel algorithm was proposed. At the process level, we designed a parallel data partitioning method (PPDB) and a dynamic scheduling strategy based on computation quantity to achieve load balancing. At the thread level, we partitioned the data block stored in the memory logically. To reduce the impact of data reading further, we proposed an asynchronous parallel strategy between threads to hide the reading time of LiDAR points. Finally, we used OpenMPI 2.0.1 to implement the programming at the process level and used Pthreads to implement the programming at the thread level. In a HPC, we tested our hybrid parallel algorithm using three LiDAR datasets of different volumes. We discussed the relationship between the number of portioned data blocks and performance of our algorithm and also analyzed how combining processes with threads could achieve the best performance. After that, we used a larger dataset to test the performance of our algorithm; furthermore, we analyzed the effect of dynamic scheduling strategy based on computation quantity and the effect of asynchronous parallel strategy between the threads, respectively. Finally, we compared the performance of the hybrid process/thread parallel algorithm (HPA) proposed in this paper with that of the traditional process parallel algorithm (PPA) and analyzed the differences between them, explaining those reasons in detail. Through the above-mentioned experiments, we reached three conclusions: (1) compared with PPA, the HPA proposed in this research improved the converting efficiency from LiDAR points to DEM by making full use of the advantages in process-paralleling and the advantages in thread-paralleling; (2) the PPDB was parallelizable and the dynamic scheduling strategy based on computation quantity combined with PPDB achieved a better load balancing; and (3) the asynchronous parallel strategy between threads further reduced the impact of reading LiDAR points on the algorithm’s efficiency. Acknowledgments: This work is supported by the Science and Technology Project of Qingdao (Grant No. 16-6-2-61-NSH). Author Contributions: Yibin Ren designed and implemented the algorithm, did the experiments, wrote the paper. Associate professor Zhenjie Chen and Professor Yong Han conducted the experiments, helped Yibin Ren to analyze experimental results. Professor Ge Chen read and modified the paper. Yanjie Wang drew the result figures of experiments. Conflicts of Interest: The authors declare no conflict of interest.
References 1. 2. 3.
Jones, K.H. A comparison of algorithms used to compute hill slope as a property of the DEM. Comput. Geosci. 1998, 24, 315–323. [CrossRef] Floriani, L.D.; Magillo, P. Digital Elevation Models. In Encyclopedia of Database Systems; Liu, L., ÖZSU, M.T., Eds.; Springer: Boston, MA, USA, 2009; pp. 817–821. Lloyd, C.D.; Atkinson, P.M. Deriving ground surface digital elevation models from LiDAR data with geostatistics. Int. J. Geogr. Inf. Sci. 2006, 20, 535–563. [CrossRef]
ISPRS Int. J. Geo-Inf. 2017, 6, 300
4. 5. 6. 7. 8. 9. 10. 11. 12. 13.
14. 15. 16.
17. 18. 19. 20. 21. 22.
23. 24. 25.
26. 27.
18 of 19
White, S.A.; Wang, Y. Utilizing DEMs derived from LIDAR data to analyze morphologic change in the North Carolina coastline. Remote Sens. Environ. 2003, 85, 39–47. [CrossRef] Ma, R.J. DEM generation and building detection from Lidar data. Photogramm. Eng. Remote Sens. 2005, 71, 847–854. [CrossRef] Hu, X.Y.; Li, X.K.; Zhang, Y.J. Fast Filtering of LiDAR Point Cloud in Urban Areas Based on Scan Line Segmentation and GPU Acceleration. IEEE Geosci. Remote Sens. 2013, 10, 308–312. Ma, H.C.; Wang, Z.Y. Distributed data organization and parallel data retrieval methods for huge laser scanner point clouds. Comput. Geosci. 2011, 37, 193–201. Wang, Y.; Chen, Z.; Cheng, L.; Li, M.; Wang, J. Parallel scanline algorithm for rapid rasterization of vector geographic data. Comput. Geosci. 2013, 59, 31–40. [CrossRef] Chen, C.; Chen, Z.; Li, M.; Liu, Y.; Cheng, L.; Ren, Y. Parallel relative radiometric normalisation for remote sensing image mosaics. Comput. Geosci. 2014, 73, 28–36. [CrossRef] Guan, Q.; Kyriakidis, P.C.; Goodchild, M.F. A parallel computing approach to fast geostatistical areal interpolation. Int. J. Geogr. Inf. 2011, 25, 1241–1267. [CrossRef] Zhao, L.; Chen, L.; Ranjan, R.; Choo, K.R.; He, J. Geographical information system parallelization for spatial big data processing: A review. Clust. Comput. 2016, 19, 139–152. [CrossRef] Liu, J.; Zhu, A.X.; Liu, Y.; Zhu, T.; Qin, C.Z. A layered approach to parallel computing for spatially distributed hydrological modeling. Environ. Modell. Softw. 2014, 51, 221–227. [CrossRef] Guan, X.; Wu, H. Leveraging the power of multi-core platforms for large-scale geospatial data processing: Exemplified by generating DEM from massive LiDAR point clouds. Comput. Geosci. 2010, 36, 1276–1282. [CrossRef] Huang, F.; Liu, D.; Tan, X.; Wang, J.; Chen, Y.; He, B. Explorations of the implementation of a parallel IDW interpolation algorithm in a Linux cluster-based parallel GIS. Comput. Geosci. 2011, 37, 426–434. [CrossRef] Han, S.H.; Heo, J.; Sohn, H.G.; Yu, K. Parallel Processing Method for Airborne Laser Scanning Data Using a PC Cluster and a Virtual Grid. Sensors-Basel 2009, 9, 2555–2573. [CrossRef] [PubMed] Danner, A.; Breslow, A.; Baskin, J.; Wilikofsky, D. Hybrid MPI/GPU interpolation for grid DEM construction. In Proceedings of the International Conference on Advances in Geographic Information Systems, Redondo Beach, CA, USA, 6–9 November 2012; pp. 299–308. Huang, F.; Bu, S.; Tao, J.; Tan, X. OpenCL Implementation of a Parallel Universal Kriging Algorithm for Massive Spatial Data Interpolation on Heterogeneous Systems. ISPRS Int. J. Geo-Inf. 2016, 5, 96. [CrossRef] Cappello, F.; Richard, O.; Etiemble, D. Understanding performance of SMP clusters running MPI programs. Future Gener. Comput. Syst. 2001, 17, 711–720. [CrossRef] Yang, C.T.; Huang, C.L.; Lin, C.F. Hybrid CUDA, OpenMP, and MPI parallel programming on multicore GPU clusters. Comput. Phys. Commun. 2011, 182, 266–269. [CrossRef] Wu, X.; Taylor, V. Performance modeling of hybrid MPI/OpenMP scientific applications on large-scale multicore supercomputers. J. Comput. Syst. Sci. 2013, 79, 1256–1268. [CrossRef] Prasad, S.K.; Mcdermott, M.; Puri, S.; Shah, D.; Aghajarian, D.; Shekhar, S.; Zhou, X. A vision for GPU-accelerated parallel computation on geo-spatial datasets. Sigspatial Spec. 2015, 6, 19–26. [CrossRef] Lee, V.W.; Kim, C.; Chhugani, J.; Deisher, M.; Kim, D.; Nguyen, A.D.; Satish, N.; Smelyanskiy, M.; Chennupaty, S.; Hammarlund, P. Debunking the 100X GPU vs. CPU myth:an evaluation of throughput computing on CPU and GPU. Acm Sigarch Comput. Arch. News 2010, 38, 451–460. [CrossRef] Zuo, Y.; Wang, S.; Zhong, E.; Cai, W. Research Progress and Review of High-Performance GIS. J. Geo-Inf. Sci. 2017, 19, 437. Huang, F.; Zhou, J.; Tao, J.; Tan, X.; Liang, S.; Cheng, J. PMODTRAN: A parallel implementation based on MODTRAN for massive remote sensing data processing. Int. J. Digit. Earth 2016, 9, 819–834. [CrossRef] Chatzimilioudis, G.; Costa, C.; Zeinalipouryazti, D.; Lee, W.C.; Pitoura, E. Distributed in-memory processing of All K Nearest Neighbor queries. In Proceedings of the IEEE International Conference on Data Engineering, Helsinki, Finland, 2016; pp. 1490–1491. Dong, B.; Li, X.; Wu, Q.; Xiao, L.; Li, R. A dynamic and adaptive load balancing strategy for parallel file system with large-scale I/O servers. J. Parallel Distrib. Comput. 2012, 72, 1254–1268. [CrossRef] Qian, C.; Dou, W.; Yang, K.; Tang, G. Data Partition Method for Parallel Interpolation Based on Time Balance. Geogr. Geo-Inf. Sci. 2013, 29, 86–90.
ISPRS Int. J. Geo-Inf. 2017, 6, 300
28. 29.
30. 31. 32.
19 of 19
Qi, L.; Shen, J.; Guo, L.; Zhou, T. Dynamic Strip Partitioning Method Oriented Parallel Computing for Construction of Delaunay Triangulation. J. Geo-Inf. Sci. 2012, 14, 55–61. [CrossRef] Ismail, Z.; Khanan, M.F.A.; Omar, F.Z.; Rahman, M.Z.A.; Salleh, M.R.M. Evaluating Error of LiDar Derived DEM Interpolation for Vegetation Area. Int. Arch. Photogramm. Remote Sens. S 2016, XLII-4/W1, 141–150. [CrossRef] Guan, X.; Wu, H.; Li, L. A Parallel Framework for Processing Massive Spatial Data with a Split-and-Merge Paradigm. Trans. GIS 2012, 16, 829–843. [CrossRef] Zou, Y.; Xue, W.; Liu, S. A case study of large-scale parallel I/O analysis and optimization for numerical weather prediction system. Future Gener. Comput. Syst. 2014, 37, 378–389. [CrossRef] Lu, G.Y.; Wong, D.W. An adaptive inverse-distance weighting spatial interpolation technique. Comput. Geosci. 2008, 34, 1044–1055. [CrossRef] © 2017 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).