J Supercomput (2013) 65:630–644 DOI 10.1007/s11227-013-0949-0
Efficient application of GPGPU for lava flow hazard mapping Donato D’Ambrosio · Giuseppe Filippone · Davide Marocco · Rocco Rongo · William Spataro
Published online: 24 April 2013 © Springer Science+Business Media New York 2013
Abstract The individuation of areas that are more likely to be impacted by new events in volcanic regions is of fundamental relevance for mitigating possible consequences, both in terms of loss of human lives and material properties. For this purpose, the lava flow hazard maps are increasingly used to evaluate, for each point of a map, the probability of being impacted by a future lava event. Typically, these maps are computed by relying on an adequate knowledge about the volcano, assessed by an accurate analysis of its past behavior, together with the explicit simulation of thousands of hypothetical events, performed by a reliable computational model. In this paper, General-Purpose Computation with Graphics Processing Units (GPGPU) is applied, in conjunction with the SCIARA lava flow Cellular Automata model, to the process of building the lava invasion maps. Using different GPGPU devices, the paper illustrates some different implementation strategies and discusses numerical results obtained for a case study at Mt. Etna (Italy), Europe’s most active volcano. Keywords GPGPU · Cellular Automata · Lava simulation · Hazard maps 1 Introduction The use of thematic maps of volcanic hazard is of fundamental relevance to support policy managers and administrators in taking the most correct land use planning D. D’Ambrosio · G. Filippone () · R. Rongo · W. Spataro University of Calabria, Cosenza, Italy e-mail:
[email protected] W. Spataro e-mail:
[email protected] D. Marocco University of Plymouth, Plymouth, UK e-mail:
[email protected]
Efficient application of GPGPU for lava flow hazard mapping
631
and proper actions that are required during an emergency phase. In particular, hazard maps are a key tool for emergency management, describing the threat that can be expected at a certain location of future eruptions. At Mt. Etna (south of Italy), the most active volcano in Europe, the majority of events occurred in the last four centuries report damage to human lives and material properties in numerous towns on the volcano flanks [1]. Nevertheless, the susceptibility of the Etnean area to lava invasion has increased in last decades due to continued urbanization [2], with the inevitable consequence that new eruptions may involve even greater risks. During past eruptive episodes, different countermeasures based on embankments or channels have been adopted to halt or deflect lava (e.g., [3]). Such interventions are generally performed while the eruption is in progress, inevitably putting into danger the safety of involved persons. Current efforts for hazard evaluation and contingency planning in volcanic areas draw heavily on hazard maps and numerical simulations (e.g. [4–6]), for the purpose of individuating affected areas in advance. For instance, in 2001 the path of the eruption that threatened the town of Nicolosi on Mt. Etna was correctly predicted by means of a lava flow simulation model [7], providing at that time useful information to local Civil Defense authorities. Nevertheless, in order to be efficiently and correctly applied, the above approaches require a prior knowledge of the degree of exposure of the volcano surrounding areas, to allow both for the realization of preventive countermeasures and for a more rational land use planning. However, because of the required high number of explicit lava flow simulations, even using the most optimized algorithm, the building of simulation-based maps often results in a highly intensive computational process. In recent years, parallel computing has undergone a significant revolution with the introduction of the GPGPU (General-Purpose computing on Graphics Processing Units) technology, a technique that uses graphics card unit (GPU) for purposes other than graphics. Currently, GPUs outperform CPUs on floating point performance and memory bandwidth, both by a factor of roughly 100 [8], easily reaching computational powers in the order of teraFLOPs. In this paper, GPGPU is applied, in conjunction with SCIARA, a robust and reliable Cellular Automata (CA) lava flow simulation model adopted for mitigation and assessment purposes, to the process of lava flow hazard map computation. In particular, the paper shows different GPGPU strategies that give rise to different overall execution times. In the following sections, after a brief description of the considered version of the SCIARA-fv2 model for lava flows, a quick overview of GPGPU paradigm together with the Compute Unified Device Architecture (CUDA) framework is presented. Subsequently, the specific GPGPU approach implementation and performance analysis referred to simulations are reported, while conclusions and outlooks are given at the end of the paper.
2 Cellular Automata and the SCIARA-fv2 model CA [9] are parallel computational models used for solving scientific problems and widely utilized in Computational Science. CA are dynamical systems, discrete in space and time; space is generally subdivided into cells of uniform size and the overall dynamics of the system emerges as the result of the simultaneous application,
632
D. D’Ambrosio et al.
at discrete time steps, of a transition function σ to each cell, which takes as input the state of the cells belonging to its neighborhood. Owing to the fact that the system is composed of a set of independent cells, which are influenced only by a small set of neighbor cells, their implementation on parallel computers is straightforward (e.g. [10, 11]). Macroscopic Cellular Automata (MCA) [12] introduce some extensions to the classical CA formal definition. In particular, the state Q of the cell is decomposed into r substates, Q1 , Q2 , . . . , Qr , each one representing a particular feature of the phenomenon to be modeled (e.g. for lava flow models, cell temperature, lava content, outflows, etc.). The overall state of the cell is thus obtained as the Cartesian product of the considered substates: Q = Q1 × Q2 × · · · × Qr . A set of parameters, P = {p1 , p2 , . . . , pp }, is furthermore considered, which allow to “tune” the model for reproducing different dynamical behaviors of the phenomenon of interest (e.g. for lava flow models, the Stephan–Boltzmann constant, lava density, lava solidification temperature, etc.). As the set of state is split into substates, also the state transition function τ is split into elementary processes, τ1 , τ2 , . . . , τs , each one describing a particular aspect that rules the dynamics of the considered phenomenon. Eventually, G ⊂ R is a subset of the cellular space that is subject to external influences (e.g. for lava flow models, the crater cells), specified by the supplementary function γ . External influences are introduced in order to model features which are not easy to be described in terms of local interactions. In the MCA approach, by opportunely discretizing the surface on which the phenomenon evolves, the dynamics of the system can be described in terms of flows of some quantity from one cell to the neighboring ones. Moreover, as the cell dimension is a constant value throughout the cellular space, it is possible to consider characteristics of the cell (i.e. substates), typically expressed in terms of volume (e.g. lava volume), in terms of thickness. For the construction of lava flow hazard maps, the latest release of the SCIARA CA model for simulating lava flows was adopted. Specifically, a Bingham-like rheology has been introduced for the first time as part of the Minimization Algorithm of the Differences [12], which is applied for computing lava outflows from the generic cell towards its neighbors. Besides, the hexagonal cellular space adopted in the previous releases [7] of the model for mitigating the anisotropic flow direction problem has been replaced by a square one, nevertheless producing an even better solution for the anisotropic effect. The model has been calibrated by considering three important real cases of studies, the 1981, 2001 and 2006 lava flows at Mt. Etna (Italy), and on ideal surfaces in order to evaluate the magnitude of anisotropic effects. As a matter of fact, notwithstanding a non-thorough calibration, the model demonstrated to be more accurate than its predecessors, providing the best results ever obtained on the simulation of the considered real cases of study. Even if major details of this advanced model can be found in [13], we briefly outline its main specifications. In formal terms, the SCIARA-fv2 model is defined as SCIARA-fv2 = R, L, X, Q, P , τ, γ where:
(1)
Efficient application of GPGPU for lava flow hazard mapping
633
– R is the set of square cells covering the bidimensional finite region where the phenomenon evolves; – L ⊂ R specifies the lava source cells (i.e. craters); – X = {(0, 0), (0, 1), (−1, 0), (1, 0), (0, −1), (−1, 1), (−1, −1), (1, −1), (1, 1)} is the set that identifies the pattern of 8 cells influencing the cell state change (i.e., the Moore neighborhood); – Q = Qz × Qh × QT × Q8f is the finite set of states, considered as Cartesian product of “substates.” SCIARA-fv2 substates, which describe relevant quantities representing a particular feature of the phenomenon to be modeled, are: cell elevation a.s.l., cell lava thickness, cell lava temperature, and lava thickness outflows from the central cell towards its neighbors, respectively; – P = {w, t, Tsol , Tvent , rTsol , rTvent , hcTsol , hcTvent , δ, ρ, , σ, cν } is the finite set of parameters (invariant in time and space) which affect the transition function (please refer to [13] for their specifications); – σ : Q9 → Q is the cell deterministic transition function, applied to each cell at each time step, which describes the dynamics of lava flows, such as cooling, solidification and lava outflows from the central cell towards the neighboring ones. In this version of SCIARA, the so-called elementary processes [12] describing the cell’s transition function are: (i) σ1 , which determines lava outflows based on a opportune version of the Minimization Algorithm of Differences; (ii) σ2 , which determines lava thickness computation; (iii) σ3 , which determines lava temperature, and (iv) σ4 , which determines the eventual lava solidification. – γ : Qh × N → Qh specifies the emitted lava thickness, h, from the source cells at each step k ∈ N (N is the set of natural numbers). 2.1 The methodology for defining hazard maps Volcanic hazard maps are fundamental for determining locations that are subject to eruptions and their related risk. Typically, a volcanic hazard map divides the volcanic area into a certain number of zones that are differently classified on the basis of the probability of being impacted by a specific volcanic event in future. Mapping both the physical threat and the exposure and vulnerability of people and material properties to volcanic hazards can help local authorities to guide decisions about where to locate critical infrastructures (e.g. hospitals, power plants, railroads, etc.) and human settlements and to devise mitigation measures that might be appropriate. This could be useful for avoiding the development of inhabited areas in high risk areas, thus controlling land use planning decisions. A lava flow simulation model can represent an effective instrument for analyzing volcanic risk in an area by simulating possible single episodes with different characteristics (e.g. vent locations, effusion rates) [14]. However, the methodology for defining high detailed hazard maps here presented is based on the application of the SCIARA lava flows computational model for simulating an elevated number of events on present topographic data. In particular, the methodology requires the analysis of the past behavior of the volcano, for the purpose of classifying the events that historically were of interest in the region. In such a way, a meaningful database of plausible simulated lava flows can be obtained, by characterizing the study area both
634
D. D’Ambrosio et al.
in terms of areal coverage and lava flows typologies. Data are subsequently processed by considering a proper criterion of evaluation. A first solution could consist in considering lava flows overlapping, assigning a greater hazard to those areas affected by a higher number of simulations. However, a similar choice could be misleading since, depending on the events particular volcanological characteristics (e.g., location of the main crater, duration and amount of emitted lava, or effusion rate trend), different events could occur with different probabilities, which should be taken into account in evaluating the actual contribution of performed simulations with respect to the definition of the overall hazard of the study area. In most cases, such probabilities can be properly inferred from the statistical analysis of past eruptions, allowing for the definition of a more refined evaluation criterion. Accordingly, in spite of a simple hitting frequency, a measure of lava invasion hazard can be obtained in probabilistic terms. The methodology, well described in [23] and [24], consists in an elaborate approach in the numerical simulation of Etnean lava flows, based on the results of an elevated number of simulations of flows erupted from a grid of hypothetical vents covering the entire volcano. Firstly, based on documented past behavior of the volcano, the probability of new vents forming was determined, resulting in a characterization (a probability density function map, pdf map) of the study region into areas that represent different probabilities of new vents opening. Then, flank eruptions of Etna since 1600 AD were classified according to duration and lava volume and a representative effusion rate trend considered to characterize lava temporal distribution for the considered representative eruptions, reflecting the effusive mean behavior of Etnean lava flows. An overall probability of occurrence for each lava event, pe , was thus defined, by considering the product of the individual probabilities of its main parameters: pe = ps · pc · pt where ps denotes the probability of eruption from a given location (i.e., based on the pdf map), pc the probability related to the event’s membership class (i.e., emitted lava and duration), and pt the probability related to its effusion rate trend. Once representative lava flows were devised as above, a set of simulations were planned to be executed in the study area by means of the SCIARA lava flows simulation model. For instance, in [24] a grid composed by 4290 craters, equally spaced by 500 m, was defined as a covering for Mt. Etna, from where the simulations have been carried out. This choice allowed both adequately and uniformly to cover the study area, besides considering a relatively small number of craters. Specifically, a subset of event classes which define 6 different effusion rate probabilities, derived from historical events considered in [10], were taken into account for each crater, thus resulting in a total of 25,740 different simulations to be carried out. Owing to the elevated number of SCIARA simulations to be executed, Parallel Computing was necessary and each scenario simulated for each of the vents of the grid. Lava flow hazard was then punctually (i.e. for each cell) evaluated by considering the contributions of all the simulations which affected a generic cell in terms of their probability of occurrence.
Efficient application of GPGPU for lava flow hazard mapping
635
3 The NVIDIA CUDA programming approach As alternative to standard parallel paradigms (e.g. distributed and shared memory programming models), the term GPGPU (General-Purpose computing on Graphics Processing Units) refers to the use of the GPU processor as a parallel device for purposes other than graphic elaboration. A GPU can be seen as a computing device that is capable of executing an elevated number of independent threads in parallel. For this reason this kind of architectures, whose computational power currently can easily exceed a teraFLOP, are fully suitable for fine grain parallelism, where the same computation is carried out independently on different parts of a data set. The enormous success and diffusion of GPGPU applications is certainly due to the CUDA programming model [8]. CUDA provides three key abstractions: the hierarchy with which the threads are organized, the memory organization and the functions that are executed in parallel, called kernels. In this context, a GPU (called in the CUDA context device) can be thought of as an additional co-processor of the main CPU (called in the CUDA context host). In usual GPU applications, data-parallel–like portions of the main application are executed on the device by calling a kernel performed by many threads. Typically, a CUDA application has parts that are normally performed in a serial fashion, and other parts that are performed in parallel. Threads can access different memory locations during execution. Each thread has its own private memory, each block has a “close” (and limited) shared memory that is visible to all threads in the same block, and finally all threads have access to global memory, which is on the device board but outside the computing chip. The device global memory is slower if compared with the shared memory, but it can deliver significantly (e.g. one order of magnitude) higher memory bandwidth than traditional host (CPU) memory. However, in most hardware configurations accessing the host memory from the GPU can be more than 20 times slower in terms of bandwidth (i.e. Gb/s) than accessing the global memory [15]. Nevertheless, even if global memory is slower, in terms of bandwidth (Gb/s) if compared with shared memory, it can still transfer higher memory bandwidth that traditional CPU-RAM memory (e.g., one order of magnitude). Therefore, minimal data transfers between CPU and GPU should take place. In order to mitigate the memory latency between GPU and CPU, the massively threaded architecture of recent GPU automatically switches between inactive threads: in case of a memory fetch stall, another scheduled thread is activated. As a consequence, the use of a large number of independent executing blocks is desirable [15]. In the following section, we illustrate how the concepts outlined above are applied for different GPGPU-based parallelizations of the procedure for building the described lava flow hazard maps.
4 GPGPU-based lava flow risk assessment A natural approach for dealing with the high computational effort related to the construction of hazard maps, due to the elevated number of simulations to be executed, is the use of parallel computing techniques. As stated previously, GPGPU has recently attracted researchers due to its efficacy exploiting the computational power provided
636
D. D’Ambrosio et al.
by graphic cards through a fine-grained data-parallel approach, especially when the same computation can be independently carried out on different elements of a data set, such as in the case of CA models. CA models can be straightforwardly implemented on parallel computers due to their underlying parallel nature. In fact, since CA methods require only next-neighbor interaction, they are very suitable and can be efficiently implemented even on GPUs. This results in: (i) computing the next state of all the cells in parallel; (ii) accessing only the current neighbor states during each cell’s update, thereby giving the chance to increase the efficiency of the memory accesses. As in the sequential case, the typical CA parallel implementation involves two memory regions, which will be called CAcur and CAnext , representing the current and next states for the cell respectively. For each CA step, the neighboring values from CAcur are read by the local transition function, which performs its computation and writes the new state value into the appropriate element of CAnext . In several GPGPU parallel implementations of the CA-based procedure illustrated above, most of the CA data were stored in the GPU global memory (e.g. [16–20]). This involves: (i) initializing the current state through a CPU–GPU memory copy operation (i.e. from host to device global memory) before the beginning of the simulation, and (ii) retrieving the final state of the CA at the end of the simulation through a GPU–CPU copy operation (i.e. from device global memory to host memory). At the end of each CA step a device-to-device memory copy operation is used to re-initialize the CAcur values with the CAnext values. In order to speed up the access to memory, the CA data in device global memory should be organized in a way to allow coalescing access. To this purpose, a best practice recommended by nVidia is to use structures of arrays rather than arrays of structures in organizing the memory storage of cell’s properties [15]. Even if further memory optimizations are possible, a complete contiguous structure access results in difficulties due to the presence of cell neighbors. As a compromise, an array with the size corresponding to the total number of cells was allocated in the CPU memory for each of the CA substates and, for every simultaneous run, allocated in the GPU memory, together with some additional auxiliary arrays (e.g. for storing the neighborhood structure and the model parameters). Since all of the neighboring cells have to access each other substates, a CA step involves repeated accesses to the same portions of the global memory. While the use of faster (but limited) shared memory could be exploited to efficiently cache the CA substates needed to all the threads of the same block (e.g. [22]), the recent automatic exploitation of caching levels (L1 and L2) for global memory access in recent nVidia hardware was here adopted in spite of shared memory implementation (e.g. [21]). A key step in the parallelization of a sequential code for the GPU architecture, according to the CUDA approach, consists of identifying all the sets of instructions (e.g., the CA transition function) that can be executed independently of each other on different elements of a data set (e.g., on the different cells of the CA). As mentioned in Sect. 3, such sequences of instructions are grouped in CUDA kernels, each transparently executed in parallel by the GPU threads. However, a critical aspect of the GPGPU parallelization object of this study is related to the fact that only the transition function of the CA active cells (i.e., cells containing lava) do actual computation.
Efficient application of GPGPU for lava flow hazard mapping Table 1 Major characteristics of adopted GPGPU hardware in experiments
637
GTX 480 GTX 580 GTX 680 Tesla C2075 SM count CUDA cores Clock rate [MHz] Bandwidth [GB/s] GFLOPs
14
16
8
14
448
512
1536
448
1215
1544
1006
1150
133.9
192.4
192.3
144.0
1088.6
1581.1
3090.4
1030.4
Hence, launching one thread for each of the CA cells would result in a certain amount of dissipation of the GPU computational power. For this reason, besides the straightforward parallel algorithm in which the above-mentioned kernels are mapped to the whole CA, a second, more optimized implementation, has been developed. In the following, such approaches are described in detail. Using a significant test case (for benchmark purposes) and four recent GPU devices, their computational performances are empirically investigated. 4.1 A case study and performed simulations To evaluate the implementation strategies, a landscape benchmark case study was initially considered, modeled through a Digital Elevation Model composed of 200 × 318 square cells with a side of 10 m. Over the area, a regular grid of 38 × 60 craters was superimposed, leading to a total of 2280 simulations to be performed. From each crater of the grid, a typical lava flow with a 20 m3 /s emission rate was considered, for a duration of 5000 CA steps. Two parallelization strategies and different graphic hardware were adopted in all experiments which are reported in the following. In particular, four CUDA devices were used in the experiments: a nVidia Tesla C2075 and three nVidia Geforce graphic cards, namely the GTX480, GTX 580 and the GTX 680. The latter belongs to the new nVidia’s Kepler GPU architecture while the former are endowed with the older Fermi GPUs. Some relevant characteristics of the used devices are reported in Table 1. Also, in order to quantify the achieved parallel speedup, sequential versions (in C language) of the algorithms parallelized for the GPU were run on a workstation equipped with a 2-Quadcore Intel Xeon E5472 (3.00 GHz) CPU. 4.2 A naïve implementation: the whole cellular space implementation A first, straightforward parallel implementation, labeled as WCSI (Whole Cellular Space Implementation) was considered where the CUDA kernels operate on the whole CA. Since during a lava flow simulation only the transition function of the currently active cells do significant computation, performing only one simulation at a time would imply a high percentage of uselessly scheduled threads. In addition, given the small size of most simulations (on average, 20 % of cells of the entire CA are active during a single simulation), the number of active threads would be too low to allow the GPU to effectively activate the latency-hiding mechanism described in Sect. 3. For these reasons, in the WCSI approach more than a single lava episode are
638
D. D’Ambrosio et al.
Algorithm 1: The general GPGPU procedure for building lava hazard maps 1 2 3 4 5 6 7 8 9 10 11 12
nc ← 0; nt ← n; ns ← x; while nc < nt do configureCA(CAcur , ns ); CAstep ← 0; while notTerminated(CAstep) do vent1, ns (CAcur , CAnext , emittedLava); transitionFunctionInParallelgrid, ns (CAcur , CAnext ); updateNextfromCurrentgrid, ns (CAcur , CAnext ); CAstep ← CAstep + 1; nc ← nc + ns ;
simultaneously executed. This means that the main CUDA kernel is executed over a number of simulations which are propagated at the same CA step. In particular, each performed simulation is mapped on a different value of z and on a grid of threads composed of 16 × 16 blocks. That is, the grid of threads used for the CA transition function is three-dimensional, with the base representing the considered CA space and the vertical dimension corresponding to the simulations. According to the general lava hazard GPGPU computation Algorithm 1, in the hazard map computation, phase clusters (ns ) of simulations are simulated over the entire area under study up to covering the total number of required simulations (nt ). Before starting the hazard map construction (from line 4), a pre-processing sequential phase takes place for compute the thread kernel grid on the basis of the dimensions of blocks and of the CA space. When the computation begins, configureCA resets the ns simulations at the beginning of each cluster computation phase and time duration for each simulation initialized. Afterwards, the function transitionFunctionInParallel (line 9) activates the kernel implementing the simulation propagation mechanism on a 3D grid as described above. Such kernel iterates over the simulations that are the simultaneously propagated. Subsequently, a device-to-device memory copy operation is used to reinitialize the CAcur values of the substates with the values of CAnext (line 10). At the end of the computation, the current CA time step CAstep is updated. For a fair comparison, the sequential version of the same algorithm was used and clearly, only one simulation at a time was propagated since the advantages of simulating multiple simulations are not significant. Here, the elapsed time achieved by the CPU was 61,312 s. Using the adopted GPU devices, the algorithm was solved with the WCSI approach and a variable number of simultaneous lava simulations. According to the results shown in Fig. 1, the GTX 680 achieved the lowest elapsed time of 1529 s, concurrently simulating 228 lava events. The gain provided by the parallelization in terms of computing time was significant and corresponded to a parallel speedup of 40
Efficient application of GPGPU for lava flow hazard mapping
639
Fig. 1 Elapsed time as a function of simultaneous lava events using the WCSI approach on different considered GPGPU hardware
over the used CPU. As can be seen, the simultaneous execution of more lava simulations was beneficial since it allowed a computing time decrement that ranged from about 27 % for the GTX 580, 31 % for the GTX 680, 36 % for the Tesla 2075 and nearly 40 % for the GTX 480. However, even if computing times decrease as the number of concurrent simulations increases, for all the GPUs simulating more than about 25 simulations simultaneously that fact did not lead to significant advantages. In fact, once the number of computationally relevant threads is sufficient for enabling latency mitigation, a further increment in the number of simulations is unnecessary. Interestingly, even if characterized by substantial different peak performance in single-precision floating point operations, the best elapsed times of GTX 580 and GTX 680 were quite similar. This is due to the fact that the used kernels are definitely memory bound and that the two GPUs have the same memory bandwidth. In fact, especially considering that most substates are pre-computed once in the pre-processing phase, the CA transition function is characterized by a relatively low computational load. In addition, to apply the lava flow spread mechanism each cell must access its own substates and most of the substates of its neighboring cells. 4.3 The rectangular bounding box strategy: the dynamic grid implementation A critical aspect of CA implementations that can improve performance, which is also related to CA sequential versions (e.g. [25]), is that the application of the transition function can be restricted to the only active cells, i.e., where computation is actually taking place. When considering a phenomenon (i.e., a lava flow) that is topologically connected (i.e., a simulation starts from few active cells and evolves by activating neighbor cells), the CA space can be confined within a rectangular bounding box (RBB). This optimization drastically reduces execution times, since the sub-rectangle is usually quite smaller than the original CA space. In case of the above WCSI CA GPGPU implementation, this weakness is even more evident due to the difficulty of having a high percentage of computationally
640
D. D’Ambrosio et al.
Fig. 2 Mapping of the CA transition function into a CUDA grid of threads (right) in case of the simultaneous lava flows (left). As can be seen, computation takes place only for threads that are within the CRBB of active cells of the two lava events
active threads in the CUDA grid. For these reasons, an alternative approach was developed in which the grid of threads is dynamically computed during the simulation in order to keep low the number of computationally irrelevant threads. In such an approach, labeled as DGI (Dynamic Grid Implementation), a number of lava flow simulations are simultaneously executed as in the WCSI procedure. In addition, at each CA step the procedure involves the computation of the smallest common rectangular bounding box (CRBB) that includes any active cells in every concurrent simulation. As shown in Fig. 2, all the kernels required by the CA step in Algorithm 1 are then mapped on such a CRBB, reducing the number of useless threads and improving significantly the computational performance of the algorithm. Since the present study exploits relatively recent GPU devices, the CRBB computation was efficiently carried out using the atomicMin and atomicMax CUDA primitives in the same kernel implementing the transition function. A further enhancement introduced in the DGI approach concerns the device-todevice memory copy operation used to re-initialize CAcur with CAnext (line 10 of Algorithm 1). In particular, instead of involving the entire CA, a specific kernel was developed to re-initialize the substates of each concurrent simulation only for the cells lying in the CRBB. In order to perform a fair comparison, an analogous strategy based on the bounding box has been developed for the sequential version of the DGI algorithm. Using the reference CPU, such sequential procedure required 47,521 s for the case study adopted. Figure 3 shows the corresponding times taken by the parallel DGI approach as a function of the number of concurrent simulations. As seen, the GTX 680 achieved the lowest elapsed time of 617.348 s, corresponding to a parallel speedup of 77. As expected, the simultaneous execution of many simulations was much more beneficial than in the WCSI case, for all considered hard-
Efficient application of GPGPU for lava flow hazard mapping
641
Fig. 3 Elapsed time as a function of simultaneous lava events using the DGI approach on different considered GPGPU hardware
ware. For example, with the GTX 680 card the best elapsed time is less than a half of that obtained by carrying out only one simulation at a time. The reason is that, having available a relatively limited number of concurrent simulations, the percentage of active cells is higher in a small CRBB than in the entire CA. However, at the contrary of the WCSI approach, there is a critical value of the number of simultaneous lava episodes to be executed (i.e., about 20 for all adopted hardware), after which the overall computing time increases. At the basis of this behavior is the CRBB mechanism, where a certain number of inactive cells are present for each simulation. As a consequence, when the number of computationally active threads is high enough for the latency hiding mechanism mentioned in Sect. 3, any further increment of simultaneous lava events brings to the unnecessary scheduling of inactive threads, producing an overall decline in computational performance. The techniques applied for the DGI approach for building lava flows hazard maps were applied to the area of Mt. Etna volcano (Italy) as in [24]. A subset of the original grid of vents (1006 craters in total) was considered and only one typology of lava effusion rate of 32 · 106 m3 taken into account, emitted in 15 days, corresponding to the most common type of eruption recorded in the past 400 years of activity of the volcano (see [23] for further details). This gave rise to 1006 simulations in spite of the over 25,000 carried out in [24]. Figure 4 shows the map obtained by the application of the DGI procedure. In particular the GTX 680 hardware was adopted, considering 20 concurrent simulations (i.e., the ns parameter of Algorithm 1). The overall computing time was 805 s, while the map obtained by the same configuration (i.e., same number of grids and simulations), executed as in [24] with standard MPI Message Passing techniques on a 80-core cluster, was processed in 23,432 s. Relatively to the numerical comparisons between this latter MPI approach and GPGPU implementations and notwithstanding the MPI version was performed using double precision variables and the GPU in single precision, results were satisfactory. The areal extensions of simulations resulted the same, except for few errors of approximation in a limited number of cells and
642
D. D’Ambrosio et al.
Fig. 4 Lava-flow invasion scenario at Mount Etna resulting from the simulations of 1006 lava flows originated from a uniform square grid, by considering one simulation per vent. Legend indicates different probability classes of areas that could be affected by a lava event in the future
approximation differences at the third significant digit were reported only for 4 % of cells, while differences were less for remaining cells.
5 Conclusions Starting from the problem of lava hazard map computation, this study discussed some approaches for executing a large number of concurrent lava simulations using GPGPU. One of the advantages of the presented solutions lies in enabling the building of hazard maps for large areas, which otherwise may not be possible by adopting traditional sequential computation. The parallel speedups attained through the proposed approaches and by considering GPGPU hardware, were indeed significant. In addition to the construction of lava risk maps, the presented strategies may also be adopted for carrying out extensive sensitivity analysis to model parameters or for
Efficient application of GPGPU for lava flow hazard mapping
643
the automatic planning of risk-mitigation interventions. Also, the fast simulation of a number of lava simulations gives the opportunity, for the interactive visualization, to run the combined rendering and simulations at interactive frame rates. Future work will also regard the adoption of multiple kernels, permitted by the recent Fermi architecture, where different streams of kernels can be simultaneously activated which can evidently permit a better management of simulations. Eventually, the same presented techniques are currently being exploited in the application of evolutionary algorithms, which require an extensive simulation phase due to the numerous evaluations of possible candidate solutions, for the morphological evolution of protection works for lava flow stream deviation. Acknowledgements This work was partially funded by the European Commission—European Social Fund (ESF) and by the Regione Calabria (Italy).
References 1. Behncke B, Neri M (2003) Cycles and trends in the recent eruptive behaviour of Mount Etna (Italy). Can J Earth Sci 40:1405–1411 2. Dibben C (2008) Leaving the city for the suburbs—the dominance of ‘ordinary’ decision making over volcanic risk perception in the production of volcanic risk on Mt Etna. J Volcanol Geotherm Res 172:288–299 3. Barberi F, Brondi F, Carapezza ML, Cavarra L, Murgia C (2003) Earthen barriers to control lava flows in the 2001 eruption of Mt Etna. J Volcanol Geotherm Res 123:231–243 4. Ishihara K, Iguchi M, Kamo K (1990) Numerical simulation of lava flows on some volcanoes in Japan. In: IAVCEI proceedings in volcanology, pp 174–207 5. Del Negro C, Fortuna L, Herault A, Vicari A (2008) Simulations of the 2004 lava flow at Etna volcano using the MAGFLOW Cellular Automata model. Bull Volcanol 70:805–812 6. Avolio MV, Crisci GM, Di Gregorio S, Rongo R, Spataro W, D’Ambrosio D (2006) Pyroclastic flows modeling using cellular automata. Comput Geosci 32:897–911 7. Crisci GM, Rongo R, Di Gregorio S, Spataro W (2004) The simulation model SCIARA: the 1991 and 2001 lava flows at Mount Etna. J Volcanol Geotherm Res 132:253–267 8. NVIDIA CUDA C Programming Guide (2010) v. 3.2 9. von Neumann J (1966) Theory of self-reproducing automata. University of Illinois Press, Champaign. Edited and completed by A Burks 10. D’Ambrosio D, Spataro W (2007) Parallel evolutionary modeling of geological processes. Parallel Comput 33(3):186–212 11. Setoodeh S, Adams DB, Gürdal Z, Watson LT (2006) Pipeline implementation of cellular automata for structural design on message-passing multiprocessors. Math Comput Model 43:966–975 12. Di Gregorio S, Serra R (1999) An empirical method for modeling and simulating some complex macroscopic phenomena by cellular automata. Future Gener Comput Syst 16:259–271 13. Spataro W, Avolio MV, Lupiano V, Trunfio GA, Rocco R, D’Ambrosio D (2010) The latest release of the lava flows simulation model SCIARA: first application to Mt Etna (Italy) and solution of the anisotropic flow direction problem on an ideal surface. In: Proceedings of the international conference on computational science 2010. Procedia computer science, vol 1, pp 17–26 14. Crisci GM, Di Gregorio S, Nicoletta F, Rongo R, Spataro W (1999) Analysing lava risk for the Etnean area: simulation by cellular automata methods. Nat Hazards 20:215–229 15. NVIDIA CUDA C Best Practices Guide (2012) 16. Zuo W, Chen Q (2010) Fast and informative flow simulations in a building by using fast fluid dynamics model on graphics processing unit. Build Environ 45(3):747–757 17. Riegel E, Indinger T, Adams NA (2009) Implementation of a Lattice–Boltzmann method for numerical fluid mechanics using the nVIDIA CUDA technology. Comput Sci Res Dev 23:241–247 18. Preis T (2011) GPU computing in econophysics and statistical physics. Eur Phys J Spec Top 194:87– 119
644
D. D’Ambrosio et al.
19. Roberts M, Sousa MC, Mitchell JR (2010) A work-efficient GPU algorithm for level set segmentation. In: SIGGRAPH 2010 conference, vol 53 20. Filippone G, Spataro W, Spingola G, D’Ambrosio D, Rongo R, Perna G, Di Gregorio S (2011) GPGPU programming and cellular automata: implementation of the SCIARA lava flow simulation code. In: Proceedings of the 23rd European Modeling and Simulation Symposium (EMSS), Rome, Italy, 12–14 September 2011, pp 696–702 21. Bilotta G, Rustico E, Hérault A, Vicari A, Russo G, Del Negro C, Gallo G (2011) Porting and optimizing MAGFLOW on CUDA. Ann Geophys 5:54 22. D’Ambrosio D, Filippone G, Rongo R, Spataro W, Trunfio GA (2012) Cellular automata and GPGPU: an application to lava flow modeling. Int J Grid and High Perform Comput 4(3):30–47 23. Crisci GM, Avolio MV, Behncke B, D’Ambrosio D, Di Gregorio S, Lupiano V, Neri M, Rongo R, Spataro W (2010) Predicting the impact of lava flows at Mount Etna. J Geophys Res 115(B4):1–14 24. Rongo R, Avolio MV, Behncke B, D’Ambrosio D, Di Gregorio S, Lupiano V, Neri M, Spataro W, Crisci GM (2011) Defining high-detail hazard maps by a cellular automata approach: application to Mount Etna (Italy). Ann Geophys 54:568–578 25. Walter R, Worsch T (2004) Efficient simulation of CA with few activities. In: ACRI 2004. LNCS, vol 3305, pp 101–110