A parallel, unstructured grid, Godunov-type ... - Semantic Scholar

Advances in Water Resources 33 (2010) 1456–1467

Contents lists available at ScienceDirect

Advances in Water Resources journal homepage: www.elsevier.com/locate/advwatres

ParBreZo: A parallel, unstructured grid, Godunov-type, shallow-water code for high-resolution flood inundation modeling at the regional scale Brett F. Sanders ⇑, Jochen E. Schubert, Russell L. Detwiler Department of Civil and Environmental Engineering, University of California, United States UC Center for Hydrologic Modeling, Irvine, United States

a r t i c l e

i n f o

Article history: Received 12 May 2010 Received in revised form 23 July 2010 Accepted 25 July 2010 Available online 21 September 2010 Keywords: Flood inundation model Parallel computing Unstructured grid Dam-break flood Hurricane Katrina Storm surge

a b s t r a c t Topographic data are increasingly available at high resolutions ( 1), the free surface height g and discharge per unit width p = uh and q = vh are assumed to be cell-wise constant. Considering both steady and unsteady flood modeling test problems, a first-order scheme on a fine grid achieved a better use of limited computational resources than a second-order scheme on coarse grid [4]. Grid coarsening is very effective at reducing sequential execution times because doubling the cell size reduces computational effort by a factor of 23. A porosity model has been incorporated into BreZo to support coarsening by parameterizing the storage and conveyance effects of sub-grid scale surface texture [33], as has been done elsewhere [24,36,47]. This supports use of a relatively coarse grid compared to building-resolving grids (e.g., [5,15,35]). However, grid coarsening can only be done to a degree before important terrain features are lost or truncation errors become too large [3,12,13,46]. Time-stepping is another factor that affects execution time. BreZo now uses a time-wise accurate Local Time Stepping (LTS) scheme to further reduce model execution time without reducing accuracy [32]. Given a base time step Dt that satisfies the Courant, Friedrichs, Lewy (CFL) condition globally, individual cells may adopt a larger time step such as 2Dt, 4Dt or 8Dt so long as it satisfies the CFL condition locally. By using a larger time step, cell data is updated less frequently which results in a more efficient algorithm. A careful sequencing of flux calculations and solution updates maintains the conservation properties of the LTS scheme. 2.2. ParBreZo design

ð5Þ

cD v V

where h = depth, u = x-component of velocity, v = y-component of velocity, z is the ground elevation, g is the gravitational constant, V = (u2 + v2)1/2, and cD = dimensionless drag coefficient which can be modeled in several ways but is commonly expressed by a Man1=3 ning nm as follows, cD ¼ gn2m h . The discrete solution introduces several additional parameters such as the computational cell area Aj , the length of grid cell edges Dsk, and the velocity normal to the grid edge, u\ = u cos / + v sin /, where / is the angle between

ParBreZo implements the Single Program Multiple Data (SPMD) paradigm for distributed-memory parallelism using a static domain decomposition. The model domain is partitioned into np subdomains, and individual processes, ip = 0, . . . , np 1, are executed for each domain, each with its own input and output. Further, each process communicates with neighboring processes using MPI so the solution in subdomain boundary cells can be updated using Eq. (3). The SPMD paradigm should be distinguished from the task-farming (or master/slave) paradigm where one processor (master) orchestrates the tasks of the others (slaves) and serves

B.F. Sanders et al. / Advances in Water Resources 33 (2010) 1456–1467

as a gateway for input and output [23]. A important advantage of SPMD is that hp = 1 in Eq. (2), so in principle a high level of efficiency can be maintained with increasing np. In contrast, the task-farming approach dictates hp < 1 which stands to limit performance as np increases. Distributed-memory parallelism using MPI was chosen for ParBreZo for portability and scalability reasons. That is, execution of ParBreZo on a variety of architectures of all sizes is of interest, including clusters with combinations of shared and distributed memory compute nodes. The Metis library of graph partitioning tools is used for static domain decomposition [19], and a simple strategy of uneven weighting to improve performance is therefore explored in this study as discussed earlier. This involves a preliminary run of the model to identify the distribution of wet and dry cells, data that is subsequently used for weighting purposes. A final note on Metis is that it permits partitioning based either on the vertices or cells of the grid. Metis considers the grid of vertices connected by cell edges to be the mesh, and the grid of cell centers linked to neighboring cell centers to be the dual mesh. Dual-mesh partitioning is used here because cells represent the computational element of BreZo, and preliminary experiments showed that mesh partitioning leads to ragged interfaces between subdomains. ParBreZo relies on a pre-processing algorithm to load a wholedomain grid, subdivide it into np separate subdomain grids based on the Metis partitioning, and prepare separate input files for each process. The whole-domain grid is precisely the grid used by BreZo, which adopts the file formats defined by Triangle [34], a constrained Delaunay grid generation package. Triangle grids are routinely used by BreZo, but other grid generation software may also be used so long as the output files are placed in Triangle format. Triangle files include: .node (coordinates of vertices), .ele (nodes defining each triangle), .neigh, (cells neighboring each cell) and .edge (nodes defining each edge in the grid). Note that the term grid is used in this study wherever possible to describe both structured and unstructured clusters of computational cells or elements that span a modeling domain. Related terminology includes mesh which is adopted by Triangle and graph used by Metis. Furthermore, partition is generally used in connection with graph to imply a subdivision into many smaller parts, but it can also be used in connection with grid, as we do here, to imply domain decomposition. To partition the whole-domain grid, the Metis executable kmetis is used because it supports both unweighted and weighted partitioning. The input to kmetis is a .graph file of the dual mesh to partition grid cells instead of nodes. The Triangle .ele file can easily be reformatted into a Metis .mesh file, and the Metis utility mesh2dual can be used to generate the required .graph file. The output of kmetis is a .part file which lists a process assignment, ip 2 [0, np 1], for each computational cell of the grid. After the cells of the whole-domain grid are flagged with a process value, the remaining task is to create np sets of input files for ParBreZo, one for each subdomain. Assuming that the whole-domain input files include .node, .ele, .neigh, and .edge and that vertex-based ground elevations are stored in .bed (single column of values) and cell-based resistance parameters are stored in .rough (single column of values), then the pre-processing algorithm generates np sets of similarly named files for each subdomain named according to ip as follows: ..node, ..ele, ..neigh, ..edge, and ..rough. Note that ground elevation data is added to the ..node file as an additional column of data. Each subdomain utilizes a local node, element, and edge numbering system. The subdomain grid includes an additional layer of cells beyond the cell partition defined by kmetis. This layer or halo is needed to support the exchange of data between subdomains, and is charac-

1459

terized by cells that share at least one node with the main partition as shown in Fig. 2. Recognizing that all subdomains extend outwards by one layer (except at boundaries of the global domain), there exists two layers of cells along each subdomain boundary that participate in inter-process communication. From the perspective of a single process, this corresponds to an interior layer of data that must be sent to neighboring processes at each time step, and an exterior layer of data that must be received from neighboring processes at each time step. To guide the sending and receiving of data between processes (or subdomains), the pre-processing algorithm prepares files listing the cell numbers to be exchanged between neighboring subdomains, using the appropriate local numbering system. Separate lists for sending and receiving data are prepared to account for differences in local numbering systems. The lists are written as a set of files named .cellsend.. and .cellrecv.., respectively, where sendr = 0, . . . , np 1 and recvr = 0, . . . , np 1 (recvr – sendr). Note that these files are only prepared when there are data to be shared between processes, i.e., the subdomains are neighbors. A brief example is presented to clarify the file naming conventions. Consider a set of Triangle grid files, irvine.node, irvine.ele, irvine.neigh, irvine.edge, a ground elevation file irvine.bed, and a resistance parameter file irvine.rough. Assuming np = 2, two sets of grid files are created with .node files (for simplicity) given by irvine.0.node and irvine.1.node. All other grid files are similarly named. Four files listing boundary cell numbers are also generated: irvine.cellsend.0.1, irvine. cellrecv.0.1, irvine.cellsend.1.0 and irvine.cellrecv. 1.0. When ParBreZo is implemented, process 0 will read irvine.cellsend.0.1 to find the list of cells to send to process 1, and data will be sent in the order defined by the list. Process 1 will then receive this data and pass it into cells in the order prescribed by irvine.cellrecv.0.1. Conversely, process 1 will read irvine.cellsend.1.0 to find the list of cells to send to process 0, and this data will be read by process 1 into cells defined by irvine.cellrecv.1.0. Finally, we note that ParBreZo output uses a similarly indexed file naming convention. This leads to a distributed set of output files that can readily be loaded by common visualization packages such as Tecplot (Tecplot Inc., Bellevue, WA), VisIt (Lawrence Livermore National Laboratory, Livermore, CA), and ArcGIS (ESRI, Redlands, CA). Moreover, use of distributed output keeps files to a manageable size and allows the modeler to easily work with a subset of the total model output. If a single global output file is needed, then a separate utility can be developed to merge the data although the need for this did not arise in this study. 2.3. ParBreZo algorithm The beauty of the SPMD approach is that the ParBreZo code differs very little from the BreZo code, principally because each process has its own input and output files and coordination between processes is only needed to update data along subdomain boundaries. As in any MPI implementation, ParBreZo is coded to first identify np and ip values using the MPI_COMM_RANK and MPI_COMM_ SIZE directives, respectively. The appropriate input files are subsequently loaded. The only significant difference beyond this is a routine to exchange cell-based halo data after each time step. The main steps of the ParBreZo update procedure are shown below, and the new procedure to share data appears as Step 4:

1460


1. A sweep over edges of the mesh to compute fluxes, F\ in Eq. (3). 2. A sweep over cells to compute source terms, So and Sf in Eq. (3). 3. A sweep over cells to advance the cell-based solution, U in Eq. (3). 4. A routine which sends and receives boundary data. Note that the edge and cell sweeps do not include the edges and cells associated only with the exterior halo, because the solution is simply imported from a neighboring process. The exchange of boundary data is handled with a combination on non-blocking and blocking communications, specifically MPI_ISEND and MPI_RECV calls [29]. The data exchanged between processes includes a Boolean (integer) variable that tracks the wet/ dry status of cells and six double precision variables corresponding

to: g, h, u, v, uh and vh. Prior to calling MPI_ISEND, these are packed into integer and double precision buffer arrays in accordance with the lists of halo cells described earlier. The data are similarly unpacked from buffers after completing MPI_RECV calls. The number of messages depends on the number of subdomains surrounding a given subdomain. This is variable as it depends on the details of the triangulation and the patchwork of subdomains as shown in Fig. 1. On uniform grids, six neighbors is typical but up to 11 neighbors were noted in a problem with variable grid resolution. In principle, the computations and communications could be rearranged so only three double precision variables are exchanged between processes: h, uh and vh. However, all six were passed for consistency between ParBreZo and BreZo.

3. Performance testing 3.1. Hardware Three high performance computing clusters were used to test ParBreZo, alpha, beta and gamma (arbitrary names for convenience), with hardware profiles shown in Table 1. These systems offer a shared memory architecture for up to eight processes, and a distributed memory environment for up to 48 (alpha), 64 (beta) and 512 (gamma) processes, based on the computing privileges available to the authors at the time of this study. Both the alpha and beta clusters consist of Intel Xeon processors, although alpha is newer and faster, while the gamma cluster uses AMD Opteron processors. Focusing on communications, the alpha cluster uses a Gigabit ethernet switch while both the beta and gamma cluster use a much faster InfiniBand switch. The random access memory per node varies from 8 (alpha) to 12 (beta) to 16 (gamma) Gigabytes.

3.2. Long wave tank

Fig. 1. Metis partitioning of a Triangle grid into 64 subdomains.

The first test problem is designed with excellent load balancing, so an upper bound on the parallel efficiency of ParBreZo can be identified. The domain consists of a 10 km by 10 km area which is initially flooded to a depth of 1.0 m and 0.5 m in its western and eastern half, respectively. Flow resistance is modeled using a spatially uniform Manning nm = 0.02 m1/3 s. The discontinuity in depth initiates dam-break type wave action which begins a sloshing type motion that, under the influence of friction, is damped and therefore decays in amplitude over time. Note that for this example we have avoided introducing any dry or partially wetted cells which create load balancing problems that have been shown to negatively impact parallel efficiency [20,26]. Hence, an evenly weighted domain decomposition is used and can be assumed to be optimal. Testing focuses on parallel efficiency as a function of np and nc. The former is of interest for scaling reasons, while the latter affects the volume of data that is exchanged between processes and previous studies have shown that this can degrade parallel efficiency

Table 1 Compute clusters used for model testing. Compilation performed using PGI Fortran in all cases. Cluster

Fig. 2. Subdomain grids include an exterior layer of cells, or halo. Data in these cells is passed from neighboring processes before the solution is advanced to the next time level. Similarly, each process collects data from a layer of cells along the interior of its boundary as passes this to neighboring processes.

Node CPUs

Cores Node Available Communication CPU switch RAM nodes speed per (GHz) node (GB)

alpha Intel Xeon E5472 3.0 beta Intel Xeon E5420 2.5 gamma AMD Opteron 2.4 8216

8 8 8

8 12 16

6 8 64

Gb Ethernet InfiniBand InfiniBand

1461


[44]. Note that preliminary checks were made to ensure correctness of the parallel output, by comparison to sequential output. Three grids were generated by Triangle to study the impact of nc on performance. Grid generation was based on a minimum angle constraint of 30° and a maximum area constraint of either 1000,

120

100

Northing (m)

80

60

40

20

0 0

20

40

60

80

100

120

Easting (m) Fig. 3. Bin-tree grids generated by Triangle for long wave test problem. Heavy and light lines correspond to Grids 1 and 3 resulting from an area constraint of 1000 and 62.5 m2, respectively. Intermediate resolution grid not shown, but follows same pattern of vertices with either 4 or 8 connecting edges.

250, or 62.5 m2. This resulted in Grids 1, 2, and 3 with 131,072, 542,288, and 2,097,152 cells, respectively. Fig. 3 shows that the resulting grids are regular: every triangle in the grid has one 90° and two 45° angles, and every interior vertex has either 4 or 8 connecting edges. These grids are sometimes called ‘‘4–8” meshes for this reason. Another name is ‘‘bin-tree” meshes because refinement can proceed by dividing each triangle from the midpoint of its hypotenuse into two triangles of equal size, similar to ‘‘quadtree” meshes that result from dividing a Cartesian cell into four cells of equal size. A time step Dt = 2.5, 1.25, and 0.625 s on Grids 1, 2, and 3, respectively, and a total integration period of 12,500 s was used in all cases. This resulted in Cr 0.7. Run times were measured from the start to the end of time integration, so the time required to read and write files and perform pre- and post-processing calculations is not considered. Fig. 4 shows the speedup S and efficiency E achieved on the alpha cluster using np = 1, 4, 8, 16, 24, 32, 40 and 48 processes. Also shown is a node-based speedup S* and efficiency E*, where the np = 8 case is used to normalize run times. A couple of trends are revealed: First, the model performs exceptionally on Grid 1 with efficiencies (E) exceeding 100% in some instances. Here, the data volumes are so small that the cost of buffering boundary data and exchanging it between processes is negligible. However, Grids 2 and 3 show that as the density of the computational mesh increases, and larger packets of data are exchanged between processes, there is an initial decrease in parallel efficiency followed by a subsequent increase. This creates a parallel efficiency minimum around np = 8. The drop in efficiency occurs on a single node where processes communicate using the system bus which is quite fast compared to the gigabit ethernet which governs inter-node communication on the alpha cluster. Therefore, the cost of transferring data between processors is unlikely to be the driving factor

50

120

40

100 80

S

E%

30 60

20 40 10 0

20 0

10

20

30

40

0

50

0

10

20

n

30

40

50

n

p

p

8

140 120 100 E* %

S*

6

4 nc=131,072 2

0

2

4 n /8 p

60

nc=524,288

40

nc=2,097,152

20

Amdahl‘s Law 0

80

6

8

0

0

2

4 n /8

6

8

p

Fig. 4. Speedup and parallel efficiency of ParBreZo in long wave tank problem which offers perfect load balancing. Computing hardware corresponds to the alpha clusters: eight-way, 3.0 GHz Intel Xeon E5472 processing nodes with 8 GB RAM per node and gigabit inter-connect.

1462


here, but the performance loss could be due to operations used to package boundary data in buffers before and after each data transfer. Another explanation is the hardware parallelism afforded by the Intel Xeon chipset. That is, a compute task that only requests a single processing core might benefit from automatic hardware parallelizations that engage other cores when those resources are idle. The node-based speedup S* and efficiency E* shown in Fig. 4 suggest that ParBreZo scales extremely well, in fact faster than Amdahl’s Law would predict or superlinear, when increasing numbers of compute nodes are used. The superlinear scaling could be due to a decrease in the amount of explicit buffering per process as the number of processes increase. That is, with nc fixed, the number of boundary cells per process is reduced as the number of processes increase. The highly distributed nature of the message passing resulting from the SPMD algorithm design surely helps in this regard in comparison to a task-farming design where a single process coordinates activities and stands to become a communication bottleneck. A second factor to consider is the system bus, which may become a bottleneck in an SPMD implementation when multiple processes simultaneously try to access memory. And thirdly, it is possible that as np increases and the memory required for each process is reduced below the size of the L2 cache, memory flow improves. The long wave tank problem is also used to compare the performance of ParBreZo on a range of computing architectures, those listed in Table 1, and using a larger number of compute nodes. The previous test problem involving Grid 3 (2,097,152 cells) was repeated on the beta cluster using np = 1, 4, 8, 16, 24, 32, 40, 48, 56, and 64 and on the gamma cluster using np = 1, 8, 32, 64, 72, 128, 256, and 512. Fig. 5 shows speedups and efficiencies resulting from these tests, along with node-based speedup S* and efficiency E* as before. The gamma cluster results show that ParBreZo has

excellent scaling properties, as it maintains an efficiency close to 100% up to the maximum number of processes tested (512). Note that an efficiency of ca. 100% is achieved using np = 8 on the AMD-based gamma cluster, while Intel based alpha and beta clusters achieve ca. 70% and 60% efficiencies, respectively. These differences point to the significance of hardware parallelism within Intel Xeon chipsets which is able to exploit idle computing resources to expedite sequential jobs. Additionally, the lower efficiency on the beta cluster (compared to the alpha cluster) is attributed to its slower system bus. Overall, the parallel performance of ParBreZo appears very good with standard efficiencies between 60% and 110% over all variations in problem size (nc), computing resources (np), and computing architectures considered. Moreover, on the gamma cluster there is nearly perfect scaling of the model (efficiencies of 100%) up to the maximum number of processors available (512), while on the alpha and beta clusters, there is a trend towards increased efficiency with increasing np. This trend compares favorably with OpenMP and MPI-based parallel flood inundation models recently reported by Neal et al. [26], which showed a decrease in parallel efficiency with increasing processors. The improvements may be attributed to the SPMD algorithm design over the task-farming approach. 3.3. Dam-break problem Previous studies have shown that partially wetted or dry cells introduce a load balancing challenge to parallel implementations of flood inundation models [20,25,26]. The inability to balance the computational effort across processors leads to an overall reduction in parallel efficiency, as processes spend time waiting for others to finish. Here, the weighting capability of the Metis libraries are utilized for improved performance over evenly

3

10

120 100

2

80

S

E%

10

1

10

60 40 20

0

10 0 10

1

2

10

10

0 0 10

3

10

np

1

2

10

10

3

10

np

2

10

120 100

E* %

S*

80 1

10

alpha beta gamma Amdahl‘s Law

0

10 0 10

1

10 np/8

60 40 20

2

10

0 0 10

1

10 np/8

2

10

Fig. 5. Speedup and parallel efficiency of ParBreZo in long wave tank problem using alpha, beta and gamma compute clusters. S* and E* are normalized by the np = 8 run time, which corresponds to a fully tasked compute node.


weighted partitions. The Baldwin Hills dam-break problem serves as a case study [9]. The site is urbanized, flow is highly unsteady, and the flood zone changes with time so this represents a rigorous test relative to load balancing. Baldwin Hills Reservoir in Los Angeles, California failed in December 1963 due to piping and subsequent erosion of an earthen embankment. Fig. 6 shows the resulting flood extent that inundated a highly urbanized region consisting mainly of single family homes and apartment buildings [42]. Many of these structures were destroyed by the high velocity flood water and there was extensive damage to roads and drainage infrastructure. The sequential version of BreZo was recently applied to simulate flood inundation, using high resolution LiDAR topography, spatially variable Manning nm for resistance, and a simple storm drain model and found that the model accurately predicted both flood extent and the flash flood hydrograph in Ballona Creek, the main drainage channel [9]. The model setup for this test is very similar but not identical to the aforementioned study [9], so details are reported here. The whole-domain mesh consists of nc = 374,414 computational cells. The mesh was generated by Triangle and includes two different levels of refinement. The finest level (maximum area constraint of 2.8 m2) surrounds the reservoir and flow path immediately below the breach, as well as the residential area where structures were damaged. The remainder of the mesh is covered by the second, coarser refinement level (maximum area constraint of 11.3 m2). A small trapezoidal breach was assumed present at time

1463

zero to initiate flow, and 9 min into the flood the breach was assumed to linearly transition, over a period of one minute, to the final breach shape defined by a post-failure topographic survey. The transition is based on photographic documentation presented in a California Department of Water Resources Report [39]. Gallegos et al. [9] used the same initial and final geometry, but performed a restart after 10 min with new topography so the topographic change was instantaneous. ParBreZo was executed for 270,000 time steps using Dt = 0.04 s, a total of 3 h. This resulted in a variable Cr with a maximum value of 0.9. For parallel performance analysis, ParBreZo was implemented as 1, 4, 8, 16, 24, 32, 40 and 48 processes on the alpha cluster. The solution was saved every 10 min for post-processing purposes. The performance of ParBreZo was examined first based on an evenly weighted partitioning of the computational grid. Secondly, performance was examined using a weighted partitioning that was based on output from the first run. Wet cells require more computational effort than dry cells, so this is the basis for weighting. A static load balance such as this will obviously not be optimal at all times because the load is dynamic, but it may offer a good compromise. Indeed, a 2:1 weighting ratio for wet:dry cells, respectively, was found to yield the best performance in comparison to 3:2, 3:1, and 5:1 weighting. Fig. 7 shows the flood progression across the study area, in comparison to the observed flood extent (red outline), and across a 48way partitioned computational grid. Two different partitions based

Fig. 6. Location and flood extent from the Baldwin Hills dam failure in 1963, from [9], used with permission.

1464


Fig. 7. Partitioning of Baldwin Hills domain into 48 subdomains based on 1:1 and 2:1 wet:dry weighting of computational cells, and progression of flood across study area. Variable subdomain sizes in 1:1 case result from localized grid refinement. Red outline represents observed flood extent. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)

on 1:1 and 2:1 wet:dry weighting are shown. Note that 2:1 weighting leads to smaller subdomains in the flood zone, and larger subdomains where terrain remains dry throughout the simulation. Fig. 8 shows speedup and parallel efficiency from the 1:1 and 2:1 weighting of the grid partition. The results show that parallel efficiency quickly drops with increasing np but then levels off and is relatively insensitive to np thereafter. The initial drop in efficiency for np 6 8 is similar to what was observed in the previous test problem (long wave tank) which had perfect load balancing. The differences here include a larger initial drop in efficiency, and an efficiency which no longer improves but rather remains constant or perhaps decreases slightly with increasing np. Though reduced efficiency compared to the fully-wet, long wave tank simulation was to be expected, the absence of a degradation in efficiency with increasing number of processes is encouraging. This

50

120

45 100 40 35 80

E%

S

30 25

60

20 40 15 10 20 1:1 Weighting 2:1 Weighting Amdahl‘s Law

5 0 0

10

20

30

np

40

50

0

0

10

20

30

40

50

np

Fig. 8. Speedup and parallel efficiency of ParBreZo in Baldwin Hills test problem on the alpha cluster. Weighting ratio corresponds to wet:dry cells.

suggests that ParBreZo will scale well to considerably larger computing systems. These results also show that a simple approach to load balancing can have a measurable influence on run times. A 2:1 weighting of wet:dry cells during grid partitioning achieves about a 10% increase in parallel efficiency, compared to 1:1 weighting. However, in comparison to the previous test problem, parallel efficiency is about 20% less even after the weighting. This is attributed to using static grid partitioning when the load distribution is actually changing with time. Dynamic grid partitioning could be considered in the future as an alternative to static partitioning to address this problem, as discussed earlier. 3.4. High resolution storm surge modeling Hurricane storm surge poses significant flood risk along the Gulf and Atlantic coasts of the United States. As hurricane forecasting systems continue to improve, and high resolution topographic data are increasingly mapped, emergency management efforts may be supported by high-resolution inundation forecasts. Here we examine the potential for ParBreZo to map inundation much faster than real time, which is a requirement for effective forecasting. We consider a ca. 40 km stretch of the Gulf coast roughly between Pass Christian and Ocean Springs, MS where hurricane Katrina delivered a storm surge over 8 m high [17]. A 1/3 arc second (10 m) raster DTM was prepared by merging bathymetric and topographic point datasets and applying an inverse distance weighted interpolation to estimate elevation in each grid cell. Bathymetric points were obtained as soundings from the NOAA Office of Coast Survey (200 m intervals), and topographic points (5 m intervals) were based on classified LiDAR point cloud data recorded pre-Katrina in March 2004 for Harrison county and February 2005 for Hancock and Jackson counties (http://www. csc.noaa.gov/lidar/ accessed April 2010). The LiDAR classification included non-ground and bare-earth returns. To prepare the DTM, the bare-earth points were augmented by non-ground returns aligned with embanked railway lines and roads as characterized by US National Transportation Atlas database [6] and US

1465


-89.024

-88.979

-88.934

-88.889

-88.844

-88.799

30.460

-89.069

Ocean Springs

30.415

±

-89.114

30.415

30.460

-89.159

30.370

30.370

Biloxi

Gulfport Long Beach

Legend FEMA Surge Inundation Limits

Computed surge depths (meters)

30.325

30.325

0- 0.5 0.5 - 1 1 - 2.5

0

2.5

5

2.5 - 5

10 Kilometers

5 - 10 > 10

-89.159

-89.114

-89.069

-89.024

-88.979

-88.934

-88.889

-88.844

-88.799

Fig. 9. Predicted and observed Mississippi coastline flood extent from Hurricane Katrina. Detailed patterns of flood inundation are resolved by the 10 m resolution inundation model.

Census Bureau Tiger/LineÒ database [43], respectively. All elevations were referenced to NAVD88 based on the tidal and geodetic datum at Gulfport Harbor, Mississippi Sound, MS (NOAA station id: 8745557). The stated vertical accuracy for the onshore topography is 18.5 cm root mean square error (RMSE) for Harrison county and 12 cm RMSE for Hancock and Jackson counties [18]. The vertical accuracy of the soundings depends on the depth and is approximately 0.5 m for a depth range of 1–10 m [27]. A bin-tree computational mesh, similar to Fig. 3 but with a spatially variable resolution, was prepared from the 1/3 arc second DTM and placed in Triangle format. The bin-tree mesh aligns vertices with the DTM elevation points, so no interpolation is required. The mesh depicts all terrain between 0 and 15 m (NAVD) at 1/3 arc second for maximum flood extent precision. However, for computational efficiency, coastal bathymetry grid spacing was gradually increased by up to a factor of 64 in the deepest water (ca. 4.0 m NAVD), and by a factor of up to 128 for high relief topography (>15 m). The resulting mesh consists of nc = 7,823,921 triangles. A Katrina-like storm surge was simulated by specifying a time dependent water elevation uniformly along the off-shore boundary of the model domain. As our intention is to examine only the computational demands of high-resolution storm surge inundation modeling, we do not focus on a precise depiction of the Katrina surge. All other boundaries were treated as free-slip walls. The water level was specified using the following equation for a solitary wave:

gðtÞ ¼ Ho þ a sech2

t t1 ðNAVDÞ; t2

ð6Þ

where Ho = 0.2, a = 8.1 m, t1 = 6 h, and t2 = 2 h. These parameter values approximate the height and duration of the surge along this stretch of the coastline based on data reported by National Hurricane Center [17]. Similarly, a simple resistance parameterization was used. Resistance is scaled by a spatially uniform Manning nm = 0.02 m1/2 s, which corresponds to a sandy surface, but is low compared to vegetated terrain that is common to the region. The resistance parameterization, boundary forcing and potential

for wind, wave and streamflow effects would need to be re-examined for model validation purposes because the approach used here is likely to be overly simplistic for accuracy purposes. ParBreZo was executed for a period of 12 h using a time step of 0.2 s, which corresponds to Cr 0.8. Using the maximum number of available processors on the beta cluster (np = 64), a wall clock time of 4.0 h expired during the simulation. Using the gamma cluster, a wall clock time of 9.1, 5.1, 2.9 and 1.6 h expired using np = 64, 128, 256 and 512, respectively. The 1.6 h simulation offers the greatest forecast lead time (10.4 h) for emergency management efforts. Also, note that the total run time on the beta cluster with 64 processors was less than half that of that measured on the gamma cluster, suggesting that a 512-node cluster with nodes similar to those on the beta cluster would likely result in run times considerably shorter than 1 h. Fig. 9 presents the Mississippi coastline subject to inundation modeling, and a comparison of the predicted and observed inundation extents. The agreement is surprisingly good considering the simplicity of the boundary forcing and resistance parameterization, which points to the critical importance of topographic variability. 4. Discussion The preceding test problems show that an SPMD implementation of BreZo, ParBreZo, scales well from a parallel efficiency perspective up to np = 512, the maximum number of processors available to the authors at the time of this study. Moreover, the observed trends suggest that a high level of efficiency can be maintained as np is increased further. This is highly promising for high-resolution inundation modeling because problems involving substantive spatial and temporal scales can be addressed by engineers and hydrologists in a cost-effective manner, assuming computing costs continue to decline. However, this will require a move away from hydraulic modeling software designed for local, desktop execution and towards software designed with a local interface for interactivity and links to remote servers for compute-intensive number crunching. ANSYS FLUENT (Ansys

1466


Inc., Canonsburg, PA), the well-known computational fluid dynamics package, now offers precisely this capability so individual users need not maintain their own high performance computing infrastructure. Similarly, this paradigm could be implemented with GPUs particularly after a wider range of programming languages and data types are supported. It is important that researchers continue to advance modeling technology for inundation dynamics because there is increasing interest in accurate flood mapping and flood forecasting. The recent National Research Council report Mapping the Flood calls for a national effort to obtain high-resolution topographic data, as well as increasing use of multi-dimensional riverine and coastal flood models where warranted by flow complexity [28]. Interest in real-time flood mapping and flood forecasting is also growing. For example, the Advanced Hydrologic Precipitation Center of the National Weather Service now provides real-time inundation mapping at a number of sites across the USA, in Texas and North Carolina in particular. At present these maps are based on simplistic river stage/elevation comparisons, but hydraulic modeling will surely be needed to address the range of complex flow scenarios seen nationally including the analysis of levee breach scenarios which accompany extreme flooding in many areas.

5. Conclusions An SPMD implementation of an unstructured grid, Godunovtype, shallow-water flood inundation code is found to scale well on several high performance computing clusters with eight shared memory compute cores per node and up to 64 nodes (512 processors). The clusters varied considerably in components using either Intel Xeon and AMD Opteron Chips, gigabit ethernet or InfiniBand communications, and from 8 to 16 GB RAM per node. A high level of performance is attributed to the explicit solution update procedure which requires minimal inter-process communication, as well as the SPMD design. An efficiency of approximately 70% is achieved in a practical test problem with extensive wetting and drying, and an efficiency ranging from 70% to 110% is achieved in an idealized test problem using 4 6 np 6 48 and nc values exceeding 2 million. While performance on an AMD Opteron cluster closely tracks Amdahl’s Law for all np tested, results on two different Intel Xeon clusters show a drop in efficiency from np = 1 to 8, the number of cores per node. This is attributed to the hardware including, possibly, hardware parallelism which exploits idle computing resources on a node when np < 8 or the system bus which controls access to memory and may become a bottleneck. In a dam-break flood inundation test problem characterized by extensive wetting and drying, uneven load balancing also acts to restrict parallel efficiency because wet cells require more computational effort than dry cells. This finding echoes previous findings drawn from flood inundation modeling with diffusive wave schemes [20,25]. However, for all cases tested, the efficiency remained relatively constant for runs with more than eight processes up to 512 processes. This suggests that ParBreZo will scale well to much larger computing platforms. The Metis graph partitioning libraries provide an excellent mechanism for unstructured grid domain decomposition, and weighted partitioning features can be used to enhance the parallel efficiency of a static decomposition by assigning extra weight to wetted computational cells. This requires a preliminary run to identify the wetted cells, but modeling studies usually involve a matrix of simulations to study sensitivities and the propagation of uncertainty. Hence, all but the first simulation can benefit from uneven weighting. A 2:1 weighting of wet:dry cells achieved a 10% increase in efficiency compared to the unweighted (1:1) grid. In this application, this yielded the best performance compared to runs based on 3:2, 3:1 and 5:1 weighting.

A hurricane storm surge inundation test problem shows that a 12 h forecast for a 40 km length of coastline at 10 m resolution can be completed in 1.6 h using np = 512, offering a 10.4 h lead time. Moreover, the Baldwin Hills dam-break flood test problem shows that advanced, high-resolution (ca. 3 m) simulations of urban inundation dynamics can be completed in minutes to support dam safety efforts. Acknowledgments This project was made possible by a grant from the National Science Foundation (CMMI-0825165), whose support is gratefully acknowledged. The authors also thank J. Famiglietti, J. Farran. J. Lowengrub and Lawrence Livermore National Laboratory for computing support, D. Hargreaves for providing information about ANSYS FLUENT, and the anonymous reviewers for offering many comments that improved the paper. References [1] Amdahl GM. Validity of the single-processor approach to achieving large scale computing capabilities. In: Proceedings, AFIPS SJCC, Reston, VA; 1967. [2] Begnudelli L, Sanders BF. Unstructured grid finite volume algorithm for shallow-water flow and transport with wetting and drying. J Hydraul Eng 2006;132(4):371–84. [3] Begnudelli L, Sanders BF. Simulation of the St. Francis dam-break flood. J Eng Mech 2007;133(11):1200–12. [4] Begnudelli L, Sanders BF, Bradford SF. An adaptive Godunov-based model for flood simulation. J Hydraul Eng 2008;134(6):714–25. [5] Brown JD, Spencer T, Moeller I. Modeling storm surge flooding of an urban area with particular reference to modeling uncertainties: a case study of Canvey Island, United Kingdom. Water Resour Res 2007;43:W06402. doi:10.1029/ 2005WR004597. [6] Bureau of Transportation Statistics, National Transportation Atlas Database, 2009. [accessed February 2010]. [7] Cowles GW. Parallelization of the FVCOM Coastal Ocean Model. Int J High Perform Comput Appl 2008;22(2):177–93. [8] Federal Emergency Management Agency. HAZUS-HM: FEMA’s methodology for predicting potential losses from disasters. Software available on-line: [accessed April 2010]. [9] Gallegos HA, Schubert JE, Sanders BF. Two-dimensional, high-resolution modeling of urban dam-break flooding: a case study of Baldwin Hills, California. Adv Water Resour 2009;32:1323–35. [10] Guinot V. Godunov-type schemes (An introduction for engineers). Elsevier Science; 2003. 508 p. [11] Hervouet JM. A high resolution 2-D dam-break model using parallelization. Hydrol Process 2000;14:2211–30. [12] Horritt MJ, Bates PD. Effects of spatial resolution on a raster based model of flood flow. J Hydrol 2001;253:239–49. [13] Horritt MJ, Bates PD, Mattinson MJ. Effects of mesh resolution and topographic representation in 2D finite volume models of shallow water fluvial flow. J Hydrol 2006;329:306–14. [14] Hunter NM, Bates PD, Horritt MS, Wilson MD. Simple spatially-distributed models for predicting flood inundation: a review. Geomorphology 2007;90:208–25. [15] Hunter NM, Bates PD, Neelz S, Pender G, Villanueva I, Wright NG, et al. Benchmarking 2D hydraulic models for urban flood simulations. ICE J Water Manage 2008;161(1):13–30. [16] Hydraulic Engineering Center, US Army Corps of Engineers. HEC-FDA flood damage reduction analysis users manual. November, 2008. [17] Knabb RB, Rhome JR, Brown DP. Tropical cyclone report, Hurricane Katrina, 23–30 August 2005. National Hurricane Center, 20 December 2005. [18] Meredith A, Carter J, Pollack C. 2005 Mississippi LiDAR Data Validation Report, I.M. Systems Group, Inc. [accessed February 2010]. [19] Karypis G, Kumar V. A fast and high quality multilevel scheme for partitioning irregular graphs. SIAM J Sci Comput 1999;20(1):359–92. [20] Lamb R, Crossley A, Waller S. A fast 2D floodplain inundation model. Water Manage (Proc Inst Civil Eng) 2009;162(6):363–70. [21] Marks K, Bates PD. Integration of high-resolution topographic data with floodplain flow models. Hydrol Process 2000;14:2109–22. [22] Mason DC, Horritt MS, Hunter NM, Bates PD. Use of fused airborne scanning laser altimetry and digital map data for urban flood modelling. Hydrol Process 2007;21:1436–47. [23] Moura e Silva L, Buyya R. In: Buyya R, editor. Parallel programming models and paradigms. High performance cluster computing: programming and applications, vol. 2. Upper Saddle River, NJ: Prentice-Hall; 1999. p. 4–27.

B.F. Sanders et al. / Advances in Water Resources 33 (2010) 1456–1467 [24] McMillan HK, Brasington J. Reduced complexity strategies for modelling urban floodplain inundation. Geomorphology 2007;90:226–43. [25] Neal J, Fewtrell T, Trigg M. Parallelisation of storage cell flood models using OpenMP. Environ Model Software 2009;24:872–7. [26] Neal JC, Fewtrell TJ, Bates PD, Wright NG. A comparison of three parallelisation methods for 2D flood inundation models. Environ Model Software 2010;25:398–411. [27] National Oceanic and Atmospheric Administration. NOS Hydrographic Surveys Specifications and Deliverables, April 2009. [accessed February 2010]. [28] National Research Council. Mapping the zone: improving flood map accuracy. Washington DC: The National Academies Press; 2009. [29] Pacheco PS. Parallel programming with MPI. San Francisco: Morgan Kaufmann Publishers, Inc.; 1997. [30] Pau JC, Sanders BF. Performance of parallel implementations of an explicit finite-volume shallow-water model. J Comput Civil Eng 2006;20(2):99110. [31] Samuels P, Klijn F, Dijkman J. An analysis of the current practice of policies on river flood risk management in different countries. Irrigation Drainage 2006;55:S141–50. [32] Sanders BF. Integration of a shallow-water model with a local time step. J Hydraul Res 2008;46(8):466–75. [33] Sanders BF, Schubert JE, Gallegos HA. Integral formulation of shallow-water equations with anisotropic porosity for urban flood modeling. J Hydrol 2008;362:19–38. [34] Shewchuk JR. Triangle: engineering a 2D quality mesh generator and Delaunay triangulator. In: Lin MC, Manocha D, editors. Applied computational geometry: towards geometric engineering. Lecture notes in computer science, vol. 1148. Springer-Verlag; 1996. p. 203–22. Software may be obtained at:Available from: . [35] Schubert JE, Sanders BF, Smith MJ, Wright NG. Unstructured mesh generation and landcover-based resistance for hydrodynamic modeling of urban flooding. Adv Water Resour 2008;31:1603–21. [36] Soares-Frazão S, Lhomme J, Guinot V, Zech Y. Two-dimensional shallow-water model with porosity for urban flood modelling. J Hydraul Res 2008;46(1):45–64.

1467

[37] Smith LC. Emerging applications of interferometric synthetic aperture radar (InSAR) in geomorphology and hydrology. Annals Assoc Am Geograph 2002;92(3):385–98. [38] Smith MJ, Edwards EP, Priestnall G, Bates PD. Exploitation of new data types to create digital surface models for flood inundation modeling, FRMRC Research Report UR3, FRMRC, UK; 2006. Report available on-line at: . [39] State of California. Department of Water Resources. Investigation of Failure, Baldwin Hills Reservoir; 1964. [40] Toro EF. Shock-capturing methods for free-surface shallow flows. Chichester, UK: Wiley J. & Sons; 2001. [41] Transportation Research Board. Criteria for Selecting Hydraulic Models, National Cooperative Highway Research Program Web-Only Document 106; 2006. 720 p. Report available online at: . [42] United States Army Corps of Engineers, Los Angeles District. Report on Flood Damage and Disaster Assistance; 1964. [43] US Census Bureau. Tiger/LineÒ database. [accessed February 2010]. [44] Villanueva I, Wright NG. An efficient multiprocessor solver for the 2D shallow water equations. Hydroinformatics 2006. Nice, France. Additional performance data provided to the authors by N. Wright. [45] Westerink JJ, Luettich Jr RA, Feyen JC, Atkinson JH, Dawson C, Powell MD, et al. A basin to channel scale unstructured grid hurricane storm Surge model as implemented for Southern Louisiana. Monthly Weather Rev 2008;136(3):833–64. [46] Yu D, Lane SN. Urban fluvial flood modelling using a two-dimensional diffusion-wave treatment, Part 1: Mesh resolution effects. J Hydrol Process 2005;20(7):1541–65. [47] Yu D, Lane SN. Urban fluvial flood modelling using a two-dimensional diffusion-wave treatment, Part 2: Development of a sub-grid-scale treatment. Hydrol Process 2005;20(7):1567–83.

A parallel, unstructured grid, Godunov-type ... - Semantic Scholar

A parallel, unstructured grid, Godunov-type ... - Semantic Scholar

Suggest Documents

PARALLEL UNSTRUCTURED GRID ... - Semantic Scholar

Parallel unstructured grid computations - Semantic Scholar

Parallel unstructured grid computations - Semantic Scholar

Parallel unstructured grid computations - Semantic Scholar

Parallel Unstructured Mesh Partitioning - Semantic Scholar

Parallel Unstructured Mesh Adaptation for ... - Semantic Scholar

A parallel framework for unstructured grid solvers - University of Oxford

Continuance Parallel Computation Grid ... - Semantic Scholar

Dune-CurvilinearGrid: Parallel Dune Grid Manager for Unstructured ...

An Unstructured-Grid, Parallel, Projection Solver for ... - CiteSeerX

Parallel Computing in the GRID: Is the GRID ... - Semantic Scholar

Dune-CurvilinearGrid Parallel Dune Grid Manager for Unstructured ...

Unstructured Grid Finite-Volume Algorithm for ... - Semantic Scholar

Automated Parallel Solution of Unstructured PDE ... - Semantic Scholar

Parallel Visualization of Large-Scale Unstructured ... - Semantic Scholar

A Generic Grid Interface for Parallel and Adaptive ... - Semantic Scholar

3d parallel unstructured mesh generation

Mining Unstructured Data - Semantic Scholar

Unstructured Adaptive (UA) NAS Parallel

Parallel Unstructured Three-Dimensional Turbulent

Java Parallel Secure Stream for Grid Computing 1 - Semantic Scholar

Fast Parallel File Replication in Data Grid - Semantic Scholar

Apply Cluster and Grid Computing on Parallel 3D ... - Semantic Scholar

Grid Approach to Embarrassingly Parallel CPU ... - Semantic Scholar