Dec 28, 1997 - These algorithms can be viewed as iterative procedures; each iteration ... triangles that interact with the new point to violate Delaunay ... In a parallel implementation of the BW algorithm, two new points cannot be added ... circumcenter method Ch93, Ru93 and the Constrained Delaunay Triangulation CDT .
Parallel Constrained Delaunay Meshing L. Paul Chew
Nikos Chrisochoidesyand Florian Sukup z
Dept of Computer Science Cornell University Ithaca, NY 14853-3801
Computer Science and Engineering University of Notre Dame Notre Dame, IN 46656
Technical Report TR-97-17 December 28, 1997
Abstract
We present a parallel unstructured grid generation method based on the Con. The Parallel Constrained Meshing Algorithm uses certain edges of the initial mesh as constraints that, without compromising the grid quality, help in the minimization of communication overhead and in the elimination of the synchronization overhead. By combining the CDT and its datacentric task-parallel implementation we produce a meshing algorithm that requires almost no synchronization. Moreover, experiments show that the use of the CDT for meshing cuts communication time by a factor of about seven when compared to a similar meshing algorithm that does not use the CDT. Keywords: parallel grid generation, constrained Delaunay triangulation, runtime systems, data-centric, threads.
strained Delaunay Triangulation (CDT)
This work supported by the US-Israel Binational Science Foundation Grant 94-00279, by DARPA under contract N00014-96-1-0699 and by the Cornell Theory Center which receives funding from its its Corporate Research Institute, NSF, New York State, DARPA, NIH, and IBM Corporation. y This work supported by University of Notre Dame, ND No. 43106 and the Cornell Theory Center which receives major funding from the National Science Foundation, IBM corporation, New York State and members of the its Corporate Research Institute. z This work supported by the Kurt-G odel fellowship from Austrian Ministry of Science and Research and in part by the Cornell Theory Center.
1 Introduction Delaunay triangulation algorithms have been used very successfully for unstructured grid generation on sequential machines. Delaunay-based algorithms generate unstructured grids by sequentially adding new points and modifying the existing triangulation by means of purely local operations. These algorithms can be viewed as iterative procedures; each iteration performs four basic operations: (i) point creation, where a new point is created using an appropriate spatial distribution technique; (ii) point location, where a triangle that contains the new point is identi ed; (iii) cavity computation, where those existing triangles that interact with the new point to violate Delaunay properties are removed; and (iv) element creation, where new triangles are built by properly connecting the newly created point with old points, so that the resulting triangulation satis es certain geometric properties [PS85]. This type of incremental construction of a Delaunay triangulation is sometimes referred to as the Bowyer-Watson (BW) algorithm. In a parallel implementation of the BW algorithm, two new points cannot be added concurrently if their corresponding cavities overlap. In such a case synchronization is required among the processors to produce a unique and valid Delaunay triangulation. The cost of synchronizing the participating processors is the main source of ineciency in the parallel BW algorithm. Lohner [LJM90], and Cignoni [CLMPS] try to solve this problem by precomputing interprocessor regions which are used as buer zones to guarantee the uncoupling of the interior regions of the processors. (Lohner de nes inter-subdomain regions as the union of elements that share either edges or vertices with elements of another processor.) In this case all processors need to synchronize before they enter the gridding phase of the interprocessor regions. In general domains it is impossible to precompute the interprocessor regions without computing the in uence regions (cavities) of the points to be inserted. For data-parallel Delaunay triangulation [LJM90, CLMPS] the synchronization overhead is exacerbated by the fact that the computation costs associated with the Delaunay triangulation of a region are variable and unpredictable. Therefore even if the domain is decomposed into subdomains with equal numbers of points or elements, the computational load of processors might not be evenly distributed. This type of imbalance forces a number of processors to wait idle until all processors reach the synchronization barrier. In [CS96] we presented a data-centric task-parallel implementation of the Bowyer-Watson algorithm that hides the synchronization overhead, without using buer zones. Figure 1 shows example meshes for the data-parallel approach and the data-centric task-parallel approach [CS96]. The data-parallel approach here is treated as a special case of the data-centric task-parallel approach in which any nonlocal cavity expansions are delayed until all local activities are completed. In this way we break the grid generation into two phases as is done in [LJM90, CLMPS]; the dierence here is that the interprocessor regions are not de ned a priori. The algorithm we present in this paper (Section 3) minimizes communication for point location operations even further and eliminates the need for synchronization. The new algorithm, the Parallel Constrained Meshing Algorithm, is essentially a combination of the 2
Figure 1: Top-Left: The initial boundary-conforming triangulation distributed over two processors. Top-Right: Final mesh using the data-centric task-parallel approach. BottomLeft: Mesh after the rst phase of the data-parallel approach. Bottom-Right: Final mesh using data-parallel approach; after gridding of the inter-subdomain regions.
3
circumcenter method [Ch93, Ru93] and the Constrained Delaunay Triangulation (CDT). Intuitively, the CDT is as close as one can get to the Delaunay triangulation given that one is stuck with certain edges. This idea can be made mathematically precise (see for instance [Ch89a]), leading to Delaunay-based grid generators that allow internal boundaries. It is possible to show that these internal boundaries do not aect the quality of the resulting grid, although an observant user might infer the presence of such boundaries in the resulting grid by noting the way in which triangle edges are aligned. Using the idea of a CDT, we can introduce arti cial boundaries into the region to be meshed, creating one subregion for each processor. Note that, by the de nition of the CDT, points on one side of a boundary have no eect on triangles on the other side of the boundary; thus, no synchronization is required during the element creation process. In addition, interprocess communication is tremendously simpli ed: the only message between adjacent processes is of the form, \Split this constraint edge." Since constraint edges are always split exactly in half, no additional information needs to be communicated. We have implemented two versions of our meshing algorithm, one which uses the CDT and one which does not. Experimental results for these algorithms are described in Section 4. Our experiments show that the use of the CDT saves on communication costs by a factor of about seven, while maintaining the same quality guarantees for the resulting grid. One might suppose that sequential grid generation algorithms are sucient for all practical use since meshing is usually much less time-consuming than the iterative process used for analysis on the resulting mesh, but for many important problems, sequential mesh generation is a signi cant bottleneck. Scalable parallel grid generation is of strategic importance for the scalability of existing parallel adaptive eld solvers since the bottleneck here is the I/O for reading sequentially generated meshes.
2 A Parallel Meshing Algorithm In [CS96] our main objective was to identify a programming model that leads to a very ecient implementation of the element creation step on distributed memory multicomputers. The algorithm in [CS96], since it paid no attention to the point insertion strategy, generated unstructured meshes with \bad" triangles | triangles which are too large or have an angle that is too small. In this section we revisit that algorithm and improve the grid quality by adding the circumcenters of such bad triangles. We refer to this technique as the circumcenter method. It has been shown [Ch93, Ru93] that this strategy, in combination with some other techniques for handling the boundaries, leads to meshes in which all triangles have angles between 30 and 120 degrees, except for those angles required by small angles along the speci ed boundary. There are some minor restrictions on the class of regions that can be meshed with this method. In particular, small angles along the region's external boundary can cause problems. Such small angles can be \lopped o" before meshing begins. After meshing, the lopped o portions can be returned in a postprocessing step. This strategy has been used by Ruppert [Ru93]. As in [CS96] the mesh is treated as a distributed object that is built upon a hierarchy of 4
data objects. The data objects themselves are distributed with one exception: the vertices. Vertices are scalar objects entirely \owned" by a single processor. The data distribution of the newly created objects to processors is guided by a set of distribution rules, D-rules [CS96]. The processor that \owns" a vertex or is responsible for inserting a newly created point (vertex) performs most of the computation and communication associated with the modi cation of the original mesh (data-centric approach). This processor coordinates (synchronization overhead) with other processors that \own" elements or vertices whose status needs to be updated in order to guarantee a single conforming mesh. The coordination phase is implemented using a task-parallel programming paradigm. Each time a processor needs to update, modify, or request information about objects that it does not \own" it spawns a task (or a remote service request) that will perform the necessary action. The data-centric task-parallel programming model treats the computation as a collection of computational tasks that can be created, deleted, suspended and scheduled at runtime. These tasks are classi ed with respect to their state as blocking or ready. Blocking tasks are waiting for the completion of pending requests to other processors. They do not block the processor's CPU; instead they are stored in the blocking-queue and free the processor's resources for the execution of other tasks. A suspended task is moved from the blocking queue to the ready-queue when all of its remote requests have been serviced. The ready queue is organized in FIFO order. The main loop of the Parallel Meshing Algorithm is given below. An example mesh produced by this algorithm appears at the Top-Right in Figure 1. REPEAT Complete suspended local tasks whose pending remote requests have been serviced; and service the remaining remote requests WHILE (there are `bad' triangles in the mesh) DO Insert new point associated with next `bad' triangle Compute the cavity for the new point and add uncompleted tasks to the blocking queue Delete the triangles within the cavity Create new triangles by connecting the new point to the edges of the cavity Service new remote request that were accumulated in the ready-queue while local computation was performed Complete suspended local tasks whose pending remote requests have been serviced and service the remaining remote requests /* This completion includes the create of new triangles, updates of local data, etc */ ENDWHILE Check termination conditions for the processor
5
and set termination flag to true if all local work is completed or no more remote requests are on the way /* synchronization */ UNTIL DONE
Processor P1
Processor P0 TIME
P0
A2
A1
A5
P0
P1
A2
A1
P1
A5
T0
A4
P0 B
A
A3
A4
A6
A6
P1
C
D
E
F
P0
O0
T1
A3
B
C
A
O
t1
T2
P1
D 1
E
F
t4 t5
G
L
t3
N M
H K
t2
L
L
N M
G
H K L
Figure 2: Left and right columns depict the computations performed by processors P0 and P1, respectively. Both processors start at time T0 with the initial mesh in the top row. Processor PO at time T1 starts a cavity expansion for the point O0. Before that cavity expansion is completed by processor P0, processor P1 starts a new cavity expansion for the point O1 at time T2 and tries to expand the cavity by acquiring triangles that he owns. Some of these triangles are locked (as in use) by P0. The main problems for an algorithm like this are: (i) synchronization overheads and (ii) setbacks in the progress of the algorithm due to locking of objects from concurrent processes. We've addressed the synchronization problem in [CS96] by using the datacentric task-parallel programming model. Tasks that wait for completion are swapped out until they become ready, in the mean-time other ready tasks are using the CPU. The second problem, setbacks due to locking, is more dicult and an example is described in Table 1. Locking occurs when a processor attempts to expand a cavity using triangles that are 6
Processor 0
Processor 1
Processor 2 Processor 3 cav exp start, initiate rem cav exp to 0
rem cav exp (from 2), initiate rem cav exp to 2 and 3
rem cav exp (from 0) rem cav exp (from 3), ini-
rem cav exp (from 1), ini-
tiate rem cav exp to 0 and 3
tiate process for collecting data to 2
rem cav exp (from 0), initiate rem cav exp to 1
rem cav exp, (from 1) cav
process collect data, (from 3) can not be
exp stopped, initiate process for collecting data to 2
completed
process collect data, rem release (from 2)
rem release (from 2)
(from 0) completed, initiate rem release to 0, 1 and 3
rem release (from 2)
Table 1: An example showing ow of control for cavity expansion that initiates at processor 2. After a number of remote service requests the cavity fails to expand due to locked objects from other cavity expansions that take place at the same time in dierent processrs. This is an example of setback in the progress of the algorithm, resulting in wasted resources and time. rem cav exp stands for a remote cavity expansion. marked as in use by another processor. Figure 2 depicts such an instance. To solve this problem we use the Parallel Constrained Meshing Algorithm described in the next Section.
3 Parallel Constrained Meshing Algorithm We use a Constrained Delaunay Triangulation (CDT) to generate the mesh. As mentioned in the introduction, at an intuitive level, a CDT is as close as one can get to a Delaunay triangulation given that one is stuck with certain edges. In this case, the edges we are stuck with are those that correspond to boundaries of the region to be meshed and those that correspond to boundaries between processors. The boundaries between processors must be created during a preprocessing step. The problem is not trivial since the boundaries must satisfy certain properties: The boundaries should be placed in such a way that the processors each get about 7
the same amount of work. Boundaries should be well-spaced. A small gap between boundaries leads to small triangles, possibly unnecessarily small. The boundaries should not form small angles with each other or with the region boundaries. For the meshing algorithm to work correctly, there should be no small angles between constraint edges. Interprocessor boundaries do not have to be simple line segments. The boundaries can be polylines or even curves, although line segments are probably sucient for most situations. Our examples have used simple line segments. For the examples we have done, it has been relatively easy to create interprocessor boundaries by hand. For larger and more complex meshes this should be automated. We expect that some of our previous work [CHR94, CHHPKR] on the problem of decomposing geometries for mapping PDE computations onto distributed memory machines should be useful here. Note that the problem of balancing processor work loads cannot be entirely solved with a priori domain decomposition. When doing adaptive remeshing, processor loads change at runtime. One framework for easy and ecient balancing of processor loads at runtime appears in [CH97]. The Parallel Constrained Meshing Algorithm is most closely related to a meshing algorithm due to Ruppert [Ru93] which is, in turn, based on earlier work of Chew [Ch89b]. In Ruppert's method each boundary edge has a neighborhood around it; this neighborhood is de ned to be the circle for which the boundary-edge is a diameter. If a point site is within this neighborhood then the site is said to encroach upon the boundary-edge. When an edge is encroached upon then the meshing algorithm splits the edge in half. Note that our implementation of the preceding algorithm (the Parallel Meshing Algorithm) also uses this encroachment technique for edges that are external boundaries, although that detail was suppressed in the pseudo-code given above. Here is an outline of the main loop of the Parallel Constrained Meshing Algorithm. Figure 3 shows a mesh that results from this algorithm. Note that the only kind of remote service request is `Split a Constraint Edge'. An outline of the code for `Split a Constraint Edge' follows the main loop. REPEAT Split any constraint edges waiting to be split WHILE (there are `bad' triangles in the mesh) DO Choose a `bad' triangle Create triangle's circumcenter as a new point Compute the cavity of the new point IF new point does not encroach on any constraint edges THEN
8
Delete the triangles within the cavity Create new triangles by connecting the new point to the edges of the cavity ELSE /* new (circumcenter) point is ignored */ Split the constraint edge /* This can generate two `split' requests: one for this processor and a remote service request for the adjacent processor (if there is one) */ Split any constraint edges waiting to be split ENDWHILE Check termination conditions for the processor and set termination flag to true if all local work is completed or no more remote requests are on the way /* synchronization */ UNTIL DONE Split a Constraint Edge (edge,midpoint): Compute the cavity of the midpoint Delete the cavity triangles Create new triangles by connecting the midpoint to the edges of the cavity IF the midpoint encroaches on some other constraint edge THEN Split that constraint edge /* Generates up to 2 `split' requests: one local, one remote */
Note that for the Parallel Constrained Meshing Algorithm, tasks are not blocked or suspended; each task can go to completion. Blocking occurs in the simpler Parallel Meshing Algorithm when a cavity expands across processor boundaries. That does not happen here because processor boundaries are constraint edges; thus, cavities always stop at processor boundaries. Note that when a new circumcenter point is too close to a processor boundary we don't suspend the process; instead, we throw away the circumcenter point, initiating up to two requests to split the constraint edge. During the meshing activity, adjacent processors can disagree about how many times a shared boundary has been split. This occurs, for instance, when one processor still has split-requests in its queue of remote service requests while the adjacent processor has already completed the corresponding split-requests. Fortunately, this does not cause any synchronization problems: since splits are never undone, the processors will eventually come into agreement with each other. Note though that adjacent processors can each make separate, independent decisions to split a particular constraint edge; in this situation, it is important that the two processors agree on the identity of the midpoint and that, once a 9
Figure 3: Top: Initial boundary conforming mesh. Middle: Treated as two independent meshes. Bottom: Treated as a single mesh with a constraint boundary between the two regions. 10
constraint edge is split, an additional request to split the same edge is ignored rather than taken as an error.
4 Preliminary Performance Data The implementations of the Parallel Meshing Algorithm (based on the Parallel BowyerWatson Algorithm [CS96]) and the Parallel Constrained Meshing Algorithm are done on the IBM 9076 SP2 machine. SP2 is a distributed memory MIMD machine that bridges the gap between workstation clusters and supercomputers by using a high performance switch (HPS) that enables high-bandwidth and low latency communication between very powerful processors. Therefore our implementation can be ported onto clusters of workstations with few modi cations. For message passing we've used the Data-Movement and Control Substrate (DMCS) [CKP97]. DMCS consists of three modules: (i) a threads module (ii) a communication module and (iii) a control module. The communication module, implemented on top of Active Messages [vCGS92], provides the necessary support for the implementation of a global address space over both shared and distributed memory machines. The control module provides support for remote procedure invocation, also known as remote service requests (RSRs). Remote procedures are either system/compiler procedures or user de ned handlers. These handlers can be threaded or nonthreaded. The costs of some primitive operations of the DMCS are: (i) thread creation time = 12s, (ii) context switch time = 5:5s, (iii) peak data-transfer bandwidth = 33:6MBytes=sec, (iv) one-way latency for a 0-byte message = 29s, and (v) time elapsed for a nonthreaded null remote service request = 31s. The use of nonlocal data-objects introduces two types of overhead: (i) communication overhead due to creation of RSRs and their release from CPU to the network and due to retrieval of RSRs from the network and their storage to the ready queue of tasks, and (ii) undo computation overhead due to the release of cavities because some local or nonlocal requested object happens to be locked due to their participation in other cavity creations taking place concurrently in remote processors. For the Parallel Meshing Algorithm with input geometry as shown in Figure 1, the overheads for a mesh of about 65.5K sites and 129.4K elements on four processors are about 27% due to communication overhead and about 1% due to undo-computation overhead. We have not yet been able to perform as many experiments as we would like; for instance, we would like to examine the eects of more complex geometries, dierent types of partitions, and larger numbers of processors. We are working on this type of analysis which will appear elsewhere. Tables 2 and 3 indicate the execution times (in seconds) and number of RSRs for the Parallel Meshing Algorithm and the Parallel Constrained Meshing Algorithm on an 8-node SP2 multicomputer using \Thin" nodes. These execution times are from an example with a very imbalanced (two-to-one) work load among the processors. An improvement in balancing would improve the times for both algorithms. As expected, this data shows that the algorithm using the constraint edges is much better for generating grids on multicomputers. 11
Table 2: Total execution time in seconds and number of RSRs for the Parallel Meshing Algorithm on an 8-node SP2 using up to 8 nodes for a 2-dimensional mesh with 22K sites and 43K elements. Number of Nodes 1 2 4 8 Execution Time 8.51 4.38 2.89 1.91 Number of RSRs 0 516 807 866 Table 3: Total execution time in seconds and number of RSRs for the Parallel Constrained Meshing Algorithm on an 8-node SP2 using up to 8 nodes for a 2-dimensional mesh with 22K sites and 43K elements. Number of Nodes 1 2 4 8 Execution Time 8.51 3.59 2.29 1.48 Number of RSRs 0 63 126 126
5 Conclusions The development and the implementation of ecient parallel algorithms for unstructured grids are very challenging problems. It is dicult to develop high-quality meshing algorithms even in the absence of parallelism. For parallel meshing, one must handle communication and synchronization for the processors in addition to the geometric meshing requirements. Moreover, the need for varying mesh-densities means that computation and communication costs are variable and unpredictable across subregions of the mesh. We have presented the Parallel Constrained Meshing Algorithm, a parallel algorithm based on the Constrained Delaunay Triangulation (CDT) which simpli es and minimizes the communication required for the parallel generation of unstructured meshes. Preliminary experiment shows that, by using the CDT, we reduce communication costs by a factor of seven over the costs for a similar parallel meshing algorithm that does not use the CDT. Moreover, use of the CDT eliminates the synchronization required to create a conforming mesh. As expected, the CDT-using algorithm is signi cantly faster than the nonCDT-using algorithm; this speedup is achieved without compromising the high quality of the resulting mesh.
6 Future Work
Automating the creation of arti cial boundaries. We plan to develop and analyze methods for creating the arti cial boundaries needed between processor regions. The
12
goal is to nd ways to do this so that work is balanced without introducing new small angles or small mesh elements. Dynamic load balancing. We cannot always tell, a priori, where arti cial boundaries should be placed for the best load balance. Indeed, when doing adaptive remeshing the mesh density changes with time. Thus, it is desirable to shift the arti cial mesh boundaries as the algorithm runs. We need to determine how best to do this. Three dimensions. The goal is to create a 3D analog of the Parallel Constrained Meshing Algorithm. The major diculty here is that the Constrained Delaunay Triangulation does not generalize to three dimensions | there are simple examples for which it is impossible to create a valid 3D triangulation that respects the given boundaries. It is possible however, by subdividing the initial boundary surfaces, to make use of a pseudo-CDT. We are exploring how this use of a pseudo-CDT aects the generalization of the Parallel Constrained Meshing Algorithm to 3D.
References [Ch89a]
L. Paul Chew, Constrained Delaunay Triangulations," Algorithmica, 4 (1989), 97{108. [Ch89b] L. P. Chew, Guaranteed-Quality Triangular Meshes, Department of Computer Science Tech Report TR 89-983, Cornell University, 1989. [Ch93] L. Paul Chew, Guaranteed-Quality Mesh Generation for Curved Surfaces, Proceedings of the Ninth Symposium on Computational Geometry (1993), ACM Press, 274{280. [CH97] Nikos Chrisochoides and Chris Hawblitze, Mobile Object Layer: A Framework for Dynamic Load Balancing Parallel Adaptive Computations. To be submitted to Journal of Parallel and Distributed Computing. Special Issue on Dynamic Load Balancing. [CHHPKR] Nikos Chrisochoides, C.E. Houstis, E.N.Houstis P. Papachiou, S.K. Kortesis, and J.R. Rice, DOMAIN DECOMPOSER: A Software Tool for Mapping PDE Computations to Parallel Architectures, Proceedings of the 4th International Symposium on Domain Decomposition Methods, Moscow, USSR, May 1990. [CHR94] Nikos Chrisochoides, Elias Houstis, and John Rice, Mapping Algorithms and Software Environment for Data Parallel PDE Iterative Solvers, Special Issue of the Journal of Parallel and Distributed Computing on Data-Parallel Algorithms and Programming, Vol 21, No 1, April 1994, 75{95. [CKP97] Nikos Chrisochoides, Induprakas Kodukula, and Keshav Pingali, Data Movement and Control Substrate for parallel scienti c computing, to appear in 13
Proceedings of the Workshop on Communication and Architectural Support for Network-based Parallel Computing, February 1997.
[CLMPS] P. Cignoni, D. Laforenza, C. Montani, R. Perego, R. Scopigno, Evaluation of Parallelization Strategies for an Incremental Delaunay Triangulator in E3 , Concurrency: Practice and Experience, 7(1) (1995), 61{80. [CS96] Nikos Chrisochoides and Florian Sukup, Task Parallel implementation of the Bowyer-Watson algorithm, Proceedings of Fifth International Conference on Numerical Grid Generation in Computational Fluid Dynamics and Related Fields (1996), 773{782. [LJM90] Lohner Rainald, Camberos Jose, and Merriam Marshal, Parallel Unstructured Grid Generation, Unstructured Scienti c Computation on Scalable Multiprocessors, ed. Piyush Mehrotra and Joel Saltz, MIT Press, 1990, 31{64. [PS85] F. Preparata and M. Shamos, Computational Geometry: An Introduction, Springer-Verlag, 1985. [Ru93] J. Ruppert, A new and simple algorithm for quality 2-dimensional mesh generation, Proceedings of the 4th ACM-SIAM Symposium on Discrete Algorithms, 83{92, 1993. [vCGS92] Thorsten von Eicken, Davin E. Culler, Seth Cooper Goldstein, and Klaus Erik Schauser, Active Messages: a mechanism for integrated communication and computation, Proceedings of the 19th International Symposium on Computer Architecture, ACM Press, May 1992.
14