Chinese Journal of Oceanology and Limnology Vol. 28 No. 6, P. 1340-1349, 2010 DOI: 10.1007/s00343-010-9937-x
Parallel hydrodynamic finite element model with an N-Best refining partition scheme* ZHANG Zhenchang (张振昌)1, HONG Huasheng (洪华生)2, WAI Onyx Winghong3, JIANG Yuwu (江毓武)2, **, ZHOU Changle (周昌乐)1 1
Department of Computer Science, Xiamen University, Xiamen 361005, China
2
State Key Laboratory of Marine Environmental Science (Xiamen University), Xiamen 361005, China
3
Department of Civil and Structural Engineering, the Hong Kong Polytechnic University, Hong Kong, China
Received Oct. 10, 2009; revision accepted Dec. 14, 2009 © Chinese Society for Oceanology and Limnology, Science Press, and Springer-Verlag Berlin Heidelberg 2010
Abstract We enhance a robust parallel finite element model for coasts and estuaries cases with the use of N-Best refinement algorithms, in multilevel partitioning scheme. Graph partitioning is an important step to construct the parallel model, in which computation speed is a big concern. The partitioning strategy includes the division of the research domain into several semi-equal-sized sub-domains, minimizing the sum weight of edges between different sub-domains. Multilevel schemes for graph partitioning are divided into three phases: coarsening, partitioning, and uncoarsening. In the uncoarsening phase, many refinement algorithms have been proposed previously, such as KL, Greedy, and Boundary refinements. In this study, we propose an N-Best refinement algorithm and show its advantages in our case study of Xiamen Bay. Compared with original partitioning algorithm in previous models, the N-Best algorithm can speed up the computation by 1.9 times, and the simulation results are in a good match with the in-situ data. Keyword: N-Best; graph partitioning; parallel computation; finite element method (FEM); domain decomposition
1 INTRODUCTION Parallel computation is widely used in modern oceanographic models, which tend to handle larger domains with finer grids. The domain decomposition method separates computation regions into several sub-domains, and each sub-domain can be handled independently by a processor. Therefore, the entire domain can be run efficiently on either parallel computers/clusters or a computer with a multi-core/processor with Message Passing Interface (MPI). This adaptability appeals to many researchers who want to make their oceanography models run faster. Graph partitioning plays a key role in the domain decomposition method and can speed up the total computation rate significantly. The object of the graph partition for ocean models is to divide the finite element mesh into k semi-equal sized sub-domains. Each sub-domain is handled independently by a different processor, keeping the weight sum of edges between different sub-domains to minimum. In this
way, the communication cost between computation processors also is minimized. A graph partitioning problem can be defined as follows: Given a graph G=(V, E) with | V |= n (where V is the set of nodes, n is total number of the nodes, and E is the set of edges), finding a partition P that divides V into k balanced sub-domains, V1, V2,…Vk such that Vi ∩ V j = φ for i ≠ j , | Vi |= n / k , and K
∑ | Vi | = | V | , and the number of edges in E whose i =1
nodes belong to different sub-domains is minimized. The equation can be listed as: arg min edge − cut = P
arg min | {ei , j | ei , j ∈ E , vi ∈ Vm , v j ∈ Vn , m ≠ n} | P
∗ Supported by the National Natural Science Foundation of China (Nos. 40406005, 41076001, 40440420596) ∗∗ Corresponding author:
[email protected]
No.6
ZHANG et al.: Parallel hydrodynamic finite element model with N-Best refining partition scheme
Given a partition P, the number of edges whose nodes belong to different sub-domains is called the edge-cut of the partition. Therefore, the communication between the Master and Slave can be reduced when the partition algorithm minimizes the edge-cut. In this way, the efficiency of the ocean parallel finite element model can be achieved to a certain extent as edges between different sub-domains are reduced. Meanwhile, the number of nodes in each sub-domain representing the computation load is constant and at equilibrium. The graph partition is an NP-complete problem. Kernighan et al. (1970) indicated that there are 1 ⎛ n ⎞⎛ n − p ⎞ ⎛ 2 p ⎞⎛ p ⎞ ⎜ ⎟⎜ ⎟ ... ⎜ ⎟⎜ ⎟ , (where kp = n) ways of k ! ⎝ p ⎠⎝ p ⎠ ⎝ p ⎠⎝ p ⎠ choosing the subsets. However, many algorithms were developed to achieve reasonably good partitions. Farhat (1988) presented a simple and efficient non-numerical algorithm to automatically decompose a finite element domain into a specified number of balanced sub-domains. Compared with the non-numerical method, the spectral partitioning algorithm can generate more smart partitions. Pothen et al. (1990; 1992) presented such a method, including calculations of the eigenvalue of a Laplacian matrix associated with a graph, which bisects the domain at each stage of recursive decomposition. This recursive decomposition algorithm was enhanced by Hendrickson et al. (1995b), who used a novel approach with the use of multiple eigenvectors, and can divide a computation domain into four or eight parts at each stage. Although this spectral method can generate a domain partition leading to fast parallel computations, this method is time consuming due to the computations of the eigenvector of the matrix. In the 1990s, graph theory was proposed for the domain partition. In this theory, the finite element mesh was converted into a graph, and the structure of the graph was analyzed for domain decomposition. Hsieh et al. (1995) presented a recursive spectral partitioning algorithm with a graph partitioning approach, which made use of the recursive spectral bisection (RSB) algorithm to divide the computing domain into the integer power of two partitions. Jones et al. (1994) presented another heuristic partitioning approach named unbalanced recursive bisection (URB) involving geometric information. The algorithm was developed from the orthogonal recursive bisection (ORB) algorithms proposed by Berger et al. (1987). The ORB algorithm recursively makes orthogonal cuts to divide the domain into two
1341
rectangles of equal sizes, and the URB algorithm chooses an unequal partition with less edge-cut (the number of edges in the graph whose nodes belong to different sub-domains) instead. Both algorithms can lead to balanced partitions. Barnard et al. (1993) and Hendrickson et al. (1995c) introduced the multilevel schemes, and the process of this technique has three phases: coarsening, partitioning, and uncoarsening. Karypis et al. (1995a; 1996; 1998) and Baños et al. (2004) adopted the multilevel technique to reduce computational time. Schloegel et al. (2001) presented two repartitioning adaptive methods to improve the performance of the multilevel schemes. In this paper, we focus on the uncoarsening phase, and present an N-Best refinement algorithm. Compared with the refinement algorithms in existence, our N-Best algorithm is a variance of the KL refinement algorithm (Kernighan et al., 1970; Fiduccia et al., 1982; Hendrickson et al., 1995c), and records n number of the best partitions in the memory. The modification gives our algorithm more ability to jump out of the local optimization and find better partitions. Although multilevel schemes achieve quite reasonable partitions, this N-Best algorithm improves the partitions further. To validate the partitions of the N-Best refinement algorithm, we apply partitions in the WLJPM model (Wai et al., 2000; Jiang et al., 2003; Jiang et al., 2005), and show a more efficient model of hydrodynamic processes in the Xiamen Bay.
2 THE PRELIMINARIES 2.1 WLJPM
WLJPM is a three-dimensional finite element sigma coordinate model for estuarine and coastal regions. The model utilizes a parallelizable hybrid operator splitting technique to discretize the governing equations. In one computation time step for solving the momentum conservation equation, an explicit Eulerian-Lagrangian scheme deals with the advection term, a finite element method (FEM) handles the horizontal diffusion term, and a finite difference method (FDM) handles the vertical diffusion term. The continuity equation is solved by the implicit FEM to achieve a higher stable simulation (Lu et al., 1998). Without reducing stability and accuracy, the model is parallelized by a mixed strip/patch domain decomposition and Master-Slave communication approach with the use of MPI (Wai et al., 2000). Jiang et al. (2003) used two large-scale finite element linear systems of the model, which are solved by the preconditioned
1342
CHIN. J. OCEANOL. LIMNOL., 28(6), 2010
conjugate gradient method (PCG). At the same time, the model is parallelized by a mixed strip/patch domain decomposition and Master-Slave communication approach. In WLJPM, two large-scale finite element linear systems are solved by the preconditioned conjugate gradient method (PCG) instead of the Gauss-Jacobi iteration method (GJI) (Kamath et al., 1984; Saad et al., 1985; Anderson, 1988; Aykanat et al., 1988; Kim et al., 1991). Besides the numerical methods mentioned above, model development has focused on sediment deposition/erosion (Wai et al., 2004b), wet and dry processing (Jiang et al., 2005) and wave-current coupling (Wai et al., 2004a). In this paper, the mixed strip/patch domain mesh partition method presented by Wai et al. (2000) is referred to as the original model for comparison. In the next section, a graph partitioning algorithm is presented to divide finite elements into k sub-domains with a minimum boundary, greatly reducing WLJPM communication costs. 2.2 Multilevel graph partition
In this section, the multilevel graph partition schemes proposed by Bui et al. (1993), Cong et al. (1993) and Barnard et al. (1993) are described. Following the multilevel schemes, many partition tools have been developed, such as Chaco (Hendrickson et al., 1995a), METIS (Karypis et al., 1995b), JOSTLE (Walshaw et al., 1999), PARTY (Preis et al., 1996) and SCOTCH (Pellegrini, 1996). This experiment is also based on the program METIS. The graph partition is an NP-complete problem, and the basic idea of multilevel schemes is quite simple. When the number of vertices is large, partitioning the original graph directly will be time consuming. Thus, the graph is first coarsened down to the coarsest graph with considerably fewer vertices. A bisection of this coarsest graph is computed first, and then projected back to the original graph. The multilevel partition scheme is illustrated in Fig.1. Formally, the multilevel graph partition algorithm consists of three phases. The coarsening phase includes the coarsening of the original graph G into a sequence of smaller graphs (G1, G2, …, Gm), such that |V0|>|V1|>|V2|>…>|Vm|. During this phase, graph coarsening can be achieved in various ways. Generally, a set of vertices of Gi is combined into one vertex in the coarser graph Gi+1. The combined vertex is called as a multimode. The weight of this vertex in Gi+1 is set to the sum of the weights of corresponding vertices in larger graphs Gi. The edges of the Gi+1 are the union of the corresponding edges in Gi alike. The weight of the union edges is set to the
Vol.28
Fig.1 Demonstration of multilevel partition scheme (after Karypis et al., 1998)
sum of the weights of all the correlative edges. Thus, the coarser graph is a good representation of the original graph, both in vertices and connectivity. During the coarsening phase, edge collapsing can be defined as matching. A matching of graphs is constructed as a set of edges, any two of which cannot share same vertex. Thus, coarser graph Gi+1 of the next level is constructed from Gi by finding a matching of Gi and collapsing the vertices into multinodes. There are many matching methods that have been proposed. Random Matching (RM) was used by Barnard et al. (1993) and Karypis et al. (1998). The random matching algorithm processes in the following steps. First, all the vertices are marked as unmatched. The vertices are visited in random order. If one vertex u is unmatched, one unmatched adjacent vertex v is randomly selected. The edge (u, v) is put into matching and vertex u and v are marked as matched. If no unmatched adjacent vertex v exists, vertex u remains unmatched. Heavy Edge Matching (HEM; Karypis et al., 1998) is similar to RM in that all vertices are initialized as unmatched and visited in random order. However, instead of randomly selecting an unmatched adjacent vertex v, vertex u is matched with vertex v, whereby the weight of edge (u, v) is maximal over all unmatched edges of u. SHEM (Sorted Heavy Edge Matching) is an extension of HEM, but the edges are visited in a global sorted order based on their weights. SHEM is the default matching method in METIS, and this method was used in this experiment. Light Edge Matching (LEM; Karypis et al., 1998) is the inverse of HEM. LEM matches vertex u with v, with a minimal weight of edge (u, v). While HEM tries to minimize the number of coarsen levels, LEM maximizes this number.
No.6
ZHANG et al.: Parallel hydrodynamic finite element model with N-Best refining partition scheme
Heavy Clique Matching (HCM; Karypis et al., 1998) is another method used. Clique is defined as a fully connective sub-graph of graph G. Given a graph G=(V, E), G is a clique if and only if the ratio 2|E|/(|V|(|V|-1)) is equal to 1. This ratio is referred to as edge density. If this ratio is small, G is far from being a clique. HCM finds a matching by collapsing vertices with high edge density. The partitioning phase can use any partition algorithm directly. Typically, a two-way partition algorithm divides the coarsest graph Gm into two parts, each containing half of the vertices in G0. The two parts are connected with a minimum edge-cut. Two major types of methods in this phase are listed below. Spectral bisection, used by Pothen et al. (1990; 1992) and Hendrickson et al. (1995b), is a method which computes the eigenvector y of the second largest eigenvalue of the Laplacian matrix (Q=D−A) associated with a graph. The eigenvector y is named the Fiedler vector (Barnard et al., 1993). A is the adjacency matrix of the graph G. The element avw of A is set to one if (v, w) ∈ E , otherwise the value is set to zero. D is a diagonal matrix with dvv=d(v), where d(v) denotes the degree of the vertex v. The vertices are partitioned into two parts as described in the following steps. Given the eigenvector y, let r = yi the ith element of eigenvector y, and choose the value of r as the weighted median of values of yi. All vertices whose corresponding yi≤r belong to one part and other vertices are partitioned to the other parts (Pothen et al., 1990; 1992). A more novel idea is proposed by Hendrickson and Leland (Hendrickson et al., 1995b), who use multiple eigenvectors. This method can divide computation domains into four or eight parts at once. Although spectral methods can generate less edge-cut partition, leading to fast parallel computations, these methods are time consuming due to the computation of the eigenvector of the matrix. Growing heuristic bisection includes methods which use graphs growing heuristics. For example, GGP (Graph Growing Partition; George et al., 1981; Geohring et al., 1994; Ciarlet et al., 1996) starts from a random vertex and grows a region around it in breadth-first fashion until half of the vertex-weight is reached. GGGP (Greedy Graph Growing Partition; Karypis et al., 1995a; Karypis et al., 1996) also starts from a random vertex, but it expands the region in a greedy fashion, leading to less increase in the edge-cut. This approach was used in the present experiment.
1343
The uncoarsening phase includes partitioning of the coarsest graph Gm, which is projected back to the original graph G0 by going through intermediate graphs Gm-1, Gm-2, …, G1. The uncoarsening process can be divided into two parts: (a) projection of the coarser graph into a finer one according to matching information, where the edge-cut of two partitions are same; and (b) refinement of the finer graph, which has more vertices and edges. Even though the Pi+1 is a local optimized partition of Gi+1, the projected partition Pi' is not the partition with the minimal local edge-cut with respect to Gi. Since the finer graph has more degrees of freedom, Pi' can be improved to be a local optimized partition Pi by refinement heuristics. For this reason, one partition refinement algorithm is used to decrease the edge-cut after projecting a partition. The basic idea of a refinement algorithm is to determine two sub-sets of vertices from each part of the partition. By exchanging the sub-sets between two parts, the edge-cut of the partition can be decreased. A series of algorithms which can produce reasonably good exchanging results are based on the Kernight-Lin algorithm (Kernighan et al., 1970; George et al., 1981; Fiduccia et al., 1982). Our N-Best refinement algorithm is a modification of the KL refinement algorithm. Other algorithms are list below. The Kernighan-Lin Refinement algorithm (KL algorithm) is an iterative method. Given an initial partition, it finds the two sub-sets of vertices mentioned above. In the classic KL algorithm (Kernighan et al., 1970), one concept called gain is introduced for the first time. For each vertex, gain is defined as the decrease in the edge-cut, if this vertex moves to the other part of the partition. The algorithm repeatedly searches for sub-sets of vertices, which lead to less edge-cut. After swapping these vertices, the gains of the adjacent vertices accordingly are updated in the partition. Karypis et al. (1995a) extended the termination criteria of the algorithm such that if the edge-cut does not decrease after x number of swaps processed, the algorithm undoes the unused swaps and terminates. The Boundary Refinement; Karypis et al., 1998) uses the KL algorithm, in which the gains of all vertices are computed, and most of this computation is wasted. Moreover, most of the vertices swapped are boundary vertices between two parts. In the boundary refinement algorithm, only the gains of the boundary vertices are computed. After swapping a vertex, the gains of adjacent vertices should be updated. If any adjacent vertices are no longer
CHIN. J. OCEANOL. LIMNOL., 28(6), 2010
1344
boundary vertices due to swapping, they should be removed from boundary list. On the other hand, the adjacent vertices that become boundary members should be added into the boundary list and the corresponding gains must be calculated. The N-Best Refinement is proposed in this paper as an algorithm to improve the partition. This algorithm is a variance of the KL refinement algorithm, which will be discussed fully in the next section.
3 N-BEST REFINEMENT ALGORITHM As mentioned above, the KL algorithm terminates when the edge-cut does not decrease, after x number of swaps processed. In this manner, the KL algorithm has some ability to jump out of local optimization. The N-Best refinement algorithm reorders N number of best partitions (N-Best partition list) in each iteration until the partition list does not change, and thus enhances the ability to jump out of the local optimization. At the beginning, the N-Best partition list is initialized with only one projected partition, Pi'. The algorithm repeatedly selects a partition P from the N-Best list in order until the list does not change. For one partition P, the vertices with m number of largest gains are selected from the large part and moved to another one. After m number of movements, m number of new partitions with large gains are obtained. These new partitions are inserted into the N-Best partition list sorted by the edge-cut. The partitions with less edge-cut are reserved in the list, and the partition with the largest edge-cut in the list is removed. In brief, the N-Best algorithm can be described by the following processors: 1. Given the projected partition P', P' is inserted into the N-Best partition list. 2. Set N-Best list is unchanged. 3. A partition P is selected from the list in order. 3.1 An m number of vertices with the largest gains is selected from the large part. 3.2 For each selected vertex, a new partition with large gain is generated from partition P. The generated partition is inserted into the N-Best partition list according to the edge-cut. The new partition is discarded if it is already in the partition list. 3.3 The N-Best list is marked changed if any partition is inserted into the list. 4. If the N-Best list is changed, the process is repeated
Vol.28
from step 2. 5. The best partition is selected from the N-Best partition list as the refinement partition, and the algorithm terminates.
From the description above, we can conclude that the N-Best refinement algorithm improves the partition by sacrificing space and time complexity. In the Xiamen Bay case, the graph partition is a preparation process, which can be performed before the parallel hydrodynamic modeling runs. Given the number of processors and the FEM grids, the partition algorithm is performed only once regardless of how many times the parallel hydrodynamic model runs. Therefore, the time consuming part of the partition algorithm is not an important factor. Compared with the KL algorithm, the N-Best refinement algorithm records n number of best partitions instead of a single best partition. For the partition P in N-Best list, an m number of new partitions is generated from it. In theory, the N-Best algorithm is about m×n times slower than the KL algorithm and requires n times more space.
4 RESULT There are many good algorithms and software packages for partitioning graphs, such as Chaco (Hendrickson et al., 1995a), METIS (Karypis et al., 1995b), JOSTLE (Walshaw et al., 1999), PARTY (Preis et al., 1996), and SCOTCH (Pellegrini, 1996). In our experiment, METIS with a boundary refinement algorithm is set as the baseline. We evaluate the performance of the N-Best refinement algorithm, and compare it to the METIS in a boundary refinement algorithm. The coarsening and partitioning methods are the same as the METIS, except in the refinement algorithm. METIS is a family of programs for partitioning unstructured graphs and hypergraphs, and computing fill-reducing orderings of sparse matrices. The underlying algorithms used by METIS are based on the state-of-the-art multilevel paradigm that has been shown to produce high quality results, and can be scaled to deal with large problems. The algorithms in METIS are based on the multilevel graph partitioning described by Karypis et al. (1995a; 1995b; 1996; 1998). Experiments on a large number of graphs arising in various domains, including finite element methods, linear programming, VLSI, and transportation show that METIS produces partitions that are consistently better than those produced by
No.6
ZHANG et al.: Parallel hydrodynamic finite element model with N-Best refining partition scheme
other widely used algorithms. The N-Best refinement algorithm is developed based the METIS program. The partition results are compared to the boundary refinement in Fig.2. In our experiments, the value of m and n are set to 20. In this figure, most of the N-Best refinement edge-cuts are less than the boundary refinement (Decreasement in
1345
edge-cut is larger than zero). When the number of sub-domains is less than 10, the N-Best edge-cut is decreased by larger percents because the edge-cut of boundary refinement is less. The decreasement values are between -5 and 5% when the number of sub-domains is larger than 10. On average, the N-Best edge-cut is decreased by about 3%.
Fig.2 Edge-cut comparison between boundary refinement and N-Best refinement
In the remainder of this section, the computation speedup of WLJPM is demonstrated as it is applies to Xiamen Bay. Xiamen Bay (Fig.3) is located west of the Taiwan Strait, between the longitude 117.8 and 118.6°, and between latitude 23.2 and 23.8° in Fujian Province, China. There are two seaward openings to the east and a river outlet to the west. The former openings are separated by Jinmen Island. The current is forced by a typical semi-diurnal tide with a mean tidal range of 3.79 m and a maximum range of 6.24 m, which dominates Xiamen Bay. In this study, the N-Best refinement algorithm was adopted by the WLJPM model for application in Xiamen Bay. The research domain was discretized with 8,681 six-node triangular finite elements and 18,984 nodes in each horizontal layer. There were 21 layers in vertical direction and each layer was 1/21 depth in the sigma coordination system. The multilevel scheme with the N-Best refinement algorithm was used to decompose the research domain with a different number of sub-domains. Figs.4 & 5 show partition comparisons between the multilevel scheme with the N-Best refinement algorithm and the original method used in the WLJPM, which divided the nodes according to the distance to the domain centre. Partitions with 6 and 8 sub-domains are presented separately. The N-Best partitions seem more reasonable, as the interface between different sub-domains is almost located in a narrow area of the research domain.
In this study, the WLJPM is run on a workstation-cluster parallel computer consisting of 10 workstations, and each has two inter Xeon 2.8G CPU. Additionally, a 1 000MB/s switcher connects these workstations. The average size of 8 681 six-node triangular finite elements is 1×105 m2, the element in the inner bay is refined, and the minimum element size is 1×105 m2. The time step of the model is 120 s and the model simulates 15-day hydrodynamic processes with an error tolerance of 10-8 m for water elevation calculation in the PCG method. The comparisons of computational time costs for the 15-day hydrodynamic processes with different numbers of sub-domains/processors in different partition algorithms are shown in Fig.6. Compared with the original partition algorithm, the N-Best algorithm
Fig.3 Map of Xiamen Bay and the monitor stations
1346
CHIN. J. OCEANOL. LIMNOL., 28(6), 2010
Vol.28
Fig.4 Graph partition for Xiamen Bay with 6 sub-domains with multilevel schemes with N-Best refinement method (left) and the mixed strip/patch method (right)
Fig.5 Graph partition for Xiamen Bay with 8 sub-domains with multilevel scheme with N-Best refinement method (left) and the mixed strip/patch method (right)
accelerate the model by one time on average when the sub-domain number is larger than 10. It is apparent that the parallel computing is fastest with 15 sub-domains, and slows down with more sub-domains. This is because the communication time bottlenecks the computing speed when there are too many sub-domains. Fig.6 also shows that the optimal sub-domain number with the original strategy is 8. We note that although the total computation time was least with both 8 and 14 sub-domains, the lower sub-domain number is preferable. The optimal processor number extended to 15 while adopting the N-Best algorithm, showing that the N-Best algorithm can enlarge the optimal sub-domain number and extend the WLJPM’s parallelism capability. The least computational time obtained with the N-Best algorithm was 89 minutes with 15 sub-domains, while it was 197 minutes with 8 sub-domains using the original strategy. In Fig.6, the original strategy computing speed was slowed down from 8 to 13 sub-domains, and this tendency suddenly terminated at 14 sub-domains. Meanwhile, with the use of the N-Best partition, the computation speed up with more sub-domains/ processors, except for 9 and 10 sub-domains, until there were 15 sub-domains. These results indicate
that the N-Best partitions are more stable and powerful than in the original method.
Fig.6 The time required to simulate 15-day processes with different numbers of sub-domains
Because of the reduction of the interface between sub-domains, the computation cost is also reduced accordingly. The time cost of each sub-process is listed in Table 1. This table clearly shows that the results of N-Best algorithm partition speeded up the WLJPM in four aspects: advection term, horizontal diffusion term, vertical diffusion term, and the continuity equations. This is because these four aspects need message-passing processes among different sub-domains. `
No.6
ZHANG et al.: Parallel hydrodynamic finite element model with N-Best refining partition scheme
1347
Table 1 The time cost comparison in WLJPM with different partitioning (18,984nodes, 12sub-domains, 120 s time step) Time need with strip/patch algorithm (second)
Time need with N-Best refinement algorithm (second)
Time reduction (%)
Explicit Eulerian-Lagrangian method for advection term
0.217
0.125
42.4
Implicit FEM for horizontal diffusion term
0.140
0.074
47.1
Implicit FDM for vertical diffusion term
0.238
0.178
25.2
0.693
0.294
57.6
1.30
0.68
47.7
Process
Implicit FEM for continuity equation, tolerance=10-8 m/s Total cost for one time step Including other processes
The continuity equation holds over half of the total computational time, since there is a linear equation with a large sparse matrix that needs to be solved by the PCG method. The N-Best partition reduces this cost from 0.693 to 0.294 s (57.6% cost reduction). Meanwhile, the advection and horizontal diffusion terms achieved more than 40% reduction. In addition, other processes also are speeded up, and the total modeling speed is increased by 1.9 times (47.7% time reduction) with the N-Best partitions, compared to the original method. In summary, the N-Best algorithm greatly shortens the computational time compared to the original model with the same number of sub-domains. Furthermore, the model’s optimal sub-domain number can extend to a larger value with the N-Best partitions.
5 MODEL VERIFICATION An oceanographic hydrodynamic simulation can be obtained by applying the N-Best partitioning results to Xiamen Bay. Fig.7 shows the comparison of tidal elevation between the computed and observed values of 10 days starting on April 12, 2003, 0:00 at Huli Shan Station. Velocity comparisons, including magnitude and direction in the middle layer, from April 20, 2003, 0:00 to April 28, 2003, 0:00 at Houyu Station are shown in Fig.8. The averaged velocity magnitude relative error calculated
Fig.7 The computed tidal elevation and the corresponding observed values at Huli Shan
in this station was 25.1%, when the magnitude is larger than 0.3m/s. In Fig.9, five transects are used to show the salinity distribution patterns. One transect is a latitudinalvertical transect along the west-east direction of the Jiulong River estuary. An additional four longitudinal-vertical transects are placed at a constant interval from west to east in the Jiulong River estuary. From these transects, it is seen that salinity in the east is higher than that in the west. This is because saline water intrudes the estuary from the east as a bottom layer, and the fresh water from the west river outlets fluxes to the sea along the surface (above the heavier saline water). The interaction of these two types of waters results in strong stratification in the Jiulong River estuary. The maximum salinity
Fig.8 Velocity (magnitude and direction in the middle layer) comparisons between computed and observed data
1348
CHIN. J. OCEANOL. LIMNOL., 28(6), 2010
Fig.9 Salinity and current patterns in the Jiulong River estuary
difference in the vertical direction is about 20 practical salinity units (psu). These patterns fit the observed salinity characteristics described by Wang et al. (2000).
6 CONCLUSION In this paper, an N-Best refinement algorithm (multilevel scheme) for the domain partition method is proposed for parallelization of finite element modeling. The application of WLJPM in Xiamen Bay shows that the N-Best algorithm can reduce the communication time between computation processors. Compared to the original partitioning algorithm, the N-Best algorithm can speed up the total computation by 1.9 times. Furthermore, the N-Best algorithm can extend the optimal sub-domain number to a greater value. Compared to the METIS with boundary refinement algorithm that has been shown to produce high quality partitions, the N-Best edge-cut is decreased by about 3%. References Anderson E. 1988. Parallel implementation of preconditioned conjugate gradient methods for solving sparse systems of linear equations. Technical Report 805, Centre for Supercomputing Research and Development, University Illinois, Urbana, IL. Aykanat C, Özgüner F, Ercal F, Sadayappan P. 1988. Iterative algorithms for solution of large sparse systems of linear equations on hypercubes. IEEE Transactions on Computers, 37(12): 1 554-1 567. Barnard S T, Simon H D. 1993. A fast multilevel implementation of recursive spectral bisection for partitioning unstructured problems. In: Sincovec R F, Keyes D E, Leuze M R, Petzold L R, Reed D E eds. Proceedings of the 6th SIAM Conference on Parallel Processing for Scientific Computing. Society for Industrial and Applied Mathematics. Philadelphia, United State. p.711-718. Baños R, Gil C, Ortega J, Montoya F G. 2004. A parallel multilevel metaheuristic for graph partitioning. Journal of Heuristics, 10(3): 315-336.
Vol.28
Berger M J, Bokhari S H. 1987. A partitioning strategy for nonuniform problems on multiprocessors. IEEE Transactions on Computers, 36(5): 570-580. Bui T, Jones C. 1993. A heuristic for reducing fill in sparse matrix factorization. Proceedings of the 6th SIAM Conference on Parallel Processing for Scientific Computing. In: Sincovec R F, Keyes D E, Leuze M R, Petzold L R, Reed D E eds. Proceedings of the 6th SIAM Conference on Parallel Processing for Scientific Computing. Society for Industrial and Applied Mathematics. Philadelphia, United State. p. 445-452. Ciarlet J P, Lamour F. 1996. On the validity of a front-oriented approach to partitioning large sparse graphs with a connectivity constraint. Numerical Algorithms, 12: 193-214. Cong J, Smith M L. 1993. A parallel bottom-up clustering algorithm with applications to circuit partitioning in VLSI design. In: Dunlop A E (Chairman). Proceedings of the 30th International Conference on Design Automation. Dallas, Texas, United States. p. 755-760. Farhat C. 1988. A Simple and efficient automatic FEM domain decomposer. Computers and Structures, 28(5): 579-602. Fiduccia C M, Mattheyses R M. 1982. A linear-time heuristic for improving network partitions. Proceedings of 19th IEEE Design Automation Conference. In: Friendenson R A, Breiland J R, Thompson T J eds. IEEE Press, Piscataway, NJ, USA. p.175-181. Geohring T, Saad Y. 1994. Heuristic algorithms for automatic graph partitioning. Technical report UMSI-94-29, Department of Computer Science, University of Minnesota, Minneapolis. George A, Liu J W. 1981. Computer solution of large sparse positive definite systems. Prentice-Hall, Englewood Cliffs, New Jersey. Hendrickson B, Leland R. 1995a. The Chaco User's Guide, Version 2.0. Technical Report SAND95-2344, Sandia Natl. Lab., Albuquerque, NM, 87185. Hendrickson B, Leland R. 1995b. An improved spectral graph partitioning algorithm for mapping parallel computations. SIAM Journal of Scientific Computing, 16(2): 452-469. Hendrickson B, Leland R. 1995c. A multilevel algorithm for partitioning graphs. Proceedings of Supercomputing. p. 626-657. Hsieh S H, Paulino G H, Abel J F. 1995. Recursive spectral algorithms for automatic domain partitioning in parallel finite element analysis. Computer Methods in Applied Mechanics and Engineering, 121(1): 137-162. Jiang Y W, Wai O W H, Li Y S. 2003. 3D parallel estuary model for cohesive sediment transport in large tidal flats. Coastal Sediments '03, Fifth International Symposium on Coastal Engineering and Science of Coastal Sediment Processes. Florida. ISTP. ISBN: 981-238-422-7. Jiang Y W, Wai O W H. 2005. Drying-wetting approach for 3D finite element sigma coordinate model for estuaries with large tidal flats. Advances in Water Resources, 28(8): 779-792.
No.6
ZHANG et al.: Parallel hydrodynamic finite element model with N-Best refining partition scheme
Jones M T, Plassman P E. 1994. Parallel algorithms for the adaptive refinement and partitioning of unstructured meshes. Proceedings of the Scalable High Performance Computing Conference. p. 478-485. Kamath C, Sameh A H. 1984. The preconditioned conjugate gradient algorithm on a multiprocessor. In: Vichnevetsky R, Stepleman R S eds. Advances in Computer Methods for Partial Differential Equations, New York, IMACS. Karypis G, Kumar V. 1995a. A fast and high quality multilevel scheme for partitioning irregular graphs. SIAM Journal on Scientific Computing, 20(1): 359-392. Karypis G, Kumar V. 1995b. METIS: Unstructured graph partitioning and sparse matrix ordering system. Technical Report, Department of Computer Science, University of Minnesota. Karypis G, Kumar V. 1996. Parallel multilevel graph partitioning. Proceedings of the 10th International Parallel Processing Symposium. p. 314-319. Karypis G, Kumar V. 1998. A parallel algorithm for multilevel graph partitioning and sparse matrix ordering. Journal of Parallel and Distributed Computing, 48(1): 71-95. Kernighan B W, Lin S. 1970. An efficient heuristic procedure for partitioning graphs. The Bell System Technical Journal, 49(2): 291-308. Kim S K, Chronopoulos A. 1991. A class of Lanczos-like algorithms implemented on parallel computers. Parallel Computing, 17: 763-778. Lu Q M, Wai O W H. 1998. An efficient operator splitting scheme for three-dimensional hydrodynamic computations. International Journal for Numerical Methods in Fluids, 26(7): 771-789. Pellegrini F. 1996. SCOTCH 3.1 User's Guide. Technical Report 1137-96, LaBRI., University Bordeaux, France. Pothen A, Simon H D, Liou K P. 1990. Partitioning sparse matrices with eigenvectors of graphs. SIAM Journal on Matrix Analysis and Applications, 11(3): 430-452.
1349
Pothen A, Simon H D, Wang L. 1992. Towards a fast implementation of spectral nested dissection. In: Werner Robert ed. Proceedings of the 1992 ACM/IEEE Conference on Supercomputing. Minneapolis, Minnesota, United States. p.42-51. Preis R, Diekmann R. 1996. The PARTY Partitioning-Library, User Guide-Version 1.1. Technical Report Tr-rsfb-96-024, University of Paderborn, Paderborn, Germany. Saad Y, Schultz M H. 1985. Parallel implementations of preconditioned conjugate gradient methods. Technical Report, Yale University, Department of Computer Science. Schloegel K, Karypis G, Kumar V. 2001. Wavefront diffusion and LMSR: algorithms for dynamic repartitioning of adaptive meshes. IEEE Transactions on Parallel and Distributed Systems, 12(5): 451-466. Wai O W H, Lu Q M. 2000. An efficient parallel model for coastal transport process simulation. Advances in Water Resources, 23(7): 747-764. Wai O W H, Chen Y, Li Y S. 2004a. A 3-D wave-current driven coastal sediment transport model. Coastal Engineering Journal, 46(4): 385-424. Wai O W H, Wang C H, Li Y S, Li X D. 2004b. The formation mechanisms of turbidity maximum in the Pearl River estuary, China. Marine Pollution Bulletin, 48(5): 441-448. Walshaw C, Cross M, Everett M. 1999. Mesh partitioning and load-balancing for distributed memory parallel systems. In: Topping B H V ed. Proceedings Parallel and Distributed Computing for Computational Mechanics: Systems and Tools, Saxe-Coburg Publications, Edinburgh. p. 110-123. Wang W Q, Zhang Y H, Huang Z Q. 2000. Characteristics of salinity front in Jiulongjiang Estuary Xiamen Harbour. Journal of Oceanography in Taiwan Strait, 19(1): 82-88. (in Chinese with English abstract).