some useful strategies for unstructured edge-based solvers ... - AIAA Info

12 downloads 8316 Views 829KB Size Report
C/Gran Capita, 2-4, Edificio Nexus I, .... The most useful and well-known are:1, 5, 8, 19, 25, 30 ..... level is finished, the MPI call is performed. ..... For the conference, numerical results illustrating comparisons between a full shared memory ...
49th AIAA Aerospace Sciences Meeting, 4-7 Jan 2011, Orlando World Center Marriott, Orlando, Florida

SOME USEFUL STRATEGIES FOR UNSTRUCTURED EDGE-BASED SOLVERS ON SHARED MEMORY MACHINES R. Aubry, G. Houzeaux and M. V´azquez, Barcelona Supercomputing Center (BSC-CNS), C/Gran Capita, 2-4, Edificio Nexus I, 08034 Barcelona, Spain Three strategies for shared memory parallel edge-based solvers are proposed which guarantee that nodes belonging to one thread are not accessed by other threads for vertex centered discretizations (Replace nodes by cells in case of cell centered discretizations). The algorithms reorder the edges in groups in order for the parallelization to take place at the edge level, possibly through multiple passes, which constitutes the bulk of the work in an edge-based solver. These strategies are presented in increasing order of programming effort and their performances are also compared. Various renumbering algorithms are considered. Results and timings are given for a classical Computational Fluid Dynamics compressible edge-based solver and a Numerical Weather Prediction compressible dynamic solver for dry air, as well as computational details to illustrate the efficiency of the proposed approach. The influence of the point renumbering on the final edge grouping and efficiency is also studied through numerical results.

Keywords: Edge based solver, shared memory machines, renumbering schemes, cc-NUMA, profile and bandwidth minimizer.

I.

INTRODUCTION

With the new version of Moore’s law asserting that multicores on chip will double every 18 months, trying to take advantage of the still decreasing transistor size while linearly scaling the power consumption, the importance of shared memory machines is crucial nowadays. Even with hybrid OpenMP/MPI implementations, it may be forecast that new exaflop machines will be composed of large multicore processors accessing the same memory locally while being connected through an interconnection network, communicating with message passing. On commodity PCs, the number of cores is increasing drastically hence software must take advantage of these architectures. Due to the ever increasing size of the applications targeted in Computational Fluid Dynamics (CFD), efficiency through special data structures and reordering has been studied extensively1, 5, 8, 12, 19, 25, 30 from a serial and parallel viewpoint. In a CFD code, the heavy work is performed at the edge, face, or element level whether the code is based on an edge, face or element datastructure. A typical CFD loop25, 30 for an unstructured code will gather some information at the point level, perform work, and scatter back the result at the point level as illustrated in Figure I.1 for a shared memory OpenMP implementation. This scatter may appear to be the bottleneck for parallelization. This will be the main concern of this paper. A solution presented in25 consists in using a domain decomposition approach at the algebraic level. In order for the points to be accessed by the same thread, edges are renumbered in groups of approximately the same size, whose point extent is nonoverlapping, trying to minimize cache misses at the same time. However, few details are given in.25 The organization of this article is as follows. Section II reviews the serial and parallel hardware constrains the programmer must be aware of to efficiently take advantage of it. In Section III, the three strategies for shared memory parallel edge-based solvers are presented. In the first strategy, a simple algorithm for load

1 of 26 American Institute of Aeronautics and Astronautics

!Loop on edge macrogroups of size nthread !All the groups inside a macrogroup must be disjoint do ithread=1,npar,nthread !Get pointers of this edge macrogroup ithread0=ithread ithread1=min(npar,ithread0+nthread-1) !Parallel loop for each thread !$omp parallel do private(jthread,iedg0,iedg1,iedge,ip1,ip2,redge) do jthread=ithread0,ithread1 !Get pointers of this edge group iedg0=edpar(jthread) iedg1=edpar(jthread+1)-1 do iedge=iedg0,iedg1 !Serial work at edge level ip1=ledge(1,iedge) ip2=ledge(2,iedge) redge=cedge(iedge)*(var(ip2)-var(ip1)) rhs(ip1)=rhs(ip1)+redge rhs(ip2)=rhs(ip2)-redge enddo enddo enddo Figure I.1. A parallel shared memory edge based loop for a scalar variable. In this loop, var represents the array of variables under consideration, rhs the point right-hand side. The array cedge contains the geometric information associated to the Partial Differential Equation under study, ledge contains the end points of each edge, edpar gives the beginning and ending of the edge group for each thread, and npar represents the number of iterations on macrogroups times the number of threads nthread.

2 of 26 American Institute of Aeronautics and Astronautics

balancing the edge groups and its parallel implementation is described. In the second strategy, a data domain decomposition is applied on the node graph. The ease of the previous strategy is maintained, while allowing a better control of the data access pattern. Finally, the third strategy proposes to enforce still more data decomposition by relying on the edge graph. Some coding complexity is added at the benefit of an almost perfect load balancing and data access pattern. Some implementation details are also given. Finally, some numerical results are presented in Section V for the case of cc-NUMA shared memory machine. This work relies on an implementation based on the OpenMP API for sake of code readability, ease and portability.

II.

SERIAL AND PARALLEL HARDWARE CONSTRAINS

In this section, physical constrains in the hardware that the programmer must take into account for efficiency are reviewed. The first part deals with serial efficiency while the second part concentrates on parallel efficiency. II.A.

Serial hardware constrains

For a uniprocessor context, various concepts must be taken into account to achieve good performances on cache based machines. The most useful and well-known are:1, 5, 8, 19, 25, 30 • spatial locality • temporal locality • data alignement • cache thrashing These concepts are sketched in Figure II.1 for sake of clarity.

Figure II.1.

Illustration of temporal and spatial locality, data alignement and cache trashing.

The programmer is responsible of the first two categories. Spatial and temporal locality may be exploited through data reordering so that physically close quantities lay also close in memory. Therefore points, edges, faces and elements must be relocated to comply with this request. In order to reduce cache misses, points are first renumbered to minimize bandwidth or profile. Numerous algorithms are now available as the frontal renumbering, the Cuthill-McKee renumbering9 and its reverse variant, the Gibbs-Poole-Stockmeyer renumbering,17 the King renumbering,21 the Levy renumbering22 or more recently, the Sloan renumbering.32

3 of 26 American Institute of Aeronautics and Astronautics

Originally these renumbering schemes were designed to minimize bandwidth for the first three and profile for the last three for direct solvers, where performances due to fill-in are drastically affected by the renumbering. Once the points are renumbered, the elements, faces and edges are renumbered to be accessed as smoothly in memory as possible, as discussed in23 with a two pass algorithm, reordering entities based on their maximal and minimal points. Optionally, points may also be reordered inside each element and face. II.B.

Parallel hardware constrains

When porting an application to shared memory, new difficulties arise and add to the serial list. The most common and important ones are: • memory contention • false sharing • memory placement in NUMA systems • thread placement in NUMA systems • load balancing These concepts are sketched in Figure II.2 for sake of clarity.

Figure II.2. Illustration of memory contention, false sharing, memory placement, load balancing where a nose geometry has been adequately partitioned by METIS and a trace where a process is spending more time than the others due to a complex topology, and thread placement.

III.

PARALLEL STRATEGIES

In this section, the three strategies proposed in this work to achieve an efficient parallelization of an edge based solver on shared memory machines are presented in order of increasing complexity and time coding. They are characterized by the property that nodes belonging to one thread are not accessed by other threads. The three strategies rely closely on a domain decomposition method. In this context, various approaches have been followed with success:18, 25 • geometric bisection • simulated annealing • spectral recursive bisection 4 of 26 American Institute of Aeronautics and Astronautics

• space filling curve decomposition • multilevel decomposition • diffusion based It is obvious that a classical domain decomposition strategy with message passing would also run on shared memory. Nevertheless, the implementation with message passing is much more involved than with a shared memory paradigm. Furthermore, it does not leave the freedom to parallelize only parts of the code when needed in a progressive manner, as the whole code must be thought in domains from the very beginning. However, it is interesting to study how the shared memory paradigm can mimic the distributed approach for efficiency, what is the implementation effort needed, and what are the main differences at a conceptual level. It will be seen that the differences with a pure domain decomposition approach decrease as the strategies are described. III.A.

Algebraic domain decomposition

In this section, the first strategy to parallelize an edge based solver on shared memory is presented in details. It relies on a reordering of the edges based on their end points, as proposed in,25 and a simple load balancing algorithm to achieve a similar number of edges in each group. First the serial implementation is described. Then, the parallel implementation is commented. III.A.1.

Serial algebraic domain decomposition

As input of the first strategy, it is assumed that the edges are ordered lexicographically where the first point of an edge is smaller than the first point of the next one, or if both first points are the same, the second point of the first edge is smaller than the second point of the second edge. Apart from providing an optimal reordering for serial applications as seen before, it provides an ordering which greatly simplifies the implementation. The initialization phase is the same as the one presented in.24, 25 First, the point range [ipmin:ipmax] from the edges to be renumbered is computed. If all the edges have to be renumbered, the point range varies from 1 to the last point npoin. Then, the initial point extent nppg for each edge group is obtained by dividing ipmax-ipmin by nthread, the thread number. Therefore: nppg =

ipmax − ipmin +1 nthread

(III.1)

The initial point range is obtained by marching from one to npoin by nppg steps. The multiples of nppg for each group define the boundaries of the edge points to be assigned to each thread. At this point, a possibly large variation in the number of edges per group may be obtained for each edge group due to the topology of the underlying mesh. In order to equidistribute the load, it is sufficient to adjust the boundaries of each edge group to move edges from one group to another. As the boundaries are point values, it means that boundary point values must be increased or decreased, depending on the local balance of neighboring edge groups. Figure III.1 illustrates schematically the algorithm. For an efficient implementation, it is recommended to previously have computed the graph of the points surrounding the points. Then, each time a boundary point value is lowered, it is an easy matter to find which edges were considered in the group with more edges, and which edges must be considered now in the other group by lowering the boundary point value. As a variation in one domain will influence the neighboring domain, a global optimal is sought iteratively. It is not obvious how to devise a good convergence criterion, as this criterion may not be met depending on the topology. In this paper, a comparison of the last four iterations are used to check if the same configuration is obtained, coupled with a maximum number of iterations. The full algorithm is given in Figure III.5. The array lgrou, of size nthread+1, contains the point indices that bound each edge group, or point boundary group. The array edpar, of size nthread, contains the edge number in each group. The edges are stored in an array ledge(2,nedge). The array lrenu first contains the thread number. After the load balancing, a second loop over the edges uses edpar to give the new edge number. Figure III.5 only shows the first loop. After this first pass, not all the edges are picked as their endpoints may not be included in an edge group point extent, so various passes are necessary to enforce that the same group of points is accessed by the same thread during one parallel pass. This may be seen as the overhead associated with the parallelization, and which in a classical domain decomposition takes the form of local communications. Edge groups should 5 of 26 American Institute of Aeronautics and Astronautics

1

nppg+1

2nppg+1

npoin

Figure III.1. Illustration of the edge groups created for three threads. The point range is divided in three non overlapping regions, [1:nppg], [nppg+1:2nppg] and [2nppg+1:npoin]. Only the edges whose endpoints lie within one of these three regions are renumbered for this pass.

then be agglomerated in macrogroups of size nthread until all the edges are considered in niter iterations. All the non marked edges are then checked through their point extent and the same approach could be followed iteratively. However, supposing that there exists an edge joining the first mesh node with the last node, the aformentioned algorithm will never converge. An idea investigated at the beginning was to lower the group number by one at each iteration. At the last iteration, the extent of points is divided by one, and the edges that could not have been considered in previous iterations are finally renumbered in this last iteration. However, this strategy gave rise to a large number of iterations with few edges in each groups. To avoid this pitfall, one may have a clue by representing the range of edges left by the algorithm as shown in Figures III.2 and III.3. As the previous algorithm relies on decomposing the point data in equally sized chunks, all the edges not considered in the first pass are the boundary edges between the point subdomains. This will motivate the strategy of the next section. For the moment, let us recall that the node partition through the node array, creating groups of size nppg, was sought in order to minimize cache misses and is not necessary to avoid memory contention. A classical colouring algorithm would provide such a property at the expense of increasing cache misses. To ensure convergence, once the ratio of renumbered edges by the previous algorithm becomes lower than a given tolerance, the node partitioning is given up, allowing a group of edges to access points belonging to another domain. As the edges not taken into account are boundary edges between two or more threads, the graph of the subdomains is built up, the boundary edges between pairs of subdomains are counted in a first pass and the first half is assigned to the first subdomain and the second one to the second subdomain of the pair if the endpoints have not been touched already by other subdomains. Clearly, an upper bound on the number of iterations will be given by the number of different threads associated to the points surrounding a point, which is classical in renumbering schemes for vectorization.23 A small example is displayed in Figure III.4 to illustrate the different stages of the algorithm.

1

nppg+1

2nppg+1

npoin

Figure III.2. Illustration of the edges not considered for the first pass of the renumbering with three threads considered in Figure III.1.

III.B.

Domain decomposition on the node graph.

As seen in the previous sections, the main aim of the edge renumbering algorithm for shared memory is to produce edge groups as large as possible for the first iteration and as well balanced as possible to equidistribute the workload. As will be seen in Section V, various point renumbering schemes will be tested to see their influence on the grouping of edges. Some of these renumbering schemes try to minimize the 6 of 26 American Institute of Aeronautics and Astronautics

3

1

2

Figure III.3. Geometrical illustration of the edge not considered for the first pass of the renumbering with three threads considered in Figure III.1. The six edges represented correspond to the six edges in the algebraic illustration given in Figure III.2.

Figure III.4.

Illustration of the edge renumbering for shared memory on a small example.

7 of 26 American Institute of Aeronautics and Astronautics

subroutine renushared() !Get the point range [ipmin:ipmax] !Get the number of points per group: nppg =

ipmax − ipmin nthread

!Initialize lgroup: lgroup(1)=1 do ithread=2,nthread+1 lgroup(ithread)=lgroup(ithread-1)+nppg enddo !Get an initial edge distribution do iedge=1,nedge ip1=ledge(1,iedge) ip2=ledge(2,iedge) !Find group igrou containing edge iedge ipfirst=lgrou(igrou) iplast=lgrou(igrou+1)-1 if(ip1>=ipfirst .and.

ip2nedg2)then !Has ithread1 too many edges? ipfirst1=lgrou(ithread1) !Get lower bound iplast1=lgrou(ithread2)-1 !Get upper bound ipoin=lgrou(ithread2) !Get lower bound iplast2=lgrou(ithread2+1)-1 !Get upper bound do !Loop on boundary point value if(ipoin==ipfirst1)exit !Check bound ipoin=ipoin-1 !Move ipoin do !Loop on edges iedge surrounding ipoin ip1=ledge(1,iedge) ip2=ledge(2,iedge) if(ip1>=ipfirst1 .and. ip2=ipoin .and. ip2

Suggest Documents