globally adaptive algorithms to the parallel context by selecting a number of .... 3. Determine the number of regions ni whose error estimates lie in the ith level. ..... [9] A. C. Genz and A. A. Malik, Remarks on Algorithm 006: An Adaptive Algorithm ...
A Comparison of Parallel Algorithms for Multi-dimensional Integration∗ T. L. Freeman†
J. M. Bull†
Abstract A central feature of adaptive algorithms for the numerical approximation of definite integrals is the list containing the sub-intervals and corresponding error estimates. Fundamentally different parallel algorithms result depending on whether the list is maintained as a single shared data structure accessible to all processors, or else as the union of non-overlapping sublists, each private to a processor. We describe several variants of these approaches, and compare numerical performances of the algorithms on multi-dimensional problems.
1
Introduction
In this paper, we consider the problem approximating the definite multi-dimensional integral I=
Z
b1
a1
Z
b2
a2
...
Z
bn
f (x1 , x2 , . . . , xn ) dx1 dx2 . . . dxn
an
to a specified absolute accuracy ǫ and focus on the design of algorithms that are suitable for parallel execution. In recent years, motivated by different programming models, two approaches to parallel numerical integration have emerged. One is based on adapting the ideas of sequential globally adaptive algorithms to the parallel context by selecting a number of subregions of the range of integration rather than simply that with largest associated error estimate, see, for example, [2], [3], [7], [10] and [14]. The other approach proceeds by imposing an initial static partitioning of the range (region) of integration and treats the resulting subproblems as independent, and therefore capable of concurrent solution. The more sophisticated of these latter algorithms include a mechanism for detecting load imbalance and for redistributing work to neighbouring processors; see, for example, [1], [4], [13] for the one-dimensional case, and [5], [6], [12] for the multi-dimensional case. The key difference between these approaches is that the former effectively maintains a single list of subregions and is particularly suitable for implementation in a shared-data programming model, whereas in the case of the latter, which is better suited to a messagepassing programming model, each processor maintains an independent list of the subregions with which it is dealing and it is the union of these independent lists that corresponds to the single list of the former algorithms. We refer to the two types of algorithm as single list and multiple list algorithms respectively. ∗ Both authors acknowledge the support of the EEC Esprit Basic Research Action Programme, Project 6634 (APPARC); the second author acknowledges the support of the NATO Collaborative Research Grant 920037. † Centre for Novel Computing, University of Manchester, Manchester, M13 9PL, U.K.
1
We base the multi-dimensional algorithms in this paper on the routine D01FCF in the NAG library [15]. The underlying cubature rule pair of this routine is that due to Genz and Malik [9] (see also [8]); it requires 2d + 2d2 + 2d + 1 evaluations of the integrand to estimate an integral over a d-dimensional hyper-rectangle. The routine also returns an estimate of the error in the approximation, and identifies the dimension of the hyper-rectangle in which the integrand is most badly-behaved by estimating fourth divided differences of the integrand, so that any further bisections of the hyper-rectangle can concentrate on this dimension.
2
Parallel Algorithms
In this section we describe in detail the two classes of algorithms identified in Section 1.
2.1
Single List Algorithms
Parallel globally adaptive (single list) algorithms were suggested in the papers of Genz [7] and (in the context of vector processors) Gladwell [10]; they are referred to as dynamic synchronous (DS) algorithms in Bull and Freeman [2]. The essence of the algorithms is a synchronised region selection phase during which a number of subregions that require further refinement are identified, followed by the concurrent application of the numerical integration (cubature) rule to these identified subregions. A generic version of the algorithm is described by the following pseudocode, where p is the number of processors: Algorithm SL: p-sect region [a1 , b1 ] × [a2 , b2 ] × · · · × [an , bn ] do par apply integration rule to subregions end do par do while (error > ǫ) and (number of rule evaluations ≤ Nmax ) Subregion Selection do par compute new subregion limits apply cubature rule to new subregions end do par do par remove old subregions from list add new subregions to list update integral approximation and error estimate end do par end do For one dimensional problems there is a unique way to achieve the initial p-secting of the region, but in several dimensions there is choice. We have found that the optimal way of generating the initial p-secting, in terms of total execution time to solve the problem, is to allow an initial sequential execution of the algorithm, until p subregions have been generated. In this phase, parallelism may be exploited in concurrent evaluations of the integrand, if desired.
The objective of the selection strategy is to select sufficient regions with large error estimates to keep all the processors usefully busy. For one dimensional problems we find that when only a few intervals are selected it is worthwhile dividing the selected intervals into several subintervals to generate sufficient work to keep the processors usefully busy; however for multi-dimensional problems we find that, except in very special circumstances, this strategy is not worthwhile and thus we adopt a bisection-only approach for our multidimensional algorithms. Note that the sequential globally adaptive algorithms that are implemented in sequential libraries, such as the NAG library [15] and QUADPACK [16], select a single subregion at each stage—that with the largest error estimate. In [7], Genz proposes simply selecting (and bisecting) the p/2 regions with largest error estimates. Gladwell in [10] suggests a region selection strategy that selects all regions that have error estimates above a certain value, such that the sum of the error estimates of the unselected regions is as large as possible without exceeding ǫ. This strategy, unlike that of Genz [7], is thus independent of the number of processors on the machine. However, both Genz’s and Gladwell’s strategies are inherently sequential as they require the list of regions (which may become very long in the solution of multi-dimensional problems) to be heap-sorted by error estimates. In [2], we proposed selection strategies for one-dimensional quadrature which avoid the use of sorting information, but these strategies are not so well suited to the multi-dimensional case. Napierala and Gladwell ([14]) propose their “Level” algorithm as a means of obtaining similar results to Gladwell’s strategy, while reducing the cost of ranking the list of regions and allowing the selection strategy itself to be parallelised. A similar strategy, here referred to as the Adaptive Level Selection Strategy, is suggested in Bull and Freeman [3]. In the Adaptive Level Selection Strategy the regions are sorted into levels (or bins) according to their error estimates—the aim is to identify sets of regions that have error estimates of approximately equal magnitudes: 1. Search the list of regions for the largest and smallest error estimates (Emax and Emin ). 2. Divide the range [Emin , Emax ] into b exponentially spaced levels. (For our experiments we use b = 50.) The ith level, i = 1, 2, . . . , b, is given by [exp (log (Emin ) + (i − 1)l) , exp (log (Emin ) + il)] , where l = (log (Emax ) − log (Emin )) /b. 3. Determine the number of regions ni whose error estimates lie in the ith level. 4. Find r such that r−1 X
Mi n i < ǫ
r X
Mi ni ≥ ǫ,
i=1
and
i=1
where Mi = exp (log (Emin ) + (i − 1/2)l) is the exponential midpoint of the ith level. 3
5. Select all regions with error estimates greater than exp (log (Emin ) + (r − 1)l), that is all regions whose error estimates lie in levels r, r + 1, . . . , b. In other words we select all the regions with larger error estimates so that the (approximate) sum of the error estimates associated with remaining regions is less than the required tolerance. Note that because of the approximate nature of the selection, it is possible that b X
Mi n i < ǫ
i=1
even though the sum of the error estimates is greater than ǫ. In this case we select all the regions in the largest non-empty level. Note also that the selection strategy requires three passes through the list of regions, one to find Emax and Emin , one to determine ni , i = 1, 2, . . . , b, and a final one to find the indices of the selected regions. By allowing separate processors to operate on separate sublists, it is possible to execute each of the three passes in parallel, with a synchronisation point after each pass. Napierala and Gladwell [14] suggest the use of levels with fixed endpoints — this has the advantage that each region needs to be classified only once. However, this imposes additional storage requirements, and there is a strong possibility that all new regions lie in a small number of levels, reducing the parallel efficiency of the selection phase.
2.2
Multiple List Algorithms
The alternative approach to parallel numerical integration is to base an algorithm on an initial static distribution of the work; for example, for a machine with p processors the region of integration is divided into p subregions and each sub-integral is assigned to a separate processor. As in Section 2.1, this initial p-secting of the region of integration is best achieved by initial (sequential) applications of the integration rule until a list of p subregions has been generated. For most integrals, an initial static subdivision of the region of integration alone would lead to an unbalanced computational load and consequently an inefficient parallel computation. To obviate this difficulty a load rebalancing strategy is usually incorporated into the algorithm, see, for example, de Doncker and Kapenga ([5] and [6]) and Lapenga and D’Alessio ([12]). In this section we introduce algorithms of this type and describe a new load rebalancing strategy. As noted in Section 1 we refer to these algorithms as multiple list algorithms, since each processor maintains its own, independent, list of regions. The basic multiple list algorithm is described by the following pseudocode: Algorithm ML: p-sect region [a1 , b1 ] × [a2 , b2 ] × · · · × [an , bn ] and assign each subregion to a different processor. do par do while (error > ǫ) and (number of rule evaluations ≤ Nmax /p) apply k steps of globally adaptive numerical integration LOAD REBALANCING end do end do par Note that the same global error convergence condition is used to terminate both the single list and multiple list algorithms. For the multiple list algorithms, this requires synchronisation of the processors, and is therefore implemented during the synchronised
section of code required by the load rebalancing. Other authors have suggested using a local convergence condition on each processor. However, in the presence of load rebalancing, it is not safe for a processor to terminate until all processors have satisfied their local condition, and therefore no further transfer of regions can take place. Hence the global condition requires no more synchronisation than the local condition, and as the global condition will generally terminate earlier, there is no reason to prefer the local condition. 2.2.1 Load Rebalancing Strategy This load rebalancing strategy is based on that suggested in [12], but we use the total error estimate in a processor’s sublist, rather than the largest error estimate, as the estimator of the load on a processor. The processors are logically arranged in a 2-dimensional torus; each processor has exactly four neighbours, labelled North, East, South and West. This processor topology is not intended to reflect the network topology of any machine. For example on a machine such as the KSR-1 with a single ring:0, there is no concept of processor locality; see Section 3 for further details of the architecture of the KSR-1. Indeed for most of the latest generation of parallel architectures, details of the network topology can safely be ignored. Each processor maintains a heapsorted list of regions (the order being determined by the magnitudes of the corresponding error estimates). After each even step of the globally adaptive algorithm, alternate processors compare their sum of error estimates with the sum of error estimates of their Easterly neighbour, and, if necessary, send the subregions with larger error estimates to their neighbour until the sums of error estimates on the two processors are as near equal as possible. After each odd step of the globally adaptive algorithm the same exchange algorithm is executed, but this time the comparison is with the Northerly neighbour. At every step the global error estimate is updated, and used to determine termination. It remains to specify the choice of k, the number of steps of adaptive quadrature to apply between each load rebalancing stage. In Section 4, we present results for k = 1, and for the following scheme which reduces the amount of synchronisation required by increasing k as the algorithm progresses. We let k = max{1, [γ s]}, where s is the number of steps executed up to and including the one on which the last synchronisation occurred, and [s] denotes the integer closest to s. For example, for γ = 0.1, load rebalancing and termination detection occurs at each of the first 20 stages, then at every other stage for the next 10 stages, then at every third stage, and so on. It can be shown that the number of subregions processed increases by at most a factor γ, but the complexity of the rebalancing and termination detection is considerably reduced. For the results of Section 4 a value of γ = 0.05 is used.
3
The Kendall Square Research KSR-1
We have implemented the algorithms described in Section 2 on the 32-processor Kendall Square Research KSR-1 computer installed at the University of Manchester. Although no longer in production, the KSR-1 remains a very advanced parallel computing system and its non-uniform memory access architecture is typical of many current distributed memory machines, in that it supports virtual shared memory in hardware. Each processor has a peak 64-bit floating point performance of 40 Mflop/s and 32 Mbytes of memory. Up to 32 processors are connected by a uni-directional slotted ring:0 network with a bandwidth of 1 Gbyte/s. The memory system, called ALLCACHE, is a directory-based system which supports full cache coherency. Data movement is request driven; a memory read operation which 5
cannot be satisfied by a processor’s own memory generates a request which traverses the ring:0 and returns a copy of the data item to the requesting processor; a memory write request which cannot be satisfied by a processor’s own memory results in that processor obtaining exclusive ownership of the data item, and a message traverses the network invalidating all other copies of the item. The unit of data transfer in the system is a subpage which consists of 128 bytes (16 8-byte words). Parallel programming is supported by extensions to Fortran consisting of directives and library calls, in a shared memory (process synchronisation) paradigm.
4
Numerical Results
In this section we compare the performances of the implementations described in Section 2 on the following two test problems which are based on those given in [17]. Problem 1 I1 =
Z
0
1Z 1
···
Z
1
cos 2
0
0
5 X
ixi
i=1
!
ǫ = 1.11 × 10−5 .
dx1 dx2 . . . dx5 ,
Problem 2 I2 =
Z
0
1Z 1
···
0
Z
6 1Y
0.36 + (xi − 0.3)2
0 i=1
−1
dx1 dx2 . . . dx6
ǫ = 1.47 × 10−4 .
The integrand of Problem 1 is oscillatory in nature, while that of Problem 2 has a single large peak. In each case the required absolute accuracy is chosen so that a conventional sequential adaptive quadrature algorithm based on the Genz and Malik rule pair (as implemented in D01FCF) requires approximately 107 integrand evaluations.
0.15
SL Genz SL Gladwell SL AdapLev ML FullSync ML RedSync Ideal
1.04e+07
Function evaluations
Performance (solutions/second)
1.05e+07
SL Genz SL Gladwell SL AdapLev ML FullSync ML RedSync Ideal
0.2
0.1
0.05
1.03e+07 1.02e+07 1.01e+07 1e+07
0 0
5
10
15 20 25 No. of processors
30
35
0
5
10
15 20 25 No. of processors
30
35
Fig. 1. Problem 1: No. of function evaluations and temporal performance on KSR-1.
Figures 1 and 2 show temporal performance (defined as the reciprocal of the execution time) and the number of integrand function evaluations for the parallel algorithms on both of the test problems. We present results for five parallel algorithms: those of [7] (SL Genz) and [10] (SL Gladwell), the Adaptive Level algorithm of Section 2.1 (SL AdapLev), and the multiple list algorithm of Section 2.2 with synchronisation after every stage (ML FullSync) and with reduced synchronisation (ML RedSync). The ‘Ideal’ number of function evaluations is simply the number of function evaluations required by the sequential
1.07e+07
SL Genz SL Gladwell SL AdapLev ML FullSync ML RedSync Ideal
0.2
SL Genz SL Gladwell SL AdapLev ML FullSync ML RedSync Ideal
1.06e+07
Function evaluations
Performance (solutions/second)
0.25
0.15
0.1
1.05e+07 1.04e+07 1.03e+07 1.02e+07 1.01e+07
0.05
1e+07 0 0
5
10
15 20 25 No. of processors
30
35
0
5
10
15 20 25 No. of processors
30
35
Fig. 2. Problem 2: No. of function evaluations and temporal performance on KSR-1.
algorithm. The ‘Ideal’ value of temporal performance is computed as the number of processors p times the temporal performance of the sequential algorithm (a version of D01FCF modified to use heap sorting rather than the much less efficient list insertion of the original code).
5
Discussion and Conclusions
On both test problems we find that both Genz’s and Gladwell’s algorithm require similar numbers of function evaluations to the sequential algorithm. In terms of temporal performance, however, Genz’s algorithm performs less well—this is because it requires the larger number of synchronisation points. Gladwell’s algorithm does somewhat better, but the selection phase is still a substantial sequential bottleneck. With full synchronisation, the multiple list algorithm also requires a similar number of function evaluations to the sequential algorithm, but its temporal performance is better than either Genz’s or Gladwell’s. The multiple list algorithm has concurrent region selection, but, like Genz’s algorithm, it requires a large number of synchronisation points. On both test problems, the temporal performances of the Adaptive Level single list algorithm and the reduced synchronisation multiple list algorithm are very similar, and markedly better than any of the other algorithms. Except for the Adaptive Level algorithm on Problem 1, this improved performance is gained at the expense of some additional function evaluations, though these never exceed 5% of the number required by the sequential algorithm. We have shown that efficient and scalable parallel algorithms for globally adaptive quadrature can result from both the single list and multiple list approaches. The major advantage of the multiple list approach is that it can be readily implemented in either a message passing or a shared memory programming paradigm, whereas the single list approach is suitable only for shared memory. The multiple list approach makes better use of data locality, and thus has lower communication overheads than the single list approach. This is particularly important for problems where the list of subregions grows very large. On the other hand, obtaining good performance from the multiple list algorithm is strongly dependent on choosing a good value of γ. By, comparison, the Adaptive Level algorithm is relatively insensitive to the choice of b. Another advantage of the single list algorithms is that if we use bisection only for subdividing selected regions, the result is then independent of the number of processors, which is not the case for the multiple list algorithms. 7
References [1] J. Bernsten and T. O. Espelid, A Parallel Global Adaptive Quadrature Algorithm for Hypercubes, Parallel Computing, 8 (1988), pp. 313–323. [2] J. M. Bull and T. L. Freeman, Parallel Globally Adaptive Quadrature on the KSR-1 , Advances in Computational Mathematics, 2 (1994), pp. 357–373. [3] J. M. Bull and T. L. Freeman, Parallel Algorithms for Numerical Integration in One and Several Dimensions, Applied Numerical Mathematics, 19 (1995), pp. 3–16. [4] K. Burrage, An Adaptive Numerical Integration Code for a Chain of Transputers, Parallel Computing, 16 (1990), pp. 305–312. [5] E. de Doncker and J. Kapenga, Parallel Systems and Adaptive Integration, Contemporary Mathematics, 115 (1990), pp. 33–51. [6] E. de Doncker and J. Kapenga, Parallel Cubature on Loosely Coupled Systems, pp. 317–327 of T. O. Espelid and A. Genz (eds.), Numerical Integration, Kluwer Academic Publishers, Dordrecht, 1992. [7] A. Genz, The Numerical Evaluation of Multiple Integrals on Parallel Computers, pp. 219– 229 of G. Fairweather and P. M. Keast (eds.), Numerical Integration. Recent Developments, Software and Applications, NATO ASI Series C203, D. Reidel, Dordrecht, 1987. [8] A. Genz, Subregion Adaptive Algorithms for Multiple Integrals, Contemporary Mathematics, 115 (1990), pp. 23–31. [9] A. C. Genz and A. A. Malik, Remarks on Algorithm 006: An Adaptive Algorithm for Numerical Integration over an N-dimensional Rectangular Region, J. Comput. Appl. Math. 6 (1980), pp. 295-302. [10] I. Gladwell, Vectorisation of One Dimensional Quadrature Codes, pp. 230–238 of G. Fairweather and P. M. Keast (eds.), Numerical Integration. Recent Developments, Software and Applications, NATO ASI Series C203, D. Reidel, Dordrecht, 1987. [11] K.S.R., KSR Fortran Programming, Kendall Square Research, Waltham, Mass., 1991. [12] M. Lapenga and A. D’Alessio, A Scalable Parallel Algorithm for the Adaptive Multidimensional Quadrature, pp. 933–936 of R. F. Sinovec, D. E. Keyes, M. R. Leuze, L. R. Petzold and D. A. Reed (eds.), Proceedings of the Sixth SIAM Conference on Parallel Processing, SIAM, Philadelphia, 1993. [13] V. A. Miller and G. J. Davis, Adaptive Quadrature on a Message-Passing Multiprocessor , Journal of Parallel and Distributed Computing, 14 (1992), pp. 417–425. [14] M. A. Napierala and I. Gladwell, Reducing Ranking Effects in Parallel Global Adaptive Quadrature, Proceedings, Seventh SIAM Conf. on Parallel Processing for Scientific Computation, pp. 647–651, 1995. [15] N.A.G., N.A.G. Fortran Library Manual, Mark 15 , N.A.G. Ltd., Oxford, 1991. ¨ [16] R. Piessens, E. de Doncker, ,C. Uberhuber and D. Kahaner, QUADPACK, A Subroutine Package for Automatic Integration, Springer-Verlag, New York, 1983. [17] Sloan, I.H. and S. Joe, Lattice Methods for Multiple Integration, Oxford, 1994.