parallel algorithms for globally adaptive quadrature - CiteSeerX

PARALLEL ALGORITHMS FOR GLOBALLY ADAPTIVE QUADRATURE A thesis submitted to the University of Manchester for the degree of Doctor of Philosophy in the Faculty of Science and Engineering

January 1997

By Jonathan Mark Bull Department of Mathematics

Contents Abstract

12

Declaration

13

Copyright

14

The Author

15

Acknowledgements

16

1 Introduction

17

2 Parallel Computing

23

2.1 Why use parallel computers? . . . . 2.2 Architecture of parallel computers . 2.2.1 Hardware . . . . . . . . . . 2.2.2 Software . . . . . . . . . . . 2.3 Performance measurement . . . . .

3 Quadrature Algorithms

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

23 24 24 30 33

38

3.1 Quadrature in one dimension . . . . . . . . . . . . . . . . . . . . 38 3.1.1 Quadrature rules . . . . . . . . . . . . . . . . . . . . . . . 38 3.1.2 Gauss rules . . . . . . . . . . . . . . . . . . . . . . . . . . 40 2

3.1.3 Composite rules . . . . . . . . 3.1.4 Error estimation . . . . . . . 3.1.5 Automatic quadrature . . . . 3.1.6 Extrapolation . . . . . . . . . 3.2 Multidimensional quadrature . . . . . 3.2.1 Adaptive quadrature . . . . . 3.2.2 Other quadrature techniques . 3.3 Parallel algorithms . . . . . . . . . . 3.3.1 Single list algorithms . . . . . 3.3.2 Multiple list algorithms . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

4 The Kendall Square Research KSR-1

41 42 44 47 49 51 51 54 56 57

60

4.1 Hardware architecture . . . . . . . . . . . . . . . . . . . . . . . . 60 4.2 Programming model . . . . . . . . . . . . . . . . . . . . . . . . . 62

5 Parallel Single List Algorithms for One-Dimensional Quadrature 65 5.1 Algorithms . . . . . . . . . 5.1.1 Selection strategies 5.1.2 Other algorithms . 5.2 Implementation issues . . 5.3 Numerical experiments . . 5.4 Analysis . . . . . . . . . . 5.5 Discussion . . . . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

65 68 72 73 76 82 92

6 Parallel Multiple List Algorithms for One-Dimensional Quadrature 95 6.1 Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 6.2 Implementation issues . . . . . . . . . . . . . . . . . . . . . . . . 105 6.3 Numerical experiments . . . . . . . . . . . . . . . . . . . . . . . . 106 3

6.4 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 6.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114

7 Parallel Algorithms for Singular Integrands in One Dimension 116 7.1 7.2 7.3 7.4

Algorithms . . . . . . . Numerical experiments Analysis . . . . . . . . Discussion . . . . . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

116 119 124 125

8 Parallel Single List Algorithms for Multi-Dimensional Quadrature 126 8.1 Algorithms . . . . . . . . . . . . . . . . 8.1.1 Selection strategies . . . . . . . 8.1.2 Parallel selection and updating 8.2 Implementation issues . . . . . . . . . 8.3 Numerical experiments . . . . . . . . . 8.4 Analysis . . . . . . . . . . . . . . . . . 8.5 Discussion . . . . . . . . . . . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

126 130 132 134 135 148 156

9 Parallel Multiple List Algorithms for Multi-Dimensional Quadrature 158 9.1 9.2 9.3 9.4 9.5

Algorithms . . . . . . . Implementation issues Numerical experiments Analysis . . . . . . . . Discussion . . . . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

158 160 160 164 174

10 Conclusions

176

Bibliography

182 4

List of Tables 5.1 Performance of Algorithm CP . . . . . . . . . . . . . . . . . . . . 81 5.2 Performance of Algorithms CPH and CPL . . . . . . . . . . . . . 82 8.1 Performance of Algorithm CPH . . . . . . . . . . . . . . . . . . . 137 8.2 Total number of function evaluations required . . . . . . . . . . . 149 8.3 Number of selection stages . . . . . . . . . . . . . . . . . . . . . . 149 9.1 Percentage of execution time due to remote accesses . . . . . . . . 164

5

List of Figures 5.1 5.2 5.3 5.4 5.5 5.6 5.7 5.8 5.9 5.10 5.11 5.12 5.13 5.14

Pseudo-code for Algorithm SS . . . . . . . . . . . . . . . . . . . . Pseudo-code for the subdivision strategy . . . . . . . . . . . . . . Pseudo-code for Algorithm AS . . . . . . . . . . . . . . . . . . . . Problem 1: Temporal performance of Algorithm SSAL on KSR-1 . Problem 2: Temporal performance of Algorithm SSAL on KSR-1 . Problem 3: Temporal performance of Algorithm SSAL on KSR-1 . Problem 4: Temporal performance of Algorithm SSAL on KSR-1 . Problem 5: Temporal performance of Algorithm SSAL on KSR-1 . Problem 1: Temporal performance of Algorithms SS, AS and LL on KSR-1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Problem 1: Total number of function evaluations performed by Algorithms SS, AS and LL on KSR-1 . . . . . . . . . . . . . . . . Problem 2: Temporal performance of Algorithms SS, AS and LL on KSR-1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Problem 2: Total number of function evaluations performed by Algorithms SS, AS and LL on KSR-1 . . . . . . . . . . . . . . . . Problem 3: Temporal performance of Algorithms SS, AS and LL on KSR-1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Problem 3: Total number of function evaluations performed by Algorithms SS, AS and LL on KSR-1 . . . . . . . . . . . . . . . . 6

66 67 72 79 79 80 80 81 84 84 85 85 86 86

5.15 Problem 4: Temporal performance of Algorithms SS, AS and LL on KSR-1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.16 Problem 4: Total number of function evaluations performed by Algorithms SS, AS and LL on KSR-1 . . . . . . . . . . . . . . . . 5.17 Problem 5: Temporal performance of Algorithms SS, AS and LL on KSR-1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.18 Problem 5: Total number of function evaluations performed by Algorithms SS, AS and LL on KSR-1 . . . . . . . . . . . . . . . . 6.1 Pseudo-code for Algorithm ML . . . . . . . . . . . . . . . . . . . 6.2 Problem 1: Temporal performance of Algorithms ML and SS on KSR-1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.3 Problem 1: Total number of function evaluations performed by Algorithms ML and SS on KSR-1 . . . . . . . . . . . . . . . . . . 6.4 Problem 2: Temporal performance of Algorithms ML and SS on KSR-1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.5 Problem 2: Total number of function evaluations performed by Algorithms ML and SS on KSR-1 . . . . . . . . . . . . . . . . . . 6.6 Problem 3: Temporal performance of Algorithms ML and SS on KSR-1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.7 Problem 3: Total number of function evaluations performed by Algorithms ML and SS on KSR-1 . . . . . . . . . . . . . . . . . . 6.8 Problem 4: Temporal performance of Algorithms ML and SS on KSR-1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.9 Problem 4: Total number of function evaluations performed by Algorithms ML and SS on KSR-1 . . . . . . . . . . . . . . . . . . 6.10 Problem 5: Temporal performance of Algorithms ML and SS on KSR-1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

87 87 88 88 96 108 108 109 109 110 110 111 111 112

6.11 Problem 5: Total number of function evaluations performed by Algorithms ML and SS on KSR-1 . . . . . . . . . . . . . . . . . . 112 7.1 Problem 1: Temporal performance of Algorithms SSESXF, SSESXC, SSESGM, SSES, CPL and CPLX on KSR-1 . . . . . . . . 7.2 Problem 1: Total number of function evaluations performed by Algorithms SSESXF, SSESXC, SSESGM, SSES, CPL and CPLX on KSR-1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.3 Problem 2: Temporal performance of Algorithms SSESXF, SSESXC, SSESGM, SSES, CPL and CPLX on KSR-1 . . . . . . . . 7.4 Problem 2: Total number of function evaluations performed by Algorithms SSESXF, SSESXC, SSESGM, SSES, CPL and CPLX on KSR-1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.5 Problem 3: Temporal performance of Algorithms SSESXF, SSESXC, SSESGM, SSES, CPL and CPLX on KSR-1 . . . . . . . . 7.6 Problem 3: Total number of function evaluations performed by Algorithms SSESXF, SSESXC, SSESGM, SSES, CPL and CPLX on KSR-1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.1 Problem 1: Total number of function evaluations performed by Algorithms SSGL using dierent subdivision strategies . . . . . . 8.2 Problem 2: Total number of function evaluations performed by Algorithms SSGL using dierent subdivision strategies . . . . . . 8.3 Problem 3b: Total number of function evaluations performed by Algorithms SSGL using dierent subdivision strategies . . . . . . 8.4 Problem 4b: Total number of function evaluations performed by Algorithms SSGL using dierent subdivision strategies . . . . . . 8.5 Problem 1: Temporal performance of Algorithm SSAL on KSR-1 using 16 processors . . . . . . . . . . . . . . . . . . . . . . . . . . 8

121

121 122

122 123

123 138 139 139 140 141

8.6 Problem 2: Temporal performance of Algorithm SSAL on KSR-1 using 16 processors . . . . . . . . . . . . . . . . . . . . . . . . . . 8.7 Problem 3b: Temporal performance of Algorithm SSAL on KSR-1 using 16 processors . . . . . . . . . . . . . . . . . . . . . . . . . . 8.8 Problem 4b: Temporal performance of Algorithm SSAL on KSR-1 using 16 processors . . . . . . . . . . . . . . . . . . . . . . . . . . 8.9 Problem 1: Temporal performance of Algorithms SSAB and SSNG on KSR-1 using 16 processors . . . . . . . . . . . . . . . . . . . . 8.10 Problem 1: Total number of function evaluations performed by Algorithms SSAB and SSNG . . . . . . . . . . . . . . . . . . . . . 8.11 Problem 2: Temporal performance of Algorithms SSAB and SSNG on KSR-1 using 16 processors . . . . . . . . . . . . . . . . . . . . 8.12 Problem 2: Total number of function evaluations performed by Algorithms SSAB and SSNG . . . . . . . . . . . . . . . . . . . . . 8.13 Problem 3b: Temporal performance of Algorithms SSAB and SSNG on KSR-1 using 16 processors . . . . . . . . . . . . . . . . . . . . 8.14 Problem 3b: Total number of function evaluations performed by Algorithms SSAB and SSNG . . . . . . . . . . . . . . . . . . . . . 8.15 Problem 4b: Temporal performance of Algorithms SSAB and SSNG on KSR-1 using 16 processors . . . . . . . . . . . . . . . . . . . . 8.16 Problem 4b: Total number of function evaluations performed by Algorithms SSAB and SSNG . . . . . . . . . . . . . . . . . . . . . 8.17 Problem 1: Temporal performance of Algorithms SS and LL on KSR-1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.18 Problem 2: Temporal performance of Algorithms SS and LL on KSR-1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

9

141 142 142 144 144 145 145 146 146 147 147 150 150

8.19 Problem 3a: Temporal performance of Algorithms SS and LL on KSR-1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.20 Problem 3b: Temporal performance of Algorithms SS and LL on KSR-1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.21 Problem 3c: Temporal performance of Algorithms SS and LL on KSR-1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.22 Problem 4a: Temporal performance of Algorithms SS and LL on KSR-1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.23 Problem 4b: Temporal performance of Algorithms SS and LL on KSR-1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.24 Problem 4c: Temporal performance of Algorithms SS and LL on KSR-1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.1 9.2 9.3 9.4 9.5 9.6 9.7 9.8 9.9

Problem 1: Predicted eciency of Algorithm MLGBR on KSR-1 . Problem 2: Predicted eciency of Algorithm MLGBR on KSR-1 . Problem 3b: Predicted eciency of Algorithm MLGBR on KSR-1 Problem 4b: Predicted eciency of Algorithm MLGBR on KSR-1 Problem 1: Temporal performance of Algorithms ML and SS on KSR-1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Problem 1: Total number of function evaluations performed by Algorithms ML and SS on KSR-1 . . . . . . . . . . . . . . . . . . Problem 2: Temporal performance of Algorithms ML and SS on KSR-1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Problem 2: Total number of function evaluations performed by Algorithms ML and SS on KSR-1 . . . . . . . . . . . . . . . . . . Problem 3a: Temporal performance of Algorithms ML and SS on KSR-1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

10

151 151 152 152 153 153 162 162 163 163 165 165 166 166 167

9.10 Problem 3a: Total number of function evaluations performed by Algorithms ML and SS on KSR-1 . . . . . . . . . . . . . . . . . . 9.11 Problem 3b: Temporal performance of Algorithms ML and SS on KSR-1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.12 Problem 3b: Total number of function evaluations performed by Algorithms ML and SS on KSR-1 . . . . . . . . . . . . . . . . . . 9.13 Problem 3c: Temporal performance of Algorithms ML and SS on KSR-1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.14 Problem 3c: Total number of function evaluations performed by Algorithms ML and SS on KSR-1 . . . . . . . . . . . . . . . . . . 9.15 Problem 4a: Temporal performance of Algorithms ML and SS on KSR-1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.16 Problem 4a: Total number of function evaluations performed by Algorithms ML and SS on KSR-1 . . . . . . . . . . . . . . . . . . 9.17 Problem 4b: Temporal performance of Algorithms ML and SS on KSR-1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.18 Problem 4b: Total number of function evaluations performed by Algorithms ML and SS on KSR-1 . . . . . . . . . . . . . . . . . . 9.19 Problem 4c: Temporal performance of Algorithms ML and SS on KSR-1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.20 Problem 4c: Total number of function evaluations performed by Algorithms ML and SS on KSR-1 . . . . . . . . . . . . . . . . . .

11

167 168 168 169 169 170 170 171 171 172 172

Abstract Globally adaptive quadrature is a well-established technique for computing a numerical approximation to a de nite integral in one or more dimensions. For some integrals, the technique makes heavy demands on computational resources, both in terms of execution time and data storage requirements. Parallel computers oer the possibility of meeting such demands, provided that a eective parallel algorithm can be employed. The standard sequential algorithm for globally adaptive quadrature lacks coarse-grained parallelism, and is therefore not well suited to the majority of current parallel computer architectures. Hence there is a requirement for alternative algorithms which are better able to exploit the power of parallel computing. In this thesis, existing parallel algorithms are critically examined and compared. New parallel algorithms are developed which give signi cant performance improvements over the existing algorithms, as well as satisfying other criteria which make them suitable for general purpose use. We present experimental results for a variety of test problems, for both one-dimensional and multidimensional quadrature. These results are obtained on a virtual shared memory computer (a Kendall Square Research KSR-1), which for the rst time allows the direct comparison of parallel algorithms designed for traditional shared memory architectures with those designed for traditional distributed memory architectures. 12

Declaration No portion of the work referred to in this thesis has been submitted in support of an application for another degree or quali cation of this or any other university or other institute of learning.

13

Copyright Copyright in text of this thesis rests with the Author. Copies (by any process) either in full, or of extracts, may be made only in accordance with instructions given by the Author and lodged in the John Rylands University Library of Manchester. Details may be obtained from the Librarian. This page must form part of any such copies made. Further copies (by any process) of copies made in accordance with such instructions may not be made without the permission (in writing) of the Author. The ownership of any intellectual property rights which may be described in this thesis is vested in the University of Manchester, subject to any prior agreement to the contrary, and may not be made available for use by third parties without the written permission of the University, which will prescribe the terms and conditions of any such agreement. Further information on the conditions under which disclosures and exploitation may take place is available from the head of Department of Mathematics.

14

The Author

The author graduated from Cambridge University in July 1988 with a BA(Hons) in Mathematics. From September 1988 to August 1990 he was engaged in research on numerical models of the atmospheric boundary layer at the Meteorological Oce. During the following year he completed the MSc course in Numerical Analysis and Computing at the University of Manchester, for which he submitted a thesis entitled Àsynchronous Jacobi Iterations on Local Memory Parallel Computers'. Since October 1991 he has held the position of Research Associate in the Department of Computer Science at the University of Manchester, with research interests in algorithms, applications and programming techniques for high-performance parallel computers.

15

Acknowledgements The author would like to thank the following|

Len Freeman, for being a very helpful and understanding supervisor, My colleagues in the Centre for Novel Computing, for much enlightening discussion, Ian Gladwell, for his technical guidance, My parents, John and Tosca, for all their support, and Gillian Duncan, for being there.

16

Chapter 1 Introduction Globally adaptive quadrature is a popular and robust technique for the numerical approximation of de nite integrals. For dicult integrals, however, it can be demanding both in terms of execution time, and in terms of the amount of storage required. To meet these demands, it is necessary to exploit high performance computers, and since most such machines gain their power from parallelism, we need parallel or parallelisable algorithms. Unfortunately, the algorithm which is most commonly used for globally adaptive quadrature is not amenable to ecient parallelisation. The main purpose of this thesis is therefore to develop alternative algorithms which are inherently parallel, and can therefore eciently exploit the power of parallel computers. The history of parallel globally adaptive quadrature algorithms has been strongly in uenced by the presence of two dierent programming paradigms for parallel computers|the shared variable paradigm and the message passing paradigm. Each of these paradigms owes its existence to a dierent style of computer architecture. Shared variables is a natural paradigm for shared memory architectures, while message passing arose with the development of distributed memory machines. The primary data structure used by globally adaptive quadrature algorithms 17

CHAPTER 1. INTRODUCTION

18

is a list of subintervals which cover the interval of integration. In the shared variable paradigm, it is natural to treat the list of subintervals as a shared data object to which each processor has read/write access. In the message passing paradigm, such an approach proves dicult to implement and highly inecient. Instead, it is usual to use disjoint sublists, one per processor, whose union is the whole list. Thus we have two families of parallel algorithms for globally adaptive quadrature|single list algorithms and multiple list algorithms. The two families have been developed almost completely independently of one another, probably because it has been unusual for workers in the eld to have access to more than one type of parallel machine. In our work, however, we use a computer whose style of architecture (a physically distributed memory supporting a shared variable programming paradigm) allows us to compare and contrast members of both families. The existence of the two families of algorithm is an example of a more general dichotomy in parallel algorithm design. Parallel algorithms are typically either task oriented or data oriented. In the former case, computational tasks are distributed amongst processors, and much of the diculty in designing a good algorithm is centred on minimising the communication of data between processors. In the latter case, the data is distributed amongst processors, and tasks are performed on the processor that owns the relevant data. Here, design issues are often focussed on balancing the computational loads on the dierent processors. In a sequential setting there is very little dierence, aside from the underlying quadrature rules, between the globally adaptive quadrature algorithms used for one-dimensional and multi-dimensional integrals. For parallel algorithms, though, there are important dierences between the two cases, especially for the single list family. Thus it is necessary to consider the one-dimensional and multidimensional cases separately when designing parallel algorithms.


19

Since the primary objective of this thesis is to design new parallel algorithms, it is necessary to state some guidelines by which we can evaluate the quality of the algorithms we generate. The following properties are those which we deem to be desirable of parallel algorithms:

Stability The parallel algorithm should ideally have error characteristics as good as those of the sequential algorithms. For adaptive quadrature algorithms, we nd that this requirement is satis ed, as the error analysis for the sequential algorithm can be applied directly to the parallel algorithms.

Performance The parallel algorithm should give signi cant reductions in execution time compared to the best sequential algorithms running on the same computer architecture. If this is not the case then there is no merit in using a parallel method. The principal factors contributing to performance are sequential competitiveness and scalability. The parallel algorithm running on a single processor should be competitive with the best sequential algorithms. For quadrature algorithms, this means that the parallel algorithm should not require signi cantly more integrand function evaluations. Loosely speaking, an algorithm is scalable if it can be implemented such that useful reductions in execution time can be obtained by adding extra processors, even when the number of processors is already large.

Robustness The algorithm should be applicable, stable and ecient across a wide range of input problems. Since for quadrature algorithms, applicability and stability issues are no dierent from the sequential case, we will be chie y concerned with robustness in terms of eciency.

Insensitivity to parametrisation If the de nition of an algorithm includes adjustable parameters, then there should be as few parameters as possible,


20

and performance should not depend too strongly on the value of the parameters. It is undesirable for the performance of an algorithm to depend on careful tuning for speci c problems or problem types.

Determinism A sequential program implies a strict ordering of actions on data items. This is no longer the case in parallel programs, where, in general, actions on dierent processors may occur in any order, except where ordering is enforced by synchronisation between processors. This raises the possibility of non-deterministic (sometimes also called asynchronous) parallel algorithms. Such an algorithm may guarantee that the solution it generates satis es some error bounds, but the numerical value of the solution may vary from one run to the next. Such algorithms can be attractive from a performance point of view, as they require less synchronisation, but they also have a number of undesirable properties|results are not repeatable, debugging code is especially dicult and code may behave very dierently on dierent architectures. Thus, except where useful for illustrative purposes, we will insist that algorithms should always return the same result when applied to the same problem using the same number of processors. A stricter requirement would be that the result should also be independent of the number of processors used. Unfortunately the nature of parallel quadrature algorithms is such that this can rarely be enforced without severely compromising performance. All these factors will therefore be taken into account when judging the quality of the parallel algorithms. The rest of this thesis is organised as follows. Chapter 2 introduces parallel computers and discusses the reasons why they exist and the potential bene ts of utilising them. We then survey the various styles of parallel architecture, both in terms of the hardware design and the programming models that are implemented


21

on top of that hardware. We also discuss the dicult issues that can arise in measuring and presenting performance data for parallel programs. Chapter 3 introduces quadrature methods, starting with the very simplest quadrature rules for the one-dimensional case. From here we progress through composite rules to automatic quadrature methods, including various forms of adaptive quadrature. We then examine the use of extrapolation methods in conjunction with automatic quadrature. Multi-dimensional quadrature is then introduced, and we show how adaptive methods may be used in the multi-dimensional case. We also give a brief survey of other methods for multi-dimensional quadrature, including Monte Carlo, lattice, and recursive methods. Finally in this Chapter we survey existing parallel algorithms for globally adaptive quadrature, both in the single list and in the multiple list families. Chapter 4 describes the parallel architecture on which we will perform our numerical experiments, the Kendall Square Research KSR-1. We discuss both the hardware architecture (in particular the novel features of the memory system) and the programming model in which parallel Fortran 77 programs are implemented. Chapter 5 presents single list algorithms for one-dimensional quadrature. Existing algorithms are described as members of a general algorithmic structure, and new members are introduced. We then discuss the implementation of these algorithms on the KSR-1. A set of test problems is de ned, and the performance of the algorithms on these problems is reported and analysed. The Chapter concludes with a discussion of the relative merits of the algorithms. Chapter 6 does the same for multiple list algorithms as Chapter 5 does for single list algorithms. In the numerical results, analysis and discussion sections, we make comparisons between the two families of algorithms. A common type of dicult one-dimensional integrand is one containing singularities in the interval of integration. Sequential algorithms may be accelerated


22

for such problems by extrapolation methods. Chapter 7 addresses the problems of using such methods to accelerate the parallel single list algorithms. Chapters 8 and 9 describe the application of respectively single and multiple list algorithms to multi-dimensional quadrature. Dierences from the onedimensional case are emphasised, with particular attention to the choices of adjustable parameters. A set of multi-dimensional test problems are de ned, and the performance of the algorithms on these problems reported and analysed. Chapter 10 summarises our ndings and draws conclusions.

Chapter 2 Parallel Computing 2.1 Why use parallel computers? The fundamental driving force behind the use of parallel computers is a seemingly endless desire for computational cycles in the scienti c and engineering community. From the earliest days of computer technology, scientists and engineers have used computers to simulate the physical world, because in many cases such simulation proves faster and less costly than performing real world experiments. However, simulating a continuous physical process on a discrete, digital computer always requires at least some degree of approximation, as the continuous equations have no analytic solution except in the simplest of cases. Approximation occurs both in the mathematical models, for example by eliminating terms which can be shown by scale analysis to be relatively insigni cant, and also in the discretisation of the continuous equations to give a numerical approximation which is amenable to solution by a computer algorithm. Both these sources of approximation result in the computer simulation being less than perfect, so there is always a need for better and better simulations, of increasingly more complex physical processes. Such improvements can be attained not only by improving the numerical algorithms, but also by increasing the amount of computation devoted 23

CHAPTER 2. PARALLEL COMPUTING

24

to the problem. Despite the steady advances in computational speed (approximately doubling every 2 years), users of supercomputers have no diculty in postulating experiments requiring many orders of magnitude more cycles than is currently feasible. Any computer simulation is subject to time constraints, ranging from a year or more for some quantum chromodynamics calculations, through a few hours for weather forecasting to less than a second for problems requiring human interaction and control. In all these cases parallel computers can help bridge the gap between what is possible and what is desirable. By utilising many processors to solve a problem, it is possible, depending on the nature of the computation, to reduce the time required by a factor of one, two, and in some cases, even three orders of magnitude. Another important factor is storage requirements. There are limitations to amount of memory which can be usefully attached to a single processor. The more memory, the longer it takes to nd a given memory location and extract or modify its contents. In a parallel system, however, it is possible not only to utilise many processors, but also to have more memory available. Thus many applications, while possibly not excessively time consuming, can only be run on machines with sucient storage, and a parallel computer can often supply a suitable solution, as many memory accesses can be performed concurrently.

2.2 Architecture of parallel computers 2.2.1 Hardware The hardware architecture of parallel computers has undergone rapid development over the last 25 years, supported by considerable research eorts and intense commercial pressure. This pace of development shows no signs of slacking, with


25

manufacturers continuing to appear and disappear every year. A number of attempts have been made to develop a formal taxonomy of parallel architectures (see, for example, [25] and [27]), though only that of Flynn [15] has gained universal acceptance. Flynn distinguishes two primary classes of parallel computer, Single Instruction Multiple Data (SIMD) machines and Multiple Instruction Multiple Data (MIMD) machines. As the names imply, SIMD machines apply the same instruction concurrently to many data items, while MIMD machines have the ability to apply dierent instructions concurrently to dierent data items. SIMD machines can be subdivided into two types, depending on how parallelism is exploited. Vector processors typically have a small number of arithmetic function units, each of which consists of multiple stages, organised as a pipeline. Parallelism is achieved by allowing dierent pipeline stages to be active on dierent data items concurrently. In contrast, array processors have large numbers of function units, all operating under the same ow of control in the program, each of which can perform arithmetic operations concurrently on separate data items. Vector processors were amongst the earliest parallel architectures, and continue to be produced in large numbers today, though they are now often found as processing units within MIMD type machines. Examples currently in production include the Cray C90, the Fujitsu VPP500 and the NEC SX-4. Their popularity is due in large measure to the development of compiler technology which can eectively exploit the architecture without the intervention of the programmer. However, their applicability remains limited to programs with regular and well structured data access patterns, such as the indexed array operations found in many, but not all, scienti c Fortran programs. The degree of parallelism attainable is limited by the number of stages in the pipelines, which typically does not exceed 100, by the high cost of starting a set of pipelined operations, and by the


26

fact that portions of a program not suitable for pipelined execution must be executed on a separate, slower, scalar processing unit. Nevertheless, given suitable applications, very high performance can be attained on these architectures. Array processors, following a brief period of popularity in the 1980s have largely fallen out of favour in the current parallel computing market. Examples of this type of machine include the MasPar MP-1 and -2, the Thinking Machines CM200 and the AMT (formerly ICL) DAP. This type of machine oers true massive parallelism (over 1000 processors), but the computational model is rather limited. Eciency depends strongly on being able to map computation onto a two-dimensional array and apply the same computation to all the array elements. This lack of exibility, combined with the fact that these machines cannot be built from commodity microprocessors, has rendered them uneconomic to manufacture. MIMD machines have been sub-classi ed by Johnson [27] depending on the memory system and the communication/synchronisation mechanism. Memory systems are classi ed as either global memory (GM) or distributed memory (DM). GM machines possess a central main memory which is accessible by all processors, whereas in DM systems the memory is physically distributed, so that each processor is tightly coupled to its own main memory. The communication/synchronisation mechanism can either be message passing (MP) or shared variable (SV). In a message passing system, each processor has its own address space, and all communication between these local address spaces is by sending and receiving messages containing the appropriate data. In an SV system there is a global address space which all processors may access. Communication between processors occurs through reading and writing data in the shared address space, with synchronisation mechanisms to ensure correct order of execution. It is worth noting that this classi cation is applied at the hardware level|it is possible, in


27

software, both to implement message passing on an SV machine and to implement shared variables on a MP machine. Any such software layer will introduce ineciencies in communication. Until recently, almost all MIMD machines were of either the classical message passing (DMMP), or else the classical shared memory (GMSV) design, and indeed both types of architecture are well represented in the current marketplace. Examples of DMMP machines include the Fujitsu VPP series, IBM SP2, Intel Paragon, nCUBE 2 and Parsytec GC/PowerPlus. Networks of workstations running message passing protocols such as PVM or MPI (see Section 2.2.2) can also be considered as DMMP architectures. GMSV machines machines include the Cray C90 and NEC SX-4 (utilising vector processors) and multiprocessor workstation/server machines manufactured by Silicon Graphics (Challenge/Power Challenge series), Sun and DEC. Both these types of system have their respective limitations. GMSV architectures suer from the memory bottleneck problem|a central memory is unable to service requests from an unlimited number of processors without contention or high latencies. Such systems are limited in the number of processors they can support, typically no more than 32. Such architectures are said not to be scalable to large numbers of processors. DMMP designs are in general more scalable, in some cases to over 1000 processors, but message passing is a less exible programming paradigm than shared variables. Attempts to overcome both these diculties simultaneously have led to a number of architectures in the DMSV category. Supporting shared variables in distributed memory is the subject of a great deal of current research, and a number of solutions have been proposed. The dierences between solutions are mainly concerned with how they solve the cache coherency problem. In order to reduce the number of accesses to remote processors' memory, it is advantageous to


28

retain a copy of any data accessed remotely, in local cache memory. The problem then arises that there may be many copies of a data item in the system. When one copy is modi ed, it is necessary to have some mechanism for ensuring that all other copies are either updated or marked as no longer valid, so that there is no ambiguity in the value stored in a given memory location. The Meiko CS2 avoids the cache coherency problem by only supporting writes to remote memory locations. Reads are not supported, so no copies of data are cached. This can be viewed as a one-sided message passing mechanism. The Cray T3D architecture supports both reads and writes to remote memory. Cache coherency is achieved by ushing the caches at synchronisation points. Both the CS2 and T3D use remote memory accesses as a means of implementing message passing, and they are most commonly used in this mode, that is as DMMP machines. The other common mechanism for maintaining cache coherency is to use a write-invalidate policy. In this mechanism a processor wishing to write to a memory location rst ensures that all copies of that location are invalidated before the write occurs. Since the writing processor must be able to determine where copies exist, this policy is normally supported by a system of directories which store information about the status and position of copies of data. Examples of DMSV machines using this mechanism are the Sequent NUMA-Q, the Convex Exemplar, the Silicon Graphics Origin 2000 and the Kendall Square Research KSR-1, which is described in more detail in Chapter 4. It is worth noting that GMSV architectures which employ caches (as opposed to, say, vector registers) also suer from the cache coherency problem, and use similar mechanisms to solve it, though the presence of a global memory does make the task somewhat less complicated. A number of schemes have been proposed to support shared variables in software, running on top of DMMP architectures.


29

The future direction of commercial parallel computer architecture remains unclear. Parallelism in the workstation and personal computer markets is likely to increase, in the form of small scale GMSV architectures. For larger systems, DMSV architectures are beginning to prove popular, with many leading manufactures either producing or planning to produce this type of machine. In the longer term, however, processor speeds are increasing faster than network and memory speeds, with the result that memory latencies will become longer (in terms of CPU cycles). DMSV machines are particularly vulnerable to this trend, since the cache coherency mechanisms tend to generate more communication between processors than for the same program running in a message passing environment. A number of alternative architectures which attempt to hide memory latency (as opposed to reducing it via caching) have been proposed in the research literature, though as yet none have made it to the commercial market:

Prefetching architectures attempt (by one means or another) to predict which addresses a program will require to load in the near future. Load requests for these addresses can be issued in the hope that the relevant data will be cached by the time they are required by the program.

Decoupled architectures separate addressing and control instructions from arithmetic instructions and run the two sets on dierent processors. By allowing the address and control processor to run ahead of the arithmetic processor as much as is possible, signi cant reductions in memory latency can be achieved.

Multithreaded architectures run many threads of control on each processor. When a thread issues a memory request, a very fast context switch is made to another thread. If a sucient number of threads is scheduled on the processor in a round-robin fashion, then by the time the original thread is switched in again, the memory request will have completed. An extreme


30

form of multithreading, where threads consist of single instructions, and conventional memory is largely dispensed with, is known as data ow. Although all these ideas have shown promise, in the short and medium term, the commercial market seems likely to be based on high volume processor chips with conventional caching.

2.2.2 Software As hardware architecture has evolved, so have the paradigms and environments used to program them. Although many languages have been designed explicitly for parallel machines, the most widely used paradigms consist of extensions to popular sequential languages such as Fortran and C. These extensions are of two types: directives, which are additional statements placed in the code and interpreted by the compiler on the parallel architecture (and normally appear as comments to other compilers), and libraries of routines which perform functions associated with parallelism such as communication, synchronisation and scheduling. Language extensions for vector architectures are relatively simple because the compilers have suciently good analysis techniques to extract parallelism from sequential code. A few directives which assist the compiler in the cases where its analysis does fail are normally sucient. Array architectures have employed special language constructs and syntax to permit operations on arrays, as well as directives and library calls. CM Fortran, developed for the Thinking Machines CM200 series, was the forerunner of the HPF (High Performance Fortran) language [24]. With the decline of array architectures in the marketplace, HPF has been implemented on MIMD machines, though its origins mean that it, too, is limited in its applicability. For MIMD machines, dierent paradigms have been developed for the MP


31

and SV classes of architectures. MP machines normally run a message passing protocol. This typically consists of a library of communication routines. Because each processor has a separate address space, all communication is explicit. The sending processor calls a message send routine, passing the data to be sent and the receiving processor(s) as arguments. This must be matched by a corresponding message receive call in receiving processors, which must specify the memory location into which the data in the message should be placed. Although it is possible to run dierent codes on dierent processors, the usual mode of operation is to run a copy of the same code on every processor, with its action distinguished from that of other copies via knowledge of the processor number on which it is running. Standardisation of the message passing paradigm has taken a long time, but there now exist two de facto standards, PVM (Parallel Virtual Machine) [17] and the more recent, but increasingly popular, MPI (Message Passing Interface) [35]. For SV architectures, the language extensions can consist of either directives or library calls, and sometimes a mixture of the two, forming a shared memory paradigm. The most commonly used directives are the parallel loop directive, which indicates that the iterations of a loop should be shared out amongst the processors, and the parallel region directive which indicates that a section of code be run on every processor. A single copy of the code is executed, with any statements not enclosed by directive being run sequentially on one processor. Alternatively a threads library may be used, consisting of calls which start or stop a new thread of execution, indicate what code threads should run and perform synchronisation actions (for example locks, semaphores and barriers) between the processors. Since the directives are often translated into thread library calls by the compiler, a combination of directives and library calls may often be used.


32

A paradigm of this type (KSR Fortran) is described in more detail in Chapter 4. SV machines have a single address space, and hence all communication is implicit. Compared to message passing, standardisation is far less advanced. POSIX threads are reasonably widely supported, but most manufacturers utilise their own form of directives. This lack of standardisation poses portability and software maintenance diculties, and may explain some of the popularity currently enjoyed in industry by PVM. A major drawback of message passing is the need for the explicit placement of both send and receive calls for every data transfer. This requires the programmer to know the source and sink of every inter-processor data dependency in the program, and encode them explicitly in both sending and receiving processors. Programs which have dynamic and/or input-dependent memory access patterns can therefore be dicult to encode eciently in a message passing style. Although, as previously mentioned, it is possible to build a shared memory paradigm in software on MP machines, eciency tends to be unacceptably low. A paradigm intermediate between message passing and shared memory, sometimes referred to as direct memory access (DMA) or one-sided messages is gaining popularity for DM architectures. In this paradigm, a processor is able to read or write an address in another processor's memory without the need for contacting the receiving processor itself. This style of programming is available on the Cray T3D, and is supported in the increasingly popular BSP (Bulk Synchronous Parallel) paradigm [47]. This paradigm is still not as exible as shared memory|the onus is still on the programmer to distinguish remote memory accesses from local ones explicitly, and the location of remote address must be speci ed. As we have intimated already, it is possible to implement just about any programming paradigm on any MIMD architecture, though the closer the paradigm matches the underlying hardware mechanisms, the more ecient it tends to be.


33

Convergence of programming paradigms seems unlikely in the near future.

2.3 Performance measurement The reporting of performance results in the eld of high performance computing has acquired something of a poor reputation in recent years, which extends to scienti c publications as well material produced by vendors. Bailey [1] gives a useful critique of practices in this area, and sets out a number of guidelines designed to improve standards of performance reporting, to which we attempt to adhere in our experiments. The fundamental metric for the measurement of the performance of a program running on a parallel computer is execution time. This is the real world (wall clock) time taken for all or part of the program to execute. We denote this quantity by Tp where p is the number of processors used by the parallel program. A typical performance experiment consists of measuring Tp for a range of values of p, to demonstrate how the execution time depends on the number of processors used. However, measuring Tp in isolation gives no information about how the parallel program is performing with respect to a sequential program solving the same computational problem. It is therefore considered good practice to also measure the execution time of a well-optimised implementation of the best (or widely used) sequential algorithm for the problem, running on a single processor of the parallel computer. This quantity is denoted by Ts. Note that in almost all cases, Ts < T1, the execution time of the parallel program on a single processor (if not, then we have not used the best sequential implementation). Even if the algorithms used are the same, the parallel code will normally contain additional code associated with communication, synchronisation and scheduling which are necessary for a parallel implementation, and will therefore have a longer execution time.


34

In order to determine the quality of a parallel program it is natural to compare its execution time with the best we could reasonably expect. The simplest possible estimate of this best time is obtained by assuming that p processors should be able to solve the problem p times faster than one processor. We should therefore use the naive ideal execution time, de ned as Ts=p, as the basis for such comparisons. Although perfectly adequate in tabular form, this comparison unfortunately does not lend itself well to graphical representation. A plot of Tp versus p tends to compress the information for large values of p into a small area of the graph, with a resulting loss of visual accuracy. For graphical purposes it has become widespread practice to plot the speedup Sp of a parallel program versus the number of processors, where speedup should properly be de ned as

Sp = TTs

p

(2.1)

with the advantage that the naive ideal execution time is represented by the straight line Sp = p. This technique has a number of pitfalls and disadvantages, however: 1. Execution time cannot be retrieved from the plot without knowledge of Ts. 2. Comparing the performance of a program on dierent parallel architectures using speedup can disguise the fact that one implementation may be executing many times faster than another. 3. Information relating to small numbers of processors is compressed into a small area and is dicult to interpret. 4. It is tempting to use self-speedup de ned by T1 =Tp, which ignores the potentially signi cant dierences between T1 and Ts outlined above. 5. Some authors use scaled speedup, obtained by keeping either the amount of computation, or data, per processor xed as p is increased. This can give


35

valuable information, but great care must be taken to explain exactly what experiment is being conducted. Points 1 and 2 in the above list can be addressed by using temporal performance, Rp , de ned by (2.2) Rp = T1 ; p

instead of speedup. This metric is advocated by Hockney [26], and is adopted for our experiments. The naive execution time is represented by the straight line Rp = p=Ts. Point 3 can be addressed by using yet another derived metric, eciency, Ep, de ned by Ts : (2.3) Ep = pT p

The naive execution time is now represented by the constant line Ep = 1, though some authors prefer to express eciency as a percentage. A further useful concept is that of parallel overhead, Op, de ned by

Op = Tp ? Tps ;

(2.4)

that is the dierence between observed and naive ideal execution times. One advantage of using parallel overhead is that it can be additively decomposed into categories re ecting the source of the overhead. For example, we can write

Op = OC + OU + OL + OS

(2.5)

where OC , OU , OL and OS are the time lost by the parallel program through communication, unparallelised code, load imbalance and scheduling respectively. Such categories do not contribute multiplicatively to speedup in an unambiguous way. Note that this list of overhead sources is neither complete, nor the most appropriate for all applications. Overheads are typically more useful for performance debugging, rather than for performance reporting|for a fuller discussion


36

of overhead classi cation see [4]. The relative overhead, obtained by dividing Op by Tp, is the proportion of execution time due to overheads and is related to eciency: Op = 1 ? Ts = 1 ? E : (2.6) p Tp pTp Relative overhead, therefore, is equivalent to loss of eciency. It is worth noting that the naive ideal execution time Ts=p is not, in general, a lower bound on the execution time of a parallel program. Because by increasing the number of processors we inevitably increase the amount of fast memory (registers and cache) available to the program, thus potentially reducing the average cost of memory accesses. This can result in so called superlinear speedup (Sp > p), [23], and negative overheads (Op < 0). We will nd that, for parallel adaptive quadrature, as in many other numerical problems, a parallel implementation of the standard sequential algorithm gives poor results. We will therefore need to develop new algorithms for ecient parallel execution. We would thus like to be able to distinguish between the eects of the change of algorithm and the eects of the overheads incurred by parallel execution. We can do this by de ning a fundamental unit of execution for the problem concerned. For many applications, such as linear algebra, the appropriate unit is the oating point operation ( op), usually de ned as a oating point add or multiply instruction. We can then compare the parallel algorithm to a sequential algorithm in terms of the total number of ops executed, and we can measure the performance of the parallel implementation in terms of the number of ops executed per second. Note that many authors report only op rates, and neglect to mention the dierence in op counts between sequential and parallel algorithms. This is another example of poor practice cited in [1] and [26]. For quadrature problems, op counts are inappropriate as they depend strongly on the nature of the integrand. A much more suitable unit of execution is the


37

number of integrand evaluations. Thus we can measure the algorithms' eectiveness in terms of the total number of integrand evaluations Ip, and the eectiveness of the implementations in terms of the number of integrand evaluations performed per second Ip=Tp. Note that this second quantity is simply the product of Ip with the temporal performance Rp de ned above. We retain the subscript on Ip since in general, and in particular for many of our quadrature algorithms, there is a dependence on the number of processors used. To design performance experiments, we need to de ne a suitable set of test problems. This is not easy for quadrature since the shape of the integrand function, the required accuracy and the computational cost of integrand evaluation are all important factors in determining Rp. Ip depends on the the shape of the integrand function and the required accuracy but not on the cost of integrand evaluation. Measuring both Ip and Rp allows us to extrapolate our results for functions with the same numerical behaviour but dierent computational cost, thus restricting the space of test problems that needs to be considered. We therefore are able to use a small set of test functions, by choosing the shape of the function and the required accuracy of the result.

Chapter 3 Quadrature Algorithms 3.1 Quadrature in one dimension 3.1.1 Quadrature rules A quadrature rule is a method for approximating the numerical value of the de nite integral Zb I = f (x) dx; (3.1) a

in one dimension, or more generally in d dimensions,

I=

Z b1 Z b2 Z bd a1

a2

ad

f (x1 ; x2; : : : xd ) dx1 dx2 : : : dxd :

(3.2)

In this section we will restrict our discussion to one dimension, and will return to the multi-dimensional case in Section 3.2. The approximation Q takes the form of a weighted sum of integrand evaluations: m X Q = !if (xi ) (3.3) i=0

where !i; i = 0; 1; : : : ; m; are called the weights of the rule, and xi ; i = 0; 1; : : : ; m; the abscissae. The error E associated with a quadrature rule is given by

E = jI ? Qj 38

(3.4)

CHAPTER 3. QUADRATURE ALGORITHMS

39

For one dimensional integrals, a very simple rule is the Trapezium rule: I Q = (b ?2 a) (f (a) + f (b)): (3.5) It can be shown that, provided f is twice continuously dierentiable, the error associated with the Trapezium rule is (b ? a)3 f 00( )=12 for some 2 (a; b). A quadrature rule is said to have degree of precision q if the error is identically zero for all polynomials of degree at most q. We can see from the error term that the Trapezium rule is exact for all functions with zero second derivative (that is, linear functions). Thus the trapezium rule has degree of precision one. Another simple rule is Simpson's rule: (3.6) I Q = (b ?6 a) (f (a) + f ((a + b)=2) + f (b)): which has error (b ? a)5f (4) ( )=2880 for suciently smooth f , and degree of precision three. Both the Trapezium rule and Simpson's rule are members of the Newton-Cotes family of quadrature rules. These rules may be derived by approximating f (x) with a Lagrange interpolating polynomial. Given a set of nodes x0 ; : : : ; xm, the Lagrange polynomials are de ned by m Y (x ? xj ) Li (x) = (3.7) j =0; j 6=i (xi ? xj ) We form the polynomial interpolant

f (x) Pm(x) = and integrate:

IQ=

where

!i =

Zb a

m X i=0

m X i=0

f (xi)Li (x)

!if (xi)

(3.8) (3.9)

Li(x) dx; i = 0; 1; : : : ; m:

The nodes of the interpolant form the abscissae of the quadrature rule. For the Newton-Cotes family, equally spaced nodes are used. If the end points a and b


40

are used as nodes, then we have the closed Newton-Cotes rules, which include the Trapezium and Simpson's rule. Otherwise if a and b are not nodes, we have the open Newton-Cotes rules. Open rules have the advantage that they can be applied where the integrand has a singularity at an end-point of the interval.

3.1.2 Gauss rules The Newton-Cotes rules described above have degree of precision m if m is odd, and degree of precision m + 1 if m is even. The Newton-Cotes rules use equally spaced abscissae; by dropping this restriction it is possible to design rules with higher degrees of precision. Gauss rules are optimal in the sense that for a given number of points, they have the highest possible degree of precision. Given that a rule with m points has 2m degrees of freedom (m weights and m abscissae) we cannot hope to integrate exactly polynomials of degree greater than 2m ? 1 (which possess 2m degrees of freedom). However, it is possible to nd rules which have precisely this optimal degree of precision. These optimal rules are closely related to sequences of orthogonal polynomials. A sequence of polynomials P0(x); P1 (x); P2(x); : : :, where Pn(x) is of exact degree n, is said to be orthogonal on [a; b] with respect to a weight function w(x) 0 if

Zb a

w(x)Pj (x)Pk (x) dx = 0;

j 6= k:

(3.10)

The Legendre polynomials

Pn(x) = 21n

0 10 1 n 2 n ? 2 m A xn?2m; n = 0; 1; 2; : : : (?1)m @ A @

X

[n=2]

m=0

m

n

(3.11)

(where [x] denotes the nearest integer to x) are orthogonal on the interval [?1; 1] with respect to the weight function w(x) 1. If we use the zeros of these polynomials as abscissae, then it can be shown (see, for example, Section 4.7 of [8]) that we can nd weights which produce a sequence of rules of optimal degree


41

of precision. These rules are sometimes called the Gauss-Legendre rules or simply Gauss rules. The error associated with the m-point Gauss rule approximation to R 1 f (x) dx is given by ?1 2m+1 m!4 2 Em = (2m + 1)[(2m)!]3 f (2m) ( )

(3.12)

for some 2 (?1; 1), provided that f (2m) (x) is continuous. The rule can be applied to any closed interval [a; b] by applying the obvious linear transformation to [?1; 1]. Other types of Gauss rules can be derived using polynomials which are orthogonal with respect to dierent weight functions w(x), and on dierent intervals. These give exact integration, over the appropriate interval, of functions of the form w(x)P (x) where P (x) is a polynomial.

3.1.3 Composite rules For dicult integrands, simply increasing the number of points in a quadrature rule until the desired accuracy is attained is not a realistic approach. Deriving the weights and abscissae for high order rules is computationally expensive, and the oscillatory nature of high degree polynomials means that inaccuracies will occur. A more satisfactory approach is to divide the interval [a; b] into a number of subintervals, and apply low order quadrature rules piecewise to each subinterval. Thus instead of increasing the order of the rule to obtain higher accuracy, we can increase the number of subdivisions of the interval. For example, if we divide [a; b] into n subintervals, [xi?1; xi ]; i = 1; 2; : : : ; n, where a = x0 < x1 < x2 < < xn = b; and apply the Trapezium rule to each interval, we obtain the Composite Trapezium rule:

!

n?1 X h I Q = 2 f (a) + f (b) + 2 f (xj ) j =1

(3.13)


42

where h = (b ? a)=n and xj = a + jh; j = 0; 1; : : : ; n. This rule has an error of (b ? a)h2 f 00( )=12 for some 2 (a; b). The same approach can be applied to any quadrature rule. There is no restriction on the number, size or distribution of subintervals that may be used. Of course there will be a trade-o between the number of points in the underlying quadrature rule and the number of subdivisions required to obtain a given accuracy. Simple rules like the Trapezium rule may require very large numbers of subintervals to obtain a given accuracy. Using a higher order rule, we may require fewer subintervals, and even though the number of function evaluations per subinterval is higher, the total number of evaluations may be lower than for the Composite Trapezium rule.

3.1.4 Error estimation An estimate of the value of an integral is of little worth unless some indication of its quality can be obtained. A great deal of work has been done to analyse the error characteristics of quadrature rules|see, for example, Chapter 4 of [11]. The errors in approximating an integral I by a quadrature rule Q are of two types. In the rst place we have the truncation error E arising from the fact that Q is only an approximation to I , and the roundo error arising from the fact that we have computed the weighted sum of function evaluations of (3.3) in nite precision arithmetic. For most practical purposes the roundo error will be insigni cant compared to the truncation error. However, if the number of function evaluations becomes very large, we may need to take care when we form their weighted sum in order to minimise the roundo error. Almost all theoretical error bounds for rules involve bounds on the higher order derivatives of the integrand, and are therefore of no use in practical error estimation. Practical error estimation using a nite number of function values can give no guarantee of success, but most real world problems are suciently well behaved for error estimation to give useful


43

information about the quality of the approximation. The most common error estimation technique is to compare the approximations to an integral generated by two dierent quadrature rules, say Q1 and Q2 , where Q1 is in some sense more accurate than Q2. For example, Q1 may have a higher degree of precision than Q2 . Then jQ1 ? Q2j may be used as an estimate of the error E1 = jI ? Q1 j. However, if Q1 is indeed a better approximation to I than Q2, then jQ1 ? Q2 j tends to be a good approximation to E2 = jI ? Q2 j, and an overestimate of E1 . Overestimation may cause a quadrature routine to appear to fail in that the error estimate is larger than the requested error, even though the value returned (Q1 ) is suciently accurate. Additional robustness can be achieved by comparing Q1 with more than one less accurate rule. For composite rules, error estimation for the whole interval is achieved by simply summing up the error estimates for each subinterval. There are many possible ways of choosing the pairs of rules for error estimation. For our experiments we use Gauss rules with the so-called Kronrod extensions. The objection to simply using Gauss rules with dierent numbers of points is that no two Gauss rules share any abscissae (other than 0), so that no function evaluations can be reused. In [29], Kronrod shows that it is possible, starting with the n-point Gauss rule, to add a further n + 1 abscissae and 2n + 1 new weights such that polynomials of maximal degree (namely 3n + 1 if n is even and 3n + 2 if n is odd) are integrated exactly. This new quadrature rule is known as the Kronrod extension to the Gauss rule. To achieve a comparable degree of precision simply by using a higher order Gauss rule would require at least an additional 3n=2 function evaluations, compared with n + 1 for the Kronrod extension. In our experiments we use the 30 point Gauss rule together with its 61 point Kronrod extension.


44

3.1.5 Automatic quadrature The goal of an automatic quadrature algorithm is to nd an approximation to a given integral to within some prede ned error tolerance , using as few function evaluations as possible. To do this the algorithm generates a sequence of rules using successively more and more function evaluations (and hopefully reusing as many existing ones as possible) until the error estimate for the rule is less than . Automatic quadrature algorithms are of two principal types: adaptive and non-adaptive. Adaptive algorithms take account of the behaviour of the function when generating the next rule in the sequence, whereas non-adaptive algorithms evaluate a prede ned sequence of rules. Typically, non-adaptive algorithms either use a sequence of single rules of increasing degree, or use a composite rule with increasing numbers of subintervals. The former approach is limited in that it is dicult to compute and store the weights and abscissae for rules of very high degree. The latter approach tends to lead to excessive numbers of function evaluations, because subintervals are re ned across the whole interval, regardless of the nature of the integrand.

Adaptive quadrature Adaptive algorithms typically use composite rules, but attempt to re ne subintervals only in regions where the integrand is badly behaved. Bad behaviour is detectable since an error estimate is available for every subinterval. There are many possible mechanisms for deciding how to re ne subintervals. There are essentially two strategies which can be adopted in an adaptive quadrature scheme; they are referred to as local and global schemes (see [33]). A local scheme terminates when every subinterval satis es some error condition which can be determined by examining only that subinterval. Any subinterval not satisfying the error condition is further subdivided, typically by bisection, and the quadrature


45

rule(s) are applied to the new subintervals. Whenever a subinterval does satisfy the error condition, its contribution to the integral and total error estimate are accumulated and the subinterval may then be discarded. If we denote the estiR mated error in the quadrature rule approximation to xxii?1 f (x) dx by ei, where a = x0 < x1 < x2 < < xn = b, the usual local error condition is

ei which guarantees that

n X i=1

ei

x ? x i

i?1

b?a

;

n X xi ? xi?1 i=1

b?a

(3.14)

= :

(3.15)

A global scheme, in contrast, terminates as soon as n X i=1

ei :

(3.16)

At each stage, the subinterval with the largest error estimate is selected for bisection. This algorithm, attributed to Cranley and Patterson [10], can be written in pseudo-code as follows:

Algorithm CP: apply quadrature rule to the interval [a; b] initialise list of subintervals with [a; b] do while (error estimate > ) and (no. of rule evaluations Nmax) bisect the subinterval with largest error estimate apply quadrature rule to two new subintervals remove òld' subinterval from list add `new' subintervals to list update integral approximation and error estimate

end do


46

Generally, globally adaptive quadrature is more ecient than locally adaptive quadrature, in the sense that the global accuracy requirement is satis ed with fewer applications of the quadrature rule (and hence fewer function evaluations). On the other hand, the local scheme is much more ecient in terms of memory requirements, since only those subintervals which have not yet satis ed the error criterion need be stored, whereas for the global algorithm, a full set of subintervals covering [a; b] must be retained. Furthermore the global scheme requires a mechanism for nding the subinterval with largest error estimate, either by searching, or by maintaining a (partially) ordered list. On balance, globally adaptive quadrature is preferred (see [33]) and is, for example, the strategy used in the QUADPACK routine QAG [38] and the NAG library routine D01AKF [36]. The mechanism for nding the the subinterval with largest error estimate deserves some discussion, as although it often does not contribute signi cantly to the execution time for problems with up to a few thousand rule applications, it is generally asymptotically more expensive than the function evaluation, and this may be of signi cance for the very hard problems where parallelism will be of most bene t. A simple mechanism, used in QAG and D01AKF, is to maintain a fully ordered list of error estimates. Each time a new error estimate is to be added, the list is searched in order (from either the top or the bottom) until the correct place for the new error estimate is found. If the total number of rule applications required is N (and thus the cost of function evaluation is O(N )), then this mechanism for maintaining an ordered list has complexity O(N 2). It is possible to improve on this by maintaining a partially ordered structure often referred to as a heap. A heap consists of a binary tree of values with the property that the value at any node is greater than or equal to the values at its children. The largest value is therefore always at the root node. The two operations we


47

require to perform for our application are removing the largest value from the heap, and adding an arbitrary new value to the heap. By performing comparisons between nodes and their children or parent, both these operations can be done with complexity O(log H ), where H is the number of nodes currently in the heap. In the context of Algorithm CP this mechanism has complexity O(N log N ). The requirement of the globally adaptive algorithm that we always select the largest error estimate is somewhat over restrictive|as long as we select large error estimates, and the largest is not ignored for very long, then the convergence rate of the algorithm will not be signi cantly aected. The use of bins (variously called boxes or buckets) was suggested by Rice in [39]. By de ning a set of bins [0; y1); [y1; y2); : : : [yB?1; yB ); [yB ; 1) with 0 < y1 < y2 < : : : < yB we can classify an error estimate according to the bin in which it lies. At each stage, we pick a subinterval from the largest non-empty bin. If the yi are xed, each subinterval need only be classi ed once, and hence this method has complexity O(N ) + O(B ). Napierala and Gladwell [37] have shown how this idea can be eciently implemented, and we will return to it in the context of parallel algorithms.

3.1.6 Extrapolation Automatic quadrature routines generate a sequence of (hopefully) increasingly accurate approximations to the integral I . If this sequence has a particular form, then convergence may be accelerated by applying extrapolation methods. If we take the composite trapezium rule and successively double the number of subintervals, then we have a suitable sequence. By taking linear combinations of consecutive terms in the sequence, we can eliminate the rst term in the EulerMaclaurin expansion of the error E . Repeated application of this process allows elimination of further terms. The elimination is called Richardson extrapolation,


48

and its application to the compound Trapezium rule is known as Romberg integration. A detailed discussion of Romberg integration, its error analysis and some variants is given in Section 6.3 of [11]. Because adaptive quadrature routines also produce a sequence of approximations to the integral, they too are amenable to acceleration by extrapolation. In particular the method due to Shanks [44], which is a generalisation of Aitken's 2 method, and whose implementation is known as Wynn's epsilon algorithm [46] proves useful. Given a sequence sn; n = 0; 1; 2; : : : of the form

sn = s +

p X i=1

ai qin

(3.17)

with jqij < 1 and converging to the limit s, de ne the following array

e(0) ?1 e(1) ?1 e(2) ?1 (3) e?1 ...

(0) (0) e(0) 0 e1 e2 : : : (1) e(1) ::: 0 e1 e(2) ::: 0 :::

(3.18)

by the relations

e(?j1) = 0; e(0j) = sj ; e(kj+1) = e(kj?) 1 + (e(kj+1) ? e(kj) )?1:

(3.19)

For each new sj , we add the entries e(?j1+1) and e(0j) , and then compute the rest of the j th anti-diagonal, e(ij?i) ; i = 1; 2; : : : ; j . It can be shown that e(0) 2p = s, and in general e(0) k ; k = 0; 1; 2; : : : converges to s faster than the sj . If we set sn to be the successive approximations generated by an automatic quadrature algorithm, then even if (3.17) is not satis ed exactly, we may still obtain some acceleration of convergence. This approach is used in routine QAGS from QUADPACK, which also appears in the NAG library as D01AJF.


49

3.2 Multidimensional quadrature In passing from one-dimensional quadrature to multi-dimensional quadrature signi cant additional diculties are encountered. The behaviour of the integrand function may be considerably more complex (for example it can have singularities on arbitrary manifolds) and the integrand function will often be more expensive to evaluate. The domain of integration can be an arbitrary region of d-dimensional space, and as the dimension increases, the number of function evaluations necessary to obtain even modest accuracy tends to increase rapidly. The variety of possible regions of integration is dealt with by transforming them to one of a small set of standard regions by a non-singular transformation| see Section 5.4 of [11]. The standard regions are the d-dimensional hypercube, the d-dimensional hypersphere, the surface of the d-dimensional hypersphere, and the d-dimensional simplex. One possible way of generating multidimensional rules is by taking the Cartesian product of one-dimensional rules. Let

Q1 = and

Q2 =

m X i=0

n X j =0

!if (xi) j f (yj )

Z

R1

Z R2

be one-dimensional quadrature rules, then

Q1 Q2 =

n m X X i=0 j =0

!ij f (xi; yj )

f (x)dx

(3.20)

f (y)dy

(3.21)

Z R1 R2

f (x; y)dxdy

(3.22)

gives a two-dimensional quadrature rule. This concept can be extended to arbitrary dimension in the obvious way. Unfortunately, for a p-point one-dimensional rule, the number of points in the corresponding d-dimensional product rule is pd, which grows very rapidly. One way of reducing the number of points required is due to Sag and Szekeres [43]. The region of integration is transformed to the unit


50

hypersphere in such a way that the integrand and its derivatives are zero on the surface of the hypersphere. Points near the surface contribute very little to the integral, and so may be ignored, greatly reducing the number of points required in high dimensions. In order to generate rules with fewer points, rules which are exact for monomials of a certain degree can be sought. A monomial of degree q is of the form Qd xai where a 2 Z+ and Pd a q. A multidimensional rule is said to be of i i=1 i i=1 i degree q if it integrates exactly all monomials of degree less than or equal to q but not all monomials of degree q +1. Note that the product of a one-dimensional rule of degree q with itself gives a rule of degree q for any dimension d, but it will be of much higher precision than other rules of degree q, since it will integrate exactly P Q all monomials di=1 xai i such that each ai q, not just those with di=1 ai q. It can be shown (see Section 5.7 of [11]) that for a rule of degree q in d dimensions, the number of points required n lies within the bounds

1 1 0 0 d + [ q= 2] d + q An@ A: @ [q=2]

q

(3.23)

For our experiments, we use the family of embedded monomial rules on the hypercube [?1; 1]d due to Genz and Malik [21]. The rule of degree seven Q7 requires 2d +2d2 +2d +1 function evaluations. Error estimation is done by also computing the corresponding rule of degree ve Q5 which utilises a subset of the abscissae. Here

Q5 = w0f (0; 0; : : : ; 0) + w1

X

f (1; 0; : : : ; 0) + w2

and

Q7 = w^0 f (0; 0; : : :; 0) + w^1 + w^2

X

X

X

f (2; 2; : : : ; 0) (3.24)

f (1; 0; : : : ; 0)

f (2; 2; : : : ; 0) + w^3

X

f (3; 3; : : : ; 3)

(3.25)


51

where the sums are over all permutations of co-ordinates including sign changes. The i are independent of d and satisfy 0 < i < 1. The wj and w^j are simple functions of d and are not always positive. An adaptive routine based on this rule pair, and on the algorithm described by Genz and Malik in [21], appears in the NAG library as D01FCF. The routine DCUHRE, described in [3], utilises the degree nine rule from the same family, together with some higher degree rules for the special cases of two and three dimensions.

3.2.1 Adaptive quadrature Multidimensional adaptive quadrature can be achieved via a simple extension of the one-dimensional algorithm. The only additional decision to be made is the direction in which the subregion should be bisected. (Note that we will use the term interval loosely as a synonym for region so that we do not need to distinguish between the one- and multi-dimensional cases unnecessarily). In the routine ADAPT [20] Genz and Malik use fourth divided dierences (which can be easily computed from the function evaluations required for the degree seven rule described above) as an indicator of the direction in which the integrand is most badly-behaved. This direction is therefore the one in which subdivision is performed. The treatment of singularities is much more dicult in more than one dimension, since the singularity can appear on any arbitrary manifold of the region of integration. For certain types of singularity (for example, on boundaries of the region) extrapolation methods can be used, though the number of special cases that have to be treated increases with the number of dimensions.

3.2.2 Other quadrature techniques Although the globally adaptive algorithm is generally robust, it is not always the best choice, particularly if high accuracy is not required, or the dimension


52

of the problem is large (say & 10). The globally adaptive algorithm requires large amounts of memory for dicult integrands (O(Nd) words, where N is the number of quadrature rule evaluations). Furthermore, for dicult integrands it is good practice to apply at least two dierent methods and compare the results. We now brie y review some other multidimensional quadrature algorithms.

Monte Carlo methods Monte Carlo methods for integration are based on simple sampling techniques. If we take a set of n random points x1 ; x2 ; : : : ; xn in the region of integration, and compute the average value of the integrand at these points, then this average converges to the required integral in the sense that

!

n X V f (xi) = I = 1; P nlim !1 n i=1

(3.26)

where V is the volume of the region of integration, and P is a probability function. In practice, obtaining true random numbers is very dicult, and pseudorandom numbers are used instead. These are produced by a deterministic algorithm, but will pass certain statistical tests for randomness. Error estimation is achieved by computing the variance of the sample, and deriving con dence limits for the error. These limits are of the form n?1=2 where depends on the con dence level and 2 is the sample variance. Although the rate of convergence with increasing sample size is slow, it is independent of the number of dimensions, except that tends to increase with d. Various methods can be used to reduce the variance| see Section 5.9 of [11]. Adaptive Monte Carlo strategies are possible, in which more points are added in subregions of high variance until error conditions are satis ed. Such an algorithm is described in [32] and appears in the NAG library as routine D01GBF. Monte Carlo methods are not memory intensive, and can be used for problems of high dimensionality. However their poor convergence rate makes them suitable


53

only when low accuracy is acceptable.

Lattice methods Better convergence rates can be obtained by using deterministic sequences of points which are equidistributed in the region of integration. The fraction of points of such a sequence that lies in a subregion is asymptotically proportional to the volume of the subregion. Lattice rules can be considered as generalisations of the composite Trapezium rule to the multi-dimensional case. A lattice rule on the unit d-dimensional cube can be written as nX nX t ?1 1 ?1 j j 1 1 t (3.27) Q = N f n z1 + + n zt ; 1 t jt =0 j1 =0

when N = n1 n2 nt is the number of integration points, z1 ; : : : ; zt are integer vectors, and fxg denotes the fractional part of x. The rank of a lattice rule is the smallest value of t required to write the rule in this form. The rank can take values between 1 and d inclusive. The Cartesian product of a one-dimensional rule is an example of a lattice rule of rank d. Lattice rules can be constructed so that rules of lower rank are embedded in them, allowing easy error estimation. The zi are chosen to minimise the theoretical error bound over a family of test functions. However, convergence theory for lattice rules shows that they perform well only for periodic functions, so it is normal practice for any non-periodic function to rst have a non-linear periodising transformation applied to it. Rank-1 rules, also called number-theoretic rules, were rst introduced by Korobov ([28]), and an implementation of his method appears in the NAG library as D01GCF. Sloan and Joe ([45]) give a detailed treatment of lattice rules of higher rank. Their numerical evidence suggests that lattice rules are competitive with adaptive methods, particularly when the number of function evaluations is large (more than 106, say). The disadvantages of lattice rules are that the number


54

of function evaluations N must be xed in advance, so they cannot be used as the basis for automatic quadrature, the choice of the zi for a given N can be computationally demanding, and the non-linear periodising transformation can increase the diculty of the integrand.

Recursive methods Recursive methods utilise one-dimensional adaptive methods in a recursive fashion, by thinking of

I=

Z b1 Z b2 Z bd a1

a2

as

ad

I=

with

F (x1 ) =

f (x1; x2 ; : : : xd ) dx1 dx2 : : : dxd

Z b1

Z b2 Z bd a2

ad

a1

F (x1) dx1

f (x1; x2 ; : : : xd ) dx2 : : : dxd

(3.28) (3.29) (3.30)

and applying a one-dimensional algorithm in the x1 direction. The function F (x1) is then in turn evaluated by the same procedure. Practical diculties arise in determining the appropriate error tolerances for the inner routines, and in the fact that functions computed as numerical integrals will not be smooth. These issues are discussed in detail in [16]. As with product rules, the number of integrand evaluations required grows rapidly with the dimension, and thus this approach is only feasible for low-dimension problems.

3.3 Parallel algorithms In this section we review existing work on parallel algorithms for globally adaptive quadrature. Note that there is no need to distinguish between one-dimensional and multi-dimensional quadrature|all the algorithms are applicable to either


55

case, though of course some may work better than others. First, though, we discuss the potential parallelism within the standard algorithm (CP). Within a single function evaluation, there may be some parallelism present. However, this is entirely dependent on the nature of the function and how it is evaluated. If the function is an algebraic expression, the parallelism will be very limited, and very ne-grained. Within one application of a quadrature rule, the function evaluations may be executed in parallel. This may be acceptable if the function evaluation is expensive, and the number of processors is small. However, the number of function evaluations is xed, precluding the use of large numbers of processors. Furthermore, if we exploit the parallelism only at this level, all the other stages in the algorithm are sequential, including data structure manipulation and searching for the subinterval(s) with largest error estimate. If function evaluation is not very expensive, but the number of rule applications is large, then this sequential fraction may become signi cant. For MIMD architectures, other sources of overhead, such as synchronisation and communication, may also be a problem for this approach, as we shall see in Chapter 5. Parallelism at this level is, however generally well suited to vector architectures (see [22]), and array processors (see [18] and [34]) and variants of library routines (such as D01AUF in the NAG library) are designed to exploit this. For ecient parallel algorithms on MIMD machines, it is necessary to exploit coarser grained parallelism. To do this, Algorithm CP is modi ed by, at each stage, identifying a set of subintervals which can pro tably be subdivided. This subdivision should give rise to enough new subintervals to keep p processors busy. There are two fundamental approaches to this problem, which are distinguished by the organisation of the data structures containing the current list of subintervals. In the rst approach, the single list of Algorithm CP is preserved. In the second, each processor maintains a list of its own, with each subinterval being


56

contained in the list of exactly one processor. We will refer to these approaches as single list and multiple list algorithms respectively. The two approaches have developed more or less independently, since single list algorithms are the more natural choice for classical shared memory (GMSV) machines, while multiple list algorithms are the obvious choice for classical message passing (DMMP) machines. Many of the existing multiple list algorithms directly re ect the network topology of their target architecture. Both types of algorithm can be readily implemented on DMSV architectures, however, which will allow us to compare them directly. We now give a brief survey of existing algorithms of both types. We have implemented a number of them, and these will be described in more detail in subsequent Chapters.

3.3.1 Single list algorithms The simplest single list algorithm is due to Genz [19]. Instead of selecting a single subinterval for bisection, the p=2 subintervals with the largest error estimates are selected and bisected, giving one subinterval for each of the p processors, assuming p is even. This algorithm is implemented in DCUHRE [3], using heap sorting as the mechanism for nding the largest error estimates. No attempt is made to parallelise sorting and data structure management. In [22], Gladwell describes a single list algorithm in the context of vector machines, but it is equally applicable to the parallel setting. The strategy for selecting subintervals is 1. Rank the list of intervals by error estimates so that e1 e2 : : : es, 2. calculate r such that

r?1 X i=1

ei < and

r X i=1

ei ;

3. select all the subintervals with error estimates er .


57

The rationale behind this algorithm is that all these subintervals, and possibly some others, will require subdivision in order to satisfy the error tolerance. It can be shown that, under certain assumptions, this algorithm will produce exactly the same nal list of subintervals as Algorithm CP. A proof can be found in Section 8.1. At rst sight, this algorithm appears to require a fully ranked list, which is expensive to maintain. In Chapter 5, however, we show that this algorithm can be eciently implemented using a heap. In [37], Napierala and Gladwell suggest the use of bins to further reduce the ranking cost. By using bins, some ranking information is lost, and the sums in step 2 are over the midpoints of bins weighted by the number of subintervals in each bin. Hence, this modi ed algorithm only approximates Gladwell's original algorithm, and the quality of the approximation is determined by the number and distribution of the bins. These issues will be discussed in more detail in Chapters 5 and 8.

3.3.2 Multiple list algorithms The simplest possible multiple list algorithm is described by Burrage in [9] for the one-dimensional problem. The interval of integration is divided into p equal subintervals, and Algorithm CP applied to each subinterval on a dierent processor, using =p as the error tolerance. This algorithm requires no communication between processors other than to distribute the initial subintervals and to accumulate the nal result. For integrands with diculties that are not spread uniformly throughout the interval, this algorithm performs poorly, since the loads on the various processors will be unequal, and some processors will nish before others. Other multiple list algorithms use the same basic principle, but attempt to detect, and correct, load imbalance as the algorithm progresses. Note that the load on a processor cannot be measured, as it cannot be predicted at a given stage in the


58

algorithm how many more stages will be necessary. The load, therefore, must be estimated from known quantities. A variety of dierent tests for satisfying the global error tolerance are also possible. The rst work in this area is due to Rice [40], [41], [42]. This work is purely theoretical, and is not strictly a parallel version of Algorithm CP, since subintervals with small error estimates are discarded and cannot therefore be further subdivided. The subintervals which are not discarded are stored in queues, with newly generated subintervals put at the backs of the queues. Thus the next subinterval selected for subdivision is not always that with largest error estimate. The length of each processor's queue is used as the estimate of load, and the load is balanced by transferring intervals to neighbouring processors in a ring con guration until queue lengths are equalised. In [2], Berntsen and Espelid describe a parallel version of Algorithm CP for an Intel iPSC/1 hypercube. No details of their load balancing strategies are available, as they found that no performance gain resulted. However, this was almost certainly due to inecient communication on the target machine and the need to explicitly route messages between non-neighbouring processors rather than any failing in the algorithm. Their algorithm diers from that of Burrage in that one processor is reserved for maintaining the global error sum via corrections sent by other nodes, and for detecting convergence. Thus a true global termination P criterion ( ei < ) is implemented. De Doncker and Kapenga [12], [13], [14] also describe algorithms intended for hypercubes. The error condition used is the same as Burrage's (the total error on each processor is less than =p). Load on each processor is measured by counting the number of subintervals whose error estimate exceeds the maximum of =p and some (unspeci ed) fraction of the current local error. Load information is sent asynchronously to neighbouring processors only when the load changes


59

signi cantly. A processor is considered overloaded if its load is greater than the average A of its load l and that of its neighbours in the hypercube topology. Such a processor distributes intervals to underloaded neighbours (those with load less than A). An underloaded neighbour with load ~l receives

"

(l ? ~l) min(l ? A; l=2) (l ? 1)l

#

subintervals, where l is the sum of the loads of the underloaded neighbours. The authors also suggest more complicated, probabilistic, load prediction mechanisms, and distributed termination schemes. In their discussions it is clear that they are using adaptive quadrature as a test case for the more general task pool distribution problem, and some of these ideas are probably unnecessarily complicated for the speci c case. Their algorithms are asynchronous in nature, meaning that the execution path through the code is not repeatable from one run to the next, as it will depend on the relative timing of events such as message arrival. Lapegna and D'Alessio [31] describe an algorithm based on a two-dimensional grid topology. The largest error estimate is used as the load estimator, and each processor compares its load with its neighbour in alternating directions (North or East) on successive stages of the algorithm. If the ratio of the loads is suciently large then the heavily loaded processor sends its subinterval with largest error estimate to its neighbour. Termination occurs when the local error estimate on each processor is less than multiplied by the fraction of the volume (length in one dimension) of the whole interval which is currently in that processor's list.

Chapter 4 The Kendall Square Research KSR-1 4.1 Hardware architecture The Kendall Square Research KSR-1 is1 a distributed shared memory machine; it has a physically distributed memory but there is extensive hardware support for a single logical address space. Each cell consists of a 20MHz processor with a peak 64-bit oating point performance of 40 M op/s and a 32 Mbyte main memory, which is also organised as cache memory. The cells are connected by a two level, uni-directional, slotted ring network. At the lower level, each ring can accommodate up to 32 cells and has a bandwidth of 1 Gbyte/s. The upper level ring has a bandwidth of 4 Gbyte/s, and can connect up to 32 lower level rings. (N.B. Some of the experiments we report were performed on a 32 cell, single ring system, others on a two ring system holding variously 40 and 64 cells. The con guration of the machine for each experiment is described where results are presented.) The machine has a Unix-compatible multi-user distributed operating system. 1 Production of the KSR-1 ceased in 1994.

60

CHAPTER 4. THE KENDALL SQUARE RESEARCH KSR-1

61

The memory system is a directory-based system which supports fully cache coherent shared virtual memory (SVM) through an invalidate policy. Except for a small amount of `held' memory (reserved for the operating system), the main memory of every cell is con gured as cache memory. Such a system is often referred to as a cache-only memory architecture (COMA). Data movement is request driven; a memory read operation which cannot be satis ed by a processor's own memory generates a request which traverses the network and returns a copy of the data item to the requesting processor; a memory write request which cannot be satis ed by a processor's own memory results in that processor obtaining exclusive ownership of the data item, and a message traverses the network invalidating all other copies of the item. The unit of data transfer in the system is a subpage which consists of 128 bytes (equivalent to 16 8-byte words). The operating system manages page migration and fault handling in units of 16 Kbyte. The KSR-1 processor has a level 1 cache, known as the subcache. The subcache is 0.5 Mbyte in size, split equally between instructions and data. The data subcache is two-way set associative with a random replacement policy. The cache line of the data subcache is 64 bytes (half a subpage). There is a 2 cycle pipeline from the subcache to registers. A request satis ed within the main cache of a cell results in the transfer of half a subpage to the subcache with a latency of 18 cycles (0.9 s). A request satis ed remotely from the main cache of another cell results in the transfer of a whole subpage with a latency of around 150 clock cycles (7.5 s). A request for data not currently cached in any cell's memory results in a traditional, high latency, page fault to disk. In order for a thread to access data on a subpage, a copy of the page in which the subpage resides must be present in the cache of the processor on which the


62

thread executes. If the page is not present, a page miss occurs and the operating system and memory system combine to make a copy of the page present. If a new page causes an old page in the cache to be displaced, the old page is moved to the cache of another cell if possible. If no room can be found for the page in any cache, the page is displaced to disk. Moving a page to the cache of another cell is much cheaper than paging to disk.

4.2 Programming model Parallel execution of programs is achieved by allowing a number of threads to participate in the execution. Each thread is a ow of control within a process. By default, the threads are scheduled by the operating system, and may timeshare on a cell with other threads, or be re-scheduled from one cell to another during program execution. However the allocate cells command can be used to reserve a number of cells for the execution of a program. Provided the number of threads required does not exceed the number of cells allocated, no time-sharing or movement of threads will occur. Threads are managed in a Fortran program via a set of extensions to Fortran 77 consisting of compiler directives and library calls. For full details see [30]. The two most important directives are parallel region and tile. The parallel region directive encloses a section of code thus| c*ksr* parallel region ([options]) . . [section of code] . . c*ksr* end parallel region


63

The enclosed code is executed by a number of threads, which is speci ed as one of the options to the directive. In addition it is possible to declare scalar variables as private variables: each thread will then have its own copy of these variables. All other variables are shared between threads. In order for threads to identify themselves, the integer function ipr mid() is provided, which returns a thread ID in the range 0; : : : ; p ? 1 when there are p threads executing the parallel region. Synchronisation, by semaphores, between threads can be achieved at the most basic level by calls to gspwt(get subpage wait) and rsp(release subpage) which, respectively, attain and release atomic ownership status on a speci ed subpage. If a subpage is in atomic state, a thread will block on a gspwt call until atomic status is released by another thread using rsp. Library routines are available which use these constructs to implement mutual exclusion (mutex) locks, condition variables and barrier synchronisation. Barrier synchronisation also occurs at the beginning and end of a parallel region. The parallel region, together with the synchronisation mechanisms described above, is a very powerful parallel construct, but it requires careful management by the programmer and may necessitate signi cant code changes for problems requiring complicated scheduling of parallel work. Since loop-based parallelism is very common, a separate directive (tile) is supplied which applies only to perfect rectangular loop nests, including, of course, single loops. This reduces programmer eort, and the code changes required, for a class of common parallel constructs. The tile directive takes the following form: c*ksr* tile (index_list, [options]) . . [loop nest]


64

. . c*ksr* end tile

This divides the iteration space of the loop nest into a number of rectangular pieces (tiles). These tiles are then scheduled to be executed in parallel. The index list allows the programmer to specify which iterators are to be tiled. The options allow speci cation of the number of threads to be used, and a choice of scheduling strategies. There are two strategies which are of interest in this experiment: slice and mod. The slice strategy divides the iteration space into p roughly equally sized tiles. The mod strategy divides the iteration space into more than p tiles (where possible), and schedules them on p threads in a modulo fashion. For either strategy the size of the tiles can be xed by the programmer, or determined at run time. In the latter case the tile size will normally be chosen as a multiple of 16 to help avoid false sharing of subpages. False sharing is said to occur when two or more threads write to dierent words on the same subpage, causing unnecessary data movement. The options also allow scalar variables to be declared as private variables, or as reduction variables. In the latter case results are accumulated in local copies of the variable, and code is generated which reduces these to a single variable at the end of the tiled loop. A further mechanism for avoiding false sharing is the subpage directive, which forces a scalar variable, or an element of an array, to be aligned on a subpage boundary. In order to minimise data movement, it is sometimes advantageous to force several dierent tiled loop nests which have common iterators to be tiled with the same strategy, so that any value of the iterator is always assigned to the same thread. This facility is provided by the affinity region directive.

Chapter 5 Parallel Single List Algorithms for One-Dimensional Quadrature In this Chapter we develop algorithms for one dimensional quadrature based on the single list approach described in Section 3.3. Recall that this approach relies on maintaining a single list of subintervals, as in Algorithm CP. In order to obtain coarse-grained parallelism we allow more exibility both in selecting subintervals for further subdivision, and in how the subdivision is performed.

5.1 Algorithms The principal existing work in this area, due to Genz, Gladwell and Napierala/Gladwell was described in Section 3.3.1. These algorithms can all be considered as members of a family of algorithms with a common structure. The general member of this family is described in pseudo-code in Figure 5.1. where do all denotes a loop which may be executed in parallel. The abbreviation SS denotes synchronous selection|the selection process takes place as part of a global synchronisation point which includes updating the list of subintervals and the global error estimate. Thus selection is performed with full knowledge of the current 65

CHAPTER 5. ONE-DIMENSIONAL SINGLE LIST ALGORITHMS

66

Algorithm SS: p-sect interval [a; b]

do all

apply quadrature rule to each subinterval

end do all

initialise list of subintervals do while (error > ) and (no. of rule evaluations Nmax) Select subintervals for further subdivision Subdivide selected subintervals

do all

apply quadrature rule to subintervals

end do all

remove òld' subintervals from list add `new' subintervals to list update integral approximation and error estimate

end do

Figure 5.1: Pseudo-code for Algorithm SS state of the computation. Within this family of algorithms, we have yet to de ne the selection and subdivision procedures. We will discuss subdivision rst, since choices here are more limited. Assuming that the selection process can select any number of subintervals in the range 1; 2; : : : ; s, where s is the current list length, what way of subdividing them will make best use of the p available processors? Clearly each subinterval selected should be at least bisected, but on the other hand we should not create too many new subintervals. Excessive subdivision might lead us to process intervals that would not be later selected for subdivision, and thus we might do some unnecessary work. Given these constraints, the following strategy seems sensible: the number of new subintervals should be the smallest multiple of p such that all selected subintervals are at least bisected. More explicitly, if the number of subintervals selected is k, then the subdivision strategy becomes that shown in


67

Figure 5.2. Note that bxc denotes the largest integer not greater than x, and dxe the smallest integer not less than x.

Subdivision strategy: if k p=2

Divide p ? kb kp c subintervals into b kp c + 1 pieces Divide the remaining subintervals into b kp c pieces else if p=2 < k < p Divide 2p ? kb 2kp c subintervals into b 2kp c + 1 pieces Divide the remaining subintervals into b 2kp c pieces else if k p Divide pd 2pk e ? 2k subintervals into 3 pieces Divide the remaining subintervals into 2 pieces

end if

Figure 5.2: Pseudo-code for the subdivision strategy Assuming that the integrand function takes the same time to evaluate for any subinterval, this strategy ensures that load balancing the main do all loop in Algorithm SS is trivial, as the number of new subintervals is a multiple of p. If the integrand function evaluation time is dependent on the subinterval, then this strategy will still work well, but the do all loop will require more sophisticated scheduling techniques. In the general case, the strategy divides some subintervals into one more piece than others. If the selection strategy orders the subintervals on their error estimates, then it is clear that the extra subdivisions should be performed on subintervals with larger error estimates. In practice, the choice of which intervals have the extra subdivisions makes little dierence, and if ordering information is not available it is not worth the additional expense of obtaining it solely for this purpose.


68

5.1.1 Selection strategies A selection strategy for Algorithm SS has to balance two con icting objectives. On one hand, it should select as many subintervals as possible, so that the number of times the selection strategy is invoked is minimised, and the impact of parallel overheads associated with synchronising and scheduling each do all loop is minimised. On the other hand, we do not wish to select intervals whose error estimates are suciently small that subdividing them will not make signi cant progress towards the goal of satisfying the error criterion. Although we cannot predict the eect of subdividing an individual subinterval, it is a reasonable assumption that subdividing intervals with large error estimates will tend to reduce the global error estimate more than subdividing intervals with small error estimates. If we adhere to this principle, then strategies will have the property that for some value C , only subintervals with error estimate greater than C will be selected. All but one of the selection strategies described here have this property. We now describe the selection strategies, used for our numerical experiments, some of which are based on the work of other authors.

Strategy GZ (Genz) Genz's algorithm [19] almost corresponds to Algorithm SS with the following selection strategy: 1. Select the p=2 intervals with largest error estimates. The dierence is in the handling of odd numbers of processors|Genz suggests selecting the p intervals with largest error estimates, but we utilise the extra processor by trisecting the interval with largest error estimate.

Strategy GL (Gladwell) This is the strategy suggested in [22] for vector architectures.


69

1. Rank the list of intervals by error estimates so that e1 e2 : : : es, 2. Calculate r such that and

r?1 X i=1 r X i=1

ei < ei ;

3. Select all the intervals with error estimates er . The rationale behind this strategy is that it is certain that at least this many subintervals must be subdivided to satisfy the global error criterion.

Strategy AL This strategy was suggested in [5] and explored further in [6]: 1. Search the list of intervals for the largest error estimate, say emax, and 2. Select all the intervals with error estimates > emax, for some satisfying 0 < 1. The choice of a reasonable value for will be addressed in Section 5.3. Note that this strategy does not require any ranking information for the error estimates.

Strategy ES This is another strategy that avoids ranking information|the intervals for subdivision are identi ed as those with error estimates greater than a fraction of the global error target, with the fraction decreasing as the algorithm progresses. A natural way of achieving this is to take the reciprocal of the current number of intervals as the fraction, since this guarantees that at least one interval is always selected. 1. Assume that currently there are s intervals in the list of intervals,


70

2. Select all intervals which have error estimates =s.

Strategy AB This strategy is similar to that suggested in [37]. It is motivated by the desire to reduce the cost of ranking intervals in Strategy GL. 1. Search the list of subintervals for the largest and smallest error estimates (emax and emin). 2. Divide the range [emin; emax] into B bins, where the ith bin, i = 1; 2; : : : ; B , is given by [exp(log (emin) + (i ? 1)l); exp(log (emin) + il)] ; where l = (log (emax) ? log (emin)) =B . 3. Determine the number of subintervals ni whose error estimates lie in the ith bin. 4. Find r such that and where

r?1 X i=1 r X i=1

Mi n i < Mini ;

Mi = exp (log (emin) + (i ? 1=2)l)

is the exponential of the midpoint of the ith bin. 5. Select all subintervals with error estimates greater than exp (log (emin) + (r ? 1)l) ; that is all subintervals whose error estimates lie in bins r; r + 1; : : : ; B . In other words we select all the subregions with larger error estimates


71

so that the sum of the (approximate) error estimates associated with remaining subregions is less than the required tolerance. Note that it is possible that the approximation error may mean that B X i=1

Mi n i <

even though the true sum of error estimates is greater than . If this occurs, we select all the subintervals in the largest non-empty bin. Napierala and Gladwell suggest the use of bins with xed end points| this has the advantage that when a subinterval is classi ed, its index can be added to a list of subintervals for each bin. Each subinterval therefore only has to be classi ed once. On the other hand, there is a risk that all the subintervals lie in a small number of bins, which can result in a poor approximation to Strategy GL. These issues turn out to be of little consequence for one-dimensional quadrature. However we will revisit them in the context of multidimensional problems in Chapter 8.

Strategy LE This strategy selects all intervals which fail the usual locally adaptive error criterion. 1. Assume that the ith interval has the range [ai ; bi] and the error estimate ei . 2. Select all subintervals for which ei (bi ? ai)=(b ? a). Note that this strategy does not select all subintervals with error estimates greater than some value C .


72

5.1.2 Other algorithms Besides Algorithm SS with the various selection strategies, we also implement two other parallel algorithms for comparison. The rst of these, Algorithm LL, exploits the low-level parallelism present in the application of a quadrature rule. This algorithm is just Algorithm CP, with the 122 function evaluations associated with the application of the 30/61-point Gauss-Kronrod rule to the two new subintervals coalesced into a single do all loop. In Algorithm AS we implement an asynchronous selection procedure, whereby each processor selects the subinterval with the largest error estimate which is not currently being subdivided by some other processor. Pseudo-code for this algorithm is given in Figure 5.3.

Algorithm AS: p-sect interval [a; b]

do all

apply quadrature rule to subintervals

end do all

mark all subintervals inactive

in parallel, each processor executes: do while (error > )

nd interval with largest error estimate which is inactive mark it active bisect it apply quadrature rules to both subintervals remove òld' interval from list add two `new' intervals to list mark them inactive update the approximation to the integral and the error estimate

end do

Figure 5.3: Pseudo-code for Algorithm AS Note that Algorithm AS is non-deterministic in the sense that both the result


73

returned and the number of function evaluations varies between dierent runs with the same input. As we have already discussed, this can be considered undesirable from a software engineering point of view|debugging and testing are made considerably more dicult by this property.

5.2 Implementation issues When implementing Algorithm CP, the major design decision concerns the storage and maintenance of ordering information about the error estimates. In Section 3.1.5 we discussed the relative merits of using a fully ordered list or a heap. For our experiments we implement Algorithm CP both ways. For the fully ordered list version we store an integer valued order array in which the ith element contains the index of the subinterval with the ith largest error estimate. When we bisect the subinterval with largest error estimate, we overwrite its entry in the list of subintervals with one of the two new subintervals; the other new subinterval is appended to the list. To nd the new values in the order array, we start with the rst new subinterval and compare its error estimate with other subintervals in descending order, swapping order values until the new subinterval reaches its correct position in the list. We then use the same procedure for the second new subinterval, but traversing the list in ascending order. This operation has complexity O(s) in the number of comparisons, where s is the current length of the list. Thus the overall complexity of maintaining the fully P 2 s) or O(N 2) comparisons, where N is the total number of ordered list is O( N= s=1 applications of the quadrature rule. For the heap version, we also use an integer valued array. We implicitly map a fully populated binary tree onto this array by numbering the tree in a breadth rst, left-to-right fashion. The ith entry in the array then has entry bi=2c as its parent in the tree and entries 2i and 2i + 1 as its children. To maintain a heap


74

we must ensure that the value at each node of the tree is greater than or equal to both of its children. To push a new subinterval onto the heap, we place its index in the rst free array element. We then interchange it with its parent until the heap property is satis ed. To pop the largest value from the heap we remove the rst entry from the array. We then take the last entry in the array and place it in the rst entry (the tree root). We then interchange it with the larger of its children until the heap property is again satis ed. To select the interval with largest error estimate we simply pop it from the heap. After bisection, we push the two new subintervals onto the heap. Since each push and pop operation has complexity P 2 log s), or O(N log N ) comparisons. O(log s), the total complexity is O( N= s=1 For the implementation of Algorithm SS on the KSR-1, the storage of the subinterval list assumes some importance. Since communication takes place on the level of 128 byte subpages, it proves ecient to store all the data associated with one subinterval (two endpoints, error estimate and integral approximation) on the same subpage. This is most conveniently achieved by using an array of inner dimension four and outer dimension the maximum allowed list length. Parallelism is obtained by the use of a tile directive on the loop over new subintervals containing the call to the quadrature rule. Since the subdivision strategy ensures that the number of new subintervals is a multiple of p, we can use slice scheduling to achieve a load balanced partition. Both Strategies GZ and GL can be eciently implemented using a heap. In each case the selection process consists of pushing all newly generating subintervals onto the heap, then popping o the appropriate number. For GZ, we pop p intervals; for GL, we keep popping until the sum of the remaining error estimates is less than . The overall complexity of these strategies is the same as for Algorithm CP with a heap, namely O(N log N ), since each subinterval processed has to be pushed onto and popped from the heap.


75

All the other strategies require inspection (one or more times) of the error estimate for every subinterval in the current list. The complexity of all these strategies is therefore O(s), though the constant of proportionality will dier between strategies. The overall complexity, however, depends on the number of subintervals selected at each stage. In the best case, where all subintervals in the list are selected, the complexity is O(N ) (no subinterval is left undivided, so each subinterval processed is examined by the selection procedure exactly once). In the worst case, where only one interval is selected at each stage, the complexity is O(N 2=p). Proof: Suppose the selection procedure is called S times, and always selects just one interval and subdivides it into p pieces. The total number of subintervals processed N is therefore p(S + 1). The list length is initially p, and at each stage it increases by p ? 1. The complexity of selection is therefore

p +(2p ? 1)+(3p ? 2)+ +(Sp ? S +1) =

S X i=1

(ip ? i +1) = (p ? 1) S (S2+ 1) + S

But S = N=p ? 1, so the complexity is

(p ? 1) (N=p ?21)(N=p) + N=p ? 1

(5.1)

(5.2)

which, provided N >> p, is O(N 2 =p). Because the selection strategies only examine one eld in the list (the error estimate), poor spatial locality results from the storage method for the list described above, as only every fourth word in memory is accessed. It proves ecient in terms of execution time, though wasteful in terms of memory, to keep a copy of the error estimates in a one-dimensional array, and use this copy in the selection strategy. For Strategy AB, much of the computational expense is in computing the logarithms of the error estimate. Since comparison operations can be equally well performed on the logarithms of values as on the values themselves, it is better


76

to store the logarithms of the error estimates in the one-dimensional array than the estimates themselves. The natural way to implement the subdivision strategy is as a loop over selected subintervals containing a loop over the number of subdivisions applied to each subinterval. If, however, we implement it as a single loop over new subintervals, this loop can then be fused with the loop over applications of the quadrature rule. Thus subdivision can be parallelised with no additional cost in terms of synchronisation and scheduling. In Algorithm AS, all accesses to shared data structures (the list, the heap, the global error estimate and the global result) must be restricted to one processor at a time. On the KSR-1 this is achieved by using a lock. The use of heap sorting is important, as performance is strongly dependent on the length of the critical sections between the acquiring and releasing of the lock.

5.3 Numerical experiments To examine the performance of the algorithms described above, we test them on the following set of problems: Problem 1

Z1 0

Problem 2

cos(50000x) dx; = 10?10

Z1 0

Problem 3

Z1 0

x?0:9 dx; = 10?10

x?0:8 cos(10000x) dx; = 10?10


Problem 4

and Problem 5

77

Z1

sin(1=x) dx; = 10?10 x 0:0001

Z 1X 8 0 k=1

jx2 + (k + 1)x ? 2j?0:2 dx; = 10?9

Problem 1 is a purely oscillatory problem, and Problem 2 has a simple algebraic singularity at the origin. Problem 3 has features of both Problems 1 and 2: oscillatory behaviour and an end point singularity. Problem 4 is also oscillatory, but the amplitude and period of the oscillations vary signi cantly over the interval. Problem 5 has eight internal singularities. Note that is the absolute, rather than relative tolerance. Although a relative tolerance is more commonly used in practice, it can lead to incorrect early termination of the adaptive quadrature algorithms if the initial estimates of the value of the integral are poor, thus making the design of robust test problems more dicult. We use the 30/61-point Gauss-Kronrod rule as the underlying quadrature rule. Note that this is not the best rule to use for all the test problems|a rule with fewer points will typically require fewer function evaluations for a singular integrand such as that of Problem 2. The 30/61-point rule, however, favours the ne-grained parallelism exploited by Algorithm LL, and thus provides a more rigorous test of the coarse-grained parallel algorithms. The test problems themselves are also chosen to be very challenging for parallel quadrature algorithms because, although they require a signi cant number of


78

function evaluations, the function evaluation is inexpensive, making sources of overhead (sequential sections, synchronisation and communication) important in determining the temporal performance. The timing of programs on the KSR-1 requires care. A non-trivial operating system housekeeping overhead (which includes the initialisation of page tables) is incurred for the rst access to data on any page. To eliminate this housekeeping overhead from the timing results we proceed as follows. We make two consecutive, identical, calls to the parallel quadrature subroutine; the rst call includes the housekeeping overhead, and the second call, which we use for timing purposes, does not include any ` rst touches' of data. This approach can be justi ed if it is assumed that most programs using the routine would make several calls to it, and we are therefore more interested in the average case behaviour rather than the special case of the rst call. Our rst experiment concerns the choice of in selection strategy AL. Figures 5.4 to 5.8 show the temporal performance Rp obtained by applying Algorithm SSAL (Algorithm SS with Strategy AL) to Problems 1 to 5 respectively. We observe that for all ve problems, the temporal performance is insensitive to the value of in the range 10?5 to 10?1. For values above 10?1, and excepting Problem 2, too few subintervals are selected at each stage, resulting in additional calls to the selection procedure, and for values close to 1:0, additional function evaluations when the number of subintervals selected falls below p=2. For values less than 10?5, and excepting Problem 1, too many subintervals are selected, especially in the later stages, resulting in additional function evaluations. Based on these results, we will use a value of = 10?3 for the rest of the experiments in this Chapter.


79

Performance (solutions/second)

20 1 processor 2 processors 4 processors 8 processors 16 processors 24 processors

15

10

5

0 1e-06

1e-05

0.0001

0.001 alpha

0.01

0.1

1

Figure 5.4: Problem 1: Temporal performance of Algorithm SSAL on KSR-1



7 6 5 4 3 2 1 0 1e-06

1e-05

0.0001

0.001 alpha

0.01

0.1

1



80



10

8

6

4

2

0 1e-06

1e-05

0.0001

0.001 alpha

0.01

0.1

1



40


35 30 25 20 15 10 5 0 1e-06

1e-05

0.0001

0.001 alpha

0.01

0.1

1



81


8


7 6 5 4 3 2 1 0 1e-06

1e-05

0.0001

0.001 alpha

0.01

0.1

1

Figure 5.8: Problem 5: Temporal performance of Algorithm SSAL on KSR-1 To each problem we apply the following algorithms; CP with a fully ranked list and with a heap, denoted CPL and CPH respectively, LL, AS, and SS with selection strategies GZ, GL, AL, ES, AB and LE (denoted SSGZ, SSGL etc.). Problem N I S 1 2047 124867 1024 2 723 44103 362 3 853 52033 427 4 397 24217 199 5 565 34465 283 Table 5.1: Performance of Algorithm CP Table 5.1 shows the number of quadrature rule applications N , the total number of function evaluations I and the number of subinterval selection stages S for Algorithm CP on the ve test problems. These are independent of whether a fully ranked list or a heap is used. Note that since we start with a single interval


82

Problem Algorithm T R IR 1 CPL 1.626 0.761 94992 CPH 1.314 0.615 76799 2 CPL 0.697 1.435 63294 CPH 0.711 1.406 62000 3 CPL 1.109 0.902 46925 CPH 1.093 0.915 47620 4 CPL 0.297 3.363 81448 CPH 0.292 3.426 82964 5 CPL 3.295 0.304 10461 CPH 3.291 0.304 10471 Table 5.2: Performance of Algorithms CPH and CPL and each selection stage results in two new subintervals (counting the initial application of the rule to the whole interval as stage 1), N = 2S ? 1, and since each application of the quadrature rule requires 61 function evaluations, I = 61N . Table 5.2 shows the execution time T , the temporal performance R and number of function evaluations per second IR attained by Algorithms CPL and CPH on the ve test problems. Figures 5.9 to 5.18 show temporal performance Rp and the number of integrand function evaluations Ip (as described in Section 2.3) for the parallel algorithms on each of the ve test problems. The Ìdeal' value of Ip is simply ICP, the corresponding value for Algorithm CP. The Ìdeal' value for Rp is computed as p times the corresponding value for Algorithm CPH.

5.4 Analysis Let us begin our analysis of the results of the numerical experiments by comparing the performance of Algorithms CPL and CPH. On Problem 1, CPH is signi cantly faster than CPL. This problem generates the most subintervals, and hence the relative importance of maintaining the list or heap is greater than for


83

the other problems. For Problem 2, however, CPL is slightly faster than CPH. A possible explanation for this is that CPL actually needs to do very little work to maintain the list for this particular problem. Each time we subdivide the subinterval containing the singularity, we generate two new subintervals. One of these contains the singularity and will have an error estimate larger than any other subinterval on the list, while the other subinterval in which the integrand is well behaved will have a very small error estimate due to its short length. By placing the rst of these subintervals at the head of the list and the second at the tail, it is likely that the resulting list is already ordered correctly. On Problems 3{5, CPH is faster than CPL, but only by a narrow margin. These problems generate fewer subintervals than Problem 1, so the list of intervals does not grow long enough for the eect of the diering complexities to be signi cant. We now turn our attention to the performance of the parallel algorithms. On Problem 1, Algorithm LL gives no improvement over Algorithm CPH. The cost of function evaluation is suciently small that any gains from distributing function evaluations over many processors are outweighed by the synchronisation and scheduling costs incurred in parallelising the do all loop. For up to six processors, Algorithm AS gives useful performance increases. Beyond this, performance decreases. Figure 5.10 shows that this is not due to signi cant increases in the number of function evaluations. The likely cause is therefore contention for access to the critical sections. With up to six processors, any processor trying to acquire the lock will nd it available most of the time. As the number of processor increases the length of time a processor has to wait for the lock also increases, and the number of subintervals that are being processed concurrently remains approximately constant. The stepped temporal performance pro le for all the variants of Algorithm SS is due to the total number of function evaluations required being quite strongly


84


20 SSGZ SSGL SSAL SSES SSAB SSLE AS LL Ideal

15

10

5

0 0

5

10

15 20 No. of processors

25

30

35

Figure 5.9: Problem 1: Temporal performance of Algorithms SS, AS and LL on KSR-1

160000

Function evaluations

140000


100000

80000

60000 0

5

10


25

30

35

Figure 5.10: Problem 1: Total number of function evaluations performed by Algorithms SS, AS and LL on KSR-1


85

10 SSGZ SSGL SSAL SSES SSAB AS LL Ideal


8

6

4

2

0 0

5

10


25

30

35

Figure 5.11: Problem 2: Temporal performance of Algorithms SS, AS and LL on KSR-1 800000 SSGZ SSGL SSAL SSES SSAB AS LL Ideal

700000


600000 500000 400000 300000 200000 100000 0 0

5

10


25

30

35



86

10 SSGZ SSGL SSAL SSES SSAB AS LL Ideal


8

6

4

2

0 0

5

10


25

30

35

Figure 5.13: Problem 3: Temporal performance of Algorithms SS, AS and LL on KSR-1 350000 SSGZ SSGL SSAL SSES SSAB AS LL Ideal

300000


250000

200000

150000

100000

50000

0 0

5

10


25

30

35



87

50 45

SSGZ SSGL SSAL SSES SSAB SSLE AS LL Ideal


40 35 30 25 20 15 10 5 0 0

5

10


25

30

35

Figure 5.15: Problem 4: Temporal performance of Algorithms SS, AS and LL on KSR-1 50000 SSGZ SSGL SSAL SSES SSAB SSLE AS LL Ideal


45000

40000

35000

30000

25000

20000 0

5

10


25

30

35



88



6

5

4

3

2

1

0 0

5

10


25

30

35

Figure 5.17: Problem 5: Temporal performance of Algorithms SS, AS and LL on KSR-1

SSGZ SSGL SSAL SSES SSAB AS LL Ideal

120000


100000

80000

60000

40000

0

5

10


25

30

35



89

dependent on the number of processors. This is a consequence of the uniform nature of the oscillations in the integrand. To satisfy the error criterion, all subintervals must be less than a certain length. The total number of subdivisions required is therefore sensitive to the initial subdivision into p subintervals. Of the selection strategies, GZ performs least well, a result of both the cost of maintaining the heap and of synchronisation (the number of synchronisation points is one to two orders of magnitude greater than for the other strategies). For the other strategies both the number of function evaluations and the number of synchronisation points varies only slightly. GL avoids the additional synchronisation points, but the cost of heap maintenance remains. Of the other four, it is hard to distinguish between AL, ES and LE. AB is slightly slower, a result of a more expensive selection strategy. For Problem 2, no algorithm achieves more than a ve-fold increase in temporal performance over CPH. The integrand has a strong singularity at the origin but is otherwise smooth, so at each stage of any of the algorithms there is only one subinterval (containing the singularity) which can pro tably be subdivided. Algorithm AS is most disadvantaged by this feature, as it has no mechanism for using more than one processor to process the important subinterval. Algorithm LL is unaected by this issue, but again its performance is dominated by synchronisation overheads. Algorithm SS is able to subdivide the most important subinterval into p pieces, and hence obtain some limited parallelism. If we assume that to resolve the singularity only one of the subintervals at any stage requires subdivision, the self-speedup (T1=Tp) of Algorithm SS is bounded above by 2 log2 p. Proof: Suppose that the initial interval is [a; b] and that the interval containing the singularity needs to be shorter than (b ? a) in order to satisfy the global error


90

tolerance. With a single processor this requires d stages of the algorithm, where

2?d = ;

d = ? log2 ;

(5.3) (5.4)

and, assuming that each subinterval requires u time units to process, the overall time of the algorithm is ?2u log2 time units. With p processors, where at each stage the selected subinterval is divided into p subintervals, the algorithm has d stages, where

p?d = ; (5.5) log2 ; (5.6) d = ? log 2p ?u log2 . Hence the self-speedup is and the overall time of the algorithm is log2 p ? 2 u log 2 bounded above by ?u log2 = log2 p = 2 log2 p. Algorithm SSLE is unable to solve this problem in a reasonable time|it generates many small intervals with error estimates that are also small, but not small enough to prevent them being selected. The list therefore grows very rapidly, while progress in reducing the global error estimate is slow. The algorithm was terminated after 5106 function evaluations, and in no case was the error criterion satis ed. Of the other selection strategies, GZ again is the poorest. For this problem, however, GZ selects too many intervals rather than too few, and hence wastes much time in unpro table subdivisions. All the other selection strategies correctly select only the interval containing the singularity. The dierences in performance are therefore due to the time taken to make the selection. GL is fastest since just a single pop of the heap is required, followed by ES (one pass through the list) and AL (two passes). AB is signi cantly slower (four passes plus the time to compute logarithms of each error estimate).


91

Problem 3 has features of both Problems 1 and 2. Overall performance is better than for Problem 2, and the stepped pro les of Problem 1 are evident, but the presence of the singularity determines the relative performance of the dierent algorithms. Algorithm AS shows some modest performance increase for small numbers of processors, but once again this tails o rapidly. Algorithm LL obtains a twofold improvement in execution time, while SSLE is again unable to solve the problem using fewer than 5 106 function evaluations. The other SS variants' relative behaviour is very similar to that for Problem 2, as it is largely determined by the relative costs of selecting a single interval. Algorithm SS goes through two distinct phases for this problem. In the rst phase almost all current subintervals are selected: this accounts for about half the function evaluations, but only a few selection stages. In the second phase only the subinterval containing the singularity is selected. This phase contains most of the selection stages. On Problem 4, Algorithm LL once again gives little performance increase over the sequential version. Algorithm AS, characteristically, does well for small numbers of processors, but performance subsequently declines. For this problem, the number of subintervals worth subdividing at any stage remains small, and can be fewer than the number of processors. SSGZ therefore processes too many subintervals. The other SS variants, including SSLE, all give rather similar performance, with their relative performance depending on the number of processors, but without any clear pattern. This is partly due to the number of subintervals processed depending on the number of processors, but also partly due to the very short execution time being subject to small random eects, and resulting in some noise in the measured execution time. SSAB and SSGL tend to perform fewer function evaluations than the others, but the selection stages are more costly. These eects approximately cancel each other out for this particular integrand. The integrand of Problem 5 is signi cantly more expensive to evaluate than


92

the other integrands. Thus Algorithm LL is not so badly aected by synchronisation overheads, and attains approximately a six-fold decrease in execution time. The integrand has eight singularities, and only subintervals containing them can be pro tably subdivided. As a result Algorithm AS performs well for up to eight processors. Beyond this, additional processors are not able to nd additional subintervals to pro tably subdivide. Algorithm SSGZ, selecting p=2 subintervals, performs well for up to 16 processors. SSLE can only solve this problem by processing large numbers of subintervals. Only selected points are shown in Figure 5.17: for other values of p more than 5 106 function evaluations are required and the algorithm was terminated once this limit was reached without satisfying the error criterion. Between the other selection strategies there is little to choose. SSAL performs best, making near optimal selection decisions at least cost. SSGL and SSAB also make near optimal decisions, but at higher cost. SSES, although having cheap selection, tends to select more subintervals at each stage and since some of these are not worth subdividing, requires more function evaluations than the others.

5.5 Discussion Overall, we see that Algorithm LL performs poorly on our test problems. However, the integrands we have chosen are cheap to evaluate, and with more expensive integrands it would perform better. On the other hand, the 61-point quadrature rule pair represents the most accurate rule pair in common usage. A lower degree rule pair would provide reduced parallelism for LL, and would result in poorer performance. Algorithm AS also performs poorly. In some cases (Problem 1 for example) this can be attributed to lock contention, but where singularities are present, AS lacks the ability to allow more than one processor to work on an important subinterval. Thus processors tend to do useless work.


93

Of the selection strategies for Algorithm SS, LE is the least successful, as might be predicted from the observation that it does not select all subintervals with error estimates greater than some value C . In particular, it performs very badly on problems with singularities. Strategy GZ also gives rather poor performance, either through selecting too many subintervals when there are only a few important ones, and hence performing unnecessary function evaluations, or else through selecting too few subintervals, with the result that the selection procedure is called too often. The remaining strategies all perform well. There is some evidence to suggest that the more expensive strategies (AB and GL) make slightly better selection decisions than the cheaper ones (AL and ES). For one-dimensional problems, Strategy AB is not signi cantly cheaper than GL, especially if the number of subintervals selected is small. However, we will see in Chapter 8 that this is no longer the case for multi-dimensional problems. For problems with fewer than p=2 singularities, the temporal performance of Algorithm SS has a log(p) dependence, so it cannot make ecient use of large numbers of processors. If the function evaluation is suciently expensive, then Algorithm LL may be worth considering. If it known that singularities are present, then as we saw in Section 3.1.6, extrapolation methods can usefully accelerate adaptive quadrature. Furthermore, it is usually more ecient to use a lower degree rule pair in such cases. We will address the issue of combining parallel algorithms and extrapolation methods in Chapter 7. If it is not known that a singularity or sharp peak is present, then one possible strategy to adopt in the event that fewer than p=2 subintervals are selected is to combine Algorithms SS and LL. We can achieve this by replacing p in the subdivision strategy of Figure 5.2 by some p0 < p, and dividing the function evaluations associated with each new subinterval amongst bp=p0c processors. However, determining the best value for p0 is not easy, and it is dicult to exploit numbers of


94

processors without suitable factors. Furthermore, implementation could be awkward and/or inecient in either shared variable or message passing paradigms. Support for dynamically recon gurable nested parallelism in shared variable programming environments is uncommon, and the necessary message passing code would be quite complicated.

Chapter 6 Parallel Multiple List Algorithms for One-Dimensional Quadrature In this Chapter we develop algorithms for one dimensional quadrature based on the multiple list approach described in Section 3.3. In these algorithms each processor maintains a separate list of subintervals, with any given subinterval appearing in the list of exactly one processor.

6.1 Algorithms The principal existing work in this area, due to Burrage, Berntsen and Espelid, De Doncker and Kapenga, and Lapegna and D'Alessio, was described in Section 3.3.2. Like the single list algorithms, multiple list algorithms can also all be considered as members of a family of algorithms with a common structure. The general member of this family is described in pseudo-code in Figure 6.1. To de ne a speci c member of the family, we must specify the termination criterion, and a strategy for redistributing subintervals which attempts to balance the load across processors. We have already discussed in earlier Chapters disadvantages 95

CHAPTER 6. ONE-DIMENSIONAL MULTIPLE LIST ALGORITHMS

96

Algorithm ML: p-sect interval [a; b]

in parallel, each processor i = 1; : : : ; p executes:

initialise local list with subinterval i do while :(TerminationCriterion) bisect the local subinterval with largest error estimate apply quadrature rule to two new subintervals remove òld' subinterval from local list add `new' subintervals to local list update local integral approximation and error estimate Redistribute subintervals

end do

Figure 6.1: Pseudo-code for Algorithm ML of non-deterministic algorithms. Many choices of termination criterion and redistribution strategy will lead to a non-deterministic algorithm. Even though such algorithms have potential performance advantages, we will not consider them further in this Chapter. To be strictly faithful to the de nition of globally adaptive quadrature, the P termination criterion pj=1 Ej < , where Ej , j = 1; 2; : : : ; p is the total error estimate for all the subintervals in the j th processor's list, should be used. Forming this sum requires global synchronisation and communication at each stage of the algorithm, where a stage is an iteration of the do while loop in Figure 6.1. We will refer to this criterion as the true global termination criterion. Another possible termination criterion is that each processor stop as soon as Ej < =p. We refer to this as the processor local criterion. In the absence of load balancing (for example in Burrage's algorithm) this criterion requires no communication between processors. However, once load balancing is introduced, and it is possible to transfer subintervals between processors, this criterion can


97

fail, because a processor which satis es the criterion at some stage may subsequently be sent subintervals by another processor, so that its total local error would then exceed =p again. No processor can safely terminate until all processors satisfy Ej < =p. In [31] Lapegna and D'Alessio suggest a variation of P this criterion, namely Ej < vj = pk=1 vk , where vj is the total d-dimensional volume of the subintervals in the local list of processor j . Global communication/synchronisation is required to implement these modi ed criteria, which are thus just as expensive to implement as the true global criterion, but are less strict. Thus there is no reason to prefer these criteria over the true global criterion. One way of reducing the amount of communication/synchronisation associated with the termination criterion without introducing non-determinism is to check the criterion every s^ stages of the algorithm rather than at every stage. This approach is suggested in [13]. The penalty for doing this is that additional subdivisions might be performed before termination is detected. To restrict the additional subdivisions, we let s^ = max(1; [ s]), where s is the total number of stages executed up to and including the one where the criterion was last checked,

is a positive constant, and [x] denotes the nearest integer to x. For example, if

= 0:1, the criterion will be checked at each of the rst 20 stages, then every other stage for the next 10, then every third stage for the subsequent 10, and so on. If checking the criterion at every stage resulted in N subintervals being processed in total, then this choice of s^ means that no more than (1 + )N subintervals will be processed using reduced checking. The total number of synchronisation points Y is approximately given by log( S ) Y = 1 + log(1 + ) where S = (N + 1)=2p is the total number of stages required, and thus the cost of load balancing and error checking is reduced from O(N=p) to O(log(N=p)). Proof: For s 1= , max(1; [ s]) = 1, so synchronisation occurs on each


98

of the rst 1= stages. For s > 1= , s increases by a factor of (1 + ) between consecutive synchronisations. Thus if Y 0 synchronisations are required to increase the number of stages from 1= to the nal value of S , then we have the relation

S = 1 (1 + )Y 0 from which we deduce that and hence that

log( S ) Y 0 = log(1 + ) log( S ) : Y = 1 + log(1 + )

If we consider Burrage's algorithm, it is clear that it is liable to suer from load imbalance|processors which initially have easier subintervals will satisfy the processor local termination criterion before others. However, if we apply the true global termination criterion, then we no longer have load imbalance in the normal sense, since all processors continue to subdivide subintervals until the global criterion is satis ed. In general, the larger the error estimate of a subinterval, the larger the reduction in error estimate resulting from its subdivision. Hence the global criterion will tend be satis ed sooner if processors with small error estimates `help out' processors with larger error estimates, rather than simply continue to process their own list of subintervals. Since it is not possible to predict the eect on the total error estimate of the subdivision of any given subinterval, it is not clear what quantity should be balanced between processors. Lapegna and D'Alessio suggest using the largest error estimate in each processor's list as the load estimator. This choice has the merit that, given that load balancing is successful, the set of subintervals chosen at any stage of the algorithm will be a good approximation to the set with the p largest error estimates. This results in a distributed selection process that has similar properties


99

to Genz's algorithm (SSGZ) which selects the subintervals with the p=2 largest error estimates. However, since subintervals are only transferred in one direction at a time, once a processor has transferred its largest error estimate subinterval to another processor, this criterion would consider the load imbalance between these two processors to have been removed. Thus at any stage of the algorithm, it allows only one subinterval to be transferred between any pair of processors. In [13], de Doncker and Kapenga suggest using the number of subintervals with error estimates greater than =p as the load estimator. In [14] this is modi ed to the number of subintervals in processor j with error estimates greater than max(=p; cEj ) where Ej is the current local error estimate in processor j and 0 < c < 1. The value of c used for the experiments in [14] is unfortunately omitted. This estimate requires all subintervals to be compared with max(=p; cEj ) every time a load estimate is computed. Also given in [14] is the outline of a load prediction scheme which attempts to estimate the load in a processor at some time in the future. One obvious choice of load estimator not previously suggested is simply the current local error estimate Ej itself. A general observation suggests that the error E resulting from the application of adaptive quadrature to many types of integrand behaves as E N ?k for some value of k depending on the integrand, where N is the number of function evaluations (see, for example Chapter 11 of [45]). This suggests that the current error estimate might form a reasonable estimate for the number of function evaluations that will be required to reduce the error estimate to a given level. For all these load estimators, there are clearly circumstances under which they could fail to give a good estimate of the relative load on dierent processors. For example, the number of subintervals with error estimate greater than =p might be equal on a pair of processors, but the actual values of the error estimates,


100

and hence the number of function evaluations required to achieve a given error tolerance, might be very dierent. Alternatively, the total error estimate might be equal on a pair of processors, one of which has a single subinterval with a large error estimate while the other has many subintervals with small error estimates. Again, the numbers of function evaluations required to achieve a given error tolerance could be very dierent. However, if the subproblems (each consisting of a set of subintervals) contained on dierent processors have broadly similar characteristics in terms of the number of subintervals, the error estimate distribution and integrand behaviour, then we might hope that the load estimators will be a fair re ection of the true loads. Furthermore, if the load balancing mechanisms act in such a way as to produce similar subproblems on dierent processors, then the algorithm as a whole should be successful. As an illustration, consider a load balancing scheme which balanced the local error estimates Ej in such a way that some processors have a small number of intervals each with large error estimates, while others have large numbers of subintervals each with small error estimates. These two sets of processors would be less likely to exhibit similar subsequent error estimate behaviour than if all processors had a similar numbers of subintervals with similar distributions of error estimates. This observation leads us to conjecture that a successful load balancing mechanism need not explicitly balance any particular quantity. Any redistribution mechanism that acts to remove systematic dierences in error estimate behaviour between processors has a reasonable chance of success. We can also consider reducing the frequency of load balancing in the same way as reducing the frequency of checking the termination criterion. A natural choice is to load balance only on stages where the termination criterion is checked. In choosing a load balancing algorithm, several factors must be taken into consideration. Although Algorithm ML has similarities to the general problem of


101

balancing workloads that are generated dynamically, there are some important dierences. In Algorithm ML, termination is not determined by the criterion that there are no outstanding tasks. Furthermore, the result returned by ML will depend on the load balancing scheme used, and if the load balancing scheme is non-deterministic, then Algorithm ML will be also. Another feature speci c to ML is that a heap structure is used to store the subintervals. Subintervals to be transferred from a processor must be popped from the heap, which means that only the subintervals with largest error estimates are available for transfer. The load balancing schemes used in the algorithms are based on that described in [31]. The processors are arranged in a virtual two-dimensional toroidal grid, so that each processor has four neighbours, labelled North, East, South and West. Note that for small numbers of processors, the number of distinct neighbours will be less than four. We describe four schemes:

A The original scheme described in [31]. 1. Choose a neighbour to compare loads with: if the stage number is odd, choose North; if even, choose East. 2. If largest error estimate is more than c times the neighbour's largest error estimate, transfer subinterval with largest error estimate to the neighbour. Note that if the second largest error estimate is smaller that the neighbour's largest error estimate, then the load imbalance (estimated by the dierence in largest error estimates) between the two processors will actually increase. Hence this scheme does not guarantee that the load imbalance between a pair of processors will either remain the same or be reduced.

B This scheme balances the total error estimates, and does ensures that load


102

balance between a pair of processors is improved. 1. Choose a neighbour to compare loads with: if the stage number is odd, choose North, if even, choose East. 2. Transfer the r subintervals with largest error estimates to the neighbour such that the dierence in total error estimate is minimised (note that r may be zero.)

C An extension of scheme B which iterates to convergence. 1. Let D1; D2; : : : ; D be a set of directions de ning the distinct neighbours of each processor. (e.g. for a 2 2 grid, = 2 and D1 = North, D2 = East is a valid set.) 2. Let j = 1. 3. Transfer the r subintervals with largest error estimates to the neighbour in direction Dj such that the dierence in total error estimate is minimised. 4. If no transfers between any processors have occurred in the last steps, then stop; otherwise let j = (j + 1) mod and go to step 3.

D A simple scheme which does not attempt to load balance any quantity, but acts as a redistribution mechanism. 1. Choose a neighbour to compare loads with: if the stage number is odd, choose North, if even, choose East. 2. Transfer subinterval with largest error estimate to the neighbour. Of these schemes, only scheme C is iterative in nature, but it can be shown that it will always terminate.


103

Proof: Let the loads on the processors be l1 ; l2; : : : ; lp, and note that the total P load pi=1 li is constant. Let W be the sum of the squares of the loads, that is

W=

p X 2 i=1

li

Whenever a transfer of subintervals takes place, we replace a pair of load values, say x; y, where since ordering is unimportant we may assume that x < y, with a new pair of values, xnew ; ynew ; xnew < ynew , with the following properties:

x + y = xnew + ynew ; xnew > x and ynew < y: The resulting change in W is given by 2 ) ? (x2 + y 2 ) W = (x2new + ynew

?

= (xnew + ynew )2 ? 2xnew ynew ? (x + y)2 ? 2xy

= 2(xy ? xnew ynew ) Now if we let

m = x +2 y = xnew +2 ynew ;

we can show that

xy = m2 ? (x ? m)2 and

xnew ynew = m2 ? (xnew ? m)2 Hence

?

W = 2 (xnew ? m)2 ? (x ? m)2

and since m > xnew > x, W < 0. Note that the li are the sums of a nite collection of loads (one sum per processor), so if a transfer occurs, the transferred load can be bounded below by some > 0. Now W 0, and any transfer lowers the value of W by at least 2 2.


104

Hence W must reach its minimum value after a nite number of transfers, thus terminating the scheme. We denote the various possible combinations of termination and load balancing by the termination criterion (local (L) or global (G)), the load balancing scheme (A, B, C or D) and whether reduced frequency is used (R). The algorithms which we have implemented are as follows: Algorithm MLL Burrage's algorithm of [9].

Termination criterion: single processor local. Load balancing scheme: none.

Algorithm MLGA Lapegna and D'Alessio's algorithm of [31], with a global error criterion.

Termination criterion: true global. Load balancing scheme: A, with c = 1.5.

Algorithm MLGB This algorithm attempts to balance the local error estimates. Termination criterion: true global. Load balancing scheme: B.

Algorithm MLGBR As for MLGB, but with reduced frequency of both load balancing and termination checking.

Termination criterion: true global at reduced frequency, = 0:05. Load balancing scheme: B at reduced frequency, = 0:05.

Algorithm MLGC As for MLGB, but with load balancing iterated to convergence at every stage.

Termination criterion: true global.


105

Load balancing scheme: C.

Algorithm MLGD A simple redistribution scheme. Termination criterion: true global. Load balancing scheme: D. Note that there are many other possible multiple list algorithms. The above set, however proves sucient to gain substantial insight into their behaviour.

6.2 Implementation issues In our implementation of Algorithm ML, each processor executes Algorithm CPH, for which the implementation details were discussed in Section 5.2. Algorithm ML is implemented on the KSR-1 using a parallel region directive. Scalars and small arrays that need to be replicated on each processor but do not need to be accessed by other processors are declared as private variables. For arrays that do need to be accessed by other processors, we add an extra (outer) dimension and index them by processor number. For scalars that need to be accessed by other processors, we convert them to two-dimensional arrays. The outer dimension is indexed by processor number, while the inner dimension, of size 16, acts as padding to prevent false sharing. Only the rst element in the inner dimension is used, and therefore all elements accessed are guaranteed to lie on distinct subpages. The synchronisation required for computing the termination criterion and for the load balancing is implemented using barriers. To send a set of subintervals from processor A to processor B, processor A pops them from its heap and stores the indices. After barrier synchronisation, processor B accesses the indices and copies the subintervals from processor A's list, adding them to the end of its own list. It then pushes them onto its heap. Note that the list entries in A become redundant and are not reclaimed.


106

6.3 Numerical experiments The set of problems we use for our experiments in this Chapter are the same as in Chapter 5 with one exception. Problem 2 in Chapter 5 contains a single singularity. Algorithm ML has no mechanism for allowing more than one processor to work on the only important subinterval, and thus none of ML variants can obtain any performance improvement over Algorithm CPH. We replace Problem 2 with a mildly load-imbalanced oscillatory problem, which gives useful insights into the relative eciencies of the ML algorithms. The set of test problems is therefore: Problem 1

Z1

cos(50000x) dx; = 10?10

0

Problem 2

Z1 0

Problem 3

Z1 0

Problem 4

and Problem 5

cos(50000x3) dx; = 10?10

x?0:8 cos(10000x) dx; = 10?10

Z1

sin(1=x) dx; = 10?10 x 0:0001

Z 1X 8 0 k=1

jx2 + (k + 1)x ? 2j?0:2 dx; = 10?9


107

As in Chapter 5, is an absolute tolerance, and we use the 30/61-point GaussKronrod pair as the underlying quadrature rule. To each problem we apply the six variants of ML listed at the end of Section 6.1, and for comparison with single list algorithms, we also apply Algorithms SSAL and SSGZ. Figures 5.9 to 5.18 show temporal performance Rp and the number of integrand function evaluations Ip (as described in Section 2.3) for the parallel algorithms on each of the ve test problems. The Ìdeal' value of Ip is simply the corresponding value for Algorithm CP. The Ìdeal' value for Rp is computed as p times the corresponding value for Algorithm CPH. The use of a two-dimensional grid topology for the load balancing schemes means that it is dicult to exploit numbers of processors that do not have suitable factors. Thus results are presented for 1, 2, 4, 6, 9, 12, 16, 20, 25 and 30 processors, since these numbers can form square or nearly square (of the form n (n ? 1)) two dimensional arrays.

6.4 Analysis For Problem 1, the diculties are equally distributed throughout the interval, so no load balancing is necessary. Thus Algorithm MLL performs very well|better even than SSAL|as it incurs no synchronisation or communication overheads. There is little to choose between any of the algorithms in terms of the number of intervals processed. MLGA, MLGB, MLGBR and MLGD all have similar performance; better than SSGZ but not as good as SSAL. MLGC is worse than SSGZ, a result of the extra work required to iterate load balancing to convergence. Problem 2 is mildly load imbalanced|MLL gives performance similar to SSGZ in this case. As for Problem 1, MLGA, MLGB, MLGBR and MLGD all give similar temporal performance, but there are larger dierences in the number of intervals processed. These four algorithms all show a tendency for Ip to increase with p. Of the four, MLGB has the most eective, but also the slowest, load


108


20 MLL MLGA MLGB MLGBR MLGC 15 MLGD SSGZ SSAL Ideal 10

5

0 0

5

10


25

30

35

Figure 6.2: Problem 1: Temporal performance of Algorithms ML and SS on KSR-1

160000


140000

120000 MLL MLGA MLGB MLGBR MLGC MLGD SSGZ SSAL Ideal

100000

80000

60000 0

5

10


25

30

35

Figure 6.3: Problem 1: Total number of function evaluations performed by Algorithms ML and SS on KSR-1



12

109

MLL MLGA MLGB MLGBR MLGC MLGD SSGZ SSAL Ideal

10

8

6

4

2

0 0

5

10


25

30

35



160000 MLL MLGA 150000 MLGB MLGBR MLGC MLGD SSGZ 140000 SSAL Ideal 130000

120000

110000

100000 0

5

10


25

30

35



110


10 MLL MLGA MLGB MLGBR 8 MLGC MLGD SSGZ SSAL Ideal 6

4

2

0 0

5

10


25

30

35

Figure 6.6: Problem 3: Temporal performance of Algorithms ML and SS on KSR-1 800000 MLL MLGA MLGB MLGBR MLGC 600000 MLGD SSGZ SSAL 500000 Ideal


700000

400000 300000 200000 100000 0 0

5

10


25

30

35



111

50 MLL MLGA 40 MLGB MLGBR MLGC 35 MLGD SSGZ SSAL 30 Ideal


45

25 20 15 10 5 0 0

5

10


25

30

35



90000 MLL MLGA 80000 MLGB MLGBR MLGC 70000 MLGD SSGZ SSAL Ideal 60000

50000

40000

30000

20000 0

5

10


25

30

35



112

7 MLL MLGA MLGB MLGBR MLGC MLGD SSGZ SSAL Ideal


6

5

4

3

2

1

0 0

5

10


25

30

35



140000 MLL MLGA 120000 MLGB MLGBR MLGC MLGD SSGZ 100000 SSAL Ideal 80000

60000

40000 0

5

10


25

30

35



113

balancing. MLGBR processes more subintervals, but the reduced frequency of load balancing makes it faster overall. MLGA generally processes more subintervals than MLGD, but this is compensated by faster load balancing. MLGC again has poor overall performance, but the number of subintervals processed does not grow with p, and is comparable with the SS variants. The singularity in the integrand of Problem 3 means that all the ML variants perform poorly, mostly worse even than SSGZ. There is no way for Algorithm ML to subdivide the subinterval containing the singularity so that the work required is performed on more than one processor. Furthermore, the processor currently owning the singularity always appears overloaded, thus inducing transfer of the subinterval, to no bene t. Problem 4 does not contain a singularity, but is much more load imbalanced than Problem 2, so MLL achieves no useful performance increase at all. None of the ML variants balance the load successfully|Ip increases rapidly with p for all of them. Once again, MLGA, MLGB, MLGBR and MLGD all give similar temporal performance, with those algorithms processing additional subintervals compensating by faster load balancing. MLGC has the most eective load balancing, but this is outweighed by the time taken to execute it. On Problem 5, the ML variants, except MLL, are competitive with SS for up to eight processors (the number of singularities). Thereafter, they are unable to extract any more parallelism, and performance decreases with additional processors. Algorithm MLL is poor for small p, but performance gradually improves until, at p = 30, each singularity is contained in a dierent initial subinterval. It then outperforms the other ML variants, as it requires no synchronisation.


114

6.5 Discussion The primary observation we can make about multiple list algorithms is that they do not perform as well as the single list algorithms for one dimensional problems. There are two main reasons for this. Most importantly, many one dimensional problems are dicult because the `hard' part of the integrand is con ned to a small portion of the interval of integration, thus restricting the amount of parallelism available. Unlike Algorithm SS, Algorithm ML is unable to allow more than one processor to work on the subdivision of one subinterval. Thus, where opportunities for useful parallelism are restricted to the subdivision of a small number of subintervals, Algorithm ML is uncompetitive. Secondly, it is apparent that, even in cases where load imbalance is relatively mild such as Problem 2, considerable computational eort must be expended to prevent the total number of function evaluations growing with the number of processors. These features outweigh any of the bene ts of multiple list algorithms, such as reduced communication costs and parallel subinterval selection. Of the variants of Algorithm ML, Burrage's algorithm (MLL) is highly effective in cases where no load balancing diculties arise, but gives very poor performance where load imbalance is signi cant. Of the schemes which do transfer subintervals between processors, there is little to choose between Lapegna and D'Alessio's algorithm (MLGA) and the simple scheme MLGD. This supports our conjecture that any scheme which transfers sucient intervals between processors to remove systematic dierences between the p lists of subintervals will give reasonable results. Using the total local error as the load estimator (MLGB) gives slightly better performance than MLGA in terms of the number of subintervals processed, at the price of transferring more subintervals between processors in order to balance the load. For the one dimensional problem, reducing the frequency of load balancing and termination detection has little impact on overall


115

performance. Any gains from reduced synchronisation are oset by the cost of additional function evaluations. In problems without singularities, the number of subintervals processed can be reduced by iterating the load balancing scheme to convergence (MLGC). For the very cheap integrand evaluation of our experiments, this is not worthwhile, but for problems with expensive function evaluation and mild load imbalance, MLGC would be competitive with the best single list algorithms. In a shared variable programming environment, there is no reason to prefer multiple list algorithms over single list algorithms for the one dimensional problem|the latter clearly give the better performance. However, we will show in Chapter 9 that this is not the case for the multi-dimensional problem. The main point of interest of multiple list algorithms in one dimension is that they are easier to implement eciently that single list algorithms in a message passing programming paradigm, and it is in this context that they might prove useful.

Chapter 7 Parallel Algorithms for Singular Integrands in One Dimension In this Chapter we consider the problem of using parallel algorithms in combination with extrapolation methods to obtain higher performance for integrands with singularities.

7.1 Algorithms The use of Wynn's extrapolation algorithm to accelerate the convergence of adaptive quadrature when applied to singular integrands was described in Section 3.1.6. The problem we wish to address here to is how to incorporate extrapolation into the parallel algorithms already described. The principal ideas of this Chapter were introduced in [7]. First, let us note that in implementations of Wynn's algorithm, such as QAGS from QUADPACK which is based on Algorithm CPL, extrapolation is only performed once the presence of a singularity has been detected. There are a number of possible criteria that may be used to diagnose the presence of a singularity, but the most commonly used is the following: 116

CHAPTER 7. ALGORITHMS FOR SINGULAR INTEGRANDS

117

A singularity is present if the subinterval with the largest error estimate is also the subinterval of shortest length. We refer to the algorithm in QAGS as Algorithm CPLX (CPL with singularity detection and extrapolation). Extrapolation can be used in conjunction with Algorithm LL with no diculty, but as was noted in Chapter 5, performance improvements are limited, especially if function evaluation is cheap. Extrapolation could also be added to Algorithms AS and ML, but since neither algorithm can readily allow more than one processor to work on the interval containing the singularity, there would be no performance gain over Algorithm CPLX. If we add extrapolation to Algorithm SS in a straightforward manner, testing the criterion for the existence of a singularity and performing the extrapolation at every synchronisation point, then we nd that convergence is not accelerated nearly as successfully as in Algorithm CPLX. The reason for this is that the interval containing the singularity is often the only subinterval selected for subdivision, and hence is divided into p new subintervals. Thus, the number of error estimates available to the extrapolation routine, during a given reduction in the size of the subinterval containing the singularity, is smaller than in the sequential algorithm, and convergence is delayed. One way to overcome this diculty is to switch to Algorithm LL whenever a singularity is detected. We will refer to this algorithm as SSssXF (extrapolation with fine-grain parallelism), where ss is the selection strategy used. Like QAGS, Algorithm SSssXF is able to deal with a singularity anywhere in the range of integration. In many cases, the location of any singularities in the integrand is known. The problem can then be easily transformed to a set of problems, each of which has a range of integration of the form [0; b0] (for some value of b0), with one singularity located at the origin. Indeed, this is advantageous in a sequential setting as the high density of oating point numbers close


118

to the origin makes resolution of the singularity without encountering precision diculties more likely. Routine D01ALF in the NAG library provides precisely this functionality. If we accept the restriction to a single end-point singularity, then coarser grained parallelism is exploitable. Given that a singularity has been detected at the origin, it is possible to predict the course of Algorithm CPLX with reasonable certainty|it is likely that the interval containing the origin will be repeatedly bisected. With p processors, having detected an end-point singularity, we generate all the subintervals required to repeatedly bisect the subinterval [0; h] (which contains the singularity). The number of bisections used is bp=2c, which generates p new subintervals if p is even. If p is odd, one processor remains idle. For example if p = 6 we generate the subintervals [0; h=2]; [h=2; h]; [0; h=4]; [h=4; h=2]; [0; h=8]; and [h=8; h=4]; and apply the quadrature rules to them in parallel. We then generate integral approximations using the following combinations of these subintervals:

f[0; h=2]; [h=2; h]g; f[0; h=4]; [h=4; h=2]; [h=2; h]g and f[0; h=8]; [h=8; h=4]; [h=4; h=2]; [h=2; h]g: These three approximations are then added in turn to the extrapolation procedure. We refer to this algorithm as SSssXC (extrapolation with coarse-grain parallelism). A quite dierent approach, which does not use extrapolation, is possible in the case of an end-point singularity. When a singularity is detected, we use a graded mesh to give a non-uniform multi-section of the end subinterval, and continue to apply Algorithm SS. If this graded mesh is not suciently ne to resolve the singularity then the Algorithm SS simply requires further subdivision of some subintervals.


119

Let E (f; 0; h) denote the error in the quadrature rule approximation to

R h f (x) dx. Experimental evidence suggests that 0

E (f; 0; h) Ah for constants A and depending on f (x) and the quadrature rules. Thus from pairs of successive values of h and E (f; 0; h), we can estimate A and . When consecutive estimates of A and dier by less than some tolerance (in the results quoted in the next section we require them to dier by less than 10%) we de ne the graded subdivision of [0; h] into the n + 1 subintervals [0; h0]; [h; h0 ]; [h0; 2 h0]; : : : ; [n?1h0; h]; where h0 is such that Ah0r =s and s is the current number of subintervals, is a constant and n is such that nh0 = h. Our motivation is this: the nest subdivision [0; h0] is chosen so that the quadrature rule should resolve the singularity to sucient accuracy; h0 is also constrained to be such that n + 1 is a multiple of p so that the resulting workload is balanced. The remaining subintervals re ect the expected improving behaviour of the integrand as we move away from the singularity at x = 0. Unfortunately there seems to be no reasonable way to predict this behaviour. Clearly the grading of the subdivision is determined by the constant of the geometric progression. Experiments suggest that 30 often represents a reasonable choice for this constant.We refer to this algorithm as SSssGM (for graded mesh).

7.2 Numerical experiments We present results for algorithms SSESXF, SSESXC and SSESGM on the KSR1, together with SSES, CPL and CPLX for comparison. Selection strategy ES was chosen as it both eective and computationally inexpensive. We apply these


120

algorithms to the following set of problems: Problem 1

Z1 0

Problem 2 Problem 3

Z1 0

Z1 0

x?0:5 dx; = 10?10

log6:0 x dx; = 10?10

x?0:8 cos(10000x) dx; = 10?10

Problem 1 has an algebraic singularity at x = 0 while Problem 2 has a logarithmic singularity at x = 0. Problem 3 has an algebraic singularity at x = 0, but in addition has oscillatory behaviour in the whole interval. Figures 7.1 to 7.6 show temporal performance Rp and the number of integrand function evaluations Ip (as described in Section 2.3) for the parallel algorithms on each of the three test problems. The performance of the sequential algorithms CPL and CPLX are shown as horizontal lines: Ìdeal' values are not shown because it is dicult to de ne them unambiguously. To facilitate comparison with algorithms described in previous Chapters, we again use the 30/61-point Gauss-Kronrod rule pair, even though this would not normally be the rule pair of choice for singular integrands (see Section 7.4). Note that although Problem 3 is the same as in Chapters 5 and 6, the results are not directly comparable. The reason for this is that the experiments in this Chapter were performed with an older release of the Fortran libraries for the KSR-1, resulting in longer function evaluation times. However, the inclusion of Algorithm SSES in the experiments of this Chapter provides the relevant comparisons.


121

50 SSESXF SSESXC SSESGM SSES CPL CPLX

45


40 35 30 25 20 15 10 5 0 0

5

10

15


30

35

40

Figure 7.1: Problem 1: Temporal performance of Algorithms SSESXF, SSESXC, SSESGM, SSES, CPL and CPLX on KSR-1 30000


25000

20000

SSESXF SSESXC SSESGM SSES CPL CPLX

15000

10000

5000

0 0

5

10

15


30

35

40

Figure 7.2: Problem 1: Total number of function evaluations performed by Algorithms SSESXF, SSESXC, SSESGM, SSES, CPL and CPLX on KSR-1


122

30



25

20

15

10

5

0 0

5

10

15


30

35

40

Figure 7.3: Problem 2: Temporal performance of Algorithms SSESXF, SSESXC, SSESGM, SSES, CPL and CPLX on KSR-1 30000


25000

20000


15000

10000

5000

0 0

5

10

15


30

35

40



123

12



10

8

6

4

2

0 0

5

10

15


30

35

40

Figure 7.5: Problem 3: Temporal performance of Algorithms SSESXF, SSESXC, SSESGM, SSES, CPL and CPLX on KSR-1 100000 SSESXF SSESXC SSESGM SSES CPL CPLX


80000

60000

40000

20000

0 0

5

10

15


30

35

40



124

7.3 Analysis Extrapolation is highly eective on Problem 1: CPLX nds the solution 10 times faster than CPL. SSES is about four times faster than CPL on 16 processors, but is always slower than CPLX. SSESGM outperforms CPLX for 4, 8, 16 and 24 processors but not signi cantly. SSESXF and SSESXC outperform CPLX but never by more than a factor of 1.8. Of the two, SSESXC is slightly the better. All three parallel algorithms with singularity detection have peak performance at 16 processors, with additional processors resulting in subsequent slow-down. For all these three algorithms the total number of function evaluations increases with increasing p, though not as rapidly as in the case of Algorithm SSES. Extrapolation proves less eective on Problem 2. CPLX nds the solution only about twice as fast as CPL. On four or more processors, SSES outperforms CPLX. The three parallel algorithms with singularity detection all show signi cant performance gains over CPLX. SSESXF is the slowest of the three, despite processing the fewest subintervals. On two and four processors SSESGM is the fastest, but its performance curve then attens markedly with increasing p, as the number of function evaluations rises. SSESXC shows better scalability, with a peak performance on 24 processors some six times faster than CPLX. On Problem 3, CPLX is only about 1.6 times faster than CPL. This is to be expected as the two diculties of resolving the singularity and handling the rapid oscillations of the integrand are equally demanding. As noted is Section 5.4, SSES gives reasonable performance gains, but these are limited by the 2 log2 p bound when resolving the singularity. SSESXF, SSESXC and SSESGM all scale much better than SSES, with SSESXC and SSESXF using fewer function evaluations, and hence giving better performance, than SSESGM. Note that the unexpectedly high performance on 24 processors is a re ection of the stepped pro le of Figure 5.13, which is not well resolved here.


125

7.4 Discussion These results show that it is possible for useful performance gains to be obtained by combining Algorithm SS with extrapolation methods. For integrands where the singularity is the only diculty, the number of processors that can usefully be employed is limited|in our experiments this limit is around 16. With more expensive function evaluation, scalability would improve, but would ultimately be limited by the same lack of parallelism as in Algorithm LL. Algorithm SSssXF requires the least number of function evaluations, and is applicable to singularities at any location in the interval of integration. For integrands that are cheap to evaluate, however, the additional overheads incurred by using ne-grained parallelism can mean that SSssXC is more ecient. SSssXC, however, is restricted to end point singularities. In our experiments we have used the 30/61-point Gauss-Kronrod rule pair which was also used in the experiments of the previous Chapters. However, quadrature routines which are designed speci cally for singular integrands often uses rules of lower degree. For example, QAGS uses the 10/21-point GaussKronrod pair. The lower degree rules may require more subintervals to be processed, but the total number of function evaluations will often be smaller. Using a lower degree rule would further increase the overheads associated with SSssXF, and load imbalance could become important, since the number of function evaluations required at each stage of the algorithm may not be much larger than the number of processors. Algorithm SSssGM is uncompetitive with the extrapolation methods. Although computationally ecient, it requires more function evaluations, largely because the choice of is somewhat arbitrary, and thus the algorithm is not as adaptive to integrand behaviour.

Chapter 8 Parallel Single List Algorithms for Multi-Dimensional Quadrature In this Chapter we consider the application of Algorithm SS of Chapter 5 to multi-dimensional quadrature.

8.1 Algorithms In applying adaptive quadrature algorithms, there is a limit to the total number of subdivisions (and hence function evaluations) which can usefully be applied to a given problem, beyond which loss of precision restricts any further gains in the accuracy of the result. This may occur because either error estimates (computed as the dierence of integral approximations by each of the pair of rules), or distances between adjacent abscissae, approach the machine precision. For the one-dimensional quadrature rules used in Chapters 5 to 7, this limit depends on the range of integration and on the nature of the integrand, but is typically of the order of 105 function evaluations. For multidimensional problems, however, 126

CHAPTER 8. MULTI-DIMENSIONAL SINGLE LIST ALGORITHMS

127

the lower accuracy of the quadrature rules and the eect of increasing dimension means that it is not at all dicult to de ne problems to which 108 or more function evaluations can usefully be applied. This means that the list of subintervals can grow much larger than in the one-dimensional case, and algorithmic steps such as interval selection and list updating can become signi cant sequential bottlenecks. In Section 8.1.2, therefore, we describe how these steps may also be parallelised. The other important dierence between the one- and multi-dimensional cases, is the subdivision strategy. With more than one dimension, there are more possibilities for subdividing an interval. Conventionally, the sequential algorithm (as implemented in D01FCF and DCHURE, for example) bisects the subinterval with largest error estimate in the dimension in which fourth divided dierences of the integrand (computed at quadrature rule points) are greatest. To apply Algorithm SS to the multi-dimensional problem, it seems natural to apply the subdivision strategy of Figure 5.2, dividing a selected subinterval in the dimension with largest fourth divided dierences. However, we nd that allowing multisection (i.e. division into three or more pieces) of subintervals in the multi-dimensional case leads to increased total numbers of function evaluations, even when problems display no lack of useful parallelism. This surprising observation is illustrated in the results of Section 8.3. The eect is particularly noticeable where a sharp peak or singularity is present in the integrand. However, unlike the one-dimensional case where such a feature results in small numbers of selected subintervals, the number of selected intervals in the multi-dimensional case typically remains large (in the hundreds or thousands) except, of course in the initial stages where the total number of subintervals is still small. Restricting the subdivision scheme to bisection while the total number of subintervals is still small (less than p=2, for example) reduces the eect but does not eliminate it. We therefore nd that adopting the simple strategy of bisecting all selected subintervals is more ecient, since the loss


128

of parallelism and load imbalance incurred are more than compensated by the reduction in the total number of function evaluations. It is possible for a multi-dimensional problem to exhibit behaviour more like that observed in the one-dimensional case, resulting in only a few selected intervals. For this to occur, however, the peak or singularity must be located on a manifold of dimension one less than the region of integration (for example a line in two dimensions, or a plane in three). This manifold must be aligned with the co-ordinate axes, and the integrand must be well behaved in the direction perpendicular to the manifold. If these circumstances occur, then the dimension with largest divided dierence will always be perpendicular to the manifold, and subdivision will always occur in this direction. If the manifold is of lower dimension, or is not perpendicular to the co-ordinate axes, then subdivision in more than one dimension will be required, and the number of selected subintervals will tend to be large. In this special case where subdivision is limited to one dimension, then the selection strategy of Figure 5.2, including multisection, would be preferable. A secondary advantage of using bisection only is that for all the selection strategies except Strategy GZ, the total number of function evaluations (and hence the returned result and error estimate) does not depend on p. This makes the choice of parameters (for example in Strategy AL) and comparisons between algorithms more straightforward, as well as being desirable from a software engineering point of view. A further eect of the bisection only strategy is that, under certain assumptions, we can guarantee that the number of function evaluations required by Algorithm SSGL is the same as that required by Algorithm CP. Proof: The set of subintervals processed by Algorithm CP forms a binary tree B where the nodes of the tree correspond to subintervals and the child nodes of a


129

subinterval are the two subintervals generated from it by bisection. We de ne a rooted binary sub-tree of B to be a binary sub-tree of B which contains the root node of B, and a proper rooted binary sub-tree of B to be a rooted binary sub-tree whose nodes are are proper subset of the nodes of B. We now make the following assumptions: 1. Algorithm CP and Algorithm SSGL make the same arbitrary choices between subintervals with identical error estimates. 2. There does not exist a proper rooted binary sub-tree of B whose leaf nodes satisfy the termination criterion. Let B~ be a proper rooted binary sub-tree of B. If we apply Algorithm CP to B~, then it follows from Assumption 2 that the nal tree generated will be B. Now let B~GL be the binary tree generated by applying one stage of Algorithm SSGL to the set of subintervals represented by the leaf nodes of B~. Then for every leaf node i in B~ which is selected by Strategy GL, there exists a set of leaf nodes S in B~ P P such that j2S ej < ei + j2S ej and ej ei ; 8j 2 S . Thus if we apply Algorithm CP to B~, eventually obtaining B, node i will be selected for subdivision, since Algorithm CP must select node i before any node in S , and unless node i is selected then the total error cannot be less than . By Assumption 1, this holds when ej = ei for some j 2 S . Hence B~GL is a rooted binary sub-tree of B, and thus, starting from the trivial one node initial sub-tree all trees produced by Algorithm SSGL are therefore rooted binary sub-trees of B. Thus the nal tree generated by Algorithm SSGL is a rooted binary sub-tree of B, but by Assumption 2 it cannot be a proper rooted sub-tree, so it must be B itself. Hence the number of subintervals generated (and consequently the number of function evaluations required) by Algorithm SSGL is the same as for Algorithm CP.


130

Note that a proper rooted binary sub-tree of B whose leaf nodes satisfy the termination criterion can only arise if subdividing a subinterval causes an increase in the total error estimate. For all the test problems in Section 8.3, we nd that the numbers of function evaluations required by the two algorithms are indeed equal.

8.1.1 Selection strategies The selection strategies employed in our experiments are those of Chapter 5 with the exception of Strategy LE, which is also found to be uncompetitive in the multi-dimensional case. In addition, a further strategy is included, that of Naperiala and Gladwell described in [37]. This is similar to Strategy AB in that subintervals are sorted into bins. However, the bins have xed endpoints throughout the algorithm, the advantage being that a subinterval only has to be classi ed once, so selection has complexity O(N ). This does however lead to some dicult implementation issues, which are discussed below.

Strategy NG

1. Divide the range 10?K ; into B exponentially spaced bins. The ith bin, i = 1; 2; : : : ; B , is given by

exp ?log ? 10?K + (i ? 1)l ; exp ?log ? 10?K + il ; ?

?

where l = log () ? log 10?K =B . 2. Determine the number of subintervals ni whose error estimates lie in the ith bin. Subintervals with error estimate greater than are assigned to bin B + 1, and those with error estimate less than 10?K are assigned to bin 1.

CHAPTER 8. MULTI-DIMENSIONAL SINGLE LIST ALGORITHMS 3. Find r such that

r?1 X i=1

and where

r X i=1

131

Mi n i < Mini ;

? ?

Mi = exp log 10?K + (i ? 1=2)l

is the exponential midpoint of the ith bin. 4. If r B , then select all subintervals with error estimates greater than ? ? exp log 10?K + (r ? 1)l , that is all subintervals whose error estimates lie in bins r; r + 1; : : : ; B . If r = B + 1, then nd r0 such that 0 ?1 B rX X Mi ni + e(jB+1) < and

i=1

j =1

B X

r0 X

Mini +

e(jB+1)

j =1 i=1 (where e(1B+1) ; : : : ; en(BB+1) +1 are the error estimates of intervals in bin B +1 with e(1B+1) e2(B+1) : : : en(BB+1) +1 ), and select all subintervals with error estimates e(rB0 +1) .

It is possible that

B X i=1

Mi n i +

X

nB+1 j =1

e(jB+1) <

even though the true sum of error estimates is greater than (this may be true even if nB+1 = 0). If this occurs, we select all the subintervals in the largest non-empty bin. In [37], K = B = 16 are suggested values for the one-dimensional problem. For the multi-dimensional case, however, K = 10 is normally sucient, but B = 10 is


132

inadequate to obtain a reasonable approximation to Strategy GL. In Section 8.3 we address the issue of the choice of B for both Strategies AB and NG. In order to take advantage of the property that each subinterval generated needs only to be classi ed once, it is necessary to store a list of the indices of the subintervals lying in each bin. For the multi-dimensional problem we make a minor modi cation to Strategy AL. Instead of simply selecting all the intervals with error estimates > emax, we select those with error estimate > min(; emax). The motivation for this comes from the choice of the best value for , which will be addressed in detail in Section 8.3. We will see that values of close to 1 are the most suitable, and hence in the early stages of the algorithm, emax is often greater than . We can be certain, though, that any subinterval with error estimate greater than requires subdivision, so this modi ed strategy increases parallelism without aecting the total number of function evaluations. Algorithm LL also diers slightly from the one-dimensional case. In the multidimensional setting, we only exploit parallel function evaluations within the application of the quadrature rule to one new subinterval at a time, rather than across the pair of new subintervals. This is largely a matter of convenience, since weights and abscissae are generated from the i, wj and w^j of (3.24) and (3.24) for every subinterval.

8.1.2 Parallel selection and updating Due to the large numbers of subintervals which can be generated by multidimensional problems, list updating and interval selection can become sequential bottlenecks. List updating is easily parallelisable, as the number of new subintervals is known to all processors, and hence the position in the list in which each


133

new subinterval should be placed can be computed independently. Note that because list entries are reused, synchronisation is required between the application of the quadrature rule and updating, to avoid overwriting of old entries. Parallelisation is possible for some of the selection strategies described in Section 5.1.1. Strategies GZ and GL are inherently sequential because selecting a subinterval involves updating the heap structure. It is not possible to push onto, or pop from, the heap more than one subinterval at a time. The other strategies can be parallelised as follows:

Strategy AL To nd emax in parallel, the list is divided up in to p pieces. Each processor nds the largest error estimate in its part of the list. The master processor then nds the maximum of these. Each processor then identi es the intervals in its part of the list whose error estimates exceed min(; emax).

Strategy ES To parallelise this strategy, we again divide the list into p pieces. Each processor then identi es the intervals in its part of the list whose error estimates exceed =s.

Strategy AB This strategy is a little more complicated. We nd emax and emin in the same way as emax is found in Strategy AL. Each processor then determines the number of subintervals in its part of the list which lie in each bin. These are then aggregated into global totals sequentially. Determining r is also performed sequentially. Finally each processor selects those intervals in its part of the list whose error estimates exceed exp (log (emin) + (r ? 1)l).

Strategy NG We rst divide the set of new subintervals generated at the previous stage of


134

Algorithm SS into p subsets. Each processor then classi es each subinterval in its subset according to the bin in which it lies, and stores the index of the subinterval in a local list. These indices are then added to global lists, containing the indices of all subintervals with error estimates in bin i. This latter process can be parallelised by assigning dierent bins to dierent processors, though some load imbalance is likely since the number of indices to be added varies from bin to bin. Unfortunately, most of the newly generated subintervals tend to lie in just a few bins, so this process does not scale well with increasing numbers of processors. Steps 3 and 4 (determining r and, if necessary, r0) are performed sequentially.

8.2 Implementation issues Most implementation details remain unchanged from the one-dimensional case, as described in Section 5.2. The data to be stored for each subinterval comprises 2d end-points, the error estimate, the integral approximation and the `worst' dimension (that with largest fourth divided dierences). This means that two subpages (32 8-byte words) can store all the data for d 14. This is more than adequate, since the quadrature rule is only de ned for 2 d 10. Note that the `worst' dimension is an integer, but we store it as a real value to avoid the use of EQUIVALENCE, or non-standard data structures. In Strategy NG, we need to store a list of subintervals for each bin. This requires a two-dimensional integer array of inner dimension the maximum allowed list length, and outer dimension the number of bins B + 1, since it is possible that all subintervals could lie in one bin. To implement the strategy eciently in parallel each processor requires a local copy of this array, though the inner dimension need only be the maximum allowed list length divided by the number of processors p. (The alternative to using local copies is to update the global


135

copy directly, using a lock for each bin, but this is likely to generate signi cant lock contention overheads.) These arrays dominate the memory requirements, but much of the space remains unused. Although most systems support virtual memory, the size of these arrays can cause diculties, either because paging to disk may occur, or the operating system may impose a limit on the amount of virtual memory available to a process.

8.3 Numerical experiments As we increase the number of dimensions, so the range of possible types of integrand behaviour grows rapidly. We attempt to select a small number of test problems which nevertheless illustrate the main performance dierences between the parallel algorithms. To describe the problems used we rst de ne the following functions: 2 x1x3 ) F1 = 4(1x1+x3xexp( 2 + 2x4 )2

F2 =

Yd ? i=1

0:36 + (xi ? 0:3)2 ?1

F3 = cos c F4 =

! d X i=1

Yd i=1

ixi

x0i :8

F1 is the standard test integrand for the NAG routine D01FCF. It has no noteworthy features. F2 has a large peak, and is the intended test function in [31], where it is misprinted. It is also a member of Family 2 (Product Peak) in Chapter 11 of [45]. F3 is oscillatory, and is a member of Family 1 (Oscillatory) in Chapter 11 of [45]. F4 has a strong singularity at the origin, an important feature not represented in the test problem set of [45] because lattice rules cannot be applied to such non-periodisable problems.


136

For all the problems the domain of integration is the d-dimensional unit cube. The problems are then de ned by the choice of integrand F , the number of dimensions, and a target number of function evaluations It . We will then choose values of such that the number of function evaluations required by Algorithm CP is as close as possible to It . In the case of F3 , increasing d dramatically increases the diculty of the problem, which proves inconvenient when de ning stable test problems, so we also specify the constant c to oset this eect. Problem 1: F = F1; d = 4; It = 106 Problem 2: F = F2; d = 6; It = 107 Problem 3a: F = F3 ; c = 100; d = 2; It = 107 Problem 3b: F = F3; c = 2; d = 5; It = 107 Problem 3c: F = F3 ; c = 0:5; d = 8; It = 107 Problem 4a: F = F4 ; d = 3; It = 105 Problem 4b: F = F4; d = 3; It = 106 Problem 4c: F = F4 ; d = 3; It = 107 Problems 3a to 3c are chosen to illustrate the eects of increasing the dimension while keeping the number of function evaluations constant. Problems 4a to 4c are chosen to illustrate the eects of increasing the number of function evaluations performed while keeping the dimension constant.


137

As in the one-dimensional case, is an absolute rather than a relative tolerance. The underlying quadrature rule is that used by the NAG routine D01FCF, the rule of degree seven from the family of monomial rules due to Genz and Malik described in Section 3.2. Table 8.1 describes the performance of Algorithm CPH on the above problems, showing the actual values of used, the number of function evaluations I , the number of quadrature rule applications N , the execution time T in seconds, the temporal performance R and the number of function evaluations per second IR. Note that due to the large values of N required, Algorithm CPL is very uncompetitive as the O(N 2) cost of selection dominates the execution time: results for CPL are therefore not reported. Problem

1 2 3a 3b 3c 4a 4b 4c

N I T R 17469 995733 11.845 8.4410?2 67105 9998645 100.64 9.9410?3 588237 10000029 201.94 4.9510?3 107529 10000197 124.06 8.0610?3 24939 10000539 120.09 8.3310?3 0 8:350 10 3033 100089 1.937 5.1610?1 8:742 10?1 30305 1000065 19.894 5.0310?2 1:198 10?2 303031 10000023 207.10 4.8310?3 Table 8.1: Performance of Algorithm CPH 2:811 10?9 1:469 10?4 7:084 10?9 1:114 10?5 7:624 10?5

IR 8.41104 9.94104 4.95104 8.06104 8.33104 5.17104 5.03104 4.83104

We now illustrate the eects of choice of selection strategy on Algorithm SS. We apply Algorithm SSGL to Problems 1, 2, 3b and 4b using three dierent subdivision strategies:

Full multisection : the strategy used for one-dimensional problems, described in Figure 5.2.

Restricted multisection : as above but restricted to bisection until there are dp=2e subintervals in the list.


138

Bisection only : always bisect all selected subintervals. Figures 8.1 to 8.4 show the total number of function evaluations required Ip, by each of the subdivision strategies. 1.2e+06 Full multisection Restricted multisection Bisection only


1.15e+06

1.1e+06

1.05e+06

1e+06

950000 0

5

10


25

30

35

Figure 8.1: Problem 1: Total number of function evaluations performed by Algorithms SSGL using dierent subdivision strategies Note that with the bisection only subdivision strategy, Algorithm SSGL requires exactly the same number of function evaluations as Algorithm CP. We observe that the full multisection strategy almost always requires more function evaluations than the bisection only strategy. This additional number ranges from a few percent of the total for Problem 3b up to over an order of magnitude more on Problem 2. The restricted multisection strategy in most cases requires fewer evaluations than the full multisection strategy, but more than the bisection only strategy. This behaviour is also observed with other selection strategies. It is important to note that we are not observing the same eect that is responsible for Ip increasing with p as in Problem 2 of Chapter 5. In the cases in Figures 8.1


139

1.2e+08

Full multisection Restricted multisection Bisection only


1e+08

8e+07

6e+07

4e+07

2e+07

0 0

5

10


25

30

35

Figure 8.2: Problem 2: Total number of function evaluations performed by Algorithms SSGL using dierent subdivision strategies 1.1e+07 1.08e+07



1.06e+07 1.04e+07 1.02e+07 1e+07 9.8e+06 9.6e+06 0

5

10


25

30

35

Figure 8.3: Problem 3b: Total number of function evaluations performed by Algorithms SSGL using dierent subdivision strategies


140

2.4e+06


2.2e+06 2e+06 1.8e+06 1.6e+06


1.4e+06 1.2e+06 1e+06 0

5

10


25

30

35

Figure 8.4: Problem 4b: Total number of function evaluations performed by Algorithms SSGL using dierent subdivision strategies to 8.4 the mean number of subintervals selected per stage of the algorithm is signi cantly larger than p, even when a singularity is present, as in Problem 4b. Based on these observations, we conclude that in the majority of cases (that is, excluding those problems which can behave like one-dimensional problems) multisection is best avoided, and so for the remainder of this Chapter, the bisection only subdivision strategy is employed in all the experiments. We now turn our attention to nding suitable values of parameters| for Strategy AL and B for Strategies AB and NG. Figures 8.5 to 8.8 show the temporal performance R16 obtained by applying Algorithm SSAL (Algorithm SS with Strategy AL) to Problems 1, 2, 3b, and 4b respectively, using 16 processors for a range of values of . Since we are using the bisection only subdivision strategy, the number of subintervals processed is independent of p, and so the performance pro les for other numbers of processors have very similar shapes.


141

0.6


0.5

0.4

0.3

0.2

0.1

0 1e-05

0.0001

0.001

0.01

0.1

1

alpha

Figure 8.5: Problem 1: Temporal performance of Algorithm SSAL on KSR-1 using 16 processors

0.14


0.12 0.1 0.08 0.06 0.04 0.02 0 1e-05

0.0001

0.001

0.01

0.1

1

alpha

Figure 8.6: Problem 2: Temporal performance of Algorithm SSAL on KSR-1 using 16 processors


142

0.06


0.05

0.04

0.03

0.02

0.01

0 1e-05

0.0001

0.001

0.01

0.1

1

alpha

Figure 8.7: Problem 3b: Temporal performance of Algorithm SSAL on KSR-1 using 16 processors 0.4


0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 1e-05

0.0001

0.001

0.01

0.1

1

alpha

Figure 8.8: Problem 4b: Temporal performance of Algorithm SSAL on KSR-1 using 16 processors


143

The major dierence between the one- and multi-dimensional cases is that multi-dimensional problems are much more sensitive to the choice of . Problems 1, 2 and 3b exhibit typical behaviour|for small values of performance is poor. As increases, there is a range of values over which the performance is highly irregular. This is a consequence of the total number of function evaluations

uctuating between a value near that required by Algorithm CP and much larger ones. The number of data points recorded (40 per order of magnitude) does not resolve all the detail in this part of the pro les. For values of between 0:5 and 1:0, there is less irregularity, with the number of function evaluations staying close to that required by Algorithm CP. As approaches very close to 1:0, performance decreases as the number of selection stages, and hence the total cost of selection, increases rapidly. Problem 4b exhibits smoother behaviour|as decreases from 1:0, performance improves slightly until 0:08, below which performance decreases in a series of steps. Based on these results, we will use a value of = 0:8 for the rest of the experiments in this Chapter. Figures 8.10 to 8.15 show the total number of function evaluations required I and temporal performance on 16 processors R16 obtained by applying Algorithms SSAB and SSNG to Problems 1, 2, 3b, and 4b respectively, for a range of values of B . Again, we are using the bisection only subdivision strategy, so the number of function evaluations is independent of p, and thus the temporal performance pro les for other numbers of processors have very similar shapes. The Ìdeal' temporal performance is computed as 16 times the performance of Algorithm CPH. We observe that all four Problems display similar features. The total number of function evaluations I for both algorithms initially tends to decrease as the number of bins B increases, until B has a value of about 50. Thereafter, as B increases from 50 to 500, there are some uctuations in I , which remains within a


144

1.4


1.2 1 0.8 0.6 0.4 SSAB SSNG Ideal

0.2 0 10

20

30

40 50 60 80 100 Number of bins

200

500

Figure 8.9: Problem 1: Temporal performance of Algorithms SSAB and SSNG on KSR-1 using 16 processors 1.6e+06


1.5e+06

SSAB SSNG CP

1.4e+06

1.3e+06

1.2e+06

1.1e+06

1e+06 10

20

30


200

500

Figure 8.10: Problem 1: Total number of function evaluations performed by Algorithms SSAB and SSNG


145

0.16


0.14 0.12 0.1 0.08

SSAB SSNG Ideal

0.06 0.04 0.02 0 10

20

30


200

500

Figure 8.11: Problem 2: Temporal performance of Algorithms SSAB and SSNG on KSR-1 using 16 processors 1.5e+07

SSAB SSNG CP


1.4e+07

1.3e+07

1.2e+07

1.1e+07

1e+07 10

20

30


200

500

Figure 8.12: Problem 2: Total number of function evaluations performed by Algorithms SSAB and SSNG


146

0.9 0.8


0.7 0.6 0.5 0.4 0.3 0.2

SSAB SSNG Ideal

0.1 0 10

20

30


200

500

Figure 8.13: Problem 3b: Temporal performance of Algorithms SSAB and SSNG on KSR-1 using 16 processors

1.12e+06 SSAB SSNG CP


1.1e+06 1.08e+06 1.06e+06 1.04e+06 1.02e+06 1e+06 980000 10

20

30


200

500

Figure 8.14: Problem 3b: Total number of function evaluations performed by Algorithms SSAB and SSNG


147

0.14


0.12

0.1

0.08

SSAB SSNG Ideal

0.06

0.04

0.02

0 10

20

30


200

500

Figure 8.15: Problem 4b: Temporal performance of Algorithms SSAB and SSNG on KSR-1 using 16 processors 1.8e+07 1.7e+07


1.6e+07

SSAB SSNG CP

1.5e+07 1.4e+07 1.3e+07 1.2e+07 1.1e+07 1e+07 10

20

30


200

500

Figure 8.16: Problem 4b: Total number of function evaluations performed by Algorithms SSAB and SSNG


148

few percent of the value for Algorithm CP, but there is no clear downward trend. For the smaller values of B , the decreasing I values are re ected in increasing temporal performance. As B increases from 50 to 500, performance tends to decrease, however, as the selection strategies (which both have a sequential search for r of complexity O(B )) become more expensive. For the experiments in remainder of this Chapter, therefore, we use B = 50 for both algorithms. For Problems where function evaluation is more expensive, and thus the cost of selection is less important, larger values of B might prove more suitable. We now compare the performance of all the algorithms on the eight test problems. Tables 8.2 and 8.3 show the total number of function evaluations I and the number of selection stages S respectively required by each of the Algorithms on the eight test problems. In the case of Algorithm SSGZ, the number of function evaluations depends on p, so the gure for I given here is the maximum number of function evaluations required for p 30. The number of selection stages for Algorithm SSGZ also depends on p and is given by N=p where N is the total number of subintervals processed. Figures 8.17 to 8.24 show the temporal performance Rp for the parallel algorithms on each of the eight test problems. The Ìdeal' value for Rp is computed as p times the corresponding value for Algorithm CPH. Note that for Problems 3a and 4c, results for small numbers of processors are not given for Algorithm SS. This is because the memory requirements exceed that available on these numbers of processors, and although the programs will execute, timing results are unreliable.

8.4 Analysis We begin our analysis of the numerical experiments by comparing the algorithmic performance of the various algorithms, as illustrated in Tables 8.2 and 8.3.


Algorithm CP/LL SSGL SSGZ SSAL SSES SSAB SSNG

Problem

2 3a 3b 9998645 10000029 10000197 9998645 10000029 10000197 10008479 10005503 10002243 10512695 10654835 10343739 14302659 14554091 16174095 10366079 11424221 10004847 10285619 11906205 10244415 Problem Algorithm 3c 4a 4b 4c CP/LL 10000539 100089 1000065 10000023 SSGL 10000539 100089 1000065 10000023 SSGZ 10009361 101013 1000923 10000287 SSAL 10087155 108801 1020855 10075989 SSES 14807727 177837 1219185 10700415 SSAB 10092769 105435 998547 9954087 SSNG 10191415 107745 1007721 9255015 Table 8.2: Total number of function evaluations required

Algorithm 1 CP/LL 8735 SSGL 25 SSAL 64 SSES 16 SSAB 19 SSNG 17

1 995733 995733 1000407 1003599 1410123 1068579 1063335

149

Problem 2 3a 3b 3c 4a 4b 4c 33553 294119 53765 12470 1517 15153 151516 54 25 32 34 80 108 159 91 82 111 101 53 79 129 18 20 18 16 27 49 80 27 20 22 23 41 65 101 32 29 22 20 43 69 123 Table 8.3: Number of selection stages


150

1.6


1.4

SSGZ SSGL SSAL SSES SSAB SSNG LL Ideal

1.2 1 0.8 0.6 0.4 0.2 0 0

5

10


25

30

35

Figure 8.17: Problem 1: Temporal performance of Algorithms SS and LL on KSR-1

SSGZ SSGL SSAL SSES SSAB SSNG LL Ideal


0.25

0.2

0.15

0.1

0.05

0 0

5

10


25

30

35

Figure 8.18: Problem 2: Temporal performance of Algorithms SS and LL on KSR-1


151

0.16 SSGZ SSGL SSAL SSES SSAB SSNG LL Ideal


0.14 0.12 0.1 0.08 0.06 0.04 0.02 0 0

5

10


25

30

35

Figure 8.19: Problem 3a: Temporal performance of Algorithms SS and LL on KSR-1



0.15

0.1

0.05

0 0

5

10


25

30

35

Figure 8.20: Problem 3b: Temporal performance of Algorithms SS and LL on KSR-1


152



0.15

0.1

0.05

0 0

5

10


25

30

35

Figure 8.21: Problem 3c: Temporal performance of Algorithms SS and LL on KSR-1 7 SSGZ SSGL SSAL SSES SSAB SSNG LL Ideal


6

5

4

3

2

1

0 0

5

10


25

30

35

Figure 8.22: Problem 4a: Temporal performance of Algorithms SS and LL on KSR-1


153



0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0

5

10


25

30

35

Figure 8.23: Problem 4b: Temporal performance of Algorithms SS and LL on KSR-1



0.08

0.06

0.04

0.02

0 0

5

10


25

30

35

Figure 8.24: Problem 4c: Temporal performance of Algorithms SS and LL on KSR-1


154

First, we note that because we are using the bisection only subdivision strategy, Algorithm SSGL requires exactly the same number of subdivisions as Algorithm CP. However, the number of selection stages required is very much smaller, indicating that there is a high degree of parallelism available in these problems. However, particularly for Problems 4a to 4c, which have singularities, the number of selection stages is larger than for Strategies ES, AB and NG. This arises from the behaviour of Strategy GL in the later stages, as the total error estimate approaches , when the number of selected intervals typically becomes very small. Strategy GZ also behaves well with respect to the total number of function evaluations: in the worst case (Problem 4a) the number of additional evaluations is just over 1% of that required by Algorithm CP. However, the number of selection stages is much larger than for any of the other selection strategies. Strategy ES is clearly the worst strategy in terms of its algorithmic performance. The number of additional function evaluations ranges from 7% on Problem 4c to 78% on Problem 4a. On the grounds of numbers of function evaluations, there is little to choose between the Strategies AL, AB and NG, as no one of them outperforms either of the others consistently across the dierent test problems. Strategy AL, however, requires signi cantly more selection stages than the other two. As we observed in the previous section, a high value of is required to prevent over-selection, particularly in the later stages of the Algorithm. The penalty for this is that the earlier stages select fewer subintervals, and hence more stages are required. Let us now examine the temporal performance realised by the parallel algorithms on the KSR-1. We rst note that Algorithm LL performs very poorly on all problems, giving no useful performance improvement. On Problem 1, Strategy AB gives the best performance, with Strategy NG close behind. Strategy AL processes fewer subintervals than either AB or NG,


155

but it has a higher number of selection stages, resulting in higher overall selection costs and poorer performance. Strategy ES processes 40% more subintervals than Strategy AL, but has fewer and cheaper selection stages. Strategy ES therefore scales better than Strategy AL, and outperforms it on 30 processors. Strategies GL and GZ both scale poorly, due to their sequential selection phases. GL performs better than GZ, however, since the number of global synchronisations is far fewer. Problem 2 shows very similar behaviour to Problem 1, indicating that the presence of a sharp peak does not signi cantly restrict the available parallelism. On Problem 3a, the number of subintervals processed is large, and hence the selection stages form a signi cant fraction of the sequential execution time. Strategies GL and GZ have very poor performance, as their selection stages remain sequential. Strategy AB is again the best, but Strategy AL is better than Strategy NG for this problem, because the total number of function evaluations is lower. Once again Strategy ES scales well, but up to 25 processors the large number of function evaluations means it takes longer than AB, NG or AL. As we progress from Problem 3a to Problem 3c, the dimension increases, the number of subintervals processed decreases, and selection becomes less important in terms of sequential execution time. Thus the Strategies with sequential selection give progressively better performance. As the dimension increases, so the dierences in algorithmic performance between Strategies AB, NG and AL disappear, and their temporal performances become closer. Strategy ES, on the other hand, remains algorithmically inecient. On Problem 4a, the presence of the singularity and the low number of function evaluations mean that the average number of subintervals selected per stage is small compared to previous Problems. Thus the total cost of selection stages


156

determines the relative performance of the Algorithm SS variants. Up to 20 processors, the best performance is attained by Strategy AL. Above 20 processors, however, Strategy ES is the fastest, since, despite processing many more subintervals, it requires fewer selection stages, and thus has more available parallelism. As we progress from Problem 4a to Problem 4c the average number of subintervals selected per stage increases rapidly. As the number of function evaluations increases, so the singularity becomes better resolved, and it becomes pro table to subdivide subintervals not close to the singularity, resulting in increased useful parallelism. For all the Strategies (except of course GL), the algorithmic performance (compared with that of Algorithm CP) improves and in the case of Strategies AB and NG actually betters that of Algorithm CP. On Problem 4b, Strategies AL and AB are joint leaders. Strategy ES is initially slower than NG, but overtakes it on more than 16 processors. On Problem 4c, Strategy NG is marginally the fastest, thanks to its surprisingly low number of function evaluations, but all four Strategies with parallel selection perform well.

8.5 Discussion The poor performance of Algorithm LL is partly due to high synchronisation overheads (as in the one-dimensional case), but also partly because the computation of weights and abscissae forms a signi cant sequential bottleneck. Of the selection strategies, we observe a clear distinction between those with parallel selection and those without. The two Strategies with sequential selection (GZ and GL) have good algorithmic properties, but lack scalability especially when the list of subintervals grows large. The O(N log N ) cost of selection then dominates the O(N=p) function evaluation cost. The tendency of Strategy ES to process more subintervals than other Strategies was noted in the one-dimensional case in Section 5.4, but the eect is much


157

greater here, and renders this Strategy uncompetitive on most problems. In one dimension, as a result of using a very high degree rule, there tends to be a clear distinction between subintervals which are well resolved by the rule (and have error estimates of the order of the machine precision), and those which are not. Thus in one dimension, the tendency of Strategy ES to give low values of C (the error estimate value above which subintervals are selected), does not necessarily translate into a tendency to select too many subintervals. In the multi-dimensional case, however, the degree of the rule is much lower, and there is no clear distinction between well resolved and unresolved subintervals. In this case, low values of C do imply that too many subintervals are selected. However, this tendency to over-select means there is plenty of parallelism to be exploited, and in some cases (Problem 4a for example) the resulting scalability means that it is the best choice of strategy for large values of p. Strategy AL gives reasonable algorithmic performance, but the penalty for achieving this is a tendency to under-select in the middle stages of the algorithm. This means that especially on problems without a singularity, parallelism is restricted and more selection stages are required. Although the selection strategy is cheaper than Strategy AB, the extra selection stages required normally means that overall performance is worse than Algorithm SSAB. The similarity between Strategies AB and NG is re ected in generally similar performance, in terms of both I and Rp. Although Strategy NG has the theoretically lower complexity in the selection phase, this is in practice oset by additional memory access costs and less ecient parallelism within the the selection strategy. Strategy NG is also more sensitive to the choice of the number of bins B . Strategy AB is therefore the Strategy of choice is most situations.

Chapter 9 Parallel Multiple List Algorithms for Multi-Dimensional Quadrature In this Chapter we apply the algorithms based on the multiple list approach and described in Chapter 6 to multi-dimensional quadrature.

9.1 Algorithms Since the multiple list algorithms are already highly parallel in their one-dimensional version, they can be applied to the multi-dimensional case with very few changes. The major change concerns the initial subdivision of the interval. In the one-dimensional case, we simply divide the interval into p equal pieces. However, as noted in Section 8.1, simply dividing the initial interval into p pieces in the dimension in which fourth divided dierences of the integrand are greatest is not advisable. Instead, since the total number of subintervals processed will normally be very large, we simply run Algorithm CP on the initial interval until the list of subintervals has p entries. We then distribute these p subintervals 158

CHAPTER 9. MULTI-DIMENSIONAL MULTIPLE LIST ALGORITHMS 159 to the processors, and proceed with Algorithm ML. This does imply a short sequential phase, though if function evaluation were expensive enough, we could use Algorithm LL instead of Algorithm CP for this initial phase. For the test problems of Chapter 8 however, this is not worthwhile. The algorithms which we have implemented for the multi-dimensional case are as follows: Algorithm MLL Burrage's algorithm of [9].

Termination criterion: single processor local. Load balancing scheme: none.

Algorithm MLGB This algorithm attempts to balance the local error estimates. Termination criterion: true global. Load balancing scheme: B (see Section 6.1).

Algorithm MLGBR As for MLGB, but with reduced frequency of both load balancing and termination checking.

Termination criterion: true global at reduced frequency, = 0:05 (see Section 6.1 for the de nition of ).

Load balancing scheme: B at reduced frequency, = 0:05.

Algorithm MLGC As for MLGB, but with load balancing iterated to convergence at every stage.

Termination criterion: true global. Load balancing scheme: C (see Section 6.1).

Algorithm MLGCR As for MLGC, but with reduced frequency of both load balancing and termination checking.

CHAPTER 9. MULTI-DIMENSIONAL MULTIPLE LIST ALGORITHMS 160

Termination criterion: true global at reduced frequency, = 0:05. Load balancing scheme: B at reduced frequency, = 0:05. The choice of the value of is justi ed in Section 9.3. Algorithms MLGA and MLGD were not implemented in the multi-dimensional case. In Section 6.1 it was shown that Algorithm MLGA can cause load imbalance to increase, and it has no advantages over Algorithm MLGB. Algorithm MLGD transfers far more subintervals than is necessary to obtain load balance, and was only intended to demonstrate that load balance can be attained with very simple schemes.

9.2 Implementation issues The implementation of these algorithms for the multi-dimensional case is essentially the same as for the one-dimensional case. The only dierence is that the data to be stored for each subinterval comprises 2d end-points, the error estimate, the integral approximation and the `worst' dimension (that with largest fourth divided dierences).

9.3 Numerical experiments For the numerical experiments we use the same test problems as for the single list algorithms of Chapter 8. The random nature of the number of additional function evaluations required means that nding the optimal value of experimentally is dicult. However, we can justify our choice of as follows. Suppose the value of is suciently small that load balancing is successful, and the number of function evaluations I required to reduce the global error estimate to is approximately the same as that required by Algorithm CP, even

CHAPTER 9. MULTI-DIMENSIONAL MULTIPLE LIST ALGORITHMS 161 though termination may go undetected for a while. Then we can expect that the number of additional function evaluations required due to non-detection of termination has a rectangular distribution on [0; I ], and so the mean number is

I : 2 The number of synchronisation points (see Section 6.1) is approximately 1 + log( SCP=p) :

log(1 + ) Thus the total execution time Tp is approximately

SCP

SCP=p) t ; Tp = 1 + 2 p tCP + 1 + log( log(1 + ) sync

where SCP is the total number of stages required by Algorithm CP, tCP is the typical time taken for one stage of Algorithm CP, and tsync is the typical time taken for a global synchronisation point, including global error checking and load balancing. Figures 9.1 to 9.4 show the predicted values of eciency of Algorithm MLGBR for Problems 1, 2, 3b and 4b on 4, 8, 16 and 24 processors for a range of values of . Note that a logarithmic scale is used on the axis. The eciency is computed as Ts=pTp, where Ts is the execution time of Algorithm CP and Tp is given by the expression above. Appropriate values for tCP and tsync were derived experimentally. We observe that the eciency is quite sensitive to , and drops o rapidly when is either too large or too small. The value of which maximises eciency, and hence performance, varies quite widely. As the number of stages SCP increases so the optimal value of decreases. Conversely, as the number of processors increases, so does the optimal value of . also depends on the number of dimensions d through tCP . This means that choosing a single value of for all problems and all numbers of processors is somewhat unsatisfactory. Nevertheless, we nd that a value of = 0:05 for both Algorithms MLGBR and MLGCR


1

0.9

Efficiency

0.8

0.7

0.6

4 processors 8 processors 16 processors 24 processors

0.5

0.4 0.005

0.01

0.02

0.05 gamma

0.1

0.2

0.5

Figure 9.1: Problem 1: Predicted eciency of Algorithm MLGBR on KSR-1

1

0.95

Efficiency

0.9

0.85

0.8


0.75

0.7 0.005

0.01

0.02

0.05 gamma

0.1

0.2

0.5

Figure 9.2: Problem 2: Predicted eciency of Algorithm MLGBR on KSR-1


1

0.95

Efficiency

0.9

0.85

0.8


0.75

0.7 0.005

0.01

0.02

0.05 gamma

0.1

0.2

0.5

Figure 9.3: Problem 3b: Predicted eciency of Algorithm MLGBR on KSR-1

1 0.95 0.9

Efficiency

0.85 0.8 0.75 0.7 0.65 0.6


0.55 0.5 0.005

0.01

0.02

0.05 gamma

0.1

0.2

0.5

Figure 9.4: Problem 4b: Predicted eciency of Algorithm MLGBR on KSR-1

CHAPTER 9. MULTI-DIMENSIONAL MULTIPLE LIST ALGORITHMS 164 seems to be a good compromise, successfully load balancing all the test problems, and giving satisfactory performance while keeping the worst case of additional function evaluations tolerable even if function evaluation is expensive. Figures 9.5 to 9.20 show temporal performance Rp and the number of integrand function evaluations Ip for the parallel multiple list algorithms on each of the eight test problems. The Ìdeal' value of Ip is simply ICP, the corresponding value for Algorithm CP. The Ìdeal' value for Rp is computed as p times the corresponding value for Algorithm CPH. Temporal performance for Algorithms SSAB and SSGZ, and number of function evaluations for Algorithm SSAB are included for comparison. To avoid losing detail, Algorithm MLL is omitted from plots of Ip. To illustrate the dierences between the single and multiple list algorithms in terms of communication overheads, we can use the KSR-1's hardware monitoring facilities to record the total time that each processor spends waiting for remote memory accesses to complete. Table 9.1 shows the percentage of the execution time attributable to remote accesses for Algorithms SSAB and MLGBR on the eight test problems using 16 processors. Problem Algorithm 1 2 3a 3b 3c 4a 4b 4c SSAB 1.52 3.60 5.60 1.77 0.55 6.25 3.66 2.64 MLGBR 0.42 1.97 0.22 0.31 0.69 5.66 1.16 1.97 Table 9.1: Percentage of execution time due to remote accesses

9.4 Analysis On Problem 1, Algorithm MLL gives reasonable, if somewhat erratic, performance, indicating that there are some load balance diculties, but that they are

CHAPTER 9. MULTI-DIMENSIONAL MULTIPLE LIST ALGORITHMS 165 1.6 MLL MLGB MLGBR MLGC MLGCR SSAB SSGZ Ideal


1.4 1.2 1 0.8 0.6 0.4 0.2 0 0

5

10


25

30

35


1.07e+06 1.06e+06


1.05e+06

MLGB MLGC MLGBR/MLGCR SSAB Ideal

1.04e+06 1.03e+06 1.02e+06 1.01e+06 1e+06 990000 0

5

10


25

30

35


CHAPTER 9. MULTI-DIMENSIONAL MULTIPLE LIST ALGORITHMS 166 0.25


0.2 MLL MLGB MLGBR MLGC MLGCR SSAB SSGZ Ideal

0.15

0.1

0.05

0 0

5

10


25

30

35

Figure 9.7: Problem 2: Temporal performance of Algorithms ML and SS on KSR-1 1.05e+07


1.04e+07

1.03e+07 MLGB MLGC MLGBR/MLGCR SSAB Ideal

1.02e+07

1.01e+07

1e+07

9.9e+06 0

5

10


25

30

35


CHAPTER 9. MULTI-DIMENSIONAL MULTIPLE LIST ALGORITHMS 167 0.16 MLGB MLGBR MLGC MLGCR SSAB SSGZ Ideal


0.14 0.12 0.1 0.08 0.06 0.04 0.02 0 0

5

10


25

30

35

Figure 9.9: Problem 3a: Temporal performance of Algorithms ML and SS on KSR-1

1.14e+07


1.12e+07 1.1e+07

MLGB/MLGC MLGBR/MLGCR SSAB Ideal

1.08e+07 1.06e+07 1.04e+07 1.02e+07 1e+07 0

5

10


25

30

35

Figure 9.10: Problem 3a: Total number of function evaluations performed by Algorithms ML and SS on KSR-1




0.15

0.1

0.05

0 0

5

10


25

30

35

Figure 9.11: Problem 3b: Temporal performance of Algorithms ML and SS on KSR-1 1.05e+07 MLGB MLGC MLGBR/MLGCR SSAB Ideal


1.04e+07

1.03e+07

1.02e+07

1.01e+07

1e+07

9.9e+06 0

5

10


25

30

35

Figure 9.12: Problem 3b: Total number of function evaluations performed by Algorithms ML and SS on KSR-1




0.15

0.1

0.05

0 0

5

10


25

30

35

Figure 9.13: Problem 3c: Temporal performance of Algorithms ML and SS on KSR-1 1.05e+07 MLGB MLGC MLGBR/MLGCR SSAB Ideal


1.04e+07

1.03e+07

1.02e+07

1.01e+07

1e+07

9.9e+06 0

5

10


25

30

35

Figure 9.14: Problem 3c: Total number of function evaluations performed by Algorithms ML and SS on KSR-1

CHAPTER 9. MULTI-DIMENSIONAL MULTIPLE LIST ALGORITHMS 170 7

MLL MLGB MLGBR MLGC MLGCR SSAB SSGZ Ideal


6

5

4

3

2

1

0 0

5

10


25

30

35

Figure 9.15: Problem 4a: Temporal performance of Algorithms ML and SS on KSR-1 106000 MLGB MLGBR MLGC MLGCR SSAB Ideal

105000


104000 103000 102000 101000 100000 99000 98000 0

5

10


25

30

35

Figure 9.16: Problem 4a: Total number of function evaluations performed by Algorithms ML and SS on KSR-1



1

0.8

MLL MLGB MLGBR MLGC MLGCR SSAB SSGZ Ideal

0.6

0.4

0.2

0 0

5

10


25

30

35

Figure 9.17: Problem 4b: Temporal performance of Algorithms ML and SS on KSR-1 1.06e+06 MLGB MLGC MLGBR/MLGCR SSGZ Ideal

1.05e+06


1.04e+06

1.03e+06

1.02e+06

1.01e+06

1e+06

990000 0

5

10


25

30

35

Figure 9.18: Problem 4b: Total number of function evaluations performed by Algorithms ML and SS on KSR-1

CHAPTER 9. MULTI-DIMENSIONAL MULTIPLE LIST ALGORITHMS 172 0.14


0.12 MLGB MLGBR MLGC MLGCR SSAB SSGZ Ideal

0.1

0.08

0.06

0.04

0.02

0 0

5

10


25

30

35

Figure 9.19: Problem 4c: Temporal performance of Algorithms ML and SS on KSR-1 1.05e+07 MLGB MLGC MLGBR/MLGCR SSAB Ideal


1.04e+07

1.03e+07

1.02e+07

1.01e+07

1e+07

9.9e+06 0

5

10


25

30

35

Figure 9.20: Problem 4c: Total number of function evaluations performed by Algorithms ML and SS on KSR-1

CHAPTER 9. MULTI-DIMENSIONAL MULTIPLE LIST ALGORITHMS 173 not too severe. Algorithm MLGB solves the load balance diculty, with values of Ip very close to those of Algorithm CP. However, synchronisation overheads are substantial, with the result that performance is little better that Algorithm MLL. Algorithm MLGBR, in which the synchronisation overheads are greatly reduced, performs much better, and equally well as Algorithm SSAB. The number of function evaluations required is always in the range [ICP ; (1 + )ICP], suggesting that any additional function evaluations are due to the reduced frequency of termination checking, rather than to any load imbalance. This is further supported by the observations that Algorithm MLGC, in which load balancing is iterated to convergence, requires virtually the same number of function evaluations as Algorithm MLGB, and that Algorithm MLGCR requires exactly the same number of function evaluations as Algorithm MLGBR. Thus in both cases, iterating load balancing to convergence serves no useful purpose|its only eect is to increase overheads and reduce performance. On Problem 2, Algorithm MLL performs poorly, indicating that the load balance diculties are more severe than in Problem 1. Nevertheless, the other ML variants all behave in a similar way as for Problem 1. Algorithm MLGBR again gives performance indistinguishable from Algorithm SSAB, and again iterating load balance to convergence serves no useful purpose. On Problem 3a, the importance of reducing synchronisation overheads is clear. Algorithms MLGB and MLGC both scale badly, while Algorithms MLGBR and MLGCR, despite the additional function evaluations, scale well, with Algorithm MLGBR giving consistently better-than-Ideal performance, and also outperforming Algorithm SSAB. On Problems 3b and 3c, as the dimension increases, the importance of reducing synchronisation overheads diminishes, and the dierence in performance of the four MLG algorithms becomes less. The advantage of Algorithm MLGBR over Algorithm SSAB is lost, but their performances remain

CHAPTER 9. MULTI-DIMENSIONAL MULTIPLE LIST ALGORITHMS 174 close. On Problem 4a, Algorithm MLL tends to slow down as processors are added, indicating that the load balance diculties associated with the singularity are severe. Nevertheless, Algorithm MLGB copes well with this, as the number of addition function evaluation required does not exceed 3% of ICP, and outperforms Algorithm SSAB on nine or more processors. Algorithm MLGC does a little better in terms of Ip, but its execution time is larger. Reducing synchronisation has modest bene ts, as the total number of stages is small, and hence the value of used is signi cantly lower than the optimal value for this problem. As the number of function evaluations increases from Problem 4a through to Problem 4c, load balance diculties diminish, and iterating load balance to convergence has even less impact on values of Ip, while reducing synchronisation has a progressively higher impact. Despite additional function evaluations, Algorithm MLGBR continues to outperform Algorithm SSAB on these problems. In Table 9.1, we note that except for Problem 3c, overheads due to remote accesses are more signi cant for Algorithm SSAB than for Algorithm MLGBR. This is especially so when the list of subintervals becomes very large, and load imbalance is mild, as in Problem 3a and 3b.

9.5 Discussion In contrast to the one-dimensional case, multiple list algorithms in the multidimensional case give comparable performance to the single list algorithms. In the multi-dimensional case, as we observed in Chapter 8, there is typically a much higher degree of available parallelism than in the one-dimensional case. Thus the fact that the multiple list algorithms cannot multi-sect subintervals is no disadvantage in the multi-dimensional setting. Furthermore, it appears that load balance is much easier to achieve in the multi-dimensional case. Iterating

CHAPTER 9. MULTI-DIMENSIONAL MULTIPLE LIST ALGORITHMS 175 the load balancing scheme to convergence is of no bene t|Algorithm MLGC typically requires just as many function evaluations as Algorithm MLGB, and its load balancing is more expensive. To obtain performance comparable to the best single list Algorithms, however, it is vital that synchronisation is performed much less frequently than at every stage of Algorithm ML. The synchronisation reduction mechanism we have chosen is shown here to work well, though the best choice for depends both on the number of processors used and on characteristics of the problem, some of which (the number of subintervals processed, for example) can only be estimated. Given a good choice of , however, Algorithm MLGBR proves very successful. On our test Problems its performance is either as good as, or better than, the best single list algorithms. It is particularly eective on problems where the list of subintervals grows very large. On such problems, Algorithm SS does not exploit data locality very well, and thus the overheads due to remote memory accesses become signi cant. Algorithm ML, on the other hand, has excellent locality properties, as the only remote accesses required are in computing the global error and in load balancing. This allows Algorithm ML to exploit the cache hierarchy of NUMA machines such as the KSR-1 more eectively, and is responsible for the superlinear scaling (and hence negative overheads) observed on Problem 3a.

Chapter 10 Conclusions We have investigated the properties of a number of parallel algorithms for globally adaptive quadrature, from both the single list and multiple list families of algorithms. In this Chapter we summarise our ndings and assess the parallel algorithms according to the criteria laid out in Chapter 1. Let us rst focus on algorithms for the one-dimensional case. We observe that exploiting the parallelism within the quadrature rule (Algorithm LL) is not scalable on our test problems. On problems with more expensive function evaluation, scalability would of course be improved, but parallelism is still limited by the number of points in the quadrature rule. Algorithm AS, as well as being non-deterministic, also scales poorly. Access to the list of subintervals must be protected by a lock, and with sucient processors, lock contention becomes a source of high overheads. With a sensible choice of selection strategy, Algorithm SS performs better than either AS or LL, particularly on problems where the diculties are due to oscillation in the integrand rather than to sharp peaks or singularities. For functions that are cheap to evaluate, simple and cheap selection strategies (SSES and SSAL) give the best performance. Their success can be attributed to the tendency for there to be a large dierence (orders of magnitude) in error estimate between 176

CHAPTER 10. CONCLUSIONS

177

subintervals where the integrand is well resolved by the underlying polynomial interpolant, and those where it is not. SSAL has one tunable parameter, , but there is a large interval in which the performance of the algorithm is insensitive to the value of . The more sophisticated selection strategies (SSGL and SSAB) are a little better in terms of the total number of function evaluations required, and may therefore be preferred if function evaluation is expensive. On problems with singularities, Algorithm SS has limited scalability (self-speedup has a log p dependence), but useful performance gains are nevertheless possible. The multiple list algorithms are in general poorly suited to one dimensional quadrature. They lack the ability to allow more than one processor to work on the subdivision of one subinterval, and therefore give no performance gain where the only diculty is a single singularity. Furthermore, even for non-singular integrands, none of the load balance mechanisms is able to successfully handle the load balance diculties which are encountered. The large dierences in error estimate between well resolved and poorly resolved subintervals may be responsible, as the estimated load on a processor may decrease by orders of magnitude from one stage of the algorithm to the next. Of the load balancing mechanisms tested, we nd that using total error estimate on a processor as a estimate of the load is the most successful, but we note that any mechanism which distributes subintervals so that systematic dierences between the lists of subintervals are avoided will give reasonable results. In Chapter 7 we examined the possibility of using parallel algorithms in conjunction with extrapolation methods for integrands with singularities. Since the multiple list algorithms cannot allow more than one processor to work on the subinterval containing the singularity, there is little point in adding extrapolation to them. For the single list algorithms the obvious approach to adding extrapolation to Algorithm SS is unsuccessful, as the number of results available


178

to the extrapolation process, and hence the ecacy of the process, diminishes as the number of processors increases. To overcome this we can switch to Algorithm LL when a singularity is detected (Algorithm SSxxXF), but we then suer the shortcomings of Algorithm LL described above. If we have the additional information that the singularity is located at the origin, then it is possible to generate the necessary intermediate results for the extrapolation process (Algorithm SSxxXC). This is then more ecient than Algorithm SSxxXF. We also investigated an alternative to extrapolation, using a graded mesh approach (Algorithm SSxxGM), but this is uncompetitive, and performance is sensitive to the value of the parameter . We can therefore conclude that Algorithm SS with an appropriate choice of selection strategy, in the method of choice for the one dimensional case. Performance is as good as we can reasonably expect, the algorithm is robust, and in the case of selection strategy AL, insensitive to parametrisation. However if it is necessary to implement the algorithm in a message passing paradigm, Algorithm SS presents some diculties, as the communication pattern is irregular and not known at compile time. The maintenance of a distributed list of subintervals and the generation of the correct messages to distribute selected subintervals to the appropriate processors would be a source of considerable overhead, as well as presenting a formidable programming task. Other solutions, such as keeping the list on a master processor, or replicating the list on all processors, are not scalable in terms of memory use. Let us now turn our attention to multi-dimensional quadrature. Here we also nd that even in cases where the number of dimensions, and hence the number of function evaluations required by the quadrature rule, is large, Algorithm LL performs poorly. For Algorithm SS, it is found to be advantageous to restrict subdivision of subintervals to bisection only. Multisection, whose purpose is to


179

increase available parallelism, also tends to increase the total number of function evaluations required, and the latter eect is usually the more important. A bene cial side eect of bisection-only, however, is that the value of the integral approximation Q is no longer dependent on the number of processors p. For Algorithm SS, the relative merits of the selection strategies are not the same as in the one-dimensional case. The strategies which are not amenable to parallelisation (GZ and GL) prove to be signi cant sequential bottlenecks in the multi-dimensional case, where the number of subintervals in the nal list is typically much larger than in the one-dimensional case. Strategy ES, which proved so successful in one dimension, processes many more subintervals than other strategies, and although it displays good scalability, its lack of sequential competitiveness means that its performance is not as good as strategies AB, NG and AL. The performance of Strategy AL is much more sensitive to the value of than in one dimension, but the choice of = 0:8 gives good performance on all our test problems. Such a high value of , however, means that the number of calls to the selection procedure is higher than for Strategies AB and NG. Strategy AB has a slight edge in performance over Strategy NG, despite the theoretical advantages of NG. Both are competitive with Algorithm CP (in terms of total number of function evaluations), scale well, and are insensitive to choice of the number of bins B . The multiple list algorithms are much better suited to multi-dimensional problems than they are to one-dimensional ones. The inability to multisect subintervals is no disadvantage here, and because the load estimate typically changes slowly from one stage to the next, load balancing is much more successful. On our test problems, the one-step load balancing Algorithm MLGB is sucient to balance the loads. Additional load balancing steps (Algorithm MLGC) are of no bene t. We nd that synchronising at every stage of the algorithm results in high


180

overheads. While this is exacerbated by the high cost of barriers on the KSR-1, it will still be signi cant on any shared variable architecture that does not have barriers implemented in hardware, and on any message passing machines. Reducing synchronisation can be successfully achieved by the use of the parameter

in Algorithms MLGBR and MLGCR. However, performance is quite sensitive to the choice of , and the optimal value of depends on both the problem and the number of processors used. In all our test problems, the cost of integrand function evaluation is independent of the point at which the function is evaluated. If this were not the case, it is possible that there may be some load imbalance (in the traditional sense of idle processors) between global synchronisation points. However since Algorithm SS typically processes many more subintervals between synchronisation points than there are processors, any dierences in function evaluation time should be averaged out between processors. For the multiple list algorithms, this is only true when s^, the number of stages between synchronisation points, is suciently large. Thus the multiple list algorithms are likely to be more adversely aected by varying function evaluation times. To summarise, therefore, we nd that for one-dimensional quadrature there are clear reasons to prefer the single list approach. Algorithm SS with a fast and simple selection strategy gives good performance on a wide range of problem types, and satis es the criteria of stability, robustness, determinism and insensitivity to parametrisation. Some success in combining this algorithm with extrapolation methods for singular integrands has been achieved. In the multi-dimensional case, there are good arguments in favour of both single and multiple list approaches. The bene ts of the multiple list algorithms over single list algorithms are simple implementation in message passing, and low communication overheads. The latter is particularly important for problems


181

where the list of subintervals grows very large. On the other hand, the single list algorithms are less sensitive to parametrisation, return the same result regardless of the number of processors used, are not limited to numbers of processors with suitable factors, and are more tolerant of varying function evaluation cost.

Bibliography [1] Bailey, D.H., (1992) Misleading performance reporting in the supercomputing eld, Scienti c Programming, vol. 1, no. 2, pp. 141{151. [2] Berntsen, J., and T.O. Espelid, (1988) A parallel global adaptive quadrature algorithm for hypercubes , Parallel Computing, vol. 8, pp. 313{323. [3] Berntsen, J., T.O. Espelid and A. Genz, (1991) An adaptive algorithm for the approximate calculation of multiple integrals, ACM Trans. on Math. Soft., vol. 17, no. 4, pp. 437{451. [4] Bull, J.M., (1996) A hierarchical classi cation of overheads in parallel programs, Proceedings of First IFIP TC10 International Workshop on Software Engineering for Parallel and Distributed Systems, I. Jelly, I. Gorton and P. Croll (eds.), Chapman Hall, pp. 208{219. [5] Bull, J.M. and T.L. Freeman, (1994) Parallel globally adaptive quadrature on the KSR-1, Advances in Computational Mathematics, vol. 2, pp. 357{373. [6] Bull, J.M. and T.L. Freeman, (1994) Parallel algorithms and interval selection strategies for globally adaptive quadrature, Proceedings of PARLE '94, Lecture Notes in Computer Science no. 817, Springer Verlag, pp. 490{501.

182

BIBLIOGRAPHY

183

[7] Bull, J.M., T.L. Freeman and I. Gladwell, (1994) Parallel quadrature algorithms for singular integrals, Proceedings of 14th World Congress on Computational and Applied Mathematics, IMACS, pp. 1136{1139. [8] Burden, R.L. and J.D. Faires, (1988) Numerical Analysis, Fourth Edition, PWS-KENT, Boston. [9] Burrage, K., (1990) An adaptive numerical integration code for a chain of transputers , Parallel Computing, vol. 16, pp. 305{312. [10] Cranley, R. and T.N.L. Patterson, (1971) On the automatic numerical evaluation of de nite integrals, Comput. J., vol. 14, pp. 189{198. [11] Davis, P.J. and P. Rabinowitz, (1984) Methods of Numerical Integration, Second Edition, Academic Press. [12] de Doncker, E. and J. Kapenga, (1987) A parallelisation of adaptive integration methods in Numerical Integration|Recent Developments, Software and Applications , G. Fairweather and P. M. Keast (eds.), D. Reidel Publishing, Dordrecht, pp. 207{218. [13] de Doncker, E. and J. Kapenga, (1990) Parallel systems and adaptive integration, Contemporary Mathematics, vol. 155, pp. 33{51. [14] de Doncker, E. and J. Kapenga, (1992) Parallel cubature on loosely coupled systems , in Numerical Integration|Recent Developments, Software and Applications, T.O. Espelid and A. Genz (eds.), Kluwer Academic, Dordrecht, pp. 317{327. [15] Flynn, M.J., (1972) Some computer organisations and their eectiveness , IEEE Trans. Comput., vol. C-21, pp. 948{960.

BIBLIOGRAPHY

184

[16] Fritsch, F.N., D.K. Kahaner and J.N. Lyness, (1981) Double integration using one-dimensional adaptive quadrature routines: a software interface problem, ACM Trans. on Math. Soft., vol. 7, no. 1, pp. 46{75. [17] Geist, A., A. Beguelin, J. Dongarra, W. Jiang, R. Manchek and V. Sunderam, (1994) PVM 3 Users Guide and Reference Manual , Oak Ridge National Laboratory, Oak Ridge, Tennessee 37831. [18] Genz, A., (1982) Numerical multiple integration on parallel computers , Computer Physics Communications, vol. 26, pp. 349{352. [19] Genz, A., (1987) The numerical evaluation of multiple integrals on parallel computers, in Numerical Integration|Recent Developments, Software and Applications , G. Fairweather and P. M. Keast (eds.), D. Reidel Publishing, Dordrecht, pp. 219{229. [20] Genz, A.C. and A.A. Malik, (1980) An adaptive algorithm for numerical integration over an N-dimensional rectangular region, J. Comput. Appl. Math., vol. 6, pp. 295{302. [21] Genz, A.C. and A.A. Malik, (1983) An imbedded family of fully symmetric integration rules, SIAM J. Numer. Anal., vol. 20, pp. 580{588. [22] Gladwell, I., (1987) Vectorisation of one dimensional quadrature codes, in Numerical Integration|Recent Developments, Software and Applications , G. Fairweather and P. M. Keast (eds.), D. Reidel Publishing, Dordrecht, pp. 230{238. [23] Helmbold, D.P. and C.E. McDowell, (1989) Modeling speedup(n) greater than n, in Proceedings of 1989 Int. Conf. on Parallel Processing, Pennsylvania State Univ. Press, III-219{III-225.

BIBLIOGRAPHY

185

[24] High Performance Fortran Forum, (1993) High Performance Fortran Language Speci cation, Scienti c Programming, vol. 2, nos. 1 and 2. [25] Hockney, R.W. and C.R. Jesshope, (1988) Parallel Computers 2: Architecture, Programming and Algorithms , Adam Hilger. [26] Hockney, R., (1992) A framework for benchmark performance analysis , Supercomputer, vol. 48, pp. 9{22. [27] Johnson, E.E., (1988) Completing an MIMD multiprocessor taxonomy, Computer Architecture News, vol. 16, no.3, pp. 44{47. [28] Korobov, N.M., (1959) The approximate computation of multiple integrals, Doklady Akademii Nauk SSSR, vol. 124, pp. 1207{1210. [29] Kronrod, A.S., (1965) Nodes and weights of quadrature formulas, Consultants Bureau, New York. [30] K.S.R., (1991) KSR Fortran Programming , Kendall Square Research, Waltham, Mass. [31] Lapegna, M. and A. D'Alessio, (1993) A scalable parallel algorithm for the adaptive multidimensional quadrature , in Proceedings of the Sixth SIAM Conference on Parallel Processing, R.F. Sinovec, D.E. Keyes, M.R. Leuze, L.R. Petzold and D.A. Reed (eds.), SIAM, Philadelphia, pp. 933{936. [32] Lautrup, B., (1971) An adaptive multidimensional integration procedure, in Proc. of 2nd Colloquium on Advanced Computing Methods in Theoretical Physics, Marseille, pp. I57{I82. [33] Malcolm, M.A. and R.B. Simpson, (1975) Local versus global strategies for adaptive quadrature , ACM Trans. on Math. Soft., vol. 1, pp. 129{146.

BIBLIOGRAPHY

186

[34] Mascagni, M., (1990) High-dimensional numerical integration and massively parallel computing , Contemporary Mathematics, vol. 115, pp. 53{73. [35] Message Passing Interface Forum, (1994) MPI: A message-passing interface standard, International Journal of Supercomputer Applications and High Performance Computing, vol. 8, nos. 3 and 4. [36] NAG, (1991) NAG Fortran Library Manual, Mark 15 , NAG Ltd., Oxford. [37] Napierala, M.A. and I. Gladwell, (1995) Reducing ranking eects in parallel adaptive quadrature, in Proceedings of the Seventh SIAM Conference on Parallel Processing, D.H. Bailey, P.E. Bjorstad, J.R. Gilbert, M.V. Mascagni, R.S. Schreiber, H.D. Simon, V.J. Torczon, and L.T. Watson (eds.), SIAM, Philadelphia, pp. 647{651. [38] Piessens, R., E. de Doncker, C. U berhuber and D. Kahaner, (1983) QUADPACK, A Subroutine Package for Automatic Integration, Springer-Verlag. [39] Rice, J., (1975) A metalgorithm for adaptive quadrature, Journal of the ACM, vol. 23, pp. 61{82. [40] Rice, J., (1974) Parallel algorithms for adaptive quadrature|convergence, in Information Processing, J.L. Rosenfeld ed., North-Holland, Amsterdam, pp. 600{604. [41] Rice, J., (1975) Parallel algorithms for adaptive quadrature II|metalgorithm correctness, Acta Informatica, vol. 5, pp. 273{285. [42] Rice, J., (1976) Parallel algorithms for adaptive quadrature III|program correctness, ACM Trans. on Math. Soft., vol. 2, pp. 1{30. [43] Sag, T.W. and G. Szekeres, (1964) Numerical evaluation of high-dimensional integrals, Math. Comput., vol. 18, pp. 245{253.

BIBLIOGRAPHY

187

[44] Shanks, D., (1955) Non-linear transformations of divergent and slowly convergent sequences, J. Math. and Phys., vol. 34, pp. 1{42. [45] Sloan, I.H. and S. Joe, (1994) Lattice Methods for Multiple Integration, Oxford. [46] Wynn, P., (1956) On a device for computing the em (Sn) transformation, Math. Comput., vol. 10, pp. 91{96. [47] Valiant, L.G., (1990) A bridging model for parallel computation, Communications of the ACM, vol. 33, no. 8, pp. 103{111.

parallel algorithms for globally adaptive quadrature - CiteSeerX

parallel algorithms for globally adaptive quadrature - CiteSeerX

Suggest Documents

parallel algorithms for globally adaptive quadrature - CiteSeerX

Parallel Globally Adaptive Quadrature on the KSR-1 1 Introduction

p4est: SCALABLE ALGORITHMS FOR PARALLEL ADAPTIVE MESH ...

Implementing Scalable Parallel Search Algorithms for ... - CiteSeerX

Distributed Frameworks and Parallel Algorithms for ... - CiteSeerX

Genetic Algorithms for Parallel Code Optimization - CiteSeerX

Grid computing for parallel bioinspired algorithms - CiteSeerX

Parallel Algorithms for Workstation Clusters - CiteSeerX

Basic Compiler Algorithms for Parallel Programs - CiteSeerX

Dynamic programming algorithms for scheduling parallel ... - CiteSeerX

Parallel Algorithms for PDE-Constrained Optimization - CiteSeerX

Massively Parallel Genetic Algorithms - CiteSeerX

The Modified Adaptive Quadrature Method for Line Integrals - CiteSeerX

Adaptive Incremental Checkpointing for Massively Parallel ... - CiteSeerX

Parallel Algorithms for the Adaptive Re nement ... - Semantic Scholar

Adaptive Control Algorithms for Decentralized Optimal ... - CiteSeerX

Scalable Algorithms for Adaptive Statistical Designs - CiteSeerX

Coverage and Adaptive Scheduling Algorithms for ... - CiteSeerX

Designing Genetic Algorithms for Adaptive Routing ... - CiteSeerX

Efficient Algorithms for Segmenting Globally ... - Semantic Scholar

Globally Consistent Algorithms for Mixture of Experts

New Hybrid Globally Convergent CG-Algorithms for

Engineering Parallel Algorithms for Community ... - Parallel Computing

GLOBALLY ADAPTIVE CONTROL VARIATE FOR ROBUST ...