Parallel Quasi-Monte Carlo Integration using (t,s)-Sequences
Wolfgang Ch. Schmid1 and Andreas Uhl2 Department of Mathematics University of Salzburg, AUSTRIA RIST++ & Department of Computer Science and System Analysis University of Salzburg, AUSTRIA e-mail: fwolfgang.schmid,
[email protected] 1
2
Abstract. Currently, the most eective constructions of low-discrepancy point sets and sequences are based on the theory of (t; m; s)-nets and (t; s)-sequences. In this work we discuss parallelization techniques for quasi-Monte Carlo integration using (t; s)-sequences. We show that leapfrog parallelization may be very dangerous whereas block-based parallelization turns out to be robust.
1 Introduction Currently, the most eective constructions of low-discrepancy point sets and sequences, which are of great importance for quasi-Monte Carlo methods in multidimensional numerical integration, are based on the concept of (t; m; s)nets and (t; s)-sequences. A detailed theory was developed in Niederreiter [10]. High dimensional numerical integration problems may require a signi cant amount of computations. Therefore, substantial eort has been invested into nding techniques for performing these computations on all kinds of parallel architectures (see [8] for an exhaustive overview). In order to keep the communication amount within a parallel system to a minimum, each processing element (PE) requires its own source of integration nodes. Therefore, our aim is to investigate techniques for using separately initialized and disjoint portions of a given point set on single PEs. In practice, it is usually not possible to determine a priori the number of integration nodes N necessary to meet a given error requirement. Therefore, it is of great importance that N may be increased without loosing previously calculated function values. Additionally, unbalanced load within a parallel system makes it extremely dicult to predict the amount of integration nodes required on a single PE. For these reasons we restrict the discussion to in nite (t; s)-sequences. We may choose between two possible approaches of generating separately initialized and disjoint substreams of a (t; s)-sequence: { Blocking: disjoint contiguous blocks of the original sequence are used on the PEs. This is achieved by simply using a dierent start point on each PE
(e.g. employing p PEs, PEi generates the points xi ; xi+1 ; xi+2 ; : : : where i is the rst index of the block speci c for each PE). { Leaping: interleaved substreams of the original sequence are used on the PEs. Each PE skips those points handled by other PEs (leap-frogging) (e.g. employing p PEs, PEi generates the points xi ; xi+p ; xi+2p ; : : : with i = 0; : : : ; p ? 1). These techniques have their corresponding counterparts in the eld of pseudo random number generation where it is also desirable to obtain separately initialized and disjoint portions of the output stream of a pseudo random number generator for parallel applications. In this context, both techniques may lead to dangerous side-eects (see e.g. [4] for blocking and [5] for leaping). Only a little amount of work has been done using (t; s)-sequences for parallel numerical integration. Bromley [3] describes a leap-frog parallelization technique to break up the so-called Sobol' sequence into interleaved subsets. In this work we extend Bromley's technique to all types of binary digital (t; s)-sequences, and we will investigate the eects which occur when using blocking and leaping parallelization, respectively. Most important, we will demonstrate that leaping parallelization may lead to dramatic defects in the results whereas blocking behaves very stable.
2 Numerical integration by means of digital (t; s)-sequences We consider quasi-Monte Carlo methods for multidimensional numerical integration, with the half-open s-dimensional unit cube I s = [0; 1)s for s 2 as a normalized integration domain. These methods are based on the integration rule Z
F (u)du N1
NX ?1 n=0
F (xn )
with deterministic nodes x0 ; x1 ; : : : ; xN ?1 which are customarily taken from
I s.
One basic error estimate for the integration error is given by the KoksmaHlawka inequality, where for the rather general class of functions F of bounded variation V (F ) in the sense of Hardy and Krause we have for any x0 ; : : : ; xN ?1 2 Is : Z NX ?1 1 F ( x ) V (F )DN (x0 ; : : : ; xN ?1 ) : F (u)du ? n N n=0
We recall that for a point set P consisting of x0 ; x1 ; : : : ; xN ?1 2 I s , its star discrepancy DN (P ) is de ned by
D (P ) = sup A(J ; P ) ? Vol(J ) ;
N
J
N
where the supremum is extended over all intervals J of the form J = is=1 [0; ui ) with 0 < ui 1 for 1 i s. Here, for an arbitrary subinterval J of I s , A(J ; P ) is the number of 0 n N ? 1 with xn 2 J and Vol(J ) denotes the s-dimensional volume of J . So to guarantee small integration errors, the nodes should form a low-discrepancy point set, i.e., a point set with small star discrepancy (see Niederreiter [12] for a survey of quasi-Monte Carlo methods). The concepts of (t; m; s)-nets and of (t; s)-sequences in a base b provide lowdiscrepancy point sets of bm points, respectively in nite sequences, in I s , s 1, which are extremely well distributed if the quality parameters t 2 N 0 are \small". We follow Niederreiter [12] in our basic notation and terminology.
De nition 1 Let b 2; s 1, and 0 t m be integers. Then a point set consisting of bm points of I s forms a (t; m; s)-net in base b if every subinterval Qs ?d J = i ai b i ; (ai + 1)b?di of I s with integers di 0 and 0 ai < bdi for 1 i s and of volume bt?m contains exactly bt points of the point set. =1
Until now all construction methods which are relevant for applications in quasi-Monte Carlo methods are digital methods over nite elds Fq , in most cases over F2 .
De nition 2 Let q be a prime-power and s 1 be an integer. Let C ; : : : ; C s P1 be (1 1){matrices over Fq . For 0 n let n = k ak qk be the q-adic (1)
( )
=0
representation of n in base q. Consider the digits a0 ; a1 ; : : : as elements of Fq . Let (i) (n))T := C (i) (a0 ; : : : ; a1 )T (y1(i) (n); : : : ; y1
for i = 1; : : : ; s :
The sequence !
1 y(s) (n) yk(1)(n) ; : : : ; X k 2 I s for n = 0; 1; : : : xn := k k q q k=1 k=1 1 X
is called a digital (t; s)-sequence constructed over Fq if, for all integers k 0 and m > t, the point set consisting of the xn with kqm n < (k + 1)qm is a (t; m; s)-net in base q.
For a more general de nition of digital (t; s)-sequences and digital (t; m; s)nets (over arbitrary nite commutative rings) see for example Niederreiter [12] or Larcher, Niederreiter, and Schmid [9]. Remark: The construction of the points of digital nets and sequences and therefore the quality of their distribution only depend on the matrices C (1) ; : : : ; C (s) . So, the crucial point for a concrete implementation is the construction of the matrices. Examples for the construction of \good" matrices (and also properties providing a good distribution) can be found for example in Niederreiter [12].
3 Generating substreams of digital (t; s)-sequences In the following we only consider digital sequences over the binary eld F2 . In this case we may use the following notation for the de nition of the sequence (i) (see De nition 2): yn(i) := (y1(i) (n); : : : ; y1 (n))T = c(0i) a0 c(1i) a1 c(2i) a2 , where denotes a binary exclusive-or operation and c(ji) the j -th column vector of the matrix C (i) , j 0. Since there is a bijection between the vector yn(i) and the i-th coordinate of the n-th point x(ni) of the sequence, we frequently will use the vector notation without further mention. Due to a suggestion of Antonov and Saleev [1] one can rewrite the sequence and calculate yn(i) from yn(i) = c(0i) g0 c(1i) g1 = c(l(in) ?1) yn(i?) 1 , where g0 ; g1 ; : : : are the binary digits of the Gray-code representation of n and l(n) is the position of the least-signi cant-zero bit in the binary representation of n. (For details see for example [1, 2]; for k = 1; 2; : : : the use of the Gray-code representation only shues each segment of length 2k , but does not aect the property of being a digital (t; s)-sequence.) Remark: The shued sequence is provided by matrices D(i) := C (i) Z for i = 1; : : : ; s, where Z is a regular (11){matrix over F2 with 1 as elements of the main diagonal and the diagonal below, and 0 elsewhere. In our experiments we will con ne ourselves to the so-called Niederreiter sequences (de ned in Niederreiter [11] for arbitrary base b; see also Bratley, Fox, and Niederreiter [2]). These sequences are widely used in many applications of quasi-Monte Carlo methods, such as physics, technical sciences, nancial mathematics, and many more.
3.1 Blocking of ( )-sequences The starting point xn is calculated by the usual Gray-code notation (yni = c i g c i g ), the further points are given by the recursion relation (yki = cl ik? yki? ). Note that if the number of points is of the form 2m , m 2 N , and the index of the starting point is a multiple of the number of points, say n2m, then the chosen point set xn m ; xn m ; : : : ; x n m ? forms a (t; m; s)-net by de nition. t; s
( )
( ) 0 0 ( ) ( 1)
( ) 1 1 ( )
( )
1
2
2
+1
( +1)2
1
Starting points with rather small indices dierent from 0 already were considered for example by Fox [6], Bratley et al. [2], or Radovic et al. [14]. For a parallel execution, dierent blocks (i.e. dierent starting points) are assigned to dierent PEs. In order to guarantee non-overlapping blocks, the starting points should be suciently far apart. Note that even if it happens by chance that an identical point set is used in parallel and sequential execution the values of the integration error will be dierent for almost all samplesizes N (except for = N ) since the ordering of the integration nodes is extremely shued. In general, the point sets used in parallel and sequential execution are dierent.
3.2 Leaping of ( )-sequence t; s
Bromley [3] describes a leap-frog parallelization technique to break up the Sobol' sequence into interleaved subsets. In our experiments we only use leaps which are an integral power of 2, say 2L. Then, in the terminology of Bromley, we have the new recursion relation for leaped substreams i yki L o = cl i;L k L ? y k? ( ) 2 +
where
( ) ( 2
1)
( ) ( 1)2 +
L o
; o = 0; 1; : : : ; 2L ? 1; k = 1; 2; : : : ;
cri;L = cl i r ? cl i r ? cl i r ? L ; r L are the column vectors of the new (11){matrices C i;L , i = 1; : : : ; s, such that each point of the new sequence is obtained by skipping 2L ? 1 points of the original sequence (for r < L we can set cri;L = 0 since these vectors are (
)
( ) (2
( ) (2
1)
( ) (2
2)
2 )
(
(
)
)
never touched in the recursion relation). There are 2L dierent streams (the same number as the leap factor) which are uniquely de ned by their oset o. The rst point of each stream is calculated by the usual Gray-code notation: yo(i) = c(0i) g0 c(1i) g1 , o = 0; : : : ; 2L ? 1. For the details of this technique we refer to Bromley [3]. Kocis and Whiten [7] already have mentioned a leaped Sobol' sequence, but without any numerical experiments. For a parallel execution, we assign each of the 2L dierent streams (assuming leap factor 2L) to a dierent PE. Note that in this case we use identical point sets in parallel and sequential execution, even the ordering of the points may be completely preserved. A dierent strategy is to use a leap factor larger than the number of PEs which results in the use of dierent point sets used in parallel and sequential execution (a reason for doing this might be the identi cation of a speci c leap factor leading to high quality streams as suggested for Halton and Sobol sequences [7]).
4 Parallel Numerical Integration Experiments 4.1 Experimental Settings
The selection of the test function F is a very crucial point in any attempt of rating low-discrepancy point sets with quasi-Monte Carlo integration. Among others, we consider the following test functions:
F 1(x1 ; x2 ; : : : ; xs ) = F 2(x1 ; x2 ; : : : ; xs ) = F 3(x1 ; x2 ; : : : ; xs ) =
s Y
i=1 s j4x Y i=1 s Y i=1
p
p
(?2:4 7(xi ? 0:5) + 8 7(xi ? 0:5)3 ) ; i ? 2j + i2
?1 ; 1 + i2 1 20 xi ? 21 + 1 ? 1 ;
r
!
s X s 2 F 4(x1 ; x2 ; : : : ; xs ) = 45 4s i=1 xi ? 3 : R For all these functions, I s F (x)dx = 0 which permits numerically stable calP culation of the value N1 Ni=0?1 F (xi ) up to sample sizes N 230 and dimensions s 300. All these functions have been used before in the context of rating point sets for quasi-Monte Carlo or Monte Carlo integration (F1 and F4 in [7], F2 in [14], and a variant of F3 in [5]). The error of an integral at a particular value of N varies in an apparently random manner within the accuracy range as N changes [7]. Therefore, we use the maximal error in the range (Nj?1 ; Nj ] as our error-criterion (Nj are the exponentially increasing measurement positions depicted in the plots). The maximal samplesize is set to N = 224 , all gures show results employing Niederreiter sequences. We restrict the splitting parameters to powers of two (e.g. leap 3 denotes a leap factor 23, whereas start 3 denotes a block with start point 23 ). In parallel (quasi)-Monte Carlo integration there is a trade-o between communication minimization and integration eciency. The partial results calculated on the single PEs need to be combined at certain stages in order to perform an error estimation to determine whether the entire computation may terminate or has to continue (to achieve higher accuracy). For example, in quasi-Monte Carlo integration this error estimation may be done using randomized rules (see e.g. [13], [8, p. 166, 174]). Such a procedure requires communication or access to shared variables. Therefore, it should be kept to a minimum on the one hand. On the other hand, performing the calculations independently for too long may result in wasted computational eort since the desired accuracy could have been reached employing fewer integration nodes. We denote the number of integration nodes xi evaluated independently on a single PE before computing the overall integration error as syncstep and x its value to 500 (which is too small for practical applications but is already sucient for demonstrating severe defects). We denote the j th set of integration nodes fxjsyncstep ; : : : ; x(j+1)syncstep?1 g produced on PEi as ij , j = 0; 1; 2; : : : and i = 0; 1; : : :; #PEs ? 1. All computations have been carried out on a SGI POWERChallenge GR using a data parallel programming model employing the language PowerC.
4.2 Experimental Results The possible eects of sequential leaping are displayed in Figure 1. We notice severe worsening of the integration results for all leaps 6= 0 for test function F4 (Figure 1.b), signi cant quality decrease concerning F2 is observed only for leaps 3 (Figure 1.a). The results for F3 are similar to those of F4, whereas F1 does not exhibit any sensitivity against leaped sequences for leaps 4. Figure 2 reports on results for F4 employing parallel leaping on 2 PEs (therefore, leap 1 should result in the same error as compared to sequential execution). Synchronized execution means that the partial results of the single PEs are combined only if jj ? kj 1 (with j; k belonging to the most recently generated
Different leap-factors, F2
Different leap-factors, F4
0
0
-1 -1
Leap 0 Leap 1 Leap 2 Leap 3 Leap 4
-2
-2
log(error)
log(error)
-3
-4
-3
-5 -4
Leap 0 Leap 1 Leap 2 Leap 3 Leap 4
-6 -5 -7
-8
-6 3.5
4
4.5
5 5.5 log(samplesize)
6
6.5
7
3.5
4
4.5
(a) F2
5 5.5 log(samplesize)
6
6.5
7
(b) F4
Fig. 1. Comparing integration results using dierently leaped Niederreiter sequences. Leaping with Leap 1, syncstep=500, F4, 2PE
Leaping with Leaps 2 & 3, syncstep=500, F4, 2PE
0
0 sequential synchronized parallel not synchronized_1 not synchronized_2
-1
sequential Leap 2 Leap 3 -1
-2
log(error)
log(error)
-2 -3
-3 -4
-4 -5
-6
-5 3.5
4
4.5
5 5.5 log(samplesize)
(a) F4, leap 1
6
6.5
7
3.5
4
4.5
5 5.5 log(samplesize)
6
6.5
7
(b) F4, dierent leap values
Fig. 2. Parallel integration with leaping on 2 PEs. set of integration nodes 0j and 1k ). Note that jj ? kj = 0 is required to obtain the sequential result (this situation may be typical in a synchronized execution on a SIMD machine, but even that needs to be assured explicitly). Not synchronized execution does not put any restriction onto jj ? kj which means that PEs are allowed to \overtake" each other (which obviously happens and is due to scheduling mechanisms). In the following plots execution is synchronized if not stated dierently. We observe that using leap 1 even with synchronized execution (which already decreases computational eciency) the integration precision already decreases slightly. A severe decrease takes place with not synchronized execution for samplesize N > 106 (see Figure 2.a). Parallel leaping with leap 2 decreases the precision slightly whereas we again observe catastrophic results for leap 3 (see Figure 2.b). Similar results are obtained for F3, whereas for F1 and F2 no problems arise.
The eects employing parallel leaping are even more severe when using a higher number of PEs. Figure 3 reports on results for F2 and F4 employing parallel leaping on 16 PEs (which means that leap 4 should result in the same error as compared to sequential execution). Synchronized execution means in that case that the absolute dierence between any two exponents of ij needs to be 1. Load-inbalance (1/2) and Leaping with Leap 4, F4, 16PE -1
-1
-1.5
-2
-2
-3
-2.5 log(error)
log(error)
Leaping with Leap 4, syncstep=500, F2, 16PE 0
-4
-5
sequential 15PE/1PE 12PE/4PE 8PE/8PE
-3
-3.5
-6
-4 sequential synchronized parallel not synchronized_1 not synchronized_2
-7
-4.5
-8
-5 3.5
4
4.5
5 5.5 log(samplesize)
(a) F2, leap 4
6
6.5
7
3.5
4
4.5
5 5.5 log(samplesize)
6
6.5
7
(b) F4, leap 4, unbalanced load
Fig. 3. Parallel integration with leaping on 16 PEs. Even in synchronized execution mode we now notice a severe degradation of the results for F2, even more pronounced for samplesize N > 105 in not synchronized mode (see Figure 3.a). Very bad results are as well obtained using leaps 5 and 6, the results for F3 and F4 are almost identical in all cases. Figure 3.b reports on results using leap 4 for F4 if certain PEs show load level 2 instead of 1 (such an eect may be due to other users or system processes and causes these PEs to produce less integration nodes (i.e. half of the amount the others do)). The eects of this scenario are already dramatic in the case of one PE showing load 2 and even worse for the other cases considered (see Figure 3.b). Now we switch to the case of blocking. Figure 4.a shows only hardly noticeable decrease in integration precision for F2 using dierent start points whereas Figure 4.b even displays signi cant precision gain for F3. For F4 we obtain a slight gain as well whereas F1 again does not show any sensitivity against different start points. As one might expect, the properties of sequential blocking are propagated to parallel blocking (as it is the case for leaping). Employing 2 PEs, start 23 denotes the use of the blocks starting at start point 0 223 and 1 223 on the PEs, which results in the sequential integration node set for N = 224 in shued ordering. We observe a slight precision gain for F3 in synchronized and not synchronized parallel execution mode (see Figure 5.a). Blocking for F1, F2, and F4 does not lead to any signi cant eects on the integration precision with 2 PEs.
Different start-points, F2
Different start-points, F3
-1
0 Start 0 Start 24 Start 26 Start 28
Start 0 Start 24 Start 26 Start 28
-2
-1
-2
-4
log(error)
log(error)
-3
-5
-3
-6 -4 -7
-8
-5 3.5
4
4.5
5 5.5 log(samplesize)
6
6.5
7
3.5
4
4.5
(a) F2
5 5.5 log(samplesize)
6
6.5
7
(b) F3
Fig. 4. Comparing integration results using blocks with dierent start points. Blocking with Start 23, syncstep=500, F3, 2PE
Load-inbalance (1/2) and Blocking with Start 20, F2, 16PE
0
-1 sequential synchronized parallel not synchronized
sequential 15PE/1PE 12PE/4PE 8PE/8PE
-2 -1 -3
log(error)
log(error)
-2
-3
-4
-5
-6 -4 -7
-5
-8 3.5
4
4.5
5 5.5 log(samplesize)
6
(a) F3, start 23
6.5
7
3.5
4
4.5
5 5.5 log(samplesize)
6
6.5
7
(b) F2, unbalanced load, start 20
Fig. 5. Parallel integration with blocking on 2 and 16 PEs. The worst result obtained when employing parallel blocking is shown in Figure 5.b. The unbalanced load scenario as described for leaping leads to a moderate reduction of integration accuracy for F2. Similar results are obtained for F4, whereas F3 again shows moderate improvement and the results concerning F1 are not aected at all.
5 Conclusion In this work we investigate possible eects of blocking and leaping (t; s)-sequences for parallel quasi-Monte Carlo integration. The properties of blocked and leaped substreams of these sequences are propagated to their use in parallel environments. Therefore, leaping may lead to dramatic degradation of the integration results whereas blocking turns out to be very robust. However, leaping oers the only possibility of obtaining an identical result as compared to sequential
execution. To achieve this, a signi cant amount of caution has to be invested into proper synchronization of the parallel execution.
Acknowledgements The rst author was partially supported by the FWF (Austrian Science Fund) project P12441-MAT. We thank Andreas Pommer for his help in identifying and resolving a major software problem on the POWERChallenge.
References 1. I. A. Antonov and V. M. Saleev. An economic method of computing LP -sequences. U.S.S.R. Comput. Maths. Math. Phys., 19:256{259, 1979. 2. P. Bratley, B. L. Fox, and H. Niederreiter. Algorithm 738: Programs to generate Niederreiter's low-discrepancy sequences. ACM Trans. Math. Software, 20:494{ 495, 1994. 3. B. C. Bromley. Quasirandom number generators for parallel Monte Carlo algorithms. J. Parallel and Distributed Computing, 38:101{104, 1996. 4. A. DeMatteis and S. Pagnutti. Parallelization of random number generators and long-range correlations. Numer. Math., 53:595{608, 1988. 5. K. Entacher, A. Uhl, and S. Wegenkittl. Linear congruential generators for parallel Monte-Carlo: the Leap-Frog case. Monte Carlo Methods and Appl., 4:1{16, 1998. 6. B. L. Fox. Algorithm 647: Implementation and relative eciency of quasirandom sequence generator. ACM Trans. Math. Software, 12:362{376, 1986. 7. L. Kocis and W. J. Whiten. Computational investigations of low-discrepancy sequences. ACM Trans. Math. Software, 23:266{294, 1997. 8. A. R. Krommer and C. W. U berhuber. Numerical Integration on Advanced Computer Systems, volume 848 of Lecture Notes in Computer Science. Springer, Berlin, 1994. 9. G. Larcher, H. Niederreiter, and W. Ch. Schmid. Digital nets and sequences constructed over nite rings and their application to quasi-Monte Carlo integration. Monatsh. Math., 121:231{253, 1996. 10. H. Niederreiter. Point sets and sequences with small discrepancy. Monatsh. Math., 104:273{337, 1987. 11. H. Niederreiter. Low-discrepancy and low-dispersion sequences. J. Number Theory, 30:51{70, 1988. 12. H. Niederreiter. Random Number Generation and Quasi-Monte Carlo Methods. Number 63 in CBMS{NSF Series in Applied Mathematics. SIAM, Philadelphia, 1992. 13. A. B. Owen. Randomly permuted (t; m; s)-nets and (t; s)-sequences. In H. Niederreiter and P. J.-S. Shiue, editors, Monte Carlo and Quasi-Monte Carlo Methods in Scienti c Computing, volume 106 of Lecture Notes in Statistics, pages 299{317. Springer, New York, 1995. 14. I. Radovic, I. M. Sobol, and R. F. Tichy. Quasi-Monte Carlo methods for numerical integration: Comparison of dierent low discrepancy sequences. Monte Carlo Methods and Appl., 2:1{14, 1996.