Evaluation of the Intel Clovertown Quad Core Processor - CiteSeerX

2 downloads 111 Views 306KB Size Report
Abstract. We evaluated the Intel Clovertown quad core processor by means of the EuroBen Benchmark. The single-core performance was assessed as well as ...
Evaluation of the Intel Clovertown Quad Core Processor

Aad J. van der Steen High Performance Computing Group Utrecht University P.O. Box 80195 3508 TD Utrecht The Netherlands [email protected] www.phys.uu.nl/~steen

Technical Report HPCG-2007-02

NCF/Utrecht University

April 2007

Abstract We evaluated the Intel Clovertown quad core processor by means of the EuroBen Benchmark. The single-core performance was assessed as well as the shared-memory parallel performance with OpenMP. In addition, the distributed-memory performance was measured using MPI and some linear algebra tests and an FFT test were performed using Intel’s MKL library. The single-core performance turns out to be generally very good. However, running multiple processes all accessing the memory appears to cause problems that for now are attributed to memory contention.

Contents 1 2

Introduction

2

The System

3 3 3

2.1 2.2 3 4

The EuroBen Benchmark

4

Benchmark results 4.1

4.2

4.3

5

Hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Single CPU results 4.1.1 Module 1 . 4.1.2 Module 2 . OpenMP results . 4.2.1 Module 1 . 4.2.2 Module 2 . MPI results . . . . 4.3.1 Module 1 . 4.3.2 Module 2 .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

Closing remarks and summary

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

6 6 6 8 13 13 15 20 20 21 27

Acknowledgements

28

References

29

1 Introduction From January 2007 on a system containing 2 Clovertown quad core processors was made available to us for evaluation. During that month we ran the single-CPU, OpenMP, and MPI versions of the EuroBen Benchmark [5] on it and, in addition we used Intel’s MKL library to assess the performance of a matrixvector multiplication, the solution of a dense linear system, and the solution of a dense symmetric eigenvalue problem. In addition, the MKL library was used to compare its FFT performance with that of program mod2f of the EuroBen Benchmark. All algorithms were run for a range of sizes that was considered to occur in practical situations. The programs in the benchmark are generally already written in a fairly optimised way. Therefore no extensive optimisation attempts with respect to the source codes were performed. However, the compiler flags used gave generally rise to very well optimised code as will become clear from the results presented here. This paper is organised as follows: in the next section we give the particulars on the software and the hardware used in the test, then we briefly describe the EuroBen Benchmark and its OpenMP and MPI variants. Next we present results for each of these variants with aditionally some selected linear algebra and FFT results obtained with the MKL library. We conclude with some closing remarks on the evaluated system.

2 The System Here we give a summary of the relevant hardware and software characteristics of the system used in the tests.

2.1

Hardware

The most important hardware features were: • 2 quad core Clovertown processors • Clock frequency 2.66 GHz • Front Side Bus 1333 MHz • L1 cache: 32 kB data, 32 kB instruction • L2 cache: 4 MB/2 cores • 16 GB of shared memory • 95 GB of available user disk space Only a very limited part of the available disk space was needed for the benchmarks that were performed.

2.2

Software

The following software was used: • The Intel ifort compiler, version 9.1.040 • The Intel MKL library, version 9.1 beta • Argonne National Laboratory’s MPICH2, version 1.0.5 configured with --with-device=ch3:ssm The following compiler flags were used: -fast -ip -tune pn4 -arch pn4 -axP -tpp7 which would give us the additional performance of the SSE units where possible as well as interprocedural optimisation.

3 The EuroBen Benchmark The EuroBen Benchmark [5] exists since about 1990 and has evolved ever since. It consists entirely of Fortran 90 programs and initially only a single-CPU version was available. Presently there are also OpenMP and MPI versions that contain the same programs as the single-CPU as far as sensible and separate programs that measure interprocessor communication are present in the MPI version. The versions, as used in this evaluation, are version 5.0 for the single-CPU benchmark, version 2.0 for the shared-memory (OpenMP) benchmark and also version 2.0 for the distributed-memory (MPI) benchmark. Rather than yielding a single result that should represent all of the performance qualities of the processor, the Euroben Benchmarks attempt to expose the particular strengths and weaknesses of a given processorcompiler combination. This approach is based on the idea of a performance profile instead of a single figure of merit characterising the performance of a processor. The same philosophy can be found for instance in Jack Dongarra’s HPC Challenge Benchmark (HPCC) [3]. A difference with the HPCC Benchmark is that all EuroBen programs directly measure either a systems parameter, e.g., the point-to-point bandwidth of interprocessor communication or the performance of an operation or algorithm that occurs in practice in technical/scientific computations. Futhermore, there is a hierarchy in the complexity of the modules the benchmark sets consist of: module 1 always measures simple parameters or operations while module 2 concentrates on important basic algorithms. As these algorithms in turn contain the same operations present in module 1 but in a different context, it enables a better understanding of the results obtained for these algorithms. This seeming redundancy also serves to expose possible performance anomalies that may occur due to an unexpected behaviour of the compiler, the processor, the communication network or any combination of them. Thus increasing the insight in the practical capabilities of the processor or system. We proceed by shortly characterising the test programs in the benchmark sets. The sets, sequential, OpenMP, and MPI, contain for as far as sensible the same programs. The sequential set EuroBen V5.0 contains: mod1a – A test for speed of basics operations from cache (if present). mod1b – A test for speed of basics operations from main memory. mod1d – A test for effects of memory bank conflicts. mod1e – A test for the accuracy of intrinsic mathematical functions. mod1f – A test for the speed of intrinsic mathematical functions. mod2a – A test for the speed of dense matrix-vector multiplication. mod2as – A test for the speed of sparse matrix-vector multiplication with the matrix in CRS format. mod2b – A test for the speed of solving a dense linear system Ax = b. mod2ci – A test for the speed of solving a sparse linear non-symmetric system Ax = b of Finite Element type. The matrix is in CRS format. mod2cr – A test for the speed of solving a sparse linear symmetric system Ax = b stemming from a 3-D finite difference problem. mod2d – A test for the speed of finding eigenvalues of a dense matrix A (Real, symmetric). mod2f – A test for the speed of a 1-D Fast Fourier Transform. mod2g – A test for the speed of a 2-D Haar Wavelet Transform. mod2h – A test for the speed of a random number generator. mod2i – A test for the speed of sorting Integers and 64-bit Reals.

5

The OpenMP version 2.0 contains almost the same programs, except mod1e, testing the accuracy of intrinsic mathematical functions as this has no additional value over the sequential version. The MPI version 2.0 lacks for the same reason the programs mod1a–mod1f but instead contains the program mod1h which measures the bandwidth of some important communication patterns, like nearest neighbour transfer in 1-D, 2-D and 3-D topologies, matrix transposition, and a number of collective communication primitives. Program mod1i tests different implementations of a distributed dotproduct while mod1j does a very precise two-sided point-to-point communication measurement to find the bandwidth and short-message latency for this kind of exchange. Program mod1k does the same but in this case for one-sided communication (MPI_Put and MPI_Get). The MPI version of the benchmark contains all the module 2 algorithms, except mod2d, the dense eigenvalue problem. An efficient implementation for this is still in the making.

4 Benchmark results We subsequently discuss the results found for the single-CPU version, the OpenMP version and the MPI version. In addition, we have some results obtained for the linear algebra programs with Intel’s MKL-libary using rather the library versions of the relevant LAPACK routines than the compiled Fortran versions. Also we compared the MKL FFT to the one implemented in program mod2f. To give the performances some additional perspective we sometimes compare the results with those from a dual-core Woodcrest processor system with the same clock frequency, a frontside bus with the same bandwidth (1333 MHz), but a slightly older compiler version.

4.1

Single CPU results

A more correct term would be single core results as we only look at the performance of a single CPU. We look subsequently at (some of) the results of the programs in module 1 and module 2.

4.1.1

Module 1

For many of the basic operations in program mod1a the speed is measured with operands that are consecutive in memory but also with a stride of 3 and 4 and by indirectly addressing the operands. As one would expect the latter 3 variants give a degradation in the performance as can be seen from the three examples in Figure 4.1 where we show the maximum achieved performance rmax . It is clear from Figure 4.1 that in all cases there is a significant degradation. This is due to several factors: 6000.0 Stride 1 Stride 3 Stride 4 Indirect

Mflop/s

4000.0

2000.0

0.0

Multiplication

x = x + αy

Vector update

Figure 4.1: Performance degradation for strides 6= 1 in three basic operations. a less effective use of the cachelines and, for a stride of 4, possibly cache memory bank conflicts. The lower performance for the indirectly addressed operations is due to both the extra address evaluation that is required and the fact that in most cases the operands are not in cache and have to be fetched from memory. Note that for a stride of 1 the axpy operation x = x + αy is much faster than the slightly different general vector update operation x1 = x2 + αy. This is because the first operation can be offloaded to the SSE unit resulting in an 1800 Mflop/s speed difference. For strides 6= 1 the SSE unit is not employed which explains the sharp performance decrease for the axpy operation. For the simple dyadic operations ’+’, ’–’, ’×’, a performance drop was noticeable for an array length of N = 800 due to cache bank conflicts: where the speed of these operations is around 1720 Mflop/s for

4.1 Single CPU results

7

N = 700, 801, and 900 for N = 800 it is roughly 920 Mflop/s, almost half of the normal speed. It matches more or less with the speed for longer vector lengths that do not fit in the L1 cache anymore. Although the effect is not very serious one should be aware of the fact. Program mod1b measures the same operations as program mod1a but from memory instead of the cache. It shows that the differences with respect to the Woodcrest processor are not always in favour of the Clovertown processor. This is evident from Figure 4.2. In most cases the Woodcrest processor is actually faster. Sometimes significantly so, e.g., in a 2-D rotation operation and the evaluation of a 9th degree polynomial. As both processors have a 1333 MHz memory bus and the micro-architecture of the processor cores are nearly the same we have no ready explanation for this phenomenon. By contrast the Clovertown core performs better for the dotproduct and the axpy operation which may be attributed to the improvements in the newest compiler version and slight alterations in the micro-architecture. It is not entirely clear why the 9th degree polynomial is not able to attain the full speed of 5.33 Gflop/s. 4000.0 Clovertown Woodcrest

Mflop/s

3000.0

2000.0

1000.0

0.0 M

on

ati

lic

ip ult

ion

vis

Di

t

uc

rod

tp Do

x=

x+

αy

on

ati

D

2−

rot

nd

2

ce

en

fer

dif

ol.

ep

th

9

rge

de

Figure 4.2: Performance of some basic operations from memory (mod1b) for the Clovertown and Woodcrest processors. The computational intensity of the operation f = 9, i.e., only two memory operations are required per 18 floating-point operations. Still, only 0.9517 flop/cycle are attained on the Clovertown and 1.2110 flop/cycle on the Woodcrest. This contrasts strongly with the same operation from cache in mod1a where 1.9327 and 1.9333 flop/cycle, respectively are attained, almost the maximum performance. It is clear that we have a memory bandwidth limitation that causes the bottleneck. A similar effect can be seen with the multiplication and division operation. Both the multiplication and the division have a speed of about 830 Mflop/s when accessed from the L2 cache (from the L1 cache the multiplication is about twice as fast, almost the theoretical maximum). However, from memory as measured in program mod1b the speed for both operations is about 108 Mflop/s. We looked in somewhat more detail into these bandwidth problems with the cachebench program from Philip Mucci [1, 4] and the experimental program mod1c that will become part of a future EuroBen release. The results of the cachebench experiment are given in Figure 4.3. From these operations the speed for the write operation agrees well with that for program mod1c when the SSE units are switched off. One variant of program mod1c does a polynomial evaluation of the type α = Pi (x) with i = 1, . . . , 9, α a scalar, and the vector x ranging in length from 200,. . . , 40,000. For polynomials of higher order register spill occurs and therefore the speed and associated bandwidth decrease as shown in Figure 4.4. For this range of vector lengths the bandwidth for polynomials for orders 1–9 is constant but the computational intensity is increased linearly and this is the case for the perfomance as well, indicating that up to order 9 the bandwidth is the limiting factor. When the code is allowed to vectorise the situation becomes different: for the operation α = Pi (x) the bandwidth needed is 24 B/cycle: in this case from a polynomial order of 5 on the bandwidth is sufficient to support the operation fully and we obtain almost the maximum speed of 16 Gflop/s. Unfortunately this is not the case for the second variant of mod1c that implements the much more realistic operation y = Pi (x), i = 1, . . . , 9, where both x and y are vectors in the range 200–40,000. For some reason

4.1 Single CPU results

8

Memory Hierarchy Performance of Clovertown-x86_64 40.0 read write rmw handread handwrite handrmw memset memcpy

35.0

30.0

GB/s

25.0

20.0

15.0

10.0

5.0

0 32

1024

32768

1.04858e+06

3.35544e+07

1.07374e+09

Vector Length

Figure 4.3: Results for various memory operations as measured by cachebench. as yet not identified the bandwidth is much lower: about 1.98 GB/s and as such matching with the speed of the read operation in cachebench. Now the amount of B/cycle is only 0.7433 which means that even for a 9th order polynomial not enough data can be provided to attain full speed. Correspondingly, we find a peak speed of about 2.2 Gflop/s where, without vectorization this should be 5.333 Gflop/s. Program mod1e does a rigorous test of the main mathematical intrinsic functions with respect to the accuracy, not their speed. This test was included because until only a few years ago the implementation of these functions turned out not always to be reliable for all paramemter ranges of interest. In that respect the ifort compiler passes the test for all intrinsics and all parameter ranges. Only the log function in the interval (0.0, 2.0) shows a Root Mean Square accuracy loss of 1.59 decimal digit and a maximal accuracy loss of 2.22 decimal digits at log(0.000896717) which is acceptable. For the other functions the accuracy losses were too small to notice.

4.1.2

Module 2

Program mod2a measures the speed of a matrix-vector multiplication implemented with dotproducts and with axpy operations. The first method requires less bandwidth (computational intensity f = 1) but it suffers from strided memory access that only partly can be improved by blocking the code. The computational intensity of the axpy operation is only f = 2/3 but with suitable unrolling and blocking the speed can be improved. Figure 4.5(a) shows the speed of the two algorithms. We also tested this basic BLAS algorithm with the implementation present in the MKL 9.1 beta library which is an axpy implementation. The result of this test is also shown in Figure 4.5(a). When still in the L1 cache the MKL version does clearly better than the Fortran version. However, when accessed from the L2 cache there is no difference anymore. The dotproduct version suffers quite much from the large strides in accessing the matrix elements and therefore is much slower than the other two versions. In practical applications the sparse matrix-vector multiplication has become actually more important than the dense variant due to an increase of the use of iterative linear solvers. This algorithm is implemented in program mod2as. Not suprisingly, the performance of this algorithm is lower than that of the dense variant. The storage of the matrix is in CRS format, which necessitates one lookup of an index and indirect access of the appropriate vector elements. So, apart form the fact that most matrix elements will not be present in the cache to start with, also two extra memory operations are required per flop. Program mod2b measures the performance of solving a dense linear system. In mod2b the Fortran versions of

4.1 Single CPU results

9

Without SSE, yi = P(xi) Without SSE, α = P(xi) With SSE, yi = P(xi) With SSE, α = P(xi)

GB/s

12.0

8.0

4.0

0.0

0.0

2.0

4.0

6.0

8.0

10.0

12.0

14.0

16.0

18.0

20.0

22.0

Polynomial Order

Figure 4.4: Results for various memory operations as measured by mod1c. Dense Matrix−vector multiplication (a) Sparse Matrix−vector multiplication (b)

5000.0 1200.0

Dotproduct axpy Fortran axpy, MKL 9.1β

4000.0

1000.0

3000.0 Mflop/s

Mflop/s

800.0

2000.0

600.0

1000.0

400.0

0.0 100

1000 Matrix Order

200.0 100

1000

2000

Matrix Order

Figure 4.5: Speed of the two matrix-vector multiplication implementations for a dense matrix A as found with program mod2a and a sparse matrix-vector multiplication measured in program mod2as. the usual LAPACK routines are used. Together with this pure Fortran version we also ran a test in which the MKL versions of the LAPACK routines was used. The result is shown in Figure 4.6. It is evident that a very good job is done in the LAPACK implementation within MKL as the Fortran implementation has a much lower performance level, for larger matrix orders even slower by a factor of about 4. Furthermore it is remarkable that there is still a noticeable performance difference between the 9.0 and 9.1β versions of MKL, so, there was still room for improvement. Although a high speed for dense linear algebra algorithms is certainly desirable the importance of sparse algorithms and in particular sparse iterative solvers is definitely higher for large-scale physical simulations. Two variants are present in the EuroBen set: mod2ci tests the speed of two types of iterative solvers, RGMRES and TFQMR, for non-symmetric irregularly sparse linear systems as originate from Finite Element type problems. Program mod2cr considers a symmetric heptadiagonal system stemming from a 3-D Finite Difference problem. In this case a CG solver with two types of preconditioners, ILU(0) and polynomial, are used. In both programs a fixed number of iterations is done to assess the speed as we are rather more interested in the speed per iteration than the final time to solution. Figures 4.7(a) and 4.7(b) show the performance for the irregular and regular type systems, respectively. As can be seen from Figure 4.7(a) The TFQMR algorithm has a somewhat higher Mflop-rate than RGMRES, at least when part of the matrix and right-hand side still fit in the L2 cache, and also the convergence is a little better than that of RGMRES

4.1 Single CPU results

10

ifort 9.1.040 MKL 9.0 MKL 9.1β

8000.0

Mflop/s

6000.0

4000.0

2000.0

0.0 100

1000 Matrix Order

4000

Figure 4.6: Speed of solution of a dense linear system (mod2b). Solving a non−symmetric Finite Element type sparse linear system (a)

Solving a symmetric Finite Difference type sparse linear system (b)

1200.0 RGMRES TFQMR

2000.0

ILU(0) precond. Polynomial precond.

1000.0

Mflop/s

Mflop/s

1600.0

800.0

600.0

1200.0

800.0

400.0 1000

10000 Matrix Order

400.0 1000

10000

100000

1,000,000

Matrix Order

Figure 4.7: Speed of solution of a sparse linear system of Finite Element type with program mod2ci (a) and a symmetric system of 3-D Finite Difference type with program mod2cr (b). for the type of problem we consider. However, the amount of flops in TFQMR is about two times higher than that of RGMRES. Therefore, when convergence is not problematic one can expect that the RGMRES variant is almost two times faster. From Figure 4.7(b) it is evident that the polynomial preconditioner has a clear advantage as long as the majority of diagonal elements still fit in the L2 cache. For a system of order N = 27, 000 (from a 30×30×30 grid) this is still the case but for a 40×40×40-size problem there are evidently cache problems. The L2 cache size is 4 MB/ 2 cores. The behaviour as a fucntion of the problem size suggests that the a core is bound to a 2 MB portion of this cache. Because the polynomial preconditioner in its sparse matrix-vector multiplication constantly needs vector elements that are located outside the L1 cache for the larger problems the performance impact is more severe than for the ILU(0) preconditioner which results in a lower performance for matrix orders N > 20, 000. Note that the speed of both the irregular and the regular type of system solver converge to the speed of the sparse matrix-vector multiplication for large N which is about 420 Mflop/s. This is understandable because this operation eventually dominates in the total time spent in the solvers when the order N increases. Program mod2d tests the performance of a symmetric dense eigenvalue problem. We use a straight LAPACK implementation and so we could compare the Fortran implementation with the two versions available from MKL 9.0 and MKL 9.1β. The result in Figure 4.8 shows a significantly higher performance of the MKL version as one would expect. However, there is virtually no performance difference between the two MKL

4.1 Single CPU results

11

versions. From an order of N = 750 on the performance deteriorates for all three implementations because 6000.0 Fortran MKL 9.0 MKL 9.1β

Mflop/s

4000.0

2000.0

0.0

10

100 Matrix Order

1000

4000

Figure 4.8: Speed of solution of a symmetric dense eigenvalue problem (mod2d). a significant amount of work cannot be held in the cache in contrast to the case of the solution of a dense linear system. Still, a considerable speed can be achieved even when not all data fit into the cache thanks to the high computational intensity of the majority of operations. In program mod2f the speed of a 1-D complex-to-complex FFT is measured. The implementation as present in the EuroBen Benchmark uses a radix-4 algorithm with special care to avoid short loop lengths during the FFT passes. As long as the data still fit in a 2 MB part of the L2 cache the speed can be maitained at about 2 Gflop/s. For N = 27 this is not the case anymore because in addition to the data vector proper also a working area of the same size must be allocated. The sawtooth pattern in the performance that shows in Figure 4.9 stems from a final copy from this working area to the result vector necessary for vector lengths that are an odd power of 2. In addition to the Fortran implementation we also tested the FFT present in MKL 9.1β. For vector lengths Fortran radix−4 MKL 9.1β

4000.0

Mflop/s

3000.0

2000.0

1000.0

0.0 2 10

3

10

4

10 Vector Length

5

10

6

10

Figure 4.9: Speed of a 1-D complex-to-complex FFT (mod2f). 512–2048 the performance difference is quite large, almost a factor of 2 and also for the larger vector lengths, N > 65, 356, MKL does significantly better. For intermediate lengths MKL is faster but less outspoken so than in the other ranges. Program mod2g implements a 2-D Haar Wavelet Transform. Like most FFT algorithms this is a power of 2 reduction algorithm. However, the computational intensity of the algorithm is much lower: f = 1 for the analysis part and f is only 1/3 for the synthesis part. This means that the speed is memory bound and

4.1 Single CPU results

12

cannot be expected to be a significant fraction of the theoretical peak performance. In addition a matrix transposition is required that adds O(N × M ) data references, where N and M are the matrix dimensions, respectively. Figure 4.10 shows the large impact of the data access that is hardly mitigated by the floatingpoint operations in the transform. Due to the fact that there is some data re-use both in the analysis phase and in the synthesis phase up to 800.0

Mflop/s

600.0

400.0

200.0

0.0 2 10

3

10

4

5

10 10 Matrix size (N*M elements)

6

10

Figure 4.10: Speed of a 2-D Haar Wavelet Transform (mod2g). a data size of 8192 elements a speed of over 600 Mflop/s can be attained as most data still reside in the L1 cache. The speed decreases steadily when more data must be fetched from the L2 cache. From N = 131, 072 on data will increasingly have to be accessed from memory resulting eventually in a speed of 79 Mflop/s for N = 1, 048, 576 when the data size is over 16 MB. In program mod2h a portable uniform random number generator is tested on the interval (0,1]. The speed is reported in Mega operation/s not flops as the algorithm needs a mixture of integer and floating-point operations to produce the random numbers. Nine operations per random number are required. The generation of the sequence is recursive. So, it cannot be pipelined and advantages from data reuse in the caches is minimal. On can therefore expect that the performance will be a (very) weak function of the problem size. Figure 4.11 shows that there is some variation in the performance but that this variation indeed is not very 900.0

850.0

Mop/s

800.0

750.0

700.0

650.0

600.0 5 10

6

10 Sequence Length

7

10

Figure 4.11: Speed of a uniform random number generator (mod2h) on the interval (0,1]. large. For the smaller sequence lengths the non-negligible overhead of the initialisation is still a factor. For N ≥ 3, 000, 000 all cache influences are absent and the speed settles at about 680 Mop/s.

4.2 OpenMP results

13

For program mod2i that sorts sequences of Integers and 8-byte Reals the situation is very similar to that found in random number generation, be it that the complexity of the algorithm is O(N log N ) instead of linear. So, one expects again a flat performance profile in terms of the operations/second while the execution time grows logarithmically. In Figure 4.12 the graph for the sorting speed is shown. Integers as well as 8-byte Reals are sorted because they have different bandwidth requirements and the 220.0

210.0 Mop/s

Integers 8−byte Reals

200.0

190.0 4 10

5

10 Sequence Length

6

10

Figure 4.12: Speed of sorting Integer and 8-byte Real numbers with program mod2i. amount of storage in the memory hierachy. It can be seen in the figure that the performance difference for these two types of variables is modest: as expected the sorting of the Reals shows a lower performance but the difference is less than 20%, indicating that the sorting procedure itself is dominating over the associated memory access. For the single-CPU version of the benchmark there is a program mod3a that measures I/O performance via a very large out-of-core sparse matrix-vector multiplication. We will, however, not discuss the results of this program as in this setting the results are too dependent on the particular I/O configuration of the test platform and therefore does not provide useful information on the I/O performance.

4.2

OpenMP results

In this section we discuss the outcomes of the OpenMP version of the benchmark. As already stated in section 3 the content of the OpenMP version is largely identical to that of the single-CPU version. The program measuring the accuracy of the intrinsic functions, mod1e, has been omitted as it does not give new information. We have tested the programs with up to 8 processes as we had a dual-CPU quadcore system at our disposal. The executables were produced by simply adding the -openmp flag to the compiler flags mentioned in 2.2 and setting the number of process threads with the environment variable OMP_NUM_THREADS to the appropriate value.

4.2.1

Module 1

From program mod1a that measures the speed of simple operations we show in Figure 4.13 four operations with non-decreasing floating-point operation content. Because of the parallelisation overhead associated with the OpenMP code we cannot expect that the operations with low floating-point content will scale very well. This is confirmed by the results in Figure 4.13. It is clear from the figure that, except for the 9th degree polynomial evaluation, which contains 18 flops, parallelisation of these single operations in a loop is not a good idea. For the dotproduct it is particularly problematic because of the reduction operation that needs synchronisation to obtain the global sum. The speed with 8 threads is therefore lower than with four. Still, the scaling with this compiler version and with the Clovertown processor has significantly improved over earlier compiler/processor combinations: except

4.2 OpenMP results

14

12000.0 1 thread 4 threads 8 threads Sequential

Mflop/s

8000.0

4000.0

0.0

Multiplication

Dotproduct

x = x + αy

th

9 degr. pol.

Figure 4.13: Scaling of 4 operations in program mod1a compared with the sequential speed of the same operations. for the dotproduct the speed increases with an increasing amount of threads where in the past we often found a decrease in speed when more threads were used. The initial penalty incurred for parallelisation is much less severe in program mod1b where all data is residing in memory. Because of the much longer times to store and fetch the data the effect is hardly discernable as is evident from Figure 4.14: The performance of the single OpenMP thread and the sequential version is virtually the same for all

4000.0

1 thread 4 threads 8 threads Sequential

Mflop/s

3000.0

2000.0

1000.0

0.0

Multiplication

Dotproduct

x = x + αy

th

9 degr. pol.

Figure 4.14: Scaling of 4 operations in program mod1b compared with the sequential speed of the same operations, all addressed from main memory. operations except the 9th degree polynomial where the time for the floating-point operations in the loop is relatively much larger. Note that the performance for four and eight threads are the same. All threads have to compete for the memory at the same time and the memory bandwidth is not large enough to satify all the requests. Even for the 9th degree polynomial evaluation with a computational intensity of f = 9 is not enough to offset the effects of the large bandwidth requirements when all 8 cores need access to the memory: the speed of this operation in 4349 Mflop/s for 4 threads and 4377 Mflop/s for 8 threads. In contrast to the simple operations in program mod1a the evaluation of intrinsic functions for vectors of sufficient length (N approx. ≥ 6000) the computational content is sufficient to get a speedup when more OpenMP threads are used as is shown in Figure 4.15. With 8 threads a speedup of a factor of 3–4 can be obtained.

4.2 OpenMP results

15

500.0 1 thread 4 threads 6 threads 8 threads

400.0

Meval/s

300.0

200.0

100.0

0.0

Sin(x)

Cos(x)

Exp(x)

Log(x)

Figure 4.15: Scaling of 4 intrinsic functions in program mod1f for a vector length of N = 6000.

4.2.2

Module 2

As in the single-core version the OpenMP variant of program mod2a measures the performance of two implementations of a dense matrix-vector multiplication. For this algorithm there is always (some) benefit in using more threads, except for the small matrix orders N = 100 and 200 for the axpy implementation. In this case the small amount of work that is done per core cannot offset the parallelisation overhead. When the matrix order is so large that the active data do not fit in the L1 cache anymore the performance (a) Dotproduct implementation

(b) axpy implementation

6000.0

6000.0 1 thread 2 threads 4 threads 6 threads 8 threads

5000.0

5000.0

4000.0 Mflop/s

Mflop/s

4000.0

3000.0

2000.0

3000.0

2000.0

1000.0

1000.0

0.0 100

1000 Matrix Order

2000

1 thread 2 threads 4 threads 6 threads 8 threads

0.0 100

1000

2000

Matrix Order

Figure 4.16: Speed of the two matrix-vector multiplication implementations for a dense matrix A with program mod2a. drops to about a fifth of the maximum observed performance. This happens earlier and more severely with the dotproduct implementation because of the strides occurring in the access of the matrix elements to be processed. The MKL version of the axpy implementation is the call to the BLAS 2 routine dgemv. From Figure 4.5(a) it can be seen that the speed is roughly equivalent to the 4-thread OpenMP performance. The sparse matrix-vector multiply as done in the OpenMP version of program mod2as consists in simply distributing the matrix rows to be processed over the available OpenMP threads. We show the results in Figure 4.17. The performance is low for small matrix orders not only because of the little work per matrix row but also because the row elements are not in cache except by accident. The speed is therefore 10–15 times lower than in the corresponding dense matrix dotproduct implementation. Still, there is a clear benefit in using more OpenMP threads. Note that for a number of threads > 4 and N ≤ 1000, the performance is lower than that with 2 and 4 threads because not all threads are located on the same chip anymore. For 8

4.2 OpenMP results

16

2000.0

Mflop/s

1500.0

1 thread 2 threads 4 threads 6 threads 8 threads

1000.0

500.0

0.0 100

1000

2000

Matrix Order

Figure 4.17: Speed of a sparse matrix-vector multiplication with program mod2as. threads at N = 2000, however, this effect is compensated by the larger amount of work per matrix row and a correspondingly smaller influence of the parallelisation overhead, arriving at a speed of 1740 Mflop/s. Also an OpenMP version of a program for the solution of a dense linear system is available as program mod2b. In addition, there is a multi-threaded MKL version. We show the results of both in Figures 4.18(a) and (b), respectively. Of the MKL version only results for up to 4 threads are available. The all-Fortran Dense linear system solve, MKL (b)

Dense linear system solve, OpenMP (a)

30.0 10.0

20.0 Gflop/s

Gflop/s

8.0

1 thread 2 threads 3 threads 4 threads

1 thread 2 threads 4 threads 6 threads 8 threads

6.0

10.0 4.0

2.0 100

1000 Matrix Order

4000

0.0 100

1000 Matrix Order

4000

Figure 4.18: Speed of the OpenMP and multi-threaded MKL versions for the solution of a dense linear system with program mod2b. version clearly benefits from using more threads op to the 8 available. For N > 2000 problems do not fit in the cache anymore, which causes the performance to stay at about the same level as that at N = 2000 even if the BLAS 3 routine DGEMM should mitigate the effects of the insufficient memory bandwidth. The convexity of the curves in Figure 4.18(a) indicate that the parallelisation overhead decreases relative to the problem size. The multi-threaded MKL version of program mod2b scales almost linearly up to 3 threads and slightly less for 4 threads. Even so, the performance is quite high due to the support of the SSE units that are employed very effectively here. Like in the Fortran version, the performance levels off to some degree for N > 2000, be it is less outspoken. OpenMP versions of sparse linear solvers are again implemented in the programs mod2ci and mod2cr that represent solving irregular Finite Element type and regular Finite Difference type problems, respectively.

4.2 OpenMP results

17

We show the performance of both types in Figure 4.19(a) and (b). In both algorithms the matrix-vector multiplication is the most important factor. The implementation is Sparse linear solve, CG, polynomial precond. (b)

Sparse linear system solve, TFQMR, polynomial precond. (a)

3000.0 1 thread 2 threads 4 threads 6 threads 8 threads

2500.0

3000.0

1 thread 2 threads 4 threads 6 threads 8 threads

Mflop/s

Mflop/s

2000.0

1500.0

2000.0

1000.0

1000.0 500.0

0.0 1000

10000 Matrix Order

20000

0.0 3 10

4

5

10

10

6

10

Matrix Order

Figure 4.19: Speed of the OpenMP implementation of a sparse linear solver for irregular Finite Element type problems with program mod2ci (a) and regular Finite Difference type problems with program mod2cr (b). completely different however. In the irregular case the matrix is in Compressed Row Storage form. From program mod2as we know that the maximum speed for this type of operation is reached at a vector length of about 2000. So, it is not surprising that also for the sparse solver using this operation the maximum speed is found for matrices of this order. The higher overall speed with respect to the matrix-vector multiplication is due to the vector L2 norm which is the other mayor operation in the TFQMR algorithm that is the actual solver. The L2 norm is significantly faster than the general dotproduct, in the order of 3–4 Gflop/s at a vector length of N = 2000 resulting in a peak speed of almost 3 Gflop/s for this matrix order when 8 threads are used. Note in Figure 4.19(a) that the speed for 6 threads is lower than that for 4 threads. Although this is not conclusive evidence, it suggests that 4 threads from one chip are active and 2 from the other chip. With 8 threads the situation is balanced again and the performance within the cache regime is clearly better than with 4 threads. When all data have to be fetched from memory all the time, which happens for N ≥ 10000, There is virtually no difference in performance when more than one thread is used. So, for larger problem sizes the benefit of using OpenMP is marginal. The matrix-vector multiplication in the regular banded matrices present in the Finite Difference type solver contains expressions like: !$omp parallel do Do i = n1+1, n12 y(i) = a(i,0)*x(i) + a(i, 1)*x(i+1) + a(i, 2)*x(i+n1) + & a(i,3)*x(i+n12) + a(i-1,1)*x(i-1) + a(i-n1,2)*x(i-n1) End Do that show a high degree of reuse of data that can benefit both from the L1 and the L2 cache. Therefore, the highest performance is found at a matrix order of N = 125,000. For larger matrices an increasing amount of data have to be fetched directly from memory with a correspondingly lower performance as a result. Note that in Figure 4.19(b) the performance for 4 and 6 threads are almost equal. This should again be ascribed to the imbalance in the thread activity with 6 threads. Also for this program the benefit of OpenMP is limited when the data have to be fetched from the memory. At N = 1,000,000 more than 2 threads hardly make a difference in the performance. The OpenMP version of program mod2d is parallelized by just putting in the necessary directives in the appropriate LAPACK and BLAS 3 routines, particularly dgemm that dominates the total execution time. The program solves the eigenvalue problem for matrix orders in the range N = 10–2000. Unfortunately, except for N = 10, 20, and 40–80, all solutions turn out to be incorrect. It turns out to be specific for the Intel ifort compilers, compiling the same code with for instance Portland Group’s pgf95 gives correct results. Also running the ifort-generated code on other (older) Intel platforms yields the same errors. This

4.2 OpenMP results

18

is something for Intel’s compiler group to look into. Program mod2f performs a radix-4 FFT with some reversions in loop constructs to keep the loops long enough to diminish the loop overhead and to have more benefit from cache-resident data. The performance curves for 1–8 threads are shown in Figure 4.20. The figure shows that there is virtually no difference between using 1 thread or more. In fact, the 3000.0 1 thread 2 threads 4 threads 8 threads

Mflop/s

2000.0

1000.0

0.0 2 10

3

10

4

10 Vector Length

5

6

10

10

Figure 4.20: Speed of the OpenMP implementation of a radix-4 complex-to-complex 1-D FFT with program mod2f. performance curves are almost identical to the 1-processor (core) version (cf. Figure 4.9) showing that the current parallelisation approach is not able to exploit the inherent parallelism. To take advantage of this, one has to factorise the large 1-D FFT into a series of smaller FFTs that are executed in parallel. Essentially the same approach as is taken in distributed-memory based implementations. In program mod2g, a 2-D Haar Wavelet Transform, there should be no problem in executing the 1-D partial transforms in parallel. Figure 4.21 gives the results for this algorithm. Indeed, there is a difference between 1 or multiple threads, but the parallelisation overhead is so large that 800.0 1 thread 2 threads 4 threads 8 threads

Mflop/s

600.0

400.0

200.0

0.0 100

1000 10000 Matrix size (N*M elements)

100000

Figure 4.21: Speed of OpenMP implementation of a 2-D Haar Wavelet Transform with program mod2g. the single-threaded version is the fastest for problem sizes up to 256×256 elements. For larger problems the negative influence of running out of cache and the lower impact of the parallelisation overhead make that with 8 threads the performance at size 512×256 is somewhat higher. For still larger problem sizes the difference becomes even more significant, be it that the absolute speed falls off for both the single-threaded

4.2 OpenMP results

19

program and with 8 threads. With 1 thread on a 1024×1024-size problem the speed is only 79 Mflop/s while with 8 threads it is still 153 Mflop/s. Random number generation can efficiently be implemented in OpenMP, except for the initialising phase in which the seeds for each of the independent processes must be generated. The overhead for the initialisation linearly increases with the number of threads and, in addition, we have the parallelisation overhead from OpenMP itself. As the number of operations required to produce a random number is low (9 in this implementation of program mod2h) we cannot expect a large benefit from parallelisation except for fairly large sequences. This turns out to be the case as can be seen from Figure 4.22. With 2 threads the program runs significantly faster from N = 200,000 on while with 4 threads the 1000.0

800.0

Mop/s

600.0

400.0 1 thread 2 threads 4 threads 6 threads 8 threads

200.0

0.0 5 10

6

10 Vector Length

7

10

Figure 4.22: Speed of generating uniformly distributed random numbers with program mod2h. single-threaded version is overtaken at N = 900,000. The initialisation overhead together with the shorter sequences per process, make that the use of 6–8 threads does not pay off. For large practical sequences 4 threads seem to be a good choice. The parallel implementation of a sorting algorithm as such can be done fairly efficiently by sorting subsequences of data followed by a P -way merge where P is the number of processes. It is possible to implement the merge part of the algorithm correctly without race conditions. However, the overhead generated in this 500.0 1 thread 2 threads 4 threads 6 threads 8 threads

Mop/s

400.0

300.0

200.0 4 10

5

10 Vector Length

6

10

Figure 4.23: Speed of sorting 4-byte Integers with program mod2i. way is so large that performing the merge operation sequentially in practice always turns out to be faster. Program mod2i therefore uses a serial merging procedure. We show the performance curves in Figure 4.23.

4.3 MPI results

20

N.B.: We only show the result for sorting Integers. The pattern for 16-byte Reals is very similar, be it that the performance is lower by about 10%. Only data access operations are done in the sorting algorithm and those are all from main memory. Therefore the performance curve for 1 thread is completely flat as no cache is involved that might speed up the execution. For the multi-threaded runs the situation is different because the exchange operations of the sub-sequences are perfectly parallel but the merging part is sequential. However, the merging has an O(N ) complexity while the compare-exchange operations have an O(N log(N )) complexity. The relative influence of the merger part is therefore decreasing with increasing vector length and the performance increases accordingly. Note that the pay-off for adding threads becomes smaller not only because of the smaller sub-sequences per thread that have to be processed but also because of competition for memory access.

4.3

MPI results

In this section we discuss the results obtained with the MPI variant of the EuroBen Benchmark. The MPI programs implement the majority of algorithms present in the single-CPU/core version and the OpenMP version. As measuring the speed of simple operations implemented in the programs mod1a and mod1b are not sensible in a distributed-memory parallel environment because they do not provide additional information, these programs are not part of the MPI set of programs. Instead, simple programs that characterise the communication properties are included as programs mod1i, mod1j, and mod1k. Furthermore, program mod2d, the solution of a dense symmetric eigenvalue is lacking because there is no implementation of sufficient quality of it (yet). The MPI version we used was Argonne National Lab’s MPICH2, version 1.0.5. Initially we tried to install Intel’s Cluster MPI 3.0 but after installation it kept complaining about undefined references to routines in the PMPI library. The Argonne version, however, worked very well when configured in SMP mode.

4.3.1

Module 1

The first program in this set we want to discuss is mod1i which tests 3 variants of a distributed dotproduct. In the first “na¨ıve” variant all partial sums are sent directly to a root processor that sums them and sends back the global sum to all other processors. In the second variant a tree of processes is built in which each node receives partial sums from nodes pointing to them, summing the received partial sums and sending it up the tree. The root processor in the end contains the global sum and this sum is again sent via the tree to each processor. One would expect that collective MPI routines like MPI_Reduce and MPI_Bcast would work 5000.0

4000.0

Naive Fortran Tree Bcast/Reduce

Mflop/s

3000.0

2000.0

1000.0

0.0 0.0

2.0

4.0 No. of Processors

6.0

8.0

Figure 4.24: Speed of of three variants of a distributed dotproduct with program mod1i. the same but curiously enough until recently this was not always the case. Therefore a third variant is also tested that uses the routines MPI_Reduce and MPI_Bcast to compare it to the Fortran tree-based version. In Figure 4.24 we show the outcome of this program.

4.3 MPI results

21

The figure shows that the Fortran tree implementation and the one using the collective MPI routines behave similarly. The na¨ıve version lags behind because all processors need to deliver their data sequentially to the root processor and the root processor in turn has to distribute its global sum sequentially to the other processors. Furthermore, note that the speed increases superlinearly with the number of processors because the two vectors involved in the dotproduct each have a length of 1,000,000 8-byte Reals of which an increasing part can be held in the L2(L1)cache, thus speeding up the computation. The second module 1 program we consider is mod1j which does a very precise measurement of two-sided communication via MPI_Send and MPI_Rcev in order to assess the bandwidth and latency for point-to-point blocked communication in a range of 4–4,000,000 bytes. Figure 4.25 shows the bandwidth vs. the message length. Note that this is the in-chip bandwidth as exploited by the MPICH2 MPI library. So, this is the best 2000.0

Mbyte/s

1500.0

1000.0

500.0

0.0 0 10

2

10

4

10 Message Length

6

10

Figure 4.25: Bandwidth vs. message length for two-sided, blocked communication as measured with program mod1j. possible bandwidth between MPI processes for this platform. Between processors this bandwidth is at best half of the measured in-chip bandwidth. The latency for this type of communication is 4.15–4.29 µs, about a factor of 2 lower than the average InfiniBand 1-hop latencies. Program mod1k measures the point-to-point bandwidth and latency for one-sided communication with MPI_Get and MPI_Put. However, the program stalls indefinitely when executed on this system. This seems a flaw in the MPICH2 library as we have experienced similar problems on other multi-core chip-based systems with this library while no problems were encountered with other MPI implementations. So, we have no results for this type of communication.

4.3.2

Module 2

Like in the single-CPU(core) and OpenMP versions, the fisrt program of molude 2 is the dense matrix-vector multiplication, program mod2a. As the MPI processes are completely independent (the partial result vectors stay on the processors they are computed on), the scaling should be about linear. Figure 4.26 shows that this is indeed the case up to a matrix order of N = 500. For N > 500 the data do not fit in the L1 cache anymore, so the speed deteriorates. At N = 2000 the matrix rows per processor also do not fit the L2 cache anymore so, the speed drops seriously here to 1–2 Gflop/s. The anomalous behaviour of the algorithm on 8 processors at N = 1000 can presently not be accounted for. The speed found is 10.68 Gflop/s, singificantly lower than those found with 4 and 6 processors. Program mod2as measures the speed of a sparse matrix-vector multiplication. Also in this case the MPI processes are completely independent. The rows from the CRS-formatted matrix are evenly distributed over the available processor while each processor gets the entire vector to be multiplied with. So, after distribution no communication is necessary anymore. Scaling should therefore be good under the assumption that the amount of non-zero elements in the matrix rows are evenly distributed. Figure 4.27 gives the performance

4.3 MPI results

22

30.0 1 proc. 2 proc. 4 proc. 6 proc. 8 proc.

Gflop/s

20.0

10.0

0.0 100

1000

2000

Matrix Order

Figure 4.26: Speed of a dense matrix-vector multiplication with program mod2a. curves for 1–8 processors. For matrix orders N = 500–5000 the scalability is indeed almost linear (although there is an unexplained 8000.0

Mflop/s

6000.0

1 proc. 2 proc. 4 proc. 6 proc. 8 proc.

4000.0

2000.0

0.0 100

1000 Matrix Order

10000

Figure 4.27: Speed of a sparse matrix-vector multiplication with program mod2as. anomaly for the 4-processor case at N = 2000). For matrix orders up to 200 and a non-zero fill of ≈ 3.5% the amount of work per processor is so small that the speed gain is observable but modest. For N > 2000 cache effect again play a significant role and at N = 10,000 there is virtually no difference in speed between 4–8 processors due to the dominant part played by the data access from the memory. Solving a dense linear system is inherently not entirely scalable. The factorisation is well-behaved in this respect, but the solution part is recursive and therefore gives only limited opportunity for parallelism. This is well demonstrated in Figure 4.28. The solution part, however, has only O(N 2 ) operations while the factorisation requires O(N 3 ) operations which means that for increasing system size the scalability should become better. This in part explains the better performance with more than 1 processor (see below, however). In problems up to an order N = 200 the communication overhead makes the solution actually slower than the serial version. For N = 300, however, the parallel versions are all faster, be it not dramatically so. Up to the point where significant cache misses begin to occur (N > 1000) there is a performance gain that is best for P = 4, P being the number of processors. This is due to the increased communication, combined with the decrease in the volume of computation per processor. For N > 1000 the memory accesses out of cache cause a drop in performance of a factor 4–5. At N = 4000 the speed of the algorithm is about equal for P = 2–8, so large is the influence

4.3 MPI results

23

1 proc. 2 proc. 4 proc. 6 proc. 8 proc.

5000.0

Mflop/s

4000.0

3000.0

2000.0

1000.0

0.0 100

1000 Matrix Order

4000

Figure 4.28: Performance of the solution of a dense linear system with program mod2b. of the reduced bandwidth to memory. MPI versions of sparse linear solvers are again implemented in the programs mod2ci and mod2cr that, as before, represent solving irregular Finite Element type and regular Finite Difference type problems, respectively. We show the performance of both types in Figure 4.29(a) and (b). In comparison to the OpenMP implementation of mod2cr the speed is quite low: about an order of Sparse linear solve, CG, polynomial precond. (b) 600.0

Sparse linear system solve, TFQMR, polynomial precond. (a)

1 proc. 2 proc. 4 proc.

3000.0

500.0

Mflop/s

Mflop/s

1 proc. 2 proc. 4 proc.

2000.0

1000.0

400.0

300.0

0.0 1000

10000 Matrix Order

20000

200.0 5 10

6

10 Matrix Order

Figure 4.29: Speed of the MPI implementation of a sparse linear solver for irregular Finite Element type problems with program mod2ci (a) and regular Finite Difference type problems with program mod2cr (b). magnitude. This is not unique for the Clovertown processor (cf. the results for the Itanium 2 and POWER5+ for instance at [2]). This bad performance is due to the many MPI_Allgatherv operations on the result vectors of the matrix-vector multiplications in the CG algorithm and the preconditioner. A phenomenon that cannot be explained (yet) is the fact that the performance for P = 4 is much lower than that for P = 2. For the iregular type sparse solver the performance is as expected: It is somewhat better than that of the OpenMP version and, thanks to the (logical) distribution of the data we even see a super-linear speedup for N = 2000 on 2 and 4 processors and for N = 5000 on 4 processors. For N ≥ 10000 addressing is mainly from outside the cache with a corresponding loss in speed and an almost similar performance for P = 2 and P = 4 due to the dominance of the data access time. In contrast to the OpenMP version of the 1-D complex-to-complex FFT, the MPI version of program mod2f scales very well as is evident from Figure 4.30. This is due to the different implementation. the 1-D FFT

4.3 MPI results

24

2000.0 1 proc. 2 proc. 4 proc. 8 proc.

Mflop/s

1500.0

1000.0

500.0

0.0 2 10

3

4

10

5

10 Vector Length

6

10

10

Figure 4.30: Performance of a 1-D complex-to-complex FFT with program mod2f. is factorised into a series of smaller 2-D FFTs that can be performed in parallel. The matrix containing the series of the transformed FFTs then in transposed and the FFT operation is repeated on the columns of the transposed matrix. The only communication during this procedure takes place in the global matrix tranposition. This is done by an MPI_Alltoallv operation. For the larger vector lengths, approximately from N = 4096 on, the parallel computation starts to offset the penalty of the communication and the parallel versions clearly start to win out. The sawtooth pattern of the performance stems from the intermediate vector copies that are required for vector lengths that are an even power of 2. Figure 4.31 shows that also in this case there is a large difference in the parallel behaviour of the OpenMP implementation (cf. Figure 4.21) and that of the MPI implementation of the 2-D Haar Wavelet Transform. Largely, both implementations are very much alike: after the parallel transformation of the columns in the data matrix the partly transformed matrix is transposed and again a parallel transformation pass is done on the columns to complete the transformation process. Unlike in the OpenMP case the parallelisation

Mflop/s

1000.0

1 proc. 2 proc. 4 proc. 8 proc.

500.0

0.0 2 10

3

4

10 10 Matrix size (N*M elements)

5

10

Figure 4.31: Performance of a 2-D Haar Wavelet Transform with program mod2g. overhead is small. The only communication required is the MPI_Alltoallv operation during the global matrix transposition. It is clear that the communication is not negligible: eventually the performance of more processors is higher when the data set is large enough to offset the communication. Still, the scalability is quite acceptable with a maximum speed of about 1200 Mflop/s on 8 processors. Note that for P = 8 processors the speed can be maintained for the larger problem sizes because the data to be operated on per core is still in cache.

4.3 MPI results

25

Also for the MPI implementation of the random number generator in program mod2h the scalability is fairly good. The implementation is conceptually the same as in the OpenMP version: there is an initialisation phase that increases linearly with the number of processors involved. The random sequence generation thereafter is completely independent and should exhibit perfect scaling. The speed curves in Figure 4.32 agree well with these considerations. As can be seen the initialisation 3000.0 1 proc. 2 proc. 4 proc. 6 proc. 8 proc.

Mop/s

2000.0

1000.0

0.0 5 10

6

10 Vector Length

7

10

Figure 4.32: Speed of generating uniformly distributed random numbers with program mod2h. overhead negatively influences the performance on the shorter sequence lengths for P = 4–8. However, for N = 107 the speed gradient for P = 4–8 is still positive which means that for longer sequences we may expect still better speedups, eventually almost linear. As generally in Monte Carlo-like procedures the random sequences algorithms tend to be very long, for 8 cores the speed can be up to about 4.5 Gop/s. Note that although the OpenMP version and the MPI version are very similar, the parallelisation overhead in the OpenMP version is killing the speedup beyond P = 4 (cf. Figure 4.22) where the parallelisation overhead for MPI turns out to be much lower. This is a general problem with OpenMP on about all known platforms and is not specific to the Intel processor/compiler combination. In Figure 4.33 we show the speed of a sort algorithm that is in principle the same as in the OpenMP version.

Mop/s

1000.0

1 proc. 2 proc. 4 proc. 6 proc. 8 proc.

500.0

0.0 4 10

5

10 Vector Length

6

10

Figure 4.33: Speed of sorting 4-byte Integers with program mod2i. In this program, mod2i, subsequences are evenly distributed over the processors and then sorted. The sorted subsequences are merged into a global sequence in the sense that the processes each contain ⌊(N/P )⌋ sorted values in increasing order, except the last process that contains the last ⌊(N/P )⌋ + N mod P values. The same program also sorts sequences of 8-byte Reals. The performance pattern is however almost identical to

4.3 MPI results

26

that of the Integer sorting so we do not show these results. The speed is roughly 25% lower for the Real case due to bandwidth both of the memory and the communication network. The main difference with the OpenMP version is that the merging phase of the algorithm can be done in parallel, at least in part. The only communication required is an exchange of data that must be sent to the appropriate processors after sorting the subsequences with an MPI_Allgatherv operation. The time required for this is O(P ) while the amount of work per processor is O(N/P log(N/P )). So, the scalability should not suffer much from a larger number of processors when the sequence to be sorted is reasonably long. Figure 4.33 shows that this indeed is the case. For P = 8 the performance drops unexpectedly when going from N = 500,000 to N = 1,000,000. As yet we have no explanation for this phenomenon. Note again that speedup for the MPI version is much better than that of the equivalent OpenMP version (see Figure 4.23). This is partly due to the parallel merging of the subsequences but also to the larger OpenMP overhead. Like with the random number generation there is no data re-use from the cache, so no performance degradation for larger problem sizes is to be expected.

5 Closing remarks and summary 1. Like the Woodcrest processor, the Clovertown processor is able to perform a significantly higher number of instructions/cycle/core than earlier IA-32 processors. This can lead to quite high performances, especially when also the SSE units can be employed. 2. The OpenMP version of program mod2d, the solution of a dense symmetric eigenvalue system, gives incorrect results that must be attributed to a compiler back-end error. This error does not occur with the ifort compiler for the IA-64 architecture, nor with the Portland Group’s pgf90 compiler on the Clovertown processor. 3. There proved to be a problem when one attempted to acquire data from the main memory very intensively. For instance, when starting more than 2 sparse matrix-vector multiplication programs with large matrices (order N = 1,000,000, non-zero fill ≈ 3%) the system would stall indefinitely and the only way to make the system accessible again would be to turn it physically off and turn it on again. This problem seems peculiar to this version of the evaluation platform and it should be solved in later versions, according to Intel’s representative. 4. As remarked in section 4.3, we were not able to get Intel’s Cluster MPI, version 3.0 working. It kept complaining about not being able to find PMPI routines. Fortunately, Argonne’s MPICH2, configured for SMPs, proved to be a very good alternative. 5. The point-to-point communication bandwidth within the system, as well as the latency proved to be very good. This was also evident for the collective communication in for instance programs mod2f, mod2g, and mod2g. Generally speaking scaling of the MPI programs was very good. 6. On almost all platforms the scaling of OpenMP is poor. The Intel processor/compiler combination used to be no exception. However, in this test there was a definite improvement in this respect. A number of programs that until recently would slow down when an OpenMP version was used now showed speedups, e.g., programs mod2as and mod2g. 7. In summary, the evaluation platform was largely an agreeable machine to work with while the ifort compiler is definitely one of the best compilers around for x86-type processors.

28

Acknowledgments We like to thank Intel Benelux for making available to us an Intel reference platform. Especially we thank Chao Yu who gave some valuable feedback on the results which altered our conclusions with respect to the use of the MKL libraries and multi-threaded capabilities.

References [1] Cachebench Homepage: http://icl.cs.utk.edu/projects/llcbench/cachebench.html. [2] EuroBen Benchmark. Collected www.euroben.nl/results.

results

of

EuroBen

Benchmarks

can

be

found

at:

[3] HPC Challenge Benchmark. Both the benchmark and an explanation of its philosophy can be found at: icl.cs.utk.edu/hpcc/. [4] Philip D. Mucci, Accurate Cache and TLB Characterization Using Hardware Counters, Proc. of International Conference on Computational Science 2004, Krakow, Poland, June 2004. [5] A.J. van der Steen, The benchmark of the EuroBen Group, Parallel Computing 17 (1991) 1211–1221.