Lessons Learned when Comparing Shared Memory ...

Lessons Learned when Comparing Shared Memory and Message Passing Codes on Three Modern Parallel Architectures J. M. MacLaren and J. M. Bull Centre for Novel Computing, Department of Computer Science, University of Manchester, Oxford Road, Manchester M13 9PL, UK email: fjon.maclaren,[email protected]

Abstract. A serial Fortran 77 micromagnetics code, which simulates the

behaviour of thin- lm media, was parallelised using both shared memory and message passing paradigms, and run on an SGI Challenge, a Cray T3D and an SGI Origin 2000. We report the observed performance of the code, noting some important eects due to cache behaviour. We also demonstrate how certain commonly-used presentation methods can disguise the true performance pro le of a code.

1 Introduction Micromagnetics is an area where simulation is of vital importance, enabling electronic engineers to model and predict the behaviour of magnetic materials. As in many other elds, accurate simulations are computationally demanding| parallel computers oer a means of meeting these demands. A serial micromagnetics code was parallelised in two ways, producing a Shared Memory code, and a Message Passing code. This paper describes this procedure, and compares the performance of the two codes on three dierent parallel architectures, and attempts to explain these results as fully as possible . In Section 2, the subject code and its problem domain are described brie y. Details regarding relevant aspects of the target architectures are provided in Section 3. Methods for writing parallel Fortran on these machines are then chosen. Full details of how the parallel codes were implemented can be found in [7]. Results are presented for the codes in Section 4|these are found to be unusual for one platform in particular. The results are discussed in detail, with explanations being given for seemingly anomalous behaviour, and the de ciencies of some common performance presentation practices are exposed. Finally, in Section 5, we draw some conclusions.

2 Subject Code The subject code simulates thin lm media, such as hard-disk surfaces, which are made up from layers of many magnetisable grains. The code models the shapes,

sizes and locations of the grains, and uses this information to predict how the magnetic elds of the grains change in reaction to an applied, external magnetic eld. Each grain has its own magnetic eld, and so in uences every other grain, making this an N -body problem. The interactions are solved by integrating N Landau Lifshitz equations[8] of the form dMi Mi (Mi HT )

( M = i HT ) ? 2 dt 1+ 1 + 2 jMi j for i = 1; : : : ; N , where N is the number of grains. Here Mi is the magnetic moment of the ith grain, and are constants and the term HT contains the contributions to the magnetic eld from the externally applied eld and from the in uence of the other grains. To evaluate the term HT precisely for every grain would have complexity O(N 2 ), so approximations are sought which yield acceptably accurate solutions in a reasonable time. Extensive work in the eld of N -body problems has generated methods for constructing approximations of N particle systems which are of complexity O(N log N ) or better [2, 5]. These methods group particles into a hierarchy of cells. An `average' eld is calculated for each cell, based on the elds of the particles contained within it, and these are used when calculating interactions with distant particles|the more distant the particle, the larger the size of cell which is used. The subject code uses an algorithm based upon the Barnes-Hut method [2], but which is restricted to a 3-level (rather than n-level) hierarchy. The largest cells are used to govern the problem size and shape, e.g. 4 4 denotes a square problem area with 16 large cells (each of which are subdivided into 9 smaller cells, which in turn contain a total of 48 magnetisable grains). At each iteration of the code, the external eld is incremented by a xed amount and the system of ODEs is integrated to convergence, so that the magnetic elds reach an equilibrium. A variable order, variable step Adams method with error control is used for the integration. The code uses 400 of these iterations to show how the overall magnetic eld of the particles follows a changing external eld, giving a hysteresis loop. Further details of the subject code can be found in [9].

3 Target Machines and Programming Methods There are many dierent kinds of parallel architecture in existence. Here, machines with dierent architectures are chosen in order to represent the currently popular classes of parallel machines. The three chosen architectures, SGI Challenge, Cray T3D, and SGI Origin 2000 are each described brie y. Methods for programming these machines are then selected. The SGI Challenge is a true shared memory architecture, the main memory being interleaved among the processors, which are connected using a simple bus. Each processor has its own write-through level-one and level-two caches. An invalidate protocol is used to achieve cache-coherency, providing sequential consistency. The SGI Challenge machine which was made available had 512Mb

of main memory, shared between four 100MHz MIPS R4400 processors, running IRIX version 5.3. Each processor also had 1Mb of level-two cache (direct mapped, 128 byte lines), and 16kb of level-one cache (direct mapped, 16 byte lines). The DEC Alpha processor nodes of the Cray T3D run a minimal version of the UNICOS MAX operating system (there is no virtual memory provided)|a second machine, typically a Cray Y-MP, is used as a front end to enable users to load and run jobs on the T3D. The T3D is a distributed memory machine, with access to other processors' memory being provided through the SHMEM library routines which do not maintain cache-consistency. The T3D which was used had 512 150MHz Alpha processors, running version 1.3 of the UNICOS MAX operating system, arranged in a 3D Torus. Each node had 64Mb of local memory, and 8kb of on-chip cache (direct mapped, 32 byte lines) using a writeback policy. The SGI Origin 2000 is a CC-NUMA (Cache-Coherent Non-Uniform Memory Access) architecture. Although it is a distributed memory architecture, sequentially consistent Virtual Shared Memory (using an invalidate protocol) is provided in hardware. Each node board contains two processors and some memory. The node boards are linked together using a CrayLink Interconnect network. Each processor also has its own level-one and level-two caches. The Origin 2000 which was available for use had 14 195MHz MIPS R10000 processors, running IRIX version 6.3 with a main memory of 1792Mb (256Mb per two-processor node board). Each processor also had 4Mb of level-two cache (2-way set associative, 128 byte lines) and 32kb of on-chip level-one cache (2-way set associative, 32 byte lines). Both the Challenge and Origin machines have compiler support for writing parallel shared-memory Fortran, by allowing the user to execute groups of loop iterations in parallel. This is achieved by inserting directives (embedded in comments) into the code which tell the compiler how the loop should be parallelised. These directives were used to produce a parallel shared memory code. The T3D's CRAFT compiler uses similar directives to generate parallel code, but was found to be too restrictive for this example. The model requires that the data (and loop iterations) are split between processors at compile-time, and also restricts any index which is iterated over in parallel to be an exact power of two. For these reasons, a message-passing code was also required. The MPI library[10] was chosen as a means of writing this code as it is popular, is readily available for many parallel architectures, including the three described above, and is more recent than PVM[4]. On both the Origin and the T3D, proprietary versions of MPI were available. For the Challenge, version 1.0.13 of the freely available MPICH[11] was used.

4 Results Results are presented for two dierent problem sizes: 4 4, the smallest runnable problem size; and 8 8, one of the largest. Timings shown are measured over 4 iterations (rather than the usual 400 iterations which can take several hours),

starting from iteration 3 (to avoid measuring heavy startup costs). This was shown to be representative of complete execution times. So that execution times were not distorted by periodic operating system activities (or by I/O transfers on the T3D), each measurement was made three times, and the best time taken. Graphs showing the performance of the codes are shown for the Challenge, T3D and Origin in Figure 1. In all cases, the line marked `Naive Ideal' shows the projected speedup of the serial code, based on perfect linear speedup. Execution times were measured in seconds, so performance is shown in s?1 . The actual execution times for the serial code are shown in Table 1. The results for each platform are now examined separately. Platform SGI Challenge Cray T3D SGI Origin 2000

Problem Size 44 88 9.98 53.06 9.42 69.11 1.12 8.40

Table 1. Measured Execution Times for the Serial Code in Seconds

4.1

SGI Challenge

The graphs in Figure 1 for the Challenge show that both the Shared Memory and MPI codes give approximately the naive ideal performance. However, the MPI code performs a little better than the Shared Memory code on both problem sizes. As the Challenge is a true shared memory architecture, this is not what one would expect, as the MPI library essentially introduces a layer of buering which is not present in the Shared Memory code. To explain this, the classi cation of cache misses de ned in Sections 5.3 and 8.3 of [6] will be used. In this classi cation, cache misses are attributable to four causes, namely compulsory (or cold-start), capacity, con ict and coherency. In the case of the MPI code, data is copied from user memory on the sending process to user memory on the receiving process, via its own buers. Although this procedure involves additional memory reads and writes, and uses more memory than the Shared Memory code due to replication, it does mean that each MPI process is always reading from its own copy of data, thus eliminating all coherency misses. Furthermore, MPI uses point-to-point synchronistion between processors, which can be more ecient than the barrier synchronisation used in the the Shared Memory code. When developing the MPI code, the last optimisation performed was to amalgamate as many small messages as possible into large messages. As shown in [7], this gives a signi cant gain in performance. Although the MPI library requires more memory for buering the communications, call overheads are vastly reduced|especially for the larger problem size. In addition to this, the MPI

0.5

0.1 Naive ideal SM-CHAL-4x4 MPI-CHAL-4x4

0.45

Naive ideal SM-CHAL-8x8 MPI-CHAL-8x8

0.08 Performance (1/sec)

Performance (1/sec)

0.4

0.09

0.35 0.3 0.25 0.2 0.15

0.07 0.06 0.05 0.04 0.03

0.1

0.02

0.05

0.01

0

0 0

1

2 3 Processors (p)

4

5

0

1

2 3 Processors (p)

4

5

(a) Problem Sizes 4 4 and 8 8 for the SGI Challenge 1.2

0.8

Performance (1/sec)

Performance (1/sec)

Naive ideal MPI-T3D-8x8

0.7

1 0.8 0.6 0.4

0.6 0.5 0.4 0.3 0.2

0.2

Naive ideal MPI-T3D-4x4

0.1

0

0 0

10

20

30 40 Processors (p)

50

60

70

0

10

20

30 40 Processors (p)

50

60

70

(b) Problem Sizes 4 4 and 8 8 for the Cray T3D 5

2 Naive ideal SM-ORI-8x8 MPI-ORI-8x8

1.8 1.6 Performance (1/sec)

Performance (1/sec)

4

3

2 Naive ideal SM-ORI-4x4 MPI-ORI-4x4

1

1.4 1.2 1 0.8 0.6 0.4 0.2

0

0 0

2

4

6 8 10 Processors (p)

12

14

0

2

4


(c) Problem Sizes 4 4 and 8 8 for the SGI Origin 2000

Fig. 1. Performance Graphs

12

14

library performs its own memory management (unlike PVM which uses userallocated space), so combining messages can greatly reduce the amount of time spent in Operating System memory management routines. Although the manual combining of messages by user code does introduce another level of buering, these buers are written to sequentially from start to nish, and read in the same manner. This will make optimal use of the cache, using all addresses in fetched cache-lines. 4.2

Cray T3D

When running the smaller problem size on the T3D, performance levels o at 16 processors. This is due to the tiling strategy employed by the MPI code. In order to simplify communications, the problem area is divided up by allocating a whole number of the larger cells to each processor. This means that for the problem size 4 4, any processors used above sixteen will be idle, so performance cannot increase further. As with the Challenge, the MPI code exhibits superlinearity for the larger problem size. 4.3

SGI Origin 2000

The results for the MPI code with the 8 8 problem size on the Origin produce an unusual performance curve|the performance on three processors is almost double that of the naive ideal. To explain this, absolute performance (measured in MFlop s?1) must be examined. The number of oating point operations required for the two problem sizes was measured using hardware counters on a fourth machine, a Cray J90 (using optimised, but not vectorised, code). The 4 4 problem size requires 138.5 million oating point operations to solve; the 8 8 problem size, 659.7 million. Using these gures, the absolute performance of the serial code on the two problem sizes 4 4 and 8 8 was calculated. These calculated values are shown in Table 2 along with gures for a third problem size, 7 7, which required 490.3 million oating point operations to solve. Machine

Performance 44 77 88 SGI Challenge 13.9 14.2 12.4 Cray T3D 14.7 18.7 9.6 SGI Origin 2000 123.7 151.5 78.5

Table 2. Absolute Performance of the Serial Code in MFlop s?1 If a problem is small enough, the entire data set will t into level-one cache. As the problem size is increased, it will become too big for level-one cache, leading to capacity misses, resulting in a drop in performance. Further increases in size

will lead to level-two cache being exceeded (where it exists), resulting in leveltwo capacity misses. For each machine, performance drops considerably from the 4 4 problem size to the 8 8 (around 35% on the T3D and Origin, but only 10% on the Challenge). These results could be attributable to increasing numbers of capacity misses, but, on each machine, performance increases from the 4 4 problem size to the 7 7|which does not t this explanation. Furthermore, on the T3D, there is no level-two cache and only 8kb of level-one cache, which is not large enough even for one processor's part of the 4 4 problem size. Similarly, for the Challenge and the Origin machines, the level-two cache is not suciently large to accomodate the 7 7 problem size, and so increased capacity misses cannot be responsible for the drop in performance when moving to problem size 8 8. These observations prove that there is another eect at work. To explain the results, the tiling strategy which is used by the MPI code must be considered in detail. As described in Section 4.2, the MPI code divides the problem up by allocating a number of large cells to each MPI process. Let s be the number of large cells in the problem area, and p be the number of MPI processes. The tile size n is de ned as d ps e. Each processor is allocated n large cells, until there are fewer than n cells remaining|these are given to the next processor. This tiling system is simple, and guarantees that all the work is shared out completely among processors. The most executed subroutine of the serial Fortran code (responsible for over 80% of execution time) works on many large arrays whose dimension depends upon the problem size, s. Hence the dimensions of the arrays used by the MPI code depend on n. Due to the nature of the loops in this subroutine, an example of which can be seen in Figure 2, performance will decrease dramatically if arrays such as rd11, rd12, etc. are aligned so that they cause cache con ict misses. For this code, arrays are most likely to cause con ict misses if they start on addresses which map to the same cache line. This will happen when n contains sucient factors of two|the more factors of two, the more arrays which will cause cache con ict misses. A graph showing how the factors of two in n varied with the number of processors used on the 8 8 problem size was drawn. An inverted yaxis was used, as these factors inversely aect performance. This graph is shown in Figure 3 next to one showing performance divided by processors for the 8 8 problem size on the Origin|this gives both graphs a at base. The corrolation between these two graphs is clear, especially up to ten processors. After this point, load imbalance becomes increasingly signi cant. Similar graphs can be drawn for 4 4 and 7 7 problem sizes which show that these cache con ict misses aect the 8 8 problem size the most, and the 7 7 problem size the least. This is consistent with the observed changes in performance. Although this eect is most apparent on the Origin, it turns out that the eect is present, but less evident, on the other architectures. For the Challenge, the dierence in performance between problem sizes is far smaller, and so the

uctuations would be smaller|it is also dicult to extrapolate a pattern from only four processors. On the T3D, jobs must be run on a number of processors

+ + + + + + + + +

do 5000 igr=istart,iend igrmod=igr-istart+1 dtmpx=dble(hxext(igrmod)) +dble(rd11(igrmod))*dmx(igr) +dble(rd12(igrmod))*dmy(igr) +dble(rd13(igrmod))*dmz(igr) dtmpy=dble(hyext(igrmod)) +dble(rd21(igrmod))*dmx(igr) +dble(rd22(igrmod))*dmy(igr) +dble(rd23(igrmod))*dmz(igr) dtmpz=dble(hzext(igrmod)) +dble(rd31(igrmod))*dmx(igr) +dble(rd32(igrmod))*dmy(igr) +dble(rd33(igrmod))*dmz(igr) ...

5000

continue

Fig. 2. Sample Loop from MPI Code

-2 Naive ideal SM-ORI-8x8 MPI-ORI-8x8

0.2

Factors of 2 0 Factors of 2 in Tile Size

Performance / Processors (1/sec)

0.25

0.15

0.1

0.05

2 4 6 8

0

10 0

2

4


12

(a) Problem Size 8 8

14

0

2

4

6 8 Processors

10

12

14

(b) Factors of 2 in Tile Size

Fig. 3. Adjusted Performance for SGI Origin 2000 alongside `Factors of 2' Graph

which is an exact power of two. From the graphs in Figure 3, this will always lead to the poorest performances, and so any uctuations which might occur are hidden. This last point can be veri ed by Figure 4, which shows the performance of the code on the 88 problem size for the T3D and the Origin, but only for 1, 2, 4 and 8 processors|the graphs are almost identical in shape. As the layout of the data in the Shared Memory code is independent of the number of processors, and is the same layout used in the serial code, the eect is always present, yielding consistently poor performance. 0.16

1.2 Naive ideal MPI-T3D-8x8

Naive ideal MPI-ORI-8x8 1

0.12

Performance (1/sec)

Performance (1/sec)

0.14

0.1 0.08 0.06

0.8 0.6 0.4

0.04 0.2

0.02 0

0 0

2

4 6 Processors (p)

(a) Cray T3D

8

10

0

2

4 6 Processors (p)

8

10

(b) SGI Origin 2000

Fig. 4. Performance Graphs for MPI Code on Problem Size 8 8 (Selected Points)

5 Conclusions Deciding how best to present parallel performance results is a non-trivial task. A bad choice of what to measure, or how to present these measurements, can lead to information being hidden or distorted, either accidentally, as discussed in [3], or deliberately, as criticised in [1]. If the initial performance graphs in Figure 1 had been plotted showing only times for 1, 2, 4, 8, etc. processors, then the drop in performance for the 8 8 problem size would have gone unnoticed. Although potentially misleading, this practice is not mentioned in either [1] or [3]. As they are, the graphs shown are adequate for conveying intra-problemsize performance information, but they allow no inter-problem-size comparisons. Unfortunately, accurate measurement of absolute performance (i.e. in M op s?1 ) is not always possible as hardware counters for oating point operations are not common, although it would seem wise to use absolute scales wherever possible. It is already well documented that the use of complex memory hierarchies can lead to unexpected performance results which are hard to explain. On cache-coherent parallel computers, the extended memory hierarchy introduces coherency cache misses, but the eects witnessed above on the Origin turn out to be due to con ict misses which are also present in the serial code. As a future

study, it would be interesting to use the hardware counters that are present in the Origin 2000 to perform detailed overhead analysis. Such an analysis could try to quantify the eect due to cache con ict misses, so that the performance gures can be more precisely explained. Although MPI introduces an extra communication layer, it turns out to be a very ecient way of programming Shared Memory architectures for this type of code. This is especially true when smaller messages are combined, reducing the number of calls to both the MPI library, and Operating System memory management routines. Additional buering is introduced, but this ineciency is oset by the performance gains which are achieved.

Acknowledgements The authors would like to thank Dr. Jim Miles of the University of Manchester and Prof. Roy Chantrell of the University of Wales, Bangor for arranging CPU time on the Cray T3D at the Edinburgh Parallel Computing Centre (EPCC), and Dr. Nigel John of Silicon Graphics who arranged for exclusive use of an SGI Origin 2000.

References 1. Bailey, D. H. (1992) Misleading Performance Reporting in the Supercomputing Field, Scienti c Programming, vol. 1, no. 2, pp. 141{151. 2. Barnes, J. and Hut, P. (1986) A Hierarchical O(NlogN ) Force-Calculation Algorithm, Nature, vol. 324, no. 4, pp. 446{449. 3. Crowl, L. A. (1994) How to Measure, Present, and Compare Parallel Performance, IEEE Parallel and Distributed Technology, vol. 2, no. 1, pp. 9{25. 4. Geist, G., et al (1993) PVM 3 User's Guide and Reference Manual, Technical Report ORNL/TM-12187, Oak Ridge National Laboratories, Oak Ridge, Tennessee. 5. Greengard, L. and Rokhlin, V. (1987) A Fast Algorithm for Particle Simulations, Journal of Computational Physics, vol. 73, pp. 325{348. 6. Hennessy, J. L. and Patterson, D. A. (1996) Computer Architecture: A Quantitative Approach (Second Edition), Morgan Kaufman Publishers Inc., San Mateo, California. 7. MacLaren, J. M. (1997) Parallelising Serial Codes: A Comparison of Three HighPerformance Parallel Programming Methods, MPhil Thesis, Department of Computer Science, University of Manchester. 8. Mallinson, J. C. (1987) On Damped Gyromagnetic Precession, IEEE Transactions on Magnetics, vol. MAG-23, no. 4, pp. 2003{2004. 9. Miles, J. J. and Middleton, B. K. (1991) A Hierarchical Micromagnetic Model of Longitudinal Thin Film Recording Media, Journal of Magnetism and Magnetic Materials, vol. 95, pp. 99{108. 10. Message Passing Interface Forum (1994) MPI: A Message-Passing Interface Standard, International Journal of Supercomputer Applications and High Performance Computing, vol. 8, nos. 3 and 4. 11. MPICH World Wide Web Home Page: http://www.mcs.anl.gov/mpi/mpich.

Lessons Learned when Comparing Shared Memory ...

Lessons Learned when Comparing Shared Memory ...

Suggest Documents

Lessons learned from the Shared Memory ... - Semantic Scholar

Appreciating Lessons Learned lessons learned

Lessons Learned

implementing shared services: lessons learned - M. Beacon Enterprises

Synthesizing lessons learned from comparing fisheries production in ...

Lessons Learned? Comparing the Federal Reserve's Responses to ...

Synthesizing lessons learned from comparing fisheries production in ...

Issues, Challenges, and Lessons Learned When Scaling up a ...

When is a disease eradicable? 100 years of lessons learned.

Issues, Challenges, and Lessons Learned When Scaling up a ...

Lessons learned from mixed-methods research when designing and

i-eval THINK Piece, No. 1 Lessons Learned Lessons Learned ...

Lessons Learned - Capacities

Lessons learned and applied

Office 2010: Lessons Learned

Software Carpentry: Lessons Learned

Intraoperative neuromonitoring: lessons learned

Famine: Lessons Learned - ReliefWeb

TEACHING LESSONS LEARNED

1stApp Lessons Learned - CiteSeerX

Lessons Learned in - MDPI

Lessons Learned from Evaluation

Lessons Learned

assessment lessons learned - ReliefWeb