May 2, 2003 - 1 Introduction. HPCx is the UK's newest and largest National High Performance Computing system, comprising. 1280 IBM POWER4 processors ...
Capability Computing: Achieving Scalability on over 1000 Processors Joachim Hein and Mark Bull EPCC The University of Edinburgh Mayfield Rd Edinburgh EH9 3JZ Scotland, UK May 2, 2003 Abstract HPCx is the UK’s largest High Performance Computing Service, consisting of 40 IBM Regatta-H SMP nodes, each containing 32 POWER4 processors. The main objective of the system is to provide a capability computing service for a range of key scientific applications, i.e. a service for applications that can utilise a significant fraction of the resource. To achieve this capability computing objective, applications must be able to scale effectively to around 1000 processors. This presents a considerable challenge, and requires an understanding of the system and its bottlenecks. In this paper we present results from a detailed performance investigation on HPCx, highlighting potential bottlenecks for applications and how these may be avoided. For example, we achieve good scaling on a benchmark code through effective use of environment variables, level 2 cache and less populated logical partitions.
1 Introduction HPCx is the UK’s newest and largest National High Performance Computing system, comprising 1280 IBM POWER4 processors and delivering up to 3.2 Tflop/s sustained performance. It is currently ranked number 9 in the top 500 supercomputers, see www.top500.org. This system has been funded by the British Government, through the Engineering and Physical Sciences Research Council (EPSRC). The project is run by the HPCx Consortium, a consortium led by the University of Edinburgh (through Edinburgh Parallel Computing Centre (EPCC)), with the Central Laboratory for the Research Councils in Daresbury (CLRC) and IBM as partners. The main objective of the system is to provide a world-class service for capability computing for the UK scientific community. Achieving effective scaling on over 1000 processors for for the broad range of application areas studied in the UK, such as materials science, atomic and molecular physics, computational engineering and environmental science, is a key challenge of the service. To achieve this, we require a detailed understanding of the system and its bottlenecks. Hence in this paper we present results from a detailed performance investigation on HPCx, using a simple iterative Jacobi application. This highlights a number of potential bottlenecks and how these may be avoided.
1
Table 1: Cache sizes on a logical partition Cache level Level 1 Level 2 Level 3
Size 32kB 1440kB 128MB
Shared between 1 processor 2 processors 8 processors
In section 2 we provide an overview of the HPCx system and in section 3 we provide details of the Jacobi application. Section 4 considers the most effective utilisation of a logical partition, section 5 examines efficient cache utilisation and section 6 explores the influence of the MPI protocol, section 7 considers the the most effective utilisation of a frame, while section 8 examines variations in run-times. Lastly, sections 9 and 10 provide conclusions and outline future work.
2 The HPCx system HPCx consists out 40 IBM p690 Regatta H frames. Each frame has 32 POWER4 processors with a clock of 1.3 GHz. This provides a peak performance 6.6 Tflop/s and up to 3.2 Tflop/s sustained performance. The frames are connected via IBM’s SP Colony switch. Per frame, these processors are grouped into 4 multi chip modules (MCM), each MCM has 8 processors. In order to increase the communication band width of the system, the frames have been divided into 4 logical partitions (lpar), coinciding with the MCMs. Each lpar is operated as an 8-way SMP, running its own copy of the operating system AIX. The processors inside an lpar share the same memory architecture, in particular there is only a single bus to main memory and also a single level 3 cache per lpar or MCM. We review the cache sizes of the HPCx system in table 1. For a modern cache based architecture, memory is one of the most scarce resources. This is one of the key reasons why HPCx allows only a single application on an lpar at any given time. It is a matter of tuning to select the number of processors per lpar, which gives optimum performance for the considered application. In section 4 we will discuss this in detail for our case study.
3 The Jacobi application The code inverts a simple lattice Laplacian in 2 dimension by using an iterative Jacobi Algorithm. It is a Fortran 90 code which is parallelised using MPI. The performance and scaling of this code has been examined on a range of processor combinations, using up to 1024 processors. We study a number of optimisation techniques and suggest possible improvements applicable to a range of application codes on HPCx. The matrix (edges) is defined as the lattice Laplace applied to the matrix (image)
(1) "! $#%! &"! $#'! #)( (2) Having given the matrix , the matrix can be determined iteratively (Jacobi Algorithm) !
, .
+* (0/1+*32 4! - +*52 #%! 6 +*32 4! +*32 #%! # - 87 (3) ; . This benchmark code contains typical features of a field-theory with next We start with :9
neighbour interaction. The simplicity of this code enables us to observe and identify causes of poor 2
1Lpar 1Lpar 1Lpar 2Lpar 2Lpar 2Lpar 4Lpar 4Lpar 4Lpar 8Lpar 8Lpar 8Lpar
Wallclock CPU in seconds
100
3360*4032 1680*2016 840*1008 3360*4032 1680*2016 840*1008 3360*4032 1680*2016 840*1008 3360*4032 1680*2016 840*1008
10
1
1
10 # of processors
100
Figure 1: Wall clock time vs number of processors for a given number of logical partitions. Results are for 2000 iterations. performance that are difficult to isolate in real production applications. The current version does not contain any global communications. Studying global communications is ongoing work, see section 10. The point-to-point communication is implemented using MPI_SENDRECV. The modules have been compiled using version 8.1 of the IBM XL Fortran compiler with the options -O3 and -qarch=pwr4. We have been using version 5.1 or the AIX operating system.
4 Populating logical partitions
(
We measured the performance of our application code on three different problem sizes, small= , medium= and large= on a range of processors and lpars. Details of the code can be found in appendix A. Our results are shown in figure 1. The points give the fastest observed run time out of three or more trials. To guide the eye, we connected runs on the same number of lpars. The straight lines give “lines of perfect scaling”. They are separated by factors of two. ). By We start the discussion with the results for a single lpar and medium problems size ( increasing the number of active processors on the lpar, we note a drop in efficiency to slightly less than 50%1 . This pattern is observed for all numbers of lpar and problem sizes. When using large numbers of processors per lpar, the computation becomes faster and the data is required at a higher rate. However the memory system is unable to deliver data at this rate. In this context it is interesting to compare the results for 8 processors and different numbers of lpars for the medium problem size. For example, running an 8 processor job across 8 lpars (i.e. 1 processor per lpar), rather than with 1
!
!
!
(
!
!
1 The result for a single processor is on top of a solid scaling line. The result for 8 processors is slightly above the neighbouring dashed line.
3
Runtime relative to ‘Array Syntax’
1.2 Array Syntax Do Loop Compact
1 0.8 0.6 0.4 0.2 0
420x504
840x1008
1680x2016
3360x4032
Figure 2: Performance comparison of different versions of the update code. This was run using 8 processors on a single lpar. lpar (i.e. 8 processors per lpar) reduces the execution time by 35%. With 8 processors across 8 lpars, each processor is no longer sharing its memory bus and level 3 cache with 7 other processors. Also the level 2 cache is no longer shared between two active processors. As a consequence, the processors access their data at a higher rate. However, time on HPCx is charged to users on an lpar basis. Hence, the run with 8 lpar is 8 times as expensive but only 35% more efficient than the single lpar run. In summary running with a single CPU per lpar is not a good deal. On HPCx it is advisable to choose the number of processors per lpar which gives the fastest wall clock time for the selected number of lpars. Figure 1 does not give a consistent picture here. Depending on the parameters 7 or 8 processors per lpar appears to be optimal. The above charging is of course specific to HPCx. For other systems a different strategy might be effective. When comparing the different single processor results, the run for the small size is 4 times faster than the medium size run. This is expected, since the problem is 4 times smaller. However the large size is more than 5 times slower than the medium size. This reflects the fact that the large size does not fit into the level 3 cache of a single lpar. When choosing 2 lpar it fits into level 3 cache and we observe reasonable scaling between the 2 lpar runs for the medium and large problem size.
5 Cache utilisation For the above investigation the computation code has been programmed in Fortran 90 array syntax and is show in appendix A. One of the obvious questions is, how effective is the machine code the Fortran compiler generates when using array syntax. We compared against the code in appendix B, which implements the same functionality using do-loops. When having a closer look at the update code with the do-loops, one realises that after reading image(i_m, i_n-1), this value is not needed any more and can be overwritten. This allows to merge the two loops and to reduce the array update from My_Nsize+2 lines to only 2 lines. We call this the “compact version” of the update code. By reducing the size of the array update we hope this will fit into level 2 cache all the time. The merge of the loops should enable better cache reuse, in particular reduce cache write misses when writing update back into image. The results of a timing test on HPCx is shown in figure 2. We give results for four different problem sizes. The bars give the fastest observed run time out of 40 trials for 2000 iterations each. For the and the smallest size the data fits into level 2 cache in all cases. For the sizes
(
4
!
!
!
data fits into level 3 cache. In case of the largest problem size the versions using array syntax and do-loops need to move more data than is available on level 3 cache. Due to the reduced memory consumption of the compact version, this fits into level 3 cache. In neither case do we observe a significant difference between the version using array syntax and the one using explicit do-loops. However the compact version is faster in all cases. This improvement is dramatic when running of level 3 cache. Here the compact code is more than 2 times faster. This advantage reduces when running on level 2 cache, but the saving is still significant.
6 MPI protocol The environment variable MP_EAGER_LIMIT controls the protocol used for the exchange of messages under MPI. Messages of a size smaller than MP_EAGER_LIMIT use the “eager protocol”, leading to a lower latency. This is achieved by sending the message directly to the receiving process, assuming it has buffer space available to store the message if needed. If this buffer space is exhausted the program will fail, which is one of the potential disadvantages of sending messages eagerly. Messages larger than MP_EAGER_LIMIT use the “rendezvous protocol”. These are send as a small header and a data body. Only the small header gets send directly to the receiving process and gets buffered if needed. This largely reduces the need for buffer space compared to using the eager protocol. Once buffer space for the data body becomes available to the receiving process, it asks for the data from the sending process. This obviously leads to synchronisation delays, resulting in an increased latency compared to the eager case. The default value and the maximum value of MP_EAGER_LIMIT depend on the number of MPI tasks. The larger the number of tasks, the smaller the default and maximum MP_EAGER_LIMIT2. The best value is a matter of tuning. The default values are based on conservative assumptions and are very small to avoid program failure due to buffer space exhaustion. For many applications however these values do not provide optimal performance. Figure 3 shows the effect of setting MP_EAGER_LIMIT to zero and to its maximum. This study used the compact version of the update code in appendix C. Again we give the time needed for 2000 iterations. For larger messages, i.e. a smaller numbers of processors, the difference in MP_EAGER_LIMIT has little effect on the performance. For 1024 processors we note a dramatic difference between the two values of MP_EAGER_LIMIT. The better choice of MP_EAGER_LIMIT manages to push the point where the time spend in the communication routine equals the time spend in the calculation routine from 512 processors to 1024 processors. It is interesting to study the efficiency of these timings in closer detail, which is shown in figure 4. We define the efficiency as
! (4)
Here is the number of processors and is the time needed to execute the program on processors. When increasing the number of processors from 1 to 8, that is on a single lpar,
we observe the decrease in efficiency discussed in section 4. However due to the improvements in section 5 this decrease in efficiency is only 36%. Previously this was more than 50%, see figure 1. When increasing the number of lpars, the update code (computation) shows significant super scaling. This is a typical behaviour for a modern cache based architecture. When using a larger number of logical partitions we get more cache memory. Starting from 32 processors, the problem fits into level 3 cache and for 1024 processors it fits into level 2 cache. This super scaling compensates for most of the overhead associated with the increased communication when running on a larger number of processors. With MP_EAGER_LIMIT set to its maximum value, the efficiency of the total code is almost . For 1024 processors we observe an execution speed of 0.51 Tflop/s. level in a range
! (
2 If one tries to set MP EAGER LIMIT to a value larger than its maximum, its maximum will be used. The system notifies the user in the loadleveler error file about this reduction. The job will execute correctly.
5
6720 x 8064 8 tasks per Lpar 1000 Total Eager=0 Total Eager=Max Update Halo Eager=0 Halo Eager=Max
time [sec]
100
10
1
1
10
100
1000
# proc
Figure 3: Time for the update (computation) and halo exchange (communication) for different values of the eager limit.
6720 x 8064 8 tasks per Lpar 1.6 Update Eager=Max Update Eager=0 Total Eager=Max Total Eager=0
1.4
1.2
efficiency
1
0.8
0.6
0.4
0.2
0
1
10
100 # proc
Figure 4: Efficiency for the data from figure 3
6
1000
6720 * 8064, 4 lpar, 8 nodes, single vs different frame 40
35
average run time [sec]
30
25
20
TotalTime, single frame UpdateTime, single frame HoloTime, single frame TotalTime, different frame UpdateTime, different frame HaloTime, different frames
15
10
5
0
0
10
20
30
40
50 Nb of run
60
70
80
90
Figure 5: Run times for the update code (calculation), the halo exchange code (communication) and the sum of both for about 90 subsequent runs. Each run consists out of 2000 iterations, on 4 lpars either from a single frame or four different frames. Considering that the code spends half of its time in communication and is unbalanced with respect to multiplications vs additions3 , we believe this is a good achievement.
7 Populating frames It is interesting to investigate performance differences between the code executed on lpars of different frames and on lpars belonging to the same frame. We submitted two jobs, one running on the four lpars of a single frame and the other one running on the first lpar of four neighbouring frames. To get better information on the fluctuations of the runtime, each job submitted the code of the case study several times. The results are shown in figure 5. Figure 5 shows, that the time spend in communication is not affected by the choice of lpars. Conversley the calculation is arround 10% slower when selecting the lpars of a single frame. This carries into the overall program performance. To understand this further, we repeated the experiment for different problem sizes. These results will fit into level 2 cache, the larger are shown in figure 6. The smallest problem size of sizes will fit into level 3 cache. For the smallest problems size, no difference is observed between the two choices of frames. However, running on lpars of different frames proves to be substantially faster when using level 3 cache. This behaviour is easy to understand from the hardware. As noted the frames are designed as 32-way SMP split up into 4 lpars operated as 8-way SMP. However the hardware to keep the level 2 caches coherent is still active. Hence when throwing a level 2 cache miss for a problem size of or larger, the other lpars of the frame have to answer the requests for cache coherency. When selecting lpars of a single frame, because of the strongly synchronising nature of our code and the high demands this code places on the memory system, these lpars are producing
( %!
!
!
3 The IBM POWER4 processors have two floating point multiply-addition units. For optimum performance these require an equal number of multiplications and additions.
7
Update code performance on 4 lpar Select lpars of single frame vs lpars of different frames 1.2
run time ratio with single frame
1.1
same frame different frame
1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0
840x1008
1680x2016 3260x4032 Problem size
6720x8064
Figure 6: Performance comparison of the update time (computation) for different problem sizes, when running on 4 lpar of a single or 4 different frames. The figure reports the fastest observed run time out of 100 trials relative to the time needed when running on a single frame. All times use 8 tasks per lpar. level 2 cache misses as well, resulting in even more requests for cache coherency. As a result execution time increases. However, when running on lpars of different frames, the remaining lpars of these frames are executing other user’s programs, which may not stress the memory system substantially and can therefore answer the requests for coherency more quickly. To summarise, cache coherency on the IBM p690 is carried out in hardware and can not be disabled by the partitioning software. For capability jobs using a significant fraction of the resource, the timings when using all the lpars of a single frame are relevant. If a job is allocated a large fraction of the available lpars at random, it is likely that the job will be allocated all the lpars of at least one frame. As the progress of a parallel program is often determined by the slowest processor, this will slow your program to the performance level of the single frame scenario.
8 Run time fluctuations When running our code on a large number of lpars, we observed a wide variation of run times, which can not be explained by the findings in section 7. Interuptions by operating system demons and helper tasks of the MPI system might be a cause of this noise. Such problems have been reported earlier from IBM SP machines, see e.g. [1, 2, 3]. If these demons and helpes are indeed the cause, using only 7 processors per lpar, that is leaving 1 processor free to deal with these interruptions, should improve the run time variation. As discussed above, each lpar runs its own copy of the operating system. we performed two runs of 20000 iterations each using 128 lpars. For a problem size of We used 7 resp. 8 task per lpar, that is a total of 896 resp. 1024 processors. Figure 7 shows, for the chosen problem size and number of lpars, running 7 tasks per lpar provides less run time variation and greater performance over 8 tasks per lpar. With 8 tasks per lpar, even when averaging as many as 1001 subsequent iterations, the speed variation is large: between 1487 and 2035 iterations/second. Apart from the higher iteration count, the run with 8 tasks per lpar has the same parameters as the right most data point of figure 3 for the better choice of MP_EAGER_LIMIT. The time spend in calculation
(
8
ind. cycle av: 11 runs av: 101 runs av: 1001 runs av: 10001 runs
Speed 6720x8064, 7 task/lpar, 128 lpar
6720x8064, 8 task/lpar, 128 lpar 3000
2500
2500
2000
2000 iterations/second
iterations/second
3000
1500
1500
1000
1000
500
500
0
0
1
2
3
4
5 6 7 time in seconds
8
9
10
ind. cycle av: 11 runs av: 101 runs av: 1001 runs av: 10001 runs
Speed
0
11
0
1
2
3
(
4
5 6 7 time in seconds
8
9
10
11
sized problem on 128 lpar. The Figure 7: Average number of iterations per second for a grey graph gives the speed for the individual iterations. The coloured graphs give the speed averaged over between 11 and 10001 cycles, see legend. On the horizontal axis we give the wall clock time of the run. The left hand plot gives result for using 7 tasks per lpar on the right hand side we used 8 task per lpar.
ms and communication are approximately equal and we change very rapidly between them ( each). It is interesting to see whether the above conclusion also holds for different program regimes. For a second pair of runs we increased the volume to . The results are shown in figure 8. Again, the run using 7 tasks per lpar shows less fluctuation in run time, but in this case the run using 8 tasks per lpar is substantially faster. In summary, the nature of the application and its run time parameters influence the optimal choice of processors per lpar.
9 Conclusion This study highlights the bandwidth from level 3 cache and main memory as one of the most precious resources on the HPCx system. When writing/optimising applications this bottle neck should be kept in mind. A reduction of local workspace inside the routines and merging loops (loop fusion) has a positive impact on the performance. We observe the code generated from Fortran90 array syntax to be as effective as the one for the same algorithm using explicit do-loops. For codes using MPI the environment variable MP_EAGER_LIMIT has a large impact on the performance and proper tuning is required. In particular when using large numbers of processors the default value is relatively small and might not be optimal for your application. For a larger number of processors, the data held per processor becomes smaller and the cache architecture of the HPCx system becomes more effective. This can result in super scaling of certain routines, which can compensate for the increased communicational overhead. We have shown that the performance of the system is affected by the activity on the remaining lpars of the same frame. This originates from the hardware attempting to keep the caches between the logical partitions coherent. Variation in run times reduce considerably when using only 7 out of 8 processors per logical partition. However it depends on the application and its run time parameters whether using 7 or 8 processors per partition is superior with respect to overall performance.
9
Speed 26880x32256, 7 task/lpar, 128 lpar
ind. cycle av: 11 runs av: 101 runs av: 1001 runs av: 10001 runs
Speed 26880x32256, 8 tasks/node, 128 lpar 200
ind. cycle av: 11 runs av: 101 runs av: 1001 runs av: 10001 runs
150
iterations/second
iterations/second
150
100
50
0
100
50
0
25
50
75
100 125 time in sec
150
175
0
200
0
25
50
75 100 time in sec
125
150
175
Figure 8: Average number of iterations per second for a sized problem on 128 lpar. We used 7 tasks/lpar on the left hand side and 8 on the right hand side. See the caption of figure 7 for further explanation.
10 Future work For these initial studies, global communications have been removed from the application, and only point-to-point communications investigated. Future work will re-introduce these global communications into the application and investigate their performance and scaling on many processors. Initial studies suggest global communications scale poorly on the system, and performance benefit may be obtained by performing the operation in two stages: within an lpar and then across an lpar. This can be achieved using multiple MPI communicators or by using OpenMP within an lpar and MPI across lpars. At SC2003 we will present scaling results for all these types of global communication.
Acknowledgements Suggestions and discussions with the other members of the HPCx terascaling team are acknowledged, in particular Stephen Booth, David Henty, Chris Johnson, Gavin Pringle and Lorna Smith.
References [1] T. Jones, Presentation at ScicomP6 meeting, August 19-23, Berkley, California; http://www.spscicomp.org/ScicomP6/Presentations/Jones/ [2] P. Wang, Presentation at ScicomP6 meeting, August 19-23, Berkley, California; http://www.spscicomp.org/ScicomP6/Presentations/Wang/ [3] R. Treumann, Presentation at ScicomP7 meeting, March 4-7, 2003 Gottingen, ¨ Germany; http://www.spscicomp.org/ScicomP7/Presentations/Treumann-MPI-Next Chapter.pdf
10
A
Update code in Fortran 90 array syntax
The following is a part of a module in Fortran 90 showing the update_image subroutine. The problem has dimensions 1 . . . , Msize in direction and 1 . . . , Nsize in direction . The parameters My_Msize and My_Nsize specify the problem size divided by the number of processors in and direction. These numbers of processors are specified as parameters as well and known to the compiler at compile time. The arrays edges and image have their ranges extended by a halo of depth 1.
Real, dimension(0:My_Msize+1, 0:My_Nsize+1) :: edges = 0.0 Real, dimension(0:My_Msize+1, 0:My_Nsize+1) :: image contains !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! subroutine update_image() !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! Real, dimension(0:My_Msize+1, 0:My_Nsize+1) :: update = 0.0 Update(1:My_Msize,1:My_Nsize) = 0.25 *( & image(0:My_Msize-1,1:My_Nsize) & + image(2:My_Msize+1, 1:My_Nsize) & + image(1:My_Msize,0: My_Nsize-1) & + image(1:My_Msize, 2:My_Nsize+1) & - edges(1:My_Msize, 1:My_Nsize) ) Image = Update end subroutine update_image
B Update code with do-loops One of the questions we asked ourselves is how well is the F90 compiler doing with respect to generating code when using the array syntax. We tried the following alternative using do-loops. Real, dimension(0:My_Msize+1, 0:My_Nsize+1) :: edges = 0.0 Real, dimension(0:My_Msize+1, 0:My_Nsize+1) :: image contains !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! subroutine update_image() !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! Real, dimension(0:My_Msize+1, 0:My_Nsize+1) :: update = 0.0 Integer :: i_n, i_m n_loop: do i_n = 1, my_nsize m_loop: do i_m = 1, my_msize update(i_m, i_n) = & 0.25 *( image( i_m - 1, i_n ) & + image( i_m + 1, i_n ) & + image( i_m , i_n - 1 ) & + image( i_m , i_n + 1 ) & - edges( i_m, i_n ) ) enddo m_loop enddo n_loop Do i_n = 1, my_nsize
11
Do i_m = 1, my_msize Image(i_m, i_n) = Update(i_m,i_n) end Do end Do end subroutine update_image
C
Compact version of the update code
Having a closer look at the code in section B shows, that in fact only two lines of the array update are needed. Once image( i_m, i_n - 1 ) has been read, its value can be updated. !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! subroutine update_image() !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! Real, dimension(0:My_Msize+1, 0:1) :: update Integer :: i_n, i_m integer :: current_upline, previous_upline Zero_loop: do i_m = 0, my_msize+1 update(i_m, 0) = 0.0 end do Zero_loop previous_upline = 0 n_loop: do i_n = 1, my_nsize current_upline = mod( i_n, 2 ) m_loop: do i_m = 1, my_msize update(i_m, current_upline) = & 0.25 *( image( i_m - 1, i_n + image( i_m + 1, i_n + image( i_m , i_n + image( i_m , i_n - edges( i_m, i_n
) ) + )
& & 1 ) & 1 ) & )
image( i_m, i_n - 1) = update( i_m, previous_upline ) enddo m_loop previous_upline = current_upline enddo n_loop ! now copy in the last line last_line_loop: do i_m = 1, my_msize Image(i_m, My_nsize) = Update(i_m, previous_upline) end do last_line_loop end subroutine update_image
12