Cautious, Machine-Independent Performance Tuning for Shared ...

Cautious, Machine-Independent Performance Tuning for Shared-Memory Multiprocessors Sarah A. M. Talbot, Andrew J. Bennett, Paul H. J. Kelly Department of Computing Imperial College of Science, Technology and Medicine London SW7 2BZ

Abstract. Coherent-cache shared-memory architectures often give dis-

appointing performance which can be alleviated by manual tuning. We describe a new trace analysis tool, clarissa, which helps diagnose problems and pinpoint their causes. Unusually, clarissa works by analysing potential contention, instead of measuring predicted contention by simulating a speci c memory system design. This is important because, after tuning, the software will be executed on dierent inputs and dierent con gurations. The goal is to produce a program with robustly good performance. This paper explains the principle behind cautious trace analysis, describes our implementation, and presents our experience of using the tool.

1 Introduction There has been considerable recent interest in developing tools to support manual performance optimisation of applications running on coherent-cache sharedmemory multiprocessors (e.g. [1, 3]). The purpose of a performance tuning tool is to direct the programmer's attention to where a program is spending its time and to give as much guidance as possible into how to reduce the performance bottlenecks. Existing performance tools measure (using special monitoring circuitry) or predict (using a simulation of the shared memory architecture) the behaviour of the machine for which the program is being developed. Although this is very useful in understanding the factors in uencing performance, there are two fundamental problems in using such tools for producing high-quality software: 1. In the eld, the software will be run with many dierent inputs, leading to behaviour dierent from that seen during tuning, and 2. The software will be installed on hardware with dierent characteristics from that used during tuning. In this paper we present an alternative approach, cautious trace analysis, aimed at addressing these problems. The key idea is to identify behaviour which might lead to lost performance on some reasonable architecture, or with dierent timing assumptions. If we can diagnose and eliminate, or at least minimise, these characteristics, the program should behave well in service.

2 Cache Line Contention in Shared-Memory Systems A shared-memory multiprocessor consists of several CPUs with associated caches linked to memory units via an interconnection network. A cache coherency protocol is required to ensure that CPUs do not use stale cached data. In addition to the overheads of maintaining coherence, such architectures can suer from three problems: contention for nodes, cache lines and communication links. These all conspire to increase memory access times, and hence slow down the execution time of tasks running on the processors. The challenge is to minimise the causes of contention, i.e. to keep data in the local cache whenever possible and to avoid using the network. We use the following de nitions of cache line sharing:

Active sharing: a data item is accessed by more than one processor during the execution of a program.

False sharing: this is where processors share a cache line without sharing data items within the cache line. With invalidation, a write to an item in a shared cache line requires all copies of that cache line to be invalidated, even if the other processors never use the data item which was changed. Passive sharing: this occurs where shared data still remains in a processor's cache even though no objects on the cache line will be accessed by that processor again [2]. Since a write by another processor to any item in that cache line will require all other copies of the cache line to be invalidated, it is desirable that the redundant cache line is ejected after its last use.

These characteristics interact, and are aected by the memory access characteristics of a particular program and the shared-memory architecture. Cache line size is particularly relevant: larger cache lines would allow many objects to be allocated on each cache line, which could be helpful if an application has locality of access. However, the larger line size can lead to contention for cache lines, especially if false sharing plays a signi cant role in the behaviour of an application.

3 Cautious Trace Analysis and CLARISSA The analysis process operates as a sequence of phases, in order to re ect barrier synchronisations (which prevent events occurring on dierent sides of a barrier from overlapping), limit the amount of analysis time and space required, and prune overlaps which are unlikely because they appear at widely-diering times in the trace (overlap is possible, since no synchronisation prevents them, but are unlikely). Essentially, what needs to be considered is which events could possibly occur in the same phase. Whatever the hardware con guration, the events for a particular CPU will always occur in the same order, but the order in which events occur between dierent CPUs can vary. In the example, Fig. 1, it is possible that an event in CPU0 may occur before or after any event in, say, CPU1 from the start of the program up to the rst barrier synchronisation. However, it is not

start

barrier 1

lock 1

unlock 1

barrier 2

end

CPU 0 CPU 1 CPU 2 CPU 3

barrier 1

lock1

barrier 1

unlock 1

barrier 2

lock1

unlock1 lock1

barrier1

barrier 2

time

unlock1 barrier 2

Fig.1. An example execution path possible for events occurring before the rst barrier to overlap with events after that barrier. Considering only barriers, Fig. 1 has three phases. Sharing can occur across barriers, but it is not currently part of our analysis. In programs where barriers are used regularly to ensure synchronisation, such as mp3d (Sect. 4), cautious trace analysis between barriers shows the sharing eects that may occur within the weak ordering programming model. In applications where there are few or no barriers, the analysis becomes so broad that it is likely that clarissa will over-report the potential sharing, and the volume of data and phase end processing will be problematic. In such cases, we introduce xed time-slots. The length of the time-slots depends on the overall length of the program. Too short a time-slot will result in some eects being missed and give too ne a level of summary information, whereas too long a time-slot will tend to over-report sharing and contention, and give a summary which is too coarse. In addition, edge eects have to be allowed for, i.e. the analysis must take into consideration sharing eects which cross a timeslot boundary. In Sect. 5 an example is given of using time-slot analysis (with overlaps) to tune the performance of an application which makes little use of barrier synchronisations. Cautious analysis also has to allow for the use of locks. For example, in Fig. 1, if each CPU only reads and writes a particular data item when the processor has obtained lock1 then, although there is still active sharing of the item between the barriers, the programmer has protected the data item from the possibility of simultaneous update by two or more processors.

3.1 Using CLARISSA The clarissa tool is based on [5]. Input parameters include cache line size, class threshold (the N value in Table 1), phase type (barrier or time-slot), timeslot length and overlap. A classi cation system is needed for summarising the wealth of data. Table 1 gives the classi cation used in the SM-prof performance debugging tool, which reports cache line access for xed time-slots in terms of read or write accesses and the number of CPUs involved [1]. In clarissa, an enhanced version of this categorisation is used, where the sharing categories (ending in E/F/M) are further split according to active or false sharing. The

Table 1. SM-prof classi cation of cache line accesses [1] class degree of sharing access mode comments UNR none

none

no processor referenced the cache line

ROE exclusive

read only

one processor has done a read operation, but no write operation to the cache line

ROF shared by few

read only

i processors have done read operations,

ROM shared by many read only

but no write operations, to the cache linea

N or more processors have done read operations, but no write operations, to the cache line

RWE exclusive

read/write one processor has performed a read-modify-write sequence on the cache line

RWF shared by few

read/write i processors have performed read-write-modify sequences to the cache line.

RWM shared by many read/write N or more processors have performed read-write-modify sequences to the cache line a

where 1 < i < N

results are used to provide histograms of sharing activity for each phase during the execution of the program.

4 MP3D mp3d is a particle-based wind tunnel simulation, from SPLASH [4]. It is used

to study the shock waves created as an object ies at high speed through the upper atmosphere. It was run for 30000 molecules, using the supplied geometry le test.geom, for 10 time-steps. Two large arrays of structures account for more than 99% of the static data space used by mp3d; the rst structure stores the state information for each particle and the second structure stores the properties of each cell in the active space. The way mp3d uses barriers for synchronisation within each step meant that clarissa barrier analysis was the most appropriate, and the resulting phase level graphs are shown in Fig. 2. These graphs, in conjunction with the summary and detail level sharing information generated by clarissa, showed that the dominant sharing was active sharing of the cells array. For many data items within the cells array, more than one processor updates the same data item between barriers, and this generates a high number of coherency protocol invalidation messages. To avoid the active sharing of cells , mp3d was modi ed so that the scheduling of work for the CPUs was driven by cells rather than particles. When the trace output for mp3d-new was analysed by clarissa, there was a substantial

MP3D-ORIG - false sharing, barrier analysis 80

MP3D-ORIG - active sharing, barrier analysis ROFF

500

ACCESS COUNT

RWFF ROMF

60

400

RWMF 300 40 200 20

0

100

0

10

20

30

40

50

60

0

0

MP3D-NEW - false sharing, barrier analysis 80

10

20

30

40

50

MP3D-NEW - active sharing, barrier analysis ROFF

500

ROFA

ACCESS COUNT

RWFF ROMF

60

60

RWFA 400

ROMA RWMA

RWMF 300 40 200 20

0

100

0

10

20 30 40 PHASE NUMBER

50

60

0

0

10

20 30 40 PHASE NUMBER

50

60

Fig. 2. Phase level graphs for mp3d-orig and mp3d-new MP3D RELATIVE PERFORMANCE

35 30

miss rate (%) no. of pe's mp3d-orig mp3d-new 1 0.1682 0.2262 2 2.0008 0.3486 4 3.0215 0.4420 8 3.5685 0.5549 16 3.8490 0.7858 32 4.0972 1.1188

IDEAL MEMORY

25 20 MP3D-NEW

15 10

MP3D-ORIG

5 0 0

5

10 15 20 25 processing elements

30

35

Fig. 3. Relative performance and cache miss rates for mp3d reduction in the active sharing of the cells array, and false sharing had also been reduced1 . This is illustrated by the active and false sharing phase level graphs shown in Fig. 2. The two dierent versions, mp3d-orig and mp3d-new, were run on an execution-driven simulator to obtain cc-numa execution timings, cache miss 1

Similar changes have been made by other researchers e.g. [1], but we made the change because it was speci cally indicated by the active sharing information from clarissa.

rates and relative performance. The results are shown in Fig. 3. The best speedup is achieved by mp3d-new. Simulation statistics con rm that the miss rate is substantially lower for mp3d-new, so the strategy of reducing sharing has been successful. However, this is starting to lose its \edge" at 32 CPUs: the drop in performance is believed to be because of poor load balancing given the relatively small problem size (i.e. 30000 molecules) used for these tests.

5 Computational Fluid Dynamics cfd is a major application area of high performance computing. The system modelled in our cfd application is a laminar ow in a square cavity with a lid which slides across the cavity introducing a zone of re-circulatory uid. cfd uses

barriers to ensure that CPU1 has updated global variables before all the processors move on to the next stage, but there are long periods without a barrier [6]. Time-slot analysis was therefore appropriate, and time-slot length was chosen to give around 100 slots over the execution time, i.e. to give a reasonably detailed pro le without being swamped by too much information. The phase level graphs generated by clarissa are shown in Fig. 4. The graphs, in conjunction with the summary and detail level sharing information, indicated that the most signi cant sharing was false sharing within the var data structure. The 2-D arrays in var were originally distributed to the CPUs columnwise, and the program was modi ed to create cfd-new, which uses square block distribution. Without the help of clarissa, it would only have been possible to pinpoint the performance problem by gaining a thorough knowledge of the application program. When the trace output for cfd-new was analysed by clarissa, there was a substantial reduction in false sharing messages relating to the var data structure, re ected in the improved false sharing phase level graph shown in Fig. 4. Active sharing was increased by the change but, as shown by the performance results below, any coherence overhead incurred by this increase is more than compensated for by the reduction in false sharing. cfd-orig and cfdnew were run on the simulator to obtain timings, cache miss rates and relative performance, shown in Fig. 5. The best speedup running under real memory is achieved by cfd-new. In addition, the simulations showed that the cache miss rate was always lower for cfd-new in comparison with cfd-orig. The reduction in false sharing lead to a signi cant improvement in performance, even though active sharing increased slightly.

6 Related Shared-Memory Tools MemSpy [3] assists in locating bottlenecks by providing detailed information that focuses the programmer's attention on the problem areas in the application. SM-prof [1], is similar to that presented here, but has the drawback that it does not distinguish between active and false sharing of cache lines. It also splits a program's execution up into time-slots, but does not allow for boundary eects

CFD-ORIG - false sharing, time-slot analysis 2500

ROFF

CFD-ORIG - active sharing, time-slot analysis 500

ROFA

ACCESS COUNT

RWFF 2000

ROMF

RWFA

400

ROMA

RWMF 1500

300

1000

200

500

100

0

0

20

40

60

80

100

0

RWMA

0

CFD-NEW - false sharing, time-slot analysis 2500

ROFF

20

40

60

80

CFD-NEW - active sharing, time-slot analysis 500 ROFA

ACCESS COUNT

RWFF 2000

ROMF

RWFA

400

ROMA

RWMF 1500

300

1000

200

500

100

0

0

20

40 60 PHASE NUMBER

80

100

100

0

RWMA

0

20

40 60 PHASE NUMBER

80

100

Fig. 4. Phase level graphs for cfd-orig and cfd-new CFD RELATIVE PERFORMANCE

60 50

IDEAL MEMORY

40 CFD-NEW

30 20

CFD-ORIG 10 0 0

10

20 30 40 50 processing elements

60

70

miss rate (%) no. of pe's cfd-orig cfd-new 1 0.0007 0.0007 2 0.6117 0.0356 4 1.7589 0.6702 8 4.0980 0.7320 16 9.2645 1.8799 32 13.2287 2.0094 64 18.9424 4.2586

Fig.5. Relative performance and cache miss rates for cfd between adjacent slots; consequently analysis has to be performed multiple times with dierent time-slot lengths. The information given by clarissa diers from that of existing performance analysis tools because it diagnoses potential contention rather than problems arising from a particular architecture. It also distinguishes between dierent types of sharing, i.e. active, passive and false sharing. This is important as the

action to be taken depends on the type of sharing that needs to be eliminated. For example, in [1], false sharing is \suspected" in mp3d, but clarissa reports false sharing precisely. Similarly MemSpy can report that cache misses are high for a particular variable, but cannot say whether this is due to active, false or passive sharing. Finally, clarissa allows the analysis to be carried out on a barrier or time-slot basis, so that the timing of phases is appropriate for a particular application.

7 Conclusions clarissa is a new tool which has been shown to be eective in analysing the

cache-line sharing eects in shared-memory parallel applications. It uses a novel approach, cautious trace analysis, to locate potential cache line contention rather than measuring actual contention in a speci c memory system design. We have shown how it was used to greatly improve the performance and cache behaviour of two scienti c programs. As further work, we plan to enhance clarissa to provide boundary analysis of cache lines, i.e. to cater for false sharing eects that may not show up for a particular problem size due to data structures happening to align with cache line boundaries. In addition, the performance of clarissa should be improved by closer integration with the simulator.

Acknowledgements. This work was funded by the U.K. Engineering and Physical Sciences Research Council through an Advanced Course Studentship and project GR/J 99117. Enormous thanks are due to Ashley Saulsbury for allowing us to use his simulator.

References 1. Mats Brorsson. SM-prof: A tool to visualise and nd cache coherence performance bottlenecks in multiprocessor programs. In Proceedings of the ACM SIGMETRICS and Performance '95, pages 178{187, May 1995. 2. Susan J. Eggers and Randy H. Katz. A characterisation of sharing in parallel programs and its application to coherency protocol evaluation. 15th Annual International Symposium on Computer Architecture, Honolulu, May, in Computer Architecture News, 16(2):373{382, May 1988. 3. M. Martonosi, A. Gupta, and T. Anderson. Tuning memory performance of sequential and parallel programs. IEEE Computer, 28(4):32{40, April 1995. 4. Jaswinder Pal Singh, Wolf-Dietrich Weber, and Anoop Gupta. SPLASH: Stanford parallel applications for shared-memory. Computer Architecture News, 20(1):5{44, March 1992. 5. Sarah A. M. Talbot. Performance tuning of programs for shared-memory multiprocessors. Master's thesis, Department of Computing, Imperial College, London, U.K., 1995. 6. B. A. Tanyi. Iterative Solution of the Incompressible Navier-Stokes Equations on a Distributed Memory Parallel Computer. PhD thesis, UMIST, 1993.

This article was processed using the LaTEX macro package with LLNCS style

Cautious, Machine-Independent Performance Tuning for Shared ...

Cautious, Machine-Independent Performance Tuning for Shared ...

Suggest Documents

Tuning Databases for High Performance

Performance Weight Adjustment for Iterative Cautious Control Design

Performance Weight Adjustment for Iterative Cautious Control Design

ETL Performance Tuning Tips

MySQL Performance Tuning - Percona

Performance Tuning - 3cyl.com

ABAP Performance Tuning

Java Performance Tuning

JDBC Performance Tuning

Hibernate Performance Tuning - databene

Two Stroke Performance Tuning

Automated Performance Tuning

Memcached Performance Tuning

Two Stroke Performance Tuning

OpenGL Performance Tuning - AMD

Beginning Performance Tuning - Proligence

WebLogic Performance Tuning

Collaborative Agent Tuning: Performance

Results of the WMT15 Tuning Shared Task - Association for ...

Tuning MetaOCaml Programs for High Performance - Infosun

Tools for Application-Oriented Performance Tuning

Performance Tuning and Benchmarking Guidelines for VMware ...

Performance Tuning Guide for Mellanox Network Adapters

Performance Tuning for Oracle Business Process Management ...