Profiling Parallel Performance using Vampir and Paraver - HPCx

60 downloads 0 Views 772KB Size Report
3. 2.1.1. Product History .... views, e.g. timeline displays showing state changes and communication, profiling statistics displaying the execution times of routines, ...
Profiling Parallel Performance using Vampir and Paraver Andrew Sunderland, Andrew Porter STFC Daresbury Laboratory, Warrington, WA4 4AD

Abstract Two popular parallel profiling tools installed on HPCx are Vampir and Paraver, which are also widely available on other platforms. These tools can simultaneously monitor hardware counters and track message-passing calls, providing valuable information on an application's runtime behaviour which can be used to improve its performance. In this report we look at using these tools in practice on a number of different applications on HPCx, with the aim of showing users how to utilise such profilers to help them understand the behaviour of their own codes. As part of this, we also examine the use of Vampir for codes run on large numbers of processes (64 or more). Interested parties should check back here regularly for updates to this paper.

This is a Technical Report from the HPCx Consortium.

Report available from http://www.hpcx.ac.uk/research/publications/HPCxTR0704.pdf

© HPCx UoE Ltd 2007 Neither HPCx UoE Ltd nor its members separately accept any responsibility for loss or damage arising from the use of information contained in any of their reports or in any communication about their tests or investigations.

Profiling Parallel Performance Using Vampir and Paraver

ii

1

Introduction ____________________________________________________________3

2

Background to Profilers ___________________________________________________3

2.1

VAMPIR & VAMPIRTRACE _________________________________________________ 3

2.1.1

2.2

3

Product History ___________________________________________________________________4

PARAVER _________________________________________________________________ 4

Background to Applications ________________________________________________5

3.1

DL-POLY 3 _________________________________________________________________ 5

3.2

NEMO _____________________________________________________________________ 5

3.3

PDSYEVR __________________________________________________________________ 5

3.4

LU decomposition using OpenMP ______________________________________________ 6

4

VAMPIR Performance Analysis on HPCx ____________________________________6

4.1

Installation _________________________________________________________________ 6

4.1.1 4.1.2

4.2

Tracing the Application Code on HPCx__________________________________________ 7

4.2.1 4.2.2 4.2.3 4.2.4

4.3

VampirTrace version 5.3.1 __________________________________________________________6 Vampir __________________________________________________________________________7 Automatic Instrumentation __________________________________________________________7 Manual Instrumentation using the VampirTrace API ______________________________________7 Running the application with tracing on HPCx ___________________________________________8 Hardware Event Counter Monitoring with PAPI__________________________________________8

Analysing DL_POLY VampirTrace files with Vampir _____________________________ 9

4.3.1 4.3.2 4.3.3

Vampir Summary Chart_____________________________________________________________9 Vampir Activity Chart ______________________________________________________________9 Global Timeline View _____________________________________________________________10

4.4

Analysing parallel 3D FFT performance in DL_POLY ____________________________ 11

4.5

Profiling the NEMO application on large process counts using Vampir on HPCx ______ 15

4.6

Identifying Load Imbalances in the Development of PDSYEVR_____________________ 18

5

PARAVER performance analysis on HPCx __________________________________20

5.1

Setting up Paraver Tracing on HPCx___________________________________________ 20

5.2

Viewing Paraver tracefiles on HPCx ___________________________________________ 21

5.3

Analysing the LUS2 application using Paraver ___________________________________ 21

6

Summary ______________________________________________________________24

7

References _____________________________________________________________24

Profiling Parallel Performance Using Vampir and Paraver

1 Introduction The performance of a parallel code is commonly dependent on a complex combination of factors. It is therefore important that developers of High Performance Computing applications have access to effective tools for collecting and analysing performance data. This data can be used to identify such issues as computational and communication bottlenecks, load imbalances and inefficient cpu utilization. In this report we investigate the use of Vampir (Visualization and Analysis of MPI Resources) [1] in association with its related tracing tool VampirTrace [2] and Paraver (Parallel Program and Visualization Analysis Tool) [3]. HPCx usage is demonstrated here by applying the tools to the parallel DL_POLY 3 [4] application code, the computation core of a new symmetric parallel eigensolver PDSYEVR [5], an LU decomposition code [6] parallelised using OpenMP [7] and the NEMO oceanmodelling code ARPDBG [8]. It is not intended that this report should be referenced as a user guide for the tools investigated. For this there are excellent documents that can be found at the respective tools’ websites that detail the huge amount of features available. Rather this report is intend to give users a quick introduction to getting started with using the tools on HPCx and to demonstrate, with the aid of application examples, some of the in-depth analysis that can be enabled.

2 Background to Profilers Both analysis tools involve similar approaches i.e. analysis of a specific tracefile created at the application’s runtime that contains information on the various calls and events undertaken. For tracing the application code VampirTrace requires a relinking of the application code to the VampirTrace libraries whereas Paraver-based tracing does not require any relinking of the code, only execution via the OMPItrace tool. Both VampirTrace, OMPItrace can produce a tracefile for an OpenMP program, an MPI program, or a mixed-mode OpenMP and MPI program. Both tools require licenses and environment variable settings can be used to customize the tracing events that are to be recorded.

2.1 VAMPIR & VAMPIRTRACE Vampir (Visualisation and Analysis of MPI Resources) [1] is a commercial postmortem trace visualisation tool from the Center for Information Services and High Performance Computing (ZIH) of TU Dresden [2]. In collaboration with the KOJAK project at ZAM/FZ Jülich [9], the freely available VampirTrace is obtainable from the same organization. The tool uses profiling extensions to MPI and permits analysis of the message events where data is passed between processors during execution of a parallel program. Event ordering, message lengths and times can all be analysed. The latest version (5.0) features support for OpenMP events and hardware performance counters. The tool comes in two components - VampirTrace and Vampir. The first of these

3

Profiling Parallel Performance Using Vampir and Paraver includes a library which when linked and called from a parallel program, produces an event tracefile. Common events include the entering and leaving of function calls and the sending and receiving of MPI messages.

By using keywords, application-specific information can be built into the trace using subroutine calls. Trace calls can be automatically applied to the whole run-time or manually added around time-critical program sections. This involves adding calls to VT_USER_START (‘label’) and VT_USER_END (‘label’) at the section of interest in the source. Automatic instrumentation requires only a re-link of the application code with the VT libraries, whilst manual instrumentation requires a re-compilation of the program. Then Vampir itself is used to convert the trace information into a variety of graphical views, e.g. timeline displays showing state changes and communication, profiling statistics displaying the execution times of routines, communication statistics indicating volumes and transmission rates and more. 2.1.1 Product History The Vampir tool has been developed at the Center for Applied Mathematics of Research Center Jülich and the Center for High Performance Computing of the Technische Universität Dresden. Vampir is available as a commercial product since 1996 and has been enhanced in the scope of many research and development projects. In the past, it was distributed by the German Pallas GmbH which became later a part of Intel Corporation. The cooperation with Intel ended in 2005. Vampir has been widely used in the high performance computing community for many years. A growing number of performance monitoring environments such as TAU [10], KOJAK [9], can produce tracefiles that are now readable by Vampir. Since the release of version 5.0, Vampir supports the new Open Trace Format (OTF), also developed by ZIH. This trace format is especially designed for massively parallel programs. Vampir is portable across many platforms due to its X-based graphical user interface and is available for many computing platforms.

2.2 PARAVER The Paraver performance analysis tool is developed by The European Center for Parallelism of Barcelona (CEPBA) [11] at the Technical University of Catalonia. Based on an easy-to-use Motif GUI, Paraver has been developed to respond to the need to have a qualitative global perception of the application behaviour by visual inspection and then to be able to focus on the detailed quantitative analysis of the problems. Paraver provides a large amount of information useful to improve the decisions on whether and where to invert the programming effort to optimize an application.

4

Profiling Parallel Performance Using Vampir and Paraver

3 Background to Applications 3.1 DL-POLY 3 DL_POLY [4] is a parallel molecular dynamics simulation package developed at STFC's Daresbury Laboratory [12]. DL_POLY 3 is the most recent version (2001) and exploits a linked cell algorithm for domain decomposition, suitable for very large systems (up to order 1,000,000 particles) of reasonably uniform density. Computationally the code is characterised by a series of timestep calculations involving exchanges of short-range forces between particles and long-range forces between domains using 3 dimensional FFTs. The computation of these 3D FFTs [13] are a major expense during the computation. Depending on the general integration flavour a DL_POLY_3 timestep can be considered to comprise of the following stages: integration part 1, particle exchange, halo reconstruction, force evaluation, integration part 2. The most communication expensive operation is the particle exchange stage since it involves recovery of the topology of bonded interactions for particles crossing domains. Metal interactions are evaluated by using tabulated data and involve a halo exchange data as they depend on the local density. The test case examined here is a molecular simulation involving dipalmitoylphosphatidylcholine (DPPC) in water. This system is of interest due to its complex forcefield, containing many bonded interactions including constraints as well as vdW and Coulomb charges.

3.2 NEMO NEMO (Nucleus for European Modelling of the Ocean) [8] is designed for the simulation of both regional and global ocean circulation and is developed at the Laboratoire d'Océanographie Dynamique et de Climatologie at the Institut Pierre Simon Laplace. It solves a primitive-equation model of the ocean system in three dimensions using a finite-difference scheme and contains sea-ice and passive-tracer models. Originally designed for vector machines, the most recent version uses MPI in its MPP implementation. Here we discuss how VAMPIR may be used to analyse a code's performance on processor counts of up to 256 using NEMO as an example.

3.3 PDSYEVR In the 90s, Dhillon and Parlett devised a new algorithm (Multiple Relatively Robust Representations MRRR) [14] for computing numerically orthogonal eigenvectors of a symmetric tridiagonal matrix with O(n2) cost. Recently a ScaLAPACK [15] implementation of this algorithm named PDSYEVR has been developed and it is planned that this routine will be incorporated into future releases of ScaLAPACK. Analysis of some of the subroutines from initial versions of this code with Vampir helped identify performance issues on HPCx, which were later rectified by the developers.

5

Profiling Parallel Performance Using Vampir and Paraver

3.4 LU decomposition using OpenMP LUS2 is a short Fortran program that calculates an LU decomposition on a dense matrix. Parallelisation of the LU algorithm is achieved by using OpenMP Fortran interface directives, in particular PARALLEL DO loop directives as in the construct that loops through the rows and columns of a matrix as shown below: C$OMP PARALLEL DO SCHEDULE(DYNAMIC,16), PRIVATE(j) do i=1, ISIZE do j=1, ISIZE D(i,j) = A(i,j) + B(i,j) enddo enddo C$OMP END PARALLEL DO

4 VAMPIR Performance Analysis on HPCx 4.1 Installation 4.1.1 VampirTrace version 5.3.1 The source files for VampirTrace version 5.3.1 can be downloaded free of charge from: http://tudresden.de/die_tu_dresden (search for VampirTrace from the home page). In order to install a 64-bit version of VampirTrace on HPCx the following compiler options were used:

./configure AR="ar -X32_64" CC=xlc_r CXX=xlC_r F77=xlf_r FC=xlf90_r MPICC=mpcc_r CFLAGS="-O2 -g -q64" CXXFLAGS="-O2 -g -q64" FFLAGS="-O2 -g -q64" FCFLAGS="-O2 -g -q64"

The following configuration options were also required in order to link to IBM’s Parallel Operating System (poe), IBM’s Message Passing Interface (MPI) library and to access hardware event counter monitoring via the Performance Application Programming Interface (PAPI):

--with-mpi-inc-dir=/usr/lpp/ppe.poe/include

6

Profiling Parallel Performance Using Vampir and Paraver

--with-mpi-lib-dir=/usr/lpp/ppe.poe/lib --with-mpi-lib=-lmpi_r --with-papidir=/usr/local/packages/papi/papi-3.2.1-64bit --with-papi-lib="-lpapi64 -lpmapi"

4.1.2 Vampir A pre-compiled binary of Vampir 5.0 for AIX is available for download from the Vampir website: http://www.vampir.eu/. NB: This download is a demonstration copy only and a permanent Vampir 5.0 installation is at present unavailable to users on HPCx. Vampir 5.0 is a GUI-based product and therefore it is intended for users to provide their own copy of Vampir 5.0 installed on their remote platforms. This can then be used to view tracefiles of parallel runs from HPCx locally. However a fully featured permanent copy of Vampir 4.3 is installed on HPCx. Users should also note that previous versions of Vampir cannot read tracefiles obtained from Vampirtrace 5.0 as they are incompatible with the new otf (open tracefile format).

4.2 Tracing the Application Code on HPCx In order to use the VampirTrace libraries a) calls to switch Vampirtrace on/off are made from the source code (optional) b) the code must be relinked to the VT libraries c) the code is then run (in the normal way under poe) on HPCx.

4.2.1 Automatic Instrumentation Automatic Instrumentation is the most convenient way to instrument your application. Simply use the special VT compiler wrappers, found in the $VAMPIRTRACE_HOME/bin subdirectory, without any parameters, e.g.: vtf90 prog1.f90 prog2.f90 -o

prog

In this case the appropriate VT libraries will automatically be linked into the executable and tracing will be applied to the whole executable. 4.2.2 Manual Instrumentation using the VampirTrace API The VT USER START, VT USER END instrumentation calls can be used to mark any user-defined sequence of statements.

Fortran: #include "vt_user.inc"

7

Profiling Parallel Performance Using Vampir and Paraver

VT_USER_START(’name’) ... VT_USER_END(’name’)

C: #include "vt_user.h" VT_USER_START("name"); ... VT_USER_END("name");

A unique label should be applied as “name” in order to identify the different sections traced. If a block has several exit points (as it is often the case for functions), all exit points have to be instrumented by VT USER END. The code can then be compiled using the VT compiler wrappers (e.g. vtf90, vtcc) as described above. This approach is particularly advantageous if users wish to profile certain sections of the application code and leave other parts untraced. A selective tracing approach can also reduce the size of the resulting tracefiles considerably which in turn can speed up loading times when it comes to analyzing them in Vampir.

4.2.3 Running the application with tracing on HPCx The code can then be run in the usual way on HPCx using poe through a loadleveler script. Upon completion a series of tracefiles are produced – a numbered *.filt and *.events.z for each process used and a global *.def.z and *.otf file.

4.2.4 Hardware Event Counter Monitoring with PAPI In order to direct VampirTrace to collect hardware event counter data a $VT_METRICS environment variable must be set in the loadlever job command script specifying which counters should be monitored. A list of all counters supported by the Performance Application Programming Interface (PAPI) [16] on HPCx can be generated by running the tool 'papi_avail' in the /usr/local/packages/papi/papi-3.2.1-64bit/share/papi/utils/ directory. A full list is included in this report in Appendix A. Many useful performance metrics are available for analysis, including Floating Point Instruction rates, Integer instruction rates, L1, L2, L3 cache usage statistics and processor load / store instruction rates.

8

Profiling Parallel Performance Using Vampir and Paraver

4.3 Analysing DL_POLY VampirTrace files with Vampir The Vampir analyser can be invoked from the command line and the tracefile loaded through the menu options File -> Open Tracefile. The loading operation can take several minutes if the tracefiles are large.

4.3.1 Vampir Summary Chart The first analysis window to be generated is the Summary Chart, shown below in Figure 1:

Figure 1.

Vampir Summary Chart

The black bar represents the Sum of the overall execution time on HPCx. This time is then broken down into three constituent parts – the Application (i.e. computation) time in green, the MPI (i.e. communication) time in red and the VT_API (i.e. tracing overhead) time in blue. These representations are maintained throughout all the different Vampir views described here. From this display users can get an overall impression of the communication / computation ratio in their application code.

4.3.2 Vampir Activity Chart A useful way of identifying load imbalances between processors is to view the Global Activity Chart under the Global Displays menu. This view, shown in Figure 2 gives a

9

Profiling Parallel Performance Using Vampir and Paraver breakdown of Application / MPI / VT_API ratio for each process involved in the execution of the program. The display below is for an eight processor DL_POLY run and it shows that communication and computation are relatively evenly distributed across the processors and therefore the load balancing is good.

Figure 2.

Vampir Global Activity Chart

4.3.3 Global Timeline View

Figure 3.

Vampir Global Timeline View

10

Profiling Parallel Performance Using Vampir and Paraver

The Global Timeline gives an overall view of the application’s parallel characteristics over the course of the complete tracing interval – in this case the complete runtime. The time interval is measured along the horizontal access (0.0 – 6.36 minutes here) and the processes are listed vertically. Message passing between processes is represented by the black (point-to-point) and purple (global communication operations) lines that link the process timelines. From the prevalence of purple in the above graphical representation it appears that communication in DL_POLY is mainly global, however this can be somewhat misleading as the purple messages overlay and obscure the black lines at this rather coarse zoom level. The proliferation of red MPI operations in the central part of the timeline could lead viewers to conclude that the code is highly communication intensive. However the above test run has much reduced timesteps compared to a production run and approximately the first twothirds of the global timeline represents a set-up phase that in reality would be substantially less significant.

4.4 Analysing parallel 3D FFT performance in DL_POLY Figure 4 shows how, by zooming in (left click with the mouse) on the right hand portion of the Global Timeline, we can obtain a more representative view of the run. This shows a series of timesteps which include phases of computation (green) separated by a series of global communications at the beginning and the end of each timestep. Here, the 3DFFTs, signified by black and red areas around the middle of each timestep, can just begin to be distinguished.

Figure 4.

DL_POLY Timesteps in the Global Timeline View

11

Profiling Parallel Performance Using Vampir and Paraver

By now selecting Global Displays -> Counter Timeline the selected (via the $VT_METRICS environment variable) hardware counters can be viewed on the same scale (Figure 5). Here we have chosen to run the code with the $VT_METRICS=PAPI_FP_OPS environment variable set in the loadleveler script, thereby measuring floating point operations throughout the application.

Figure 5.

Vampir Hardware Counter Timeline view of DL_POLY timesteps

It can be seen that the flops/s rate peaks at around 100 Mflops/s per processor towards the centre of a timestep and reduces to virtually zero during the intensive global communication phases at the end of the timestep. Zooming further in (Figure 6), we can identify the subroutine in the program that the flop rate is at a maximum in the routines ‘parallel_f’(ft) (the number after the function name represents the number of times that the function has been called). The associated Counter Timeline is also shown below.

12

Profiling Parallel Performance Using Vampir and Paraver

Figure 6.

Parallel 3D FFT in DL_POLY Timelines

The characteristic communication pattern for a 3D FFT is shown clearly in Figure 6 i.e. pairwise point-to-point communications in firstly the x, then the y, then the z directions. Again, the corresponding counter timeline shows how the Flops/s rate reduces to almost zero during communication-dominated periods and serial performance peaks at around 100 Mflops/s during the FFT computation. A summary of the message passing statistics, highlighting the level of data transfer between processors can also be obtained (Figure 7). This shows how each processor

13

Profiling Parallel Performance Using Vampir and Paraver transfers 8 Mbytes of data with three other processors, representing pair-wise communication in the x, y and z directions.

Figure 7.

Message Passing statistics for the 3D FFT

Left clicking on any of the black point-to-point message lines in the 3D FFT timeline highlights the specified message and initiates a pop-up box with more details on this message passing instance. Shown in Figure 8 are the details of the message highlighted at the bottom right corner of the timeline in Figure 6.

Figure 8.

Individual Message Statistics

14

Profiling Parallel Performance Using Vampir and Paraver

4.5 Profiling the NEMO application on large process counts using Vampir on HPCx An immediate drawback of VampirTrace when using large numbers of processes is the size of the trace files produced and consequently the amount of memory (and time) needed by Vampir when loading them. This may be alleviated by reducing the length of the benchmarking run itself (e.g. the number of timesteps that are requested) but ultimately it may be necessary to manually instrument the source code (as described in Section 4.2.2) such that data is only collected about the sections of the code that are of interest. For instance, the scaling performance of a code will not be affected by the performance of any start-up and initialisation routines and yet, for a small benchmarking run, this may take a significant fraction of the runtime. Below we show an example of a summary activity timeline generated by Vampir using a trace file from a manually-instrumented version of the NEMO source code. The large blue areas signify time when the code was not in any of the instrumented regions and is broken only by some initialisation and a region where tracing was (programmatically) switched on for a few timesteps midway through the run before being switched off again.

Figure 9. The activity timeline generated from a manually-instrumented version of NEMO. It contains a little initialisation and then data for a few time-steps midway through the run.

15

Profiling Parallel Performance Using Vampir and Paraver The full trace data for the few timesteps may be loaded by selecting the relevant region from the summary time-line. Since the tracing has been programmatically switched on for a set number of timesteps, the information provided by the resulting summary may be reliably compared for different runs since it is not dependent on the area of the activity timeline selected by the user. Below we show an example of such a summary where the code has been manually instrumented.

Figure 10. Summary view of trace data for five timesteps of a manually instrumented version of NEMO running on 128 processes.

Once the trace data has been loaded, the user often wishes to view the global timeline, an example of which is shown below for a single timestep of NEMO. A useful summary of this view may be obtained by right-clicking and selecting Components->Parallelism Display. This brings up the display visible at the bottom of the figure from which it is easy to determine which sections of the timestep are dominated by e.g. MPI communications (coloured red by default in Vampir). An example here is the section of NEMO dealing with ice transport processes (coloured bright green). Also of note in this example is the dominance of global communications (coloured purple) over the last 16 processes. It turns out that these processes have been allocated the region of the globe in the vicinity of the poles and thus have extra work to do in removing noise introduced by the small mesh size in this region.

16

Profiling Parallel Performance Using Vampir and Paraver

Figure 11. A global timeline for each of 64 processors during a single timestep of NEMO. A 'parallelism' display is included at the bottom showing the percentage of the processors involved in each activity at any one time. The usefulness of the global timeline can be limited when looking at tracefiles for numbers of processors greater than 64 as Vampir will try to scale the data for each process so as to fit them all on screen. However, one can specify the number of process timelines one would like displayed at a time by right-clicking on the display and selecting Options->Show Subset... This brings up the Show Subset Dialog:

Figure 12. The Show Subset Dialog for the global timeline. Use this to choose the number of processors (“Bars”) for which data is shown on the timeline.

17

Profiling Parallel Performance Using Vampir and Paraver Using this dialog one can look at the application's behaviour in detail on a few processes or look at the overall behaviour on many processes. The figure below shows VIPAR displaying the activity of the majority of 128 processes during the section of the code dealing with ice rheology in NEMO. The effect of the LPARS on HPCx (effectively 16-way SMP nodes) on interprocess communications is highlighted by the fact that groups of 16 well-synchronised processes may be identified.

Figure 13. The global timeline configured to show data for the majority of the 128 processors of the job.

4.6 Identifying Load Imbalances in the Development of PDSYEVR Profiling early versions of the new ScaLapack routine PDSYEVR with VampirTrace (VT) allows us to investigate its performance in detail. Basic timing analysis of the code revealed that load-balancing problems may exist for certain datasets in the eigenvector calculation stage of the underlying tridiagonal eigensolver MRRR. The Vampir analyses shown below enabled us to track this potential inefficiency with great precision.

18

Profiling Parallel Performance Using Vampir and Paraver In order to track the code in more detail, here different colours were assigned to different functions using the syntax which is described in $VT_HOME/info/GROUPS.SPEC. Some additions to the underlying source code are required and a re-compilation must be undertaken. In the timeline view shown in Figure 14 the cyan areas are set to represent computation in the subroutine DLARRV. This routine is involved in the calculation of eigenvectors. As usual, time spent in communication is represented by the red areas in the timeline and the purple lines represent individual messages passed between processors.

Figure 14.

Vampir Timeline for original DLARRV subroutine

The above timeline trace shows that when calculating half the subset of eigenvalues, the workload balance in DLARRV increases substantially from process 0 to process 14. This causes a large communication overhead, represented by the large red areas in the trace. Following this, it was determined that the load imbalance was primarily caused by an unequal division of eigenvectors amongst the processes. These problems were addressed by the Scalapack developers and a newer version of the code gave a much better division of workload, as can be seen in the timeline traces in Figure 15.

19

Profiling Parallel Performance Using Vampir and Paraver

Figure 15.

Vampir Timeline for modified DLARRV subroutine

5 PARAVER performance analysis on HPCx 5.1 Setting up Paraver Tracing on HPCx Paraver uses the tool OMPItrace to generate tracefiles for OpenMP programs, MPI programs, or mixed-mode OpenMP and MPI programs. Users should note that OMPItrace currently only works with 32-bit executables on HPCx and also that OMPItrace uses IBM's DPCL (Dynamic Probe Class Library) which requires a .rhosts file in your home directory that lists all the processor ids on HPCx. Paraver tracefiles are generated on HPCx by adding the environment variables (in e.g. ksh/bash): export OMPITRACE_HOME=/usr/local/packages/paraver export MPTRACE_COUNTGROUP=60 to the Loadleveler job control script the poe command in the LoadLeveler scriptis changed from e.g. poe ./prog to $OMPITRACE_HOME/bin/ompitrace -counters -v poe.real ./prog

20

Profiling Parallel Performance Using Vampir and Paraver

On HPCx ‘poe’ is in fact a wrapper to the ‘real’ poe command. In order for OMPITRACE to function correctly on HPCx poe.real must be called directly.

5.2 Viewing Paraver tracefiles on HPCx The following environment variables should be set in the user’s login session: export PARAVER_HOME = /usr/local/packages/paraver export MPTRACE_COUNTGROUP=60 During the run, Paraver will have created a temporary trace file for each process (*.mpit and *.sim files). After the run has completed the user must submit an instruction to pack the individual profile files into one global output. This is undertaken by issuing the command: $PARAVER_HOME/bin/ompi2prv *.mpit -s *.sym -o trace_prm.prv To view the resulting tracefile use the command: $PARAVER_HOME/bin/paraver trace_prm.prv

5.3 Analysing the LUS2 application using Paraver

Unlike Vampir, upon starting Paraver, users are immediately shown the Global Timeline view. This parallelisation of LUS2 is based on OpenMP, therefore threads rather than processes are listed on the vertical axis against time on the horizontal axis. Increasing the zoom in a representative section of the trace shows:

21

Profiling Parallel Performance Using Vampir and Paraver

Figure 16.

Paraver Timeline for two cycles of $OMP PARALLEL DO

The default colours assigned represent the following activities:

Figure 17.

Colour properties in Paraver

22

Profiling Parallel Performance Using Vampir and Paraver The trace in Figure 16 shows a typical slice of the timeline from lus2, where the code is undertaking $OMP PARALLEL DO construct across the matrix as described in section 3.3. It can be seen that relatively large swathes of blue, representing computation, are divided by thread administration tasks at the start and end of each $OMP PARALLEL DO cycle.

Figure 18.

Detailed view of OMP thread scheduling in LUS2

23

24

Profiling Parallel Performance Using Vampir and Paraver In Figure 18, above the timeline bar of each thread is a series of green flags, each denoting a change of state in the thread. Clicking on the flag gives a detailed description as shown in the example above. Here it can be seen that thread 16 is firstly undergoing a global synchronisation before being scheduled to run the next cycle of the loop.

6 Summary Profilers can be highly effective tools in the analysis of parallel programs on HPC architectures. They are particularly useful for identifying and measuring the effect of such problems as communication bottlenecks and load imbalances on the efficiency of codes. New versions of these tools also include hardware performance data which facilitates the detailed analysis of serial processor performance within a parallel run. The Vampir and Paraver GUI-based analysis tools allow users to switch with ease from global analyses of the parallel run to very detailed analyses of specific messages, all within the one profiling session. Interoperability of VampirTrace with other profilers such as KOJAK and TAU has now been made possible due to the adoption of the opentracefile format.

Acknowledgements

The authors would like to thank Matthias Jurenz from TU Dresden, Chris Johnson from EPCC University of Edinburgh, and Ilian Todorov & Ian Bush from STFC Daresbury Laboratory for their help in creating this report.

7 References [1]

Vampir – Performance Optimization http://www.vampir.eu .

[2]

Vampirtrace, ZIH, Technische Universitat, Dresden, http://tu-dresden.de/die_tu_dresden/zentrale_einrichtungen/zih.

[3]

Paraver, The European Center http://www.cepba.upc.es/paraver.

[4]

The DL_POLY Simulation Package, W. Smith, STFC Daresbury Laboratory, http://www.cse.scitech.ac.uk/ccg/software/DL_POLY/

for

Parallelism

of

Barcelona,

Profiling Parallel Performance Using Vampir and Paraver [5]

“PDSYEVR. ScaLAPACK’s parallel MRRR algorithm for the symmetric eigenvalue problem”, D. Antonelli, C. Vomel, Lapack working note 168, (2005). http://www.netlib.org/lapack/lawnspdf/lawn168.pdf.

[6]

OMPtrace Tool User’s Guide, http://www.cepba.ups.es/paraver/docs/OMPtrace.ps

[7]

The OpenMP Application Program Interface, http://www.openmp.org.

[8]

NEMO - Nucleus for European Modelling of the Ocean, http://www.lodyc.jussieu.fr/NEMO/

[9]

KOJAK – Automatic Performance Analysis Toolset, Forschungszentrum Julich, http://www.fz-juelich.de/zam/kojak/.

[10] TAU – Tuning and Analysis Utilities, University of Oregon, http://www.cs.uoregon.edu/research/tau/home.php. [11] The European Center for Parallelism of Barcelona, http://www.cepba.upc.es. [12] Science & Technology Facilities Council, http://www.scitech.ac.uk. [13] “A Parallel Implementation of SPME for DL_POLY 3”, I. J. Bush and W. Smith, STFC Daresbury Laboratory, http://www.cse.scitech.ac.uk/arc/fft.shtml. [14] “A Parallel Eigensolver for Dense Symmetric Matrices based on Multiple Relatively Robust Representations”, P.Bientinesi, I.S.Dhillon, R.A.van de Geijn, UT CS Technical Report #TR-03026, (2003) http://www.cs.utexas.edu/users/plapack/papers/pareig.ps [15] www.netlib.org/scalapack/scalapack_home.html [16] PAPI – Performance Application Programming Interface http://www.icl.cs.utk.edu.papi/.

Appendix A The list of available PAPI hardware-counters on HPCx.

Test case avail.c: Available events and hardware information. ------------------------------------------------------------------------Vendor string and code

: IBM (-1)

25

Profiling Parallel Performance Using Vampir and Paraver

Model string and code

: POWER5 (8192)

CPU Revision

: 983040.000000

CPU Megahertz

: 1502.495972

CPU's in this Node

: 16

Nodes in this System

: 1

Total CPU's

: 16

Number Hardware Counters : 6 Max Multiplex Counters

: 32

------------------------------------------------------------------------Name

Code

Avail Deriv Description (Note)

PAPI_L1_DCM 0x80000000

Yes

Yes

Level 1 data cache misses ()

PAPI_L1_ICM 0x80000001

No

No

Level 1 instruction cache misses ()

PAPI_L2_DCM 0x80000002

Yes

No

Level 2 data cache misses ()

PAPI_L2_ICM 0x80000003

Yes

No

Level 2 instruction cache misses ()

PAPI_L3_DCM 0x80000004

Yes

Yes

Level 3 data cache misses ()

PAPI_L3_ICM 0x80000005

Yes

Yes

Level 3 instruction cache misses ()

PAPI_L1_TCM 0x80000006

No

No

Level 1 cache misses ()

PAPI_L2_TCM 0x80000007

No

No

Level 2 cache misses ()

PAPI_L3_TCM 0x80000008

No

No

Level 3 cache misses ()

PAPI_CA_SNP 0x80000009

No

No

Requests for a snoop ()

PAPI_CA_SHR 0x8000000a cache line ()

No

No

Requests for exclusive access to shared

PAPI_CA_CLN 0x8000000b cache line ()

No

No

Requests for exclusive access to clean

PAPI_CA_INV 0x8000000c

No

No

Requests for cache line invalidation ()

PAPI_CA_ITV 0x8000000d

No

No

Requests for cache line intervention ()

PAPI_L3_LDM 0x8000000e

Yes

Yes

Level 3 load misses ()

PAPI_L3_STM 0x8000000f

No

No

Level 3 store misses ()

PAPI_BRU_IDL

0x80000010

No

No

Cycles branch units are idle ()

PAPI_FXU_IDL

0x80000011

Yes

No

Cycles integer units are idle ()

PAPI_FPU_IDL idle ()

0x80000012

No

No

Cycles floating point units are

PAPI_LSU_IDL ()

0x80000013

No

No

Cycles load/store units are idle

No

Data translation lookaside buffer

PAPI_TLB_DM 0x80000014 misses ()

Yes

26

Profiling Parallel Performance Using Vampir and Paraver

PAPI_TLB_IM 0x80000015 buffer misses ()

Yes

No

Instruction translation lookaside

PAPI_TLB_TL 0x80000016 misses ()

Yes

Yes

Total translation lookaside buffer

PAPI_L1_LDM 0x80000017

Yes

No

Level 1 load misses ()

PAPI_L1_STM 0x80000018

Yes

No

Level 1 store misses ()

PAPI_L2_LDM 0x80000019

Yes

No

Level 2 load misses ()

PAPI_L2_STM 0x8000001a

No

No

Level 2 store misses ()

PAPI_BTAC_M 0x8000001b

No

No

Branch target address cache misses ()

PAPI_PRF_DM 0x8000001c

No

No

Data prefetch cache misses ()

PAPI_L3_DCH 0x8000001d

No

No

Level 3 data cache hits ()

PAPI_TLB_SD 0x8000001e ()

No

No

Translation lookaside buffer shootdowns

PAPI_CSR_FAL instructions ()

0x8000001f

No

No

Failed store conditional

PAPI_CSR_SUC instructions ()

0x80000020

No

No

Successful store conditional

PAPI_CSR_TOT instructions ()

0x80000021

No

No

Total store conditional

PAPI_MEM_SCY accesses ()

0x80000022

No

No

Cycles Stalled Waiting for memory

PAPI_MEM_RCY Reads ()

0x80000023

No

No

Cycles Stalled Waiting for memory

PAPI_MEM_WCY writes ()

0x80000024

No

No

Cycles Stalled Waiting for memory

PAPI_STL_ICY ()

0x80000025

Yes

No

Cycles with no instruction issue

PAPI_FUL_ICY issue ()

0x80000026

No

No

Cycles with maximum instruction

PAPI_STL_CCY completed ()

0x80000027

No

No

Cycles with no instructions

PAPI_FUL_CCY completed ()

0x80000028

No

No

Cycles with maximum instructions

PAPI_HW_INT 0x80000029

Yes

No

Hardware interrupts ()

PAPI_BR_UCN 0x8000002a

No

No

Unconditional branch instructions ()

PAPI_BR_CN

0x8000002b

No

No

Conditional branch instructions ()

PAPI_BR_TKN 0x8000002c ()

No

No

Conditional branch instructions taken

PAPI_BR_NTK 0x8000002d taken ()

No

No

Conditional branch instructions not

PAPI_BR_MSP 0x8000002e mispredicted ()

Yes

Yes

Conditional branch instructions

27

Profiling Parallel Performance Using Vampir and Paraver

PAPI_BR_PRC 0x8000002f correctly predicted ()

No

No

Conditional branch instructions

PAPI_FMA_INS

0x80000030

Yes

No

FMA instructions completed ()

PAPI_TOT_IIS

0x80000031

Yes

No

Instructions issued ()

PAPI_TOT_INS

0x80000032

Yes

No

Instructions completed ()

PAPI_INT_INS

0x80000033

Yes

No

Integer instructions ()

PAPI_FP_INS 0x80000034

Yes

No

Floating point instructions ()

PAPI_LD_INS 0x80000035

Yes

No

Load instructions ()

PAPI_SR_INS 0x80000036

Yes

No

Store instructions ()

PAPI_BR_INS 0x80000037

Yes

No

Branch instructions ()

PAPI_VEC_INS

0x80000038

No

No

Vector/SIMD instructions ()

PAPI_RES_STL

0x80000039

No

No

Cycles stalled on any resource ()

PAPI_FP_STAL ()

0x8000003a

No

No

Cycles the FP unit(s) are stalled

PAPI_TOT_CYC

0x8000003b

Yes

No

Total cycles ()

PAPI_LST_INS ()

0x8000003c

Yes

Yes

Load/store instructions completed

PAPI_SYC_INS completed ()

0x8000003d

No

No

Synchronization instructions

PAPI_L1_DCH 0x8000003e

No

No

Level 1 data cache hits ()

PAPI_L2_DCH 0x8000003f

No

No

Level 2 data cache hits ()

PAPI_L1_DCA 0x80000040

Yes

Yes

Level 1 data cache accesses ()

PAPI_L2_DCA 0x80000041

No

No

Level 2 data cache accesses ()

PAPI_L3_DCA 0x80000042

No

No

Level 3 data cache accesses ()

PAPI_L1_DCR 0x80000043

Yes

No

Level 1 data cache reads ()

PAPI_L2_DCR 0x80000044

No

No

Level 2 data cache reads ()

PAPI_L3_DCR 0x80000045

Yes

No

Level 3 data cache reads ()

PAPI_L1_DCW 0x80000046

Yes

No

Level 1 data cache writes ()

PAPI_L2_DCW 0x80000047

No

No

Level 2 data cache writes ()

PAPI_L3_DCW 0x80000048

No

No

Level 3 data cache writes ()

PAPI_L1_ICH 0x80000049

Yes

No

Level 1 instruction cache hits ()

PAPI_L2_ICH 0x8000004a

No

No

Level 2 instruction cache hits ()

PAPI_L3_ICH 0x8000004b

No

No

Level 3 instruction cache hits ()

PAPI_L1_ICA 0x8000004c

No

No

Level 1 instruction cache accesses ()

PAPI_L2_ICA 0x8000004d

No

No

Level 2 instruction cache accesses ()

PAPI_L3_ICA 0x8000004e

Yes

No

Level 3 instruction cache accesses ()

PAPI_L1_ICR 0x8000004f

No

No

Level 1 instruction cache reads ()

28

Profiling Parallel Performance Using Vampir and Paraver

PAPI_L2_ICR 0x80000050

No

No

Level 2 instruction cache reads ()

PAPI_L3_ICR 0x80000051

No

No

Level 3 instruction cache reads ()

PAPI_L1_ICW 0x80000052

No

No

Level 1 instruction cache writes ()

PAPI_L2_ICW 0x80000053

No

No

Level 2 instruction cache writes ()

PAPI_L3_ICW 0x80000054

No

No

Level 3 instruction cache writes ()

PAPI_L1_TCH 0x80000055

No

No

Level 1 total cache hits ()

PAPI_L2_TCH 0x80000056

No

No

Level 2 total cache hits ()

PAPI_L3_TCH 0x80000057

No

No

Level 3 total cache hits ()

PAPI_L1_TCA 0x80000058

No

No

Level 1 total cache accesses ()

PAPI_L2_TCA 0x80000059

No

No

Level 2 total cache accesses ()

PAPI_L3_TCA 0x8000005a

No

No

Level 3 total cache accesses ()

PAPI_L1_TCR 0x8000005b

No

No

Level 1 total cache reads ()

PAPI_L2_TCR 0x8000005c

No

No

Level 2 total cache reads ()

PAPI_L3_TCR 0x8000005d

No

No

Level 3 total cache reads ()

PAPI_L1_TCW 0x8000005e

No

No

Level 1 total cache writes ()

PAPI_L2_TCW 0x8000005f

No

No

Level 2 total cache writes ()

PAPI_L3_TCW 0x80000060

No

No

Level 3 total cache writes ()

PAPI_FML_INS instructions ()

0x80000061

No

No

Floating point multiply

PAPI_FAD_INS ()

0x80000062

No

No

Floating point add instructions

PAPI_FDV_INS instructions ()

0x80000063

Yes

No

Floating point divide

PAPI_FSQ_INS instructions ()

0x80000064

Yes

No

Floating point square root

PAPI_FNV_INS instructions ()

0x80000065

No

No

Floating point inverse

Yes

Floating point operations ()

PAPI_FP_OPS 0x80000066

Yes

------------------------------------------------------------------------avail.c

PASSED

29

Suggest Documents