3. 2.1.1. Product History .... views, e.g. timeline displays showing state changes and communication, profiling statistics displaying the execution times of routines, ...
Profiling Parallel Performance using Vampir and Paraver Andrew Sunderland, Andrew Porter STFC Daresbury Laboratory, Warrington, WA4 4AD
Abstract Two popular parallel profiling tools installed on HPCx are Vampir and Paraver, which are also widely available on other platforms. These tools can simultaneously monitor hardware counters and track message-passing calls, providing valuable information on an application's runtime behaviour which can be used to improve its performance. In this report we look at using these tools in practice on a number of different applications on HPCx, with the aim of showing users how to utilise such profilers to help them understand the behaviour of their own codes. As part of this, we also examine the use of Vampir for codes run on large numbers of processes (64 or more). Interested parties should check back here regularly for updates to this paper.
This is a Technical Report from the HPCx Consortium.
Report available from http://www.hpcx.ac.uk/research/publications/HPCxTR0704.pdf
© HPCx UoE Ltd 2007 Neither HPCx UoE Ltd nor its members separately accept any responsibility for loss or damage arising from the use of information contained in any of their reports or in any communication about their tests or investigations.
Profiling Parallel Performance Using Vampir and Paraver
ii
1
Introduction ____________________________________________________________3
2
Background to Profilers ___________________________________________________3
2.1
VAMPIR & VAMPIRTRACE _________________________________________________ 3
2.1.1
2.2
3
Product History ___________________________________________________________________4
PARAVER _________________________________________________________________ 4
Background to Applications ________________________________________________5
3.1
DL-POLY 3 _________________________________________________________________ 5
3.2
NEMO _____________________________________________________________________ 5
3.3
PDSYEVR __________________________________________________________________ 5
3.4
LU decomposition using OpenMP ______________________________________________ 6
4
VAMPIR Performance Analysis on HPCx ____________________________________6
4.1
Installation _________________________________________________________________ 6
4.1.1 4.1.2
4.2
Tracing the Application Code on HPCx__________________________________________ 7
4.2.1 4.2.2 4.2.3 4.2.4
4.3
VampirTrace version 5.3.1 __________________________________________________________6 Vampir __________________________________________________________________________7 Automatic Instrumentation __________________________________________________________7 Manual Instrumentation using the VampirTrace API ______________________________________7 Running the application with tracing on HPCx ___________________________________________8 Hardware Event Counter Monitoring with PAPI__________________________________________8
Analysing DL_POLY VampirTrace files with Vampir _____________________________ 9
4.3.1 4.3.2 4.3.3
Vampir Summary Chart_____________________________________________________________9 Vampir Activity Chart ______________________________________________________________9 Global Timeline View _____________________________________________________________10
4.4
Analysing parallel 3D FFT performance in DL_POLY ____________________________ 11
4.5
Profiling the NEMO application on large process counts using Vampir on HPCx ______ 15
4.6
Identifying Load Imbalances in the Development of PDSYEVR_____________________ 18
5
PARAVER performance analysis on HPCx __________________________________20
5.1
Setting up Paraver Tracing on HPCx___________________________________________ 20
5.2
Viewing Paraver tracefiles on HPCx ___________________________________________ 21
5.3
Analysing the LUS2 application using Paraver ___________________________________ 21
6
Summary ______________________________________________________________24
7
References _____________________________________________________________24
Profiling Parallel Performance Using Vampir and Paraver
1 Introduction The performance of a parallel code is commonly dependent on a complex combination of factors. It is therefore important that developers of High Performance Computing applications have access to effective tools for collecting and analysing performance data. This data can be used to identify such issues as computational and communication bottlenecks, load imbalances and inefficient cpu utilization. In this report we investigate the use of Vampir (Visualization and Analysis of MPI Resources) [1] in association with its related tracing tool VampirTrace [2] and Paraver (Parallel Program and Visualization Analysis Tool) [3]. HPCx usage is demonstrated here by applying the tools to the parallel DL_POLY 3 [4] application code, the computation core of a new symmetric parallel eigensolver PDSYEVR [5], an LU decomposition code [6] parallelised using OpenMP [7] and the NEMO oceanmodelling code ARPDBG [8]. It is not intended that this report should be referenced as a user guide for the tools investigated. For this there are excellent documents that can be found at the respective tools’ websites that detail the huge amount of features available. Rather this report is intend to give users a quick introduction to getting started with using the tools on HPCx and to demonstrate, with the aid of application examples, some of the in-depth analysis that can be enabled.
2 Background to Profilers Both analysis tools involve similar approaches i.e. analysis of a specific tracefile created at the application’s runtime that contains information on the various calls and events undertaken. For tracing the application code VampirTrace requires a relinking of the application code to the VampirTrace libraries whereas Paraver-based tracing does not require any relinking of the code, only execution via the OMPItrace tool. Both VampirTrace, OMPItrace can produce a tracefile for an OpenMP program, an MPI program, or a mixed-mode OpenMP and MPI program. Both tools require licenses and environment variable settings can be used to customize the tracing events that are to be recorded.
2.1 VAMPIR & VAMPIRTRACE Vampir (Visualisation and Analysis of MPI Resources) [1] is a commercial postmortem trace visualisation tool from the Center for Information Services and High Performance Computing (ZIH) of TU Dresden [2]. In collaboration with the KOJAK project at ZAM/FZ Jülich [9], the freely available VampirTrace is obtainable from the same organization. The tool uses profiling extensions to MPI and permits analysis of the message events where data is passed between processors during execution of a parallel program. Event ordering, message lengths and times can all be analysed. The latest version (5.0) features support for OpenMP events and hardware performance counters. The tool comes in two components - VampirTrace and Vampir. The first of these
3
Profiling Parallel Performance Using Vampir and Paraver includes a library which when linked and called from a parallel program, produces an event tracefile. Common events include the entering and leaving of function calls and the sending and receiving of MPI messages.
By using keywords, application-specific information can be built into the trace using subroutine calls. Trace calls can be automatically applied to the whole run-time or manually added around time-critical program sections. This involves adding calls to VT_USER_START (‘label’) and VT_USER_END (‘label’) at the section of interest in the source. Automatic instrumentation requires only a re-link of the application code with the VT libraries, whilst manual instrumentation requires a re-compilation of the program. Then Vampir itself is used to convert the trace information into a variety of graphical views, e.g. timeline displays showing state changes and communication, profiling statistics displaying the execution times of routines, communication statistics indicating volumes and transmission rates and more. 2.1.1 Product History The Vampir tool has been developed at the Center for Applied Mathematics of Research Center Jülich and the Center for High Performance Computing of the Technische Universität Dresden. Vampir is available as a commercial product since 1996 and has been enhanced in the scope of many research and development projects. In the past, it was distributed by the German Pallas GmbH which became later a part of Intel Corporation. The cooperation with Intel ended in 2005. Vampir has been widely used in the high performance computing community for many years. A growing number of performance monitoring environments such as TAU [10], KOJAK [9], can produce tracefiles that are now readable by Vampir. Since the release of version 5.0, Vampir supports the new Open Trace Format (OTF), also developed by ZIH. This trace format is especially designed for massively parallel programs. Vampir is portable across many platforms due to its X-based graphical user interface and is available for many computing platforms.
2.2 PARAVER The Paraver performance analysis tool is developed by The European Center for Parallelism of Barcelona (CEPBA) [11] at the Technical University of Catalonia. Based on an easy-to-use Motif GUI, Paraver has been developed to respond to the need to have a qualitative global perception of the application behaviour by visual inspection and then to be able to focus on the detailed quantitative analysis of the problems. Paraver provides a large amount of information useful to improve the decisions on whether and where to invert the programming effort to optimize an application.
4
Profiling Parallel Performance Using Vampir and Paraver
3 Background to Applications 3.1 DL-POLY 3 DL_POLY [4] is a parallel molecular dynamics simulation package developed at STFC's Daresbury Laboratory [12]. DL_POLY 3 is the most recent version (2001) and exploits a linked cell algorithm for domain decomposition, suitable for very large systems (up to order 1,000,000 particles) of reasonably uniform density. Computationally the code is characterised by a series of timestep calculations involving exchanges of short-range forces between particles and long-range forces between domains using 3 dimensional FFTs. The computation of these 3D FFTs [13] are a major expense during the computation. Depending on the general integration flavour a DL_POLY_3 timestep can be considered to comprise of the following stages: integration part 1, particle exchange, halo reconstruction, force evaluation, integration part 2. The most communication expensive operation is the particle exchange stage since it involves recovery of the topology of bonded interactions for particles crossing domains. Metal interactions are evaluated by using tabulated data and involve a halo exchange data as they depend on the local density. The test case examined here is a molecular simulation involving dipalmitoylphosphatidylcholine (DPPC) in water. This system is of interest due to its complex forcefield, containing many bonded interactions including constraints as well as vdW and Coulomb charges.
3.2 NEMO NEMO (Nucleus for European Modelling of the Ocean) [8] is designed for the simulation of both regional and global ocean circulation and is developed at the Laboratoire d'Océanographie Dynamique et de Climatologie at the Institut Pierre Simon Laplace. It solves a primitive-equation model of the ocean system in three dimensions using a finite-difference scheme and contains sea-ice and passive-tracer models. Originally designed for vector machines, the most recent version uses MPI in its MPP implementation. Here we discuss how VAMPIR may be used to analyse a code's performance on processor counts of up to 256 using NEMO as an example.
3.3 PDSYEVR In the 90s, Dhillon and Parlett devised a new algorithm (Multiple Relatively Robust Representations MRRR) [14] for computing numerically orthogonal eigenvectors of a symmetric tridiagonal matrix with O(n2) cost. Recently a ScaLAPACK [15] implementation of this algorithm named PDSYEVR has been developed and it is planned that this routine will be incorporated into future releases of ScaLAPACK. Analysis of some of the subroutines from initial versions of this code with Vampir helped identify performance issues on HPCx, which were later rectified by the developers.
5
Profiling Parallel Performance Using Vampir and Paraver
3.4 LU decomposition using OpenMP LUS2 is a short Fortran program that calculates an LU decomposition on a dense matrix. Parallelisation of the LU algorithm is achieved by using OpenMP Fortran interface directives, in particular PARALLEL DO loop directives as in the construct that loops through the rows and columns of a matrix as shown below: C$OMP PARALLEL DO SCHEDULE(DYNAMIC,16), PRIVATE(j) do i=1, ISIZE do j=1, ISIZE D(i,j) = A(i,j) + B(i,j) enddo enddo C$OMP END PARALLEL DO
4 VAMPIR Performance Analysis on HPCx 4.1 Installation 4.1.1 VampirTrace version 5.3.1 The source files for VampirTrace version 5.3.1 can be downloaded free of charge from: http://tudresden.de/die_tu_dresden (search for VampirTrace from the home page). In order to install a 64-bit version of VampirTrace on HPCx the following compiler options were used:
./configure AR="ar -X32_64" CC=xlc_r CXX=xlC_r F77=xlf_r FC=xlf90_r MPICC=mpcc_r CFLAGS="-O2 -g -q64" CXXFLAGS="-O2 -g -q64" FFLAGS="-O2 -g -q64" FCFLAGS="-O2 -g -q64"
The following configuration options were also required in order to link to IBM’s Parallel Operating System (poe), IBM’s Message Passing Interface (MPI) library and to access hardware event counter monitoring via the Performance Application Programming Interface (PAPI):
--with-mpi-inc-dir=/usr/lpp/ppe.poe/include
6
Profiling Parallel Performance Using Vampir and Paraver
--with-mpi-lib-dir=/usr/lpp/ppe.poe/lib --with-mpi-lib=-lmpi_r --with-papidir=/usr/local/packages/papi/papi-3.2.1-64bit --with-papi-lib="-lpapi64 -lpmapi"
4.1.2 Vampir A pre-compiled binary of Vampir 5.0 for AIX is available for download from the Vampir website: http://www.vampir.eu/. NB: This download is a demonstration copy only and a permanent Vampir 5.0 installation is at present unavailable to users on HPCx. Vampir 5.0 is a GUI-based product and therefore it is intended for users to provide their own copy of Vampir 5.0 installed on their remote platforms. This can then be used to view tracefiles of parallel runs from HPCx locally. However a fully featured permanent copy of Vampir 4.3 is installed on HPCx. Users should also note that previous versions of Vampir cannot read tracefiles obtained from Vampirtrace 5.0 as they are incompatible with the new otf (open tracefile format).
4.2 Tracing the Application Code on HPCx In order to use the VampirTrace libraries a) calls to switch Vampirtrace on/off are made from the source code (optional) b) the code must be relinked to the VT libraries c) the code is then run (in the normal way under poe) on HPCx.
4.2.1 Automatic Instrumentation Automatic Instrumentation is the most convenient way to instrument your application. Simply use the special VT compiler wrappers, found in the $VAMPIRTRACE_HOME/bin subdirectory, without any parameters, e.g.: vtf90 prog1.f90 prog2.f90 -o
prog
In this case the appropriate VT libraries will automatically be linked into the executable and tracing will be applied to the whole executable. 4.2.2 Manual Instrumentation using the VampirTrace API The VT USER START, VT USER END instrumentation calls can be used to mark any user-defined sequence of statements.
Fortran: #include "vt_user.inc"
7
Profiling Parallel Performance Using Vampir and Paraver
VT_USER_START(’name’) ... VT_USER_END(’name’)
C: #include "vt_user.h" VT_USER_START("name"); ... VT_USER_END("name");
A unique label should be applied as “name” in order to identify the different sections traced. If a block has several exit points (as it is often the case for functions), all exit points have to be instrumented by VT USER END. The code can then be compiled using the VT compiler wrappers (e.g. vtf90, vtcc) as described above. This approach is particularly advantageous if users wish to profile certain sections of the application code and leave other parts untraced. A selective tracing approach can also reduce the size of the resulting tracefiles considerably which in turn can speed up loading times when it comes to analyzing them in Vampir.
4.2.3 Running the application with tracing on HPCx The code can then be run in the usual way on HPCx using poe through a loadleveler script. Upon completion a series of tracefiles are produced – a numbered *.filt and *.events.z for each process used and a global *.def.z and *.otf file.
4.2.4 Hardware Event Counter Monitoring with PAPI In order to direct VampirTrace to collect hardware event counter data a $VT_METRICS environment variable must be set in the loadlever job command script specifying which counters should be monitored. A list of all counters supported by the Performance Application Programming Interface (PAPI) [16] on HPCx can be generated by running the tool 'papi_avail' in the /usr/local/packages/papi/papi-3.2.1-64bit/share/papi/utils/ directory. A full list is included in this report in Appendix A. Many useful performance metrics are available for analysis, including Floating Point Instruction rates, Integer instruction rates, L1, L2, L3 cache usage statistics and processor load / store instruction rates.
8
Profiling Parallel Performance Using Vampir and Paraver
4.3 Analysing DL_POLY VampirTrace files with Vampir The Vampir analyser can be invoked from the command line and the tracefile loaded through the menu options File -> Open Tracefile. The loading operation can take several minutes if the tracefiles are large.
4.3.1 Vampir Summary Chart The first analysis window to be generated is the Summary Chart, shown below in Figure 1:
Figure 1.
Vampir Summary Chart
The black bar represents the Sum of the overall execution time on HPCx. This time is then broken down into three constituent parts – the Application (i.e. computation) time in green, the MPI (i.e. communication) time in red and the VT_API (i.e. tracing overhead) time in blue. These representations are maintained throughout all the different Vampir views described here. From this display users can get an overall impression of the communication / computation ratio in their application code.
4.3.2 Vampir Activity Chart A useful way of identifying load imbalances between processors is to view the Global Activity Chart under the Global Displays menu. This view, shown in Figure 2 gives a
9
Profiling Parallel Performance Using Vampir and Paraver breakdown of Application / MPI / VT_API ratio for each process involved in the execution of the program. The display below is for an eight processor DL_POLY run and it shows that communication and computation are relatively evenly distributed across the processors and therefore the load balancing is good.
Figure 2.
Vampir Global Activity Chart
4.3.3 Global Timeline View
Figure 3.
Vampir Global Timeline View
10
Profiling Parallel Performance Using Vampir and Paraver
The Global Timeline gives an overall view of the application’s parallel characteristics over the course of the complete tracing interval – in this case the complete runtime. The time interval is measured along the horizontal access (0.0 – 6.36 minutes here) and the processes are listed vertically. Message passing between processes is represented by the black (point-to-point) and purple (global communication operations) lines that link the process timelines. From the prevalence of purple in the above graphical representation it appears that communication in DL_POLY is mainly global, however this can be somewhat misleading as the purple messages overlay and obscure the black lines at this rather coarse zoom level. The proliferation of red MPI operations in the central part of the timeline could lead viewers to conclude that the code is highly communication intensive. However the above test run has much reduced timesteps compared to a production run and approximately the first twothirds of the global timeline represents a set-up phase that in reality would be substantially less significant.
4.4 Analysing parallel 3D FFT performance in DL_POLY Figure 4 shows how, by zooming in (left click with the mouse) on the right hand portion of the Global Timeline, we can obtain a more representative view of the run. This shows a series of timesteps which include phases of computation (green) separated by a series of global communications at the beginning and the end of each timestep. Here, the 3DFFTs, signified by black and red areas around the middle of each timestep, can just begin to be distinguished.
Figure 4.
DL_POLY Timesteps in the Global Timeline View
11
Profiling Parallel Performance Using Vampir and Paraver
By now selecting Global Displays -> Counter Timeline the selected (via the $VT_METRICS environment variable) hardware counters can be viewed on the same scale (Figure 5). Here we have chosen to run the code with the $VT_METRICS=PAPI_FP_OPS environment variable set in the loadleveler script, thereby measuring floating point operations throughout the application.
Figure 5.
Vampir Hardware Counter Timeline view of DL_POLY timesteps
It can be seen that the flops/s rate peaks at around 100 Mflops/s per processor towards the centre of a timestep and reduces to virtually zero during the intensive global communication phases at the end of the timestep. Zooming further in (Figure 6), we can identify the subroutine in the program that the flop rate is at a maximum in the routines ‘parallel_f’(ft) (the number after the function name represents the number of times that the function has been called). The associated Counter Timeline is also shown below.
12
Profiling Parallel Performance Using Vampir and Paraver
Figure 6.
Parallel 3D FFT in DL_POLY Timelines
The characteristic communication pattern for a 3D FFT is shown clearly in Figure 6 i.e. pairwise point-to-point communications in firstly the x, then the y, then the z directions. Again, the corresponding counter timeline shows how the Flops/s rate reduces to almost zero during communication-dominated periods and serial performance peaks at around 100 Mflops/s during the FFT computation. A summary of the message passing statistics, highlighting the level of data transfer between processors can also be obtained (Figure 7). This shows how each processor
13
Profiling Parallel Performance Using Vampir and Paraver transfers 8 Mbytes of data with three other processors, representing pair-wise communication in the x, y and z directions.
Figure 7.
Message Passing statistics for the 3D FFT
Left clicking on any of the black point-to-point message lines in the 3D FFT timeline highlights the specified message and initiates a pop-up box with more details on this message passing instance. Shown in Figure 8 are the details of the message highlighted at the bottom right corner of the timeline in Figure 6.
Figure 8.
Individual Message Statistics
14
Profiling Parallel Performance Using Vampir and Paraver
4.5 Profiling the NEMO application on large process counts using Vampir on HPCx An immediate drawback of VampirTrace when using large numbers of processes is the size of the trace files produced and consequently the amount of memory (and time) needed by Vampir when loading them. This may be alleviated by reducing the length of the benchmarking run itself (e.g. the number of timesteps that are requested) but ultimately it may be necessary to manually instrument the source code (as described in Section 4.2.2) such that data is only collected about the sections of the code that are of interest. For instance, the scaling performance of a code will not be affected by the performance of any start-up and initialisation routines and yet, for a small benchmarking run, this may take a significant fraction of the runtime. Below we show an example of a summary activity timeline generated by Vampir using a trace file from a manually-instrumented version of the NEMO source code. The large blue areas signify time when the code was not in any of the instrumented regions and is broken only by some initialisation and a region where tracing was (programmatically) switched on for a few timesteps midway through the run before being switched off again.
Figure 9. The activity timeline generated from a manually-instrumented version of NEMO. It contains a little initialisation and then data for a few time-steps midway through the run.
15
Profiling Parallel Performance Using Vampir and Paraver The full trace data for the few timesteps may be loaded by selecting the relevant region from the summary time-line. Since the tracing has been programmatically switched on for a set number of timesteps, the information provided by the resulting summary may be reliably compared for different runs since it is not dependent on the area of the activity timeline selected by the user. Below we show an example of such a summary where the code has been manually instrumented.
Figure 10. Summary view of trace data for five timesteps of a manually instrumented version of NEMO running on 128 processes.
Once the trace data has been loaded, the user often wishes to view the global timeline, an example of which is shown below for a single timestep of NEMO. A useful summary of this view may be obtained by right-clicking and selecting Components->Parallelism Display. This brings up the display visible at the bottom of the figure from which it is easy to determine which sections of the timestep are dominated by e.g. MPI communications (coloured red by default in Vampir). An example here is the section of NEMO dealing with ice transport processes (coloured bright green). Also of note in this example is the dominance of global communications (coloured purple) over the last 16 processes. It turns out that these processes have been allocated the region of the globe in the vicinity of the poles and thus have extra work to do in removing noise introduced by the small mesh size in this region.
16
Profiling Parallel Performance Using Vampir and Paraver
Figure 11. A global timeline for each of 64 processors during a single timestep of NEMO. A 'parallelism' display is included at the bottom showing the percentage of the processors involved in each activity at any one time. The usefulness of the global timeline can be limited when looking at tracefiles for numbers of processors greater than 64 as Vampir will try to scale the data for each process so as to fit them all on screen. However, one can specify the number of process timelines one would like displayed at a time by right-clicking on the display and selecting Options->Show Subset... This brings up the Show Subset Dialog:
Figure 12. The Show Subset Dialog for the global timeline. Use this to choose the number of processors (“Bars”) for which data is shown on the timeline.
17
Profiling Parallel Performance Using Vampir and Paraver Using this dialog one can look at the application's behaviour in detail on a few processes or look at the overall behaviour on many processes. The figure below shows VIPAR displaying the activity of the majority of 128 processes during the section of the code dealing with ice rheology in NEMO. The effect of the LPARS on HPCx (effectively 16-way SMP nodes) on interprocess communications is highlighted by the fact that groups of 16 well-synchronised processes may be identified.
Figure 13. The global timeline configured to show data for the majority of the 128 processors of the job.
4.6 Identifying Load Imbalances in the Development of PDSYEVR Profiling early versions of the new ScaLapack routine PDSYEVR with VampirTrace (VT) allows us to investigate its performance in detail. Basic timing analysis of the code revealed that load-balancing problems may exist for certain datasets in the eigenvector calculation stage of the underlying tridiagonal eigensolver MRRR. The Vampir analyses shown below enabled us to track this potential inefficiency with great precision.
18
Profiling Parallel Performance Using Vampir and Paraver In order to track the code in more detail, here different colours were assigned to different functions using the syntax which is described in $VT_HOME/info/GROUPS.SPEC. Some additions to the underlying source code are required and a re-compilation must be undertaken. In the timeline view shown in Figure 14 the cyan areas are set to represent computation in the subroutine DLARRV. This routine is involved in the calculation of eigenvectors. As usual, time spent in communication is represented by the red areas in the timeline and the purple lines represent individual messages passed between processors.
Figure 14.
Vampir Timeline for original DLARRV subroutine
The above timeline trace shows that when calculating half the subset of eigenvalues, the workload balance in DLARRV increases substantially from process 0 to process 14. This causes a large communication overhead, represented by the large red areas in the trace. Following this, it was determined that the load imbalance was primarily caused by an unequal division of eigenvectors amongst the processes. These problems were addressed by the Scalapack developers and a newer version of the code gave a much better division of workload, as can be seen in the timeline traces in Figure 15.
19
Profiling Parallel Performance Using Vampir and Paraver
Figure 15.
Vampir Timeline for modified DLARRV subroutine
5 PARAVER performance analysis on HPCx 5.1 Setting up Paraver Tracing on HPCx Paraver uses the tool OMPItrace to generate tracefiles for OpenMP programs, MPI programs, or mixed-mode OpenMP and MPI programs. Users should note that OMPItrace currently only works with 32-bit executables on HPCx and also that OMPItrace uses IBM's DPCL (Dynamic Probe Class Library) which requires a .rhosts file in your home directory that lists all the processor ids on HPCx. Paraver tracefiles are generated on HPCx by adding the environment variables (in e.g. ksh/bash): export OMPITRACE_HOME=/usr/local/packages/paraver export MPTRACE_COUNTGROUP=60 to the Loadleveler job control script the poe command in the LoadLeveler scriptis changed from e.g. poe ./prog to $OMPITRACE_HOME/bin/ompitrace -counters -v poe.real ./prog
20
Profiling Parallel Performance Using Vampir and Paraver
On HPCx ‘poe’ is in fact a wrapper to the ‘real’ poe command. In order for OMPITRACE to function correctly on HPCx poe.real must be called directly.
5.2 Viewing Paraver tracefiles on HPCx The following environment variables should be set in the user’s login session: export PARAVER_HOME = /usr/local/packages/paraver export MPTRACE_COUNTGROUP=60 During the run, Paraver will have created a temporary trace file for each process (*.mpit and *.sim files). After the run has completed the user must submit an instruction to pack the individual profile files into one global output. This is undertaken by issuing the command: $PARAVER_HOME/bin/ompi2prv *.mpit -s *.sym -o trace_prm.prv To view the resulting tracefile use the command: $PARAVER_HOME/bin/paraver trace_prm.prv
5.3 Analysing the LUS2 application using Paraver
Unlike Vampir, upon starting Paraver, users are immediately shown the Global Timeline view. This parallelisation of LUS2 is based on OpenMP, therefore threads rather than processes are listed on the vertical axis against time on the horizontal axis. Increasing the zoom in a representative section of the trace shows:
21
Profiling Parallel Performance Using Vampir and Paraver
Figure 16.
Paraver Timeline for two cycles of $OMP PARALLEL DO
The default colours assigned represent the following activities:
Figure 17.
Colour properties in Paraver
22
Profiling Parallel Performance Using Vampir and Paraver The trace in Figure 16 shows a typical slice of the timeline from lus2, where the code is undertaking $OMP PARALLEL DO construct across the matrix as described in section 3.3. It can be seen that relatively large swathes of blue, representing computation, are divided by thread administration tasks at the start and end of each $OMP PARALLEL DO cycle.
Figure 18.
Detailed view of OMP thread scheduling in LUS2
23
24
Profiling Parallel Performance Using Vampir and Paraver In Figure 18, above the timeline bar of each thread is a series of green flags, each denoting a change of state in the thread. Clicking on the flag gives a detailed description as shown in the example above. Here it can be seen that thread 16 is firstly undergoing a global synchronisation before being scheduled to run the next cycle of the loop.
6 Summary Profilers can be highly effective tools in the analysis of parallel programs on HPC architectures. They are particularly useful for identifying and measuring the effect of such problems as communication bottlenecks and load imbalances on the efficiency of codes. New versions of these tools also include hardware performance data which facilitates the detailed analysis of serial processor performance within a parallel run. The Vampir and Paraver GUI-based analysis tools allow users to switch with ease from global analyses of the parallel run to very detailed analyses of specific messages, all within the one profiling session. Interoperability of VampirTrace with other profilers such as KOJAK and TAU has now been made possible due to the adoption of the opentracefile format.
Acknowledgements
The authors would like to thank Matthias Jurenz from TU Dresden, Chris Johnson from EPCC University of Edinburgh, and Ilian Todorov & Ian Bush from STFC Daresbury Laboratory for their help in creating this report.
7 References [1]
Vampir – Performance Optimization http://www.vampir.eu .
[2]
Vampirtrace, ZIH, Technische Universitat, Dresden, http://tu-dresden.de/die_tu_dresden/zentrale_einrichtungen/zih.
[3]
Paraver, The European Center http://www.cepba.upc.es/paraver.
[4]
The DL_POLY Simulation Package, W. Smith, STFC Daresbury Laboratory, http://www.cse.scitech.ac.uk/ccg/software/DL_POLY/
for
Parallelism
of
Barcelona,
Profiling Parallel Performance Using Vampir and Paraver [5]
“PDSYEVR. ScaLAPACK’s parallel MRRR algorithm for the symmetric eigenvalue problem”, D. Antonelli, C. Vomel, Lapack working note 168, (2005). http://www.netlib.org/lapack/lawnspdf/lawn168.pdf.
[6]
OMPtrace Tool User’s Guide, http://www.cepba.ups.es/paraver/docs/OMPtrace.ps
[7]
The OpenMP Application Program Interface, http://www.openmp.org.
[8]
NEMO - Nucleus for European Modelling of the Ocean, http://www.lodyc.jussieu.fr/NEMO/
[9]
KOJAK – Automatic Performance Analysis Toolset, Forschungszentrum Julich, http://www.fz-juelich.de/zam/kojak/.
[10] TAU – Tuning and Analysis Utilities, University of Oregon, http://www.cs.uoregon.edu/research/tau/home.php. [11] The European Center for Parallelism of Barcelona, http://www.cepba.upc.es. [12] Science & Technology Facilities Council, http://www.scitech.ac.uk. [13] “A Parallel Implementation of SPME for DL_POLY 3”, I. J. Bush and W. Smith, STFC Daresbury Laboratory, http://www.cse.scitech.ac.uk/arc/fft.shtml. [14] “A Parallel Eigensolver for Dense Symmetric Matrices based on Multiple Relatively Robust Representations”, P.Bientinesi, I.S.Dhillon, R.A.van de Geijn, UT CS Technical Report #TR-03026, (2003) http://www.cs.utexas.edu/users/plapack/papers/pareig.ps [15] www.netlib.org/scalapack/scalapack_home.html [16] PAPI – Performance Application Programming Interface http://www.icl.cs.utk.edu.papi/.
Appendix A The list of available PAPI hardware-counters on HPCx.
Test case avail.c: Available events and hardware information. ------------------------------------------------------------------------Vendor string and code
: IBM (-1)
25
Profiling Parallel Performance Using Vampir and Paraver
Model string and code
: POWER5 (8192)
CPU Revision
: 983040.000000
CPU Megahertz
: 1502.495972
CPU's in this Node
: 16
Nodes in this System
: 1
Total CPU's
: 16
Number Hardware Counters : 6 Max Multiplex Counters
: 32
------------------------------------------------------------------------Name
Code
Avail Deriv Description (Note)
PAPI_L1_DCM 0x80000000
Yes
Yes
Level 1 data cache misses ()
PAPI_L1_ICM 0x80000001
No
No
Level 1 instruction cache misses ()
PAPI_L2_DCM 0x80000002
Yes
No
Level 2 data cache misses ()
PAPI_L2_ICM 0x80000003
Yes
No
Level 2 instruction cache misses ()
PAPI_L3_DCM 0x80000004
Yes
Yes
Level 3 data cache misses ()
PAPI_L3_ICM 0x80000005
Yes
Yes
Level 3 instruction cache misses ()
PAPI_L1_TCM 0x80000006
No
No
Level 1 cache misses ()
PAPI_L2_TCM 0x80000007
No
No
Level 2 cache misses ()
PAPI_L3_TCM 0x80000008
No
No
Level 3 cache misses ()
PAPI_CA_SNP 0x80000009
No
No
Requests for a snoop ()
PAPI_CA_SHR 0x8000000a cache line ()
No
No
Requests for exclusive access to shared
PAPI_CA_CLN 0x8000000b cache line ()
No
No
Requests for exclusive access to clean
PAPI_CA_INV 0x8000000c
No
No
Requests for cache line invalidation ()
PAPI_CA_ITV 0x8000000d
No
No
Requests for cache line intervention ()
PAPI_L3_LDM 0x8000000e
Yes
Yes
Level 3 load misses ()
PAPI_L3_STM 0x8000000f
No
No
Level 3 store misses ()
PAPI_BRU_IDL
0x80000010
No
No
Cycles branch units are idle ()
PAPI_FXU_IDL
0x80000011
Yes
No
Cycles integer units are idle ()
PAPI_FPU_IDL idle ()
0x80000012
No
No
Cycles floating point units are
PAPI_LSU_IDL ()
0x80000013
No
No
Cycles load/store units are idle
No
Data translation lookaside buffer
PAPI_TLB_DM 0x80000014 misses ()
Yes
26
Profiling Parallel Performance Using Vampir and Paraver
PAPI_TLB_IM 0x80000015 buffer misses ()
Yes
No
Instruction translation lookaside
PAPI_TLB_TL 0x80000016 misses ()
Yes
Yes
Total translation lookaside buffer
PAPI_L1_LDM 0x80000017
Yes
No
Level 1 load misses ()
PAPI_L1_STM 0x80000018
Yes
No
Level 1 store misses ()
PAPI_L2_LDM 0x80000019
Yes
No
Level 2 load misses ()
PAPI_L2_STM 0x8000001a
No
No
Level 2 store misses ()
PAPI_BTAC_M 0x8000001b
No
No
Branch target address cache misses ()
PAPI_PRF_DM 0x8000001c
No
No
Data prefetch cache misses ()
PAPI_L3_DCH 0x8000001d
No
No
Level 3 data cache hits ()
PAPI_TLB_SD 0x8000001e ()
No
No
Translation lookaside buffer shootdowns
PAPI_CSR_FAL instructions ()
0x8000001f
No
No
Failed store conditional
PAPI_CSR_SUC instructions ()
0x80000020
No
No
Successful store conditional
PAPI_CSR_TOT instructions ()
0x80000021
No
No
Total store conditional
PAPI_MEM_SCY accesses ()
0x80000022
No
No
Cycles Stalled Waiting for memory
PAPI_MEM_RCY Reads ()
0x80000023
No
No
Cycles Stalled Waiting for memory
PAPI_MEM_WCY writes ()
0x80000024
No
No
Cycles Stalled Waiting for memory
PAPI_STL_ICY ()
0x80000025
Yes
No
Cycles with no instruction issue
PAPI_FUL_ICY issue ()
0x80000026
No
No
Cycles with maximum instruction
PAPI_STL_CCY completed ()
0x80000027
No
No
Cycles with no instructions
PAPI_FUL_CCY completed ()
0x80000028
No
No
Cycles with maximum instructions
PAPI_HW_INT 0x80000029
Yes
No
Hardware interrupts ()
PAPI_BR_UCN 0x8000002a
No
No
Unconditional branch instructions ()
PAPI_BR_CN
0x8000002b
No
No
Conditional branch instructions ()
PAPI_BR_TKN 0x8000002c ()
No
No
Conditional branch instructions taken
PAPI_BR_NTK 0x8000002d taken ()
No
No
Conditional branch instructions not
PAPI_BR_MSP 0x8000002e mispredicted ()
Yes
Yes
Conditional branch instructions
27
Profiling Parallel Performance Using Vampir and Paraver
PAPI_BR_PRC 0x8000002f correctly predicted ()
No
No
Conditional branch instructions
PAPI_FMA_INS
0x80000030
Yes
No
FMA instructions completed ()
PAPI_TOT_IIS
0x80000031
Yes
No
Instructions issued ()
PAPI_TOT_INS
0x80000032
Yes
No
Instructions completed ()
PAPI_INT_INS
0x80000033
Yes
No
Integer instructions ()
PAPI_FP_INS 0x80000034
Yes
No
Floating point instructions ()
PAPI_LD_INS 0x80000035
Yes
No
Load instructions ()
PAPI_SR_INS 0x80000036
Yes
No
Store instructions ()
PAPI_BR_INS 0x80000037
Yes
No
Branch instructions ()
PAPI_VEC_INS
0x80000038
No
No
Vector/SIMD instructions ()
PAPI_RES_STL
0x80000039
No
No
Cycles stalled on any resource ()
PAPI_FP_STAL ()
0x8000003a
No
No
Cycles the FP unit(s) are stalled
PAPI_TOT_CYC
0x8000003b
Yes
No
Total cycles ()
PAPI_LST_INS ()
0x8000003c
Yes
Yes
Load/store instructions completed
PAPI_SYC_INS completed ()
0x8000003d
No
No
Synchronization instructions
PAPI_L1_DCH 0x8000003e
No
No
Level 1 data cache hits ()
PAPI_L2_DCH 0x8000003f
No
No
Level 2 data cache hits ()
PAPI_L1_DCA 0x80000040
Yes
Yes
Level 1 data cache accesses ()
PAPI_L2_DCA 0x80000041
No
No
Level 2 data cache accesses ()
PAPI_L3_DCA 0x80000042
No
No
Level 3 data cache accesses ()
PAPI_L1_DCR 0x80000043
Yes
No
Level 1 data cache reads ()
PAPI_L2_DCR 0x80000044
No
No
Level 2 data cache reads ()
PAPI_L3_DCR 0x80000045
Yes
No
Level 3 data cache reads ()
PAPI_L1_DCW 0x80000046
Yes
No
Level 1 data cache writes ()
PAPI_L2_DCW 0x80000047
No
No
Level 2 data cache writes ()
PAPI_L3_DCW 0x80000048
No
No
Level 3 data cache writes ()
PAPI_L1_ICH 0x80000049
Yes
No
Level 1 instruction cache hits ()
PAPI_L2_ICH 0x8000004a
No
No
Level 2 instruction cache hits ()
PAPI_L3_ICH 0x8000004b
No
No
Level 3 instruction cache hits ()
PAPI_L1_ICA 0x8000004c
No
No
Level 1 instruction cache accesses ()
PAPI_L2_ICA 0x8000004d
No
No
Level 2 instruction cache accesses ()
PAPI_L3_ICA 0x8000004e
Yes
No
Level 3 instruction cache accesses ()
PAPI_L1_ICR 0x8000004f
No
No
Level 1 instruction cache reads ()
28
Profiling Parallel Performance Using Vampir and Paraver
PAPI_L2_ICR 0x80000050
No
No
Level 2 instruction cache reads ()
PAPI_L3_ICR 0x80000051
No
No
Level 3 instruction cache reads ()
PAPI_L1_ICW 0x80000052
No
No
Level 1 instruction cache writes ()
PAPI_L2_ICW 0x80000053
No
No
Level 2 instruction cache writes ()
PAPI_L3_ICW 0x80000054
No
No
Level 3 instruction cache writes ()
PAPI_L1_TCH 0x80000055
No
No
Level 1 total cache hits ()
PAPI_L2_TCH 0x80000056
No
No
Level 2 total cache hits ()
PAPI_L3_TCH 0x80000057
No
No
Level 3 total cache hits ()
PAPI_L1_TCA 0x80000058
No
No
Level 1 total cache accesses ()
PAPI_L2_TCA 0x80000059
No
No
Level 2 total cache accesses ()
PAPI_L3_TCA 0x8000005a
No
No
Level 3 total cache accesses ()
PAPI_L1_TCR 0x8000005b
No
No
Level 1 total cache reads ()
PAPI_L2_TCR 0x8000005c
No
No
Level 2 total cache reads ()
PAPI_L3_TCR 0x8000005d
No
No
Level 3 total cache reads ()
PAPI_L1_TCW 0x8000005e
No
No
Level 1 total cache writes ()
PAPI_L2_TCW 0x8000005f
No
No
Level 2 total cache writes ()
PAPI_L3_TCW 0x80000060
No
No
Level 3 total cache writes ()
PAPI_FML_INS instructions ()
0x80000061
No
No
Floating point multiply
PAPI_FAD_INS ()
0x80000062
No
No
Floating point add instructions
PAPI_FDV_INS instructions ()
0x80000063
Yes
No
Floating point divide
PAPI_FSQ_INS instructions ()
0x80000064
Yes
No
Floating point square root
PAPI_FNV_INS instructions ()
0x80000065
No
No
Floating point inverse
Yes
Floating point operations ()
PAPI_FP_OPS 0x80000066
Yes
------------------------------------------------------------------------avail.c
PASSED
29