In addition to CPU performance characterization, observation of memory, network, ... instrumentation for tracking dynamic memory management in Fortran codes that ... There are a number of tools that track memory usage in C and C++ using ...
Memory Leak Detection in Fortran Applications using TAU Sameer Shende and Allen Malony ParaTools, Inc., Eugene, OR {sameer,malony}@paratools.com Shirley Moore and David Cronk University of Tennessee {shirley,cronk}@cs.utk.edu Abstract There is a growing awareness that high-end performance evaluation and tuning requires holistic program analysis. In addition to CPU performance characterization, observation of memory, network, and I/O performance can help to identify execution bottlenecks related to these factors. Correctness of memory and communication operations is also an issue and can affect performance indirectly. This paper describes extensions to the TAU performance system to incorporate direct source-level code instrumentation for tracking dynamic memory management in Fortran codes that use allocate and deallocate statements. TAU’s lightweight profiling can then generate a detailed report of memory usage including the sizes of memory blocks allocated and deallocated with precise program attribution: variable name, source line number, and file name. We report on results and experiences in applying TAU to the PTURBO application. 1. Introduction The TAU performance system [1] provides users with robust instrumentation, measurement, and analysis capabilities for observing and evaluating the performance of high-end DoD applications. It supports both profiling and tracing modes of measurement and has been ported to all DoD HPC platforms, installed and supported by HPCMP PET Computational Environment Core Support personnel as part of the consistent tool environment at the MSRCs and allocated DCs. TAU can instrument application source code automatically at the level of routines and outer-loops. It captures both time and hardware counter data from PAPI [3], generates trace files in several trace formats, presents the profile information in interactive displays, and stores experiment results in a performance database. Until recently, TAU provided support primarily for measurement of control flow performance, with less accounting of memory usage. Observation of memory allocation and deallocation is important to understand how memory is being managed during execution, especially when it leads to leakage problems. This paper highlights TAU’s capabilities for evaluating memory usage in Fortran applications and its successful application in an important DoD application. 2. The Problem The TAU performance system provides robust portable profiling and tracing capabilities. Performance measurement in TAU follows a process of selecting “events” for instrumentation, deciding what performance data to collect, running the experiment, and analyzing the profile or trace data generated. TAU supports both interval events, having entry and exit semantics (such as routine or loop level timers), as well as atomic events that are triggered at a specific point in the program’s execution and can record application-specific data. An application programmer can use the TAU measurement API for source code instrumentation. However, for purposes of efficiency and
reliability, TAU also supports automatic instrumentation in several forms: source code, library interposition, and dynamic instrumentation (dynamic library loading and binary rewriting). Automated source code instrumentation is accomplished with the TAU static source analysis and instrumentation tool. Here, we leverage TAU’s atomic context events to track the volume of memory allocations, deallocations, and memory leaks. A memory leak is defined as a memory allocation that does not have a corresponding deallocation. Previously, the amount of heap memory utilized in the program could be examined in TAU using periodic interrupts. However, this did not reveal where and when the allocations and de-allocations took place. A common user request is for information regarding the precise program attribution (application variable name, source line number, file name) where allocations and de-allocations take place. This level of detail is not typically available in performance analysis tools. For C and C++, TAU already supports wrappers for malloc() and free() that provide information on memory usage, but such schemes do not extend to Fortran easily. Instead, a direct instrumentation approach that places instrumentation hooks in the program source at the point of memory allocation and de-allocation has been deployed. Through these hooks, TAU tracks and reports memory operations and memory leaks. 3. Related Work There are a number of tools that track memory usage in C and C++ using wrapper libraries for malloc and free [7,8]. These tools re-define malloc / free calls during the pre-processing stage of compilation in order to track memory operations. These approaches do not extend to dynamic memory allocations in Fortran where intrinsic calls to malloc and free are used. Each compiler implements its own intrinsic library and manages memory internally. One approach to tracking memory in Fortran is to use a debugger or a memory emulator such as Valgrind [5]. Valgrind reports slowdowns of an order of magnitude in execution time for memory intensive applications. TotalView Technologies’ MemoryScape and TotalView [4] debugging tools and IBM’s Purify [6] are examples of commercially available tools that track memory operations. While Purify tracks each memory reference, debuggers typically require the program to be compiled with debugging flags (-g) to gather file and line number information. They also recommend turning off all optimization switches (-O), which can consequently slow down a parallel application. Debuggers typically track the entire memory of the execution and locate memory leaks by examining registers and memory pointers to identify addresses that are not referenced. This is an expensive operation and may not always be suitable for high performance computing applications. Batch execution also poses interesting problems for debuggers that require a GUI. However, debuggers are now providing scripting interfaces that permit users to describe memory tracking operations in a script and execute their code within a batch system [4]. Fortran poses another unique problem for memory tracking tools. It permits multiple arrays and variables to be allocated and de-allocated in a single line of code. When a memory leak is detected at a given line by a debugger, the user must re-execute the program after setting appropriate breakpoints and examine the source code and program state to identify which variable is responsible for the leak. The problems associated with existing approaches led us to investigate the use of source-based instrumentation to track memory operations and identify memory leaks in Fortran programs. Our goal was to create a lightweight solution that would work reliably in the presence of code-transforming
optimizations and report the volume of memory operations in both profiles and event traces at the level of a Fortran variable name, source line number and file name. 4. Method of Solution 4.1 Instrumentation Specification To instrument a program, TAU parses the original source code and applies event selection and instrumentation rules found in an instrumentation specification file to produce an instrumented copy of the source code. The instrumentation specification describes the locations in the source code where instrumentation hooks are to be inserted. The instrumentation specification acts as a filter whereby the user identifies which events and files should be included or excluded from instrumentation. TAU's source code instrumentation tool (tau_instrumentor) reads the instrumentation specification file and automatically performs the necessary modifications to the source code. The user may define multiple instrumentation specification files to create different instrumented source versions. This is important for purposes of storing instrumentation configurations and automating experimentation. The user may also maintain an instrumentation history in this way, as instrumentation is being refined. To instrument all dynamic memory allocations and de-allocations in a given routine, the following section in the selective instrumentation file may be used: BEGIN_INSTRUMENT_SECTION memory [file=""] routine="" END_INSTRUMENT_SECTION The file name may contain the kleene star * and ? as wildcard characters, where the former matches zero or more occurrences and the latter matches exactly one occurrence of a character in the name. The routine name may be fully qualified with a signature, or may be just a single word comprising the routine name. The fully qualified name appears in the TAU program database (PDB) generated by PDTtoolkit. It also appears in profile files and may be extracted using TAU’s pprof or paraprof utilities. The routine name may contain # as a wildcard character to match zero or more occurrences of a character, like the kleene star (*) operator. For example, BEGIN_INSTRUMENT_SECTION memory file="*.f90" routine="#" END_INSTRUMENT_SECTION is sufficient to instrument all dynamic memory allocations and de-allocations in all routines in all Fortran source files that have a .f90 suffix. To instrument a program, a user simply changes the name of the compiler in his/her makefiles to a TAU wrapper shell script (tau_f90.sh for Fortran, tau_cxx.sh for C++, and tau_cc.sh for C). In addition, two environment variables (TAU_MAKEFILE and TAU_OPTIONS) describe the nature of instrumentation and measurement and selection of the instrumentation specification file. During compilation, the application program is parsed and an instrumented version of the program is generated and compiled.
4.2 Design of the memory instrumentation tool We extended tau_instrumentor to implement the above instrumentation specification. This work included several steps. First, we extended the instrumentation specification file parser module to read in instrumentation requests for memory events. Next, a list of instrumentation requests is created. When a PDB file is read in with the source code, this list is used for identifying source locations in each routine where Fortran allocate and de-allocate intrinsic routines are called. Memory instrumentation requests can co-exist with other instrumentation requests for loop-level instrumentation and routine entry and exit instrumentation. These requests are sorted by source location and stored internally. Next, the source file is read line by line and checked against the list of instrumentation requests. If a request exists at the current line, it is processed. Memory allocation requests in Fortran require special parsing of the source code to identify the names of variables at a given statement. Unlike C and C++, Fortran allocation statements can simultaneously allocate (and deallocate) memory for multiple allocatable arrays in the same statement. Each variable, in a comma separated list, is associated with the size of the array dimensions in the allocation statement. An optional trailing status argument can also be specified in the list of parameters for allocation and deallocation. The instrumentor parses the allocation statement and extracts the names of variables, ignoring the status argument. It writes the original allocation statement to the instrumented source file, followed by a series of TAU_ALLOC subroutines (one for each variable ) to pass the information about the variable name, its address, size (obtained from the Fortran sizeof intrinsic function, or size under AIX), source line number, and file name. This step involves parsing continuation statements in fixed and free format as well as checking if the memory statement is associated with a single-if Fortran statement. In the latter case, the single-if statement is re-written as a compound statement with a then and an endif clause. It includes the allocate intrinsic routine call followed by a call to the TAU_ALLOC subroutine. This routine has been implemented in the TAU measurement library and correlates callstack information with memory events. Each allocation request is mapped back to the list of calling routines, or the edges in the callgraph, to uniquely identify the calling location using a TAU context event. The depth of the callpath information maintained in TAU can be specified using the runtime environment variable TAU_CALLPATH_DEPTH. When a de-allocate event is encountered, the instrumentor extracts the names of variables to be de-allocated by parsing the statement. Extracting the variable name from the allocate/deallocate statement involves checking the number of open and close parentheses. A variable name may include access to user defined type members. TAU extracts the qualified name of the variable that performs the allocation/deallocation. There can be multiple allocation/de-allocations on each line, so it is important to extract names of all variables involved. For the deallocate statement, TAU inserts a call to the TAU_DEALLOC routine and passes the address of the variable, its name, source line number and file name. It is not necessary to pass the size information at this stage as it is maintained internally in TAU. The de-allocate routine in the TAU library triggers a context event to associate the de-allocation with the executing callpath. It then removes the variable's entry from the list of allocations. At the end of program execution, the memory leak detection module in each thread writes out the list of allocations that were not de-allocated. Thus, in TAU profiles, we see the following information: • Node information - The summary of the size of memory allocations and de-allocations at a source location including number of samples, minimum, maximum, mean and standard deviation across all samples. Information about the variable name, source line number, and file name is also stored in each entry for each thread of execution.
• •
Callpath information - The above summary information further partitioned and qualified with the precise callpath, or the sequence of events that led to a given allocation or de-allocation. Memory leak information - The callpaths and the variables that were associated with memory leaks, their allocation locations, and the sizes of memory leaks along each callpath. The size information as described above includes detailed statistics regarding the number of samples, maximum, minimum and standard deviation in the memory sizes at each location, for each thread of execution.
5. Significance to DoD - Case Study: PTURBO To test the implementation, we instrumented the Fortran routines for allocating and deallocating memory in the DoD PTURBO Challenge application, in addition to carrying out routine-level and selective loop-level performance profiling. PTURBO is in production use at ASC MSRC and under development at ASC and NASA GRC. We compiled and ran PTURBO on 8 cpus. The code ran for about 1 hour and 50 minutes on kraken (an IBM Power4 based p655 AIX system). Typical overhead introduced by TAU v2.16.4 was less than 1% of total execution time as shown in Table 1 below. We optimized the instrumentation in PTURBO using runtime throttling and a selective instrumentation file that contained an exclude list of lightweight routines obtained from a prior profiling run. The instrumented case profiled 117 distinct routines (events) and generated 1100 atomic event instances or more for each rank. We then instrumented PTURBO with TAU using six simultaneous hardware performance counters obtained from PAPI [3] as well as the wallclock time. The instrumentation overhead observed was still under 1% as shown in the table below. Using the TAU performance data, we found a load imbalance in the application where processor zero did more computation than communication delaying the other nodes. We also used outer-loop level instrumentation to identify specific loops that were compute-intensive. Performance experiment Uninstrumented original execution TAU with memory instrumentation Overhead for memory instrumentation TAU with 6 PAPI counters and time Overhead for papi based instrumentation
Elapsed wallclock time 6617.978 s 6643.019 s 0.378 % 6656.526 s 0.582 %
Elapsed CPU time (all ranks) 52356.878 s 52563.281 s 0.394 % 52757.187 s 0.764 %
Table 1. Runtime overhead incurred by monitoring PTURBO with TAU
Memory instrumentation of PTURBO revealed several memory leaks in the code as shown in Figure 1. Each memory leak is highlighted along with the number of times the corresponding allocations took place and statistics (number of samples, max, min, mean, and std.dev.) for the memory event. Each allocation and deallocation instance is shown with its size. Clicking on a single item in TAU's Paraprof browser reveals details across all ranks as shown in Figure 3 below. There is no leak in this case, and all the memory allocated above is de-allocated below. We see the name of the file (TURBO.F), the variable name (s), and line numbers (251, 568 respectively) along with the callpath that led to this atomic event. A callpath of the form a=>b=>c refers to the callsite in a routine c when it was called by b and when b was called by a.
Figure 1. Size of arrays allocated for a specific variable on each processor for an 8 cpu execution of PTURBO.
Figure 2. Memory events on rank 0 for PTURBO.
The profile on rank 0 (n,c,t 0,0,0) is shown in Figure 2. It shows us the maximum value of memory events (1.23E7 elements) for the leak in mod.memori.F variable fj(nlb)%v at line 686 when TURBO calls ALLOCATE_VARIABLES.
Figure 3. Size of arrays allocated for a specific variable on each processor for an 8 cpu execution of PTURBO.
In Figure 4, we see how this leak appears across all ranks of the application. Ranks 0 and 1 have smaller leaks than ranks 2 through 7. Most of the memory leaks detected in PTURBO were observed in initialization routines or modules where the memory was allocated once and used throughout the program. The deallocate statement was not invoked prior to termination, generating the memory leak. Discussions with the PTURBO developer led to the conclusion that the addition of the deallocate statements would improve the code but they were not deemed critical. The information on the location, volume, and number of memory allocations and de-allocations was of greater value in identifying parts of the CFD code where these memory operations occurred.
Figure 4. Memory leaks for a single variable differ on each rank. MPI ranks 0 and 1 have less leaks for the fj(nlb)%v variable.
7. Conclusions We have developed a lightweight memory evaluation capability in TAU that identifies the volume and location of memory allocations, deallocations, and leaks in Fortran programs using source-based instrumentation. In applying this feature to a production CFD code under AIX, we observed less than 1% overhead for the entire execution. The features described in this paper are included in the TAU distribution which is freely available for download from the TAU website (http://www.cs.uoregon.edu/research/tau/) under a BSD style license. In the future, we plan to apply TAU’s memory tracking features to additional DoD applications. 8. Acknowledgements This publication was made possible through support provided by DoD HPCMP PET activities through Mississippi State University under the terms of Agreement No. #N62306-01-D-7110. The opinions expressed herein are those of the author(s) and do not necessarily reflect the views of the DoD or Mississippi State University. The work described in this paper was carried out as part of the PET COP-KY6-001 project. The authors wish to thank Alan Morris, University of Oregon, Tom Cortese, University of Tennessee, and the PTURBO development team.
References 1. S. Shende and A. D. Malony, “The TAU Parallel Performance System,” Int’l Journal of High Performance Computing Applications, SAGE Publishers, 20(2): pp. 287- 331, Summer 2006. 2. U. Oregon, “Tuning and Analysis Utilities,” http://www.cs.uoregon.edu/research/tau/, 2007. 3. S. Browne, J. Dongarra, N. Garner, G. Ho, and P. Mucci, “A Portable Programming Interface for Performance Evaluation on Modern Processors,” Int’l Journal of High Performance Computing Applications, 14(3), pp. 189-204, 2000. http://icl.cs.utk.edu/papi/ 4. TotalView Technologies, http://www.totalviewtech.com/, 2007. 5. Valgrind, http://www.valgrind.org/, 2007. 6. IBM, Rational Purify, http://www-306.ibm.com/software/awdtools/purify/, 2007. 7. Peter Toft, “Some memory tools and functions categorized by license,” http://www.sslug.dk/emailarkiv/bog/2001_08/msg00030.html. 8. Petr Sorfa, “Debugging Memory on Linux,” Linux Journal, Issue 87, pp. 84.