Selective Instrumentation and Monitoring ∗ Michael Gerndt, Edmond Kereku Institut f¨ur Informatik, LRR Technische Universit¨at M¨unchen
[email protected],
[email protected]
Abstract This paper presents the design and implementation of a monitoring environment for parallel applications. The focus of our work is on selective monitoring. The amount of runtime data is reduced by colleting only information required during the search of performance bottlenecks. The programmer can selectively instrument its F95 program via a production quality source-code instrumenter. In addition, he can configure the monitor via the Monitoring Request Interface (MRI). The instrumenter and a first prototype of the monitor were implemented within the EP-Cache project focusing of memory hierarchy analysis.
Keywords: Automatic Performance Analysis, Program Instrumentation, OpenMP, Selective Monitoring
1
Introduction
Performance tuning of parallel programs is of increasing importance since the complexity of parallel architectures is more and more exposed to the programmer. Current machines are scalable systems with shared memory nodes. The architecture space of the nodes covers SMP systems with vector or RISC processors as well as ccNUMA type systems with large numbers of processors. The overall memory is distributed amongst the nodes thus requiring hybrid programming with OpenMP and MPI. A number of performance analysis tools are available assisting the user in analyzing program behavior, e.g. Vtune, SUN One, Vampir. All these tools rely on runtime information. Profiling tools collect summary information via sampling or instrumentation, while tracing tools collect traces via source code or binary instrumentation. While the profiling tools are based on a moderate amount of data, tracing tools create huge amounts of data. Of course, all profiling information can be generated from the traces. Thus, tracing tools are more general. Applying these tools to real programs requires usually a first investigation with profiling tools and a subsequent inspection with tracing tools. But, instead of tracing all information in the latter ∗
Part of this work is funded by the European Commission via the working group on Automatic Performance Analysis: Real Tools (APART), http://www.fz-juelich.de/apart, EP-Cache ...
stage it would be much better to combine both types of tools so that the amount of information generated is limited by selecting information only which is really required. We are presenting a solution to this problem that is based on source code instrumentation for Fortran 95 OpenMP programs and on a monitoring system offering an interface that allows to specify monitoring requests. This interface, called the Monitoring Request Interface (MRI), can be used by offline and online analysis tools. Besides the reduction of information provided by monitoring, our approach allows to gather information for a large variety of program regions including subprograms, loops, vector statements, call sites, OMP parallel regions, and MPI routines. The instrumentation is implemented based on the commercial Fortran 95 frontend from NAG. It does an automatic instrumentation which is guided by command line options [8, 5] . Section 2 of this article gives an overview of related work. Section 3 describes the Monitoring Request Interface. Section 4 provides details on the instrumentation process, especiall the transformations required for instrumented sequential regions. Section 5 presents the MRI-based monitor. Section 6 analyzes the overhead for a real application.
2
Related Work
Source code instrumentation is typically done by the compiler. A more recent portable tool was developed in the KOJAK project at Research Centre J¨ulich [9]. OPARI [12, 11] instruments OpenMP regions in Fortran and C programs. Source code instrumentation is also support for program regions in the instrumenter used in the Askalon project at the University of Vienna [3], the AIMS toolsuite [1], and the TAU environment [14]. None of the tools are as general in respect of the instrumentation of program regions as the tool presented here. The other main approach is object code instrumentation which can be applied before loading the program or during program execution. This technique was developed within the Paradyn environment and is supported in DYNINST [7, 4] and DPCL [6, 13]. Object code instrumentation is language independent and can severely limit the overhead of the instrumentation. On the other hand, program regions, except functions and call sites cannot be instrumented. In the context of parallel applications, the MPI profiling interface (PMPI) was included into the MPI standard and the POMP interface [10] has been proposed as standard interface for monitoring libraries for OpenMP programs. An excellent collection of instrumentation tool can be found at [2].
3
The Monitoring Request Interface
While the monitoring for current tools is done in a very tool specific way, we propose to develop a standard interface, the Monitoring Request Interface (MRI). As shown in Figure 1 the MRI will be used by tools to request information as well as to retrieve the information. Online tools will directly process the obtained information and can request other information based on the status in the performance property search. Offline tools will have an archiver that request the information
Figure 1: The monitor consists of sensors providing information and the Runtime Information Producer (RIP) implementing the MRI. Analysis tools submit request to the RIP and retrieve the runtime information produced for the requests. and stores it into files. The MRI is based on the following definitions: A code region is a single entry block of statements. Examples are: functions, sequential loops, vector statements, parallel loops and parallel sections. A region instance is the execution of a region by a thread of control. An active object is an entity performing some computation. Examples are compute nodes, processes and threads Runtime information is any information gathered during the execution of an application. Examples are: Individual events, such as RegionEnter and RegionExit, Number of L1-cache hits, and the execution time of regions. An aggregation combines runtime information from multiple instances of a region. This might be region instances in the same thread or in different threads. Examples are Sum over all instances in a single thread, Sum over all instances in all threads and Mean and standard deviation over all instances in a single thread. The MRI provides routines for four different purposes: runtime information publication, monitoring request management, runtime information delivery, and application control. The publication routines allow a monitor to advertize the measured information. The application control routines can be used by online tools to start and stop the application to perform an incremental performance property search. These two groups will not be further discussed in this article. The monitoring request management routines provide means to insert and delete monitoring requests. A monitoring request specifies:
• the requested runtime information, • the set of monitored active objects, • the set of regions, • and, possibly, an aggregation across region instances in a single or a set of active objects. The specification of the region and the active objects is based on a hierarchically structured name space. A program is represented by a tree structure which determines the nesting of program regions. Each node in the tree has a specific region type and a unique region id. The root node is the program node, representing the whole program. The next levels are program units and regions in units. The root node of the tree representing the active objects is called the application object. On the next levels are processes and treads. Selecting regions and active objects in a request is based on a simple principle. The requester specifies a reference node in the tree and the entity type. Runtime information is then collected for all entities of the specific type in the subtree. For example, a request specifies the application object and the process type. Thus, the request asks for runtime information about all processes. If a request specifies the application object and the thread type, runtime information will be provided for all threads. If a request specifies a specific process and no type, runtime information will be provided for this process only. A request is inserted with the routine Req_ID
MRI_req_submit(runtime info, regions, active objects, aggregation);
It returns a request id which is subsequently used for accessing the data. The data delivery routines support the pull and the push model. For the pull model we provide a routine that retrieves the runtime information for a specific request id. The push model is supported via a routine that is registered as a callback routine. Periodic runtime information, such as trace data, are written to a monitor-local buffer. When the buffer is full, the routine is called and the tool which registered the routine can process the data.
4
Selective Instrumentation
Selective monitoring requires two techniques: Selective instrumentation and an configurable monitor. We implemented selective source code instrumentation for Fortran 95 programs. It is based on the commercial compiler frontend from NAG. Our Fortran 90 instrumenter is the only instrumenter based on a production quality frontend and implements instrumentation of a broad spectrum of regions. These regions are sequential as well as OpenMP parallel regions.
We uniquely identify regions in the inserted calls to the monitor library interface via a unique file id and the region’s first line number. The file id is specified via a command line argument to the instrumenter.
4.1
Program Regions
The following region types can be instrumented: program units, i.e. main program, procedures, and functions, loops and nested loops, vector statements, call sites, IO statements, parallel regions, work sharing regions, master and single regions, and synchronization region. Very important for tuning single node performance of applications is the instrumentation of loops. To reduce program perturbation, the user can select to only instrument outermost loops. If this is not sufficient nested loop can also be instrumented. Here nested loops does not include perfectly nested loops. Perfect loop nests are handed as multidimensional loops. Inner loops are only instrumented if they are not perfectly nested in an already instrumented loop. The region tpyes to be instrumented can be selected via command line switches. The following example illustrates the instrumentation of a loop nest.
call enter_region(LOOP, 17, 42) do i=1,n do j=1,n a(i,j)=... call enter_region(LOOP, 17, 45) do k=1,n //orig line 45 ... enddo call end_region(LOOP, 17, 45) enddo enddo call end_region(LOOP, 17, 42)
//orig line 42
We specified the command line f90inst test.f90 loop nestedloop. The loop specification triggers the instrumentation of the outermost loop nest, while the nestedloop specification enforces instrumentation of the inner loop nest.
4.2
Instrumentation Transformations
Several problems have to be solved in implementing this instrumentation strategy for Fortran 95. First of all, the statement to be instrumented can be the target of a jump. Our instrumenter modifies the code such that the label is removed from the original statement and added to the start region statement. Statements can also be guarded by a logical IF statement. We transform logical IF statements into block IF statements if the guarded statement has to be instrumented. The following transformations solve region specific problems. When instrumenting the main program we not only have to insert end region before the program’s END statement but also
before each STOP statement. A STOP can terminate multiple active regions 1 while END cannot. Therefore, we insert the end all regions routine before each STOP. It has only two parameters, the file id of the file including the STOP 2 and the current line number. This technique allows the monitor to issue an error message if other active regions exist when end region is executed. There is one other issue with program units. While the main program is always instrumented, the user can select to instrument subroutines. If subroutines are not instrumented, a call to enter region and leave region is inserted. These routines indicate to the monitor that a new unit is executed but no measurement can be requested for that regoin. This reduces the overhead for the unit merely to the single function call. We decided to insert the calls since the monitor might need that information for setting up internal data structures. There are three more situations where multiple active regions can be terminated. Loop nests can be terminated by jumps and via CYCLE and EXIT. In all three cases we insert end region with the identification of the outermost terminated region. In addition, our end region routine has a Region Abortion Flag. This flag is TRUE if it is legal to terminate nested active regions and FALSE otherwise. call start_region(LOOP, 42, 17) do i=1,n ... call start_region(LOOP, 42,19) do j=1,m ... call end_region(LOOP, 42, 17, TRUE) GOTO 123 enddo call end_region (LOOP, 42, 19, FALSE) enddo call end_region (LOOP, 42, 17, FALSE) ... 123 continue The example above demonstrates the termination of multiple active regions via GOTO. The same may happen due to CYCLE and EXIT. If the GOTO would terminate only the inner loop, the Region Abortions Flag would be FALSE. Besides writing loops in the structured DO ... END DO fashion, FORTRAN allows to terminated multiple loops with a single labeled statement. do 25 i=1,n do 25 j=1,n ... do 25 k=1,m ... 25 continue This notation hinders instrumentation of inner loops. Instead of normalizing all loops before the 1 2
A region is call active if its start region was executed but not yet its end region . The region id of the main routine might not be known when the STOP is instrumented.
instrumentation we keep the code as close to the original code as possible. We only normalize the outer loops to DO ... END DO. call start_region(LOOP, 42, 17) do i=1,n do j=1,n ... call start_region(LOOP, 42, 25) do 25 k=1,m ... 25 continue call end_region(LOOP, 42, 25, FALSE) end do end do call end_region(LOOP, 42, 17, FALSE) Instrumentation of vector statements in Fortran 95 is equally important to instrumentation of sequential loops. While vector assignments, FORALL and WHERE statements are straight forward to instrument, FORALL and WHERE blocks have limitations. Only the outermost FORALL block can be instrumented since the statements inside are executed as vector statements. Instrumentation of IO statements enables measurement of IO overhead. The tricky part is that Fortran provides the specification of target statement via labels for error and end of file events. Although a normalization as shown below would be possible, it is very system dependent since the return values of IOSTAT are not standardized. read (unit=1, err=10, end=20) ... Transformed code: read (unit=1, iostat=stat) ... if (stat==-1) then goto 10 else if (stat/=0) then goto 20 end if This is the main reason why we chose another transformation strategy. We create two new labels and let the IO statement jump to an inserted piece of code that terminates the region and then branches to the original target statement. The new code is inserted at the end of the subprogram. call start_region(READ, 42, 130) read (unit=1, err=30, end=40) ... call end_region(READ, 42, 130, FALSE) ... 30 call end_region(READ, 42, 130, FALSE) goto 10 40 call end_region(READ, 42, 130, FALSE) goto 20
The last transformation is again related to subprograms. Fortran subprograms might have multiple entry points. Thus, start region has to be inserted after the ENTRY statement. If the control flow reaches the statement from a previous statement, start region should not be executed. Therefore we insert before the ENTRY statement a call to ignore next entry point as shown in the following code. ... subroutine foo() ... call ignore_next_entry_point() entry foo1() call start_region(42, 17, 33) ... end foo Finally, our instrumenter also allows instrumentation of individual call sites. This is important, if performance information has to be related back to individual call sites. Of course, in the case of very small subprograms, the overhead for the instrumentation might be significant. Call site of subprograms with alternate return points are also handled correctly but there instrumentation is not described in this paper.
4.3
OpenMP Regions
The instrumenter can also be used for OpenMP programs. The instrumentation is based on the work of Bernd Mohr in OPARI. OPARI is a source-level instrumenter for OpenMP. It inserts POMP library calls into the code. The main reason for implementing the instrumentation again is that OPARI is based on a fuzzy parser. It cannot handle parallel loops that are not terminated by an explicit !$OMP END DO directive. Although this might not look severe, real programs do frequently omit this directive since it is not required. The main idea is outlined in the example below. The parallel loop is transformed so that the implicit barrier synchronization can be monitored. Original code: !$omp parallel do do i=1,n ... end do !$omp end parallel do Transformed code: call start_region(PARALLEL, 42, 17) !$omp parallel call start_region(PARALLELBODY, 42, 17) call start_region(DO, 42, 17)
!$ omp do do i=1,n ... end do !$omp end do nowait call start_region(IMPLBARRIER, 42, 17) !$omp barrier call end_region(IMPLBARRIER, 42, 17) call end_region(DO, 42, 17) call end_region(PARALLELBODY, 42, 17) !$omp end parallel call end_region(PARALLEL, 42, 17) This transformation adds quite some calls to the monitor library. But, assuming that it is worthwhile executing the loop in parallel, the overhead is not significant. This detailed instrumentation allows to measure the region startup and shutdown overhead in the master thread. It also allow measuring the implicit barrier because it is made explicit here. Note, we insert the nowait clause at the !$OMP END DO. Instrumentation of the body of the parallel statement is required to allow the monitor measuring the execution time not only in the master but also in the other threads of the executing team. All other OpenMP regions are handled in a similar fashion including critical regions. In contrast to OPARI we also instrument ordered blocks in loops, thus enabling inspection of pipelined processing of such loops. Measurement of functions of the OpenMP runtime library is done by providing appropriate wrappers for functions such as omp set lock. The instrumentation can be guided by two additional command line switches. The first one switches instrumentation of parallel constructs on and off, the second one instruments synchronization constructs. Implicit synchronization of parallel regions is always instrumented with the region.
4.4
Implementation
The implementation is based on the commercial compiler frontend of NAG in UK. Therefore, production Fortran 95 codes can be instrumented. The compiler frontend first transforms the source code into an abstract syntax tree (AST). The OpenMP directives are just kept as comments by the standard parser. To avoid modifications of the software of NAG we parse OpenMP directives in a separate pass after the internal representation is available. We insert the directives into the syntax tree as appropriate nodes. This technique enables easy development of other OpenMP tools. Although we parse the directive, we do not fully parse all clauses. Some of the clauses do allow arbitrary Fortran expressions, for example the chunk size specification in parallel loops. These expression are represented in the syntax tree in form of string constants. The next phase transforms the syntax tree according to the instrumentation transformations outlined in the last section. All transformations are carried out in a single pass of the AST.
In the last step, the source code is reconstructed from the AST. The original reconstructor was extended to handle the new OpenMP parse tree nodes. Special care was necessary to reconstruct directives as multiple comment lines thus not exceeding the Fortran maximum line length. The interface to the monitoring library includes the following routines which have been introduced in the previous session. Start Region (RType, FileID, RFL) This routine marks the start of a new region. RType is the region type, FileID the unique number assigned to a file, and RFL the first line number of the region in the file. End Region (RType, FileID, RFL, RegAbFl) This routine marks the end of a region. The additional parameter RegAbFl is the region abort flag. End All Regions (FileID, RFL) This routine is inserted before STOP statements. Enter Region (RType, FileID, RFL) It marks the start of subprograms that are not instrumented. Leave Region (RType, FileID, RFL, RegAbFl) It ends a subprogram that is not instrumented. Ignore Next Entry Point (RType, FileID, RFL) It is inserted before ENTRY statements.
5
Selective Monitoring
In order to be able to perform selective monitoring, a configurable monitor is required in addition to the selective instrumenter. This monitor implements the MRI interface outlined in Secion 3, it controls the application’s execution, and manages different hardware and software monitoring resources. Figure 2 shows the monitoring system’s building blocks with a focus on the Monitor Control Component (MCC). The MCC provides two interfaces: MRI and the monitoring library interface outlined in Section 4.4. Via MRI, performance analysis tools specify runtime requests and retrieve runtime data. The MCC receives information about the current region executed by the application via the monitoring library interface. The MCC converts MRI requests into many low level requests which are addressed to different software or hardware sensors3 . Those sensors can be hardware counters, e.g. counting cache misses, or software sensors, e.g. extracting the message length from an MPI message. The information provided by the sensors is subsequently processed by the MCC in order to produce the required profile or trace data. An important property of our monitor is that, while it accepts relatively complex requests, it keeps the overhead at an acceptable level (see Section 6). The design and implementation of the MCC was done with this requirement in mind. The following three sections give more details on the MCC’s structure and its implementation. 3
Referred in the Figure 2 as monitoring resources.
(a)
(b)
Figure 2: (a)The components of a monitoring system and (b)Implementation details of our Monitor Control Component.
5.1
Monitor Control Component
The MCC is the central and most important component in our monitoring system (Figure 2). It is responsible for the following tasks: Initialization of the whole system. This means the build of the Configuration Table, initialization of the monitoring resources, starting the facility for communication with the RIP and making available the gathered information about available sensors to the tool. Synchronization between the application, the tool and the resources. The very first instruction executed by the application is to start the control component. Like this, no region of the program is lost. The monitoring control keeps the application blocked as long as the rest of the system is up and the tool has made its first requests. Translation of MRI requests into low level requests such as PAPI requests. If data structures are involved in the monitoring process then, their symbol information is translated into virtual address. Configuration of sensors via low level interfaces as well as retrieving the results from the sensors. Aggregations specified via MRI are performed as well in the monitor control component. This happens directly after the results data are retrieved at the end of the monitored region. The MCC consists of two main components: the Runtime Information Producer (RIP) and the Monitoring Resources Configurator (MRC) as shown in Figure 2. The RIP is the manager of MRI requests. It accepts requests and ensures their validity. This includes checking that no more requests involving hardware counters can be made as counters
are available. It also verifies that the information about regions and active objects is accurate and valid. It also returns the gathered runtime information. The MRC’s main responsibility are the configuration of the sensors and retrieving their measured data. The basic data are then aggregated and stored in the MCC’s internal data structures.
5.2
MCC Implementation
The RIP communicates with the rest of the runtime monitor via a set of internal data structures. We implemented the RIP as a separate process4 , while the rest of the monitor is implememented in the monitoring library. Therefore, these data structures are implemented in a shared address space. We use the shared memory model of Inter Process Communication (IPC) in System V for communication and synchronization. The data structures shared between RIP and the MRC are: The Configuration Table (see Figure 2) contains an entry for every instrumented region of the application to be monitored as well as some special flags used to synchronize the application run and to drive the monitoring process. Self contained MRI Requests are created after each call of MRI request submit() and are attached to the specified region in the configuration table or (if other requests are already linked to the region) at the end of the linked list. As the name suggests, in those entities is included: the runtime information, region and eventual data structures information, aggregation and a reserved place5 for measurement results. The Trace Buffer. The trace requests doesn’t follow the same pattern as the profile requests. There is no reserved place for their results but, the trace information is dumped in a buffer which is emptied by the tool if it’s full or if the request reaches its end. The monitoring library implements the MRC. Besides the required functionality, the most important aspects to be taken into account in the design and implementation of this library was, that the instrumentation calls could be frequently executed and that there will always be a lot more instrumented regions without information requests than regions with MRI requests appended to them. With selective instrumentation some of the overhead can be avoided, but this is not always possible. Thus, we had to insure that the library calls are as lightweight as possible, especially in cases where no configuration has to be done for the actual region. Therefore, we implemented the Configuration Table as a hash table with an entry and a flag for each instrumented region present in the program. The flag is only set if there is an MRI request appended to the region. This way, the minimal weight of a library call is basically reduced to the access time for the hash table. 4 5
Typically it is a part of the performance analysis tool. In the case of profiling requests.
Figure 3: The execution times for the original program, the instrumented one and the one linked with the Monitoring Library are pretty much the same
6
Overheads
To measure the overhead of instrumentation and of the monitoring library, we used a program which solves a finite difference discretization of Helmholtz equation : (d2/dx2)u + (d2/dy2)u − alphau = f using Jacobi iterative method 6 . The version of the monitoring library includes all the code required to access the correct entry in the configuration table. But, no configuration of the hardware counters is perforemd. Our instrumenter detected 35 regions. Two of them are executed twice per iteration while the rest of them are only executed once during the whole program run. We increased the number of iterations from 100 to 11000. The results are shown in Figure 3. The execution times are almost the same. In some cases, the instrumented version of the code is even faster than the original version.
7
Summary
This paper presented an environment for selective monitoring. It is based on a Fortran 95 instrumenter that applies program transformations to enable instrumentation of a wide range of sequential and OpenMP code regions. In addition, it includes a configurable monitor that implemented the Monitoring Request Interface. The environment allow to gather only relevant runtime data for online and offline performance analysis tools with very low overhead. It is currently under development in the German EP-Cache project. We would like and certainly hope to be able to make the 6
Author: Joseph Robicheaux, Kuck and Associates, Inc. (KAI), 1998.
instrumenter available to the performance tools research community.
References [1] Automated Instrumentation and Monitoring System, www.nas.nasa.gov/Groups/Tools/Projects/AIMS [2] IST Working Group on Automatic Performance Analysis: juelich.de/apart
Real Tools, www.fz-
[3] Askalon: A Programming Environment and Tool Set for Cluster and Grid Computing, www.par.univie.ac.at/project/askalon [4] B. Buck, J.K. Hollingsworth: An API for Runtime Code Patching, Journal of Supercomputing Applications, Vol. 14, No. 4, pp. 317 - 329, 2000 [5] P. Dixit: Performance Analysis for OpenMP Programs, Master Thesis, Technische Universit¨at M¨unchen, 2003 [6] Dynamic Probe Class Library, oss.software.ibm.com/developerworks/opensource/dpcl/ [7] DYNINST API, www.dyninst.org [8] S. Knoben: Selektives Monitoring von Fortran 90-Anwendungen f¨ur KOJAK, Diplomarbeit, RWTH Aachen, Technical Report Forschungszentrum J¨ulich J¨ul-3749, 2000 [9] Kit for Objective Judgement and Knowledge-based Detection of Performance Bottlenecks, www.fz-juelich.de/zam/kojak [10] B. Mohr, A. Malony, S. Shende, F. Wolf: Design and Prototype of a Performance Tool Interface for OpenMP, Journal of Supercomputing, Vol. 23, pp. 105 - 128, 2002 [11] A. Malony, B. Mohr, S. Shende, F. Wolf: Towards a Performance Tool Interface for OpenMP: An Approach Based on Directive Rewriting, EWOMP 01, Third European Workshop on OpenMP, 2001 [12] OpenMP Pragma and Region Instrumentor, www.fz-juelich.de/zam/kojak/opari [13] L. DeRose, T. Hoover, J.K. Hollingsworth: The Dynamic Probe Class Library - An Infrastructure for Developing Instrumentation for Performance Tools, IBM, 2001 [14] Tuning and Analysis Utilities, www.cs.uoregon.edu/research/paracomp/tau/tautools