The EP-Cache Automatic Monitoring System ∗ Edmond Kereku Institut f¨ur Informatik Technische Universit¨at M¨unchen email:
[email protected] ABSTRACT In this paper we present an automatic monitoring system consisting of a monitoring infrastructure and an automatic performance analyzer. The monitoring infrastructure supports different monitoring resources (CPU counters, simulation) and monitors the utilization of cache hierarchies in serial and OpenMP programs. A special feature of our system is the restriction of monitoring to single data structures. Our ASL[2]-based automatic analyzer called AMEBA is able to search for predefined performance bottlenecks in code regions using a provided set of search and refinement strategies. KEY WORDS Automatic Performance Analysis, AMEBA, Selective Monitoring, MRI, ASL, Instrumentation, Simulation
1
Introduction
With hardware components getting constantly faster and computers growing bigger, the role of performance analysis is becoming more and more crucial in order to achieve maximum Flop-rates for the applications. Analyzing the behavior of complex applications on HPC systems with thousands of processors is a daunting task when the proper knowledge and tools are missing. Effective performance analysis requires the developer’s knowledge about the application, the knowledge about the underlying hardware architecture, and the knowledge about the performance analysis process. Sometimes the methods and tools used to measure performance are not appropriate. Consider for example the giga or even tera-bytes of raw information produced by a conventional trace generation process. And consider the visualization and manual search for bottlenecks in such an enormous information source. Sometimes the tools’s capabilities are too restricted. For example, being able to identify a data structure with bad cache behavior together with the code regions where the bad behavior happens, can help a lot to optimize the code. But how many performance tool are out there that support monitoring of data structures in addition to monitoring program regions? The hardware vendors already took the first steps toward data structure ∗ The work presented in this paper is mainly performed in the context of the EP-CACHE project, funded by the German Federal Ministry of Education and Research (BMBF)
Michael Gerndt Institut f¨ur Informatik Technische Universit¨at M¨unchen email:
[email protected] monitoring. Intel already provides some support for data structure monitoring in its recent processors. In the P4 architecture it is possible to identify the data structure which caused a cache miss by using Precise Event-Based Sampling. The Itanium processor goes even a step further by allowing to restrict the counter surveillance to a precise piece of memory. What is missing is the support from the performance analysis tools. We believe that the only way to succeed when performing performance analysis of our applications is to build more intelligent and capable monitoring systems and to automate the performance analysis process itself. This paper describes the works on automatic performance analysis done in the EP-CACHE project. We created a new infrastructure for monitoring and analyzing the cache behavior of serial and parallel programs. Our monitoring system supports online selective monitoring. We use source code instrumentation to identify the application’s code regions and a measurement’s interface to request performance data in runtime for specific code regions. One of the primary requirements we had in mind when developing our monitoring system, was its use for automatic performance analysis. Therefore we put major efforts in building proper interfaces and mechanisms that would be useful for the development of automatic analysis tools. We also developed AMEBA, the prototype of an automatic performance analyzer. It is based on a specification of performance bottlenecks for bad cache utilization and applies a specialized search strategy to detect such bottlenecks. The rest of the paper is organized as follows. Section 2 shows similar works performed in automatic analysis, Section 3 is a reflection on the performance analysis process. Our approach to automatic performance analysis is explained in Section 4 followed by the description of the EP-Cache monitoring system and its interfaces in Section 5. The AMEBA automatic performance analyzer is described in Section 6. Future works are the object of Section 7.
2
Related Work
The Paradyn tool by Miller et al.[6] is the closest related work. The Performance Consultant module of Paradyn uses a W 3 (Why there is a problem, Where in application is the problem, and When the problem occurs) search
model to automate the identification of performance problems by analyzing data provided by means of run-time dynamic instrumentation. Performance problems are simply expressed in terms of a threshold and a counter-based metric. The dynamic instrumentation of measurement probes in the running code is also guided by the Property Consultant. EXPERT[7] developed at the Forschungszentrum J¨ulich, performs an offline hierarchical search for patterns of inefficient executions in trace files. EXPERT uses source code instrumentation for MPI, OpenMP, and hybrid applications and defines more complex bottlenecks than Paradyn by using Python for the specification of performance properties. The huge amount of raw performance data it produces and the long execution time of the post mortem analysis, pose the main limitations of this tool for use with large parallel programs. JavaPSL[8] is a language for specifying performance properties based on the Apart Specification Language (ASL) developed in the European Working on Automatic Performance Analysis Tools (APART). Performance properties are formalized as Java abstract classes taking advantage of Java language mechanisms such as polymorphism, abstract classes, and reflection. JavaPSL is used in a tool called Aksum[10] which relies on source code instrumentation to generate raw performance data, stores it in a relational database and then automatically searches for the existence of the predefined performance properties.
3
The performance analysis process
Our goal is to automate the performance analysis process. In order to succeed we must study the process, find what to automate and how to do it. After the implementation of the application, one of the final objectives for an application developer is to get the best possible performance out of its code on a given system. This task has a greater significance when it comes to high performance computing, where it is crucial to maintain an optimal utilization of the expensive computational resources and where the short execution time of an application often remains one of the primary requirements for the developer. In order to attain this goal, the developer has to go through an iterative process of performance tuning. The task of performance analysis, during the tuning process, is to gather information about the runtime behavior of a specific application in a specific environment (hardware/software configuration), identify the code regions with unsatisfactory performance and explore the causes of their inefficiency. The causes should lead to the suggestion of possible optimization steps. Performance information may be analyzed online as the application runs or post mortem by analyzing the data generated after the programs execution. A performance expert usually refines his search for performance problems to the possibly smallest code region or to a data structure.
Several aspects of the performance analysis process can be automated: The search process itself can be automated. This requires that performance bottlenecks be defined in terms of the performance data already available and, further, that a process of searching and refinement for bottlenecks is defined. Selection of data required by the search process can be automated. Many sources may be available to be used to satisfy the queries raised at any given point in the process, and query planning and query optimization are required. Reduction of overhead caused by the measures taken to collect the performance data. Unnecessary instrumentation could be automatically found and removed or not considered during the consequent analysis steps. In our approach, an automated search process is implemented in the performance analyzer AMEBA (Sections 4 and 6). Transparent selection of performance data is supported by our monitoring system (Section 5.1). It also provides features to reduce the monitoring overhead.
4
The AMEBA performance analyzer
Our performance analyzer AMEBA (Automatic Monitoring Environment for Bottleneck Analysis), is based on an automated iterative search process which is executed while the application is running. It searches for performance properties that are specified in the ASL. The search process is iterative in the sense that AMEBA starts with a set of potential performance properties, performs an experiment, evaluates the hypotheses based on the measured data, and then refines the hypotheses towards more specific performance properties. Since AMEBA performs an online analysis, an experiment consists of the specification of measurements and of measuring the data during the execution of the next program phase. This approach requires that the program spends most of its time in repetitive application phases, i.e., these phases are executed multiple times.
4.1
Definition of performance bottlenecks with ASL
We tried to overcome limitations in the specification of performance properties existing in other automatic tools by using a well designed definition language. The APART Specification Language (ASL)[2] was jointly defined by the performance analysis experts in the APART group. It allows specifying performance properties as well as their severity and existence in an elegant and consistent manner. Here is a simplified cache-related ASL property.
property LC1MissOverMemRef(SeqPerf sp){ condition:sp.lc1_miss/sp.mem_ref>0.1; confidence: 1; severity: (sp.lc1_miss/sp.mem_ref) *sp.execution_time; } where sp is the summary information of performance data for one instance of a region. The property LC1MissesOverMemRef specifies that there is a cache problem in regions where the LC1 miss rate is greater than 10%. If condition is true for a region, it means that the property holds for that region. Furthermore it specifies that the problem is more severe if it is found in regions where most of the execution time is spent. We also are 100% confident about the existence of the problem in regions where the condition is true because our measurements are precise (counter values) and not statistical. An ASL compiler is used to generate C++ classes, in form of definition and partial implementation, from ASL specifications. Finally the tool developer include them (or the classes he may derive from them) in its automatic analyzer. Here it’s is an outline of the C++ class generated for the above property. class LC1MissOverMemRef :public Asl::Property{ public: LC1MissOverMemRef(Asl::SeqPerf *sp); bool condition(){ return sp->lc1_miss/sp->mem_ref>=0.1; } double confidence(){return 1.0;} double severity(){ return (sp->lc1_miss/sp->mem_ref) *sp->exec_time; } }; The ASL is far more powerful than shown in the example. ASL performance properties can be hierarchically ordered and properties can be specified starting from scratch or from existing predefined ASL templates.
4.2
Specification of application phases
AMEBA is an online analysis tool. That is, monitoring requests are submitted and performance properties are evaluated while the application runs. In order to be able to incrementally search for performance properties in a running application, we must split the entire execution into phases, called application phases. More precisely, a phase is a specific program region. The overall execution of the program thus includes a sequence of instances of those phases. Different applications are constructed with different phases. One possible example is depicted in Figure 1. Once the
Figure 1. An example application including an initialization phase, several computational phases(which can be repetitive), and a concluding phase for possible deallocation and results gathering.
user identified those phases, appropriate measurements can be performed during the execution of those phases. In most of the HPC applications there is a time loop which consumes most of the execution time. This loop’s body is an excellent example for a repetitive application phase. AMEBA can basically use each instrumented code region as an application phase. While the phases are mainly intended to support online analysis, they can also be used to support iterative analysis via repetitive execution of the program. The phase in this case in the program’s main routine. The user may want to mark code regions as application phases that are no standard code regions, such as sequential loops or parallel regions in OpenMP. We implemented a simple mechanism in our instrumenter, which allows specifying those phases everywhere in the application with very little effort. Indeed, it only requires the user to insert ”USER REGION” directives in the code before the instrumentation. do time = 1, MAX_TIME !$MON USER REGION ... !doing something ... !$MON END USER REGION end do The instrumenter, recognizes the directives and transforms them to monitoring library calls. Here is the transformed code for the example above. do time = 1, MAX_TIME start_region(USER_REGION,fileID,LineNR) ... !doing something ... end_region(USER_REGION,fileID,LineNR) end do
4.3
AMEBA’s search strategy
The APART working group did an excellent work on knowledge specification for automatic performance analysis. The missing piece of the puzzle for building our automatic analyzer is the specification of search and refinement strategies The search strategy answers questions like
”which performance property should be evaluated first?” and ”what are the next steps upon the evaluation of a specific property?”. The AMEBA’s strategies specify the set of properties to be evaluated in the next phase. At the beginning of the analysis process, the strategy specifies which are the properties to be evaluated for the phase region. After each execution of the application phase a new strategy step is performed. There is decided, depending on the evaluated properties, whether to refine the search for the same property in smaller code regions, or to start searching for other properties. The analysis terminate if one of the following conditions is true:
Performance Analysis Tools
3. The application terminates and the phase region is not the same as the main region. The strategies are of course dependent on the underlying hardware, on the middleware used to tie the program pieces together (communication libraries for instance), and on the application domain. It is unlikely that a common pattern exists which applies to all the possible cases, but an effort is being made to classify and store those patterns in a strategy repository (See Figure 3), and to define a framework for finding dependencies between performance properties. A set of basic strategies which refine the search for bottlenecks in subregions and data structures are already provided. The user feeds the tool with the strategy to be used through the command line. We explain one of these strategies in section 6.1.
5
The EP-CACHE monitoring infrastructure
The EP-CACHE ([1], [5]) monitoring infrastructure (Figure 2) was built to support existent and future hardware monitors offering new techniques to monitor cache hierarchies in SMP nodes. The monitoring infrastructure was designed for a new hardware monitor developed in the EP-CACHE project. The counters on our hardware monitor can be configured to observe the whole or only a precise address range of the memory. We built a simulator called SMART to simulate SMP nodes with our hardware monitor integrated in each of the node’s processors. The system supports Fortran 95 OpenMP programs and the support for C is planed. Our monitoring environment requires source code instrumentation to insert monitoring library calls in the code. We use a Fortran 95 instrumenter[4] based on NAGWare f95 compiler front-end to instrument the code regions. We also use a data structure instrumenter to map the data structures in the program. The amount of performance data produced by
Selective Monitor
ePAPI Interface
Monitoring Resources
1. The maximum number of strategy steps is reached. 2. The maximum number of iterations for the phase region is reached.
Monitoring Library Interface
Application
MRI
Figure 2. The components of EP-CACHE monitoring system.
the monitoring system is kept small by processing and aggregating monitoring results in different layers of the system. At this point we have to emphasize that our environment is tested and supports the simulation and monitoring of real HPC applications. Figure 2 shows the monitoring system’s building blocks and its interfaces. The Monitoring Control Component (MCC) is the core of the monitoring system. It provides two interfaces, the monitoring library interface for communication with the running application and the Measurement Request Interface (MRI)[3] for communications with the performance analysis tool. We also implemented a PAPI[11] alike interface called ePAPI to configure the monitoring resources and read the measurement’s results.
5.1
Highlights of the monitoring system
We mentioned in the previous sections that our system was conceived for use with automatic analysis tools. Here are the main techniques provided by the monitor enabling automatic performance analysis. Online configuration. Automatic performance analysis tools can spend much more effort in precisely selecting the information to be measured than if the user has to do it manually. Therefore, our monitor provides the MRI which allows specifying the required information based on the information type, the code region, the data structure and the thread. These requests can be specified and modified during runtime to enable online analysis. Application control. To enable an incremental online analysis, the performance analysis tool has to be able to control the programs execution, e.g., to start the ap-
plication and to stop it at the end of a phase. This application control is also part of the MRI.
AMEBA Application Application(SIR)
Access to performance data. MRI also provides means to access during runtime the performance data measured according to the given requests. With the help of these routines, the online tool can request the data required to evaluate the properties.
Region Region Data Struct
Region Region Region
AMEBA’s functionality
Figure 3 outlines how the tool functions. Here we give a short overview of the components in the figure. Section 6.1 explains how those components work together in an example. AMEBA is built in C++. The predefined ASL performance properties and the search strategies are compiled and stored in the apposite repositories. The tool is started by providing in the command line the SIR file and the search strategy to be used. At this point the application, which is instrumented and linked with the monitoring library, is simulated. The Main Processing Unit is the heart of the tool. It parses the SIR getting information about the application, it initiates the strategy and the monitoring system. The Application data structure is built from the SIR file. Each instrumented code region or data structure is represented
Strategy
Region
Main Processing Unit
Experiment
MRI Monitoring
Figure 3. Functionality of AMEBA.
as an object. Each of these regions is also connected with its performance summary (for the sake of simplicity not shown in Figure 3), that has summary information about the runtime behavior of the region. This is needed to evaluate the ASL performance properties. The main processing unit starts an Experiment after a new set of properties was determined by the strategy. During an experiment, first the monitor is configured via MRI, then the application is released for the next phase, and after the execution of the phase the performance data are retrieved. The next step is to evaluate the properties and to pass back to the Main Processing Unit the set of proven properties.
6.1 6
ASL Properties Repository
Region
Selective instrumentation of code regions and data structures. The EP-CACHE Fortran 95 instrumenter is highly configurable and flexible. This combined with the possibilities offered by our simulator to only simulate desired parts of the code (the so called FAST mode), can be used to automatically reduce the monitoring overhead. Precious simulation time can be spared this way, by not simulating code regions which are not of interest for us (i.e. regions with scarce potential of optimization, or regions previous steps of analysis discovered not to have performance problems). In addition re-instrumentation reduces the overhead for entering and leaving the monitoring library if no measurements will be requested. Structural information about the application. Automatic performance analysis might not only use dynamic information but also might benefit from information about the program’s static structure, e.g., the nesting of regions or usage information about data structures. In our environment such information is generated by the instrumenter in form of Standard Intermediate Representation (SIR)[12]. SIR was defined by the APART group and is an XML file which contains the structure of the instrumented application, the instrumented code regions and data structures. A performance analysis tool working with our monitoring system, has to first parse the SIR and then use the found instrumented regions and data structures to generate monitoring (MRI)-requests.
Strategies Repository
Example
In this section we demonstrate the operation of AMEBA for a simple strategy. Let say we want to look for code regions and data structures where the LC1 miss rate is greater than 10% refining our search from the phase region to the smallest subregion or data structure. The first step is to read the code regions and the data structures available in the application (parse the SIR file). We build a tree of region objects and create for each region a summary object for the monitored data. The phase region is selected as the starting point for the bottleneck search process. The Main Processing Unit initializes the monitoring interface and select the search strategy according to the command line argument. After the strategy is selected from the strategies repository and is initialized, the main unit receives from the strategy the initial candidate properties set. That is in our simple case a list of
LC1MissOverMemRef properties each one to be evaluated for a specific subregion of the phase region. A new experiment is initiated with the candidate properties set. If the summary of monitored data of each region is empty, MRI requests are generated and the application is started. At the end of the application’s phase, the performance summary objects are filled with the monitored data and the ASL properties are evaluated. The properties which are found to be true, are put in a new list called evaluated properties set. The experiment finishes by passing this list to the Main Processing Unit. At this point the next strategy step starts. The Main Processing Unit provide the last evaluated list to the strategy and asks for a new set of properties. At his turn the strategy searches for any subregions or data structures in the regions for which a high LC1-cache miss rate was proven. If this is the case, a new candidate properties set is created. Again a new experiment starts and the whole process goes on until one of the termination conditions described in Section 4.3 is fulfilled. The last step is to collect all the properties which were found true and to present them to the user as a list of regions or data structures together with their LC1 miss rate. This list can be sorted by the severity of the problem so that the user could concentrate his efforts solving the most severe problems first.
7
Future Work
AMEBA is in its first steps and the work on it has just began. We intent to extend it in the course of a new project called Periscope as part of a multi-agent automatic monitoring system for teraflop computers[9]. Major efforts will be spent on the creation of a set of ASL properties as bottleneck specification for cache architectures. We also plan to drastically improve the existing search strategies and introduce new ones, leading in the ideal case to a set of strategy-”bricks”. Those ”bricks” could be combined with each other by the user in order to get a working strategy for it’s own application domain. The EP-CACHE monitoring system is also under further development. We expect to have full support for configuring and using CPU hardware counters in Intel Xeon and Itanium architectures using PAPI[11] really soon. Especially Itanium with its support on monitoring predefined data address spaces is in the center of our attention. The MRI interface implementation is being improved to allow subsequent runs of the monitored application, very useful if a saturation of resources occur while working with hardware counters.
References [1] T. Brandes et. al., Monitoring Cache Behavior on Parallel SMP Architectures and Related Programming
Tools, Future Generation Computer Systems. Vol.20, 2005 [2] T. Fahringer, M. Gerndt, G. Riley, J.L. Traff, Knowledge Specification for Automatic Performance Analysis, APART Technical Report, www.fzjuelich.de/apart, 2001 [3] M. Gerndt, E. Kereku: Monitoring Request Interface Version 1.0, http://wwwbode.in.tum.de/ kereku/epcache/pub/MRI.pdf [4] M. Gerndt, E. Kereku: Selective Instrumentation and Monitoring, In Proceedings of 11th Workshop on Compilers for Parallel Computers (CPC 04), p.61-74, Kloster Seeon, 2004 [5] E. Kereku, T. Li, M. Gerndt, J. Weidendorfer, A Data Structure Oriented Monitoring Environment for Fortran OpenMP Programs, In Proceedings of Euro-Paar 2004, p.133-140, Pisa, September 2004 [6] B. P. Miller, M. D. Callaghan, J. M. Cargille, J. K. Hollingsworth, R. B. Irvine, K. L. Karavanic, K. Kunchithapadam, and T. Newhall: The Paradyn Parallel Performance Measurement Tool, IEEE Computer, 28(11):3746, 1995 [7] F. Wolf, B. Mohr Automatic performance analysis of hybrid MPI/OpenMP applications, Journal of Systems Architecture: the EUROMICRO Journal, v.49 n.10-11, p.421-439, November 2003 [8] T. Fahringer and C. Seragiotto Junior, Modelling and Detecting Performance Problems for Distributed and Parallel Programs with JavaPSL, In Proc. of the Conference on Supercomputers (SC2001), Denver, Colorado, November 2001 [9] M. Gerndt, A. Schmidt, M. Schulz, R. Wismller, Performance Analysis for Teraflop Computers - A Distributed Automatic Approach , Euromicro Workshop on Parallel, Distributed, and Network-based Processing, Gran Canaria, pp. 23-30, January 2002 [10] T. Fahringer, C. Seragiotto, Aksum: a performance analysis tool for parallel and distributed applications, Performance analysis and grid computing, Kluwer Academic Publishers, Norwell, MA, 2004 [11] S. Browne, J. Dongarra, N. Garner, G. Ho, P. Mucci: A Portable Programming Interface for Performance Evaluation on Modern Processors, The International Journal of High Performance Computing Applications, 14(3), Fall 2000. Pp. 189–204. [12] C. Seragiotto, H. Truong, T. Fahringer, B. Mohr, M. Gerndt, and T. Li: Standardized Intermediate Representation for Fortran, Java, C and C++ Programs, APART Working Group Technical Report, Institute for Software Science, University of Vienna, October 2004.