Monitoring Cache Behavior on Parallel SMP Architectures and Related Programming Tools Thomas Brandes, Helmut Schwamborn Institut f¨ur Algorithmen und Wissenschaftliches Rechnen (SCAI) Fraunhofer Gesellschaft (FhG) Schloss Birlinghoven, D-53754 St. Augustin e-mail:
[email protected] Michael Gerndt, J¨urgen Jeitner, Edmond Kereku, Wolfgang Karl, Martin Schulz, Jie Tao Lehrstuhl f¨ur Rechnertechnik und Rechnerorganisation (LRR) Technische Universit¨at M¨unchen Boltzmannstr. 3, D-85748 Garching e-mail:
[email protected] Holger Brunst, Wolfgang Nagel, Reinhard Neumann, Ralph M¨uller-Pfefferkorn, Bernd Trenkler Zentrum f¨ur Hochleistungsrechnen (ZHR) Technische Universit¨at Dresden Zellescher Weg 12, D-01062 Dresden e-mail:
[email protected] Hans-Christian Hoppe Pallas GmbH Herm¨uhlheimer Str. 10, D-50321 Br¨uhl e-mail:
[email protected] This paper describes the ideas and developments of the project EP-CACHE. Within this project new methods and tools are developed to improve the analysis and the optimization of programs for cache architectures, especially for SMP clusters. The tool set comprises the semi-automatic instrumentation of user programs, the monitoring of the cache behavior, the visualization of the measured data, and optimization techniques for improving the user program for better cache usage. As current hardware performance counters do not give sufficient user relevant information, new hardware monitors are designed that provide more detailed information about the cache utilization related to the data structures and code blocks in the user program. The expense of the hardware and software realization will be assessed to minimize the risk of a real implementation of the investigated monitors. The usefulness of the hardware monitors is evaluated by a cache simulator.
1. Introduction The gap between CPU performance (increasing approximately by a factor of 1.5 per year) and memory performance (factor of 1.07 per year) has increased dramatically and might continue to increase during the next years. Caches placed between memory and processor provide the possibility to access data much faster by duplicating parts of main memory in smaller and faster memory. But the advantages of the cache are only given if temporal and spatial data locality is exploited in the user program. Re-using data in the cache provides temporal locality and can speed up a program by a factor of 6 to 10 on many computers. Straight-forward written programs not taking the cache hierarchy into account achieve only a small fraction of the theoretical peak performance. Tuning the program for better cache utilization has become an expensive part of the software development cycle. Furthermore, the modular and extendible design of software conflicts directly with the locality needed for the exploitation of the cache. Identification and understanding of bottlenecks in a program due to cache problems is one of the most critical issues. General information about cache misses is not very useful as they give the user only the information that something goes wrong but not where and why. The project EP-CACHE [5] (funded by the German Federal Ministry of Education and Research, BMBF) is intended to overcome this problem. By exploiting hardware monitors and related monitor control techniques the user can gain more useful information about the cache behavior in his program. Related tools for monitor controlling, performance visualization and optimization provide new and advanced possibilities for the identification of memory-cache problems and their elimination. The tools address the analysis and the optimization of programs for cache architectures, especially for SMP clusters. The rest of the paper is organized as follows. Section 2 describes our approach of hardware supported cache monitoring. The performance analysis outlined in Section 3 uses the VAMPIR tool for visualization of the monitored data. In Section 4 we outline our tools for optimizing the source program for better cache usage. We present first evaluation results in Section 5 and we give a summary in Section 6. 2. Monitoring Cache Behavior Any technique implementing an efficient cache optimization strategy will depend on an accurate observation of the memory behavior in the target system. Only this will provide the necessary basis for a determination of existing bottlenecks. 2.1. A New Approach for Hardware Cache Monitoring Cache monitoring is available in most microprocessor architectures, but the current approaches are very restricted in their capabilities. They only allow the monitoring of a few (typically 4–8) global event types and do not provide any means to associate the observed behavior with the related addresses causing the events. This severely restricts their applicability and hence also the potential optimization techniques that can be derived from their observation. Within this project a new cache monitoring architecture has been designed, which overcomes these limitations and delivers a detailed overview of the complete memory access behavior of an application in relation to its virtual address space. The core of this approach is an associative counter array capable of recording all events snooped from an arbitrary bus together with their
address information. This work is based on previous research in the area of hardware DSM monitoring [9], which has been followed all the way to a corresponding hardware prototype, showing the feasibility of the discussed approach. These counters can be used to record specific events for a specified data and/or instruction address range. This enables, for example, detailed measurements of the number of cache misses for a specific data structure in a specific code region. The event type and the address ranges can be configured. Configuration of the counters can also be controlled by the hardware monitor itself. The address space to be monitored is partitioned into blocks according to a given granularity. If an access to an address currently not covered by a counter leads to a miss, a free counter is identified or a counter is freed by purging its content to a monitor buffer. This mode can be used to generate access histograms with a given granularity for the specified address range. A single monitor attached to a particular level of the cache hierarchy, however, does only provide a limited view on the observed system. It is, therefore, necessary to deploy multiple monitors of the same type, one on each level of the memory hierarchy. The individual histograms delivered by these independent monitors can easily be combined in software, leading to the intended global overview of the complete memory access behavior. Figure 1 shows a system with three cache levels. The first-level cache is located inside the processor. The logical structure of the monitor consists of three components: (1) monitor control, (2) cache-level components, and (3) data buffers.
Figure 1. Sample hardware-monitor design for a system with three cache levels.
The monitor control can be seen as one single logical unit, although the distribution of the implementation depends on the architecture of the target system (Figure 1) and may result
in more than one physical component. This unit is responsible for communication with the software layer, that provides the interface for further applications and monitoring tools. The monitor control is also in charge of user-defined settings (e.g. granularity) and manages the access to the cache-level components. Each cache-level component monitors a specific cache. Data collection and preprocessing is done in each component based on the associative counter array mentioned above. To simplify implementation, the design of the monitor components for all cache levels is identical. The components only differ in the amount of counters, depending on the properties of the corresponding cache. In order to reduce the influence on the performance of the monitored application by directing the gathered data to a buffer in main memory, two data buffers are provided. Additionally, status information and timestamps are merged to the collected data. 2.2. Monitor Control The new hardware monitor introduced in the previous section can be configured to measure a broad range of event types in relation to specific address ranges. Which measurement is performed and whether a histogram of the data for the given address range should be generated can be configured. This configuration is done via a hierarchy of interfaces that provides means suited for performance tools on different levels.
AMC
VAMPIR monitor
Monitoring Request Interface (MRI) Based on variables and program regions
ePAPI Based on virtual addresses
Hardware Interface Based on physical addresses
Figure 2. Accessing the hardware monitor via a stack of interfaces.
The most abstract interface in the hierarchy, outlined in Figure 2, is the Monitor Request Interface (MRI). This interface allows tools to specify Runtime Information Requests (RIR) determining: the runtime information type, the source code region,
the name of the data structure, and the active objects. The runtime information type can be e.g. the individual enter and exit events of regions, the number of cache misses, or a histogram of the cache misses for a program’s data structure. Typical program regions are program units, loops, parallel regions and function call sites. Active objects are the threads executing the program. Thus requests can be restricted to individual threads as well as determined for all threads. The runtime information determined by the monitor will be delivered via a push and/or pull interface from the MRI to the consumer. The MRI will be used by two performance analysis tools: the performance visualization tool VAMPIR and an automatic analysis tool called Automatic Monitor Control (AMC). While it is the task of the programmer analyzing the program with VAMPIR to generate appropriate requests into a configuration file guiding the monitoring during program execution, the AMC will automatically request useful runtime information during program execution. The automatic search in the AMC is guided by a specification of typical performance properties related to the memory access behavior. For example, severe coherence traffic or conflict misses might be properties of a specific code region and data structure. These properties will be defined in the APART Specification Language (ASL)[6]. ASL has been applied in the specification of performance properties for the Hitachi SR8000 supercomputer at Leibniz Computing Centre at Munich [7]. property MemoryAccessOvhdInSeqReg (SeqPerf pD) { //potential property of any sequential region condition: pD.d1CacheC>0; confidence: 1; severity: (pD.d1CacheC/pD.instrC*pd.execT)/RB(pd); }
Figure 3. Example of ASL: representation of cache misses.
For example, the property in Figure 3 specifies that first-level data cache misses exist in the sequential region. The severity is computed as the portion of the execution time spent on instructions suffering from cache misses. The parameter of the property identifies the sequential program region, e.g. a sequential loop, as well as the performance data measured for that region. The severity is computed as the fraction of the startup time on the execution time of the entire program run. In the context of this project, the AMC evaluates such predefined properties for the program’s code regions and the regions’ data structures. It will determine appropriate RIRs during program execution based on the property specification, on static program information about regions and data structures, and on the limitation in the resources of the hardware monitor. The last aspect requires careful experiment planning.
The information on the code structure required by the AMC is determined by a Fortran 95 instrumenter being developed based on the NAG Fortran frontend (see below). It will output at least the list of instrumented program regions as well as the accessed data structures in those regions. Information about the structure for non-instrumented regions is unnecessary since no information can be requested at runtime for those program areas. While RIRs are based on symbolic information, such as variable names and region identifiers, the lower level interfaces for the hardware monitor will subsequently lower the information. The next level is the ePAPI interface. It is defined based on the PAPI interface for accessing standard hardware performance counters [10], [4]. This interface provides means to request counts of memory hierarchy events for specific virtual address ranges. The mapping between symbolic names and virtual addresses is provided by the compiler runtime system. ePAPI offers also means to access the information and to return it to the MRI layer. The interface next to the hardware monitor is the Hardware Interface. It is based on control registers in the hardware monitor to control measurements related to physical addresses. The mapping of virtual to physical addresses is based on the page table of the operating system. Figure 4 summarizes the execution environment: the runtime information is based on the hardware monitor as well as program instrumentation. The instrumented program is executed on a dual Xeon workstation. The hardware monitor is simulated. The AMC will automatically configure the monitor to determine performance bottlenecks, while a VAMPIR trace file is generated according to a request file read at program startup.
Figure 4. Execution environment of EP-CACHE. The runtime information accessible via MRI is not only determined by hardware sensors but also by software sensors. The hardware sensor information is preprocessed by the hardware monitor designed in this project. The software sensors are inserted into the application’s source code via instrumentation (Figure 4). We developed a Fortran 95 instrumenter based on the
commercial NAG frontend from the NAG tools. It can selectively instrument program units, sequential loops, vector statements, IO statements, call sites, as well as OpenMP regions. If necessary it transforms the program to insert the instrumentation correctly. For example, a sequential loop can be terminated by a jump to a statement outside the loop. In that case, the call to the sensor determining the end of the region cannot be inserted simply before the jump’s target since it might also be executed if the target is reached from another program region. Thus the instrumenter inserts a new jump target with the sensor and generates a subsequent jump to the original target. Sensors are implemented via calls to the EP-Cache monitoring library. Each call marks the entry and exit of a program region. These events can be delivered to the runtime information consumer either individually or as collective statistical information, e.g. the sum of the execution time spent in a program region. When entering a program region, the monitoring library determines, whether a runtime information request specified for that region requires a reconfiguration of the hardware monitor. The exit of a program region might lead to the collection of information determined by the monitor and its delivery to the information consumer. A very special information usually not collected by monitoring systems is required in our approach. The hardware monitor is able to generate access histograms for entire address ranges. Therefore, mapping information for program symbols must be delivered with the histogram to allow the information consumer, either VAMPIR or AMC, to relate the measured information back to the program variables. This information is provided by the monitoring library based on information only accessible in the Fortran OpenMP runtime environment. The Fortran OMP compiler determines for each array a descriptor which specifies the start and end address as well as the dope vector information. This information is transferred via a monitoring function to the monitoring library which maintains a mapping table. When a histogram is measured by the hardware monitor, the current mapping is delivered to the consumer. Any changes of the mapping until the measurement stops are delivered too. This gives full information to the consumer but might lead also to a mapping where multiple variables are mapped to the same virtual address range. 2.3. Developing a Simulation System as an Evaluation Platform In order to evaluate this hardware monitoring approach and to speed up the design phase, a simulation environment is being implemented at Technische Universit¨at M¨unchen. This simulation system also enables the development of tools for the time when the hardware facility is not available. It exactly models the hardware design and provides the same performance data. For this purpose, the target system architecture and the application execution have to be simulated. The prototype of this simulation environment is built on top of an existing simulation tool, called SIMT [12]. SIMT is an event-driven multiprocessor simulator with a focus on the memory hierarchy. SIMT hence in detail models caches and cache consistency protocols, a distributed shared memory with a spectrum of data distribution policies, the hardware monitor for observing memory traffic, and the data transfer across processor nodes. SIMT has been developed as a general tool for evaluating the memory system and shared memory applications on NUMA (Non Uniform Memory Access) machines. It has been used to analyze the cache access behavior, memory access behavior, and applications execution behavior, and to study the impact of locality optimizations on the memory system.
Within SIMT, applications have to use specific macros for specifying parallelism and synchronizations. Hence, extensions are necessary for EP-CACHE, since this project requires a platform for modeling SMP machines and simulating the parallel execution of OpenMP applications. A significant step towards this is to enable nested parallelism on the simulator. In addition, a new OpenMP runtime library has been implemented. This library is needed for transforming the semantics from real threads to simulated threads. For this, the original implementation of most OpenMP directives has been modified. An example is ORDERED, which forces threads to run in a specific order. On an actual execution, this order is maintained by deploying a global identifier that shows the next running thread. Threads, whose id does not match the global identifier, have to wait until the active thread leaves the ordered region and modifies the global identifier. For simulation, however, this scheme can not be used because threads are actually sequentially executed. This means that the execution control, when owned by a thread waiting for the permission to enter the ordered region, can not be transferred to the active thread for modifying the global identifier. For tackling this problem, we use explicit events and appropriate handling mechanisms that are capable of forcing context transformation between simulated threads. For other OpenMP directives similar work has been done. 3. Performance Analysis The visualization of monitored data helps the user to identify critical code blocks and data structures that do not properly utilize the cache. The behavior of a program is stored in a trace file. Tools that can directly handle program traces are typically capable of showing accumulated performance measures in all sorts of diagrams for arbitrary time intervals of a program run. Additionally, they offer a detailed insight into a program’s runtime behavior by means of so-called timeline diagrams. These diagrams show the parallel program’s states for an arbitrary period of time. We have chosen VAMPIR as the display engine to visualize performance bottlenecks. VAMPIR is a well established tool to show large amounts of trace data [3]. It takes information from a trace file and generates quite a large number of displays (state charts, timelines, several statistic displays, etc.). The effective zoom and scroll functions allow the user to easily extract information about the dynamic behavior of a program in any stage of the analysis process. There are other advantages to take VAMPIR as the basis tool. Current processor architectures usually offer performance monitor functionality, mostly in the form of special registers that contain various performance metrics like the number of floating-point operations, cache misses etc. The use of these registers is limited because normally there is no relation to the program structure: an application programmer typically does not know which parts of the application actually cause e.g. bad cache behavior. Dealing with large amounts of performance data requires multiple abstraction layers. While the lower layers are expected to provide direct access to the event data by means of standard timeline and statistic views, the upper layers must provide aggregated event information. The aggregation needs to be done in a way that provides clues to performance bottlenecks caused in lower layers. Measures like cache performance, floating point performance, communication volume, etc. turned out to have good summarizing qualities with respect to activities on lower layers. Therefore, VAMPIR already supports adequate displays. The following example was taken from a performance optimization session, which we recently carried out on one of our end user’s programs. For some reason, his parallel
program started off performing as expected but suffered a serious performance decrease in its MFLOPS rate after two seconds. Figure 5 shows the program’s call stack combined with the MFLOPS rates for a representative CPU over the time period of 6 seconds with a close-up of the time interval the program behavior changes. We see two similar program iterations separated by a communication step. The first one is twice as fast as the second one. We can also see that the amount of work carried out in both iterations is identical, as their integral surfaces are the same (effect 1). A third aspect can be found in the finalizing part (function DIFF2D) of each iteration. Obviously, the major problem resides here, as the second iteration is almost 10 times slower than the first one (effect 2). We eventually discovered that the whole problem was caused by a simple buffer size limit inside the program which lead to repeated date re-fetching.
Trace File Process Number Call Stack Legend Time Interval Call Stack Floating Point Rates Zoom Markers
Effect 1 Effect 2
Iteration 1
Communication
Iteration 2
Figure 5. Performance Monitor Combined with Call Stack.
This counter could also have been a cache miss rate instead of MFLOP rates. To our knowledge, there does not exist any other performance tool which can present this very precise information at the moment. Unfortunately, in many cases this information and display is still just the beginning of an intensive and prolonged search in order to find out which data structures have caused these high cache miss rates. No previous analysis concepts are efficient here, and only with a lot of manual work in single cases the effort can be made to identify the very detailed reasons. Our project addresses this problem. Beyond improved monitor concepts, we develop extended visualization methods which make critical accesses to single data structures visible and - after identification of high cache miss rates - the direct determination of the relevant data structures and - if necessary - even the accesses within the source code. Due to the high quantity of data, suitable reduction algorithms have to be developed in order to be able to handle this in an interactive visualization environment. Especially the measurements will only be possible to be carried out selectively for critical program regions.
4. Optimization Techniques In many cases, the expected MFLOPS performance of programs running on current hardware is falling below expectations. Pretty much always, the program does not care about locality which is a real burden on the quite complex cache infrastructure of current processors. To address that problem, improvement of the run time behavior can be obtained by optimizations of the program that increase the spatial and temporal locality. For example, a meaningful loop permutation considered might much better fit the specific data storing of Fortran and thus may lead to a higher re-use of the data. Sometimes, the inclusion of suitable blocking strategies lead e.g. to an adjustment of the data block size to the cache capacity and improves thus the cache hit rate. Compilers mostly are not able to make changes to the data layout automatically, and there is no way to accomplish blocking strategies upon request. This is the more annoying fact, since no other tools are available which support these kind of transformations. Therefore, during the optimization process of a program only the manual transformation of the program with appropriate error rates remains in most cases. Within the project we follow two approaches. One approach is a tool supporting source-tosource optimizations (similar to the TransIt [2]), the other approach allows the specification of locality by a data mapping that results in better cache usage. In the final version, the results of the cache monitoring might be used to drive the optimization in a semi-automatic way. 4.1. Source-to-Source Optimizations Source-to-source transformations provide the user with a tool to tune his program for cache architectures. Code restructuring by hand (e.g. blocking) is complicated and error-prone. Within the project, we develop a software tool which supports cache optimizations (in particular loop transformations and blockings) in form of a source-to-source transformation for Fortran 90 programs, so that the user can further optimize the transformed code by hand. The tool has a graphical user interface where transformations can be made available in a simple manner. Additionally, within a few steps it is possible to insert directives for program optimizing into the source text and to arrange by mouse-click the required and specified transformation. Both, the source program supplemented with directives and the transformed Fortran 90 program will be indicated next to each other on the screen, so that the code modifications can be visually inspected. A first impression of the tool gives the screenshot in picture 6. Principal item of the realization is a language parser, which examines the source program for appropriate pre-compiler directives for the source-to-source transformation and which makes necessary changes in the underlying infrastructure. The front end of the ADAPTOR compilation system [1] has been tested for this task successfully and is used to implement the transformations. 4.2. Utilization of Data Locality High Performance Fortran [8] as a high level programming language offers compiler directives to map the data of the user program to abstract processors. This mapping has two levels. On the first level related data is aligned to each other and on the second level the data is distributed onto abstract processors. In current HPF compilers such a mapping information is used to generate a parallel program where one abstract processor will be identified with a physical processor that operates on all
Figure 6. Screenshot of the user interface of the source-to-source transformation tool.
the data mapped to it corresponding to the owner computes rules. On parallel machines with distributed memory the HPF compiler generates an equivalent SPMD program where each processor works on its local data and where non-local data is exchanged via message passing. The same kind of mapping can also be used to increase the spatial and temporal locality of data within a program and so to use the cache more effectively. Therefore the mapping is chosen in such a way that the data owned by one abstract processor will fit in the cache. The execution of the local operations implicitly mapped to the abstract processor should have no more capacity misses. Abstract processors will not be identified any more with physical processors so one physical processor has to emulate many abstract processors. Compiler techniques used in HPF compilers will result in blocking and tiling of data and iteration spaces that result in more effective cache usage. Compared with existing loop blocking techniques this approach has certain advantages. At first the blocking is no longer restricted to single loop nests but applies to the whole program and becomes more effective. On the second hand the data mapping can also be used for optimization of the data layout (e.g. merging of arrays). Data-structure related monitoring provided within the EP-CACHE project will contribute to optimize the mapping process. The HPF compilation system ADAPTOR [1] provided by SCAI is being extended to implement this approach. In a first step an HPF execution model for one-processor architectures has been defined and implemented. All abstract processors are emulated on one physical processor. The execution of the program implies a blocking of the data where each block fits into the cache. This approach allows for an automatic blocking on a higher abstract level where the user can directly influence the block building process. In a second step the cache-related execution model will be coupled with the shared-memory
DISTRIBUTE A(...) ONTO PA
DISTRIBUTE PA(...) ONTO PP
processor
data mapping
Template / Array A
mapping
Abstract processors PA
Physical Processors PP
Figure 7. The hierarchical HPF execution model.
and the distributed-memory execution model respectively. By a hierarchical mapping abstract processors related to cache sizes will be mapped to all available physical processors and then these physical processors to the nodes of the SMP cluster. The HPF compilation system supports already a hierarchical execution model for SMP clusters that is now extended with a further level for cache optimizations. The main advantage of this approach for using data locality is the portability and the homogeneity of the programming model for cache optimization and parallelization. 5. First Evaluation Results During the project time we use the EP-CACHE OpenMP simulator (see Section 2.3) as an evaluation platform that enables the tool development for the time when the hardware facility is not available and that gives information about the effectiveness of the chosen approach. Using this simulator, we have investigated several small kernel codes [11]. The performance data provided by the monitoring component facilitated understanding the runtime access distribution, the cache miss behavior, and the temporal locality of caches. The first example presents the results from WATER, a molecular dynamics code from the SPLASH-2 suite [13] (Figure 8). The second example in Figure 9 illustrates a cache access histogram acquired by simulating the jacobi code on a 8 node system. This program solves the Helmholtz equation on a regular mesh, using an iterative Jacobi method with over-relaxation. As shown in Figure 9, the access histogram depicts the accesses to the whole working set and the entire memory system at granularity of cache line size. The x-axis shows the accessed locations, while the corresponding numbers of accesses to L1, L2, and the local memory are presented on the y-axis. Since the obtained histograms clearly exhibit the different behavior of each location in the memory hierarchy, they are capable of directing the user to optimize the data accesses towards a better cache locality. Once the entire tool set will be available, the mapping of data structures to memory locations will be supported by the monitoring environment. It will also be possible to focus the measurements to specific data structures and code regions to significantly reduce the amount of performance data. VAMPIR will give extremely flexible graphical means to analyze the given
2000
Number of cache block
1500 1000
MEM L2 L1
500
310
296
303
282
289
268
275
254
0 261
Number of accesses
MEM L2 L1
0 100 200 300 400 500 600 700 800 900 1000 1100 1200 1300 1400 1500 5050 5150 5250
Number of accesses
5000 4500 4000 3500 3000 2500 2000 1500 1000 500 0
Number of cache block
Figure 8. Accumulated Memory Access Histograms of WATER with 64 molecules — complete address space (left) and zoom on range for 3 molecules (right).
Figure 9. Access histogram of the Jacobi code (first 100 cache lines).
measurements far beyond current displays. The AMC will alleviate the task of the programmer in the analysis process by searching automatically for common performance problems. For further evaluation and validation of the project results two representative parallel simulation applications are used: the local model (LM) of the DWD (Deutscher Wetterdienst, German National Meteorological Service) and the geophysical package GeoFEM of RIST, Tokio. These two large codes are already in use since several years for productions runs. The possible support during the development of cache-relevant application programs will be investigated by considering the prototype of a parallel FEM-code for crash-worthiness simulation. 6. Summary and Outlook Within the EP-CACHE project new methods and tools are developed to improve the analysis and the optimization of programs for cache architectures. The project has started March 2002 and lasts for three years. Our approach exploits hardware monitors and related monitor control techniques to gain more useful information about the cache behavior related to the source code and its data structures. We use the interactive visualization environment VAMPIR for the visualization of the monitored data as it already supports adequate displays. The monitoring of the cache and the visualization of the monitored data help to identify and understand the bottlenecks in the program. For the optimization of its program, we offer the user a source-to-source transformation tool supporting all relevant cache optimization transformations and a high level approach based on HPF mapping directives to exploit data locality. One big challenge is the integration of techniques for semi-automatic optimization based on the monitored data. During the project time we use a simulation system as an evaluation platform that enables the development of the tools for the time when the hardware facility is not available. For the evaluation and validation of the project results two representative parallel simulation applications are used: the local model (LM) of the DWD (Deutscher Wetterdienst, German National Meteorological Service) and the geophysical package GeoFEM of RIST, Tokio. REFERENCES 1. ADAPTOR. High Performance Fortran Compilation System. WWW documentation, Institute for Algorithms and Scientific Computing (SCAI, FhG) et al., 2000. http://www.scai.fhg.de/lab/adaptor. 2. J. Blum. TransIt: Ein interaktives Werkzeug zur Programmoptimierung mittels CodeTransformationen. Master’s thesis, Forschungszentrum J¨ulich, 1996. published as Report J¨ul-3302, http://www.fz-juelich.de/zam/docs/autoren96/blum.html. 3. H. Brunst, H-Ch. Hoppe, W. E. Nagel, and M. Winkler. Performance Optimization for Large Scale Computing: The Scalable VAMPIR Approach. In ICCS2001, San Francisco, CA, USA, May 28-30, Part II, volume 2074 of Lecture Notes in Computer Science, pages 751–760. Springer-Verlag Berlin Heidelberg New York, May 2001. 4. Jack Dongarra, Kevin London, Shirley Moore, Philip Mucci, and Daniel Terpstra. Using PAPI for hardware performance monitoring on Linux systems. In Proceedings of Linux Clusters: The HPC Revolution, 2001. http://icl.cs.utk.edu/projects/papi/documents/ pub-papers/2001/linux-rev2001.pdf. 5. EP-CACHE. Tools for Efficient Parallel Programming of Cache Architectures. WWW
6. 7.
8. 9.
10. 11.
12.
13.
documentation, Institute for Algorithms and Scientific Computing (SCAI, FhG), 2002. http://www.scai.fhg.de/EP-CACHE. T. Fahringer, M. Gerndt, G. Riley, and J. L. Tr¨aff. Knowledge Specification for Automatic Performance Analysis. Apart technical report, 2001. http://www.fz-juelich.de/apart. Michael Gerndt. Specification of Performance Properties of Hybrid Programs on Hitachi SR8000. Peridot technical report, 2002. http:www.in.tum.de/ ˜gerndt/Vita/publications.htm. High Performance Fortran Forum. High Performance Fortran Language Specification. Version 2.0, Department of Computer Science, Rice University, January 1997. W. Karl, M. Leberecht, and M. Schulz. Optimizing Data Locality for SCI–based PC–Clusters with the SMiLE Monitoring Approach. In Proceedings of International Conference on Parallel Architectures and Compilation Techniques (PACT), pages 169–176, 1999. PAPI. Software Specification. http://icl.cs.utk.edu/projects/papi/. M. Schulz, J. Tao, J. Jeitner, and W. Karl. A Proposal for a New Hardware Cache Monitoring Architecture. In to appear in: Proceedings of the ACM/SIGPLAN workshop on Memory System Performance, 2002. J. Tao, W. Karl, and M. Schulz. Using Simulation to Understand the Data Layout of Programs. In Proceedings of the IASTED International Conference on Applied Simulation and Modeling (ASM 2001), pages 249–354, 2001. S. Woo, M. Ohara, E. Torrie, J. Sing., and A. Gupta. The SPLASH–2 Programs: Characterization and Methodological Considerations. In International Symposium on Computer Architecture (ISCA), pages 24–36, 1995.