Performance Tuning with AIMS — An Automated Instrumentation and Monitoring System for Multicomputers Jerry C. Yan † RECOM Technologies MS 269-3, NASA Ames Research Center, Moffett Field, CA 94035-1000 Tel. (415)-604-4381, e-mail:
[email protected]
Abstract Whether a researcher is designing the “next parallel programming paradigm,” another “scalable multiprocessor” or investigating resource allocation algorithms for multiprocessors, a facility that enables parallel program execution to be captured and displayed is invaluable. A software toolkit that facilitates performance evaluation of parallel applications on multiprocessors, the Automated Instrumentation and Monitoring System (AIMS), is described in this paper. It has four major software components: a source-code instrumentor, which automatically inserts event recorders into the application; a run-time performance-monitoring library, which collects performance data; a tracefile animation and analysis toolkit; and a trace post-processor which compensates for data collection overhead. We illustrate the process of performance tuning using AIMS with two examples. Currently, AIMS accepts FORTRAN and C parallel programs written for TMC’s CM-5, Intel’s iPSC/860, iPSC/Delta, Paragon, and HP workstations running PVM.
1: Introduction 1.1: Motivation and background While parallel processing promises to deliver orders of magnitude speed-up in the near future, a “TeraFLOPS” machine (as conceived today) will never execute at that level unless i.) all of its processors can be effectively utilized and ii.) inter-processor communication, should there be any, can be accomplished instantaneously. Realistically, the performance obtained will depend on several factors: how well the application is formulated to harness the in herent parallelism of the problem, the architecture of the multicomputer, and how well the application is mapped onto the multicomputer. Although simulation and theoretical considerations have produced many interesting results, their applicability needs to be carefully evaluated on actual implementations. Careful analysis of † AIMS was developed jointly by: Jerry Yan, Charles Fineman, Philip Hontalas, Sherry Listgarten, Melisa Schmidt, Pankaj Mehra, Sekhar Sarukkai, and Cathy Schulbach. Tarek Elaydi with Lawrence Berkeley Labs and Rob Gordon with Convex Computers helped develop AIMS prototypes on the CM5 and PVM respectively. The author only put the paper together and serves as a point of contact.
execution traces can help (scientific) application developers and system software developers to uncover system behavior not modeled and to take advantage of specific application characteristics and hardware features. Performance evaluation in parallel processing environments presents technical challenges that do not exist in uniprocessing environments. Firstly, massive amounts of data are generated because a parallel program has many threads of control. Secondly, the completion time of the entire program depends on the order of synchronization/communication events in different control threads. Such “event-ordering” data are difficult to capture because data collection overheads perturb event timings and may, therefore, alter the relative ordering of events on different threads. Furthermore, with loosely coupled systems (especially in heterogeneous computing environments), the lack of a globally synchronous clock severely limits one’s ability to determine the relative ordering of events on different machines. Finally, there are (far too) many parallel programming paradigms. If it is difficult for developers of performance tools to keep pace with each combination introduced into the market, it is even harder to establish a performance evaluation methodology (not to mention metric) that makes “fair” comparisons across different machines. Therefore, there is a pressing need for a methodology that will enable program execution to be captured and analyzed automatically across many architectures.
1.2: Outline of paper The focus of this paper is on AIMS (A u t o m a t e d Instrumentation and Monitoring System), a toolkit designed to facilitate performance evaluation of parallel applications on multicomputers via measurement and visualization of execution traces. Section 2 presents the motivations behind AIMS’ design choices taken to accommodate performance evaluation and comparison across a range of multicomputers and makes the performance analysis process easy. Section 3 describes AIMS’ four main software c o m p o n e n t s — a source-code instrumentor (xinstrument), a run-time performance monitoring library (the monitor), two tracefile animation and analysis tools (VK and tally), and a trace post-processor (tpp) — which work together to measure and display a
program’s performance. Section 4 contains two examples to illustrate the use of AIMS in performance tuning. Conclusions, directions for future research and availability are discussed in section 5.
2: Issues in designing performance tuning tools 2.1: Performance tuning and performance tools After developing a parallel algorithm that yields correct results on small test-cases, the user may find that a fullscale application may execute far too long on the problem he/she really wants to solve. In order to locate execution bottlenecks, the user needs to address several issues: Q1 : Defining Desired Performance Characteristics: “What should/shouldn’t happen during execution?” Q2 : Instrumentation and Parameterization: “What information should be collected to help discover what really happened?” Q3 : Monitoring: “How can we gather such information without perturbing the system?” Q4 : Analysis: “Based on the information gathered, how can we deduce what really happened?” Q5 : Tuning: “How can we improve the program’s performance?” The approach to these issues depends on both the programming environment and the machine architecture. These issues also dictate the tools required for supporting the performance-evaluation process. For example, if a “desired performance characteristic” dictates that “all processors should be busy 100% of the time” (c.f. Q1 ), we need to identify situations when this is not attained. We need data that indicate both the duration of processor idleness and the number of processors that are idle simultaneously (c.f. Q2 ). In order to obtain this information, we need to devise a mechanism to detect and record processor idleness (c.f. Q 3 ). Instead of going through stacks of print-outs manually, an analysis tool would be needed to process the data and make reports such as “Whenever procedure X executes on node 15, all other nodes are idle waiting for its results” (c.f. Q 4 ). Based on such reports, the user may be able to reformulate the application, enabling node 15 to share its work and thus eliminate a performance bottleneck (c.f. Q5 ). In summary, three observations can be made based on this example: 1. Although performance tools are built to “help” users tune their programs, these tools can only be used effectively when users follow a particular performance tuning methodology. 2. This performance tuning methodology, in turn, assumes certain desired performance characteristics. 3. The success of the performance tuning process, therefore, depends on the validity of these performance characteristics and how effectively the programmer (and his/her performance tools) can sug-
gest time-saving program transformations using the trace data.
2.2: “Why don’t users use tools tool-developers develop?” 1 AIMS was built in response to frustrations we experienced two years ago, trying to get users to use performance tools. Putting it mildly, we were simply unable to convince application developers on our Intel iPSC/860 multicomputer to use performance tools. There were a number of reasons: 1. Users were unwilling to modify their code in order to use any performance tools. The most they were willing to do was to link their code with an extra library or compile it with an extra flag. The idea of changing every csend call (in NX) to send0 (in order to use PICL/ParaGraph [1, 2]) was unacceptable. 2. Users wanted a mechanism to map observed events back into their code. For example, scrolling displays can be a great help for monitoring message passing patterns during execution. Unfortunately, trying to figure out which line drawn on the screen corresponds to what message can be very difficult (especially after one has looked briefly away from the screen). 3. Many of the tools available to us were not very robust. Their limitations were quickly felt when analyzing large, complex codes residing in multiple directories across the file system and using a wide spectrum of synchronization primitives. Users were not eager to experiment with prototype tools especially if they had bad experience with another performance tool. As soon as they encountered a problem where the solution was not obvious, they assumed that the tool was “just another prototype” and gave up even though relevant documentation was available. 4. The data-collection process was intrusive. Users’ confidence in the validity of observed program behavior was greatly reduced when instrumented programs took five times longer to execute. 5. The few application/algorithm developers who actually used the tools did it for algorithm visualization (i.e. to illustrate their papers) as opposed to actually tuning their programs. This further suggested that the way toolbuilders (computer science majors) built their tools was not the way application developers (scientists or numerical analysts) wrote and tuned their programs.
3: The Automated Instrumentation and Monitoring System AIMS was built in response to the aforementioned outcries. It requires minimal user participation since it can automatically insert instrumentation into the program. AIMS provides facilities that greatly reduce the time required to pinpoint performance bottlenecks. It supports a 1 Quotation from Prof. Cherri Pancake, Oregon State University.
wide variety of parallel-programming paradigms and hardware platforms because the process of instrumentation, monitoring, and analysis are supported by separate software modules with well defined interfaces. AIMS’ overhead is well-characterized and an intrusion compensator is available for those who desire to use it.
3.1: Architecture of AIMS AIMS’ four main software components are: (1) a source-code instrumentor (xinstrument); (2) a runtime performance monitoring library (the monitor); (3) a tracefile animation and analysis toolkit (V K and tally); and (4) a trace post-processor (tpp). These work together to instrument, measure and display a program’s performance. Xinstrument inserts instrumentation into the source-code to record performance data from the program’s execution. It also creates and manages two data structures critical to the performance-evaluation process: an application database and an instrument-enabling profile, whose functions are explained in section 3.2. The monitor comprises routines called by the instrumented code in order to trace performance during program execution. Trace data is collected into a tracefile subsequently used by VK and tally. VK provides animated views for displaying program behavior. Tally tabulates cumulative statistics gleaned from the tracefile. Should the user desire to remove intrusion introduced by the monitor, the tracefiles can be processed by tpp before being fed to VK/tally. Figure 1 shows AIMS’s architecture. Application S ource Code 1
Program Structure, Data Structures & Points of Ins trum entation
“XINSTRUMENT” — AIMS’s Graphica l I nstrum entation Interfa ce
I nstrume nted Sourc e Code
Instrumentation ON/OFF Information
Ap plication Da tabase
Compil er I nstrum ented Object Code
Instrument Enab ling Profile
2 “MONITOR” — A IMS’s Trace Capture Library
3 “tp p” — A IMS’ intrusion compensator
Multipro cessor Linke r Exec ution Trac e
Instrumented Executa bl e P rogram
Compensated Execution Tra ce
4 “VK ” & “Tally” — A IMS’s Tra ce A ni mation & Anal ysi s Tools
Per formance Data
Figure 1. Components of the Automated Instrumentation and Monitoring System
3.2: Software instrumentation — AIMS’ language dependent components Performance evaluation requires some form of instrumentation — a mechanism whereby performance data can be generated so that program execution may be monitored. Many such mechanisms have been proposed, these include
event sampling [3], hardware event recorders [4], and software event recorders [5]. A detailed survey of these methodologies for multiprocessors may be found in [6]. Although event sampling produces the smallest amount of data, it is not adopted in AIMS because we wanted to capture and replay the entire execution process for the programmer to analyze his/her code. Event sampling can also be very intrusive without instrumentation hardware. In order to support uniform performance evaluation in a heterogeneous computing environment (or across different platforms), software instrumentation provides the most plausible approach. Although intrusion may be non-negligible for software instrumentation, we argue that we can characterize and compensate for its effects and infer the original program behavior from the tracefile. There are two possible mechanisms for software instrumentation: i.) instrumented system software [2], or ii.) event recorders [7, 5]. The use of instrumented versions of communication libraries and operating systems is most convenient since performance data can be obtained without modifying the application code. However, this requires vendor participation and does not provide an easy mechanism for the user to turn monitoring on/off for different portions of the application. The insertion of event recorders at the source-code level performs better on both counts: it does not require vendor participation and collects exactly what you want to measure. In this, insertion of event recorders resembles putting “print”-statements at various points in the program to trace its control flow. Furthermore, this approach is highly portable since the instrumented programs will execute on any machine on which the original program executes. Xinstrument addresses the main drawback of this approach, which is that source-code instrumentation by hand is extremely tedious. Currently, AIMS supports parallel programs written in FORTRAN and C using three possible message-passing paradigms: Intel’s NX, TMC’s CMMD and PVM. Xinstrument inserts event recorders to trace subroutine invocations, synchronization operations, and message passing. To accommodate several parallel-programming paradigms, AIMS supports easy specification of both the syntax of the constructs to be instrumented (e.g. pvm_send, Intel’s csend, TMC’s cmmd_send) and the transformations required to instrument these constructs. Besides inserting instrumentation at appropriate locations in the program code, xinstrument also generates two key data structures — an application database (or A P P L _ D B ) and an instrument enabling profile (profile) (see Figure 1). The APPL_DB stores static information about the application’s source-code (e.g. the file names and line numbers of instrumented constructs). AIMS’ analysis tools use APPL_DB to relate traced events to instrumented constructs and data structures in the source code. An APPL_DB is incorporated at the beginning of each tracefile produced by executing the instrumented application program.This supports limited sourcecode click-back with the VK (see section 3.4.1) even when application code has been changed or is not available at vi-
Figure 2. Figure 2. Graphical Interface of AIMS’ Source-Code Instrumentor sualization time. The profile contains a table of flags that selectively trigger the instrumentation inserted in the program. Since the monitor reads in the profile at the beginning of program execution, the user does not need to recompile the application when trying to obtain performance data for different parts of the program. The profile may be modified and saved using xinstrument via a point-and-click user interface (Figure 2).
3.3: Monitor — AIMS’ machine dependent component After the source-code has been instrumented, it must be compiled and linked with a run-time performance monitoring library (the monitor). The monitor contains a set of “event recorders” (or routines) that are inserted into the application by xinstrument; these generate the tracefile used by AIMS’ analysis tools. Figure 3 illustrates the basic function of these recorders during program execution — they write records into a buffer at each processing node. Either at the end of program execution or when the buffer fills up, the buffer is written (or flushed) to the system’s disks. Since flushing is quite intrusive, AIMS allows the user to control the buffer size and the flushing frequency (See [8]). Generated event records include: • state entrance/exit — for subroutines, loops, and user-defined code segments;
• communication — time spent by the application on message sending/receiving, message transmission time; message size, destination and type; • I/O — file system read/write times; • barriers — global reduction operations, barrier synchronizations, and event waits; • markers — event enabling, probes to the message queue; user-specified events; and • statistical records — summarizes cumulative performance statistics at specified points of program execution. Event Buffer
Instrumented Source Code
2 1 366 7 3 0 SUBROUTINE xyz CALL proc_begin(3,0) . . . CALL sync_send(...,3,1) . . . CALL sync_recv(...,3,2) . . . CALL proc_end(3,0) END
9 1 460 7 2 ... 15 1 1032 7 2 ... 3 1 2451 7 3 0
Processing Node [i] Processing Node [i+1]
“Flushing”
Trace File System Disks
••• Processing Node [n]
Figure 3. Event Recorders Store Performance Data Locally, then “Flush” to Disks
Figure 4. A Potpourri of VK’s Displays
3.4: Tracefile animation and analysis The tracefile contains the event records of the program execution; after execution, it is collected and transferred to a graphics workstation where it can be analyzed and displayed by AIMS’ visualization toolkit. VK and tally offer several ways of examining performance data. With VK, the tracefile can be animated via a number of X-window based views that depict the program’s changing state as time passes. Tally, on the other hand, collects and tabulates statistics that reflect the cumulative activity of the program. 3.4.1 The View Kernel (VK): VK displays a program’s execution via five animated views. These views present information indicating when certain constructs were executing; when messages were sent, how long messages were queued-up before being processed, and when nodes were running/idle. Some views scroll as time passes, showing a segment of the program’s history, while others animate each state in sequence (updating the previous state). Some of the views VK provides are shown in Figure 4. A full description of these views can be found in
[8].The OverVIEW, for example, uses colors or dithering to depict different procedures running on each node, and the messages being passed between them. The nodes are represented by rows with node numbers listed on the left. Thin lines on the OverVIEW indicate messages being sent between nodes. The tracefile can be viewed step-by-step or at high speed. VK can be instructed to pause (or start) at any instrumented construct (such as a particular message send/receive or when a certain subroutine is invoked) or after a certain time (say 3.54 msecs into the execution). A source-code click-back capability allows tracefile events pictured on the display to be mapped directly back into application code. For example, clicking on a procedure bar on the OverVIEW will reveal the subroutine running at that time/node point. Similar clicks on message lines reveal information about the corresponding sends and receives. As shown in Figure 5, this information may be displayed as a text window with the corresponding code (where the exact line is pointed to by a “^”) or as a construct-tree view showing the relationship between the observed event and instrumented points in the sourcecode.
Figure 5. Source-Code “Click-Back” with VK’s OverVIEW
Time Usage by Routines
2000 1500
Global Blocking
Send Blocking
Recv Blocking
Busy Time
1000 500
edge_ew
jacx
edge_ns
sspts
jacy
rhsx
rhsy
gradco
filterx
x2pldge
filtery
xpldge
y2pldge
setiv
tsinvs
tetas
nsinvs
ypldge
0 edge_news
There are two main sources of intrusion in AIMS. The first of these is flushing, the need to periodically write out filled memory buffers to the file system. Even on the iPSC/860, where some compute nodes are directly connected to disks via a Concurrent File System (CFS), disk writes are still time-consuming; they significantly perturb the program’s execution. Not only is the “writing node” delayed, all nodes that interact with it may also be affected. Messages are delayed, synchronization routines are held up, and carefully orchestrated parallel programs may go haywire. The second source of intrusion is the time consumed by the instrumentation code. An event recorder typically accesses the nodes’ hardware clocks once or twice and stores some information into a buffer.
2500
eigv
3.5: Intrusion — characterization and compensation
Although event recorders are relatively fast, they still perturb the program.
comp_ps
3.4.2 Tally: Tally generates a list of resource-utilization statistics on node-by-node and routine-by-routine bases. These statistics can help point out inefficient sections of code, which can then be more closely examined with VK. In addition, tally’s output can be used as in put to statistical packages such as Excel or WingZ. For example, Figure 6 shows time usage across the different subroutines of a CFD solver. A detailed account of all the statistics tally tabulates can be found in [8].
Figure 6. Plots Obtained from Tally’s Output The measured program behavior, in turn, is perturbed in at least three ways. First, events on each control thread are delayed, and the execution times of the entire computation lengthened. Secondly, events on one node that originally precede events on another node may not do so after the insertion of software probes because each node may be perturbed at a different rate (depending on the amount and nature of instrumentation). Third, instrumentation may cause different events to occur because a program’s execution path may be changed if decisions are made based on the order and time in which events occur. Some other effects that come about because of the event recorders are very difficult to account for. Firstly, pro-
gram size is increased and memory reference patterns are changed. These may, in turn, change cache miss and page fault rates of the memory sub-system, which are very difficult to account for. Secondly, for processing nodes that are time-shared between processes (particularly in a distributed environment), the application can be non-deterministically interrupted. Therefore, compensation is only meaningful to a certain extent depending on the execution environment. A “tracefile post-processor” (tpp) has been developed for AIMS on the iPSC/860 [9]. It represents an attempt to characterize AIMS’ intrusion and to provide a mechanism to remove as much intrusion as possible so that original program behavior can be recovered. Tpp attempts to: i.) correct event delays due to flushing and executing event recorders; ii.) maintain the consistency of send/receive/block relationships across nodes; and iii.) preserve barrier-synchronization patterns We have tested over 20 parallel programs on the Intel iPSC/860. Experimental results to date show that tpp was able to remove 60% of the CPU overhead due to the execution of instrumentation software and 100% of filesystem usage for storing trace data during program execution. Over all, compensated execution times are within 6% of the uninstrumented execution times.
4: Two examples of usage AIMS can be used to analyze program execution and highlight problem areas that can then be modified to improve program execution. For example, a user complained that his program took 25% longer to execute once out of every 10 executions for no apparent reason. We instrumented it and found the cause: Intel’s O/S took twice as long to send one particular message (10% of the time) at the beginning of the algorithm. This kind of performance
problem would not readily show up (if at all) in an aggregated analysis such as a plot of average message latencies. Figure 8 shows an OverVIEW of a parallel integer sort program. In the initial run, the sorting required 841 msec (note that white areas represent idle times). Using AIMS, the programmer noticed where the program spent most its time idling and was able to make a simple change that reduced the sorting time to 184 msec. Figure 9 shows the result of the first improvement. (The scales are not the same on the figures.) This improvement involved replacing a loop of global operations with one single global operation (done on a vector). Figure 10 shows that execution time was finally reduced to 80 msec after 7 modifications including: reducing message lengths, combining lengthy loops, using broadcasts instead of communicating by stages, and removing unnecessary global sums and broadcasts.
5: Conclusions and future research The Automated Instrumentation and Monitoring System (AIMS) provides a suite of software tools to facilitate the tuning of parallel applications. Application source-code is instrumented automatically. Performance data gathered from the execution of instrumented code can be displayed on a variety of workstations. These displays provide users with a means for observing the behavior of their programs and for tracing the sequence of operations via source-code click-back. Thus the performance and correctness of parallel algorithms on hypercubes may be evaluated easily. Although we have shown that AIMS can be a powerful tool for the development of parallel applications, there is much room for improvement. • Scalability: The 2-dimensional display formats and “event trace” approaches are not scalable. Enormous amount of performance data can be gen-
Figure 7. Comparing Message Passing Delays with AIMS
Figures 8. Original Integer Sort Program.
Figure 9. Integer Sort After First Modification.
Figure 10. Final Version of Integer Sort Program.
erated rather easily. The run-time overhead (e.g. flushing) and analysis overhead (e.g. the need to sort the records before feeding them to the visualization/analysis toolkit) could render the performance tuning methodology described here impractical. Our current research efforts are addressing the scalability problem from several directions: - limit the trace data generated: Through the use of “performance metric predicates” [10], we will allow the user to define performance parameters that control trace data generation (e.g. monitoring only during load imbalance situations); - scalable hierarchical visualization formats: The user can specify the level of abstraction and other selection criteria for perusing sub-sets of the data. - performance bottlenecks indicators (indices): We are working on a version of tally that produces statistics to help users “zoom-in” to problematic sections of the code. This eliminates the need to animate the entire tracefile. • Portability a n d Application to Heterogeneous Computing: AIMS attempts to enhance portability by decoupling the performance evaluation process into three phases: a language/paradigm dependent instrumentation phase, a machine dependent moni-
toring phase and a tracefile format dependent analysis phase. We have also developed an experimental meta-monitor for cooperating parallel programs executing on different partitions of the iPSC/860. Based on this experiment, together with the progress we have made with PVM-AIMS, we plan to build a generalized framework that enables us to coordinate our monitors on different platforms to support heterogeneous computing environments. • Vendor Participation will enable performance evaluation to take place at a level not possible before: For example, the evaluation of compiler effectiveness and the utilization of hardware monitors • Flexibility: Application specialists want different questions answered for the same data-set. It is impossible to build a visualization/analysis toolkit that will make everybody happy. AIMS tracefiles are readable by Pablo [7] (which provides users a mechanism to customize the analysis toolkit) and ParaGraph [1] (which provides over 30 animated views). We hope that performance data gathered using AIMS will eventually help us build more accurate models of parallel programs/machines and predict their scalability w.r.t. increases in problem and machine sizes.
[5]
Requirements and availability AIMS is available to the public through COSMIC (NASA’s software clearing house). AIMS (version 2.2) supports FORTRAN77 and C with NX on the Intel iPSC/860, iPSC/Delta, and the Paragon. VK, tally and xinstrument require X Windows (X11R5) and Motif (1.1.3) and have been tested on Sun SparcStations and the Silicon Graphics IRISs and Indigos under the twm and mwm window managers. We are collaborating with computer vendors to interface with their performance-monitoring hardware and software. A version of AIMS that supports HP workstations running PVM is available from CONVEX. A version of AIMS on the Thinking Machine Inc.’s CM-5 will be available in March 1994.
[6] [7]
[8] [9]
Acknowledgments [10]
AIMS was developed after evaluating software prototypes from the research community and reviewing published ideas on performance visualization. We would like to acknowledge the community’s help/support in letting us adopt, adapt and augment their research prototypes for the parallel processing environment here at NASA Ames Research Center. The current version of AIMS uses the POEM source-code instrumentation system developed under the Programming and Instrumentation Environment (PIE) project [11] at Carnegie-Mellon University. AIMS’ monitor adopts many of the event-record conventions established by PICL, a Portable Instrumented Communication Library [2]. Some of the displays have been inspired by ParaGraph [1] and Quartz [12]. We also want to acknowledge the computing facilities provided to us by the Numerical Aerodynamic Simulation Systems Division, NASA Ames Research Center. The work described here is supported under the NASA High Performance Computing and Communications Program.
References [1] [2]
[3]
[4]
M. Heath and J. Ethridge. “Visualizing the Performance of Parallel Programs.” IEEE Software , Vol. 8, No. 5, Sept. 1991, pp. 29-39. G. A. Geist, M. T. Heath, B. W. Peyton, P. H. Worley “PICL — A Portable Instrumented Communication Library” Tech Report ORNL/TM-11130, Oak Ridge National Lab. May 1990. S. L. Graham, P. B. Kessler, M. K. McKusick. “gprof: a Call Graph Execution Profiler.” In Proceedings of SIGPLAN ’82 Symposium on Compiler Construction , June 1982. A. Malony. “Performance Observability.” PhD. Dissertation, Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana, IL 61801. Report No. UIUCDCS-R-90-1603. October 1990.
[11] [12]
T. Lehr. “Compensating for Perturbation by Software Performance Moni tors in Asynchronous Computations.” PhD. Dissertation, Department of Electrical and Computer Engineering, Carnegie Mellon University, Pitsburgh PA 15213. April 1990. M. H. Reilly, A Performance Monitor for Parallel Programs. Academic Press Inc . 1990. D.A. Reed, R. D. Olson, R. A. Aydt, T. M. Madhyastha, T. Birkett, D. W. Jensen, B. A. A. Nazief, and B. K. Totty. “Scalable Performance Environments for Parallel Systems.” In Proceedings of the 6th Distributed Memory Computing Conference . Apil 1991 “The Automated Instrumentation and Monitoring System (AIMS) Reference Manual,” NASA Technical Memorandom, September 1993 J. C. Yan and S. Listgarten. “Intrusion Compensation for Performance Evaluation of Parallel Programs on a Multicomputer.” In Proceedings of the 6th International Conference on Parallel and Distributed Computing Systems. Louisville, KY, October 14-16, 1993. C. E. Fineman and P. J. Hontalas. “Selective Monitoring Using Performance Metric Predicates.” In Proceedings from the Scalable High Performance Computing Conference SHPCC-92 , Williamsburg, VA, April 26-29, 1992. pp. 162-165. T. Lehr, Z. Segall, D. Vrsalovic, E. Caplan, A. Chung, & C. Fineman. “Visualizing Performance Debugging.” Computer , October 1989, pp. 38-51. T. E. Anderson and E. D. Lazowska. “Quartz: A Tool for Tuning Parallel Program Performance.” In Proceedings of SIGMETRICS ’90 Conference on Measurement and Modeling of Computer Systems , May 1990, pp. 115-125.