{hollings,ramu}@cs.umd.edu. Abstract ... without hardware support, although the perturbation in- curred may be substantial ... tools can be a valuable asset in debugging and fine-tuning ..... of fixed size messages between two hosts. We also in ...
Copyright 1998 IEEE. Published in the Proceedings of Rtss’98, 1 December 1998 in Madrid, Spain. Personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution to servers or lists, or to reuse any copyrighted component of this work in other works, must be obtained from the IEEE. Contact: Manager, Copyrights and Permissions / IEEE Service Center / 445 Hoes Lane / P.O. Box 1331 / Piscataway, NJ 08855-1331, USA. Telephone: + Intl. 732-562-3966.
Performance Measurement using Low Perturbation and High Precision Hardware Assists 1
Alan Mink and Wayne Salamon2
Jeffrey K. Hollingsworth and Ramu Arunachalam3
Scalable Parallel Systems Group Information Technology Laboratory National Inst. of Standards and Tech. (NIST) Gaithersburg, MD 20899 {amink,wsalamon}@nist.gov
Computer Science Dept. University of Maryland College Park, MD 20742 {hollings,ramu}@cs.umd.edu
Abstract
rectness of such programs is not only dependent on logical or functional correctness but also on temporal correctness. Temporal correctness can be difficult to guarantee or verify especially in an open system with a dynamic task set due to the timing interactions and unpredictable interleaving of other tasks in the system. Runtime monitoring tools can be a valuable asset in debugging and fine-tuning such systems by providing a means to observe scheduling and timing behavior. Using a monitoring tool one can collect information related to the system behavior such as execution time, CPU usage, blocked time, etc., which can be used to pinpoint temporal errors, restructure computation and ascertain the accuracy of Worst Case Execution Times. Runtime monitoring is also useful during normal operation of the system to verify that the system meets its temporal requirements. Monitoring tools collect data for both user and system level metrics by inserting software probes into the appropriate functions of the code. Execution of the instrumented code triggers the capture and storage of data by the monitoring system. It is, however, important for the monitoring tool to be as non-intrusive as possible so as not to change the behavior of the underlining task set. Significant intrusion will increase the execution time of the activities being monitored which can lead to tasks missing their deadlines, adversely affecting load balance, or even reducing the ability to schedule hard real-time tasks. Although many current micro-processors provide hardware counters, there is usually only one clock register. For performance measurement it is necessary to virtualize a clock into a timer (i.e., support start and stop operations). In addition, multiple timers are required to measure the time between different events. In this paper we present the NIST Multikron4 hardware monitor, which provides 16 high-precision, non-intrusive timers and counters. In addition, Multikron provides minimally intrusive event tracing via an onboard trace memory (or optional external data path). We describe MultiKron’s integration with the Paradyn[18] performance measurement tool. Paradyn provides application performance analysis, visualization, and dynamic insertion and removal of measurement probes. This paper presents the first integration of
We present the design and implementation of MultiKron PCI, a hardware performance monitor that can be plugged into any computer with a free PCI bus slot. The monitor provides a series of high-resolution timers, and the ability to monitor the utilization of the PCI bus. We also demonstrate how the monitor can be integrated with online performance monitoring tools such as the Paradyn parallel performance measurement tools to improve the overhead of key timer operations by a factor of 25. In addition, we present a series of case studies using the MultiKron hardware performance monitor to measure and tune high-performance parallel computing applications. By using the monitor, we were able to find and correct a performance bug in a popular implementation of the MPI message passing library that caused some communication primitives to run at one half of their potential speed.
1. Introduction Low cost and low perturbation performance data collection is necessary in high performance computing for both measurement and control purposes. This applies to single processors, parallel processing, heterogeneous distributed environments, and especially real-time environments. Event tracing can be accomplished in software without hardware support, although the perturbation incurred may be substantial and the precision of the available “time” may be too coarse for resolving delays and correlating interprocessor events. Hardware support can significantly reduce the perturbation, improve the time precision, and is a must for tallying high-speed hardware events. Writing and debugging parallel and real-time programs is a complex and time-consuming task. The cor1
This NIST contribution is not subject to copyright in the United States. Certain commercial items may be identified but that does not imply recommendation or endorsement by NIST, nor does it imply that those items are necessarily the best available for the purpose. 2 This work was partially sponsored by DARPA. 3 Supported in part by NIST CRA #70NANB5H0055, NSF Grants ASC-9703212 and CDA-9401151, and DOE Grant DEFG02-93ER25176.
4
1
MultiKron is a registered trademark of NIST.
Send msg A function X Rcv msg B
Code Probes
MultiKron
Memory Mapped via Memory or I/O Bus
Event Traces Performance Counters
msg A sent @ 27,384,013 us 1,327 Bytes took 3,859 us
Signal Probes
function X took 2,501 us L2 cache hit ratio was 43% Memory Bus Utilization was 98% msg B rcvd @ 31,814,033 us 341 Bytes took 1,312 us
Figure 1: Overview of Measurement Probes. Paradyn’s dynamic instrumentation (insertion and deletion of code during program execution), with a hardwarebased counters. The combination of these two systems permits Paradyn to dynamically re-use the finite number of hardware timers on the Multikron system. The rest of the paper is organized as follows. Section 2 describes the MultiKron measurement hardware. Section 3 describes the integration of MultiKron with the Paradyn Parallel Performance Measurement Tools. Section 4 presents a series of micro-benchmarks and case studies to demonstrate the benefit of hardware-software performance measurement. Section 5 summarizes related work and Section 6 presents our conclusions.
since they would likely lack the time or expertise to interface the chips to their machine. Thus, the MultiKron toolkits provide experimenters a quick and easy means to utilize MultiKron instrumentation. These toolkits are designed so that experimenters can plug-in the PCB, install its support software and begin to integrate performance measurement into their experiments.
2.1 MultiKron Instrumentation During execution of a program under test, performance measurement data are acquired as directed by measurement probes (see Figure 1). There are two types of measurement events, hardware and software. A hardware measurement event is captured via a wire physically connected to an electrical signal in the system being measured. One of the options for the MultiKron performance counters is to count the occurrences of these external signals. Operationally the MultiKron is a passive, memorymapped device. Programmers interact with MultiKron via reads and writes to the mapped memory region. To generate traces, a software measurement event is triggered by the execution of a specific statement, a measurement probe, in the application program. A measurement probe can be added to the source code, requiring recompilation, or added directly to the executable code via a binary patch[11]. The probe appears as an assignment statement to a memory mapped MultiKron address. When the measurement probe code is executed, the value from the assignment statement (generally an event ID) is written to the MultiKron, which then appends its current timestamp (precision on the order of 100 ns), the identity of the process, and in multi-processor systems, the identity of the CPU, to form an event trace sample. The trace sample is then buffered and written to the MultiKron data storage interface. Operating system support is necessary during context switches to load the MultiKron with the current process ID. Thus, MultiKron provides a hardware assist to traditional software-based instrumentation systems
2. Hardware Measurement Environment The focus of the NIST performance instrumentation work[22] is to provide hardware support in obtaining performance measurement data from parallel computers, as well as uniprocessors, with tolerable perturbation to both the executing processes and the architecture on which they are executing. Current NIST instrumentation consists of the MultiKron_II[19] and the MultiKron_vc[20] custom VSLI chips, and their associated toolkits[21, 23]. The chips are designed to be memory mapped to the local processor(s), via the memory or I/O bus. The MultiKron_II provides both event tracing and 16 performance counters, while the MultiKron_vc provides only performance counters, but 64 thousand of them. Performance counters can be used to count the number of occurrences of a target event or to record the elapsed time between events. Both chips provide a high-precision clock. The MultiKron toolkits are printed circuit boards (PCBs) that contain a MultiKron chip, interface logic to a standard I/O bus (currently VME, SBus, and PCI), logic for support and management of the MultiKron, and two data storage schemes: a local, dedicated memory on the PCB or an external interface to another machine. The MultiKron chips are not directly useful to experimenters,
2
CPU Interface
64 64
64
64
Clk Sel TS_Clk Clk/100 MUX Clk/10 Clk MUX Ext SW
Mode 32 bits 1 2
32 bits 1 2
32 bits High Order bits 56
1 1 1 2 2 2
56 bits Timestamp 32 32 bits Wait Cnt Overrun Cnt
16 Shadow Registers
Performance Counters
16 bits CSR Filter
Clk Sel
8
16
32 bits 1 2 Source Addr Registers 8
56
64
32
Mo de
16
32
32
Mode Clk Sel Enable
64
TS_Clk Clk/100 MUX Clk/10 16 16 16 Clk MUX Ext SW
Header
Timestamp
Source Addr
8
56
32
Data 64
FIFO
8 8
8 MUX 1
1 8
Data Storage Interface
EOM Parity Data
Figure 2: Functional Block Diagram of the MultiKron Chip.
thereby minimizing the perturbation to the executing program by simplifying the probe code to a single write operation and recording the time-stamped samples into its own memory. The block diagram of the MultiKron_II chip is shown in Figure 2. There are two separate interfaces: a CPU interface to acquire measurement data and control the chip, and a data storage interface to separately store measured data without interfering with normal processor execution. The right hand side of the figure represents the acquisition and buffering of the 20 byte trace event samples. The lefthand side of the figure represents the performance counters and their control. Each counter is controlled by a dedicated sub-field in the Mode, Clk Sel, and Enable registers. The Mode and Clk Sel registers select the signal to be tallied by each counter. The Enable register controls whether each counter is active or inactive. The counters, which can be virtualized for each process with operating system support, can be written directly by the CPU, but only read indirectly via the Shadow Registers. This transparent operation allows the data of all of the performance counters to be simultaneously read out and held constant while the counters continue their programmed activities. This simultaneous read provides a single time view of the measured data without concern for time skew caused by
reading each counter separately at slightly different times. Data records sent to the data storage interface are 20 byte trace event samples, optionally concatenated with the 64 bytes of data from the performance counters.
2.2 The PCI MultiKron Toolkit The MultiKron toolkit PCBs are designed to be easily re-engineered for different buses. The PCB form factor, connectors, and bus protocol will be different for each bus, but these functions have been isolated from the rest of the board design. The block diagram of the PCI bus toolkit, designed and used for this project, is shown Figure 2. A single programmable logic device (PLD) handles the bus interface logic. A second PLD handles the logic for the rest of the toolkit board, which includes a MultiKron_II chip, a MultiKron_vc chip, a FIFO, and 16M Bytes of DRAM storage. The MultiKron_vc controls its own dedicated SRAM storage, which is normally not directly accessible to the experimenter. The MultiKron_II uses the on-board 16 Mbytes of DRAM to store trace event samples, and this memory is directly accessible to the experimenter. One of the PCI bus signals, DEVSEL#, is a device active signal. To measure the utilization of the PCI bus we hardwired two versions of the DEVSEL# signal to the
3
8K x 40 bits
SRAM 40 4M x 32 bits
MultiKron_vc
MultiKron_II
8
FIFO
8
(optional)
12
32
32 22 7
32
32
32
32
Board Logic
16 MByte
DRAM
22
Address Bus
32
Data Bus (low)
32 Data Bus (high) 32
32
PCI Logic
Buffer high 32 data lines (optional PCI configuration)
22
32
32 Shared low order 32 bits
32
address & data
PCI bus
Figure 3: Functional Block Diagram of the PCI Multikron Toolkit.
external input pin of the MultiKron performance counters, numbers 0 and 2. One version is the direct signal, and the other version is the signal enabled only when the toolkit board is active. These signals can be used to determine the overall PCI bus utilization and the time used by the toolkit board. By configuring performance counters number 0 and 2 to use these input signals, they will tally the total number of clock ticks that the PCI bus is busy, and busy only with the toolkit board, respectively. By configuring another performance counter to tally the total number of elapsed clocks, we can compute the percent of time the bus is utilized. The MultiKron clocks are operating at a rate faster than the 33 MHz PCI bus clock, thus yielding a fairly accurate measure of utilization. Each MultiKron performance counter is a 32 bit counter, but “even” numbered counters can be configured as 64 bit counters by concatenating them with their neighboring “odd” numbered counter. Concatenating counters is useful for tallying high frequency events, since a 32 bit counter can overflow quickly, in less than 90 seconds when tied to a 50 MHz clock.
models. Section 3.2 describes the design and implementation of the Paradyn-MultiKron integration.
3.1 Paradyn and Its Data Collection Model Paradyn is a parallel performance suite designed to scale to long running (multi-hour) executions on largescale parallel systems. The tools are designed to provide precise information about application performance during program execution, yet keep instrumentation overhead reasonable. To do this, Paradyn uses a mechanism known as Dynamic Instrumentation[11] that employs run-time code generation to insert instrumentation into a running program. The end-to-end requirements of Paradyn are very similar to that of a soft real-time system. In order to provide accurate real-time performance feedback all the different modules (data collection, analysis and visualization) have delay requirements that must be met for the system to be effective. To control the amount of instrumentation overhead, dynamic instrumentation is regulated by a cost model[13] that limits the amount of instrumentation enabled to not exceed a user defined time budget (expressed as the ratio of instrumentation code execution time to application code execution time). Paradyn’s data collection model is based on a hybrid two step process. As the program to be measured executes, data structures record the frequency and duration of instrumented operations. These data structures are then periodically sampled to report performance information to an external process (called the Paradyn deamon) that forwards them to visualization and analysis components. Periodic sampling of these structures provides accurate information about the time varying performance of an application without requiring the large amount of data
3. Integrating Paradyn and MultiKron To provide an end-to-end demonstration of the benefits of hardware instrumentation for tuning application programs, we have integrated support for the MultiKron PCI toolkit into the Paradyn Parallel Performance Measurement Tools. To take advantage of the high-resolution, low-overhead timers available with the MultiKron board, we replaced Paradyn’s default software-based instrumentation with memory operations on the mapped PCI board. The next sub-section provides a brief overview of the Paradyn tools, its instrumentation, and its data collection
4
Dynamic Instrumentation startTimer(msgTime, WallTime) { .... }
Visualization and Analysis Tools
Paradyn Deamon
Timer
stopTimer(msgTime) { ... }
RecvMsg(src, ptr, cnt, size) { . . . }
Application Process
Figure 4: Paradyn’s Data Collection Paths. needed to store event logs of every change in the state of a monitored variable. To meet the challenges of providing efficient and detailed instrumentation, Dynamic Instrumentation makes some radical changes in the way traditional performance data collection has been done. However, since the approach is designed to be usable by a variety of high level tools, a simple interface is used. The interface is based on two abstractions: resources and metrics. Resources are hardware or software abstractions that we wish to collect performance information about. For example, procedure, processes, CPU’s, and disks are all resources. Metrics are time varying functions that characterize some aspect of a parallel program’s performance. Metrics can be computed for any subset of the resources in the system. For example, CPU utilization can be computed for a single procedure executing on one processor or for the entire application. Metrics definitions are converted into instrumentation snippets by a compiler that inserts code into the running application[14]. Figure 4 shows an application program with two snippets inserted: one at the entry to the RecvMsg function and the second at the end of that function. This pair of snippets computes the amount of time that the RecvMsg functions, and any functions it calls, are active. The first snippet starts a wall timer when the application program enters the RecvMsg function. The second snippet stops the timer when the application leaves this function.
MultiKron’s hardware timers. Figure 5 shows the modified data path when using the hardware assist for timers. The application program still has two snippets, but now these snippets consist of assignment statements to the memory mapped control register. In addition to periodically sampling the value of the timer, the Paradyn daemon now directly reads metric values by memory mapping the MultiKron hardware. Timers are implemented by configuring the performance counters in the MultiKron to be incremented by the clock on the MultiKron board. Once configured, starting a timer simply requires a write operation to the memory-mapped location that enables the counter. Likewise stopping a timer also requires a similar operation. Figure 5 shows a slightly simplified version of the implementation (the actual code uses a bit mask to select a specific counter, so the constant is not 1). In addition to a faster access to a clock source, the MultiKron implementation of Paradyn’s timers are faster because they provide the abstraction of a timer (e.g., start and stop) rather than the building block of a clock (e.g., return current time). Sampling by the Paradyn daemon is accomplished by having the daemon process also map the MultiKron into its address space. We have also exported the PCI bus utilization counter in the MultiKron toolkit as a Paradyn metric. This allows online (during execution) visualization of bus utilization along side application (software) specific metrics such as message passing statistics. Isolating bus utilization to specific program components can be difficult due to asynchronous events such as I/O and the unpredictable interleaving of kernel activity with application requests. As a result, the bus utilization metric can only be requested for the entire machine rather than specific procedures, files,
3.2 Integrating Paradyn and MultiKron To integrate Paradyn’s instrumentation system with the MultiKron toolkit, we have replaced the software used to implement Paradyn’s timers by instructions to trigger
MultiKron
Visualization and Analysis Tools
Counter Counter . . .
Paradyn Daemon
Dynamic Instrumentation
*0x4f5e2 = 1;
*0x4f5e2 = 0;
RecvMsg(src, ptr, cnt, size) { . . . }
Counter Application Process
Figure 5: Paradyn Data Collection Using Multikron.
5
or threads. However, for many real-time and parallel applications, only one process per node is executing and so the per-machine measurements are nearly identical to perprocess data.
quires all available cycles on the 233 Mhz Pentium II processor used in these experiments. The implication is that a large set of time trace events with the corresponding set of counters can be assembled with little perturbation to the executing code by using MultiKron type instrumentation. Whereas, reading each counter and the clock individually at each event can cause significant perturbation. Such delays could significantly degrade parallel code performance, but could be catastrophic to real-time codes.
4. Measurement Studies In order to evaluate the capabilities of the MultiKron and its integration with the Paradyn Parallel Performance Tools, we conducted a series of measurement studies. The first set of experiments was designed to demonstrate the capabilities of the hardware instrumentation of the MultiKron, and to quantify the advantages of using hardware assistance for online performance evaluation tools such as Paradyn. The second set of experiments was designed to show the benefits of using the combined hardware-software set of tools to understand the performance of parallel and distributed applications. These same benefits and concerns apply to real-time systems, and especially to parallel and distributed real-time systems using the real-time extensions to the Message Passing Interface (MPI)[15]. In these systems, the timing of network and peripheral device activity is of equal concern to those of processor activity. Light-weight data capture mechanisms allow performance measurements to be acquired in real-time systems, even those with tight time budgets. Since this measurement mechanism causes very little perturbation to the executing processes, accurate, undistorted performance information can be obtained down to relatively fine-grained events. In contrast, software based data capture tools are not suited to real-time systems since they can distort performance information, especially for fine-grained events.
% PCI bus utilization
50 Polling counters Triggering traces
40 30 20 10 0 0
25,000
50,000
75,000
100,000
125,000
Samples/second
Figure 6: Sampling frequency vs. Bus Utilization.
4.2 Performance of MultiKron Integration In the original implementation of Paradyn, gettimeofday and getrusage UNIX system calls are used for wall timer and process timer respectively. In the MultiKron implementation, each start and stop time operation is simply one or two writes to the mapped control registers. A read or write to a MultiKron register is slightly slower than reading or writing a “normal” address. This difference ocurs because the PCI Bus (where the MultiKron is located) is slower than the memory bus. However, this difference is insignificant compared to the time taken by a timer system call. To quantify the performance gains of using MultiKron compared with software system calls, we wrote a test program to start and stop Paradyn wall timers for each implementation. In the test program, a timer is started and stopped 5 million times. The test program was run on a DEC Alpha with a clock speed of 275 MHz. The aggregate net time to execute each call after factoring out loop overhead is 2.6 micro-seconds (715 processor clock cycles) for the system call-based implementation, and 0.096 micro-seconds (26 processor clock cycles) for the MultiKron version, or 27 times faster. This speedup is due to being able to execute a single memory operation rather than the sequence: read clock, read start time, and then update accumulated time. Also, reading a wall clock on the Alphas requires trapping into the operating system kernel.
4.1 Benefits of using MultiKron Storage We wished to quantify the benefits of using the onboard memory of the MultiKron to record periodic samples of performance data. By using the MultiKron’s storage, a system being measured is able to record periodic samples of the 16 MultiKron performance counters into the on-board memory for offline analysis after the time critical events have been measured. To evaluate this capability, we constructed a test program that can either sample the 16 performance counters and store the values in the system main memory (i.e. request multiple reads over the PCI bus), or request the MultiKron to store a sample containing the current values of the counters in its memory. We than ran the test program varying the frequency of these requests. The results of comparing the overhead of different sampling frequencies are shown Figure 6. By using the MultiKron to trigger samples, we are able to gather over 100,000 samples per second and only incur a load of less than 5% on the PCI bus. By contrast, a similar level of sampling that directly reads the performance counters results in a PCI bus utilization of over 40%. In addition, the processor utilization to read samples frequently is very high. For example, when polling the maximum possible sampling rate is 122,000 samples per second, which re-
4.3 Micro-benchmarks of MultiKron on PCs We also have installed the PCI MultiKron toolkit on a network of Intel Pentium II’s based PCs connected by a Myrinet network. Myrinet[4] is a 1.2 Gbps switched local area network that is designed to provide both low latency and high speed communication. Our experimental configuration consists of a network of eight Pentium IIs with
6
64 MB of memory each. Inter-node communication is provided by sending messages using the industry standard Message Passing Interface (MPI)[8]. We used the BIP[25] protocol that provides a Myrinet driver which works with the MPICH[7] implementation of MPI developed by Argonne National Laboratory. Given the high speed of Myrinet, we wanted to use the MultiKron’s PCI bus monitor to see if the performance of the PCI bus was a limiting factor in the communication speed between nodes. We constructed a simple ping-pong message-passing program that sends a series of of fixed size messages between two hosts. We also instrumented this program to use the MultiKron PCI busmonitoring feature to record the average PCI bus utilization for each run. For each message size, the experiment was repeated 10,000 times, and the average of the last 9,000 values is reported. We repeated the measurements for various sized messages from 1024 bytes to 10 megabytes. Figure 7 shows the results of these measurements. The two curves show the correlation between application level message passing rates and PCI bus utilization as the message size is varied. This graph shows that once the message size exceeds 500 kilobytes, both the transfer rate and bus utilization curves flatten out. The effective transfer rate is about 120 MB/sec and the bus utilization is over 95%. Considering that the maximum speed of the PCI bus is 132 MB/sec (4 byte wide bus at 33MHz) and that the theoretical maximum speed of Myrinet is 150 MB/sec it appears that the PCI bus is the limiting resource.
nodes is 162 MB/sec, which is only 35% higher than the two node case. 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 110
100 80 60
0 1,000
MB/sec % Bus Utilization
10,000
100,000
1,000,000
110.4
110.6
110.8
111
Qualitatively, the graph shows a large fluctuation in the bus utilization as a function of time. As a result, we suspected that somehow the all-to-all function was not able to send data at the uniformly high rate we had expected. We then looked at the implementation of the function, and discovered that although the function was using non-blocking sends and receives to transmit the data, each node was initiating the sends in the same order. Each process started with virtual node zero and then proceed sequentially to each other node. We suspected that the lower level implementation of non-blocking communication was processing message send requests in a FIFO order. This send pattern would result in contention for each node that was the recipient of several simultaneous sends. To test this theory, we changed the send loop to have each node start by sending data to the node with the next highest node number. The results of this version are shown Figure 9. In this case, the test application achieves an average bus utilization of 58%, and the aggregate communication performance across eight nodes is 316 MB/sec.
120
20
110.2
Figure 8: PCI bus utilization vs. time for two nodes.
140
40
Glen 7 Glen8
1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 110
10,000,000
Message Size (Bytes)
Figure 7: BW and Bus Utilization vs. Pkt Size. We also used the PCI bus-monitor feature of the Multikron board to find and correct a performance bottleneck in the all-to-all communication primitive in MPICH, a popular implementation of MPI. The PCI bus utilization for two nodes of an eight-node cluster is shown in Figure 8. The graph shows a one-second interval of a test application that executes a series of all-to-all communication requests. As can be seen, there is a large fluctuation in bus utilization, and the average bus utilization for the entire application is only 30%. However, since all-to-all message passing is nothing more than a sequence of point-topoint messages, we would have expected the bus utilization to be similar to that seen in Figure 7. Since each allto-all message is 1.3 MB (or 163 KB per node), we would have expected an average bus utilization of close to 80%. Also, the average aggregate transfer rate across eight
Glen 7 Glen8 110.2
110.4
110.6
110.8
111
Figure 9: PCI Bus Utilization after tuning. The micro-benchmarks show the ability of the MultiKron toolkit to reduce the overhead associated with software-based instrumentation and the utility of the PCI bus monitor to isolate performance problems in systems software. In the next section, we consider the utility of the toolkit for measuring application programs.
7
4.4 Application Case Studies
Nodes 2 4 8
To evaluate the performance of the MultiKron toolkit on real applications, we selected two programs. The first is an out of core LU matrix decomposition that was run on a DEC Alpha cluster. This application shows the integration of MultiKron PCI bus statistics time and counter based metrics triggered by software probes. The second application is an implementation of the Poisson problem (Jacobi integration on a 2-D grid). This application demonstrates how the MultiKron can be used to show that the PCI bus is not a bottleneck. LU is the dense LU decomposition of an out-of-core matrix[10]. This application is interesting because it is communication, computation and I/O intensive. This implementation of the out-of-core data staging is done using synchronous read and write operations. In our measurement studies, we used an 8192×8192 double precision matrix (total size 536MB) staged using slabs of 64 columns as a basic unit. We ran the application on a cluster of four DEC Alpha processors connected via a 155 Mbps ATM network. A time histogram showing the performance of the first several minutes of the application’s execution is shown in Figure 11. In this graph, we have requested Paradyn to display three metrics: message bytes transferred, PCI bus utilization, and execution (wall) time spent in the I/O write routine. The bus utilization and execution time curves were generated using data gathered via the MultiKron monitor, and the message bytes metric was computed using Paradyn’s software counters. The curves show that the PCI bus utilization is highest during the I/O operations, approximately 30%. We also see that the bus utilization continues past the I/O interval due to the filesystem write-behind policy. Once the I/O phase completes, there is a period when none of the metrics show much activity. This middle interval of time is computation that is not captured by the three metrics shown Figure 11. After this middle phase ends, the application enters a communication phase where data transfers to and from the selected process peak at 1.5 MB/sec. During this final phase, the PCI bus utilization is perceptible (slight peaks in the time-histogram), but not a significant bottleneck. The second application is a solution to the Poisson problem by using Jacobi integration on a 2-D decomposition. Figure 10 shows the performance statistics for this application when it is executed on 2, 4, and 8 nodes of a Myrinet connected cluster of Pentium II machines. For all versions of the application, the program spends a majority of its time performing communication operations. However, despite the majority of time being spent in communication, the fifth column of Figure 10 shows that the PCI bus utilization never exceeds 5%. This data shows the software overhead of the message-passing library dominates the PCI bus utilization. Further investigation into the application confirmed that the reason it spends so much time in communication routines is due to the relatively small message size (average message size is only 15-30 bytes depending on the number of nodes used).
Time Comm % Comm 14.07 7.79 55.4% 17.34 10.33 59.6% 18.99 10.85 57.2%
PCI 3.9% 4.6% 3.7%
Figure 10: Data for the Decomp Application.
5. Related Work Several tools to permit runtime verification of realtime requirements exist. Mok[24] developed a language for expressing real-time timing requirements that can be verified at runtime. Chodrow et al [6] have developed a system that permits checking real-time properties during system execution. Their system could benefit from the Multikron monitor since Multikron provides a mechanism to provide high-resolution time-stamped trace records. Various types of hardware performance measurement have been incorporated into high-performance microprocessors and systems. The IBM RP3[5] included performance-monitoring hardware throughout the system. The Sequent Symmetry's also included hardware to measure bus utilization, bus read and write counts, and cache miss rates. Using this information, tools[12] were able to give the programmer insights that substantially simplified program tuning. Unfortunately, Sequent never widely disclosed the existence of these features due to fears that information obtained using the hardware instrumentation would be used by their competitors. Another early example of hardware performance data comes from the Cray Y-MP[2]. The Y-MP provided a wide variety of hardware-level performance data including information about frequency of specific instructions. Most current microprocessors also include hardware counters for measuring processor visible events. For example, the Intel Pentium family of processors[17] provides access, via model specific registers (MSR), to data including cache misses, pipeline stalls, instructions executed, misaligned memory accesses, interrupts, and branches. Similar data is also provided for the Ultra Sparc, DEC Alpha[1], IBM Power2[27], and MIPS R10000[28] processors. However, a limitation of this type of instrumentation is that it is only available for processor visible events, and therefore can not track other aspects of the system such as the bus and the I/O system. Several versions of hardware support for tracing have been developed. The TMP monitor[9] included a generalpurpose microprocessor dedicated to instrumentation, and a separate I/O network to gather performance data at a central location. Tsai et al [26] have developed a nonintrusive hardware monitor for real-time systems. Their technique requires a complex processor to track the activity of the primary processor, however they were able to develop a completely non-intrusive monitor as a result. Malony and Reed[16] developed HYPERMON to measure the performance of an Intel Hypercube. HYPERMON permitted gathering data from each node to a custom designed monitoring node. Like MultiKron, HYPERMON used software-based instrumentation to trigger
8
Figure 11: Paradyn Time Histogram Showing Application and PCI Metrics. This graph shows the Paradyn Time-histogram display for three metrics: execution time for the put_seq procedure, PCI Bus utilization, and message bytes transferred. The graph has three distinct phases. In the first phase, the put_seq procedure is active and the PCI bus utilization is about 30%. In the middle phase, none of the three metrics show a significant amount of activity (i.e., the program is performing operations not related to the three metrics). In the final phase, the program is sending messages, but this activity is only inducing a very light load on the bus. the transfer of events to the hardware tracing facility. One advantage of the MultiKron toolkit and its on-board memory is that a separate data collection network is not required. Instead data can be accumulated on each node, and transferred when the application completes execution. A key enabling trend to make plug-in instrumentation hardware possible has been the migration of high-performance computing platforms to using Commodity Off The Shelf (COTS) nodes with standardized I/O buses.
that performance data can be tracked on a per-process or per-thread basis, or provide easy hooks to allow thirdparty software to have a callback during the contextswitch code. MultiKron type instrumentation includes a larger set of counters, sixteen, compared to microprocessor on-chip counters which generally have two timers. Current onchip counters do have the advantage of having access to internal chip information that the MultiKron can’t get access to, but 32 or more signals must be multiplexed to the limited number of counters. MultiKron also provides for event tracing in addition to counting. Thus, the MultiKron provides an efficient, low perturbation means to collect the information as well as generate it. This is a significant benefit to real-time systems that generally have tight time budgets and are sensitive to the insertion of delays. Event tracing and data collection incur larger implementation costs than do simple counters, and therefore face higher manufacturer resistance. A performance monitoring tool like the Paradyn and MultiKron combination presented in this paper can be very useful in providing runtime feedback about the scheduling and timing behavior of a set of real-time tasks under different system conditions. Metrics like resource utilization and blocking time can be collected in an efficient way and displayed on a timely basis without affecting the underlining behavior of the task set.
6. Conclusions Low perturbation and high precision timers are beneficial in collecting performance data as demonstrated in our experiments. Also, having performance counters throughout the computer system rather than just inside the microprocessor is useful for tuning applications. Application developers are finding I/O bus statistics vital for tuning their applications[3]. As real-time parallel and distributed systems become more complex and pervasive, the utility of these counters will increase. While many microprocessor vendors are incorporating hardware counters in their design, the interface to these counters is rarely what is necessary for application performance measurement studies. For example, most vendors make their performance counters readable using only privileged instructions. This requires application programs to trap into the operating system to read these counters. The overhead of a trap is tolerable for course grained measurements (i.e., at the beginning and end of a program). However, for fine-grained measurements at the procedure and loop levels, this trap overhead is unacceptable since the time required for even the fastest trap represents a significant fraction of the time of the event to be measured. Operating system vendors should also provide either a virtualized view of the performance counters so
Acknowledgments We thank Li Zhang for his efforts to port Paradyn’s Dynamic Instrumentation system to the DEC Alpha processor. The BIP team provided several explanations and bug-fixes that were critical to using the Myrinet. Mustafa Uysal supplied the implementation of LU used in the case
9
study and helped us to it. The decomposition application was taken from examples supplied in [8].
14. J. K. Hollingsworth, B. P. Miller, M. J. R. Goncalves, O. Naim, Z. Xu, and L. Zheng, "MDL: A Language and Compiler for Dynamic Program Instrumentation," International Conference on Parallel Architectures and Compilation Techniques (PACT). Nov. 1997, San Francisco, pp. 201-212. 15. A. Kanevsky, A. Skjellumy, and J. Watts, "Standardization of a Communication Middleware for High-Performance Real-Time Systems," RTSS. December 2-5, 1997, San Francisco, CA. 16. A. D. Malony and D. A. Reed, "A Hardware-Based Performance Monitor for the Intel iPSC/2 Hypercube," 1990 International Conference on Supercomputing. June 11-15, 1990, Amsterdam, pp. 213-226. 17. T. Mathisen, "Pentium Secrets," Byte, 19(7), 1994, pp. 191192. 18. B. P. Miller, M. D. Callaghan, J. M. Cargille, J. K. Hollingsworth, R. B. Irvin, K. L. Karavanic, K. Kunchithapadam, and T. Newhall, "The Paradyn Parallel Performance Measurement Tools," IEEE Computer, 28(11), 1995, pp. 37-46. 19. A. Mink, Operating Principles of Multikron II Performance Instrumentation for MIMD Computers, NISTIR 5571, National Institute of Standards and Technology, Dec. 1994. 20. A. Mink, Operating Principles of MultiKron Virtual Counter Performance Instrumentation for MIMD Computers, NISTIR 5743, National Institute of Standards and Technology, Nov. 1995. 21. A. Mink, Operating Principles of the SBus Mulikron Interface Board, NISTIR 5652, National Institute of Standards and Technology, May 1995. 22. A. Mink, R. Carpenter, G. Nacht, and J. Roberts, "Multiprocessor Performance Measurement Instrumentation," IEEE Computer, 23(9), 1990, pp. 63-75. 23. A. Mink and W. Salamon, Operating Printicples of the PCI Bus Multikron Interface Board, NISTIR 5993, National Institute of Standards and Technology, Mar 1997. 24. A. K. Mok and G. Liu, "Early Detection of Timing Constraint Violations at Runtime," Real-Time Technology and Applications Symposium. 1997, pp. 176-185. 25. L. Prylli and B. Tourancheau, "BIP: a new protocol designed for high performance networking on Myrinet," IPPS Workshop on PC-NOW. 1998, Orlando, FL. 26. J. J. P. Tsai, K. Y. Fang, and H. Y. Chen, "A Noninvasive Architecture to MOnitor Real-time Distributed Systems," IEEE Computer, 23(3), 1990, pp. 11-23. 27. E. H. Welbon, C. C. Chen-Nui, D. J. Shippy, and D. A. Hicks, The POWER2 Performance Monitor, in PowerPC and Power2: Technical Aspects of the New RISC System/6000. 1994, IBM Corporation. p. 45-63. 28. M. Zagha, B. Larson, S. Turner, and M. Itzkowitz, "Performance Analysis Using the MIPS R10000 Performance Counters," Supercomputing. Nov. 1996, Pittsburg, PA.
References 1. DECchip 21064 and DECchip21064A Alpha AXP Microprocessors - Hardware Reference Manual, EC-Q9ZUA-TE, DEC, June 1994. 2. UNICOS File Formats and Special Files Reference Manual SR-2014 5.0, Cray Research Inc. 3. A. C. Arpaci-Dusseau, R. H. Arpaci-Dusseau, D. E. Culler, J. M. Hellerstein, and D. A. Patterson, "Searching for the Sorting Record: Experiences in Tuning NOW-Sort," SIGMETRICS Symposium on Parallel and Distributed Tools. Aug. 1998, Welches, OR, pp. 124-133. 4. N. J. Boden, D. Cohen, R. E. Felderman, A. E. Kulawik, C. L. Seitz, J. N. Seizovi, and W.-K. Su, "MYRINET: A Gigabit Per Second Local Area Network," IEEE-Micro, 15(1), 1995, pp. 29-36. 5. W. C. Brantley, K. P. McAuliffe, and T. A. Ngo, RP3 Performance Monitoring Hardware, in Instrumentation for Future Parallel Computer Systems, M. Simmons, R. Koskela, and I. Bucker, Editors. 1989, Addison-Wesley. pp. 35-47. 6. S. E. Chodrow, F. Jahanian, and M. Donner, "Run-Time Monitoring of Real-Time Systems," Real-Time Systems Symposium. Dec. 1991, pp. 74-83. 7. W. Gropp, E. Lusk, N. Doss, and A. Skjellum, "A highperformance, portable implementation of the MPI message passing interface standard," Parallel Computing, 22(6), 1996, pp. 789-828. 8. W. Gropp, E. Lusk, and A. Skjellum, Using MPI: Portable Parallel Programming with the Message Passing Interace. 1995: MIT Press. 9. D. Haban and D. Wybranietz, "A Hybrid Monitor for Behavior and Performance Analysis of Distributed Systems," IEEE Transactions on Software Engineering, 16(2), 1990, pp. 197-211. 10. B. Hendrickson and D. Womble, "The Torus-wrap Mapping for Dense Matrix Calculations on Massively Parallel Computers," SIAM Journal Scientific Computing, 15(5), 1994, pp. 1201. 11. J. K. Hollingsworth and B. Buck, DyninstAPI Programmer’s Guide, CS-TR-3821, University of Maryland, August 1997. 12. J. K. Hollingsworth, R. B. Irvin, and B. P. Miller, "The Integration of Application and System Based Metrics in A Parallel Program Performance Tool," 1991 ACM SIGPLAN Symposium on Principals and Practice of Parallel Programming. April 21-24, 1991, Williamsburg, VA, pp. 189-200. 13. J. K. Hollingsworth and B. P. Miller, "Using Cost to Control Instrumentation Overhead," Theoretical Computer Science, 196(1-2), 1998, pp. 241-258.
10