E-mail: {waheed,yan}@nas.nasa.gov. Abstract ... performance counters can provide reliable measure- ... tions of using these counters for different performance.
Performance Measurements of Parallel and Distributed Systems Using On-Chip Counters Abdul Waheed and Jerry Yan MRJ Technology Solutions MS T27A-2, NASA Ames Research Center, Moffett Field, CA 94035-1000 E-mail: {waheed,yan}@nas.nasa.gov
performance parallel and distributed systems. Some examples of such HPC systems include: Intel/Sandia ASCI teraflop machine based on Intel Pentium Pro processors; IBM SP-2 distributed memory parallel system based on POWER2 processors in RS/6000 nodes; SGI Origin2000 scalable distributed shared memory system based on two R10000 processors on each node; and Sun Enterprise and HPC2 servers based on Ultrasparc processors. Continuous improvements in processor architectures have resulted in sustained exponential increase in their performance. While current processor architectures result in high performance, they also increase the complexity of the system. Architectural features, such as superscalar designs, speculative execution, pipelining, and support for multiple levels of cache and memory hierarchy, not only enhance the performance but also make if difficult to analyze or tune the software performance on systems built around these processors. On-chip performance counters can provide reliable measurements of processor-level activities that are too complex to accurately capture through application-level instrumentation [1,18]. These activities include statistics about cache utilization, floating point operations, branch predictions, memory traffic, and so on. We use these measurements from on-chip counters on two HPC systems (SGI Origin2000 and IBM SP2) for characterizing scientific workloads on these systems and performance tuning with an emphasis on the utilization of available memory hierarchy. In addition, we survey the use of embedded counters on Pentium Pro based PCs and clusters of PCs. Although the performance counters on our selected processors may provide similar information across these platforms, their application programming interfaces (APIs) are completely different and in some cases even unsupported by the manufacturers. In Section 2, we put measurements collected from on-chip counters in perspective with respect to other instrumentation techniques. Section 3 surveys the architectures, APIs, and measurement tools for data
Abstract Present generation of high performance computing (HPC) systems are often built around commodity processors due to their high performance at a low cost. These processors provide a number of new architectural features, including support for performance measurement through on-chip counters. On-chip performance counters can provide reliable measurement of processor-level activities that are too complex to accurately capture through other means of instrumentation. We briefly survey the architectures, programming interfaces, and measurement tools for three processors: MIPS R10000, IBM POWER2, and Intel Pentium Pro. These processors are found in several parallel and distributed HPC systems. We then use on-chip counters on two such HPC systems: Origin2000 and IBM SP-2 as case studies to demonstrate the utility from on-chip counters on these HPC systems. Key Words: performance measurement, on-chip performance counters, instrumentation, performance tuning, workload characterization, parallel and distributed systems, high performance computing.
1.
Introduction
High performance computing (HPC) system architectures are increasingly being developed around commodity parts, including processors [6]. Present generation of HPC systems use state-of-the-art processors due to their potential for high performance at a low cost. These processors provide a number of new architectural features, including support for performance measurements through on-chip counters. Examples of processors that provide on-chip counters include: Intel Pentium, IBM POWER2, MIPS R10000, DEC Alpha, and Sun Ultrasparc. These processors are being used in a broad range of high 1
collected from on-chip counters of three processors. Section 4 presents two case studies of applying measurements obtained from on-chip counters to performance tuning and workload characterization. We conclude with a discussion of advantages and limitations of using these counters for different performance measurement scenarios.
2.
3.
they permit convenient access to counters from user level software through appropriate APIs. Implementing performance instrumentation facilities within a processor means allocating chip real estate for additional registers and counters to serve this purpose. This design decision is a trade-off between supporting on-chip data collection facilities and implementing other functions that may be more central to processor performance or other design goals. Therefore, implementation of on-chip instrumentation capabilities in a number of processors underscores the importance of processor-level performance data that may be invaluable for tuning software performance. Since processor architectures are becoming highly complex and integrated, information about internal activities available outside of a processor is too scarce to be of any value for monitoring or recording. Performance data can be presented to external hardware with additional I/O pins for monitoring. However, this increases the I/O pin requirements of a processor, further complicating its design. Providing measurements through software accessible registers can reduce the pin I/O requirements. On-chip performance measurement capabilities are useful for both single-processor and multiprocessor executions and measurements. Typically, an on-chip counter is not aware of multiple processors in a parallel or distributed system. In order to perform measurements relevant to a parallel or distributed system, higher level measurement tools are responsible to construct a global view from counts collected at individual processors.
On-Chip Instrumentation
There are several techniques for measuring computer system performance. These techniques differ in terms of their accuracy, intrusion to the system under test, level of details, and ease of use. Traditionally, application- or OS-level software instrumentation is easy to customize for particular measurement-based experiments that provide specific information. The main drawbacks of this technique include the intrusion to the system under test and lack of hardwarelevel details [10]. Simulation-based techniques are often used when detailed measurements are needed under carefully controlled operating conditions [4]. However, these studies are limited by the accuracy of the simulator and the amount of time and computing resources required to execute a high fidelity simulation. Hardware instrumentation is another technique to collect low-level measurements with minimal intrusion [12]. Hardware monitors are plugged into the backplane of a processor to gather relevant information from the bus activity. Contemporary computer system design trends dictate the use of commodity parts to develop the architecture. Use of non-standard parts, such as hardware monitors adds cost to the overall system. In addition, information thus collected may still miss important processor-level activities, such as cache utilization, branch prediction, pipeline performance, and so on. On-chip counters leverage the level of detail and accuracy of hardware monitors and ease of use of software instrumentation. Special purpose counters embedded in processors can be programmed to collect information about processor-level activities. Developers provide OS and system software level interfaces to allow the users to access and control these counters to obtain measurements, which are relevant to their application programs [18]. Given the complexity of state-of-the-art processors, this information is increasingly considered essential for tuning performance — a task that can no longer be accomplished appropriately by software instrumentation alone. Advantages of on-chip counters include: 1. they enable experiments that require measurements of interactions among various hardware and software components; 2. they allow the measurement of highly accurate counts and fine-grained intervals of time; and
3.
Performance Measurements with On-Chip Counters
In this section, we consider the on-chip counter architectures, APIs, and measurement tools for three processors: MIPS R10000, IBM POWER2, and Intel Pentium Pro. We consider R10000 and POWER2 processor counters, APIs, and tools in the context of SGI Origin2000 and IBM SP-2 systems, respectively. In case of Pentium processors, our focus is on standalone as well as a cluster of networked PCs.
3.1 MIPS R10000 MIPS R10000 is a superscalar RISC processor. It has a 64 bit architecture with dynamic instruction scheduling and out-of-order execution capabilities. It has 32KB on-chip, non-blocking instruction and data caches [8]. Due to its superscalar architecture and out-of-order execution, the R10000 performance is greatly improved over its predecessors. However, these features also make it difficult to predict the time required to execute any section of a program, since it often depends on the instruction mix and the critical dependencies between instructions. Measurement of 2
face.
processor-level activities are greatly helpful in understanding the performance. R10000 supports two 64-bit on-chip counters to measurer processor-level activities, called events. Each hardware counter can track one out of a choice of sixteen events at a time. Thus, the two hardware counters can track two of possible thirty two events. Events to be counted are specified by two control registers. Each counter is a 32-bit read/write register, which is incremented by one whenever an event specified by its corresponding control register occurs. The counters may optionally assert an overflow interrupt when the most significant bit of a counter becomes set. These counters can be multiplexed by system software to count up to 32 processor-level events. Therefore, programming interface can support the view of 32 counters to count multiple (up to 16) events per hardware counter. Clearly, counting more than two events reduces the accuracy of measurements but the error is negligible for fairly long-running programs. Table 1 lists all hardware events that are possible to measure using R10000 performance counters. Kernel maintains 64-bit virtual counters for the user program using the hardware counters. This 64-bit view of 32bit hardware counters is maintained through the programming interface.
Table 1. List of possible events that can be measured by R10000 counters. Event Number
3. 1. 1 Application Programming Interface There are three techniques to access and use R10000 performance counters: 1. operating system level interface using /proc and ioctl calls; 2. a procedural interface using libperfex; and 3. command-line interface using perfex tool. Performance counters are accessible to a user program through /proc interface for controlling, reading, and writing to a device. This interface takes program ID as one of its arguments. Therefore, hardware counters can be programmed to count events related to a specific process or group of processes. This interface also allows the user to specify overflow threshold on a per event basis and a signal can be sent to the program upon overflow. Counters can be used in one of two modes: user or system mode. In user mode, counters can be shared among multiple processes as the kernel saves and restores their counts across context switches. A user with root privileges can access the counters in system mode. Other user programs cannot access the counters while they are being used in system mode. User can invoke ioctl system call to initialize, write, and read the counters in a program. Therefore, additional instructions are added to the actual program to use the /proc inter-
Measurement (count)
0
Clock cycles
1
Issued instructions
2
Issued loads
3
Issued stores
4
Issued store conditionals
5
Failed store conditionals
6
Conditional branches resolved
7
Quadwords written back from secondary cache
8
Correctable ECC errors on secondary cache data
9
Primary instruction cache misses
10
Secondary inst. cache misses
11
Secondary instruction cache way misprediction
12
External intervention requests
13
External invalidate requests
14
Functional unit completion cycles
15
Instructions graduated
16
Clock cycles
17
Instructions graduated
18
Graduated loads
19
Graduated stores
20
Graduated store-conditional
21
Graduated FP instructions
22
Quadwords written back from primary data cache
23
TLB misses resulting in refills
24
Mispredicted branches
25
Primary data cache misses
26
Secondary data cache misses
27
Secondary data cache way mispredicted
28
External intervention requests hits in secondary cache
29
External invalidate requests hits in secondary cache
30
Stores to clean exclusive secondary cache blocks
31
Stores to shared secondary cache blocks
A procedural interface to R10000 counters is available through libperfex. These library routines for initializing, starting, and reading hardware counters do not require direct invocation of OS-level ioctl calls to use /proc interface. Thus, the interface is more convenient to use in a program. However, the functionality is limited compared to 3
/proc interface. Command-line interface to R10000 counters is used for measurements based on entire program rather than specific sections of code as for /proc and libperfex based APIs. Using this interface, two or all 32 events can be counted during the execution of a program. Perfex can count events over all processes that are descendants of the target executable.
chip module (MCM) packaging of the POWER2 processor is an example of a highly complex and integrated architecture [17]. Figure 1 provides an overview of the architecture of on-chip counters on a POWER2 processor [16]. Most of the chip inputs are interconnected on the MCM substrate and not externally accessible. The memory and I/O buses are the only external interfaces to the MCM. Consequently, there are few external POWER2 CPU signals suitable for deriving performance data. This pin I/O requirement was reduced by implementing software accessible registers in processor architecture. To provide on-chip measurement capabilities, the designers allocated the ICU, FPU, SCU, and FXU five counters each. Similarly, the designers provided each of these four basic units with a 4-bit control field in the Monitor Mode Control Register (MMCR) that selects the set of events to be counted. Thus for each unit, it is possible to choose any one of sixteen groups of five events each for monitoring and to require only nine pins per chip as shown in Figure 1. As illustrated by Figure 1, the monitor contains twenty-two 32-bit counters for CPU and storagerelated performance events. The MMCR provides monitor control functions. The MMCR, along with two status bits in the Machine State Register (MSR), also allows selective measurement of specific threads of execution. The counters and the MMCR are addressable for read and write operations using Programmed I/O (PIO).
3. 1. 2 Measurement Tools SGI provides SpeedShop toolset for profiling programs on MIPS R10000 processor based systems. SpeedShop extensively uses on-chip performance counters. Using SpeedShop, a program can be profiled based on any one of possible hardware events. SpeedShop can return the source code line numbers and functions to allow the user to analyze where the execution time is spent. SpeedShop runs do not require relinking an executable and are invoked through a command-line interface. The SpeedShop tool can be used for single-processor as well as multiprocessor executions. In case of multiprocessing executions, profiling information is collected on a per process basis. The Perfex tool with command-line interface to counters can also be used to count specific events. Similar to SpeedShop, it also does not require relinking of an executable. Using various commandline options, Perfex can count two or all 32 events, their estimated costs in terms of execution time, and a number of other summary statistics related to cache utilization, cache miss rates, ratio of memory accesses to floating point operations, memory bandwidth, etc. Perfex can also be used for multiprocessor executions. In addition to SpeedShop and Perfex tools, it is not very difficult to develop custom tools using API to performance counters. We developed customized utilities to generate traces corresponding to hardware events occurring in specific sections of code using /proc API. Thus, other customized measurement tools can be developed for single-processor as well as multiprocessor evaluations.
3.2 IBM POWER2 IBM POWER2 is a superscalar RISC processor, which is used in RS/6000 family of desktop and rackmounted systems. While its clock speed is only about 66 MHz, its main source of performance, which is comparable to other processors used in desktop workstations, is its aggressive instruction-level parallelism. POWER2 processor consists of eight semi-custom chips partitioned as: an Instruction Cache Unit (ICU), the Fixed Point Unit (FXU), the Floating Point Unit (FPU), four Data Cache Units (DCU), and a Storage Control Unit (SCU). Multi-
Figure 1. Architecture of POWER2 performance counters [16]. The POWER Architecture defines the system-wide MSR as part of the process state. Two MSR bits, the Process Mark (PM) bit and the Problem bit, along with the MMCR, control the state of the moni4
tor. The MSR is part of the process state, and the operating system saves and restores the MSR when processes pause and resume execution. Therefore, with low overhead, the MSR PM bit can selectively qualify processes for monitoring. Thus the MMCR and MSR together can efficiently control the state of the Monitor.
keep track of any two of several dozen events, including instructions executed, floating-point instructions, data cache misses, hardware interrupts, pipeline stalls, branches, branch mispredictions, and so forth. Additionally, a nanosecond-accurate timer called the Time Stamp Counter (TSC), a 64-bit register that increments once for each clock tick, can be used to add time stamps to measurements. Table 2 lists four performance measurement registers that are also called Model Specific Registers (MSRs). Counter 0 and Counter 1 each have an associated output pin (PM0 and PM1), which can be programmed to signal each time the counter increments, or each time it overflows.
3. 2. 1 Application Programming Interface Although POWER2 supports an elaborate architecture of hardware counters, RS/6000 systems (e.g., SP-2) do not provide an API to access them from user programs. Maki has used these counters by extending AIX kernel to access the counters for system monitoring [11]. Extensions to AIX kernel allow the use of performance counters to monitor all processes running on a POWER2 processor. A daemon collects the measurements and library functions are used to get this information from the daemon. Therefore, this API is not useful for process-level monitoring in a multi-user environment. We are not aware of any other effort to access POWER2 performance counters for performance measurements.
Table 2. Model Specific Registers of Intel Pentium Pro.
3. 2. 2 Measurement Tools A public domain tool, called rs2hpm, was developed by Maki based on his extensions to AIX kernel. A daemon collects measurements in sampling mode about floating point operations and cache misses from multiple groups of counters. These measurements are found to be extremely accurate for programs that run for more than a few seconds. Rs2hpm has been extended by NAS division of NASA Ames Research Center to monitor program executions on an IBM SP-2 multicomputer system [14]. This extended tool is called Parallel Hardware Performance Monitor (PHPM). It allows monitoring multiple nodes of the system and an interface with Portable Batch System (PBS) to monitor batch jobs. This tool can provide measurements for individual nodes as well as summarize measurements for all the nodes allocated to a job.
Register Name
Description
Time stamp counter
64-bit free running counter, incremented every clock tick
Control and event selection
Selects the event that each event counter will monitor
Counter 0
40-bit event counter that counts occurrences or duration of a selected event
Counter 1
40-bit event counter that counts occurrences or duration of a selected event
3 . 3 . 1 Application Programming Interface There are three methods of using Pentium pro event counters for performance measurements: 1. monitoring selected sections of code by inserting additional instructions around those sections; 2. monitor events generated by all the processes running on a system; and 3. hardware-assisted event based sampling to profile the code. The first method allows an instruction level granularity of monitoring. Direct access to the counters requires kernel-mode (also known as ring 0) execution of writing to and reading from MSRs. The second method requires only user level (ring 3) access to the counter while low level counter-related tasks are handled by a kernel-mode device driver. This method is useful when application under test is the main contributor to the events being counted. The third method avoids the insertion of extra instructions for measurements. Accuracy of sampling technique can be improved by measuring over a long execution time. Statistics are recorded whenever a non-maskable interrupt occurs. This method can be intrusive if the samples are recorded too often. Intrusion can be avoided by recording only after N-th event
3.3 Intel Pentium Pro Pentium Pro has a pipelined, superscalar microarchitecture. It enhances the superscalar capability of a Pentium processor to execute two instructions simultaneously by using an “instruction pool” instead of traditional linear instruction sequencing between fetch and execute phases. Instructions are executed from the pool out-of-order speculatively and original instruction sequencing is maintained by adding an instruction retirement scheme. Pentium Pro processor has two performance counter registers to aid in performance measurement and tuning. These counters can be programmed to 5
and setting N to a high value.
4.
Using On-Chip Counters: Two Case Studies
3. 3. 2 Measurement Tools A number of Windows 9x and NT based tools are available based on the three APIs for accessing the counters. Some of these tools are commercially available from Intel and independent software vendors for single processor measurements [15]. PMX (Privileged Mode eXecution), PMON (Performance MONitor), and MTPM (MMX Technology Performance Monitor) are examples of Windows utilities that allow a user program to access the counters for monitoring selected sections of code. PMX allows the user to execute several Ring 0 commands, including reading or writing Control Registers and MSRs. PMON and MTPM are GUI based alternatives for the functionality provided by PMX. EMON (Event MONitor), P5MON, and PTACH are some of the tools that can be used for applicationlevel monitoring. EMON can run, monitor, and stop any program. In addition, it logs counter values at selectable time intervals. P5MON is a public domain utility with real-time-updated GUI based display. PTACH is a commercial utility with real-time-updated oscilloscope display. VTUNE and EDBEMON (Event Domain Based EMON) are examples of utilities that use event based sampling method of measurement. VTUNE allows a user to select any number of different performancerelated events. It runs these event selections one at a time in separate sessions. VTUNE has two features, in addition to event based sampling: dynamic analysis and time-based sampling. Dynamic analyzer simulates the execution of a part of the code rather than using actual registers for measurements. Time-based sampling is accomplished by generating interrupts at regular time intervals and system state is saved instead of counting events. EDBEMON is a stand-alone utility for event based sampling, similar to VTUNE. Based on the above discussion, it can be concluded that there is a large number of measurement tools for Windows 9x and NT based systems that can utilize Pentium performance counters. There are only a few independently developed and freely available tools based on Pentium performance counters for Linux platforms. In order to access the counters on a Linux PC, a device driver is needed. Such drivers can be found in public domain as patches to Linux operating system. Perfmon and mperfmon are two performance profiling tools developed at Los Alamos National Laboratory based on such a patch for Linux [5]. Perfmon allows the user to specify two profiling events to be monitored. Mperfmon multitasks over all available profiling events to produce an estimate of resource utilization for the specified user program.
In this section, we present two case studies of using on-chip performance counters for two measurement-based tasks: (1) tuning of source code to optimize cache utilization; and (2) data collection for workload characterization. These measurements were accomplished on an Origin2000 and an SP-2 systems.
4.1 Cache Performance Tuning Processor performance has been increasing at a rate, which is significantly higher than the rate of improvement in memory access latencies. Due to this growing disparity between processor and memory speeds, high performance potential of cache-based processors is limited by the number of off-chip memory accesses. Ignoring instruction fetch time, if all of the operands were available in CPU registers (or onchip caches), a processor can operate at its peak (or close to peak) performance rate. In practice, non-local memory accesses are difficult to eliminate without extensively tuning the code for available memory hierarchy. Nevertheless, this task is critical for a distributed shared memory multiprocessor with cachecoherent Non-Uniform Memory Access (cc-NUMA) architecture, such as Origin2000 [8]. We illustrate the importance of this task by following case study, adapted from a real application. Figure 2 provides a section of the code, simplified to hide unnecessary detail, taken from ARC3D, a Computational Fluid Dynamics (CFD) application. ARC3D was not originally written for a cc-NUMA multiprocessor. Therefore, the code simply reflects the numerical algorithm for a CFD simulation. We execute this code on an single processor of Origin2000. Using Perfex tool with command-line interface, we measured overall execution time, primary and secondary data cache miss overhead, TLB miss overhead, and time for executing floating point operations (see part of Figure 4 that is labeled as “unoptimized”). It is a well-known fact that power of two array dimensions, which are a multiple of cache line size, result in excessive cache misses. Due to parameter (maxj=64,maxk=64,maxl=64) double X(maxj,maxk,maxl) double XX(maxj,maxk,maxl) integer l,k,j do l=1,maxl do k=1,maxk do j=1,maxj X(j,k,l)=X(j,k,l)*XX(j,k,l) enddo enddo enddo Figure 2. An example code adapted from ARC3D application. 6
such array dimensions, the likelihood of successive memory accesses competing for the same cache line becomes very high. This situation can be altered by “padding” the arrays, so that successive array accesses are less likely to compete for for the same cache line. Thus every access to such arrays likely will not result in a cache miss. Figure 3 shows the modified code with array dimensions padded by one. parameter (maxj=65,maxk=65,maxl=65) double X(maxj,maxk,maxl) double XX(maxj,maxk,maxl) integer l,k,j do l=1,maxl do k=1,maxk do j=1,maxj X(j,k,l)=X(j,k,l)*XX(j,k,l) enddo enddo enddo
unoptimized version (first bars from left). One should also notice that the sum of L1, L2, and TLB misses and time for executing floating point instructions is slightly greater than the overall execution time in the optimized case. This inaccuracy is due to multiplexing of two hardware counters to measure 32 events for an R10000 processor. Additionally, due to latency-hiding through cache coherence, memory accesses are overlapped by other processor activities. Measurement accuracy is further affected due to this overlap. Measurement inaccuracy does not appear for unoptimized case due to longer execution time that tends to hide approximate count of event occurrences. Without processor-level information about cache misses, it is difficult to understand the impact of various changes to an application code. Array padding is one transformation to improve cache performance. There are several other possible transformations, such as prefetching to control data placement, loop nest transformations, loop unrolling, reorganization of temporary arrays, and array privatization. Measurements provided by on-chip counters are the only source of accurate and detailed memory performance analysis of such code changes.
Figure 3. Modified code with array padding to improve cache utilization. Figure 4 illustrates the difference between cache performance of two versions of the ARC3D application–one without array padding and other with array padding. Notice that the time taken by floating point instruction execution for both unoptimized and optimized versions is about 16 seconds (5-th bars from left for each version). Primary (L1) cache miss overhead reduces from 61 seconds to 14 seconds after optimization (see the second bars from left). Secondary cache miss overhead reduces from 98 to 8 seconds while TLB misses take about 6 seconds in both cases. Results clearly indicate that significantly large values of primary and secondary cache miss overhead contribute to larger overall execution time in the case of
200 180 160 140 120 100 80 60 40 20 0
Execution time L2 miss overhead FP inst. exec. time
Unoptim Unoptimized ized
4.2 Workload Characterization Workload characterization is often considered as an effort to generalize the results of various measurement-based studies of a class of applications. System architects, software developers, and performance analysts can use workload characterization for experimenting with new architectural features and system management protocols, and their impact on performance [3]. The goal of our workload characterization effort is to assist simulation-based evaluation of alternative resource management policies in a metacomputing system [7]. In order to collect accurate measurements of resource usage, such as CPU, caches, memory bandwidth and so on, performance counters are effective means to gather runtime information for workload characterization. We elaborate with a simple example. Consider the code fragment given in Figure 5. It represents the repetitive nature of a typical numerical algorithm. We execute this example code on one
L1 miss overhead TLB miss overhead
Double A(1024,1024) integer time, i,j real temp do time = 1,100 do I = 1,128 do j = 1,128 temp = A(i,j) enddo enddo enddo
Optimize Optimized d
Figure 4. Comparison of memory access overhead and its impact on overall execution time for unoptimized and optimized versions of ARC3D. All times are in seconds.
Figure 5. 7
Example code.
This simple example illustrates the importance of on-chip counters and appropriate API to characterize quantities that may have temporal dependencies. While in this example, primary data cache misses are uniform across successive iterations, there may be more complex dependencies for other types of workloads. Since accesses to caches represent a low-level of information, on-chip counters are valuable to accurately measure and characterize these patterns. On-chip performance counters can also be used for workload characterization across different platforms. We study the cost of memory accesses across two systems: Origin2000 and SP-2. Our focus is on offchip memory accesses. Since R10000 processor has two levels of caches, our objective is to determine the cost of primary cache misses. A hit in secondary cache may take up to 10 cycles (at 195 MHz). POWER2 processor has only one level of cache and takes at least 8 cycles (at 66 MHz) to access off-chip memory. We measured these costs for an MPI implementation of NAS Parallel Benchmarks using perfex on Origin2000 and PHPM on SP-2 [2]. These measurements are shown in Table 3 and differences in memory access costs can be characterized with respect to the differences in cache architectures of two processors. Until recently, execution-driven [9] and trace-driven simulations [4] have been the only means to analyze different cache architectures. Onchip performance counters may be used to validate the simulation results from an early design stage against measurements from embedded counters at a later stage.
6000 5000 4000 3000 2000 1000 85
65
45
25
0 5
Number of cache misses
MIPS R10000 processor of an Origin2000. Each Origin2000 node consists of two processors, each with two levels of caches: a 64KB primary cache and a 4MB secondary cache. In this example, we chose a matrix size that could not fit in the primary data cache. Additionally, the number of time steps was kept large, so that the “cold” cache misses in the beginning do not significantly skew the overall results. We used /proc and ioctl based API to access R10000 performance counters and configured them to measure primary data cache misses. Figure 6 presents the histogram of primary data cache misses over time steps. This experiment suggests that recurrent and regular accesses to large arrays result in uniformly distributed cache misses over multiple time steps and address space, as expected.
Time step Figure 6. Histograms of primary data cache misses with respect to time steps.
Table 3. Cache performance measurements across two architectures. Benchmark
L1 cache miss cost for R10000 (total execution time in sec.)
Cache miss cost for POWER2 (total execution time in sec.)
BT
22% (820)
12% (594)
SP
33% (393)
10% (473)
LU
36% (365)
7% (479)
FT
57% (46)
5% (67)
CG
69% (13)
7% (11)
MG
38% (17)
3% (16)
13926
10649
7372
4096
7000 6000 5000 4000 3000 2000 1000 0 819
Number of cache misses
Figure 7 presents the distribution of primary data cache misses over array index. The array index is calculated as I*128+J. The histogram shows that repeated accesses to each of the array elements resulted in consistent number of primary data cache misses.
Two case studies, presented in this section, highlight the unique nature of on-chip counters to collect invaluable measurements about processor-level activities.
Array index
5.
Figure 7. Histogram of primary data cache misses with respect to array index.
Discussion and Conclusions
In this paper, we presented a survey of three commodity processors with on-chip performance counters, 8
which are being used in state-of-the-art parallel and distributed systems. We considered application programming interfaces and tools to use on-chip counters for measurements. We presented some examples of using these counters for performance measurements on two concurrent systems. Our survey of three processors indicates that suitable APIs and measurement tools based on hardware counters are not readily available in all cases. While some processors, such as MIPS R10000, come with mature APIs as well as measurement tools, others do not provide such support. In case of Pentium processors, a number of measurement tools exist for single processor but these tools are not applicable for a parallel or distributed configuration of processors. Recently, parallel tools consortium has launched an effort to design a consistent API for accessing on-chip counters [13]. Such an API can facilitate the development of measurement tools based on on-chip counters that are also portable across heterogeneous platforms. On-chip performance counters provide an accurate means of measuring processor-level activities. Importance of these measurements for performance evaluation of sequential as well as concurrent systems continues to grow due to increasing complexity of processors . On-chip counters provide a convenient mechanism to measure or tune performance without ignoring the architectural complexities of systems that are built around commodity processors.
[6]
[7]
[8]
[9]
[10]
References [11] [1]
[2]
[3]
[4]
[5]
G. Ammons, T. Ball, and J. Larus, “Exploiting Hardware Performance Counters with Flow and Context Sensitive Profiling,” Proc. of ACM SIGPLAN Conference on Programming Language Design and Implementation, Las Vegas, Nevada, June 15-18, 1997. D. Bailey et al., “The NAS Parallel Benchmark 2.0,” Technical Report NAS-95-020, Dec. 1995.
[12]
[13]
Robert T. Dimpsey and Ravishankar K. Iyer, “A Measurement-Based Model to Predict the Performance Impact of System Modifications: A Case Study,” IEEE Transactions on Parallel and Distributed Systems, 6(1), January 1995, pp. 28–40. J. Gee, M. Hill, D. Pnevmatikatos, and A. Smith, “Cache Performance of SPEC92 Benchmark Suite,” IEEE Micro, August 1993. M.P. Goda, “Model Specific Registers and Performance Monitoring,” Proc. of Pentium
[14]
[15]
9
Pro Cluster Workshop, Ames, Iowa, April 1011, 1997. Available on-line from http://qso.lanl.gov/~mpg/perfmon.html. John L. Hennessy, “Perspectives on the Architecture of Scalable Multiprocessors: Recent Development and Prospects for the Future,” State of the Field Talk, High Performance Networking and Computing Conference (SC ‘97), San Jose, California, Nov. 15-21, 1997. Ken Kennedy, Charles F. Bender, John W. D. Connolly, John L. Hennessy, Mary K. Vernon, and Larry Smarr, “A Nationwide Parallel Computing Environment,” Communications of the ACM, 40(11), November 1997, pp. 63–72. James Laudon and Daniel Lenoski, “The SGI Origin: A ccNUMA Highly Scalable Server,” Proc. of the 24th Annual International Symposium on Computer Architecture, Denver, Colorado, June 2-4, 1997, pp. 241–251. J. Larus, “The SPIM Simulator for the MIPS R2000/R3000,” in Computer Organization and Design—The Hardware/Software Interface by David A. Patterson and John L. Hennessy, Morgan Kaufmann Publishers, 1994. A.D. Malony, D.A. Reed, and H.A.G. Wijshoff, “Performance Measurement Intrusion and Perturbation Analysis,” IEEE Transactions on Parallel and Distributed Systems, 3(6), Nov. 1992, pp. 657-671. Jussi Maki, “A Free AIX Performance Monitor,” Proc. of GUIDE/SHARE Europe Conference, Vienna, Austria, Oct. 10-13, 1994. Alan Mink, Robert Carpenter, George Nacht, and John Roberts, “Multiprocessor Performance Measurement Instrumentation,” IEEE Computer, Sept. 1990, pp. 63-75. Philip J. Mucci, Christopher Kerr, and Shirley Browne, “PerfAPI–Performance Data Standard and API,” PTools project description. Available on-line from: http://www.cs.utk.edu/~mucci/pdsa/. NAS Parallel Systems Division, “Parallel Hardware Performance Monitor,” NASA Ames Research Center. Available on-line from http://parallel.nas.nasa.gov/Parallel/SP2/HPM /index.html. Survey of Pentium Processor Performance Monitoring Capabilities and Tools, Available on-line from: http://developer.intel.com/drg/mmx/AppNotes /PERFMON.HTM.
[16]
[17]
[18]
1994. His research interests include high-performance parallel and distributed systems, instrumentation systems, performance evaluation tools, computer system modeling, and distributed real-time and embedded systems. He received the B.Sc. degree with honors in Electrical Engineering from University of Engineering and Technology, Lahore, Pakistan in 1991. He received the MS degree in 1993 and the Ph.D. degree in 1997, both in Electrical Engineering from Michigan State University. Dr. Waheed is a member of IEEE Computer Society and ACM.
E. H. Welbon, C.C. Chan-Nui, D.J. Shippy, and D.A. Hicks, “POWER2 Performance Monitor,” Available on-line form http://www.rs6000.ibm.com/resource/technolo gy/monitor.html. Steven White and Sudhir Dhawan, “POWER2: Next Generation of the RISC System/6000 Family,” Available on-line from http:// service2.boulder.ibm.com/devcon/p2ngr/power2.2. html. M. Zagha, B. Larson, Steve Turner, Marty Itzkowitz, “Performance Analysis Using the Mips R10000 Performance Counters,” Proceedings of Supercomputing ‘96, Pittsburgh, Pennsylvania, Nov. 1996.
Jerry Yan has been working at NASA Ames Research Center (currently, as a senior research scientist with MRJ Technology Solutions) since 1989. He received a B.Sc. from Imperial College, London, England, in 1982, an MSEE in 1984 and a Ph.D. in 1989 from Stanford University, California, all in Electrical Engineering. His research interests include highperformance computing, performance evaluation, and software tools. Dr. Yan was a founding member of the Steering Committee with the Parallel Tools Consortium. He is also a senior member of IEEE, a member of the IEE and ACM.
Abdul Waheed is a research staff member with MRJ Technology Solutions at NASA Ames Research Center. In 1991, he worked as a Field Service Engineer in Medical Engineering Division at Siemens in Lahore, Pakistan. He held a summer position in Concurrent Computing Division at Hewlett-Packard Research Laboratories in Palo Alto, California in
10