Performance Evaluation of FFT Algorithms Using Performance Counters Martin Auer Franz Franchetti Herbert Karner Christoph W. Ueberhuber
Institute for Applied and Numerical Mathematics Technical University of Vienna Wiedner Hauptstrasse 8-10, A-1040 Wien, Austria E-Mail:
[email protected],
[email protected] [email protected],
[email protected]
The work described in this paper was supported by the Special Research Program SFB F011 \AURORA" of the Austrian Science Fund FWF.
Contents Introduction
4
1 Empirical Performance Quanti cation
5
1.1 Performance Indices . . . . . . . . . . . . . . . . . . . . . . . 1.2 Workload Measurement . . . . . . . . . . . . . . . . . . . . . . 1.3 Runtime Measurement . . . . . . . . . . . . . . . . . . . . . .
2 Performance Relevant Events
2.1 Calculating or Simulating Events . . . . . 2.1.1 Complexity Analysis . . . . . . . . 2.1.2 Simulation . . . . . . . . . . . . . . 2.1.3 Source Code Analysis . . . . . . . . 2.1.4 Pro ling . . . . . . . . . . . . . . . 2.2 Hardware Performance Counters . . . . . . 2.3 Performance Relevant Events . . . . . . . 2.4 Event Types and the Floating-Point Case .
3 Performance Evaluation With FFTest
3.1 Parts of FFTest . . . . . . . . . . . . 3.1.1 Main Program Files . . . . . . . 3.1.2 Program Files for Windows NT 3.1.3 Input and Output . . . . . . . . 3.1.4 Compile Tools . . . . . . . . . . 3.1.5 Batch Tools . . . . . . . . . . . 2
. . . . . .
. . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
5 7 8
10
10 10 11 12 12 12 13 14
17
17 17 18 18 18 19
CONTENTS
3.2 3.3 3.4 3.5 3.6
3.1.6 Other Tools . . 3.1.7 Groups of Files Example . . . . . . . . Execution of FFTest Input File Format . . . Counting Events . . . Output File Format . .
3 . . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
4 Compiling FFTest 4.1 4.2 4.3 4.4 4.5
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
DEC Unix on a DEC Alpha . . . . . . . . . . . IBM AIX on a PowerPC System . . . . . . . . . HP-UX on an HP 9000 . . . . . . . . . . . . . . Adding New Algorithms to FFTest . . . . . . Additional Tools and Useful UNIX Commands .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
19 20 20 22 23 25 25
28
31 31 32 32 35
5 Event Counting on MIPS Processors
37
6 Event Counting on Pentium Processors
40
7 Event Counting on Alpha Processors
44
Conclusion
48
References
49
Introduction Many scienti c applications require the use of fast Fourier transforms (FFTs). As a result of its importance, there is a great variety of FFT programs available. Some FFT programs require extra work space, some operate in-place. Some access memory in constant strides, some access memory in varying power-of-two strides. Some initialize the required twiddle-factors before computing the transform by calling a separate initialization routine, some calculate them during the computation of the transform. There are con icting goals when deciding which FFT program to use. On a computer with limited memory, an in-place algorithm is the best choice. If the transform is to be performed more than once, an algorithm with a separate initialization phase is best. On a vector computer, FFT algorithms which make the reordering during the calculation perform best. Due to the great importance of DFT based methods to the solution of various scienti c problems FFT routines are included in all of the major scienti c software libraries. But, there are also a great number of FFT routines publicly available on the Internet. The newly developed tool FFTest, described in this report, allows to evaluate the performance of FFT programs not only by measuring their run-time. It also helps to establish detailed performance characteristics by counting relevant events, the most important being oating-point operations and cache misses. Other features of FFTest include the ability to measure the accuracy of FFT programs, as well as their memory usage. Therefore an FFT algorithm's resource usage can be determined completely and the costs of the computation can be established.
4
Chapter 1 Empirical Performance Quanti cation The assessment and choice of computer hardware and/or scienti c software requires the use of numerical values, which may be determined analytically or empirically, for the quantitative description of performance.
1.1 Performance Indices The user of a computer system who has to wait for the solution of a particular problem is mainly interested in the time it takes for the problem to be solved. This time depends on two parameters, workload and performance: workload = performanceworkload eciency time = performance eective maximum The computation time is therefore in uenced by the following quantities: :
1. Workload is a measure for the amount of work performed in an application. For scienti c computing applications where numerical calculations dominate, a natural metric is the number of oating point operations that need to be executed. The arithmetic complexity is therefore a measure for the execution workload of an algorithm. The metric of workload has a unit of Millions of ops (M op). 2. Peak performance Peak performance is a theoretical measure that the manufacturer guarantees that programs will not exceed these rates, it is a sort of a \speed 5
6
CHAPTER 1. EMPIRICAL PERFORMANCE QUANTIFICATION of light" for a given computer. The manufacturer usually refers to peak performance when describing a system. The peak performance is arrived at by counting the number of oatingpoint operations that can theoretically be performed in one second. The metric of peak performance has a unit of Millions of ops per second (M op/s) 3. Eciency is the percentage of peak performance utilized in a speci c application. A low eciency always indicates a poor implementation or compiler.
Example (LINPACK Benchmark) The Linpack benchmark (Dongarra [3]) uses the programs LINPACK/sgefa and LINPACK/sgesl for the solution of systems of linear equations with = 100 variables. The eective oating point performance is determined by the workload as given by the number of oating point operations executed F 2 3 3 + 2 2 and the required computing time n
W
n =
n
T
P
=
F
W
T
[ op/s]
:
Relating this value to the peak performance of the computer gives the eciency , i. e., the degree to which two particular Linpack programs exploit computer resources. In Table 1.1 empirical performance and eciency values for the Linpack benchmark are listed (Dongarra [4]). For =1000, mathematically equivalent programs, which were designed to make the best use of hardware resources, were used. Obviously, machine dependent program optimization signi cantly improves eciency. n
Computer Type IBM RS/6000 397 DEC 8200 5/440 HP 9000/K460-XP Pentium 400 MHz SGI Power Challenge XL
Linpack benchmark
Peak = 100 = 1000 [M op/s] [M op/s] [ % ] [M op/s] [ % ] n
640 880 720 400 390
312 205 158 124 86
n
49 23 22 31 22
528 588 510 | 344
82 67 71 | 88
Table 1.1: Empirical performance and eciency for the Linpack benchmark.
1.2. WORKLOAD MEASUREMENT
7
1.2 Workload Measurement In order to determine the eciency of FFT routines, it is necessary to determine their arithmetic complexity. This can be done either \analytically" by using formulas for the arithmetic complexity or \empirically" by counting executed oating-point operations on the computer systems used. Estimating the number of oating-point operations analytically has the disadvantage that real implementations of FFT algorithms often do not achieve the complexity bounds given by the formulas due to an inecient handling of multiplication operations with trivial twiddle-factors. Moreover, formulas for the arithmetic complexity do not include the costs for the on-line computation of the twiddle-factors. To overcome these limitations, in this report the empirical approach was chosen to determine the arithmetic complexity of the examined FFT routines. The empirical approach has the additional bene t that arithmetic complexities of pre-compiled library routines can also be determined exactly. In order to determine the number of executed oating-point operations, a special feature of modern microprocessors is used: the Performance Monitor Counter (PMC). PMCs are hardware counters able to count various types of events, such as cache misses, memory coherence operations, branch mispredictions, and several categories of issued and graduated instructions (see Chapters 2, 5, 6, and 7). In addition to characterizing the workload of an application by counting the number of oating-point operations, PMCs can help application developers for gaining deeper insight into application performance and for pinpointing performance bottlenecks. PMCs were rst used extensively on Cray vector processors, and appear in some form in all modern microprocessors, such as the MIPS R10000 (see [5] [13] [14] [17]), Intel Pentium (see [7] [8] [12]), IBM PowerPC (see [16]), DEC Alpha (see [2]), and HP PA-8000 (see [6]). Most of the microprocessor vendors provide hardware developers and selected performance analysts with documentation on counters and counter-based performance tools. Useful information regarding PMCs can be found on the following webpages: www.sandpile.org www.quantasm.com/opcode_f.html www.x86.org www.intercast.ch/drg/mmx/appnotes/perfmon.htm www.intel.com/design/perftool/cbts/contents.htm
8
CHAPTER 1. EMPIRICAL PERFORMANCE QUANTIFICATION
1.3 Runtime Measurement The run-time is the second important performance index (in addition to the workload). To determine the performance of an algorithm, its run-time has to be measured. One has to deal with the resolution of the system clock. Then, most of the time not the whole program, but just some relevant parts of it have to be measured. For example, in FFT programs the initialization step usually is not included, as it takes place only once, and the transform is executed many times. For the purpose of measuring the run-time of FFT algorithms, the test environment FFTest was created.1 To maintain the code portable across Unix platforms the C subroutine times is used, which is available on all Unix systems. The following fragment of a C program demonstrates how to determine user and system CPU time as well as the overall elapsed time using the prede ned subroutine times. #include ... /* period is the granularity of the subroutine times */ period = (float) 1/sysconf(_SC_CLK_TCK); ... start_time = times(&begin_cpu_time); /* begin of the examined section */ ... /* end of the examined section */ end_time = times(&end_cpu_time); user_cpu = period*(end_cpu_time.tms_utime - begin_cpu_time.tms_utime); system_cpu = period*(end_cpu_time.tms_stime - begin_cpu_time.tms_stime); elapsed = period*(end_time - start_time);
The subroutine times provides timing results as multiples of a speci c period of time. This period depends on the computer system and must be determined with the Unix standard subroutine sysconf before times is used. The subroutine times itself must be called immediately before and after that part of the program to be measured. The argument of times returns the accumulated user and system CPU times, whereas the current time is returned as the function value of times. The dierence between the respective begin and Please contact Herbert Karner to get access to the source code. 1
1.3. RUNTIME MEASUREMENT
9
end times nally yields, together with scaling by the predetermined period of time, the actual execution times. Whenever the execution time is smaller than the resolution of the system clock, dierent solutions are possible. First, performance counters can be used to determine the exact number of cycles required. Second, the measured part of a program can be executed many times and the overall time is divided by the number of runs. This second approach has a few drawbacks. The resulting time may be too optimistic, as rst-time cache misses will only occur once and the pipeline might be used too eciently when executing subsequent calls. A possible solution is to empty the cache with special instructions. Calling in-place FFT algorithms repeatedly has the eect that in subsequent calls the output of the previous call is the input of the next call. This can result, when repeating this process very often, in excessive oating-point errors up to over ow conditions. This would not be a drawback by itself, if processors handled those exceptions as fast as normal operations. But some processors handle them with a great performance loss, making the timing results too pessimistic. The solution to this problem is calling the program with special vectors (zero vector, eigenvectors) or restoring the rst input between every run. The second solution leads to higher run times, but measuring this tiny fraction and subtracting it nally yields the correct result.
Chapter 2 Determination of Performance Relevant Events In order to understand the performance characteristics of a given algorithm, its execution on a computer system has to be examined closely. The number of executed operations or the number of executed oating-point operations are used most of the time. But there are other in uences, like exceptions, pipeline stalls and cache misses, to aect the run-time. All these in uences are called events.
2.1 Calculating or Simulating Events To establish the eective count of events happening during a program's execution, dierent strategies can be used. They dier considerably in cost, exactness and eld of application.
2.1.1 Complexity Analysis An accurate mathematical analysis of an algorithm can sometimes give the exact amount of events as loads or branches (see Karner et al. [10]). In the case of FFT algorithms, dierence equations can be used to achieve this goal. Generally speaking, only events can be investigated that way, whose number is not in uenced by run time factors, for example, the execution of other programs on multi-tasking machines. Moreover, the event counts of some algorithms depend on their input parameters. For example, some linear algebra programs have dierent event characteristics depending on the values 10
2.1. CALCULATING OR SIMULATING EVENTS
11
of the matrix coecients, not only on the size of the matrix. Usually, FFT algorithms of a given length use the same amount of operations every time they are called, independently of the data vector. Another important point is that some events, like mispredicted jumps, are special hardware attributes. Accordingly, this aspect would have to be part of the mathematical model. Some algorithms perform run time checks to choose between dierent code versions. This makes counting dicult, as the code (and its decisions) would have to be analyzed rst. However, not every program's source code is publicly available. Many libraries are protected by copyright, or compiled at dierent optimization levels. Some algorithms are not as clearly organized as FFT algorithms, and event counting may be complicated. Worst of all are rare cases of self-modifying code and random decisions based on statistical pro ling data. Therefore not all events are suitable for complexity analysis.
2.1.2 Simulation If the event to be counted is intrinsically hardware related or complicated to calculate, simulation might be the method of choice. Simulation is basically a special kind of execution of the program to be measured, which instead of performing the code just keeps track of the number of operations. Sometimes internal software counters are included in the original program and can be activated via a compiler option at compile-time or a parameter at run time. It is fairly easy to count, for example, the number of executed arithmetic operations|only a simple counter has to be added to the program's code. But diculties may arise, when counting, for example, the number of cache misses. Here, a very accurate model|a program on its own|of the memory hierarchy has to be built. Furthermore, the whole memory access pattern of the algorithm must be known precisely (which is the case for simple FFT algorithms). Programs running simultaneously (for example, on multi-tasking computers) which use the same cache to access their data and therefore cause additional cache misses, can not be investigated in simulation studies. It requires a lot of work to write a complex simulator, which is usually not portable to other hardware architectures, and which needs computational resources for its own.
12
CHAPTER 2. PERFORMANCE RELEVANT EVENTS
2.1.3 Source Code Analysis Some events, like speci c instructions, depend on the hardware the program was compiled on. It is always possible to analyze the instruction sequence generated by the compiler at assembler code level. Sometimes the compiler even gives valuable information and it is not necessary to count the instructions. Again, the program must be investigated at compile-time. If the compiler does not add scheduling information, this analysis is not user-friendly. The best tools are compilers which create scheduling reports , as they were used when developing FMAfft (see Karner et al. [9] [11]). Good scheduling reports contain the original code as well as valuable pipelining information.
2.1.4 Pro ling Pro ling tools are used for years now to trace program execution. Especially function calls can be monitored using pro lers. Pro ling data can be used for program optimization as well. Generally, substantial code is added to the program, making it bigger and slower. Also, sometimes the whole program is aected and not only the interesting parts of it. During the last years, pro ling tools became more sophisticated by making use of event counters. Pro ling tools generally calculate some statistics of the data. They are developed by hardware vendors, and are often the rst programs to use new processor features.
2.2 Hardware Performance Counters Performance monitor counters (PMCs) oer an elegant solution to the counting problem. They have many advantages :
They can count any event of any program. They provide exact numbers. They can be used to investigate arbitrary parts of huge programs. They do not aect program speed or results, or the behavior of other programs.
2.3. PERFORMANCE RELEVANT EVENTS
13
They can be used in multi-tasking environments to measure the in uence of other programs.
They are cheap to use in resources and time. They have disadvantages as well:
Only a limited number of events can be counted, typically two. When
counting more events, the counts have to be multiplexed and they are not exact any more.
Extra instructions have to be inserted, re-coding and re-compilation is necessary.
Documentation is sometimes insucient and dicult to obtain. Usage is sometimes dicult and tricky. The use of the counters is dierent on any architecture. To overcome these diculties a portable interface to the PMCs has been developed by the \Zentralinstitut fur Angewandte Mathematik, Forschungszentrum Julich".1
2.3 Performance Relevant Events All processors, which provide performance counters, count dierent types of events. There is no standard for implementing such counters, and many events are named dierently. But one will nd most of the important events on any implementation.
Cycles: Cycles needed by the program to complete. This event type depends
heavily on the underlying architecture. It can be used, for instance, to achieve high resolution timing.
Graduated instructions, graduated loads, graduated stores: Instructions, loads and stores completed.
1
see http://www.fz-juelich.de/zam/pt/redec/softtools/pcl/pcl.html
14
CHAPTER 2. PERFORMANCE RELEVANT EVENTS
Issued instructions, issued loads, issued stores: Instructions started,
but not necessarily completed. The number of issued loads is usually far higher than the number of graduated ones, while issued and graduated stores are almost the same.
Primary instruction cache misses: Cache misses of the primary instruc-
tion cache. High miss counts can indicate performance deteriorating loop structures.
Secondary instruction cache misses: Cache misses of the secondary instruction cache. Usually, this count is very small and is therefore not a crucial performance indicator.
Primary data cache misses: One of the most crucial performance factors
is the number of primary data cache misses. It is usually far higher than the number of instruction cache misses.
Secondary data cache misses: When program input exceeds a certain
amount of memory, this event type will dominate even the primary data cache misses.
Graduated oating-point instructions: Once used as main performance indicator, the oating-point count is together with precise timing results still one of the most relevant indicators of the performance of a numerical program.
Mispredicted branches: Number of mispredicted branches. Aects memory and cache accesses, and thus can be a reason for high memory latency.
2.4 Event Types and the Floating-Point Case When the program execution does not strongly depend on its input data or other circumstances not known at compile-time, then three event classes can be de ned:
Constant count events: The count of these events does not vary on dierent architectures. A typical example is the number of branches taken. Counting these events has to be performed only once, even when a whole range of computer systems is to be investigated.
2.4. EVENT TYPES AND THE FLOATING-POINT CASE
15
Trembling count events: The number of these events depends strongly
on hardware features and compiler options. Typical examples are the number of integer instructions and the number of issued oating-point operations.
Janus count events: These events are very hardware dependent. Usually
they are dicult to model and impossible to count exactly without special hardware support. Cache misses and pipeline stalls are the most important members of this category. Sometimes even the operating system can be responsible for producing Janus count events. For example, virtual memory management is handled dierently by all operating systems.
While the rst and the third group have well-de ned characteristics, the second is the most interesting one. What are the reasons, that, for example, the oating-point count is not a constant count event ? The varying numbers of these events is sometimes due to hardware features.
Some processors provide special instructions like the fused multiply-add instruction, which combines two operations into a single one.
Standard functions like sin( ) may be implemented directly on a prox
cessor or calculated via a library call resulting in several oating-point operations.
Sometimes the varying number of these events is caused by the supporting software.
Dierent implementations of functions, like the Fortran function MINT, which converts a oating-point number to an integer, may exist.
Libraries use dierent methods to calculate standard functions, resulting in dierent numbers of oating-point operations.
Dierent languages and/or dierent compilers may use dierent library versions.
Some library functions require dierent numbers of oating-point op-
erations depending on the data, like the function sin( ) on the SGI Power Challenge, which performs no oating-point operation at all when called with the argument = 0. x
x
16
CHAPTER 2. PERFORMANCE RELEVANT EVENTS
Compilers have signi cant in uence on trembling count events.
Compilers may or may not use new hardware features, like FMA in
structions, eciently. Some compilers are able to recognize common subexpressions, and thus reduce the respective number of oating-point operations. Compilers may handle constants in dierent ways. For example, the code const = 2; a = -const - x; leads to one negation and one subtraction, i. e., two operations on a simple-minded compilers. The rearrangement const = -2; a = const - x; saves one operation. Optimization depends on the optimization level of the compiler. These levels are not comparable across language and platform borders. Some compilers handle low precision numbers dierently and may decide to convert low precision numbers to higher precision|resulting in additional instructions, sometimes counted as if they were oatingpoint instructions. Dead code, like myStrangeFunction() { double a, b; b=rand(); a=sin(b); return b; /*sin-result never used*/ }
may or may not be compiled at all|depending on the chosen optimization level. So the sine function might not even be called, as its result is not used. Of course, the compiler must be able to decide that the sine function has no side eects. Finally, even the algorithm itself can make hardware assumptions and choose alternative code segments with dierent event characteristics.
Chapter 3 The Performance Evaluation Tool FFTest FFTest is a portable FFT benchmark program. It is written in ANSI C,
and has been ported to and run successfully on dierent platforms like Solaris on the SGI Power Challenge, AIX on IBM workstations, HP-UX on HP workstations, DEC Unix on DEC Alpha workstations, Linux and Windows NT on Intel Pentium based PCs (only on some Pentium models when using the event count registers). FFTest was designed to measure the run time of FFT algorithms as well as their accuracy. Furthermore event counters can be addressed and read. With FFTest come a few tools to eectively handle large number of test runs for various algorithms, types of events, transformation lengths, etc. There are make les to make compilation easier and less system-dependent.
3.1 Parts of FFTest The environment FFTest consists of several parts.
3.1.1 Main Program Files
: Main program, containing functions for time measurement. It is entirely written in ANSI C to achieve portability. FFTest.h: Header le for FFTest.c. De nes non-ANSI macros and program speci c constants. FFTest.c
17
18
CHAPTER 3. PERFORMANCE EVALUATION WITH FFTEST
: Here the FFT algorithms, which can reside in other source les, are declared and given attributes needed for the function calls and the accuracy measurement. This is the only FFTest program le that needs to be changed when including a new algorithm. FFTest interface.c
3.1.2 Program Files for Windows NT
DirectNT.sys
direct.c
: This is a system driver, needed only for Windows NT to access the event counters on Pentium processors.1 With its help a user program obtains privileged access to the registers. : Code to call the system driver DirectNT.sys to access the event counters on Pentium processors.
3.1.3 Input and Output
les: Input les for FFTest. Here the algorithms and transformation lengths are speci ed. Several algorithms can be handled in one single .dat le, but usually only one algorithm per .dat le should be used to avoid that a crashing algorithm (for example, when a oating-point exception takes place) stops the execution of the whole test run.
.dat
.txt
les: Output les produced by FFTest.
: Event les for FFTest which specify the events to be measured using the event counters. (Currently only on SGI computers the event les are read. Under Windows NT the events are selected via special function calls in the le aufr.c).
eX Y
3.1.4 Compile Tools
1
: system-indepentent make le to compile and link Fortran, C, and C++ programs. FFTest does not translate Fortran code to C before compiling it, therefore dierent compilers have to be called during the process of compilation and linkage. This le is used as input to the unix command make, which is invoked by executing the system-speci c mbat* les.
makeit
It can be found at http://www.heise.de.
3.1. PARTS OF FFTEST
19
: Speci c batch les for each computer architecture. They call the system-independent make le makeit with the unix command make and system-speci c variables as compiler names and options.
mbathp, mbatsgi, mbatalph, mbatnec, mbatibm
3.1.5 Batch Tools
: Creates an intermediate batch le called smallbat. The arguments are a list of .dat les. Shell wildcards are allowed. To change the paths of the executables and the program output, change the le smallbat.
cr bat b: Completes the intermediate batch le created by cr bat a, adding program version and event selection.
cr bat a
: Submits a batch le to the program execution queue. Abbreviation of the qsub command with a few arguments. On some systems the batch command has to be used instead.
submit
3.1.6 Other Tools
: As the program is supposed to be ported to many dierent architectures, changes to the current version have to be consistent on all of them. migrate compresses all the les into a single one (a tar archive), opens an FTP connection to a host and copies the le to the remote system.
migrate
: Just sed command calls to handle the FFTest output and transform it into suitable form for postprocessing. onlyColumns, noTabs, tab2semi
: List and remove unnecessary les (like object
wls, clean, cleanall
les) to save disc space.
: Calls dos2unix. Accepts wildcards, converts a list of ASCII les from DOS to Unix. This is sometimes necessary, as some compilers do not ignore DOS speci c sequences and stop the compiliation process.
d2u
20
CHAPTER 3. PERFORMANCE EVALUATION WITH FFTEST
3.1.7 Groups of Files
: Contains source code. Here the compilation takes place. Contains the make les, the migrate, d2u and other code related commands. /TOPT/runs: Contains the current versions of the FFTest executables. Contains the current .dat les, and output (both from FFTest and from larger batch runs of FFTest performed by the operating system) is written to this directory. Contains batch utilities and event les. /TOPT/oldexe: Former versions of FFTest executables are stored here to avoid re-compilation. /TOPT/runs/TXT XXX: Output les are stored here according to event types counted. /TOPT/TMP OBJECTFILES: Temporary storage directory for object les. /TOPT
3.2 Example The example presented in this section shows a typical test run in batch mode on an SGI Power Challenge XL step by step. The sign > denotes a shell command. >cp mbatsgi mbatsgiFUSED >vi mbatsgiFUSED
Change the le mbatsgi setting the variables for the make -f makeit command. Save the new le as mbatsgiFUSED, because the options enforce the use of fused multiply-add instructions. >mbatsgiFUSED
The compilation process is started. Depending on the number of processors and the optimization level between 2 and 20 minutes are needed when about 20 algorithms are included. On single processor systems or when the compiler is not able to compile the les in parallel this time is usually far higher, making testing very time-consuming. Here changes to the system-independent le makeit could lower the number of linked algorithms. On nishing, an executable is created with an extension denoting the optimization level (in this case the extension is fused. The compiler options are part of every run's output. >mv FFTest fused runs
3.2. EXAMPLE
21
>cd runs
The executable FFTest fused is moved to the sub-directory runs, where the input and event les are, then the current directory is changed to runs. >cr bat a i2.dat i3.dat i1*.dat
The rst step to create a batch le to submit to the program queue. The batch le will use the input- les 1, 2, 3, 10{19. >cr bat b FFTest fused 21 1
The batch- le is nished and named bat.FFTest fused 21 1. It consists of calls to the executable FFTest fused for each of the .dat input les. The events measured are 21 (issued oating-point operations) and 1 (issued instructions). SUBMIT bat.FFTest fused 21 1
The batch- le is submitted to the queue using the qsub command. An e-mail is sent upon start and completion of the job. >cr bat b FFTest fused 25 0
Another batch- le is created from the intermediate le. This time, the same executable and the same input les (1, 2, 3, 10-19), but dierent events are measured: 25 (primary data cache misses) and 0 (cycle count). >SUBMIT bat.FFTest fused 25 0 >cr bat b FFTest nofusedopt 26 0
The same input les, but dierent events and a dierent executable for the third batch le. Note, the executable FFTest fusedopt has to be created rst and moved to the directory runs before a SUBMIT command is executed. All three batch les can run concurrently, as the output uses unique le names. Very long batch les (including a large number of algorithms and transformation lengths) should be split up into dierent les and sent to dierent processors, depending on how many jobs the user is allowed to run simultaneously (four in this special case). After completion of the runs, an e-mail is sent to the user. The output consists of several .txt les written to the directory runs. Subsequent runs of the same job append data to the les instead of creating new ones. This can be useful to compare, for example, cache performance and execution time depending on the system load or the number of TLB misses. The e-mail sent to the user contains the exit code of the job. If this code is not zero, the reason is most of the time a memory problem (for instance, the
22
CHAPTER 3. PERFORMANCE EVALUATION WITH FFTEST
transformation length might be too large for a memory consuming algorithm) or a missing line (i. e., a syntax error) in one of the .dat les. The output can be of two dierent formats: TEMPO (a keyword based format) or plain table format. When using table output, the rows are separated by semicolon and tabulator. The kind of output is selected at compile time through a -D directive. >noTab i19.FFTest fused.e25 0.txt > a.txt
Tabulators are removed from the original output le in order to make the le easily readable for Microsoft Excel. A new le named a.txt is created. >cut -f 1,4,7,10 -d';' a.txt > b.txt >onlyColumns a.txt > b.txt
Only rows 1,4,7,10 of the output are copied to the le b.txt, being name of the algorithm, transformation length, event count 1 and run time.
3.3 Execution of FFTest After compilation of FFTest an executable is created with the desired le name extension, for example FFTest fused. This includes all of the selected algorithms of the system independent make le makeit. To access any of the algorithms, FFTest has to be executed. When typing >FFTest
the program is started. It reads the le FFTest.dat as default input and outputs the results to the le FFTest.txt. Additional output (like name of the algorithm and current processing status) is directed to standard output, most likely your terminal. To change these default les, simply type >FFTest myInput.dat myOutput.txt
Still, additional information is written to standard output. To redirect this to a le, type >FFTest myInput.dat myOutput.txt > myRun.inf
No event counters are used yet. To use the event counters, you have to specify a third argument: >FFTest myInput.dat myOutput.txt myEvents.evt > myRun.inf
The event les contain only one single line, with two integer values separated by a comma, specifying the desired events. Note that that the executable
3.4. INPUT FILE FORMAT FFTest must have been compiled with the
23
ag. The event les are created automatically by the batch system, and called -DMEASURE EVENTS
e(event specifier 1) (event specifier 2)
Normally such manual execution is useful only when debugging algorithms. In testing mode the batch system should be used instead, because the interactive user time is restricted on most systems, and execution of a large FFT problem may be interrupted by the operating system.
3.4 Input File Format Here are two examples of typical .dat les to show their general syntax: >more i30.dat ; FFTest ; 13.1.98 #testenv #algorithms ; FFTest 0x3000 #dimension 1 ; n=2**5 etc, 2nd row indicates repetitions #power_of_two 5, 10 6, 10 15, 10 16, 1 17, 1 18, 1 19, 1 ;20, 1 ;21, 1
Lines starting with a semicolon are regarded as comments and ignored by FFTest. The keyword #testenv starts a new environment, which may contain several algorithms, in this case only one is included. A .dat le may contain several environments, each headed by the #testenv keyword. After the keyword #algorithms the algorithm(s) to be investigated is (are) speci ed. This is done in a list of numbers of the algorithms (these numbers are de ned in the interface le of FFTest). In the example above the algorithm chosen has the hexadecimal number 0x3000.
24
CHAPTER 3. PERFORMANCE EVALUATION WITH FFTEST
After the keyword #dimension a decimal number speci es the dimension of the transform, in this case the FFT is one-dimensional. Then, following the keyword #power of two, the powers 5 to 19 are selected, with 10 repetitions for the powers 5 to 15. The resulting run times are the average of all runs or the minimum run time, depending on the transform length. A slightly more complex example is the following: >more i1.dat ; IMSL #testenv #algorithms ; IMSL FFT complex double precision 0x2003 ; some other FFT algorithm.. 0x3005 #dimension 1 #power_of_two 1, 10000 2, 1000 3, 100 4, 10 ;20, 1 ;21, 1 #extra_veclen 100, 1 200, 1 300, 1
In this example two algorithms are investigated, using not only power of two lengths but also 100, 200, and 300. High repetition counts for the small lengths are chosen to enhance the timing accuracy. Any .dat le may contain more than one environment, any environment may contain more than one algorithm. This turns out to be a powerful benchmark test, when, for example, all algorithms are known to meet memory restrictions and the in uence of slight changes of cache size or of memory access time to a system's performance is to be investigated. Normally, when FFT algorithms are investigated in batch mode, a .dat le should be created for each algorithm and FFTest should be executed separately for each of them to speed up overall execution time on multi-processor
3.5. COUNTING EVENTS
25
environments. Another advantage of this workload splitting becomes apparent shows when an algorithms is called with illegal transformation lengths. In this case only one FFTest process stops execution and allows the other algorithms to complete|therefore a single input le error cannot stop a whole run of many algorithms.
3.5 Counting Events Currently the event counters available on Pentium processors and on SGI RS10000 processors can be accessed by FFTest. In order to count events on a Pentium processor, the code of the le direct.c has to be changed. The event is selected by a 64 bit mask, which is automatically set in direct.c. The function of direct.c call the system driver DirectNT.sys. To count events on an R10000 processor, FFTest has to be called specifying an event le to select the desired events to be counted: >FFTest i.dat o.txt myEventfile
The le format is very simple: >more myEventfile 21, 1
The event les are created automatically in batch mode: >cr bat a i1.dat i2.dat >cr bat b FFTest 21 1 >submit bat.FFTest 21 1
The event le e21 1 is created, containing the numbers 21 and 1 in its rst line. The batch le bat.FFTest 21 1 then calls FFTest twice, with the input les i1.dat and i2.dat and the event le e21 1.
3.6 Output File Format The output of FFTest is written to a text le that might look like the following le. >more myOutput.txt Name; Platform; Dimension; Vectorlength; Memory usage;
26
CHAPTER 3. PERFORMANCE EVALUATION WITH FFTEST
Repetitions; Mode; e1 21; e2 1; Accuracy; (User)Time Valkenburg's FFT Algorithm; SGI:c:-n32 -mips3 -O1:cpp:-n32 -mips3 -O1:f:-n32 -mips3 -O1; 1; 80000; 640; 1; min; 18317635.000000; 270137537.000000; 0.0000; 2.320000 Valkenburg's FFT Algorithm; SGI:c:-n32 -mips3 -O1:cpp:-n32 -mips3 -O1:f:-n32 -mips3 -O1; 1; 90000; 720; 1; min; 21327355.000000; 305022619.000000; 0.0000; 2.690000
This is the plain table output, with the rst line denoting the rows. The rows are separated both by semicolon and tabulator to make them more readable. The rows contain the following information: name, platform and compiler options, transformation length, number of repetitions, average or minimum time, event count one, event count two, accuracy as a multiple of the machine epsilon, run time the algorithm required in user mode. A few useful Unix commands and abbreviations to handle the date are: >noTabs myOutput.txt > myOutputNotabs.txt
or >sed -e 's/[ ]//g' myOutput.txt > myOutputNotabs.txt
Tabulators are removed calling the sed command, the output written to the new le myOutputNotabs.txt. >onlyColumns myOutputNoTabs.txt > myOutputNoTabs.txt
or >cut -f1,4,7,10 -d';' myOutputNoTabs.txt > myOutputNoTabs.txt
This command writes rows 1,4,7 and 10 to the new le. >cat i10*.txt > all.txt
All the .txt les starting with i10 are written to a single le all.txt. >more i10*.txt > all.txt
Has the same functionality, except that lines with the names of the les precede their content the nal output. When FFTest is run, additional output is written to standard output stdout to indicate the current processing status:
3.6. OUTPUT FILE FORMAT
27
>FFTest_fused i1.dat XXX.txt e21_1 FFTest 7.11.6 testprocessor for FFT-algorithms tests time-conditions of FFT-algorithms output prepared for TEMPO (C) Copyright 1997 by
[email protected] skriptfile: i1.dat outputfile: XXX.txt platform: SGI: c:-n32 -mips4 -O1: cpp:-n32 -mips4 -O1: f:-n32 -mips4 -O1 new testenvironment: dimension = 1 algorithms: 0x2003 (vectorlengths, repetitions, mode): (2, 1, avg) (4, 1, avg) (8, 1, avg) (16, 1, avg) (32, 1, avg) (64, 1, avg) (128, 1, avg) (100, 1, avg) (200, 1, avg) algorithm: 0x2003; IMSL FFT complex Double Precision (2, 1, avg) (4, 1, avg) (8, 1, avg) (16, 1, avg) (32, 1, avg) (64, 1, avg) (128, 1, avg) (100, 1, avg) (200, 1, avg) events counted: 21, 1 tests done
This output can be redirected to a le by typing >FFTest i1.dat XXX.txt e21 1 > mySpecialOutput.out
This output is supposed to provide debugging information, for example, at which transformation lengths a memory problem occurred. Sometimes this output is the only debugging information, especially when an FFT algorithms exits without further error messages.
Chapter 4 Compiling FFTest When assessing only one algorithm using only one computer system, the following procedures seem not to be very eective. But when dealing with twenty functions on various dierent computer architectures, the make le approach is to be preferred. To compile FFTest the make le makeit has to be called with the command make -f makeit. Many system speci c variables have to be passed to the make command, this is done in a system speci c batch le called mbat(system name). Thus typing: >mbatsgi
starts the whole compilation process on SGI computers. The respective make le makeit may look as follows (lines starting with # are comments): >more makeit # Makefile for FFTest # Override variables in batch-file # # For later output PLATFORM will be defined as this OPTIMIZATION_TEXT= "$(SYSTEM):c:$(C_OPT):cpp:$(CPP_OPT):f:$(F_OPT)" # Linkage section # 4 different combinations of languages # # Program is created # linking Fortran, C++ and C object files # # Final target, needs 3 subtargets FFTest_c_cpp_f: FFTest_c.O FFTest_cpp.O FFTest_f.O
28
29 mv TMP_OBJECTFILES/* . $(LINKER_NAME) $(LINKER_PRE_DEFOPT) *.o -o FFTest_c_cpp_f $(LINKER_POST_DEFOPT) mv FFTest_c_cpp_f FFTest$(FILE_EXTENSION) # The final executable is named FFTest_(EXTENSION) # # 3 Compilation sections for each language, before linking # # C functions are compiled here FFTest_c.O: $(C_NAME) sexyfft.c timer.c krukar_b512.c nielsen_mixfft.c ohallaron-tzeng_fft.c ronmayer_fft.c sitton_cfft.c sitton_qftsubs.c sitton_rfft.c FFTest.c FFTest_interface.c config.c executor.c fft_duhamel.c fftwnd.c generic.c krukar_bitrev.c krukar_dint.c krukar_dintime.c krukar_fft.c krukar_idint.c krukar_idintime.c malloc.c naive.c pfafft.c planner.c twiddle.c $(C_DEFOPT) $(C_OPT) $(CPPACTIVE) $(FACTIVE) -DPLATFORM=\"$(OPTIMIZATION_TEXT)\" -c mv *.o TMP_OBJECTFILES # # Cpp functions are compiled here FFTest_cpp.O: $(CPP_NAME) aux4step.$(CPP_END) fft4step.$(CPP_END) fftfht.$(CPP_END) fht.$(CPP_END) fht0.$(CPP_END) scramble.$(CPP_END) transpos2.$(CPP_END) -D_CPP_ACTIVE $(CPP_DEFOPT) $(CPP_OPT) -c mv *.o TMP_OBJECTFILES # not all the C++ compilers accept the same file suffix # extra variable specifies suffix # # Fortran functions are compiled here FFTest_f.O: $(F_NAME) M1VE.F M1VECH.F N1RGB.F N1RTY.F S1ANUM.F UMACH.F I*.F C1*.F D*.F E*.F vfftpack.f Xlib.f cfft99.f fftd.f csr1.f fft.f gpfa.f gpfa3f.f rvsrh.f sfft.f csr2.f fft0.f gpfa2f.f gpfa5f.f setgpfa.f CFFTF.F CFFTI1.F PASSF3.F RADF2.F RADF5.F RFFTF1.F CFFTF1.F PASSF.F PASSF4.F RADF3.F RADFG.F RFFTI.F CFFTI.F PASSF2.F PASSF5.F RADF4.F RFFTF.F RFFTI1.F $(F_DEFOPT) $(F_OPT) -c mv *.o TMP_OBJECTFILES
This make le is called by the following batch le:
30
CHAPTER 4. COMPILING FFTEST
>more mbatsgi echo "Removing object files rm *.o echo "Emptying temporary object file directory" rm TMP_OBJECTFILES/* make FFTest_c_cpp_f -f makeit FILE_EXTENSION=_fused CPPACTIVE=-D_CPP_ACTIVE FACTIVE=-D_F_ACTIVE CPP_END=cpp SYSTEM="SGI" C_NAME=cc F_NAME=f77 CPP_NAME=CC LINKER_NAME=CC C_OPT="-Ofast=ip25 -O3" CPP_OPT="-Ofast=ip25 -O3" F_OPT="-Ofast=ip25 -O3" C_DEFOPT="-D_SGI -D_NAG_DOUBLE_ACTIVE -D_SIMPLE_OUT" R_LINK_DEFOPT="-64 -mips4" CPP_DEFOPT=-D_HPUX_SOURCE F_DEFOPT="" LINKER_PRE_DEFOPT="-64 -mips4" LINKER_POST_DEFOPT="-lm -lnag -lftn "
The make command is executed by the so called target line FFTest c cpp f in the make le makeit. The commands after the target line FFTest c cpp f:
FFTest c.O FFTest cpp.O FFTest f.O
of makeit are executed only when the subtargets FFTest c.O, FFTest cpp.O, FFTest f.O are available. The subtargets are compilations in three dierent languages, which take place rst, the nal target is the linkage of the 3 object les. For debugging purposes, the make le could also be called with a subtarget, like >make FFTest f.O -f makeit
This would invoke the Fortran compiler and suppress further linkage. The variables de ned in the batch le and read by the make le makeit are:
: Extension to append to the nal executable's name, in this example the name will be FFTest fused.
: Should be set to -D CPP ACTIVE or to an empty string "". Needed when C++ code is to be included in the environment.
FILE EXTENSION
CPPACTIVE
FACTIVE
CPP END
: Set to otherwise to "".
-D F ACTIVE
when Fortran code is to be included,
: The sux of the C++ les accepted by the C++ compiler.
4.1. DEC UNIX ON A DEC ALPHA
31
: Name of the computer system, part of the output of FFTest. No other eect.
SYSTEM
: The command names invoking the various compilers and the linker. C NAME, F NAME, CPP NAME, LINKER NAME
C OPT, CPP OPT, F OPT
C DEFOPT, CPP DEFOPT, F DEFOPT
: Important options passed to the dierent compilers, therefore part of the FFTest output.
: Normal options for the compilation process, not part of the output. LINKER PRE DEFOPT, LINKER POST DEFOPT: Final link options. The linker is called by the make le using the syntax LINKER NAME LINKER PRE DEFOPT object file list LINKER POST DEFOPT
Some options must precede the le names. The mbat les depend on the computer system used. They should be the only les to be modi ed when porting from one system to another takes place.
4.1 DEC Unix on a DEC Alpha make FFTest_c_cpp_f -f makehp FILE_EXTENSION=_o4 CPPACTIVE=-D_CPP_ACTIVE FACTIVE=-D_F_ACTIVE CPP_END=cc SYSTEM="DEC" C_NAME=cc F_NAME=f77 CPP_NAME=cxx LINKER_NAME=cc C_OPT="-O5 -inline speed -unroll 8 -migrate -tune host -D_INTRINSICS -D_INLINE_INTRINSICS -D_FASTMATH -fp_reorder" CPP_OPT="-O2 -tune host" F_OPT="-O5 -inline speed -tune host -unroll 8 -math_library fast -assume noaccuracy_sensitive -align records -align dcommons" C_DEFOPT="-Ae -D_DEC -D_NAG_DOUBLE_ACTIVE -D_SIMPLE_OUT" CPP_DEFOPT="" F_DEFOPT="" LINKER_PRE_DEFOPT="" LINKER_POST_DEFOPT="-lUfor -lfor -lFutil -lm_4sqrt -lm -lots -lc -lcxx -lnag"
4.2 IBM AIX on a PowerPC System make FFTest_c_cpp_f -f makehp
FILE_EXTENSION=_o4
32
CHAPTER 4. COMPILING FFTEST
CPPACTIVE=-D_CPP_ACTIVE FACTIVE=-D_F_ACTIVE CPP_END=cpp SYSTEM="RS6000" C_NAME=cc F_NAME=f77 CPP_NAME=cc LINKER_NAME=f77 C_OPT="-O3 -qunroll=8 -qtune=604" CPP_OPT="-O3 -qunroll=8 -qtune=604" F_OPT="-O3 -qhot -qtune=604" C_DEFOPT="-Ae -D_IBM -D_SIMPLE_OUT" CPP_DEFOPT="" F_DEFOPT="" LINKER_PRE_DEFOPT="" LINKER_POST_DEFOPT="-lc -lm -lC "
4.3 HP-UX on an HP 9000 make FFTest_c_cpp_f -f makehp FILE_EXTENSION=_o4 CPPACTIVE=-D_CPP_ACTIVE FACTIVE=-D_F_ACTIVE CPP_END=cpp SYSTEM="HP K460-XP, HP-UX, aurora" C_NAME=cc F_NAME=f77 CPP_NAME=CC LINKER_NAME=cc C_OPT="+O4 +Oentrysched +Odataprefetch +Ofastaccess +Onoinitcheck +Oinline +Olibcalls +Oloop_unroll=8" CPP_OPT="+O4 "+Oentrysched +Odataprefetch +Ofastaccess +Onoinitcheck +Oinline +Olibcalls +Oloop_unroll=8 F_OPT="+O4 +Oentrysched +Odataprefetch +Ofastaccess +Onoinitcheck +Oinline +Olibcalls +Oloop_unroll=8" C_DEFOPT="-Ae -D_NAG_DOUBLE_ACTIVE -D_SIMPLE_OUT" CPP_DEFOPT=-D_HPUX_SOURCE F_DEFOPT="" LINKER_PRE_DEFOPT="+Ofastaccess" LINKER_POST_DEFOPT="-lc -lm -lC -lcl -lnag"
4.4 Adding New Algorithms to FFTest Several steps are required to add new FFT routines to FFTest. 1. The new routine's code has to be compiled and linked as part of the FFTest executable. This is done by adding all necessary source code lenames of the new algorithm to the compilation section (the subtargets) of the make le makeit. The linkage process automatically links the new object les. No further changes to the makeit le are necessary. No changes to the mbat le are necessary. 2. Three rotines have to be provided for FFTest: The init function, which is responsible for initialization and is not timed, the exit function (only when some memory cleanup is required|usually not needed) and the interface function, which calls the desired FFT function. Those functions are added to the le FFTest interface.c.
4.4. ADDING NEW ALGORITHMS TO FFTEST
33
All of them receive the same arguments (length, dimension, a pointer to
oating-point and integer data buer and nally a pointer to a special structure de ning the algorithm's attributes). They call the init, t, and exit functions of the new algorithm with the appropriate algorithm speci c list of arguments. To enable better type checking one should add the function prototypes at top of the le FFTest interface.c. While this step is not necessary, it should be done to increase program readability. 3. An element of type struct algdata must be added to the global list of algorithms in the le FFTest interface.c. struct algdata alglist[]= { /*{"Name", 0xAlgID, float/double workspace, extra float workspace (absolute), int workspace, Typ: float/double, Typ: real/complex, dimension, Typ: 2^XX/split-radix, in-place/workspace, &init, &exit, &fftfunc) */ {"Duhamel-Hollman's Split-Radix FFT Algorithm", 0x1001, 0, 0, 0, DOUBLE_ALG, COMPLEX_COMPLEX, 1, ORDER_TWO_VECT, POWER_OF_TWO, CALC_INPLACE, NORM_N, NULL, NULL, &dhsr_iface}, {"New Algorithm..", 0x1002, 0, 0, 0, DOUBLE_ALG, COMPLEX_COMPLEX, 1, ORDER_TWO_VECT, POWER_OF_TWO_SHORT, CALC_INPLACE, NORM_N, &newInit, &newExit, &newInterface} }
This example list consists of two struct tain the following information
algdata
elements which con-
The name of the algorithm (it may contain spaces and is part of
the FFTest output). A unique hexadecimal number which selects the respective algorithm in the .dat les. Workspace required by the algorithm. A workspace value of and a transformation length of means words of additional memory of the algorithm's data type are needed. Extra workspace, as an absolute value. Integer workspace, only needed by some algorithms. k
N
k
N
34
CHAPTER 4. COMPILING FFTEST
Selection of the data type of the algorithm, double or oat. This
selection is accomplished by setting the struct element to either or FLOAT ALG. Selection of FFT type: real-to-complex or complex-to-complex. Selected by REAL COMPLEX, COMPLEX COMPLEX. The dimension of the FFT. Type of algorithm: power-of-two or split-radix. Selected by the constant values POWER OF TWO or SPLIT RADIX. These values can be logically ORed to combine them. Selection, whether the result of FFT is in-place or in workspace. Selected by CALC INPLACE or CALC WKSPACE. Type of result, denoting whether it is divided by or not. Three pointers to the init, exit, and interface functions de ned previously. DOUBLE ALG
N
4. FFTest can be compiled now. When started with a new .dat le containing the hexadecimal identi er of the algorithm, the new FFT routine is timed. Some extra work has to be done to include code written in a dierent language than C. You have to make sure that the options FACTIVE=-D F ACTIVE
and
CPPACTIVE=-D CPP ACTIVE
are part of the mbat le that calls the make le when including Fortran or C++ code. For Fortran programs, extra lines have to be added to the FFTest interface le FFTest interface.c. One must add compiler de nes like #define newFortranInit newFortranInit
The last underscore is not to be omitted for some compiler combinations, because some Fortran compilers assign names to object code by adding underscores or by using capital letters. On Windows NT the names are in capital letters, on SGI/Solaris, DEC Unix and NEC SuperUX the names end with an underscore. For Fortran programs supposed to be compiled with the Microsoft Fortran compiler, in addition to the changes adready mentioned, the function prototypes must be changed to
4.5. ADDITIONAL TOOLS AND USEFUL UNIX COMMANDS extern void
35
stdcall
(see the example in le FFTest interface.c).
4.5 Additional Tools and Useful UNIX Commands Some of the tools presented in this chapter are only commands which perform batch calls to Unix commands to speed up work and to avoid some cryptic options. Others simply add wildcard ability to Unix commands. Genuine Unix commands are underlined.
migrate
tar
cut
: This command copies all the les of the current directory to a remote computer system. It compresses the les with the tar command to one single le, calls the ftp command in non-interactive mode, opens a connection to the remote system, enters user name and password and puts the le to the respective home directory. When run without arguments, the command's usage is displayed. For example, migrate 3 copies all the les to host with identi er 3. To add a new host as a possible target for the migrate command, edit the migrate le. : Used to uncompress the tar archives, for example: tar -xf myArchive. : Used to select a few columns of a text le. The columns must be separated by a delimiter. Syntax for columns 4 and 6 is cut -f4,6 -d';' myFile.
: Used to select rows from a text le. See manual pages for detailed description of the numerous arguments. grep, egrep
: Selects the most important columns of a plain text output of FFTest. Syntax:
onlyColumns
onlyColumns myFile > myNewImportantColumnsFile
: Calls the sed command to remove tabulators from a text le. Syntax:
noTabs
noTabs myFileWithTabs > myNewFileFarewellToTheTabs
36
CHAPTER 4. COMPILING FFTEST
: Powerful command to manipulate text les, expecially to search and replace text. The following command substitutes all strings PEACE with WAR:
sed
sed -e 's/PEACE/WAR/g' myFile > myNewFile
For more options consult the manual pages with the man command. qsub: Submits a batch job to a queue. Might not be implemented on every system. Example: qsub -mb -me -q long myBatchFile
submits the batch le to the program queue called long, and sends an e-mail when starting and nishing the job. qstat: Prints the current queue status. submit: Calls the qsub command with a few often-used arguments. Example: submit myBatchFile
To change the arguments, edit the le submit. batch: On some systems this command has to be used for batch jobs. at: Starts a command at a speci ed time. wls: Calls the ls and egrep command to list all les of the current directory not ending in .c, .f, .cpp. Type more wls to see how the call to the egrep command works. clean, cleanall: Remove unnecessary les, like object les, from the current directory. cleanall removes all FFTest executables as well. dos2unix, d2u: Convert DOS text les to Unix text les, especially removing unwanted end-of-line characters. This action does not only improve readability but is also necessary for some compilers that don't ignore them. d2u accepts shell wildcards overwriting all the input les with their corresponding output les. sysconf: System information (architecture, ticks per second).
Chapter 5 Event Counting on MIPS Processors The MIPS R10000 processor provides two 32-bit PMCs for counting two types of events simultaneously (see [5] [13] [14] [17]). Each counter can track one event at a time and there are a choice of sixteen events per counter. There are also two associated control registers which are used to specify which event the relevant counter is counting. In Table 5.1 the 32 events that can be counted by the MIPS R10000 PMCs are listed. Unfortunately, each of the hardware counters can only access 16 of the events. So, counter 0 tracks events 0{15 and counter 1 tracks events 16{31. To gain more control over where PMC events are gathered, the PMCs can directly be accessed by calling appropriate libperfex routines:
int start counters(int, long long *, int, long long *) int read counters(int, long long *, int, long long *) int print counters(int, long long *, int, long long *)
start counters
eters.
read counters
tegers.
clears and starts the counters speci ed by the e0/e1 param-
stops the counters and returns their values in the 64-bit in-
does not eect the counters but produces formatted output displaying the counter name and value.
print counters
37
38
CHAPTER 5. EVENT COUNTING ON MIPS PROCESSORS
Encoding Counter Event 00h 01h 02h 03h 04h 05h 06h 07h 08h 09h 0Ah 0Bh 0Ch 0Dh 0Eh 0Fh 10h 11h 12h 13h 14h 15h 16h 17h 18h 19h 1Ah 1Bh 1Ch 1Dh 1Eh 1Fh
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
Cycles Instructions issued Memory data accesses issued Stores issued Store conditional issued Store conditionals failed Branches decoded Quadwords written back from L2 cache Correctable ECC errors on L2 cache L1 cache misses (instruction) L2 cache misses (instruction) L2 cache way mispredicted (instruction) External intervention requests External invalidation requests Instructions done Instructions graduated Cycles Instructions graduated Memory data loads graduated Stores graduated Store conditional graduated Floating- point instructions graduated Quadwords written back from L1 cache TLB re ll exceptions Branches mispredicted L1 cache misses (data) L2 cache misses (data) L2 cache way mispredicted (data) External intervention hits External invalidation hits Upgrade requests on clean L2 cache lines Upgrade requests on shared L2 cache lines
Table 5.1: Events counted on MIPS R10000 processors.
39 In order to access those functions, the library has to be included in the link process: >ln all.o -lperfex
The pointer arguments of the function call have to point to a long long variable each, where the result is written by read counters. Another approach to count events is the system command >perfex -e 21 -e 1 testedProgram
which counts events 21 and 1. But in this case the initialisation phase is part of the observed program. The advantage of this approach is, that no code changes and no re-compilation has to be done. For programs without initialization, or when the measured event is known not to happen while initializing, this is the easiest way of event counting. perfex could otherwise be called with the -s argument, like >perfex -s -e 21 -e 1 testedProgram
Then testedProgram had to send the USR1 and USR2 signals to start and stop counting. Code changes and re-compilation would be necessary. Reportedly, sending the signal can sometimes result in a delay of the start of the counting, thus resulting in slightly too low event numbers. Common event pairs used for the analysis of FFT programs on an SGI computer are 18 2, 19 3, 21 1, 22 7, 25 9, and 26 10. Generally speaking, perfex and the library libperfex, on which perfex is based, are easy to include, provide exact event counts, and are well documented. These are remarkable properties for software dealing with low level hardware aspects. Please note, that the use of the perfex command-line tool and direct calls to the libperfex routines cannot be mixed.
Chapter 6 Event Counting on Pentium Processors The Pentium processor family provides two 40-bit PMCs, making it possible to monitor two types of events simultaneously. These counters can either count events or measure duration (see [7] [8] [12]). When counting events, a counter is incremented each time a speci ed event takes place or a speci ed number of events takes place. When measuring duration, a counter counts the number of processor clock cycles that occur while a speci ed condition is true. The counters can count events or measure durations that occur at any privilege level. Table 6.1 lists events that can be counted on all Pentium processors. The PMCs are supported by three model-speci c registers (MSRs): the PMC Control & Event Select Register (CESR)|MSR 11h and the performance counter (CTR0 and CTR1)|MSRs 12h and 13h. These registers can be read from and written to using RDMSR and WRMSR instructions, respectively. In Pentium processors they can be accessed using these instructions only when operating at privilege level 0. User applications under Windows NT runs at privilege level 3; therefore a virtual device driver is used to access the MSRs. In Pentium processors with MMX extension and in PentiumPro and Pentium II processors, the MSRs can be read from any privilege level using the RDPMC (read PMCs) instruction. The PMCs are started by writing valid setup information in the CESR register, i. e. setting the enable counters ag and selecting the event to be counted (compare Table 6.1). 40
41 Encoding Description
Type
00h 01h 02h 03h 04h 05h 06h 07h 08h 09h 0Ah 0Bh 0Ch 0Dh 0Eh 0Fh 10h 11h 12h 13h 14h 15h 16h 17h 18h 19h 1Ah 1Bh 1Ch 1Dh 1Eh 1Fh 20h 21h 22h 23h 24h 25h 26h 27h 28h 29h
occurence occurence occurence occurence occurence occurence occurence occurence occurence occurence occurence occurence occurence occurence occurence occurence occurence occurence occurence occurence occurence occurence occurence occurence duration duration duration duration occurence occurence occurence duration occurence
data reads data writes data TLB misses data read misses data write misses writes (hits) to M or E state lines data cache lines written back data cache snoops data cache snoop hits memory accesses in both pipes bank con icts misaligned data memory references code reads code TLB misses code cache misses any segment register loaded segment descriptor cache accesses segment descriptor cache hits branches BTB hits taken branches or BTB hits pipeline ushes instructions executed in both pipes instructions executed in the v-pipe clocks while bus cycle in progress (bus utilization) pipe stalled by full write buers (writes backup) pipe stalled by waiting for data memory reads pipe stalled by writes to M or E lines locked bus cycles I/O read or write cycles non-cacheable memory references pipeline stalled by address generation interlock source destination con icts reserved
oating-point operations breakpoint matches on DR0 register breakpoint matches on DR1 register breakpoint matches on DR2 register breakpoint matches on DR3 register hardware interrupts data reads or data writes data read misses or data write misses
occurence occurence occurence occurence occurence occurence occurence occurence
Table 6.1: Events counted on Intel Pentium processors.
42
CHAPTER 6. EVENT COUNTING ON PENTIUM PROCESSORS
If the setup is valid, the counters begin counting following the execution of the WRMSR instruction that sets up the CESR register. The counters can be stopped by clearing the enable counters ag. The fundamental measurement procedure is (see [8]): 1. Set the CESR register (the counter immediately begins accumulating). 2. 3. 4. 5.
Read and save the counter's starting value. Execute the code section being measured. Read the counter's end value. Subtract the counter values to nd the number of executed oatingpoint operations.
These steps cannot all be done in user-level (Ring 3) code because:
can only be set by the WRMSR instruction; WRMSR can only be run from Ring 0. The event counters could be read by the RDMSR instruction; but RDMSR can only be run from Ring 0. CESR
The counters can also be read by the RDPMC instruction; RDPMC can be run from Ring 3, if the CR4.PCE permission bit is set; but, in Windows95 and Windows NT, CR4.PCE is not set by default. These diculties can be addressed by using appropriate drivers, for example, CPUMON. 1 Intel itself provides software developers with many tools to use the event counters. Some of them are available via the Intel developer website2 while others are not published (yet). Sometimes equivalent programs are provided by other companies and organizations.3 Intel sells the tool VTune which handles many aspects of performance monitoring.4 The Pentium Pro and Pentium II processors PMCs can be con gured to monitor any of the possible types of events. One or more of these events can be chosen for monitoring. However, VTune monitors every selected event in see http://www.ntinternals.com Intel developer website: developer.intel.com. 3 See www.x86.org or www.heise.de. 4 A 30 days trial version can be downloaded from the Intel developer website.
1
2
43 a separate session. During event-based sampling, VTune interrupts the processor each time a certain number of events have occurred and collects samples of the instruction addresses of all active software on the system. The frequency of the processor interrupts and the sample collection can be determined by VTune. The data collected by VTune allow to determine the number of events that occurred and the amount of CPU time that was spent in each active module running on the system used. Note,
Event-Based Sampling (EBS) is possible on Pentium Pro and Pen-
tium II processors. On Pentium processors or Pentium processors with MMX technology, EBS is supported only in connection with a special CPU socket, which is not provided by Intel. EBS is not possible in interactive mode, so some applications can not be assessed using VTune.
VTune is capable of displaying basic statistical data of the events occurred. VTune can sample and display a system-wide view of software activity and
CPU time distribution. It can perform a static analysis of the functions or blocks of code in an application and can determine overall percentages of instruction pairing and pipeline penalties incurred by the instructions in each function. It is possible to determine how many clock cycles each instruction takes to execute and how many of them were incurred due to pipeline penalties. Additionally, a dynamic analysis is available, which is able to simulate a block of code and discover such events as missed cache accesses and misaligned data which can degrade performance. It can even analyze C or Fortran source code and give high-level optimization advice. Note, VTune has a graphical user interface and is therefore not suitable to measure the event counts of interactive applications.
Chapter 7 Event Counting on Alpha Processors The uprofile and kprofile tools access the PMCs of the DEC Alpha chip. They do not collect information on shared libraries. By default, both tools collect the number of cycles required by the investigated program. The performance data produced by these tools is processed with the prof command. Direct access is possible through the pfm pseudo device, which serves as an interface to the performance counters. First, it has to be opened with the open function. Then, the ioctl command is used to pass pre-de ned constants to it (see [2]). The counters work slightly dierent as on other architectures. Each counter interrupts the operating system when a certain number of the selected evens have been counted. There are three actions that can happen at each interrupt:
Increment a value counter; Histogramming; User or kernel PC pro ling. These actions can be selected through a call to the ioctl command with constants de ned in the le /sys/pfcntr.h. If counters are selected, the interrupt count is incremented. Thus, the number of times an event has happened, is counted in multiples of the interrupt frequency. Note, the driver can only count the interrupts generated. There is no direct access to the on-chip counter values. 44
45 Histogramming increments the entries in an array accessible with the ioctl command; if pro ling is enabled, the value of the program counter is added to the pro le histogram if the program's current mode is the selected one (kernel or user mode). In a multi-processor system each CPU has its own counters. Consequently, there are three ways to open the device:
Open and collect data only on the actual CPU; Open all CPU counters, but keep data separate; Open all counters and accumulate data in one collection. Note, in the rst case, the process must be bound to one speci c CPU, or the open will fail. It must also remain bound to that processor, otherwise unpredictable results will happen. The interrupt frequency of performance counter 0 can be set to either 212 or 216 , the frequency of counter 1 can be set to either 28 or 212 . Probably the biggest problem when trying to open the device or when using uprofile is that the operating system has to be re-con gured to provide the pseudo-device. This can be done by including the line pseudo-device pfm
in the kernel con guration le and by rebuilding the kernel.
Issues (total number of issues divided by 2): This counter is incremented
by one for each cycle in which two instructions are issued and is incremented by 1/2 for each cycle in which only one instruction is issued. The number of cycles in which one instruction is issued can be found by using the dual issues eld and the equation = ( ? ) 2, where denotes single issues, dual issues, and issues. S
S
D
I
D
I
Non-issues (total number of non-issues divided by 2): This counter is incremented by one for each cycle in which no instructions are issued and is incremented by 1/2 for each cycle in which only one instruction is issued. This counter is the inverse of the issues counter: non-issues = 1 { issues.
Dual-issues: This counter is incremented once per cycle where two instructions are dual-issued.
46
CHAPTER 7. EVENT COUNTING ON ALPHA PROCESSORS
Pipedry: This counter is incremented by one for each cycle in which nothing is issued due to the lack of valid instruction stream data. The causes could be instruction cache re ll operations (due to normal sequential operation, or delays while fetching the target of a branch) or delays caused by the draining of the pipeline in response to an exception.
Loads: This counter is incremented once per issuance of a load instruction. Note, if a load misses in the primary data cache, the replay of the instruction will cause the load counter to be incremented again.
Stores: This counter is incremented once per store instruction. Pipe frozen: This counter is incremented by one for each cycle in which
nothing is issued due to a resource con ict within the pipeline. Examples are:
Not all source and destination registers are available; A load miss or write buer over ow occurs; A conditional branch cannot be issued in the cycle following a jump; Memory barrier instruction processing causes the pipe to freeze.
Branches: This counter is incremented once per issuance of a branch instruction.
Cycles: This counter is incremented once per cycle. It is useful when deter-
mining whether or not the performance counters are operating properly.
Victims (external pin 0): This counter is incremented once per external
event supplied to external pin 0. On the DEC 3000/500 and DEC 3000/400, this pin is connected to logic that indicates external cache misses with victims. A victim is a data block that must be written back to main memory before it is reused.
Novictims (external pin 1): This counter is incremented once per exter-
nal event supplied to external pin 1. On the DEC 3000/500 and DEC 3000/400, this pin is connected to logic that indicates external cache misses without victims.
Dcache: This counter is incremented once per primary data cache miss. Note, this counter is incremented each time a primary data cache probe does not complete in one cycle. This includes all misses, but also
47 includes hits that are stalled for other reasons such as bus trac holding previously misses pending. Icache: This counter is incremented once per primary instruction cache miss. Mispredicts: This counter is incremented once per incorrectly predicted branch. Floatops: This counter is incremented once per oating-point instruction. The Floating-point operate instructions do not include oating-point load, oating-point branch and oating-point-store instructions. Intops: This counter is incremented once per integer operate instruction as well as once per load address and load address high instruction.
Conclusion The newly developed tool FFTest allows to evaluate the performance of FFT algorithms not only by measuring their run-time. It also helps to establish detailed performance characteristics by counting performance relevant events, the most important being oating-point operations and cache misses. FFTest's ANSI-C compliance makes it easy to port the code to dierent platforms and to choose the fastest FFT algorithm available for the given computer system. Other features of FFTest include the ability to measure the accuracy of FFT programs, as well as their memory usage. Therefore an FFT algorithm's resource usage can be determined completely and the costs of the computation can be established. Another use would be the benchmarking of computer systems by running FFTest with the same set of algorithms on various architectures.
48
References [1] M. Auer, R. Benedik, F. Franchetti, H. Karner, P. Kristoefel, A. Slatte, C. W. Ueberhuber, Machine Independent Serial FFT Routines|A Comparative Performance Study, AURORA Tech. Report, Institute for Applied and Numerical Mathematics, Technical University of Vienna, to appear. [2] Digital Equipment Corp., Pfm The 21064 Performance Counter Pseudo-Device, DEC OSF/1 Manual pages, 1995. [3] J. J. Dongarra, The Linpack Benchmark|An Explanation, in \Evaluating Supercomputers", (A. J. Van der Steen, Ed.), Chapman and Hall, London, 1990, pp. 1{21. [4] J. J. Dongarra, Performance of Various Computers Using Standard Linear Equations Software, Tech. Report CS-89-85, Computer Science Dept., University of Tennessee, Knoxville, 1997. [5] M. Galles, E. Williams, Performance Optimizations, Implementation, and Veri cation of the SGI Challenge Multiprocessor, Proceedings of the 27th Annual Hawaii International Conference on System Sciences, 1994. [6] D. Hunt, Advanced Performance Features of the 64-bit PA8000, COMPCON'95, 1995. [7] Intel Corp., Intel Architecture Software Developer's Manual, Volume 3: System Programming Guide, Order Number 243192, 1998. [8] Intel Corp., Survey of Pentium Processor Performance Monitoring Capabilities & Tools, App. Note, 1996. [9] H. Karner, M. Auer, C. W. Ueberhuber, Top Speed FFTs for FMA Architectures, AURORA Tech. Report TR1998-16, Institute for Applied and Numerical Mathematics, Technical University of Vienna, 1998. 49
50
REFERENCES
[10] H. Karner, M. Auer, C. W. Ueberhuber, Optimum Complexity FFT Algorithms for RISC Processors, AURORA Tech. Report TR1998-03, Institute for Applied and Numerical Mathematics, Technical University of Vienna, 1998. [11] H. Karner, M. Auer, C. W. Ueberhuber, FMA Optimized FFT Algorithms: An Empirical Performance Study, AURORA Tech. Report, Institute for Applied and Numerical Mathematics, Technical University of Vienna, to appear. [12] T. Mathisen, Pentium Secrets, Byte Magazine, July 1994, pp. 191{192. [13] MIPS Technologies Inc., R10000 Microprocessor Technical Brief, 1994. [14] MIPS Technologies Inc., De nition of MIPS R10000 Performance Counter, 1997. [15] C. W. Ueberhuber, Numerical Computation, Springer-Verlag, Berlin Heidelberg New York Tokyo, 1997. [16] E. H. Welbon, C. C. Chan-Nui, D. J. Shippy, D. A. Hicks, POWER2 Performance Monitor. PowerPC and POWER2: Technical Aspects of the New IBM RISC System/6000, IBM Corporation SA23-2737, 1994, pp. 55{63. [17] M. Zagha, B. Larson, S. Turner, M. Itzkowitz, Performance Analysis Using the MIPS R10000 Performance Counters, Proceedings Supercomputing'96, IEEE Computer Society Press, 1996.