An Abstraction Layer for SIMD Extensions Franz Franchetti Christoph W. Ueberhuber
AURORA TR2003-06
Institute for Applied Mathematics and Numerical Analysis Vienna University of Technology Wiedner Hauptstrasse 8-10, A-1040 Wien, Austria E-Mail:
[email protected] [email protected]
This work described in this report was supported by the Special Research Program SFB F011 “AURORA” of the Austrian Science Fund FWF.
Abstract This paper presents an abstraction layer for short vector SIMD ISA extensions like Intel’s SSE, AMD’s 3DNow!, Motorola’s AltiVec, and IBM’s Double Hummer. It provides unified access to short vector instructions via intermediate level building blocks. These primitives are C macros that allow, for instance, portable and highly efficient implementations of discrete linear transforms like FFTs and DCTs. The newly developed API is built on top of recently introduced C language extensions that allow to access short vector SIMD hardware from within C programs by means of data types and intrinsic or built-in functions. Empirical evidence of the success of the portable SIMD API is provided by means of program runs of short vector versions of Spiral and Fftw, resulting in the fastest FFT implementations to date.
Chapter 1
Introduction A few years ago major vendors of general purpose microprocessors have started to include short vector SIMD (single instruction, multiple data) extensions into their instruction set architecture for exploiting data level parallelism found in multi-media applications. Examples of SIMD extensions supporting both integer and floating-point operations include Intel’s SSE, AMD’s 3DNow!, and Motorola’s AltiVec extension. Also, SIMD extensions are included in the Double Hummer floating-point unit for IBM’s BG/L top performance machines. SIMD extensions have the potential to significantly speed up implementations in all areas where (i) performance is crucial, and (ii) the relevant algorithms exhibit fine grain parallelism. Currently, most short vector extensions can be accessed within the C language by means of language extensions. These extensions mirror the actual hardware using data types and intrinsic (or built-in) functions. However, all these language extensions are proprietary. No common API exists yet—neither across compiler vendors for a given extension nor across different extensions. The implementations of the basic operations of the portable SIMD API differ heavily between different short vector extensions. This paper describes a core technology that enables the portable and efficient support of short vector SIMD extensions within automatic performance tuning systems: a portable SIMD API. This API abstracts details of the target vector extension and implements primitives which are the basic blocks upon which high performance numerical software can be built. The new SIMD API was used to develop short vector versions of the state-ofthe-art numerical software systems Fftw and Spiral. Experiments show that the new versions of Fftw and Spiral achieve performance that is comparable to optimal hand-generated code.
1.1 Related Work Apple Computers Inc. included the vDSP vendor library into their operating system MAC OS X. This library features a DFT implementation which supports Motorola’s MPC 74xx G4 AltiVec extension. Intel’s MKL provides support for SSE, SSE 2, and the Itanium processor family. 2
1.2 Synopsis
3
Previous work extended Fftw and Spiral to portably support short vector SIMD extensions [1, 3, 4, 5]. Fftw-gel is a short vector SIMD extension for Fftw supporting 3DNow! and SSE 2 [8].
1.2 Synopsis Section 2 overviews Spiral and Fftw, the current state-of-the-art automatic performance tuning systems for discrete linear transforms. Section 3 defines the portable SIMD API. Section 4 experimentally evaluates short vector versions of Spiral and Fftw which are based on the portable SIMD API.
Chapter 2
Automatic Performance Tuning Systems The increasing complexity of today’s computer systems and the need for portability and high performance spawned the development of automatic performance tuning systems. This new technology was introduced to automatically implement optimized programs for numerical computation. In the field of numerical linear algebra Atlas and PHiPAC automatically generate optimized basic linear algebra subroutines (BLAS). Fftw [6] and Spiral [9] are the most prominent examples of state-of-the-art software in the area of discrete linear transforms.
2.1 FFTW Fftw (the fastest Fourier transform in the west) was the first software package to automatically generate optimized FFT code using a special purpose compiler and using the actual run time as optimization criterion [6]. Typically, Fftw performs faster than other publicly available FFT codes and nearly as fast as hand optimized vendor-supplied libraries across different machines. Fftw is based on a recursive implementation of the Cooley-Tukey FFT algorithm. The recursion stops when the remaining subproblems are solved using codelets. Codelets are automatically generated routines that carry out the bulk of the computation. For a given problem size there are many different ways of computing the FFT with potentially very different run times. Fftw applies dynamic programming using measured run times of alternatives as cost function to find a highly efficient FFT implementation for a given problem size on a given machine.
2.2 SPIRAL Spiral (signal processing algorithms implementation research for adaptive libraries) is a generator producing high performance code for discrete linear transforms like DFTs, discrete cosine transforms (DCTs), and many others [9]. Spiral uses a mathematical approach that translates the implementation problem of discrete linear transforms into an optimization problem in the space of 4
2.2 SPIRAL
5
structurally different algorithms and their possible implementations to generate code that is adapted to the given computing platform. Spiral represents the many different algorithms for a transform as formulas in a concise mathematical language. These formula are automatically generated and automatically translated into code, thus enabling an automated search process.
Chapter 3
A Portable SIMD API A programming model for short vector extensions has been established recently. The C language has been extended by new data types according to the available registers and the operations are mapped onto intrinsic or built-in functions functions [7]. Using this programming model, a programmer does not have to deal with assembly language. Register allocation and instruction selection is done by the compiler. However, these interfaces are not standardized neither across different compiler vendors on the same architecture nor across architectures. But for any current short vector SIMD architecture at least one compiler featuring such an interface is available (see Table 3.1). Table 3.1: Short vector SIMD extensions providing single or double precision floating-point arithmetic found in general purpose microprocessors. Vendor
Name
n-way
Precision
Processor
Compiler
Intel
SSE
4-way
single
Pentium III Pentium 4
Intel
SSE2
2-way
double
Pentium 4
Intel AMD
IPF 3DNow!
2-way 2-way
single single
Itanium, Itanium 2 K6
AMD
Enhanced 3DNow! 3DNow! Professional
2-way
single
4-way
single
K7, Athlon XP Athlon MP AMD Athlon XP AMD Athlon MP
4-way
single
MS Visual C++ Intel C++ Compiler Gnu C Compiler 3.0 MS Visual C++ Intel C++ Compiler Gnu C Compiler 3.0 Intel C++ Compiler MS Visual C++ Gnu C Compiler 3.0 MS Visual C++ Gnu C Compiler 3.0 MS Visual C++ Intel C++ Compiler Gnu C Compiler 3.0 Gnu C Compiler 3.0 Apple C Compiler 2.96
AMD
Motorola
AltiVec
MPC74xx G4
Careful analysis of the instructions required by the code generated for discrete linear transforms within the context of Spiral and Fftw allowed to define a set of C macros—the portable SIMD API of this paper—that can be implemented efficiently on all current architectures and features all necessary operations. The main restriction turned out to be that across all short vector SIMD extensions only naturally aligned vectors can be loaded and stored highly efficient. All other 6
3.1 Abstracting from Special Machine Features
7
memory access operations lead to performance degradation and possibly to prohibitive performance characteristics. The newly defined portable SIMD API as described in this paper serves two main purposes: (i) to abstract from hardware peculiarities, and (ii) to abstract from special compiler features. Accordingly, it provides a tool for the portable and efficient implementation of numerical software on modern processors.
3.1 Abstracting from Special Machine Features In the context of this paper all short vector SIMD extensions feature the functionality required in intermediate level building blocks. However, the implementation of such building blocks depends on special features of the target architecture. For instance, a complex reordering operation like a permutation has to be implemented using register-register permutation instructions provided by the target architecture. In addition, restrictions like aligned memory access have to be handled. Thus, a set of intermediate building blocks—called the portable SIMD API—has to be defined which (i) can be implemented on all current short vector SIMD architectures, and (ii) enables all discrete linear transforms to be built on top of them.
3.2 Abstracting from Special Compiler Features All compilers featuring a short vector SIMD C language extension provide the required functionality to implement the portable SIMD API. But syntax and semantics differ from platform to platform and from compiler to compiler. These specifics have to be hidden in the portable SIMD API.
Chapter 4
Definition of the Portable SIMD API The portable SIMD API includes macros of four types: (i) data types, (ii) constant handling, (iii) arithmetic operations, and (iv) extended memory operations. An overview of the provided macros is given below. All examples of such macros displayed in this paper suppose a two-way or four-way short vector SIMD extension. However, the portable SIMD API can be extended easily to an arbitrary vector length ν.
4.1 Data Types The portable SIMD API introduces three data types, which are all naturally aligned: (i) Real numbers of type float or double (depending on the extension) have type simd_real. (ii) Complex numbers of type simd_complex are pairs of simd_real elements. (iii) Vectors of type simd_vector are vectors of ν elements of type simd_real. For two-way short vector SIMD extensions the type simd_complex is equal to simd_vector.
4.2 Constant Handling The portable SIMD API provides declaration macros for the following types of constants whose values are known at compile time: (i) the zero vector, (ii) homogeneous vector constants (whose components all have the same value), and (iii) inhomogeneous vector constants (whose components may have different values). Three types of constant load operations have been introduced: (i) load a constant (both homogeneous and inhomogeneous) that is known at compile time, (ii) load a constant vector (both homogeneous and inhomogeneous) that is precomputed at run time (but not known at compile time), and (iii) load a precomputed constant real number and build a homogeneous vector constant with that value. See Table 4.1 for some examples. 8
4.3 Arithmetic Operations
9
Table 4.1: Some of the constant handling operations of the portable SIMD API. Macro
Type
DECLARE_CONST(name, r) DECLARE_CONST_2(name, r0, r1) DECLARE_CONST_4(name, r0, r1, r2, r3) LOAD_CONST(name) LOAD_CONST_SCALAR(*r) LOAD_CONST_VECT(*v) SIMD_SET_ZERO()
compile time homogeneous compile time inhomogeneous compile time inhomogeneous compile time precomputed homogeneous real precomputed vector compile time homogeneous
4.3 Arithmetic Operations The portable SIMD API provides real addition, multiplication, and subtraction operations, the unary minus, two types of fused multiply-add operations, and a complex multiplication. Both variants that either modify a parameter or that return the result exist. See Table 4.2 for a summary. Table 4.2: Some of the arithmetic operations provided by the portable SIMD API. Macro
Operation
VEC_ADD_P(v, v0, v1) VEC_SUB_P(v, v0, v1) VEC_UMINUS_P(v, v0) VEC_MUL_P(v, v0, v1) VEC_MADD_P(v, v0, v1, v2) VEC_NMSUB_P(v, v0, v1,v2) COMPLEX_MULT(v0, v1, v2, v3, v4, v5)
v = v 0 + v1 v = v 0 − v1 v = −v0 v = v 0 × v1 v = v 0 × v1 + v2 v = −(v0 × v1 − v2 ) v 0 = v2 × v4 − v3 × v5 v1 = v2 × v5 + v3 × v4
4.4 Extended Memory Operations The portable SIMD API features three types of memory operations. All vector reordering operations are part of the memory access operations as all permutations are joined with load or store operations. The portable SIMD API provides three types of memory access operations which are described in the following: (i) plain vector load and store, (ii) vector memory access plus permutation, and (iii) generic memory access plus permutation. The semantics of all load operations is to load data from memory locations which are not necessarily aligned nor vectors into a set of vector variables (for an overview see Table 4.3). The semantics of all store operations is to store a set of vector variables at specific memory locations which are not necessarily aligned nor vectors.
10
4. Definition of the Portable SIMD API
Table 4.3: Load and store operations supported by the portable SIMD API for two-way short vector SIMD extensions. Macro
Access
Permutation
LOAD_VECT(t, *v) LOAD_J(t, *v) LOAD_L_4_2(t0, t1, *v0, *v1) LOAD_R_2(t, *r0, *r1) STORE_VECT(*v, t) STORE_J(*v, t) STORE_L_4_2(*v0, *v1, t0, t1) LOAD_R_2(*r0, *r1, t)
vector vector vector real vector vector vector real
Iν Jν L42 implicit Iν Jν L42 implicit
4.4.1 Vector Memory Access The macros VEC_LOAD and VEC_STORE load or store naturally aligned vectors from or to memory, respectively. These are the most efficient operations for memory access. Vector Memory Access plus Permutation. A basic set of permutations of n vector variables is supported. For load operations, n vectors are loaded from memory, permuted accordingly and then stored into n vector variables. The store operations first permute and then store vector variables, respectively. The supported permutations are, e. g., stride permutations (denoted by Lmν n [2]) applied to the elements of k vectors or reversing the order of vector elements (denoted by J). In addition, the identity permutation implies standard vector memory access. The load and store macros are defined according to these permutations. This leads, for instance, to macros like LOAD_L_4_2 and LOAD_L_16_4 for stride permutations, and STORE_J_4 for reversing a vector. Details on these permutations can be found in [2].
4.4.2 Generic Memory Access plus Permutation These macros are a generalization of the vector memory access macros. The implementation of general permutations can be achieved using these macros which are not directly supported by all short vector SIMD extensions. Instead of accessing whole vectors, these macros imply memory access at the level of real or complex numbers. Depending on the underlying hardware, these operations may require scalar, subvector or vector memory access. The performance of such permutations depends strongly on the target architecture. For instance, on SSE, properly aligned memory access for complex numbers does not degrade performance significantly. For two-way short vector SIMD extensions
4.4 Extended Memory Operations
11
complex numbers are native vectors. However, on AltiVec these memory operations feature prohibitive performance characteristics as such a macro may require a large number of vector memory access and permutation operations. These operations lead to macros like LOAD_L_4_2_R which loads four real numbers from arbitrary locations and then performs an L42 operation, or the macro STORE_8_4_C which performs an L84 operation and then stores pairs of real numbers properly aligned at arbitrary locations.
Chapter 5
Experimental Results Using the portable SIMD API, Fftw has been extended to utilize short vector SIMD instructions when computing complex-to-complex DFTs. This Fftw extension has been built for arbitrary data strides. Problem sizes are not restricted to powers of two. Various algorithms have been included into Fftw [1, 3] to take advantage of short vector SIMD extensions. The Spiral system has been extended to produce short vector SIMD code for all discrete linear transforms supported by the original version of Spiral. Both a method for arbitrary discrete linear transforms as well as a DFT specific method have been implemented [4, 5]. This section presents performance results obtained using the methods mentioned above which are all based on the portable SIMD API. Numerical tests were run on a 2.53 GHz Pentium 4 and on a 400 MHz Motorola PPC 7400 G4. The Intel C++ compiler 6.0 was used for the Pentium 4 and an AltiVec version of the Gnu C compiler for the Motorola processor. Due to architectural implications the theoretical speed-up limit achievable by vectorization is a factor of four for SSE on the Pentium 4 as well as for AltiVec on the MPC 7400 G4, it is a maximum factor of two for SSE 2 on the Pentium 4. FFT performance data are displayed in pseudo-flop/s, i. e., 5N log N/T , a scaled inverse of the runtime T which preserves runtime relations and gives an indication of the absolute performance [6]. Performance data of the two-dimensional DCTs are displayed in Gflop/s.
5.1 Intel SSE On the Pentium 4 the new single-precision Spiral SSE version is compared to the scalar versions of Fftw, Fftpack, and Intel’s manually optimized MKL 5.1 (see Figure 5.1). The SSE version of Spiral is up to three times faster than the scalar Fftw version and most of the time even faster than the hand-coded MKL routine. The new SSE version of Spiral is about six times faster than the industry standard routine cfftf of Fftpack. Figure 5.2 shows the performance of double-precision two-dimensional DCT routines generated by Spiral for the Pentium 4. The SSE version is for some dimensions nearly twice as fast as the scalar version. 12
5.2 Motorola Altivec
13 Floating-Point Performan e
10
Spiral SIMD Intel MKL 5.1 Fftw 2.1.3
8
Fftpa k
6 G op/s 4 2 0
24
26
28
210
212
214
Ve tor Length
Figure 5.1: Floating-point performance of the Spiral SIMD version for SSE (four-way singleprecision) compared to Fftpack, Fftw 2.1.3, and Intel’s MKL on an Intel Pentium 4 running at 2.53 GHz. Floating-Point Performan e
Spiral SIMD Spiral C
2
G op/s 1
0 22 22
23
23
24
24
25
25
26
26
27
27
Dimension
Figure 5.2: Floating-point performance of the Spiral SIMD version for SSE 2 (two-way double-precision) compared to the scalar Spiral version on an Intel Pentium 4 running at 2.53 GHz.
5.2 Motorola Altivec On the Motorola Power PC 7400 the scalar version of Fftw was compared to the newly developed AltiVec version of Fftw. Figure 5.3 shows the performance of the respective single-precision routines. The AltiVec version of Fftw is up to 2.5 times faster than the scalar version.
14
5. Experimental Results Floating-Point Performan e
1.0 0.8 0.6 G op/s 0.4 Fftw-AltiVe Fftw
0.2 24
25
26
27
28
29
210
211
212
213
Ve tor Length
Figure 5.3: Floating-point performance of the Fftw SIMD version for AltiVec (four-way single-precision) compared to Fftw 2.1.3 on a 400 MHz Motorola Power PC 7400.
Conclusion This paper presents a portable short vector SIMD API. This new tool enables the development of portable and efficient short vector implementations of stateof-the-art numerical software. The new SIMD API was used to develop portable short vector versions of the automatic performance tuning systems Spiral and Fftw. The new versions are the fastest FFT implementations on Intel Pentium 4 processors and among the fastest implementations of discrete linear transforms across various short vector extensions.
15
Acknowledgement We would like to thank the Spiral team and the Fftw team for our long lasting and highly successful cooperation. In addition, we would like to acknowledge the financial support of the Austrian science fund FWF.
16
Bibliography [1] Franchetti, F.: A portable short vector version of Fftw. Proc. of the MATHMOD, ARGESIM-Verlag, Vienna (2003) [2] Franchetti, F.: Performance Portable Short Vector Transforms. PhD thesis, Institute for Applied Mathematics and Numerical Analysis, Vienna University of Technology, Aurora Technical Report TR2003-01 (2003) [3] Franchetti, F., Karner, H., Kral, S., Ueberhuber, C.W.: Architecture independent short vector FFTs. In: Proceedings of the International Conference on Acoustics, Speech, and Signal Processing (ICASSP’01). Volume 2., New York, IEEE Press (2001) 1109–1112 [4] Franchetti, F., P¨ uschel, M.: A SIMD vectorizing compiler for digital signal processing algorithms. In: Proc. International Parallel and Distributed Processing Symposium (IPDPS’02). (2002) [5] Franchetti, F., P¨ uschel, M.: Short vector code generation for the discrete Fourier transform. Proceedings of the 17th International Parallel and Distributed Processing Symposium (IPDPS’03) (to appear) [6] Frigo, M., Johnson, S.G.: Fftw: An adaptive software architecture for the FFT. In: ICASSP 98. Volume 3. (1998) 1381–1384 [7] Intel Corporation: Intel C/C++ compiler user’s guide (2002) [8] Kral, S., Franchetti, F., Lorenz, J., Ueberhuber C.W.: SIMD vectorization of straight line code. EuroPar’03 (submitted) [9] P¨ uschel, M., Singer, B., Xiong, J., Moura, J.M.F., Johnson, J., Padua, D., Veloso, M., Johnson, R.W.: Spiral: A generator for platform-adapted libraries of signal processing algorithms. Journal of High Performance Computing and Applications (2001) submitted.
17