Improving Large Vector Operations with C++ Expression Template and ATLAS L. Plagne and F. H¨ ulsemann EDF R&D 1 av. du G´en´eral de Gaulle F-92141 Clamart France
[email protected],
[email protected]
Abstract. This paper describes a short and simple way of improving the performance of vector operations (e.g. X = aY + bZ + ..) applied to large vectors. The principle is to take advantage of high performance vector copy operation provided by the ATLAS library [1] used as a kernel for a C++ Expression Template (ET) mechanism. The proposed ET implementation that involves a simple blocking technique, leads to significant performance increase compared to existing implementations (up to 50%) and extends the ATLAS scope.
1
Introduction
FORTRAN is a widely used language specialized for developing scientific software. It offers native types to handle multidimensional arrays of floating-point elements and the implementations built on top of these types are fast and expressive. For Linear Algebra (LA) computations, the Basic Linear Algebra Subroutines (BLAS) [2] are commonly used in FORTRAN codes to achieve maximal performance on specific target architectures. Within this procedural language, libraries like BLAS consist of collections of subroutines with rigid and rather cumbersome signatures involving numerous arguments. In contrast to FORTRAN, C++ is a powerful general-purpose language that allows multiple programming paradigms. Hence C++ scientific software development imposes to select from several possible strategies. These multiple design choices require from the development team a rather strong experience in computer science in addition to mastering the considered scientific domain. On the other hand, C++ power allows creating generic libraries and embedded Domain Specific Languages (DSL) that can enhance the quality of the implementation while dramatically reducing its size. The software development process can be split in two consecutive stages: – Identify and develop the DSL embedded in C++. – Develop the scientific application on top of this DSL. In this paper we propose a simplified C++ implementation of a LA vector class that allows composing compact and abstract vector expressions such as: X = a ∗ Y + b ∗ (Z + W);
(1)
This implementation is based on the Expression Template mechanism (ET) introduced by Veldhuizen [3] and Vandevoorde [4]. ET allows avoiding temporary vectors and the performances of abstract expressions like (1) compete with the ones of the corresponding low-level (loop-based) basic implementations such as: for (i = 0; i < N; i++) X[i] = a ∗ Y[i] + b ∗ (Z[i] + W[i]);
(2)
Our vector class implementation relies on the Curiously Recursive Template Pattern [5, 4] (CRTP) and makes use of both object oriented and generic programming paradigms. Our contribution in this paper consists in a performance improvement of vector expressions obtained by mixing our ET based vector class with the ATLAS [1] implementation of the procedural BLAS library. The high performance ATLAS implementation of the vector copy operation, which relies partly on low level assembly language, is used as a kernel for the vector ET evaluation. The resulting vector class allows composing abstract vector expressions like (1) that achieves a better performance than both loop-based implementations such as (2) and off-the-shelf vector libraries such as Blitz++ [6], uBLAS [7] or std::valarray. This performance increase reaches 50 % for large vectors. The paper is organized as follows: Section 2 presents the considered vector operations and the different BLAS implementations. Section 3 presents the different performance regions for typical BLAS operations and focuses on the ATLAS accelerated copy operation for large vectors. Section 4 gives a short description of our C++ vector class implementation based on ET and CRTP. Performance measurements show that this implementation avoids abstraction penalties. Section 5 presents our enhanced ET vector class relying on the ATLAS dcopy kernel. Performance measurements are carried out on three different architectures. We conclude in Section 6.
2 2.1
Vector Operations Linear Algebra Vectors
Let us first define the scope of this paper and what we refer to as vector operations. From the linear algebra point of view, vectors can be defined as indexed collections of numerical elements of the same type. Indexed means that the value of every vector elements can be accessed from a given integer index to be chosen in a given range. In the STL language, this means that a random iterator can be defined. Here is a list of the most commonly encountered vector element types in the linear algebra community: – – – –
real of given floating point precision, complex of given floating point precision, integer of given size, vector if one deals with the multidimensional case.
2.2
Considered Vectors and C++/C/F77 Arrays
While a wide variety of linear algebra vector types (sparse, multidimensional,...) can be considered, we will focus on simple vector types where real type elements (single and double precision) are stored in basic containers that can be defined and exchanged through the following common programming language: F77, C and C++. Within these languages, the location of a contiguous memory region containing a given number of floating point elements, can be manipulated either as a pointer type (C and C++) or as an array type (F77). Theses arrays are the main Input/Output types for the Basic Linear Algebra Subroutines (BLAS) API [2]. 2.3
The BLAS API
The BLAS API is a widely used interface in the dense linear algebra community. This interface defines a large collection of subroutines that are separated in three main groups: – Level 1 BLAS contains subroutines acting on real and complex vectors (X t Y , Y ← X, Y ← αX + Y , . . . ). – Level 2 BLAS contains subroutines acting on matrix-vector operations (Y ← αA ∗ X + βY , . . . ). – Level 3 BLAS contains subroutines acting on matrix-matrix operations (C ← αA ∗ B + βC, . . . ). In this paper we will focus on Level 1 BLAS subroutines. Although a default implementation exists for the BLAS API [8], it is important to understand that the fundamental goal of the BLAS is to provide a unified interface for these subroutines. Numerical analysts can write their own codes on top of the BLAS in order to reach high performances while remaining independent of a specific architecture. The concrete implementations of the BLAS are provided by high performance libraries that are specialized for a given target architecture. Those libraries can be grouped in three categories: – Vendor tuned BLAS libraries which are produced by architects of a given target micro-processor (e.g. Intel MKL for Intel chips [9], AMD ACML for AMD chips [10]). – Independent tuned BLAS libraries which can compete in certain cases with the first category (e.g. GOTO-Blas [11], mini-SSE-L1-BLAS [12]). – Autotuned BLAS library. The ATLAS project [1] provides an opensource library implementation that adapts automatically to a given architecture at the installation/compilation stage and achieves performances that compete with the best available libraries. 2.4
Performance Measurements and Target Architectures Description
This paper is based on performance measurements that have been carried out on three different target architectures. Table 1 gives a short description of the
main features of theses machines. In the following, the performance curves will be named after these three architectures. Most of the time, the different targets exhibit the same kind of performance behavior. In this case, we will report only the Pentium M curves. The performance measurements are carried out with the tools developed by the BTL project [13]. Three different compilers have been used: – gnu gcc 3.3.5 – gnu gcc 4.1.1 – Intel ICC 9.1 Table 1. Description of the three target architecture for performance measurements. The bandwidth figures are the results of the Stream Benchmark tool [14]. Name
processor
frequency (MHz) Pentium M Intel(R) Pentium(R) M 1595 BiOpteron Dual Core AMD Opteron(tm) 1808 BiXeon Intel(R) Xeon(TM) 2658
L2 Cache size (KB) 1024 1024 512
Bandwidth (MB/s) 932 2557 1848
Most of the time the performance results are very close and we report only one measurement set. As an example, Fig. 1 shows the different performances of a matrix-matrix product operation (L3 BLAS dgemm) obtained on the Pentium M. The C results correspond to a direct implementation in C/C++ language (gcc 3.3.5): double sum; for (int i=0;i { public: typedef const VectorExpression & StoreType; typedef typename RIGHT::ElementType ElementType; typedef typename RIGHT::SizeType SizeType; VectorExpression(const BaseVector & left, const BaseVector & right):left_(left.getCDR()),
11 12 13 14 15 16 17 18 19 20 21
right_(right.getCDR()){} inline ElementType operator[]( SizeType i) const { ElementType result=OP::apply(left_[i],right_[i]); return result; } inline SizeType size( void ) const { return right_.size();} private: typename LEFT::StoreType left_; typename RIGHT::StoreType right_; };
The most important feature of this class is to provide an operator [] (lines 13 to 16) which returns the result of the binary operator OP (Add or Minus) applied to the stored operands left and right . Note that the VectorExpression constructor does not imply the actual calculation. The operators Add and Minus are implemented as follows:
struct Add{ template static inline T apply(T left, T right){ T result=left+right; return result; } }; struct Minus{ template static inline T apply(T left, T right){ T result=left-right; return result; } };
A.3
VectorScalarExpression
An instance of the class VectorScalarExpression is created when a * operator is applied to a scalar and a BaseVector object: template VectorScalarExpression operator * (const BaseVector & v, const typename L::ElementType & a){ return VectorScalarExpression(v,a); } template VectorScalarExpression operator * (const typename R::ElementType & a,
const BaseVector & v){ return VectorScalarExpression(v,a); }
The template class VectorScalarExpression is very similar to the VectorExpression ones: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
template class VectorScalarExpression : public BaseVector< VectorScalarExpression < V > > { public: typedef VectorScalarExpression StoreType; typedef typename V::ElementType ElementType; typedef typename V::SizeType SizeType; VectorScalarExpression(const BaseVector & v, const ElementType & a):a_(a), v_(v.getCDR()){} inline ElementType operator[]( SizeType i) const { ElementType result=a_*v_[i]; return result; } inline SizeType size( void ) const { return v_.size();} private: ElementType a_; typename V::StoreType v_; };
and again also provides the operator [] that performs the actual calculation. Again this calculation is not performed at the VectorScalarExpression construction stage. A.4
Usage and Schematic Parsing Process
The Expression Template mechanism that is used in the Vector class allows the user to write arbitrarily complex vector expressions like: 1 2 3 4 5 6 7 8
const int
1 2 3 4
const int
Vector Vector Vector Vector
size=10000000; X(size,1.0); Y(size,2.0); Z(size,3.0); R(size,0.0);
R=2.0*X+2.0*(Y-Z*2.0); size=10000000;
Vector X(size,1.0); Vector Y(size,2.0);
5 6 7 8
Vector Z(size,3.0); Vector R(size,0.0); R=2.0*X+2.0*(Y-Z*2.0);
Here we give a schematic representation of the compiler parsing of this expression. V,S,X stands for Vector,VectorScalarExpression and VectorExpression: |2 ∗{zX}
1. Expression Parsing:
+ 2 ∗ (Y −
S s1 (2,X)
Z ∗ 2) | {z }
S s2 (2,Z)
|
{z
}
X x1 (Y,s2 )
|
{z
}
S s3 (2,x1 )
|
{z
}
X right(s1 ,s3 )
2. Vector operator =: R = right for (int i...) R[i] = right[i] = s1 [i] + s3 [i] = 2 ∗ X[i] + 2 ∗ x1 [i] = 2 ∗ X[i] + 2 ∗ (Y[i] + s2 1[i]) = 2 ∗ X[i] + 2 ∗ (Y[i] + 2 ∗ Z[i]) In this case, the C++ code, that implies only one loop and no temporary vectors, should be equivalent (from the performance point of view) to the direct loop-based implementation: for (int i=0 ; i < size ; i++) 2.0*X[i]+2.0*(Y[i]-Z[i]*2.0);