Improving Large Vector Operations with C++ ... - Semantic Scholar

Improving Large Vector Operations with C++ Expression Template and ATLAS L. Plagne and F. H¨ ulsemann EDF R&D 1 av. du G´en´eral de Gaulle F-92141 Clamart France [email protected],[email protected]

Abstract. This paper describes a short and simple way of improving the performance of vector operations (e.g. X = aY + bZ + ..) applied to large vectors. The principle is to take advantage of high performance vector copy operation provided by the ATLAS library [1] used as a kernel for a C++ Expression Template (ET) mechanism. The proposed ET implementation that involves a simple blocking technique, leads to significant performance increase compared to existing implementations (up to 50%) and extends the ATLAS scope.

1

Introduction

FORTRAN is a widely used language specialized for developing scientific software. It offers native types to handle multidimensional arrays of floating-point elements and the implementations built on top of these types are fast and expressive. For Linear Algebra (LA) computations, the Basic Linear Algebra Subroutines (BLAS) [2] are commonly used in FORTRAN codes to achieve maximal performance on specific target architectures. Within this procedural language, libraries like BLAS consist of collections of subroutines with rigid and rather cumbersome signatures involving numerous arguments. In contrast to FORTRAN, C++ is a powerful general-purpose language that allows multiple programming paradigms. Hence C++ scientific software development imposes to select from several possible strategies. These multiple design choices require from the development team a rather strong experience in computer science in addition to mastering the considered scientific domain. On the other hand, C++ power allows creating generic libraries and embedded Domain Specific Languages (DSL) that can enhance the quality of the implementation while dramatically reducing its size. The software development process can be split in two consecutive stages: – Identify and develop the DSL embedded in C++. – Develop the scientific application on top of this DSL. In this paper we propose a simplified C++ implementation of a LA vector class that allows composing compact and abstract vector expressions such as: X = a ∗ Y + b ∗ (Z + W);

(1)

This implementation is based on the Expression Template mechanism (ET) introduced by Veldhuizen [3] and Vandevoorde [4]. ET allows avoiding temporary vectors and the performances of abstract expressions like (1) compete with the ones of the corresponding low-level (loop-based) basic implementations such as: for (i = 0; i < N; i++) X[i] = a ∗ Y[i] + b ∗ (Z[i] + W[i]);

(2)

Our vector class implementation relies on the Curiously Recursive Template Pattern [5, 4] (CRTP) and makes use of both object oriented and generic programming paradigms. Our contribution in this paper consists in a performance improvement of vector expressions obtained by mixing our ET based vector class with the ATLAS [1] implementation of the procedural BLAS library. The high performance ATLAS implementation of the vector copy operation, which relies partly on low level assembly language, is used as a kernel for the vector ET evaluation. The resulting vector class allows composing abstract vector expressions like (1) that achieves a better performance than both loop-based implementations such as (2) and off-the-shelf vector libraries such as Blitz++ [6], uBLAS [7] or std::valarray. This performance increase reaches 50 % for large vectors. The paper is organized as follows: Section 2 presents the considered vector operations and the different BLAS implementations. Section 3 presents the different performance regions for typical BLAS operations and focuses on the ATLAS accelerated copy operation for large vectors. Section 4 gives a short description of our C++ vector class implementation based on ET and CRTP. Performance measurements show that this implementation avoids abstraction penalties. Section 5 presents our enhanced ET vector class relying on the ATLAS dcopy kernel. Performance measurements are carried out on three different architectures. We conclude in Section 6.

2 2.1

Vector Operations Linear Algebra Vectors

Let us first define the scope of this paper and what we refer to as vector operations. From the linear algebra point of view, vectors can be defined as indexed collections of numerical elements of the same type. Indexed means that the value of every vector elements can be accessed from a given integer index to be chosen in a given range. In the STL language, this means that a random iterator can be defined. Here is a list of the most commonly encountered vector element types in the linear algebra community: – – – –

real of given floating point precision, complex of given floating point precision, integer of given size, vector if one deals with the multidimensional case.

2.2

Considered Vectors and C++/C/F77 Arrays

While a wide variety of linear algebra vector types (sparse, multidimensional,...) can be considered, we will focus on simple vector types where real type elements (single and double precision) are stored in basic containers that can be defined and exchanged through the following common programming language: F77, C and C++. Within these languages, the location of a contiguous memory region containing a given number of floating point elements, can be manipulated either as a pointer type (C and C++) or as an array type (F77). Theses arrays are the main Input/Output types for the Basic Linear Algebra Subroutines (BLAS) API [2]. 2.3

The BLAS API

The BLAS API is a widely used interface in the dense linear algebra community. This interface defines a large collection of subroutines that are separated in three main groups: – Level 1 BLAS contains subroutines acting on real and complex vectors (X t Y , Y ← X, Y ← αX + Y , . . . ). – Level 2 BLAS contains subroutines acting on matrix-vector operations (Y ← αA ∗ X + βY , . . . ). – Level 3 BLAS contains subroutines acting on matrix-matrix operations (C ← αA ∗ B + βC, . . . ). In this paper we will focus on Level 1 BLAS subroutines. Although a default implementation exists for the BLAS API [8], it is important to understand that the fundamental goal of the BLAS is to provide a unified interface for these subroutines. Numerical analysts can write their own codes on top of the BLAS in order to reach high performances while remaining independent of a specific architecture. The concrete implementations of the BLAS are provided by high performance libraries that are specialized for a given target architecture. Those libraries can be grouped in three categories: – Vendor tuned BLAS libraries which are produced by architects of a given target micro-processor (e.g. Intel MKL for Intel chips [9], AMD ACML for AMD chips [10]). – Independent tuned BLAS libraries which can compete in certain cases with the first category (e.g. GOTO-Blas [11], mini-SSE-L1-BLAS [12]). – Autotuned BLAS library. The ATLAS project [1] provides an opensource library implementation that adapts automatically to a given architecture at the installation/compilation stage and achieves performances that compete with the best available libraries. 2.4

Performance Measurements and Target Architectures Description

This paper is based on performance measurements that have been carried out on three different target architectures. Table 1 gives a short description of the

main features of theses machines. In the following, the performance curves will be named after these three architectures. Most of the time, the different targets exhibit the same kind of performance behavior. In this case, we will report only the Pentium M curves. The performance measurements are carried out with the tools developed by the BTL project [13]. Three different compilers have been used: – gnu gcc 3.3.5 – gnu gcc 4.1.1 – Intel ICC 9.1 Table 1. Description of the three target architecture for performance measurements. The bandwidth figures are the results of the Stream Benchmark tool [14]. Name

processor

frequency (MHz) Pentium M Intel(R) Pentium(R) M 1595 BiOpteron Dual Core AMD Opteron(tm) 1808 BiXeon Intel(R) Xeon(TM) 2658

L2 Cache size (KB) 1024 1024 512

Bandwidth (MB/s) 932 2557 1848

Most of the time the performance results are very close and we report only one measurement set. As an example, Fig. 1 shows the different performances of a matrix-matrix product operation (L3 BLAS dgemm) obtained on the Pentium M. The C results correspond to a direct implementation in C/C++ language (gcc 3.3.5): double sum; for (int i=0;i { public: typedef const VectorExpression & StoreType; typedef typename RIGHT::ElementType ElementType; typedef typename RIGHT::SizeType SizeType; VectorExpression(const BaseVector & left, const BaseVector & right):left_(left.getCDR()),

11 12 13 14 15 16 17 18 19 20 21

right_(right.getCDR()){} inline ElementType operator[]( SizeType i) const { ElementType result=OP::apply(left_[i],right_[i]); return result; } inline SizeType size( void ) const { return right_.size();} private: typename LEFT::StoreType left_; typename RIGHT::StoreType right_; };

The most important feature of this class is to provide an operator [] (lines 13 to 16) which returns the result of the binary operator OP (Add or Minus) applied to the stored operands left and right . Note that the VectorExpression constructor does not imply the actual calculation. The operators Add and Minus are implemented as follows:

struct Add{ template static inline T apply(T left, T right){ T result=left+right; return result; } }; struct Minus{ template static inline T apply(T left, T right){ T result=left-right; return result; } };

A.3

VectorScalarExpression

An instance of the class VectorScalarExpression is created when a * operator is applied to a scalar and a BaseVector object: template VectorScalarExpression operator * (const BaseVector & v, const typename L::ElementType & a){ return VectorScalarExpression(v,a); } template VectorScalarExpression operator * (const typename R::ElementType & a,

const BaseVector & v){ return VectorScalarExpression(v,a); }

The template class VectorScalarExpression is very similar to the VectorExpression ones: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19

template class VectorScalarExpression : public BaseVector< VectorScalarExpression < V > > { public: typedef VectorScalarExpression StoreType; typedef typename V::ElementType ElementType; typedef typename V::SizeType SizeType; VectorScalarExpression(const BaseVector & v, const ElementType & a):a_(a), v_(v.getCDR()){} inline ElementType operator[]( SizeType i) const { ElementType result=a_*v_[i]; return result; } inline SizeType size( void ) const { return v_.size();} private: ElementType a_; typename V::StoreType v_; };

and again also provides the operator [] that performs the actual calculation. Again this calculation is not performed at the VectorScalarExpression construction stage. A.4

Usage and Schematic Parsing Process

The Expression Template mechanism that is used in the Vector class allows the user to write arbitrarily complex vector expressions like: 1 2 3 4 5 6 7 8

const int

1 2 3 4

const int

Vector Vector Vector Vector

size=10000000; X(size,1.0); Y(size,2.0); Z(size,3.0); R(size,0.0);

R=2.0*X+2.0*(Y-Z*2.0); size=10000000;

Vector X(size,1.0); Vector Y(size,2.0);

5 6 7 8

Vector Z(size,3.0); Vector R(size,0.0); R=2.0*X+2.0*(Y-Z*2.0);

Here we give a schematic representation of the compiler parsing of this expression. V,S,X stands for Vector,VectorScalarExpression and VectorExpression: |2 ∗{zX}

1. Expression Parsing:

+ 2 ∗ (Y −

S s1 (2,X)

Z ∗ 2) | {z }

S s2 (2,Z)

|

{z

}

X x1 (Y,s2 )

|

{z

}

S s3 (2,x1 )

|

{z

}

X right(s1 ,s3 )

2. Vector operator =: R = right for (int i...) R[i] = right[i] = s1 [i] + s3 [i] = 2 ∗ X[i] + 2 ∗ x1 [i] = 2 ∗ X[i] + 2 ∗ (Y[i] + s2 1[i]) = 2 ∗ X[i] + 2 ∗ (Y[i] + 2 ∗ Z[i]) In this case, the C++ code, that implies only one loop and no temporary vectors, should be equivalent (from the performance point of view) to the direct loop-based implementation: for (int i=0 ; i < size ; i++) 2.0*X[i]+2.0*(Y[i]-Z[i]*2.0);

Improving Large Vector Operations with C++ ... - Semantic Scholar

Improving Large Vector Operations with C++ ... - Semantic Scholar

Suggest Documents

Improving Large Vector Operations with C++ Expression Template ...

Improving Support Vector Clustering with Ensembles - Semantic Scholar

improving particle filter with support vector ... - Semantic Scholar

Improving Large Vocabulary Accented Mandarin ... - Semantic Scholar

Querying Large C and C++ Code Bases: The ... - Semantic Scholar

Adaptive Motion Vector Smoothing for Improving ... - Semantic Scholar

Improving Short Utterance Based I-Vector Speaker ... - Semantic Scholar

Support Vector Regression With Kernel ... - Semantic Scholar

Combining Support Vector Regression with ... - Semantic Scholar

A Vector with Transcriptional Terminators ... - Semantic Scholar

Trajectory-Oriented Operations with Limited ... - Semantic Scholar

Enhancing Operations with Spatial Access ... - Semantic Scholar

Sphere Support Vector Machines for large ... - Semantic Scholar

Large-Scale Vector Data Visualization Using High ... - Semantic Scholar

Maximal Vector Computation in Large Data Sets ... - Semantic Scholar

Vector Clocks - Semantic Scholar

Large-Scale Support Vector Learning with

Improving Statistical Machine Translation with ... - Semantic Scholar

Improving communication with multiple sclerosis ... - Semantic Scholar

Improving Classification Performance with ... - Semantic Scholar

Improving SMIL with NCM Facilities - Semantic Scholar

Improving communication with multiple sclerosis ... - Semantic Scholar

Improving Continuous Improvement with CATeam - Semantic Scholar

Improving Business Process Models with ... - Semantic Scholar