Programming with the HPC++ Parallel Standard Template Library

Programming with the HPC++ Parallel Standard Template Library Elizabeth Johnsony

Dennis Gannony

Abstract We present an overview of the HPC++ Parallel Standard Template Library (PSTL), a parallel version of the C++ Standard Template Library (STL). The PSTL is part of HPC++, a C++ library and language extension framework being developed by the HPC++ consortium as a standard model for portable parallel programming in C++. The PSTL includes distributed versions of the seven STL containers (vector, list, deque, set, map, multiset, multimap), as well as parallel versions of the STL algorithms. A key component of the PSTL is the parallel iterator, which provides global access to all elements in the distributed containers and faciliates generic parallel programming.

1 Introduction

Reusable software components are of primary importance for software developers. Until recently, however, C++ provided neither standard basic data structures (such as linked lists and vectors) nor algorithms. Instead, each new software project included a development phase for these basic structures. Algorithms would then be created to operate on these structures. If the structures changed, the algorithms needed to be modi ed. Each programming team also had to be concerned about eciency of data access and algorithms. This problem is particularly acute in the parallel programming realm, where data structures must be built to utilize concurrency whenever possible. With no widely accepted standard for serial data structures, the parallel data structures roam even farther a eld. While many parallel C++ extensions and libraries have been de ned [13], most provide either non-standard tools to build and support data structures or include a set of specialized data structures. Clearly, support is needed for cost-eective, ecient, and reusable software components in parallel processing. The Standard Template Library (STL) [8, 10, 12], a recent addition to the C++ draft standard, is a library of templated algorithms, containers, and iterators which provide support for generic programming. Generic programming [9] is a programming paradigm in which algorithms are written so that they can operate on any type of container with accessibility meeting certain minimal criteria. This element access is provided via iterators, C-pointer-like objects which can traverse through a container. Standard operations, such as incrementing and dereferencing, can be performed on the iterators. Algorithms are written in terms of these standard operations so that the same algorithm can be run on various types of containers. The iterator, not the algorithm, encapsulates knowledge of how to access successive container elements. This work is supported by DARPA under contract DABT63-94-C-0029 and Rome Labs from contract 30602-92-C-0135. y Department of Computer Science, Indiana University, Bloomington, IN.

1

2 The STL provides basic building blocks for the development of sequential programs which utilize containers and algorithms. The emergence of this standard is a good opportunity for establishing a parallel Standard Template Library which provides these same building blocks, but in a parallel environment. The HPC++ consortium, composed of representatives from industry, academia, and government laboratories, has included such a parallel library in their new framework for parallel C++ programming, HPC++ [5]. This paper describes the PSTL portion of the HPC++ framework and provides basic examples of PSTL use.

2 Overview of HPC++

The current HPC++ framework (Level 1) describes a C++ library along with compiler directives which support parallel C++ programming. A future version (Level 2) will include language extensions for semantics that cannot be expressed by the Level 1 library. The standard architecture model supported by HPC++ is a system composed of a set of interconnected nodes. Each node is a shared-memory multiprocessor (SMP) and may have several contexts, or address spaces. The HPC++ framework will support homogeneous as well as heterogeneous systems (where nodes may be on physically distinct computers). Level 1 of the HPC++ framework consists of the following parts: parallel loop directives, a parallel Standard Template Library, a multidimensional array class, and, in the future, a library for distributed active objects, an interface to CORBA via IDL [4] mapping, and a set of programming and performance analysis tools. The parallel loop directives support single context parallelism. A loop can be declared by the programmer to be parallelizable using the compiler directive #pragma HPC_INDEPENDENT

placed before the loop. Given this directive, the compiler can generate the necessary code to execute the loop in parallel and to synchronize at loop termination. Compiler directives to support reduction over loops and private loop variables are also provided. The Parallel Standard Template Library (PSTL) is a parallel extension of the C++ Standard Template Library (STL). Distributed versions of the STL container classes are provided along with versions of the STL algorithms which have been modi ed to run in parallel. In addition, several new algorithms have been added to support standard parallel operations such as the element-wise application of a function and parallel reduction over container elements. Finally, parallel iterators have been provided. These iterators extend global pointers and are used to access remote elements in distributed containers. The STL does not include a multidimensional array class, but such a class is essential for the scienti c computation typical of parallel applications. For this reason, HPC++ includes a multidimensional distributed array class based on A++[11] and LPARX[6]. This array class will support element access via standard array indexing as well as parallel random

3 access iterators. The latter will facilitate the use of STL and PSTL algorithms on the array class. Use of optimized mathematical libraries such as the BLAS [7, 3, 2] and LAPACK [1] for computations on HPC++ matrices and vectors will also be supported.

3 Standard Template Library

The C++ Standard Template Library has ve basic components.

Container class templates provide standard de nitions for common aggregate data structures, including vector, list, deque, set and map. Class templates are C++ classes which have a template parameter added. For example, if a templated vector is de ned, vectors can be instantiated as collections of ints or floats, or any other type, without rewriting of the basic vector de nition. The code for each speci c vector type is generated by the compiler as needed.

Iterators generalize the concept of a pointer. Each container class de nes an iterator that provides a way to step through the contents of containers of that type. There are ve basic categories of iterators: random access, bidirectional, forward, input and output. Each category provides certain operations. For example, forward iterators support increment (++) but not decrement (--), while bidirectional iterators support both operations.

Generic Algorithms are function templates that allow standard element-wise operations to be applied to containers. Like the class template, function templates are functions with a template parameter added. The compiler automatically generates the correct code during compilation.

Function Objects are created by wrapping functions with classes that typically have only operator() de ned. They are used by the generic algorithms in place of function pointers because they provide greater eciency.

Adaptors are used to modify STL containers, iterators, or function objects. For example, container adaptors are provided to create stacks and queues, and iterator adaptors are provided to create reverse iterators to traverse an iteration space backwards.

4 Parallel Standard Template Library

The Parallel Standard Template Library (PSTL) extends the STL to include distributed containers, parallel iterators, and parallel algorithms. The following sections brie y describe these components.

4.1 Distributed Containers

Distributed containers are data structures whose elements are distributed across several contexts. There are seven distributed containers in the PSTL. These mirror the containers in the basic STL. Three of the containers are sequence containers, which store elements in a sequential order.

distributed vector is a one-dimensional array.

distributed deque is a double-ended queue.

4

distributed list is a doubly-linked list.

The other four are associative containers. Elements are ordered by a key, which can also be used to retrieve elements from the container.

distributed set is an ordered set of unique keys.

distributed multiset is like a distributed set except that duplicate keys are allowed.

distributed map is an ordered set of unique keys along with associated objects.

distributed multimap is like a distributed map except that duplicate keys are allowed.

Each container has two iterators associated with it { a standard and a parallel iterator. The local iterator iterates through storage local to the context, while the parallel iterator iterates through both local and remote elements. Iterators for the sequence containers traverse elements in sequence while associative container iterators traverse in the order of the element keys. Distribution of container elements across contexts is handled via Ratio objects, which specify the proportion of elements in each context. The Ratio object may be modi ed during execution. Like the STL containers, the PSTL containers are dynamic { the size varies during execution as elements are added or deleted. A redistribute operation is provided to bring the distribution back into compliance with the container's Ratio object after insertions or deletions, or after modi cation of the Ratio object.

4.2 Parallel Iterators

Parallel iterators extend the functionality of global pointers. In the case of random access parallel iterators, the operators ++, {, +n,-n, and [i] allow random access to the entire contents of a distributed container. In general, each distributed container class C will have a subclass for the strongest form of parallel iterator that it supports (e.g. forward, bidirectional, or random access).

4.3 Algorithms

As in the STL, the algorithms in the PSTL are generic. The arguments to the algorithms are iterators which access container elements rather than the containers themselves. This approach has several advantages. First, the same algorithm can be used for any container which utilizes an iterator of sucient capability. So, for example, an algorithm which applies a function to each element can be called for any container which provides an iterator capable of the increment (++) operation. Second, algorithms can be easily applied to subranges of elements by passing as arguments to the algorithm the iterators which mark the beginning and one past the end of the subrange. In addition, through the use of user-de ned iterator adaptors, subgroups of elements (such as odd- or even-indexed elements) can be accessed. There are three types of parallel algorithms in the PSTL.

STL algorithms with parallel semantics.

par versions of STL algorithms.

par algorithms for standard parallel operations.

5 The rst group of algorithms will retain their STL names, but will require pariterator arguments in place of iterators. These new algorithms will be collective and will include parallel semantics. For example, in addition to count(iterator first,iterator last, T& val)

which counts the number of elements equal to val in [ rst, last), the PSTL will include count(pariterator first, pariterator last, T& val).

This version of count will compute the count for each local section of the space in parallel and then do a reduction to sum the count over all contexts. The second group of algorithms will also be versions of the STL algorithms, but par will be prepended to their names. When invoked with parallel iterators, these algorithms will be semantically equivalent to the rst type of algorithm (and will be collective). When invoked with local iterators, these algorithms will not be collective. Execution will be local to a particular context, but have parallel semantics (so a loop may be parallelized, for example). Finally, special par algrithms such as par apply, par scan,, and par reduce will be provided. These will be collective operations with parallel semantics. The following have been de ned thus far: par apply: Applies a function object pointwise to the elements of a set of containers. par reduction: Applies a function object pointwise to the elements of a set of containers and then does a reduction on an associative binary operator. par scan: Applies a function object pointwise to the elements of a set of containers and then does a parallel pre x operation using an associative binary operation.

5 Using the PSTL

In this section, we include examples of basic operations on a distributed vector. Due to the generic nature of the PSTL algorithms, these operations can be applied to any of the distributed containers so long as the container provides iterators of sucient strength. For example, all of the code below could be used on a distributed deque without modi cation since both the deque and vector provide random access local and parallel iterators. One basic operation on a vector is to invoke some computation to transform each element. For example, we may require each element in a vector of doubles to be replaced by its square root. In order to do this, we de ne a function object: class square_root f public: void operator()(double& x) f x=sqrt(x); g g;

To apply this function object in parallel over all elements of the vector v, we use one of the parallel algorithms: par_apply(v.parbegin(), v.parend(), square_root());

This algorithm is a collective operation (i.e., it must be called in all contexts). In each context, the () operator of the function object (created by square root()) is applied to each local element. A similar operation is used to add a scalar to each element. In this case, we specify the scalar during function object construction.

6 class add_scalar f double val; public: add_scalar(double v) : val(v) fg void operator()(double& x) f x = x + v; g g; par_apply(v.parbegin(), v.parend(), add_scalar(5.0));

Adding two vectors of the same length, v and w, is only slightly dierent. Here, we put the result back in v. class add_vect f public: void operator()(double& x, double y, double z) f x = y + z; g g; par_apply(v.parbegin(), v.parend(), v.parbegin(), w.parbegin(), add_vect());

The two vectors do not need to have the same distribution { par apply fetches any remote elements of w needed during the computation. The same function object can be used for another operation. Suppose we want to compute each element such that: v[i] = u[i] + u[i+1]

We could use the following (assuming that u has one more element than v): par_apply(v.parbegin(), v.parend(), u.parbegin(), u.parbegin()+1, add_vect());

To compute the dot product of two vectors, we can use a parallel version of an STL algorithm along with two of the STL function objects: double result=inner_product(v.parbegin(),v.parend(),w.parbegin(),0.0, plus(), times());

This rst computes the dot product in each context of the local elements (by applying times to the matching elements in v and w and then reducing across the context using plus). A reduction operation across all contexts using plus produces the nal result. An operation involving a stencil, such as: v[i] = (v[i-1] + v[i] + v[i+1])/3

can be accomplished using another function object (the boundary elements remain unchanged). class stencil f public: void operator()(double& x, double xsub1, double xplus1) f x = (xsub1 + x + xplus1)/3.0;

g;

g

distributed_vector tmp(v); // make a copy of v par_apply(v.parbegin()+1, v.parend()-1, tmp.parbegin(), tmp.parbegin()+2, stencil());

7 Sometimes more complicated function objects are required. Suppose we have a sparse matrix represented in coordinate-wise format, where each element contains not only the value, but also its row and column index. class Elem f protected: double val; int row; int col; public: Elem(int r, int c, int getRow() const int getCol() const double& getVal() f g;

double v) : row(r), col(c), val(v) fg f return row; g f return col; g return val; g

We might wish to sort the vector elements according to their coordinates. A parallel sort algorithm is included in the PSTL, but we must provide a predicate which can be used when two elements are being compared. class orderByCoord : public binary_function f public: bool operator()(const Elem& e1, const Elem& e2) const f if (e1.getRow() < e2.getRow()) return true; else if (e1.getRow() > e2.getRow()) return false; else return (e1.getCol()

Programming with the HPC++ Parallel Standard Template Library

Programming with the HPC++ Parallel Standard Template Library

Suggest Documents

Programming with the HPC++ Parallel Standard Template ... - CiteSeerX

HPC++: Experiments with the Parallel Standard ...

STAPL: The Standard Template Adaptive Parallel Library

Trends in HPC architectures and parallel programming

Sample Sort for the Standard Template Adaptive Parallel Library

The C++ Standard Template Library

Standard Template Library Introduction

pdf-1862\data-structure-programming-with-the-standard-template ...

pdf-1862\data-structure-programming-with-the-standard-template ...

High Level Program Representation for HPC - Parallel Programming

opencl: a parallel programming standard for heterogeneous

The Standard Template Library Tutorial - Google Sites

the Sourcebook of Parallel Computing - HPC University

Parallel Programming with Object Assemblies

enhancing middlewares with parallel programming

Parallel Functional Programming with Skeletons: the ... - CiteSeerX

Task-parallel versus data-parallel library-based programming in ...

STAPL: A Standard Template Adaptive Parallel C++ ...

Parallel Extensions to the Matrix Template Library - CiteSeerX

Programming the Memory Hierarchy - Parallel Programming Laboratory

Implementing the C++ Standard Template Library in Ada 95

Towards soundness examination of the C++ Standard Template Library

Runtime Concepts for the C++ Standard Template Library - Bjarne ...

HPC Programming on Intel Many-Integrated-Core Hardware with ...