Irregular Data-Parallel Objects in C++ Jean-Luc Dekeyser
Boris Kokoszko Philippe Marquet
Jean-Luc Levaire
fdekeyser,kokoszko,levaire,
[email protected] Laboratoire d'Informatique Fondamentale de Lille Universite de Lille
Abstract. Most data-parallel languages use arrays to support parallelism. This regular data structure allows a natural development of regular parallel algorithms. The implementation of irregular algorithms requires a programming eort to project the irregular data structures onto regular structures. We rst propose in this paper a classi cation of existing data-parallel languages. We brie y describe their irregular and dynamic aspects, and derive dierent levels where irregularity and dynamicity may be introduced. We propose then a new irregular and dynamic data-parallel programming model, called Idole. Finally we discuss its integration in the C++ language, and present an overview of the Idole extension of C++.
1 Irregularity and Data-Parallelism The evolution of data-parallel languages mimics closely the evolution of sequential languages. Keeping in mind eciency and simplicity, compilers have supported, in a rst step, only regular data structures: arrays in sequential languages, vectors and matrices in parallel languages. Handling irregular data structures imposes then to the programmer an explicit management of the memory. A consequence is that irregularity will take place at the level of the algorithm (gather/scatter operations). The integration of irregular data structures in dataparallel languages can be compared to the emergence of dynamic memory management and pointers in sequential languages. While preserving the semantics of the data-parallel model, irregular data structures essentially allow the speci cation of interdependencies between parallel object elements, which are similar to the interdependencies expressed by pointers in sequential data structures. Let's consider the following example : in a binary tree, each node wants to compute the minimum of a value held by its two child nodes. (Parallel happy birthday: in a family and on several generations, each parent wants to know the next birthday to wish among its children.) This algorithm is eectively a data-parallel algorithm. The same algorithm is applied to a set of data of similar type. For a data-parallel language supporting irregular data structures, the translation is straightforward: if (rchild and lchild exist) then min_child = min (rchild.val, lchild.val)
To write this algorithm using arrays, it is necessary to linearize the binary tree into an array (val ) and to manage links using two index tables (lchild (:) and rchild (:) ). The algorithm then consists in two gather operations: where (rchild(:) and lchild (:)) min_child(:) = min (val (rchild(:)), val (lchild(:)))
Similarly, during a tree update (insertion, deletion), a data-parallel language with irregular data structures will provide adequate tools (e.g. dynamic virtual processor allocation, PV alloc ()). In an array language, the programmer will have to ensure himself the management of the distributed memory.
2 Data-Parallel Language Classi cation It is currently dicult to provide an exhaustive list of data-parallel languages. Some of them have been proposed by constructors, and were often recognized as standards. Apart from these \dinosaurs", a great number of languages are developed by research teams. As for sequential languages, none are unanimously accepted by the users, even if their contributions are signi cant. From this diversity of data-parallel languages, we propose to extract some common criteria that will be the root of a classi cation of data-parallel languages. We extend the classi cation proposal of D. Lazure [16] and obtain the following criteria.
2.1 Object Declaration The explicit declaration of parallel objects in a language implies that the identi cation of the data-parallel code is directly achievable during the compilation phase. On the contrary, for an implicit language (sequential), a primary phase of automatic parallelization is required to identify objects that may be safely handled in parallel. The following criteria only concern explicit languages.
2.2 Virtualization The size of data-parallel structures depends on the problem and on the algorithm used to solve it. Usually, the target machine does not oer the exact number of processors allowing a direct mapping of one element of the structure to one elementary processor. Virtual machine languages rely on the notion of a virtual machine, which will oer the desired number of virtual processors. It is actually up to the compiler to ensure the emulation of these virtual processors on the physical processors. Physical machine languages impose that the size of dataparallel structures is strictly equal to the number of physical processors. It is now up to the programmer to ensure \by hand" the virtualization. The following criteria only concern virtual machine languages.
2.3 Virtual Machine A virtual machine is called an instantiation machine if each object declared on the machine inherits the size and the rank of the machine. All objects declared on one instantiation machine have thus the same number of elements and the same topology than the machine itself. A virtual machine is called an alignment machine if the objects declared on the machine may have any number of elements in the range of the size of the machine. Moreover the elements of an object may be mapped to any subset of virtual processors of the machine. For instance, on a 2D grid virtual alignment machine, you can declare an object corresponding to only one line of the grid, or also an object corresponding to the four corners of the grid.
2.4 Mapping During the compilation phase of a virtual machine language, the compiler will perform a mapping of virtual processors to physical processors. Some languages require that the programmer explicitly states a mapping for each virtual machine (cyclic, block...). Others leave the compiler to decide the mapping to use.
2.5 Data-Parallel Semantics In implicit languages, there is no explicit notion of parallel variables. The program only accesses to their elements through an indexing mechanism (e.g. A[i]). In explicit languages, parallel objects are seen in their whole. A program can then express a computation involving two parallel variables directly using their names (e.g. A + B). For an instantiation machine, the semantics of the computation is unique as objects are de ned on the whole machine: perform the computation on every virtual processors. For an alignment machine, there exists two dierent interpretations, as objects are not necessarily de ned on the whole virtual machine. The dierence between the two interpretations lies in the method used to map the two objects one on each other, in order to perform a computation element by element. The rst interpretation, called semantics of the virtual processor, uses the virtual machine as a referential to realize the mapping between the two objects. The more natural approach is to consider the intersection of the two sets of virtual processors where each object is de ned. Only virtual processors where both objects are declared will perform the computation. In this way, the semantics is always well-de ned. Moreover, the compiler will not have to generate implicit communications because all objects of the same virtual processor are usually located on the same physical processor. The other interpretation, called semantics of the index, linearizes the two objects (or considers them as arrays when possible), and directly maps the corresponding arrays one on each other. The referential used to perform the mapping between the objects is then the referential of the objects themselves. The problem here is that the two objects should have the same topology, or at least the same number of elements when linearized. This leads to a semantics which is not always well
de ned. Moreover, the mapping of each object on the virtual machine is not necessarily the same. This forces the compiler to perform some extra tests and eventually generate communications between physical processors. This classi cation is illustrated with some typical languages in Figure 1.
EXPLICIT
no Fortran 77
yes
VIRTUAL no MPL
yes
alignment VIRTUAL MACHINE instantiation MAPPING no C*
yes DPCE HyperC
SEMANTICS index MAPPING no Fortran 90
virtual processor MAPPING
yes HPF
no Help
yes Idole
Fig. 1. Data-parallel language classi cation.
2.6 Content of the Paper From our classi cation, we identify three levels where dynamic and irregular properties of data-parallel structures may be introduced (section 3). The Idole programming model is described in section 4. Idole is a virtual machine language with alignment, and supports the semantics of the virtual processor. Irregularity and dynamicity appear in Idole at the level of the virtual machine and at the level of objects. The integration of this model in C++ is discussed in section 5. Section 6 presents through examples the basic steps of programming in Idole. Section 7 focuses on the irregular and dynamic aspects of the Idole language.
3 Irregularity and Dynamicity in Data-Parallel Languages In the proposed classi cation, no assumption has been made on the structure of parallel objects. Nevertheless, it is dicult to imagine handling irregular data structures for any of the classes. Implicit languages can handle them since they are sequential, but current parallelization techniques do not succeed in recognizing them. Similarly, physical machine languages manipulate exclusively regular data. The architecture of data-parallel machines is indeed often regular and static. In the class of virtual machine languages, one nds the expression of the irregularity and the dynamicity at dierent levels.
3.1 The Virtual Machine A virtual machine is composed of a set of virtual processors. The topology of this machine is often a static multi-dimensional grid. In some languages (C*[12], DPCE [10], HyperC [15, 6], Paralaxis [3, 2]), irregular and/or dynamic topology constructions facilitate the development of data-parallel algorithms. As a rule, objects associated with an irregular virtual machine inherit its irregularity (and in some cases its dynamicity).
3.2 The Objects In instantiation virtual machines, all the objects have the rank of the virtual machine. In this case, irregularity at the object level makes no sense. In the case of alignment virtual machines (Help [8, 16]), objects do not necessarily cover the whole virtual machine. The allocation domain of an object on this machine can be regular, when it may be described directly using an array subscript notation (for example multi-dimensional arrays). It is irregular when its description is impossible using such a notation (for example elements on the perimeter of a 2-D domain).
3.3 The Mapping The lack of irregularity at the object level as well as at the virtual machine level leads languages to propose dynamic and irregular mapping functions (HPF2 [13, 14, 19], Vienna Fortran [5, 20], Vienna Fortran 90 [1], PST [18]). This allows, by a judicious gathering of array elements, to ensure the physical locality of elements according to the irregular addressing of the algorithm. However, in this case, the programmer still has to construct its algorithm using regular structures. Only the irregular mapping will allow to achieve better performance in the execution of the algorithm. Such a situation requires a particular eort from the programmer. This classi cation has been presented in [9]. Many irregular and dynamic language constructions illustrating these dierent levels are detailed.
4 Idole Programming Model The Idole language supports a virtual alignment machine with a virtual processor semantics. Such a machine is called a collection in the language. The Idole programming model integrates as a basic principle the dynamicity and the irregularity of parallel objects. The expression of the irregularity is found at two levels: the virtual machine and the objects themselves. This twofold irregularity is justi ed by the following.The structure of a collection is common to all objects of the collection. A transformation of this structure have identical consequences on all these objects. For instance, when deleting a
virtual processor of the collection, all objects that own an occurrence on this virtual processor also undergo a deletion. One can thus modify the structure of the virtual machine according to the algorithm. Such a structure modi cation represents a change in the application domain of the data-parallel algorithm. Let's consider stack algorithms [11, 7]. In this kind of algorithm, the data structure is a distributed stack. The same algorithm is applied on each element of the stack. Each element can generate another element or can be deleted. For example, in particle dynamics experimentation, a particle is tracked step by step until it stopped. On each step, a particle can generate a child particle which is pushed on the stack for further computation. This algorithm is well suited to data-parallel implementation but it relies a data-structure which evolved in an irregular way. Inside the collection, the alignment machine authorizes the creation of objects not allocated on the whole collection. Unlike Help or HPF that only manipulate multi-dimensional objects, Idole accepts objects allocated on any subset of virtual processors. No assumption is made on the structure of this set. As in Help, Idole objects may vary in rank and size. The irregularity at this level allows to limit, for a particular phase of the algorithm, the range of the dataparallel processing. This set may actually characterize both an active domain of computation as well as an allocation domain for parallel objects.
5 Idole Language: A C++ Extension The Idole language is derived from the C++ language. The object-oriented technology provides two main features: the encapsulation of functions in object methods, and the re nement of object properties through the inheritance mechanism. We have chosen to take advantage of these characteristics in the data-parallel section of the language. The direct integration of the data-parallel programming model in C++ has already been studied by Lickly and Hatcher [17]. Their opinion is that such an integration would not be satisfying. Indeed, C++ does not allow to express data-parallel computations, without adding a special mechanism. Neither does it provide a way to de ne parallel control structures. These both aspects have been included in Idole. We modify in a common way the C++ semantics of computation for data-parallel objects: element operations are performed in parallel on each element when applied to a data-parallel object. Syntactically, Idole also provides data-parallel control structures, mainly the where statement and the other usual structures. Idole speci c extensions are collection classes, virtual processor classes, shapes, and nally data-parallel objects. Each Idole extensions will be illustrated using adaptative mesh algorithm [4]. In this example, the nodes concentration is not the same on the whole mesh. Moreover, adaptative mesh technique relies on re nement of the mesh (cf., Figure 2) to concentrate on points where error contributions are large. By using those irregular mesh, number of points needed for
the computation are reduced to accurately capture region where an high calculus precision is required.
Fig. 2. An Adaptative mesh (Re nement by bisection of the longest side)
5.1 Collection Classes Idole collections correspond to the virtual machines of the programming model, providing thus the way to express data-parallelism. Collections are involved as an extra speci er in the declarations of data parallel objects, in addition to the element class speci er. We extend C++ to support this new mechanism. Idole introduces classes of collection with the keyword collection class, and adds a new prede ned class bag. This collection class corresponds to unstructured sets of processors, and implements basic attributes (size) and methods on collections. The mechanism of inheritance is extended to collection classes. For convenience, Idole provides also the collection class grid, derived from bag, which corresponds to multidimensional grids. It would have been convenient to introduce collections just as speci c classes in the language. They actually act as classes in the declaration of parallel objects. But in the model, the dynamicity at the virtual machine level dictates that collections should have some modi able attributes. The only way to provide this feature in C++ is to consider collections as objects themselves. Thus, a collection is an object, instance of a collection class. Irregularity and dynamicity at the virtual machine level are speci ed by the programmer using prede ned primitives
(See section 7). Note that a modi cation of the structure of a collection may result in modi cations on objects de ned on that collection. In case of adaptative mesh, we de ne a collection to handle the whole mesh structure. Each node of the mesh is represented by a virtual processor. These processors will be interconnected to describe the mesh structure.
5.2 Virtual Processors The virtual processor semantics implies an explicit communication mechanism between virtual processors of a virtual machine. Communication links between virtual processors will be identi ed in Idole through communication ports holding processor references (entries). This set of communication links de nes the virtual machine topology. This topology may vary at runtime, as the value of a communication port is dynamic. Communication ports are declared at the level of the virtual processors. Processor classes are derived from the prede ned processor class. An instance of this class is a virtual processor without prede ned communication ports. It provides methods to declare communication ports (these are only allocated at the processor instantiation) and methods to connect a communication port to a processor of any type. The mechanism of inheritance on processor classes allows to expand already de ned processor classes. Collections are built using processor objects. Conceptually, processors are created at the instantiation of collections. One processor handles (and performs computations on) one element (if present) of each DPO in the collection. A collection can be composed of dierent types of processor. Moreover, virtual processor can be dynamically allocated. A collection class can also be de ned in a generic manner to support any kind of virtual processor: the notion of C++ template has been extended to collection classes. When the re nement of a region is required, this region is divided in two sub-regions by bisection of the longest side. First, we create a new virtual processor to get a new node using dynamic allocation function. Then we update the values of the communication ports to re ect the new structure of the mesh.
5.3 Shapes Processing domains or DPO allocation domains can be de ned on collections. These domains have the type shape. A shape refers a subset of the virtual processors of a given collection. A shape is built by addition and deletion of virtual processors. Speci c block description operations are available for grid collection shapes. Set operations (union and intersection) are also de ned on shapes.
In our example, in last re nement step of Figure 2, we hilight the region where high precision is required. We de ne a shape by adding all the virtual processors of this region. Using this shape, we can limit the computation domain to this region and allocate DPO to focus on this domain.
5.4 Data-Parallel Objects A DPO is de ned by specifying both a collection (or a shape) and an element class. The element class is a prede ned or user-de ned C++ class. Methods of the element class are implicitly extended to the DPO: a method of element called on a DPO is applied to every elements of the DPO. Global operations on the DPO are de ned in the collection class in a special section named dpo. In particular, it is possible to de ne DPO methods involving inter-processor communications, reshaping, etc. In our example, DPO elements are data required to describe the physical properties of each node of the mesh. DPOs can be de ned on the whole mesh or only on a shape. For example, we can declare a DPO on the shape de ned in the section 5.3. This DPO will hold extra data required for more acurrate calculus.
6 Idole Programming Method Figure 3 describes the Idole programming methodology.
6.1 Processor Class De nition The rst step consists in declaring the processor classes. The macro DECL PROCESSOR declares a new class processor4 of processors, each of which having four communication ports. The macros CONNECTION allow the programmer to name these communication ports. We use macros here in order to simplify the syntax. DECL_PROCESSOR(processor4, processor, 4) CONNECTION(north,0) CONNECTION(south,1) CONNECTION(west,2) CONNECTION(east,3) END_PROCESSOR
6.2 Collection Class De nition The second step consists in declaring the collection classes. We de ne below a 2D grid class 2d grid, derived from the prede ned class grid. The processor class
COLLECTION CLASS DEFINITION
ELEMENTS DEFINITION
COLLECTION INSTANCIATION
PROCESSOR CLASS DEFINITION
DPOS INSTANCIATION
SHAPE DEFINITION
Fig. 3. Diagram of a Idole programming model. on which collections will be built is left as a template. It will be speci ed at the collection instantiation. In the constructor of the collection class, the constructor of the prede ned class grid is called. The body of the 2d grid constructor will also usually contain an initialization phase of the communication ports of virtual processors. Communication ports may be referenced here using their indices, as the processor class may be unknown at that time. Every collections are dynamic in Idole. We specify here a method add new line to add a new line of processors in collections. Finally, we de ne a method of DPO move in the dpo section, which shifts a DPO inside the grid. template collection_class 2d_grid : grid { 2d_grid(int x_size, int y_size) : grid(x_size, y_size) {...} ... void add_new_line(void) {...} ... dpo: ... void move(int x_offset, int y_offset) {...} } ;
6.3 Collection Instantiation
We can now instantiate a collection from a collection class and a processor class. We de ne a 4 3 2D grid collection named a grid, which derives from the
collection class 2d grid and is built using processor4 type processors. 2d_grid a_grid(4,3) ;
6.4 Shape De nition The optional fourth step is to de ne some useful domains on the collection a grid. We de ne here a shape a domain, and initializes it by adding virtual processors which satisfy a given condition. shape a_domain ; ... where(a_condition_is_true_on_a_grid) a_domain.add() ;
6.5 Elements De nition The fth step is to reuse already de ned element classes, or to de ne your own element classes. Theses classes are classic C++ class, e.g. a class of complex numbers. class complex {...} ;
6.6 DPO Instantiation We can now de ne parallel variables, the DPOs, by specifying their collections (or shapes) and their element types. Complete DPOs are de ned on the whole collection whereas incomplete DPOs are only de ned on a domain of the collection. a_grid a_domain
a_complete_dpo ; an_incomplete_dpo ;
The nal step consists in writing the code of the algorithm itself. The next section presents some sample programs written in Idole.
7 Irregular and Dynamic Programming in Idole This section emphases on the dynamic and irregular aspects of the Idole programming language.
i = 1
1 grid pascal (1) ; pascal value = 1; for (int i=1; i rchild () = child ; g elsewhere f self = (tree_node_entry) mytree.getvpid() ; self-> parent () = (tree_node_entry) mytree.getpvpid() ; self-> lchild () = NULLPROC ; self-> rchild () = NULLPROC ;
g
g
Fig.5. Dynamic manipulation of virtual processors in Idole.
The processor class tree node is derived from the prede ned class processor. A processor tree node oers 3 communications ports. These ports are actually de ned as methods of the class tree node . The macro-de nition DECL PROCESSOR also de nes the generic type tree node entry , which are references (entries) to processors of that new class. The bag collection mytree is built using tree node processors. The add rchild method adds a right child to all active processors. child is a DPO of reference to tree node processors; it is de ned on the collection mytree . The DPO self is initialized from the getvpid () method of bag. getvpid () returns an entry (a reference) to the current processor. The fork () constructor performs the parallel creation of tree node processors. In the rst alternative of the where, the parent processors connect to their child. In the second alternative, the child processors init themselves and connect to their parent (the getpvpid () method returns a reference to the parent processor).
7.3 Idole Objects Idole proposes the manipulation of dynamic and irregular parallel objects. This dynamicity covers the size and the position of objects. Idole introduces irregularity in the shape of incomplete objects. The allocation domain of objects can be any subset of virtual processors of the collection; this is supported for grid collections as well as for bag collections. An illustration of these features is given in Figure 6.
grid pascal(N) ; pascal[0:0] value = 1 ; shape oldshape ; for (int i=1; i