Object-Oriented Runtime Support for Complex Distributed Data Structures Chialin Chang Alan Sussman Joel Saltz Institute for Advanced Computer Studies and Department of Computer Science University of Maryland, College Park, MD 20742 fchialin, als,
[email protected] Abstract
Object-oriented applications utilize language constructs such as pointers to synthesize dynamic complex data structures, such as linked lists, trees and graphs, with elements consisting of complex composite data types. Traditionally, however, applications executed on distributed memory parallel architectures in single-program multiple-data (SPMD) mode use distributed (multi-dimensional) data arrays. Good performance has been achieved by applying runtime techniques to such applications executing in a loosely synchronous manner. Existing runtime systems that rely solely on global indices are not always applicable to object-oriented applications, since no global names or indices are imposed upon dynamic complex data structures linked by pointers. We describe a portable object-oriented runtime library that has been designed to support applications that use dynamic distributed data structures, including both arrays and pointerbased data structures. In particular, CHAOS++ deals with complex data types and pointerbased data structures by providing two abstractions, mobile objects and globally addressable objects. CHAOS++ uses preprocessing techniques to analyze communication patterns, and provides data exchange routines to perform ecient data transfers between processors. Results for applications taken from three distinct classes are also presented to demonstrate the wide applicability and good performance characteristics of the runtime library.
1 Introduction As parallel processing systems become more widely available, an increasing number of complex, object-oriented applications are being developed to make use of the computational power provided by these systems. Parallelism is exploited by assigning objects to dierent processors so that computation is carried out on dierent objects concurrently. A number of concurrent object-oriented systems (e.g., [5, 13, 11, 19]) have been proposed in the literature to provide support for a global object name space and remote method invocation. To hide communication latency in such systems, invoking a method of a remote object is usually implemented as a non-blocking function call, which generates an asynchronous request message to the processor that has been assigned ownership of the object. The message is then received by the owner processor, using either a polling scheme or an This research was supported by the National Science Foundation under Grant #ASC 9318183, and NASA under Grant #NAG 11485 (ARPA Project #8874).
1
interrupt mechanism, and processed under a scheduling policy de ned by the underlying concurrent object-oriented system. On the other hand, a large class of applications have been developed to execute on distributed memory parallel computers using the single-program multiple-data (SPMD) mode, in a loosely synchronous manner [9]. That is, collections of data objects are partitioned among the local memories of processors, often in an irregular fashion for better locality or load balance, and the program executes a sequence of parallel computational phases. Each computation phase corresponds to, for instance, a time step in a physical simulation or an iteration in the solution of a system of equations by relaxation. Since there are no data dependencies between the computations on dierent subcollections, the computation phase can be carried out on each processor concurrently, and synchronization is only required at the end of each computation phase. Therefore, once all the data for a computation phase (which is typically produced by a preceding computation phase) becomes available, a collective communication phase can be performed by all processors, after which each processor will contain a local copy of the data needed to carry out the computation phase. The computation phase can then be executed completely locally on each processor, with no remote accesses, hence without interprocessor communication. For loosely synchronous applications, a collective communication phase before and/or after each computation phase is more ecient than generating asynchronous remote accesses during the computation phase, because collective communication can provide better performance through various optimizations, such as message aggregation (to reduce communication startup cost) and elimination of duplicate data transfers. Furthermore, since communication is eectively performed simultaneously by both the sender and the receiver, there is no overhead incurred for polling or interrupts, as is the case for asynchronous communication. This approach is, however, complicated by the fact that the data access pattern of a computation phase usually depends on the input data and the distribution of that data across the processors, and such informations is typically not available until runtime. Optimizations that can be carried out by compilers are therefore limited, and runtime analysis is often required [27]. Good performance on distributed memory architectures has been achieved by applying such runtime techniques to various problems with irregular data access patterns, such as molecular dynamics for computational chemistry [12], particle-in-cell (PIC) codes for computational aerodynamics [20], and computational uid dynamics [7]. However, many existing runtime systems for distributed memory parallel machines fail to handle pointers, which are often used in object-oriented applications to construct complex objects and data structures. These runtime systems therefore only support primitive data types, such as integers and oating point numbers, and simple objects that contain no references to other objects. Most of these runtime systems also rely on the existence of global indices, which makes them applicable only to distributed arrays. In many applications, such as image processing, geographic information systems, and data mining, hierarchies of complex data types are de ned, such that ones at the higher levels serve as containers for the ones at lower levels. Pointers are usually used by container objects to point to the objects they contain. Objects that are solely contained within such a container object are referred as sub-objects. Objects of data types at the top of the hierarchy can further be 2
class Pixel f int x, y; g; class Region f int num pixels; Pixel *pixels; int num neighbors; Region **neighbors; g; class Region Map f Region *region; g;
// a single pixel of an image // x,y coordinates // a region consisting of pixels // number of pixels // an array of pixels // number of adjacent regions // list of pointers to adjacent regions // a graph consisting of regions
Figure 1: An example of complex objects and pointer-based data structures. connected through pointers, forming complex data structures. We refer to these as pointer-based data structures, and examples include linked lists, trees, and graphs. Figure 1 shows an example in C++, where the pixels of an image are clustered into regions, and regions contain pointers to adjacent regions to form a map. In this example, the Region class is implemented as a container class for the Pixel class, so that a Pixel is a sub-object of a Region. Since dierent regions may consist of dierent numbers of pixels, the Region class uses a pointer to an array of its constituent pixels. A set of regions interconnected with pointers then form a graph, de ned by the class Region Map. Without adequate support for pointers, parallel applications utilizing such complex data structures cannot achieve good performance on distributed memory architectures. The runtime library we have designed and implemented to solve these problems is called CHAOS++. CHAOS++ is a runtime library targeted at parallel object-oriented applications with dynamic communication patterns. It subsumes CHAOS [8], which is a runtime library developed to eciently support applications with irregular patterns of access to distributed arrays. In addition to providing support for distributed arrays through the features of the underlying CHAOS library, CHAOS++ also provides support for distributed pointer-based data structures, and allows exible and ecient data exchange of complex data objects among processors. CHAOS++ is implemented as a C++ class library, and can be used directly by application programmers to parallelize applications with adaptive and/or irregular data access patterns. The design of the library is architecture independent and assumes no special support from C++ compilers. Currently, CHAOS++ uses message passing as its transport layer, and is implemented on various distributed memory machines, including the Intel iPSC/860 and Paragon, the Thinking Machines CM-5, and the IBM SP-1 and SP-2. However, the techniques used in the library can also be applied to other parallel environments that provide a standard C++ compiler and a mechanism for o-processor data accesses, including various distributed shared memory architectures [18, 21, 29]. The remainder of the paper is structured as follows. In Section 2, we give a brief overview of the CHAOS runtime library and discuss issues related to the additional runtime support required for parallel object-oriented applications. In Section 3 we describe how CHAOS++ supports com3
double x[max nodes], y[max nodes]; int ia[max edges], ib[max edges];
// data arrays // indirection arrays
for (int n = 0; n < n steps; n++) for (int i = 0; i < size of indirection arrays; i++) x[ia[i]] += y[ib[i]];
// a parallel loop
Figure 2: An example with an irregular loop. plex objects and distributed pointer-based data structures. Performance results for three complete applications are presented in Section 4. Section 5 discusses some related work on concurrent objectoriented systems, and we present conclusions and future work in Section 6.
2 Runtime Support for Distributed Dynamic Data Structures In this section, we brie y discuss how runtime techniques have been applied to applications with irregular data access patterns over distributed arrays in CHAOS, and discuss the additional functionality that is required to apply these techniques to pointer-based data structures consisting of complex objects.
2.1 Overview Traditionally, scienti c computing has proved to be the predominant class of parallel applications. These applications are usually implemented in a Fortran style, with (multi-dimensional) arrays used for the primary data structures. To enable parallel SPMD execution on distributed memory architectures, large data arrays are partitioned among the local memories of the processors, and accesses to non-local array elements are carried out by message passing. In many cases, the data access patterns to the arrays are not known until runtime, since array elements are accessed through one of more levels of indirection. Figure 2 illustrates a typical irregular loop. The data access pattern is determined by arrays, ia and ib, whose values are not known until runtime. For the loosely synchronous applications described in the introduction, however, the data access pattern of a parallel computation phase is usually known before entering the computation phase, and the access pattern is repeated many times. This makes it possible to utilize various preprocessing strategies to optimize the computation and communication. The CHAOS runtime library [7] has been developed to eciently handle such adaptive and irregular problems. It employs various runtime techniques to optimize data movement between processor memories. CHAOS carries out its optimization through two phases, the inspector phase and the executor phase [27]. During program execution, the CHAOS inspector routines examine the data references, given in global indices, and precompute the locations of the data each processor needs to send and receive. For irregular problems, large data arrays are often partitioned in an irregular manner for 4
performance reasons, such as reduced communication cost or better load balance. To implement irregular data partitioning, the CHAOS library constructs a translation table that contains the host processor number and the local address for every array element, and uses the translation table to convert global indices into processor numbers and local indices. Since the translation table can be quite large (the same size as the data array), it can be either replicated or distributed across the processors. In distributed memory MIMD architectures, there is typically a non-trivial communications latency or startup cost. For eciency reasons, the CHAOS inspector routines optimize multiple non-local references to the same data item through simple software caching, and coalesce requests for non-local data into relatively large messages to reduce both communication latency and startup costs. The result of these optimizations is a communication schedule, which is used by CHAOS data transportation routines in the executor phase to eciently carry out the necessary interprocessor communication, to collect the data needed for the computation phase. The communication pattern does not change as long as the data access pattern remains the same (i.e. the values in the indirection arrays do not change), in which case the communication schedule can be constructed once by the CHAOS inspector routines, and reused multiple times by the CHAOS executor routines. This eectively reduces the (often considerable) cost of building a communication schedule. CHAOS also provides primitives to redistribute data arrays eciently at runtime to improve data locality and load balance. Special attention has been devoted to optimizing the inspector routines for adaptive applications, where data access patterns change occasionally at runtime, so that communication patterns cannot be reused many times [28].
2.2 Issues in Runtime Support for Pointer-Based Data Structures The CHAOS runtime library has been successfully applied to irregular and adaptive problems that use distributed arrays of primitive data types (integers, double precisions, etc.). As was discussed in the previous subsection, data access patterns in these applications are typically represented as global indices, which are examined by the CHAOS inspector primitives to generate communication schedules. The CHAOS transportation primitives carry out the interprocessor communication, using simple assignments or function calls to an ecient memory block copy routine to pack and unpack array elements between data arrays and message buers. However, in an object-oriented model, where programmers are allowed to de ne complex composite objects with pointers to sub-objects, and build data structures with pointers, two signi cant problems must be addressed for a runtime system to support applications that make use of such complex data structures. First, the runtime system must provide support for arbitrarily complex objects that contain pointers to sub-objects. This typically happens when a hierarchy of data types are de ned, as was shown in Section 1, and sub-objects contained within a container object are instantiated dynamically during program execution. In this case, a container object and its sub-objects are unlikely to occupy consecutive memory locations. As a consequence, remote accesses to container objects, which are carried out by message passing on a distributed memory architectures, require more sophisticated functionality than a memory to memory block copy to pack and unpack the complex objects from 5
message buer. The runtime system must follow pointers when packing a container object into a message for a send operation, and pack all its sub-objects as well. On the other hand, the receiving processor must instantiate the container object when unpacking it from the message buer, and also instantiate all the sub-objects that were also packed in the message. Pointers in the container object must also be initialized properly, since local pointers from one processor bear no correspondence to those on another processor. The second problem that must be addressed is support for naming and nding o-processor objects in a pointer-based data structure. Since elements (objects) may be added to and removed from pointer-based data structures dynamically at runtime, no static global names are associated with the elements. Accesses to those elements are done only through pointer dereferencing. As a consequence, access patterns to such data structures cannot be determined until runtime, so only runtime optimization techniques can be applied. As was previously discussed, objects in pointerbased data structures do not have global names or indices, as do arrays elements, so it is not feasible to use an existing runtime system that relies on the existence of global indices. Due to the use of pointers, partitioning a large pointer-based data structure on a distributed memory architecture also introduces the need for global pointers, since two objects connected via pointers may be assigned to two dierent processors. In contrast to a regular (local) pointer, which is only valid within the address space of a single processor, a global pointer may point to an object owned by another processor. It eectively consists of a processor identi er and a local pointer (that is only valid on the named processor), and has been implemented as part of several language extensions, including Split-C [16], CC++ [5], and pC++ [19].
3 The CHAOS++ Runtime Library CHAOS++ is designed to eectively support applications that contain complex data types and pointer-based data structures. As for applications that utilize the CHAOS library, a parallel application executing in loosely synchronous mode using CHAOS++ partitions its data structures, and performs a preprocessing phase before each computation phase. The preprocessing phase examines the data references for the subsequent computation phase (for arrays the references are global array indexes and for pointer-based data structures they are global pointers), to determine the communication required to fetch the o-processor data required to execute the computation phase. A communication schedule is generated, and is used by CHAOS++ data exchange routines to perform the required communication. As was described in Section 2, CHAOS++ must be able to deal with complex data types and global pointers. These two problems are handled by two types of CHAOS++ objects, called mobile objects and globally addressable objects. The model that CHAOS++ provides relies heavily on class inheritance, as supported by the C++ language.
3.1 Mobile Objects 6
class Buer; class Mobject f public: virtual void pack(Buer&); virtual void unpack(Buer&); g;
// message buer // mobile object
Figure 3: Class de nition of Mobject. To be able to send the contents of arbitrarily complex objects between processors, CHAOS++ de nes an abstract data type, called Mobject, for these mobile objects. A Mobject is essentially an object that knows how to pack and unpack itself to or from a message buer. CHAOS++ data exchange primitives make use of these capabilities when transferring Mobjects between processors. The Mobject class, as shown in Figure 3, contains two virtual member functions, pack and unpack. It is designed as a base class for all objects whose contents need to be transferred between processors. These include objects that are allowed to be accessed by processors other than the ones they are assigned to, and objects that are allowed to migrate from processors to processors. Object migration may be required, for example, when a collection of objects is redistributed during execution to provide better load balance among processors. When a user-de ned class is derived from Mobject, users are expected to provide implementations of the pack and unpack functions. Dynamic binding of virtual functions in C++ ensures that the CHAOS++ runtime system always invokes the appropriate implementation. For an object that only occupies consecutive memory locations, the pack and unpack functions consist of a simple memory copy between the object data members and the message buer. However, for a more complex object that is either derived from other classes, or contains pointers to subobjects and thus has parts to be copied scattered throughout the program memory (runtime heap), the application programmer must provide an implementation of pack and unpack that supports a deep copy. To be more speci c, on the sending processor the pack function should copy into the message buer
the contents of the declared members of the object, the contents of the members from all its base classes, and the contents of all its sub-objects (ones pointed to by members). A straightforward implementation of pack for a complex object can be done by deriving all of its base classes and all the classes of its sub-objects from Mobject. The pack function for the object can then recursively call the pack function for each of its base classes and each of its sub-objects, as shown in Figure 4. On the receiving processor, the unpack function must perform the inverse operation. That is, it must interpret the attened object data in the message buer, packed by the sender's pack function, 7
// a base class containing appropriate pack and unpack implementations class BaseClass : public Mobject; // class for a sub-object containing appropriate pack and unpack implementations class SubObject : public Mobject; class MyClass : public BaseClass, public Mobject f SubObject *sub obj; // a pointer to an array of SubObject int ndata; // length of the array above g; void MyClass::pack(Buer& buer)
f
g
buer.write(ndata); for (int i=0; i