Experience with Benchmark Applications on the AP1000 Herod Object Store C. W. Johnson, S. Fenwick, W. Keating E-mail:
[email protected] ACSys - Advanced Computational Systems Cooperative Research Centre and ANU-Fujitsu CAP Project, Australian National University Dept of Computer Science, ANU Canberra, A.C.T. 0200, Australia
Abstract A multicomputer persistent object store(MPOS) is a software layer that is intended to help in programming irregular, data-intensive applications on multicomputers such as the Fujitsu AP series. We have built a prototype MPOS with an architecture of specialised object-server processors and client-application processors, over the HiDIOS multiple disk le system. Application processes see a common object address space, using optimistic transactions for concurrency control. Objects transparently persist on disk, via page buers in the object-servers. We report experience and scalability performance of two benchmark applications studies, one in scienti c computation (an N-body tree code) and the other a parallel version of the OO7 object-oriented database benchmark.
1 Introduction
1.1 A multicomputer persistent object store - HeROD
The HeROD multicomputer persistent object store allows multiple application processes running in a multicomputer to see a common, persistent store made of object-shaped data. The store looks like a shared data heap of pointer-linked records. The HeROD design has been described at a previous Parallel Computing Workshop[9]. The ideal persistent object store (POS) has the properties of persistence by reachability, crash resilience or durability, and location trans-
parency. The property of persistence by reachability means that any object which can be reached by following pointers from a de ned root of persistence, or from any object that can be reached this way, transitively, will live longer than the program which created it, and can be used by later or concurrently executing programs. A resilient POS is an implementation that saves updated objects to non-volatile or stable storage (such as disk storage) in a consistent way, so that an internally consistent state of the store can be recovered after an application or system crash. This must be done eciently, and requires extra care when concurrent processes may be using the store. This is similar to some of the requirements of databases, and common database techniques have also been adapted to implementing concurrent POS. For example, the performance of the Data Safe method adapted to a persistent object store has recently been studied on the Fujitsu AP1000 multicomputer[3]. The property of location transparency means that referring to any object is done in the same way at every point in the application program, without needing to know whether that object is on disk or in memory on another processor. This allows the program to use the stability of disk storage without any need for the programmer to program input/output in any way.
1.2 Programming applications for a Multicomputer Object Store
The HeROD POS is designed to allow experiments with application programs that use large
P1-T-1
numbers of objects in the AP1000 parallel computer. Because the HeROD store object references { OIDs { are location transparent and the store is durable, the programs are much easier to write than conventional multicomputer programs: the programmer does not need to do any message passing or input/output, yet the store automatically transparently transfers and replicates data between processors and saves updated objects to disk. The programmer can view the data as a shared heap space. The method of concurrency control in HeROD is by optimistic transactions, so that the programmer does not need to be concerned with message passing for data transfer or for synchronisation. Application programs that are built of cooperating processes with little contention for data objects are clearly served well by this model. Where there is more free competition between processes or signi cant contention for common data, performance may suer compared to more carefully crafted programs for the same problems which avoid such contention by more explicit management of the data by the programmer. Such careful programming is a signi cant cost which contributes to the perceived diculty of parallel programming. One of the objectives of our work was to compare the performance and ease of programming for representative applications. In this paper we discuss the programming and performance of two benchmark applications from dierent areas: (1) in parallel scienti c computing, part of an N-body treecode, and (2) in an object-oriented database programming, part of the OO7 benchmark suite.
1.3 Software Architecture of HeROD The HeROD store software architecture is layered. The store is a layer of application processes over an object layer, over a storage layer. It runs as a client-server architecture and uses the HiDIOS parallel lesystem[14], whereby all processors have equal access to the lesystem, based on disks that may be attached to only a subset of the processors. The user chooses some number of the processors to be servers and the remainder act as clients. The clients each run an application process which requests objects from the local object layer; the object layer in
turn requests operations from storage layer processes running on the server processors, which access the parallel lesystem. Application processes run concurrently, and concurrency control is provided in the form of optimistic transactions by the store to guarantee coherent and consistent reading and writing of the store by the applications processes. The object layer holds a cache of objects at each client, for use in its current transaction. The application requests objects by their unique object-identi er, an OID; if the object is not present in the local object layer cache, this object fault is handled by the object layer requesting a copy of the object from the server layer in a particular server processor, as determined by the value of the object's OID. The servers' storage layer processes each hold a buer of pages of objects, and they will page fault and fetch a page full of objects from the lesystem if necessary. The layout of storage layer pages on disks is simple. The lesystem lays out les in stripes across the multiple disks. The use of the whole space is partitioned between HeROD servers by each having full control of a number of les. There is no constraint on HeROD OIDs being cross-references between les. The layout on disk follows the lesystem allocation of sequential allocation across stripes, and no attention is paid to the locality of the disk to the server processor. Because of the fast network transfer rates and the eciency of the HiDIOS lesystem there is little time cost in this independent layered design. The storage layer operates in 4k byte pages, lesystem stripes are each 32 pages (128 kbytes).
1.4 Optimistic Transactions HeROD provides concurrency control through
at, optimistic transactions, which each runs in a single application process. Optimistic transactions have the property that the operations in the transaction are executed without being blocked, and no locks are placed on objects during transactions; only when the transaction tries to commit is it determined whether the transaction is valid, that is, whether it is serialisable against all other previously committed transactions. A valid transaction is committed (and
P1-T-2
cannot be retracted); an invalid transaction is aborted and must be retried. To be valid, a transaction requires that it has used a consistent version of each object. A full algorithm for distributed servers and possibly replicated data is described by Bassiouni[2]. In HeROD each object is managed by a single server, so a considerably simpler algorithm is possible. We further simplify the algorithm by being slightly more conservative when transactions possibly interact, possibly aborting both transactions in some cases, and were thereby able to use local version numbers on each object, rather than a global timestamp on transactions. Committing a transaction is a two-phase process during which each of the servers covered by that transaction are locked against all other requests. When a transaction commits, the implementation is in two stages: (1) validate and (2) update or abort. To validate a transaction the client object layer sends two sets of OIDs, a WriteSet and a ReadSet, to each of the servers that are involved. Each server determines whether the transaction is valid against the timestamped version number of each of its objects at the server, and replies to the client object layer. If all of the replies from servers are positive, the client enters the update phase: it commits the transaction and signals the servers, sending them updated values for all objects that it has been changed. During the validate and update phase all of these objects in the servers buers remain locked against any other access by other clients. The updated pages are written to disk before the servers can proceed. This clearly has serious implications for performance of applications that have a high rate of transactions, for transactions with a large number of objects that are likely to be spread across many servers, and for transactions that are all likely to refer to some common object such as the root of a tree, and therefore all involve that objects' owning server. This is discussed further below. An invalid transaction is decided by the client if any of the validation replies are negative. The transaction is then aborted by the client object layer: it discards all objects in the client's cache, signals the servers to remove the locks on those objects at the servers, and the transaction is re-
tried by the application. At present the cache is
ushed at the end of every successful or aborted transaction and there are no optimisations such as retaining some objects in the cache.
2 N-body Application Benchmark The application we use as a benchmark example of a computational science application is the octtree N-body algorithm described by Barnes and Hut [1]. A parallel implementation of this algorithm was developed by Warren and Salmon[15]. This implementation was carefully built to exploit a Thinking Machines Corporation CM5, and in fact did this well enough to win the Gordon Bell prize. A later implementation study by the same authors made use of an indexing scheme that in some way resembles a user-level implementation of the simple store-wide object addressing of HeROD, but greatly increased the complexity of the programming[16]. Another example of this algorithm was studied by Scales and Lam[12] using the the SAM virtual distributed object memory system. This system has a quite dierent concurrency control mechanisms that required a lot of programmer attention. Another distributed implementation of the algorithm for the Amoeba distributed operating system is described by Romein and Bal[11]. Since much of the complexity of the Warren and Salmon algorithms resulted from the issues of data distribution, and the complexity of the HeROD's virtual shared object space and optimistic concurrency control were expected to greatly simplify the programming task. The N-body problem occurs in many elds of astronomy, for example galactic and cosmological simulations. It involves calculating the gravitational forces acting between a large number of massive bodies, and thus simulating the evolution of the system over time. A naive way to calculate the force acting on each body is to sum the attraction of every other body in the system, but this approach leads to N 2 complexity. The Barnes and Hut approach reduces an Nbody force calculation from N 2 complexity to approximately N log N , by arranging the bodies into a spatial oct-tree. This involves a hierarchi-
P1-T-3
cal subdivision of the coordinate space into cubic cells, each of which is recursively divided into eight sub-cells whenever more than one body occupies the cell. Calculating the forces on a particular body due to each other body can be reduced by using an approximation for the cluster of bodies in sub-trees that are suciently distant, the cluster then being treated as a single body with the sum of its constituent masses positioned at the centre of mass of the cluster.
tree involves descending from the root of the tree to a leaf node, at each level choosing one child to follow depending on the position of the body, and creating additional internal tree notes when two bodies would otherwise lie at the same leaf position. (c) Combine Clusters The total mass and centre of mass is calculated for all internal nodes. This is done serially by a single processor.
3 Oct-Tree Implementation Using HeROD
3. Calculate Forces and Update Each process chooses a set of bodies in a physical neighbourhood (contained in a particular subtree) and calculates the force on each one. The force calculation for each body traverses the tree, descending into each part of the tree as far as necessary to obtain an accurate force calculation. Once the force has been calculated the velocity and position of the body can be updated for the next iteration. A new set of bounding cubes is then calculated for each processor. This phase has not been implemented in our benchmark example.
Our implementation of the N-body problem is based on an implementation of the same problem for the SAM distributed memory system [12]. Because of the diering characteristics of SAM and HeROD the details of our implementation dier considerably from those of the SAM implementation. However, the overall structure is very similar. Each physical processor runs a single application process, and the responsibility for bodies is divided evenly between processors. The implementation has the following phases of computation: 1. Initialisation The set of bodies to be simulated is created. In our example each body is given a random initial position with a clustering distribution. Each processor generates an equal number of bodies. Alternatively the bodies could be read in from an external le. While generating the initial data each processor calculates a cube that enclosed all of the bodies that it generated. This phase involves no sharing of data. 2. Build Oct-tree (a) Bounding cube A bounding cube that encloses all the bodies is calculated. This phase involves each of the processors adding its cube's bounds to determine an overall enclosing cube. (b) Insert bodies Each process inserts its set of bodies into the common oct-tree. Each body is independently inserted into the shared tree, which is initially empty. Inserting each body into the
4. The computation is repeated from step 2a using the updated positions and velocities.
3.1 Programming Issues Writing the N-body application raised several issues concerning the way in which distributed object stores should be programmed. In this section we discuss some of these issues and their implications for the design of distributed object stores. This application exploits the shared heap and concurrency control of the HeROD POS, but in this example does not exploit the durability. Durability provides a transparent checkpoint facility that could be useful for long-running executions of this algorithm as discussed below in section 3.2.1.
3.1.1 Synchronisation One of the apparent advantages of concurrent processing over distributed object stores is that they avoid the need for explicit synchronisation.
P1-T-4
Since the store is always in a consistent state it should be possible for an application to execute operations at any time without regard to other operations that are taking place concurrently. If there are any con icts between operations they are be resolved by the transaction mechanism. In practice things are not quite that simple. Many applications have a notion of consistency that extends beyond the normal scale of transactions, which are designed to ensure read{write consistency over objects or groups of objects. For example, Section 3 described the N-body program in terms of phases, each of which must be executed in sequence. Phase 2b, inserting the bodies into the oct-tree, is possible only once phase 2a, the generation of a cube enclosing the entire set of bodies, has been completed. Phase 2a in turn cannot be completed until all processes have nished generating bodies in phase 1 or updating them in phase 3. It is not possible to make the whole of a phase a single transaction for each process, since this would serialise all the processing in this phase, because they read many objects and update some objects in the common tree. The insertion of a single object is a suitable scale operation for a transaction, so that many insertions can proceed concurrently; processes only collide and back out (retry) if two processes simultaneously (in a transaction timescale) insert into one interior node, or insert a new node before a leaf. But at this scale it is possible that one processor may complete several transactions in phase 2b before another has completed all object generation and bounding cube calculation in phase 2a. If some processes start inserting bodies into the oct-tree while other processes are still generating bodies there is a possibility that some of the new bodies will lie outside the initially computed bounds of the oct-tree, in which case the oct-tree may be ill-formed and the computation will fail. This is a type of con ict that will not be detected using the standard transaction mechanism. To deal with this problem we needed to violate the pure \object store paradigm" by using traditional barrier synchronisation operations (MPI_Barrier) to separate the phases. In principle this could be avoided by spin locking on the value of a single object, but this would be
very inecient, creating a high load on the single server that manages the spin lock object and probably interfering with other clients that are still doing useful work. The barrier synchronisation call is functionally equivalent to an implementation that uses the object store, but avoids overloading the servers.
3.1.2 Shared work pool problem As stated above, the central object-insertion phase of the N-body program repeatedly takes a body from a shared pool and inserts it into the oct-tree. An obvious way to implement this is for many processors to perform insertions concurrently, representing each body as an object and maintaining the pool as a shared collection structure such as a linked list. However, this creates a problem: it turns out that performing insertions and deletions on such common collection structures eectively serialises the computation. To see how this happens, suppose that a linked list is used to store the pool of bodies, and that several clients are simultaneously executing a loop of the form:
while (more bodies) f repeat f
g
start transaction; fetch body from list; add body to tree; end transaction; until (last transaction was successful);
Although this is a very natural way of implementing this stage of the problem, it turns out that it allows virtually no parallelism. The reason for this is that all transactions that insert an object into the linked list have to update the value of an object that points to the head of the list. Any other process that reads from the list before the rst transaction is committed will read the same object, and will attempt to insert it again into the tree. When the second process's transaction tries to commit, it will fail (the pool object has been updated by the rst to commit, so the second transaction is invalid) and the process must retry the transaction. This time it will fetch another object and possibly succeed, but only one object from the pool can be successfully
P1-T-5
worked with by any number of processes: the computation has been serialised. This is an example of the very general worker-farm paradigm, and it is essential to have a solution that allows parallelism. The same problem does not occur with the concurrent insertions of body objects into the oct-ree itself, since after the rst few nodes have been created the processors are likely to be inserting into dierent nodes, at some distance down the tree. Our measurements of data contention below bear this out. In this example we can avoid the problem by completely partitioning the work pool, and having one pool per processor. Each processor generates a separate pool of bodies, takes only from its own pool, and inserts into the common tree. The data structure required is a separate list of bodies for each process, and the operations of inserting generated bodies into the lists and removing them cause no contention at all. Although it works well in this case, this is not ideal for two reasons:
It does not support a general worker-farm paradigm, because it provides no loadbalancing eect between processors by sharing the pool. In our case the number of work items that each processor must handle is xed and nearly equal, and the insertion times vary only slightly.
It assumes that each processor generates its own work items, or that the production of items is phased and does not contend with their consumption. The producers and consumers may contend for their common work item pools unless additional layers of data structures are used to avoid this, or as in this example the application is phased.
We have explored a range of other solutions to this problem, including store support for new datatypes such as a Weakly FIFO Queue [13], splitting the transactions to separate operations on the pool and those on the tree (at some increased complexity to maintain crash recoverability), and nested transactions[10]. These approaches are discussed in a forthcoming Technical Report[8].
3.2 Performance Issues We have studied the performance of the tree construction phase of this problem, and not the force calculation phase. Although the latter is expected to be much more time consuming in a real instance of this problem, it is the case that once the shareable tree structure has been set up by the construction phase, the force calculation involves very little movement of data and no contention. This phase is therefore expected to scale very well and run at near full computational speed. The tree construction phase reveals the issues of a multicomputer object store, since it is data intensive and has many opportunities for contention. Tuning the performance of the N-body application raised two important considerations in the use and design of distributed object stores. These are the relationship between data coherency and crash recovery (both of which HeROD currently addresses using transactions), and the trade-o between data replication and the cost of maintaining replicated data, which again is closely related to HeROD's transaction mechanism. We report a series of experiments for insertion of 10,000 bodies into an oct-tree. The hardware con guration was the ANU Fujitsu AP1000 with 128 processors, each with 16M bytes of RAM; the lesystem had 16 disks attached to separate processors, each of 4 Gbytes. Our experiments use fewer than 100 procesors as clients and servers, and the 16 processors with disks attached were not included in this set.
3.2.1 Data Coherency and Crash Recovery The atomic transactions provided by HeROD have four properties described by the acronym ACID: atomicity, consistency, independence, and durability. Atomicity refers to the property that a transaction is never seen in a halfcompleted state. Consistency and independence refer to the properties that each transaction sees a consistent view of the store and executes independently of other transactions. Durability refers to the property that a successfully committed transaction is permanent. The property of durability includes crash resiliency which
P1-T-6
trol data consistency this has the side-eect of causing some increased data contention. Figure 1: Eect of transaction size on performance, with 5 clients and 5 servers 10,000 bodies
400
5 clients 20 clients
350
Execution time
300 250 200 150 100 50 0
0
5
10 15 Insertions per transaction
20
Figure 2: Eect of transaction size on proportion of aborted transactions, with 5 clients and 5 servers Ratio of aborted transactions to successful transactions
means that data will not be lost if the system crashes. However, strict durability is unnecessary for many classes of problem, and it has a signi cant performance overhead that is not justi ed by its bene ts. The cost arises from the fact that to be durable every transaction must be written to disk when it is committed. If transactions are small, which is the case in the N-body problem, this results in many writes of small amounts of data to disk. This writing of data to disk accounts for a major proportion of the total execution time, as shown by comparing the line labelled \no write to disk" in Figure 6. The nature of the N-body program means that strict durability is unnecessary. The ACID transaction model was developed for databases that describe, and frequently interact with, the outside world. Losing information in the database is unacceptable because then the state of the database no longer re ects the state of the outside world. Crash recovery should be to the state of the store at the last committed transaction. For the same reasons of real-world interaction, crash recovery should be fast. The N-body application, along with many other scienti c multicomputer applications, does not interact with the outside world except through the object store interface to the lesystem. Thus if the application or system crashes it is acceptable to restore the store to any consistent past state, not necessarily the most recent one, since the computation performed since that state was valid can easily be repeated. It is unnecessary to restore the state to the last committed transaction. Since system crashes are usually very infrequent, it may well be acceptable to use a slow recovery method, traded o against time saved at every transaction. This resembles traditional checkpoint mechanisms for long-running computations, but transactions automatically provide coherence of the saved state. Given these considerations and the high cost of providing strict durability, it is sensible to write data to disk much less frequently. In HeROD at present the only way for an application to do this is to combine its transactions into larger ones, for example, inserting several objects into the tree before ending the transaction. However, since transactions are also used to con-
10,000 bodies
3
5 clients 20 clients 2.5
2
1.5
1
0.5
0
0
5
10 15 Insertions per transaction
20
We made an experimental investigation of this issue. The gains in performance by inserting larger numbers of bodies in one transaction, compared to just inserting 1, can be seen in Figure 1. The increased data contention on tree nodes is revealed in the rising number of aborted transactions as the transaction size is increased, in Figure 2. The increased performance is apparent up to approximately 10 bodies inserted per transaction; the eect of more aborted transactions, because of increasing data contention on tree nodes, is clearly apparent, but this does not aect overall performance below about 12 insertions per transaction. This is because an aborted
P1-T-7
3.2.2 Data Caching and Data Contention Section 3.2.1 described how increasing the size of transactions results in data being written to disk less frequently. Larger transactions have an additional bene t resulting from the fact that client processors do not maintain local copies of objects between transactions. Thus using larger transactions means that the client's copies of objects are kept longer and fetched less frequently from the server. Since fetching an object from the server is expensive compared to using a local copy, this can have a signi cant impact on performance. Once again a trade-o results from the fact that larger transactions result in more data contention, which reduces performance. The HeROD implementation discards all objects in the client's cache at the end of a transaction. Any object that is needed again must be fetched again, but it is up to date at that time. It appears attractive to consider maintaining the local copies between transactions, particularly in a program like N-body where after the initial urry of in- ll in the top of the tree, many tree nodes change little thereafter but are used by many transactions. On re ection, however, on general this would either lead to more transactions being aborted (since the local objects would be more likely to become out of date) or require additional work to determine whether the objects were valid at the start of the transaction. In the case of such small objects as we see in this study, the cost of a re-validation message exchange is little dierent from the cost of ushing the local cache and re-fetching all objects on demand. The comparison of large and small sizes of transactions in Figures 1 and 2 points to performance gains from retaining replicas longer, for this N-body program. Further studies are needed to reveal whether this is a general phenomenon, and to what scale of transaction it applies.
3.3 Scalability Performance Measurements In multicomputer studies the scalability of performance of an application or software system, with respect to problem size or numbers of processors, is a signi cant measure of quality. We report on varying three parameters of the Nbody program: the number of client processes against a xed number of servers, the number of client/server processors in a xed ratio, and the number of client/server processes in dierent ratios. Unlike the normal distributed database studies where the number of servers is small and nearly xed, in a multicomputer with a parallel lesystem the number of server processors can be freely varied, and choice of which processors are servers is not tied to the processor having a disk directly attached.
3.3.1 Client Scalability Figure 3: Execution time, varying number of client processors 10,000 bodies, 50 servers
700
1 insertion/transaction 8 insertions/transaction 600 500 Total time (sec)
and retried transaction requires no disk I/O, and is relatively cheap compared to the disk I/O required to commit a successful transaction.
400 300 200 100 0
0
10
20 30 Number of clients
40
50
First we look at the eects of varying the number of clients while holding the number of servers constant. Figure 3 shows how the time to insert 10,000 bodies into the oct-tree varies with the number of clients while the number of servers is held at 50. Again, this shows the execution time with both one and eight insertions per transaction. Performance increases well up to around 10 processors and then starts to level o. This levelling o is due to servers becoming overloaded, as well as increased data contention caused by the large number of processors simultaneously
P1-T-8
accessing the oct-tree.
10,000 bodies, 50 servers
2.5
1 insertion/transaction 8 insertions/transaction 2
1.5
Figure 6: Speedup, varying number of client and server processors together, with one insertion per transaction
1
10,000 bodies, 1 insertion per transaction
0.5
50 0
0
10
20 30 Number of clients
40
50
3.3.2 Overall Scalability
Ideal Performance Measured performance
40
Speedup
Ratio of aborted transactions to successful transactions
Figure 4: Eect of number of client processors on proportion of transactions aborted
data shows performance with this option disabled. Unfortunately this option cannot be used with small numbers of servers, since it works only if all data used by the application ts in the server's memory. However, the data set shows the expected dramatic improvement in performance, and improved scalability (approaching 20 times), that results from avoiding disk activity at the end of each transaction.
30
20
Figure 5: Execution time, varying number of client and server processors together 10,000 bodies
600
0
1 insertion/transaction 8 insertions/transaction 1 insertion/transaction, no write to disk
500
Total time (sec)
400
300
200
100
0
10
0
10
20 30 Number of clients/servers
40
50
Next we present the eects of varying the number of client and server processes together. Figure 5 shows the time to insert 10,000 bodies into the oct-tree varying the total number of client and server processors, but holding the number of each type equal. The x-axis of this graph shows the number of clients { or equivalently the number of servers. The total number of processors is thus twice this value. Figure 5 also shows the eect of disk I/O on system performance. HeROD includes an option that prevents data being written to disk at the end of each transaction. The \no disk write"
0
10
20 30 Number of servers/clients
40
50
Figure 6 shows the speedup for the single insertion per transaction case relative to a single processor. Unfortunately, because of a limitation of HeROD, it was not possible to run the 10,000 body problem using a single server. For this reason we have estimated the time for one client and one server as being twice the time for two clients and two servers. Thus the graph is likely to slightly overestimate the actual speedup, although this error should not be large. The speedup approaches a value of only 10, which is not unexpected given the intensive data activity and committing many very small transactions to disk, in this insertion phase.
3.3.3 Client{Server Balance Figure 7 shows the eect of holding the total number of processors at 100, but varying the balance between clients and servers. Although performance decreases signi cantly at extreme imbalances, we see that relatively good performance is obtained from a ratio of 2 servers for every client, through a 1:1 ratio, through to a ra-
P1-T-9
Figure 7: Execution time, varying balance of client and server processors 200
10,000 bodies, 100 processors (clients plus servers)
Module i id type builddate
Execution time (sec)
150
manual
Manual text
design_root
100
complex assemblies
50
base assemblies
1 insertion/transaction 8 insertions/transaction 0
1
10/90 20/80 30/70 40/60 50/50 60/40 70/30 80/20 90/10 Number of clients/number of servers
2
3
4
N
Design Library of Composite Parts
tio of 2 clients:1 server. As the number of clients is increased the number of concurrent processors inserting into the oct-tree is increased, tending to improve performance. At the same time two factors decrease performance: the smaller number of servers results in increased server contention, and the larger number of clients results in more data contention. The eect of increased data contention is particularly severe when many clients are used, with the result that with 90 clients performing eight insertions per transaction gives poorer performance than inserting a single body per transaction.
4 OO7 Object Oriented Database benchmark The OO7 Object-Oriented Database benchmark suite[5] comprises a database schema and a set of database operations. The database schema is intended to be representative of engineering applications such as CAD/CAM support, but the operations are torture tests such as complete traversal of the data, rather than relating to a particular application. The OO7 benchmark focusses on allowing the implementor of a database to examine the performance of important system components, and for this reason it is revealing to experiment with it on the relatively unusual HeROD architecture.
Figure 8: Structure of OO7 database schema
4.1 OO7 Database Structure The OO7 database schema is a very dierent data structure from that in the N-body example, and at a larger scale. The schema is shown in gure 8. The OO7 database is composed of a number of Modules. Each module is a tree (called an assembly hierarchy). All the non-terminal nodes (with the exception of the root node) in the tree have the same number of child nodes. The root node is relatively small but has a large associated object that normally occupies multiple disk pages. The terminal nodes each point to a given number of complex objects known as composite parts. When the database is generated, the composite parts pointed to by a terminal node are chosen (with replacement) from a set of composite parts. Hence, two or more terminal leaf nodes may refer to the same composite part (i.e. composite parts are shared). A composite part is an object that contains meta-information about
P1-T-10
the part and a graph of very small atomic parts. The atomic part graph has a xed number of edges and each edge is represented by a small connection object. There are a xed number of known edges that form a graph as well as a xed number of edges linking random objects. The actual database instances created by OO7 are con gured at generation time. In particular it is possible to select the depth of the tree, the branching factor, the size of the Manual objects, the number of atomic objects in each composite part and the number of edges (and hence connection objects). The number of composite part objects dominates the database; in the example database we used the composite parts were more than 99% of the objects in the database (there were 20,000 to 40,000 objects in total, with only 363 objects in the Assemblies).
is not locally resident in the client, a copy of the object is requested from its server and its return awaited. The application can then repeatedly use the local memory address (the handle) for subsequent references to that object, or can call the fetch operation again at a small additional (local) cost. The handle remains valid for the duration of a transaction. Any references contained in objects themselves are always in the form of OIDs, and may only be used via a fetch operation: the objects are not swizzled. This contrasts with the E programming language which has implicit object fetches and operates by pointer swizzling. There was some opportunity at this point to \cheat" when writing the HeROD benchmark code, by preserving object addresses in the application and passing them around from method to method in an unnatural and cumbersome fashion but thereby avoiding extra fetch calls. However, in order to be fair to the benchmark we chose to do what most programmers would do. This was to use memory addresses within procedures by performing a single fetch call per object, but to pass only OIDs between procedures. This approach turned out to be both natural, simple and ecient, once the implementation of fetch had been tuned for speed.
4.2 Programming Issues We used the original Exodus OO7 benchmark code, written in the E language (an extension of C for an OODB) as a template for creating a HeROD implementation of the benchmark. Exodus is a distributed object store for Ethernet-speed connections and small numbers of servers[4]. The transformation of the program was labour intensive but relatively straightforward because most of the E language constructs could be converted easily into the low level HeROD calls (HeROD lacks a programming language layer). Signi cant issues in the conversion were:
Encapsulation: Exodus objects contain
Pointer Swizzling: In HeROD applica-
references, data and methods whereas HeROD objects contain only references and data. This meant that we had to turn the encapsulated E methods into procedures. We also had to decompose inheritance hierarchies and incorporate explicit object creation and deletion code into the relevant procedures. tions objects are generally referred to using an OID, by the application calling the explicit object fetch operation of the object layer. The operation returns the memory address of the object in local cache; if it P1-T-11
Collections: The collection of objects is one high level storage construct in Exodus/E which was absent in HeROD. At the storage level, collections provide a means both of clustering objects on disk and fast retrieval of all the objects in a collection; it is often makes sense for the programmer to place all objects of a particular type into a single collection. The objects in an Exodus collection can be scanned by referring to the named collection. However, in HeROD all objects are accessed from a single root of persistence and there is no notion of a collection. The OO7 benchmark code used collections extensively. We simulated them in HeROD using the OO7 Association methods, which provide the same functionality as collections without the clustering property.
The HeROD version of the benchmark pro-
vides separate programs for database creation, an exhaustive traversal, and an update benchmark. One variation from the Exodus version is that the Herod version does not support the large Manual objects that are associated with each Module object, because Herod does not yet support objects larger than about 3.5 kbytes (objects are currently implemented to be smaller than one page, nominally 4 kbytes in size). Database creation is performed in HeROD by a single client and by one or more server processors. This serialises database creation; this is not an inherent limitation of HeROD, but would require considerable re-design of the creation code to parallelise it eectively. At present the serialised database creation times limit the size of databases that it is practical to create. The traversal is performed in parallel, as described in the next section. In parallelising the traversal algorithm to exploit multiple client processors, we also wanted the parallel benchmark to re ect the simplest parallel code that a programmer would like to write. In order to reduce the amount of change to the sequential code we decided that each client would duplicate the traversal of the relatively small Assemblies tree, but each would traverse only a subset of the leaf node composite parts. The set of leaf nodes is partitioned to the clients in a round-robin fashion based on the leaf node's id. In the original form of the benchmark, many composite parts are shared between dierent base assemblies. This complicates the analysis of performance, and can be selectively omitted in our version. The \non-shared" option has the side-eect that an eectively smaller database is traversed.
4.3 Performance Issues In this application the creation of the database in the HeROD persistent object store is performed by executing one program, and it is accessed in the store by the subsequent execution traversal and update programs. These programs thus test the performance of retrieving a large number of objects from disk storage via the HeROD storage layer mechanisms, and the server page caches are tested in their ability to hold and organise
the very large amount of data involved. By running both \cold" and \hot" traversals the performance of the server memory alone, can be distinguished from that of disk and servers combined. The T1 benchmark traversal operation of OO7 performs a depth rst traversal of the benchmark database. When a leaf node is encountered all the atomic objects within the leaf node (the composite part) are traversed. The OO7 tree of Assemblies is created in a depth rst manner and hence any system that naively creates objects on disk adjacent to the previously created object will obtain signi cant disk locality bene ts during traversal of this tree. However, as noted above, the Assemblies tree constitutes a very small proportion of the database. The good disk locality is broken by the fact that each base assembly refers to a random set of composite parts ensuring random accesses to get a composite part. Again, however, within each composite part objects are traversed in the order they are created. Hence there is a big advantage in this operation for systems that lay out objects in the simplest and most obvious manner, systems that perform sequential read ahead, and systems with large disk caches (or page caches). Carey et al. claim that T1 is a test of raw traversal speed. It is arguably the most important benchmark operation of OO7 since it shows up most of the object store functionality { i.e. fetching objects into local memory and caching performance across the memory hierarchies. Even though the traversal does not cause any updates to objects, the optimistic transactions used in HeROD require some time to perform a transaction commit. This is necessary to ensure that the all of the objects touched by the transaction are in fact not updated by any other concurrent transaction that committed rst. The large number of objects traversed by each client in this benchmark, and the lack of any osetting work performed with the objects, means that this commit validation time may be signi cant. The eect is exacerbated by the even balance of workload between the client processors (increasing the chances of near-simultaneous commits) and the use by all clients of the common Assemblies tree, ensuring that every transaction involves at least one common server (remembering that all of a client's corresponding servers are
P1-T-12
4.4 Scalability Issues As with the N-body example, the con guration of number of servers and clients can be widely varied for OO7 on HeROD. The principal question is how well the benchmark algorithm scales across variable numbers of client and server processors. As noted above, traversals are done in parallel by all client processors, but very little work is done on objects once they are in the clients' local caches: most of the work is in fetching objects from the servers to the clients. Simply scaling up the number of clients is therefore not by itself expected to scale performance well. HeROD provides parallelism through the client processors, server processors, and the parallel lesystem. Servers accept client requests in a single thread, and normally clients make only blocking requests of servers from their own single thread. The degree of parallelism is thus normally restricted to the number of client processors. Only in committing a transaction does one client possibly cause one server, or multiple servers, to issue concurrent asynchronous lesystem write operations for each updated page. This source of parallelism is not available in a read-only traversal. It is possible to increase the number of servers independently of the number of clients. However, in general, a small number of clients cannot be expected to keep all the servers busy and so we cannot simply scale the system performance by increasing the number of servers. However, as the database gets bigger, increasing the number of servers will allow more of the database to be held in their combined RAM page caches, and hence even a small number of clients will bene t from removing any need for page replacement in the servers. For a large number of clients, this eect may lead to some super-linearity eect being observed as the number of servers increases. The scalability parameter that we explore for this system is therefore not either the number of clients, nor the number of servers, but the to-
tal number of processors within which the ratio of servers to clients is held xed. The eect of varying this ratio is the other parameter of interest. If the balance is nearly equal, increasing the ratio of clients to servers may give greater opportunity for parallel server and disk requests. Decreasing the ratio, so increasing the number of servers, has a counter tendency of decreasing the probability of request contention at the servers. Overall performance for large numbers of processors may be bounded by the capacities of the lesystem and the internal network. The dierence between a cold traversal (reading all objects initially from disk) and a subsequent hot traversal (taking objects mainly or all from server page caches) will indicate the former. There is no simple way to isolate the latter, but it may be taken as the explanation of falling o from ideal speedup at large numbers of client and server processors.
4.5 Performance Measurements With a large number of available servers, hot performance scales well with the number of clients (speedup of 38 for 50 clients compared to 1 client is observed). With a cold traversal the speedup is much less and approaches a limit of around 18. A cold traversal has a significant component of disk read requests, and once more than 20 or 30 clients are available to saturate the 16-disk lesystem no further speedup is available. 60 Speedup - cold Speedup - cold (no commit) Speedup - hot (no commit) Speedup - hot ideal case
50
40
Speedup
blocked during that client's commit validation). We can measure the cost of this by comparing the time taken when we abort, rather than commit, each transaction.
P1-T-13
30
20
10
0 0
10
20
30 (Clients+Servers)/2
40
50
Figure 9: Scalability of Traversal T1
60
lelism among the clients is balanced by the additional workload of requests at each server. The performance of a server-rich mix, on the other hand, is shown in the 1:2 ratio data. Performance here ranges from 25% to 10% worse than for the 1:1 case (when reading this graph, note that the data points on the dierent curves are for dierent values of numbers of processors). This supports the expectation that one client can keep no more than one server occupied; only the decrease in expected random contention at the servers works against this decrease being directly in the same ratio, which would give a 33% decrease.
The scaling characteristics of the performance of traversal T1 with increasing equal numbers of clients and servers are shown in Figure 9. The graph shows that the performance scales well against the number of clients { or the number of servers { provided that the number of the other is kept in proportion. The cost of the transaction commit mechanism for this traversal operation is approximately 20% of the entire traversal cost. The implementation is particularly inecient at this point, as the client object layer sends every ReadSet and WriteSet member of the transaction in a separate message to one of the servers. Bundling these messages together is be an obvious way to improve performance considerably that has yet to be implemented. The hot, no-commit curve falls away from the ideal at higher numbers. This experiment includes no disk activity nor any commit eect. The fall-o is possibly due to approaching the internal communications network capacity. We have not done any follow-up experiments with alternative processor topology to investigate this.
5 Conclusions
60 1to1 1to2 2to1 50
Time (secs)
40
30
20
10
0 0
20
40
60 Processors
80
100
120
Figure 10: Varying client:server ratio for T1 A comparison of the time taken for T1 with various ratios of number of clients to servers against total number of processors (clients + servers) is shown in Figure 10. It can be seen that above 15 processors total, increasing the ratio from 1:1 to 2:1 makes almost no dierence in performance. We tentatively conclude that in this range the increased ratio of clients to servers that might provide more opportunity for paral-
Using the implementation of Herod POS for a classic irregular computational and a data intensive application demonstrates ease of programming with good performance, while it reveals some limits on scalability. Improvements to the prototype implementation of Herod can be expected to raise these speci c limits signi cantly, and with its ease of programming the persistent object store model of computation has great potential for application in the coming generation of commodity parallel computers. We have shown that distributed persistent object stores provide an eective and elegant way of sharing data between processors in the multicomputer. In addition the use of transactions avoids much of the complexity of shared memory programs. In some situations, careful design of the program is needed to avoid the eects of data contention. This was particularly signi cant when designing a structure to contain the bodies to be inserted into the oct-tree. It was very dicult to nd a satisfactory solution to this problem using the services provided by HeROD, leading us to the conclusion that a distributed object store needs to provide specialised data structures to deal eciently with common programming problems such as the shared work-set structure. The issues of data coherency and crash recovery need to be separated for acceptable eciency of object-intensive computation. HeROD uses transactions for both purposes, but significantly slows down computations using transaction concurrency control by writing all updates
P1-T-14
in a transaction to disk at the end of the transaction. Given that system crashes are generally rare, and in this type of application any lost computation can easily be recreated, a much coarser grain of crash recovery is appropriate. However, ne grain transactions are still needed to avoid data contention. Thus transactions should be used only to maintain coherency, and data written to disk much less frequently than at present, possibly using a separate checkpoint mechanism. This approach is being explored in the form of an explicit StabilizeAll() operation in the Glasgow/Sun Persistent Java project[7]. Our experiments with the N-body and OO7 benchmark applications on the HeROD multicomputer persistent object store indicate that it has met its goal of combining an easy programming model with acceptable performance for two dierent representative applications. The distributed object space model of HeROD has been easily adapted to two dierent existing objectstructured programs. The benchmark programs are designed to stress test data management and movement alone, and represent the worst case of many applications that require computation with objects. Such computations are expected to be unconstrained by the object store, and so we conclude that apart from the issue of committing transactions, HeROD is an eective object data manager to support parallel computations in multicomputers. We expect that the performance of committing read-only transactions can be substantially improved with a little further work. Improving the performance of transactions that include updates requires an explicit change to the computational model of transactions, as discussed above, or an automatic feature that may defer writing to disk until some clock-time has elapsed or some number of transactions have been committed. This is not hard to implement while maintaining coherence since we can block new attempts to validate or commit transactions while some accumulated updates are written to disk.
Acknowledgements The authors wish to acknowledge the support provided by the Cooperative Research Centre for Advanced Computational Systems established
under the Australian Government's Cooperative Research Centres Program and the ANU-Fujitsu CAP Research Program. We must also acknowledge the contributions of all members of the HeROD project team, including Jerey Yu and Steven Blackburn for discussions, and Robert Cohen and Dean Jackson for algorithms and implementation.
References [1] J. Barnes and P. Hut. A hierarchical O(N log N) force-calculation algorithm. Nature, 14(8):1071{1080, 1988. [2] M. A. Bassiouni. Single-site and distributed optimistic protocols for concurrency control. IEEE Transactions on Software Engineering, 14(8):1071{1080, August 1988. [3] Stephen M. Blackburn, Robin B. Stanton, and Christopher W. Johnson. Recovery and page coherency for a scalable multicomputer object store. In Proceedings of 30th Hawaii International Conference of the System Sciences, 1997. [4] M. Carey, David J DeWitt, J. Richardson, and E. Shekita. Storage management for objects in EXODUS. In W. Kim and F. Lochovsky, editors, Object-oriented concepts, databases and applications. AddisonWesley, 1989. [5] M. J. Carey, D. J. DeWitt, C. Kant, and J. F. Naughton. A status report on the OO7 OODBMS benchmarking eort. In Proceedings of OOPSLA'94, 1994. [6] J. Darlington, editor. Proceedings of the Fourth International Parallel Computing Workshop, London, U.K., 1995. Imperial College Fujitsu Parallel Computing Centre. [7] Laurent Daynes. A exible transaction model for persistent Java. In M. Jordan and M. Atkinson, editors, First International Workshop on Persistence and Java (PJ1), 1996. [8] Stephen Fenwick and Chris Johnson. HeROD avoured oct-trees to appear. Tech-
P1-T-15
[9]
[10] [11]
[12]
[13]
[14] [15]
[16]
nical Report TR-CS-96-nn, Dept of Computer Science, Australian National University, 1996. C. W. Johnson, J. X. Yu, and R. B. Stanton. Architecture of a high-performance persistent object store. In Darlington [6], pages 297{306. J. Eliot B. Moss. Design of the Mneme persistent object system. Transactions on Information Systems, 8(2):103{139, 1990. John W. Romein and Henri E. Bal. Parallel N-body simulation on a large-scale homogeneous distributed system. In EuroPar'95, August 1995. N-body for the Amoeba operating system. D. J. Scales and M. S. Lam. The design and evaluation of a shared object system for distributed memory machines. In First Symposium on Operating Systems Design and Implementation, pages 101{114, November 1994. P. M. Schwarz and A. Z. Spector. Synchronizing shared abstract data types. ACM Transactions on Computer Systems, 2(3):223{250, August 1984. transactions locking. A. Tridgell and D. Walsh. The HiDIOS le system. In Darlington [6], pages 53{63. Michael S. Warren and John K. Salmon. Astrophysical N-body simulations using hierarchical tree data structures. In Proceedings of Supercomputing '92, pages 570{576. IEEE Computer Society, 1992. Michael S. Warren and John K. Salmon. A parallel hashed oct-tree N-body algorithm. In Proceedings of Supercomputing '93, pages 12{21. IEEE Computer Society, 1993.
P1-T-16