Architecture of a high performance persistent object store ... - CiteSeerX

0 downloads 0 Views 144KB Size Report
Architecture of a high performance persistent object store. C.W. Johnson, J.X. Yu, R.B. Stanton. Advanced Computational Systems CRC and Dept of Computer ...
Architecture of a high performance persistent object store C.W. Johnson, J.X. Yu, R.B. Stanton Advanced Computational Systems CRC and Dept of Computer Science, Australian National University fcwj,yu,[email protected]

1 Introduction Applications for multicomputers (massively parallel, MIMD, distributed memory computers) are apparently under challenge from the continuing increase in speed of single processors, and their combination over LANs as NoWs (Networks of Workstations) and over shared memory as SMP (symmetric multiprocessor) systems. However, there are many classes of applications which are well suited to the multicomputer's larger number of processors than the SMP architecture, and its coordinated parallelism and high-speed interconnectivity. This is becoming more apparent as e ective, scalable parallel lesystems[5, 15] become available for the multicomputer to address its previous problems of inadequate I/O performance. However there remains a class of problems whose irregular structure makes them hard to program for current multicomputer systems, although they display the potential to bene t from the architecture. For the programmer and automatic compiler alike these applications pose daunting problems of nding and managing hundred-fold parallelism, synchronisation and data transfer by message passing, and a broad memory hierarchy of processor cache, processor local RAM, other processors' RAM, local and parallel disk, remote disk. We believe that the use of a persistent object store can be part of the solution to this problem. A persistent object store is a software layer that provides a location-independent, uniform view of data as objects throughout the storage hierarchy. In single processor stores \location independence" refers to hiding from the programmer the distinction between storage on disk and in processor RAM, by automatically moving objects between them on demand when a pointer to an object is dereferenced. Several persistent object stores have been built under various models[16, 9, 2]. Issues such as the methods for pointer transparency, garbage collection, crash resilience (disk coherence), pointer swizzling, and the engineering of these methods for performance, are involved in the design of the store 13{1

itself; the concept also raises questions in programming languages such as type safety for persistent objects and type extensibility, and new programming development methods using browsers and embedded hyperlinks[4]. In a multiprocessor store concurrency control also becomes an issue, and in a distributed memory system the issues of sharing or copying, the level in the memory hierarchy at which a global address space is constructed, data movement between processors, and cache coherence between local memories, must be considered. Performance also depends on object placement and clustering in logical proximity with respect to the application access patterns, the cache coherence protocols, the method of stabilising to the disk, and the speed of communication and data movement between processors[6]. The multicomputers we consider are designed for high performance: tens to hundreds of parallel processors with tens of megabytes of RAM each, tens of disks and a parallel lesystem. They require an object store that has high performance and is scalable. The size of objects suited to multicomputer applications is typically larger than those common to single processor stores: many thousands of objects of 600-700 bytes, for instance for computational molecular modelling[11], thousands of text objects up to 2.5 MByte for document indexing and search applications[7], hundreds of images of 7.5 MByte[17]. The high performance object store needs to support data parallel operations over individual objects or many replicas, as much as concurrent operations on separate objects. Our focus is on massive concurrency, object movement and storage issues, and the interaction of the object store and the lesystem; we ignore language and type issues, and garbage collection. The issues speci c to a persistent object store for this class of computer include: e ective concurrency control { an ecient mechanism suited to programming for large datasets

utilisation of high bandwidth disk { hiding latency and utilising high bandwidth of parallel disks

large memory { e ective use of the large total amount of RAM memory distributed memory { use of memory in local and remote processors scalability { up to hundreds of processors thread or data movement { multiprocessing in nodes and moving threads between

nodes are poorly supported by multicomputer systems. Moving data between nodes must minimise latency.

13{2

Application

object manager (local)

other clients

Object Layer

comms

comms

manager

manager

Storage Layer

Residence Table global

transaction manager

local

object

storage

manager

manager

object/fragment Page Map

cache

part of global

store

page

stabiliser

cache

other managers

client node

server-manager node

file system

parallel disk system

Figure 1: HeROD simpli ed architecture diagram

2 An experimental multicomputer object store The HeROD project in the Advanced Computational Systems CRC is concerned with comparing and implementing multicomputer object store techniques. Part of the project is to construct a prototype persistent object store for the Fujitsu AP1000/+ architecture[13] utilising the locally developed HiDIOS parallel lesystem[15]. The signi cant features of the lesystem are that it presents a single lesystem image at all nodes, each le is striped across a number of disks, and it presents a MIMD/SPMD viewpoint { application processes can perform I/O independently (with normal read and write calls very similar to UNIX System V) or cooperatively, in that a read or write issued by one node may manage data in another node's address space.

2.1 HeROD object store architecture

Objects and references

An object has a simple structure of a vector of object references, and a vector of untyped data. This is a storage abstraction of objects, not a programming language view. To support the use of large objects we allow fragments of objects to be referred to and handled 13{3

at the application. A fragment is a contiguous subrange of references or untyped data from an object. An object has a unique system-wide identity known as an OID.

Object layer and storage layer

The object store has two abstract layers: an object layer and a storage layer. The object layer is concerned with the handling of objects and fragments of objects to and from application processes. It is also responsible for concurrency control by transaction management. The storage layer is responsible for grouping of objects into storage units (pages) and the movement of pages into and out of the lesystem. It maintains a coherent stabilised image of the store on the lesystem that complies with the transactions that have been committed up to some earlier point in time. Two classes of object representation are used in the storage layer: small objects (many to a page) and (very) large objects (one or more pages to the object).

Client-manager specialisation and caching HeROD provides a load-store model of object use. An application process loads objects by request from a server, reads and or updates them, and stores the updates through the server when a transaction commits. We specialise processor nodes into being either application nodes (clients) or managers (servers). They behave as a variant of an object server system[6] with multiple servers. That is, data is passed between application-client and manager-server as objects; the managers are responsible for controlling transfers to and from disk in the form of pages. The object layer is realised as library routines in the application process on application nodes, and as routines in the manager nodes, communicating across a simple messaging protocol. The object layer manages a local cache of objects and fragments of objects in each application process (one process per node, for simplicity). The storage layer is realised as the sole user-level process in each manager node. The managers' responsibility for handling pages is used also to provide the function of a large global distributed cache, organised in pages, between the lesystem and the local memory of applications. Each manager is responsible for organising a con guration- xed part of this cache and the associated part of the lesystem. The mapping from OID to manager is xed by the runtime con guration of the store, according to the number of manager processors in the con guration. Each object is the responsibility of a single manager, and all application process requests concerning that object are made to that manager.

2.2 Concurrency control { the optimistic transaction model

Concurrency control is by optimistic transactions[1]. A transaction belongs to a single application process: the one process starts the transaction, performs object accesses, and 13{4

ends the transaction. No object locks are used. In Bassiouni's words, Transactions are allowed to proceed without any synchronisation; but updates are made on local copies of objects. At the end of execution, a validation is performed to determine whether con icts may have occurred. . . If a con ict is detected the transaction is backed up [i.e. aborted] . . . otherwise the transaction is committed and the local updates are made global. The determination of con ict between transactions rests on comparisons between the sets of objects that have been read and written by this transaction, and the sets that belong to transactions whose lifetime overlaps the time from this transaction's start until its attempt to commit. In the HeROD architecture the \local copies" of objects are those in the application node's local cache. Once an object fragment has been fetched from its manager, read and write access is made directly by the application process, without notifying the manager. The HeROD programming interface allows the application program to directly access the local copy of object data with a C language pointer and normal o sets and indexing. The client maintains the Read and Write sets of its transactions: the interface automatically adds any fetched fragment to the Read set, and provides a user function DidWrite that includes fragments in the Write set. We expect to use a variant of Bassiouni's Unsynchronized Clocks, Distributed Database algorithm[1] to update data in the managers' page caches when a transaction commits. The signi cant parts of the algorithm are: start transaction : the client requests a global timestamp sequence number from a central node. end transaction : the client informs all the managers that are covered by the transaction, i.e. those that are responsible for any of the objects in its Read set or Write set. The managers validate the transaction using a two-phase algorithm among themselves to determine whether this transaction is free of con ict with any others previously committed. Only these covered managers need be involved in this validation, since any con ict for this transaction can only be on objects in its Read or Write set. For each transaction a super-manager is appointed within the covered group. When the validation phase is completed by the managers the super-manager informs the client that the transaction can commit or must abort.

commit transaction : to commit its updates the client sends update messages to the

managers covered by the Write set. The updates apply to pages in the manager's own part of the global page cache (this update in memory replaces Bassiouni's operation of writing to disk). We refer to this phase during which a client is updating manager pages as the completing phase of the transaction. 13{5

abort transaction : the client deletes all its local fragments of objects. If the application

decides to retry the transaction it must get a fresh timestamp and re-fetch fresh copies. This can be safely optimised to retain the previous copy if the manager replies that the previous version is still current.

The manager maintains a number of versions of each logical page in its part of the cache: possibly a current page and possibly a number of completed pages. The current version of the page includes the combined updates of all transactions for objects on that page that are completing or have completed. The page is simultaneously being updated and providing the source of new object fetch requests from other clients. Any read-write con icts that might lead to inconsistencies in this shared use are detected by the optimistic transaction algorithm and will cause the o ending new transaction to abort.

2.3 Stabilisation of updates

The optimistic concurrency control algorithm ensures that the transaction is completed at the manager without con icts, that is, the application processes execute with consistent, transaction-coherent views of the object store. It does not guarantee that the transaction has been committed to the lesystem at the time that the application resumes execution. We accept this distinction to enhance performance. In the typical multicomputer application we are not concerned with real-time e ects of transactions that must commit down to stable store at the time that a response is given at an automated teller machine or airline reservation screen; for us the purpose of the store's stability is to allow resumption of computation with a consistent store, after gentle failure of the hardware system, expiry of execution time allocated for the application, or software failures in the application code. Loss of several seconds or even minutes of processing in the event of recovery is acceptable. The stabilisation algorithm is still under development and is not further described here.

3 Intra-transaction concurrency { data-parallelism The mechanism described above allows for inter-transaction concurrency, and is particularly e ective where there is very little interaction or con ict between transactions. For many problems the maximum e ective parallelism can only be achieved by intratransaction concurrency. We will achieve this in the form of data parallel operations by many processors on partitions or replicas of single objects. Using process groups and collective operations in the MPI sense[8] we introduce group transactions and partitioned or replicated group fetch operations. A group transaction starts by collective collaboration between a named group of processes. A group fetch is also a collective operation, which optionally replicates or partitions an object's data into the group members' local object caches. Subsequent group fetches 13{6

in the same transaction are necessarily in the same process group but may partition other objects di erently. This concept supports data parallelism and also allows very large objects: those that larger than any one manager's storage. Neither the entire object nor substantial fragments of it could be handled eciently by the page-caching mechanism described earlier, since the manager's cache would be thrashed by this one object's pages. The manager remains responsible for controlling the movement of the data, but can do so by requesting the lesystem to transfer the data directly between disk and the client group's memories. The manager storage layer determines a list of disk addresses, lengths and corresponding client addresses for the whole request, and makes a vector of remote destination lesystem requests that is done asynchronously to maximise parallel I/O throughput. The group transaction does not monitor or prevent con ict between the client processes in the group. A group transaction e ectively broadens the focus of attention to include the whole data parallel operation within the transaction, so any internal con ict becomes the programmer's responsibility.

4 Discussion

Distribution of functionality This architecture utilises the large amounts of RAM available in the multicomputer's processors and the attached processing power of those processors con gured as managers, to provide a very large, distributed, page-structured cache for the store. The managers are not servers in the traditional distributed database sense in that the lesystem does not appear as a number of disks intimately attached to individual servers, but is also distributed and equally accessible to all. The high performance read and write operations to a third party processor reinforce this view of a more intimately coupled system than the traditional distributed systems models[6]. The issue of object serving, page serving or le serving is met here by choosing a form of object serving, but with a di erent division of responsibilities that accords with the closer networking of the processors. In an intimate network the choice between when to transfer data to another processor or to operate on it locally tends to favour transfer, as we show by requiring all simple and bulk object operations to be done in the clients. The cost of making requests object-by-object is less here than in the conventional distributed system, which leads us to favour object-service rather than page-service.[6]. The use of an object cache rather than a page cache at clients avoids false usage con icts on pages, and maximises the utilisation of memory across the system. The clients cache only the objects (or fragments) that they need; the managers cache potentially shareable data, at the page level, and optimise disk trac in page-sized units. The likelihood of logically related objects being used concurrently by multiple processors in multicomputer 13{7

applications encourages us in this architecture. Intra-page clustering { the clustering of related objects onto pages { does however remain a signi cant performance issue as in traditional OO databases. In the parallel lesystem there is a counter argument against logical grouping of related pages. The striped architecture of a parallel lesystem maximises access concurrency by spreading blocks of adjacent addresses across many disks, so that where a suciently large amount of data is transferred in one access the latency overheads of several disk accesses are paid for concurrently. For the same reason, having logically adjacent object store pages allocated on di erent disk devices can apparently be advantageous. This factor needs investigation. It will also have impact on our future choice of shadow paging or logging le structures.

Parallelism and scalability The combination of transaction-based concurrency and parallel scienti c computation is a previously unexplored area that requires further study. The HeROD architecture cleanly supports independent transactions that, similarly to the use of an interactive database, cause independent queries and updates to objects in the store that only interact by chance. The optimistic transaction is well suited to this case of loosely coupled processes. A stronger coupling of processes is exempli ed by visualisation, rendering and data intensive search problems which cooperatively use parts or copies of the bulk data contained in objects or groups of objects. Our group operations meet some of the needs of these applications, where the single-process transaction model is inadequate. Case studies are needed to determine how e ective this facility will be, and where its restrictions lie. Further opportunities for parallelism come from applying the data parallel group operations in programs. Several parallel programming systems provide ways of expressing overlapping and selectively synchronised data partitions, e.g. DINO[12] and GA[10], but we have not explored these issues for the object structured store. The potential amount of parallelism in any real system is subject to degradation by bottlenecks, and the way these bottlenecks come to dominate with increasing system size demonstrates scalability. In this case we note that there are potential bottlenecks in validating, committing and stabilising transactions. The only central bottleneck is in the provision of a common timestamp service to start every transaction. This is a very cheap service and we estimate that large number of requests should be able to be served by a single processor. As noted by Bassiouni this service may be provided by more than one processor if the load is heavy, with periodic synchronisation between them. Other potential bottlenecks in the validation and stabilisation of transactions are minimised by performing their operations within the minimal group of managers covered by the transaction and by keeping the ordering of client to manager updates minimally synchronised. The only serial dependency should be between multiple clients validly updating the same objects. 13{8

Experimental evaluation strategy Our further work involves tuning the design and system implementation and comparing the performance of the HeROD store with other multicomputer and distributed systems. To do so we expect to draw benchmark programs from the Object Oriented Database domain, such as the Wisconsin OO7 benchmark[3] that represents characteristics of CAD/CAM/CASE data structures. Its shortcoming for our purposes is its assumption of loosely coupled operations (there is no visualisation/rendering stage of processing, for example). The computational benchmarks of high performance computing also include some that are apparently suited to an object store, such as the Splash suite's Water N-body simulation benchmark[14], recently used as a demonstrator of the distributed Amoeba system[11]. Signi cant unknowns to be explored by benchmarking include the optimum page size, which we conjecture should be larger than in conventional systems to make use of larger cache memory and a relatively higher ratio of total processor power to disks; the clientserver ratio; and the logical and physical (de)clustering of related pages in the parallel lesystem.

Acknowledgments

The authors wish to acknowledge the support provided by the Cooperative Research Centre for Advanced Computational Systems established under the Australian Government's Cooperative Research Centres Program and the ANU-Fujitsu CAP Research Program.

References [1] M. A. Bassiouni. Single-site and distributed optimistic protocols for concurrency control. IEEE Transactions on Software Engineering, 14(8):1071{1080, August 1988. [2] M. Carey, David J DeWitt, J. Richardson, and E. Shekita. Storage management for objects in EXODUS. In W. Kim and F. Lochovsky, editors, Object-oriented concepts, databases and applications. Addison-Wesley, 1989. [3] M. J. Carey, D. J. DeWitt, C. Kant, and J. F. Naughton. A status report on the OO7 OODBMS benchmarking e ort. In Proceedings of OOPSLA'94, 1994. [4] R. C. H. Connor, Q. I. Cutts, G. N. C. Kirby, and R. Morrison. Exploring the boundaries of static safety in persistent application systems. In R. Kotagiri, editor, Proceedings of the Eighteenth AUstralasian Computer Science Conference, pages 99{ 107, 1995. [5] Peter F. Corbett and Dror G. Feitelson. Design and implementation of the Vesta parallel le system. In Proceedings of the Scalable High Performance Computing Conference, pages 63{70. IEEE Computer Society, May 1994. 13{9

[6] David J DeWitt, Phillipe Futtersack, David Maier, and Fernando Velez. A study of three alternative workstation-server architectures for object-oriented database systems. In Dennis McLeod, Ron Sacks-Davis, and Hans Schek, editors, Proceedings of the 16th VLDB Conference, pages 107{121. Morgan Kaufman, Inc, 1990. [7] D. Hawking and P. Thistlewaite. Searching for meaning with the help of a PADRE. In D. K. Harman, editor, Proceedings of the Third Text Retrieval Conference (TREC-3), pages 257{268. US Department of Commerce, National Institute of Standards and Technology, 1995. [8] Message Passing Interface Forum. MPI: a Message-Passing Interface Standard, version 1.1, June 1995. [9] J. Eliot B. Moss. Design of the Mneme persistent object system. Transactions on Information Systems, 8(2):103{139, 1990. [10] J. Nieplocha, R. J. Harrison, and R. J. Little eld. Global arrays: a portable \sharedmemory" programming model for distributed memory computers. In Supercomputing'94, 1994. [11] John W. Romein and Henri E. Bal. Parallel N-body simulation on a large-scale homogeneous distributed system. In EuroPar'95, August 1995. [12] M. Rosing, R. B. Schnabel, and R. P. Weaver. The DINO parallel programming language. CU-CS-457-90, Department of Computer Science, University of Colorado at Boulder, Boulder, Colorado, April 1990. [13] O. Shiraki, Y. Koyanagi, N. Imamura, K. Hayashi, T. Shimizu, T. Horie, and H. Ishihata. Architecture of Highly Parallel Computer AP1000+. In PCW'94 Proceedings of the Third Parallel Computing Conference, pages P1{F{1{9, Kawasaki, Japan, November 1994. Fujitsu Parallel Computing Research Facilities, Fujitsu Laboratories Ltd. [14] J. P. Singh, W.-D. Weber, and A. Gupta. SPLASH: Stanford parallel applications for shared-memory. ACM Computer Architecture News, 20(1), March 1992. [15] A. Tridgell and D. Walsh. The HiDIOS le system. In to appear Fujitsu Parallel Computing Workshop, 1995. [16] Francis Vaughan and Alan Dearle. Supporting large persistent stores using conventional hardware. In Proc. Fifth International Workshop on Persistent Object Systems, San Miniato, Italy, September 1992. [17] The Visible Human Project, November 1994. http://www.nlm.nih.gov/extramural research.dir/visible human.html. 13{10

Suggest Documents