A Distributed JavaSpace Implementation for Harness Mauro Migliardi y, Simon Schubiger z, Vaidy Sunderamx
Abstract
Harness is an experimental metacomputing system based upon the principle of dynamically recon gurable, object-oriented, networked computing frameworks. Harness supports recon g-
uration not only in terms of the computers and networks that comprise the virtual machine, but also in the capabilities of the VM itself. These characteristics allow the construction of modular programming environments that can be plugged into the Harness system on demand. As a proof of concept exercise, and also with an intent to provide a exible coordination facility for the Harness system, Sun's JavaSpace interface was implemented as a Harness plug-in. The JavaSpaces model is part of the Java distributed computing eort and is essentially an object-oriented tuple-space. JavaSpaces technology is a simple uni ed mechanism for dynamic communication, coordination, and sharing of objects between Java technology-based network resources. This paper introduces the Harness system and discusses design, implementation, and preliminary experiences with the JavaSpace for Harness. Keywords: Recon gurable metacomputing Systems, Flexible Programming Environment, Pluggable Frameworks, Distributed Components, PVM, JavaSpaces
1 Introduction Harness [16, 15] is an experimental metacomputing system based upon the principle of dynamically recon gurable, object-oriented, networked computing frameworks. Harness supports
recon guration not only in terms of the computers and networks that comprise the virtual machine, but also in the capabilities of the VM itself. These characteristics may be modi ed under user control via an object-oriented \plug-in" mechanism that is the central feature of the system. The motivation for a plug-in based approach to recon gurable virtual machines is derived from two observations. First, distributed and cluster computing technologies change often in response to new machine capabilities, interconnection network types, protocols, and application requirements. At the system level, the capability to recon gure the set of services delivered by the virtual machine assists in overcoming obsolescence related problems and the incorporation of new technologies. The second reason for investigating the plug-in model is to attempt to provide a virtual machine environment that can dynamically adapt to meet an application's needs, rather than forcing the application to t into a xed environment. In fact, the Harness recon guration capability and its O-O design allow building modular, plug-in based programming environments that can be plugged into the system on demand in order to tailor the system to the needs of the application. ; Department of Math & Computer Science, Emory University, Atlanta, GA 30322 ; IIUF; Universite de Fribourg; Chemin du Musee 3, CH-1700 Friburg ; Department of Math & Computer Science, Emory University, Atlanta, GA 30322 Research supported in part by NSF grant ACI-9872167 and DoE grant DE-FG02-99ER25379
y
[email protected] z
[email protected] x
[email protected]
1
As a proof of concept exercise, we selected the JavaSpaces technology and developed a modular, distributed plugin based implementation of JavaSpaces as an example of a programming environment that can be loaded into Harness on demand. As the JavaSpaces technology aims to provide a simple uni ed mechanism for dynamic communication, coordination, and sharing of objects between Java technology-based network resources, it is a good candidate for a Harness plug-in, and likely to be useful in a wide range of applications. JavaSpace for Harness extends the set of Harness services by a Linda-like coordination space. Instead of providing just an adapter for the existing JavaSpace implementation, the goals were to implement a distributed space by reusing existing plug-ins as well as by writing new ones. Distributing the space should add some degree of fault-tolerance and eventually improve performance. Having the system decomposed into plug-ins allows us to tailor the JavaSpace to the application needs and avialble hardware. This paper rst gives an introduction to the Harness system. Section 3 then explains the JavaSpaces model and describes the implementation of the JavaSpace for Harness. Preliminary performance data are provided in section 4, and comparisons to related work are presented in section 5. Finally, some concluding remarks are given in section 6.
2 Harness The main achievement of the Harness system is its higher abstraction level compared to other distributed computing environments such as MPI [12] or PVM [11]. Instead of providing a virtual machine with a xed set of services, Harness provides mechanisms to tailor the virtual machine to the needs of the application. This recon guration of the virtual machine can even happen at runtime, further increasing its exibility. This means that the Harness system can easily incorporate new technologies, and is independent of the parallel programming model used by the application and can easily adapt to new models. For example, the availability of Myrinet [5] interfaces and Illinois Fast Messages [19] has recently led to new models for closely-coupled Network Of Workstations (NOW) computing systems. Similarly, multicast protocols and better algorithms for video and audio codecs have led to a number of projects that focus on telepresence over distributed systems. In traditional metacomputing frameworks the underlying middle-ware either needs to be changed or re-constructed, thereby increasing the eort level involved and hampering interoperability. A virtual machine model intrinsically incorporating recon guration capabilities has the potential to address these issues in an eective manner. At the programming level the exibility of the system translates into the capability to support dierent programming models and environments. The ability to recon gure services provided by the system by plugging in or replacing computational components is natural and consistent with an object oriented programming environment based on distributed components. Such a programming environment supports the development of reusable applications capable of adapting themselves at run-time by means of behavioral objects [18]. However, the Harness framework is also able to support Harness-unaware legacy applications { in fact its recon gurability can be exploited to de ne and build dierent programming environments layered on top of its native model [17]. Besides, the distributed component-based nature of Harness accords, to these programming environments, the capability to extend and adapt themselves at run-time to optimally suit the needs of applications.
2
Reconfiguration
Users Reconfiguration
Applications
Change the set of computational resources enrolled in the DVM
Change DVM Capabilities (add remove services)
Level 4 Baseline
Level 3
Services (plug-ins)
Level 2
Heterogeneous computational resources
Level 1
Abstract Distributed Virtual Machine (DVM)
Figure 1: Overview of the Harness system.
2.1 Fundamental Abstractions and System Architecture
The fundamental abstraction in the Harness metacomputing framework is the Distributed Virtual Machine (DVM) (see gure 1, level 1). A DVM is associated with a symbolic name that is unique within a Harness name space, but has no physical entities connected to it. Heterogeneous Computational Resources may enroll into a DVM (see gure 1, level 2) at any time; however at this level the DVM is not ready yet for use by applications. To adapt to applications' needs, the heterogeneous computational resources enrolled in a DVM need to load plug-ins (see gure 1, level 3). A plug-in is a software component implementing a speci c service. By loading plug-ins that implement services, it is possible to complement the set of native services of a computational resource in such a way that all the computational resources enrolled in a DVM present a consistently homogeneous service baseline to applications (see gure 1, level 4, Baseline). Users may recon gure the DVM at any time (see gure 1, level 4) both in terms of computational resources enrolled by having them join or leave the DVM, and in terms of services available by loading and unloading plug-ins. The main goal of the Harness metacomputing framework is to support the ability to enroll heterogeneous computational resources into a DVM and make them capable of delivering a consistent service baseline to users. This goal requires the programs comprising the framework to be as portable as possible over as large as possible a selection of systems. The availability of services to heterogeneous computational resources derives from two dierent properties of the framework: the portability of plug-ins and the presence of multiple searchable plug-in repositories. Harness implements these properties primarily by leveraging two dierent features of Java technology. These features are the capability to layer a homogeneous architecture such as the Java Virtual Machine (JVM) [14] over a large set of heterogeneous computational resources, and the capability to customize the mechanism provided to load and link new objects and libraries. However, the adoption of the Java language as the development platform for the Harness metacomputing framework has given us several other advantages:
The Harness framework is realized as a collection of cooperating objects with consistent 3
boundaries (Java classes) that present a coherent and stable object-oriented development environment to users. A clear and consistent boundary is de ned for plug-ins, in fact each plug-in is required to appear to the system as a Java class. All the entities in the framework could be implemented using a robust multi-threaded architecture. Users can develop additional services both in a passive, library-like avor and in an active thread-enabled avor. A simple Object Oriented mechanism is available to request services from remote computational resources (Java Remote Method Invocation [21]). A generic methodology is available to transfer data over the network in a consistent format (Java Object Serialization [20]). Users are provided the de nition of interfaces to be implemented by plug-ins implementing the basic services. The trade-o between portability and eciency for the dierent components of the framework could be tuned as appropriate. However, the adoption of Java does not impose any major constraints on HARNESS applications. In fact the only strict requirement is to encapsulate services into a Java class; the adoption of advanced Java technologies such as object serialization and RMI in user developed plug-ins is completely optional. The capability to trade portability for eciency by using native code inside plug-ins is extremely important { in fact, although portability in general is needed in all components of the framework, it is possible to distinguish three dierent categories among the components that require dierent levels of portability. The rst category is represented by the components implementing DVM status management and load/unload services. We call these components kernel level services. These services require the highest achievable degree of portability, as they are needed to enroll a computational resource into a DVM. The second category is represented by very commonly used services (e.g. a general, network independent, message passing service or a generic event noti cation mechanism). We call these services basic services. Basic services should be generally available, but it is conceivable for some computational resources based on specialized architectures to lack them. The last category is represented by highly architecture speci c services. These services include all those services that are inherently dependent on the speci c characteristics of a computational resource (e.g. a low-level image processing service exploiting a SIMD co-processor, a message passing service exploiting a speci c network interface or any service that need architecture dependent optimization). We call these services specialized services. For this last category portability is a goal to strive for, but it is acceptable if they are available only on small subsets of the available computational resources. These dierent degrees of required portability and eciency over heterogeneous computational resources can optimally leverage the capability to link together Java byte code and system dependent native code enabled by the Java Native Interface (JNI) [13]. The JNI allows development of those parts of the framework that are most critical to ecient application execution in a low-level language, and also permits the desired level of architecture dependent optimization to be introduced into them at the cost of increased development eort. 4
The use of native code requires a dierent implementation of a service for each type of heterogeneous computational resource that needs to deliver that service. This fact implies a replicated development eort for each plug-in incorporating native code. However, if a version of the plug-in for a speci c architecture is available, the Harness metacomputing framework is able to fetch and load it in a user transparent fashion; thus users are screened from the necessity to control the set of architectures their application is currently running on. To achieve this transparency, Harness leverages the capability of the JVM to let users rede ne the mechanism used to retrieve and load both Java bytecode and native shared libraries. In fact, each DVM in the framework is able to search a set of plug-in repositories for the desired library. This set of repositories is dynamically recon gurable at run-time and users can add new repositories at any time. For further details about the features and system architecture of the HARNESS framework we direct the interested reader to [16, 15].
3 The JavaSpace for Harness Implementation The availability of plug-ins is critical for the attractiveness of the Harness system. The more plug-ins are available, the easier the task for the application programmer will be. Since the Linda [4] model is a well known coordination model for distributed applications, its Java incarnation called JavaSpaces was implemented as a Harness plug-in.
3.1 The JavaSpaces Model and the JavaSpace Interface
Sun Microsystems, Inc. de ned the JavaSpaces model and its interface in [22]. The JavaSpaces model is part of the Java distributed computing eort and plays a central role in the Jini connection technology. JavaSpaces technology is a simple uni ed mechanism for dynamic communication, coordination, and sharing of objects between Java technology-based network resources like clients and servers. JavaSpaces technology is very similar to the Linda [4] model. Instead of holding tuples, a JavaSpace holds entries, which are instances of classes implementing the Entry interface with some additional restrictions1. Templates are used to match entries in the space. The main dierences between Linda and a JavaSpace are (see [22] for further details):
JavaSpaces entries are typed and the type is used together with the eld values for
JavaSpaces entries may have methods associated with them.
The typing of entries allows matching of subtypes, equality is not required if the entry is a subtype of the template. The JavaSpaces technology support transactions which can span multiple spaces. Entries in a space are leased, helping the garbage collector. The JavaSpaces model has no equivalent to Linda's \eval". The JavaSpaces model allows clients to be noti ed when a matching entry is written to the space through remote events.
matching.
1 Each Entry class must provide a public no-arg constructor. Only public elds of entries are considered. Entries may not have elds of primitive type (int, boolean, etc.), although the objects they refer to may have primitive and non-public elds.
5
The JavaSpaces model uses a dierent nomenclature than Linda: Linda tuple template actual formal out in rd JavaSpaces entry template value wild card write take read
The operations allowed on a JavaSpace are de ned by the JavaSpace interface, shown for brevity without the exceptions thrown by the dierent methods: public interface JavaSpace { Lease write(Entry entry, Transaction txn, long lease); Entry read(Entry tmpl, Transaction txn, long timeout); Entry readIfExists(Entry tmpl, Transaction txn, long timeout); Entry take(Entry tmpl, Transaction txn, long timeout); Entry takeIfExists(Entry tmpl, Transaction txn, long timeout); EventRegistration notify(Entry tmpl, Transaction txn, RemoteEventListener listener, long lease, MarshalledObject handback) Entry snapshot(Entry e); }
The write() method writes the given entry into the space. The write operation will only succeed if all other operations in the same transaction txn succeed. The lease parameter speci es the initial lifetime of the entry in the space. The lifetime of the entry can be extended through the lease returned by write. When the lease expires, the entry is automatically removed from the space and its memory will eventually be reclaimed by the garbage collector. The read(), readIfExists(), take() and takeIfExists() methods are very similar. They all take a template tmpl that will be used for matching entries in the space. Templates have their elds either lled with values or null. A null eld means \match anything", a eld with a value has to match exactly. Matching exactly means that equals() of the elds marshaled representations has to hold between the template eld and the corresponding entry eld. The txn parameter speci es the transaction under which the method will be executed. And the timeout parameter tells how long the client is willing to wait for a transactionally proper matching entry. The ...ifExists() versions return immediately with a matching entry or null if no such entry exists. The other versions block until such an entry is written to the space. The read...() methods simply return a matching entry from the space, the take...() methods also remove it from the space. The notify() method is used by a client to announce interest in some kind of entry. When entries are written that match the given template tmpl, the space noti es the given listener with a RemoteEvent that includes the given handback object. Matching is done as for read and take. Finally the snapshot() method returns a special instance of an entry that can be used everywhere the original entry could. Using this instance is more ecient than repeatedly using the original entry. A JavaSpace implementation usually stores entries in dierent format internally. snapshot() helps to avoid unnecessary conversion between the external entry form and the internal format.
3.2 Implementation Overview
The Implementation is split into four dierent layers, some of them implemented as a Harness plug-in. The Harness plug-ins allow the user to substitute dierent implementation for the 6
layers, depending on the application, available hardware and software. Figure 2 gives an overview of the JavaSpace for Harness implementation. The layers will be discussed in a top down fashion in more detail in the following sections. JavaSpace
H_plugin
IStoreInterceptor
RIGroup
JSpace
Store
RingInterceptor
Ring
write(…) read(…) readIfExists(…) take(…) takeIfExists(…) notify(…) snapshot(...)
put(…) process(…)
put() lookup() remove() removeAsync() handle()
sendToAll() sendTo() send() sendReceiveSync() sendReceive() receive()
Request Queue Memory
Figure 2: An overview of the JavaSpace for Harness implementation. The leftmost layer (JSpace) implements the JavaSpace interface. The store layer acts as entry cache. The interceptor layer is responsible for a consistent view of the space among dierent nodes and the group layer implements the ring-based group communication. Group communication is based on other plug-ins for low-level communication.
3.3 Implementation of the JavaSpace Interface and the Store
The JavaSpace interface is implemented by the JSpace class. This class mainly passes writes to an instance of the Store class and converts all the read and take calls into requests which are executed by the store's process() method. The main components of the store are the request queue and the memory. The request queue holds outstanding requests from previous process() calls. The memory holds all entries written to the space which were not previously removed by take() or takeIfExists() calls. Therefore the memory represents the state of the space at a given instant. The store's process() method rst tries to satisfy a read or a take request from the memory. If a matching entry is found, it is returned. If no matching entry in the memory is found, null is returned in the ...IfExists cases. If no matching entry is found in the other cases, the request is added to the request queue and the calling thread is blocked. The put() method rst scans the request queue if there is an outstanding request which may be satis ed by the new entry. In that case, the request is satis ed, removed from the request queue and the thread waiting on the request is unblocked. If there is an outstanding take request that matches the new entry, it can be satis ed directly and the entry never has to be stored in the memory. In all other cases, the entry is stored in the memory for further use. The main task of the memory is holding the JavaSpace content and providing template matching. In order to provide near-constant matching time for entries, hashing is used for dierent features of entries. Figure 3 shows the data structures involved for matching entries in the memory. The \same or subclass bitmap" provides fast testing if two classes are the same class or if one class is a subclass of another class. This bitmap is updated every time a new (entry-) type is added to the space. If no values are given in a template, matching has to occur based on the type of the entry and the type of the template. For every (entry-) type in the space there exists a list in the \Type lists" data structure. Together with the same or subclass bitmap, the type lists are searched 7
Memory Same or subclass bitmap
Type lists
ID hashed
hashCode() hashed
Figure 3: The data structures used by the memory to provide near-constant matching time for entries in the memory. until a type list is found which is the same class or a subclass of the template and contains an entry. Every entry has an ID that unambiguously identi es it. This identi er is used as a hash value in the \ID hashed" table and provides fast lookup of entries based on their identi ers. In the case where values are given for some entry elds, the \hashCode() hashed" table is used. When an entry is written to the space, the hash code of each of its elds is calculated. The entry is then added for each of these hash codes in the corresponding list in the \hashCode() hashed" table. The \hashCode() hashed" list holds therefore entries with elds having the same hash code, independent of the eld position and the type of the entry. Figure 4 illustrates the process of adding an entry to the space. To match a template containing one or more non-null elds, the hash codes of these elds are calculated in the same way as it is done for the entries. These hash codes are then used to nd the shortest list in the \hashCode() hashed" structure. Finally the shortest list is searched for an entry matching the template. anEntry field[0].hashCode() = 34 field[1].hashCode() = 12 field[2].hashCode() = 1 field[3].hashCode() = 4321 ID = 1234 type = EntryType
insert at (34 % size = 2) insert at (12 % size = 4) insert at (1 % size = 1) insert at (4321 % size = 1) insert at (1234 % size = 2)
add type
Same or subclass bitmap EntryType EntryTypeC EntryTypeB EntryTypeA EntryTypeA EntryTypeB EntryTypeC EntryType
Type lists
ID hashed
hashCode() hashed
size = 4
size = 8
size = 8
EntryTypeA EntryTypeB EntryTypeC EntryType
0 1 2 3 4 5 6 7
0 1 2 3 4 5 6 7
Figure 4: Writing an entry to the space. First, the type is added to the \same or subclass bitmap". Second, the entry is added to its type list. Third, the entry is added to the \ID hashed" table. Fourth, the entry is added to every list holding entries with elds having the same hash code.
8
3.4 The Store Interceptor
In order to allow dierent back ends for this JavaSpace implementation, the store supports propagating speci c requests to a store interceptor. The store without a store interceptor is fully functional implementation of a JavaSpace. In such a setting, the memory provides transient storage for the JavaSpace. Using the JavaSpace in that way limits its access to the threads running in the same address space as the JavaSpace. Although it may be made remote accessible through RMI, it does not oer more services than Sun's transient \outrigger" implementation. In order to oer more services like fault tolerance, persistence and so on, the JavaSpace for Harness implementation supports dierent store interceptors. The store interceptor is responsible for a consistent view of the space among the dierent client applications, maybe running in dierent address spaces on dierent hosts. A
A
A
A
JS
JS
JS
JS
GCSI
GCSI
DBSI
DBSI DB
Abstract Distributed Virtual Machine (DVM)
Figure 5: Two possible implementation of a store interceptor. (A) Application, (JS) JavaSpace for Harness, (GCSI) Group communication based store interceptor, (DBSI) Database based store interceptor, (DB) Database. It is up to the store interceptor how to enforce consistency. The example on the left hand side of in gure 5 uses group communication to synchronize access to the replicated store. The interceptor on the right hand side of gure 5 is a classical client / server implementation with a database as the persistent store. The Harness implementation replicates the space over multiple nodes. Every node runs the same implementation, there is no special master or server node. Every node uses IP-multicast to distribute newly added entries to the other nodes. As IP (UDP/IP) multicast is not reliable, each node might not have a complete copy of the space. However, if a query cannot be completed locally, reliable group communication is used to perform a global query thus consistency is guaranteed. Reliable group communication is also used to atomically remove entries from the space.
3.5 The Group Communication
Reliable group communication is used to satisfy to following consistency requirements among the nodes sharing a JavaSpace: Removal of entries from the space have to be atomic. That means that an entry may not be removed more than once from the space. 9
Requests must be seen by all nodes. That means that requests have to be distributed to all nodes upon arrival. The current implementation of JavaSpaces in Harness uses a plug- in providing group semantics by means of a ring abstraction. Atomicity of the action performed is guaranteed by the presence of a token that any node needs to hold to initiate an action. The reliability of the ring is guaranteed by the reliability of the links composing the ring together with the underlying Harness system. Each node in the ring is reliably noti ed by the Harness system of any other node failure. When a failure occurs, the node upstream from the failed node queries the Harness system to get the handle of the node downstream from the crashed node and re-establishes the ring. Any node with a pending message (i.e. a message that has not circulated back to it yet) inserts a dummy message in the ring; ordered forwarding performed by the nodes guarantees that any message that is still pending at the time the dummy message is received was lost on the failed node and has to be re-circulated. The number of requests circulating in the ring is reduced by grouping requests together. The links of the ring itself are provided by a Harness plug-in delivering reliable point-to-point communication. Currently the plug-in providing point-to-point communication for our implementation of the JavaSpaces adopts generic TCP connections; however, the modularity provided by the opaque interfaces of Harness plug-ins allows replacement if advanced, high performance communication fabrics (e.g. myrinet, SCI or VIA) are available. Similarly, the plug-in providing reliable and atomic transport by means of the ring abstraction can be replaced by a dierent scheme. In fact, any plug-in providing the Harness atomic-reliable service interface can be adopted. At the moment, only the ring scheme is available; however, in order to overcome the limitations inherent in the ring scheme, e.g. the large number of messages necessary to perform an atomic operation, a dierent plug-in based on reliable multicast [23] is currently being evaluated. It should be noted that due to the Harness opaque interface based plug-in concept, any change introduced by a new plug-in will not necessitate any changes in the upper layers of the JavaSpace implementation.
4 Performance results We focus here on the data transfer behavior of the implementation only, because we are evaluating other group communication protocols at the moment. The goal was to verify the initial assumption we made upon the dynamic behavior 2 of the ring based store interceptor. Obviously, running the implementation without a store interceptor provides the best performance (roughly ten times faster than Sun's \transient-outrigger" implementation but limits its usage to a single address space. Accessing such a space through RMI is still twice as fast as the Sun implementation but has no other advantages beside that. As stated earlier, we were interested in the data transfer behavior of the space. Our test program consisted therefore in one node that wrote data to the space and nodes reading / taking the data from the space. Data structures were not optimized and the Java just in time compiler (JIT) was disabled. The space was running on a cluster of Sun Ultra 10 workstations interconnected by a 100 MBit Ethernet. Writing to the space is about twice as fast as with the Sun implementation (10521 ms vs. 27568 ms for a 20x20 matrix3 ), reading is between four and seven times faster than the Sun n
2 Constant time read() and write() due to multicasting and a linear increase of removal time (take()) with the number of nodes due to ring-based group communication. 3 Each matrix element consisted of an ident er (String), its row (Integer), its column (Integer) and its value (Double).
10
Read 20x20 matrix 90000 SUN Implementation Harness Implementation 80000
Mean run time for read (ms)
70000 60000 50000 40000 30000 20000 10000 0 1
2
3
4 5 Number of Nodes
6
7
8
Figure 6: Reading a 20x20 matrix from the space.
Take 10x10 matrix 350000 SUN Implementation Harness Implementation
Mean run time for read (ms)
300000
250000
200000
150000
100000
50000
0 1
2
3
4 5 Number of Nodes
6
Figure 7: Taking a 10x10 matrix from the space. 11
7
8
implementation and this ratio is likely to increase with the number of nodes due to multicasting (see gure 6). In fact, it was observed that relatively few packets are missed ( 0 10 00) during multicast reception and therefore had to be transmitted through reliable group communication. The picture considerably changes in the case of taking entries from the space (see gure 7). Although we expected the time to increase with the number of nodes, the overhead introduced is not acceptable. We are currently trying to reduce on the one hand side the absolute time required for message transmission (avoiding unnecessary de serialization and serialization at every ring node) as well as examining other group communication protocols based on multicasting and tree structures.