A High Performance Cluster JVM Presenting a ... - Semantic Scholar

5 downloads 183 Views 895KB Size Report
The benefit we want to gain by virtualizing a cluster is high performance for Java Server Applications which are concur- rent daemons implemented in Java.
A High Performance Cluster JVM Presenting a Pure Single System Image Y. Aridor, M. Factor, A. Teperman IBM Haifa Research Lab Matam, Advanced Technology Center Haifa, 31905, Israel yarivlfactorlteperman @ il.ibm.com

T. Eilam, A. Schuster Computer Science Dept. Technion, Israel Institute of technology Haifa, 32000, Israel eilamlschuster @ cs.technion.ac.il

ABSTRACT cJVM is a Java Virtual Machine (JVM) which provides a single system image of a traditional JVM while executing in a distributed fashion on the nodes of a cluster, cJVM virtualizes the cluster, transparently distributing the objects and threads of any pure Java application. The aim of cJVM is to obtain improved scalability for Java Server Applications by distributing the application's work among the cluster's computing resources. cJVM's architecture, its unique object model, thread and memory models were described in [6]. In this article we focus on the optimization techniques employed in cJVM to achieve high scalability. In particular, we focus on the techniques used to enhance locality thereby reducing the amount of communication generated by cJVM. In addition, we describe how communication overhead can be reduced by taking advantage of Java semantics. Our optimization techniques are based on three principles. First, we employ a large number of mostly simple optimizations which address caching, locality of execution and object migration. Second, we take advantage of the Java semantics and of common usage patterns in implementing the optimizations. Third, we use speculative optimizations, taking advantage of the fact that the cJVM run-time environment can correct false speculations. We have demonstrated the usefulness of these techniques on a large (10Kloc) Java application, achieving 80% efficiency on a four-node cluster. This paper discusses the various techniques used and reports our results.

1.

presenting an image of a single system to any pure Java application and thus extending the "Write Once, Run Anywhere" promise of Java beyond the confines of single systems and 2) utilize the cluster to obtain significant performance advantages for an interesting class of applications. While the existence of the cluster is not visible to a Java application running on top of cJVM, the implementation of cJVM is cluster-aware. The implementation distributes the objects and threads created by the application among the nodes of the cluster. In addition, when a thread wishes to use a remote object, it is the cJVM implementation that supports this remote access in a manner that is 100% transparent to the application. In [6], we presented the architecture of cJVM which allows us to provide a pure single system image. Section 2 recaps the relevant aspects of this architecture. The benefit we want to gain by virtualizing a cluster is high performance for Java Server Applications which are concurrent daemons implemented in Java. When we started this effort, we hypothesized that there were several conditions for achieving scalability in an object oriented system that transparently uses the cluster. These conditions include: use a large combination of (mostly) simple optimizations where the optimizations address caching, locality of execution and object migration. take into account the semantics of the language and common usage patterns in defining and implementing the optimizations.

INTRODUCTION

The Cluster JVM (cJVM) is an implementation of a Java Virtual Machine (JVM) that executes in a distributed manner on the nodes of a cluster. Our goals in building cJVM were: 1) demonstrate that a JVM can virtualize a cluster, Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Java 2000 San Francisco CA USA Copyright ACM 2000 1-58113-288-3/00/6...$5.00

168

build upon speculative analyses, taking advantage of the fact that we implement the run-time system (i.e., cJVM) to correct any errors.

The emphasis in this paper is on how cJVM demonstrates these hypotheses, focusing on reducing the amount of communication. We also briefly address how to reduce the cost of an individual remote interaction. As stated above, the main mechanism we use to reduce the communication volume is a collection of (mostly) speculative optimizations, which leverage the semantics of Java, namely 1) caching at the level of the class, object and field

JavaSI Applic

jects are placed on the nodes according to a pluggable load balancing function. Other objects are co-located with the threads creating them (see Section 6.1). To support distributed access to objects, cJVM uses the master-proxy paradigm. The master node for an object is the node where the thread creating it resides. A proxy is a surrogate for a remote master object through which that object can be accessed. In cJVM,all objects can have masters and proxies including regular instances, arrays, threads and classes. While a proxy is a fundamental concept used in systems supporting location-transparent access to remote objects [1, 2], cJVM pushes the idea one step further. Smart proxies is a novel mechanism allowing multiple proxy implementations for a given class,while the most efficient implementation is determined on a per object instance basis. F i g u r e 1: C l u s t e r J V M

2) locality of method execution and 3) object placement (including migration). We attempt to match each type of data element with its own optimization: one that will answer its expected access pattern in Java Server Applications. There are multiple advantages to this approach. First, by tailoring optimizations to the expected behavior of each data type, we can design a policy which will achieve maximal locality of reference to instances of this type. Second, for several instances of the same type (e.g., the same field in two different objects which implement the same class), learning the memory access pattern which is exhibited by one of them gives strong indication to the future use of the others. Third, by using speculative analyses, we cazl take into account the actual execution sequence of a program and not just the static call graph. Fourth, by taking into account the semantics and usage patterns of Java, we can apply our speculative optimizations in those areas most likely to provide a payoff. And finally, by using a large combination of optimizations, we become less sensitive to the effectiveness of a particular optimization in a particular context. The next section presents the base architecture of cJVM. The performance optimizations are introduced in the following sections (3 - 6). In Section 7, we present performance results for a large benchmark on a four-node cluster including a breakdown of the impact of the optimizations. In Section 8, we relate cJVM to other works and we present our conclusions and future work in the final section.

2.

CJVM ARCHITECTURE

This section describes the base architecture of cJVM which enables cJVM to presents a Single System Image to any Java application. In the following sections we extend this base architecture with optimizations which enable us to achieve high performance. Figure 1 shows our basic approach. The upper half shows the threads and objects of a Java application as seen by the programmer. This is a traditional JVM. The lower half shows the objects and threads of the Java application distributed transparently (to the Java application) across the nodes of the cluster by the implementation of cJVM. To take advantage of the cluster for scalability, thread ob-

169

Two key challenges of cJVM towards a single system image (SSI) are 1) giving the application the illusion that it is using a single monolithic heap and 2) hiding the distinction between master and proxy from the application. The first challenge is met by a new memory model based on a distributed heap. Objects are allocated in the local Java heaps of the cluster nodes, and are referenced locally by regular Java references. Upon passing an object as an argument to a remote operation, it is assigned a unique global identifier used to locate the corresponding proxy at the target node. cJVM's object model addresses the second challenge by allowing multiple implementations (e.g., master and proxy) of a single method to coexist in a single class, selecting for each instance object the precise code to execute for a given method. When a thread accesses a proxy, our basic approach is method shipping. The proxy, in a maner transparent to the application, redirects the flow of execution to the node where the object's master copy is located. This basic approach is enhanced with class and object specific caching and replication policies, as described later in this paper. Since cJVM's memory model is based on a distributed heap, each of the bytecodes that access the heap (e.g., p u t f i e l d , g e t f i e l d etc.) is modified to determine if the data it is accessing is located at the node where the bytecode is executed or if it is located at another node. In the latter case, a remote bytecode access is applied. One can treat this as a field level distributed memory. Therefore, it is not necessary to ship a method invoked on a proxy to the master node to obtain correct behavior. The code can be executed locally and each access to the fields of the object which is the target of the method invocation, will result in a remote bytecode. Thus, cJVM's remote method shipping can be viewed as an optimization, replacing (possibly) many remote bytecode accesses with one remote invocation. However, there are two kinds of methods that must always be executed at master nodes: synchronized methods and native methods. Locks are always obtained at the master, thus we always execute synchronized methods at the master. Native methods may use native state which is not visible to cJVM and which cannot be made available at a non-master node.

3.

COMMUNICATION

To achieve high performance in a distributed system like cJVM, two aspects must be addressed: reducing the cost of a remote interaction and reducing the amount of remote interactions. The bulk of this paper focuses on the second point. In this section, however, we briefly describe how we reduce communication and processing overhead by taking advantage of Java semantics. As mentioned above, the implementation of the heap accessing bytecodes was modified to be cluster-aware. When a bytecode's implementation needs to perform a remote access, we use the following basic mechanism. The thread executing the bytecodes sends a message to the master node. At the master node a high priority dispatcher thread receives the message, wakes up a service thread, hands it the message, and resumes waiting for the next message. The service thread performs the bytecode and returns the result to the waiting thread at the calling node. Since in the general case, the operation executed by the service thread may require class loading, monitor locking and/or stack traversal, the requesting node must include in the message extra information which increases the message size and latency. The context switch is mandated from the dispatcher to the service thread; otherwise the response to incoming messages might become unacceptably slow or even lead to deadlock. In particular, we observe that remote operations that only involve primitive types (other than the target master object) or that only read, cannot lead to the loading of a class, monitor locking, or stack traversal at the node where the target master object is located. More specifically, these operations a r e non-blocking with respect to the JVM implementation. For these remote bytecodes, we can 1) optimize the message passed to describe the remote operation and 2) optimize the server side processing of this message. In particular, we can avoid passing all information about the requesting thread, and we can handle these messages in the dispatcher thread. This reduces the amount of data transferred and the number of context switches leading to a significant reduction in the latency seen by a client of a remote bytecode. The numbers we measured for a g e t f ± e l d of a primitive type show that these optimizations lead to a > 30% improvement with latency for the implementation of the remote bytecode decreasing from 145#S to 90#S.

4.

CACHING

cJVM supports caching at multiple levels of granularity: classes, objects and individual fields. Since in an environment such as Java, new classes can be created and loaded on-the-fly, it is often impossible to prove that a datum is immutable. We, thus, often make speculative decisions that a particular datum will not be mutated. To handle the case that our speculation was incorrect, we augment our caching optimizations with invalidation schemes. We currently do not use the alternative of update-based protocols, because of 1) their relatively complicated protocols and 2) the repetitive nature of our target Java Server Applications, implying that an access which conflicts with a caching optimization is most likely to repeat itself.

170

Class level caching is applied by caching static variables and accessing them locally. While static variables can be modified, in common usage, even if they are not declared f i n a l , they axe set at most once (i.e., during class initialization) and read many times. In cJVM, a node caches a static field the first time it makes a remote access for that field. When the master for a class receives a remote request to retrieve the value of one of the class's fields, it records the fact that the field is being cached and the node doing the caching. Subsequent g e t s z a z i c operations on this field will be executed locally. If a cached static variable is updated, all its replicas are invalidated after which all accesses are directed to the node containing the master class object. At the level of objects, we cache ReadOnly objects which are objects whose fields are proved to be immutable after the object is instantiated. Thus, there is no need for either an update or invalidation protocol. When a read-only object is passed to a remote node, cJVM creates a special proxy for it, namely, a ReadOnly proxy, through which all field access operations are applied locally. A more interesting example of object level caching is arrays referenced from static final variables. This optimization is a clear example of leveraging the usage pattern of the language. The semantics of a final field referencing a Java array is that the reference to the array never changes, but the axray elements can be modified. However, in practice, it is uncommon for the elements of such arrays to be modified. As with static fields, we use an invalidation-based protocol for those cases where there is a write to one of the array elements. The rest of this section is dedicated to a novel caching technique: speculating that specific individual fields will not be mutated after initialization, and caching them in all instances of a given class.

4.1

Field Level Caching

Read-only in practice fields are fields which in a particular run of the program are not modified after the object contalning them has at least one proxy. This definition is much less restrictive than a code-based mutability analysis which reports a field as mutable if there is reachable code that mutates the field, even if this code is never executed. In addition, in a language such as Java which allows dynamic loading of code, a static analysis must report th~field as mutable, if code loaded in the future is able to mutate the field. Finally, by only considering mutations that occur after the object is shared, we support a much broader class of programs than a theoretical mutability analysis could ever support. As long as an object is used only by threads on a single node, the presence of mutations is irrelevant to caching. We define a field that is read-only in practice as read-locally. When a class is loaded, cJVM speculatively marks all nonstatic, private fields (except those belonging to read-only objects) as read-locally. A read-locally field which is modified after it is cached is invalidated, losing its read-locally status. cJVM employs a per field, per class approach to read only in practice field. In other words, caching and invalidation of a

specific field is done in all instances of the class, independent of all other fields in instances of the class. This approach is consistent with object-oriented programming methodology which encourages a programmer to write code that identically treats all instances of a class. Thus, if one of the fields in a certain instance is mutated, it is highly likely that it will also be mutated in other instances; this is especially true for repetitive applications, such as our target Java Server Applications. This approach also requires only minimal storage for the infrastructure to support this optimization; in each class we record which of its fields are currently read locally. A read-locally field in a class is invalidated when a p u t f i e l d for this field is executed on an instance of the class that has a proxy. Note, this holds regardless of whether the target of the p u t f i e l d is a master or a proxy. The invalidation protocol must guarantee that 1) the field is invalidated in all instances (masters and proxies) of the class and 2) the field is unmarked as read-locally at the master and proxy class objects. Thus, nodes which later utilize the class will be aware of this invalidation. It should be done in a way which guarantees that the Java memory model is respected. The invalidation process contains two phases. In the first phase a message is sent to the master class object. The node where the master class object is located then sends an invalidation request to all nodes which have used the class. In order to preserve Java's memory consistency model, the invalidation protocol is embedded in the code of p u t f i e l d . Namely, 1) a p u t f i e l d that causes an invalidation is not complete until the field is invalidated in all the nodes and 2) the new value of the field is written in the master instance object as part of the invalidation process. In other words, the invalidation process is applied atomically with respect to other accesses to the same field. Consequently, the invalidation process does not introduce any interleaving of the bytecode instruction streams which cannot occur in a traditional JVM. Note that fields in read-only objects may be viewed as a special case of read-only in practice fields. However, because it can be proved that they are read only, cJVM uses a different mechanism to handle them in order to reduce the overhead.

5.

METHOD SHIPPING

As a general rule, cJVM executes methods on the node holding the master copy of the object which is the target of the method invocation. As mentioned earlier, given that cJVM essentially implements field level DSM, this can be viewed as a zero'th level locality optimization. We are leveraging the fact that the encapsulation principle of object oriented programming implies we will increase locality (avoiding remote bytecodes) by executing a method targeted to a particular object at the node where the object's master (i.e., its state) is located. Sometimes, however, all of the data needed to execute the method is located at the node where the object's proxy is located. In this case, method shipping adds needless overhead. In cJVM we have three different optimizations to resolve method shipping: C l a s s m e t h o d s . Since static fields are cached with class

171

proxies, there is a high chance that a class method which accesses static fields can find all of them locally. Thus, by invoking class methods locally, we gain performance except in very rare cases. S t a t e l e s s m e t h o d s . Methods which work only on the local thread's stack without accessing the heap (e.g., java/lang/Math.min (a, b) which accepts two integer parameters and returns the smaller one) are safely executed on the proxy's node. cJVM uses a simple load time analysis to detect such methods. L o c a l l y E x e c u t a b l e m e t h o d s . Methods which have no heap accesses other than to read-locally fields (see Section 4.1), can be executed locally since all of their data is locally cached. However, what makes things interesting is that invalidation of read-locally fields forces the invalidation of all the locally executable methods which access the particular field being invalidated. Details are omitted here for lack of space, see [5].

None of these optimizations apply if the method is synchronized or native, since as described earlier, synchronized and native methods must always be executed at the master node.

6.

OBJECT PLACEMENT

Another way to improve locality is to place the master copy of an object where it will be used. Currently, cJVM uses two heuristic optimizations that focus on object placement. The first, factory methods, is aimed at creating objects where they will be used, focusing on objects created by factory instance methods. The second, single chance, looks at migrating objects whose usage pattern has one thread creating and initializing the object and exactly one other thread using the object with no overlap between the two threads. Both of these optimizations are speculative attempts to leverage Java semantics and usage patterns.

6.1

Factory Methods

A factory instance method is a method which creates an object which it returns; such methods axe associated with a common software design pattern of the same name [11]. With a factory method, we have very strong evidence that the invoker of the method will use the newly created object; after all it invokes a method to create the object. Thus it makes sense to co-locate an object created by a factory method with the invoker of the method. However, in cJVM, without special handling, invoking a factory method on a proxy factory object creates an object co-located with its factory master object and not where it will be used, as seen in the upper part of Figure 2. More generally, if the f a c t o r y and s i n g l e t o n design patterns are combined, all threads will share a single factory object, and on all but one of the nodes, they will be accessing newly created objects via proxies. Our basic approach to solve this problem is shown in the bottom part of Figure 2. We always execute factory methods where they are invoked, thus naturally co-locating the object created with its invoker.

1

Node proxy

use object _ . ~ L ~ " Factory

Node remote invocatJor

_~-....

2 I

aCCeSS accessaccess .~.'i~i ~.i ~ ~i 01 i~i ~.'~i~.'~'~i~~ii i

hread 2

return a prolq

creat~

factory M ~:,.o.

Factory master

,.=orym=.o.

1

master u s e obje¢ f r o m factor

Bicc;esses •Fac•o•

possible remote

from factory metho¢

object

Object

lhread

Node2

Node 1

obie~t object

time

create accessaccess

object object object

tl

t2

t3

,,..

t4

t5

t6

t7

t8

T9

ma~er

Figure 3: Usage Scenario for Single-Chance Optimization Figure 2: Factory Methods: u n o p t i m i z e d case (upper part) versus the optimized case (lower part).

One way to determine factory methods would be to extend escape analysis to determine the method calls which return objects (they create directly or indirectly) which never escape.

We define an instance method as a method which either 1) returns an object it creates or 2) returns an object it received as the result of a call to a factory method. The support for this optimization includes 1) identifying factory methods and 2) placing the master copy of an object allocated by a factory method on the node where that factory method was invoked.

Using escape analysis would allow a finer degree of distinction, allowing us to determine if a method as used at a particular call site is a factory method. On the other hand, escape analysis is a stronger analysis than we need. In our definition of factory methods we do not care if another thread may in the future use the object; we are trying to optimize the almost certain usage that the caller of the factory method will make.

cJVM uses a simple heuristic to identify factory methods. This heuristic, executed when the code for the class is loaded, performs a simple, flow-insensitive, non-conservative analysis of the bytecodes. Specifically, we track the JVM stack variable which contains the value returned by a method. If all stores into this variable either contain the result of executing a bytecode to allocate an object (one of new, newarray or anewaray), or the return value of a call to a factory method, we identify this as a factory method. Note that we can be insensitive to the control and data flow since the specification of valid bytecodes ensures that the type of a variable at any point in time is well-defined.

After identifying factory methods, we need to ensure that the master copy of an object created by a factory method is co-located with the caller of the factory method. In cJVM we achieve this by always invoking a factory method locally. (As noted earlier, local execution is always correct). An alternative would be to remotely invoke the factory method and to migrate the created object back to the invoking node. The drawbacks of this alternative are the fact that not all objects are migrateable and the relatively high cost of object migration in a multithreaded environment.

Appendix B shows pseudo code for our analysis algorithm. There axe several places where it can be improved. For instance, the code may be storing a reference to an object it previously created, even if the store instruction is not immediately preceded by an invocation of a factory method or object creation. As we report in Section 7, this simple heuristic which is inexpensive at run-time (linear in the size of the code) and inexpensive to implement, is sufficient to identify several interesting factory methods and to improve the placement of a large number of objects. An alternative means of identifying factory methods could be based upon escape analysis, e.g., [9, 7, 8, 19]. Escape analysis is a conservative, static analysis technique which determines which objects are not shared between threads.

172

6.2 Single C h a n c e M i g r a t i o n Single Chance Migration is a heuristic attempt to support a usage pattern in which an object goes through two nonoverlapping phases, where in each phase only a single thread uses the object. One concrete example of such a usage pattern is when there is one thread which performs setup for the other threads; these other threads only begin executing after the setup is completed. One key point to note about this usage pattern is that there is no sharing of the object. At any point in time, no more than one thread is actively using this object. We show this pattern graphically in Figure 3. Single chance migration is a speculative optimization which guesses that an object fits the design pattern and migrates the object to the node where the object will be used in the second phase of its life. If we determine that objects of a

particular class do not fit this pattern, e.g., we detect that the object is concurrently accessed by two nodes or that it is used by more than two nodes during its life, we invalidate this optimization for the class. The single chance migration optimization involves the following elements: 1) Statically identify classes whose instances may be candidates for this optimization; 2) Dynamically prune this list of classes by detecting objects which are shared by threads on different nodes; 3) Detect remote service requests for an object of a class whose instances are eligible for this optimization; 4) Migrate the object from the node where it was created to the node where the thread that is using the object is executing. We only want to migrate objects whose code is relatively encapsulated, i.e., not too dependent upon other objects which may be left behind on the node where the object was created. When the code is loaded, we perform a simple analysis to determine if a class's code is relatively encapsulated. This is done as follows. We focus on arrays, since using an array often involves a large number of memory accesses. An array is said to be encapsulated if it is accessed only by the code of its containing instance. Now, if a class contains a method which uses a non-encapsulated array, the class is not relatively encapsulated, and its instances are thus marked non-migrateable. We also do not migrate objects with application-defined native methods because we have no way of migrating (or even detecting) any native state that may be used by these methods. Finally, we do not migrate objects which are cacheable. Objects belonging to classes which have not been marked as ineligible are candidates for migration. If an object is shared, i.e., if an attempt is made to perform a remote operation on an object that has been migrated, we stop migration of all instances of the object's class by marking the class as ineligible. In this way, we zero in on the list which accurately reflects the usage of objects by the program. Our migration works on demand. The process starts when a node receives an object it has never seen in response to a remote request. The node then determines whether the object's class is eligible for migration. If so, it sends a message to the master node for the object requesting that the object be migrated. The simple part of migrating an object is transferring the object's state. The hard part of migrating an object is handling the race that can occur if the object is (contrary to our speculation) shared and the state of the object is modified on the source node while the state is being copied. This is hard since we do not want to use a lock to synchronize between all writes to the state of an object (i.e., all p u t f i e l d bytecodes) and copying an object's state for purposes of migration. Our solution uses a two-phase approach which combines inexpensive detection of a race and a heavyweight fix-up algorithm. The detection consists of a check in the implementation of p u t f ± e l d , which verifies that the local copy remains the master throughout the actual modification. Clearly, only in very rare cases this check would imply that the master has moved, in which case the fix-up protocol will check whether the modification was actually performed

173

Figure 4: p B O B Architecture: warehouse objects are denoted by light hexagons and their cached copies are denoted by gray hexagons.

and will otherwise perform it again remotely. Details are omitted here for lack of space.

7.

RESULTS

In this section we show how the set of optimizations we presented achieve good performance when running an unmodified version of the Portable Business Object Benchmark (pBOB). IBM pBOB is a kernel of business logic, inspired by the TPC-C benchmark [3]. In accordance with the TPC's fair use policy we note that pBOB deviates from the TPC-C specification and is not comparable to any official TPC result. We chose pBOB as it is a large (10Kloc), self-contained, pure Java application which only depends on the Java Core APIs. When it starts, pBOB's main thread creates and initializes N warehouses composite objects. A warehouse object consists of objects representing customers, stock items, orders, etc. After the warehouses are initialized, the main threads spawns M threads per warehouse for a total of M * N application threads. Both N and M are run-time configuration parameters. After initialization, the application enters a recording phase in which all threads concurrently execute "transactions" against their respective warehouses. Each transaction involves modifications, insertions or deletions of a warehouse's objects. The application runs for a fixed time after which it terminates, reporting its throughput as the value of total transactions per minute (TPM). The TPM value indicates the application's performance; the higher the TPM value, the higher the performance The key factor in gaining scaiability of pBOB with cJVM is maintaining high locality between a thread and its corresponding warehouse composite objects. We want the warehouses (and their component objects) cached with the threads working on the warehouses as indicated by the gray hexagons in Figure 4. This reduces the quantity of remote messages and the bottleneck that would occur if all warehouse objects are accessed remotely on a single node. Figure 5 shows the T P M s and efficiency of pBOB run with

four warehouses and one thread per warehouse on a four node cluster. As seen, we achieved an efficiency of 80% (a speedup of 3.2); the results are compared to an execution using the Sun JDK1.2 reference implementation on the identical hardware. In both cases we use the interpreter version and did not take advantage of a JIT. In addition to the optimizations described here, these results use a single configuration parameter for the placement of a single class's instances. 7 r

:1.2 1

005

0.8

o

,.~.4

0.6

.~

0.4

O.2

2

Jo

increase the quantity of remote operations due to access of remote static fields. In many cases multiple optimizations can lead to the same effect. For example, factory methods which are locally executable could be invoked locally due to the read-only-inpractice optimization. Thus, the potential impact of a later optimization is higher than what is shown in Figure 6. As mentioned, we use a single configuration parameter which enables us to enhance locality for a single class of objects which we are currently not able to migrate. These objects are created when pBOB is building the warehouses (i.e., before there are multiple threads). I n the future, we expect to apply our migration mechanism to a broader range of objects based upon run-time profiling. Interestingly, when we use the non-optimized version of cJVM, the configuration parameter has no impact on the performance. In other words, the configuration parameter is only effective in the context of the optimizations we presented. Finally, the invalidation penalty can be measured by the number of invalidation messages. Our experience shows that this overhead is negligible, consisting of no more than a few tens of messages for a run of pBOB using all of the optimizations.

2

Number of Nodes F i g u r e 5: E f f i c i e n c y a n d P e r f o r m a n c e o f p B O B c o n f i g u r e d w i t h four w a r e h o u s e s , o n e t h r e a d p e r w a r e h o u s e , o n a four n o d e c l u s t e r .

Figure 6 accompanies the speedup results. It shows the cumulative impact of the optimizations we discuss in this paper on the quantity of remote operations. The vertical axis represents the number of remote operations involved in a single average transaction applied by a thread which is not co-located with its warehouse object. The horizontal axis represents accumulated optimizations; the impact of every optimization is added to the impact of the preceding ones. The "non-optimized" column reflects cJVM as described in Section 2. It includes caching.of j a v a / l a n g / S t r i n g objects which are implemented differently than other read-only objects since they are interned; it does not include caching of other read-only objects which are shown in the second column. Optimizations which are not discussed in this paper are presented as "other optimizations" and are mostly related to providing distributed implementations of specific Java classes without changing their interfaces. For technical reasons, the column labeled "read-only in practice" includes the optimization for caching of these fields, for locally executing stateless methods and for locally executing methods depending only upon these fields. As shown, these optimizations, excluding the configuration parameter, result in a reduction of almost 75% of the total amount of remote messages in a non-optimized cJVM version. The impact of an optimization is not absolute. It is likely that with a different order (or for a different application), the impact of some of the optimizations would be different than presented in Figure 6. For example, an application can benefit from local invocation of static methods if static fields are already cached across the cluster nodes, as in Figure 6. Otherwise, local invocations of static methods might

174

8.

RELATED W O R K

From a programmer's point of view, tools and infrastructure which support Java applications on a cluster range from completely explicit solutions to implicit solutions similar to cJVM. Explicit approaches [2, 1, 10] assume an architecture of multiple JVMs while handling remote objects and threads at the level of the Java language and external frameworks. Most of these frameworks have very little relevance to cJVM; to various degrees, they don't support transparent creation and access of remote class objects. For lack of space, we only list here some of the projects which seem most related to the concept of cJVM. Hicks, et al. [13] describe an extension to Java with specialized remote operations provided as (optional) hooks to the programmer. References to remote objects are handled implicitly. No reduction of consistency or any form of smart caching is applied, as instance methods always execute on the node where the object was allocated. A notable system which focuses on transparent support of remote objects is Java Party [17]. Still, in contrast to cJVM, programmers must take care to distinguish between remote and local method invocation as the argument passing convention between the two differ. Moreover, while supporting monitoring of objects' interactions during run-time, it schedules object migration to enhance locality, unlike cJVM which mainly relies on its smart caching policies. Java/DSM [20] takes the implicit approach at the level of infrastructure. It is a modified JVM whose heap is implemented in a distributed shared memory. A related system is Hyperion [15] which is an implementation of a JVM on top of an object-based distributed shared memory. Hyperion uses an object shipping model, in which a copy of a remote object is brought to the accessing node. The access-

2O0

160 - - i ~120 ~ 80--

|

4O

0

I non-optimized

I

I

|

__

I

cache static fields single-chance read-onlyin practice otheroptimizations cachestaticarrays configurationparameter read-only static methods factorymethods Accumulatedoptimizations

Figure 6: I m p a c t o f O p t i m i z a t i o n s on N u m b e r o f R e m o t e Messages ing node uses this local cached copy, which is written back to the server at synchronization points. This weakening of the memory model consistency is not fully compliant with the Java memory model, see [14/[Chapter 17] and [12, 18]. JESSICA [16] is probably the system t h a t has the most in common with cJVM. Like cJVM it is a JVM implementation which virtualizes a cluster, presenting a single system image to Java applications. JESSICA is built on top o f a DSM, and unlike cJVM, focuses on thread migration for purposes of load balancing. In terms of architecture, JESSICA is closer to a JVM on top of a cluster-enabled infrastructure t h a n a clnster-aware JVM. While JESSiCA leverages the semantics of the language to achieve a single system image, unlike cJVM performance in JESSICA is obtained from a low-level, usage-neutral infrastructure and not by using a large n u m b e r of speculative optimizations t h a t leverage Java's semantics and usage.

9.

SUMMARY AND FUTURE RESEARCH

homepage [4].

10.

ACKNOWLEDGEMENTS

We would like to t h a n k Oded Cohn and Hillel Kollodner for their input to cJVM. Special thanks to Alaln Azagury who initiated this research activity and to Steve Munroe for giving us insight on pBOB.

11.

REFERENCES

[1] h t t p://www.javasoft.com/. [2] http: / /www.objectspace.com/voyager /. [3] http://www.tpc.org/cspec.html. [4] htt p://www.haifa.il.ibm.com/ projects/syst ech/cjvm .html. [5] Y. Aridor, T. Eilam, M. Factor, A. Schuster, and A. Teperman. Field Level Caching in a Cluster JVM. In preparation.

cJVM is a cluster-aware Java virtual machine which aims to 1) virtualize a cluster, presenting a single system image for pure Java applications and 2) obtain high performance for Java Server Applications which are concurrent daemons. This paper describes how we have met the second goal of high-performance. We presented a large set of optimizations addressing caching, locality of execution and object placement. These are, for the most part, speculative optimizations which take into account common object usage patterns extracted from analysis of bytecodes and knowledge of Java semantics. We showed their effect on the performance for a fairly large benchmark which achieved efficiency of 80% while running on a four node cluster.

[6] Y. Aridor, M. Factor, and A. Teperman. cJVM: a Single System Image of a JVM on a Cluster. In

cJVM's smart proxy mechanism and the infrastructure for maintaining field-level consistency are general frameworks which can be used to implement many locality protocols at different levels of granularity. Indeed, we are in the process of including more optimizations which will be based on memory-access-pattern statistics gathered during run-time, enhanced static/escape analyses, memory model and consistency protocols, etc.

[9] J. D. Choi, M. Gupta, M. Serrano, V. C. Sreendax, and S. Midkiff. Escape Analysis for Java. In PROCEEDINGS of the 1999 OOPSLA, pages 1-19. ACN, November 1999.

For up to date information regarding cJVM status, see our

175

PROCEEDINGS of the 1999 ITERNATIONAL CONFERENCE on PARALLEL PROCESSING, pages 4-11. IPSJ and IACC, September 1999. [7] B. Blanchet. Escape Analysis for Object-Oriented Languages. Application to Java. In PROCEEDINGS of the 1999 OOPSLA, pages 20-34. ACM, November 1999. [8] J. Bogda and U. Holzle. Removing Unnecessary synchronization in Java. In PROCEEDINGS of the 1999 OOPSLA, pages 35-46. ACN, November 1999.

[10] E. Freeman, S. Hupfer, and K. Arnold.

Javaspaces(TM) Principles, Patterns and Practice (The Jini(TM) Technology Series). Addison-Wesley, 1999. Project URL: ht t p://java.sun.com/products/j avaspazes.

[11] E. Gamma, R. Helm, R. Johnsond, and J. Vlissides. Design Patterns. Addison-Wesley, 1995. [12] A. Gontmakher and A. Schuster. Java Consistency: Non Operational Characterizations for Java Memory Behavior. Technical Report CS-0922, Technion - Israel Institute of Technology, November 1997. Online version: http://www.cs.technion.ac.il/ assaf/publications/java.ps. [13] M. Hicks, S. Jagannathan, R. Kesley, J. Moore, and C. Ungureanu. Transparent Communication for Distributed Objects in Java. pages 160-170. ACM Java Grande Conference, June 1999. [14] B. Joy, J. Gosling, and G. Steele. The Java Language Specification. AddisonWesley, 1996. [15] M. MacBeth, K. McGuigan, and P. Hatcher. Executing Java threads in parallel in a distributed-memory environment. IBM Center for Advanced Studies Conference, NovemberDecember 1998. [16] J. M. M. Matchy, C. Wang, F. C. M. Lau, and X. Zhiwei. JESSICA: Java-Enabled Single-System-Image Computing Architecture. In 1999 International Conference on Parallel and Distributed Processing Techniques and Applications (PDPTA '99), JuneJuly 1999. [17] M. Philippsen and M. Zenger. Javaparty: Transparent remote objects in Java. Concurrency: Practice and Experience, 11(9):1125-1242, 1999. [18] W. Pugh. Fixing the Java memory model. In ACM Java Grande Conference, pages 682-687, June 1999. [19] J. Whaley and M. Rinard. Compositional Pointer and Escape Analysis for Java Programs. In PROCEEDINGS of the 1999 OOPSLA, pages 187-206. ACN, November 1999. [20] A. Yu and W. Cox. Java/DSM a platform for heterogeneous computing~ In ACM 1997 Workshop on Java for Science and Engineering Computation, June 1997.

176

APPENDIX A. PSEUDO CODE FOR CACHING OF READ-ONLY IN PRACTICE FIELDS At Requester: send(pulling_req, , obj_id) message to object's master node wait for a response message (fields, remote_version); get _version_vect or (obj Ad, &local_version); while(local_version < remote_version) { //wait on an event object until an invalidation occurs wait (FIELD _INVALIDATION); get_version_vector(obj Ad, ~zlocal_version);

}

//unpacked the fields according to the remote vector unpack(objld, fields, remote_vector); At master for object: get_version_vector(obj_id, &version); //get version from o b j l d ' s class pack(&fields, obj_id, version); //pack the fields of objAd according to version send (fields, version) to RequesterNode;

B.

PSEUDO CODE FOR STATIC ANALYSIS TO IDENTIFY FACTORY METHODS

for each opcode in the method { opcode = current opcode being processed next_opcode =-- next opeode in sequential order if (next_opcode = = store to variable returned by method) { if ((opcode = = aconst_null) II (opcode = = new) I[ (opcode = = a n e w a r r a y ) I I (opcode = = opc_newarray)) {

} }

continue; / / S o fax still a factory method } else if ((opcode = = invokespecial) II (opcode = = i n v o k e v l r t u a l ) II (opcode = = invokestatic) [I (opcode = = invokeinterface)) { char *methodnaxne = = name of the method being invoked; char *signature = = signature of the method being invoked; if ((opcode = = invokespecial && (strcmp(methodnazne, "") = = 0))) { / / c a l l i n g a constructor. Assume the common code idiom / / o f duplicating the reference to the newly created object, calling / / t h e constructor and then storing the reference to the new object. } else if (returnTypeIsObject(signature) && isAFactoryMethod(methodname)) continue; / / So fax still a factory method } else return NOT_A_FACTORY_METHOD; } else return NOT_A_FACTORY_METHOD; } else continue;

return

IS_A_FACTORY_METHOD;

177