HP-RMI: High Performance Java RMI over FM 1 Introduction - CiteSeerX

HP-RMI: High Performance Java RMI over FM Luis Rivera Lynn Zhang Geetanjali Sampemane Sudha Krishnamurthy flrivera,l-zhang,geta,[email protected]

Abstract Java Remote Method Invocation (RMI)[7] is a convenient mechanism for making remote method calls in distributed object systems. However, from our experiments with Java RMI, we found that the available TCP/IP implementation is not particularly fast. It is our contention that marshalling and the transport mechanism are the main sources of overhead. We focus here on reducing network delays. Our approach is to trace the network path of an RMI call, and provide a more ecient transport layer as an alternative to (the default) TCP. We selected Illinois Fast Messages (FM)[6] as the alternative transport layer because of its low overhead and ability to deliver a large part of the underlying network capacity to the application. We implemented an FM sockets interface in Java, using the Java Native Interface (JNI)[5] and attempted to provide the application a choice between TCP/IP or FM as the transport layer. To measure the performance speedup, we designed a testbed that repeatedly runs a set of RMI calls on a network of 2 Pentium II machines running Windows NT connected by Ethernet and Myrinet[9]. The RMI calls sent and received data over the network and we measured the time taken on the standard TCP/IP and our FM implementation. We observed a signi cant improvement in performance (upto 60x for an RMI call sending 1500 bytes).

1 Introduction This paper describes the work and results of a half-semester course project for CS490 \HighPerformance Distributed Object Systems". Our goal in this project was to optimize the network performance of Java RMI. We chose to replace the TCP transport layer with a faster messaging layer, FM. To achieve this, we designed and implemented a Java socket interface to FM. We then integrated this with the Java RMI subsystem. We designed a test suite and measured the speedup obtained. The RMI calls sent

1

and received data over the network and we measured the time taken on the standard TCP/IP and our FM implementation. We observed a signi cant improvement in performance (upto 60x speedup for a RMI call sending a packet of 1500 bytes) Section 2 describes our project schedule. Section 3 provides the technical background for the project { we brie y cover Java and the JVM, Java RMI, the Java Native Interface (JNI) and Fast Messages (FM). Sections 4 and 5 describe our implementation approach, followed by the test methodology used. In section 6, we describe our experimental setup in some detail and present the results measured. We analyze the results in section 7 and explain what we achieved and what the limitations of our experimental setup were, followed by a summary of our goals and conclusions in section 8. In section 9, we provide some suggestions for future work in this area and section 10 provides a brief perspective on what we learned from the entire project.

2 Project Plan & Execution The original plan was as shown in Fig 1. The actual schedule followed is described below:

Oct 20-31:

Discuss ideas for performance enhancements for RPC systems; decide to work on network improvements to Java RMI; formed teams; started work on the initial project proposal. Got familiar with Java, JDK and RMI Wrote and submitted a project proposal

Nov 1-15

Setup a working environment { installed Windows NT and software utilities on a dedicated set of machines. Compiled the JDK source and then began tracing the execution of RMI calls to nd out the most feasible way to provide an alternative transport for RMI. Tried replacing TCP with UDP to nd out the places where the network interface is referred to, but were not very successful, since it was hard-coded to use TCP connections. Installed the FM library, tried the sample programs with the TCP/IP version of FM and started designing the Java sockets interface to FM { experimented with JNI for this. Designed and implemented a test suite to measure the time taken for RMI calls on Windows NT. Submitted a revised project proposal which included timing measurements for the unmodi ed Java RMI system.

Nov 15-30

2

10/31

11/7

11/14

11/21

11/28

12/5

12/12

Implementation!

3

Figure 1: Proposed timeline for project

Interface using JNI Debugging

Understand RMI, JNI, FM Questions: 1. How stubs and skeletons are generated? 2. Where and how the SocketFactory Class is invoked? 3. How can we create an Finish the FM alternative or replace the transport socket class layer? 4. How does RMI interface w/ transport layer? Look into JNI Performance analysis based on profiling the current system. - How much benefit could there be?

Design test suites

Write up Report Performance measurement Prepare for presentation Collect Result Data Analysis Compare with expected benefits Compare to a “null” comm layer

Lab setup Decide on what timing and measurement tools to use

Week 1

Week 2

Week 3

Week 4

Week 5

Week 6

Week 7

Implemented the FM socket class with JNI Got the FM sockets to compile and be part of the java.net hierarchy. Explored the RMI subsystem source code to nd the exact locations for integration. Tried out some of the initial approaches to integrate FM sockets with RMI { unsuccessfully.

Dec 1-15

By this time Myrinet was installed, and we ported our FM socket implementation to Myrinet { in theory, this just involved linking the Myrinet version of FM.lib instead of the sockets version, but it took a lot longer. Had a basic client-server application running using FM sockets. Integrated the FM sockets with the RMI subsystem. Implemented a basic client-server program using RMI and Myrinet FM { it works with a few glitches. Took measurements with the integrated FM-RMI system as well as the FM sockets without RMI. Wrote up this report. Prepare for presentation.

3 Technical Background Traditional distributed computing environments generally used remote procedure calls[2, 8] for communication across a network. When optimizing those, the designers tried to minimize network access, since that was the slowest component of the system. In modern distributed systems, the network is no longer the bottleneck. Low latency gigabit networks are becoming commonly available, and many of the assumptions made in optimizing distributed applications are no longer valid. e.g. marshalling and demarshalling are the bottlenecks of many conventional implementations of RPC systems designed when networks were orders of magnitude slower than CPUs. As available network bandwidth increased, application developers began to send larger and larger amounts of data over the network. Distributed object computing, where communication is implemented in the form of method calls to remote objects, is becoming the preferred model of network computing. For an RPC system, latency is more important than the peak bandwidth achievable, since reponse time is what the user perceives. Currently, there is a lot of hype about Java, with a large variety of applications being developed on it. Application developers are also coming round to the view that the component approach is essential for developing large software projects. As applications became objectbased, RPC systems had to follow suit to remain useful. Distributed objects were thus the natural successor to conventional RPC systems, and many such systems were developed, e.g. OMG's CORBA[4], Microsoft's COM-based[1] technologies, Java RMI and so on.

4

Client

Server

Stubs

Skeletons

Remote Reference Layer Transport Figure 2: Structure of a Java RMI call

3.1 Java RMI Java RMI (Remote Method Invocation) is an API standard for building distributed Java systems[10]. It allows applications to pass objects between JVMs and invoke remote methods on them. By using RMI in a pure Java environment, one gains the bene ts of distributed garbage collection and full Java semantics. Since the JVM is a standard speci cation, all RMI applications will be able to interoperate.

3.1.1 System Architecture Fig. 2 represents the basic RMI system architecure. It consists of three main layers: the stub/skeleton, remote reference layer, and transport. Fig 3 shows the actual hierarchy of some of the classes that implement the RMI subsystem in JDK1.1.4. The client invoking a method on a remote server object actually makes use of a stub or proxy for the remote object. The stub/skeleton layer does not deal with any transport speci cs but transmits data using marshal streams. The stub is a client-side implementation of the remote interfaces of the remote object. It is responsible for:

initiating a call to the remote object by calling the remote reference layer. marshaling arguments to a marshal stream informing the remote reference layer that the call should be invoked. StreamRemoteCall class represents the object that currently executes the remote call. unmarshaling the return value from a marshal stream

5

rmi

server

registry

rmic

transport

StreamRemoteCall Channel

Transport

Endpoint Connection

fm tcp

UnicastRef UnicastServerRef Dispatcher TCPChannel TCPTransport .....

Figure 3: Java RMI class hierarchy

informing the reference layer that the call is complete

The remote reference layer handles the semantics of the type of invocation, namely unicast or multicast. UnicastRef and UnicastServerRef are the JDK1.1.4 classes that implement the remote reference layer. The reference layer currently transmits data to the transport layer via the abstraction of a stream-oriented connection. The transport takes care of the implementation details of the connection. The transport is responsible for connection set-up to remote address spaces, managing connections, monitoring connection liveness, listening for incoming calls, maintaining a directory of objects residing in the local address space, locating the dispatcher for the target of a remote call and passing over the connection to this dispatcher. The transport for Java RMI consists of the following four abstract classes:

Endpoint: an endpoint denotes an address space. Given an endpoint, a speci c instance of

the transport can be obtained. A remote object is identi ed by an object identi er and an endpoint. Channel: this class is an abstraction for a passage between two address spaces and is responsible for connection management. Transport: Given an endpoint to an address space, this class sets up a channel to that address space. This class is also responsible for accepting calls on the incoming connections to the address space, setting up a connection object for the call, and dispatching the call to the higher layers. Connection: a connection is an abstraction for the actual data transfer.

6

A skeleton for a remote object is a server-side entity that contains the dispatch method that dispatches the call to the actual implementation of the remote object. It is responsible for: unmarshalling arguments from the marshal stream making the up-call to the actual remote object implementation marshaling the return value of the call onto the marshal stream

3.1.2 Remote Method Invocation In order to perform an RMI on a remote object, the server implementing the remote object must rst register itself with the rmiregistry. The client must then obtain a reference to the remote object by looking up its URL in the name server. The java.rmi.Naming class is an abstraction of a name server that stores URL-based named references to remote objects. It provides methods to lookup, bind, unbind, and rebind a remote object maintained on a particular host and port. The reference returned by the name server lookup is actually a reference to the local stub implementing the remote interface. The stub marshals the arguments, if any, using a stream-based communication. It then invokes the remote method using the underlying transport. At the server end, the transport dispatches the call to the skeleton. The latter unmarshals the arguments and invokes the appropriate method. The results from the server, if any, are then sent to the client in an analogous manner.

3.2 Java Native Interface Java was designed to be a platform-independent \write-once, run-everywhere" programming environment. However, the downside of this is the performance penalty { early versions of Java were notoriously slow. The proposed solution to this was the Java Native Interface (JNI), which allows platform-dependent code to be linked into a JVM for sections where performance is important. Native code accesses Java VM features by calling JNI functions. JNI functions are available through an interface pointer. The JNI interface is organized like a C++ virtual function table or a COM interface. Native methods are loaded with the System.loadLibrary() method. The argument to System.loadLibrary() is a library name chosen arbitrarily by the programmer. The system follows a standard, but platform-speci c, approach to convert the library name to a native library name. For example, Windows NT converts our library \net" to \net.dll". The JNI interface pointer is the rst argument to native methods. The JNI interface pointer is of type JNIEnv. The second argument diers depending on whether the native method is static or nonstatic. The second argument to a nonstatic native method is a reference to the object. The second argument to a static native method is a reference to its Java class. The

7

remaining arguments correspond to regular Java method arguments. The native method call passes its result back to the calling routine via the return value. We have an FM library written in C, and we have implemented a thin wrapper around the existing FM library function so they would be more \socket-like". We make them accessible to the Java code through the JNI. There are dierent versions of Java native method interfaces, such as Netscape's Java Runtime Interface Microsoft's Raw Native Interface and Java/COM interface JDK 1.0 native method interface We adopted the JDK 1.0 native method interface, mainly because it is more compatible and easy to use as we were working with the JDK source code. However there are some serious problems with the JDK1.0 native interface. First, the native code accesses elds in Java objects as members of C structures. However, the Java Language Speci cation does not de ne how objects are laid out in memory. If a Java VM lays out objects dierently in memory, the programmer would have to recompile the native method libraries. Second, JDK 1.0's native method interface relies on a conservative garbage collector. The unrestricted use of the unhand macro, for example, makes it necessary to conservatively scan the native stack.

3.3 FM Illinois Fast Messages (FM) is a low-overhead high performance software messaging layer with an implementation for the Myrinet network. FM's primary goal is to provide high performance not just to applications written directly to the FM API, but also to applications written to a wide range of higher-level of communication APIs. To achieve that goal, FM's few, simple primitives provide a number of important guarantees, saving higher-level messaging layers the burden and performance penalty of having to implement them themselves [3].

reliable delivery in-order delivery decoupling of communication and computation

FM's interface is carefully designed to enable ecient composition into higher level layers. This composition eliminates copies and buer pool overruns, which often reduce performance signi cantly. Receiver ow control enables messaging layers to pace the rate at which data is removed from the network layer.

8

Myricom's Myrinet[9] is a high-speed local area network with full duplex 1.28Gbps links. Derived from multicomputer routers, current Myrinet oerings employ wormhole routing and 8x8 crossbar switch. The network interface has a programmable 33MHz CPU (the LANai) with 256KB of SRAM that attaches to the I/O bus of the host processor. The LANai has 3 DMA engines which minimize the data copying from the host to the LANai and the LANai to and from the network. It also allows the LANai to send and receive from the network at the same time.

4 Technical Design & Implementation The project objective was clear: select an implementation of Java RMI, obtain an indication of where the network delays are, and then work on reducing or eliminating them. We identi ed TCP/IP as a bottleneck and decided that replacing it with FM was the best method of overcoming it.

4.1 Selection of Platform We selected the Sun Java RMI implementation included as a part of JDK1.1.4 as the implementation to ne-tune.1 We intended to eventually test our implementation with a JIT compiler, but had problems using the rmiregistry with a JIT compiler. The server, when run with the JIT compiler, could not bind to the rmiregistry. So our measurements are taken with a non-JITed JVM. Another important decision was the choice of operating system. JDK source was available for two platforms { Sun's Solaris and Microsoft's Windows NT. Since FM is currently available for Windows NT, we decided to work with the Wintel platform. We started out by obtaining the JDK1.1.4 source code, installing and compiling it. We traced the execution of an RMI call and veri ed that the major sources of delay were marshalling and networking. In the rest of this section, we describe our approach for implementing the FM sockets and integrating them with the Java RMI system.

4.2 FM Socket Implementation The Java network library provides two kinds of socket classes, the stream Socket class and the DatagramSocket class. The Socket class is associated with an InputStream and an OutputStream for transmission and reception of data respectively. To use it, the application creates an instance of a Socket, binds and/or connects to a TCP port and then uses the Input and Output Streams to transmit data. The full source release to the Java Development Kit is available on a limited, non-commercial basis to those interested in the Java source code for educational, evaluation or research purposes. 1

9

The DatagramSocket class on the other hand, uses DatagramPackets for data transmission and reception. Here the application creates a DatagramPacket with the source and destination addresses and the relevant data and uses send and receive to transmit data. The Socket class implements a reliable, stream based connection while the DatagramSocket class implements an unreliable datagram protocol. Both of these classes provide a high level abstraction to the actual socket implementation which is done in C. These wrapper classes use the Java Native Interface[5] to interact with their corresponding C implementation. JDK1.1.4 provides implementations of these two classes and allows custom sockets to be added to its library. This enabled us to implement FM sockets using the FM library and add it to the java.net package. Our implementation was modeled on the Java socket classes. We provide a stream version, FMSocket and a datagram version, FMDgSocket of the FM sockets. The underlying implementation is in C and the interface to the wrapper classes is implemented using JNI. The static initialization for the FMSocket class calls FMsocketInitProto which initializes the FM library for use. During this stage of the FM sockets, FM set parameter() is invoked to set the key parameter. Also FM initialize() is called for initialization. During the socket construction stage, the constructor of the class binds the socket to the appropriate handler using FM register handler(), for later data reception . FMsockets use the FMInputStream and FMOutputStream to exchange data2 analogous to the Java Socket class. Since FM treats its data as \messages", there is no real concept of a connection establishment in FMSocket; when data transfer begins, FM begin message() (from the FM library) establishes the path to send data. The FMDgSocket is fairly similar, except that it uses FMPackets rather than streams for data transfer. The FM socket class supports both character and byte mode of transfer. However, in our test cases, we predominantly used byte mode of transfer. The following piece of code illustrates the use of FMDgSocket for sending and receiving bytes of data: int node_id = 0; // destination node id byte[] buffer = new byte[500]; // data to be sent //construct a packet to be sent to a node FMPacket pack = new FMPacket(buffer, buffer.length,node_id); FMDgSocket sock = new FMDgSocket(); // create an FM socket sock.sendBytes(packet); // send the data to the node ... sock.receiveBytes(packet);

// receive data from any node

sock.close();

These stream classes belong under the io hierarchy, though we have placed them in the net hierarchy temporarily for convenience. Putting them into the io class required more changes to the structure of the code base. This again, does not aect functionality, so in the interest of time, we have not bothered with keeping the hierarchy clean. 2

10

Server

Server

Class :UnicastRemoteObject Method: exportObject

Class : UnicastRemoteObject Method: exportObject

creation of FMDgSockets here Class : UnicastServerRef Method: exportObject

Class : FMServerRef Method: exportObject

Class : LiveRef Method: exportObject Class : FMTransport Method: exportObject Class : TCPEndpoint Method: exportObject

Actual creation of the TCP transport done here

Class : TCPTransport Method: exportObject a)

b)

Figure 4: Java RMI implementation >From our tests, we found that there was no signi cant dierence in the performance of the stream version and datagram version. Moreover, the model of communication in FM is inherently message-oriented and is reliable. As a result, we chose to use the datagram version for our tests.

4.3 Integration of FM with RMI After integrating the FM sockets with the Java networking library, our next main challenge was to integrate them with Java RMI. This required a thorough understanding of the source code implementing the classes comprising the Java RMI subsystem. In this section we provide details about the dierent approaches we tried and the justi cation for adopting these approaches. A high-level overview of the Java RMI implementation was provided in Section 3.1. Here we focus on the speci c classes that were aected by our integration.

4.3.1 Providing an Alternative socketfactory The Java RMI currently uses stream based sockets for communication. However, an alternative socket implementation can be used by appropriately setting the defaultSocketFactory in the RMIMasterSocketFactory class. We used this approach to change the default RMI sockets to

11

datagram sockets. That was when we realized that providing an alternative socket implementation was alone not sucient for achieving the integration. We realized that we also had to implement the entire transport subsystem corresponding to the alternative socket implementation. Since we were not sure if implementing an entirely new transport was feasible in the time we had, our rst approach was to reuse parts of the existing transport wherever possible and replace parts of the TCP transport using stream sockets to make use of FM sockets instead. However, this approach was not suitable either. The main problem with this approach was that the TCP transport was used for both the client-server communication as well as the client-rmiregistry communication. Our objective was to modify only the client-server transport to use FM sockets. We realized that modifying the TCP transport would aect the client-rmiregistry transport as well. This was not desirable. Thereafter we focussed our eorts in designing a transport based on FM sockets. Towards this end, we rst had to modify the automatically generated stubs and skeletons. Currently the RMI compiler generated stubs and skeletons are hard coded to communicate using streams that are tied to a TCP connection. Since the RMI compiler by default generates the bytecode for the stubs and skeletons, we had to rst nd an option to generate the source les instead, in order to modify them. The \-keepgenerated" option of the RMI compiler accomplished this. On account of the tight coupling of the TCP streams with the RMI, we found it very hard to replace the TCP transport entirely. So we decided to use an incremental approach by rst replacing all the stream based marshaling and unmarshaling of arguments by FM datagram socket based marshaling and unmarshaling. The actual remote call structure itself was unmodi ed. The remote call is performed by invoking the executeCall() method of a StreamRemoteCall object. This call object contains information about the exact operation to be invoked, the id of the target object and other relevant information required to invoke the exact service at the server end. Moreover, since this call object is tied to the TCP connection, the remote call results in awakening the listening TCP server which then spawns a new thread to service the call. The thread invokes the dispatch() method of the appropriate server skeleton which then makes the upcall to the server. On account of the layer of abstractions involved, we decided to retain the TCP transport to invoke the appropriate skeleton, while using the FM sockets for data transmission only. However, we found that the overheads to spawn a new thread for each remote call far outweighed the overheads for the data transmission. For the FM sockets, the maximum overheads were observed in the initialization stage, while calling FM initialize(). Since we performed static initialization, these overheads were incurred only when the rst FM socket was created and not for every call. As a result, the client and server communication fell out of synchronization very often, with the stub sending data well before a socket could be ready to receive data at the server end. So the performance numbers we achieved were very poor.

12

4.3.2 Designing an Alternative Transport Our nal eort was to do away with the TCP transport entirely and design a custom transport that uses FM sockets for communication. To this end, we designed an FMTransport class based on the TCPTransport class to provide at least the mandatory functions, namely, the ability to export a remote object using the exportObject method and the ability to handle incoming connections and dispatching the call to the appropriate skeleton. We also had to locate the exact point at which the server was initiating the TCP connection. After much code hacking, we found that the starting point was in the exportObject() method of the server's base class, namely UnicastRemoteObject. The exact sequence of calls being made before the server side of the TCP connection is initiated is shown in Fig. 4a Our objective was to cut through the layer of abstractions as much as possible. So we designed an FMServerRef class analogous to the UnicastServerRef class except that the former uses the FMTransport instead of the TCPTransport to export a remote object. The modi ed sequence of calls is illustrated in Fig. 4b Since most of the time was spent in hacking the code and working through the layer of abstractions, (locating a call was like searching for a needle in a haystack!!), we were left with little time to achieve a complete integration of our classes with the Java RMI. We were able to perform tests only with a modest amount of integration. However, we are con dent that our last approach is the correct and feasible way to achieve the integration and a more robust and complete integration may be considered as future work.

5 Testing Right at the start, we designed a test suite to measure the performance of Java RMI, and to try and isolate the networking components contributing to the delay. We wrote a simple \Timer" client-server application that used Java RMI to make repeated procedure calls to another JVM. The Timer application had methods to make a Null() call (no arguments, no results), sendbyte() and senddata() calls (to send one byte and a byte-array from the client to the server) and recvbyte() and recvdata() (for the client to receive data from the server). The delays in RMI can be viewed as coming from: system delays (into which we include procedure calls, initial RMI setup costs etc.), marshalling delays and network delays. We were interested in \fast-path" performance, so we ignored system delays due to binding etc. which get amortized over a large number of calls. We did not attempt to minimize marshalling delays here, but focussed on networking delays. Network delays can be characterised as a xed \base latency" (depending on the type of network) and a per-packet overhead (related to the size of packet and the number of packets, depending on the type of network used and the available bandwidth).

The Null() procedure just makes a method call, which takes no arguments and produces

13

no results. This provides an indication of the base latency of the system/network combination, since there is no data transfer involved. senddata() sends an array of bytes (we selected this datatype to minimize marshalling overhead) from client to server. We measured the time taken for 10,000 calls. We varied the size of the array upto 1500 bytes. We also implemented a recvdata() which receives data from the server, but since send and receive are symmetric, we only show the send data here.

We took these measurements on dierent machines and networks: between two JVMs on the same machine running TCP/IP. between two JVMs on dierent machines connected on a 10Mb/s Ethernet running TCP/IP between two JVMs on dierent machines connected over Myrinet running FM.3 The two JVMs on the same machine emulate a \null" network layer. We expected that the time with the FM implemenation would be somewhere between the rst two sets of readings. However, the results turned out dierently, as explained in the following section. We also re-implemented the test suite using a basic client-server approach (i.e. using sockets directly, not through RMI). This shows the overhead of using RMI in such a situation. We used UDP sockets, since the FM model, being message-based, is closer to UDP than to TCP. This approach therefore accounts for the following factors: The dierent network layers allow us to compare the eect of the network protocol The simple client-server vs. RMI client-server enable us to comment about the overhead associated with RMI Null() and senddata() allow us to estimate the overhead due to actual data transfer Note that the performance may also vary as the number of connections or number of objects increase. However, we do not take this into account for the purposes of this experiment. The Ethernet connection goes through a switch, so it was not possible to completely isolate the test network (since we did not know what other machines were on that network segment). However most of our readings were taken in the early hours of the morning, when network trac was at its lowest. Our Myrinet network had a very simple topology { two machines connected directly by Myrinet cable. Since there was no switch, we would have lower latencies than in a real-world situation, where switch latencies would also aect performance.

6 Experiments and Measurement For our experiments we used two Pentium II machines connected to each other in two ways { using Ethernet and Myrinet. The Ethernet is a 10Mb/s twisted-pair network connected to a We were unable to take readings between 2 JVMs on the same machine with Myrinet, since a Myrinet node cannot send data to itself in a switchless con guration. 3

14

Table 1: Measurements for a basic client-server application Function Null network UDP/IP FM (ms/call) layer (ms/call) (ms/call) Null() 0.3075 0.5353 0.0472 sendbyte() 0.3138 0.5437 0.0472 send{100 bytes 0.3300 0.8181 0.0543 send{256 bytes 0.3409 1.3400 0.0672 send{1500 bytes 0.6331 5.5328 0.1391 switch. The Myrinet cards in each machine are LANai 5.0 cards and are connected back to back using full duplex 1.28 Gb/s Myrinet cable, without a switch. As described in Section 4, we selected Microsoft Windows NT as our experimental platform. We installed the basic NT server 4.0 on both machines and then added the following pieces of software: JDK1.1.4 { Sun's Java Developers Kit MKS Tool Kit (the trial version) { from Mortice Kern System Inc. This provides some Unix tools for Windows NT that are required to build JDK. Microsoft Visual C++, Microsoft SDK, Microsoft DDK { the C compiler, other utilities (like nmake) and documentation. FM library from HPVM-1.0 distribution for x86 and Windows NT. JDK1.1.4 was the current release of JDK from Sun at the time we started our project. JDK1.1.5 has just been released. We developed a test library that consists of the following functions: A Null() function with no arguments and no results A sendbyte() function with one byte argument and no result A recvbyte() function with no arguments and one byte result A senddata() function with a large argument (array of bytes) and no result A recvdata() function with no argument and an array of bytes as result Since the time taken for send and receive are similar for same sized data, we analyze only the send functions. We were unable to run the server using the JIT compiler since it encountered problems while trying to bind with the rmiregistry. We also tried out several public domain JIT compilers for Windows NT: kae, Supercede, grok, and the JIT compiler bundled along with Microsoft's IE3.0. We haven't been successful so far since some of these do not support JDK1.1 and RMI

15

Table 2: Measurements for RMI application Function Null network TCP/IP FM (ms/call) layer (ms/call) (ms/call) Null() 2.7092 3.1970 0.0470 sendbyte() 4.0094 4.4812 0.0485 send{100 bytes 4.1188 4.8438 0.0531 send{256 bytes 4.15 5.3812 0.0656 send{1500 bytes 4.3282 8.425 0.1407 yet. From the postings in the rmi-users FAQ, we understand that the Symantec JIT compiler does not work with RMI either. Testing on JIT compilers will be left for future work.

7 Analysis We took readings with a null network layer (the loopback ethernet interface), with a UDP/IP network interface and with an FM network interface. There were two sets of readings { one for a simple client and server, and the other for an RMI-based client-server. The \null" network layer was where the client and server ran on the same physical machine (dierent JVMs) using UDP as a transport. The dierence between this reading and the readings between two dierent machines running UDP provides the network delay due to UDP transport. We wish to compare the performance of Java RMI using FM with the standard Java RMI which uses TCP/IP. Ideally, FM performance should be compared with UDP, since they are similar in having no connection establishment overheads. However we could not set Java RMI to use UDP instead of TCP, hence we present gures for TCP. In the case of UDP, the delays were proportional to the packet length, whereas in FM, there was little variance based on packet length. The best performance improvement we observed was when sending 1500 bytes { FM was around 34 times faster than UDP. The least improvement was in the case of the Null() call { FM was only around 4 times faster than UDP. As mentioned earlier, we were not able to directly compare FM-RMI performance with UDP based RMI. However, performance for FM with and without RMI are fairly similar. We have shown that our FM implementation achieves better performance than UDP. Since UDP has lower overheads than TCP, we conclude that FM-RMI performance is better than TCP based RMI.

7.1 Limitations The most serious limitation of the JDK1.1.4 is the stream-based marshaling of arguments. As a result, we had to spend a lot of eort to decouple the streams from the (hardcoded TCP)

16

transport based RMI. This is conceptually simple, but the layers of abstraction made it dicult to nd the precise point in the code where we could intercept it. The Java platform is still evolving, and both RMI and JNI are relatively new additions, and still in a state of ux. JITS compilers, which will provide great performance improvements, have not caught up with these latest developments, but we expect they will soon. There was also a great time constraint, so we ran out of time to cleanup code and do rigorous testing. The system is not yet stable enough for production/mission-critical applications.

8 Summary High Performance RMI is an enhancement to Sun's JDK1.1.4 distribution. The key to achieving high performance with the original implementation relies on the understanding of the current bottlenecks in performance. Marshaling and unmarshaling are being improved in RMI in subsequent versions and the current trend indicates even better performance in future releases. Therefore we attacked the second big source of overhead, the transport layer built on top of TCP. Given the availability of a high performance network (Myrinet), a messaging layer able to take advantage of it (FM), inexpensive computer hardware (Intel Pentium), we decided to replace the TCP with FM as the Java RMI transport layer. We did this as described earlier, providing an alternate socket implementation for RMI to use FM-Sockets. Our testbed consisted of two Pentium II machines running WindowsNT server 4.0, connected with Ethernet and Myrinet, Illinois FM-2.0, JDK1.1.4 source code, and related hardware and software tools. The test suite was run before and after the modi cation to compare performance between TCP and FM. Measurements were based on the transmission of bytes in order to minimize the marshaling/unmarshaling overhead. They showed around 60x speedup. The test case results and analysis are encouraging and can be used to develop high performance distributed applications. To be able to make this production-quality, we need to do more rigorous testing and analysis. The results do not include the FM initialization overhead, which in fact is done through TCP/IP. We feel justi ed in ignoring this overhead because the initialization is not part of the \fast path" that we are seeking to optimize. Though our RMI/FM implementation is currently not yet tightly coupled, it should be possible to achieve more closer integration, given more time. As a further point we have to say that the RMI/FM implementation is mainly driven by the exibility FM allows. We have also described some of the problems faced during the development cycle, and conclude that Wintel based platforms have some ways to go from the software developer point of view. As an important conclusion, after having stated the goals, described the reality and problems, and assimilated the achievements so far in this project, we can fairly say that High Performance RMI (HP-RMI) is a reality. Though our integration of our alternate FM-Socket version is not

17

clean and robust yet, the path is traced and the understanding of mechanisms, diculties and viability is already in place. Of course we limit this armation to high speed local area networks where these fast messaging tools are fully functional.

9 Future Work A great number of improvements can be made to the current state of our implementation. The time constraint forced us to take some shortcuts. These must be studied and corrected carefully in order to make the implementation clean and robust. As of now we cannot say that the integration meets these two goals. If we want to gain as much performance as possible, we need to study and understand the contribution of each one of the components involved in the communication. In this way our performance won't be diminished as a result of our integration. As a result of the clear distinction of what pure Java is and what needs to be platform-dependent, JNI provides not just the link, but the division of responsibilities in the process of communication. Immediately afterwards we notice that the problems involved in the integration of FM and RMI involves many platform related issues and are in a great extent out of our control. However what we can indeed control is the Java side, a good implementation can be signi cantly better in performance than a bad implementation. Extending this last armation we can say that the approach stated as the goal for the implementation of this project looks as the best way to do it, the most natural in order to maintain the uniformity and homogeneity of the current distribution. However deeper and careful research needs to be done to corroborate this procedure, this involves the analysis of dierent approaches and, an extensive and selective exhaustive testing, helping to point out the problems that might exist. In the same manner understanding the technicalities and dierences of FM is also required in order to improve the actual implementation without letting the nal user be aware of the underlaying transport. To measure the real impact on real application tests need also to be done with nal applications in a fully distributed working environment. One of the nice aspects of this project is the fact that we had the opportunity to work with a real Distributed Object Oriented System, since this eld is growing and this research can be a great contribution in the search of high performance Distributed Systems. Comparisons with CORBA, DCOM seem to be unavoidable, therefore a further step should be to add features that are becoming increasingly important in the eld, the Quality of Service for support of real time applications. Achieving gigabits in bandwidth seems not only feasible but necessary. Up to this point we haven't said anything about the marshaling/unmarshalingengine in JavaRMI. This is another project by itself and it doesn't mean they cannot be integrated, in fact they are necessarily integrated. JavaSoft seems to be taking seriously the marshaling/unmarshaling performance issue, given the increasing popularity of this the language. Each release includes new improvements and Java is being made faster, we will see in the near future the limit, the point where we have to worry about the increasing transport bottleneck. Furthermore inex-

18

pensive distributed computing interconnected using high speed networks is gaining popularity, a fully functional high speed Java RMI/FM with a highly optimized marshaling/unmarshaling engine, is worth doing and we can consider as a research for the near future!!!

10 Perspective The project was interesting since it provided a feel of real-life project design { starting with basic project selection and design, experimental setup and then the actual implementation and write-up. It was somewhat complicated because of the number of large pieces of software involved. JDK1.1.4 had around 60MB of source code, and just guring out where to intercept the RMI calls to the network interface involved a lot of code-tracing. Then there was FM, which we were partly familiar with, but had to use the source to gure out and handle some memory problems. This project was also dierent from most course projects in that the problem was more \open-ended". We therefore had to start from scratch with deciding what were the interesting parameters to study, and then go about designing the experiment to measure these. This made it more interesting than just solving a \toy" problem that someone speci ed. However, this also led to some problems. Since this was not something people had done before, the lab was not setup, and we had to specify what hardware/software setup we needed. This was also complicated by the fact that the setup of the NT machines in the lab did not permit students to install any software for testing. So as we discovered that we needed something, we had to ask for it, and wait for it to be installed. We later had two machines moved downstairs to which we had administrator access. This worked well, but we had wasted a lot of precious time in trying to get the CRL machines to be usable for the project. Windows NT was yet another learning curve. The development environment is not really conducive for shared working { e.g. no eective version control software exists by default, and it took a lot of time to try and get a consistent and non-volatile environment setup for the whole group. We all nally ended up working under the \administrator" account, since that was the only way to ensure that we were all using the same environment variables. The C development environment also had some interesting features { multiple incompatible versions of the C runtime libraries was one. (This was a good practical example of why component software is a good idea!) Since we were unfamiliar with the environment, we probably did not use the most ecient make les etc. which also slowed the development. Also on occasion, the compiler and all programs would just produce memory errors, which were solved by a powercycle (ordinary reboot didn't help). For a while, we tried debugging these errors, before we realized we probably shouldn't bother. This project also required a lot of collaborative work with group-members, since we could not easily split the work into separate parts that could be merged at the end. This required development of some interesting scheduling algorithms for use of the machines downstairs { we

19

had close to 100% utilization of the machines towards the end :-). The project made us aware of the internals of commercial software, and quickly rid us of our biases in favor of that as compared to academic software. JDK is very much work-in-progress, and we suered the rough edges. JNI on the other hand, still has a long way to go. There is not yet one standard way to do it, but three incompatible interfaces. None of them are comprehensively documented. RMI is still a new interface, and each release of JDK is making enhancements to it, dramatically improving both the architectural design and performance. We expect JITs to catch up with these developments soon, making RMI an even more attractive option for distributed object computing in the near future. Overall, the project was a great learning experience in that we had the opportunity to work with a massive commercial software, understand the code, extend it by providing our own customization and nally achieving a reasonable speedup. Thus we were able to realize most of the objectives we had proposed.

References [1] The component object model speci cation. Available from http://www.microsoft.com. [2] A.D. Birrell and B.J. Nelson. Implementing remote procedure calls. In ACM Transactions in Computer Systems, volume 2, Feb 1984. [3] K. Connelly and A. Chien. Fm-qos : Real-time communication using self-synchronizing schedules. In Proceedings of Supercomputing, 1997. [4] The corba 2.0 speci cation. Available from http://www.omg.org. [5] JNI Speci cation. Available from http://www.javasoft.com/. [6] S. Pakin et al. HPVM 1.0 User Documentation, August 1997. [7] Java RMI Speci cation. Available from http://www.javasoft.com/products/jdk/1.1/docs/guide/rmi/spec/rmiTOC.doc.html. [8] M.D. Schroeder and M. Burrows. Performance of re y rpc. In ACM Symposium on Operating System Principles, December 1989. [9] Charles Seitz. Myrinet { a gigabit-per-second local-area network. In Proceedings of the IEEE Symposium on Hot Interconnects, 1994. [10] A. Wollrath, R. Riggs, and J. Waldo. A distributed object model for the java system. In COOTS, June 1996.

20

HP-RMI: High Performance Java RMI over FM 1 Introduction - CiteSeerX

HP-RMI: High Performance Java RMI over FM 1 Introduction - CiteSeerX

Suggest Documents

Wireless Java RMI - CiteSeerX

Wireless Java RMI - CiteSeerX

Java RMI Performance and Object Model

Java RMI and .Net Remoting Performance Comparison - CiteSeerX

Java RMI Performance and Object Model Interoperability - CiteSeerX

Java RMI Performance and Object Model Interoperability - CiteSeerX

Java RMI Performance and Object Model Interoperability - CiteSeerX

Java RMI - UMM Directory

Java RMI versus .NET Remoting Architectural ... - CiteSeerX

Steps to Implementing Java RMI

Java RMI, RMI Tunneling and Web Services ... - Semantic Scholar

Java considered dangerous for high performance ... - CiteSeerX

High-Performance Parallel Programming in Java - CiteSeerX

Oggetti Distribuiti e Java RMI

Comparison of CORBA and Java RMI Based on Performance Analysis

Java RMI and .Net Remoting Performance Comparison - Institute for ...

High Performance Distributed Computing over ATM ... - CiteSeerX

Performance Enhancing Proxies for Java2 RMI over Slow Wireless

CORBA, RMI and RMI-IIOP Performance Analysis ... - Semantic Scholar

A Customizable Implementation of RMI for High Performance Computing

Java 2 RMI and IDL Comparison - soa.si

3. Network Programming and RMI in Java

RMI, RMI-IIOP and IDL Performance Comparison - soa.si

Generalizing Java RMI to Support Efficient Group ... - CiteSeerX