networking performance for distributed objects in java

13 downloads 0 Views 111KB Size Report
communications performance in Java, in the context of ... undertaken a benchmarking exercise. Our .... marshalling we adopted the full object oriented.
NETWORKING PERFORMANCE FOR DISTRIBUTED OBJECTS IN JAVA M. Migliardi DIST – University of Genoa Via Opera Pia 13 Genoa 16145 Italy

Vaidy Sunderam Dept. of Math & Computer Science Emory University 1784 N. Decatur Rd. Atlanta, 30322, GA USA

In network-based concurrent Abstract. computing systems communication performance is always a crucial factor and a bottleneck even in conventional environments and therefore likely to be critical in Java based systems. In order to obtain a baseline measure for distributed object communications performance in Java, in the context of distributed HPC systems, we have undertaken a benchmarking exercise. Our experiments show some interesting and somewhat surprising results viz., that the overhead introduced by Java over C socket programming is very small, that the overhead of object serialization is highly asymmetric, and that memory bandwidth is crucial for object serialization. Our results and findings are presented in this paper. Keyword: Benchmark, Java, Metacomputing, Distributed Objects

idiosyncrasies. Yet, Java is being increasingly adopted for HPC projects, as evidenced by the growing effort in the area [1]. Several explanations may be postulated for this, including: (1) those based on momentum and trends, availability of expertise, software libraries, tools, and Java extensions to support HPC; and (2) others, such as the existence of JIT compilers, highly efficient JVM implementations, and even ongoing increases in CPU speeds that offset software induced slowdowns. Nevertheless, the effectiveness of Java in HPC depends at some level on the performance it delivers, both in absolute terms as well as relative to other alternatives, i.e. traditionally compiled language systems and low level access to system facilities. In particular, in concurrent computing systems based on message exchange, communication performance is always a crucial factor and a bottleneck even in conventional environments and therefore likely to be critical in Java based systems. Communications performance has always been a critical issue in distributed systems. Recently, however, some multiprocessors have incorporated interconnection networks and switches that achieve communication speeds that are of the same order of magnitude as computing speeds. On the other hand, cluster and metacomputing systems are based on network technologies that are typically one to two orders of magnitude slower than CPU speeds. In these environments, the overheads added by the software layers are all the more crucial to the end-to-end communications performance that ultimately impacts the overall efficiency of

1. Introduction The Java programming language and system has been receiving increasing attention for high performance computing, though of course, to a lesser extent than for internet/multimedia applications. The prominence of Java in the latter class of applications is not surprising, given the explosive growth in economic and technical drivers in the web domain. However, the use of Java in HPC is less natural, at least on the surface, for two reasons: one, it is basically an interpreted language, almost by definition subject to poor performance; and two, it does not allow conventional methods of source level optimization including low level memory manipulation and exploiting machine architecture

System Name Model CPU Clock OS

Physical Memory SPECInt 95 [2][ 3] JVM

Network Fabric

Table 1.

Dilbert

Labss1{a-h}

Labss3{a, h}

Wharness{1,2}

Lharness{1,2}

Sun Server Enterprise 3000 Ultra Sparc II 250MHz SunOS Release 5.7 Version Generic [UNIX(R) System V Release 4.0] 512 MB

Ultra 1 Model 170 Ultra Sparc I 166MHz SunOS Release 5.7 Version Generic [UNIX(R) System V Release 4.0] 128 MB

Ultra 5/10

Dell Dimension XPS R450 Intel Pentium II 450MHz Windows 98

Dell Dimension XPS R450 Intel Pentium II 450MHz Red Hat Linux release 5.2 (Apollo) Kernel 2.0.36 on an i686

128 MB

128 MB

10.4

6.6

12.1

18.51

18.51

Solaris VM (build Solaris_JDK_1.2 _01_dev05_fcsK , native threads, sunwjit) Full-Duplex Switched Ethernet 100Mb/s

Solaris VM (build Solaris_JDK_1.2 _01_dev05_fcsK , native threads, sunwjit) Full-Duplex Switched Ethernet 100Mb/s

Solaris VM (build Solaris_JDK_1.2 _01_dev05_fcsK , native threads, sunwjit) Full-Duplex Switched Ethernet 100Mb/s

java version "1.2" Classic VM (build JDK-1.2V, native threads, symjit) Full-Duplex Switched Ethernet 100Mb/s

Classic VM (build Linux_JDK_1.2_ pre-release-v1, native threads, sunwjit) Full-Duplex Switched Ethernet 100Mb/s

Ultra Sparc IIi 300 MHz SunOS Release 5.7 Version Generic_10654101 [UNIX(R) System V Release 4.0] 256 MB

Main features of the systems composing our testbed.

distributed objects applications. In Java-based systems, this software overhead is comprised not only of the usual transport layer functions, but also of object stream processing overheads and JVM interpreter overheads. Since the hardware interface layer, i.e. the JVM, plays more of a direct role, different JVM implementations may also influence communication performance. As a result, the number of parameters affecting communications performance in Java messagepassing HPC systems is considerably more than in traditional C or Fortran-based systems, making benchmark results particular to very specific environments. Nevertheless, we believe that some baseline measurements of Java message-passing performance will help quantify the degree to which its unique characteristics are likely to impact metacomputing and network based high performance applications. With this in mind, we have conducted a basic set of benchmarking experiments that measure message-passing performance in Java in a few representative hardware and software environments. In the following sections, we describe the experimental testbed and

benchmarking methodology, and present empirical results obtained from our exercises. This paper is structured as follows: in section 2 we discuss related efforts, and present a detailed description of our testbed, experimental methodology and benchmark programs; in section 3 we present our experimental results; finally in section 4 we discuss our results and provide some concluding remarks.

2. Background and testbed Recently, several projects, such as DOGMA [4], IceT [5], Harness [6], Javelin[7] and Bayanihan [8], have adopted Java technology to tackle the problem of generating a portable distributed virtual machine. Other projects, such as NinjaRMI [9], Albatross [10] and NexusRMI [11] have enhanced the performance of the Java Remote Method Invocation (RMI) mechanism with their own non-standard implementation of such a facility. Some of these projects report having achieved high levels of message-passing performance by adopting reliable communication services in Java, and others have modified the 1

This value has been measured on the NT4.0 OS

System

C socket layer

Java streams layer

Object stream layer

UltraI

185 us

193 us

574 us

UltraII

127 us

149 us

424 us

Windows98

175 us

165 us

575 us

Linux

58 us

72 us

240 us

Table 2.

Ping-pong latencies for different systems at different software layers.

object serialization mechanism on which RMI is built to boost performance. However, to our knowledge, a systematic study of the performance impact of the different software layers involved in Java-based reliable communication has not yet been published. This kind of study, i.e. analytical measurements of the overhead introduced by the JVM and the object serialization mechanism over the C stream socket interface, is the main contribution of this paper. We performed our experiments on four systems with different hardware and software configurations. In Table 1 we detail the characteristics of the systems included in our testbed. All the computers in our testbed are medium to high-end workstations, with the notable exception of a Sun Ultra Enterprise server (Dilbert). We decided to disregard the homogeneity issues raised by its inclusion in the testbed because it allowed us to observe the impact of an architecture optimized for clientserver corporate applications, such as the Sun Server Enterprise 3000, on our benchmark programs. From the JVM standpoint, we tested the two reference implementations of the Java2™ architecture provided by Sun as well as the prerelease of the port to the Linux OS by the Blackdown Java-Linux Porting Team. Given the growing interest that the Linux OS is attracting we decided to include it in our testbed although the only available implementation of Java2™ is at a pre-release stage. The 100Mb/s switched ethernet is an increasingly common medium-end networking standard, thus we have decided to

adopt it as our baseline communication fabric for distributed computing. We measured the latency and bandwidth delivered by reliable streams through one, two and three additional software layers, namely the socket interface in C, the java.net package and the object serialization mechanism provided by Object{Input, Output}Stream classes. In order to simulate a cooperating distributed-objects environment, we acknowledged each message at the application level. This mimics the invocation of a method requiring no computation on a remote object, adopting the synchronous call semantics used in many distributed object programming environments. We measured the bandwidth attainable at the interface of each layer for different message sizes ranging from one byte to 64k bytes, in exponential increments. Latency times were obtained as the average measure of the time required to execute a set of ping-pong transmissions carrying the smallest possible payload. We performed our experiments on hosts with no user-level load either on the CPU or on the network connection. Only the boot time system tasks and our test programs were active on the systems during our experiments. To measure the time required to perform our tests, we used the currentTimeMillis method of the java.lang.System class in Java programs and the gettimeofday system-call in C programs. We measured the duration and granularity of both function calls and we tuned the number of times each experiment was repeated inside a single time

Ultras distributed objects 100.00 90.00 80.00 Bandwidth (Mbs)

70.00

40.00

UI C UI Java streams UI Object streams UII C socket UII Java streams

30.00

UII Object streams

60.00 50.00

20.00 10.00 0.00 1.00

10.00

100.00

1000.00

10000.00 100000.00

Message Size (bytes)

Figure 1a. systems.

Experimental results in distributed-object environment on the UltraSPARC

PII distributed objects

Bandwidth (Mbs)

Linux C 90.00 80.00 70.00 60.00 50.00 40.00 30.00 20.00 10.00 0.00

Linux Java streams Linux Object streams Win C Win Java streams Win Object streams 1

100

10000

1000000

Message size (bytes)

Figure 1b.

Experimental results in distributed-object environment on the PentiumII systems.

measurement in order to reduce the error introduced by the function call to less than 1%. All the JVM flags were set to their default values, and no non-standard –X option was used. We repeated all measurements at least a hundred times and have reported the minimum measured time. This choice is due to the fact that we want to portray the upper bound of achievable

performance and minimize the impact of artifacts on our measurements. The Java stream experiments do not perform any kind of type marshalling -- in fact they transmit only byte arrays. To evaluate the overhead of type marshalling we adopted the full object oriented mechanism provided by object serialization.

C sockets distribute objects 100

Band Mbs

80 UII b 60

UI b

40

Lb Win b

20 0 1

10

100

1000

10000

100000

Size (Bytes)

Experimental results in distributed-objects environment with the C socket

Figure 2a. layer.

Java streams distributed objects 90 80 Bandwidth (Mbs)

70 60

Win Linux UltraI UltraII

50 40 30 20 10 0 1

10

100

1000

10000

100000

Message Size (bytes)

Figure 2b. layer.

Experimental results in distributed-objects environment with the Java streams

3. Experimental Results The experimental results of our latency measurements are shown in Table 2. Our results show that while the difference between the latency encountered in the C socket layer and the latency encountered in the Java stream layer is always very small - always less than 15%, and even slightly in Java favor on Win98 systems -

the latency at the object serialization layer can be up to 3 times larger than that at the C socket layer. This is due to the large amount of data exchange and data structure set-up that is required to establish a connection at the object serialization layer. Figures 1a and 1b show the bandwidth achieved by the various layers for different message sizes on the different architectures of our testbed, for

message exchange with application level replies. Our experiments show that in this environment also, the Java Data{Input, Output}Stream layer introduces a very limited overhead; as a matter of fact connections through this layer deliver from 91% to 96% of the bandwidth attainable with pure C programs. The results obtained for the object serialization layer show a behavior analogous that outlined in the previous section. In this environment too the layer providing the object serialization service introduces a noticeable overhead, but this overhead is easily absorbed by the powerful architectures. It is interesting to notice that the request-response communication mechanism enforced in our experiments actually prevents the TCP protocol from filling the byte pipe by means of the sliding window mechanism. In fact none of the architectures in our testbed showed any kind of bandwidth saturation in this environment. This effect is obviously caused by the synchronous ping-pong communication pattern of this mode of message exchange -- i.e. the sender has to wait for a reply to a message before starting to send the next, thus forcing the communication pipe to empty at each step. The experimental results of our latency measurements are shown in Table 2. Our results show, unexpectedly, that the latency experienced depends on the power of the CPU to an extent that does not change in the different communication layers. In fact, the performance ratio between the different systems is almost constant (less than 10% variations) in all the three layers. Besides, our results show that, in terms of latency, the Linux system is able to take full advantage of the CPU power and consistently outperforms the other systems. On the contrary, although running on the same CPU, the Windows98 system delivers an extremely poor performance. The results of the comparison between the different architectures composing our testbed are shown in figures 2a to 2c. The synchronous nature of this kind of communication forces the communication pipe to empty at each message exchange, thus the performance obtained is very sensitive to the message size: The behavior of the different platforms in the distributed-object environment is very regular and our experiments show that the more powerful

systems consistently deliver a higher level of performance. However, it is important to notice that the systems running the Solaris operating systems are more efficient in performing large memory copies and thus manage large messages more efficiently. This characteristic is evident in the results obtained at the C socket and Java streams layers.

4. Summary In this paper we have presented some results and early findings of a systematic performance analysis we have conducted to characterize the overhead introduced by different reliable communication services available in the Java2™ platform, namely the Java stream layer and the object serialization layer, over the C socket layer. The results of our experiments show that the Java stream layer introduces very small overheads and, on some systems, is able to deliver performance up to 95% of that delivered by the C socket layer. However, this overhead depends (to a large extent) on the JVM implementation available on the specific system. In this paper we have shown that the overhead depends on the JVM implementation both in terms of magnitude and in terms of the kinds of operation that are inefficiently performed. Our experiments also show that object serialization is a heavy-weight process that introduces noticeable overheads over the C socket communication layer. However, the measurements indicate that systems capable of delivering a large amount of computational power, e.g. PentiumII 450MHz based systems, can absorb or mask this overhead, and incur only limited performance degradation and sacrifice less than 20% of the bandwidth attainable with the C socket layer. It is our opinion that, although Java Object Serialization technology is today neither mature enough nor well suited to support the needs of communication bound tasks, its importance relies on its capability to support complex distributed programming concepts and techniques such as code mobility and object migration. Besides, as the computing power available on single systems increases, the granularity of the tasks resident on a single system is bound to grow too. This fact can reduce the requirements of applications in terms of communication performance, thus the set of tasks and applications that can sacrifice

Object streams distributed objects 80

Bandwidth (Mbs)

70 60 UltraI UltraII Linux Win98

50 40 30 20 10 0 1

10

100

1000

10000

100000

Message size (bytes)

Experimental results in distributed-objects environment with the object Figure 2c. streams layer. some communication performance to exploit the sophisticated capability delivered by the Object Serialization layer is bound to grow.

6. References 1 2 3

4

5

6

Java Grande Forum, http://www.javagrande.org SPEC web site, http://www.specbench.org/ UC Berkeley CPU info center, http://infopad.eecs.berkeley.edu/CIC/summary /local/ M. Clement, Q. Snell, G. Judd, High Performance Computing for the Masses, Proc. of Workshop on Java for Parallel and Distributed Computing of IPPS/SPDP99, pp. 781-796, S. Juan, Puerto Rico, April 12-16, 1999. P. Gray and V. Sunderam, Native Language Based Distributed Computing Across Network and Filesystem Boundaries, Concurrency: Practice and Experience, 1998, vol. 10, n. 1. M. Migliardi, V. Sunderam, Heterogeneous Distributed Virtual Machines in the Harness Metacomputing Framework, Proc. of Heterogeneous Computing Workshop of IPPS/SPDP99, pp. 60-72, S. Juan, Puerto Rico, April 12-16, 1999.

7

B. O. Christiansen, P. Cappello, M. F. Ionescu, M. O. Neary, K. E. Schauser, D. Wu, Javelin: Internet-based Parallel Computing Using Java, Proc. of ACM97 Workshop on Java for Science and Engineering Computation, June 1997. 8 Luis F. G. Sarmenta, Satoshi Hirano, Stephen A. Ward, Towards Bayanihan: Building an Extensible Framework for Volunteer Computing Using Java, Concurrency: Practice and Experience, Vol. 10(11-13), 1015-1019 (1998). 9 NinjaRMI web page, http://www.cs.berkeley.edu/~mdw/proj/ninja/n injarmi.html. 10 J. Maassen, R. Van Nieuwpoort, R. Veldema, H. E. Bal, A. Plaat, An Efficient Implementation of Java's Remote Method Invocation, Proc. of PPoPP'99, Atlanta, GA, May 1999 11 F. Breg, D. Gannon, A Customizable Implementation of RMI for High Performance Computing, Proc. of Workshop on Java for Parallel and Distributed Computing of IPPS/SPDP99, pp. 733-747, S. Juan, Puerto Rico, April 12-16, 1999.