Networking Performance for Metacomputing in Java

0 downloads 0 Views 102KB Size Report
message-passing performance in Java, in the context of distributed HPC systems, we have undertaken a benchmarking exercise. ..... object-oriented semantics.
Proceedings of the IASTED international conference on Parallel and Distributed Computing and Systems November 3-6, 1999, MIT, Boston, USA

Networking Performance for Metacomputing in Java Mauro Migliardi and Vaidy Sunderam Dept. of Math & Computer Science Emory University

crucial factor and a bottleneck even in conventional environments and therefore likely to be critical in Java based systems. Communications performance has always been a critical issue in all message-passing concurrent systems. Recently, however, some multiprocessors have incorporated interconnection networks and switches that achieve communication speeds that are of the same order of magnitude as computing speeds. On the other hand, cluster and metacomputing systems are based on network technologies that are typically one to two orders of magnitude slower than CPU speeds. In these environments, the overheads added by the software layers are all the more crucial to the end-to-end communications performance that ultimately impacts the overall efficiency of message-passing applications. In Java-based systems, this software overhead is comprised not only of the usual transport layer functions, but also of object stream processing overheads and JVM interpreter overheads. Since the hardware interface layer, i.e. the JVM, plays more of a direct role, different JVM implementations may also influence communication performance. As a result, the number of parameters affecting communications performance in Java message-passing HPC systems is considerably more than in traditional C or Fortran-based systems, making benchmark results particular to very specific environments. Nevertheless, we believe that some baseline measurements of Java message-passing performance will help quantify the degree to which its unique characteristics are likely to impact metacomputing and network based high performance applications. With this in mind, we have conducted a basic set of benchmarking experiments that measure message-passing performance in Java in a few representative hardware and software environments. In the following sections, we describe the experimental testbed and benchmarking methodology, and present empirical results obtained from our exercises. This paper is structured as follows: in section 2 we discuss related efforts, and present a detailed description of our testbed, experimental methodology and benchmark programs; in section 3 we present one set of experimental results that focus on different communications layers; in section 4 we compare our experimental results on different platforms; finally in section 5 we discuss our results and provide some concluding remarks.

Abstract. The Java programming language and system has been receiving increasing attention for High Performance Computing, particularly in cluster and metacomputing platforms. In network-based concurrent computing systems communication performance is always a crucial factor and a bottleneck even in conventional environments and therefore likely to be critical in Java based systems. In order to obtain a baseline measure message-passing performance in Java, in the context of distributed HPC systems, we have undertaken a benchmarking exercise. Our experiments lead to some interesting results such as discovering that, surprisingly enough, the overhead introduced by Java over C socket programming is either very small or, for some implementations, even null. Our results and findings are presented in this paper.

1. Introduction The Java programming language and system has been receiving increasing attention for high performance computing, though of course, to a lesser extent than for internet/multimedia applications. The prominence of Java in the latter class of applications is not surprising, given the explosive growth in economic and technical drivers in the web domain. However, the use of Java in HPC is less natural, at least on the surface, for two reasons: one, it is basically an interpreted language, almost by definition subject to poor performance; and two, it deters conventional methods of hand-optimization including low level memory manipulation and exploiting machine architecture idiosyncrasies. Yet, Java is being increasingly adopted for HPC projects, as evidenced by the growing effort in the area [1]. Several explanations may be postulated for this, including: (1) those based on momentum and trends, availability of expertise, software libraries, tools, and Java extensions to support HPC; and (2) others, such as the existence of JIT compilers, highly efficient JVM implementations, and even ongoing increases in CPU speeds that offset software induced slowdowns. Nevertheless, the effectiveness of Java in HPC depends at some level on the performance it delivers, both in absolute terms as well as relative to other alternatives, i.e. traditionally compiled language systems and low level access to system facilities. In particular, in concurrent computing systems based on message exchange, communication performance is always a 302-339 1

System Name Model CPU Clock OS

Physical Memory SPECInt 95 [2][ 3] Memory Bandwidth JVM

Table 1.

Dilbert

Labss1{a-h}

Labss3{a, h}

Wharness{1,2}

Lharness{1,2}

Sun Server Enterprise 3000 Ultra Sparc II 250MHz SunOS Release 5.7 Version Generic [UNIX(R) System V Release 4.0]

Ultra 1 Model 170

Ultra 5/10

Ultra Sparc I 166MHz SunOS Release 5.7 Version Generic [UNIX(R) System V Release 4.0]

Dell Dimension XPS R450 Intel Pentium II 450MHz Red Hat Linux release 5.2 (Apollo) Kernel 2.0.36 on an i686

512 MB

128 MB

Ultra Sparc IIi 300 MHz SunOS Release 5.7 Version Generic_10654101 [UNIX(R) System V Release 4.0] 256 MB

Dell Dimension XPS R450 Intel Pentium II 450MHz Windows 98

128 MB

128 MB

10.4

6.6

12.1

18.51

18.51

Data will be available in full paper Solaris VM (build Solaris_JDK_1.2_0 1_dev05_fcsK, native threads, sunwjit)

Data will be available in full paper Solaris VM (build Solaris_JDK_1.2_0 1_dev05_fcsK, native threads, sunwjit)

Data will be available in full paper Solaris VM (build Solaris_JDK_1.2_0 1_dev05_fcsK, native threads, sunwjit)

Data will be available in full paper java version "1.2" Classic VM (build JDK-1.2-V, native threads)

Data will be available in full paper Classic VM (build Linux_JDK_1.2_pr e-release-v1, native threads, sunwjit)

Main features of the systems composing our testbed.

release stage. The 100Mb/s switched ethernet is an increasingly common medium-end networking standard, thus we have decided to adopt it as our baseline communication fabric for distributed computing. We measured the latency and bandwidth delivered by reliable streams through one, two and three additional software layers, namely the socket interface in C, the java.net package and the object serialization mechanism provided by Object{Input, Output}Stream classes. We simulated a message-passing-like programming environment by adopting the technique to transmit all messages asynchronously (without application-level replies) as is the norm in message-passing systems without rendezvous semantics. We measured the bandwidth attainable at the interface of each layer for different message sizes ranging from one byte to 64k bytes, in exponential increments. Latency times were obtained by measuring the time required to execute a set of ping-pong transmissions carrying the smallest possible payload. We performed our experiments on hosts with no user-level load either on the CPU or on the network connection. Only the boot time system tasks and our test programs were active on the systems during our experiments. To measure the time required to perform our tests, we used the currentTimeMillis method of the java.lang.System class in Java programs and the gettimeofday system call in C programs. We measured the duration and granularity of both function calls and we tuned the number of times each experiment was repeated inside a single time measurement in order to reduce the error introduced by the function call to less than 1%. All the JVM flags were set to their default values. We repeated all measurements at least a hundred times and

2. Background and testbed Recently, several projects, such as DOGMA [4], IceT [5], Harness [6], Javelin[7] and Bayanihan [8], have adopted Java technology to tackle the problem of generating a portable distributed virtual machine. Other projects, such as NinjaRMI [9], Albatross [10] and NexusRMI [11] have enhanced the performance of the Java Remote Method Invocation (RMI) mechanism with their own non-standard implementation of such a facility. Some of these projects report having achieved high levels of message-passing performance by adopting reliable communication services in Java, and others have modified the object serialization mechanism on which RMI is built to boost performance. However, to our knowledge, a systematic study of the performance impact of the different software layers involved in Java-based reliable communication has not yet been published. This kind of study, i.e. analytical measurements of the overhead introduced by the JVM and the object serialization mechanism over the C stream socket interface, is the main contribution of this paper. We performed our experiments on four systems with different hardware and software configurations. In Table 1 we detail the characteristics of the systems included in our testbed. From the JVM standpoint, we tested the two reference implementations of the Java2™ architecture provided by Sun as well as the pre-release of the port to the Linux OS by the Blackdown Java-Linux Porting Team. Given the growing interest that the Linux OS is attracting we decided to include it in our testbed although the only available implementation of Java2™ is at a pre1

This value has been measured on the NT4.0 OS

2

System

C socket layer

Java streams layer

Object stream layer

UltraI

185 us

193 us

574 us

UltraII

127 us

149 us

424 us

Windows98

175 us

165 us

575 us

Linux

58 us

72 us

240 us

Table 2.

Ping-pong latencies for different systems at different software layers.

messages with size close to the Ethernet MTU size, Java stream based connections deliver best performance for messages of size equal to or greater than eight kilobytes. This fact suggests that Java streams connections introduce an additional level of buffering, thus masking the overhead of sending messages that do not fit in a single Ethernet packet. Yet the memory-copy overhead introduced by this buffering is extremely low -- in fact the performance reduction is less than 15%. The results we have obtained for channels delivering the object serialization service are quite different and clearly show that object serialization is a heavy-weight process. However, while the object serialization overhead is extremely evident on the UltraI and lowers the attainable bandwidth to less that one half of the C program bandwidth, more powerful architectures, such as the PentiumII systems, overcome it almost completely. In fact in the latter cases it is possible to achieve from 73% to 89% of the bandwidth attainable with the native C programs. This result shows that, although object serialization is not well suited to communication-bound applications, the performance toll it imposes is a reasonable trade-off for tasks that need its powerful object-oriented semantics.

have reported the minimum measured time. This choice is due to the fact that we want to portray the upper bound of achievable performance and minimize the impact of artifacts on our measurements. The Java stream experiments do not perform any kind of type marshalling -- in fact they transmit only byte arrays. To evaluate the overhead of type marshalling we adopted the full object oriented mechanism provided by object serialization.

3. Cross-layer comparisons The experimental results of our latency measurements are shown in Table 2. Our results show that, while the difference between the latency encountered in the C socket layer and the latency encountered in the Java stream layer is always very small (always less than 15% and negative on Windows98 systems), the latency at the object serialization layer can be up to 3 times larger than that at the C socket layer. This is due to the large amount of data exchange and data structure set-up that is required to establish a connection at the object serialization layer. In the rest of this section we compare the performance of message passing at the different layers during communication between homogeneous machines. Table 3 and figure 1 (See pages 4 and 5) show the bandwidth achieved by the various layers at different message sizes on the different architectures of our testbed. Our experiments show that Java streams introduce very limited overheads over the C implementation if no data type marshalling is performed (i.e. byte arrays are used for data transmission). This fact is clearly shown by the benchmark results; as a matter of fact a stream connected through the Data{Input, Output)Stream classes delivers from 85% to 95% of the top bandwidth attainable in the C programs and even delivers better performance under Windows98. Windows98 systems suffer from a very poor performance with C sockets for message size of 32 kilobytes. However further investigation of this phenomenon is out of the scope of this paper, in fact we only adopt C socket programs as a performance watermark. Java based connections don’t show the performance reduction connected with a message size exceeding the Ethernet MTU. Besides, with the notable exception of the Win98 system, they show a bandwidth saturation pattern that is very close, although slightly slower, than the one shown by C based connections. In fact, while on C based connections the best performance is achieved for

4. Cross-platform comparisons The experimental results of our latency measurements are shown in Table 2. Our results show, unexpectedly, that the latency experienced depends on the power of the CPU to an extent that does not change in the different communication layers. In fact, the performance ratio between the different systems is almost constant (less than 10% variations) in all the three layers. Besides, our results show that, in terms of latency, the Linux system is able to take full advantage of the CPU power and consistently outperforms the other systems. On the contrary, although running on the same CPU, the Windows98 system delivers an extremely poor performance. The comparison between the different architectures composing our testbed shows that the highest level of performance is not always achieved by the system with the most powerful CPU. In fact, the PentiumII based systems are capable of delivering the highest level of performance only for the C socket communication layer and the layer providing the object serialization service. In table 3 and figure 1 we show the results of our 3

Size 1 2 4 8 16 32 64 128 256 512 1024 2048 4096 8192 16384 32768 65536

Size 1 2 4 8 16 32 64 128 256 512 1024 2048 4096 8192 16384 32768 65536

Table 3

UI C 0.10 0.20 0.42 0.84 1.68 3.15 5.55 12.46 30.55 50.79 70.61 79.25 60.07 58.48 65.85 74.61 80.96

Win C 0.11 0.20 0.40 0.80 1.61 3.22 6.44 12.60 25.13 47.57 83.00 61.73 50.26 69.35 60.10 1.42 75.96

UI Jstreams 0.06 0.12 0.25 0.58 1.07 1.97 6.83 15.06 32.00 42.01 56.37 75.50 76.65 77.04 77.10 77.47 77.76

Win Jstreams

UI Ostreams 0.02 0.03 0.05 0.10 0.21 0.42 1.42 2.83 5.41 9.64 13.98 22.41 29.00 34.49 35.05 35.19 2

Win Ostreams 0.16 0.32 0.64 1.28 2.56 5.12 10.24 20.48 40.96 81.92 81.92 86.23 86.23 85.11 82.44 80.91 80.91

Linux C 0.03 0.06 0.12 0.24 0.47 0.95 1.90 4.10 7.59 15.17 24.82 38.10 49.65 56.99 61.25 2 2

0.02 0.04 0.08 0.16 0.31 0.63 1.28 4.94 17.84 83.32 90 90.67 89.81 88.12 87.42 84.67 83.94

UII Jstreams 0.05 0.05 0.10 0.20 0.44 1.51 12.80 25.60 41.80 56.11 68.84 86.23 87.61 88.44 88.86 88.47 88.75

Linux Jstreams

UII Ostreams 0.02 0.04 0.08 0.15 0.30 0.61 1.90 3.90 7.59 13.11 20.10 30.06 38.78 46.48 46.48 44.73 44.54

Linux Ostreams 0.01 0.02 0.04 0.09 0.18 0.35 0.70 2.19 4.96 20.48 49.05 61.02 70.32 75.33 78.39 79.64 80.50

0.01 0.02 0.04 0.09 0.18 0.36 0.71 2.32 5.94 15.63 27.17 51.12 62.95 68.27 73.10 2 2

Bandwidth achieved by the different layers on the architectures. Write size is in bytes while bandwidth achieved is measured in Mega-bit per second

Linux3. Another interesting fact is that the Windows98 and Linux JVM incur two different kinds of inefficiencies. The Windows98 JVM delivers a level of performance very close to the one attainable on the UltraII systems with messages sized up to five times the Ethernet MTU and then swiftly incurs significant performance degradation. On the other hand, the level of performance delivered at this layer by the Linux JVM rises slowly and shows a limited bandwidth saturation pattern for messages sized over sixteen kilobytes. These observations suggest that the Windows98 JVM is able to efficiently execute calls to the socket layer but manages large memory buffers inefficiently, while the Linux JVM is not able to efficiently service short and frequent system-calls but better copes with large memory buffers. The cross-platform analysis of the results of our experiments with the object serialization layer confirms

experiments for each layer on all the systems in our testbed. The results we have obtained from the experiments on the C socket layer let us formulate some interesting observations. In the C socket layer the performance gap between the slowest and the fastest system is very little, in fact it is less than 9% of the attainable bandwidth. For this reason we can say that the performance level attainable with this layer has very little dependency on the CPU power, and that the dependency of performance on CPU power is primarily observed at the upper layers. The Java streams layer experiments show a pattern that is different from the one observed for the other two layers. In fact the highest level of performance is achieved by the UltraII systems. This likely indicates that the Java stream implementation on the Solaris operating system is more efficient than the one available on Windows98 and

2

UII C 0.13 0.26 0.51 1.02 2.03 4.22 7.00 17.09 39.43 59.05 83.19 89.27 67.46 64.63 72.92 82.92 88.89

3 However, it is important to remember that the version of the Java2™ platform available on Linux at this time is only a pre-release.

The program ran out of memory.

4

100.00

10.00

UI C UI Jstreams UI Ostreams

1.00 1

10

100

1000

10000 100000

0.10

UII C UII Jstreams UII Ostreams

0.01 Figure 1a

Bandwidth achieved on UltraSparc systems.

100.00

10.00

1.00 1

10

100

1000

10000 100000

Win C Win Jstreams Win Ostreams Linux C Linux Jstreams Linux Ostreams

0.10

0.01 Figure 1b

Bandwidth achieved on PentiumII systems.

that object serialization is a heavy-weight process, and hence the bandwidth attainable grows as the available CPU power grows. However it is interesting to note that, although the highest level of performance is achieved on a PentiumII system, there is a noticeable difference between the behavior of the Windows98 JVM and the behavior of the Linux JVM. In fact, while for short messages the best

performance is achieved with the Windows98 JVM, as soon as the message size exceeds the 512-byte watermark, the Linux JVM becomes the most efficient. This indicates again that the Linux JVM is inefficient in servicing very frequent, very short I/O system calls.

5

ACM 1998 Workshop on Java for High-Performance Network Computing, Palo Alto, California, Feb. 28 Mar. 1, 1998. Published in Concurrency: Practice and Experience, Vol. 10(11-13), 1015-1019 (1998). 9 NinjaRMI web page, http://www.cs.berkeley.edu/~mdw/proj/ninja/ninjarm i.html. 10 J. Maassen, R. Van Nieuwpoort, R. Veldema, H. E. Bal, A. Plaat, An Efficient Implementation of Java's Remote Method Invocation, Proc. of PPoPP'99, Atlanta, GA, May 1999 11 F. Breg, D. Gannon, A Customizable Implementation of RMI for High Performance Computing, Proc. of Workshop on Java for Parallel and Distributed Computing of IPPS/SPDP99 , pp. 733-747, S. Juan, Puerto Rico, April 12-16, 1999.

5. Summary In this paper we have reported some results and early findings of a systematic performance analysis we have performed to characterize the overhead introduced by different reliable communication services available in the Java2™ platform, namely the Java stream layer and the object serialization layer, over the C socket layer. The results of our experiments show that the Java stream layer introduces a very limited overhead and, on some systems, is able to deliver a performance up to 95% of the one delivered by the C socket layer. However, this overhead depends on the JVM implementation available on the specific system. In this paper we have shown that the overhead depends on the JVM implementation both in terms of magnitude and in terms of the kind of operation that is inefficiently performed. The results of our experiments show that the object serialization is a heavy-weight process that introduces a noticeable overhead over the C socket communication layer. However, our experiments have shown that systems capable of delivering a large amount of computational power, namely the PentiumII 450MHz based systems, can absorb this overhead incurring only in a limited performance degradation and sacrifice less than 20% of the bandwidth attainable with the C socket layer. It is our opinion that, as the computing power available on single systems increases, the set of tasks and applications that can sacrifice some communication performance to exploit the sophisticated capability of object transmission and migration delivered by this layer is bound to grow.

6. References 1 2 3 4

5

6

7

8

Java Grande Forum, http://www.javagrande.org SPEC web site, http://www.specbench.org/ UC Berkeley CPU info center, http://infopad.eecs.berkeley.edu/CIC/summary/local/ M. Clement, Q. Snell, G. Judd, High Performance Computing for the Masses, Proc. of Workshop on Java for Parallel and Distributed Computing of IPPS/SPDP99 , pp. 781-796, S. Juan, Puerto Rico, April 12-16, 1999. P. Gray and V. Sunderam, Native Language Based Distributed Computing Across Network and Filesystem Boundaries, Concurrency: Practice and Experience, 1998, vol. 10, n. 1. M. Migliardi, V. Sunderam, Heterogeneous Distributed Virtual Machines in the Harness Metacomputing Framework, Proc. of Heterogeneous Computing Workshop of IPPS/SPDP99, pp. 60-72, S. Juan, Puerto Rico, April 12-16, 1999. B. O. Christiansen, P. Cappello, M. F. Ionescu, M. O. Neary, K. E. Schauser, D. Wu, Javelin: Internetbased Parallel Computing Using Java, Proc. of ACM97 Workshop on Java for Science and Engineering Computation, June 1997. Luis F. G. Sarmenta, Satoshi Hirano, Stephen A. Ward, Towards Bayanihan: Building an Extensible Framework for Volunteer Computing Using Java, 6