Practical Guidelines for Boosting Java Server Performance - CiteSeerX

28 downloads 27214 Views 193KB Size Report
Many of these applications such as Internet servers de- mand high execution speed. Currently, it is mostly the programmer's responsibility to optimize Java code ...
Practical Guidelines for Boosting Java Server Performance Reinhard Klemm Bell Laboratories, Lucent Technologies 600 Mountain Ave., Rm. 2B-420 Murray Hill, NJ 07974, U.S.A. Email: [email protected]

Abstract

As Java technology matures, an increasing number of applications that have traditionally been the domain of languages such as C++ are implemented in Java. Many of these applications such as Internet servers demand high execution speed. Currently, it is mostly the programmer's responsibility to optimize Java code for speed. This paper presents several simple yet e ective source code-level guidelines for accelerating Java programs. Among the performance enhancing rules described in the paper are object reuse, avoiding Java API methods with implicit object allocations, statically creating immutable objects, thread pooling, and avoiding dynamically expanding objects. All these rules aim at reducing the frequency of object allocations and objectto-object copy operations. Using examples, the paper demonstrates up to 15-fold accelerations by using these guidelines. When applied under the appropriate circumstances explained in the paper, none of the rules leads to performance penalties. 1 Introduction

Java allows fast and convenient development of portable software. Development speed and convenience come at the expense of execution speed of some Java programs compared to equivalent programs in more traditional languages such as C++. Improvements in Java bytecode compilers, just-in-time compilers, and virtual machines, however, have signi cantly decreased the execution speed gap between Java and C++ programs since the inception of Java. In the future, the speed di erence is likely to decrease even further. Because of these improvements in Java technology and the convenience

of programming in Java, an increasing number of applications, e.g., Internet servers, which have traditionally been the domain of languages such as C++ are implemented in Java. Currently, it is mostly the programmer's responsibility to optimize Java code for speed. This paper presents several simple yet e ective source code-level guidelines for improving execution speed and decreasing memory consumption of Java programs. The focus is on some sources of Java-speci c execution speed problems that can be easily avoided or alleviated by careful programming. More speci cally, the presented rules aim at reducing the frequency of object allocations and objectto-object copy operations. Language-independent algorithmic complexity as a potential source of slow program execution is not considered in this paper. The guidelines in this paper complement other optimization techniques described elsewhere (see Section 2), e.g., those that are often part of an optimizing compiler. The performance enhancing strategies outlined in this paper can also be important for programmers who design their own classes or class libraries in future versions of Java, on a future Java execution platform, or in a more traditional object-oriented language such as C++. Even in an environment that is very di erent from the current Java platform or in a language di erent from Java, eliminating an object creation or objectto-object copy operation is likely to decrease program execution time. When applied under the appropriate circumstances explained in the paper, none of the presented rules leads to performance penalties. However, if the appropriate circumstances are not given, applying the guidelines might slow down program execution or even result in programming bugs. The clarity and maintainability of the code can su er, and an increase in code size occurs in some cases. Although these guidelines apply to all Java programs they are particularly important for programs such as servers that need to exhibit short response times and/or high throughput. The decrease in memory consump-

tion as a consequence of applying the presented rules also contributes to the longevity of a program because it reduces the risk of memory exhaustion due to a Java program bug or a faulty Java virtual machine implementation. This, too, is particularly important for servers because they usually have to operate for a long time without outages. The guidelines have been successfully applied to a Web prefetching proxy that requires very fast response times [8]. After having applied the rules to the proxy, its performance increased from providing only slightly faster Web access on the average to speeding up Web access by more than 50% on the average compared to not using the proxy. Moreover, a memory leak in the Java virtual machine a ected the life span of the improved proxy dramatically less than the original proxy. The remainder of this paper is organized as follows. Section 2 contains an overview of previous work on Java performance optimizations. Section 3 states the origin of major performance problems with Java code. The central collection of performance enhancing guidelines and corresponding examples follows in Section 4. The paper provides a conclusion in Section 5. 2 Related Work

There is a myriad of articles and books that address Java speed optimization issues. Some of these resources are listed below. However, there does not seem to be any publication that discusses the Java performance problems and solutions or the quantitative evaluations demonstrated in this paper. The Java language is described in many books. Good sources of information on Java are [1] and [5]. The latter book also contains a chapter dedicated to Java speed considerations including relative execution speed gures for various Java language and API constructs. A good book on general object-oriented system development with information on high-level performance optimizations that are also applicable to Java is [3]. Large-scale programs are rarely implemented in a traditional programming language such as C++ and then again in Java. Therefore, it is hard to gauge the relative performance of Java and C++ for a realistic software solution. There are, however, many published studies such as [9] and [7] that use microbenchmarks to compare the performance of Java and C++. Most of these studies conclude that Java with a just-in-time compiler performs nearly as fast as C++. Microbenchmarks, however, do not give C++ compilers the opportunity to leverage all their optimization techniques. Such techniques are largely missing in current Java compilers and put C++ at a performance advantage over Java if the programmer does not manually optimize the Java code.

In [10], the reader will nd performance measurements for various Java language constructs (not Java API calls) in di erent hardware and Java virtual machine environments. The author bases general recommendations on his performance measurements. An online collection of general recommendations for optimizing Java code for speed and size is available in [6]. It also contains a description of Java compiler optimizations and performance measurements of Java constructs and API calls. A Java performance benchmark applet is presented in [2]. The applet computes the execution speed of several Java language constructs on a given hardware, operating system, and Java virtual machine platform. A case study using Java as the implementation language for a distributed medical imaging system with stringent performance requirements is described in [7]. The article contains a list of speed optimization techniques that were manually applied to the software. The described techniques are not Java-speci c and can often be found in optimizing compilers. The authors compare the Java code speed with that of equivalent compileroptimized and hand-optimized C and C++ code. In [5], the reader will nd the description of a technique that allows Java code to select a native method if available and to select a Java method otherwise, so that the resulting code remains portable. Replacing Java code in performance-critical areas with native methods is a very e ective way of performance boosting. However, this paper concentrates on portable source codelevel performance enhancements and thus does not discuss native methods. Object reuse is a central theme of this paper and has been mentioned in several of the references listed above. A software support for reusing certain objects is described in [4], albeit without a systematic quanti cation of the performance advantages. 3 Sources of Java Code Performance Problems

Java source programs are compiled into programs in Java byte code format before execution. Executing a Java program means interpretation of the byte code program by another software layer, the Java virtual machine. Since the Java virtual machine is a software program, the execution of Java code can be slower than the execution of native code on a physical machine. Moreover, not being able to control details of memory management in a Java application can create performance problems. In Java virtual machines, all objects are allocated on the heap. Heap allocation means nding space for an object on the heap, updating heap bookkeeping data, and later deallocation (garbage collection) of the object when it is no longer needed. This is generally more time-consuming than allocation on the stack,

which is possible in C++ in addition to heap allocation. To illustrate this, consider the case of a stack allocation of a Java double as opposed to the heap allocation of a Java Object. On the reference platform described in Section 4, the double is created in 5ns whereas the Object creation takes 890ns and is thus more than 175 times slower. As the examples in the remainder of this paper show, a liberal use of Java constructs leads to the creation of many objects and a high frequency of time-consuming object-to-object copy operations. The latter are typically copies of large parts of the state of objects into other objects, either by performing an exact copy or by transforming the data type of object states. An example of an exact copy is the ByteArrayOutputStream.toByteArray() method that returns a copy of the internal byte[] bu er in the ByteArrayOutputStream object. An example of a copy operation that changes the data type of the copied object state is the String.getBytes() method. The Java API generally prefers copies of objects to object references as method return values and when dealing with method parameters. An increase in the number of allocated objects also leads to an increase in the work that the garbage collector has to accomplish. Since a Java application has little control over the times when garbage allocation occurs, this can have additional negative repercussions for the performance of Java code. Many Java API objects contain arrays acting as internal bu ers. Arrays are objects in Java and therefore are allocated on the heap as well. Moreover, each array element is initialized upon array creation, which means that the time for creating an array grows with the array size. For example, initializing a ByteArrayOutputStream with varying internal bu er sizes takes about 164 microseconds per kilobyte of bu er space on the reference platform described in Section 4. Since the advent of more sophisticated Java just-intime compilers and virtual machines, much of the speed gap between C++ and Java programs can now be attributed to the memory management in Java virtual machines and to certain methods in the Java API. This a ects mostly programs that create a large and quickly changing pool of objects. In these cases and if a Java program is CPU-bound, a slower execution compared to functionally equivalent C++ code can often be observed. In addition, current Java just-in-time compilers avoid time-consuming code optimization techniques that are part of most C++ compilers. Java also has many portable safety and security features that slow down Java execution compared to C++ without these features. Performance optimizations of CPU-bound Java programs will help narrow the performance gap between Java and C++ programs but probably not eliminate it any time soon. Performance-optimized I/O-

bound Java programs, however, should execute nearly as fast or as fast as comparable C++ programs. 4 Reducing Object Allocation and Copy Operation Frequencies

For the reasons outlined in the previous section, the number and frequency of object creations and objectto-object copy operations throughout the execution of a Java program should be decreased as much as possible if execution speed is important. This paper discusses  object reuse,  avoiding API methods with implicit object allocations,  statically creating immutable objects,  avoiding dynamically expanding objects,  adequate object initialization, and  object pooling and, in particular, thread pooling, as possible techniques to attain this goal. Below are explanations and some examples for these guidelines. The examples show code fragments that would typically appear in an Internet server. Many of the examples are illustrated with performance comparisons between di erent versions of classes (unoptimized/less optimized versus optimized/more optimized class versions). Objects of these classes were created in test programs on a 400 MHz Pentium-II computer running Microsoft Windows NT 4.0 and with 128 MB of main memory. The Java JDK used was JavaSoft's JDK 1.1.7 with a just-in-time compiler. All Java API implementation details in this paper are based on this JDK. In the remainder of the paper, the term reference platform will refer to this hardware, operating system, and Java JDK combination. To ensure that the experimental results are not due to reference platform idiosyncrasies, all experiments were repeated with the same JDK on a low-end 133 MHz Pentium laptop with 32 MB of memory and Microsoft Windows 95. Although the measured performance was very di erent on the laptop, the relative performance of the di erent experimental class versions was very similar to that on the reference platform. 4.1 Object Reuse

Object reuse refers to a life cycle of an object that does not end with the rst use of the object. Instead, the object's internal state is cleaned up after each use in such a way that the object can be used again as if it had never been used before. Cleaning up an object's internal state is often much less time-consuming than discarding the object and then reallocating another object of the same class including time for garbage collection. Consider the code fragment of class Fragment1 1 in Figure 1. In each call to method readIntoBuf-

class Fragment1_1 { public byte[] readIntoBuffer(InputStream in) throws IOException { byte[] buffer = new byte[5000]; ByteArrayOutputStream byteStream = new ByteArrayOutputStream(); int bytesRead; do { bytesRead = in.read(buffer); byteStream.write(buffer, 0, bytesRead); } while (bytesRead > 0); return byteStream.toByteArray(); } ... }

class Fragment1_2 { public byte[] readIntoBuffer(InputStream in, byte[] buffer) throws IOException { ByteArrayOutputStream byteStream = new ByteArrayOutputStream(); int bytesRead; do { bytesRead = in.read(buffer); byteStream.write(buffer, 0, bytesRead); } while (bytesRead > 0); return byteStream.toByteArray(); } ... }

fer,

are created more often than Fragment1 2 objects, the thread-safe version will be slower than the thread-unsafe version. Otherwise, it will actually be faster (Section 4.6 shows how the thread creation frequency can be lowered in many cases). In class Fragment1 2, there is still an expensive object allocation. Each time readIntoBuffer is called a ByteArrayOutputStream is created. This implies the allocation of an internal byte[] array in byteStream that can dynamically grow in length during subsequent byteStream.write calls (see Section 4.4 for an analysis of the time and memory cost of dynamic growth). There is a way to avoid the repeated creation of this ByteArrayOutputStream. Many Java API classes such as ByteArrayOutputStream provide a reset() method that allows the reuse of the object, and speci cally of the internal bu er as it was immediately before the reset() call. Figure 4 shows how class Fragment1 2 can be changed to class Fragment1 3 accordingly. The

Figure 1: Class Fragment1 1

an allocation of the buffer array takes place on the heap and the entire bu er is initialized with 0. This can be time-consuming. From a performance point of view, a much better alternative to Fragment1 1 is class Fragment1 2 shown in Figure 2, provided only one thread calls readIntoBuffer simultaneously. In class Fragment1_2 { private byte[] buffer = new byte[5000];

}

public byte[] readIntoBuffer(InputStream in) throws IOException { ByteArrayOutputStream byteStream = new ByteArrayOutputStream(); int bytesRead; do { bytesRead = in.read(buffer); byteStream.write(buffer, 0, bytesRead); } while (bytesRead > 0); return byteStream.toByteArray(); } ...

Figure 2: Class Fragment1 2

Fragment1 2, a bu er is allocated only once, when a Fragment1 2 object is de ned. The bu er is reused during each call to readIntoBuffer. With Fragment1 1,

many copies of the bu er may exist between subsequent garbage allocations, thus increasing the memory requirements of the program and possibly the time needed to perform the garbage collection. If a threadsafe version of Fragment1 2 is needed1 , each thread using Fragment1 2 can provide a class member variable private byte[] buffer = new byte[5000] and pass buffer to readIntoBuffer in a modi ed class Fragment1 2 as shown in Figure 3. The only change a ecting execution speed in the thread-safe version compared to the thread-unsafe version is the presence of a second parameter in readIntoBuffer and the de nition of buffer in each thread instead of in objects of class Fragment1 2. If the threads 1 A thread-safe version can be used by several threads simultaneously and produces the same result as if all threads had used it sequentially in some order.

Figure 3: A thread-safe version of class Fragment1 2

class Fragment1_3 { private byte[] buffer = new byte[5000]; private ByteArrayOutputStream byteStream = new ByteArrayOutputStream();

}

public byte[] readIntoBuffer(InputStream in) throws IOException { int bytesRead; byteStream.reset(); do { bytesRead = in.read(buffer); byteStream.write(buffer, 0, bytesRead); } while (bytesRead > 0); return byteStream.toByteArray(); } ...

Figure 4: Class Fragment1 3

performance of all Fragment1 x classes is compared in Section 4.2. Some classes do not provide a reset() or similar method allowing object reuse. Examples of popular non-reusable objects with potentially large internal data

structures are StringTokenizers, StringBuffers, and In these cases, the programmer may want to invest the time and e ort to design a similar class version from ground up that allows the creation and later reuse of objects. The means for object reuse are methods for resetting the internal state of an object to a prede ned value and then enabling subsequent object state changes. Strings.

class Fragment1_4 { private byte[] buffer = new byte[5000]; private MyByteArrayOutputStream byteStream = new MyByteArrayOutputStream();

4.2 Avoiding Java API Methods with Implicit Object Allocations

Implicit object allocation refers to the creation of an object within an API method and without explicit allocation by the calling method. For example, ByteArrayOutputStream.toByteArray() returns a newly created copy of the internal byte[] bu er in the ByteArrayOutputStream object. Class Fragment1 3 contains implicit object allocations that may slow down the execution. Suppose the contents of the byte[] array returned by readIntoBuffer were not used after the next invocation of readIntoBuffer and there is only one thread calling readIntoBuffer simultaneously. In these cases, the byteStream.toByteArray() method call is wasteful because it will allocate a new byte[] array and then copy the contents of the ByteArrayOutputStream's internal bu er into the newly allocated byte[] array. The array allocation and the copy operation can be avoided by using a simple modi cation of the ByteArrayOutputStream class shown in Figure 5. The main class MyByteArrayOutputStream extends ByteArrayOutputStream { public MyByteArrayOutputStream() { super(); } public MyByteArrayOutputStream(int capacity) { super(capacity); } public byte[] getByteArray() { return buf; } }

Figure 5: Subclassing ByteArrayOutputStream to gain access to its internal bu er di erence between the new class MyByteArrayOutputStream shown in Figure 5 and ByteArrayOutputStream is the public accessibility of the internal bu er buf in the new class via getByteArray. The bu er is a protected variable in ByteArrayOutputStream, and thus every subclass has the right to give class users access to buf. By taking advantage of MyByteArrayOutputStream's features, class Fragment1 3 can be rewritten as Fragment1 4 listed in Figure 6. Figure 7 shows an example of how class Fragment1 4 can be used. Table 1 compares the execution times of all Fragment1 x classes per invocation of readIntoBuffer. It

}

public MyByteArrayOutputStream readIntoBuffer(InputStream in) throws IOException { int bytesRead; byteStream.reset(); do { bytesRead = in.read(buffer); byteStream.write(buffer, 0, bytesRead); } while (bytesRead > 0); return byteStream; } ...

Figure 6: Class Fragment1 4

Fragment1_4 stream = new Fragment1_4(); MyByteArrayOutputStream temp = stream.readIntoBuffer(); byte[] buffer = temp.getByteArray(); int length = temp.size(); for (int i = 0; i < length; i++) if (buffer[i] == '\n') System.out.println("New Line Found.");

Figure 7: An example of code using class Fragment1 4 is clear that each technique presented above yields a signi cant execution time decrease. In fact, Fragment1 4 is more than 15 times faster than Fragment1 1. The test program calls readIntoBuffer to read the contents of a 5kB bu er that is wrapped in an input stream object. The numbers in the table are averaged over 10000 invocations of method readIntoBuffer on the same Fragment1 x object. The Java API contains many methods that copy an object-internal bu er into a newly created object before returning the copy to the calling method. The copy operation can be an exact copy or a conversion of the contents of the object-internal bu er into a di erent representation. Examples of such methods are toString(), getBytes(), and toByteArray(). In many cases, there is no need for operating on a copy of an internal bu er instead of on the bu er itself. Instead, the technique demonstrated with MyByteArrayOutputStream or designing the desired class from ground up are better alternatives from a performance point of view than the Java API classes. 4.3 Statically Creating Immutable Objects

Another way of reusing objects is to statically create immutable objects that are used repeatedly. Static creation means allocation of the corresponding objects as member variables in classes and declaring them static. Such objects are created once across all objects of the same class and not each time before they are used.

Execution Time in s Normalized Execution Time

Fragment1 1

478 100

Fragment1 2

399 83:63

Fragment1 3

141 29:42

Fragment1 4

32 6:64

Table 1: Comparing the performance of all Fragment1 x classes Examples of immutable objects are certain error messages, noti cations, and exceptions whose content is independent of the context in which they are used. Consider the excerpt of class Fragment2 1 shown in Figure 8. Suppose we expect the IOException to be thrown class Fragment2_1 { private byte[] buffer = new byte[5000]; private MyByteArrayOutputStream byteStream = new MyByteArrayOutputStream();

}

public MyByteArrayOutputStream readIntoBuffer(InputStream in, OutputStream out) { int bytesRead; byteStream.reset(); try { do { bytesRead = in.read(buffer); byteStream.write(buffer, 0, bytesRead); if (bytesRead > 512) throw new IOException(); } while (bytesRead > 0); out.write("\rMessage successfully read in readIntoBuffer".getBytes()); } catch (IOException e) { try { out.write("\rMaximum message length exceeded in readIntoBuffer".getBytes()); } catch (IOException e1) {} } return byteStream; } ...

Figure 8: Class Fragment2 1

frequently inside the readIntoBuffer method. Notice that the IOException and both Strings that can be written to out never change and yet are dynamically created each time the corresponding statement is executed. Thus, by statically creating the Strings and the IOException, class Fragment2 1 can be coded more eciently as class Fragment2 2 listed in Figure 9. Creating an exception also means capturing the stack at this point in time. Consequently, if an exception is created statically as in Fragment2 2, retrieving the stack contents associated with this exception in a catch clause means obtaining the stack contents at the time of exception creation and not the time when the exception was thrown. Table 2 compares the execution time of class Fragment2 1 with that of Fragment2 2 per invocation of readIntoBuffer. The test program calls readInto-

class Fragment2_2 { private byte[] buffer = new byte[5000]; private MyByteArrayOutputStream byteStream = new MyByteArrayOutputStream(); private static byte[] successMsg = "\rMessage successfully read in readIntoBuffer".getBytes(), errorMsg = "\rMaximum message length exceeded in readIntoBuffer".getBytes(); private static IOException lengthExceeded = new IOException();

}

public MyByteArrayOutputStream readIntoBuffer(InputStream in, OutputStream out) { int bytesRead; byteStream.reset(); try { do { bytesRead = in.read(buffer); byteStream.write(buffer, 0, bytesRead); if (bytesRead > 512) throw lengthExceeded; } while (bytesRead > 0); out.write(successMsg); } catch (IOException e) { try { out.write(errorMsg); } catch (IOException e1) {} } return byteStream; } ...

Figure 9: Class Fragment2 2

Buffer to read the contents of a 5kB bu er that is wrapped in an input stream object. The output is written to a le. The performance gures for this case can be found in the top portion of Table 2. The numbers in the table are averaged over 10000 invocations of readIntoBuffer on the same Fragment2 x object. Method readIntoBuffer is set up in such a way that an IOException is thrown in one quarter of all calls. If the output of Fragment2 2 is written to the computer screen instead of a le, the execution times of the di erent versions of readIntoBuffer in Fragment2 1 and Fragment2 2 are much closer to the same value, i.e., less time can be saved by statically creating immutable objects. This is shown in the middle of Table 2. Apparently, bu ered output to a le is much faster here than output to a screen, and screen output dominates the execution time. Fragment2 x uses System.out.write for screen out-

File Output (bu ered) Screen Output

Execution Time in s Normalized Execution Time Execution Time in s (System.out.write) Normalized Execution Time Screen Output Execution Time in s (System.out.print) Normalized Execution Time

Fragment2 1

91 100 213 100 196 100

Fragment2 2

41 45:05 163 76:53 191 97:45

Table 2: Comparing the execution times of Fragment2 1 and Fragment2 2 per invocation of readIntoBuffer put instead of System.out.print. The naive way of statically creating the error and success messages and to write them to the screen would be to declare private static String successMsg = "\rMessage successfully read in readIntoBuffer", errorMsg = "\rMaximum message length exceeded in readIntoBuffer";

and to output these messages by calling System.out.print. Here, the conversion of the Strings to byte[] arrays would happen during each invocation of the System.out.print method and not during the one-time creation of the error and success messages. The bottom portion of Table 2 shows how the execution time of readIntoBuffer in Fragment2 1 would compare to that in Fragment2 2 if Fragment2 2 used the naive way of creating and displaying the success and error messages. Clearly, Fragment2 2 now saves hardly any time compared to Fragment2 1. 4.4 Avoiding Dynamically Expanding Objects

Many instantiations of Java API classes contain bu ers that expand dynamically as more data is added to them. Examples of such classes are ByteArrayOutputStream, StringBuffer, Vector, and Hashtable. Whenever the internal bu er has reached its capacity and new data is added, a new bu er with twice the size of the previous bu er or more, if necessary, is allocated. The content of the old bu er is then copied into the new bu er and new data is added to the larger internal bu er. These allocations and copy operations and subsequent garbage collection can be very time-consuming. Suppose the initial size of the internal bu er is 1 data unit, the nal size is n data units, and a certain maximum number of data chunks is added several times in between (1 data unit during the rst and second addition of data and not more than 2 ?2 data units during the ith addition of data, for i > 2). The total cost of the bu er expansion is  dlog2 ne + 1 bu er allocations on the heap,  dlog2 ne copy operations with a total copy volume of at least 2n ? 1 data units, and i

 a total data volume of at least 4n ? 1 data units

in all allocated bu ers. Although all these functions are at most linear in n, they can considerably slow down a Java server under heavy load that has to perform these operations many hundreds or thousands of times each second. There are many ways of avoiding objects with dynamically expanding bu ers, none of which, however, o ers the convenience of using the corresponding Java API classes. One technique of avoiding the frequent creation of dynamically expanding objects is demonstrated here. Suppose class Fragment1 4 were used in a Web proxy and had the task of reading in HTML documents. Suppose also that readIntoBuffer is called only once or a few times on each Fragment1 4 object. We know that currently few HTML documents exceed 15000 bytes in size. Fragment1 4 can be modi ed to Fragment1 5 as shown in Figure 10 to avoid the overhead of internal bu er allocation and copy operations in MyByteArrayOutputStream in most cases. Notice that in class Fragment1 5 there is no need for copying data from the buffer into the byteStream unless the amount of data read exceeds bufferLength. This is another signi cant performance improvement. Previous examples showed that copying an internal object bu er before returning its contents to a calling method can have a signi cant impact on the execution speed. Naturally, copying data from a source (such as a bu er) into another object for initialization or to expand the content of the destination object can have an equally serious impact on program performance as this example and the performance measurements in Table 3 show. This table compares the execution times of classes Fragment1 4 and Fragment1 5 per invocation of method readIntoBuffer. The top part of the table (10000 Byte Input) refers to a test program that calls readIntoBuffer to read the contents of a 10000 byte bu er 512 bytes at a time. In this test environment, classes Fragment1 2 and Fragment1 3 cannot capitalize on object reuse techniques (see Sections 4.1 and 4.2) because readIntoBuffer is called only once. Therefore, Fragment1 2 and Fragment1 3 do not provide performance gains over

10000 Byte Input

Execution Time in s Normalized Execution Time 20000 Byte Input Execution Time in s Normalized Execution Time

Fragment1 4

692 100 1390 100

Fragment1 5

318 45:9 970 69:64

Table 3: Comparing the performance of classes Fragment1 4 and Fragment1 5 using a 10000/20000 byte input Fragment1 1. Fragment1 4 is approximately 25% faster than Fragment1 1, Fragment1 2, and Fragment1 3 because it avoids copying the internal bu er of the ByteArrayOutputStream before returning from the method. Fragment1 5 is more than twice as fast as Fragment1 4. If there were an a-priori known limit of 15000 bytes on the size of a document read by readIntoBuffer, there would not be any need for the variable byteStream, and the design of Fragment1 5 would be much simpler. In any case, applying the above technique has to be weighed carefully because there is a trade-o between time and memory requirements. If a document is usually very small, creating a static bu er of size 15000 bytes leads to fast execution but also consumes unnecessary memory. The above test favored Fragment1 5 because less than 15000 bytes of data were read in readIntoBuffer. Table 3 also shows the performance comparison between classes Fragment1 4 and Fragment1 5 if 20000 bytes of data are read. All other parameters are unchanged from the test with a 10000 byte input. Since Fragment1 5 processes the rst 15000 bytes more ef ciently than Fragment1 4 and afterwards switches to the same algorithm that Fragment1 4 uses, it still performs considerably faster than Fragment1 4 in this case.

4.5 Adequate Object Initialization

If a dynamically expanding object is used in a Java program, much of the internal bu er expansion explained in Section 4.2 can be avoided by using a constructor that appropriately initializes the internal bu er. Most Java API classes with internal bu ers such as ByteArrayOutputStream provide a constructor that allows the programmer to set an initial size for the internal data structure. The default constructor for ByteArrayOutputStream uses an initial capacity of 32 bytes in the JavaSoft Java 1.1.7 API. If the programmer expects a large number of bytes to be read in small chunks during the average execution of readIntoBuffer, using ByteArrayOutputStream(n) with an adequate n saves time-consuming memory allocations and copy operations over using the default constructor. On the other hand, an overly large n could waste memory and the time for allocating a large bu er can be signi cantly longer than for creating a small bu er (see Section 3).

Class Fragment1 4 in Section 4.2 uses the default size (32 bytes) for the internal bu er of class MyByteArrayOutputStream. Table 4 (Scenario 1) compares the execution time per invocation of method readIntoBuffer in class Fragment1 4 with that of a modi ed version of Fragment1 4, where the internal bu er of MyByteArrayOutputStream is initialized with 10000 bytes. A call to method readIntoBuffer reads the contents of a 5kB bu er, 5000 bytes at a time. The numbers in the table are averaged over 10000 invocations of readIntoBuffer for the same Fragment1 4 object. Apparently, using an internal bu er with an initial size of 10000 bytes actually results in a slight performance penalty. The reason is that all invocations of readIntoBuffer reuse the internal bu er, and therefore the bu er does not grow as more data is read into it. Allocating a large internal bu er, however, is more timeconsuming than allocating a bu er with the default size as explained above. Let us look at a di erent scenario. If readIntoBuffer is called only once and reads in a total of 5kB, 100 bytes at a time, an initial bu er size of 10000 bytes instead of the default size (assuming that the actual size of the input is not known exactly ahead of time) saves considerable time. This is shown in Table 4 (Scenario 2). In the case where the default bu er size is used instead, the time saved by not allocating the larger initial bu er is more than o set by the amount of time spent in dynamically expanding the internal bu er as more data is being read in during each call to readIntoBuffer. 4.6 Thread Pooling

A well-known technique for reusing objects is object pooling and speci cally thread pooling. Thread creation is time-consuming not only because a Java thread is an object but also because it requires behind-the-scenes bookkeeping and context switching as a parallel unit of execution. Threads usually contain objects as member variables that have to be allocated and initialized each time a thread is spawned. Therefore, if a method solves the problem at hand just as easily and conveniently as a thread, then a method should be used. If not, thread pooling can provide the means to reduce the overhead associated with frequent thread creation, for instance in a server (timers, request handlers, etc.).

Default Size With 10000 Byte Bu er Scenario 1 Execution Time in s 31:8 32:8 Normalized Execution Time 100 103:4 Scenario 2 Execution Time in s 357 221 Normalized Execution Time 100 61:89 Table 4: Comparing the performance of class Fragment1 4 with a 32 byte and a 10000 byte bu er initialization in two scenarios Thread pooling refers to the static creation of a sucient number of thread objects before their actual use. Each thread then enters a wait state and waits until it is unlocked by another thread or by some event. Once a thread has completed its computation, it returns to the wait state. For example, consider the excerpt from class DynamicThread in Figure 11. This class can be rewritten to establish thread pools instead of dynamically spawned threads as shown in Figure 12. Table 5 shows how the performance of thread pooling can compare to that of dynamically spawning threads per one execution of the body of the run() method. For the performance comparison, one version of a test Internet server embedded DynamicThreads and another version used PoolThreads. A test client sent o 10000 requests to the server. The server using DynamicThreads spawned a new DynamicThread for each incoming request, and the server using PoolThreads created one PoolThread in the beginning and reused this thread for each request. An execution of the body of the run method therefore corresponds to processing a client request. The table shows that, including the time it takes to start a thread, a thread that is part of a pool can process a request more than 40% faster than a dynamically created thread. 5 Summary

Java and the Java API o er many operations that are convenient for the programmer but need to be used with caution because of potential performance penalties. This paper concentrates on object allocations and object-to-object copy operations due to Java API method calls as a major source of speed ineciencies under certain circumstances. The paper demonstrates how to avoid the ensuing speed problems while at the same time decreasing the memory footprint of a Java program. The paper also shows that performance enhancements can increase the risk of introducing bugs to the given Java program because they work only if the given program satis es the preconditions for their application. For example, some of the presented guidelines turn a thread-safe class into a thread-unsafe class. In another case described in this paper, a technique that speeds up

program execution when the preconditions for its application are satis ed slow down the execution in other situations. The readability and maintainability of the code can also su er. In addition, the textual volume of the code can increase. This may not be of concern for a standalone Java application but it may very well be for a Java applet that is downloaded to a client via a link with small bandwidth. References

[1] Ken Arnold and James Gosling. The Java Programming Language. Addison-Wesley Publishing Company, Reading, MA, 1998. [2] Doug Bell. Make Java fast: Optimize! JavaWorld, April 1997. [3] Dennis Champeaux, Doug Lea, and Penelope Faure. Object-Oriented System Development. Addison-Wesley Publishing Company, Reading, MA, 1993. [4] Thomas Davis. Build your own ObjectPool in Java to boost app speed. JavaWorld, June 1998. [5] Bruce Eckel. Thinking in Java. Prentice Hall PTR, Upper Saddle River, NJ, 1998. [6] Jonathan Hardwick. Java Optimization. http:== www.cs.cmu.edu/~jch/java/optimization.html. [7] Prashant Jain, Seth Wido , and Douglas C. Schmidt. The design and performance of MedJava. In Proceedings 4th USENIX Conference on ObjectOriented Technologies and Systems, April 1998. [8] Reinhard Klemm. WebCompanion: A Friendly Client-Side Web Prefetching Agent. To appear in IEEE Transactions on Knowledge and Data Engineering, 1999. [9] Carmine Mangione. Performance tests show Java as fast as C++. JavaWorld, February 1998. [10] Mark Roulo. Accelerate your Java apps! JavaWorld, September 1998.

DynamicThread PoolThread Execution Time in ms 5:13 2:94 Normalized Execution Time 100 57:32 Table 5: Comparing the performance of DynamicThreads and PoolThreads class DynamicThread extends Thread { private static String webProxy; private static int proxyPort; private Socket webServer; private static Hashtable documentCache; class Fragment1_5 { private static final int min = 150, bufferLength = 15000; private byte[] buffer = new byte[bufferLength]; public byte[] readIntoBuffer(InputStream in, Integer returnLength) throws IOException { int total = 0, bytesRead; MyByteArrayOutputStream byteStream = null; // overflow if > 15000 bytes have been read boolean overflow = false; while (true) { bytesRead = in.read(buffer, total, bufferLength - total); total += bytesRead; if (bytesRead 15000 byteStream.write(buffer, 0, bytesRead); total = 0; } }

}

} ...

}

public DynamicThread(String proxy, int port, Socket server, Hashtable cache) { webProxy = proxy; proxyPort = port; webServer = server; documentCache = cache; } public void run() { String id = "Connection Handler"; byte[] buffer = new byte[2048]; ByteArrayOutputStream webDoc = new ByteArrayOutputStream(8000); Hashtable documentLinks = new Hashtable(50, .8f); ... // read in document from server // and prefetch links here } ...

Figure 11: Class DynamicThread

class PoolThread extends Thread { private static String webProxy = "http://webproxy.bell-labs.com"; private static int proxyPort = 8888; private static Hashtable documentCache; private static String id = "Connection Handler"; private byte[] buffer = new byte[2048]; private ByteArrayOutputStream webDoc = new ByteArrayOutputStream(8000); private Hashtable documentLinks = new Hashtable(50, .8f); private static ServerSocket server;

if (overflow) { returnLength = new Integer(byteStream.size()); return byteStream.getByteArray(); } else { returnLength = new Integer(total); return buffer; }

Figure 10: Class Fragment1 5

}

public PoolThread(Hashtable cache, ServerSocket thisHost) { documentCache = cache; server = thisHost; } public void run() { Socket webServer; while (true) { try { // wait for connection request (wait state) webServer = server.accept(); ... // read in document // and prefetch links here } catch (...) { ... } } } ...

Figure 12: Class PoolThread

Suggest Documents