object oriented language, Java widely uses virtual methods to support the extension and reuse of .... Only gcc exercises a substantial number of branches, which.
Branch Behavior of Java Runtime Systems and its Microarchitectural Implications Tao Li✝, Lizy Kurian John✝, Vijaykrishnan Narayanan✻ and Anand Sivasubramaniam✻ ✝Laboratory for Computer Architecture Electrical and Computer Engineering Dept. The University of Texas at Austin
✻220 Pond Lab Dept. of Computer Science and Engineering The Pennsylvania State University
Abstract Java programs are becoming increasingly prevalent on numerous platforms ranging from embedded systems to high-end servers. Dynamic translation (interpretation and compilation), frequent calls to native interface libraries or operating system kernel services and abundant usage of virtual methods by Java programs can complicate the intrinsic predictability of the control flows that can be exploited by an ILP machine. An in-depth look and understanding of the implications on the underlying microarchitecture can either guide architects to design efficient Java processors or finely tune the performance of a Java run time system on general purpose, wide issue machines. To our knowledge, this paper presents the first insight on branch behavior of a standard JVM running on a commercial operating system using real workloads. Employing a complete system simulation environment, we profile and quantify the performance of branch prediction structures for both user and kernel executions. The impact of different JVM styles (JIT compiler and interpreter) on branch behavior is also studied. We find that: (1) kernel instructions and kernel branches play an important role in the execution of a Java program; (2) the studied JVM contains fairly large number of static conditional branches with an average of 38K in JIT mode and 30K in interpreter mode, as opposed to 5.5K in SPECInt95; (3) a major part of the dynamic indirect branches are multiple target (polymorphic) branches; (4) kernel and user codes in a Java run time system use the underlying speculative mechanisms in different styles. We propose a split branch predictor structure that separates branch instruction stream into kernel and user parts based on the processor’s execution mode and use an optimized and cost-effective prediction mechanism for each. Simulations using SPECjvm98 show this technique improves prediction accuracy by reducing destructive branch aliasing between kernel and user codes. Since this scheme requires the access to only one of these smaller modules (kernel or user) for a given branch instruction mode, it also results in energy savings. 1. Introduction Java is becoming increasingly important with the blooming of Internet, E-commerce and intelligent mobile devices. Its “write-once run-anywhere” promise has resulted in portable software and standardized interfaces spanning from embedded systems to enterprise servers. To adhere to this motto, Java programs are first converted to a machine neutral format known as bytecodes and a Java Virtual Machine (JVM) [Lind99] dedicated to a specific hardware platform is then invoked to execute the bytecodes. Early JVM implementations relied on interpretation [Java and Rom96], a binary emulation technique that has been adopted in programming languages such as Lisp, Basic, Perl and Tcl. Direct interpretation of bytecodes facilitates portability and security properties of Java applications and is easier to be implemented. Unfortunately, it can lead to poor performance. Dynamic compilation of bytecodes to the native machine instructions at run time can improve the performance provided compilation time is amortized over the savings due to the execution of the generated code. Such a technique is commonly referred to as a Just-In-Time (JIT) compilation [Cram97, Kral98 and Taba98]. Further, bytecodes can also be executed on a hardware implementation of the JVM [Vija98b] such as Sun’s PicoJava & MAJC [Conn97, Radh00b and MPR0899], and JEDI’s JSTAR [MPR0300].
1
The execution of Java applications is largely dependent on the Java Virtual Machine (JVM) as well as the underlying operating system. A JVM environment can be significantly different from that required to support traditional C or FORTRAN based codes. Many Java classes require services provided by the underlying operating system and can spend a significant amount of time in exercising the native code interfaces and OS services [Li00]. Moreover, the different JVM implementation styles (JIT compiler or interpreter) and the manner, in which the various JVM components (such as the garbage collector and class loader) are exercised, can have a significant impact on the system behavior. Java is designed to support the safe execution of portable programs across platforms and networks and hence provides additional language features (e.g. dynamic class loading and validation, garbage collection, multithreading and synchronization), which are absent in other object-oriented languages like C++. Current ILP processors use prediction techniques to speculatively execute instructions. Correct control flow predictions help alleviate control dependence and can provide smooth stream of instructions to the execution engine. Accurate control flow prediction is important for improving processor performance, especially on wide issue and deeply pipelined machines where the misprediction penalty is high. Having undergone steady improvements, branch prediction is still considered to be a key hurdle for current and future microprocessors [Heil99]. Consequently, there continues to be ongoing research and effort to improve branch prediction accuracy, especially on real applications and emerging workloads. Dynamic translation (interpretation and compilation), frequent calls to native interface libraries or operating system kernel and abundant usage of virtual methods to support the extension and reuse of classes can complicate the intrinsic predictability of control flow in a Java run time system. Previous studies have mainly focused on optimizing branch prediction performance on general-purpose processors using application-only references from benchmarks such as SPECInt95 [Chang97, Yeh93, Shad99 and Heil99]. While the results from these studies are suggestive and useful, it is our belief that a Java run time system can behave differently from SPECInt95. Moreover, given the importance of kernel execution in a Java run time system, it is interesting and necessary to examine how the operating system uses the underlying microarchitecture for effective speculation. This gives rise to an interesting question: What branch prediction hardware should be provided to effectively support Java run time systems on ILP machines? Such a study can help us evaluate the microarchitectual efficiency of the different Java run time system implementations, identify the scope for potential improvement, and derive design enhancements that can improve performance in a given settings. Further, optimization of the branch prediction for one design style may be useful for other JVM implementation styles as well. For instance, an improvement in branch behavior could be useful in the design of efficient processor microarchitectures as well as in the profile-based optimization for JVM codes on general-purpose processors. Finally, complete system profiling also helps us to design environments that efficiently and seamlessly combine JVM, operating system and underlying microarchitecture wherever possible.
2
Adhering to this philosophy, this paper presents results from an in-depth look at complete system profiling of branch characteristics of a standard Java run time system (with different implementations) running a commercial operating system and real applications. Employing a complete system simulation environment, we quantify the performance of branch prediction structures for both user and kernel executions. The impact of different JVM styles (JIT compiler and interpreter) is also studied. We find the following: (1) kernel instructions and kernel branches play an important role in the execution of a Java program; (2) indirect branch frequency caused by virtual method calls and switch statements in an JVM interpreter can be as high as 18.4%; (3) the studied JVM contains fairly large number of static conditional branches (with an average of 38K in JIT mode and 30K in interpreter mode), as opposed to 5.5k in SPECInt95; (4) monomorphic indirect branches constitute a dominant portion in static branch sites whereas polymorphic branches are invoked more frequently in dynamic instances. Polymorphic branches which constitute 4% of static branches in JIT mode and 5% in interpreter mode lead to 28% of all dynamic branches in JIT mode and 75% in interpreter mode; (5) for the same benchmark, the two JVM implementations affect the user branch target access patterns drastically. Kernel target access pattern is more dependent on the JVM and JVM native interfaces than on the applications themselves; (6) kernel and user codes in a Java run time system use the underlying speculative mechanisms in different styles. We propose a split branch prediction structure to separately handle the different branch behaviors of user and kernel codes. This split branch prediction structure separates the branch instruction stream into kernel and user parts based on the processor’s execution mode and use prediction for each. Simulations using SPECjvm98 show that this technique improves prediction accuracy by reducing destructive branch aliasing between kernel and user codes. Compared with a unified predictor, a split predictor results in an average prediction accuracy improvement of 9% in user and 30% in kernel respectively. Since this scheme requires the access to only one of the smaller modules (kernel or user) for a given branch instruction mode, it also results in energy savings. Although, we focus only on indirect branches due to their increased usage in Java execution in this paper, we believe that such an approach would also benefit conditional branch prediction. The rest of this paper is organized as follows. Section 2 discusses the related work. Section 3 describes the benchmarks and simulation based experimental setup. Section 4 gives insight into branch characteristics of a Java run time system and examines how operating system and user codes exploit sophisticated speculative microarchitectures. Section 5 describes our proposed split branch prediction structure and evaluates its benefits. Finally, section 6 summarizes the contributions of this work and outlines directions for future research. 2. Related Work Many previous studies have focused on enhancing branch prediction performance on ILP machines [Yeh91, Smith81, Chang97, Yeh93, Shad99, Heil99 and Drie98a]. These studies have concentrated mainly on the analysis and optimization of branch behaviors of SPECInt95 and C++ programs. A recent paper [Radh00]
3
studies the branch predictability of SPECjvm98 benchmarks using a limited range of predictors. Prediction accuracies of 65-87% (in interpreter mode) and 88-92% (in JIT compiler mode) are observed for a 2K-entry gshare scheme [Chang95] using 5 bits of global history and 1K BTB entries. Hsieh, Conte and Hwu [Hsie97]
compare the performance of Java codes run on the Sun JDK 1.0.2 Java interpreter to code compiled through Caffeine [Hsie96]. These are also compared with compiled C/C++ versions of the code. It is observed that microarchitectural mechanisms, such as caches and BTB, are not well exploited by Java interpreters. Translating Java bytecodes to native code alleviates some of these problems. A related study [Vija99] examines the effectiveness of using path history to predict target addresses of indirect branches for virtual method invocations used in Java applications. The XOR hashing scheme with a global path history and a 2-bit update policy is found to perform the best. This result is shown only for the interpreter mode and small instruction footprint benchmarks (richards and deltablue) and does not include kernel code. Although there have been efforts to study branch prediction performance using system workloads [Gloy96 and Sech96] in the past, little work has been done on hardware optimizations. Gloy, Young and Smith [Gloy96] analyze ATOM-generated system traces from the Instruction Benchmark Suite (IBS) and find that user only traces yield fidelity when the kernel accounts for less than 5% of the total executed instructions. Their simulation results show that including kernel branches in the branch trace worsens the effects of aliasing. For example, aliasing adds 300% and 60% more mispredictions in GAs.4K and GAs.64K predictors respectively. Driesen and Hölzle [Drie98a] investigate a wide range of two-level predictors dedicated exclusively to indirect branches using programs from the SPECInt95 suite as well as a suite of large C++ applications. On the average, a global history and per-address predictor performs best in the design space. They find that combining twolevel predictors with different path lengths in a hybrid style further improves prediction accuracy. In [Drie98b], Driesen and Hölzle propose a cascaded branch predictor, which dynamically classifies and filters branches into a simple first-stage predictor and a more expensive second-stage predictor. Simulation results show that a cascaded predictor with 64+1024 entries achieves the same performance (with 7.8% misprediction rate) as a 4096-entry two level predictor. Early indirect branch prediction studies have been reported in [Lee84, Jaco97, Emer97 and Chang97]. Branch classification and hybrid prediction was first proposed for conditional branches by Chang & Patt in [Chang94] and by McFarling in [McFar93] respectively. To our knowledge, no previous study has looked at branch prediction of Java programs by examining both user and kernel execution. 3. Benchmarks and Experimental Methodology This work is based on simulation analysis of branch instruction traces generated on a complete system simulation environment. The experimental platform used to perform this study is SimOS [Rose95a, Herr97 and Herr98], a complete simulation environment that models hardware components with enough detail to boot and run a full-blown commercial operating system. In this study, the SimOS version that runs the Silicon Graphics IRIX5.3 operating system has been used. The interpreter and JIT compiler of the Java Development Kit [Java]
4
from Sun Microsystems are simulated on top of the IRIX 5.3 operating system. Our studies are based on programs from the SPECjvm98 suite (see Table 1), a set of programs intended to evaluate performance for the combined hardware (CPU, cache, memory, and other platform-specific performance) and software aspects (efficiency of JVM, the JIT compiler, and operating system implementations) of the JVM client platform [SPECJVM98]. The SPECjvm98 uses common computing features such as integer and floating point operations, library calls and I/O, but does not include AWT (window), networking, and graphics. Each benchmark can be run with three different input sizes referred to as s1, s10 and s100. Due to the large slowdown of complete system simulation and in order to keep trace size reasonable (300M - 4,000M instructions) while capturing the behavior of entire programs, we use the reduced data size s1 as the input. The benchmark mpegaudio is excluded from this table because it could not finish completely on the detailed CPU model of SimOS. Table 1. SPECjvm98 Workloads Benchmarks compress jess db javac mtrt jack
Description Modified Lempel-Ziv method (LZW) to compress and decompress large file Java expert shell system based on NASA’s CLIPS expert system Performs multiple database functions on a memory resident database The JDK 1.0.2 Java compiler compiling 225,000 lines of code Dual-threaded raytracer Parser generator with lexical analysis, early version of what is now JavaCC
The branch instruction traces used in this study are generated by MXS [Benn95], which models a superscalar microprocessor with multiple instruction issue, register renaming, dynamic scheduling, and speculative execution with precise exceptions. The simulated architectural model is an 8-issue superscalar processor with MIPS R10000 [Mips94 and Yeag96] instruction latencies. Unlike the MIPS R10000, our processor model has a 128-entry instruction window, a 128-entry reorder buffer and a 64-entry load/store buffer. Additionally, all functional units can handle any type of instructions. To provide the base line performance, conditional branch prediction is implemented as a one-level, 1024-entry BPT (Branch Pattern Table) with 2-bit saturating counter. Indirect branches and call/return are handled by a 1024-entry BTB (Branch Target Buffer) and a 32-entry RAS (Return Address Stack) respectively. By default, the branch prediction unit allows fetch unit to fetch through up to 4 unresolved branches. The instruction and data accesses of both applications and operating system are modeled. We instrument MXS simulator to gather complete system traces generated on the above machine architecture. Each benchmark is run to completion. More details of our simulation techniques and machine architecture can be found in [Li00]. 4. Branch Behavior of Java Run Time System The goal of this paper is to explore hardware and software support features for the operating system and Java run time system to exploit sophisticated speculative microarchitectures. We performed a detailed system profiling of the studied JVM [Li00] before we set up to investigate its branch behaviors. Different JVM
5
implementations are investigated because both interpretation and JIT compilation modes will continue to coexist (either in the same implementation or distinct implementations) due to the variety in application behavior and available resources in target systems on which the JVM is implemented. We believe such a study is not only necessary but also crucial because it brings us in-depth background and knowledge of a JVM and justify the investigation of different JVM implementations and the use of system trace to perform the study we present in this paper. Our studies using SimOS showed that kernel activity constitutes a significant part of the overall execution time (see Table 2). The kernel activity is found to constitute 1% (compress with intr) to 31% (db) of the overall execution time. On the average, the SPECjvm98 programs spend 17.3% (with jit) and 15% (with intr) of their execution time in kernel, in contrast to the less than 2% of that found in SPECInt95 benchmarks. This implies that ignoring the kernel instructions in SPECjvm98 workloads may not capture complete and accurate execution behavior. Table 2. Kernel Execution Percentages (s1 Data Set) Exec. Mode compress % Kernel
jit intr
6% 1%
jess
30% 28%
Benchmarks db javac mpeg
31% 31%
19% 23%
11% 2%
mtrt
jack
7% 4%
17% 16%
We begin by classifying the branch behavior of JVMs with different implementation styles, specifically trying to answer the following questions: What are the mixes of the branch instructions that are executed on a general purpose CPU when executing Java programs (using an interpreter or JIT compiler)? Are these different from those for traditional SPECInt95 programs? Do the different applications use branches in JVM and its native libraries in a similar way? If so, can we “preoptimize” Java run time system using a profile-based optimizer? How do Java executions in JIT and interpreter modes fare with a typical branch predictor? Do user and kernel codes exploit this microarchitectural feature in different ways? Based on this, can we suggest optimized and cost-effective speculative mechanisms for Java executions? 4.1 Branch Profiling A complete branch instruction profiling for a Java run time system and OS running the benchmarks is given in Table 3. Branch instructions are categorized as conditional branches, direct branches that unconditionally redirect instruction streams to a statically specified target encoded in the instruction itself, (non-return) indirect branches which transfer control to an address stored in a register, and return branches which always use jump and link instruction (e.g. jal, jalr) and a specified architecture register (e.g., r31 on MIPS machines). For each benchmark, the table lists branch frequency expressed as branches per instruction for each category, in both user and kernel modes. The execution of these benchmarks in both JIT compiler (jit) and interpreter (intr) is profiled. For example, 16.4% of kernel instructions in jack are branches and 88% of these branches
6
are conditional branches, etc. Table 3 shows that kernel has more branches per instruction than that of user codes. This observation, together with the significance of the kernel activity shown in Table 2 suggest that kernel instructions and kernel branches play an important role in the execution of a Java program. Hence, branch prediction accuracy measured using simulation of user-only activity is not necessarily a reliable indicator of the true prediction accuracy on the full-system trace. Table 3. Basic Profiling of Branch Instructions In Java Run Time System Kernel User Branches per Instruction Branches per Instruction All Cond. Direct Return Indirect All Cond. Direct Return Indirect compress 0.202 0.184 0.008 0.009 0.001 0.177 0.153 0.000 0.009 0.015 jess 0.215 0.184 0.014 0.015 0.002 0.158 0.112 0.001 0.020 0.025 db 0.261 0.232 0.013 0.014 0.002 0.156 0.114 0.002 0.020 0.021 javac 0.254 0.229 0.012 0.012 0.002 0.155 0.119 0.001 0.017 0.019 mtrt 0.227 0.206 0.010 0.010 0.001 0.171 0.138 0.000 0.015 0.017 jack 0.164 0.144 0.009 0.010 0.002 0.165 0.124 0.000 0.018 0.023 compress* 0.254 0.230 0.011 0.012 0.001 0.143 0.108 0.000 0.001 0.034 jess 0.211 0.182 0.013 0.014 0.002 0.144 0.102 0.000 0.013 0.029 db 0.260 0.231 0.013 0.014 0.002 0.153 0.107 0.002 0.019 0.026 javac 0.251 0.226 0.011 0.012 0.001 0.155 0.110 0.001 0.019 0.025 mtrt 0.230 0.209 0.010 0.011 0.001 0.137 0.108 0.000 0.006 0.023 jack 0.149 0.132 0.008 0.008 0.001 0.149 0.109 0.000 0.013 0.026 *Statistics are based on simulation of each benchmark (using s1 data sets) on SimOS MXS model until completion, except for compress invoked with an interpreter, which takes extremely long time to finish. In this case, we use the first 4,000M instructions as the representative execution window based on the profiling of entire execution on SimOS Mipsy model. intr
jit
Benchmarks
The difference between behavior of kernel and user codes is indicated by the number of branches per instruction: on the average, the 6 benchmarks execute 22 branches per 100 instructions in kernel mode and 16 branches per 100 instructions in user mode. Kernel branches include loops, error and bounds checking, and other routine conditionals. Error and bounds checking related branches are abundant in operating system routines because it has to be designed to handle all possible situations. In kernel codes, 89% of branch instructions are conditional branches, the rest are responsible only for 5% (direct), 5.3% (return) and 0.7% (indirect) of the branches. In user mode with a JIT compiler, conditional branches contribute 77% of total branches, and the rest represent 0.4% (direct), 10.4% (return) and 12.2% (indirect) of all branches. Compared with kernel codes, the higher indirect branch ratio corresponds to virtual method calls in Java codes. As an object oriented language, Java widely uses virtual methods to support the extension and reuse of classes (polymorphic programming style). Virtual method calls in Java incur a performance penalty because the target of these calls can only be determined at run time. JIT compilers such as Kaffe, CACAO, and LaTTe [Lee00], typically maintain a virtual method table for each loaded class. A virtual method invocation is then translated into an indirect function call after two loads. First, the base address of the virtual table is loaded and is followed by loading of the target address by indexing into this table. Finally, an indirect call is made to this target location. For statically bound method calls, JIT compilers generate a direct jump at the call site. The branch behavior of an interpreter can be different from that of a JIT compiler. For example, in user mode, the percentage of conditional, direct, and return branches decreases to 73%, 0.4%, 8.2% respectively while indirect
7
branch ratio increases from 12.2% to 18.4%. The interpreter mode results in higher frequency of indirect control transfers due to additional indirect jumps used to implement the switch statement for case by case interpretation [Radh00]. J:with a Just-In-Time Compiler,I:with an interpreter 150 140
Inst. per Cond. Branch Inst. per Indir. Branch
Instructions per Branch
130
246
468
120 110 100 90 80
J
70 J J
50 40
J
J
60 I
I
I
I
I
J
I
30 20 10 0
compr jess
db
javac mtrt
jackm88k gcc
ijpeg
li
perlcompr95
Figure 1. Instruction per Branch in Java and SPECInt95 Benchmarks To gain insight into how the Java programs and JVM impact the branch behavior, Figure 1 shows the branch frequency (expressed as instructions per branch) comparison between SPECjvm98 and SPECInt95 benchmarks. Further, instructions per branch is shown separately for both conditional and indirect branches. All SPECInt95 programs were compiled with MIPSpro C compiler (version 7.2.1) with optimization level -O2. Although the number of instructions per conditional branch of Java (7.9 with JIT and 9.3 with interpreter) and SPECInt95 (10.8) are not very much different, Java programs have much less number of instructions per indirect branch (50 with JIT and 37 with interpreter) than SPECInt95. For example, ijpeg and compress in SPECInt95 have 468 and 246 instructions per indirect branch respectively. perl, li and gcc have many indirect branches because these programs essentially perform interpretation or compilation, which is somewhat similar to a JVM’s functionality. With indirect branches becoming more frequent relative to conditional branches, indirect branch prediction penalty can start to dominate the overall branch misprediction cost because indirect branches tend to be mispredicted more frequently. For example, if the indirect branches are mispredicted 6.3 times (with JIT, and 4 times with interpreter) more frequently than the conditional branches, indirect branch misses will dominate conditional branch misses in the studied JVM.
4.2 Conditional Branch
Table 4 lists the number of dynamic conditional branches, the number of static conditional branches executed, the number of static branches constituting the given portion of total dynamic instances and the percentage of static branches that correspond to that portion. To provide a comparison, we also tabulate the corresponding data for SPECInt95 benchmarks. For example, in javac with a JIT compiler, 38,815 static branches constitute
8
34,766,245 dynamic instances, 90% of which are executed from 1,058 static branch sites (2.7% of total static branch sites).
db
SPEC Int95
javac compress95 gcc ijpeg m88ksim
Static Conditional Branches
SPEC JVM98
jess
Dynamic Conditional Branches
Benchmarks
Exec. Mode
Table 4. Characterization of Conditional Branches
jit intr jit intr jit intr
35,986,299 39,335,967 13,147,512 11,585,463 34,766,245 25,467,007 411,486 66,327,989 53,859,383 100,527,021
38,654 29,477 33,957 29,219 38,815 29,487 1,960 14,351 2,899 3,046
-----
# of Static Branches Constituting the Given Portion of Total Dynamic Instances (Percentage of Static Conditional Branches) 50% 90% 156(0.4%) 1,466(3.7%) 46(0.2%) 501(1.7%) 112(0.3%) 1,172(3.4%) 66(0.2%) 673(2.3%) 138(0.4%) 1,058(2.7%) 79(0.3%) 597(2.0%) 12(0.6%) 102(5.2%) 214(1.5%) 1618(11.3%) 15(0.5%) 68(2.3%) 5(0.2%) 79(3.0%)
Considered statically, the SPECjvm98 benchmarks all contain a fairly large number of static conditional branches with an average of 38K in JIT mode and 30K in interpreter mode, whereas the four studied SPECInt95 have only 5.5K static branch sites. Only gcc exercises a substantial number of branches, which account for around 37% - 47% (depending on execution modes) of total branches in a Java execution. The high frequency of branches in Java execution is not surprising since the control flow is transferred frequently to the translated user codes, JVM and its native interfaces and the underlying operating system. When the dynamic frequency of these branches is considered, however, a small number of distinct branches contribute to the overwhelming majority of the branch instances. For example, 90% of active branches in Java executions are executed from less than 1,232 interesting branch sites (3.3% of total branch sites) in JIT mode and 590 (2% of total branch sites) in interpreter mode, showing that they are highly biased. In SPECInt95, 90% of the conditional branches are executed from less than 100 conditional branch sites, except for gcc that contains a much larger number of active conditional branches (1,618). It is found that the number of conditional branch sites in the studied Java executions is relatively constant across applications. For example, in JIT mode the average size of conditional branch sites is 37,746. The variation of different benchmarks is within 8.7%. The data shown in Table 4 implies that a Java run time system spends a significant portion of time in executing the underlying JVM, the JVM native interface libraries and operating system kernel services. This is only natural, since a JVM implementation typically results in complex software that is best addressed by libraries. The study in [Li00] shows that only a small number of kernel routines are used, and that these routines are used in a similar fashion between applications. This raises an interesting question: Do different applications use the JVM environment in a similar style? Towards
9
answering this question, we analyze the dynamic conditional branch distribution during the entire execution of each benchmark. Figure 2 shows the results for benchmarks db, javac and jess in both JIT and interpreter modes. The data is plotted as branch identifier vs. percentage of branch execution. The branch identifier is uniquely numbered based on the call site, namely program counter (PC) value of the branch instruction. For example, in benchmark db with a JIT compiler, there is only one branch that accounts for 13.4% of dynamic instances, with the rest having a value (percentage of branch execution) less than 2%. A more interesting observation is that dynamic conditional instances are distributed in a similar style across different benchmarks, i.e., the most frequently invoked branches are clustered together, spanning approximately the same range of branch sites. This similarity indicates that the studied applications use JVM and its native libraries in a similar way and it may be possible to “preoptimize” a Java run time system to boost the ILP performance. For example, using profile-based compiler optimizations [Cald95] for JVM, Java run time libraries and OS kernel may be effective for improving performance across different applications.
Figure 2. Dynamic Distribution of Conditional Branches based on Call Sites 4.3 Indirect Branch As the percentage of indirect branches is high in a Java run time system, it becomes imperative to examine the indirect branch behavior. Indirect branches, which transfer control to an address stored in a register, are typically harder to predict accurately. Current processors predict indirect branches with a branch target buffer (BTB) which caches the most recent target address of a branch. Unconstrained BTB has been shown to achieve an average prediction accuracy of 64% on SPECInt95 benchmarks [Drie98a]. Tables 5 and 6 show the profile of (non-return) indirect branches in user and kernel modes in a Java run time system. Indirect branches are further categorized as branches with only one target (monomorphic branches) and those with more than one target (polymorphic branches). Further, we divide polymorphic branches based on the number of targets. These
10
tables give the number of indirect branch sites for each category, percentage of static branch sites for that category, and their dynamic instance counterparts. For example, in benchmark mtrt with a JIT compiler, there are a total of 18.1 million executed indirect branches (17.4 million with one target and 0.7 million with multiple targets). The total number of dynamic instances comes from 6,865 indirect branch sites, of which there are 6,603 (96%) branch sites with one target and 262 (4%) branch sites with multiple targets. Considered statically, monomorphic branches constitute a dominant portion. When the dynamic frequency of these branches is considered, however, the importance of polymorphic branches is felt: polymorphic branches which constitute 4% of static branches in JIT mode and 5% in interpreter mode lead to 28% of all dynamic branches in JIT mode and 75% in interpreter mode. Compared with JIT compiler, an interpreter has fewer number of indirect branch sites. On the average, the 6 studied benchmarks have 6,597 static indirect branch sites in JIT mode, while only 3,960 branch sites are found in interpreter mode. The studied Java run time system has much larger number of indirect branch sites compared with SPECInt95 and C++, in which 90% of indirect branches are executed from less than 100 branch sites [Drie98a]. Table 5. Non-Return Indirect Branch Profiling (User Mode) in Java Run Time System
comp
Benchmarks Single jit
jess
intr jit intr
db
jit
mtrt
javac
intr jit intr jit
jack
intr jit intr
Num. Per. Num. Per. Num. Per. Num. Per. Num. Per. Num. Per. Num. Per. Num. Per. Num. Per. Num. Per. Num. Per. Num. Per.
Target 5358 96.07% 3650 95.65% 6722 94.68% 3864 95.29% 5424 96.15% 3785 95.32% 6757 95.57% 3915 95.09% 6603 96.18% 3768 95.46% 7071 96.45% 3679 95.56%
>=2 219 3.93% 166 4.35% 378 5.32% 191 4.71% 217 3.85% 186 4.68% 313 4.43% 202 4.91% 262 3.82% 179 4.54% 260 3.55% 171 4.44%
Static Branch Sites Multiple Target >=4 >=8 >=16 >=32 74 23 9 4 1.33% 0.41% 0.16% 0.07% 56 19 8 3 1.47% 0.50% 0.21% 0.08% 119 40 13 7 1.68% 0.56% 0.18% 0.10% 63 21 8 5 1.55% 0.52% 0.20% 0.12% 73 25 10 3 1.29% 0.44% 0.18% 0.05% 61 20 8 5 1.54% 0.50% 0.20% 0.13% 121 40 13 6 1.71% 0.57% 0.18% 0.08% 62 23 8 6 1.51% 0.56% 0.19% 0.15% 92 34 12 4 1.34% 0.50% 0.17% 0.06% 62 18 8 3 1.57% 0.46% 0.20% 0.08% 88 30 15 6 1.20% 0.41% 0.20% 0.08% 59 18 8 3 1.53% 0.47% 0.21% 0.08%
>=64 1 0.02% 1 0.03% 3 0.04% 2 0.05% 1 0.02% 1 0.03% 1 0.01% 2 0.05% 1 0.01% 1 0.03% 3 0.04% 1 0.03%
Single Target 18705704 51.23% 1131814 2.77% 5205790 75.08% 4924628 46.51% 1852952 82.27% 1715960 67.01% 3922966 77.40% 3455111 69.86% 17413166 71.66% 13476212 24.79% 23184753 75.35% 14806299 48.42%
11
>=2 17804658 48.77% 39662270 97.23% 1728210 24.92% 5663323 53.49% 399448 17.73% 844660 32.99% 1145661 22.60% 1490420 30.14% 6885721 28.34% 40878066 75.21% 7584478 24.65% 15770268 51.58%
Dynamic Instances Multiple Target >=4 >=8 >=16 13242308 102229 73021 36.27% 0.28% 0.20% 39594738 38835968 38815571 97.06% 95.20% 95.15% 1173233 678839 283601 16.92% 9.79% 4.09% 5323622 4863046 4435293 50.28% 45.93% 41.89% 264882 144829 84015 11.76% 6.43% 3.73% 683429 513916 477044 26.69% 20.07% 18.63% 677169 445532 288405 13.36% 8.79% 5.69% 1030649 826893 657261 20.84% 16.72% 13.29% 5267999 4111372 925788 21.68% 16.92% 3.81% 39972136 35178089 35161782 73.54% 64.72% 64.69% 5603077 1910769 1283077 18.21% 6.21% 4.17% 15046729 11499847 11206312 49.21% 37.61% 36.65%
>=32 62068 0.17% 38803333 95.12% 239916 3.46% 4380235 41.37% 59463 2.64% 463728 18.11% 267624 5.28% 652316 13.19% 94766 0.39% 34906317 64.22% 267692 0.87% 10078036 32.96%
>=64 58417 0.16% 38803333 95.12% 177510 2.56% 4232004 39.97% 58112 2.58% 431208 16.84% 223020 4.40% 571703 11.56% 92336 0.38% 34906317 64.22% 261538 0.85% 9958788 32.57%
Table 6. Non-Return Indirect Branch Profiling (Kernel Mode) in Java Run Time System
comp
Benchmarks Single jit
jess
intr jit intr
db
jit
mtrt
javac
intr jit intr jit
jack
intr jit intr
Num. Per. Num. Per. Num. Per. Num. Per. Num. Per. Num. Per. Num. Per. Num. Per. Num. Per. Num. Per. Num. Per. Num. Per.
Target 96 64.43% 96 64.86% 96 64.43% 97 64.67% 93 64.14% 93 64.14% 96 64.43% 96 64.43% 96 64.00% 98 65.33% 96 64.43% 96 64.86%
Static Branch Sites Multiple Target >=2 >=4 >=8 >=16 >=32 53 11 2 1 1 35.57% 7.38% 1.34% 0.67% 0.67% 52 10 2 1 1 35.14% 6.76% 1.35% 0.68% 0.68% 53 11 2 1 1 35.57% 7.38% 1.34% 0.67% 0.67% 53 11 2 1 1 35.33% 7.33% 1.33% 0.67% 0.67% 52 11 2 1 1 35.86% 7.59% 1.38% 0.69% 0.69% 52 11 2 1 1 35.86% 7.59% 1.38% 0.69% 0.69% 53 11 2 1 1 35.57% 7.38% 1.34% 0.67% 0.67% 53 11 2 1 1 35.57% 7.38% 1.34% 0.67% 0.67% 54 11 3 1 1 36.00% 7.33% 2.00% 0.67% 0.67% 52 10 2 1 1 34.67% 6.67% 1.33% 0.67% 0.67% 53 11 2 1 1 35.57% 7.38% 1.34% 0.67% 0.67% 52 10 2 1 1 35.14% 6.76% 1.35% 0.68% 0.68%
Single >=64 Target 0 29318 0% 15.58% 0 23385 0% 15.16% 0 161857 0% 28.77% 0 157240 0% 28.90% 0 49028 0% 18.69% 0 51587 0% 20.21% 0 66290 0% 27.98% 0 69579 0% 30.16% 0 64136 0% 28.58% 0 40881 0% 22.83% 0 164068 0% 28.34% 0 154050 0% 33.43%
>=2 158835 84.42% 130827 84.84% 400812 71.23% 386895 71.10% 213338 81.31% 203624 79.79% 170649 72.02% 161097 69.84% 160290 71.42% 138225 77.17% 414842 71.66% 306796 66.57%
Dynamic Instances Multiple Target >=4 >=8 >=16 130108 10141 10085 69.15% 5.39% 5.36% 107332 8651 8620 69.60% 5.61% 5.59% 328655 56098 56098 58.41% 9.97% 9.97% 314456 54196 54141 57.79% 9.96% 9.95% 176782 18313 18287 67.38% 6.98% 6.97% 166985 17405 17380 65.43% 6.82% 6.81% 144817 12889 12866 61.12% 5.44% 5.43% 135338 11442 11395 58.67% 4.96% 4.94% 132995 11356 11154 59.26% 5.06% 4.97% 111028 9582 9546 61.99% 5.35% 5.33% 335073 62812 62754 57.88% 10.85% 10.84% 246691 44195 44149 53.53% 9.59% 9.58%
>=32 10085 5.36% 8620 5.59% 56098 9.97% 54141 9.95% 18287 6.97% 17380 6.81% 12866 5.43% 11395 4.94% 11154 4.97% 9546 5.33% 62754 10.84% 44149 9.58%
>=64 0 0% 0 0% 0 0% 0 0% 0 0% 0 0% 0 0% 0 0% 0 0% 0 0% 0 0% 0 0%
4.4 Target Access Patterns Until now we have gained some knowledge on indirect branch behavior, such as the distribution of single target and multiple target indirect branches, and their static and dynamic statistics. For single target branches, misprediction comes from aliasing between branches which map to the same branch target buffer entry. For multiple target branches, misprediction also depends on multiple target access pattern. Multiple targets do not directly imply penalties to prediction accuracy. For example, if one target address is accessed 1,000 times followed by the other executing 1,000 times, the loss due to interleaving is negligible. However, if the two targets alternate in trace order, then interleaving may cause significant misprediction. To characterize multiple target access pattern, we uniquely number both branch sites and their corresponding targets for each polymorphic branch and show the results for 0.5 million dynamic branch instances in trace. As shown in Figure 3 and Figure 4, data are visualized in 3-D formats for both user and kernel executions. Both JIT and interpreter modes are studied. Each dot plotted in 3-D space records an occurrence of the following event: at time X (represented by number of dynamic instances), a given branch Y (represented by branch ID) jumps to its target Z (represented by branch target ID).
12
compress (JIT,User)
mtrt (JIT,User)
Z
javac (JIT,User)
Z
Y
Z
Y
X compress (Intr,User)
mtrt (Intr,User)
Z
Z
Y
X jack (JIT,User)
Z
Y
X
jess (Intr,User)
X
X db (Intr,User)
Z
Y
Y
X
jack (Intr,User)
Z
X db(JIT,User)
Z
Y
Y
X jess (JIT,User)
Z
X javac (Intr,User)
Z
Y
Y
X
Z
Y
X
Y
X
Figure 3. Visualization of Branch Targets Access Pattern for No-Return Polymorphic Indirect Branch Invoked in Java Run Time System (user mode, x: total number of no-return indirect branch invoked, y: static no-return indirect branch ID, z: static branch target ID, cut-out sections are enlarged and shown in Figure 5)
13
compress (JIT,Kernel)
mtrt (JIT,Kernel)
Z
Z
Y
Y
X
mtrt (Intr,Kernel)
Z
Y
jack (JIT,Kernel)
Y
X
Z
db(JIT,Kernel)
Y
X
jack (Intr,Kernel)
jess (Intr,Kernel)
Z
Z
X
X
Z
Y
X
Y
javac (Intr,Kernel)
jess (JIT,Kernel)
Z
X
Z
X
Y
Y
X
compress (Intr,Kernel)
Z
Y
javac (JIT,Kernel)
Z
X db (Intr,Kernel)
Z
Y
X
Y
X
Figure 4. Visualization of Branch Targets Access Pattern for Non-Return Polymorphic Indirect Branch Invoked in Java Run Time System (kernel mode, x: total number of non-return indirect branch invoked, y: static no-return indirect branch ID, z: static branch target ID) As shown in Figure 3, each benchmark demonstrates a unique target access pattern ranging from those in which the execution time is dominated by a small group of branches with highly biased targets such as compress and mtrt in interpreter mode, to those in which a wide range of branches with random target access
14
pattern constitute most of the execution as in javac in interpreter mode and db in JIT mode. For the same benchmark, the two JVM implementations affect the target access patterns drastically. In interpreter mode, branch sites are small and target access patterns are highly interleaved, whereas JIT mode usually has more active branch sites with random and sparse access patterns for a given target. Benchmarks compress, mtrt, and jack show interesting patterns in interpreter mode: the enlarged cut-out sections of these benchmarks (see Figure 5) show that the target addresses of an indirect branch are interleaved regularly. This is particularly true in compress, where target address instances repeatedly work through a body of 9 branch sites. In JIT mode, target address access does not follow some special patterns and are distributed randomly. Figure 4 shows that in kernel mode with both interpretation and JIT, target access pattern does not vary significantly since both use operating system kernel services in a similar style [Li00]. Also, this observation stays valid for each studied benchmark. This fact implies that kernel target access pattern is more dependent on the JVM and JVM native interfaces than on the applications themselves. 160
180
140
140
160
120
120
100 80 60
100 80 60
40
40
20
20 25
50 75 100 125 150 175 200 Dynamic Branch Instances
140 Branch Target ID
Branch Target ID
Branch Target ID
mtrt (Intr,User)
jack (Intr,User)
compress (Intr,User) 160
120 100 80 60 40 20
25
50 75 100 125 150 175 200 Dynamic Branch Instances
25
50 75 100 125 150 175 200 Dynamic Branch Instances
Figure 5. Enlarged cut-out Section in Figure 3 4.5 Baseline Prediction Performance We have performed a detailed analysis of branch behavior for both user and kernel execution in a Java run time system with different implementations. How do these behaviors fare with a typical processor’s branch prediction mechanisms? To find the answer, we first look for a pessimistic branch prediction performance by using hardware with constraints on predictor size and organization. Such an approach can help with identifying the bottlenecks and quantifying the effect of any optimization by evaluating cost-performance trade-offs. This is particularly interesting in Java because the underlying microarchitecture needs to be tailored or augmented to support a wide spectrum of hardware from portable devices to high-end servers. We instrument SimOS MXS simulator to perform selective profiling for conditional branches, non-return indirect branches and call/return branches (see Figure 6). We skip the direct branches because their target addresses can be computed easily by a super fast adder. To obtain the baseline performance, a simple yet typical predictor is used for each type of branch. In user codes, the prediction accuracy of conditional branches falls into the range from 90.9% (jess) to
15
97.1% (compress) in JIT mode and from 92.7% (javac) to 97.9% (compress) with an interpreter. On the average, the prediction accuracy is 92.6% in JIT mode and 94.6% in interpreter mode. In kernel codes, the prediction accuracy is from 80.4% (jess) to 86.8% (jack) in JIT mode and from 82.4% (db) to 88.7% (jack) in interpreter mode. The average prediction accuracy is 83.6% in JIT mode and 84.3% in interpreter mode. For indirect branches, JIT mode has prediction accuracy of 83.3% (jack) to 96.9% (compress) with the average of 87.6%. Interpreter mode has prediction accuracy of 3.6% (compress) to 78.7% (javac) with an average of 50.2%. In kernel mode, the prediction accuracy of indirect branches is relatively high by achieving 93.1% in JIT mode and 95.1% in interpreting mode. Procedure calls/returns are excluded from indirect branches and predicted with a return address stack [Kaeli91and Skad98]. In user codes, a 32-entry RAS achieves an average accuracy of 68.8% in JIT mode and 65.9% in interpreter mode. The prediction accuracy of call/return addresses in kernel mode is 94.8% in JIT and 94.1% in interpreter mode. I
UserJ
J
I
I
J
I
J
I
Accuracy of Indirect Branches Prediction
80 70 60 50 40 30 20 10
100
compr
J
I
jess
J
I
db
J
javac
Kernel I J
I
mtrt
J
I
J
I
80 70 60 50 40 30 20 10
J
I
J
I
User I J
J
I
J
I
J
I
100
90 80 70 60 50 40 30 20 10 0
jack
Percentage of Correctly Taken Percentage of Correctly Fall Through
90
0
100
Accuracy of Return Address Prediction
J
100
compr
J
I
jess
J
db
I
J
javac
Kernel I J
mtrt
I
J
I
J
jess
db
javac
mtrt
jack
I
90 80 70 60 50 40 30 20 10
compr
jess
db
javac
mtrt
I
J
I
User I J
J
I
J
I
I
80 70 60 50 40 30 20 10
100
jack
J
I
jess
J
I
db
J
javac
Kernel I J
I
mtrt
jack
J
J
I
I
90 80 70 60 50 40 30 20 10 0
compr
jess
db
javac
mtrt
Figure 6. Baseline Performance of Branch Prediction Structures in Java Run Time System. Branch Instructions are Separated into Kernel and User Modes and each Benchmark is Invoked with both a JIT Compiler (J) and an Interpreter (I) Branches are breakdown into three categories: conditional, non-return indirect and call/return. To provide the baseline performance, we use 1KB BPT (with 2-bit saturating counters) for conditional branches, 1KB BTB entries (with two-bit counter update rule) for indirect branches, and a 32 entries Return Address Stack (with stack pointer reparation) for call/return jumps.
16
J
90
compr
0
compr
J
0
jack
Accuracy of Return Address Prediction
I
90
0
Accuracy of Conditional Branches Prediction
J
Accuracy of Indirect Branches Prediction
Accuracy of Conditional Branches Prediction
100
jack
Figure 6 shows that conditional branches in kernel codes can not be predicted well with a simple predictor. Previous study [Xia99] shows that the transfer of control flow in the operating system is different from those in typical technical applications. Instead of executing tight loops, the control flow of operating system frequently transfers to handle TLB misses, process an interrupt, service a system call and core routines involved in context switching. As a result, the frequently invoked instruction sequences are mixed with other seldom-executed basic blocks that are bypassed by conditional branches which are almost always taken. The low indirect branch prediction accuracy in user codes with interpreter mode stems from the fact that multiple target of an indirect branch interleave with each other in the dynamic instances. A BTB scheme with a simple hashing function (e.g. direct mapping) can incur high misprediction rate. Instead, a history path based predictor which captures the different patterns to a specific branch site may improve the performance. 5. Split Branch Prediction Having characterized the different branch behaviors and examined their corresponding performance on a base line predictor, we find that kernel and user codes in a Java run time system use the underlying speculative mechanisms in different styles. For example, Java codes use indirect branches and can access branch target buffer more frequently while the conditional branches in kernel can not be predicted well with a simple predictor. As a result, separating control flow into kernel and user modes and handling separately may provide opportunities to optimize each branch stream prediction individually. This idea leads to the split branch prediction structure, which is discussed next. 5.1 Architecture Branch predictors (e.g. GAs, PAp and gshare) typically use combinations of branch PC bits and branch histories to achieve higher prediction accuracy [Yeh93]. Recently, data values have been correlated to improve the prediction performance [Heil99]. Compared to gshare, the Branch Difference Predictor (BDP, proposed by Heil and Smith) employing data values as additional heuristics has been shown to reduce misprediction rate by up to 33% on SPECInt95 benchmarks. The split branch prediction structure that we propose exploits additional information (i.e. processor operation modes or privilege levels) to separate kernel and user branches and handles them using separate predictors. Typically, a set of bits in Processor Status Register (PSR) is used to record mode information. For example, MIPS R10000 uses KSU field in PSR to identify current execution mode and Intel’s next generation IA-64 Itanium (Merced) uses PSR.cpl to determine one of 4 privilege levels (level 0-3) [MPR0400]. Switching between different modes can be made by the operating system (in kernel mode) by writing into the corresponding field in PSR. The processor is forced into kernel mode when operating system services are required and then returned to user mode after kernel services complete. Figure 7 outlines
17
the block diagram of the split branch prediction structure. The branch instructions coming from the fetch unit is split into kernel and user parts and fed to separate branch predictors. Branch Instructions Branch Unit Execution Mode
User Optimized Branch Predictor for User Codes Kernel Optimized Branch Predictor for Kernel Codes
Figure 7. Branch Prediction via Split Branch Predictor Compared with a conventional design, the split branch prediction approach has several advantages. Firstly, the aliasing between kernel and user branches is avoided. Since an operating system virtual memory system can map kernel and user segments into the same virtual address space, kernel and user branches can be mapped to the same table entries. Previous studies [Sech96] show that for predictors with small available resources, aliasing between distinct branches can have a dominating influence on prediction accuracy and can eliminate advantage gained through branch correlation. Secondly, dividing the branch stream into small streams can open up new opportunities to optimize each with specialized techniques. For instance, various speculative techniques can be tailored to each stream for higher efficiency. Thirdly, since this scheme requires access to only one of these smaller modules (kernel or user) for a given branch, it results in energy savings. In many applications, such as embedded processors used in portable devices, energy efficiency would be a major concern. Energy dissipation is a function of the size of the memory array (capacitive load and leakage) and the number of accesses [Vija00]. The proposed split branch approach may permit the use of smaller prediction tables without performance penalty in most cases and performance improvements in some cases. 5.2 Preliminary Simulation Results We present the performance evaluation results of employing this technique on an indirect branch predictor (with fully associative BTB organization and LRU update policy). We compare the misprediction rates on a unified predictor (represented as Ui, where i is the number of BTB entries) with that on a split branch predictor consisting of a predictor (represented as Sui) for user codes and a predictor (represented as Skj) dedicated to kernel branches. Table 7 shows performance results of combining a Sk32 or a Sk64 with a Sui, where i varies from 512 to 2048. The misprediction rate of Ui, which is broken down to user (u) and kernel (k) modes, is tabulated. The performance of predictor U(x+y), which has the same BTB size with it split counterpart Sux+Sky, is also presented for comparison. All performance data is based on interpreter mode and is shown in both absolute misprediction rate and percentage of accuracy improvement. Compared with Ui predictor, the Sui+Sk32 and Sui+Sk64 predictors are found to improve branch prediction performance in both user and kernel modes. For example, in db, Su512+Sk64 decreases misprediction rate from 26.7% to 23.6% in user and from 18
8.52% to 5.04% in kernel, resulting in a prediction accuracy improvement of 11.8% in user and 40.8% in kernel respectively. The use of a more cost-effective Su512+Sk32 predictor still results in 38.8% improvement in kernel. Based on 4 SPECjvm98 benchmarks, the average improvement in comparison to unified predictors (e.g. U(512+64) and U(512+32)) which have equivalent BTB sizes with their split counterparts (e.g. Su512+Sk64 and Su512+Sk32) is found approximately to be 9% in user and 30% in kernel. This result indicates that split branch predictor improves branch prediction performance by eliminating branches aliasing between user and kernel codes. The misprediction rate reduction in the split predictor is found to be more significant when the BTB size of Sui predictor is small. While Su512+Sk64 and Su512+Sk32 require more silicon area than U512, such a comparison may be considered fair as silicon area is often traded these days for ability to reduce energy consumption [Vija00] and power dissipation is of much greater concern than the die area these days. The proposed split branch approach permits the use of smaller prediction tables with performance improvement (or without performance penalty) in most cases. Since this scheme requires the access to only one of these smaller modules (kernel or user) for a given branch instruction mode, it results in energy savings. Also, Table 7 shows that even when keeping resources of a BTB constant across normal and split approaches, we still gain in performance. Table 7. Indirect Branch Misprediction Rates (in Interpreter mode) and Percentages of Accuracy Improvement by Combining Sk32/Sk64 with Sui (i= 512,1024 and 2048). BTB Configurations: Fully Associated with LRU Replacement Policy BTB Configuration U512 Su512 Sk64 Sk32 U(512+64) U(512+32) U1024 Su1024 Sk64 Sk32 U(1024+64) U(1024+32) U2048 Su2048 Sk64 Sk32 U(2048+64) U(2048+32)
u k u k k u k u k u k u k k u k u k u k u k k u k u k
Benchmark jack
44.70 10.18 41.75(6.6%) 7.65(24.9%) 7.82(23.2%) 44.65(0.1%) 10.14(0.4%) 44.68(0%) 10.16(0.2%) 43.78 9.86 41.72(4.7%) 7.65(22.4%) 7.82(20.7%) 43.78(0%) 9.86(0%) 43.78(0%) 9.86(0%) 43.45 8.12 42.40(2.4%) 7.65(5.8%) 7.82(3.7%) 43.45(0%) 8.12(0%) 43.45(0%) 8.12(0%)
jess
47.77 9.79 44.71(6.4%) 7.57(22.7%) 7.83(20.0%) 47.75(0%) 9.72(0.7%) 47.75(0%) 9.75(0.4%) 45.99 9.34 43.87(4.6%) 7.57(19.0%) 7.83(16.2%) 45.99(0%) 9.33(0.1%) 45.99(0%) 9.33(0.1%) 45.13 7.97 44.09(2.3%) 7.57(5.3%) 7.63(4.3%) 45.13(0%) 7.97(0%) 45.13(0%) 7.97(0%)
19
javac
db
23.55 6.83 20.79(11.7%) 4.02(41.1%) 4.14(39.4%) 23.5(0.2%) 6.78(0.7%) 23.53(0%) 6.81(0.3%) 20.67 5.42 18.96(8.3%) 4.02(25.8%) 4.14(23.6%) 20.65(0.1%) 5.42(0%) 20.67(0%) 5.42(0%) 19.17 4.54 18.36(4.2%) 4.02(11.5%) 4.14(8.8%) 19.17(0%) 4.54(0%) 19.17(0%) 4.54(0%)
26.71 8.52 23.56(11.8%) 5.04(40.8%) 5.21(38.8%) 26.66(0.2%) 8.45(0.8%) 26.68(0.1%) 8.5(0.2%) 24.63 6.77 22.56(8.4%) 5.04(25.6%) 5.21(23.0%) 24.6(0.1%) 6.77(0%) 24.63(0%) 6.77(0%) 23.05 5.62 22.05(4.3%) 5.04(10.3%) 5.21(7.3%) 23.05(0%) 5.62(0%) 23.05(0%) 5.62(0%)
Table 7 shows that kernel branches benefit from split structures with even small (32 or 64-entry) BTB configurations. This is not surprising since Table 6 and Figure 4 show that a small number of indirect branches dominate dynamic branch instances in kernel. In kernel, invocations of indirect branches and the corresponding target access patterns are relatively constant across applications. Figure 8 further illustrates how the performance of Ski varies with different BTB sizes and associativities. The performance is compared with the kernel prediction accuracy in a unified predictor with double BTB size (U2i). The data shown here is based on the average statistics of all studied benchmarks running on the interpreter mode. It is found that for a small BTB with 128 or 256 entries, the benefit of split kernel prediction is significant. For instance, with a direct mapped BTB configuration, a Sk64 reduces misprediction rate to 6.2% whereas the kernel misprediction rate on a U128 is 17.8%. Another observation is that kernel branch prediction performance on a Ui predictor depends on BTB associativity more than a Ski does. For example, for a 256-entry BTB, increasing associativity from 2 to 4 decreases the kernel misprediction rate from 11.6% to 8.5% on a U256 predictor. On the Ski predictor, this associativity related improvement largely disappears. For all of our experimented cases, the kernel prediction accuracy increases less than 1% from direct mapping to fully associated BTB configuration. These imply that Ski predictor works efficiently with a small and simple BTB organization, which will result in less hardware complexity and access time. Previous studies [Kin00] show that a simple, direct mapped cache has around 30% percent reduction in access time over its 2 or 4-way associative counterparts. 20 U128 Sk 64 U256 Sk 128 U512 Sk 256 U1024 Sk 512
18 16
Misprediction Rate
14 12 10 8 6 4 2 0
1
2
4 8 BTB A ss oc iativity
full
Figure 8. Impact of BTB Sizes and Associativities 6. Conclusions and Future Research The popularity and wide adoption of Java has necessitated the development of an efficient Java run time system. This study has provided insights into the interaction of the JVM implementation, operating system with the branch prediction mechanisms. To our knowledge, this work is the first to characterize the branch behavior with kernel activity of the Java applications and analyze their influence on architectural enhancements. The major findings and contributions from this research are: •
Kernel has higher branch instruction frequency (22%) than that of user codes (16%). This observation, together with the significance of the kernel activity (an average of 17.3% in JIT mode and 15% in
20
interpreter mode) suggest that kernel instructions and kernel branches play an important role in the execution of a Java program. •
The use of virtual method calls to support the extension and reuse of classes leads Java codes to have 12.2% indirect branches in its instruction stream (with a JIT compiler). The additional indirect jumps used to implement the switch statement for case by case interpretation further increase this number to 18.4% in the case of interpreter mode.
•
The number of instructions per conditional branch of Java (7.9 with JIT and 9.3 with interpreter) and SPECInt95 (10.8) are not very much different. However, SPECjvm98 benchmarks all contain a fairly large number of static conditional branches with an average of 38K in JIT mode and 30K in interpreter mode, whereas the four studied SPECInt95 have only 5.5K static branch sites. The substantially higher branch frequency in Java execution is due to the larger instruction footprint than the statically compiled SPECInt95 benchmarks. A small number of distinct branches contribute to the overwhelming majority of the branch instances: 90% of active branches in Java executions are executed from less than 1,232 interesting branch sites (3.3% of total branch sites) in JIT mode and 590 (2% of total branch sites) in interpreter mode, showing that they are highly biased.
•
The number of conditional branch sites in the studied Java executions is relatively constant across applications. The variation of different benchmarks is within 8.7%. This implies that a Java run time system spends a significant portion of time in executing the underlying JVM, the JVM native interface libraries and operating system kernel services. Dynamic conditional instances are distributed in a similar style. This similarity indicates that the studied applications use JVM and its native libraries in a similar way and it may be possible to “preoptimize” a Java run time system to boost the ILP performance.
•
Monomorphic branches constitute a dominant portion in static branch sites whereas polymorphic branches are invoked more frequently in dynamic instances. Polymorphic branches which constitute 4% of static branches in JIT mode and 5% in interpreter mode lead to 28% of all dynamic branches in JIT mode and 75% in interpreter mode. Compared with JIT compiler, an interpreter has fewer number of indirect branch sites. On the average, the studied benchmarks have 6,597 static indirect branch sites in JIT mode, while only 3,960 branch sites are found in interpreter mode.
•
For the same benchmark, the two JVM implementations affect the target access pattern drastically. In interpreter mode, branch sites are small and target addresses are highly interleaved, whereas JIT mode usually has more active branch sites with random and sparse access pattern for a given target. In kernel mode with both interpretation and JIT, target access pattern does not vary significantly since both use operating system kernel services in a similar style.
•
In user codes, the average prediction accuracy for conditional branches on a simple branch predictor is 92.6% in JIT mode and 94.6% in interpreter mode. In kernel codes, the average prediction accuracy is
21
83.6% in JIT mode and 84.3% in interpreter mode. For indirect branches, JIT mode has prediction accuracy of 83.3% (jack) to 96.9% (compress) with the average of 87.6%. Interpreter mode has prediction accuracy of 3.6% (compress) to 78.7% (javac) with an average of 50.2%. •
We propose a split branch prediction structure to separately handle the different branch behaviors of user and kernel codes. This split branch prediction structure separates branch instruction streams into kernel and user parts based on the processor’s execution mode and use prediction for each. Simulations show this technique improves prediction accuracy by reducing destructive branch aliasing between kernel and user codes. The average improvement by using a 32-entry BTB for kernel branch prediction is approximately 30% for kernel indirect branches and 9% for user indirect branches. The split branch approach permits the use of smaller and simpler prediction tables (e.g. a direct-mapped 32 or 64-entry BTB) for kernel branch prediction. Split scheme also allows the access to only one of these smaller modules (kernel or user) for a given branch instruction mode and results in energy savings. Although we focus only on indirect branches due to their increased usage in Java execution in this paper, we
believe that such an approach would also benefit conditional branch prediction. Future research will focus on exploring the optimum branch prediction mechanism for each part. Meanwhile, we will extend this technique to conditional branch and return address prediction. Also, branch prediction techniques that rely on profilingbased optimization at compile time will be applied to the studied JVM. Acknowledgments This research is supported in part by the National Science Foundation under Grants EIA9807112, CCR-9796098, CCR-9900701, Career Award MIP-9701475, State of Texas ATP Grant #403, and by Sun Microsystems, Dell, Intel, Microsoft and IBM. References [Lind99] T. Lindholm and F. Yellin, The Java Virtual Machine Specification, Second Edition, Addison Wesley, 1999 [Drie98a] K. Driesen and U. Hölzle, Accurate Indirect Branch Prediction, In Proceedings of the 25th Annual International Symposium on Computer Architecture, pages 167-178,1998 [Drie98b] K. Driesen, and U. Hölzle, The Cascaded Predictor: Economical and Adaptive Branch Target Prediction, In Proceedings of the 31st Annual ACM/IEEE International Symposium on Microarchitecture, pages 249-258, 1998 [Gloy96] N. Gloy, C. Young, J. B. Chen and M. D. Smith, An Analysis of Dynamic Branch Prediction Schemes on System Workloads, In Proceedings of the 23rd Annual International Symposium on Computer Architecture, pages 12-21, 1996 [Lee84] J. Lee and A. Smith, Branch Prediction Strategies and Branch Target Buffer Design, IEEE Computer 17(1), 1984 [Emer97] J. Emer and N. Gloy, A Language for Describing Perdictors and its Application to Automatic Synthesis, In Proceedings of the 24th Annual International Symposium on Computer Architecture, pages 304-314, 1997 [Jaco97] Q. Jacobson, S. Bennet, N. Sharma, and J. E. Smith, Control Flow Speculation In Multiscalar Processors, In Proceedings of 3rd International Symposium on High Performance Computer Architecture, pages 218-229, 1997 [Chang97] P.-Y. Chang, E. H. and Y. N. Patt, Target Prediction for Indirect Jumps, In Proceedings of the 24th International Symposium on Computer Architecture, pages 274-283, 1997 [Chang94] P.-Y. Chang, E. Hao, T.-Y. Yeh and Y. Patt, Branch Classification: a New Mechanism for Improving Branch Predictor Performance, In Proceedings of the 27th Annual International Symposium on Microarchitecture, pages 22-31, 1994 [Chang95] P.-Y. Chang, E. Hao and Y. N. Patt, Alternative Implementations of Hybrid Branch Predictors, In Proceedings of the 28th Annual International Symposium on Microarchitecture, pages 252-257,1995 [MPR0300] T. R. Halfhill, JSTAR Coprocessor Accelerates Java, Microprocessor Report, pages 29-33, Mar. 2000 [MPR0400] K. Diefendorff, HP, Intel Complete IA-64 Rollout, Microprocessor Report, pages 1-9, Apr. 2000 [MPR0899] T. R. Halfhill, Sun Reveals Secrets of “Magic”, Microprocessor Report, Aug. 1999 [Kaeli91] D. R. Kaeli, and P. G. Emma, Branch History Table Prediction of Moving Target Branches due to Subroutine Returns, In Proceedings of the 18th Annual International Symposium on Computer Architecture, pages 34-42, 1991 [Skad98] K. Skadron, P. S. Ahuja, M. Martonosi, and D. W. Clark, Improving Prediction for Procedure Returns with Return-AddressStack Repair Mechanisms, In Proceedings of the 31st Annual ACM/IEEE International Symposium on Microarchitecture, pages 259271, 1998
22
[McFar93] S. McFarling, Combining Branch Predictors, WRL Technical Note TN-36, Digital Equipment Corporation, June 1993 [Yeh93] T.-Y. Yeh, and Y. N. Patt, A Comparison of Dynamic Branch Predictors that Use Two Levels of Branch History, In Proceedings of the 20th Annual International Symposium on Computer Architecture, pages 257-266, 1993 [Skad99] K. Skadron, P. S. Ahuja, M. Martonosi, D. W. Clark, Branch Prediction, Instruction Window Size, and Cache Size: Performance Trade-Offs and Simulation Techniques, IEEE Transactions on Computers, vol. 48, no.11, Nov. 1999 [Heil99] T. H. Heil, Z. Smith and J. E. Smith, Improving Branch Predictors by Correlating on Data Values, In Proceedings of the 32nd Annual ACM/IEEE International Symposium on Microarchitecture, pages 28-37, 1999 [Lee00] J. Lee, B.-S. Yang, S. Kim, S. Lee, Y. C. Chung, H. Lee, J. H. Lee, S.-M. Moonm, K. Ebcioglu, and Erik Altman Reducing Virtual Call Overheads in a Java VM Just-In-Time Compiler, In Proceedings of the 4th Annual Workshop on Workshop on Interaction between Compilers and Computer Architectures, 2000 [Yeh91] T. Yeh and Y. Patt, Two-Level Adaptive Branch Prediction, In Proceeding of 24th International Symposium on Microarchitecture, pages. 51-61, 1991 [Smith81] J. Smith, A Study of Branch Prediction Strategies, In Proceedings of 8th International Symposium on Computer Architecture, pages 135-148, 1981 [Sech96] S. Sechrest, C-C. Lee, and T. Mudge, Correlation and Aliasing in Dynamic Branch Predictors, In Proceedings of the 23rd Annual International Symposium on Computer Architecture, pages. 22-32. 1996 [Li00] T. Li, L. K. John, N.Vijaykrishnan, A. Sivasubramaniam, J. Sabarinathan and A.Murthy, Using Complete System Simulation to Characterize SPECjvm98 Benchmarks, In Proceedings of ACM International Conference on Supercomputing, pages 22-33, 2000 [Radh00] R. Radhakrishnan, N. Vijaykrishnan, L. K. John and A. Sivasubramaniam, Architectural Issue in Java Runtime Systems, In Proceedings of the 6th International Conference on High Performance Computer Architecture, pages 387-398, 2000 [Hsie96] C. -H. A. Hsieh, J. C. Gyllenhaal, and W.- M. W. Hwu, Java Bytecode to Native Code Translation: The Caffeine Prototype and Preliminary Results, In Proceedings of the 29th International Symposium on Microarchitecture, pages. 90-99, 1996 [Radh00b] R. Radhakrishnan, D. Talla and L. K. John, Allowing for ILP in an Embedded Java Processor, In Proceedings of the 27th Annual International Symposium on Computer Architecture, pages 294-305, 2000 [Vija00] N. Vijaykrishnan, M. Kandemir, M. J. Irwin, H. S. Kim, and W. Ye, Energy-Driven Integrated Hardware-Software Optimizations Using SimplePower, In Proceedings of the 27th Annual International Symposium on Computer Architecture, pages 95106, 2000 [Rom96] T. H. Romer, D. Lee, G. M.Voelker, A. Wolman, W. A.Wong, J.-L. Baer, B. N.Bershad, and H. M.Levy, The Structure and Performance of Interpreters, In Proceedings of the 7th International Conference on Architectural Support for Programming Languages and Operating Systems, pages 150-159, 1996 [Herr98] S. A. Herrod, Using Complete Machine Simulation to Understand Computer System Behavior, Ph.D. Thesis, Stanford University, Feb. 1998 [Rose95a] M. Rosenblum, S. A. Herrod, E. Witchel, and A. Gupta, Complete Computer System Simulation: the SimOS Approach, IEEE Parallel and Distributed Technology: Systems and Applications, vol.3, no.4, pages 34-43, Winter 1995 [Cram97] T. Cramer, R. Friedman, T. Miller, D. Seberger, R. Wilson and M. Wolczko, Compiling Java just in time, IEEE Micro, vol. 17, pages 36-43, May 1997 [Kral98] A. Krall, Efficient JavaVM Just-In-Time Compilation, In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques, pages 54-61, 1998 [Taba98] A. Adl-Tabatabai, M. Cierniak, G. Lueh, V. M. Parakh and J. M. Stichnoth, Fast Effective Code Generation in a Just-In-Time Java Compiler, In Proceedings of Conference on Programming Language Design and Implementation, pages 280-290, 1998 [Vija98b] N. Vijaykrishnan, Issues in the Design of a Java Processor Architecture. PhD thesis, College of Engineering, University of South Florida, July 1998 [Java] Overview of Java platform product family, http://www.javasoft.com/products/OV_ jdkProduct.html [Conn97] M. O’Connor and M. Tremblay, PicoJava-I: The Java Virtual Machine in Hardware, IEEE Micro, pages 45-53, Mar. 1997 [Hsie96] C.-H. A. Hsieh, J. C. Gyllenhaal and W. W. Hwu, Java Bytecode to Native Code Translation: the Caffeine Prototype and Preliminary Results. In Proceedings of the 29th International Symposium on Microarchitecture, pages 90-97, 1996 [Hsie97] C.-H. A. Hsieh, M. T. Conte, T. L. Johnson, J. C. Gyllenhaal and W. W. Hwu, A Study of the Cache and Branch Performance Issues with Running Java on Current Hardware Platforms, In Proceedings of COMPCON, pages 211-216, 1997 [SPECJVM98] SPEC JVM98 Benchmarks, http://www.spec.org/osg/jvm98/ [Vija99] N. Vijaykrishnan and N. Ranganathan, Tuning Branch Predictors to Support Virtual Method Invocation in Java, In Proceedings of the 5th USENIX Conference of Object-Oriented Technologies and Systems, pages. 217-228, 1999 [Herr97] S. Herrod, M. Rosenblum, E. Bugnion, S. Devine, R. Bosch, J. Chapin, K. Govil, D. Teodosiu, E. Witchel, and B. Verghese, The SimOS User Guide, http://simos.stanford.edu/userguide/ [Benn95] J. Bennett and M. Flynn, Performance Factors for Superscalar Processors, Technical Report CSL-TR-95-661, Computer Systems Laboratory, Stanford University, Feb. 1995. [Mips94] MIPS Technologies, Incorporated, R10000 Microprocessor Product Overview, MIPS Open RISC Technology, Oct. 1994 [Yeag96] K. C. Yeager, MIPS R10000, IEEE Micro, Vol.16, No.1, pages 28-40, Apr. 1996 [Cald95] B. Calder, D. Grunwald and A. Srivastava, The Predictability of Branches in Libraries, In Proceedings of the 28th Annual International Symposium on Microarchitecture, pages 24-34, 1995 [Kin00] J. Kin, M. Gupta, and W.H. M.-Smith, Filtering Memory References to Increase Energy Efficiency, IEEE Transactions on Computer, vol. 49, no. 1, Jan. 2000 [Xia99] C. Xia and J. Torrellas, Comprehensive Hardware and Software Support for Operating Systems to Exploit MP Memory Hierarchies, IEEE Transactions on Computers, May 1999
23