Improving Branch Predictability in Java Processing Tao Li✝, Lizy Kurian John✝ and Vijaykrishnan Narayanan✻ ✝
✻
Laboratory for Computer Architecture Department of Electrical and Computer Engineering The University of Texas at Austin Austin, TX 78712 {tli3,ljohn}@ece.utexas.edu
220 Pond Lab Department of Computer Science and Engineering The Pennsylvania State University University Park, PA 16802
[email protected]
Abstract Java programs are becoming increasingly prevalent on numerous platforms ranging from embedded systems to enterprise servers. Dynamic translation (interpretation and compilation), frequent calls to native interface libraries or OS kernel services and abundant usage of virtual methods by Java programs can complicate the intrinsic predictability of the control flow that can be exploited by an ILP machine with deep speculation. This paper focuses on adapting processor microarchitecture to achieve accurate and efficient control flow speculation in Java processing. This paper presents insight on branch behavior of a standard JVM running on a commercial operating system using real workloads. Employing a complete system simulation environment, we profile, analyze and quantify performance issues of branch prediction structures for both user and kernel executions. The impact of different JVM styles (JIT compiler and interpreter) on branch behavior is also studied. We find that: (1) kernel instructions and kernel branches play an important role in a Java runtime system; (2) kernel/user branch aliasing constitutes up to 35% of total branch aliasing and is more significant in global history based GAg, GAs and Gshare predictors; (3) multiple target (polymorphic) branches dominate indirect branches instances and cause high BTB miss rate. We propose two techniques - the split branch predictor and target history promotion buffer (THPB) - to improve branch prediction efficiency and accuracy for Java processing. The OS-aware split branch predictor specifies its prediction strategy through a processor’s execution mode and uses optimized prediction mechanism for user and kernel space individually. Target history promotion buffer dynamically promotes the less predictable polymorphic branches to a small dedicated buffer (in which their target history patterns are built) and rehashes BTB entries for the interleaved multiple targets. Simulations using SPECjvm98 show that compared to the unified GAg and Gshare schemes, the user/kernel split counterparts reduce up to 50% and 90% of misprediction in user and kernel space. Hybridization of split predictors provides further performance improvements. For example, for most benchmarks, a hybrid predictor with GAs technique for user code and Gshare technique for kernel code yields good performance. By exploiting target locality and BTB entry rehashing, a 16-entry THPB contributes to 48% - 70% reduction in BTB miss rate. 1. Introduction Java is becoming increasingly prevalent with the blooming of Internet, E-commerce and intelligent mobile devices. Its “write-once run-anywhere” promise has resulted in portable software and standardized interfaces spanning from embedded systems to enterprise servers. To adhere to this motto, Java programs are first converted to a machine independent format known as bytecodes and a Java Virtual Machine (JVM) [Lind99] dedicated to a specific hardware platform is then invoked to execute the bytecodes. Early JVM implementations relied on interpretation [Java and Rom96]. Direct interpretation of bytecodes leads to poor performance. Dynamic compilation of bytecodes to the native machine instructions at runtime using a Just-In-Time (JIT) compiler [Cram97] can improve the performance. So far, interpretation and JIT compilation are two popular
1
JVM implementations for Java processing and coexist in state-of-the-art JVM technologies such as Sun’s HotSpot [HotSpot99]. The execution of Java applications is largely dependent on the JVM as well as the underlying operating system (OS). A JVM environment can be significantly different from that required to support traditional C or FORTRAN codes: many Java classes require services provided by the underlying operating system and can spend a significant amount of time in exercising the native code interfaces and OS services [Li00]. Moreover, Java is designed to support the safe execution of portable programs across platforms and networks and hence provides
additional
language
features
(e.g.
dynamic
class
loading
and
validation,
runtime
compilation/interpretation & exceptions checking, garbage collection, multithreading and synchronization), which are absent in other object-oriented languages like C++. Current ILP processors use prediction techniques to speculatively execute instructions. Correct control flow predictions help alleviate control dependence and can provide smooth stream of instructions to the aggressive execution engine. Due to the trend toward wider issue and deeper pipeline designs, the increasing program control flow misprediction penalty can quickly throttle the performance of a superscalar machine. Hence, having undergone steady improvements, branch prediction is still considered to be a key hurdle for current and future microprocessors [Heil99]. Consequently, there continues to be ongoing research and effort to improve branch prediction accuracy, especially on real applications and emerging workloads. Dynamic translation (interpretation and compilation), frequent calls to native interface libraries or OS kernel service routines and abundant usage of virtual methods to support the extension and reuse of classes can complicate the intrinsic predictability of control flow in Java processing. Previous studies [Chang97, Yeh93, Skad99 and Heil99] have mainly focused on optimizing branch prediction performance on general-purpose processors using application-only references from benchmarks such as SPECInt95. While the results from these studies are suggestive and useful, the behavior during Java processing is not well understood. Moreover, given the importance of kernel execution in Java processing, it is interesting and necessary to examine how the operating system uses the underlying microarchitecture for effective speculation. This paper focuses on understanding Java branch behavior and adapting processor microarchitecture to achieve accurate and efficient control flow speculation in Java processing. This paper presents the results from an in-depth look at both complete system based branch characteristics analysis and microarchitectural enhancements of a standard JVM running a commercial OS and real applications. We observe that kernel branch execution forms a significant portion of the overall branches and increases pressure and competition on the underlying branch prediction resource. In the studied Java runtime system, kernel/user branch aliasing is found to constitute up to 35% of total branch aliasing and is more significant in global history based GAg, GAs and Gshare predictors. The brief and intermittent
kernel executions due to the lightweight
interruption/exception handlers implementation and the persistent kernel activities in Java runtime system cause
2
kernel branches failing to build up and exploit useful branch history patterns and steady counter states in the shared but highly aliased branch prediction tables. This observation motivates an OS-aware branch predictor design, in which the privilege (or execution mode) bits that are historically used for security purpose (e.g., preventing illegal page table modification) are employed to protect performance critical microarchitecture resource like branch prediction tables. Our experiments show that compared to the unified GAg and Gshare schemes, the user/kernel aliasing free split predictors using the same hardware resource reduce up to 50% and 90% of misprediction in user and kernel space. Hybridization of split predictors provides further performance improvements. Yet another contribution of this paper is a cost-effective Branch Target Buffer (BTB) extension for accurate indirect branch target prediction. We find that multiple target (polymorphic) branches dominate and cause high BTB miss rate in the indirect branch intensive Java processing. The highly mispredicted polymorphic branches exhibit a new locality - target address locality. We propose a novel BTB extension structure, the Target History Promotion Buffer (THPB) which employs a target-centric approach to caching target pattern for exploiting the target locality phenomenon. The THPB structure is used in conjunction with a BTB to achieve improved performance for target prediction. Due to its ability to capture both target pattern and target locality, a small THPB can provide significant reductions in BTB miss rates. Our experiments demonstrate that by augmenting a 4-way BTB of size 2KB with a direct-mapped 16-entry THPB, we can obtain reductions in BTB miss rates ranging from 48% - 70%. The rest of this paper is organized as follows. Section 2 describes the simulation based experimental setup and benchmarks. Section 3 gives insight into branch characteristics of a Java runtime system and examines how OS and user codes behave on the underlying branch prediction schemes. Section 4 describes our proposed OSaware split branch prediction structure and evaluates its benefits. Section 5 illustrates a cost-effective BTB design enhancement exploiting polymorphic branch target locality and BTB entry rehashing and evaluates various design trade-offs. Section 6 discusses the related work. Finally, section 7 summarizes the contributions of this work and outlines directions for future research.
2. Experimental Methodology and Benchmarks
To perform entire execution of JVM, Java workloads and a commercial OS, we use a full system simulation environment – the SimOS [Rose95] to study Java branch characteristics. A Silicon Graphics Inc. ported JDK (Java Development Kit) from Sun Microsystems and SPECjvm98 (as described in Table 1) benchmarks [SPECjvm98] are simulated on the top of IRIX 5.3 operating system. In this paper we exclude benchmark mpegaudio from our experimental evaluations because we can not make it to run on the superscalar model of SimOS.
3
Table 1 SPECjvm98 Workloads Benchmarks compress jess db javac mtrt jack
Description Modified Lempel-Ziv method (LZW) to compress and decompress large file Java expert shell system based on NASA’s CLIPS expert system Performs multiple database functions on a memory resident database The JDK 1.0.2 Java compiler compiling 225,000 lines of code Dual-threaded raytracer Parser generator with lexical analysis, early version of what is now JavaCC
We collect system traces from a heavily instrumented SimOS simulator and then feed it to our back-end branch prediction simulators. The traces are generated by an instrumented SimOS MXS model [Benn95], which simulates a superscalar microprocessor with multiple instruction issue, register renaming, dynamic scheduling, and speculative execution with precise exceptions. The simulated architectural model is an 8-issue superscalar processor with MIPS R10000 [Yeag96] instruction latencies. Unlike the MIPS R10000, our processor model has a 128-entry instruction window, a 128-entry reorder buffer and a 64-entry load/store buffer. The instruction and data accesses of both applications and operating system are modeled. We simulate each benchmark on SimOS MXS model until completion, except for benchmark compress invoked with an interpreter, which takes extremely long time to finish. In this case, we use the first 2,000M instructions as the representative execution window based on the profiling of entire execution on SimOS Mipsy model. Table 2 Branch Predictor Configurations Scheme.size (i=1..6) 2bc.2iK GAg.2iK GAs.2iK Gshare.2iK SAg.2iK SAs.2iK
Branch (PC) bits used for BHSR BHT selection index 0 i+10 0 0 0 i+6 0 0 i+8 i+9- log (i + 9 )
2 (
i+8
2
BHSR bits used Size of scheme (# for BHT index of BHT entries) (path length) 0 2iK i+10 2iK 4 2iK i+10 2iK i+9 §iK
)
i+5
4
2iK
The various branch prediction schemes we evaluate in this study are summarized in Table 2. The examined branch prediction schemes, ranging from a simple per-branch 2-bit saturating counter table (2bc) indexed by branch instruction address to more sophisticated two-level adaptive schemes which exploit patterns in the recent global (GAg, GAs and Gshare) or local (SAg and SAs) branch history, have been shown to be successful at predicting user-level branches. Branch prediction schemes are represented by “name.size” (as illustrated in [Gloy96]), where “name” falls into the taxonomy proposed by Yeh and Patt [Yeh93] and “size” is number of 2bit counter entries in the Branch History Table (BHT). The two-level adaptive schemes use Branch History Shift Registers (BHSRs) to record the recent branch history: GAg, GAs and Gshare exploit single BHSR to record and maintain global history information while SAg and SAs schemes map each program branch into a table of BHSRs. The content of the selected BHSR is combined with a portion of the branch address to select a
4
BHT entry. For example, SAs.16K contains 4K BHSR entries of 4-bit history path, and 512 BHT sets, each of which consists of 16 2-bit counter entries indexed by the 4-bit history path. In our study, we use a 2K-entry, 4way Branch Target Buffer (BTB) with LRU replace policy to provide the outcomes of control flow for the predicted taken branches. 3. Branch Behavior of a Java Runtime System The goal of this paper is to efficiently support control flow prediction in Java processing. We begin by characterizing the branch behavior of a Java runtime system (with different implementation styles and including operating system activity). 3.1 Branch Frequency and Mix Table 3 Branch Frequency and Mix (jit: JIT compilation, intr: interpretation) Benchmarks
intr
jit
All compress jess db javac mtrt jack compress jess db javac mtrt jack
0.202 0.215 0.261 0.254 0.227 0.164 0.254 0.211 0.260 0.251 0.230 0.149
Kernel Branches per Instruction Call/ Cond. Direct Return 0.184 0.008 0.009 0.184 0.014 0.015 0.232 0.013 0.014 0.229 0.012 0.012 0.206 0.010 0.010 0.144 0.009 0.010 0.230 0.011 0.012 0.182 0.013 0.014 0.231 0.013 0.014 0.226 0.011 0.012 0.209 0.010 0.011 0.132 0.008 0.008
Indirect
All
0.001 0.002 0.002 0.002 0.001 0.002 0.001 0.002 0.002 0.001 0.001 0.001
0.177 0.158 0.156 0.155 0.171 0.165 0.143 0.144 0.153 0.155 0.137 0.149
User Branches per Instruction Call/ Cond. Direct Return Indirect 0.153 0.000 0.009 0.015 0.112 0.001 0.020 0.025 0.114 0.002 0.020 0.021 0.119 0.001 0.017 0.019 0.138 0.000 0.015 0.017 0.124 0.000 0.018 0.023 0.108 0.000 0.001 0.034 0.102 0.000 0.013 0.029 0.107 0.002 0.019 0.026 0.110 0.001 0.019 0.025 0.108 0.000 0.006 0.023 0.109 0.000 0.013 0.026
Table 3 presents the branch profiling results of the studied Java runtime system and OS running the SPECjvm98 benchmarks. Branch instructions are categorized as conditional branches, direct branches that unconditionally redirect instruction streams to a statically specified target encoded in the instruction itself, (non-return) indirect branches which transfer control to an address stored in a register, and call/returns which always use jump and link instruction (e.g. jal, jalr) and a specified architecture register (e.g., r31 on MIPS machines). For each benchmark, the table lists branch frequency expressed as branches per instruction for each category, in both user and kernel modes. The execution of these benchmarks in both JIT compiler (jit) and interpreter (intr) is profiled. In user mode and with a JIT compiler, on the average conditional branches contribute 77% of total branches, and the rest represent 0.4% (direct), 10.4% (call/return) and 12.2% (indirect) of all branches. Compared with kernel codes, the higher indirect branch mix corresponds to virtual method calls in Java codes. Virtual method calls in Java incur a performance penalty because the target of these calls can only be determined at run time. JIT compilers such as Kaffe, CACAO, and LaTTe [Lee00], typically maintain a virtual method table for each loaded class. A virtual method invocation is then translated into an indirect function call after two loads. (For
5
statically bound method calls, JIT compilers generate a direct jump at the call site.). Table 3 indicates that the interpretation increases indirect branch ratio from 12.2% to 18.4%. The interpreter mode results in higher frequency of indirect control transfers due to additional indirect jumps used to implement the switch statement for case by case interpretation [Radh00]. Table 3 reveals that kernel has higher number of branches per instruction than that of user codes. Kernel branches include loops, error and bounds checking, and other routine conditionals. Error and bounds checking related branches are abundant in operating system because it has to be designed to handle all possible situations. This observation, together with the significance of the kernel activity as observed by [Li00] suggest that kernel instructions and kernel branches play an important role in the execution of a Java program. Moreover, with indirect branches becoming more frequent relative to conditional branches, its prediction penalty can start to dominate the overall branch misprediction cost. 3.2 Implications of Kernel Branches Table 4 Characterization of User/Kernel Branch Behaviors (in JIT mode)
Benchmarks
db jess javac jack mtrt
User Statistic
Kernel Statistic
Static sites
Dynamic instances
Static Sites
Dynamic instances
33,957 38,654 38,815 40,640 36,629
13,147,512 35,986,299 34,766,245 210,722,195 195,674,102
6,016 6,037 6,070 6,142 6,099
19,742,706 28,266,026 20,807,714 40,451,532 23,343,298
Table 4 reveals the significance of kernel branches by showing the number of both static branch sites and dynamic branch instances (of conditional branches) in user and kernel space when executing the JVM (with a JIT compiler) and SPECjvm98 benchmarks. Overall, the OS is found to exercise around 6K branch sites, approximately 1/6th of what user code invokes. The relatively constant kernel branch sites across workload stems from the similarity of kernel behaviors on the different benchmarks [Li00]. The kernel portion of dynamic branch instances is found to be more significant. For example, OS exercises more (around 6.6M) branch instructions than user code does, in benchmark db. On the average, kernel branches constitute 14% of branch sites and 21% of dynamic instances in our collected system traces. This fact justifies the use of system trace for accurate microarchitecture study and motivates the necessity of investigating the impact and implication of kernel branches on branch prediction schemes. Our analysis [Li01] shows that the highly biased branch site distribution, combined with the abundance of runtime exception checking codes (e.g., class verifier and array boundary checking) that are seldom taken, lead to the lower number (26%, data are not shown) of taken branch sites. This fact implies that for the conditional branches in user space, given the direction of control flow can be predicted accurately, the BTB entries
6
allocated for the taken branch sites is small. This feature is exploited by a BTB rehashing mechanism, which will be discussed in Section 5. As kernel branch execution forms a significant portion of the overall branches, there is an increase in pressure and competition on the underlying branch prediction resource. The study performed by Gloy et al [Gloy96] suggests that the prediction accuracies generated by the current implementations of dynamic prediction schemes are negatively affected by problems of aliasing. Aliasing occurs when different branch sites are assigned to the same entry of prediction hardware structures such as BHT, BHSRs and BTB.
80
80
80
70
70
70
70
40 30
30
SA s
sh ar e
SA g g SA
2b c
SA s
e
SA g
s G
sh ar
Ag
A G
G
2b c
SA s
SA g
As
ar e
G
G sh
G
SA
SA
G
sh ar
A g
0
2b c
0
s
0
g
10
0 e
10
As
20
10
G
A g
30
10
A g
A s
40
20
SA s
30
50
20
2b c
G
2b c
G 40
60
20
G
G
SA s
50
e
30
60
sh ar
40
70
G
50
C o ld M is s e s K -U A lia s ing U -U A lia s in g K -K A lia s in g U -U H it K -K H it
80 Breakdown (AVG)
Pe BHT Reference Breakdown (AVG)
60
90
Ag
70
100
90
As
80
70
ja c k
100
G
80
70
Per BHT Reference Breakdown (AVG)
80
30
40
db
90
40
ar e
c 2b
SA s
SA g
sh ar e
javac
100
90
50
50
0
100
60
60
10
PerBHT Reference
je s s
Unmapped C old M isses K-U Aliasing U-U Aliasing K-K Aliasing U-U Hit K-K Hit
20
G
G A
2b c
G A
G sh
2b
s
0
g
0 SA s
0 ar e
10
SA g
10
G A s
20
c
20
10
G A g
20
SA g
30
50
G A s
40
60
G sh
30
50
G A g
40
60
Per BHT Entry Breakdown (AVG)
80
Per BHT Entry Breakdown (AVG)
90
50
jack
100
90
Per BHT Entry Breakdown (AVG)
Per BHT Entry Breakdown (AVG)
db
100
90
60
Per BHT Reference Breakdown (AVG)
java c
100
90
G
je s s
100
Figure 1 Branch Aliasing Breakdown Based on BHT entry and BHT Reference (in JIT mode) To quantify the impact of this issue in the context of Java processing, we instrument our branch prediction simulators to record the mapping histograms between branch instructions and BHT entries. We capture both per-BHT entry based and per-BHT reference based events and attribute them to one of the following categories: (1) if a BHT entry is mapped to the same branch site from user (or kernel) space, we record an instance of user hit (or kernel hit); (2) if a BHT entry is mapped to the different branch sites from user (or kernel space), we refer to it as user aliasing (or kernel aliasing); (3) if a BHT entry is mapped to branches from different space, we call it user/kernel aliasing; (4) if a BHT entry is never mapped during the simulation, we treat it as unmapped. Finally, a cold miss is counted when a BHT entry is first mapped to a branch. To reduce the effect of capacity misses, we examine the mapping behavior on a BHT with 8K-entry. The results of 4 out of the 6 studied benchmarks are illustrated in Figure 1. Figure 1 shows that in a processor with fine-grained resource sharing like superscalar, the presence of OS branches changes the utilization of microarchitecture resource like BHT. For example, in benchmark jess, kernel branches constitute 4% (on SAg predictor) to 18% (on 2bc predictor) BHT entries and constitute 22%
7
(on SAg predictor) to 44% (on 2bc predictor) BHT references. The importance and impact of kernel branches on different predictors are found to be different due to the variety of mapping schemes. Kernel-kernel aliasing (< 10% of total aliasing in most cases) is found to be less significant than user-user aliasing (> 55% of total aliasing) because of the fewer branch sites in kernel. User-kernel aliasing is present in all examined predictors. It is observed that user-kernel aliasing happens infrequently in 2bc predictor since the use of address based indexing efficiently distinguishes different branches within a large size (8K) table. However, this conflict increases in the two-level adaptive predictors and is more significant in global history based predictors. In Gshare and GAg predictors, for example, the user/kernel aliasing occurs in 10%-18% of BHT entries and accounts for 5%-9% of BHT references and 30%-35% of total aliasing. The BHSR resource competition between user and kernel space also lead to the noticeable user-kernel aliasing in SAg and SAs predictors despite employing aliasing-proof, local history based BHT indexing mechanism. Aliasing may not directly imply penalties on prediction accuracy given that an aliased branch is executed enough times (to amortize this "context switch" cost) before conflict occurs. During the execution of Java programs, the instruction streams of user applications (e.g., JIT compiler and translated native methods) and kernel instructions (invoked by interrupts, exceptions or system calls) alternate with each other. Hence, branches from user and kernel space alternate in trace order. Our OS characterization [Li00 and Li01b] shows that most of kernel branches come from the frequently invoked routines like TLB handling (utlb and tlb_miss), paging management such as copy on access and zero-fill page allocation (demand_zero), file system services (read and write) and other frequent processing events (clock). In OS designs, in order to minimize the overhead, lightweight exception handlers are usually used to swiftly handle frequently occurring events, such as TLB misses. Hence, OS kernel execution tends to be brief and intermittent because of the exception handling feature and its lightweight, highly optimized codes. This exception-driven and short term execution characteristics [Li01b] cause kernel branches to fail to build up and exploit persistent and useful branch history patterns and steady counter states in the shared BHSRs or BHT prediction environments. In such environments, the useful kernel branch history and kernel BHT entries are either destroyed or displaced by user branch information in branch prediction structures due to the long term running in user space and the visible user-kernel branch aliasing (shown in Figure 1). This implies that providing a dedicated branch prediction structure to OS may yield better control flow speculation on branchrich kernel codes. 4. Split Branch Prediction To eliminate aliasing problem and exploit the distinct runtime behaviors of user and kernel codes on the underlying microarchitecture, we propose a split branch prediction structure that separately handles user and kernel branches, as described below.
8
Branch predictors typically use combinations of branch PC bits and history patterns to achieve high prediction accuracy [Yeh93]. Recently, data values have been correlated to improve the prediction performance [Heil99]. The split branch prediction structure that we propose exploits additional information (i.e. processor operation modes or privilege levels) to separate kernel and user branches and handles them using separate predictors. Filtering out kernel branches can easily be done at run time by using the Processor Status Register (PSR). Typically, in a microprocessor a set of PSR bits is used to record and identify kernel/user execution mode or privilege level. For example, MIPS R10000 uses KSU field in PSR to identify current execution mode and Intel’s next generation IA-64 Itanium (Merced) uses PSR.cpl to determine one of 4 privilege levels (level 0-3) [MPR0400]. The OS switches different modes by writing into the corresponding field in PSR. Usually, a processor is forced into kernel mode when OS services are required and then returned to user mode after kernel services complete. Figure 2 outlines the block diagram of the split branch prediction structure. Instructions from a fetch unit, combining with branch prediction hints (e.g., branch history pattern and data value, selected by the execution mode), are filtered into an active prediction structure (kernel or user, depending on execution mode). The active branch predictor is then employed to identify branch instructions, predict the outcome of control flow and update microarchitecture state for resolved branches. Branch Prediction Hints (PC, History, Data Value) Execution Mode Kernel Optimized Branch Predictor for Kernel Codes
User Optimized Branch Predictor for User Codes
Figure 2 Split Branch Prediction 4.1 Performance of Splitting Intuitively, split branch prediction demands more silicon by allocating resource for both kernel and user space. Therefore, one of the design issues is to improve prediction accuracy (via splitting) while avoiding hardware budget increase. To evaluate the efficiency of a simple half-and-half splitting policy on the fixed resource and identify the potential design space, we perform a misprediction rate study on a unified predictor and a combination of two identical split branch predictors with half the size. Note that all the hardware resources, including BHSRs, BHT and BTB are split evenly between OS and user code in our simulation study. We vary the unified predictor size from 4KB to 64KB and the results are shown in Figure 3. Also, the
9
aggregate misprediction rate of unified predictor is broken down and normalized to user and kernel parts separately for comparison. U4K S 8K
S2K U 32K
U 8K S16K
S4K U 64K
U 16K S32K
U4K S8K
16
10 8 6 4 2
db
je s s
U4K S 8K
ja v a c
S2K U 32K
ja c k
U 8K S16K
m tr t
S4K U 64K
12 10 8 6 4
0
co m pre ss
db
U 16K S32K
je s s
U4K S8K
ja v a c
S2K U 32K
ja c k
U 8K S16K
m tr t
S4K U 64K
14
Misprediction Rate (%)
10 8 6 4 2
12 10 8 6 4 2
db
je ss
U4K S 8K
ja v a c
S2K U 32K
ja c k
U 8K S16K
m tr t
S4K U 64K
0
co m pre ss
db
U 16K S32K
je ss
U4K S8K
16
ja va c
S2K U 32K
ja ck
U 8K S16K
m tr t
S4K U 64K
12
12
Misprediction Rate (%)
Misprediction Rate (%)
14
10 8 6 4 2
10 8 6 4 2
db
je ss
U4K S 8K
ja v a c
S2K U 32K
ja c k
U 8K S16K
m tr t
S4K U 64K
0
co m pre ss
db
U 16K S32K
je ss
U4K S8K
16
ja va c
S2K U 32K
ja ck
U 8K S16K
m tr t
S4K U 64K
c o m p re s s
U16K S32K
16
G s h a re -k e rn e l
G s h a re -u s e r 14
14
12
12
Misprediction Rate (%)
Misprediction Rate (%)
U16K S32K
G A s -k e rn e l
G A s -u s e r
10 8 6 4 2
10 8 6 4 2
db
je s s
U4K S 8K
ja v a c
S2K U 32K
ja c k
U 8K S16K
m tr t
S4K U 64K
0
co m pre ss
db
U 16K S32K
je s s
U4K S8K
16
ja v a c
S2K U 32K
ja c k
U 8K S16K
m tr t
S4K U 64K
c o m p re s s
U16K S32K
16
S A g -k e rn e l
S A g -u s e r
14 12
14
Misprediction Rate (%)
Misprediction Rate (%)
c o m p re s s
16
14
10 8 6 4 2 0
U16K S32K
G A g -k e rn e l
G A g -u se r
12
0
c o m p re s s
16
14
0
U16K S32K
2
16
Misprediction Rate (%)
S4K U 64K
14
Misprediction Rate (%)
Misprediction Rate (%)
12
0
U 8K S16K
2 b c -k e rn e l
2 b c -u s e r
14
0
S2K U 32K
16
12 10 8 6 4 2
db
je ss
ja v a c
ja c k
m tr t
co m pre ss
0
10
db
je ss
ja va c
ja ck
m tr t
c o m p re s s
U4K S 8K
S2K U 32K
U 8K S16K
S4K U 64K
U 16K S32K
U4K S8K
16
S2K U 32K
U 8K S16K
S4K U 64K
S A s -k e rn e l 14
12
12
Misprediction Rate (%)
Misprediction Rate (%)
S A s -u s e r 14
10 8 6 4 2 0
U16K S32K
16
10 8 6 4 2
db
je ss
ja v a c
ja c k
m tr t
0
co m pre ss
db
je ss
ja va c
ja ck
m tr t
c o m p re s s
Figure 3 Misprediction Rates (Normalized) on Half-and-Half Splitting (U: Unified, S: Split) Figure 3 shows that the simple half-and-half split branch predictors can reduce misprediction of user and kernel codes on GAg and Gshare in all studied benchmarks. On the above two predictors, splitting yields 30% (db) to 90% (jack) prediction accuracy improvement in kernel and 9% (compress) to 50% (jack) in user. The half-and-half splitting also benefits kernel branch prediction on SAg, SAs and GAs predictors but percentage of improvement is smaller (< 5%) than that of GAg and Gshare. Another observation is that the impact of splitting on kernel branches varies with different benchmarks. For example, in benchmark jack, the splitting contributes to misprediction reduction on all studied prediction schemes. The half-and-half splitting policy is found to penalize prediction accuracy in user codes on SAg and SAs predictors by increasing (up to 7% of) the misprediction rates.
4.2 Hybridizing Optimized Split Predictors Figure 3 reveals that kernel and user codes favor different branch prediction schemes because of the different dynamic behaviors. Surprisingly, we find that kernel has better branch prediction accuracy than user on SAg and SAs predictors. The use of splitting also yields accurate kernel prediction on GAg and Gshare predictors. As a result, separating control flow into kernel and user modes can provide opportunities to optimize each branch stream prediction individually. Table 5 further compares the performance of unified branch prediction schemes with hybridized split predictors. The combining of hybridization and splitting further leads to best predictors (which are shaded and underlined in Table 5) for Java processing. For example, the misprediction rates of benchmark jack drop from 9.6% on a unified Gshare predictor with 4KB to 4.1% on a hybrid GAs(U)+Gshare(K)
configuration
of
the
same
size.
Statistically,
the
SAs(U)+Gshare(K)
and
GAs(U)+Gshare(K) configurations provide the best performance on all studied benchmarks. Among other examined hybrid split predictors, we find that the simple 2bc(U)+Gshare(K), 2bc(U)+GAg(K) configurations yield performance comparable to that of the best predictor. The performance of dynamic, adaptive predictors are largely dependent on the runtime branch information that can be captured and correlated. Unfortunately, the persistent control flow transfers between OS and user codes, augmented with different runtime behaviors of each part, yields dynamic branch patterns that may not be efficiently captured and exploited by “single context” branch predictors. Historically, privilege bits are used to
11
protect restricted and critical resources such as page tables and process control blocks, for security reasons. Our study shows that on a fine-grained resource sharing superscalar microprocessor, the protection of performancecritical microarchitecture hardware is also necessary for performance purposes. Table 5 Misprediction Rates (%) on Hybrid Split Predictors (best predictors are shaded and underlined)
Unified SAg
Unified SAs
2bc(U) +Gshare(K)
2bc(U) +GAg(K)
2bc(U) +SAg(K)
GAg(U) +GAg(K)
Gshare(U) +Gshare(K)
SAs(U)+ Gshare(K)
GAs(U)+ Gshare(K)
SAs(U)+ SAg(K)
SAs(U)+ GAg(K)
12.1 5.7 11.6 5.1 11.5 4.7 9.5 11.4 8.8 9.4 8.7 8.7 10.6 8.9 10.2 7.5 10.1 6.9 5.3 11.2 4.7 9.0 4.7 8.2 7.8 5.3 7.6 4.4 7.6 4.1
Unified Gshare
Unified GAg
Unified 2bc
BHT Entries 4k 16k 32k 4k 16k 32k 4k 16k 32k 4k 16k 32k 4k 16k 32k
Split + Optimized
Unified GAs
mtrt
jack
javac
jess
db
Benchmarks
Unified
7.7 6.9 6.7 7.5 6.7 5.7 7.7 6.6 6.4 4.7 3.6 3.1 4.8 4.6 4.5
5.6 4.2 3.8 9.9 7.7 6.7 8.0 6.4 5.7 9.6 6.7 5.7 4.8 3.6 3.1
4.2 3.3 2.8 6.4 4.9 4.0 6.0 4.9 4.1 5.2 3.5 2.4 3.5 2.8 2.3
4.8 3.9 3.7 5.9 4.4 4.0 6.0 4.7 4.5 4.3 2.6 2.3 4.4 3.8 3.7
4.0 3.3 3.1 5.3 4.3 4.0 5.8 5.1 4.9 4.3 3.6 3.5 6.7 6.3 6.3
4.3 3.5 3.3 5.7 4.6 4.3 5.9 5.2 5.0 4.4 3.7 3.6 6.7 6.3 6.3
4.0 3.5 3.2 5.1 4.4 4.1 5.8 5.2 5.0 4.3 3.6 3.5 6.7 6.4 6.3
4.3 3.7 3.3 7.6 5.6 4.9 7.0 5.4 4.8 5.9 4.3 3.7 4.1 3.4 2.9
4.5 3.2 2.7 6.8 4.7 4.0 6.5 4.7 4.1 5.0 2.9 2.3 3.7 2.7 2.5
3.9 2.7 2.4 6.0 4.0 3.5 5.9 4.3 3.7 4.6 2.6 2.3 4.7 3.8 3.6
3.8 2.8 2.6 5.4 3.9 3.5 5.6 4.2 3.9 4.1 2.8 2.6 4.5 4.0 3.9
3.9 2.9 2.6 5.8 4.0 3.5 5.9 4.5 3.9 4.5 2.7 2.3 4.7 3.8 3.6
4.3 2.9 2.7 6.4 4.2 3.7 6.0 4.4 3.9 4.7 2.7 2.3 4.7 3.8 3.6
Moreover, compared with a conventional design, the split branch prediction approach has several implementation advantages. The access to only one active smaller modules (kernel or user) for a given branch results in energy savings since energy dissipation is a function of the size of the memory array (capacitive load and leakage) and the number of accesses [Vija00]. Energy efficiency is a major concern in modern microprocessor designs. Additionally, branch predictions often reside on the critical path for the execution of the program. As a consequence of technology, the slower wires and faster clock rates will require multi-cycle access time to large on-chip microarchitectural structures, such as BHT. Jiménez [Jim00] et al show that with an aggressive clock frequency (2G Hz) and 180 nanometer technology, the accessing of 32KB BHT requires 3 cycles. Therefore, the smaller, split branch prediction tables can be used to provide quick and accurate prediction in the face of increasing latency.
5. BTB Extension for Java Processing Table 1 shows that indirect branch frequency in the control flow transfer of Java processing can be high due to virtual method calls and runtime bytecode interpretation. Targets of indirect branches, which transfer program flow to a target address stored in a register, are typically hard to predict accurately. Current processors predict branch targets with a branch target buffer (BTB) [Chang97], which caches the most recently resolved
12
target. An indirect branch can always be predicted “taken” by setting a corresponding “branch type” bit once it enters the BTB. However, its prediction performance is largely dependent on the efficiency of BTB, because the BTB is referred for both identifying indirect branch and obtaining target address at the fetch cycle of a pipeline. Our analysis [Li01] shows that the BTB performance is largely dependent on the JVM implementation as interpretation significantly increases BTB miss rate on most studied benchmarks. For example, the BTB miss rate in benchmark compress increases from 8% in JIT mode to 96% in interpreting mode. Increasing BTB size does not reduce BTB target miss too much. The higher frequency of BTB misses in user code in the interpreter mode indicates that it is difficult to predict the targets of the indirect branch which implements the case by case interpretation. 5.1 Indirect Branch Behavior and Target Locality Indirect branches can be categorized as branches with only one target (monomorphic branches) and those with more than one target (polymorphic branches) [Drie98a]. Our indirect branch characterization [Li01] shows that polymorphic branches, which constitute 4% of static branches in JIT mode and 5% in interpreter mode, lead to 28% of all dynamic branches in JIT mode and 75% in interpreter mode. For monomorphic branches, misprediction comes from aliasing between branches which map to the same BTB entry. For polymorphic branches, misprediction also depends on multiple target access pattern: for example, if one target address is accessed 1,000 times followed by the other executing 1,000 times, the loss due to interleaving is negligible. However, if the multiple targets are alternated at a higher frequency, then interleaving may cause significant misprediction. compress (Intr,User)
Z
X
jack (Intr,User)
Z
Y
Y
Y
X
160
180
160
140
160
140
140
80 60 40
120
120
Branch Target ID
Branch Target ID
120 100
100 80 60
20 25
50 75 100 125 150 175 200 Dynamic Branch Instances
100 80 60 40
40
20
X jack (Intr,User)
mtrt (Intr,User)
compress (Intr,User)
Branch Target ID
mtrt (Intr,User)
Z
20 25
50 75 100 125 150 175 200 Dynamic Branch Instances
25
50 75 100 125 150 175 200 Dynamic Branch Instances
Figure 4 Targets Locality in Polymorphic Indirect Branches
13
To characterize multiple target access pattern, we uniquely number both branch sites and their corresponding targets for each polymorphic branch and capture the first 0.5 million dynamic branch instances running in interpreter mode. We visualize data in 3-D space and show the results in user space for benchmark compress, mtrt and jack in Figure 4. As shown in Figure 4, each dot plotted in 3-D space records an occurrence of the following event: at time X (represented by number of dynamic instances), a given branch Y (represented by branch ID) jumps to its target Z (represented by branch target ID). The cut-out sections are enlarged and shown separately. We observe that the studied programs exhibit a new kind of value locality, the branch target address locality, according to which a few target addresses appear very frequently in polymorphic indirect branch instances. As depicted in Figure 4, in the interpreter mode, a polymorphic indirect branch can potentially jump to large set of targets, partially because of the large switch-case body working for all bytecodes (more than 200) interpretation. But within a limited execution period, the actually invoked target set is usually constituted by a small size of heavily reused addresses and this observation holds true for the entire execution of the programs. The target locality, a characteristic similar to the temporal locality found in instruction/data references, implies that target addresses to which have been transferred recently have high possibility of reuse in the very near future. The set of frequent targets remains quite small and stable over of the execution of some programs like compress, where branch instances repeatedly work through a body of 9 distinct address sites. Not surprisingly, this highly interleaved branch target transfer pattern almost always causes mispredictions (with a misprediction rate of 96%) in a BTB where only the most recently transferred target is recorded.
5.2 Capturing Target Locality through Target History Promotion Buffer and BTB Entry Rehashing Our analysis [Li01] shows that polymorphic branches lead to a high misprediction rate in conventional BTBs, as a simple, most recently used target bookkeeping and a static hashing mechanism employed in a BTB design, is insufficient to capture an interleaved but highly clustered target sequence. We propose a BTB extension technique, which employs a small buffer to collect polymorphic branch history. The captured history information can be exploited to provide a more accurate target prediction through BTB entry rehashing and reuse. Figure 5 illustrates the architecture of this scheme. At runtime, polymorphic branches with high target miss rate in BTB are dynamically filtered out to a smaller dedicated structure called Target History Promotion Buffer (THPB), where target history patterns are built up for a more accurate prediction. The filtering or promotion of polymorphic branches can be achieved by associating a Target Miss Counter (TMC) for each BTB entry. Whenever a branch instruction hits a BTB entry but the predicted target is incorrect, the TMC associated with that BTB entry is increased. When a TMC reaches a threshold, a migration is trigged by
14
copying branch information (tag, branch target) from the corresponding BTB entry to a new allocated THPB entry. The BTB entry is then reclaimed for future use and the TMC is reset to zero. As depicted in Figure 5, each THPB entry comprises of a tag field (tag) and a compressed target table (t1, t2,…tn). The tag field is used to identify an indirect branch at the instruction fetch cycle and compressed target table (CTT) is used to record target history patterns for a given indirect branch. To reduce the hardware overhead of THPB, the target history information is stored in compressed style in which only the lower bits (8bits is used in our current design and simulation) of target address are recorded and concatenated with each other. When a new target of that branch is resolved, the current target pattern is shifted by certain amount of bits to accommodate the new generated information. The CTT can be maintained globally or locally. In a global CTT configuration, a one-entry CTT is used by all branch sites residing in THPB while in a local CTT (as illustrated in Figure 5) configuration, separate target patterns are maintained for each branch sites residing in THPB. BTB Set Offset Indexing
Branch Target Buffer (BTB) PC TAG
Branch Type
Target
TMC BTB Entry Indexing Logic
…
…
…
…
…
…
…
Target History Promotion Buffer (THPB, with Local CTT)
Target Patterns Folding
… tag
t1
t2
…
tn
…
…
…
…
…
… Rehashing Algorithm
BTB Set Index Address
…
BTB ENTRY REHASHING AND REUSE
PROMOTION
Target
Tag Field Compressed Target Table (CTT)
Figure 5 BTB Extension via THPB When an instruction is being fetched, its program counter is sent to both BTB and THPB (and a hit in one of them results in an overall hit. However, a branch is never cached in both BTB and THPB simultaneously.). If a BTB hit occurs, the target stored in the BTB entry is used for prediction. If a THPB entry hit happens, the history patterns collected in the global or local CTT entry in THPB is then used to produce a BTB set indexing address, a procedure we refer to as BTB entry rehashing. The rehashing algorithm can be simple arithmetic and logic operations like XOR and concatenation, or more complicated permutation or shuffling. The rehashing can be performed concurrently with tag comparison operation so that it will not cause an extra cycle delay. Usually,
15
to capture more history information, the number of bits stored in a CTT entry is wider than that needed to index a BTB set. The target pattern folding operation which partitions target history patterns recorded in a CTT entry to smaller bit chunk with the width equivalent to BTB set index address is performed first. Then the rehashing algorithm (e.g. XOR) is applied on the partitioned bit chunk to produce a BTB set indexing address. The rehashed BTB set offset can be generated by using a modulo based direct-mapping scheme between branch PC and BTB set associativity. The target residing in the rehashed BTB entry is then used for branch target prediction. When the actual branch target is resolved in pipeline, the rehashed BTB entry is updated (if necessary) and the new resolved target bits are collapsed into CTT to keep tracking history pattern information. By using the more elegant target history pattern and rehashing algorithm, the multiple targets of a promoted branch can be rehashed and stored into different BTB entries. The low percentage (26%) of taken branch sites [Li01] indicates that the BTB entry utilization in a Java runtime system can be low because a significant portion of runtime checking branches are seldom taken. So, the magnitude of BTB corruption caused by rehashing and reuse of an already allocated BTB entry can be low. The target locality shown in Figure 4 implies that within a given execution period, the number of rehashed BTB entry is also low due to the small set of active targets and the possibly generated patterns. As the extension to a BTB scheme, one of the design issues is to keep the THPB small enough by avoiding significant hardware budget increase. Inherently, the number of THPB buffer entries is decided by the number of unique branch sites that can be promoted on a given promotion threshold (pth) and the width of CTT entry is intrinsically dependent on the target patterns and control flow characteristics in programs. If the capacity of THPB is too small, there will be a significant thrashing effect between different promoted branch sites. On the other hand, THPB can be underutilized if the size is too large. To investigate this design choice, we simulate a BTB combined with a THPB with infinite size of to find the maximum number of branch sites that can be promoted for a given promotion threshold (data is shown in Table 6). Table 6 shows that the number of unique branch sites that can potentially be promoted ranges from 13 (in compress with pth=512) to 74 (javac with pth=16). Increasing promotion threshold decreases the mining ability for polymorphic branches. This also implies that the use of very small bit counters for TMC may lead too larger THPB size to accommodate the larger size of over-mined branches. Table 6 Number of Mined Branch Sites Number of Mined Branch Sites Benchmarks Pth= Pth= Pth= Pth= Pth= Pth= 16 32 64 128 256 512 db 64 54 45 41 35 29 jess 71 61 54 45 36 30 javac 74 61 50 38 29 24 jack 66 54 47 42 36 34 mtrt 65 49 37 27 21 21 compress 52 34 24 19 15 13 AVG 65 52 43 35 29 25
16
5.3 Design Trade-Off Evaluation To explore a cost-effective design space for the proposed BTB extension scheme, we examine the impact of several factors such as CTT entry configuration, rehashing algorithm, size and associativity of THPB, as described below. We first examine the efficiency of using local vs. global target pattern history and use of different rehashing algorithms. We use 16-entry and direct mapped THPB in our simulation. Additionally, a promotion threshold of 512 is set for all of our simulations unless specified. We model a stand-alone BTB structure (BTB), a BTB augmented with a THPB using global CTT and XOR rehashing algorithm (BTB+PBuffer1 and BTB+PBuffer2), and a BTB combined with a THPB using local CTT and XOR hashing algorithm (BTB+PBuffer3). The hashing algorithms used in BTB+PBuffer2 and BTB+PBuffer3 are same whereas in BTB+PBuffer1, the generated rehashed BTB set indexing address is further XORed with branch PC before it is sent to the BTB entry indexing logic. The size of CTT entry in our simulations is fixed to 32-bits and the target patterns folding length is set to 9-bits in order to produce the set indexing of a 2KB, 4-way set-associative BTB. The rehashed BTB entry set offset is generated by using the 2 least significant bits of a shifted PC. The misprediction rate for indirect branch only and for all branch instructions (in interpretation mode) are shown in Figure 6. Figure 6 shows that the proposed THPB and BTB rehashing technique significantly reduces the BTB miss rate by exploiting branch target locality and BTB resource reuse. For example, in benchmark mtrt, BTB miss rates on indirect branch and all branch instruction are reduced from 66% to 19% and from 15% to 5% on BTB+PBuffer1 scheme respectively. The use of the two above described rehashing algorithms yields comparable performance improvements. The use of local history based CTT does not necessarily imply a better performance compared with its global history CTT counterpart. For this reason, we use CTT and rehashing algorithm adopted by BTB+PBuffer1 for our further design space explorations. 15
BTB Miss Ra te for Indirect Branc h
BTB Miss Rate for All Branch
80 60
Misprediction Rate (%)
B TB B TB + PB uffer1 B TB + PB uffer2 B TB + PB uffer3
40
BTB BTB+PBuffer1 BTB+PBuffer2 BTB+PBuffer3
10
5
20 0
co m pr es s
m tr t
ja ck
ja va c
je ss
db
es
s
t m pr co
m tr
ck ja
c va ja
je ss
0 db
Misprediction Rate (%)
100
Figure 6 Performance of BTB Extension (in Interpretation Mode)
17
So far we have evaluated performance of proposed scheme using interpretation of Java codes. Now we set up simulations using the same BTB extension configuration (BTB+PBuffer1) on each benchmark running with a JIT compiler. Figure 7 shows the performance of a stand alone BTB and that of BTB+PBuffer1 on indirect branch and all branch instructions. It is observed that the use of BTB extension yields misprediction reduction in JIT mode in all the cases. For example, on benchmark mtrt, the indirect branch BTB miss rate and overall BTB miss rate reduce from 13.2%-3.8% and from 2.4%-0.9% respectively. The fact that THPB can benefit both Java execution modes also indicates potential performance improvement in other indirect branch intensive codes, such as C++, Perl, Tcl etc. BTB Miss Ra te for Indire ct Bra nch
5
15
10
5
4
Misprediction Rate (%)
BTB+PBuffer1 Misprediction Rate (%)
BTB Miss Ra te forAll Bra nch
BTB
BTB
BTB+PBuffer1
3 2 1
ss pr e
tr t co m
m
ja ck
ja va c
db
pr es s
tr t m
co m
ja ck
ja va c
je ss
db
je ss
0
0
Figure 7 BTB Extension Performance in JIT mode To further search for the cost-effective BTB extension design, we set up our experiments to investigate the performance of resource constrained THPB configurations. Figure 8 reveals the performance of BTB extensions with THPB entries varied from 16 to 128 and associativity from 1 to 4 (on a THPB with 32-entries). We find that increasing THPB entry size does not provide significant performance improvement (less than 2%) and the use of a 16-entry (with PBuffer=16 in Figure 8), direct-mapped (with a=1 in Figure 8) THPB can provide performance improvement comparable to that of more complex and costly configurations. The conflict between different promoted branch sites is not observed to be of concern even in a small promotion buffer (16entry) and the use of direct mapping is sufficient to distinguish different promoted branch sites in a THPB. BTB Miss Rate for Indirect Branch
BTB M is s Ra te for Indire c t Bra nch
P B uffer= 16 P B uffer= 64
20 10
a=1 a=8
30
P B uffer= 32 P B uffer= 128 Misprediction Rate (%)
30
0
a=2 a=16
a=4 a=32
20 10
es s m pr
co
m tr t
ja ck
ja
va c
ss je
db
pr es s
t m tr
co m
ja ck
ja va c
je ss
0 db
Rate (%)
Misprediction
40
Figure 8 Impact of THPB Size and Set Associativity
18
Figure 9 further shows the impact of promotion thresholds on the BTB extension misprediction rate (with THPB of 32 entry). Increasing the promotion threshold is found to slightly reduce misprediction rate by reducing the conflict between THPB entry and the BTB entry caused by reuse. It is observed that most of the highly invoked polymorphic indirect branches with high target miss rate can be easily captured by a promotion threshold of 32. 40
Misprediction Rate (%)
30
BTB Miss Rate for Indirect Branch
th=32
th=64
th=128
th=256
th=512
20
10
0 db
jess
javac
jack
mtrt
compress
Figure 9 Impact of Promotion Threshold 6. Related Work Many previous studies have focused on enhancing branch prediction performance on ILP machines [Yeh91, Chang97, Yeh93, Skad99, Heil99 and Drie98a]. These studies have concentrated mainly on the analysis and optimization of branch predictions of SPECInt and C++ programs. Hsieh, Conte and Hwu [Hsie97] compare the performance of Java codes run on the Sun JDK 1.0.2 Java interpreter to code compiled through Caffeine [Hsie96]. It is observed that microarchitectural mechanisms, such as BTB, are not well exploited by Java interpreters. A related study [Vija99] examines the effectiveness of using path history to predict target addresses of indirect branches due to virtual method invocations in Java applications. The XOR hashing scheme with a global path history and a 2-bit update policy is found to perform the best. This result is shown only for the interpreter mode and small instruction footprint Java programs (richards and deltablue) and does not include kernel code. Although there have been efforts to study branch prediction issue using system workloads [Gloy96 and Sech96] in the past, little work has been done on hardware optimizations. Gloy, Young and Smith [Gloy96] analyze ATOM-generated system traces from the Instruction Benchmark Suite (IBS) and find that user-only traces yield fidelity when the kernel accounts for less than 5% of the total executed instructions. Their simulation results show that including kernel branches in the branch trace worsens the effects of aliasing. Our research further analyzes both runtime behaviors of user and kernel codes and leads to the design of OS-aware branch predictors. Driesen and Hölzle [Drie98a] investigate a wide range of two-level predictors dedicated exclusively to indirect branches using programs from the SPECInt95 suite as well as a suite of large C++ applications. On the
19
average, a global history and per-address predictor performs best in the design space. They find that combining two-level predictors with different path lengths in a hybrid style further improves prediction accuracy. Compared with the dedicated 2-level indirect branch predictors, the proposed BTB extension dynamically rehashes the interleaved but highly clustered polymorphic branch targets by promoting, capturing and exploiting the more accurate target history patterns. Hence, it can be augmented with a traditional BTB structure in which an address based indexing scheme is preferred for fast branch identification and target prediction. The prediction using BTB extension is trigged only when the "hotspots" in target predictions is detected. This on-demand BTB extension usage makes the whole branch target prediction more adaptive with the dynamic behaviors of programs. In [Drie98b], Driesen and Hölzle propose a cascaded branch predictor, which dynamically classifies and filters branches into a simple first-stage predictor and a more expensive second-stage predictor. The difference between our proposed BTB extension and cascaded branch predictor lies in the fact that a BTB extension exploits BTB entry rehashing and reuse mechanisms thus avoiding the more complicated prediction resource management rules used in a cascaded predictor. Early indirect branch prediction studies have been reported in [Lee84, Jaco97, Emer97 and Chang97]. Branch classification and hybrid prediction was first proposed for conditional branches by Chang & Patt in [Chang94 and Chang95] and by McFarling in [McFar93] respectively. To our knowledge, no previous study has analyzed branch prediction of Java programs by examining both user and kernel execution. 7. Conclusion and Future Research The popularity and wide adoption of Java has necessitated the development of an efficient Java runtime system. We believe that an efficient Java processing technique requires a synergistic hardware/software approach in which the processor hardware, software, and the OS collaborate with each other to deliver high performance. This study has provided insights into the interaction of the JVM implementations and OS with the control flow speculation mechanisms ubiquitously used in current processor designs. To our knowledge, this work is the first to investigate the optimized microarchitectural enhancements for efficient Java processing. The major findings and contributions from this research are: •
Kernel instructions and kernel branches play an important role in Java processing. Kernel/user branch aliasing constitutes up to 35% of total branch aliasing and is more significant in global history based GAg, GAs and Gshare predictors. The aliasing of kernel and user branches (in branch predictors) can be eliminated by using an execution mode guarded split branch predictor with separate prediction structures for kernel and user space. Our experiments show that the user/kernel aliasing free split predictor reduce up to 50% and 90% of misprediction in user and kernel space compared to its unified counterpart with the same hardware resource.
20
•
Kernel and user codes are seen to favor different branch prediction schemes, due to distinct runtime branch behaviors. Split branch predictors provide opportunity to optimize each predictor individually. Hybridized split predictors, say GAs predictor for user code and Gshare for kernel code, yield better behavior than uniform prediction strategy for both user and kernel.
•
The use of virtual method calls, combined with indirect jumps used to implement the switch statement for case by case interpretation causes the indirect branch frequency in Java processing to be as high as 18.4%, indicating that accurate indirect branch prediction is necessary for an efficient Java processing.
•
Another contribution of this paper is an improved indirect branch target predictor. By exploiting target locality and BTB entry rehashing, a 16-entry Target History Promotion Buffer (THPB) contributes to 48% to 70% reduction in BTB miss rate. Historically, privilege bits are used to protect system critical resources such as page tables and process
control blocks for security purpose. The results of our study show that on a fine-grained resource sharing superscalar microprocessor, the protection of performance-critical microarchitecture hardware, such as branch prediction tables, provides better performance compared with its not OS-aware counterparts. We will further investigate the benefit of OS-aware microarchitecture on other speculative techniques like value prediction and on other OS intensive commercial workloads such as database and web server [Reds00]. Compared with a conventional design, the split branch predictor requires the access to only one of the smaller active prediction tables for a given branch instruction mode (kernel or user), it can result in energy savings and low-latency access cycle in the face of increasing power budget and clock frequency. In our future work, we plan to model the energy consumption of split branch predictors. References [Benn95] J. Bennett and M. Flynn, Performance Factors for Superscalar Processors, Technical Report CSL-TR-95-661, Computer Systems Laboratory, Stanford University, Feb. 1995. [Chang94] P.-Y. Chang, E. Hao, T.-Y. Yeh and Y. Patt, Branch Classification: a New Mechanism for Improving Branch Predictor Performance, In Proceedings of the 27th Annual International Symposium on Microarchitecture, pages 22-31, 1994 [Chang95] P.-Y. Chang, E. Hao and Y. N. Patt, Alternative Implementations of Hybrid Branch Predictors, In Proceedings of the 28th Annual International Symposium on Microarchitecture, pages 252-257,1995 [Chang97] P.-Y. Chang, E. H. and Y. N. Patt, Target Prediction for Indirect Jumps, In Proceedings of the 24th International Symposium on Computer Architecture, pages 274-283, 1997 [Cram97] T. Cramer, R. Friedman, T. Miller, D. Seberger, R. Wilson and M. Wolczko, Compiling Java just in time, IEEE Micro, vol. 17, pages 36-43, May 1997 [Drie98a] K. Driesen and U. Hölzle, Accurate Indirect Branch Prediction, In Proceedings of the 25th Annual International Symposium on Computer Architecture, pages 167-178,1998 [Drie98b] K. Driesen, and U. Hölzle, The Cascaded Predictor: Economical and Adaptive Branch Target Prediction, In Proceedings of the 31st Annual ACM/IEEE International Symposium on Microarchitecture, pages 249-258, 1998 [Emer97] J. Emer and N. Gloy, A Language for Describing Perdictors and its Application to Automatic Synthesis, In Proceedings of the 24th Annual International Symposium on Computer Architecture, pages 304-314, 1997 [Heil99] T. H. Heil, Z. Smith and J. E. Smith, Improving Branch Predictors by Correlating on Data Values, In Proceedings of the 32nd Annual ACM/IEEE International Symposium on Microarchitecture, pages 28-37, 1999 [HotSpot99] The Java HotSpot Performance Engine Architecture, White Paper, http://java.sun.com/products /hotspot/whitepaper.html , Apr. 1999 [Hsie96] C.-H. A. Hsieh, J. C. Gyllenhaal and W. W. Hwu, Java Bytecode to Native Code Translation: the Caffeine Prototype and Preliminary Results. In Proceedings of the 29th International Symposium on Microarchitecture, pages 90-97, 1996 [Hsie97] C.-H. A. Hsieh, M. T. Conte, T. L. Johnson, J. C. Gyllenhaal and W. W. Hwu, A Study of the Cache and Branch Performance Issues with Running Java on Current Hardware Platforms, In Proceedings of COMPCON, pages 211-216, 1997
21
[Gloy96] N. Gloy, C. Young, J. B. Chen and M. D. Smith, An Analysis of Dynamic Branch Prediction Schemes on System Workloads, In Proceedings of the 23rd Annual International Symposium on Computer Architecture, pages 12-21, 1996 [Jaco97] Q. Jacobson, S. Bennet, N. Sharma, and J. E. Smith, Control Flow Speculation In Multiscalar Processors, In Proceedings of 3rd International Symposium on High Performance Computer Architecture, pages 218-229, 1997 [Java] Overview of Java platform product family, http://www.javasoft.com/products/OV_ jdkProduct.html [Jim00] D. A. Jiménez, S. W. Keckler, C. Lin, The Impact of Delay on the Design of Branch Predictors, In Proceedings of the 33rd International Symposium on Microarchitecture, 2000 [Lee00] J. Lee, B.-S. Yang, S. Kim, S. Lee, Y. C. Chung, H. Lee, J. H. Lee, S.-M. Moonm, K. Ebcioglu, and Erik Altman Reducing Virtual Call Overheads in a Java VM Just-In-Time Compiler, In Proceedings of the 4th Annual Workshop on Workshop on Interaction between Compilers and Computer Architectures, 2000 [Lee84] J. Lee and A. Smith, Branch Prediction Strategies and Branch Target Buffer Design, IEEE Computer 17(1), 1984 [Li00] T. Li, L. K. John, N.Vijaykrishnan, A. Sivasubramaniam, J. Sabarinathan and A.Murthy, Using Complete System Simulation to Characterize SPECjvm98 Benchmarks, In Proceedings of ACM International Conference on Supercomputing, pages 22-33, 2000 [Li01] T. Li, S. W. Hu, Y. Luo, L. K. John and N.Vijaykrishnan, Understanding Control Flow Transfer and Its Predictability in Java Processing, Technical Report TR-010108-01, Department of Electrical and Computer Engineering, University of Texas at Austin, Jan. 2001 [Li01b] T. Li, L. K. John, N.Vijaykrishnan, and A. Sivasubramaniam, Characterizing Operating System Activity in SPECjvm98 Benchmarks, Book Chapter for Characterization of Contemporary Workloads, pages 53-82, Kluwer Academic Publishers, to be published in 2001 [Lind99] T. Lindholm and F. Yellin, The Java Virtual Machine Specification, Second Edition, Addison Wesley, 1999 [McFar93] S. McFarling, Combining Branch Predictors, WRL Technical Note TN-36, Digital Equipment Corporation, June 1993 [MPR0400] K. Diefendorff, HP, Intel Complete IA-64 Rollout, Microprocessor Report, pages 1-9, Apr. 2000 [Radh00] R. Radhakrishnan, N. Vijaykrishnan, L. K. John and A. Sivasubramaniam, Architectural Issue in Java Runtime Systems, In Proceedings of the 6th International Conference on High Performance Computer Architecture, pages 387-398, 2000 [Reds00] J. A. Redstone, S. J. Eggers and H. M. Levy, An Analysis of Operating System Behavior on a Simultaneous Multithreaded Architecture, In Proceedings of the 9th International Conference on Architectural Support for Programming Languages and Operating Systems, pages 245-256, 2000 [Rom96] T. H. Romer, D. Lee, G. M.Voelker, A. Wolman, W. A.Wong, J.-L. Baer, B. N.Bershad, and H. M.Levy, The Structure and Performance of Interpreters, In Proceedings of the 7th International Conference on Architectural Support for Programming Languages and Operating Systems, pages 150-159, 1996 [Rose95a] M. Rosenblum, S. A. Herrod, E. Witchel, and A. Gupta, Complete Computer System Simulation: the SimOS Approach, IEEE Parallel and Distributed Technology: Systems and Applications, vol.3, no.4, pages 34-43, Winter 1995 [Sech96] S. Sechrest, C-C. Lee, and T. Mudge, Correlation and Aliasing in Dynamic Branch Predictors, In Proceedings of the 23rd Annual International Symposium on Computer Architecture, pages. 22-32. 1996 [Skad99] K. Skadron, P. S. Ahuja, M. Martonosi, D. W. Clark, Branch Prediction, Instruction Window Size, and Cache Size: Performance Trade-Offs and Simulation Techniques, IEEE Transactions on Computers, vol. 48, no.11, Nov. 1999 [SPECJVM98] SPEC JVM98 Benchmarks, http://www.spec.org/osg/jvm98/ History, In Proceedings of the 20th Annual International Symposium on Computer Architecture, pages 257-266, 1993 [Vija00] N. Vijaykrishnan, M. Kandemir, M. J. Irwin, H. S. Kim, and W. Ye, Energy-Driven Integrated Hardware-Software Optimizations Using SimplePower, In Proceedings of the 27th Annual International Symposium on Computer Architecture, pages 95106, 2000 [Vija99] N. Vijaykrishnan and N. Ranganathan, Tuning Branch Predictors to Support Virtual Method Invocation in Java, In Proceedings of the 5th USENIX Conference of Object-Oriented Technologies and Systems, pages. 217-228, 1999 [Yeag96] K. C. Yeager, MIPS R10000, IEEE Micro, Vol.16, No.1, pages 28-40, Apr. 1996 [Yeh91] T. Yeh and Y. Patt, Two-Level Adaptive Branch Prediction, In Proceeding of 24th International Symposium on Microarchitecture, pages. 51-61, 1991 [Yeh93] T. Yeh, and Y. N. Patt, A Comparison of Dynamic Branch Predictors that Use Two Levels of Branch
22