Sep 12, 2013 - ... operations and of the remaining lookups, 80% are done for highly bi- .... Hardware Architecture: a BP
On-Demand Branch Prediction CVA MEMO 133 Milad Mohammadi, Song Han, Tor M. Aamodt, and William J. Dally Electrical Engineering, Stanford University, Stanford, CA 94305 E-mail:{milad, songhan, dally}@cva.stanford.edu Electrical Engineering, University of British Columbia, Vancouver, BC V6T 1Z4 E-mail:
[email protected] Thursday 12th September, 2013 Abstract In out-of-order (OoO) processors, speculative execution with high branch prediction accuracy is employed to achieve good single thread performance. In these processors the branch prediction unit tables (BPU) are accessed in parallel with the instruction cache before it is known whether a fetch group contains branch instructions. For integer applications, we find 80% of BPU lookups are done for non-branch operations and of the remaining lookups, 80% are done for highly biased branches that can be predicted statically with high accuracy. We propose ondemand branch prediction (ODBP), a novel technique that uses compiler generated hints to identify those instructions that can be more accurately predicted statically to eliminate unnecessary BPU lookups. We evaluate an implementation of ODBP that combines static and dynamic branch prediction. For a 4-wide superscalar processor, ODBP delivers as much as 12% improvement in average energy-delay (ED) product, 8% core average energy saving, and 3% speedup. ODBP also enables the use of large BPU’s for a given power budget.
1
Introduction
To achieve high single thread performance, superscalar processors must fetch and decode multiple instructions per cycle. To achieve this, modern speculative superscalar processors (e.g., Power 7 [23]), access the branch prediction unit (BPU) every cycle. Our studies show that up to 12% of processor energy can be consumed on branch prediction. For integer applications, roughly 20% of all dynamic instructions are branches and for our workload we find 80% of executed branches are strongly biased and indeed statically predictable with the same level of accuracy as achieved using modern dynamic branch predictors (BP). Thus, 96% of instructions do not require access to 1
dynamic branch prediction. On a four wide superscalar machine this in turn implies that ideally only one access is required every six cycles. If the processor could know in advance which cycles a prediction was not needed, then it could potentially save up to 83% of branch prediction energy resulting in 10% overall energy reduction while obtaining similar performance. We propose on-demand branch prediction (ODBP), an energy efficient prediction technique that uses compiler-generated hints to select static prediction for instructions that are either not branches or are strongly biased. The hint bits are associated with earlier instructions (ahead prediction) so they are available during the relevant prediction. ODBP reduces the average core energy by 8% and improves the ED product by 12%. Ahead instruction type prediction is related to the challenge of long predictor latency [12, 13, 19–21]. To hide long predictor latency processors such as the Alpha EV6 [13] and EV8 [20], Power 7 [23] and ARM Cortex-A15 [14] employ an override prediction strategy [12]. Whereas ODBP employs ahead type prediction to avoid lookups to branch prediction arrays, the overriding approaches generally involve parallel lookup to predictors every cycle to tackle wire delay. Wide issue superscalar processors can provide fewer branch target predictions than instructions in a fetch group [4, 11]. However, for accurate prediction of direction, they provide a parallel prediction of all instructions in a fetch group (regardless of whether they are branches) and use instruction predecode bits the following cycle to identify predictions corresponding to branches [20, 23]. ODBP aims to perform dynamic prediction only when necessary. While a fetch target buffer (FTB) [19] avoids parallel direction prediction lookups it complicates return address prediction and to “get ahead” of instruction fetch, the FTB makes a direction and target predictor access every cycle which consumes energy. Early research on branch prediction explored the use of combining static and dynamic prediction similar to ODBP, but focused on the benefits to performance [6] rather than energy. In this paper, we first introduce a mechanism for saving energy while achieving strong branch prediction accuracy by annotating static prediction hints to earlier instructions, and then discuss a profile based hint assignment technique that optimizes ED gain. ODBP would be suitable either in the context of embedded systems or systems employing runtime compilation to encourage software portability such as the Heterogeneous System Architecture platform [1].
2
On-Demand Branch Prediction Theory
Static prediction has the advantage of energy efficiency, but suffers from poor prediction accuracy for branches with weak bias. Dynamic prediction provides higher prediction accuracy, but suffers from high energy overhead. Combining the two schemes delivers substantially lower energy footprint with stronger prediction accuracy. At compile time, we annotate instruction binaries with prediction hints. This enables the CPU to choose between static and dynamic prediction schemes at runtime. The hints determine instruction type (branch or non-branch); if an instruction is a branch, hints also determine if it is statically predicted taken (ST), statically not-taken (SN), or dynamically predictable (DY). The decision for the choice of prediction scheme is made using 2
an algorithm that assigns hint annotations based on both profile information (accuracy, bias) and a number of architectural parameters. The following analysis provides the corresponding annotation (ST, SN, DY) for each static instruction. Below, EDODBP and ED are the energy-delay products for ODBP and our OoO baseline (BASE) respectively; incremental energy and delay changes in ED gives EDODBP . The ED is reduced when the incremental change in energy by using static prediction more than offsets any increase in delay. ∆E and ∆D refer to the change of energy and execution latency when the ODBP scheme is used. The change in EDODBP is favorable for a static instruction when equation 5 is satisfied. EDODBP = (E + ∆E) ∗ (D + ∆D)
(1)
⇒ EDODBP ≈ ED + E ∗ ∆D + D ∗ ∆E
(2)
EDODBP < ED
(3)
⇒ E ∗ ∆D < −D ∗ ∆E
(4)
⇒ −∆E/∆D > E/D
(5)
For a branch biased towards taken, ∆D is estimated using equations 6, 7. Dmis denotes mis-speculation penalty cycles. If a branch bias is greater than the dynamic predictor’s accuracy, static prediction is superior. A negative value for ∆D implies static prediction is more accurate than dynamic. ∆DST = (accuj − biasj ) ∗ Dmis
(6)
∆DSN = (accuj − (1 − biasj )) ∗ Dmis
(7)
∆E is estimated using equations 8, 9. Epc denotes core energy-per-cycle, and Ebp is the energy saved by avoiding the BPU lookup. A negative value for ∆E favors static prediction and a positive value favors dynamic. ∆EST = ∆DST ∗ Epc − Ebp
(8)
∆ESN = ∆DSN ∗ Epc − Ebp
(9)
if (−∆EST /∆DST > E/D) choose ST annotation, else if (−∆ESN /∆DSN > E/D) choose SN annotation, else choose DY annotation.
3
Architecture Implementation
Baseline Model: the baseline branch predictor (BASE) for this study is built similar to the Alpha EV8 branch predictor [20]. The access model follows [8] where on every cycle, the BPU is indexed using the fetch group PC, to collect fetch-width many predictions, one for every fetched instruction; ODBP also uses this model. Hardware Architecture: a BPU consists of three key components: Branch Prediction Table (BP), Branch Target Buffer (BTB), and Return Address Stack (RAS). All three units are accessed at every fetch event. In ODBP, during the fetch stage, DY instructions look up all tables in BPU, ST instructions only look up BTB and RAS, 3
8/G2-A/2+,-%00(/==/=.(/01+213*-K-:)@+B@)213*
7*=2(B+213*:)+,/-%++/==
7*=2(B+213*;/+30/
789:)+,/ 4*12
789;/+30/ 4*12
/2+,-)00(/==->(3L-/G+/E213*-3(-L1=CE(/01+213*
Figure 1: OoO core front-end pipeline with ODBP logic
and SN instructions never lookup any BPU tables. At the commit stage, DY instructions update any table in BPU as needed, ST instructions update BTB and RAS when needed, and SN instructions never update any BPU tables. In addition to determining the prediction policy, hint bits enable instruction type prediction; SN includes all nonbranch instructions as well as branches with not-taken bias. ST includes unconditional branches and branches with taken bias. DY includes all other branches. Upon a mis-speculation or exception, the pipeline is flushed and the program resumes from the most recent non-speculative context checkpoint. To do so, every checkpoint stores hints for upcoming instructions and restores them upon a squash event. Fig. 1 shows the gate level changes (in gray) made to the BASE front-end to filter out unnecessary BP, BTB, and RAS accesses. ODBP fetch and forwards branch prediction hints to the first stage where the two AND gates block unnecessary lookups. The prediction hint bit polices used in this work assume SN = b0 00, ST = b0 01, DY = b0 10. The most significant bit determines the prediction policy choice (static vs. dynamic), and the least significant bit gives the prediction in case the static policy is set. While b0 11 is unused here, it can be provisioned to further reduce table accesses by selectively looking up a BP table to obtain its prediction with lower access energy. Hint Forwarding: Fig. 1 shows hint bits are available to BPU just after instructionfetch while the PC is available immediately after the BPU completes the previous PC computation. The lag between the arrival time of PC and hints implies hint information must be annotated to earlier instructions. Equation 10 gives the distance between an instruction and its annotated hint; it depends on the fetch-width (FW), w, and fetch pipeline depth, n.
4
d = w ∗ (n − 1)
(10)
Notice, d is the distance between an instruction and its hint in the dynamic code trace. Annotating hints, however, is done at compile time where they are assigned to static instructions; instruction hints are hoisted up by d instructions and annotated to the corresponding static instruction in the control-flow graph (CFG). Depending on d and basic-block (BB) sizes, hints can be propagated across BB boundaries, and potentially contend for annotating their conflicting hints to the same upstream static instruction. The conflict is resolved by picking the hint with the best expected improvement in ED given profile information. To do so, path frequency statistics are collected during the profiling phase. No hint contention occurs when the number of hint alternatives is one. This happens when the hint for an instruction is annotated to an earlier instruction in the same BB, or when two BB’s are connected with an unconditional branch, or when all hints are of the same kind. Hints are either incorporated into the instruction set architecture (ISA) or into an existing ISA by storing a hint file separate and parallel to the executable file. Hint Annotation Algorithm: following the above analysis, a four-step algorithm is formulated for ahead hint annotation: first, for each static instruction, initialize ∆DST , ∆DSN , ∆EST , ∆ESN as discussed in section 2 . Then, for every static instruction j with distance d (equation 10) from instruction i in the CFG, compute its weighted path frequency (W P Fij ). Then, at every instruction i, find ∆DST,i by summing the weighted ∆DST,j ’s over each ij path in the CFG (i.e. sum of W P Fij ∗ ∆DST,j ). Repeat this for estimating ∆EST,i , ∆ESN,i , and ∆DSN,i . Next, use equation 5 for each static instruction, i; this equation is either satisfied for SN or ST but not both; if satisfied for neither, i is marked DY. Fig. 2 illustrates how hints are placed in ODBP for the simple case of a scalar pipeline (w = 1) along with the pipeline stages as in Fig. 1 (n = 2). For this combination, the hints must be placed one instruction earlier (d = 1) than the instruction receiving the hint. Instruction 3.M U L holds the hint for 4.BCC, and 4.BCC holds the hint for 5.BCC and 6.BCC combined. 5.BCC must hold the hint for 6.BCC and 7.SU B; since their hints disagree, 5.BCC is marked DY assuming that both 6.BCC and 7.SU B are frequently executed. The hint for 1.ADD is held by an earlier instruction, and 8.M U L holds the hint for the next instruction in the CFG.
4
Simulation Methodology
For performance analysis, we used Gem5 (with SimPoint support), a cycle accurate simulator with support for OoO execution and Alpha ISA; for energy modeling, we used McPAT [2, 10, 15]. SPEC CPU2006 benchmarks are used for the evaluation of this work [18]. The test inputs are used for profiling, and the reference inputs for performance and energy evaluation. Our evaluations use the 2Bc-gskew predictor designed for low prediction aliasing [22]; it consists of the following 2bit FSM tables: two g-share (G0, G1), a bimodal (BIM), a choice (META) predictor. Similar to Alpha EV8, BIM is used as the third gskew table [20]. In this study, we use the 45nm technol-
5
!"#$! !"#$ !%"&$"#'
! '! *! .! 1! 2! 4! 6!
""""#$$" """"$() """"+,""""/00"""""/00"-' - 3"/00"-* -'3"5,/ -*3"+,-
%"&!$ !"#$ !%"&$"#' () * "#&$+,-$"%#
% !& % !& % !& %"#& % $& % $& % !& % !&
! '! *! .! 1! 2! 4! 6!
""""#$$ % !& """"$() % !& """"+,%"#& """"/00"% $& """"/00"-' %"#& - 3"/00"-* % !& -'3"5,/ % !& -*3"+,-%%&'()*%+),*%(&
Figure 2: Hint forwarding in a scalar processor with fetch pipeline latency n = 2 and fetch-width w = 1.
ogy node, 2GHz clock frequency, 4096 entry BTB, 32kB L1I/L1D caches, and 1MB L2 cache.
5
Results
Fig. 3 shows the results for varying FW; increases in the distance between the hint and its corresponding instruction due to FW increases worsen hint annotation accuracy leading to lower speedups, and less energy saving (Fig. 3(a,b)). Nonetheless, our results suggest the ED remains favorable for all FW values commonly used in modern CPU’s (Fig. 3(c)). We observe a similar trend when sweeping the fetch pipeline depth (FPD) from 2 to 5 cycles. Also, the ED product improves when sweeping BP sizes. For the prediction configuration of FPD=2, FW=2, when BP size increases from {1,2,2,2}k to {16,32,32,32}k entries, BASE prediction accuracy improves, fewer aliasing mispredictions happen, speedup drops but remains above 1, energy savings grow, and so the overall ED gain improves. This trend suggests ODBP enables building large branch predictor tables without paying the toll for energy consumption and benefiting from more accurate predictions. Energy saving is accomplished by eliminating unnecessary BPU lookups and better performance is achieved through less BPU traffic and higher prediction accuracy for statically predicted instructions. Fig. 3(d) compares the mis-prediction per 1000 instructions (MPKI) of BASE and ODBP. For ODBP, the MPKI’s are broken into static and dynamic. In most cases, the overall MPKI for ODBP is lower than BASE because the ODBP algorithm favors the prediction model that delivers the lower ED which mainly implies the better prediction accuracy, despite the different inputs for profiling and evaluation; so, highly biased branches with accuracy < bias, use static prediction. Because of the distance between the hint and its instruction, it is possible that less frequently executed instructions be annotated with a hint that does not reflect their correct control behavior. For instance, at runtime, an unconditional branch may be assigned a SN hint and inevitably be mis-speculated later. Moreover, it is possible different paths of execution assign different hints to different dynamic instances of a static instruction. For instance, one path of execution may label a non-branch instruc-
6
0-*1/,+,+,+23,&'-.*/&
?@AB'C4C4C4DE$ ?@AB4C#C#C#DE$
GHA#$
$()
$()
G@IA4$
$0 1
+$
-./
G@IAF$ G@IA4$ G@IAJ$
0-*1/,+,+,+23,&'-.*/,&'(*G&
IP;(7.1$ ()*(+C$LI?@$ ()*(+C$?MNO$ ,-./C$LI?@$ ,-./C$?MNO$ $011C$LI?@$ $011C$?MNO$ 2'34+56C$LI?@$ 2'34+56C$?MNO$ 2775+C$?MNO$ 2775+C$LI?@$ 8.,9:(;*:7C$ 8.,9:(;*:7C$?MNO$ 716C$LI?@$ 716C$?MNO$ ;0$ 5+ (0 5$
0-*1/,+,+,+23&,&'(*+&
GHA'$
!"#$
?@AB&3CF'CF'CF'DE$
?@AB&C'C'C'DE$
G@IAJ$
&$
'(*+,&'-.*/&
,-./C$LI?@$
G@IA4$
&"'$
?@AB#C&3C&3C&3DE$
&"4$ &"'$ &$ !"#$ !"3$
()*(+C$?MNO$
G@IAF$
&$
-./
?@AB4C#C#C#DE$
()*(+C$LI?@$
G@IA'$
&"'$
$,
?@AB'C4C4C4DE$
!"3$
K!$ 3!$ J!$ 4!$ F!$ '!$ &!$ !$
&"'$ &"&$ &$ !"%$ !"#$
0-*1/,+,+,+23,&'-.*/& ?@AB&C'C'C'DE$
!"#$
R$ @:A"6#$:BC5=DEF&
GHA4$
'(*+,&'-.*/& &$
$()
456789:;#$& ;0$ 5+ (0 5$
!""#$%"&
($
N*(Q1$
Figure 3: (a) DBASE /DODBP , (b) EODBP /EBASE , (c) EDODBP /EDBASE , (d) static/dynamic branch mispredictions/1000 instructions of ODBP and BASE. The BP sizes correspond to the number of entries for {BIM,G0,G1,META} predictor tables.
tion SN and another may label it DY. In such circumstances, we observe undesirable BPU traffic that may lead to destructive aliasing. We find that while such instances are possible, they are rare events associated with less frequently executed code paths. For benchmarks with poor speedup such as gcc, static mis-speculation is the source of performance drop which is in turn the result of frequent mis-labeling cases at d = 2 and d = 8; we attribute the destructive hint contention at such d values to the small BB sizes and control flow structure of gcc. Static mis-prediction can be reduced if every instruction were to provide multiple hints to the BP so that ODBP chose the appropriate hint according to the execution path.
6
Related Work
The Intel Pentium 4 employs static prediction upon a BTB miss in decode after several pipeline stages [3]. The IBM Power 4 employs static hints to find if it must use the RAS or BTB [24]. Chaver, et al. [7] focus on profiling techniques to reduce BPU energy by dynamically adjusting its resources such as BTB size. Yang, et al. [25] focus on early identification of incoming branch addresses using static information about the program control-flow. Using this information BTB is accessed only for branches dynamically predicted-taken rather than on every instruction. Yeh, et al. [26] discuss various techniques for predicting and fetching multiple branch instructions per cycle. Mahlke, et al. [16] discuss a technique to eliminate hard to predict branches from BPU through hyperblock formation and instruction predication. Chang, et al. [5] eliminate
7
g-share aliasing by tracking highly biased branches in BTB and not updating BP for such branches; in doing so the fetch unit looks up BTB and BP for every fetch group. Monchiero, et al. [17] limit BP lookup though static hints used to activate the dynamic branch prediction only when a branch is about to execute in VLIW architectures; this technique eliminates lookups for only some non-branch instructions. ODBP’s prediction model makes it possible to reduce BPU accesses by combining static and dynamic branch prediction. It focuses on ahead prediction through static hint annotations to support energy efficient BP utilization at peak fetch efficiency; this needs hoisting hints to earlier instructions while improving ED.
7
Conclusion
In this work we proposed on-demand branch prediction, a technique that enables 36% average energy saving in the fetch unit and 8% total average processor energy saving by combining dynamic and static branch prediction. The compiler, using profiling information, provides static hints that enable the program to avoid dynamic predictions on both non-branch instruction and on highly-biased branches (where the bias exceeds the accuracy of the dynamic predictor); this leads to 80% reduction in BPU lookup events, and an average energy-delay (ED) product gain of 12%.
References [1] “The HSA Foundation,” http://hsafoundation.com. [2] N. Binkert, B. Beckmann, G. Black, S. K. Reinhardt, A. Saidi, A. Basu, J. Hestness, D. R. Hower, T. Krishna, S. Sardashti et al., “The gem5 simulator,” ACM SIGARCH Computer Architecture News, vol. 39, no. 2, pp. 1–7, 2011. [3] D. Boggs, A. Baktha, J. Hawkins, D. T. Marr, J. A. Miller, P. Roussel, R. Singhal, B. Toll, and K. Venkatraman, “The Microarchitecture of the Intel Pentium 4 Processor on 90nm Technology.” Intel Technology Journal, vol. 8, no. 1, 2004. [4] B. Calder and D. Grunwald, “Next cache line and set prediction,” in ISCA, 1995, pp. 287–296. [5] P.-Y. Chang, M. Evers, and Y. N. Patt, “Improving Branch Prediction Accuracy by Reducing Pattern History Table Interference,” in PACT, ser. PACT ’96. IEEE Computer Society, 1996, p. 48. [6] P. Chang, E. Hao, T. Yeh, and Y. Patt, “Branch Classification: a New Mechanism for Improving Branch Predictor Performance,” in MICRO, 1994, pp. 22–31. [7] D. Chaver, L. Pinuel, M. Prieto, F. Tirado, and M. C. Huang, “Branch Prediction On Demand: an Energy-Efficient Solution,” in Proc. Int’l Symposium on Low Power Electronics and Design (ISLPED), 2003.
8
[8] G. P. Giacalone and J. H. Edmondson, “Method and apparatus for predicting multiple conditional branches,” Aug. 7 2001, US Patent 6,272,624. [9] D. Grunwald, D. Lindsay, and B. Zorn, “Static Methods in Hybrid Branch Prediction,” in PACT, 1998. [10] G. Hamerly, E. Perelman, J. Lau, and B. Calder, “SimPoint 3.0: Faster and more flexible program phase analysis,” Journal of Instruction Level Parallelism, vol. 7, no. 4, pp. 1–28, 2005. [11] P. Y.-T. Hsu, “Designing the TFP Microprocessor,” IEEE Micro, vol. 14, no. 2, pp. 23–33, 1994. [12] D. A. Jim´enez, S. W. Keckler, and C. Lin, “The impact of delay on the design of branch predictors,” in MICRO, 2000, pp. 67–76. [13] R. E. Kessler, “The Alpha 21264 Microprocessor,” IEEE Micro, vol. 19, no. 2, pp. 24–36, 1999. [14] T. Lanier, “Exploring the design of the cortex-A15 processor,” in ARM TechCon, 2011. [15] S. Li, J. H. Ahn, R. D. Strong, J. B. Brockman, D. M. Tullsen, and N. P. Jouppi, “McPAT: an integrated power, area, and timing modeling framework for multicore and manycore architectures,” in MICRO, 2009, pp. 469–480. [16] S. A. Mahlke, R. E. Hank, R. A. Bringmann, J. C. Gyllenhaal, D. M. Gallagher, and W.-m. W. Hwu, “Characterizing the impact of predicated execution on branch prediction,” in Proceedings of the 27th annual international symposium on Microarchitecture. ACM, 1994, pp. 217–227. [17] M. Monchiero, G. Palermo, M. Sami, C. Silvano, V. Zaccaria, and R. Zafalon, “Power-aware branch prediction techniques: a compiler-hints based approach for VLIW processors,” in Proc. ACM Great Lakes Symp. on VLSI (GLVLSI), 2004. [18] T. K. Prakash and L. Peng, “Performance characterization of SPEC CPU2006 benchmarks on Intel Core 2 Duo processor,” ISAST Trans. Comput. Softw. Eng, vol. 2, no. 1, pp. 36–41, 2008. [19] G. Reinman, T. Austin, and B. Calder, “A scalable front-end architecture for fast instruction delivery,” in ISCA, 1999, pp. 234–245. [20] A. Seznec, S. Felix, V. Krishnan, and Y. Sazeides, “Design Tradeoffs for the Alpha EV8 Conditional Branch Predictor,” in ISCA, 2002, pp. 295–306. [21] A. Seznec, S. Jourdan, P. Sainrat, and P. Michaud, “Multiple-block ahead branch predictors,” in ASPLOS, 1996, pp. 116–127. [22] A. Seznec, P. Michaud et al., “De-aliased hybrid branch predictors,” Technical Report RR-3618, Inria, Feb. 1999.
9
[23] B. Sinharoy, R. Kalla, W. J. Starke, H. Q. Le, R. Cargnoni, J. A. Van Norstrand, B. J. Ronchetti, J. Stuecheli, J. Leenstra, G. L. Guthrie, D. Q. Nguyen, B. Blaner, C. F. Marino, E. Retter, and P. Williams, “IBM POWER7 multicore server processor,” IBM J. Res. Dev., vol. 55, no. 3, pp. 1:1–1:29, 2011. [24] J. M. Tendler, J. S. Dodson, J. S. Fields, H. Le, and B. Sinharoy, “POWER4 System Microarchitecture,” IBM J. Res. Dev., vol. 46, no. 1, pp. 5–25, 2002. [25] C. Yang and A. Orailoglu, “Power efficient branch prediction through early identification of branch addresses,” in Proc. ACM Int’l Conf. on Compilers, Architecture and Synthesis for Embedded Systems (CASES), 2006, pp. 169–178. [26] T.-Y. Yeh, D. T. Marr, and Y. N. Patt, “Increasing the instruction fetch rate via multiple branch prediction and a branch address cache,” in Proc. ACM Int’l Conf. on Supercomputing (ICS), 1993, pp. 67–76.
10