Superscalar Branch Instruction Processor T.L. Jeremiah IBM Corp. Endicott, NY USA
S. Vassiliadis B. Blaner TU Delft IBM Corp. E.E. Dept. Essex Junction, VT Delft, The Netherlands USA
[email protected]
Abstract In this paper we describe the design of the branch unit that has been implemented in some models of the recently announced IBM AS/400 1 . The branch unit we describe is a modification of the unit originally designed for the experimental IBM ESA/370 2 SCISM processor. The main feature of branch unit is its capability to remove branch instructions from the instruction stream dynamically and pre-process them before the branches enter the pipeline. This allows the processor to issue branch-less code from the instruction stack while the branches execute separately and in parallel with the processing of other instructions. The branch prediction in the SCISM processor is achieved by the tagging of instructions in the cache, while in AS/400 is achieved with a branch history table.
1 Introduction It is well-known, see for example [1, 2], that branch instructions disrupt pipeline operation by introducing dead cycles (stalls or bubbles) when a decision must be made, based on a prior or coincident execution result, to either fetch a new (target) instruction stream or continue executing the sequential (fall-through) stream. This performance-diminishing effect of branch instructions is further amplified in computers with multiple pipelined functional units. This effect has been known for some time [3], and has recently received further attention [4] with the advent of superscalar and scalable compound instruction set machines (SCISM) [5]. These machines, which attempt to execute multiple instructions in parallel from a single instruction stream, are particularly susceptible to the adverse effects of branches, because not only will branch instructions cause multiple pipelines to stall, but since these machines consume instructions at a higher rate, the likelihood of a branch instruction entering the pipelines is greater than for a scalar machine. This paper we describe a design of a branch unit impleemnted according the to SCISM [5] branch proposal [6]. The basic ideas for the branch processor described are as follows: 1 AS/400 is 2 ESA/370
a registered trademark of IBM. is a registered trademark of IBM.
Branches are moved “up” in the instruction stream. That is, we remove branches from the incoming instruction stream and process them in parallel with other instructions preceding the branch instructions. In superscalar machines the speculative execution of the branch is guided by some dynamic prediction mechanism (e.g. tags in the cache or branch history tables). The speculation is completed when the branch instructionis committed. That is speculation is completed at the point of the instruction stream where the branch should enter the pipeline of the processor to be executed. This idea allows the processor when successful in predicting the outcome of the branches to execute from an instruction stream containing no branches.
The applicability of the scheme is demostrated with the design of the branch processing in some models of the recently announced AS/400 models. The paper is organized as follows: Section 2 provides some discussion for background purposes and discusses a general operation of the branch unit. Consequently, we discuss the branch processing in some models of the recently announced IBM AS/400 machines. Finally, in Section 4 we conclude with some general remarks.
2 Background and General Description Fundamental to the SCISM machine organization [5] is the existence of an instruction tag, i.e, a field containing one or more bits that is associated with each instruction. One bit of the tag, called the t0 bit in [5], is required to identify compound instruction boundaries. Other bits may be defined as needed. In fact, one of the suggested bit definitions is for branch prediction. This is achieved in the instruction compounding unit (ICU), which takes instructions as input and produces compounded instructions, i.e., instructions with their tags, as output. These tagged instructions are then stored in a compound instruction cache (CIC), which provides the usual functions associated with an instruction cache and additionally provides storage for the tags. It is therefore conceivable that the ICU could statically predict the outcome of branch instructions as it formulates tags for compound instructions and embed the resulting pre-
diction information in the tag. However, the prediction accuracy of such a system is questionable because the ICU will have limited or no knowledge of the context of the branch, which is requisite for making accurate static predictions [7]. Furtehrmore, the SCISM machine organization may employ several units that may resolve true data dependencies [8, 9] and aliviate load-use dependencies as for example described in [10, 11, 12, 13, 14]. In order to comprehend the operation of a SCISM machine and the way it deals with the branches, consider the code fragment described in Figure 1: In the Figure a fragment code is described together with the fetching of the instruction text of an hypothetical superscalar pipelined machine. It is assumed that the text is fetched in terms of “ three windows” and that the processor issues at most a window per cycle. Furthermore we assume, for simplicity of exposition, that instruction within windows contain no branches and that instructions within windows one and two have no true dependencies with the window three which contains the fragment code. The critical path of the dynamic dependency graph is described in Figure 2. As describe in Figure 2 the critical path associated with the dynamic execution, using the SCSM constructs, is reduced to two critical blocks. We focus in more detail in the dependency associated with the branch instruction. To remove the dependency first the branch is moved to window one and marked as speculative. The speculative execution of the branch (not discussed here) is guided by a means of branch prediction and the branch is committed at the point were it should be executed. The process described previously is graphically depicted in Figure 3. More information and details regarding the constructs can be found elsewhere [5].
3 General Description of the IBM AS/400 Branch Processing In this section we give a brief description of the branch processing of some models of the IBM AS/400. The over all design of some models of the newly announced AS/400 processors are SCISM [5] processor instances, and the branch processor is modeled after the branch processor of the ESA/370 SCISM experimental implementation. The scheme implemented in such machines is a modification of the scheme presented in the previous sections. The major modification is that a branch history table is employed for the branch dynamic prediction. The following models employ the branch processor described here: The AS/400 models 400, 500 and 510, the AS/400 Advanced servers models 40s and 50s and the AS/400 Advanced 36 (model 436). The description of the branch processing is as follows: The dataflow of the branch processor is described in Figure 4. Excluding the address generation logic, the branch processing is divided into three major areas ( see Figure 4): The
Instruction Dispatch Unit, (IDU), the Branch Lookahead Unit (BLU), and the Branch Queue and Branch Test Unit (BQBTU) 3 . The instruction dispatch unit (IDU) operates as follows: Instructions are fetched from the I cache and placed into an 8-entry instruction buffer inside the instruction dispatch unit. From there, the instructions are sent to the various execution units by the instruction dispatching mechanism. Loading and unloading of the instruction buffer is controlled by two sets of pointers which keep track of how much instruction text has been fetched and how much has been dispatched in any given cycle. Whenever the instruction buffer is empty, and the processor is waiting for instructions, the output of the I cache can be bypassed around the instruction buffer directly to the appropriate execution unit. In accordance with the general SCISM directives, as reported in [5], processors implement cache tagging or decoding of instructions. Here only a subset of the proposed tagging reported in the SCISM paper [5] is implemented as there is no inter-instruction dependent decoding or compounding. The tagging regards only the routing of instructions. Routing of instructions is accomplished with the help of a two-bit field stored in the I cache along with each 4-byte instruction. The routing field indicates whether the instruction is to be executed by the Load/Store Unit(LSU), the Branch Unit, or one of the remaining execution units. A very few instructions are routed to both the LSU and the branch unit, and these instructions use the remaining decode of the two-bit field. The routing field is created on-the-fly as instructions are loaded into the I cache from storage. Zero, one, two, or three instructions are dispatched to the execution units in any given cycle, depending on the order in which the instructions occur in the program, and how many instructions are available in the buffer or on the instruction bus from the I cache. In order to keep track of instruction order for the purposes of handling interrupts and purging conditionally executed instructions, each instruction is dispatched with an offset indicator, which remains with the instruction as it is executed. The offset indicator specifies whether the instruction was the first, second, or third instruction dispatched in the cycle. The Branch Lookahead Unit (BLU) searches the incoming instruction stream for branches and attempts to fetch more instructions along the branch direction most likely to be taken. This process occurs somewhat asynchronously, several cycles ahead of the actual branch instruction execution whenever possible. The BLU employs several methods to select the most likely branch outcome. Some branch instructions are unconditional, so merely decoding the instruction is sufficient to establish the outcome of the branch. Others rely on the value of the Count Register, so testing the value of the Count Reg is done to guess at the branch outcome. 3 Note: To avoid detailed explanations that provide no further insight in the branch processor, the discussion does follow a general description of the unit and it may not follow in details the figure.
WINDOW 1 = NO BRANCHES NO DEPENDENCIES
WINDOW 2
1 2 3 4 5 6 7 8 9
WINDOW 3
R5 = R5 XOR R12 R21 = [R20(D)] R21 = R21 OR R30 R3 = R5 + R16 R3 = [R3 + R21] R9 = R9 + R15 R14 = R9 - R12 CC = R3 - R14 BNE LL1
Figure 1: Example Code Fragment
execution 2 dependency
1
4
6
load use
3
execution dependency
7 Agi
Agi
execution dependency
5
load use
8 CC dependency 9
Figure 2: Critical Path
WINDOW 1
2 R21 = [R20(D)] 5 R3 = [R3 + R21] 9 BNE LL1
WINDOW 2 WINDOW 3
1,4
1 2 3 4 5 6 7 8 9
R5 = R5 XOR R12 commit 2 R21 = R21 OR R30 R3 = R5 XOR R12 + R16 commit 5 R9 = R9 +R15 R14 = R9 + R15 - R12 CC = R3 -R14 commit 9
commit 2, 3
8, commit 9
6,7
commit 5, 5
Figure 3: Critical Path with Branch Removal The Count Reg may change between the point at which the prediction is made and the point at which the branch actually executes, so this method of prediction is not infallible. Likewise, branches which use the value of the Link Register or the Count Register to create a branch target address may generate an incorrect prediction if the Count Reg or Link Reg is not set up several instructions before the branch. Certain conditional instructions rely on the use of the Branch History Table (BHT) for predicting whether the branch is taken. The BHT is a 512x2-way buffer which is loaded whenever a conditional branch instruction is executed, and the particular branch was not previously found in the BHT when the BLU preprocessed the branch. The BHT is also updated if the BLU found an entry but the prediction was incorrect. Each time the BLU processes a conditional branch, it saves the address of the instruction following the branch to use as the index into the BHT. In other words, the BHT is accessed using the address of the beginning of the sequential instruction stream prior to a given branch instruction, instead of using the address of the branch instruction itself. This technique avoids the necessity of calculating the address of the branch instruction in series with accessing the BHT, thus relieving what was discovered to be a critical delay path in the logic. The BLU accesses the BHT using effective address bits
53:61 and matches the two entries accessed from the BHT with a hashed Effective Address. The BHT entries and the hashed address are created as follows:.
EA32:35 EA47:50 k EA51:52
(1)
Where is the exclusive-or and k is concatenation. In order to save area, the BHT contains only a 6-bit tag, but this is not expected to materially reduce the accuracy of the prediction. If a match occurs, the branch is predicted taken, and the target address computed by the BLU agen function is used to redirect instruction prefetching. The instruction text fetched from the target address replaces the text in the instruction buffer immediately following the branch itself. If the branch is predicted not taken, either by virtue of not having a matching entry in the BHT, or via one of the other mechanisms, the BLU makes no attempt to redirect instruction fetching that cycle. As branches are processed by the BLU, they are placed into one of the three entries in the Branch Queue (BQ). If the BQ fills up, the BLU stops processing. The information stored in the BQ consists of a decoded form of the branch instruction, the computed branch target address, the next sequential address, and the predicted branch outcome. When the branch instruction is actually executed later on, this information is compared to the outcome of the branch execution, and if different, the instruction stream is refetched from the correct address. The BLU does
IMR
I CACHE OUTPUT LOOKAHEAD LENGTH
IADDR
BIAR
ALIGN BHT 0
1
ALIGN BIR 0 1 2 3 4 5 6 7
IBFR
IDU
BLU
SELECT
= = BHT HIT 0
SELECT
0 1 INSTR TO
FXU FPU CRU
LSU
BQ
2
BRU
PF ADDR BTAEX
CTLEX
FETCH ADDR
ICACHE ADDR CONTROL IFAD 1
BR TEST -1 CIAD
SEQ PF
CTR
CIAX
LINK SRR0
CIAP
SRR1 MSR
OFFSET
BQBTU
INTRPT ADDR
DATA FROM LSU DATA TO LSU
Figure 4: Branch Unit High-level Design not actually remove the branch instruction from the instruction buffer when it performs the preprocessing function, nor is the branch instruction actually passed to the branch unit at dispatch time. Instead, as the instruction dispatch mechanism encounters the branch instruction, it signals the branch unit to access the next decoded branch entry in the BQ. In this way, execution of the branch is synchronized with execution of other instructions in the program. Branches are always dispatched at offset 0, meaning they are always the first of the three possible instructions which can be dispatched in any given cycle. This is done to permit previous instructions time to complete execution and set any conditions the branch may need to test. The two instructions which may be dispatched along with the branch are cancelled if the branch test fails. The branch pipe is a two-stage pipeline, corresponding to a Decode phase and an Execute phase. During Decode, the BQ entry is accessed and latched in a pipeline buffer, freeing that entry for the BLU to use on the next cycle if necessary. During the Execute phase, the branch test is made and compared to the predicted result. Any target
address mismatch or wrong guess results in cancelation of subsequent instructions in the pipe, and refetching of the correct instruction stream. The pipeline waits one cycle for the I-fetch to take place (or more if an I cache miss or translation miss occurs), plus one additional cycle for fetched instructions to be decoded. Thus, a mispredicted branch causes a two-cycle bubble in the pipe.
4 Summary In this paper we discussed the design of the branch unit present in some models of the recently announced IBM AS/400 systems. The basic idea for the branch processor has been the following:
Branches are moved “up” in the instruction streams. That is, we remove branches from the incoming instruction streams and process them in parallel with instruction preceding the branch instructions. In
pipelined superscalar machines the speculative execution of the branch is guided by some dynamic prediction mechanism (e.g. tags in the cache, branch history tables etc). The speculation is completed when the branch instruction is committed. That is speculation is completed at the point of the instruction stream where the branch should enter the pipeline of the processor and executed. This idea allows the processor when successful in predicting the outcome of the branches to execute from an instruction stream containing no branches. Our design and implementation experience with such a branch unit strongly suggest that for superscalar machines branch instruction processing can be achieved having the following additional characteristics:
A simple mechanism can be provided to synchronize the branch tests done in the branch unit with the instructions being processed in the execution unit. The proposed design causes the most frequent branches to be removed from the instruction stream of the execution unit and potentially executed in zero cycles, providing a significant performance gain.
References [1] J. K. F. Lee and A. J. Smith, “Branch prediction strategies and branch target buffer design,” IEEE Computer, January 1984. [2] J. E. Smith, “A study of branch prediction strategies,” Proceedings of the 8th Ann. Symp. Computer Arch., May 1981. [3] E. M. Riseman and C. C. Foster, “The inhibition of potential parallelism by conditional jumps,” IEEE Trans. Computers, pp. 1405–1411, December 1972. [4] M. D. Smith, M. Johnson, and M. Horowitz, “Limits on multiple instruction issue,” Proceedings of ASPLOS III, ACM, pp. 290–302, 1989. [5] S.Vassiliadis, B. Blaner, and R. Eickemeyer, “Scism: A scalable compound instruction set machine,” IBM Journal of Research and Development, vol. 38, no. 1, pp. 59–77, January 1994. [6] B. Blaner, S. Vassiliadis, and T. Jeremaiah, “A branch instruction processor for scism organizations,” IEEE EUROMICRO 95, Conf. Proc., pp. 285–293, September 1995. [7] S. Bandyopadhyay, V. S. Begwani, and R. B. Murray, “Compiling for the crisp microprocessor,” Proceedings of COMPCON, pp. 96–100, Spring 1987. [8] S. Vassiliadis, J. Philips, and B. Blaner, “Interlock collapsing alus,” IEEE Trans. Computers, vol. 432, pp. 825– 839, July 1993. [9] J. Philips and S. Vassiliadis, “High performance 3-1 interlock collapsing alus,” IEEE Trans. Computers, vol. 43, pp. 257–268, March 1994. [10] R. Eickemeyer and S. Vassiliadis, “A load-instruction unit for pipelined processors,” IBM Journal of Research and Development, vol. 37, no. 4, pp. 547–564, July 1993.
[11] T. Chen and J. Baer, “Effective hardware-based data prefetching for high performance processors,” IEEE Transactions on Computers, vol. 44, no. 5, pp. 609–623, May 1995. [12] T. Austin and G. Sohi, “Zero-cycle loads: Microarchitecture support for reducing load latency,” Proceedings of the 28th Annual ACM/IEEE International Symposium and Workshop on Microarchitecture, pp. 82–92, June 1995. [13] S. Mehrota and L. Harrison, “Examination of a memory access classification scheme for pointer intensive and numeric programs,” Proceedings of the 10th International Conference on Supercomputing, May 1996. [14] Y. Sazeides, S. Vassiliadis, and J. E. Smith, “The performance potential of data dependence speculation and collapsing,” ACM/IEEE 29th Annual International Symposium on Microarchitecture (Micro-29), Conf Proc., December 1996.