IBM Corp. Esses Junction, VT Faculty of Electrical Engineering Endicott, NY. USA. Delft, The ... branch outcomes by tagging instructions in an instruc- tion cache ...
A Branch Instruction Processor for SCISM Organizations B. Blaner S. Vassiliadis T.L. Jeremiah IBM Corp. TU Delft IBM Corp. Esses Junction, VT Faculty of Electrical Engineering Endicott, NY USA Delft, The Netherlands USA
Abstract
The performance degradation caused by branch instructions in pipelined computers is well known. The degradation is even greater on computers with multiple pipelines processing a single instruction stream, such as superscalar and scalable compound instruction-set machines (SCISM). Several branch prediction schemes have been proposed that attempt to reduce this performance penalty. One of these { dynamic prediction of branch outcomes by tagging instructions in an instruction cache with prediction information { is adapted to an IBM ESA/370 SCISM implementation with several important additions. The adaptation may be extended to other architectures with similar characteristics. More signi cantly, a scheme is developed that allows the predominant IBM ESA/370 branch instructions to be removed from the instruction stream. These instructions, in eect, execute in zero time when the prediction is correct, thereby signi cantly increasing the performance achieved by the base SCISM machine organization.
1 Introduction
The technique of pipelined instruction processing is widely known and employed in designing computers. Pipelining oers a signi cant performance speed-up if conditions conducive to pipelining can be met and sustained [1, 2]. Unfortunately, the characteristics of program behavior are such that these conditions frequently go unmet, thus reducing, often considerably, the speed-up actually attained when pipelining is employed. One such characteristic is the comparatively high frequency of branch instructions occurring in programs. In fact, it has been observed that branch instructions may comprise 25% and more of all instructions executed in a given program trace [3]. Branch instructions disrupt pipeline operation by introducing dead cycles (stalls or bubbles) when a decision must be made, based on a prior or coincident execution result, to either fetch a new (target) instruc-
tion stream or continue executing the sequential (fallthrough) stream. As the time it takes to make the decision and fetch the appropriate stream increases, the number of bubbles injected into the pipeline increases, and, since branch instructions are such a large fraction of the instruction stream, the speed-up obtained by pipelining decreases. The performance-diminishing eect of branch instructions is further ampli ed in computers with multiple pipelined functional units. This eect has been known for some time [4], and has recently received further attention [5] with the advent of superscalar [6, 7] and scalable compound instruction set machines (SCISM) [12]. These machines, which attempt to execute multiple instructions in parallel from a single instruction stream, are particularly susceptible to the adverse eects of branches, because not only will branch instructions cause multiple pipelines to stall, but since these machines consume instructions at a higher rate, the likelihood of a branch instruction entering the pipelines is greater than for a scalar machine. This topic has been the focus of much research, in both academia and industry, and many useful results have been produced. Smith [8] examines several branch prediction strategies, whereby the outcome of a branch is predicted and the appropriate instruction stream is fetched directly and executed as soon as possible. Lee and Smith [3] de ne and quantify the branch target buer (BTB) as a practical means of predicting both the outcome and target address of branch instructions. To achieve an acceptable BTB hit ratio for a high-performance computer, say 0.9 or greater, a comparatively large BTB is required. For example, Lee and Smith show that a 1K-entry, 2-way set associative BTB yields a hit ratio of 0.919 on a particular IBM instruction trace. A more economical approach was suggested by Smith [8]. Instructions in an instruction cache are tagged with bits for dynamic branch prediction [3]. Although this technique does not supply a target address for a predicted-taken branch, it
is readily applicable to a SCISM utilizing a compound instruction cache, as will be demonstrated. This paper will show how an IBM ESA/370 SCISM design, an implementation of a SCISM [12], addresses the branch problem. The techniques developed, although applied speci cally to a particular implementation, may be applied to the IBM ESA/370 designs in general, as well as to other architectures with similar characteristics. Section 2 provides some discussion for background purposes and discusses the main results. Sections 3 and 4 will then show how branch instructions are processed early in the pipeline so that the largest fraction of branch instructions can actually be removed from the instruction stream and potentially execute in zero time. Finally, Section 5 will summarize the results.
2 Background and Main Results
Fundamental to the SCISM architecture [12] is the existence of an instruction tag, i.e, a eld containing one or more bits that is associated with each instruction. One bit of the tag, called the t0. bit in [12], is required to identify compound instruction boundaries. Other bits may be de ned as needed. In fact, one of the suggested bit de nitions is for branch prediction. This is achievedin the instruction compounding unit (ICU), which takes instructions as input and produces compounded instructions, i.e., instructions with their tags, as output. These tagged instructions are then stored in a compound instruction cache (CIC), which provides the usual functions associated with an instruction cache and additionally provides storage for the tags. It is therefore conceivable that the ICU could statically predict the outcome of branch instructions as it formulates tags for compound instructions and embed the resulting prediction information in the tag. However, the prediction accuracy of such a system is questionable because the ICU will have limited or no knowledge of the context of the branch, which is requisite for making accurate static predictions [11]. Smith [8] has suggested tagging instructions in an instruction cache with bits for dynamic branch prediction [3]. This approach is conceptually applicable to a SCISM [12] utilizing a compound instruction cache, can be designed to comply with the IBM ESA/370 architecture, and provides acceptable branch prediction accuracy. It may appear that to implement of Smith's approach would require increasing tag storage to accommodate the number of branch prediction bits desired. However, after a more careful study of the tag de nition in [12] and certain related elements of the IBM ESA/370 architecture, an alternative approach has
been developed that: 1. requires no tag bits beyond the t0 bit for dynamic prediction of all ESA/370 branches, and 2. depending on the opcode of the instruction, provides up to two bits for dynamic branch prediction, again without extending the size of the tag. These enhancements are made possible by including hardware to process certain branch instructions in zero cycles. The basic algorithm dynamic branch prediction for this design is as follows:
The ICU initializes the prediction bits and embeds them in the required tag when a line of instructions is fetched into the CIC. When a branch instruction is fetched from the CIC for execution, the bits are used to predict the outcome of the branch and the appropriate instruction stream is subsequently fetched. If the prediction is correct, instruction processing proceeds without stalling the pipeline. If the prediction is found to be incorrect, the prediction bits for the incorrectly predicted branch are updated in the CIC. The execution units are stalled while the correct instruction stream is fetched.
Thus, through this mechanism, the CIC functions both as a compound instruction cache and as a branch target buer. A CIC with integrated BTB functions will henceforth be referred to as the combined CICBTB. The following sections will detail the organization of the branch prediction bits and will assess two parameters of concern: branch prediction accuracy and hit ratio of the combined CIC-BTB. Also, manipulation of the prediction bits will be presented.
3 Prediction Bit Organization
To de ne an organization of the tag-resident dynamic branch prediction bits, it is necessary to rst postulate some structural properties of the tag. By de nition, the IBM ESA/370 instructions are one, two, and three halfword (one halfword = two bytes) quantities aligned on halfword address boundaries. Therefore, to allow tagging of every instruction in the CIC, a tag must be logically associated with every halfword in the CIC. If tags are assumed to be one bit in extent (the t0 bit is minimally required), then clearly two and
three halfword instructions have one and two extraneous tag bits, respectively, that are available for other use, potentially as branch prediction bits. Table 1 lists the IBM ESA/370 branches, their frequencies relative to all instructions in a TSO-XA operating system representative workload trace, their frequencies relative to branch instructions only, their frequencies of taken and not taken outcomes, and their length in halfwords. From the de nition of the instructions and their lengths, the branches may be further subdivided as follows: 1. The unconditional branches BAL, BALR, BAS, BASR, BSM, BASSM, and EX. 2. The conditional two-halfword branches BC, BCT, BXLE, and BXH. 3. The conditional one-halfword branches BCR and BCTR. The outcome of the unconditional branches can always be predicted with absolute certainty from their opcode and register elds. Thus, no prediction bits are required for these instructions. The conditional two-halfword branches have one tag bit available for branch prediction. These branches plus the unconditional branches account for over 90% of the branches in the TSO-XA trace. The single-halfword branches BCR and BCTR are somewhat problematic in that there would appear to be no means of predicting these branches without increasing the size of the tag. To the contrary, there are several possible solutions to the problem: 1. Predict BCR as taken. For TSO-XA, this prediction will be correct 91.9% of the time. The prediction could more informed by conditioning it with the BCR mask (M1 ) and R2 elds, i.e., by de nition, the branch is not taken if M1 = 0 or R2 = 0. 2. Predict BCTR as not taken. For TSO-XA, this prediction will always be correct, since the instruction is really used to decrement a register when R2 = 0 and fall through to the next instruction. Once again, the prediction direction may be conditioned by the value of the R2 eld. Both of these solutions risk generalizing TSO-XA results to other instruction mixes. For example, Lee and Smith [3] analyze several traces where the frequency of BCTR and its likelihood of being taken is higher than in the TSO-XA trace. The worst-case trace is an IBM business workload where 5% of all
branches are BCTR and 17.3% of these are taken branches. Even so, predicting BCTR as never taken still produces an acceptable 82.7% prediction accuracy. This approach is adopted for BCTR prediction in this implementation. For BCR, however, and for BC as well, there is an alternative. Section 4 will detail a scheme that obviates the need for compounding these instructions because they are removed from the instruction stream during the instruction fetching process. For these two instructions, even the t0 bit is not required because the BC and BCR opcodes are explicitly decoded to facilitate removing them from the instruction stream. Therefore, the tag bit associated with each halfword is available for one bit of prediction for BCR and two bits of prediction for BC.
3.1 Prediction Accuracy
BAL, BALR, BAS, BASR, BSM, BASSM, and EX are unconditional branches and may be predicted with 100% certainty. Likewise, with some further decoding, BC with M1 = 0, BC with M1 = 15, BCR with M1 = 0, BCR with M1 = 15, and BCR with R2 = 0 can all be predicted with 100% certainty. With one bit of prediction, as will be the case with BCR, BCT, BXLE, and BXH, Lee and Smith [3] show prediction accuracy ranging from 79.7 to 95.2% for IBM workloads of varying nature. For two bits of prediction, as could be the case with BC, Lee and Smith show accuracies ranging from 83.4 to 96.6%. If it is assumed that these prediction accuracies can be applied to individual or groups of branch instructions, then it is possible to speculate on the overall range of prediction accuracy for the ESA/370 SCISM implementation. The limits of the range may be calculated by summing over all branches the product of the frequency of a given branch (with respect to all other branches) and the prediction accuracy of the branch. Using Lee and Smith's ranges applied to the branch frequencies from :tref re d=brs., the low end of the range, l , is p
pl
= 1 0 (3 8 + 3 0 + 1 4 + 1 3 + 0 55 + 0 46) + 0 797 (8 8 + 8 0 + 0 1) + 0 834 68 6 + 0 827 1 7 = 82 6 :
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
The rst term is the percentage of branches predicted with 100% accuracy weighted by the frequencies of the corresponding branches (BAL, BALR, EX, BASR, BAS, and BASSM). The second is the same
Instruction BC BCT BCR BAL BALR BSM BCTR EX BASR BAS BASSM BXLE BXH
% of % of % % Not Length Trace Branch Taken Taken (hfwd) 17.23 68.6 54.3 45.7 2 2.213 8.8 95.4 4.6 2 2.013 8.0 91.9 8.1 1 0.964 3.8 100 0 2 0.747 3.0 62.3 37.7 1 0.548 2.2 100 0 1 0.434 1.7 0 100 1 0.351 1.4 { { 2 0.326 1.3 62.3 37.7 1 0.137 0.55 100 0 2 0.115 0.46 100 0 1 0.026 0.10 57.4 42.6 2 0.0004 0.01 65.7 34.3 2
M1 ,D2(X2 ,B2) M1 ,D2(X2 ,B2) R1 ,R2 R1 ,D2(X2 ,B2 ) R1 ,R2 R1 ,R2 R1 ,R2 R1 ,D2(X2 ,B2 ) R1 .R2 R1 ,D2(X2 ,B2 ) R1 ,R2 R1 ,R3,D2(B2 ) R1 ,R3,D2(B2 )
Table 1: branch statistics for TSO-XA trace calculation for branches predicted at 79.7% accuracy (BCT,BCR,BXLE). 1 The third is for BC, predicted at 83.4% accuracy. The fourth is for BCTR predicted at 82.7% accuracy (a worst-case value, assuming, as stated earlier, that BCTR is always predicted as not taken). The high end of the range, h , is calculated similarly, using the prediction accuracy values at the high end of the range: p
= 1 0 (3 8 + 3 0 + 1 4 + 1 3 + 0 55 + 0 46) + 0 952 (8 8 + 8 0 + 0 1) + 0 966 68 6 + 1 01 7 = 94 6
ph
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
Thus, prediction accuracies may be expected to range from 82.6% to 94.6%. For BC and BCR alone, a similar calculation yields prediction accuracies ranging from 83.0% to 96.5%.
Branch Hit Ratio: As stated previously, a BTB
with an acceptable hit ratio for a high-performance computer, i.e., 90% or greater, would require a 1Kentry BTB. The question arises as to what size CIC is required to provide comparable performance. The following reasoning is used to derive a rst-order approximation. BXH does not appear because of its comparatively low frequency. 1
Assume for simplicity that all ESA/370 instructions are two halfwords in length. Then assume, as before, that one instruction in four is a branch { a fair assumption for TSO-XA since the sum of the \% of Trace" is 25.1%. It follows that in a 16KByte or 4KWord CIC there will be 1K branch instructions, or, expressed differently, a 16KByte CIC also functioning as a BTB will provide approximately equivalent performance behavior as a 1K-entry BTB.
Prediction Bit Manipulation: Manipulating of the branch prediction bits in the combined CIC-BTB consists of two parts: initializing the bits when branch instructions are processed by the ICU, and updating the bits as required. Initialization can be as simple as setting all prediction bits to predict the branch outcome as taken. According to Lee and Smith [3], this will be correct 60 to 70% of the time. More complicated initializations can be made based on the opcodes, M1 , and R2 elds. Obviously, doing this will provide more correct initial guesses, but may impose additional ICU hardware complexity. Updating the branch prediction bits depends on both the number of bits per branch instruction and the meaning assigned to the bits when there are two or more bits available for prediction. A single bit of prediction functions as a simple history bit, i.e., the bit is one if the branch was taken last time and zero if it was not. The branch is predicted to go in the same direction as last time. If this prediction is incorrect, the bit must be updated to re ect the actual direction
of the branch. If two or more bits are available for branch prediction, the meaning assigned to the bits may vary as will the frequency of bit updating. Lee and Smith [3] propose several state machine schemes, some requiring updating the bits only when the prediction turns out to be incorrect, and others requiring more frequent updates. The choice of which scheme to use is implementation dependent and will not be discussed further here.
4 Branch Procesing
From an execution standpoint, branches are divided into two categories. Those in the rst category perform some general purpose register (GPR) manipulation. These require an execution cycle (E-cycle) in the execution unit, and may also be compounded with other instructions. Branches in the second category do not manipulate GPRs, thus no E-cycle is required, and they are not dispatched to the execution unit. These potentially execute in zero cycles. There are only two instructions in the second category, BC and BCR, but they comprise approximately 19% of the dynamic instruction frequency. Branches in the rst category account for approximately 6% of the dynamic frequency. The Branch Unit in this design strives to provide an uninterrupted supply of instructions to the instruction decode (ID) stage of the execution unit and the branch decode (BD) stage of the branch unit. Furthermore, it removes Branch on Condition (BC/BCR) instructions entirely from the execution pipeline by generating branch target addresses, fetching the predicted instruction stream, and performing the actual branch test in parallel with the execution of other instructions in the execution unit. Consequently, with two instructions executing in the execution unit and one more simultaneously executing in the branch unit, a peak throughput of three instructions per cycle can be achieved. The Branch Unit is comprised of the following major functions (see Figure 1): 1. CI Align { Parses instruction text coming from the Compound Instruction Cache (CIC) into individual simple or compound instructions to be loaded into registers in the Compound Instruction Queue. 2. Instruction Queue (IQ) { A group of registers containing the sequence of instructions predicted to be needed for execution by the Execution Unit and/or the Branch Unit.
3. Branch Address Generator { Contains a copy of registers needed to create branch addresses and an adder dedicated to instruction address generation. 4. Branch Queue (BQ) { Holds Branch-type instructions which have been processed by the Branch Address Generator until the actual branch test can be made at the proper point in the execution of the branch instruction. 5. Sequential Address Generator { Generates address needed for prefetching sequential instructions. This is done using a prefetch address register and a prefetch oset amount. The oset amount is added to the current prefetch address register contents for sequential prefetching. 6. Branch Test { For BC and BCR instructions, generates a branch condition by applying the mask to the condition code. Other branch-type instructions provide the branch result to the Branch Test unit by comparisons performed in the execution unit. The actual branch direction is compared to the predicted branch direction.
CI Align
IQ
to E-Unit
Sequential Address Generator
Branch Address Generator
to CIC
BQ from E-Unit
Branch Conditions
Branch Test
Figure 1: Branch unit high-level block diagram Instructions are fetched from the CIC as sequential bytes of text, the length of the fetch being several
instructions, so that a queue of instructions to be processed by the branch unit and the execution unit may be created. For purposes of illustration, assume the length of the fetch to be up to sixteen bytes, although other lengths can be used depending on the particular implementation. from CIC
Left Shifter
Compound Instruction Align
Instruction Queue
IQ0
IQ1
CIR
IQ2
IQ3
BIR
CI Aligner: The sixteen bytes of instruction text
A
to E-Unit
Figure 2: Branch unit detail (Part 1 of 2) A PFAR X2/R2 B2 D2
GR Copy
BIAR
update length
INC
update length
INC
Branch ILC BR TGT ADDR DCD
BIA0
I N C
Refetch ADDR IAMUX to CIC
Branch Queue
OP0
TGT0
OP1
BIA1
TGT1
OP2
BIA2
TGT2
OP3
BIA3
TGT3
R F A M U X
OPMUX from E-Unit CC BR. cond
tions are routed directly to the rst stage of the pipeline if the queue is empty and the rst stage of the pipeline is available to receive them. All branchtype instructions are processed by the branch unit to create the branch address and initiate any necessary fetches to the CIC. Additionally, those branches which manipulate registers are also processed by the execution unit. Such instructions would include Branch on Count (BCT/BCTR) which decrement a GPR; Branch and Link (BAL/BALR) and Branch and Save (BAS/BASR/BASSM) which save link information in a GPR; Branch on Index (BXLE/BXH) instructions which increment a GPR by an increment amount in another GPR and compare the incremented value to a third GPR; Branch and Set Mode, which modi es a GPR; and Execute (EX), which is handled as a special case. Instructions such as BC and BCR which only test the condition code do not get processed by the execution unit and are skipped when loading the CIR from the IQ registers.
Branch Test
Figure 3: Branch unit detail (Part 2 of 2) The queue is created in several registers called the Instruction Queue (IQ Regs) (see Figure 3). Instruction text coming from the CIC is separated into individual or compound instructions in the CI Align unit and routed to the appropriate IQ Reg. Instruc-
fetched from the CIC is shifted left in the CI Aligner according to the four low-order bits of the address so that the beginning of the rst instruction is always in the same place. This provides a reference point to begin splitting the text into individual instructions (individual instructions in this context also means compound instructions). Instruction boundaries can be identi ed by decoding the instruction length at the shifter output. The leftmost instruction fetched is gated to the rst available Instruction Queue (IQ) register, and subsequent instructions are gated in order to the remaining available IQ registers. Leftover text is either discarded or placed into temporary storage in an Instruction Buer. In either case, the next sequential fetch will be to the address immediately following the last instruction placed into the queue. Instructions may bypass the IQ and be directly loaded into the pipeline of the execution unit and/or the branch unit. This happens when the queue is empty, which may occur for a CIC line miss or occasionally for a taken branch. Mispredicted branches always cause bypass to occur, since the the contents of the queue behind the branch is invalid. If an instruction is bypassed directly into the Compound Instruction Register (CIR), it is not placed into an IQ register. However, instructions loaded into the Branch Instruction Register (BIR) are also placed into an IQ register provided there are valid instructions in the queue. This allows the execution pipeline to observe
the branch instructions in the proper conceptual sequence and set a control latch when they are encountered, as will be described later. The purpose of this action is to synchronize branch testing at the correct point in the instruction stream. It is also possible to load the same instruction into the BIR and the CIR. This happens whenever a branch-type instruction having an E-cycle is encountered. The instruction enters the branch unit to permit the branch address to be calculated; it enters the execution unit to perform any GPR updates that may be required. Branch-type instructions may enter the branch unit before or at the same time as they enter the execution unit, but never afterward.
Instruction Queue (IQ): The IQ registers are
loaded and unloaded under the control of pointers. The load pointer indicates the next available IQ register to receive an instruction from the CI Align unit. The unload pointer controls which IQ register is to be gated into the CIR. A separate unload pointer controls the gating of instructions into the BIR, since that operation is not synchronous with the rst. If the load pointer advances such that it points to either of the registers designated by the unload pointer, prefetching stops because the IQ is full. Conversely, if either unload pointer advances such that it points to the same register indicated by the load pointer, the queue is considered empty. The optimum number of registers in the IQ varies with the particular instruction mix being executed, and also with the number of cycles required to fetch instructions from the cache. Fewer taken branches and a small number of fetch cycles require fewer IQ registers to keep the Execution unit busy. In this case, the choice has been made to implement four IQ registers, but the number could be reduced to three without major loss in performance.
Branch Address Generator: The algorithm used by the branch unit is very simple in concept. It searches the IQ for the rst branch-type instruction and loads it into the BIR, which is the instruction decode register in rst stage of the branch unit. The branch address is unconditionally generated, and the branch prediction bits from the CIC are used to determine whether the branch address will be sent to the CIC. If a branch is predicted to be taken, the branch address is sent to the CIC and a fetch is begun. The fetched instruction text is divided on instruction boundaries as previously described in the CI Aligner and placed in the IQ following the branch just exe-
cuted. The address of the instruction sequentially following the branch is saved in the Branch Queue (BQ) in case the branch was wrongly predicted. If the branch is predicted not to be taken, then the branch address is generated as before, but is only saved in the BQ. The branch unit sends the address of the instruction following the the last valid instruction in the IQ to the CIC to initiate a sequential prefetch in case a location becomes available in the IQ on the next cycle. In other words, the branch unit attempts to ll the IQ with sequential instructions whenever it fails to encounter a predicted taken branch. In this manner, the branch unit creates a single instruction stream to be processed by the execution unit and itself, based on a predicted branch path.
Branch Queue (BQ): Clearly, no branch prediction
scheme will always predict branch directions correctly. Indeed, incorrect predictions will occur approximately 5 to 17% of the time. The branch unit recovers from a wrong prediction by selecting the appropriate address from the BQ and issuing an instruction fetch request to the CIC. The incorrect instruction sequence is
ushed from the pipeline and the IQ registers and replaced with the newly fetched instructions. A penalty in the form of pipeline bubbles or stalls occurs at this point while the correct instruction stream is fetched and entered into the pipeline. In this design, this amounts to four cycles lost, although the number may vary in other implementations. This branch unit only creates a single instruction stream. For reasons of simplicity, it does not attempt to maintain or conditionally execute instructions sequentially prefetched after a branch is predicted taken. This decision was further justi ed on the basis of a single cycle access to the CIC; thus no advantage was derived by saving the aforementioned sequential instructions, given the decision not to start processing them. However, other implementations having a longer cache access time, or those with multiple conditional instruction stream execution pipelines may choose to maintain alternate instruction streams and even begin processing them in an attempt to reduce or eliminate the penalty associated with incorrectly predicted branches. The Branch Queue (BQ) holds branch instructions after they have been decoded in the BIR and any necessary prefetches performed. Branch instructions enter the BQ on the cycle following the branch address calculation, and each BQ entry consists of the decoded branch instruction (the rst halfword of the instruction), the address of the branch instruction, and the
address of the branch target. The branch target address is used to recover from a taken branch which was predicted not taken. The address of the branch instruction itself is saved to allow the branch prediction bits to be updated whenever required. It is also used to refetch the sequential instruction address for not-taken branches that were predicted to be taken. In this case, the instruction length of the branch is added to the address of the branch before refetching the sequential instruction stream. For BAL and BALR, it also provides the link address. Branch instructions are removed from the queue in the order they enter. The number of entries in the queue should be large enough to avoid pipeline stalls due to the BQ becoming full. There are four such registers in the present design. A pointer is used to indicate which register is to be loaded next from the branch unit. As an entry is made, the pointer is updated to the next register, and loading of the registers in a circular fashion occurs until the pointer designates a register which has yet to be unloaded, in which case a branch pipeline stall will occur if the branch unit attempts to load the register. A separate pointer designates the register to be unloaded, and it too, operates in a circular fashion. Updating of the unload pointer occurs as the entries are selected for testing the actual branch condition.
Branch Test: A branch test consists of performing
the architected branch test and comparing the branch direction with the predicted direction saved for that instruction in the BQ entry. If they agree, the entry is removed from the BQ and the instruction is completed. When the prediction does not match the actual branch decision, an instruction fetch request is issued to the CIC for the correct instruction address, and the pipeline is ushed of incorrect instructions. The architected branch test for BC and BCR merely consists of testing the condition code against the mask eld in the instruction. Thus, no E-cycle in the execution unit is needed, and the branch appears to execute in zero time, as seen by the E-unit. BAL, BALR, BAS, BASR, BCT, BCTR, BXLE, BXH, BASSM, and BAS all require an E-cycle to modify a GPR value. EX reads a GPR and modi es the fetched instruction text, conceptually requiring an E-cycle. The branch unit performs the branch test in the cycle following the E-cycle for that instruction and from that point on, operates as described above for BC and BCR. BC and BCR are not compounded with other instructions because they require no E-cycle, and would prevent some otherwise possible compounding to occur. The execution unit merely skips over them as
they are encountered in the IQ registers, and loads the next non-BC/BCR instruction. All other branches may be compounded, and also enter the execution unit. Thus, it is possible for the branch unit to be required to do two branch tests in a single cycle. Consider the case where BCT precedes AR, and is compounded with it. A BC instruction follows AR and tests its condition code. The branch unit removes two entries from the BQ simultaneously, the rst corresponding to the BCT, the second corresponding to the BC. Both branch tests are performed simultaneously, and the results posted in the correct order. That is, if the BCT is wrongly predicted, the instruction stream is refetched using its address entry in the BQ. If BCT prediction was correct, and BC was wrong, then the address corresponding to the BC entry is used to refetch the instruction stream. If both branches were correctly predicted, then both entries are removed from the BQ in the same cycle. Matching of instructions in the BQ with instructions being processed through the execution unit is accomplished with two control bits, B1 and B2. B1 is set whenever the execution unit loads a branch instruction other than BC/BCR into the CIR. It propagates through the execution unit pipeline with the instruction. When B1 reaches the execution stage of the pipeline, it is held there for the necessary number of cycles required to complete the execution of the compound instruction. When the execution completes, control logic in the execution unit will assert an instruction completion signal which is then ANDed with B1 to cause the unload pointer in the BQ to be incremented and point to the next entry to be unloaded, and causes a branch test to be performed on the next cycle. Thus, the branch test for those instructions having an E-cycle is synchronized with the execution of the correct instruction. Since BC and BCR are not loaded into the execution unit, they utilize the B2 bit. B2 is set whenever the mechanism which loads instructions from the IQ registers into the CIR of the execution unit skips over BC or BCR. Thus, B2 marks the place of the BC/BCR instruction in the correct order in the execution unit, even though the instruction itself does not enter the execution unit. B2 advances through the pipeline just as B1 does and causes a similar action in the branch unit when it reaches the E-cycle in the pipe. B1 and B2 are independent in the sense that either or both bits may be on (or both may be o) at any stage in the pipeline. When both are on, they cause two entries to be unloaded from the BQ as previously described.
If B1 is on, it is associated with the oldest branch in the pipeline, and causes the oldest entry in the BQ to be accessed. If B2 is also on, then it is associated with the next oldest branch. If B1 is o, then B2, if on, is associated with the oldest branch. Updating of the branch prediction bits in the tags is initiated by a request from the branch unit as required. For instructions having a single branch prediction bit, it is assumed that the bit represents the most recently taken path. If the prediction is correct, no update takes place. If the prediction is determined by the branch unit to be incorrect, then a request is issued to the instruction tag controls to set the bit to the opposite state. The eective address of the branch instruction itself is sent to the CIC addressing and translation logic to access the correct tag entry. The update of branch prediction bits for two-bit prediction is more complicated, but involves the same general actions as for one-bit prediction. The particular algorithm selected will determine when the bits in the tags require an update, but in any case the update is requested by the branch unit as previously noted.
5 Summary
The motivation for using branch prediction in superscalar and SCISM processors has been described. It has been shown that branch prediction may be economically provided by utilizing tag bits in a SCISM processor design instead of separate branch prediction bits in the cache or a Branch Target Buer. The idea of zero-cycle conditional branch execution was also proposed as a means of increasing instruction processing throughput. A description of an ESA/370 SCISM processor branch unit implementation was given to demonstrate the feasibility of implementing these concepts.
References
[1] Kogge, P. M., The Architecture of Pipelined Computers, McGraw-Hill, New York, NY, 1981. [2] Smith, J. E., \Dynamic Instruction Scheduling and the Astronautics ZS-1," IEEE Computer, July 1989, pp. 21-35. [3] Lee, J. K. F., Smith, A. J., \Branch Prediction Strategies and Branch Target Buer Design," IEEE Computer, January 1984. [4] Riseman, E. M., Foster, C. C., \The Inhibition of Potential Parallelism by Conditional Jumps," IEEE Trans. Computers, December 1972, pp. 1405-1411.
[5] Smith, M. D., Johnson, M., Horowitz, M., \Limits on Multiple Instruction Issue," Proceedings of ASPLOS III, ACM, 1989, pp. 290-302. [6] Jouppi, N. P., Wall, D. W. \Available Instruction-Level Parallelism for Superscalar and Superpipelined Machines," Proceedings of ASPLOS III, ACM, 1989, pp. 272-282. [7] Chan, S., Horst, R., \Building Parallelism into the Instruction Pipeline," High Performance Systems, Vol.X, No.12, December 1989, pp. 53-60. [8] Smith, J. E., \A Study of Branch Prediction Strategies," Proceedings of the 8th Ann. Symp. Computer Arch., May 1981. [9] Ditzel, D. R., McLellan, H. R., Berenbaum, A. D., \The Hardware Architecture of the CRISP Microprocessor," Proceedings of the 14th Ann. Intern. Symp. Computer Arch., June 1987, pp. 309319. [10] Blaner, B., Vassiliadis, S., A Hardware Preprocessor for Instruction-Level Parallel Processors, IBM Corp., Endicott, NY, January 1992, no.TR 01.C208. [11] Bandyopadhyay, S., Begwani, V. S., Murray, R. B., \Compiling for the CRISP Microprocessor," Proceedings of COMPCON, pp.96-100, Spring 1987. [12] Vassiliadis, S., Blaner, B., Eickemeijer, R.J., \SCISM: A Scalable Compound Instruction Set Machine," IBM Journalof Research and Development, Vol. 38, No. 1, pp. 59-77, January 1994. [13] Vassiliadis,S., Philips, J., Blaner, B., \Interlock Collapsing ALUs," IEEE Trans. Computers, Vol. 432, pp. 825-839, July 1993. [14] Philips, J.,Vassiliadis, S., \High Performance 3-1 Interlock Collapsing ALUs", IEEE Trans. Computers, Vol. 43, pp. 257-268, March 1994.