Validation of Speculative and Out-of-order

Appeared in the International Workshop on Microprocessor Test and Veri cation, Washington, DC, October 1998 Validation of Speculative and Out-of-order Execution Microarchitecture1

Noppanunt Utamaphethai, R.D. (Shawn) Blanton and John Paul Shen Department of Electrical and Computer Engineering Carnegie Mellon University Pittsburgh, PA 15213 fnau,blanton,[email protected]

Abstract

We validate speculative and out-of-order execution microarchitecture using an ATPG-like methodology. The validation methodology uses FSM models derived from microarchitecture speci cations. Complete transition tours are generated from the FSM models to obtain a high-level test sequence. Small assembly sequences associated with each FSM transition are used to translate FSM testing sequences into simulatable test programs. The methodology is demonstrated on the speculative and out-of-order execution mechanisms of the PowerPC 604. The eectiveness of our sequences is compared with some real programs by measuring transition coverage. Simulation results show that all targeted FSM transitions are covered by our sequences. Real programs can achieve the same coverage for some portions of the targeted functionality but use 1000X more instructions.

1 Introduction

Based on previous predictions and actual performance of microprocessors in the last decade, microprocessors in 2006 will most likely operate at 4GHz with a 500 SPEC int95 rating [1]. One of the key enabling technologies is the breakthrough in microarchitecture techniques. Many novel microarchitectures have been introduced to relax program semantics and better exploit instruction level parallelism resulting in a higher level of performance. However, these aggressive and complex microarchitecture features substantially increase the design complexity. One of the major technological barriers to be resolved before a microprocessor of such a high complexity can be realized is 1 This research eort is sponsored by the Semiconductor Research Corporation under contract DC068.070.

the validation of the design itself. Correcting design errors detected late in the development cycle has a high cost, therefore it is crucial that high-level design speci cations are correct. In particular, validating microarchitecture speci cations is critical since they are de ned early in the design cycle and involve many complex control behaviors such as branch prediction, dynamic register renaming and pipeline interlock. Microarchitecture validation is also complicated by the fact that they are hard to create. Formal veri cation [2, 5, 10] has been the focus of many current approaches to detect design errors. It provides a rigorous means to mathematically prove the correctness of a design. Since simulation is a common tool used by designers, it is logical for design validation to rely on the simulation of test stimuli on the simulatable model of the design rather than proving mathematical expressions that abstractly describe the design. As a result, design validation is a less rigorous but more practical approach to uncover design bugs. Our previous work [14] presents a systematic approach for validating superscalar microarchitecture speci cations via simulations. Our methodology involves direct derivation of microarchitecture models in terms of FSMs from the microarchitecture speci cations. High-level test sequences are then derived from the FSMs which either cover every transition (a transition tour) or apply a checking sequence [9]. Speci c instruction sequence synthesis techniques are then developed to obtain test sequences of user-set con dence levels. We have applied our approach to the branch prediction mechanism of a real superscalar processor, the PowerPC 604 [14]. Microarchitecture speci cations of the PowerPC 604 described in the MW framework [3, 6, 7] are used for simulation. Simu-

lation results show that 100% coverage of the targeted functionality is achieved using a very small number of instructions. Real programs that have orders of magnitude more instructions can achieve the same coverage for only some portions of the target functionality. The performance enhancing features used in many modern microarchitectures can be viewed as \small" control units (FSMs) that read and write buer entries. For example, branch history in the form of branch direction (taken or not taken) and target addresses are stored in a table (buers). Branch history is used (read) to make predictions for encountered branch instructions and updated (written) when the actual branch outcome is resolved. Our validation approach is to focus on buers, determine their functionality, model the operation of each buer entry at the microarchitecture level using an FSM and rigorously generate instruction sequences that systematically exercise the FSM. In this paper, we apply this methodology to the speculative and out-of-order execution mechanisms of superscalar processors. We continue to use the PowerPC 604 as a research vehicle to explore this approach. Speculative and out-of-order execution mechanisms in the 604 consist of reservation stations, rename buers and a reorder buer. An FSM model is derived for each of these three components. A set of instructions associated with each FSM transition is obtained and concatenated to form a test program which exercises the functionality of each component by performing a transition tour. We report FSM transition coverage of each FSM using our methodology and make comparison with some real programs. The rest of this paper is organized as follows. Section 2 describes the PowerPC 604's implementation and the FSM models of speculative and out-of-order execution. Section 3 describes how high-level test sequences for validating the three components are derived from the FSM models. There we also show how these sequences are transformed into simulatable instruction sequences. Section 4 presents the simulation results and compares the FSM transition coverage from our generated sequences to the coverage of real programs. Finally, in Section 5 we summarize our work and present directions for future work.

2 Speculative and Out-of-order Execution

Superscalar processors try to achieve high instruction throughput by allowing concurrent execution of instructions in multiple pipelines. However, simply adding more hardware resources and widening the pipeline cannot exploit instruction-level parallelism. A number of techniques must be used to relax inter-

instruction dependency and take advantage of existing instruction parallelism. Superscalar processors continuously fetch and issue instructions based on the assumption that branches are correctly predicted and exceptions do not occur. In most cases, branches are correctly predicted and recoveries from mis-speculation are infrequent. Speculation allows instruction execution to proceed without waiting for previous instructions to complete, thereby achieving greater instruction throughput. With in-order execution, the frontend of the pipeline stops supplying instructions whenever a decoded instruction creates a resource con ict or has a true or output dependency. Out-of-order execution bypasses this dependency by utilizing a lookahead capability to nd independent instructions to execute. An instruction window is used to buer instructions after the decode stage so that the processor can continue to decode instructions regardless of whether they can be executed immediately. Instructions are issued from the instruction window with little regard to the original program order. The PowerPC 604 is used as a research vehicle to explore our validation methodology as applied to speculative and out-of-order execution. The PowerPC 604 serves our purpose because it has all the characteristics of modern superscalar processors. It implements speculative and out-of-order execution using rename registers, a reorder buer and reservation stations, all of which are described in the following subsections.

2.1 Register Renaming

Register renaming [8] is a technique that improves parallelism by eliminating stalled cycles due to antiand output dependencies. It avoids contention for a given register le location in the course of out-of-order execution by storing instruction results in temporary (rename) buers. The PowerPC 604 uses three sets of rename buers for three register les { general purpose rename buer (GRB) for general purpose registers (GPR), oating-point rename buer (FRB) for

oating-point registers (FPR) and condition code rename buer (CRB) for condition code registers (CR). Twelve rename registers are used for the GPRs and eight each for the FPRs and the CRs. Figure 1 shows the FSM model for the operation of each rename buer entry. (Note, the model is valid whether the rename buer entry is from the GRB, FRB or CRB.) An entry is Free until the dispatch unit allocates an entry for an instruction in the dispatch stage. This occurs if an instruction modi es any register. The entry remains allocated until the instruction nishes. There are two states for an al-

located entry. At the time of renaming, each newly allocated rename entry will always hold the most recent (MR) value for the renamed register denoted by the MR Alloc state of Figure 1. If a rename entry is allocated to a register which is then later renamed by another instruction, the previously allocated entry will no longer hold the most recent value and will therefore transition from the MR Alloc state to the NonMR Alloc state. Once the instruction nishes, the content of the rename entry becomes valid which causes a transition from MR Alloc (NonMR Alloc) to MR Valid (NonMR Valid). The FSM stays in the valid state until the result is written to the register le (WB transition) or a prior instruction causes an exception that requires all subsequent instructions to be discarded (discard transition). MR Alloc

finish

stale dispatch

discard

WB MR Valid

Free discard

NonMR Alloc

discard

discard WB finish

stale

NonMR Valid

Figure 1: The FSM model of a rename buer entry.

2.2 Reorder Buer

Out-of-order execution allows independent instructions to be executed in an order that is dierent from the original program order. However, out-of-order instructions must complete in program order to ensure precise exception handling. Pipelines of superscalar processors can be typically divided into an in-order frontend, an out-of-order execution core and an inorder backend. During the last stage of the in-order frontend, an entry is allocated for an instruction in a reorder buer [12]. The execution of the instruction is performed in the out-of-order core. When the instruction nally completes or transfers its speculative state into permanent machine state in the in-order backend, the associated reorder buer entry is deallocated in program order. A reorder buer entry in essence, is a place holder for results that preserves program order.

The PowerPC 604 uses a 16-entry reorder buer to implement in-order completion at the backend. A reorder buer entry is allocated during instruction dispatch. When an instruction nishes execution, its status (completed with or without exception) is recorded in its corresponding reorder buer entry. The completion unit retires four nished instructions from the reorder buer and updates registers in the complete stage. The completion unit recognizes exceptions and, if necessary, discards any operations being performed on subsequent instructions in program order. Figure 2 shows the FSM model for the operation of each reorder buer entry. A reorder buer entry is available for allocation if its FSM is in the Free state. The three upper states of Figure 2 (Alloc, Ex and Fin) indicate that an instruction is non-speculative while the lower three states indicate a speculative instruction. Depending on whether or not the instruction is speculative, the FSM transitions from the Free state to one of the two allocate states (Alloc or Alloc S) when it is dispatched to a reservation station. For both speculative and non-speculative paths, the FSM will transition from the Alloc state to the Ex state and nally to the Fin state when the instruction executes and nishes, respectively. If an instruction is speculative, it must wait for the instruction it depends on to complete in order to traverse a non-speculative transition. Hence, an instruction in the lower three states (a speculative instruction) can transition to an upper state when it becomes nonspeculative. Only a non-speculative instruction can complete (complete transition) and cause the processor's permanent state to change. However, both speculative and non-speculative reorder buer entries can be discarded if it follows an instruction that causes an exception (discard transition).

2.3 Reservation Stations

When instructions are dispatched to an appropriate functional unit, they are placed in reservation stations (RSs) before they are executed [13]. Instructions are buered in the reservation stations of functional units even if their operands are ready. Reservation stations are useful because they allow instructions that do not have ready operands to progress deeper in the pipeline; this allows the frontend of the pipeline to process new instructions instead of stalling. The PowerPC 604 has distributed reservation stations, that is, each of the functional units has a dedicated two-entry buer for two instructions of the proper type. If a reservation station is available and an instruction can be dispatched, an entry in the reserva-

Alloc Ex allocate

discard discard

complete

Fin

RS entry is deallocated if all operands are ready and the instruction is issued (issue transition); or if the entry is discarded due to an exception created by a prior instruction (discard transition).

discard non speculative

non speculative

Free

2 op. valid

Alloc 00

non speculative 1 op. valid

1 op. valid none valid

allocate

discard discard

Fin_S

discard

1 op. valid

Ex_S

Figure 2: The FSM model of a reorder buer entry. tion station will be allocated for the instruction. The following conditions prevent an instruction from being dispatched to a reservation station: (i) The required reservation station or its write port is full; (ii) The reorder buer or its write port is full; (iii) An instruction break occurs and changes the program ow; (iv) No rename register is available. When an entry in a reservation station is allocated, the value of each instruction operand is written into the reservation station entry. If the value is not yet available, the tag (a temporary hardware identi er created to identify the result) of the pending operand will be used instead of the actual value. Once all of the operands are available, the instruction is removed from the reservation station and execution begins. Figure 3a shows the FSM model for each RS entry of a two-operand PowerPC instruction (an add instruction for example). Every RS entry starts as free and available for allocation. Hence, the starting state for an RS entry is the Free state shown in the center of the diagram of Figure 3a. When an instruction is in the dispatch stage and all dispatch conditions are met, an entry in the RS is allocated. Since the instruction requires two source operands, there are four possible allocation states for an RS entry: Alloc 00: Alloc Alloc

No source operands are available 01: Only the right source is available 10: Only the left source is available 11: Both source operands are available

The FSM can transition from the Alloc 00, states to another Alloc state when other operand(s) become available. An

Alloc 01 and Alloc 10

1 op. valid

Alloc 10

Alloc 01

Free discard

Alloc_S

Alloc

discard

discard discard

1 op. valid

issue

2 op. valid

1 op. valid

Alloc 11

(a) Alloc v0 allocate, non valid discard

op. valid

Free issue

discard

allocate, 1 valid Alloc v1

(b) Figure 3: The FSM models of a reservation station entry requiring (a) two source operands and (b) a single source operand.

3 Test Sequence Generation

The FSM models derived from the microarchitecture speci cations are used to derive a high-level test sequence. This test consists of a sequence of FSM transitions that exercises the FSM in some prescribed way (such as complete or partial state tour, complete or partial transition tour and checking experiment [9]). An obvious tradeo exists between sequence generation eort and the level of validation depending on the type of tour selected. We choose to use complete transition tours for testing our FSM models. Complete transition tours are widely used because they exercise a signi cant amount of FSM functionality while re-

quiring little eort to create. Although a transition tour cannot guarantee total correctness, it is a proven technique for uncovering design errors [11]. Small sequences of PowerPC assembly instructions are used to translate FSM test sequences (i.e. transition tours) into a simulatable instruction sequence, that is, an assembly program. These small instruction sequences are called atomic sequences. For each FSM transition, an atomic sequence is derived. Execution of the atomic sequence instructions causes the associated FSM transition to be exercised. The atomic sequence structure is partitioned into two parts: an initialization subsequence and a single triggering instruction. The initialization subsequence places the processor into a machine state that makes it ready to traverse the transition associated with the atomic sequence. Execution of the triggering instruction then causes the traversal of the transition. Some atomic sequences do not require the initialization subsequence (the reorder buer for example). Atomic sequences are derived from the microarchitecture and the ISA speci cations. Once the atomic sequences are obtained, assembly-level test programs are created using Perl scripts to concatenate various atomic sequences in the order of the transition tour.

4 Simulation Results

The MW [6, 7] performance simulator is used for our simulation experiments for the PowerPC 604. Checking code has been added to the simulator to track and measure the level of FSM transition coverage achieved by the test program under simulation. Coverage is de ned as the percentage of targeted transitions traversed. In this case, coverage is simply the ratio of the number of traversed FSM transitions to the total number of targeted FSM transitions. In current industry practice, real and randomlygenerated programs are typically used for validating hardware design. We use the SPEC95 (both integer and oating point) benchmarks which have over a billion instructions and an execution cycle count of 735M. We evaluate the eectiveness of our generated sequences by comparing the amount of transition coverage obtained from our test program with those of SPEC95 benchmarks. Figure 4 shows the number of instructions simulated in SPEC95 benchmarks and our test programs. Since the current version of MW is strictly tracedriven, it is not able to simulate the functionality associated with mis-speculated paths. As a result, some transitions in the FSM models are untrackable. The untrackable transitions are shown as bold arrows in Figures 1, 2 and 3. These transitions can only

be traversed by executing instructions down speculative paths. MW is currently being modi ed into an execution-driven tool that is capable of simulating both speculative and non-speculative instructions. Figure 5 shows overall coverage of the GRB, FRB and CRB rename buers. Our test sequence is exhaustive in that each FSM model edge is traversed in every possible way. For the GRB, this means that each edge can be traversed using any one of the 32 different general purpose registers. Similarly, there are 32 possible ways to traverse an edge in the oating point rename entry since there are 32 dierent oating point registers. However, there are only 8 alternatives for a condition code rename entry since there are only eight condition registers. Figure 5 shows that none of the SPEC95 benchmarks achieve 100% coverage of the targeted functionality of the rename buer. The benchmark programs from compress to vortex have low FRB coverage because these benchmarks are integer applications and therefore contain very few oating point instructions. However, our test programs easily achieve 100% coverage using (comparatively) very few instructions. Figure 6 shows overall coverage for the reorder buer. The targeted functionality for the reorder buer is quite simplistic. Here, we ensure that each buer entry is exercised by each of the six functional units SFX0, SFX1, CFX, FPU, LSU and BRU. All but three of the benchmark programs (go, li, and vortex) achieve 100% coverage. Figures 7 and 8 compares coverages for the 1- and 2-operand reservation station FSMs, respectively, for four functional units SFX0, SFX1, CFX and FPU. The targeted functionality for each reservation station includes traversing each transition with all register operand possibilities. For the two-operand reservation stations, this means there are 1024 dierent alternatives for exercising an edge since there are 32 register choices for each of the two sources. The one-operand reservation stations have only 32 possible alternatives since there is only one source. The benchmark coverage of the one-operand reservation station is quite low, with the highest coverage achieved being about 40% for the program fpppp. The benchmark coverage is even lower for the two-operand reservation stations. No benchmark program achieves 20% coverage.

5 Summary

Design errors detected late in the design cycle are extremely expensive to correct. Hence, microarchitecture validation is important since it impacts many of the complex behaviors found in the design. Here, we have applied our microarchitecture valida-

39M

50M

fpppp

111M

38M

applu

153M

50M

107M

92M

ijpeg

57M

80M

go

258M 40M

10^8

138K

10^6 10^5 7K

10^4

2K

No. of instructions

10^7

0.3K

10^3 10^2 10^1

(a)

Program names

rs (2op)

rs (1op)

rob

rename

swim

mgrid

vortex

perl

m88ksim

li

gcc

compress

10^0

(b)

Figure 4: Program sizes (i.e. the total number of instructions) for (a) the SPEC benchmarks and (b) our test programs. tion approach [14] to the out-of-order execution mechanisms of the PowerPC 604 which include a reorder buer, functional-unit reservation stations and rename buers. We have described microarchitecture-level, FSM-like models for each of these components, generated transition tours for these models and have automatically transformed these tours into PowerPC test programs. The eectiveness of our sequences has been shown through coverage comparisons with the SPEC95 benchmark programs. Our simulation results show that our programs achieve 100% FSM transition coverage. The benchmarks, although three orders of magnitude larger than our test programs, do not achieve 100% coverage. Currently, we are working to automate our methodology. We envision an ATPG-like algorithm which accepts FSM models of the microarchitecture, information about the instruction set architecture and fault coverage goals which are set by the program user. The output of this ATPG algorithm would be assemblylevel test programs that \exercise" the functionality targeted by the FSMs. The type of exercise can range from tours and checking experiments applied to individual FSMs to the intercommunication of these

FSMs. The PowerPC 604, of course, would continue to be used as our case study but the algorithm would be general and therefore applicable to any superscalar processor. Of particular interest, is the application of an automated form of our validation methodology to the x86 and Alpha families of microprocessor designs.

References

[1] Special Issue Celebrating the 25th Anniversary of the Microprocessor. Microprocessor Report, Aug. 1996. [2] D. L. Beatty and R. E. Bryant. \Formally Verifying a Microprocessor Using a Simulation Methodology," In Proc. of Design Automation Conference, pp. 596{602, June 1994. [3] B. Black and J. P. Shen. \Calibration of Microprocessor Performance Models,". IEEE Computer, Vol. 31, No.5, pp. 59{65, May 1998. [4] A. K. Chandra et. al. \AVPGEN { A Test Case Generator for Architecture Veri cation," IEEE Transactions on VLSI Systems, Vol. 3, No. 2, pp. 188{200, June 1995.

GRB FRB CRB

100.00 90.00 80.00

Coverage percentage

70.00 60.00 50.00 40.00 30.00 20.00 10.00

(a)

rename

swim

mgrid

fpppp

applu

vortex

perl

m88ksim

li

ijpeg

compress

gcc

go

0.00

Program names

(b)

Figure 5: Rename buer coverage comparison between (a) the SPEC95 benchmarks and (b) our test programs.

100.00 90.00

Coverage percentage

80.00 70.00 60.00 50.00 40.00 30.00 20.00 10.00

(a)

Program names

rob

swim

mgrid

fpppp

applu

vortex

perl

m88ksim

li

ijpeg

go

gcc

compress

0.00

(b)

Figure 6: Reorder buer coverage comparison between the (a) SPEC95 benchmarks and (b) our test programs.

100.00

SFX0 SFX1 CFX FPU

90.00 80.00

Coverage percentage

70.00 60.00 50.00 40.00 30.00 20.00 10.00

(a)

rs(1op)

swim

mgrid

fpppp

applu

vortex

perl

m88ksim

li

ijpeg

go

gcc

compress

0.00

Program names

(b)

Figure 7: One-operand reservation station coverage comparison between (a) the SPEC95 benchmarks and (b) our test program.

SFX0 SFX1 CFX FPU

100.00 90.00 80.00

Coverage percentage

70.00 60.00 50.00 40.00 30.00 20.00 10.00

(a)

Program names

rs(2op)

swim

mgrid

fpppp

applu

vortex

perl

m88ksim

li

ijpeg

go

gcc

compress

0.00

(b)

Figure 8: Two-operand reservation station coverage comparison between (a) the SPEC95 benchmarks and (b) our test program.

[5] E. M. Clarke and R. P. Kurshan. \ComputerAided Veri cation," IEEE Spectrum, pp. 61{67, June 1996. [6] T. A. Diep. \VMW: A Visualization-based Microarchitecture Workbench," PhD thesis, Carnegie Mellon University, Aug. 1995. [7] A. S. Huang and T. A. Diep. \MW Developer's Guide," Technical Report, CMuART-95-1, Carnegie Mellon University, Aug. 1995. [8] R. M. Keller. \Look-Ahead Processors," Computing Surveys, Vol. 7, No. 4, pp. 177{195, Dec. 1975. [9] Z. Kohavi. Switching and Finite Automata Theory. McGraw-Hill, New York, 1978. [10] M. C. McFarland. \Formal Veri cation of Sequential Hardware: A Tutorial," IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, Vol. 12, No. 5, pp. 633{654, May 1993. [11] S. Naito and M. Tsunoyama. \Fault Detection for Sequential Machines by Transition-Tours," Proc. of the Interanational Symposium on FaultTolerant Computing, pp. 238{243, June 1981. [12] J. E. Smith and A. R. Pleszkun. \Implementation of Precise Interrupts in Pipelined Processors," Proc. of the International Symposium on Computer Architecture, pp. 36{44, Jun. 1985. [13] R. M. Tomasulo. \An Ecient Algorithm for Exploiting Multiple Arithmetic Units," IBM Journal, Vol. 11, No. 1, pp. 25{33, Jan. 1967. [14] N. Utamaphethai, R. D. Blanton, and J. P. Shen. \Superscalar Processor Validation at the Microarchitecture Level," Proc. of International Conference on VLSI Design, Jan. 1999.

Validation of Speculative and Out-of-order

Validation of Speculative and Out-of-order

Suggest Documents

Speculative invoicing and - Dishonest.biz

Lazy and Speculative Execution - Microsoft

Artificial Agents and Speculative Bubbles

A Corpus Study of Evaluative and Speculative

Mathematics of Speculative Price - people.hbs.edu

Speculative Fiction Studies

Speculative Multithreaded Processors - CiteSeerX

Dynamic Speculative Precomputation

Speculative Distributed Transaction Processing

Speculative synchronization - iacoma

Speculative Vacancies - Prosper Australia

Speculative Prefetching - CiteSeerX

Fastpath Speculative Parallelization

SPECULATIVE CHRONOTOPES IN CAQUETÃ

Speculative Checkpointing - Semantic Scholar

rational speculative bubbles and commodities markets - SSRN

Lazy and Speculative Execution - Butler Lampson

Multiplex: Unifying Conventional and Speculative Thread-Level ...

[Ebooks] Download Speculative Everything: Design, Fiction, and ...

Speculative Parallelization Using State Separation and ... - CiteSeerX

Cross-validation and cross-study validation of

Speculative synthetic chemistry and the nitrogenase problem

Exchange Arrangements and Speculative Attacks - Munich Personal ...

Mapping: A Speculative and Creative Design Tool

Validation of Speculative and Out-of-order