Chapter 3 Instruction-Level Parallelism and Its

0 downloads 0 Views 3MB Size Report
Parallelism at the level of detailed digital design ... To exploit ILP we must determine which instructions can be executed in .... Suppose it is n, and we would like to unroll the .... DADDUI R1,R1,#-40. S.D ... Static branch predictors are sometimes used in processors ..... Register file has a Qi field: The number of the RS that.
Chapter 3 Instruction-Level Parallelism and Its Exploitation

Dr. Shadrokh Samavi

1

Some slides are from the instructors’ resources which accompany the 5th and previous editions. Some slides are from David Patterson, David Culler and Krste Asanovic of UC Berkeley; Israel Koren of UM Amherst, and Milos Prvulovic of Georgia Tech. Otherwise, the source of the slide is mentioned at the bottom of the page. Please send an email if a name is missing in the above list.

Dr. Shadrokh Samavi

2

Outline 1.

Instruction-Level Parallelism: Concepts and Challenges

2. Basic Compiler Techniques for Exposing ILP

3. Reducing Branch Costs with Advanced Branch Prediction 4. Overcoming Data Hazards with Dynamic Scheduling

5. Dynamic Scheduling: Examples and the Algorithm 6. Hardware-Based Speculation 7. Exploiting ILP Using Multiple Issue and Static Scheduling 8. Exploiting ILP Using Dynamic Scheduling, Multiple Issue, and Speculation 9. Advanced Techniques for Instruction Delivery and Speculation 10. Studies of the Limitations of ILP 11. Cross-Cutting Issues: ILP Approaches and the Memory System 12. Multithreading: Exploiting Thread-Level Parallelism to Improve Uniprocessor Throughput

Dr. Shadrokh Samavi

3

1. Instruction-Level Parallelism: Concepts and Challenges

Dr. Shadrokh Samavi

4

All processors since about 1985, including those in the embedded space, use pipelining to overlap the execution of instruction and improve performance. This potential overlap among instructions is called:

Instruction-level parallelism(ILP) since the instructions can be evaluated in parallel. Exploit ILP: 1- Dynamic (Hardware) (e.g. Intel Core) 2- Static (Software) (e.g. ARM Cortex-A8, not A9).

Dr. Shadrokh Samavi

5

Clock Per Instruction Pipeline CPI = Ideal pipeline CPI + Structural stalls +

Data hazard stalls + Control stalls Reducing each of the terms on the right-hand side Minimize the overall pipeline CPI

Increase the IPC

Dr. Shadrokh Samavi

6

Techniques to increase IPC

Dynamic

Static

Dr. Shadrokh Samavi

7

Instruction-Level Parallelism •

parallelism Types : 1. Parallelism at the system level

2. Parallelism among instructions 3. Parallelism at the level of detailed digital design



The amount of ILP available within a basic block is quite small 

Average dynamic branch frequency in MIPS program:15% to 25% 4 to 7 instructions execute between a pair of branches



These instructions are likely to depend upon one another

overlap within a BB is much less than the average BB size We must exploit ILP across multiple BB

Dr. Shadrokh Samavi

8

Increase amount of ILP: Simplest way :Loop-level parallelism (LLP) (Exploit parallelism among iterations of a loop)

Example: For (i=1; i more registers like RS) FP Adder – Once instruction commits, result is put into register – As a result, easy to undo speculated instructions on mispredicted branches or exceptions

Dr. Shadrokh Samavi

Reorder Buffer

FP Regs

Res Stations FP Adder

144

From Instruction unit

Reorder buffer

Reg #

Instruction queue

FP registers

Load/store operations

Operand Buses

Floating point operations

Address unit

Data

Load Buffers

Operation Bus

The basic structure of a MIPS FP unit using Tomasulo‘s algorithm and extended to handle speculation.

Reservation Stations

Store Data Memory unit Load Data

Address

FP adders

FP multiplier Common Data Bus (CDB)

Dr. Shadrokh Samavi

145

Speculative Tomasulo Algorithm 1. Issue —get instruction from FP Op Queue

If reservation station and reorder buffer slot free, issue instr & send operands & reorder buffer no. for destination (this stage sometimes called ―dispatch‖)

2. Execution —operate on operands (EX)

When both operands ready then execute; if not ready, watch CDB for result; when both in reservation station, execute; checks RAW (sometimes called ―issue‖)

3. Write result — finish execution (WB)

Write on Common Data Bus to all awaiting FUs & reorder buffer; mark reservation station available.

4. Commit —update register with reorder result

When instr. at head of reorder buffer & result present, update register with result (or store to memory) and remove instr from reorder buffer. Mispredicted branch flushes reorder buffer (sometimes called ―graduation‖)

Dr. Shadrokh Samavi

146

Speculative Tomasulo Algorithm EXAMPLE • add is 2 clock cycles, multiply is 10 clock cycles, and divide is 40 clock cycles L.D F6,34(R2) L.D F2,45(R3) MUL.D F0,F2,F4 SUB.D F8,F6,F2 DIV.D F10,F0,F6 ADD.D F6,F8,F2 • show what the status tables look like when the MUL.D is ready to go to commit. Dr. Shadrokh Samavi

147

At the time the MUL.D is ready to commit, only the two L.D instructions have committed.

Dr. Shadrokh Samavi

148

Speculative Tomasulo Algorithm

A processor with the ROB can dynamically execute code while maintaining a precise interrupt model. If the MUL.D instruction caused an interrupt, we could simply wait until it reached the head of the ROB and take the interrupt, flushing any other pending instructions.

Dr. Shadrokh Samavi

149

EXAMPLE Loop: L.D MUL.D S.D DADDIU BNE

F0,0(R1) F4,F0,F2 F4,0(R1) R1,R1,#-8 R1,R2,Loop

; branches if R1≠0

We have issued all the instructions in the loop twice. Let’s also assume that the L.D and MUL.D from the first iteration have committed and all other instructions have completed execution. Normally, the store would wait in the ROB for both the effective address operand (R1 in this example) and the value (F4 in this example). Since we are only considering the floating-point pipeline, assume the effective address for the store is computed by the time the instruction is issued. Dr. Shadrokh Samavi

150

Speculative Tomasulo Algorithm

Dr. Shadrokh Samavi

151

WAW and WAR hazards through memory are eliminated with speculation, because the actual updating of memory occurs in order, when a store is at the head of the ROB, and hence, no earlier loads or stores can still be pending. RAW hazards through memory are maintained by two restrictions: 1. not allowing a load to initiate the second step of its execution if any active ROB entry occupied by a store has a Destination field that matches the value of the A field of the load, and 2. maintaining the program order for the computation of an effective address of a load with respect to all earlier stores. Dr. Shadrokh Samavi

152

7 Exploiting ILP Using Multiple Issue and Static Scheduling

Dr. Shadrokh Samavi

153

More ILP with Multiple issue

To improve performance further: decrease the CPI to less than one.  CPI all register WAW & WAR hazards are avoided 2. Branch prediction – perfect; no mispredictions 3. Jump prediction – all jumps perfectly predicted 2 & 3 => machine with perfect speculation & an unbounded buffer of instructions available 4. Memory-address alias analysis – addresses are known & a store can be moved before a load provided addresses not equal Also: unlimited number of instructions issued/clock cycle; perfect caches; 1 cycle latency for all instructions (FP *,/); Dr. Shadrokh Samavi

186

IBM Power 5 1. Issues up to 4 instructions per clock 2. Initiates execution on up to six (with significant restrictions on the instruction type, e.g., at most two load-stores) 3. Renaming registers :88 integer and 88 floating Point 4. Over 200 instructions in flight, of which up to 32 can be loads and 32 can be stores 5. A large aggressive branch predictor 6. Employs dynamic memory disambiguation

Dr. Shadrokh Samavi

187

SPEC benchmarks

Dr. Shadrokh Samavi

188

IPC (instruction per cycle

Upper Limit to ILP: Ideal Machine

Benchmark Dr. Shadrokh Samavi

189

The perfect processor must do: 1. Look arbitrarily far ahead to find a set of instructions to issue, predicting all branches perfectly. 2. Rename all register uses to avoid WAR and WAW hazards. 3. Determine whether there are any data dependencies among the instructions in the issue packet; if so, rename accordingly.

4. Determine if any memory dependences exist among the issuing instructions and handle them appropriately. 5. Provide enough replicated functional units to allow all the ready instructions to issue.

Dr. Shadrokh Samavi

190

To determine whether n issuing instructions have any register dependences among them, assuming all instructions are register-register and the total number of registers is unbounded, requires

2000 instructions—the default size assumed in the next several figures—requires almost 4 million comparisons

Dr. Shadrokh Samavi

191

multiple-issue implementation constraints

1. 2. 3. 4. 5. 6.

Issues per clock, Functional units and unit latency, Register file ports, Functional unit queues (which may be fewer than units), Issue limits for branches, Limitations on instruction commit

Dr. Shadrokh Samavi

192

Benchmarks

The effect of window size on the average number of instruction issues per clock cycle

Instruction issues per cycle

Dr. Shadrokh Samavi

193

IPC (instruction per cycle

Window size

The effects of reducing the size of the window. Dr. Shadrokh Samavi

194

It is assumed a base window size of 2K entries, roughly 10 times as large as the largest implementation in 2005,

and a maximum issue capability of 64 instructions per clock, also 10 times the widest issue processor in 2005,

Dr. Shadrokh Samavi

195

five levels of branch prediction shown are: 1. Perfect— All branches and jumps are perfectly predicted at the start of execution. 2. Tournament-based branch predictor— a correlating two-bit predictor and a noncorrelating two-bit predictor together with a selector, which chooses the best predictor for each branch. The prediction buffer contains 213 (8K) entries, each consisting of three two-bit fields, two of which are predictors and the third is a selector. The correlating predictor is indexed using the exclusive-or of the branch address and the global branch history. The noncorrelating predictor is the standard two-bit predictor indexed by the branch address. 3. Standard two-bit predictor with 512 two-bit entries— In addition, we assume a 16-entry buffer to predict returns. 4. Static— A static predictor uses the profile history of the program and predicts that the branch is always taken or always not taken based on the profile. 5. None— Dr. Shadrokh Samavi

196

IPC (instruction per cycle)

More Realistic HW: Branch prediction impact is studied

Branch prediction strategy

This graph highlights the differences among the programs with extensive loop-level parallelism (tomcatv and fpppp) and those without (the integer programs and doduc).

Dr. Shadrokh Samavi

197

IPC (instruction per cycle)

Branch Prediction Scheme Dr. Shadrokh Samavi

198

Benchmark

Branch prediction accuracy for the conditional branches in the SPEC92

Branch misprediction rate Dr. Shadrokh Samavi

199

Benchmark Instruction issues per cycle Dr. Shadrokh Samavi

200

Benchmark

The Effects of Imperfect Alias Analysis

Instruction issues per cycle Dr. Shadrokh Samavi

201

The impact of three models for the memory alias analysis: 1. Global/stack perfect— This model does perfect predictions for global and stack references and assumes all heap references conflict. This model represents an idealized version of the best compiler-based analysis schemes currently in production. Recent and ongoing research on alias analysis for pointers should improve the handling of pointers to the heap in the future. 2. Inspection— This model examines the accesses to see if they can be determined not to interfere at compile time. For example, if an access uses R10 as a base register with an offset of 20, then another access that uses R10 as a base register with an offset of 100 cannot interfere, assuming R10 could not have changed. In addition, addresses based on registers that point to different allocation areas (such as the global area and the stack area) are assumed never to alias. This analysis is similar to that performed by many existing commercial compilers, though newer compilers can do better, at least for loop-oriented programs. 3. None— All memory references are assumed to conflict.

Dr. Shadrokh Samavi

202

Limitations on ILP for Realizable Processors 1. Up to 64 instruction issues per clock with no issue restrictions. The practical implications of very wide issue widths on clock rate, logic complexity, and power may be the most important limitation on exploiting ILP. 2. A tournament predictor with 1K entries and a 16-entry return predictor. The predictor is not a primary bottleneck. 3. Perfect disambiguation of memory references done dynamically—this is ambitious but perhaps attainable for small window sizes (and hence small issue rates and loadstore buffers) or through a memory dependence predictor. 4. Register renaming with 64 additional integer and 64 additional FP registers, which is roughly comparable to the IBM Power5.

Dr. Shadrokh Samavi

203

Benchmark

Effect of window size for a processor capable of issuing up to 64 arbitrary instruction per clock.

Instruction issues per cycle Dr. Shadrokh Samavi

204

Example: Consider the following three hypothetical, but not atypical, processors, which we run with the SPEC gcc benchmark: 1. A simple MIPS two-issue static pipe running at a clock rate of 4 GHz and achieving a pipeline CPI of 0.8. This processor has a cache system that yields 0.005 misses per instruction. 2. A deeply pipelined version of a two-issue MIPS processor with slightly smaller caches and a 5 GHz clock rate. The pipeline CPI of the processor is 1.0, and the smaller caches yield 0.0055 misses per instruction on average. 3. A speculative superscalar with a 64-entry window. It achieves onehalf of the ideal issue rate measured for this window size. This processor has the smallest caches, which lead to 0.01 misses per instruction, but it hides 25% of the miss penalty on every miss by dynamic scheduling. This processor has a 2.5 GHz clock.

Assume that the main memory time (which sets the miss penalty) is 50 ns. Determine the relative performance of these three processors. Dr. Shadrokh Samavi

205

Answer

Dr. Shadrokh Samavi

206

Limitations even for the perfect model 1. WAW and WAR hazards through memory —The study

eliminated WAW and WAR hazards through register renaming, but not in memory usage. 2. Unnecessary dependences —With infinite numbers of registers, all but true register data dependences are removed. One example of these is the dependence on the control variable in a simple do loop: Since the control variable is incremented on every loop iteration, the loop contains at least one dependence. 3. Overcoming the data flow limit —If value prediction worked with high accuracy, it could overcome the data flow limit. Perfect data value prediction would lead to effectively infinite parallelism, since every value of every instruction could be predicted a priori. Dr. Shadrokh Samavi

207

Hardware versus Software Speculation 1. To speculate extensively, we must be able to disambiguate memory references. Difficult to do at compile time for integer programs that contain pointers. Tomasulo’s algorithm can do the job. 2. Hardware-based speculation works better when control flow is unpredictable, and when hardware-based branch prediction is superior to software-based branch prediction done at compile time.

3. Hardware-based speculation maintains a completely precise exception model even for speculated instructions. 4. Hardware-based speculation does not require compensation or bookkeeping code, which is needed by ambitious software speculation mechanisms. 5. Compiler-based approaches may benefit from the ability to see further in the code sequence, resulting in better code scheduling than a purely hardware-driven approach.

6. Hardware-based speculation with dynamic scheduling does not require different code sequences to achieve good performance for different implementations of an architecture. Dr. Shadrokh Samavi

208

12. Multithreading: Using ILP Support to Exploit Thread-Level Parallelism to Improve Uni-Processor Throughput

Dr. Shadrokh Samavi

209

ILP vs. Thread-Level Parallelism (TLP) 1. ILP is reasonably transparent to the programmer. 2. The next higher-level of parallelism is called threadlevel parallelism (TLP). It is logically structured as separate threads of execution. 3. A thread is a separate process with its own instructions and data. A thread may represent a process that is part of a parallel program consisting of multiple processes, or it may represent an independent program on its own. Each thread has all the state (instructions, data, PC, register state) necessary to allow it to execute. Thread-level parallelism is explicitly represented by the use of multiple threads of execution that are inherently parallel. Dr. Shadrokh Samavi

210

1. Is it possible for a processor oriented at ILP to exploit TLP? 2. Multithreading allows multiple threads to share the functional units of a single processor in an overlapping fashion. This requires a separate copy of the register file, a separate PC, and a separate page table are required for each thread. The memory itself can be shared through the virtual memory mechanisms, which already support multiprogramming. 3. Thread switch: the ability to change to a different thread relatively quickly; in particular, a thread switch should be much more efficient than a process switch, which typically requires hundreds to thousands of processor cycles. Dr. Shadrokh Samavi

211

Approaches to multithreading

Fine-grained multithreading: switches between threads on each instruction, in a round-robin fashion.

Coarse-grained multithreading: switches threads only

on costly stalls, such as level 2 cache misses. Pipeline start-up overhead to be considered. This method is more useful for reducing the penalty of high-cost stalls, where pipeline refill is negligible compared to the stall time.

Dr. Shadrokh Samavi

212

Pipeline Hazards t0 t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 t11 t12 t13 t14

F D X MW LW r1, 0(r2) F D D D D X MW LW r5, 12(r1) F F F F D D D D X MW ADDI r5, r5, #12 SW 12(r1), r5 F F F F D D D D

• Each instruction may depend on the next

Dr. Shadrokh Samavi

213

Execution units idle in an OoO superscalar For an 8-way superscalar.

From: Tullsen, Eggers, and Levy, “Simultaneous Multithreading: Maximizing On-chip Parallelism, ISCA 1995.

Dr. Shadrokh Samavi

214

Multithreading We need to guarantee no dependencies between instructions in a pipeline. -- Hence, interleave execution of instructions from different program threads on same pipeline Interleave 4 threads, T1-T4, on non-bypassed 5-stage pipe t0 t1 t2 t3 t4 t5 t6 t7

T1: LW r1, 0(r2) T2: ADD r7, r1, r4 T3: XORI r5, r4, #12 T4: SW 0(r7), r5 T1: LW r5, 12(r1)

F D X M F D X F D F

Dr. Shadrokh Samavi

t8

W MW X MW D X MW F D X MW

t9

Prior instruction in a thread always completes write-back before next instruction in same thread reads register file

215

Superscalar Machine Efficiency Issue width Instruction issue Completely idle cycle (vertical waste) Time

Dr. Shadrokh Samavi

Partially filled cycle, i.e., IPC < 4 (horizontal waste)

216

Vertical Multithreading Issue width

Instructio n issue Second thread interleaved cycle-by-cycle Time

Partially filled cycle, i.e., IPC < 4 (horizontal waste)

• Cycle-by-cycle interleaving can remove vertical waste, but leaves some horizontal waste

Dr. Shadrokh Samavi

217

Chip Multiprocessing (CMP) Issue width

Time



split the pipeline into multiple processors: – reduces horizontal waste, – leaves some vertical waste, and – puts upper limit on peak throughput of each thread.

Dr. Shadrokh Samavi

218

Ideal Superscalar Multithreading [Tullsen, Eggers, Levy, UW, 1995]

Issue width

Time

• Interleave multiple threads to multiple issue slots with no restrictions

Dr. Shadrokh Samavi

219

Power 4 Single-threaded predecessor to Power 5. 8 execution units in out-of-order engine, each may issue an instruction each cycle.

Dr. Shadrokh Samavi

220

Power 4

Power 5

2 commits (architected register sets)

2 fetch (PC), 2 initial decodes

Dr. Shadrokh Samavi

221

Changes in Power 5 to support SMT 1. Increased associativity of L1 instruction cache and the instruction address translation buffers 2. Added per thread load and store queues 3. Increased size of the L2 (1.92 vs. 1.44 MB) and L3 caches 4. Added separate instruction prefetch and buffering per thread 5. Increased the number of virtual registers from 152 to 240 6. Increased the size of several issue queues 7. The Power5 core is about 24% larger than the Power4 core because of the addition of SMT support Dr. Shadrokh Samavi

222

Power 5 thread performance Relative priority of each thread controllable in hardware.

For balanced operation, both threads run slower than if they ―owned‖ the machine.

Dr. Shadrokh Samavi

223

Design Challenges in SMT ■ Dealing with a larger register file needed to hold multiple contexts ■ Not affecting the clock cycle, particularly in critical steps such as instruction issue, where more candidate instructions need to be considered, and in instruction completion, where choosing what instructions to commit may be challenging ■ Ensuring that the cache and TLB conflicts generated by the simultaneous execution of multiple threads do not cause significant performance degradation

Dr. Shadrokh Samavi

224

A comparison of SMT and single-thread (ST) performance on the 8-processor IBM eServer p5 575

speedup Dr. Shadrokh Samavi

225

Simultaneous Multithreading (SMT): Converting TLP into ILP The key motivation for SMT is that multiple-issue processors often have more functional unit parallelism available than a single thread can effectively use.

Dr. Shadrokh Samavi

226

Time (processor cycle)

Superscalar Fine-Grained Coarse-Grained Multiprocessing

Simultaneous Multithreading

Thread 1

Thread 3

Thread 5

Thread 2

Thread 4

Idle slot

Dr. Shadrokh Samavi

227

Dr. Shadrokh Samavi

228

SPECint2000 benchmarks

Comparison of Performance (Int)

SPECRatio

Dr. Shadrokh Samavi

229

SPECfp2000 benchmarks

Comparison of Performance (FLP)

SPECRatio

Dr. Shadrokh Samavi

230

Transistors

Area

Power Efficiency ratio

Dr. Shadrokh Samavi

231

Parallel Benchmarks Dr. Shadrokh Samavi

232

Dr. Shadrokh Samavi

233