CSE 533 – Advanced Computer Architectures

6 downloads 306 Views 16KB Size Report
CSE 533 – Advanced Computer Architectures. Homework Assignment #2. Due Date: 08/04/2008. 1) Consider the ... R10 ← 400 */ ... 3 iteration less */. Prelude:.
CSE 533 – Advanced Computer Architectures Homework Assignment #2 Due Date: 08/04/2008 1) Consider the following instruction sequence for the following problem: ADD R2, R0, R0 /* R2 Å 0 */ ADDL R8, R0, #600 /* R8 Å 600 */ Loop: LOAD R1, R2, #200 LOAD R3, R2, #1000 MUL R4, R1, R3 ADD R4, R4, R2 STORE R4, R2, #2000 ADD R2, R2, #2 BNE R2, R8, Loop Assume the following latencies: Instruction Latency ADD 1 cycle (F, D, E, M, W) ADDL 1 cycle (F, D, E, M, W) LOAD 2 cycles (F, D, E1, E2, M, W) STORE 2 cycles (F, D, E1, E2, M, W) MUL 3 cycles (F, D, E1, E2, E3, M, W) BNE 1 cycle (F, D, E, M, W) Assuming the static scheduling introduces NOP instructions until the branch instruction is resolved and bypass circuitry is present: a) Show the pipeline grid (# of cycles vs. # of instructions) for the original schedule. b) Apply one of the popular static scheduling algorithms: Loop Unrolling. Unroll the loop once, and, again, show the pipeline grid for this schedule. c) Do we save any cycles? How many cycles can we save if we completely get rid of the loop structure by unrolling it 300 times?

2) Consider the following instruction sequence for the following problem: ADD R2, R0, R0 /* R2 Å 0 */ ADDL R8, R0, #600 /* R8 Å 6 */ ADDL R10, R0, # 400 /* R10 Å 400 */ Loop: LOAD R1, R2, #200 MUL R1, R1, R8 STORE R1, R2, #4000 ADD R2, R2, #4 BNE R2, R10, Loop a) Fill in the blanks: ADD ADDL ADDL Prelude: LOAD LOAD LOAD MUL MUL STORE Loop: STORE MUL LOAD ADD BNE Postlude: STORE MUL STORE

R2, R0, R0 R8, R0, #600 R10, R0, ___

/* 3 iteration less */

R20, R2, #200 R21, R2, #204 R1, R2, #208 R20, R20, R8 R21, R21, R8 R20, R2, #4000

/* iteration #1 */ /* iteration #2 */ /* iteration #3 */ /* iteration #1 */ /* iteration #2 */ /* iteration #1 */

___, ___, ____ ___, ___, ___ ___, ___, ____ R2, R2, #4 R2, R10, Loop

/* iteration #2 or i+1 */ /* iteration #3 or i+2 */ /* iteration #4 or i+3 */

R21, R2, #4004 R21, R1, R8 R21, R2, #4008

/* iteration #99 */ /* iteration #100 */ /* iteration #100 */

b) Assuming the same latency values and bypass characteristics of the first problem, how many cycles can we save after software pipelining?

Suggest Documents