16-Bit vs. 32-Bit Instructions for Pipelined Microprocessors - CiteSeerX

38 downloads 1168 Views 270KB Size Report
Roy Jenevein. [email protected] { 512-471-9722. Department of Computer Sciences. The University of Texas at Austin. Austin, Texas 78712. W. C. Athas.
16-Bit vs. 32-Bit Instructions for Pipelined Microprocessors

John Bunda  [email protected] { 512-471-9715 Donald Fussell [email protected] { 512-471-9719 Roy Jenevein [email protected] { 512-471-9722 Department of Computer Sciences The University of Texas at Austin Austin, Texas 78712 W. C. Athas 310-822-1511 ext. 644 Information Sciences Institute, USC Marina del Rey, California

[email protected] {

Keywords: RISC, instruction sets, code density, performance October 23, 1992 Abstract

In any stored-program computer system, information is constantly transferred between the memory and the instruction processor. Machine instructions are a major portion of this trac. Since transfer bandwidth is a limited resource, ineciency in the encoding of instruction information (low code density) can have de nite hardware and performance costs. Starting with a parameterized baseline RISC design, we compare performance for two instruction encodings for the architecture. One is a variant of DLX, the other is a 16-bit format which sacri ces some expressive power while retaining essential RISC features. Using optimizing compilers and software simulation, we measure code density and path length for a suite of benchmark programs, relating performance di erences to speci c instruction set features. We measure time to completion performance while varying memory latency and instruction cache size parameters. The 16-bit format is shown to have signi cant cost-performance advantages over the 32-bit format under typical memory system performance constraints.

 This

work was supported in part by the IBM Corporation's Graduate Resident Study Program.

1

1 Introduction

Ecient transfer of instructions between the memory and instruction set processor is a signi cant issue in any Von Neumann-style computer system. Since the capacity of processors to execute instructions typically exceeds the capacity of a memory to provide instructions, eciency in the encoding of instruction information can be expected to have de nite hardware and/or performance costs. Such considerations for many years supported the development of CISC processors. CISC instructions provide relatively compact encodings of computations, but this comes at the cost of complex decoding and execution, often requiring multiple processor cycles per instruction. These drawbacks have motivated widespread adoption of the RISC paradigm, which in pure form employs only simple instructions which can be decoded easily, execute in a single machine cycle, and facilitate pipelining of the processor. With the use of instruction caches and advanced compiler technology, RISC machines can provide signi cant performance advantages over CISC machines. Most RISC instruction sets contain only 32-bit instructions, allowing simple instruction fetch and decode stages, three operand instructions for exibility in compiler optimization, and sucient addressing capability for modern machines. However, such instruction sets require signi cantly larger numbers of bits to represent object programs than CISC machines, and in that sense provide a less ecient encoding of instructions. This means that RISC programs require more main memory for storage and more instruction fetch bandwidth for execution than CISC machines. These considerations are somewhat mitigated by the fact that memory is a relatively inexpensive resource in current technology and through the use of instruction cache to reduce fetch bandwidth requirements. Thus, the RISC paradigm seems better suited to today's technology. In spite of the success of RISC, low code density is still a drawback, and still exacts a penalty in terms of cost and/or performance, even though these disadvantages are more than o set by RISC advantages vis a vis CISC. It is natural to ask whether low code density is really an inherent penalty for RISC performance. In this paper, we claim that the answer to this question is no. One way to improve code density is to use a 16-bit rather than a 32-bit instruction set, while otherwise retaining the design features of the typical RISC processor. By keeping all instructions the same length, fetch and decode simplicity can be maintained. A load/store general register architecture can still be used, but the short instructions will limit the number of registers that can be referenced and the number of operands per instruction. Fewer bits are available for address o sets. These limitations can be expected to limit the ability of compilers to optimize code for these machines, lengthen instruction sequences required for given computations, and thus decrease performance. On the other hand, short instructions allow more fetches for a given memory bandwidth and require smaller instruction caches for a given miss ratio. Thus, it is unclear a priori whether the use of 32-bit instructions is the best way to exploit the advantages of the RISC paradigm. In this paper, we report the results of a set of experiments designed to provide speci c, quantitative evaluations of these tradeo s. We start with a baseline instruction processor with a xed pipeline architecture and set of operations, and compare performance of two instruction encodings for the machine. Execution of a suite of benchmark programs is simulated for a processor with a multi-stage pipeline, executing single instructions at a peak rate of one per clock cycle. DLXe is a variant of DLX, Hennessy and Patterson's composite RISC instruction set with a conventional xed 32-bit format [HP90]. We compare this to a 16-bit format called D16. An overview of the D16 and DLXe instruction sets is given in Section 2. Our measurements are based on code compiled by GCC 2.1, a portable optimizing C compiler [Sta92]. Optimized code produced by GCC is competitive with the native compilers for commercial workstations. While there are potential advantages to speci cally targeted compilers, basing both on the same technology helps ensure a level playing eld, where compilation, optimization, and code generation capabilities are as similar as possible. The minor di erences between the instruction sets are, for the most part, handled by code generation parameters of the portable compiler. Using an architecture simulator that executes programs of either instruction encoding, we measure relative code density (size of compiled programs) and path length (total count of executed instructions) for a suite of benchmark programs. We examine in detail the particular restrictions of the 16-bit encoding to determine precisely e ects of these restrictions on density and path length. Results show that the compiler is able to exploit the expressive power of 32-bit instructions to measurably reduce the number of instructions in the static representation. However, we nd that reduction in path length is considerably less than what might be expected based on static measures. These results are presented in Section 3. 2

In Section 4, performance of each instruction set is measured with respect to memory interface parameters including latency and instruction cache size. We examine time-to-completion performance for programs of the benchmark suite, and nd under constraints of typical memory system performance, there are suf cient advantages in reduced instruction trac of the 16-bit encoding to provide comparable or improved performance with respect to the more conventional 32-bit format.

2 D16 and DLXe Instruction Sets

D16 and DLXe are both RISC-inspired load-store instruction sets. They are nearly identical in function, and supported on the same pipeline with identical execution resources. Both instruction sets have the normal complement of ALU, shift, memory, and oating-point operations. The principal di erences lie in the size and format of instruction encodings and the size of the register les. D16 instructions are sixteen bits, DLXe instructions thirty-two. D16 instructions can address sixteen general and sixteen oating-point registers, while DLXe instructions can address thirty-two of each. DLXe is a variant of the 32-bit DLX RISC instruction set [HP90], which is a simple RISC design with a strong resemblance to the MIPS R2000 [Kan88]. DLXe di ers from DLX only in its oating-point comparison instructions and the lack of direct loads and stores of oating-point registers. These restrictions were incorporated to simplify the FPU interface for a prototype implementation. D16 and DLXe instruction formats are shown in Figures 1 and 2, respectively. Because the information density of 16-bit instructions is higher, a more elaborate encoding scheme is necessary; D16 has ve instruction types to DLXe's three. Both instruction sets de ne general-register machines. Some D16 instructions, including compares and branches, have xed, implicit operand and destination registers. D16 has four bits for each register address, designating one of sixteen general or sixteen oating-point registers. The DLXe instruction format has ve bits per register address, addressing thirty-two of each register class. The set of opcodes for each instruction set is approximately the same as shown in Table 1. D16 and DLXe instructions are executed on the same ve-stage execution pipeline, shown in Figure 3. 1

op

offset

ry

rx

MEM 15

14

0

1

13

8

opcode

7

4

3

4

3

4

3

imm

0

rx

REG 15

0

14

0

13

8

1

const

0

rx

MVI 15

0

13

0

12

0

0

offset

op

BR 15

0

0

13

12

11

10

0

0

0

1

11

10

0

offset

LDC 15

9

0

Figure 1: D16 16-bit instructions.

3 Instruction Set Performance

One measure of the expressive power of an instruction set is the physical size of compiled programs. The

relative density of two programs that perform the same computation is the ratio of their sizes. A smaller program has higher density than a larger one. Another measure is path length, the total number of instructions

in an execution of a given program. In general, a shorter path length is preferred. However, other performance 3

considerations can o set small path length di erences: A processor achieving higher throughput can execute more instructions in the same number of cycles. In this section, we compare these measures for a suite of programs chosen from commonly quoted synthetic benchmarks and real programming tasks. The benchmark suite is summarized in Table 2.

3.1 Code Density

The relative density of programs of the benchmark suite compiled and linked for DLXe and D16 are shown in Figure 4. The gure shows the relative density of D16 programs with respect to the DLXe equivalent. The programs are compiled with all optimizations enabled, including instruction scheduling. The size measure is the number of bytes in the stripped [Lab83] binary executable le, including both text and data segments1. DLXe programs average approximately 1.5 times the size in bytes of D16 binaries. While D16 instructions are half the size of DLXe, the reduced expressive power of the smaller format results in an increase in the total number of instructions, so relative density is less than two.

3.2 Path Length

Path length is the total number of instructions in an execution trace of a given program. Relative density is useful in assessing memory requirements, but static instruction counts are only weakly correlated to path length. This is partly because program execution time is dominated by inner loops. Also, some optimization strategies reduce path length at a cost of increased static size [AC71]. Consequently, few conclusions about performance can be drawn from density alone. Direct comparison of D16 and DLXe program path lengths is more meaningful than comparisons between arbitrary architectures, because both instruction sets are executed on the same pipeline. If execution resources or computations per instruction for both machines were di erent, it would be more dicult to make direct cost-performance comparisons. Path lengths of each DLXe program relative to D16 are shown in Figure 5. The DLXe speedup over D16 is not as dramatic as relative density measures might predict, averaging 15 percent over all test programs. In the next section, we examine individual distinguishing features of DLXe instructions to assess the contribution of each to density and path length di erences.

3.3 Instruction Set Features

DLXe instructions di er from D16 instructions in several important ways: 1. 32 vs. 16 general and oating-point registers. 2. Three-address vs. two-address instructions. 3. Larger immediate elds, more immediate operations.

1 Since library source is identical, this comparison is fair. Comparison of executable sizes among architectures with di erent libraries could be misleading.

op

rs1

I−type 31

26

25

26

25

op

J−type

20

21

20

rs1

R−type 31

rd 21

immediate 15

16

15

rs2

op 31

16

0

rd

func 11

10

0

offset 26

25

0

Figure 2: DLX 32-bit instructions. 4

Instruction

ld, st, ldh, ldhu, sth, ldb, ldbu, stb br, bz, bnz j, jz, jnz, jl

cond

s

add, addi, sub, subi and, or, xor, neg, inv shra, shrai, shr, shri, shl, shli mv, mvi mvhi add.sf, sub.sf, mul.sf, div.sf, neg.sf, cmp.sf, add.df, sub.df, mul.df, div.df, neg.df, cmp.df si2sf, sf2df, di2df, df2di, df2sf trap, rdsr

Description For D16, address for subword modes is not o settable. D16 o sets for word modes limited to 128, or negative to -4096 for LDC format. Loads have one delay slot. Branches are to PC-relative o sets, D16 limit is 1024. Jumps are to absolute address in speci ed register, linkage register is r1. One delay slot. Integer compare. For D16, both operands must be registers, sets r0 to zeros or ones. Cond may be lt, ltu, le, leu, eq, neq. DLXe allows immediate operands, also gt, gtu, ge, geu, and any GPR destination. D16 immediate range on addi/subi, unsigned 5 bits. DLXe has andi, ori, xori, and all immediates are 16 bits, and neg, inv are unneeded because r0 is always zero. D16 immediates 5 bits unsigned, 0  i < 32. D16 only, immediate signed 9 bits. DLXe only, set upper 16 bits. Floating-point operations.

Mode conversions. Special instructions, rdsr reads status register containing the result of oating-point compares.

Table 1: D16 and DLXe opcodes. 4. Larger address o sets, and for all addressing modes. To determine which instruction set features provide the most return in the code density tradeo , the code generator of the DLXe compiler is selectively restricted to prevent exploitation of a particular feature, and resulting performance is compared.

3.3.1 Register File Size

One of the rst dicult compromises in the D16 instruction set is the number of addressable registers. The number of registers required depends on both the compiler allocation scheme and the target application. It has been claimed that a general register le larger than sixteen should not signi cantly increase performance [Rad83], [HF89]. The problem of optimally allocating registers by a compiler is NP-complete, but heuristic solutions with very good behavior exist [CAC+ 81]. With procedure-level register allocation, sixteen registers has been claimed to be sucient; adding more registers does not signi cantly increase performance [Rad83], [HF89]. However, with the GCC compiler, we observe measurable di erences in both static and dynamic performance between sixteen and thirty-two registers. For both instruction sets, the register le is xed and allocated at compile time, as opposed to using a register-windowing scheme. Register windows provide a sliding-window type access to a large register le, which can be used by procedure call, for example, to reduce memory trac in saving and restoring 5

i0

Instruction Fetch

Decode

i1

Execute

Instruction Fetch

Decode

Memory

Writeback

Execute

Memory

Writeback

Clock

Figure 3: D16-DLXe execution pipeline. Program ackermann assem bubblesort queens quicksort towers grep linpack matrix dhrystone pi solver latex ipl whetstone

Description Computes the Ackermann function. The D16 assembler. Sorting program from the Stanford suite. The Stanford eight-queens program. The Stanford quicksort program. The Stanford towers of Hanoi program. The Unix utility from the BSD sources. The linear programming benchmark. Gaussian elimination. The synthetic benchmark. Computes digits of pi. Newton-Raphson iterative solver. The typesetter. PostScript plotting package. The synthetic oating point benchmark. Table 2: The benchmark suite.

non-volatile scratch registers. This scheme originally appeared in the Berkeley RISC machine [Pat85], and has been employed in other machines, including the Am29000 [Joh87] and SPARC [GAB+ 88]. Link time register allocation, which allows static allocation of registers across procedure calls [Wal86] has been shown to perform as well or better than register windows, and partitioning the register le and dedicating several registers to globals is just as e ective [Wal87]. This approach allocates registers across procedure and module boundaries, and can make use of more registers. However, the small D16 register le, and the increase in GCC performance with more registers suggests that D16 is at least slightly registerlimited, perhaps to the point of severely limiting the bene ts of a link-time allocation scheme. With a small register le, register windows is a more promising option. All else being equal, reducing the size of the visible register le only degrades a program's performance if, at any point in its execution, there are more live values than available registers. When this happens, values needed in registers must be spilled to temporary variables in memory. An increase in spills increases the total number of instructions, as well as data trac to memory. This is the bad news; the good news is that in many programming environments (our compiler for one), spills are to stack frame variables, which are extremely likely to hit in a data cache. Spills are therefore generally less expensive than other memory references. While the register allocation and organization issue is not merely peripheral, we restrict our scope to the least common denominator, a xed, at register le with procedure-based allocation, and leave evaluation of more sophisticated allocation or windowing schemes as possible future work. To test the e ects of the smaller D16 register le in isolation, the DLXe compiler is restricted to generate code using only 16 general and 16 oating-point registers. Again, the code is measured for density, dynamic 6

2.00

D16 relative density (static code size DLXe/D16)

1.50

1.00

0.50

0.00 ackermann assem bubblesort queens quicksort towers

grep

linpack

matrix dhrystone

pi

solver

latex whetstone

ipl

Figure 4: D16 relative density. instruction counts. Additionally, we measure data trac to determine the extent to which it is increased with the smaller register le. Figure 6 shows the increase in static code size for the suite of benchmarks. The increase in total instructions to run each program with the smaller register le is shown in Figure 7. Table 3 gives relative data trac increase over DLXe for both D16 and DLXe restricted to a D16-sized register le. D16 does better than the restricted DLXe in some cases, but this is may be due to the fact that DLXe has one fewer register available, since r0 is always zero. However, in general, for the allocation scheme used by the GCC compiler, the data trac penalty for the smaller register le is about 10 percent.

7

1.20

DLXe path lengths (D16 path lengths = 1.0)

1.00

0.80

0.60

0.40

0.20

0.00 ackerman assem bubblesort queens quicksort towers

grep

linpack

matrix dhrystone

pi

solver

latex

ipl

whetstone

ipl

whetstone

Figure 5: DLXe path length reduction.

3.00

DLXe with Small Register File (D16 = 1.00) DLXe 16 registers DLXe 32 registers

2.50

2.00

1.50

1.00

0.50

0.00 ackermann assem bubblesort queens quicksort towers

grep

linpack

matrix dhrystone

pi

solver

Figure 6: Density e ects of 16 vs. 32 registers.

8

latex

1.20

DLXe with Small Register File (D16 = 1.0) DLXe 16 registers. DLXe 32 registers.

1.00

0.80

0.60

0.40

0.20

0.00 ackerman assem bubblesort queens quicksort towers

grep

linpack

matrix dhrystone

pi

solver

Figure 7: Path length e ects, 16 vs. 32 registers.

Benchmark Program ackermann assem bubblesort queens quicksort towers grep linpack matrix dhrystone pi solver latex ipl whetstone Average

Increase% D16 DLXe-16 0.2 1.2 0.0 3.1 0.7 1.0 8.8 2.7 15.4 14.7 -2.0 0.0 54.5 36.5 10.8 10.6 26.3 26.9 10.9 2.7 1.3 3.2 13.8 10.5 6.2 4.7 1.6 -2.8 3.0 19.5 10.1 9.0

Table 3: Data trac increase for smaller register le.

9

latex

ipl

whetstone

3.3.2 Three-Address vs. Two-Address Instructions

The ability to specify a destination register distinct from the operand registers appears in many 32-bit RISC instruction sets, despite the fact that it is widely believed to have little tangible bene t. However, the availability of bits when encoding most operations in a xed 32-bit instruction makes this feature virtually free in most RISC instruction sets. 3.00

DLXe with Small Register File (D16 = 1.00) DLXe 16 registers DLXe 32 registers

2.50

2.00

1.50

1.00

0.50

0.00 ackermann assem bubblesort queens quicksort towers

grep

linpack

matrix dhrystone

pi

solver

latex

ipl

whetstone

Figure 8: Code density e ects, two-address instructions. Figure 8 shows how density decreases when DLXe instructions are restricted to two operands (the destination register is always required to be the same as the left source operand register). The increase in path length is shown in Figure 9. Both measures show that three-address instructions have a small but measurable advantage for most of the benchmark programs.

10

1.20

DLXe with Small Register File (D16 = 1.0) DLXe 16 registers. DLXe 32 registers.

1.00

0.80

0.60

0.40

0.20

0.00 ackerman assem bubblesort queens quicksort towers

grep

linpack

matrix dhrystone

pi

solver

latex

Figure 9: Path length e ects, two-address instructions.

11

ipl

whetstone

3.3.3 Immediate Fields

The limits of 16-bit instructions appear especially acute when confronting the issue of encoding constant operands in an instruction eld. Where DLXe allows 16-bit immediate operands and address displacements, the bits available for such operands are very scarce in D16 instructions. As described in Section 2, the only D16 integer instructions supporting immediate operands are add, subtract, and shift instructions, which are limited to (unsigned) 5 bits. The move-immediate instruction has a sign-extended 9-bit immediate operand. Address displacements are limited to word-aligned displacements of 128 bytes. 1.20

Speedup for DLX immediates and offsets (D16 = 1.00)

1.00

0.80

0.60

0.40

0.20

0.00 ackerman assem bubblesort queens quicksort towers

grep

linpack

matrix dhrystone

pi

solver

latex

ipl

whetstone

Figure 10: E ect of large immediates on path lengths. Restricting the DLXe compiler to both a small register le and two-address instructions, we can measure a machine that approximates D16 performance with the immediate-operand instructions, 16-bit immediate elds, and large address displacements of DLXe. The reduction in total instructions this provides is shown in gure 10. The gure gives speedup of each program relative to D16= 1:0. Two of the programs are actually measurably slower; this is perhaps due to the sacri ce of a register (r0 permanently zero). The average speedup provided by immediate elds is about 10 percent, and this is con rmed by the breakdown of the instructions in program traces is given in Table 4, which gives the number of instructions in the restricted DLXe program with immediate operands that exceed the limits of the D16 instruction set. Assuming each of these has a penalty of one additional D16 instruction, the path length di erence is accounted for. Compare immediate 2.1% ALU immediate, > 5 bits 2.8% Memory displacements > 8 bits 4.6% Total 9.5% Table 4: Average immediate- eld instruction frequencies. Can anything be done to improve D16 performance? The D16 MVI instruction moves 9-bit immediates to a register. Giving up one bit in the D16 MVI immediate eld, one could implement an 8-bit move immediate and an 8-bit compare-equal immediate instruction, which could improve D16 performance by up to 2 percent (if the comparands are 8 bits or less). 12

Half of the total DLXe speedup involves address displacements that exceed 8 bits, and most of these are stack frame references. The GCC compiler assumes that stack frame slots can be addressed cheaply, so this situation could be improved with more sophisticated compiler stack management. For example, common subexpression elimination on stack addresses, or biasing the stack and frame pointers to increase addressability could be used by a compiler to reduce the D16 addressability problem for large stack frames.

3.4 Combined E ects

In doubling the instruction size from sixteen to thirty-two bits, static code size does not double. The compiler is able to exploit larger immediate elds and o sets for memory instructions, yielding a signi cant reduction in the static measure of the number of instructions. However, the reduction in path length is not nearly as dramatic as density measures might predict. One reason for this is that the execution time of any program is dominated by inner loops. The better a compiler is able to move expensive operations out of inner loops, the less e ect these instructions have on ultimate performance. It appears that D16 programs have considerable overhead in manipulating constant operands, but the overhead instructions comprise a proportionally small part of dynamic instruction traces. 3.00

Code size ratios DLXe/D16 16 registers, 2-address 16 registers, 3-address

2.50 32 registers, 2-address DLXe 2.00

1.50

1.00

0.50

0.00 ackermann assem bubblesort queens quicksort towers

grep

linpack

matrix dhrystone

pi

solver

latex

ipl

whetstone

Figure 11: Code density summary. Figures 11 and 12 summarize the density and path-length di erences of all the measured instruction set features and their interactions. Each bar group shows how DLXe density and path length are a ected by the corresponding instruction set feature: Immediate elds, three-address instructions, a larger register le, and all three combined. Density and path length ratios averaged over all programs in the test suite are shown in Table 5. Code Size (D16 = 1.00) Path Length (D16 = 1.00) Registers Two-Address Three-Address Registers Two-Address Three-Address 16 1.62 1.61 16 .95 .94 32 1.57 1.52 32 .90 .87 Table 5: Summary of density and path length e ects. 13

2.00

Path length ratios DLXe/D16 16 registers, 2-address 16 registers, 3-address 32 registers, 2-address

1.50 DLXe

1.00

0.50

0.00 ackerman assem bubblesort queens quicksort towers

grep

linpack

matrix dhrystone

pi

solver

latex

ipl

whetstone

Figure 12: Path length summary. D16 instructions produce denser code, reducing static size by about 35 percent, with an approximately equal reduction in instruction trac. DLXe and D16 instructions are executed by an identical pipeline. Assuming decoding either format does not a ect the critical path, the only factor that could close the gap between D16 and DLXe instructions is the penalty for high instruction trac, the total bits of instructions fetched during program execution (path length times bits-per-instruction). Steenekiste's uniformity assumption is that the e ects of density are uniformly distributed, a ecting all program parts equally. This allows predicting trac based on density [Ste89]. Instruction trac is given with static size references in gure 13, and shows the uniformity assumption to be approximately true for the benchmark suite. To determine whether the cost of instruction trac can o set the additional path length, we need to consider performance of a real instruction fetch mechanism. In the remainder of this paper, we explore performance of D16 and DLXe instruction sets with respect to memory and cache performance parameters.

4 E ects of Memory Performance

For a scalar processor without a cache, it is fairly easy to estimate performance based on the instruction count and total memory trac. If all memory references have ` wait state cycles, and each instruction fetch request returns a block of k instructions, the total number of cycles is at least: Cycles = IC + `  ( IC k + MemOps) Where IC is the instruction count (total path length), and MemOps is the total number of memory operations (loads and stores). This estimate is optimistic, because it assumes that all instructions fetched are actually issued (none are fetched but discarded due to branches). This estimate also ignores interlocks due to delayed load and FPU operations. Simulation to measure actual memory trac and pipeline behavior can be used to more accurately predict performance: Cycles = IC + `  (Ifetches + Dfetches) + Interlocks 14

2.0

Instruction Traffic

Instruction Traffic and Code Size (D16 = 1.00)

Static Size 1.8

1.6

1.4

1.2

1.0

0.8

0.6

0.4

0.2

0.0 ackermann assem bubblesort queens quicksort towers

grep

linpack

matrix dhrystone

pi

solver

latex

ipl

whetstone

Figure 13: Instruction trac vs. density, DLXe/D16. For a given program and instruction set, the number of Ifetches is constant in k. The performance impact of varying ` can be accurately estimated using this formula.2 Dividing the total number of cycles by the instruction count gives a performance measure independent of path length, the average number of cycles per instruction CPI: CPI = Cycles IC By dividing cycle counts for a particular program on two di erent machines by the path length of one of them normalizes the performance measure. This allows direct comparison of program performance that factors out path length di erences but still re ects overall performance. Figure 14 compares normalized CPI for cacheless D16 and DLXe processors as the number of memory wait states ` varies from 0 to 3 cycles. CPI is normalized (to correct for di erences in instruction count), and the two graphs show performance, averaged over all programs in the test suite, for 32-bit and 64-bit memory ports, respectively. These graphs illustrate two interesting points; rst, the instruction prefetching of D16 with a 32-bit fetch bus (k = 2) results in higher performance (lower CPI) with any nonzero `. The 64-bit fetch bus gives k = 2 for DLXe, and k = 4 for D16, and almost identical average performance (especially with ` > 1). There are two reasons the path length disadvantage of the D16 machine does not always result in poorer performance. First, the D16 machine prefetches more instructions with each fetch (k is increased), and second, the total number of instruction fetches (trac) is lower. In other words, D16 instructions result in fewer total cycles of instruction fetch latency, and each cycle of latency is amortized over more instructions. Relative instruction trac is illustrated in Figure 15. The graph shows saturation of the instruction fetch bus in requests per cycle as ` varies. As these gures show, instruction fetch performance falls o sharply as memory latency increases. Because large, fast memories can be prohibitively expensive (if not impossible) to build, some means of reducing fetch latency is necessary to avoid severely degrading the performance of today's microprocessors. 2 This formula is actually slightly pessimistic because it assumes memory and FPU latencies do not overlap. Actual measured performance is presented, but di ers by less than 1% from that predicted by the formula.

15

32-bit fetch, no cache

6

5

Cycles Per Instruction

5

Cycles Per Instruction

64-bit fetch, no cache

6

4

3

2

4

3

2

DLXe k=1

DLXe k=2

D16 k=2

1

D16 k=2 (normalized)

0

D16 k=4

1

D16 k=4 (normalized)

0 0

1

2

3

0

Memory Wait States

1

2

3

Memory Wait States

Figure 14: Normalized CPI for 32-bit and 64-bit fetch, no cache.

4.1 Performance with Instruction Cache

One technique for reducing the impact of long main memory latencies is to add an instruction cache, a smaller and faster bu er memory between the instruction fetch hardware and main memory. The locality of instruction fetch references can make caches very e ective in reducing the average latency of an instruction fetch request. Even a very small cache can reduce main memory trac instruction fetch trac and the associated latency by more than 80 percent [HS84]. The actual reduction in trac and latency provided by a cache depends not only on its size and organization, but also on dynamic memory access patterns of the program. To evaluate cache performance, the dinero cache simulator [Hil92] was applied to measure miss rates and trac for the programs of the benchmark suite large enough to have interesting cache behavior: assem, latex, ipl. The D16/DLXe path length ratios for these programs are 1.24, 1.15, and 1.26 respectively. Predicting system performance with cache is dicult because key parameters are target-program dependent and vary widely. Byte for byte, D16 instructions yield better instruction cache performance than DLXe isntructions, because twice as many instructions t in the same size cache, and each bus transaction transfers twice as many instructions. All else being equal, D16 hit rates are higher, and total trac lower. The question is, given equal resources, does the improvement in miss rate, and therefore miss penalty cycles, o set the increase in cycles required to execute the extra D16 instructions?

4.1.1 Cache Miss Rates

D16 and DLXe processors with separate on-chip direct-mapped instruction and data caches are considered. Block (line) size is 32 bytes with 4-byte sub-blocks; hit latency is zero processor cycles. The memory bus is 32 bits, and on a demand miss, the word following the missed word is always prefetched (ignoring bus cycles, this is similar to the 64-bit fetch described for the cacheless machines above). Figure 16 shows miss rates for the three benchmark programs with instruction caches of 1K to 16K. D16 and DLXe miss rate di erences are considerable, but in both cases, trac is substantially reduced. The performance di erence for a given program depends on the actual ratio of instructions executed, and total miss penalty cycles. Figure 17 shows how CPI varies with miss penalty in processor cycles for the three programs, assuming each processor has 4K direct-mapped instruction and data caches and a miss penalty of four cycles. 16

32-bit fetch, no cache

1.0

0.8

Fetches per Cycle

0.8

Fetches per Cycle

64-bit fetch, no cache

1.0

0.6

0.4

0.2

0.6

0.4

0.2

DLXe D16

DLXe D16

0.0

0.0 0

1

2

3

0

Memory Wait States

1

2

3

Memory Wait States

Figure 15: Instruction fetch saturation, no instruction cache. With the exception of assem, normalized performance for D16 and DLXe is comparable, despite the D16 path length increase. For assem, D16 performance is better despite a 24 percent increase in path length. This is because for this program, 4K is sucient to capture the D16 working set, but 8K is required for DLXe to achieve a similar hit rate. Figure 18 gives CPI with instruction and data caches expanded to 16K bytes. The performance curves are very close for all three programs; even with a large cache, the hit rate increase and trac reduction for D16 is sucient to e ectively cancel the impact of longer path length on net performance.

4.1.2 Instruction Trac

An instruction cache of any size substantially reduces instruction trac for either instruction set. Figure 19 compares D16 and DLXe instruction trac for each cache benchmark, with a 4K instruction cache. The gure shows that regardless of program or cache hit rate, D16 instruction trac is signi cantly lower than that of DLXe.

5 Conclusions.

Results presented in this paper show that reducing instruction size for a RISC processor to 16 bits increases code density with respect to processor with a xed 32-bit format. This approach has measurable costs in terms of expressive power of the instructions, but measurement with optimizing compilers reveals that this sacri ce does not impact dynamic performance as much as static measurements or conventional wisdom might predict. The increased density measured for D16 program compares favorably with CISC encodings, yet the instruction set design does not sacri ce essential advantages of the RISC paradigm. Compared to the 32-bit DLXe format, D16 16-bit instructions directly increase the capacity of instruction fetch resources, including memory bandwidth, cache size, and cache block size. The impact on performance depends on factors including memory latency, cache size, and availability of fetch bandwidth. Performance measurements for benchmark programs show that in almost all interesting cases, the increase in throughput achieved by greater instruction fetch eciency is at least sucient to o set the path length penalty. 17

assem

0.50

ipl

0.50

0.40

latex

0.50

0.40

DLXe

0.40

Miss Rate

Miss Rate

Miss Rate

D16 0.30

0.30

0.20

0.30

0.20

0.10

0.20

0.10

0.00

0.10

0.00 1K

2K

4K

8K

Cache Size

16K

0.00 1K

2K

4K

8K

16K

1K

Cache Size

2K

4K

8K

16K

Cache Size

Figure 16: Instruction cache miss rates. For parallel architectures, multiprocessors, and processors with high clock rates compared to memory access times, performance is often signi cantly impacted by the limits of instruction fetch resources. These trends in technology increase, rather than decrease, concern over instruction trac as a performance bottleneck. The improved density provided by reducing instruction size directly reduces trac, improving instruction fetch performance. This paper demonstrates that this greater eciency can be achieved without compromising the inherent advantages of the RISC approach.

Acknowledgments.

Justine Blackmore prototyped the architecture simulator. The assembler is a modi ed version of one originally written by Rick Simpson. The linker and C compiler are ports of GNU software from the Free Software Foundation; special thanks to Richard Stallman and Richard Kenner for their invaluable advice in porting GCC. Some of the library code, including the oating point math routines, came from public BSD sources.

18

DLXe D16 D16 normalized

assem

ipl

5

4

3

3

1

CPI

4

3 2

2 1

0 8

12

16

2 1

0 4

latex

5

4

CPI

CPI

5

0 4

8

Miss Penalty

12

16

4

8

Miss Penalty

12

16

Miss Penalty

Figure 17: Performance with 4K instruction and data caches.

DLXe D16 D16 normalized

assem

ipl

5

4

3

3

3

2

1

1

0

0 4

8

12

Miss Penalty

16

CPI

4

2

latex

5

4

CPI

CPI

5

2 1 0

4

8

12

Miss Penalty

16

4

8

12

Miss Penalty

Figure 18: Performance with 16K instruction and data caches.

19

16

assem

0.25

0.10 0.05

0.15 0.10 0.05

0.00 2K

4K

Cache Size

8K

16K

D16 0.15 0.10 0.05

0.00 1K

DLXe

0.20

Words/Cycle

0.15

latex

0.25

0.20

Words/Cycle

0.20

Words/Cycle

ipl

0.25

0.00 1K

2K

4K

8K

Cache Size

16K

1K

2K

4K

Cache Size

Figure 19: Instruction trac with a 4K cache.

20

8K

16K

References [AC71]

Frances E. Allen and John Cocke. A catalogue of optimizing transformations. In Randall Rustin, editor, Proceedings Courant Computer Science Symposium 5. Prentice-Hall, March 1971. [CAC+ 81] G. J. Chaitin, M. A. Auslander, A. K. Chandra, J. Cocke, M. E. Hopkins, and P. W. Markstein. Register allocating via coloring. Computer Languages, 6(1):231{240, 1981. [GAB+ 88] Robert Garner, Anant Agrawal, Will Brown, David Hough, Bill Joy, Steve Kleiman, Stephen Muchnick, Dave Patterson, Joan Pendleton, and Richard Tuck. The scalable processor architecture (SPARC). In Proc. 1988 COMPCON, March 1988. [HF89] Jerome C. Huck and Michael J. Flynn. Analyzing Computer Architectures. IEEE Computer Society Press, 1989. [Hil92] Mark D. Hill. Dinero cache simulator, 1992. Available at several internet sites including max.stanford.edu. [HP90] John L. Hennessy and David A. Patterson. Computer Architecture: A Quantitative Approach. Morgan Kaufmann Publishers, Inc., Palo Alto, CA, 1990. [HS84] Mark D. Hill and Alan Jay Smith. Experimental evaluation of on-chip microprocessor cache memories. In Conference Proceedings of the 11th Annual International Symposium on Computer Architecture, pages 158{166, Ann Arbor, MI, 1984. [Joh87] M. Johnson. System considerations in the design of the am29000. IEEE Micro, 7(4):28{41, August 1987. [Kan88] Gerry Kane. Mips RISC Architecture. Prentice-Hall, 1988. [Lab83] Bell Telephone Laboratories. UNIX Programmer's Manual, 1983. [Pat85] David A. Patterson. Reduced instruction set computers. CACM, 28(1), January 1985. [Rad83] George Radin. The 801 minicomputer. IBM Journal of Research and Development, 27(3):237{ 246, May 1983. [Sta92] Richard M. Stallman. Using and Porting GCC Version 2. Free Software Foundation, 1992. [Ste89] Peter Steenkiste. The impact of code density on instruction cache performance. In Conference Proceedings of the 16th Annual International Symposium on Computer Architecture, pages 252{ 259, 1989. [Wal86] David W. Wall. Global register allocation at link time. SIGPLAN Notices, 21(7):264{275, July 1986. Proceedings of the SIGPLAN 1986 Symposium on Compiler Construction. [Wal87] David W. Wall. Register windows vs. register allocation. Technical Report 87/5, Digital Western Research Laboratory, Palo Alto, CA, 1987.

A D16 and DLXe Performance Data

Performance comparisons all assume equal hardware. The important distinction from an instruction fetch point of view is that for any given set of instruction fetch resources, D16 has approximately double the capacity in instructions. A 32-bit word holds one DLXe instruction and two D16 instructions. Therefore, all things being equal, D16 fetch bandwidth and cache capacities are e ectively double for the same resource budget. 21

A.1 Instruction Set Performance

Tables 6 and 7 summarize code density and path lengths for each benchmark program for D16, DLXe, and the restrictions of DLXe to a smaller register le and two-address instructions. Benchmark ISA/registers/operands Program D16/16/2 DLXe/16/2 DLXe/16/3 DLXe/32/2 DLXe/32/3 ackermann 1840 2980 2932 2980 2972 assem 29852 42864 42360 41448 40700 bubblesort 2140 3508 3460 3492 3480 queens 2136 3596 3560 3508 3400 quicksort 2292 3804 3728 3748 3748 towers 2908 4732 4656 4724 4684 grep 5836 9180 9032 9084 8984 linpack 7592 13980 13660 14504 13808 matrix 3348 5832 5748 6380 6236 dhrystone 3800 5920 5836 5488 5760 pi 3648 5912 5816 5800 5724 tex 194532 281356 277200 274248 265468 ipl 167793 228529 227413 224097 219637 whetstone 12868 21740 20500 21756 21080 Relative Density Average 1.00 1.62 1.61 1.57 1.53 Std. Dev. 0.11 0.13 0.11 0.09 Table 6: Code size/density summary. Benchmark ISA/registers/operands Program D16/16/2 DLXe/16/2 DLXe/16/3 DLXe/32/2 DLXe/32/3 ackerman 2751 2500 2353 2291 2144 assem 302600 268463 269009 257672 245873 bubblesort 607978 606416 608202 605467 605443 queens 33385 33496 33170 26951 24310 quicksort 63983 60544 61411 59600 58657 towers 2858775 2940662 2842265 2866884 2760296 grep 316976 281817 273617 268161 259961 linpack 38600636 38012876 37844974 37496251 37188113 matrix 325799 343419 349792 302204 301126 dhrystone 41294 33612 33150 32823 31124 pi 49632 48654 46735 49361 46046 solver 1264827 1250021 1217622 1151606 1084102 ipl 5454311 5049427 5050049 4976673 4844964 latex 11678293 10480781 10100524 9488626 9091141 whetstone 807623 780750 752114 764898 715909 Path Length Reduction Average 1.00 0.95 0.94 0.90 0.87 Std. Dev. 0.06 0.07 0.07 0.08 Table 7: Path length summary.

The reduction in code size for D16 is manifest as reduced instruction trac. Assuming the instruction fetch mechanism fetches 32-bit words (ignoring the low order two address bits), path length versus total number of instruction words fetched for D16 and DLXe is shown for each benchmark program in Table 8. 22

D16 instruction trac (in words) is greater than half path length because all fetches are word-aligned, and alignment of branch instructions and targets can result in fetches of words where both instructions are not executed. Memory instructions (loads and stores) for each program are shown in table 9. Interlocks, caused when an instruction refers to a result that is not yet available, are shown in table 10. The data in these tables can be used to compute performance under various resource assumptions. Benchmark Path Length Instruction Trac Program D16 DLXe D16 DLXe % ackermann 2518 1914 1584 1914 17.2 assem 275270 222430 150574 222430 32.3 bubblesort 513724 511791 279355 511791 45.4 queens 30035 21532 15986 21532 25.8 quicksort 57535 50957 32318 50957 36.6 towers 2645908 2506461 1425583 2506461 43.1 grep 281392 233127 153504 233127 34.2 linpack 34010048 32670025 18269825 32670025 44.1 matrix 310925 277616 164750 277616 40.7 dhrystone 38089 28353 20520 28353 27.6 pi 40519 37936 22209 37936 41.5 solver 1164552 968865 604886 968865 37.6 ipl 4803128 8285180 5568851 8285180 32.8 latex 10398621 4187762 2736828 4187762 34.6 whetstone 715845 626182 376402 626182 39.9 Average 35.6 Std. Dev. 7.7 Table 8: Path length and instruction trac. ackermann assem bubblesort queens quicksort towers grep linpack matrix dhrystone pi solver latex ipl whetstone Average Std. Dev.

D16 518 86876 186149 9428 14136 1212240 81943 7755395 110541 12384 9943 377088 3457032 1217766 221041

DLXe 517 86846 184934 8667 12249 1236809 53043 6998787 87497 11169 9811 331481 3255669 1198124 214659

% -0.2 -0.0 -0.7 -8.8 -15.4 2.0 -54.5 -10.8 -26.3 -10.9 -1.3 -13.8 -6.2 -1.6 -3.0 -10.1 14.5

Table 9: Total loads and stores.

23

D16 DLXe Instructions Interlocks Rate Instructions Interlocks Rate ackermann 2518 214 0.085 1913 207 0.108 assem 275270 24774 0.090 222429 43596 0.196 bubblesort 513724 79627 0.155 511790 79327 0.155 queens 30035 3004 0.100 21531 2455 0.114 quicksort 57535 5811 0.101 50956 6624 0.130 towers 2645908 195797 0.074 2506460 230594 0.092 grep 281392 31516 0.112 233126 24012 0.103 linpack 34010048 4047196 0.119 32670024 3953073 0.121 matrix 310925 14303 0.046 277615 21654 0.078 dhrystone 38089 2971 0.078 28352 3062 0.108 pi 40519 7455 0.184 37935 6677 0.176 solver 1164552 92000 0.079 968864 102700 0.106 latex 10398621 1143848 0.110 8285179 737381 0.089 ipl 4803128 571572 0.119 4187761 569535 0.136 whetstone 715845 81606 0.114 626181 78273 0.125 Mean 0.104 0.122 Std. Dev. 0.034 0.032 Table 10: Delayed Load and Math Unit Interlocks.

A.2 Processor Performance

Without on-chip cache, all memory references are external. Bu ering instructions according to the width of the memory bus means that not all instruction requests result in external trac. Performance of a program can be predicted using the formula: Cycles = IC + Interlocks + Latency  (IRequests + DRequests) with instruction counts, interlocks, and trac data from from Tables 8, 10, and 9. Without an instruction cache, each fetch request returns a block of k instructions, where k is the fetch bus width divided by instruction size. When k is greater than 1, the instruction block is bu ered, and as long as instructions requested are in the bu er, no memory request is made, and the fetch latency for those instructions is zero cycles. With a 32-bit fetch bus, for DLXe k = 1, for D16 k = 2. Table 11 gives the cycle ratios (DLXe to D16) for wait states of zero to three cycles. With zero wait states, DLXe cycles are lower, but D16 performs better with any nonzero wait state due to lower instruction trac (fewer fetch requests). Widening the fetch bus to 64 bits and bu ering doubles the number of instructions per fetch; for DLXe, k = 2 and D16, k = 4. As block size grows large, the likelihood of fetching unneeded instructions increases; moreover, the memory interface may require fetching of aligned double-words which exacerbates this. The bene ts of prefetching do help DLXe, averaging 8 percent slower than D16, and some programs are slightly faster.

24

Program ackermann assem bubblesort queens quicksort towers grep linpack matrix dhrystone pi solver ipl latex whetstone Mean

Memory Wait States 0 1 2 3 0.78 0.94 1.01 1.04 0.81 1.03 1.11 1.16 1.00 1.21 1.30 1.34 0.73 0.93 1.01 1.05 0.92 1.10 1.18 1.22 0.97 1.18 1.26 1.30 0.82 0.99 1.06 1.09 0.96 1.19 1.28 1.34 0.92 1.09 1.17 1.21 0.77 0.96 1.04 1.07 0.93 1.14 1.23 1.28 0.86 1.06 1.14 1.18 0.89 1.09 1.17 1.21 0.78 1.00 1.08 1.13 0.89 1.10 1.19 1.24 0.87 1.07 1.15 1.19

Table 11: DLXe/D16 performance, 32-bit fetch bus, no cache.

Program ackermann assem bubblesort queens quicksort towers grep linpack matrix dhrystone pi solver latex ipl whetstone Mean

Memory Wait States 0 1 2 3 0.76 0.89 0.94 0.96 0.81 0.96 1.02 1.06 1.00 1.12 1.17 1.21 0.72 0.87 0.93 0.97 0.89 1.03 1.08 1.12 0.95 1.09 1.14 1.17 0.83 0.90 0.94 0.96 0.96 1.09 1.15 1.19 0.89 1.00 1.05 1.08 0.74 0.89 0.94 0.97 0.94 1.05 1.12 1.16 0.83 0.98 1.04 1.07 0.80 0.92 0.99 1.03 0.87 1.01 1.07 1.11 0.87 1.01 1.08 1.12 0.86 0.99 1.04 1.08

Table 12: DLXe/D16 cycles, 64-bit fetch bus, no cache.

25

A.3 Cache Performance

All cache performance gures are derived from the following measurements. Instructions, interlocks, and instruction and data trac for each cache benchmark program are given in Table 13. Benchmark Instruction Interlock Instruction Data Data Program ISA Count Rate Fetches Reads Writes as16 D16 275270 0.090 150574 64018 22858 DLXe 222429 0.196 222430 64283 22563 ipl

D16 DLXe

4803128 4187025

0.119 0.136

2736348 4187026

939713 277829 916880 281020

latex

D16 DLXe

10398621 8277794

0.110 0.089

5564128 8277795

2566212 887625 2393084 859445

Table 13: Trac and interlocks for cache benchmarks. Instruction and data cache behavior for each cache benchmark program is given in Tables 14, 15, and 16. All caches are organized in 8-byte sub-blocks, with wrap-around prefetch for instruction and data reads and no prefetch on write. Cache performance for these experiments was measured with dinero [Hil92]. Size Instruction Data Read Data Write Cache Block D16 DLXe D16 DLXe D16 DLXe 1k 8 0.117 0.278 0.262 0.284 0.342 0.412 16 0.125 0.288 0.304 0.334 0.411 0.479 32 0.132 0.299 0.386 0.447 0.482 0.593 64 0.148 0.317 0.503 0.535 0.606 0.679 2k

8 16 32 64

0.076 0.084 0.089 0.104

0.189 0.203 0.222 0.239

0.168 0.213 0.278 0.364

0.159 0.202 0.285 0.377

0.216 0.280 0.321 0.410

0.224 0.302 0.442 0.513

4k

8 16 32 64

0.042 0.047 0.051 0.063

0.126 0.139 0.156 0.173

0.094 0.127 0.176 0.242

0.095 0.120 0.180 0.266

0.161 0.207 0.235 0.334

0.160 0.188 0.253 0.351

8k

8 16 32 64

0.015 0.016 0.017 0.021

0.066 0.074 0.085 0.096

0.039 0.050 0.066 0.104

0.052 0.065 0.092 0.154

0.081 0.091 0.099 0.177

0.090 0.099 0.120 0.159

16k

8 16 32 64

0.004 0.004 0.004 0.004

0.019 0.020 0.020 0.023

0.023 0.028 0.035 0.051

0.042 0.051 0.071 0.124

0.073 0.075 0.076 0.145

0.074 0.075 0.077 0.087

Table 14: Cache miss rates for as16. With these gures, performance for each program on D16 and DLXe machines with di erent instruction and data cache sizes and con gurations can be estimated using the formula: Cycles = IC + Interlocks + MissPenalty  (IMiss + RMiss + WMiss) 26

Size Cache Block 1k 8 16 32 64

Instruction D16 DLXe 0.057 0.095 0.060 0.102 0.067 0.113 0.077 0.123

Data Read D16 DLXe 0.066 0.077 0.087 0.091 0.120 0.115 0.167 0.154

Data Write D16 DLXe 0.086 0.091 0.117 0.115 0.180 0.154 0.250 0.222

2k

8 16 32 64

0.034 0.037 0.041 0.048

0.050 0.054 0.061 0.072

0.035 0.044 0.058 0.091

0.037 0.046 0.064 0.088

0.061 0.082 0.118 0.161

0.060 0.079 0.102 0.149

4k

8 16 32 64

0.020 0.023 0.026 0.031

0.030 0.032 0.038 0.046

0.019 0.025 0.033 0.046

0.022 0.027 0.040 0.056

0.040 0.055 0.081 0.117

0.029 0.038 0.049 0.086

8k

8 16 32 64

0.004 0.004 0.004 0.005

0.019 0.020 0.024 0.029

0.014 0.019 0.024 0.034

0.009 0.012 0.017 0.024

0.034 0.048 0.070 0.104

0.018 0.025 0.033 0.055

16k

8 16 32 64

0.002 0.002 0.002 0.002

0.007 0.007 0.008 0.009

0.005 0.006 0.008 0.011

0.005 0.006 0.008 0.012

0.006 0.007 0.008 0.012

0.010 0.013 0.016 0.022

Table 15: Cache miss rates for ipl. These tables contain instruction counts (path length), interlock and miss rates, and fetch trac for each program. Note that miss rates are reported per instruction, not per fetch request, since the number of instructions in each word of trac depends on the instruction set; read and write data misses are percents of read and write instructions respectively.

27

Size Cache Block 1k 8 16 32 64

Instruction D16 DLXe 0.092 0.201 0.099 0.212 0.114 0.226 0.127 0.247

Data Read D16 DLXe 0.180 0.189 0.234 0.234 0.297 0.308 0.418 0.423

Data Write D16 DLXe 0.178 0.177 0.222 0.221 0.312 0.302 0.412 0.415

2k

8 16 32 64

0.065 0.070 0.086 0.100

0.139 0.150 0.163 0.183

0.124 0.165 0.208 0.280

0.125 0.154 0.208 0.277

0.122 0.151 0.195 0.256

0.110 0.134 0.182 0.252

4k

8 16 32 64

0.033 0.038 0.044 0.051

0.087 0.093 0.102 0.117

0.088 0.110 0.142 0.186

0.081 0.102 0.137 0.190

0.076 0.093 0.112 0.155

0.081 0.093 0.116 0.157

8k

8 16 32 64

0.018 0.020 0.025 0.028

0.052 0.056 0.064 0.075

0.062 0.074 0.097 0.126

0.053 0.065 0.086 0.120

0.065 0.075 0.087 0.118

0.068 0.077 0.094 0.118

16k

8 16 32 64

0.025 0.012 0.015 0.016

0.027 0.029 0.032 0.037

0.037 0.044 0.055 0.070

0.036 0.043 0.053 0.068

0.053 0.057 0.062 0.073

0.056 0.060 0.067 0.079

Table 16: Cache miss rates for latex.

28

Suggest Documents