The Performance Potential of Data Value Reuse - CiteSeerX

1 downloads 104 Views 57KB Size Report
The performance of instruction-level reuse and trace-level reuse is ..... gccgo ijpeg li perl vortex. GMEAN_INT. GMEAN. 1.0. 10.0. 100.0. 1000.0 speed-up ...
1 of 22

The Performance Potential of Data Value Reuse Antonio González, Jordi Tubella and Carlos Molina Departament d’Arquitectura de Computadors Universitat Politècnica de Catalunya, Jordi Girona 1-3, Edifici D6, 08034 Barcelona, Spain e-mail: {antonio,jordit,cmolina}@ac.upc.es

Abstract This paper presents a study of the performance limits of data value reuse. Two types of data value reuse are considered: instruction-level reuse and trace-level reuse. The former reuses instances of single instructions whereas the latter reuses sequences of instructions as an atomic unit. Two different scenarios are considered: an infinite resource machine and a machine with a limited instruction window. The results show that reuse is abundant in the SPEC applications. Instructionlevel reuse may provide a significant speedup but it drops dramatically when the reuse latency is considered. Trace-level reuse has in general less potential for the unlimited window scenario but it is much more effective for the limited window configuration. This is because trace-level reuse, in addition to reduce the execution latency, increases the effective instruction window size, by avoiding the fetch and execution of sequences of instructions. Overall, trace-level reuse is shown to be a promising approach since it can provide speedups around 3 for a 256-entry instruction window and a realistic reuse latency. Keywords: Data value reuse, instruction-level reuse, trace-level reuse, instruction-level parallelism.

1.

Introduction

Data dependences1 are one of the most important hurdles that limit the performance of current microprocessors. The amount of instruction-level parallelism (ILP) that processors may exploit is significantly limited by the serialization caused by data dependences. This limitation is more severe for integer codes, in which data dependences are more abundant. Some studies on the ILP 1. In this paper, data dependences refer to true dependences (output and anti-dependences are not included).

2 of 22

limits of integer applications have revealed that some of them cannot achieve more than a few tens of instructions per cycle (IPC) in an ideal processor with the sole limitation of data dependences [6] [12]. This suggests that techniques to avoid the serialization caused by data dependences are important to boost ILP, and they will be crucial for future wide-issue microprocessors. Two techniques have been proposed so far to avoid the serialization caused by data dependences: data value speculation and data value reuse. This paper focuses on the latter technique. The objective of this work is not to propose or evaluate any particular reuse mechanism, but to understand the nature of this phenomenon and the potential benefits that may be expected from data value reuse. The main objective is to perform a study of the performance limits of data value reuse, ignoring implementation aspects. Two different types of reuse are considered: instruction-level reuse and trace-level reuse. The former targets on single instructions whereas the latter focus on sequences of instructions of the dynamic stream. The potential of both approaches is evaluated for an infinite resource machine and for a machine with a limited instruction window. The rest of this paper is organized as follows. Section 2 reviews the concept of data value reuse and the most relevant work. Section 3 details the objectives of this work. The evaluation methodology is described in section 4. The performance of instruction-level reuse and trace-level reuse is analyzed in section 5 and 6 respectively. Finally, section 7 summarizes the main conclusions of this work.

2.

Data value reuse

Data value reuse1 is a technique that exploits the fact that many dynamic instances of some static instructions, and groups of instructions, have the same inputs, and thus generate the same results. Data value reuse exploits this fact by buffering previous inputs and their corresponding outputs. 1. Some authors refer to this technique as instruction reuse [10].

3 of 22

When an instruction or group of instructions is fetched and decoded and its/their current inputs are found in that buffer, its/their execution can be avoided by getting the outputs from the buffer. This reduces the functional units utilization and, more importantly, reduces the time to compute the results, and thus, shortens the lengths of critical paths of the execution. Particularly interesting is the fact that this technique may compute all at once the results of a chain of dependent instructions (e.g. in a single cycle), which allows the processor to exceed the dataflow limit that is inherent in the program. Data value reuse can be implemented by software or hardware. Software implementation is usually known as memoization or tabulation [2] [9]. Memoization is a code transformation technique that takes advantage of the redundant nature of computation by trading execution time for increased memory storage. The results of frequently executed sections of code (e.g. function calls, groups of statements with limited side effects) are stored in a table. Later invocations of these sections of code are preceded by a table lookup, and in case of hit, the execution of these sections of code is avoided. A hardware implementation of data value reuse was proposed by Harbison for the Tree Machine [5]. The Tree Machine has a stack-oriented ISA and the main novelty of its architecture was that the hardware assumed a number of compiler’s traditional optimizations, like common subexpression elimination and invariant removal. This is achieved by means of a value cache, which stores the results of dynamic sequences of code (called phrases). For each phrase, the value cache keeps its result as well as an identifier of its input variables. For the sake of simplicity, input variables are represented by a bit vector (called dependence set), such that multiple variables share the same codification. Every time that the value of a variable changes, all the value cache entries that may have this variable as an entry are invalidated. Another hardware implementation of data value reuse is the result cache proposed by Richardson [8] [9]. The objective was to speedup some long latency operations, like multiplications, divisions and square roots, by caching the results of recently executed operations.

4 of 22

The result cache is indexed by hashing the source operand values, and for each pair of operands it contains the operation code and the corresponding result. Result caching is further investigated by Oberman and Flynn [7]. They evaluate division caches, square root caches and reciprocal caches, which are similar to Richardson’s result cache, but for just one type of operation: division, square root and reciprocal respectively. They also investigate a shared cache for reciprocals and square roots. Sodani and Sohi propose the reuse buffer [10], which is a hardware implementation of data value reuse (or dynamic instruction reuse, as it is called in that paper). The reuse buffer is indexed by the instruction address. They propose three different reuse schemes. In the first scheme, for each instruction in the reuse buffer, it holds the source operand values and the result of the last execution of this instruction. In the second scheme, instead of the source operand values, the buffer holds the source operand names (architectural register identifiers). In the third scheme, in addition to the information of the second scheme, the buffer stores the identifiers of the producer instructions of the source operands.

3.

Contributions of this work

In this work, we are interested in understanding the data value reuse phenomenon and investigating the performance potential of this technique. We will try to isolate the effect of data value reuse from other microarchitectural aspects by putting it into an ideal machine context, with either a limited or unlimited instruction window. We are interested in evaluating the performance limits of data value reuse instead of centering on any particular implementation. In this study, we will consider two different families of data value reuse schemes:

5 of 22

• Instruction-level reuse. In this type of schemes, the reuse is tried to be exploited for each individual dynamic instruction. This is the case of Richardson’s result cache, Oberman and Flynn proposals, and Sodani and Sohi’s reuse buffer. • Trace-level reuse. In this type of schemes, the reuse is tried to be exploited for sequences of dynamic instructions, also called traces in the literature. This is the case of the memoization scheme, Harbison’s value cache, and up to some extent, the third scheme (reuse based on register names and dependences) of Sodani and Sohi. Notice that data value reuse exploits the fact that an instruction or a trace have appeared in the past with the same input values. Therefore, evaluating the maximum instruction-level reuse can be performed by analyzing the dynamic instruction stream and counting how many times a static instruction appears with the same inputs as before. Evaluating trace-level reuse is not so easy since it depends on how the dynamic instruction stream is split into traces. Since we are interested in the performance limits, we should consider any possible partition, which would be an intractable combinatorial problem. However, we will prove some properties that allow us to compute an upper bound of the maximum trace-level reuse, based on the instruction-level reuse. Notice also that data value reuse is performed by caching the inputs and the results of previous dynamic instructions/traces. Therefore, a reuse requires the following operations: a) read the input operands; b) perform a table lookup; and c) write the output operands. Even though some of these tasks can be overlapped (e.g. (a) and (b) if the table is indexed by the instruction address), the whole reuse operation will take some time. That is, any reuse engine will have a given latency. To provide insight into this issue, we will consider an ideal reuse engine with parameterized latency, including the ideal case of null latency.

6 of 22

4.

Evaluation methodology

For each reuse scheme we have considered two different scenarios: an ideal machine with either an infinite window or a limited window. For both scenarios, a machine with infinite resources and perfect branch prediction has been assumed. For the infinite window scenario, the execution time is only limited by data dependences among instructions, both through register and memory. For the limited window scenario, the execution order inside each sequence of W instructions, W being the instruction window size, is only limited by data dependences, whereas any pair of instructions at a distance greater than W are forced to be sequentially executed. Below, the approach to compute the IPC for each scenario is described. It is an extension of the approach proposed in [1].

4.1.

IPC for an infinite window machine

The IPC is computed by analyzing the dynamic instruction stream. For each instruction, its completion time is determined as the maximum of the completion time of the producers of all its inputs plus its latency. The inputs of an instruction may be register or memory operands. Therefore, for each logical register and each memory location, the completion time of the latest instruction that has updated such storage location so far is kept in a table. The latency of the instructions has been borrowed from the latency of the Alpha 21164 instructions [3]. Once all the dynamic instruction stream has been processed, the IPC is computed as the quotient between the number of dynamic instructions and the maximum completion time of any instruction.

4.2.

IPC for a limited window machine

A limited instruction window of size W, implies that a given instruction cannot start execution until the instruction that is W locations above in the dynamic stream has completed. The process to compute the IPC for this scenario is an extension of the unlimited window approach. The extension consists of computing the graduation time of each instruction as the maximum

7 of 22

completion time of any previous instruction, including itself. Then, the completion time of a given instruction is computed as the maximum among the completion time of all the producers of its inputs and the graduation time of the instruction W locations above in the trace, plus the latency of the instruction. Notice that only the graduation time of the latest W instructions must be tracked.

4.3.

IPC for instruction-level reuse

Under this scenario, the IPC is computed by extending the mechanism described in sections 4.1 and 4.2, for an unlimited or limited window configuration respectively. The completion time of a non-reusable instruction is computed as explained in those sections, whereas the completion time of a reusable instruction is computed as the maximum of the completion time of all the producers of its inputs (an instruction cannot be reused until all its inputs are available) plus a reuse latency. As stated in section 3, the effect of different reuse latencies will be investigated. In any case, if the completion time of a reused instruction is higher than the completion time of the normal execution of that instruction, the latter will be chosen. This is equivalent to assume that an oracle determines the best approach for each instruction.

4.4.

IPC for trace-level reuse

For instructions that do not belong to reusable traces, the IPC is computed as in sections 4.1 and 4.2, for the unlimited and limited instruction window scenario respectively. For a reusable trace, the completion time of all instructions that produce an output1 is computed as the maximum of the completion time of all the producers of its inputs plus the reuse latency. Moreover, two ways to consider the reuse latency have been analyzed. In one of them, the reuse latency is assumed to be a constant time per reuse operation. In the other, the reuse latency is assumed to be proportional to the number of inputs plus the number of outputs of the trace. Notice that this is reasonable since reusing a trace requires to read all its inputs and check that they are the same as in a previous execution, and then, to update all its outputs. 1. Notice that there are instructions in the trace that just produce values consumed by other instructions of the trace.

8 of 22

For the limited window scenario, notice that a reusable trace does not need to be fetched, and a single entry in the instruction window is allocated to represent the whole trace. Therefore, when computing the completion time of a given instruction, the graduation time of the instruction that is W locations above must be interpreted taking into account only those instructions that must be brought into the processor. In other words, trace-level reuse has the additional advantage of artificially increasing the effective instruction window size, as it will be discussed and evaluated in section 6.2, since some instructions are not even fetched and do not use any instruction slot. If the completion time of an instruction in a reusable trace is higher than the completion time of the normal execution of that instruction, the latter will be chosen.

4.5.

Benchmarks

The benchmark programs are a subset of the SPEC95 benchmark suite, composed of both integer and FP codes: compress, gcc, go, ijpeg, li, perl and vortex from the integer suite, and applu, apsi, fpppp, hydro2d, su2cor, tomcatv and turb3d from the FP suite. The programs have been compiled with the DEC C and Fortran compilers with full optimization (“-non_shared -O5 -tune ev5 -migrate -ifo” for C codes and “-non_shared -O5 -tune ev5”

for Fortran codes). Each program was run for 50 million of instructions after

skipping the first 25 millions. The compiled programs have been instrumented with the Atom tool [11] and their dynamic trace has been processed in order to obtain the IPC for each configuration. Results are shown for individual programs, and in some cases it is shown the geometric mean of integer programs, FP programs or the whole set of benchmarks. Notice that in those cases where the range of values is very broad, a logarithmic scale is used in the graphs.

9 of 22

% of reusable instructions

100.0 80.0 60.0 40.0 20.0

N EA M G

m pr es s gc c g ijp o eg l p i vo erl G r M EA tex N _I N T

co

ap

pl u ap fp si hy pp dr p su o2d to 2co m r c tu atv rb G 3 M EA d N _F P

0.0

Figure 1: Instruction-level reusability.

5.

Instruction-level reuse

5.1.

Limits of instruction-level reusability

In this work we are interested in the potential of data value reuse. Hence, we consider here the maximum instruction-level reuse that can be exploited. For each static instruction, all the different input values of its previous executed instances are stored in a table. For a given dynamic instruction, if its current inputs are the same as in a previous execution, the instruction is reusable. The maximum instruction-level reuse can be computed by counting in the dynamic instruction stream the percentage of instructions that have previously appeared with the same input operands. This is achieved by having a table that stores for each static instruction all the different input values of its previous executed instances. The maximum percentage of reusable instructions will be referred to as the instruction-level reusability of a program. Figure 1 depicts it for the Spec95 benchmark suite. We can observe that the instruction-level reusability is very high. For most programs it is higher than 90% of all dynamic instructions and on average it is 87%. The reusability ranges from 53% to 99%, applu and hydro2d being the programs with the lowest and highest reusability respectively. We can also observe that there are not high differences

10 of 22

speed-up

1000.0

100.0

10.0

N EA G M

m pr es s gc c g ijp o eg l pe i v r G or l M EA tex N _I N T

co

ap

pl u ap fp si hy pp dr p su o2d to 2co m r c tu atv rb G 3 M EA d N _F P

1.0

Figure 2: Speedup of instruction-level reuse for an infinite instruction window and null reuse latency.

between integer and FP codes (91% and 83% of instruction-level reusability respectively). We can conclude that instruction-level reuse is abundant in all types of programs and thereby, it is a phenomenon that deserves further research.

5.2.

Performance improvement

The ultimate figure in which we are interested is the effect of data value reuse on execution time. Figure 2 shows the speedup provided by instruction-level reuse when the reuse latency is assumed to be null. Notice that the speedups are very dependent on the particular benchmark. On average, it is around 18.7, and it is slightly higher for FP than for integer programs. However, there are some programs that can extremely benefit from instruction-level reuse, such as turb3d and ijpeg, which show an speedup of 2231 and 1072 respectively. On the other hand there are also programs that hardly benefit from instruction-level reuse, such as applu and perl, which have an speedup of 1.5. The effect of instruction-level reuse does not only depend on the percentage of reusable instructions, but it strongly depends on the criticality of such instructions. In other words, if reusable instructions are concentrated on the critical path of the program, their benefit can be very

11 of 22

speed-up

1000.0 theoretical speed-up actual speed-up

100.0

10.0

N EA G M

m pr es s gc c g ijp o eg l pe i v r G or l M EA tex N _I N T

co

ap

pl u ap fp si hy pp dr p su o2d to 2co m r c tu atv rb G 3 M EA d N _F P

1.0

Figure 3: Theoretical versus actual speedup.

high, whereas if they are located in the less critical sections of a program, their benefit can be negligible. Notice that if reusable instructions were uniformly distributed over the whole dynamic instruction stream, independently of the criticality of the instructions and their latency, the theoretical speedup achieved by instruction reuse would be given by 100 ⁄ ( 100 – r ) , where r is the percentage of reusable instructions. If reusable instructions tend to be highly critical, this theoretical speedup will be much lower than the actual one. On the other hand, if reusable instructions are not critical, the theoretical speedup will be much higher than the actual one. Figure 3 compares the theoretical versus the actual speedup. On average, we can see that the actual speedup is slightly higher than the theoretical one (18.7 in front of 13.9), which suggests that reuse is slightly more frequent for critical instructions than for non critical ones. The same trend is observed for integer and FP programs. Analyzing individual programs, for fpppp, turb3d, ijpeg and li, reusable instructions tend to be highly critical, whereas for hydro2d, su2cor, tomcatv, compress, go and perl reusability tends to concentrate on non critical instructions. For the other programs, applu, apsi, gcc and vortex, reusability and criticality seem to be quite independent phenomena.

12 of 22

speed-up

15.0

10.0

5.0

0.0 0

1

2

3

4

Figure 4: Speedup of instruction-level reuse for an infinite instruction window and a reuse latency varying from 0 to 4 cycles.

Figure 4 shows the effect of the reuse latency on performance, for a latency varying from 0 to 4 cycles per reuse (only average figures for integer and FP codes are shown). Notice that the benefits of instruction-level reuse drastically drop when a 1-cycle latency is assumed and are practically negligible for a higher latency. This indicates that the instructions that are in the critical path are usually low-latency instructions, and thus, the latency reduction achieved by instruction reuse is effective only if the reuse latency is very low. In the case of a limited instruction window, the speedup of instruction-level reuse is not so high but still significant. This is shown in figure 5 when the reuse latency is considered null. On average it is 4.3, with minor difference between integer and FP means (3.8 and 4.8 respectively). Individual programs, and in particular FP codes, show a high variation: from 1.2 or fpppp to 33.7 for hydro2d. Notice also that in some cases, such as hydro2d, the benefits for a limited instruction window (speedup of 34) are even higher than for the infinite window scenario (speedup of 18 – see figure 2), although this is not the common case. Finally, the benefits of instruction-level reuse when the reuse latency is not null also drop dramatically (see figure 6), but they are slightly higher than those obtained for the infinite window configuration (see figure 4).

speed-up

13 of 22

10.0

N EA M G

m pr es s gc c g ijp o eg l p i vo erl G r M EA tex N _I N T

co

ap

pl u ap fp si hy pp dr p su o2d to 2co m r c tu atv rb G 3 M EA d N _F P

1.0

Figure 5: Speedup of instruction-level reuse for a limited instruction window (256 entries) and null reuse latency.

4.0

speed-up

3.0

2.0

1.0

0.0 0

1

2

3

4

Figure 6: Speedup of instruction-level reuse for a 256-entry instruction window and a reuse latency varying from 0 to 4 cycles.

6.

Trace-level reuse

6.1.

Limits of trace-level reusability

Reuse of traces is an attractive technique since a single reuse operation may skip the execution of a potentially long sequence of dynamic instructions. To evaluate the performance limits of this technique we should compute the maximum reuse that can be attained for any possible partition

14 of 22

of the dynamic instruction stream into traces. Since there is not any constraint about the contents of each trace, the different ways to partition a dynamic instruction stream into traces are practically unlimited, which avoids an exhaustive exploration of all of them. Given that each reuse operation has an associated latency (e.g. table lookup), the most effective schemes will be those that reuse maximum length traces. That is, given a dynamic instruction stream that corresponds to the execution of a program, we are interested in identifying a set of reusable traces such that: a) the number of instructions included in those traces is maximum and b) the number of traces is minimum. In other words, if a trace is reusable, it will be more effective to reuse the whole trace in a single reuse operation than to reuse parts of it separately. However, finding maximum length reusable traces would be still a complex problem if all the possible partitions of a program into traces should be explored. We can however prove that if we consider just those traces that are formed by all maximumlength sequences of dynamically consecutive reusable instructions, we have an upper-bound of the reusability that can be exploited by maximum-length traces (condition (a) above) and a lower bound of the number of traces required to exploit it (condition (b) above). This is supported by the theorems below. The performance provided by assuming that such traces are reusable will provide an upper-bound of the performance limits of trace reuse. Theorem 1. Let T be a trace composed of the sequence of dynamic instructions . If T is reusable, then ik is reusable for every k ∈[1,n]. Proof. Refer to the enclosed appendix in page 21. Theorem 2. Let T be a trace composed of the sequence of dynamic instructions . If ik is reusable for every k ∈[1,n], then T is not necessarily reusable. Proof. Refer to the enclosed appendix in page 21. Theorem 1 implies that the number of instructions whose execution can be avoided by any trace reuse scheme is limited by the amount of individual instructions that are reusable. Thus, we

15 of 22

speed-up

100.0

10.0

N EA G M

m pr es s gc c g ijp o eg l p i vo erl G r M EA tex N _I N T

co

ap

pl u ap fp si hy pp dr p su o2d to 2co m r c tu atv rb G 3 M EA d N _F P

1.0

Figure 7: Speedup of trace-level reuse for an infinite instruction window and null reuse latency.

compute an upper-bound of the benefits of trace-level reuse by assuming that the amount of tracelevel reusability is equal to the amount of instruction-level reusability, and the overhead of tracelevel reuse is given by grouping reusable instructions into the minimum number of traces (i.e. assuming maximum-length traces). Theorem 2 states that this approach will result in an upperbound that may not be reached.

6.2.

Performance improvement

The above theorems also imply that in an scenario where the reuse latency is null, and the resources are infinite, instruction-level reuse will provide a better performance than trace-level reuse, since the latter cannot increase the percentage of the dynamic stream where reuse can be exploited. This is shown in figure 7. We can see that the speedup of trace-level reuse is significant but much lower than that of instruction-level reuse (see figure 2) for an infinite window configuration. The speedups for integer and FP programs are similar, in the range of 1.3 and 10.3, except for ijpeg, that goes up to 378. However, the conclusions are quite different for the finite window scenario. In such a more realistic scenario, trace-level reuse may provide a very important additional advantage: it may avoid the fetching of instructions and may artificially increase the effective instruction window.

speed-up

16 of 22

10.0

N EA G M

m pr es s gc c g ijp o eg l p i vo erl G r M EA tex N _I N T

co

ap

pl u ap fp si hy pp dr p su o2d to 2co m r c tu atv rb G 3 M EA d N _F P

1.0

Figure 8: Speedup of trace-level reuse for a 256-entry instruction window and null reuse latency.

That is, a trace may be identified by its initial address. Once the control flow reaches the initial address, all the input values of the trace may be checked and if they are the same as in a previous execution of the same trace, the whole trace can be reused without fetching nor executing the remaining instructions. The speedup achieved by trace-level reuse when the reuse latency is null is shown in figure 8. Notice that the results are in general slightly higher than those of the instruction-level reuse for a limited instruction window scenario – 5.9 in front of 4.3 on average – (see figure 5), and much higher than those of trace-level reuse for an infinite window – 5.9 in front of 3.7 on average – (see figure 7). As pointed out before, this is due to the fact that an important contribution of value reuse is the artificial increase in effective window size. This can be observed by looking at figure 9, which shows the average trace size, and correlating it with figure 8. It can be observed that in general, larger traces imply higher speedups, which can be attributed to their higher potential to artificially increase the effective instruction window size. Integer programs have a quite uniform trace size, ranging from 14.5 to 36.7 instructions, and they also exhibit a quite homogeneous speedup. On the other hand, some FP programs have very short traces, such as applu, apsi and fpppp, and they exhibit very low speedup, whereas the others have large traces (up to 203 instructions for hydro2d) and they also have high speedups.

17 of 22

trace size

100.0

10.0

N EA G M

m pr es s gc c g ijp o eg l p i vo erl G r M EA tex N _I N T

Figure 9: Average trace size.

co

ap

pl u ap fp si hy pp dr p su o2d to 2co m r c tu atv rb G 3 M EA d N _F P

1.0

It is also remarkable the fact that trace-level reuse, unlike instruction-level reuse (shown in figure 6), provides significant speedups even if the reuse latency is not null. This is shown in figure 10.a, where it can be observed that the average speedup for a reuse latency ranging from one to four cycles is not much degraded when compared with the results for null latency. Notice that a trace is reused provided that all input values are the same as in a previous execution. Therefore, a trace reuse operation implies a check of as many values as inputs the trace has. Moreover, as a consequence of a trace reuse, all the output values of the trace must be updated. Hence, it may be more realistic to assume that the reuse latency is proportional to the number of input and output values. That is, it is equal to a constant K multiplied by the number of input/output values. K is the inverse of the read/write bandwidth of the reuse engine; for instance, K=1/16 implies that the reuse engine can read/write 16 values per cycle. Under this scenario, the speedup of trace-level reuse is shown in figure 10.b, where the X axis represents different values of K. It can be observed that the speedup of value reuse is still high, although it is significantly affected by the reuse latency. Notice that it is reasonable to assume that future microprocessors may have the capability to perform around 16 reads+writes per cycle, including register and memory values. In fact, current microprocessors such as the Alpha 21264 [4] can perform 14 reads+writes per cycle (8 register reads, 4 register writes and 2 memory references). Thus, looking

5.0

5.0

4.0

4.0

speed-up

3.0

3.0

2.0

2.0

1.0

1.0 0.0 0

1

2

3

4

a)

1

0.0

0 1/ 32 1/ 16 1/ 8 1/ 4 1/ 2

speed-up

18 of 22

b)

Figure 10:Speedup of trace-level reuse for a 256-entry instruction window and a reuse latency that a) varies from 0 to 4 cycles, b) is proportional to the number of inputs plus outputs of the trace.

at the bar corresponding to K=1/16 in figure 10.b, we can conclude that an speedup of 2.9 is reasonable to be expected from trace-level reuse. This suggest that the study of implementation aspects of trace-level reuse may be an interesting research area. On average, we have measured that the number of input values per trace is 10.9 (3.3 register values and 7.6 memory values), and the number of output values is 10.5 (5.4 register values and 5.1 memory values). Since the average number of instructions per trace is 23.2, this means that each reused instruction requires 0.47 reads and 0.45 writes, which is significantly lower than the number of reads and writes required by the execution of an instruction. We can thus conclude that trace-level reuse also provides a significant reduction in the data bandwidth requirements, and thereby it reduces the pressure on the memory and register file ports.

7.

Conclusions

We have presented an analysis of the performance limits of data value reuse. Two different reuse approaches have been considered: instruction-level reuse and trace-level reuse. We have shown that the former can exploit a higher degree of reuse than the latter, and may provide very large

19 of 22

speedups for an ideal machine. However, instruction-level reuse has also more overhead since it requires more reuse operations. When the reuse latency is considered, trace-level reuse is much less degraded and may even outperform instruction-level reuse. For a realistic instruction window size, trace-level reuse provides very high speedups and it outperforms instruction-level reuse even when the reuse latency is not considered. This is because trace-level reuse not only avoids the execution of some instructions, but also fetching and storing them in the instruction window. This latter effect can be regarded as an artificial increase of the effective instruction window size, which has been shown to increase as the average trace size increases. Moreover, as in the infinite window scenario, trace-level reuse is less degraded than instruction level-reuse when the reuse latency is considered. A speedup in the range of 1.7-7.9 has been observed for a 256-entry instruction window and when the reuse latency is considered proportional to the number of inputs and outputs of a trace. We can conclude that trace-level reuse is a promising technique for future microarchitectures. It can provide significant speedups, especially for realistic instruction window configurations. The study of the implementation of such type of reuse may be an interesting area of research.

Acknowledgments This work has been supported by the Spanish Ministry of Education under contract CYCIT 429/ 95 and by the grant AP96-52460503. The research described in this paper has been developed using the resources of the Center of Parallelism of Barcelona (CEPBA).

References [1]

T.M. Austin and G.S. Sohi, “Dynamic Dependence Analysis of Ordinary Programs”, in Proc. of Int. Symp. on Computer Architecture, pp. 342-351, 1992

20 of 22

[2]

H. Abelson and G.J. Sussman, Structure and Interpretation of Computer Programs, McGraw Hill, New York, New York, 1985

[3]

Digital Equipment Corporation, Alpha 21164 Microprocessor. Hardware Reference Manual. 1995

[4]

L. Gwennap, “Digital 21264 Sets New Standard”, Micorprocessor Report, vol. 10, no. 14, Oct. 1996.

[5]

S.H. Harbison, “An Architectural Alternative to Optimizing Compilers”, in Proc. of Int. Conf. on Architectural Support for Programming Languages and Operating Systems, pp. 57-65, 1982

[6]

J.L. Hennessy and D.A. Patterson, Computer Architecture. A Quantitave Approach. Second Edition. Morgan Kaufmann Publishers, San Francisco 1996

[7]

S.F. Oberman and M.J. Flynn, “On Division and Reciprocal Caches”, Technical Report CSL-TR-95-666, Stanford University, 1995

[8]

S. E. Richardson, “Exploiting Trivial and Redundant Computations”, in Proc. of Symp. on Computer Arithmetic, pp. 220-227, 1993

[9]

S. E. Richardson, “Caching Function Results: Faster Arithmetic by Avoiding Unnecessary Computation”, Technical Report SMLI TR-92-1, Sun Microsystems Laboratories, 1992

[10] A. Sodani and G.S. Sohi, “Dynamic Instruction Reuse”, in Proc. of Int. Symp. on Computer Architecture, pp. 194-205, 1997 [11] A. Srivastava and A. Eustace, “ATOM: A system for building customized program analysis tools” , in Proc of the 1994 Conf. on Programming Languages Design and Implementation, 1994. [12] D.W. Wall, “Limits of Instruction-Level Parallelism”, Technical Report WRL 93/6, Digital Western Research Laboratory, 1993

21 of 22

A.

Appendix

In this appendix we present the proof of theorems 1 and 2. In fact, we will prove a more general formulation of theorem 1 and 2, in which, a trace is considered as a sequence of consecutive traces of a smaller size. Theorem 1 and 2 are the particular case in which the size of every smaller trace is one instruction. Definitions • An input of a trace T is a register, condition code or memory location that is read and has not been written before in such trace. • An output of a trace T is a register, condition code or memory location that is written in such trace. • Let IL(T) be the sequence of input storage locations of trace T. Notice that IL(T) is a sequence and not a set. The order of the sequence is given by the order in which the inputs are read. • Let OL(T) be the sequence of output storage locations of trace T. The order of the sequence is given by the order in which the outputs are written. • Let IV(T) be the sequence of input values of trace T, in the order in which they are read. • Let OV(T) be the sequence of output values of trace T, in the order in which they are written. • If A and B are two sequences, we will say that A ⊆ B if A is a subsequence of B. • Let A and B be two sequences. A ∪ B will refer to any sequence that is composed of the elements of A and B, no matter the order of the elements. Different dynamic instances of the same trace will be denoted by using the same symbol to refer to the trace, with a superscript that corresponds to the dynamic execution order. Notice that different instances of the same trace will always have the same input/output registers but may have different input/output memory locations.

22 of 22

If a trace T is reusable, it must happen that IL(Ti) = IL(Tj) and IV(Ti) = IV(Tj) for some j < i. This obviously implies that OL(Ti) = OL(Tj) and OV(Ti) = OV(Tj). That is, if the inputs are the same and have the same value, then the outputs will also be the same and will have the same values. Theorem 3 (generalization of theorem 1). Let T be a trace composed of a sequence of traces . If T is reusable, then tk is reusable for every k ∈[1,n]. Proof. If Ti is reusable, then IL(Ti) = IL(Tj) and IV(Ti) = IV(Tj) for some j < i. Notice that IL(t1i) ⊆ IL(Ti) = IL(Tj), which implies that IL(t1i) = IL(t1j) and IV(t1i) = IV(t1j). Therefore, OL(t1i) = OL(t1j) and OV(t1i) = OV(t1j). Thus, t1 is reusable. Notice that IL(t2i) ⊆ IL(Ti) ∪ OL(t1i) = IL(Tj) ∪ OL(t1j). Since IL(t2j) ⊆ IL(Tj) ∪ OL(t1j) and OV(t1i) = OV(t1j), we have that IL(t2i) = IL(t2j) and IV(t2i) = IV(t2j). Thus, t2 is reusable, that is, OL(t2i) = OL(t2j) and OV(t2i) = OV(t2j). In general, we can prove that tk+1 is reusable provided that ti is reusable for any i=1..k. Notice that IL(tk+1i) ⊆ IL(Ti) ∪ OL(tki) ∪ OL(tk–1i) ∪ ... ∪ OL(t1i) = IL(Tj) ∪ OL(tkj) ∪ OL(tk–1j) ∪ ... ∪ OL(t1j). Thus, IL(tk+1i) = IL(tk+1j) and IV(tk+1i) = IV(tk+1j), which means that tk+1 is reusable. Theorem 4 (generalization of theorem 2). Let T be a trace composed of the sequence of traces . If tk is reusable for every k ∈[1,n], then T is not necessarily reusable. Proof. If t1i, t2i, ..., tni are reusable, then each of them have their inputs equal to those of some previous execution. That is, IV(t1i) = IV(t1j1), IV(t2i) = IV(t2j2) , ..., IV(tni) = IV(tnjn), but j1, j2, ..., jn may be different. Therefore, IV(Ti) ⊆ IV(t1i) ∪ IV(t2i) ∪ ... ∪ IV(tni) = IV(t1j1) ∪ IV(t2j2) ∪ ... ∪ IV(tnjn), but IV(Ti) may be different from IV(Tj) for every j < i, and thus, Ti may be non-reusable.

Suggest Documents