Measuring Limits of Parallelism and Characterizing its Vulnerability to ...

1 downloads 0 Views 1MB Size Report
the other hand, if the range of Ij is the same as the do- main of Ii, Ij is said to be ..... are ignored, in- structions such as load/store update, are still included ..... mE+m. 2eE+m. asE+w l.MEtM. 4.luz+m l.OZEtO2. 2X8Etm. MOE+Ol. 141EtOl. amE+m.
Measuring Limits of Parallelism and Characterizing Vulnerability to Resource Constraints Lawrence Rauchwerger* Center for Supercomputing R. & D. University of Illinois 422 CSRL, 1308 W. Main St. Urbana, IL 61801 [email protected]

Pradeep K. Dubey and Ravi Nair T. J. Watson Research Center I.B.M. P.O. Box 704 Yorktown Heights, NY 10598 pradeep,[email protected]

Abstract

1

1.1

Previous Work

There has been a lot of work recently in measuring the average instruction level parallelism in sequential instruction traces [l, 2, 3, 4, 5, 6, 81. Nicolau/Fisher [l] and Austin/Sohi [4] reported large amount of parallelism using an ideal model with perfect branch prediction, unlimited register and memory renaming and no other resource constraints. Wall [2] concluded that most of such ideal parallelism is lost when relying on imperfect branch prediction, even with very ambitious hardware/software support. Lam/Wilson [5] studied in more detail the effect of conditional branches and concluded that control dependence analysis coupled with speculative execution of multiple control jlows is essential to boosting the speedup limits observed by

‘Research supported by the IBM Corporation and by the Army contract #DABT63-92-C-0033. This work is not necessarily representative of the positions or policies of the Army or the Government.

$3.00 0 1993 IEEE

Introduction

Recent advances in VLSI technology are making it increasingly feasible to put multiple execution pipelines on the same chip. Also, such a processor design can be further used as the building block of massively parallel systems. Effective utilization of such a system is the key to realizing its sustainable performance potential. This implies a two-fold question: whether there is enough parallelism in numeric and non-numeric workloads under ideal conditions, disregarding any resource constraints and more importantly, whether a high ideal parallelism can be further characterized to assess its extractability with finite resources. This paper addresses both these questions. We have developed a tool which analyzes the influence of different architectural features and instruction scheduling policies. We also offer a characterization of ideal parallehsm to assess its sustainability with finite resources.

This paper addresses a two-fold question: whether there is enough parallelism in numeric and nonnumeric workloads, such as the SPEC92 benchmark suite, under ideal conditions, disregarding any resource constraints and more importantly, whether a high ideal parallelism can be further characterized to assess its extractability with finite resources. We have designed and implemented an analysis tool that accepts as input a dynamic execution trace from an IBM RS/6000 environment, and outputs a parallelized instruction trace (schedule) that could be executed on an abstract machine with unlimited functional units and various constraints on the rest of its resources, namely, registers, stack and memory. We also analyze two different instruction scheduling policies: greedy and lazy. This paper further offers a characterization of ideal parallelism (obtainable on a machine with infinite resources) using a measure called slack to assess its sustainability with finite resources. Most of the new floating-point additions to SPEC92 (except OTU) offer quite high limits of ideal (oracle) parallelism. But this very high parallelism is mostly associated with not very high average slack, suggesting the vulnerability of this parallelism to resource constraints. While register renaming is much more important for all the programs than stack or memory renaming, for quite a few programs renaming of stack and memory, in addition to the registers, offers a significant performance boost.

1072-4451193

its

105

Wall. Theobald et al. [6] analyzed the smoothability of the parallelism in instruction traces. A single-value smoothability measure attempts to quantify whether the ideal (infinite resource) parallelism is smoothable enough to be extracted with finite resources. 1.2

This

l

research

Together with the usual greedy (ASAP) scheduling policy, we introduce an alternate scheduling scheme, lazy (as-late-as-possible, or ALAP) scheduling. Schedules obtained with both these policies are compared, and new insights are offered. We present an alternate measure called slack to quantify the ability of high ideal parallelism to map itself to finite resources. For example, lower slack in the high parallelism regions implies that finite resource constraints are very likely to significantly increase the critical path and hence reduce the ideal parallelism. Almost all of the previous work is based on traces derived from MIPS and SPARC workstations. This paper tests the generality of the results using a different compiler and instruction set architecture, namely the IBM RS/6000.

l

Most of previous work [4, 51 is based on running around 100 million instructions from each trace. All the programs used in this paper have been run to completion (like, Theobald et al. [S]).

1.3

Finally, our set of benchmarks are based on the SPEC92 suite of benchmarks, whereas, almost all of the previous work was based on the SPEC89 suite. Paper

Dependence ing

Analysis

and

Schedul-

Types of Dependences

An instruction can be defined as a function, with source operands, representing its function domain, and the result destinations, representing the range of the function. Thus, a typical instruction execution consists of reading the source operands, applying the specified function, and writing the result. Consider two instructions, Ii and Ij, where Ii precedes Ij in a purely sequential execution. Data dependence between these instructions results from overlapping If the range of 1i is same ranges and/or domains. as that of Ij, then Ij is said to be output dependent on Ii. If the range of Ii is the same as the domain of Ij , and there is no instruction 4, which follows 1i but precedes Ij , such that Ik is output dependent on Ii, then Ij is said to be essentially dependent on Ii. On the other hand, if the range of Ij is the same as the domain of Ii, Ij is said to be anti-dependent on Ii, since the dependence exists only if the order specified by the proper sequential execution is reversed. Finally, if the domain of Ij is the same as the domain of Ii, Ij is said to be input dependent on Ii. Input dependence does not imply a lack of concurrence unless there is a limit on simultaneous distribution of input to a number of different executing instructions. Resource dependence among instructions Ii and Ij may exist if both correspond to the same function. Thus, data-independent instructions may not be executed concurrently if they must use the same resource for evaluation and the resource cannot be shared. An instruction is considered control dependent on some of the preceding conditional branch instrucThus, while data and resource independence tions. of two instructions implies that the operations are executable concurrently, procedural dependence tells whether they even need to be executed. To explore the nature of dependences further, consider an ideal machine with infinite resources. On such a machine, it should be possible always to remap the range of an instruction such that it does not overlap the domain of any of the preceding operations. Similarly, it should be possible always to remap the range of any instruction such that it does not overlap the range of any of its followers. Resource dependence is not an issue in such an ideal environment. Further, assume that whenever a conditional branch is encountered, resources can be replicated such that different branch paths can be explored concurrently until the

in the

l

l

2.1

Work

This paper differs from the previous following aspects: l

2

Organization

The next section summarizes the theoretical background for the paper. Section 3 details the experimental setup and some of the important data dependence issues concerning loop unrolling and address calculations. The experimental results and their ramifications are presented in Section 4. The final section presents some conclusions and describes our plan for extending the analysis tools.

106

parallelism in a program is the average number of instructions per level in the parallelism profile. The parallelism profile reveals details of the nature of program parallelism such as whether or not the parallelism is evenly distributed. Consider a parallelism profile that is concentrated in short bursts of high parallelism, separated by long periods of little or no parallelism. A machine with limited resources would be forced to delay some of the instructions considered independent in the profile. These delayed instructions will reduce the average parallelism if the scheduling delays cause additional instructions to fall on the critical path, thereby increasing the length of the critical path. The profile does not offer much insight into this likelihood that a reduced set of resources leads to an increase in the critical path. The slack as defined above offers a measure of this vulnerability. For example, high parallelism along with high slack offers better potential of its extraction with a smaller set of resources, since the higher slack implies more slots between the greedy and lazy schedules and hence scheduling delays are better absorbed with higher slack. Equivalently, higher slack implies better potential for balancing the resource utilization.

control dependence is resolved, at which time the incorrect paths can be discarded. As a result, for an environment with infinite resources, control dependence does not imply lack of concurrency either. In this environment only one type of dependence remains, essential dependence. Hence, all dependences other than essential data dependence can be considered different variations of resource dependence. 2.2

Scheduling

A machine with no resource constraints, which has prior knowledge of the path through the program, demonstrates parallelism which we refer to as the oracle limit of parallelism [5]. A very high fraction of dramatically high oracle limits of parallelism in many programs is lost under the resource constraints arising out of limited functional units, registers, and memory. In practice, machines aimed at exploiting instruction-level parallelism attempt to detect the above mentioned dependences between instructions (either at compile-time or at run-time) and to schedule independent instructions for simultaneous execution. A given instruction, I, cannot be scheduled any sooner than the scheduling cycle in which all its source operands become available to be read as inputs. Also, 1 must be scheduled no later than the cycle preceding the one in which its result gets used as a source operand. A scheduling technique, often referred to as the greedy (or as-soon-as-possible, i.e., ASAP) technique attempts to schedule instructions in the earliest possible time slot. Another scheduling technique, often referred to as the lazy (or as-late-as-possible, i.e., ALAP) technique attempts to schedule instructions in the latest possible time slot. The difference between these two scheduling cycles is defined as the slack associated with instruction I. 2.3

3 3.1

Experimental Simulation

Framework Methodology

Dynamic instruction traces (along with the data addresses) are generated by instrumenting the object code and then executing it on an the IBM RS/6000. We have designed and implemented an analysis tool that accepts as input the aforementioned sequential trace and outputs a parallelized instruction trace (schedule) that could be executed on an abstract machine with unlimited functional units and various constraints on the rest of its resources, namely, registers, stack and memory. We assume perfect branch prediction. Therefore, using the option of unlimited renaming of registers, stack, and memory, with unlimited functional units, the tool is capable of providing oracle limits of parallelism for the trace under consideration. For simplicity, we assume single-cycle latency for all instructions. Therefore, there is a direct correspondence between the level of an instruction in the dynamic dependence graph and its execution time-slot (cycle). Also, for the purpose of this paper, our scope of concurrency detection (instruction window) is unlimited, i.e., limited only by the length of the program. Thus, our scheduler allows instructions arbitrarily far apart in the program trace to execute in parallel.

Characterizing Program Parallelism

Program dependences can be represented by a partially ordered, directed, acyclic graph, referred to as the dynamic dependence graph [4], where the nodes of the graph represent the specific instruction executions, and the edges represent the inter-instruction dependences during the program execution. The height of the topologically sorted dynamic dependence graph (DDG) is often referred to as the critical path length of the application, which also provides a measure of the minimum number of steps needed to complete the program execution. A plot of the number of instructions per level in the topologically sorted DDG reveals the parallelism profile of the program. The average

107

3.2

Scheduling

Constraints

transaction structures)

For every instruction read from the incoming trace the parallelizer computes the earliest time when it could be executed taking into account all data dependence relations that have to be satisfied and places the instruction on the appropriate level of the critical path. Essential dependence is enforced by simply waiting for all the input sources (register, stack or memory). Anti and output dependences are enforced by waiting also for the destination resource. The definition time of an instruction is the time when all the source operands and and all the destination registers or memory locations referenced by the instruction are ready (also known as ready time. The first use time for an instruction is the time of first use of any of the resources being defined by the instruction (also known as latest issue time). The IBM RS/SOOO instructions can potentially define several resources at the same time. For example, the store-update instruction (stu) simultaneously defines registers and memory locations. In order to keep an accurate first use and last use stamp we have to keep links between these simultaneouslydefined resources. As the resources are being redefined we break the original links and establish new ones. When a resource that has a valid entry (instruction id) is being redefined and it has no links to other registers or memory locations, then that instruction is eligible to be emitted , i.e. scheduled on its corresponding level and removed from the data structures. If a resource is never redefined during the execution of the rest of the program then it will wait for emission until the end of the program when it will be scheduled according to its level in the DDG. As indicated before, greedy schedules are obtained by scheduling instructions baaed on their definition level. And, lazy schedules are obtained by scheduling instructions just prior to the first use time. Instructions whose results are never used, will be scheduled at the end of the program. Note that there wilI be resources that seem only defined and never used because we do not trace system code (only application and shared libraries are traced) and because we see only the code that follows the taken branches. For example, buffers that are written by the user and read by the system will appear to be defined but never used, and vice versa. Also note that the critical path can be ended only by instructions that cause a load/store transaction, i.e. a real data transaction. We assume that any activity that occurs in a program after the last result has been stored is useless from the data flow point of view. This means that all timestamps larger than the last data

3.3

(that can be found in the register cannot extend the critical path.

Data

shadow

Structures

In order to enforce the dependence relations, we declare a shadow data structure of all visible system resources: registers, stack and memory. Each such storage resource has a shadow structure whose fields record: l

definition

timestamp,

0 first use timestamp, l

last use timestamp,

l

instruction id (position in the sequential defining instruction, and

l

link to a co-defined

register

trace)

of

or memory.

There is a shadow structure for every register and every memory object in use. Since the shadow data structure records at least four integers worth of information, keeping this information for every byte (smallest addressable memory object) in memory would result in a 20 to 1 expansion of the data space. On the other hand, memory objects are seldom referred to in single bytes. Most references are to word (4-bytes) or double word (&bytes) objects. This observation helps us to reduce our space requirements significantly. Also note that for lazy scheduling, execution of an instruction is delayed until its first use time. But, since we are parallelizing the instruction stream, we cannot know when the earliest consuming instruction will occur until we have either redefined (consumed) all resources that were defined by the instruction under consideration, or the program ends. This lack of foreknowledge of the lifetime of a resource, and the unlimited scope of concurrency detection makes memory management of the shadow data structures quite difficult. 3.4

Specific

Dependence

Issues

Branches: As mentioned before, we assume perfect branch prediction. It should also be noted that the branch instructions in the IBM RS/6000 are not exclusively flow control instructions. They sometimes have dataflow components as well. For example, the branch and link instruction sets a link register simultaneously with the execution of the branch. The analysis in this paper includes such data dependences too. Subroutines calls: We have not yet implemented inlining of subroutine calls. So it is possible that a long

108

chain path.

of calls will increase

the length

of the critical

potentially also miss important phases of program behavior. Hence, we have instead chosen to run all our benchmarks to completion. In order to keep traces and analysis time within practical limits we have reduced the input data sets of the benchmarks either by using the short input files originally provided by the SPEC92 suite or a modified version.

Loop index variables: Parallelism in many loops is inhibited by dependences on the loop index variables which are used only as loop counters. Identifying loop index variables in a dynamic (object code level) instruction trace is in general a non-trivial problem. However, the IBM RS/SOOO compilers normally store the inner loop counter in a dedicated count register, making the identification easier. By ignoring the data dependences involving the count register in branches, we ensure that loop index manipulation does not affect the critical path.

4.2

The parallelism profiles discussed later in the paper have been represented through a portion-wise constant approximation because the original data contains millions of points that can differ by orders of magnitude in value. We have passed the raw profile data through a filter that takes as input two parameters; an additive parameter and a multiplicative parameter. In order for a point to be plotted on the profile graph, it has to differ from the old point by an additive constant and by a certain percentage given by the multiplicative parameter. When a new point is found, it is plotted and the old reference is swapped with the new point so that no cumulative error occurs. By using both types of conditions we ensure fairness at both high values and low values of the data. The data obtained with this technique will most likely contain fewer points and hence are more representable. Also, the new data set is plotted with a step function interpolation which means that the continuous horizontal lines represent a continuous series of data points within a preset bandwidth (usually 15 additive units and 15% multiplicative units). Parallel horizontal lines represent quickly alternating values that are so close as to give the impression of a continuous line. Note that our filter is not very effective if the profile is varying quickly and widely. The filter compresses only portions of relative constancy.

Address calculation: Parallelism in loops is also inhibited by trivial address calculations needed to access vectors with constant stride. Identification of such calculations typically involves detecting constant increments in every loop iteration. The IBM RS/SOOO compiler makes this identification easier by using the store-update (stv) or load-update (h) instruction in such cases. Our analysis can eliminate the influence of such address calculations on the critical path. Note that even when address calculations are ignored, instructions such as load/store update, are still included in the total instruction count because they also perform the load/store operation. Also note that there are other address calculations which cannot be easily deciphered and are ignored in the discussion to follow. System calls: The analyzer allows the option of either optimistically ignoring the system call instructions, or pessimistically considering system calls as synchronization barriers and hence disallowing any instruction scheduling across system calls. Libraries and system code: Code executed in all shared libraries are included in the trace. However, we do not trace code in the operating system kernel (system code).

4.3

4 4.1

Experimental

Output Representation

Experiments Performed

Assuming an ideal machine with infinite functional units, we have studied the following machine models with different constraints on the renaming of registers, stack and memory:

Results

Benchmark Programs

1. ignoring address calculation overhead, along with unlimited renaming of registers, stack and memory.

As a vehicle for our experiments we have chosen the widely known and accepted SPEC92 benchmarks. They are a collection of scientific and integer codes from a wide variety of applications. To reduce the size of traces, often they are truncated or sampled. However, trace truncation fails to capture the long-term changes in the program behavior. Trace sampling can

2. unlimited

renaming

of registers,

ory, 3. unlimited

109

register

renaming

only,

stack

and mem-

4. unlimited 5. registers, available

stack and memory

renaming

stack and memory limited in the IBM RS/SOOO [7].

notice that for a limited number of registers, renaming of stack or memory has a negligible effect on performance. On the other hand, comparing the results of Model-2 and Model-3, it is clear that given unlimited register renaming, for some programs (such as, fpppp, mdljdp&, mdljsp!?, ora, su2cor) additional renaming of memory and stack is beneficial. Austin/Sohi [4] also observed the importance of register renaming while analyzing the SPEC89 traces.

only, to those

For all the benchmarks we have chosen the conservative option and imposed a global synchronization barrier for every occurrence of a system call. Two of the benchmarks, tomcatv and gee have been analyzed with the optimistic system call assumption also. These are referred to as tomcatv (ns) and gee (ns) respectively. Table-l contains the average parallelism results for the five architectural models mentioned above. This table is derived from the following data obtained for all benchmarks: l

A time profile of the lazy and greedy schedules, i.e., the number of instructions that are scheduled at every level of the critical path. Sample profiles for tomcatv are shown in Figures 1 through 4.

l

A parallelism distribution graph, i.e. the distribution function of the different degrees of parallelism (instructions /cycle) approximated by a histogram. Sample histograms for tom&v are shown in Figures 9 through 12.

4.4.3

4.4.4

4.4.1

Oracle

important Limit

observations

can be made.

of Parallelism

It should be evident from the table that there is a significant amount of parallelism available in almost all the benchmarks. Most of the new floating-point additions to SPEC92 (except oru) offer high oracle limit of parallelism, as was the case with SPEC89 floatingpoint benchmarks. Note that for some benchmarks (such as, eqntott), our parallelism limits with unlimited resources are smaller than those reported previously [4, 5, 61. This is primarily due to our smaller input data sets. 4.4.2

Register

Overhead

Greedy

vs.

Lazy

Scheduling

As one might expect, both greedy and lazy schedules show relatively large variances with the very high parallelism offered in Model-l. Unlike the earliestpossible scheduling approach of the greedy algorithm, the lazy scheduling technique delays execution of an instruction until its first use. This is demonstrated in the detailed profiles shown in Figures 5 and 6 for tomcatv. The lazy schedules in Figure 6 clearly show more schedules of smaller width than the greedy schedules in Figure 5. Hence, one would expect lazy schedules to have less variance overall than the greedy schedules. But, this is not supported by the data in Table-l. On further analysis, we found out that since lazy scheduling postpones scheduling instructions until their first use time, many instructions whose results never get used are emitted together at the end. These include store operations to memory locations used by the program to write out the final results (outputs) of its computations. These appear as operations which define resources that never get used and hence are postponed by lazy scheduling until the very end. Earlier in this section, we have also explained that due to our inability to trace system code, there are many instructions which define resources that never seem to get

Results The following

Computation

As mentioned before, address arithmetic involving autoincrement operations on address registers in program loops can inhibit program parallelism because it creates an induction on the address register. Comparison of Model-l and Model-2 in Table-l demonstrates the performance impact of such address computation related dependences. All benchmarks significantly benefit from the removal of this overhead. It would not be unrealistic to imagine a compiler that would precompute an address pattern well in advance and therefore create more parallelism. For this reason we also think that the compiler will make less use of autoincrement addressing modes, such as the ones present in the IBM RS/SOOO instruction set.

Table-l also lists the standard deviation of the degree of parallelism for the lazy and greedy scheduling techniques. Slack distribution has also been obtained for a subset of the benchmarks. Table-3 lists the average slack per instruction, along with the standard deviation in the slack distribution. 4.4

Address

Renaming

Register renaming is much more important for all the programs than stack or memory renaming. Comparing the results for Model-4 and Model-5, one would

110

I

II

II

UMR :

7

,S-NoA

I

UMR : 9 IS

II

:

-: L M,S

II

eqnl

Table 1: In the architectural model description, U denotes unlimited, L limited, M memory, R registers, and S stack. In the first model, No A indicates address computations have not been included; address computations are included in all other models. For example, U: R - L: M,S represents unlimited registers and limited memory and stack, i.e., register renaming. A dash (-) mark indicates an incomplete experiment

Table 2: The architectures

are abbreviated

in the same manner

as for Table

1.

used. These exceptionally large width lazy schedules at the end of the critical path hurt the overall variance significantly. One might argue that the scheduling of such instructions could be easily distributed over the rest of the program. To confirm our hypothesis that the lazy schedule variance is unfairly hurt due to this parallelism spike at the end of the critical path, we recomputed the variance of lazy schedules for some of the programs, after removing this spike. The results are given in Table-2. This optimization significantly reduces the standard deviation (and hence, the variance) of lazy scheduling to below that of greedy scheduling.

4.4.5

System

Call

eqntott fPPPP su2cor tomcatv tomcatv xlisp

of

Slack

805

2401

12540

4606 1656 4094 220 2683

6059 10484 5825 597 1038

312142 258993 176324 4579 13521

are abbreviated

in the same

Effects

Benchmarks

on

The very high parallelism of Model-l is mostly associated with not very high average slack. For example, most of the sample slack distributions shown in Figs. 13-16 have 70 to 80 percent of the instructions with a minimum slack value of one. On one hand, this high parallelism accompanied with low slack, implies a very good utilization of the resources. On the other hand, as explained in the previous section, this also characterizes the vulnerability of this high parallelism to resource constraints. In other words, as the unlimited resources of Model-l are made limited, most of the off-the-critical-path instructions would very likely be delayed enough to fall on the critical path, thereby reducing this high parallelism. Also note that going 1 to more constrained average slack of only of average parallelism tomcatv (ns) with an only negligible loss of

Parallelism

Also note that the parallelism profiles of different benchmarks can be very different from each other. For example, the tomcatv profiles in Figures 1 through 4 are concentrated along certain degrees of parallelism quite unformly through the program excution. Whereas, the hydro,fd profile shown in Figure 7, consists of distinct phases with different distributions of parallelism. The parallelism profile of ora (Fig. 8) is even more interesting to look at, since it consists of different phases with significantly different average parallelism but very little variance in each phase.

4.4.7

(ns)

109 813 89 488 35 233

Table 3: The architectures manner as for Table 1.

l

Effect Profiles

(Cycles) Architectural Model U: M, R, S - No A 11 L: M, R, S C u f u

IIprogram

As mentioned before, we have chosen the conservative option of disallowing any instruction scheduling across system calls. The detailed profiles shown in Figure 5 and 6, clearly demonstrate the damaging effect of system call barriers on average parallelism (note the repetitive synchronization barrier effects due to system calls in Figs. 5 and 6, which is represented by the regularly spaced vertical bars at the bottom of the profiles). To assess the effect of this pessimistic assumption, we reran two of the benchmarks, gee and tomcatv, without any system call barriers. These are referred to as gee (ns) and tomcatv (ns) respectively, in Table-l. As expected, given enough parallelism, both these benchmarks benefit significantly.

4.4.6

Slack

II

from most relaxed ModelModel-2, tomcatv with an 35, suffers significant loss from 614 to 402, whereas, average slack of 488, suffers parallelism from 781 to 773.

Note that the average slack for a given model refers to the average distance between the definition and first use of an instruction. This distance is in terms of levels of the dynamic dependence graph corresponding to the resource constraints specified by the model. Hence one must exercise caution in comparing absolute average slack values across different models. As we move from the least restrictive Model-l to less relaxed models, the first use time for some of the instructions on the critical path is further delayed due to the increased length of critical path of Model-2. This delayed first use results in an increased slack value for the instruction under consideration.

Distributions

Table-3 contains average slack measurements for a subset of the benchmarks presented in Table-l. The following observations can be inferred from the data:

112

L

5

Conclusion

and Future

Work

register when accessing a vector with constant stride if autoincrement instructions are used. Finally we have introduced slack as a possible measure of the vulnerability of parallelism to resource constraints and other limiting factors. We have observed that high parallelism is usually accompanied by low average slack in the codes studied, which implies good resource utilization. On the other hand it also correlates with a high vulnerability to resource constraints. Low average slack will usually imply a significant loss of parallelism when we move from a relaxed to a constrained model. In the future we will be exploring the possibility of normalizing the slack metric to a generally valid measure of parallelism realizability.

In this paper we have presented a methodology for analyzing the parallelism of real applications and have demonstrated its capabilities on the programs from the new SPEC92 benchmark set. We have shown a method of extracting the dynamic dependence information from traces of programs that were executed on the IBM RS/SOOO computer, a system that has so far not been used as a test vehicle for similar studies and that does present most of the elements of a superscalar architecture. With the data dependence information thus extracted we have been able to concurrentize the sequentially executed instruction trace and reschedule it - off line - into a parallel instruction stream for a hypothetical superscalar machine with infinite functional units. We have introduced the use of two different scheduling techniques: The greedy (ASAP) scheduling policy and the lazy (ALAP) scheduling policy. Dynamic parallelism profiles of the newly constructed execution schedules are then presented for several representative benchmarks and a comparison was drawn between them: l

l

Acknowledgements The authors wish to thank Doru Marcusiu from the NCSA at the University of Illinois for the use of the facilities and Nancy Amato as well as Daniel Lavery for the careful review of the manuscript.

References A. Nicoiau and J. Fisher. Measuring the Parallelism Available for Very Long Instruction Word Architectures. IEEE Trans. on Computers, C-33(11), pp. 968-976, Nov. 1984.

The profiles representing the lazy scheduling policy show lower variance than their greedy counterparts. This observation hints towards perhaps the need of a different approach to scheduling than the greedy policies currently used in order to obtain sustainable bounded parallelism on real machines with limited resources.

PI D.

Wall. Limits of Instruction-Level Parallelism. In Proceedings of Ihe 4th Intl Conf. on Archilechral Supporl for Programming Languages and Operating Systems, pp. 176188, April 1991.

(31 M. Butler, T. Yeh, Y. Patt, M. Alsup, H. Scales, M. Shebanow. Single Instruction Stream Parallelism Is Greater Than Two. In Proceedings of the 8th Annual Symp. on Computer Architeclure, pp. 276286, May 1991.

Parallelism distribution and its average varies significantly between benchmarks. The dynamic behavior of the parallelism also varies significantly between benchmarks; we can observe the whole range of time variation from almost constant to periodic through phasic behavior.

141 T. Austin and S. Sohi. Dynamic Dependency Analysis of Ordinary Programs. In Proceeding8 of the 19th Annual Intl. Symp. on Computer Axhitechre, pp. 342-351, May 1992.

PI From our studies of of dynamic parallelism under different architectural models that assume different degrees of storage availability (registers, stack/memory), we came to the conclusion that the number of registers is the most important resource for increasing parallelism. Only for a few programs is stack/memory renaming a real benefit. We have also shown that the removal of the data dependences introduced by address computations is very effective in extracting more concurrency. Under this assumption, all benchmarks demonstrated a sizeable increase of parallelism, which is essentially due to removing the effect of the induction on the address

M. Lam and R. Wilson. Limits of Control Flow on Parallelism. In Proceedings of the 19th Annual Intl. Symp. on Computer Architeclure, pp. 4657, May 1992.

b31K.

Theobold, G. Gao, L. Her&en. On the Limits of Program Parallelism and its Smoothability. In Proceedings of the

25th

Annual

Inll

Symp.

pp.

on bfiCTOaTChi6%bLTe,

10-19, Dec. 1992. System/GO00 [71 IBM RISC SA23-2619, IBM Corp., Austin, TX 78758.

PI

M.

Kumar.

Measuring

Parallelism

Intensive Scientific Engineering on Compnlera,

113

C-37(9),

Publication No. Communications Dept.,

Technology,

Austin

in

Applications,

pp. 1088-1098,

ComputationIEEE

Sept. 1988.

Trans.

ParallelismFVofilefor GREEDYFblkx TOMCATV

1-m

Parallelii Profile for GREEDYPolicy: TOMCATV unlhliled:~, slackMemofy - LimIted: luallllhchm-4BEB fJddmw~hcl~NO sjabmCsbBeniers:YEs

6.7E+o7 -I 1.66EtLw 4.1oE+m

-

r.o6E+m

-

2.aEtc6 6.66EtO41.6kEto4 -

to81.ogto2 -

1~+w

25Ei+(P-

26ez+(P

4.1cE

:

OME+Ol 1=+01

.:.:

::

:::

. . ... .

j,:.::.:::::;.:.:::

6(yJ?+cQ_

..:::

2.OlE+OO- i

:...:.:

:.::

.:::

.::.,..

:::::

0

-

mm00

6oo6om

-

moooao cuolm

la0oOoo

mlma!6

l6oamo

-nnm ~cydes)

:::

. . . . . . :.:.::..::

:

,_

:..i

:.:

:.. . .... ..

i

..:

“““’

r. ,\I 0

:‘.

:

,=+a_

o.oE+oo

: .::

_ ,..

._

mooo2ulrosma4oaosrmoamammemu,

nm(crcl=)

ParallelismProfile for GREEDYFblii: TOMCATV

Parallelii Profile for GREEDYPolii: TOMCATV unlhrlted:Regls$rs, sta&h&may- Lhlited: walblaubrs-46Ee bdrhmB~hduded:YES &3k3mcal!sBfIniewYEs VJ

1.aEtr

16eEtw -

.IoEw6 -

4.mEtm -

i.mE+QI -

idR+m

2.622tm

2.(12E+o6 -

666EtW

0.662+ol-

-

1.64J!+o4 4.102 +a3 1aBtm 26(IEtop(L*+a-

Sl :,

,~tm

.:

:

6.a tm _ ---,+--

;:s, .

:*

,’

.: :..I.

il.lji. ::.:.

: .;.,.

.;.

,: ,./, .,I:, “’ .:.

_- _- L-d.&y-y-7-:_-

ii ,,:. :. :, ./:

:,.: .;: ,: ,:; #:: ,,,

,. ,;.::., __ -:_ ;.. _-- __-,

2.mEtoo - .-

_

._ ‘::‘,: -; :_.-, L-

::

._ _-,

-

: :

;-y 1

t66~+0i

1i!!!!!!!!!!!!!

*mE+u)/ .:

...............................

‘:‘.::..:,:

,.j::

.::.:

,,

.,

...................................................... ... :: ... ....................

2m+m_

rmtm,

,.m+lJj

,.

-

o.mE+m y-c 0

loJx2omosam4am6uJRamo7omo6mN6umloamrmool2wm llm6 ww

F-is.

4

Fiq.

2

:.

Paraflelism Profile for GREEDY Policy:TOMCATV

Parallelism Profile for GREEDY Policy: HYDR02D

Unsmlsd:Regbbn,~Memay-umlted: IbldhaudbtN-ME6 MdIwotkmpuwmhduded:YEs sy6mwlBBaflemzYEs DErAL

Unlhdkd:Registers,s$dc h4emay - l.Mted: Tddibudbne=4.3E6 Adclre8Bcanpltetknhdudsd:No systemcans~YEs

67lE+O7

_1

1.0&+07 4.l22*02 i.mE+m se2tOll 2.262+04 l.24EtO4 4.lOE+m l.o22+02 SsBEtm 2.uIEtOl IdoEiOl

1mE+al

L

_.

‘.

o.mE+m @!J!!!!L!!!! 0

am,

.,.,.y 40x

v---Y_*_ an,

mu0

_-

aDOE+

.-

.-

PmEIOo l.m2+Oc

1

0mE+OC

lOomlam,l4omlKmumonmo

I

I

om

I

I

4lOm

I

I

42m2

n~(cydes

Fig.

-

.-

II

I 44000

am0

46ma

heH=l

7

Fig.

5

Parallelism Profilefor LAZY Policy: TOMCATV lhllmlted:~ s&k@Malmq - umkd: whgudknr=4ew Mdma~hchdudzYEs sy8mmcdbblielxYEs DETAL

Parallelism Profile for GREEDY Poky: ORA Unlhnd:Registers, sledc Memay - wtd: Totsh~I~~Uons~lpEB Ac!dr683~lltbmhduded:NO syIJtemcalkBenkre:yEs ‘nst%22L l.mEtO74.12Eca

-

lmE+ml.mE+m

2.2zE+m-~

2eE+m

c6E+54-'

asE+w

l&lE+D(-

l.MEtM

4.WEtm-

4.luz+m

l.caz+m-

l.OZEtO2

2.!sE+a?-

2X8Etm

2.4OE+Ol,

l.mE+Ol--

141EtOl

amE+m-l 2.a

MOE+Ol

tm-

-

7

amE+m

-

-

-

2002tm

ias+mo.ar+m

Suggest Documents