An Operand Status Based Instruction Steering Scheme for ... - CiteSeerX

0 downloads 0 Views 238KB Size Report
chitecture, Data dependence-based design, Instruction- level parallelism ... superscalar processors such as issue queue, register file and ... The steering logic determines a PE for the steering. Processor Front-End. Reg File. FU . PE ... clustered architectures, each PE has a piece of ..... We also simulate a configuration of 40.
In Proceedings of the 2005 International Conference on Computer Design (CDES'05), pp. 168--174, (2005)

An Operand Status Based Instruction Steering Scheme for Clustered Architectures Yukinori Sato, Ken-ichi Suzuki, Tadao Nakamura Department of Computer and Mathematical Sciences Graduate School of Information Sciences, Tohoku University 6-6-01, Aramaki Aza Aoba, Aoba-ku, Sendai, 981-8579, Japan

Abstract— Clustered architectures which intend to process data within a localized PE are one of the approaches to increase the performance under the difficulties of the wire delay problems. The performance of the clustered architecture depends on the implemented instruction steering scheme. Existing steering schemes insert inter-PE communications to achieve load balance among PEs. These insertions delay the executions of the dependent instructions and lead to the degradation of the performance. In this paper, we propose a novel instruction steering scheme, which gives priority to critical dependencies. The way to find out the critical dependencies is by observing the status of the source operands of an instruction. We evaluate the proposed scheme and compare it with the existing ones. The results show that the proposed scheme outperforms the existing schemes in terms of instruction per clock because of reductions of the critical inter-PE communications with superior load balance among the PEs. Index Terms— Instruction steering, Clustered architecture, Data dependence-based design, Instructionlevel parallelism

1. Introduction Current superscalar processors execute independent instructions by extracting instruction-level parallelism (ILP) within a sequential program. ILP can improve the performance of many kinds of applications. However, the detection and handling of ILP require complex and global hardware structures [1]. As the scaling of transistors proceeds, wire delays are becoming more significant than gate delays. Therefore, the slow wires motivate us to design an architecture that exploits the physical locality derived from processing data near where it is stored. The characteristics of data dependencies can realize the localized data processing. Dependencies among instructions prevent the instructions from

executing in parallel. However, the sets of instructions grouped according to the dependencies may lead data processing to physically localized processing without global structures. Moreover, each set of instructions which is independent of each other can be processed in parallel with other sets. Therefore, the characteristics of dependencies enable us to exploit parallelism processed in the localized structures. Dependence-based clustered architectures can exploit the property of dependencies [1], [2]. In the clustered architectures, the global structures of superscalar processors such as issue queue, register file and functional units are partitioned into simple structures and arranged in smaller localized PEs (processing elements), called clusters in some papers [2], [3]. This partitioning makes the hardware simpler and its control and data paths faster because the number of ports and entries of the partitioned structures can be reduced [1], and the processing within a localized PE can be done without the large wire delays. In the case that the processing needs some input values from different PEs, communications between PEs are required. These interPE communication still needs relatively long wires and causes the extra latency. The amount of interPE communications heavily depends on the method of partitioning a program into sets of dependent instructions. Instruction steering mechanisms play the role of dynamic program partitioning. In the data dependence-based instruction steering scheme, instructions that are data dependent on an instruction are assigned to the same PE. The data independent instructions which do not need communication are most likely assigned to different PEs. This can minimize the inter-PE communication while it can

2. Clustered architectures 2.1. Microarchitecture overview The microarchitectures of a clustered architecture are based on that of the aggressive out-oforder issue superscalar processors. Fig. 1 shows the overview of the clustered architecture assumed in this paper. The processor front-end fetches multiple instructions at the same time and decodes them. The decoded instructions are delivered to the steering logic. The steering logic determines a PE for the

P roce s s or Front-End steering



process independent streams of dependent instructions in parallel by using different PEs. Load balance among PEs also affects the performance of clustered architectures. If the workload is not optimally balanced, the PEs might be less productive than they can be. Several proposals for instruction steering that balance the workload have appeared in [4] [3]. They showed that the performance of the data dependence-based scheme with load balancing steering is better than the simple data dependence-based scheme. However, in this scheme, the way of load balancing is not adaptive. Whenever the workload imbalance exceeds the given threshold, the scheme always balances the workload, resulting in the increase of an undesirable inter-PE communication that has a strong impact on the performance. In this paper, we propose an instruction steering scheme that gives priority to critical dependencies, which degrade the performance more than the others when an inter-PE communication is inserted. To find out the critical dependencies, the proposing scheme regards an instruction that has an unready source operand as a critical dependency. Then, the proposing scheme tries to allocate critical dependencies to the same PE and non-critical dependencies to a minimum loaded PE in order to reduce the performance degradation due to the undesirable inter-PE communication. The rest of this paper is organized as follows. In section 2, we briefly show the overview of the clustered architecture and the existing load balancing method. Then, we propose the novel steering scheme that can reduce the undesirable inter-PE communication with superior load balance. Section 3 describes the experimental framework, the evaluation methodology and the results. Section 4 concludes this paper.

IQ

Reg File FU

… PE 0

Fig. 1.

PE 1



PE n

The overview of the clustered architecture.

execution of one instruction. Next, the steered instruction is dispatched to the IQ (instruction queue) that observes whether the operand status is ready or unready. When required operands are ready, the instruction is waked up and the corresponding resources of the steered PE are checked. If the resources are available, the instruction is selected and issued to the PE and executed. The conventional unclustered architectures provide single monolithic RegFile (register file) composed of the large number of registers, read and write ports and the fully forwarding network to support wide instruction issue configurations. The monolithic RegFile can be adapted to a wide range of applications easily, but it causes problems of slow access and high energy dissipation. In the clustered architectures, each PE has a piece of the monolithic RegFile. The partitioned RegFile design can reduce the number of registers and the number of ports. The fully forwarding network is also limited to the forwarding within a PE and this can avoid the delay of the forwarding logic. Most instruction set architectures define a single logical RegFile with the aim of processes using a single monolithic RegFile. However, the clustered architectures provide the physically partitioned RegFiles and their register mapping mechanisms need to map the logical registers into the partitioned registers. Therefore, the elegant mapping scheme is desired, which can exploit the physically partitioned hardware. One effective scheme of the mapping is that a register in each RegFile has its own register instance and any registers do not duplicate a register instance. This scheme is referred to non-consistent RegFiles [5] and allows smaller, less ported RegFiles. For these advantages, this paper focuses on the clustered architectures with non-

IF

ID

MAP

ISSUE

REG

EX

COMMIT

(a) Assuming instruction pipeline organization

...

ISSUE

EX

reg_comm reg_comm

...

(b) Pipeline timing of an inter-PE register read.

...

ISSUE

REG

EX

COMMIT

...

forwarding from the different PE

...

ISSUE

REG

comm

comm

EX

...

(c) Pipeline timing of an inter-PE result forwarding

Fig. 2.

The timing of the pipeline.

consistent RegFiles. Fig. 2 (a) shows the pipeline stages assumed in this paper, which are based on those of Alpha21264 [6]. The processor front-end (IF and ID stages) is the same structure as Alpha21264. In MAP stage, the renaming logic allocates a destination register to an available free register and the mapping of them is recorded in the map table. To accommodate the renaming mechanism to the non-consistent RegFiles configuration, we prepare a free register list for each PE. The selection of a free register list determines a PE where the instruction is executed. Therefore, the instruction steering mechanism have to select a PE before the register renaming process. The register renaming mechanism we adopted is also based on the mechanism used in Alpha21264 [6]. To avoid converging register pressure on the particular PEs, we assume that the architectural registers that hold committed values are partitioned into each PE and each PE has the same number of architectural registers. In ISSUE stage, the instructions in the IQ are checked whether their operands are ready and their corresponding functional units are available. After ISSUE stage, the operands are tried to be read from RegFiles in REG stage. When the operand is stored in the same PE, it can be read in a single cycle. When the operand is stored in a different PE, we assume the operand fetch requires one extra cycle for communication. Fig. 2 (b) shows the pipeline timing of an inter-PE communication due to the register read. After REG stage, the instruction is executed in EX stage. If an instruction has not waited for any results of executions, then the instruction is executed in its given latency. In the case that an instruction

has waited for the result of the preceding instruction that has not written in RegFiles yet, the result of the preceding instruction is forwarded by using forwarding logic. When the preceding dependent instruction is executed in the same PE as the waiting instruction, the waiting instruction is executed at the next cycle of the execution of the preceding instruction. On the other hand, when the waiting instruction is allocated in a different PE from the preceding dependent instruction, we assume that the inter-PE communication takes 2 extra cycles for inter-PE forwarding as shown in Fig. 2 (c). This delay model of the inter-PE communication is the same as the delay model in [2]. In this paper, we assume all PEs share a single IQ as shown in Fig 1. The reason for this assumption is to concentrate on the effects of the instruction steering scheme. If we partition the IQ into the PEs, we have to consider the utilization of each queue. The effect of the partitioned IQ will be evaluated in our future work.

2.2. Instruction steering schemes The performance of the clustered architectures depends on the amount of instructions executed in parallel and the amount of the inter-PE communications. If too many instructions are steered to a particular PE, then the communication among PEs seldom occurs, but the overloaded instructions in the overloaded PE cause the resource conflicts, which degrade the performance. This is referred to as workload imbalance. On the contrary, if instructions are steered to various PEs, the possibility of parallel processing is increased, but the amount of the inter-PE communications is increased, which also degrade the performance. Hence, we have to design an instruction steering scheme that balances the workload among PEs and inter-PE communication. The most basic instruction steering scheme is dependence-based scheme [1] (this scheme is also called RMBS, Register Mapping Based Scheme in [4]). This scheme assigns the dependent instructions to the same PE and this can minimize inter-PE communication penalties. However, this scheme suffers from load imbalance among the PEs. Parcerisa and Gonzalez proposed several steering schemes of data dependence-based steering with load balancing [2], [4]. They concluded that Advanced RMBS is the best steering of them. This

i0 i2

i1 i3 i4

i5

i6 i7

Fig. 3.

The nodes represent instructions The edges represent the dependencies

Critical Path

A dataflow graph and critical path.

scheme works in the following way: if there is a significant workload imbalance, the instruction is assigned to the least loaded PE. Otherwise, the scheme follows the dependence-based scheme. The heuristics they use to measure the workload balance is DCOUNT, which is the product of the number of PEs and the difference between the total number of instructions dispatched to the PE and the average number of instructions dispatched per PE. If the DCOUNT exceeds a given threshold, they identify the situation as the workload imbalance. As a result of instruction steering using DCOUNT, the equal number of instructions are dispatched to each PE on average. This scheme can improve the workload balance while this might increase the communication among dependent instructions. The effect of the inter-PE communication is not constant. That is, some inter-PE communications among dependent instructions may hurt the performance more than others. This is because every inter-PE communication does not degrade the execution time equally. The execution time of instructions is determined by the longest dependent instruction sequence, called critical path. The shaded area of Fig. 3 shows the critical path. If we insert the inter-PE communication into the critical path, the total execution time of the instructions is enlarged. In the Advance RMBS scheme, whenever the workload imbalance exceeds the threshold, the inter-PE communication takes place, even if it affects the critical path and causes performance degradation. On the contrary, if we insert the communication between the non-critical instructions for example between i1 and i5 in this figure, the total execution time is not affected.

2.3. An operand status based instruction steering scheme To avoid both the undesirable inter-PE communication and load imbalance, we propose a novel

instruction steering scheme for the clustered architecture. We classify dependencies into critical and non-critical ones. If we allocate instructions with a critical dependency to the same PE, the performance loss due to the undesirable inter-PE communication could be avoided. If we allocate a non-critical instruction to a least loaded PE with an inter-PE communication, the performance loss due to the workload imbalance could be avoided. In order to classify the dependencies into the critical and non-critical ones, we observe temporal locality of production and consumption of the register instance. This is represented by the data dependent distance which is a distance in terms of the number of the dynamic executed instructions between the producer and consumer instruction. Franklin and Sohi presented the characteristics of the data dependent distance [7] as follows: The dependencies with small data dependent distance tend to be used for computations of intermediate value. The dependency with long data dependent distance is mainly from a small number of frequently-used values, such as loop invariants. In the conventional superscalar processors, the dependencies with small distance are processed using data forwarding logic. Since adding extra cycles to the forwarding logic significantly enlarges the execution time [8], architects try to design it without extra cycles. On the contrary, the dependencies with the long distance might be less sensitive to the execution time than the small distance ones. In clustered architectures, the dependencies with small distance might affect the execution time more significantly than the dependencies with long distance. Therefore, we treat the dependencies with small distance as the critical dependencies which compose the critical path and long distance as the non-critical dependencies which are out of the critical path. The status of the source operand, which is either ready or unready, can be utilized as the metric of the data dependent distance instead of measuring the exact distance. If an instruction has an operand with ready status, the preceding dependent instructions have already been executed and the results have been written to the register. Therefore, the instruction with a ready operand can be regarded as a long data dependent distance instruction. If the instruction has an operand with unready status,

TABLE I T HE STATUS OF SOURCE OPERANDS OF AN INSTRUCTION AND ITS STEERED PE FOR THE STEERING SCHEMES . (b) !ready scheme

(a) dependence-based scheme in 1 in 2

null null

null

Min_dcount

in2

in1

in1 / in2

!ready !ready

ready

!ready ready

in 2

(c) ready scheme

null

!ready

ready

Min_dcount

in2

Min_dcount Min_dcount

!ready

in1

in1 / in2

in1

ready

Min_dcount Min_dcount

in2

Min_dcount

in 1

null

in 2 null !ready null !ready Min_dcount Min_dcount null null ! ready Min_dcount Min_dcount !ready

in 1

ready ready

in1 in1

in1 in1

ready ready in2 in2 in1//in2 in2 in1

The Min_dcount indicates that the instruction is steered to the PE with minimum DCOUNT. The in1 and in2 indicate that the instruction is steered to the producer PE of source operand in1 and in2, respectively.

the preceding dependent instruction has not been executed yet. Then, the instruction with an unready operand can be regarded as a small data dependent distance instruction. The status of the operand of an instruction is always monitored in conventional ILP processors in order to realize the out-of-order execution. Thus, we can classify dependencies without extra hardware. We propose a novel steering scheme based on the status of operands, which is named !ready (not ready). The proposed scheme gives priority to the instructions with at least one unready operand in order to prevent the undesirable interPE communications. The instructions with unready source operands are steered in the same way as the dependence-based scheme. These are expected to be effective against avoiding the undesirable interPE communications. The instructions with only ready source operands are steered to the minimum loaded PE. These will even up the loads of PEs. The heuristics to measure the load balance is DCOUNT, which is the same as the existing schemes. To investigate the effects of temporal locality for register usage, we introduce another scheme, named ready. The ready scheme gives priority to the instructions with at least one ready operand. This scheme behaves oppositely compared to the !ready scheme. The instructions with ready source operands are steered in the same way as the dependence-based scheme. The instructions with unready source operands are steered to the minimum loaded PE. Table I shows the relationship between the status of source operands of an instruction and its steered PE for the steering schemes. The first column and first row of each table denote the status of the two source operands of a consumer instruction, in1 and in2. The status of a source operand is classified as follows: dependent operand is nothing

(null), dependent operand is not ready (!ready), and operand is ready (ready). Therefore, we have 9 possible status for 2 source operands. A steering scheme determines which PE the instruction is executed in for each status. The rest of the table indicates which PE the instruction is steered to. In the case that both source operands have priority (in1/in2 in the table), we gives priority to the source operand in1 because an address generation of a store instruction using in1 operand is performed before the actual store operation using in2 operand in the Alpha21264 model.

3. Experiments 3.1. Methodology We developed a cycle-accurate execution-driven simulator in order to evaluate the proposed steering scheme for clustered architectures. The baseline simulator is sim-alpha [9], which is one of the extension version of SimpleScalar tool set. Simalpha models the detailed microarchitecture of Alpha21264, which is one of the clustered architectures composed of dual integer PEs (clusters). We modified sim-alpha to model the clustered architecture with all the architectural features described in the previous section. In this paper, we assume an 8-way processor, which has 8 homogeneous integer PEs. The integer PE issues 1 instruction per cycle. We simulate the integer RegFiles that have 16 physical registers per PE with a 64-entry IQ. This configuration is named reg16. We also simulate a configuration of 40 physical registers per PE named reg40. The rest of the configuration are following that of Alpha21264. The main architectural parameters for the assuming architecture are shown in Table II. The latency of the caches and the functional units are the same as Alpha21264.

TABLE II M AIN ARCHITECTURAL PARAMETERS .

dep_based

Advanced_RMBS

!ready

ready

2.5 2

ROB size

256

Functional Units

1 ALU + 1 MUL per int. PE

Issue width

1 inst per PE

The number of PEs

8 int + 1 fp

Icache

128kB, 2way

Dcache

128kB, 2way

RegFile size

16 / 40 per PE

Inter-PE communication

2 cycles

1.5 1 0.5 0

Fig. 4.

3.2. Results Figs. 4 and 5 show the instructions per clock (IPC) for the the four steering schemes described in section 2.3. We can understand that the IPC using !ready steering is larger than the others in most of the benchmarks. The average IPCs of the !ready steering for reg16 and reg40 configurations in MediaBench are improved by 16.4% and 10.9% compared with the advanced RMBS scheme, respectively. The IPC of the dependencebased scheme in reg16 configuration is superior to that of Advanced RMBS scheme. We can also find that the ready scheme is the worst of all. The reason of this is that the dependencies with small data dependent distance are more critical than those with long ones. The differences between the configurations of reg16 and reg40 are caused by the unbalanced register pressure among PEs. In the non-consistent RegFiles configuration, the number of available free registers in a PE is different from each other. If

The IPC with 16 registers per PE configuration. dep_based

Advanced_RMBS

!ready

ready

3 2.5 2 IPC

1.5 1 0.5 0

dj pe

We have selected a subset of 4 benchmarks (djpeg, cjpeg, rawdaudio, rawcaudio) from the MediaBench benchmark suite. This benchmark suite captures the main features of commercial multimedia applications, which tend to achieve high instruction-level parallelism. We also have selected a subset of 7 benchmarks (gzip, vpr, gcc, mcf, perlbmk, bzip, twolf) from the SPEC2000CPU int benchmark suite. All the benchmarks were compiled for the Alpha binary using Compaq’s C compiler v6.5 on Tru64 UNIX V5.1B with -O4 -fast -non shared options. Each program of the MediaBench was executed until the completion and 100 million instructions of each program of the SPEC2000int were executed after forwarding 1 billion instructions.

g c ra jpe g w da r a ud io w ca ud io gz ip vp r gc c m p e cf rlb m k bz ip M ed ia two B S P en l f EC ch 20 av. 00 av .

Tournament branch predictor 64

g c ra jpe g w da r a ud i w ca o ud io gz ip vp r gc c m pe cf rlb m k bz ip M ed ia two B S P en l f EC ch 20 av. 00 av .

Branch predictor IQ, FQ, LQ, SQ size

IPC

8 instructions

dj pe

Fetch and decode

Fig. 5.

The IPC with 40 registers per PE configuration.

the number of in-flight instructions in a particular PE is increased, the register pressure of the PE is increased and this causes lack of free registers in the PE. When the lack of free registers occurred in a PE, we assumed that the steering logic reallocate the instructions to the PE having the maximum number of free registers. This reallocation causes the extra inter-PE communication compared with the case of PE that has enough number of registers. Therefore, the IPC of the reg16 configurations is lower than that of the reg40 configurations. The results of the reg16 (in total 128 registers) and the reg40 configurations also imply that the register pressure of the clustered architectures with non-consistent RegFiles is larger than that of conventional 8-way 64-entry superscalars, which saturates the performance in 128 registers in total [10]. To alleviate the register pressure of the clustered architectures, we need to prepare more registers than the conventional superscalar processors. Fig. 6 (a) shows the number of instructions stalled per clock due to the inter-PE communication. When the inter-PE communication due to

dep_based

Advanced_RMBS

dep_based

!ready

2 1.5 1 0.5

!ready

3 2.5 2 1.5 1 0.5 0

0 reg16 reg40 reg16 reg40 MediaBench av. SPEC2000 av.

(a) Instructions stalled due to inter-PE communication per cycle

Fig. 6.

Advanced_RMBS

3.5 Instuctions stalled per cycle

Instuctions stalled per cycle

2.5

reg16 reg40 reg16 reg40 MediaBench av. SPEC2000 av.

(b) Instructions stalled due to resource conflicts per cycle

The number of instructions stalled per cycle.

inter-PE data forwarding occurs, we assume that the instruction are stalled in 2 cycles. When the communication due to inter-PE register read occurs, we assume that the instructions are stalled during 1 cycle. The result shows that the dependence-based steering has the lowest stalls per cycle. This is because the aim of the dependence-based scheme is to minimize the inter-PE communication without considering load balance. The proposed !ready scheme can decrease the stalls by 27.3% in MediaBench reg16 and by 53.8% in SPEC2000 reg40 in comparison with that of advanced RMBS scheme. This means that the !ready scheme is effective in the reduction of the inter-PE communication. The load balance is another factor that strongly affects the IPC. Fig. 6 (b) shows the instructions stalled per clock due to the resource conflicts. The frequency of resource conflicts indicates the load imbalance among PEs. When a waked-up instruction is not selected in a PE, we count it as a stall due to resource conflict. The result shows that the advanced RMBS steering, which emphasizes the workload balance can steer instructions with lowest resource conflicts in the three schemes. However, this intensive load balancing causes the increase of the inter-PE communication as depicted in Fig. 6 (a), which results in the degradation of total performance as shown in Figs. 4 and 5. On the other hand, the !ready scheme can reduce the resource conflicts compared to the dependence-based scheme and reduce the communication compared to the advanced RMBS scheme. Consequently, the !ready scheme is the totally well balanced steering scheme, achieving reasonable performance gain.

4. Conclusions In this paper, we have proposed a novel instruction steering scheme that exploits dependen-

cies among instructions in order to prevent the undesirable inter-PE communications. To obtain the reasonable instruction steering, we classify data dependencies among instructions into critical dependencies and non-critical ones. The metric to classify dependencies is the source operand status of an instruction. We have evaluated the performance of the instruction steering schemes. We have found that the proposed scheme that gives priority to the critical dependencies can achieve higher IPC (instructions per clock) than the other schemes. The performance gain of the proposed scheme owes a reduction of the undesirable inter-PE communication and avoidance of the load imbalance among the PEs.

References [1] S. Palacharla, N. P. Jouppi, and J. E. Smith, “Complexityeffective superscalar processors,” in Proceedings of the 24th annual international symposium on Computer architecture, 1997, pp. 206–218. [2] J.-M. Parcerisa and A. Gonzalez, “Reducing wire delay penalty through value prediction,” in Proceedings of the 33rd annual international symposium on Microarchitecture, 2000, pp. 317–326. [3] A. Aggarwal and M. Franklin, “An empirical study of the scalability aspects of instruction distribution algorithms for clustered processors,” in Proceedings of IEEE International Symposium on Performance Analysis of Systems and Software, 2001, pp. 172–179. [4] R. Canal, J.-M. Parcerisa, and A. Gonzalez, “A costeffective clustered architecture,” in Proceedings of the 1999 International Conference on Parallel Architectures and Compilation Techniques, 1999, pp. 160–168. [5] J. Llosa, M. Valero, and E. Ayguade, “Non-consistent dual register files to reduce register pressure,” in Proceedings of the 1st IEEE Symposium on High-Performance Computer Architecture, 1995, pp. 22–31. [6] R. E. Kessler, “The alpha 21264 microprocessor,” IEEE Micro, vol. 19, no. 2, pp. 24–36, 1999. [7] M. Franklin and G. S. Sohi, “Register traffic analysis for streamlining inter-operation communication in fine-grain parallel processors,” in Proceedings of the 25th annual international symposium on Microarchitecture, 1992, pp. 236–245. [8] M. S. Hrishikesh, D. Burger, N. P. Jouppi, S. W. Keckler, K. I. Farkas, and P. Shivakumar, “The optimal logic depth per pipeline stage is 6 to 8 fo4 inverter delays,” in Proceedings of the 29th annual international symposium on Computer architecture, 2002, pp. 14–24. [9] R. Desikan, D. Burger, and S. W. Keckler, “Measuring experimental error in microprocessor simulation,” in Proceedings of the 28th annual international symposium on Computer architecture, 2001, pp. 266–277. [10] K. Farkas, N. Jouppi, and P. Chow, “Register file design considerations in dynamically scheduled processors,” in High-Performance Computer Architecture, 1995, pp. 40– 51.

Suggest Documents