IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 5, NO. 6, JUNE 1994
658
2) Sparc2 was programmed in C , a high-level language, and MARS was programmed in assembly language, which made full use of the hardware features of the MARS processors like bit field extraction and manipulation, table access, and interprocessor communication support. VI. CONCLUSION The key contribution of this short note is the partitioning of the Goldberg-Tarjan network flow algorithm for pipelined execution on a message-passing multicomputer. The MARS multicomputer is the platform used in the implementation. Although 15 processors were available in MARS, the granularity of the algorithm needed to maintain maximal data locality permitted the use of only six processors. A larger number of processors would have required data duplication and more interprocessor communication. Even in the present implementation, copies of the vertex labels are maintained in tables in three PE memories. This forced the algorithm to be partitioned into two phases, thus reducing the efficiency. A serial implementation of the entire algorithm on a single PE of the MARS system was impossible. because of the limited size of the program memory within each PE. Although the exact CPU time of this implementation could not be measured, a six-processor version at 5 MHz yields the same order of performance as a Sparc2 workstation at 40 MHz. Many data partitioning-based implementations of the same algorithm exist IS], [7]. Ours is the first attempt at an algorithmbased partitioning approach. The pipelined implementation can be made even faster by using a hybrid pipelined-parallel approach that uses many PE’s in parallel within the pipelined stages. Other methods of combining data partitioning and pipelining may improve the efficiency of either implementation. Further studies may investigate such combinations. REFERENCES [ I ] P. Agrawal and W. J. Dally, “A hardware logic simulation system,” IEEE Trans. Cornput.- Aided Design cvcircuits Syst., vol. 9, pp. 19-29, Jan. 1990. [2] F. Alizadeh and A. V. Goldberg, “Experiments with the push-relabel method for the maximum flow problem on a connection machine,” DIMACS Iniplenzentution Chullenge Workshop: Nehlwrk F1ow.s and Mutching, Tech. Rep. 9 2 4 , pp. 56-71, Sept. 1991. 131 R. J . Anderson and J.C. Setubal, “Parallel and sequential implementations of maximum-flow algorithms,” DIMACS Implerentution Challenge Workshop: Network Flows und Matching, Tech. Rep. 9 2 4 , pp. 1 7 4 1 , Sept. 1991. 141 J. Cheriyan and S. N.Maheshwari, “Analysis of preflow push algorithms for maximum network flow,” SIAM J. Compur.. vol. 18, pp. 1057-1086, 1989. [51 A . V . Goldberg and R.E. Tarjan, “A new approach to the maximum flow problem,” Syinp. T h e o p of Computing, May 1986, pp. 136146. [6] A. V. Goldberg, “Efficient graph algorithms for sequential and parallel computers,” Tech. Rep. TR-374, Lab. for Comput. Sci., Massachusetts Inst. of Technol., Cambridge, MA, 1987. [7] __, “Processor-efficient implementation of a network flow algorithm,” Tech. Rep. STAN-CS-90-1301, Dept. of Comput. Sci., Stanford Univ., Palo Alto, CA, 1990. [8] A. V. Kwzanov, “Determining the maximum flow in a network by the method of preflows,” Soviet Math. Dokl., vol. 15, pp. 43-37, 1974. 19) E. L. Lawler, Combinatorial Optimization, Networks and Matroids. New York: Holt, Rinehart and Winston, 1976. [ I O ] K.J. Singh, A.R. Wang, R . K . Brayton, and A.L. SangtovanniVincentelli, “Timing optimization in combinational circuits,” IEEE Int. Con$ Conzput.-Aided Design, 1988, pp. 282-285.
Pipelining and Bypassing in a VLIW Processor Arthur Abnous and Nader Bagherzadeh Abstract-This short note describes issues involved in the bypassing mechanism for a very long instruction word (VLIW) processor and its relation to the pipeline structure of the processor. We will first describe the pipeline structure of our processor and analyze its performance and compare it to typical RISC-style pipeline structures given the context of a processor with multiple functional units. Next we shall study the performance effects of various bypassing schemes in terms of their effectiveness in resolving pipeline data hazards and their effect on the processor cycle time. Index Terms-Pipeline, bypassing, very long instruction word (VLIW), computer architecture, RISC, performance evaluation
I. INTRODUCTION The very long instruction word (VLIW) architecture is considered to be one of the promising methods of increasing performance beyond standard RISC architectures. Although RISC architectures take advantage of temporal parallelism (by using pipelined functional units), VLIW architectures can also take advantage of spatial parallelism by using multiple functional units to execute several operations concurrently. Similar to superscalar architectures, the VLIW architecture can reduce the clock per instruction (CPI) factor by executing several operations concurrently. Superscalar processors schedule the execution order of the operations at run-time. They demand more hardware support in the architecture to manage synchronization among concurrent operations. Since VLIW machines schedule operations at compile-time (allowing global optimizations), they tend to have relatively simple control paths-following the RISC methodology. Recent advances in compiler optimization techniques [I] have produced new fine-grain compilation techniques that compute a static parallel schedule from an originally sequential program. Percolation Scheduling is one of these promising techniques [I]. In recent years, there have been several efforts to design and develop VLIW architectures. Multiflow’s Trace was one of the pioneer architectures in this field; its design was expandable to support 1024bit instructions by concatenating 256-bit processor boards [3]. VLIW ideas have also surfaced in the designs of Cydrome’s Cydra-5 [4], iWARP [SI, and LIFE [6]. In order to resolve pipeline data hazards, RISC processors have typically used a technique known as bypassing (orfonuurding) [ 1 I I. The function of the bypassing hardware is to resolve data hazards that arise when an instruction needs the results of previous instructions in the pipeline that have not been written to the register file by the time the current instruction reads its source operands from the register file. The function of the bypassing hardware is to bypass the required operands from the temporary pipeline registers to the input of the ALU. In this short note, we present some of the design issues that have arisen in the pipeline structure and bypassing mechanism in the context of the VLIW processor designed at the University of California at Irvine [ 7 ] , [8]. Section I1 presents an overview of our processor, which is named VIPER (VLIW Integer Processor). Manuscript received March 11, 1992; revised March 5 , 1993. The authors are with the Department of Electrical and Computer Engineering, University of California, Irvine, CA 92717, USA; e-mail:
[email protected],
[email protected]. IEEE Log Number 9401 191.
10!45-92,19/94$04.00 0 1994 IEEE
IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 5, NO. 6, JUNE 1994
659
VIPER is based on a Percolation Scheduling compiler [9]. In Section 111, we present the pipeline structure of VIPER and discuss the relevant design issues. In Section IV, we discuss and evaluate various interconnection topologies for the bypassing network.
To Dam cache
To Data Cache
11. GENERAL ORGAN~ZATION OF VIPER
Fig. 1 shows the block diagram of VIPER. The processor is pipelined and contains four integer functional units connected through a shared multiport register file that has thirty two 32-bit registers. The register file provides two read ports and one write port to each functional unit. The ports are multiplexed for read and write operations. Thus, the register file has eight ports. Hardware execution units within the functional units can be classified into three types. 1) ArithmeticLogic Unit (ALU). 2) LoadStore Unit (LSU). 3) Control Transfer Unit (CTU). The hardware configuration of VIPER was decided based on numerous simulations that analyzed the efficiency of different configurations and organizational strategies [7]. Implementation costs imposed another factor on the configuration of the processor. One of the important considerations was the utilization of the pins devoted to the wide instruction bus. In each cycle, the register file can perform eight read and four write transactions (two reads and one write per functional unit). Register RO is hardwired to contain 0 at all times. Writing to RO is allowed, but has no effect on its content. Our initial design goal was to have 64 registers, but it was finally decided to have 32 registers. One reason for this decision was that the silicon area required for 64 registers proved to be quite large after preliminary layout efforts. Another reason was that the extra number of bits required to address 64 registers presented a severe problem for our goal of having a 32-bit format for each operation. Also, more registers could result in a slower register file, because of the increased load on the bit lines. Having 32 registers has not presented a performance penalty for our benchmarks so far; for larger benchmarks, however, a larger number of registers might be a better choice. This is due to the fact that the PS compiler uses register renaming extensively in order to expose parallelism. VIPER has some of the typical attributes of a RISC processor. It has a simple instruction set that is designed with efficient pipelining and decoding in mind. All operations follow the register-to-register execution model. Data memory is accessed with explicit load/store operations. All instructions have a fixed size, and there are only a few instruction formats. Arithmetic, logic, and shift operations are executed by all functional units. Loadstore operations are executed by FU2 and NJ3, which have access to the data cache subsystem through their LSU's. Loadstore operations use the register indirect addressing mode instead of the more typical displacement addressing mode. This is explained in Section 111. Control Transfer operations are executed by F U O and FUl, which interact with the Program Counter through their CTU's. VIPER has hardware support for the execution of three-way branch operations. Also, to increase execution throughput, operations following a branch are executed conditionally: They are issued and executed, but they are allowed to complete and write their result, depending on the outcome of the branch Fig. 1 is drawn with the floorplan of the processor in mind. The organization of the processor was developed to facilitate an efficient very large scale integration (VLSI) layout. A key aspect was an emphasis on locality of communication between different hardware blocks of the processor. The floorplan of the processor is shown in Fig. 2. The relative sizes of various hardware blocks were estimated from preliminary layout efforts. A generic functional unit has been designed and fabricated in the form of a stand-alone RISC microprocessor [ 101. The data path of VIPER comprises a large
Eight-Port Register File
0
F
0 b
c
s
w PC Unit
Fig. 1.
T
lnsrmction
Inarmdon
(FUl 6 FU2)
(FUO 6 FU3)
Block diagram of VIPER.
portion of the floorplan. The floorplan of the chip is centered around the register file and is symmetric with respect to it. Shaded areas in Fig. 2 highlight the ith bit-slice of different data path blocks. A register file bit-slice is twice as wide as a functional unit bit-slice. At the top and bottom of the register file are routing areas that are used to connect the register file bit-slices to the functional unit bitslices. Notice that in this figure, the CTU's of F U O and FU1 and the PC unit are all combined into a single hardware block and are collectively called the Control Transfer Unit. 111. PIPELlNE STRUCTURE AND BYPASSING Fig. 4 shows the pipeline structure typically used in RISC processors. To resolve possible data hazards caused by the delay between the ID and WB stages, RISC processors use bypassing [ 111. The cost of bypassing in a RISC processor is minor compared to the number of cycles that it saves. It requires 2d comparators (where d is the number of pipeline stages between ID and WB) and the necessary pathways from the pipeline registers to the inputs of the execution unit corresponding to the EX stage. If the bypassing circuitry is carefully designed, its cycle time overhead can be relatively small in a RISC processor. In a VLIW processor with n functional units, however, bypassing becomes a costly function to implement. There are two factors contributing to this cost. One factor is the number of comparators that are required. In a VLIW processor with n functional units, the number of required comparators is 2dn2. The bypassing comparators are usually not in the critical path of the execution cycle, because comparison of register addresses can start right at the beginning of the ID stage and progress in parallel with the register file write and read operations, which are slower than the comparison required for bypassing. The comparators present an area penalty; but given the circuit densities available in current 1.0 p m technologies, they might not present a severe constraint. On the other hand, the buses required to bypass operands not only present global layout problems but also are likely to be on the critical path for two reasons.
IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS,VOL. 5 , NO. 6, JUNE 1994
-
Time IO
1
IF I1
ID
I
IF
EX
I
MEM
ID
I
EX
WB
1
MEM
WB
IF : lnstrutiwn Fetch ID : Instruction Deccde
EX ’ Execute
MEM: Memory ACCW WB :Write Back
Fig. 4. Pipeline structure of typical RISC processors. Y
3
f
12
IF
13
ID
EX
WB
IF
ID
EX
WE
IF : Instruction Fetch ID :Instruction Decode EX : Execute WB:WriteBack
Fig. 5. Pipeline structure of VIPER. Cmml Trasfer Unil
Fig. 2. Floorplan of VIPER. Data Path Bit-Slice
bypass-to-srcl clock
SRC1
j
Dealbur
Srcl bus
j
Fig. 3. The bypassing path for a single functional unit.
1) Bypassing of operands cannot start until these operands are available. This means that the bypassing hardware has to wait until the EX and ID stages have generated their results. Then, given the outcome of the bypassing comparators, the output of the EX stage can be bypassed back to its input. 2) In a processor with a single functional unit (e.g., a RISC processor), the bypassing pathways are within the data path of the processor and are distributed among the bit-slices of the data path; in other words, they are local to each bit-slice and do not present a major capacitive load (see Fig. 3). In a processor with multiple function units (e.g., a VLIW processor), however, the bypassing pathways are global buses connecting multiple data paths and present a heavy capacitive load. The frequency of read-after-write (RAW) hazards can be reduced by decreasing the delay between the ID and WB stages. If, instead of using the displacement addressing mode for Loadstore operations, the register indirect addressing mode is used, then memory access can take place during the EX stage because the effective memory address is available by the end of the ID stage. Thus, the MEM stage can be removed, and the delay between the ID and WB stages is
reduced to a single cycle. The resulting pipeline is shown in Fig. 5. This scheme has the following advantages. 1) The frequency of RAW hazards will decrease. This can improve the performance of the processor. 2) The number of comparators is reduced by 50%. This will substantially reduce the area taken by the address comparison circuitry. Even if the processor does not use complete global bypassing, but detects RAW hazards and resolves them by stalling the execution pipeline, the savings in the number of required comparators is significant. 3) The number of bypassing buses that are needed to connect different functional units decreases by half. This will somewhat relax the layout difficulties imposed by a complete global bypassing network. It can also make interfunctional unit bypassing more feasible. Changing the addressing mode from displacement to the less powerful register indirect mode can have a negative effect on performance. With the register indirect addressing mode, the compiler will have to schedule an extra add operation before loadstore operations in order to compute memory addresses. This can increase the path length of the program and reduce performance. There are, however, several mitigating factors. 1) Not all loadstore operations need an add operation. In [ 111, the
percentage of loadstore operations with zero displacement for two standard integer benchmarks, GCC and TeX, for a RISC style machine is reported to be 27% and 17% respectively. In [12], Gross et al. report the average percentage of loadstore operations with zero displacement to be around 29% for a variety of small and large integer benchmarks in C and Pascal. 2) The extra add operation could be “absorbed” into an empty operation slot in a long word instruction during the compaction process without increasing the path length of the program. 3) With the register indirect addressing mode, the new pipeline structure does not suffer from load delays. An operation using the result of a load operation in the previous instruction can execute without delay (assuming that there is a bypass path from the functional unit executing the load operation to the functional unit that uses the result of the load). In order to find out which pipeline structure would result in higher performance, a series of simulations were performed. In these simulations, the objective was to compare the speedup achieved by two different versions of VIPER, one with a five-stage pipeline using
IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 5, NO. 6, JUNE 1994
66 I
TABLE I
COMPARISON OF THE PERFORMANCE OF THE FIVE-STAGE AND FOUR-STAGE F'IPELINES
1 0 0 0 p = p 1 = [ 0 10 01 0 ]
88
0 0 0 1
,=,=[A the displacement addressing mode, and one with a four-stage pipeline using the register indirect addressing mode. The speedup factors for these two processors were computed by comparing them to a processor with a single functional unit using the five-stage pipeline. The reason is that if we were to design a pipelined processor with a single functional unit, we would use the five-stage pipeline, because it would result in better performance. In these simulations, it was assumed that bypass paths existed from each functional unit to itself only, i.e., no global bypassing among different functional units. This means that a data hazard will result in a stall only if a functional unit uses the result produced by a different functional unit in the previous cycle. The results of these simulations are presented in Table I. The results in Table I show that a four-stage pipeline can result in higher performance for a VLIW processor. On the average, the speedup achieved by the processor with the four-stage pipeline is 8.4% greater than that of the processor with the five-stage pipeline. The percentage of pipeline stalls caused by RAW hazards is much lower for the four-stage pipeline. This accounts for the performance advantage of the four-stage pipeline, even though it incurs an extra addition for a large fraction of loadstore operations. For two of the benchmarks (binsearch and quicksort), the performance of the five-stage pipeline is as good as, or slightly better than, that of the four-stage pipeline. This is due to the increase in the path length of the program by the extra add operations used to compute effective addresses of loadstore operations. Given the performance advantage of the four-stage pipeline and the fact that the hardware implementation of the four-stage pipeline with register indirect addressing is simpler than that of the five-stage pipeline with displacement addressing, it was decided I.hat VIPER would use the four-stage pipeline. RAW data hazards that cannot be resolved by the available bypassing network are detected at runtime by bypassing comparators, and are resolved by stalling the execution pipeline by one cycle. It is possible to allow the code generator to eliminate RAW hazards by scheduling instructions with NOP operations at appropriate points in the program; however, this approach has two drawbacks.
1) Stalling the processor by scheduling instructions with NOP operations will increase the code size and will effectively waste some of the instruction fetch bandwidth. 2) After a branch delay slot, depending on the outcome of the branch, it might be necessary to stall the pipeline. The code generator has to assume the worst case and schedule a stall cycle. If a branch operation takes the path that does not really require a stall cycle, a cycle is lost. The frequency of pipeline stalls caused by RAW hazards can be further reduced by software scheduling. This is done in a final pass by the code generator, during which it attempts to modify the assignment of operations to functional units so that RAW hazards can be resolved by the bypassing hardware of the processor instead of resulting in pipeline stalls.
; :]
1 1 0 0 0 0 1 1
8-8
,"]
1 0 0 1
.=,=[
8 ;;
1 0 0 1
; ; :]
1 1 0 1
P=P4=[h
1 0 1 1
1 1 1 1 p = p 5 = [ 1 1 1 I1 ]
1 1 1 1
Fig. 6. Various bypassing interconnection network topologie5. IV. BYPASSING INTERCONNECTIONNETWORK One can think of the bypassing hardware as an interconnection network that connects different functional units together. Increasing the connectivity of the bypassing network results in two conflicting effects on the performance of the machine. A higher degree of connectivity can result in a smaller number of pipeline stalls and a higher level of performance. On the other hand, increasing the connectivity of the bypassing network will increase the capacitive load of the bypassing pathways and will lengthen the processor cycle time. In this section, the performance effects of various bypassing interconnection network topologies are analyLed. The objective is to explore the conflicting performance effects of increasing the connectivity of the bypassing interconnection network. We can describe a given four-node network topology by a 4 x 4 matrix P that is defined as follows:
1 if there is a path from functional unit i to functional unit j, 0 otherwise.
662
E E E TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 5, NO. 6, JUNE 1994
Bypassing Network
Data Path
Control Path
from Register File
to Register File
I clock
: from other FUs : to other FUs
-
T
r-
bypass-to-srcl
.......
+ T
clock
7
a
ALU output
......
:
..................................
2;t-ALU
........
PC+offset
&, MUX
;. ...........................
........
Dest
Fig. 7. Critical path involving the bypassing hardware. The P matrix is a simulation parameter that is used to determine whether the execution pipeline needs to be stalled in case of a data hazard. Fig. 6 shows various interconnection topologies along with their corresponding P matrices. The bypassing network for P = PI presents a minimal degree of connectivity where destination operands are bypassed from each functional unit to itself only. The area taken by the bypassing pathways for this topology is virtually zero. They are embedded within the data paths of the functional units and are local to each bit-slice (see Fig. 3). Their capacitive load is insignificant compared to the capacitive load of the destination bus to which they are connected. The cycle time penalty of this scheme is minimal; however, it can result in more pipeline stalls than the other bypassing networks. For P = P2 and P = Ps, each functional unit has access to two bypassed destination operands: one from itself and another from a neighboring functional unit. The network topology for P = P2 includes extra buses that connect FUO to FUI (and vice versa) and FU2 to FU3 (and vice versa). The cycle time penalty for these extra connections is relatively modest in VIPER because of the physical proximity of FUO to FUI and FU2 to FU3. Because these extra connections require horizontal wires (with respect to the floorplan
in Fig. 2), however, the routing areas between the register file and the functional units have to be stretched in the vertical direction to accommodate the following horizontal buses:
1) A bus from FUO to FUI,
2) A bus from FUI to FUO, 3) A bus from FU2 to FU3, and 4) A bus from FU3 to FU2. For P = P3, the additional bypassing connections connect FUO to FU3 (and vice versa) and FUI to FU2 (and vice versa). The buses required for this scheme are longer than ones required for P = P2, because they have to cross the entire height of the register file. Since the functional unit bit-slices that are being connected by these vertical connections are at the same horizontal coordinate, the required routing tracks can be placed within the bit-slices of the register file without the need for extra routing space. The bypassing network for P = P4 is a combination of the previous two. It offers an even higher degree of connectivity at the expense of a larger cycle time penalty. The routing area taken by this network is equal to the routing area required for P = P2. For P = Ps, all functional units are completely interconnected. Each functional unit has accesses to all destination operands from
IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 5, NO. 6, JUNE 1994
2.2
7,
I
I
2.1
2.0
> a I
1.9 0
VI
1.6
I ’ I
663
is the longest chain of dependencies during the ID stage. First, the output of the ALU or the shifter is driven onto the Dest bus. Then it goes through the bypassing network and the bypassing multiplexor, and then it is driven onto the Srcl bus, which goes into the branch detection logic and is used to determine the outcome of branch operations. In VIPER, the outcome of branch operations is decided by testing the least significant bit of a register operand. Explicit comparison operations are used to set or reset the least significant bit of a register. The outcome of a branch operation is known by the end of the ID pipeline stage. The address of the target of the branch is also computed during the ID stage. The capacitive load of the bypassing buses were estimated from preliminary layout efforts. The simulation results are also plotted in Fig. 8. These results show that the best performance can be achieved with P = A. Its performance advantage over the bypassing network for P = PI is about 2.6%). Because this is not a very significant performance advantage, the simplicity of the bypassing network for P = 9 makes it a viable option, even though its performance is less than optimal. V. CONCLUSION
1
We presented the design issues involved in the pipeline structure PI p2 p3 p4 p5 and the bypassing interconnection network in a VLIW processor. We P presented the four-stage pipeline of our processor and showed why it Fig. 8. Performance effects of various bypassing interconnection networks. can achieve a higher performance in a VLIW processor than the fivestage pipeline typically employed in RISC processors. Even though TABLE I1 the four-stage pipeline and the associated register indirect addressing PERFORMANCE EFFECTSOF VARIOUS BYPASSING INTERCONNECTION KETWORKS mode require an extra addition operation for roughly half of the loadstore operations, the frequency of pipeline data hazards are much less than that of the five-stage pipeline. Additionally, the simplicity of the four-stage pipeline is very attractive from an implementation point of view. We presented different bypassing interconnection networks that can be used for increased levels of connectivity among the functional units the previous machine cycle; however, the cycle time penalty for this of the processor. Our analysis included both the frequency of stalls configuration is the largest. The additional bypassing connections are and cycle time penalties. We have shown that bypassing network that used to connect FUO to FU2 (and vice versa) and F I J I to € U 3 (and completely connects all of the functional units does not provide for vice versa). These connections present additional layout difficulties, enhanced performance when its cycle time penalty and the required because they require that the routing areas between the register file silicon area are taken into account. and the functional units be stretched further to accommodate four REFERENCES more 32-bit horizontal buses. To analyze the performance effects of these different bypassing netA. S. Aiken, “Compaction-based parallelization,” Ph.D. dissertation, works, two sets of simulations were performed. The speedup factors Cornell Univ., Aug. 1988. compare VIPER with various interconnection network topologies to A. Nicolau, “Percolation scheduling: A parallel compilation technique,” Tech. Rep. 85-678, Dept. of Comput. Sci., Cornell Univ., May 1985. a single-processor system. In the first set of simulations, the objective R.P. Colwell, R.P. Nix, J.J. Odonnell, D. P. Papworth, and P.K. was to observe the speedup factors that can be achieved by different Rodman, “A VLIW architecture for a trace scheduling compiler,” IEEE network topologies. These speedup values are ideal (Sidral in Table Trans. Comput.. vol. 37, pp. 967-979, 1988. 11) in the sense that they do not include the cycle time penalty of a B.R. Rau, D.W.L. Yen, W. Yen, and R. A. Towle, “The Cydra 5 departmental supercomputer,” Comput., vol. 22, pp, 12-3.5, 1989. given network topology. They provide a measure of the number of R. C. Cohn, T. Gross, M. Lam, and P. S. Tseng, “A VLIW architecture stalls saved by a given bypassing network. For this set of simulations, for a trace scheduling compiler,” froc. 3rd Inr. CmJ Arch. Supporr the code generator was allowed to schedule operations to reduce the Programming Languages and Operating Syst., 1989, pp. 2-14. frequency of pipeline stalls caused by RAW data hazards. In the J. Labrousse and G. A. Slavenburg, “CREATE-LIFE: A modular design second set of simulations, the extracted layout of the bypassing path approach for high performance ASIC’s,’’ COMPCON, 1990. A. Abnous, “Architectural Design of a VLIW Processor,” M.S. thesis, was simulated using SPICE [ 131. For these simulations, the process Dept. of Elec. and Comput. Eng., Univ. of California, Irvine, 1991. parameters of the 1.2 p m CMOS technology offered by MOSIS were A. Abnous, R. Potasman, N. Bagherzadeh, and A. Nicolau, “A perused [ 141. The results of these circuit simulations provide a measure colation based VLIW architecture,” Proc. I Y Y I Int. Con$ on Parullel of the cycle time penalty of a given bypassing network. These results Processing, 1991, pp. 144-148. R. Potasman, “Percolation-based compiling for evaluation of parallelism ~ ~Table ~ ~ I ) 11. are presented as normalized cycle times ( T ~ ~ , ~ , , , ~ Iin and hardware design trade-offs,” Ph.D. dissertation, Univ. of California, The actual speedup that can be achieved with a given bypassing Irvine, 1991. interconnection network is computed by using the following equation: A. Abnous, C. Christensen, .I.Gray, J . Lenell, A. Naylor, and N.
Sarfllal =
Sidral -.Tnormalizrd
The circuit simulations analyzed the path shown in Fig. 7, which
Bagherzadeh, “VLSI design of thc tiny RISC microprocessor,” Proc. 1992 Custom Integrated Ciruits Con$. 1992. .I. L. Hennessy and D. A. Patterson, Computer Archirecturr: A Quantitarive Approach. Palo Alto, CA: Morgan Kaufmann, 1990.
664
IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 5, NO. 6, JUNE 1994
[I21 T.R. Gross, J.L. Hennessy, S.A. Przybylski, and C. Rowen, “Measurement and evaluation of the MIPS architecture and processor.” ACM Trans. Comput. Syst., pp. 229-257, Aug. 1988. 1131 A. Vladimirescu and S. Liu, “The simulation of MOS integrated circuits using SPICE2,” ERL Memo ERL M80/7, Electron. Res. Lab., Univ. of Cal., Berkeley, Oct. 1980. [I41 C. Tomovich, MOSIS User Manual, Rel. 3.1, USChform. Sci. Inst., Marina del Rey, CA, 1988.
Fig. I.
A processor array connected with a spanning optical bus.
sors is permitted. As an illustration, in the following subsection, we briefly describe the principle of message pipelining on optical buses.
Embedding Binary X-Trees and Pyramids in Processor Arrays with Spanning Buses Zicheng Guo and Rami G. Melhem Abstract-We stiudy the problem of network embeddings in 2-D array architectures in which each row and column of processors are interconnected by a bus. These architectures are especially attractive if optical buses are used that allow simultaneous access by multiple processors through either wavelength division multiplexing or message pipelining, thus overcoming the bottlenecks caused by the exclusive access of buses. In particular, we define S-trees to include both binary X-trees and pyramids, and prwent two embeddings of X-trees into 2-D processor arrays with spanning buses. The first embedding has the property that all neighboring nodes in S-trees are mapped to the same bus in the target array, thus allowing any two neighbors in the embedded A--trees to communicate with each other in one routing step. The disadvantage of this embedding is its relatively high expansion cost. In contrast, the second embedding has an expansion cost approaching unity, hut does not map all neighboring nodes in S-trees to the same bus. These embeddings allow all algorithms designed for binary trees, pyramids, as well as X-trees to be executed on the target arrays. Index Terms- Alignment cost, embedding, spanning bus, pyramid, reflection index. *\--tree
I. INTRODUCTIONAND PROBLEM DEFINITION Although the computational powers of parallel computers are potentially much larger than sequential ones, parallel computers suffer from inefficient interprocessor communications. Dealing with the communication inefficiency through either architecture or algorithm design, or both, has always been a key research issue in parallel computation. In order to improve the communication efficiency of parallel systems, buses, both electronic and optical, have been considered by many researchers for interconnecting processor arrays [ I ] , [ 7 ] , 191, [18], [19], [21]. In particular, 2-D processor arrays with broadcasting row and column buses have been suggested that allow for efficient solutions to various problems, including semigroup and prefix computations, image processing, computational geometry, and numerical computations [3], [ 161. [I 71. Broadcasting buses, however, have low bandwidth because of their exclusive access mode of operation, and thus are not suited for tasks involving extensive interprocessor communications. This limitation may be alleviated if optical buses are used either with wavelength division multiplexing, which provides multiple channels on the same bus [7], or with spaceltime multiplexing, which allows message pipelining on a bus 191. [ 151. In both cases, simultaneous bus access by multiple procesManuscript received March 12, 1992; revised September 1 I , 1992. This work was supported by the U S . Air Force under Grant AFOSR-89-0469,and by the National Science Foundation under Grant MIP-8901057. Z. Guo is with thle Department of Electrical Engineering, Louisiana Tech University, Ruston, LA 7 1272-0046. LISA; e-mail:
[email protected]. R . G. Melhern is ,with the Department of Computer Science, University of Pittshurgh. Pittsburg,h, PA 15261, USA. IEEE Log Number 9401 190.
A. Message Pipelining on Opticul Buses
Pipelined optical buses take advantage of two unique properties of optical signal transmissions in waveguides: unidirectional propagation and predictable path delays. Consider Fig. I , which shows an array of .V processors connected by an optical bus (waveguide). Each processor is coupled to the bus with two passive optical couplers, one for writing signals and the other for reading. Assume that the optical distance between each pair of adjacent processors is Do and that a message consists of a sequence of b optical pulses, each having a width 11: in seconds. In contrast to the case of an electronic bus, where writing access to the bus is exclusive, all of the processors may write their messages on the optical bus simultaneously. This may be accomplished if all processors write their messages at the same instant and if the length of each message is smaller than D