Jun 8, 1998 - This document describes the architectural design for the PipeRench chip. .... the FPGA, but only directly connect to the logic blocks underneath them. ...... of IEEE Symposium on FPGAs for Custom Computing Machines, pp.
June 8, 1998
PipeRench Manual Matthew Myers, Kevin Jaget, Srihari Cadambi, Jeffrey Weener, Matthew Moe, Herman Schmit, Seth Copen Goldstein, Dan Bowersox Carnegie Mellon University An overview of the PipeRench architecture, assembler, and simulator Section 1.0: Section 2.0: Section 3.0: Section 4.0: Section 5.0: Section 6.0: Section 7.0: Section 8.0:
Introduction Parameterized Architecture PipeRench Architecture Programming PipeRench The CHASM Assembly Language The Assembler and Simulator Conclusion References
1 2 9 18 20 36 41 41
1.0 Introduction PipeRench (also known as Cached Virtual Hardware or CVH) is a run-time reconfigurable Field-Programmable Gate Array (FPGA) which manages a “virtual” pipeline. The fabric of the FPGA contains physical pipeline stages which, though limited in number, can be reconfigured separately at run-time to perform numerous calculations in an application. In this way, the physical pipeline stages, or “stripes,” are configured and arranged to form the virtual pipeline which executes the desired task. The logical size of a virtual pipeline is unbounded, and it can run on a compatible architecture of any size. Such hardware virtualization provides the benefits of forward-compatibility and more robust compilation. This document describes the architectural design for the PipeRench chip. Architectures both general and specific are discussed. Also outlined are an assembler and a simulator for PipeRench that we have implemented for the UNIX operating system. These software tools allow the convenient testing of various PipeRench applications in the absence of the actual hardware. The syntax for the Cached Virtual Hardware Assembly language is detailed here, as are planned improvements to the architecture, the assembler, and the simulator.
PipeRench Manual
1 of 41
Parameterized Architecture
2.0 Parameterized Architecture In this section we explore the general Cached Virtual Hardware architecture. Features such as global busses, pass registers, and interconnect will be introduced. This model will provide the foundation for a more detailed description of the PipeRench architecture in the next section. The structure of PipeRench is based on a new pipeline reconfigurable FPGA architecture called “striped FPGA.” The main advantage of a striped FPGA over other FPGA alternatives is its high resolution of reconfigurability (see [1]). In a pipeline reconfigurable architecture, individual pipeline stages (“stripes”) can be reconfigured at runtime to perform numerous tasks over the course of one application. A standard FPGA, on the other hand, must be reconfigured in its entirety; a given piece of such hardware can only perform one task per application. On-the-fly reconfiguration means that an application running on PipeRench is not limited by the number of physical stripes in the FPGA fabric, but instead runs on the virtual hardware created from these reconfigured stripes.
FIGURE 1.
Generalized Pipeline Reconfigurable FPGA
Global Bus
Comb. Logic
Stripe n
Comb. Logic Stripe n +1
2 of 41
PipeRench Manual
Parameterized Architecture
Two stripes of an extremely general pipeline reconfigurable architecture are shown in Figure 1. In this architecture, the combinational logic in any stripe is a function of the global I/O, the registered outputs of the previous stripe and the registered outputs of the current stripe. There must be global I/Os because, as discussed in [2], the pipeline stages in an application can be physically placed in any of the stripes of the fabric, therefore inputs to and outputs from the application must use a global bus to get to their destination. Connections between stripes must be registered, because on the first cycle that stripe n operates on data, stripe n + 1 may be being configured or may be working on another part of the pipeline. The register assures that values do not outrun the configuration. Conversely, values cannot go to previous stripes, because those stripes may have been reconfigured to operate on another stage of the pipelined application. The underlying assumption for our area model is that the building blocks for combinational logic are look-up tables (LUTs). Therefore, the combinational logic block in Figure 1 can be replaced by an interconnection network and a set of N LUTs, each of which has D inputs. This yields the architecture in Figure 2. Inputs to the interconnection network include the global bus, the registers from the previous stripe, as well as the registered and un-registered outputs from the LUTs. Unregistered outputs are distributed so that LUTs can be chained together to build complex combinational functions within one stripe.
FIGURE 2.
LUT-Based FPGA
Global Bus
Interconnect Network D
D
D-LUT #1
D-LUT #2
D
...
D-LUT #N
Stripe n Stripe n+1
PipeRench Manual
Interconnect Network
3of 41
Parameterized Architecture
This architecture is targeted at implementing data paths for digital signal processing and other word-based computations. In word-based computations, the logic found in LUTs is frequently duplicated across every bit of the word. In order to conserve configuration bits, we would like to investigate the benefit of grouping multiple LUTs into logic blocks, as in [3]. Figure 3 shows the architecture of our logic block, which includes B LUTs with D inputs each. As in [3], we have included two types of inputs to a logic block: Dd data inputs which are B bits wide and are wired straight to the B LUTs, and Dc control inputs which propagate to one input of each LUT. These control inputs are useful for masks and multiplexor select lines.
FIGURE 3.
B-output LUTs
Dd Inputs (=2)
{ 1
1 LUT 1
LUT
Dc Inputs (=1)
... LUT
B Outputs
2.1 Pass registers A shortcoming of the architecture depicted in Figure 2 is that a LUT is associated with a single register. Our initial experience with mapping applications to such an architecture has shown that a high number of registers are required to pass values from the register in one stripe directly down to the corresponding register in the next stripe. In the architecture in Figure 2, the only way to do this is to configure the logic block to pass one input, which wastes the logic block’s computational power. Furthermore, the interconnect network is under-utilized since it is only asked to pass a value from a logic block in one column to the logic block in the next stripe and the same column. There is no facility for local register-to-register interconnection. In order to provide for this type of interconnect, we have transformed each register in Figure 2 into a pass register file, with P different B-bit wide registers. In a pass register file, which is depicted in Figure 4, the logic block can write to one of the P registers. The remaining registers are written with the value in the corresponding register in the previous stripe. There may be multiple read ports on the register file, which can be connected to one of the logic block inputs or to the interconnect network. The connection of pass register read ports to logic blocks and interconnect will be discussed later.
4 of 41
PipeRench Manual
Parameterized Architecture
FIGURE 4.
Pass Register File
Logic Block
B
B
1
2
B
... P
Read Port Pass Register File
Stripe n Stripe n+1
Logic Block Write Port
...
2.2 Global Busses Figure 2 depicts one global bus feeding into the generalized interconnect of the stripe. This would, in principle, allow any part of the global bus to route through the interconnect network to any logic cell. Though useful, this would be a very expensive feature to implement, especially for the large amount of data that needs to be carried on the global busses. Keep in mind that global busses are the only way for operands to get into the FPGA, and for results to get out of the FPGA. In addition, as discussed in [2], global busses are necessary to save and restore state from stripes when they are removed from and later restored to the fabric during hardware virtualization. We have therefore decided to have the global bus bypass the general interconnect network and connect directly to the logic blocks. A set of G busses are associated with each column of logic cells. These busses, which are each B bits wide, span all the stripes in the FPGA, but only directly connect to the logic blocks underneath them. Each logic block has one write port and multiple read ports on these busses. The enable signals for writing global busses are generated globally in order to prevent multiple stripes from driving the same bus, much as in [5]. Because these enables are generated globally, there is no need to locally store configuration bits to determine which bus to drive.
PipeRench Manual
5of 41
Parameterized Architecture
2.3 Interconnect Network The interconnect network, which is responsible for moving values between cells in a single stripe, will play a very substantial role in the size and routability of an architectural implementation. Obviously, having complete interconnect between all logic blocks and pass registers is infeasible for wide stripes. We can reduce the size of the interconnect network
• by reducing the number of inputs to the network from each logic block, (Figure 2 shows registers and combinational logic connecting to the interconnect network);
• by reducing the connectivity inside the network (e.g. not having a complete crossbar); and
• by reducing the number and functionality of the output ports from the network (which are the inputs to the logic block). It is rare that both the combinational value from the logic block and a value from the corresponding pass register file need to simultaneously use the interconnect network to move laterally within the stripe. As shown in Figure 5, one input port of the interconnect network is multiplexed between the read port of the pass register file and the combinational output of the logic block. The other input port to the interconnect network is connected to a read port of the pass register file in the previous stripe.
FIGURE 5.
Parameterized Cell Architecture
Pass Reg File Stripe n Stripe n+1 Horizontal Interconnect Network
Control
G Busses B
Dg=1 Dp=1
Dc
P*B
Logic Block D Inputs
Pass Reg File
6 of 41
PipeRench Manual
Parameterized Architecture
We are currently considering which network architecture we will use for the horizontal interconnect network. The prototype chip that we have designed has a complete crossbar between every input port and every output port. (The CVH assembler and simulator described in Section 6.0 currently support a full crossbar as well.) The logic block inputs can access not only any of the values sent from the N cells in the stripe, but a sliding window of bits within the N*B bits on the crossbar. This allows our architecture to do data alignments that are necessary for word-based arithmetic. This innovation was originally described in [3]. We realize that this complete crossbar architecture is not scalable, and we are therefore investigating and parameterizing other network architectures. Our architecture assumes that each of the Dd data inputs to the logic block has a dedicated connection to the network. We assume that a subset of these inputs, with cardinality Dg, also has access to one of the global busses. The remaining Dp inputs to logic block can access a dedicated read port of the pass register file. No input can access both the global busses and the pass registers. Figure 5 shows an example where Dp = 1 and Dg = 1. We have replaced the parameter Dd with two parameters new parameters, Dp and Dg, such that Dd = Dg + Dp. Figure 5 provides an overview of our parameterized architecture. In summary, there are seven parameters:
• • • • •
N: the number of logic blocks in the stripe; B: the width, in bits of the logic block; G: the number of B-bit wide global busses visible to each cell; P: the number of B-bit wide registers in the pass register file; Dg: the number of logic block inputs that can connect to the horizontal interconnect network or the global busses;
• Dp: the number of logic block inputs that can connect to the horizontal interconnect network and the pass registers of the previous stripe; and
• Dc: the number of control inputs to the LUT. The logic block therefore contains B LUTs, each of which has D = Dp + Dg + Dc inputs.
PipeRench Manual
7of 41
Parameterized Architecture
8 of 41
PipeRench Manual
PipeRench Architecture
3.0 PipeRench Architecture The architecture that can currently be targeted by the Cached Virtual Hardware Assembler is a restricted class of the architectures described in the previous section. This section details the architecture in a top-down manner, first discussing the stripe and then the processing elements (PEs) that are contained in the stripe. The discussion then moves to the routing and configuration of PEs, and finally to the memory controllers which handles flow of data between the FPGA fabric and the memory. 3.1 The Stripe In a hierarchical sense, the stripe is the largest, most general block of PipeRench. Each stripe performs a computationally significant portion of the application and is analogous to a pipeline stage. Registers between stripes separate each stripe from the rest of the architecture, providing the pipeline capability. Figure 6 shows two stripes along with the local bus which connects each stripe to its successor. Each stripe is made up of N computational logic blocks, or Processing Elements (PEs). PEs are described in detail in Section 3.2.
Architecture of PipeRench at Stripe Level
....
....
FIGURE 6.
PE N-1
PE N-2
....
PE 1
PE 0
N busses
Dc control bits
Stripe n
....
....
Stripe n+1
PE
PE
....
PE
N busses
PE
All lines are B bits wide except for control lines
PipeRench Manual
9of 41
PipeRench Architecture
3.2 The Processing Element All computation for a PipeRench application is done in the processing elements. Each PE in a stripe can take its inputs from several different locations, and output to various other locations within the fabric. More details follow on the features of a processing element. 3.2.1
Inputs
A PE takes two B-bit inputs from the local bus, the global bus, or the registered output of the previous stripe. These inputs, A and B, are the values to be operated on within the PE. Operand A can read a value from the local bus, or it can read a value from the global bus. Operand B can read a value from the local bus, or it can read a value directly from the register file of the same PE in the previous stripe. Other inputs to the PE come from an adjacent PE in the same stripe. These 1-bit values are named Cin (carry-in), Zin (zero in), and Xin (general purpose), and can be chosen from the three inputs to the “Misc. Routing” block, although a limited number of combinations are allowed. One might choose to define Xin to equal the Cout from the previous PE, for example, in order to perform some logical function on that value. Figure 7 shows in detail where the particular inputs come into the PE. 3.2.2
Outputs
The B-bit functional output of the PE can be saved into the register file and passed on to the next stripe. In this case, the value can be fed directly to the B input of the corresponding PE in the next stripe. The PE output can also be routed, registered or unregistered, to the local bus, as shown in Figure 7. This value can be picked up from the local bus by another PE in the same stripe. 1-bit outputs Cout (carry-out), Zout (zero out), and Xout (general purpose) are routed through the “Misc. Routing” block to the neighboring PE on the left. Cout is the carryout value from the top bit of the carry chain (see Section 3.2.5). Zout has a value of 0 if the PE output is zero, and 1 otherwise. The Xout of a PE is hardwired to the Xin of the same PE; this allows the PE to pass a value from the previous PE to the next PE. 3.2.3
Barrel Shifter
Connecting inputs from the local busses to the PE are barrel shifters for each operand, as seen in Figure 7. Currently, the barrel shifters can only shift left or rotate left, although a shift or rotate right can be accomplished through a combination of the routing resources. The barrel shifter can take as its input any B-bit window from the total value available to it (B*N bits wide, spanning all N of the local busses). For example, in an architecture with N=4 and B=4, imagine that the following 16 bit value was placed on the four local busses: 1 0 1 1 1 0
[
0 1 0 1
]
1 0 1 0 1 1
Any four-bit window on this value can be grabbed by the barrel shifter, and then that value can be shifted or rotated left by any amount. In this case, the value 0101 was grabbed, so if it was shifted left by 2, the resulting input to the PE would be 0100. If it
10 of 41
PipeRench Manual
PipeRench Architecture
was rotated left by three, we would get 1010 as the PE’s input. This flexibility allows a great number of possibilities for manipulating data.
Processing Element Routing
....
....
FIGURE 7.
Barrel Shift 1
Global
Barrel Shift 2
A
Prev. Reg File
B
Xin, Zin, Cin
Xout, Zout, Cout PE Output
PE.x R0 from Rfile
0 1 x
....
Stripe n+1
....
Stripe n
N Barrel Shift 1
Global
Barrel Shift 2
A
Prev. Reg File
B
R0 PE Output
Xin, Zin, Cin
Xout, Zout, Cout PE.x
All lines are B bits wide except for control lines
PipeRench Manual
11of 41
PipeRench Architecture
FIGURE 8.
Processing Element Architecture (x is PE number, y is a global bus number, i a register)
G
N busses B bits wide
Global
Barrel Shift
“prev.x.Ri”
Barrel Shift
B
A
“x.B”
“x.A”
3-LUT “pe.x”
B-1
3-LUT “pe.x”
B-2
1
Ripple Fig. 9
Fig. 9
3-LUT “pe.x”
0 “x.Cin”
Carry Chain Fig. 9 Fig. 9 from previous stripe register file
Misc. Routing
3-LUT “pe.x”
“(x-1).Cout”
“x.Xin”
“x.Xout”
“x.Cout”
“(x-1).Xout”
“global.y”
“x.Zin” Zero? “(x-1).Zout”
“x.Zout”
B “x.Out” PE Out to current stripe local bus
R0
R1
RP-1
“x.R0” to current stripe local bus
“this.x.Ri” to next stripe B input
0 1 x
....
PE Out from next stripe
next stripe local bus
....
R0 from next stripe
to next stripe register file
N
12 of 41
PipeRench Manual
PipeRench Architecture
3.2.4
Look-Up Table (LUT)
There are B 3-input look-up tables provided in each PE, as seen in Figure 8. The inputs to the LUTs are A, B, and Xin. Inputs A and B will vary among the B LUTs depending on these inputs’ individual bit values, but Xin will have the same value for every LUT in the PE. Each LUT can provide any function of its three inputs; however, all of the LUTs in a given PE perform the same function of A, B, and Xin. 3.2.5
Ripple-Carry Chain
The ripple-carry chain is intended to be used whenever a PE is performing an addition or subtraction. When the LUT output is combined with its inputs, the ripple-carry chain can output the sum (or difference) of the two B-bit inputs, A and B. The most significant carry-out result can be passed on to the next PE through Cout, so that addition and subtraction of values greater than B bits is supported. Figure 9 shows the actual contents of one bit of the ripple-carry logic. This logic exists underneath each LUT pictured in Figure 8. The “carry_enable” bit chooses whether the carry chain should pass along only the LUT output or whether it should compute the LUT output XOR’ed with the carry in from the previous PE. In this way, a function such as A ⊕ B ⊕ Cin can be expressed by the PE.
FIGURE 9.
Carry Chain Cell LUT output
A 0
Cout to next carry cell (or next PE)
0
B 1
shift_input Cin from previous carry cell (or previous PE)
1
carry_enable
PE output
An additional function of the carry chain is realized by activating the “shift_input” bit. It can select which input, A or B, should be shifted through the carry chain as a carry-in for the next carry chain cell.
PipeRench Manual
13of 41
PipeRench Architecture
3.2.6
Pass Registers (Register File)
Associated with each processing element is a register file. This register file can be used to hold intermediate values of a computation which need to be passed to subsequent stages in the pipeline. By using these registers, one can avoid wasting an entire processing element to simply pass values unchanged between stripes. On each clock cycle, the contents of a PE’s register file are automatically passed to the register file in the corresponding PE in the next stripe. In addition, the contents of a register from the file can be used and modified by the logic in a stripe. The output of a PE can be written to any register in its register file, but a register file can not be written to by any other PE except its own. A register’s output can routed to the A or B input of any PE in the next stripe, and Register 0 can be routed to the input of any PE in the current stripe. Register 0 is the special register for save/restore (described in Section 3.2.7). Each PE’s register file has two read ports and one write port. Therefore, a register file can be read by multiple PEs, as long as each PE requires data from the same register. This is because one read port goes directly to the next stripe while the other read port can be fed back into the horizontal interconnect network from which the current stripe takes its inputs. Since the number of registers that will be present in the final version of PipeRench has not been determined, the assembler and simulator support an arbitrary number of registers per processing element. 3.2.7
State Storage and Restoration
In order to maintain a current value in a given stripe, one register of a PE can be stored into memory after the stripe has executed, and it can be restored the next time the stripe is loaded into the fabric. This operation will be necessary if a pipeline stage needs to store some “state” in a register and there is a likelihood that the stripe will be swapped out of the fabric. When the stripe is swapped back into the fabric, the register will be restored along with the stripe to allow execution to continue normally. Currently, Register 0 is designated as the store/restore register, so only this register will be saved in a store operation, and only this register will be restored. In order to support the store/restore feature, two global busses must be dedicated to the swapping operation. The actual interface with the memory controllers is explained in Section 3.5. 3.3 Stripe Routing and Limitations The local routing within a stripe is extremely flexible to allow dataflow to the necessary cells, whether they are in the same stripe or in the next stripe. However, there are some constraints to the amount of routing that can be performed. The following is a partial list of some interesting routing “quirks”:
• Outputs of a PE can be routed to the stripe’s local bus, which can be read by any other PE in the stripe. Each one of the N local busses must share among the output of the current PE, Register 0 from the current PE, or a registered output from the previous PE. This imposes several routing limitations, for example, not allowing Register 0 and the output of a given PE to both be accessed in a single stripe. Most of these
14 of 41
PipeRench Manual
PipeRench Architecture
limitations can be superseded by using an extra stripe or an extra PE to perform the given function.
• Only operand A can read from the global bus, and only operand B can read from the extra register file port from the previous stripe. However, any of the registers in the same PE in the previous stripe can be accessed through the local bus if the register is driven on the local bus.
• Each PE can read from one of G global busses, each having N*B bits. However, a PE can only read the B bits that correspond to its position in the stripe. For example, the rightmost PE (“PE 0”) can only access the least significant B bits of any global bus.
• The “Misc. Routing” box in Figure 8 is a partial crossbar which allows Cin, Xin, and Zin to grab almost any combination of Cout, Xout, and Zout from the neighboring PE. However, not all combinations are supported.
• The “carry_enable” and “shift_input” signals are hard-coded logic and can only be turned on and off. 3.4 Configuration Controller The control word is the main unit of configuration for the PipeRench architecture. This word determines how the stripe is configured, which global busses the stripe will access, and where the next control word is located. More significantly, it also determines the number of execution cycles when applications are virtualized. The fabric knows when each stripe reads/writes on a global bus based on the control word. A stripe gets exclusive access to a bus by asserting a bus_enable signal and reads from a bus by asserting a bus_use signal in the control word. Figure 10 shows how the control word is broken up.
FIGURE 10.
Control Word Format
20 bits
1 next address (8 bits) 4 bits to indicate which global bus this virtual stripe can read from. PipeRench Manual
4 bits to indicate which global bus(es) this virtual stripe needs to write to.
needs to be a ‘1’ for the last virtual stripe in the config set
15of 41
PipeRench Architecture
3.5 Memory Controllers Currently, the memory architecture is only defined for the interface between the global busses and the memory. Each global bus has its own memory controller, which accesses the memory and supplies data to the global bus, or accepts data off the global bus and writes it to memory. Each memory controller keeps track of a current index to memory, and increments it after it transfers each piece of data to or from the memory. The design for the memory controllers was based on an analysis of the memory access patterns of data flow though the reconfigurable fabric. The majority of PipeRench applications, such as FIR and multiplication, are single pipelined. This means that data flows through the fabric starting at the first stripe and ending with the last stripe. Two memory controllers, one for data supplied to the first stripe and one for data output from the last stripe, are used to facilitate this common data flow pattern. In several other applications, the data flow is not single pipelined. In this case, the state of a stripe must be stored in memory, then retrieved for later use. (Section 3.2.7 covers state storage and restoration.) Sometimes only a store or a restore operation is needed, but not both, so that type of access is also permitted. Store and/or restore of a register is accomplished by asserting either the store bit or the restore bit in the configuration word. The store or restore data is then driven over global bus 0 or 1, respectively. (These two busses can be used for other access types if store and restore are not needed.) Each of the memory controllers must be programmed with its desired access pattern. (For simulation purposes, this is accomplished with the memconfig.data file – refer to Section 6.3.2 for details.) Five parameters are needed to describe each of the controllers:
• Access type — The first number describes which access type is desired, from those described above: 0. unused 1. Last stripe output 2. First stripe input 3. unused 4. unused 5. Store - bus 0 only 6. Store and restore - bus 1 only 7. Restore - bus 1 only (under construction) 8. unused
• Initial index — The second number represents the very first address to be used by memory.
• Stride value — The third number is the absolute value of the memory access stride. • Period — The fourth number is the period of memory access. The period indicates how many stripe states must be stored or restored per iteration through the virtual stripes.
• Stride direction — The fifth and final number has a value of either 1 or 0 and dictates whether the stride is positive or negative, respectively.
16 of 41
PipeRench Manual
PipeRench Architecture
PipeRench Manual
17of 41
Programming PipeRench
4.0 Programming PipeRench In designing a PipeRench application, one must consider the logic and routing resources available, and decide how they can be used most efficiently to carry out the procedures required to complete the task. For example, if we want to design a simple pipeline to multiply a B-bit variable input (call it J) by a constant number (say, 13), we have to realize that there is no arbitrary multiply function which is native to the processing elements. The tools available to us are 3-bit boolean functions (via the LUTs), carry logic (in the ripple-carry chain), and multiplication by 2 (using the barrel shifters). Combining the LUTs with the carry chain allows construction of a ripple adder without much difficulty. Incorporating the shifters allows partial product calculation using only powers of 2 for multiplication: (8*J) + (4*J) + (1*J) = (13*J) Assuming B=4, the product 13*J can require up to 8 bits to represent in binary, so we now know that we must add three 8-bit binary numbers. However, each PE can only input two 4-bit values (assuming we don’t use the 1-bit Xin as an operand). That presents two obstacles:
• How do we add two 8-bit numbers? • How can we add together 3 of them? The solutions:
• Using the carry-out from PE x and the carry-in from PE x+1, we link together two PEs to form an 8-bit ripple adder as described in Section 3.2.5.
• We use this 8-bit adder to add up the first 2 partial products and register the result. Then, in the next stripe, we can add the third number to find the final output. Here is the code which implements our multiply-by-13 function. We bring in the J input from Global Bus 0 and output the result to Global Bus 1. Since CHASM is fairly straightforward, it should be evident what each line of code contributes to the configuration of the fabric. Details on the CHASM syntax are provided in the next section.
// Multiply-by-13
stripe First; 0.A = Global.0; 0.Cin = @0; pe.0 = A; pe.1 = 0; 2.A = 0.Out