Tim Courtney, Richard Turner, Roger Woods ... {t.courtney, r.h.turner, r.woods}@ee.qub.ac.uk .... [5] A S Tanenbaum, âComputer Networksâ Prentice Hall, 1981.
Mapping multi-mode circuits to LUT-based FPGA using embedded MUXes Tim Courtney, Richard Turner, Roger Woods Queen’s University Belfast {t.courtney, r.h.turner, r.woods}@ee.qub.ac.uk Programmable Systems Laboratory Electrical & Electronic Engineering, Ashby Building Stranmillis Road, Belfast, BT9 5AH Northern Ireland
Introduction
A model consisting of function units and a MUX has been proposed as a model for reconfigurable circuits [1]. If a system can be mapped to this format then a reconfigurable solution can be applied. The delay penalty for reconfiguration is on an upward trend, as device size and configuration block size increase and configuration bus size does not. For a simple circuit on a Xilinx XC2V8000 the reconfiguration penalty is likely to be at least 22500 times the circuit delay [2,3]. There are many instances when full generality is not required and therefore reducing flexibility has previously given rise to an area gain for DSP circuits [4]. In this paper, a design strategy to avoid this reconfiguration penalty without incurring the overhead of general-purpose circuitry is suggested and applied to a simple example and then to a 10-polynomial 32-bit parallel CRC system.
2.
s a b c
Figure 1. Virtex circuit synthesised using Synplify Pro
The detailed workings of the CRC decoder can be found in [5]. It is based on modulo-2 division operation. A full
0
output(k-j+1,…,k) input
j
output(1,.., k-j)
Background on the Parallel CRC
0
o
Figure 2. Embedded MUX circuit In the parallel CRC decode circuit the incoming data word is broken into blocks of j-bits; the operations relating to these bits are performed in parallel. Thus, an m-bit data word is processed in m/j clock cycles. The main processing element in the parallel circuit is an array of XOR gates. The number of XOR gates and their connectivity is defined by the CRC generator polynomial and is therefore different for different polynomials, although every XOR array, for a fixed generator length, has the same number of inputs and outputs. This connectivity is obtained from a series of modulo-2 matrix multiplications based on the generator polynomial. The resulting matrix is of size j by m and for j=8 m=32, as considered here, requires 4388 four input LUTs to implement on a Virtex FPGA. The general structure of the parallel CRC circuit is given in Figure 3; Figure 4 then shows a particular example for generator x4+x3+1 with 4-bit input blocks.
A simple circuit containing a MUX selecting between a 5-input AND and a 5-input OR was described in VHDL and synthesized using Synplify Pro 7.0, resulting in the circuit shown in Figure 1. This circuit uses four LUTs.
3.
1
d e
Embedded MUX technique
The presence of the MUX at the output of Figure 1 indicates that a reconfigurable implementation can be used. In this case, 2 LUTs would be used to implement the 5-way AND or OR function respectively. The embedded MUX circuit is shown in Figure 2. The method involves identifying the common features (associativity and inputs) and partitioning the design in such a way that resource is left in the LUT to implement a MUX. This results in a 50% reduction in area.
1
j 2-way XORs j M Registers on Outputs 1,…, k
1.
treatment of the parallel circuit can be found in [6] but a brief introduction is included here.
XOR array
Abstract For some systems, a general-purpose FPGA solution tends to be large and slow. A reconfigurable solution is smaller and faster but has a delay associated with the reconfiguration. In this paper, embedded MUXes are used to achieve the performance of reconfiguration without the time penalty. For a CRC circuit an area reduction of 93% compared to a generalpurpose solution and a reduction of 17-34% compared to similar software compiled systems is achieved.
k
output
Figure 3. General Picture of Parallel CRC
4.
Implementation details
A 10-polynomial system is considered as this gives some flexibility without being too large. The matrices for the XOR array for each of these 10 polynomials were derived using Matlab™ and the circuits for the XOR array outputs were derived from them. The following paragraphs show only the circuits for the third output from the XOR array, the structure and design method for the other outputs were similar. The hexadecimal representations of the polynomials are: FA8FC37F F6D15C19 DACEC37F A39431B7
F76D83AD 8A6C8B65
Proceedings of the 10 th Annual IEEE Symposium on Field-Programmable Custom Computing Machines (FCCM’02) 1082-3409/02 $17.00 © 2002 IEEE
93A5A6DD A58B5D9B C7DCACA5 C7A9CF8D
f Z d
D
h d g e X sel0 a f
z-1 ‘0’
z-1
‘0’
z-1
‘0’
z-1
‘0’
z-1 Registers
XOR Array
Figure 4. Parallel CRC circuit for G= x4+x3+1, with j=4 Due to the logic resources of the Xilinx Virtex device (used in this work), it makes sense to split the 10-way MUX (Figure 5) into a tree of 2-way MUXes. This splitting results in five groups of two to implement. Each group is then implemented as a circuit with two possible outputs, chosen between using a MUX, as shown in Figure 6. This output then feeds into the rest of the MUX tree. In the implementation of these five circuits, simple hardware sharing has been used. If a previously designed block provides the required output then it is re-used. In Figure 6 there are three LUT inputs labelled X, Y and Z, these are the outputs from three LUTs that are hardware shared. In the five circuits the aim is to have many MUXes embedded into the LUTs with the XOR gates to eliminate redundancy in the first level of logic. Four versions of the 10polynomial CRC have been implemented. Two used commercial tools to compile behavioural VHDL to edif netlists, one used ten constant polynomial circuits and MUXed between them, and the last was the proposed embedded MUX structure written using structural VHDL. The first three had low target speeds (1 MHz) for compilation to allow the tools to optimise for area. This was then changed to 35MHz for place & route. Table 1 shows that the embedded MUX technique creates the smallest circuits. a⊕d⊕e⊕f b⊕d a⊕c⊕h a, b, c, d, e, f, g, h 8
a⊕b⊕c⊕g
a⊕c⊕d⊕f c⊕d
b⊕c⊕d⊕h d⊕f
a⊕b⊕c⊕d⊕g d⊕g⊕h
Figure 5. Circuits for Output(3)
Output(3)
c d sel0 c Y f sel0 a b h
F6 MUX
2-input XORs
sel2
F5 MUX
sel0
(0)
(1)
F5 MUX
(2)
(3)
sel1
‘0
d c sel0
sel3
Output(3)
g
‘0
Figure 6- The circuit as implemented Circuit Synplify Pro Xilinx Foundation 10 + MUX Proposed
5.
Circuit area (LUTs)
Target Speed (MHz)
Relative area
Max. Speed (MHz
375
35
1·20
48·0
368
35
1·50
61·2
399
35
1·28
312 35 1·00 Table 1. Comparison of circuits
35·5
Acknowledgements
The authors acknowledge the support of the Engineering and Physical Sciences Research Council (grant GR/98909).
6. References [1]N. Shirazi, W. Luk, P. Cheung, “Automating Production of Run-Time Reconfigurable Designs”, pp147-156,Proc. IEEE Symp. on FCCM 1998, April 1998 [2]T. Courtney, R. Turner, R. Woods, “Multiplexer Based Reconfiguration for Virtex Multipliers”, pp749-758, Field Programmable Logic and Applications, August 2000. [3] “Virtex II Platform FPGA handbook”, p387, published online from www.xilinx.com, revised Dec. 2001. [4] T Courtney, R Turner, R Woods, “Implementation of Fixed Coefficient DSP functions using the reduced coefficient multiplier” (invited paper), Volume II, spec-L3.1, IEEE Conf. On Acoustics, Speech and Signal Processing, Special Session “Configurable Computing for DSP”, Salt Lake City, 7-11 May 2001, [5] A S Tanenbaum, “Computer Networks” Prentice Hall, 1981 pp128-131 [6] T. Pei, C. Zukowski, “High-Speed Parallel CRC Circuits in VLSI”, IEEE Transactions on Communications, Vol 40, No. 4, April 1992
Proceedings of the 10 th Annual IEEE Symposium on Field-Programmable Custom Computing Machines (FCCM’02) 1082-3409/02 $17.00 © 2002 IEEE