Logic Synthesis Importance in FPGA-based Designing of Information and Signal Processing Systems Mariusz Rawski, Paweł Tomaszewicz, Tadeusz Łuba Warsaw University of Technology, Institute of Telecommunications Nowowiejska 15/19, 00-665 Warsaw, Poland, e-mail:
[email protected]
Abstract – The goal of this paper is to promote application of logic synthesis methods and tools in different tasks of modern digital designing. The paper discusses functional decomposition methods, which are currently being investigated, with special attention to balanced decomposition. Since technological and computer experiments with application of these methods produce promising results, this kind of logic synthesis will probably dominate the development of digital circuits for FPGA structures. Many examples confirming effectiveness of decomposition method in technology mapping in digital circuits design for cryptography and DSP applications are presented. Key Words: Logic Synthesis, Functional Decomposition, FPGA, VLSI.
I. INTRODUCTION1 The influence of advanced logic synthesis procedures on the quality of hardware implementation of signal and information processing systems is especially important in case of applications targeted to FPGA structures based on look-up tables (LUT). Direct cause of such situation is imperfection of technology mapping methods that are currently widely used such as minimization and factorisation of Boolean function, which are traditionally adapted to be used for structures based on standard cells. These methods transform Boolean formulas from the form of sum-of-products into multilevel, highly factorised form that is then 1
This paper was supported by Polish State Committee for Scientific Research financial grant number 4 T11D 014 24
mapped into LUT cells. This process is at variance with nature of LUT cell, which from the logic synthesis’ point of view is able to implement any logic function of limited input variables. For this reason in case of implementation targeted to FPGA structure the decomposition is much more efficient method. The decomposition allows synthesizing the Boolean function into multilevel structure that is build of components, each of which is in form of LUT logic block specified by truth tables. Efficiency of functional decomposition has been proved in many theoretical papers [1, 2, 3, 7]. However, there are relatively few papers where functional decomposition procedures were compared with analogous synthesis methods used in commercial design tools. Direct cause of such situation is lack of appropriate interface software that would allow transforming description of project structure obtained outside commercial design system into description compatible with its rules. Moreover, the computation complexity of functional decomposition procedures makes it difficult to construct efficient automatic synthesis procedures. These difficulties – at least partially – have been eliminated in so called balanced decomposition [4, 6, 8]. In this paper, first DEMAIN system that implements balanced decomposition is presented. Next, on the example of simple binary to BCD encoder, the influence of the method on technology mapping in FPGA structure is shown. Following that, the efficiency of decomposition procedures in hardware implementation of cryptographic and DSP algorithms is discussed.
II. BALANCED FUNCTIONAL DECOMPOSITION Here we review only some information that is necessary for an understanding of this paper. More detailed information concerning balanced functional decomposition method can be found in the papers [6, 8]. Balanced decomposition relies on partitioning of a switching function with either parallel decomposition or serial decomposition applied at each phase of the synthesis process. In the parallel decomposition, the set of output variables Y of a multi-output function F is partitioned into subsets, Yg and Yh, and the corresponding functions, G and H, are derived so that, for either of these two functions, the input support contains fewer variables than the set of input variables X of the original function F. An objective of the parallel decomposition is to minimize the input support of G and H. In the serial decomposition, the set of input variables X is partitioned into subsets, A and B, and functions G and H are derived so that the set of input variables of G is B ∪ C, where C is a subset of A, the set of input variables of H is A ∪ Z, where Z is the set of output variables of G, and H has fewer input variables than the original function F, i.e. F = H(A, G(B,C)). The balanced decomposition is an iterative process in which, at each step, either parallel or serial decomposition of a selected component is performed. The process is carried out until all resulting subfunctions are small enough to fit blocks with a given number of input variables. The idea of intertwining parallel and serial decomposition has been implemented in a program called DEMAIN (available at www.zpt.tele.pw.edu.pl). This tool is designed to aid implementation of combinational parts of digital systems. This tool has two modes: automatic and interactive. It can also be used for the reduction of the number of inputs of a function, when an output depends on only a subset of the inputs. From this point of view DEMAIN is a tool specially dedicated to FPGA-oriented technology mapping. The influence of the balanced decomposition on the final result of the FPGA-based mapping process will be explained with the exemplary circuit, namely BIN2BCD converter, which converts binary numbers from 1 to 366 onto their BCD codes. The BIN2BCD converter can be implemented in two styles: behavioural or Data Flow description. The implementation based on behavioural description (see Fig. 2) uses three registers: lb[], lda[] ldb[], and ldc[]. Binary value lb[] is loaded into register lb_r[] with high value on signal start. In next steps, this value is converted into BCD format. Each time, if condition ldb[] >= 5 is satisfied contents of register is increased and next the contents of each registers ldc[], ldb[] and
SUBDESIGN bin2bcd ( lb[11..0], start, clock : INPUT; ld[11..0], ready : OUTPUT; ) VARIABLE lda[3..0], ldb[3..0] : DFF; ldc[3..0] : DFF; lb_r[11..0], lk[3..0] : DFF; ld[11..0], ready : DFF; BEGIN (lda[], ldb[], ldc[]).clk = clock; (lb_r[], lk[]).clk = clock; (ld[], ready).clk = clock; IF start THEN lb_r[] = lb[]; lk[] = 12; ELSE IF lk[] > 0 THEN IF ldb[] >= 5 AND ldc[] >= 5 THEN lda[] = (lda[2..0], B"1"); ldb[] = (ldb[2..0]+3, b"1"); ldc[] = (ldc[2..0]+3, lb_r[11]); ELSIF ldc[] >= 5 THEN lda[] = (lda[2..0], B"0"); ldb[] = (ldb[2..0], b"1"); ldc[] = (ldc[2..0]+3, lb_r[11]); ELSIF ldb[] >= 5 THEN lda[] = (lda[2..0], B"1"); ldb[] = (ldb[2..0]+3, b"0"); ldc[] = (ldc[2..0], lb_r[11]); ELSE lda[] = (lda[2..0], B"0"); ldb[] = (ldb[2..0], B"0"); ldc[] = (ldc[2..0], lb_r[11]); END IF; lb_r[] = (lb_r[10..0], B"0"); lk[] = lk[] - 1; ELSE lda[] = lda[]; ldb[] = ldb[]; ldc[] = ldc[]; ld[] = (lda[], ldb[], ldc[]); ready = B"1"; END IF; END IF; END;
Fig. 2: AHDL behavioral description of BIN2BCD conversion algorithm. lda[] shifted. After 12 clock cycles values from registers ldc[], ldb[] and lda[] are stored in output register ld[] and signal ready goes high to inform that conversion is done. Such implementation requires 60 logic cells and 41 flip-flops of device EPF10K10. However the BIN2BCD converter can be specified simply as a lookup table. This table, described in AHDL and placed in the binpla.tdf file, has been synthesized with QuartusII system. Its implementation in FLEX10K circuit requires 114 logic cells. Thus, such an implementation is much
worse than the one obtained with behavioural synthesis. However this solution has an important advantage – high performance, since it is just combinational circuit. Keeping this in mind it is worth to venture additional optimisation. It can be easily done since the circuit is described with truth table and functional decomposition procedures – implemented in DEMAIN software – can be applied. Scenario where file with AHDL description of given circuit is processed by DEMAIN and the resulting binpla.ans file is then converted with ANS2HDL application allows creating AHDL file that represents multilevel structure describing original circuit’s behaviour. This structure is build of blocks with such size that each can be implemented in one FLEX’s logic cell. Implementation based on this multilevel description is very efficient since it requires only 39 logic cells, and is even faster than the previous implementation. III. CRYPTOGRAPY AND DSP APPLICATIONS The presented results lead to the conclusion, that the influence of the balanced decomposition on efficiency of practical digital systems implementation will be particularly significant, when the designed circuit contains complex combinational blocks. This is a typical situation when implementing cryptographic algorithms, where so called substitution boxes are usually implemented as combinational logic. DEMAIN has been used in implementation of such algorithms allowing for significant improvement in logic resources utilisation, as well as in performance. Implementation of data path of the DES (Data Encryption Standard) algorithm with MAX+PlusII requires 710 logic cells and allows encrypting data with throughput of 115 Mb/s. Application of balanced functional decomposition in optimisation of selected parts of the algorithm reduces the number of required logic cells to 296 without performance degradation and even increasing it to 206 Mb/s [9]. The balanced functional decomposition was also used in implementation process of the Rijndael algorithm targeted to low-cost Altera programmable devices [10]. Application of DEMAIN software allowed implementing this algorithm in FLEX10K200 circuit very efficiently with throughput of 752 Mb/s. For comparison implementation of Rijndael in the same programmable structure developed at TSI (France) and Technical University of Kosice (Slovakia) allowed throughput of 451 Mb/s, at George Mason University (USA) – 316 Mb/s, and at Military University of Technology (Poland) – 248 Mb/s [11]. Another interesting application of balanced decomposition can be found in DSP systems, particularly for circuits based on distributed arithmetic (DA) concept. Given the dual use of
lookup tables as small memories, distributed arithmetic has already been an effective implementation choice for FPGA-based DSP circuits [5]. The DA is a method of computing the sum of products N −1
y = ∑ c[ n] ⋅ x[n]. n =0
But in many DSP applications, the coefficients c[n] are known a priori. Then taking into account the fact that the input variable is a binary number B −1
x[ n] = ∑ xb [n] ⋅ 2b , where xb [n] ∈ [0,1] b =0
the whole formula can be implemented as it is shown in Fig. 3, where the contents of the lookup table include all the possible linear combinations of the filter coefficients and the bits of the incoming data samples act as the addresses for the table [5]. The utility programs that generate the lookup tables for filters with given coefficients can be found in the literature. However we developed our own tools that RP
AT
X [0 ]
X [0]
X [0]
X [1 ]
X [1]
X [1]
PRO
ACC
LU T +/–
X [N –1]
R
X [N– 1] X [N– 1]
2
ADD . 0