IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS
1
http://www.ntu.edu.sg/home/ASPKMeher/papers/TVLSI-Retiming.pdf S Q On Efficient Retiming of Fixed-Point Circuits L
L
P
Pramod Kumar Meher, Senior Member, IEEE
L
X
2L
+
2L
FIGUR
R
Available in IEEE Explore Abstract—Retiming of digital circuits is conventionally based on the estimates of propagation delays across different paths in the data-flow graphs (DFG) obtained by discrete component timing model, which implicitly assumes that operation of a node can begin only after the completion of the operation(s) of its preceding node(s) to obey the data dependence requirement. Such a discrete component timing model very often gives much higher estimates of propagation delays than the actuals particularly when the computations in DFG nodes correspond to fixedpoint arithmetic operations like additions and multiplications. On the other hand, very often it is imperative to deal with the DFGs of such higher granularity at the architecture-level abstraction of digital system design for mapping an algorithm to the desired architecture, where the overestimation of propagation delay leads to unwanted pipelining and undesirable increase in pipeline overheads. In this paper, we propose the connected component timing model to obtain adequately precise estimates of propagation delays across different combinational paths in a DFG easily, for efficient cutset-retiming in order to reduce the critical path substantially without significant increase in register-complexity and latency. Apart from that, we propose novel node-splitting and node-merging techniques which can be used in combination with the existing retiming methods to achieve reduction of critical path to a fraction that of original DFG with a small increase in overall register complexity.
P
Retiming transformation changes the location(s) of delay element(s) in a circuit without changing its input-output characteristics [1], [2]. It has several applications in the design and optimization of synchronous circuits, e.g., reduction of clock period, reduction of number of registers, and reduction of power consumption [1]–[4]. Cutset retiming is a special class of retiming which is performed by decomposition of a data-flow graph (DFG) into two subgraphs by removing the edges of a given cutset, and then by transferring certain number of delay(s) from (or to) the incoming edges of a subgraph to (or from) its outgoing edges. It is popularly used in the architecture-level digital system design for the reduction of clock period. In the existing works, the retiming problem has been optimally solved where propagation delays on different paths are assumed to be known [1]–[5]. Conventionally, retiming is based on the propagation delay information estimated by discrete component timing model, which implicitly assumes that operation of a DFG node can begin only after the completion of the operation(s) of its Manuscript submitted on November 17, 2014, Revised April 21, 2015 and May 31, 2015; accepted June 26, 2015. This paper was recommended by Associate Editor Sachin S. Sapatnekar. The author is with the School of Computer Engineering, Nanyang Technological University, 50 Nanyang Avenue, Singapore, 639798, Email:
[email protected]. Information about the author is available at http://www.ntu.edu.sg/home/aspkmeher/
S
Q
S
L
L
L
X
2L
+
2L
R
P
L
+
(a)
L
+
FIGUR L
R
(b)
Fig. 1. (a) Multiply-add circuit to compute R = P × Q + S. (b) 3-operand addition circuit to compute R = P + Q + S. It is assumed here that wordlength is not Q affected Sby the addition. However, generally, the word-length should increase by 1-bit after the additions. L
L
2L preceding obeyRthe data dependence requirement. P L + node(s) +to 2L It is shown that discrete component timing model gives much higher estimates of propagation delays than the actuals when the DFG is represented at the granularity of arithmetic units, and leads to undesirable increase in pipeline overhead in terms of register-complexity and latency [6], [7]. We demonstrate this issue by two simple examples in the following. The DFG in Fig.1(a) represents the computation of R = P ×Q+S and the DFG in Fig.1(b) represents the computation of R = P + Q + S. The word-length of each of P , Q, and S is L. Conventionally, the propagation delays of the circuits in Fig.1(a) and Fig.1(b) are, respectively, considered to be
Index Terms—Digital signal processing (DSP) hardware, retiming, cutset retiming, fixed-point arithmetic.
I. I NTRODUCTION AND BACKGROUND
L
Q L
and
TMA = TMULT + TADD
(1a)
TAA = 2TADD
(1b)
where TMULT and TADD , respectively, are the times required for a multiplication and an addition. The conventional estimate of propagation delay assumes the adders and multipliers as discrete components, where the operation of one circuit begins after the completion of the whole operation of the other. But, in reality the intermediate signals generated during the computation pass seamlessly across the combinational datapath, so that an operation can start as soon as the first output bits of its preceding operation(s) and other input bits (if any) are available; and need not wait for the whole of the preceding operation(s) to be over. The actual propagation delays of the circuits in Fig.1(a) and Fig.1(b) are, therefore, much less than these conventional estimates. In Table I, we have listed the propagation delays of a multiplier, an adder, and a multiplyadd circuit from the report of Synopsis Design Compiler (DC) [8] generated after synthesis of the circuits using 65 nm CMOS technology library. From Table I it can be easily found that TMA is much less than TMULT + TADD . Similarly, in Table II, we have listed the propagation delays of multioperand adders. From Tables I and II we can see that the computation time of a 3-operand adder (T3OP-ADD ) is much less than 2TADD and the computation time of a 4-operand adder (T4OP-ADD ) is much less than 3TADD . We explain this behaviour of circuits in Fig.2. Fig.2(a) shows a multiply-add circuit for the computation of P × Q + S, where each of P and Q is a 4-bit word and
IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS
TABLE I C OMPUTATIONAL D ELAY OF M ULTIPLIER , A DDER , M ULTIPLY-A DD C IRCUIT BASED ON 65 NM CMOS T ECHNOLOGY L IBRARY. multiplication time input size TMULT 8-bit 12-bit 16-bit
TMA
TFA
TFAC
TMA = TMULT + TFA + 2 TFAC
(16, 8) (24, 12) (32, 16)
1.81 2.55 3.34
0.14 ∼ 0.17
0.09 ∼ 0.12
Fig. 2(b) shows the the computation of 3-operand addition R = P + Q + S, where each of P , Q, and S is an 8-bit word. If we assume that increase in bit-width does not happen due to this addition, the sum word T = P + Q and R are also 8-bit words. From this figure we can find that after the sum T is computed by the first adder, the addition of T with S requires 1-bit full-adder time, TFA . Therefore, the propagation delay of 3-operand additions of the form R = P + Q + S, is given by TAA = TADD + TFA (3a) However, if we consider the increase in bit-width after the addition, we can find it to be
1.38 2.08 2.77
TFAC is the time required by an FA to generate the output carry after the arrival of input carry provided that the other two input bits are already available. All delays are in nanoseconds (ns). TABLE II C OMPUTATIONAL D ELAY OF M ULTIOPERAND A DDER C HAIN BASED ON 65 NM CMOS T ECHNOLOGY L IBRARY. T2OP-ADD 1.38 2.08 2.77
T3OP-ADD 1.55 2.25 2.94
T4OP-ADD 1.57 2.26 2.96
Diff-1 0.17 0.17 0.17
Diff-2 0.02 0.01 0.02
All delays are in ns. TnOP-ADD is the computation time of n-operand adder. Diff-1 = T3OP-ADD − T2OP-ADD and Diff-2 = T4OP-ADD − T3OP-ADD . a 4 x 4 multiplier final RCA of the multiplier final RCA of the multiplier a 4 x 4 multiplier to compute T=PQ final RCA of the multiplier final RCA of the multiplier to compute T=PQ HA FA FA HA HA FA FA HA t7 t t t t t t s7 t6 s6 5 s5 4 s4 3 s3 2 s2 1 s1 0 s0 t7 t t t t t t s7 t6 s6 5 s5 4 s4 3 s3 2 s2 1 s1 0 s0 FA FA FA FA FA FA FA HA FA FA FA FA FA FA FA HA r7
r5
r6
r7
r4
r5
r6
r3
r4
r2
r3
r1
r2
r0
r1
r0
(a) p7 q6 p6 q5 p5 q4 p4 q3 p3 q2 p2 q1 p1 q0 p0 q7 p7 q6 p6 q5 p5 q4 p4 q3 p3 q2 p2 q1 p1 q0 p0
q7
FA
FA FA FA FA FA FA HA FA t FA t5 FA t4 FA t3 FA t2 FA t1 FA t0 HA 6 ts77 t5s5 t4s4 t3s3 t2s2 t1s1 ts00 t6s6 s7 s5 s4 s3 s2 s1 s0 s6 FA FA FA FA FA FA FA HA FA FA FA FA FA FA FA HA
t7
r7
r5
r6
r7
r6
r4
r5
r3
r4
r2
r3
(b)
r1
r2
r0
r1
r0
Fig. 2. (a) Propagation delay of multiply-add circuit. RCA: ripple carry adder. (b) Propagation delay of 3-operand addition circuit. a b
bits are available to the FA beforehand (shown in dashed lines in Fig.3). However, if we consider the increase in bit-width by 1 bit after the addition, we can find it to be
addition time input size TADD
1.55 2.29 3.08
input size 16-bit 24-bit 32-bit
2
TFA sum
Cin i
TFAC
Cout
Fig. 3. Propagation delays TFA and TFAC in a 1-bit full-adder.
S is an 8-bit word. The product word T = P × Q is an 8bit word1 . From this figure we can find that after the product T is computed by the multiplier, for the addition of T with S, 2-bit addition is required to be performed to complete the multiply-add operation. The propagation delay of the multiplyadd circuit is, therefore, given by TMA = TMULT + TFA + TFAC
(2a)
where, TFA is the delay of a one-bit full-adder (FA). TFAC is the time required by an FA to generate the output carry after the arrival of the input carry provided that the other two input 1 In this example it is assumed that the bit-width is not affected by the addition, although, in general, bit-width is required to increase due to addition.
TAA = TADD + TFA + TFAC
(2b)
(3b)
The synthesis results in Tables I and II are in close conformity with the delay model of multiply-add circuit and 3-operand adder given by (2) and (3). The timing model used in the propagation delay estimates of (1) is referred to as the discrete component timing model, while that of (2) and (3) is referred to as the connected component timing model. The discrete component timing model could provide precise estimate of propagation delay if we consider gatelevel description of the digital circuits. But, very often it is imperative to deal with the DFGs at the granularity of arithmetic operators at the architecture-level abstraction of the digital system design for mapping an algorithm to the desired architecture, where the discrete component timing model provides overestimates of propagation delays and leads to unwanted pipelining. In this paper, we show that based on connected component timing model we can easily obtain adequately precise estimates of propagation delays on different paths in a DFG, and can use that for efficient retiming to reduce the critical path substantially without significant increase in the register-complexity and the latency. The main contributions of this paper are as follows: • In Section II it demonstrates that very often the architecture level conventional retiming (based on discrete component timing model) very often does not lead to effective reduction in critical path or the minimum sampling period, but leads to unwanted pipelining and undesirable increase in pipeline overhead. Besides, it presents better alternative retiming based on the proposed connected component timing model. • In Section III it shows that some nodes of the DFG could be split and some could be merged before and after retiming to achieve significant reduction in critical path. It demonstrates the use of splitting and merging of nodes for efficient retiming of some popular digital signal processing (DSP) circuits, e.g., finite impulse response (FIR) filters and basic recursive filters. The use of proposed retiming techniques in any given application is discussed briefly in Section IV. Conclusions and scope for further work are presented in Section V.
1
IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS
3 x[n]
II. L IMITATIONS OF C ONVENTIONAL R ETIMING AND L OW-OVERHEAD A LTERNATIVE R ETIMING In this Section we discuss the limitation of popularly used conventional retiming of FIR filter and suggest a low-overhead alternative retiming based on connected component timing model. Besides, in case of lattice filters, we show that the 2-slow transformation does not help reducing the minimum sampling period but increases the power consumption. We discuss alternative retiming of lattice filter as well, and show the advantage of alternative schemes based on connected component timing model.
D
x[n]
D
h[0]
h[0]
x
x
D
D
h[1]
h[1]
x
x
h[2]
h[2]
+
x
x
+ x[n] h[0]
TDF = TMULT + dlog2 N e∆
(4a)
∆ = TFA + TFAC
(4b)
where
The dashed line in Fig.4(a) passes across a feed-forward cutset [2] to decompose it into two subgraphs G1 and G2 as shown in Fig.4(b). The retimed DFG (shown in Fig.4(c)) can be obtained by adding a delay on all the edges going from subgraph G1 to subgraph G2. We consider here the input samples and coefficients to be of L-bit words, and the product word to be of 2L bits. Moreover, the addition time of 2Lbit words could be comparable to that of L-bit multiplication
x x
D
h[4]
x x
h[4]
h[5]
x x
h[5]
h[6]
h[6]
+
x x
h[7]
x x
h[7]
+
+
+
+
D
D
+
+
D
D
D
+
+
D
D x[n]
h[1]
x
x
D
D
h[2] h[3]
+
+
y[n]
y[n]
x
x
h[3] h[2]
x
x
D
h[4] h[1]
x
h[5]
x
x x
h[0]
G1
D
D h[6]
h[7]
x
x G2
+
+
D
+
D
++
D
++
y[n]
+
+
y[n]
2
(b)
h[0]
We discuss here the limitation of conventional retiming of direct-form (DF) FIR filter, and propose a flexible retiming (FX-R) scheme based on precise estimation of propagation delay derived from connected component timing model. The flexible retiming allows to select the cutsets according to the throughput requirement of the given application where register-complexity can be traded for performance. 1) Limitations of Conventional Retiming of FIR filter: The DFG of the direct-form FIR filter of length N = 8 is shown in Fig.4(a). Assuming that the output adder-chain in Fig.4(a) is implemented by a binary adder-tree, based on the delay model given in (2) and (3), the critical-path of an FIR filter of length N can be estimated to be
D
(a)
x[n]
B. Retiming of FIR Filter
h[3]
h[3]
+
D
D
D
D
D
D
D
h[1]
h[2]
h[3]
h[4]
h[5]
h[6]
h[7]
A. Connected Component Timing Model Very often in a DFG we find that an adder node follows a multiplication node or another adder node. As shown in (2) and (3), in such cases, the propagation delay of both the adjacent nodes combined together is much less than sum of propagation delays of individual nodes. Therefore, the propagation delay of such multiple connected nodes could be calculated together according to (2) and (3) and considered while deciding on retiming of circuits to achieve certain critical path. Similarly, we can find that when a multiplication node follows an addition node, the propagation delay across them is much less than the sum of propagation delays of individual nodes. We have referred to this as connected component timing model and shown that it could be used to avoid unwanted pipelining and achieve reduction of corresponding pipeline overheads.
D
D
D
x D
x D
+
x D
+
x
x
D
D
+
+
x D
+
x
x
D
D
+
+
y[n]
(c) Fig. 4. (a) The DFG of FIR filter of length N = 8. (b) Decomposition of DFG to two subgraphs G1 and G2 for feed-forward cutset retiming of DFG. (c) The retimed DFG.
time. Therefore, the critical path of direct-form retimed (DFR) FIR filter (Fig.4(c)) could generally be determined by the time required by the final adder, and given by TDF-R = TADD + (dlog2 N e − 1)∆ + TD-FF
(5)
where ∆ is given by (4b), and TD-FF is delay of a D flipflop, and TADD is the time required for 2L-bit additions. Since the addition time of 2L-bit words is comparable to that of L-bit multiplication time, from (4) and (5) we can find that conventional retiming of Fig.4(b) does not help to achieve significant reduction of critical path. Note that the critical path of the retimed DFG (Fig.4(c)) increases significantly as filter length increases. The register-complexity of the original DFG amounts to (N − 1)L bit-registers while that of the retimed DFG (Fig.4(c)) amounts to (3N − 1)L bit-registers. The register complexity, therefore, increases by nearly 3 times due to this retiming. 2) Flexible Retiming of FIR Filter Based on Connected Component Timing Model: We discuss here a flexible scheme for the retiming of the DFG of Fig.4(a). To perform the desired retiming, the direction of accumulation path in the DFG is reversed (as shown in Fig.5(a)), and the cutsets across the dashed lines (after each gray box) are considered one after the other for retiming. During each of those retimings, the edges of a cutset is removed to decompose the DFG into pairs of subgraphs, and delay on the upper edge of the subgraph is moved to the lower edge to get the retimed DFG finally as shown in Fig.5(b). The critical path of this retimed DFG is shown in the dashed line. Based on the delay model of (2) and (3), critical path of retimed DFG can be obtained to be TFX-R = TMULT + 2TFA + (dlog2 N e + 1)TFAC + TD-FF
(6)
We can also consider cutsets after 4 multiply-add sections, or after 8 such sections, or 16 sections to reduce the registercomplexity at the cost of a small increase in the critical path by
IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS 3
4
3
x[n] x[n]
D D
h[0] h[0]
x x
D D
h[1] h[1]
y[n] y[n] +
x x
h[2] h[2]
x x
+ +
+
D
D D
D
h[3] h[3]
h[4] h[4]
+ +
x x
D
+ +
x
D
D
h[5] h[5]
x
+ +
x
x
D
D
h[6] h[6]
x
x
D
h[7] h[7]
x
x
+ +
+ +
The register overhead of the flexible retiming could be reduced further by a factor of 2 or 4 by reducing the number of cutsets at the cost of marginal increase of critical path by TFA or 2TFA , respectively.
(a) x[n] x[n] h[0] h[0]
DD
xx
h[1] h[1]
y[n] y[n]
++
D D
DD
xx
h[2] h[2]
++
DD
xx
++
xx
h[4] h[4]
++
DD
h[3] h[3]
xx
DD
h[5] h[5]
++
xx
++
h[6] h[6]
DD
xx
h[7] h[7]
xx
++
(b) Fig. 5. (a) Cutset selection for the proposed retiming of DFG of FIR filter of length N = 8. (b) The retimed DFG. TABLE III T HE M INIMUM C LOCK P ERIOD ACHIEVED BY C ONVENTIONAL R ETIMING OF D IRECT-F ORM FIR F ILTER AND P ROPOSED F LEXIBLE R ETIMING . Filter Length, N 8 16 32 64 128
Minimum Clock Period (CP) DF DF-R FX-R 3.99 4.27 4.55 4.83 5.11
3.54 3.82 4.10 4.37 4.65
3.77 3.86 3.97 4.08 4.19
% Reduction of CP DF-R FX-R 11.3 10.5 9.9 9.5 9.0
5.5 9.6 12.7 15.5 18.0
Legends: DF: direct-form, DF-R: Direct-form retimed, and FX-R: Flexible-retimed. Word-length= 16 bits.
TFA or 2TFA , or 3TFA , respectively. Therefore, the number of registers can be traded easily for critical path by the proposed retiming. 3) Analysis and Comparison: 1) From (5) and (6) we can find that conventional retiming of direct-form FIR filter has lower critical path for small filter lengths, but for higher filter lengths the proposed retiming of Fig.5 involves lower critical path since TFAC is much smaller than ∆. 2) The conventional retimed direct-from FIR filter requires N additional registers of 2L bit size, while the proposed one requires only N L additional bit-registers. 3) In the proposed scheme, critical path can be easily traded for register complexity and vice versa. We have synthesized the circuits corresponding to original direct-form DFG, conventional direct-form retimed (DF-R) DFG (Fig.4(c)) as well as the proposed flexible retimed (FXR) DFG (Fig.5(b)) of FIR filters of various lengths for 16bit word-size by Synopsis Design Compiler using CMOS 65 nm technology library without any constraints. The minimum clock period obtained from the synthesis results are listed in Table III. We can find that conventional retiming of directform FIR filter reduces the minimum possible clock period by ≈ 10%, in average, for filter lengths N = 4, 8, 16, 32, 64, 128, while the flexible retiming of Fig.5 reduces the minimum possible clock period by ≈ 12%. The retiming of Fig.4 is found to have lower critical path compared with that of Fig.5 up to filter length N = 16, while for filter lengths higher than 32, it involves substantially higher critical path. But the register-complexity of Fig.4 is twice that of proposed one.
C. Retiming of Lattice Filters and Limitation of 2-Slow Transformation A lattice filter is an all-pass filter popularly used as a phase equalizer of audio processing systems. Fig.6 shows the DFG of an N -stage lattice filter. Each stage of this filter consists of 4 nodes for computing 2 multiplications and 2 additions. The critical path of the DFG is shown by the dashed line in Fig.6. 1) Retiming of Lattice Filters based on 2-Slow Transformation: Based on discrete component timing model, the critical path of an N -stage lattice filter is considered to be 2TMULT + (N + 1)TADD . In order to reduce this critical path, conventionally the DFG of Fig.6 is retimed in two phases [2]. In the first phase, a 2-slow transformation of the DFG is performed, where each delay element ‘D’ is replaced by ‘2D’. The DFG of the lattice filter of Fig.6 after 2-slow transformation is shown in Fig.7(a). For cutset retiming the (N − 1) cutsets intercepted by the vertical dashed lines are considered, where each cutset consists of a pair of edges: the lower edge which is directed from right to left and the upper edge which is directed from left to right. In the second phase of retiming, one delay from lower edge is moved to the upper edge for each of the cutsets. The retimed DFG is shown in Fig.7(b). The critical path of the retimed DFG is conventionally estimated to be 2(TMULT + TADD ). Considering TMULT = 2 unit time (ut) and TADD = 1 ut, the critical path (or the minimum clock period) is assumed to be getting reduced from (N + 5) ut to 6 ut by the retiming. For a 100-stage lattice filter (for N = 100) the critical path is believed to be reduced from 105 ut to 6 ut by the retiming. During 2-slow transformation it is assumed that the filter will take the input samples in alternate clock cycles. The sample period, therefore, gets doubled due to 2-slow transformation. Accordingly, the minimum sample period achieved by the cutset retiming preceded by a 2-slow transformation is considered to be 12 ut. Based on the delay model given in (2) and (3), considering that the bit-width does not increase after addition and the result of addition is rounded to the input bit-width L, the critical path of the retimed DFG of Fig.7(b) can be found to be 2TMA , and can be given by TLATTICE(C) = 2(TMULT + TFA + TFAC )
(7)
Accordingly, the minimum sampling-period of the retimed DFG of Fig.7(b) can be found to be SLATTICE(C) = 4(TMULT + TFA + TFAC )
(8)
Note that the TLATTICE(C) and SLATTICE(C) given by (7) and (8), respectively, are much less than that of the original filter. 2) Exploring Retiming Without 2-Slow Transformation: The minimum sampling period (MSP) achieved by retiming with 2-slow transformation is expected to be 4(TMULT +TADD ) according to discrete component timing model. But, we show here that it is not necessary to perform slow-down transformation for retiming of lattice filter to achieve that MSP. The
FIGURE-6 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS
stage-1 in
stage-2
+
stage-3
+ X
stage-4
+
stage-N
+
X
stage-1
+
in
stage-2
stage-3
+
+
stage-4
+
stage-N +
+
X
X
X
5
FIGURE‐8a
D
X
X
X
X
X
X
X
X
X
X
D X
X
X
X
Example of a Lattice Filter
• out
+
+
D
+
D
X
+
D
+
D
out
+
D
+
D
in
+
stage-3
+ X
stage-4
+
in
stage-2
+
+ X
X
X
stage-N
+
+
+
D
(a)
stage-1
stage-2
D
FIGURE‐8b
Fig. 6. An N -stage latticeFIGURE‐7A filter. The dashed-line shows the critical path. stage-1
+
+
stage-3 D
stage-4
+
+
stage-N +
D
X
X
X
X
X
X
X
X
X
X
X
D
2D X
out
+
X 2D
+
X 2D
2D
FIGURE‐7b stage-1 t 1 in
+
stage-2 t 2 D
+
X
+
1
X 2D
out
out
+
+
stage-4 t 4 D
+
D
X
X
X
X
X
X
X
X
D
+
+
D
+
D
+
+
X
+
+
Fig. 8. (a) Cutset selection for the proposed retiming of the DFG of 100 stage lattice filter of Fig.6. (b) Proposed retimed DFG.
stage-N t N
X
D
D
(b)
2D
1
+
+
(a)
stage-3 t 3 D
+
D
+
D
+
TABLE IV C OMPARISON OF S YNTHESIS R ESULTS OF R ETIMING OF L ATTICE F ILTER W ITH AND W ITHOUT 2-S LOW T RANSFORMATION . 2
(b)
No. of Stages
Fig. 7. (a) Two-slow transformation of the DFG of the N -stage lattice filter of Fig.6. (b) Conventional retimed DFG.
4 8 16 32 64
sampling period of nearly 4TMULT can be achieved by a direct cutset retiming across (N/2 − 1) cutsets shown (Fig.8(a)) by the dashed lines after alternate stages of the DFG of N stage lattice filter of Fig.6 when we estimate the critical path by 2 connected component timing model. The delay on the lower edge of each of the cutsets of Fig.8(a) is moved to the upper edge to get the retimed DFG shown in Fig.8(b). Based on the delay model given in (2), the critical path of proposed retimed DFG of the lattice filter (Fig.8(b)) can be found to be nearly 4 multiply-add operations, given by TLATTICE(P) = 4(TMULT + TFA + TFAC )
(9)
3) Analysis and Comparison: 1) Comparing (8) and (9), we can find that the critical path of proposed retimed lattice filter is the same as that of the sampling period achieved by 2-slow transformation as shown in Fig.7(b). 2) The minimum sampling-period achieved by the retimed DFG of Fig.8(b) is the same as its critical path since no slow-down transformation is performed. Therefore, the minimum sampling-period achieved by the proposed cutset retiming is the same as that of the retiming after 2-slow transformation. 3) The conventional 2-slow transformation based retiming involves N additional registers, while the proposed one does not involve any additional registers for retiming. To establish the limitation of 2-slow transformation for retiming of lattice filters, and to show the advantage of retiming without 2-slow transformation, we have synthesized the retimed lattice filters of Fig.7(b) and Fig.8(b) by Synopsis Design Compiler using CMOS 65 nm technology library without any constraints [8]. The minimum clock period, area,
With 2-Slow Transformation MSP Area EPC 9.22 9.22 9.22 9.22 9.22
9662 18989 37643 74950 149565
15.3 26.2 47.9 91.0 176.5
Retiming Without Slowdown MSP Area EPC 8.06 8.06 8.06 8.06 8.06
9225 18117 35899 71464 142595
15.09 25.88 47.41 90.25 175.57
MSP: Minimum sampling period in ns, EPC: Energy per clock cycle in pJ.
and power consumption of lattice filters of different number of stages are obtained from the synthesis results for 16-bit input signal and 8-bit coefficient values. The minimum clock period is the same as the MSP for the retimed filter of Fig.8(b), but the MSP of 2-slow filter (Fig.7(b)) is twice its minimum clock period. Energy consumption per clock cycle (EPC) is estimated from the power consumption. The calculated values of MSP and EPC along with area consumption are listed in Table IV. We can find that the filter retimed without slowdown transformation involves slightly less area, less MSP, and slightly less EPC than the other. The energy consumption per sample for the filter retimed after 2-sow transformation is nearly twice of the other since it requires two clock cycles to produce each output sample. The retiming of lattice filters based on 2-slow transformation [2] has no advantage over the direct retiming but consumes nearly twice the energy per output sample compared to the proposed retiming.
III. N ODE - MERGING AND N ODE - SPLITTING FOR E FFICIENT R ETIMING We discuss here the proposed technique for the reduction of critical path by combination of retiming with node-splitting and node-merging. Besides, we discuss the application of proposed technique in FIR filter and simple examples of recursive computation.
IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS
combining cutset retiming and slow‐down •
A. Splitting and Merging of DFG Nodes It is quite simple to split an adder circuit into two nearly equal parts as shown in Fig.9 for the ripple carry adder to perform the addition T = P + Q, where P and Q are 8-bit words. Similar splitting can be performed for carry-lookahead adders also. In a recent paper [7], we have shown that it is quite easy to decompose a Wallace-tree multiplier [9] into two parts where each part involves nearly half of the propagation delay of a multiplier, for fine-grained seamless pipelining of multiplier and multiply-add circuits. Such a scheme is shown in Fig.10 for the decomposition of a multiplier for the computation of T = P × Q, where each of P and Q is an 8-bit word. When we need to add a large number of words, we can use a carry-save reduction (CSR) section. By a CSR section, M operands can be reduced to a pair of words by x reduction stages, where dM (2/3)x e = 2. It is easy to test that x = 2(log2 M − 1) for M = 4, 8, 16, 32, and 64. Each CSR stage involves only 1-bit full adder delay, TFA . The carry and the sum words generated by the CSR section can be added together by a final adder. As shown in [7], it is simple to split multioperand adder into two parts having nearly equal propagation delays for pipeline implementation. One such circuit is shown in Fig.11 for the addition of 8 number of 8-bit words. B. Retiming Combined with Splitting and Merging of Nodes We demonstrate here the proposed technique for retiming combined with splitting and merging of DFG nodes in three typical examples. 1) Retiming of a Basic Recursive Computation: Let us consider the conventional retiming of the DFG of a basic recursive computation [2] given by y[n] = x[n] + h · y[n − 1]
(10)
The block diagram of the computation of (10) is shown in Fig.12 and the retiming of the corresponding DFG is shown in Fig.13. The parentheses near the nodes denote the relative computation times in terms of unit time (ut). In this example we consider the input samples {x[n]}, the coefficient h, and the output y[n] to be L-bit words. The product ‘h · y[n]0 is, therefore, a 2L-bit word 2 . The time required for the addition 2 If we consider rounding of product word to L-bits, the delay assumption FIGURE-9 will change, but that will not affect the critical path of the retimed DFG.
p7 q6
q7
t8
p6 q 5
FA
FA
t7
t6
p5 q4 FA
t5
q3
p4
p3 q2
p1 q0
p2 q1
FA
FA
FA
t4
t3
t2
FA
p0
HA
t1
t0
Fig. 9. Splitting of a ripple-carry adder circuit.
P
Booth recoding and carry‐save reduction stages
Q s14 c10 HA
t15
c9
x13 c8
FA
t14
s12 c7
FA
t13
s11 c6
FA
t12
s10 c5
FA
t11
s9 c4
FA
t10
s8 c3
FA
t9
s7 c2
FA
t8
Fig. 10. Splitting of multiplier circuit.
s6 c1
FA
t7
s5
s4
s3
s2
s1
s0
t4
t3
t2
t1
t0
c0
FA
t6
HA
t5
6
Slow-down transformation
– F For L-slow L l transformationf i replace l each h delay d l D by b LD. – (L-1) # of zero samples must be interleaved after each useful signal. Will have (L-1) nullsave operations. carry carry‐save reduction circuit for multi‐operand addition reduction circuit for multi operand addition – A 2-slow system will take s5 s6 cycles. s4 s3 s2 s1 s0 s8 insalternate 7 inputs9samples cycles –c8 Will ccycle. c4 cin3 everyc2odd cclock c7 have c6null coperations 1 0 5 HA
FA
FA
FA
FA
r(2) 7
(2) r6A
FA
(2) r10 A
r9
FA
FA
r4
(2) r M 3
HA
2D
D
r8
M
r5
D
r(2) 2
A
Fig. 11. Splitting of multioperand adder circuit.
r1
(2) M
r0 D
combining cutset retiming and slow‐down y[n] combining cutset retiming and slow‐down combining cutset retiming and slow‐down
•
Fig.
transformation Slow-down transformation transformation • •Slow-down Slow-down
(2)
D
A – – F– For For L lL-slow f ffx[n] l l each h delay dhldelay D by bD by LD. FL-slow L lltransformationtransformationeach ddelay b FFor LL-slow transformationreplace b LD. by LD. [ i] i replace h ll D X – – (L-1) # of samples must be interleaved after each useful signal. Will have x zero –+ (L-1) (L-1) # of zero zero samples after each useful signal. Will have # of samples must be interleaved interleaved after each useful signal. Will have (L-1) nullnull operations. (L-1) null operations. (L-1) operations. D B – – A– 2-slow system willwill take (4) 2-slow system will AAD2-slow system take input samples in alternate cycles cycles. input samples in alternate input samples in cycles cycles. 12.– Block diagram ofalternate the incomputation of cycle. (10). Will have nullnull operations every oddodd clock – Will Will have null operations clock – have operations in every clock cycle. cycle. 1
D
DD
(2) (2) (2) A AA
2D 2D 2D (2) (2) (2) (2) (2) (2) A A A M M M
(2) (2) (2) M M M
(2) (2) (2) A AA
D DD (2) (2) (2) M M M D DD
(a)
(b)
(c)
y(n)y(n) Fig. 13. (a). The DFG of the computation of (10). (b) Two-slow transformay[n] (2) (2) (2)A tion of the DFG. (c) The retimed DFG. D A A + ++
x
D DD
xx
x(n) ( )x[n] x(n) [( ])
X D
D
a ah
X X
D DD
B BB (4) (4) (4)
of the 2L-bit word ‘h · y[n − 1]’ with x[n] can be considered 1 nearly the same as (or comparable to) that of computation of 1 1 the product h · y[n − 1]. As discussed in the case of lattice filters in Section II, here also retiming is performed in two stages: involving a 2-slow transformation (shown in Fig.13(b)) followed by delay migration from the upper edge to the lower edge. The retimed DFG is shown in Fig.13(c). The critical path of the retimed DFG is 2 ut and the minimum sampling-period is 4 ut due to the 2-slow transformation. The multiplication node and the addition node of the DFG of Fig.13(a), can be split into two pairs of nodes (m1, m2) and (a1, a2), respectively, as shown in Fig.14(a), where the delay of each node is assumed to be 1 ut. The edge from a1 to a2 denotes the transfer of carry output of a1. This DFG can be split into two subgraphs G1 and G2 (shown in Fig.14(b)) across the cutset indicated by the dashed line in Fig.14(a). Adding one delay on each edge form G2 to G1 and subtracting one delay from the edge of the cutset from G1 to G2, we can get the retimed DFG shown in Fig.14(c). This DFG can again be retimed across the cutset indicated by the dashed line to get the retimed DFG shown in Fig.14(d). The nodes a1 and m1 (in the gray area) then can be merged to have a multiply-add node N1, and similarly, nodes a2 and m2 can be merged to form a node N2 as shown in Fig.14(d). The critical path of the DFG in Fig.14(d) is nearly 2 ut, while the conventional retiming has the critical path of 2 ut after 2-slow transformation. After 2-slow transformation and cutset retiming, the proposed retimed DFG of Fig.14(d) can achieve a critical path of 1 + ∆ ut, where ∆ is given by (4b). Since ∆ is quite small, the proposed retimed DFG can support a minimum samplingperiod of nearly 2 ut, while the conventional retiming (by 2-slow transformation) still has the critical path of 2 ut and minimum sampling-period of 4 ut. The proposed retiming with node decomposition, therefore, can support nearly 2 times the maximum sampling rate of the conventional retiming.
D (1)
D
(1)
a2
(1)
a1
m1
IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS FIGURE‐14a
m2
FIGURE‐14c
(1)
FIGURE‐14d
a1
D a1 a1a2 G2
a2 (1)
(1)
(1)
m1
m2(1) m2 (1)
m2
(a)
D a1
a2
(1)
D m1
G1
m1
(1)
(1)
(1) (1)
(1)
(1)
a2
m2
FIGURE‐17b
m2 (1)
D (1)
7
D
(1)
FIGURE‐14d
FIGURE‐14b FIGURE‐14c
((1))
m1
D
m1
(1)
2D
(1)
(1)
a2 (1)
a1
(1) (1)
(1)
D a1
a2
(1)
D
m1
N1
D a1
a2
(1)
N2
N2
(1)
m1
m2 ((1)) D (1) m2 Example: Let us consider an IIR filter (Fig.1a) given by FIGURE‐14b
(b)
D a2 (1) a1 FIGURE‐17a
(c)
(d)
FIGURE‐14e
m2 ((1))
a2
(1)
m1
D
FIGURE‐14e
2D
(1)
((1))
m1
(1)
m1
D (1) (1+) N1
(1)
y[n] [ ]
A
a1
D
m2
(2)
a2 (1)
((1))
(1)
x[n]
D
2D
(1)
m1
D
N2 (1+) D a1(1)
D
N1 (1) (1) a2 (1)m1 a1
(1)
m2
D
m2
FIGURE‐17b
(1)
(1)
m2
(1) (1) N2 (1+’) Fig. 14. (a) The DFG of the computation of (10) after splitting of nodes. a1 D a2 FIGURE‐17a FIGURE‐18a D a2 and a1 a2G2 2D and m1 represent the less significant parts and(1)m2 represent the more (a) (b) D a1(1) ((2)) h1 (2) (1+) (1+) significant parts. (b) Decomposition of the into 2(1+) subgraphs G1 and (1+)G2. A D M D m1 G1 DFG N2 N1 D N1 Fig. 17. (a) Retimed DFG of Fig.16(b). (b) Merging (1) in the ((1)) of nodes (c) The retimed DFG. (d) The retimed DFG m2 after merging ofN2nodes. 2Dreoriented a2 (1) a1 m1 (1) h2 (1) retimed DFG. (2) us consider an IIR filter (Fig.1a) given by
m1
M x[n]
(2)
y[n] [ ]
A D
D
A
D (a) (1)
D
(2) h1 (2) h2
(1)
a
(1)
M (2) h2
M
w(n) a. y (n 1) b. y (n h[0] 2) m2 y (n) w(n 1) x(n)
h[1]
a. y (n 2) by (n x[n] 3) x(n).
D
(1)
h[0]
2D
m1
(b)
(1)
m1
y (n) w(n 1) x(n)
m2
(1)
D
D a2
(1)
a2
a1
D
a1 a2 FIGURE‐16a
D (1)
a2 2 (1)
m1
h[2]
( ) (1)
m1
h[2]
FIGURE‐18b
a1 1
(1)
(1)
m2 (1)
m2
(1)
m1
h[3]
(1)
m2
h[3]
(1)
m1
(1)
D
a1 1
D
a2
2D
m1 1
a2
m2 (1)
D a2 (1)
D h[0]
((1))
m2
x[n]
(1)
m2 FIGURE‐16b
h[0]
h[2]
m2
h[1]
D (1)
m2
h[3]
(1)
m2
m1
D (1)
m1 (1)
D m2 2) Retiming of an Infinite Impulse Response Filter: Let us consider an infinite impulse response (IIR) filter given by
y[n] = h1 · y[n − 2] + h2 · y[n − 3] + x[n]
(1) D
(1)
m1
a1
h[3]
(1)
m1
(1)
(b) FIGURE‐19 Fig. 18. (a) The DFG of direct-form configuration of FIR filter of length N = 4. (b) Splitting of nodes of the DFG.
y[n]
(11)
As discussed in [2] the computation of (11) can be represented by the DFG of Fig.15(a) and conventionally retimed as shown in Fig.15(b). The critical path of the retimed DFG is nearly 2 ut which is the same as that of the original DFG (Fig.15(a)). The multiplication nodes ‘M’ and the addition nodes ‘A’ of the DFG shown in Fig.15(a) can be split into two parts (m1, m2) and (a1, a2) having nearly equal propagation delays as shown in Fig.16(a). Note that carry transfer from lower a1 to a2 is not required since the carry transfer to upper a2 can happen after the addition at upper a1. Therefore, there is no edge from lower a1 to a2. Two dashed lines CS1 and CS2 in Fig.16(a) intercept two cutsets across which we can have node-retiming [2] to obtain the retimed DFG of Fig.16(b). Fig.17(a) shows the retimed DFG of Fig.16(b). A reoriented form of the DFG of Fig.17(a) is shown in Fig. 17(b). Two sets of nodes (in gray areas) are merged to form two nodes N1 and N2 in the final retimed DFG (Fig.17(b)). The computation time of nodes N1 and N2 is (1+δ) ut, where δ = 2TFA +TFAC . Since δ is small, the critical path of the retimed DFG of Fig.17(b) can be found to be nearly 1 ut, while that of conventional retimed DFG (Fig.15(b)) is nearly 2 ut.
a1
a1
h[2]
m1
(1)
(1)
((1))
(1)
D (1)
(b)
m2
h[1]
(1)
D
D (1)
Fig. 16. (a) The DFG of IIR filter given by y[n] = hD1 · y[n − 2] + h2 · (1) (1) two parts y[n−3]+x[n] obtained after splitting of nodes into of nearly equal 2D a2 2 (1) a1 1 m1 1 delays. (b) The retimed DFG with split nodes. D
a2
m1
m2
D
(1)
(1)
a D
(1)
(1)
a1 1
((1))
a1
m2
(a) y[n]
(1)
D
m1
(a)
(1)
m2
(1)
(1)
(1)
m2 (1)
h[1]
a1
D
2D
(1)
CS2
a2
D
D
FIGURE‐16b (1)
(1)
D (1)
D
((1))
m2
a. y (n 2) by (n 3) x(n). CS1 a1
a2
N1 (1+) (1)
(1)
a2
CS2 DFG of IIR filter given (1) by y[n] = h · y[n − 2] + h · Fig. 15. (a) The 1 2 m1 w(n) (b) a. y (nA 1conventional ) b. y (n 2) retimed DFG. y[n − 3] + x[n]. 2D (1)
M M
a1
y[n]
2D (2) h1
D
FIGURE‐16aD
D
(1)
a2
y[n]
D
(2)
A
(2) h2 (1)
a2 DM
y[n]
A
2D
M (1)
(2)
D
((2)) h1
(2)
CS1 a1
D
x[n]
(1+3)
a
(1)
(1)
a2
a2
D h[0]
D
m2
h[1]
(1)
D
D (1)
a2
(1)
m2
h[2]
D (1)
m2
h[3]
(1)
m2
N3 x[n] h[0]
D (1)
m1
h[1]
D (1)
m1
m1
(1) D
a1
(1) D (1+)
h[3]
(1)
m1
N1
N2
D a1
h[2]
(1)
a1
(1) (1+)
Fig. 19. Feed-forward cutset retimed DFG of the DFG of Fig.18.
3) Retiming of an FIR Filter: We illustrate here the reduction of critical path of FIR filter by splitting and merging of nodes combined with retiming. The DFG of direct-form FIR filter of length N = 4 [retimed similar to the FIR filter of Fig.5(b)] is shown in Fig.18(a) where the multiplication nodes as well as addition nodes are split into two parts of nearly equal delays. The feed-forward cutset is shown by the horizontal dashed line in Fig.18(a); and the resulting retimed DFG is shown in Fig.18(b). Two pairs of (m1, a1) connected nodes (in gray area) in the DFG of Fig.18(b) could be merged to form nodes N1 and N2. The DFG of Fig.18(b) can be node-retimed about the last a1
IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS
8
TABLE VI S YNTHESIS R ESULTS OF P IPELINED FIR F ILTERS BY C ONVENTIONAL R ETIMING AND P ROPOSED R ETIMING W ITH S PLITTING AND M ERGING OF N ODES AND W ITHOUT S PLITTING AND M ERGING OF N ODES . Filter Length 8 16 32 64 128
Conventional Retiming (Fig.4(c)) Area MCP ADP
Proposed Retiming Without S & M Area MCP ADP
Proposed Retiming With S & M Area MCP ADP
23242.3 46548.7 93161.5 186387.1 372838.3
23618.9 47419.2 95019.8 190221.1 380623.7
24197.4 48582.4 97352.3 194892.1 389971.8
3.35 3.53 3.71 3.89 4.07
77861.8 164317.0 345629.2 725045.9 1517452.0
3.39 3.39 3.39 3.39 3.39
80068.0 160751.1 322117.3 644849.6 1290314.3
2.30 2.30 2.30 2.30 2.30
55654.0 111739.4 223910.2 448251.9 896935.1
S& M: Split and Merge of nodes. MCP: minimum clock period measured in ns. Area is measured in sq.um. ADP: Area-Delay Product.
node to obtain the final DFG of Fig.19. The chain of two a2 nodes, the last a1 node, and the final adder node ‘a’ can be merged to form the node N3, which could be realized by a 4-operand adder. Assuming that the multiplication nodes and adder nodes are split into two parts of nearly equal propagation delay of 1 ut = (1/2)TMULT or (1/2)TADD , where TADD is (∼ 2L)-bit addition time, while TMULT is the time required for multiplication of two L-bit words, the delay of N1 and N2 nodes can be found to be nearly 1 + ∆. The critical path of the retimed FIR filter is the same as the delay of N3 node which amounts to TFIR-RT =∼ (1/2)TMULT + 3∆
(12)
where ∆ is defined in (4b). Since ∆ is small compared with TMULT , the proposed retiming combined with node-splitting and node-merging can result in substantial reduction of the critical path over the lowest achievable critical path (one multiplication time, TMULT ) by the conventional retiming. 4) Synthesis Results and Discussions: To validate the advantage of proposed retiming using splitting and merging of DFG nodes, we have synthesized the proposed retimed designs as well as the conventional retimed structures by Synopsis Design Compiler using CMOS 65 nm technology library without any constraints. In all these cases, the word-size of input and output signals is taken to be 32 bits. The width of the coefficients is taken to be 16 bits and the output result is truncated to 16-bit to feedback the result as input to the recursive structure. The synthesis results pertaining to the IIR filters of Sections III-B1 and III-B2 are listed in Table V. The proposed designs involve marginally higher area. But the retimed structure of Fig.14(d) without 2-slow transformation involves nearly 12% less critical path than the conventional retimed structure of Fig. 13(c) obtained after 2-slow transformation. The proposed design of Fig.14(d) thus provides more than double the samTABLE V S YNTHESIS R ESULTS OF IIR F ILTERS BY C ONVENTIONAL R ETIMING AND P ROPOSED R ETIMING BY S PLITTING AND M ERGING OF N ODES . Retimed Filters
Area
MCP
Retimed Basic Recursive Structure of Fig.13(d) Retimed with S-and-M without 2-slow (Fig.14(d)) Retimed with S-and-M after 2-slow Retimed IIR Filter of Fig.15(b). Retimed with Split and Merge (Fig. 17(b))
2952.0 3016.1 3152.2 5841.7 6010.6
3.36 2.95 1.90 3.35 2.27
Legends: S-and-M: Split and Merge. MCP: minimum clock period measured in nanoseconds. Area is measured in sq.um.
pling rate compared to the the conventional one. The proposed retimed structure with 2-slow transformation requires nearly half (∼ 56%) of the critical path of the conventional retimed structure. The conventional retimed IIR filter of Fig.15(b) requires nearly 2.9% less area but involves 47.58% more critical path than the proposed retimed filter of Fig.17(b) based on splitting and merging of nodes. The synthesis results pertaining to the retimed FIR filters of Sections II-B and III-B3 are listed in Table VI for different filter lengths for 16-bit input and coefficient size and 32-bit output size. The proposed retiming with splitting and merging of nodes (Fig.19), on average, requires nearly 2.5% more area but involves nearly 47.4% less critical path than the retimed filter without splitting of nodes (Fig.5(b)). Similarly, on average, it requires nearly 4.2% more area but involves nearly 61.3% less critical path than the conventional retimed filter of Fig.4(c). IV. A PPLICATION OF P ROPOSED T ECHNIQUES In most DSP applications as well as other scientific and engineering applications involving matrix-matrix products and matrix-vector products, we find that a multiplication or an addition is followed by one or more addition(s). Correspondingly in the DFGs an adder node is preceded by a multiplication node or an adder node. In such cases we can easily estimate the propagation delay of multiple connected nodes together according to the connected component timing model given in (2) and (3). The propagation delays estimated by the connected component timing model is quite precise, and can be used to make better retiming decision, to avoid unwanted pipelining, and their associated overheads. Besides, splitting and merging of nodes can be used along with retiming for the reduction of critical path more effectively, as follows: 1) When we encounter a multiplication node followed by one or more adder nodes or an adder node is followed by another adder node in a DFG, we can split the multiplication node and the adder node(s) into pair of nodes of nearly equal propagation delays. Retiming decisions could be taken thereafter by estimating the propagation delays according to connected component timing model when the original DFG nodes are in split condition. After the retiming is over, the split connected nodes in the same timing zone can be merged together to perform the final timing analysis. 2) The time required for two consecutive additions is almost the same as that of a single addition time when
IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS
the bit-width of result does not increase, and marginally higher when the bit-width increases. Therefore, the consecutive adder nodes could very often be merged together to form a single node. 3) When we encounter two consecutive adder nodes with a delay in between them, we can try to move that delay to some other location if that could be utilized to reduce the critical path. If the delay is moved to some other location, then the consecutive adder nodes could be merged together to form a single node. 4) When several connected adder nodes are encountered, the adder nodes could be merged together to form a multi-operand adder node which could be split and retimed for possible reduction of critical path. 5) It is observed that area complexity increases by nearly 3% due to splitting and merging of nodes. Therefore, splitting and merging of nodes should be considered only on the critical path of the DFG, particularly when we intend to reduce the critical path. V. S UMMARY AND C ONCLUSIONS Conventional retiming is based on the propagation delay information estimated by discrete-component timing model, which assumes that the operation of each node in a DFG begins only after the completion of its preceding operations. But, we can see that such a discrete-component timing model provides overestimate of propagation delays for circuits involving fixed-point arithmetic units. The retiming decision based on such overestimated propagation delay very often leads to unnecessary pipelining in architecture-level designs. We have shown that in applications like FIR filter the conventional pipelining results only ≈ 10% reduction of critical path for various filter lengths. In this paper we have presented a connected component timing model considering seamless signal propagation in combinational circuits, which provides a simple way to obtain adequately precise estimate of propagation delays across different paths in a DFG. The precise estimate of propagation delay thus obtained can be used to make more efficient retiming decision to have less pipeline overheads. We have shown here that in case of lattice filters it would be possible to achieve the same sampling rate (without 2-slow transformation) without increasing the clock frequency and register complexity, which results in the reduction of energy consumption per sample to nearly half that of the filter retimed by 2-slow transformation. We have demonstrated flexible and efficient retiming of FIR filters where we achieve reduction of critical path using less pipeline registers compared to conventional one. We have also shown that the use of retiming in combination with node-splitting and node-merging, can potentially reduce the sampling period by nearly 50% without 2-slow transformation in case of basic multiply-add recursive structure. In case of both FIR and IIR filters the retiming in combination with node-splitting and node-merging offers nearly 50% to 60% reduction of critical path with a marginal area overhead. The techniques presented in this paper can be applied during retiming of the DFGs (described at the granularity of arithmetic operations) involving multiply-add
9
operations and adder-trees to reduce pipelining overhead and critical path. It can be used for high-speed implementation of DSP applications, where multipliers and adders are very often followed by one or more adders. The use of connected component timing model which we have demonstrated for cutset retiming, could be extended to general retiming. The proposed retiming techniques for fixed-point circuits could be extended to floating-point circuits. ACKNOWLEDGMENT The author is grateful to Prof. K. K. Parhi for the rich wealth of knowledge in his book “VLSI Digital Signal Processing Systems: Design and Implementation” [2], and for providing a copy of the book to this author. R EFERENCES [1] C. E. Leiserson and J. B. Saxe, “Retiming synchronous circuitry,” Algorithmica, vol. 6, no. 1, pp. 5–35, 1991. [2] K. K. Parhi, VLSI Digital Signal Processing Systems: Design and Implementation. New York: John Wiley & Sons, Inc, 1999. [3] J. Monteiro, S. Devadas, and A. Ghosh, “Retiming sequential circuits for low power,” International journal of high speed electronics and systems, vol. 7, no. 02, pp. 323–340, 1996. [4] L.-F. Chao and E.-M. Sha, “Scheduling data-flow graphs via retiming and unfolding,” Parallel and Distributed Systems, IEEE Transactions on, vol. 8, no. 12, pp. 1259–1267, 1997. [5] N. Shenoy and R. Rudell, “Efficient implementation of retiming,” in Proceedings of the 1994 IEEE/ACM international conference on Computer-aided design, 1994, pp. 226–233. [6] P. K. Meher and S. Y. Park, “Critical-path analysis and low-complexity implementation of the LMS adaptive algorithm,” IEEE Trans. Circuits Syst. I: Regular Papers, vol. 61, pp. 778–788, 2014. [7] P. K. Meher, “Seamless pipelining of DSP circuits,” Journal of Circuits System and Signal Processing, [Online]. Available: http://link.springer.com/article/10.1007/s00034-015-0089-2). [8] “Synposys, DesignWare. Foundry Libraries, Mountain View, CA.” [Online]. Available: http://www.synopsys.com/ . [9] B. Parhami, Computer arithmetic: algorithms and hardware designs. New York: Oxford University Press, Inc, 2009. Pramod Kumar Meher (SM’03) Pramod Kumar Meher (SM03) received the B.Sc. (Honours) and M.Sc. degree in physics, and the Ph.D. degree in science from Sambalpur University,India, in 1976, 1978, and 1996, respectively. Currently, he is a Senior Research Scientist with Nanyang Technological University, Singapore. Previously, he was a Professor of Computer Applications with Utkal University, India, from 1997 to 2002, a Reader in electronics with Berhampur University, India, from 1993 to 1997, and a Lecturer in physics in various government colleges in Odisha (India). His research interest includes design of dedicated and reconfigurable architectures for computation-intensive algorithms pertaining to signal, image and video processing, communication, bio-informatics and intelligent computing. He has contributed more than 200 papers on VLSI architectures to various reputed journals and conference proceedings (available at http://www.ntu.edu.sg/home/ASPKMeher/List.pdf). Dr. Meher has served as a speaker for the Distinguished Lecturer Program (DLP) of IEEE Circuits Systems Society during 2011 and 2012 and Associate Editor of the IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS-II: EXPRESS BRIEFS during 2008 to 2011, Associate Editor for the IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS-I: REGULAR PAPERS during 2012-2013, and Associate Editor for the IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS during 2009-2014. Currently, he is serving as Associate Editor for the Journal of Circuits, Systems, and Signal Processing (CSSP) and Integration, the VLSI Journal. Dr. Meher is a Fellow of the Institution of Electronics and Telecommunication Engineers, India. He was the recipient of the Samanta Chandrasekhar Award for excellence in research in engineering and technology for 1999.