one for pipelined dynamic logic [1]. In this work the pipelining is generalized to be performed after a number of operations. Further, digit-serial arithmetic is also ...
Optimal logic level pipelining for digit-serial implementation of maximally fast recursive digital filters Oscar Gustafsson and Lars Wanhammar Department of Electrical Engineering, Linköping University, SE-581 83 Linköping, SWEDEN E-mail: {oscarg, larsw}@isy.liu.se
Abstract Although recursive filters are not possible to pipeline at the algorithmic level, it is possible to introduce logic level pipelining and thereby increase the maximal sample rate. This is because the critical path gets shorter so that the circuitry can be clocked faster. However, the latency of the operations increases with increased pipelining. It is of interest to find the optimal degree of pipelining to be able to maximize the sample rate of the implementation. In this paper the optimal degree of logic level pipelining is derived for maximally fast implementations of recursive digital filters.
between different operations, while the number of clock periods only depends on the number of cascaded multipliers and, thus, the filter algorithm. It is therefore convenient to divide the latency in algorithmic latency, Lop, and clock period, Tclk, as Top = LopTclk. In the same way is it possible to divide Tmin in Lmin and Tclk. In earlier work three different logic models for bit-serial arithmetic has been defined depending on level of pipelining; one for static logic, one for dynamic logic and one for pipelined dynamic logic [1]. In this work the pipelining is generalized to be performed after a number of operations. Further, digit-serial arithmetic is also considered.
1
2
Introduction
Frequency selective digital filters are used in many contemporary digital signal processing systems. As more and more equipment is portable, the power consumption is becoming an important design criterion. Maximally fast implementations and bit-serial arithmetic has been shown to be an efficient way to implement recursive digital filters with low power consumption [1] [2]. With maximally fast (or rate optimal) we denote implementations of recursive filters which obtain the maximal sample frequency. The maximal sample frequency of a recursive filter is Ni f max = min ------------- i T o p, i
(1)
where Ni is the number of delay elements and Top,i is the total latency of the operations in the recursive loop i [3]. For bit- and digit-serial arithmetic the latency of an operation is proportional to the number of fractional bits of the coefficient as W T o p, mult = -------f- T clk (2) d where Wf is the number of fractional bits for the multiplier coefficient, d is the digit-size (d = 1 for bit-serial), and Tclk denotes the clock frequency. Hence, the latency of a multiplication depends on both the coefficient of the multiplier and the clock period of the system. The clock period of the system depends on the longest logical path between two registers, and, thus, depends on the interconnection
Cyclic Scheduling
To obtain a maximally fast implementation using bit- or digit-serial arithmetic the operations must be scheduled over more than one sample period [4]. The minimum number of periods to schedule for should be selected as max ( T exe, j ) j m ≥ --------------------------------(3) T min where Texe,j is the time between two adjacent inputs for operation j. The total scheduling period is then mTmin. This gives that m must be selected so that mTmin yields a schedule over an integer number of clock periods. To schedule over m sample periods the algorithm must be unfolded m times. An illustration of unfolding is shown in Fig. 1. The cyclic scheduling formulation exploits the fact that the algorithm is recursive. The scheduling formulation can be viewed on the circumference of a cylinder as shown in Fig. 2. One property of a maximally fast cyclic schedule is that all operations in the critical loop are scheduled adjacent, without any shimming delays, to each other. As it is possible to introduce pipelining in all non-critical operations, the maximal clock frequency will be determined by a critical path going through the operations in the critical loop only.
x(n)
y(n)
N T
v1(n)
T
vN(n)
x(mn)
N0
y(mn)
x(mn+1)
N1
y(mn+1)
m times unfolding x(mn+m-1)
y(mn+m-1)
Nm-1 T
T
v1(mn+m-1) vN(mn+m-1) Figure 1: Illustration of unfolding an algorithm m times. ai bi
x(mn+i)
ci
FA D
Ni-1
Ni+1
Figure 3: Bit-serial adder with critical path marked.
Ni
ac-1
ac-2
a0
ai
y(mn+i) 0
Time Figure 2: Illustration of cyclic scheduling.
3
Bit- and Digit-Serial Operations
In bit-serial processing one bit of the input word is processed each clock cycle. A bit-serial adder is shown in Fig. 3. The algorithmic latency of a bit-serial adder is zero clock cycles. However, the critical path of the adder will affect the maximal clock frequency. Bit-serial multiplication can be performed using a serial/parallel multiplier as shown in Fig. 4. For a fixed coefficient, these multipliers can be simplified to reduce the hardware complexity [5]. A simplified serial/parallel multiplier for coefficient 0.4375 = 0.01112C = 0.100–1CSD is shown in Fig. 5. As stated previously the algorithmic latency is proportional to the number of fractional bits of the coefficient. The critical path of the multiplier is at most one full adder as can be seen from Fig. 5. If the coefficient only has one non-zero bit, no full adder is required, and, thus, that operation will not alter the critical path. However, it must still be included for the algorithmic latency, off course. A digit-serial adder can be designed by unfolding a bitserial adder [6]. In Fig. 6 a digit-serial adder with digitsize 3 is shown. The critical path from input to output is one full adder, but for some operation there will be a critical path from the input to the carry register of the adder. Digit-serial serial/parallel multipliers can be designed and simplified in a similar way as bit-serial serial/parallel multipliers. The latency is as stated in (2), while the critical path is the same as for a digit-serial adder.
FA
D
FA
D
D
FA
D
bj
D
Figure 4: General coefficient bit-serial serial/parallel multiplier. ai D
D
D
FA
bi
D
Figure 5: Simplified bit-serial multiplier with coefficient 0.4375. The critical path is marked with a thick line.
4
Pipelining
Although it is not possible to pipeline recursive loops at the algorithm level, we can introduce pipelining at the logic level. The critical path goes through the operations in the critical loop unfolded m times. This limits the sample period to T s = T min = L min T clk . Introducing a pipelining stage corresponds to increasing the latency for an operation with one clock cycle. However, it may also shorten the critical path and thereby allow a decrease of the clock period. Hence, the minimal sample period increase and the clock period decrease. It is therefore possible to obtain an optimal level of pipelining that maximizes the sample rate.
a3i b3i
FA
a3i+1 b3i+1
FA
a3i+2 b3i+2
FA
c3i
c3i+1
c3i+2
D Figure 6: Digit-serial adder with digit-size 3. The critical paths are marked with a thick line. In this section we derive the optimal level of pipelining for maximally fast bit- and digit-serial implementations of recursive algorithms.
4.1
N op T D -------------------L tot T FA
M =
(8)
For bit-serial arithmetic we can utilize the logic-level implementation to transfer from a static CMOS implementation to a dynamic implementation. Is is possible to merge the full adder and the flip-flop to obtain a more efficient implementation [1]. This corresponds in the simplest case to M = 1, i. e., one flip-flop after each full adder. However, it is possible to introduce two flip-flops and pipeline the full adder internally as in Fig. 7. This corresponds to M = 1/2. By allowing pipelining inside the adder it should also be possible to obtain a resolution of 1/2 for M. x(n) y(n)
=1
D
&
D
=1 &
D
∑
≥1
Bit-serial implementation
where TD is the setup time of the pipelining registers plus the propagation delay from the clock edge to the output of the registers. The algorithmic latency of the critical loop is now N op L min = L tot + --------(5) M
C
0
Figure 7: Bit-serial adder with two pipeline stages.
4.2
Digit-serial implementation
The minimal clock period for an implementation with digit-size d is T clk = ( mN op + ( d – 1 ) )T FA , i.e., mN operations plus d – 1 extra full adders at the last operation as shown in Fig. 6. The algorithmic latency of the critical loop is now L tot = Σ W f , i ⁄ d . Introducing pipelining after every M operations reduces the critical path to (9) T clk = ( M + d – 1 )T FA + T D while the algorithmic latency increases to M L min = L tot + --------- . N op
(10)
The minimal sample period is now T s = L min T clk N op = L tot + --------- ( ( M + d – 1 )T FA + T D ) M (11) N op ( d – 1 ) = ( M + d – 1 )L tot + N op + --------------------------- T FA + M
The resulting minimal sample period is T s = L min T clk (6)
N op = ( M L tot + N op )T FA + L tot + --------- T D M Taking the derivative of the minimal sample period with respect to M we find N op ∂T s (7) --------- = L tot T FA – ---------T ∂M M2 D and the optimal level of pipelining is found to be
D clr
Assume that the critical loop has Nop operations and the algorithm is unfolded m times according to (3). Only operations that adds to the critical path should be included in Nop. Multiplications with only one non-zero bit in the coefficient should therefore not be included. For bit-serial arithmetic each operation will add a time of TFA to the critical path, where TFA is the gate delay of a full adder. Thus, the minimal clock period is Tclk = mNopTFA. If the algorithmic latency in the critical loop is Ltot = ΣWf,i and there is one delay element in the critical loop, the minimal sample period is Ts = LtotmNopTFA. If pipelining is introduced after every M operations the critical path is reduced to T clk = MT FA + T D (4)
N op = L tot + --------- ( MT FA + T D ) M
∑(n)
N op L + --------T tot M D Again, taking the derivative of the minimal sample period with respect to M yields N op ∂T s N op ( d – 1 ) --------- = L tot – --------------------------- T FA – ---------T (12) ∂M M2 D M2 and the optimal level of pipelining is M opt =
N op T D + N op ( d – 1 )T FA -------------------------------------------------------------L tot T FA
(13)
To see how the optimal pipelining level Mopt is affected by the digit-size d we first assume
T D = T FA
(14)
which yields N op d ------------L tot
M opt =
T
(15)
(16)
(equality if all Wf,i is an integer multiple of d). We now obtain N op M ≈ d ------------------W f, i
∑
Thus, the optimal level of pipelining is directly proportional to the digit-size.
Examples
In Fig. 8 a second-order lattice wave digital filter allpass section is shown with its critical loop marked. The reason for using this structure is that it is suitable for VLSI implementation as it has a regular, low complexity structure, low coefficient sensitivity, and yields a regular implementation. Is is clear that the number of operations in the critical loop, Nop, is six. The multipliers α1 and α2 will both contribute to the algorithmic latency of the loop, Ltot. A cyclic scheduling formulation for this filter is found in [1]. Cyclic scheduling formulations for other types of wave digital filters are found in [7]–[9]. If TFA = Td = 1 ns is assumed then the estimated sample period is plotted in Fig. 9 for different number of fractional bits and bit-serial arithmetic. Figure 9 shows that the longer coefficients present in the loop, the more pipelining stages can be introduced with a decrease of the sample period. Further, it can also be seen that more pipeline stages can be introduced if the number of fractional bits are high compared with a lower number of fractional bits. Considering digit-serial arithmetic with the same assumptions and four fractional bits for both coefficients, the sample periods shown in Fig. 10 are obtained. From Fig. 10 it can be seen that it is important that the digit-size is an integer multiple of the number of fractional bits. The minimal sample period is obtained for a digit-size of one, two and four bits. This is due to the fact that the latency of a digit-serial multiplication is W f ⁄ d . Thus, if the digitsize is not an integer multiple of the number of fractional bits, algorithmic latency will be rounded towards the next larger integer. It can also be seen that the optimal level of pipelining is dependent on the digit-size. This is an expected result as can be seen from (13). This is shown more clearly in Fig. 11 where the optimal level of pipelining is plotted versus digit-size with three and five fractional bit coefficients for the multipliers, respectively.
y(n)
a1
x(n)
y(n)
Figure 8: Second-order lattice wave digital filter allpass section. 100
(17)
i
5
x(n)
Estmiated sample period [ns]
i
W f, i i ------------ ≈ -----------------d d
T -
a1
90
Ltot=16
80
Ltot=14
70
Ltot=12
60
Ltot=10
50
Ltot=8
40
Ltot=6
30
Ltot=4
20 1
2 3 4 Operations/pipeline stage M
5
Figure 9: Sample period for second-order allpass sections using bit-serial arithmetic as a function of the degree of pipelining. 70 d=1
65 Estmiated sample period [ns]
∑
Û
T
∑W f, i
-
a2
a2
Hence, the optimal level of pipelining is proportional to the square root of the digit-size. However, as the algorithmic latency is proportional to the digit-size we notice that
L tot =
T
Critical loop
60 55
d=3
50
d=2
45
d=8 d=7 d=6 d=5 d=4
40 35 30 25 1
2
3 4 5 6 Operations/pipeline stage M
7
8
Figure 10: Sample period for second-order allpass sections using digit-serial arithmetic as a function of the degree of pipelining. Wf,α1= Wf,α2
[2] Optimal operations/pipeline stage M
opt
5
[3]
4
[4]
3
2
[5]
1
0 1
2
3
4 5 Digit−size d
6
7
8
Figure 11: Optimal level of pipelining as a function of digit-size for second-order allpass section with Wf,α1= 3 bits and Wf,α2 = 5 bits.
6
Conclusions
In this work the optimal degree of logic level pipelining was derived for maximally fast recursive digital filters using bit- and digit-serial arithmetic. Even though it is not possible to introduce pipelining on the algorithmic level, it is possible to pipeline the operations at the logic level. Pipelining will increase the algorithmic latency, but the maximal clock frequency will also increase.
References [1]
M. Vesterbacka, On Implementation of Maximally Fast Wave Digital Filters, Linköping Studies in Science and Technology, Dissertation No. 487, Linköping University, Sweden, 1997.
[6]
[7]
[8]
[9]
L. Wanhammar, DSP Integrated Circuits, Academic Press, 1999. M. Renfors and Y. Neuvo, “The maximum sampling rate of digital filters under hardware speed constraints,” IEEE Trans. Circuits Syst., vol. 28, no. 3, pp. 196–202, March 1981. K. Palmkvist, M. Vesterbacka, P. Sandberg, and L. Wanhammar, “Scheduling of data-independent recursive algorithms,” Proc. European Conf. Circuit Theory Design, Istanbul, Turkey, Aug. 27-31, 1995, vol. 2, pp. 855–858. M. Vesterbacka, K. Palmkvist, and L. Wanhammar, “Realization of serial/parallel multipliers with fixed coefficients,” Proc. National Conf. Radio Science (RVK), Lund, Sweden, April 5–7, 1993, pp. 209– 212. K. K. Parhi, “A systematic approach for design of digit-serial signal processing architectures,” IEEE Trans. Circuits Syst., vol. 38, pp. 358–375, Apr. 1991. O. Gustafsson and L. Wanhammar, “Implementation of maximally fast ladder wave digital filters using a numerically equivalent state-space representation,” Proc. IEEE Int. Symp. Circuits Syst., Orlando, Florida, May 30–June 2, 1999, vol. 3, pp. 419–422. O. Gustafsson and L. Wanhammar, “Maximally fast scheduling of bit-serial lattice wave digital filters using constrained third-order sections,” Proc. IEEE Int. Conf. Electronics Circuits Syst., Paphos, Cyprus, Sept. 1999, pp. 729–732. O. Gustafsson and L. Wanhammar, “Maximally fast scheduling of bit-serial lattice wave digital filters using three-port adaptor allpass sections,” Proc. Nordic Signal Processing Symp., Norrköping, Sweden, June 13–15, 2000, pp. 441–444.