IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS
1
Area-Delay-Power Efficient Fixed-Point LMS Adaptive Filter with Low Adaptation-Delay Pramod Kumar Meher, Senior Member, IEEE and Sang Yoon Park, Member, IEEE
Abstract—In this paper, we present an efficient architecture for the implementation of delayed least mean square (DLMS) adaptive filter. For achieving lower adaptation-delay and areadelay-power efficient implementation, we have used a novel partial product generator, and we have proposed a strategy for optimized balanced pipelining across the time-consuming combinational blocks of the structure. From synthesis results, we find that the proposed design offers nearly 17% less areadelay-product (ADP) and nearly 14% less energy-delay-product (EDP) than the best of the existing systolic structures, in average, for filter-lengths N =8, 16, and 32. We have proposed an efficient fixed-point implementation scheme of the proposed architecture, and derived the expression for steady-state error. We have shown that the steady-state mean squared error obtained from the analytical result matches with the simulation result. Moreover, we have proposed a bit-level pruning of the proposed architecture, which provides nearly 20% saving in ADP, and 9% saving in EDP over the proposed structure before pruning without noticeable degradation of steady-state error-performance. Index Terms—Adaptive filters, least mean square algorithms, fixed-point arithmetic, circuit optimization.
I. I NTRODUCTION Adaptive digital filters find wide applications in several digital signal processing (DSP) applications, e.g., noise and echo cancelation, system identification, and channel equalization etc. The finite impulse response (FIR) filters whose weights are updated by the famous Widrow-Hoff least mean square (LMS) algorithm [1] may be considered as the simplest known adaptive filter. The LMS adaptive filter is the most popular one not only due to its simplicity but also due to its satisfactory convergence performance [2]. The direct form LMS adaptive filter involves a long critical-path due to an inner-product computation to obtain the filter output. The critical-path is required to be reduced by pipelined implementation when it exceeds the desired sample period. Since the conventional LMS algorithm does not support pipelined implementation due to its recursive behavior, it is modified to a form called delayed LMS (DLMS) algorithm [3]–[5], which allows pipelined implementation of the filter. A lot of work has been done to implement the DLMS algorithm in systolic architectures to increase the maximum usable frequency, [3], [6], [7] which can support very high sampling frequency due to pipelining after every multiply accumulate (MAC) operation. But they involve adaptationdelay of ∼ N cycles for filter-length N , which is quite high The authors are with the Institute for Infocomm Research, Singapore, 138632 (e-mail:
[email protected];
[email protected]). Corresponding author of this paper is Sang Yoon Park. A preliminary version of this paper was presented in IEEE International Midwest Symposium on Circuits and Systems (MWSCAS 2011).
for large order filters. Since the convergence performance degrades considerably for large adaptation-delay, Visvanathan [8] et al have proposed a modified systolic architecture to reduce the adaptation-delay. A transpose form LMS adaptive filter is suggested in [9], where the filter output at any instant depends on the delayed versions of weights, and the number of delays in weights varies from 1 to N . Van and Feng [10] have proposed a systolic architecture, where they have used relatively large processing elements (PEs) for achieving lower adaptation-delay with critical-path of one MAC operation. Ting et al [11] have proposed a fine-grained pipelined design to limit the critical-path to the maximum of one addition-time, which supports high sampling frequency, but involves lot of area overhead for pipelining and higher power-consumption than that of [10] due to its large number of pipeline latches. Further effort has been made by Meher and Maheswari [12] to reduce the number of adaptation-delays. Meher and Park have proposed a 2-bit multiplication cell, and used that with an efficient adder-tree for pipelined inner-product computation to minimize the critical-path and silicon area without increasing the number of adaptation-delays [13], [14]. The existing work on DLMS adaptive filter do not discuss the fixed-point implementation issues, e.g, location of radixpoint, choice of word-length, and quantization at various stages of computation, although they directly affect the convergence performance particularly due to the recursive behavior of the LMS algorithm. Therefore, fixed-point implementation issues are given adequate emphasis in this paper. Besides, we present here the optimization of our previously reported design [13], [14] to reduce the number of pipeline delays along with the area, sampling period, and energy-consumption. The proposed design is found to be more efficient in terms of power-delay product and energy-delay-product compared with the existing structures. The novel contributions of this paper are as follows: i) A novel design strategy for reducing the adaptation delay, which very much impacts the convergence speed. ii) Fixed-point implementation scheme of the modified DLMS algorithm. iii) Bit-level optimization of computing blocks without adversely affecting the convergence performance. In the next Section, we have reviewed the DLMS algorithm, and in Section III, we have described the proposed optimized architecture for its implementation. Section IV deals with fixed-point implementation considerations, and simulation studies of convergence of algorithm. In Section V, we discuss the synthesis of proposed architecture and comparison with the existing architectures followed by conclusions in Section VI.
FIG1.pdf 2
IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS
input sample, xn
FIR FILTER BLOCK
filter output, yn
desired signal,dn input sample, xn
_ mD
new weights
+
new weights
Fig. 1.
WEIGHT-UPDATE BLOCK
Fig. 2.
FIG1.pdf
Structure of modified delayed LMS adaptive filter.
FIG2.pdf
input sample, xn
wn+1ERROR-COMPUTATION = wn + µ · en · xn BLOCK new weights
en = dn − yn and yn = wnT · xn
(1a) error en
n2D
(1b)
n1D
where the input vector xn , and the weight vector wn at the nth iteration are, respectively, given by
Mean Squared Error (dB)
0
II. R EVIEW OF D ELAYED LMS A LGORITHM The weights of LMS adaptive filter during the nth iteration are updated according to the following equations [2]. desired signal,d n
n1D
n1D
mD
Structure of conventional delayed LMS adaptive filter.
where
error en
n2D
n1D
error, en WEIGHTUPDATE BLOCK
ERROR-COMPUTATION BLOCK
desired signal dn
LMS (n1=0, n2=0) DLMS (n1=5, n2=1)
-20
DLMS (n1=7, n2=2)
-40 -60 -80 0
100
WEIGHT-UPDATE BLOCK
xn = [xn , xn−1 , · · ·, xn−N +1 ]T
200
300 400 Iteration Number
500
600
700
Fig. 3. The convergence performance of system identification with LMS and modified DLMS adaptive filters.
T
wn = [wn (0), wn (1), · · ·, wn (N − 1)] FIG2.pdf
dn is the desired response, yn is the filter output, and en denotes the error computed during the nth iteration. µ is the step-size, and N is the number of weights used in the LMS adaptive filter. In case of pipelined designs with m pipeline stages, the error en becomes available after m cycles, where m is called the “adaptation-delay”. The DLMS algorithm therefore uses the delayed error en−m , i.e., the error corresponding to (n − m)th iteration for updating the current weight instead of the recentmost error. The weight-update equation of DLMS adaptive filter is given by wn+1 = wn + µ · en−m · xn−m
(2)
The block diagram of DLMS adaptive filter is depicted in Fig. 1, where the adaptation-delay of m cycles amounts to the delay introduced by the whole of adaptive filter structure consisting of FIR filtering and weight-update process. It is shown in [12] that the adaptation-delay of conventional LMS can be decomposed into two parts, where one part is the delay introduced by the pipeline stages in FIR filtering and the other part is due to the delay involved in pipelining of weightupdate process. Based on such a decomposition of delay, the DLMS adaptive filter can be implemented by a structure shown in Fig. 2. Assuming that the latency of computation of error is n1 cycles, the error computed by the structure at the nth cycle is en−n1 , which is used with the input samples delayed by n1 cycles to generate the weight-increment term. The weightupdate equation of the modified DLMS algorithm is given by wn+1 = wn + µ · en−n1 · xn−n1
(3a)
en−n1 = dn−n1 − yn−n1
(3b)
where and
T yn = wn−n · xn 2
(3c)
We can notice that during weight-update, the error with n1 delays is used while the filtering unit uses the weights delayed by n2 cycles. The modified DLMS algorithm decouples computations of error-computation block and weight-update block and allows to perform optimal pipelining by feed-forward cutset retiming of both these sections separately to minimize the number of pipeline stages and adaptation delay. The adaptive filters with different n1 and n2 are simulated for a system identification problem. The 10-tap bandpass filter with impulse response hn =
sin(wH (n − 4.5)) sin(wL (n − 4.5)) − , π(n − 4.5) π(n − 4.5) for n = 0, 1, 2, ..., 9, otherwise hn = 0.
(4)
is used as unknown system as in [10]. wH and wL represent the high and low cutoff frequencies of the passband, and are set to wH = 0.7π and wL = 0.3π, respectively. The step-size µ is set to 0.4. A 16-tap adaptive filter identifies the unknown system with gaussian random input xn of zero mean and unit variance. In all cases, outputs of known system are of unity power, and contaminated with white Gaussian noise of −70 dB strength. Fig. 3 shows the learning curve of MSE of error signal en by averaging 20 runs for the conventional LMS adaptive filter (n1 = 0, n2 = 0) and DLMS adaptive filters with (n1 = 5, n2 = 1) and (n1 = 7, n2 = 2). It can be seen that as the total number of delays increases, the convergence is slowed down, while the steady-state MSE remains almost the same in all cases. In this example, the MSE difference between the cases (n1 = 5, n2 = 1) and (n1 = 7, n2 = 2) after 2000 iterations is less than 1 dB in average.
MEHER AND PARK: AREA-DELAY-POWER EFFICIENT FIXED-POINT LMS ADAPTIVE FILTER WITH LOW ADAPTATION-DELAY
input samples, xn L
wnn2 (0) W
wnn2 (1)
2-BIT PPG-0
p00 p01
……
xn-1
D
2-BIT PPG-1
W p0 L 1 2
p10 p11
xn-2
D
……
wnn2 (2)
2-BIT PPG-2
W p1 L 1 2
…………
p20 p21
……
………
D wnn2 (N 1) W
3
xn-(N-1)
2-BIT PPG-(N-1)
p(N-1)0 p(N-1)1……p(N-1) L 1
p2 L 1 2
2
ADDER–TREE (log2N STAGES) q0
q1
q2
…………………
q L 2 q L 1 2
2
SHIFT–ADD–TREE (log2L-1 STAGES)
error, en( n11) Fig. 4.
filter output, yn(n12) desired response, dn( n12)
+
D
Proposed structure of the error-computation block.
error-block.pdf bit input b, and consists of n AND gates. It distributes all the n-bits of input D to its n AND gates as one of the inputs. The As shown in Fig. 2, there are two main computing blocks other inputs of all the n AND gates are fed with the single-bit in the adaptive filter architecture, namely (i) the errorinput b. As shown in Fig. 6(c), each OR cell similarly takes a computation block and (ii) weight-update block. In this Secpair of n-bit input words, and has n OR gates. A pair of bits tion, we discuss the design strategy of the proposed structure to in the same bit-position in B and D is fed to the same OR minimize the adaptation-delay in samples, the error-computation block, xn-1 input xn xn-(N-1) xn-2 gate. D D D ………… followed by weight-update block. The output of an AOC is w, 2w and 3w corresponding to the decimal values 1, 2, and 3 of 2-bit input (u1 u0 ), respectively. A. Pipelined Structure of Error-Computation Block The decoder along with the AOC performs a multiplication of PPG-(NPPG-0 PPG-1 The proposed structure for error-computation unit of an N - input operand w with two-bit digit (u1 u1)0 ), such that the PPG ………………… tap DLMS adaptive filter is shown in Fig. 4. It consists of of Fig. 5 performs L/2 parallel multiplications of input word N number of 2-bit partial product generators (PPG) corre- w with a 2-bit digit to produce L/2 partial products of the …… …… product word wu. sponding to N multipliers, a cluster of …… L/2 binary adder-trees D shift-add tree. Each sub-block is described 3) Structure of Adder-Tree: Conventionally, we should have followed by a single