A High-Performance FIR Filter Architecture for Fixed ...

1

A High-Performance FIR Filter Architecture for Fixed and Reconfigurable Applications Basant Kumar Mohanty, Senior Member, IEEE and Pramod K. Meher, Senior Member, IEEE

Abstract—Transpose-form finite impulse response (FIR) structures are inherently pipelined and support multiple constant multiplication (MCM) results in significant saving of computation. However, transpose-form configuration does not directly support the block processing unlike direct-form configuration. In this paper, we explore the possibility of realization of block FIR filter in transpose-from configuration for area-delay efficient realization of large order FIR filters for both fixed and reconfigurable applications. Based on detail computational analysis of transpose-form configuration of FIR filter we have derived a flowgraph for transpose-from block FIR filter with optimized register complexity. A generalized block formulation is presented for transpose-from FIR filter. We have derived a general multiplierbased architecture for the proposed transpose-form block filter for reconfigurable applications. A low-complexity design using MCM scheme is also presented for the block implementation of fixed FIR filters. Performance comparison shows that the proposed structure involves significantly less area-delay product (ADP) and less energy per sample (EPS) than the existing block direct-form structure for medium or large filter lengths while for the short-length filters, the existing block direct-form FIR structure has less ADP and less EPS than the proposed structure. ASIC synthesis result shows that the proposed structure for block-size 4 and filter-length 64 involve 42% less ADP and 40% less EPS than the best available FIR structure proposed for reconfigurable applications. For the same filter length and the same block size, the proposed structure involves 13% less ADP and 12.8% less EPS than that of the existing direct-from block FIR structure. Based on these findings, we present a scheme for the selection of direct-form and transpose-form configuration based on the filter lengths and block-length for obtaining areadelay and energy efficient block FIR structures. Index Terms—Reconfigurable Architecture, Block Processing, FIR Filter, VLSI.

I. I NTRODUCTION

F

INITE impulse response (FIR) digital filter is widely used in several digital signal processing applications such as speech processing, loud speaker equalization, echo cancellation, adaptive noise-cancellation, and various communication applications including software defined radio (SDR) etc. [1]. Many of these applications require FIR filters of large order to meet the stringent frequency specifications [2]– [4]. Very often these filters need to support high sampling rate for high-speed digital communication [5]. The number of multiplications and additions required for each filter output, however, increases linearly with the filter order. Since, there is B. K. Mohanty with the Department of Electronics and Communication Engineering, Jaypee University of Engineering and Technology, Raghogarh, Guna, Madhya Pradesh, India-473226, (email: [email protected]. P. K. Meher is with School of Computer Engineering, Nanyang Technological University, Nanyang Avenue, Singapore, 639798, email: [email protected], URL:http://www.ntu.edu.sg/homes/aspkmeher/.

no redundant computation available in the FIR filter algorithm, real-time implementation of a large order FIR filter in a resource constrained environment is a challenging task. Filter coefficients very often remain constant and known a priori in signal processing applications. This feature has been utilized to reduce the complexity of realization of multiplications. Several designs have been suggested by various researchers for efficient realization of FIR filters (having fixed coefficients) using distributed arithmetic (DA) [22] and multiple constant multiplication (MCM) methods [10], [13]–[16]. DA-based designs use look-up-tables (LUTs) to store pre-computed results to reduce the computational complexity. The MCM method on the other hand reduces the number of additions required for the realization of multiplications by common subexpression sharing, when a given input is multiplied with a set of constants. The MCM scheme is more effective when a common operand is multiplied with more number of constants. Therefore, MCM scheme is suitable for the implementation of large order FIR filters with fixed coefficients. But, MCM blocks can be formed only in the transpose form configuration of FIR filters. Block-processing method is popularly used to derive highthroughput hardware structures. Not only does it provide throughput-scalable design but also improves the area-delay efficiency. The derivation of block-based FIR structure is straight-forward when direct-from configuration is used [21] whereas the transpose-form configuration does not directly support block processing. But, to take the computational advantage of the MCM, FIR filter is required to be realized by transpose form configuration. Apart from that, transpose form structures are inherently pipelined and suppose to offer higher operating frequency to support higher sampling rate. There are some applications such as SDR channelizer where FIR filters need to be implemented in a reconfigurable hardware to support multi-standard wireless communication [6]. Several designs have been suggested during the last decade for efficient realization of reconfigurable FIR (RFIR) using general multipliers, and constant multiplication schemes [7]– [12], [17], [18]. A programmable multiply-accumulator based processor is proposed in [7] for FIR filtering. The area and power requirement of these architectures are significantly large and, therefore, they are not suitable of SDR channelizer. The structure of [9] is multiplier-based and uses poly-phase decomposition scheme. In [10], a reconfigurable FIR filter architecture using computation sharing vector-scaling technique of [8] has been proposed. In [11], a programmable canonical signed digit (CSD) based architecture was proposed using Booth encoding to generate partial products and Wallace tree adder for

2

addition of partial products. Chen et al [12] have proposed a CSD-based reconfigurable FIR filter where the non-zero CSD values are modified to reduce the precision of filter coefficients without significant impact on filter behavior. But, the reconfiguration overhead is significantly large and does not provide an area-delay efficient structure. The architectures of [8]–[12] are more appropriate for lower order filters and they are not suitable for channel filters due to their large area complexity. Constant shift method (CSM) and programmable shift method (PSM) have been proposed in [16], [17] for RFIR filters specifically for SDR channelizer. Recently, Park et al. [18] have proposed an interesting distributed arithmetic (DA) based architecture for RFIR filter. The existing multiplier-based structures use either direct-form configuration or transposeform configuration. But, the multiplier-less structures of [16], [17] use transpose-form configuration whereas the DA-based structure of [18] uses direct-form configuration. But, we do not find any specific block-based design for RFIR filter in the literature. A block-based RFIR structure can easily be derived using the scheme proposed in [20], [21]. But, we find that the block structure obtained from [20], [21] is not efficient for large filter lengths and variable filter coefficients such as SDR channelizer. Therefore, the design methods proposed in [20], [21] are more suitable for 2-D FIR and BLMS adaptive filters. In this paper we explore the possibility of realization of block FIR filter in transpose-from configuration in order to take advantage of the MCM schemes and the inherent pipelining for area-delay efficient realization of large order FIR filters for both fixed and reconfigurable applications. The main contributions of this paper are as follows: • Computational analysis of transpose-form configuration of FIR filter and derivation of flow-graph for transposefrom block FIR filter with reduced register complexity. • Block formulation for transpose-from FIR filter. • Design of transpose-form block filter for reconfigurable applications. • A low-complexity design method using MCM scheme for the block implementation of fixed FIR filters. The remainder of the paper is organized as follows: In Section II, computational analysis and mathematical formulation of block transpose-form FIR filter are presented. The proposed architectures for fixed and reconfigurable applications are presented in Section III. Hardware and time complexities along with performance comparison are presented in Section IV. The paper ends with the conclusions in Section V. II. C OMPUTATIONAL A NALYSIS AND M ATHEMATICAL F ORMULATION OF B LOCK T RANSPOSE F ORM FIR F ILTER The output of an FIR filter of length N can be computed using the relation

y(n) =

N −1 X

h(i) · x(n − i)

(1)

i=0

The computation of (1) can be expressed by the recurrence relation

h Y (z) = z −1 (· · · (z −1 (z −1 h(N − l) + h(N − 2)) i +h(N − 3)) · · · + h(1)) + h(0) X(z)

(2)

x(n) h(5)

h(4)

h(3)

D

h(2)

D

h(1)

D

h(0)

D

y(n)

D

(a) x(n-1) h(5)

h(4)

h(3)

D

h(2)

D

h(1)

D

h(0)

D

y(n-1)

D

(b)

Fig. 1. Data-flow graph (DFG) of transpose-form structure for N = 6. (a) for output y(n). (b) For output y(n − 1).

ccs

M1

M2

M3

M4

M5

M6

1

x(n-5)h(5)

x(n-5)h(4)

x(n-5)h(3)

x(n-5)h(2)

x(n-5)h(1)

x(n-5)h(0)

2

x(n-4)h(5)

x(n-4)h(4)

x(n-4)h(3)

x(n-4)h(2)

x(n-4)h(1)

x(n-4)h(0)

3

x(n-3)h(5)

x(n-3)h(4)

x(n-3)h(3)

x(n-3)h(2)

x(n-3)h(1)

x(n-3)h(0)

4

x(n-2)h(5)

x(n-2)h(4)

x(n-2)h(3)

x(n-2)h(2)

x(n-2)h(1)

x(n-2)h(0)

5

x(n-1)h(5)

x(n-1)h(4)

x(n-1)h(3)

x(n-1)h(2)

x(n-1)h(1)

x(n-1)h(0)

6

x(n)h(5)

x(n)h(4)

x(n)h(3)

x(n)h(2)

x(n)h(1)

x(n)h(0)

(a) ccs

M1

M2

M3

M4

M5

M6

1

x(n-6)h(5)

x(n-6)h(4)

x(n-6)h(3)

x(n-6)h(2)

x(n-6)h(1)

x(n-6)h(0)

2

x(n-5)h(5)

x(n-5)h(4)

x(n-5)h(3)

x(n-5)h(2)

x(n-5)h(1)

x(n-5)h(0)

3

x(n-4)h(5)

x(n-4)h(4)

x(n-4)h(3)

x(n-4)h(2)

x(n-4)h(1)

x(n-4)h(0)

4

x(n-3)h(5)

x(n-3)h(4)

x(n-3)h(3)

x(n-3)h(2)

x(n-3)h(1)

x(n-3)h(0)

5

x(n-2)h(5)

x(n-2)h(4)

x(n-2)h(3)

x(n-2)h(2)

x(n-2)h(1)

x(n-2)h(0)

6

x(n-1)h(5)

x(n-1)h(4)

x(n-1)h(3)

x(n-1)h(2)

x(n-1)h(1)

x(n-1)h(0)

(b)

Fig. 2. (a) Data-flow table (DFT) of multipliers of DFG shown in Fig.1(a) corresponding to output y(n). (b) DFT of multipliers of DFG shown in Fig.1(b) corresponding to output y(n − 1). The arrow shows accumulation path of the products

A. Computational Analysis The data-flow graphs (DFG-1 and DFG-2) of transposeform FIR filter for filter length N = 6 as shown in Fig.1 for a block of two successive outputs {y(n), y(n − 1)} which are derived from (2). The product values and their accumulation paths in DFG-1 and DFG-2 of Fig.1 are shown in data-flow tables (DFT-1 and DFT-2) of Fig.2. The arrows in DFT-1 and DFT-2 of Fig.2 represents the accumulation path of the products. We find that 5 values of each column of DFT-1 are same as those of DFT-2 (shown in gray colour in Fig.2). These redundant computation of DFG-1 and DFG-2 can be avoided by using non-overlapped sequence of input-blocks as shown in Fig.3. Data flow tables (DFT-3 and DFT-4) of DFG-1 and DFG-2 for non-overlapping input blocks are, respectively, shown in Fig.3(a) and (b). As shown in Fig.3(a) and (b), DFT3 and DFT-4 do not involve redundant computation. It is easy to find that the entries in gray cells in DFT-3 and DFT-4 of Fig.3(a) and (b) correspond to the output y(n) while the other

3

entries of DFT-3 and DFT-4 correspond to y(n − 1). The DFG of Fig.1 need to be transformed appropriately to obtain the computations according to DFT-3 and DFT-4. ccs

M1

M2

M3

M4

M5

M6

1

x(n-10)h(5) x(n-10)h(4) x(n-10)h(3) x(n-10)h(2) x(n-10)h(1) x(n-10)h(0)

2

x(n-8)h(5)

x(n-8)h(4)

x(n-8)h(3)

x(n-8)h(2)

x(n-8)h(1)

x(n-8)h(0)

3

x(n-6)h(5)

x(n-6)h(4)

x(n-6)h(3)

x(n-6)h(2)

x(n-6)h(1)

x(n-6)h(0)

4

x(n-4)h(5)

x(n-4)h(4)

x(n-4)h(3)

x(n-4)h(2)

x(n-4)h(1)

x(n-4)h(0)

5

x(n-2)h(5)

x(n-2)h(4)

x(n-2)h(3)

x(n-2)h(2)

x(n-2)h(1)

x(n-2)h(0)

6

x(n)h(5)

x(n)h(4)

x(n)h(3)

x(n)h(2)

x(n)h(1)

x(n)h(0)

M5

M6

(a) ccs

M1

M2

M3

M4

1

x(n-11)h(5) x(n-11)h(4) x(n-11)h(3) x(n-11)h(2) x(n-11)h(1) x(n-11)h(0)

2

x(n-9)h(5)

x(n-9)h(4)

x(n-9)h(3)

x(n-9)h(2)

x(n-9)h(1)

x(n-9)h(0)

3

x(n-7)h(5)

x(n-7)h(4)

x(n-7)h(3)

x(n-7)h(2)

x(n-7)h(1)

x(n-7)h(0)

4

x(n-5)h(5)

x(n-5)h(4)

x(n-5)h(3)

x(n-5)h(2)

x(n-5)h(1)

x(n-5)h(0)

5

x(n-3)h(5)

x(n-3)h(4)

x(n-3)h(3)

x(n-3)h(2)

x(n-3)h(1)

x(n-3)h(0)

6

x(n-1)h(5)

x(n-1)h(4)

x(n-1)h(3)

x(n-1)h(2)

x(n-1)h(1)

x(n-1)h(0)

The computation of DFT-3 and DFT-4 can be realized by DFG-3 of non-overlapping blocks as shown in Fig.4. We refer it to block transpose-form type-I configuration of block FIR filter. The DFG-3 can be retimed to obtain the DFG4 of Fig.5 which is referred to block transpose-form type-II configuration. Note that both type-I and type-II configurations involve the same number of multipliers and adders, but type-II configuration involves nearly L times less delay elements than those of type-I configuration. We have, therefore, used block transpose-form type-II configuration to derive the proposed structure. In the next section we present mathematical formulation of block transpose-form type-II FIR filter for a generalized formulation of the concept of block-based computation of transpose form FIR filers. C. Mathematical Formulation of the Transpose-form Block FIR Filter

(b)

Fig. 3. Data-flow table of DFG-1 and DFG-2 for three non-overlapped inputblocks [x(n), x(n − 1)], [x(n − 2), x(n − 3)], and [x(n − 4), x(n − 5)]. (a) DFT-3 for computation of output y(n). (b) DFT-4 for computation of output y(n − 1).

x(n)

B. DFG Transformation

Suppose in every cycle the block FIR filter takes a block of L new input samples, and processes those to produce a block of L output samples. The k-th block of filter output yk is computed using the relation y k = Xk · h

x(n-1)

(3)

where the weight-vector h is defined as: T

h(5)

h(3)

h = [h(0), h(1), · · ·, h(N − 1)]

h(1)

D

The input matrix Xk is defined as D

D

D

x(n-2)

D

h(4)

h(2) D

h(0) y(n)

D

D

Xk = x(kL) x(kL − 1)  x(kL − 1) x(kL − 2)   . .   . . x(kL − L + 1) x(kL − L)

y(n-1)

D

. . . . .

 . x(kL − N + 1)  . x(kL − N )   . .   . . . x(kL − L − N + 2) (4)

The input matrix Xk can otherwise be expressed as Fig. 4. Merged DFG (DFG-3: transpose-form type-I configuration for block FIR structure).

Xk =

x0k

x1k

·

· x4k

·

−1 · xN k

(5)

xik

where is the (i + 1)-th column of Xk are defined as: T xik = x(kL − i) x(kL − i − 1) · · · x(kL − i − L + 1) (6)

x(n) x(n-1) D h(5) h(4)

h(3) h(2)

Substituting (5) in (3), the matrix-vector product is expressed in the form of scalar-vector products (SVPs) as

h(1) h(0)

yk =

N −1 X

xik · h(i)

(7)

i=0 D

D

y(n)

D

D

y(n-1)

Fig. 5. DFG-4 (retimed DFG-3) transpose-form type-II configuration for block FIR structure.

Suppose, N is a composite number and decomposed as N = M L, then index i is expressed as i = l + mL, for 0 ≤ l ≤ L − 1, and 0 ≤ m ≤ M − 1. Substituting i = l + mL in (6), we have = xlk−m xl+mL k

(8)

4

Substituting (8) in (5), we have Xk =

h

x0k

x1k

···

xL−1 k

III. P ROPOSED S TRUCTURES

x0k−1

x1k−1

··· i

xL−1 k−1

xL−1 k−M +1

x0k−M +1 x1k−M +1 · · ·

··· (9)

Substituting (9) in (3) we have

yk =

L−1 −1 XM X

xlk−m · h(l + mL)

(10)

There are several applications where the coefficients of FIR filters remain fixed, while in some other applications, like SDR channelizer which requires separate FIR filters of different specifications to extract one of the desired narrow-band channels from the wide-band RF front-end. These FIR filters need to be implemented in a reconfigurable FIR structure to support multi-standard wireless communication [6]. In this section we present a structure of block FIR filter for such reconfigurable applications. In this Section we discuss the implementation of block FIR filter for fixed filters as well using MCM scheme.

l=0 m=0

The input-matrix Xk of (9) has an interesting feature. The data-block x0k is the current block while {x0k−1 , x0k−2 , · · · , x0k−M +1 } are blocks delayed by 1, 2, · · · , (M − 1) cycles. The overlapped blocks {x1k−1 , x1k−2 , · · · , x1k−L+1 } are, respectively, 1 clock cycle, 2 clock cycles, · · ·, (M − 1) cycles delayed version of overlapped block x1k . To take the advantage of this feature, the input-matrix Xk is decomposed into M small matrices Slk such that S0k contain L input-blocks {x0k , x1k , · · · , xL−1 }, k }. and S1k contain input-blocks {x0k−1 , x1k−1 , · · · , xL−1 k−1 L−1 Similarly, the input-block {x0k−M +1 , x1k−M +1 , · · · , xk−M +1 } M −1 constitute the matrix Sk . The coefficient vector h is also decomposed into small weight-vectors cm = {h(mL), h(mL + 1), · · · , h(mL + L − 1)}. Interestingly, Sm k is symmetric and satisfy the following identity 0 Sm k = Sk−m

(11)

According to (11), Sm k (for 1 ≤ m ≤ M − 1) are m clock cycle delayed with respect to S0k . Computation of (10) can be expressed in matrix-vector product using S0k−m and cm as

yk = rm k =

M −1 X

rm k

m=0 S0k−m

· cm

(12a) (12b)

The Computations of (12) may be expressed in a recurrence form h Y(z) = S0 (z) (z −1 (· · · (z −1 (z −1 cM −1 + cM −2 ) i +cM −3 ) + · · ·) + c1 ) + c0 (13) where, S0 (z) and Y(z) are the z-domain representation of S0k and yk , respectively. The DFG-4 of block transpose-form type-II configuration (shown in Fig.5 for N = 6 and L = 2) can be derived using the recurrence relation of (13). The delay operator {z −1 } of (13) represents a delay for a block of data in the transposeform type-II structure which stores the product of S0k and cm . The proposed structure based on the recurrence relations of (13) is presented in the following Section.

A. Proposed Structure for Transpose-from Block FIR Filter for Reconfigurable Applications The proposed structure for block FIR filter is (based on the recurrence relation of (13)) and shown in Fig.6 for block size L = 4. It consists of one coefficient selection unit (CSU), one register unit (RU), M number of inner-product units (IPU) and one pipeline adder unit (PAU). The CSU stores coefficients of all the filters to be used for the reconfigurable application. It is implemented using N ROM LUTs such that filter coefficients of any particular channel filter is obtained in one clock cycle, where N is the filter length. The RU (shown in Fig.7) receives xk during the k-th cycle and produces L rows of S0k in parallel. L rows of S0k are transmitted to M IPUs of the proposed structure. The M IPUs also receive M short weight-vectors from the CSU such that during the k-th cycle, the (m + 1)-th IPU receives the weight-vector cM −m−1 from the CSU and L rows of S0k form the RU. Each IPU performs matrix-vector product of S0k with the short-weight vector cm , and computes a block of L partial filter outputs (rm k ). Therefore, each IPU performs L inner-product computations of L rows of S0k with a common weight-vector cm . The structure of the (m+1)-th IPU is shown in Fig.8. It consists of L number of L-point innerproduct cells (IPCs). The (l + 1)-th IPC receives the (l + 1)th row of S0k and the coefficient-vector cm , and computes a partial result of inner-product r(kL − l), for 0 ≤ l ≤ L − 1. Internal structure of (l+1)-th IPC for L = 4 is shown in Fig.9. All the M IPUs work in parallel and produce M blocks of result (rm k ). These partial inner-products are added in the PAU (shown in Fig.10) to obtain a block of L filter outputs. In each cycle, the proposed structure receives a block of L inputs and produces a block of L filter outputs, where the duration of each cycle is T = TM + TA + TF A log2 L, TM is one multiplier delay, TA is one adder delay, and TF A is one full-adder (FA) delay. B. MCM-Based Implementation of Fixed-Coefficient FIR Filter We discuss the derivation of MCM units for transpose form block FIR filter, and the design of proposed structure for fixed filters. For fixed-coefficient implementation, the CSU of Fig.6 is no longer required since the structure is to be tailored for only one given filter. Similarly, IPUs are not required. The multiplications are required to be mapped to the the MCM units for a low-complexity realization. In the

5

xk

L

Register Unit (RU)

rk1

Coefficient Storage Unit (CSU) cM-2 c0

cM-1

0

rk

rkM-1

D

D

D

D

D

D

D D

D D

D D

yk

0

Sk

Fig. 10. L

L IPU-1 L rk0

L IPU-2 L rk1

IPU-M L

rk M-1

L

Pipelined Adder Unit (PAU)

Fig. 6.

rk2

yk

Proposed structure for block FIR filter

xk

x(kL) x(kL-1)

D D

x(kL-2) x(kL-3)

D

Structure of pipeline adder unit (PAU) for block size L = 4.

following, we show that proposed formulation for MCM-based implementation of block FIR filter which makes use of the symmetry in input matrix S0k to perform horizontal and vertical common sub-expression elimination (CSE) [16] to minimize the number of shift-add operations in the MCM blocks. The recurrence relation of (13) can alternately be expressed as Y(z) = z −1 (· · · (z −1 (z −1 rM −1 + ·rM −2 ) +rM −3 ) + · · ·) + r1 ) + r0

(14)

The M intermediate data-vectors rm , for 0 ≤ m ≤ M − 1 can be computed using the relation R = S0k · C

1

2

xk Fig. 7.

3

xk

4

xk

xk

where R and C are defined as

Internal structure of register unit (RU) for block size L = 4.

Sk0 xk1

xk0 cm

(15)

L

xkL-1 L

L

L

IPC-1

IPC-2

r(kL)

r(kL-1)

IPC-L r(kL-L+1)

R=

rT0

rT1

C=

cT0

cT1

···

rTM −1

(16a)

···

cTM −1

(16b)

To illustrate the computation of (15) for L = 4 and N = 16, we write it as a matrix product given by (17). From (17) we can observe that the input matrix contains six input samples {x(4k), x(4k − 1), x(4k − 2), x(4k − 3), x(4k − 4), x(4k − 5), x(4k−6)}, and multiplied with several constant coefficients as shown in Table I.

m

rk

Fig. 8.

x(4k) R=

Structure of (m + 1)-th inner product unit (IPU).

cm

r(kL- l) (a)

0

rk

rk2

rkM-1

D

D

D

D

D

D

D D

D D

D

yk

D (b)

Fig. 9.

h(0) h(4) h(8)

h(12)

x(4k-1) x(4k-2) x(4k-3)

x(4k-4)

h(1)

h(5) h(9)

h(13)

x(4k-2) x(4k-3)

x(4k-4)

x(4k-5)

h(2)

h(6) h(10) h(14)

x(4k-4) x(4k-5)

x(4k-6)

h(3)

h(7) h(11) h(15)

(17)

x(kL-l-1) x(kL-l-2) x(kL-l-3)

h(Ll) h(Ll+1) h(Ll+2) h(Ll+3)

rk1

x(4k-3)

x(4k-3)

xkl x(kL-l)

x(4k-1) x(4k-2)

Internal structure of (l + 1)-th inner-product cell (IPC) for L = 4.

As shown in Table I, MCM can be applied in both horizontal and vertical direction of the coefficient matrix. The sample x(4k − 3) appeares in 4 rows or 4 columns of the input matrix of (17) whereas x(4k) appeares in only one row or one column. Therefore, all the 4 rows of coefficient matrix are involved in the MCM for the x(4k − 3) whereas only the first row of coefficients are involved in the MCM for x(4k). For larger values of N or the smaller block-sizes, the row size of the coefficient matrix are larger which results in larger MCM size across all the samples which results into larger saving in computational complexity. The proposed MCM-based structure for FIR filters for block-size L = 4 is shown in Fig.11 for the purpose of illustration. The MCM-based structure (shown in Fig.11) involves

6

xk 4

rk0

4

rk1

x(2k-6)

Adder Network 4 4 rk2

4-wide MCM

x(2k-5) 8-wide MCM

x(2k-4) 12-wide MCM

x(2k-3) 16-wide MCM

x(2k-2) 12-wide MCM

8-wide MCM

4-wide MCM

x(2k)

x(2k-1)

Register Unit (RU)

rk3

4

Pipelined Adder Unit (PAU)

4

yk

Fig. 11. Proposed MCM-based structure for fixed FIR filter of block-size L = 4 and filter length N = 16.

6 MCM blocks corresponding to 6 input samples. Each MCM block produces the necessary product terms as listed in Table I. The sub-expressions of the MCM blocks are shift added in the adder network to produce the inner-product values (rl,m ), for 0 ≤ l ≤ L − 1 and 0 ≤ m ≤ (N/L) − 1 corresponding to the matrix product of (15). The inner-product values are finally added in the PAU of Fig.10 to obtain a block of filter output. IV. C OMPLEXITIES AND P ERFORMANCE C ONSIDERATIONS A. Hardware and Time Complexities The proposed block transpose-form type-II FIR structure for reconfigurable application consists of one CSU, one RU, M IPUs and one PAU. The CSU consists of N ROM units of P words each, where P is the number of FIR filters TABLE I MCM IN T RANSPOSE - FORM B LOCK FIR FILTER OF LENGTH 16 AND BLOCK - SIZE 4 Input sample

Coefficient Group

x(4k)

{h(0), h(4), h(8), h(12)}

x(4k − 1)

{h(0), h(4), h(8), h(12)} {h(1), h(5), h(9), h(13)} {h(0), h(4), h(8), h(12)}

x(4k − 2)

{h(1), h(5), h(9), h(13)} {h(2), h(6), h(10), h(14)} {h(0), h(4), h(8), h(12)}

x(4k − 3)

{h(1), h(5), h(9), h(13)} {h(2), h(6), h(10), h(14)} {h(3), h(7), h(11), h(15)} {h(1), h(5), h(9), h(13)}

x(4k − 4)

{h(2), h(6), h(10), h(14)} {h(3), h(7), h(11), h(15)}

x(4k − 5) x(4k − 6)

{h(2), h(6), h(10), h(14)} {h(3), h(7), h(11), h(15)} {h(3), h(7), h(11), h(15)}

to be implemented by the proposed reconfigurable structure. We have excluded complexity of CSU in the performance comparison since it is common in all the reconfigurable FIR structures. Each IPU is comprised of L IP cells, where each IP cell involves L multipliers and (L − 1) adders. The RU involves (L − 1) registers of B-bit width. The PAU involves L(M − 1) adders and the same number of registers, where each register has a width of (B + B 0 ), B and B 0 respectively, being the bit-width of input sample and filter-coefficients. Therefore, the proposed structure involves LN multipliers, L(N − 1) adders and [B(N − 1) + B 0 (N − L)] FFs; and processes L samples in every cycle where the duration of cycle period T = [TM + TA + TF A log2 L − 1)]. We have estimated the hardware and time complexity of the block transposefrom type-I structure from DFG-3 of Fig.4 for comparison purpose. We do not find a multiplier-based direct-form block FIR structure on RFIR in the literature. However, direct-form multiplier-based block FIR structure can be derived from the block formulation of [20]. We have derived the direct-form block FIR structure using equation (4) of [20], and estimated its hardware and time complexities for comparison purpose. B. Performance Comparison The hardware and time complexities of proposed structure, the block transpose-form type-I structure and the extracted direct-from structure of [20] along with those of the existing reconfigurable FIR filter structures of [18] and [17] are listed in Table II for comparison. We have assumed fixed wordlength (B + B 0 ) for the adder-tree in case of direct-form structure as well as the pipeline-adder in case of transposeform structure. As shown in Table II, the proposed block transpose form structures have the same number of multipliers and adders, and they differ only by the register complexity and cycle period. The direct-form structure of [20] and the proposed structures involve the same number of multipliers and adders, but the proposed one involves {(log2 M −1)TF A } less cycle period, where M = N/L, at a marginal cost of B 0 (N − L) FFs. The register complexity of the proposed structure is independent of block-size as in the case of directform structure. However, the cycle period of the proposed block transpose-form type-II structure depends on the inputblock size whereas in case of existing direct-form block FIR structure of [20] it depends on the filter-length. Since, filterlength is usually higher than the block-length, the cycle period of the existing direct-form structure increase for large order filters. To compare with the DA-based structure of [18] and the proposed structure, we find that the proposed structure involves (LN ) multipliers in place of (3N BB 0 /2) MUXes (bit-level), nearly (2L/B) times more adders and B 0 N more FFs, but offers nearly L times higher throughput. Similarly, compared with the CSM-based structure of [17], the proposed structure involve (LN ) multipliers in place of ≈ (7N BB 0 /3) MUXes (bit-level), ≈ (3L/B 0 ) times more adders, B 0 (L − 1) less FFs, and offers L times higher throughput. The proposed block transpose-form type-II structure involves nearly L times less FFs than those of block transpose-form type-I structure and both the structures have the same cycle period. Due to

7

TABLE II G ENERAL C OMPARISON OF H ARDWARE AND T IME C OMPLEXITIES Structures

Flip-Flop

Adder

Multiplier

Mahesh et al [17]

(B + B 0 )(N − 1)

N (α + 1) + 2

0

B(N + B 0 + φ + 1)

[N (B + 1)/2] − 1

Direct-form structure of [20]

B(N − 1)

Transpose-form

B[L(N − L + 1) − 1]

Type-I

+B 0 L(N − L)

Transpose-form

B(N − 1) + B 0 (N − L)

(CSM) Park et al [18] (One Level Pipeline)

Type-II

MUX 2:1

CP

(bit-level) [7α(B + 2)

4TM U X + 4TA

+B + B 0 ]N

+(β + 1)TF A

0

3N BB 0 /2

2TM U X + 2TA

L(N − 1)

NL

0

L(N − 1)

NL

0

L(N − 1)

NL

0

+φTF A TM + TA +(log2 N − 1)TF A TM + TA +(log2 L)TF A TM + TA +(log2 L)TF A

Throughput 1/CP 1/CP L/CP L/CP

L/CP

α = dB 0 /3e, β = dlog2 α − 1e, φ = log2 (N/2) − 1. TABLE III T HEORETICALLY E STIMATED H ARDWARE AND T IME C OMPLEXITIES OF P ROPOSED AND E XISTING S TRUCTURES FOR B = 8 AND B 0 = 16 Structures

Filter-length

FF

Adder

Multipliers

MUX 2:1 bit-level

Cycle period (T )

Throughput

16

360

114

0

7104

4TM U X + 4TA + 3TF A

1/T

Mahesh et al [17]

32

744

226

0

14208


1/T

(CSM)

64

1512

450

0

28416


1/T

128

3048

898

0

56832


1/T

16

296

71

0

3456


1/T

Park et al [18]

32

432

143

0

6912


1/T

(One Level Pipeline)

64

696

287

0

13824


1/T

128

1216

575

0

27648


1/T

Direct-form

16

120

60

64

0

TM + TA + 3TF A

4/T

structure of [20]

32

248

124

128

0

TM + TA + 4TF A

4/T

L=4

64

504

252

256

0

TM + TA + 5TF A

4/T

128

1016

508

512

0

TM + TA + 6TF A

4/T

Direct-form

16

120

120

128

0

TM + TA + 3TF A

8/T

structure of [20]

32

248

248

256

0

TM + TA + 4TF A

8/T

L=8

64

504

504

512

0

TM + TA + 5TF A

8/T

128

1016

1016

1024

0

TM + TA + 6TF A

8/T

Proposed

16

312

60

64

0

TM + TA + 2TF A

4/T

structure

32

696

124

128

0

TM + TA + 2TF A

4/T

L=4

64

1464

252

256

0

TM + TA + 2TF A

4/T

128

3000

508

512

0

TM + TA + 2TF A

4/T

Proposed

16

248

120

128

0

TM + TA + 3TF A

8/T

Structure

32

632

248

256

0

TM + TA + 3TF A

8/T

L=8

64

1400

504

512

0

TM + TA + 3TF A

8/T

128

2936

1016

1024

0

TM + TA + 3TF A

8/T

less FF complexity, proposed type-II structure is more efficient than the type-I structure. In-spite of more FFs, the proposed structure may have less area-delay product (ADP) and less energy per sample (EPS) than the existing direct-form structure due to its small cycle period. We have estimated hardware and time complexities of proposed structures for block sizes L = 4, 8, and filter-lengths N = 32 and 64. Also we have estimated the hardware and the time complexities of the direct-form structure extracted from

[20] for the same block-size and for the filter-lengths, and those of [18] and [17] for the same filter-lengths. We have considered B = 8 (wordlength of input sample), B 0 = 16 (wordlength of filter coefficient) and 24-bit wordlength for the intermediate and output signals for all the designs. The estimated values are listed in Table III for comparison. We can find from Table III that, the multiplier and adder complexities of the proposed structure increases proportionately with block-size and filter-length as in the case of direct-form

8

structure. The cycle period of direct-form structure increases proportionately with the filter-length such that it increases by an amount TF A when filter-length doubles. But, cycleperiod of proposed structure is independent of filter-length and increases by an amount TF A when the block-size doubles. The area-delay performance of the proposed structure is found better than that of direct-form structure of [20] for higher filter lengths due to smaller cycle period. Besides, the proposed structure support MCM-scheme when fixed-coefficient filters are implemented, whereas direct-form structure does not support MCM scheme. We have shown that the proposed structure offers both horizontal and vertical MCM, which can be exploited in the proposed structure to reduce the area complexity substantially further compared with the direct-form structure for the implementation of fixed filters. Compared with the direct-form structure, the proposed structure for block size 4 involves 192, 448, 960, 1984 more FFs and its cycle period less by TF A , 2TF A , 3TF A , 4TF A for filter lengths 16, 32, 64 and 128, respectively. The proposed structure involves more number of FFs than the direct-form structure for higher filter lengths, but the excess area due to those FFs is very small compared to the total area of the direct-form structure. On the other hand, the saving in cycle period in the transpose-form structure for higher filter-lengths is significant with respect to the cycle period of direct-form structure. Therefore, the overall ADP of the proposed block transpose-form type-II structure is found to be less than that of direct-form structure [20] for higher filter-lengths. Compared with the structure of [18], the proposed transpose-form typeII structure for N = 64 and L = 4 involves 256 multipliers against 13824 bit-level MUXes, 35 less adders, 48 less FFs, and offers nearly 4 times higher throughput rate. For the same block-size and filter-length, the proposed transpose-form typeII structure involves 256 multipliers against 28416 bit-level MUXes, 198 less adders, 48 less FFs than those of the structure of [17], and it offers more than 4 times higher throughput rate due to its smaller cycle period. C. Synthesis Results We have coded the proposed structure in VHDL for filter lengths 16, 32 and 64 and block-sizes 4 and 8. Also we have coded the direct-form block FIR structure extracted from [20] for the same filter-lengths and the same blocksizes, and the structures of [18] and [17] for the same filterlengths. We have considered B = 8, B 0 = 16, and 24-bit wordlength for the intermediate and the output signals of all the designs. All the designs are synthesized using Synopsys Design Compiler TMSC 65nm CMOS library. The area, the minimum clock period (MCP), and power estimates obtained from the synthesis reports generated by the Design Compiler are listed in Table IV for comparison. As shown in Table IV, the proposed transpose-form type-II structure involves more area and consumes more power than the existing directform structure [20] due to extra FFs. But, it has less MCP (higher sampling frequency1 , block-size=1 in case of [17], [18]) than the corresponding direct-form structure of [20] due

TABLE IV S YNTHESIS R ESULTS OF P ROPOSED S TRUCTURES AND E XISTING S TRUCTURES USING TMSC 65 NM CMOS S TANDARD C ELL L IBRARY, N : FILTER - LENGTH . P OWER ESTIMATED AT MINIMUM CLOCK PERIOD (MCP) Structures

N

MCP (ns)

Samp. rate (GHz)

Area (um2 )

Power (mW)

Mahesh [17] (CSM)

16 32 64

1.27 1.29 1.30

0.78 0.77 0.76

67949 133015 264342

28.7 55.8 104.4

Park [18] (one level) Pipeline

16 32 64

0.95 1.10 1.28

1.05 0.90 0.78

56325 112446 218351

24.9 47.3 85.6

Direct-form structure of [20] (L = 4)

16 32 64

1.35 1.51 1.67

2.96 2.66 2.39

112723 223878 443277

55.3 98.6 179.5

Direct-form structure of [20] L=8

16 32 64

1.36 1.53 1.69

5.88 5.29 4.73

218446 436403 860879

90.2 175.3 345.2

Proposed structure (L = 4)

16 32 64

1.30 1.30 1.30

3.07 3.07 3.07

123489 247350 495437

59.3 108.2 201.1

Proposed structure (L = 8)

16 32 64

1.40 1.40 1.40

5.71 5.71 5.71

232089 476503 957186

100.9 186.4 366.2

Sampling frequency=block-size/minimum clock period (MCP), where blocksize=1 in case of [17] and [18].

to shorter critical path. We have estimated the increase in area (∆A) and reduction in MCP (∆T ) of proposed transposeform type-II structure over the direct-form structure of [20] for different block-sizes and different filter-lengths. Graphs are plotted using these estimated values and shown in Fig.12 and Fig.13. Note that the ADP varies directly with (∆A) whereas it varies inversely with (∆T ). As shown in Figs.12 and 13, the intersection point of two curves {∆A and ∆T } gives a filter-length (N0 ) where the direct-form stricture of [20] and transpose-form type-II structures have nearly the same ADP. For N < N0 , (∆A) is higher than (∆T ) and transposeform type-II structure has higher ADP than that of directfrom structure of [20]. Similarly, for N > N0 the (∆T ) is higher than (∆A) and the transpose-form type-II structure has less ADP than the direct-from structure of [20]. The N0 shift marginally towards higher value for higher block-sizes due to increase in MCP of the transpose-form type-II structure. We have estimated ADP2 and energy per sample (EPS3 ) of proposed transpose-form type-II structure and direct-form structure of [20]. We have also estimated the reduction of ADP (RADP) and reduction of EPS (REPS) of the proposed transpose-form type-II structure over the direct-form structure of [20]. The estimated RADP and REPS values are shown in the bar-chart of Fig.13 for comparison. As shown in Fig.13, the proposed transpose-form type-II structure has higher ADP and EPS (negative RADP and REPS) than the existing directform structure of [20] for small filter-lengths while it has less ADP and less EPS (positive RADP and REPS) for filter-length 2 ADP=area/sampling

1 sampling

frequency=block-size/minimum clock period

frequency frequency

3 EPS=power/sampling

9

TABLE V S ELECTION OF D IRECT- FORM AND T RANSPOSE - FORM TYPE -II C ONFIGURATIONS FOR B LOCK FIR F ILER S TRUCTURE

Block-size 4

In percentage (%)

Reduction in MCP 25 20 15 10 5 0

Increase in Area

N0

16

32

64

Filter-length (N)

(a)

Block-size

N = 16

N = 32

N = 64

N = 128

N = 256

L=2

TF-II

TF-II

TF-II

TF-II

TF-II

L=4

DF

TF-II

TF-II

TF-II

TF-II

L=8

DF

DF

TF-II

TF-II

TF-II

L = 16

DF

DF

TF-II

TF-II

TF-II

Block-size 8 Reduced MCP

LEGEND: DF: direct-form, TF-II: transpose-form type-II.

Increase in Area

In percentage (%)

20 15 10 5 0 16

-5

N0 32 Filter-length (N)

64

(b) Fig. 12. Increase in area and decrease in minimum cycle period (MCP) of transpose-form type-II structure with respect to direct-form structure of [20]. (a) For block-size 4, (b) For block-size 8. N0 is the filter-length for the direct-form structure of [20] and the transpose-form type-II FIR structures have nearly the same area-delay efficiency. Block-size 8

30 25 20 15 10 5 0 -5

Mahesh [9] Park [10] DF (L=4) [15] Prop. (L=4)

200 100

16 16

32 Filter-length (N)

64

150

Block-size 8

EPS (nJ)

Block-size 4

64


(a)

12 Reduction of EPS (pJ)

300

0

(a) 10 8

100 Mahesh [9] Park [10] DF (L=4) [15] Prop. (L=4)

50

6 4

0

2

16

0

(b)

-2 -4

(b)

400 ADP (sq.um.us)

Reduction of ADP (sq.um.us)

Block-size 4

structure has higher ADP and EPS saving than the proposed block transpose-form type-II structure for short-length filters. Compared with the structure of [18] which is the best amongst the existing designs, the block proposed transpose-form typeII structure for block-size 4 and filter-length 64 involve 42% less ADP and 40% less EPS respectively.

16


64

Fig. 13. Reduction of area-delay product (RADP) and reduction of energyper sample (REPS) of proposed transpose-form type-II structure compared to that of direct-form structure of [20]. (a) Reduction of ADP. (b) Reduction of EPS.

N ≥ 32. The proposed block transpose-form type-II structure offers higher saving of ADP and EPS saving than the existing direct-form structure [20] for higher filter-lengths. Based on the above observation, we suggest a selection rule (given in Table V) for the direct-form (DF) and transpose-form type-II (TF-II) configuration to derive area-delay and energy efficient block FIR filter structure of different block-sizes and filterlengths . We have estimated ADP and EPS of the existing structures of [17] and [18] and these estimated values along with those of the proposed structures and the direct-form structure of [20] are shown in bar-charts of Fig.13 for comparison. As shown in Fig.13, the proposed block transpose-form typeII structures has significantly less ADP and EPS than the existing multiplier-less structures of [17] and [18]. Compared with the existing direct-form structure of [20], the transposeform type-II structure offers higher ADP and EPS saving for medium and large filter lengths while the existing direct-form

Fig. 14.

32

64

Filter-length (N)

(a) Comparison of ADP. (b) Comparison of EPS

V. C ONCLUSION Transpose-form structures are inherently pipelined and supports MCM which results significant saving in computation and increase in higher sampling rate. However, transposeform configuration does not directly support the block processing. In this paper, we have explored the possibility of realization of block FIR filter in transpose-from configuration for area-delay efficient realization of large order FIR filters for both fixed and reconfigurable applications. We have made computational analysis of transpose-form configuration of FIR filter and derived a flow-graph for transpose-from block FIR filter with optimized register complexity. A generalized block formulation is also presented for transpose-from block FIR filter. Based on that we have derived transpose-form block filter for reconfigurable applications. We have presented the scheme to identify the MCM blocks explored the horizontal and vertical sub-expression elimination for the implementation of the proposed block FIR structure for fixed coefficients to reduce the computational complexity. A low-complexity design method using MCM scheme is also presented for the block implementation of fixed FIR filters. Performance comparison

10

shows that the proposed structure involve significantly less ADP and less EPS than the existing block direct-form structure for medium or large filter lengths while for the short-length filters, the existing block direct-form FIR structure has less ADP and less EPS than the proposed structure. ASIC synthesis result shows that the proposed structure for block-size 4 and filter-length 64 involve 42% less ADP and 40% less EPS than the best available FIR structure of [18] for reconfigurable applications. For the same filter length and block size, the proposed structure involves 13% less ADP and 12.8% less EPS than that of the existing direct-from block FIR structure of [20]. Based on these findings, selection of direct-form and transpose-form configuration based on the filter lengths and block-length is suggested for obtaining area-delay and energy efficient structures for block FIR filters.

[17] R. Mahesh and A. P. Vinod, ”New reconfigurable architectures for implementing FIR filters with low complexity,” IEEE Trans. Comput.-Aided Design Integr. Circuits Syst., vol. 29, no. 2, pp. 275–288, Feb. 2010. [18] S. Y. Park and P. K. Meher, ”Efficient FPGA and ASIC realizations of DA-based reconfigurable FIR digital filter”, IEEE Transactions on Circuits and Systems-II, Express Briefs, online available. [19] K. K. Parhi, VLSI Digital Signal Procesing Systems: Design and Implementation. New York: Wiley, 1999. [20] B. K. Mohanty and P. K. Meher, ”A high-performance energy-efficient architecture for FIR adaptive filter based on new distributed arithmetic formulation of block LMS algorithm,” IEEE Transactions on Signal Processing, vol. 61, no. 4, pp. 921-932, Feb. 15, 2013. [21] B. K. Mohanty, P. K. Meher, S. A-Madeed A. Amira ”Memory footprint reduction for power-efficient realization of 2-D finite impulse response filters”, IEEE Transactions on Circuits and Systems-I, Regular Papers, vol. 61, no. 1, pp. 120-133, Jan. 2014. [22] S. A. White, ”Applications of distributed arithmetic to digital signal processing: A tutorial review,” IEEE ASSP Magazine, vol. 6, pp. 4–19, July, 1989.

R EFERENCES [1] J. G. Proakis and D. G. Manolakis, Digital Signal Processing: Principles, Algorithms and Applications. Upper Saddle River, NJ: PrenticeHall, 1996. [2] T. Hentschel and G. Fettweis, ”Software radio receivers,” in CDMA Techniques for Third Generation Mobile Systems, Dordrecht, The Netherlands: Kluwer Academic, 1999, pp. 257–283. [3] E. Mirchandani, R. L. Zinser Jr., and J. B. Evans, ”A new adaptive noise cancellation scheme in the presence of crosstalk [speech signals],” IEEE Trans. Circuits Syst. II, Analog. Digit. Signal Process, vol. 39, no. 10, pp. 681–694, Oct. 1995. [4] D. Xu and J. Chiu, ”Design of a high-order FIR digital filtering and variable gain ranging seismic data acquisition system,” in Proc. IEEE Southeastcon’93, Apr. 1993, pp. 6. [5] J. Mitola, ”Object-oriented approaches to wireless systems engineering,” in Software Radio Architecture, New York: Willy, 2000. [6] A. P. Vinod and E. M-K. Lai , ”Low power and high speed implementation of FIR filters for software defined radio receivers,” in IEEE Trans. on Wireless Communications, vol. 7, no. 5, pp. 1669–1675, 2006. [7] T. Solla and O. Vainio, ”Comparison of programmable FIR filter architectures for low power,” in Proc. 28th Eur. Solid-State Circuits Conf., Firenze, Italy, Sep. 2002, pp. 759–762. [8] K. Muhammad and K. Roy, ”Reduced computational redundancy implementation of DSP algorithms using computation sharing vector scaling,” IEEE Trans. Very Large Scale Integr. Syst., vol. 10, no. 3, pp. 292–300, Jun. 2002. [9] X. Chenghuan, C. He, Z. Shunan, and W. Hua, ”Design and implementation of a high-speed programmable polyphase FIR filter,” in Proc. 5th Int. Conf. Applicat.-Specific Integr. Circuit, vol. 2. Oct. 2003, pp. 783787. [10] J. Park, W. Jeong, H. Mahmoodi-Meimand, Y. Wang, H. Choo, and K. Roy, ”Computation sharing programmable FIR filter for low-power and high-performance applications,” IEEE J. Solid State Circuits, vol. 39, no. 2, pp. 348–357, Feb. 2004. [11] T. Zhangwen, J. Zhang, and H. Min, ”A high-speed, programmable, CSD coefficient FIR filter,” IEEE Trans. Consumer Electron., vol. 48, no. 4, pp. 834–837, Nov. 2002. [12] K. H. Chen and T. D. Chiueh, ”A low-power digit-based reconfigurable FIR filter,” IEEE Trans. Circuits Syst. II, vol. 53, no. 8, pp. 617–621, Aug. 2006. [13] P. K. Meher, ”Hardware-efficient systolization of DA-based calculation of finite digital convolution,” IEEE Trans. Circuits Syst. II: Express Briefs, vol. 53, no. 8, pp. 707–711, Aug. 2006. [14] P. K. Meher, S. Chandrasekaran, and A. Amira, ”FPGA realization of FIR filters by efficient and flexible systolization using distributed arithmetic,” IEEE Trans. Signal Process., vol. 56, no. 7, pp.3009-3017, Jul. 2008. [15] P. K. Meher, ”New approach to look-up-table design and memory-based realization of FIR digital filter,” IEEE Transactions on Circuits and Systems-I, vol. 57, no. 3, pp. 592-6003, March 2010. [16] R. Mahesh and A. P. Vinod, ”A new common subexpression elimination algorithm for realizing low complexity higher order digital filters,” IEEE Trans. Comput.-Aided Design Integr. Circuits Syst., vol. 27, no. 2, pp. 217–219, Feb. 2008.

Basant K Mohanty (M’06, SM’11) received M.Sc degree in Physics from Sambalpur University, India, in 1989 and received Ph.D degree in the field of VLSI for Digital Signal Processing from Berhampur University, Orissa in 2000. In 2001 he joined as Lecturer in EEE Department, BITS Pilani, Rajasthan. Then he joined as an Assistant Professor in the Department of ECE, Mody Institute of Education Research (Deemed University), Rajasthan. In 2003 he joined Jaypee University of Engineering and Technology, Guna, Madhya Pradesh, where he becomes Associate Professor in 2005 and full Professor in 2007. His research interest includes design and implementation of low-power and high-performance systems for adaptive filters, image and video-processing applications, secured communication and reconfigurable architectures. He has published nearly 50 technical papers. Dr.Mohanty is serving as Associate Editor for the Journal of Circuits, Systems, and Signal Processing. He is a life time member of the Institution of Electronics and Telecommunication Engineering, New Delhi, India, and he was the recipient of the Rashtriya Gaurav Award conferred by India International friendship Society, New Delhi for the year 2012. Pramod Kumar Meher (SM’03) received the B.Sc. (Honours) and M.Sc. degree in physics, and the Ph.D. degree in science from Sambalpur University, India, in 1976, 1978, and 1996, respectively. Currently, he is a Senior Research Scientist with Nanyang Technological University, Singapore. Previously, he was a Professor of Computer Applications with Utkal University, India, from 1997 to 2002, and a Reader in electronics with Berhampur University, India, from 1993 to 1997. His research interest includes design of dedicated and reconfigurable architectures for computation-intensive algorithms pertaining to signal, image and video processing, communication, bio-informatics and intelligent computing. He has contributed more than 200 technical papers to various reputed journals and conference proceedings. Dr. Meher has served as a speaker for the Distinguished Lecturer Program (DLP) of IEEE Circuits Systems Society during 2011 and 2012 and Associate Editor of the IEEE Transactions on Circuits and Systems-II: Express Briefs during 2008 to 2011, and Associate Editor for the IEEE Transactions on Circuits and Systems-I: Regular Papers during 2012-2013. Currently, he is serving as Associate Editor for the IEEE Transactions on Very Large Scale Integration (VLSI) SYSTEMS, Journal of Circuits, Systems, and Signal Processing (CSSP), and Integration, the VLSI Journal. Dr. Meher is a Fellow of the Institution of Electronics and Telecommunication Engineers, India. He was the recipient of the Samanta Chandrasekhar Award for excellence in research in engineering and technology for 1999

A High-Performance FIR Filter Architecture for Fixed ...

A High-Performance FIR Filter Architecture for Fixed ...

Suggest Documents

FIXED-POINT-COEFFICIENT FIR FILTERS AND FILTER BANKS ...

Design of digit serial FIR filter using shift add architecture

High Speed Wavelet Based FIR Filter Architecture on FPGA ... - IJRIT

High Speed Wavelet Based FIR Filter Architecture on FPGA Platform ...

realization of fir filter using modified distributed arithmetic architecture

Automatic FIR Filter Generation for FPGAs - CiteSeerX

A window function for FIR filter design with an ...

HighPerformance Glass Fiber Development for

A HighPerformance, LowPower Chip Multiprocessor for ... - CiteSeerX

A HighPerformance Recycling Solution for ... - TAMU Chemistry

Efficient Vectorization of the FIR Filter - AES

Analog Multiplierless LMS Adaptive FIR Filter

Synthesis of a unique highperformance

FIR filter design - Signal Wizard Systems

Successive Approximation FIR Filter Design - UFRJ

Two-Channel Perfect Reconstruction FIR Filter

design of low power fir filter

Parametrized Biorthogonal Wavelets and FIR Filter

Multiplierless FIR Filter Implementation on FPGA - ijiee

Research Article Efficient FIR Filter Implementations ...

Design of FIR Filter Using Hamming Window

Practical FIR Filter Design in MATLAB

realization of fir filter using modified distributed

Evaluation of Power Efficient FIR Filter for FPGA ... - Science Direct