An Optimized Coefficient Update Processor for

An Optimized Coefficient Update Processor for High-Throughput Adaptive Equalizers Christian Lütkemeyer, Tobias G. Noll Chair of Electrical Engineering and Computer Systems, University of Technology RWTH Aachen, Germany Abstract A processor for the adaptation of the coefficients in high throughput adaptive equalizers will be presented. The accumulation operation - fundamental basis of the adaptation process - is split into two steps: A fine-grain carry-save accumulation with timesharing factor 2 collects the products of estimated error and input symbols over a blocklength of 16 input symbols and operates at twice the symbol rate; a master accumulator with timesharing factor 32 collects the block-sums from 16 fine-grain accumulators, multiplies them with the adaptation constant and carries out the final vector merging operation, saturation, tap leakage and radix-4 Booth recoding. Three steps to reduce the power consumption of the fine-grain accumulators will be presented and evaluated for a 14-bit-wide accumulator: The suppression of one state of the redundant codes for the value ”1” in the carry save digit alphabet, i.e. 0 1 or 1 0 , reduces the power consumption by 5.5%; The redundancy-reduced digit alphabet can be exploited to reduce the transistor count of the following full adder by one third, resulting in a significant input capacity reduction which increases the maximumand clock frequency by nearly 15% and an additional reduction of power consumption of 2.7%; Finally an optimized sign extension logic reduces the capacitive load of the input sign bits by 70%, eliminates six of the full adders in the sign extension slices and increases the power reduction to 19.2%. The maximum clock frequency of the accumulator could be increased by 18% due to the reduced internal loads.

1 Introduction Adaptive equalization is one of the key tasks of a receiver in a mobile communication environment. It mitigates intersymbol interference (ISI) originating from multipath propagation. In transversal or decision-feedback equalizers, that update their coefficients according to the least-mean-square error (LMS) criterion, half of the arithmetic complexity is needed for the coefficient adaptation. Eqn. (1)-(4) describe the recursive coefficient adaptation for a complex-valued transversal equalizer. cii ciq cqi

jk 1

j

jk 1

cii j k ∆ eî k xi k ciq j k ∆ eˆq k xi k cqi j k ∆ eî k xq k

jk 1

j

j

(1) (2) (3)

cqq

jk 1

cqq

jk

∆ eˆq k xq k

(4)

j

Symbol k denotes the timestep, j the coefficient index, ∆ the adaptation step size, x the input sample and eˆ the estimated error at the slicer. The choice of ∆ is limited by stability considerations to 0 ∆ 2 L E x2 ∆max, where L is the number of coefficients and E x2 is the energy of the input signal. The actual choice of ∆ is a compromise between either slow convergence rate and low asymptotic mean-square error for a small ∆ or an optimum adaptation speed for ∆ ∆max 2 at the expense of an increased asymptotic meansquare error [LM94]. Due to their recursive nature these correlations between x and eˆ can not be implemented directly for high datarates. It is possible to modify the adaptation to a block adaptation scheme, that keeps the filter coefficients c constant for N input symbols, accumulates the products ∆ eˆk xk j in a fast carry-save accumulator and updates the coefficients by a vector merging adder every N-th input symbol [DSS94]. We will refer to this approach as a one stage implementation. It is customary to implement the multiplication with the adaptation constant ∆ by a variable shift operation. The wordlength of the carry-save accumulator is determined by the minimal contribution from the term ∆ eˆk xk j and the maximal coefficient value and is typically in the order of 25 bit. The adaptation step size ∆ has to be decreased for an increasing blocklength N to ensure adaptation stability, but fig. 1 shows that the overall adaptation speed is not sacrificed for N 32.

−0.6

H z

−0.8

Log of output mean square error

−1

0 304 z0 0 903 z

1

0 304 z

2

11 Taps, SNR=30 dB

−1.2 −1.4

LMS Delta= 0.040, N=1 LMS Delta= 0.0200, N= 4

−1.6

LMS Delta= 0.0100, N= 8

−1.8

LMS Delta= 0.0050, N= 16 LMS Delta= 0.0025, N= 32

−2 −2.2 −2.4 −2.6 0

100

200

300

400 500 600 700 Number of output sample

800

900

1000

Figure 1. Adaptation of a transversal equalizer for adaptation blocklength N

Further modifications of the adaptation equations reduce one (hybrid correlation scheme) or both (sign bit correlation) of the factors x k j and eˆk to its sign bit. The computation of eˆk xk j is reduced to a controlled twoth complementer at the expense of a reduced adaptation speed.

2

The adaptation speed of the conventional LMS algorithm is relative low when compared to RLS based algorithms [Pro89], but a modified LMS algorithm allows to increase the adaptation speed to that of RLS-based algorithms and enables a flexible tradeoff between adaptation speed and throughput [LN97]. The reduction of the power consumtion is one major goal in the design of mobile communication ICs. The modified LMS algorithm puts further pressure on this goal as the arithmetic effort as well as the power dissipation per output sample grows proportional with the increase in adaptation speed. The paper is structured as follows. Section 2 describes the architecture of the coefficient adaptation processor which is composed of 16 multipliers and fine-grain carry-save accumulators to calculate and collect the products eˆk xk j for a block of N 16 input symbols, and a master accumulator carrying out the variable shift, vector merging and saturation operation for the outputs of the fine-grain accumulators. Section 3 describes the logic extension in the fine-grain accumulator to implement the elimination of the redundant code of the ”1” in a carry save digit alphabet and introduces a redundancy-reduced full adder optimized for the redundancy-reduced digit alphabet. Section 4 presents an optimized sign extension logic. Section 5 compares the fine-grain accumulator for the different optimization steps and Sec. 6 concludes the paper.

2 The Coefficient Update Processor CUP Key ideas in the Coefficient Update Processor are the reduction of the internal wordlength of the accumulators and an optimal utilization of the less frequently used operators, like the vector merging adder, by introducing timesharing for these operators. Fig. 2 shows how these goals could be realized by splitting the accumulation process into two stages. In the first stage 16 fine-grain accumulators collect the products from the 16 multiplications between eˆk xk j , j 0 15. Enabled by the pointersignal en they transmit the MSBs of their results one at a time to the master accumulator in the second stage on a tri-state bus. The bits that where transmitted to the master accumulator are cleared in the fine-grain accumulators to ensure the correctness of the accumulation. The number of MSBs to be cleared is determined by the shift operation performed in the master accumulator. Fig. 3 outlines the co-operation. The two stage approach has several advantages over a one stage approach without timesharing: The wordlength wΣ of the fine-grain accumulators can be reduced significantly as they do not operate on the full coefficients c. For w x 3, w ê 6, ∆ 2 15 and N 16 an internal wordlength of 14 bits is sufficient to achieve a resolution of 21 bits. Chip area and power dissipation for these components are reduced by one third. As the shifter, tap leakage operation, vector merging adder and saturation logic are timeshared by a factor of 16 in the master accumulator the area of these components is reduced by more than 90%. The power consumption for all these operations remains the same in a first order approximation. Only the power consumption for the shift operation decreases significantly, too, because it is moved from the input rate in a single stage implementation to the decimated output rate. The initialization of the coefficients can be done easily, as they can be loaded serially

3

load x

cin

wx ∆-shift

eˆ w ê

tap-leakage ∑0

∑1

VMA

∑14

∑15

wΣ

saturation Booth coding

en cout fine-grain accumulators

wc

master accumulator

Figure 2. Block diagram of the CUP

into the master accumulator. They need not be broadcasted to the fine-grain accumulators which only have to be cleared at the beginning of the adaptation. The final operation in the CUP is a radix-4 Booth recoding of the coefficients. The Booth algorithm reduces the chip area of a transversal filter by about 20% [LN97] and moving the Booth recoders from the coefficient inputs of the filter into the CUP decreases their area by more than 90% due to timesharing. To support the modified LMS algorithm [LN97] for faster adaptation resume registers are integrated into the delay line for input symbol x and the enable signal en. A significant part of the area of the CUP is occupied by the fine-grain accumulators. Their optimization is therefore a primary goal. The inputs of the fine-grain accumulators are the carry- and sum output vectors d and e from the multipliers (Fig. 4). The accumulators have a clear stage to perform a global reset operation and a bit clear stage, that is used to clear those MSBs that where received by the master accumulator. The adders in the MSBs are modified according to [Nol91] to include the carry-overflow correction. An optimized pipeline scheme with two full additions or clear operations per period is applied as described in [Nol91, p.123]. This results in a delay of two clock cycles in the accumulator loop, which can be exploited by interleaving two input data streams, because the accumulator has a ”natural” timesharing factor of two. The registers in the x- and en delay lines have to be doubled to compensate for this timesharing. A closer look at the coefficient update equations 1 and 2 or 3 and 4 reveals, that if we

4

fine grain accumulator

cleared every N-th cycle free running bits

∑i

∆

2

15

c

master accumulator

Figure 3. Co-operation between the fine-grain accumulator and the master accumulator

interleave the update of cii and ciq onto the same CUP the input x of the CUP changes only every second clock cycle. This reduces the switching factor, i.e. the probability for a charge event, in the x delay line. If a second clock signal, reduced by a factor of two, is generated, the registers that were introduced to compensate for the timesharing can be removed. This step saves 75% of the power dissipation in that part of the clocking network, that drives the delay line latches.

3 Suppression of the redundant ”1” in the carry-save number representation Carry-save arithmetic has an addition time independent if the wordlength, as the carry propagation is postponed. It is therefore well suited to implement accuumulators for high datarates. The digit value ”1” is coded redundantly as it can be represented by c i si 0 1 and ci si 1 0 (Fig. 5a). This redundant coding causes unnecessary power consumption, as toggle events between these two representations have no arithmetic sense. Fig. 5b) shows the circuitry that is needed to convert the redundant representation of ci si to a redundancy free representation ai bi , as the state ai bi 1 0 is suppressed [SN95]. 1 0 does not occur, can be used to The property, that the representation ai bi reduce the number of transistors in the full adder, where a i and bi are added. Fig. 5 c) displays the truth table of an inverting full adder and of an inverting full adder with don’t care condition ai bi 1 0 . Fig. 6 shows the transistor diagram of a symmetric inverting full adder (24 trans.), that can be reduced to a redundancy-reduced full adder with only 16 transistors. Fig. 7 shows a layout of the symmetric full adder (A=400µm 2) and the redundancy

5

d2

e2

d1

e1

d0

e0

FA

cl clear clb bit-clear

a5 b5

a4 b4

a3 b3

a2 b2

a1 b1

a0 b0

Figure 4. Conventional carry-save accumulator (internal wordlength m wordlength n 3 bit) with clear and bit-clear stage

6 bit, input

reduced full adder (A=375µm2) in a 0.6-µm-CMOS technology. Both cells have the same width and the redundancy-reduced full adder can replace the symmetric full adder where the don’t care condition holds. The two redundant transistors T1 and T2 in the series connection to node s were not removed, as the contacts needed to switch to another layer would have enlarged the width of the layout. The redundancy suppressing circuitry could be integrated easily into the clear stage of the accumulator without sacrificing the throughput rate. Fig. 8 shows how the nand gates of the clear stage could be changed to include the redundancy suppression function. Only four additional transistors where needed to expand one NAND with two inputs by one input and to include the OR functionality into the second NAND. It suppresses the unnecessary 0 1 1 0 state transitions in the bit-clear stage and at the a and b inputs of the first adder row. The floorplan of the accumulator with redundancy suppression is shown in Fig. 10B). As the first adder row adds input vector d to the redundancy-reduced feedback signals a and b all adders in the first row can be replaced by the redundancy-reduced full adder, resulting in the floorplan of Fig.10C). The load on the feedback signals is lowered by 11fF and 18fF.

6

a)

b) Digit si 0 1 0 1

Value ci 0 1 1 2

0 0 1 1

c)

ci

ai

si

bi

ai 0 0 0 0 1 1 1 1

bi 0 0 1 1 0 0 1 1

di 0 1 0 1 0 1 0 1

ci 1 1 1 1 0 1 0 0 0

cir 1 1 1 1 0 P P 0 0

si 1 0 0 1 0 1 1 0

sir 1 0 0 1 P P 1 0

Figure 5. a) carry-save digit codes with redundant code; b) redundancy suppressing circuitry to convert the redundant code alphabet to a nonredundant one; c) truthtable for an inverting full adder (c, s) and with don’t care terms (c r , sr ,) if a redundancy free carry-save alphabet is used

4 Optimized sign extension logic Sign extension puts a great capacitive load onto the sign bits of the input signals. To minimize this load signal d is routed to input d of the full adder (Fig. 6), which has the least load (33fF) of the three inputs. Signal e is added in the second stage of the accumulator, it has to drive only the inputs of the latches (8.5fF) that synchronize signal e. The wiring of the extension bits has a load of approximately 8.8fF per bitslice. The input load at the sign bits of an accumulator with m 14 bits internal wordlength and n 8 bit input wordlength is composed of 7 times the input load of a full adder, 7 latches and the wiring, in total 400fF. Even if the load is distributed evenly to the sign bits d 7 and e7 by interleaving them onto the full adder inputs and the latch inputs, both sign bits are loaded with 200fF and the load is increased by 30fF for each additional bit of internal wordlength. The basic idea of the optimized sign extension logic is to merge the sign extension bits into one twoth-complement number g, so that the feedback signals a and b can be bypassed to the second adder row where they are added to g. A problem arises in bitslice n 1, where only two free inputs are available at the inputs of the second stage adder as it receives a carry from slice n 2. This problem can be solved by a different interpretation of signal an 1 . Due to the redundancy free code we know that b n 1 is always set if an 1 is set. Therefore it can be interpreted as having valency 2 n if we force bn 1 to zero when an 1 is set. Table 1 shows the dependency of g from the bits a n 1, d n 1 and en 1. The following equations describe the generation of g and the zero forcing operation for n bs 1 .

gn gn

1

dn an

1

gn

gn

1

1

1 1

en dn

an

1 1

e

(5) (6)

n 1

dn

1

en

1

(7) 7

a

a

T1

b

b

d

d c

cr

s

sr

T2

Figure 6. Transistor schematic of symmetric full adder and redundancy-reduced full adder for the don’t care condition a b 1 0

2n

an 1

2n 1

2n 1

en 1

gm 1

0 0 1 1 0 0 1 1

0 1 0 1 0 1 0 1

2n 1

... ... ... ... ... ... ... ... ... ...

dn 1

0 0 0 0 1 1 1 1

2m 1

0 1 1 1 0 0 0 0

gn 1

0 1 1 1 0 0 0 0

2n

2n 1

gn

gn 1

0 1 1 1 1 0 0 0

0 1 1 0 0 1 1 0

Table 1. Merging of sign extension bits d n 1, en 1 and feedback bit an 1 into one twoth-complement number

gi bns 1

gn bn

1

1

;

i

n 2 m

1

(8) (9)

an 1

Figure 9 shows a CMOS implementation of the complementary signals as inverting latches are used. The delay of the optimized sign extension is as small as the delay of the full adder. The signal gn 1 is buffered with an inverter with wn w p 5 6µm 11 2µm to reduce the speed degradation if the internal wordlength is expanded. The load on the sign bits d 7 and e7 is reduced by nearly 70% from 200fF to 63fF.

8

symm. FA input Cin f F a 45 b 45 d 33 123 ∑ redundancy red. FA input Cin f F a 34 b 27 d 35 96 ∑ Figure 7. Layouts and input capacitances (in basic accumulator cell) of symmetric full adder (24 transistors) and redundancy reduced-full adder for redundancy suppressed input signal (18 trans.)

a

si

ci

b

cl

si

ci

cl

ai

bi

ai

bi

Figure 8. Integration of the redundancy suppression functionality b) into the NAND gates of the clear-stage a)

5 Comparison of the fine-grain accumulator for the different optimization steps Fig. 10 shows the floorplans of the four fine-grain accumulators. Table 2 summarizes the power consumption and maximal clock frequency of the accumulators.They have an input wordlength of 8 bit, 14 bit internal wordlength and use an area of 0.073mm 2 in a 0.6-µm-CMOS technology. Accu A) is a conventional carry-save accumulator as shown in Fig. 4. Accu B) is obtained by integrating the redundancy suppression logic into the clear stage of the accumulator. This reduces the switching factor at the output of the clear stage by 28% for random input data and reduces the power consumption by 5.5%. Accu C) arises from B) by replacing the full adders in the first row of the accumulator by the redundancy-reduced full adder. The power dissipation decreases further by 2.7% and the maximal clock frequency increases by nearly 15% due to the reduced load on the

9

d n 1en

1

an

1 bn 1

gn

2

gn

1

gn

gn

1

bns

1

input Cin d7 e7 a7 b7

fF 61 63 35 19

Figure 9. CMOS implementation of the sign extension equations and input loads for a 14 bit accumulator with 8 bit input wordlength

accu A B C D

# trans. Pin mW 1924 0.36 1980 0.36 1896 0.37 1788 0.29

Pphi mW 5.03 5.00 4.97 4.88

Pvdd mW 16.5 15.3 14.7 12.5

∑ mW 21.9 20.7 20.1 17.7

P P A / % 100.0 94.5 91.8 80.8

fmax MHz 194 194 222 229

Table 2. Comparison of the transistor count and power consumption of the finegrain accumulators, f clock =200MHz VDD=3.3V, typical parameters, and maximum clock frequency

feedback lines. Accu D) utilizes the optimized sign extension logic and reduces the power dissipation by more than 19% compared to A). The maximal clock frequency is 18% higher than that of A).

6 Conclusion A Coefficient Update Processor for adaptive equalizers using the LMS algorithm was presented. A two stage approach in the update recursions resulted in a significant wordlength and power dissipation reduction of the accumulators. Timesharing reduced the area for the less frequently used components like the vector merging adders by more than 90%. The suppression of one redundant digit representation in the carry-save digit alphabet, a redundancy-reduced full adder and an optimized sign extension logic were able to reduce the power consumption of the fine grain accumulators by nearly 20% while the maximum clock frequency rised nearly by the same amount. The input load on the sign bits was reduced by 70%.The overall efficiency defined as throughput divided by area and power 10

A

C

B

D

full adder

redundancy reduced full adder

clear

sign extension gn

bit-clear

sign extension gn

clear with redundancy suppression

sign extension gn

modified for carry-save overflow corr.

sign extension gi , i

1

1

n 2 m

1

Figure 10. Floorplans for different optimization steps: A) conventional carry-save accumulator, B) plus redundancy reduction logic, C) plus redundancy-reduced FA, D) plus optimized sign extension

dissipation per clock cycle increases by 40%.

References [DSS94] Erik De Man, M. Schulz, and R. Schmidmaier. A 1.0-µm CMOS 60-MBaud single-chip QAM processor for digital radio applications. In Proceedings of the IEEE 1994 Custom Integrated Circuits Conference, page S4.3, May 1994. [LM94] E. A. Lee and D. G. Messerschmitt. Digital Communication. Kluwer Academic Publishers, 1994. [LN97] Ch. Lütkemeyer and T. Noll. A transversal equalizer with an increased adaptation speed and tracking capability. In Proceedings of the IEEE 1997 Custom Integrated Circuits Conference, May 1997. to be published. [Nol91] T. G. Noll. Carry-save architectures for high-speed digital signal processing. In Journal of VLSI Signal Processing, pages 121–140. Kluwer Academic Publishers, 1991. [Pro89] J. G. Proakis. Digital Communications. McGraw-Hill Book Company, 1989. [SN95] A. Schlegel and T. Noll. Entwurfsmethoden zur Verringerung der Schaltaktivität bei verlustleistungsoptimierten digitalen CMOS-Schaltungen. In DSP-Deutschland, Munich, Germany, pages 61–74, September 1995.

11

An Optimized Coefficient Update Processor for

An Optimized Coefficient Update Processor for

Suggest Documents

WCET Analysis for a Java Processor - JOP - Java Optimized Processor

picoJava-II in an FPGA - JOP - Java Optimized Processor

Optimized Hardware Implementation of FFT Processor

A Profile for Safety Critical Java - JOP - Java Optimized Processor

A Low-Power Processor Architecture Optimized for Wireless ... - APT

DESIGN OF A PROCESSOR OPTIMIZED FOR SYNTAX PARSING IN ...

Hardware Objects for Java - JOP - Java Optimized Processor

An Update for Nurses

MakeIndex: An Index Processor For LaTEX

An efficient method for transfer cross coefficient

AN INTELLIGENT INTERFACE PROCESSOR FOR A BEHAVIOUR

Towards a Java Multiprocessor - JOP - Java Optimized Processor

Using a Java Optimized Processor in a Real World Application

Coefficient of performance under optimized figure of merit in min ...

DEVELOPING AN OPTIMIZED UPC COMPILER FOR FUTURE ...

An optimized molecular model for ammonia

An optimized enumeration method for sorbitol- fermenting ...

Watermill: An Optimized Fingerprinting System for ... - CiteSeerX

An optimized measurement chamber for cantilever array ...

an optimized lagrangian-multiplier approach for ... - CiteSeerX

An Optimized Transient Dual Luciferase Assay for

An Optimized Superpixel Clustering Approach for

DESIGN AN OPTIMIZED FUZZY CLASSIFIER SYSTEM FOR ...

An extraaural headphone for optimized binaural