detection in communication receivers: a case for ... - Semantic Scholar

0

D ETECTION

IN

C OMMUNICATION R ECEIVERS : A

C ASE F OR O N - LINE A RITHMETIC

Sridhar Rajagopal and Joseph R. Cavallaro Department of Electrical and Computer Engineering MS-366 Rice University Houston TX 77005. USA sridhar,cavallar @rice.edu

Telephone: 1-713-348-2256, 1-713-348-4719 Fax: 1-713-348-6196

This work was presented in part at the IEEE International Symposium on Computer Arithmetic, Vail, CO, 2001 and submitted in part for the IEEE International Conference on Application-specific Systems, Architectures and Processors, San Jose, CA, 2002. This work was supported by Nokia, Texas Instruments, the Texas Advanced Technology Program under grant 1999003604-080, and by NSF under grant ANI-9979465. March 25, 2002

Submission for IEEE Trans. on Computers

1

Abstract This paper presents low power architectures using conventional and on-line arithmetic to accelerate detection in mobile communication receivers. We identify the potential application of on-line arithmetic for simple and advanced detection algorithms in mobile communication systems. We develop innovations in on-line arithmetic modules for better integration with conventional arithmetic systems. We explore a suite of tradeoffs in the usage of on-line arithmetic on the throughput, area and latency and its effects on bit error rate performance. We design novel detector circuits using radix-4 on-line arithmetic for detection algorithms. Our detector can directly interface with conventional arithmetic circuits if necessary and hence, be used as co-processor support to a DSP. A timeconstrained conventional arithmetic detector using carry-save adders is first presented that achieves 1.62 1.86 speedup in throughput and a corresponding 1.31 1.08 reduction in area for

bit precisions against a direct

implementation. The on-line arithmetic detector, however, does much better, providing 3.94 5.94 in throughput and area reduction from 1

14.55

speedup

respectively against the same implementation. Hence, by

eliminating extraneous computations, decreasing area requirements and simultaneously increasing throughput, online arithmetic can provide significant power savings and is well suited for detection in mobile wireless receivers. We also show the application of on-line arithmetic for multiuser detection, soft-decisions and higher modulation schemes in advanced communication receivers Keywords On-line arithmetic, on-line addition, on-line multiplication, detection, W-CDMA, low power, VLSI implementation, mobile communication systems.

I. I NTRODUCTION Advanced communication systems, such as those based on W-CDMA (Wideband Code Division Multiple Access) [1], EDGE (Enhanced Data rates for Global Evolution) [2], 802.11a OFDM (Orthogonal Frequency Division Multiplexing) based Wireless LAN [3] are being designed to increase data rates by orders-of-magnitude and enhance performance significantly in order to support real-time multimedia services in the near future. However, the increase in complexity required by the algorithms to support these high data rates and multimedia capabilities has made it difficult to achieve real time hardware implementations. Detection is one of the core baseband processing operations in a mobile communication receiver and is used to find the transmitted bits from the source after it has been corrupted by noise and other interference present in the channel. The data rates that can be achieved in a communication system are dependent on the speed of detection in hardware. Hence, a mobile receiver not only has to be area

March 25, 2002


2

and power efficient to meet the mobile device constraints, but also has to accelerate detection to meet real-time performance requirements. On-line arithmetic [4], [5], [6] has been shown to be very useful for many signal processing applications such as DCT, FFT, CORDIC, filtering and matrix based operations [7], [8]. In conventional arithmetic, all digits of the result must be computed before the next operation can be started. However, in on-line arithmetic, the operands, as well as the results, flow through the computations in a digit-by-digit manner, starting from the most significant digit (MSDF). The advantages of on-line arithmetic have been to eliminate carry-propagate addition, reduce interconnection bandwidth between modules and allow parallelism between several operations. With a serial data flow, on-line arithmetic can be pipelined to implement sophisticated algorithms. As carry-propagation is eliminated, on-line operations can be overlapped. The digit-serial nature of on-line arithmetic can be extremely beneficial for small area and low power architectures. Though on-line arithmetic has been shown to provide a speed-up of 2 16

[9] for con-

ventional numerical operations, this gain is reduced when conventional arithmetic systems are deeply pipelined. This is because the word-parallel logarithmic-time computations attain better throughput than the digit-serial on-line computations (though at the expense of larger area and higher latency). The other implementation tradeoffs related to the applicability of on-line arithmetic are its need for a non-conventional number system, conversions to-and-from a conventional system [10] and the inherently serial nature of the operations. We overcome all these above disadvantages to promote a strong case for the use of on-line arithmetic for detection in communication systems. The main contributions of our work are as follows. We identify the potential applications of on-line arithmetic for detection in communication systems. We develop innovations in on-line arithmetic modules for better integration with conventional arithmetic systems. We explore a suite of tradeoffs in the usage of on-line arithmetic on throughput, area and latency and its effects on the bit error rate performance. We first present the optimal, variable throughput on-line arithmetic detector that obtains the same bit error rate as the detector using conventional arithmetic. Then, we present a truncated, fixed throughput, on-line arithmetic detector to attain higher throughput and we quantify its performance on the bit error rates of the communication receiver. Then, we present a time-constrained conventional arithmetic architecture based on Carry-Save

March 25, 2002


3

Adders (CSAs) for comparison purposes. We also develop novel radix-4 on-line multipliers and on-line adders that allow our circuit to directly interface with a conventional arithmetic circuit, if necessary for example, as a co-processor support to a DSP. In all the architectures discussed, we present time-constrained versions in order to attain the maximum data rates and then, compare the area requirements for these architectures to determine their utility in a low power architecture design. After initially applying on-line arithmetic to simple detection problems and demonstrating its effectiveness, we show that on-line arithmetic is also effective for more sophisticated detection problems. In many emerging standards and technologies, the detection problem becomes more sophisticated. For example, at the base-station, there are many users interfering with each other and the detection problem for a single user grows into a multiuser detection problem [11], [12]. Even in the handset, detection for the user needs to account for interference from other users [13]. Decoding algorithms [14], [15] sometimes use soft decisions from the detector to provide better decoding of the information. Emerging standards [1], [2], [3], [16] also support higher data rates which use higher modulation schemes. These detection schemes need to detect multiple bits per symbol requiring more boundaries for detection. We outline the application of on-line arithmetic to multiuser detection, soft decisions and to these higher modulation schemes. We have organized this paper as follows. The next section presents the advantages of using on-line arithmetic for detection and the bit error rate trade-offs. Section 3 presents radix-4 online adder and multiplier design. Section 4 shows the architecture design of the detector based on conventional and on-line arithmetic. In Section 5, the area-time tradeoffs of the different detector architectures and results are presented. Section 6 shows the application of on-line arithmetic to obtain soft decisions and detection for higher modulation schemes. II. D ETECTION

IN MOBILE COMMUNICATION RECEIVERS

Figure 1 shows the main blocks in a mobile communication receiver. The physical baseband layer in a mobile communication receiver involves operations to detect and decode the transmitted information bits. Sophisticated algorithms for channel estimation, detection and decoding are applied on the receiver to determine the transmitted bits. Based on these high-precision operations, a hard decision (a sign-based test) is made on the transmitted bits for detection, which are typically March 25, 2002

’s, assuming a Binary Phase Shift Keying (BPSK) system for simplicity. De-


4

Hard decision ^ b or soft decision y

Antenna

Detection RF Unit

A/D

Decoded information bits

Decoding

Demux Channel Estimation

Fig. 1.

Block diagram of a mobile communication receiver. The detector produces either single-bit precision

outputs (hard decisions) of the detected bits or higher precision outputs (soft decisions). The final decision on the transmitted information is made at the decoder.

tection for Quadrature Phase Shift Keying (QPSK) [17] used in current W-CDMA standards can be easily translated to two BPSK detections as the transmission in both the dimensions are orthogonal. A. Code matched filter and multiuser detection algorithms A brief mention of the detection schemes used for the base-station in a CDMA-based wireless communication receivers is shown to signify the type of operations involved. For details, the reader is referred to [11], [12], [18]. Let be the received signal and true cross-correlation matrix.

is the number of users in the system. Let

be the

users to be detected. Then, the system can be formulated as given below:

where

is the length of the spreading code (also known as the spreading

gain or the spreading factor) and be the bits of the

!#"

(1)

is the Additive White Gaussian Noise (AWGN) in the system. The single user detector

or the matched filter detector with hard decision outputs is shown to be: $

where and

%'&)(+*-, /. 0213"

(2)

refer to the estimate of the detected bit and the estimate of the channel respec-

tively (as opposed to the true value). Figure 2 shows the architecture of a single user matched filter detector which exhibits a inner (dot) product structure for computations followed by signbased testing. The subscript, 4 , refers to the 4576 user in the system. The received signal coming March 25, 2002


5

Code Matched Filter Detector

Â Hp,2

Â Hp,1 ri,1 *

Â Hp,N-1

ri,2

ri,N -1

* +

*

Â Hp,N ri,N

*

+ +

+

+

Â Hp ri

^Hp ri ) ^bi, p = sign(A Fig. 2. A code matched filter detector for the user in the system. The operations involve testing a dot product for its sign.

from the A/D converter has typically 8-bit or greater precision [19]. Depending on the algorithm implementation and the desired bit error rate performance, the precision requirements of the hardware processing the received signal is typically in the 8-16 bit precision range for finite precision implementations such as those in [18], [20], [21], with more sophisticated algorithms requiring greater precision for operations such as division and matrix inversions. This implies extraneous computations in a conventional number system as the sign is obtained only at the end due to the Least Significant Digit First (LSDF) nature of computations. On-line arithmetic, based on a signed digit number representation, provides Most Significant Digit First (MSDF) computation. Hence, the computations can stop after the first non-zero MSD (sign) is computed and additional computations for the successive digits are avoided. The need for back-conversion to a conventional number system is also not required as the sign of the digit represents the detected bit. The code matched filter is also useful for the mobile handset algorithms and can be used as an initial estimate for more advanced handset schemes [13], [22], [23], [24]. At the base-station, the multiuser detector [11], [12] helps in canceling the interference of other users in the system and

March 25, 2002


6

Received Signal r i+3

Channel Estimates ^ A^0, A 1

Matched Filter T yi+2 b^ i+2 R= L

C

^

-LT ^bi+1

-Cb i

L

-Lb^ i-1

Delay Delay

yi

Adder

Stage 1

Sign Detection

yi

^b i

T

R= L

C

L

Stage 2 T yi-2 ^ bi-2 R= L

C

L

Stage 3 yi-4

^

bi-4

Detected bits

Fig. 3. Time-constrained pipelined detector architecture. The figure shows the matched filter which provides the initial estimates for parallel interference cancellation. A 3-stage parallel interference cancellation detector is shown. All stages are identical. The first stage of the detector is expanded in detail.

is one of the advanced algorithms proposed for detection in 3G systems for better performance. One of the popular algorithms for multiuser detection which has good error rate performance and suitable for implementation is Parallel Interference Cancellation (PIC) [25], [26], [27], [18]. The multistage multiuser detector uses the soft decisions of the matched filter to get an initial estimate of the bits and then subtracts the interference caused by other users. The multistage detector performs parallel interference cancellation iteratively in stages.

March 25, 2002


7

%'&)(+*-,

(3)

1 "

(4)

Equation (3) may be thought of as subtracting the interference from the past bit of users, with more delay, and the future bits of the users, with less delay than the desired user. The left matrix , stands for the partial correlation between the past bits of the interfering users and the desired user, the right matrix , stands for the partial correlation between the future bits of

the interfering users and the desired user. The center matrix

, is the correlation of the

current bits of interfering users and the diagonal elements are zeros since only the interference from other users, represented by the non-diagonal elements, is canceled. The lower index, & , represents time, while the upper index, , represents the iterations. We assume and

that are

derived from the channel estimates to be pre-computed as part of channel estimation procedure [18]. Figure 3 shows the architecture of the multistage multiuser detector [27] based on equa-

, and the hard decisions, , from the asynchronous matched filter. The partial correlation matrices (channel estimates) are

tions (3),(4). The detector takes in the soft decisions,

also loaded. The detector subtracts the interference iteratively in stages from the matched filter output. Three stages of PIC are shown to be sufficient for convergence of multistage detection [26]. The detected bits are then sent to the decoder for retrieving the transmitted information. Each stage of the multiuser detector uses only adders because multiplication by single bits can be reduced to addition and subtraction. B. On-line arithmetic background On-line arithmetic algorithms [4], [7] work in a digit-serial manner, producing the result in a MSDF fashion. To generate the first output digit,

digits of the input are required. Thereafter,

with each digit of the input, a new digit of the result can be obtained. The on-line delay, , is typically a small integer, e.g 1 to 4. Since the outputs are produced serially, the algorithms can be pipelined with a latency of as shown below: Input Input $ Output ( March 25, 2002

)

$

!"#

$%&$%$'!"$%# +*

(

.... ....

(,-(,"(.!/(,#

....


8

In order to achieve MSDF operations, on-line algorithms need to use a redundant number system [28] for carry-free addition. The on-line representation of a number is given by [4]

represents the value of the addends at step , is the radix of the redundant number syswhere tem and is the on-line delay. The digits belong to a redundant digit set, - ,.....,-1,0,1,....., (assumed symmetric) where , represents the amount of redundancy in the num for maximum redundancy as this will ber system. For our system, we shall assume

allow the inputs to the system to be directly acceptable in redundant form for on-line operations. C. Using on-line arithmetic for detection Figure 4 shows a simple example to motivate the use of on-line arithmetic for detection. Figure 4(a) shows a simple sign based detector for the received signal, which has been transmitted in an AWGN communication channel. The probability distribution of the received signal [17] is as shown in Figure 4(b-1). The sign of the received signal determines the bit that has been transmitted. Figures 4(b-2) and 4(b-3) represent the time taken by conventional and on-line arithmetic implementations of the detector to find the sign, assuming a 8-bit precision and radix4 on-line arithmetic modules. The $ -axis is normalized to the time taken to find the first digit using on-line arithmetic. Conventional arithmetic, being logarithmic time with precision, will

have a lower throughput than the on-line throughput per digit (roughly shown as " in the figure

for illustrative purposes). The figure shows that the on-line arithmetic scheme can take variable time to detect the sign while the conventional arithmetic detector finds the sign in constant time (assuming constant precision for both schemes). Hence, we will have performance gains as long as the expected time to detect the sign (the first non-zero MSD) is less than that of conventional arithmetic for the same precision. From Figure 4(b), it is clear that the larger the radix used for the on-line detection scheme, the faster it is to determine the sign. Hence, a higher radix system is preferable for our design. However, higher radix systems have overhead in terms of area and delay. We choose a radix-4 system as a trade-off point as it covers most of the area under the probability density curve for the first digit ( " ) and it reduces the on-line delay, , to March 25, 2002

for


9

(a)

-1,+1

Sign based detector

0

Input bits

-1,+1

Received signal

AWGN Channel

Detected bits

(b) 5

(1) Probability distribution of the received signal

4.5

Time taken for addition (Normalized)

4 3.5 3 2.5 2 1.5 1

0

(2) Conventional arithmetic

(3) On-line arithmetic (MSD )

0.5

-1

-0.5

0

0.5

1

Received Signal Amplitude (Normalized) Fig. 4. Use of on-line arithmetic for detection. Part (a) shows a noisy communication channel through which a

or

is transmitted. Part (b) shows the time taken by conventional and on-line arithmetic to detect the sign of

the received signal. To have performance benefits, the sign should be detected before the on-line arithmetic time exceeds the conventional arithmetic time. A 8-bit precision and a radix-4 on-line module is assumed.

multiplication and addition [4], [29]. On-line arithmetic, being digit serial, has a lower throughput if all the digits are required in the computation. Since the probability of obtaining the sign using the first digit is high, we truncate the on-line detector to stop the computations immediately after the first digit has been received. However, the throughput speedup due to the truncation of the first digit is at the expense of increase in the probability of error. This is as shown in Figure 5. The probability of

March 25, 2002


10

Probability of error for BPSK modulation using on−line arithmetic

0

10

−2

10

−4

10

−6

e

Probability of error (P )

10

−8

10

−10

10

−12

10

only first digit detector (MSD) second digit detector (2 MSDs) theoretical sign detector

−14

10

−16

10

−18

10

−20

10

0

2

4

6

8 E /N (in dB) b

10

12

14

16

0

Fig. 5. Probability of error for truncated detection. The figure shows the increase in probability of error assuming only the first digit or first 2 digits are used for detection in a BPSK system.

error

for an optimal detector for BPSK [17] is given by

where

, 1

is the Q-function [17],

noise energy and

(5)

is the energy of the transmitted bit,

is the variance of the noise. However, if we detect only the

the probability of error "

is the

MSD’s, then

for the on-line detector for BPSK can be shown to be

! "# ! $%

(6)

where is the radix used in the on-line implementation. We can see from equation (6) that with increasing the radix or the number of digits to be detected , we get increasingly closer to the

optimal detector in equation (5). Figure 5 shows the probability of error for a radix-4 number

system for the first and second most significant digits ( '& and it with the optimal detector. We can see around a March 25, 2002

)(+*

) and compares

drop in performance assuming only Submission for IEEE Trans. on Computers

11

the MSD is used for detection. This may still be acceptable for higher throughput as there could be a decoder following this detector to compensate for the bit error rate degradation. However, an additional digit makes the performance of the on-line detector close to the optimal detector. Hence, we will consider both a

as well as a MSD truncated on-line detector in this

paper for comparison purposes instead of a variable throughput, full precision, on-line detector architecture as presented in our recent work[30]. Figure 5 assumes a simple transmission system without any CDMA spreading code applied on top of it (

).

The error rate curve for the CDMA matched filter detector of equation

(2) is similar except that the Signal to Noise Ratio (SNR) gain

(giving

, 1(

*

is scaled by the spreading

better performance for the same SNR). This is assuming

perfect knowledge of the channel estimates of the user. With proper channel estimation for fading channels, a closed form expression for the probability of error is no longer available, but performance similar to Figure 5 can be expected. Figure 6 shows the suitability of on-line arithmetic for low precision decisions on higher precision data. Figure 6(a) shows the timing schedule of a conventional arithmetic full precision matched filter detector shown in Figure 2. An implementation based on conventional arithmetic has throughput as a logarithmic function of the precision as both fast adders and multipliers can be built for logarithmic time computation i.e.

(

( , (+1 [31], [32], [33], where

is the bit-precision. Figure 6(b) shows the application of on-line arithmetic for a full pre-

cision matched filter implementation. As an example, a radix-4 system and 8-bit precision is assumed, giving 4 digits per input. Though the full precision implementation has a low latency, the throughput of the system for repeated operations is a linear function of the precision i.e.

(

, where

is the on-line throughput time per digit. Note that one digit per

computation is wasted as it signifies the end of the current computation precision (shown by RESET( ) in Figure 6, which clears all latches at the end of the computation). Also, additional

% have to be inserted for full precision multiplication as the multiplier generates twice the

output precision. However, as shown in Figure 6(c), the determination of the sign (the first nonzero MSD) can take variable time. This figure refers to the optimal (theoretical sign) detector of Figure 6, which waits until the first non-zero digit is received. As soon as the sign is detected, we can insert the RESET signal(R) for all the adder stages in the pipeline. We will then have

March 25, 2002


12

0 0

ai * b i

ai * b i

0 0

0 R

R

Tree addition Level 1


0 R

R R

Tree addition Result

R


(b) On-line arithmetic with full precision

(a) Conventional arithmetic with full precision R

R

R *

R *

0 0



R *

*

R *

*

*

R

ai * b i

Idle

R *

*

R

*

*

R

t OL -MF a d* t OL

t CONV -MFa log(d)

ai * b i

R

R

R


R

R


*

t OL -MF a d* t OL

R R

R

t OL -MF = constant = 3* t OL

Sign determined at this point. Stop!

Sign determined at this point. Stop!

(d) On-line arithmetic with truncated detection(2 MSDs)

(c) On-line arithmetic for detection (variable throughput)

Fig. 6. Timing comparisons for different arithmetic techniques. Part (a) shows the the time taken by a conventional arithmetic circuit for the dot product with full precision. Part (b) shows the time taken by a on-line arithmetic module for calculating a dot product with full precision. Part (c) shows the time taken by a on-line arithmetic detector to find the sign of the dot product. Part (d) shows the time taken by a truncated-precision detector to find the sign.

idle stages in the pipeline (shown by ) during which the hardware could stop the clock for low power. Since the time for obtaining the MSD is variable, the time taken is almost the same as part(b), i.e.

(

and the area is also the same. The only advantage is due to the

power savings during idle computations. If only the first few MSDs in the output are desired or as in the case of truncated detection, the sign of the computation is obtained within the first two

digits (The MSD detector in Figure 6), a higher throughput can be obtained along with savings in area. In this case,

=

III. R ADIX -4

* %

*

=

.

ON - LINE ADDER AND MULTIPLIER MODULES

This section describes the design of a radix-4 on-line adder and multiplier used in the detectors. As opposed to previous radix-4 on-line multiplier implementations [8], [34] that assume one of the inputs as constant, we design a generalized multiplier assuming both inputs as vari-

March 25, 2002


13

able. The multiplier is also enhanced to accept inputs in a conventional number system so that the detector can directly interface with a conventional arithmetic circuit, if needed. The design of these modules were verified by simulation using the Xilinx Foundation Software for a schematic implementation on a FPGA. The area and throughput of the radix-4 adder and multiplier are determined in order to understand the area-throughput tradeoffs in the detectors. A. Radix-4 on-line adder The steps involved in fixed point on-line addition [4], [35] of 2 inputs

and $ , giving an

output ( are: Initialization:

(

(

Recurrence for j = 0,1,..... ,m+1 do

(

where

is the residual and

%

,

(

1

,

$

1

(7)

is the number of digits in the inputs. The selection is done by

rounding by using a discretization algorithm (DIS) [35] that performs the rounding based on the first few MSDs. The selection function is given by: %

,

1

%'& (*-,

%'& (*-,

1 1

"

Since the on-line adder interfaces with the output of an on-line multiplier and as an input to other on-line adders, we design the adder to be a general on-line adder, taking in redundant operands and producing a redundant result. The design of the on-line adder is shown in Figure 7. The inputs are shown as

and !

"

, with higher indices representing the more significant

bits. The inputs are taken to be in two’s complement radix-4 maximally redundant form (which is also the form in which the on-line multiplier produces its output). The residual is stored in Carry-Save (CS) form to avoid carry propagation delay (shown as

$#

and

&%

for residual sum

and carry). The design includes 3 half adders (HA)’s, 1 Full Adder (FA), 7 latches to store the residual and the output and a 3-bit Carry Lookahead adder (CLA) for the digit selection. This March 25, 2002


14

X3 Y3

X2 Y2

FA C

X1 Y1

HA S

HA

C

S

C

S

HA C

S

WS1

D

D

D

D

WC2 WS3 WS2

B1 B2

B3

A1

A2

A3

3-bit CPA

C out

C in

S1

S2

S3

LATCH

Z3

Z2

Z1

Fig. 7. A radix-4 on-line adder.

circuit has a delay of 8 gates and an area of 71 gates.

(8) (9)

For comparison purposes, we assume the following: Full Adder: Area = 5 gates, Delay = 2 gates Half Adder: Area = 2 gates, Delay = 1 gate 3-bit CLA: Area = 21 gates, Delay = 5 gates Latch: Area = 5 gates March 25, 2002


15

We assume the time delay for a ( -bit CLA adder with a fan-in of & as , ( 1

&

!

(10)

(

(11)

,+ ( 1

as in [31], [32] and the area in gates can be calculated as

,+ ( 1

and

where ( is a constant between

& "(

&

(

depending on

( ,( &

1 . In the rest of the paper,

wherever the time and area are dependent on the precision ( , we will emphasize it by using it as an argument in the equations. This is to clearly distinguish the conventional implementations from on-line implementations whose area or time requirements may be independent of the bitprecision. B. Radix-4 on-line multiplier The steps involved in on-line multiplication [4], [35] are: Initialization:

(

!

Recurrence for j = 1,..... ,m do

! (

!

%

,

$

$

!

(12)

1

(

where the selection function is the same as in on-line addition. The signed digit by vector multiplication in on-line multiplication [35] needs a multiplier having one of its inputs as a signed digit and another as a vector whose length keeps increasing with the number of digits (see equation (12)). This signed-digit by vector multiplication is difficult to implement as the multiplication time becomes dependent on the digit-precision, and the advantage of precision-free on-line multiplication time is lost. However, in the radix-2 case, the signed digit multiplication reduces to multiplication by

and hence, is easy to implement

with lower area requirements. As we saw from Figure 4, higher radix numbers are preferable in order to detect the sign as soon as possible, making us prefer a radix-4 number system as a March 25, 2002


16

better tradeoff for throughput. Radix-2 multipliers [36], [37] can still be used if area is a more important constraint than throughput for the detector design. Radix-4 on-line multipliers have also been designed for filter design problems and Discrete Cosine Transforms (DCT) [8], [34], where one of the inputs to the multiplier is considered constant. This is a reasonable assumption for filters as the coefficients were not time-varying and the SDVM multiplication is again avoided. However, since both the inputs in the detector are variable, we need to implement a general Signed Digit by Vector Multiplier (SDVM). The SDVM needed is a two’s complement multiplier of size (

. In order to make the multiplication

time precision-independent, we produce the outputs of the SDVM in CS form. The multiplier

outputs in CS form are then added together with the residual using a

compressor (parallel

counter) [31], [33]. We then use the MSDs in the residual for the selection function as in [8], [34] to attain on-line time independent of the precision.

Figure 8 shows the implementation of a ( -bit radix-4 multiplier in two’s complement rep-

resentation. Since the on-line multiplier may interface to a conventional arithmetic circuit (See Figures 1 and 2), we design the multiplier to take in inputs in conventional arithmetic two’s complement representation. The inputs converter or the input,

and

shown could represent the input, , from the A/D

$

, from the channel estimator with a parallel to serial (P/S) converter.

We internally add the appropriate sign digit to the two’s complement number in our circuit without any additional time overhead to make the number redundant. The

%

and

%

$

blocks

take 2 bits from each of the inputs in the conventional arithmetic system and convert them to a redundant number system. To do so, we note that all positive two’s complement conventional numbers are valid in the redundant number system with a

negative two’s complement numbers are valid with a the successive digits. The

%

and

%

!

as the sign digit for all digits. All

as the sign for the first digit and

s for

blocks append the digits generated to form the

vector for the SDVM as in equation (12). The SDVM is designed based on the modified Baugh-Wooley multiplier scheme [31], [38]. In

a general (

(

bit multiplication, the Baugh-Wooley method does not require any increase in the

maximum column height. But since the operands to the multiplier’s input do not have the same precision ( (

), an additional stage of CSAs are needed to add the 1’s at the bottom of the tree,

increasing the delay of the partial product accumulation by 1 stage of CSAs. Since the multiplier

March 25, 2002


17

outputs are now combined with the residual addition, the number of inputs to be added become

. Hence, a

compressor is used for addition. Again, to make the on-line multiplication

time independent of precision, the selection function uses only the most significant 4 bits to determine the result as shown in Figure 8.

tiplication (

The need for a SDVM and a

compressor makes the on-line time requirements for mul-

gate delays) greater than that for addition ( gate delays). Since the on-line

multiplier output feeds the on-line adders, this will leave the adder starved for data. Hence, in order to match the delays of the adder and multiplier, we pipeline the design further as shown by the dotted lines in Figure 8. This modified design has a delay of gates, equal to the delay of the compressor and the HAs. The pipeline stages increase the on-line delay, , of the multiplier from 1 to 3, resulting in higher throughput at the expense of larger latency. The complete on-line multiplier uses

gates for

bit precision. This large area requirement even for

a digit-serial multiplier is due to the need for a SDVM and a large number of latches required to pipeline the design. If this area requirement exceeds the detector architecture specifications, then a radix-2 multiplier can be used at the expense of throughput.

,+ ( 1

.

, +( 1

, +( 1

and

, +( 1

partial product generator gates. The

(

of four , and

.

(13)

(+1

(14) (15)

,(

1

, +( 1

& , +( 1 !

.

, ( 1

(16)

(17)

is the on-line multiplication time and area respectively. The

two SDVM’s having a combined area of

, ( 1

(

where

& , ( 1 (

, (

, +( 1

, (+1 consist of

(

FA cells,

compressor with area

(

HA cells and

, +( 1

#(

consists

1 -bit FA cells. The 3-bit CPA takes 21 FA cells, the 4 HA’s take 2 gates each , (1 is the area required by the latches in order to pipeline the design sufficiently

enough to reduce the on-line delay to 7.

, +( 1

is the area required for the additional

control logic required. The matched filter and multiuser detectors are now built using the radix-4 on-line adder and March 25, 2002


18

conventional arithmetic A/D or P/S converter (MSDF input) x[2:1]

y[2:1] Control Logic

Calcy

Calcx

y[3:1]

x[3:1] D D CalcX

CalcY

X[d:1]

Y[d:1]

Modified Baugh -Wooley SDVM

Modified Baugh -Wooley SDVM C

S

C

[0,0,PC[ d-1 :1],0,0]

[QS1 ,QS1 ,PS [d-1:1],0,0]

WS[d+3:1] WC[d+3:1]

D

S

yXC [d+3:1]

xYC[d+3:1]

yXS[d+3:1] PIPELINE STAGE

xYS[d+3:1]

(d+3)-bit 6:2 Compressor using CSA's

D

S

C

PC[ d-1:1] PC[ d+3: d]

PS [d-1:1]

PS[ d+3: d]

4-bit 2:2 CSA's (4 HA's) S

C

QS1

C out

PIPELINE STAGE

QS[4:2]

QC[3:1 ]

3-bit CPA

Z[3:1]

C in

QS1 PIPELINE STAGE

Fig. 8. A general radix-4 on-line multiplier.

March 25, 2002


19

(a) Conventional Code Matched Filter

Â Hp,1 ri ,1

Â Hp,2

^

ri ,2 *

(b) Conventional Code Matched Filter using a CSA Tree

ri ,N

ri ,N-1

*

Â Hp,1

Â Hp,N

A Hp,N-1

Â Hp,N-1

ri ,2

ri ,1

*

*

ri ,N-1

*

*

s

+

Â Hp,2

s

C

s

C

+ +

ri ,N * C

s

C

2* N:2 Compressor s C

+

+

*

Â Hp,N

CPA

Â Hp ri

Â Hp ri

^Hp ri ) ^bi, p = sign(A

^bi, p = sign(A ^Hp ri ) (c) On-line Code Matched Filter

^

Â Hp,2

A Hp,1

ri ,1 *

Â Hp,N-1

ri ,2

ri ,N-1

* +

*

Â Hp,N ri ,N * +

+

+ +

^Hp ri ) ^bi, p = sign(A Fig. 9. Code matched filter with different arithmetic schemes. Part(a) shows a simple implementation using conventional arithmetic. Part(b) shows a conventional implementation using CSAs. Part(c) shows an on-line arithmetic implementation.

multiplier as building blocks for calculating their area-time requirements and comparisons with conventional arithmetic. IV. C ONVENTIONAL

AND ON - LINE IMPLEMENTATION OF A MATCHED FILTER DETECTOR

We discuss different implementation schemes for a code matched filter detector using a simple conventional arithmetic implementation, a time constrained conventional arithmetic implementation using carry-save adders and an on-line arithmetic implementation. These are shown in Figure 9. In all the architectures discussed below, we present time-constrained versions in order to attain the maximum data rates and then, we compare the area requirements for these architectures to determine their utility in a low power architecture design.

March 25, 2002


20

A. Conventional arithmetic matched filter A straight-forward implementation of a matched filter [23] is shown in Figure 9(a). The dotproduct is implemented in a tree structure for a parallel implementation. The multipliers form the first level of the tree and the adders form the rest of the levels. For a length- there are

multipliers at the top of the tree,

adders at the next level,

&

dot product, adders at the

next level and so on. If the multipliers are implemented using Dadda tree multipliers [31], [39], the delay of the multiplier depends on the precision ( . If there are

#

stages in the tree reduction

for ( -bit precision, the time taken for the multiplication in terms of gate delays and the area requirements is given by

, +( 1

, +( 1

where

(

(

,

#

(

1

,

(

1

( & "( ,

1

,

"( 13

(18) (19)

is the size of the CPA needed for the Dadda multiplier [39], one gate delay is

required for the partial product generation and each FA in the reduction stage has gate delays.

There are

(

gates for generating the partial product matrix, ,

(

1

HA’s and ,

( & (

1

FAs in the Dadda multiplier [39]. Truncated multiplication [40], which can reduce the multiplier area requirements by and power dissipation by & , can also be used giving rise to tradeoffs between the error rate performance and the area requirements similar to on-line arithmetic. However, while on-line arithmetic also provides significant gains in throughput, truncated multiplication does not offer any throughput improvements. The time and area taken by each adder in the adder tree can be given by

, (+1

, (+1

Since there are

, 1

(1 , (+13" ,

(20) (21)

levels of addition, the time and area taken by the matched filter can

be given as

,+ ( 1

,+ ( 1

, +( 1

, +( 1

, 1 ,

, (+1 1 , (+13"

(22) (23)

This detector is used as a base case comparison point with other detector structures based on conventional and on-line arithmetic. March 25, 2002


21

B. Conventional arithmetic matched filter using Carry-Save adders Further reductions in delay can be obtained by reducing the carry-propagation delay overhead using CSAs. Similar schemes using CSAs for dot-product type computations have been proposed in [41], [42], [43]. We explore this architecture as another trade-off point for comparison purposes with a time-constrained on-line arithmetic architecture. The Dadda multiplier no longer does the CPA at the bottom of the multiplier tree, but instead passes on its output in a CS form, as in Figure 9(b). The CS outputs of all the

%

with a

multipliers are then added together

compressor. The CS output of this compressor is then fed to a wider CPA

for computing the final result. The time taken for the matched filter using carry-save adders is the time taken for the multiplier without the CPA

where

#

, the time taken by the compressor

, and the time for the final CPA.

, +( 1 , (+1

, (+1 and #

#

#

,+ ( 1

, +( 1

(24)

,

"(

,

(25) 1 1

(26)

are the number of stages required for the Dadda multiplier tree compression

and the addition compression respectively. Note that

#

is dependent on the precision ( while

#

is dependent only on the number of inputs . The precision required by the CPA at the end of the tree is ,

(

,

(

1 1 as

is the precision of the CS outputs of the multiplier

and the addition tree width [31] increases at most by

,

1 . Thus in equation (26), we no

longer get the multiplicative effect of the CPA time due to the tree height as in equation (22). We note that the design could be slightly optimized further by integrating the addition of the output of the different multipliers into the Dadda tree addition. However, this would affect only the number of stages slightly and will not have any significant change in the effective time delay of the matched filter detector. The area requirements for the matched filter using carry-save adders sum of the areas of the multipliers

March 25, 2002

, the compressor

will be the

and the final CPA


22

.

, +( 1 , (+1

a

,+ ( 1

where

, +( 1

(

,

1

( ,

compressor using

( & "(

1 ( , 1 1 , ( , (+1 , ( 1

1

(27)

(28)

(29)

is the number of multipliers required and "

,

,+ ( 1

(30)

is the number of FA’s required for

FA cells. The area in equation (27) is obtained similar to

that in equation (19). Comparisons of the area-time tradeoffs of this detector against the simple conventional arithmetic detector and the on-line arithmetic detector are presented in Section 5. C. On-line arithmetic matched filter In contrast to the conventional arithmetic implementations, the throughput of the on-line arithmetic based matched filter detector (shown in Figure 9(c)) depends only on the number of most significant digits

needed and is independent of the total precision of the input and the number

of stages in the adder-multiplier chain as it is highly pipelined.

where

,

1

(31)

is the total time for the matched filter detector and

is the on-line time. The additional digit in the (

,

) factor in equation (31) is due to the over-

head in on-line arithmetic to signify the end of the current precision computation (see the RESET

signal ( ) in Figure 6). The area requirements for on-line arithmetic

, +( 1

, +( 1

V. A REA -T IME

,

1

can be given by (32)

TRADEOFFS AND RESULTS

Figure 10 presents the throughput comparisons of the conventional arithmetic matched filter detector, the conventional arithmetic matched filter detector using carry-save adders and a radix-4 on-line arithmetic based matched filter detector. We can see that both the conventional arithmetic detectors have logarithmic increase in gate delays with precision while the on-line detector has a linear increase in gate delay and is hence, slower for a full precision implementation. The truncated detector schemes with March 25, 2002

and

MSDs, however, perform much better as Submission for IEEE Trans. on Computers

1

23

Throughput for Code Matched Filter and Multiuser Detector

3

10

Time required (gate delays)

Simple Conventional Conventional with CSA On−line (Full precision) Truncated On−line (2 MSDs) Truncated On−line (MSD)

2

10

1

10

0

10

20

30 40 Precision (in bits)

50

60

70

Fig. 10. Throughput for code matched filter and multiuser detector in terms of gate delay time.

they have a constant throughput independent of the digit precision. The conventional arithmetic implementation using carry save adders also outperforms the simple conventional arithmetic implementation by limiting the carry propagation delay. Figure 11 presents the corresponding area comparisons for the detectors. It can be observed from the figure that the conventional arithmetic matched filter detector using carry-save adders has better time performance than a straightforward implementation with nearly the same area. Full precision on-line arithmetic on the other hand, consumes more area than the conventional schemes for low precision due to the use of a redundant number system, but soon starts giving benefits in area for higher precision (

bits) due to its digit-serial nature. Though the area

required by the on-line adders is constant with precision, the on-line multiplier requires more area with increasing precision (without any decrease in throughput). The truncated schemes again give additional gains in area as the multiplier area also becomes constant with precision. Table I gives the area-throughput tradeoffs of the different matched filter architectures. The table shows the speedup in throughput and area reduction for all schemes against the simple conMarch 25, 2002


24

Area required for a Matched Filter Implementation

6

10

5

Area required (gates)

10

4

10


3

10

0

10

20


50

60

70

Fig. 11. Area consumed by code matched filter detector in terms of logic gates.

TABLE I A REA -T HROUGHPUT

TRADEOFFS FOR DIFFERENT MATCHED FILTER ARCHITECTURES WITH CONVENTIONAL

AND ON - LINE ARITHMETIC .

T HE

COMPARISONS PRESENTED ARE AGAINST A SIMPLE , CONVENTIONAL ARITHMETIC IMPLEMENTATION .

Precision (bits)

March 25, 2002

Throughput Speedup CSA

On-line

Area Reduction

On-line CSA

(2 MSDs)

(MSD)

On-line

On-line

(2 MSDs)

(MSD)

8

1.62

2.63

3.94

1.31

0.74

1.00

16

1.76

3.29

4.94

1.17

2.81

3.80

32

1.86

3.96

5.94

1.08

10.77

14.55


25

ventional arithmetic scheme. We can see that the conventional implementation with carry-save adders has a 1.62

1.86

improvement in throughput for

bit precision due to its reduc-

tion of the carry propagation delay while simultaneously providing 1.31

1.08

reduction in

area respectively. The truncated on-line arithmetic detector (using just the MSD), however, does much better, providing 3.94

speedup for

5.94

providing an area reduction of 1 14.55

bit precision while simultaneously

respectively compared to the simple conventional

implementation. A. Area-Time performance of the multiuser detector The architecture of the multiuser detector (shown in Figure 3) contains the matched filter detector as the initial estimate of the detected bits. Three stages of PIC are then applied to refine the decisions. Each stage in the detector consists of

matrix-vector products as shown

in equation (3). These matrix-vector products can again be implemented as dot products as in the matched filter implementation. However, the multiplications with

’s can be replaced

by additions and subtractions. Hence, on-line multipliers are not needed in the PIC stages, which significantly reduce the area requirements. The multiuser detection implementation, being pipelined, takes the same amount of time as the matched filter detector. Hence, the throughput requirement curves for the multiuser detector is the same as that shown in Figure 10. The area requirements for the multiuser detector are shown in Figure 12. The figure assumes a

-user multiuser detector (

) with

PIC stages and compares it with a

-user

-user matched filter implementation consists of single user matched filters in parallel and hence, will have the area of the matched filter shown in matched filter implementation. The

Figure 11. As no additional multiplications are required in the multiuser detector stages, there is very little area overhead in terms of gates ( " &

for the truncated MSD on-line detector; similar

or lower overhead for other on-line schemes) but significantly higher performance in terms of bit error rates [25], [26] can be achieved. This supports the use of the on-line multiuser detector design for advanced communication receivers. VI. E XTENSIONS

TO SOFT DECISIONS AND HIGHER MODULATION SCHEMES

On-line arithmetic can also be used for detection in higher modulation schemes, especially M-QAM (M-ary Quadrature Amplitude Modulation) [17] used in current and advanced wireless March 25, 2002


26

Area for a 32−user matched filter and multiuser detector

8

10

−−−− multiuser detector matched filter 7

Area required (gates)

10

6

10


5

10

4

10

0

10

20


50

60

70

Fig. 12. Area consumed by a 32-user code matched filter and multiuser detector in terms of logic gates.

LAN systems [3] and for QPSK for W-CDMA to attain higher data rates. This is done by not only using the sign of the most significant digits but also their magnitude. Each dimension of the constellation is treated independently to find the sign and the magnitude for detection of the symbol. Figure 13(a) shows the optimal (non-truncated) on-line arithmetic decision scheme for BPSK modulation. For the BPSK case, we waited for the first non-zero MSD to appear and then based on the sign, made a decision (See Figure 5). For this, we noted that the sign of the first non-zero MSD represented the sign of the number. Thus, if a

is received, this implies

that we need to wait for the next digit to determine the sign. If the next digit is again process is repeated. If all the digits are

, this

’s, the decision is made at random. Note that in this

case, we ignored the value of the MSD and just concentrated on the sign. In order to obtain soft decisions for decoding, we need to look at the magnitude of the first non-zero MSD along with the sign. The magnitude provides higher accuracy (softness) in the decision region and can be used to attain better error rate performance for decoding. However, in a redundant number system, the true magnitude of the MSD could be off by March 25, 2002

as the next


27

wait until next non-zero digit 1xx

000

0xx

(a) BPSK (See the sign)

IMAGINARY 101

110

111 000

001

010

011

011 010 wait until next non-zero digit 001

REAL

000 wait until next non-zero digit 111 110 wait until next non-zero digit 101

(b) 16- QAM (See the Sign and Magnitude)

Fig. 13. Using on-line arithmetic for higher modulation schemes.

digit following the MSD could have the opposite sign. Hence, one more additional digit needs to be detected in order to obtain the correct magnitude of the MSD. The decision scheme can be be similarly extended to higher modulation schemes. The magnitude as well as the sign of the number is used to tell the decision region. For the

-QAM

example case shown in Figure 13(b), we need to wait for a second digit, whenever the first digit is a

or a

as only the second digit will tell us in whether the actual magnitude is above

or below that value. The tradeoffs are between the additional digits to wait for, a higher radix implementation and probability of error similar to that shown in Figure 5 for the BPSK case. March 25, 2002


28

Systems based on M-ary PSK, such as GSM [2], have the constellation points on a circle. These make the decision regions angular. Hence, rotations along with the magnitude comparisons are necessary for detection of these signals using on-line arithmetic. Cordic schemes for on-line arithmetic [44] may be applicable for computing the rotations. A. Extensions to other blocks in the commmunication receiver On-line arithmetic can also be extended to other blocks in the communication receiver such as channel estimation and channel decoding (See Figure 1). The latency of computations can be reduced due to the MSDF pipelining of the digits to the detector block. The inputs from the A/D converter is fed directly to the estimation block. The inputs generated serially from the A/D converter in Figure 1 are inherently MSDF and hence, the on-line algorithms can increase the overall speed by overlapping computations with the conversion. Moreover, there is no need for conversion to a redundant number system as we assume a maximally redundant system and no need for re-conversion either as the subsequent detector block accepts the redundant inputs from channel estimation. Channel coding is typically performed on the transmitted data to protect against channel errors. Techniques such as Viterbi decoding [14] and turbo decoding [15] are some of current and advanced channel error protection schemes used for communication receivers. Decoding also involves making decisions on the detected bits to recover the information and hence, the on-line techniques shown in this paper for detection are applicable to decoding. The Add-CompareSelect (ACS) function in Viterbi decoding [14] can be implemented using on-line arithmetic. The operations involved in turbo decoding are exponentiation, logarithm, minimum-selection, addition, multiplications and sign-based testing. The last four operations are already shown to be well-suited for on-line implementations. Also, exponentiation and logarithms can be implemented in a digit-serial fashion [45] and can be implemented in on-line arithmetic. VII. C ONCLUSIONS This paper presents low power architectures using conventional and on-line arithmetic to accelerate detection in communication systems. We identify the potential application of on-line arithmetic for detection in mobile communication systems. We develop innovations in on-line arithmetic modules for better integration with conventional arithmetic systems. We explore a March 25, 2002


29

suite of tradeoffs in the usage of on-line arithmetic on the throughput, area and latency and its effects on bit error rate performance. We design novel detector circuits using radix-4 on-line arithmetic for detection algorithms. Our detector can directly interface with conventional arithmetic circuits if necessary and hence, be used as co-processor support to a DSP. We present a strong case for the use of on-line arithmetic for detection in mobile communication receivers as it eliminates extraneous computations, providing significant speedups in throughput and reductions in area requirements simultaneously. The detection schemes based on on-line arithmetic presented here are just a case study. Online arithmetic is directly applicable to a wide range of simple and advanced detection circuits used in signal processing and in wired and wireless communications. Hence, this can have a significant impact on future detector designs. On-line arithmetic can also be extended to include other blocks in the receiver such as channel estimation and Viterbi decoding for further pipelining benefits. Also, the detection schemes can also be extended to soft decisions for decoding as well as higher dimensional constellations such as M-ary Phase Shift Keying (M-PSK) and M-ary Quadrature Amplitude Modulation (M-QAM) used in GSM and wireless LAN systems for attaining higher data rates by using the magnitude of the digits detected along with the sign. We are currently investigating the integration of the detector using on-line arithmetic as a co-processor support in the design of a programmable wireless communication receiver. R EFERENCES [1]

F. Adachi, M. Sawahashi, and H. Suda, “Wideband DS- CDMA for next-generation mobile communication systems,” IEEE Communications Magazine, vol. 36, no. 9, pp. 56–69, September 1998.

[2]

A. Furuskar, S. Mazur, F. Muller, and H. Olofsson, “EDGE: enhanced data rates for GSM and TDMA/136 evolution,” IEEE Personal Communications Magazine, vol. 6, no. 3, pp. 56–66, June 1999.

[3]

R. Van-Nee, G. Awater, M. Morikura, H. Takanashi, M. Webster, and K. W. Halford, “New high-rate wireless LAN standards,” IEEE Communications Magazine, vol. 37, no. 12, pp. 82–88, December 1999.

[4]

M. D. Ercegovac, “On-line arithmetic: An overview,” in Real Time Signal Processing VII, SPIE, San Diego, CA, August 1984, vol. 495, pp. 86–93.

[5]

T. Lynch and M. J. Schulte, “A high radix on-line arithmetic for credible and accurate computing,” Journal of Universal

[6]

R. McIlhenny and M. D. Ercegovac, “On-line algorithms for complex number arithmetic,” in

Computer Science, vol. 1, no. 7, pp. 439–453, July 1995.

Asilomar Conference

on Signals, Systems and Computers, Pacific Grove, CA, October 1998, pp. 172–176. [7]

M. D. Ercegovac and T. Lang, “On-line arithmetic: A design methodology and applications in digital signal processing,” VLSI Signal Processing III, pp. 252–263, November 1988, IEEE Press.

March 25, 2002


30

[8]

J. Bruguera and T. Lang, “2-D DCT using on-line arithmetic,” in International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Detroit, MI, May 1995, vol. 5, pp. 3275–3278.

[9]

M. D. Ercegovac and A. L. Grnarov, “On the performance of on-line arithmetic,” in International Conference on Parallel Processing, Harbor Springs, MI, August 1980, pp. 55–62.

[10] M. D. Ercegovac and T. Lang, “On-the-fly conversion of redundant into conventional representations,” IEEE Transactions on Computers, vol. C-36, no. 7, pp. 895–897, July 1987. [11] S. Moshavi, “Multi-user detection for DS-CDMA communications,” IEEE Communications Magazine, pp. 124–136, October 1996. [12] S. Verdú, Multiuser Detection, Cambridge University Press, 1998. [13] M. Honig, U. Madhow, and S. Verdú, “Blind adaptive multiuser detection,” IEEE Transactions on Information Theory, vol. 41, no. 4, pp. 944–960, July 1995. [14] H. Suzuki, Y. Chang, and K.K. Parhi, “Low-power bit-serial Viterbi decoder for next generation wide-band CDMA systems ,” in IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Phoenix, AZ, March 1999, vol. 4, pp. 1913 –1916. [15] B. Slkar, “A Primer on Turbo Code Concepts,” IEEE Communications Magazine, pp. 94–101, December 1997. [16] N. Weste and D. J. Skellerm, “VLSI for OFDM,” IEEE Communications Magazine, vol. 36, no. 10, pp. 127–131, October 1998. [17] J. G. Proakis and M. Salehi, Communication Systems Engineering, Prentice Hall, 1994. [18] S. Rajagopal, S. Bhashyam, J. R. Cavallaro, and B. Aazhang, “Real-time algorithms and architectures for multiuser channel estimation and detection in wireless base-station receivers,” appearing in IEEE Transactions on Wireless Communications, July 2002. [19] R. H. Walden, “Performance trends in Analog-to-Digital converters,” IEEE Communications Magazine, vol. 37, no. 2, pp. 96–101, February 1999. [20] I. Seskar and N. B. Mandayam, “A software radio architecture for linear multiuser detection,” IEEE Journal on Selected Areas in Communications, vol. 17, no. 5, pp. 814–823, May 1999. [21] N. Zhang, A. Poon, D. Tse, R. Brodersen, and S. Verdú, “Trade-offs of performance and single chip implementation of indoor wireless multi-access receivers,” in IEEE Wireless Communications and Networking Conference (WCNC), New Orleans, LA, September 1999, vol. 1, pp. 226–230. [22] J. Chen, G. Shou, and C. Zhou, “High-speed low-power complex matched filter for W-CDMA: Algorithm and VLSI architecture,” IEICE Trans. Fundamentals, vol. E83-A, no. 1, pp. 150–157, January 2000. [23] K. Chapman, P. Hardy, A. Miller, and M. George, “CDMA matched filter implementation in Virtex devices,” Xilinx Application Note XAPP212(v1.1), January 2001. [24] T. Long and N. R. Shanbhag, “Low-power CDMA multiuser receiver architectures,” in IEEE Workshop on Signal Processing Systems, Taipei, Taiwan, October 1999, pp. 493–502. [25] M. K. Varanasi and B. Aazhang, “Multistage detection in asynchronous code-division multiple-access communications,” IEEE Transactions on Communications, vol. 38, no. 4, pp. 509–519, April 1990. [26] G. Xu and J. R. Cavallaro, “Real-time implementation of multistage algorithm for next generation wideband CDMA systems,” in Advanced Signal Processing Algorithms, Architectures, and Implementations IX, SPIE, Denver, CO, July 1999, vol. 3807, pp. 62–73. [27] S. Rajagopal and J. R. Cavallaro, “A bit-streaming pipelined multiuser detector for wireless communications,” in IEEE International Symposium on Circuits and Systems (ISCAS), Sydney, Australia, May 2001, vol. 4, pp. 128–131.

March 25, 2002


31

[28] A. Avizienis, “Signed-Digit number representations for fast parallel arithmetic,” IRE Transactions on Electronic Computers, vol. EC-10, no. 3, pp. 389–400, September 1961. [29] A. Gorji-Sinaki and M. D. Ercegovac, “Design of a digit-slice on-line arithmetic unit,” in

IEEE Symposium on

Computer Arithmetic, Ann Arbor, MI, May 1981, pp. 72–80. [30] S. Rajagopal and J. R. Cavallaro, “On-line arithmetic for detection in digital communication receivers,” in

IEEE

International Symposium on Computer Arithmetic (ARITH-15), Vail, CO, June 2001, pp. 257–265. [31] B. Parhami, Computer Arithmetic - Algorithms and Hardware Designs, Oxford University Press, 2000. [32] S. Waser and M. J. Flynn, Introduction to Arithmetic for Digital System Designers, CBS College Publishing, 1982. [33] I. Koren, Computer Arithmetic Algorithms, Prentice Hall, 1993. [34] J. S. Fernando and M. D. Ercegovac, “Conventional and on-line arithmetic designs for high-speed recursive digital filters,” in Fifth IEEE Workshop on VLSI Signal Processing, Napa Valley, CA, October 1992, pp. 81–90. [35] M. D. Ercegovac, “A general hardware-oriented method for evaluation of functions and computations in a digital computer,” IEEE Transactions on Computers, vol. 26, no. 7, pp. 667–680, July 1977. [36] A. Guyot, Y. Herreros, and J-M. Muller, “JANUS: An on-line multiplier/divider for manipulating large numbers,” in

IEEE Symposium on Computer Arithmetic, Santa Monica, CA, September 1989, pp. 106–111. [37] A. F. Tenca, M. D. Ercegovac, and M. E. Louie, “Fast on-line multiplication units using LSA organization,” in Advanced Signal Processing Algorithms, Architectures, and Implementations IX, SPIE, Denver, CO, July 1999, vol. 3807, pp. 74–83. [38] C. R. Baugh and B. A. Wooley, “A two’s complement parallel array multiplication algorithm,” IEEE Transactions on Computers, vol. 22, pp. 1045–1047, December 1973. [39] K’A. C. Bickerstaff, E. E. Swartzlander, and M. J. Schulte, “Analysis of column compression multipliers,” in IEEE International Symposium on Computer Arithmetic, Vail, CO, June 2001, pp. 33–39. [40] M. J. Schulte, J. E. Stine, and J. G. Jansen, “Reduced power dissipation through truncated multiplication,” in IEEE Alessandro Volta Memorial Workshop on Low Power Design, Como, Italy, March 1999, pp. 61–69. [41] P. I. Balzola, M. J. Schulte, J. Ruan, J. Glossner, and E. Hokenek, “Design alternatives for parallel saturating multioperand adders,” in IEEE International Conference on Computer Design (ICCD), Austin, TX, 2001 2001, pp. 172–177. [42] G. Wang and M. Tull, “The implementation of an efficient and high-speed inner-product processor,” in Asilomar Conference on Signals, Systems and Computers, Pacific Grove, CA, October 2001, vol. 2, pp. 1362 –1366. [43] N. Azakova, R. Sung, N. Durdle, M. Margala, and J. Lamoureux, “Fast and low-power inner product processor,” in 2001 IEEE International Symposium on Circuits and Systems (ISCAS), Sydney, Australia, May 2001, vol. 4, pp. 646–649. [44] H. X. Lin and H. J. Sips, “On-line CORDIC algorithms,” IEEE Transactions on Computers, vol. 39, no. 8, pp. 1038–1052, August 1990. [45] J. J. Modi, Parallel Algorithms and Matrix Computation, Oxford University Press, 1988.

March 25, 2002