Low-Complexity High Throughput VLSI Architecture of Soft-Output ML

3 downloads 0 Views 187KB Size Report
recently presented Layered ORthogonal Lattice Detector, LORD) is presented. The baseline implementation includes optimal (i.e. maximum-likelihood – ML – in ...
Low-Complexity High Throughput VLSI Architecture of Soft-Output ML MIMO Detector Teo Cupaiuolo and Massimiliano Siti

Alessandro Tomasoni

Advanced System Technologies – STMicrolectronics, Agrate Brianza, Italy Email: {teo.cupaiuolo,massimiliano.siti}@st.com

Politecnico di Milano, Milan, Italy Email: [email protected]

Abstract—In this paper a VLSI architecture of a high throughput and high performance soft-output (SO) MIMO detector (the recently presented Layered ORthogonal Lattice Detector, LORD) is presented. The baseline implementation includes optimal (i.e. maximum-likelihood – ML – in the max-log sense) SO generation. A reduced complexity variant of the SO generation stage is also described. To the best of the authors’ knowledge, the proposed architecture is the first VLSI implementation of a max-log ML MIMO detector which includes QR decomposition and SO generation, having the latter a deterministic very high throughput thanks to a fully parallelizable structure, and parameterizability in terms of both the number of transmit and receive antennas, and the supported modulation orders. The two designs achieve a very high throughput making them particularly suitable for MIMO-OFDM systems like e.g. IEEE 802.11n WLANs: the most demanding requirements are satisfied at a reasonable cost of area and power consumption.

I. I NTRODUCTION Wireless transmission through multiple antennas, also referred to as MIMO (Multiple-Input Multiple-Output), currently enjoys great popularity because of the demand of high data rate communication from multimedia services. MIMO transmission consists of the simultaneous transmission of T complex symbols using T transmit antennas; this way a transmit data rate of T times the data rate of a single antenna system transmitting in the same bandwidth can be obtained. Sphere decoder (SD) [1] is the most known low-complexity ML MIMO detector, though it entails some inherent drawbacks, i.e. its non-deterministic nature and lowly parallelizable structure, that complicate its hardware implementations. Moreover, the original algorithm is hard-output, and since then the problem of soft-output (SO) detection has been analyzed from an implementation perspective in several works, like [2]. Despite this, the field of VLSI designs of SO near-ML MIMO detectors still presented some important unsolved issues. Several papers have been published so far on the VLSI design of the SD and its extension to SO generation, like [3]: however, the above mentioned drawbacks affect the architectural design and achieving very high data rate using high modulation orders (e.g. 64-QAM, as required by 802.11n [4]), is still a major challenge. In [5] the first VLSI architecture of a SO 4 transmit, 4 receive antennas (in short, 4×4) MIMO detector based on the SD supporting 64-QAM modulation order is published. The authors introduce several algorithm optimizations aimed at reducing the associated complexity, but the design suffers from

978-3-9810801-6-2/DATE10 © 2010 EDAA

a run-time variable detection throughput which depends on the experienced SNR. Other works have been published tailoring SO MIMO detection with fixed throughput as well [6][7], though the VLSI design is specific for 4×4 16-QAM. In [8] a VLSI implementation of the recently introduced SO Layered ORthogonal Lattice Detector (LORD) [9] is given. The work just focuses on the SO generation stage (4×4 64-QAM), neglecting the multiple QR decomposition required by LORD; besides, the maximum reached throughput of 188 Mbps does not meet the requirements of [4] even using high clock frequency such as 500 MHz. In this paper, a novel full VLSI architecture based on LORD is introduced. The channel processing (multiple QR decomposition) stage is outlined with reference to the case of T = 2 transmit antennas, while the most critical stage for LORD complexity, i.e. SO generation stage, can be applied to any number T . Compared to [8] a higher level of parallelism is proposed achieving very high throughput as required by [4], retaining the deterministic throughput and latency properties intrinsic to LORD algorithm. Moreover, the VLSI design of a reduced complexity variant of the SO stage, valid for OFDM systems, is proposed, allowing a significant complexity and power reduction. The paper is organized as follows. Section II describes the system model, Section III recalls the LORD algorithm and introduces its reduced complexity version, Section IV details the novel VLSI architecture, Section V summarizes the results and compares the design with other state-of-the art SO MIMO detectors. Section VI concludes the paper. II. S YSTEM MODEL In order to simplify the notation we consider a frequency non-selective MIMO communication system with T transmit and R receive antennas. For OFDM systems, like those of interest for 802.11n WLANs, the following equations are to be intended valid per sub-carrier in frequency domain. The signal received at each antenna is therefore a superposition of the T transmitted signals corrupted by multiplicative fading and additive white Gaussian noise. The complex path gains are samples of zero mean Gaussian random variables (RVs) with variance σ 2 = 0.5 per dimension. Fading processes for different transmit and receive antenna pairs are assumed to be independent. Complex gains are assumed constant over the duration of a codeword and vary independently from one

codeword to another (i.e. quasi-static block fading). Ideal channel state information (CSI) at the receiver is assumed (i.e. the channel matrix H, of size R × T , is perfectly known). The transmitted signal in each time instant can be represented as a vector X, of size T ×1, where the t-th symbol st , taken from a generic M 2 -QAM constellation, is transmitted by the t-th antenna. Under these assumptions the received vector Y, of size R×1, is given by: r Es HX + N (1) Y= T where Es is the total per symbol transmitted energy (under the hypothesis that the average constellation energy is s2k = 1) and N is the noise vector of size R × 1, whose elements are samples of independent circularly symmetric zero-mean complex Gaussian RVs with variance N0 /2 per dimension. The signal-to-noise-ratio (SNR) per receive antenna is Es /N0 . ML detection over a MIMO channel corresponds to finding ¯ which minimizes the square norm the transmitted sequence X of the error matrix (i.e. the sum of the square magnitudes of all its components, k · k2 ):

2

r

E

s ¯ = arg min Y − (2) HX X

T X If symbols take value in M 2 -QAM constellation, solving eq. (2) corresponds to searching M 2T sequences which means it is rapidly unfeasible for a growing T . III. LORD MIMO DETECTION The algorithm consists of three distinct stages. The first one is the real domain representation, different from the one used by the SD. A fundamental difference is that the inphase (I) and quadrature-phase (Q) components of the complex quantities are taken in a different order, namely: T

x = [X1,I X1,Q . . . XT,I XT,Q ]

(3) T

n = [N1,I N1,Q . . . NT,I NT,Q ] y =

Htre xt

t

Htre = Qt Rt

(6)

The index t refers to symbol sequence permutations where each layer (layer or transmit antenna will be used interchangeably throughout this paper) becomes the reference only once; more specifically, each permutation has to differ from the others by the complex symbol placed in the t-th position in the complex sequence X, corresponding to the t-th I and Q couple in the real sequence xt . In order to save complexity associated to the multiple QRDs (MQRDs), the proposed processing strategy is to perform the QR decomposition through a Gram-Schmidt orthogonalization (GSO) without Q explicit computation [9]. This allows sharing unnormalized scalar products among channel columns, or channel columns times the received vector, for all T QR decompositions. The QR decomposition related to the system (3), can be written as follows. With T = 2, there is a 4×4 upper triangular matrix Rt :   σi 0 s1,3 /σi (−1)t · s1,4 /σi  0 σi (−1)t+1 · s1,4 /σi  s1,3 /σi  (7) Rt =   0 0  βj 0 0 0 0 βj 2

2

and βj2 = σj2 − (s1,3 /σi ) − (s2,3 /σi ) , where t is the antenna index and for t = 1, 2, then i = 3, 1, j = 1, 3, respectively. The 2 notation is the same as in [9], namely: σ2k−1 ≡ kh2k−1 k2 , T 2 t T sj,k ≡ khj hk k . Multiplying (3) by (Q ) one has: ˜ t = (Qt )T y = Rt xt + n ˜t y

(8)

˜ t has still independent components and The noise vector n equal variances. From the above expressions, the minimization problem (2) translates in the real domain as:

2

t ˜ − Rt xt ˆ t = arg min y (9) x x

T

y = [Y1,I Y1,Q . . . YT,I YT,Q ] t

decomposition, still generating an orthonormal matrix Qt and an upper triangular matrix Rt as:

t

t

+ n = [h1 . . . h2T ] x + n

The superscript t in the sequences (3) is a symbol permutation index as explained in the following. Htre is a real version of the channel matrix where its columns get the form: h2k−1 = [ℜ(H1,k ) ℑ(H1,k ) . . . ℜ(HR,k ) ℑ(HR,k )]

T

h2k = [−ℑ(H1,k ) ℜ(H1,k ) . . . − ℑ(HR,k ) ℜ(HR,k )]

(4) T

(5)

where ℜ and ℑ denote the real and imaginary part of the argument, respectively. The couples h2k−1 , h2k are orthogonal by definition, i.e. hT2k−1 h2k = 0. A. The QR processing algorithm An efficient preprocessing orthogonalization process of the channel matrix H is possible in alternative to a standard QR

B. Demodulation and bit SO generation

After the preprocessing is done, the ML demodulation (9) can be solved with a lot reduced complexity with respect to the unfeasible exhaustive search ML (2), thanks to the properties of the matrix Rt . According to the so-called “maxlog” approximation, the log-likelihood ratio (LLR) of the bit bT,k , can be expressed as [9]:  t DED [ˆ x(˜ x2T −1 , x ˜2T )] L bT,k |˜ yt = min {˜ x2T −1 ,˜ x2T }∈S(k)− T



min

{˜ x2T −1 ,˜ x2T }∈S(k)+ T

t DED [ˆ x(˜ x2T −1 , x ˜2T )] (10)

t where DED is the Euclidean distance (ED) metric:

t

2 t ˜ − Rt x DED (x) = y

(11)

In eq. (10), the following notation is used: Mc -bit transmitted symbols belong to a M 2 -QAM complex constellation; ˆ (˜ x x2T −1 , x ˜2T ) denotes the sequence obtained by grouping a

candidate value (˜ x2T −1 , x ˜2T ) of the I and Q couple of the reference layer complex symbol XT and the (2T − 2) I and Q estimates of the T −1 non-reference layer symbols determined through spatial Decision Feedback Equalization (DFE) starting from such candidate value; bT,k are the bits mapped onto XT − having bit index k = 1, . . . , Mc ; S(k)+ T and S(k)T represent the sets of symbols of the reference layer having bT,k = 1 and bT,k = 0, respectively. It should be recalled that the LORD demodulation method requires to consider all the constellation symbols as candidate symbols for each reference layer, or transmit antenna, and then minimizes the ED metrics over the sequences X wherein a given bit value is “1” (or “0”). This will be referred to as “full candidate search” (FCS) in the remainder of this paper, as opposed to the “low candidate search” (LCS) variant, which results in a negligible performance degradation [10], and recalled below. C. Reduced complexity SO generation for OFDM systems If all the OFDM sub-carriers are demodulated employing the same number of clock cycles Ncycles , then the time Tc employed to demodulate one OFDM symbol is given by: Tc = Ncycles · NDC /fclk ≤ Lc

(12)

where fclk is the clock frequency, NDC is the number of data carriers per OFDM symbol and Lc is the available decoding time per OFDM symbol. The key point of the LCS method is to let the detector employ a variable number of clock cycles to demodulate different OFDM tones, still satisfying (12) on average for the NDC data carriers in an OFDM symbol. Said in other words, the constraint to be respected is Tc = Nc,avg ·NDC /fclk ≤ Lc , where Nc,avg is the average number of clock cycles available per each OFDM tone. A good trade-off between hardware complexity and performance degradation can be obtained as follows. Divide the OFDM sub-carriers in two groups: NL sub-carriers that are demodulated by LCS and searching n2 < M 2 QAM symbols (per antenna); NH sub-carriers that are demodulated by FCS. In the reminder of the paper, these two groups will be briefly referred to as “best” and “worst” sub-carriers, respectively: details on how to identify the NL and NH sub-carriers will be discussed in Sec. IV-D. It should finally be noted that LLRs are not optimal (in max-log sense) for reduced size QAM sub-sets of cardinality C = n2 < M 2 . In this case it may be desirable to improve the performance to consider an enlarged set of sequences X for the computation of (10). A preferred way to do so will be referred to as “cross demapping” (CD) in the remainder of this paper [11]. CD means considering also the other sets S j with j 6= t when computing bit LLRs relative to Xt . This means that for antenna t and a given QAM symbol candidate ˆ the ED metrics (11) have to for the reference layer X t = X, ˆ be minimized over the enlarged set S˙ t (X): ) ( t ˆ t ˆ ˙ S (X) = arg min DED , ∀X ∈ S ˆ or X∈S j6=t :XT =X ˆ X∈S t (X)

(13)

YTU (Received channel processing)

H,Y

DU (Demodulation and soft-output generation)

LLR

CHU (Channel estimates processing) LORD

Fig. 1.

LORD architecture

Ram σ

H 2

ISQRT(σ )

Ram 1/σ Ram 1/β Ram s (reference layer 1)

2

Ram β

ISQRT(β )

Ram s (reference layer 2) 1...NDC min(β1,β3)

LIST FADINGVECT

CHU

Fig. 2.

Preprocessing architecture (CHU) for T = 2

ˆ denotes the sequences obtained by grouping where S t (X) ¯ for the reference layer and the symbol a value XT = X estimates obtained for the other layers e.g. through spatial DFE. IV. LORD VLSI ARCHITECTURE DESIGN The hardware architecture of the near-ML MIMO detector LORD is given in Fig. 1, where the following units are shown: • a channel estimates processing unit (CHU), wherein the CHU calculates T QRDs of Htre , eq. (6); • a received vector processing unit (YTU) which computes T processed received vectors yt , eq. (8); • a unit that performs demodulation and bit SO generation (DU) for all T trasnmit antennas (or layers). A. Preprocessing Several simplifications in the preprocessing architecture (Fig. 2) are possible with specific reference to the T = 2 case and will be described below. The architecture calculates ˜ 1, R ˜ 2 (eq. (7)) as well as 1/σ1 , 1/σ3 and 1/β1 , 1/β3 . These R terms are computed once per packet for all OFDM sub-carriers and stored in dedicated RAM blocks: during the detection of the received frame, they will be retrieved for the sub-carrier under evaluation. The units ISQRT(A) in figure, compute the inverse square root of the input argument A and are based on a linear interpolation approximation. This requires small look-up tables and few arithmetic cells, thus minimizing latencies and keeping high precision. Each sub-carrier is processed in two clock cycles: (σ12 , s1,3 ) in parallel at the first clock cycle, and (σ32 , s1,4 ) at the second. Because of this scheduling, one ISQRT unit can be used to compute both σ1 and σ3 . Overall, the chosen schedule is suitable for a fully pipelined structure: the data related to successive sub-carriers can be input once every two clock cycles, for a resulting input data rate (rM QRD ) of 1/2. It should

2

M

Y

DMU

LSU

R,H

PED1

Reference layer

PED2

Subsequent layers

LLR1

CDU

DU1

ED

Fig. 4. LSU

DMU

MIMO tree traversing according to LORD

LLRT Q

DU

DUT

Fig. 3.

DU architecture

˜ 1, R ˜ 2 can be stored be noted that the entries of the matrices R as a single word in a same single-port Ram cut, thus reducing memory logic compared to that required by two separate Ram blocks. The overall channel processing time (κM QRD ) required to calculate T = 2 QRDs for NDC data sub-carriers is then: κM QRD = 1/rM QRD · NDC /fclk

t=1,2

I

(14)

As a last remark on Fig. 2, the unit LIST is related to the LCS demodulation and bit SO generation method. It is dedicated to the sub-carrier list management as in this case a list of NH worst OFDM sub-carriers based on the channel fading conditions are to be determined, see Section III-C. The unit stores the NH lowest values of ˜ 2T,2T (h) = min R ˜t R 2T,2T (h)

NED=16

(15)

for h = 1, . . . , NDC ; besides, the corresponding carrier indexes are also stored. It might be convenient to keep track of each sub-carrier status through an NDC 1 bit logic array (FADINGVECT in Fig. 2) where a value of “0” or “1” may stand for sub-carrier to be demodulated using LCS or FCS. It should be noted that the throughput κM QRD of the unit is not affected by the sub-carrier selection architecture. B. Constellation sweeping and SO generation As previously said, the DU performs demodulation and bit SO generation. The steps performed by the DU for every reference layer, are: 1) compute a set of ED metrics; 2) find the minimum ED of the partitioned constellation (where the constellation partition depends upon the modulation order and the evaluated bit) and compute the LLRs for every bit. Fig. 3 illustrates the architecture of the DU: it is characterized by a parallelism level of T , wherein each DUt includes a lattice search unit (LSU), an optional cross-demapping unit (CDU) and a bit demapping unit (DMU). The role of the LSU is to perform the “constellation sweeping” i.e. a procedure consisting of: 1) selecting a set of candidate complex symbols of the reference layer belonging to an input (QAM) constellation; 2) computing the remaining symbol estimates through spatial DFE operation; 3) for each determined sequence of transmit symbol estimates, computing the ED metrics.

Fig. 5.

LORD–I constellation sweeping

C. VLSI design LORD–I Fig. 4 graphically represents the computation of the ED for a 2×R transmission scheme and a generic constellation of size M 2 as a tree traversal. In the given architecture, T parallel LSUs, one for each reference layer, are instantiated within the DU: each LSU includes NED ED units (EDUs), wherein each EDU computes a single ED term of eq. (11). The ED is the result of the summation of T partial Euclidean distances (PEDs) and the PED is defined as the summation of the two independent squares related to the I and Q of a given complex symbol. The architecture of the EDU is a systolic version of Fig. 4, where a PED unit (PEDU) at every clock cycle computes a PED term which is passed on in a forward-pipeline manner to the subsequent PEDU, and thus each EDU includes T PEDUs. The LSU architecture implements the constellation sweeping method of [9], where M 2 EDs per antenna are to be computed to demodulate M 2 -QAM constellation symbols. If a fixed number of clock cycles (Ncycles ) is employed to demodulate each sub-carrier, the number NED can be set based on the timing equation (12) for the largest size constellation to be supported. Given Ncycles , one has NED = ⌊M 2 /Ncycles ⌋

(16)

For a regular and simple data flow, NED has been chosen to be an integer sub-multiple of M 2 . As an example, for M 2 = 64, fclk = 80 MHz and Lc = 4 µs, then NED = 16: the corresponding constellation sweeping procedure is shown in Fig. 5.

The proposed flow is to divide the EDUs in two subsets of NED /2 units (the gray rectangles in figure), spanning the positive and negative Q semi-axes as indicated by the arrows on the left. At each clock cycle, NED /2 ED metrics are computed for a same positive Q value and all I PAM values; similarly, at the same clock cycle, NED /2 ED metrics are computed for a negative Q value. LLR generation (eq. (10)), performed by the DMU, requires to compute the minimum of the ED metrics over − S(k)+ t , S(k)t , the sets of symbols of the reference layer having bits bt,k = 1 and bt,k = 0, respectively, for t = 1, . . . , T , k = 1, . . . , Mc . As the sets depend on the demapping rule and the bit position within the symbol, the related hardware structure is not straightforward. The DMU architecture solves this issue based on a two-step process. First, symbol demapping is performed for all constellation symbols and the ED metrics are minimized and stored in corresponding registers as a function of the associated PAM value of the reference layer (two independent minimizations for both I and Q for a total of 2M registers). The symbol demapping is shown as an example in Fig. 5 for NED = 16 and M 2 = 64, where during the constellation sweeping: at each clock cycle, registers store the minimum ED over the set of possible I and Q values, separately as the sweeping is performed by evaluating M ED metrics for as many I candidate values and a constant Q in the two directions. Lastly, once the minimum EDs for each PAM element have been found, the DMU of Fig. 3 performs bit demapping according to a given input mapping (and demapping counterpart) rule. The LLR corresponding to a given bit of the I (or Q) component is determined by performing a further minimization of the M /2 values stored in the corresponding registers. It should be noted that most critical part of the DU and even of the whole detector architecture in terms of hardware complexity is represented by the computation of the set of ED metrics. This is due to the fact that the ED is obtained as the summation of 2T squares, and for a given number M 2 of EDs to be computed per layer, M 2 T such multiplications need to be computed. This consideration makes particularly important to consider alternative architectures able to reduce the mentioned computational burden. D. VLSI design LORD–II An alternative approach for the LSU, applicable to OFDMbased systems, allows achieving scalability of complexity versus performance and considerable area reduction at the expense of negligible performance degradation. It consists of an architecture of LSU implementing the LCS demodulation method of Section III-C, and hereinafter referred to as LORD– II. According to LCS, in order to demodulate the best-case subcarriers, for every transmit antenna n2 QAM symbols need to be computed: the architecture will then evaluate n parallel EDs (per antenna) at each clock cycle, requiring Ncycles = Nc,L = n clock cycles to span all the sub-set and demodulate the best-case sub-carriers. If Nc,H is the number of clock cycles

Q

Q

I

I

(a)

(b)

Fig. 6. LORD–II constellation sweeping (NED = 5): (a) reduced and (b) full search case

required by the worst-case sub-carriers, then it can be shown that the number NH of worst-case sub-carriers to be selected is given by:   Nc,avg − n (17) NH = NDC Nc,H − n

LCS is shown in Fig. 6a. For a given n, NED ≥ n EDUs are instantiated. They work in parallel and process symbols along rows, one per clock cycle. Then, n clock cycles are required to span n2 points. When FCS has to be performed for the M = 8 (i.e. 64-QAM) case and if NED < M EDUs are instantiated, a regular constellation sweeping process is not possible; an efficient option (minimizing the number of clock cycles required to complete the process) is illustrated in Fig. 6b for NED = 5. The process can be divided into three phases, identified by the arrows showing the direction of constellation sweeping and by the related subsets of symbols having different color in figure: 1) the constellation is spanned starting from the top left corner for M clock cycles along axis Q, i.e. at each clock cycle a constant PAM value of the Q component and NED different PAM values of the I component are provided to the EDUs. 2) Then, the top right square is processed, i.e. sweeping occurs along I and Q axes in a reverse order compared to the former step. The duration of this phase is (M −NED ) clock cycles. 3) Finally, the bottom right corner of the constellation is processed, for a total of (M − NED ) clock cycles. The whole constellation sweeping duration for the given architecture is Nc,H = 3M − 2NED clock cycles. Other constellation sweeping techniques are possible in principle, but the proposed one allows trading hardware area with performance: increasing NED saves clock cycles, which can be employed to perform FCS on more sub-carriers. As previously stated, the LLR reliability can be improved extending the number of candidate transmit sequences, see eq. (13). The drawback is the introduction of interdependence between the T minimizations performed by the LSU in order to compute the LLRs of the bits corresponding to the T symbols transmitted by the related transmit antennas. The related operations are included in the CDU of Fig. 3. The operations performed by the unit refer to eq. (13), meaning that a given LSU, when computing the EDs over a given set of candidate symbols for a reference layer, keeps track of the

TABLE I LORD 2×2 HARDWARE RESULTS

Design LORD–I LORD–II

Area [mm2 ] CHU DU

Throughput CHU DU

0.08 0.10

1.3 µs 1.4 µs

0.64 0.21

240 Mbps 164 Mbps

Ram

Power

1.3 kB 1.5 kB

38 mW 14 mW

TABLE II D ESIGN COMPARISON (SO STAGES WITHOUT QRD)

Tech [nm] Area [mm2 ] Area [kGate] Clock [MHz] Rate [Mbps] T ×R QAM Power [mW]

[5] 65 0.24 174 200 115 (variable) 4×4 64 11

References [6] [7] 130 65 1.07 n.a. 97 576 200 400 107 533 (fixed) (fixed) 4×4 4×4 16 16 n.a. 114

[8] 45 n.a. 70 500 188 (fixed) 4×4 4–64 n.a.

This work I II 65 0.64 0.21 408 135 80 240 164 (fixed) 2×2 4–64 38 14

minimum found ED value also as function of the estimate value for the non-reference layers. Such estimate values are not known a-priori in general and need to be determined run-time, for example through spatial DFE starting from the candidate value of the reference layer [9]. From the above description, it should be noted that the DU architecture can be straightforward extended from T = 2 to higher number of transmit antenna with a similar approach to that of [8]. V. D ESIGN HARDWARE COMPLEXITY AND COMPARISON Table I summarizes the synthesis results for STMicroelectronics 65 nm low power CMOS technology, with NED = 16 for LORD–I and NED = 5 for LORD–II. The Ram requirement is related to the channel preprocessing (i.e. the terms of the triangular matrices) for all OFDM sub-carriers. The throughput of the CHU unit refers to the case where NDC = 52. The detection throughput θ is calculated as: θ=T

Mc · fclk Nc,avg

(18)

where θ is measured in Mbps, and Nc,avg is the average number of clock cycles required to detect a sub-carrier. For the 64-QAM case and fclk = 80 MHz, Lc = 4 µs, Nc,avg = 4 for LORD–I, eq. (16); for LORD–II, chosen NED = 5, then NH = 5 and Nc,avg = 5.86, eq. (17). The power consumption results are obtained from postsynthesis simulations for the 64-QAM 2×2 configuration @80 MHz clock frequency. Table II compares our designs with the state-of-the art SO MIMO detectors. Synthesis results refer to the DU only and have been obtained for 2×R antenna configuration, thus a straight and fair comparison is not possible.

The design [5] supports high modulation orders as 64-QAM but the throughput is SNR dependent and variable, at a clock frequency of 200 MHz it does not meet the 2×2 requirements even if tailored at 4×4; both [6] and [7] guarantee a fixed throughput, but the VLSI complexity associated to higher modulation order as 64-QAM is not given. The architecture of [8] supports high modulation orders with fixed throughput: however, do to their serial approach in the calculation of the EDs, high clock frequency (i.e. 500 MHz) need to be achieved, resulting in high power consumption. We underline that our designs support the highest 802.11n data rate (i.e. 130 Mbps for T = 2, 64-QAM modulation order, code rate 5/6 and 20 MHz bandwidth) with a deterministic latency and at a very low clock frequency, leading to high throughput and low power consumption as desirable for wireless hand-held devices. VI. C ONCLUSION We propose two new SO MIMO detector architectures based on LORD and its reduced complexity variant, respectively. Both designs achieve a very high throughput satisfying the most stringent 802.11n requirements. The SO units are characterized by a customizable degree of parallelism and are parameterizable in terms of both the number of transmit and receive antennas, and are flexible regarding the supported modulation order. The resulting throughput is deterministic and complex QAM constellation orders of up to 64-QAM are supported at the expense of a reasonable area and power consumption. R EFERENCES [1] A. Burg, M. Borgmann, M. Wenk, M. Zellweger, W. Fichtner, and H. B¨olcskei, “VLSI implementation of MIMO detection using the sphere decoding algorithm,” IEEE Journal on Solid-State Circuits, vol. 40, no. 3, July 2005. [2] D. Wu, E. Larsson, and D. Liu, “Implementation aspects of fixedcomplexity soft-output MIMO detection,” in Proc. IEEE VTC, 2009. [3] C. Studer, A. Burg, and H. B¨olcskei, “Soft-output sphere decoding: Algorithms and VLSI implementation,” IEEE Journal on Selected Areas in Communications, vol. 26, no. 2, pp. 290–300, Feb. 2008. [4] A. Stephens et al., “Draft amendment to [...] -part 11: Wireless lan medium access control (MAC) and physical layer (PHY) specifications: Enhancements for higher throughput,” IEEE P802.11nTM /D2.0, 2008. [5] S. Chen and T. Zhang, “Low power soft-output signal detector design for wireless MIMO communication systems.” New York, NY, USA: ACM, 2007, pp. 232–237. [6] Z. Guo and P. Nilsson, “Algorithm and implementation of the K-best sphere decoding for MIMO detection,” IEEE Journal on Selected Areas in Communications, vol. 24, no. 3, pp. 491–503, 2006. [7] Y. Sun and J. R. Cavallaro, “High throughput VLSI architecture for soft-output MIMO detection based on a greedy graph algorithm,” in GLSVLSI ’09: Proceedings of the 19th ACM Great Lakes symposium on VLSI. New York, NY, USA: ACM, 2009, pp. 445–450. [8] P. Bhagawat, R. Dash, and G. Choi, “Dynamically reconfigurable soft output MIMO detector,” in ICCD, 2008, pp. 68–73. [9] M. Siti and M. Fitz, “A novel soft-output layered orthogonal lattice detector for multiple antenna communications,” in IEEE International Conference on Communications, 2006. ICC’06, vol. 4, 2006. [10] A. Tomasoni, M. Ferrari, S. Bellini, M. Siti, and T. Cupaiuolo, “A hardware oriented, low-complexity LORD MIMO detector,” submitted to ICC 2010. [11] A. Tomasoni, M. Siti, M. Ferrari, and S. Bellini, “Turbo-LORD: A MAP-approaching soft-input soft-output detector for iterative MIMO receivers,” in GLOBECOM, 2007, pp. 3504–3508.

Suggest Documents