A Flexible VLSI Architecture for Extracting Diversity ... - Semantic Scholar

This full text paper was peer reviewed at the direction of IEEE Communications Society subject matter experts for publication in the ICC 2008 proceedings.

A Flexible VLSI Architecture for Extracting Diversity and Spatial Multiplexing Gains in MIMO Channels Chia-Hsiang Yang, Student Member, IEEE, Dejan Marković, Member, IEEE University of California, Los Angeles, USA Abstract—The sphere decoding algorithm is able to approach maximum likelihood (ML) detection with significantly reduced computational complexity for multi-input multi-output (MIMO) communications. The computational reduction makes it attractive for hardware implementation. This paper presents a unified sphere decoder architecture that deploys diversity-multiplexing tradeoff in MIMO channels by taking advantage of the flexibility in the number of antennas and modulation schemes. Several signal processing and circuit techniques are constructively combined to reduce the hardware complexity: a 20 times area reduction is achieved even without interleaving of sub-carriers compared to the direct-mapped architecture. The proposed flexible architecture supports antenna arrays from 2×2 to 16×16, modulations from BPSK to 64-QAM, over 16 to 128 sub-carriers. The peak estimated data rate exceeds 1.5 Gbps over a 16 MHz bandwidth in just 0.55 mm2 in a standard 90 nm CMOS process.

I. I NTRODUCTION MIMO communication has recently received significant attention due to its potential to increase link robustness and channel capacity. Among various MIMO algorithms, sphere decoding is one of the most promising solutions. It approximates the information theoretic bound, set by the ML detection [1], with several orders of magnitude lower computational complexity. This means that, for a given hardware cost, the reduced complexity could be utilized to increase the size of antenna array and effectively improve the performance beyond the ML performance of a system with smaller array size. The sphere decoding algorithm is a multi-dimensional signal processing task dealing with vector and matrix arithmetic. The required computation could involve hundreds of add and multiply operations, and may also need divide and trigonometric functions. Such a high complexity limits the system specifications such as antenna array size (up to 8×8) and modulations (up to 16-QAM) [2]-[6]. In addition, prior work focused only on solutions with fixed number of antennas and fixed modulation. The flexibility of supporting multiple search methods and multiple sub-carriers was not explored either. In this paper, we present an architecture that simplifies sphere decoding implementation by jointly considering tradeoffs at the algorithm, architecture, and circuit layers of abstraction. A number of signal processing and circuit techniques are employed to support flexibility and scalability of search method (K-best and depth-first), antenna array size (up to 16×16), number of sub-carriers (up to 128), and modulation scheme (up to 64-QAM). Essentially, the proposed architec-

ture can extract the diversity and spatial multiplexing gains available in MIMO wireless channels [7]. This paper is organized as follows. Section II reviews the sphere decoding algorithm and the fundamental diversitymultiplexing tradeoff in MIMO communications. Requirements for architecture flexibility are analyzed in Section III. Flexible architecture that optimally combines signal processing and circuit techniques is proposed in Section IV. Section V discusses the Matlab-Simulink design environment and hardware emulation results. Section VI concludes the paper. II. A LGORITHM S PACE E XPLORATION A MIMO system can improve the reliability of a wireless link through increased diversity or improve the channel capacity through spatial multiplexing. Diversity gain and spatial multiplexing gain are related to system coverage range and data rate, respectively. Both gains can be improved using a larger antenna array. However, there is a fundamental tradeoff between these two gains [7]. In the diversity-multiplexing space, repetition code, Alamouti code, and space-time code use data redundancy to increase diversity at the cost of reducing spatial multiplexing gain. In contrast, Bell Labs Layered Space Time (BLAST) algorithm, Singular Value Decomposition (SVD), and QR decomposition allocate data-streams in different eigen-modes to maximize spatial multiplexing gain while sacrificing diversity gain, as shown in Fig. 1. Sphere decoding is a decoding scheme that can extract both diversity and multiplexing gains. With flexibility in coding and modulation, sphere decoder can effectively explore the entire tradeoff curve as shown in Fig. 1. The original data type for sphere decoding is uncoded data. By manipulation of input data, sphere decoding is capable of decoding space-time block codes (STBC), which improves the error probability and thus increases diversity gain. The data rate can be maximized by transmitting different modulations over different MIMO substreams to increase spatial multiplexing gain. Antenna array size provides an added flexibility to shift the tradeoff curve in the diversity-multiplexing space, Fig. 1. A unified sphere decoder model is discussed in the following section. A. Sphere Decoding Algorithm Consider a multiple antenna system with M transmit antennas and N receive antennas. The received vector y can be represented by

978-1-4244-2075-9/08/$25.00 ©2008 IEEE

than the search radius, the corresponding branches are outside the search radius as well.

Repetition Alamouti SpaceSpace-time

B. Improving BER Performance

array size

Sp re he ing d co de

Diversity gain (range)


array size

BLAST SVD QR

Spatial multiplexing gain (rate)

Fig. 1.

Diversity-multiplexing tradeoff in MIMO communications.

y = Hs + n,

(1)

where y is an N×1 vector of received symbols, and H denotes an N×M channel matrix whose elements are i.i.d. complex Gaussian with zero mean and unit variance. Vectors s and n (M×1 and N×1 respectively) represent the transmitted symbols and zero mean, circularly symmetric white Gaussian noise, respectively. The transmitted vector with the smallest Euclidean distance is selected as ML estimate in Eq. (2). ˆ s = arg min y − Hs2 s∈Q

(2)

The channel matrix can be further decomposed using QR factorization; the equivalent ML estimate thus can be represented by two equivalent forms, Eqs. (3), (4). ˆ sM L = arg min R(s − sZF )2

(3)

ˆ sM L = arg min QH y − Rs2

(4)

s∈Q

s∈Q

where Q is a unitary matrix, R is an upper triangular matrix, and sZF = (HH H)−1 HH y is the zero-forcing (unconstrained ML) estimate. Note the diagonal elements of R matrix are real. Using the upper triangular nature of R, the symbol decoding begins from the last row and occurs in several steps. The decoded symbols are used for successive decoding steps until all symbols are decoded. This decoding algorithm can be mapped to finding a shortest path (with minimum Euclidean distance) in a tree topology—one possible constellation point denotes one node, each row of the R matrix is mapped to each level of the tree whose edges are weighted by channel coefficients. The whole solution space of this tree is equivalent to exhaustive search in the trellis diagram of the original problem; number of total combinations of transmitted symbols is |Q|M , where |Q| is the constellation size. By properly choosing a search radius and a search method, the ML solution can be approached by visiting only nodes within a hyper-sphere, rather than performing an exhaustive search. This complexity reduction is feasible, because the Euclidean distance is a cumulative sum of square terms. This means that for each node, if its Euclidean distance is larger

Several effective methods such as detection ordering, candidate enumeration and search radius setting are applied to improve error performance and/or reduce the complexity of the basic sphere decoding algorithm [9]. Most of these methods are executed in the preprocessing stage, but candidate enumeration should be considered in the tree searcher. For each level, the order of constellation point enumeration is another important factor to improve search speed. The Schnorr-Euchner (SE) enumeration [3] suggests traversing the constellation candidates according to the distance increment in an ascending order. Therefore, the first candidate si for each row of Eq. (4), y ˆ = QH y, is the one with minimum distance between bi and Rii Qi as in Eq. (5). Finding a good admissible solution early means that we can shrink our initial radius early [1].

yî −

M

Rij sj = bi − Rii si , with bi = yî −

j=i

M

Rij sj (5)

j=i+1

C. Tradeoff in Diversity-Multiplexing Space In order to maximize diversity gain, the receiver has to average over multiple independently faded replicas of the same symbol. Thus, the error probability can be reduced by sending multiple copies of the same symbol in space and/or time dimensions. Since a unified signal model can be developed for these space-time coding schemes, the same sphere decoder architecture can be used with some data rearrangement. For STBC, the ML estimate can be written as ˆ s = arg min y − Bs2 s∈Q

(6)

where matrix B depends on code generators and channel matrix. By interpreting B as H in the original signal model, Eq. (2), the sphere decoding algorithm can be applied. Since the matrix dimension is changed due to the data rearrangement in the preprocessing stage, the equivalent antenna array size will be changed accordingly. For example, repetition coding by 2 in space domain for an 8×8 system will be transformed into data processing in a 4×4 system (only one half of the symbols need to be decoded). This requirement enhances the need for flexibility in antenna array size. Spatial multiplexing gain is characterized by data rate. To maximize spatial multiplexing gain, we should allow data rate to scale with the SNR or assign different data rate to different substreams for a fixed SNR [7]. To this end, modulation scheme should be adaptive according to channel condition: a larger constellation is applied to substreams with higher SNR, and a smaller constellation is applied to substreams with lower SNR. The system performance, thus, further motivates the need for adding flexibility in modulation schemes.


K-best

large

variable

long

Yes

ML

constant

short

No

near-ML

PE

(a) Depth-first (folding)

Fig. 2.

Area

Area Throughput Latency Radius Shrinking Performnace Depth-first small

Delay

TABLE I Q UALITATIVE C OMPARISON OF D EPTH - FIRST AND K- BEST A LGORITHM

...

(b) K-best (parallel and multi-stage)

Fig. 3.

Impact of antenna array size on (a) area and (b) critical path delay.

Basic architecture of (a) depth-first and (b) K-best algorithm.

III. A RCHITECTURE S PACE E XPLORATION Since there are several MIMO algorithms, a common architecture does not exist for different detection schemes. For the sphere decoding algorithm, existing architectures [2]-[6] are evaluated to better define implementation requirements in terms of area and throughput for flexible architecture design. Prior work is extended to include flexibility with respect to a number of parameters: search method, antenna array size, number of sub-carriers, and modulation scheme. A. Search Method Two major types of tree search methods are reported in the previous work: depth-first [3][4] and K-best [2][5][6]. Major advantages of depth-first are that the ML performance can be achieved, and that radius shrinking can be used for tree pruning. On the other hand, the advantages of K-best are its uniform data path and constant throughput. In hardware implementation, depth-first is realized in a folding-like architecture because only one node is visited at a time during the tree search process, Fig. 2 (a). K-best is realized as a multistage pipeline, Fig. 2 (b), because no trace-back is needed. To process K data paths at the same time, parallel architecture is applied. Table I summarizes architecture comparison in terms of circuit metrics and algorithmic performance. If only forward trace is allowed, the BER performance is limited by the number of parallel processors such as in K-best algorithm. Even though more processing cycles are provided, there is no room to improve the BER performance for K-best algorithm. Therefore, capability of supporting forward trace and backward trace should be considered in the architecture design. B. Antenna Array Size For the sphere decoder operating with a large antenna array, the biggest challenge in the implementation is reducing area. Using the number of (complex) multipliers as a first order area estimate, the number of multipliers needed in the folding and multi-stage architectures are M and M(M+1)/2, respectively, where M is the number of transmit antennas. Expanding a 4×4 system to a 16×16 system, relative area increases from 4 to 16 for the folding architecture and 10 to 136 for the multi-stage architecture. The folding architecture is therefore 2.5× to 8.5× more area efficient compared to the multi-stage architecture,

as shown in Fig. 3 (a). The second design challenge is the operating frequency for the folded architecture. As the array size increases, the number of operands in the Multiply-Accumulate (MAC) operation in the metric calculation unit increases proportionally to the number of antennas. Assuming a tree adder topology, the critical path delay roughly increases logarithmically with the number of transmit antennas. However, the time required to finish the MAC operation should be scaled down by the number of antennas in order to increase the throughput proportionally to the number of antennas. This timing requirement for a fixed bandwidth is shown in Fig. 3 (b), assuming the critical path delay just meets the timing requirement in a 4×4 system. The situation is actually worse when metric enumeration is included in the loop. Since pipelining in the loop is considered a difficult task, this architecture can not operate at a high frequency even for a 4×4 system [3]. C. Number of Sub-Carriers To facilitate pipeline insertion in the loop, inputs are upsampled by a factor m, which means that each register in the loop has to be replaced with m pipeline registers. By applying data-stream interleaving, samples of other independent data streams can be introduced in the loop in place of the repeated values or padding zeros. Technique of interleaving is therefore used to improve area efficiency through logic sharing and to provide flexibility needed to support varying number of data sub-carriers. D. Modulation Scheme The challenge for the sphere decoder with a large constellation size is that the hardware cost grows quickly as the modulation size increases. Since the admissible constellation points should be enumerated according to their distance increments |Ti |2 (Ti = bi −Rii si ) in an ascending order, exhaustive search is a straightforward implementation; it calculates the distance increments of all constellation points and uses a sorting circuit to find the constellation point with the minimum distance, as shown in Fig. 4 (a). The number of distance calculation units is proportional to the constellation size (64 units are required for 64-QAM, for example). In the constellation plane, metric enumeration corresponds to finding the points closest to bi and scaling constellation

min-search

...

...

min-search

...

MCU

s^

shiftregister chain

...

...


bi-directional shift register chain: backward trace and forward trace

partial product

R

... yî Rii

sub sub bi Symbol selection

...

adder tree

Flexibility folding architecture: multiple antenna arrays

m stages radius check

data-interleaving: multiple sub-carriers symbol mapping: multiple modulations

MEU

Fig. 5. Fig. 4. Closest point selection scheme: (a) exhaustive search architecture, (b) SE enumeration for QAM, (c) region partition based search approach. Real value is represented by gray line.

points Rii Qi from the closest to the farthest [1]. This is the underlying principle of SE algorithm. The SE enumeration is originally applied to one-dimensional signal, such as real valued PAM or PSK constellation; therefore it was modified to arrange QAM constellations in PQ concentric groups to fit the original algorithm. For example, 16-QAM constellation can be expressed as an arrangement of points in 3 concentric circles. Then the problem is reformulated to find the closest point in each subgroup and find the closest point over subgroup [3], as shown in Fig. 4 (b). The original algorithm [1] uses phase relationship to find the closest point in a concentric circle. This approach is too complex for hardware implementation due to the calculation of cos−1 (·), so [3] proposed a decision boundary based method for each concentric subgroup to simplify the SE enumeration. However, this simplification is only applicable to small size constellations such as 16-QAM. Larger constellation sizes are hard to support for several reasons. First, the decision boundary algorithm is quite complex—many multiplications are needed to generate the decision boundaries. Second, the number of subgroup grows quickly, which increases the latency of the min-search circuit. For example, 64-QAM is decomposed into 9 subgroups. Third, the concentric group partition is not scalable as QAM constellation size changes, thereby making the architecture infeasible to support different modulations. IV. S CALABLE A RCHITECTURE D ESIGN A scalable processing element (PE) is constructed to deploy flexibility requirements outlined in the previous section. The PE consists of two parts: Metric Calculation Unit (MCU) and Metric Enumeration Unit (MEU), as shown in Fig. 5. There are m-stage pipeline registers inserted in the loop, so the critical path can be shortened by choosing a larger m. Since m data streams are interleaved into the PE, the pipelines are active every clock cycle, creating the maximum throughput. The flexibility of search scheme is provided by the shiftregister chain, which can be configured as forward trace or

Scalable PE architecture.

backward trace. If K PEs are used, K search paths can be explored at the same time to implement K-best algorithm, while each PE has flexibility to trace back as depth-first. The flexibility to support varying antenna size is provided by the folding architecture. It reuses the same hardware with the clock frequency proportional to the antenna size. The architecture shown in Fig. 5 works with complex numbers. One of the key tradeoffs to consider in the decoder implementation is concurrency versus latency, because the decoding of real and imaginary parts can be done concurrently or sequentially. Joint decoding maps to a parallel architecture with large area and proportional throughput increase. Sequential approach achieves lower throughput due to timemultiplexing. Joint decoding is more attractive from the BER performance standpoint. It also takes advantage of the speed of logic gates to reduce voltage and minimize power. The details of our implementation strategy and the structure of building blocks are illustrated next. A. Numerical Strength Reduction From an algorithm perspective, the complexity of sphere decoding is evaluated by the number of nodes visited in the tree search process. When considered for hardware implementation, decoding algorithms are generally compared in terms of the number of multiplications. Down to the circuit level, the size of multipliers is the key factor to estimating the area, speed, and power of the sphere decoder. We start by simplifying the cost of the multiply operation to reduce hardware complexity. The multiplication is required to calculate Euclidean distance, which is mathematically represented by Eqs. (3) and (4), as adopted in [2][4][5][6] and [3], respectively. At the first glance, Eq. (3) has one multiplication while Eq. (4) has two. However, a careful investigation shows that Eq. (4) is a better choice from the hardware perspective for at least two reasons. First, sZF and QH y can be precomputed and, hence, have negligible impact on the total number of operations. Second, the wordlength of s is usually much shorter than sZF . Separating terms as in Eq. (4) results in multipliers with reduced wordlength and reduced area.


TABLE II M ETRIC E NUMERATION U NIT: A REA AND D ELAY C OMPARISON Exhaustive [3] SE enumeration [3] Our work Area (normalized)

192

73

16

Delay (normalized)

8

5

1

B. Metric Calculation Unit (MCU) M MCU computes j=i+1 Rij sj . Basically, it executes a MAC operation. To accumulate all 16 operands and achieve the highest throughput, there are 16 multipliers followed by an adder tree that merges the partial products. It is possible to reduce the number of multipliers in a time-multiplexing manner at the price of lower throughput. For example, 4 complex multipliers can be time-multiplexed by 4 to deploy 16 multipliers, with the throughput also reduced by 4. Such tuning at the architecture level is used to position the design along the throughput and power axis, by optimally tuning variables such as supply voltage. Since the search process advances one stage per clock cycle, we propose an FIR-like architecture to facilitate metric calculation, as shown in Fig. 6. By observing that the traceback goes back up by only one layer instead of a random jump, a bi-directional shift register chain is embedded to adjust the search depth. Since the search state is recorded in the shift registers, no extra memory, such as stack memory [3][4], is needed to keep all the states. Coefficients of R matrix are stored in memory in an area efficient way. The diagonal terms of R matrix are real, while the rest are complex numbers. Using the upper triangular nature, the real part and the imaginary part of the off-diagonal terms are organized into a square memory, which saves around 50% of memory area.

point (with minimum distance) can be decided by the location of bi /Rii since the real part and the imaginary part can be decoded separately, as shown in Fig. 4 (c). The area complexity of the three architectures in Fig. 4 is evaluated using the number of add-equivalent operators (add, subtract, compare). Similar concept is applied in delay comparison: the number of adder delays is used as the delay estimation metric. The assumption is that the square operators in Eq. (2) are simplified to absolute operators with a small performance loss [3]. Table II summarizes the area and delay using this first order estimation for 64-QAM constellation. A 4.6× area reduction is achieved compared to SE enumeration and a 32× compared to exhaustive search. The delay of our design is 1/5 the delay of the SE enumeration. After finding the closest point, remaining candidates are also decided by the distance between bi and constellation points in an ascending order. The decoded symbol si is used to enumerate remaining candidates through geometric relationship. The complexity of the sphere decoding algorithm is independent of the lattice constellation size; therefore, we can enumerate the adjacent possible constellation points instead of the whole constellation plane. We extract 9 points in the constellation plane as illustrated in Fig. 7. Eight surrounding constellation points have either 1-bit error (Fig. 7 (a-b)) or 2-bit errors (Fig. 7 (c-d)) if Gray coding is used. The 2nd closest point for each solution set is decided based on decision boundaries indicated by the dashed lines in Fig. 7 (a), (c). The remaining points are decided by the search direction, which is specified by other decision boundaries, starting from the 2nd point, as shown in Fig. 7 (b), (d). These two decision boundaries are easy to calculate by sign check and comparison for the real part and the imaginary part of bi − Rii si . To decide the enumeration sequence by jointly considering these two subsets from the closest to the farthest, we adopt a mixed method: the two solution sets are compared to find the final enumeration sequence with respect to the central point. An overall 20× area reduction is achieved by a combination of signal processing and circuit techniques, from the arithmetic level down to the circuit level. The area reduction comes from folding architecture (8.5×), MEU simplification (30%), simplified multiplier (20%), memory reduction (5%), and wordlength reduction (20%).

C. Metric Enumeration Unit (MEU)

D. Summary of Design Flexibility

MEU enumerates the possible constellation points in an ascending order according to |Ti |2 . We propose a simple partition method based on Cartesian coordinates. The constellation plane is partitioned into 64 regions for 64-QAM (8 regions in the real part and 8 regions in the imaginary part). The closest

The sphere decoder is designed to support different system configurations with respect to antenna array size, modulation and detection schemes, as well as the number of sub-carriers. Table III summarizes the configuration modes. Since varying antenna array size and modulation are supported, this design

Fig. 6.

Miceo-architecture of MCU block.

Without loss of generality, the normalized size of a multiplier can be estimated by the product of wordlengths of the multiplier and the multiplicand. The normalized delay of a multiplier can be estimated by the sum of wordlengths of the multiplier and the multiplicand if an array multiplier is used [10]. The array multiplier approximation works well for first-order comparison. In a 64-QAM system, where wordlength of s is 3 for a real multiplier, the area reduction is at least 50% compared to the case where wordlength of s exceeds 6 bits. The delay reduction also reaches 50% for R with wordlength of 16.


(a)

(b)

1 bit error subset

#1

#2

TABLE IV H ARDWARE C OMPLEXITY C OMPARISON

#4

#5

#2 110101

#3

(c)

(d) #5

2 bit errors subset

#3

nd

111101

closest point

#4 rd

[2]

101101

110111

111111

101111

110110

111110

101110

real part

#1 #2

2

Riisi

bi

imag. part

Area Area (norm.)

[3]

[4]

[5]

[6] This work

500k 50k 10 mm2 12.7k 97k GC GC (0.18 µm) slices GC 6.5

1.3

18.2

9.2

2.5

* 1

*154k gate count (GC), 0.55 mm2 (90 nm), or 5.5k FPGA slices

#2 th

3 to 5 points

Fig. 7. Candidate enumeration scheme. Decision boundaries are dashed lines in the central region. TABLE III OVERVIEW OF S YSTEM C ONFIGURATION M ODES Configuration

Modes

Antenna array size

any square matrix # b/w 2×2-16×16

Modulation

BPSK, QPSK, 16-QAM, 64-QAM

# sub-carriers

16, 32, 64, 128

Detection

Depth-first, K-best

is capable of trading off diversity gain for spatial-multiplexing gain. Due to interleaving by 16, the supported number of subcarriers can be a multiple of 16 through data rearrangement. Main design specification is the throughput constraint for the algorithm. Since total 16 MHz bandwidth is used, each sub-channel requires 1 MS/s to process the data in the case of 16 sub-carriers. The requirement is thus to process 16 parallel streams of data at a 1MHz rate. Clock specification for the resulting architecture then becomes 256 MHz (1 MHz×16 sub-carriers×16 antennas). At the 16×16 64-QAM mode, the system supports throughput up to 1.536 Gbps, which results in a spectral efficiency up to a 96 bps/Hz. When the system operates at a smaller array mode, clock frequency and supply voltage can be reduced to minimize power consumption. V. E XPERIMENTAL V ERIFICATION Matlab-Simulink environment is used to unify design decisions across multiple abstraction layers, from algorithm to circuits. The design methodology is based on the graphical timed data flow representation using blocks from Simulink hardware library. This allows us to estimate the hardware complexity of the design and map the architecture onto BEE2 hardware emulation platform [12].

Fig. 8.

Comparison of this work with previous work.

optimization. The graphical Matlab-Simulink development environment offers bit-true, cycle-by-cycle hardware equivalent modules for simulation, and then translates to FPGA emulation without hardware description language (HDL) coding [11]. BEE2 platform is used for hardware emulation. With the capacity of 10M equivalent logic gates, BEE2 can provide over 100 times computing throughput of a microprocessorbased system with similar power consumption and cost [12]. B. Hardware Complexity A comparison of hardware is illustrated in Table IV. The estimated chip area is 0.55 mm2 in a standard 90 nm CMOS process using the approximation of 10,000 FPGA slices ⇔ 1 mm2 layout area in a 90 nm CMOS [8]. To make a fair comparison, the area is normalized by the number of transmit antennas (this is a conservative estimate, because the hardware complexity could grow quadratically with the number of transmit antennas). The data indicates that the proposed architecture is the most area efficient compared to prior work. Furthermore, our design outperforms all previously published designs in terms of supported antenna array size and constellation size, as shown in Fig. 8. Unlike previous work, the proposed architecture also supports multiple sub-carriers and search methods. Finally, this is the first design that offers the flexibility required to fully traverse the diversity vs. spatialmultiplexing tradeoff curve. C. Hardware Emulation Results

A. Design Methodology An integrated design methodology is adopted in our work to incorporate algorithm, architecture, and circuit implementation in a highly automated environment. Since the design is complex, we start with a layered design approach which hierarchically decomposes the top architecture down to the fundamental modules. Different considerations such as area and throughput are evaluated at each layer for architecture

The BER performance of one PE is verified through the hardware/software co-simulation environment. In this preliminary experiment, only the closest lattice point is chosen as the decoded symbol during the search process for the highest throughput performance. The BER performance can be further improved towards the ML performance without repetition coding by using multiple PEs, since more search paths can be examined in the same time interval.


10 10

BER

10 10 10 10 10

0

-1

-2

-3

-4

-5

-6

0

16x16 8x8 4x4 8x8, rep. x2 16x16, rep. x4 5 10

15

20

25

Eb/No (dB)

(a) 64-QAM 10 10

BER

10 10 10 10 10

0

4x4 4x4 ML 8x8, rep. x2 16x16, rep. x4

-1

-2

-3

ACKNOWLEDGMENT

-4

-5

-6

0

5

10

15

20

25

Eb/No (dB)

(b) 16-QAM

Fig. 9.

arrays from 2×2 to 16×16, modulations from BPSK to 64QAM, and 16 to 128 sub-carriers. The target data rate exceeds 1.5 Gbps using a 16 MHz bandwidth in the 16×16, 64-QAM mode. An architecture with folding and interleaving is used to reduce area and critical path delay. Varying antenna arrays is supported through hardware reuse. Multiple modulation schemes are implemented by metric enumeration which uses a simplified boundary decision algorithm. Circuit-level considerations include smaller multipliers achieved through numerical strength reduction and reduced memory size accomplished by using bi-directional shift registers. The combined effect of these techniques is a 20× reduction in area compared to the direct-mapped architecture. Hardware emulations show that the reduced hardware complexity allows for an 8×8 decoder with repetition coding to outperform, in BER vs. SNR, a 4×4 ML decoder. Currently, a multi-core architecture is being developed to cooperate multiple Processing Elements (PEs) to extend the search range and improve BER performance.

Hardware emulation results for several design configurations.

Figure 9 (a) shows the BER performance of 64-QAM modulation for different antenna array sizes and different repetition coding rates. The repetition coding here is referred to as sending replicas in space domain to reduce error probability. We see that the BER performance of 4×4, 8×8, and 16×16 is comparable; however, schemes with larger antenna array have proportionally higher throughput. By using repetition coding by a factor 2, the throughput also drops by a factor of 2, but the BER performance is greatly improved. Therefore, the throughput of 4×4, 8×8 with repetition coding by 2, and 16×16 with repetition coding by 4 is the same, but the BER performance is improved significantly. The performance of the ML estimate for a 4×4 system is depicted in Fig. 9 (b) as a reference. An 8×8 system with repetition coding by 2 has outperformed the 4×4 system with the ML performance by 5 dB. The results demonstrate that the BER performance of a system with a larger antenna array and repetition coding performs better than the ML performance with a smaller antenna array. This tradeoff is practically feasible due to greatly reduced complexity of sphere decoder hardware as compared to the ML algorithm based on exhaustive search. VI. C ONCLUSION This work proposes a flexible sphere decoder architecture for MIMO communications. Our architecture supports antenna

The authors acknowledge the support of the Focus Center for Circuit & System Solutions (C2S2), one of five research centers funded under the Focus Center Research Program, a Semiconductor Research Corporation Program. R EFERENCES [1] B. M. Hochwald and S. ten Brink, “Achieving near-capacity on a multiple-antenna channel,” IEEE Trans. Communications, vol. 51, pp. 389-399, Mar. 2003. [2] G. Knagge et al., “A VLSI 8×8 MIMO Decoder Engine,” in IEEE Workshop on Signal Processing Systems, SIPS’06, pp. 387-392, Oct. 2006. [3] A. Burg et al., “VLSI implementation of MIMO detection using the sphere decoding algorithm,” IEEE J. Solid-State Circuits, vol. 40, pp. 1566-1577, July 2005. [4] D. Garrett et al., “Silicon Complexity for Maximum Likelihood MIMO Detection Using Spherical Decoding,” IEEE J. Solid-State Circuits, vol. 39, pp. 1544-1552, Sep. 2004. [5] L. G. Barbero and J. S. Thompson, “Rapid Prototyping of a FixedThroughput Sphere Decoder for MIMO Systems,” Proc. Int. Conf. Communications, vol. 7, pp. 3082-3087, June 2006. [6] Z. Guo and P. Nilsson, “Algorithm and implementation of the k-best sphere decoding for MIMO detection,” IEEE J. Selected Areas in Communications, vol. 24, pp. 491-503, Mar. 2006. [7] L. Zheng and D. Tse, “Diversity and Multiplexing: A Fundamental Tradeoff in Multiple-Antenna Channels,” IEEE Trans. Information Theory, vol. 49, no. 5, pp. 1073-1096, May 2003. [8] D. Marković, B. Nikolić, and R. W. Brodersen, “Power and Area Efficient VLSI Architecture for Communication Signal Processing,” in Proc. Int. Conf. Communications, Jun. 2006. [9] G. B. Giannakis, Z. Liu, X. Ma, and S. Zhou, Space-Time Coding for Broadband Wireless Communications, John Wiley & Sons, 2007. [10] J. Rabaey, A. Chandrakasan, and B. Nikolić, Digital Integrated Circuits: A Design Perspective, 2nd ed. Prentice-Hall, 2003. [11] W. R. Davis et al., “A Design Environment for High Throughput, Low Power Dedicated Signal Processing Systems,” IEEE J. Solid-State Circuits, vol. 37, no. 3, pp. 420-431, Mar. 2002. [12] C. Chang, J. Wawrzynek, and R. W. Brodersen, “BEE2: A High-End Reconfigurable Computing System,” IEEE Design & Test of Computers, vol. 22, pp. 114-125, Mar.-Apr., 2005.

A Flexible VLSI Architecture for Extracting Diversity ... - Semantic Scholar

A Flexible VLSI Architecture for Extracting Diversity ... - Semantic Scholar

Suggest Documents

A VLSI Architecture for Visible Watermarking in a ... - Semantic Scholar

A Modular VLSI Implementation Architecture for ... - Semantic Scholar

A Modular VLSI Implementation Architecture for ... - Semantic Scholar

A Multi-Core Sphere Decoder VLSI Architecture for ... - Semantic Scholar

A Scalable VLSI Architecture for Real-Time and ... - Semantic Scholar

Flexible Architecture for Microinstrumentation ... - Semantic Scholar

architecture and VLSI implementation - Semantic Scholar

A Flexible Architecture for Adaptive Hypermedia ... - Semantic Scholar

A flexible microcontroller architecture for fail-safe ... - Semantic Scholar

A Flexible Hybrid Architecture for Management of ... - Semantic Scholar

A Flexible Software Architecture for High ... - Semantic Scholar

a flexible and modular software architecture for ... - Semantic Scholar

A Flexible Processor Architecture for MPEG-4 ... - Semantic Scholar

FRIENDS: A Flexible Architecture for Implementing ... - Semantic Scholar

A Flexible Web Service based Architecture for ... - Semantic Scholar

A flexible architecture for Client-Side Adaptation - Semantic Scholar

Godard: A Flexible Architecture for Audio/Video ... - Semantic Scholar

A Flexible Architecture for Welding Simulators ... - Semantic Scholar

An Efficient VLSI Architecture for CORDIC Algorithm - Semantic Scholar

New Architecture Paradigms for Analog VLSI Chips - Semantic Scholar

VLSI Architecture for Motion Estimation using the ... - Semantic Scholar

New Architecture Paradigms for Analog VLSI Chips - Semantic Scholar

A VLSI Architecture of the Soft-Output Sphere ... - Semantic Scholar

VLSI hardware architecture for complex fuzzy ... - Semantic Scholar