VLSI Architecture for Soft-Output Tuple Search Sphere Decoding Esther P. Adeva and M. Ali Shah and Bj¨orn Mennenga and Gerhard Fettweis Vodafone Chair Mobile Communication Systems Technische Universit¨at Dresden Dresden, Germany Email: esther.perez, mohammad.ali.shah, mennenga,
[email protected]
Abstract—High detection complexity is known to be one of the major challenges in MIMO communications based on spatial multiplexing. Tuple Search Detector (TSD) was recently introduced, significantly reducing detection complexity in comparison to conventional algorithms while achieving close to full max-logAPP BER performance. Irregular control flow and sequential nature of depth-first-based detectors frustrate efficient application of parallelization techniques, typically leading to inefficient realizations. This work presents a novel TSD implementation, based on a scalable and parallelizable pipelined ASIP architecture. The proposed VLSI design is implemented for 4×4 MIMO transmission using 64-QAM constellation on 65-nm CMOS technology. In low SNR scenarios, proposed detector achieves 403.6 Mbps throughput at 454 MHz clock frequency. TSD can be moreover adjusted according to transmission conditions, reaching >1 Gbps. A silicon area of 0.14 mm2 (98.9 kGEs) is occupied by the TSD core, reporting low power dissipation (57.94 mW) under typical case operating conditions. Proposed detector implementation achieves close to full max-log-APP BER performance and high detection throughput with reasonable hardware complexity, by far outperforming state-of-the-art realizations.
Keywords— MIMO detection, sphere decoder, tuple search, ASIP, VLSI architecture. I. I NTRODUCTION Future mobile communication systems will make use of multiple-input multiple-output (MIMO) techniques in combination with high constellation orders to enhance spectral efficiency. Transmission of spatially multiplexed data streams allows increasing data rates as well as diversity. However, high detection complexity of common detector realizations still represents a limiting factor towards efficient implementations. Conventional low-complex detector realizations provide poor BER performance (e.g. Successive Interference Cancelation -SIC detector), whereas high-BER-performance implementations present unusable high complexity (e.g. Single Tree Search -STS detector). In this regard, TSD [1] has demonstrated to achieve significant complexity reduction while providing close to full max-log-APP BER performance, outperforming state-of-the-art detection strategies [2] like STS [3] or List Sphere Detection (LSD) [4]. Irregular and data-dependent control flow of so-called depth-first detection algorithms (like STS and LSD) frustrate efficient algorithm parallelization, thus limiting the achievable throughput. This, together with high detector complexity, typically leads to inefficient realizations. Moreover, complexity of these conventional detectors grows
exponentially with size of the constellation Q and with the number of transmit antennas NT [2]. Consequently, application of most state-of-the-art detection strategies is restricted to low-order transmission systems (NT ≤ 4, Q ≤ 16). TSD presents, in contrast, a roughly linear trend, hence becoming especially well suited for high-order transmission scenarios (NT > 4, Q > 16). In this work, VLSI implementation of a novel TSD-ASIP architecture is presented. The considered communications system model is introduced in section II, followed by description of the complexity-reduced TSD algorithm (section III) representing the basis of our implementation. The proposed ASIPbased architecture is described in section IV. Corresponding hardware implementation results are shown in section V and subsequently discussed in VI. Section VII summarizes main highlights of this work. II. S YSTEM M ODEL Throughout this paper, we consider a NT × NR MIMO system based on a BICM transmission strategy with NT transmit and NR receive antennas. A vector u of i.i.d. information bits is encoded by the outer channel code with rate R. The resulting stream of vectors c′ is bit-interleaved and portioned into blocks c of NT ·L bits, where L denotes the number of bits per transmit symbol. For the transmission, the corresponding bits c ∈ C are mapped (e.g. gray mapping) onto complex constellation symbols x(c). Regarding the transmission, we consider a flat fading, uncorrelated channel and an additive noise vector n ∈ CNR ×1 at the receiver. The considered passive channel is represented by H ∈ CNR ×NT with entries of a zero mean i.i.d. gaussian random process of variance 1 and is assumed to be perfectly known at the receiver. The received signal y is therefore given by: y = Hx + n. In order to ensure comparability of results, a setup equivalent to the one used in e.g. [1], [5], [6] has been used for our simulations.1 In the following, NT = NR = 4 and Q = 64 will be focused and, for the sake of simplicity, non-iterative detection↔decoding will be considered. 1 Rate 1/2 PCCC with (7 , 5) convolutional codes, information block size R of 9216 bits (including tail bits), gray mapping, spatial and temporal fading. Detection is carried out by complex-valued tuple search sphere detector in conjunction with a BCJR based decoder with 8 internal iterations.
III. C OMPLEXITY-R EDUCED T UPLE S EARCH MIMO D ETECTION A. Fundamentals Task of the MIMO detector is the determination of bits c most likely sent as well as of reliability information for these bits. This can be accomplished by calculating log-likelihood ratios (L-values): P (cm,l = +1|y) L (cm,l |y)=ln P (cm,l = −1|y) 1 1 ≈− min {λ0 } + min {λ0 } ,(1) N0 c|cm,l =+1 N0 c|cm,l =−1 where (1) results from application of the max-log approximation [7]. The l-th bit of a symbol sent by the m-th antenna is represented by cm,l and λ0 represents the distance metric. For the considered case without detector↔decoder iterations (i.e. without a priori information), this metric depends only on the Euclidean distance between the set of received symbols y and the representative of the symbol x ˆ(c) likely transmitted through the channel H: λ0 (y, c) = ky − Hˆ x(c)k2 Consequently, sent symbol
besides the most arg min{λ0 } (i.e. the
probably detection
ˆ (c)|c∈C x
hypothesis) and its corresponding metric λ0 (cML ), the detector has to determine also the counter-hypotheses arg min {λ0 } and their metrics for each bit. ˆ (c)|c∈C,cm,l 6=cML x m,l
B. Tree Search Basics Since brute force (full max-log-APP) detection of (1) is known to be of exponentially growing computational complexity with the number of transmit antennas and order of the constellation, several close to optimal detection approaches have been lately proposed, some of the most promising based on tree search strategies. As described in detail in [5], transforming the detection problem is permitted by QRdecomposition (QRD) of H = QR, where Q is unitary and R an upper triangular matrix. With modified received symbols y′ = QH y, determination of the Euclidean distance ky′ − Rˆ x(c)k
2
(2)
can be interpreted as a tree search. Resulting from this, λ0 can be recursively calculated through the layered partial metric 2 ′′ λi = λi+1 + yi −rii xˆi , |{z} |{z}
metric from already estimated symbols
yi′′ = yi′ −
(3)
interference reduced symbol NX T −1
rij x ˆj ,
(4)
j=i+1
by adding the metric of the corresponding parent node to the squared distance between a representative of the estimated symbol and an interference reduced symbol. The search is carried out in depth-first fashion, successively extending the
selected nodes by analyzing their child nodes. As presented in [8], a regularized control flow and the parallel calculation of sibling parent nodes as well as of leaf nodes permit a so called one-node-per-cycle implementation [9]. C. Complexity-Reduced Tuple Search Detector Computing the L-values in (1) requires determination of a detection hypothesis and all counter-hypotheses as described in section III-A. Explicit search for all needed minimums leads to impractically high ♯n [10]. Therefore, instead of searching all possible minima, TSD introduced in [1] searches a subset of T most likely leaves, similarly to the approach used by LSD. The metrics λ0 of these leaves are stored in a search tuple T := {λ0 (c1 ) , λ0 (c2 ) , . . . , λ0 (cT )}, defining the sphere radius as the maximum metric in the tuple: R = max {λ0,t } . ct |ct ∈T
(5)
Tuple search can be additionally combined with separated bitspecific storage of information for the L-value calculation [1]. The resulting TSD achieves much better BER performance than LSD at a significantly reduced ♯n compared to STS detection. Additionally, adjustment of SNR⇔ ♯n trade-off is enabled by varying the size of the tuple T . Strategies like sorted QR decomposition, MMSE preprocessing as well as sphere and L-values clipping are widely applied approaches towards ♯n reduction [1]. Novel approaches like search sequence determination (SSD) [11] and metric estimation (ME) [12] are additionally utilized in this work to further reduce ♯n as well as the detection computational complexity, as detailed in [2], [13]. IV. VLSI A RCHITECTURE The TSD hardware realization presented in this work is based on the ASIP model and concepts proposed in [13]. The design follows the synchronous transfer architecture (STA) principle [14] and is controlled by means of Very Long Instruction Words (VLIWs). The architecture is comprised of basic modules known as functional units (FUs). FU output ports are buffered with registers so that data produced by a FU can be directly consumed by connected FUs. Pipelineinterleaved execution of different detection paths is implemented for throughput enhancement [13] and, for simplicity, q-fold SIMD vectorization with q = 1 has been considered. The proposed ASIP design is depicted in Fig. 1 and described through next sections. It is mainly comprised of: • Control unit: a decoder reads VLIWs from the instruction memory (IMEM) and maps the corresponding operations onto the functional units (FUs) comprising the design. A flow control unit (FCU) generates addresses for IMEM, assisting the sequencer in handling the conditional execution flow [13]. • Data path: it contains data memories (SMEM, VMEMI, VMEMO), banks of data and address registers and the MIMO detection module. On behalf of SIMD parallelization, vector and scalar paths should be distinguished, as illustrated in Fig. 1.
Fig. 1: ASIP-based detector architecture
A. Memory Organization
Fig. 2: Block diagram of the regularized TSD loop.
1) Scalar Memory (SMEM): SMEM contains system parameters (NT , L, T, . . .) as well as channel-dependent parameters (R-matrix elements (2), . . . ). Assuming a cerain channel coherence time (e.g. equivalent to m detections), SMEM content can be shared by m individual detection paths, as explained in [13]. 2) Input and Output Vector Memories (VMEMI, VMEMO): VMEMI represents a buffer (of length bf words) storing the symbols y to be detected. L-values resulting from the detection are stored into the corresponding output buffer (VMEMO). Each word in VMEMI and VMEMO stores the information corresponding to one detection path. Memory IMEM SMEM VMEMI VMEMO
Width (bits) 64 64 64 192
Length Size (♯ words) (bytes) 40 320 15 120 40 320 40 960
Content Program code System and channel info. Received symbols (y) Soft output (L-values)
TABLE I: Memory specifications for NT = NR = 4 and Q = 64 (bf = 40, q = 1).
Table I summarizes the memory specifications. Both VMEMI and VMEMO are q-fold vector memories with regard to SIMD implementation, while SMEM is scalar. For the considered 4×4 MIMO transmission with 64-QAM constellation, using buffers of size bf = 40 words (i.e. 40 individual detection paths) and disregarding vectorization (q = 1), a total of ∼1.4 KB data memory is required2. B. MIMO Detector Module TSD algorithm can be decomposed, as proposed in [8], in an arbitrary number of regularized loops. Operations performed within each loop are partitioned into the task blocks illustrated in Fig. 2, each mapped to a FU ( [2], [13]). As mentioned in section III-B, a so called one-node-per-cycle(-loop) realization is desired. Consequently, sibling parent nodes and leaf nodes are assumed to be processed in parallel [11]. For this purpose, separated modules processing first, second and subsequent sibling nodes individually are defined (Fig. 2). Likewise, modules separatelly processing candidate leaf nodes are included. 2 Values stored in data memories are represented using fixed point with a maximum word width of 8 bits. Fixed point representation is detailed in [13].
1) Node Enumeration Unit (NEU) and Interference Computation Unit (ICU): In order to analyze most favourable symbols first, the Search Sequence Determination (SSD) strategy of [11] is implemented by the NEU (comprised by FUs a, b and g in Fig. 2). Node enumeration resulting from SSD is determined through fixed sequences which are mapped to constellation symbols during the detection. Predefined sequences are stored in small (∼ 10 bytes) look-up-tables (LUTs). As a result, costly metric calculations (3) and sorting operations of commonly used Schnorr-Euchner (SE) enumeration are replaced by few comparison and addition operations by the SSD strategy. Operations involved in determination of the interference-reduced received signal y ′′ (4) are performed in the ICU, comprised by FUs d, f and i in Fig. 2. 2) Metric Computation Unit (MCU): The computationally costly operations required for the conventional metric calculation (3) can be considerably simplified through an estimation based on SSD’s geometrical approach [12]. Euclidean distances in (3) are replaced by predefined 2 2 2 2 geometrical distances d, resulting in rii kyi′′′ − xˆi k ≈ rii d . 2 2 Additionally, rii d can be precalculated and stored in SMEM, thus reducing the complexity of (3) to a simple addition operation. Estimation of the metrics is performed in FUs c, e and h in Fig. 2. 3) Radius Administration Unit (RAU) and Level Determination Unit (LDU): The RAU is responsible for administrating the sorted list of leaf metrics λ0 comprising the search tuple T . The search radius R is adapted (5) by this entitiy, as detailed in [1]. Subsequently, the LDU makes a decision on the next tree level to be explored, implementing depth-first tree examination. In addition, subtrees leading to metrics which exceed the search radius R are pruned. Both entities require logic and few comparators to carry out their respective tasks. 4) Soft-output Administration Unit (SAU) and L-values Computation Unit (LCU): SAU and LCU enable providing soft-output. Symbols representing the most favourable (counter-)hypotheses candidates are selected and their corresponding metrics are estimated within FU j. Subsequently, stored soft information is updated by FU k, according to the new (counter-)hypotheses found. As mentioned in III-C, TSD makes use of this separated storage
1200
TSD (4x4 MIMO, 64−QAM) [this work] SIC (4x4 MIMO, 64−QAM) [this work] STS (4x4 MIMO, 16−QAM) [16]
T=2 (1.1 Gbps @ 15.55 dB)
Average Throughput (Mbps)
1000
(454 MHz)
800
T=4 (681 Mbps @ 14.7 dB) 600 (454 MHz)
T=8 (403.6 Mbps @ 14.1 dB)
400
% 100.0 38.3 61.7 2.4 59.2 6.2 14.4 4.3 3.1 3.9 26.8
Total Power mW % 88.31 100.0 24.78 28.0 57.94 65.6 2.02 2.3 55.92 63.3 8.29 9.4 12.32 14.0 4.67 5.3 4.10 4.7 4.46 5.1 22.08 24.8
T=32 (162.2 Mbps @ 13.85 dB) (379 MHz)
0 13.8
Area mm2 0.23 0.09 0.14 0.01 0.14 0.02 0.03 0.01 0.01 0.01 0.06
TABLE II: Area breakdown and total average power consumption under typical case operating conditions, at 454 MHz (4×4 MIMO, 64-QAM).
T=16 (279.4 Mbps @ 13.9 dB) 200
Total Memory TSD core Ctrl. path Data path NEU ICU MCU LDU RAU SAU + LCU
kGE 160.4 61.5 98.9 3.9 95.0 10.7 23.1 7.0 5.0 6.3 43.0
14
14.2
14.4
14.6
14.8
15
15.2
15.4
15.6
Eb/N0 (dB)
Fig. 3: TSD average throughput at fmax = 454MHz, for different tuple sizes (T ) at 10−5 BER.
of soft information for later L-values calculation (1) in the LCU. V. I MPLEMENTATION R ESULTS In order to assess the true silicon complexity of the proposed detector implementation, the described VLSI architecture has been modeled in Verilog HDL and synthesized using Synopsis Design Compiler3 . RTL and gate-level netlists have been verified against the same test vectors generated from a MATLAB/C++ fixed point model. For the analysis, a detector design configured for NT = NR = 4 with 64-QAM constellation has been instantiated. A maximum clock frequency fmax = 454 MHz is reached under typical case operating conditions4, while frequency degrades to fmax = 294 MHz (∼35%) under worst case conditions4. Through the following sections, results on throughput, silicon area and power consumption are presented and analyzed. A. Throughput Based on the considered pipeline-interleaving strategy and without vectorization (q = 1), average detection throughput τ at frequency f is given by: τ TSD =
NT · L f [bits/s]. ♯n
Figure 3 depicts throughput corresponding to the proposed TSD realization, adjustable through the tuple size T . Throughput corresponding to successive interference cancellation (SIC) detection5 and to detector realization from [16] are included as reference (note that different frequencies are used; extensive comparison considering frequency-normalized throughput 3 Using TSMC 65 nm low-power CMOS libraries under typical and worst case operating conditions. 4 Typical case operating conditions: 1.2V and 25C; worst case operating conditions: 1.08V and 125C. 5 Non-pipelined SIC implementation, based on the proposed TSD-ASIP design with limited ♯n = NT .
is provided in section VI). As illustrated, presented TSD implementation achieves high throughput even in low SNR scenarios (e.g. 403.6 Mbps with T = 8), outperforming by far STS realization from [16]. In high SNR scenarios, impressive throughput values (>1 Gbps) are achieved by TSD, more than doubling throughput provided by SIC detection. TSD surpasses 2.7 Gbps throughput when limiting ♯n = NT (providing SIC SNR performance). Further throughput increase could be still achieved by exploiting SIMD vectorization (enhancement by a factor ∼ q [8]). B. Area Table II shows the area breakdown of the proposed design, extracted from pre-layout synthesis reports. In order to provide technology-independent area characterization, the number of gate-equivalents (GEs) is additionally specified6 . A total area of 0.23 mm2 is required, corresponding 38% to memory (IV-A) and 62% to the TSD ASIP core (IV-B). The control path represents small area overhead (2%). Regarding the TSD core, the obtained results show that modules involved in soft-output computation (SAU and LCU) present the highest hardware complexity. Similar result is observed in [15] comparing area breakdowns of hard-output and soft-output sphere detector realizations. Area of ICU represents ∼ 1/2 of the area required by soft-output computation modules, while NEU requires ∼ 1/4 of this value. LDU, RAU and MCU modules present the lowest area complexity. As future work, designed architecture could be optimized in order to simplify hardware complexity of the most costly entities. C. Power Consuption Power analysis is performed using Synopsis PrimeTime, based on switching activity annotated during post-synthesis simulation into a value change dump (VCD) file. Simulations are carried out in Mentor ModelSim under typical case operating conditions, at fmax = 454 MHz. Observed peak power deviates ∼20% of total power, except regarding memory (peak power doubles average total power). Reported average total power consumption is detailed in table II. As illustrated, memory consumes ∼30% of the total power. Concerning TSD 6 One GE corresponds to the area of a two input NAND gate synthesized using TSMC 65nm libraries.
2) Area: Table III compares silicon area (in mm2 and kGEs) occupied by the considered detectors. Since design principles followed in literature are completely different from the implementation proposed in this work, occupied CMOS area is not directly comparable, as next discussed:
core, observed power consumption results are inline with the obtained results on area. The greatest dissipation value corresponds to soft-output computation modules (∼27%), followed by ICU (15%) and NEU (10%). Remaining entities consume 4, Q > 16), pre-eminence of TSD is significantly enhanced due to the linearly growing complexity trend. Resulting from this, proposed design is highly flexible and scalable, in contrast to most n−variable realizations (limited to low-order systems due to their exponential complexity trend). Further throughput improvement can be provided by exploiting SIMD vectorization and possibly increasing frequency. Additionally, flexibility of the design can be enhanced by taking advantage of the ASIP programability. Additional future work will target architecture optimization for reduction of area and power consumption. Clearly outstanding implementation results, together with enabled flexibility, scalability and parallelizability, make the proposed design a very favorable candidate for MIMO detection in a wide range of applications. R EFERENCES [1] B. Mennenga, A. von Borany, and G. Fettweis, “Complexity reduced Soft-In Soft-Out Sphere Detection based on Search Tuples,” in Proceedings of the IEEE International Conference on Communications (ICC’09), Dresden, Germany, 2009. [2] E. P. Adeva and B. Mennenga, “Survey on an Efficient, Low-complex Tuple Search Based Sphere Detector,” Submitted to IEEE 34th Sarnoff Symposium, 2011. [3] C. Studer, A. Burg, and H. B¨olcskei, “Soft-output sphere decoding: Algorithms and VLSI implementation,” IEEE Journal on Selected Areas in Communications, Feb. 2008. [4] M. S. Yee, “Max-log-MAP sphere decoder,” in Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP ’05), vol. 3, 18.-23. Mar. 2005. [5] B. Hochwald and S. ten Brink, “Achieving near-capacity on a multipleantenna channel,” IEEE Transactions on Communications, vol. 51, pp. 389–399, Mar. 2003. [6] E. Zimmermann and G. Fettweis, “Unbiased MMSE Tree Search Detection for Multiple Antenna Systems,” in Proceedings of the International Symposium on Wireless Personal Multimedia Communications (WPMC’06), San Diego, USA, Sep. 2006.
[7] B. Mennenga, Aufwandsg¨unstige Detektion in Mehrantennensystemen mittels komplexit¨atsreduzierter Baumsuchverfahren. Dissertation, Technische Universit¨at Dresden, 2010. [8] B. Mennenga, E. Matus, and G. Fettweis, “Vectorization of the Sphere Detection Algorithm,” in Proceedings of the IEEE International Symposium on Circuits and Systems (ISCAS’09), Taipei, Taiwan, 2009. [9] A. Burg, M. Borgmann, M. Wenk, M. Zellweger, W. Fichtner, and H. Bolcskei, “VLSI implementation of MIMO detection using the sphere decoding algorithm,” IEEE Journal of Solid-State Circuits, vol. 40, pp. 1566–1577, Jul. 2005. [10] J. Jald´en and B. Ottersten, “Parallel Implementation of a Soft Output Sphere Decoder,” in Proceedings of Asilomar Conference on Signals, Systems, and Computers, 2005. [11] Mennenga and Fettweis, “Search Sequence Determination for Tree Search based Detection Algorithms,” in Proceedings of the IEEE Sarnoff Symposium 2009, Princeton, USA, 2009. [12] B. Mennenga and G. Fettweis, “Simplified Search Sequence and Metric Determination for Tree Search based Detection Algorithms,” Submitted to the IEEE Transaction on Wireless Communications, 2009. [13] E. P. Adeva and B. Mennenga, “Scalable ASIP Implementation and Parallelization of a MIMO Sphere Detector,” Submitted to International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation (SAMOS XI), 2011. [14] G. Cichon, A Novel Compiler-Friendly Micro-Architecture for Rapid Development of High-Performance and Low-Power DSPs. Dissertation, Technische Universit¨at Dresden, 2004. [15] C. Studer, A. Burg, and H. Bolcskei, “Soft-output Sphere Decoding: Algorithms and VLSI Implementation,” IEEE Journal on Selected Areas in Communications, 2008. [16] E. M. Witte, F. Borlenghi, G. Ascheid, R. Leupers, and H. Meyr, “A Scalable VLSI Architecture for Soft-input Soft-output Depth-first Sphere Decoding,” IEEE Transactions on Circuits and Systems II: Express Briefs, 2010. [17] Z. Guo and P. Nilsson, “Algorithm and Implementation of the K-best Sphere Decoding for MIMO Detection,” IEEE Journal on Selected Areas in Communications, 2006. [18] B. Wu and G. Masera, “A Novel VLSI Architecture of Fixed-complexity Sphere Decoder,” in 13th Euromicro Conference on Digital System Design: Architectures, Methods and Tools (DSD), 2010. [19] A. Burg, M. Borgmann, M. Wenk, M. Zellweger, W. Fichtner, and H. Bleskei, “VLSI Implementation of MIMO detection using the sphere decoding algorithm,” IEEE Journal on Solid-State Circuits, 2005.