2012 IEEE 20th International Symposium on Field-Programmable Custom Computing Machines
FLEXDET: Flexible, Efficient Multi-Mode MIMO Detection using reconfigurable ASIP Xiaolin Chen, Andreas Minwegen, Yahia Hassan, David Kammler, Shuai Li, Torsten Kempf, Anupam Chattopadhyay, Gerd Ascheid Institute for Communication Technologies and Embedded Systems RWTH Aachen University, Aachen, Germany Email:
[email protected]
and WiMAX. Through an in-depth analysis on most of the proposed MIMO detection algorithms [1][4][5][6][7][8], both noniterative and iterative, a similarity is that, a large portion of the algorithms can be implemented using matrix operations like matrix/vector multiplication/addition, matrix inversion, conjugation etc. This feature leads us to search for a way to accelerate the matrix operations for different MIMO detection algorithms. Implementations of matrix operations are seen using SIMD ASIPs [15][16]. Due to the 1-dimension (1-d) feature of the SIMD data path in the computation and storage resources, it cannot fully explore the parallelism within matrix operations. Lots of memory space and clock cycles are consumed by storing and transferring of the intermediate matrix results. Compared with ASIPs, 2-d architectures like Coarse-Grained Reconfigurable Architectures (CGRAs) are usually more beneficial to explore the spatial parallelism in the matrix level and show the efficiency to support matrix operations [3][10]. Figure 1 illustrates a CGRA tailored for matrix operations, and two examples of matrix-matrix and matrix-vector multiplications are mapped on the CGRA in Figure 2. Each processing element (PE) of the CGRA can perform a multiply-accumulate (MAC) operation and have local storage resource. As it is shown by the examples, using the designed global and local interconnects, the matrix operations can be realized in a few cycles with an acceptable data path complexity. With the local storage resource in PEs, the computed intermediate matrix can be topographically stored in the PEs, which can be used for concatenated matrix operations without transferring the intermediate results out of the CGRA. For different detection algorithms, the control path differs, which includes e.g. loading of data and scheduling of matrix operations onto the CGRA. To keep the flexibility of handling the variations in the control path of different algorithms, the concept of partially reconfigurable ASIP (rASIP) [9] is used, which combines a processor with the CGRA. The CGRA takes care of most of the computation tasks by performing the required matrix operations, while
Abstract—This paper describes the implementation of a multi-mode MIMO detector based on the concept of partially reconfigurable ASIP (rASIP). The multi-mode detector can support three different detection algorithms which are the Maximum Ratio Combining, the linear Minimum Mean Square Error (MMSE) detection, and the MMSE Successive Interference Cancellation. The detection algorithms also support different antenna configurations and modulation schemes. The rASIP is based on a Coarse-Grained Reconfigurable Architecture (CGRA), which is designed for efficient architectural support of matrix operations. A matrix inversion algorithm, which is used for the preprocessing of different detection algorithms, is mapped on the CGRA. By integrating a processor with the CGRA, the variations in the control path of different algorithm configurations can be handled efficiently. To the best of our knowledge, we show, for the first time that, a CGRAbased multi-mode MIMO detection is extremely efficient and matches the performance of dedicated ASIC implementation. Keywords-MIMO Detection; CGRA; reconfigurable ASIP;
I. I NTRODUCTION Future wireless communication terminals tend to become multi-mode, multifunctional devices. The systems have to be cognitive to changing environmental conditions as well as adaptive to variable user demands. Flexibility now is becoming more important than ever. Considering the batteryserved characteristic of wireless terminals and the realtime constraints of wireless standards, in order to support multiple modes and access technologies in future cognitive wireless systems, the required flexibility has to trade off with performance and energy-efficiency of the systems. One way to conciliate these competing demands is to use the programmable Application-Specific Instruction-set Processors (ASIPs), which usually have heavily customized micro-architectures and instruction set architectures to deliver enhanced performance and energy efficiency for a certain application domain. The programmability of such processors ensures that new features can be incorporated in the form of new software. However, for the application domain of MIMO detection which is targeted in this paper, a single ASIP sometimes cannot meet the real-time constraints due to the large amount of data processing, especially in the next generation mobile communication standards like LTE 978-0-7695-4699-5/12 $26.00 © 2012 IEEE DOI 10.1109/FCCM.2012.22
69
Figure 1.
Figure 2.
In [15], the linear MMSE detector is implemented using a programmable baseband processor aimed for softwaredefined radio (SDR). The processor can be considered as an ASIP which contains a 4-way SIMD Complex MultiplyAccumulate (CMAC) unit. The implementation of the linear MMSE detector can support antenna configurations between 2 × 2 and 4 × 4. The preprocessing of the detector is based on direct matrix inversion. In [16], a VLIW-like processor is used to implement the computation of the linear MMSE equalization matrix. The data path of the VLIW-based processor is equipped with a 2-way SIMD CMAC unit, complex ALU unit, and realvalued division unit. A divide-and-conquer matrix inversion algorithm is applied for the preprocessing to compute the equalization matrix. Both above-mentioned architectures use the SIMD-based data path consisting of CMAC units to process the large number of CMAC operations within the matrix operations. However, with the SIMD-based data path, the spatial parallelism within the matrix/vector operations cannot be fully explored. From the above analysis, there have been some implementations of multi-mode MIMO detectors based on flexible architectures e.g. ASIP or VLIW. Given the inherent linear algebra operations of MIMO detection algorithms, it is natural to leverage CGRA-like structures for maximum performance benefit. To the best of our knowledge, such a structure proposed in this paper is not reported in literature so far. We show, for the first time, that CGRA-based multi-mode detection is extremely efficient and matches the performance of dedicated ASIC implementation and still allows a high degree of flexibility. We borrow the Nucleus concept [11] to design the core kernels of the algorithms using a CGRA. We believe that such a CGRAbased matrix operation implementation will be useful in general for wireless and broadband algorithms. In summary, the contribution of this work is manifold.
Base CGRA for matrix operations
Matrix/Vector multiplications mapped on the base CGRA
the processor is responsible for the scheduling of the required matrix operations onto the CGRA as well as other control related tasks or low complexity computations. We demonstrate that a multi-mode MIMO detector supporting multiple detection algorithms can be implemented efficiently using the rASIP architecture. The paper is organized as follows. In section 2, a survey of the related work and a statement of the contribution are given. Section 3 briefly introduces the target algorithms supported by the multi-mode detector. In section 4, the proposed CGRA for matrix operations is described, and the mapping of a matrix inversion algorithm on the CGRA is given. In section 5, other components which are integrated with the CGRA are introduced, and the design of the rASIP (i.e. FLEXDET) is described. In Section 6, gate-level synthesis result is shown, and the performance of FLEXDET is analyzed and compared with some other implementations. Section 7 concludes this paper.
(1) A CGRA architecture is proposed for efficient support of matrix operations used in wireless baseband processing. (2) A direct matrix inversion algorithm is mapped onto the proposed CGRA. (3) A rASIP architecture FLEXDET, is designed which combines the proposed CGRA with a processor to support multiple detection algorithms, antenna configurations, and modulation schemes in one architecture.
II. R ELATED W ORK AND C ONTRIBUTION
III. I NTRODUCTION TO TARGET ALGORITHMS
Most proposed architectures for multi-mode MIMO detector in literature usually focus on the ASIC implementation targeting at one detection algorithm (e.g. sphere decoder) which can support different antenna configurations and modulation schemes [13][14]. There are also a few which attempted to use flexible architectures [15][16].
In this paper, three detection algorithms widely used in the MIMO systems are targeted, which are the Maximum Ratio Combining (MRC), the linear Minimum Mean Square Error (MMSE) detection, and the MMSE Successive Interference Cancellation (MMSE-SIC). MRC supports antenna
70
configurations of 1 × 2 and 1 × 4. MMSE and MMSESIC support antenna configurations from 2 × 2 to 4 × 8. All algorithms support the modulation schemes of BPSK, 16QAM, and 64QAM. The purpose of supporting multiple detection algorithms and configurations in one architecture is to allow the tuning of the terminal to apply the suitable detection algorithm according to the changing environment as well as the performance requirement of the user. Consider a MIMO system with M transmit and N receive antennas, the received symbol vector y is given by y = Hx + n
(1)
where H is the complex channel matrix of dimension N × M , x is the transmitted symbol vector, and vector n models the thermal noise in the system as i.i.d proper complex Gaussian with variance N0 . The MRC is applicable in the case of M = 1, in which the channel matrix H is a column vector and x is a complex symbol. The estimate of the transmitted symbol, x ˆ, given by MRC equalization [4] is computed by Equation 2, which needs two vector dot products and one reciprocal operation. x ˆ = (HH H)−1 HH y
Figure 3.
(2)
For the cases of multiple transmit and receive antennas, the estimated transmit symbol vector ˆx using the linear MMSE equalization [5] is computed using Equation 3. ˆ x = (HH H + N0 I)−1 HH y
Structure of the Proposed CGRA
Figure 4.
Conceptual Diagram of the PE
IV. P ROPOSED C OARSE -G RAINED R ECONFIGURABLE A RCHITECTURE
(3)
The proposed CGRA is designed based on the CGRA in Figure 1 with necessary extensions to efficiently support the matrix inversion. The structure is depicted in Figure 3. The modeling of the proposed CGRA is based on a high-level description language proposed in [9]. The RTL model of the CGRA is generated from the high-level model. The basic block of the CGRA is the PE (Figure 4). It includes the functional units (FUs) of a complex multiplier, a complex ALU, and a shifter. With the FUs, the PE can be configured to realize more complex functionality, e.g. CMAC. A local register file is also included in the PE to store the results generated by the FUs. The output of the PE can be either the result of the FUs or values stored in the register file. Four PEs are used to construct a PE 2x2 cluster. A MESH connection is used between the PEs to pass the result of one PE to its two neighbor PEs. The output of the PE 2x2 can be from any of its PEs. According to the supported antenna configurations, the maximum dimension of matrix operations is 4. Therefore, four PE 2x2 clusters are used to construct a 4 × 4 PE array analogous to Figure 1, which is used to perform matrix operations within this dimension. In order to efficiently support matrix inversion, the CGRA is extended with two PE chain clusters and one so-called
The linear MMSE detection is more complex than the MRC. It includes the matrix multiplication and inversion to compute the M × M matrix (HH H + N0 I)−1 . The inverted matrix is used to multiply with hermitian of the matrix H and the received symbol vector y, which adds two additional matrix-vector multiplications. The MMSE-SIC detection is the most complex algorithm among the three, and can be implemented in different ways. In this paper, the V-BLAST MMSE [6] is used for the implementation of MMSE-SIC. In this scheme, the detection is done through a number of iterations. In each iteration, one symbol of the transmitted symbol vector x is detected using the linear MMSE detector. The interference introduced by the detected symbols are successively removed for the detection of the following symbols. The advantage of the MMSE-SIC detection is that, it can be applied in an iterative receiver where extra computation complexity is paid to achieve a better algorithm performance. As it can be seen, the complexity of the target algorithms are extremely high considering the matrix inversion and multiple times of matrix multiplications. However, they can be clearly divided into different regular matrix operations, which fit well to the idea of this paper.
71
Center Alpha unit. The PE chain cluster is inserted between each upper and lower pair of PE 2x2 clusters, which has two PEs in a row. The results of the PEs can be passed to each other or used as the outputs of the PE chain. The four PEs in the two PE chains construct a 1 × 4 PE vector, which can perform vector operations independent of the matrix operations on the 4 × 4 PE array. The Center Alpha is positioned between the two PE chains and used to calculate a special alpha parameter for the matrix inversion algorithm [10] mapped on the CGRA. The CGRA has 4 input ports and 5 output ports. Four of the output ports are used to output the results from the PEs in the PE chains. The other output port outputs the result of the Center Alpha unit. The global interconnect is similar to Figure 1. The row-wise and column-wise broadcasting signals can propagate four data elements d1 to d4 to the PEs in the 4 × 4 PE array. These four data elements can come from the four inputs of the CGRA, the results of the four PEs in the PE chains, or the result from the Center Alpha. The PE chain can receive the results from its upper and lower PE 2x2 clusters and pass to its inside PEs. The Center Alpha receives from each side one output result of the PE chains. The result of the Center Alpha can be passed to the PE 2x2 clusters and used by the 16 PEs inside. In Figure 5, the matrix inversion algorithm [10] used to compute the matrix inversion of (HH H + N0 I)−1 is shown. In this algorithm, a matrix INV is computed through a number of iterations (equal to the number of receive antennas) with another real-valued parameter alpha. The matrix INV and parameter alpha are first initialized to an identity matrix and N0 . During each iteration, one row of the matrix H is used to update the matrix INV and parameter alpha. After the iterations, the final inverted matrix is computed by scaling the matrix INV using the parameter alpha. The mapping of a 4 × 4 matrix inversion on the CGRA is shown below the algorithm in Figure 5. Only the operations in the first 4 cycles are shown, which perform the initialization and the first iteration. The matrix INV is stored topographically in the PEs of the 4 × 4 PE array. For each iteration, 3 cycles are needed to compute the updated matrix INV and parameter alpha, and in total 13 cycles are needed to compute a 4 × 4 matrix inversion from the matrix H.
Figure 5.
Matrix Inversion Algorithm and the Mapping on the CGRA
Figure 6.
Integrated Multi-mode Detection Architecture
level and integrated with the generated RTL model of the CGRA. Since all the target algorithms can be realized by the integrated system of the CGRA and the above mentioned components, this integrated system is named in this paper as the Multi-mode Detection Architecture (MDA). To handle the variations in the control path of different algorithms, a rASIP, i.e. the FLEXDET, is designed by integrating a processor with the MDA. Details of the integrated components and the system integration are described as follows.
V. S YSTEM I NTEGRATION With the proposed CGRA, matrix operations of the algorithms can be mapped. In order to efficiently use the CGRA for different algorithms, the CGRA is integrated with extra components, which include a data register file for the processing data of the CGRA, the configuration memory of the CGRA, a LLR block to calculate the log-likelihood ratio (LLR) values, a minimum search and quantization block (used by MMSE-SIC detection), and a MMSE MASK block (used by the linear MMSE detection), as it is shown in Figure 6. These components are modeled in register transfer
A. Data Register File The structure of the data register file is depicted in Figure 7, which includes registers for the channel matrix H, the received symbol vector y, and special constants. The H registers are organized into a 4 × 8 register array, which is used to store the H matrix for the maximum supported antenna configuration of 4 × 8. For the requirement of the
72
block computes the LLR values for each of the estimated symbols, and stores them in the LLR registers. D. Minimum Search and Quantization, MMSE Mask Block
Figure 7.
The Minimum Search and Quantization is used in the MMSE-SIC detection to search for the equalized symbol with the minimum noise power. The symbol is then quantized for successive interference cancellation. The MMSE Mask block is used in the linear MMSE detection to generate a mask signal for the LLR block. This mask signal indicates the LLR block which are the valid symbols in the interface registers for LLR calculation. For the MMSE-SIC case, the mask signal is computed from the index of the symbol with minimum noise.
Structure of the Data Register File
E. Processor Core
algorithms, the H register array is designed to be accessible per row/column. Since the CGRA has four data input ports, a maximum of four data elements from the data register file can be passed to the CGRA as input data. When the H registers are accessed per row, the four data elements of the row can be passed simultaneously to the CGRA. For the access per column, the eight elements in one column of the H registers are divided into upper and lower four data elements, which can be passed separately to the CGRA. The access to the registers for the received symbol vector y or the special constants is similar to one column of the H registers.
For different algorithms, the control paths differ. For example, although different detection algorithms can be implemented by a similar group of matrix operations, the scheduling of these operations differs between the algorithms. Even for the same detection algorithm, different antenna configurations or modulation schemes can result in different control paths, e.g. different numbers of iterations to perform the matrix inversion on the CGRA. For this reason, the rASIP concept is used by integrating a processor with the MDA to efficiently implement the control paths of different algorithms. The main control tasks of the processor include the following two aspects: (1) Update the data in the storage components of the MDA (2) Control of different components in the MDA to perform the algorithm processing Based on these two tasks, the integration of the MDA with the processor is depicted in Figure 8. The process of the integration is assisted by Synopsys Processor Designer [18]. The whole system consists of three parts, i.e. the processor, the MDA, and a data memory with wide bandwidth. The starting point is a RISC processor with a 5-stage pipeline modeled in LISA [12]. A wrapper of the MDA is modeled in LISA and integrated with the LISA model of the RISC processor into the rASIP. A register interface is modeled for the communication of the data and control information between the RISC processor and the MDA. The rASIP RTL model is generated by Processor Designer, which includes the processor, the wrapper of the MDA, and the register interface. The MDA RTL model is wrapped by the wrapper, and simulation of the rASIP model is done in register transfer level. Using Processor Designer, special instructions are added to implement the two control tasks mentioned above. For the first control task, a wide memory access of 128-bit is added to the RISC processor to allow memory load/store with the wide data memory. The wide data memory stores the processing data (e.g. the channel matrix H, the received symbol vector y etc.), the configuration bits of the CGRA,
B. Configuration Memory The configuration memory is a SRAM which can store 32 configuration words of the CGRA. Each configuration word is 393 bits wide, which can configure the whole CGRA to perform a required matrix operation. A 393-bit wide configuration register is used to store the configuration word for the operation running on the CGRA. Loading of configuration word from the configuration memory to the configuration register is controlled by the processor. The content of the configuration register can be updated on a cycle basis, so that the CGRA can be reconfigured to perform a different matrix operation every cycle. Currently, the configuration words for the required matrix operations are manually generated. Because the matrix operations performed in different detection algorithms are similar, these configuration words are used by different detection algorithms. If a new matrix operation is needed, the corresponding configuration word can be loaded into the configuration memory by the processor during run-time. Therefore, the dynamic reconfiguration is allowed [2]. C. LLR Block The LLR block is used to calculate the soft outputs for the estimated symbols. After the CGRA computes the estimated symbol vector ˆx, it outputs the symbol vector together with other additional parameters into the interface registers between the CGRA and the LLR block. The LLR
73
Table I S YNTHESIS R ESULT OF D IFFERENT M ODULES I N FLEXDET
Figure 8.
Name of Design
Area [k GEs]
Timing [ns]
CGRA
534.134
3.8
LLR block
24.442
2.0
Configuration Memory
32.85
—
Processor
30.5
2.8
When the synthesis goes up in hierarchy, the synthesized designs of lower hierarchy are used to construct the modules of higher hierarchy. After the synthesis is done, the configuration bit-stream of each operation in all supported algorithms is tested on the configuration port of the CGRA using special commands of Design Compiler, and the timing analysis is performed. The maximum frequency is determined by the longest path among all the possible bit configurations. The gate-level synthesis results of FLEXDET using Faraday 65nm technology, 1V, standard performance, is shown in table I. The CGRA was designed based on a fixedpoint exploration on the target algorithms, in which a 22bit data word is used for each real-valued number in the computation. This explains the large area in table I for the CGRA. The maximum achievable frequency of the CGRA for all supported detection algorithms is 1/3.8ns = 263 MHz.
Integration of the MDA with Processor
and the LLR values from the MDA. For the memory read, the read request is sent at EX stage. The data read from the memory and some control signals are sent at WB stage to the interface registers between the RISC and the MDA. The data is then addressed by the control signals and stored into the corresponding storage component in the MDA. For the memory store, the LLR values are read from the MDA and stored into the wide data memory. Therefore, the LLR index is sent to the MDA at DC stage, so that the corresponding LLR values is read from the LLR registers and stored into the interface register. The LLR values are sent to the wide memory at WB stage. For the second control task, two control information need to be sent to the MDA. One is the address of the configuration word in the configuration memory which is to be loaded to the CGRA to perform a specific matrix operation. The other one is the control bits to coordinate the different components in the MDA to perform the algorithm processing. In the current implementation, these two control information are decoded from the instructions and passed to the interface registers. The special instructions are designed by combining both control tasks in one instruction, so that both operations are executed simultaneously. Using the instructions, the MDA is controlled by the processor to continuously process the data in the data register file. At the same time, data in the data register file or the configuration memory can be updated during the algorithm processing. The LLR values can be read out and stored into the memory when they are computed.
B. Performance Based on the synthesis results, performance numbers of FLEXDET for the equalization of each algorithm configuration are shown in table II III and IV. For the target algorithms, different modulation schemes only have an effect on the LLR calculation. Therefore, the cycle counts for the equalization do not differ between different modulation schemes of the same algorithm and antenna configuration. The throughput in million symbols per second is defined by Equation 4. T hroughput =
#T x.Ant. × Clk.F req. #cycles
(4)
MRC has the same cycle count for 1 × 2 and 1 × 4 cases because the CGRA is capable of handling vector operations up to the dimension of 4. For the linear MMSE detection, compared with 2 × 2, the 2 × 4 uses twice the cycle count. This is because the mapped matrix inversion algorithm uses twice the number of iterations for the 2 × 4 case compared to the 2 × 2 case to compute the 2 × 2 matrix inversion. A similar explanation applies for other antenna configurations. Because of the complexity of the V-BLAST algorithm, the MMSE-SIC takes much more cycles than the other two algorithms. Take the 4 × 4 linear MMSE detection as an example. FLEXDET needs 16 cycles to compute the equalization, in which 13 cycles are used to compute the 4 × 4 matrix inversion from the corresponding 4×4 H matrix and 3 cycles
VI. R ESULTS In this section, the synthesis result and the performance of FLEXDET are shown, and compared with some other implementations in literature. A. Gate-level Synthesis The synthesis of the CGRA is done hierarchically using Synopsys Design Compiler [19]. Constraints are set for the synthesis of modules in the lowest hierarchy (e.g. PEs).
74
Table V C OMPARISON WITH FLEXIBLE IMPLEMENTATIONS IN [15]
Table II P ERFORMANCE OF D IFFERENT C ONFIGURATIONS OF MRC
1x2 1x4
BPSK 7 7
Cycle Count 16QAM 64QAM 7 7 7 7
2x2 2x4 2x8 4x4 4x8
BPSK 10 19 37 16 31
Cycle count
2x2 2x4 2x8 4x4 4x8
BPSK 25 43 79 77 137
2×2 4×4
Clock Frequency Throughput (MSymb/s)
Throughput (Msymb./s) 52.6 27.68 14.22 65.75 33.94
2×2 4×4
Area (k GE) 1
9 17
44 166/2=83
N/A 202
263 MHz @ 65nm
250 MHz @ 180 nm
400 MHz @ 65nm
58.44 61.88
31.46 33.36
597
383
1 1
N/A 7.92 90
Using clock frequency scaled to 65nm technology
processing of one symbol in second (i.e. the inverse of the throughput).
Table IV P ERFORMANCE OF D IFFERENT C ONFIGURATIONS OF MMSE-SIC Cycle Count 16QAM 64QAM 25 25 43 43 79 79 77 77 137 137
[16]
Compute G = (HH H + N0 I)−1 HH FLEXDET VLIW [16] ASIP [15]
Throughput (Msymb./s) 37.57
Table III P ERFORMANCE OF D IFFERENT C ONFIGURATIONS OF MMSE Cycle Count 16QAM 64QAM 10 10 19 19 37 37 16 16 31 31
AND
Throughput (Msymb./s) 21.04 12.23 6.65 13.66 7.68
383 × 58.44 ATV LIW,2×2 = = 1.19 ATF LEXDET,2×2 597 × 31.46
(5)
383 × 61.88 ATV LIW,4×4 = = 1.19 ATF LEXDET,4×4 597 × 33.36
(6)
90 × 61.88 ATASIP,4×4 = = 1.18 ATF LEXDET,4×4 597 × 7.92
(7)
For the compared configurations, FLEXDET is about 19%, 19% and 18% more efficient than the VLIW and ASIP in terms of AT product. And the FLEXDET can supports more configurations whereas, all the algorithms are not tried with the ASIP/VLIW.
are used to do the equalization using the inverted matrix. FLEXDET can also be configured to work in another mode, in which 17 cycles are needed to compute the equalization matrix G = (HH H + N0 I)−1 HH and 1 cycle is used for the equalization using G. The second mode can be easily applied in a packet-based wireless standard like IEEE 802.11n, in which multiple symbol vectors can share the same channel matrix H estimated from pilot symbols. Based on the implementation results, in the following, FLEXDET is compared with two flexible implementations [15][16] and an ASIC implementation [17] of a 4 × 4 linear MMSE detector. Because the comparison is on the equalization task, the area of the LLR block in FLEXDET is not considered.
D. ASIC-based Implementation A fair comparison of FLEXDET with dedicated ASIC implementation can be performed by combining individual ASIC implementations of all the different detection algorithms. However, to the best of our knowledge, there are no ASIC implementation results in literature which apply a similar V-BLAST MMSE and MRC algorithms. Therefore, FLEXDET is only compared here with an ASIC implementation of the linear MMSE detector in [17]. In [17], a matrix inversion algorithm similar to the one in this paper is used for MIMO preprocessing to compute the equalization matrix. Both a pipelined and a non-pipelined implementation are proposed. Since the pipelining of the operations mapped on the CGRA is not introduced in the current implementation of FLEXDET, the non-pipelined ASIC is selected for comparison. Table VI shows the implementation results of both architectures for the 4×4 linear MMSE equalization including the matrix inversion. FLEXDET achieves a higher throughput than the ASIC implementation for the computation task, but also has more area. This is due to the fact that, more multipliers are used in FLEXDET to explore more parallelism than the ASIC implementation. To achieve the same throughput, the area of the ASIC implementation is scaled to 285.14 k GE, which is approximately half the area of FLEXDET.
C. Flexible Implementations In [15] and [16], two flexible architectures of ASIP and VLIW are used for the computation of the linear MMSE equalization matrix. Both implementations apply direct matrix inversion for the computation. Implementation results of 2 × 2 and 4 × 4 in [16] and 4 × 4 in [15] are compared with FLEXDET. The throughput of FLEXDET is higher than the other two implementations for the target antenna configurations, while the area requirement is also higher. We use the AT-product to evaluate the efficiency of different implementations, in which A is the chip area and T is the latency for the
75
R EFERENCES
Table VI C OMPARISON WITH ASIC- BASED IMPLEMENTATION IN [17] Compute ˆ x= FLEXDET
(HH H
+ N0 ASIC [17]
Cycle Count
16
73
Clock Frequency
263 MHz @ 65nm
93 MHz @ 250 nm
Throughput (MSymb/s)
65.75
19.6
Area (k GE)
597
85
1
[1] E. Dahlman, S. Parkvall, J. Sk¨old, and P. Beming, 3G Evolution: HSPA and LTE for Mobile Broadband, 2nd Edition, 2008. [2] H. Singh, M.H. Lee, G. Lu, F. Kurdahi, N. Bagherzadeh, and E. Chaves Filho, MorphoSys: an integrated reconfigurable system for data-parallel and computation-intensive applications, in IEEE Trans. on Computers, vol.49, no.5, May 2000. [3] K. Patel, C. Bleakley Systolic Algorithm Mapping for Coarse Grained Reconfigurable Array Architectures, in Reconfigurable Computing: Architectures, Tools and Applications, 2010. [4] A. Shah, and A.M. Haimovich, Performance Analysis of Maximal Ratio Combining and Comparison with Optimum Combining for Mobile Radio Communications with Cochannel Interference, in IEEE Trans. on Vehicular Technology, vol.49, no.4, July 2000. [5] J.H. Winters, Optimum Combining in Digital Mobile Radio with Cochannel Interference, in IEEE Trans. on Vehicular Technology, vol.VT-33, no.3, August 1984. [6] P.W. Wolniansky, G.J. Foschini, G.D. Golden, R.A. Valenzuela V-BLAST: An Architecture for Realizing Very High Data Rates Over the Rich-Scattering Wireless Channel, in URSI International Symposium on Signals Systems and Electronics, 1998. [7] B. Farhang-Boroujeny, H. Zhu, Z. Shi Markov Chain Monte Carlo Algorithms for CDMA and MIMO Communication Systems, in IEEE Trans. on Signal Processing, vol.54, no.5, 2006. [8] C. Studer, S. Fateh, and D. Seethaler ASIC Implementation of Soft-Input Soft-Output MIMO Detection Using MMSE Parallel Interference Cancellation, in IEEE Journal of Solid-State Circuits, vol.46, no.7, 2011. [9] K. Karuri, A. Chattopadhyay, X. Chen, D. Kammler, L. Hao, R. Leupers, H. Meyr, and G. Ascheid, A Design Flow for Architecture Exploration and Implementation of Partially Reconfigurable Processors, in IEEE Trans. on VLSI Systems, vol.16, no.10, Oct 2008. [10] I. LaRoche, and S. Roy, An Efficient Regular Matrix Inversion Circuit Architecture for MIMO Processing, in IEEE International Symposium on Circuits and Systems, 2006. [11] V. Ramakrishnan, M. Witte, T. Kempf, D. Kammler, G. Ascheid, R. Leupers, H. Meyr, M. Adrat, M. Antweiler, Efficient and Portable SDR Waveform Development: The Nucleus Concept, in IEEE Military Communications Conference, 2009 [12] A. Hoffmann, O. Schliebusch, A. Nohl, G. Braun, and H. Meyr, A Methodology for the Design of Application Specific Instruction Set Processors (ASIP) Using the Machine Description Language LISA, in Proceedings of the International Conference on Computer Aided Design (ICCAD), Nov. 2001. [13] R. Shariat-Yazdi, and T. Kwasniewski A Multi-mode Sphere Detector Architecture for WLAN Applications, in SOC Conference, 2008. [14] M. Witte, F. Borlengi, G. Ascheid, R. Leupers, and H. Meyr A Scalable VLSI-Architecture for Soft-input Soft-output Single Tree-search Sphere Decoding, in IEEE Trans. on Circuits and Systems II: Express Briefs, vol.Expr Brief, no.57, 2010. [15] J. Eilert, D. Wu, and D. Liu Implementation of a Programmable Linear MMSE Detector for MIMO-OFDM, in IEEE International Conference on Acoustics, Speech and Signal Processing, 2008. [16] S. Eberli, D. Cescato, and W. Fichtner Divide-and-Conquer Matrix Inversion for Linear MMSE Detection in SDR MIMO Receivers, in Proceeding of 26th Norchip Conference, 2008. [17] A. Burg, VLSI Circuits for MIMO Communication Systems, in Dissertation of ETH Zurich, 2006. [18] Synopsys Processor Designer, http://www.synopsys.com/ Systems/BlockDesign/ProcessorDev/Pages/default.aspx [19] Synopsys Design Compiler, http://www.synopsys.com/ Tools/Implementation/RTLSynthesis/Pages/default.aspx [20] 3GPP TS 36.101; User Equipment (UE) radio transmission and reception (Release 8)
I)−1 HH y
1
Using clock frequency scaled to 65nm technology
Therefore, the efficiency of the ASIC implementation is about two times the efficiency of FLEXDET in terms of AT product for the compared algorithm configuration. Consider that FLEXDET is a flexible implementation, the other half of the area is actually paid for the flexibility to support different configurations of the MMSE-SIC and MRC. E. Comparison with LTE real-time constraint Now we consider the applicability of FLEXDET in a modern mobile communication system e.g. the LTE system. In LTE systems, the downlink transmission is based on Orthogonal Frequency Division Multiplexing (OFDM) in which data is carried on a number of subcarrier frequencies. In a 0.5ms slot, each subcarrier can have a maximum of 7 symbol vectors. For the highest supported bandwidth of LTE systems, in total 12 × 100 = 1200 subcarriers are allowed [20]. Therefore in the duration of a slot, a maximum of 8400 symbol vectors need to be processed. The processing time of each symbol vector is about 0.5ms/8400 = 59.5ns. Considering a 4 × 4 LTE system, using the linear MMSE detection implemented on FLEXDET, the processing time of one symbol vector including the matrix inversion is about 3.8 × 16 = 60.8ns. This means that, by some further optimization on FLEXDET, the real-time constraint of the highest mode of a 4 × 4 LTE system can be met using the linear MMSE detection. VII. C ONCLUSION In this paper, we presented the implementation of a multimode MIMO detector on a rASIP architecture FLEXDET. It can support three different detection algorithms with different antenna configurations and modulation schemes. The architecture is based on a novel CGRA proposed for efficient architectural support of matrix operations used in wireless baseband processing. A matrix inversion algorithm is also mapped on the CGRA. The variations of control path in different detection algorithms are handled by the processor of the rASIP architecture. Comparison results with existing flexible and ASIC implementations show that, our implementation is more efficient than the flexible architectures, while also delivers comparable performance to the ASIC implementation.
76