G70 = GeForce 7800GTX. NV30 = GeForce .... as a Single Instruction Multiple
Data (SIMD) processor and .... 3) an NVIDIA GeForce 8400 GS graphic card (N =
1.
Copyright 2010 IEEE, published in Proc. 16th European Wireless Conference (EW 2010), April 2010, Lucca, Italy
Boosting Sphere Decoding Speed Through Graphic Processing Units Muhammad S. Khairy, Christian Mehlf¨uhrer, and Markus Rupp Institute of Communications and Radio-Frequency Engineering, Vienna University of Technology, Vienna, Austria Email:
[email protected],{chmehl,mrupp}@nt.tuwien.ac.at
I. I NTRODUCTION Graphic Processing Units (GPUs) are specialized for highly parallel computations. They are built as multithreaded manycore processors with a tremendous computational power. This computational power increased by about a factor of ten within the last four years, as shown in Fig. 1. In 2008, GPUs clearly outperformed Central Processing Units (CPUs), which devote much more transistors to data caching and flow control than GPUs [1]. Also, the trend of the last years shows that the computational power of GPUs increases much faster than that of CPUs. Considering that high performance graphic cards are similarly priced as high performance quad core CPUs (in the order of a few hundred Euros), performing simulations on GPUs becomes very attractive from a cost point of view. Recently, the two major GPU manufacturers NVIDIA and ATI adopted a fully programmable architecture and released software development tools for general purpose computing on their GPUs. NVIDIA presented the Compute Unified Device Architecture (CUDA) [1, 2] and AMD the ATI stream computing [3]. Since NVIDIA has developed some scripts capable of parsing CUDA files (.cu) into MATLAB MEX scripts it is very easy to integrate code executed on the graphic card into existing MATLAB simulation chains. To the authors’ knowledge, up to now no similar support is available from ATI. Therefore, we adopted the CUDA architecture for our simulations. Many applications on both academic research and industrial products have been accelerated using CUDA to achieve significant speedups [4]. Such applications fall into a variety of problem domains including database processing, video and audio processing, financial modeling, gas exploration, numerical linear algebra, product design, medical imaging, and physical simulation [5, 6].
GT200
1000
Peak GFlop/s
Abstract—Graphic Processing Units (GPUs) have evolved to provide a massive computational power. In contrast to Central Processing Units, GPUs are so-called many-core processors with hundreds of cores capable of running thousands of threads in parallel. This parallel processing power can accelerate the simulation of communication systems. In this work, we utilize NVIDIA’s Compute Unified Device Architecture (CUDA) to execute two different sphere decoders on a graphic card. Both flat fading and frequency selective channels are considered. We find that the execution of the soft-sphere decoder can be accelerated by factors of 6-8, and the fixed-complexity sphere decoder even by a factor of 50.
750 G80
500 250 0
G80 Ultra
Nvidia GPU
G92
G71 3.2GHz Harpertown
G70 NV35 NV30
NV40
Jan Jun
Apr
2003
2004
GT200 = GeForce GTX280 G92 = GeForce 9800GTX G80 = GeForce 8800GTX
3.0GHz Core2 Duo
Jun
2005
Mar Nov May
2006
G71 = GeForce 7900GTX G70 = GeForce 7800GTX NV40 = GeForce 6800 Ultra
2007
Intel CPU
Jun
2008
NV35 = GeForce FX5950 Ultra NV30 = GeForce FX5800
Fig. 1. Peak floating-point operations per second for Intel CPUs and Nvidia GPUs (reproduced from [1, pp. 2, Fig. 1-1]).
In the simulation of a MIMO communication system, such as for example UMTS Long Term Evolution [7], a large part of the simulation time is devoted to the detection algorithm, often implemented as a sphere decoder. In this work, we utilize NVIDIA’s CUDA to accelerate the execution of two different sphere decoding algorithms for which implementations in C were available. In particular, we wrote wrapper functions for the codes of a hard output fixed-complexity sphere decoder [8] and a soft-output sphere decoder with max-log approximation [9]. In order to keep the implementation effort at a minimum, we did not perform any optimization of the sphere decoder C-codes. The parallel computing capabilities of the GPU was utilized by simply executing thousands of sphere decoders in parallel, as it can be done in simulations of MIMO OFDM systems. This straightforward approach already leads to massive speed-ups as will be shown in this paper. The sphere decoder implementations for both CPU and GPU execution will be made available online as C-source files at [10]. It should be noted that the focus of this work is not to optimize the structure of the sphere decoding algorithms to specific GPUs, as it has been done for example in [11]. On the contrary, we try to estimate what amount of speed-up we can obtain with minimum effort, that is, without any algorithmic optimizations at all. The paper is organized as follows. In Section II, we briefly review the problem of maximum likelihood detection that is calculated on the graphic card to speed up the execution.
Copyright 2010 IEEE, published in Proc. 16th European Wireless Conference (EW 2010), April 2010, Lucca, Italy
An overview of NVIDIA’s GPU and CUDA programming is presented in Section III. The methodology of running the parallel sphere decoders on CUDA is explained in Section IV. Section V presents the hardware and software configuration steps. Finally, the measured execution speed-ups are presented in Section VI.
Device Multiprocessor N
Multiprocessor 2 Multiprocessor 1
II. M AXIMUM L IKELIHOOD D ETECTION Shared Memory
In this section, we briefly review the problem of maximum likelihood detection that is to be calculated on the graphic card. Consider the mathematical model of a MIMO system with NT transmit and NR receive antennas y = Hx + w.
Registers
Registers
Registers
Processor 1
Processor 2
Processor M
Instruction Unit
(1) Constant Cache
In this model, y denotes the NR ×1 dimensional received symbol vector, H the NR ×NT dimensional MIMO channel matrix, x the NT ×1 dimensional transmitted symbol vector, and w the NR ×1 dimensional additive white Gaussian noise vector. The maximum likelihood detector for the transmitted symbol vector x in the above system model is 2
x ˆ = arg min Hx − y2 . x
Texture Cache
Device Memory (Constant, Texture, Global)
(2)
Here, the detected symbol vector x ˆ is obtained by calculating the distance metric for every possible transmit vector. Thus, the complexity of the maximum likelihood detector is growing exponentially in the number of antennas. It turns out that this kind of problem can be solved with largely reduced complexity by the so-called sphere decoding algorithm [12–14]. In this work, we particularly consider the fixed-complexity sphere decoder of [8, 15]. In case of soft detection, a so-called Log-Likelihood Ratio (LLR) for the k-th transmitted bit bk is calculated using the max-log-MAP approximation: 1 Hx − y22 1 Hx − y22 − min . 2 2 σw σw x|bk =0 2 x|bk =1 2 (3)
LLR(bk ) = min
This detection problem can either be solved by executing a hard decision sphere decoder two times per transmitted bit or –more efficiently– by using the single tree search sphere decoder of [9]. For our analysis in this paper, we employ the algorithm of [9]. III. OVERVIEW OF NVIDIA GPU S AND CUDA CUDA is a new hardware and software architecture that enables researchers to harness the tremendous computation power and memory bandwidth of the GPU in a familiar Clike programming environment [1]. Thus, CUDA makes it significantly easier to utilize a GPU for scientific simulations. The GPU architecture of NVIDIA is built as a fully programmable processor array, organized into N multi processors, each containing M multi-threaded scalar processor cores, as shown in Fig. 2. Each multi-threaded scalar processor operates as a Single Instruction Multiple Data (SIMD) processor and can execute up to 128 concurrent threads. Thread creation, scheduling, and resource management are performed entirely
Fig. 2. 4-2]).
Architecture of the CUDA GPU (reproduced from [1, pp. 72, Fig.
in hardware. The cost of creating and destroying threads is negligible, and there is effectively no overhead involved in thread scheduling. Each multiprocessor incorporates a small but fast memory shared by all scalar processors in this multiprocessor. All multiprocessors have access to three other device-level memory modules: constant, texture, and global (of which the texture and the constant memories are cached on each multiprocessor). These memories are also accessible from the host machine. Since the shared memory is on the GPU, it can be accessed by the scalar processors with a short latency of only one clock cycle. The device memory, however, is off-chip, and its access latency is several hundreds of clock cycles. A CUDA program consists of a host program and one or more so-called kernels which are executed on the GPU. A kernel, when called, launches multiple threads that run in parallel. The kernel is organized as a grid of thread blocks, with each thread block as a two- or three-dimension thread array as shown in Fig. 3. An example of a kernel functioncall syntax is SphereDecoderKernel > (... parameter list ...); Here, dimGrid and dimBlock are three-element vectors that specify the dimensions of the grid in blocks and the dimensions of the blocks in threads, respectively. Each thread block can only be run on a multiprocessor which itself can run up to eight concurrent blocks. However, since each multiprocessor has a limited number of registers and shared memories, the number of allocated blocks per multiprocessor is limited by the kernel occupancy of the
Copyright 2010 IEEE, published in Proc. 16th European Wireless Conference (EW 2010), April 2010, Lucca, Italy
Host
Device
Preprocessing of SD Data Copy to GPU Grid Sphere Decoders Kernel
Block (0,0)
Block (1,0)
Block (2,0)
Block (0,1)
Block (1,1)
Block (2,1)
block size of 128 threads to have the best utilization of the GPU resources. Each running thread represents one of the parallel sphere decoders. To accelerate the performance, the data is transferred from the global memory to the shared memory because of its lower latency. 3) Write the results (either the log-likelihood ratios of the soft sphere decoder or the detected symbols of the hard output sphere decoder) back to the host memory and compare them to the results obtained from running the same simulation on the host CPU. V. C ONFIGURATION S ETUP
Results Back to CPU
Block (1,0) Thread Thread Thread Thread Thread (1,0) (2,0) (3,0) (4,0) (0,0) Thread Thread Thread Thread Thread (1,1) (2,1) (3,1) (4,1) (0,1) Thread Thread Thread Thread Thread (2,2) (3,2) (4,2) (0,2) (1,2)
Fig. 3.
Workflow when executing the sphere decoder on the GPU.
In order to run the sphere decoder on the NVIDIA graphic card, the following procedures have to be performed: 1) Installing the CUDA-enabled GPU and the software tools: MATLAB R2008b and Microsoft Visual Studio 2005 or 2008. 2) Downloading and installing both the device driver and the CUDA software [18]. 3) Compiling the CUDA files using the nvmex script under the MATLAB environment, then using the executable MATLAB file (MEX) as a separate routine within the system. VI. P ERFORMANCE R ESULTS
registers and the shared memory. Using the CUDA occupancy calculator [16], we can adjust the size of the thread block to obtain the best performance with high utilization of the processor utilities. IV. PARALLEL S PHERE D ECODING ON CUDA We have considered a 4×4 MIMO OFDM system with different constellation sizes. In such a system, many symbol vectors (number of OFDM symbols times the number of data subcarriers) can be detected in parallel. The preprocessing of the sphere decoder, such as QR decomposition and initial ZF solution, is computed on the host environment using MATLAB. The CUDA kernel was written in familiar C-like programming and consists of a wrapper function and standard (not optimized for the GPU) C-implementations of the sphere decoders. The kernel was then compiled by the nvcc CUDA compiler and mixed with MATLAB using NVIDIA scripts (nvmex) [17] to generate a MATLAB mex-file. This mexfile can be directly executed within MATLAB like any other function. The operations included in the kernel are illustrated in Fig. 3 and are as follows: 1) Copy the data required for the sphere decoding from the host memory to the device memory. 2) Determine the execution configuration which are the grid and block sizes. Then launch the kernel computation (the sphere decoder in our case). The GPU Occupancy Calculator can assist in choosing thread block size based on shared memory and register used by each thread block [16]. Maximizing the occupancy can help to cover latency during global memory loads. We have used a
In this section, we present the performance results of running parallel sphere decoders on the GPU using CUDA. The same simulations are executed on 1) an Intel Core 2 Quad CPU 2.4 GHz with 8 GB RAM, 2) a high performance NVIDIA GeForce GTX 285 graphic card (N = 30 multi processors and M = 8 scalar processors per multi processor) with 1 GB DDR3 Memory, and 3) an NVIDIA GeForce 8400 GS graphic card (N = 1 multi processor and M = 8 scalar processors per multi processor) with 256 MB DDR2 memory. The results for this graphic card are not explicitly shown (but discussed) in the paper since it turned out that the speedup compared to the CPU processing increases almost linearly with the available number of processors on the graphic card. We used CUDA Toolkit 2.2 with NVIDIA driver 185.85 for Windows Vista 64-bit. We have adopted a 4×4 MIMO OFDM system with 4-QAM, 16-QAM and 64-QAM constellations. The speed-up of the GPU processing compared to the CPU processing was investigated for flat fading and frequency selective channels. These two scenarios differ significantly in the amount of data that has to be transferred from the host to the graphic card. In case of flat fading, the channel coefficients are the same for all sphere decoders and thus have to be transferred only once. In case of frequency selective fading, the overhead of the data transfer and the memory required on the graphic card is larger. All computations were performed in both single and double precision. In our simulations, we averaged over 100 channel realizations.
Flat Fading Channel
Frequency Selective Channel
Av. Execution Time [s]
1.0 0.8
CPU Time
0.6
CPU Time
0.4 0.2
GPU Time
GPU Time
0.0 10
GPU vs. CPU Speed UP
GPU vs. CPU Speed UP
Av. Execution Time [s]
Copyright 2010 IEEE, published in Proc. 16th European Wireless Conference (EW 2010), April 2010, Lucca, Italy
8 6 4 2 0
0
10
20
30
40
50 0
Number of Parallel Sphere Decoders, in Thousands
10
20
30
40
50
Number of Parallel Sphere Decoders, in Thousands
6
Flat Fading Channel
Frequency Selective Channel
5 4
CPU Time
CPU Time
3 2 1
GPU Time
0 50
GPU Time
40 30 20 10 0 0
10
20
30
40
50 0
Number of Parallel Sphere Decoders, in Thousands
10
20
30
40
50
Number of Parallel Sphere Decoders, in Thousands
Fig. 4. Speed-up of the GTX 285 graphic card over the Core 2 Quad CPU for the soft-sphere decoder in a 4×4 spatial multiplexing system with 4-QAM modulation.
Fig. 5. Speed-up of the GTX 285 graphic card over the Core 2 Quad CPU for the fixed-complexity sphere decoder in a 4×4 spatial multiplexing system with 64-QAM modulation.
Fig. 4 shows the average execution times of the soft output sphere decoder and the speed-up for a 4×4 MIMO system with 4-QAM modulation scheme under both, flat fading and frequency selective channels using double precision on the GTX 285. For a fair comparison, the GPU time includes the transfer time to and from the global device memory. It is apparent that for frequency selective channels, the speed-up is smaller. This is because the GPU processing time is dominated by the slowest of the parallel executed sphere decoders which suffers from the worst channel. Also, we find that the GPU processing time is increasing linearly with steps. These steps are due to exceeding the maximum allowable parallel threads. In Figure 5, we show the simulation results for the fixedcomplexity sphere decoder using a 64-QAM modulation scheme in flat fading and frequency selective channels with double precision computation on the GTX 285. We observe that a huge speed-up, approximately of order 40x, is achieved when running more than ten thousand parallel sphere decoders under flat fading channels (as it is the case for example when simulating a 2048 carrier OFDM system with five OFDM symbols). As for the soft sphere decoder, the speed-up is less in case of the frequency selective channel. However, for this sphere decoder the reason is the higher transfer time of the coefficients of the frequency selective channel. We have used CUDA Visual Profiler to view the time taken by each sphere decoder and the memory copy time. In Figure 6, we show the speed-up of running parallel fixedcomplexity sphere decoders for different constellation sizes. For each modulation scheme, we considered all combinations of flat fading and frequency selective channels with single and double precision calculations. This graph was obtained by running 50,000 sphere decoders sequentially on the CPU and in parallel on the GPU. As expected, the speed-up increases
with increasing constellation size because of the increasing computational complexity. Running more complex threads in parallel reduces the execution time compared to the sequential execution on the CPU. One more interesting point is that the double precision calculations have higher speed-ups than the single precision calculations. Due to the construction of the GPU, double and single precision calculations have approximately the same execution time. On the contrary, the CPU is able to perform two single precision calculations in about the same time as one double precision calculation. Therefore the speed-up in case of single precision calculation is approximately halved. Figure 7 shows the speed-up for the soft sphere decoder. In contrast to the fixed-complexity sphere decoder, the speed-up decreases with increasing constellation size. This is due to the increase of the execution time of the soft sphere decoder on the GPU where the access of the global memory, which has high latency, increases with higher constellation sizes. Also, the single and double precision calculations have approximately the same speed-up. This is because the particular implementation of the soft sphere decoder has more conditional operations than arithmetic ones. Therefore, its execution time on the CPU is approximately the same for both single and double precision. Note that if we consider a general N ×N MIMO system instead of the above 4×4 MIMO system, the only difference is the complexity and the number of operations of the individual sphere decoder thread. The speed-up will vary according to the type of the sphere decoder. For the fixed-complexity sphere decoder, the speed-up will increase for increasing antenna numbers (similar to increasing constellation size in Figure 6). This is due to the higher computation complexity which is much better handled in the GPU. Regarding the soft output sphere decoder, the speed-up will decrease for increasing
Copyright 2010 IEEE, published in Proc. 16th European Wireless Conference (EW 2010), April 2010, Lucca, Italy
50
10 Flat Fading Dbl. Prec. Flat Fading Sgl. Prec. Freq. Selective Dbl. Prec. Freq. Selective Sgl. Prec.
45
8
35
7
30
6
Speed-Up
Speed-Up
40
25 20
5 4
15
3
10
2
5
1
0
4-QAM
16-QAM
Flat Fading Dbl. Prec. Flat Fading Sgl. Prec. Freq. Selective Dbl. Prec. Freq. Selective Sgl. Prec.
9
64-QAM
Modulation Alphabet Fig. 6. Speed-up of the GTX 285 graphic card over the Core 2 Quad CPU for the fixed-complexity sphere decoder with different constellations sizes and precisions.
0
4-QAM
16-QAM
64-QAM
Modulation Alphabet Fig. 7. Speed-up of the GTX 285 graphic card over the Core 2 Quad CPU for the soft sphere decoder with different constellations sizes and precisions.
R EFERENCES antenna numbers. This is a result of the high latency access of the global memory. Thus, for obtaining high speed-ups for the soft sphere decoder with large antenna number and/or large constellation size, the soft sphere decoder requires specific code optimizations for the GPU. Considering the second, slower graphic card (NVIDIA GeForce 8400 GS) the speed-up was less by a factor in the range of about 30-40 when compared to the speed-up of the high performance graphic card. This factor is direct proportional to the ratio of the number of multiprocessor multiplied by their individual clock frequencies. VII. C ONCLUSIONS In this work we utilized NVIDIA’s CUDA architecture to speed up the execution of sphere decoding algorithms. Running parallel sphere decoders without any code optimization on a GPU can accelerate simulations up to a factor of 50 for the fixed-complexity sphere decoder and up to a factor of nine for the soft output sphere decoder. Larger speed-ups can be expected by optimizing the code for the GPU or by using a higher performance graphic card (with a larger number of multiprocessors or a higher clock frequency). Alternatively also multiple GPUs can be utilized to increase the level of parallelism even more. ACKNOWLEDGMENTS This work has been funded by the Christian Doppler Laboratory for Wireless Technologies for Sustainable Mobility and the Institute of Communications and Radio-Frequency Engineering, Vienna University of Technology, in corporation with the Electronics and Communications Department, Faculty of Engineering, Cairo University. The authors thank Christoph Mecklenbr¨auker and their industrial partners mobilkom austria AG and Kathrein-Werke KG.
[1] NVIDIA Corporation, “NVIDIA CUDA programming guide 2.2,” May 2009. [Online]. Available: http://developer.download.nvidia.com/compute/cuda/2 2/ toolkit/docs/NVIDIA CUDA Programming Guide 2.2. pdf [2] J. Nickolls, I. Buck, M. Garland, and K. Skadron, “Scalable parallel programming with CUDA,” Queue, vol. 6, no. 2, pp. 40–53, 2008. [3] AMD, “ATI stream computing.” [Online]. Available: http://ati.amd.com/technology/streamcomputing/ index.html [4] M. Garland, S. Le Grand, J. Nickolls, J. Anderson, J. Hardwick, S. Morton, E. Phillips, Y. Zhang, and V. Volkov, “Parallel computing experiences with CUDA,” vol. 28, no. 4, pp. 13–27, July-Aug. 2008. [Online]. Available: http://ieeexplore.ieee.org/stamp/stamp.jsp?tp= &arnumber=4626815 [5] NVIDIA Corporation, “Applications using CUDA.” [Online]. Available: http://www.nvidia.com/object/cuda home.html# [6] D. Luebke, “CUDA: Scalable parallel programming for high-performance scientific computing,” in Proc. 5th IEEE International Symposium on Biomedical Imaging: From Nano to Macro 2008 (ISBI 2008), pp. 836–838, May 2008. [Online]. Available: http://ieeexplore.ieee. org/stamp/stamp.jsp?tp=&arnumber=4541126 [7] C. Mehlf¨uhrer, M. Wrulich, J. C. Ikuno, D. Bosanska, and M. Rupp, “Simulating the long term evolution physical layer,” in Proc. 17th European Signal Processing Conference (EUSIPCO 2009), pp. 1471–1478, Glasgow, Scotland, UK, Aug. 2009. [Online]. Available: http://publik.tuwien.ac.at/files/PubDat 175708.pdf [8] L. Barbero and J. Thompson, “A fixed-complexity MIMO detector based on the complex sphere decoder,”
Copyright 2010 IEEE, published in Proc. 16th European Wireless Conference (EW 2010), April 2010, Lucca, Italy
[9]
[10] [11]
[12]
[13]
in Proc. IEEE 7th Workshop on Signal Processing Advances in Wireless Communications 2006 (SPAWC 2006), July 2006. [Online]. Available: http://ieeexplore. ieee.org/stamp/stamp.jsp?tp=&arnumber=4153876 C. Studer, M. Wenk, A. P. Burg, and H. B¨olcskei, “Soft-output sphere decoding: Performance and implementation aspects,” in Conference Record of the 40th Asilomar Conference on Signals, Systems and Computers, Pacific Grove, CA, USA, Nov. 2006. [Online]. Available: http://ieeexplore.ieee.org/iel5/4176490/ 4176491/04176942.pdf?tp=&arnumber=4176942 “Sphere decoder homepage.” [Online]. Available: http: //www.nt.tuwien.ac.at/spheredecoder/ M. Wu, Y. Sun, and J. R. Cavallaro, “Reconfigurable real-time MIMO detector on GPU,” in Conference Record of the 43rd Asilomar Conference on Signals, Systems and Computers, Pacific Grove, CA, USA, Nov. 2009. [Online]. Available: http://www-ece.rice.edu/ ∼ys4937/distribute/2009-Asilomar-2.pdf E. Viterbo and J. Boutros, “A universal lattice code decoder for fading channels,” vol. 45, no. 5, pp. 1639– 1642, July 1999. [Online]. Available: http://ieeexplore. ieee.org/stamp/stamp.jsp?tp=&arnumber=771234 O. Damen, A. Chkeif, and J.-C. Belfiore, “Lattice code
[14]
[15]
[16]
[17]
[18]
decoder for space-time codes,” vol. 4, no. 5, pp. 161– 163, May 2000. [Online]. Available: http://ieeexplore. ieee.org/stamp/stamp.jsp?tp=&arnumber=846498 H. Vikalo and B. Hassibi, “Maximum-likelihood sequence detection of multiple antenna systems over dispersive channels via sphere decoding,” EURASIP Journal on Applied Signal Processing, vol. 2002, no. 5, pp. 525–531, 2002. [Online]. Available: http://www.hindawi.com/journals/asp/2002/156743.pdf M. S. Khairy, M. M. Abdallah, and S. E.-D. Habib, “Efficient FPGA implementation of MIMO decoder for mobile WiMAX system,” in Proc. IEEE International Conference on Communications 2009 (ICC 2009), Dresden, Germany, June 2009. Nvidia, “CUDA occupancy calculator.” [Online]. Available: http://www.nvidia.com/object/cuda programming tools.html NVIDIA Corporation, “MATLAB plug-in for CUDA.” [Online]. Available: http://developer.nvidia.com/object/ matlab cuda.html Nvidia, “CUDA driver and toolkit download homepage.” [Online]. Available: http://www.nvidia.com/object/cuda get.html
Reference: M. S. Khairy, C. Mehlführer, and M. Rupp, "Boosting sphere decoding speed through graphic processing units," in Proc. 16th European Wireless Conference (EW 2010), Lucca, Italy, Apr. 2010. [Online]. Available: http://publik.tuwien.ac.at/files/PubDat_184823.pdf BibTeX: @InProceedings{EW2010_CUDA, author = {Muhammad Sayed Khairy and Christian Mehlf\"uhrer and Markus Rupp}, title = {Boosting Sphere Decoding Speed Through Graphic Processing Units}, booktitle = {Proc. 16th European Wireless Conference (EW 2010)}, month = apr, year = 2010, address = {Lucca, Italy}, url = {http://publik.tuwien.ac.at/files/PubDat_184823.pdf}, }