a low-cost and flexible software-based multi-core architecture for high performance computing. ... In the field of wireless communi- cation, although power and ... execution, each dispatch unit can issue a 32 wide single instruction multiple data ...
A Massively Parallel Implementation of QC-LDPC Decoder on GPU Guohui Wang, Michael Wu, Yang Sun, and Joseph R. Cavallaro Department of Electrical and Computer Engineering, Rice University, Houston, Texas 77005 {wgh, mwb2, ysun, cavallaro}@rice.edu Abstract—The graphics processor unit (GPU) is able to provide a low-cost and flexible software-based multi-core architecture for high performance computing. However, it is still very challenging to efficiently map the real-world applications to GPU and fully utilize the computational power of GPU. As a case study, we present a GPU-based implementation of a real-world digital signal processing (DSP) application: lowdensity parity-check (LDPC) decoder. The paper shows the efforts we made to map the algorithm onto the massively parallel architecture of GPU and fully utilize GPU’s computational resources to significantly boost the performance. Moreover, several algorithmic optimizations and efficient data structures have been proposed to reduce the memory access latency and the memory bandwidth requirement. Experimental results show that the proposed GPU-based LDPC decoding accelerator can take advantage of the multi-core computational power provided by GPU and achieve high throughput up to 100.3Mbps. Keywords-GPU, parallel computing, CUDA, LDPC decoder, accelerator
I. I NTRODUCTION A graphics processing unit (GPU) provides a parallel architecture which combines raw computation power with programmability. GPU provides extremely high computational throughput by employing many cores working on a large set of data in parallel. In the field of wireless communication, although power and strict latency requirements of real communication systems continue to be the main challenges for a practical real-time GPU-based platform, GPUbased accelerators remain attractive due to their flexibility and scalability, especially in the realm of simulation acceleration and software-defined radio (SDR) test-beds. Recently, GPUbased implementations of several key components of communication systems have been studied. For instance, a soft information multiple-input multiple-output (MIMO) detector is implemented on GPU and achieves very high throughput [1]. In [2], a parallel turbo decoding accelerator implemented on GPU is studied for wireless channels. Low-density parity-check (LDPC) decoder [3] is another key communication component and the GPU implementations of the LDPC decoder have drawn much attention recently, due to its high computational complexity and inherently massively parallel nature. LDPC codes are a class of powerful error correcting codes that can achieve near-capacity error correcting performance. This class of codes are widely used in many wireless standards such as WiMax (IEEE 802.16e), WiFi (IEEE 802.11n) and high speed magnetic storage devices.
The flexibility and scalability make GPU a good simulation platform to study the characteristics of different LDPC codes or to develop new LDPC codes. However, it is still very challenging to efficiently map the LDPC algorithm to the GPU’s massively parallel architecture and achieve very high performance. Recently, parallel implementations of high throughput LDPC decoders are studied in [4]. In [5], the researchers optimize the memory access and develop parallel decoding software for cyclic and quasi-cyclic LDPC (QCLDPC) codes. However, there are still great potentials to achieve higher performance by developing better algorithm mapping according to GPU’s architecture. This work focuses on the techniques to fully utilize GPU’s computational resources to implement a computation-intensive digital signal processing (DSP) algorithm. As a case study, a highly-optimized and massively parallel LDPC decoder implementation on GPU is presented. Several efforts have been made from the aspects of algorithmic optimizations, memory access optimizations and efficient data structures, which enable our implementation to achieve much higher performance than the prior work. This paper is organized as follows. In Section II, the CUDA platform for GPU computing is introduced. Section III gives an overview of the LDPC decoding algorithm. Different aspects of the GPU implementation of the LDPC decoder are discussed in Section IV. Section V focuses on memory access optimization techniques. Section VI provides the experimental results for performance and throughput of the proposed implementation. Finally, Section VII concludes this paper. II. CUDA A RCHITECTURE Computer Unified Device Architecture (CUDA) [6] is widely used to program massively parallel computing applications. The NVIDIA Fermi GPU architecture consists of multiple stream multiprocessors (SM). Each SM consists of 32 pipelined cores and two instruction dispatch units. During execution, each dispatch unit can issue a 32 wide single instruction multiple data (SIMD) instruction which is executed on a group of 16 cores. Although CUDA provides the possibility to unleash GPU’s computational power, several restrictions prevent programmers from achieving peak performance. The programmer should pay attention to the following aspects to achieve near-peak performance. In the CUDA model, a frequently executed task can be mapped into a kernel and executed in parallel on many threads
efficient implementation by using the so-called layered decoding algorithm [2], [3]. The layered decoding algorithm has a much higher sub-matrix size Z = 81, IEEE 802.11n LDPC code. In this matrix convergence speed (up to two times faster), and requires less memory representation, each square box with a label Ix represents an 81 × 81 compared to the standard two-phase flooding decoding algorithm. In cyclicly-shifted identity matrix with a shifted value of x, and each [4], a fully parallel 1-Gbps LDPC decoder was implemented based empty box represents an 81 × 81 zero matrix. on the two-phase flooding algorithm. But this decoder only supports I57 I50 I11 I50 I79 I1 I 0 code structure and it consumes enormous hardware resources. in the one GPU. CUDA divides threads within a thread block I3 I28 I0 I55 I7 I 0 I0 As a tradeoff between complexity and performance, partial-parallel I30 I24 I37 I56 I14 I 0 I0 into blocks of 32 threads, which are executed as a group layered decoder architectures are often used to decode the QC-LDPC I62 I53 I53 I3 I35 I 0 I0 using the same common instruction (a warp instruction). As codes [5], [6], [7], [8], [9]. I40 I20 I66 I22 I28 I0 I0 instructions issued in-order, a stall can occur when an The are conventional layered decoder architecture [2], [3] is initially I0 I8 I8 I0 I 0 I42 I50 developed to process the parity check matrix layer by layer, where instruction fetches data from device memory or when there I0 I69 I79 I79 I56 I52 I 0 I0 layer corresponds a block-row of the parity check matrix. I65 I38 I57 I72 I27 I 0 I0 is data each dependency. Stalls tocan be minimized by coalescing the column-weight of each layer is typically 1 in many I64 I14 I52 I30 I32 I0 I0 memorySince access, using fast on-die memory resources and I45 I70 I0 I77 I9 I0 I0 applications, such as IEEE 802.11n and IEEE 802.16e, this greatly throughsimplifies hardware for fast switching. I2 I56 I57 I35 I12 I0 I 0 the support decoder design. To thread further improve the throughput, the A CUDA device has a large amount can (>1GB) of off-chip I24 I61 I60 I27 I51 I16 I1 I0 two consecutive layers’ data processing be partially overlapped device through memory with high latency (400∼800 cycles) [6].between To a pipelined schedule [7], [5], where the data conflicts Parity check matrix for block length 1944 bits, code rate 1/2, sublayers can be resolved by stalling the pipeline. Thecoalesced. maximum row Fig.Fig. 1. 1.Parity check matrix H for a block length 1944bits, code rate 1/2, IEEE improvetwo efficiency, device memory accesses can be matrix size Z = 81, IEEE 802.11n LDPC code. 802.11n (1944, 972) LDPC code. H consists of Msub × Nsub sub-matrices parallelism for the conventional layered algorithm is equal to the subFor CUDA devices, if all threads access 1, 2, 4, 8, or 16 byte (M = 12, N = 24 in this example). sub matrix size Z, i.e. we can employ Z parallel check node processors sub words and threads within half of a warp access sequential to process Z rows in parallel. With this amount of parallelism, the B. Review of The Layered Decoding Algorithm words in device memory, the memory requests of half Mbps of an 81 81 review zero matrix. For a QC-LDPC code, Halgorithm could be A× brief of the conventional layered decoding conventional layered decoder can typically offer 100-1000 a warpthroughput can be [5], coalesced into a single memory request. viewed [2] is provided to facilitate the × discussion. We define the following as an array of Msub Nsub sub-matrices. Msub and [6], [7], [8]. notation. Thenumber a posteriori probability (APP) To memory go beyondaccess 1-Gbpsimproves throughput, the layered architecture needs Nsub Coalescing performance by reducing are the of sub-matrices in a log-likelihood column and aratio row, (LLR) of each bit n is defined as: to be extended to provide higher parallelism. One natural extension the number of device memory requests. respectively. A row of Nsub sub-matrices is also known as a the conventional is toshared design memory, a multi-layer P r(n = 0) Fast of on-chip resourceslayered such architecture as registers, layer. Ln = log , (1) parallel architecture where multiple (K) layers of a parity check P r(n = 1) and constant memory can reduce memory access time. Access
to shared memory usually takes one load or store operation. However, random accesses to shared memory with bank conflicts should be avoided since they will be serialized and cause performance degradation. On the other hand, memory access to constant memory is cached although constant memory resides in device memory. Therefore, it takes one cached read if all threads access the same location. However, other constant memory access patterns will be serialized and need device memory access, therefore, should be avoided. Furthermore, to hide the processor stalls and achieve peak performance, one should map sufficient concurrent threads onto a SM. However, both shared memory and registers are shared among concurrent threads on a SM. The limited amounts of shared memory and registers require the programmers to effectively partition the workloads such that at least a sufficient number of blocks can be mapped onto the same processor to hide stalls. III. I NTRODUCTION TO LDPC D ECODING A LGORITHM A. LDPC Codes
Low-density parity-check (LDPC) codes are a class of widely used error correcting codes. The binary LDPC codes can be defined by the equation H · xT = 0, in which x is a codeword and H is an M × N sparse parity check matrix. The number of 1’s in a row of H is called row weight, denoted by ωr . If ωr is the same for all rows, the code is regular; otherwise, the code is irregular. Regular codes are easier to implement but have poorer error correcting performance than irregular codes. Quasi-Cyclic LDPC (QC-LDPC) codes [7] are a special class of LDPC codes with a structured H matrix, which can be generated by the expansion of a Z × Z base matrix. As an example, Fig. 1 shows the parity check matrix for the (1944, 972) 802.11n LDPC code with sub-matrix size Z = 81. In this matrix representation, each square box with a label Ix represents an 81 × 81 circularly right-shifted identity matrix with a shifted value of x, and each empty box represents
Because of its well-organized structure, QC-LDPC is very efficient for hardware implementation. In addition, irregular codes have better error correcting performance than regular codes. Therefore, irregular QC-LDPC codes have been adopted by many communication systems such as 802.11n WiFi and 802.16e WiMAX. In this paper, we take the irregular QC-LDPC codes from the 802.11n standard as a case study. However, the proposed accelerator can be used for other cyclic or quasi-cyclic LDPC codes. B. Sum-product Algorithm for LDPC Decoder
The sum-product algorithm (SPA) is based on the iterative message passing among check-nodes (CNs) and variablenodes (VNs) and provides a powerful method to decode LDPC codes [3]. The SPA is usually performed in the log-domain. The SPA has a computational complexity of O(N 3 ), in which N is normally very large which results in very intensive computations. Let cn denote the n-th bit of a codeword, and let xn denote the n-th bit of a decoded codeword. The a posteriori probability (APP) log-likelihood ratio (LLR) is a soft information of cn and can be defined as Ln = log((P r(cn = 0)/P r(cn = 1)). 1) Initialization: Ln is initialized to be the input channel LLR. The VN-toCN (VTC) message Qmn and the CN-to-VN (CTV) message Rmn are initialized to 0. 2) Begin the iterative Decoding process: For each VN n, calculate Qmn by X Qmn = Ln + Rm 0 n , (1) m0 ∈{Mn \m}
where Mn \ m denotes the set of all the CNs connected with VN n except CN m. Then, for each CN m, compute the new 0 CTV message Rmn and ∆mn by 0 Rmn = Qmn1 Qmn2 · · · Qmnk ,
(2)
0 ∆mn = Rmn − Rmn ,
(3)
Serial code
Serial code
Serial code
Host (CPU)
Host (CPU)
x y Synchronization = sign(x)sign(y) min(|x|, |y|) + S(x, y),
−|x+y| Kernel+2 eDevice S(x, y)CUDA = log(1 ) − log(1 + e−|x−y| ). (GPU) 1. Update APP values
...
2. Make hard decisions
3) Update the APP values and make hard decisions X Serial code 0 Host (CPU) 1. Transfer data Lnfrom= L ∆mn . n+ Device to Host 2. Finish decoding
(4) (5)
Iterative Decoding
Iterative Decoding
where n1 , n2 ,1.2.· Initialization · · , ndatak from ∈ {Nm \ n} and Nm \ n denotes the Transfer Host to Device 0 set of all the CNs connected with VN n except CN m. Rmn (GPU) and ∆mn are CUDA savedKernel for the APP update and the 1 Device ... next iteration. Horizontal Processing The operation is defined as below:
The decoder makes a hard decision to get the decoded bit
C. Scaled Algorithm CUDA KernelMin-Sum 2 Device 1. Update APP values The min-sum algorithm (MSA) 2. Make hard decisions (GPU)
... reduces the decoding com-
plexity of the SPA with minor performance loss [8][9]. The 1.MSA Transferapproximates data from Host log-SPA by removing the correction item Device to Host (4). (CPU) S(x) from The Rmn calculation in the scaled MSA can 2. Finish decoding be expressed as below: Y 0 Rmn = α· sign(Qmn0 ) · 0 min | Qmn0 |, (7) Serial code
n0 ∈{Nm \n}
n ∈{Nm \n}
where α is the scaling factor to compensate for the performance loss in the min-sum algorithm (α = 0.75 is used in this paper) [9]. IV. M APPING LDPC D ECODING A LGORITHM ONTO GPU In this work, we map both the MSA and the log-SPA onto GPU architecture. In this section, we present modified decoding algorithms to reduce the decoding latency and the bandwidth requirement of device memory. The implementations of the LDPC decoding kernels are also described. A. Loosely Coupled LDPC Decoding Algorithm It can be noted that the log-SPA or min-sum algorithm could be simplified by reviewing the equations from (1) to (6). Instead of saving all the Q values and R values in the device memory, only R values are stored [10]. To compute the new CTV message R0 , we first recover the Q value based on R and APP value from the previous calculation. Based on this idea, Equation (1) can be changed to: Qmn = Ln − Rmn .
(8)
Due to the limited size of the on-chip shared memory, Rmn values can only fit in the device memory. The loosely coupled LDPC decoding algorithm significantly reduces device memory bandwidth by reducing the amount of device memory used. For example, with an M × N parity-check matrix H with the row weight of ω r (the number of 1’s in a row of H), we save up to M ω r memory slots. Also, the number of memory accesses to the device memory is reduced by at least 2M ω r since each Qmn is read and written to at least once per iteration.
Device (GPU)
...
Device (GPU)
...
CUDA Kernel 2 1. Update APP values 2. Hard decisions
Host (CPU)
Fig. 2. Code structure of the GPU implementation of LDPC decoder by using two CUDA kernels.
B. Two-Pass Recursion CTV Message Update Based onSerial (2), code we need to traverse one row of the H matrix 1. Initialization multiple times to update allHost the CTV messages Rmn . Each 2. Transfer data from (CPU) traversal pass (ω r − 2) compare-select operations Host requires to Device to find the minimum value from (ω r − 1) data. Moreover, Device CUDA Kernel 1 (ω r − 1) XOR operations are also needed ... to compute the Horizontal Processing (GPU) sign of Rmn . In addition, the traversal introduces some branch Kernel 2 an unbalanced workload and reduce instructionsCUDA which cause Device 1. Update APP values ... (GPU) the throughput. 2. Hard decisions The traversal process could be optimized by using a forward Serial code and backward two-pass recursion Host scheme [11]. In the forward 1. Transfer data from Device intermediate to Host recursion, some values are calculated and stored. (CPU) decoding During the2. Finish backward recursion, the CTV message for each node is computed based on the intermediate values from the forward recursion. By roughly calculating, the number of operations can be reduced from M ω r (ω r −2) to M (3ω r −2) in one iteration of LDPC decoding. For the 802.11n (1944, 972) code, this two-pass recursion scheme saves us about 50% of the operations. This significantly reduces the decoding latency. The number of memory accesses and the number of required registers are also greatly reduced. Iterative Decoding
Iterative Decoding
terminates Host to Device when the codeword x satisfies H · x = 0, or the pre-set maximum number of iterations is reached. Otherwise, CUDA Kernel 1 Device go back to step 2 and start a new ... iteration of decoding. Horizontal Processing (GPU)
CUDA Kernel 1 Horizontal Processing
1. Transfer data from Device to Host 2. Finish decoding
m 0
Host (CPU)
Serial code
(6)
xn bycode quantizing the APP value Ln into 1 and 0, that is, if Serial Host 1. Initialization 0 L