low complexity, high speed decoder architecture ... - Semantic Scholar

3 downloads 7339 Views 215KB Size Report
Email: [email protected]. Abstract. This paper ... quasi-cyclic LDPC codes [1], which can achieve ... The encoder of a QC-LDPC code can be easily built.
Low Complexity, High Speed Decoder Architecture for Quasi-Cyclic LDPC Codes Zhongfeng Wang School of EECS, Oregon State Univ. Corvallis, CA 97331, USA Email: [email protected]

Qing-wei Jia Seagate Technology International Singapore 118249 Email: [email protected] 7154) EG-LDPC code is shown to achieve a maximum throughput of 169 Mbps, which is significantly higher than existing works such as [6] and [9].

Abstract This paper presents a low complexity, very high speed decoder architecture for quasi-cyclic Low Density Parity Check (QC-LDPC) codes, specifically Euclidian Geometry (EG) based QCLDPC codes. Algorithmic transformation and architectural level optimizations are employed to increase the clock speed. Enhanced partially parallel decoding architectures are proposed to linearly increase the overall throughput with the introduction of a small percentage of extra hardware. Based on the proposed architecture, a FPGA implementation of a (8176, 7154) EG-LDPC decoder can achieve a worst-case throughput of 169 Mbps.

2. Belief Propagation Modification

and

The conventional BP algorithm is composed of two phases of message passing, i.e., variable-to-check node message passing and check-to-variable node message passing. Let Rcv denote the check-tovariable message conveyed from the check node c to variable node v, and Lcv represent the variable-to-

1. Introduction

check message conveyed from the variable node v to the check node c, then Rcv can be computed as

Recently a class of structured LDPC codes, namely quasi-cyclic LDPC codes [1], which can achieve comparable performance to random codes, have been proposed Further works include irregular cirlulant-based QC-LDPC codes [2] and Euclidian Geometry-based QC-LDPC codes [3][4]. QC-LDPC codes are well suited for hardware implementation. The encoder of a QC-LDPC code can be easily built with shift-registers [5] while random codes usually require complex encoding circuitry to perform complex matrix and vector multiplications. In addition, QC-LDPC codes also facilitate efficient high-speed decoding because of the regularity in its parity check matrix. A memory-based partially parallel decoding architecture has been presented in [6] to obtain a good trade-off between hardware complexity and decoding speed.

follows:

Rcv = − S cv Ψ{

∑ Ψ( L

cn

n∈N ( c ) \ v

where

∏ sign( L

S cv =

cn

)} ,

).

(1)

(2)

n∈N ( c ) \ v

S cv is the sign part of Rcv , and N(c)\v denotes the set of variable nodes connected to the check node c excluding the variable node v. The nonlinear function Ψ (x) =log(tanh(|x|/2)) is generally implemented with a look-up table (LUT) in hardware. On the other hand, the variable-to-check message Lcv can be computed with the following equation:

Lcv =

The paper is organized as follows: after a brief review of the conventional BP algorithm [7], a modified version based on algorithmic transformation is discussed. VLSI architectures and optimizations for the Variable node Processing Units (VPU) and Check node Processing Units (CPU) will be presented for the new algorithm. An enhanced memory-based partially parallel decoding architecture is proposed to linearly increase the throughput with small percentage of hardware overhead. A FPGA implementation of a (8176,

0-7803-8834-8/05/$20.00 ©2005 IEEE.

Algorithm

∑R

nv n∈M ( v ) \ c

− 2 * rv / σ 2

(3)

where M(v)\c denotes the set of check nodes connected to the variable node v excluding the

2 * rv / σ 2 is the intrinsic information related to the received soft symbol rv

variable node c, and

and the estimated standard deviation of the channel noise. The log likelihood ratio for the variable node v, denoted as Lv , is computed as follows.

5786

Lv = The sign of

∑R

c∈M (v )

(4)

cv

Lv is taken as the estimated information

bit (+1 or -1). It can be observed that the conventional BP algorithm has unbalanced computation complexity between the two decoding phases. This leads to unbalanced date-paths between VPU’s and CPU’s. Actually, the critical path of a CPU consists of a summation operation and two LUT operations while that of a VPU consists of only a summation operation. As the clock speed will be upper-bounded by the longest data-path, the throughput of a LDPC decoder employing the BP algorithm will be limited. In [8], a modified version based on algorithmic transformation was proposed in order to balance the computation load between the two decoding phases. The new algorithm is expressed as the follows.

Rcv = − S cv Lcv =

∑ − sign( R

n∈M ( v ) \ c

∑ Ψ(L

n∈N ( c ) \ v

nc

cn

)}

Figure 1. The architecture of CPU with original BP algorithm

(5)

)Ψ ( Rnc ) − 2 * rv / σ 2 (6)

S cv is computed as before. It should be noted that the real value of Rcv computed here is

where

Figure 2. The architecture of VPU with original BP algorithm

different from what is obtained with the original algorithm. The major benefit with the modified algorithm is that the computation complexity and thus computation delay are balanced between two decoding phases. As shown in [8], this modification not only helps reduce the clock cycle time, but also facilitate 100% hardware utilization efficiency.

In addition, we propose to utilize architecture-level optimization techniques to further reduce the total computation delay for the summation part. The new architectures for VPU and CPU are shown in Figure 3 and Figure 4, respectively. The dashed lines indicate possible positions for inserting pipeline stages. Actually with 3-stage pipelining, the critical path for either type of node processing units is reduced to about 1 multi-bit addition, which leads to nearly 6 times speed-up over the design presented in [8].

3. Architectures for Node Processing Units and Optimizations Consider a (3, 5) quasi-cyclic LDPC codes. The architectures of VPU and CPU with the original BP algorithm are shown in Figure 1 and 2 respectively. The input signal z v in Fig. 2 stands for the intrinsic

4. Partially Parallel Decoding Architecture and Enhancement

information. As can be observed from Fig. 1, two LUT operations are involved in the critical path of each CPU. With the new algorithm, we move one LUT operation to the critical path of every VPU. We can further eliminate the sign-magnitude to 2’s complement conversion block [SM-2’s] and the 2’s complement to sign-magnitude [2’s-SM] conversion block in order to shorten the critical path further. The compensation for the removal of these blocks includes adding the sign bit to the input of a normal LUT and forcing the output of LUT to have different formats at different decoding phases.

Several papers have addressed partially parallel decoding architecture for regular LDPC codes such as [6] and [9]. This kind of architecture generally achieves a good trade-off between hardware complexity and decoding throughput. A partially parallel decoder architecture for general (3, 5) QCLDPC codes is shown in Figure 5, where totally 3*5=15 memory banks are used to store the soft message symbols conveyed at both decoding phases, memory bank Z’s are used to store the intrinsic

5787

needed) to complete message updating for each row (or column). To increase the parallelism, we can force each node processing unit to process multiple rows (or columns) at the same time. However, this will generally cause memory access problems. Figure 6 shows a small sub-matrix of a QC-LDPC parity-check matrix.

information, and the memory bank C’s are used to store the estimated data bits.

Figure 3. An optimal architecture for CPU.

Figure 6. A sub-matrix of a QC-LDPC parity-check matrix. As discussed in [8], all soft message symbols corresponding to all the 1-components in the submatrix are stored in one memory bank. Thus, a straightforward approach to enable each process unit to process multiple rows/columns per cycle is to store multiple soft symbols at each memory entry. For example, we can store 2 soft symbols corresponding to two adjacent 1-components at one memory entry. This design easily solves the problem of processing 2 rows at each cycle. However, it is generally not applicable to column processing. Assume we start column processing from the first column for the same example. In the first cycle, no memory access conflict happens, as both soft symbols are stored at the same memory entry. In the second cycle, we are supposed to process the third and the fourth columns at the same time. However, the required two soft symbols are located in different memory entries. This situation would become even worse when there are multiple sub-blocks in one (block) column of the parity check matrix, e.g., there are 3 sub-blocks in one column for a (3, 5)-regular QC-LDPC parity-check matrix. Using multi-port memories is a possible solution, but not an efficient one as the overall hardware will be linearly increased. To fix this problem, the authors of [10] added one more constraint on each circulant matrix, i.e., the shift value of each shifted identity matrix must be multiples of δ, where δ denotes the number of soft symbols stored in one memory entry, e.g., δ =2 in the previous example. The added constraints unavoidably limit the performance of the LDPC codes.

Figure 4. An optimal architecture for VPU.

Figure 5. The structure of a partially parallel decoder for (3, 5) QC-LDPC codes. For QC-LDPC codes, the address generator for each memory bank can be built with a simple counter, which not only simplifies the hardware design, but also improves the circuit speed. In general, each node processing unit takes 1 clock cycle (assuming dual-port memories are used, otherwise 2 cycles are

5788

In this paper, we have discussed modified BP algorithm and presented optimized architectures for both type of node processing units based on the new algorithm. We have further presented enhanced partially parallel decoder architectures for QCLDPC codes.

Euclidian geometry-based QC-LDPC codes (EGLDPC) were first introduced in [11] and late improved in [12][4]. This class of codes has comparable performance to random codes. Particularly, they have very low error floor. Consider a (8176, 7156) (4, 32)-regular EG-LDPC code discussed in [4]. The coding gain is only 1 dB

References

−7

away from the Shannon limit at BER= 10 . Our simulations have shown that the error floor for this

[1] D. Sridhara, T. Fuja, and R. M. Tanner, “Low density parity check codes from permutation matrices,” Conf. on Info. Science and Systems. The John Hopkins University, March, 2001. [2] D. Hocevar, “Efficient encoding for a family of quasi-cyclic LDPC codes,” IEEE GLOBECOM'03. Vol. 7. Pages: 3996 - 4000 vol.7. [3] Y. Kou, J. Xu, H. Tang, S. Lin, and K. AbdelGhaffar, “On circulant low density parity check codes,” ISIT’2002. Pages:200. [4] L. Chen, J. Xun, I. Djurdjevic, and S. Lin, “Near Shannon Limit Quasi-Cyclic Low Density ParityCheck Codes,” to appear in IEEE Trans. on Communications, July, 2004. [5] Z. Li, L chen, and S. Lin, W Fong and P Yeh, “Efficient Encoding of Quasi-Cyclic Low Density Parity-Check Codes,” to appear in IEEE trans. on communications, 2004. [6] Y. Chen and D. Hocevar, “A FPGA and ASIC implementation of rate 1/2, 8088-b irregular low density parity check decoder,” IEEE GLOBECOM '03. Volume: 1, 1-5 Dec. 2003. Pages:113 – 117. [7] D. J. MacKay, “Good error-correcting codes based on very sparse matrices,” IEEE Trans. Infor. Theory, vol. 45, pp. 399-431, Mar. 1999. [8] Z. Wang, Y Chen, and K Parhi, “Area-efficient quasi-cyclic LDPC code decoder architecture,” in ICASSP 2004. [9] Tong Zhang and Keshab Parhi, “A 54 Mbps (3,6)-regular FPGA LDPC decoder,” in Proc. IEEE SiPS’2002, pp 127-132, 2003. [10] M. Karkooti and J. R. Cavallaro, “Semiparallel reconfigurable architectures for real-time LDPC decoding,” ITCC’2004., Volume: 1 , 5-7 April 2004. Pages:579 – 585, Vol.1. [11] Y. Kou, J. Xu, H. Tang, S. Lin, and K. AbdelGhaffar, “On circulant low density parity check codes,” in Proc. of 2002 IEEE ISIT, Pages:200. [12] S. Lin, L. Chen, J. Xu, and I. Djurdjevic, “Near Shannon limit quasi-cyclic low-density parity-check codes,” GLOBECOM '03, 2003. Volume: 4, Dec. 2003. Pages:2030 – 2035, vol.4. [13] Z Wang, Y Tan, and Y. Wang, “Low Hardware Complexity Parallel Turbo Decoder Architecture”, in ISCAS’2003.

−10

code is at least blow 10 . A key feature of the parity-check matrix of this code is that each submatrix consists of two overlapped cyclic-shifted identity matrices. There is no other constraint other than that the size of the sub-block must be a prime number. A more general case would be to have multiple (e.g., m>2) independently cyclic-shifted identity matrices overlapped in one sub-block. To efficiently decoder this class of codes, we propose to have one separate memory bank for each independently (cyclic) shifted identity matrix. Hence, we need 2 memory banks for each subblock. To double the parallelism of a partially parallel decoder, we propose three solutions as follows: I. Store two adjacent soft symbols at one memory entry while utilizing extra buffers (c.f. [13]) to solve the memory access problem. II. Partition each memory bank into two subbanks with one contains even-numbered entries and the other for odd-numbered entries. III. Combine Approach I and Approach II. Due to the limited space, we will only elaborate on Approach II in this paper. For any shifted identity matrix, it is guaranteed that one of two adjacent soft symbols (whether from row processing point of view or from column processing point of view) belongs to the even-numbered sub-bank and the other belongs to the odd-numbered sub-bank. Therefore, this method successfully solves the memory access problem, though extra multiplexers will be needed and the control circuitry will be slightly more complex. It can be observed that the proposed approach can be extended to the case with m>2. Most recently, we have implemented a LDPC decoder with Xilinx Virtex II 6000 for a (8176, 7156) EG-LDPC code, where all the techniques discussed in the above are employed. Our design has shown a worst-case decoding throughput of 169 Mbps (with 15 iterations), which is significantly higher than any existing work such as [6] and [9]. 5. Conclusion

5789

Suggest Documents