IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 57, NO. 1, JANUARY 2010
125
Programmable Architecture for Flexi-Mode QC-LDPC Decoder Supporting Wireless LAN/MAN Applications and Beyond Dan Bao, Bo Xiang, Rui Shen, An Pan, Yun Chen, Associate Member, IEEE, and Xiao Yang Zeng, Member, IEEE
Abstract—A programmable architecture is proposed for a flexi-mode quasi-cyclic low-density parity-check code decoder. The proposed architecture has the following advantages: 1) Code rate, length, and pattern can be programmed on the fly; 2) decoding complexity is reduced by algorithm modification; 3) memory read/write operation is reduced by access optimization and hierarchical memory structure; and 4) an early stopping scheme is adopted to give power efficiency, particularly in the low-signal-to-noise-ratio region. A decoder chip is implemented in an SMIC 180-nm 1.8-V CMOS technology. Experimental results show the advantages in terms of flexibility, area, power, and error-correction performance. Index Terms—Flexi-mode decoder, iterative decoding, low-density parity-check (LDPC) codes, programmable architecture.
I. INTRODUCTION
L
OW-DENSITY parity-check (LDPC) codes, first proposed by Gallager [1], have been achieving great interests since the rediscovery made by Mackay and Neal [2]. Among several forward error correction (FEC) coding schemes, such as convolution codes, Reed–Solomon codes, and turbo codes, LDPC codes provide nearest capacity performance [3]. Moreover, inherent parallelism ensures that the decoder can perform with flexible throughput ranging from several megabits per second (Mbits/s) to multiple gigabits per second (Gbits/s) [4]–[8], which can satisfy different applications. Consequently, LDPC codes have been adopted in broadcasting systems, such as second-generation digital video broadcasting via satellite [9] and digital television terrestrial multimedia broadcasting [10], and have been optional FEC techniques in network systems, such as wireless LANs (IEEE 802.11n) [11] and metropolitan area networks (MANs) (IEEE 802.16e) [12], etc. The adopted LDPC codes are architecture-aware (AA) or quasi-cyclic LDPC (QC-LDPC) codes which facilitate decoder implementation. Generally, the architecture for LDPC code decoders can be grouped into three categories. First is single-mode architecture in terms of single code rate and length [5]. Second is multimode architecture in terms of multiple rates with single length Manuscript received August 26, 2008; revised January 18, 2009. First published March 27, 2009; current version published January 22, 2010. This work was supported by the Science and Technology Committee of Shanghai under Grant 77062001. This paper was recommended by Associate Editor V. Öwall. The authors are with the State Key Laboratory of ASIC and System, Fudan University, Shanghai 201203, China (e-mail: dan.bontaine@gmail. com;
[email protected];
[email protected];
[email protected];
[email protected];
[email protected]). Digital Object Identifier 10.1109/TCSI.2009.2019395
[4], [13], single rate with multiple lengths [14], and multiple rates and lengths [15]. The multimode architecture can support fixed-standard applications with different services, such as wireless MANs. Third is flexi-mode architecture in terms of flexible combinations of rates and lengths that can be programmed on the fly [16]–[18]. The flexi-mode architecture is desired due to the urgent demand of multistandard multimedia applications. In this paper, a programmable architecture and design techniques will be proposed for the flexi-mode LDPC decoder. The framework of designing a flexi-mode decoder consists of the following: 1) generic decoding algorithm independent of code parameters; 2) programmable decoder architecture; 3) area-/power-efficient design techniques. This paper will concentrate on the three aforementioned topics. There are mainly two approaches, utilizing iterative passing of soft messages, to the LDPC code decoding based on softdecision. 1) The first approach is called two-phase message passing, which performs iterations between check- and variablenode updates, including the sum–product algorithm [19] and its variations, e.g., min-sum (MS) algorithm [20] and phase-overlapped decoding [21], [22]. The performance of the MS algorithm can be enhanced by correction factors [23] depending on channel and code parameters, i.e., code rate and length, which brings inconvenience to the flexi-mode decoder design. 2) The other approach performs turbo-decoding message passing (TDMP) or layered decoding [24]–[26], which boosts throughput and reduces computational complexity by accelerating convergence. The update operation, which is implemented as a soft-in–soft-out (SISO) engine free of lookup tables and independent of channel parameters, i.e., signal-to-noise ratio (SNR), can operate with negligible loss compared with the ideal case [26]. Partially parallel LDPC decoder architectures are developed for different area/throughput requirements [7], [8], [14], [15], [24], [27]. The advantage of partially parallel decoders is centered on the flexible tradeoff between area and throughput. Thus, the flexi-mode decoder design can be based on the partially parallel architecture. The partially parallel LDPC decoder is generally composed of on-chip memories, processing units (checkand variable-node-update units or SISO engines), and a permutation network (PN) for message delivery [4], [13], [24],
1549-8328/$26.00 © 2010 IEEE
126
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 57, NO. 1, JANUARY 2010
[27]. The on-chip memory occupies a great part of the decoder area and power dissipation and thus needs to be carefully deployed when designing a flexi-mode decoder. Furthermore, another challenge of implementing a flexi-mode LDPC decoder is the design of a programmable PN (PPN) without interconnection congestion [16]–[18], [28], [29]. Based on the design framework defined earlier and the aforementioned challenges, this paper deals with the flexi-mode QC-LDPC decoder design using the following techniques. 1) The TDMP algorithm [26] is adopted as the generic algorithm. In order to achieve area reduction, the modified TDMP (mTDMP) algorithm is proposed to reduce the complexity of the original algorithm. 2) A flexible decoder architecture is devised to support mode programming on the fly. The flexibility is achieved by the proposed scalable SISO engine, fully PPN, and appropriately deployed memories. 3) This paper proposes the following techniques to further improve area/power efficiency. a) Structured memory architectures are devised so that i) dual-port and single-port memory can be appropriately deployed and ii) access reduction can be made by address remapping and read/write buffering. b) The proposed SISO engine consists of independent cells working concurrently and can be deactivated, respectively. c) Low-complexity flexible PNs are designed, and appropriate paths of the PNs can be gated depending on the current code mode. d) The undecodable iterations are terminated early. The remainder of this paper is organized as follows. Section II briefly discusses the QC-LDPC codes and introduces the TDMP and mTDMP algorithms. The proposed flexi-mode decoder architecture is described in Section III. Section IV details the design techniques of the structured memory, the low-complexity SISO engine, and the fully PPN. In Section V, a flexi-mode LDPC decoder chip is implemented for wireless LAN/MAN applications and beyond, and the results are demonstrated together with a detailed analysis. Finally, Section VI concludes this paper. II. QC-LDPC CODES AND DECODING ALGORITHM A. QC-LDPC Code Description LDPC codes check equations [1] as
can be defined by parity-
(1) where the -by- matrix gives the structure of the codes and parity bit number . The information bit with length . QC-LDPC codes are well number is given by received for their inherent regularity to avoid congestion in the corresponding decoder chip [4], [18], [24]. Based on the regularity of the permutation pattern used, the QC-LDPC matrix can be decomposed into multiple submatrices, which are either zero matrices or cyclic-shifted identity matrices of size -by( means expansion factor), i.e., nonzero entries (NZEs), along
TABLE I EXAMPLE MODES OF FLEXI-MODE QC-LDPC CODES
TABLE II EXAMPLE PARAMETERS OF (24, 96, 1/2) AND (24, 27, 2/3) CODES
-by- base matrix containing the permutation with an pattern (locations of NZEs and permutation sizes). Therefore, and submatrices with factor integrally the base matrix and represent the QC-LDPC codes with length . Let and (row weight) denote rate the total number of NZEs in and in the th row of , respectively. for the flexi-mode In this paper, we use the notation and have a large QC-LDPC codes, which means that number of parameter varieties (modes). The notation denotes one type of QC-LDPC codes with expansion factor , , and rate . The proposed decoder architeclength on the fly, ture is flexible in terms of programming which enables the rate , length , and row weight to be programmable. are shown in Example modes of QC-LDPC codes Table I. These codes are adopted in wireless LAN and MAN standards. The LAN codes have three expansion factors, i.e., 27, 54 and 81, and four rate types. The MAN codes have 19 expansion factors and six rate types. There are 12 and 114 modes in these codes, respectively. The key parameters of a (24, 96, 1/2) code and a (24, 27, 2/3) code are listed in Table II. Fig. 1 shows the structure of the (24, 96, 1/2) code in wireless MAN. The other codes have similar structures but different and values. The decomposition of matrix will be introduced in Section IV-A. B. TDMP Decoding Algorithm The TDMP algorithm [25], [26] utilizes a turbo-decoding schedule to propagate soft messages [log-likelihood ratio (LLR)] on the trellis. The schedule largely improves decoder throughput and memory efficiency and is adopted in the muldenote the intrinsic timode decoder [4], [13], [24]. Let denote the extrinsic mesmessages from the channel, denote the posterior messages. Message sages, and consists of the corresponding to the th row vector of matrix . Vector is defined as the index set of the NZEs . The in the th row. Vector is defined as the set of TDMP algorithm performs iterative decoding composed of subiterations. The th subiteration consists of the following four stages [26].
BAO et al.: PROGRAMMABLE ARCHITECTURE FOR FLEXI-MODE QC-LDPC DECODER
127
Fig. 1. Wireless MAN (24, 96, 1/2) code structure. TABLE III COMPUTATION COMPLEXITY COMPARISON FOR LLR UPDATE IN ONE SUBITERATION
1) Read and 2) Subtract
from memories. from by
The SISO operation can be implemented by forward–backward recursion [26] using nonlinear -function as (2)
3) Row decode the following. a) For each , update
(7) Based on the approximation given by
by (8) (3)
where
is given by
(7) can be approximated by [26]
(4) b) Update
using the updated
by
(9) (5)
4) Write back the updated , i.e., , and , i.e., . Note that the next subiteration can perform with the latest , which accelerates decoding convergence. The iteration is is terminated when the maximum iteration number reached or all parity-check equations defined by (1) are met. . The decision result is given by The update (3) can be formulated as (6)
(10) C. mTDMP Decoding Algorithm The TDMP algorithm achieves high performance using forward–backward recursion based on the approximation (9) or (10), which has a correction term independent of channel and
128
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 57, NO. 1, JANUARY 2010
code parameters and free of lookup tables [26]. This paper furupdate in the TDMP algorithm and ther simplifies the preserves the desirable feature, i.e., independence from channel and code parameters. Based on the correction term in (8), the modification to the update (3) is given by
(11) and are the first and second least absolute values where of the elements in vector . The SISO operation in (6) replaced by (11) can be implemented as a forward operation consisting of comparison, addition, and simple correction. The negligible performance loss brought by the modification will be investigated later.
Fig. 2. Error-correction performance.
D. Early Stopping in Iterative Decoding Because no inherent stopping criterion can be achieved in turbo codes, early stopping algorithms in turbo decoding have been developed to reduce complexity and improve throughput [30], [31], while the LDPC code decoding can be terminated early by codeword validation by (1). The efficiency of the codeword-validation technique was investigated by Li et al. [32] and Shin et al. [33], who discovered that, in the nonconvergence case, the technique operates ineffectively, which makes the deand results coding reach the maximum iteration number in a waste of power consumption and deteriorating throughput. Based on observing satisfied parity check (SPC) equations, a simple and efficient scheme [33] is proposed to terminate iteration oscillation and stuck, which happens more frequently at low-SNR regions than at high-SNR regions. Other rules, such as a hard-decision-aided criterion [30], specialize no treatment for oscillation. Thus, in the proposed architecture, the early stopping scheme (ESS) of SPC (ESS-SPC) [33] is adopted to avoid undecodable iterations. E. Error-Correction-Performance Evaluation To evaluate the error-correction performance of the proposed mTDMP algorithm and fixed-point scheme with ESS for decoder design, the decoding of QC-LDPC codes is simulated with an additive white Gaussian noise channel and a maximum iteration number of 30. The results of decoding (24, 96, 1/2) and (24, 24, 1/2) codes are shown in Fig. 2. Fig. 2 shows that, when decoding a (24, 96, 1/2) code, the mTDMP algorithm performs with an error-correction loss of less than 0.1 dB compared with the original TDMP algorithm, and the modification using MS without correction brings a loss of about 0.5 dB. The and 8-bit scheme of six-bit quantization for quantization for has a performance loss of . The ESS less than 0.1 dB at a bit error rate (BER) of brings extra performance loss of less than 0.1 dB. The (24, 24, 1/2) code has similar results. Fig. 3 shows the average iteration at different channel conditions. For the (24, 96, number
N
Fig. 3. Average iteration number (
).
1/2) code, when ESS is adopted, the average iteration number is reduced by between 10% and 60%, depending on difreduction by more than 40% ferent channel conditions. The is obtained at an SNR below 1.2 dB, because undecodable iterations happen frequently in the low-SNR region. The rule of parameter selection in the ESS-SPC algorithm will be given in Section IV-D. F. Computational Complexity Evaluation The computational complexity of the update in the subiteration with row weight is evaluated in terms of multiplication, addition, min operation, and special operation. Table III update algolists the comparison results by different rithms. The algorithms (3) and (6) by approximation (9) [25] have the largest complexity, while the offset MS (OMS) algorithm features the least complexity. The proposed update (11) has a slightly larger complexity than the OMS, i.e., one extra addition and correction, but brings independence from channel and code parameters. The correction term can be implemented with complexity less than an adder [26]. The normalized MS (NMS) has similar complexity compared with the OMS and the proposed update (11), but the offset factor in OMS and the normalization factor in NMS are both dependent of channel and code parameters. Compared with (6) [26], the proposed update (11) gives complexity reduction by more than 50%.
BAO et al.: PROGRAMMABLE ARCHITECTURE FOR FLEXI-MODE QC-LDPC DECODER
TABLE IV RESOURCE PARAMETERS OF THE PROPOSED FLEXIBLE-MODE DECODER ARCHITECTURE
III. PROPOSED FLEXI-MODE DECODER ARCHITECTURE Based on the TDMP algorithm (2) –(6) and the modification (11), we describe the proposed flexi-mode QC-LDPC decoder in the following sections. A. Flexi-Mode Decoder Description The partially parallel TDMP decoder consists of a code-pattern memory (CMEM) storing information, a posterior messages, an extrinsic memory (PMEM) storing memory (EMEM) storing messages, a SISO engine including first-in first-out (FIFO) storing interim messages, and a PN for message delivery [4]. The essential target of designing flexi-mode QC-LDPC decoders is to support the programming , on the fly of the code configuration information, i.e., and to allocate appropriate hardware resources for the decoding operations. For a decoder chip with limited hardware resources, lim, width itations are given as follows: 1) CMEM depth , and capacity ; 2) PMEM depth , width , ; 3) EMEM depth , width , and and capacity ; 4) FIFO depth , width , and capacity capacity ; 5) data-path width of the SISO engine, i.e., the , i.e., the maxnumber of the SISO cells; and 6) PN width imum bit number of messages that can be permuted. Table IV describes the relationship among the aforementioned resource and limitations, code parameters, and bit number for quantization ( and ). The resource allocation of flexi-mode QC-LDPC decoders can be made by the relationship in Table IV.
Fig. 4. Proposed flexi-mode decoder architecture.
129
130
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 57, NO. 1, JANUARY 2010
Fig. 5. Pipelined process flow of decoding continuous blocks.
Fig. 6. Process flow of subiterations.
B. Proposed Decoder Architecture
We demonstrate the proposed decoder architecture for in Fig. 4. The architecture conQC-LDPC codes sists of an input–output interface [input–output buffer (IOB)], a CMEM, PMEM banks, EMEM banks, a PPN, a scalable SISO engine, and a control unit (CTRL) including an ESS circuit. In order to support real applications, such as wireless LAN/MANs, the architecture is designed to decode codes with maximum row weight of 20 , maximum , and . expansion factor of 96 In the architecture, an input buffer is used to enable parallel to initialization for serial-input channel information reduce initialization latency, and an output buffer is used to decouple decision output with the decoding core (the decoder excluding IOB) to further reduce latency. Thus, the proposed architecture can decode continuous code blocks, as shown in Fig. 5. Serially, by the mTDMP algorithm, every four-stage , which will update consumes total clock cycles of at least result in large decoding latency. Therefore, the architecture is for decoding pipelined to reduce the decoding cycles to a row. The two clock cycles are manipulated to avoid memory access collision at the exchange between subiterations. Fig. 6 shows the pipelined data path of the proposed architecture. The architecture supports parallel processing up to 96 SISO cells working concurrently, and the number of working cells depends on the dynamic allocation according to the expansion of the current code mode. The designed message-defactor livering network (PPN) can be programmable in both permutation size and expansion factor . Based on the aforementioned description, the proposed fleximode decoder has the following operating flow.
1) Initializing: PMEM, EMEM, and CMEM banks are initial, zero, and inforized with channel information mation, respectively. With the help of IOB, the PMEM/ clock cycles. The EMEM CMEM initialization needs cycles. initialization needs 2) Iterative decoding: Before ESS indicates the decoding termination, an iterative message update is operated based on the mTDMP algorithm. The subiteration has the schedule shown in Fig. 6, which can be given as follows: a) PMEM/ by PN, and generation EMEM access, delivery of are pipelined, and b) a new is generof new ated during the next subiteration. The subiteration can be clock cycles. finished within 3) Decoding output: The decoder core (the decoder excluding IOB) writes hard decision results into the IOB with clock cycles, and the IOB provides bit-by-bit output. The parameters of the proposed architecture are listed in Table IV. Table IV also gives the comparison with the multimode TDMP decoder [4]. Based on the same 4-bit quantization , the proposed EMEM capacity can be reduced for by 28%. Benefiting from the proposed low-complexity PPN, which will be detailed in Section IV, the CMEM deployed in the architecture is ultrasmall, i.e., 7.1% capacity of that in [4]. We adopt the network allocation method [13], [34] (use the difference of read/write permutation size as the read permutation size ) to arrange one PPN for message delivery, which further reduces the permutation complexity. The architecture can be easily extended to be even more flexible if more resources are allocated according to the parameter correlation in Table IV. IV. PROPOSED DESIGN TECHNIQUES In this section, area-/power-efficient design techniques will be detailed for the memory deployment, SISO engine, PN, etc.
BAO et al.: PROGRAMMABLE ARCHITECTURE FOR FLEXI-MODE QC-LDPC DECODER
131
Fig. 7. Base-matrix optimization for PMEM access reduction.
Fig. 8. PMEM architecture.
A. Structured Memory Architecture Memories consume a great part of power in the LDPC decoder, e.g., as much as 50% of the power in the TDMP decoder [24]. In this paper, a structured memory architecture is proposed to reduce power dissipation. First, base matrix is in Fig. 1 is decomoptimized, as shown in Fig. 7(a)–(d). posed into an index matrix containing the locations of the NZEs, which are also the address pattern of PMEM, shown in Fig. 7(a), and the permutation size pattern for PN, shown in Fig. 7(c). Fig. 7(a)shows that there is little correlation between the two subiterations. For example, there are only two same locations [i.e., address of PMEM (1 and 13)] in the zeroth and first rows of and one same location (i.e., 17) in the fourth and fifth rows. The little correlation means that frequent read/write access to PMEM is needed when subiterations are performed. Thus, by finding the maximum correlation between two rows in the whole , the base matrix can be optimized into the patterns shown can in Fig. 7(b) and (d). In this way, the newly updated be grouped into two categories: One is directly passed to the next subiteration without being written back, and the other is written back for the consequent subiterations. The principle here is different from [22] which loosens the coupling between node processing to enable efficient phase overlapping. However, the scheduling algorithm [22] can be used to obtain the address pattern as in Fig. 7(b) and the permutation pattern as in Fig. 7(d). To facilitate PMEM access, the same address pattern is aligned
Fig. 9. PMEM access reduction.
Fig. 10. EMEM architecture.
between two subiterations in the direction of row decoding, as shown in Fig. 7(b). Consequently, a hierarchical structure for PMEM, as in Fig. 8, can be devised to reduce memory access based on the aforementioned optimization. The PMEM is consisted of multiple banks of dual-port register files with capacity of 18 432 bits and a block
132
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 57, NO. 1, JANUARY 2010
Fig. 11. Low-complexity SISO cell.
of registers with capacity of 768 bits. The register files work as the main storage, and the small registers work as the fast-access update and storage and reduce the path delay between the update. The will be read from the registers if an the adjacent subiteration has the same PMEM address as in Fig. 7(b) or from the register files. In this way, access to the register files can be decreased. The optimization results for some modes of the LAN/MAN LDPC codes are shown in Fig. 9. Fig. 9 shows that more than 40% access can be reduced for these codes, and the rate-5/6 MAN codes have the largest reduction of 60%. In addition, the EMEM has large power consumption and area occupation as well [24]. Based on the EMEM operation flow, shown in Fig. 6, access reduction can be made by scheduling and buffering. The architecture of the proposed EMEM is shown in Fig. 10. The EMEM consists of single-port register files and two blocks of registers, i.e., one for read buffering and the other for write–read buffering. The single-port register files are designed with capacity of 26 496 bits. The access to the register files can be reduced by the following schedule: 1) At the start from the register files with one of one subiteration, read clock cycle and buffer the data into the Read Buffer for row is generated, write the decoding, and 2) when the new into the Write/Read Buffer (WRB), write the from WRB to the register files in one clock cycle, and update using the in the WRB. With the aforementioned schedule, the access number of the register files can be optimized to two clock cycles, and the register files can be implemented as single , can be port. The access-reduction ratio, given by about 67% and 90% for decoding (24, 96, 1/2) and (24, 96, 5/6) codes, respectively. It is obvious that the larger row weight corresponds to more access reduction. B. Low-Complexity SISO Engine The processing unit of the proposed decoder is the scalable low-complexity SISO engine for message update. The SISO enSISO cells . The detailed gine is composed of architecture of the SISO cell is shown in Fig. 11, where the four-stage subiteration is indicated. The row decoding based on , , algorithm (11) only generates four kinds of values, i.e., locations , and signs by comparison–selection (CS) and EXCLUSIVE-OR (XOR) operation. Therefore, the 6-bit
TABLE V LOOKUP SISO CELL ACTIVATION
quantized is updated in the form of , , , and signs, which can reduce the EMEM capacity. Accordingly, the should be decompressed when is used to update . The decompressor generates serially by comparison and two’s complement transform. Similar to [4], the FIFO stack is deployed to store the interim prior messages . Based SISO cells on the activation code defined in Table V, can be gated off when decoding the code with expansion mode . Thus, the SISO cells can be dynamically allocated according to the expansion factor . In this way, the proposed SISO engine is power efficient when operating with different ’s. C. PPN Congestion-free networks have been proposed as dual bidirectional networks [4], self-routing networks [15], [29], Benes networks [16], [17], and MS-CS networks [28]. Based on the concept of programmable shuffle networks introduced in [35], this paper gives a PPN with simple control to achieve flexible message delivery with permutation size and expansion factor . Fig. 12 shows the architecture of the PPN based on a -stage logarithmic shifter network . Each stage of the shifter consists of two 2-by-1 multiplexers (MUXs), valid messages, i.e., paths A and B. The last stage decides messages remain unused according to and the other
BAO et al.: PROGRAMMABLE ARCHITECTURE FOR FLEXI-MODE QC-LDPC DECODER
133
Fig. 12. PPN.
expansion factor . The valid messages distribute in the of path A and the segment segment of path B. The decision control of the th MUXs in the last stage is given by if else.
(12)
The permutation size is remapped into two types of control instructions, i.e., and , depending on the parameter and factor . The instructions are calculated by (13) (14) , and The aforementioned control information, i.e., , , are generated on the fly by a permutation pattern calculator with simple addition and comparison operations. Table VI lists the complexity of some recent networks and gives quantitative results by delivering 64 4-bit messages when decoding the code with 96 NZEs. Normalized results are also given. The proposed network is much smaller than the dual bidirectional networks [4] in terms of 2-by-1 MUXs and memory
for control bits. Benes networks [16], [17] have the least usage of 2-by-1 MUXs (0.84) but have the largest usage of memory. For QC-LDPC decoders, the proposed PPN has sufficient flexibility, ultrasmall usage of memory, very small usage of MUXs, and control simplicity. It is noticed that Benes networks have the largest flexibility in terms of supporting random LDPC code -by-1 MUXs decoding. The MS-CS networks [28] need messages with arbitrary expansion mode , for delivering which shows large complexity overhead. messages power efficiently, the extra To deliver input messages are always gated to zero when the code with is decoded. The implementation results of expansion factor the reduced complexity PPN will be demonstrated in Section V. D. ESS (ESS-SPC) In the ESS-SPC based on [33], there are three parameters, i.e., , , and , which indicate the threshold for enabling the stuck counter, disabling SPC calculation, and terminating the iterations, respectively. In order to support flexi-mode decoding, the three parameters are selected by (15)
134
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 57, NO. 1, JANUARY 2010
TABLE VI MESSAGE-PASSING COMPLEXITY BASED ON VARIOUS ARCHITECTURES
E. Programming the Proposed Architecture for LAN/MAN Applications and Beyond
Fig. 13. Proposed ESS architecture based on SPC.
and is given as an input signal based on required error-correction performance. The evaluation in Section II is based on rule . Rule (15) shows no obvious error-correction (15) and loss, as shown in Fig. 2. Equation (15) can be implemented by binary right shifting, subtraction, and divide by 3 as
There are 12 and 114 modes in wireless LAN/MAN LDPC codes, respectively. We take the wireless MAN (24, 96, 1/2) code (shown in Fig. 1) as an example to introduce the flexibility and programming flow of the proposed architecture. At the initialization stage, the locations and permutation sizes of the 76 in the code shown in Fig. 1) in are proNZEs ( grammed into CMEM, besides which the 2304 (24 96) messages are initialized into PMEM. For an expansion factor of 96 , 96 SISO cells are activated, and correspondingly, PMEM banks deliver 96 messages to SISO cells through the PPN. Therefore, 96 messages can be updated concurrently during the iterative decoding stage. The utilized hardware resources can be given by Table IV. At the output stage, a total of 1152 information bits are written from the decoder core into IOB, which is completed within 12 clock cycles. The other code modes in wireless LAN/MAN can be operated with the same decoding flow. Thus, the proposed flexi-mode architecture can codes with LAN/MAN modes and support decoding beyond. Compared with the single/multimode architectures [4], [5], [13]–[15], the proposed architecture has the largest flexibility. V. IMPLEMENTATION RESULTS AND ANALYSIS
(16) Fig. 13 shows the proposed architecture for the ESS circuit. , requires the The calculation of the SPC number, i.e., largest complexity in the ESS and may lead to long latency. A tree-like adder (TLA) architecture with quasi-logarithm adder , is used to calculate the SPC number at delay, i.e., parity-check results each subiteration. At each subiteration, results, which are are obtained, and the TLA fetches the either 1 or 0, and adds them together. At the first stage, every three check results are added together, and at the following stages, every two results of the former stage are added together. will be accumulated to obtain , as in Fig. 13. Then, the After comparison (COMP) with thresholds, the termination decision can be made. The proposed TLA costs 1033 gates and can operate at 200 MHz. The power saving brought by the ESS will be discussed in Section V.
Based on the proposed architecture, a flexi-mode QC-LDPC decoder is implemented in an SMIC 180-nm 1.8-V one-poly sixmetal (1P6M) CMOS technology. The layout photo is shown in Fig. 14. The chip occupies 10.12 mm . The maximum operating frequency is implemented as 100 MHz. The implementation results and analysis will be given in Section VI. The power consumption is estimated by postlayout simulation in the case of codes. The decoding continuous blocks of flexi-mode MAN (24, 96, 1/2), (24, 96, 2/3A), (24, 96, 5/6), (24, 52, 1/2), and (24, 24, 1/2) codes are adopted as example codes to investigate the implemented decoder. The implemented decoder has the same error-correction performance as the fixed-point simulation model, i.e., 6 bits for and 8 bits for with ESS, the result of which is demonstrated in Section II-E. The TDMP decoder [4] achieves at 2.2 dB for a 2048-bit rate-1/2 AA-LDPC a BER of code. The phase-overlapped decoder [15] achieves a BER of at around 2.5 dB for a 2304-bit rate-1/2, i.e., (24, 96, 1/2), QC-LDPC code. This decoder gains a BER of
BAO et al.: PROGRAMMABLE ARCHITECTURE FOR FLEXI-MODE QC-LDPC DECODER
135
Fig. 15. Area and power distribution of the main building blocks.
Fig. 14. Layout photograph of the proposed decoder.
at 2.2 dB for the same (24, 96, 1/2) code. The performance improvement (0.3 dB) benefits from the mTDMP algorithm and the quantization scheme with ESS. Fig. 15 shows the gate-count percentage of the main building blocks of the proposed decoder. Fig. 15 also shows the power dissipation for decoding a (24, 96, 1/2) code at an SNR of 2.0 dB and operating frequency of 100 MHz. The SISO engine occupies about 32% of the area and 29% of the power consumption. The PMEM occupies 13% of the chip area but only 7% of the power consumption. Meanwhile, the structured EMEM costs 22% of the area but only 10% of the power dissipation. The PPN consumes about 10% of the area and about 15% of the power. The ESS circuit costs 2657 gates in all (less than 1% of the decoder area) and less than 1% of the power. The IOB circuit is redundant if the proposed decoder is embedded in the wireless LAN/MAN system chip with inherent parallel initialization and output, which means that about 14% of area and 20% of power can be saved. Fig. 16 shows the power consumption when decoding difcodes with the same iteration number. Fig. 16 ferent shows the following. 1) The memory (including FIFO) consumes less power to decode the (24, 96, 5/6) code than the (24, 96, 1/2) code, because the (24, 96, 5/6) code has larger access-reduction ratio for PMEM/EMEM, which proves that the proposed memory architecture is power efficient. 2) The PPN and SISO (excluding FIFO) dissipate less power when decoding shorter codes ( is smaller), e.g., in the case of an expansion factor of 24, the PPN and SISO dissipate 36% and 50% of the power in the case of an expansion factor of 96 (while the ideal power reduction is 25% by 24/96). Similar results can be obtained for the (24, 52, 1/2) code. The results prove that the proposed PPN and SISO engine can perform power efficiently, depending on the expansion factor .
Fig. 16. Power distribution for decoding (24, 24, 1/2), (24, 52, 1/2), (24, 96, 5/6), and (24, 96, 1/2) codes with five iterations at 100 MHz.
3) The IOB for the (24, 96, 1/2) and (24, 96, 5/6) codes consumes larger power than the (24, 24, 1/2) and (24, 52, 1/2) codes, because decoding the former two codes has more and decision results to initialize and output. Fig. 17 shows the power consumed in decoding the (24, 96, 1/2) code at an SNR between 0.4 and 2.0 dB. At the left part of 1.2 dB, the power consumption with ESS decreases as the SNR decreases, which results from the adopted ESS-SPC terminating the undecodable iterations at an earlier stage when the SNR is lower. At the right part of 1.2 dB, the power consumption decreases again as SNR increases, which indicates that less de(average iteration number) coding failure happens and less is needed when the SNR is higher. The power reduction brought by ESS is also shown in Fig. 17. The ESS circuit consumes less than 1% of the total area and power (shown in Fig. 15), brings a small BER performance loss with the chosen parameter, and and power conat the same time, results in large savings of sumption, particularly in the low-SNR region (more than 40% in the ESS-SPC at an SNR below 1.2 dB). The parameter can be chosen according to the BER requirement; for example,
136
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 57, NO. 1, JANUARY 2010
TABLE VII SUMMARY OF THE PROPOSED DECODER ARCHITECTURE
Fig. 17. Power-consumption comparison for decoding the (24, 96, 1/2) code.
We define normalized energy efficiency (NEE) to evaluate the decoders implemented with different technologies (similar to [36]) nm
Fig. 18. NTh at different channel conditions.
the higher BER requirement corresponds to a larger value of , , and power consumption [33]. We evaluate the decoding of the net throughput (Nth) at different channel conditions by (17) denotes the where denotes the operating frequency and clock cycles for one iteration. The first part of the denominator represents the cycles for initialization and output, represents the cycles for iterative and the second part decoding. Under an operating frequency of 100 MHz, the NTh of the implemented decoder for decoding (24, 96, 1/2), (24, 96, 2/3A), and (24, 96, 5/6) codes is shown in Fig. 18. The NTh ranges from 66 to 168 Mbits/s for decoding the (24, 96, 1/2) code at an SNR between 0.4 and 2.0 dB. The NTh can be flexible between 4.4 Mbits/s and 1.45 Gbits/s, which can be tuned by different combinations of clock frequency , code rate , and length . The value of 4.4 Mbits/s is defined at decoding the (24, 24, 1/2) code with 30 iterations and 50 MHz. The value of 1.45 Gbits/s is defined at decoding the (24, 96, 5/6) code with one iteration and 100 MHz.
(18)
where denotes power consumption. NEE describes the energy required to decode 1 bit of an LDPC code block with 180-nm CMOS technology and once of iteration. The NEE for decoding the (24, 96, 1/2) code at 1.2 dB is 410 pJ/(bit iteration). Table VII summarizes the implementation results of the proposed flexi-mode decoder. Table VIII lists the comparison results with existing AA/QC-LDPC decoders. The proposed decoder costs 198 000 logic gates, which is 10% less than that of the multimode TDMP decoder [4] and 47% less than that of the multimode decoder [15]. The multimode decoder [14] for wireless MAN has a much larger chip area (i.e., 63% larger) if scaled to the same CMOS technology. The flexi-mode decoder [16] has 15% less logic gates than this decoder, but the throughput [16] is much lower (about 50%). The memory area of this decoder is half of that in the multimode decoder [15] for wireless MAN applications. The NEE of the proposed decoder is 16.7% less than the TDMP decoder [4]. Compared with the multimode decoder [15], this decoder has a similar NEE; taking into account the fact that the IOB circuit occupies about 20% of the power, the proposed decoder excluding IOB features much smaller power. Thus, the proposed flexi-mode decoder has the largest flexibility with advantageous error-correction performance and efficient area occupation and power dissipation. VI. CONCLUSION This paper deals with the framework and architecture of an area-/power-efficient programmable LDPC decoder compliant to flexi-mode applications. The proposed decoder architecture, , has which is capable of decoding QC-LDPC codes been detailed in the form of several area-/power-efficient design techniques. These techniques include low-complexity decoding algorithms, memory access reduction, scalable SISO engines, reduced-complexity PPNs, ESSs, etc. To verify the proposed architecture and design techniques, a decoder chip supporting wireless LAN/MAN applications is implemented. The implementation results in terms of power, area, and throughput validate the advantage of the proposed architecture and techniques.
BAO et al.: PROGRAMMABLE ARCHITECTURE FOR FLEXI-MODE QC-LDPC DECODER
137
TABLE VIII COMPARISON WITH RECENT LDPC DECODERS
The proposed decoder can be easily extended to future applications with the same architecture and techniques. REFERENCES [1] R. G. Gallager, Low-Density Parity-Check Codes. Cambridge, MA: MIT Press, 1963. [2] D. J. C. MacKay and R. M. Neal, “Near Shannon limit performance of low density parity check codes,” Electron. Lett., vol. 32, no. 18, pp. 1645–1646, Aug. 1996. [3] S. Chung, D. Forney, T. Richardson, and R. Urbanke, “On the design of low-density parity-check codes within 0.0045 dB of the Shannon limit,” IEEE Commun. Lett., vol. 5, no. 2, pp. 58–60, Feb. 2001. [4] M. M. Mansour and N. R. Shanbhag, “A 640-Mb/s 2048-bit programmable LDPC decoder chip,” IEEE J. Solid-State Circuits, vol. 41, no. 3, pp. 684–698, Mar. 2006. [5] A. J. Blanksby and C. J. Howland, “A 690-mW 1-Gb/s 1024-b, rate-1/2 low-density parity-check code decoder,” IEEE J. Solid-State Circuits, vol. 37, no. 3, pp. 404–412, Mar. 2002. [6] A. Darabiha, A. C. Carusone, and F. R. Kschischang, “Multi-Gbit/sec low density parity check decoders with reduced interconnect complexity,” in Proc. IEEE ISCAS, May 2005, vol. 5, pp. 5194–5197. [7] L. Liu and C.-J. Shi, “Sliced message passing: High-throughput overlapped decoding of high-rate low-density parity-check codes,” IEEE Trans. Circuits Syst. I, Reg. Papers, vol. 55, no. 11, pp. 3697–3710, Dec. 2008. [8] A. Darabiha, A. C. Carusone, and F. R. Kschischang, “Block-interlaced LDPC decoders with reduced interconnect complexity,” IEEE Trans. Circuits Syst. II, Exp. Briefs, vol. 55, no. 1, pp. 74–78, Jan. 2008. [9] Digital Video Broadcasting (DVB-S2) Via Satellite. [Online]. Available: http://www.dvb.org [10] Framing Structure, Channel Coding and Modulation for Digital Television Terrestrial Broadcasting System [S], Chinese Standard, 2006. [11] B. Bangerter, E. Jacobsen, M. Ho, A. Stephens, A. Maltsev, A. Rubtsov, and A. Sadri, “High-throughput wireless LAN air interface,” Intel Technol. J, vol. 7, no. 3, pp. 47–57, Aug. 2003. [12] Part 16: Air Interface for Broadband Wireless Access Systems, IEEE Std 802.16-2009 (Revision of IEEE Std 802.16-2004), IEEE, Piscataway, NJ [Online]. Available: http://www.ieee802.org/16/tge/ [13] C. P. Fewer, M. F. Flanagan, and A. D. Fagan, “A versatile variable rate LDPC codec architecture,” IEEE Trans. Circuits Syst. I, Reg. Papers, vol. 54, no. 10, pp. 2240–2251, Oct. 2007. [14] X.-Y. Shih, C.-Z. Zhan, C.-H. Lin, and A.-Y. Wu, “An 8.29 mm 52 mW multi-mode LDPC decoder design for mobile WiMAX system in 0.13 m CMOS process,” IEEE J. Solid-State Circuits, vol. 43, no. 3, pp. 672–683, Mar. 2008.
[15] C.-H. Liu, S.-W. Yen, C.-L. Chen, H.-C. Chang, C.-Y. Lee, Y.-S. Hsu, and S.-J. Jou, “An LDPC decoder chip based on self-routing network for IEEE 802.16 e applications,” IEEE J. Solid-State Circuits, vol. 43, no. 3, pp. 684–694, Mar. 2008. [16] G. Masera, F. Quaglio, and F. Vacca, “Implementation of a flexible LDPC decoder,” IEEE Trans. Circuits Syst. II, Exp. Briefs, vol. 54, no. 6, pp. 542–546, Jun. 2007. [17] F. Quaglio, F. Vacca, C. Castellano, A. Tarable, and G. Masera, “Interconnection framework for high-throughput, flexible LDPC decoders,” in Proc. Des. Autom. Test Eur., Mar. 2006, vol. 2, pp. 6–10. [18] H. Zhang, J. Zhu, H. Shi, and D. Wang, “Layered approxi-regular LDPC: Code construction and encoder/decoder design,” IEEE Trans. Circuits Syst. I, Reg. Papers, vol. 55, no. 2, pp. 572–585, Mar. 2008. [19] F. R. Kschischang, B. J. Frey, and H. A. Loeliger, “Factor graphs and the sum–product algorithm,” IEEE Trans. Inf. Theory, vol. 47, no. 2, pp. 498–519, Feb. 2001. [20] J. Chen and M. P. C. Fossorier, “Near optimum universal belief propagation based decoding of lower-density parity check codes,” IEEE Trans. Commun., vol. 50, no. 3, pp. 406–414, Mar. 2002. [21] Y. Chen and K. K. Parhi, “Overlapped message passing for quasi-cyclic low density parity check codes,” IEEE Trans. Circuits Syst. I, Reg. Papers, vol. 51, no. 6, pp. 1106–1113, Jun. 2004. [22] S. H. Kang and I. C. Park, “Loosely coupled memory-based decoding architecture for low density parity check codes,” IEEE Trans. Circuits Syst. I, Reg. Papers, vol. 53, no. 5, pp. 1045–1056, May 2006. [23] J. Chen, A. Dholakia, E. Eleftheriou, M. P. C. Fossorier, and X. Y. Hu, “Reduced-complexity decoding of LDPC codes,” IEEE Trans. Commun., vol. 53, no. 8, pp. 1288–1299, Aug. 2005. [24] M. M. Mansour and N. R. Shanbhag, “High-throughput LDPC decoders,” IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 11, no. 6, pp. 976–996, Dec. 2003. [25] D. E. Hocevar, “A reduced complexity decoder architecture via layered decoding of LDPC codes,” in Proc. SIPS, 2004, pp. 107–112. [26] M. M. Mansour, “A turbo-decoding message-passing algorithm for sparse parity-check matrix codes,” IEEE Trans. Signal Process., vol. 54, no. 11, pp. 4376–4392, Nov. 2006. [27] Y. Dai, N. Chen, and Z. Yan, “Memory efficient decoder architectures for quasi-cyclic LDPC codes,” IEEE Trans. Circuits Syst. I, Reg. Papers, vol. 55, no. 9, pp. 2898–2911, Oct. 2008. [28] M. Rovini, G. Gentile, and L. Fanucci, “Multi-size circular shifting networks for decoders of structured LDPC codes,” Electron. Lett., vol. 43, no. 17, pp. 938–940, Aug. 2007. [29] C.-H. Liu, C.-C. Lin, H.-C. Chang, C.-Y. Lee, and Y. Hsu, “Multi-mode message passing switch networks applied for QC-LDPC decoder,” in Proc. IEEE ISCAS, May 2008, pp. 752–755.
138
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 57, NO. 1, JANUARY 2010
[30] R. Y. Shao, S. Lin, and M. P. C. Fossorier, “Two simple stopping criteria for turbo decoding,” IEEE Trans. Commun., vol. 47, no. 8, pp. 1117–1120, Oct. 2002. [31] C. Benkeser, A. Burg, T. Cupaiuolo, and Q. Huang, “A 58 mW 1.2 mm HSDPA turbo decoder ASIC in 0.13 m CMOS,” in Proc. IEEE ISSCC Dig. Tech. Papers, 2008, pp. 264–612. [32] J. Li, X. H. You, and J. Li, “Early stopping for LDPC decoding: Convergence of mean magnitude (CMM),” IEEE Commun. Lett., vol. 10, no. 9, pp. 667–669, Sep. 2006. [33] D. Shin, K. Heo, S. Oh, and J. Ha, “A stopping criterion for low-density parity-check codes,” in Proc. VTC—Spring, Apr. 22–25, 2007, pp. 1529–1533. [34] K. Gunnam, C. Gwan, W. Weihuang, and M. Yeary, “Multi-rate layered decoder architecture for block LDPC codes of the IEEE 802.11n wireless standard,” in Proc. IEEE ISCAS, May 2007, pp. 1645–1648. [35] Y. Deng, “Reconfigurable LDPC decoder for digital video broadcasting systems,” M.S. thesis, State Key Lab. ASIC Syst., Fudan Univ., Shanghai, China, 2008. [36] H.-Y. Hsu, J.-C. Yeo, and A.-Y. Wu, “Multi-symbol-sliced dynamically reconfigurable Reed–Solomon decoder design based on unified finite-field processing element,” IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 14, no. 5, pp. 489–500, May 2006. Dan Bao received the B.S. degree in electronics from Beijing University of Aeronautics and Astronautics, Beijing, China, in 2005. He is currently working toward the Ph.D. degree in microelectronics at the State Key Laboratory of ASIC and System, Fudan University, Shanghai, China. From August 2005 to July 2006, he was an Assistant Engineer with Marine Radar Institute, Nanjing, China. His research interests include application-specific integrated circuit (ASIC) designs, channel decoding techniques, and VLSI architectures for broadband wireless transmission systems. Bo Xiang received the B.S. degree in microelectronics from Sichuan University, Chengdu, China, in 2005. He is currently working toward the Ph.D. degree in microelectronics at the State Key Laboratory of ASIC and System, Fudan University, Shanghai, China. His research focuses on wireless communication systems and their VLSI architecture design, particularly channel coding and decoding algorithms and their VLSI implementations.
Rui Shen received the B.S. degree in electrical engineering from Tongji University, Shanghai, China, in 2006. She is currently working toward the M.S. degree in microelectronics at the State Key Laboratory of ASIC and System, Fudan University, Shanghai. Her research focuses on wireless communication systems and their VLSI architecture design.
An Pan received the B.S. degree from East China Normal University, Shanghai, China, in 2006. He is currently working toward the M.S. degree in microelectronics at the State Key Laboratory of ASIC and System, Fudan University, Shanghai. His main research interests include digital signal processing, orthogonal frequency-division multiplexing (OFDM) systems, and wireless transmission communications, particularly channel estimation and equalization for high-definition television (HDTV). Yun Chen (A’09) received the B.S. and M.S. degrees from the University of Electronic Science and Technology of China, Chengdu, China, in 2000, and the Ph.D. degree from Fudan University, Shanghai, China, in 2007. She is currently with the State Key Laboratory of ASIC and System, Fudan University. Her research interests include high-definition television chip design and VLSI implementation of signal processing and communication systems. Xiao Yang Zeng (M’05) received the B.S. degree from Xiangtan University, Xiangtan, China, in 1992, and the Ph.D. degree from Changchun Institute of Optics, Fine Mechanics, and Physics, Chinese Academy of Sciences, Beijing, China, in 2001. From 2001 to 2003, he was a Postdoctoral Researcher with Fudan University, Shanghai, China. Then, he joined the State Key Laboratory of ASIC and System, Fudan University, as an Associate Professor, where he is currently a Full Professor and the Director. His research interests include information security chip design, system-on-chip platforms, and VLSI implementation of digital signal processing and communication systems.