IEICE TRANS. FUNDAMENTALS, VOL.E89–A, NO.4 APRIL 2006
969
PAPER
Special Section on Selected Papers from the 18th Workshop on Circuits and Systems in Karuizawa
Partially-Parallel LDPC Decoder Achieving High-Efficiency Message-Passing Schedule Kazunori SHIMIZU†a) , Tatsuyuki ISHIKAWA† , Nonmembers, Nozomu TOGAWA†† , Takeshi IKENAGA† , Members, and Satoshi GOTO† , Fellow
SUMMARY In this paper, we propose a partially-parallel LDPC decoder which achieves a high-efficiency message-passing schedule. The proposed LDPC decoder is characterized as follows: (i) The column operations follow the row operations in a pipelined architecture to ensure that the row and column operations are performed concurrently. (ii) The proposed parallel pipelined bit functional unit enables the column operation module to compute every message in each bit node which is updated by the row operations. These column operations can be performed without extending the single iterative decoding delay when the row and column operations are performed concurrently. Therefore, the proposed decoder performs the column operations more frequently in a single iterative decoding, and achieves a high-efficiency message-passing schedule within the limited decoding delay time. Hardware implementation on an FPGA and simulation results show that the proposed partially-parallel LDPC decoder improves the decoding throughput and bit error performance with a small hardware overhead. key words: low-density parity-check codes, partially-parallel LDPC decoder, message-passing algorithm, FPGA
1.
Introduction
Low-Density Parity-Check (LDPC) codes achieve information rates very close to the Shannon limit by using the message-passing algorithm [1]–[4]. In the last few years some work has been done on designing LDPC decoder in Refs. [7]–[13]. LDPC decoders are composed of a check functional unit (CFU) and a bit functional unit (BFU), where the CFU performs row operations for check nodes and the BFU performs column operations for bit nodes. References [7], [8] have proposed a fully-parallel LDPC decoder. Considering the trade-offs between hardware cost and decoding throughput, the partially-parallel LDPC decoder is the most practical implementation as indicated in Refs. [9]–[13]. The decoding throughput for the LDPC decoder is determined by the decoding delay time which is the product of the single iterative decoding delay, times the number of iterations. The increase of the decoding delay time not only degrades the decoding throughput but also the bit error performance. This is because the decoder has to correct as many bit errors as possible within the limited decoding delay time. Manuscript received June 23, 2005. Manuscript revised September 30, 2005. Final manuscript received November 30, 2005. † The authors are with the Graduate School of Information, Production and Systems, Waseda University, Kitakyushu-shi, 8080135 Japan. †† The author is with the Dept. of Computer Science, Waseda University, Tokyo, 169–8555 Japan. a) E-mail:
[email protected] DOI: 10.1093/ietfec/e89–a.4.969
The requirements in order to improve the decoding throughput and bit error performance are as follows: (1) The single iterative decoding delay should be reduced by performing the row and column operations concurrently. (2) The number of iterations until the decoding convergence is reached should be reduced by improving the message-passing efficiency. The requirements (1) and (2) are dependent on the messagepassing schedule for the row and column operations. On the other hand, the requirement from a hardware implementation point of view is as follows: (3) The message-passing schedule should not complicate the hardware design. Especially, the message-passing schedule should not partition the memory into a large number of memory banks. The decoder shown in Refs. [9]–[11] performs the row and column operations independently. This schedule enables the decoder to perform the row and column operations concurrently using a dual memory architecture. Therefore, the single iterative decoding delay is reduced. However, the message-passing efficiency between the check and bit nodes is degraded since the row and column operations are performed independently. (i.e. the decoder does not meet the requirement (2).) On the other hand, in the partially-parallel LDPC decoder shown in Refs. [12], [13], the row operations follow the column operations. By approximating the column operation in the decoder Ref. [12], the single iterative decoding delay and the number of memory banks and words can be reduced. However, the approximation degrades the messagepassing efficiency. (i.e. the decoder does not meet the requirement (2).) In the decoder shown in Ref. [13], each column operation computes only a single message in association with each row operation. In order to compute a single message in the column operation and perform the row operation concurrently, the decoder partitions the memory into a large number of memory banks. (i.e. the decoder does not meet the requirement (3).) In this paper, we propose an efficient architecture for the partially-parallel LDPC decoder which meets the requirements (1), (2) and (3) simultaneously. The proposed decoder is based on the simple addressing and control logic as shown in Refs. [9]–[11]. Firstly, the proposed schedule
c 2006 The Institute of Electronics, Information and Communication Engineers Copyright
IEICE TRANS. FUNDAMENTALS, VOL.E89–A, NO.4 APRIL 2006
970
performs the row operations determining positions for the column operations. The column operations are then performed at these positions. We propose a pipelined architecture to ensure that the row and column operations are performed concurrently. Secondly, we focus on the fact that the computational complexity of the column operation is less than that of the row operation. In the proposed schedule, the row and column operations are performed concurrently, as a result of which the column operations can be performed more frequently in a single iterative decoding. From this point of view, the proposed parallel pipelined bit functional unit enables the column operation module to compute every message in each bit node which is updated by the row operations. These column operations can be performed without extending the single iterative decoding delay. By using the proposed schedule, the row and column operations can be performed concurrently, and the messagepassing efficiency is improved significantly. The proposed partially-parallel LDPC decoder was implemented on an FPGA, and the bit error performance of the decoder was simulated. Hardware implementation and simulation results show that the proposed decoder improves the decoding throughput and bit error performance with a small hardware overhead. 2.
Partially-Parallel LDPC Decoder
Low-Density Parity-Check (LDPC) codes are a class of linear block codes with very sparse parity check matrices. The size of the parity check matrix H is defined by M × N, where M represents the total number of check bits and N represents the total number of codeword bits. The codeword y satisfies the parity check equation of H · y = 0. The parity check matrix is represented by a bipartite graph called Tanner graph shown in Fig. 1. There are two types of nodes in the graph, called bit and check nodes. Each check node cm , m = 1, · · · , M is connected to bit node bn , n = 1, · · · , N, where elements in the matrix H are one. LDPC codes can be decoded iteratively using a message-passing algorithm as described in Ref. [6]. Each iteration of message-passing algorithm is composed of two phases. Phase 1 is called row operation. This updates messages (αmn ) of all check nodes, and sends the messages to
bit nodes. Phase 2 called column operation updates messages (βmn ) of all bit nodes, and sends the messages to check nodes. The message-passing algorithm is defined in Fig. 2, where A(m) {n|Hmn = 1}, B(n) {m|Hmn = 1}, and Galllager function f (x) ln exp(x)+1 exp(x)−1 . A partially-parallel LDPC decoder performs the Phase 1 (row operations) and the Phase 2 (column operations) partially in parallel. For the partially-parallel LDPC decoder, the parity check matrix has to be structured in order to reuse the parallel CFUs and BFUs. Figure 3 shows that the blockstructured parity check matrix for a (wc , wr )-regular LDPC code. The matrix is composed of wc × wr sub-blocks. The diagonal line in each sub-block in Fig. 3 represents a one in the sub-block. The b × b square matrix is defined by shifting each row of the identity matrix Ib×b to the right. We determine the shift value by cyclotomic cosets as shown in Ref. [10]. If the partially-parallel LDPC decoder has k CFUs for each wc row block and k BFUs for each wr column block, the decoder can perform the k×wc row operations and k×wr column operations in parallel. In order to perform row and column operations concurrently, a dual memory architecture is used in the LDPC decoder. By using the dual memory architecture, messages β obtained from the column operation can be stored in the memory of the row operation module while messages α obtained from the row operation are stored in the memory of the column operation module [9], [10].
Initialization : Compute the log likelihood ratio (LLR) λn for bit nodes (n = 1, 2, · · · , N), and set βmn = λn for each (m, n) satisfying Hmn = 1. Phase 1 : For all the check nodes cm in the order corresponding to m = 1, 2, · · · , M; Compute message αmn with the following equation, where each set (m, n) satisfies Hmn = 1. ⎛ ⎞ ⎛ ⎞ ⎜⎜⎜ ⎟⎟⎟ ⎜⎜⎜ ⎟⎟⎟ ⎜⎜ ⎟⎟ ⎜⎜⎜ ⎟ ⎜ ⎟ (1) αmn = ⎜⎜⎜ sign(βmn )⎟⎟⎟ · f ⎜⎜⎜ f (|βmn |)⎟⎟⎟⎟⎟. ⎝ n ∈A(m)\n ⎠ ⎝ n ∈A(m)\n ⎠ Phase 2 : For all the bit nodes bn in the order corresponding to n = 1, 2, · · · , N; Compute message βmn with the following equation, where each set (m, n) satisfies Hmn = 1. α m n . (2) βmn = λn + m ∈B(n)\m
Tentative decision : Compute all the tentative LDPC codeword yˆ n for n = 1, 2, · · · , N with the following equation.
0, sign(λn + m ∈B(n) αm n ) = 1. (3) yˆ n =
1, sign(λn + m ∈B(n) αm n ) = −1. Parity Check : If the tentative LDPC codeword yˆn satisfies the parity check equation as shown in Eq. (4), or if the maximum number of iterations is reached then stop the algorithm, otherwise go to Phase 1, and continue iterations. H · (yˆ1 , yˆ2 , · · · , yˆN )T = 0. Fig. 2 Fig. 1
Tanner graph of a parity check matrix.
Message-passing algorithm.
(4)
SHIMIZU et al.: PARTIALLY-PARALLEL LDPC DECODER ACHIEVING HIGH-EFFICIENCY MESSAGE-PASSING SCHEDULE
971
Fig. 3
3.
Block-structured parity check matrix for a partially-parallel LDPC decoder.
Partially-Parallel LDPC Decoder Achieving HighEfficiency Message-Passing Schedule
In this section, we propose a novel high-efficiency messagepassing schedule and its hardware architecture for the partially-parallel LDPC decoder. 3.1 High-Efficiency Message-Passing Schedule In order to improve the decoding throughput and bit error performance, the message-passing schedule and its hardware architecture should meet the requirements (1), (2) and (3) described in Section 1. In order to meet the requirement (3), we propose the message-passing schedule and hardware architecture based on the simple addressing and control logic as shown in Refs. [9]–[11]. Figure 4 shows an example of the message-passing schedule shown in Refs. [9]– [11]. The message-passing schedule performs i1 , i2 , i3 -th row operations starting from 1 to b, and j1 , · · · , j6 -th column operations starting from 1 to b in parallel. The row and column operations are performed independently. In qth column block, three messages α (i1(q) ), α (i2(q) ), α (i3(q) ) are updated by i1 , i2 , i3 -th row operations, and three messages β ( jq(1) ), β ( jq(2) ), β ( jq(3) ) are updated by jq -th column operation. The row and column operations update the message α and β , where the elements of each i1 , i2 , i3 -th row and j1 , · · · , j6 -th column are one. Therefore, the position of the updated message α is different from that of the updated message β . The updated messages α is not used by the column operation until the column number jq reaches the column position of the updated message α . In addition, the timing when the column operation is performed with the updated messages α is different among sub-blocks. This is because the shift value of identity matrix Ib×b for each sub-block is different from that for the other sub-blocks (see Fig. 3). From this point of view, we propose a high-efficiency message-passing schedule improving the timing when the column operations are performed with the latest messages
Fig. 4 block.
Message-passing schedule shown in Refs. [9]–[11] in q-th column
α updated by the row operations. In addition, we focus on the fact that the computational complexity of the row and column operation are proportional to the number of inputs. For the (wc , wr )-regular LDPC codes, the number of inputs in the row operation and inputs in the column operation are degree wr in each row and degree wc in each column of a parity check matrix respectively. Code rate R of the (wc , wr )regular LDPC codes satisfies 0 < 1−wc /wr ≤ R < 1, accordingly the degree wr and wc satisfy wc < wr . This indicates that the computational complexity of the column operation is less than that of the row operation. Therefore, the column operations can be performed more frequently than the row operations in the single iterative decoding delay. In order to meet the requirement (2), we propose a high-efficiency message-passing schedule which is characterized as follows: (i) First, the proposed schedule performs the row operations determining positions for the column operations. The column operations are then performed at these positions.
IEICE TRANS. FUNDAMENTALS, VOL.E89–A, NO.4 APRIL 2006
972
Fig. 5
High-efficiency message-passing schedule in q-th column block. Fig. 7
The average number of iterations (lmax = 4).
Input : β(i1(q) ), β(i2(q) ), β(i3(q) ). Step 1 : i1 , i2 , i3 -th row operations in each row block are performed in parallel. i1 -th row operation updates the message α (i1(q) ), i2 -th row operation updates the message α (i2(q) ) and i3 -th row operation updates the message α (i3(q) ). Step 2 : In each q-th column block, jq1 -th column operation is performed by using the updated message α (i1(q) ). jq2 -th column operation is performed by using the updated message α (i2(q) ). jq3 -th column operation is performed by using the updated message α (i3(q) ). Step 3 : jq1 -th column operation updates the message β ( jq1(1) ), β ( jq1(2) ), and β ( jq1(3) ). jq2 -th column operation updates the message β ( jq2(1) ), β ( jq2(2) ), and β ( jq2(3) ). jq3 -th column operation updates the message β ( jq3(1) ), β ( jq3(2) ), and β ( jq3(3) ). Output : β ( jq1(1) ), β ( jq1(2) ), β ( jq1(3) ), β ( jq2(1) ), β ( jq2(2) ), β ( jq2(3) ), β ( jq3(1) ), β ( jq3(2) ), β ( jq3(3) )
Fig. 6
High-efficiency message-passing schedule in q-th column block. Fig. 8
(ii) The proposed schedule performs every column operation using every message α updated by the row operations. The proposed schedule is shown in Figs. 5 and 6. In the proposed schedule, the column operations are always performed using the updated messages α immediately after the the row operations are performed. In addition, three column operations are performed using the three messages α in each jq -th column block. As shown in Figs. 4 and 5, the number of column operations based on the proposed schedule increases three times compared to that based on the schedule shown in Refs. [9]–[11]. Accordingly, the proposed schedule also accelerates the timing when the row operations are performed with the latest messages β updated by the column operations. By using the high-efficiency message-passing schedule, the number of iterations for decoding can be reduced. This allows the decoder not only to increase the decoding throughput but also to improve the bit error performance within a limited decoding delay time. We evaluate the message-passing efficiency based on
Bit error performance (lmax = 4).
the schedule in Refs. [9]–[13] and the proposed schedule in the algorithm level. We employ the (3,6)-regular LDPC codes whose codeword length is b × wr = 556 × 6 = 3336 [bits], where each sub-block of the parity check matrix is defined as b = 556. The maximum number of iterations is set to lmax = 4 and 8 respectively. In the simulations, we assume the channel model to be AWGN (Additive White Gaussian Noise) Channel. Figures 7 and 9 shows the average number of iterations until the decoding convergence is reached. Figures 8 and 10 shows the bit error performance based on each schedule. The results show that the average number of iterations based on the proposed schedule reduces up to about 35% and obtains up to about 1.5 [dB] better coding gain compared to the schedule in Refs. [9]– [11]. The average number of iterations based on the schedule in Refs. [12],[13] is also better than that based on the schedule in Refs. [9]–[11]. This is because the schedule in Refs. [12],[13] performs the row operation following the column operation, and it improves the timing when the row
SHIMIZU et al.: PARTIALLY-PARALLEL LDPC DECODER ACHIEVING HIGH-EFFICIENCY MESSAGE-PASSING SCHEDULE
973
Fig. 9
The average number of iterations (lmax = 8).
Fig. 11
Proposed row operation module.
the proposed schedule performs the column operations more frequently compared to the other schedules in addition to improving the timing of the column operations. In the next section, we propose an efficient hardware architecture for the proposed high-efficiency message-passing schedule to ensure that the row and column operations are performed concurrently. 3.2 Hardware Architecture
Fig. 10
Bit error performance (lmax = 8).
operations are performed with the latest message β which is updated by the column operations. However, the schedule in Ref. [12] approximates the column operations, and the schedule in Ref. [13] computes only a single message in association with each row operation. Compared with the schedule in Ref. [12], the average number of iterations based on the proposed schedule reduces up to about 14% and obtains up to about 0.73 [dB] better coding gain. Compared with the schedule in Ref. [13], the average number of iterations based on the proposed schedule reduces up to about 16% and obtains up to about 0.27 [dB] better coding gain. The results show that the schedule in Refs. [9]–[11] degrades the bit error performance significantly when the maximum number of iterations (lmax ) is small. On the other hand, even if the maximum number of iterations (lmax ) is increased, the schedule in Ref. [12] shows only a slight improvement in the bit error performance because of the approximation of the column operations. The results show that the proposed schedule achieves the best performance compared to the schedule in Refs. [9]–[13]. This is because
Main modules of the partially-parallel LDPC decoder are row operation module and column operation module. In order to meet the requirement (1), we propose a hardware architecture of the partially-parallel LDPC decoder based on the high-efficiency message-passing schedule, which is characterized as follows: (i) The column operations follow the row operations in a pipelined architecture to ensure that the row and column operations are performed concurrently. (ii) Our parallel pipelined bit functional unit enables the decoder to complete three column operations within a single row operation delay. Therefore, these column operations can be performed without extending the single iterative decoding delay. In the following section, we design the partially-parallel LDPC decoder for (3, 6)-regular LDPC codes. 3.2.1 Row Operation Module The row operation module is shown in Fig. 11. Each row operation module has six memory banks (β(i p(1) ), · · · , β(i p(6) )) for messages β which are updated by six column operations in parallel. When the row operation module has k CFUs,
IEICE TRANS. FUNDAMENTALS, VOL.E89–A, NO.4 APRIL 2006
974
Fig. 13
Fig. 12
Proposed column operation module.
each memory bank has k sets of messages β in a single word, and each memory bank is composed of b/k words. The address translator (R2C) translates the row address to column address corresponding to the positions of one in each sub-block. For the i p -th row operation, the row operation module inputs the row address i p to the memory banks for messages β and address translator. The CFU computes Eq. (1) using messages β(i p(1) ), · · · , β(i p(6) ), where function f (x) is called Gallager function. Since the hardware for the Gallager function has a large number of gates, an approximated minimum function can be applied to the parallel LDPC decoder [6]. The row operation module outputs six sets of the message α (i p(1) ), · · · , α (i p(6) ) and corresponding column address ( j1p , · · · , j6p ) to each of the six column operation modules. 3.2.2 Column Operation Module The proposed column operation module is shown in Fig. 12. Each column operation module has three memory banks for messages α which are updated by the three row operation modules. The row operation module has the memory banks for messages β and column operation module has the memory banks for messages α (i.e. dual memory architecture). Therefore, the row and column operations can be performed concurrently. Each column operation module receives three sets of the messages α and corresponding column addresses from the three row operation modules. The addressing unit in the proposed column operation module stores the three column addresses. In the proposed column operation module, the BFU performs the three column operations for the three column addresses jqx (x = 1, · · · , 3) with Eq. (2) sequentially. The address translator (C2R) translates the column address to row address corresponding to the positions of one in each sub-block. For a single jqx -th column operation, the column operation module input the column address jqx to the memory banks for messages α and address translator. The single column operation computes three messages
Fig. 14
Serial bit functional unit.
Proposed parallel pipelined bit functional unit.
β ( jqx(1) ), β ( jqx(2) ), β ( jqx(3) ) by using the updated message α (i x(q) ). In order to achieve a high operation frequency as achieved in the CFU, the serial BFU architecture shown in Fig. 13 can be applied to the column operation module [10]. The serial BFU computes the additions shown in Eq. (2) serially. However, by using the serial BFU, three column operations by the proposed schedule take more clock cycles than a single row operation. Therefore, three column operations by the proposed high-efficiency message-passing schedule become a bottleneck in a single iterative decoding when a single row operation and three column operations are performed concurrently. From this point of view, we propose a parallel pipelined BFU as shown in Fig. 14. The proposed parallel pipelined BFU computes the three sets of the additions in parallel. Figure 15 shows the timing diagram of the row and column operation. The upper half of Fig. 15 shows that the row operation module computes absolute value |β| and determine the first and second minimum values |β| in parallel using the pipeline architecture. In order to achieve high operation frequency of the decoder, the CFU in the row operation module compares the |β| once in each cycle. It takes 7 clock cycles to determine the first and second minimum values. The row operation module takes 10 clock cycles totally to perform a single row operation as shown in Fig. 15. The lower half of Fig. 15 shows that each column operation module performs three column operations after the row operations. The BFU with the pipeline architecture takes 6 clock cycles totally to perform three column operations. The number of clock cycles for three column operations is less than that for a single row operation. In addition, the critical path delay of the proposed column operation module is expected to be smaller than that of the row operation module. Therefore, our partially-parallel LDPC decoder does not degrade the decoding throughput when a single row operation and three column operations are performed concurrently. In the proposed column operation module, the unit that has most overhead is the addressing unit (as shown in the
SHIMIZU et al.: PARTIALLY-PARALLEL LDPC DECODER ACHIEVING HIGH-EFFICIENCY MESSAGE-PASSING SCHEDULE
975
Fig. 15 Table 1 Schedule
Timing diagram of the row and column operation.
Comparison of the number of memory banks and words based on each schedule.
message (αmn ) banks words
message (βmn ) banks words
output (yˆn ) banks words
total (wr =6,wc =3,b=556) banks words
Ref. [12]
–
–
wr
wr ×b
wr
wr ×b
wr
wr ×b
18
10008
Ref. [13]
wr ×wc
wr ×wc ×b
2×wr ×wc
2×wr ×b
wr ×wc
wr ×b
wr ×wc
wr ×b
90
23352
Refs. [9]–[11], Proposed Schedule
wr ×wc
wr ×wc ×b
wr ×wc
wr ×wc ×b
wr
wr ×b
wr
wr ×b
48
26688
gray area of Fig. 12) to perform three column operations after the row operations. In addition, the hardware size of the proposed parallel pipelined BFU is to be larger than that of the serial BFU. 4.
input (λn ) banks words
Implementation Results
Firstly, we evaluate the number of memory banks and words for the LDPC decoder. Memories are required for the messages αmn , βmn , input values λn , and tentative output values yˆn . The required number of memory banks and words based on the schedule shown in Refs. [9]–[13] and the proposed schedule are shown in Table 1. In the table, the number of memory banks and words are calculated based on the parity check matrix shown in Fig. 3. The total number of memory banks and words are obtained from the parity check matrix, where wr = 6, wc = 3, b = 556. The schedule in Ref. [12] is designed based on a single memory architecture, and the row operations follow the column operations. Therefore, the decoder based on this schedule does not perform the row and column operation concurrently. By approximating the column operation, the single iterative decoding delay can be reduced. In the approximation of the column operation, wc messages β in each bit node are to be same value. Therefore, the required number of memory banks and words can be reduced significantly. However, as shown in Figs. 8 and 10 this approximation degrades the bit error performance significantly. In the schedule shown in Ref. [13], each column operation computes only a single message in association with each row operation. Therefore, the required number of memory words for the messages β based on this schedule is less than that based on the proposed schedule. Clearly, the
number of message-passings from the bit nodes based on this schedule is less than that based on the proposed schedule. The schedule in Ref. [13] partitions the memory into a large number of memory banks in order to compute the single message in the column operation and perform the row operation concurrently. The number of memory banks based on this schedule is about twice as many as that based on the proposed schedule. Presence of a large number of memory banks increases the hardware overhead caused by the duplication of addressing and control logic and the wires required to exchange the messages. This makes layouts of VLSI circuit difficult [14]. The proposed schedule is designed based on the schedule in Refs. [9]–[11]. The schedule enables the decoder to perform the row and column operations concurrently using the dual memory architecture. The number of memory banks and words using the dual memory architecture based on the proposed schedule is same as that based on the schedule in Refs. [9]–[11]. We evaluate the hardware overhead by the proposed schedule in terms of logical parts of the decoder compared to the decoder by the schedule in Refs. [9]–[11]. We design a partially-parallel LDPC decoder for (3, 6)-regular LDPC codes according to the parity check matrix shown in Fig. 3. We design the decoder with k = 4 and each sub-block in the parity check matrix is defined by b = 556. Therefore, this decoder performs 4 × 3 = 12 row operations and 4 × 6 = 24 column operations in parallel, and decodes LDPC codeword whose total number of codeword bits is 6 × 556 = 3336 bits. All intermediate messages are quantized, since the messages are obtained from fixed-point computations in the row and column operations. We define the quantization bits to be 8 bits which are divided into one sign bit, four integer bits
IEICE TRANS. FUNDAMENTALS, VOL.E89–A, NO.4 APRIL 2006
976 Table 2 Comparisons of FPGA synthesis and implementation results based on the schedule in Refs. [9]–[11] and the proposed schedule (k=4) (VertexII xc2v2000-5bf957). Module Name Input Module for λ Output & Parity Check Module Row Operation Module
Slice F/F
Slice Usage 4-input LUT
Total Slices
0 (0%) 233 (1%) 1676 (7%)
0 (0%) 688 (3%) 3030 (14%)
0 (0%) 651 (6%) 2063 (19%)
6 (10%) 6 (10%) 18 (32%)
3.192 5.620 5.646
Block RAM
Delay [ns]
Column Operation Module
Refs. [9]–[11] Proposed Schedule
3134 (14%) 3933 (18%)
3888 (18%) 5132 (23%)
3046 (28%) 3754 (34%)
18 (32%) 18 (32%)
5.434 4.602
Controller
Refs. [9]–[11] Proposed Schedule
249 (1%) 253 (1%)
155 (0%) 160 (0%)
178 (1%) 181 (1%)
0 (0%) 0 (0%)
5.996 6.006
LDPC Decoder (Place & Route)
Refs. [9]–[11] Proposed Schedule
4812 (22%) 5552 (25%)
7487 (34%) 8680 (40%)
7408 (68%) 7808 (72%)
48 (85%) 48 (85%)
9.931 9.946
and three fractional bits. The proposed partially-parallel LDPC decoder is composed of input module, output and parity check module, row operation module, column operation module, and controller. Computation of the tentative decision in Eq. (3) is almost same as the column operation. Therefore, the column operation module also computes the tentative output values. In comparison to the hardware implementation based on the schedule in Refs. [9]–[11] and the proposed schedule, we apply an approximated minimum function to each row operation module. The serial BFU shown in Fig. 13 is applied to the column operation module based on the schedule in Refs. [9]–[11], and the parallel pipelined BFU shown in Fig. 14 is applied to the column operation module based on the proposed schedule respectively. We implement the design on the Xilinx Vertex II xc2v2000-5bf957 FPGA. Synthesis and implementation are carried out using ISE Ver.6.2 provided by Xilinx. The synthesis and implementation results are shown in Table 2. The value in parenthesis in the table represents the utilization of the FPGA resource. A single Slice of an FPGA resource has two 4-input LUTs and two Slice Flip Flops. The implementation results (i.e. Place & Route results) show that the number of occupied Slices for the partially-parallel LDPC decoder based on the proposed schedule increases by about 5% compared to that based on the schedule in Refs. [9]– [11]. In the Slices, the number of occupied Slice Flip Flops which indicates the number of registers increases by about 15%, and the number of occupied 4-input LUTs increases by about 16%. The implementation results show that the proposed schedule can be implemented with a small hardware overhead. The critical path delay of the decoder based on the proposed schedule is almost same as that based on the schedule in Refs. [9]–[11]. The implementation result shows that operation frequency of the partially-parallel LDPC decoder based on the proposed schedule achieves up to more than 100 [MHz]. We evaluate the LDPC decoding performance of the partially-parallel LDPC decoder based on the schedule in Refs. [9]–[13] and proposed schedule. The decoding throughput is determined by the decoding delay time which is the product of the single iterative decoding delay, times the number of iterations. A single iterative decoding for the parity check matrix shown in When each wc row operation module has k CFUs and each wr column operation module
has k BFUs respectively, the row operations delay in a single c iterative decoding is b×w k×wc · dr , where dr denotes a single row operation delay, and the column operations delay in a sinr gle iterative decoding is b×w k×wr · dc , where dc denotes a single column operation delay. When the row and column operations are performed concurrently, the row operations delay is larger than the column operations delay (i.e. bk ·dr > bk ·dc ). In this case, the single iterative decoding delay is bk · dr . The schedule in Refs. [9]–[11], [13] and the proposed schedule performs the row and column operations concurrently. The decoder based on the schedule in Ref. [12] does not perform the row and column operations concurrently since the schedule is designed based on a single memory architecture. However, the schedule in Ref. [12] approximates the column operation, and reduces the column operation delay significantly, accordingly the single iterative decoding delay based on this schedule is almost same as performing the row and column operations concurrently. Therefore, the decoding throughput based on each schedule is calculated from the following equation, where b × wr represents the codeword length and l denotes the number of iterations. T hroughput =
b × wr [bit] l × b/k × dr [sec]
(5)
When the row operation module shown in Fig. 11 is applied to each schedule, a single row operation takes 10 clock cycles. However, the row operation module with the pipelined architecture performs the row operations consecutively, and the following row operation starts at the 8-th clock cycle of the previous row operation. Therefore, a single row operation delay is dr = 7 × 10 [ns] when a operating frequency is 100 [MHz]. Tables 3 and 4 show the comparison of the number of iterations, decoding throughput, and bit error performance, where lmax = 4, SNR=5.0, and lmax = 8, SNR=4.5 respectively. The decoding throughput can be obtained from Eq. (5) and the average number of iterations. The average number of iterations for decoding is simulated in the algorithm level as shown in Figs. 7 and 9. These results show that the proposed schedule achieves best decoding throughput. In Table 4, the proposed decoder achieves about 52% higher decoding throughput compared to that based on the schedule in Refs. [9]–[11]. Compared with the schedule in Ref. [12] and Ref. [13], the proposed schedule achieves about 13% and 14% higher decoding throughput respec-
SHIMIZU et al.: PARTIALLY-PARALLEL LDPC DECODER ACHIEVING HIGH-EFFICIENCY MESSAGE-PASSING SCHEDULE
977 Table 3 Comparison of the decoding performance based on each schedule. (lmax = 4, SNR=5.0) Schedule Refs. [9]–[11] Ref. [12] Ref. [13] Proposed Schedule
Iterations
Throughput
BER
3.983 3.251 3.360 3.063
86 [Mbps] 105 [Mbps] 102 [Mbps] 112 [Mbps]
0.000220 0.000005 0.000004 0.000000
Table 4 Comparison of the decoding performance based on each schedule. (lmax = 8, SNR=4.5) Schedule Refs. [9]–[11] Ref. [12] Ref. [13] Proposed Schedule
Iterations
Throughput
BER
5.154 3.808 3.847 3.376
67 [Mbps] 90 [Mbps] 89 [Mbps] 102 [Mbps]
0.000002 0.000013 0.000001 0.000000
tively. In addition, the bit error performance based on the proposed schedule achieves the value of zero in each result. This is because the proposed schedule accelerates the decoding convergence within a limited decoding delay time. From a hardware implementation point of view, the proposed schedule can be implemented with 5% FPGA resource overhead and same number of memory banks as that based on the schedule in Refs. [9]–[11], where the number of memory banks is about half compared to that in the schedule in Ref. [13]. The number of memory banks and words based on the proposed schedule is larger than that based on the schedule in Ref. [12]. However, the proposed schedule reduces the single iterative decoding delay without approximating the column operation. Therefore, the proposed schedule achieves much better bit error performance compared to the schedule in Ref. [12]. 5.
Conclusion
In this paper, we propose a partially-parallel LDPC decoder which achieves a high-efficiency message-passing schedule. In the proposed schedule, accelerating the decoding convergence without extending the single iterative decoding delay enables the decoder not only to increase the decoding throughput but also to improve the bit error performance within a limited decoding delay time. Hardware implementation and simulation results show that the proposed decoder achieves up to about 54% higher decoding throughput, and obtains up to about 1.5 [dB] better coding gain with 5% FPGA resource overhead compared to that based on the schedule shown in Refs. [9]–[11]. Compared with the schedule shown in Ref. [12], the proposed schedule achieves up to about 16% higher decoding throughput, and obtains up to about 0.73 [dB] better coding gain. In addition, the proposed schedule does not degrade the decoding performance even if the number of iterations for decoding is increased. Compared with the schedule shown in Ref. [13], the proposed schedule achieves up to about 19% higher decoding throughput, and obtains up to about 0.27 [dB] better coding gain with about half the number of memory banks. These results show that the proposed schedule achieves a
high-efficiency message-passing schedule and is efficient for proctical hardware implementations. Acknowledgements This work was supported by fund from the MEXT via Kitakyushu innovaive cluster project. References [1] D.J.C. MacKay and R.N. Neal, “Near Shannon limit performance of low density parity check codes,” IEEE Electron. Lett., vol.32, no.18, pp.1645–1655, Aug. 1996. [2] D.J.C. MacKay, “Good error-correcting codes based on very sparse matrices,” IEEE Trans. Inf. Theory, vol.47, no.2, pp.399–431, 2001. [3] S.Y. Chung, G.D. Forney, Jr., T.J. Richardson, and R.L. Urbanke, “On the design of low-density parity-check codes within 0.0045 dB of the Shannon limit,” IEEE Commun. Lett., vol.5, no.2, pp.58–60, Feb. 2001. [4] T.J. Richardson and R.L. Urbanke, “The capacity of low-density parity-check codes under message-passing decoding,” IEEE Trans. Inf. Theory, vol.42, no.2, pp.599–618, 2001. [5] M. Fossorier, M. Mihaljevic, and H. Imai, “Reduced complexity iterative decoding of low density parity check codes based on belief propagation,” IEEE Trans. Commun., vol.47, no.5, pp.673–680, May 1999. [6] R.G. Gallager, Low-Density Parity-Check Codes, MIT Press, Cambridge, MA, 1963. [7] A. Blanksby and C. Howland, “A 690-mW 1-Gbps 1024-b, rate1/2 low-density parity-check code decoder,” IEEE J. Solid State Circuits, vol.37, no.3, pp.404–412, March 2002. [8] E. Liao, E. Yeo, and B. Nikolic, “Low-density parity-check code constructions for hardware implementation,” Proc. 2004 IEEE International Conference on Communications, ICC’04, pp.2573–2577, Paris, June 2004. [9] Y. Chen and D. Hocevar, “A FPGA and ASIC implementation of rate 1/2 8088-b irregular low density parity check decoder,” IEEE Global Telecmmunications Conference, GLOBECOM’03, pp.113– 117, 2003. [10] M. Mansour and N. Shanbhag, “Low power VLSI decoder architectures for LDPC codes,” Proc. International Symposium on Low Power Electronics and Design, pp.284–289, 2002. [11] M. Karkooti and J.R. Cavallaro, “Semi-parallel reconfigurable architectures for real-time LDPC decoding,” IEEE Proc. International Conference on Information Thechnology: Coding and Computing, ITCC’04, vol.1, pp.579–585, April 2004. [12] E. Yeo, P. Pakzad, B. Nikolic, and V. Anantharam, “High throughput low-density parity-check decoder architectures,” GLOBECOM 2001 IEEE Global Telecommunications Conference, no.1, pp.3019– 3024, Nov. 2001. [13] E. Boutillon, J. Castura, and F.R. Kschischang, “Decoder-first code design,” 2nd International Symposium on Turbo Codes and Related Topics, pp.459–462, Sept. 2000. [14] L. Benini, L. Macchiarulo, A. Macii, and M. Poncino, “Layoutdriven memory synthesis for embedded systems-on-chip,” IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol.10, no.2, pp.96– 105, April 2002. [15] Xilinx, Inc., Vertex-II Platform FPGAs: Complete Data Sheet, DS031 (v3.3), pp.21–22, June 2004.
IEICE TRANS. FUNDAMENTALS, VOL.E89–A, NO.4 APRIL 2006
978
Kazunori Shimizu received the B.Eng. and M.Eng. degrees from Waseda University in 2002 and 2004 respectively, all in electronics, information and communication engineering. He is currently working towards the Dr. Eng. degree. His research interests are design and verification of VLSIs, especially reconfigurable hardware systems.
Tatsuyuki Ishikawa received the B.Eng. in Electronic Engineering from Gunma University in 1999. He joined Toshiba Microelectronics in 1999, where he has been undertaking design and implementation of ASICs. He is currently working towards the M.Eng. degree in Waseda University. His research interests are design and implementation of VLSIs.
Nozomu Togawa received the B.Eng., M.Eng., and Dr.Eng. degrees from Waseda University in 1992, 1994, and 1997, respectively, all in electrical engineering. He is presently an Associate Professor in the Department of Computer Science, Waseda University. His research interests are VLSI design, graph theory, and computational geometry. He is a member of IEEE and the Information Processing Society of Japan.
Takeshi Ikenaga received the B.E. and M.E. degrees in electrical engineering and the Ph.D degree in information & computer science from Waseda University, Tokyo, Japan, in 1988, 1990, and 2002, respectively. He joined LSI Laboratories, Nippon Telegraph and Telephone Corporation (NTT) in 1990, where he has been undertaking research on the design and test methodologies for high-performance ASICs, a real-time MPEG2 encoder chip set, and a highly parallel LSI & system design for image-understanding processing. He is presently an associate professor in the system LSI field of the Graduate School of Information, Production and Systems, Waseda University. His current interests are application SoCs for image, security and network processing. Dr. Ikenaga is a member of the IPSJ and the IEEE. He received the IEICE Research Encouragement Award in 1992.
Satoshi Goto received the B.Eng. and the M.Eng. degrees in Electronics and Communication Engineering from Waseda University in 1968 and 1970, respectively. He also received the Dr. degree of Engineering from the same university in 1981. He joined Central Research Laboratory of NEC in 1970, and become a professor of Waseda University since 2003. He is an IEEE Fellow and a member of Academy Engineering Society of Japan. His research interests include LSI system and multimedia system.