can best be described by a bipartite graph, the so called Tanner graph. .... Table 1: Degree distributions of the WiMax 802.16e LDPC code. ⢠Layered and ...
The 17th Annual IEEE International Symposium on Personal, Indoor and Mobile Radio Communications (PIMRC’06)
A SYNTHESIZABLE IP CORE FOR WIMAX 802.16E LDPC CODE DECODING Torben Brack, Matthias Alles, Frank Kienle, Norbert Wehn Microelectronic Systems Design Research Group University of Kaiserslautern Erwin-Schroedinger-Strasse, 67663 Kaiserslautern, Germany Email: {brack,alles,kienle,wehn}@eit.uni-kl.de A BSTRACT The upcoming IEEE WiMax 802.16e standard, also referred to as WirelessMAN [7], is the next step toward very high throughput wireless backbone architectures, supporting up to 70Mbps. It features as an advanced channel coding scheme Low-Density Parity-Check (LDPC) codes. The decoding of LDPC codes is an iterative process, hence many data have to be exchanged between processing units within each iteration. The variety of the specified codes and the envision of different decoding schedules for different codes pose significant challenges to an LDPC decoder hardware realization. In this paper, we present to the best of our knowledge the first published LDPC decoder architecture capable to process all specified WiMax LDPC codes. Detailed synthesis and communications performance results are shown in addition. I.
I NTRODUCTION
The IEEE WiMax standard [6] covers a large range of wireless transmission applications. Compared to WiFi (or WirelessLAN), it can support high throughput over largerer distances, even with higher mobility involved. The upcoming IEEE WiMax 802.16e standard, also referred to as WirelessMAN [7], is the next step toward very high throughput wireless backbone architectures, supporting up to 70Mbps. The WiMax standard features LDPC codes as an optional channel coding scheme. LDPC codes were invented by Gallager in the early 1960s and rediscovered in the 1990s. They can best be described by a bipartite graph, the so called Tanner graph. The decoding is an iterative process which exchanges information between two types of nodes. Hardware realization of LDPC decoders is a vital part in the research community. The most efficient methodology for decoder realization is the joint design of the code and the hardware which can be reached by designing the LDPC code based on permuted identity matrices. Taking the resulting hardware architecture under consideration during the LDPC code design, very efficient high throughput decoder architectures can be obtained. A further important impact on the implementation costs is to provide code rate and codeword size flexibility. The more codes have to be supported by the decoder hardware the higher the overhead in terms of silicon area. The WiMax LDPC code was designed with respect to hardware constraints, however, the high number of defined codes (code rates and codeword sizes) specified by different members of the WiMax consortium imposes significant challenges on an LDPC decoder realization. This paper presents to the best of our knowledge the first
Figure 1: Tanner graph for an irregular LDPC code LDPC decoder realization for WiMax 802.16e. Synthesis results of an IP LDPC decoder core are presented which is capable to process all specified WiMax LDPC codes. II.
T HE W I M AX 802.16 E LDPC C ODE
LDPC codes are linear block codes defined by a sparse binary matrix H, called the parity check matrix. The set of valid codewords C satisfies HxT = 0,
∀x ∈ C.
(1)
A column in H is associated to a codeword bit, and each row corresponds to a parity check. A nonzero element in a row means that the corresponding bit contributes to this parity check. The complete code can best be described by a Tanner graph [12], a graphical representation of the associations between code bits and parity checks. Code bits are shown as so called variable nodes (VN) drawn as circles, parity checks as check nodes (CN) represented by squares, with edges connecting them accordingly to the parity check matrix. Figure 1 shows a Tanner graph for a generic irregular LDPC code with N variable and M check nodes with a resulting code rate of R = (N − M )/N . The number of edges on each node is called the node degree. If the node degree is identical for all nodes, the corresponding LDPC code is called regular, otherwise it is called irregular. Note that the communications performance of an irregular LDPC code is known to be generally superior to which of regular LDPC codes. The degree distribution of the VNs f[dmax ,...,3,2] gives the fraction of VNs with a certain degree, v the maximum variable node degree. The degree with dmax v distribution of the CNs can be expressed as gd[max,max−1] with c
c 1-4244-0330-8/06/$20.00 2006 IEEE
The 17th Annual IEEE International Symposium on Personal, Indoor and Mobile Radio Communications (PIMRC’06)
Code 1/2 2/3 A 2/3 B 3/4 A 3/4 B 5/6
f g [2, 3, 6] = { 11/24, 1/3, 5/24} [6, 7] = { 2/3, 1/3} [2, 3, 6] = { 7/24, 1/2, 5/24} [10] = {1} [2, 3, 4] = { 7/24, 1/24, 2/3} [10, 11] = { 7/8, 1/8} [2, 3, 4] = { 5/24, 1/24, 3/4} [14, 15] = { 5/6, 1/6} [2, 3, 6] = { 5/24, 1/2, 7/24} [14, 15] = { 1/3, 2/3} [2, 3, 4] = { 3/24, 5/12, 11/24} [20] = {1}
Table 1: Degree distributions of the WiMax 802.16e LDPC code • Layered and non-layered decoding Figure 2: Structure of the parity check matrix for a rate 1/2 WiMax 802.16e LDPC code (z = 96). dmax c
the maximum CN degree, meaning that only CNs with two different degrees occur [13]. The WiMax 802.16e LDPC code [7] currently consists of six different code classes spanning four different code rates. All six code classes have the same general matrix structure that allows for a linear encoding scheme which simplifies the decoding process significantly. It consists of 24 columns and (1R)*24 rows, with each entry describing a z-by-z sub-matrix which is either a permuted identity matrix or a zero matrix. The first R*24 columns correspond to the systematic information, the second (1-R)*24 columns for the parity information which have a fixed structure required by the encoder design. The size of the sub-matrices z-by-z is variable and ranges from 24x24 to 96x96 with a granularity of 4, therefore supporting 19 codeword sizes. The codeword length can be calculated by N = 24 · z and ranges from N = 576 to N = 2304 bit with a granularity of 96 bit. Figure 2 shows the parity check matrix of the rate 1/2 code for z = 96 and thus a codeword length of N = 2304 bit. Table 1 summarizes the six code classes with its degree distributions for variable and check nodes. The rate 1/2 code is suitable for layered decoding if the rows are processed in a distinct order. This layered decoding approach which requires a special message scheduling is explained in Section IV.. Note that the processing of layered and non layered decoding is rather different regarding the architectural resources utilized. There are two code classes of rate 2/3: code A is highly irregular, code B is semi-regular and allows for layered decoding. The two rate 3/4 code classes differ mainly in the maximum variable node degree to be supported. Rate 5/6 is also provided by one code class. It is obvious that enormous flexibility is necessary to fully support the WiMax 802.16e LDPC code with only one unified architecture: • 6 different code classes = 6 and • Different VN and CN distribution with dmax v dmax = 20 c • Different sub-matrix sizes from 24x24 to 96x96 • Different codeword sizes from 576 to 2304 bit
III.
D ECODING A LGORITHM
LDPC codes can be decoded using the message passing algorithm [2]. It exchanges soft-information iteratively between variable and check nodes which is called belief propagation (BP). Updating the nodes can be done with a canonical, twophased scheduling: In the first phase all variable nodes are updated, in the second phase all check nodes respectively. This scheduling is denoted as two-phase BP in the following. The processing of individual nodes within one phase is independent and can thus be parallelized. The exchanged messages are assumed to be log-likelihood ratios (LLR). Each variable node of degree dv calculates an update of message k according to: λk = λch +
dX v −1
λl − λold k ,
(2)
l=0
with λch the corresponding channel LLR of the VN and λl the LLRs of the incident edges. To subtract the own old extrinsic information λold k is the basic principle of iterative decoding. The check node LLR update can be done in an either optimal or suboptimal way, trading of implementation complexity against communications performance. The simplest suboptimal check node algorithm is the well-known Min-Sum algorithm [3], where the incident message with the smallest magnitude mainly determines the output of all other messages: λk =
dY c −1
∀l,l6=k
sign(λl ) · min (|λl |) ∀l,l6=k
(3)
The resulting performance comes close to the optimal SumProduct algorithm only for high rate LDPC codes (R ≥ 3/4) with relatively large CN degree. For lower code rates the communications performance strongly degrades. Thus a more sophisticated suboptimal algorithm has to be used for low rates. The λ-3-Min algorithm [3] uses the three smallest absolute input values and applies a correction term δ to counter the introduced approximation: δ(x, z) = ln 1 + e−|λx +λz | − ln 1 + e−|λx −λz |
(4)
While increasing implementation complexity, this decoding scheme almost approaches the optimal algorithm for any given
The 17th Annual IEEE International Symposium on Personal, Indoor and Mobile Radio Communications (PIMRC’06)
code rate and is therefore the best solution if a wide range of codes has to be supported. IV.
L AYERED D ECODING
Layered decoding applies a different message schedule than the two-phase decoding. It was originally proposed by [11] and denoted as turbo decoding message passing (TDMP), then it was referred to as layered decoding by Hocevar [5]. The basic idea is to process a subset of CNs and to pass the newly calculated messages immediately to the corresponding VNs. The VNs update their outgoing messages in the same iteration. The next CN subset will thus receive newly updated messages which improves the convergence speed. The major problem to utilize layered decoding within an LDPC decoder realization is the availability of the current message. The hardware realization of the WiMax Decoder is a partly parallel realization, at which a subset of functional units (FU) nodes are instantiated. To provide flexibility these VFU and CFUs for variable node and check node processing respectively are realized sequentially. That means they can receive one message per clock cycle and can produce one message per clock cycle. Thus to process a CN of degree dc = 6, six cycle are required to process all messages to the corresponding CFU. At each successive clock cycle one output message is calculated. Thus there exists a latency of dc cycles before an input message is updated. Therefore we have to guarantee that at least dc cycles no further CFU processes a CN which would require the messages which are currently updated. This latency constraint of the layered decoding can be an obstacle for a hardware realization if the LDPC code is not designed with respect to this constraint 1 . This layered hardware constraint is respected by two WiMax 802.16e code classes, the R = 1/2 and R = 2/3B code. The layered decoding capability for these two codes is even explicitly emphasized in the WiMax standard. V.
LDPC D ECODER A RCHITECTURE
A partly parallel architecture becomes mandatory to ensure flexibility of the resulting LDPC decoder. For the WiMax LDPC decoder a parallelization of P = 96 is utilized, which means that 96 edges are processed per clock cycle. This size P is determined by the maximum sub-matrix size z = 96. To ensure code rate and codeword size flexibility the functional nodes are realized in a serial manner. Thus each functional unit can accept one message per clock cycle as already mentioned. The mapping of the VNs and CNs to the functional units is determined by the given code structure. Efficient mapping of the nodes to the functional units can be always guaranteed by designing LDPC codes using permuted identity matrices. Always z VNs/CNs with the connection described by a z-by-z identity matrix are allocated to z distinct VFU and z distinct CFU respectively. This was shown by many LDPC decoder realizations [1][4][10][9] as well as in the case of DVB-S2 LDPC 1 The layered latency constraint can be relaxed by introducing stall states which however would reduce the throughput.
Figure 3: Unified decoder architecture (two-phase data path) decoding [8]. A permutation network has to realize the permutation of the identity matrix which results in a simple barrel shifter. The required flexibility of z as introduced in Section II. is supported by a highly flexible barrel shifting network. To reach reasonable communications performance and minimize the area consumption, the check nodes are implemented with a hardware optimized version of the λ-3-Min algorithm (Section III.). A.
Data Path Mapping
The unified decoder architecture supports two rather different processing data paths, one for the two-phase BP decoding (Section III.), and one for layered BP decoding (Section IV.). The major differences of the two data paths are the variable node block (VNB) and check node block (CNB) which are responsible for the node processing and the message handling. For the two-phase BP each VNB contains two sum RAMs. One sum RAM stores the current VN sum which is currently processed, the other stores the old sum of the previous iteration. sum RAM 1 and sum RAM 2 are interleaved for the next iteration. A related principle of the VN processing was presented in [1]. However, the subtraction of Equation 2, i.e. ignoring the own information, is realized in the CNB. The CNB for the twophase processing is composed of a functional unit (CFU) and a message RAM. To ensure the correct CFU input information we have to subtract the corresponding edge message which is done immediately before the CFU. The newly calculated extrinsic messages are stored in the message RAMs at the same storage location. For the layered processing we can simplify the VN block, see Figure 4. This is achieved by storing the input information of the CFU in a FIFO and add the new extrinsic information to the
The 17th Annual IEEE International Symposium on Personal, Indoor and Mobile Radio Communications (PIMRC’06)
Code 1/2 2/3 A 2/3 B 3/4 A 3/4 B 5/6
Lay. Yes No Yes No No No
Edges 1824-7296 1920-7680 1944-7776 2040-8160 2112-8448 1920-7680
Iter. 15 15 10 10 10 10
Throughput 83-333 Mbps 89-358 Mbps 155-619 Mbps 137-548 Mbps 133-532 Mbps 152-610 Mbps
Lat. 6.9 µs 6.4 µs 3.7 µs 4.2 µs 4.3 µs 3.7 µs
Table 2: Parameters, Throughput, and Latency of the WiMax 802.16e LDPC Code Decoder WiMax 802.16e LDPC Codes
0
10
−1
10
−2
FER
10
Figure 4: Unified decoder architecture (layered data path)
−3
10
1/2 (15it, lay.) 2/3B (10it, lay.) 2/3A (15it) 3/4B (10it) 3/4A (10it) 5/6 (10it)
−4
10
−5
10
1
1.5
2
2.5
3
3.5
4
4.5
Eb/N0 [dB]
Figure 6: Communications Performance of the WiMax 802.16e LDPC Decoder Figure 5: Architecture of the flexible Barrel Shifter VI. corresponding input message bypassed by the FIFO. Thus the correct a posteriori information is passed back to the channel memory which holds always the updated VN information. This basic concept was first presented by [11]. The disadvantage when passing already updated VN information to the channel RAMs is the increased bit width. The input channel values are represented by 6 bit, an a posteriori message is represented by 9 bit as well as informations stored in the sum RAM. This 9 bit messages have to be passed through the barrel shifter as well. The unified architecture combines the two presented data paths of Figure 3 and Figure 4 into one architecture which shares as many components as possible.
B.
Network
Due to the high flexibility required by the variable block sizes, namely 576 to 2304 bit in steps of 96 bit, the degree of parallelism in our architecture has to be adapted between 24 and 96 in steps of 4. Thus the network has to mirror this kind of flexibility, too. We used a logarithmic barrel shifter composed of modular cells which provide wrap-arounds on 19 positions, one for each supported block size. Figure 5 shows the architecture of the network.
I MPLEMENTATION
Table 2 summarizes important parameters of the WiMax 802.16e LDPC Code with respect to our decoder implementation: Ability for layered decoding, the number of processed edges in each iteration, the maximum number of iterations, and the resulting throughput and latency. The stated ranges correspond to the variable codeword sizes from 576-2304 bit. The higher number of iterations used to process the rate 1/2 code is necessary to achieve reasonable communications performance, for the 2/3 A code we have to compensate for the unavailable layered decoding option in comparison to the 2/3 B code. The throughput numbers are calculated based on a clock frequency of f = 333M Hz as shown in Equation 5 for two-phase and in Equation 6 for layered processing: T2p = R · N/((5 + dmax + Edges/P ) · Iter) · f c
(5)
Tlay = R · N/(5 + dmax + Edges · Iter/P ) · f c
(6)
The latency can be roughly calculated by simply dividing the codeword size N by the associated throughput number. Note that it is always possible to decrease the throughput by decreasing the clock frequency which will in turn lead to an increased latency.
The 17th Annual IEEE International Symposium on Personal, Indoor and Mobile Radio Communications (PIMRC’06)
LDPC Code WiMax 802.16e Codeword Size 576-2304 bit 1 Code Rate /2, 2/3, 3/4, 5/6 Parallelism 24-96 Quantization 6 bit Algorithm λ-3-Min Iterations 10, 15 Comm. Perform. Figure 6 Area[mm2 ] 0.13µm@333Mhz VN Block Logic 0.317 CN Block Logic 2.115 Controller Logic 0.021 Network 0.511 Channel RAMs 0.11 Message RAMs 0.36 Sum RAMs 0.34 Code Vectors 0.06 Overall Area 3.834 Throughput Table 2 Latency Table 2 Table 3: Synthesis Results for the WiMax 802.16e LDPC Decoder A.
Communications Performance
Figure 6 shows the communications performance over an AWGN channel as frame error rates (FER) for all proposed WiMax 802.16e LDPC code classes assuming z = 96. The results were obtained using the λ-3-Min algorithm (Section III.) and a 6-bit message quantization. The codeword size was set to the maximum of N = 2304 bit for all simulations. Layered decoding was used if permitted by the code design (see Table 2). Increasing the number of iterations will decrease throughput and increase latency.
B.
Synthesis Results
Table 3 contains the synthesis results of our LDPC decoder architecture using a STM 0.13µm technology and 333 MHz clock frequency. The LDPC decoder is capable to process all LDPC codes specified by the WiMax 802.16e standard. The two different decoding scheduling methods can be executed as described above, the two-phase BP which can be applied for all codes, the layered BP for the 1/2 and 2/3B codes. The overall area is about 3.83mm2 with ≈ 77% utilized by the logic part. This high portion of the area costs results mostly from the required flexibility. All RAMs are single port memories which can be utilized by an elaborated RAM fragmentation. Note that the overall area can further be reduced if not all code classes have to be supported. For example, when supporting only the two code classes with layered decoding capabilities, the sum RAMs, the VNB logic part and one permutation network can be omitted, resulting in an overall area of less than 3mm2 .
VII.
C ONCLUSIONS
In this paper we have presented to the best of our knowledge the first published IP Core for WiMax 802.16e LDPC code decoding. It can support all codes currently under standardization and all specified block lengths up to 2304 bit. Furthermore, the decoder can support different decoding scheduling methods, namely two-phase and layered BP. The overall area of the decoder is about 3.83mm2 which is mainly determined by the costs of flexibility. R EFERENCES [1] E. Boutillon, J. Castura, and F.R. Kschischang. Decoder-first code design. In Proc. 2nd International Symposium on Turbo Codes & Related Topics, pages 459–462, Brest, France, September 2000. [2] R. G. Gallager. Low-Density Parity-Check Codes. M.I.T. Press, Cambridge, Massachusetts, 1963. [3] F. Guilloud, E. Boutillon, and J.L. Danger. λ-Min Decoding Algorithm of Regular and Irregular LDPC Codes. In Proc. 3nd International Symposium on Turbo Codes & Related Topics, pages 451–454, Brest, France, September 2003. [4] D.E. Hocevar. LDPC Code Construction with Flexible Hardware Implementation. In Proc. 2003 International Conference on Communications (ICC ’03), pages 2708–2712, May 2003. [5] D.E. Hocevar. A reduced complexity decoder architecture via layered decoding of LDPC codes. In Proc. IEEE Workshop on Signal Processing Systems (SiPS ’04), pages 107–112, Austin,USA, October 2004. [6] IEEE 802.16. Worldwide Interoperability for Microwave Access. www.wimaxforum.org. [7] IEEE 802.16e. Air interface for fixed and mobile broadband wireless access systems. IEEE P802.16e/D12 Draft, Oct 2005. [8] F. Kienle, T. Brack, and N. Wehn. A Synthesizable IP Core for DVB-S2 LDPC Code Decoding. In Proc. 2005 Design, Automation and Test in Europe (DATE ’05), Munich, Germany, March 2005. [9] F. Kienle and N. Wehn. Design Methodology for IRA Codes. In Proc. 2004 Asia South Pacific Design Automation Conference (ASP-DAC ’04), pages 459–462, Yokohama, Japan, January 2004. [10] M. Mansour and N. Shanbhag. Architecture-Aware Low-Density ParityCheck Codes. In Proc. 2003 IEEE International Symposium on Circuits and Systems (ISCAS ’03), Bangkok, Thailand, May 2003. [11] M.M. Mansour and N.R. Shanbhag. High-Throughput LDPC Decoders. IEEE Transactions on Very Large Scale Integration Systems, 11(6):976– 996, December 2003. [12] T. Richardson and R. Urbanke. The Renaissance of Gallager’s LowDensity Pariy-Check Codes. IEEE Communications Magazine, 41:126– 131, August 2003. [13] T.J. Richardson, M.A. Shokrollahi, and R.L. Urbanke. Design of Capacity-Approaching irregular Low-Density Parity-Check Codes. IEEE Transaction on Information Theory, 47(2):619–637, February 2001.