Hardware Implementation of a Soft Cancellation Decoder for Polar Codes Guillaume Berhault, Camille Leroux, Christophe Jego and Dominique Dallet IMS lab., University of Bordeaux, Bordeaux INP, CNRS UMR 5218 351 Cours de la Lib´eration 33400 Talence, France Email:
[email protected] Abstract—Polar Codes can provably achieve the capacity of discrete memoryless channels. In order to make practical, it is necessary to propose efficient hardware decoder architectures. In this paper, the first hardware decoder architecture implementing the Soft-output CANcellation (SCAN) decoding algorithm, is presented. This decoder was implemented on Field Programmable Gate Array (FPGA) devices. The proposed architecture is parametrizable for any number of iterations without adding hardware complexity. The SCAN decoder architecture is compared to another soft-output decoder that implements a Belief Propagation (BP) algorithm. The SCAN decoder can reach a higher throughput than a BP decoder, with a lower memory footprint. Moreover, only one iteration with the SCAN algorithm leads to better decoding performance than 50 iterations of the BP algorithm.
I.
I NTRODUCTION
Invented by Arıkan [1], Polar Codes provably achieve the capacity of various communication channels for infinite codelengths. Despite these asymptotically optimal performance, the original decoding algorithm denoted as Successive Cancellation (SC) decoding shows mediocre performance at finite lengths. In order to make Polar Codes a practical approach for future telecommunication standards, a double challenge arises. The first one is to improve the SC decoding algorithm performance. The second one is to propose reduced complexity decoding strategies so that efficient hardware architectures can be defined. Earlier works focused on proposing efficient hardware architectures for SC decoding [2], [3], [4], [5], [6], and simplifying the SC decoding process [7], [8], so that even more efficient architectures can be defined. Several research works also focused on improving SC decoding algorithm so that good performance can be achieved even at small to medium codelengths. As such, the list decoding algorithm was proposed in [9]. It consists in building a pruned list of candidate codewords and to select the most probable one at the end of the decoding process. Several improvements of this algorithm were then investigated in [10], [11], [12], and some hardware architectures were also proposed [13], [14], [15]. Other research works explored the software implementation of polar decoders on various processor target (x86, ARM, GPU) [16], [17] and showed that very fast and energy efficient SC decoder can be implemented on current processors. One of the numerous open questions about polar codes is how one can generate a soft decision output. Indeed, in most of last generation communication standards, the digital receiver consists of several blocks (equalizer, detector, demodulation, error correction decoder...) that exchange probabilistic information
between them. In such a communication system, each block of the receiver should be able to generate a belief on its decision, which is called a soft-decision. To date, in the case of polar codes, two decoding algorithms can generate a soft-decision output for each decoded bit. The first one is called the BP [18] and consists in iteratively propagating beliefs (in the form of Likelihood-ratios) on the factor graph of the polar codes. This algorithm slightly improves the decoding performance compared to the SC decoding. However, it requires about 50 decoding iterations in order to achieve decoding performance close to SC decoding. Another similar approach was proposed in [19], where probabilistic values are propagated on the graph in a way similar to the SC decoding algorithm. This SCAN decoding algorithm reaches better decoding performance than the BP with only few iterations. In terms of hardware implementation, the BP was implemented on an FPGA device [20]. In this paper, we propose to investigate the hardware implementation of the SCAN decoding algorithm. To the best of our knowledge, this is the first architecture that implements this decoding algorithm. In section II, some definitions about polar coding and decoding are provided. In section III, a hardware architecture implementing the SCAN decoding algorithm is detailed. Section V compares our FPGA implementation results (hardware complexity, throughput and decoding performance) with the ones obtained for a BP decoder proposed in [20] and presented in section IV. Performance and throughput comparisons are presented in section VI. Conclusions are drawn in section VII. II.
P OLAR C ODES
A. Definition and Code Construction Polar codes [1] are linear block codes of size N = 2n , n being a positive integer. Each bit sees a more or less reliable equivalent channel. As a consequence, in order to send K information bits (K ≤ N ), the K most reliable equivalent channels are selected as explained in [1], [21]. In this work, Polar Codes are generated thanks to the method described in K [22]. A Polar Code of size N has a code rate R = , with N K information bits and N − K frozen bits set to 0. In order to encode the message to be transmitted, the generator matrix of the code is used. It is a submatrix of the nth Kronecker power of 1 0 κ= , 1 1
denoted ⊗n
κ
=
κ⊗n−1 κ⊗n−1
0N/2 κ⊗n−1
Stage 0 β0,0
.
β1,0
The encoding process consists in multiplying the generator matrix with the message to be transmitted. The result of the multiplication is called the codeword. It will be transmitted over the channel. A codeword is called systematic if it is composed of the same K information bits as the original message, before the matrix multiplication. The remaining N − K bits are called redundancy bits. In the rest of this paper, only systematic codewords are considered. For Polar Codes, two steps are required to encode a message as a systematic codeword [23]. The first step consists in a classical encoding, explained below. The second step consists in setting the bits of the intermediate encoded codeword from the first step, that are placed on frozen bit indices, to 0. Thereon, the modified intermediate codeword is encoded the same way as in the first step. The first encoding phase consists in creating an extended information vector U which contains the K information bits and (N −K) frozen bits (all set to 0). The corresponding codeword X1 is then constructed by calculating X1 = U × κ⊗n . The second step consists in setting the N − K bits of X1 , being placed on frozen bit indices, to 0, getting as a result X10 . Then, a similar encoding step is carried out in order to get the codeword, denoted X, and such that X = X10 × κ⊗n . Polar Code systematic encoder may also be represented graphically as shown in Fig. 1 for N = 2n = 4 and K = 3. One can notice that the same subencoder might be used for both N phases. The subencoder consists of n stages of XOR logic 2 gates. The input vector U , on the left hand side, is propagated into the graph in order to get X1 , in the middle. The output of the first subencoder is the input for the next subencoder by setting the bits, that are placed on frozen indices, to 0. Then, the vector X1 modified into X10 is encoded with the same subencoder in order to get the codeword X. U
Phase 1
X1
Phase 2
X10
codeword X
u0 = 0
u1
u1 + u2 + u3 0
u1 + u3
u1 + u2 + u3
u1
u1
u1 + u3
u1 + u3
u1 + u3
u1
u2
u2 + u3
u2 + u3
u2 + u3
u2
u2
u3
u3
u3
u3
u3
u3
Fig. 1: N = 4, K = 3 Polar Code systematic encoder graph.
Scheduling LLR values to be updated
Scheduling
LLR values to be updated
β2,0 β3,0 β4,0 β5,0 β6,0 β7,0
= = = =
1 β0,1 λ0,1
2 β0,2 λ0,2
3 Channel β0,3 λ0,3
β1,1 λ1,1
β1,2 λ1,2
β1,3 λ1,3
β2,1 λ2,1
β2,2 λ2,2
β2,3 λ2,3
β3,2 λ3,2
β3,3 λ3,3
β3,1 λ3,1
= =
β4,1 λ4,1 β5,1 λ5,1 β6,1 λ6,1 β7,1 λ7,1
β4,2 λ4,2 β5,2 λ5,2 β6,2 λ6,2
= =
β4,3 λ4,3
=
β7,2 λ7,2
β5,3 λ5,3
=
β6,3 λ6,3
= =
β7,3 λ7,3
Fig. 2: Factor graph for N = 23 = 8 Polar Code. B. Soft Cancellation decoding algorithm After being sent over a transmission channel, the noisy version Y of the codeword X is received. Each sample yi is converted into Log-Likelihood Ratio (LLR) format in order to simplify the implementation of the computation functions. Once the channel LLRs are calculated, they are set at the input of the factor graph (right hand side) as presented in Fig. 2. The SCAN decoding process consists in propagating iteratively these soft values in the factor graph. The SCAN decoding algorithm can be seen as a mixture between the SC and BP algorithms. The operation scheduling is similar to the SC one. Nevertheless, the SCAN algorithm is close to the BP one in the sense that it is iterative, and that the propagated values, in both ways, are soft information (LLRs). The left-propagating and right-propagating LLRs at row i and stage j is denoted λi,j and βi,j , respectively. It is illustrated in Fig. 2, with 0 ≤ i ≤ N − 1 and 0 ≤ j ≤ n. The λi,n and βi,0 LLR values do not require an update during the decoding process. The λi,n are set depending on the value, yi , received from the channel as explained above. The βi,0 are set depending on the bit type such that: βi,0 =
+∞ 0
if i is a frozen bit if i is an information bit
The SCAN decoding scheduling of a Polar Code of size N = 8, and for 2 iterations, is presented in Table I. One can notice that the second iteration is composed of the same twelve first steps as the first iteration. Another step is added at the end of the last iteration to update the βi,n of the last stage of the graph, which are the results of the decoding. The SCAN scheduling is very similar to the SC one. For further details on the SC algorithm, see for example [1].
1
2
3
4
5
6
λ0,2 λ1,2 λ2,2 λ3,2
λ0,1 λ1,1
β0,1 β1,1
λ2,1 λ3,1
β2,1 β3,1
β0,2 β1,2 β2,2 β3,2
13 λ0,2 λ1,2 λ2,2 λ3,2
14 λ0,1 λ1,1
15 β0,1 β1,1
16 λ2,1 λ3,1
17 β2,1 β3,1
18 β0,2 β1,2 β2,2 β3,2
Iteration 1 7 8 λ4,2 λ4,1 λ5,2 λ5,1 λ6,2 λ7,2 Iteration 2 19 20 λ4,2 λ4,1 λ5,2 λ5,1 λ6,2 λ7,2
9 β4,1 β5,1
10 λ6,1 λ7,1
11 β6,1 β7,1
12 β4,2 β5,2 β6,2 β7,2
21 β4,1 β5,1
22 λ6,1 λ7,1
23 β6,1 β7,1
24 β4,2 β5,2 β6,2 β7,2
TABLE I: Scheduling of SCAN decoding for an N = 8 Polar Code
25 β0,3 β1,3 β2,3 β3,3 β4,3 β5,3 β6,3 β7,3
Stage
The decoding process estimates the βi,n after Imax iterations. 0 1 2 3 The computations can efficiently be carried out by using a Type 1 calculation N0,0 factor graph as proposed by Arıkan in [1]. The factor graph Type 2 calculation N0,1 of an N = 8 Polar Code, detailed in Fig. 2, is composed of N1,0 n + 1 = 4 memory stages. The decoder successively estimates Type 3 calculation N0,2 the λi,j and βi,j such that: N2,0 N1,1 f λi,j+1 , λi+2j ,j+1 + if Bi,j = 0 N3,0 βi+2j ,j+1 λi,j = N0,3 f λi−2j ,j+1 , βi−2j ,j + λi,j+1 if Bi,j = 1 N 4,0 (1) N2,1 f (βi,j−1 , βi+2j−1 + λi+2j−1 ) if Bi,j−1 = 0 N5,0 βi,j = f βi−2j−1 ,j−1 , λi−2j−1 ,j−1j + βi,j−1 if Bi,j−1 = 1 N1,2 (2) N6,0 with: N3,1 a b N7,0 −1 f (a, b) = 2 tanh (tanh( ) tanh( )) (3) 2 2 Fig. 4: Binary tree representation of the SCAN decoding f (a, b) can be approximated with no significant impact on the process for a Polar Code of size N = 23 = 8 decoding performance, as proposed in [2], by: f (a, b) = sgn(ab) × min(|a|, |b|)
(4)
i Bi,j = b j c mod 2, 0 ≤ i < N and 0 ≤ j < n. 2 An elementary computation is illustrated in Fig. 3. The update of all values is computed as follows: λa λb βc βd
= = = =
f (λc , λd + βb ) f (λc , βa ) + λd f (βa , βb + λd ) f (βa , λc ) + βb
For instance, in Fig. 2: λ0,1 λ2,1 β0,2 β2,2
= = = =
f (λ0,2 , λ2,2 + β2,1 ) f (λ0,2 , β0,1 ) + λ2,2 f (β0,1 , β2,1 + λ2,2 ) f (β0,1 , λ0,2 ) + β2,1
The decoding process can alternatively be represented as a full binary tree, as shown in Fig. 4. Each node at stage s stores 2s λ values and 2s β values. In this example, the tree of depth 4 represents the decoding of a codeword of size N = 8. The updating process of the λ and β values, for one iteration, is given below. The node Ni,j is updated when all its β values have been computed. This is possible when its two children have been updated. First, the λ values of the upper child (N2i,j−1 ) are computed from the λ values of Ni,j and the β values of the lower child (N2i+1,j−1 ) (cf Fig. 5a). Next, when the upper child is updated, then the λ values of the lower child (N2i+1,j−1 ) are calculated from the λ values of Ni,j and the
λ propagation
βa λa
βc λc
βb λb
βd λd
=
β propagation Fig. 3: Left and right propagating LLR values
β values of the upper child (N2i,j−1 ) (cf Fig. 5b). When the lower child is updated, then the β values of Ni,j are computed from the β values of its two children (N2i,j−1 and N2i+1,j−1 ) and from the λ values of Ni,j (cf Fig. 5c). An iteration corresponds to the complete traversal of the tree. This sequence is repeated as many times as there are iterations (Imax ) to do. It is noteworthy to mention that during intermediate iterations, it is neither compulsory to calculate the λ of the leaves (λi,0 ), nor the β of the root. These last ones have to be computed at the end of the last iteration only. Algorithm 1: SCAN decoding algorithm Result: β of Ni,j updated. Input: Node Ni,j if Ni,j has children which have children then
for I = 1 to Imax do 1) Calculate the λ of the upper child (N2i,j−1 ) from the λ of Ni,j and the β of the lower child (N2i+1,j−1 ). 2) Update the upper child (N2i,j−1 ). 3) Calculate the λ of the lower child (N2i+1,j−1 ) from the λ of Ni,j and the β of the upper child (N2i,j−1 ). 4) Update the lower child (N2i+1,j−1 ). 5) Calculate the β of Ni,j from the β of the children (N2i,j−1 and N2i+1,j−1 ) and the λ of Ni,j . ; else Nothing to do Upper child
Upper child
Upper child
βa
λc
βc f (λa , λb + βc ) λa λb
f (λa , βc ) + λb λa λb
βc Lower child
f (βa , βb + λc ) f (βa , λc ) + βb βb
Lower child
(a) Type 1 calcula- (b) Type 2 calculation - λ update tion - λ update
Lower child
(c) Type 3 calculation - β update
Fig. 5: Calculation types during the SCAN decoding process
III.
H ARDWARE ARCHITECTURE FOR THE SCAN DECODER
The proposed architecture, illustrated in Fig. 6, is composed of three main units: Memory Unit: for the storage of the λ and β values during the decoding process. Processing Unit for the computation of the two types of calculations; f (a, b + c) or f (a, c) + b, with a, b and c (λ or β LLRs). Control Unit: for the generation of read and write addresses of the operands. It also controls the type of operation to be carried out by the processing unit. A. Memory Unit The Memory Unit is implemented with Random Access Memory (RAM), to store the λ and β quantized on Q bits, or Read-Only Memory (ROM), to store frozen bits quantized on 1 bit. The ROM is N -bit wide. The βi,0 LLR depends on the stored values such that: βi,0 = βi,0 =
+sat if the ith bit is frozen 0 otherwise
k=0
where sat is the saturation (maximum value) that can be set in the architecture depending on the quantization Q: sat = 2Q−1 − 1
A RAM is used to store the λi,n LLRs, received from the channel. Two distinct RAMs are used to store the remaining λ and β. We could use only two N (n + 1)-LLR memories as proposed @rom Enrom
ROM βi,0
2×1
Frozen bits
@ram Channel Enram Channel Control Unit Scheduler Address Generator En
Controller
Enrom R/W rom Enram Channel R/W ram Channel Enram λ R/W ram λ Enram β R/W ram β Ebypass λ Ebypass β Operation type
@ram λ Enram λ R/W ram λ
@ram β Enram β R/W ram β
RAM λi,n
2PQ
Channel
RAM λi,s
2PQ
1 ≤ s ≤ n−1
RAM βi,s
2PQ
1≤s≤n
β bypass
Memory Unit
Processing Unit
Ebypass β PQ
PQ
P PEs
Operand Selection
Operation type
Fig. 6: SCAN decoder architecture
in [20], which would simplify the memory address generation. Yet, the authors in [19] showed that it is possible to reduce the memory requirements of the λ, going from N (n + 1) to 2N −1, by applying the memory optimization presented in [4]. As a consequence, at stage s, only 2s λ LLRs need to be saved at a time, and such for any number of iterations. Moreover, it is not necessary to compute the λ of stage 0, therefore, only the λ LLRs from stage 1 to n are memorized. Thus reducing the λ memory requirements to 2N − 2. Not all β needs to be saved from one iteration to the other. Only the ones in (i, j) have to be memorized, such that i mod 2 = 1. For the others at stage s, one block of size 2s 2j can be shared. The β at stage n does not need to be stored because they are updated only at the end of the last iteration as explained in the Algorithm 1. The authors in [19] claimed that this optimization reduces the memory requirement from Nn N (n+1) to 4N + −2. We believe that this formula suffers 2 a flaw during the print. Indeed, for a Polar Code of size N = 8, the maximum β memory size is N (n + 1) = 8 ∗ (3 + 1) = 32. However, the formula in [19] gives the following β memory Nn 3∗8 requirement: 4N + −2 = 4∗8+ −2 = 42 > N (n+1). 2 2 The corrected formula is expressed as follows: n−2 X N k N (n + 1) − Unnecessary β = N (n + 1) − −2 2
λ bypass
Ebypass λ
Nn + 2N − 1 = 2 Our SCAN decoder architecture does not need to store β LLR values at stage 0 and n. Therefore, the β memory required for the SCAN decoder becomes: n−2 X N k N (n − 1) − Unnecessary β = N (n − 1) − −2 2 k=1
Nn N = + −2 2 2 The SCAN decoder architecture memory structure is illustrated in Fig. 7. The solid colored rectangles stand for the amount of memory required at each stage for either the λ or the β LLRs. The gray rectangles mean that these values share the same memory location as the data of the block from which comes the arrow shown in Fig. 7. The RAMs used are dual port RAMs because during the decoding process, it happens that a read and write operation are needed in the same RAM. Moreover, when the read address and the write address are the same, then the output of the Processing Unit needs to be sent back directly to its inputs. The RAM is then bypassed while the value is written back to the memory. The λ and β RAM input ports are P × Q wide. Indeed, they can store a maximum of P LLRs during a clock cycle, with P the number of Processing Elements (PEs) assigned in the Processing Unit. These RAM output ports are 2P × Q wide. It corresponds to reading a maximum of 2P LLRs at a time, because for each computation, two operands are in the same RAM.
β λ
ROM β
β λ
β λ
λ
f (a, b + c) a(MSB)
=
(b + c)(MSB) AND
= =
Q
= =
= =
Q
a
= =
c
=
=
= =
Required for other iterations
= =
f (a, b + c) Result
sgn(ac) Concatenation
Min
ADD Saturation
f (a, c) + b
min(|a|, |c|)
Fig. 8: Processing element architecture
= =
=
min(|a|, |b + c|)
f (a, c) + b
= =
=
Abs |c|
= =
AND
Abs |a|
= =
=
Min
Q Operation type
Q c(MSB) a(MSB)
=
=
Abs |b + c|
b
= =
=
Concatenation
Abs |a| b+c ADD Saturation
sgn(a(b + c))
=
Values stored in another block
Reused to store different values in the graph
Fig. 7: SCAN decoder architecture memory structure for N = 16 B. Control Unit The Control Unit is composed of three main blocks: Scheduler: The scheduler generates the sequence of the LLR indices to be updated. An example of scheduling for the decoding of a Polar Code of size N = 8 is shown in Table I. Address generator: The address generator retrieves the LLR indices that have to be updated. Then it determines the addresses of the three operands that are necessary to carry out the decoding (cf equations 1 and 2). Controller: The controller enables to handle the type of operation to be carried out by the processing unit. It also enables the bypasses of the RAMs when the read and the write addresses are the same. Moreover, it controls the scheduler, the address generator and also the RAM (enable and R/W ). The operand selection block control signals enable to select the three required operands, amongst the 8 data read from the RAMs, depending on the operation type. C. Processing Unit The Processing Unit has three P Q-bit wide inputs, which corresponds to the three operands for the calculation. The operation type is determined by the Control Unit. The result is sent to the RAMs. The targeted RAM is activated by the Control Unit in order to store the result. The result is also connected to multiplexers used to bypass the RAMs if the next operand is the result that is not yet available in the RAM. In order to carry out these calculations, this Processing Unit is composed of P PEs that can compute both calculation types: f (a, b + c) or f (a, c) + b. The Processing Element (PE) architecture is given in Fig. 8. The right calculation type is controlled by the operation type signal generated by the Control Unit.
D. Latency The SCAN decoder requires updating from right to left and from left to right for a complete iteration. The βi,n values are not updated during intermediate iteration. N For the stages s, such that s p, P latency for one intermediate iteration is then: 2N ∗ (n − p + P − 2). P N Since the last iteration requires extra clock cycles, the total P latency to decode I iterations is then: I ∗( IV.
2N N ∗ (n − p + P − 2)) + clock cycles. P P
(5)
B ELIEF P ROPAGATION HARDWARE ARCHITECTURE
To the best of our knowledge, [20] is the only work that proposed a hardware implementation of a soft output decoder, using a BP algorithm [18], on FPGA devices. Others works focused on implementing the BP decoder in ApplicationSpecific Integrated Circuit (ASIC) as in [24] and [25]. They apply a stopping criterion. However, the decoding performance are not improved. The BP is similar to the SCAN, in the sense that it propagates LLRs on the factor graph of the code. However, it is different since a flooding scheduling is used instead of the recursive scheduling of the SCAN decoding. During an iteration, stages are completely and successively updated from right to left and then from left to right. This means that the first half of an iteration corresponds to the calculation of all λ values while the second half of an iteration consists in updating the β values. In [20], the scheduling is slightly different, from the original one given in [18], since the LLRs are propagated both way at the same time. This scheduling is close to the shuffle scheduling presented in [26]. The main advantage of such a scheduling is its consistency. It enables the implementation of efficient decoders. The main drawback of the BP decoding is that around 50 iterations are necessary to reach the decoding performance of a simple SC decoding, while the SCAN decoding requires only 2 iterations to outperform the SC decoding. The architecture proposed in [20] is composed of three main
•
The memory used to store the 2 ∗ N (n + 1) LLRs (N (n + 1) λ and N (n + 1) β).
•
The processing elements with a parallelism level of P.
•
The control is composed of an address generator and a R/W signals.
Bit required in Memory
parts:
D clock cycles are required by a processing element in order to compute LLRs. The scheduling of their decoding implies that the total latency that is necessary for one iteration is: Nn . (6) P The memory amount required for such an architecture can not be practically implemented for a large codelength because is grows as O(2N (n+1)). Moreover, the BP algorithm requires a lot of iterations in order to get the same decoding performance as with 1 iteration of the SCAN algorithm. This significantly impacts on the resulting data rate of the BP decoder. Decoding performance of the BP algorithm for 50 iterations, and of the SCAN algorithm for 1, 2 and 4 iterations are discussed in section VI. V.
I MPLEMENTATION RESULTS
The architecture of the SCAN decoder presented in section III was functionally verified and then implemented on FPGA devices. In order to better define the SCAN decoder, we compared its decoding performance with the only one able to generate soft-output: the BP decoder of [20] introduced in section IV. The implementation results show that the number of Look Up Table (LUT)s of the SCAN architecture depends almost exclusively on the parallelism level P and very little on the codelength N . Indeed, for a fixed parallelism level and several codelengths, the number of LUTs is steady (3291 ≤ LUTs ≤ 4017 for P = 16 and 256 ≤ N ≤ 8192). On the contrary, the number of LUTs increases linearly with P . This can be explained because of the logic instantiated is mainly used to implement the PEs and the multiplexing of the RAMs, which evolves with the parallelism level P . The number of Flip Flop (FF)s is almost insensitive to the codelength and to the number of PEs. Indeed, FFs are instantiated inside the Control Unit. Since these units remains steady with the modification of the code parameters, it enables the SCAN architecture to keep few FFs. The SCAN decoder uses dual port RAMs. As explained in [20] Block Size P LUT FF BRAM Bit Number 256 16 2779 1592 6 24576 512 16 2809 1596 6 55296 1024 16 2794 1600 12 122880 2048 16 2797 1604 22 270336 4096 16 2805 1605 48 589824 8192 16 2808 1612 96 1277952
LUT 3291 3482 3517 3701 3693 4017
SCAN FF BRAM Bit Number Dual Port Eq. 271 20 10368 20736 290 20 21888 43776 308 20 46464 92928 328 20 98688 197376 346 23 209280 418560 364 46 442752 885504
2048 2048 2048 2048 2048
1402 1754 2413 3701 6592
313 314 285 328 416
2 462 271 4 792 459 8 1459 839 16 2797 1605 32 5479 3144
24 24 24 22 22
270336 270336 270336 270336 270336
11 12 14 20 37
98280 98304 98400 98688 99456
196560 196608 196800 197376 198912
TABLE II: Implementation results of the SCAN decoder compared with the BP decoder [20] on the FPGA XC4VSX25
1.5
·106
BP SCAN
1 0.5 0 8
9
10
n
11
12
13
Fig. 9: Bit requirement evolution depending on N for both BP and SCAN decoder architectures
section II, we need to store twice as many bits as strictly necessary: λ + β Q-bit LLRs N Nn + − 2 Q-bit LLRs 2 2 Nn N + − 2) × Q bits need to Therefore, 2 × ((N − 2) + 2 2 be stored in the RAMs. The memory requirements of the BP decoder is (2N (n + 1)) × Q bits. The number of bits to store for both decoders are represented in Fig. 9. It is obvious that the number of bits to store for the BP decoder in [20] is greater that the one for the SCAN decoder whatever the codelength. In order to implement the RAMs, the FPGA devices infer 18kb Block RAM (BRAM)s. As shown in Fig. 9, the SCAN decoder should use less BRAMs than the BP. Even so, there are some implementation cases (Table II) where the SCAN decoder uses more BRAMs than the BP decoder. It can be explained because, depending on the architecture implementation, some BRAMs are not much filled. For example, for the Polar Code N = 2048 and P = 32, the SCAN decoder uses 37 BRAMs while the BP decoder uses only 22 BRAMs. For this particular example, the filling rate of the BRAMs is 66% for the BP decoder and only 14.5% for the SCAN decoder. Those very low utilization rates can be explained because 10 block RAM are instantiated for the SCAN decoder for N = 2048 and P = 32. Since we use dual port RAMs, it becomes around 20 block RAM (cf for N = 256 the utilization rate is 5.6%). It can also be explained because the synthesis tool (Xilinx XST) need to instantiate block RAMs only for few more bits that can not be set in another block RAM. (N − 2) +
VI.
D ECODING PERFORMANCE COMPARISON
Either for the BP algorithm or the SCAN algorithm, the number of iterations has a direct impact on the data rate (throughput) and on the decoding performance. More generally, the more iterations, the better the decoding performance, but with a linear decrease in data rate. In sight of hardware implementation, a good balance between decoding performance and data rate needs to be found. In [20], the data rate is given for 5 iterations while the decoding performance are given for 50 iterations. We propose to compare both data rates and
100 Frame Error Rate
Frame Error Rate
100 10−1 10−2
SCAN1 SCAN2 SCAN4 BP50
10−3 10−4
Fig. 10: SCAN algorithm VS BP algorithm - (1024,512)
Frame Error Rate
SCAN1 SCAN2 SCAN4 BP50 0
1
2 3 4 5 SNR (dB)
6
Frame Error Rate
100 10−1 SCAN1 SCAN2 SCAN4 BP50
10−4 0
1
2 3 SNR (dB)
2 3 4 SNR (dB)
5
160 MHz [20] 2.783 5.203 5.333 3.556
Data Rate P=16, 90 MHz SCAN 1 SCAN 2 SCAN 4 17.56 8.89 4.47 30.72 15.56 7.83 29.19 14.79 7.45 19.46 9.86 4.97
7
decoding performance of the BP decoder proposed in [20] with 50 iterations, and of the SCAN decoder with 1, 2 and 4 iterations. The same Polar Codes as in [20] are compared in Fig. 10 through 13. One can notice that the SCAN algorithm has better decoding performance than the BP algorithm even with only one iteration. As a consequence, we can compare the data rate of both architectures as follows. For the same LLR quantization (Q = 6), the architecture proposed in [20] works at 160MHz on an FPGA device XC5VLX85, for several Polar Codes. Note that the authors do not specify the parallelism level used. Moreover, the data rates are given for 5 iterations only. As we know that the decoding
10−3
1
TABLE III: Data rate comparison between the BP and the SCAN decoder for similar decoding performance
Fig. 11: SCAN algorithm VS BP algorithm - (512,426)
10−2
10−3
Fig. 13: SCAN algorithm VS BP algorithm - (256,192)
Polar Code (1024,512) (512,426) (256,192) (256,128)
10−1
10−4
SCAN1 SCAN2 SCAN4 BP50 0
100
10−3
10−2
10−4
0 0.5 1 1.5 2 2.5 3 3.5 SNR (dB)
10−2
10−1
4
Fig. 12: SCAN algorithm VS BP algorithm - (256,128)
performance at 50 iterations are not as good as the SCAN with 1 iteration, we transpose their data rate from 5 iterations to a data rate for 50 iterations. We still suppose that the BP decoder can work at 160Mhz for 50 iterations. Therefore, the data rate is divided by 10 because each iteration takes as much time. Our architecture is compared in terms of data rate for several configurations. A comparison is given in Table III. The SCAN decoder has better decoding performance with only one iteration and a data rate which is 5 times higher than the BP decoder one. Even with 4 iterations, the SCAN decoder has a higher data rate than the BP and better decoding performance. VII.
C ONCLUSION
In this article we have presented the SCAN algorithm, which is to date the only one, with the BP, that can provide soft outputs. The first decoder architecture that implements this algorithm is presented in this paper. This architecture is then compared in terms of hardware complexity and data rate with a BP decoder architecture. The SCAN decoder can reach a higher data rate than a BP decoder implemented on an FPGA device, and a smaller memory complexity. Moreover, a single SCAN decoding iteration performs better than 50 BP decoding iterations. This first architecture demonstrates that the SCAN decoding is an efficient solution for the soft-decision decoding of Polar Codes. Even with these encouraging results, the proposed architecture may be further be optimized by reducing the memory footprint and the computation complexity. This will be the object for future research.
R EFERENCES [1]
[2]
[3]
[4] [5] [6]
[7] [8] [9] [10] [11] [12]
E. Arikan, “Channel Polarization: A Method for Constructing CapacityAchieving Codes for Symmetric Binary-Input Memoryless Channels,” IEEE Transactions on Information Theory, vol. 55, no. 7, pp. 3051– 3073, Jul. 2009. C. Leroux, I. Tal, A. Vardy, and W. Gross, “Hardware architectures for successive cancellation decoding of polar codes,” in 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), May 2011, pp. 1665 –1668. C. Leroux, A. J. Raymond, G. Sarkis, I. Tal, A. Vardy, and W. J. Gross, “Hardware Implementation of Successive-Cancellation Decoders for Polar Codes,” Journal of Signal Processing Systems, vol. 69, no. 3, pp. 305–315, Dec. 2012. C. Leroux, A. Raymond, G. Sarkis, and W. Gross, “A Semi-Parallel Successive-Cancellation Decoder for Polar Codes,” IEEE Transactions on Signal Processing, vol. 61, no. 2, pp. 289–299, Jan. 2013. A. J. Raymond and W. J. Gross, “A Scalable Successive-Cancellation Decoder for Polar Codes,” IEEE Transactions on Signal Processing, vol. 62, no. 20, pp. 5339–5347, Oct. 2014. C. Zhang and K. K. Parhi, “Low-Latency Sequential and Overlapped Architectures for Successive Cancellation Polar Decoder,” IEEE Transactions on Signal Processing, vol. 61, no. 10, pp. 2429–2441, May 2013. A. Alamdar-Yazdi and F. Kschischang, “A Simplified SuccessiveCancellation Decoder for Polar Codes,” IEEE Communications Letters, vol. 15, no. 12, pp. 1378–1380, Dec. 2011. G. Sarkis, P. Giard, A. Vardy, C. Thibeault, and W. J. Gross, “Fast Polar Decoders: Algorithm and Implementation,” IEEE Journal on Selected Areas in Communications, vol. 32, no. 5, pp. 946–957, May 2014. I. Tal and A. Vardy, “List decoding of polar codes,” in Information Theory Proceedings (ISIT), 2011 IEEE International Symposium on, Aug. 2011, pp. 1 –5. B. Li, H. Shen, and D. Tse, “An Adaptive Successive Cancellation List Decoder for Polar Codes with Cyclic Redundancy Check,” IEEE Communications Letters, vol. 16, no. 12, pp. 2044–2047, Dec. 2012. K. Niu and K. Chen, “Stack decoding of polar codes,” Electronics Letters, vol. 48, no. 12, pp. 695 –697, Jun. 2012. O. Afisiadis, A. Balatsoukas-Stimming, and A. Burg, “A LowComplexity Improved Successive Cancellation Decoder for Polar Codes,” arXiv preprint arXiv:1412.5501, Dec. 2014.
[13]
[14]
[15]
[16] [17]
[18] [19]
[20]
[21]
[22] [23] [24]
[25]
[26] 2005.
A. Balatsoukas-Stimming, A. J. Raymond, W. J. Gross, and A. Burg, “Hardware Architecture for List Successive Cancellation Decoding of Polar Codes,” IEEE Transactions on Circuits and Systems II: Express Briefs, vol. 61, no. 8, pp. 609–613, Aug. 2014. A. Balatsoukas-Stimming, M. B. Parizi, and A. Burg, “On Metric Sorting for Successive Cancellation List Decoding of Polar Codes,” arXiv preprint arXiv:1410.4460, Oct. 2014. B. Yuan and K. K. Parhi, “Successive Cancellation List Polar Decoder using Log-likelihood Ratios,” arXiv:1411.7282 [cs, math], Nov. 2014, arXiv: 1411.7282. P. Giard, G. Sarkis, C. Thibeault, and W. J. Gross, “A Fast Software Polar Decoder,” arXiv e-print 1306.6311, Jun. 2013. B. Le Gal, C. Leroux, and C. Jego, “Multi-Gb/s Software Decoding of Polar Codes,” IEEE Transactions on Signal Processing, vol. 63, no. 2, pp. 349–359, Jan. 2015. E. Arikan, “Polar codes: A pipelined implementation,” Proc. Int. Symp. Broadband Communication (ISBC2010), Jul. 2010. U. U. Fayyaz and J. R. Barry, “Low-Complexity Soft-Output Decoding of Polar Codes,” IEEE Journal on Selected Areas in Communications, vol. 32, no. 5, pp. 958–966, May 2014. A. Pamuk, “An FPGA implementation architecture for decoding of polar codes,” in Wireless Communication Systems (ISWCS), 2011 8th International Symposium on, Nov. 2011, pp. 437 –441. E. Sasoglu, I. Telatar, and E. Arikan, “Polarization for arbitrary discrete memoryless channels,” in IEEE Information Theory Workshop, 2009. ITW 2009, Oct. 2009, pp. 144–148. I. Tal and A. Vardy, “How to Construct Polar Codes,” IEEE Transactions on Information Theory, vol. 59, no. 10, pp. 6562–6582, Oct. 2013. E. Arikan, “Systematic Polar Coding,” Communications Letters, IEEE, vol. 15, no. 8, pp. 860 –862, Aug. 2011. B. Yuan and K. Parhi, “Early Stopping Criteria for EnergyEfficient Low-Latency Belief-Propagation Polar Code Decoders,” IEEE Transactions on Signal Processing, vol. 62, no. 24, pp. 6496–6506, Dec. 2014. Y. S. Park, Y. Tao, S. Sun, and Z. Zhang, “A 4.68gb/s belief propagation polar decoder with bit-splitting register file,” in 2014 Symposium on VLSI Circuits Digest of Technical Papers, Jun. 2014, pp. 1–2. J. Zhang and M. Fossorier, “Shuffled iterative decoding,” IEEE Transactions on Communications, vol. 53, no. 2, pp. 209–213, Feb.