Reduced-Memory High-Throughput Fast-SSC Polar Code Decoder

1 downloads 0 Views 385KB Size Report
Abstract—Polar codes have been selected for use within 5G networks, and are being considered for data and control channel for additional 5G scenarios, like ...
Reduced-Memory High-Throughput Fast-SSC Polar Code Decoder Architecture Furkan Ercan, Carlo Condo, Warren J. Gross Department of Electrical and Computer Engineering, McGill University, Montr´eal, Qu´ebec, Canada Email: [email protected], [email protected], [email protected]

Abstract—Polar codes have been selected for use within 5G networks, and are being considered for data and control channel for additional 5G scenarios, like the next generation ultra reliable low latency channel. As a result, efficient fast polar code decoder implementations are essential. In this work, we present a new fast simplified successive cancellation (Fast-SSC) decoder architecture. Our proposed solution is able to reduce the memory requirements and has an improved throughput with respect to state of the art Fast-SSC decoders. We achieve these objectives through a more efficient memory utilization than that of Fast-SSC, which also enables to execute multiple instructions in a single clock cycle. Our work shows that, compared to the state of the art, memory requirements are reduced by 22.2%; at the same time, a throughput improvement of 11.6% is achieved with (1024, 512) polar codes. Comparing equal throughputs, the memory requirements are reduced by up to 60.4%.

I. I NTRODUCTION Polar codes, introduced by Arıkan in [1], are a class of errorcorrecting codes that can provably achieve channel capacity on a memoryless channel when the code length N tends to infinity. They are able to do so through channel polarization, that splits N channel utilizations into K reliable ones, through which information bits are sent, and N − K unreliable ones that are fixed to a constant, usually to zero. Recently, polar codes have been selected for the control channel of 5G enhanced mobile broadband (eMBB) networks [2], where high throughput and low complexity implementations are required. In [1], the Successive Cancellation (SC) decoding algorithm is proposed: it can be represented as a binary tree search. However, this approach suffers from long decoding latency due to its sequential nature and mediocre error correction performance at moderate code lengths [1]. To increase the throughput of the SC algorithm, the Fast-SSC decoding algorithm proposed in [3] identifies particular frozen and information bit patterns: moreover, it introduces fast decoding methods of such nodes without affecting the error correction performance. Compared to standard successive cancellation decoders [4], Fast-SSC achieves a throughput an order of magnitude higher. Further identification and use of special nodes were carried out in both SC-based [5], [6] and SC-list based [7], [8] decoding techniques. Hardware implementations of the Fast-SSC algorithm use Pe parallel processing units to optimize the throughput performance: this number influences the memory word sizing in return. In [3], the number of processing elements affects c 978-1-5386-0446-5/17/$31.00 2017 IEEE

the utilization factor of the decoder memory. In fact, nodes that represent codes shorter than Pe occupy a memory word without fully utilizing it. For example, for a polar code of length N = 1024, memory utilization is only 66% if Pe = 26 , and 44% when Pe = 27 . This work proposes a new Fast-SSC architecture that substantially increases the memory utilization regardless of the parallelization factor. This is achieved by storing the node variables related to the lower stages of the Fast-SSC decoding tree into a single memory word. For a polar code with N = 1024 and K = 512, the memory utilization rises to 99.61%. This allows us to introduce multiple operations per clock cycle to increase the decoder throughput: new processing elements that support more than one operation per cycle are introduced, with limited impact on the system critical path. The rest of this work is organized as follows: in Section II, polar codes and Fast-SSC decoding are reviewed. In Section III, the proposed architecture is reviewed in detail. Section IV presents numerical results for an FPGA implementation of our decoder; conclusions are addressed in Section V. II. BACKGROUND A. Polar Codes A polar code P C(N, K) is a linear block code of length N = 2n and rate R = K/N . The encoding process can be represented by a matrix multiplication as x = uG⊗n (1) where u = {u0 , u1 , ..., uN −1 } is the input vector, x = {x0 , x1 , ..., xN −1 } is the encoded vector, and the generator matrix G⊗n is  the n-th Kronecker product of the polar code matrix G = 11 01 . Thus, a polar code of length N can be seen as composed of two polar codes of length N/2. In [1] it has been shown that as N → ∞, bits encoded as (1) become either completely unreliable or completely reliable. The N − K most unreliable bits are usually frozen to 0, while the remaining K are used to transmit information bits. For a P C(16, 10) code, bits u0 , u1 , u2 , u3 , u4 and u8 in Fig. 1 have the least reliable indices, thus are frozen (white circles in the figure). The information bits are indicated with the black nodes; and the calculation nodes are colored in gray. SC decoding of polar codes can be represented on a binary tree as in Fig. 1. The tree is explored depth-first, with priority to the left branch. The root node of the tree is filled with the channel LLR values. At each stage of the tree, the soft

0SPC

SPC

R-0

SPC

R-1

Fig. 1. A P C(16, 10) tree with identified special nodes.

messages (LLR values) α(α0 , α1 , ..α2S −1 ) are passed from a parent node to its child nodes, and hard decision estimates β(β0 , β1 , ..β2S −1 ) are passed from a child node to its parent node. The soft information passed to left child αl and right child αr can be approximated as αil

= sgn(αi ) sgn(αi+2S−1 ) min(αi , αi+2S−1 )

hard decision is made by summing the LLR values altogether and extracting the sign bit of the result (6). An SPC node has a single frozen bit in its first position. Due to polar code construction, the frozen bit represents the parity of all the information bits. Thus in an SPC node, the parity check (8) of all the LLR hard decisions (HD) must be zero (7). If this condition is satisfied, the node is assumed to be decoded correctly; if not, sign of the node with the least reliable soft information (9) is flipped (10).  PNv −1 0, if i=0 αi ≥ 0 βi = (6) 1, otherwise.  0, if αi ≥ 0 HDi = (7) 1, otherwise. parity =

j = argmin(|αi |); 0 ≤ i < Nv

= αi+2S−1 + (1 −

2βil )αi

(8) (9)

(2)  βi =

αir

⊕Ni=0v −1HDi

(3)

where 0 ≤ i < 2S−1 for stage S, where the root node is at stage S = log2 (N ). The hard decision estimates, β for stage S, are calculated via the left and right messages from child nodes, β l and β r , as  l βi ⊕ βir , if i ≤ 2S−1 βi = (4) βir , otherwise. where ⊕ denotes bitwise XOR operation, and where 0 ≤ i < 2S . At the leaf nodes, β values are hard decisions computed by observing the sign bit of their soft information, as  0, if αileaf ≥ 0 βileaf = (5) 1, otherwise. where i represents the node index. B. Fast-SSC Decoding 1) Algorithm: Simplified successive cancellation (SSC) decoding [9] showed that the SC tree can be pruned, avoiding the descent in case of nodes whose leafs are either all information bits (rate-1) or all frozen bits (rate-0). An example of rate-0 and rate-1 nodes is shown in Fig. 1. The Fast-SSC decoding algorithm [3] evolves SSC by identifying special patterns and presenting efficient decoding techniques for such nodes. If we denote an information bit with I and a frozen bit with F, in addition to rate-0 and rate-1 nodes, ML nodes occur only in three other different forms in a polar code: repetition (Rep) (FF· · ·FI), single parity check (SPC) (FI· · ·II) and maximum likelihood (ML) node (FIFI). A repetition node contains a single information bit; all other nodes are frozen. An information node encoded with frozen bits contain the same information bit in all the nodes. The

HDi ⊕ parity, when i = j, HDi , otherwise.

(10)

These special nodes can also be merged together. For example, in polar codes constructed as in [10], a repetition node is often followed by an SPC node of the same length. A Rep-SPC node can decode both nodes in a single clock cycle by performing both operations in parallel. Further node merging techniques are applied in [3] to generate new nodes, that are listed in Table I. 2) Decoder Architecture: The Fast-SSC architecture described in [3] contains separate memory units for channel LLR values, intrinsic LLR values α, partial sums β, decoding instructions, and final codeword. Words from channel, α and β memory units are routed to an ALU unit, where the identified operations listed in Table I are performed. Left operation (F), right operation (G) and combine operation (C) are adopted from the SC decoding of [1]. The notations 0, 1 and R represent child nodes with rate-0, rate-1 and rate-R (0 < R < 1). The operations with P- notation represent the operations performed without explicitly visiting the right child node. Routing of the memory and configuration of the datapath are controlled based on the instruction list that is compiled offline. The datapath for the Fast-SSC decoder is depicted in Fig. 2. Execution of a single instruction is called a step, and each step may take one or more clock cycles, based on the tree stage and the number of physical processing units dedicated to perform the operation. If Pe is too small, performing one step may take many clock cycles. If Pe is too large, the resource utilization of the datapath becomes inefficient. In [3], for the polar codes of length 215 , the Pe is chosen to be 28 . Appropriately, for a polar code of length 1024, Pe = 26 is reasonable. The SC tree for a P C(1024, 512) with nodes from Table I is presented in Fig. 3. The code is constructed targeting a noise standard deviation of σ 2 = 0.5625; it executes a total

F

TABLE I I NSTRUCTION LIST FOR THE FAST-SSC ARCHITECTURE OF [3]

α0

ML RepSPC

β00

Rep 0 β0 α

Sign C

G

β10

Name F G G-0R C C-0R P-R1 P-01 P-RSPC P-0SPC ML Rep Rep-SPC

Operation Operation in (2). Operation in (3). Operation in (3), but with βl = 0. Combine βl and βr . Combine βl and βr , with βl = 0. Calculate hard decision βr = 1. Calculate hard decision βl = 0 and βr = 1. Calculate hard decision with right child being and SPC node. Same as P-RSPC, but with βl = 0. Exhaustive-seach ML decoding. Calculate hard decision with (6). Calculate hard decision merging Rep and SPC nodes.

SPC

Nv /2Pe

S=2

S=3

S=4

S=5

S=6

S=7

S=8

S=9

Fig. 2. Fast-SSC datapath architecture from [3].

2 × Pe

β1

Fig. 4. Memory architecture (for both α and β) for a polar code of length N = 1024, with Pe = 64. Nv is the node size at stage S length, gray area is unused.

Fig. 3. P C(1024, 512) tree with special nodes from Table I. Green circles are Rep nodes, yellow circles are SPCs, blue circles are ML nodes, white are rate-0, black are rate-1, gray are rate-R.

number of 209 steps in 266 clock cycles to decode a single frame. 3) Memory: In the SC tree there are log2 (N ) + 1 stages, with the highest stage log2 (1024) = 10 corresponding to the root node, and the lowest stage log2 (1) = 0 to the leafs. For each stage of the SC tree, both α and β are stored in designated memory modules. The α memory is comprised of two banks, each of which is Pe LLRs wide. The partial sum β memory is composed of two memory units, each of which has two banks, and each bank is Pe bits wide. A full word is defined as the content from an index of a memory unit. At each clock cycle, a full word can be read from each memory. The F and G modules can process 2 × Pe α values at once to generate Pe α outputs, thus a half word can be written back to the α memory. C unit takes a half word from each β memory bank, to produce a full word that should be written to either of the banks. Consequently, a half word can be read from each β memories, and a full word can be written back to one β memory.

In terms of both α and β memories, at least one word is reserved for each stage of the tree. Fig. 4 represents the memory architecture for both α and β memories for P C(1024, K) code. The root node (S=10) is stored explicitly in channel memory, because it has a different quantization scheme than the internal LLRs. Due to special node decoding techniques mentioned in Section II-B, the lower limit stage is S=2, thus no additional memory is required after stage S=2. III. R EDUCED -M EMORY & H IGH -T HROUGHPUT FAST-SSC I MPLEMENTATION In this work, we propose a new Fast-SSC decoder architecture with maximized memory utilization and enhanced throughput by merging instructions. For each decoding tree stage, one or more words are reserved in both α and β memories. The word size is decided by the number of processing elements Pe , that acts as a parallelization factor. Pe can also be interpreted as a threshold on the SC tree (dashed line in Fig. 3), where the node size Nv = Pe . Above this threshold (HI-stage), each stage of the tree fully utilizes the dedicated memory words for that level; below the threshold (LO-stage), only a portion of the memory is used. We maximize the memory utilization by storing all the variables relative to LO-stage into a single memory word, in

Pe

Pe /2

Pe /4

Pe /8

Pe /16

S=6

S=5

S=4

S=3

S=2

Pe /16

Fig. 5. Content of the last memory word with Pe = 64. TABLE II N UMBER OF SAVED CYCLES MERGING DIFFERENT INSTRUCTIONS . ROW INSTRUCTIONS ARE HEAD , AND COLUMN INSTRUCTIONS ARE TAIL INSTRUCTIONS .

F G G-0R C C-0R

F 22 12 2 0 0

G 0 0 0 5 5

G-0R 4 0 6 0 0

C 0 0 0 7 0

C-0R 0 0 0 0 6

both α and β memory units. This allows to maximize memory utilization, as presented in Fig. 5, that depicts the content of the last memory word. A similar approach was presented in [11], where stacking of each memory unit into a last word was employed. With these modifications to the memory structure, when a node at the LO-stage is to be processed, the entire last word is fetched from the memory. Since the content of the memory is relative to multiple stages, multiple instructions per clock cycle can be performed. In [3], the critical path is determined by the SPC node. Compared to the SPC node hardware structure, other nodes such as F, G, and C introduce a significantly lower delay, which provides the opportunity to merge them without increasing the system critical path. The potential throughput improvement obtainable through merging different instructions is examined in Table II. We take the operations F, G, G-0R, C, and C-0R from Fast-SSC and investigate various combinations of them. In order to maintain the maximum operating frequency, the delay of the critical path of the original architecture should be maintained as much as possible. Our studies show that up to three of the same instructions or two different instructions can be merged with minimum effect on the critical path delay. The saved clock cycles in Table II are observed independently: not all of the merging can be exploited at the same time. For example, if a series of instructions such as (G-FF) is present at LO-stage, the potential saved clock cycles consider both GF-F and G-F2, but only one scheme will be implemented. In order to minimize the number of conflicts between different instruction merging schemes, we developed a merging algorithm based on the following observations: 1) GF with F: When a series of F instructions are observed in the original instruction list, they are merged starting from the tail F instruction, in order to minimize conflict with the possible merging of a GF instruction. 2) CG with C (C0G with C0): The merging of consecutive C (C0) instructions begins from the head instruction towards the tail instruction, to maximize the chances of merging

TABLE III A DDITIONAL INSTRUCTIONS FOR THE NEW FAST-SSC ARCHITECTURE . Name F2, F3 G0R-2, G0R-3 C2, C3 C0R-2, C0R-3 GF FG0 CG C0G

Operation Up to three consecutive F operations. Up to three consecutive G-0R operations. Up to three consecutive C operations. Up to three consecutive C-0R operations. G instruction followed by an F instruction. F instruction followed by a G-0R instruction. C instruction followed by a G instruction. C-0R instruction followed by a G instruction.

with a G instruction that follows it, leading to a CG (C0G) instruction. 3) FG0 with G0 and F: Merging of consecutive G0 instructions starts from the tail G0 instruction towards the head instruction, to minimize the conflict with potential FG0 instruction. To further minimize conflicts, F instructions should be merged starting from the head instruction; however this decision would contradict the decision made for the sake of maximizing GF instructions. Based on observations, the number of occurrences of GF is much higher than that of FG0, thus F is decided to be merged starting from the tail F instruction. The proposed algorithm uses the original Fast-SSC instructions of Table I, and generates an instruction set based on Table I and III. For P C(1024, 512), the potential throughput improvement based on the schemes presented in Table II is 35.0%, where an actual improvement of 24.9% is achieveable with our algorithm. A comprehensive study for potential and actual throughput improvement was made with our instruction merging technique to observe the relationship between speedup versus Pe , codeword length, and rate. Results are presented in Table IV. It was observed that the speedup is largely independent from the code length. On the other hand, the throughput gain is maximum with rates between 1/3 and 2/3. Regardless of the code length, the throughput improvement is directly proportional to Pe . The modified architecture must guarantee support to all instructions previously present. At the same time, the architecture must be able to support the new instructions listed in Table III. First of all, new instructions resulting from the merging of different instructions (e.g. GF) are only relevant to LOstage nodes. Modules executing multiple instructions of the same type (e.g. FF) should be able to support single HI-stage operations, as well as single and multiple LO-stage operations. Moreover, in cases where the output of an instruction is immediately used by the following instruction, and never used by any other instruction in the future, needs not to be stored in memory if those two instructions are merged. Take a G0R3 instruction as an example. The output of the first G-0R instruction is used only by its following G-0R instruction, and is never used again. Similarly, the output of the second G-0R instruction is only used by the third one. Consequently, only the output of last sub-instruction is saved. However, this does

TABLE IV I MPROVEMENT ON T/P VS . P OLAR C ODES OF VARIOUS LENGTH , RATE AND Pe . L ATENCY AND SAVINGS ARE EXPRESSED IN NUMBER OF CLOCK CYCLES . Polar Code

R

Pe

(1024, 336) (1024, 512) (1024, 672) (1024, 853) (2048, 672) (2048, 1024) (2048, 1344) (2048, 1706) (8192, 2688) (8192, 4096) (8192, 5376) (8192, 6824) (32768, 10752) (32768, 16384) (32768, 21504) (32768, 27296)

1/3 1/2 2/3 5/6 1/3 1/2 2/3 5/6 1/3 1/2 2/3 5/6 1/3 1/2 2/3 5/6

64 64 64 64 64 64 64 64 128 128 128 128 256 256 256 256

Original Latency 247 266 240 174 478 512 475 341 1439 1527 1413 990 4627 4787 4171 2658

Pot. Savings 66 69 59 39 103 113 98 58 364 379 343 208 1361 1337 1091 630

Actual Savings 44 53 43 27 77 81 73 46 276 285 259 160 1065 1072 877 504

F G0 α0

T/P Improv. 21.67% 24.88% 21.83% 18.37% 19.20% 18.79% 18.16% 15.59% 23.73% 22.95% 22.44% 19.28% 29.90% 28.86% 26.62% 23.40%

GF FG0 ML Rep RepSPC CG 0 β0 α

TABLE V D ETAILS OF DATA STORING REQUIREMENTS AND MAXIMUM NUMBER OF INPUTS FOR MERGED INSTRUCTIONS . Merged Storage of Max. number Instruction intermediate data of inputs F* Yes Pe G0* No 2 × Pe C/C0* No Pe /2 GF Yes Pe FG0 No 2 × Pe CG/C0G Yes Pe /2 *Multiple-operations from Table III.

not apply to all merged instructions: for example, the output of each F function in a merged F instruction will be used by another instruction in the sequence, and all outputs must be stored. By avoiding the storage of intermediate values that will not be used in future instructions, it is possible to save memory bandwidth and increase the number of parallel operations that can be performed by merged instructions. The complete list of the intermediate data storing choices, and maximum parallel operations for merged instructions are listed in Table V. Fig. 6 presents the new architecture that can support single and multiple operations (from Table I and III). The original critical path (in both Fig. 2 and 6) lies on the path G-SPCC. To support multiple instructions in a single cycle, the new G0 and C modules require more hardware complexity than the original G and C modules. Consequently, to avoid lengthening the critical path, modules including G and C operations are instantiated separately, and the old G and C modules are only used for P- prefixed operations from Table I. Modules F , G0 and C/C0 can support single instructions at HI-stage, as well as single and multiple instructions at LO-stage. Doubleinstruction modules, GF , F G0 and CG are used only at the LO-stage. Module CG is capable of performing both CG and C0G operations. The rest of the modules are the same.

β00

β1

Sign G

C

SPC

Multi C/C0

β10

Fig. 6. Proposed Fast-SSC datapath that supports multiple instructions at lowstage while also supporting compatibility for regular Fast-SSC instructions.

IV. FPGA I MPLEMENTATION R ESULTS The proposed decoder architecture has been coded in VHDL within the Altera Quartus II environment. Logic synthesis and mapping targeted an Altera Stratix IV EP4SGX530KH40C2 FPGA device. Testing and validation with random test vectors have been performed with Mentor Graphics ModelSim software environment. A quantization scheme Q(6, 4, 1) has been used, where Q(Qi , Qc , Qf ) are quantization bit size for internal LLRs, channel LLRs, and fraction bit size for both internal and channel LLRs, respectively. Fig. 7 depicts the error correction performance of the proposed decoder in terms of bit error rate (BER) and frame error rate (FER). The selected quantization values result in a 0.03 dB loss at FER=10−4 compared to floating-point precision. The introduced modification does not cause any degradation with respect to [3]. Table VI presents the FPGA implementation results for our architecture for P C(1024, 512). Results for the original FastSSC architecture, and a recent SC-based pipelined combinational decoder from [12] are also provided for comparison. Compared to Fast-SSC [3], the compression of the α and β memories to a single word for decoding stages that are below the parallelization threshold results in a 22.2% memory reduction when Pe = 64. This percentage rises as Pe increases. The proposed approach requires additional routing and selection logic that slightly lengthen the critical path, leading to a lower maximum frequency. At the same time, the merging of instructions reduces the required number of

10−2 100

TABLE VII FPGA IMPLEMENTATION RESULTS COMPARISON WITH DIFFERENT Pe .

10−1 10−2

10−1

10−4

Pe LUTs Registers RAM (bits) fmax (MHz) # of clocks T/P (Mbps)

10−3

10−2

BER

FER

BER

10

−3

10−3

−5 10 10

10−4 10−5

64 6126 1223 23592 99.76 266 384.0

[3] 128 10684 2347 30504 101.2 229 452.5

256 20947 4562 46376 101.2 217 477.5

This Work 64 128 14300 29828 1216 2332 18350 18356 89.57 80.68 213 170 430.6 486.0

−4

10−5

10−6

−6

101

10−7 1

2 3 4 Eb /N0 [dB]

2 3 4 Eb /N0 [dB]

10−7 1 2 3 4fixed-point 2FER and BER3performance4for FastFig. 7. Floating-point and1 SSC decoding of P C(1024, 512). The code is optimized for E /N = 2.5 EbdB. /N0 [dB] Eb /N0 [dB] [3], floating point

[3], Q(6, 4, 1)

This work, Q(6, 4, 1)

b

0

TABLE VI FPGA IMPLEMENTATION RESULTS COMPARISON .

P C(N, K) Quantization Pe LUTs Registers RAM (bits) Instruction size fmax (MHz) # of instructions # of clock cycles T/P (Mbps)

[12] (1024,512) 5 bits 190127 22928 13312 N/A N/A 1240

[3] (1024,512) Q(6, 4, 1) 64 6126 1223 23592 5 bits 99.76 209 266 384.0

This Work (1024,512) Q(6, 4, 1) 64 14300 1216 18350 6 bits 89.57 157 214 428.6

clock cycles to decode one frame by 19.92%, resulting in an overall 11.6% throughput improvement. As shown in Table VII, our method provides a throughput gain comparable to doubling Pe in the architecture presented in [3]. Comparing similar throughputs, our method maintains ≈ 1.43× LUT requirements with respect to [3]. On the other hand, our approach halves the required registers, while the architecture in [3] has 1.66× and 2.53× the RAM requirements, that in the proposed architecture are independent of Pe . The work in [12] proposes a pipelined combinational SC decoder for P C(1024, 512). The presented implementation results yield 1.9× more throughput and 27% less memory compared to our work. On the other hand, the hardware complexity of our decoder is an order of magnitude smaller than [12]: the number of LUTs required is ≈ 13× lower, and the number of registers ≈ 19× lower. V. C ONCLUSIONS In this work, we have proposed an improved Fast-SSC decoder architecture for polar codes. The α and β memory words

related to the lower stages of the decoding tree are compressed into a single word: this allows to merge decoding instructions. We developed an algorithm to select which operations to merge in order to maximize the throughput gain. We designed and implemented on FPGA a decoder architecture that can support the new merged instructions. For a P C(1024, 512) code and 64 processing elements, the memory requirements are reduced by 22.2%, and the throughput is increased by 11.6%. Comparing equal throughputs, the memory requirements are reduced by up to 60.4%. ACKNOWLEDGMENTS The authors would like to thank Seyyed Ali Hashemi, Gabriele Coppolino, Arash Ardakani and Harsh Aurora for valuable comments and discussions. R EFERENCES [1] E. Arıkan, “Channel polarization: A method for constructing capacityachieving codes for symmetric binary-input memoryless channels,” IEEE Transactions on Information Theory, vol. 55, no. 7, pp. 3051–3073, July 2009. [2] C. Notes, “3GPP TSG RAN WG1 meeting no. 87, chairmans notes of agenda item 7.1.5 channel coding and modulation,” , October 2016. [3] G. Sarkis, P. Giard, A. Vardy, C. Thibeault, and W. J. Gross, “Fast polar decoders: Algorithm and implementation,” IEEE Journal on Selected Areas in Communications, vol. 32, no. 5, pp. 946–957, May 2014. [4] A. Pamuk and E. Arıkan, “A two phase successive cancellation decoder architecture for polar codes,” in 2013 IEEE International Symposium on Information Theory, July 2013, pp. 957–961. [5] P. Giard, G. Sarkis, C. Thibeault, and W. J. Gross, “Multi-mode unrolled architectures for polar decoders,” IEEE Transactions on Circuits and Systems I: Regular Papers, vol. 63, no. 9, pp. 1443–1453, Sept 2016. [6] P. Giard, A. Balatsoukas-Stimming, G. Sarkis, C. Thibeault, and W. J. Gross, “Fast low-complexity decoders for low-rate polar codes,” Journal of Signal Processing Systems, Springer, March 2016, to appear. [7] G. Sarkis, P. Giard, A. Vardy, C. Thibeault, and W. J. Gross, “Fast list decoders for polar codes,” IEEE Journal on Selected Areas in Communications, vol. 34, no. 2, pp. 318–328, Feb 2016. [8] S. A. Hashemi, C. Condo, and W. J. Gross, “A fast polar code list decoder architecture based on sphere decoding,” IEEE Transactions on Circuits and Systems I: Regular Papers, vol. 63, no. 12, pp. 2368–2380, Dec 2016. [9] A. Alamdar-Yazdi and F. R. Kschischang, “A simplified successivecancellation decoder for polar codes,” IEEE Communications Letters, vol. 15, no. 12, pp. 1378–1380, December 2011. [10] I. Tal and A. Vardy, “How to construct polar codes,” IEEE Transactions on Information Theory, vol. 59, no. 10, pp. 6562–6582, Oct 2013. [11] S. A. Hashemi, C. Condo, F. Ercan, and W. J. Gross, “Memory-efficient polar decoders,” submitted to Journal on Emerging and Selected Topics in Circuits and Systems, 2017. [12] O. Dizdar and E. Arıkan, “A high-throughput energy-efficient implementation of successive cancellation decoder for polar codes using combinational logic,” IEEE Transactions on Circuits and Systems I: Regular Papers, vol. 63, no. 3, pp. 436–447, March 2016.

Suggest Documents