High Speed QPP Generator with Optimized Parallel ...

High Speed QPP Generator with Optimized Parallel Architecture for 4G LTE-A System Shuai Wang, Lihua Liu, Zhigang Wen

High Speed QPP Generator with Optimized Parallel Architecture for 4G LTE-A System 1

2

3

Shuai Wang, Lihua Liu, Zhigang Wen 1,2,3 School of Electronic Engineering, Beijing University of Posts and Telecommunications, Beijing Hai Dian District, P.R.China, 100876 1 [email protected] 2 [email protected] 3 [email protected]

Abstract In 4G long term evolution advanced (LTE-A), the contention-free quadratic permutation polynomial (QPP) internal interleaver is selected for parallel turbo decoding. However, the specific hardware interleaver is still complicated and low-speed, not refined for LTE-A. Based on this situation, the goal of the study is to present an optimized implementation of highly parallelized QPP generator for higher processing speed and smaller area. To achieve that goal, the pipeline technique and recursion algorithm will be applied. The article at first analyzes the QPP formula and introduces a parallel interleaving algorithm which calculates memory access and sub-block indexes. Then an efficient hardware interleaver is proposed including three implementation points. Finally, the performance analysis is provided to evaluate the proposed scheme and the results show that the proposed method has a good exploration to other QPP generators, which costs fewer hardware slices on a Xilinx Virtex6 field programmable gate array (FPGA) reaching 284 MHz maximum clock frequency. Furthermore, the proposed 32 parallel QPP generator reaches 500 MHz on 65nm COMS with an area of 0.012 mm2, which can meet the requirement of a Gbps turbo decoder for 4G LTE-A standard.

Keywords: LTE-A Channel Coding, Parallel Turbo Decoding, QPP Interleaver, Sub-Block Index, High Speed and Small Area.

1. Introduction With the next generation of mobile broadband communication 4G approaching, 3GPP has submitted its new technical specification LTE-A to meet the standard of international mobile telecommunications advanced (IMT-Advanced). One crucial feature of LTE-A is the much higher throughput above 1 Gbps. In order to meet that new speed target, the equipment in the physical layer L1 should make corresponding adjustment. One important procedure of LTE-A physical layer is channel coding. Since turbo coding has been known as a capacity to approach the Shannon limit through iterative decoding [1], it is adopted by 3GPP LTE-A. But for the objective of Gbps, LTE-A must apply a parallel turbo decoder structure, which demands corresponding parallel interleaving. Under the current background, the paper discusses the method of designing a LTE-A internal interleaver and presents an optimized implementation. Here are two fundamental questions to address:  Why does LTE-A choose QPP interleaver for turbo codes?  How to reach a higher benefit of hardware consumption and processing speed when implementing a parallel QPP generator? In practice, in communication systems such as High Speed Downlink Packet Access (HSDPA), parallel and interleaved memory access leads to an interleaver bottleneck which is caused by accesscontentions, eventually resulting in rather inefficient implementations [2]. However, QPP interleaver is a kind of contention-free interleaver which alleviates the bottleneck effect and therefore significantly improves the channel decoding throughput [3]. Actually that is the answer of the first question and we will briefly introduce it in the second section. Moreover, we will further present our work of which the main contribution is made in terms of hardware optimization. In our proposed design, the high throughput is guaranteed by pipeline technology and recursion thinking is involved to simplify the hardware structure. The features of our new architecture are listed as follows:

International Journal of Advancements in Computing Technology(IJACT) Volume4, Number23,December 2012 doi: 10.4156/ijact.vol4.issue23.42

355


 500 MHz processing speed almost available for all the present turbo decoders, which can meet the 4G high standard.  0.012 mm2 core area which can help turbo decoder to save valuable spaces.  Reliable circuit performances passing FPGA board verification and ASIC post-route simulation. The ensuing content is organized as follows. In the second section, the architecture of a parallel turbo decoder and the contention-free feature of 3GPP internal interleaver are reviewed. Then the article introduces an efficient algorithm of calculating block indexes [4]. In section 3, it will present the optimized architecture which applies pipeline technique and recursion method. The FPGA and ASIC implementation results are both provided in section 4. Finally, we will conclude this article in the fifth section.

2. Parallel Interleaving in a Turbo Decoder 2.1. Parallel turbo decoder According to 3GPP TS 36.212 [5], LTE-A adopts turbo codes based on QPP internal interleaver. The coding rate is 1/3, and its length ranges from 40 to 6144, totally 188 different values. Since two soft in soft out (SISO) decoders and the internal interleaver should complete 5-8 times iteration, the conventional serial decoder architecture could not meet the 4G standard. Processing speed becomes the key problem of a turbo decoder. In order to improve the throughput, C. Studer’s work [2] applies frame split algorithm to design a 8 degree of parallelism turbo decoder. Here, a common parallel decoder structure is depicted in Fig. 1 (a) [6][7], which selects max-log maximum a posteriori (MAP) as its SISO algorithm. Fig. 1 (b) presents our MATLAB simulation (BER curves) of Fig. 1 (a), which shows that the bit error rate of a parallel turbo decoder will be under 10-4 when SNR≥1dB, not losing much performance compared to the serial one. Back to Fig. 1 (a), the received data is divided into N (degree of parallelism) sub-blocks and the interleaver has to cooperate with it. Unfortunately, in a turbo decoder with N degree of parallelism, providing interleaved memory access at the high bandwidth required by N parallel MAP units is a challenging task. Researches demonstrate that an efficient interleaver for parallel structure should satisfy the contention-free (CF) feature [3] which can be described as formula (1). Wherein [] represents rounding. L is the length of a sub block, L=K/N. [

 (i  tL )  (i  vL) ][ ],0  t  v  N L L

(1)

Figure 1 (a). Structure of a parallel turbo decoder

356


Figure 1 (b). BER performance in AWGN channel for code block length 4096, max-log-MAP turbo decoder using QPP internal interleaver with iteration 5

2.2. 3GPP LTE-A interleaver algorithm 3GPP TS 36.212 [5] gives the relationship between the output index i and the input index П(i), which satisfies the following quadratic form (2):  (i )  ( f1  i  f 2  i 2 ) mod K

(2)

Where the parameters f1 and f2 depend on the block size K and are summarized in Table 5.1.3-3 of [5]. Using formula (2), we can prove that QPP function satisfies (1), which indicates its high efficiency for parallel interleaving [8]. Since our design should calculate the addresses of N sub-blocks, change (2) into formula (3) to get the No. i interleaved address of block-s:  (i,s)  (i  sL)  f1  (i  sL)  f 2  (i  sL) 2 2

mod K

2 2

 ( f1i  f 2 i )  ( f1 Ls  f 2 L s )  (2 f 2 Lis ) mod K

(3)

S. G. Lee’s work [9] gives an implementation by directly calculating Π(i,s). Its degree of parallelism is set to 8 costing 8 basic recursive units. However, the RAM access still needs to be derived from Π(i,s). For our design, we tend to do the implementation by providing one RAM access and N bankindexes as formula (4) and (5). The QPP character indicates that all the blocks share the same column address as (4) [4]: A(i )  ( f1  i  f 2  i 2 ) mod L

(4)

Where A(i) represents the RAM access. Using the properties of modulo operation to simplify (3), we further deduce the No. i index of block-s I(i,s) as follows [4]: (i,s)   I(i,s)  [ L ]  I(i,0 )  I( 0 ,s)  C(i,s) mod N   f1i  f 2 i 2 mod K ]  I(i,0 )  [ L   I( 0 ,s)  f1s  f 2 Ls 2 mod N  C(i,s)  2 f 2 is mod N

(5)

Formula (5) indicates that the values of block indexes I(i,s) can be derived from the index of block0 I(i,0), initial term I(0,s) of each block, and correction term C(i,s). The latter experiment will list FPGA results to prove the advantage of this method.

357


3. An Efficient Hardware Interleaver for LTE-A We mainly propose three points in implementation, including compare-select (CS) network, recursion of initial values and parallel block-index generator.

3.1. Pipelined compare-select (CS) network To get the value of I(i,0), the conventional design advices to use the following formula: (П(i)a(i))×(213/L)/213 [4]. As a result, the values of f1+f2 mod L, 2f2 mod L and 213/L are essential to be prestored by LUTs. Moreover, it contains complicated hardware units to calculate within only one pipeline (one 13×6 bit multiplier, one memory access a(i) generator, П(i) generator and a subtractor), making the path delay relatively large. Another way to deal with I(i,0) is applying N thresholds to multiplex [10], but the efficiency is low. S. J. Wang’s work [11] introduces a design of parallel address generator for N≤8. It applies recursion to exclude division. But as the degree of parallelism increases, the amount of pre-stored values becomes larger costing more LUTs. At the same time, the working frequency is ordinary and the circuit’s further implementation contains several complicated problems. This design uses binary search to locate the block index requiring log2N pipelines of CS units. In each pipeline the CS unit determines whether the input is above or under K/2p, where p is the number of pipeline, ranging from 1 to log2N. The compare-select network is shown in Fig. 2. One pipeline of processing contains a comparator, a subtractor, and a multiplexer. In this way the derivation of I(i,0) is done more efficiently. The drawback of this method is that there are log2N clock-cycles’ latency which should be solved during the decoding procedure. However，if lower latency is required, two or more CS units could be put together for one pipeline stage. The results of all the comparators make up the log2N-bits index I(i,0) and the last pipeline of multiplexer outputs the RAM access.

Figure 2. Compare-select network, p=1~log2N

358


3.2. Derivation of initial value I(0,s) In order to output the indexes of N blocks, we have to cope with the initial value I(0,s) of separate block, s=1,2…N-1. The corresponding I(0,s) sets of all available K are calculated out by MATLAB as Table Ⅰ:

TableⅠ. [I(0,s), s=1—N-1] of 4 parallelism conditions Degree of Parallelism

8

16

32

64

Available values of K

188

158

127

96

Possible values of I(0,s) : [I(0,1),I(0,2)~I(0,N‐1) ]

8

25

60

73

Table Ⅰ indicates that the initial values repeat regularly for lower degrees of parallelism (N≤16). Under the circumstance, we could use LUTs to map all the I(0,s) to corresponding K rather than precalculate them. For N=8, only 8×7×3 bits need mapping and it increases to 25×15×4 bits for N=16. In the cases of N≥32, the initial term of No. (s+1) block could be easily derived from the No. s. Given the block-0’s value, we could output I(0,s) in an incremental sequence. So we suggest that the design ought to apply recursion thus to avoid two multipliers suggested in [4]. Alter the form (5) as follows:  I( 0, s  1 )  I( 0, s)  g(s) mod N   I( 0,0 )  0

(6)

Where g(s) satisfies (7): g(s)  f1  f 2 L  2 f 2 L  s

mod N

(7)

Still g(s) could be derived by recursive calculation as (8):  g ( s  1)  g ( s )  2 f 2 L mod N   g (0)  f1  f 2 L mod N

(8)

But when using the recursion for I(0,s), N clock cycles' pre-calculation must be done before the generator starts running. This can be done during the first half of iteration even for a Gbps-turbo decoder and could be parallelized for higher speed. Fortunately, a modulo-N operation could be simply implemented by overflow for log2N is an integer. The I(0,s) generator is depicted in Fig. 3.

Figure 3. Recursion for I(0,s), s=1,2…N-1

3.3. Architecture design of block-index generator The final key step is about the correction term C(i,s). Instead of using a counter for i, N/4-1 multipliers and LUTs for parameters (2f2s mod N), we simply use registers and adders for recursion as

359


is depicted in Fig. 4 (left). The advantage of recursion is that it makes the path delay much smaller with registers storing the value of previous clock and the consumed resources will be fewer than a corresponding multiplier (adder tree). Since f2 is an even number, C(i,s) is divisible by 4 and will repeat regularly when s is above N/4. In the maximum case of N=64, the circuit need to calculate 15 parallel C(i,s) values from C(i,1) to C(i,15). In addition, the derivation of 2sf2 mod N could be done by adder trees. According to the discussion above, our parallel index generator outputs I(i,s) by adding those three values: I(i,0), I(0,s) and C(i,s). The whole structure is depicted in Fig. 4 (right). The generator could provide N block indexes (row addresses) and one RAM access (column address) at one cycle, while applying an optimized structure, only including comparators, adders and registers.

Figure 4. Parallel I(i,s) and RAM access generator

4. Results and Comparison The proposed design and T. Ilnseher’s method [4] share the same algorithm of calculating the row addresses. However, differences are listed in Table Ⅱ. Generally, this work costs fewer hardware resources and excludes complicated multiplication. As a result, DSP48E1 modules are not required. Here we define the throughput of a QPP generator as: Throughput  N  Fclk max ( Address / s )

(9)

4.1. FPGA Results The FPGA implementation results of this work are shown in Table Ⅲ together with several conventional designs. Originally, T. Ilnseher’s work is implemented by a 65 nm ASIC technology for 300 MHz (the processing speed of ASIC is much higher than FPGA on the same condition). In order to make corresponding comparisons, we re-implement it on a Virtex-6 FPGA. Comparing Arch-L [9] with the others, we find out that calculating parallel indexes can save more than ×2 resources, which demonstrates its advantage over calculating parallel interleaved access. In our design, the maximum clock frequency could reach 240.3 to 284.0 MHz when the degree of parallelism ranges from 8 to 32, better than Arch-I (109.7 to 155.8) or Arch-L (N/A to 47.2). Generally, the proposed design requires lower hardware consumption using no DSP48E1s, costing less slice LUTs, LUT-FF pairs and occupied slices compared to Arch-I (slightly more registers for N≥32 due to pipeline technique and recursion).

360


Table Ⅱ. Differences of hardware structure between this work and T. Ilnseher’s method [4] Scheme

This work

Ilnseher’s method [4]

Derivation of I(i,0) and RAM access

●Pipelined CS network: Compare‐select unit—log2N П(i) generator—1

●Division: 13×6 bit multiplier—1 13 LUTs—for 2 /L RAM access a(i) generator—1 П(i) generator—1 13bit Subtractor—1

Derivation of I(0,s)

●Recursion: log2N‐bit adder—2 log2N‐bit register—2 LUTs—for f2L mod N

●Mul plica on: Multiplier—3 Adder—1

Derivation of C(i,s)

●Recursion: log2N‐2 bit Register—N/4‐1 log2N‐2 bit Adder—N/2‐log2(N‐1)‐1

●Mul plica on: Counter for i—1 log2N‐2 Multiplier—N/4‐1 LUTs—for 2f2s mod N

Table Ⅲ. FPGA Implementation results and comparison to other QPP generators Scheme

Arch‐W：This work

Target device

xc6vlx240t‐1ff1156

xc6vlx240t‐1ff1156

Method

Calculating parallel Index

Calculating parallel index

Degree of Parallelism (N) Maximum frequency (MHz) Minimum period (ns) Throughput (M Add/s) Slice registers Slice LUTs FPGA utilization

Occupied Slices LUT‐FF pairs DSP48E1s

Arch‐I：Ilnseher [4]

Arch‐L： Lee [9] xc6vlx240t‐ 1ff1156 Calculating parallel interleaved address

64

32

16

8

32

16

8

8

215.0

240.3

284.0

259.4

109.7

115.5

155.8

47.2

4.650

4.160

3.520

3.855

9.115

8.652

6.417

21.152

13,700

7,692

4,545

2,075

3,510

1,849

1,246

378

900

434

131

89

393

173

102

221

1,141

594

561

559

787

699

673

1,232

612

307

196

192

309

291

224

372

1,495

744

564

573

901

758

695

1,245

0

0

0

0

3

3

2

0

The FPGA post-route simulation result of our design is shown in Fig. 5 (a). The inputs include clock port (clk), system reset (rst_), frame length (K), initialization (init_en) and startup (calculate_en). The outputs include one RAM access (add_ram) and sub-block indexes (R_out). The clock frequency is set to 200 MHz, K is 6144 and N is 32. The data transferred through JTAG from Virtex-6 is shown in Fig. 5 (b). The clock port is connected to J9 (200 MHz). Compared to the addresses calculated out by MATLAB, we conclude that the outputs of our design are correct.

361


Figure 5 (a). FPGA post-route simulation result, the clock frequency is set to 200 MHz, K is 6144 for N=32, the figure contains only part of ports from R0_out to R22_out

Figure 5 (b). Board verification, the clock is connected to J9 on Virtex-6 which is 200 MHz

4.2. ASIC Features In this section we will present the ASIC implementation results. The 32 degree of parallelism QPP generator is placed and routed on SMIC 65nm CMOS by Cadence. The final reports show that the proposed design could stably reach 500 MHz with an area of 0.012 mm2. The ASIC features are shown in Table Ⅳ. Compared to Arch-S and Arch-I, the frequency of the proposed work is increased by 66.7%, and the area is the smallest of the three. Also the timing report is shown in Fig. 6. Fig. 6 proves that the architecture is able to work at a high speed of 500 MHz. This is in accordance with the previous discourse. The circuit layout is shown in Fig. 7. The total chip including pins is a 120um×120um square.

362


Table Ⅳ. ASIC features and comparison Scheme

Arch‐S State of art [4]

Arch‐I Ilnseher’s [4]

Arch‐W This work

Technology (nm)

65

65

65

Degree of parallelism (N)

32

32

32

Frequency (MHz)

300

300

500

Area (um )

148801

13411

12116

Cell count

38636

2201

2973

2

Figure 6. The timing report of Encounter, path 1 is the largest reg-to-reg delay path

Figure.7 Circuit layout, core area 12116 um2

5. Conclusions This work proposes an optimized architecture of QPP interleaver under the background of 4G LTEA. It not only proves the advantage of calculating the block-indexes, but also implements different

363


degrees of parallelism from 8 to 64 available for a LTE-A turbo decoder. The final results demonstrate that our work achieves the expected goal of hardware optimization, which can sharply increase the processing speed and simplify the whole structure of internal interleaver. Normally, this work could work at 500 MHz clock frequency, which is nearly available for all the present turbo decoder structure. Especially, this QPP interleaving generator could be adapted to those high clock frequency applications, like Karim’s work [12] which reaches 1.138 Gbps at 486 MHz. Cooperating with the master-slave network [2][11], the design can fulfill the interleaving task of LTEA more efficiently. The performance experiments of turbo codes can be further read in [13]. Future researches are still required for better architectures and alternative interleavers. Also we will focus on other new fields of LTE-A, like cognitive radio (CR) [14], and coordinated multipoint (COMP) [15].

6. Acknowledgement The work presented in this paper was supported by the National Natural Science Foundation of China (Grants No. NSFC-61170176), National Great Science Specific Project (2010ZX03005-001-03), (2012ZX03003006), Beijing Municipal Commission of Education Build Together Project.

7. References [1] C. Berrou, A. Glavieux and P. Thitimajshima, "Near Shannon limit error-correcting coding and decoding: Turbo-codes", Proc. IEEE Int. Conf. Commun., pp. 1064-1070, May 1993. [2] Christoph Studer, Christian Benkeser, Sandro Belfanti, "Design and Implementation of a Parallel Turbo-Decoder ASIC for 3GPP-LTE", IEEE Journal of Solid-State Circuits, vol. 46, no. 1, Jan. 2011. [3] Ajit Nimbalker, T.Keith Blankenship, Brain Classon, "Contention-Free Interleavers for HighThroughput Turbo Decoding", IEEE Transactions on Communications, vol. 56, no. 8, Aug. 2008. [4] Thomas Ilnseher, Matthias May, Norbert When, "A Monolithic LTE Interleaver Generator for highly parallel SMAP Decoders", IEEE Wireless Telecommunications Symposium (WTS), July 2011. [5] 3GPP TS 36.212 Release 10 V10.4.0:"Multiplexing and Channel Coding", Dec. 2011. [6] Cheng-Chi Wong, Yung-Yu Lee, Hsie-Chia Chang, "A 188-size 2.1mm2 Reconfigurable Turbo Decoder Chip with Parallel Architecture for 3GPP LTE", IEEE 2009 Symposium on VLSI Circuits, Aug. 2009. [7] Cheng-Chi Wong, Hsie-Chia Chang, "Reconfigurable Turbo Decoder With Parallel Architecture for 3GPP LTE System", IEEE Transactions on Circuits and Systems, vol. 57, no. 7, July 2010. [8] Ajit Nimbalker, Yufei Blankenship, Brian Classon, "ARP and QPP Interleavers for LTE Turbo Coding", IEEE Wireless Communications and Networking Conference (WCNC), pp. 1032-1037, 2008. [9] Shuenn-Gi Lee, Chung-Hsuan Wang, Wern-Ho Sheen, "Architecture Design of QPP Interleaver for Parallel Turbo Decoding", IEEE Vehicular Technology Conference (VTC 2010-Spring), June 2010. [10] Zhaoyuan Liu, Hui Yu, You yun Xu, "Architecture Design and Implementation of a QPP Internal Interleaver", in Chinese, Telecommunication Engineering, Jan. 2011. [11] Shuaijie Wang, Jun Ma, Guanghui He, "Generalized Interleaving Network Based on Configurable QPP Architecture for Parallel Turbo Decoder", IEEE Signal Processing Systems (SiPS), Oct. 2011. [12] Karim S.M, Chakrabarti I, "Design of pipelined parallel turbo decoder using contention free interleaver", TENCON 2011 - 2011 IEEE Region 10 Conference, pp. 646-650, Nov. 21-24 2011. [13] Dashe Li, Liu Shue, "Performance Enhancement of Foggy Atmosphere Channel Based on Turbo Code", JCIT, vol. 6, no. 4, pp. 145-151, 2011. [14] Lei Shi, Zheng Zhou, Liang Tang, "Wideband Spectrum Sensing based on Sparse Channel State Recovery in Cognitive Radio Networks", JCIT, vol. 6, no. 2, pp. 207-215, 2011. [15] Ralf Irmer, Heinz Doste, Patrick Marsch, "Coordinated Multipoint: Concepts, Performance, and Field Trial Results", IEEE Communications Magazine, vol. 49, no. 2, pp. 102-111, Feb. 2011.

364

High Speed QPP Generator with Optimized Parallel ...

High Speed QPP Generator with Optimized Parallel ...

Suggest Documents

Optimized Sizing of High Speed PM Generator for ... - SDA Engineering

MoonGen: A Scriptable High-Speed Packet Generator

MoonGen: A Scriptable High-Speed Packet Generator

High speed permanent magnet synchronous motor / generator

High Speed CRC with 64-bit generator polynomial on

high-speed flywheel system with switched reluctance motor/generator

MoonGen: A Scriptable High-Speed Packet Generator

High-speed Parallel Software Implementation of the

High-speed parallel software implementation of the

fast_protein_cluster: parallel and optimized

Speed Scaling on Parallel Processors with Migration

High-Speed Generator and Multilevel Converter for Energy Recovery ...

Design of High Speed Permanent Magnet Generator for Solar Co

Development of a High Speed HTS Generator for Airborne Applications

Development of a High Speed HTS Generator for ... - IEEE Xplore

High-Speed Generator and Multilevel Converter for ... - IEEE Xplore

A high speed, postprocessing free, quantum random number generator

A Module Generator for High Speed CMOS Current Output Digital ...

Speed Breaker Power Generator

AC/QPP overview - StatCan

An Optimized Pseudorandom Generator using Packed ... - wseas.us

Parallel Architecture for Decoding LDPC Codes on High Speed ...

Implementation of A High-Speed Parallel Turbo Decoder for 3GPP ...

A New Approach for Parallel CRC Generation for High Speed ...