Parallel decoding of turbo codes using soft output T ... - IEEE Xplore

3 downloads 0 Views 61KB Size Report
without much performance degradation, by decoding the turbo code in a parallel fashion. In this letter we show that the decoding delay could be further reduced ...
352

IEEE COMMUNICATIONS LETTERS, VOL. 5, NO. 8, AUGUST 2001

Parallel Decoding of Turbo Codes Using Soft Output T -Algorithms U. Dasgupta and K. R. Narayanan, Member, IEEE

Abstract—Turbo codes need to be used with reasonably long blocks of data which invariably lead to encoding delay (latency) and decoding delay of the order of ( ) due to the finite computational speed of the processor. Recently it was shown that if the hardware complexity of 1 processors was acceptable, then the decoding delay could be reduced to ( ) without much performance degradation, by decoding the turbo code in a parallel fashion. In this letter we show that the decoding delay could be further reduced if the component decoders use parallel versions of soft output -algorithms instead of parallel versions of the MAP algorithm. Index Terms—Decoding delay, iterative soft output decoding, MAP-based decoding, parallel decoding scheme, T-algorithm.

I. INTRODUCTION

O

NE OF THE drawbacks of turbo codes, apart from the encoding latency due to the interleaver, is the high complexity of the iterative soft output decoding of the component codes, which directly leads to a significant amount of decoding iterations using two delay. A single processor performing MAP decoders (each decoding a component code with memory ) introduces a delay of where is the block length and is the processor speed in operations/second. The decoding delay can reduce the effective data rate and hence, it is important to find techniques to decrease the complexity of the decoding algorithm or other ways to reduce the decoding delay. Two approaches to reduce the complexity of the MAP (BCJR) algorithm, the -BCJR algorithm and the -BCJR algorithm, -BCJR algorithm, reduces the were suggested in [2]. The BCJR algorithm complexity by retaining only the most likely states in each iteration. Thus, it reduces the worst-case decoding complexity of the MAP algorithm but suffers considerable performance degradation from the optimum. On the other hand the -BCJR algorithm [2], reduces the BCJR algorithm complexity by using a threshold to delete states whenever the metric (or probability) of a state falls below the threshold. The -BCJR algorithm provides near-optimum performance but fails to reduce the worst-case decoding complexity (it only reduces the average complexity). Hence if there is a need to reduce the worst-case Manuscript received September 20, 2000. The associate editor coordinating the review of this letter and approving it for publication was Prof. M. Fossorier.The work of K. R. Narayanan was supported in part by National Science Foundation under Grant NSF CCR 0073506. U. Dasgupta is with the DSPS R&D Center, Texas Instruments Incorporated, Dallas, TX 75206 USA (e-mail: [email protected]). K. R. Narayanan is with the Department of Electrical Engineering, Texas A&M University, College Station, TX 77843 USA (e-mail: [email protected]). Publisher Item Identifier S 1089-7798(01)07669-4.

Fig. 1. Turbo decoder using parallel architecture (

W processors).

decoding complexity/delay with minimal performance degradation, then both the -BCJR and -BCJR algorithms fail. One of the ways to reduce the worst-case decoding delay ( ) without much performance degradation, is to use processors. This is the approach adopted in this work. In genprocessors could be arranged in two ways: eral, the • Traditionally [1] a pipelined structure has been used to reduce the decoding delay. Each processor operates on iterations before passing the rethe entire block for sulting extrinsic information to the next processor in the pipeline and operating on the next information block. Obviously this approach can reduce the decoding delay to without any performance degradation. This pipelined structure has been suggested in several papers. • A much less studied structure is the parallel structure, as in Fig. 1, which was suggested in [3]. In this scheme each partially overlapped information block is divided into processors performs all sub-blocks and each of the sub-blocks in parallel. This apthe iterations on the at proach reduces the decoding delay to the cost of a slight performance degradation. This work shows that soft-output -algorithms can provide substantial reduction in the worst-case decoding delay, with minimum performance degradation, if used in a parallel decoding structure but not a pipelined one. The next section discusses the parallel BCJR algorithms briefly. II. RECEIVER STRUCTURES The soft output decoding algorithms which were used for turbo decoding will be explained in this section.

1089–7798/01$10.00 © 2001 IEEE

DASGUPTA AND NARAYANAN: PARALLEL DECODING OF TURBO CODES

The Parallel MAP Algorithm: In [3], the authors illustrated overthat by dividing a block of length (see Fig. 1) into (where is the overlap lapping blocks of size between adjacent blocks), and by using a parallel implementation of the MAP algorithm on each sub-block simultaneously, near-optimum performance could be obtained. Thus their approach could provide significant reduction in decoding delay with only a slight loss in performance. The Parallel -BCJR Algorithm: The -BCJR algorithm, as proposed in [2], is similar to the BCJR algorithm but supwhenever it falls before a predefined presses calculation of threshold . Such s which fall below the threshold are set to 0 and thus “killed” during the forward recursion. The backward recursion is executed only for those sections of the trellis which remain “alive” after the forward pass. In the parallel implementation, the basic -BCJR algorithm resides in all the processors ( ) and operates simultaneously on . Let the subscripts “ ” and separate blocks of size “ ” denote the beginning and ending points of a sub-block. Also assume that both component encoders start from the all-zero state and only the first encoder is terminated in the all-zero state while the second encoder is left open. Each of the sub-blocks processors with their are processed simultaneously by the forward and backward recursions initialized in the following manner: : In both decoders, • if and if . Also in both decoders. • : In both decoders, and . : In both • . In the first decoder decoders, if and if , and in the . second decoder III. SOFT OUTPUT -ALGORITHMS IN A PARALLEL/PIPELINED STRUCTURE When soft output -algorithms are used in a pipelined turbo processors operates on the endecoder in which each of the iterations, they behave as follows. During tire block for the first few iterations, the “effective SNR” as seen by the component decoders is low and most of the states are retained in order to correct most of the errors and provide near optimum performance. In later iterations the “effective SNR” improves, and the decoder only needs to correct the few remaining errors, or to decide between a few possible codewords. Thus in later iterations a large number of states can be considered to be improbable and deleted, thereby reducing complexity. Hence the first iterations take place, becomes processor, where the initial a “bottleneck” in the processing. The initial iterations retain apstates, making the decoding delay of this setup proximately . Note that this is the same decoding delay obtained by using the MAP algorithm in a pipelined or parallel structure, indicating that the usage of soft-output -algorithms in a pipelined structure does not provide any additional reduction in the worst-case decoding delay.

353

Fig. 2. Performance comparison of different decoding approaches in AWGN channel.

In contrast, when soft-output -algorithms are used in a parallel structure, a significant reduction in decoding complexity/delay results. In a parallel implementation each processor performs all the iterations on a sub block, and although it retains nearly all the states in the initial few iterations, it deletes states, thereby reducing complexity, in later iterations. Thus for a given block for which the turbo decoder requires iterations to correct all the errors, the th processor retains states during the th iteration over the block length of . The total decoding delay is proportional length . to IV. SIMULATION RESULTS Rate 1/3 turbo codes using 16 state component codes with and a random interleaver of generator polynomials length 1024 have been simulated. A CRC code providing a reliable stopping criterion, has been assumed. Note that if a fixed number of iterations had been used, as opposed to a CRC, the -algorithms would have an unfair advantage over the other versions of the MAP algorithm because in many blocks it could drastically reduce states after all the errors had been corrected. Fig. 2 shows the bit error rate (BER) performance of the BCJR, parallel BCJR, -BCJR and the parallel -BCJR algorithms. It can be seen that the performance of all these al0.75 dB but beyond that gorithms is very close till SNR the -BCJR and the parallel -BCJR exhibit an error floor. It should be noted that the conventional -BCJR algorithm [2] exhibits this error floor as well and that the parallel implementation does not cause any additional degradation compared to the conventional -BCJR algorithm in [2]. The error floor in the performance of the -BCJR algorithms is sensitive to the selection of the threshold. If the thresholds are not low enough, the algorithm tends to display an error floor higher than the BCJR algorithm. For this particular set of simulations, the threshold of the -BCJR and parallel -BCJR algorithms have been set at 10 , resulting in the performance of the various algorithms staying close to the BCJR algorithm nearly for BER’s greater

354

IEEE COMMUNICATIONS LETTERS, VOL. 5, NO. 8, AUGUST 2001

the number of states retained has been used as a measure of decoding delay and, hence, the axis in Fig. 3 is linearly proportional to the actual decoding delay. All the decoding delay profiles have been taken at an SNR of 0.75 dB. The figure indicates 99% of the time, the parallel -BCJR algorithm has ) about 50% less delay than the parallel MAP algorithm ( which in turn has about a 50% less delay that the MAP algo). Moreover, it was seen that when the threshold rithm ( was reduced to 10 , the parallel -BCJR algorithm had 35% less delay than the parallel MAP for 99% of the data blocks, while the BER performance remains almost identical to that of the MAP algorithm up to 5 10 . V. CONCLUSIONS

Fig. 3. CDF of decoding delays of different decoding approaches in AWGN channel.

than about 5 10 . Although not shown here, it was seen that by reducing the threshold, it is possible to make the performance close to the BCJR algorithm up to BER’s of 5 10 . Typically, the threshold should be set such that the performance of the -algorithms are similar to the full MAP algorithm for BER of interest. It should also be noted that it is unlikely that any turbo coded system will be designed to operate in the floor region. Next we compare the decoding delays of these algorithms. Since the complexity/delay of the -type algorithms strongly depend on the channel conditions, the cumulative density function of the decoding delay as in Fig. 3, has been used to compare the different approaches. As discussed in the previous section,

This letter concentrates on reducing the worst-case decoding delay/complexity in turbo decoders. Various decoding structures and algorithms have been discussed. Turbo decoders implemented using multiple processors arranged in a parallel or pipelined structure reduce the decoding delay without significant performance degradation. Parallel versions of soft output -algorithms have been shown to reduce the decoding delay still further. REFERENCES [1] C. Berrou, A. Glavieux, and P. Thitimajshima, “Near Shannon limit error-correcting coding and decoding: Turbo-codes,” in IEEE Int. Conf. on Communications (ICC ’93), vol. 2, May 1993, pp. 1064–1070. [2] V. Franz and J. B. Anderson, “Concatenated decoding with a reducedsearch BCJR algorithm,” IEEE J. Select. Areas Commun., vol. 16, pp. 186–195, Feb. 1998. [3] J. Hsu and C. Wang, “A parallel decoding scheme for turbo codes,” in IEEE Int. Conf. on Circuits and Systems (ISCAS ’98), vol. 4, June 1998, pp. 445–448.