Mapping interleaving laws to parallel turbo and LDPC ... - CiteSeerX

Mapping interleaving laws to parallel turbo and LDPC decoder architectures Alberto Tarable, Member, IEEE, Sergio Benedetto, Fellow, IEEE, and Guido Montorsi, Senior Member, IEEE

Abstract— For high data rate applications, the implementation of iterative turbo-like decoders requires the use of parallel architectures posing some collision-free constraints to the reading/writing process from/into the memory. This consideration applies to the two main classes of turbo-like codes, i.e., turbo codes and low-density parity-check codes. Contrary to the literature belief, we prove in this paper that there is no need for an ad hoc code design to meet the parallelism requirement, because, for any code and any choice of the scheduling of the reading/writing operations, there is a suitable mapping of the variables in the memory that grants a collision-free access. The proof is constructive, i.e., it gives an algorithm that obtains the desired collision-free mapping. The algorithm is applied to two simple examples, one for turbo codes and one for low-density parity-check codes, to illustrate how the algorithm works. Index Terms— LDPC codes, memory mapping, parallel implementation, Turbo codes

I. I NTRODUCTION Turbo-like codes are universally recognized as the most successful known class of codes, yielding performance close to the theoretical Shannon bound [6], [13]. The common feature of all codes in this family is the feasibility of an iterative decoder, which performs close to the maximum-likelihood (ML) decoder, provided that the signal-to-noise ratio (SNR) is sufficiently high. Turbo-like codes include parallel concatenated convolutional codes (PCCC) [6], serially concatenated convolutional codes (SCCC) [3], low-density parity-check (LDPC) codes [8] and other minor variations on the theme. A unified approach to the whole class of turbo-like codes can be found in [1]. The powerful error-correcting properties of turbo-like codes make them an attractive choice for all kinds of communication systems. However, in high data rate applications, there is a problem connected with the generally high latency of the iterative decoder. In fact, each frame has to be processed through several iterations, before estimating the corresponding data. Consequently, the decoder implementation has to be optimized, so as to keep the latency as low as possible. The obvious solution for a low-latency decoder is a highly parallelized structure. In such a decoder, many operations are performed at the same time by different processors, according to the intrinsic parallelizability of the decoding algorithm. Provided a certain degree of parallelism, one of the most important problems to solve becomes the handling of memory access. This work has been partially funded by MIUR under CERCOM and FIRBPRIMO projects. Part of the paper has been already published at the 3rd Turbo Code Symposium, held in Brest, France, in 2003 and in the IEEE Communication Letters, issue of March 2004

enc1

π enc2

(a)

enc2

π

enc1

(b)

Fig. 1. (a): PCCC encoder (b): SCCC encoder

In fact, if more concurrent accesses are attempted at the same time, the collision must be solved by serializing the colliding accesses, and thus spoiling parallelism. The problem of handling the memory access is similar in the two main classes of turbo-like codes, i.e., PCCC/SCCC and LDPC codes. However, since the decoder structures for the two classes are quite different, it is better to consider them separately. A. Handling memory access: PCCC and SCCC In a PCCC, two constituent codes are concatenated in parallel, as in Fig. 1. Also shown in the figure is an SCCC, which shares with the PCCC most properties that are important for our discussion. In the PCCC, an information sequence is encoded by encoder 1. The same sequence is interleaved (permuted) according to a permutation π and then encoded by encoder 2. In the SCCC, the information sequence enters encoder 2, whose output is interleaved (permuted) according to a permutation π to form the input to encoder 1. If the input to the interleaver is the sequence b1 b2 b3 ..., then the output will be the sequence bπ(1) bπ(2) bπ(3) ... The turbo decoder iterates for a certain number of iterations between the two soft-input soft-output (SISO) decoders [4] of the two constituent codes. In a given iteration, decoder 1 accepts as its input what is output by decoder 2 in the previous iteration (except in the first iteration, when no input from decoder 2 is available). In turn, decoder 1 sends its output to decoder 2, which accepts it as its input in the current iteration. Between the two decoders, an interleaver and a deinterleaver rearrange the information according to permutation π and to its inverse π −1 , respectively. In the parallel concatenation, the information exchanged pertains to the input bits of both encoders,

123

from the channel

123

. . .

123

123

. . .

. . .

. . .

SISO1

π

π

-1

(a) 123

SISO2

123

. . .

123

123

. . .

. . .

. . .

collision

Fig. 2. Turbo decoder (b)

while in the SCCC, it pertains to the output of encoder 2 (the outer decoder), which is the input of encoder 1 (the inner encoder). The block diagram of the turbo decoder is shown in Fig. 2. The most important design for turbo codes employs convolutional codes as constituent codes. In this case, the algorithm performed inside each SISO decoder is typically the BCJR algorithm [2], which ideally requires two recursions on the code trellis, one forward from the beginning of the block, the other backward from the end of the block. Practically, in low-latency implementations, the algorithm is windowed by dividing the entire block into some sub-blocks and performing shorter recursions inside the sub-blocks, all assigned to distinct processors working in parallel. Since every processor performs the same algorithm, they all access the memory at the same time instants. Then, collisions may occur, thus weakening the efficiency of the implementation. In the following, we will suppose that the memory is divided into a certain number of banks. A collision happens when two (or more) accesses at the same time are attempted to the same memory bank, while there is no collision when the contemporaneous accesses concern different memory banks. If there are P processors and each processor accesses the memory once at a time, then there will be at most P simultaneous accesses to the memory. To grant a collision-free implementation, there must be at least P memory banks. A pictorial view of collisions is shown in Fig. 3. Access to the memory is performed when the SISO decoder reads input variables and when it writes output variables. Actually, the output variables are an update of the input variables, and we suppose that the update of a given variable is written into the same memory cell in which it was stored. Also, we suppose that the scheduling of reading operations and of writing of the corresponding updates coincide, thus allowing us to consider reading and writing as equivalent from the point of the view of the memory access. The problem of parallelism is that there are two different schedules of the reading/writing operations, one in connection with decoder 1, which reads and writes variables in natural order, the other connected with decoder 2, which reads and writes in interleaved (permuted) order. In other words, the two halfiterations of the turbo decoding algorithm pose different constraints on the mapping of the variables into the memory banks. This is the main problem of parallelism in turbo decoders.

Fig. 3. (a) Absence of collisions (b): A collision happens

As it turns out from the above discussion, the main character in this play is the interleaver. If the permutation defining the interleaver is the identity permutation, i.e., π (i) = i, ∀i, the problem of parallelism becomes much simpler, provided that decoder 1 and decoder 2 work in the same way, at least for what concerns scheduling of the memory access. Both decoders read/write in natural order. But in this case, the performance of the turbo code is much worse than it could be with a better interleaver design. Most researchers till now have then attempted to find good interleavers that allow for a collision-free parallel implementation of the turbo decoder. Ad hoc designs can be found in [9], [11], [15], [10]. In [17], the problem of handling collisions efficiently is posed. In this paper, we will show that a parallel implementation exists for any interleaver, so that there is no need for an ad hoc design. Flexibility and the need for a design rule which only takes into account the performance of the turbo code, and not other parameters, are the first consequences of this result. The second advantage lies in the compatibility between new parallel decoders and existing systems, which is rather important for already standardized applications. The problem of the parallelization of PCCC/SCCC is an instance of a more general problem, another application of which resides in collision-free LDPC decoder implementations. B. Handling memory access: LDPC codes LDPC codes are linear block codes characterized by a sparse parity-check matrix. It is customary to define an LDPC code either through the parity-check matrix itself, or through a bipartite graph, like the one in Fig. 4. In this graph, the subset of nodes on the left (represented by circles) is composed by the n variable nodes (VN’s), each one corresponding to a bit of the codeword. The subset of nodes on the right (represented by squares) consist of the n − k check nodes (CN’s), where k is the code dimension, and each of them corresponds to a parity equation satisfied by the code. There is an edge connecting the i-th VN and the j-th CN if the j-th parity equation involves the i-th coded bit. The LDPC decoding algorithm is an example of the beliefpropagation (BP) algorithm [12]. Through some iterations, the

variable nodes

check nodes

Fig. 4. A bipartite graph defining an LDPC code

set of VN’s and the set of CN’s exchange information along the edges that connect them. In an iteration, the sets of VN’s and CN’s are activated alternately. Each active node, either a VN or a CN, accepts from its incident edges the input variables, computes its outputs (in a different way, according to whether it is a VN or a CN), which are updates of the inputs and restitutes them to the edges. They will be used in the following halfiteration as new inputs. LDPC codes have a high intrinsic degree of parallelism in the decoder. Each node can in principle work in parallel with the others of the same kind. This is a favorable feature of LDPC codes, with respect to PCCC and SCCC. However, the average number of iterations needed is usually higher for LDPC codes (some tens to a hundred versus about ten for PCCC/SCCC). The problem of parallelism in LDPC codes is analogous to that for concatenated codes. Consider the half-iteration in which CN’s are active. There are a few processors, each having in charge a subset of the CN’s. If the processors are working in parallel, they will have to access the memory whenever they read their input or write their output. If the (maximum) number of simultaneous accesses to the memory during this phase is P , there must be at least P memory banks to avoid collisions among these accesses. The same is true when the VN’s are active. Just like in turbo decoding, we suppose that the operations of writing into the memory the updates of the input variables follow the same scheduling of reading from the memory the input variables themselves. Consequently, the writing and the reading operations pose the same constraints to the mapping of the variables into the memory, so that, for our purpose, they are equivalent. The problem of the parallelism in LDPC codes arise from the fact that, usually, the scheduling of reading/writing operations at the VN side is different from that on the CN side. In other words, the two half-iterations of the LDPC decoding algorithm pose different constraints on the mapping of the variables into the memory banks. It is easy to see the analogy with PCCC/SCCC. The difference is that for LDPC decoders the scheduling is freer, because of the intrinsic nature of the algorithm. Processors may read some of their inputs in parallel. Some subsets of processors may be activated successively.

What is really important, however, is the scheduling of the reading/writing operations, independently from which processor it concerns. It is the scheduling that defines which variables are read simultaneously (either at the VN or at the CN side) and must then be placed into different memory banks. The same role played by the interleaver in the turbo decoder is played now, but in a less direct way, by the structure of the parity-check matrix, or analogously by the bipartite graph defining the code. Also for LDPC codes there exist ad hoc designs, that try to solve the problem of parallelization at the root. Examples can be found in [14], [16], [18], where explicit rules to construct good LDPC codes satisfying certain conditions on the parity-check matrix are given. A more hardwareoriented treatment of the subject can be found in [7]. In our paper, collision-free LDPC decoders are obtained as a particular case of a more general design, without posing constraints on the parity-check matrix or even on the scheduling chosen for the VN and CN sides. This has the clear advantage of giving complete flexibility to the design of LDPC codes. The parallelization of PCCC/SCCC and LDPC codes can be treated in a unified way, since the two classes of codes pose the same kind of constraints. This problem can be seen as a particular case of a more general problem, which will be dealt with in abstract terms in the next section. II. M APPING FUNCTIONS Let us consider a set of L elements V = {v1 , ..., vL }. Suppose we are given two different partitions on V, namely 0 }, with the following P1 = {V1 , ..., VM } and P2 = {V10 , ..., VM 0 characteristic: all subsets Vi , Vi , i = 1, ..., M , have the same number of elements P , L/M . Note that M must be a divisor of L. The following definition is central: Definition 1: Let V, P1 and P2 be defined as above. A function M: {1, ..., L} → {1, ..., P } is a mapping function for (P1 , P2 ) if it satisfies the following conditions for every j, j 0 = 1, ..., L, j 6= j 0 : vj , vj 0 ∈ Vi for some i vj , vj 0 ∈ Vi0 for some i

=⇒ M (j) 6= M (j 0 ) =⇒ M (j) 6= M (j 0 )

(1) (2)

or, in words, elements belonging to the same subset in either partition are mapped to different bins. The connection between the problem of parallelism in PCCC/SCCC and LDPC codes is clear. The set V contains the variables that are updated at every half-iteration. The two partitions define the variables that may originate a collision, i.e., variables that are read/written at the same time in one of the two half-iterations. The mapping function gives then the correspondence between variables and memory banks. If the constraints (1), (2) are all satisfied, no collision in the memory access takes place. It is possible to see the problem of finding a mapping function as a graph coloring problem. In fact, consider a graph with L nodes corresponding to the elements v1 , ..., vL . There is an edge connecting node j to node j 0 if (and only if) vj and vj 0 belong to the same subset in either partition. Then, the problem of finding a mapping function is equivalent to coloring the vertices of such a graph.

Another connection with known problems can be established when the following two hypothesis are satisfied: 2 • P = M , so that L = P , and 0 • The intersection between any two Vi , Vi0 contains exactly one element. By arranging the elements into a P × P matrix, such that elements in the same row belong to the same subset in P1 and elements in the same column belong to the same subset in P2 , we have reduced the problem of finding a mapping function in this particular case to the problem of finding a P × P Latin square. Given a pair of partitions (P1 , P2 ) of V, the problem we want to solve is to find a mapping function that satisfies (1) and (2). In this section, an algorithm is presented, which gives the desired mapping function for any (P1 , P2 ), as the explanation of the algorithm will clarify. The algorithm can be divided into two successive steps, described in the following: First step: Any step that produces a preliminary mapping function with the following properties: no conditions in (1) and (2) are violated. However, there are some elements for which the mapping function is not determined yet, called blanks hereafter. Second step: This step accepts the preliminary mapping function output in the first step and fills all blanks. This procedure of completing the mapping function is called annealing. The result is a valid mapping function satisfying (1) and (2). Next, the annealing procedure, which is the core of the algorithm, is formalized. First, we define a collision as a single violation of one of the constraints in (1)-(2). The annealing procedure purposely injects a collision and at each step solves it by possibly introducing another one. A. Formal description of the annealing procedure The annealing procedure can be decomposed into several cycles, each of them starting with a blank, picked at random, and ending when no collisions are produced. After a cycle is ended, a new one starts if there are still blanks in the mapping function, otherwise the annealing procedure is over. A description of the operations performed by a single cycle requires the following definition: Definition 2: L (Vi ) (resp. L (Vi0 )): the set of values in {1, ..., P } that are not yet assigned to Vi (resp. Vi0 ). At the beginning of a given cycle, a preliminary mapping function M(0) maps {1, . . . , L} into the set {1, . . . , P } ∪ {∅}. (M(0) )−1 (∅) is the nonempty set of blanks. At the i-th step of the cycle, an updated preliminary mapping function M(i) is produced1 . To start the cycle, a blank is identified, say in position¡j (0) ∈¢ Vi(0) ∩ Vi0(1) . If the intersection between L (Vi(0) ) and L Vi0(1) (0) is not empty, there is a value that fills the ¡ blank ¢ in j , without (1) (0) creating collisions. In this case M j will be found in this intersection, while for j 6= j (0) M(1) (j) = M(0) (j). Since no collision is produced, the cycle is over. ¡ ¢ If instead the intersection between L (Vi(0) ) and L Vi0(1) is empty, pick at random two values, m ∈ L (Vi(0) ) and 1 Here and later, the apices (0), (1), . . . refer to variables in the corresponding step of the cycle

¡ ¢ ¡ ¢ n ∈ L Vi0(1) . Then assign M(1) j (0) = m and M(1) (j) = M(0) (j) for j 6= j (0) and repeat the following pair of steps for k = 1, 2, ...: (2k−1) • Search for j ∈ Vi0(2k−1) ∩ Vi(2k) such that ¡ (2k−1) ¢ (2k−2) M j = m and j (2k−1) 6= j (2k−2) . If such ¡ ¢ a j (2k−1) is found, assign M(2k−1) j (2k−1) = n and M(2k−1) (j) = M(2k−2) (j) for j 6= j (2k−1) . Otherwise, stop the cycle. (2k) • Search for j ∈ Vi(2k) ∩ Vi0(2k+1) such that ¡ (2k) ¢ (2k−1) M j = n and j (2k) 6= j (2k−1) . If such a j (2k) ¡ ¢ is found, assign M(2k) j (2k) = m and M(2k) (j) = M(2k−1) (j) for j 6= j (2k) . Otherwise, stop the cycle. After the end of a cycle, the sets L (Vi ) and L (Vi0 ), ∀i = 1, ..., M , must be updated. Fig. 5 shows a pictorial view of the annealing procedure. ¢ ¡ A cycle is stopped when either m ∈ L Vi0(2k−1) or n ∈ L (Vi(2k) ). Notice that there surely exists some index i such that m ∈ L (Vi0 ). In fact, since m ∈ L (Vi(0) ), there are less than M elements of V mapped to m. Since there are M sub0 and each subset can contain only one element sets V10 , ..., VM mapped to m, there will be a subset Vi0 , for some i, that does not contain any element mapped to m yet. So m ∈ L (Vi0 ). Analogously, there surely exists some index i such that n ∈ L (Vi ). Consequently, the cycle ends when it meets the subset Vi0 (resp. Vi ) for which m ∈ L (Vi0 ) (resp. n ∈ L (Vi )). A question may arise: does the cycle always end? From what we have said, it is clear that the cycle is never ending if (and only if) it loops inside a set of subsets that does not contain the ones that would stop the cycle. The following proposition shows that this case is impossible. Proposition 1: A single cycle of the annealing procedure cannot fall inside a loop. 0 Proof: Suppose j (k) = j (k ) , k > k 0 , and no other index 0 0 between j (k) and j (k ) is equal to j (k ) .³Suppose ´ k is odd. Then, ¡ ¢ 0 it must be M(k−1) j (k) = M(k−1) j (k ) = m. Since the 0

cycle has already passed through j (k ) , it means that k 0 is even. On the contrary, if k is even, k 0 must be odd. Suppose k is odd. Then j (k) ∈ Vi0(k) and, for the same reason, 0 0 0 j (k +1) ∈ Vi0(k0 +1) = Vi0(k) , so that j (k ) , j (k +1) , j (k−1) , j (k) all belong to Vi0(k) . Before passing through the (k − 1)-th ele¡ ¢ ment, we³have M´(k−2) j (k−1) = n, while at the same time, 0

0

M(k−2) j (k +1) = n. Since j (k +1) and j (k−1) both belong

to Vi0(k) and there is no collision in this subset at the (k − 1)-th 0 step2 , it must be j (k−1) = j (k +1) . Applying the argument repeatedly, we obtain finally ³ ´ ³ ´ k+k0 −1

k+k0 +1

2 2 j = j , which is impossible, due to the fact (l) (l+1) that j 6= j . So, the loop is impossible and the cycle always reaches the end. ¦ Since a single cycle cannot fall into a loop, it will sooner or later meet the subset that makes it stop. If for a given partition pair (P1 , P2 ), we can find a preliminary mapping function as defined above, the annealing procedure finds a valid mapping function. Since for all partition pairs a possible preliminary

2 At the (k − 1)-th step there is only one collision and, since k − 1 is even, it is found in the subset Vi(k−1)

V i(0)

V 'i (1)

V 'i (3)

V i(2)

V i(2k-2)

j(1)

(2k-1)

j(3) j(0)

V i(2k)

V 'i (2k-1) j

j(2k-3)

. . .

j(2)

. . .

j(2k-2)

(2k)

j

Fig. 5. Pictorial view of the annealing procedure. Every black dot represents an index. Every time a dot is reached, the corresponding value of the mapping function is changed

mapping function is the one with all blanks, we have proved the following corollary. Corollary 2: Let V be defined as in Definition 1. Let P1 and P2 be any two partitions as in Definition 1. Then, there exists a mapping function for (P1 , P2 ) and it is found by applying the algorithm described above. III. A PPLICATIONS OF THE MAPPING FUNCTION In this section, we apply the algorithm to two examples, one for a PCCC, the other for an LDPC code. The examples will also clarify how the annealing procedure works.

M M (1) M (M +1)

P

. . . . . . . . . . . .

M (2) M (M +2)

M (2M +1) M (2M +2)

. . .

. . .

M (M ) M (2M ) M (3M )

. . .

Fig. 6. The mapping matrix

A. Turbo decoder Before starting with the example for the turbo code, let us consider how the partitions are usually chosen in this case. Since Decoder 1 reads and writes in natural order, we obtain partition P1 = {V1 , ..., VM } in the following way: © ª Vi = v(j−1)M +i , j = 1, ..., P (3)

In this way, the subsets Vi of partition P1 become columns of such a matrix, while the subsets Vi0 of partition P2 , which depends on permutation π, define a tiling of the mapping matrix. This easy way of describing the mapping function is used in the example below. We will refer in the example to the subsets Vi and Vi0 as columns and tiles, respectively.

i.e., Vi contains the variables read at time i by the P processors working on the sub-blocks. At time i, the processor working on the first sub-block reads variable i, the one working on the second sub-block reads variable M + i, and so on. Since Decoder 2 reads and writes in permuted order, accord0 } in the following to π, we obtain partition P2 = {V10 , ..., VM ing way: © ª Vi0 = vπ((j−1)M +i) , j = 1, ..., P (4)

Example 1: Suppose L = 25, P = M = 5. Suppose, for instance:

i.e., Vi0 contains the variables read at time i by the P processors working on the sub-blocks. At time i, the processor working on the first sub-block reads variable π (i), the one working on the second sub-block reads variable π (M + i), and so on. In these terms, Eqs. (1) and (2) become, ∀k, k 0 , k 6= k 0 : k =M k 0 =⇒ M (k) 6= M (k 0 )

(5)

k =M k 0 =⇒ M (π (k)) 6= M (π (k 0 ))

(6)

where =M means “equal modulo M ”. Since the partitions show a certain regularity, this regularity can be exploited to obtain an intuitive pictorial view of the problem. It is useful to represent the mapping function as a P × M rectangular matrix, the mapping matrix, whose (j, i)th element, i = 1, ..., P , j = 1, ..., M , represents the value of M ((j − 1)M + i), as shown in Fig. 6.

π = (25, 10, 22, 3, 1, 14, 11, 5, 6, 2, 15, 13, 7, 16, 20, 24, 18, 8, 19, 17, 12, 21, 4, 23, 9) The permutation π induces the tiling shown in Fig. 7 on the mapping matrix (where letters are just labels for tiles). Suppose the output of the first step is the following preliminary mapping matrix (a straightforward extension of the pre-

E

E

D

C

C

D

C

C

E

B

B

A

B

A

A

D

E

B

D

E

B

C

D

A

A

Fig. 7. The tiling of the mapping matrix in Example 1

1

liminary mapping function, defined in Sect. II):   3 4 3 1 −  1 2 4 5 3     4 1 1 2 4     2 − 2 4 1  5 3 5 3 5 which has blanks in its (1, 5) and (4, 2) elements. The procedure of annealing starts from one of them, for example the latter, and changes it to the value that is not represented in its column yet, i.e., 5. This change causes a collision to happen, because (4, 2) and (2, 4) both belong to tile E and both have value 5. So (2, 4) is changed to the value that is not represented in tile E yet, i.e., 2:     3 4 3 1 − 3 4 3 1 −  1 2 4 2 3   1 2 4 5 3       4 1 1 2 4  −→  4 1 1 2 4  .      2 5 2 4 1   2 − 2 4 1  5 3 5 3 5 5 3 5 3 5 Now, there is a collision in the fourth column (the 2 appears twice), so (3, 4) is changed to 5. This, in turn, is in collision with the 5 in (5, 5), which also belongs to tile A. This is changed to 2:     3 4 3 1 − 3 4 3 1 −  1 2 4 2 3   1 2 4 2 3       4 1 1 2 4  −→  4 1 1 5 4  .      2 5 2 4 1   2 5 2 4 1  5 3 5 3 5 5 3 5 3 2 This produces no collision in the last column. The corresponding cycle is then over. By inspection, the matrix thus obtained still has the blank in (1, 5). Changing it to 5, which is missing in the last column, creates no collision. Thus the final result is the following mapping matrix:   3 4 3 1 5  1 2 4 2 3     4 1 1 5 4 .    2 5 2 4 1  5 3 5 3 2 One can verify that constraints (5) and (6) are all satisfied. B. LDPC decoder In the case of LDPC codes, as we have already noticed, the scheduling of decoding operations is not as standard as in turbo codes. It mostly depends on the number of CNs and VNs and on their degrees, on the assignment of the nodes to the processors, on the available number of memory banks, and on how processors work. It is, however, recommendable that the number of memory accesses at each time instant be about a constant, since a collision-free implementation requires a number of distinct memory banks equal to the maximum number of simultaneous memory accesses. Keeping the number of memory accesses as constant as possible with time minimizes then the number of

2

1

3

4

5 6 8 9

2

1 7

10 11 12 13 14

2

15 Fig. 8. The LDPC code of the example

memory banks required for a collision-free implementation, if the duration of the single half-iteration is fixed. Example 2: Suppose we have a code with 6 VNs, 5 CNs defined by the bipartite graph shown in Fig. 8. In the figure, an index is attached to every variable represented by an edge in the graph. Fig. 8 also shows a possible assignment of the nodes to two processors. Let us suppose that the two processors work in parallel while each processor accesses sequentially the memory, i.e., it accesses the memory once at a time, and that, at each side, the variables are read/written in the order from the top downwards of the corresponding edges. Since the first processor accesses eight variables and the second only seven, there will be a time instant in which only the first processor accesses the memory. To reduce the problem to the framework of Sect. II, we can introduce a dummy variable, which will be indexed 16. With these hypotheses, the partitions turn out to be: P1 = { (1, 9) , (2, 10) , (3, 11) , (4, 12) , (5, 13) , (6, 14) , (7, 15) , (8, 16)} and P2 = { (1, 2) , (3, 10) , (6, 12) , (4, 14) , (7, 8) , (11, 13) , (5, 15) , (9, 16)}. Using the notation of Sect. II, L = 16, M = 8, P = 2. We can use again the formalism of the mapping matrix to this case. By applying the annealing procedure already seen in the previous example, we have found the following mapping matrix: µ ¶ M (1) M (2) · · · M (8) M (9) M (10) · · · M (16) µ ¶ 1 2 2 1 1 1 1 2 = 2 1 1 2 2 2 2 1 which satisfies the constraints (1) and (2), as it can be verified. IV. P RACTICAL IMPLEMENTATION Given a mapping matrix, the practical implementation of the whole writing and reading procedure can be made by cascading three different layers of blocks, as it is shown in Fig. 9.

δ1

βj(i)

. . .

δ2

. .

β'j(i)

. . .

δP

Fig. 9. Practical implementation of the writing and reading procedure

to permutation π and writing it again by rows as a 5 × 5 matrix:   2 3 3 3 3  5 4 5 1 4     4 1 2 2 1 .    3 2 4 4 5  1 5 1 5 2 β 0j (i) is the (i, j) element of the above matrix. Finally, the permutations δ 1 , ..., δ 5 can be easily found as : δ 1 = (2, 3, 4, 1, 5) δ 2 = (5, 3, 2, 1, 4)

The figure shows the process of information exchange in both turbo and LDPC decoder. According to the convention of the past sections, for turbo decoders it is represented the information going from decoder 1 to decoder 2 (see Fig. 2). In LDPC decoders, instead, the information flows from the VN side to the CN side. • The block on the left consists of a crossbar with P input and P output pins. The interconnection between input and output at time j = 1, ..., M is given by the time-varying permutation β j over the integers {1, ..., P }. Precisely, the i-th input pin is connected at time j to the β j (i) output pin. According to the notation of the previous sections, β j (i) = M ((i − 1) M + j). 3 • The interleaver bank in the middle arranges information according to the order it is needed from decoder 2. The i-th interleaver, i = 1, ..., P , is defined by permutation δ i over the integers {1, ..., M }. Permutation δ i is defined by considering the set of M indices k in {1, ..., L} such that M(k) = i. Let us call this set M−1 (i). Given a k ∈ M−1 (i), let j be the integer in {1, ..., M } for which j =M k and j 0 be the integer in {1, ..., M } for which j =M π −1 (k) . Then, δ i (j 0 ) = j. • The block on the right consists of a crossbar with P input and P output pins. The interconnection between input and output at time j = 1, ..., M is given by the time-varying permutation β 0j over the integers {1, ..., P }. Precisely, the i-th input pin is connected at time j to the β 0j (i) output pin. According to the notation of the previous sections, β 0j (i) = M (π ((i − 1) M + j)). In the other half-iteration, i.e., when information goes from decoder 2 to decoder 1 in turbo decoders and from the CN side to the VN side in LDPC decoders, the order of the blocks is inverted. The information flows first through the crossbar on the right in Fig. 9. Then it is sent to a bank of P interleavers −1 −1 defined by permutations δ −1 1 , δ 2 , ..., δ P . Finally, it passes through the crossbar shown on the left in Fig. 9. Example 3: Consider the permutation and the mapping of Example 5, with L = 25, P = M = 5. The time-varying permutation β j can be read in a very straightforward way. β j (i) is the (i, j) element of the mapping matrix. The time-varying permutation β 0j can be read by stacking the rows of the mapping matrix in a vector, permuting it according 3 This

interleaver bank causes no problem to the parallel architecture

δ 3 = (4, 5, 2, 3, 1) δ 4 = (5, 1, 3, 4, 2) δ 5 = (4, 1, 5, 3, 2) . Example 4: Consider the LDPC code of Example 6, with L = 16, P = 2, M = 8. As seen before, β j (i) is the (i, j) element of the mapping matrix, shown in Example 6. To obtain β 0j (i), we construct the matrix whose columns correspond to the subsets in P2 : µ ¶ M (1) M (3) · · · M (8) M (2) M (10) · · · M (16) µ ¶ 1 2 1 1 1 1 1 2 = . 2 1 2 2 2 2 2 1 The (i, j) element of the above matrix directly gives β 0j (i). Finally, the permutations δ 1 and δ 2 are given by: δ 1 = (1, 2, 6, 4, 7, 3, 5, 8) δ 2 = (2, 3, 4, 6, 8, 7, 5, 1) . V. C ONCLUSIONS In this paper, we have provided an algorithm that allows for a parallel implementation of turbo and LDPC decoder. There is no constraint imposed on the code itself, which can then be designed in such a way to optimize its performance. The algorithm output is the memory mapping of the variables exchanged in the decoder, so that no collisions take place when accessing the memory in the decoding process. The algorithm we have described in the paper is balanced, in the sense that every memory bank has the same number of variables assigned to it (except for a slight unbalance in LDPC codes when the scheduling itself is unbalanced in time), and is optimal, in the sense that it provides a mapping for the minimum possible number of memory banks. When the system, to follow time-varying channel conditions, needs for a versatile implementation in which the co-decoder pair can be reconfigured according to higher-level instructions after some data frames, the new mapping function must be recomputed from the beginning. This can be done for example in a pre-processing step, because the complexity of the algorithm is quite low.

R EFERENCES [1] S. M. Aji and R. J. McEliece, “The generalized distributive law”, IEEE Trans. Inform. Theory, vol. 46, pp. 325-343, Mar. 2000. [2] L. R. Bahl, J. Cocke, F. Jelinek and J. Raviv, “Optimal decoding of linear codes for minimizing symbol error rate”, IEEE Trans. Inform. Theory, vol. IT-20, pp. 284-287, 1974. [3] S. Benedetto, D. Divsalar, G. Montorsi and F. Pollara, “Serial concatenation of interleaved codes: Performance analysis, design, and iterative decoding”, Electronic Letters, vol. 32, pp. 1186-1188, June 1996. [4] S. Benedetto, D. Divsalar, G. Montorsi, and F. Pollara, “Soft-input softoutput modules for the construction and distributed iterative decoding of code networks”, European Trans. on Telecomm., vol. 9, No. 2, MarchApril 1998. [5] S. Benedetto and G. Montorsi, “Unveiling turbo codes: some results on parallel concatenated coding schemes”, IEEE Trans. Inform. Theory, vol. 42, pp. 409-428, Mar. 1996. [6] C. Berrou, A. Glavieux and P. Thitimajshima, “Near Shannon limit errorcorrection coding and decoding: turbo-codes,” in Proc. 1993 Int. Conf. on Communications (ICC ’93), 1993, pp. 1064-1070. [7] A. J. Blanksby and C. J. Howland, “Parallel decoding architectures for low density parity check codes”, in Proceedings IEEE Int Symp. on Circuits and Systems 2001 (ISCAS01), vol. 4, pp. 742-745, Sydney, May 2001. [8] R. G. Gallager, Low-Density Parity-Check Codes, Cambridge, MA: Mit Press, 1963. [9] A. Giulietti, L. Van der Perre and A. Strum, “Parallel turbo coding interleavers: avoiding collisions in accesses to storage elements”, Electronic Letters, vol. 38, pp. 232-234, Feb. 2002. [10] D. Gnaedig, E. Boutillon, M. J´ez´equel, V. C. Gaudet and P. G. Gulak, “On Multiple Slice Turbo Codes”, in Proceedings 3rd Turbo code symposium, pp. 343-346, Brest, Sep. 2003. [11] J. Kwak and K. Lee, “Design of dividable interleaver for parallel decoding in turbo codes”, Electronic Letters, vol. 38, pp. 1362-1364, Oct. 2002. [12] R. J. McEliece, D. J. C. MacKay and Jung-Fu Cheng, “Turbo decoding as an instance of Pearl’s belief propagation algorithm”, IEEE Trans. Sel. Areas Comm., vol. 16, pp. 140-152, Feb. 1998. [13] D. J. C. MacKay and R. M. Neal, “Near Shannon limit performance of low density parity check codes”, Electronic Letters, vol. 32, pp. 16451646, Aug. 1996. [14] M. M. Mansour and N. R. Shanbhag, “Memory-efficient turbo decoder architectures for LDPC codes”, in Proceedings IEEE Workshop on Signal Processing Systems 2002 (SIPS02), pp. 159-164, San Diego, Oct. 2002. [15] A. Nimbalker, T. K. Blankenship, B. Classon, T. E. Fuja and D. J. Costello Jr., “Inter-window shuffle interleavers for high throughput turbo decoding”, in Proceedings 3rd Turbo code symposium, pp. 355-358, Brest, Sep. 2003. [16] A. Selvarathinam, G. Choi, K. Narayanan, A. Prabhakar and E. Kim, “A massively scaleable decoder architecture for low-density parity-check codes”, in Proceedings IEEE Int Symp. on Circuits and Systems 2003 (ISCAS03), vol. 2, pp. 61-64, Bangkok, May 2003. [17] M. J. Thul, N. Wehn and L. P. Rao, “Enabling high-speed turbo-decoding through concurrent interleaving”, in Proceedings IEEE Int Symp. on Circuits and Systems 2002 (ISCAS02), vol. 1, pp. 897-900, Scottsdale, May 2002. [18] F. Verdier and D. Declercq, “A LDPC parity-check matrix construction for parallel hardware decoding”, in Proceedings 3rd Turbo code symposium, pp. 235-238, Brest, Sep. 2003.

Alberto Tarable received the Laurea degree (summa cum laude) in 1998 and the Ph.D. degree in Electronic Engineering in February 2002, both from Politecnico di Torino. From April 2001 to October 2001, he spent six months as a Visiting Scholar at the University of California, San Diego, working on the topic of space-time codes. Since March 2002, he is working as a researcher in the Dipartimento di Elettronica of Politecnico di Torino. His current interests are on the subjects of multiuser detection, CDMA systems, space-time coding, coding theory and UWB systems. Sergio Benedetto is a Full Professor of Digital Communications at Politecnico di Torino, Italy since 1981. He has been a Visiting Professor at University of California, Los Angeles (UCLA), at University of Canterbury, New Zealand, and is an Adjoint Professor at Ecole Nationale Superieure de Telecommunications in Paris. In 1998 he received the Italgas Prize for Scientific Research and Innovation. He has co-authored two books on probability and signal theory (in italian), the book ”Digital Transmission Theory” (Prentice-Hall, 1987), ”Optical Fiber Communications” (Artech House, 1996), and ”Principles of Digital Communications with Wireless Applications” (Plenum-Kluwer, 1999), and over 250 papers in leading journals and conferences. He has taught several continuing education courses on the subject of channel coding for the UCLA Extension Program and for the CEI organisation. He has been Chairman of the Communications Theory Symposium of ICC 2001, and has organized numerous sessions in major conferences worldwide. Sergio Benedetto is the Area Editor for the IEEE Transactions on Communications for Modulation and Signal Design, and a Distinguished Lecturer of the IEEE Communications Society. Professor Benedetto is the Chairman of the Communication Theory Committee of IEEE and a Fellow of the IEEE. Guido Montorsi was born in Turin, Italy, on January 1, 1965. He received the Laurea in Ingegneria Elettronica in 1990 from Politecnico di Torino, Turin, Italy, with a master thesis, concerning the study and design of coding schemes for HDTV, developed at the RAI Research Center, Turin. In 1992 he spent the year as visiting scholar in the Department of Electrical Engineering at the Rensselaer Polytechnic Institute, Troy, NY. In 1994 he received the Ph.D. degree in telecommunications from the Dipartimento di Elettronica of Politecnico di Torino. In December 1997 he became assistant professor at the Politecnico di Torino. From July 2001 to July 2002 he spent one year at ”Sequoia Communications” devoloping algorithm for 3G wireless receivers. In 2003 he became senior member of IEEE and associate professor at Politecnico di Torino. His interests are in the area of channel coding, particularly on the analysis and design of concatenated coding schemes and study of iterative decoding strategies.

Mapping interleaving laws to parallel turbo and LDPC ... - CiteSeerX

Mapping interleaving laws to parallel turbo and LDPC ... - CiteSeerX

Suggest Documents

Mapping interleaving laws to parallel Turbo ... - Semantic Scholar

ML frame synchronization for turbo and LDPC codes - CiteSeerX

Effect of outer block interleaving in turbo codes ... - CiteSeerX

Interleaving Quasi-Sliding Mode Control of Parallel ... - CiteSeerX

Parallel Mapping and Simultaneous Sequencing Reveals ... - CiteSeerX

Designing and Mapping a Turbo Decoder for 3G mobile ... - CiteSeerX

Decoding Turbo Codes and LDPC Codes via Linear Programming

Carrier Phase Tracking From Turbo and LDPC Coded Signals ... - TELIN

Binary SOVA and Nonbinary LDPC Codes for Turbo ... - IEEE Xplore

Constrained Interleaving of Turbo Product Codes - IEEE Xplore

Genetic mapping of allometric scaling laws - CiteSeerX

A pipelined semi-parallel LDPC Decoder

turbo - CiteSeerX

Mapping Data-Parallel Tasks Onto Partially ... - CiteSeerX

Partially-Parallel LDPC Decoder Achieving High

Pipelined Data Parallel Task Mapping/Scheduling ... - CiteSeerX

Mapping Data-Parallel Tasks Onto Partially ... - CiteSeerX

Mapping Data-Parallel Tasks Onto Partially ... - CiteSeerX

Benefit Estimating Service for Mapping Parallel ... - CiteSeerX

Interleaving Web Services Composition and Execution ... - CiteSeerX

Unveiling turbo codes: some results on parallel ... - CiteSeerX

Interleaving Symbolic Execution and Partial Evaluation - CiteSeerX

Flexible Multi-ASIP SoC for Turbo/LDPC Decoder

VLSI implementation of a multi-mode turbo/LDPC ...