Information Geometry of Turbo Codes 1 Introduction 2 ... - CiteSeerX

1 downloads 0 Views 262KB Size Report
Turbo codes 2] has been attracting a lot of interests because of its high performance of error correction. Although the thorough experimental results support theĀ ...
Information Geometry of Turbo Codes Shiro Ikeda Kyushu Inst. of Tech., & JST

Toshiyuki Tanaka Tokyo Metropolitan Univ.

Shun-ichi Amari RIKEN BSI

Abstract

The properties of the turbo decoding is studied from information geometrical viewpoint. Our study gives an intuitive understanding of the theoretical background, and a new framework for the analysis. Based on the framework, we reveal basic properties of the turbo decoding.

1 Introduction Turbo codes[2] has been attracting a lot of interests because of its high performance of error correction. Although the thorough experimental results support the potential of the iterative decoding method, the mathematical background is not suciently understood. The problem of the turbo decoding is equivalent to marginalizing an exponential family distribution. The distribution includes higher order correlations, and its direct marginalization is intractable. But the partial model with a part of the correlations (i.e., a constituent code), can be marginalized eciently, by soft decoding. By collecting and exchanging the results of the partial models, the true decoding result is approximated. Richardson[6] initiated a geometrical understanding of the turbo decoding. But further intuitive understanding seems to be necessary. We investigate the problem from information geometrical viewpoint[1]. It gives a new framework for analyzing the iterative methods, and shows an intuitive understanding. Also it reveals a lot of basic properties, such as characteristics of the equilibrium, the condition of stability, the cost function related to the iterative method, and the decoding error[3, 4].

2 Information Geometrical Framework 2.1 Turbo Decoding

Turbo Decoder

Turbo Encoder

- Encoder 1 - y1 x

-

~ ; y~1 x

- x y1 y2 ;

- Encoder 2 - y2

;

BSC( )

- x~ y~1 y~2 ;

- Decoder 1 - 2

;

1



Decoder 2

  ~ ; y~2 x

Figure 1: Turbo codes Let x 2 f?1; +1gN be the information bits, from which the turbo encoder generates two sets of parity bits, y1 = (y11 ;Q    ; y1L )T , and y2 = (y21 ;    ; y2L)T , y1j ; y2j 2 f?1; +1g (Fig.1). Each parity bit is expressed in the form i2Lrj xi , (r = 1; 2), where the product is taken through a subset Lrj 2 f1;    ; N g. The codeword (x; y1 ; y2 ) is then transmitted over a noisy channel, which we assume a BSC (binary symmetric channel) with ipping probability  < 1=2. The receiver observes their noisy version as (~x; y~1 ; y~2 ), x~i ; y~1j ; y~2j 2 f?1; +1g. The ultimate goal of the turbo decoding is the MPM (maximization of the posterior marginals) decoding of x based on p(xjx~ ; y~1 ; y~2 ). Since the channel is memoryless, the following relation holds p(~x; y~1 ; y~2 jx) = exp( x~  x + y~1  y1 + y~2  y2 ? (N + 2L) ( )) = ln(e + e? ); > 0;  = 21 (1 ? tanh ); ( ) def 1

where, `' denotes the inner-product of vectors. By assuming the uniform prior on x, the posterior distribution is given as follows, ; y~1 ; y~2 jx) = C exp( x~  x + y~  y + y~  y ) p(xjx~ ; y~1 ; y~2 ) = Pp(~xp(~ 1 1 2 2 ~1 ; y~2 jx) (1) x x; y = C exp(c0 (x) + c1 (x) + c2 (x)): Here C is the normalizing factor, and c0 (x) = x~ x, cr (x) = y~r yr , (r = 1; 2). The soft decoding is expressed as follows,  x

P def = x xp(xjx~; y~1 ; y~2 );

xi (i = 1;    ; N ) is the soft bit, and the sign of each is the MPM decoding result. Let  denote the operator of marginalization,

 p(xjx~; y~1 ; y~2 ) def = Ni=1 p(xi jx~ ; y~1 ; y~2 ): Since xi is directly calculated from p(xi jx~ ; y~1 ; y~2 ), calculation cost of soft bits and  p(xjx~; y~1 ; y~2 ) are Q

the same. In the turbo decoding, direct calculation of the soft bits are not tractable. Turbo codes utilize two decoders. Each of them gives the soft decoding based on one of the two sets of the parity bits. For the soft decoding, the following pr (x; ) (r = 1; 2) is used. P pr (x; ) = exp(c0 (x) + cr (x) +   x ? 'r ());  2 RN ; 'r () = ln x exp(c0 (x) + cr (x) +   x) (2) This distribution is derived from p(~x; y~r jx) and the prior of x which has the form of P

!(x; ) = exp(  x ? ()); () = i (i ): Each pr (x; ) includes one of fc1 (x); c2 (x)g in eq.(1), and additional parameter  adjusts the linear term of x. In the turbo decoding, the marginalization of pr (x; ) is feasible. The soft decoding of p(xjx~; y~1 ; y~2 ) is approximated by updating  iteratively in \turbo" like way.

2.2 Information Geometrical View of MPM Decoding

Let us consider the family of all the probability distributions over x, which we call S . The manifold S is equivalent to the family of distributions over 2N atoms, n



S = p(x) p(x) > 0;

P x

o

p(x) = 1 :

We de ne e{ at and m{ at submanifolds of S . e{ at manifold: A submanifold M 2S is said to be e{ at, when the following r(x; t) belongs to M for all q(x); p(x) 2 M . Here, c(t) is the normalization factor, ln r(x; t) = (1 ? t) ln q(x) + t ln p(x) + c(t); t2R:

m{ at manifold: A submanifold M 2S is said to be m{ at, when the following mixture r(x; t) belongs to M for all q(x); p(x) 2 M , r(x; t) = (1 ? t)q(x) + tp(x); t2[0; 1]: Now, let us consider a submanifold of p0 (x; ) de ned as  (3) M0 = p0 (x; ) = exp(c0 (x) +   x ? '0 ()) j  2 RN ; ~x, every  gives the coordinate system of M0 , and is called the natural parameter. Since c0 (x) = x distribution of M0 can be rewritten as follows p0 (x; ) = exp(c0 (x) +   x ? '0 ()) = exp(( x~ + )  x ? '0 ()); '0 () = ( x~ + ): It shows that every distribution of M0 is decomposable, or factorizable. From the information geometry[1], we have the following theorem of m{projection. 2

Theorem 1. Let q(x)2S , and ^ be the parameter of p0(x; ) 2 M0, that minimizes the KL-divergence from q(x) to M0,

^ = M0 q (x) ^

def = argmin D[q(x); p0 (x; )]: 2RN

is called the m{projection of q(x) to M0 . The m{projection is unique. Here, D[q(x); p0 (x; )] is the Kullback-Leibler divergence, which is de ned as, X D[q(x); p0 (x; )] = q(x) log p q((xx;)) ; 0 x

Generally D[q(x); p0 (x; )]  0, and it is equal to 0 if and only if q(x) = p0 (x; ) holds for every x. It is easy to show that the marginalization corresponds to the m{projection to M0 [7]. Since the soft decoding and marginalization is equivalent, the soft decoding is also equivalent to the m{projection to M0 . Finally, we show that the following equation holds, P P ^ def ^ x xq (x) = x xp0 (x; ) = 0 (): which is called expectation parameter in information geometry[1], is the soft bits of p0 (x; ) and gives another coordinate system of M0 .

0 ( ),

2.3 Information Geometry of Turbo Decoding The turbo decoding process is written as follows, 1. Let 1t = 0 for t = 0, and t = 1. 2. Project p2 (x; 1t ) onto M0 and calculate 2t+1 by 2t+1

= M0 p2 (x; 1t ) ? 1t :

3. Project p1 (x; 2t+1 ) onto M0 and calculate 1t+1 by 1t+1 = M0 p1 (x; 2t+1 ) ? 2t+1 : 4. If M0 p1 (x; 2t+1 ) 6= M0 p2 (x; 1t+1 ), go to step 2. The turbo decoding approximates the estimated parameter , the projection of p(xjx~ ; y~1 ; y~2 ) onto M0 , as  = 1 + 2 , where the estimated distribution is

p0 (x; ) = exp(c0 (x) + 1  x + 2  x ? '0 (1 + 2 )):

(4)

An intuitive understanding of the turbo decoding is as follows. In step 2, (2  x) in eq.(4) is replaced with c2 (x). The distribution becomes p2 (x; 1 ), and 2 is estimated by projecting it onto M0 . In step 3, (1  x) in eq.(4) is replaced with c1 (x), and 1 is estimated by m{projection of p1 (x; 2 ). We now de ne the submanifold corresponding to pr (x; ), (r = 1; 2),



Mr = pr (x; ) = exp(c0 (x) + cr (x) +   x ? 'r ()) j  2 RN :  is the coordinate system of Mr . Mr is also an e{ at submanifold. M1 6=M2 and Mr 6= M0 hold because cr (x) includes cross terms of x and c1 (x)6=c2 (x) in general. The information geometrical view of the turbo decoding is schematically shown in Fig.2.

3 The Properties of Turbo decoding 3.1 Equilibrium

For the following discussion, we de ne the expectation parameters of pr (x; ), r = 1; 2. r ( )

P def = x xpr (x; ) r = 1; 2:

3

When the turbo decoding converges, equilibrium solution de nes three important distributions, p1 (x; 2 ), p2 (x; 1), and p0 (x; ). They satisfy the following two conditions: 1: M0 p1 (x; 2) = M0 p2 (x; 1 ) = p0 (x; 2); in other word, 1 (2 ) = 1 (1 ) = 0 (): (5) 2:  = 1 + 2 : (6) For p(x; )2M0 , there exist 1 and 2 such that 1 (2 ) = 2 (1 ) = 0 (). We let 1 () and 2 () which satisfy this equation. The manifold M () is de ned by n

P

M () = p(x)

x xp(x) = 0 ( )

o

:

This is an m{ at submanifold, which includes p1 (x; 2 ()), p2 (x; 1 ()), and p0 (x; ). From its de nition, for any p(x)2M (), the expectation of x is the same. Hence for any p(x)2M (), its m{projection to M0 coincides with p0 (x; ). We call M () an equimarginal submanifold. Let us de ne an e{ at version of the submanifold as E (), which connects p0 (x; ), p1 (x; 2 ), and p2 (x; 1 ) in log-linear manner P o n E () = p(x) = Cp0 (x;)t0 p1 (x;2 )t1 p2 (x;1 )t2 2r=0 tr = 1 :

When eq.(6) holds, p(xjx~; y~1 ; y~2 ) is included in the E (). It can be proved by taking t0 = ?1, t1 = t2 = 1. Theorem 2. When the turbo decoding procedure converges, the convergent probability distributions p0(x; ), p1 (x; 2), and p2 (x; 1 ) belong to equimarginal submanifold M ( ), while its e{ at version E ( ) includes these three distributions and also the posterior distribution p(xjx~ ; y~1 ; y~2 ) (Fig.3). ............................................................................................... .................. .............. .............. ........... ........... ........... .......... ......... ......... ........ ........ ......... ........ ....... ........ ....... . . . . . . ....... ..... ...... ....... . . ...... . . . . ...... ..... ..... ..... . . . ..... . ... ..... ...... ..... . . . . .... ... .... ..... . . ..... . . .... ... .... .... . . . .... .... ... ... . ... . ... ... ... ... . . ... ... ... ... ... . ... ... .. ... . ... .. .. .. .. . ... .. .. .. . .. . .. .. .. . . . ... ... .. . . .. . .. ... .. . . . .. .. .. . . .. .. .. .. ... . ... . . . .. .. . . . .. . . . . . . . .. . . . .. .. . .. .. .. .. .. ... . . .. ... .. . . .. ... .. ... .. . ... ... ... .. .. .. ... ... .. .. .. ... .. .. ... ... .. . . ... .. ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... . . . ... . ... ... .... ... .... .... .... .... .... .... .... .... .... ..... ..... ..... ..... ..... . . ..... . .... ..... ..... ..... ..... ..... ...... ...... ....... ...... ....... ...... ........ ...... ........ ........ ....... ......... ......... ......... . . . . . . . . . .......... ........... .......... .............. ........... .................. .............. ...............................................................................................

S

S

 



  21   x~

 



s



 

 

s

p1 (x; 2)

2 1Q1

M

E ()

1 m?projection

 Qs1sp0(x; ) x~ sQ Qs 1 2

 

q(x)

M2 M

p1(x; 2)

1

M

0

m{projection

M

2

M0

m?projection

x~ sQ ss p2(x; 1) 1Q

p2(x; 1)

M ()  q(x)

p0(x; )

Figure 3: M () and E ( )

Figure 2: Turbo decoding

If M () includes p(xjx~ ; y~1 ; y~2 ), p0 (x;  ) is the true marginalization of p(xjx~; y~1 ; y~2 ). However, M ( ) does not necessarily include p(xjx~; y~1 ; y~2 ). This fact means that p(xjx~; y~1 ; y~2 ) and p0 (x; ) are not necessarily equimarginal, which is the origin of the decoding error.

3.2 Condition of Stability

From the properties of exponential family distributions, the following relation holds for expectation parameters and '0 in eq.(3) and 'r , (r = 1; 2) in eq.(2) 0 ( ) = @ '0 ( ); r ( ) = @ 'r ( ): We give a suciently small perturbation  to 1 and apply one turbo decoding step. The m{projection from p2 (x;  + ) to M0 gives, 0 (  +  ) = 2 (1 +  )  = G0 ( )?1 G2 (1 ): 4

G0 () is the Fisher information matrix of p0 (x; ), and Gr () is that of pr (x; ), (r = 1; 2), de ned as, G0 () = @ '0 () = @ 0 (); Gr () = @ 'r () = @ r (); r = 1; 2: 0

0

Note that G0 () is a diagonal matrix. 2 in step 2 will be, 2 = 2 + (G0 (  )?1 G2 (1 ) ? IN ) : Here, IN is an identity matrix of size N . Following the same line for step 3, we derive the theorem which coincides with the result of Richardson[6]. Theorem 3. Let i be the eigenvalues of the matrix T de ned as

T = (G0 ( )?1 G1 (2 ) ? IN )(G0 ( )?1 G2 (1 ) ? IN ): When ji j < 1 holds for all i, the equilibrium point is stable.

3.3 Cost Function and Characteristics of Equilibrium

We give the cost function which plays an important role in turbo decoding.

F (1 ; 2 ) = '0 () ? ('1 (2 ) + '2 (1 )): Here,  = 1 + 2 . This function is identical to the \free energy" de ned in [5], implying close relationship between our approach and the statistical-mechanical one. Theorem 4. The equilibrium state 1; 2 is a critical point of F . Proof. Direct calculation gives @1 F = 0 () ? 2 (1 ), @2 F = 0 () ? 1 (2 ). For the equilibrium, 0 (  ) = 1 (2 ) = 2 (1 ) holds, and the proof is completed. When (rt+1 ? rt ) is small, the linear approximation of the mapping, de ned by the one cycle of the turbo decoding, is derived as  t+1  1  t+1

2

?

 t 1 t

2



?1 ' G0 (O)?1 G0 (O)





@1 F @2 F :

This shows that the algorithm performs a \skewed" gradient ascent in the vicinity of an equilibrium. The Hessian of F is 







H = @@1 1 FF @@1 2 FF = G0 G?0 G1 G0 G?0 G2 : 2 2 2 1

And by transforming the variables as,  = 1 + 2 and  = 1 ? 2 , we have     @ F @ F = 1 4G0 () ? (G1 + G2 ) (G1 ? G2 ) : @ F @ F (G1 ? G2 ) ?(G1 + G2 ) 4 Most probably, @ F is positive de nite and @ F is always negative, thus, F is generally saddle at equilibrium.

3.4 Perturbation Analysis

For the following discussion, we de ne a distribution p(x; ; v) as

p(x; ; v) = exp(c0 (x) +   x + v  c(x) ? '(; v)) P = (c1 (x); c2 (x))T : '(; v) = ln x exp(c0 (x) +   x + v  c(x)); c(x) def This distribution includes p0 (x; ) (v = 0), p(xjx~; y~1 ; y~2 ) ( = 0, v = 1), and pr (x; ) ( = , v = er ), where 1 = (1; 1)T , e1 = (1; 0)T , and e2 = (0; 1)T . The expectation parameter (; v) is de ned as,  ( ; v ) = @ '( ; v ) =

5

P x x (x;  v )

p

; :

Let us consider M (), where every distribution p(x; ; v)2M () has the same expectation parameter, that is, (; v) = ( ) holds. Here, we de ne, ( ) = (; 0). From the Taylor expansion, we have, P P P i (; v) = i ( ) + j @j i ( )j + r @r i ( )vr + 12 r;s @r @s i ( )vr vs (7) P P 1   3 3 + j;r @r @j i ( )vr j + 2 k;l @k @l i ( )k l + O(kvk ) + O(kk ): The indexes fi; j; k; lg are for , fr; sg are for v, and def =  ?  . After adding some de nitions, that    is, i (; v) = i ( ), and @j i ( ) = gij ( ), where fgij g is the Fisher information matrix of p(x; ; 0) which is a diagonal matrix, we substitute i with function of vr up to its 2nd order, and neglect the higher orders of vr . And we have,

  ii P  P P P i '?gii r Air vr ? g2 r;s @r ? k gkk Akr @k @s ? j gjj Ajs @j i ()vr vs ; (8) where, gii = 1=gii , and Air = @r i ( ). Let us consider, p(x; ; 0), p(x; 1 ; e2 ), p(x; 2 ; e1 ), and p(x; 0; 1). Let p(x; ; 0), p(x; 1 ; e2 ), and p(x; 2 ; e1 ), be included in M (), and  = 1 + 2 be satis ed. By putting  = 1, this coincides with the converged point of the turbo decoding. From the

result of eq.(7) and eq.(8), we have the following theorem. Theorem 5. The true expectation of x, which is (0; Pr er ), is approximated as,  ? P 1 P @ ?P gkk Ak @ @ ?P gjj Aj @ ( ): (9)  0; r er '  (  ) + s s j r k j 2 r6=s r k Where () is the solution of the turbo decoding. Equation (9) is related to the m{embedded{curvature of E () (Fig.3). The result can be extended to general case where K>2 [8].

4 Discussion We have shown a new framework for understanding and analyzing the turbo decoding. It elucidates the mathematical background, and reveals basic properties. The information geometrical structure of the equilibrium is summarized in Theorem 2. It shows the e{

at submanifold E () plays an important role. Furthermore, Theorem 5 shows that the relation between E () and the m{ at submanifold M () causes the decoding error, and the principal component of the error is the curvature of E ( ). Since the curvature strongly depends on the codeword, we can control it by the encoder design. This shows a room for improvement of the \near optimum error correcting code". This paper gives a rst step to the information geometrical understanding of the belief propagation decoder. The main results are for the turbo decoding, but the mechanism is common with wider class, and the framework is also valid for them. We believe further study in this direction will lead us to better understanding of these methods.

References

[1] S. Amari and H. Nagaoka. Methods of Information Geometry. AMS and Oxford University Press, 2000. [2] C. Berrou and A. Glavieux. Near optimum error correcting coding and decoding: Turbo-codes. IEEE Trans. Commun., 44(10):1261{1271, October 1996. [3] S. Ikeda, T. Tanaka, and S. Amari. Information geometrical framework for analyzing belief propagation decoder. to appear in Neural Information Processing Systems 2001 (NIPS'2001), December 2001. [4] S. Ikeda, T. Tanaka, and S. Amari. Information geometry of turbo codes and low-density parity-check codes. submitted to IEEE trans. Inform. Theory, August 2001. [5] Y. Kabashima and D. Saad. The TAP approach to intensive and extensive connectivity systems. In M. Opper and D. Saad, editors, Advanced Mean Field Methods, pages 65{84. MIT Press, 2001. [6] T. J. Richardson. The geometry of turbo-decoding dynamics. IEEE Trans. Inform. Theory, 46(1):9{23, January 2000. [7] T. Tanaka. Information geometry of mean- eld approximation. In M. Opper and D. Saad, editors, Advanced Mean Field Methods, pages 259{273. MIT Press, 2001. [8] T. Tanaka, S. Ikeda, and S. Amari. Information-geometrical signi cance of sparsity in Gallager code. to appear in Neural Information Processing Systems 2001 (NIPS'2001), December 2001.

6