Finite Blocklength Coding for Channels with Side ... - CiteSeerX

1 downloads 0 Views 80KB Size Report
619–637, 2001. [7] Jilei Hou, Paul H. Siegel, Laurence B. Milstein, and Henry D. Pfister,. “Capacity-approaching bandwidth-efficient coded modulation schemes.
2010 IEEE 26-th Convention of Electrical and Electronics Engineers in Israel

Finite Blocklength Coding for Channels with Side Information at the Receiver Amir Ingber and Meir Feder Department of EE-Systems, The Faculty of Engineering Tel Aviv University, Tel Aviv 69978, ISRAEL email: {ingber, meir}@eng.tau.ac.il Abstract—The communication model of a memoryless channel that depends on a random state that is known at the receiver only is studied. The channel can be thought of as a set of underlying channels with a fixed state, where at each channel use one of these channels is chosen at random, and this selection is known to the receiver only. The capacity of a such channel is known, and is given by the expectation (w.r.t. the random state) of the capacity of the underlying channels. In this work we examine the finite-length characteristics of this channel, and their relation to the characteristics of the underlying channels. We derive error exponent bounds (random coding and sphere packing) for the channel and determine their relation to the corresponding bounds of the underlying channels. We also determine the channel dispersion and its relation to the dispersion of the underlying channels. We show that both in the error exponent bounds and in the dispersion case, the expectation of these quantities is too optimistic w.r.t. the actual value. Examples for such channels are discussed.

I. I NTRODUCTION The communication model of a memoryless channel that depends on a random state is studied. We focus on the case where the random state, also known as channel state information (CSI), is known at the receiver only. The channel, denoted by W , can be thought of as a set of (memoryless) channels, WS , where S is the random state. Such a model appears many times in practice: the ergodic fading channel is an example for such a channel, where the fade value is assumed to be known at the receiver. Sometimes the state S is a result of the communication scheme design and is inserted intentionally (for example, in order to attain symmetry properties). In this work we study the relationship between the finite blocklength information theoretic properties of the channel W and those of the underlying channels WS . The capacity of this channel is well known, an is generally given by the expectation (over S) of the capacity of the underlying channel WS . We follow by analyzing other information theoretical properties such as the error exponent and the channel dispersion of the channel W and comparing then to the expected values of these properties of the channel WS . The main results can be summarized as follows: The random coding and sphere packing error exponent bounds [1] are both given by the expression E0 (ρ) − ρR c 2010 IEEE 978-1-4244-8682-3/10/$26.00

(optimized w.r.t. ρ), where E0 (ρ) is a function of the channel. We show that the function E0 for the channel W is given by h i (S) (1) E0 (ρ) = − log E 2−E0 (ρ) , (S)

where E0 is the corresponding E0 function for the channel WS , E[·] denotes expectation (w.r.t. S), and log = log2 . In [2], error exponents for channels with side information were considered. However, the focus was on channels with CSI at the transmitter as well, compound channels and more. While the case of CSI known at the receiver only is as special case, the contribution here lies in the simplicity of the relation (1). We also discuss the channel dispersion (see [3], [4]), which quantifies the speed at which the rate approaches capacity with the block length (when the codeword error rate is fixed). We show the following relationship between the dispersions of W and WS , denoted V and VS respectively: V = E[VS ] + VAR [CS ] .

(2)

Both in the error exponent and in the dispersion case, we show that the expected exponent and the expected dispersion are too optimistic w.r.t. the actual exponent and dispersion. Finally, we discuss several examples that involve channels with side information at the receiver, such as channel symmetrization, multilevel codes with multi-stage decoding (MLC-MSD) and bit-interleaved coded modulation (BICM). II. T HE G ENERAL C OMMUNICATION M ODEL A. Channel Model Let W be a discrete memoryless channel (DMC) 1 with input x ∈ X , and output (y, s) ∈ Y × S, where s ∈ S is the channel state, which is independent of the channel input X: W (y, s|x) = PY,S|X (y, s|x) = PS (s)PY |S,X (y|s, x).

(3)

Definition 1: Let Ws be the W where the state S is fixed to s: Ws (y|x) , PY |S,X (y|s, x). 1 Similar

(4)

results for continuous-output channels can be derived similarly.

B. Communication Scheme The communication scheme is defined as follows. Let n be the codeword length, and let M be a set of 2nR messages. The encoder and decoder are denoted fn and gn respectively, where n • fn : M → X is the encoder, which maps the input message m to the channel input x ∈ X n . n n • gn : Y × S → M is the decoder, which maps the channel output and the channel state to an estimate m ˆ of the transmitted message. • The considered error probability is the codeword error probability pe , P (m ˆ 6= m), where the messages m are drawn randomly from M with equal probability. The communication scheme is depicted in Figure 1. We shall be interested in the tradeoff between rate R, codelength n and error probability pe of the best possible codes. III. I NFORMATION T HEORETIC A NALYSIS Here we shall be interested in the performance of the optimal codes for the channel W . We review known results for the capacity, and present the results for the error exponent and the channel dispersion. A. Capacity Since the channel W is simply a DMC with a scalar input and a vector output, the capacity can be simply derived (see, e.g. [5]): C(W ) =

max I(X; Y, S) p(x)

=

max I(X; Y |S) + I(X; S)

=

max I(X; Y |S),

p(x)

p(x)

(5)

where the last equality holds since X and S are independent. Note that the capacity can also be written as max ES [I(X; Y |S = s)]. p(x)

(6)

B. Error Exponent The error exponent of a channel is given by [1] 1 (9) E(R) , lim − log (pe (n, R)) , n→∞ n where pe (n, R) is the average codeword error probability for the best code of length n and rate R (assuming that the limit exists). While the exact characterization of the error exponent is still an open question, two important bounds are known [1]: the random coding and the sphere packing error exponents, which are a lower and upper bounds, respectively. The random coding exponent is given by Er (R) = max max{E0 (ρ, PX ) − ρR}, where E0 (ρ, PX ) is given by  !1+ρ  X X 1 . − log PX (x)W (y|x) 1+ρ y∈Y

(7)

Esp (R) = max max{E0 (ρ, p) − ρR}.

C(W ) = = =

It can be seen that both exponent bounds are similar. In fact, they only differ in the optimization region of the parameter ρ, and they coincide at rates beyond a certain rate called the critical rate. We note that both bounds depend on the function E0 (·). For channels with CSI at the receiver, we derive E0 (·) explicitly. Following the relationship (8), we wish to find the connections between E0 (·) and the corresponding E0 functions of the (s) conditional channels Ws , denoted E0 . Theorem 1: Let W be a channel with CSI at the receiver. Then the function E0 (·) for this channel is given by h i (S) E0 (ρ, PX ) = − log E 2−E0 (ρ,PX ) . (13) Proof: When the channel output is (y, s), we get

=

=

I(X; Y |S)

ES [I(X; Y |S = s)] E[C(WS )],

E0 (ρ, PX ) =  X − log  − log

"

(a)

=

=

− log

X

1

PX (x)W (y, s|x) 1+ρ

x∈X

X

PS (s)×

X

X

!1+ρ  

s∈S

×

(8)

where C(Ws ) is the capacity of the underlying channel Ws . We conclude that the capacity formula can be interpreted as an expectation over the capacities of the underlying channels. Note that when the CSI is available at the transmitter as well, (8) holds even without the assumption of a fixed prior on X.

(12)

ρ>0 pX (·)

y∈Y,s∈S

Recalling the definition of the channel conditioned on the state s, we get

(11)

x∈X

The sphere packing bound Esp (R) is given by

In the paper we limit ourselves to a fixed input distribution (e.g. equiprobable). In this case the capacity is given by I(X; Y |S) = ES [I(X; Y |S = s)].

(10)

ρ∈[0,1] pX (·)

y∈Y

"

X

PX (x)PY |S,X (y|s, x)

x∈X

PS (s)2

(s) −E0 (ρ,PX )

s∈S

h i (S) − log E 2−E0 (ρ,PX ) ,

#

(s)

where (a) follows from the definition of E0 .

1 1+ρ

!1+ρ  

Random state s ∈ Sn m∈M

x ∈ Xn

Encoder

Fig. 1.

W

˜ r (R) always overestimates the true random coding Then E exponent of W , Er (R). h i ˜ 0 (ρ, PX ) = E E(S) (ρ, PX ) . Since 2−(·) is Proof: Let E 0 convex, it follows by the Jensen inequality and Theorem 1 that ˜ 0 (ρ, PX ). E0 (ρ, PX ) ≤ E

=

E

sup

PX ; ρ∈[0,1]

≥ =

sup PX ; ρ∈[0,1]

sup PX ; ρ∈[0,1]

(a)



PX ; ρ∈[0,1]

sup

=

Er (R),



(S) E0 (ρ, PX )

(15)

− ρR

h i (S) E E0 (ρ, PX ) − ρR h

Decoder

m ˆ

Communication scheme for channels with CSI at the receiver

As a corollary, we get the random coding and the sphere packing exponents for the channel W according to (10) and (12). Following (8), one might think that the error exponent bounds (for example, Er (R)) will be given by the expectation of the exponent function w.r.t. S. This is clearly not the case, as seen in Theorem 1. In addition, the following can be shown: ˜ r (R) be the average of E(S) Theorem 2: Let E r w.r.t. S: i h ˜ r (R) , E E(S) (R) . (14) E r

˜ r (R): We continue with E h i ˜ r (R) = E E(S) (R) E r "

y ∈ Yn

˜ 0 (ρ, PX ) − ρR E

pe and a codeword length n are given. We can then seek the maximal achievable rate R given pe and n. It appears that for fixed pe and n, the gap to the √ channel capacity is approximately proportional to Q−1 (pe )/ n (where Q(·) is the complementary Gaussian cumulative distribution function). The proportion constant (squared) is called the channel dispersion. Formally, define the (operational) channel dispersion as follows [3]: Definition 2: The dispersion V(W ) of a channel W with capacity C is defined as 2  C − R(n, pe ) V(W ) = lim lim sup n · , (17) pe →0 n→∞ Q−1 (pe ) where R(n, pe ) is the highest achievable rate for codeword error probability pe and codeword length n. In 1962 , Strassen [4] used the Gaussian approximation to derive the following result for DMCs: r   V −1 log n R(n, pe ) = C − Q (pe ) + O , (18) n n where C is the channel capacity, and the new quantity V is the (information-theoretic) dispersion, which is given by

# 

V , VAR(i(X; Y )),

(19)

where i(x; y) is the information spectrum, given by

i

i(x; y) , log

[E0 (ρ, PX ) − ρR] (16)

where (a) follow from (15). Note that the proof of Theorem 2 holds no matter what the optimization region of ρ is. Therefore the same version for the sphere packing exponent follows similarly. We conclude that the expectation (w.r.t. S) of the error exponent bounds overestimate the true exponent bounds of W (and also the true error exponent, over the critical rate). C. Dispersion An alternative information theoretical measure for quantifying coding performance with finite block lengths is the channel dispersion. Suppose that a fixed codeword error probability

PXY (x, y) , PX (x)PY (y)

(20)

and the distribution of X is the capacity-achieving distribution that minimizes V . Strassen’s result proves that the dispersion of DMCs is equal to VAR(i(X; Y )). This result was recently tightened (and extended to the power-constrained AWGN channel) in [3]. It is also known that the channel dispersion and the error exponent are related as follows. For a channel with capacity C and dispersion V , the error exponent can be approximated (for rates close to the capacity) by (C−R)2 E(R) ∼ = 2V ln 2 . See [3] for details on the early origins of this approximation by Shannon. We now explore the dispersion for the case of channels with side information at the receiver. Theorem 3: The dispersion of the channel W with CSI at the receiver is given by V(W ) = E[V(WS )] + VAR [C(WS )] ,

(21)

where both expectation and variance are taken w.r.t. the random state S. Proof: Since W is nothing but a DMC with a vector output, the proof boils down to the calculation of VAR[i(X; (Y, S))]. The information spectrum in this case is given by i(x; y, s)

= (a)

=

PY SX (y, s, x) PY S (y, s)PX (x) PY |S,X (y|s, x) log , i(x; y|s), PY |S (y|s) log

(22)

where (a) follows since X and S are independent. Suppose that s is fixed, i.e. consider the channel Ws . The capacity is given by C(Ws ) = =

E [i(X; Y |S)|S = s] I(X; Y |S = s).

(23)

= VAR(i(X; Y |S)|S = s)   = E i2 (X; Y |S)|S = s

−E2 [i(X; Y |S)|S = s] 2  = E i (X; Y |S)|S = s − C(Ws )2 .

(24)

Finally, the dispersion of the original channel W is given as follows: V(W )

= (a)

= =

VAR(i(X; Y |S)) E[VAR[i(X; Y |S)|S = s]] +VAR [E[i(X; Y |S)|S = s]] E[V(WS )] + VAR [C(WS )] ,

z = LLR(y, s) , log

PY |S,X (y|s, x = 0) . PY |S,X (y|s, x = 1)

(26)

It is well known that for channels with binary input, the optimal ML decoder can be implemented to work on the LLR values only. Therefore by plugging the LLR calculator at the channel output, and supplying the decoder with the LLRs only, the performance is not harmed, and we can therefore regard the channel as a simple DMC with input x and output z = LLR(y, s) for code design purposes. V. E XAMPLES A. Symmetrization of binary channels with equiprobable input

The dispersion of the channel Ws is given by V(Ws )

For channel outputs s and y, denote the LLR of x given (y, s) by z:

(25)

where (a) follows from the law of total variance. A few notes regarding this result: ˜ • Let V(W ) , E[V(WS )]. As an immediate corollary of ˜ Theorem 3, it can be seen that V(W ) underestimates the true dispersion of W , V(W ). This fact fits the exponent case: both expected exponent and expected dispersion are too optimistic w.r.t. the true exponent and dispersion. • The factor VAR [C(WS )] can be viewed as a penalty ˜ factor over the expected dispersion V(W ), that grows as the capacities of the underlying channels are more spread. IV. C ODE DESIGN When designing channel codes, the fact that the output is two-dimensional may complicate the code design. It would therefore be of interest to apply some processing on the outputs Y and S, and feed them to the decoder as a single value. We seek such a processing method that would not compromise the achievable performance over the modified channel (not only in the capacity sense, but in the error probability at finite codelengths sense as well). For binary channels this can be done easily by calculating the log-likelihood-ratios for each channel output pair (y, s) (see Figure 2).

In the design of state-of-the-art channel codes, it is usually convenient to have channels that are symmetric. In recent years there have developed methods to design very efficient binary codes, such as LDPC codes. When designing LDPC codes, A desired property of a binary channel is that its output will be symmetric[6]. Definition 3 (Binary input, output symmetric channels [6]): A memoryless binary channel U with input alphabet {0, 1} and output alphabet R is called output-symmetric, if for all y∈R U (y|0) = U (−y|1). (27) Consider a general binary channel W with arbitrary output (which is not necessarily symmetric), and suppose that, for practical reasons, we are interested in coding over this channel with equiprobable input (which may or may not be the achieving prior for that channel). The fact that we use equiprobable input does not make the channel symmetric according to Definition 3. However, there exists a method for transforming this channel to a symmetric one, without compromising on the capacity, error exponent or dispersion: First, we add the LLR calculation to the channel and regard it as a part of the channel. This way we get a real-output channel from any arbitrary channel. Second, before we transmit the binary codewords on the channel, instead of coding on the channel directly, we perform a bit-wise XOR operation with an iid pseudo-random binary vector. It can be shown that by multiplying the LLR values by −1 wherever the input was flipped, the LLRs are corrected. It can also be shown that the channel, with the corrected LLR calculation is symmetric according to Definition 3. In [7], this method (termed ’channel adapters’) was used in order to symmetrize the sub-channels of several coded modulation schemes. It is also shown in [7] that the capacity is unchanged by the channel adapters. By using Theorems 1 and 3, it can be verified that the error exponent bounds and the dispersion remain the same as well. B. Multilevel Coding and Multistage Decoding (MLC-MSD) MLC-MSD is a method for using binary codes in order to achieve capacity on nonbinary channels (see, e.g. [8]). In

Random state

m

W

Encoder

Fig. 2.

LLR calc.

z ∈ Rn

Decoder

m ˆ

Incorporating LLR calculation into the channel

MLC-MSD, the binary encoders work in parallel over the same block of channel uses, and the decoders work sequentially as follows: the first decoder assumes the rest of the codewords are noise and decodes the message from the first encoder. Every other decoder, in its turn, decodes the message from the corresponding encoder assuming that the decoded messages from the previous decoders are correct, therefore regards these messages as side-information. The effective channels between each encoder-decoder, called sub-channels, are in fact channels with CSI at the receiver, and therefore can be analyzed by Theorems 1 and 3. For more details on finite-length analysis of MLC-MSD, see [9].

known at the receiver. This allows the proper use of Theorems 1 and 3 (see [11] for the details). D. Fading Channels Rayleigh fading channel, which is popular in wireless communication, can be modeled as a channel with CSI at the receiver. The state in this setting is the fade value, which is usually estimated and some version of it is available at the receiver. When the fading is fast (a.k.a. ergodic fading) the channel is memoryless and fits the model discussed in the paper, and Theorems 1 and 3 can be applied.

C. Bit-Interleaved Coded Modulation (BICM)

ACKNOWLEDGMENT A. Ingber is supported by the Adams Fellowship Program of the Israel Academy of Sciences and Humanities.

BICM [10] is another popular method for channel coding using binary codes over nonbinary channels (for example, a channel with output of size 2L ). It is based on taking a single binary code, feeding it into a long interleaver, and then mapping the interleaved coded bits onto the nonbinary channel alphabet (every L-tuple of consecutive bits is mapped to a symbol in the channel input alphabet of size 2L ). At the receiver, the LLR of all coded bits are calculated according to the mapping, de-interleaved and fed to the decoder. By assuming that the interleaver is ideal (i.e. of infinite length), the equivalent channel of BICM is modeled as a binary channel with a random state [10]. The state is chosen uniformly from {1, ..., L}, and represents the index of the input bit in the L-tuple. Since the state is known to the receiver only, this model fits the channel models discussed in the paper. Finite blocklength analysis of BICM should be done carefully: although the model of a binary channel with a state known at the receiver allows the derivation of error exponent and channel dispersion, they do not have the usual meaning of quantifying the performance of BICM at finite block lengths. The reason for that is the interleaver: how can one rely on the existence of an infinite-length interleaver in order to estimate the finite-length performance? The solution comes in the form of an explicit finite-length interleaver. Recently an alternative scheme called Parallel BICM was proposed [11], where binary codewords are used in parallel and an interleaver of finite length is used in order to validate the BICM model of a binary channel with a state

[1] Robert G. Gallager, Information Theory and Reliable Communication, John Wiley & Sons, Inc., New York, NY, USA, 1968. [2] Pierre Moulin and Ying Wang, “Capacity and random-coding exponents for channel coding with side information,” IEEE Trans. on Information Theory, vol. 53, pp. 1326–1347, 2007. [3] Y. Polyanskiy, H.V. Poor, and S. Verd´u, “Channel coding rate in the finite blocklength regime,” IEEE Trans. on Information Theory, vol. 56, no. 5, pp. 2307 –2359, May 2010. [4] V. Strassen, “Asymptotische absch¨atzungen in shannons informationstheorie,” Trans. Third Prague Conf. Information Theory, 1962, Czechoslovak Academy of Sciences, pp. 689–723. [5] Thomas M. Cover and Joy A. Thomas, Elements of Information Theory, John Wiley & sons, 1991. [6] Thomas J. Richardson, Mohammad Amin Shokrollahi, and R¨udiger L. Urbanke, “Design of capacity-approaching irregular low-density paritycheck codes,” IEEE Trans. on Information Theory, vol. 47, no. 2, pp. 619–637, 2001. [7] Jilei Hou, Paul H. Siegel, Laurence B. Milstein, and Henry D. Pfister, “Capacity-approaching bandwidth-efficient coded modulation schemes based on low-density parity-check codes,” IEEE Trans. on Information Theory, vol. 49, no. 9, pp. 2141–2155, 2003. [8] Udo Wachsmann, Robert F. H. Fischer, and Johannes B. Huber, “Multilevel codes: Theoretical concepts and practical design rules,” IEEE Trans. on Information Theory, vol. 45, no. 5, pp. 1361–1391, 1999. [9] Amir Ingber and Meir Feder, “Capacity and error exponent analysis of multilevel coding with multistage decoding,” in Proc. IEEE International Symposium on Information Theory, Seoul, South Korea, 2009, pp. 1799–1803. [10] Giuseppe Caire, Giorgio Taricco, and Ezio Biglieri, “Bit-interleaved coded modulation,” IEEE Trans. on Information Theory, vol. 44, no. 3, pp. 927–946, 1998. [11] Amir Ingber and Meir Feder, “Parallel bit interleaved coded modulation,” in ALLERTON 2010, 48th Annual Allerton Conference on Communication, Control and Computing, September 29 - October 1, 2010, Allerton, USA, 09 2010.

R EFERENCES