48-57, 1968. R. E. Krichevsky and V. K. Trofimov, âThe performance of universal .... additive noise single-user channel, Gaussian noise and Gaussian signaling ...
IEEE TRANSACTIONS ON INFORMATION THEORY,VOL.42,NO.5,SEFTEMBER
1520
The flexibility of context maximizing, allows us to describe efficient methods for producing minimum description length classification structures. Combinations of our maximizing methods for the different classes lead to interesting classification procedures, even for attributes that take values in “large” alphabets. Note that the algorithm for Class III selects the positions (attributes) which gives the highest reduction of the description length and produces a decision tree (see Quinlan and Rivest [4]), while Class II methods can be used to find the most effective thresholds in large attribute alphabets. The fact that there exist elegant context weighting methods to treat missing attributes demonstrates once more the flexibility of context weighting (maximizing).
ACKNOWLEDGMENT
The authors wish to thank the reviewers and the Associate Editor M. Feder for valuable comments.
1996
Nearest Neighbor Decoding for Additive Non-Gaussian Noise Channels Amos Lapidoth, Member, IEEE
Abstract-We study the performance of a transmission scheme employing random Gaussian codebooks and nearest neighbor decoding over a power limited additive non-Gaussian noise channel. We show that the achievable rates depend on the noise distribution only via its power and thus coincide with the capacity region of a white Gaussian noise channel with signal and noise power equal to those of the original channel. The results are presented for single-user channels as well as multiple-access channels, and are extended to fading channels with side information at the receiver. Zndex Terms- Nearest neighbor decoding, Euclidean distance, mismatched decoding, multiple-access channel, generalized cutoff rate, fading.
I. INTRODUCTION
REFERENCES
Ul F. Jelinek, Probabilistic Information Theory. New York: McGraw-Hill, 1968, pp. 476-489. PI R. E. Krichevsky, “The connection between the redundancy and the reliability of information about the source,”Probl. Inform. Transm., vol. 7, no. 3, pp. 48-57, 1968. r31 R. E. Krichevsky and V. K. Trofimov, “The performance of universal encoding,”ZEEE Trans. Inform. Theory, vol. IT-27, no. 2, pp. 199-207, Mar. 1981. r41 J. R. Quinlan and R. L. Rivest, “Inferring decision trees using the minimum description length principle,” Inform. Compur., vol. 80, pp. 227-248, 1989. 151 .I. Rissanen,“Universal coding, information, prediction, and estimation,” ZEEE Trans. Inform. Theory, vol. IT-30, no. 4, pp. 629-636, July 1984. Stochastic Complexity in Statistical Inquiry. Singapore: World 161 z;ific, 1989. 171 B. Ya. Ryabko, “Twice-universal coding,” Probl. Inform. Transm., vol. 20, no. 3, pp. 24-28, 1984. 181 Y. M. Shtarkov, “Methods,of constructing lower bounds for redundancy of universal coding,” Probl. Inform. Transm., vol. 18, no. 2, pp. 3-11, 1982. 191 V. K. Trofimov, “The redundancy of universal encoding of arbitrary Markov sources,”Probl. Inform. Transm., vol. 10, no. 4, pp. 16-24, 1974. HOI M. J. Weinberger, A. Lempel, and J. Ziv, “A sequential algorithm for the universal coding of finite memory sources,”IEEE Trans. Inform. Theory, vol. 38, no. 3, pp. 1002-1014, May 1992. [ill M. J. Weinberger, N. Merhav, and M. Feder, “Optimal sequential probability assignmentfor individual sequences,”IEEE Trans. Inform. Theory, vol. 40, no. 2, pp. 384-396, Mar. 1994. 1121 M. J. Weinberger,J. Rissanen,and M. Feder,“A universal finite memory source,”IEEE Trans. Inform. Theory, vol. 41, no. 3, pp. 643-652, May 1995. 1131 F. M. J. Willems, Y. M. Shtarkov, and T. J. Tjalkens, “Context tree weighting: A sequential universal source coding procedure for FSMX sources,”in IEEE Int. Symp. on Information Theory (San Antonio, TX, Jan. 17-22), 1993, p. 59. “The context-tree weighting method: Basic properties,” IEEE 1141 KS. Inform. Theory, vol. 41, no. 3, pp. 653-664, May 1995. [I51 R. Nohre, “Some topics in descriptive complexity,” Ph.D. dissertation, LinkBping Univ., LinkGping, Sweden, 1998.
Due to the simplicity of their implementation using matched filters and due to their robustness, nearest neighbor decoders (minimum Euclidean distance decoders) are often used on additive noise channels even if the noise is not a white Gaussian process. For such channels, the nearest neighbor decoding rule is usually suboptimal and a loss in performance is thus incurred. In this correspondence we quantify this loss in terms of the achievable rates, i.e., the rates at which reliable communication is possible with a nearest neighbor decoder. Stated as above, the problem is a special case of the general mismatch problem that has been studied in [l]-[5] for the singleuser channel, and in [6] for the multiple-access channel. However, the techniques that we use in order to study the problem are quite different from those used in the above references and rely heavily on Euclidean geometry. This allows us to deal with the infinite input and output alphabets and with the memory that the noise may exhibit. It should be noted that we only study the case where the transmitter uses random Gaussian codebooks. While this is the optimal input distribution for white Gaussian channels, this may not necessarily be optimal for non-Gaussian channels and the decoder that we assume, see Example 1. The motivation to assume this input distribution is that Gaussian codebooks are relatively well understood, and that, as we shall see, Gaussian codebooks and nearest neighbor decoders form a very robust communication scheme. Furthermore, for the power-limited additive noise single-user channel, Gaussian noise and Gaussian signaling constitute a saddle-point for the mutual information functional [7]: given that the noise is Gaussian the input distribution that maximizes the mutual information between the channel input and output is the Gaussian distribution, and given that the input distribution is Gaussian, the worst noise, i.e., the noise that minimizes Manuscript received May 9, 1995; revised Feb. 20, 1996. This work was supported in part by NSF under Grant NCR-9205663, JSEP under Contract DAAH04-94-G-0058, and ARPA under Contract J-FBI-94-218-2. The material in this correspondencewas presented in part at the IEEE Inemational Symposium on Information Theory, Trondheim, Norway, June 1994. The author is with the Department of Electrical Engineering and Computer Science,MassachusettsInstitute of Technology, Cambridge, MA 02139-4307 USA. Publisher Item Identifier S 0018-9448(96)05225-X.
00 1g-9448/96$05.OO0 1996 IEEE
IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 42, NO. 5, SEPTEMBER1996
the mutual information, is the Gaussian noise. It thus seems that a robust design approach is to design for the worst case, i.e., assume that the noise is Gaussian, and use a Gaussian codebook and a decoder that is optimal for Gaussian noise, i.e., a nearest neighbor decoder. This reasoning seems to imply that a Gaussian codebook and a nearest neighbor decoder should allow one to transmit at rates arbitrarily close to the Gaussian capacity (corresponding to Gaussian inputs and Gaussian noise) if the noise is Gaussian, and at even higher rates if the noise is not the worst noise (i.e., not Gaussian). One should, however, exercise great caution when following the above line of reasoning because if the noise is not Gaussian, then nearest neighbor decoding is no longer the optimal decoder and the mutual information between the input and output is no longer an appropriate measure of the achievable rates. If the noise is not Gaussian, then two opposite forces are at work: on the one hand the noise is less malevolent than the Gaussian noise and the mutual information is bigger than the Gaussian capacity, but on the other hand the receiver is no longer optimal, and some loss in performance might arise from the suboptimality of the decoder. We shall, in fact, see in Theorem 1 that the two forces exactly cancel out: irrespective of the noise distribution, the Gaussian capacity is achievable, and no rate above it is achievable with random Gaussian coding and nearest neighbor decoding. This result will be generalized in Theorem 3 to the additive-noise multiple-access channel (MAC) with joint nearest neighbor decoding. The robustness of Gaussian codebooks and nearest neighbor decoding for the single-user channel has been demonstrated before, see [S]-[IO] and references therein. (For the case where the noise power is smaller than the transmitted power, the strongest version is apparently due to Csisz&r and Narayan [Ill.) Our contribution to the single-user channel is in proving a random coding converse, i.e., that with random Gaussian codebooks and nearest neighbor decoding no rate above (l/2) log (1 + P/N) is achievable, where P is the signal power and N is the noise power, and in extending these results to the multiple-access channel and to fading channel with side information at the receiver. Our results do not imply that if one is to use Gaussian codebooks and if one is to guarantee that (l/2) log (1 + P/N) be achievable for any noise distribution of power N, then no rate above (l/2) log (1+ P/N) is achievable with a decoder that does not depend on the noise distribution; it is only with a nearest neighbor decoder that this holds. At least for memoryless noise processes, one can in fact achieve the mutual information between the channel input and channel output using a universal decoding rule, see [12] and [13]. The shortcoming of the universal decoders is that their implementation is often too complex even for codebooks with strong algebraic structure. The term “Gaussian codebook” requires some clarification. In this correspondence it will always refer to a random codebook whose codewords are chosen independently of each other. Sometimes we shall take it to imply that each codeword is chosen uniformly over the n-dimensional sphere of radius Jn?s, where P is the average power and n is the blocklength, and at other times we shall take it to mean that each codeword is chosen according to a product Gaussian distribution so that the coordinates of the codewords are Lid. h’(O, P), where H(m, g”) denotes the Gaussian distribution of mean m and variance u2. The different meaning will usually be clear from the context or unimportant; otherwise, we shall make the distinction explicit and refer to the former case by “equienergy Gaussian codebook”’ and to the latter by “product Gaussian codebook.” All the results stated in this correspondence hold for ‘We refer to this codebook as Gaussian becauseas the dimensionality R grows, the marginals of this distribution convergeto the Gaussiandistribution.
1521
both cases, but we shall prove them for the case for which the proof is simplest. It should be noted that, in general, a direct part is slightly more impressive if it is shown for a product-Gaussian codebook rather than for an equi-energy Gaussian codebook, whereas a random coding converse is stronger if it applies to an equi-energy Gaussian codebook. The reason is that in a decoding mismatch situation choosing all codewords to be of the same type is usually beneficial [6]. The rest of the paper is organized as follows. In Section II we treat the single-user case and show that under some weak conditions on the noise process, nearest neighbor decoding and random Gaussian codebooks allow oneto communicate at any rate below (l/2) log (1-t P/N) but no higher. In Section III we show that the generalized cutoff rate [15], which is a commonly used lower bound on the maxima1 achievable rate under decoding mismatch conditions, can be very loose even when the cutoff rate is close to the channel capacity. Section IV extends some of the results of Section II to the single-user channel with vector inputs and outputs. The multipleaccess channel is treated in Section V where it is once again shown that Gaussian codebooks and nearest neighbor decoding achieve the capacity region of the corresponding (matched) Gaussian channel but no better. Extension to fading channels with side information at the receiver are treated in Section VI, and someconclusions are drawn in Section VII. It should be noted that some of the results of this correspondence have duals in rate distortion theory. Those are explored in [14].
II.
THE SCALAR SINGLE-USER CASE
The single-user additive noise scalar channel is a channel whose input and output are real valued and satisfy y(k) = $"I
+ z(k)
(1)
Here Xck) and Y(‘) denote the input and output of the channel at time k, and Z(“) is the noise sample. We assume throughout that the process { Zck)} is independent of the input process, and that it is ergodic with second moment N, i.e.,
In fact, our results still hold if rather than requiring ergodicity of the noise process we require that its empirical average power converge in probability to N, i.e., ;yz’“‘y-N >l5 =o, v’s>o. (2) k=l I ) (I Theorem 1: For a single-user scalar additive noise channel with nearest neighbor decoding we have that, irrespective of the noise distribution, the average probability of error, averaged over the ensemble of Gaussian codebooks of power P, approaches zero as the blocklength n tends to infinity for code rates below $ log (1 + $), and approaches one for ratesabove a log (1 + $). Proof: Step I: Consider a blocklength n equi-energy Gaussian codebook of average power P and rate R lim Pr n---tCX
c = {z(l),
. . ) z(2”R)}.
We are interested in l?‘, the average probability of error corresponding to a codebook drawn from the ensemble, averaged over the ensemble of codebooks. Since the different codewords are chosen independently and according to the same distribution, the ensemble averaged probability of error corresponding to the ith message p(i), does not depend on i, and we thus conclude that P = P(1). We can thus
IEEE TRANSACTIONS ON INFORMATIONTHEORY,VOL. 42, NO. 5, SEPTEMBER 1996
1522
assume without loss of generality that the transmitted codeword is the one that corresponds to hypothesis number 1, i.e., Z( 1). Let y denote the received sequence. Let e(z) denote the conditional expectation of the probability of error given that the transmitted message is message 1 and that the noise vector is z. We thus have
We can now compare the performance of a nearest neighbor decoder and a Gaussian codebook on our channel with their performance on the Gaussian channel of variance N”. Denoting the noise vector on the Gaussian channel by z” we have Pr (]]z]l > IIz”]]) n-
1.
P = P(1) = E,(e(z)) where E, denotes expectation with respect to the noise sequence. Step 2: We claim that e(z) depends on z only via its Euclidean norm ]l.ll. To see this, note that the spherical symmetry of the codebook distribution implies that given z and z(l), the probability of error depends only on J/z + I]] and JIzJJ.Since z(l) is drawn uniformly over the sphere, it follows that the conditional distribution of ]]z + z(l)11 depends on z only via its norm. To stress that e(z) depends on z only via its norm (and dimension), we shall denote
e(z) by en(l142/~>. Step 3: We show that the function e,(]]zl]2/n) is monotonically nondecreasing with ]]zj]“/n. To see this note that by the proceeding step, for the purposes of studying the function e(z) z e, (]].#/n), we may assume without loss of generality that the noise vector z is aligned with some fixed deterministic unit vector 6, so that z = ]]z]]& The monotonicity now follows from the observation that if for some i # 1 we have ]Jy - z(i)]] 2 ]]y - I]] where y = ~(1) + ]]z]]iL, then ((y’ - z(i)]] 2 ((y’ - ~(l)]( where y’ = z(1) + (]]z]] f S)ci and 6 > 0. Indeed
1141)+ (1141 + 6)s - z(i)11= 5 I =
lb(l) + llzll~- z(i) + Skll lb(l) + Il.+ - =(i)ll + 6 llzll+ 6 1141)+ (1141 + w - 41)ll
where the first inequality follows from the triangle inequality and the positivity of 6, and the second from the assumption that for the noise vector l]z]]iL the codeword z(i) is at least as close to the received sequence as the transmitted codeword z(1) is. Step 4: We can now conclude the proof by a comparison argument. Consider first the case where the code’s rate satisfies R
0 is very small. We thus have ~=1bits. 2 ch. use However, using the uncoded input alphabet l m and nearest neighbor decoding one achieve zero probability of error at a rate of 1 bit/channel use. III.
A NOTE ON THE GENERALIZEDCUTOFFRATE
The generalized cutoff rate GRa is a commonly used lower bound on the mismatch capacity of a single-user channel. See [15], [5], and references therein. For a binary-input output-symmetric channel, i.e., a channel with input alphabet X = {&l}, output alphabet y = IR, and a probability law that satisfies P(Y
I $1) =
PC-Y
I -1)
the generalized cutoff rate has an additional operational meaning. If the decoding metric is also symmetric, i.e.
$1 dl, Y) = d-1,
-Y)
so that there exists an N’ > N such that R < ; log Compare now the ensemble average probability of error corresponding to a Gaussian codebook transmitted over the additive i.i.d. Gaussian channel of variance N’ with that over the channel under consideration. Let z denote the noise vector in the original (nonGaussian) channel, and let z’ denote the noise vector in the Gaussian channel. By the law of large numbers (LLN) ]]z’]]/n approaches N’ as n tends to infinity, and since N’ > N we conclude from (2) that Pr(]]z’II > ]]z]]) nz
1.
The direct part now follows from the monotonicity of the function e, (.) and the fact that a Gaussian codebook for the Gaussian channel achieves capacity. In the other direction, if R>
;‘og(l+$)
then there exists an N” < N such that R > ; log
&, (
. >
then the message error rate or the bit error rate of any linear binary code over the channel can be upper-bounded by some function that depends on the code only, evaluated at GRo [15]. Like &M-the lower bound on the maximal rate at which reliable communication is possible with a given suboptimal decoder [ l]-the generalized cutoff rate GRa is derived using a random coding argument. However, while the analysis of the ensemble averaged probability of error used in deriving CLM is tight [5], the analysis that leads to GRa is not. In this section we shall demonstrate that GRa can indeed behave very differently from the mismatch capacity CM, and the random coding bound CLM. In order to derive GRa, one upper-bounds the probability that a randomly chosen codeword accumulates a higher metric than the metric accumulated by the correct codeword, and then uses the union of events bound to upper-bound the probability that some incorrect codeword accumulates a higher metric than the metric accumulated by the transmitted codeword. To be more specific, assume that random coding is carried out according to the distribution p(z). Denote the transmitted codeword by z, the received sequence by y, and let z’ be a randomly chosen codeword independent of x and y. By the union 2We do not actually need a strong converse.It sufficesthat for rates above capacity the ensembleaverageprobability of error goes to one.
1523
IEEETRANSACTIONS ON INFORMATIONTHEORY,VOL. 42, NO. 5, SEPTEMBER 1996 of events bound we obtain that the probability of a message error is upper-bounded by
YR Pr(4x’, y) 2 4x, y))
Jim
where R is the code rate, and n is its blocklength. Using the Chernoff bound we obtain
Pr(dx’,Y) 2 4(x,9)) = Pr 5 inf A>0
‘dzk,
yk))
2
) P(x)P(a
(Z,Z’,Y)EXXXXY
n ‘)
and we can thus conclude that the generalized cutoff rate, defined as GRo = -log
p>fo
c
p(,+o’)p(y
2 70.
(5)
1 - :(I - e’“) . ( > Computing the Legendre-Fenchel transform at 70 leads to
is achievable, i.e., a lower bound on the mismatch capacity CM. Let us now focus once again on the scalar additive noise channel with nearest neighbor decoding, noise power N, and Gaussian codebook of power P. For this channel we know that irrespective of the noise distribution the rate $ log (1 + 5) is achievable (Theorem 1). We shall, however, show that by a proper choice of the noise distribution (of power N), we can make GRo arbitrarily small. Let Z(z) denote the probability that a randomly chosen codeword Z’ is closer to y = z + z than the correct codeword x is. By the spherical symmetry of the Gaussian codebook g(z) depends on the noise sequence only via its norm llzll. We shall hence denote g(z) by &(ll~ll”/n). Following the reasoning outlined in Step 3 in the proof of Theorem 1 we conclude that &(ll~II*/n) is monotonically nondecreasing. Let c > 0 be arbitrary, and let 70 be sufficiently large so that E > + log (5). Consider now a deterministic noise 2 = 70. By the random coding converse of Theorem 1 we conclude that for sufficiently large blocklength n the average probability of error, averaged over an ensemble of rate E Gaussian codebooks, is greater than l/2. It thus follows from the union of events bound that for sufficiently large n, we have 2”‘& (70) > l/2. By the monotonicity of en(q) we conclude that for sufficiently large n h