Universal Noiseless Coding

2 downloads 0 Views 2MB Size Report
possible in a maximin sense if and only if the channel capacity between ... Minimax universal codes result if an additional entropy stability con- straint is applied.
IEEE TRANSACTIONS

ON INFORMATION

THEORY,

VOL.

1~-19.

NO.

6,

NOVEMBER

783

1973

Universal Noiseless Coding LEE D. DAVISSON

A&ruct-Universal coding is any asymptotically optimum method of block-to-block memoryless source coding for sources with unknown parameters. This paper considers noiseless coding for such sources, primarily in terms of variable-length coding, with performance measured as a function of the coding redundancy relative to the per-letter conditional source entropy given the unknown parameter. It is found that universal (i.e., zero redundancy) coding in a weighted sense is possible if and only if the per-letter average mutual information between the parameter space and the message space is zero. Universal coding is possible in a maximin sense if and only if the channel capacity between the two spaces is zero. Universal coding is possible in a minimax sense if and only if a probability mass function exists, independent of the unknown parameter, for which the relative entropy of the known conditional-probability mass-function is zero. Several examples are given to illustrate the ideas. Particular attention is given to sources that are stationary and ergodic for any fixed parameter although the whole ensemble is not. For such sources, weighted universal codes always exist if the alphabet is finite, or more generally if the entropy is finite. Minimax universal codes result if an additional entropy stability constraint is applied. A discussion of fixed-rate universal coding is also given briefly with performance measured by a probability of error.

any message block of length N contains n ones in any given pattern is then 0”(1 - @N-n.

(1)

If 0 were known, a per-letter average codeword length at least equal to the entropy of

-eiog,e-(i-e)iog,(i

-e)

b

(2)

would be required (and could be approached arbitrarily closely) to block encode the source. Although 8 is unknown, the composite source output probabilities are known. In fact the probability of any given message containing n ones in N outputs is simply -1

1 ql s 0

_

ey-n

de

=

1

N+l

N (

n

.

(3)

1

(Note that the composite source does not produce independent letters. This is due to the fact that past observations provide information about the value of 8 in effect. The I. INTRODUCTION composite source is stationary but not ergodic.) E START with a general discussion. More precise There is no reason to suppose that the use of the probdefinitions will appear subsequently. Suppose we are abilities of (3) to design a variable-length code will do well w given a discrete-time and discrete-alphabet source that we for any 8 that comes along in the sense of approaching (2). wish to block encode without error to minimize the average One of the surprising results, which will be shown subcodeword length, where the code is not allowed to depend sequently, is that the probabilities of (3) provide codes as on some unknown parameters of the source message probgood asymptotically as the probabilities of (1) with 0 known abilities. The unknown parameter could be something as exactly. simple as, say, the probability of a one in a binary indeThis example has illustrated an interesting and important pendent-letter source or as general as the complete set of point of problem definition. Does one regard such a source message probabilities for the source. Such sources are called as a stationary ergodic source with unknown message composite (see, e.g., Berger [I]). probabilities, or does one state simply that the source is In some cases, although not usually, the unknown paramnot ergodic? While it is important that a precise statement eter may have a known prior distribution so that all be made in this regard to adequately define the coding ensemble message probabilities are known. As has been problem, such precise statements typically have not been frequently stated, however, there is no reason to suppose made in the literature. In this paper we will regard such that coding for the whole ensemble will be efficient for the sources to be conditionally stationary ergodic in the sense actual parameter in effect. One of the important results of precisely defined in Definition 2 appearing in Se&ion III. this paper is to establish that coding for the whole ensemble Roughly what is meant is that a source ensemble characdoes, under certain circumstances, have a very important terized by any possible value of the unknown parameter is meaning in an asymptotic sense. For example, suppose we stationary ergodic. W e now turn to a definition. have a composite binary source in which the probability 8 Dejinition 1: Universal coding is any method of block of a “one” is chosen by “nature” randomly according to encoding for composite sources where the following holds. the uniform probability law on [O,l] and then fixed for all a) The encoding depends only on the observed source time. The user is not told 8, however. Subsequently the message block encoded and not past or future outputs, i.e., source produces independent-letter outputs with the chosen, it is blockwise memoryless. b) Some performance measure(s) but unknown, probability. The probability, given 8, that is attained arbitrarily closely as the block length goes to infinity. Any type of coding, in the sense of mapping the source Manuscript received November 28, 1972; revised May 15, 1973. This work was supported by the Natlonal Science Foundation under message blocks into code blocks, is allowed provided that Grant GK 14190 and by the Advanced Research Projects Agency of the Department of Defense monitored by the Air Force Eastern Test the mapping depends only on the message block to be enRange under Contract F08606-72-C-0008. coded and not past and future blocks. This excludes what The author is with the Department of Electrical Engineering, is loosely called “adaptive coding” where past statistical University of Southern California. Los Angeles, Calif. 9OOO7.

784

source message information is used to affect the coding. (The real reason for not using past information, of course, is uncertainty about stationarity; that is, statistics from past blocks may be an unreliable measure of current conditions.) The performance measure to be used in any application could include, for example, the average codeword length, a distortion measure on the receiver output, and encoding complexity. In this paper, however, we consider noiseless coding only and complexity is either ignored or given brief mention. Variable-length coding is considered for the most part, without length constraints, so that functions of the coding redundancy are used to measure performance. Several functions will be proposed. All previously known universal coding methods are constructive and do roughly the same thing: the message block is divided into two parts. The second part contains an encoding of the block in the usual sense and the first part tells how the encoding is done. As the block length goes to infinity, the first part takes up a vanishingly small fraction of the whole codeword while the average length of the second part converges to the minimum. In this paper it is shown that constructive methods need not be employed explicitly. However, interesting code constructions based upon maximum-likelihood estimation, sufficient statistics, and histograms can be made and will be discussed. The paper was originally intended to be a tutorial review of universal coding, but the temptation and ease of extension to new directions could not be resisted. Therefore, both old and new results (mostly the latter) will be found with appropriate credits where applicable. In contrast with earlier work, which relies heavily on constructions, the viewpoint of this paper is primarily information-theoretic. An attempt is made to make new precise definitions of various forms of redundancy and universal coding with respect to these forms of redundancy. The insight gained through mutual information and extensions to infinite alphabets is entirely new, as is the explicit application of maximum-likelihood estimation and sufficient statistics to universal coding. The section on histogram encoding is only partially new, as are some of the specific examples used. To put things in proper perspective, Section II presents a general history of universal coding. Section III presents the necessary terminology and fundamentals to be used in the paper. Section IV defines several performance measures for universal coding, finds necessary and sufficient conditions for universal coding, and demonstrates a number of constructive coding methods. Briefly, it is found that universal coding with respect to a prior probability distribution on the unknown parameter is possible if and only if the per-letter average mutual information between the source message space and the parameter space tends to zero as the block size goes to infinity. This always happens for conditionally stationary-ergodic finite-entropy sources. In a stronger maximin sense, universal coding is possible if and only if the per-letter channel capacity between the parameter space and the source message space is zero. Finally, universal coding is possible in a weakly minimax

IEEE TRANSACTIONS

ON INFORMATION

THEORY,

NOVEMBER

1973

sense if and only if a mixture probability mass function exists for which the relative entropy of the conditional probability mass function tends to zero for every value in the parameter space. If the convergence is uniform in the parameter, then minimax universal codes result. With some mild constraints on the parameter space, this happens for conditionally ergodic sources. Section V presents histogram constructions. Section VI discusses some implications of the work to fixed length codes and finally, Section VII proposes some open problems. II. A HISTORY OF UNIVERSAL CODING Kolmogorov [2] introduced the term universal coding, although he gave it only brief mention and suggested the possibility of such coding in the sense described in Section I. The general approach proposed was combinatoric rather than probabilistic as was traditional in coding problems. Fitingof [3], [4] exploited this approach in greater detail using mixed combinatoric-probabilistic methods. Codes based on his ideas are described in Section V. At approximately the same time, Lynch [5], [6] and Davisson [7] presented a (somewhat practical) technique for encoding time-discrete conditionally stationary ergodic binary sources. The universality of the scheme in the sense of attaining the per-letter entropy is demonstrated in [7]. Cover [8] has also presented specific enumeration techniques. Shtarkov and Babkin [9] demonstrated the same thing for any stationary finite-order, finite-alphabet Markov source, showing in particular that the increased per-letter average code length is of the order of log (block length)/ block length. Results for arbitrary stationary finite-alphabet sources are also included in [9]. Babkin [lo] considered the complexity of a scheme for an independent-letter source and estimated the coding redundancy. All the preceding was derived for variable-length coding. Ziv [I l] took a different approach by using fixed-rate codes with performance measured by the probability of error. He showed that the error probability can be made arbitrarily small with a large enough block size provided that the message entropy is less than the coding rate. In contrast to [l]-[IO], this can result in a chosen rate, which is much greater than the actual (unknown) message entropy. Although the term universal was not used, the same type of idea is involved in the research of Berger [l] on a composite source with a distortion measure. Universal coding with a distortion measure has also been considered by Ziv [I 11. An application of what might be called universal coding to “real” data is provided by the Rice [12] technique, whereby successive blocks of video data are examined; the best code for each block is chosen from among a set number of code books, and the code book designation is encoded along with the actual codeword. Another example is the analysis-synthesis technique [17] for speech waveforms, where adaptive predictor parameters are measured on each block and sent along as part of the block codeword. The interested reader is also referred to [13]-[16] which present related adaptive methods where statistical information from past blocks is used in the coding.

DAVISSON

: UNIVERSAL

NOISELESS

785

CODING

III. NOTATION AND PRELIMINARIES Throughout, capital letters will be used for random variables with lower-case letters used for specific values of the random variables. Let (Q,9,9) be a probability space. W e will be concerned with two functions defined on this probability space : the source message stochastic process X= {Xiii= 0, +l, +2,-e . }, which is observed and is to be encoded and a random variable 0, which is unknown to the encoder and decoder whose value, if known, would be used to improve the coding (e.g., the probability of a “one” in Section I). The random variable Xi: 0 --f A has a countable range space, the alphabet space A = {a1,a2,. . . ,aL}, which without loss of generality is taken to be the nonnegative integers, i.e., a, = i - 1, where L is the size of the alphabet. Unless specifically stated, L = co is allowed. Coding is done with specific reference to the message block x, = (X1,X,;.. ,X,) of length N with values xN = ,xN) defined on the range product alphabet space (Xl,%. * . AN. Only noiseless coding will be considered (mostly variable length, but also fixed length with vanishing probability of error). For simplicity the coding alphabet is taken to be binary with 2 the base of all logarithms so that entropies and codeword lengths are in bits. The unknown parameter 0 takes values in the measurable space (A,B) and is (F&+3))-measurable (i.e., the inverse image of any BE B belongs to 9). Let Q2,E 9 be the inverse image of any given value 0 with conditional probability measure gO, i.e., go(&) = 0 for any d E 9 such that & n R, = 0. Dejinition 2: A source is conditionally stationary ergodic if for every 8 E A the message ensemble defined on (R,P,P’,) is stationary ergodic. As noted in the introduction this definition makes it possible to discriminate between sources that would be ergodic if some parameter were known and the ensemble of all source message sequences. In general, however, conditional stationarity and ergodicity are not assumed but are only required when certain limits are needed to exist. The measure 9 may or may not be known. It is assumed, however, that the measure ~7~is known for all 0 E A so that the (measurable) block conditional message probability mass function, p(xN 1 e), is known for all N. The common notational abuse of distinguishing probabilities, entropies, etc., by their arguments will be employed. For example, p(.) will be a generic notation for probability mass function. W ith respect to a probability measure w(.) defined on A, the mixture probability mass function of X, is &N)

=

PC-\-N s

A

1 e,

dw(e).

(4)

The probability measure w may or may not be known and may or may not represent the “true” prior probability measure (induced by 9 of the probability space) of the unknown parameter 0. One might not be willing to accept the concept of a prior probability at all or, as will be seen, might not like the weighting it implies because, at any one

time, the coder sees only one member of the nonergodic ensemble. The w might actually be a preference weighting or it might be a priori chosen to be “least favorable” as will be discussed. The space of all such probability measures on A will be denoted W . The conditional entropy H(X, 1 0) defined pointwise on 0 E A and the average conditional entropy H(X, 1 0) are defined in the usual fashion as are the entropies H(X,) and H(O), the average mutual information Z(X,;O), the conditional average of the mutual information Z(X,;8), and the average conditional mutual information Z(X,; 0 1X,- r). Given an arbitrary probability mass function 4(x,), not necessarily in the convex hull of p(x, I fI>, i.e., not necessarily satisfying (4) for any w E W , the relative entropy of p(xN I 0) with respect to (I&) is defined as dxN H(p:d=gP(XNie)log-e

i e>

4 (XN)

(5)

If, in addition, q(xN) satisfies (4) for some w E W , p(e) will be used notationally rather than q(.), and (5) is then the conditional average (for the prior probability measure on 0) of the mutual information between X, and 0, Z(X,;B). In subsequent proofs there will be need for the wellknown Kraji inequality. Let Z(xN) be the code length in bits for message block xN of any uniquely decipherable prejix code defined on AN. Then z

2--'("N) < 1.

(6)

The following theorem, which is a consequence of (6), will be used frequently. Variable Length Source Coding Theorem: Given a probability mass function q(xN) with corresponding entropy H(XN)

= - c

log

qcxN)

< co

qbN)

(7)

AN

it is possibIe to construct a uniquely decipherable code on block xN of length Z(x,) satisfying z(xN)

< -log

+ 1

dxN)

(8)

so that the expected length I(XN) satisfies

H (x,)

< 1(x,)

< H (x,)

+ 1.

(9)

No uniquely decipherable code exists with I(XN) < H(X,). The minimum average codeword length is attained by Huffman coding if L < 00. In the subsequent material reference to a code will mean a set of uniquely decipherable codewords on a source output block of length N. The code will be specified by its length function, Z(x,), which satisfies (6). The space of all such length functions is denoted C,. The space of all code sequences is denoted

c = g

c,.

786

IEEETRANSACTIONSONINFORMATIONTHEORY,NOVEMRER1973

A universal code, in conformance with Definition 1, will mean a sequence of codes in C approaching optimum in the sense of some performance measure as N + co. IV.

UNIVERSAL

CODING

Suppose that the source N-block conditional probability function is known as in Section III except for a fixed but unknown parameter 0 E A. That is, p(xN 10) = Pr [X,

= xN ( 0 = 01

- e)N-n

(114

where n = $J xi. i=l

Example 2: The source is one of M < cc distinct types, A = {1,2;.. ,M}, such possible sources are the following. 1) The source is independent-letter exponential with parameter (0.5). 2) The source is independent-letter Poisson with parameter (5.6). 3) The source is Markov with a given transition matrix, etc. Example 3: The source is binary, A = [O,l], and X, is with probability one equal to the first N digits of the binary expansion of 8. That is, let

0 = 2 0,2-i

ei

(12)

E {0,1}

i= 1

where, for uniqueness, no expansion using an infinite string of zeros is used except for 0 = 0. Then

P(X, I 4 =

7

if xN = (e,,e,,. + -,e,) otherwise.

Example 4: The source is independent-letter parameter 0 E A = [O,co)

P(X, 1 e) = s

(13)

Poisson with

(14)

I’

Example 5: The source is conditionally stationary ergodic and entropy stable in the sense that the per-letter conditional entropy satisfies

ffWk I 0) _ lim ff(x, I 0) < E k

N-+CC

N

i(xN I fl> = c

(10)

is a known (measurable) function of xN, N, and 0. X, is observed but 0 is not. The source may or may not be conditionally stationary ergodic. We give several examples, which will be referred to frequently throughout the balance of the paper. Example 1: As in the example of Section I the source is a binary independent-letter source with A = [O,l] p(xN 1 e) = eyi

noiseless, variable-length coding performance can be measured relative to the source entropy conditioned on 8. (How meaningful the conditional entropy is, of course, depends upon how well the source has been parametrized. For conditionally stationary-ergodic sources it is obviously the most meaningful quantity.) If a code assigns a word of length I(x,) to source message block xN, the conditional average length is

--

lim sk = 0 k-rm

with {.sk} a known sequence. A is the set of all {&}-uniform entropy-converging stationary-ergodic source probabilities. The problem is to design an encoder for a source such as those in Examples l-5 that performs well in some as yet undefined sense without knowing 8. Since coding is to be

AN

KXN)P(XN

I ‘3 2 HGG I 0)

(16)

the lower bound following from (9). If H(X, I 0) = co, the conditional redundancy is defined as zero since any code is equally effective. For simplicity it will be assumed subsequently that A is redefined, if necessary, so that H(H, I 0) c co for all N < co, 0 E A. We would like, of course, to minimize this for all possible 8 (i.e., approach the entropy H(X, I t9)). It is not possible to do this pointwise because this would require 0 to be known. That is, l(xN) can depend only on what is observed, the source message block, and not the unknown parameter 8. We will define the following. Dejinition 3: The conditional redundancy of a code for H(X, 1 e) < COis rN(lTe) = $ cicxN 1 6) - H(XN 1 @ I.

(17)

We know r,(1,8) 2 0 by the variable-length source coding theorem and (16). The average codeword length with respect to w(e) is i(x,)

=

sA

lcxN i e, dw(e)

= A ,cNKxN)P(xNI 0) dwC@2 s

H(XN

I 0)

(18)

the lower bound following from (16). Two Examples of Universal Coding Before proceeding to a more formal definition of the problem, two examples will be presented to demonstrate the possibility of coding a source with conditional redundancy arbitrarily close to zero for all f3 E A. In Example 1, an N-block of N-n zeros and n ones is observed. If 0 were known each letter would be independent with p(xN I 0) given by (1)

p(xNIe)=(l

-e)N

(

+e”.

)

Consider the following encoding procedure. It is seen from (19) that all sequences with the same number of ones are equally probable. Send the message in two parts. Part one consists of a fixed-length code for the number of ones using at most log (N + 1) + 1 b. Part two is a code indicating N the location of the ones within the block, one of n 0 possibilities. Using a fixed codeword length (given n) for N each of the possibilities, at most log n + 1 b, a variable 0

DAVISSON

: UNIVERSAL

NOISELESS

787

CODING

length that depends on the number of ones is required for this information. The per-letter coding rate is then (log (N + 1) + 2) + log

.

(20)

DeJinition 5: a) The average redundancy of a code 1 E C, is IN

=

sA

b) The minimum Nth-order The first part goes to zero as N + co. The expected value of the second part must converge to the per-letter entropy of (2) because the coding rate cannot be below the source entropy, and, at the same time, the second part of the code is optimal, all messages conditioned on the first part being equally likely, independent of 0 N n=Cxi i=l is a sufficient statistic for 0, an idea which will be used more generahy later). This will in fact be demonstrated quantitatively later in the section (see also [7]). This method of coding was first presented in [5]-[7]. Note that a comparison of (20) and (3) shows that coding with respect to the whole ensemble with 0 uniform on [O,l] is asymptotically optimum for this example. Turning to Example 2, generate a uniquely decipherable code for each of the M possible source distributions using words of length at most -log p(xN I 0) + 1 by (8). Upon observing a source block xN to be encoded, send the message in two parts. First go through all M codes and find the minimum-length codeword for x,. As the first part of the message, send the number 8’ of the code book of the minimum-length codeword, requiring at most log (M) + 1 b. For the second part of the message, send the corresponding codeword for xN. The average coding rate for any 0 is then IwM)+

1 N



E]

= 0.

(25)

N+U3

The limit may not be uniform in 8. However, by the Riesz theorem there exists a subsequence of codes which converges almost surely. Dejinition 6: a) The Nth-order maximin redundancy of W is c%)N- =

sup c%)N*(w). wew

(26)

b) The maximin redundancy of W (if it exists) is L % ‘- = lim L?4!N-.

(27)

N-tm

If W f = 0, a sequence of codes that attains the limit is called maximin universal. If there is a w* E W which achieves the supremum (as a function of N) it is least favorable. Equation (25) holds for w* also, if it exists, of course. DeJinition 7: a) The Nth-order minimax redundancy of A [13], [16] is wN+ = inf sup [r&7)]. CN

As N --f co, M remains fixed, so that the minimum coding rate for known 8 is achieved. A decision-theoretic interpretation of the preceding procedure would be that it is equivalent to making a maximumlikelihood decision as to the true state 0 of the source at the encoder and then conveying both the decision and the corresponding codeword to the receiver.

r,W)

(28)

BEh

b) The minimax redundancy of A (if it exists) is .%f+ = lim gN+.

(29)

N-t03

If the minimax redundancy L%?‘is + zero, then the user can transmit the source output at a rate per symbol arbitrarily close to the source entropy rate uniformly over all values 8 E A (unlike (25)) by taking N large enough. g’+ = 0 is the strongest and most desirable condition to be proposed and can in some cases be achieved, as in the two examples already given. It will be seen, however, that in some cases it is too much to desire. If L%?’= 0, a sequence of codes which attains the limit is called minimax-universal. Although (29) is the form of primary interest, there will be occasion to refer to a weaker form of minimax universal codes. Definition 8: The weak minimax redundancy of A is &? = inf sup lim r,(l,B) C

BEAN+cc

(30)

788

IEEE TRANSACTIONS

where the infinum is over code sequences. If & = 0, a code sequence which attains the limit is called weakly minimax universal. If a code is weakly minimax universal but not minimax universal, it means that the redundancy cannot be bounded uniformly in N over A. Therefore, in practice we would not know how large N should be to insure that the redundancy is below a desired level. This is, however, slightly stronger than (25). Now the following will be shown. Theorem 1: a) i%fN+ 2 wN-

b) B+ 1 W- 2 W*(w). c) i%+ 2 8 2 w*(w). Proof: %N- 2 a,*(w) by definition. Now let l’(xN) and l”(xN) be two arbitrary codes in C,. Consider

inf r,(Z”,B) dw(B) rN(z’,8) dw(8) 5 Z’(XN)ECN s A sA I sup r,(l”,B). BEh Taking the supremum over w E W of the left side and the infimum over the codes l”(x,) on the right side retains the inequality. Therefore, k%?)N-< i?tN+.

$

{HWN)

5 w.

Second, for any E > 0, there exists a sequence of codes such that for all 8 E A, from (29) lim r&e)

< &?+ + &.

N-tm

Therefore from (30) 8 I sup lim r,(Z,B) 5 &Y+ + E. t’eA N-tm

Coding with respect to a measure w E W is considered first. By Definition 5, (17), and (22) H(XN

I 011dwC@

N

1

= inf-lc p(xN 1 e)z(xN) - H(xN ( e) dw(8) c,N SFA AN ,cN P(x~)l(x~)

IO)}

I

gN*(w)

= Jit -t

i Z(X,;O).

The preceding assumes H(X, I 0) < co. If this is not so, the same conclusion can be reached by expanding Z(xN) before integrating as 1(x,) = [IX, + logP(x,)] [logp(x,)]. The latter factor yields Z(X,,e) inside the integral and then (31) follows. Hence we have the following theorem. Theorem 2: The minimum average Nth-order redundancy of w is bounded as in (31). For a finite-alphabet source, the minimum is achieved by Huffman coding with respect to the mixture probability mass function. The necessary and sufficient condition for the existence of weighted universal codes is that the per-letter average mutual information between X, and 0 tend to zero. Note that for a conditionally stationary-ergodic in terms of the conditional mutual information B*(w)

source,

= irnm k Z(X,;O) + = lim Z(X,;

0 ) X,-i)

(32)

if the latter is termwise finite, so that weighted universal codes exist if and only if X, and 0 have finite conditional average mutual information for all N and are asymptotically independent (in the preceding sense) conditioned on the past N - 1 message values. It will be seen in Section V that this is always the case if H(X,) < co. Furthermore, we can obtain a theorem for maximin universal codes by using (26) (27), and (31) to get I gN-

< sup WEW

[

$ Z(X,;O)

+ $

1

(33) (34)

Redundancy Codes

I 0) -

H(XN

sup $ Z(X,;O)

Q I 6%“.

gN*(w) = inf SAmxN CN

-

a*(w)

WCW

Since this is true for all E > 0

Minimum

1973

N+‘X

s lim &?N(w,l) 5 sup lim rN(@) N-m 8aA N-tm =a w*(w)

NOVEMBER

or

Part b) of the theorem follows by taking limits (when they exist). Part c) follows by first noting that for any sequence of codes in C, w*(w)

THEORY,

tion so that from (9)

2 W,*(w), for every N = 1,2,. . . .

If the limits exist,

ON INFORMATION

- ff(XN I 0)

I

.

By the source coding theorem the sum is minimized by encoding with respect to the mixture probability mass func-

Therefore we have the following theorem. Theorem 3: The Nth-order maximin redundancy is bounded as in (33). The minimum is achieved for finite alphabets by Huffman coding with respect to the mixture probability mass function determined by the least favorable measure w* if it exists. The maximin redundancy is given by (34). The necessary and sufficient condition for the existence of maximin universal codes is that the capacity K of the channel between A and A be zero. Finding a necessary and sufficient condition for minimax universal codes takes a little more work. Suppose we take an arbitrary probability mass function q(xN). Then there exists a code with word lengths satisfying (8) therefore,

DAVESON:

UNlVERStU

NOISELESS

789

CODING

from (28) < sup g .%N+ -

i e)[-iog

dxN

+ i + log

dxN)

dxN

i e>i

N

BEA

= ;y?$H(p:q)

+ ;.

(35)

Therefore, a sufficient condition for minimax universal coding is that there exist a sequence of probability mass functions for X, for which the relative entropy vanishes uniformly over 0 E A with N. Obviously a sufficient condition for weak minimax coding is that the relative entropy vanish only pointwise. That the vanishing of relative entropy is also necessary follows from (6). Suppose there is a sequence of 1(x,) for which .!Z’ = 0. Then define = ,cN 2--@N)

QN

where, by the Kraft inequality (6)

A sufficient condition for the existence of a weakly universal codes is that the convergence to zero be only pointwise in 8. The determination of maximin and minimax codes is most easily made by searching the class of minimum average redundancy codes. This is analogous to decision theory where Bayes’ rules are sometimes more easily found than minimax rules, so that minimax rules are constructed by searching the class of admissible Bayes’ rules or by finding the Bayes’ rule with respect to a “best guess” least favorable prior. (For A finite, minimum average redundancy codes have in fact been called Bayes [18].) Rather than calculate the mutual information directly it is frequently easier to construct a code which has zero redundancy. W e will consider now several examples where the mutual information can be shown to vanish (directly, though), and then turn to constructive techniques. Suppose with w-measure one 0 takes on only a denumerable number, M, of values {e,} with probabilities {wk}. Then

QN 5 1.

; ~(x,,@)

Let

= ; [H(O)

(36)

- H(@ 1 xN>] log M N

+H(O)

P(XN

esk

i @I

N

I

I

Similarly the necessary condition for weak minimax coding is the pointwise vanishing of the relative entropy. To summarize we have the following theorem. Theorem 4: The necessary and sufficient condition for the existence of minimax universal codes is that there exist a sequence of probability mass functions q(x,) for which the per-letter relative entropy of p(xN I e) is zero uniformly in 8, i.e., (39) The necessary and sufficient condition for the existence of weakly minimax codes is that the convergence to zero be only pointwise in 8. A sufficient condition for the existence of minimax universal codes is that there exist a mixture probability mass function for which the conditional average of the mutual information be zero, i.e., q(e) satisfies (4) for some w E W so that (39) becomes lim 1 N

N+co

Or from (31), (40) we have the following. Corollary: If 0 takes on at most a denumerable number of values with w-measure one, weighted universal codes exist if H(O) < co. (It is possible to construct infiniteentropy discrete random variables, e.g., w, - (K-log’+%)-‘, 6 > 0.) If 0 takes on only a finite M < cc number of values,

I e>[-log 4 cxN) + log P(XN 1@>I - 1 %QN

= lim sup AN N-tm

P@N

N

BeA

c

+ log

sup BEA

qx,,e)

=

0.

(40)

(41)

2 lim ,;1;sup H(p:q) N-tm

2 0.

(38)

1V BEA

minimax universal codes exist. This was already shown by construction in (21) for Example 2. Returning now to Example 1 with 0 uniform and using (1) and (3) the conditional average of the mutual information is

*log [W JwW+ N 0 is a consequence of the artificiality of the problem setup. W*(w) > 0 can in fact be demonstrated in this example for any distribution on 0 with a nonconstant absolutely continuous component on [O,l]. On the other hand, from the corollary, if 0 takes on only ‘a denumerable number of values with H(O) < co, weighted universal coding is possible. Example 4 provides us with an infinite-alphabet example. It should be clear that if the Poisson parameter can take on “too many” values, universal coding may not be possible. Suppose we let dw(8) = tie-@ de,

e 2 0,

where c( is some constant. Then the mixture mass function is cc ,gne-NO ~ ue-” dO dxN) = s O

jj

(45) probability

(46)

ON INFORMATION

=

$ Z(X,; 0) > E.

Theorem 2 provides a necessary and sufficient condition for the existence of weighted universal codes. The calculations involved cannot always be easily done and, in addition, no particular insight as to practical implementation can be gained in many instances. The theorem presented here provides some help in that direction. The theorem applies to the coding scheme where code books are developed and enumerated for many “representative” values of 8. For any given XN, the identity of the code book with the shortest codeword plus the codeword itself is transmitted in the way presented for Example 2 earlier in this section. Theorem G-Code Book Theorem: Suppose there exists a sequence of partitions of AN into sets {TjCN’;j = 1,2; + *, J(N)}, J(N) possibly infinite, with xN E T$‘&,, where for some set {Bj; j = 1,2,. . . J(N)}, and some sequence of vanishing positive numbers E(N)

= s

n = 2

1I

E(N).

P+N

1 e>

(50)

Let pjcN) = pr [rjcN)]

where

(49)

A Suficient Condition for the Existence of Universal Codes

PVN I @> 1 ej(XN))

(47)

1973

Therefore, no maximin (or minimax by Theorem 1) universal coding is possible by Theorem 1. However, weakly minimax universal coding is possible, as can be seen by taking the limit of the integrand in (48) for every 8. It also can be shown that minimax universal codes exist if A is defined as a finite rather than an infinite interval. We now turn to universal coding based upon constructive methods.

dxN

(N + cO”+ ’ iQ (Xi !>

NOVEMBER

On the other hand, it can be shown that for any N < co and any E > 0 there is an a > 0 such that

Cxi!>

an!

THEORY,

A z,

dw(e).

c51)

a) If

xi.

i=l

The average mutual information

is

lim k E[-log

p$‘d,,]

= 0

then weighted universal codes exist. b) If (50) and (52) hold pointwise sequence of vanishing numbers s(N$) 1 E 77

-

entropy of a random variable with parameter N8

P(X, Ie) 8 I 1 ej(XN)) I I

8,

i.e., for some

me)

(53)

dxN

and for some set of probabilities, (51) for any w E W,

not necessarily satisfying

c(e-” de

< (1 + $) log(1+ ;) + logf----,o N

log

in

(52)

N+CC



(48)

lim ,A- E [ -log N

prdNj I e] = 0

N-co

then weakly minimax universal codes exist. c) If (53) and (54) hold uniformly in 0, i.e., s(N,@ I s(N) in (53) and lim, sup, can replace lim, in (54), minimax universal codes exist.

DAVlSSON

: UNIVERSAL

NOISELESS

791

CODING

Comment I : (54) is obviously satisfied, in particular if lim log J(N) N

= 0

Comment 2: The sequence {ej} is a sequence of asymptotically maximum-likelihood estimates of 0 for the source output observations {xN}. It is seen that the theorem is useful if a set of these estimates can be found involving a many-to-one mapping of AN into A. Several examples will be given after the theorem proof. Proof: Part a) will be proven by showing that the conditions of the theorem imply vanishing per-letter mutual information 1 z

Z(X,; 0) = E ; log “(X$ [

1

+ E(N)

= $ EC-log

Then xN: esgN)=esj

= JA

0.

(56)

by considering

1 ej(xN))hXN)

(57)

A code construction based upon J(N) code books, one for each of the “representative” values Sj, is suggested by (50)-(57). Find a code book for the set {0,> with words of length py)

+ 1.

(58)

For each ej find a code book for every xN E AN with words of length nj(xN)

5 -log&N

1 ej)

+ 1.

(59)

The codeword corresponding to each xN is then composed of two parts: the code book j(xN) to which x, belongs and the codeword in that code book; thus from (58) and (59)

5

-log

flj{fA)

-

j

p(es(xN)

i O) dw(e)

A

log &N I ejcxN)>+ 2.

It follows immediately that this code is universal in the appropriate sense. Examples l-5 can all be handled by the code book theorem. Example 5 will be considered in the next section. In Example 2, {ej} are possible values of 0. In Example 1, the values ej are j/(N + 1), j = O,l, * * . ,N, corresponding to the relative frequency of ones observed at the source output. This is a particular example of a sufficient statistic. More generally, if 0,(x,) is a sufficient

(62)

Then in the code book theorem we can take s(N) = a(iV$) = 0. W e have the following. Corollary 1: A sufficient condition for weighted universal codes to exist is that a sufficient statistic for 0 have entropy satisfying lim

and proceeding as in (56).

< -log

r i es(xN>>

P @ j I 0) d W )

N+OZ

mj

’ e, dw(e)

= p(k(x,) = hj>.

pj(xN,] + E(N) -

dxN

p(xN

c

1 ej(XN))

Parts b) and c) of the theorem follow relative entropy with =

S,

s

N-02

q&N)

(61)

eeh

1+E(N) 1ej(XN))p(ej(XN))

dxN dxN

P(e,j I ej> = SUPP(e,j I 0).

= xN:es(~N,=e j P@N

eKh) is a deterministic mapping. Therefore, p(XN) can be factored trivially to give log

Therefore, the values {ej} are maximizing values ofp(&(xN) 1 e,), which is a great simplification in many cases (e.g., Example 1). As a special but important case, let (8,; j = 1,2;*. J(N) 5 co} be an enumeration of the values of a sufficient statistic for 8 and let {e,} be the corresponding maximum-likelihood estimates, which are assumed to exist, i.e.,

N

dxN)

< E ;

(60)

p.(N) = J

$)I

I E $ log dxN 1 ej(xw))

; I(&;@)

theorem

(55)

*

N-r,

statistic for 0, by the factorization

- h :f J

log p(esj) = 0.

p(e,j)

1

Corollary 2: A sufficient condition for weakly minimax universal codes to exist is that the sequence of probability mass functions for the sufficient statistic is such that

ft+ $ E[ -1Og

p(es(xN))

( e]

=

o

(63)

for every 8 E A. Minimax universal codes exist if this convergence to zero is uniform in 8. In Example 4, the sum of the letter outputs in the block is a sufficient statistic for 0. As previously noted, minimax universal codes do not exist for A = [O,co). It is now seen that this can be interpreted as being due to the sufficient statistic taking on arbitrarily large values with nonvanishing probability. W e can construct a weakly minimax universal code as follows. For the first part of the code, send i=l

using a two-part subcode. Let M be the integer satisfying log, (n + 1) I M < log, (n + 1) + 1.

(64)

Send M zeros followed by a 1. Then send the natural binary codeword of length M for n. The per-letter code length contributed by the sufficient statistic is then

< 2Clog2 (n + 1) + 11 N

(65)

792

IEEE TRANSACTIONS

By the convexity of log X, for any value of 8 E [

2[10g cn + ‘1 + ‘1 N

THEORY,

NOVEMBER

1973

L < co. Then weakly minimax (and hence weighted) universal codes exist. If in addition the probabilities are entropy-stable in the sense of satisfying (15), minimax universal codes exist.

I1 e

< 2 log (E[n I O] + 1) + 1 N = 21og (NO + 1) + 1 ------) o N-tm ’ N

ON INFORMATION

(66)

Note that this code construction is only weakly minimax since (66) is unbounded in 8 for N < co. The second part of the code consists of course of the message sequence conditioned on the sufficient statistic. This coding can be done optimally. V. HISTOGRAM ENCODING We now consider code constructions of great generality for use when very little is known of the source. In terms of Section IV, the 0 parameter could represent all possible source probabilities in a class, e.g., Example 5. Essentially the idea is to measure the block conditional histogram on subblocks, encoding both the histogram and the codeword constructed according to the histogram. Obviously the latter portion must do at least as well as the code constructed using the actual subblock conditional probabilities. In fact, if it is known that the source is conditionally stationary ergodic and kth-order Markov where k -C co, then the histogram of that order is a sufficient statistic for 0 and the corollaries to the code book theorem apply. If the alphabet size is finite, the number of values of the sufficient statistic in (55) satisfies log J(N) < L?’ log (N + 1), and it is seen immediately that a minimax universal code results. If L = co and/or if the source is not Markov, then one must be satisfied with weaker forms of universal codes. If the source is finite-alphabet and conditionally stationary ergodic, the conditional probabilities converge and the values of ei in the code book theorem can be taken as those conditional probabilities, which are kth-order Markov and which coincide with the possible values of the kth-order relative frequencies for each N. We then let k -+ co with N, so that (55) is satisfied. Because of the convergence of conditional probabilities, (53) is satisfied and thus the sequence is weakly minimax. Without further restriction, the codes are not minimax. If the source is infinite-alphabet and conditionally stationary ergodic, (53) is satisfied by the histograms as in the last paragraph, but J(N) = co and conditions of the form of (52) or (54) must be added for universal coding in the appropriate sense. Because of the importance of this type of coding we will describe the particular construction in greater detail. The essential constructive idea for these codes is due to Fitingof [3], [4] with improvement and information-theoretic interpretation by Shtarkov and Babkin [IS]. Theorem 7: Let A be the space of all stationary-ergodic processes indexed by 0 with finite alphabet size, i.e.,

Two constructive methods will be presented for proof of the theorem. Proof: Both are based on histograms on subblocks of length k (called k-grams by Fitingof [3]), where for simplicity in the second method it is assumed that N/k is an integer. In the proof, as N -+ co we will let k --* co in such a way that the per-letter “overhead” information + 0. The reason for this is that we want certain source probability information to require a vanishing portion of the codeword while increasing the allowed dimensionality of the probability information. The first construction uses conditional histograms. Encode the first k values using a fixed-length code of at most log (2) + 1 b. Follow this by an encoding of the conditional histogram of each of the L source letters, where the conditioning is on each of the Lk possible immediate past sequences of length k. Using a fixed-length code, this requires at most {(number of histograms) x (number of cells in the histogram) x log, (number of possible values in each cell)} + 1 = Lk x L x log, (N + 1) + 1 = Lk+‘logz(N+

1) + 1

b.

(67)

Finally this is followed by a codeword for the remaining N - k letters following the first k (which are already encoded). The length of the codeword will be chosen in proportion to the log of the conditional histogram for each value. To be more precise, let the histogram by denoted as follows : qkcx.+k+l

w

1 x,+l~-“>xn+k~*

Then the codeword will be at most of length N-k-1%

1

n

n=O

qk(X,+k+l

1 %,+I,’

* *>xn+k)

+

1.

(69)

Then the,total per-letter codeword length is, from (67)-(69), ;

1(x,) I $ log Lk + ; Lk+ 1 log, (N + 1)

-

$

N;i;l

lo&

qkcX,+k+

1 1 x,+1,’

’ .yxn+k)

(70) Letting N, k + co so that k/log N -+ 0 eliminates all terms but the sum, which requires greater scrutiny. By definition of the histogram, the sum can be written as the

DAVISSON

: UNIVERSAL

NOISELESS

793

CODING

Letting k, N + CO,k/log N --) 0,

following :

& I sup lim b(xkIe) e ~+m [ k =

--

- $qxNIe)

1

= o

(76)

and if (15) applies k ,+ZEk+l

(number of times xk+ 1 follows

xk)

9+

I

lim sk = 0. k-tm

* lo&

qkcXk+

= -~ N-k N

1 I xk)

xk+,

(number Of times xk+ 1 follows xk) (number of times xk appears)

c

.Ak+’

. (number of times xk appears) N-k =

-

N+

x,:e

Ak qktXk+

* lo&

qk&k+

1 I Xk)qk(Xk)

1 1 xk)

log2

qkbk+

1 I xk).

(71) Note that the quantity in (71) (called quasi-entropy [3]) represents the absolute minimum average codeword length for the given histogram and, hence, for the observed block when the k-length constraint is applied. If the average of (71) is now taken over the ensemble (i.e., average with respect to p(xl,, 1 e)), it is bounded above by ((N - k)/N) H(Xk 1 Xk- 1, 8) through convexity Of --x log x. The weak minimax redundancy is bounded by letting k + co with N so that k/log N + 0, and by using (70) and (71) we have N-k N

& I sup lim e N-tm I

]im N-03

~

N(Xk+l

H(Xk+,

I e,

_

i xk,

lim

k+l

6)

H(XN

N+m

-

$

I ‘1

N

H(xN

=

i 0))

0

(72)

>

the latter statement following from the stationary ergodic property. By Theorem 1, weighted universal codes also exist. If in addition (15) holds, then 9?‘+ I lim &k = 0. k+cc

The second constructive proof is based upon a histogram of the possible values of the k-grams. Use a fixed-length code for the histogram as in (68), requiring at most

The theorem is established. Note that convergence will be slower with k in the second method than in the first since k-grams are encoded independently. The theorem can be extended to infinite-alphabet sources as follows. Theorem 8: Let A be the space of all stationary-ergodic processes indexed by B as in Theorem 7 with finite entropy H(X,) < co. Weighted universal codes for such sources exist. If there exists a probability mass function p(xl) such that -c-l0gpw 1 ei < 03 (77) for all 8 E A, then weakly minimax universal codes exist. If (77) is a bounded function of 8 and (15) holds, minimax universal codes exist. Remark: (77) is a very weak restriction. The existence of E[log X1 1 e] is sufficient for example. Proof: The proof is by construction in a manner similar to the last. All values in the block of length N are limited in value to log (N + 1), taken to be an integer for simplicity, and the result encoded by either of the code constructions of the last theorem with truncated alphabet size L = log (N + 1). Additional information is added to the end of the block for those values 2 log (N + 1) on a letter-byletter basis in order of their appearance using a codeword of length I -log p(x,) + 1. Using the second construction of the last theorem, to be specific, and noting that the limiting operation decreases entropy, the average codeword length is bounded in (75) with L = log N plus the additional information for the extreme values

(number of cells) x log, (number of possible values/cell) = Lk log, (N $ 1) -t 1

b.

(73)

as before. Call the histogram itself qk(xk) as in (71). Encode each of the k-grams using at most -log, qk(*) f 1 b. The per-letter codeword length is then

’ z

qkcXk)

loi?

qk(Xk)

+

;

.

(74)

Taking the expected value of (74), we obtain

(75)

+

.,=Ic$N+l)

p(xl I ex-~0gpw

+ 11.

The last term goes to zero with N + co in the weighted sense since H(X,) c w, and in the minimax sense if (77) holds. The other terms are handled as in the last theorem with k, N + w so that (log N + l)k/N + 0. This completes the proof of Theorem 8. Note that Theorems 2, 7, and 8 imply that the mutual information between the parameter space and the observation space goes to zero for all finite-entropy conditionally stationaryergodic sources. Thus for all stationary-ergodic sources it is not necessary to know the source probabilities to encode in an asymptotically optimal fashion. Of course, the less one knows of the source probabilities, the larger the block size must be for a given average redundancy level.

794

IEEE TRANSACTIONS

The vanishing of the mutual information can in fact be shown directly through the convergence of conditional probabilities and the code book theorem. The idea of quasi-entropic encoding [3], [9], [11] can be extended to arbitrary nonstationary, nonergodic sources. Here the performance measure could be taken as the minimum average length code over the block which is fixed on k-grams. It is obvious from (71) and (74) that quasi-entropic encoding does the best that can be done in this regard. VI. FIXED-RATE CODING We suppose now that fixed rather than variable-length coding is to be employed blockwise. Let R be the coding rate in bits per source letter so that each block is encoded into RN b. We can define error probabilities in analogy to the redundancies of Section IV. Here a code will mean an assignment of xN to one of the distinct 2RN codewords or a random assignment, in which case an error is made. C,(R) is now the space of all such assignments. For a particular code let if xN has a distinct codeword otherwise so that the conditional

error probability = ,cN &XN)P(XN

is I 0).

(79)

DeJinition 9: The average error probability

of w is given

Pe~(Rpfl)

= lim N-tm

Definition given by

10: The maximin

P,-(R)

= lim N+m

Dejinition given by

inf CdR)

sup weW

sA

= lim N+m

dw(e)-

error probability

inf Cp,(R)

11: The minimax P,+(R)

P,N(R,e)

inf C.&R)

s ,,

PeN(R7e)

(80) of W is

dw(e).

error probability

(81)

NOVEMBER

1973

if if

dxN)



TN

p(xN)

S

TN

where TN is determined parametrically for which ,cN (1 - 6(x,))

(84)

as the smallest value

5 zRN.

(85)

Obviously a necessary and sufficient condition for weighted fixed-rate universal coding is that R be large enough so that lim C 8(xN)p(xN) = 0. N+m

AN

(86)

A similar procedure holds for (81). The integrands are nonnegative in (80) and (81); therefore, zero error probability implies almost-everywhere zero error probability with respect to w. Therefore, we need only find a mixture which yields P,+(R) = 0 in (82) to get a minimax code, if one exists. We will not develop the general theorems, however. We will only consider a theorem for finite-alphabet conditionally stationary-ergodic sources that follows immediately from Theorem 7 and the McMillan asymptotic equipartition property. Theorem 9: Let A be the space of all stationary-ergodic processes indexed by 8 with finite alphabet L < co. Then for any rate R such that R > lim N-‘supH(X,lQ N-+CC e

(87)

weakly minimax (and hence weighted) fixed-rate universal codes exist in the sense of (83). Proof: We use either of the codes in Theorem 7 with zeros added on the end if l&N) < RN and truncation if @N) > RN, in which case an error occurs. By the stationary-ergodic property (for the first code) for any k,,

of A is

sup P,,(R,e).

(82)

BEA

Dejnition 12: The weakly minimax error probability A is given by pe = inf sup lim C(R) BEA N+ao

THEORY,

satisfy the equation

by P,*(R,w)

ON INFORMATION

P,,(R,e).

of (83)

A code sequence for which zero error probability is attained will be called a universal code in the various senses that were defined for variable-length codes in Section IV. Theorems for performance as in (83) have been considered by Ziv [ll]. General theorems can be developed for universal fixedrate coding in analogy with Section IV. It is apparent, for example, that the infc,(,, in (80), (81) is taken by assigning distinct codewords to the most probable XN vectors with respect to the mixture probability mass function. Let 6(x,)

= fwk,+,

1

I x,,, 0) I e = I.

(88)

Therefore, we can find a sequence of values k and N, depending on k, such that for every 6’E A lim k,N-+m

Pr

-1(xN) [ N

>Rl8

1

=o.

(89)

Another construction due to Ziv [l I] is based upon the k-gram histogram of Theorem 7. Find the most frequently occurring k-grams in the block of N and assign each of these a fixed-length code of length s logJ+

1.

(90)

Send the list of Jk-grams first. Then follow this by N/k fixed-length codewords for the observed k-gram sequence.

DAVISSON:

UNlVERSAL

NOISELESS

195

CODING

The per-letter codeword length is

ON) _ UN> --N

J(log Lk + 1) + ; (log 1 + 1)

N

b.

I

(91) Choose J, k, and N so that the result is less than R. To be specific, pick some E such that 0 < E < R - lim k- ’ sup H(X, i 0) k-tm 2W-&)

< -

J

Suggest Documents