The Minimum Average Code for Finite Memoryless ... - IEEE Xplore

1 downloads 0 Views 960KB Size Report
Abstract—The problem of selecting a code for finite monotone sources with N symbols is considered. The selection criterion is based on minimizing the average ...
IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 53, NO. 3, MARCH 2007

955

The Minimum Average Code for Finite Memoryless Monotone Sources Mohammadali Khosravifard, Hossein Saidi, Member, IEEE, Morteza Esmaeili, and T. Aaron Gulliver, Senior Member, IEEE

Abstract—The problem of selecting a code for finite monotone sources with N symbols is considered. The selection criterion is based on minimizing the average redundancy (called Minave criterion) instead of its maximum (i.e., Minimax criterion). The average probability distribution P N , whose associated Huffman code has the minimum average redundancy, is derived. The entropy of the average distribution (i.e., H (P N )) and the average entropy of the monotone distributions (i.e., H (P N )) are studied. It is shown that both log N 0 H (P N ) and log N 0 H (P N ) are asymptotically equal to a constant (' 0:61). Therefore, there is only a negligible penalty (at most 1.61 bits/symbol) in using a simple fixed-length code with respect to the optimal code. An efficient near-optimal encoding technique is also proposed. The consequences of the two approaches, i.e., Minave and Minimax, are compared in terms of their associated distributions and associated codes. In order to evaluate the average performance of the Minimax code, we prove that the informational divergence of the average distribution and Minimax distribution asymptotically grows as 02:275 + log log N . Index Terms—Average redundancy, finite monotone sources, fixed-length code, Huffman code, Minave code, Minimax code, minimum average criterion.

I. INTRODUCTION

W

HEN the symbol probabilities are known, the best code for a memoryless information source in the sense of average codeword length can be found using the well-known Huffman algorithm [11]. However, it is often the case that the exact values of the symbol probabilities are unknown. In some situations, the number of symbols is very large, or the symbol probabilities change over time and because of nonstationary effects cannot be estimated using adaptive techniques. If nothing is known about a source except the number of symbols , it has been proposed in [5] to assume equiprobable symbols and use the corresponding Huffman code. Definition 1: The Huffman code of the uniform distribution is called the Uniform code, which is Manuscript received March 15, 2003; revised November 19, 2006. M. Khosravifard and H. Saidi are with the Department of Electrical and Computer Engineering, Isfahan University of Technology, Isfahan, Iran (e-mail: [email protected]; [email protected]). M. Esmaeili is with the Department of Mathematical Sciences, Isfahan University of Technology, Isfahan, Iran (e-mail: [email protected]). T. A. Gulliver is with the Department of Electrical and Computer Engineering University of Victoria, PO Box 3055, STN CSC Victoria, BC V8W 3P6, Canada (e-mail: [email protected]). Communicated by S. A. Savari, Associate Editor for Source Coding. Color versions of Figs. 2, 3, and 5–8 in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TIT.2006.890782

Fig. 1. The best codes in terms of knowledge.

denoted by given by

. The codeword length for the th symbol is if if

This code may have a degradation in performance of up to bits for some sources. Clearly, a lack of information about the symbol probabilities causes this degradation. Therefore, we can consider the following two extreme cases in designing a code in terms of our knowledge: • When we know the exact values of the symbol probabilities, we have complete knowledge and the best choice is the Huffman code. • When our information is restricted to just the alphabet size (i.e., ), we use the Uniform code. These are illustrated in Fig. 1. As one can see there is a large gap in knowledge between the two cases where something more than just the number of symbols is known, but less than all symbol probabilities. What may be known is that the source belongs to a specific class of sources such as Bernoulli sources, monotone sources, Markov sources, etc., [16]. For example, in universal data compression techniques such as LZ77 [33] and its variants, it is (implicitly) assumed that shorter match lengths are more probable, which implies dealing with monotone sources. In particular, this paper considers the case where the order of the symbol probabilities is known rather than their exact values (Fig. 1). This means designing a code for the class of memoryless monotone sources with a finite number of symbols. The question is how to include the available information in the code design to obtain good (or guaranteed) performance over the class of sources being considered. The first step in the design of an appropriate code is to define the design criterion. Elias [7] considered the problem of finding a code for monotone sources with countable symbols. He defined an asymptotic optimality criterion for which the ratio of average codeword

0018-9448/$25.00 © 2007 IEEE

956

length to entropy must tend to one as the entropy increases to infinity. Subsequently, some researchers have considered this criterion, and some asymptotically optimal codes have been obtained [30], [31]. In particular, more recently Yamamoto [32] constructed a recursive universal representation of the integers for which the codeword length for almost all sufficiently large positive integers is shorter than previous known techniques. However, these codes were designed for an infinite number of symbols, and their satisfactory performance is guaranteed only as the entropy tends to infinity, which is not the case for finite sources. Thus, they provide no guarantee of optimality for finite sources (sources with a finite number of symbols), and it is likely that better performance can be obtained by considering the finiteness of the alphabet in the criterion. Here we consider the minimum average criterion, which was mentioned as a special case by Gilbert [9], and was well defined by Cover [3]. With this criterion, average redundancy (or average codeword length) over all sources in the class must be minimized, and consequently, a distribution function on the class of sources is needed. Cover considered a Dirichlet function as the probability distribution over the class of all possible symbols and explained Gilbert’s scheme [9] sources with from the point of view of this criterion [3]. Most papers on designing codes for a given class of sources (in particular for finite monotone sources), use the Minimax criterion [25], [26], [28], [5], [12]. In this case, the maximum of the cost function over all of the sources is minimized. The minimum average criterion is very different since the Minimax criterion improves the worst case performance, while the minimum average criterion improves the average performance. Smith [28] used the average codeword length as the cost function for the class of sources with known upper and lower bounds on each symbol probability. Rissanen [25] considered the Minimax criterion for the ratio of average codeword length to entropy and assumed an upper bound on the largest symbol probability. He derived the optimum real codeword lengths, that is, codeword lengths that would be optimal if the integer constraint on codeword lengths is removed. Ryabko [26] and Davisson and Leon-Garcia [5], presented a probability distribution whose Shannon code [4] is a suboptimal code for the Minimax criterion using the difference of average codeword length and entropy as the redundancy function. They showed that this code has near optimal performance. This code was used to design a suboptimal key for taxons [15]. Kazakos [12] minimized the largest average codeword length using a game-theoretic formulation. The result was a code designed for the source with the highest entropy. Since the uniform distribution belongs to the class of monotone sources, his conclusion was to use a Uniform code for these sources. In this paper, we use the minimum average criterion to design a code for the class of memoryless monotone sources. In Section II, the minimum average and Minimax criteria are explained in detail. Since integral (in the expectation) and summation (in the average codeword length) are linear operations, the problem of finding the optimum code reduces to finding the Huffman code for the average probabilities. Using a theorem on the center of gravity of uniformly distributed simplexes, average probabilities are derived for equiprobable monotone sources in

IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 53, NO. 3, MARCH 2007

Section III. Some properties of the entropy of the average distribution are derived in Section IV. The average entropy of monotone sources is studied in Section V. In Section VI, we introduce a fast and efficient encoding scheme. In Section VII, we compare the consequences of the two different criteria, i.e., minimum average and Minimax. Finally, some conclusions are presented in Section VIII. In this paper we use the following notation: • The set of real numbers: . • The set of natural numbers: . . • Logarithmic constant: • Euler–Mascheroni constant: . • The value of the convergent series

• Number of symbols or alphabet size: . . • Entropy of distribution : • Informational divergence of two probability distributions and : . . • Logarithm in base • Natural logarithm: . • Expectation of a random variable with probability density function : . • Cardinality or the number of elements in a set : . . • Volume of a simplex : . • Binary representation of integer with bits: II. THE MAIN PROBLEM Define each memoryless information source with symbols by a vector , where denotes the probability of the th symbol appearing in the output. Also, as all sources with define the class of monotone sources symbols and ranked probabilities

The problem is then to find a unique code for all the sources . Since designing the codewords is straightforward if the in codeword lengths are known, we only consider the problem of determining the codeword lengths. A given code for a source symbols is denoted by , where with is the codeword length for the th symbol. small, To make the average codeword length shorter codewords should be assigned to more probable symbols, i.e., we should, without loss of generality, have for . For unique decodability, the codeword lengths should satisfy [4]. However, in order to the Kraft inequality minimize the average codeword length as much as possible, we

KHOSRAVIFARD et al.: THE MINIMUM AVERAGE CODE FOR FINITE MEMORYLESS MONOTONE SOURCES

only consider compact codes (i.e., it is desirable to find good codes in

). Therefore, , where

An exhaustive search over requires generating all compact codes in the first stage and evaluating the performance of each code in the second stage. Exhaustive search is an intractable task for large even when the second stage is feasible, because ) [9], [23] inthe number of possible compact codes (i.e., creases exponentially [8], [14], that is

957

It should be noted that the entropy is not always achievable without extending the source. Therefore, if a code is to be used for nonextended sources, it is more reasonable to employ the minimum achievable average codeword length , in the code design. In other words, it is better to use

than ; or the performance of a code should be compared with the performance of the Huffman and lead to code not the entropy. Generally, different codes. Example: Suppose that we want to design a Minimax code for the two distributions and

Thus, although there is an efficient algorithm for generating all binary compact codes with codewords [13], an efficient means of code selection (rather than exhaustive search) must be employed for large . We next consider two criteria to select one of these codes. A. The Minimax Criterion The Minimax criterion was considered by Smith [28] in 1974. For the class of monotone sources, it was independently used by Rissanen [25] in 1978, Ryabko [26], [27] in 1979, and Davisson and Leon-Garcia [5] in 1980. Defining the redundancy of a code for a distribution as

the Minimax criterion results in a code whose maximum is minimum among all , redundancy over all that is

In other words, the goal with this code is to improve and guarantee the performance in the worst case (maximum redundancy).

The optimum code for the Minimax criterion with redundancy function is and with is , where , , and . This shows that by selecting we save bits for while incurring a loss of 0.25 bit for . However, computing the entropy of a distribution is much easier than computing the average codeword length of its is more tractable. ConHuffman code; therefore, is preferable when the number of sources versely, is finite and computing the corresponding Huffman codes is feasible. Rissanen [25] used the Minimax criterion for the redundancy function

He assumed the additional constraint for monotone sources and derived the optimum real codeword lengths as if if where

is the real solution to

, is Definition 2: The Minimax code [26], denoted by which is the Shannon code for the Minimax distribution defined by where Hence, the codeword length for the th symbol is given by

(1) (2)

for

It is shown in [5], [26] that

With the Minimax criterion, a code is designed based on the worst case, so the worst case performance may be improved at the cost of degrading the performance in other cases. Thus, using the Minimax criterion for evaluating the performance of a code is questionable if the worst case rarely occurs. In such cases the minimum average criterion is more reasonable. B. The Minimum Average Criterion

and, consequently, imax problem.

is a suboptimal solution of this min-

Gilbert [9] and Cover [3] both noted that one can treat as a random vector. Accordingly, we may conon sider a probability distribution function

958

IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 53, NO. 3, MARCH 2007

the class of sources. Using this distribution function, we can defor each as the expectation of , that is fine

different codes may result. In addition, this criterion can be used for an arbitrary class of sources on which a distribution function is predefined.

Thus, is the average redundancy over the class of sources. With the minimum average criterion (hereafter called Minave criterion), the codeword lengths of the optimal code (hereafter are called Minave code) for the class of monotone sources , which provide the minimum value of , i.e.,

Definition 3: Point of gravity of an -dimensional simplex , where function

Since is a linear combination of , is actually an -dimensional simplex in and -dimensional integration on vanishes to zero. Thus, the expectation of must be computed by -dimensional defined by integration on the simplex

(3) In addition, as functions of facts one can write

and variables

must be considered . Based on these

Since have

is the center with mass density

is a probability density function, we

and therefore

This shows that we should determine the center of gravity of . C. Minimax Versus Minave The differences between the Minimax and Minave criteria are summarized as follows. • With the Minimax criterion, we attempt to improve the worst case performance (maximum redundancy), whereas with Minave the overall performance is the major concern. • From the Minimax perspective, all sources carry the same importance. In other words, a source is either in the set of sources or is not. With the Minave criterion, we can assign a degree of importance to each source and consider this in the code design.

(4) where , and tropy, is a finite constant. Consequently

, , the average en-

(5) It follows from (5) that are exactly the Huffman codeword lengths for . Therefore, finding reduces to com, for which the Huffman code can puting efficiently be obtained [20], [21]. Clearly, from the right-hand side of (5), we can interpret the optimal code that mini, as the optimal code which minimizes mizes . Note that the optimal code depends directly on , and for different distribution functions,

• The Minimax criterion implicitly assumes that a single source selected from the set of sources generates all symbols. With the Minave criterion, we may assume that different sources generate the symbols. • With the Minimax criterion, the optimum code may differ , between the cost functions ( and ), while it is unchanged with Minave. • It can be difficult to derive an analytic expression for the optimum code with the Minimax criterion. In particular, when the number of sources is finite, an exhaustive search may be required. Conversely, deriving the optimum code with the Minave criterion only requires the computation of the average distribution and the application of the Huffman algorithm. Moreover, the evaluation of an arbitrary code with the Minave criterion is much easier than with the Minimax. Depending on the requirements of a particular application, one or the other of these criteria may be preferred. Keeping this in mind, we are motivated to determine how much the resulting codes differ. The Minimax code for finite monotone sources has

KHOSRAVIFARD et al.: THE MINIMUM AVERAGE CODE FOR FINITE MEMORYLESS MONOTONE SOURCES

been considered previously [5], [26]. In this paper, we consider the Minave code for these sources.

959

and noting that

Using

(8) III. THE MINAVE CODE FOR EQUIPROBABLE MONOTONE SOURCES

(which can be easily proved by induction), one may write

It is reasonable to assume the uniform distribution (6) on the sources when the only information available about the sources under consideration is their monotonicity. In order to where is compute the center of gravity of a constant, we use the following theorem. Theorem 1: Let be a uniformly distributed sim) with vertices plex (i.e., . Then the center of gravity of is the average of its vertices, that is

(9)

Definition 4: Assume a uniform probability distribution function on the class of monotone sources with symbols. In this case, the average probability distribution is given by (10)

The Huffman code of , called , which is optimum in the average sense has codeword lengths

(The proof is given in the Appendix.) Remark: If we define the center of gravity of located at masses

individual as In fact, can be interpreted as the average probability that the th symbol appears. It follows from (10) that

then we can say from Theorem 1 that the center of gravity of a uniformly distributed simplex is the same as the center of equal individual masses located at its vertices gravity of . It can be easily shown that the vertices of form of

for

[12]. Noting that

are in the .. .

iff

the vertices of are for Hence we can use Theorem 1 to compute the ’s, that is

.. .. . .

.. .

.. .

.. .

It may appear that the above probability distribution is not so uniform, but we will show that it is. As the inequality

. holds for any nonincreasing integrable real function de, we can set and conclude that fined for (7)

(11)

960

IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 53, NO. 3, MARCH 2007

Fig. 2. A comparison of the probability distribution of English letters and P

Since

, we may also write

.

TABLE I ENTROPY AND AVERAGE CODEWORD LENGTHS FOR THE M (26), U (26), F (26), AND HUFFMAN CODES FOR ENGLISH, FRENCH, GERMAN, AND SPANISH

(12) One can also use the following simple approximation to compute the average probabilities: (13) which improves in accuracy as increases. Another approximation for the average probability is

The latter expression has a somewhat more complex form, but its performance is very good. In particular, the Huffman codes and differ only for 12 values in the range of (i.e., ). Example: As an example, we applied the code to the probability distribution of English letters [24]. As shown in Fig. 2, the probability distribution of English letters is very . Therefore, one could do very well even if only close to the ranking of the English letters were known, in term of their applies to real-world examprobabilities. So we find that ples. Spanish, French, and German are other languages with 26 letters and distributions similar to English. Table I gives the average codeword length of the Huffman code for each language, , , and codes. For these as well as for the code is the best, followed by the distributions, the code. Note that the code has the worst performance in all cases.

A. Sources With More Uncertainty Until now, we have considered the design of a code based on . If our informathe probability ranking tion about the probability ranking is deficient, we can consider all possible cases, use (10) for each case, and average over these cases to obtain the average probability distribution. Obviously, the Huffman code for the resulting average distribution is optimum in the Minave sense. Example: Consider a source with five symbols and the in, so that only the order equalities and is unknown. Hence, we have two possible cases of and . Assuming that and are equiprobable, we obtain

Therefore, if we were certain that would use the distribution or were in doubt as to whether the distribution

, we , but if we is greater, we should use .

KHOSRAVIFARD et al.: THE MINIMUM AVERAGE CODE FOR FINITE MEMORYLESS MONOTONE SOURCES

This lack of information about the probability ranking leads to a more uniform distribution. In the extreme case, when we know nothing about the ranking of the probabilities, we should use a uniform distribution, as expected. In other words, code is the Minave code for the class of sources with symbols.

961

and for The inequality

for

implies

IV. ENTROPY OF THE AVERAGE DISTRIBUTION In this section, the behavior of the entropy (as the fundais mental information-theoretic quantity of a distribution) of studied. We prove that the entropy and average Huffman code(i.e., and ) are nonword length of decreasing functions of , and the difference tends to the constant (where is the Euler–Mascheroni constant, ). and Theorem 2: functions of , that is

In addition, one can see that for . From have we write

for

. As a result, we and (11)

, which can be rearranged as

are nondecreasing Noting that the difference of the upper and lower bounds in the . Clearly, above inequality is less than , we have (14) for satisfies (14) as well.

and

Before proving this theorem we must prove the following lemma. Lemma 1: Defining such that integer

, for integer

Theorem 3: Abrahams–Lipman [1] Consider two distribuand with tions and . If Hu and Tucker’s condi, we have tion [10] holds, that is, for some , for for

there exists an then

for for The value of

is given by or (14) we have

Proof: From (10) for

Lemma 1 and Theorem 3 can be used to prove Theorem 2. According to Theorem 2, including an extra symbol in the alphabet causes an increase in the entropy and average codeword length. Therefore, one cannot assume a lower expectation of average , by adding symbols. codeword length, i.e., Note that, although it intuitively seems that a larger entropy is associated with a larger average Huffman codeword length, this is not true in general [1]. However, Lemma 1 and Theorem 3 and show that this is true for the underlying special cases ( ). , we have Since and consequently so that (15)

This shows that iff and . On the other hand, we have for . Therefore, implies the existence of an which satisfies for

iff and

Now, noting that gives

[4], and using (12) and (15) (16)

Equation (16) implies that . However, Theorem 4 shows that actually gets very close to .

and

962

IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 53, NO. 3, MARCH 2007

Theorem 4: The difference stant value

as

tends to the conincreases, that is

over the intervals known property

and

. Hence, the implies that

(17) Proof: Let lower bounds on that

. Using the upper and , i.e., inequalities (11) and (12), and noting

(19) is the differential entropy of the probability In fact, , . Following similar density function steps, we can derive an upper bound on

for for we can write (20)

Using (12) gives

which implies gument on (20) results in

. Using a similar ar-

which together with (19) completes the proof.

V. AVERAGE ENTROPY OF MONOTONE SOURCES In this section, we study the average entropy of equiprobable symbols (i.e., where monotone sources with ), denoted by . As the entropy is a concave function of distribution, we have . Here, we intend to compare with in more detail. , we first define the canonical In order to formulate in as simplex

for (18)

It can be shown that the last sum in (18) tends to as (see Lemma 4 in the Appendix). The two other sums tend to the Riemann integral of the continuous function

(21)

The importance of this definition is that the mean of some speand over cial symmetric functions (such as entropy) over are equal (Lemma 5 in the Appendix). Moreover, evalumight be simpler ating the mean of such functions over . The following theorem gives an exact formula for than over in terms of .

KHOSRAVIFARD et al.: THE MINIMUM AVERAGE CODE FOR FINITE MEMORYLESS MONOTONE SOURCES

Fig. 3. The difference between

H (P

) and

H (P

963

).

Theorem 5: For an arbitrary positive integer

we have

Combining (22) with (23) gives

(22)

(24)

Proof: Let . Assuming equiprobability of the monotone sources, we write

Theorem 6: Assuming equiprobability of monotone distributions, the average entropy tends to be equal to the entropy of the average distribution as increases, that is (25)

Therefore,

Proof: Noting (24) and (17), the proof is clear.

is actually the mean of the function

Theorem 7: For an arbitrary positive integer

we have (26)

Proof: Setting gives

over . As satisfies the conditions of Lemma 5 (given in the Appendix), its mean over and over are equal, that is

in the well-known inequality and consequently

and Now, from Lemmas 6 and 7 (given in the Appendix), the mean of over is given by Therefore, is a nondecreasing sequence of and (24) completes the proof. Theorem 8: For an arbitrary positive integer

we have

and the proof is complete. The well-known property of Euler–Mascheroni constant is that (23)

Proof: The proof is clear from orem 7.

and The-

In Fig. 3, the difference is illustrated as on a logarithmic scale. It can be seen that a function of

964

Fig. 4. Average redundancy of the

IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 53, NO. 3, MARCH 2007

U (N ), M (N ), and PU (N ) codes.

has its maximum at . Therefore, keeping Theorem 8 in mind, we have the following conjecture. Conjecture 1: For an arbitrary positive integer have

From

we

(28)

and (26) we have

(27) As the difference is actually the informational divergence of the average probability distribution and the uniform distribution, i.e., , the inequality (27) reflects the high uniformity of length code (assigning ceptable performance.

On the other hand, the average codeword length of a Uniform code) is certainly not greater than that of the code (the corresponding fixed-length code for any arbitrary distribution ). Also, the code is the Huffman code (in particular, for and has the minimum possible average codeword length for this distribution. Thus, we have

. Thus, one may expect a fixedbits to each symbol), to have ac-

An analytic investigation of the performance of for is not feasible. The average redundancy of the and codes are shown in Fig. 4. Although, we proved that the average code is less than , Fig. 4 suggests redundancy of the a smaller upper bound. Conjecture 2: The average redundancy of the Uniform code, , is not greater than , that is

Theorem 9: The average redundancy of a fixed-length code . is less than Proof: From Theorem 7 we have Therefore, knowing the rank of the probabilities (when we ) allows only a small improvement in the average use redundancy with respect to the case of knowing nothing about ). the symbol probabilities (when we use VI. A PARTIALLY UNIFORM CODE FOR Therefore, if a fixed-length code with length is used, . This sugthe average redundancy is less than a constant gests that if 1.61 bits/symbol average redundancy is tolerable, one can simply use a fixed-length code and take advantage of its benefits such as simple implementation and absence of error propagation.

As mentioned earlier, is the best code in the average sense, but for large , even the most efficient known algorithms require some time and space to determine the Huffman code [20] . On the other hand, is very easy to implement but of in average redundancy. In this section, certainly exceeds code) which is very we design a partially uniform code (

KHOSRAVIFARD et al.: THE MINIMUM AVERAGE CODE FOR FINITE MEMORYLESS MONOTONE SOURCES

965

efficient in terms of both performance (Fig. 4) and implementation. A simple generalization of the start–step–stop code [29] provides the following two techniques to design a code for nearly uniform sources, i.e., sources with entropy slightly less than . In both methods, the set is partitioned each containing successive integers. into sets, To represent the integer , , we first encode , the set containing . Then using the codewords given in (29) code with sym(at the bottom of the page) for the with or bols, we indicate among all integers in bits. Note that the codewords given by establish an instantaneous code. The main difference between the two techniques is the partition of and the allocation of bits in representing the ’s. In the first technique, the partitioning is done by the user. For example, one may attempt to partition such that the ’s have to the same cardinality. Then by assigning each and applying the Huffman algorithm to the ’s, one can obtain a good representation for the ’s. In the second approach, the codeword lengths representing the ’s are predefined by the user, and partitioning is done so that the resulting code is appropriate. In particular, if the se, with is to be used for repquence , then the sets should be defined such resenting that

method more convenient we used the following approximation to determine , and

Both of these techniques were applied to for different values of and various codeword lengths for representing the ’s. This led to an efficient and fast method for encoding based on the second technique. With this method, is partiand these sets are encoded tioned into four sets , , , and bits, respectively. with should be determined in such a way Clearly, , , , and that

Solving the nonlinear equation we obtain

(30)

(32)

(33) (34) where (8) and (13) were used in (32) and (33), respectively. Now using (30), (31), and (34), we have

for

Thus we set

Conversely, if we partition using the above values for , , and , we obtain , , , and to represent , , , and , respectively. In fact, the two techniques presented in this section are the same for the nearly and . uniform distribution The codeword for the th symbol is the concatenation of two and defined by parts if if if if

(31) Since the ’s contain successive integers, knowing , , uniquely determines , , , and . To make this and

if if

(35)

(29)

966

IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 53, NO. 3, MARCH 2007

Fig. 5. Average redundancy of the

PU (N ) code.

TABLE II PARTIALLY UNIFORM CODE FOR

P

PU (1488))

used when changes frequently and takes on large values. Note that choosing a larger value of improves performance while increasing the code complexity, but this improvement is not conwhich resulted in a code siderable. This is why we chose very similar to the classic start–step–stop code, except that it uses a Uniform code rather than a fixed-length code.

(

VII. COMPARISON OF THE CONSEQUENCES OF THE MINAVE AND MINIMAX CRITERIA In this section, we examine the consequences of the Minave and and Minimax criteria in terms of the resulting codes ( ) and distributions ( and ).

and if

(36)

This encoding scheme is very simple because we can construct a lookup table for the eight intervals according to (29), (35), and (36). After locating in one of these intervals, the codeword can easily be produced. Decoding will also be trivial if such a simple look-up table is used. , we constructed the Example: For code given in Table II. From this table, the codeword length for is and for is . In other words, in the code, more probable symbols may have longer codewords. However, because of the high uniformity of , the proposed method is efficient. the distribution Fig. 5 shows the average redundancy of the code for up to . The average redundancy is negligible, in particular for it does not exceed . In addialways has better performance tion, Fig. 4 shows that , especially for . For large , the average than code is at least 0.25 bits less than redundancy of the that of . Thus, the proposed code has near-optimum performance, and can be simply implemented. It can be efficiently

A. Codeword Lengths It is very difficult to characterize the Huffman length for arbitrary and , , but we observed that the following conjecture holds for . Conjecture 3: For

we have (37)

This conjecture may seem to suggest that the code is . Although it is true for a generalized Shannon code [6] of most values of , it is not always the case because (37) does not necessarily hold for . Instead, we have and

For instance,

and

but

. There are two other points to be mentioned concerning this conjecture. The first is that these bounds are very tight, be-

KHOSRAVIFARD et al.: THE MINIMUM AVERAGE CODE FOR FINITE MEMORYLESS MONOTONE SOURCES

967

Fig. 6. Performance of the  and ! codes.

cause

is equal to

or . As an ex-

For

, one can write

, we have and . The second point is that (37) does not hold for a general distribution. For example, it does not hold for the distriwith Huffman lengths bution . Beyond this conjecture, we use the following theorem to characterize analytically. ample, for

Theorem 10: Montgomery and Kumar [22] If the largest probability of a distribution lies in the interval

then the smallest Huffman codeword length is equal to .

or

In other words, this theorem states that for an arbitrary distribution, the smallest Huffman codeword length is given by

which implies that at most one integer lies in the interval

or (38) where

denotes the largest probability.

Lemma 2: For alphabet size

Noting this point and using (38) and (39) completes the proof. The immediate corollary of Lemma 2 is the following theorem.

we have

Theorem 11: The shortest codeword length of to infinity as increases, that is

tends

where . Proof: From (12) we have

(39)

However, in representations of the integers such as the and codes, the shortest codeword has a fixed and finite length. and the average codeFig. 6 shows the average entropy word length of the asymptotically optimal codes and , when

968

IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 53, NO. 3, MARCH 2007

TABLE III

` (N ), ` (N ), AND

log 1 +

applied to . This figure suggests unbounded average redundancy for both codes, that is

FOR

N =2

Proposition 1: If (37) holds for sufficiently large , then

and

(43)

(40) tends to

Theorem 12: The shortest codeword length of infinity as increases, that is

one may write that

and conclude from (37) (44)

(41) where

Proof: From (2) we have

. Hence,

Proof: From (10) we have

Noting that is a compact code and and using Lemma 2 completes the proof.

,

To show that this proposition and its assumption are plausible, and . This consider Table III, which compares , table suggests that for

Also Lemma 8 (given in the Appendix) implies that and (42) which agrees with (43) as

and consequently

The divergency of the harmonic series proof. Theorem 13: For code

completes the

we have

Proof: From (2) we can write

and

Noting that

we have

In addition, the last row of Table III verifies Lemma 2 for , . B. Distributions Depending on which one of the two criteria, Minave or Minor imax, is considered, we must deal with two distributions . The natural question that arises is how different are these two distributions? We answer this question in this subsection. To some extent, the answer reveals how different the two criteria are. and is evident The intrinsic difference between from Fig. 7. In order to examine the difference for arbitrary , . we consider the informational divergence Lemma 3: For any alphabet size

As the right-hand side of this inequality tends to infinity as increases, the proof is complete. Therefore, the shortest codeword length of the Minimax code increases to infinity as does that of the Minave code, but the growth rate is slow so that . In contrast, we conjecture that this ratio tends to a constant for the code.

we have

Proof: According to (1) and (10) we can write

KHOSRAVIFARD et al.: THE MINIMUM AVERAGE CODE FOR FINITE MEMORYLESS MONOTONE SOURCES

Fig. 7. Comparison of P

and P

969

.

Simulation shows that increases with (Fig.8). The following theorem indicates how fast it tends to infinity. Theorem 14: For the informational divergence of , i.e., , we have

where Now, using the fact that completes the proof:

in the following

and

denotes the value of the convergent series .

Proof: It is well known that consequently

and (45)

We show (Theorem 15 in the Appendix) that the infinite series

converges to a constant denoted by . Thus, from (23) and , one may conclude that and consequently (46) Now from Lemma 3 we have

which together with (17), (46), and (45) proves the theorem.

970

IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 53, NO. 3, MARCH 2007

Fig. 8. Informational divergence of the average distribution and Minimax distribution

A direct consequence of Theorem 14 is that

D P kP

.

The nonzero informational divergence of plies the nonzero redundancy of for the inequality

and imwhich satisfies

(47) This implies that

and

Thus,

Also, for the average redundancy one can write

and

are quite different for large

.

Remark: The Minimax and Minave criteria result in th probabilities and , which can be proportionally approximated by (the Zipf distribution) [26] and , respectively. Therefore, the th average probability is approximately proportional to the sum of . In the th to th Minimax probabilities, i.e., other words, the th Minimax probability is approximately proth average probportional to the difference of the th and . Interestingly, there is abilities, i.e., a meaningful difference between these distributions in the conis assotext of linguistics. The Minimax probability ciated with Zipf’s law, which models the probability of the th most frequent English word [19], while the average probability models the probability of the th most frequent English letter (Fig. 2). C. Average Redundancy For a fixed , the best code from the Minave point of view is , whereas is suboptimal with the Minimax criterion. The question is then how performs with the Minave criterion.

(48) Example: For

, we have and

Depending on the situation, this 1.239 bits/symbol degradation in the average performance may be tolerable. Therefore, for and have acceptable persmall values of we may use formance from both the Minave and Minimax points of view. for large . Now consider the average redundancy of Clearly, (25), (47), and (48) imply unbounded average redun. According to (40), dancy for large which grows as unbounded average redundancy for large also occurs for the asymptotically optimal codes, and . By contrast, according to Theorem 9 and (28), the average redundancy of the Minave,

KHOSRAVIFARD et al.: THE MINIMUM AVERAGE CODE FOR FINITE MEMORYLESS MONOTONE SOURCES

Uniform and fixed-length codes is less than for all alphabet sizes . In the context of universal codes [7], where the redundancy tends to infinity, the ratio of the redundancy to entropy is a measure of the goodness of a code, i.e., asymptotic optimality criterion or

Similarly, as the average redundancy of the , , and codes tends to infinity, one may use the ratio of average redundancy to average entropy as a measure of their goodness ). As is an asymptotically optimal code and (i.e.,

971

small (and negligible for large ), improvement in performance with respect to the case when nothing is known about the symbol probabilities. , the ratio of the largest codeWe demonstrated that for to the smallest codeword length tends to word length as increases. This suggests that for large alphabet sizes, the smallest codeword length should be approximately half that of the largest one. Since universal codes are designed for a countably infinite number of symbols, they cannot satisfy this condition and do not result in bounded average redundancy. Also, for the Minimax code ( code), the ratio tends to infinity, and its average redundancy is approximately for large , which is not bounded. However, the ratio of the average redundancy to average entropy tends to zero for the Minand any asymptotically optimal code, such as imax code or . APPENDIX

we can write

be a uniformly distributed simTheorem 1: Let plex (i.e., ) with vertices . Then the center of gravity of is the average of its vertices, that is and from Theorem 6 we conclude

This is also the case for any other asymptotically optimal code (in particular, ). For the code one may write

Hence, the expected redundancy of all these codes is negligible with respect to the expected entropy and no one is preferable in code, however, is both suboptimal for the this sense. The Minimax criterion and asymptotically optimal for the Minave criterion. VIII. CONCLUSION In this paper, we considered the minimum average criterion to design a code for the class of finite memoryless monotone sources. Assuming equiprobable sources, we derived the avfor which the associated erage probability distribution code) is optimum. We proved the inHuffman code ( and , and that creasing behavior of tends to a constant value . Also we studied the average entropy , compared it with , and proved that . Thus, if an average redundancy of up to 1.61 bits/symbol is tolerable, a fixed-length code can be employed. Otherwise, we can use the , which has a small average fast and efficient code, i.e., redundancy. It is important to note that, from the minimum average point of view, knowing the ranking of the probabilities allows only a

Proof: We use a proof presented by Lasserre and Avrachenkov [17] for a more general problem. There is a well-known formula for integrating a Dirichlet function on the canonical simplex (defined in (21)) [17] (49)

Since is a simplex, for all can write with equivalently

where

and

, we , or

. Also from we can write (50)

Thus we have the following transformation (51) where

is an

matrix

Note that from the volume formula of a simplex we have (52)

972

IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 53, NO. 3, MARCH 2007

Now, using the change of variables and (50) we get (53) is and (54) at the bottom of the page. In this derivation, the Jacobian of the transformation (51). Also, (53) and (54) are obtained from (52) and (49), respectively. Thus, from Definition 3 one may write

or just

.

Lemma 4: We have

Lemma 5: Consider a function which is invariant is symmetric (i.e., the value of under any permutation of its variables) and satisfies

over The mean of equal. Proof: Let the set of all permutations of size . It is clear that permutation transformation

and over

(55) are

and

denote taken from . Each defines an affine where

for Proof: Clearly, holds for Also, implies and consequently one can write for

and

Noting that

which maps (55) implies

onto

. The symmetry of

and property

for any permutation . On the other hand, a comparison and (i.e., (21) and (3)), rebetween the definitions of are a sorted version of the elements veals that the elements of . Therefore, can be partitioned into disjoint simof plexes, ’s, that is

and we have

completes the proof.

(53)

(54)

KHOSRAVIFARD et al.: THE MINIMUM AVERAGE CODE FOR FINITE MEMORYLESS MONOTONE SOURCES

973

It can easily be shown that the canonical simplex in (21)) may be rewritten as

(defined

for (56) Using the well-known formula which gives the volume of an -dimensional simplex in terms of the coordinates of its vertices, it is not hard to show that

Thus, we have the equation at the bottom of the page. Defining the new variable , we have

(57) Thus, from (56) we conclude

Following this procedure, the most interior integral becomes and the proof is complete. Lemma 6: For a function

, the mean of which, with the change of variables

over

is equal to

, gives

, where

Proof: Let . The symmetry of with respect to variables the canonical simplex (see (21)) implies that

After have

successive change of variables and integration, we

for Finally, (57) and (58) give

and consequently

(58)

which completes the proof.

974

IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 53, NO. 3, MARCH 2007

Lemma 7: If

and

,

Theorem 15: The infinite series

then

converges to . Proof: Manipulating the left-hand side inequality of (59) gives the following:

Proof: We can use integration by parts and write

Also, it can be easily shown by induction that

Then we can write

and consequently

(60) Noting (42), the existence of the upper bound (60) guarantees the convergence of

which completes the proof. Lemma 8: For

we have (59)

Proof: Let It is not hard to prove for

and that

.

We denote the value of this infinite series by which is approx. Interestingly, the logarithmic constant (i.e., imately ) is the only value of which makes convergent. ACKNOWLEDGMENT

and conclude that

and

for

we have

for

The authors are indebted to Prof. Thomas M. Cover for his constructive suggestions, and especially for simplifying the proof of (10). Also, the authors appreciate the careful evaluation of the paper by the anonymous reviewers. In particular, they thank the reviewers for bringing [6], [8], [14], [20], [21] to their attention, and motivating them to determine (25) and (47). Finally, the authors acknowledge Dr. Farid Bahrami for his help in proving Theorem 15. REFERENCES

and

which proves

. Defining

.

[1] J. Abrahams and M. J. Lipman, “Relative uniformity of sources and the comparison of optimal code costs,” IEEE Trans. Inf. Theory, vol. 39, no. 5, pp. 1695–1697, Sep. 1993. [2] J. L. Bently and A. C. Yao, “An almost optimal algorithm for unbounded searching,” Inf. Process. Lett., vol. 5, no. 3, pp. 82–87, Aug. 1976.

KHOSRAVIFARD et al.: THE MINIMUM AVERAGE CODE FOR FINITE MEMORYLESS MONOTONE SOURCES

[3] T. M. Cover, “Admissibility properties of Gilbert’s encoding for unknown source probabilities,” IEEE Trans. Inf. Theory, vol. IT-18, no. 1, pp. 216–217, Jan. 1972. [4] T. M. Cover and J. A. Thomas, Elements of Information Theory. New York: Wiley, 1991. [5] L. D. Davisson and A. Leon-Garcia, “A source matching approach to finding minimax codes,” IEEE Trans. Inf. Theory, vol. IT-26, no. 2, pp. 166–174, Mar. 1980. [6] M. Drmota and W. Szpankowski, “Precise minimax redundancy and regret,” IEEE Trans. Inf. Theory, vol. 50, no. 11, pp. 2686–2707, Nov. 2004. [7] P. Elias, “Universal codeword sets and representations of the integers,” IEEE Trans. Inf. Theory, vol. IT-21, no. 2, pp. 194–203, Mar. 1975. [8] P. Flajolet and H. Prodinger, “Level number sequences for trees,” Discr. Math., vol. 65, pp. 149–156, 1987. [9] E. N. Gilbert, “Codes based on inaccurate source probabilities,” IEEE Trans. Inf. Theory, vol. IT-17, no. 3, pp. 304–314, May 1971. [10] T. C. Hu and A. C. Tucker, “Optimal binary search trees,” in Proc. 2nd Chapel Hill Conf. Combinat. Math. and Applic., Chapel Hill, NC, May 1970, pp. 285–305. [11] D. A. Huffman, “A method for the construction of minimum-redundancy codes,” Proc. IRE, vol. 40, no. 9, pp. 1098–1101, Sep. 1952. [12] D. Kazakos, “Robust noiseless source coding through a game theoretic approach,” IEEE Trans. Inf. Theory, vol. IT-29, no. 4, pp. 576–583, Jul. 1983. [13] M. Khosravifard, M. Esmaeili, H. Saidi, and T. A. Gulliver, “A tree based algorithm for generating all possible binary compact codes codewords,” IEICE Trans. Fundam., vol. E86-A, no. 10, pp. with 2510–2516, Oct. 2003. [14] J. Komlos, W. Moser, and T. Nemetz, “On the asymptotic number of prefix codes,” Mitteilungen aus dem Math. Seminar Giessen, pp. 35–48, 1984. [15] R. E. Krichevskii, B. Y. Ryabko, and A. Y. Haritonov, “Optimal key for taxons ordered in accordance with their frequencies,” Discr. Appl. Math., vol. 3, no. 1, pp. 67–72, 1981. [16] R. Krichevsky, Universal Compression and Retrieval. Norwell, MA: Kluwer Academic, 1994. [17] J. B. Lasserre and K. E. Avrachenkov, “The multi-dimensional version of ,” Math. Assoc. Amer. Monthly, vol. 108, pp. 151–154, Feb. 2001. [18] V. I. Levenstein, “On the redundancy and delay of decodable coding of natural numbers,” Probl. Cybern., vol. 20, pp. 173–179, 1968.

N

x dx

975

[19] W. Li, “Zipf’s law everywhere,” Glottometrics, vol. 5, pp. 14–21, 2002. [20] R. L. Milidiú, A. A. Pessoa, and E. S. Laber, “Three space-economical algorithms for calculating minimum-redundancy prefix codes,” IEEE Trans. Inf. Theory, vol. 47, no. 6, pp. 2185–2198, Sep. 2001. [21] A. Moffat and J. Katajainen, “In-place calculation of minimum-redundancy codes,” in Algorithms and Data Structures, 4th Int. Workshop (Lecture Notes in Computer Science), S. G. Akl, F. H. A. Dehne, J. Sack, and N. Santoro, Eds. Berlin, Germany: Springer-Verlag, 1995, vol. 955, pp. 393–402. [22] B. L. Montgomery and B. V. K. V. Kumar, “On the average codeword length of optimal binary codes for extended sources,” IEEE Trans. Inf. Theory, vol. IT-33, no. 2, pp. 293–296, Mar. 1987. [23] E. Norwood, “The number of different possible compact codes,” IEEE Trans. Inf. Theory, vol. IT-13, no. 4, pp. 613–616, Oct. 1967. [24] F. Pratt, Secret and Urgent: The Story of Codes and Ciphers. Garden City, NY: Blue Ribbon Books, 1942. [25] J. Rissanen, “Minimax codes for finite alphabets,” IEEE Trans. Inf. Theory, vol. IT-24, no. 3, pp. 389–392, May 1978. [26] B. Y. Ryabko, “Encoding of a source with unknown but ordered probabilities,” (in Russian) Probl. Pered. Inform., vol. 15, no. 2, pp. 71–77, Apr.-Jun. 1979. [27] ——, “Comments on ’A source matching approach to finding minimax codes’,” IEEE Trans. Inf. Theory, vol. IT-27, no. 6, pp. 780–781, Nov. 1981. [28] S. A. Smith, “A generalization of Huffman coding for messages with relative frequencies given by upper and lower bounds,” IEEE Trans. Inf. Theory, vol. IT-20, no. 1, pp. 124–125, Jan. 1974. [29] D. Solomon, Data Compression, The Complete Reference, 2nd ed. New York: Springer-Verlag, 2000. [30] Q. F. Stout, “Improved prefix encodings of the natural numbers,” IEEE Trans. Inf. Theory, vol. IT-26, no. 5, pp. 607–609, Sep. 1980. [31] H. Yamamoto and H. Ochi, “A new asymptotically optimal code for the positive integers,” IEEE Trans. Inf. Theory, vol. 37, no. 5, pp. 1420–1429, Sep. 1991. [32] H. Yamamoto, “A new recursive universal of the positive integers,” IEEE Trans. Inf. Theory, vol. 46, no. 2, pp. 717–723, Mar. 2000. [33] J. Ziv and A. Lempel, “A universal algorithm for sequential data compression,” IEEE Trans. Inf. Theory, vol. IT-23, no. 3, pp. 337–343, May 1977.

Suggest Documents