IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 53, NO. 3, MARCH 2007
955
The Minimum Average Code for Finite Memoryless Monotone Sources Mohammadali Khosravifard, Hossein Saidi, Member, IEEE, Morteza Esmaeili, and T. Aaron Gulliver, Senior Member, IEEE
Abstract—The problem of selecting a code for finite monotone sources with N symbols is considered. The selection criterion is based on minimizing the average redundancy (called Minave criterion) instead of its maximum (i.e., Minimax criterion). The average probability distribution P N , whose associated Huffman code has the minimum average redundancy, is derived. The entropy of the average distribution (i.e., H (P N )) and the average entropy of the monotone distributions (i.e., H (P N )) are studied. It is shown that both log N 0 H (P N ) and log N 0 H (P N ) are asymptotically equal to a constant (' 0:61). Therefore, there is only a negligible penalty (at most 1.61 bits/symbol) in using a simple fixed-length code with respect to the optimal code. An efficient near-optimal encoding technique is also proposed. The consequences of the two approaches, i.e., Minave and Minimax, are compared in terms of their associated distributions and associated codes. In order to evaluate the average performance of the Minimax code, we prove that the informational divergence of the average distribution and Minimax distribution asymptotically grows as 02:275 + log log N . Index Terms—Average redundancy, finite monotone sources, fixed-length code, Huffman code, Minave code, Minimax code, minimum average criterion.
I. INTRODUCTION
W
HEN the symbol probabilities are known, the best code for a memoryless information source in the sense of average codeword length can be found using the well-known Huffman algorithm [11]. However, it is often the case that the exact values of the symbol probabilities are unknown. In some situations, the number of symbols is very large, or the symbol probabilities change over time and because of nonstationary effects cannot be estimated using adaptive techniques. If nothing is known about a source except the number of symbols , it has been proposed in [5] to assume equiprobable symbols and use the corresponding Huffman code. Definition 1: The Huffman code of the uniform distribution is called the Uniform code, which is Manuscript received March 15, 2003; revised November 19, 2006. M. Khosravifard and H. Saidi are with the Department of Electrical and Computer Engineering, Isfahan University of Technology, Isfahan, Iran (e-mail:
[email protected];
[email protected]). M. Esmaeili is with the Department of Mathematical Sciences, Isfahan University of Technology, Isfahan, Iran (e-mail:
[email protected]). T. A. Gulliver is with the Department of Electrical and Computer Engineering University of Victoria, PO Box 3055, STN CSC Victoria, BC V8W 3P6, Canada (e-mail:
[email protected]). Communicated by S. A. Savari, Associate Editor for Source Coding. Color versions of Figs. 2, 3, and 5–8 in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TIT.2006.890782
Fig. 1. The best codes in terms of knowledge.
denoted by given by
. The codeword length for the th symbol is if if
This code may have a degradation in performance of up to bits for some sources. Clearly, a lack of information about the symbol probabilities causes this degradation. Therefore, we can consider the following two extreme cases in designing a code in terms of our knowledge: • When we know the exact values of the symbol probabilities, we have complete knowledge and the best choice is the Huffman code. • When our information is restricted to just the alphabet size (i.e., ), we use the Uniform code. These are illustrated in Fig. 1. As one can see there is a large gap in knowledge between the two cases where something more than just the number of symbols is known, but less than all symbol probabilities. What may be known is that the source belongs to a specific class of sources such as Bernoulli sources, monotone sources, Markov sources, etc., [16]. For example, in universal data compression techniques such as LZ77 [33] and its variants, it is (implicitly) assumed that shorter match lengths are more probable, which implies dealing with monotone sources. In particular, this paper considers the case where the order of the symbol probabilities is known rather than their exact values (Fig. 1). This means designing a code for the class of memoryless monotone sources with a finite number of symbols. The question is how to include the available information in the code design to obtain good (or guaranteed) performance over the class of sources being considered. The first step in the design of an appropriate code is to define the design criterion. Elias [7] considered the problem of finding a code for monotone sources with countable symbols. He defined an asymptotic optimality criterion for which the ratio of average codeword
0018-9448/$25.00 © 2007 IEEE
956
length to entropy must tend to one as the entropy increases to infinity. Subsequently, some researchers have considered this criterion, and some asymptotically optimal codes have been obtained [30], [31]. In particular, more recently Yamamoto [32] constructed a recursive universal representation of the integers for which the codeword length for almost all sufficiently large positive integers is shorter than previous known techniques. However, these codes were designed for an infinite number of symbols, and their satisfactory performance is guaranteed only as the entropy tends to infinity, which is not the case for finite sources. Thus, they provide no guarantee of optimality for finite sources (sources with a finite number of symbols), and it is likely that better performance can be obtained by considering the finiteness of the alphabet in the criterion. Here we consider the minimum average criterion, which was mentioned as a special case by Gilbert [9], and was well defined by Cover [3]. With this criterion, average redundancy (or average codeword length) over all sources in the class must be minimized, and consequently, a distribution function on the class of sources is needed. Cover considered a Dirichlet function as the probability distribution over the class of all possible symbols and explained Gilbert’s scheme [9] sources with from the point of view of this criterion [3]. Most papers on designing codes for a given class of sources (in particular for finite monotone sources), use the Minimax criterion [25], [26], [28], [5], [12]. In this case, the maximum of the cost function over all of the sources is minimized. The minimum average criterion is very different since the Minimax criterion improves the worst case performance, while the minimum average criterion improves the average performance. Smith [28] used the average codeword length as the cost function for the class of sources with known upper and lower bounds on each symbol probability. Rissanen [25] considered the Minimax criterion for the ratio of average codeword length to entropy and assumed an upper bound on the largest symbol probability. He derived the optimum real codeword lengths, that is, codeword lengths that would be optimal if the integer constraint on codeword lengths is removed. Ryabko [26] and Davisson and Leon-Garcia [5], presented a probability distribution whose Shannon code [4] is a suboptimal code for the Minimax criterion using the difference of average codeword length and entropy as the redundancy function. They showed that this code has near optimal performance. This code was used to design a suboptimal key for taxons [15]. Kazakos [12] minimized the largest average codeword length using a game-theoretic formulation. The result was a code designed for the source with the highest entropy. Since the uniform distribution belongs to the class of monotone sources, his conclusion was to use a Uniform code for these sources. In this paper, we use the minimum average criterion to design a code for the class of memoryless monotone sources. In Section II, the minimum average and Minimax criteria are explained in detail. Since integral (in the expectation) and summation (in the average codeword length) are linear operations, the problem of finding the optimum code reduces to finding the Huffman code for the average probabilities. Using a theorem on the center of gravity of uniformly distributed simplexes, average probabilities are derived for equiprobable monotone sources in
IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 53, NO. 3, MARCH 2007
Section III. Some properties of the entropy of the average distribution are derived in Section IV. The average entropy of monotone sources is studied in Section V. In Section VI, we introduce a fast and efficient encoding scheme. In Section VII, we compare the consequences of the two different criteria, i.e., minimum average and Minimax. Finally, some conclusions are presented in Section VIII. In this paper we use the following notation: • The set of real numbers: . • The set of natural numbers: . . • Logarithmic constant: • Euler–Mascheroni constant: . • The value of the convergent series
• Number of symbols or alphabet size: . . • Entropy of distribution : • Informational divergence of two probability distributions and : . . • Logarithm in base • Natural logarithm: . • Expectation of a random variable with probability density function : . • Cardinality or the number of elements in a set : . . • Volume of a simplex : . • Binary representation of integer with bits: II. THE MAIN PROBLEM Define each memoryless information source with symbols by a vector , where denotes the probability of the th symbol appearing in the output. Also, as all sources with define the class of monotone sources symbols and ranked probabilities
The problem is then to find a unique code for all the sources . Since designing the codewords is straightforward if the in codeword lengths are known, we only consider the problem of determining the codeword lengths. A given code for a source symbols is denoted by , where with is the codeword length for the th symbol. small, To make the average codeword length shorter codewords should be assigned to more probable symbols, i.e., we should, without loss of generality, have for . For unique decodability, the codeword lengths should satisfy [4]. However, in order to the Kraft inequality minimize the average codeword length as much as possible, we
KHOSRAVIFARD et al.: THE MINIMUM AVERAGE CODE FOR FINITE MEMORYLESS MONOTONE SOURCES
only consider compact codes (i.e., it is desirable to find good codes in
). Therefore, , where
An exhaustive search over requires generating all compact codes in the first stage and evaluating the performance of each code in the second stage. Exhaustive search is an intractable task for large even when the second stage is feasible, because ) [9], [23] inthe number of possible compact codes (i.e., creases exponentially [8], [14], that is
957
It should be noted that the entropy is not always achievable without extending the source. Therefore, if a code is to be used for nonextended sources, it is more reasonable to employ the minimum achievable average codeword length , in the code design. In other words, it is better to use
than ; or the performance of a code should be compared with the performance of the Huffman and lead to code not the entropy. Generally, different codes. Example: Suppose that we want to design a Minimax code for the two distributions and
Thus, although there is an efficient algorithm for generating all binary compact codes with codewords [13], an efficient means of code selection (rather than exhaustive search) must be employed for large . We next consider two criteria to select one of these codes. A. The Minimax Criterion The Minimax criterion was considered by Smith [28] in 1974. For the class of monotone sources, it was independently used by Rissanen [25] in 1978, Ryabko [26], [27] in 1979, and Davisson and Leon-Garcia [5] in 1980. Defining the redundancy of a code for a distribution as
the Minimax criterion results in a code whose maximum is minimum among all , redundancy over all that is
In other words, the goal with this code is to improve and guarantee the performance in the worst case (maximum redundancy).
The optimum code for the Minimax criterion with redundancy function is and with is , where , , and . This shows that by selecting we save bits for while incurring a loss of 0.25 bit for . However, computing the entropy of a distribution is much easier than computing the average codeword length of its is more tractable. ConHuffman code; therefore, is preferable when the number of sources versely, is finite and computing the corresponding Huffman codes is feasible. Rissanen [25] used the Minimax criterion for the redundancy function
He assumed the additional constraint for monotone sources and derived the optimum real codeword lengths as if if where
is the real solution to
, is Definition 2: The Minimax code [26], denoted by which is the Shannon code for the Minimax distribution defined by where Hence, the codeword length for the th symbol is given by
(1) (2)
for
It is shown in [5], [26] that
With the Minimax criterion, a code is designed based on the worst case, so the worst case performance may be improved at the cost of degrading the performance in other cases. Thus, using the Minimax criterion for evaluating the performance of a code is questionable if the worst case rarely occurs. In such cases the minimum average criterion is more reasonable. B. The Minimum Average Criterion
and, consequently, imax problem.
is a suboptimal solution of this min-
Gilbert [9] and Cover [3] both noted that one can treat as a random vector. Accordingly, we may conon sider a probability distribution function
958
IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 53, NO. 3, MARCH 2007
the class of sources. Using this distribution function, we can defor each as the expectation of , that is fine
different codes may result. In addition, this criterion can be used for an arbitrary class of sources on which a distribution function is predefined.
Thus, is the average redundancy over the class of sources. With the minimum average criterion (hereafter called Minave criterion), the codeword lengths of the optimal code (hereafter are called Minave code) for the class of monotone sources , which provide the minimum value of , i.e.,
Definition 3: Point of gravity of an -dimensional simplex , where function
Since is a linear combination of , is actually an -dimensional simplex in and -dimensional integration on vanishes to zero. Thus, the expectation of must be computed by -dimensional defined by integration on the simplex
(3) In addition, as functions of facts one can write
and variables
must be considered . Based on these
Since have
is the center with mass density
is a probability density function, we
and therefore
This shows that we should determine the center of gravity of . C. Minimax Versus Minave The differences between the Minimax and Minave criteria are summarized as follows. • With the Minimax criterion, we attempt to improve the worst case performance (maximum redundancy), whereas with Minave the overall performance is the major concern. • From the Minimax perspective, all sources carry the same importance. In other words, a source is either in the set of sources or is not. With the Minave criterion, we can assign a degree of importance to each source and consider this in the code design.
(4) where , and tropy, is a finite constant. Consequently
, , the average en-
(5) It follows from (5) that are exactly the Huffman codeword lengths for . Therefore, finding reduces to com, for which the Huffman code can puting efficiently be obtained [20], [21]. Clearly, from the right-hand side of (5), we can interpret the optimal code that mini, as the optimal code which minimizes mizes . Note that the optimal code depends directly on , and for different distribution functions,
• The Minimax criterion implicitly assumes that a single source selected from the set of sources generates all symbols. With the Minave criterion, we may assume that different sources generate the symbols. • With the Minimax criterion, the optimum code may differ , between the cost functions ( and ), while it is unchanged with Minave. • It can be difficult to derive an analytic expression for the optimum code with the Minimax criterion. In particular, when the number of sources is finite, an exhaustive search may be required. Conversely, deriving the optimum code with the Minave criterion only requires the computation of the average distribution and the application of the Huffman algorithm. Moreover, the evaluation of an arbitrary code with the Minave criterion is much easier than with the Minimax. Depending on the requirements of a particular application, one or the other of these criteria may be preferred. Keeping this in mind, we are motivated to determine how much the resulting codes differ. The Minimax code for finite monotone sources has
KHOSRAVIFARD et al.: THE MINIMUM AVERAGE CODE FOR FINITE MEMORYLESS MONOTONE SOURCES
been considered previously [5], [26]. In this paper, we consider the Minave code for these sources.
959
and noting that
Using
(8) III. THE MINAVE CODE FOR EQUIPROBABLE MONOTONE SOURCES
(which can be easily proved by induction), one may write
It is reasonable to assume the uniform distribution (6) on the sources when the only information available about the sources under consideration is their monotonicity. In order to where is compute the center of gravity of a constant, we use the following theorem. Theorem 1: Let be a uniformly distributed sim) with vertices plex (i.e., . Then the center of gravity of is the average of its vertices, that is
(9)
Definition 4: Assume a uniform probability distribution function on the class of monotone sources with symbols. In this case, the average probability distribution is given by (10)
The Huffman code of , called , which is optimum in the average sense has codeword lengths
(The proof is given in the Appendix.) Remark: If we define the center of gravity of located at masses
individual as In fact, can be interpreted as the average probability that the th symbol appears. It follows from (10) that
then we can say from Theorem 1 that the center of gravity of a uniformly distributed simplex is the same as the center of equal individual masses located at its vertices gravity of . It can be easily shown that the vertices of form of
for
[12]. Noting that
are in the .. .
iff
the vertices of are for Hence we can use Theorem 1 to compute the ’s, that is
.. .. . .
.. .
.. .
.. .
It may appear that the above probability distribution is not so uniform, but we will show that it is. As the inequality
. holds for any nonincreasing integrable real function de, we can set and conclude that fined for (7)
(11)
960
IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 53, NO. 3, MARCH 2007
Fig. 2. A comparison of the probability distribution of English letters and P
Since
, we may also write
.
TABLE I ENTROPY AND AVERAGE CODEWORD LENGTHS FOR THE M (26), U (26), F (26), AND HUFFMAN CODES FOR ENGLISH, FRENCH, GERMAN, AND SPANISH
(12) One can also use the following simple approximation to compute the average probabilities: (13) which improves in accuracy as increases. Another approximation for the average probability is
The latter expression has a somewhat more complex form, but its performance is very good. In particular, the Huffman codes and differ only for 12 values in the range of (i.e., ). Example: As an example, we applied the code to the probability distribution of English letters [24]. As shown in Fig. 2, the probability distribution of English letters is very . Therefore, one could do very well even if only close to the ranking of the English letters were known, in term of their applies to real-world examprobabilities. So we find that ples. Spanish, French, and German are other languages with 26 letters and distributions similar to English. Table I gives the average codeword length of the Huffman code for each language, , , and codes. For these as well as for the code is the best, followed by the distributions, the code. Note that the code has the worst performance in all cases.
A. Sources With More Uncertainty Until now, we have considered the design of a code based on . If our informathe probability ranking tion about the probability ranking is deficient, we can consider all possible cases, use (10) for each case, and average over these cases to obtain the average probability distribution. Obviously, the Huffman code for the resulting average distribution is optimum in the Minave sense. Example: Consider a source with five symbols and the in, so that only the order equalities and is unknown. Hence, we have two possible cases of and . Assuming that and are equiprobable, we obtain
Therefore, if we were certain that would use the distribution or were in doubt as to whether the distribution
, we , but if we is greater, we should use .
KHOSRAVIFARD et al.: THE MINIMUM AVERAGE CODE FOR FINITE MEMORYLESS MONOTONE SOURCES
This lack of information about the probability ranking leads to a more uniform distribution. In the extreme case, when we know nothing about the ranking of the probabilities, we should use a uniform distribution, as expected. In other words, code is the Minave code for the class of sources with symbols.
961
and for The inequality
for
implies
IV. ENTROPY OF THE AVERAGE DISTRIBUTION In this section, the behavior of the entropy (as the fundais mental information-theoretic quantity of a distribution) of studied. We prove that the entropy and average Huffman code(i.e., and ) are nonword length of decreasing functions of , and the difference tends to the constant (where is the Euler–Mascheroni constant, ). and Theorem 2: functions of , that is
In addition, one can see that for . From have we write
for
. As a result, we and (11)
, which can be rearranged as
are nondecreasing Noting that the difference of the upper and lower bounds in the . Clearly, above inequality is less than , we have (14) for satisfies (14) as well.
and
Before proving this theorem we must prove the following lemma. Lemma 1: Defining such that integer
, for integer
Theorem 3: Abrahams–Lipman [1] Consider two distribuand with tions and . If Hu and Tucker’s condi, we have tion [10] holds, that is, for some , for for
there exists an then
for for The value of
is given by or (14) we have
Proof: From (10) for
Lemma 1 and Theorem 3 can be used to prove Theorem 2. According to Theorem 2, including an extra symbol in the alphabet causes an increase in the entropy and average codeword length. Therefore, one cannot assume a lower expectation of average , by adding symbols. codeword length, i.e., Note that, although it intuitively seems that a larger entropy is associated with a larger average Huffman codeword length, this is not true in general [1]. However, Lemma 1 and Theorem 3 and show that this is true for the underlying special cases ( ). , we have Since and consequently so that (15)
This shows that iff and . On the other hand, we have for . Therefore, implies the existence of an which satisfies for
iff and
Now, noting that gives
[4], and using (12) and (15) (16)
Equation (16) implies that . However, Theorem 4 shows that actually gets very close to .
and
962
IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 53, NO. 3, MARCH 2007
Theorem 4: The difference stant value
as
tends to the conincreases, that is
over the intervals known property
and
. Hence, the implies that
(17) Proof: Let lower bounds on that
. Using the upper and , i.e., inequalities (11) and (12), and noting
(19) is the differential entropy of the probability In fact, , . Following similar density function steps, we can derive an upper bound on
for for we can write (20)
Using (12) gives
which implies gument on (20) results in
. Using a similar ar-
which together with (19) completes the proof.
V. AVERAGE ENTROPY OF MONOTONE SOURCES In this section, we study the average entropy of equiprobable symbols (i.e., where monotone sources with ), denoted by . As the entropy is a concave function of distribution, we have . Here, we intend to compare with in more detail. , we first define the canonical In order to formulate in as simplex
for (18)
It can be shown that the last sum in (18) tends to as (see Lemma 4 in the Appendix). The two other sums tend to the Riemann integral of the continuous function
(21)
The importance of this definition is that the mean of some speand over cial symmetric functions (such as entropy) over are equal (Lemma 5 in the Appendix). Moreover, evalumight be simpler ating the mean of such functions over . The following theorem gives an exact formula for than over in terms of .
KHOSRAVIFARD et al.: THE MINIMUM AVERAGE CODE FOR FINITE MEMORYLESS MONOTONE SOURCES
Fig. 3. The difference between
H (P
) and
H (P
963
).
Theorem 5: For an arbitrary positive integer
we have
Combining (22) with (23) gives
(22)
(24)
Proof: Let . Assuming equiprobability of the monotone sources, we write
Theorem 6: Assuming equiprobability of monotone distributions, the average entropy tends to be equal to the entropy of the average distribution as increases, that is (25)
Therefore,
Proof: Noting (24) and (17), the proof is clear.
is actually the mean of the function
Theorem 7: For an arbitrary positive integer
we have (26)
Proof: Setting gives
over . As satisfies the conditions of Lemma 5 (given in the Appendix), its mean over and over are equal, that is
in the well-known inequality and consequently
and Now, from Lemmas 6 and 7 (given in the Appendix), the mean of over is given by Therefore, is a nondecreasing sequence of and (24) completes the proof. Theorem 8: For an arbitrary positive integer
we have
and the proof is complete. The well-known property of Euler–Mascheroni constant is that (23)
Proof: The proof is clear from orem 7.
and The-
In Fig. 3, the difference is illustrated as on a logarithmic scale. It can be seen that a function of
964
Fig. 4. Average redundancy of the
IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 53, NO. 3, MARCH 2007
U (N ), M (N ), and PU (N ) codes.
has its maximum at . Therefore, keeping Theorem 8 in mind, we have the following conjecture. Conjecture 1: For an arbitrary positive integer have
From
we
(28)
and (26) we have
(27) As the difference is actually the informational divergence of the average probability distribution and the uniform distribution, i.e., , the inequality (27) reflects the high uniformity of length code (assigning ceptable performance.
On the other hand, the average codeword length of a Uniform code) is certainly not greater than that of the code (the corresponding fixed-length code for any arbitrary distribution ). Also, the code is the Huffman code (in particular, for and has the minimum possible average codeword length for this distribution. Thus, we have
. Thus, one may expect a fixedbits to each symbol), to have ac-
An analytic investigation of the performance of for is not feasible. The average redundancy of the and codes are shown in Fig. 4. Although, we proved that the average code is less than , Fig. 4 suggests redundancy of the a smaller upper bound. Conjecture 2: The average redundancy of the Uniform code, , is not greater than , that is
Theorem 9: The average redundancy of a fixed-length code . is less than Proof: From Theorem 7 we have Therefore, knowing the rank of the probabilities (when we ) allows only a small improvement in the average use redundancy with respect to the case of knowing nothing about ). the symbol probabilities (when we use VI. A PARTIALLY UNIFORM CODE FOR Therefore, if a fixed-length code with length is used, . This sugthe average redundancy is less than a constant gests that if 1.61 bits/symbol average redundancy is tolerable, one can simply use a fixed-length code and take advantage of its benefits such as simple implementation and absence of error propagation.
As mentioned earlier, is the best code in the average sense, but for large , even the most efficient known algorithms require some time and space to determine the Huffman code [20] . On the other hand, is very easy to implement but of in average redundancy. In this section, certainly exceeds code) which is very we design a partially uniform code (
KHOSRAVIFARD et al.: THE MINIMUM AVERAGE CODE FOR FINITE MEMORYLESS MONOTONE SOURCES
965
efficient in terms of both performance (Fig. 4) and implementation. A simple generalization of the start–step–stop code [29] provides the following two techniques to design a code for nearly uniform sources, i.e., sources with entropy slightly less than . In both methods, the set is partitioned each containing successive integers. into sets, To represent the integer , , we first encode , the set containing . Then using the codewords given in (29) code with sym(at the bottom of the page) for the with or bols, we indicate among all integers in bits. Note that the codewords given by establish an instantaneous code. The main difference between the two techniques is the partition of and the allocation of bits in representing the ’s. In the first technique, the partitioning is done by the user. For example, one may attempt to partition such that the ’s have to the same cardinality. Then by assigning each and applying the Huffman algorithm to the ’s, one can obtain a good representation for the ’s. In the second approach, the codeword lengths representing the ’s are predefined by the user, and partitioning is done so that the resulting code is appropriate. In particular, if the se, with is to be used for repquence , then the sets should be defined such resenting that
method more convenient we used the following approximation to determine , and
Both of these techniques were applied to for different values of and various codeword lengths for representing the ’s. This led to an efficient and fast method for encoding based on the second technique. With this method, is partiand these sets are encoded tioned into four sets , , , and bits, respectively. with should be determined in such a way Clearly, , , , and that
Solving the nonlinear equation we obtain
(30)
(32)
(33) (34) where (8) and (13) were used in (32) and (33), respectively. Now using (30), (31), and (34), we have
for
Thus we set
Conversely, if we partition using the above values for , , and , we obtain , , , and to represent , , , and , respectively. In fact, the two techniques presented in this section are the same for the nearly and . uniform distribution The codeword for the th symbol is the concatenation of two and defined by parts if if if if
(31) Since the ’s contain successive integers, knowing , , uniquely determines , , , and . To make this and
if if
(35)
(29)
966
IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 53, NO. 3, MARCH 2007
Fig. 5. Average redundancy of the
PU (N ) code.
TABLE II PARTIALLY UNIFORM CODE FOR
P
PU (1488))
used when changes frequently and takes on large values. Note that choosing a larger value of improves performance while increasing the code complexity, but this improvement is not conwhich resulted in a code siderable. This is why we chose very similar to the classic start–step–stop code, except that it uses a Uniform code rather than a fixed-length code.
(
VII. COMPARISON OF THE CONSEQUENCES OF THE MINAVE AND MINIMAX CRITERIA In this section, we examine the consequences of the Minave and and Minimax criteria in terms of the resulting codes ( ) and distributions ( and ).
and if
(36)
This encoding scheme is very simple because we can construct a lookup table for the eight intervals according to (29), (35), and (36). After locating in one of these intervals, the codeword can easily be produced. Decoding will also be trivial if such a simple look-up table is used. , we constructed the Example: For code given in Table II. From this table, the codeword length for is and for is . In other words, in the code, more probable symbols may have longer codewords. However, because of the high uniformity of , the proposed method is efficient. the distribution Fig. 5 shows the average redundancy of the code for up to . The average redundancy is negligible, in particular for it does not exceed . In addialways has better performance tion, Fig. 4 shows that , especially for . For large , the average than code is at least 0.25 bits less than redundancy of the that of . Thus, the proposed code has near-optimum performance, and can be simply implemented. It can be efficiently
A. Codeword Lengths It is very difficult to characterize the Huffman length for arbitrary and , , but we observed that the following conjecture holds for . Conjecture 3: For
we have (37)
This conjecture may seem to suggest that the code is . Although it is true for a generalized Shannon code [6] of most values of , it is not always the case because (37) does not necessarily hold for . Instead, we have and
For instance,
and
but
. There are two other points to be mentioned concerning this conjecture. The first is that these bounds are very tight, be-
KHOSRAVIFARD et al.: THE MINIMUM AVERAGE CODE FOR FINITE MEMORYLESS MONOTONE SOURCES
967
Fig. 6. Performance of the and ! codes.
cause
is equal to
or . As an ex-
For
, one can write
, we have and . The second point is that (37) does not hold for a general distribution. For example, it does not hold for the distriwith Huffman lengths bution . Beyond this conjecture, we use the following theorem to characterize analytically. ample, for
Theorem 10: Montgomery and Kumar [22] If the largest probability of a distribution lies in the interval
then the smallest Huffman codeword length is equal to .
or
In other words, this theorem states that for an arbitrary distribution, the smallest Huffman codeword length is given by
which implies that at most one integer lies in the interval
or (38) where
denotes the largest probability.
Lemma 2: For alphabet size
Noting this point and using (38) and (39) completes the proof. The immediate corollary of Lemma 2 is the following theorem.
we have
Theorem 11: The shortest codeword length of to infinity as increases, that is
tends
where . Proof: From (12) we have
(39)
However, in representations of the integers such as the and codes, the shortest codeword has a fixed and finite length. and the average codeFig. 6 shows the average entropy word length of the asymptotically optimal codes and , when
968
IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 53, NO. 3, MARCH 2007
TABLE III
` (N ), ` (N ), AND
log 1 +
applied to . This figure suggests unbounded average redundancy for both codes, that is
FOR
N =2
Proposition 1: If (37) holds for sufficiently large , then
and
(43)
(40) tends to
Theorem 12: The shortest codeword length of infinity as increases, that is
one may write that
and conclude from (37) (44)
(41) where
Proof: From (2) we have
. Hence,
Proof: From (10) we have
Noting that is a compact code and and using Lemma 2 completes the proof.
,
To show that this proposition and its assumption are plausible, and . This consider Table III, which compares , table suggests that for
Also Lemma 8 (given in the Appendix) implies that and (42) which agrees with (43) as
and consequently
The divergency of the harmonic series proof. Theorem 13: For code
completes the
we have
Proof: From (2) we can write
and
Noting that
we have
In addition, the last row of Table III verifies Lemma 2 for , . B. Distributions Depending on which one of the two criteria, Minave or Minor imax, is considered, we must deal with two distributions . The natural question that arises is how different are these two distributions? We answer this question in this subsection. To some extent, the answer reveals how different the two criteria are. and is evident The intrinsic difference between from Fig. 7. In order to examine the difference for arbitrary , . we consider the informational divergence Lemma 3: For any alphabet size
As the right-hand side of this inequality tends to infinity as increases, the proof is complete. Therefore, the shortest codeword length of the Minimax code increases to infinity as does that of the Minave code, but the growth rate is slow so that . In contrast, we conjecture that this ratio tends to a constant for the code.
we have
Proof: According to (1) and (10) we can write
KHOSRAVIFARD et al.: THE MINIMUM AVERAGE CODE FOR FINITE MEMORYLESS MONOTONE SOURCES
Fig. 7. Comparison of P
and P
969
.
Simulation shows that increases with (Fig.8). The following theorem indicates how fast it tends to infinity. Theorem 14: For the informational divergence of , i.e., , we have
where Now, using the fact that completes the proof:
in the following
and
denotes the value of the convergent series .
Proof: It is well known that consequently
and (45)
We show (Theorem 15 in the Appendix) that the infinite series
converges to a constant denoted by . Thus, from (23) and , one may conclude that and consequently (46) Now from Lemma 3 we have
which together with (17), (46), and (45) proves the theorem.
970
IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 53, NO. 3, MARCH 2007
Fig. 8. Informational divergence of the average distribution and Minimax distribution
A direct consequence of Theorem 14 is that
D P kP
.
The nonzero informational divergence of plies the nonzero redundancy of for the inequality
and imwhich satisfies
(47) This implies that
and
Thus,
Also, for the average redundancy one can write
and
are quite different for large
.
Remark: The Minimax and Minave criteria result in th probabilities and , which can be proportionally approximated by (the Zipf distribution) [26] and , respectively. Therefore, the th average probability is approximately proportional to the sum of . In the th to th Minimax probabilities, i.e., other words, the th Minimax probability is approximately proth average probportional to the difference of the th and . Interestingly, there is abilities, i.e., a meaningful difference between these distributions in the conis assotext of linguistics. The Minimax probability ciated with Zipf’s law, which models the probability of the th most frequent English word [19], while the average probability models the probability of the th most frequent English letter (Fig. 2). C. Average Redundancy For a fixed , the best code from the Minave point of view is , whereas is suboptimal with the Minimax criterion. The question is then how performs with the Minave criterion.
(48) Example: For
, we have and
Depending on the situation, this 1.239 bits/symbol degradation in the average performance may be tolerable. Therefore, for and have acceptable persmall values of we may use formance from both the Minave and Minimax points of view. for large . Now consider the average redundancy of Clearly, (25), (47), and (48) imply unbounded average redun. According to (40), dancy for large which grows as unbounded average redundancy for large also occurs for the asymptotically optimal codes, and . By contrast, according to Theorem 9 and (28), the average redundancy of the Minave,
KHOSRAVIFARD et al.: THE MINIMUM AVERAGE CODE FOR FINITE MEMORYLESS MONOTONE SOURCES
Uniform and fixed-length codes is less than for all alphabet sizes . In the context of universal codes [7], where the redundancy tends to infinity, the ratio of the redundancy to entropy is a measure of the goodness of a code, i.e., asymptotic optimality criterion or
Similarly, as the average redundancy of the , , and codes tends to infinity, one may use the ratio of average redundancy to average entropy as a measure of their goodness ). As is an asymptotically optimal code and (i.e.,
971
small (and negligible for large ), improvement in performance with respect to the case when nothing is known about the symbol probabilities. , the ratio of the largest codeWe demonstrated that for to the smallest codeword length tends to word length as increases. This suggests that for large alphabet sizes, the smallest codeword length should be approximately half that of the largest one. Since universal codes are designed for a countably infinite number of symbols, they cannot satisfy this condition and do not result in bounded average redundancy. Also, for the Minimax code ( code), the ratio tends to infinity, and its average redundancy is approximately for large , which is not bounded. However, the ratio of the average redundancy to average entropy tends to zero for the Minand any asymptotically optimal code, such as imax code or . APPENDIX
we can write
be a uniformly distributed simTheorem 1: Let plex (i.e., ) with vertices . Then the center of gravity of is the average of its vertices, that is and from Theorem 6 we conclude
This is also the case for any other asymptotically optimal code (in particular, ). For the code one may write
Hence, the expected redundancy of all these codes is negligible with respect to the expected entropy and no one is preferable in code, however, is both suboptimal for the this sense. The Minimax criterion and asymptotically optimal for the Minave criterion. VIII. CONCLUSION In this paper, we considered the minimum average criterion to design a code for the class of finite memoryless monotone sources. Assuming equiprobable sources, we derived the avfor which the associated erage probability distribution code) is optimum. We proved the inHuffman code ( and , and that creasing behavior of tends to a constant value . Also we studied the average entropy , compared it with , and proved that . Thus, if an average redundancy of up to 1.61 bits/symbol is tolerable, a fixed-length code can be employed. Otherwise, we can use the , which has a small average fast and efficient code, i.e., redundancy. It is important to note that, from the minimum average point of view, knowing the ranking of the probabilities allows only a
Proof: We use a proof presented by Lasserre and Avrachenkov [17] for a more general problem. There is a well-known formula for integrating a Dirichlet function on the canonical simplex (defined in (21)) [17] (49)
Since is a simplex, for all can write with equivalently
where
and
, we , or
. Also from we can write (50)
Thus we have the following transformation (51) where
is an
matrix
Note that from the volume formula of a simplex we have (52)
972
IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 53, NO. 3, MARCH 2007
Now, using the change of variables and (50) we get (53) is and (54) at the bottom of the page. In this derivation, the Jacobian of the transformation (51). Also, (53) and (54) are obtained from (52) and (49), respectively. Thus, from Definition 3 one may write
or just
.
Lemma 4: We have
Lemma 5: Consider a function which is invariant is symmetric (i.e., the value of under any permutation of its variables) and satisfies
over The mean of equal. Proof: Let the set of all permutations of size . It is clear that permutation transformation
and over
(55) are
and
denote taken from . Each defines an affine where
for Proof: Clearly, holds for Also, implies and consequently one can write for
and
Noting that
which maps (55) implies
onto
. The symmetry of
and property
for any permutation . On the other hand, a comparison and (i.e., (21) and (3)), rebetween the definitions of are a sorted version of the elements veals that the elements of . Therefore, can be partitioned into disjoint simof plexes, ’s, that is
and we have
completes the proof.
(53)
(54)
KHOSRAVIFARD et al.: THE MINIMUM AVERAGE CODE FOR FINITE MEMORYLESS MONOTONE SOURCES
973
It can easily be shown that the canonical simplex in (21)) may be rewritten as
(defined
for (56) Using the well-known formula which gives the volume of an -dimensional simplex in terms of the coordinates of its vertices, it is not hard to show that
Thus, we have the equation at the bottom of the page. Defining the new variable , we have
(57) Thus, from (56) we conclude
Following this procedure, the most interior integral becomes and the proof is complete. Lemma 6: For a function
, the mean of which, with the change of variables
over
is equal to
, gives
, where
Proof: Let . The symmetry of with respect to variables the canonical simplex (see (21)) implies that
After have
successive change of variables and integration, we
for Finally, (57) and (58) give
and consequently
(58)
which completes the proof.
974
IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 53, NO. 3, MARCH 2007
Lemma 7: If
and
,
Theorem 15: The infinite series
then
converges to . Proof: Manipulating the left-hand side inequality of (59) gives the following:
Proof: We can use integration by parts and write
Also, it can be easily shown by induction that
Then we can write
and consequently
(60) Noting (42), the existence of the upper bound (60) guarantees the convergence of
which completes the proof. Lemma 8: For
we have (59)
Proof: Let It is not hard to prove for
and that
.
We denote the value of this infinite series by which is approx. Interestingly, the logarithmic constant (i.e., imately ) is the only value of which makes convergent. ACKNOWLEDGMENT
and conclude that
and
for
we have
for
The authors are indebted to Prof. Thomas M. Cover for his constructive suggestions, and especially for simplifying the proof of (10). Also, the authors appreciate the careful evaluation of the paper by the anonymous reviewers. In particular, they thank the reviewers for bringing [6], [8], [14], [20], [21] to their attention, and motivating them to determine (25) and (47). Finally, the authors acknowledge Dr. Farid Bahrami for his help in proving Theorem 15. REFERENCES
and
which proves
. Defining
.
[1] J. Abrahams and M. J. Lipman, “Relative uniformity of sources and the comparison of optimal code costs,” IEEE Trans. Inf. Theory, vol. 39, no. 5, pp. 1695–1697, Sep. 1993. [2] J. L. Bently and A. C. Yao, “An almost optimal algorithm for unbounded searching,” Inf. Process. Lett., vol. 5, no. 3, pp. 82–87, Aug. 1976.
KHOSRAVIFARD et al.: THE MINIMUM AVERAGE CODE FOR FINITE MEMORYLESS MONOTONE SOURCES
[3] T. M. Cover, “Admissibility properties of Gilbert’s encoding for unknown source probabilities,” IEEE Trans. Inf. Theory, vol. IT-18, no. 1, pp. 216–217, Jan. 1972. [4] T. M. Cover and J. A. Thomas, Elements of Information Theory. New York: Wiley, 1991. [5] L. D. Davisson and A. Leon-Garcia, “A source matching approach to finding minimax codes,” IEEE Trans. Inf. Theory, vol. IT-26, no. 2, pp. 166–174, Mar. 1980. [6] M. Drmota and W. Szpankowski, “Precise minimax redundancy and regret,” IEEE Trans. Inf. Theory, vol. 50, no. 11, pp. 2686–2707, Nov. 2004. [7] P. Elias, “Universal codeword sets and representations of the integers,” IEEE Trans. Inf. Theory, vol. IT-21, no. 2, pp. 194–203, Mar. 1975. [8] P. Flajolet and H. Prodinger, “Level number sequences for trees,” Discr. Math., vol. 65, pp. 149–156, 1987. [9] E. N. Gilbert, “Codes based on inaccurate source probabilities,” IEEE Trans. Inf. Theory, vol. IT-17, no. 3, pp. 304–314, May 1971. [10] T. C. Hu and A. C. Tucker, “Optimal binary search trees,” in Proc. 2nd Chapel Hill Conf. Combinat. Math. and Applic., Chapel Hill, NC, May 1970, pp. 285–305. [11] D. A. Huffman, “A method for the construction of minimum-redundancy codes,” Proc. IRE, vol. 40, no. 9, pp. 1098–1101, Sep. 1952. [12] D. Kazakos, “Robust noiseless source coding through a game theoretic approach,” IEEE Trans. Inf. Theory, vol. IT-29, no. 4, pp. 576–583, Jul. 1983. [13] M. Khosravifard, M. Esmaeili, H. Saidi, and T. A. Gulliver, “A tree based algorithm for generating all possible binary compact codes codewords,” IEICE Trans. Fundam., vol. E86-A, no. 10, pp. with 2510–2516, Oct. 2003. [14] J. Komlos, W. Moser, and T. Nemetz, “On the asymptotic number of prefix codes,” Mitteilungen aus dem Math. Seminar Giessen, pp. 35–48, 1984. [15] R. E. Krichevskii, B. Y. Ryabko, and A. Y. Haritonov, “Optimal key for taxons ordered in accordance with their frequencies,” Discr. Appl. Math., vol. 3, no. 1, pp. 67–72, 1981. [16] R. Krichevsky, Universal Compression and Retrieval. Norwell, MA: Kluwer Academic, 1994. [17] J. B. Lasserre and K. E. Avrachenkov, “The multi-dimensional version of ,” Math. Assoc. Amer. Monthly, vol. 108, pp. 151–154, Feb. 2001. [18] V. I. Levenstein, “On the redundancy and delay of decodable coding of natural numbers,” Probl. Cybern., vol. 20, pp. 173–179, 1968.
N
x dx
975
[19] W. Li, “Zipf’s law everywhere,” Glottometrics, vol. 5, pp. 14–21, 2002. [20] R. L. Milidiú, A. A. Pessoa, and E. S. Laber, “Three space-economical algorithms for calculating minimum-redundancy prefix codes,” IEEE Trans. Inf. Theory, vol. 47, no. 6, pp. 2185–2198, Sep. 2001. [21] A. Moffat and J. Katajainen, “In-place calculation of minimum-redundancy codes,” in Algorithms and Data Structures, 4th Int. Workshop (Lecture Notes in Computer Science), S. G. Akl, F. H. A. Dehne, J. Sack, and N. Santoro, Eds. Berlin, Germany: Springer-Verlag, 1995, vol. 955, pp. 393–402. [22] B. L. Montgomery and B. V. K. V. Kumar, “On the average codeword length of optimal binary codes for extended sources,” IEEE Trans. Inf. Theory, vol. IT-33, no. 2, pp. 293–296, Mar. 1987. [23] E. Norwood, “The number of different possible compact codes,” IEEE Trans. Inf. Theory, vol. IT-13, no. 4, pp. 613–616, Oct. 1967. [24] F. Pratt, Secret and Urgent: The Story of Codes and Ciphers. Garden City, NY: Blue Ribbon Books, 1942. [25] J. Rissanen, “Minimax codes for finite alphabets,” IEEE Trans. Inf. Theory, vol. IT-24, no. 3, pp. 389–392, May 1978. [26] B. Y. Ryabko, “Encoding of a source with unknown but ordered probabilities,” (in Russian) Probl. Pered. Inform., vol. 15, no. 2, pp. 71–77, Apr.-Jun. 1979. [27] ——, “Comments on ’A source matching approach to finding minimax codes’,” IEEE Trans. Inf. Theory, vol. IT-27, no. 6, pp. 780–781, Nov. 1981. [28] S. A. Smith, “A generalization of Huffman coding for messages with relative frequencies given by upper and lower bounds,” IEEE Trans. Inf. Theory, vol. IT-20, no. 1, pp. 124–125, Jan. 1974. [29] D. Solomon, Data Compression, The Complete Reference, 2nd ed. New York: Springer-Verlag, 2000. [30] Q. F. Stout, “Improved prefix encodings of the natural numbers,” IEEE Trans. Inf. Theory, vol. IT-26, no. 5, pp. 607–609, Sep. 1980. [31] H. Yamamoto and H. Ochi, “A new asymptotically optimal code for the positive integers,” IEEE Trans. Inf. Theory, vol. 37, no. 5, pp. 1420–1429, Sep. 1991. [32] H. Yamamoto, “A new recursive universal of the positive integers,” IEEE Trans. Inf. Theory, vol. 46, no. 2, pp. 717–723, Mar. 2000. [33] J. Ziv and A. Lempel, “A universal algorithm for sequential data compression,” IEEE Trans. Inf. Theory, vol. IT-23, no. 3, pp. 337–343, May 1977.