Average Coding Rate of a Multi-Shot Tunstall Code with an Arbitrary ...

7 downloads 0 Views 106KB Size Report
Dec 12, 2016 - SUMMARY. Average coding rate of a multi-shot Tunstall code, which is a variation of variable-to-fixed length (VF) lossless source codes, for sta ...
IEICE TRANS. FUNDAMENTALS, VOL.E99–A, NO.12 DECEMBER 2016

2281

LETTER

Special Section on Information Theory and Its Applications

Average Coding Rate of a Multi-Shot Tunstall Code with an Arbitrary Parsing Tree Sequence˚ Mitsuharu ARIMURA:a) , Senior Member

SUMMARY Average coding rate of a multi-shot Tunstall code, which is a variation of variable-to-fixed length (VF) lossless source codes, for stationary memoryless sources is investigated. A multi-shot VF code parses a given source sequence to variable-length blocks and encodes them to fixedlength codewords. If we consider the situation that the parsing count is fixed, overall multi-shot VF code can be treated as a one-shot VF code. For this setting of Tunstall code, the compression performance is evaluated using two criterions. The first one is the average coding rate which is defined as the codeword length divided by the average block length. The second one is the expectation of the pointwise coding rate. It is proved that both of the above average coding rate converge to the entropy of a stationary memoryless source under the assumption that the geometric mean of the leaf counts of the multi-shot Tunstall parsing trees goes to infinity. key words: lossless data compression, source coding, variable-to-fixed length code, Tunstall code, average coding rate

1.

Introduction

In this paper, average coding rate of a multi-shot Tunstall code is investigated. Tunstall code is the original variableto-fixed length (VF) source code proposed by Tunstall [1], and an optimal VF code which attains the maximum average block length in all the VF codes with the same leaf counts of parsing tree. In the setting of multi-shot Tunstall code used in this paper, which is formulated in [6], variable-length blocks are parsed from a source sequence using multiple Tunstall trees, and each of them is encoded as a VF code. On the contrary, the setting of Tunstall code, in which one variable length block is parsed and encoded, is called one-shot Tunstall code in this paper. For one-shot and multi-shot Tunstall codes, almost-sure convergence coding theorems are proved in [6]. This paper evaluates average coding rate of the multi-shot Tunstall code. This paper is characterized by the following points. First, the parsing count of the multi-shot VF code is fixed as wtih [6], so that the total coding procedure is regarded as a one-shot VF code. Second, this paper evaluates two types of average coding rates. The first one is the average coding rate defined by dividing the fixed codeword length by the expectation of Manuscript received January 16, 2016. Manuscript revised June 17, 2016. : The author is with the Department of Applied Computer Sciences, Shonan Institute of Technology, Fujisawa-shi, 251-8511 Japan. ˚ This paper was partly presented at 2014 International Symposium on Information Theory and its Applications (ISITA2014). a) E-mail: [email protected] DOI: 10.1587/transfun.E99.A.2281

the variable block length. The second one is the expectation of the pointwise coding rate defined in [5], which is called average pointwise coding rate in this paper. The reason that the latter definition is also used in this paper is discussed in the next section. Finally, in the multi-shot Tunstall code, the leaf counts of all the parsing trees are assumed to be given [6], and a rather loose sufficient condition such that the code is asymptotically optimal is presented. In [6], two notions on multiple Tunstall trees, which are Cartesian concatenation of trees and geometric mean of the leaf counts of trees, are defined and play crucial roles in the analyses. In this paper, the upper bound is related to the geometric mean and the lower bound has relationship with the Cartesian concatenation. This is in contrast to the previous result [6], in which all the results are characterized by the geometric mean of the leaf counts of trees. This paper is organized as follows. Section 2 discusses on the average coding rate of a VF code. Section 3 defines a multi-shot Tunstall code and show some properties of this code. The two types of average coding rates are evaluated in Sect. 4. 2. Average Coding Rate and Average Pointwise Coding Rate of VF Codes The average coding rate of a VF code is commonly defined as the (fixed) codeword length divided by the average block length [1]. Tunstall code minimizes the average coding rate. Moreover, suppose that the codeword length is bounded from the above by n and the expected length of the words registered in the parsing tree is represented by ErLs. Then, as the length of the encoded source string increases, the number of code letters per source letter approaches n{ErLs with probability one [4]. Therefore, the average coding rate has operational justification in this sense. On the other hand, we can consider average pointwise coding rate, which is defined by the expectation of the pointwise coding rate. The pointwise coding rate is defined as the coding rate for each sequence emitted by the information source. In the definition of the average coding rate, expectations in the numerator and the denominator are taken independently (note that in VF codes, since the numerator is constant, expectation of that takes the same value), hence, it does not necessarily provide a faithful statistical characterization for the behavior of the pointwise coding rate as a random variable [3]. In [5], the average pointwise coding

c 2016 The Institute of Electronics, Information and Communication Engineers Copyright ⃝

IEICE TRANS. FUNDAMENTALS, VOL.E99–A, NO.12 DECEMBER 2016

2282

rate of one-shot Tunstall code is investigated. In that paper, it is demonstrated that the average pointwise coding rate has direct relationships to the convergence in probability of the coding rate and maximum/minimum values of pointwise redundancy. Therefore, if we consider the theoretical compression performance of a one-shot VF code, this definition has a merit such that this value has direct relationship with the other quantities on the compression performance. A disadvantage of the investigation of the average pointwise coding rate [5] is that it is much more difficult and complicated compared to the investigation of the average coding rate [2] which is very simple, as noted in [3]. Subsequently, consider a multi-shot VF code which parses multiple blocks with multiple Tunstall parsing trees. For this VF code, pointwise coding rate for the total coding procedure is defined by the sum of the codeword lengths divided by the sum of the block lengths. If we consider a stochastic process as an information source, various sequences are generated from a source. Therefore, when average compression performance for all the set of sequences is considered, we can define the following two average coding rates. First, if we fix the sum of the parsed block lengths, the total coding procedure can be treated as a fixed-to-variable length (FV) code, which maps the set of fixed-length blocks to the set of variable-length codewords. On the other hand, if we fix the number of the parsed blocks, the total coding procedure can be treated as a VF code which maps the set of variable-length blocks to the set of fixed-length codewords. If we define the multi-shot Tunstall code in the latter manner and treat the total coding as a one-shot VF code, the merit of average pointwise coding rate for one-shot VF code, which is discussed in the above, can be applied also to the multishot Tunstall code. From the above reason, we treat the multi-shot Tunstall code with fixed parsing counts as a one-shot VF code, and evaluate both of the average coding rate and the average pointwise coding rate as the compression performance. 3. 3.1

Multi-Shot Tunstall Algorithm Notations

For any random variable X, the probability distribution of X is denoted by PX . The sequences x1 ¨ ¨ ¨ xn and X1 ¨ ¨ ¨ Xn are denoted by x1n and X1n , and xij represents a substring xi xi`1 ¨ ¨ ¨ x j of x1n for i, j satisfying 1 ď i ď j ď n. An infinite sequence starting with xi is denoted by xi8 . The length of a finite string w is written as }w}, and the cardinality of a finite set X is represented by |X|. The union of all finite products of X is written as X˚ “ Yně1 Xn . The bases of log and exp are assumed to be 2 in this paper. Let X “ X18 be a stationary memoryless information source such that X1 “ X. Each Xi takes values in a finite alphabet X p|X| “ A ă 8q with probability distribution PXi “ PX . The entropy rate HpXq of the source X is equal to the entropy of the probability distribution PX , which is defined as

HpXq “ HpPX q “

ÿ

PX pxq log

xPX

1 . PX pxq

The minimum probability of PX is written as P “ min PX pxq. xPX

(1)

It is assumed that P satisfies P ą 0. The probability of a variable length sequence w “ x1 x2 ¨ ¨ ¨ x}w} P X˚ is denoted by P˚ pwq. Since the source X is stationary memoryless, it holds that ˚

P pwq “

}w} ź

PX pxi q.

(2)

i“1

In this paper, it is assumed that the encoder and the decoder know the probability distribution PX of the information source X. Subsequently, some notations on parsing tree for VF coding is described. Let T be a rooted A-ary tree such that each inner node has A child nodes. Each edge of T has a label x P X. The leaf set of T , which is defined as the set of nodes having no child, is written as T . Each leaf node ν P T corresponds to a block wpνq, which is a concatenation of the labels of all the edges in the path from the root node to ν. The probability of a leaf ν P T is defined as the probability P˚ pwpνqq. The depth of the leaf ν can be represented by }wpνq}. 3.2 Multi-Shot Tunstall Code Here we describe the encoding algorithm of a multi-shot Tunstall code. When a stationary memoryless source X is given, a Tunstall parsing tree (or Tunstall tree) is created as follows [1]. 1. At the extension step 1, let T p1q be the A-ary tree of depth one. 2. At the extension step m ě 2, T pmq is created from T pm ´ 1q as follows. a. Take any leaf ν which has the largest probability P˚ pwpνqq in T pm ´ 1q. b. Let T pmq be the tree created by attaching the root node of a copy of T p1q to ν. It is not guaranteed that Tunstall tree is unique when the source X and the extension step m are given. Therefore, some rule is predetermined for the case that there are two or more leaves which have the largest probability in Step 2-a. For example, take the leaf ν such that wpνq heads a list which is created from all the wpνq’s with the same probability in the lexical order. Since the size of the leaf set T pmq of the tree T pmq is |T pmq| “ pA ´ 1qm ` 1, the block wpνq corresponding to a leaf ν P T pmq can be encoded in rlog |T pmq|s bits and decoded without error. The multi-shot Tunstall code, which parses K blocks from the source sequence x18 and encodes them, are defined

LETTER

2283

as follows [6]. 1. Prepare the Tunstall trees T1 pm1 q, . . . , Tk pmk q, . . . , TK pmK q using the above algorithm. 2. At the parsing step k “ 1, . . . , K, the block wk “ 8 k xnnk´1 `1 pn0 “ 0q is parsed from x1 as follows. k a. Let wk “ xnnk´1 `1 be the prefix of the sequence 8 xnk´1 `1 being equivalent to the block wpνq corresponding to some leaf node ν of the tree Tk pmk q. 8 k b. Parse wk “ xnnk´1 `1 from the sequence xnk´1 `1 .

The probability distribution PX and m1 , . . . , mK is shared between the encoder and the decoder. Under this non-universal scenario, each tree Tk pmk q is constructed such that it is unique when mk is given. In the following analyses, since the valuable quantity of a tree Tk pmk q is not the extension step number mk but the leaf count |T k pmk q|, the tree Tk pmk q may be written as Tk by omitting mk , especially for the leaf count |T k |. When K blocks w1K “ pw1 , . . . , wK q are parsed from a given source sequence x18 using the Tunstall trees T1 pm1 q, . . . , TK pmK q, respectively, random variables corresponding to the block wk and the sequence w1K are written as Wk and W1K “ pW1 , . . . , WK q, respectively. Overall parsed block which is the concatenation of w1K and the corresponding random variable are written as w1¨¨¨K “ w1 ¨ ¨ ¨ wK and W1¨¨¨K “ W1 ¨ ¨ ¨ WK . For any w P T k , let P˚k pw|wk´1 1 q denote the probability of w conditioned by wk´1 , w i P T i , i “ 1, . . . , k ´ 1. 1 ˚ From Lemma 2 of [6], it holds that P˚k pw|wk´1 1 q “ P pwq k´1 for any w and w1 , where the right-hand side is defined in (2). Therefore, the probability of the k-th parsed block w P T k is written as P˚k pwq which is equivalent to P˚ pwq. 3.3

Cartesian Concatenation of Trees and Geometric Mean of the Leaf Counts of Trees

For a multi-shot Tunstall code, Cartesian concatenation of trees and geometric mean of the leaf counts of trees are defined in [6]. If two trees T1 , T2 are given, the Cartesian concatenation of T1 and T2 is defined as the tree created by attaching the root nodes of the trees being equivalent to T2 to all the leaves of T1 , and denoted by T1 ˆ T2 . Cartesian concatenation of K trees T1K “ pT1 , . . . , Tk q is defined by T1¨¨¨K “

K ą

Tk “ T1 ˆ T2 ˆ ¨ ¨ ¨ ˆ TK .

k“1

The leaf count |T1¨¨¨K | of the tree T1¨¨¨K satisfies |T1¨¨¨K | “

K ź

|T k |.

(3)

k“1

For K trees T1K , the geometric mean of the leaf counts of them is defined by

gpT1K q



ˆź K

˙ K1 |T k |

1

“ |T1¨¨¨K | K .

k“1

4. Average Coding Rate and Average Pointwise Coding Rate 4.1 A Relationship between Average Coding Rate and Average Pointwise Coding Rate In this subsection, two types of average coding rates are compared. If K blocks W1K are parsed from X “ X18 using the Tunstall trees T1K and encoded, the average coding rate and the average pointwise coding rate are defined as K ÿ 1 rlog |T k |s Er}W1¨¨¨K }s k“1

(4)

and „ E

1

K ÿ

}W1¨¨¨K }s k“1

ȷ rlog |T k |s ,

(5)

respectively. From Jensen’s inequality (cf. [7]), these two quantities have a relationship K ÿ

rlog |T k |s

k“1

Er}W1¨¨¨K }s

„ ďE

1

K ÿ

}W1¨¨¨K }s k“1

ȷ rlog |T k |s .

(6)

Therefore, an upper bound of the average pointwise coding rate (5) and a lower bound of the average coding rate (4) are evaluated in the following subsections. The evaluation of the upper bound is a corollary of the previous results given in [5], [6]. On the other hand, the evaluation of the lower bound depends on a new property that the Cartesian concatenated tree of multiple Tunstall trees is not a Tunstall tree, and the average coding rate of the multi-shot Tunstall code is bounded below by that of the one-shot Tunstall code using a Tunstall tree which has the same number of leaves as the Cartesian concatenated tree. 4.2 Upper Bound of Average Pointwise Coding Rate In this subsection, upper bound of average pointwise coding rate of multi-shot Tunstall code is evaluated. The strategy of the proof is a combination of the results of [5] and [6]. The length and the probability of w1¨¨¨K , which is the concatenation of the blocks w1K parsed by K Tunstall trees T1 , . . . , TK , are represented by }w1¨¨¨K } and P˚ pw1¨¨¨K q, respectively. Definition 1 (Pointwise Redundancy): If the blocks w1 , . . . , wK are the result of the parsing of X18 using K Tunstall trees T1 , ¨ ¨ ¨ , TK , the pointwise redundancy with respect to the source X is defined as follows.

IEICE TRANS. FUNDAMENTALS, VOL.E99–A, NO.12 DECEMBER 2016

2284

«

rpw1K , T1K , Xq

lim E

"ÿ K

KÑ8

* 1 “ rlog T k s ´ log ˚ . }w1¨¨¨K } k“1 P pw1¨¨¨K q 1

We have the following lemma on the self information. Lemma 1: When K blocks W1 , . . . , WK are parsed from X18 using the Tunstall trees T1 , . . . , TK and encoded, if lim inf gpT1K q “ 8 KÑ8

is satisfied, it holds that „ ȷ 1 1 lim sup E log ˚ ď HpXq. }W1¨¨¨K } P pW1¨¨¨K q KÑ8 Proof of Lemma 1 is given in Appendix A. The following theorem is proved using Lemma 1. Theorem 1: When K blocks W1 , . . . , WK are parsed from X18 using the Tunstall trees T1 , . . . , TK and encoded, if lim inf gpT1K q “ 8 KÑ8

(7)

is satisfied, it holds that « ff K ÿ 1 lim sup E rlog2 |T k |s ď HpXq. }W1¨¨¨K } k“1 KÑ8 Proof of Theorem 1 is given in Appendix B. 4.3

Lower Bound of Average Coding Rate

For the average coding rate of the multi-shot Tunstall algorithm, the following lower bound holds. Theorem 2: When K blocks W1 , . . . , WK are parsed from X18 using the Tunstall trees T1 , . . . , TK and encoded, the average coding rate have the following lower bound. K ÿ 1 rlog |T k |s ě HpXq. Er}W1¨¨¨K }s k“1

Proof of Theorem 2 is given in Appendix C. 4.4

Limits of Average Coding Rate and Average Pointwise Coding Rate

From Theorems 1, 2 and the inequality (6), we have the following result immediately. Corollary 1: When K blocks W1 , . . . , WK are parsed from X “ X18 using the Tunstall trees T1 , . . . , TK and encoded, if lim inf gpT1K q “ 8

K ÿ

1

}W1¨¨¨K } k“1

ff rlog2 |T k |s “ HpXq.

Note that the condition (7) of Theorem 1 and Corollary 1 is satisfied when the k-th block is parsed using the Tunstall tree Tk pkq with k internal nodes [6]. 5. Conclusion In this paper we have evaluated two types of average coding rates of a multi-shot Tunstall code [6]. Two properties of multiple Tunstall trees defined in [6] play crucial roles in the analyses of upper and lower bounds of average coding rate, respectively. Geometric mean of the leaf counts of multiple Tunstall trees is used to express the upper bound. On the other hand, Cartesian concatenation of multiple Tunstall trees is used to obtain the lower bound. This is in contrast to the previous result [6], in which all the results are characterized by the geometric mean of the leaf counts of trees. Acknowledgments The author expresses his thanks to the associate editor who simplified the proof of Lemma 1. This research is supported in part by MEXT Grant-in-Aid for Scientific Research(C) 15K06088. References [1] B.P. Tunstall, Synthesis of noiseless compression codes, Ph.D. dissertation, Georgia Inst. Tech., Atlanta, GA, 1967. [2] F. Jelinek and K. Schneider, “On variable-length-to-block coding,” IEEE Trans. Inform. Theory, vol.18, no.6, pp.765–774, Nov. 1972. [3] N. Merhav and D.L. Neuhoff, “Variable-to-fixed length codes provide better large deviations performance than fixed-to-variable length codes,” IEEE Trans. Inform. Theory, vol.38, no.1, pp.135–140, Jan. 1992. [4] S.A. Savari and R.G. Gallager, “Generalized Tunstall codes for sources with memory,” IEEE Trans. Inform. Theory, vol.43, no.2, pp.658–668, March 1997. [5] M. Arimura, “On the average coding rate of the Tunstall code for stationary and memoryless sources,” IEICE Trans. Fundamentals, vol.E93-A, no.11, pp.1904–1911, Nov. 2010. [6] M. Arimura, “Almost sure convergence coding theorems of one-shot and multi-shot Tunstall codes for stationary memoryless sources,” IEICE Trans. Fundamentals, vol.E98-A, no.12, pp.2393–2406, Dec. 2015. [7] T.M. Cover and J.A. Thomas, Elements of Information Theory, 2nd ed., John Wiley & Sons, 2006.

Appendix A:

Proof of Lemma 1

From (1), (2), and Lemma 2 of [6], we have

KÑ8

is satisfied, it holds that K ÿ 1 rlog |T k |s “ HpXq, KÑ8 Er}W1¨¨¨K }s k“1

lim

P˚ pw1¨¨¨K q “

K ź k“1

P˚ pwk q ě

K ź k“1

for any w1¨¨¨K . Hence it holds that

P}wk } “ P}w1¨¨¨K }

LETTER

2285

max w1K

1 1 1 log ˚ ď log . P }w1¨¨¨K } P pw1¨¨¨K q

(A¨ 1)

Fix δ ą 0 arbitrarily. Then the following inequality is satisfied from (A¨ 1) and Lemma 3 of [6]. „ ȷ 1 1 E log ˚ }W1¨¨¨K } P pW1¨¨¨K q ˇ "ˇ * ˇ ˇ 1 1 ˇ log ˚ ´ HpXqˇˇ ă δ ď Pr ˇ }W1¨¨¨K } P pW1¨¨¨K q ¨ pHpXq ` δq ˇ "ˇ * ˇ ˇ 1 1 ˇ ˇ ` Pr ˇ log ˚ ´ HpXqˇ ě δ }W1¨¨¨K } P pW1¨¨¨K q 1 1 log ˚ }w1¨¨¨K } P pw1¨¨¨K q ¸ ˜ plog gpT1K qK qA`1 1 ď HpXq ` δ ` O ¨ log , K K f ´1 pδq P pgpT1 q q ¨ max w1¨¨¨K

where f ´1 pδq is the inverse function of f pδq defined as d ¸ ˜d δ δ , f pδq “ δ ` logpA ´ 1q ` h 2 log e 2 log e and hpxq is binary entropy function defined as hpxq “ ´x log x ´ p1 ´ xq logp1 ´ xq. If gpT1K q Ñ 8 holds when K Ñ 8, the right-hand side can be bounded by HpXq ` 2δ for sufficiently large K. Moreover, since δ ą 0 can be arbitrarily small and the lefthand side is independent of δ, and f ´1 pδq Ñ 0 when δ Ñ 0, the right-hand side can be arbitrarily close to HpXq. Appendix B:

Proof of Theorem 1

From Definition 1, it holds that „ ȷ K ÿ 1 E rlog T k s }W1¨¨¨K } k“1 „ ȷ “ ‰ 1 1 log ˚ “ E rpW1K , T1K , Xq ` E }W1¨¨¨K } P pW1¨¨¨K q „ ȷ 1 1 ď max rpw1K , T1K , Xq ` E log ˚ . }W1¨¨¨K } P pW1¨¨¨K q w1K If (7) is satisfied, from Theorem 1 of [6] the first term converges to 0. Therefore, from Lemma 1, we have the desired result. Appendix C:

Proof of Theorem 2

Let T1¨¨¨K be the Cartesian concatenation of the parsing trees

T1 , . . . , TK . Since the parsed block using T1¨¨¨K is W1¨¨¨K and the parsed blocks using T1 , . . . , TK are W1 , . . . , WK , it holds from Lemma 1 of [6] that „ÿ ȷ K Er}W1¨¨¨K }s “ E }Wk } . k“1

Define T pKq as a Tunstall tree, the leaf count of which is the same as T1¨¨¨K . The parsed block by the Tunstall tree T pKq is denoted by wpKq , and the corresponding random variable is represented by W pKq . The tree T1¨¨¨K may not be a Tunstall tree and T pKq is a Tunstall tree. Since all the Tunstall tree has a property that it maximizes the average block length in all the trees with the same leaf counts, it holds that Er}W pKq }s ě Er}W1¨¨¨K }s.

On the codeword length, from the relationship (3) of the leaf count, it holds that rlog |T pKq |s “ rlog |T1¨¨¨K |s ď

K ÿ

rlog |T k |s.

k“1

Using the above relationships, average coding rate is bounded below as follows. K ÿ

rlog |T k |s

k“1

« E

K ÿ

ffě

log |T pKq | rlog |T pKq |s ě . pKq Er}W }s Er}W pKq }s

}Wk }

k“1

Since the tree T pKq is a Tunstall tree, the equation [2] HpW pKq q “ Er}W pKq }sHpXq. holds. Hence we have the following lower bound. HpW pKq q log |T pKq | ě “ HpXq. Er}W pKq }s Er}W pKq }s

(A¨ 2)

Note that the inequality comes from that log |T pKq | ´ HpW pKq q “ DpP˚ pW pKq q}Up|T pKq |qq ě 0, where Up|T pKq |q is the uniform probability distribution of size |T pKq |.

Suggest Documents