Grammar-based coding

6 downloads 7132 Views 499KB Size Report
The compression performance of grammar-based codes has been analyzed so far by ..... rithms that cdn outperform asymptotically &(x) for. On./. - such that for ...
IM/ 2004,San Anlanio, Texas. October 24.29.2004

Grammar-Based Coding: New Perspectives' En-hui Yang', Da-ke Re', a n d J o h n

Abstmct Grammar-based c o d i n g i s i n v e s t i g a t e d f r o m three new p e r s p e c t i v e s . F i r s t , w e r e v i s i t the perf o r m a n c e a n a l y s i s o f grammar-based c o d e s by proposi n g c o n t e x t - b a s e d r u n - l e n g t h e n c o d i n g a l g o r i t h m s as new performance benchmarks. A redundancy result stronger than a l l p r e v i o u s c o r r e s p o n d i n g results is est a b l i s h e d . We then extend the a n a l y s i s of g r a m m a r based c o d e s to sources w i t h c o u n t a b l y infinite a l p h a bets. L e t A denote a n a r b i t r a r y c l a s s of s t a t i o n a r y , e r g o d i c sources w i t h a c o u n t a b l y infinite a l p h a b e t . It is s h o w n that grammar-based codes can he m o d i f i e d so that they are u n i v e r s a l with respect to any A for w h i c h there e x i s t s a universal code. Moreover, u p p e r b o u n d s o n t h e worst-case r e d u n d a n c i e s of grammarbased codes among Large sets o f l e n g t h - n i n d i v i d u a l s e q u e n c e s f r o m a c o u n t a b l y i n f i n i t e a l p h a b e t are e s t a b l i s h e d . Finally, we propose a n e w t h e o r e t i c framew o r k f o r c o m p r e s s i o n i n w h i c h grammars rather than s e q u e n t i a l s t o c h a s t i c processes are used as source gene r a t i n g m o d e l s , and p o i n t out some open p r o b l e m s i n the f r a m e w o r k . ~

I. INTRODUCTION Grammar-based codes [I] stand for a new class of universal lossless data compression algorithm. To compress a sequence x = xi .. xn, a grammar-based code first transforms x into a context-free grammar G, from which x can he fully reconstructed &- t h e only sequence in t h e language genemted by G, and then compresses the context-free grammar G. For a detailed description of grammar-based codes and context-free grammars, the reader is referred to [ I ] , (21. The class of grammar-based codes is very broad; it includes block codes; Lempel-Ziv types of codes [31, and nlany other new universal lossless compression algorithms as well. For instance, within t h e design framework of grammar-based codes, several new lossless data compression algorithms, such as the Yang-Kieifer algorithms [2] and the multilevel pattern matching(MPb1) algorithm [4], were proposed. The concept of grammar-based coding was further extended in [ 5 ] , where context-dependent grammar-based cudes were proposed. In this paper, we investigate grammar-based coding from new perspectives. We shall first take a close look a t the per-

.

'This work w a supported in part by the Natural Sciences and Engineering Research Council of Canada under Grants RGPlN203035-98 and RGPlN20303602 and under Collaborative Research and Development Grant, by the Premier's Research Excellence Award, by the Cornmimications and Information Technology Ontario, by the Canada Foundation for Innovation. by the Ontario Distingmished Research Award, and by the Canada Research Chairs Program

2The authors are with the Department of Eloctricd and Computer Engineering, University of Waterloo, 200 University Avenue West, Waterloo, Ontario N2L : 1 from a finite alphabet. It was proved in [ l ] , 121 t h a t the difference in compression rate on z between the worst grammarbased code' and t h c b e s t arithmetic coding algorithm with I; contexts is at most d l o g l o g n l l a g n , uniformly over x, where d is a constant depending only on k and the alphabet size.' The above worst-case redundancy result characterizes the compression performance of grammar-based codes for individual sequences from a finite alphabet. However, such a characterization becomes less meaningfnl if arithmetic coding algorithms with k contexts fail to provide good performance benchmarks. Indeed, it can be shown that there exist interesting stationary sources with a finite alphabet for which no arithmetic coding algorithm with k contexts can achieve the sources' Shannon entropy rates 161, no matter how large k is. For sequences emitted from these sources, we need t o find new performance benchmarks to analyze the compression performance of grammar-based codes. Evidently, from this perspective, one can derive many (redundancy) results far grammar-based codes by selecting different performance benchmarks. I n Section 11: we discuss yome initial results in this direction. Details of these results can be found in [6]. In addition to the worst-case redundancy result, another important result about the compression performance of grammar-based codes is that they are universal for the class of all stationary, ergodic sources with a finite alphabet in the sense t h a t they can achieve asymptotically t h e entropy rate of each and every stationary, ergodic source with a finite alphabet 111, Both the worst-case redundancy result and the universality result above share the assumption that the alphabet is finite. However, in many applications like lossless audio compression and high-quality image compression, where d a t a sequences to be compressed are often from large alphabets with cerdinalities even comparable to the sequence lengths, the assumption of a finite alphabet iu hardly justified; instead, one may simply assume t h a t d a t a sequences come from a countably infinite alphabet. From this perspective, one wants t o investigate the

PI.

'Unless otherwise specified, throughout this paper we focus our discussion on grmmilr-based codes that transform each data Sequence into an irreducible context-free grammar, and will simply refer to them ill grammar-based codes. 210g stands for the logarithm to base 2 throughout this paper.

105

camprcssion perforniimce of grammar-based codes for sources with a countably infinite alphabet. In Section MI wc will discuss thc universality and redundancies of grammar-based codes in the countably infinite alphabet c a e . The results in Section 111 are available with more details in [7]. We now turn our attention to the basic concept of grammar-based coding itself. As described at the beginning of this section, to compress a sequence x B grammar-bascd code always transforms x into a context-free grammar G, whose language L ( G ) consists uniquely of thc sequence x, i.e.

L(G) = {x}.

(1)

The constraint ( I ) limits us t o use only D subclass of the clims of all context-free grsmmars(see [I] for details). To involve more context-free grammars into the design of lossless d a t a compression algorithms, it is logical to investigate if the constraint (1) could be relaxed, for example, to

This motivates us to look a t grammar-baed coding from a completely new perspcctive: Given a context-free grammar G whose language L ( G ) contains the sequence x to he compressed, how can we design a n efficient lossless d a t a cumpres sion algorithni to compress x? In this setting L ( G ) is allowed to contain more than one sequence. For instance, L ( G ) could be the set of application-specific files such as the set of all HTML files, the set of all Microsoft Word document f i l e , the set of all executable files, etc. Each sequence x E L(G) is generated by the application of a sequence of production rules of G. In this Sense, it is natural to regard G as a source generating model. Note that even though the exact sequence of production rules which generates x is unknown, it does not seem justifiable to regard x as a sample sequence of a sequential random process. This new concept of using G as a source generating model enables us t o propme a new theoretic framework far compression. Many interesting problems, including the one raised above, arise in this new framework. We shall elaborate more on these problems in Section lV. Throughout this pdper, we shall use the following notation. A = (a,, az,..., ala,} denotes a finite source alphabet with cardinality ldl greater than o r equal to 2. N = { 1 , 2 , . . .} denotes the set of all pasitive integers. A' denotes the set of all finite strings drawn from A, including the empty string A, and A+ denotes the set of all finite strings of positive length from A. T h e notation 1x1 denotes the length of x for any x E A'. For any positive integer n, A" denotes the set of all sequences of length n from A. Let x = ~ 1 x 2 . .x, be a length n sequence from A. We shall sometimes write the substring xi ... x,(l 5 i 5 j 5 n) of x a 4 for brevity. Similar notation will he applied to other sets and finite strings drawn from them. To avoid possible confusion, a sequence from A is sometimes called an A-sequence.

.

I*' PERsPECTIVE * : AND

REDUNDANCY RESULTS

In [l], [2]; the performance of grammar-based codes was analyzed by comparing with t h a t of the best arithmetic coding algorithms with finite contexts. Let IC be B positive integer. Let 01denote theset ofall arithmeticcoding algorithms with k contexts. For any sequence x = 21x2...I,,from A and any arithmetic coding algorithm 0 E let re denote the

e,,

compression rate in bits per lett,er resulting from rising 0 t o encode x. Define

i.e., r;(x) represents the smallest compression rate in bits per letter among all arithmetic coding algorithms with k contexts. Following the naming convention in [I', we shall call T;(x) the k-context empincal entropy of x. Let X = (X,}P._,denote a n alphabet A finite-state source with no more than k states. Then it can be shown far any individual sequence x emitted from X,r;(x) gives the 'ideal' lossless cumpression rate in bits per tetter. Thus, T;(x) serves as B dcsirablc performance benchmark in evaluating the compression performance of a lossless data compressiun algorithm far individual sequences emitted from finite-state sources. Let denote an arbitranj grammar-based code throughout this section. Let ~'(2)he the compression rate in bits per letter resulting from using 4 to compress x. Define

R:,h = a

21[r'(x)- r;(x)].

The quantity R2,k is called the worst-cme redundancy of +r+ against the k-context empirical entropy. T h e following upper bound of R$,hw&s established in [l],(21. R e s u l t 1 There is a constant d , which depends only on Id1 and k, such that

As alluded to in the introduction section, in this section we shall derive new redundancy results by introducing new performance benchmarks. To this end, we shall first define semifinite-state sources; then propose context-based run-length encoding (RLE) algorithms, a n d show that context-based RLE algorithms are indeed superior to arithmetic coding algorithms with finite contexts; and finally evaluate t h e compression performance of grammar-based codes against that of the best context-based RLE algorithms. As a consequence, a redundancy result stronger than Result 1 will he established. Before giving the definition of semi-finite-state sources, we give a brief description of run-length parsing. Let X = denote a source with alphabet A. The source X = {X,}$, can also be represented by a sequence of symbolrun pairs { ( K , R , ) } F , (also called the run-length parsing of X), where YI = X I , R I is the run length of Y, in the beginning of X , Yz = X n , + l , which of course is not equal t o the symbol of XR,,Rz is the run length of YZin X immediately after the first run of Y,, and so on. For example, consider a sequence x = aialalazazazazasas;the corresponding sequence of symbol-run pairs is given by ( a l , 3 ) ( a z , 4 ) ( a 3 , 2 ) . Clearly, there is a one-teone correspondence between ( X + } z , and {(Yj,Ri)l?l,. .,,, ~. Let k be a positive integer. Let 2 be a finite set consisting o f k contexts (or states). An alphabet A source X = (X.}P.=, is called a semi-Iintte-state s o u m with k states if there exist a n initial state 30,a transition probability function go : 2 x ( A x 2) [O,l] from 2 to A x 2, and two transition probability functions Qi : ( A x 2 ) x ( A x N x 2 ) [0, I ] , i = 0, 1, from A x 2 to A x N x 2 satisfying Q ; ( a , r , z l a , z ' ) = 0 for any a E A. r E N , and z , z' E 2, such t h a t the probability Pr(X1.. . X, = XI...xn} is equal to (4) for all n 2 1 and

{X,}zl

106

.. .

-

-

-

z ~ m " . x , from A. In (4), ( ~ l , r I ) ( M , T Z ) . . . ( ? J t r , r f r ) , 1 i A x 2; let Q. : (A x 2)x ( A x N x 2 ) [ O , l ] , i = 0. I , he t, 5 n , denotes the sequence of symbol-run pairs associated two transition probability functions from A x 2 to A x ,V x 2 with z . 1 ~ 2 . ..xn. For brevity, we shall write the quantity given satisfying Q L ( a , r , z l a , z ' ) = 0 for any a t A, r E N,and by (4) zm A ~ i . . ~ ( ~ a , Q o , Q i , ~ , z o ) . For any positive integer k,let h t f S ( A denote ) the clms of all alphabet A scmi-finitestate sourccs with k states. I t iy emy to check that A:,,9(d) includes 811 renewal processes [Ll] when A is binary. We are also intermted in the relationship between semi-finite-state sources and finite-state sources. Since for any source X from Atf,9(d), with prohahility one X does not have a run of infinite length, we here define the scxillled degenerate finite-state s"7Lrces. Let X = (X, )E, denote an alphabet A finite-state source with transition probability function p from 2 to A x 2. Then X is called n degenerate finite-state murre with k states if

Pr{.Y2 = a, X s = a , . ..,X" = a,.. .I=*) m

> o

(5)

z, z' E 2. For any sequence z = ~ 1 ~ 2.I, . . from A: the compression rate in hits per letter resulting from using the k-context RLE algorithm with functions qo1 Qa, and QI and an initial state 20 to encode 3: is given by

1

-- log A.i..k(Pn,

Qn,

& I , 20, x ) ,

Define

where the auter maximization varies over all possible combinations of transition prohahility functions q o , Qol and 91. The quantity ~ ; ? , ~ (represents z) the smallest compression rate in bits per letter among all RLE algorithms with k contexts. It should be emphmiaed that there is no single RLE algorithm with k contexts that can achieve t h e compression rate ~ b , ~ ( x ) for euem z = xIz2. . .xn. Since symbols and runs - sequence . are encoded hy arithmetic coding with k contexts, we shell call T ; ~ , ~ ( X the ) k-context symbol-run cnipi&d entropy of the

for some (zl,a) E 2 x A. Otherwise, X is called a nondegenerate pnite-state source with k states. sequence x' Let A),(A) denote the class of all alphabet A nonThe following two theorems together show that it is more degerlerate finitc.state sourcBs with k ~h~ fallowing desirable t o use the k-context symbol-run empirical entropy includes as strict subtheorem shows that r:v,k(z) as the benchmark than to use d ( x ) t o evaluate with proofs of other class. The proof of Theorem I the compression performance of universal source coding dtheorems in this section are provided in [e]. garithms.

~g,(~)

Theorem 1 Let k he a positive integer. Then,

Theorem 2 Let k he a positive integer.

i) any vlphabct A non-degenerate finite-state source with A' states is a semi-finitestate source with k states;

ii) there exist semi-finitestate sources X = { X , ) z , with aIphabet A such that X Ag.(d)no matter how large k! is. The definition of semi-finite-state sources implies that in order to efficiently compress a sequence emitted from a semifinite-state source, one can encode the sequence indirectly hy encoding the sequence of symbol-run pairs associated with the sequence. In the literature, such an algorithm is called a RLE algorithm 181. Traditionullv. uses Huffman .. a RLE algorithm coding to encode the sequence of symbol-run pairs. As a result, it may not be optimal since the length of each codeword must he an integer. One way to improve its comprasion performance is to replace Huffman coding with arithmetic coding or ninre specifically multilevel arithmetic coding [U]. To avoid possible confusion, we shell refer to such B modified RLE algorithm as an arithmetic RLE algorithm, and t h e original RLE algorithm as a Huffrnan RLE algorithm. If the arithmetic coding algorithm used in an arithmetic RLE algorithm is context-based with k contexts, then such an arithmctic RLE algorithm will be called B RLE algorithm with k contexts. We next show that in comparison with arithmetic coding algorithms with k contexts, it is desirable to use RLE algorithms with k contexts to evaluate the compression performance of universal source coding algorithms including grammar-based codes. Let 2 he a finite set consisting of k elements; each element z t 2 is regarded ils an abstract context. Let qo : 2 x ( A x 2) [0,11 be a transition prohahility function from 2 to L

-

i) Let X = {X,)E, he an alphabet A semi-finitestate source with k states. Then I

T;~,~(X;) 5 -logPr{X:]. ;

(7)

ii) For any alphabet d source X = {XZ)eO,,, there exists a constant d > 0 depending only on k and A such that far any sufficiently large n, 1

~ z , , ~ ( X ; )-;logPr{X;j-dt

1

J;;

(8)

,

with probability greater than or equal to 1 - l/nz.

Theorem 3 Let k he a positive integer. Then, for any sequence x E A+, r:?,k(g) 5 T ; ( z ) . (9) Furthermore, there exist sequences 3: for which the above inequality becomes a strict inequality.

Remark 1 Fix a positive integer k . Theorem 3 shows that T;?,~(x) is no greater than ri(x) for any individual sequence x from a finite alphabet, and there exist sequences for which (9) becomes a strict inequality. However. being smaller thdn r;(x) alone does not mean that r;,,k(x) is a useful quantity from an information theoretic point of view; one has to guarantee that r:,,k(z)can not be essentially less than the self-information. That is why in Theorem 2, we relate T & ( X I . . X,) to the self-information - logPr(X1.. . X,} for any source X = [X,]z, with a finite alphabet.

107

.

Finally in this section we revisit grammar-based codes by evaluating their compression performance against the kcontext symbol-run empirical entropy for any individual sequence from a finite alphabet. In particular, we are interested in the difference between rd(x)and r:v,h(x) for any individual sequence x E A+. Define

Remark 2 It should he pointed out that even though the modificiltion of grammar-based codes needed for encoding sources with countably infinite alphabets is minor, it does not mean that the extension of t h e performance analysis results of grammnr-based codes for soiirces with finite alphabets t o the case of sources with infinite alphabets is straightforward. To the contrary, it is hardly overstated that due to the fact that E, grows without bound for most sequencr? as 7~ m, the problem becomes more difficult.

-

The quantity Rt,sr,h is called the worst-case redundancy of the grammar-baed code 9 against the k-context symbol-run empirical entropy. In the fullowing theorem, we upper-bound

RLk. Theorem 4 There is a constant d, which depends only on A and k , such t h a t log log n R?,*?,k 5 d- logn ' Theorems 4, together with Theorem 3, implies Result 1. Hence, the new redundancy result is stronger than Result 1.

Before analyzing the performance of grammar-based codes for sources with alphabet E , it is necessary to review a result obtained by Kieffer over 25 years ago. Let A denote a clws of alphabet E stationery, ergodic sources. Given A , it was proved in [11] that a universal code for A exists if and only if therc is a countable set P = {pt,pz,. . .) of probability distribntions on E such that A C_ A p , where A p consists of all stationary, ergodic alphabet E sources X = {Xi}=, far which there exists a p; E P, which depends on {X,)P._,,such that E[-logP;(X*)l

111. PERSPECTIVE 2: UNIVERSALITY FOR SOURCES WITH

COUNTABLY INFINITE ALPHABETS

As mentioned in the introduction section, grammar-based codes were originally designed and analyzed for sources with finite alphabets. In this section, we show how grammar-based codes can be extended to encode sources with countably infinite alphabets, and analyze their compression performance. Let E be B countably infinite source alphabet throughout this section. Let x E E" denote a sequence to be compressed. For any x. let EZ denote the set that consists of all distinct symbols appearing in x. Since x is of finite length, E, is a finite subset of E. However, E, may grow without bound as x gets longer and longer. Because E, is unknown t o the decoder, it is clear that in order to encode x efficiently, the set E, has to he encoded and transmitted to the decoder. Note that such issue does not exist in the case of finite alphabet. Having the knowledge of E,, one can then regard I as a sequence from E,, which is finite if the length n of x is finite. Thus, by incorporating the encoding of E,, any grammar-based code can be applied to encode x as if it is from E,. Since E can be a n arbitrary collection of symbols, it is generally not efficient to encode E, directly. Nonetheless, because E is known t o both encoder and decoder, encoding E, can be simplified as encoding integers. To facilitate our discussion, let L : E N denote a known one-to-one and onto deterministic mapping from E to N. For example, a labeling method that assigns a unique positive integer index to each symbol e in E gives rise to a mapping L. It is then easy to see that every symbol e E E, can he encoded indirectly by encoding the integer L ( e ) . For simplicity, we choose the Elias universal doubly compound representation of integers or simply the Eli= code described in [lo] to encode integers.

-

< m.

(1%

Therefore, unlike the case of finite source alphabet, a universal code does not exist for the clas of all stationary, ergodic sources with a countably infinite alphabet or even for the class of all memuryless sources with a countably infinite alphabet 1121.

In light of the above Kieffer's result, the most general question one could possibly ask about the the universality of grammar-based codes for sources with alphabet E is follaws: Q1 Given any A for which a universal code exists, are grammar-based codes with the modification described above universal with respect to A? In the following we shall settle the above question in the affirmative. Moreover, we shall also investigate the compression performance of grammar-based codes for individual sequences from alphabet E. Let x be a sequence from E. Let r$ denote an arbitrary grammar-based code that encodes E, as described above. Let &(x) be the compression rate in hits per letter resulting from using d t o compress x. To evaluate r m ( x ) we , define r ; ( x ) as the smallest compression rate in bits per letter among all bthorder arithmetic coding algorithms, where b is a finite integer. In view of that r;(x) is equal to the empirical entropy of x when b = 0, we call v ; ( x ) the bth-order empWca1 entropy of x. Define

where F(E") is a subset of E" to be specified later. The reason for taking the maximization over a subset of E " , not E" itself, lies in the fact that there is no universal code for all

I ox

alphabet E stationary, ergodic sources. The quantity R&,,),, is called the worst-case redundancy of the grammar-based code 4 against the bth-order cmpiricvl entropy among F ( E " ) . Recall that in Result I the k-context empirical entropy P;(x) is used as the performance benchmark. The reason for using $(z) instead of r;(z) here is that there exists a sequence emitted from a first-order Markov source with alphabet E for which the first-order empirical entropy is strictly less than the k-context empirical entropy for any finite k. The next two theorems are the main results of this section. Their proofs are provided in 171.

Theorem 5 Suppose that in Q, the encoding of E, for each z E E' is based on a known one-to-one and onto deterministic mapping L : E N.

-

" : c[L(zi)]"

Theorem 6 Let P = ( ~ 1 . ~ 2 ,...} be a countable set of p r o b ability distributions on E. Then for any stationary, ergodic source (X,}P.=, E A.p, r"(x,

xz. . .X")+ H ( X )

with probability one as n

-

m.

Clearly, Theorem 6 provides a positive amwer to Question 91 posed earlier in this section.

1) Let

E"(L",Kl) = (z E E"

selection of L will result in a large worst-case redundancy among B certain set of sequences. Clearly, to design a good mapping L: one needs some prior information shout the data sequences to be compressed. Fortunately, in practice, for example, in applications like audio coniprmsion and image compression? one does have such prior knowledge that can be utilized in the design of L.

< Ktn}:

Iv. PERSPECTIVE 3:

i=,

GRAMMARS AS SOURCE MODELS

USING

where a and K I are both positive constants. Then there In this section, we elaborate on the idea of using grammars is a constant d, which depends only on 6, a , and K I , as source generating models. From this perspective, we propose a new theoretic framework for compression, and point out Some open problems in the framework. As mentioned in the introduction section, we shall allow our context-free gramniars to have more than one sequence in 2 ) Let their languages. This indicates that B single variable may be associated with more than one production rule in B grilmmar. E"(log'L,K2) = {z E" c ( l l 0 g 7 L(zt)l+l) < K z n } , To better illustrate this, let us look a t the following example. ,= 1 Ezan~ple1: Let G be a context-free grammar with variable > 1, then there set (a} and terminal set {a,b } cnnsisting of the following four ~f where K, is positive is a constant d', which depends only on b and Kz, such Production

'

Remark 3 (11) ~, in Theorem 5 is stroneer than the result in [14] which shows that the incremental parsing Lempel-Ziv code [3] with modifications is asymptotically optimal for any stationary, ergodic source (X,}P.=, with alphabet N satisfy-

,"2 EX?