The Fractional Greedy Algorithm for Data Compression

1 downloads 0 Views 261KB Size Report
words according to a static dictionary such that the original text is encoded by a .... dictionaries a small gap remains towards the upper bound due to asymptotic ...
The Fractional Greedy Algorithm for Data Compression  Jozsef Bekesi zz Gabor Galambos zz Ulrich Pferschy  Gerhard J. Woeginger  Abstract

Text{compression problems are considered where substrings are substituted by code{ words according to a static dictionary such that the original text is encoded by a shorter code sequence. We introduce a new ecient on{line heuristic which locally maximizes the compaction ratio. The worst{case behaviour of this fractional greedy heuristic is investigated for several types of dictionaries. Categories and Subject Descriptors: E.4. [Coding and Information Theory]: data compaction and compression; G.2.2. [Discrete Mathematics]: Graph Theory | path and circuit problems General Terms: Algorithms, Theory Additional Key Words and Phrases: optimal and heuristic encoding, shortest paths, textual substitution

Zusammenfassung Es werden Text{Komprimierungsprobleme behandelt, bei denen Teilworte durch Codeworte ersetzt werden, soda der ursprungliche Text durch eine kurzere Codesequenz reprasentiert wird. Das geschieht mit Hilfe eines statischen Worterbuchs. Wir fuhren eine neue eziente on{line Heuristik ein, welche die lokale Komprimierungsrate maximiert. Von diesem fractional greedy Verfahren wird das Verhalten im schlechtesten Fall fur verschiedene Typen von Worterbuchern untersucht. This research was partially supported by the Christian Doppler Laboratorium fur Diskrete Optimierung and by the Fonds zur Forderung der wissenschaftlichen Forschung, Project P8971-PHY. zz JGYTF, Department of Computer Science, P.O. Box 396, H-6720 Szeged, Hungary  TU Graz, Institut f ur Mathematik B, Kopernikusgasse 24, A-8010 Graz, Austria 

1

Fractional Greedy Heuristic

1 Introduction

2

The transfer and storage of large data quantities are still bottlenecks for the performance and hardware requirements of modern computer systems. To reduce the computational cost of large{scale data manipulation methods for data compression (and subsequently decompression) are widely used. A frequently applied possibility of compressing a given source{string is to substitute the string piecewise by corresponding code{strings with the help of a dictionary . A dictionary consists of pairs of strings over a nite alphabet (source{word, code{word ), which are used to replace substrings in the source{string. We will consider only methods which use a static dictionary , i.e. a xed dictionary that cannot be changed or extended during the encoding{decoding procedure. Our aim is to translate the source{string with the help of the words in the dictionary into a code{string with minimal length; in other words, we want to nd a space{optimal encoding procedure. The problem de ned by the above setup is equivalent to the problem of nding a shortest path in the corresponding directed, weighted graph (cf. Schuegraf and Heaps [4]): For any source{string S = s1 s2 : : : sn, we de ne a graph G = (V; A) on a vertex set V = fv0; v1; : : : ; vng with an arc set A. An arc (vi; vi+d) 2 A i there exists a pair (source{ word, code{word) in the dictionary such that the source{word consists of d characters exactly matching the original source{string from position i + 1 until position i + d. The weight of an arc is the number of bits used for the representation of the corresponding code{word. Furthermore, we de ne the length of an arc as the number of characters of the corresponding source{word. The result of an encoding procedure is equivalent to a path in G from v0 to vn. We denote the path generated by an encoding algorithm A by A{path. With these de nitions it is easy to see that the problem of nding an optimal compression of a source{string S is equivalent to the computation of a shortest path w.r.t. the arc weights from v0 to vn. If the graph has many cut vertices (i.e. vertices which divide the original problem into independent subproblems) and in case that these subproblems are reasonably small, we can solve the problem eciently and compute the optimal encoding. Unfortunately, in practice this will not be the case and the optimal algorithm cannot be applied as dealing with very long strings would take too much time and storage capacity. Therefore, heuristics have been developed to derive near optimal solutions. On the performance of earlier developed heuristics (for example the longest fragment rst heuristic , cf. Schuegraf and Heaps [4]) only experimental results have been reported. The rst theoretical worst{case analysis for data compression heuristics was performed by Katajainen and Raita [3]. It is restricted to so-called on{line heuristics: An on{line data compression algorithm starts at the source vertex v0, examines all outgoing arcs and chooses one of them according to some given rule. Then the algorithm continues

Fractional Greedy Heuristic

3

this procedure from the vertex reached via the chosen arc. There is no possibility to undo a decision made at an earlier time, and no backtracking is allowed. Although this type of heuristic may produce considerably worse encodings than an optimal algorithm it is very fast and can be performed by scanning the source{string only once with a small bu er. The worst{case behaviour of a heuristic is measured by its asymptotic worst{case ratio which is de ned as follows: Let D = f(wi; ci) : i = 1; : : : ; kg be a static dictionary and consider an arbitrary data compression algorithm A. Let A(D; S ) resp. OPT (D; S ) denote the compressed string produced by algorithm A resp. the optimal encoding for a given source{ string S . The length of these codings will be denoted by kA(D; S )k resp. kOPT (D; S )k. Then the asymptotic worst{case ratio of algorithm A is de ned as ) ( k A ( D; S ) k : S 2 S (n) RA (D) = lim sup n!1 kOPT (D; S )k where S (n) is the set of all text{strings with exactly n characters. Four parameters are used to investigate and state asymptotic worst{case ratios: Bt(S ) = length of each symbol of the source{string S in bits lmax(D) = maxfjwij j i = 1; : : : ; k g cmin(D) = minfkcik j i = 1; : : : ; k g cmax(D) = maxfkcik j i = 1; : : : ; k g; where jwij denotes the length of a source{string wi in characters and kcik the length of a code{word ci in bits. Obviously, the case lmax = 1 (i.e. all code{words represent a single character) does not make sense and we will therefore assume lmax  2. When we deal with a speci c source{string, Bt(S ) is shortly written as Bt. Katajainen and Raita [3] analyzed two simple on{line heuristics with respect to their worst{case behaviour:  The longest matching heuristic LM chooses at each vertex of the underlying graph the longest outgoing arc, i.e. the arc corresponding to the encoding of the longest substring starting at the current position.  The greedy heuristic, which we will call di erential greedy DG (introduced by Gonzalez{Smith and Storer [2]) chooses at each position the arc implied by the dictionary entry (wi; ci) yielding the maximal \local compression", i.e. the arc maximizing jwijBt ? kcik. Not surprisingly, the worst{case behaviour of a heuristic strongly depends on the features of the available dictionary. In this paper we will examine only general dictionaries, i.e. dictionaries which contain all symbols of the input alphabet as source{words. (This ensures that every heuristic will terminate in any case with a feasible solution.) Other properties of dictionaries discussed in this paper are given by the following de nition. A general dictionary is called

Fractional Greedy Heuristic

4

1. code{uniform dictionary, if all code{words are of equal length (i.e. kcik = kcj k, 1  i; j  k) 2. nonlengthening dictionary, if the length of any code{word never exceeds the length of the corresponding source{word (i.e. kcik  jwijBt, 1  i  k) 3. sux dictionary, if with every source{word w also all of its proper suxes are source{ words (i.e. if w = !1!2    !q is a source{word ) !h!h+1    !q is a source{word for all 2  h  q) 4. pre x dictionary, if with every source{word w also all of its proper pre xes are source{ words (i.e. if w = !1 !2    !q is a source{word ) !1!2    !h is a source{word for all 1  h  q ? 1). In [3] Katajainen and Raita raised the question whether there exist other types of heuristics which behave \good enough" even in the worst case. Starting from the practical point of view that usually the proportion of the size of the compressed text relative to the source{text is considered we introduce the fractional greedy algorithm FG: The fractional greedy algorithm takes at any actual position in the source{ string the fractionally best possible local compression, i.e. if I is the set of indices of the arcs emanating from the current node of the corresponding graph then an arc i0 will be chosen such that kcik : i0 = arg min i2I jwi jBt It is intuitively clear that in many cases this type of heuristic will perform better than the LM {heuristic, which does not care about code lengths at all and the DG{algorithm, which tries to maximize the absolute gain in each encoding step thereby missing chances of encoding smaller parts of the string with a good relative compaction rate which together may yield a better total compression. This is illustrated by the following example (let a1 = a; ai+1 = aai ; i 2 N, for any character a):

Example:

Let us consider the following nonlengthening dictionary with cmax = 4, cmin = 1 and, as usual for ASCII encodings, Bt = 8. source{word u v uv vlmax?1 u code{word 1; 11;1 11;; ; weight 2 4 4 1

Fractional Greedy Heuristic

5

Compressing the source{string Si = u(vlmax?1u)i consisting of 8(lmax i + 1) bits with the LM { or the DG{algorithm in both cases yields the code{string (11;;(11;1)lmax?2)i 1; with 4(lmax ? 1)i + 2 bits. Applying the FG{heuristic to the same problem generates the code{string 1;(;)i which is only i + 2 bits long. Although this example is based on a very special dictionary it demonstrates the advantages of choosing the fractionally best compression. In this paper we will derive upper bounds for the worst{case ratio of the fractional greedy heuristic for dictionaries with di erent properties as de ned above and combinations of them (as far as they are reasonable). Moreover, we will show that all these bounds are tight in the sense that there exist dictionaries and strings where the bounds are attained. (For sux dictionaries a small gap remains towards the upper bound due to asymptotic bounding techniques.) The paper is organized as follows. Section 2 deals with general, code{uniform, nonlengthening and pre x dictionaries whereas sux resp. sux and nonlengthening dictionaries are treated in Section 3. Some concluding remarks close the paper in Section 4.

2 General Results

We will rst show a general upper bound which is valid for any compression algorithm. Furthermore, it will be proved that this bound can be attained by the fractional greedy algorithm. A more complicate analysis yields a best possible worst{case bound for nonlengthening dictionaries. Theorem 2.1 Let D be a general dictionary. Then for any encoding algorithm A ? 1)cmax : RA(D)  (lmaxcmin The bound can be attained by the FG{heuristic. Proof. Only those source{strings have to be considered where the A{path and an OPT { path are vertex disjoint and have common endvertices v and v. We denote the number of characters between these endvertices by j and the number of vertices on the OPT {path between them by r. Because the two paths are vertex disjoint and each arc \consumes" at least one character the number of arcs of the A{path between v and v is at most j ? r. Because the arcs on the optimal path have to cover the distance between v and v we have  j  (1) r  lmax : This yields RA(D)  ((jr ?+ r1))cmax  j cmin  r ? 1 cmax cmin ( lmax ? 1) cmax :  cmin

Fractional Greedy Heuristic

6

It remains to be shown that the upper bound can be reached by the FG{heuristic. Let us consider the following dictionary:

w u2 uwlmax?2u source{word u code{word a b c d weight cmax cmax cmax cmin We compress the source{string Si = (u2wlmax?2)i where n = i lmax. An optimal algorithm generates the code{string OPT (D; Si) = adi?1ablmax?2 whereas the FG{algorithm produces FG(D; Si) = (cblmax?2 )ia which together yields

FG(D; Si)k = lim i(lmax ? 1)cmax + cmax RFG(D)  lim sup kkOPT (D; Si)k i!1 lmax cmax + (i ? 1)cmin n!1 = (lmax ? 1)cmax : cmin

2

Examples where the same bound is attained by the Longest Matching and the Di erential Greedy algorithm are given by Katajainen and Raita in [3], where an identical upper bound was shown for the Di erential Greedy heuristic in a signi cantly longer proof. Setting cmax = cmin and using the same lower bound example as above we get the following Corollary 2.2 Let D be a code{uniform dictionary. Then for any encoding algorithm A

RA(D)  lmax ? 1: The bound can be attained by the FG{heuristic.

2

The above bounds lead to the expectation that the worst{case behaviour of the FG{ algorithm will be identical to that of the LM { and the DG{heuristic for other dictionary types too. Indeed, this is true for the nonlengthening dictionary.

Theorem 2.3 Let D be a nonlengthening dictionary. Then 8 (lmax ? 1)cmax > > if cmax  Bt

> > cmin > > < RFG(D)  > (lmax ? 2)Bt + cmax if Bt < cmax < 2 Bt cmin > > > > > : lmax Bt if 2 Bt  cmax cmin

and all bounds can be attained.

Fractional Greedy Heuristic

7

Proof. Depending on the relation between cmax and Bt we distinguish three cases: Case A. cmax  Bt:

Under this condition every dictionary is nonlengthening and the result of Theorem 2.1 can be applied without changes (including the worst{case example). Case B. Bt < cmax < 2 Bt: First we observe that for lmax = 2 the claimed bound is equal to the general upper bound proved in Theorem 2.1. Therefore, we will assume lmax  3 throughout this part of the proof. We have to elaborate the model of Theorem 2.1 in more detail. Let us repeat that for a comparison of the fractional greedy algorithm with an optimal encoding, only those source{ strings have to be considered where the FG{path and an OPT {path are vertex disjoint and have common endvertices. We denote these paths by PFG = v; v1; v2 ; : : : ; vq ; v resp. POPT = v; v10 ; v20 ; : : : ; vr0 ; v. Let there be p resp. p0 characters between v and v1 resp. v10 and j characters from v to v as above. The code{strings of the arcs (v; v1) and (v; v10 ) are c0 resp. c00. When the FG algorithm reaches vertex v during the encoding process it will choose the \fractionally shortest" outgoing arc implying kc0 k  kc00k ; p Bt p0 Bt and hence 0 kc00 k  pp kc0 k: On the other hand since PFG and POPT are vertex disjoint paths and each arc \consumes" at least one character the total number of arcs between v and v is

q + r + 2  j + 1 ? (minfp; p0g ? 1) and therefore

q  j ? r ? minfp; p0g: Equation (1) can be improved to & 0' j ? p r  lmax :

We divide the arcs on the FG{path into two sets. Let Q1 contain all arcs with length 1 and Q2 the remaining arcs. The cardinality of these sets is denoted by q1 = jQ1 j and q2 = q ? q1 = jQ2j. Hence we have + q2 cmax : RFG(D)  kc0kk+c0 qk1 Bt + r cmin 0

Now two subcases have to be treated:

(2)

Fractional Greedy Heuristic

8

Case B.1. 0 < p < p0  minfj; lmaxg: We derive an upper bound for the sum of the arc weights on the FG{path. The worst case for a nonlengthening dictionary occurs, if each character is encoded by Bt Bits. As there are r vertices on the path of the optimal encoding which may not be visited by the FG{heuristic they have to be bypassed with arcs from Q2 . If cmax < 2 Bt the worst possible case for these shortcuts is attained for q2 = r with all arcs in Q2 of length 2 and weight cmax. Using the above bounds for r and q yields q1 Bt + q2 cmax  (j ? p ? 2r)Bt + r cmax = (j ? p)Bt + r(cmax ? 2 Bt) ? p0 (cmax ? 2 Bt)  (j ? p)Bt + jlmax ? p0 (cmax + (lmax ? 2)Bt): = (p0 ? p)Bt + jlmax Inserting this estimation in (2) and using the bounds for kc00k and r we get j ?p0 k c0k + (p0 ? p)Bt + ((lmax ? 2)Bt + cmax) lmax RFG(D)  : p0 kc k + (j ? p0 ) cmin p 0 lmax

It can be checked easily that the right hand side of this inequality is an increasing function in j except for the case p0 = lmax and p = 1. Taking the limit for j ! 1 yields the upper bound ((lmax ? 2)Bt + cmax)=cmin. Treating the case p0 = lmax and p = 1 separately a simple calculation shows that the above expression is again increasing in j for lmax  3 which leads to the same upper bound. Case B.2. 0 < p0 < p  minfj; lmaxg: In this case we have q1 Bt + q2 cmax  (j ? p0 ? 2r)Bt + r cmax ? p0 (cmax ? 2 Bt):  (j ? p0)Bt + jlmax In analogy to Case B.1. we get from (2) j ?p0 k c0k + (j ? p0 )Bt + (cmax ? 2 Bt) lmax : RFG(D)  p0 kc k + (j ? p0 ) cmin p 0 lmax

Setting x = j ? p0 we can de ne the right hand side of this inequality as a function in x, p and p0 : x kc k + ((lmax ? 2)Bt + cmax) lmax f (x; p; p0) := 0 0 p kc k + cmin x p 0 lmax Obviously, f is a monotone function in x. Its partial derivative with respect to x is given by 0 @f = kc0 k  ((lmax ? 2)Bt + cmax) pp ? cmin :  p0  cmin x 2 @x lmax k c k + 0 p lmax

Fractional Greedy Heuristic

9

The derivative is positive for all feasible values of x, p and p0 except for the case p0 = 1, p = lmax and 2 cmin > cmax. If it is positive and hence the function is increasing in x the claimed upper bound is attained by taking the limit j ! 1. Otherwise (if p0 = 1, p = lmax and 2 cmin > cmax) we use a di erent estimation for 0 kc0k namely kc0k :  kc00k  cmin > cmax 2 2 In this case (2) leads to j ?1 kc0k + (j ? 1)Bt + (cmax ? 2 Bt) lmax kc k + (j ? 1) cmin 2 lmax which is again an increasing function in j . The limit j ! 1 yields the desired bound.

RFG(D) 

0

To prove that the bound for Case B is the best possible the same example as in the case of the general dictionary can be used with modi ed weights: source{word u w u2 uwlmax?2u a b c d code{word weight Bt Bt cmax cmin

Case C. 2 Bt  cmax: Since the given upper bound is a trivial upper bound for any nonlengthening dictionary (see [3]) it is enough to show that it can be reached by the same example with di erent weights. source{word u w u2 uwlmax?2u code{word a b c d Bt Bt 2 Bt cmin weight

2

The following theorem states that the pre x property does not improve the upper and lower worst{case bounds.

Theorem 2.4 Let D1 be a general, pre x dictionary, D2 be a pre x and code{uniform dictionary and let D3 be a pre x and nonlengthening dictionary. Then R (D )  (lmax ? 1)cmax FG

1

cmin RFG(D2)  lmax ? 1 Bt RFG(D3)  lmax cmin

and all three bounds can be attained.

Fractional Greedy Heuristic

10

Proof. To show that the upper bounds implied by the previous theorems can be attained also for pre x dictionaries we give the following example with a 3{symbol alphabet fu; v; wg. source{word code{word weight

u

v

w

uv

a b c d cmax cmax cmax cmax

vwj

vwlmax?2u

cmax

f cmin

j =1;:::;lmax?2 ej

For i > 0, we consider the strings Si = u(vwlmax?2u)i with length n = i(lmax + 1) + 1. Obviously, OPT (D; Si) = af i and FG(D; Si) = (dclmax?2)ia. Hence kFG(D; Si)k = lim i(lmax ? 1)cmax + cmax RFG(D)  nlim !1 kOPT (D; Si)k i!1 i cmin + cmax ? 1)cmax : = (lmaxcmin Examples for code{uniform and nonlengthening dictionaries can be constructed in a similar way and are left to the reader. 2 We conclude that for general, code{uniform, nonlengthening and pre x dictionaries and for all combinations of these properties the worst{case bounds for the LM {, the DG{ and the FG{heuristic are identical.

3 Results for Sux Dictionaries

Throughout this section we will assume lmax  3 since every dictionary with lmax = 2 is a sux dictionary and the results of the previous section can be applied. The following theorem shows that for sux dictionaries the fractional greedy algorithm behaves quite di erent to the other heuristics.

Theorem 3.1 Let D be a sux dictionary. Then

? 1) + 1) RFG(D)  cmax(ln(lmax cmin

and there exists a sux dictionary D0 with cmax(ln(lmax ? 1) + 1 ? ln 2)

< RFG(D0 ): cmin Proof. We consider again those source{strings where the FG{path and an OPT {path are vertex disjoint and have common endvertices. Let us denote the FG{path between any two adjacent vertices va ; va+1 of the OPT {path by z1 ; z2; : : : ; zk+1. We assume that there exists an arc on the FG{path connecting z1 with a vertex preceding va and an arc leading from zk+1 past va+1 (so called \skipping arcs"), except at the beginning and the end of the string. The following analysis can be performed for any a, i.e. for any arc of the OPT {path.

Fractional Greedy Heuristic

11

The length resp. weight of the arc (zi; zi+1 ) on the FG{path is denoted by ti resp. ci. Furthermore, the number of characters between zk+1 and va+1 is denoted by . In the following two simple observations will be used: k X

ti  lmax ? ? 1

(3)

cmax Pk t + 8 i s=i s

(4)

i=1 ci  ti

To simplify the notation we introduce

T :=

k X i=1

ti +  lmax ? 1:

Considering observation (4) and summing up over all i; 1  i  k, we get an upper bound for the sum of the weights: k k X X ti ci  cmax P i?1  cmax ln T i=1 i=1 T ? s=1 ts The second inequality follows from Lemma 3.2. Putting things together and taking into account that the weight of the arc emanating from zk+1 and \skipping" va+1 is at most cmax yields Pk c + cmax ln(lmax ? 1)cmax + cmax i  : RFG(D)  i=1 cmin cmin The following construction gives dictionary D0 and the claimed lower bound. Let us consider an alphabet with lmax letters u; v; w1; : : : ; wlmax?2 and the following dictionary D0: source{ u v wj uv wj : : : wlmax?2u vw1 : : : wlmax?2 u word j =1;:::;lmax?2 j =1;:::;lmax?2 b cj ej code{word a d f cmax weight cmax cmax lmax?j cmax cmax cmin Let cmax = (lmax ? 1)! cmin and Si = u(vw1 : : : wlmax?2 u)i be the string to be compressed with length n = i lmax + 1. It is easy to see that OPT (D0; Si) = af i and FG(D0; Si) = (dc1 : : : clmax?2 )ia, if we suppose that the FG algorithm resolves all ties by choosing the arc corresponding to cj . This example yields  Plmax?2 cmax  + cmax i cmax + k FG ( D ; S ) k j =1 lmax?j RFG(D0 )  lim sup kOPT (D0 ; Si )k = ilim !1 cmax + i cmin n!1 0 i   P ?1 1 + cmax i cmax 1 + lmax j =2 j = ilim !1  cmax + i cmin  cmax 1 + lmax1 ?1 + ln(lmax ? 1) ? ln 2  ; cmin

Fractional Greedy Heuristic

12

with the last inequality due to the fact that n 1 X  ln(n + 1) ? ln 2 : k=2 k

(5)

Lemma 3.2 For integers s1; : : : ; sk ; b with si  1, b  1 and S = Pki=1 si + b k SX ?b X si 1   ln S P i ?1 i=1 S ? j =1 sj i=1 S ? i + 1

2

holds.

Proof. To prove the rst inequality we show that the left hand side attains its maximum

for si = 1; i = 1; : : : ; k. If one of the si, namely sm is greater or equal to 2 we can replace it by two values sm1 = sm ? 1 and sm2 = 1. Thereby the left hand side is changed by sm2 sm sm1 + ? P P P m ?1 m ?1 S ? j=1 sj S ? j=1 sj ? sm1 S ? jm=1?1 sj : Setting S := S ? Pjm=1?1 sj the amount of change can be written as 1 1 S ? sm + 1 ? S > 0 which contradicts the assumption that the left hand side can be increased be setting si > 1 for some i. The second inequality is shown by bounding a harmonic series. S 1 X S 1 SX ?b X 1 =   ln S i=1 S ? i + 1 i=b+1 i i=2 i

2

The worst{case performance of the LM { and the DG{heuristic for sux dictionaries is given by the next proposition: Proposition 3.3 (Bekesi et al. [1], Katajainen and Raita [3]) Let D be a sux dictionary. Then RLM (D)  cmax cmin

8 2cmax ? Bt > > if cmax  3=2 Bt > cmin > > > > < (2cmax + Bt)2 if 3=2 Bt < cmax RDG(D)  > 8Bt cmin >  (lmax ? 3=2)Bt > > > > ? (lmax ? 2)Bt) if (lmax ? 3=2)Bt < cmax > : (lmax ? 1)(2cmax 2cmin

Fractional Greedy Heuristic

13

and the bounds can be attained.

2

Thus, the LM {algorithm has a better worst{case ratio then the FG{heuristic which in turn is in most relevant cases superior to the DG{heuristic. To be precise, if cmax  3=2 Bt RDG is smaller than RFG. If cmax is between 3=2 Bt and (lmax ? 3=2)Bt no strict dominance exists. (e.g. if cmax = (lmax ? 2)Bt FG is superior for lmax  7, if cmax = 2 Bt DG is superior for all lmax  3.) For the last case (cmax > (lmax ? 3=2)Bt) elementary calculations show that FG is always better than DG. A more complicate analysis is necessary for the combination of the sux and the nonlengthening property.

Theorem 3.4 Let D be a nonlengthening, sux dictionary. Then   Bt   minflmax Bt; cmax ln lmax cmax + 3 ? Btg : RFG(D)  cmin

Proof. The rst expression in the minimum clause is a trivial upper bound for any non-

lengthening dictionary which dominates the second bound if cmax is close to lmax Bt. To prove the latter expression we will use the same notation as in the proof of Theorem 3.1. Furthermore, we suppose that cmax = Bt for some > 0. If the FG algorithm chooses at a vertex zi an arc with length ti and weight ci this arc has to be \better" than the sux arc leading directly to va+1 which may have length cmax. This implies again ci  cmax ti Pks=i ts + whereas the nonlengthening property is given by ci  Bt: ti Let zj denote the vertex with the smallest index i 2 f1; : : : ; kg for which Bt  cmax=(Pks=i ts + ) and let Ti = Pks=i ts + . It follows that Bt  Bt=Tj and hence Tj  . Summing up over all weights yields

1 0j?1 k X X t ci  @ Ti cmax + tiBtA i=j i=1 i=1 i 0j?1 1 X 1 T ? T i i+1 A = cmax @ Ti + (Tj ? ) i=1 1 0 jX ?2 T T 1  cmax @(j ? 1) ? Ti+1 ? T j + 1 ? A :

k X

i=1

i

j ?1

Fractional Greedy Heuristic

14

(In the last inequality, we used  1 and Tj  .) Bounding the arithmetic by the geometric mean we get s s jX ?2 T i+1 ? 2 Tj ?1 ?2 :  ( j ? 2)  ( j ? 2) (6) T T T j

i=1

i

Using x ? x py  ln(1=y) for x; y > 0 leads to

j

1

1

x

k X i=1

s

+2? 1 T1

ci  cmax (j ? 2) ? (j ? 2)      cmax ln T 1 + 2 ? 1 ! Bt lmax  cmax ln cmax + 2 cmax ? Bt : j

?2

! (7)

Adding the weight of the skipping arc and dividing by cmin yields the desired upper bound.

2

To illustrate that the upper bound shown above is almost as good as possible we give now a general lower bound for the sux, nonlengthening dictionary. A bound exactly matching the given upper bound can not be constructed because of the bounding techniques used in the above proof. Of course, the trivial upper bound lmax Bt=cmin can be exactly attained (see [3]).

Theorem 3.5 There exists a nonlengthening, sux dictionary D1 such that

  Bt   cmax ln lmax cmax + 1 : RFG(D1)  cmin Proof. We consider the same dictionary D1 as in the proof of Theorem 3.1 with lmax letters u; v; w1; : : : ; wlmax?2 but di erent weights. Let cmax = (lmax ? 1)! cmin and cmax = Bt with 2 N. source{ u v wj uv wj : : : wlmax?2 u vw1 : : : wlmax?2 u j =1;:::;lmax?2 j =1;:::;lmax?2 word code{word a b cj d ej f weight Bt Bt

j 2 Bt j (lmax ? j ) cmin where

j =

(

if j = 1; : : : ; lmax ? Bt if j = lmax ? + 1; : : : ; lmax ? 2

cmax lmax?j

Let Si = u(vw1 : : : wlmax?2u)i be the string to be compressed with length n = i lmax + 1. As in Theorem 3.1 we have OPT (D1; Si) = af i and FG(D1; Si) = (dc1 : : : clmax?2 )ia, if we

Fractional Greedy Heuristic

15

suppose, as before, that the FG algorithm resolves all ties by choosing the arc corresponding to cj . This example yields RFG(D1 )  lim sup kFG(D1; Si)k n!1 kOPT (D1; Si )k   ? cmax + Plmax?2 Bt + Bt i 2 Bt + Plmax j =1 j =lmax? +1 lmax?j = ilim !1 Bt + i cmin ?1 1 + ( ? 2)Bt 2 Bt + cmax Plmax j = j =  lmax  cmin cmax ln + Bt   cmin  Bt   cmax ln lmax cmax + 1 = cmin using inequality (5). 2 To compare FG with the other heuristics we give Proposition 3.6 (Katajainen and Raita [3], Bekesi et al. [1]) Let D be a nonlengthening, sux dictionary. Then RLM (D)  cmax cmin 2cmax ? Btg RDG(D)  minflmaxBt; cmin and the bounds can be attained. 2 Surprisingly, it turns out that both LM and DG yield a (slightly) better worst{case ratio than FG.

4 Conclusions

In this paper we investigated the fractional greedy algorithm for on{line data compression. A worst{case analysis was performed for general dictionaries with the following four properties and reasonable combinations of them: code{uniform, nonlengthening, sux and pre x. It turned out that the pre x property does not change the worst{case behaviour. For general, code{uniform and nonlengthening dictionaries the derived upper bounds can be attained and are identical to those of the previously analyzed longest matching and di erential greedy algorithm (see [1] and [3]). The analysis of sux directories is more complicated and leads to bounds with logarithmic expressions whereas all previously known bounds for on{line heuristics were rational ones. Because of the techniques necessary for the sux case, a small gap remained between the worst case examples and the upper bounds. For future research we see the following possibilities:

Fractional Greedy Heuristic

16

1. De ne other reasonable properties of dictionaries which lead to attractive worst case ratios of the LM {, DG{ and FG{heuristics. 2. Develop new on{line heuristics which exploit special properties of dictionaries or input data. 3. Expand the standard on{line model by including e.g. the next 3 lmax characters in the decision on the next encoding step. 4. No theoretical worst{case results about dynamic dictionaries (such as the Ziv{Lempel algorithm [5]) are known. It would be very interesting to nd a theoretical instrument for comparing dynamic and static dictionaries. 5. To nd out more about the practical behaviour of the proposed algorithms, a probabilistic analysis might be useful.

Acknowledgement: Gabor Galambos gratefully acknowledges the hospitality of the TU Graz; Gerhard J. Woeginger gratefully acknowledges the hospitality of the JGYTF Szeged.

References

[1] J. Bekesi, G. Galambos, U. Pferschy and G.J. Woeginger, Greedy algorithms for on{line data compression, Report 276-93, Mathematical Institute, TU Graz, Austria 1993. [2] M.E. Gonzalez{Smith and J.A. Storer, Parallel algorithms for data compression, Journal of the ACM 32, 1985, 344{373. [3] J. Katajainen and T. Raita, An analysis of the longest matching and the greedy heuristic in text encoding, Journal of the ACM 39, 1992, 281{294. [4] E.J. Schuegraf and H.S. Heaps, A comparison of algorithms for data base compression by use of fragments as language elements, Inf. Stor. Ret. 10, 1974, 309{319. [5] J. Ziv and A. Lempel, A universal algorithm for sequential data compression, IEEE Trans. Inf. Theory 23, 1977, 337{343.

Suggest Documents