Block Sorting and Compression 1 Introduction - Semantic Scholar

7 downloads 0 Views 193KB Size Report
Burrows and Wheeler 3], introduced a new algorithm, which they call the Block Sort- ing Lossless Data Compression Algorithm (BSLDCA). When applied to text ...
Block Sorting and Compression Ziya Arnavut Remote Sensing Application Laboratories University of Nebraska, Omaha NE 68182 email:[email protected] tel: (402)-554-2662 fax: (402)-554-3518 Spyros S. Magliveras Department of Computer Science and Computer Engineering University of Nebraska, Lincoln NE 68588 email:[email protected] tel: (402)-472-5005 fax: (402)-472-1718 Abstract The Block Sorting Lossless Data Compression Algorithm (BSLDCA) described by Burrows and Wheeler [3] has received considerable attention. It achieves as good compression rates as context-based methods, such as PPM, but at execution speeds closer to Ziv-Lempel techniques [5]. This paper, describes the Lexical Permutation Sorting Algorithm (LPSA), its theoretical basis, and delineates its relationship to BSLDCA. In particular we describe how BSLDCA can be reduced to LPSA and show how LPSA could give better results than BSLDCA when transmitting permutations. We also introduce a new technique, Inversion Frequencies, and show that it does as well as Move-to-Front (MTF) Coding when there is locality of reference in the data.

1 Introduction Burrows and Wheeler [3], introduced a new algorithm, which they call the Block Sorting Lossless Data Compression Algorithm (BSLDCA). When applied to text or image data their algorithm achieves better compression rates than Ziv-Lempel techniques with comparable speed, while its compression performance is close to context based methods, such as PPM. Cleary et. al. [4], have viewed BSLDCA (called BW94 by the authors) as a context based method. Recently, Fenwick [5, 6, 7, 8] has done a comparative study on

BSLDCA and concluded that BSLDCA is a \viable text compression technique, with a compression approaching that of the currently best compressors while being much faster than many other compressors of comparable performance" [5]. In this paper, we rst de ne the Lexical Permutation Sorting Algorithm (LPSA), and show that BSLDCA is reducible to LPSA. The advantage of going to LPSA is that we now have a clear understanding of its theoretical foundations, and have some optimization choices not available with BSLDCA. When the underlying data to be transmitted is a permutation , LPSA generates a cyclic group of order n with (n) generators, where (n) is Euler's  function of n. Communicating any one of the group generators and a single exponent x : 1  x  n allows us to reconstruct the original data . Among the (n) possibilities, one or more choices may be cheaper to communicate (higher compressibility) than the original permutation. BSLDCA o ers only one choice. Therefore, when communicating permutations LPSA is superior to BSLDCA. We also introduce a new technique, Inversion Frequencies, and show that when the underlying data have locality of reference, inversion frequencies do as well as Moveto-Front Coding.

2 Mathematical Preliminaries In this work we assume knowledge of some basic mathematical concepts, including familiarity with elementary properties of standard objects of discrete mathematics. However, to set the stage for our later discussion we begin by recalling some de nitions and corresponding notation. By a permutation  of a nite set X we mean a bijection (i.e. a one-to-one and onto function) from X onto itself. We use standard functional notation to denote permutations, for example, if X = fa; b; c; d; eg, and  : X ! X such that (a) = b,  (b) = c,  (c) = a,  (d) = e, and  (e) = d, then we denote  by: 

=

a b c d e b c a e d

!

:

If we select a particular xed order for the elements of X , once and for all, say [a; b; c; d; e], then we can specify  by simply writing the sequence corresponding to the bottom row, [b; c; a; e; d], in the functional notation above. Thus, we can also write:  = [b; c; a; e; d]: This is called the cartesian form of . The totality of all permutations on X forms a group under functional composition, called the symmetric group on X , and denoted by S . If X = f1; 2; :::; ng we simply denote S by S . If A is a nite alphabet of symbols, and n is a positive integer, we denote by A the set of all possible words with letters in A. Thus, the elements of A are all X

X

n

n

n

sequences of the form (Y [1]; Y [2]; : : : ; Y [n]), where Y [i] 2 A. We call elements of A data strings. Frequently, we take A = Z , the ring of integers modulo g. From a di erent perspective, data strings can be viewed as multiset permutations [9]. A multiset is like a set except that it can have repetitions of identical elements. For example, M = f1; 1; 2; 2; 2; 3; 3g is a multiset. A multiset permutation of a multiset M is an ordered arrangement of the elements of M. Hence, M = [2; 1; 3; 3; 1; 3; 2] is a multiset permutation of M = f1; 1; 2; 2; 2; 3; 3g. Sometimes, in order to write the information more compactly, a multiset is represented in its product-exponential form by M = 1 1 2 2    n , where f is the frequency of element i. Given a data string Y = (Y [1]; Y [2]; : : : ; Y [n]) for 1  i  n, let Y denote the data string formed by cyclically shifting Y to the left (i ? 1) positions (with wrap-around.) We de ne the lexical index permutation  =  for Y by  = ? where: (j ) = i if and only if Y is lexically the j data string among the strings fY : 1  k  ng: It is easy to verify that ? is a sorting permutation for Y , i.e. Y [? ] = (Y [? (1)]; : : : ; Y [? (n)] consists of the data string Y in ascending order. For example, when Y = (2; 1; 1; 3; 1; 2),  = [4; 1; 3; 6; 2; 5] and ? = [2; 5; 3; 1; 6; 4] sorts Y . n

g

f

f

fn

i

(i)

Y

(i)

1

Y

(k )

th

1

1

Y

1

Y

Y

1

Y

Y

1

Y

3 Lexical Permutation Sorting Algorithm Before discussing the theoretical basis of LPSA, we begin this section by giving an example. Let p = [3; 1; 5; 4; 2] be a given permutation. Construct the matrix 1 0 3 1 5 4 2 BB 1 5 4 2 3 CC C B N =B BB 5 4 2 3 1 CCC @4 2 3 1 5A 2 3 1 5 4 by forming successive rows of N which are consecutive cyclic left-shifts of the sequence p. Let F be the rst, S the second and L the last column of N . By sorting the rows of N lexically, we transform it to 1 0 1 5 4 2 3 BB 2 3 1 5 4 CC C B N0 = B BB 3 1 5 4 2 CCC : @4 2 3 1 5A 5 4 2 3 1 This amounts to sorting N with respect to the rst column, i.e., applying a rowpermutation to N so that its rst column becomes (1; 2; 3; 4; 5) . The original sequence p appears in the i = 3 row of N 0. Let F 0 be the rst, S 0 be the second and L0 be the last column vector of N 0 . If the transmitter transmits the pair (i; S 0 ) or T

rd

(i; L0 ), then the receiver can reconstruct the original sequence p uniquely. For example, if (i; S 0) is transmitted, the receiver constructs the original sequence p by using the following procedure: 1. Initially, let p[1] = i. 2. For j = 2; : : : ; n, let p[j ] = S 0[p[j ? 1]]. Or, if (i; L0) is transmitted, the receiver applies the following: 1. p[n] = L0 [i]. 2. For j = 1; : : : ; n ? 1, let

p[n

? j ] = L0[p[n ? j + 1]].

Of course once we realize that L0 is the inverse of S 0 as a permutation, the second procedure is seen to be equivalent to the one using S 0. More generally, suppose that A is an alphabet of n symbols with a linear ordering. If Y is a data string with elements in A we denote by N (Y ) the n  n matrix whose i row is Y , and by N 0 (Y ) the matrix obtained by lexically ordering the rows of N (Y ). We now develop a theoretical setting for the algorithms given above. (i)

th

Lemma 3.1 Let p be a permutation of degree n given in cartesian form. Construct an n  n matrix N whose rst row is p and whose each row is a left cyclic shift of

the previous row. If  is the j column of N , so that N = [1 ; 2 ; : : : ;  ], then the result of lexically ordering the rows of N is the matrix N 0 = [1?1 1 ; 1?1 2 ; : : : ; 1?1  ], where 1?1  is the j column of N 0 in cartesian form. th

j

n

n

th

j

Proof: To begin with, note that  = p. Since each row of N is a permutation, to sort N , it suces to reorder the rows of N so that the rst column of the resulting matrix will be the identity permutation [1; 2; : : : ; n]. Hence, N 0 = N [? ;  ]. That is, the i row of N 0 is N 0 [i;  ] = N [ ? (i);  ]: Therefore, N 0 [i; j ] = N [? (i); j ]. Let ? (i) = i0 then i =  (i0 ) and 1

1

1

th

1

1

1

1

N 0 [i; j ]

1

1

1

= N [i0 ; j ] =  (i0) =  (? (i)): j

1

j

1

Therefore, for all i,  (i) =  (? (i)). Hence,  = ?  . That is, 0

j

j

N0

1

0

1

1

j

1

j

= [?  ; ?  ; : : : ; ?  ]: 1

1

1

1

1

2

1

1

n

[]

When we need to emphasize the dependancy of N and N 0 on the input data permutation p, we write N (p) and N 0 (p) for N and N 0 respectively. We continue to assume that p is a given permutation as in Lemma 3.1. We have the following:

Theorem 3.1 Let ` = ?  and  = `? = ?  . Then p(i + 1) = (p(i)). 1

1

1

n

1

1

n

Proof: Let j be an entry in p, what follows j ? Find j in the last column of N 0 say appearing in row i and look at the index at the beginning of the i row. But, this is in fact the position of j in `. Hence `? (j ). th

1

[]

Corollary 3.1 Knowledge of ` = ?  and p(1) allows us to recover p completely. 1

Let the matrix N 0 = N 0 (p) = (t N 0 as a permutation, we get 01 t BB 2 t =B B@ ... ...

i;j

1;2 2;2

n tn;2

1

n

): Note that if we interpret the second column of

1 CC CC = A

1

t1;2

2

t2;2

::: n : : : tn;2

!

but the image under  of any index j can be found by taking any row of N 0 , nding j in that row, and looking at the next element in that row. Since, rows are cyclic shifts of each other, it does not matter which row we look at. In particular, we could just use the rst row, i.e. we have that  = (1; t ; t ; : : : ; t ). Hence,  is a cycle of length n. Moreover, it is clear that the third column, interpreted as a permutation is simply  , : : : , and in general the k column is  ? . Hence, we have proved the following proposition. 1;2

2

th

1;3

1;n

k

1

Proposition 3.1 The columns of N 0 (p) form a cyclic group of order n. Let  be the set of columns of N 0 which are generators of the cyclic group <  >, then  as well as ` = ? can be completely speci ed as an integer power of any one element of . There are jj = (n) generators of the cyclic group <  >, where (n) is Euler's  function of n. If n is prime, there are (n) = n ? 1 generators of the cyclic group <  >. We now return to the general case of data strings. Let Y be a data string of length n, from a linearly ordered alphabet A of g distinct symbols. Let M = N (Y ) and M 0 = N 0 (Y ). We may now state the following theorem. 1

Theorem 3.2 Let Y be a data string as above, with lexical index permutation  =  . If

N0

= N 0 (),

then

M0

i;j

= Y 0 [N 0

i;j

].

Y

Proof: Recall that Y denotes the data string formed by cyclically shifting Y to the left (k ? 1) positions. Let Y 0 be the sorted version of Y and  = ? . Then, Y 0 = Y [], and Y = Y 0 []. Since the rows of N (Y ) and N () are formed by the cyclic shifts Y and  respectively, from Y = Y 0 [] we obtain (k )

1

(k )

(k )

N (Y )

= Y 0[N ()]:

Moreover, since 1  i < j  n implies that Y 0(i)  Y 0 (j ), Y 0 may be viewed as a monotonic function and lexically ordering the rows of N (Y ) produces the same result as looking at the images under Y 0 of the result of lexically ordering the rows of N (), i.e. M 0 = N 0 (Y ) = Y 0 [N 0 ()] Therefore, M 0 = M 0 [i; j ] = Y 0 [N 0[i; j ]] = Y 0[N 0 ], for all 1  i; j  n. i;j

i;j

[] The BSLDCA algorithm of Burrows and Wheeler [3], takes a data string Y , and by successively shifting it to the left, constructs an n  n matrix M . Later, M is transformed into M 0 , by lexically sorting the rows of M . Knowledge of the index of the original data string in M 0 , and the last column of M 0 (i.e., M 0 [; n]), allows one to reconstruct the original data string. This has been shown by Burrows and Wheeler. Due to the fact that the elements in the rst column and the last column are neighbors, and the rst column consists of the sorted elements of a given data string, the last column may be more orderly than the original data string Y . This has been empirically veri ed by Burrows and Wheeler. Theorem 3.2 establishes the connection between LPSA and BSLDCA. When the data to be transmitted is a permutation, then, in general, LPSA will give better results than BSLDCA, because we are able to select the least expensive generator  2 , with an additional overhead of transmitting a single integer x; 1  x  n, such that  = . This amounts to an overhead bounded by (log n)=n. In the general case where Y is not a permutation Theorem 3.2 can be viewed as a theoretical reduction of BSLDCA to LPSA: Simply transmit the sorted vector Y 0 at very low cost (using runlength coding, say) and the lexical index permutation  for Y by means of LPSA. Although this is theoretically possible, in almost all cases BSLDCA will do better than LPSA. x

4 Inversion Frequencies The notion of an inversion table for a given permutation was introduced quite early [9] in an e ort to provide concise representations of ordinary permutations. Several variants and types of inversions were de ned at di erent times by di erent authors. In [10], Sedgewick gives some other inversion generation methods for permutations. In [1] the author studies inversion vector techniques in the context of data compression.

Certain inversion methods for multiset permutations present a solution to the well known sparse-histogram problem, which occurs in some images [2]. The results are comparable to other known techniques but o er signi cant speedup. In this section, we present a new inversion technique for multiset permutations (data strings). We further show that, this yields a compression performance close to MTF coding, if the underlying data have locality of reference. Let M be a multiset permutation of elements from an underlying set S = f1; 2; : : : ; kg. Let \ " denote catenation of data strings. We de ne the inversion frequency vector D = D for M as follows: k

1. 2.

=. D = D ? T with T =< x ; x ; : : : ; x > where i) x = position of the rst occurrence of i in M . ii) and for j > 1; x = number of elements y in M , y > i occurring between the (j ? 1) and j occurrence of i in M .

D0 i

i

1

i

1

i

2

fi

1

j

st

th

For example, for the multiset permutation M = [1; 1; 2; 3; 1; 2; 4; 3; 4; 2; 4], we have S = (1; 2; 3; 4). Initially D0 =. For i = 1 S , we have D1 = D0 T1 , where T1 =< 1; 0; 2 >, so D1 =< 1; 0; 2 >. For i = 2 S , D 2 = D1 T2 , where T2 =< 3; 1; 3 >. Therefore, D2 =< 1; 0; 2; 3; 1; 3 >. For i = 3 S , D3 = D2 T3 , where T3 =< 4; 1 >, so D3 =< 1; 0; 2; 3; 1; 3; 4; 1 >. For i = 4 S , D4 = D3 T4 , where T4 =< 7; 0; 0 >. Hence, D = D4 =< 1; 0; 2; 3; 1; 3; 4; 1; 7; 0; 0 >.

2

2



2 2



To recover the original multiset permutation M from D, we need the knowledge of the multiset described by F = (f ; f ; : : : ;Pf ) and S = (1; 2; : : : ; k). We initially, let M = [?; ?; : : : ; ?] where jM j = jDj = f . From the de nition of inversion frequencies, D is built recursively as D = D ? T , where T =< x ; x ; : : : ; x > and x represents the position of the rst occurrence of i in M and each x , j > 1, represents the number of elements greater than i, which occur between (j ? 1) and j occurrence of i in M . Hence, we can recover the elements of M by rst inserting i in location M [x ] and f or j = 2; : : : ; f inserting i in the (x + 1 ) dash position in M from the last inserted i. For example, for the above multiset permutation M , F =< 3; 3; 2; 3 >. The receiver, upon receiving vectors S , F and D, can reconstruct the multiset permutation M as P f = 11. From the ordered set S and follows: It rst creates a vector M of size F , it determines that the rst element of M is 1 and there are three 10 s in the multiset M . The receiver then knows that, the rst three entries in D are the locations related to 1 and in the rst pass, the receiver inserts 1's in their location in M correctly. Since the rst entry in D is 1, it follows that the location of the rst element in D is at position 1, hence M = [1; ?; ?; ?; ?; ?; ?; ?; ?; ?; ?]. The second entry in D is 0. This means that, there is no element which is greater than 1, between the rst and second occurrence of 1. Hence, the receiver inserts the second 1 in the rst blank position next to the rst 1, so M = [1; 1; ?; ?; ?; ?; ?; ?; ?; ?; ?]. The third entry in 1

2

k

k

i=1

i

i

i

1

i

1

i

1

2

fi

j

st

th

1

i

j

4 i=1

i

st

D is a 2. This means that, there are two elements greater than 1, between the second and third 1's. Hence, the third 1 should be placed in the third empty position after the second 1. Therefore, M = [1; 1; ?; ?; 1; ?; ?; ?; ?; ?; ?]. Again, from S and F , the receiver knows that there are three 2's in M . Accessing the fourth position in D it learns the location of the rst 2 in M , that is, the rst 2 should occur in position 3 in M . Hence, M = [1; 1; 2; ?; 1; ?; ?; ?; ?; ?; ?]. The receiver then proceeds to insert the second 2 into M . From D, the receiver determines that between the rst 2 and second 2, there is one element which is greater than 2. So, starting from the location of the rst 2 in M , the receiver skips one blank and inserts the second 2 into the second blank position. Similarly, for the third 2 the receiver determines from D that between the second and third 2, there are three elements which are greater than 2. Therefore, the receiver inserts the third 2 into the fourth blank position, after the second 2. Hence, M = [1; 1; 2; ?; 1; 2; ?; ?; ?; 2; ?]. Repeating the above procedure, the receiver can fully reconstruct M .

Text Data MTF IF LZW bib 2.28 2.25 3.34 book1 2.76 2.67 3.45 book2 2.40 2.34 3.28 news 2.80 2.79 3.86 paper1 2.68 2.65 3.77 paper2 2.70 2.67 3.52 progc 2.69 2.67 3.86 progl 1.90 1.92 3.03 progp 1.86 1.87 3.11 trans 1.63 1.68 3.26 Table 1: Entropy of IF versus MTF coding for Text data.

Image MTF IF Best JPEG Usc-Girl 4.56 4.53 4.83 Couple 4.75 4.52 4.27 Beauty 4.30 4.24 4.24 Lady 4.14 4.00 3.81 House 5.04 4.82 4.44 Sat1 6.48 6.36 5.89 Tree 5.99 5.76 5.49 X-ray 4.73 4.65 5.17 F17 5.99 5.73 5.50 Table 2: Entropy of IF versus MTF coding for Image data.

5 Move-to-Front Coding versus Inversion Frequencies The sorting transformation in BSLDCA yields orderly data, when the underlying data is text or image. Due to the nature of text data, BSLDCA generates even more orderly data when applied to text. Hence, Burrows and Wheeler obtain a better compression gain for text data than some well known text compression algorithms. After the sorting transformation, they use MTF coding followed by run-length coding on the resulting output. In [5]-[8] Fenwick initially proposes to replace the MTF encoder. He concludes however that, best results are obtainable when MTF encoding is indeed employed. When we apply inversion frequencies instead of MTF coding, we obtain almost equiv-

alent results to Burrow and Wheeler's results on text data, as obtained from the ftp site mentioned in [3], and slightly better results on gray images, as obtained from the USC-database and UNL's Compression Laboratory. In Tables 1-2, we present the results obtained by applying move-to-front encoding and inversion frequencies to the same data. Taking advantage of runs of similar elements may yield a gain of 0:1 ? 0:25 bps on text data. For image data the gain is less signi cant ranging from 0 ? 0:05 bpp, both for MTF coding and inversion frequencies.

6 Conclusions In this paper, we have presented a new algorithm we call LPSA (Lexical Permutation Sorting Algorithm), and its theoretical setting. We show that when the input data is a permutation, LPSA gives rise to cyclic group G of order n. For a given input data permutation p LPSA generates a matrix N 0 = N 0 (p) with p occuring in the p(1) row of N 0 . Knowledge of the second column  of N 0 and p(1) allows us to completely reconstruct p. To specify  it suces to specify the smallest cost generator  of G, and the exponent x : 1  x  n, such that  =  . Thus the overhead incurred is at most (log n)=n bits per symbol. Experimentation shows that the cost varies signi cantly among the (n) generators of G. LPSA may also nd applications in other elds, such as pattern matching. Knowledge of the lexical indexing permutation of a data string, and the sorted form of the given data string (Y 0 vector) allows us to construct all the lexical subsequences of any length, less than or equal to n, without requiring to generate a huge database. We have also introduced a new data compression technique we call Inversion Frequencies (IF). Unlike MTF coding, IF forms a composite source. All information related to the smallest element in a given data string resides in the rst block of the IF vector, all information regarding the second smallest element in the data string resides in the next block of the IF vector, and so on. It is important to investigate, whether any further compression gain can be achieved by further examining properties of this composite source. th

x

References [1] Arnavut, Z., \Applications of Permutations to Lossless Data Compression" Ph. D. Thesis, University of Nebraska - Lincoln, Lincoln, NE., December, 1995. [2] Arnavut, Z., \Applications of Inversions to Lossless Image Compression" International Conference on Optical Engineering and Instrumentation, SPIE, Denver, Colorado, August 4-9, 1996, Vol. 2847, pp. 491-499.

[3] Burrows, M. and Wheeler, D. J., \A Block-sorting Lossless Data Compression Algorithm", SRC Research Report 124, Digital Systems Research Center, Palo Alto, CA., May 1994. (ftp site: gatekeeper.dec.com, /pub/DEC/SRC/researchreports/SRC-124.ps.Z) [4] Cleary, J. G., Teahan, W. J. and Witten, I. H., \Unbounded Length Contexts for PPM", Data Compression Conference, DCC-95, Snowbird Utah, March 1995, pp. 52-61. [5] Fenwick, P., \Block Sorting Text Compression", Proceedings of the 19'th Australasian Comp. Sc. Conference, Melbourne, Australia, p. 193-202, 1996. (ftp site: ftp.cs.auckland.ac.nz, /out/peter-f/ACSC96paper.ps) [6] Fenwick, P., \Experiments with a Block Sorting Text Compression Algorithm", The University of Auckland, Department of Computer Sc., Technical Report 111, March 1995. (ftp site: ftp.cs.auckland.ac.nz, /out/peter-f/report111.ps) [7] Fenwick, P., \Improvements to the Block-Sorting Text Compression Algorithm", The University of Auckland, Department of Computer Sc., Technical Report 120, July 1995. (ftp site: ftp.cs.auckland.ac.nz, /out/peter-f/report120.ps) [8] Fenwick, P., \Block Sorting Text Compression", The University of Auckland, Department of Computer Sc., Technical Report 130, July 1995. (ftp site: ftp.cs.auckland.ac.nz, /out/peter-f/report130.ps) [9] Knuth, D., The Art of Computer Programming. Vol. 3, Addison -Wesley Publishing Company, Reading, Mass., 1973. [10] Sedgewick, R. \Permutation Generation Methods: A review", ACM Computing Surveys, Vol 9, No. 2:137-164, 1977.