Word Based Text Compression - Semantic Scholar

-1-

Word Based Text Compression Alistair Moffat

Department of Computer Science, The University of Melbourne, Parkville, Victoria 3052, Australia. [email protected]

-2-

Word Based Text Compression

SUMMARY The development of efficient algorithms to support arithmetic coding has meant that powerful models of text can now be used for data compression. Here the implementation of models based on recognising and recording words is considered. Move to the front and several variable order Markov models have been tested with a number of different data structures, and first the decisions that went into the implementations are discussed and then experimental results are given that show English text being represented in under 2.2 bits per character. Moreover the programs run at speeds comparable to other compression techniques, and are suited for practical use. KEY WORDS. data compression, arithmetic coding, word based compression.

INTRODUCTION The recent development of arithmetic coding1 and the consequent logical separation of model and coding in a data compression algorithm have meant that powerful models of text can now be realistically considered. Arithmetic coding is fast, requiring only a fixed number of arithmetic operations per bit of output, and, within the limits of the precision of the arithmetic used, is optimal with respect to the model. These two attributes have opened the way for consideration of models that better reflect the structure of the data to be compressed, which in turn leads to better compression performance. One significant approach has been the use of a variable order Markov model, in which each character is predicted based a finite context of immediately preceding characters. By considering such a context each character can be better predicted, and will thus require a shorter output code. The best of these schemes, the ‘Prediction by Partial Matching’ (PPM) algorithm of Cleary and Witten2, has attained compression results in the vicinity of 2.2 bits per byte for mixed case English text, or a saving of up to 70% of the original data space. Such a saving can be of great use in the practical world, where disks are always full and hundreds of megabytes of information are daily transferred over telephone wires. The intention with this research has been to exploit the nature of the text at a level higher than character by character. Bentley et al3 described a word based Move to the Front (MTF) compression scheme and gave results showing their scheme representing English text in 3 to 4 bits per character. Here other ways in which word based compression might be effective are considered, and the use of arithmetic coding in conjunction with a variable order word based Markov model has allowed the development of a compression scheme that also encodes at about 2.2 bits per character, but requires less resources and is suitable for practical use.

-3-

WORD BASED COMPRESSION − THE MTF SCHEME Bentley et

al3-5

suggested that text be considered as a strictly alternating sequence of words and non-

words, according to some suitable definition of ‘word’, and that independent statistics should be used to control the encoding of words and non-words. When a word (similarly, non-word) is encountered for the first time it must be transmitted character by character using a subsidiary character model, but second and subsequent appearances of a word can be encoded as a single quantity. To encode known words Bentley et al suggested the use of a move to the front (MTF) list, which has the useful property of quickly adapting to changes in word usage in different sections of the text. Thus a word w would be encoded as integer p, where there have been p distinct words used since the most recent prior appearance of w. To encode integer p it was suggested that either a Huffman code be constructed in a two pass approach, or that a fixed code could be accessed directly from an array for one pass processing. As examples, two fixed coding schemes were mentioned − the codes Cγ and Cδ , described by Elias6. When an unknown word is to be transmitted it must be preceded by an ‘escape’ to indicate this fact; the suggestion made was that the code for n + 1 should be transmitted, where n is the current number of words known. Table 1. Encoding of ‘new word’ flag for file bookA.

Encoded length (bytes) Coding method

words

non-words

i.

n + 1, Cγ

3292

123

ii.

n + 1, Cδ

2789

113

iii.

1 bit flag preceding each word

687

687

iv.

adaptive arithmetic flag for each word

572

87

Witten, Neal and Cleary pointed out that arithmetic coding would be more suitable than both a fixed encoding of the integers and the use of dynamic Huffman codes, and that arithmetic coding would allow the scaling of the ‘new word’ flag to any desired fraction of the code space. The first experiments, summarised in table 1, confirmed that use of n + 1 as the flag is indeed inefficient (i, ii), and is inferior even to the use of a single bit preceding each word to indicate whether the impending word is new or old (iii). More efficient still is a separate flag, implemented using adaptive arithmetic coding (iv), which on file bookA (mixed case English text of 30 844 bytes) reduced the encoded length to an average of about half a bit per word transmitted. For these experiments a ‘word’ (similarly, nonword) was taken to be a maximal sequence of alphabetic (non-alphabetic) characters, and no upper bound was placed on the number of distinct tokens that could be retained by the model. If a word has been identified as ‘old’ a code representing the position of the word in the MTF list must be

-4-

transmitted. Bentley et al discussed two different fixed encodings of the integers (i, ii), but use of any fixed encoding of the integers is likely to be inefficient, as it is unreasonable to expect that the list position frequency distributions for words and non-words will be the same. For example, on file bookA the first MTF list position accounted for more than 60% of the non-word accesses, but fewer than 1% of word accesses. The alternative is adaptive arithmetic coding (iv), which will mould itself to any distribution, and, so long as separate statistics are kept, is suitable for encoding both words and non-words. Table 2 lists the output code lengths resulting from encoding the list positions of ‘old’ words. Table 2. Encoding of repeated words on file bookA.


words

non-words

i.

MTF, Cγ

6116

1604

ii.

MTF, Cδ

5603

1784

iii.

MTF, 2-pass Huffman

4509

1599

iv.

MTF, adaptive arithmetic

4344

1534

v.

Zero, adaptive arithmetic

4038

1011

Row (v) of table 2 lists the code space that would be required by a zero order scheme if an MTF list was not used; for this file the MTF scheme is inferior to an encoding based on word frequency rather than MTF list position frequency. This is in agreement with the results given by Bentley et al, where they found that zero order compression was consistently a little better than the MTF scheme for documents containing English text. Table 3. Spelling out the unique words and non-words of file bookA.


words

non-words

i.

2-pass Huffman, terminating character

5497

295

ii.

arithmetic, initial 1’s, terminating character

5608

364

iii.

arithmetic, new/old escape, terminating character

5494

296

iv.

arithmetic, new/old escape, lengths

5342

290

The third component of a word based compression scheme is the spelling out of new words. Bentley et al suggested a zero order character scheme using Huffman coding (i), but again adaptive arithmetic

-5-

coding should be preferred (ii). The best results were obtained when characters were only added to the character set when needed (iii), rather than all possible characters being given an initial count of one, and when words being spelt were preceded by an adaptive arithmetic code to indicate their length in characters (iv), avoiding the need to include an ‘end-of-word’ character in the character set. Thus at a lower level, the characters themselves should be preceded by a flag to say ‘new/old’, and characters that are not known must be spelt out in full as an atomic 8-bit quantum. This last approach means that words and non-words, which use mutually exclusive character sets, will only be encoded from amongst the relevant characters. The code space allocated to the possibility of a new character was n/(n + m), where n was the number of distinct characters encountered so far and m the total number of characters encoded during the spelling of words. Table 3 shows the improvements bought about by these changes. Table 4. Predicted compression performance on some test files.

Predicted Compression Scheme

coding methods used

(bits per character)

flag

repeated

unique

short

Cprog

news

bookA

MTF1

i

i

i

4.71

2.49

5.24

4.39

MTF2

iii

ii

i

3.98

2.27

4.20

3.78

MTF3

iv

iv

iv

3.49

1.97

3.79

3.16

ZeroWord

iv

v

iv

3.31

1.99

3.64

2.94

Table 4 summarises the compression that would result by the use of various combinations of these methods. MTF1 is a ‘simplest’ MTF scheme, and achieves relatively poor compression; MTF2 is a better version using a flag bit and fixed coding based on Cδ ; MTF3 is the best MTF scheme, using arithmetic coding in all stages; and ZeroWord is a non-MTF scheme, using the word frequencies rather than MTF list positions as the basis for the arithmetic coding. The only file for which the MTF scheme is superior to ZeroWord is the C program Cprog, and this is again in agreement with the results given by Bentley et al. The difference is a result of the type of text encountered in a program, where sequences of the type ‘A[i]=A[i]+1’ are common, and where procedures have local variables that occur frequently within a small scope and then do not reappear in the remainder of the program.

-6-

IMPLEMENTING ZERO ORDER WORD COMPRESSION The results listed in the previous section came from statistics gathering programs that did not actually implement the compression. Those results were then used as a guide to the construction of a working ZeroWord program, which was chosen for implementation because of its slightly better performance on text than the MTF alternative. This section discusses the implementation and resource requirements of the ZeroWord program and gives precise compression results. The program was composed of three modules. The first module was the data compression model, and included routines for breaking up the text into word tokens and non-word tokens, for searching a hash table and returning a unique token number, and for spelling out new tokens character by character. A token was defined to be an unbroken sequence of up to 20 characters, either alphabetic characters for a word or non-alphabetic characters for a non-word. Tokens longer than 20 characters were broken into two (or more, if necessary) parts to maintain this bound. For example, a 31 character non-alphabetic string would be broken into a 20 character non-word, a 0 character word, and then an 11 character nonword. The decoding part of the model included routines for learning and storing new tokens, and for re-creating the text from a sequence of token indices. Six distinct arithmetic encoding distributions were maintained as part of the model, and each such adaptive distribution used as a basis for the encoding will be called a coding set. For each of words and non-words there was one coding set for the zero order word distribution, one for the length of words being spelt, and one for the characters being used to spell out the words. The flag code indicating whether each symbol was known or not known was automatically sent for each symbol number passed to the coding set routines. The model was written so that new tokens encountered after the dictionary reached some pre-specified limit (maxwords) were spelt and then discarded without being added to the dictionary. It is possible that this policy might cause some loss of compression efficiency on a very long document if the first maxwords distinct tokens of each type are not a representative sample of the tokens that are used in later sections. If this became a problem tokens might be more usefully discarded according to some sort of least recently used policy. Notice however that the token distribution used is adaptive. Thus, although no new words would be learnt once the dictionary was full, the arithmetic coding of the words that were known would change to reflect different frequency patterns and a different ratio of new to old words. All of the experiments reported below allowed a maximum of 2500 distinct words and another 2500 distinct non-words, larger than the maximum of any of the test files, and so tokens were never in fact discarded. In their experiments Bentley et al only considered cases with maxwords ≤ 256, but no advantage could be seen in unnecessarily restricting the size of the lexicons built up by the model. At the next level the coding set module implemented a data structure for recording the distribution of frequencies used by the adaptive arithmetic encoding. Several different data structures were tested. Witten et al used a sorted linear list in their illustrative arithmetic coding implementation, and this possibility was considered again for the coding sets. Other data structures

-7-

tested, all with better asymptotic efficiency than the list, were the implicit tree structure of Moffat7, binary search trees, and the splay trees of Sleator and Tarjan8. The different coding set implementations are discussed in detail below. The coding sets were made locally adaptive by halving all of the frequency counts (but letting none of them reach zero) whenever the total count for the coding set reached 16383, the maximum value that could be handled by 32-bit integer arithmetic while carrying out the arithmetic encoding. If this halving process is undesirable or if more than 16383 distinct words are to be retained then 64-bit arithmetic could be used, but on the hardware used this would add an overhead of about 30% to the time used by the arithmetic encoding routines9, which in turn required about 30% of the total running time of the program. Another way to both avoid the periodic halvings and reduce the likelihood of the lexicons becoming full would be to use the sliding window approach of Knuth10. The third module contained routines for arithmetic encoding and decoding. The routines were very similar to those described by Witten Neal and Cleary (including the use of macro calls for input_bit() and output_bit()) except that the searching and updating of coding sets were carried out by separate procedures rather than in-line code. This was necessary to maintain the separation between coding and data structure, so that the various data structures could use a uniform interface. The zero order model required about 300 lines of C code, the data structure modules 200 to 300 lines, and the arithmetic coding required 250 lines. All experiments were run on a VAX 11/780 computer under Unix BSD4.3; the running times listed are as reported by the Unix time command and are accurate to within about 10%. Coding Set Implementation In discussing the coding set data structures it is supposed that n is the number of distinct symbols in the coding set, m the number of symbols to be coded using the coding set, and b the number of output bits produced by the encoding. The time required for the arithmetic encoding, excluding the time needed to calculate the parameters, which will be charged to the data structure, is O(m + b). Linear List. In their description of arithmetic encoding Witten Neal and Cleary assumed the coding set to be represented as an linear list stored in an array, ordered by decreasing symbol frequency. This data structure is fast for small coding sets, and asymptotically efficient given a highly skew symbol access distribution. For a less skew distribution the worst case running time for encoding in an adaptive model might be as large as O(mn + b). Linear Time Implicit Tree. The necessary manipulations on this data structure can be carried out in time linear in the number of inputs and outputs, O(m + b). The implementation tested was an optimised version of that described by Moffat7, tailored for adaptive coding. The main change was that the count fields were assumed to be always incremented and were processed in a single pass over

-8-

the tree whenever possible at the same time as the coding bounds were being calculated. For simplicity the original presentation described these two operations as two distinct passes. The disadvantage of this and the List data structure is that they are array based, requiring that the maximum number of symbols handled be predetermined. On small files a large value for maxwords would be wasteful of memory space, but a small limit would be overly restrictive on large files. Compression results with restricted dictionary sizes showed a clear tradeoff between memory space used and the compression attained. Binary Search Tree (BST). The third dictionary structure considered was a binary search tree. The implicit binary tree discussed above is stored in an array, and each node is allocated storage at the time the array is declared. The dynamic binary search tree has the advantage that space for nodes need only be allocated as it is required, avoiding the large fixed allocation of memory space. Each node of the tree recorded the frequency count for the symbol, and the total of the frequency counts in the left subtree of that node, meaning that absolute positions in the ordered coding set could thus be calculated while the tree was being searched for a particular symbol. Each node required two pointers, two integer counts, and a symbol number, totalling 14 bytes and so was more space efficient than the List and Tree implementations (with maxwords = 2500) whenever fewer than about 1400 symbol were being stored. The space required per node could be further reduced to 10 bytes if, in the style of C, the pointers were regarded as 16-bit node indices rather than full 32-bit pointers and the nodes were all stored in a single large array. The implementations discussed here used full 32-bit pointers and dynamic allocation of nodes. Even for the character coding sets this approach had a space advantage: the List implementation required that all 256 characters be allocated space, but on a typical text there were typically fewer than 40 characters used in spelling non-words and only slightly more for spelling words. Each of the trees was accessed by symbol number, but with the bit pattern reversed to prevent the tree degenerating into a linked list. With unsigned 14-bit symbol numbers this meant that sequentially allocated token numbers 1, 2, 3, 4, 5... mapped on to 8192, 4096, 12288, 2048, 10240... and forced the creation of a balanced tree with a total data structure access time that was O(m log n). Reversing the bit pattern of a symbol number required only two array accesses, two shifts, and two arithmetic operations. No attempt was made to dynamically reorganise the trees; instead it seemed reasonable to expect that frequently accessed symbols would occur early in the document and would naturally appear near the root of the tree. Splay Tree (SPT). A splay tree8 is a binary search tree with the addition of ‘self-organising’ heuristics. After each access the tree is adjusted by a sequence of local operations that have the effect of reducing the time required for subsequent operations in the tree. Thus any single operation may be time consuming, but a sequence of operations is guaranteed to run quickly. For data compression

-9-

purposes, where it is actively hoped that the access pattern will be skew, the dynamically changing splay tree might thus be significantly better than the binary search tree. In the worst case, the total time to be charged to the data structure is again O(m log n). The splaying operations require the addition of a parent pointer to each node, and Sleator and Tarjan give a mechanism by which this can be done without requiring additional data space. In the implementation tested the simpler and faster method of adding an additional pointer was used. Experimental Results for Zero Order Word Compression The following tables describe the results of running the zero order word compression program (ZeroWord) on a set of test files. Also listed are the results of two other data compression programs − an optimised version of the zero order character program (ZeroChar) given by Witten, Neal, and Cleary; and the Unix utility Compress, a highly optimised program implementing a form of Ziv-Lempel coding11-13. Table 5. Compression performance.

Compression file

(bits per character) ZeroChar

Compress

ZeroWord

short

4.85

4.55

3.29

Cprog

3.58

2.52

1.99

object

5.77

5.38

5.30

skewwords

1.02

0.30

0.01

news

4.96

4.49

3.64

bookA

4.53

3.91

2.93

session

4.89

2.63

2.34

csh

4.75

3.43

2.65

bookB

4.68

3.32

2.50

Table 5 lists the compression performance of the three schemes. ZeroWord compression was superior to both ZeroChar and Compress on all of the test files. File skewwords, consisting of 2 000 repetitions of ‘aaaabaaaa’ separated by new-line characters, was created to show word coding at its best, and was unambiguously compressed to 13 bytes of output, or almost 200 input characters per bit. This result highlights the superiority of arithmetic coding over Huffman coding on highly skewed symbol distributions. Table 6. Average throughput for encoding and decoding.

- 10 -

Throughput, averaged over encoding and decoding (kbyte/sec) file

ZeroChar

Compress

ZeroWord List

Tree

BST

SPT

short

4.2

12.2

2.4

2.8

3.4

2.4

Cprog

4.4

21.7

4.2

4.7

5.0

3.8

object

2.5

15.0

1.5

2.3

2.5

1.8

skewwords

6.1

35.5

13.0

14.0

14.0

12.2

news

3.8

16.3

1.7

2.9

3.1

2.2

bookA

4.1

17.8

1.9

3.4

3.7

2.5

session

3.7

21.7

3.2

4.4

4.8

3.6

csh

3.9

18.6

2.3

3.8

4.2

3.0

bookB

3.9

18.6

2.0

3.8

4.3

3.1

Table 6 lists running times. As expected, the list coding set was acceptably fast for the files that had small lexicons, but was inefficient for the larger files. The splay tree was consistently worse than the simpler binary search tree, and it would appear that the binary search tree was sufficiently balanced and the access pattern sufficiently even that the cost of the self-adjustment steps was not recouped by reduced running time. The implicit tree was also a little slower than the binary search tree on all of the files, but was faster than the splay tree. For the Tree, BST, and SPT programs decoding was slightly faster than encoding, while for the List implementation encoding was slightly faster. In no case was the time difference between encoding and decoding more than about 20%. The BST implementation processed the larger files at a rate of about 4 kbyte per second for each of encoding and decoding. The BST program also ran at about the same speed as ZeroChar, but was four to five times slower than Compress. The nature of the Ziv-Lempel algorithm used in Compress means that it is probably a naturally faster coding scheme, but it also seems reasonable to suppose that the ZeroWord program could be substantially improved if it were to be written as a production program rather than a testbed. For example, Compress uses almost no procedure calls and contains in-line assembler language code, and were the same attention spent upon the modular ZeroWord test code it would be hoped that running times might decrease by as much as a factor of two. Table 7. Space requirements for ZeroWord.

Space Required (kbyte) file

hash table

words

coding set List, Tree

BST

total SPT

BST

- 11 -

short

11

3

43

7

9

21

Cprog

11

6

43

9

12

26

object

17

15

43

25

32

57

8

1

43

1

1

10

news

17

14

43

23

30

54

bookA

17

14

43

23

29

54

session

18

19

43

26

33

63

csh

20

18

43

28

37

66

bookB

27

28

43

45

58

100

skewwords

Table 7 shows the amount of memory space required by the ZeroWord program during encoding and decoding. The two hash tables each consisted of a fixed allocation of 4000 bytes for 1000 head of list pointers, and each word stored required an additional pointer for the chaining of collisions. The first column of table 7 lists the space attributable purely to the hash table. The second column lists the space required to store the words of the document. With all words stored packed into a single large array, each word could be represented using only 2 bytes for a starting index, one byte to record the length, and one byte for each character of the word. Even on the largest of the files there was only about 30 kbytes required for storage of the dictionaries. The next three columns list the space required by the coding sets. For the List and Tree implementations with maxwords = 2500 the storage required by the word coding sets was 39 kbytes, with another 4 kbytes required by the character coding sets. This allocation was fixed, and independent of the size of the document being processed. For the BST and SPT coding sets the allocation depended on the number of coding set nodes. The last column lists the total data space required by the BST implementation. To this should be added the 20 kbyte of space required by the program code, but even with this included the compression can be effected in under 100 kbytes for all of the files except for bookB. In comparison, Compress requires as much as 400 kbytes of data space14.

IMPLEMENTING FIRST ORDER WORD COMPRESSION The PPM scheme of Cleary and Witten achieves very good compression by the explicit use of a variable order Markov model. Each character is predicted using some finite number of preceding characters to establish a context. For example, in the context ‘ telep’, ‘h’ is more likely to be the next character than ‘e’, even though ‘e’ occurs more frequently than ‘h’ if no context is considered. Clearly the same observation can be made for words − within this article the word ‘arithmetic’ has been frequently followed by ‘coding’ and never by ‘the’, even though the latter is probably the most frequent word overall. With this in mind the ZeroWord program of the previous section was

- 12 -

extended to make a word level variation of the PPM scheme. The changes were reasonably straightforward. To enable the decoder to unambiguously decode the message another flag was prefixed to the code for each token to indicate whether the first or zero (or ‘minus one’, if the token was unknown) context was used. Each token was thus encoded by one of three possible messages: ‘known in the context of the previous token (of this type), here is an arithmetic code in that coding set’; ‘not known in the context of the previous token, but known in the document, here is a code for the token in the corresponding zero order coding set’; and ‘not known in the context of the previous token, not known for the document, here is how this new token is spelt’. For example, the first time ‘arithmetic coding’ appears the first order context of ‘arithmetic’ will not contain ‘coding’, and so an escape flag will be sent, followed by ‘coding’ transmitted as a zero order code, or as a sequence of letters if it is the first time encountered. Thereafter ‘coding’ can be predicted in the first order context of ‘arithmetic’, with hopefully a more concise code resulting. The implementation for word coding differed from the character PPM suggestions of Cleary and Witten in two areas. Firstly, the counting strategies were altered. When a word was successfully predicted in the first order context, Cleary and Witten stipulated that the zero order count for the word should be incremented as well as the first order count. This would have meant that each time ‘arithmetic’ was used to predict ‘coding’, the count for ‘coding’ was also incremented in the zero order coding set. However this seemed unnecessary: if ‘coding’ is only being encountered in the context of ‘arithmetic’ and it is being successfully predicted using that context it does not seem necessary for it to be given an increased code space allocation in the zero order context. The zero order context should thus record the distribution of words that are not being successfully predicted in a first order context, and words should move through the zero order context into the appropriate first order contexts. Experiments in which both coding sets were incremented when a word was predicted by the first order context were also carried out, and on all of the test files slightly better compression was attained by single counting. For example on bookB the difference was about 0.03 bits per character. The double counting approach also required extra running time, as two coding sets needed to be processed rather than one. This was a second and more practical reason for preferring not to increment both coding sets. The second change was also made in the interests of running time, but this time at the expense of compression performance. Suppose that the first order context for word A contains word B; that C is known in the zero order context but not the first order context of A; and that A C is encountered. Since C is not known in the context A, an escape must be transmitted and then C transmitted in the zero order context. Also known in the zero order context will be B. But, after the transmission of the escape, B can be excluded from zero order consideration, reducing the code length required for the transmission of C. However in our implementation these exclusions were not calculated, as the computational overhead would have been very high. Results from a statistics gathering program indicated that the loss due to this was very small, and on the larger files the codes produced were less

- 13 -

than 1% longer than they would have been were exclusions implemented. The ‘drop from first order to zero order context’ code was an adaptive arithmetic code for each word in the dictionary, so that each word built up its own statistics as to whether or not it was a good predictor of following words. This was equivalent to method B of Cleary and Witten. Table 8. First order word compression.

file

Compression

Improvement

Throughput

Total data space

(bits/char)

(bits/char)

(kbyte/sec)

(kbyte)

BST

SPT

BST

short

3.16

0.13

2.3

2.0

41

Cprog

1.67

0.32

4.2

3.6

57

object

4.69

0.61

2.2

1.6

147

skewwords

0.01

0.00

13.9

12.6

10

news

3.35

0.29

2.4

1.9

128

bookA

2.76

0.17

2.9

2.2

141

session

1.63

0.71

4.3

3.5

142

csh

2.33

0.32

3.3

2.6

198

bookB

2.17

0.33

3.6

2.8

346

At the character level the predictions were also initially attempted using a first order context, dropping to zero order only when necessary. The same counting and escape strategies were used at the character level as were used at the word level. On the small files the character level coding change accounted for most of the compression improvement, but on the larger files the improvement was primarily because of the first order word predictions. For example, on bookB the change in character coding saved an average of 0.08 bits per character and the change in word coding saved 0.25 bits per character. Because of the large number of coding sets involved − one for each of the tokens in the document − it was imperative that they take up minimal space, and the list and Tree implementations were not suitable. Implementations using BST and SPT coding sets were tested with the first order model, which was about 100 lines longer than the zero order model. Table 8 lists compression performance, running time, and space requirements for these two variable first/zero order compression programs. The BST implementation was again the fastest, and required only about 20% more time than the zero order implementation. The data space required increased significantly, and most of the files required a data space larger than the file itself. Compression improved slightly, and for example on bookB a further 13% space saving was possible. The third column lists the improvement in compression achieved by changing from zero to first order.

- 14 -

Higher order word compression. Table 9. Second order word compression.

file

Compression

Improvement

Throughput

Total data space

(bits/char)

(bits/char)

(kbyte/sec)

(kbyte)

BST

BST

short

3.22

-0.06

1.7

81

Cprog

1.61

0.06

3.1

118

object

4.56

0.13

1.2

331

skewwords

0.01

0.00

12.2

10

news

3.35

0.00

1.7

275

bookA

2.79

-0.03

1.9

317

session

1.50

0.13

3.2

305

csh

2.30

0.03

2.2

475

bookB

2.17

0.00

2.1

875

After establishing that a first order word scheme gave improved compression over the zero order scheme it was natural to consider a second order scheme. This time there was no significant compression improvement, and in fact on two of the files the compression was a little worse. Table 9 lists the results of a second order program using BST coding sets, and the improvement achieved by changing from first to second order is listed in the third column. At both the character and the word levels there was little net improvement because more code space was wasted by the lack of exclusions, and because the relatively small sample sizes used to establish the second order frequencies meant that the extra escape code consumed more code space than was saved by the second order prediction. Second order compression also required a great deal more space and time, and there would appear to be no advantage to extending the variable order word compression beyond first order. Only on very large (in the megabyte range) files might there be some advantage, but then the resource costs would be very high. Bell15 gives results for a large number of compression algorithms on files short, Cprog, bookA, session, csh, and bookB; of these methods character PPM and the DMC scheme of Cormack and Horspool16 were consistently the best and second best respectively. In the experiments described here the first order word scheme improved slightly upon both of those methods for all of the files except session, and the second order scheme was better on session as well. Table 10 shows overall compression figures for those 6 files, calculated by taking the weighted sum of the individual figures. The figures for the PPM and DMC algorithms are taken from Bell.

- 15 -

Table 10. Overall compression performance on six test files.

Overall Compression Compression Algorithm

(bits/byte)

ZeroChar

4.67

Compress

3.26

MTF3

2.66

ZeroWord

2.53

DMC

2.45

PPM

2.28

FirstWord

2.15

SecondWord

2.12

In their description of PPM Cleary and Witten estimate an encoding speed of about 1 kbyte per second (on a VAX 11/780) and a space requirement of as much as five hundred kbytes† , † These estimates have recently been shown to be pessimistic, and in fact the PPM scheme can be implemented to run at about 2 to 3 kbyte per second and within two hundred kbyte of memory with minimal compression loss17. and it seems likely that the DMC program of Cormack and Horspool would require comparable resources. Within this framework it can be seen that second order word compression also uses similar resources to attain marginally better compression; first order word compression uses less data space and is faster in operation while still obtaining very good compression; and when space is limited to 100 kbytes or less the zero order word compression approach gives compression better than current techniques.

SUMMARY Experiments have shown that word based compression schemes, using the concept of a coding set and an underlying arithmetic coder, can be a space and time efficient method of attaining good compression on text type documents. Overall the best model tested was a first order scheme, attaining compressed representations of English text requiring as little as 2.2 bits per character, while the zero order model was more economical in terms of time and space and still compressed text to 2.5 bits per character. Both were better than the MTF scheme on the test files containing English text, and both compressed at speeds comparable to previous zero order character models, speeds that were fast enough to be practical. The zero order scheme has the additional advantage of being feasible in a

- 16 -

limited amount of memory space, such as on a microcomputer, and yet still gives good compression. A similar second order scheme required significantly more resources and gave no appreciable compression improvement over the first order scheme. On non-text the advantages of considering ‘words’ are not as apparent, but the good performance of the word based methods on file object shows that the approach is not necessarily bad. Moreover, on non-text the worst that can happen with these schemes is that they degrade gracefully to a corresponding character compression model. There thus seems little reason why the ZeroWord and FirstWord schemes should not be used for practical and productive data compression in a wide variety of applications. Appendix − The Nine Test Files The nine test files were chosen to give a wide range of sizes and data. Table 11 gives statistics for these test files. Table 11. Test file statistics.

size file

− −

total

(bytes)

words

short

4510

310

61

1670

Cprog

15072

239

289

3190

object

16384

349

1090

3940

skewwords

20000

1

1

4000

news

20309

1290

258

6518

bookA

30844

1448

94

10978

session

57127

688

995

12790

csh

60997

1480

450

21026

139521

2411

735

52514

bookB

short −

distinct non-words

the first 100 lines of file bookB Cprog −

tokens

a commented C program object

executable image from Vax C compiler for an early version of ZeroWord skewwords 2000 copies of ‘aaaabaaaa’ separated by new-line characters news −

an item taken

from the moderated Usenet newsgroup comp.risks (volume 4, issue 73) including mail header bookA − commands session −

a section of a book, free of any formatting

a transcript of a terminal session, including screen editing csh −

entry for the Unix ‘csh’ command, including formatting commands bookB − including embedded formatting commands.

manual

a section of a book,

- 17 -

References. I. Witten, R. Neal and J. Cleary, ‘Arithmetic coding for data compression’, Comm. ACM, 30, 520-541 (1987). J. Cleary and I. Witten, ‘Data compression using adaptive coding and partial string matching’, IEEE Trans. Communications, COM-32, 396-402 (1984). J. Bentley, D. Sleator, R. Tarjan and V. Wei, ‘A locally adaptive data compression scheme’, Comm. ACM, 29, 320-330 (1986). B. Ryabko, ‘Technical correspondence on ‘A locally adaptive data compression scheme’’, Comm. ACM, 30, 792 (1987). R. Horspool and G. Cormack, ‘Technical correspondence on ‘A locally adaptive data compression scheme’’, Comm. ACM, 30, 792-794 (1987). P. Elias, ‘Universal codeword sets and representations of the integers’, IEEE Trans. Information Theory, IT-21, 194-203 (1975). A. Moffat, ‘A data structure for arithmetic coding on large alphabets’, Proceedings of the 11’th Australian Computer Science Conference, Brisbane, 309-317 (1988). D. Sleator and R. Tarjan, ‘Selfadjusting binary search trees’, J. ACM, 32, 652-686 (1985). R. Neal, private communication, 1987. D. Knuth, ‘Dynamic Huffman coding’, J. Algorithms, 6, 163-180 (1985). J. Ziv and A. Lempel, ‘Compression of individual sequences via variable rate coding’, IEEE Trans. Information Theory, IT-24, 530-536 (1978). T. Welch, ‘A technique for high performance data compression’, IEEE Computer, 17, 8-20 (1984). S. Thomas and J. Orost, Compress (version 4.0) program and documentation, available from [email protected], 1985. J. Orost, Compress.digest, 2, (1987). T. Bell, A unifying theory and improvements for existing approaches to text compression, PhD dissertation, University of Canterbury, Christchurch, New Zealand, 1986. G. Cormack and R. Horspool, ‘Data compression using dynamic markov modelling’, Comput. J., (to appear, 1988). A. Moffat, ‘A note on the PPM data compression scheme’, Tech. Rep. 88/7, Department of Computer Science, The University of Melbourne, 1988.