Implementing the Context Tree Weighting Method ... - Semantic Scholar

39 downloads 8556 Views 109KB Size Report
they also represent ranges of arithmetic code corresponding a block probability of xt. 1 .... Psd+1 e. (xt = c). A function to update the context tree after encoding xt ...
Implementing the Context Tree Weighting Method for Text Compression Kunihiko Sadakane Takumi Okazaki Hiroshi Imai Department of Information Science, University of Tokyo 7-3-1 Hongo, Bunkyo-ku, Tokyo 113-0033, JAPAN {sada,takumi,imai}@is.s.u-tokyo.ac.jp Abstract Context tree weighting method is a universal compression algorithm for FSMX sources. Though we expect that it will have good compression ratio in practice, it is difficult to implement it and in many cases the implementation is only for estimating compression ratio. Though Willems and Tjalkens showed practical implementation using not block probabilities but conditional probabilities, it is used for only binary alphabet sequences. We extend the method for multi-alphabet sequences and show a simple implementation using PPM techniques. We also propose a method to optimize a parameter of the context tree weighting for binary alphabet case. Experimental results on texts and DNA sequences show that the performance of PPM can be improved by combining the context tree weighting and that DNA sequences can be compressed in less than 2.0 bpc.

1

Introduction

The context tree weighting method [16], or CTW, is a lossless compression algorithm for FSMX sources. It has theoretically good compression ratio for binary alphabet sequences. It was extended for multi-alphabet sources [20], for easiness to implement [17, 14, 19], for compressing text [13], and for improving PPMs [1]. Though the CTW will have good compression ratio in practice, its implementation is difficult and there are few implementations [13, 15]. One reason is that the original CTW is for binary sequences and it cannot be directly applied to multi-alphabet sequences. The other reason is that it uses block probabilities of many subsequences which requires multi-precision floating-point arithmetic operations. This is a drawback in speed and memory requirements. Concerning compression ratio, one of the best compression algorithms in practice is PPM [5] and its variants. These algorithms compresses a sequence of symbols one by one by predicting symbol probabilities from past symbols. The prediction is made from preceding several symbols called context. The length d of the context is called

order of the PPM. This value is one of parameters of the PPM. The PPM assumes that the information source creating the sequence is order-d Markovian. However, we usually do not know the real value of d. In PPM methods, there exists the optimal value of the order for each text and compression ratio will decrease if the order is more than the optimal value. Though it is well known that the best value of the order d is five for many English texts, this value is not optimal for other texts like program source codes or DNA sequences. Moreover, the assumption does not hold for usual text data, that is, values of order vary according to contexts. Therefore PPM is not flexible algorithm. On the other hand, the CTW can adapt to not only Markov sources but also tree sources and its compression ratio rarely decreases even if the order becomes too large. It was shown theoretically and confirmed by experiments. Therefore we can set the value of the order as large as possible. This is one reason that we use the CTW. We implement the CTW in two ways. One uses binary decomposition of the alphabet [13] and the other uses PPM technique. We propose a method to improve compression ratio of the former by optimizing a weighting parameter of the CTW. The latter has been proposed by ˚ Aberg and Shtarkov [1]. However, it is difficult to implement because they used block probabilities of sequences, which are represented by floating point numbers. Therefore they did only simulation of compression ratio by calculating only probabilities of encoded sequence, that is, they did not implement the decoder. It may be difficult to implement the decoder according to their method. We solve the problem by extending the idea of Willems and Tjalkens [17] for multialphabet sequences and show implementation details. Our method is intuitive and both the encode and the decoder can be easily implemented. Moreover, by using our technique, we can improve compression ratio of PPMs. We also show by experiments that compression ratio of the PPM can be improved by combining the CTW.

2 2.1

Context Tree Weighting method The original algorithm

First we describe the original algorithm of Willems et al. [16]. It is for binary alphabet sources. We represent a symbol by xi and a sequence of symbols x1 x2 . . . xt by xt1 . We consider a FSMX information source of depth D. A context tree represents the source. Each node of the tree represents a context. When a symbol 0 appears as times and a symbol 1 appears bs times at a context s, we store these numbers in the node s. An inner node s has two children: 0s and 1s. The numbers of times symbols occur at the nodes have this relation; a0s + a1s = as b0s + b1s = bs . We calculate an estimated probability Pes (as , bs ) for each context s which stands for the probability that a symbol 0 occurs as times and 0 occurs bs times at the context s. The Krichevsky-Trofimov(KT)-estimator [9] is commonly used and it is defined by (as − 12 ) · · · · · 23 · 12 · (bs − 12 ) · · · · · 23 · 12 Pes (as , bs ) = (as + bs )!

Next we calculate weighted probabilities Pws from values of Pes . This is done by recursion as follows. Pws (xt1 )

=

(

γPes (xt1 ) + (1 − γ)Pw0s (xt1 )Pw1s (xt1 ) (node s is not a leaf) Pes (xt1 ) (node s is a leaf)

γ is a real value satisfying 0 < γ < 1.We can obtain code words for xt1 by arithmetic coding from the value Pwλ (xt1 ) at the root of the context tree λ. This probability can be considered as weighted summation of probabilities of xt1 in all sub-models in the context tree. To update nodes in the context tree while encoding a symbol xt , it is sufficient to update only nodes representing suffixes of xt−D . . . xt−1 . Therefore the time complexity of the algorithm is O(nD). Note that here time complexity for arithmetic coding is not considered. Though block probabilities of symbols represent precise probabilities of sequences which is guaranteed to have a theoretical upper-bound of code-word lengths, it is necessary to use multi-precision floating-point arithmetic operations, which is difficult to implement and time-consuming tasks.

2.2

Previous works

There exist some implementations of the CTW. Yokoo and Kawabata [19] used fixedprecision real values for representing Pes (xt1 ) and Pws (xt1 ) and showed that an encoded sequence can be decoded correctly by using round down operations. Though they showed computer simulation of compression ratio it is slow because of using multiprecision operations. ˚ Aberg and Shtarkov [1] calculate block probability of xt1 by using multi-precision operations. However, they calculate only width of a range encoded by arithmetic code and they did not implement the decoder. They showed that the CTW combined with the PPMD [7] has superior performance. Willems and Tjalkens [17] reduced space complexity of the CTW. They use not weighted block probabilities but conditional weighted probabilities. They store not Pes (xt1 ) and Pws (xt1 ) but their ratio in the node s of the context tree. They also use a value η, which represents ratio of probabilities of 0 and 1. Therefore it cannot be directly extended for encoding multi-alphabet sources. Tjalkens and Willems [14] proposed using arithmetic encoding based on conditional probabilities Pr(xt |xt−1 1 ). However, they encode the probabilities by using their special arithmetic encoder. they also represent ranges of arithmetic code corresponding a block probability of xt1 by a fixed-precision real value which is not desirable. Volf and Willems [15] proposed the switching method. It calculates code word length of two different compression algorithms and use better one. They combine the CTW and other compression algorithms and improve compression ratio. Because the CTW in this algorithm is based on the binary decomposition technique [13], it is difficult to improve performance. On the other hand, if we can apply the CTW for multi-alphabet sequences easily, we can improve the performance of the CTW by using some techniques developed for PPMs [7, 12, 3, 2]. Therefore it is important to show easy implementation of the CTW for multi-alphabet sequences to use the techniques.

We propose an efficient implementation of the CTW. Though it is based on Willems and Tjalkens [17], it is simpler and it never use block probabilities. It encodes symbols one by one by arithmetic codes according to conditional probabilities of a symbol. Our implementation can be used for encoding multi-alphabet sequences. As ˚ Aberg and Shtarkov suggested, we can combine PPM with CTW. Our method can be interpreted as calculating mixture of probabilities estimated by PPMs of some orders.

3

Our Implementation

Tjalkens and Willems [14] calculate block probabilities of n symbols by using conditional probabilities of symbols and perform arithmetic encoding. They used conditional probabilities to calculate block probabilities of symbols, while we directly encode conditional probabilities by using arithmetic code of Witten et al. [18].

3.1

Calculation of conditional probabilities

If a symbol xt has a context 0s, the probability of xt = c in a context s can be calculated as follows: γPes (xt1 ) + (1 − γ)Pw0s (xt1 )Pw1s (xt1 ) Pws (xt1 ) = 0s t−1 1s t−1 Pws (xt−1 γPes (xt−1 1 ) 1 ) + (1 − γ)Pw (x1 )Pw (x1 ) s 0s t−1 0s 1s t−1 γPes (xt−1 1 )Pe (xt = c) + (1 − γ)Pw (x1 )Pw (xt = c)Pw (x1 ) = 0s t−1 1s t−1 γPes (xt−1 1 ) + (1 − γ)Pw (x1 )Pw (x1 )

Pws (xt = c) =

By defining β=

βs (xt−1 1 )

Pes (xt−1 ) = 0s t−1 1 1s t−1 , Pw (x1 )Pw (x1 )

γβPes (xt = c) + (1 − γ)Pw0s (xt = c) γβ + (1 − γ) 1−γ γβ Pes (xt = c) + P 0s (xt = c). = γβ + (1 − γ) γβ + (1 − γ) w

Pws (xt = c) =

Pes (xt = c) is calculated by the KT-estimator according to the number of symbols which occur at a context s. Note that we use PPM mechanism to estimate Pes (xt = c) later. β can be computed incrementally as follows: βs (xt1 ) =

Pes (xt1 ) Pes (xt ) t−1 = β (x ) · . s 1 Pw0s (xt1 )Pw1s (xt1 ) Pw0s (xt )

Initial values of β become 1.0. According to this conditional probability, a symbol xt is encoded by arithmetic code. We calculate βs and symbol probabilities by double-precision floating-point operations. Though these arithmetic operations are approximation of the precise calculation, the values represent probabilities of a symbol and therefore it is unnecessary to use high-precision operations.

3.2

Data structure of context tree

A node sc of a context tree consists of these elements; c: a symbol appeared at the context s, freq: the number of occurrence of c at s, beta: βsc , son: a pointer to the child of sc, and next: a pointer to the sibling of sc. Before encoding the context tree consists of only the root node λ. New nodes are created when new contexts appear in a sequence.

3.3

Encoding algorithm

Encoding a symbol xt after encoding a sequence xt−1 becomes as follows. 1 1. calculate probabilities P [c] = Pwλ (xt = c) for all symbols c in an alphabet A (F indP (λ, t − 1, D)) 2. read a symbol xt into X[t mod D] 3. convert values of P [c] into integers by multiplying a constant R 4. encode xt by arithmetic code according to P [c] 5. update β and symbol frequencies in the context tree (UpdateT ree(λ, t − 1, D)) A function F indP (s, t) calculating a probabilities p = Pws (xt = c) is defined as follows: 1. traverse the context tree from root and find nodes s1 , s2 , . . . , sD corresponding xt−1 , xt−2 , . . . , xt−D 2. calculate Pes1 (xt = c), . . . , PesD (xt = c) for all symbols c ∈ A 3. let c = X[(t − 1) mod D]!$p′w = F indP (cs, t − 1, d − 1) 4. let PwsD (xt = c) = PesD (xt = c) (c ∈ A) 5. for d = D − 1 to 0, for all c ∈ A γβ Pesd (xt = c) + Pwsd (xt = c) = γβ+(1−γ)

1−γ P sd+1 (xt γβ+(1−γ) e

= c).

A function to update the context tree after encoding xt UpdateT ree(s, t) is defined as follows: 1. update symbol counts in the nodes s1 , s2 , . . . , sD 2. for d = D − 1 to 0 βsd = βsd · Pesd (xt )/Pwsd+1 (xt ) 3. if βsd > 10000 then let βsd = 10000

3.4

Optimizing γ value

Compression ratio of the CTW depends on the value of γ, especially when binary alphabet is used for compressing multi-alphabet sequences because the number of models we mix is large. We use a heuristic. Instead of fixing γ, we vary the value to minimize expected code length for encoding a symbol each time.

1. calculate Pwλ for γ = 0.1(0.8)i (i = 0, 1, 2, 3, 4) 2. calculate entropy of the weighted probabilities and find the smallest one 3. calculate Pwλ using the optimal γ and encode a bit

3.5

Using PPM mechanism

The probability Pes (xt = c) is calculated by the KT-estimator. However, redundancy of this estimator for text sources will be reduced by using the escape mechanism of PPM [5]. In the PPM, a special escape symbol esc is used. If a novel symbol c appears in a context of depth d, we encode the esc in the context s and then encode c in a shorter context s′ of depth d − 1. However, this escape mechanism does not fit to use in the CTW because to update values of βs we need probability of any symbol in all contexts. Therefore we calculate symbol probabilities which become the same values as we use the escape mechanism. The probability of the symbol c is estimated by the ′ probability of esc in s times the probability of c in s′ , that is, Pes (c) = Pes (esc) · Pes (c). The escape probability Pes (esc) is calculated using PPM techniques like PPMD [7]. In the null context λ, probability for a symbol which never appeared is equi-probable and escape probability is divided equally. We assume that M is alphabet size and m is the number of distinct symbols appeared in the null context. Then a probability 1 Peλ (esc). The probability of a novel symbol in the null context is calculated by M −m si Pe (xt = c) is calculated as follows: 1. calculate Peλ (xt = c) 2. for d = 1 to D (a) calculate Pesd (xt = c) (c ∈ A), let e = 0 (b) for all c ∈ A if c does not appear in sd+1 , e = e + Pesd (c) (c) for all c ∈ A if c does not appear in sd+1 , Pesd+1 (c) = Pesd+1 (esc) · Pesd (c)/e This corresponds to exclusion after estimation in PPM. Note that to use order-(-1) context like PPM implementations will not work because symbol frequency in the order-(-1) context does not change and the value βλ cannot be updated correctly.

3.6

Update exclusion

In PPM, update exclusion is used. It means that symbol frequency is updated only in contexts which are used to encode the symbol. This technique improves not only compression speed but also compression ratio. However, it is difficult to use this technique with CTW. In the original CTW, frequency of a symbol c in a context s is the summation of frequencies of c in children nodes of s. Therefore it does not consider the update exclusion. We store frequencies cd (0 ≤ d ≤ D) corresponding to PPM of order d in a node of the context tree. To calculate Pesd (xt = c), we use cd ’s in nodes s0 , . . . , sd . That is, we use a symbol probability Pesd which is the same as a

s′

probability used in the PPMD with order d. Another symbol probability Pe d , which is mixed with Pesd to calculate weighted probability Pwλ , is calculated by the PPMD with order d′ . We also update the frequencies cd according to updating rules of PPM with order d. Time complexity of our algorithm is O(nD2M) and space complexity is O(nD2 ) where n is the length of a sequence to be compressed, D is the depth of the context tree and M is the size of the alphabet A. If we use a suffix tree to represent the context tree, the space complexity can be reduced to O(nD).

4

Experimental Results

We made experiments on compressing English texts and DNA sequences. The workstation we use is Sun Ultra60 (2048MB memory) We use test files from the Calgary corpus [4] and DNA sequences [11]. Compression ratio is shown in bpc (bits/character). The results include compression loss introduced by arithmetic codes. Table 1: Compression ratio for Calgary corpus, D = 64 γ bib book1 book2 geo news obj1 obj2 paper1 paper2 pic progc progl progp trans

4.1

0.5 2.131 2.266 2.052 4.370 2.649 4.164 2.842 2.642 2.473 0.777 2.746 2.006 2.057 1.909

0.2 1.926 2.181 1.929 4.366 2.450 3.898 2.589 2.412 2.300 0.772 2.485 1.770 1.832 1.630

0.05 1.878 2.179 1.916 4.394 2.426 3.925 2.555 2.406 2.293 0.776 2.471 1.706 1.795 1.564

mix 1.860 2.164 1.899 4.384 2.397 3.860 2.523 2.361 2.260 0.772 2.426 1.688 1.765 1.540

PPMD+ 1.862 2.303 1.963 4.733 2.355 3.728 2.378 2.330 2.315 0.795 2.363 1.677 1.696 1.467

Results on binary CTW

We use binary decomposition [13] for compressing sequences by binary CTW. We measure compression ratio for various values of γ and for using optimal γ. The length of context is D = 64. Table 1 shows the results of binary CTW using binary decomposition and that of PPMD+ [12]. The fifth column ‘mix’ shows the results of using the optimal value of γ for each character in a text. The compression ratio varies highly according to the value of γ. The mix algorithm has the best performance among the CTW methods. Its performance is better for

files bib, book1, book2, paper2 and pic and it is very good for book1 and book2. However, it is worse than that of PPMD+ for other files. We also made experiments on DNA sequences. We use two representation of the alphabet. One represents symbols a, t, g and c in two bits and the other represents them in eight bits by ASCII codes. Table 2 shows the results. accession is a unique identifier for each DNA sequence. PPMD+ shows the result of the PPMD+. c2 and c8 are our implementations using two bit alphabet and eight bit alphabet. PPMD+d shows the result of a modified PPMD+ for the alphabet of size four. We do not escape probability in a context in which all four symbols already appeared. The order of the PPMD+d is D = 3. The depth D of context and the value of γ are shown in the first column like (64, 0.5). Compression ratio of c2 is better than that of c8 because c8 also assigns probabilities for symbols which does not appear in the sequence. Table 2: Compression ratio for DNA sequences accession length PPMD+ PPMD+d c8(32,0.5) c8(64,0.5) c2(32,0.05) c2(32,0.01) c2(32,0.005) c2(32,0.001) Bio2 CDNA GTAC

X55026

M68929

X59720

M35027

X17403

100314

186609

315339

191737

229354

2.018 1.869 1.869 1.868 1.859 1.859 1.860 1.861 1.88 1.85 1.74

2.075 1.964 1.964 1.964 1.950 1.947 1.948 1.949 1.94 1.87 1.78

2.023 1.950 1.949 1.949 1.945 1.940 1.939 1.939 1.92 1.94 1.82

2.002 1.909 1.908 1.904 1.873 1.870 1.871 1.870 1.76 1.81 1.67

2.053 1.969 1.961 1.961 1.958 1.956 1.956 1.957 1.85 — 1.74

Compression ratio of the PPMD+ is more than two bits for all sequences. It means that DNA sequences cannot be compressed by the original PPM algorithm for texts. Compression ratio can be improved by assuming that the size of alphabet is four. However, the CTW outperforms this result. In the table, Bio2, CDNA, GTAC shows results of Biocompress-2 [6], CDNA compress [10], and GTAC [8]. These are comperssion algorithm specialized for DNA sequences. Compression ratio is improved by using approximated matching or considering palindromes. Note that the CDNA and the GTAC only estimate compression ratio. Our CTW achieves compression ratio less than 2.0 bpc for all sequences. This cannot be achieved by many famous compression algorithms. Our CTW outperforms Biocompress-2 for a sequence and by incorporating special properties of DNA sequences in a simple preprocessing state, we can further improve the result by considering approximated matching, which will be reported elsewhere. This demonstrates the power of the CTW as a general-purpose compression scheme.

4.2

Results on multi-alphabet CTW

Table 3 shows the results of our implementation of the PPMD with order D = 5 using exclusion after estimation and update exclusion, our CTW with depth D = 5, 8, 16, and the PPMD+. The value of γ is fixed to 0.2. Compression ratios of CTWs are always better than the PPMD and these never decrease as the order grows. It means that we should use large values of the order as much as we can. Though our implementation of the PPMD is inferior to the PPMD+, it can be improved by simply combining the CTW. It means that if we use good probability estimator, the performance of the CTW is improved. Table 3: Compression ratio for Calgary corpus file bib book1 book2 geo news obj1 obj2 paper1 paper2 pic progc progl progp trans average

5

PPMD D = 5 1.940 2.319 2.007 4.838 2.427 3.991 2.507 2.393 2.352 0.833 2.446 1.778 1.772 1.561 2.369

CTW D = 5 1.891 2.229 1.957 4.592 2.384 3.903 2.470 2.344 2.270 0.809 2.403 1.749 1.733 1.530 2.304

CTW D = 8 1.872 2.217 1.920 4.589 2.364 3.902 2.406 2.331 2.265 0.802 2.384 1.676 1.681 1.453 2.275

CTW D = 16 1.863 2.217 1.916 4.586 2.359 3.898 2.386 2.331 2.265 0.795 2.382 1.662 1.641 1.429 2.266

PPMD+ 1.862 2.303 1.963 4.733 2.355 3.728 2.378 2.330 2.315 0.795 2.363 1.677 1.696 1.467 2.289

Concluding Remarks

We proposed simple implementation of the CTW. Though its performance is worse than the usual PPMs for some files, it can be improved by using better probability estimators. By using our implementation, we can easily combine such estimators with the CTW algorithm and can improve compression performance. We have showed experimental results, which confirms the result of ˚ Aberg and Shtarkov.

Acknowledgments The authors would like to thank Professor Tsutomu Kawabata and Doctor Jan ˚ Aberg, who provided us information on the CTW.

References [1] J. ˚ Aberg and Y. M. Shtarkov. Text Compression by Context Tree Weighting. In IEEE Data Compression Conference, pages 377–386, March 1997. [2] J. ˚ Aberg, Y. M. Shtarkov, and B. J. M. Smeets. Multialphabet Coding with Separate Alphabet Description. In Proc. of Compression and Complexity of SEQUENCES 1997, pages 56–65. IEEE Computer Society, 1997. [3] C. Bloom. PPMZ, 1995. http://www.cco.caltech.edu/˜bloom/src/ppmz.zip. [4] Calgary Text Compression Corpus. ftp://ftp.cpsc.ucalgary.ca/pub/projects/text.compression.corpus/. [5] J. G. Cleary and I. H. Witten. Data compression using adaptive coding and partial string matching. IEEE Trans. on Commun., COM-32(4):396–402, April 1984. [6] S. Grumbach and F. Tahi. A New Challenge for Compression Algorithms: Genetic Sequences. Information Processing & Management, 30:875–886, 1994. [7] P. G. Howard. The Design and Analysis of Efficient Lossless Data Compression Systems. Technical Report Report CS-93-28, Brown University, 1993. [8] J. K. Lanctot and M. Li and E. Yang. Estimating DNA Sequence Entropy. preprint. [9] R. E. Krichevsky and V. K. Trofimov. The Performance of Universal Encoding. IEEE Trans. Inform. Theory, IT-27(2):199–207, March 1981. [10] D. Loewenstern and P. Yianilos. Significantly Lower Entropy Estimation for Natural DNA Sequences. accepted for publication in the Journal of Computational Biology. [11] National Center for Biotechnology Information. Entrez Nucleotide Query. http://www.ncbi.nlm.nih.gov/htbin-post/Entrez/query?db=n s. [12] W.J. Teahan and J.G. Cleary. The entropy of english using PPM-based models. In IEEE Data Compression Conference, pages 53–62, March 1997. [13] T. J. Tjalkens, P. A. J. Volf, and F. M. J. WIllems. A Context-tree Weighting Method for Text Generating Sources. In IEEE Data Compression Conference, page 472, 1997. [14] T. J. Tjalkens and F. M. J. Willems. Implementing the Context-Tree Weighting Method: Arithmetic Coding. In International Conference on Combinatorics, Information Theory & Statistics, July 1997. [15] P. A. J. Volf and F. M. J. Willems. Switching between Two Universal Source Coding Algorithms. In IEEE Data Compression Conference, pages 491–500, March 1998. [16] F. M. J. Willems, Y. M. Shtarkov, and T. J. Tjalkens. The Context Tree Weighting Method: Basic Properties. IEEE Trans. Inform. Theory, IT-41(3):653–664, May 1995. [17] F.M.J. Willems and T.J. Tjalkens. Complexity Reduction of the Context-Tree Weighting Method. In 18th Benelux Symposium on Information Theory, pages 123–130, 1997. [18] I. H. Witten, R. M. Neal, and J. G. Cleary. Arithmetic coding for data compression. Communications of the ACM, 30(6):520–541, June 1987. [19] N. Yokoo and T. Kawabata. A Simple Implementation of Context Tree Weighting Method and its Verification. Technical Report IT93-123, IEICE, March 1994. [20] N. Yokoo and T. Kawabata. Implementing Context Tree Weighting Method for non binary. In Proc. of IEICE Fall Conference, pages 271–272, 1994.

Suggest Documents