Online Encoding Algorithm for Infinite Set

91 downloads 10207 Views 194KB Size Report
ELITE (Engineering Laboratory in Theoretical Enumerable System). Department of Computer Engineering Faculty of Engineering,. Chulalongkorn University ...
Online Encoding Algorithm for Infinite Set Natthapon Punthong, Athasit Surarerks ELITE (Engineering Laboratory in Theoretical Enumerable System) Department of Computer Engineering Faculty of Engineering, Chulalongkorn University, Pathumwan, Bangkok, 10330, Thailand Email: [email protected], [email protected]

Abstract In this paper we present online encoding and decoding algorithms for an infinite input alphabet. The well-known encoding algorithm that explicitly constructs the optimal codes is proposed by Huffman. This algorithm is clearly inappropriate for online encoding because the whole input data are not known before starting the process. We propose an algorithm that can be encoded while the input data performed in an online mode. Theoretical result shows that the tree obtained from the algorithm satisfies a code length property of an optimal solution. Moreover, experimental results are also shown that the average code length is close to the one resulted from Huffman algorithm and the entropy of input data.

Key words: Online encoding algorithm, infinite Huffman codes, recursive replacement method. 1. Introduction

In many research domains, encoding is an importance problem. For instance, Golomb in [9] proposed run-length encoding using the variable length concept for some lossless image compression. In 1952, Huffman in [1] introduced the well-known encoding algorithm which uses binary tree pattern in process for encoding each input character. During the process, bottom-up technique is applied to build the tree. Some extended version of Huffman’s algorithm are developed in the case that all characters have different cost, see detail in [6]. It is clear that the Huffman’s encoding algorithm cannot apply to an online encoding problem which the number of difference

characters are unknown (i.e., a new character can be appeared during the process.) In order to solve the problem, tree structures are studied by many researchers. Some optimal prefix-free trees are studied in [5, 7, 8, 10]. One remarkable work is an optimal infinite prefix-free tree structure proposed by Golin and Ma in 2004 (see detail in [3]). Laidlaw in [4] also introduced an algorithm for constructing codes for infinite sets using recursive replacement method to construct infinite tree. In this work, we focus in an online encoding problem. A top-down technique which is the reversing of Huffman’s algorithm is applied for constructing an infinite set of prefix-free codes using an optimal prefix-free infinite tree structure combining with the recursive replacement method. The tree is shown to be appropriate for the probability distribution. The paper is organized as follows: Section 2 recalls the Huffman’s encoding algorithm and the concept of infinite tree construction. Section 3 describes the online encoding problem. Moreover, an online encoding and decoding algorithm for infinite set are presented. Some theoretical results are also demonstrated. Some experimental results comparing between our online encoding algorithm for infinite set and Huffman’s algorithm are shown in Section 4. Section 5 is the conclusion of the paper.

2. Basic definitions

We introduce some basic definitions used in this work, starting from the concept of entropy. The classical variable length encoding algorithm proposed by Huffman is described. Then the concept of recursively infinite tree construction is also included in this section.

2.1 Entropy

This section describes a general method for measuring the structure of the source is considered. The concept of entropy is introduced for simplifying a function of a probability distribution. Let pj be the probability of getting information aj, then the average code length for each symbol aj is described by p j log 2 (1 / p j ) . Form this it follows that the average code length over the whole alphabet should be n

∑ p j log 2 (1/ p j ) . j =1

2.2 Huffman encoding algorithm

Huffman in [1] has proposed an algorithm for generating codes in order to reduce the size of the specified context. The frequency of characters is an important factor to encode the characters. Given an alphabet, a finite set of characters, denoted by Σ where Σ={a1, a2, a3, …, an}. Let P={p 1, p2, p3, …, pn} be the set of probabilities of occurrence of each character in a given context such that pj denotes the probability of aj. The problem is to generate codes which are strings of 0 and 1 with the minimum average length. Let S={c1, c2, c3, …, cn} be the solution (cj be the code of aj). The average size of the solution Lav is computed by n

Lav = ∑ p j l j , j =1

where lj is the length of cj. All generated codes from the algorithm preserve the prefix-free property (i.e., for any two different codes, one must not be the prefix of the others.) All codes can be represented by one single binary tree structure. The Huffman encoding algorithm use bottom-up technique to construct the binary tree for minimizes the average code length. It starts from the least frequency character; the most frequency character should be encoded by the possible shortest length code. Each character is assigned to be a leaf node of the tree. A code of a character corresponds to the path from the root to that leaf node. Figure 1 shows the problem for encoding A, B, C, and D with probability 0.1, 0.2, 0.3, and 0.4 respectively. For instance, the code of C is 01.

Figure 1. Codes represented using a binary tree

2.3 An infinite tree construction

The problem to generate an infinite tree is studied in this section. Since the Huffman’s algorithm assigns all characters to be leaf nodes of a binary tree, we are interested in the constructing of an infinite binary tree which leaf nodes can be occurred in any levels. The work of Laidlaw [4] proposed that the number of leaf nodes in each level is defined by a generating function; this technique is called a recursive replacement method. In principle, a generating function G(x) is described by a recursive definition where G(0) must be zero. In each level of constructing the tree, nodes are separated into two partitions, one for terminal (leaf) part and other for nonterminal (generating) part. Let Nj be the number of nodes at the level j. It is clear that 2Nj = Nj+1. Then it is not in the case that all generating function can be selected. Laidlaw also studied the characteristic of such generating functions. It is demonstrated that the average code length can be computed using a generating function associated to the tree.

Figure 2. Tree generated by fL = 2fL-2

The binary tree in Figure 2 is generated by the generating function as follow: fL = 2 fL−2 , where fL is the number of terminal nodes at level L, L ≥ 2, and f0 = 0, f1 = 1.

3. Online encoding and decoding In this section, we start with the description of the online encoding problem. The binary tree structure and the property of optimal tree will be studied and some theoretical results will be shown. Finally, code interchanging, online encoding and decoding algorithms will be introduced.

Figure 3. Online encoding and decoding process Figure 3 shows the process of online encoding and decoding of infinite input alphabet set. Given a context which is a sequence of characters in the unknown input alphabet, the online encoding process starts by sequentially encoding character, through the encoding unit, character-by-character, from the first character of the context until the last character. The decoding process is performed by the same manner.

Theorem 1 shows that the optimal encoding tree must satisfy the code length property. Theorem 1 Let S be the optimal solution for the encoding problem. Let ci and cj be codes in S and let pi and pj be their probability respectively. If pi ≥ pj, then length(ci) ≤ length(cj). Proof: The proof of the theorem is to show that pi ≥ pj → length(ci) ≤ length(cj). Suppose that S is the optimal solution and codes ci , cj are elements in S, corresponded to character ai and aj. Let pi ≥ pj. It is proved that length(ci) ≤ length(cj). By the contradiction technique, let length(ci ) > length(cj). Since S is the optimal solution, the minimal average code length is n

∑ length(c k =1

Since the encoding process performs in an online mode, the incoming context can be considered as a sequence of known and unknown characters. This implies that the number of different characters and its frequency are also unknown. This is the reason that an input alphabet is considered as an infinite set.

3.2 The property of optimal tree

The objective of encoding algorithm is to generate the optimal solution which its average code length is minimal. That is to minimize n

min Lav = ∑ p j l j . j =1

Definition 1 Let ci and cj be codes for characters in Σ where pi and pj be their probability respectively. An encoding tree is said to satisfy the code length property if pi ≥ pj then length(ci) is less than or equal to length(cj).

) pk .

Let n = length(ci), and m = length(cj) then the average code length can be rewritten as n

∑ length(c

k =1, k ≠i , k ≠ j

k

) p k + npi + mp j .

Consider the solution that ci and cj are codes for a j and ai. The average code length results from this solution is n

∑ length(c ) p

k =1, k ≠ i , k ≠ j

3.1 Infinite alphabet

k

k

k

+ mpi + np j .

It is clear that the average code length of this solution is less than the average code length of the solution S. This contradicts that S is the optimal solution. That is the theorem is true. ■ In our work, the binary tree is generated using a recursive replacement method with respective to the following constant function, f (0) = f (1) = f (2) = 0 f ( L ) = ⎣log 2 n ⎦ , for L ≥ 3,

where n is the number of characters and L is the level in the tree. We are focused in the context which is an English text file, and then the number of characters is approximated by 26. The binary tree structure is generated as illustrated by Figure 4.

cb ← buffer exit endif enddo end Lemma 2 shows that the tree obtained from the interchanging algorithm is satisfied the code length property.

Figure 4. A binary tree

Lemma 2 The encoding tree obtained from the algorithm always satisfies the code length property.

All nodes in a binary tree can be separated into two parts, non-terminal (branching) nodes and terminal nodes (leaf). Each terminal node is represented by a triple nodei = ∈ Tree. The parameter ai is the input character, frequency(ai ) is the frequency of character ai, and the code ci is exactly the path from the root to that terminal node, nodei. Terminal nodes will be sequentially assigned starting from the top level.

Proof: To prove the lemma, it is to show that the average length code computed from the output is less than or equal to the average length code corresponded to the input. Let a1a2a 3…ak-1 be an encoded context and let c1, c2, c3,…, ck-1 be codes of a1, a2, a3, …,ak-1 respectively. Let the input tree satisfy the code length property and let the average code length be

3.3 Code interchanging algorithm

The concept of code interchanging process is to preserve the code length property of a binary tree in the online encoding and decoding algorithms. In the case of, once an input character comes to the online encoding process, if it is the new one, a new code must be assigned to the character and its frequency is increased by 1; otherwise the binary tree must be adjusted if it does not preserve the code length property (i.e., character’s frequency is more than other one in the tree but its code length is longer than that one.) This can be performed by interchanging its code with the shorter code that has less frequency. The process of online decoding runs in the same manner. The modification of the solution binary tree can be described by the following algorithm. Algorithm: Interchanging input: Unsatisfied tree a (input character) ca (code of a) pa (its frequency) output: Satisfied tree begin for each node b:length(cb) < length(ca ) do if pb < pa buffer ← ca ca ← cb

k −1

∑ length(c ) i =1

i

k −1 .

Suppose that a new comer (ak) arrives, its probability (pk) must be increased. In the case that there exits a character aj such that pj < pk but length(cj) < length(ck), the input tree is unsatisfied the property. That is the average code length becomes k −1

Lk = (∑ length(ci )) + length(c k ) k . i=1

By applying the interchanging algorithm, the tree must be adjusted and the new average code length is k −1

L k′ = (∑ length (c i )) + length (c j ) k . i =1

Since length(cj) < length(ck), that is L′k < Lk. Then the proof is completed. ■

3.4 Encoding & decoding algorithms

The online encoding and decoding algorithms are presented in this section. During the online encoding and decoding process, character’s frequency can be changed depend on its character and the binary tree in the online encoding and decoding process must be always preserved the code length property by code interchanging algorithm. The online encoding algorithm is shown as follows: Algorithm: Online Encoding input: a1a2a3a4… (a sequence of characters)

output: c1c2c3c4… (a sequence of binary code) begin Tree ← ∅ (empty set) i ← 1 (current input character) j ← 0 (number of terminal nodes in Tree) while (not-end-of-data) do if (ai = x | = node in Tree) fx ← f x + 1 ci ← cx call Interchanging algorithm else j ← j + 1 new nodej Tree ← Tree ∪ nodej nodej ← endif i←i+1 enddo end By the same technique, the decoding algorithm is as follows: Algorithm: Online Decoding input: c1c2c3c4… (a sequence of binary code) output: a1a2a3a4… (a sequence of characters) begin Tree ← ∅ (empty set) i ← 1 (current input code) j ← 0 (number of terminal nodes in Tree) while (not-end-of-data) do if (ci = y | = node in Tree) fx ← fx +1 ai ← x call Interchanging algorithm else j ← j + 1

new nodej Tree ← Tree ∪ nodej *** nodej ← endif i←i+1 enddo end *** When the code of new character arrives, the character ai must be transmitted together with its code. Theorem 2 The encoding tree generated by the proposed encoding algorithm satisfies the code length property. Proof: From the lemma 2, it is clear that the tree satisfies the code length property. ■ For an example of the online encoding and decoding algorithms, Figure 5 illustrates the binary tree before and after encoding the character A. It is seen that the frequency of A is increased and it is greater than the frequency of D but the code length of A is longer than the code length of D (i.e., the tree does not satisfy the property). After sending code of A, 0010, the tree must be adjusted by interchanging code of A and D. Now, the code of A is changed to 110. By the same way, the decoding tree must be adjusted after receiving the code of A, 0010, in order to maintain the property.

Figure 5. The frequency of input A is increased and the property does not preserved, then the tree is adjusted (i.e., code interchanging between A and D.)

4. Experimental results

Some experiments on English alphabet are studied in this section. Moreover, the results obtained from our algorithm and the results from the classical Huffman algorithm are compared with the entropy of the text. The generating function used in the experiments is defined as follow: f (0) = f (1) = f (2) = 0 f ( L ) = ⎣log 2 n ⎦ , for L ≥ 3,

Percentage

where n is the number of characters (n = 26). In our experimentation, English text files (file 1–5) are used to be the data tests. Each file contains more than 2000 characters where the probability distribution of all characters is illustrated by Figure 6. This is also similar to the study in [2]. Table 1 shows the average code length computed from our algorithm, Huffman algorithm and the entropy of each file. 12 10 8 6 4 2 0 A D G

J M P

S

V Y

Alphabet

Figure 6. Average of character’s frequency related with Table 1.

Average code length File: Our algorithm Huffman Entropy 1 4.350734 4.229473 4.187761 2 4.311534 4.217067 4.181448 3 4.294745 4.177557 4.140691 4 4.333783 4.194070 4.150287 5 4.291720 4.185328 4.140991 Avg 4.316503 4.200699 4.160236 Table 1. The comparison result between our algorithm and Huffman algorithm.

The results of average code length from our algorithm are close to the average code length of Huffman algorithm. The average size using our algorithm is only 1.038 times of the entropy while the optimal solution is 1.01 times seeing in Table 2. Avg/Entropy Our algorithm 1.0375622 Huffman 1.0097262 Table 2. Entropy comparison on our algorithm.

5. Conclusion

We present a novel online encoding algorithm using an infinite binary tree generated by a recursive replacement method to solve the encoding problem. Our binary tree is generated by a constant function. Theoretical result demonstrates that our tree also satisfies the code length property of the optimal binary tree. Experimentation has been performed on English alphabet files show that the average code length is close to the one generated from Huffman algorithm. Moreover, it is also obtained that the average code length is approximated 1.038 times of the entropy of the texts. One obvious direction for future should be studied is the relationship between tree structures and their appropriate probability distributions.

6. References

[1] D.A. Huffman. “A method for the construction of minimum redundancy codes,”. In Proc. IRE 40, volume 10, pages 1098-1101, September 1952. [2] H. Beker and F. Piper, Cipher Systems, WielyInterscience, 1982. [3] K.K. Ma and M.J. Golin, “Algorithms for Infinite Huffman-Codes*,” In Proceedings of the Association for Computer Machinery 2004. [4] M.G.G. Laidlaw, “The Construction of Codes for Infinite Sets,” In Proceedings of SAICSIT 2004, pp. 157-165. [5] M. Golin, C. Kenyon, and N. Young, “Huffman Coding with Unequal Letter Costs,” Proceeding of the 34th ACM Symposium on Theory of Computing (STOC2002), (May 2002) 785-791. [6] M. Golin and G. Rote, “A dynamic programming algorithm for constructing optimal prefix-free codes for unequal letter costs,” IEEE Trans. Inform. Theory, vol. 44, pp. 1770-1781, Sept. 1998. [7] M. Golin and H. Na, “Optimal prefix-free codes that end in a specified pattern and similar problems: the uniform probability case (Extended Abstract)*,” IEEE Trans. Inform. Theory, 2001. [8] M. Golin and N. Young, “Prefix Codes: Equiprobable words, unequal letter costs,” SIAM Journal on Computing, 25(6):12811292, December 1996. [9] S.W. Golomb, “Run Length Encodings,” IEEE Transactions on Information Theory, IT-12 (July 1966) pp.399-401. [10] S.L. Chan and M. Golin, “A Dynamic Programming Algorithm for Construction Optimal “1”-ended Binary Prefix-Free Codes,” IEEE Transactions on Information Theory, 46(4) (July 2000) pp. 1637-44.

Suggest Documents