A New Approach to Adaptive Encoding Data using Self-organizing Data Structures Luis Rueda∗ and B. John Oommen†
Abstract—This paper demonstrates how techniques applicable for defining and maintaining a special case of binary search trees (BSTs) can be incorporated into “traditional” compression techniques to yield enhanced superior schemes. We, specifically, demonstrate that the newly introduced data structure, the Fano Binary Search Tree (FBST) can be maintained adaptively and in a self-organizing manner. The correctness and properties of the encoding and decoding procedures that update the FBST are included. We also include the theoretical and empirical analysis, which shows that the number of shift operators is large for small files, and it tends to decrease (asymptotically towards zero) for large files. Index Terms—adaptive coding, self-organizing data structures, binary search trees.
I. I NTRODUCTION Binary Search Trees (BSTs) have been used in a wide range of applications that include storage, dictionaries, databases, and symbol tables. BSTs can be maintained statically when the statistical information about the number of accesses to the records is known a priori. When the probabilistic distribution about the records is unknown, adaptive schemes are the most appropriate ones. These schemes, update the BST during the search process. Consider a set of records whose keys are given by the ordered set of distinct elements K = {k1 , . . . , km }, where k1 < . . . < km . By following the procedure given in [1], the optimal BST can be constructed using dynamic programming in O(m2 ) time and space. Alternatively, using dynamic programming and divide-and-conquer techniques [2], a nearly-optimal BST can be constructed in O(m) space and O(m log m) time. These two approaches can be used whenever the statistical information about the access to the records is known beforehand. As opposed to this, we assume that these probabilities are unknown, and the structure and content of the BST are dynamically changed while the records are searched for, in the tree. The move-to-root heuristic, proposed by Allen and Munro [3], is a very simple approach to maintain an adaptive BST. The aim of this approach is to maintain the most frequently accessed records near the root, and consequently, to minimize the average cost of searching. Another approach, the simple exchange rule, was also introduced by Allen and Munro [3]. It consists of rotating the accessed record one level up ∗ Member of the IEEE. Department of Computer Science, University of Concepcion, Edmundo Larenas 215, Concepcion, 4030000, Chile. E-mail:
[email protected]. † Chancellor’s Professor; Fellow: IEEE and Fellow: IAPR. School of Computer Science, Carleton University. 1125 Colonel By Dr., Ottawa, ON, K1S 5B6, Canada E-mail:
[email protected]. This author is also an Adjunct Professor with the Department of Information and Communication Technology, Agder University College, Grimstad, Norway.
towards the root. Although this approach is not very efficient, it has the advantage that it does not use extra space. Splaying is another technique due to Sleator and Tarjan [4], [5]. It uses its own tree structure called the splay tree. The main idea of this technique is to move the accessed record towards the root, and to simultaneously allow accesses to each record by an in-order traversal of the tree. The splaying tree-structuring techniques have reported good results even for highly timevariant access probabilities. Another scheme found in the literature is known as the monotonic tree [6]. Each record maintains an extra memory location to count the number of times it has been accessed. This approach performs poorly for key sets with high entropy. Empirical results have also shown that, on the average, it behaves poorly. Other adaptive binary search tree approaches are biasing [7], dynamic binary search [8], weighted randomization [9], deepsplaying [10], and the technique that uses conditional rotations [11]. The basic idea of the latter approach is to maintain certain key pieces of information in each node. These are used by the heuristic called the conditional rotation, based on the fundamental rotation operation (also known as the promotion operation) introduced by Adel’son-Vel’skii and Landis [12]. On the other hand, adaptive coding is important in many applications that require online data compression and transmission. This modality is advantageous since the data is encoded by performing a single pass, as opposed to the strategy used in static algorithms which requires two passes – the first to learn the probabilities, and the second to accomplish the encoding. Most of the well-known static encoding techniques have been extended to also function in an adaptive manner. The most well-known adaptive coding technique is Huffman’s algorithm, which was first presented by Faller [13], and augmented by Gallager [14], Knuth [15], and Vitter [16]. Another important encoding method that has been extended for its adaptive version is arithmetic coding [17]. Adaptive methods for Fano coding have been recently introduced for the binary and multisymbol code alphabets [18], [19]. In this paper, we show how we can effectively use results from adaptive self-organizing data structures in enhancing compression schemes. Adaptive lists have been used earlier in compression [20], and the Fenwick tree [21] has brilliantly used the list of probabilities, maintained as a tree, to maintain probability estimates1 . To the best of our knowledge, adaptive self-organizing trees have not been used in this regard, and this is what we have done. We present the theoretical framework 1 We should mention, however, that the Fenwick tree [21] does not use a self-organizing adaptive structure.
including correctness and convergence properties. II. C ONDITIONAL ROTATIONS IN BST S Conditional rotations in adaptive BSTs are the basic data structures used in defining and operating the Fano BST [11]. Consider a BST, T = {t1 , . . . , t2m−1 }. If ti is any internal node of T , then: Pi is the parent of ti , Li is the left child of ti , Ri is the right child of ti , and Bi is the brother (or sibling) of ti . Using these primitives, Bi can also be defined as follows: LPi if ti is the right child of Pi , and RPi if ti is the left child of Pi . The properties of the rotation operation can be found in [11]. Also, let αi (n) be the total number of accesses to node ti up to time n, τi (n) is the total number of accesses to Ti , the subtree rooted at ti , up to time n, and is calculated as τi (n) = tj ∈Ti αj (n), and κi (n) be the weighted path length (WPL) of Ti , thesubtree rooted at ti , at time n, which is calculated as κi (n) = tj ∈Ti αj (n)λj (n), where λj (n) is the path length from tj up to node ti . By using simple induction, it follows that κi (n) = tj ∈Ti τj (n), and the values of τi (n) and κi (n) can also be calculated as τi (n) = αi (n) + τLi (n) + τRi (n), andκi (n) = τi (n) + κLi (n) + κRi (n). The criteria for performing rotations is based on the aim of decreasing the WPL of the entire tree as a result of the rotation. Let α , τ and κ be the corresponding values after a rotation is performed. Then, a rotation is performed on node ti if θi (n) = κPi (n) − κi (n) > 0, which is a naive conditional rotation heuristic. An optimized heuristic consists of maintaining, for each node, the value of τ only, and the rotation operation is performed on node ti if ψi ≥ 0, where ψi is defined as ψi (n) = 2τi (n) − τRi (n) − τPi (n), if ti is a left child, and ψi (n) = 2τi (n) − τLi (n) − τPi (n), if ti is a right child. The procedure for searching and updating are performed at the same time. Starting from the root towards the leaves, the key being searched, ki , is compared to that of the current node, kj . If they are distinct, τj (n) is updated, and ki is searched in the subtree rooted at tLj if ki < kj , or in the subtree rooted at tRj if ki > kj . When ki is found, τi (n) is updated and a rotation is performed on node ti whenever ψi (n) ≥ 0. Note that in this approach, it is assumed that ki is always in T . III. A DAPTIVE FANO C ODING U SING BST S Prior to introducing the relevant data structures and their properties, we briefly state underlying problem. Data encoding involves processing an input sequence, X = x[1] . . . x[M ], where each input symbol, x[i], is drawn from a source alphabet, S = {si , . . . , sm } whose probabilities are P = [p1 , . . . , pm ] with 2 ≤ m < ∞. The encoding process is rendered by transforming X into an output sequence, Y = y[1] . . . y[R], where each output symbol, y[i], is drawn from a code alphabet, A = {a1 , . . . , ar }. The main problem in lossless data compression is to find an encoding scheme that minimizes the size of Y, in such a way that X is completely recovered by the decompression process. We consider the case in which P is unknown, and is updated during the encoding/decoding process.
A new structure, the Fano BST, which is a particular case of the BST, and shift operators which can be applied to any BST, has been recently introduced in [18], [22]. Let αi , τi , and κi be the corresponding values (as defined in the conditional rotation heuristic) contained in node ti at time n, i.e. αi (n), τi (n), and κi (n) respectively. Definition 1: Consider the source alphabet S = {s1 , . . . , sm } whose probabilities of occurrence are P = [p1 , . . . , pm ], where p1 ≥ . . . ≥ pm , and the code alphabet A = {0, 1}. A partitioning BST is a binary tree, T = {t1 , . . . , t2m−1 }, whose nodes are identified by their indices (for convenience, also used as the keys {ki }), and whose fields are the corresponding values of τi . Furthermore, every partitioning BST satisfies: (i) Each node t2i−1 is a leaf, for i = 1, . . . , m, where si is the ith alphabet symbol satisfying pi ≥ pj if i < j. (ii) Each node t2i is an internal node, for i = 1, . . . , m − 1. Remark 1: Given a partitioning BST, T = {t1 , . . . , t2m−1 }, the number of accesses to a leaf node, α2i−1 , is a counter, and pi refers to either α2i−1 (the frequency counter) or the probability of occurrence of the symbol associated with t2i−1 . In fact, the probability of occurrence of si can be estimated as follows: α2i−1 pi = m . j=1 α2j−1
(1)
A particular case of the partitioning BST, the Fano BST, is defined as follows. Definition 2: Let T = {t1 , . . . , t2m−1 } be a partitioning BST. T is a Fano BST, if for every internal node, t2i , the following conditions are satisfied: (a) τRi − τLi ≤ τ2i+1 if τLi < τRi , (b) τLi − τRi ≤ τ2i−1 if τLi > τRi , or (c) τLi = τRi . The procedure for constructing a Fano BST from the source alphabet symbols and their probabilities of occurrence is given in [18]. Remark 2: The structure of the Fano BST is similar to the structure of the BST used in the conditional rotation heuristic introduced in [11]. The difference, however, is that since every internal node does not represent an alphabet symbol, the values of α2i are all 0, and the quantities for the leaf nodes, {α2i−1 }, are set to {pi } or to frequency counters representing them. Clearly, the total number of accesses to the subtree rooted at node t2i , τ2i , is obtained as the sum of the number of accesses to all the leaves of T2i , as given in the following lemma, whose proof can be found in [18]. Lemma 1: Let T2i = {t1 , . . . , t2s−1 } be a subtree rooted at node t2i . The total number of accesses to T2i is given by: τ2i =
s
α2j−1 = τL2i + τR2i .
(2)
j=1
The following theorem relates the WPL of a Fano BST and the average code word length of the encoding schemes generated from that tree [18]. Invoking this result, the average code word length can be obtained by a single access to the root of the Fano BST.
Theorem 1: Let T = {t1 , . . . , t2m−1 } be a Fano BST constructed from the source alphabet S = {s1 , . . . sm } whose probabilities of occurrence are P = [p1 , . . . , pm ]. If φ : S → {w1 , . . . , wm } is an encoding scheme generated from T , then ¯ =
m i=1
pi i =
κroot − 1, τroot
(3)
where i is the length of wi , τroot is the total number of accesses to T , and κroot = tj ∈T αj λj . Remark 3: From Theorem 1, we see that the WPL and ¯ are closely related. The smaller the WPL, the smaller the value ¯ Consequently, the problem of minimizing the WPL of a of . Fano BST is equivalent to minimizing the average code word length of the encoding schemes obtained from that tree. IV. S HIFTING O PERATIONS IN PARTITIONING BST S After the encoder updates the frequency counter of the current symbol and the corresponding nodes, the resulting partitioning BST may need to be changed so that a Fano BST is maintained. To achieve this, one of the following two operators can be applied to a partitioning BST: the ShiftTo-Left (STL) operator and the Shift-To-Right (STR) operator. The STL operator is defined below, while the STR operator, analogous to the STL, can be found in [18]. The STL operator applied to an internal node of a partitioning BST consists of removing the left-most leaf of the subtree rooted at the right child of that node, and inserting it as the right-most leaf in the subtree rooted at the left child of that node. Notation STL: Consider a partitioning BST, T = {t1 , . . . , t2m−1 }, in which the weight of each node, tl , is τl , and the key for each internal node is kl , for l = 1, . . . , m. Let ti be an internal node of T , Ri be also an internal node of T , tj be the left-most leaf of the subtree rooted at Ri , and tk be the right-most leaf of the subtree rooted at Li . Three mutually exclusive cases in which the STL operator can be applied are STL-1: PPj = ti and Li is a leaf, STL-2: PPj = ti and Li is a leaf, and STL-3: Li is not a leaf. The STL operator applied in these scenarios are given below. The complete proofs of the lemmas can be found in [18]. Rule 1 (STL-1): Consider a partitioning BST, T , and Notation STL. Suppose that the scenario is that of Case STL-1. The STL operator applied to the subtree rooted at node ti consists of the following operations: the value τk − τBj is added to τPj , ki and kPj are swapped2 , Bj becomes the right child of ti , Pj becomes the left child of ti , tk becomes the left child of Pj , and tj becomes the right child of Pj . Remark 4: The node on which the STL operator is applied, ti , can be any internal node or the root satisfying Notation STL. Lemma 2 (STL-1 validity): Consider a partitioning BST, T = {t1 , . . . , t2m−1 }, specified as per Notation STL. If an 2 In the actual implementation, the Fano BST can be maintained in an array, in which the node tl , 1 ≤ l ≤ 2m − 1, can be stored at position l. In this case, and in all the other cases of STL and STR, swapping these two keys could be avoided, and searching the node tl could be done in a single access to position l in the array.
STL operation is performed on the subtree rooted at node ti as per Rule 1, then the resulting tree, T = {t1 , . . . , t2m−1 }, is a partitioning BST. Rule 2 (STL-2): Consider a partitioning tree, T = {t1 , . . . , t2m−1 }, and the notation of Notation STL. Suppose that we are in the scenario of Case STL-2. The STL operator applied to node ti involves the following operations: τk − τBj is added to τPj , τj is subtracted from all the τ ’s in the path from PPj to Ri , ki and kPj are swapped, Bj becomes the left child of PPj , tj becomes the right child of Pj , tk becomes the left child of Pj , and Pj becomes the left child of ti . Lemma 3 (STL-2 validity): Consider a partitioning BST, T = {t1 , . . . , t2m−1 }, as per Notation STL. The resulting tree, T = {t1 , . . . , t2m−1 }, obtained after performing an STL2 operation as per Rule 2 is a partitioning BST. Rule 3 (STL-3): Consider a partitioning BST, T = {t1 , . . . , t2m−1 }, specified using the notation of Notation STL, and the scenario of Case STL-3. The STL operator to the subtree rooted at ti consists of shifting tj to the subtree rooted at Li in such a way that: τk − τBj is added to τPj , τj is subtracted from all the τ ’s in the path from PPj to Ri , τj is added to all the τ ’s in the path from Pk to Li , ki and kPj are swapped, Bj becomes the left child of PPj , tj becomes the right child of Pj , tk becomes the left child of Pj , and Pj becomes the right child of Pk . Example 1: Let S = {a, b, c, d, e, f, g} be the source alphabet whose frequency counters are P = [8, 3, 3, 3, 3, 3, 3]. A PBST, T , constructed from S and P is the one depicted on the left of Figure 1. After applying the STL operator to the subtree rooted at node ti (in this case, the root node of T ), we obtain T , the tree depicted on the right. Observe that T is a PBST. The general result is stated in Lemma 4 given below, and proved in [18]. Lemma 4 (STL-3 validity): Let T = {t1 , . . . , t2m−1 } be a partitioning BST in which an STL-3 operation is performed as per Rule 3, resulting in a new tree, T = {t1 , . . . , t2m−1 }. Then, T is a partitioning BST. Note that the weights of the internal nodes in the new tree, T , are consistently obtained as the sum of the weights of their two children. This is achieved in only two local operations, as opposed to re-calculating all the weights of the tree in a bottom-up fashion. The formal algorithm that considers the three mutually exclusive cases discussed above can be found in [18]. V. T HE C ONDITIONAL S HIFTING H EURISTIC The aim of the compression scheme is to maintain a Fano BST at all times - at the encoding and decoding stages. In the encoding algorithm, at time ‘k’, the symbol x[k] is encoded using a Fano BST, T (k), which is identical to the one used by the decoding algorithm to retrieve x[k]. T (k) must be updated in such a way that at time ‘k + 1’, both algorithms maintain the same tree T (k + 1). The procedure takes a Fano BST and the node corresponding to the current symbol occurring in the input sequence, and generates a new Fano BST. To achieve this, we introduce a conditional shifting heuristic
4 26 t i 2 11 Pk 1
-3+3
tk 3
8 a
-3 8 15 PPj
+3
3
6
10 9
6 Pj
b tj 5
Bj 7
3 c
9
3
12 6
3 e
d
11 3 f
13 3 g
6 26
8 12
2 14
1
4
8
7
6
a
10 9
3 d
3
3 b
5
3 c
9
12 6
3 e 11 3 f
13 3 g
Fig. 1. A partitioning BST, T , constructed from S = {a, b, c, d, e, f, g} and P = [8, 3, 3, 3, 3, 3, 3], and the corresponding partitioning BST, T , after performing an STL-3 operation. The top-most node can be a left child, a right child, or the root.
used to transform a partitioning BST into a Fano BST. The conditional shifting heuristic is based on the principles of the Fano coding – the nearly-equal-probability property defined in [23]. This heuristic used in conjunction with the STL and the STR operators defined earlier are used to transform a partitioning BST into a Fano BST. For every internal node, and based on the conditional shifting heuristic, the shifting operations are performed as per the following rule. Rule 4: Consider a partitioning BST, T = {t1 , . . . , t2m−1 }. Let ti be an internal node of T , tj be the left-most leaf of the subtree rooted at Ri , and tk be the right-most leaf of the subtree rooted at Li . (i) If θ1 = τRi −τLi −τj > 0, then perform an STL operation on ti . (ii) If θ2 = τLi − τRi − τk > 0, then perform an STR operation on ti . We now introduce a definition that is important in the formalization of the algorithms that implements the updating procedure. Let a two-leaf internal node be an internal node whose left and right children are leaves. Definition 3: Let T = {t1 , . . . , t2m−1 } be a partitioning BST. An internal node, ti , is a two-leaf internal node if and only if Li and Ri are both leaves. The procedure for updating the Fano BST, procedure updateFanoBST(...), is formalized in Algorithm Fano BST Updating, which can be found in [18]. The weights of all the internal nodes in the path from Pn to the root are updated. After this, the path from the root downwards
tn is inspected to see if there is a non-two-leaf internal node that does not satisfy the conditions stated in Definition 2. This is achieved by invoking the procedure checkShift(...) explained below. The path from the root downwards tn is traced by using the key of each internal node, in such a way that the key of tn , kn , is searched as in a binary search tree. The procedure checkShift(...) of Algorithm Fano BST Updating is responsible for checking that all the non-two-leaf internal nodes satisfy Definition 2. This procedure receives as parameters the partitioning BST, T , and the non-two-leaf internal node to be inspected, ti . By following Rule 4, if (i) is true (i.e. θ1 > 0), an STL operator is applied to ti , and the procedure checkShift(...) is recursively invoked for the left and right children of ti . If θ1 ≤ 0, θ2 > 0 is evaluated, and if true, an STR operation is performed on ti , and, as in the case of the STL operation, the procedure checkShift(...) is recursively invoked for the left and right children of ti . After the procedure checkShift(...) is executed on all the non-two-leaf internal nodes of the path from the root to the leaf associated with x[k], tn , the partitioning BST is transformed into a Fano BST. The encoding algorithm, Fano BST Encoding, takes the source sequence and generates the encoded sequence by adaptively maintaining a Fano BST. The initial Fano BST is created by invoking procedure FanoBST(...) given in [18]. Let T (k) be the Fano BST at time ‘k’, which, in the algorithm, is represented by root. The encoding process proceeds, as usual, by scanning all the symbols of X from left to right. At time ‘k’, i is obtained as the index of x[k] in S. By taking advantage of Property (i) of a Fano BST, 2i − 1 is searched in T (k), as if it were searched in a binary search tree : Going to the left child when 2i − 1 < kn , or to the right child otherwise. The value 2i − 1 is always found, since as mentioned earlier, we assume that all the keys are already in the BST. Besides when going to the left child, ‘0’ is sent to the output, and when going to the right child, ‘1’ is sent to the output. Any of the 2m−1 labeling strategies other than the one used here can also be used. The decoding algorithm does the reverse process, and at each decoding step, the Fano BST is updated in the same way as in the encoder, i.e. invoking the same procedure. The formal algorithm, Fano BST Decoding, can be found in [18]. The properties of the encoding and decoding algorithms follow. The complete proofs can be found in [18]. First of all, we show that the tree obtained after performing the updating procedure is a Fano BST. We then analyze the correctness of Algorithms Fano BST Encoding and Fano BST Decoding. The third property that we study is related to the asymptotic number of changes required to maintain a Fano BST for stationary distributions. Theorem 2: Let T (v) = {t1 (v), . . . , t2m−1 (v)} be the Fano BST at time ‘v’, and tx (v) be the leaf of T (v) associated with the last processed symbol, x[v]. Suppose that T (v) and tx (v) are the parameters of procedure updateFanoBST(...). Then, the resulting tree, T (v + 1), is a Fano BST. Theorem 3: Consider the source alphabet S = {s1 , . . . , sm }, and the code alphabet A = {0, 1}. Suppose that the input sequence X = x[1] . . . x[M ] is encoded by using Algorithm Fano BST Encoding, yielding an output
Algorithm 1 Fano BST Encoding Input: The source alphabet, S. A source sequence, X = x[1] . . . x[M ]. Output: The encoded sequence Y = y[1] . . . y[R]. Assumptions: The Fano BST is constructed by invoking FanoBST(...), and maintained correctly by invoking updateFanoBST(...). Method: ηi (1) ← 1 for i = 1, . . . , m Create node m m root η(1) ← i=1 ηi (1); τroot ← i=1 ηi (1) FanoBST(S, η(1), 1, m, root, τroot ) j←1 for k ← 1 to M do i ← position of x[k] in T from left to right (counting the leaves only) tn ← root // For each symbol, start again from root while tn = “leaf” do if 2i − 1 < kn then // Perform a binary search y[j] ←‘0’; tn ← Ln // Go to left child else y[j] ←‘1’; tn ← Rn // Go to right child endif j ←j+1 endwhile updateFanoBST(T , root, tn ) endfor end Algorithm Fano BST Encoding
sequence Y = y[1] . . . y[R]. If Y is decoded with Algorithm Fano BST Decoding into X = x [1] . . . x [T ], then X = X (i.e. T = M , and x [i] = x[i], ∀i , i = 1, . . . , M ). Sketch of Proof. This result is proved by a double induction on the number of encoding steps. The outer loop of the induction is based on the encoding of x[k] at the k th step, and the inner induction on the depth of the node visited in traversing the Fano BST in the encoding and the decoding. The complete proof can be found in [18]. Theorem 4: Consider the assumptions stated in Theorem 3. Encoding X into Y using Algorithm Fano BST Encoding, and decoding Y into X following Algorithm Fano BST Decoding yield on-line compression. Sketch of Proof. The proof follows from the inductive proof of Theorem 3, observing that each symbol encoded from the input, x[k], is simultaneously decoded into x [k] = x[k]. This is done by decoding its code word y[j] . . . y[j + k ], where k is the length of the code word of x[k], in a prefix manner. Consequently, encoding by using Algorithm Fano BST Encoding and decoding by using Algorithm Fano BST Decoding achieve on-line compression at each individual bit processed. Theorem 5: Consider a stationary source whose alphabet is S = {s1 , . . . , sm }, and their probabilities of occurrence are P = [p1 , . . . , pm ], where p1 ≥ . . . ≥ pm . If η(k) = {η1 (k), . . . , ηm (k)} are the frequency counters of {s1 , . . . , sm }, the Fano BST maintained using the procedure updateFanoBST(...) asymptotically converges to the static
Fano tree, where the latter is the tree that would have been obtained if the occurrence probabilities are known a priori. Sketch of Proof. The theorem follows by observing that the list maintained by the static Fano encoding is identical to the asymptotic list that would have been obtained by traversing the leaves of the Fano BST in a left-to-right manner. The only advantage, however, is that in the Fano BST, we do not have to traverse the “list” of the leaves to obtain the partitioning. The sum of the probabilities of the sublists reside in the corresponding τ values of the parents of these leaves, all the way up to the root. The complete proof can be found in [18]. Corollary 1: The number of updates in the Fano BST asymptotically tends to zero.
VI. E MPIRICAL R ESULTS We conclude the paper showing the performance of the adaptive Fano coding using BSTs on files of the Canterbury corpus [24]. We analyze the variability of the Fano BST using two different measures: the number of encoding steps with shift operations, and the total number of shifts in the entire encoding process. An encoding step involves reading a symbol from the input sequence and generating its code word. The empirical results obtained are shown in Table I. The first and second columns contain the file name and the size (in bytes) of the original file. The third column contains the size (in bits) of the compressed file, Y . The fourth column, γb , contains the total number of encoding steps in which at least one shift operation was performed, and the fifth, ϕb , contains the percentage of encoding steps with at least one shift operation, calculated as ϕb = lγXb 100. The next column, b , indicates how often an encoding step contains at least one shift operation, and is calculated as b = lγXb , while ζs is total number of shift (STL or STR) operations performed in the entire encoding process. The last column ζn , shows the number of shift operations per node visited when generating the code words, calculated ζn = ζYs . We observe that although the total number of steps with shift operations are smaller for small files, the difference for large files is not proportional. For example, for the file grammar.lsp, γb = 462, and for the file kennedy.xls (the largest file in the test set) γb = 3, 508. This behavior is reflected in the percentage of steps with shift operations, ϕb . We see that for large files, such as kennedy.xls and lcet10.txt, ϕb is significantly small, 0.34 and 0.25 respectively. Conversely, ϕb is large for small files, such as grammar.lsp and xargs.1, ϕb = 12.42 and ϕb = 11.38 respectively. With regard to the frequency of steps with shifts, b is small for small files and vice versa. This behavior is as expected, since fewer changes in the Fano BST are likely to occur after some time, when the estimated probabilities converge to the actual probabilities of the source symbols. When analyzing the number of shifts per node visited during the encoding, we observe that ζn is large for small files and vice versa. The behavior observed in these empirical results clearly corroborate the theoretical analysis presented earlier.
TABLE I VARIABILITY OF THE FANO BST DURING THE ENCODING PROCESS ON THE FILES OF C ANTERBURY CORPUS . File Name alice.txt asyoulik.txt cp.html fields.c grammar.lsp kennedy.xls lcet10.txt plrabn12.txt ptt5 sum xargs.1
lX (bytes) 148,481 125,179 24,603 11,150 3,721 1,029,744 419,235 471,162 513,216 38,240 4,227
Y (bits) 681,903 610,304 131,498 58,080 18,683 3,722,970 1,954,593 2,136,386 855,435 209,802 22,052
γb (steps) 910 905 841 714 462 3,508 1,038 788 1,349 1,832 481
VII. C ONCLUSIONS We have introduced the adaptive Fano coding that uses a new structure, the Fano BST. By means of a rigorous theoretical analysis, we have shown the correctness of the updating procedure, i.e. that it receives a partitioning BST, and using the tree-based operators mentioned above, outputs a Fano BST. We have also shown the validity of the encoding and decoding algorithms, which achieve lossless compression, and simultaneously maintain a Fano BST, and analyzed the variability of the Fano BST during the encoding/decoding process. We have shown that when dealing with stationary sources, the Fano BST asymptotically converges to the static Fano tree, and subsequently, fewer changes are required to maintain the “nearly-optimal” Fano BST. Finally, our results show that for small files, a large number of changes (shift operations) per encoding step are performed, whereas a few changes for each node are required when encoding larger files, corroborating the theoretical analysis. ACKNOWLEDGMENTS This research work has been partially supported by NSERC, the Natural Sciences and Engineering Research Council of Canada, and CONICYT, the National Council for Scientific and Technological Research of Chile. R EFERENCES [1] D. Knuth, The Art of Computer Programming, vol. 3, Reading, MA: Addison-Wesley, 1973. [2] W. Walker and C. Gotlieb, “A Top-down Algorithm for Constructing Nearly Optimal Lexicographical Trees,” in Graph Theory and Computing, Academic Press, Ed., New York, 1972. [3] B. Allen and I. Munro, “Self-organizining Binary Search Trees,” J. Assoc. Comput. Mach., vol. 25, pp. 526–535, 1978. [4] J. Iacono, “Alternatives to splay trees with o(log n) worst-case access times,” in Proceedings of the 12th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA-01), 2001, pp. 516–522. [5] D. Sleator and R. Tarjan, “Self-adjusting Binary Search Trees,” J. Assoc. Comput. Mach., vol. 32, pp. 652–686, 1985. [6] J. Bitner, “Heuristics that Dynamically Organize Data Structures,” SIAM Journal of Computing, vol. 8, pp. 82–110, 1979. [7] S. Bent, D. Sleator, and R. Tarjan, “Biased Search Trees,” SIAM Journal of Computing, vol. 14, pp. 545–568, 1985. [8] K. Mehlhorn, “Dynamic Binary Search,” SIAM Journal of Computing, vol. 8, pp. 175–198, 1979. [9] C. Aragon and R. Seidel, “Randomized Search Trees,” Proceedings 30th Annual IEEE Symposium on Foundations of Computer Science, pp. 540–545, 1989.
ϕb (%) 0.61 0.72 3.42 6.40 12.42 0.34 0.25 0.17 0.26 4.79 11.38
b (steps) 163.17 138.32 29.25 15.61 8.05 293.54 403.89 597.92 380.44 20.87 8.78
ζs (shifts) 97,795 23,260 86,699 58,760 38,320 155,813 109,030 87,801 114,144 95,747 45,876
ζn (shifts) 0.143 0.038 0.659 1.012 2.051 0.042 0.056 0.041 0.133 0.456 2.080
[10] M. Sherk, “Self-adjusting k-ary Search Trees and Self-adjusting Balanced Search Trees,” Tech. Rep. 234/90, University of Toronto, Toronto, Canada, February 1990. [11] R. Cheetham, B.J. Oommen, and D. Ng, “Adaptive Structuring of Binary Search Trees Using Conditional Rotations,” IEEE Transactions on Knowledge and Data Engineering, vol. 5, no. 4, pp. 695–704, 1993. [12] T. Lai and D. Wood, “Adaptive Heuristics for Binary Search Trees and Constant Linkage Cost,” SIAM Journal of Computing, vol. 27, no. 6, pp. 1564–1591, December 1998. [13] N. Faller, “An Adaptive System for Data Compression,” Seventh Asilomar Conference on Circuits, Systems, and Computers, pp. 593– 597, 1973. [14] R. Gallager, “Variations on a Theme by Huffman,” IEEE Transactions on Information Theory, vol. 24, no. 6, pp. 668–674, 1978. [15] D. Knuth, “Dynamic Huffman Coding,” Journal of Algorithms, vol. 6, pp. 163–180, 1985. [16] J. Vitter, “Design and Analysis of Dynamic Huffman Codes,” Journal of the ACM, vol. 34, no. 4, pp. 825–845, 1987. [17] D. Hankerson, G. Harris, and P. Johnson Jr., Introduction to Information Theory and Data Compression, CRC Press, 1998. [18] L. Rueda, Advances in Data Compression and Pattern Recognition, Ph.D. thesis, School of Computer Science, Carleton University, Ottawa, Canada, April 2002, Electronically available at http://www.inf.udec.cl/˜lrueda/papers/PhdThesis.pdf. [19] L. Rueda and B. J. Oommen, “A Fast and Efficient Nearly-optimal Adaptive Fano Coding Scheme,” Information Sciences, vol. 176, pp. 1656–1683, 2006. [20] S. Albers and M. Mitzenmacher, “Average case analyses of list update algorithms, with applications to data compression,” Algorithmica, vol. 21, pp. 312–329, 1998. [21] P. Fenwick, “A New Data Structure for Cumulative Frequency Tables,” Software - Practice and Experience, vol. 24, no. 3, pp. 327–336, 1994. [22] L. Rueda and B. J. Oommen, “Efficient Adaptive Data Compression using Fano Binary Search Trees,” in The 20th International Symposium on Computer and Information Sciences, Istanbul, Turkey, 2005, pp. 768– 779. [23] L. Rueda and B. J. Oommen, “A Nearly Optimal Fano-Based Coding Algorithm,” Information Processing & Management, vol. 40, no. 2, pp. 257–268, 2004. [24] R. Arnold and T. Bell, “A Corpus for the Evaluation of Lossless Compression Algorithms,” in Proceedings, Data Compression Conference, Los Alamitos, CA, 1997, pp. 201–210, IEEE Computer Society Press.