Deletions in binary search trees are difficult to analyse as they are not randomness preserving. We present a new kind of tree which differs slightly from the ...
Randomness Preserving Deletions On Special Binary Search Trees Michaela Heyer MSc in Computer Science Centre For Efficiency-Oriented Languages
1
Department of Computer Science National University of Ireland, Cork Supervisor: Dr. Michel Schellekens Head of Department: Prof. Gregory Provan March 2005
1 Funded
by Science Foundation Ireland, Investigator Award.
Contents 1 Introduction
1
2 Background
5
2.1
Average-Case Analysis of Algorithms . . . . . . . . . . . . . . . .
5
2.2
Binary Search Trees . . . . . . . . . . . . . . . . . . . . . . . . . .
8
2.2.1
What is a Binary Tree? . . . . . . . . . . . . . . . . . . . .
8
2.2.2
Tree Properties . . . . . . . . . . . . . . . . . . . . . . . .
11
2.2.3
Insertions in Binary Search Trees . . . . . . . . . . . . . .
12
2.2.4
Deletions in Binary Search Trees . . . . . . . . . . . . . .
12
2.2.5
Randomness Preservation in Binary Search Trees . . . . .
15
2.3
Randomized Binary Search Trees . . . . . . . . . . . . . . . . . .
17
2.4
Randomized Treaps . . . . . . . . . . . . . . . . . . . . . . . . . .
18
3 The Ordered Binary Search Tree
20
3.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
20
3.2
The Find/Insertion Algorithm . . . . . . . . . . . . . . . . . . . .
21
3.3
The Deletion Algorithm . . . . . . . . . . . . . . . . . . . . . . .
22
3.4
Randomness Preservation . . . . . . . . . . . . . . . . . . . . . .
28
3.5
Average-Case Analysis . . . . . . . . . . . . . . . . . . . . . . . .
34
4 An Empirical Comparison of RBST and OBST
42
4.1
The Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . .
42
4.2
The Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
43
4.3
Analysing the Results . . . . . . . . . . . . . . . . . . . . . . . . .
45
i
5 Conclusion and Future Work
49
A RBST Pseudo Code
51
A.1 Insertion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
51
A.2 Insertion at the Root . . . . . . . . . . . . . . . . . . . . . . . . .
51
A.3 Split . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
52
A.4 Deletion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
52
A.5 Join . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
53
B RBST Pseudo Code
54
B.1 Insertion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
54
B.2 Deletion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
55
B.3 Rotation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
55
B.4 Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
56
C Notes on the Pseudo-Code
57
C.1 Replace() Method . . . . . . . . . . . . . . . . . . . . . . . . . . .
57
C.2 The Arrow Assignment . . . . . . . . . . . . . . . . . . . . . . . .
57
C.3 Other Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . .
58
References
59
ii
Acknowledgement I would like to thank my supervisor Dr. Michel Schellekens and also Dr. Joseph Manning for all their support and help with this thesis. My external examiner Dr. Geoff Hamilton has also provided useful comments and suggestions and I am very grateful for that.
iii
Abstract Deletions in binary search trees are difficult to analyse as they are not randomness preserving. We present a new kind of tree which differs slightly from the standard binary search tree. It is referred to as an ordered binary search tree as it stores a history element in its nodes, which provides information about the order in which the nodes were inserted. Using this extra information it is possible to design a new randomness preserving deletion algorithm. We introduce this algorithm, explain why it works and prove that it is indeed randomness preserving. As well as that we show through experimental analysis that it actually runs faster than one of the other main random binary search tree deletion algorithms currently in use.
Chapter 1 Introduction Binary search trees (BST) are widely used for the storage and easy access of information. They provide the user with very simple algorithms for the basic operations like searching for, inserting or deleting an item. We will concentrate on the average-case running time of BST algorithms, in particular that of the deletion algorithm. When trying to analyse the average running time of any algorithm it is an advantage if we can assume a random input sequence. In general it is easy to see that we can fulfil this requirement by simply randomizing the list that is to be inserted into the data structure. However we also require that the algorithms for the basic operations over that data structure be randomness preserving. Randomness preservation is a topic of active investigation at the Centre for Efficiency-Oriented Languages. As remarked in [13], randomness preservation is a crucial property which enables and considerably simplifies (semi-automated) Average-Case Time Analysis. Randomness preservation plays for instance a crucial role in the possibility of carrying out so called “backwards analysis of algorithms” [7], currently one of the most elegant methods for average-case time analysis. Randomness preservation has thus far been defined in a precise way in the literature for trees and has been used in an intuitive way in a more general setting. Certain algorithms, such as all variants of traditional Heapsort and combinations of deletions and insertions on random Binary Search Trees, are not randomness preserving [7, 8, 14, 2]. The lack of randomness preservation has
1
been pointed out by Edelkamp, Flajolet and Knuth as a fundamental obstacle to average-case analysis (in the case of Heapsort-variants) and as a serious complication in general to average-case analysis (e.g. for the case of deletions and insertions in Binary Search Trees). This last fact is also illustrated by Schellekens in [13] for the case of a Bubblesort-variant. When an algorithm is “randomness preserving” the general consensus is that average-case time analysis is feasible. The fact that this is indeed the case has been formally demonstrated in [13] which introduces a programming language ACETT for which all programs are guaranteed to be randomness preserving and as a result give rise to recurrence equations for the average-case time in a (semi-)automated fashion. In particular [13] introduces the first Heapsort-variant which is randomness preserving and for which the averagecase analysis is feasible and straightforward. [13] puts the notion of randomness preservation on a formal basis, extending the definition to the context of data structures which are general partial orders. Since we focus on binary trees, it will suffice in this context to use the standard restricted definition of randomness preservation. In the sense of a BST, this can be understood as follows: given a tree T , we need to be able to perform any number of operations on it, thus creating a new tree T 0 , and be sure that the probability of T 0 being created through those operations is the same as the probability that T 0 was created by successive insertions of the elements of a random permutation of a list. In [5] Jonassen and Knuth describe Hibbard’s theorem on deletions in BSTs which states that “If n + 1 items are inserted into an initially empty binary tree, in random order, and if one of those items (selected at random) is deleted, the probability that the resulting binary tree has a given shape is the same as the probability that this tree shape would be obtained by inserting n items into an initially empty tree, in random order.” They also describe the paradox, discovered by Gary D. Knott, which shows that a subsequent random insertion after a deletion does not necessarily produce a random tree. Details of this paradox can be found in [6]. The question about the existence of a randomness preserving deletion algorithm for BSTs was posed by Knuth [8], and Martinez and Roura [9] were the 2
first ones to claim to have solved the problem. However instead of preserving randomness in a strict sense their algorithm reintroduces it via a random number generator each time a new insertion or deletion is performed. An important fact to keep in mind, as pointed out above, is that randomness preservation is a crucial pre-requisite to facilitate average-case analysis. To carry out traditional average-case analysis in a feasible way two requirements need to be met: 1) one needs to assume that input data structures are random (which can e.g. be achieved if necessary by randomization) 2) one needs to guarantee that the basic operations over these data structures preserve randomness. The fact that the method of [9] requires the introduction of an extraneous random factor as part of the operations in order to preserve randomness of the original input data is a drawback. It could be useful if one were to prove that no randomness preserving operations can exist on the given data structure and that the introduction of an additional random element could repair this situation. However, as we will show, truly randomness preserving operations can be introduced on a very natural type of binary tree, the Ordered Binary Search Tree through the incorporation of a history notion. The present investigation originated from the goal to incorporate insertions and deletions on BSTs in the context of the programming language ACETT [13]. Here we present randomness preserving operations on Ordered Binary Search Trees independently of ACETT, as this is of independent interest and we leave the incorporation of such operations in ACETT as a topic for future research. This thesis introduces a new kind of BST, the so-called ordered binary search tree (OBST) making it possible to design a truly randomness preserving deletion algorithm which does not rely on the use of a random number generator. Instead it will consistently preserve randomness, i.e., for any number of random insertions and deletions the resulting BST will remain random and each possible tree structure will have the same probability. This not only simplifies the analysis, it also guarantees logarithmic average performance. We will be covering some necessary background in Chapter 2, where a basic 3
introduction to average-case analysis is given in Section 2.1. We explain why average-case analysis of algorithms is so important and how it can be performed. There are many methods known for this but only the ones which are required to understand the analysis in Section 3.5 are covered. We will explain how to use generating functions to solve recurrences and provide the reader with references to good introductions into this area. We will then explain BSTs and their terminology. We comment on some of the main properties of BSTs like depth and height and their relationships to each other which will be required for the analysis in Section 3.5. The general idea of the deletion algorithm and the problems it causes are explained in Section 2.2.4. Two of the main models providing randomized algorithms for the basic operations for BSTs are randomized binary search trees, which are covered in Section 2.3 and randomized treaps, which are covered in Section 2.4. In Chapter 3 we develop a new data structure, the ordered binary search tree. The main idea behind this new tree is described and the two main algorithms, insertion and deletion, are provided in pseudo code. In Section 3.4 we compare the probabilities of the different tree structures after a normal deletion with those using the new deletion algorithm on an example of a tree of size three. We then move on to provide a general proof of randomness preservation by proving that the OBST structure and its algorithms for the basic operations provide a bijection with permutations. The actual average-case analysis is done in Section 3.5 where we show that both the insertion and the deletion algorithm have an expected performance of O(log n). Chapter 4 compares OBSTs directly with randomized binary search trees by measuring their running time experimentally and shows that the OBST deletion runs a lot faster than the randomized binary search tree deletion. Finally we provide a conclusion and some ideas for future work in Chapter 5.
4
Chapter 2 Background 2.1
Average-Case Analysis of Algorithms
Often we need to know how long an algorithm will take to run. For example if there are two algorithms performing essentially the same action, we would choose the one that runs faster. Also when scheduling multiple tasks or algorithms, it is helpful to know the running times so we can schedule the tasks accordingly and thus save time. In most of these cases it suffices to look at the extremal values, i.e., the best- or worst-case performance. Other times however we might be interested to know how an algorithm performs on average. For example when comparing algorithm A and algorithm B, we might find that A has a better worst-case performance than B and therefore choose A. On average however A may perform a lot worse than B, because B’s worst-case may occur less frequently than A’s worstcase. As well as that when scheduling resources where an algorithm’s worst-case performance differs greatly from its average-case performance we might actually be wasting resources. So it can be useful in some cases to know the average behaviour of an algorithm. The problem is that it is a lot harder to compute the average running time, especially compared to computing the best- or worst-case running time. A lot of research has been done in this area, trying to find methods and ways to simplify this process. Some of the most important ideas can be found in [14], where the authors suggest the following two main methods:
5
Distributional: This is the approach which seems the most obvious. We calculate the cost c for each of the input items and multiply that value by the number nc of items causing it. To get the expected cost, divide by the total number of elements n. Ec =
1X nc c n c
Cumulative: This approach appears to be quite similar to the previous one, but is actually a lot more usable. Again we compute the total cost of the algorithm and divide it by the number of inputs. But here we might get this total cost through other ways than by summing up the individual costs. Generating functions, details of which can be found in [14, 15, 17], greatly simplify this process of counting elements and are widely used. No matter which approach is taken, generating functions (GFs) play a very important role in average-case analysis. They are used to directly count elements with a certain cost or other property (ordinary and bivariate generating functions), they can provide us with extra information such as mean and variance (probability generating functions) and they can be used to directly derive a functional equation from the recursive definition of a data structure (symbolic method). Here we will concentrate only on the usage of ordinary generating functions (OGF) to solve recurrences, as this will be used in Section 3.5 for the analysis of the deletion algorithm in the OBST. More detailed information about the other uses of GFs can be found in [14, 18]. When given a recurrence, the following steps provide a method to solve that recurrence easily using OGFs: 1. Multiply both sides of the recurrence by z n and sum over all values of n for which the recurrence holds 2. Express both sides of the resulting equation explicitly in terms of the generating P function A(z) = n≥0 an z n 3. Solve the resulting equation for the unknown generating function A(z)
6
4. Express the OGF as a power series to get expressions for the coefficients. For steps 2 and 4 it might be helpful to use one of the dictionaries that are provided for OGFs, for example in [14], which give the basic operations as well as some of the coefficients extractions. Another aspect that is very important in average-case analysis is that of randomness preservation. If we can assume a random input and guarantee that the algorithms for the basic operations on the data structure preserve randomness, then we can use probabilistic models to analyse the expected performance. For example an array produced by a random insertion sequence will have the property that any given element has a probability of size of1 array of being the middle element. This knowledge can be very useful when analysing an algorithm for example that selects the middle element for a certain action. It is very easy to produce a random input by simply randomizing the input list before supplying it to the algorithm. Alternatively there are methods which use some form of randomization inside the algorithm. Examples include randomized binary search trees and treaps which are covered in Sections 2.3 and 2.4 respectively. Further information about randomized algorithms in general can also be found in [11]. Preserving randomness however can sometimes be hard and for BSTs no randomness preserving deletion algorithm had been discovered until now. Considering other data structures, we find a randomness preserving deletion algorithm in [13]. The algorithm is written in ACETT (Average Case Execution Time Tool), a new programming language which simplifies average-case analysis through the use of randomness preserving data structures and algorithms. ACETT’s deletion algorithm can currently be applied to general labelled partial orders and heap-ordered trees in particular.
7
2.2
Binary Search Trees
2.2.1
What is a Binary Tree?
In everyday life we encounter trees on a regular basis: for example family trees, the directory structure on a computer or even the hierarchy structure of chapters and sub chapters in a thesis. These examples can give us a good idea of what a tree might look like but they do not provide any insight in the essence or definition of a tree as it is used in Mathematics or Computer Science. So let us take a closer look. From a mathematical point of view a tree is nothing else but a graph. But what is a graph? A graph is a construct consisting of a number of points which are referred to as nodes or vertices. These nodes can be connected by lines, which are referred to as edges. Not all graphs are trees however. To be classified as a tree a graph has to adhere to certain restrictions like the following: 1. The number of edges is exactly one less than the number of nodes 2. The graph has no cycles There exist other equivalent conditions but these will do for our purposes. Figure 2.1 shows a simple tree and it can be easily checked that the number of edges = the number of nodes - 1 and that it does not contain any cycles.
Figure 2.1: A simple Tree Computer Scientists might also consider a tree as a special graph but more commonly they would describe it as a data structure which provides very easy and efficient access to the data that has been stored. Nodes here are objects which usually store an item (the data) and in some cases they might contain additional information like for example a key in the case of Binary Search Trees. The nodes also contain a pointer to any other node that they are connected through by an 8
edge. Generally the implementation will distinguish between the children pointers and the parent pointer. Definition 1. A node C is considered to be a child of another node N , if C is directly below N in the graphical representation of the tree. Definition 2. A node P is considered to be the parent of another node N , if P is directly above N in the graphical representation of the tree. We now have enough information to give an informal definition of a Binary Tree: a Binary Tree is a tree where each node has either no children, one child or two children. There are a few other terms which are used frequently in Computer Science when working with trees. In some cases (for example external nodes) there exist several understandings and the following are the definitions which reflect the understanding of these terms in the context of this thesis. Definition 3. The node which resides at the very top in the graphical representation of the tree is called the root of that tree. The root therefore cannot have a parent. Note: In Computer Science we mainly consider trees which have a distinguished root and we use the term tree to mean a rooted tree. Definition 4. All nodes that are below a node N in the graphical representation of a tree and connceted to N by a path are referred to as the descendents of N . Definition 5. All nodes that are above a node N in the graphical representation of a tree and connected to N by a path are referred to as the ancenstors of N . Figure 2.2 shows an example of a tree where the objects that we just defined are clearly marked. Definition 6. External nodes are placeholders for empty binary trees. Definition 7. Any node which is not an external node is considered an internal node. Thus all nodes that are visible in the normal graphical representation of a tree are considered internal nodes. 9
Root
Ancenstors of Y
Parent of Y Y
X
Child of X
Descendents of X
Figure 2.2: Tree terminology Note: This definition of external nodes is used mainly in the context of Binary Trees. The difference between internal and external nodes is shown in Figure 2.3, where external nodes are represented as the small unfilled rectangles. It is easy to see that the depicted tree is a valid binary tree with 6 internal and 7 external nodes. Using these new definitions we can now present a clearer definition of
( ' '( '( ( ('('(' ' ' ' ' ' ' ( ( ' ( '( '( ( ('('(' ' ' ' ' ' ' ( ( ' #$ ' ' '((' (( (( ' ' ' ! ! ! # ! # # " " " $ $ ! !" # #$ #$ "! " "!"!"! $ $ $#$#$# ! ! ! # # # " ! ! ! # # # " " $ $ ! !" # #$ #$ "! " "!"!"! $ $ $#$#$# ! !&" ! # # # " ! ! ! ! # # # " $ $ !%&% " ! #"! $ # $ # $# %% "
Figure 2.3: External and Internal Nodes Binary Trees, as stated in [14]: Definition 8. A binary tree is either an external node or an internal node attached to an ordered pair of binary trees called the left subtree and the right subtree of that node. In the following we will use the term node to mean internal node. For simplicity we will omit drawing the external nodes when presenting examples of trees, unless the explicit visualization of these nodes is essential to the understanding. 10
2.2.2
Tree Properties
Now that we have defined trees and the necessary terminology, we move on to explain some of the properties a tree might have. The most obvious one is the size of a tree, which is simply the number of internal nodes it has. So considering for example the tree from Figure 2.3: this tree has 6 internal nodes and therefore we say it is of size 6. Other important properties of trees are height and pathlength. To define these we need to introduce the notion of depth in a tree, which can be defined as follows: The root is at depth 0 and any other node is at a depth one greater than its parent. Definition 9. The height of a tree is the maximum depth among all external nodes in the tree Definition 10. The internal path-length of a tree is the sum of the depths of each of the (internal) nodes. Definition 11. The external path-length of a tree is the sum of the depths of each of the external nodes. Taking another look at the tree in Figure 2.3, we can now tell that it is of height 4, its internal path-length is 9 and its external path-length is 21. All of the stated properties are closely related and we will now show some of the most important relationships which will be used later on. a) The number of external nodes in a binary tree is exactly one more than the number of internal nodes b) The internal path-length is the sum of the internal path-lengths of the subtrees plus the size - 1 c) The height h of a binary tree, containing n (internal) nodes, can be bounded as follows: log(n + 1) ≤ h ≤ (n + 1)/2
11
Definition 12. A Binary Search Tree is a binary tree which stores some extra information at each node in form of an integer value, called a key. These keys adhere to the property that the key in each node is greater than or equal to all the keys in its left subtree and less than or equal to all the keys in its right subtree.
2.2.3
Insertions in Binary Search Trees
Inserting a new key into an existing BST is very straightforward and follows the algorithm shown here as Algorithm 1. For simplicity we assume that any item to be inserted is distinct from all items already in the tree. Note: Please refer to Appendix C for further explanations of any of the methods used in the algorithms. Algorithm 1 BST Insertion //Insert a new node with key k into the tree T Insert(T,k) x ← T.root while !(x.isExternal) do if (k < x.key) then x ← x.leftChild else if (k > x.key) then x ← x.rightChild end if end while replace( x,Node(k) ) Figure 2.4 shows the insertion of a node with key value 13 into an existing BST. The coloured path is the path that the node takes as the algorithm executes.
2.2.4
Deletions in Binary Search Trees
Deleting a node with no children in a BST is straightforward as we can just remove it. For nodes with one or two children the deletion is more complicated 12
13 2
2
1
15
13
1
15
10
10
13 > 2, so move to the right
13 < 15, so move to the left
2
2
1
15
1
15
13 10
10 13
13 > 10, so move to the right Found an external node, so insert 13
Figure 2.4: BST Insertion as we have to ensure that the BST property of the tree is preserved. The pseudocode in Algorithm 2 describes an algorithm based on Hibbard’s original deletion algorithm which can be found in [4]. This algorithm makes use of the notion of a successor in a tree which is defined as follows: Definition 13. The successor of a node N in a Binary Search Tree is the node with the smalles key greater than the key of N . The successor of a node N can be easily found: if N has a non-empty right subtree, then the successor of N is the node with the minimum key value in that subtree. If N has an empty right subtree, then its successor is the closest ancestor that N is in its left subtree. For simplicity the algorithm to find the successor is not implemented here. Similarly the algorithm to find a key in a tree is not implemented explicitly as it is very similar to the insertion algorithm. Figure 2.5 shows an example of this deletion algorithm for the more complicated case, 13
Algorithm 2 BST Deletion //Delete the node with key k in the tree T Delete(T,k) x ← Find(k,T) if (x.leftChild.isExternal && x.rightChild.isExternal) then remove x else if (x.leftChild.isExternal) then replace( x,x.rightChild ) else if (x.rightChild.isExternal) then replace( x,x.leftChild ) else if !(x.successor.rightChild.isExternal) then replace( x.successor,x.successor.rightChild ) end if replace( x,x.successor ) end if
14
where the node that is to be deleted has two children. There are many different X, the node to be deleted 3 10
3 1
15 1
The successor of X
10
17
15 12
17
First replace the successor by its right child (if it exists)
12
10 1
15 12
17
Then replace X by its successor
Figure 2.5: BST Deletion versions of this basic idea that try to improve the expected performance of the algorithm. A number of different deletion algorithms can be found in [1] where Culberson and Evans investigate the relationship between asymmetries in deletion algorithms and the structure of the evolving BSTs.
2.2.5
Randomness Preservation in Binary Search Trees
As already mentioned, the definition of randomness preservation in BSTs is as follows: given a tree T , we need to be able to perform any number of operations on it, thus creating a new tree T 0 , and be sure that the probability of T 0 being created through those operations is the same as the probability that T 0 was created by successive insertions of the elements of a random permutation of a list. The problem with Hibbard’s original deletion algorithm, and any of the existing modified versions, was first discovered by Gary D. Knott and is that they do not adhere to the above condition. Hibbard himself believed his algorithm to be randomness 15
preserving as he proved that a random deletion performed on a random BST will result in a random BST. He did not investigate however what effect further insertions would have on the tree. Knott however did investigate this and discovered that further insertions after a deletion will not produce the same random tree as would have been produced solely by random insertions. Thus the deletion on BSTs was discovered not to be randomness preserving. Since then, much research has been done on this particular topic and several interesting results can be found in [8] and [5]. But why exactly are we so interested in a random BST? Random BSTs are a lot easier to analyze than ordinary BSTs, as each tree has the same probability of arising. But the main reason behind our interest in randomness is that it has been shown in [14, 7] that the expected cost for update operations on random binary search trees is O(log n). So randomness offers a relatively easy way to achieve logarithmic performance. There do exists other ways of course to improve performance. It should be obvious, that the depth of a node is the main cost factor in the two most important algorithms for BSTs, so it is no surprise that a lot of different versions of BSTs have been developed, all aiming at keeping the depth as low as possible. Most of these are balanced in some sense and use rotations to restore this balance after any update operation. Examples are: multiway search trees, (2,4) trees, red-black trees and splay trees, details of which can be found in [3]. Often the rotations can complicate the algorithm immensely and also these trees require extra storage space for the balance information. So even though these trees guarantee a good average depth and therefore good average performance they are not the perfect solution. Thus random BSTs are considered the better alternative. However, there had been no success so far in the design of a truly randomness preserving deletion algorithm for BSTs, so more recent work seems to focus on the development of randomness generating algorithms for the basic operations instead. These randomized data structures use random number generators to produce randomness inside the algorithm. Some examples of this are randomized binary search trees [9], randomized treaps [16] and skip lists [12]. As there had not been a randomness preserving deletion algorithm for BSTs up to date, it has been assumed that randomness generating and randomness preserving 16
algorithms for the basic operations perform equally well. However, this is not the case, as we will see in Chapter 4.
2.3
Randomized Binary Search Trees
Martinez and Roura describe a new version of BSTs in [9] which they call randomized binary search trees (RBST). The main idea lies in the insertion algorithm which uses structural information to generate a random number. To be precise, they generate a random integer between 0 and the current size of the tree. If this integer is equal to the tree size the new element gets inserted as the root, otherwise it gets inserted in either the left or right subtree of the root, following the usual BST procedure. By using this test, it is guaranteed that if we are dealing with an empty tree, the new item gets inserted as root, as 0 is the only number that can be generated. This method also fixes the probability of a new item becoming the root at
1 , n+1
as there is a
1 n+1
probability to select the integer 0 from the numbers
0 to n. RBSTs use another random number generator as part of their join method (which is called inside the deletion method), to decide whether the root of the right or left subtree to be joined becomes the new overall root. Pseudo-code of the main methods for RBSTs are provided as Appendix A. As the randomness is constructed each time a new item is inserted, the model does not require the input sequence to be randomized. In fact it does not rely on any assumptions about the input distribution and it is impossible to construct a bad sequence. Even though this is advantageous, it also means that their algorithm is not actually randomness preserving but rather randomness creating. Therefore it does not necessarily answer the question of the existence of a randomness preserving deletion algorithm for BSTs which was posed in [8]. However, they do prove that their algorithm creates a random BST even after multiple insertions and deletions have been performed thus enabling them to use many results already discovered about random BSTs. As well as that, they show that any update operation performed on their RBSTs has an expected performance of O(log n). The algorithms for deletion and insertion are very straightforward and run in logarithmic time, 17
which makes the model very attractive and usable.
2.4
Randomized Treaps
In [16] Seidel and Aragon propose what they call randomized search treaps to achieve logarithmic expected performance. Treaps are very similar to BSTs but in addition to a key they also store a priority. The items are arranged in a tree with the keys adhering to the BST property as described in Section 2.2. The priorities however are arranged in a heap-order, i.e., the priority of any given node in the tree is smaller than or equal to that of its parent. It follows that the item with the biggest priority will reside at the root. For a randomized treap we assume that all priorities are independent, identically distributed random variables through which randomness is introduced. The insertion algorithm is similar to the normal BST insertion algorithm in that at each stage it picks the left/right subtree according to whether the key to be inserted is smaller/bigger than the key at the current node. But as well as that it performs a left/right rotation to ensure the heap-order priority of the nodes. The deletion algorithm is basically just a backwards version of the insertion algorithm. The node to be deleted is located using the normal search procedure of a BST. It is then rotated down according to its priority value and once it has reached an external position, the node is removed. Thus this data structure allows for balancing trees through randomization and therefore the time bounds do not rely on any assumptions about the input, which is similar to the model of RBSTs described above. The authors claim to provide simple algorithms which are fast in practice and do not require any additional storage space in the case of unweighted trees. The average running time for any update operation is O(log n), which is the same as for RBSTs. However the time analysis is slightly more complex and does not make use of the many results already known about random BSTs. There are some similarities between this model and the model of the Ordered Binary Search Tree. For example, when running both deletion algorithms on a BST with the same structure, the resulting BST will also have the same structure. But whereas OBSTs focus on randomness preservation, RSTs 18
focus on maintaining the balance of the search tree and this results in some major differences between the two approaches. RSTs use priority values to keep their trees balanced. As these priority values are random variables, RSTs require a rotation on each insertion and deletion to restructure the tree. There is no need for rotations in the OBST algorithms, as its history values are generated by time stamping the keys on insertion. This eliminates any need for restructuring after inserting an element and simplifies the restructuring process after a deletion. The insertion and deletion algorithms for the Randomized Treaps can be found in Appendix B.
19
Chapter 3 The Ordered Binary Search Tree 3.1
Introduction
We introduce a new notion of BST which stores information about the order in which the elements were inserted. From here on we will refer to these trees as ordered binary search trees (OBST). A similar notion is mentioned in [10] where Mishna describes the transformation between BSTs and heap ordered trees in order to create a bijection between permutations and BSTs. Any normal BST will create the same shape for the two distinct permutations (2 1 3) and (2 3 1), as shown in Figure 3.1: 2
1
3
Figure 3.1: Normal BST In an OBST we will store an item at each node consisting of the usual key and its position in the permutation yielding the two distinct trees depicted in Figure 3.2. Any tree is therefore uniquely defined by the elements it contains and the order in which they were inserted, the latter will be termed the history of the tree. We hereby define the structure of a tree as a new characteristic which contains exactly 20
2,1
2,1
1,3
3,3
1,2
3,2
Figure 3.2: OBSTs from the permutations (2 1 3) and (2 3 1) this information. When classifying the OBSTs by their structure, the underlying permutation will be taken into account, meaning that the two trees depicted above are classified as having a different structure even if their shape is the same. In traditional BSTs it suffices to look at the shape alone when classifying the trees, as the deletion algorithm is based purely on this structural information. For our new kind of tree however, the deletion algorithm will use history information as well as structural information, which is the reason that this distinction is required. For example, consider the possible binary tree structures for a tree with three elements x, y and z with x < y < z as shown in Figure 3.3: z
z y
x
x z
x
x
y
A
y
x
z
B
y
y
C
z
D
E
Figure 3.3: BST with 3 elements There are five different structures and all arise with probability the structure labelled as C which arises with probability
2 6
1 6
except for
as it can be created by
two permutations. If we now look at the different possible OBST structures for a tree with those three elements, we get six different structures as shown in Figure 3.4, all of which arise with an equal probability of 61 :
3.2
The Find/Insertion Algorithm
To find an element in an OBST, we proceed the same as with normal BSTs, picking the left/right subtree of a node depending on whether the element to be 21
z,1 y,2
y,1
z,1 x,2 y,3
x,3 S1
x,2
z,3
S2
y,1
S3
x,1
x,1 z,2
x,3
z,2 S4
y,2
y,3
z,3
S5
S6
Figure 3.4: OBST with 3 elements found is smaller/greater than the node. The pseudo code for this algorithm is in accordance with [7] and is shown here in Algorithm 3. To insert a new node, we need to firstly create a new item i = (key, T.size + 1) and then follow the above search algorithm until we reach an external which we then replace by the new item i. For simplicity we assume that all keys are distinct.
3.3
The Deletion Algorithm
The deletion algorithm is what distinguishes OBSTs from normal BSTs. The key idea is to consider not only the structure when rebuilding the tree, but also the history. That way we are able to achieve a one-to-one mapping between trees and permutations, i.e., the tree built after deleting an element is exactly the same as the tree built from the permutation with that element omitted. The advantages of this will be explained in more detail in Section 3.4. The pseudo code for the deletion in OBSTs is shown in Algorithm 4. The deletion algorithm restructures the tree according to its history values, but also ensures that any left-to-right relationship between the elements is maintained. So basically, when an element is deleted, it is replaced by its right or left child, depending on which one has the smaller history value, i.e. should be higher up 22
Algorithm 3 Find in OBST //Find and return the node with key value k in the tree T. Find(T,k) x ← T.root while !(x.isExternal) do if k < x.key then x ← x.leftChild else if k > x.key then x ← x.rightChild else return x end if end while return“Not foun” in the tree. If the item that we moved up had two children originally then we are faced with a separate subtree as the item only has room for one child in its new position. The re-insertion of this subtree into the tree is done by the method ReInsert which finds a new position for the subtree following two main steps: first, find the correct level for the subtree’s root in the tree, which is determined solely by the root’s history value and second, on that level find out whether to attach the subtree as a right or a left child, which is determined through key comparison. If the newly found position is already occupied, we are once again left with a separate subtree, so ReInsert will call itself again. This process of finding a new home for a separate subtree is repeated until the position found is an external node, and we have no separate subtree to deal with. To clarify the deletion process, consider the following example. We start with the tree created by the permutation (10 16 7 12 17 6 9 20 11 14) as shown in Figure 3.5. If we were to delete the key 10, we would be dealing with the item (10,1). We first move the lower history value up, i.e., (16,2). The right subtree of that node does not cause a problem and can stay connected to it the way it is, the
23
Algorithm 4 Deletion in OBST //Delete the node with key k in the tree T Delete(T,k) x ← Find(T,k) if x has no children then remove x else if x has one child c then replace(x,c) else if x.leftChild.getHistory < x.rightChild.getHistory then l ← x.leftChild if l.rightChild.isExternal then replace(x,l) else s ← l.rightChild replace(x,l) ReInsert(l.rightChild, left, s) end if else r ← x.rightChild if r.leftChild.isExternal then replace(x,r) else s ← r.leftChild replace(x,r) ReInsert(r.leftChild, right, s) end if end if end if
24
Algorithm 5 ReInsert in OBST //Insert the tree ‘subtree’ into the tree rooted at ‘correctPos’, starting the //search in the right (left) subtree if ‘direction’ = right (left) ReInsert(correctPos, direction, subtree) testPos ← correctPos.parent while (correctPos.getHistory < subtree.getHistory && !testPos.isExternal) do set Flag if direction = left then correctPos ← correctPos.leftChild else correctPost ← correctPos.rightChild end if testPos ← correctPos end while if Flag is not set then newSubtree ← correctPos replace(correctPos,subtree) ReInsert(subtree, - direction, newSubtree) else if !(correctPos.isExternal) then replace(correctPos,subtree) ReInsert(subtree, - direction, correctPos) else replace(correctPos,subtree) end if end if
25
10,1
16,2
7,3
12,4
9,7
6,6
17,5
14,10
11,9
20,8
Figure 3.5: OBST from permutation (10 16 7 12 17 6 9 20 11 14) left subtree however needs to be removed and considered as a separate subtree for now. This leaves us with the situation depicted in Figure 3.6, where the tree now contains (16,2) as the new root and this node’s original left subtree needs to be re-inserted into the tree somehow. 16,2
11,9
17,5
7,3
6,6
12,4
9,7
14,10
20,8
Figure 3.6: The tree before ReInsert and its separate subtree The question that arises now is, what to do with the subtree? We need to find a new home for it. This is done by the recursive method ReInsert which moves the whole subtree down the new left subtree of (16,2) (as it needs to stay to the left of this node) to its correct history level, i.e., just after the last node that has a smaller history value, which is the node (7,3). We now need to decide whether to insert the subtree as a left or a right child of (7,3). As the key of (12,4) is larger than the key of (7,3), we insert it as the right child. The original right child needs to be removed and considered as a separate subtree. The current stage of the deletion process is shown in Figure 3.7. As there is still a separate subtree to deal with, ReInsert calls itself and again 26
16,2
9,7
17,5
7,3
20,8
12,4
6,6
14,10
11,9
Figure 3.7: The tree after the first call to ReInsert with its separate subtree tries to find a new home for the separate subtree. This time we start at the just inserted node (12,4) and move our subtree down the left, as (9,7) has a smaller key than (12,4). Again we find the correct level according to the history value. This puts the subtree after (12,4), yet again leaving us with a modified tree and its separate subtree, as shown in Figure 3.8. 11,9
16,2
17,5
7,3
20,8
12,4
6,6
9,7
14,10
Figure 3.8: The tree after the second call to ReInsert with its separate subtree ReInsert makes yet another recursive call in order to incorporate the newly created subtree. We follow the same steps as before: starting at the newly inserted node, which is (9,7) we move our way down its right subtree (due to key comparison). As (9,7) has no children, we can just attach (11,9) as its new right child and therefore do not have to deal with a separate subtree anymore. We are now finished and Figure 3.9 shows the original tree after the deletion of (10,1). It can be easily seen that the structure of the resulting tree (subtracting one 27
16,2
17,5
7,3
20,8
12,4
6,6
14,10
9,7
11,9
Figure 3.9: Tree after deletion of 10 from each history value) is the same as we would obtain by deleting the number 10 from the permutation and then building a tree using the insertion algorithm defined above. As we are guaranteed that each new item has a history value different from all the previous ones, there is no need to relabel the existing values after a deletion. We are left with gaps in our list of history values but that does not cause a problem as no new value can possibly fall inside these gaps. This is a very important observation as relabelling would have a huge impact on the running time. A relabelling algorithm would have to visit every node in the tree and therefore cause the deletion algorithm to run in linear time, thus making it unacceptable in practice.
3.4
Randomness Preservation
In [5], Jonassen and Knuth show the deletion paradox on the example of BSTs of size 2 and 3. We will follow their example using the new deletion algorithm and show that this new variant solves the paradox. Knuth begins by defining the different shapes possible for BSTs of size 2 and 3. The different OBST structures of size 3, see Figure 3.4, have already been shown. It can be easily seen that any list of integers with the same inter-relationship will yield one of the generic
28
structures depicted and we can therefore restrict ourselves to values between 1 and n. The OBST structures from Figure 3.4 map to Knuth’s 3-element trees as shown in Figure 3.3 as follows: S1 → A S2 → B S3 → C S4 → C S5 → D S6 → E It is easy to see that we do not have a bijection, as two OBST structures map to the same BST shape. However we do have a bijection between permutations and OBST structures which will help us in achieving randomness preservation. Let us now consider the possible structures of OBSTs of size 2. There are only two structures, depending on which of the two elements was entered first, x or y. The structures for BSTs of size 2 are actually the same, so we will name the OBST structures G and F according to Knuth’s schema, as shown in Figure 3.10:
x,1
y,1
x,2
y,2
F
G
Figure 3.10: OBST with two elements Let us now consider the different generic OBSTs of size 3. For each of the trees there are three possible keys that can be deleted: x, y or z. If we use the deletion algorithm described above, then we get the following distribution of 2-element structures:
29
Initial Structure
Delete x
Delete y
Delete z
S1
F
F
F
S2
F
F
G
S3
G
G
F
S4
G
F
F
S5
F
G
G
S6
G
G
G
The overall distribution of F and G is
9 18
= 12 , which is the same for normal
BSTs and is as we would expect. If we distinguish between the different F and G structures according to the elements they contain however and compare those values to the values for normal BSTs we get the following breakdown: F(x,y)
F(x,z)
F(y,z) G(x,y)
G(x,z)
G(y,z)
BST
3/18
4/18
2/18
3/18
2/18
4/18
OBST
3/18
3/18
3/18
3/18
3/18
3/18
Already the values for the OBST structures are more evenly distributed. Now assume a random insertion into these trees. The newly inserted element can have the following inter-relationship to the elements that were originally in the tree, before the first deletion took place: a) it can be smaller than all of them b) it can be the second smallest c) it can be the third smallest d) it can be bigger than all of them For example the structure F(x,z) yields the structure S1 in case a), the structure S2 in cases b) and c) and the structure S3 in case d). If we do the same for all 2-element structures and then compute the probabilities we get: Prob(S1) =
3+3+6 72
=
1 6
Prob(S2) =
3+6+3 72
=
1 6
30
Prob(S3) =
6+3+3 72
=
1 6
Prob(S4) =
3+3+6 72
=
1 6
Prob(S5) =
3+6+3 72
=
1 6
Prob(S6) =
6+3+3 72
=
1 6
Comparing these to the values for normal BSTs, we get the following: Prob(A) =
11 72
Prob(B) =
13 72
Prob(C) =
25 72
Prob(D) =
11 72
Prob(E) =
12 72
It is easy to see that OBSTs still provide the nicer distribution, which already suggests that a further deletion will not cause a problem. And once we actually perform this random deletion and compute the probabilities we see that structures F and G still arise with a probability of
1 2
each, in comparison to normal BSTs
where this final deletion causes structure F to arise with probability
109 216
> 12 .
From the above results it seems that this new deletion is randomness preserving. However we have only looked at trees of a very small size and cannot infer from these examples that it will work for all trees. To prove this we will first prove the correctness of the operations described in Sections 3.2 and 3.3. In the following we will show that if we are given an OBST which was created from a specific permutation, then inserting an element into this OBST is equivalent to adding a new element to the permutation and then using this new permutation to create the OBST. Similarly deleting an element from the OBST is equivalent to removing the element from the permutation and then creating the OBST from the modified permutation. In particular we will show the following three things: a) there is a one-to-one mapping between permutations and OBSTs b) a random insertion preserves this mapping c) a random deletion also preserves this mapping. We will use the following notation.
31
Let P = (e1 e2 . . . en ) denote the permutation of the elements e1 to en and let T = [e1 , e2 . . . en ] denote the OBST which was created by inserting the keys e1 to en in exactly that order. a) We need to show that the following equation holds: Lemma 1. P = (e1 e2 . . . en ) ⇔ T = [e1 , e2 . . . en ] Proof. First we show P = (e1 e2 . . . en ) ⇒ T = [e1 , e2 . . . en ] T = [e1 , e2 . . . en ] is defined above to be the OBST created by inserting the elements e1 to en in that order, so the equation is true by definition. Now we look at P = (e1 e2 . . . en ) ⇐ T = [e1 , e2 . . . en ] Here we need to show that, given the OBST T = [e1 , e2 . . . en ], there is exactly one permutation P = (e1 e2 . . . en ) of the elements e1 to en that could have created it. This case is slightly harder and does not hold for ordinary BSTs. The tree shown in Figure 3.1 could have been created by the permutation (2 1 3) or (2 3 1). However, in an OBST we are provided with a history value for each node in the tree. As permutations are only defined through the order of their elements, the history value is all we need to guarantee a unique mapping. b) For this case we need to prove the following equivalence: Lemma 2. P = (e1 e2 . . . en en+1 ) ⇔ Insert en+1 into T = [e1 , e2 . . . en ] Proof. To show P = (e1 e2 . . . en en+1 ) ⇒ Insert en+1 into T = [e1 , e2 . . . en ] consider the permutation P = (e1 e2 . . . en en+1 ). We have already shown 32
in a) that this creates exactly one OBST which is: T = [e1 , e2 . . . en , en+1 ]. This in turn describes the OBST which contains the elements e1 to en+1 and as en+1 is the element with the highest subscript, it is the node in T with the highest history value, i.e. the node that was inserted last. No changes have been made to the position or history values of any other nodes Therefore we have reached the proposed state: Insert en+1 into T = [e1 , e2 . . . en ]. Now consider the initial situation Insert en+1 into T = [e1 , e2 . . . en ], i.e. the case P = (e1 e2 . . . en en+1 ) ⇐ Insert en+1 into T = [e1 , e2 . . . en ]. Can we show that the resulting tree could have only been created by P = (e1 e2 . . . en en+1 )? From a) we know that T = [e1 , e2 . . . en ] could have only been created by P = (e1 e2 . . . en ) and we also know that our insertion algorithm will give en+1 the highest history value and add it at its correct position in T without actually modifying the rest of the tree structure. Therefore the only permutation that this tree could have been created from is the permutation (e1 e2 . . . en ) plus en+1 added to the end of it which is equal to P = (e1 e2 . . . en en+1 ).
c) We will prove the following: Lemma 3. P = (e1 e2 . . . en−1 ) ⇔ Delete en from T = [e1 , e2 . . . en ] Proof. First we will look at P = (e1 e2 . . . en−1 ) ⇒ Delete en from T = [e1 , e2 . . . en ]. Again, we already know from a) that the permutation P = (e1 e2 . . . en−1 ) creates exactly one OBST, which is T = [e1 , e2 . . . en−1 ]. From the definition stated above it is clear that this is the OBST we are looking for. It contains the elements e1 to en−1 in the same order (meaning with the same history values) as before.
33
Proving P = (e1 e2 . . . en−1 ) ⇐ Delete en from T = [e1 , e2 . . . en ] turns out to be a lot harder as our deletion algorithm actually modifies the original structure of T and we need to show that this does not violate any of the following two properties: the relationship between any element ej and any of its predecessors ei , where 1 ≤ i < j and the original relative ordering of the remaining elements. When deleting an element from an OBST, the algorithm does not modify the relative ordering between the elements. The structure of T is modified but the subtrees are reattached to the tree in such a manner as to preserve the original left and right subtree relationships. Consider an arbitrary element ex in the tree: after the deletion of en the position of ex might be different. However if ex was in the left (right) subtree of any ej , where 1 ≤ j < x, then it will still be there afterwards. Therefore the new tree could have only been created from the permutation P = (e1 e2 . . . en−1 ). Lemmas 1-3 immediately imply the following Theorem: Theorem 1. The structure and the deletion and insertion algorithms of an ordered binary search tree provide a bijection with permutations and their insertion and deletions. Proof. The Theorem is an immediate corollary from the Lemmas. Corollary 1. Ordered binary search trees are randomness preserving. Proof. Permutations are randomness preserving and we have proved that there exists a bijection between ordered binary search trees and permutations.
3.5
Average-Case Analysis
As both the insertion and the deletion algorithm are randomness preserving it is very easy to compute the expected values for internal path-length and height. The internal path-length is equivalent to the construction cost of a random BST and is described by the following recurrence which can be found in [14]: 34
Cn = n − 1 +
1 X (Ck−1 + Cn−k ) for n > 0 with C0 = 0 n 1≤k≤n
It is easy to see why this recurrence holds. The first item inserted will reside at the root. As we are assuming a random input there is a
1 n
probability that the
item at the root is the kth smallest of all for any 1 ≤ k ≤ n. In that case we have a left subtree of size Ck−1 , consisting of the k − 1 items that are smaller than the root and a right subtree of size Cn−k , containing the n − k items larger than the root. The added value of n − 1 reflects the fact that if we are not at the root, the construction cost needs to be increased by one. So we need to add the size of the two subtrees: k − 1 + n − k = n − 1. There are many ways to solve this recurrence, but the most common method in algorithm analysis uses generating functions as these can often be used to derive other properties or perform further conversions. We use the normal procedure for solving recurrences using GFs as described in 2.1. To simplify the recurrence we first change k to n − k + 1 in the second part of the sum:
Cn = n − 1 +
1 X (Ck−1 + Ck−1 ) for n > 0 with C0 = 0 n 1≤k≤n
Now follow the first step from our ‘recipe’ from 2.1: Multiply both sides of the recurrence by z n and sum over all values of n for which the recurrence holds. X
Cn z n =
X
(n − 1)z n +
1 XX Ck−1 + Ck−1 z n n
We then multiply the whole equation by n. X P
nCn z n =
X
n(n − 1)z n + 2
XX
Differentiating the basic generating function C(z) = n≥0
Ck−1 z n
P
n≥0
Cn z n yields C 0 (z) =
nCn z n−1 . The right hand side of this equation is very similar to the left
hand side of our recurrence equation, the only difference being the power of the coefficient z. Multiplying by z makes them completely equal. So we can now write the recurrence as follows: 35
zC 0 (z) =
X
n(n − 1)z n + 2
XX
Ck−1 z n
Using the basic tables available in [14] for generating functions and operations on them, we see that the first summand can be expressed as the index multiply P (differentiation) of n≥1 nz n with two right shifts. The second summand is simply
the partial sum, also with a right shift. Applying these operations we get the following:
zC 0 (z) =
2z 2 2z + C(z) 3 (1 − z) 1−z
Divide this equation by z to give the following linear ordinary differential equation (ODE):
C 0 (z) =
2z 2 + C(z) (1 − z)3 1 − z
This can be solved through the normal way of finding an integration factor. The normal form of an ODE is y 0 + p(x)y = q(x) and from that the integration factor can be computed as:
µ(x) = exp{
Z
p(x)dx}
Therefore we get an integration factor of µ = (1 − z)2 . Now we can easily solve the equation and quickly find the solution:
C(z) =
1 2z 2 ln − (1 − z)2 1 − z (1 − z)2
All that remains to be done now is to extract the coefficients from the above. This is easily done using the dictionary for GFs, giving us the expected internal path-length for random BSTs:
2(n + 1)(Hn+1 − 1) − 2n As the sum of depths of all nodes in the tree yields the internal path-length, we get the following: 36
Internal path-length =
n X
di
i=1
From the expected path-length the expected depth of an item can easily be calculated by dividing by n, yielding the following:
Expected depth: De = 2Hn+1 + 2
Hn+1 −4 n
We will also require a value for the expected height He in a BST. An approximation for this value can be found in [14]:
Expected height: He = c log n, where c ≈ 4.31107 These are the main values required to analyse both algorithms. Insertion: To insert a new item into the tree, we essentially perform a search for the correct external node, starting at the root and turning left or right, depending on the outcome of the key comparison. It is not hard to see that the expected cost is directly related to the number of nodes that are visited along this search. In the best case this could be one, if the new item is to be as the root, which occurs with a probability of
1 n
in a random BST. In
the worst-case this value will be equal to the height of the tree. How can we get a handle on the average-case? We have already stated some expected values for random BSTs above. It is clear that the expected depth of a node in the tree is very closely related to the value we are looking for. In fact, the insertion of a new node is equal to the expected depth plus one to reach the external node. This gives the following equation:
Ie = De + 1 = 2Hn+1 + 2
Hn+1 −3 n
The approximation for the harmonic number Hn is given in [14] as: Hn ≈ ln n + .57721 · · · . Using this approximate and the fact that
Hn+1 n
goes to
zero as n becomes large, we can see that the average cost for insertion is O(log n) which is to be expected for random BSTs. 37
Deletion: To analyse the deletion algorithm, we divide it into two main parts. In the first part we perform a search for the node with the key value that is specified for deletion. Let us refer to the cost for this part as Se . The second part of the algorithm is concerned with restructuring the tree to make sure it is still an OBST. The cost for this part will be referred to as Re . Finding the cost Se involves similar arguments as the ones used for the insertion algorithm. Again, the cost is directly related to the number of nodes visited along our search, only this time we are looking for an internal node rather than an external node. Therefore we can simply take the value for the expected depth of a node which we already know:
Se = D e Finding the value for Re is slightly more complex. The cost for restructuring is essentially nothing else but a recursive call to ReInsert, so the cost is reflected by the following recurrence:
Rn = In + Rn−k , with 0 < n − k < n The cost of ReInsert on a tree with n nodes is made up of the cost of inserting a tree of arbitrary size into a tree of size n plus the cost of calling ReInsert on a smaller tree Tsub with only n − k nodes. We stop when we reach an external node, i.e. when n − k = 0. For simplicity we will use the variable s to denote the value of n − k in the calculations below. Inserting a tree is not very different from inserting a single node: we consider only the root of the tree and try and find the correct position in the tree, insert the root and through its children pointers, the rest of the tree gets automatically inserted. As part of ReInsert we do not necessarily need to find an external node to insert the tree and so we use the cost of finding a node in a tree of size n is:
38
Ie = De + 1 = 2Hn+1 + 2
Hn+1 −4 n
Thus this is the cost of inserting a tree of arbitrary size into a tree of size n. Now we need to get a handle of the average size s of Tsub . We already know the expected depth of an element in the tree and we also know the expected height He of the tree. Therefore we can compute the expected height h of Tsub as follows:
h = He − De Filling in the values for expected height and depth we get:
h = c log n − 2Hn+1 − 2
Hn+1 +4 n
Now we can fill in the approximation for the harmonic number to give:
h = c log n − 2 ln (n + 1) + 2c1 − 2
ln (n + 1) + c1 +4 n
At this point the equation looks like it is not going to be of any help, as it is far too complicated to plug into the recurrence. However, for this analysis we are not interested in constant factors, so we can ignore them. Also we are only interested in the upper bound, so using O() notation we get the following:
h = O( log n) − O( log n) + O(
log n ) n
This simplifies to:
h = O(
log n ) n
We also know that for the average height of Tsub the normal formula applies, so: 39
h
h = c log s ⇔ s = 2 c
We fill in the value we got for h into the above equation yielding the following for s:
s=
√ n
n
Now we can finally plug the values into our recurrence:
n n) Rn = O( log n) + O(R √
After the first recursive step the reccurrence will look as follows:
O( log n) + O( log
√ n
n n−k n) + R √
It is easy to see that as this recurrence unfolds further, all additional terms are bounded by O(log n) and we can now use this value in our original equation for the average cost of deletion:
Dele = Se + Re = O( log n) + O( log n) = O( log n) Thus we get a logarithmic average running time for the deletion algorithm. Considering the worst-case performance there are two main factors to take into account: the time it takes to reach the node which is to be deleted and the time it takes to restructure the tree. The time to reach the correct node can be O(n) in the worst-case if the tree is not balanced. The time for restructuring depends on the cost of ReInsert, which is given by the number of nodes that have a smaller history value than the node we are trying to ReInsert and it also depends on the number of recursive calls. The first value is maximised if the node to be re-inserted has a very high history value, e.g. n and the node at the starting point has a very low history value, e.g. 1. 40
In this case we would have to visit n nodes. However, as the node with history value n is external, the algorithm then finishes, yielding a running time of O(n) + O(n) = 2 O(n) = O(n). Let us now try and maximise the number of times that ReInsert has to call itself. This is the case if every newly found position has two children. All searches for the new position, except for the very first one, start at the root of the just inserted subtree, so it is this subtree that needs to have two children on each level. In the worst-case we have to call ReInsert for each level in the subtree, giving us log n calls as the tree has to be reasonably balanced if it has two children on each level. This also fixes the number of nodes that need to be visited when searching for the correct position at log n, yielding an overall time of O(n) + O(log n) + O(log n) = O(n) . So the worst-case running time of the deletion algorithm is linear.
41
Chapter 4 An Empirical Comparison of RBST and OBST 4.1
The Experiment
The formal analysis shows that the expected performance for deletions and insertions in OBSTs is logarithmic which is the same as for the RBSTs introduced in [9]. To find out whether OBSTs perform better or worse than RBSTs in practice, both data structures were implemented and the average running times for the deletion and insertion algorithms were measured experimentally. The time was measured using the Java profiler JFluid1 , which provides an accurate time and memory measurement. In the experiment a random insertion sequence is simulated by using a random number generator. In practice there is no need for a random insertion sequence for RBSTs as they introduce randomness through their algorithm. OBSTs however only preserve randomness and therefore need the insertion sequence to be random. Deletion: First a number of random values are inserted into the tree. Due to memory restrictions the maximum tree that we can create is one containing 1,500,000 nodes. The smallest tree that will be tested is of size 10,000. After the tree has been created a random value is deleted from it. To allow for the 1
Developed by Sun Labs, download from http://research.sun.com/projects/jfluid/download/
42
time measurements to be comparable, the same value needs to be deleted in both the RBST and the OBST. As the insertion sequence is random we are guaranteed a random distribution of values in the tree. This random distribution has the effect that any value has equal probability of being close to the root as it has of being close to an external node. We will check the running time for the deletion on two randomly selected values, 3 and 1239. Because of the random distribution, the deletion will be performed three times on each tree to give a more accurate measure of the average time. Insertion: After creating an OBST and an RBST of relatively small size, 10,000 nodes, we are left with very similar running times. For that reason we concentrate on larger trees and try to achieve values that are more dissimilar. We produce insertion sequences of up to 1,500,000 nodes and perform the experiment three times for each size to get a better estimate for the average performance.
4.2
The Results
Deletion The following table shows the times, in milliseconds, for the deletion of the value 3 on the OBST and the RBST:
43
Size
OBST
RBST
100000 0.041, 0.040, 0.038
2.15, 0.719, 0.903
100000 0.043, 0.041, 0.042
25.2, 11.8, 16.3
200000 0.042, 0.041, 0.039
9.60, 0.472, 3.93
300000 0.046, 0.043, 0.037
12.1, 6.80, 36.8
400000 0.044, 0.082, 0.045
79.5, 0.052, 32.8
500000 0.044, 0.077, 0.035
107, 69.0, 22.9
600000 0.043, 0.070, 0.070
56.9, 21.7, 86.0
700000 0.038, 0.053, 0.042
14.2, 46.2, 0.342
800000 0.046, 0.041, 0.040
79.2, 1.31, 111
900000 0.036, 0.078, 0.043
158, 23.6, 217
1000000 0.078, 0.036, 0.041, 72.7, 1.50, 185 1500000 0.079, 0.050, 0.045
8.7, 0.175, 17.6
The times, in milliseconds, for deleting the value 1239 are similar: Size
OBST
RBST
100000 0.075, 0.039, 0.041 2.3, 36.7, 0.035 100000 0.039, 0.042, 0.039 3.37, 4.73, 2.26 200000 0.041, 0.089, 0.036 53.5, 25.2, 11.2 300000 0.045, 0.081, 0.039 44.1, 8.74, 22.7 400000 0.049, 0.048, 0.046 105, 3.33, 24.1 500000 0.081, 0.045, 0.082 7.0, 36.4, 75.5 600000 0.044, 0.044, 0.050 177, 119, 19.8 700000 0.078, 0.039, 0.079 4.20, 8.13, 27.2 800000 0.044, 0.051, 0.076 72.7, 16.4, 119 900000 0.033, 0.053, 0.036 100, 44.7, 78.0 1000000 0.049, 0.053, 0.038 10.5, 290, 280 1500000 0.040, 0.052, 0.048 125, 210, 330 Taking the average of these values for each tree and plotting them against the number of nodes produces the graph depicted in Figure 4.1. 44
Insertion For the insertion algorithm we measure the last 1,000 insertions for each of the different tree sizes and take the average value of those. Each tree size was created three times and the average of those three values can be found in the following table: Size
OBST
RBST
10000 0.001
0.001
50000 0.002
0.001
100000 0.002
0.002
200000 0.004
0.003
500000 0.005
0.005
1000000 0.006
0.006
1100000 0.007
0.005
1200000 0.007
0.006
1300000 0.007
0.007
1400000 0.008
0.006
1500000 0.007
0.007
Even without drawing the graph it is easy to see that the values are very close. Plotting the values against the tree size yields the graph depicted in Figure 4.2.
4.3
Analysing the Results
Deletion From looking at the results it seems that the different values for the deletion in the OBST are much closer to each other than the ones for the RBST. To investigate this further, we calculate the standard deviation of the value-triples:
45
140
OBST RBST
120
Time in Milliseconds
100 80 60 40 20 0 0
200000
400000
600000 800000 1e+06 Number of Nodes in Tree
1.2e+06 1.4e+06 1.6e+06
Figure 4.1: Deletion in OBST and RBST
Average Time of a Random Insertion 0.008
OBST RBST
Time in Milliseconds
0.007 0.006 0.005 0.004 0.003 0.002 0.001 0
200000
400000 600000 800000 1e+06 Number of Nodes in Tree
1.2e+06 1.4e+06 1.6e+06
Figure 4.2: Insertion in OBST and RBST
46
Size
OBST(del 3)
OBST(del 1239) RBST(del 3) RBST(del 1239)
100000 0.002
0.020
0.78
20.55
100000 0.001
0.002
6.82
1.24
200000 0.002
0.029
4.61
21.55
300000 0.005
0.023
16.01
17.81
400000 0.022
0.002
39.93
53.72
500000 0.022
0.021
42.11
34.36
600000 0.016
0.003
32.2
79.49
700000 0.008
0.023
23.52
12.3
800000 0.003
0.017
56.44
51.38
900000 0.023
0.011
99.12
27.84
1000000 0.023
0.008
92.51
158.56
1500000 0.018
0.006
8.71
103
These values yield an average standard deviation of 0.013 for OBSTs and 41.86 for RBSTs. The explanation for this immense difference is obvious. RBSTs employ their own random number generator to decide whether a node is inserted at the root or not. The action taken in the deletion algorithm is also dependent on a random number. Therefore the exact same insertion sequence could produce completely different trees. Also deleting the same element from the same tree can result in completely different actions. This has the effect of giving very different time values for the same algorithm. In contrast, the insertion algorithm of the OBST will produce the exact same tree structure if presented with the same insertion sequence and its deletion algorithm will perform the same action if the same value is to be deleted from the same tree. As we are only trying to analyse the average running time, the difference in standard deviation does not cause a problem and can be ignored. We only need to look at the respective average values as depicted in Figure 4.1. One issue we have to consider however is the fact that even though we are deleting the same value from each tree
47
every time, we are dealing with trees that have been created by distinct insertion sequences. So strictly speaking the times are not completely comparable. As the deletions have been performed a number of times however, we can assume the average value as representative of a deletion on a random BST. The graph clearly shows that the deletion for an OBST is a lot faster on average than the deletion for an RBST. Looking at the actual values we can see that only once does the RBST deletion actually perform equally fast as the OBST deletion (deleting 1239 in a tree with 10,000 nodes) and this is the case for the smallest tested tree size. So even taking into account that one algorithm might suffer some time wastage due to implementation issues, we can safely say that OBSTs have a better average-case performance than RBSTs and this applies even more so as the tree size increases. An immediate question that arises is, whether the random number generator used in the join method of RBST could cause some of the differences in running times. However to eliminate this possibility some tests have been run, including a similar statement in the deletion algorithm of OBST which caused no increase in time. Insertion The average times for the insertion in OBSTs lie between 0.001 ms (1,000 nodes) and 0.008 ms (1,400,000 nodes) and produce a smooth ascending graph apart from the last value which is slightly lower than its predecessor. The times for insertion in RBSTs are between 0.001 ms (1,000 nodes) and 0.007 ms (1,400,000 nodes) and the graph shows unexpected falls, especially in the last quarter. The explanation for these falls lies in the random number generator used in the RBST insertion algorithm. It has the effect that the running time is only partly dependent on the tree size. The random number produced decides whether a new node is inserted at the root or not, which will give rise to two different algorithms which will produce different running times. Apart from these falls however the running times for RBST and OBST insertion are almost identical, any differences being only on a scale of milliseconds and are therefore negligible.
48
Chapter 5 Conclusion and Future Work We have presented a new kind of binary search tree, which provides an answer to Knuth’s open question regarding the existence of a randomness preserving deletion algorithm for BSTs. The insertion and deletion algorithms that we have introduced are very straightforward, easy to implement and have an expected performance of O(log n). Not only does the deletion preserve randomness, it actually preserves the original order of elements, i.e. the resulting tree is the one that would have been created, had the element never been inserted. The average-case analysis of the OBST algorithms has been shown to be very easy due to the fact that the OBST can be classified as a random tree. The data structure developed by Martinez and Roura offers an equally simple analysis, as their tree can also be classified as random. However their claim to have developed a randomness preserving deletion algorithm has been shown to be incorrect, as their algorithm creates randomness rather than preserves it. The experiments performed also demonstrate that their deletion performs a lot worse than the deletion on an OBST which makes our algorithm more attractive. Similarly, the deletion algorithm produced by Seidel and Aragon also introduces randomness rather than preserves it. It would be of interest to experimentally compare the performance of their deletion algorithm, or indeed the deletion algorithms of other existing random BSTs, with the OBST deletion to see which runs faster on average. Clearly, ordered binary search trees are very usable in practice as they have
49
good average-case performance and the only extra information that needs to be stored is the history value in the form of an integer. To further investigate the usability of these trees, one might be interested to show how the actual tree structure and the corresponding internal path-length change after multiple insertions and deletions. Also the best-case and worst-case behaviour of these algorithms could be analysed and compared to the times of the other main models of random binary trees. As well as that it might be possible to apply the idea of an ordered tree to other kinds of tree to achieve good average performance.
50
Appendix A RBST Pseudo Code The pseudo code for the RBST algorithms, taken from [9]:
A.1
Insertion
bst insert(int x, bst T) int n,r; n = T → T.size; if (r = n) then return insertAtRoot(x,T); end if if ( x < T → key) then T → left = insert(x,T → left); else T → right = insert(x,T → right); end if return T;
A.2
Insertion at the Root
bst insertAtRoot(int x, bst T) bst S, G; 51
split(x,T, & S, & G); /* S ≡ T< , G ≡ T> */ T = newNode(); T → key = x; T → left = S; T → right = G; return T;
A.3
Split
void split(int x, bst T, bst *S, bst *G) if (T = null) then *S = *G = null; end if if ( x < T → key) then *G = T; split(x,T → left, S, &(*G → left)); else *S = T; split( x, T → right, &(*S → right),G); end if
A.4
Deletion
bst delete(int x, bst T) bst Aux; if ( T = null) then return null; end if if ( x < T → key) then T → left = delete(x,T → left); else if ( x > T → key ) then T → right = delete(x,T → right); else 52
Aux = join(T → left,T → right); freeNode(T); T = Aux; end if return T;
A.5
Join
bst join(bst L, bst R) int m,n,r,total; m = L → T.size; n = R → T.size; total = m + n; if ( total = 0) then return null; end if r = random(0,total-1); if (r < m) then L → right = join(L → right,R); return L; else R → left = join(L,R → left); return R; end if
53
Appendix B RBST Pseudo Code The pseudo code for the Treap algorithms, taken from [16]:
B.1
Insertion
procedure TREAP-INSERT( (k,p) : item, T : treap ) if T = tnull then T ← NEWNODE() T → [key,priority,lchild,rchild] ← [k,p,tnull,tnull] else if k < T → key then TREAP-INSERT( (k,p),T → lchild ) if T → lchild → priority > T → priority then ROTATE-RIGHT( T ) end if else if k > T → key then TREAP-INSERT( (k,p),T → rchild ) if T → rchild → priority > T → priority then ROTATE-LEFT( T ) end if else //key k already in treap T end if 54
B.2
Deletion
procedure TREAP-DELETE( k : key, T : treap ) tnull → key ← k REC-TREAP-DELETE( k,T ) procedure REC-TREAP-DELETE( k : key, T : treap ) if k < T → key then REC-TREAP-DELETE( k,T → lchild ) else if k > T → key then REC-TREAP-DELETE( k, T → rchild ) else ROOT-DELETE( T ) end if procedure ROOT-DELETE( T : treap ) if IS-LEAF-OR-NULL( T ) then T ← tnull else if T → lchild → priority > T → rchild → priority then ROTATE-RIGHT( T ) ROOT-DELETE( T → rchild ) else ROTATE-LEFT( T ) ROOT-DELETE( T → lchild ) end if
B.3
Rotation
procedure ROTATE-LEFT( T : treap) [T,T → rchild,T → rchild → lchild ] ← [T → rchild,T → rchild → lchild,T] procedure ROTATE-RIGHT( T : treap) [T,T → lchild,T → lchild → rchild ] ← [T → lchild,T → lchild → rchild,T] 55
B.4
Functions
function IS-LEAF-OR-NULL( T : treap ) : Boolean return( T → lchild = T → rchild ) function EMPTY-TREAP() : treap tnull → [priority,lchild,rchild] ← [−∞,tnull,tnull] return ( tnull )
56
Appendix C Notes on the Pseudo-Code The following notes apply to the pseudo-code used in this thesis only. They do not apply to the pseudo-code for the RBST or Treap algorithms shown in appendix A and B.
C.1
Replace() Method
A call to replace(A, B) is used in most of the pseudo-code algorithms. Essentially this causes the node A to be replace by the node B, automatically deleting A and updating B’s pointers as follows: B’s parent will be A’s parent. If A’s right child is not external, then B’s right child will be A’s right child; otherwise B will keep its own right child, if exists. B’s left child is updated similarly. In the case where B is a right (left) child of A initially, it is considered as an external node in the context of the pointer updates, thus allowing B to keep its own right (left) child.
C.2
The Arrow Assignment
A statement like the following: l ← x.leftChild means that l now points to the leftChild of x. Any updates performed directly on x.leftChild will not affect the information stored in l.
57
C.3
Other Methods
1. isExternal applied to any node will return true if x is an external node and false otherwise. 2. key applied to any node will return the key value that is stored at that node. 3. rightChild applied to any node will return a pointer to the right child of that node. Similarly for leftChild. 4. parent applied to any node will return a pointer to the parent of that node. It will return ‘NULL’ when applied to the root. 5. The statement Node(k) will create a new node containing the key k. 6. getHistory applied to any node will return the history value associated with that node. 7. The statement remove x will remove the node x and any children it may have and automatically update its parent’s pointer to now point to an external node. 8. - direction will return ‘left’ if it initially contained ‘right’ and vice versa.
58
Bibliography [1] J. C. Culberson, P.A. Evans, Asymmetry in Binary Search Tree Update Algorithms, Technical Report TR 94-09 (1994) [2] S. Edelkamp, Weak-Heapsort, ein schnelles Sortierverfahren, Diplomarbeit, Universit¨at Dortmund (1996) [3] M. T. Goodrich, R. Tamassia, Data Structures and Algorithms in Java, John Wiley and Sons, Inc. (1998) [4] T. N. Hibbard, Some combinatorial properties of certain trees with applications to searching and sorting, Journal of the ACM Vol. 9 (1962) [5] A. T. Jonassen, D. Knuth, A Trivial Algorithm Whose Analysis Isn’t, J.Comput. Syst. Sci. 16, 3, 301-322 (1978) [6] G. D. Knott, Deletion in Binary Storage Trees, Ph.D. Thesis, Stanford University (1975) [7] D. Knuth, Sorting and Searching, Volume 3 of The Art Of Computer Programming, Reading, Massachusettes: Addison-Wesley (1973) [8] D. Knuth, Deletions that preserve randomness, IEEE Trans. Software Engineering 3, 351-359 (1977) [9] C. Martinez, S. Roura, Randomized Binary Search Trees, Journal of the ACM, Vol. 45, No. 2 (1998) [10] M. Mishna, Attribute grammars and automatic complexity analysis, INRIA (2000) 59
[11] R. Motwani, P. Raghavan, Randomized algorithms, Cambridge University Press (1995) [12] W. Pugh, Skip Lists: A probabilistic alternative to balanced trees, Communications of the ACM 33, 6, 668-676 (1990) [13] M. Schellekens, Towards a Calculus for Modular Software Timing I; ACETT, a programming language with Linearly-Compositional Average-Case Time Complexity, CEOL-preprint, 152 p. (2005) [14] R. Sedgewick, P. Flajolet, An Introduction to the Analysis of Algorithms, Addison-Wesley (1996) [15] R. Sedgewick, P. Flajolet, Analytic Combinatorics, in preparation (preliminary version on the Web) [16] R. Seidel, C. Aragon, Randomized Search Trees, Algorithmica 16, 464-497 (1996) [17] J. S. Vitter, P. Flajolet, Chapter 9 in Handbook of Theoretical Computer Science, Volume A: Algorithms and Complexity (edited by J. van Leeuwen), Elsevier, 1990, 431-524 [18] H. S. Wilf, Generatingfunctionology, Academic Press (1990) [19] E. C. Young, Partial differential equations: an introduction, Allyn and Bacon (1972)
60