Improved Uniform Error Bounds for Boolean Classi ers Eric Bax November 3, 1999 Abstract
We develop error bounds for boolean classi ers by using dynamic programming to count boolean expressions. Our method produces improved error bounds by eliminating some types of overcounting found in earlier methods.
Key words machine learning, boolean functions, dynamic programming, statistics, combinatorics.
Math and Computer Science Department, University of Richmond, VA 23173(
[email protected]). Supported by a faculty research grant from the University of Richmond.
1
2
1 Introduction Consider the following machine learning problem. We wish to use the results of several medical tests to predict whether patients in remission will remain healthy for ve years. We have a set of in-sample data consisting of positive or negative test results and whether each patient remained healthy. We use this insample data to develop a boolean classi er. We establish a limit on the number of \and" or \or" operations in the classi er before we examine the in-sample data. We wish to bound the error rate of the classi er on out-of-sample patients. Since the classi er is selected using the in-sample data, the in-sample error rate is a biased estimate of the out-of-sample error rate. However, the class of boolean classi ers is selected without reference to the in-sample data. So we can compute a bound on the probability that, for every classi er in the class, the performance over the in-sample data closely approximates the performance over out-of-sample data. These uniform bounds imply a bound on the selected classi er. If there are fewer classi ers in the class, then the uniform bounds are stronger. Often, the size of the class is unknown, so an upper bound on class size must be used to compute the error bound. This paper outlines a method to compute an upper bound on class size. The bound is tighter than previous bounds, resulting in stronger error bounds. There are several previous results in this area. Pippenger [6] developed an upper bound on the number of classi ers. Pearl [5] used Pippenger's results to derive error bounds. Recently, Devroye, Gyor , and Lugosi [3] developed a tighter upper bound on the number of classi ers and stronger error bounds. They exhibit a correspondence between boolean classi ers and binary expression trees in which each classi er is represented by one or more trees. They bound the number of classi ers by counting trees. In this paper, we apply their strategy to a smaller set of trees. In the next section, we review the correspondence between boolean classi ers and binary expression trees, and we develop a smaller class of expression trees. In the following section, we present a recurrence to count the new expression trees. Then we review uniform error bounds for machine learning. Finally, we compare error bounds based on our results to error bounds based on previous methods.
3
2 Classi ers and Expression Trees Devroye, Gyor , and Lugosi [3] show that any boolean expression with k or fewer \and" or \or" operations can be represented as a binary expression tree with k internal nodes, each labelled by an operation, and k + 1 leaves, each labeled by a variable or a negated variable. (Note that any number of \not" operations may be accomodated by using DeMorgan's laws to express \not" operations through negated variables. For information on DeMorgan's laws, refer to, e.g., [2].) Multiple trees may correspond to a single classi er. The following manipulations create new trees corresponding to the same classi er as the original tree. 1. Select an internal node with nonidentical left and right subtrees. Swap the subtrees. 2. Select an internal node with identical left and right subtrees. Replace the node and its subtrees by either subtree. 3. Select a node. Replace it by an internal node with either operation and with the subtree rooted at the selected node as its identical left and right subtrees. 4. Select a connected subgraph of internal nodes all labeled by the same operation. Disconnect the subtrees descending from the subgraph. Replace the subgraph by any other connected tree subgraph with the same label and the same number of nodes. Reconnect the descending subtrees to the new subgraph, in any order. (See Figure 1.) Now we develop a class of expression trees that avoids these sources of multiple trees corresponding to the same classi er. To avoid multiple trees formed by reorderings of subtrees, impose an ordering within each set of trees having the same number of leaves, and order trees with fewer leaves before those with more leaves. Then insist that the child subtrees of each node decrease in the ordering from left to right. To avoid multiple trees caused by identical subtrees, prohibit any node from having identical child subtrees. To avoid multiple trees formed by rearrangements of connected subgraphs of identically labeled internal nodes, allow nodes to have any number of children except one, but insist that internal node labels alternate between levels of the tree. In other words, condense each maximal connected subgraph of identically labeled internal nodes into a single node. (See Figure 2.) Nodes with three or more children may be viewed as \and" or \or" operations with three or more operands. Each such operation may be formed in multiple ways by combining binary operations.
4
aj
XXXX ?@ @ X ? o @ 6 H HH@ ? ? o o @ ? @ ? ?@ ?@ ? @ ? @
j
j
j
j
aj 3j 4j 5j
aj
XXXX H H ? H X ? o ? 6 ? H H ? ? H ? 3 ? o ? ?@ ? ? @
-
j
j
oj aj
j
j
5j4j1j2j
1j2j
A
A
A
Figure 1: These trees represent the same boolean classi er. The classi er remains the same after rearrangement of the connected subgraph of \or" nodes and reordering of the subgraphs descending from the \or" nodes. (The nodes are labelled by \a" for \and", by \o" for \or", and by numbers for variables.)
aj
XXXX ?@ @ X ? o @ 6 H HH@ ? ? o o @ ? @ ? ?@ ?@ ? @ ? @
j
j
j
aj 3j 4j 5j
1j2j A
j
-
aj
XXXX \ X o \ 6 P PP \ ? @ ?@ P
j
aj 3j 4j 5j
j
1j2j A
Figure 2: The tree on the left is a binary classi er tree. The tree on the right is the corresponding tree in the new class. The subgraph of \or" nodes in the left tree is condensed into a single \or" node in the right tree.
5
3 Counting Expression Trees Let L(n) be the number of trees in the new class that have n leaves. Let d be the number of input variables. De ne
m(k; d) =
kX +1 n=1
L(n):
(1)
Then m(k; d) is an upper bound on the number of classi ers that can be expressed using k or fewer binary \and" or \or" operations. To simplify the computation of L(n), de ne T (n) to be the number of trees in the class counted by L(n), except that internal nodes are unlabeled. Since a tree with a single node has no internal node, L(1) = T (1). For larger trees, setting the root node label determines the other internal node labels, because labels must alternate between levels. Hence, for n > 1, L(n) = 2T (n). Each leaf is labeled by a variable or its negation, so T (1) = 2d. We use a recurrence to compute T (n) for larger trees. For n > 1 and 0 < m < n, de ne T (n; m) to be the number of trees counted by T (n) that have m leaves on the largest subtree. Trees counted by T (n; m) can be constructed by adding a combination of trees with m leaves each as leftmost subtrees at the root of a tree with fewer than m leaves on each subtree. (See Figure 3.) In this way, we maintain the ordering, with the largest subtrees on the left. So, for n > 1 and 0 < m < n, m?1 X T ( m ) T (n ? im; j ); T (n; m) = i n j =0 i2f1;:::;b m cg
(2)
T (n ? im; j ) = 0 for n ? im < j;
(3)
X
where we de ne and
T (n ? im; 0) = 0 for n ? im > 0:
(4) To account for the possibility of adding new subtrees to a \proto-tree" that is a root with no children, let T (0; 0) = 1: (5) Also, to account for the possibility of adding new subtrees to a \proto-tree" that is a root with a single child subtree, let T (j; j ) = T (j ) for j > 0: (6) Recurrence 2 may be used to compute T (n; m) through dynamic programming. (See, e.g., [2].) For n > 1,
T (n) =
nX ?1
m=1
T (n; m):
(7)
6
Forming a tree counted by T (n; m)
j
( ( ( ( ( ( ? JJ ( ( (
@ ( (( @ ? ( ( J (
( ( ( ? ( J@ ((
@ ? @ ? ?@ ?@ ?@ AA @ @ @ @ ? ? ? ? @ A @ @ @ ? ? ? ? j < m leaves ? ? @? @ @
j
j
j
m leaves
?T (m) 2
j
n ? 2m leaves
m leaves
T (n ? 2m; j ) choices
choices
Figure 3: Trees counted by T (n; m) have n leaves in total and a maximum of m leaves in each subtree. These trees may be formed by adding a collection of trees with m leaves each as subtrees of the root of a tree with fewer than m
leaves in each subtree. The terms in the diagram above correspond to terms in Formula 2. So, for n > 1,
L(n) = 2
nX ?1 m=1
T (n; m):
(8)
Hence, the number of classi ers that can be expressed using k or fewer binary \and" or \or" operations is at most
m(k; d) = 2[d +
?1 k+1 nX X n=2 m=1
T (n; m)]:
(9)
7
4 Review of Uniform Error Bounds Bounds on the number of classi ers in a class are used to compute error bounds for classi ers developed using in-sample data, as follows. Let g be a classi er. Let be the (unknown) error rate over an input distribution. Let be the error rate over n in-sample examples, with inputs drawn according to the input distribution and outputs determined by the (unknown) target function. If g is selected without reference to the in-sample data, then, by Hoeding's inequality [4], 2 Prf + g e?2n : (10) Now, let fg1; : : : ; gm g be a set of classi ers chosen without reference to the examples. Let i and i denote the error rates of gi . To develop uniform error bounds, use the sum of probabilities to bound the probability of the union of events: 2 Prf1 1 + or : : : or m m + g me?2n : (11) If the examples are used to select a classi er g from the set, then the singleclassi er bound does not apply to g . However, the uniform bound implies a bound for g . So 2 Prf + g me?2n : (12) (For more information on uniform error bounds, see [1, 3, 7].) For many classes, including the class of boolean classi ers with k or fewer \and" or \or" operations, the size of the class is unknown. So upper bounds must be used in place of m to compute error bounds using Formula 12.
8
5 Comparison To Previous Results Pippenger [6] derived the following bound for the number of classi ers that can be expressed using k or fewer \and," \or," or \not" operations. 2 mp (k; d) = ( 16(d k+ k) )k :
(13)
Devroye, Gyor , and Lugosi [3] (pp. 468-470 and p. 476) used binary expression trees to show that the number of classi ers that can be expressed using k or fewer \and" or \or" operations is no more than (14) mb (k; d) = 2(2d)k+1 k +1 1 2kk : Devroye, Gyor , and Lugosi show that mb (k; d) is a stronger bound than mp (k; d). Since our expression trees eliminate some sources of overcounting by binary expression trees, m(k; d) is a stronger bound than mb (k; d). Figures 4, 5, and 6 compare numbers of in-sample examples required for uniform error bounds using binary expression trees and using the new expression trees to count classi ers. The values were obtained by substituting bounds mb (k; d) and m(k; d) for m in Formula 12 and solving for n, with bound tolerance = 0:10 and bound failure probability 10%. Figures 4, 5, and 6 show results for input spaces with dimensions d = 5, d = 10, and d = 20, respectively. From the gures, note that the ratio of examples required for bounds using binary expression trees to examples required for bounds using the new expression trees increases as the limit on the number of operations increases. Also, note that the limit k on the number of operations has a much greater eect on the required number of examples than the dimension d of the input space. The number of examples required for uniform bounds grows as the logarithm of the bound on the number of classi ers, so the gures compare logarithms of bounds mb (k; d) and m(k; d).
9
4500 binary expression trees new expression trees
4000
3500
examples n
3000
2500
2000
1500
1000
500
0 0
2
4
6
8
10 operations k
12
14
16
18
Figure 4: Comparison of the number of in-sample examples required for uniform error bounds using binary expression trees and the new expression trees to bound the number of boolean classi ers with k or fewer \and" or \or" operations for input space of dimension d = 5. The uniform bounds have tolerance = 0:10 and probability of bound failure no more than 10%.
20
10
5500 binary expression trees new expression trees
5000 4500 4000
examples n
3500 3000 2500 2000 1500 1000 500 0 0
2
4
6
8
10 operations k
12
14
16
18
Figure 5: Comparison of the number of in-sample examples required for uniform error bounds using binary expression trees and the new expression trees to bound the number of boolean classi ers with k or fewer \and" or \or" operations for input space of dimension d = 10. The uniform bounds have tolerance = 0:10 and probability of bound failure no more than 10%.
20
11
6000 binary expression trees new expression trees 5000
examples n
4000
3000
2000
1000
0 0
2
4
6
8
10 operations k
12
14
16
18
Figure 6: Comparison of the number of in-sample examples required for uniform error bounds using binary expression trees and the new expression trees to bound the number of boolean classi ers with k or fewer \and" or \or" operations for input space of dimension d = 20. The uniform bounds have tolerance = 0:10 and probability of bound failure no more than 10%.
20
12
6 Conclusion We have derived improved bounds for the number of boolean classi ers by counting a class of expression trees. There is still room for improvement because there are still classi ers that correspond to multiple expression trees. For example, the \xor" classi er can be expressed in two ways that correspond to dierent trees: x1 xor x2 = (x1 and x2 ) or (x1 and x2 ) (15) and (16) x1 xor x2 = (x1 or x2 ) and (x1 or x2 ): One strategy to improve the results in this paper is to count trees that contain the subtree corresponding to the second expression and subtract this number from the total number of trees. Similar reductions can be developed based on other boolean functions that correspond to multiple expression trees.
13
7 Acknowledgements Thanks to Dr. Hayden Porter and Dr. Doug Rall at Furman University, to Dr. Andras Recski at MTKI in Budapest, and to Dr. Joel Franklin at the California Institute of Technology for advice and encouragement. Thanks to Elizabeth Geiger, Dr. Gary Green eld, and Dr. Van Nall at the University of Richmond for advice on the presentation of these results. Thanks to the University of Richmond for supporting this work through a faculty research grant.
14
References [1] E. Bax, Partition-based and sharp uniform error bounds, to appear in IEEE Transactions on Neural Networks. [2] T. Cormen, C. Leiserson, and R. Rivest, Introduction to Algorithms, MIT Press, Cambridge, MA., 1997. [3] L. Devroye, L. Gyor , and G. Lugosi, A Probabilistic Theory of Pattern Recognition, Springer-Verlag New York, Inc., 1996. [4] W. Hoeding, Probability inequalities for sums of bounded random variables, Am. Stat. Assoc. J., 58 (1963) 13-30. [5] J. Pearl, Capacity and error estimates for boolean classi ers with limited complexity, IEEE Transactions on Pattern Analysis and Machine Intelligence, 1 (1979) 350-355. [6] N. Pippenger, Information theory and the complexity of boolean functions, Mathematical Systems Theory, 10 (1977) 124-162. [7] V. N. Vapnik, Statistical Learning Theory, John Wiley and Sons, Inc., 1998.