On the Consistency of Boosting Algorithms
Shie Mannor and Ron Meir Department of Electrical Engineering Technion, Haifa 32000, Israel {shie,rmeir}@{techunix,ee}.technion.ac.il
Shahar Mendelson Australian National University Canberra 0200, Australia
[email protected]
Abstract Boosting algorithms have been shown to perform well on many realworld problems, although they sometimes tend to overfit in noisy situations. While excellent finite sample bounds are known, it has not been clear whether boosting is statistically consistent, implying asymptotic convergence to the optimal classification rule. Recent work has provided sufficient conditions for the consistency of boosting for one-dimensional problems. In this work we provide sufficient conditions for the consistency of boosting in the multi-variate case. These conditions require non-trivial geometric concepts, which play no role in the one-dimensional setting. An interesting connection to the recently introduced notion of kernel alignment is pointed out.
1
Introduction
While boosting algorithms (e.g., [7, 14]) have been demonstrated to yield very effective classifiers in many situations, there are several issues about boosting which are not yet fully understood. The main focus in this paper is on the issue of consistency, namely the question of whether boosting converges asymptotically to the optimal classification rule. Some work along these lines has recently been done by Jiang [8], who established two main results. First, he showed that in the onedimensional case boosting is consistent in the noise-free case when a simple classifier is used as a weak learner. Second, he provided an example of a one-dimensional noisy classification problem for which boosting is inconsistent. In fact, boosting is well-known to overfit under noisy situations (e.g., [12, 14]), although Jiang’s result is probably the first to provide an analytically tractable example of this. In this work we consider the multi-variate problem, the geometry of which poses many problems which do not exist in the one-dimensional case. Furthermore, we focus on the noiseless case, as even in this case the issue of consistency is rather difficult in multiple dimensions. Before moving to the technical discussion, we discuss why the consistency of boosting is difficult to establish. For this purpose, consider first the classic approach to establishing consistency in the context of a hierarchy of models (e.g., neural networks with an increasing number of hidden units). In this context, one usually uses some
form of complexity regularization, whereby overly complex models are penalized. This kind of approach has been developed by Vapnik and Chervonenkis under the title of structural risk minimization (e.g., [15]), and is known to lead to consistent classification rules in many contexts (for example, see [6] for a proof of consistency in the context of neural networks). A distinguishing feature of boosting algorithms, at least in their original formulation, is that they do not incorporate any form of regularization, and keep increasing the complexity of the function-class under consideration without penalizing for increased complexity. Initial work demonstrated that in spite of this lack of regularization, boosting algorithms worked very well in many cases, and overfitting was not observed, at least for noiseless problems. In this work we attempt to understand this phenomenon by providing precise conditions under which boosting is expected to be consistent. Future work will consider the more difficult problem of noisy data, where some form of regularization seems essential (e.g., [12]). Finally, we comment that in this work we consider the case where the weak learner used in the boosting algorithm is a linear classifier.
2
Problem formulation and geometric concepts
We consider the problem of binary classification. A sequence of m independent and identically distributed (i.i.d.) examples, {(xi , yi )}m i=1 is available, where each example pair (xi , yi ) is drawn from some unknown probability distribution µ(x, y) over Rd × {−1, +1}. Let F be a hypothesis class consisting of real-valued classification rules, and consider the loss `(y, f (x)) incurred by predicting according to the sign of f (x). In this work we consider the 0-1 loss function given by `(y, f (x)) = I(yf (x) ≤ 0). We define the empirical and expected loss functions in the standard way, m
X ˆ m (f ) = 1 I[yi f (xi ) ≤ 0] ; L m i=1
L(f ) = Pµ [Y f (X) ≤ 0],
and set optµ (F) = inf L(f ), f ∈F
namely the minimal error incurred by the optimal classifier in F. Clearly, L(f ) and optµ (F) can only be calculated if the underlying distribution is known. Let G be a target class of classification rules. We define the notion of consistency relative to G. Definition 1 An algorithm is strongly consistent with respect to a target class G if a sequence of classifiers fˆm generated by the algorithm obeys limm→∞ L(fˆm ) = optµ (G) with probability 1, for any distribution µ. If G contains the Bayes rule, we have a universally consistent algorithm. One popular approach to establishing consistency (e.g., [6]) is based on the following observation. If fˆm results from a sample error minimization (sem) algorithm, ˆ m (f )|. Then one it is easy to show that L(fˆm ) − optµ (F) ≤ 2 supf ∈F |L(f ) − L can use standard uniform convergence results in order to prove that the latter term converges to zero, yielding the minimal loss optµ (F). If the size of the class F is allowed to increase at an appropriate rate, and complexity regularization is imposed, consistency can be established. In the context of boosting, there are two basic observations to be made concerning this approach. First, Boosting is not generally a sem algorithm, although it does give rise to a nearly sem algorithm under
appropriate conditions. In fact, it is the explicit aim of this paper to establish precise conditions under which boosting is a nearly sem algorithm, and to specify the conditions under which consistency may be established. Second, no regularization is imposed in the original boosting algorithms. In order to set the stage for the development of the main results in Section 3, we begin by introducing a certain geometric quantity which measures the ‘degree of separation’ of two oppositely labeled sets of points. Let S = {(xi , yi )}m i=1 be a sample of m labeled points, to each of which is associated a non-negative weight Pi , Pm such that i=1 Pi = 1. Define Jˆ(S, P ) = −
m X m X
kxi − xj kyi yj Pi Pj .
(1)
i=1 j=1
The intuition behind the term Jˆ(S, P ) is as follows. Let Ie be the subset of indices of the equally labeled pairs, namely Ie = {(i, j) : yi = yj }, and let Io be the subset of oppositely labeled index pairs. Then, XX XX Jˆ(S, P ) = kxi − xj kPi Pj − kxi − xj kPi Pj . (i,j)∈Io
(i,j)∈Ie
It is clear that Jˆ(S, P ) is large when the equally labeled points are close by on average, while the oppositely labeled points are as far apart as possible. Thus, Jˆ(S, P ) measures the ‘separation’ of the two subsets of points. We note that a similar quantity, termed kernel alignment, was recently introduced by Cristianini et al. [5] in the context of kernel methods, although the motivation was very different. P Assume initially that P is a balanced measure, namely i Pi yi = 0. Under this assumption, Alexander [1] proves that Jˆ(S, P ) ≥ 0. Let H be the class of linear classifiers H = h : h(x) = sgn(wT x + w0 ) : w ∈ Rd , w0 ∈ R , then it is shown in [10], based on [1], that (m ) q X 1 Pi I[yi h(xi ) ≤ 0] ≤ − (cd /2ρ)Jˆ(S, P ), inf h∈H 2 i=1 where the constant cd depends only on the dimension d, and ρ is the diameter of the support of the data. In order to deal with general (i.e., unbalanced) measures P , we proceed Pas follows. For any P , define a balanced measure Q (depending on P ), obeying i Qi yi = 0 P and i Qi = 1. From Lemma 4.1 in [10] we have that (m ) q X 1 inf Pi I[yi h(xi ) ≤ 0] ≤ − (cd /8ρ)Jˆ(S, Q), (2) h∈H 2 i=1 where an extra factor of 1/2 appears in the second term on the r.h.s. of (2), and the balanced measure Q = Q(P ) replaces P .
3
Establishing consistency
Consider a boosting algorithm giving rise to the empirical classifier fˆT,m at the T -th step of boosting. Assuming that the weak learner at each boosting iteration is
selected from the class H of weak classifiers, clearly fˆT,m ∈ coT (H), where coT (H) is the convex hull of H, restricted to a combination of T elements. In order to simplify notation we let FT = coT (H), and split the loss incurred by fˆT,m as follows: L(fˆT,m ) = [L(fˆT,m ) − optµ (FT )] + [optµ (FT ) − optµ (G)] + optµ (G). (3) The second term is an approximation error term and does not depend on the data. We focus now on the first term, which we refer to as the estimation error. ∗ Let fˆT,m denote the empirically optimal classifier, namely the classifier minimizing ˆ m (f ) over FT . We then have the empirical error L n o n o ∗ P L(fˆT,m ) − optµ (FT ) ≥ ≤ P L(fˆT,m ) − L(fˆT,m )≥ (4) 2 o n ∗ + P L(fˆT,m ) − optµ (FT ) ≥ . 2 In order to guarantee that the estimation error converges to zero, it suffices that both terms on the r.h.s. of (4) converge to zero. The second term converges to zero with high probability (for ‘small’, e.g. VC, classes by standard arguments (see, for ∗ example, Chapter 4 in [3])). Since fˆT,m results from a sample error minimization algorithm, we have from standard uniform convergence results (e.g., Theorem 4.3 in [3]) that n o 2 ∗ P L(fˆT,m ) − optµ (FT ) ≥ ≤ 4ΠFT (2m)e−m /32 ,
where (abusing notation slightly) ΠFT (2m) is the growth function of sgn(FT ). Note that we do not make use here of advanced recent bounds using more refined complexity measures (e.g., Rademacher complexities in [11]), as our main goal is to establish sufficient conditions for consistency, rather than optimal convergence rates. Next, ∗ we consider the term L(fˆT,m ) − L(fˆT,m ). This is the first place where the nature of the learning algorithm enters. We make the following assumptions. Assumption 3.1 The label y is a deterministic function of x, and is given by y = g(x) for some g ∈ G. We now show that the empirical error of the boosted classifier is small. In order to do this we need one further assumption. Assumption 3.2 The distribution D is such that with probability 1, the oppositely labeled data points are separated by a gap of η. Assumptions 3.1 and 3.2, respectively, imply that the Bayes error is zero and that the oppositely labeled points are η-far from each other. Note that we have assumed nothing about the optimal decision boundary, except that g ∈ G, which may therefore be arbitrarily complex depending on G. From Lemma 11 in [2] together with the remark on page 9 of [2], we conclude Pm 2 that under the conditions of Assumption 3.2, Jˆ(S, Q) ≥ cηd−3/2 for i=1 Qi Pm an absoluteP positive constantPc. Since the minimal value of i=1 Q2i , under the m m constraints i=1 Qi = 1 and i=1 yi Qi = 0 is attained at |Qi | = 1/m, we conclude that for any measure Q, Jˆ(S, Q) ≥ cηd−3/2 /m. (5) From (5) we have a distribution free and sample-independent lower bound on Jˆ(S, Q). Assume, in general, that we have some lower bound of the form Jˆ(S, Q) ≥ J0 (m). We consider how this affects the generalization error bounds.
We recall the bound on the empirical error of the AdaBoost algorithm. Let t = 1/2−γt be the error incurred by the t-th weak classifier during the boosting process. ˆ m (fˆT,m ) ≤ exp{− PT γ 2 }. Although this bound Then from [13] we have that L t=1 t was established for AdaBoost, it holds for many other boosting algorithms. Lemma 3.1 Let Assumption 3.2 hold, and assume that the weak learner at each step of boosting minimizes the weighted empirical error. Assume further that Jˆ(S, Q) ≥ J0 (m). Then ˆ m (fˆT,m ) ≤ e−(cd /8ρ)J0 (m)T . L q p Proof From (2) we have that γt ≥ cd Jˆ(S, Q)/8ρ ≥ cd J0 (m)/8ρ. We then ˆ m (fˆT,m ) ≤ exp{− PT γt2 } ≤ exp {−(cd J0 (m)/8ρ)T }. have that L t=1 It should be observed that the requirement that the weak learner exactly minimize the weighted empirical error is not essential. Even if the minimal empirical error is 1/2 − γt , it suffices to achieve an empirical error bounded from above by (say) 1/2 − γt /2. While it is known that minimization of the empirical error is NP-hard, it is not known that this holds for the approximate minimization. ∗ In order to proceed, we need to show that L(fˆT,m ) − L(fˆm ) is small. By Lemma ∗ −(c /8ρ)J (m)T 0 ˆ m (fˆm ) − L ˆ m (fˆ ) ≤ e d 3.1 we have that L . Using the inequality T,m ( ) ˆ m (f ) ≥ ≤ 4ΠF (2m)e−m2 /8 , P sup L(f ) − L T f ∈FT
(Theorem 4.3 in [3]) we find that n o 2 ∗ P L(fˆT,m ) − L(fˆT,m ) ≥ 2 + e−(cd /8ρ)J0 (m)T ≤ 8ΠFT (2m)e−m /8 . Setting T > (8ρ/cd J0 (m)) log(3/) we obtain from (4) after some simple algebra that n o 2 P L(fˆT,m ) − opt (FT ) ≥ ≤ 12ΠF (2m)e−m /72 . µ
T
In order to establish consistency with respect to the class G, we consider the case where G is given by sgn(co(H)). In the case where H is the class of linear classifiers of the form sgn(wT x + w0 ), Barron [4] characterized the class using boundedness of the Fourier transforms of functions in the class. Fix and let T0 () be sufficiently large so that optµ (coT (H)) − optµ (G) < for all T > T0 (). We then have the following result. Theorem 3.1 Let the target function g(x) = sgn(g0 (x)), where g0 is contained in co(H), and let Assumptions 3.1 and 3.2 hold. Then applying Boosting for T steps, where T > max{(8ρ/cd J0 (m)) log(3/)), T0 ()}, n o 2 P L(fˆT,m ) ≥ ≤ 12ΠFT (2m)e−m /288 . (6) In order to establish consistency we need to specify J0 (m). In (5) we demonstrated that J0 (m) = Ω(1/m) always holds. In order to establish consistency we will need to make a slightly stronger assumption, as specified in Corollary 3.1. Corollary 3.1 Under the conditions of Theorem 3.1, and the added requirement that J0 (m) ≥ Ω(log2+α m/m) for some α > 0, the boosting algorithm is strongly consistent with respect to the class of functions sgn(co(H)), where H is the class of thresholded linear classifiers .
Proof The claim will follow from the Borel-Cantelli Lemma, upon showing that the r.h.s. of (6) is summable over m. Observe that sgn(coT (H)) is equivalent to a two-layer feedforward neural network with sharp threshold activation functions. From Theorem 6.1 in [3] we have that ΠFT (2m) ≤ O mT log T . Setting T ≥ T (m) = (8ρ/cd J0 (m)) log(3/) where J0 (m) ≥ Ω(log2+α m/m) we find that n o m log(1/) m2 − P L(fˆT,m ) ≥ ≤ O exp logα m 288 For any fixed > 0, the second term in the curly brackets dominates the first, and thus the sequence is summable establishing the claim. Finally, we provide rates of convergence for the approximation error. Lemma 3.2 Let the target function be given by g(x) = sgn(g0 (x)) where g0 is contained in co(H). Then optµ (FT ) ≤ Eµ exp(−T g0 (x)) where FT = coT (H). Proof sketch The proof proceeds by a standard probabilistic P method argument. First, since g ∈ co(H) there are αi such that g(x) = i αi h(xi ), where the non-negative coefficients αi sum to one. We then randomly select a sequence of functions h1 , h2 , . . . , hT , each with probability αi , and form the random function PT (1/T ) i=1 hi (x). The proof is then concluded using Hoeffding’s inequality. Observe that if Eµ exp(−T g(x)) ≤ e−aT we obtain an exponential rate of approximation. Related results for the L2 loss (e.g, [4]) led to an approximation rate of order O(1/T ). The faster rate is due to the need to approximate only the sign of the target function, rather than the function itself. Remark 1 The assumption that g ∈ co(H) may seem somewhat stringent. However, if the distribution µ is such that Pµ [|g(x)| < ] is small for small , we can extend Lemma 3.2 to the case where G is the space of continuous functions. This extension will be deferred to the full paper. We conclude this section by noting that the results above may be somewhat strengthened by demanding that the bound Jˆ(S, P ) ≥ J0 (m) only hold with high probability for sufficiently large T and m. The proof is essentially unchanged, except for some minor technicalities. Furthermore, using recent results from [11] it is possible to show, with some effort, that J0 (m) ≥ Ω(log1+α m/m) suffices.
4
Conclusions
We have provided sufficient conditions for the consistency of boosting algorithms in the case where the weak learner is a linear classifier. We have shown that such algorithms, without any inherent regularization, are consistent if the input/output mapping is deterministic, and the data is characterized by a nonzero gap between the oppositely labeled points. In addition to the assumptions about the data, we have had to introduce an assumption concerning the behavior of the boosting algorithm, namely that after a given number of boosting the weights {Pi } assigned P 2 iterations, 2+α by the boosting algorithm are such that P ≥ Ω(log m/m). Simulations (e.g., i i P [9]) indicate that i Pi2 tends to grow during the boosting process. Preliminary simulations also tend to support the above assumption concerning the dependence on m, although they are not yet conclusive. An interesting open problem at this point P is the removal of the logarithmic factor in this bound. In fact, the condition on i Pi2 is connected to the rate of decay of the empirical error. An interesting
question is whether boosting algorithms with faster decay rates of the empirical error can be found. In future work we intend to consider regularized forms of boosting, which are expected to work more effectively under noisy situations, where the unregularized algorithm is known to overfit. Finally, we note that our focus here has been on statistical rather than computational issues.
References [1] R. Alexander. Geometric methods in the study of irregularities of distribution. Combinatorica, 10(2):115–136, 1990. [2] R. Alexander. The effect of dimension on certain geometric problems of irregularities of distribution. Pac. J. Math., 165(1):1–15, 1994. [3] M. Anthony and P.L. Bartlett. Neural Network Learning; Theoretical Foundations. Cambridge University Press, 1999. [4] A.R. Barron. Universal approximation bound for superpositions of a sigmoidal function. IEEE Trans. Inf. Th., 39:930–945, 1993. [5] Nello Cristianini, Andre Elisseeff, and John Shawe-Taylor. On optimizing kernel alignment. Technical Report 2001-087, Royal Holloway University of London, 2001. [6] A. Farago and G. Lugosi. Strong universal consistency of neural network classifiers. IEEE Trans. Inf. Theory, 39(4):1146–1151, 1993. [7] Y. Freund and R.E. Schapire. Experiments with a new boosting algorithm. In Proceeding of the Thirteenth International Conference on Machine Learning, pages 148–156, 1996. [8] W. Jiang. Some theoretical aspects of boosting in the presence of noisy data. In Proceedings of the Eighteenth International Conference on Machine Learning, 2001. [9] S. Mannor and R. Meir. Geometrioc bounds for generlization in boosting. In Proceedings of the fourteenth International Conference of Computational Learning Theory, 2001. [10] S. Mannor and R. Meir. On the existence of weak learners and applications to boosting. Machine Learning, 2001. To appear. [11] S. Mendelson. Rademacher averages and phase transitions in Glivenko-Cantelli classes. IEEE Trans. Inf. Theory, To appear, 2001. [12] G. R¨ atsch, T. Onoda, and K.R. M¨ uller. Soft margins for adaboost. Machine Learning, 42(3):287–320, 2001. [13] R.E. Schapire, Y. Freund, P. Bartlett, and W.S. Lee. Boosting the margin: a new explanation for the effectiveness of voting methods. The Annals of Statistics, 26(5):1651–1686, 1998. [14] R.E. Schapire and Y. Singer. Improved boosting algorithms using confidencerated predictions. Machine Learning, 37(3):297–336, 1999. [15] V. N. Vapnik. Estimation of Dependences Based on Empirical Data. Springer Verlag, New York, 1982.