Consistency of Penalized Risk of Boosting Methods in Binary ...

1 downloads 0 Views 188KB Size Report
New Trends in Psychometrics. 87. Consistency of Penalized Risk of Boosting Methods in Binary. Classification. Kenichi Hayashi, Yasutaka Shimizu, and Yutaka ...
New Trends in Psychometrics

87

Consistency of Penalized Risk of Boosting Methods in Binary Classification Kenichi Hayashi, Yasutaka Shimizu, and Yutaka Kano Graduate School of Engineering Science, Osaka University,1-6 Machikaneyamacho, Toyonaka-shi, Osaka 560-8531, Japan

Abstract In this paper, we consider the boosting method with a penalized risk functional as a regularization method. A risk consistency is established under some conditions on the penalizing parameter that controls the degree of the penalty. The condition to prevent the overlearning is just simple: the parameter converges to zero as the sample size goes to infinity. It can also be seen that the penalizing parameter can be changed adaptively at each boosting step. 1.

Introduction Is it possible to construct a powerful classifier by combining poor-performing learners? A positive answer to this question is the boosting method (Schapire, 1990), which was proposed in the machine learning community. The boosting method is the algorithm which combines weak learners sequentially. The unique requirement for weak learners is that they should be slightly better than random guessing. AdaBoost (Freund and Schapire, 1997) is a specific rule to determine weights of weak learners in the boosting method. AdaBoost has received much attention for the simplicity of the algorithm, the implementability, and the superior empirical performance. In fact, classifiers constructed by AdaBoost with tree-type weak learners are regarded as “the most accurate off-the-shelf classifiers” (Breiman, 1998). Boosting methods also draw attention from the statistical community and their properties have been studied from the statistical viewpoint. Friedman et al. (2000) showed that the AdaBoost algorithm corresponds to fitting an additive logistic regression model via Newton-like updating of the exponential loss minimization. Some theoretical results indicate that the boosting methods are valid optimization procedures for a certain distance minimization problem in the functional space. Lafferty (1999) showed that some boosting algorithms can be unified by introducing Bregman distances and defined the class of additive models by the Legendre transform. When we repeat the boosting step many times, the prediction performance of the constructed classifier can be poor for future data. Such a phenomenon is called the overlearning or the overfitting. Therefore the problem of a regularization on the boosting algorithm arises. Breiman (2000) showed that the classifier constructed by the AdaBoost algorithm based on infinitely many data (the population version of AdaBoost) converges to the Bayes risk, which is the lowest attainable risk, as the iteration increases with tree-type-based weak learners. We call this property the risk consistency. Although the risk consistency gives the theoretical basis of the boosting method, it hardly contributes to understanding the overlearning and obtaining practical solution. Friedman (2001) implied that the boosting algorithm with small step-size is empirically helpful for constructing a good classifier. Consequent studies supported his insight theoretically, that is, it is necessary to impose some restrictions on c 2008 The Organizing Committee of the International Meeting of the Psychometrics Society pp. 87–96 °

88

the optimization procedure to avoid the overlearning. In fact, Lugosi and Vayatis (2004) established the risk consistency by restricting the sum of coefficients of weak learners in the set of convex combination of weak learners. Zhang and Yu (2005) proposed a specific stopping rule and established the risk consistency. Their results can be applied to general loss functions. Their stopping rule determines the number of the boosting steps depending on sample size. Moreover, they obtained the rate of convergence for the risk consistency. In this paper, we prove the risk consistency of the penalized risk functional. While previous works on regularization methods are regarded as optimizations with some constraints (sum of coefficients or a stopping rule), the penalized risk minimization we analyse is regarded as the optimization with the modified objective function. 2. Boosting Algorithm 2.1. Setup and notation Let (Ω, F, P) be a probability space, and let X be a subspace of Rd , and Y = {−1, +1}. We denote by X an X -valued random vector, and by Y a Y-valued random variable. Moreover, we denote by D the joint distribution of (X, Y ): for any A ⊂ X and B ⊂ Y, D(A × B) = P [(X, Y ) ∈ A × B] . Let zn = {(Xi , Yi )}ni=1 be independent and identically distributed training samples from the unknown distribution D. Let C be a set of R-valued functions on Rd , which is a family of weak learners and H be the linear span of C: a functional space defined as ( ) m ¯ X ¯ H = F (x; θ) = θi gi (x) ¯ gi ∈ C, θi ∈ R, i = 1, . . . , m, m ∈ N . i=1

In this paper, we assume that weak learners gi ∈ C take values −1 or +1. The boosting procedure constructs a classifier F ∈ H, and we predict the label of X by sign(F (X; θ)). Let φ : R → R+ be a loss function. We use the loss function φ to evaluate the performance of a classifier F ∈ H at each data point. Define the expected risk functional and the empirical risk functional respectively as follows: for each F ∈H n

X b )= 1 φ(Yi F (Xi )). Q(F ) = E [φ(Y F (X))] , Q(F n i=1 It suffices to consider the loss of Y F (X) measured by φ since we just consider the binary classification problem in this paper. We must consider more general loss functions such as φ(Y, F (X)) in the multiple classification problem, though in the binary classification we can use the sign of Yi F (Xi ) for i = 1, . . . , n to see whether the prediction by F (Xi ) meets Yi . When Yi F (Xi ) is positive, F correctly predicts Yi . On the other hand, when Yi F (Xi ) is negative, F incorrectly predicts Yi . We give some assumptions to loss functions φ. Assumption 1. (L1) φ is differentiable, strictly decreasing and strictly convex function. Moreover, φ(0) = 1 and limx→∞ φ(x) = 0;

89

(L2) φ is a local Lipschitz function with the constant γφ (β) in [−β, β]: |φ(a1 ) − φ(a2 )| ≤ γφ (β)|a1 − a2 | for all a1 , a2 ∈ R with |a1 |, |a2 | ≤ β. The assumption (L1) that the value of a loss function for correct classification be less than that for misclassification whose absolute value is the same as correct classification: φ(−a) > 1 > φ(a) for any a > 0. The role of (L2) is shown in Section 4. The following are examples of loss functions meeting Assumption 1: • exponential loss: φ(a) = exp(−a), γφ (a) ≤ exp(a); ¶ µ 1 , γφ (a) ≤ 1; • logistic loss: φ(a) = log 1 + exp(−a) We give a general form of the boosting algorithm. Algorithm 1. (Boosting algorithm) 1. Put Fb0 ≡ 0 as an initial classifier. 2. For t = 1, . . . , T , (a) Choose the weak learner fˆt ∈ C and its coefficient θˆt ∈ R such that b Fbt + θˆt fˆt ) ≤ Q( b Fbt ). Q( (b) Update the classifier as follows: Fbt+1 = Fbt + θˆt fˆt . 3. Output the final classification function as T X FbT (x) = θˆt fˆt (x) t=1

and determine the label by its sign for all x ∈ X . Specific computation of the step (a) in Algorithm 1 depends on the loss function φ. We call the boosting method with φ(a) = exp(−a) AdaBoost. When the error probability of fˆt is nearly zero, θˆt becomes large. On the other hand, when the error probability of fˆt is as much as random guessing (nearly 1/2), θˆt is close to zero. These tendencies are common to most loss functions. Our analysis in this paper does not need any specification of the computational detail on the boosting algorithm. All we need is the property that a risk decreases at every boosting step. 2.2. Regularization of boosting In this section we introduce the regularization method via the reproducing kernel Hilbert space (RKHS) theory. It seems that there is no relation between boosting methods and kernel machines (e.g. support vector machine). However, the space of the combined weak learners is very close to the one expressed by a kernel function. Kawakita et al. (2006) showed that the boosting method could be seen as a procedure to construct classifiers in RKHS. They proposed reproducing kernel functions based on weak learners which were conventionally used

90

in the boosting method. Moreover, they proposed a computationally effective reproducing kernel under natural assumptions in applications. Define the kernel function for the class C as follows: X KC (x, x0 ) = πg g(x)g(x0 ) for all x, x0 ∈ X , where

X

g∈C

πg = 1, 0 ≤ πg ≤ 1 for all g ∈ C.

g∈C

The RKHS induced by KC is expressed as follows:     ¯ X ¯ HKC = F (x; θ) = θg g(x) ¯ g ∈ C, θg ∈ R, kF kHKC < ∞   g∈C

equipped with an inner product defined by hF (·; θ), G(·; θ0 )iHK = C

X 1 g θg θ0 for all F (·; θ), G(·; θ0 ) ∈ HKC . πg g∈C

The space HKC corresponds to the set of all classifiers constructed by weak learners in C. We simply denote HKC by H in the sequel and the norm of F by kF k2H = hF (·, θ), F (·, θ)iH . We can easily check the reproducing property as follows: * + X X g hF (·; θ), KC (·, x)iH = θ g(·), πg g(·)g(x) g∈C

g∈C

X 1 θg · πg g(x) = πg

H

g∈C

= F (x; θ). Consider the penalized empirical risk functional of F ∈ H: b λ (F ) = Q(F b ) + λkF k2H , λ > 0, Q

(1)

where λ is a given constant. In the viewpoint of the kernel machine learning, we can b λ (·) with the penalty by the representer theorem (Sch¨olkopf find the minimizer of Q and Smola, 2001). This theorem says that for given training data zn ∈ (X × Y)n there is the unique minimizer of (1) in the span {K(·, X1 ), . . . , K(·, Xn )}, that is, the minimizer is of the form Fbn (x) =

n X

α` K(x` , x), α` ∈ R, ` = 1, . . . , n.

(2)

`=1

Kawakita et al. (2006) proposed the penalized risk boosting algorithm that optib λ (·) yields a mizes α` ’s sequentially. According to their experiment, minimizing Q b more stable classifier than doing Q(·).

91

In this paper, we consider the boosting algorithm for the penalized risk functional. We consider the norm on the RKHS as the penalty for the convenience b λ (·) is similar to the regularization method by to analyse. The optimization of Q b Lugosi and Vayatis (2004): searching the minimizer of the risk functional Q(·) b (not Qλ (·)) in the convex set of weak learners ( F (x; θ) =

m X

) m ¯ X ¯ θi gi (x) ¯ g ∈ C, m ∈ N, θi ≥ 0, i = 1, . . . , m, θi = κ ,

i=1

i=1

where κ > 0 is a given constant. Penalty corresponds to the constraint of the size of θ’s. However the minimization with constraints is more difficult than that of (2). In this paper, we adopt the penalized risk in Kawakita et al. (2006) as a regularization method. 3.

Risk Consistency In this section, we consider the classifier obtained by the boosting algorithm with the penalized risk (1). Our goal is to prove the risk consistency: h i lim lim E Q(Fbn,t ) = inf Q(F ),

n→∞ t→∞

F ∈H

(3)

b λ (·) at the tth step with where Fbn,t is the classifier constructed by minimizing Q sample size n. The order of limits is important. Breiman (2000) studied the case where Fbn,t b is the minimizer of Q(·) in (3) and t → ∞ after n → ∞. That was the ideal situation since the underlying distribution D was specified as n → ∞. Zhang and Yu (2005) studied the case where t depends on n and Fbn,t is the minimizer b of Q(·). There was a critical problem how we determine the iteration number t for given n. However we take a limit t → ∞ without any constraints on t before n → ∞, which are very practical asymptotics. The equality (3) does not hold for b the estimator constructed by minimizing Q(·). Because we can easily construct the classifier whose empirical risk is smaller than the Bayes risk, which is the lowest attainable risk under the true distribution D. When we repeat the boosting step b ) approaches to zero in most cases. However, many times, the empirical risk Q(F b λ (·) because of trade-off between the value of there is a positive lower bound of Q the risk and the norm of the classifiers. In this paper, we consider the following boosting algorithm. Algorithm 2. 1. Put Fbn,1 ≡ 0 as an initial classifier. 2. For t = 1, . . . , T, update the classifier as Fbn,t+1 = Fbn,t + θˆt gˆt , where gˆt ∈ C and θˆt ∈ R such that bλ b λ (Fbn,t ), Q (Fbn,t+1 ) ≤ Q n,t+1 n,t where {λn,t } is a positive sequence.

92

3. Output the final classification function as T X FbT (x) = θˆt fˆt (x) t=1

and determine the label by its sign for all x ∈ X . We impose the following assumption on the penalizing parameter: Assumption 2. (P1) A positive limit λn,∞ := lim λn,t exists for each t and lim lim λn,t = 0. t→∞

n→∞ t→∞

(P2) For each n ∈ N, there exists a positive constant Mn > 0 such that h i lim λn,t E kFbn,t k2H < Mn ; t→∞

(P3) lim Mn = 0. n→∞

This seems to be a reasonable assumption since we have enough information for the unknown distribution D when n is large. The assumption (P2) controls the size of penalty in the boosting procedure for a given n. Moreover, we give two additional conditions to Algorithm 2. Define Qλ (F ) = Q(F ) + λkF k2H . P∞ ∞ Assumption 3. There exists a sequence {εt > 0 | t=1 εt < ∞}t=1 such that (A1) Qλn,t+1 (Fbn,t+1 ) ≤ Qλn,t (Fbn,t ) + εt for all t ∈ N; b λ (Fbn,t ) = inf Q b λ (F ) and sup inf Q b λ (F ) < ∞ (A2) lim Q n,t n,∞ n,∞ t→∞

F ∈H

n F ∈H

Remark 1. Since λn,∞ > 0 from the assumption (P1), we have lim kFbn,t kH < ∞ for n ∈ N.

t→∞

b λ (Fbn,t ) = ∞. This contradicts Indeed, if limt→∞ kFbn,t kH = ∞ then limt→∞ Q n,t (A2). As a measure of complexity C, we use the expected Rademacher complexity: "

# n 1X Rn (C) = E sup σi g(Xi ) , g∈C n i=1 where σi ’s are independent random variables with P[σi = +1] = P[σi = −1] = 1/2 (i = 1, . . . , n). The next theorem is our main results.

93

Theorem 1. Suppose that Algorithm 2 satisfies Assumptions 1–3. Let {βn }∞ be a positive sequence such that γφ (βn )βn Rn (C) → 0 and as n → ∞. n=1 −2 −1 Moreover, the sequence {Mn }∞ n=1 in the assumption (P2) satisfies Mn βn λn,∞ → 0 as n → ∞. Then the risk consistency holds: h i lim lim E Q(Fbn,t ) = inf Q(F ).

n→∞ t→∞

F ∈H

It may be useful to explore comprehensive conditions for γφ (βn )βn Rn (C) to converge to zero as n → ∞. When C is a set of very complex functions such that Rn (C) → ∞ as n → ∞, such C may not satisfy the condition γφ (βn )βn Rn (C) → 0 as n → ∞. If the VC dimension of C is finite, then Rn (C) → 0 as n → ∞. For details, see Vapnik (2000). It is shown that weak learners we usually use satisfy the condition. For example, consider the class of decision stumps Cds = {±fj (x, a) | a ∈ R, j = 1, . . . , d} , where fj (x, a) = 2I [xj > a] − 1, x = (x1 , . . . , xd ). Then, it is easy to see that Rn (Cds ) → 0 as n → ∞. The choice of {βn }∞ n=1 for Cds is given in Zhang and Yu (2005). 4.

Proof First, we prepare two lemmas. We introduce “1-norm” as another measure of the complexity of a combined classifier F ∈ H:   X ¯  X ¯ kF k1 = inf |θg | ¯ F (x; θ) = θg g(x) .   g∈C

g∈C

Remark 2. If F ∈ H has two representations: F =

X g∈C

θ1g g =

X

θ2g g,

g∈C

then θ1g = θ2g for all g ∈ C by the property of the RKHS (H, k · kH ). This is one of the advantages to consider RKHS. As a result, the norm kF k1 is well-defined. Remark 3. It is easy to see that F ∈ H, kF k1 ≤ kF kH for all F ∈ H. Indeed, Schwartz’s inequality yields that  12  21   X X (θg )2  · πg  = kF kH . kF k1 ≤  πg g∈C

g∈C

Lemma 1.(Zhang and Yu, 2005) Under Assumption 1, " E

# n o b Q(F ) − Q(F ) ≤ 2γφ (β)βRn (C). sup kF k1 ≤β

94

Proof. See Lemma 4.3. in Zhang and Yu (2005). Lemma 2. Suppose that Algorithm 2 satisfies Assumption 1–3. Then, for any n ∈ N, h i h ³ ´i b Fbn,t ) = E lim Q(Fbn,t ) − Q( b Fbn,t ) . lim E Q(Fbn,t ) − Q(

t→∞

t→∞

Proof. Notice that h i h i b Fbn,t ) = E Qλ (Fbn,t ) − Q b λ (Fbn,t ) E Q(Fbn,t ) − Q( n,t n,t b λ (·), and then it holds that by the definitions of Qλn,t (·) and Q n,t h i h ³ ´i b λ (Fbn,t ) = E lim Qλ (Fbn,t ) − Q b λ (Fbn,t ) . lim E Qλn,t (Fbn,t ) − Q n,t n,t n,t

t→∞

t→∞

b λ (Fbn,t )}t∈N is decreasing By the step 2 in Algorithm 3, we see that a sequence {Q n,t for an arbitrarily given n ∈ N. Thus, we obtain h i h i h i b λ (Fbn,t ) = E lim Q b λ (Fbn,t ) =: E Q b λ (Fbn,∞ ) lim E Q n,t n,t n,∞

t→∞

t→∞

(4)

b λ (Fbn,t ) is bounded above almost by the monotone convergence theorem, since Q n,t surely. Notice that the last limit exists by the condition (A2) in Assumption 3. By (A1) in Assumption 3, we have Qλn,∞ (Fbn,t+1 ) ≤ Qλn,t (Fbn,t ) + εt ≤ Qλn,1 (Fbn,1 ) +

t X

εj .

j=1

Let Rt = Qλn,t (Fbn,t ) −

Pt−1

j=1 εj , t

= 2, 3, . . . and R1 = 1. It follows that

Rt+1 − Rt = Qλn,∞ (Fbn,t+1 ) − Qλn,t (Fbn,t ) − εt ≤ 0 by (A1). Hence the sequence {Rt }∞ t=1 is decreasing and uniformly bounded above. Therefore the monotone convergence theorem yields that ∞ h i h i X lim E [Rt ] = E lim Rt = E Qλn,∞ (Fbn,∞ ) − εj

t→∞

since

P∞

j=1 εj

t→∞

j=1

< ∞. Thus, we have h i h i lim E Qλn,t (Fbn,t ) = E lim Qλn,t (Fbn,t ) .

t→∞

t→∞

The equalities (4) and (5) yield the consequence of the lemma.

(5) ¤

95

Proof of Theorem 1. Note that h i i h i h b Fbn,t ) + E Q( b Fbn,t ) − Q(F b ) E Q(Fbn,t ) − Q(F ) = E Q(Fbn,t ) − Q( h i b ) − Q(F ) + E Q(F h i h i b Fbn,t ) + E Q( b Fbn,t ) − Q(F b ) . (6) = E Q(Fbn,t ) − Q( On the first term in the right-hand side of (6), we obtain h i h i b Fbn,t ) = E Q(Fbn,∞ ) − Q( b Fbn,∞ ) lim E Q(Fbn,t ) − Q( t→∞

(7)

by Lemma 2. Since Qλn,∞ (Fbλn,∞ ) ≤

∞ X

εt + Qλn,1 (Fbλn,1 ) =

t=1

∞ X

εt + 1 < ∞

t=1

from (A1) in Assumption 3 and b λ (Fbλ ) ≤ Q b λ (Fbλ ) = 1 Q n,∞ n,∞ n,1 n,1 from Algorithm 2, we have ³ ´ b λ (Fbλ ) < ∞. sup Qλn,∞ (Fbλn,∞ ) − Q n,∞ n,∞ zn

(8)

In the right-hand side of (7), we have h i b Fbn,∞ ) E Q(Fbn,∞ ) − Q(



´ h ii b Fbn,∞ ) I kFbn,∞ k1 ≤ βn Q(Fbn,∞ ) − Q( h³ ´ h ii b Fbn,∞ ) I kFbn,∞ k1 > βn +E Q(Fbn,∞ ) − Q( # " ³ ´ b b b Q(Fn,∞ ) − Q(Fn,∞ ) ≤ E sup = E

kF k1 ≤βn



´ h ii b Fbn,∞ ) I kFbn,∞ k1 > βn Q(Fbn,∞ ) − Q( h i ≤ 2γφ (βn )βn Rn (C) + KP kFbn,∞ kH > βn , (9) +E

where K is a positive constant. In the last inequality, the first term is derived from Lemma 1 and the second term is derived from (8) and Remark 3. On the second term in the right-hand side of (6), it follows that i h b ) b Fbn,t ) − Q(F E Q( ´ i h i ³ h b λ (Fbn,t ) − Q b λ (F ) − λn,t E kFbn,t k2 − kF k2 . = E Q H H n,t n,t

96

Then we have h i b Fbn,t ) − Q(F b ) lim E Q( t→∞

h i b λ (Fbn,∞ ) − Q b λ (F ) − Mn + λn,∞ kF k2 = E Q H n,∞ n,∞ ≤ −Mn + λn,∞ kF k2H .

(10)

In the last equality, we used (4) in the proof of Lemma 2, (P2) and (P3) in Assumption 2. In the last inequality, we used (A2) in Assumption 3. The inequalities (6)–(10) yield that h i lim E Q(Fbn,t ) − Q(F ) ≤ 2γφ (βn )βn Rn (C) t→∞ h i + KP kFbn,∞ kH > βn − Mn + λn,∞ kF k2H . The right-hand side of the last inequality converges to zero as n → ∞. Indeed, on the second term, as n → ∞ h i h i E kFbn,∞ k2H Mn P kFbn,∞ kH > βn ≤ ≤ 2 →0 2 βn βn λn,∞ by Chebyshev’s inequality. Moreover, the remaining terms clearly converge to zero by the assumptions. Therefore we obtain h i inf Q(F ) ≤ lim E Q(Fbn,∞ ) ≤ Q(F ) F ∈H

n→∞

for any F ∈ H. Taking the infimum over F ∈ H, we obtain the desired result. ¤ References Breiman, L. (1998). Arcing classifiers. Annals of Statististics, 2, 801–824. Breiman, L. (2000). Some infinite theory for predictor ensembles. Technical report 577, Statistics Department, UC Berkeley. Freund, Y. and Schapire, R. E. (1997). A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences, 5, 119–139. Friedman, J. (2001). Greedy function approximation: a gradient boosting machine. Annals of Statistics, 29, 1189–1232. Friedman, J., Hastie, T. and Tibshirani, R. (2000). Additive logistic regression: a statistical view of boosting. Annals of Statistics, 2, 337–407. Kawakita, M., Ikeda, S. and Eguchi, S. (2006). A bridge between boosting and a kernel machine. the institute of statistical mathematics research memorandum, No. 1006. Lafferty, J. (1999). Additive models, boosting, and inference for generalized divergences. Proceedings of the Twelfth Annual Conference on Computational Learning Theory, 125–133. Lugosi, G. and Vayatis, N. (2004). On the Bayes-risk consistency of boosting methods. Annals of Statistics, 3, 30–55. Schapire, R. E. (1990). The strength of weak learnability. Machine Learning, 5, 197–227. Sch¨ olkopf, B. and Smola, A. J. (2001). Learning with kernels: support vector machines, regularization, optimization and beyond. MIT press. Vapnik, V. N. (2000). The nature of statistical learning second edition. Springer, NY. Zhang, T. and Yu, B. (2005). Boosting with early stopping: convergence and consistency. Annals of Statistics, 3, 1538–1579.

Suggest Documents