The Most Robust Loss Function for Boosting

8 downloads 0 Views 149KB Size Report
this paper, loss functions for robust boosting are studied. Based on a concept of the robust statistics, ..... Additive logistic regression: A statistical view of boosting.
The Most Robust Loss Function for Boosting Takafumi Kanamori1 , Takashi Takenouchi2 , Shinto Eguchi3 , and Noboru Murata4 1

Department of Mathematical and Computing Sciences, Tokyo Institute of Technology, Ookayama 2-12-1, Meguro-ku, Tokyo 152-8552, Japan, [email protected] 2 Institute of Statistical Mathematics, 4-6-7, Minami-Azabu, Minato-ku, Tokyo 106-8569, Japan, [email protected] 3 Institute of Statistical Mathematics, 4-6-7, Minami-Azabu, Minato-ku, Tokyo 106-8569, Japan, [email protected] 4 School of Science and Engineering, Waseda University, 3-4-1 Ohkubo, Shinjuku, Tokyo 169-8555, Japan, [email protected]

Abstract. Boosting algorithm is understood as the gradient descent algorithm of a loss function. It is often pointed out that the typical boosting algorithm, Adaboost, is seriously affected by the outliers. In this paper, loss functions for robust boosting are studied. Based on a concept of the robust statistics, we propose a positive-part-truncation of the loss function which makes the boosting algorithm robust against extreme outliers. Numerical experiments show that the proposed boosting algorithm is useful for highly noisy data in comparison with other competitors.

1

Introduction

We study a class of loss functions for boosting algorithms in binary classification problems. Boosting is a learning algorithm to construct a predictor by combining, what is called, weak hypotheses. Adaboost [3] is a typical implementation of boosting and is shown to be a powerful method from theoretical and practical viewpoints. The boosting algorithm is derived by the gradient method [4]. Through these studies, the boosting algorithm is viewed as an optimization process of the loss function. The relation between Adaboost and the maximum likelihood estimator was also clarified from the viewpoint of the information geometry [6, 8]. Since a loss function plays an important role in statistical inference, the relation between the loss function and the prediction performance is widely studied in statistics and machine learning communities. In last decade some useful loss functions for classification problems have been proposed, for examples, the hinge loss for support vector machine [11], the exponential loss for Adaboost [4] and so on. Since these typical loss functions are convex, highly developed optimization techniques is applicable to the global minimization of the loss function. We study the robustness of boosting. Adaboost puts too much weights on the solitary examples even though these are outliers which should not be learned. For

example, outliers may occur in recording data. Some studies revealed that the outlier seriously degrades the generalization performance of Adaboost. Though several improvements to recover the robustness are already proposed, the theoretical understanding is not enough. We measure the influence of the outlier by gross error sensitivity [5] and derive the loss function which minimizes the gross error sensitivity. Our main objective is to propose a boosting algorithm that is robust for outliers. This paper is organized as follows. In section 2, we briefly introduce boosting algorithms from the viewpoint of optimization of loss functions. In section 3, we explain some concepts in the robust statistics and derive robust loss functions. Numerical experiments are illustrated in section 4. The last section is devoted to concluding remarks.

2

Boosting Algorithms

Several studies have clarified that boosting algorithm is derived from the gradient descent method for loss functions [4, 7]. The derivation is illustrated in this section. Suppose that a set of examples {(x1 , y1 ), . . . , (xn , yn )} is observed, where xi is an element in the input space X and yi takes 1 or −1 as the class label. We denote the set of weak hypothesis by H = {ht (x) : X → {1, −1} | t = 1, . . . , T }, where each hypothesis assigns the class label for the given input. Our aim is to construct a powerful predictor H(x) by combining weak hypotheses, where H(x) is given as the linear combination of weak hypotheses, that is, T X H(x) = αt ht (x). t=1

Using the predictor, we can assign the class label corresponding to x as sign(H(x)), where sign(z) denotes the sign of z. Loss functions are often used in the classification problems. The loss of the predictor H given a sample (x, y) is defined as l(−yH(x)), where l : R → R is twice continuously differentiable except finite points. Typically convex and increasing functions are used because of the facility of the optimization. Let us define the empirical loss as n

Lemp (H) =

1X l(−yi H(xi )). n i=1

The minimization of the empirical loss provides the estimator of the predictor. The gradient method for the empirical loss provides the boosting algorithm as follows. Boosting Algorithm Input: Examples {(x1 , y1 ), . . . , (xn , yn )} and the initial predictor H0 (x) ≡ 0. Do for m = 1, . . . , M

Step 1. Put the weight as w(xi , yi ) ∝ l0 (−yi Hm−1 (xi )), i = 1, . . . , n, where l0 be the differential of l and the sum of weights is set to be equal to one. Step 2. Find hm ∈ H which minimizes the weighted error, n X

w(xi , yi )I(yi 6= h(xi )).

i=1

Step 3. Find the coefficient αm ∈ R which attains the minimum value of n

1X l(−yi (Hm−1 (xi ) + αhm (xi ))). n i=1 Step 4. Update the predictor, Hm (x) ← Hm−1 (x) + αm hm (x). Output: The final predictor: sign(HM (x)). Note that in the above algorithm the exponential loss l(z) = ez derives Adaboost.

3

Robust Loss Functions

It is often pointed out that Adaboost is much sensitive to outliers. Some alternative boosting algorithms are proposed to overcome this drawback [9]. In this section we derive robust loss functions from the viewpoint of robust statistics. We define P (y|x) as the conditional probability of the class label for a given input x. Suppose that samples are identically and independently distributed. When the sample size goes to infinity, the empirical loss converges to the expected loss function: Z X L(H) = µ(dx) P (y|x) l(−yH(x)), X

y=±1

where µ denotes the probability measure on the input space X . Let H ∗ (x) be the minimizer of L(H). When the loss function l satisfies l0 (z)/l0 (−z) = ρ(z), where ρ is a monotone increasing function and¡ l0 denotes ¢ the differential of l, the ∗ −1 P (1|x) minimizer of L(H) is given as H (x) = ρ P (−1|x) . This formula is derived from the variational of L(H). Even if l1 and l2 are different loss functions, l10 (z)/l10 (−z) = l20 (z)/l20 (−z) could hold. For examples, all of the following loss functions lada (z) = ez , llogit (z) = log(1 + e2z ), ( z z≥0 lmada (z) = 1 2z 1 −2 z

Suggest Documents