On boosting kernel density methods for ... - Semantic Scholar

0 downloads 0 Views 239KB Size Report
On boosting kernel density methods for multivariate data: density estimation and classification. Marco Di Marzio1 , Charles C. Taylor2. 1. Dipartimento di Metodi ...
Statistical Methods & Applications manuscript No. (will be inserted by the editor)

On boosting kernel density methods for multivariate data: density estimation and classification Marco Di Marzio 1 , Charles C. Taylor 2 1

2

Dipartimento di Metodi Quantitativi e Teoria Economica, Universit` a G. d’Annunzio, email address: [email protected] Department of Statistics, University of Leeds email address: [email protected]

The date of receipt and acceptance will be inserted by the editor

Abstract Statistical learning is emerging as a promising field where a number of algorithms from machine learning are interpreted as statistical methods and vice–versa. Due to good practical performance, boosting is one of the most studied machine learning techniques. We propose algorithms for multivariate density estimation and classification. They are generated by using the traditional kernel techniques as weak learners in boosting algorithms. Our algorithms take the form of multistep estimators, whose first step is a standard kernel method. Some strategies for bandwidth selection are also discussed with regard both to the standard kernel density classification problem, and to our ‘boosted’ kernel methods. Extensive experiments, using real and simulated data, show an encouraging practical relevance of the findings. Standard kernel methods are often outperformed by the first boosting iterations and in correspondence of several bandwidth values. In addition, the practical effectiveness of our classification algorithm is confirmed by a comparative study on two real datasets, the competitors being trees including AdaBoosting with trees. Key words Smoothing.

Bandwidth Selection, Bias Reduction, Learning, Leave–One–Out Estimates, Simulation,

1 Introduction Machine learning is an established research area in computer science. It deals with data-driven function approximation techniques that, once trained on the basis of examples, are able to reduce the ignorance about new data. This subject, in which estimation is prominent, has gained a large amount of attention from statisticians. Here ‘training’ (or ‘learning’) means to (iteratively) estimate a decision function that: i) describes the relation between the inputs and the outputs at hand; and, consequently, ii) is able to make predictions. Clearly, the ability to make inference by generalizing examples is related to the risk associated with the decision function. It is possible to distinguish two classes of problems, depending on data being labelled or unlabelled. If data are labelled we have a supervised learning problem, where the observed profiles of features have a label and the task is to estimate the label of a new profile of features. If data are not labelled we have an unsupervised learning process, here the aim is in identifying the structure of data. Statistical regression and classification can be defined as supervised learning, while cluster analysis and density estimation are examples of unsupervised learning. Due to many successful practical applications, boosting (Shapire (1990), Freund (1995)) has become one of the most studied ideas of supervised learning. A great deal of effort has been devoted to develop statistical theory able to explain why boosting works. Basically, a M -step boosting algorithm iteratively performs M estimates by applying a given method, called a weak learner, to M different versions of the sample. The estimates are then combined into a single one being the final output. This latter can

2

Marco Di Marzio and Charles C. Taylor

be viewed as a ‘powerful committee’, being expected to be significantly more accurate than each single estimate. A common interpretation of the boosting algorithm is as a greedy function approximation technique (Breiman (1999) and Friedman (2001)), where a given loss function is optimized by M iterative adjustments of the estimate in the function space. Therefore, every given loss function leads to a specifically ‘shaped’ boosting algorithm. However, a loss function should be convex and smooth enough to obtain algorithm convergence. Clearly, if a specific weak learner is adopted, a boosting algorithm can be explicitly expressed as a multistep estimator, consequently some statistical properties can be derived. While boosting was originally applied to classification and regression, more recently it has also been used in density estimation problems. Ridgeway (2002) combines boosting and bagging in a parametric density estimation setting: boosting is employed to reduce bias, and bagging to control variance. Rosset and Segal (2002) use boosting to learn bayesian networks; here the density estimation problem is viewed from a machine learning perspective. Di Marzio and Taylor (2004) propose a boosting algorithm for kernel density estimation (BoostKDE). BoostKDE is a higher-order bias method that uses the basic kernel density estimator as its weak learner. Kernel density classification (KDC) — which classifies according to kernel density estimates of the probability density functions — can be thought of as estimating class boundaries as the intersection of kernel density estimates. In this paper we discuss the use of standard kernel density classification as a weak learner in a boosting classification algorithm. In other words, we intend to generate a ‘powerful committee’ as a linear combination of M KDC boundary estimates resulting from M boosting steps where the weak learner is the standard KDC (note that for M = 1 boosting is identical to the standard kernel density classification method). It will be seen that the proposed classification algorithm is closely linked to BoostKDE. The paper is structured as follows. Section 2 introduces notation and preliminary definitions. In Section 3 we introduce a multivariate version of the boosting algorithm for density estimation proposed by Di Marzio and Taylor (2004). In Section 4 boosting strategies for kernel density classification are discussed; also here the setting refers to higher dimensional problems. Some links between our boosting classification algorithm (BoostKDC) and density estimation emerge, also in connection with the higher order bias kernel density estimator proposed by Jones et al. (1995). In Section 5 we discuss the problem of bandwidth selection and boosting regularization. In particular, we explore the use of cross validation and L2 loss, firstly for the standard kernel density classification, then for the boosting framework. The conclusion is that the optimal choice of bandwidth is method-specific, according to whether the goal is density estimation or classification. A bandwidth-based approach to regularization is also proposed. To explore the potential advantage of our methods, Section 6 provides some experiments on simulated and real data. The results are that, when the bandwidths are chosen optimally, our algorithms are superior to the classical kernel density methods. In addition, although our main end lies in verifying that boosting can be beneficial for kernel density methods, we present a comparative study with trees and AdaBoosting trees. Also in this case BoostKDC exhibits the best behavior. Finally, we make some concluding remarks.

2 The framework Suppose we have a two–class training sample {(xi , yi ), i = 1, ..., N } ⊂ Rd ×{−1, 1} of size N = n−1 +n1 . These observations are interpretable as a set of N i.i.d. realizations of the random vector (X, Y ) , where the feature vector X models the d-dimensional individual profile, and the response variable Y denotes the individual class membership described by the label. Suppose a new observation with unknown label is denoted by (x, Y ) . The task is to predict Y given the observed x. Then we have to construct a decision rule Yb : X → {−1, 1}. The optimal decision rule is the Bayes classifier (also known as Bayes’ decision rule), in which we allocate x by maximizing the estimated class posterior probability Pb (Y = y|X = x). If we assume equal misclassification costs, the Bayes classifier is Yb (x) = argmax y∈{−1,1}

π by fby (x) π b−1 fb−1 (x) + π b1 fb1 (x)

= argmax π by fby (x) y∈{−1,1}

On boosting kernel density methods for multivariate data: density estimation and classification

3

where πy = P (Y = y) is the class prior. Often πy is known, or simply estimated using π ˆy = ny /N , hence the main issue is the estimation of the conditional pdfs fy (x). If we use a kernel density estimate (KDE) of fy (x), y = −1, 1, we get kernel density classification (KDC). A KDE of f at x is n 1X fˆ(x; H) = KH (x − xi ) , n i=1 where the scale factor H is a symmetric positive definite d × d matrix called the smoothing parameter ¡ ¢ −1 or bandwidth matrix, and KH (x) = |H| K H−1 x , where K : Rd → R is called the kernel; usually K is a symmetric probability density function. At this point it appears useful to recall that statistical learning includes many other strategies for estimating the Bayes rule, other than plugging the (parametric or not) conditional densities estimates in the Bayes theorem. Some of these methods directly try to estimate the posterior probability (logistic regression, LogitBoost), some solve directly for the Bayes rule (trees, perceptron). The accuracy of a classification rule Yb (x) is inversely related to the misclassification probability b P (Y (x) 6= Y ) . In our two–class problem this criterion can be expressed by the expected misclassification rate: ! ÃZ Z E

{x:ˆ π1 fb1 (x)>ˆ π−1 fb−1 (x)}

π−1 f−1 (x) dx +

{x:ˆ π−1 fb−1 (x)>ˆ π1 fb1 (x)}

π1 f1 (x) dx

(1)

The AdaBoost algorithm (Shapire (1990); Freund and Shapire (1996)) calls M times a weak learner δ(x) on M different versions of the sample, yielding M weak decision rules δm (x) : Rd → {−1, 1} ; the final output, that is given by a weighted combination of the δm s, is expected to be significantly more accurate than each weak learner involved. A weak learner is defined here by any algorithm for which P (δ(x) 6= Y ) is slightly smaller than a default decision rule which has expected error rate min(π−1 , π1 ). The M ‘sample versions’ are made of M copies of the sample each associated with a particular weighting distribution over the observations. The first weighting distribution is uniform, whilst for m ≥ 2 the mth one depends on the results from the (m − 1) th call. The weighting distribution associates more importance to badly estimated data through a loss function. Thus, the ‘awkward’ observations receive increasing attention from the classifier until they are correctly allocated; see Shapire (1990). 3 Boosting Kernel Density Estimates First we summarize the one dimensional case introduced by Di Marzio and Taylor (2004), then we extend it to a multidimensional setting. Fundamental to the boosting paradigm is the re-weighting of data based on a ‘goodness-of-fit’ measure. As originally conceived, this measure was derived from the resubstitution error rate of classifiers, or their relative likelihood. In a KDE, we can obtain such a measure by comparing fˆ(xi ) with the leave–one–out estimate: fˆ(i) (xi ) =

1 X Kh (xi − xj ) . n−1

(2)

j6=i

A boosting algorithm rule that employs a KDE as ‘weak learner’ could then re-weight xi by wi ∝ log(fˆ(xi )/fˆ(i) (xi )) . Note that wi > 0 because fˆ(i) (xi ) < fˆ(xi ). Our pseudocode for boosting a KDE (BoostKDE ) is given below. Algorithm 1 BoostKDE 1. Given {xi , i = 1, . . . , n} , initialize w1 (i) = 1/n, i = 1, . . . , n . 2. Select h . 3. For m = 1, . . . , MP (i) Get fˆm (x) = wm (i) Kh (x − xi ) (i) (ii) Update wm+1 (i) = wm (i) + log(fˆm (xi )/fˆm (xi ))

4

4. Output

Marco Di Marzio and Charles C. Taylor

QM m=1

fˆm (x) renormalized to integrate to unity.

Di Marzio and Taylor (2004) show that w2 (i) ∝ fˆ(xi )−1 , so we could interpret the multiplicative KDE of Jones et al. (1995) as a boosted KDE with M = 2 . The estimator of Jones et al. (1995) could also have been iterated (and so M > 2 ), but they doubted its efficacy. However, at least for M = 2, Di Marzio and Taylor (2004) show that BoostKDE inherits their bias reduction properties. However, the overall behavior for further steps is a progressive bias reduction and a slightly increasing variance. Note that the L2 boost algorithm for regression proposed by B¨ uhlmann and Yu (2003) performs quite similarly. A straightforward extension to the multidimensional case is immediate by employing product kernels. So, if x = (1x, ...,d x) ∈ Rd , we replace Kh (x − xi ) in step 3(i) of the algorithm above by KH (x − xi ) =

d Y

Kh (jx −jxi )

j=1

in which a unidimensional kernel is used at each coordinate of x . 4 Boosting Kernel Density Classifiers 4.1 Kernel density estimation and classification KDC is widely used, and has been shown to work well in several real-world discrimination problems; see Habbema et al. (1974), Michie et al. (1994) and Wright et al. (1997). Nevertheless, we note that in kernel–based classification problems we are not primarily interested in density estimation per se, but as a route to classification. We believe that the methodological impact of this different perspective has not yet been fully explored, although there exist a few contributions on this matter; see, for example Hall and Wand (1988). It is worth considering the extent to which we should adapt the standard methodology of density estimation when applied to discrimination problems. An obvious difference is that density estimation usually considers Mean Integrated Squared Error as a measure of estimate’s accuracy whereas classification problems are more likely to use expected error rates. Moreover, many researchers avoid using higher order kernels in density estimation because the estimate is not itself a density, and for moderate sample sizes, there is not much to gain. However, for some classification problems, at least, the first may not be an obstacle. Another interesting issue lies in the densities’ estimation: two separate estimates or a joint one, i.e. adopting a loss function depending on all bandwidths, as formula (1) suggests? We further discuss this in Section 5. 4.2 The algorithm One of the early implementations of boosting is AdaBoost. The major feature is that discrete decision rules, i.e. δm (x) : Rd → {−1, 1}, are learned on weighted versions of the data, often using tree stumps. The final output is " M # X Yb (x) = sign cm δm (x) , m=1

where cm is proportional to the performance of δm (x)on the training sample. Shapire and Singer (1999) suggest more efficient continuous mappings. They propose Real AdaBoost in which the weak classifier yields membership probabilities, in our case δm (x) = p(Y = y | X = x) , for a fixed y . Recently Friedman et al. (2000) have presented Real AdaBoost from a more statistical perspective. If we consider that KDC estimates and compares conditional probabilities, it seems natural that a majority voting criterion that fully employs the generated information should be the sum of these membership probabilities and not only the label indicated by the likelihood ratio. Besides, it is to be noted that a continuous final output is generated, preserving the analytical advantages of a KDE when further investigation is needed, for

On boosting kernel density methods for multivariate data: density estimation and classification

5

example, taking into account unequal costs of misclassification. This can be easily seen if it is considered that in Real AdaBoost a weight proportional to · ¸1 min {p(xi |yi = −1), p(xi |yi = 1)} 2 Vi = max {p(xi |yi = −1), p(xi |yi = 1)} is given to xi if it is correctly classified, and proportional to Vi−1 otherwise. (The indicated weighting pm (x) −y/2 system is obtained by noting that the updating factor of wm (i) is ( 1−p ) ). m (x) Our pseudocode for Real AdaBoost KDC (BoostKDC ) follows.

Algorithm 2 BoostKDC 1. Given {(xi , yi ), i = 1, . . . , N } , initialize w1 (i) = 1/N, i = 1, . . . , N . 2. Select Hy s , y ∈ {−1, 1}. 3. For m = 1, . . . , M P (i) Get fby,m (x) = j∈{i:yi =y} wm (j) KHy (x − xj ) , y ∈ {−1, 1} (ii) Calculate µ ¶ 1 pm (x) δm (x) = log , 2 1 − pm (x) where X pm (x) = fb−1,m (x)/ fby,m (x). y

(iii) Update: wm+1h(i) = wm (i) ×iexp (−yi δm (xi )) PM b 4. Output YM (x) = sign m=1 δm (x) .

Note that fby,m (x) does not integrate to 1, even for m = 1, (remember that our goal is classification) so, actually, the target of our estimation is πy fy (x) (with πy = ny /N ). We do not need to renormalize the weights because we consider the ratio fˆ−1,m (x)/fˆ1,m (x) so any normalization constant will cancel. Finally, note that For M = 2 and d = 1 , it can ¡ be ¢ shown that the frontier estimator ¡ ¢ (i.e. the estimator x ˆ0 such that YˆM (ˆ x0 ) = 0) of BoostKDC is O h2 biased if H1 6= H−1 , and O h4 biased if H1 = H−1 ; see Di Marzio and Taylor (2005) for further details. 5 Bandwidth selection and regularization 5.1 Bandwidth selection for standard KDC If we assume the misclassification error rate criterion given in formula (1), for a leave–one–out estimator given by equation (2) we can choose h−1 and h1 to minimize X X (j) (j) I[ˆ π1 fb1 (xj ) − π ˆ−1 fb−1 (xj )] + I[ˆ π−1 fb−1 (xj ) − π ˆ1 fb1 (xj )] {xj :yj =−1}

{xj :yj =1}

where

½ I (u) =

1 0

if u > 0 . otherwise

In our experience this function may have several local minima and so the problem requires a grid search over a region of (h−1 , h1 ), which will be time–consuming for large ni . Another possible accuracy criterion is to minimize a L2 loss function constructed for the difference g (x) = π1 f1 (x)−π−1 f−1 (x). For ease of presentation we now suppose that π1 = π−1 = 1/2 ; the following discussion will easily generalize to the case of unequal proportions. In particular, if Hf (·) denotes the Hessian matrix of f , let the following assumptions hold:

6

Marco Di Marzio and Charles C. Taylor

1. Each entry of Hf (·) is piecewise continuous, and square integrable. −1/2 2. Let Hi i = −1, 1 be two sequences of bandwidth matrices such that lim n→∞ n−1 |Hi | = 0 and lim n→∞ hi = 0, where hi is any entry of Hi . and the ratio of the largest and the smallest eigenvalues of Hi is bounded for all n. 3. K is a bounded, compactly supported d–variate kernel satisfying Z

Z K (z) dz = 1

Z zK (z) dz = 0

zz T K (z) dz = µ2 (K) Id .

Starting from the usual theory (see Wand and Jones (1995) p.97) we obtain © £ ¤ª 1 E gb (x) = f1 (x) − f−1 (x) + µ2 (K) tr [H1 Hf1 (x)] − tr H−1 Hf−1 (x) + o{tr (H−1 ) + tr (H1 )} 2 and

h i −1/2 −1/2 −1/2 −1/2 −1 Var gb (x) = R (K) n−1 |H | f (x) + n |H | f (x) + o{n−1 + n−1 } 1 1 −1 −1 1 −1 1 |H1 | −1 |H−1 |

R R 2 where R (K) = K (z) dz and µt (K) = z t K (z) dz . Hence the mean squared error (MSE) of our estimate at the point x0 such that g (x0 ) = 0 , is: MSE gb (x0 ) = AMSEb g (x0 ) + o{n−1 1 |H1 | where

−1/2

+ n−1 −1 |H−1 |

−1/2

+ tr2 (H1 ) + tr2 (H−1 )}

h i −1/2 −1/2 −1 AMSE gb (x0 ) = R (K) n−1 |H | f (x) + n |H | f (x) + 1 1 −1 −1 1 −1 £ ¤ª2 1 2© + µ2 (K) tr [H1 Hf1 (x)] − tr H−1 Hf−1 (x) 4

is the asymptotic MSE, a large sample approximation consisting of the leading term in the expanded MSE. By integrating the AMSE over the densities domain – here we are assuming the integrability assumptions in conditions 1. and 3. – we get a global measure, the asymptotic integrated mean squared error of the densities difference h i −1/2 −1/2 AMISEb g (·) = R (K) n−1 + n−1 + 1 |H1 | −1 |H−1 | Z © £ ¤ª2 1 2 + µ2 (K) tr [H1 Hf1 (x)] − tr H−1 Hf−1 (x) dx. 4 The strategy should be to minimize an estimate of AMSE or AMISE over H−1 and H1 . Clearly, the straightforward use of AMSE or AMISE equations to evaluate the optimal h is prevented by the presence of terms that are functions of the unknown true densities. Moreover, the solutions to minimize joint AMISE will, in general, differ from those to minimize error rate (though simulations indicate some equivalence in the joint dependence). Anyway we notice the obvious difference between global (integrated) and pointwise solutions in our discrimination setting, the latter being more appropriate near the frontier. 5.2 Bandwidth selection and regularization in boosting In kernel density methods a proper choice of the smoothing level is crucial. Unfortunately, for a given sample, boosting forever is not optimal, i.e. if in a given practical case L∗ is the minimal loss obtainable by the boosting strong hypothesis HM in predicting f , we typically have lim L (f, HM ) > L∗ .

M →∞

Actually, the more M increases, the more HM becomes complex and tends to closely reproduce the sample (overfitting). Therefore, a stopping rule indicating when L∗ is achieved, i.e. the optimal number

On boosting kernel density methods for multivariate data: density estimation and classification

7

of iterations M ∗ , is needed. However, not a great deal of theory exists on this matter, and the usual advice concerns cross-validation based on subsamples. As intuition confirms, if the δ is too complex (not sufficiently ’weak’, exactly), overfitting can appear after a few steps, or even at the first iteration, yielding poor inferences. In statistical terms a complex learner is featured by low bias and big variance. Therefore, a natural and direct way for reducing the complexity of whatever kernel method is oversmoothing because large bandwidths increase bias and reduce the variance. Our choice of using in the algorithms a fixed value of h seems appropriate as well. In fact, if we optimally select h for every estimation task, we encourage the overfitting tendency, because the ’learning rate’ of every single step is then maximized. The subject will be taken up again in Section 6. In summary, we should have Hm = H∗ ∀m ∈ [1, M ] , where H∗ minimizes the loss after the M -th step. In statistical terms we could identify the following strategy: ’use very biased and low variance estimates by adopting large bandwidths, then reduce the bias component using several boosting iterations’. In machine learning terms we could consider the smoothing parameter as a potential regularization factor, because oversmoothing weakens the learner reducing its learning rate so preventing overfitting. Finally, note that regularizing through oversmoothing allied with intensively iterating increases the settings of (H, m) for which boosting works. In particular, many big values of H and M yield good results, so reducing the need of both an accurate selection of the smoothing level and an optimal stopping rule. 6 Empirical results In all our examples we use a Kernel based on a product of independent standard Gaussians, which is equivalent to set the bandwidth matrix and the population covariance matrix respectively to: Hy = diag(1 hy , ..., d hy ) and Σy = diag(1 σy2 , ...,d σy2 ), i.e. X fby,m (x) = wm (j) KHy (x − xj ) , y ∈ {−1, 1} j∈{i:yi =y}

with KHy (x) =

³√

( µ ¶2 ) d ´−d Y 1 1 k x −k x j 2π exp − 2 k hy k hy

(3)

k=1

where k x is the k th entry of the vector x = (1x, ...,d x). In nonparametric density estimation, a simple intuitive plug-in rule for Hy = diag(1 hy , ...,d hy ) (which is based on an assumption of normality) is µ b k hy =

4 n(d + 2)

¶1/(d+4) by kσ

' n−1/(d+4) k σ by

(see Scott (1992) p. 152) where k σy is the standard deviation of the k th component k xj for j ∈ {i : yi = y} with y = −1, 1. However, very little is known about an appropriate plug-in rule in the case of kernel classification for multivariate data, let alone plug-in rules under boosting. 6.1 Boosting kernel density estimates in 2-D To investigate the effectiveness of boosting kernel density estimates, we simulated 50 observations from the bivariate normal distribution with mean µ = (0, 0)0 . For each of 100 samples we compute the boosted kernel density estimate for M = 1, . . . , 17 boosting values over a range of smoothing parameters h using Algorithm 1, with the kernel function given by (3). For each estimate we store the integrated squared error which compares fˆ(x) with the true density. For comparative purposes, we also store the integrated squared error of the parametric estimate in which fˆ is estimated by a multivariate normal density using the sample mean and covariance of the data. Figure 1 shows the average (over 100 samples) integrated squared error for various values of M , plotted as a function of the smoothing parameter. For the uncorrelated data (left panel), it can be seen

8

Marco Di Marzio and Charles C. Taylor

that boosting improves the estimate for up to about 11 (M = 12 ) boosting iterations, with most of the gains being made in the first two iterations. Note that BoostKDE outperforms the parametric estimate if M and h are correctly chosen. For the correlated data, boosting improves the estimate for only one iteration, and it does not do better than the parametric version. However, in this case the product kernel is not very efficient. In both cases when there are more boosting iterations a larger smoothing parameter is required.

bivariate correlated data 0.05

0.05

boosting kernel density estimates in 2−D

M=1

M=12

M=17

1

2

3

0.04 0.03 0.02

average integrated squared error

0.01

M=8

M=2

M=3 M=4

0.01

0.02

M=5

0.00

0.04 0.03

M=2

0.00

average integrated squared error

M=1

0.5

4

1.0

1.5

smoothing parameter

smoothing parameter

Fig. 1 For 100 samples of size n = 50 the average integrated squared error (ISE) for BoostKDE is shown as a function of the smoothing parameter h for various values of the boosting iteration M . The dashed line joins the points corresponding to the optimal smoothing parameters for each boosting iteration. The horizontal line shows the average ISE for the parametric estimate in which the mean and covariance are estimated from the data. The underlying distribution is a bivariate normal with mean µ = (0, 0)0 and covariance matrix, Σ , given by 

Left: Σ =

10 01





Right: Σ =



1 0.7 . 0.7 1

6.2 Simulated Concentric Shells We simulated two groups of data which are the unions of concentric shells in 3 dimensions, with an additional pure noise variable. Specifically, each group has an equal number of observations in which (x1 , x2 , x3 ) are generated from polar co-ordinates (r, θ, φ) with θ ∼ U [0, 2π) and φ ∼ U [0, π), and x4 a group independent variable with X4 ∼ U (−13, 13). For each group the radii, R = r , are simulated from a mixture of (truncated) Normal distributions, with the mixtures chosen so that the combined density of points in R4 was approximately P uniform. The distributions of the group dependent radii, Rj , j = −1, 1, are given by Rj ∼ i=1,3 pij fij in which the mixture proportions are T T p·−1 = (0.0014, 0.1003, 0.3983), p·1 = (0.0248, 0.1994, 0.2758), and the densities f11 , f2−1 , f21 , f3−1 are Normal densities with standard deviations equal to 1 and means equal to 3, 6, 9, 12 respectively. f1−1 is the density of the absolute value of a standard Normal, and f31 is the density of the absolute value of a standard Normal subtracted from 15. We simulated 10 datasets each of size 1200 (600 in each group) as the training data, and we evaluated each classifier on an independent test sets of size 6000 (3000 in each group). For each training dataset we tried a range of smoothing parameters h1 = h1 in Equation (3) and investigate the performance (as measured by the misclassification rate) for up to 6 boosting iterations.

On boosting kernel density methods for multivariate data: density estimation and classification

9

The results are displayed in Figure 2. It can be seen that boosting reduces the error rate for two boosting iterations ( M = 3), and that, when boosting is used, a larger smoothing parameter is optimal. This is in agreement with the theory that boosting is more effective for weaker (oversmoothed) learners. 6.3 Digit recognition This dataset was originally collected in order to design classifiers suited to the automated recognition of hand-written postcodes. The postcodes were segmented from envelopes and then a grey-level 16 × 16 image was created from each digit ( 0 − 9 ). The full dataset is described in more detail in (Michie et al., 1994, ch. 9). Here we deal with a simplified problem in which we first average each image over nonoverlapping 4 × 4 blocks to produce a crude 4 × 4 image (see Figure 3 for some examples). Note that the block averaging makes the images hard to distinguish by eye. Further, we consider only the digits “1” and “7” in order to perform boosting in the simpler two-class problem. We have 1800 examples in the training set (900 from each group), and an independent 1800 examples in the test set. For these data, since the k σ ˆ , k = 1, . . . , 16 are quite unequal, we first standardize all the data according to the overall sample means and standard deviations of each variable in the test set. In this example, the transformation was not dependent on the group membership, although this could also be explored. The results are shown in Figure 4. In this example, boosting reduces the error rate only for one iteration ( M = 2 ), and only by a very small amount (one observation out of 1800). Again we see that boosting has an improving effect for larger smoothing parameters. In addition, we have found that the choice of smoothing parameter is usually less critical when boosting is applied. For example, if we consider the average error rate over a range of choices of h, then boosting is often better than no boosting (M = 1) for h randomly chosen in various ranges. 6.4 Tsetse Data We have applied BoostKDC to the StatLog tsetse dataset (two groups , d = 14, 3500 training data and 1499 test data); see (Michie et al., 1994, ch. 9) for further details and results from the competition. For this dataset we explored two alternative smoothing strategies: b y = W (4/n (d + 2))1/(d+4) diag (1 σ – a separate bandwidth selector for each class, i.e. H by , ..., d σ by ) – where k σ by is the s.d. calculated along the k th dimension of {xi : yi = y} , y = −1, 1 for various choices of multiplier W ; b = (H−1 H1 )1/2 . – the same smoothing parameter in both KDEs, i.e. H In Figure 5 we observe test and training error rate (err) patterns as functions of the boosting iterations PN for various choices of the multiplier W ; remember that err = N1 i=1 I(yi 6= Yb (xi )) . For W = 1 the best test err is reached after a few tens of iterations, respectively: .0505 for m = 23 and .0376 for m = 34 ; later for W = 1.5: .0496 for 23 < m < 26 and .0295 for 92 < m < 103. Interestingly, in the StatLog project the best results on this dataset were obtained for tree algorithms with about 3.6% of test err. Note that the first step (i.e., standard KDC) accuracy is everywhere greatly improved without the ‘weak learner’ being particularly rough (first steps exhibit a training err not bigger than 8%). Moreover, the common selector choice gets superior accuracy and overfitting resistance; this is consistent with our theoretical results for bias reduction in 1-dimension. Finally, we note that oversmoothing (i.e., reducing the flexibility of the ‘weak learner’ by setting W > 1) improves the performances, such a characteristic was also observed for BoostKDE . 6.5 Some comparisons with trees and AdaBoost.M1 For the two real datasets we have compared the results of our classifier with those of other methods, including boosting. Comparative results for the tsetse fly data can be found in Michie et al. (1994, p. 169) in which the best test error was 3.6% for the decision tree algorithm CN2, whereas a standard linear discriminant analysis gave an error rate of 12.2%.

10

Marco Di Marzio and Charles C. Taylor

0.390

error: boosting and h

5

error

0.380

1 5 4

0.370

3 2 1

4

3 1 2

2

5 1 4 5 1 1 3

2 4

0.360 0.8

1.0

1.2

4 5

5 5 4 2 3

5 3 4

3 2

4 2 3

1.4

1.6

1.8

2.0

2

3

2.2

h Fig. 2 Average out of sample error rate for the simulated concentric sphere data as a function of smoothing parameter. Each curve corresponds to the number of boosting iterations, with the minimum attained value also shown with the symbol ◦ . Ten simulations were used with 1200 observations in each training dataset, and 6000 observations in the test datasets.

Fig. 3 Top two rows: four examples of the observations corresponding to the digits “1” and “7”. Bottom two rows: 4 × 4 average blocks corresponding to first two rows. These 16 grey levels (one for each pixel) represent the data used in the boosting experiment.

On boosting kernel density methods for multivariate data: density estimation and classification

11

5 4

5 4

4 5

3

3

3

15

2

10

error

20

25

error: boosting and h

1

5 4 3

3

1

4

1 5

3

2

5

4 3

2

2

1

5

2 1

1

4

5 4

3 1

3 1

3 2

1

1

5 4 1

2

2

1 5

1

4

5

3 2

3 4 2

5 2 3 4

4 3 2 5

2

0

5

1

2

5 4

0.4

0.6

0.8

1.0

1.2

1.4

1.6

h Fig. 4 Error (counts out of 1800 observations) for standardized digits data as a function of smoothing parameter. Each curve corresponds to the number of boosting iterations, with the minimum attained value also shown with the symbol ◦ . 0.07

0.06

training and test errors

0.05

test

0.04

0.03

0.02

0.01

train 0 0

5

10

15

20

25

30

boosting iterations

35

40

45

50

W 0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0 2.2 2.6 3.0 3.4 3.8 4.2

err 4.44 3.89 3.76 3.42 3.28 2.87 2.80 2.66 2.80 3.01 3.07 3.48 3.89 4.30

m 6 14 34 82-83 49-75 87-103 160-166 238-252 400-440∗ 503-520∗ 588-600∗ 658-680∗ 709-760∗ 803-830

Fig. 5 Tsetse fly data. Left: training and test err s of BoostKDC as a function of the iterations for W = 1 ; solid line: one common bandwidth used; dashed line: two different bandwidths used. Right: minima of test err s for several values of W in the case of the common bandwidth strategy; ∗ = last iteration performed for a given value of W .

We implemented Freund & Schapire’s (1996) boosting algorithm AdaBoost.M1 using a tree stump (i.e. a classification tree with only one split) as the weak learner. This algorithm can be run for m = 1, . . . , M iterations and the error rate for the training and test set can be monitored over m . For the tsetse fly data, we used AdaBoost.M1 for M = 10 000 iterations. The optimal (over all boosting iterations) test error rate is just over 5%. We note that this is more than the error rate obtained by CN2, and that our version of boosting a kernel classifier (BoostKDC) outperforms other methods if the smoothing parameters are correctly chosen. For the digits data, we first note that a standard classification tree (using an information splitting criterion) gives 16 errors on the training data, and 19 errors on the test data; this tree has only five splits,

12

Marco Di Marzio and Charles C. Taylor

Digits data Tsetse data

Trees 19 3.6%

Adaboost Trees 10 5%

Adaboost KDC 8 2.66%

Table 1 Test errors for Digits data (number misclassified) and Tsetse data (percentage misclassified).

and uses four of the variables. Adaboost.M1, with a stump as the weak learner gave perfect classification on the training set after only 51 boosting iterations. At this point there were 12 misclassifications in the test set. The minimum test error (over the first 1000 boosting iterations) was 10, and this was obtained after 142 boosting iterations. It is again clear that the standard kernel density classifier and its boosted version BoostKDC compares favorably for these data when the smoothing parameters are chosen optimally. The number of boosting iterations required in these two examples is very different — both for tree stumps and for kernel classification. But it is clear that kernel classification – even with large smoothing parameters – is not as “weak” as tree stumps. At each step all the variables are used, and curved disconnected boundaries between the classes are possible, so it is not surprising that fewer iterations are required for this method. For clarity in Table 1 the comparison results just discussed are summarized.

7 Conclusions We have discovered an interesting link between boosting and a method of bias reduction used in kernel density estimation. Initial results suggest that this bias-reduction characteristic also holds when boosting is applied to classification. Intuition suggests that the more boosting is to be applied, the greater the smoothing parameter will be optimal, and all of our simulations and experiments confirm this. Although it is clear that boosting can lead to reduction in error rates, there are still open questions. From a practical perspective the most important issues is the desire to find the optimal combination of smoothing parameter(s) and number of boosting iterations. This would need to be addressed in some sort of cross-validation approach, and the success of any strategy would require more investigation.

References Breiman, L. (1999). Prediction games and arching algorithms. Neural computation, 11:1493–1517. B¨ uhlmann, P. and Yu, B. (2003). Boosting with the l2 loss: regression and classification. Journal of the American Statistical Association, 98:324–339. Di Marzio, M. and Taylor, C. C. (2004). Boosting kernel density estimates: a bias reduction tecnique? Biometrika, 91:226–233. Di Marzio, M. and Taylor, C. C. (2005). Kernel density classification and boosting. submitted for publication. Freund, Y. (1995). Boosting a weak learning algorithm by majority. information and computation. Information and computation, 121:256–285. Freund, Y. and Shapire, R. (1996). Experiments with a new boosting algorithm. Machine Learning: Proceedings of the Thirteenth International Conference, pages 148–156. Morgan Kauffman, San Francisco. Friedman, J. (2001). Greedy function approximation: a gradient boosting machine. The Annals of Statistics, 29:1189–1232. Friedman, J., Hastie, T., and Tibshirani, R. (2000). Additive logistic regression: A statistical view of boosting. The Annals of Statistics, 28:337–407. Habbema, J., Hermans, J., and van der Burgt, A. (1974). Cases of doubt in allocation problems. Biometrika, 61:313–324. Hall, P. and Wand, M. (1988). On nonparametric discrimination using density differencies. Biometrika, 75:541–547. Jones, M., Linton, O., and Nielsen, J. (1995). A simple bias reduction method for density estimation. Biometrika, 82:327–38.

On boosting kernel density methods for multivariate data: density estimation and classification

13

Michie, D., Spiegelhalter, D., and Taylor, C. C. (1994). Machine Learning, Neural and Statistical Classification. Ellis Horwood, Chichester. Ridgeway, G. (2002). Looking for lumps: Boosting and bagging for density estimation. Computational Statistics and Data Analysis, 38:379–392. Rosset, S. and Segal, E. (2002). Boosting density estimation. NIPS-2002. Scott, D. (1992). Multivariate Density Estimation: Theory, Practice, and Visualization. Wiley, New York. Shapire, R. (1990). The strength of weak learnability. Machine Learning, 5:197–227. Shapire, R. and Singer, Y. (1999). Improved boosting algorithms using confidence–rated prediction. Machine Learning, 37:297–336. Wand, M. P. and Jones, M. C. (1995). Kernel Smoothing. Chapman & Hall, London. Wright, D., Stander, J., and Nicolaides, K. (1997). Nonparametric density estimation and discrimination from images of shapes. Journal of the Royal Statistical Society, C46:365–380.

Suggest Documents