Speeding-up Margin Based Learning via Stochastic Curtailment

0 downloads 0 Views 2MB Size Report
Hoeffding's inequality upper bounds the probability of a sum Sn of independent random variables {Xi}n i=1 deviating from its expectation by more than t where.
Speeding-up Margin Based Learning via Stochastic Curtailment

Raphael Pelossof [email protected] Center for Computational Learning Systems, Columbia University, 1214 Amsterdam Ave New York, NY 10027 Michael Jones Mitsubishi Electric Research Labs, 201 Broadway, Cambridge, MA 02139

[email protected]

Zhiliyang Ying [email protected] Statistics Department, Columbia University, 1255 Amsterdam Ave New York, NY 10027

Abstract The purpose of this work is to speed-up margin based learning algorithms by deriving a stopping rule which stops the partial computation of the margin when the full computation is likely to have the same outcome. This is achieved by a novel merger of sequential analysis and margin based learning. Early stopping may introduce decision errors, which occur when the algorithm stops the computation when in fact it should not have. Sequential analysis allows to trade off decision error rates and algorithmic speed-ups through stopping rules. Our algorithm, Curtailed Online Boosting, is derived by adding stopping rules to Online Boosting. This results in large computational speed-ups by stopping computation early on uninformative examples. Our experiments show the successful speed-up of Online Boosting, while maintaining similar accuracy on synthetic and the MNIST data sets.

1. Introduction In this work we propose a simple novel method based on sequential analysis (Wald, 1945; Lan et al., 1982) to drastically improve the computational efficiency of margin based learning algorithms. Our method allows the algorithm to determine how much computation is needed per example. This is done by quickly determining for each training example whether it is unimportant. An example is unimportant if its margin is Appearing in Proceedings of the 26 th International Conference on Machine Learning, Haifa, Israel, 2010. Copyright 2010 by the author(s)/owner(s).

large and positive. Unimportant examples are then quickly discarded. In AdaBoost and SVM, the margin is computed as a function of the weighted sum of weak classifiers (which we will also call features). By finding a quick way to determine if the example is unimportant and discard it we greatly speed-up margin based learning algorithms without suffering much in terms of generalization. We use the terms margin and full margin to describe the summation of all the feature evaluations, and partial margin as the summation of a part of the feature evaluations. The calculation of the margin is broken up for each example in the stream of examples. The algorithm makes a decision after the evaluation of each feature about whether to evaluate the next feature or stop and reject the example due to its lack of importance for training. By making a decision after each feature evaluation we are able to prematurely stop the evaluation of features on examples with a large partial margin after having evaluated only a few features. Examples with a large partial margin are unlikely to have a full margin fall below the required threshold. Therefore, by rejecting these examples early, large savings in computation are achieved. To find the early stopping thresholds we use ideas from the field of sequential analysis which has been an active research field for over 60 years. It is mostly used and developed by the clinical testing and economics communities. In clinical testing the researcher would like to design a test which requires the smallest number of patients to prove the efficacy of a drug. The cost of many tests is monetarily high and patients may die throughout the test, therefore the tests are designed sequentially, so that the minimum number of patients will be tested. The connection between the clinical tests and margin evaluation is done by looking at each feature as a patient, and the margin evaluation as the

Curtailed Discriminative Sequential Tests Decision boundary of Curtailed Online Boosting

Positive Example Negative Example Decision Boundary

Most Computation Least Computation Decision Boundary

(a) The synthetic two sin waves learning problem. Curtailed Online

(b) Computational Efficiency Plot. The size and color of each point

boundary in black.

effort when processing examples by the decision boundary. A subset of 5000 random training points are plotted. The jumps in the computation allocation are due to the structure of the decision surface.

Figure Efficiency. Curtailed Boosting places most computational effortOnline into processing Boosting1.is Computational trained with 100 boosting stumps and run over 100,Online 000 shows the amountthe of features evaluated by Curtailed Boostexamples near the decision boundary.linear Theseparators trainingwhich set is pictured in the left In the the size and examples. The stumps used are thresholded ing throughout the subfigure. learning process. Theright largersubfigure, and more red a point were selected AdaBoost. Plotted is part the number test set, inof blue the more features were evaluated and updated processing color of eachusing point are proportional toofthe featuresis,evaluated throughout the curtailed onlinewhile learning process. the positive class, inwith red the negative class, and learned decision shows that algorithm puts the most computational Most examples a large margin arethe rejected early withoutit. aThe fullfigure evaluation of the their margins.

sequential evaluation of patients’ responses to a drug. Sequential Thresholded Sum Test which relates deciFigure 2. Synthetic experiment. sion error rates with early stopping thresholds. In secTherefore, a sequential test can be designed to stop the tions 4 and 5 we apply our sequential test to Online margin evaluation when the partial evaluation of the Boosting, and validate its strength by experimentamargin has high probability that full evaluation will tion. in sectionround. 6 we provide a brief discussion have the same outcome. If at any interim point we ing set Finally, at that boosting We partitioned the multishows how most of the computation is allocated to examand ideas for future work. have enough confidence that the full evaluation will class problem into 10 one-versus-all problems, and defined ples that are on the decision boundary, and substantially less determine that an example is uninformative, we will a meta-rule for deciding the digit number as the index of computation is allocated to examples that are easily classistop the test early, reject the example, and save on 2.classifier Related the thatWork produced the highest vote. Each of the fiable. Figure 3(a) compares the test error of AdaBoost, computational costs while maintaining accuracy (see digit classifiers was trained with 1, 000 features. We set Online Boosting, and Curtailed Online Boosting as training figure 1.) Margin-based learning has spurred countless algoprogresses. All three algorithms had the weak hypotheses α = 0.95, β = 0.8, and θ was set individually for each rithms in many different disciplines and domains. We tradeoff between and speed is established. classifier in the range θ ∈ [0, −0.5]. The generalization erfixedAthroughout training accuracy for comparison. summarize the impact in a few domains: online learnInstead of looking at the traditional classification er- ror rate of the combination rule using each of the methods ing, batch learning and real-time detection, and finally we look at decision error. Decision errors are errors can be seen in figure 3(b). The generalization error rates for 5.2. ror MNIST active learning. Many margin-based online algorithms that occur when the algorithm rejects examples that each classifier can be seen in table 1. At the beginning of the base their model update on the margin of each example should havedataset been accepted Given a de- training process the votes αj are a very bad estimate of the The MNIST consists offor 28training. × 28 images of the in the stream. Online algorithms such as Kivinen and decision error is rate weinto would like thesettest to decide digitssired [0, 9]. The dataset split a training which inend votes. This causes the random walk(Kivinen to be highly unreWarmuth’s Exponentiated Gradient & Warwhen to stop the and computation, suchincludes that the10, decision cludes 60, 000 images, a test set which 000 liable, and the curtailment process inefficient. We therefore muth, 1997) and Oza and Russell’s Online Boosting error All ratetheis digits maintained. The testapproximately is adaptive, in and initialized our model with a small batch of the first 5, 000 images. are represented (Oza & Russell, 2001) update their respective modthe partial computation of each examples to avoid these estimation errors. The efficiency equalchanges amountaccording in each set.toSimilarly to the synthetic experiels by using a margin based potential function. PasA adecision mademanner when an ment,margin. we trained classifiererror in anisoffline withinformasamresults in tablealgorithms, 1 exclude thesuch full margin computation of the sive online as Rosenblatt’s percepexample rejected fromtraining the learning plingtive to find a set is of erroneously weak hypotheses. When we initial 000 out of 60, 000and examples. If we were online to add tron 5, (Rosenblatt, 1958) Crammer et al.’s process.theSince wetobound the mean rate ofand these normalized images have zero unitdecision variance.er- these examples then algorithms the average(Crammer number of et features evalpassive-aggressive al., 2006), rors, hthe number of examples for which the algorithm We used (x) = sign("x − x" − τ ) as our weak hyuated increase by about 66 features. The update, filtering j j 2 definewould a margin based filtering criterion for will erroneously not update itsevery model is at most pothesis. The weak learner found for boosting roundthe error rateonly is not affected since we fully compute which updates the algorithm’s model if the themargin value decision error rate times the number of examples pro- of these examples. The results show that an order of magthe vector xj and threshold τ that create a weak hypothesis of the margin falls below a defined threshold. All these cessed. By adding this same of examples which minimizes the training error. proportion As candidates for xj nitude speedup is possible maintaining generalization algorithms fully evaluatewhile the margin for each example, to the training set, the overall error of the should we used all the examples that were sampled from the train-be which means that they evaluate all their features for equal to that of the original margin- based classifier. every example regardless of its importance. We demonstrate that our simple test can speed-up Online Boosting by almost an order of magnitude while The running time of these algorithms is a function of maintaining generalization accuracy. the number of features, or the dimensionality of the input space. Since models today may have thousands of This paper is structured as follows. In section 2, we features, running time seems daunting, and depending present related work. In section 3 we develop the novel

Curtailed Discriminative Sequential Tests

on the task, one may wish to speed-up these online algorithms by pruning uninformative examples. We propose to stop the computation of feature evaluations early for uninformative examples by connecting sequential analysis (Wald, 1945; Lan et al., 1982) to margin based algorithms.

novel decision error inequalities required for the new sequential test.

There has been extensive research in the computer vision community on prematurely stopping the computation of the margin or the score of examples. One example of this research is the well known Viola-Jones face detector (Viola & Jones, 2001; Jones & Viola, 2003) which uses a cascade of classifiers to quickly reject non-face image patches. The Viola-Jones cascaded face detector is a discriminative classifier that is trained using batch AdaBoost (Freund & Schapire, 1997). Following this work, Bourdev and Brandt (Bourdev & Brandt, 2005) replaced the cascaded classifier with a soft cascade that uses a set of rejection thresholds, one after each weak classifier, to determine whether to continue or to reject the example as a non-face. Attempting to make the tradeoff between early stopping and accuracy mathematically optimal, Matas and Sochman (Matas & Sochman, 2007) propose WaldBoost, which puts Wald’s sequential test into a boosting framework. Our work differs by means of development of new theory which does not require a likelihood function over the votes of weak hypotheses, and is therefore purely discriminative.

Our task is to find a filtering framework that would speed-up margin based learning algorithms by quickly rejecting examples of little importance. Quick rejection is done by creating a test which stops the margin evaluation process given the partial computation of the margin. We measure the importance of an example by the size of its margin, the distance from the decision boundary geometrically. We define θ as the importance threshold, where examples that are important to us have a margin smaller than θ. Statistically this problem can be generalized to finding a test for prematurely stopping the computation of a partial sum of independent random variables when the result of the full summation is guaranteed with a high probability.

Another closely related learning paradigm is active learning. An active learning algorithm is presented with a set of unlabeled examples and it decides which examples’ labels to query at a cost. The algorithm’s task is to pay as little as possible for labels while achieving specified accuracy and reliability rates (Dasgupta et al., 2005; Cesa-Bianchi et al., 2006; Settles, 2009). Typically, for selective sampling active learning algorithms the algorithm would ignore examples that are easy to classify, and pay for labels for harder to classify examples that are close to the decision boundary (Dasgupta et al., 2005).

3.2. Probability inequalities for thresholded sums

In most of the above described algorithms, the learning algorithm compares the margin (or the score) of each example to a threshold to form a decision. If we generalize this process, most of the algorithms compare a weighted sum of random variables to a threshold to make a certain decision. Therefore, to design a test to stop a margin evaluation early we extend the theory used by sequential analysis to deal with thresholded sums of independent random variables. By looking at the evaluation of the margin as a random walk, an observation that was made previously by Freund (2001); Long & Servedio (2005; 2008), we are able to compute

3. Sequential Thresholded Sum Test 3.1. Mathematical roadmap

To design such a sequential test, first we will upper bound the probability of a sum of weighted independent random variables ending above a threshold given that a partial computation was stopped - a decision error. Then given a required decision error rate we will derive the Sequential Thresholded Sum Test (STST) which will provide adaptive early stopping thresholds.

We are interested in bounding the probability of making a decision error. We define a decision error as the event when the full sum of weighted independent random variables is smaller than a given threshold but the partial sum passed above an early stopping threshold. Let the sum of weighted independent random Pnvariables (wi , Xi ), i = 1, ..., n be defined by Sn = i=1 wi Xi , where wi is the weight assigned to the random variable Xi . We require that wi ∈ R, Xi ∈ [−1, 1]. We define Sn as the full Pn sum, Si as the partial sum, and Sin = Sn − Si = j=i+1 wi Xi as the remaining sum. Once we computed the partial sum up to the ith random variable we know its value Si . Let the stopping threshold at coordinate i be defined by τi . We use the notation ESin to denote the expected value the remaining sum. Theorem 1. The probability of making a decision error for thresholded sums of weighted independent random variables is bounded by ! (θ − τi − ESin )2 Pn . P (decision error) ≤ exp − 2 j=i+1 wj2

Curtailed Discriminative Sequential Tests

Proof. A decision error is an error that occurs when the computation of a sum is stopped early, when in fact it should have continued. This means that the partial sum passed above a given early stopping threshold τi , when in fact it should not have since the full sum satisfies the importance requirement Sn ≤ θ. Our task is to bound the following probability (see Figure 2): P (decision error) = P (Sn ≤ θ|Si > τi )

(1)

/

P (Sn ≤ θ|Si = τi )

(2)

=

P (Sn − Si ≤ θ − τi |Si = τi )

(3)

=

P (Sin ≤ θ − τi ).

(4)

Partial Sum

Decide Sn > θ

+wi

Si

−wi

θ

The inequality in equation 2 is valid since τi is closer than Si to θ by construction, and tight since once the random walk passes the stopping threshold it is stopped, and its final value is close to the value of the corresponding stopping threshold (see figure 3.) The condition in equation 4 is dropped since Sin does not depend on Si . Hoeffding’s inequality upper bounds the probability of a sum Sn of independent random variables {Xi }ni=1 deviating from its expectation by more than t where the summands are bounded Xi ∈ [ai , bi ]   2t2 P . (5) P (Sn − ESn ≥ t) ≤ exp − n 2 i=1 (bi − ai ) To apply this bound to thresholded sums we need to convert equation 4 to the form used by the Hoeffding bound: P (Sin ≤ θ − τi ) = P (Sin − ESin ≤ θ − τi − ESin ) = P ((−Sin ) − (−ESin ) ≥ −(θ − τi − ESin )). (6) Equation 6 gives an adaptive threshold which changes each time a random variable is observed and added to the sum. Let ti = θ − τi − ESin , ai = −wi , bi = wi . Combining equations 4 - 6 we get the following inequality ! (θ − τi − ESin )2 Pn . (7) P (Sn ≤ θ|Si > τi ) ≤ exp − 2 j=i+1 wj2 Where the threshold θ, the partial sum Si , and the weights wj are known (see figure 2.) What is left to calculate the upper bound is to compute the expectation of the remaining sum ESin .

n

i

Decide Sn < θ

STST Upper Thresholds τiu STST Lower Thresholds τil Random Walk Si

Figure 2. Margin evaluation as a random walk. The margin evaluation is a summation of n summands. As the partial sum Si is computed feature weights +wi or −wi are added to it. We are interested in finding the stopping boundary τiu at which the probability of Si to pass it and while Sn ≤ θ is bounded by δ. The stopping boundary τil of the opposite decision error is plotted in green.

δ if the stopping thresholds are set as r τi = θ − ESin + ||win ||

Corollary 1. The Sequential Thresholded Sum Test. The probability of making a decision error is at most

1 , δ2

qP n 2 where ||win || = j=i+1 wj is the norm of the remaining weights. Proof. Theorem 1 gives us an upper bound on the decision error rate, as more information is gathered sequentially. Let us explicitly calculate the early stopping thresholds for a required error rate δ ! (θ − τi − ESin )2 Pn = δ. exp − 2 j=i+1 wj2 Solving for τi r

3.3. Sequential Thresholded Sum Test

ln

τi = θ − ESin + ||win ||

ln

1 . δ2

(8)

As long as Si satisfies equation 8, ie. Si ≤ τi we continue the summation by adding the next summand.

Curtailed Discriminative Sequential Tests

Equivalently, the STST will stop the summation if q Stop if Si > θ − ESin + ||win || ln δ12 Decide

Simulation Result x∼ N(0,1) 1

delta (wanted error rate) Error rate Hoeffding constants Error rate CLT constants Savings Hoeffding constants Savings CLT constants

0.9

Sn > θ

0.8 0.7

thresholds τi = θ + ESin − ||win || ln δ12 . which are shown in figure 2 as the bottom green line τil .

0.6

Rate

We see from the structure of the STST that as the test progresses the stopping thresholds converge to the importance threshold as more summands are observed and added to the sum. The general shape of the stopping boundary is square root, as shown in figure 2. Similarly, we can find the thresholds of the opposite decision error event P (Sn ≥ θ|Si < qτi ), and get the

0.5 0.4 0.3 0.2 0.1 0

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

delta

3.4. Extensions to the STST The STST decision thresholds take the form τi = θ − ESin + ||win ||f (δ). It is possible to obtain better (smaller) thresholds by using the Central Limit Theorem. This type of improvement to the constants in the bounds has been used in the sequential estimation and computer science literature. As an example see Watanabe (2000); Mnih et al. (2008). An interesting remark is that since our algorithm is derived from the decision error stand point, the statistics that our method is based on are remainder quantities. This is in contrast to the statistics gathered by the algorithms proposed By Watanabe and by Mnih et al, nevertheless, the same improvement applies. Therefore, by making a weak independence assumption on the summands, we can apply the Central Limit Theorem to find sharper stopping boundaries. We reformulate the probability of making an error as P (Sn ≤ θ|Si > τi ) / P (Sn ≤ θ|Si = τi ) Sin − ESin θ − τi − ESin p ≤ p var(Sin ) var(Sin )

Since Φ−1 (1 − δ) < ln δ12 and the variance of the random process Sin is positive, the thresholds which are produced by the Central Limit Theorem are superior to those produced by Hoeffding’s inequality. A comparison can be seen in figure 3.

4. Curtailed Online Boosting

= P (Sin ≤ θ − τi ) = P

Figure 3. An experimental comparison of the error rates and speedups gained by using Hoeffding’s constants vs. the constants gained by the Central Limit Theorem. We simulated 50, 000 random walks, each comprised of 1, 000 steps (Sin ∼ N (0, n − i)). The stopping thresholds obtained by the CLT constants produce decision error rates that are very tight to the wanted error rates δ. CLT gives preferable error rates and speedups compared to the Hoeffding thresholds. At a 20% error rate CLT only requires half the features to be evaluated.

! .

By the central limit theorem, this is a standardized normal random variable, with zero expectation, and unit variance. We solve this equation by using the normal CDF function Φ ! θ − τi − ESin p Φ = 1 − δ. (9) var(Sin ) This gives us the following stopping thresholds p τi = θ − ESin + var(Sin )Φ−1 (1 − δ).

(10)

We apply the STST filtering framework to speed-up Online Boosting by quickly rejecting examples of little importance. We measure the importance of an example by the size of its margin. Let the margin of an example (x, y) be defined by Sn = y

n X i=1

αi hi (x) =

n X

αi ui ,

i=1

where αi is the weight assigned to the ith weak hypothesis hi , and ui = yhi (x). Let the label be defined by y ∈ {−1, 1}, the classification of the ith confidencerated weak hypothesis be defined by hi ∈ [−1, +1], and the weight assigned to the ith weak hypothesis αi ∈ R. We define Sn as the full margin, Si as the ith

Curtailed Discriminative Sequential Tests

partial margin, and Sin = Sn − Si as the remaining margin. Once we have computed the ith partial margin we know its value Si . Directly applying STST we q get the test: Stop if Si > θ − ESin + ||αin ||

ln δ12 .

This test adapts to the weights learned by Online Boosting. We compute the norms of the remaining weights ||αin ||, i = 1..n − 1 every time a feature is updated.To efficiently compute the stopping thresholds we use the following sequential update for the norm of 2 the remaining weights ||αi+1,n ||2 = ||αin ||2 −αi+1 , with the initial condition ||α1,n ||2 = ||α||2 − α12 . The expectations are calculated by summing the partial sums until they are stopped. These norms and expectations are the only varying quantities in the STST. They are updated by cumulatively summing the changes from the last hypothesis that was updated to the first. Curtailed Online Boosting is detailed in algorithm 1. It is based on a modification of Online Boosting to the more prevalent AdaBoost exponential weight update rule (Schapire & Singer, 1999; Pelossof et al., 2009). The algorithm does not detail the bayesian feature selection for clarity. However, it can be incorporated without loss of generality. In the case where we trained a classifier and we are interested in speeding up the evaluation of the score of new examples (testing/ deployment phase), the STST thresholds can be pre-computed and stored by using the variances and expectations from training. Additionally, by sorting the coordinates by weight the curtailed algorithm can become extremely fast while maintaining accuracy.

5. Experiments We conduct three experiments to test the speed and accuracy of Curtailed Online Boosting. The first is a synthetic experiment which enables us to visualize how the algorithm spends it computational power. The second is a synthetic experiment to compare constants derived from Hoeffding’s inequality versus the constants derived from the Central Limit Theorem. Finally, we conducted a real world experiment on the MNIST dataset, which shows the speed advantage of Curtailed Online Boosting. 5.1. Synthetic experiments The first synthetic experiment was set up to test the computational efficiency of Curtailed Online Boosting. Figure 1 visualizes a simple 2D synthetic set which was designed to visualize how the algorithm spends its computation. The data was created by adding uniform

Algorithm 1 Curtailed Online Boosting Input: h1 , ..., hn ; (x1 , y1 ), ..., (xm , ym ), θ, δ − Initialize: λ+ i , λi = ; ESin , αi = 0; i = 1, .., n Define: uji = yj hi (xj ) for j = 1 to m do S0 = 0 d1 = 1 for i = 1 to n do Si = Si−1 + αi uji % Run Online Boosting − λ− i ← λi + di 1[uji =−1] + + λi ← λi + di 1[uji =+1] αi =

1 2

log

λ+ i λ− i −αi uji

di+1 = di e % Run STST q if Si > θ − ESin + ||αin || ln δ12 OR i = n then for k = i to 1 do Update ESkn Update ||αkn || end for Jump to next example end if end for end for Output: α1 , ..., αn

noise to two translated sin waves. The examples were split randomly to the training and test sets. Each of the sets contains 100, 000 examples. First AdaBoost was trained on the sets to obtain a set of 100 features. The features used are thresholded planes. These features are not online-learnable, and therefore were set beforehand. Figure 1 shows a random subset of 5000 examples from the test set and the resulting decision boundary found by Curtailed Online Boosting. We set θ = 0, δ = 0.2. The algorithm fully calculated the margin of only 1233 examples out of the 100, 000 which is about 1% of the training set. The average number of weak classifiers evaluated per example is 18 out of 100 which is a 5x speed-up over Online Boosting. The figure shows how most of the computation was allocated to examples that were by the decision boundary, and substantially less computation was allocated to examples that were easily classifiable and were further away. Throughout the learning process, only 282 examples actually had a margin below the required threshold of 0, out of which 6% were not fully evaluated due to early stopping. This is far below the required error rate of 20% that we set. AdaBoost, Online Boosting and Curtailed Online Boosting obtained

Curtailed Discriminative Sequential Tests Table 1. MNIST test error in % for each classifier, and curtailment efficiency. Curtailed Online Boosing (COB) is compared to Online Boosting (OB) which is trained with the same average number of features that COB used throughout learning. COB always performs better than OB. COB achieves a speedup of 8x with only 0.01% loss in classification accuracy!

Classification Error in % Digit 0 1 2 3 4 5 6 AdaBoost, 1000 features 0.33 0.20 0.78 0.94 0.9 1.01 0.53 OB, 1000 Features 0.34 0.21 0.81 0.97 0.9 0.99 0.5 COB 0.35 0.23 0.82 1.04 0.97 1.01 0.59 OB with COB Avg. No. Feat. 0.5 0.31 1.32 1.55 1.55 1.46 0.87 Curtailed Online Boosting Computational Efficiency COB Avg. No. of features 149 163 265 224 211 234 184 COB speedup 7x 6x 4x 4x 5x 4x 5x the same generalization error at the end of the training of 0.2%. The second synthetic experiment was designed to compare the different constants derived from Hoeffding’s inequality, and the Central Limit Theorem. We simulated 50, 000 random walks, each of 1, 000 steps drawn from a gaussian distribution with mean zero and variance 1. Figure 3 shows the results of using the different constant with different desired decision error rates. The Central Limit Theorem provides a very sharp approximation to the desired error rate δ, and provides much faster stopping times than the ones obtained by Hoeffding’s constants. 5.2. MNIST The MNIST dataset consists of 28 × 28 images of the digits [0, 9]. The dataset is split into a training set which includes 60, 000 images, and a test set which includes 10, 000 images. All the digits are represented approximately in equal amount in each set. Similarly to the synthetic experiment, we trained a classifier in an offline manner to find a set of weak hypotheses. When training we normalized the images to have zero mean and unit variance. We used decision stumps hi (x) = sign(kxi − xk2 − γ) as our weak hypotheses. The weak learner found for every boosting round the vector xi and threshold γ that create a weak hypothesis which minimizes the training error. As candidates for xi we used all the examples that were sampled from the training set at that boosting round. We partitioned the multi-class problem into 10 one-versusall problems.Each of the 1-vs-all classifiers was trained with 1, 000 features. We set δ = 0.1, and θ = 0, and ESin = 0. Although the assignment of the expectation to zero is very crude, it turns out to work well in this classification task. At the beginning of the training process the weights αi are very bad estimates of the final weights. This causes the random walk to be highly

7 0.81 0.84 0.91 1.28

8 1.63 1.66 1.77 2.75

9 1.35 1.37 1.35 2.17

195 5x

346 3x

265 4x

unreliable, and the curtailment process inefficient early on. There may exist odd cases where the first coordinate may always be curtailed, we therefore curtailed the margin evaluation process always one coordinate after the point where the margin passed the threshold. As the training progresses the efficiency improves, and the decision error rate decreases. Throughout the entire process the decision error rate was under the required 10%. We looped over the training set 3 times for each of the online algorithms to allow them time to converge. The generalization error rates for each classifier can be seen in table 1. The efficiency results are also shown in table 1. We compare OB with COB by taking the average number of features that were evaluated by COB for each 1-vs-all set, and training the equivalent OB classifier. It is evident that COB outperforms OB throughout all the 10 training problems. We should emphasize that the average number of features that should be evaluated was found by specifying a decision error rate threshold, and cannot be found with the ”vanilla” OB. Also note that each of the features requires the evaluation of 282 multiplications, which makes the savings obtained by the STST even greater!

6. Discussion and Future Work We showed that by using sequential analysis we can come up with a very simple rule to speed-up online algorithms which maintains accuracy. However, we did not estimate the expected remaining margin accurately, which leaves room for improvement of the testing procedure. By making a few assumptions on the weak hypotheses and the margin distribution, we believe that a bound on the average number of features evaluated can be obtained. Furthermore, it might be possible by obtaining upper and lower bounds on the random walk to derive a more sophisticated sequential test which will trade-off false positive error rates and

Curtailed Discriminative Sequential Tests

true positive detection rates.

dictors. Information and Computation, 132(1):1–63, 1997.

The same analysis applies to majority vote based score computation when the algorithm is deployed. Further improvements would be to extend the inequalities to deal with absolute scores instead of signed margins, as well as to use kernel functions instead of weak hypotheses.

Lan, K.K. Gordon, Simon, Richard, and Halperin, Max. Stochastically curtailed tests in long-term clinical trials. Sequential Analysis, 1(3):207–219, 1982.

In summary, we demonstrated that by adapting the amount of computation allocated to each example according to its importance, a margin based learning algorithm can be sped up an order of magnitude without losing much of its generalization power.

Long, Philip M. and Servedio, Rocco A. Adaptive martingale boosting. In Neural Information Processing Systems, 2008.

References Bourdev, Lubomir and Brandt, Jonathan. Robust object detection via soft cascade. In CVPR, volume 2, pp. 236–243, 2005. Cesa-Bianchi, Nicolo, Gentile, Claudio, and Zaniboni, Luca. Worst-case analysis of selective sampling for linear classification. Journal of Machine Learning Research, 7:1205–1230, 2006. ISSN 1533-7928. Cohn, David, Ladner, Richard, and Waibel, Alex. Improving generalization with active learning. In Machine Learning, pp. 201–221, 1994. Crammer, Koby Crammer, Dekel, Ofer, Keshet, Joseph, and Singer, Yoram. Online passiveaggressive algorithms. Journal of Machine Learning Research, 7:551–585, 2006.

Long, Philip M. and Servedio, Rocco A. Martingale boosting. Learning Theory, pp. 79–94, 2005.

Matas, Jiri and Sochman, Jan. Wald’s sequential analysis for timeconstrained vision problems, pp. 57–77. Lecture Notes in Electrical Engineering. Springer, 2007. Mnih, V., Szepesv´ari, C., and Audibert, J.Y. Empirical bernstein stopping. In International Conference on Machine Learning, pp. 672–679. ACM, 2008. Oza, N. and Russell, S. Online bagging and boosting. In Artificial Intelligence and Statistics, pp. 105–112, 2001. Pelossof, Raphael, Jones, Michael, Vovsha, Ilia, and Rudin, Cynthia. Online coordinate boosting. In On-line Learning for Computer Vision Workshop, 2009. Rosenblatt, F. The perceptron: A probabilistic model for information storage and organization in the brain. Psychological Review, 65(6):386–408, 1958.

Dasgupta, Sanjoy, Kalai, Adam Tauman, and Monteleoni, Claire. Analysis of perceptron-based active learning. In In COLT, pp. 249–263, 2005.

Schapire, Robert E. and Singer, Yoram. Improved boosting algorithms using confidence-rated predictions. Machine Learning, 37(3):297–336, 1999.

Freund, Yoav. An adaptive version of the boost by majority algorithm. Machine Learning, 43(3):293– 318, 2001.

Settles, Burr. Active learning literature survey. Computer Sciences Technical Report 1648, University of Wisconsin–Madison, 2009.

Freund, Yoav and Schapire, Robert E. A decisiontheoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences, 55(1):119–139, 1997.

Viola, Paul and Jones, Michael. Rapid object detection using a boosted cascade of simple features. In Computer Vision and Pattern Recognition, 2001.

Freund, Yoav, Shamir, Eli, and Tishby, Naftali. Selective sampling using the query by committee algorithm. In Machine Learning, pp. 133–168, 1997. Jones, Michael and Viola, Paul. Face recognition using boosted local features. In Interantional Conference on Computer Vision, 2003. Kivinen, Jyrki and Warmuth, Manfred K. Exponentiated gradient versus gradient descent for linear pre-

Wald, A. Sequential tests of statistical hypotheses. The Annals of Mathematical Statistics, 16(2):117– 186, 1945. ISSN 00034851. Watanabe, Osamu. Simple sampling techniques for discovery science. IEICE Transactions on Communications and Systems, 2000.

Suggest Documents