Approximation Vector Machines for Large-scale Online Learning

15 downloads 0 Views 2MB Size Report
May 28, 2017 - One common issue with kernel-based online learning approach, ...... Five of which (year, covtype, poker, KDDCup99, airlines) are large-scale.
Journal of Machine Learning Research x (2016) y-z

Submitted a/b; Published c/d

Approximation Vector Machines for Large-scale Online Learning Trung Le∗

[email protected]

arXiv:1604.06518v2 [cs.LG] 25 Apr 2016

Ho Chi Minh University of Pedagogy, Vietnam

Tu Dinh Nguyen

[email protected]

Centre for Pattern Recognition and Data Analytics, Australia

Vu Nguyen

[email protected]

Centre for Pattern Recognition and Data Analytics, Australia

Dinh Phung

[email protected]

Centre for Pattern Recognition and Data Analytics, Australia

Editor: TBA

Abstract One of the most challenging problems in kernel online learning is to bound the model size and to promote the model sparsity. Sparse models not only improve computation and memory usage, but also enhance the generalization capacity, a principle that concurs with the law of parsimony. However, inappropriate sparsity modeling may also significantly degrade the performance. In this paper, we propose Approximation Vector Machine (AVM), a model that can simultaneously encourage the sparsity and safeguard its risk in compromising the performance. When an incoming instance arrives, we approximate this instance by one of its neighbors whose distance to it is less than a predefined threshold. Our key intuition is that since the newly seen instance is expressed by its nearby neighbor the optimal performance can be analytically formulated and maintained. We develop theoretical foundations to support this intuition and further establish an analysis to characterize the gap between the approximation and optimal solutions. This gap crucially depends on the frequency of approximation and the predefined threshold. We perform the convergence analysis for a wide spectrum of loss functions including Hinge, smooth Hinge, and Logistic for classification task, and ℓ1 , ℓ2 , and ε-insensitive for regression task. We conducted extensive experiments for classification task in batch and online modes, and regression task in online mode over several benchmark datasets. The results show that our proposed AVM achieved a comparable predictive performance with current state-of-the-art methods while simultaneously achieving significant computational speed-up due to the ability of the proposed AVM in maintaining the model size. Keywords: kernel, online learning, large-scale machine learning, sparsity, big data, core set, stochastic gradient descent, convergence analysis

∗. This work was carried out when the first author was supported and visiting the Centre for Pattern Recognition and Data Analytics, Deakin University, Australia in 2015. c

2016 Trung Le, Tu Dinh Nguyen, Vu Nguyen and Dinh Phung.

Trung et al

1. Introduction In modern machine learning systems, data usually arrive continuously in stream. To enable the efficient computation and effectively handle the memory resource, the system should be able to adapt according to incoming data. Online learning represents a family of efficient and scalable learning algorithms for building a predictive model incrementally from a sequence of data examples (Rosenblatt, 1958). In contrast to the conventional learning algorithms (Joachims, 1999; Chang and Lin, 2011), which usually require a costly procedure to retrain the entire dataset when a new instance arrives, online learning aims to utilize the new incoming instances to improve the model given knowledge of the correct answers to previous processed data and possibly additional available information, making them more suitable for large-scale online applications wherein data usually arrive sequentially and evolve rapidly. The seminal line of work in online learning, referred to as linear online learning (Rosenblatt, 1958; Crammer et al., 2006; Dredze et al., 2008), aims at learning a linear predictor in the input space. The crucial limitation of this approach lies in its oversimplified linear model and consequently may fail to capture non-linearity commonly seen in many real-world applications. This motivated the works in kernel-based online learning (Freund and Schapire, 1999; Kivinen et al., 2004) in which a linear model in the feature space corresponding with a nonlinear model in the input space is more robust to a variety of data distributions. One of issues in kernel-based online learning approach is that the model size (i.e., the number of vectors in support set) may grow linearly with the data size accumulated over time hence causing computational problem and potential memory overflow (Wang et al., 2012). Therefore in practice, one prefers the kernel online learning methods with guaranty on a limited and bounded model size. In addition, enhancing model sparsity is also of great interest to practitioners since this allows the generalization capacity to be improved, known as Occam’s razor principle or law of parsimony, and in many cases leads to faster computation as well. However, encouraging sparsity needs to be done with care since an inappropriate sparsity-encouraging mechanism may compromise the performance. In this paper, we propose Approximation Vector Machine (AVM) to simultaneously encourage the model sparsity1 while maintaining the model performance. The model size is theoretically proven to be bounded regardless of the data distribution and data arrival order. To promote the sparsity, we introduce the notion of δ-coverage which partitions the data space into cells whose diameters are less than δ (cf. Figure 1). This coverage can be constructed in advance or on the fly. Our experiment on the real datasets shows that the coverage can impressively boost the sparsity, e.g., with the dataset KDDCup99 including 4, 408, 589 instances using δ = 3 resulting in 115 cells (i.e., the model size is 115); the dataset airlines including 5, 336, 471 using δ = 1 resulting in 388 cells (i.e., the model size is 388). When an incoming instance arrives, this instance may be approximated with the corresponding core point in the cell that contains it. Our intuition behind is that when an instance is approximated by a core point that is nearby to it, the performance would largely be preserved. We developed rigorous theory to support this intuition. In particular, our convergence analysis explicitly characterize the gap between the approximation from our proposed method and the optimal exact solution. This gap crucially depends on the 1. Model sparsity can be computed as the ratio of the model size and the number of vectors received so far.

2

Approximation Vector Machines

cell diameter δ and the approximation process whose technical details shall be made clear. The cell parameter δ can control the trade-off between the sparsity level and the model performance. Furthermore, we also introduce a rigorous convergence analysis for the average, suffix-average , and single-point outputs (Rakhlin et al., 2012). In addition, our convergence analysis is valid for six popular loss functions, namely Hinge, smooth Hinge, and Logistic for classification task and ℓ2 , ℓ1 , and ε-insensitive for regression task. We further establish the experiments to validate the proposed method on a variety of learning tasks, including classification in batch mode, classification and regression in online mode on several benchmark large-scale datasets. The experimental results demonstrate that our proposed method maintains a comparable predictive performance while simultaneously achieving an order of magnitude speed-up in computation comparing with the baselines due to its capacity in maintaining model size.

Figure 1: An illustration of the hypersphere coverage for 1, 000 data samples which locate in 3D space. We cover this dataset using hyperspheres with the diameter δ = 7.0, resulting in 20 hypersphere cells as shown in the figure (cf. Sections (5.3,8)). All data samples in a same cell are approximated by a core point in this cell. The model size is therefore significantly reduced from 1,000 to 20.2 The rest of this paper is organized as follows. In Section 2, we review the works mostly related to ours. In Section 3, we introduce the problem setting. In Section 4, we report the standard SGD for kernel online learning which suffers from the curse of kernelization. In Section 5, we present our proposed AVM including the technical details and the related 2. In fact, we used a subset of the dataset a9a which has 123 features. We then project all data points onto 3D using t-SNE. We note that the t-SNE does not do clustering, it only reduces the dimensionality into 3D for visualization while trying to preserve the local properties of the data.

3

Trung et al

notions. In Section 6, we investigate the suitability of the loss functions. In Section 7, we study the multi-class setting. Finally, in Section 8, we conduct the extensive experiments and then discuss on the experimental results. In addition, all proofs are given in the appendix sections.

2. Related Work In this section, we review the works closely related to ours. One common goal of kernel online learning is to bound the model size and to encourage the sparsity. Generally, research in this direction can be broadly reviewed into the following themes. Budgeted Online Learning. This approach limits the model size to a predefined budget B. When the model size exceeds the budget, a budget maintenance is triggered to decrement the model size by one. Three popular budget maintenance strategies include removal, projection, and merging. In the removal strategy, the most redundant support vector is simply eliminated. In the projection strategy, the information of the most redundant support vector is conversed through its projection onto the linear span of the remaining support vectors. The merging strategy first selects two vectors, and then merges them into one before discarding them. Forgetron (Dekel et al., 2005) is the first budgeted online learning method that employs removal strategy for budget maintenance. At each iteration, if the classifier makes a mistake, it conducts a three-step update: (i) running the standard Perceptron (Rosenblatt, 1958) update; (ii) shrinking the coefficients of support vectors with a scaling factor; and (iii) removing the support vector with the smallest coefficient. Randomized Budget Perceptron (RBP) (Cavallanti et al., 2007) randomly removes a support vector when the model size overflows the budget. Budget Perceptron (Crammer et al., 2004) and Budgeted Passive Aggressive (BPA-S)(Wang and Vucetic, 2010) attempt to discard the most redundant support vector (SV). The work of (Orabona et al., 2009) uses the projection to automatically discover the model size. The new vector is added to the support set if its projection onto the linear span of others in the feature space exceeds a predefined threshold, or otherwise its information is kept through the projection. Other works involving the projection strategy include Budgeted Passive Aggressive Nearest Neighbor (BPA-NN) (Wang and Vucetic, 2010; Wang et al., 2012). The merging strategy is used in some works (Wang and Vucetic, 2009; Wang et al., 2012). Random Features. The idea of random features was proposed in Rahimi and Recht (2007). Its aim is to approximate a shift-invariant kernel using the harmonic functions. In the context of kernel online learning, the problem of model size vanishes since we can store the model directly in the random features. However, the arising question is to determine the appropriate number of random features D to sufficiently approximate the real kernel while keeping this dimension as small as possible for an efficient computation. The work of (Ming et al., 2014) investigated the number of random features in the kernel online learning context. Recently, Lu et al. (2015) proposed to run stochastic gradient descent (SGD) in the random feature space rather than that in the real feature space. The theory accompanied with this work shows that with a high confidence level, SGD in the random feature space can sufficiently approximate that in the real kernel space. Core Set. This approach utilizes a core set to represent the model. This core set can be constructed on the fly or in advance. Notable works consist of the Core Vector Machine 4

Approximation Vector Machines

(CVM) (Tsang et al., 2005) and its simplified version, the Ball Vector Machine (BVM) (Tsang et al., 2007). The CVM was based on the achievement in computational geometry (Badoiu and Clarkson, 2002) to reformulate a variation of ℓ2 -SVM as a problem of finding minimal enclosing ball (MEB) and the core set includes the points lying furthest the current centre of the current MEB. Our work can be categorized into this line. However, our work is completely different to (Tsang et al., 2005, 2007) in the method to determine the core set and update the model. In addition, the works of (Tsang et al., 2005, 2007) were not designed for online learning.

3. Problem Setting We consider two following optimization problems for batch and online settings respectively in Eqs. (1) and (2) λ kwk2 + E(x,y)∼PN [l (w; x, y)] 2 N λ 1 X l (w; xi , yi ) , kwk2 + 2 N

min f (w) , w

(1)

i=1

min f (w) , w

λ kwk2 + E(x,y)∼PX ,Y [l (w; x, y)] 2

(2)

where l (w; x, y) is a convex loss function, PX ,Y is the joint distribution of (x, y) over X × Y with the data domain X and label domain Y, and PN specifies the empirical distribution over the training set D = {(x1 , y1 ) , . . . , (xN , yN )}. Furthermore, we assume that the convex loss function l (w; x, y) satisfies the following property. There exists two positive numbers

′ A and B such that l (w; x, y) ≤ A kwk1/2 + B, ∀w, x, y. As demonstrated in Section 6, this condition is valid for all common loss functions. ′ It is clear that given any w, there exists a random variable g such that E [g|w] = f (w). ′ In fact, we can specify g = λw + l (w; xt , yt ) where (xt , yt ) ∼ PX ,Y or PN . To simplify  ′

the convergence analysis, we further assume that the radial kernel is used, i.e., K x, x = 

 ′ 2

k x − x . Without loss of generality, we further assume that kΦ (x)k2 = K (x, x) =

1, ∀x ∈ X . We denote the optimal solution of optimization problem in Eq. (1) or (2) by w∗ , that is, w∗ = argmin f (w). w

4. Stochastic Gradient Descent Method We introduce the standard kernel stochastic gradient descent (SGD) in Algorithm 1. The ′ 1 is used and αt is a scalar such that l (wt ; xt , yt ) = αt Φ (xt ) standard learning rate ηt = λt (this scalar exists for all typical loss functions). It is apparent that this standard kernel SGD algorithm is vulnerable to the curse of kernelization, that is, the size of support set or the model size, is almost linearly grown with the data size accumulated over time. Consequently, the computation gradually becomes slower or even infeasible when the data size grows rapidly. 5

Trung et al

Algorithm 1 Stochastic Gradient Descent algorithm. Input: λ, K (., .) 1: w1 = 0 2: for t = 1, 2, . . . T do 3: Receive (xt , yt ) ′ 4: gt = λwt + l (wt ; xt , yt ) 5: wt+1 = wt − ηt gt = t−1 t wt − ηt αt Φ (xt ) 6: end for P Output: wT = T1 Tt=1 wt or wT +1

//(xt , yt ) ∼ PX,Y or PN

5. Approximation Vector Machines for Large-scale Online Learning In this section, we introduce our proposed Approximation Vector Machine (AVM) for online learning. The main idea is that we employ an overlapping partition of sufficiently small cells to cover the data domain, i.e., X or Φ (X ); when an instance arrives, we approximate this instance by a corresponding core point in the cell that contains this instance. Our intuition behind this approximation procedure is that since the instance is approximated by its neighbor, the performance would not be significantly compromised while gaining significant speedup. We depart this section with the definition of δ-coverage, its properties and connection with the feature space. We then present AVM and the convergence analysis. 5.1 δ-coverage over a domain To facilitate our technical development in sequel, we introduce the notion of δ-coverage in this subsection. We first start with the usual definition of a diameter for a set. Definition 1. (diameter ) Given a set A, the diameter of this set is defined as D (A) = ′ sup ||x − x ||. This is the maximal pairwise distance between any two points in A. x,x′ ∈A

Next, given a domain X (e.g., the data domain, input space) we introduce the concept of δ-coverage for X using a collection of sets. Definition 2. (δ-coverage) The collection of sets P = (Pi )i∈I is said to be an δ-coverage of the domain X iif X ⊂ ∪i∈I Pi and D (Pi ) ≤ δ, ∀i ∈ I where I is the index set (not necessarily discrete) and each element Pi ∈ P is further referred to as a cell. If the index set I is finite, the collection P is called a finite δ-coverage. Definition 3. (core set, core point) Given an δ-coverage P = (Pi )i∈I over a given domain X , for each i ∈ I, we select an arbitrary point ci from the cell Pi , then the collection of all ci (s) is called the core set C of the δ-coverage P. Each point ci ∈ C is further referred to as a core point. We show that these definitions also extend to the feature space with the mapping Φ and kernel K via the following theorem.   ′ ′ Theorem 4. Assume a radial kernel K(x, x ) = k ||x − x ||2 , where k (.) is a monotonically continuous decreasing function with k (0) = 1, is examined and Φ (.) is its induced 6

Approximation Vector Machines

feature map. If P = (Pi )i∈I is an δ-coverage of the domain X thenΦ (P) = (Φ (Pi ))i∈I is also an δΦ -coverage of the domain Φ (X ), where δΦ = 2 1 − k δ2 is a monotonically increasing function and lim δΦ = 0. δ→0

′ 2 ′

In particular, the Gaussian kernel given by K(x, x ) = exp(−γ x − x ) is a radial  kernel and δΦ = 2 1 − exp −γδ2 . Theorem 4 further reveals that the image of an δcoverage in the input space is an δΦ -coverage in the feature space and when the diameter δ approaches 0, so does the induced diameter δΦ . For readability, the proof of this theorem is deferred to Appendix A. We have further developed methods and algorithms to efficiently construct δ-coverage, however to maintain the comprehensibility, we defer this construction to Section 5.3. 5.2 Approximation Vector Machines We now present our proposed Approximation Vector Machine (AVM) for online learning. Under an online setting, instances arise on the fly and we need an efficient approach to incorporate incoming instances into the learner. Different from existing work (cf. Section 2), our approach is to construct an δ-coverage P = (Pi )i∈I over the input domain X , and for each incoming instance x we find the cell Pi that contains this instance and approximate this instance by a core point ci ∈ Pi . The coverage P and core set C can either be constructed in advance or on the fly as presented in the Section 5.3. Algorithm 2 Approximation Vector Machine. Input: λ, K (., .), δ-coverage P = (Pi )i∈I 1: w1 = 0 2: for t = 1, . . . , T do 3: Receive (xt , yt ) 4: Find it s.t. xt ∈ Pit ′ 5: l (wt ; xt , yt ) = αt Φ (xt ) 6: Sample a Bernoulli random variable Zt 7: if Zt = 1 then 8: ht = λwt + αt Φ (cit ) //do approximation 9: else 10: ht = λwt + αt Φ (xt ) 11: end if Q 12: wt+1 = wt −ηt ht or wt+1 = B(0,ymax λ−1/2 ) (wt − ηt ht ) (ℓ2 loss is used, λ ≤ 1) 13: end for PT wt or wT +1 Output: wT = t=1 T In Algorithm 2, we note that at Step 6 we introduce a Bernoulli random variable Zt to govern the approximation procedure. This random variable could be either statistically independent with or dependent on the incoming instances and the current model. In Section 8.2, we report on different settings for Zt and how they influence the model size and learning performance. Our findings at the outset is that, the naive setting with P (Zt = 1) = 1, ∀t 7

Trung et al

(i.e., always performing approximation) returns the sparsest model while obtaining comparable learning performance comparing with the other settings. Furthermore to ensure that kwt k is bounded for all t ≥ 1 in the case of ℓ2 loss, if λ ≤ 1 then we project wt − ηt ht onto the hypersphere with centre origin and radius ymax λ−1/2 , i.e., B 0, ymax λ−1/2 . Since it can be shown that with ℓ2 loss the optimal solution w∗ lies  in B 0, ymax λ−1/2 (cf. Theorem 26 in Appendix C), this operation could possibly result in a faster convergence. In addition, by reusing the previous information, this operation can be efficiently implemented. Finally, we note that with ℓ2 loss and λ > 1, we do not need to perform a projection to bound kwt k since according to Theorem 28 in Appendix C, kwt k max . is bounded by yλ−1 In what follows, we present the theoretical results for our proposed AVM including the convergence analysis for a general convex or smooth loss function and the upper bound of the model size under the assumption that the incoming instances are drawn from an arbitrary distribution and arrive in a random order. 5.2.1 Analysis for General Convex Loss We start with the theoretical analysis for Algorithm 2. The decision of approximation (i.e., the random variable Zt ) may be statistically independent with or dependent on the current model (i.e., wt ) and the incoming instance (i.e., (xt , yt )). For example, one can propose an algorithm in which the decision of approximation is performed iff the confidence level of the incoming instance w.r.t the current model is greater than 1, i.e., yt wtT Φ (xt ) ≥ 1. We develop the theory that takes into account all possible cases. Theorem 5 establishes an upper bound on the regret under the possible assumptions of the statistical relationship among the decision of approximation, the data distribution, and the current model. Based on Theorem 5, in Theorem 8 we further establish a real inequality for the error incurred by a single-point output with a high confidence level. Theorem 5. Consider the running of Algorithm 2 where (xt , yt ) is uniformly sampled from the training set D or the joint distribution PX ×Y . The following statements hold i) If Zt and wt are independent for all t (i.e., the decision of approximation only depends on the data distribution) then E [f (wT ) − f (w∗ )] ≤

T H (log (T ) + 1) δΦ M 1/2 W 1/2 X P (Zt = 1)1/2 + 2λT T t=1

where H, M, W are positive constants. ii) If Zt is independent with both (xt , yt ) and wt for all t (i.e., the decision of approximation is independent with the current hyperplane and the data distribution) then E [f (wT ) − f (w∗ )] ≤

T H (log (T ) + 1) δΦ M 1/2 W 1/2 X + P (Zt = 1) 2λT T t=1

iii) In general, we always have E [f (wT ) − f (w∗ )] ≤

H (log (T ) + 1) + δΦ M 1/2 W 1/2 2λT 8

Approximation Vector Machines

Remark 6. Theorem 5 consists of the standard convergence analysis. In particular, if the approximation procedure is never performed, i.e., P (Zt = 1) = 0, ∀t, we have the regret )+1) . bound E [f (wT ) − f (w∗ )] ≤ H(log(T 2λT Remark 7. Theorem 5 further indicates that there exists an error gap between the optimal and the approximate solutions. When δ decreases to 0, this gap also decreases to 0. Specifically, when δ = 0 (so does δΦ ), any incoming instance is approximated by itself and consequently, the gap is exactly 0. 1/2 1/2 PT 1/2 Theorem 8. Let us define the gap by dT , which is δΦ M T W (if Zt is t=1 P (Zt = 1) δΦ M 1/2 W 1/2 PT independent with wt ), t=1 P (Zt = 1) (if Zt is independent with (xt , yt ) and T wt ), or δΦ M 1/2 W 1/2 . Let r be any number randomly picked from {1, 2, . . . , T }. With the probability at least (1 − δ), the following statement holds r 1 1 H (log (T ) + 1) ∗ + dT + ∆T log f (wr ) − f (w ) ≤ 2λT 2 δ where ∆T = max (f (wt ) − f (w∗ )). 1≤t≤T

We now present the convergence analysis for the case when we output the α-suffix average result as proposed in (Rakhlin et al., 2012). With 0 < α < 1, let us denote wαT =

1 αT

T X

wT

t=(1−α)T +1

where we assume that the fractional indices are rounded to their ceilings. Theorem 9 establishes an upper bound on the regret for the α-suffix average case, followed by Theorem 10 which establishes a real inequality for the error incurred by a α-suffix average output with a high confidence level. Theorem 9. Consider the running of Algorithm 2 where (xt , yt ) is uniformly sampled from the training set D or the joint distribution PX ×Y . The following statements hold i) If Zt and wt are independent for all t (i.e., the decision of approximation only depends on the data distribution) then E [f (wαT ) − f (w∗ )] ≤

λ (1 − α) α δΦ M 1/2 W 1/2 WT + 2α αT

T X

P (Zt = 1)1/2 +

t=(1−α)T +1

H log (1/ (1 − α)) 2λαT

h

2 i where H, M, W are positive constants and WTα = E w(1−α)T +1 − w∗ . ii) If Zt is independent with both (xt , yt ) and wt for all t (i.e., the decision of approximation is independent with the current hyperplane and the data distribution) then E [f

(wαT ) −

λ (1 − α) α δΦ M 1/2 W 1/2 WT + f (w )] ≤ 2α αT ∗

T X

t=(1−α)T +1

P (Zt = 1)+

H log (1/ (1 − α)) 2λαT

iii) In general, we always have E [f (wαT ) − f (w∗ )] ≤

H log (1/ (1 − α)) λ (1 − α) α WT + δΦ M 1/2 W 1/2 + 2α 2λαT 9

Trung et al

Theorem 10. Let us once again define the induced gap by dT , which is respec1/2 δΦ M 1/2 W 1/2 PT α tively λ(1−α) (if Zt is independent with wt ), t=(1−α)T +1 P (Zt = 1) 2α WT + αT λ(1−α) δΦ M 1/2 W 1/2 PT α t=(1−α)T +1 P (Zt = 1) (if Zt is independent with (xt , yt ) and 2α WT + αT

α 1/2 W 1/2 . wt ), or λ(1−α) Let r be any number randomly picked from 2α WT + δΦ M {(1 − α) T + 1, 2, . . . , T }. With the probability at least (1 − δ), the following statement holds r 1 H log (1/ (1 − α)) 1 ∗ α f (wr ) − f (w ) ≤ + dT + ∆T log 2λαT 2 δ

where ∆αT =

max

(1−α)T +1≤t≤T

(f (wt ) − f (w∗ )).

Remark 11. Theorems 8 and 10 concern with the theoretical warranty if rendering any single-point output wr rather than the average outputs. The upper bound gained in Theorem 10 is tighter q than that gained in Theorem 8 in the sense that the quantity H log(1/(1−α)) 2λαT

+ ∆αT 12 log 1δ decreases faster and may decrease to 0 when T → +∞ given a confidence level 1 − δ.

5.2.2 Analysis for Smooth Loss Definition 12. A loss function l (w; x, y) is said to be µ-strongly smooth w.r.t a norm k.k ′ iff for all w, w and (x, y) the following condition satisfies

2  ′  µ ′  ′  ′

l w ; x, y ≤ l (w; x, y) + l (w; x, y)T w − w + w − w 2 Another equivalent definition of µ-strongly smooth function is

′   ′ ′ ′



l (w; x, y) − l w ; x, y ≤ µ w − w ∗

where k.k∗ is used to represent the dual norm of the norm k.k. It is well-known that • The ℓ2 loss is 1-strongly smooth w.r.t k.k2 . • The Logistic loss is 1-strongly smooth w.r.t k.k2 .

• The τ -smooth Hinge loss (Shalev-Shwartz and Zhang, 2013) is w.r.t k.k2 .

1 τ -strongly

smooth

Theorem 13. Assume that ℓ2 , Logistic, or τ -smooth Hinge loss is used, let us denote L = λ2 + 1, λ2 + 1, or λ2 + τ −1 respectively. Let us define the gap by dT as in Theorem 10. Let r be any number randomly picked from {(1 − α) T + 1, 2, . . . , T }. With the probability at least (1 − δ), the following statement holds r LMTα 1 1 H log (1/ (1 − α)) ∗ + dT + log f (wr ) − f (w ) ≤ 2λαT 2 2 δ where MTα =

max

(1−α)T +1≤t≤T

kwt − w∗ k. 10

Approximation Vector Machines

Remark 14. Theorem 13 extends Theorem 10 for the case of smooth loss function. This q LMTα H log(1/(1−α)) 1 1 allows the gap + 2 2λαT 2 log δ to be quantified more precisely regarding the discrepancy in the q model itself rather than that in the objective function. The gap H log(1/(1−α)) 2λαT

+

LMTα 2

1 2

log

1 δ

could possibly decrease rapidly when T approaches +∞.

5.2.3 Upper Bound of Model Size In what follows, we present the theoretical results regarding the model size and sparsity level of our proposed AVM. Theorem 15 shows that AVM offers a high level of freedom to control the model size. Especially, if we use the always-on setting (i.e., P (Zt = 1) = 1, ∀t), the model size is bounded regardless of the data distribution and data arrival order. Theorem 15. Let us denote P (Zt = 1) = pt , P (Zt = 0) = qt , and the number of cells found after the iteration t by Mt . If we define the model size, i.e., the size of support set, after the iteration t by St , the following statement holds E [ST ] ≤

T X t=1

qt +

T X t=1

pt E [Mt − Mt−1 ] ≤

T X

qt + E [MT ]

t=1

Specially, if we use some specific settings for pt , we can bound the model size E [St ] accordingly as follows i) If pt = 1, ∀t then E [ST ] ≤ E [MT ] ≤ |P|, where |P| specifies the size of the partition P, i.e., its numberof cells.  ii) If pt = max 0, 1 − βt , ∀t then E [ST ] ≤ β (log (T ) + 1) + E [MT ].   1−ρ iii) If pt = max 0, 1 − tβρ , ∀t, where 0 < ρ < 1, then E [ST ] ≤ βT1−ρ + E [MT ].   iv) If pt = max 0, 1 − tβρ , ∀t, where ρ > 1, then E [ST ] ≤ βζ (ρ) + E [MT ] ≤ βζ (ρ) + |P|, R +∞ ts−1 1 where ζ (.) is ζ- Riemann function defined by the integral ζ (s) = Γ(s) es −1 dt. 0

Remark 16. We use two parameters β and ρ to flexibly control the rate of approximation pt . It is evident that when β increases, the rate of approximation decreases and consequently the model size and accuracy increase. On the other hand, when ρ increases, the rate of approximation increases as well and it follows that the model size and accuracy decreases. We conducted experiment to investigate how the variation of these two parameters influence the model size and accuracy (cf. Section 8.2). Remark  items i) and iii) in Theorem 15 indicate that if P (Zt = 1) = pt =  17. The β max 0, 1 − tρ , where ρ > 1 or ρ = +∞, then the model size is bounded by βζ (ρ)+|P| (by convention we define ζ (+∞) = 0). In fact, the tight upper bound is βζ (ρ) + E [MT ], where MT is the number of unique cells used so far. It is empirically proven that MT could be very small comparing with T and |P|. In addition, since all support sets of wt (1 ≤ t ≤ T ) P T

w

t are all lied in the core set, if we output the average wT = t=1 or α-suffix average T P T 1 α wT = αT t=(1−α)T +1 wT , the model size is still bounded. Remark  18. The  items ii) and iv) in Theorem 15 indicate that if P (Zt = 1) = pt = β max 0, 1 − tρ , where 0 < ρ ≤ 1 then although the model size is not bounded, it would

slowly increase comparing with T , i.e., log (T ) or T 1−ρ when ρ is around 1. 11

Trung et al

5.3 Construction of δ-Coverage In this section, we present two methods to construct a finite δ-coverage. The first method employs hypersphere cells (cf. Algorithm 3) whereas the second method utilizes the hyperrectangle cells (cf. Algorithm 4). In these two methods, the cells in coverage are constructed on the fly when the incoming instances arrive. Both are theoretically proven to be a finite coverage. Algorithm 3 Constructing hypersphere δ-coverage. P =∅ n=0 3: for t = 1, 2, . . . do 4: Receive (xt , yt ) 5: it = argmin kxt − ck k 1: 2:

k≤n

6: 7: 8: 9: 10: 11: 12:

if kxt − cit k ≥ δ/2 then n=n+1 cn = xt it = n P = P ∪ [B (cn , δ/2)] end if end for

Algorithm 3 employs a collection of open hypersphere cell B (c, R), which is defined d as B (c, R) = x ∈ R : kx − ck < R , to cover the data domain. Similar to Algorithm 3, Algorithm 4 uses a collection of open hyperrectangle R (c, a), which is given by R (c, a) =  d x ∈ R : kx − ck∞ < a , to cover the data domain. Both Algorithms 3 and 4 are constructed in the common spirit: if the incoming instance (xt , yt ) is outside all current cells, a new cell whose centre or vertex is this instance is generated. It is noteworthy that the variable it in these two algorithms specifies the cell that contains the new incoming instance and is the same as itself in Algorithm 2. Theorem 19 establishes that regardless of the data distribution and data arrival order, Algorithms 3 and 4 always generate a finite δ-coverage which implies a bound on the model size of AVM. Theorem 19. Let us consider the coverages formed by the running of Algorithms 3 and 4. If the data domain X is compact (i.e., close and bounded) then these coverages are all finite δ-coverages whose sizes are all dependent on the data domain X and independent with the sequence of incoming data instances (xt , yt ) received. Remark 20. Theorem 19 also reveals that regardless of the data arrival order, the model size of AVM is always bounded (cf. Remark 17). Referring to the work of (Cucker and Smale, d  ) . However with many possible 2002), it is known that this model size cannot exceed 4D(X δ data arrival orders, the number of active cells or the model size of AVM is significantly smaller than the aforementioned theoretical bound. 12

Approximation Vector Machines

Algorithm 4 Constructing hyperrectangle δ-coverage. 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: 17: 18: 19:

P =∅√ a = δ/ d n=0 for t = 1, 2, . . . do Receive (xt , yt ) it = 0 for i = 1 to n do if kxt − ci k∞ < a then it = i break end if end for if it = 0 then n=n+1 cn = xt it = n P = P ∪ [R (cn , a)] end if end for

6. Suitability of Loss Functions We introduce six types of loss functions that can be used in our proposed algorithm, namely Hinge, Logistic, ℓ2 , ℓ1 , ε−insensitive, and τ -smooth

Hinge. We

verify that these loss func ′

tions satisfying the necessary condition, that is, l (w; x, y) ≤ A kwk1/2 + B for some appropriate positive numbers A, B. For comprehensibility, without loss of generality, we assume that kΦ (x)k = K (x, x)1/2 = 1, ∀x ∈ X . At the outset of this section, it is noteworthy that for classification task (i.e., Hinge, Logistic, and τ -smooth Hinge cases), the label y is either −1 or 1 which instantly implies |y| = y 2 = 1. • Hinge loss n o l (w; x, y) = max 0, 1 − ywT Φ (x) ′

l (w; x, y) = −I{ywT Φ(x)≤1} yΦ (x)

where IS is the indicator function which renders 1 if the logical statement S is true and 0 otherwise. Therefore, by choosing A = 0, B = 1 we have



1/2

l (w; x, y) ≤ kΦ (x)k ≤ 1 = A kwk + B

• ℓ2 loss

13

Trung et al





In this case, at the outset we cannot verify that l (w; x, y) ≤ A kwk1/2 + B for all

w, x, y. However, to support the proposed theory, we only need to check that



1/2

l (wt ; x, y) ≤ A kwt k + B for all t ≥ 1. We derive as follows l (w; x, y) = ′

2 1 y − wT Φ (x) 2 

l (w; x, y) = wT Φ (x) − y Φ (x)



l (wt ; x, y) = |wtT Φ (x) + y| kΦ (x)k ≤ |wtT Φ (x) | + ymax where B = ymax

≤ kΦ (x)k kwt k + ymax ≤ A kwt k1/2 + B ( 1/2 ymax λ−1/4 if λ ≤ 1 and A = . 1/2 −1/2 ymax (λ − 1) otherwise

Here we note that we make use of the fact that kwt k ≤ ymax (λ − 1)−1 if λ > 1 (cf. Theorem 28 in Appendix C) and kwt k ≤ ymax λ−1/2 otherwise (cf. Line 12 in Algorithm 2 ). • ℓ1 loss l (w; x, y) = |y − wT Φ (x) |   ′ l (w; x, y) = sign wT Φ (x) − y Φ (x)

Therefore, by choosing A = 0, B = 1 we have



1/2

l (w; x, y) = kΦ (x)k ≤ 1 = A kwk + B

• Logistic loss

   l (w; x, y) = log 1 + exp −ywT Φ (x)  −y exp −ywT Φ (x) Φ (x) ′ l (w; x, y) = exp (−ywT Φ (x)) + 1

Therefore, by choosing A = 0, B = 1 we have



1/2

l (w; x, y) < kΦ (x)k ≤ 1 = A kwk + B

• ε-insensitive loss

n o l (w; x, y) = max 0, |y − wT Φ (x) | − ε   ′ l (w; x, y) = I{|y−wT Φ(x)|>ε} sign wT Φ (x) − y Φ (x)

Therefore, by choosing A = 0, B = 1 we have



1/2 l (w; x, y)

≤ kΦ (x)k ≤ 1 = A kwk + B

14

Approximation Vector Machines

• τ -smooth Hinge loss (Shalev-Shwartz and Zhang, 2013)   0 l (w; x, y) = 1 − ywT Φ (x) − τ2   1 T Φ (x) 2 1 − yw 2τ

if ywT Φ (x) > 1 if ywT Φ (x) < 1 − τ

otherwise



l (w; x, y) = −I{ywT Φ(x) 1. It follows that

  β β q t = 1 − pt ≤ 1 − 1 − ρ = ρ t t

From Eq. (9), we gain E [ST ] ≤ β

T X 1 + E [MT ] ≤ βζ (ρ) + E [MT ] ≤ βζ (ρ) + |P| tρ t=1

where ζ (.) is ζ- Riemann function defined by the integral ζ (s) =

1 Γ(s)

R +∞ 0

ts−1 es −1 dt.

We now show the proof of Theorem 8. To realize this proof, we use the famous inequality, namely Hoeffding which for completeness we state below. Theorem. (Hoeffding inequality)PLet the independent variables X1 , . . . , Xn where ai ≤ Xi ≤ bi for each i ∈ [n]. Let S= ni=1 Xi and ∆i = bi − ai . The following hold 2

i) P (S − E [S]) > ε ≤ exp − Pn2ε ∆2 .  i=1 i 2  ii) P (|S − E [S]|) > ε ≤ 2 exp − Pn2ε ∆2 i=1

i

Proof of Theorem 8 From Eqs. (4, 6, 8), we achieve

T H (log (T ) + 1) 1X E [f (wt )] − f (w∗ ) ≤ + dT T 2λT t=1

38

Approximation Vector Machines

Let us denote X = f (wr ) − f (w∗ ), where r is uniformly sampled from {1, 2, . . . , T }. We have T 1X H (log (T ) + 1) Er [X] = E [f (wt )] − f (w∗ ) ≤ + dT T 2λT t=1

It follows that E [X] = E(xt ,yt )T [Er [X]] ≤ t=1

H (log (T ) + 1) + dT 2λT

Let us denote ∆T = max (f (wt ) − f (w∗ )) which implies that 0 < f (wr ) − f (w∗ ) < 1≤t≤T

∆T . Applying Hoeffding inequality for the random variable X, we gain   2ε2 P (X − E [X] > ε) ≤ exp − 2 ∆T

   2ε2 H (log (T ) + 1) − dT > ε ≤ exp − 2 P X− 2λT ∆T     2ε2 H (log (T ) + 1) + dT + ε > 1 − exp − 2 P X≤ 2λT ∆T q   2ε2 or ε = ∆T 12 log 1δ , then with the probability at least 1 − δ, Choosing δ = exp − ∆ 2 T we have r 1 1 H (log (T ) + 1) ∗ + dT + ∆T log f (wr ) − f (w ) ≤ 2λT 2 δ 

Proof of Theorem 9h

2 i We denote WTα = E w(1−α)T +1 − w∗ . Our proof proceeds as follows.

i) If Zt is independent with wt , taking sum in Eq. (3) when t = (1 − α) T + 1, . . . , T , we gain T X

t=(1−α)T +1

λ (1 − α) T α H E [f (wt )] − αT f (w ) ≤ WT + 2 2λ ∗

+

T X t=1

T X

t=(1−α)T +1

1 t

i1/2 h i1/2 h P (Zt = 1)1/2 E kwt − w∗ k2 E k∆t k2 (10)

λ (1 − α) T α H log (1/ (1 − α)) WT + ≤ 2 2λ T X P (Zt = 1)1/2 + δΦ M 1/2 W 1/2 t=(1−α)T +1

where we have used the inequality

PT

1 t=(1−α)T +1 t

39

≤ log (1/ (1 − α)).

Trung et al

E [f

(wαT ) −

λ (1 − α) α δΦ M 1/2 W 1/2 f (w )] ≤ WT + 2α αT

T X



+

P (Zt = 1)1/2

t=(1−α)T +1

H log (1/ (1 − α)) 2λαT

ii) If Zt is independent with wt and (xt , yt ), taking sum in Eq. (5) when t = (1 − α) T + 1, . . . , T , we gain T X

t=(1−α)T +1

λ (1 − α) T α H WT + 2 2λ

E [f (wt )] − αT f (w∗ ) ≤ +

T X

t=(1−α)T +1



T X

t=(1−α)T +1

1 t

i1/2 h i1/2 h P (Zt = 1) E kwt − w∗ k2 E k∆t k2

(11)

λ (1 − α) T α H log (1/ (1 − α)) WT + 2 2λ T X P (Zt = 1) + δΦ M 1/2 W 1/2 t=(1−α)T +1

E [f (wαT ) − f (w∗ )] ≤

λ (1 − α) α δΦ M 1/2 W 1/2 WT + 2α αT

T X

P (Zt = 1)

t=(1−α)T +1

H log (1/ (1 − α)) + 2λαT

iii) In general, taking sum in Eq. (7) when t = (1 − α) T + 1, . . . , T , we gain T X

t=(1−α)T +1

E [f (wt )] − αT f (w∗ ) ≤ +

λ (1 − α) T α H WT + 2 2λ T X

t=(1−α)T +1



E [f (wαT ) − f (w∗ )] ≤

h

T X

t=(1−α)T +1

E kwt − w∗ k2

i1/2

1 t

i1/2 h E k∆t k2

λ (1 − α) T α H log (1/ (1 − α)) WT + + δΦ M 1/2 W 1/2 αT 2 2λ (12)

H log (1/ (1 − α)) λ (1 − α) α WT + δΦ M 1/2 W 1/2 + 2α 2λαT

40

Approximation Vector Machines

Proof of Theorem 10 The proof of this theorem is similar to that of Theorem 8 which relies on Hoeffding inequality. From Eqs. (10, 11, 12), we achieve T X

1 αT

t=(1−α)T +1

E [f (wt )] − f (w∗ ) ≤

H log (1/ (1 − α)) + dT 2λαT

Let us denote X = f (wr ) − f (w∗ ), where r is uniformly sampled from {(1 − α) T + 1, 2, . . . , T }. We have Er [X] =

1 αT

T X

t=(1−α)T +1

E [f (wt )] − f (w∗ ) ≤

H log (1/ (1 − α)) + dT 2λαT

It follows that E [X] = E(xt ,yt )T [Er [X]] ≤ t=1

Let us denote

∆αT

=

max

(1−α)T +1≤t≤T

H log (1/ (1 − α)) + dT 2λαT

(f (wt ) − f (w∗ )) which implies that 0 < f (wr ) −

f (w∗ ) < ∆αT . Applying Hoeffding inequality for the random variable X, we gain ! 2ε2 P (X − E [X] > ε) ≤ exp − 2 ∆αT !   2ε2 H log (1/ (1 − α)) − dT > ε ≤ exp − P X− 2 2λαT ∆αT !   H log (1/ (1 − α)) 2ε2 P X≤ + dT + ε > 1 − exp − 2 2λαT ∆αT   q 2ε2 Choosing δ = exp − α 2 or ε = ∆αT 12 log 1δ , then with the probability at least (∆T ) 1 − δ, we have r H log (1/ (1 − α)) 1 1 ∗ α f (wr ) − f (w ) ≤ + dT + ∆T log 2λαT 2 δ

Proof of Theorem 13 It is apparent that f (w) is L-strongly smooth w.r.t k.k2 . Therefore, we have ′

f (wr ) − f (w∗ ) ≤ f (w∗ )T (wr − w∗ ) +

L L kwr − w∗ k2 ≤ kwr − w∗ k2 2 2

It follows that ∆αT ≤ 12 LMTα . Hence we gain the conclusion.

41

Trung et al

Appendix C. Proofs of Bound for L2 Loss We now consider the upper bound of kw∗ k in the case that ℓ2 loss is being used for regression problem. Concretely, we have the following theorem.    P T Φ (x ) 2 then kw∗ k ≤ y − w Theorem 26. If w∗ = argmin λ2 kwk2 + N1 N i i i=1 w

ymax λ−1/2 .

Proof Let us consider the equivalent constrains optimization problem ! N λ 1 X 2 2 min ξi kwk + w,ξ 2 N i=1

s.t.: ξi = yi − wT Φ (xi ) , ∀i

The Lagrange function is of the following form L (w, ξ, α) =

N N   X X

λ

w2 + 1 αi yi − wT Φ (xi ) − ξi ξi2 + 2 N i=1

i=1

Setting the derivatives to 0, we gain ∇w L = λw −

N X i=1

αi Φ (xi ) = 0 → w = λ−1

N X

αi Φ (xi )

i=1

N αi 2 ξi − αi = 0 → ξi = N 2 Substituting the above to the Lagrange function, we gain the dual form ∇ξi L =

N N X λ NX 2 yi αi − αi kwk2 + 2 4 i=1 i=1

2

N N

X NX 2 1

X αi αi Φ (xi ) + yi αi − =−

2λ 4

W (α) = −

i=1

i=1

i=1

Let us denote (w∗ , ξ ∗ ) and α∗ be the primal and dual solutions, respectively. Since the strong duality holds, we have N N N X λ N X ∗2 1 X ∗2 λ ∗ ∗ 2 ∗ 2 yi αi − kw k + ξi = − kw k + αi 2 N 2 4 i=1

i=1

N X

i=1

N N N X ∗2 1 X ∗2 αi − ξi 4 N i=1 i=1 i=1  X N N  X yi2 N ∗2 2 ∗ ≤ ymax ≤ ≤ yi αi − αi 4 N

λ kw∗ k2 =

yi α∗i −

i=1

i=1

42

Approximation Vector Machines

We note that we have used g (α∗i ) = yi α∗i − conclusion.

N ∗2 4 αi

≤g



2yi N



=

yi2 N.

Hence, we gain the

Lemma 27. Assume that ℓ2 loss is using, the following statement holds ! T X 1 kwt k kwT +1 k ≤ λ−1 ymax + T t=1 where ymax = max |y|. y∈Y

Proof We have the following (Q wt+1 =

It follows that kwt+1 k ≤

Q

S S

t−1 t wt t−1 t wt

 − ηt αt Φ (xt ) if Zt = 0  − ηt αt Φ (cit ) otherwise

1 t−1 kwt k + |αt | t λt

since kΦ (xt )k = kΦ (cit )k = 1



It happens that l (wt ; xt , yt ) = αt Φ (xt ). Hence, we gain T |αt | = yt − wt Φ (xt ) ≤ ymax + kwt k kΦ (xt )k ≤ ymax + kwt k

It implies that

t kwt+1 k ≤ (t − 1) kwt k + λ−1 (ymax + kwt k)

Taking sum when t = 1, 2, . . . , T , we achieve T kwT +1 k ≤ λ−1 T ymax + kwT +1 k ≤ λ−1

Theorem 28. If λ > 1 then kwT +1 k ≤

T X t=1

kwt k

T 1X ymax + kwt k T t=1

ymax λ−1

1−

1 λT




log (T + 1) and hence can

Approximation Vector Machines

References M. Badoiu and K. L. Clarkson. Optimal core-sets for balls. In In Proc. of DIMACS Workshop on Computational Geometry, 2002. G. Cavallanti, N. Cesa-Bianchi, and C. Gentile. Tracking the best hyperplane with a simple budget perceptron. Machine Learning, 69(2-3):143–167, 2007. C.-C. Chang and C.-J. Lin. Libsvm: A library for support vector machines. ACM Trans. Intell. Syst. Technol., 2(3):27:1–27:27, May 2011. ISSN 2157-6904. K. Crammer and Y. Singer. On the algorithmic implementation of multiclass kernel-based vector machines. J. Mach. Learn. Res., 2:265–292, March 2002. ISSN 1532-4435. K. Crammer, J. Kandola, and Y. Singer. Online classification on a budget. In Advances in Neural Information Processing Systems 16. MIT Press, 2004. K. Crammer, O. Dekel, J. Keshet, S. Shalev-Shwartz, and Y. Singer. Online passiveaggressive algorithms. J. Mach. Learn. Res., 7:551–585, 2006. F. Cucker and S. Smale. On the mathematical foundations of learning. Bulletin of the American Mathematical Society, 39:1–49, 2002. O. Dekel, S. Shalev-Shwartz, and Y. Singer. The forgetron: A kernel-based perceptron on a fixed budget. In Advances in Neural Information Processing Systems, pages 259–266, 2005. M. Dredze, K. Crammer, and F. Pereira. Confidence-weighted linear classification. In Proceedings of the 25th International Conference on Machine Learning, ICML ’08, pages 264–271, New York, NY, USA, 2008. ACM. Y. Freund and R. E. Schapire. Large margin classification using the perceptron algorithm. Mach. Learn., 37(3):277–296, December 1999. J. Hensman, N. Fusi, and N. D Lawrence. Gaussian processes for big data. In Uncertainty in Artificial Intelligence, page 282. Citeseer, 2013. T. Joachims. Advances in kernel methods. chapter Making Large-scale Support Vector Machine Learning Practical, pages 169–184. MIT Press, Cambridge, MA, USA, 1999. ISBN 0-262-19416-3. J. Kivinen, A. J. Smola, and R. C. Williamson. Online Learning with Kernels. IEEE Transactions on Signal Processing, 52:2165–2176, August 2004. J. Lu, S. C.H. Hoi, J. Wang, P. Zhao, and Z.-Y. Liu. Large scale online kernel learning. J. Mach. Learn. Res., 2015. L. Ming, W. Shifeng, and Z. Changshui. On the sample complexity of random fourier features for online learning: How many random fourier features do we need? ACM Trans. Knowl. Discov. Data, 8(3):13:1–13:19, June 2014. ISSN 1556-4681. 45

Trung et al

F. Orabona, J. Keshet, and B. Caputo. Bounded kernel-based online learning. J. Mach. Learn. Res., 10:2643–2666, December 2009. ISSN 1532-4435. A. Rahimi and B. Recht. Random features for large-scale kernel machines. In In Neural Infomration Processing Systems, 2007. A. Rakhlin, O. Shamir, and K. Sridharan. Making gradient descent optimal for strongly convex stochastic optimization. In International Conference on Machine Learning (ICML12), pages 449–456, 2012. F. Rosenblatt. The perceptron: A probabilistic model for information storage and organization in the brain. Psychological Review, 65(6):386–408, 1958. S. Shalev-Shwartz and T. Zhang. Stochastic dual coordinate ascent methods for regularized loss. Journal of Machine Learning Research, 14(1):567–599, 2013. S. Shalev-Shwartz, Y. Singer, N. Srebro, and A. Cotter. Pegasos: Primal estimated subgradient solver for svm. Mathematical programming, 127(1):3–30, 2011. I. W. Tsang, J. T. Kwok, P. Cheung, and N. Cristianini. Core vector machines: Fast svm training on very large data sets. Journal of Machine Learning Research, 6:363–392, 2005. I. W. Tsang, A. Kocsor, and J. T. Kwok. Simpler core vector machines with enclosing balls. In Proceedings of the 24th International Conference on Machine Learning, ICML ’07, pages 911–918, 2007. Z. Wang and S. Vucetic. Twin vector machines for online learning on a budget. In Proceedings of the SIAM International Conference on Data Mining, pages 906–917, 2009. Z. Wang and S. Vucetic. Online passive-aggressive algorithms on a budget. In AISTATS, volume 9, pages 908–915, 2010. Z. Wang, K. Crammer, and S. Vucetic. Breaking the curse of kernelization: Budgeted stochastic gradient descent for large-scale svm training. J. Mach. Learn. Res., 13(1): 3103–3131, 2012. K. Zhang, L. Lan, Z. Wang, and F. Moerchen. Scaling up kernel svm on limited resources: A low-rank linearization approach. In International Conference on Artificial Intelligence and Statistics, pages 1425–1434, 2012. P. Zhao, J. Wang, P. Wu, R. Jin, and S. C. H. Hoi. Fast bounded online gradient descent algorithms for scalable kernel-based online learning. CoRR, 2012.

46

Suggest Documents