What Objective Does Self-paced Learning Indeed Optimize?

14 downloads 0 Views 472KB Size Report
Nov 19, 2015 - 894–942, 2010. [15] E. Ni and C. Ling. ... [22] S. Yu, L. Jiang, Z. Mao, X. J. Chang, X. Z. Du, C. Gan, Z. Z. Lan, Z. W. Xu, X. C. Li, Y. Cai, A. Ku-.
arXiv:1511.06049v1 [cs.LG] 19 Nov 2015

What Objective Does Self-paced Learning Indeed Optimize?

Qian Zhao School of Mathematics and Statistics Xi’an Jiaotong University [email protected]

Deyu Meng School of Mathematics and Statistics Xi’an Jiaotong University [email protected]

Abstract Self-paced learning (SPL) has been attracting increasing attention in machine learning and computer vision. Albeit empirically substantiated to be effective, the investigation on its theoretical insight is still a blank. It is even unknown that what objective a general SPL regime converges to. To this issue, this study attempts to initially provide some new insights under this “heuristic” learning scheme. Specifically, we prove that the solving strategy on SPL exactly accords with a majorization minimization algorithm, a well known technique in optimization and machine learning, implemented on a latent objective. A more interesting finding is that, the loss function contained in this latent objective has a similar configuration with non-convex regularized penalty, an attractive topic in statistics and machine learning. In particular, we show that the previous hard and linear self-paced regularizers are equivalent to the capped norm and minimax concave plus penalties, respectively, both being widely investigated in statistics. Such connections between SPL and previous known researches enhance new insightful comprehension on SPL, like convergence and parameter setting rationality. The correctness of the proposed theory is substantiated by experimental results on synthetic and UCI data sets.

1

Introduction

Since being raised, curriculum learning (CL) [1] and self-paced learning (SPL) [2] have been attracting increasing attention in machine learning, computer vision and multimedia analysis circles. The philosophy under this paradigm is to simulate a learning scheme as the learning principle of humans/animals, which generally starts by learning easier aspects of an learning task, and then gradually takes more complex examples into training [3]. Instead of heuristically designing a curriculum by ranking samples based on manually preset easiness measurements as CL presented [4, 5], the SPL research further attempted to formulate this ad-hoc idea as a concise SPL model through introducing a regularization term into the learning objective. Such amelioration guides a sound SPL regime to automatically optimize an appropriate curriculum by the model itself, which makes the model generalize well to diverse applications and avoid subjective “easiness” measure setting problem [6, 7, 8, 9]. Albeit rational in intuition and effective in experience, there is still few research to learn the insightful mechanism inside SPL. Specifically, even though it is easy to prove that the SPL regime is convergent by adopting an alternative search strategy (ASS) on the SPL model, it is still unknown where this SPL iteration converges to. There even does not exist a strict theoretical evidence to clarify why SPL is capable of performing robust especially in outlier/heavy noise cases. Such in-depth investigations, however, are extremely necessary for future developments of SPL, and will illuminate whether the SPL methodology would be dismissed as a ungrounded nine-day wonder, or is a rigorous and solid scientific research field worthy to be further explored. 1

This study aims at providing some theoretical evidences to illuminate the insight under SPL. Our main results can be summarized as the following three-fold points: Firstly, we prove that the ASS algorithm commonly utilized to solve the SPL problem exactly accords with the widely known majorization minimization (MM) [10] algorithm implemented on a latent SPL objective function. In the recent decade, the research topic on MM has attracted much attention in machine learning and optimization, and many elegant theoretical results on it have been brought forward. Such a bridge is thus helpful for analyzing the property underlying the SPL solving strategy, like convergence and stability, by the aid of these MM knowledge. Secondly, we prove that the loss function contained in this latent SPL objective has a close relationship with non-convex regularized penalty (NCRP), an attractive research branch in statistics and machine learning. Specifically, we discover that multiple current SPL realizations exactly comply with some well known NCRP terms, e.g., the hard and linear SPL regimes are equivalent to the optimizations on losses with the capped-norm penalty [11, 12, 13] and minimax concave plus penalty (MCP) [14], respectively. Such equivalence on one hand facilitates a deeper comprehension of SPL by employing the off-the-shelf theoretical tools and statistical results in NCRP research, and on the other hand provides more choices for formulating NCPR terms and a new viewpoint to reexamine this advanced research direction. Thirdly, by understanding the SPL optimization as a NCPR loss minimization problem, we provide an easy explanation on why SPL is able to perform robust in the presence of extreme outliers or heavy noises, and accordingly illustrate new insightful understandings on the intrinsic working mechanism for the previously utilized SPL regime. A rational termination criterion for age parameter control in SPL iteration can also be deduced accordingly. This paper is organized as follows. Section 2 introduces related work on this research. Section 3 presents our main theoretical results, and clarify the relationships between ASS and MM algorithms as well as SPL and NCRP problems. Section 4 shows experimental results to verify these theoretical results. A concluding remark is finally made.

2

Related Work

Curriculum Learning. Inspired by the intrinsic learning principle of humans/animals, Bengio et al. [1] formalized the fundamental definition of CL. The core idea is to incrementally involve samples into learning, where easy samples are introduced first and more complex ones are gradually included when the learner is ready for them. These gradually included samples from easy to complex correspond to the curriculums learned in different grown-up stages of humans/animals. This strategy, as supported by empirical evaluation, is helpful in alleviating the local optimum problem in nonconvex optimization [15, 16]. Self-paced Learning. Instead of using the aforementioned heuristic strategies, Kumar et al. [2] formulated the key principle of CL as a concise SPL model. Formally, given a training dataset D = {(xi , yi )}ni=1 , in which xi denotes the ith observed sample, and yi represents its label, let L(yi , g(xi , w)) denote the loss function which calculates the cost between the ground truth label yi and the estimated one g(xi , w). Here w represents the model parameter inside the decision function g. The SPL model includes a weighted loss term on all samples and a general self-paced regularizer imposed on sample weights, expressed as: n X min n E(w, v; λ) = (vi L(yi , f (xi , w)) + f (vi , λ)) , (1) w,v∈[0,1]

i=1

where λ is the age parameter for controlling the learning pace, and f (v, λ) represents the selfpaced regularizer (SP-regularizer), whose intrinsic conditions have been theoretically abstracted by [17, 18]. By jointly learn the model parameter w and the latent weight v = [v1 , · · · , vn ]T by ASS with gradually increasing age parameter, more samples can be automatically included into training from easy to complex in a purely self-paced way. Multiple variations of this SPL learning regime, like self-paced reranking [17], self-paced learning with diversity [19], self-paced curriculum learning [20] and self-paced multiple-instancelearning [21], have been proposed under the format (1). The effectiveness of this SPL paradigm, especially its robustness in highly corrupted data, has been validated in various machine learning 2

and computer vision tasks, such as object detector adaptation [6], specific-class segmentation learning [7], visual category discovery [8], and long-term tracking [9]. Especially, the SPL paradigm has been integrated into the system developed by CMU Informedia team, and achieved the leading performance in challenging TRECVID MED/MER competition organized by NIST in 2014 [22]. There is few investigation, however, to theoretically explain the intrinsic effectiveness mechanism under SPL. In this paper, we attempt to take the first step against this issue. Non-convex Regularized Penalty. NCRP has been demonstrated to have attractive properties in sparse estimation both theoretically and practically, and attracted much attention in machine learning and statistics in recent years. Various NCRP realizations have also been proposed. Typical ones include capped-norm based penalty [11, 12, 13], minimax concave plus (MCP) penalty [14] and smoothly clipped absolute deviation (SCAD) penalty [23]. The mathematical forms of these NCRP terms in one dimension cases are listed as follows [24]: CAP Capped − normpenalty = γ min(|t|, λ), λ > 0; ( : pγ,λ (t) t2 γ(|t| − ), if |t| < γλ CP 2γλ MCP : pM ; γ,λ (t) = γ2λ 2 , if |t| ≥ γλ  λ|t|, if |t| ≤ λ   t2 −2γλ|t|+λ2 SCAD − 2(γ−1) , if λ < |t| ≤ γλ . SCAD : pγ,λ (t) =   (γ+1)λ2 , if |t| ≥ γλ 2

(2)

In this work, we will prove the close relationship between these NCRP terms and the conventional SPL models. Majorization Minimization Algorithm. MM algorithms have wide applications in machine learning and statistical inference [25]. It aims to turn a relatively complicated optimization problem into a tractable one by alternatively iterating an Majorization step and an Minimization step. In particular, consider a minimization problem with objective F (w). Given an estimate wk at the k th iteration, a typical MM algorithm consists of the following two steps: Majorization Step: Substitute F (w) by a surrogate function Q(w|wk ) such that F (w) ≤ Q(w|wk ) k

with equality holding at w = w . Minimization Step: Obtain the next parameter estimate wk+1 by solving the following minimization problem: wk+1 = arg min Q(w|wk ). w

It is easy to see that when the the minimization of Q(w|wk ) is tractable, the MM algorithm can then be very easily implemented, even when the original objective F (w) might be difficult to optimize. Such a solving strategy has also been proven to own many good theoretical properties, like convergence and stability, under certain conditions. In our work, we will clarify that the ASS algorithm generally utilized to solve a SPL model is exactly an MM strategy on a latent SPL objective, and thus is hopeful to inherit some superiority possessed by this known optimization technique.

3 3.1

SPL Model and Algorithm Revisit Axiomic Definition of SP-regularizer

By mathematically abstracting the insightful properties underlying a rational SPL regime, [17, 18] presented a formal definition for the SP-ragularizer f (v; λ) involved in the SPL model (1) as follows: Definition 1 (Self-paced regularizer). Suppose that v is a weight variable, ` is the loss, and λ is the age parameter. f (v, λ) is called self-paced regularizer, if 1. f (v, λ) is convex with respect to v ∈ [0, 1]; 2. v ∗ (λ; `) is monotonically decreasing with respect to `, and it holds that lim`→0 v ∗ (λ; `) = 1, lim`→∞ v ∗ (λ; `) = 0; 3

3. v ∗ (λ; `) is monotonically increasing with respect to λ, and it holds that limλ→∞ v ∗ (λ; `) ≤ 1, limλ→0 v ∗ (λ; `) = 0; where v ∗ (λ; `) = arg min v` + f (v, λ). v∈[0,1]

(3)

The three conditions in Definition 1 provide the axiomatic understanding for the SPL. Condition 2 indicates that the model inclines to select easy samples (with smaller losses) in favor of complex samples (with larger losses). Condition 3 states that when the model “age” (controlled by the age parameter λ) gets larger, it tends to incorporate more, probably complex, samples to train a “mature” model. The convexity in Condition 1 further ensures the soundness of this regularizer for optimization. Under this axiomatic definition, multiple SP-regularizers have been constructed. The following lists several typical ones, together with their closed-form solutions v ∗ (λ; `) as defined in Definition 1:  1, if ` < λ H ∗ f (v; λ) = −λv; v (λ; `) = 0, if` ≥ λ −`/λ + 1, if ` < λ f L (v; λ) = λ( 21 v 2 − v); v ∗ (λ; `) = 0, if ` ≥ λ   2 (4) λγ   1, if ` ≤ λ+γ  γ2 2 ; v ∗ (λ, γ; `) = f M (v; λ, γ) = v+γ/λ  0, if ` ≥ λ    γ √1 − 1 , otherwise. λ ` The above Eq. (4) represents the hard, linear and mixture SP-regularizers proposed in [2], [17], and [18], respectively. By iteratively updating v and w in the SPL regime (1) with gradually increasing age parameter λ, a rational solution to the problem is expected to be progressively approached. 3.2

Revisit ASS Algorithm for Solving SPL

For convenience of notions, we briefly write L(yi , g(xi , w)) as `i (w)/`i and L(y, g(x, w)) as `(w)/` in following. Given a SP-regularizer f (v, λ), we can get the integrative function of v ∗ (λ; `) calculated by Eq. (3) as: Z `

v ∗ (λ; l)dl.

Fλ (`) =

(5)

0

The following result can then be proved (the proof is given in the appendix). Theorem 1. For v ∗ (λ; `) conducted by an SP-regularizer and Fλ (`) calculated by (5), given a fixed w∗ , it holds that: ∗

Fλ (`(w)) ≤ Qλ (w|w ) = Fλ (`(w∗ )) + v ∗ (λ; `(w∗ ))(`(w)−`(w∗ )). ∗

The theorem verifies that Qλ (w|w ) represents a tractable surrogate for Fλ (`(w)). Specifically, ∗ only considering the terms with respect to w, Qλ (w|w ) simplifies Fλ , no matter how complicated ∗ its format is, as an easy weighted loss form v (λ; `(w∗ ))`(w). This constitutes the fundament of our new understanding on the ASS algorithm on SPL. Based on Theorem 1, denote (i)



Qλ (w|w ) = Fλ (`i (w∗ )) + v ∗ (λ; `i (w∗ ))(`i (w)−`i (w∗ ), and we can then easily get that: n X

Fλ (`i (w)) ≤

n X

i=1

(i)



Qλ (w|w ).

(6)

i=1

Then we can prove the equivalence of the ASS strategy for solving the SPL problem (1) and the MM Pn Pn (i) ∗ algorithm for solving i=1 Fλ (`i (w)) under surrogate function i=1 Qλ (w|w ) as follows: 4

Denote wk as the model parameters in the k th iteration of the ASS implementation on solving SPL, and then its two alternative search steps in the next iteration can be precisely explained as a standard MM scheme: (i)

k

Majorization step: To obtain each Qλ (w|w ), we only need to calculate v ∗ (λ; `i (wk )) by solving the following problem under the corresponding SP-regularizer f (vi , λ): v ∗ (λ; `i (wk )) = min vi `i (wk ) + f (vi , λ). vi ∈[0,1]

This exactly complies with updating v in (1) under fixed w. Minimization step: we need to calculate: wk+1 = arg min w

= arg min w

n X

Fλ (`i (wk )) + v ∗ (λ; `i (wk ))(`i (w)−`i (wk ))

i=1 n X

v ∗ (λ; `i (wk ))`i (w),

i=1

which is exactly equivalent to update w in (1) under fixed v. It is then interesting to see that the commonly utilized ASS strategy in previous SPL regimes is exactly the well known MM algorithm on a minimization problem of the latent SPL objective Pn i=1 Fλ (`i (w)) with the latent SPL loss Fλ (`(w)). Various off-the-shelf theoretical results of MM can then be readily employed to explain the properties of such SPL solving strategy. For example, based on the MM theory, the lower-bounded latent SPL objective is monotonically decreasing during MM/ASS iteration, and the convergence of the SPL algorithm can then be guaranteed. The above theory provides us a new viewpoint to explore the insight of SPL. We thus fairly expect to see what secrets hide under this latent SPL objective/loss. 3.3

Revisit SPL Model

Now we try to discover more interesting insight from the latent SPL objective. To this aim, we first calculate the latent SPL losses under hard, linear and mixture SP-regularizers, as introduced in (4), by Eq. (5) as follows:  `, ` < λ, FλH (`) = λ, ` ≥ λ;  ` − `2 /2λ, ` < λ, L Fλ (`) = ` ≥ λ; (7)  λ/2, 1 , `, ` <  2 (1/λ+1/γ)  √ γ 1 2 M γ(2 ` − `/λ) − (1/λ+1/γ) , (1/λ+1/γ) Fλ,γ (`) = 2 ≤ ` < λ ,  1  γ(λ − 1/λ+1/γ ), ` ≥ λ2 . The configurations of these Fλ (`)s under different age parameters are depicted in Fig. 1 for better observation. Some common patterns under these latent SPL losses can be easily seen from Fig. 1. E.g., there is an evident suppressing effect of Fλ (`) on large losses as compared with the original loss function `. When ` is larger than a certain threshold, Fλ (`) will become a constant thereafter. This provides a rational explanation on why the SPL regime can perform robust in the presence of extreme outliers or heavy noises: The samples with loss values larger than the age threshold will have no influence to the model training due to their 0 gradients. Corresponding to the original SPL model, these largeloss samples will be with 0 importance weights vi , and thus have no effect on the optimization of model parameters. Now, let’s reexamine the intrinsic mechanism inside SPL implementation based on such understanding. At start of SPL iteration, the age λ is small, the latent loss function Fλ (`) has a significant suppressing effect on large losses and only allows small amount of high-confidence samples (with small loss values) into training; then with gradually increasing λ, the supressing effect of Fλ (`) on outliers will gradually become weaker and more relatively less informative samples incline to 5

M ixture Figure 1: Graphical illustration for the latent SPL losses FλHard (`), FλLinear (`) and Fλ,γ (`), as defined in (7), conducted by the hard, soft and mixture SP-regularizers on different loss functions, including the logistic loss, the hinge loss, the absolute loss and the least square loss, under various pace parameters in 1-dimensional cases, respectively. Note that when λ = ∞ (λ, γ = ∞ in mixture cases), the latent SPL loss Fλ (`) will degenerate to the original loss l.

be involved into training. Through such robust guidance, more and more faithful knowledge under samples tend to be incrementally learned by such learning scheme. Such a gradually changing tendency of the latent SPL loss Fλ (`) can be easily understood by Fig. 1. An interesting observation is that the latent SPL objective Fλ (`) has a close relationship to NCRP widely investigated in machine learning and statistics. E.g., the hard and linear SPL objectives FλH (`) and FλL (`) comply exactly with the forms of the capped norm penalty and MCP, as defined in (2), imposed on l by setting γ = 1, respectively. I.e., L M CP FλH (`) = pCAP 1,λ (`), Fλ (`) = p1,λ (`). M Furthermore, the form of Fλ,γ (`) is almost similar to the SCAD term, both containing three phases of values, and the first and third of both are linear and constant, respectively. The only difference M (`) is of linear+sqrt+constant form while SCAD is of a linis in the second phase, where Fλ,γ ear+square+constant one. Actually, it is easy to deduce that any Fλ (`) led by a SP-ragularizer is non-convex, and has a very similar configuration with a general NCRP. Such a natural relationship on one hand provides a new viewpoint to see NCRP and facilitates more choices of NCRP formulations by virtue of Fλ (`) obtained under various SP-regularizers, and on the other hand inspires us to borrow mature statistical tools and theoretical results on NCRP to further understand SPL insight in our future investigation1 .

3.4

Age Parameter Tuning

In the starting stage of SPL iteration, the age value is small, and only small-loss samples can participate into the training process, which naturally conducts the problem of insufficient learning. Yet in the late SPL training stage, the age will become much larger, and then outliers or meaningless noisy samples tend to be involved into learning. Such false information incline to negatively influence the performance of the learned model. Therefore, it is necessary to terminate the SPL iteration at an appropriate intermediate age. The latent SPL objective gradually optimized in the SPL process provides us helpful clues to this problem. Specifically, in SPL implementation, with gradually increasing age parameter, more and more samples are incremented into the learning process, which can be described as the following mapping function: γ(n) = λn , where λn represents the age value where n training samples begin to be involved in the SPL training (i.e., n samples are with nonzero importance weights vi ). It is generally observed that in the early training stage, there are a lot of informative samples with small losses, and a small increase of the age parameter will lead to many of them involved into training. This implies that the discrepancy between adjacent λn (i.e., γ(n + 1) − γ(n)) will be small. While in the later stage, noisy samples 1 Albeit closely related, it should be noted that Fλ (`) and NCRP are different in that they are generally imposed on the loss term and the regularization term on model parameters, respectively.

6

with larger losses tend to get into the in-constant domain of Fλ (l). Each such noise involving tends to conduct a relatively larger increase of λn . We thus can rationally select a proper terminate age by easily setting a threshold for γ(n + 1) − γ(n) or locating where the elbow place of γ(n) is. Such tendency of γ(n) curve can be easily observed in Fig. 2 obtained by our synthetic experiment. It is seen that the elbow location can be easily en-anchored. We thus prefer to utilize this easy strategy for tuning the output age for SPL. 3.5

Discussion on SPL Superiority

A natural question is that why not directly optimize the latent SPL objective instead of the SPL model, and what is superiority of the latter? Actually, we can easily see that an intrinsic property of SPL is to decompose the minimization of the robust but difficult-to-solve non-convex loss Fλ (`(w)) into two much easier optimization problems with respect to sample importance weights v (solved by the closed-form solution to SP-regularizer) and model parameters w (solved by weighted loss problem). Such decomposition not only simplifies the solving of the problem, but more importantly, makes it easy to embed helpful prior knowledge on the sample importance (easiness) into the loss function of a SPL scheme. Here we list some of such useful sample importance knowledge which can be generally attained before learning: 1. Spatial/temporal smoothness prior: Some spatially/temporally adjacent samples tend to have relatively similar sample importance; 2. Partial order prior: We might possibly know some samples are more important (i.e., cleaner, easier, more high-confident) than some others. 3. Diversity prior: Meaningful samples for the learning task should be scattered across the data range so that the learning can possibly include global-scale data knowledge. Attributed to the separation of the sample importance weight v from the original non-convex loss, such prior knowledge can be easily encoded into a SPL scheme. Specifically, Prior 1 can be formulated as a graph Laplacian term vT Lv, where L is the Laplacian matrix on the data adjacent matrix; Prior 2 can be easily encoded as supplemental constraint vi > vj if the ith sample is known more important than j th one [20]; and Prior 3 can be realized by a −l2,1 norm or −l0.5,1 norm on v, as utilized in [19] and [21], respectively. Such easy loss-prior embedding capability inclines to guide a sounder learning manner for SPL, which, however, is hard to be integrated by conventional machine learning models with predefined loss functions.

4

Experiments

In this section, we aim to verify the correctness of the proved theoretical results through synthetic and UCI experiments. 4.1

Synthetic Simulations

We constructed two synthetic data sets for substantiation. The first is a classification dataset, with 600 points, generated by the following distribution: 5 X πi pi (X, Y ), pi (X, Y ) = pi (X)pi (Y |X), i=1

with π1 = 0.32; p1 (X) = N ([−4, −4], 3); p1 (1|X) = 0; p1 (−1|X) = 1; π2 = 0.32; p2 (X) = N ([4, 4], 3); p2 (1|X) = 1; p1 (−1|X) = 0; π3 = 0.1; p3 (X) = U (Ω1 ); p3 (1|X) = 0.49; p3 (−1|X) = 0.51; π4 = 0.1; p4 (X) = U (Ω2 ); p4 (1|X) = 0.51; p4 (−1|X) = 0.49; π5 = 0.16; p5 (X) = U (Ω3 ); p1 (1|X) = 0.5; p1 (−1|X) = 0.5; 7

Figure 2: Performance demonstration of utilizing linear SP regularizer to least square loss in synthetic regression experiment. Left panel: The γ(n) = λn curve utilized to tune the proper pace parameter for terminating the SPL iteration. Four typical pace parameters, A, B, C, and D, are marked along the curve, where the age C corresponds to an appropriate elbow position of the curve and is preferred to be as the output age. Right upper panel: The tendency curves of the latent Pselected n objective objective i=1 Fλ (`i (w)) with respect to the iterative process of the MM algorithm (i.e., the ASS iteration for SPL) under paces A, B, C, D, respectively. Right lower panel: The intrinsic non-convex loss Fλ (`) under SPL under paces A, B, C, D, respectively. where Ω1 = O([−7, −8], 4)∩{[x, y]|y < x}, Ω2 = O([6, 8], 4)∩{[x, y]|y > x}, Ω3 = {x, y|x, y ∈ [−180, 180]}, O(x0 , r) represents a circle area with center x0 and radius r, N (µ, σ 2 ) denotes the Gaussian distribution with mean µ and variance σ 2 and U (Ω) represents the uniform distribution on Ω. It is easy to deduce that the optimal classification surface of the problem is y = −x, and p3 and p4 tend to generate noises while p5 inclines to yield outliers to the problem. 600 correctly classified data points were also yielded for performance testing. Another dataset is the regression data, containing 1000 points and generated through the following distribution: 3 X πi pi (X, Y ), pi (X, Y ) = p(X)pi (Y |X) i=1

with p(X) = U (Ω1 ); Ω1 = {x|x ∈ [−1, 1]}; π1 = 0.3; p1 (Y |X) = N (f (X), 1); π2 = 0.4; p2 (Y |X) = Laplacian(f (X), 1); π3 = 0.3; p3 (Y |X) = U (Ω3 ), Ω3 = {ε|ε ∈ [f (X) − 20, f (X) + 20]}; where f (x) = 0.5x + 1 and Laplacian(µ, β) represents the Laplacian distribution with mean µ and scalar β. Note that the optimal regression curve of the problem is f (x), and p1 and p2 tend to generate noises while p3 inclines to yield outliers. 1000 data points were further generated along the this regression curve as the test dataset. For classification experiments, we utilized the logistic regression (LR) [26] and support vector classification (SVC) [27], with log loss and hinge loss as objectives, respectively, as our baseline comparison methods. And for regression, we adopted the least square (LS) regression method [28], with LS loss for comparison. The hard, linear, mixture SPL regimes were respectively employed on these baseline methods to substantiate that if they can ameliorate performance. The age in SPL was tuned by drawing the γ(n) curve and locating the elbow position, as introduced in the last section. For classification and regression, the performance were assessed by the classification accuracy and LS-error on test data, respectively. The codes of LR and LS methods were written by ourselves, and those of SVC and weighted SVC involved in SPL were directly implemented by the “fitcsvm” function in Matlab2014. All experimental results are listed in Table 1. For an easy observation of the working mechanism under SPL, Fig. 2 illustrates some related images on our regression experiment by utilizing the linear SPL regime on least square method. It is easy to see from the figure that the latent SPL objectives under different ages are monotonically decreasing in iteration. This implies the correctness of our SPL theory, i.e., the SPL optimization 8

Table 1: Performance comparison of all competing methods (LR: logistic regression, SVC: support vector classification, LS: least square) on synthetic data. LR SVC LS

Hard 100% 99.83% 4.9

Linear 97.4% 99.83% 2.7

Mixture 97.2% 99.83% 2.6

Original 48.45% 48.35% 16.9

Table 2: Performance comparison of all competing methods on three UCI datasets. D1 D2 D3

LR SVC LR SVC LR SVC

Hard 74.31% 65.74% 83.51% 83.08% 91.98% 80.75%

Linear 74.31% 75.00% 84.16% 84.60% 91.98% 80.21%

Mixture 74.31% 70.37% 83.95% 83.51% 91.98% 80.75%

Original 65.74% 63.89% 82.43% 82.86% 65.24% 76.47%

problem corresponds to the minimization problem on this latent SPL objective. From the table, the effectiveness of the proposed age tuning strategy can also be observed. Specifically, the performance of the original LR/SVC evidently have been hampered by noises/outliers in training data, while SPL regimes can perform stably robust in such cases under selected age values. This can be easily explained by Fig. 2: With the SPL iteration, more informative points can be gradually incremented into the learning process at the beginning, and after an appropriate age, the age threshold will become large and more interferential outliers with large losses tend to be involved into training, and the SPL performance thus tend to be degenerated. By properly terminating iteration at this age, the SPL can effectively avoid such degeneration unexpectation. 4.2

UCI Experiments

We have further run experiments on various binary classification problems in UCI datasets2 . In most cases, an SPL regime can more or less ameliorate the performance of a traditional classification method with fixed loss. In Table 2 we depict some typical results on three of them, which are Monk’s problem 1 (D1), Mammographic Mass (D2) and SPECT Heart (D3) datasets. LR and SVC were adopted as the baseline methods, and hard, linear and mixture SPL regimes were implemented to enhance their robustness. From the table, it is easy to see the better performance of the SPL regimes against the original LR and SVC in all experiments. Such performance improvement actually implies that all three data sets contain evident noises to a certain extent, and thus the noise suppressing capability of the latent SPL objective takes effect. Such capability, however, is not possessed by the baseline methods since their loss functions are pre-fixed while cannot flexibly adapt varying data distributions in practice as the proposed SPL paradigms.

5

Conclusion

We have provided some new insightful understanding to the conventional SPL regime in this study. On one hand, we have shown that the ASS algorithm generally utilized for solving SPL exactly complies with the known MM algorithm on a latent SPL objective, and on the other hand we have verified that the loss function contained in this latent SPL objective precisely accords with the famous nonconvex regularized penalty (NCRP). The effectiveness, especially its robustness to outliers/heavy noises, of SPL, as substantiated by previous experiences, can then be naturally explained under such understanding. A rational age parameter tuning scheme can be easily conducted through this theory. In our future investigation, we will employ the theories on MM and NCRP to more deeply explore the theoretical/statistical properties underlying the SPL algorithm and model. 2

http://archive.ics.uci.edu/ml/

9

Appendix Proof of Theorem 1. To prove the theorem, we need to show that Fλ (l) ≤ Fλ (l0 ) + v ∗ (λ; l0 )(l − l0 ). There are two cases should be dealt with. Case 1. v ∗ (λ; l) is continous with respect to l. From Eq. (3), we have that

v ∗ (λ; l) = Fλ0 (l). By Definition 1, v ∗ (λ; l) ≥ 0 when l ≥ 0, and thus Fλ0 (l) is nondecreasing with respect to l on [0, ∞). Besides, v ∗ (λ; l) is nonotonically decreasing with respect to l. Therefore, we can conclude that Fλ (l) is concave on [0, ∞). Based on the property of concave function, we have Fλ (l) ≤ Fλ (l0 ) + Fλ0 (l0 )(l − l0 ) = Fλ (l0 ) + v ∗ (λ; l0 )(l − l0 ). Case 2. v ∗ (λ; l) is discontinuous with respect to l. With out loss of generality, suppose there is only one discontinuous point ˜l ∈ [0, ∞). When l, l0 ∈ [0, ˜l) or l, l0 ∈ (˜l, ∞), following the similar derivation as Case 1, we also have that Fλ (l) ≤ Fλ (l0 ) + v ∗ (λ; l0 )(l − l0 ) holds. Now suppose l ∈ [0, ˜l) and l0 ∈ (˜l, ∞). Pick l1 ∈ [0, ˜l), and then we have that Fλ (l) ≤ Fλ (l1 ) + v ∗ (λ; l1 )(l − l1 ), and

Fλ (˜l) ≤ Fλ (l0 ) + v ∗ (λ; l0 )(˜l − l0 ). ∗ ˜− ˜− v (λ; l), and let l1 → l . Since Fλ (l) is continuous, we can have that

Denote v ∗ (λ; ˜l)− = liml→l

Fλ (l) ≤ Fλ (˜l) + v ∗ (λ; ˜l)− (l − ˜l). Therefore, Fλ (l) − Fλ (l0 ) = Fλ (l) − Fλ (˜l) + Fλ (˜l) − Fλ (l0 ) ≤ v ∗ (λ; ˜l)− (l − ˜l) + v ∗ (λ; l0 )(˜l − l0 ) ≤ v ∗ (λ; l0 )(l − ˜l) + v ∗ (λ; l0 )(˜l − l0 ) = v ∗ (λ; l0 )(l − l0 ), where the second inequality holds due to the fact that l ≤ ˜l and v ∗ (λ; l) ≥ 0 is decreasing with respect to l. Similarly, if l ∈ (˜l, ∞) and l0 ∈ [0, ˜l), the result also hods. Now we consider the case l0 = ˜l. Suppose l ∈ [0, ˜l) (derivation is similar for l ∈ (˜l, ∞)), and pick l1 ∈ [0, ˜l). We have that Fλ (l) ≤ Fλ (l1 ) + v ∗ (λ; l1 )(l − l1 ). Let l1 → ˜l− . Since Fλ (l) is continuous, we can have that Fλ (l) ≤ Fλ (l0 ) + v ∗ (λ; ˜l)− (l − l0 ) ≤ Fλ (l0 ) + v ∗ (λ; l0 )(l − l0 ), where the second inequality holds due to the fact that l ≤ l0 and v ∗ (λ; l) ≥ 0 is decreasing with respect to l. From the above discussion, we can conclude that Fλ (l) ≤ Fλ (l0 ) + v ∗ (λ; l0 )(l − l0 ). Substitute l and l0 with l(w) and l(w∗ ), respectively, and then Theorem 1 follows. 10

References [1] Y. Bengio, J. Louradour, R. Collobert, and J. Weston. Curriculum learning. In ICML, 2009. [2] M. Kumar, B. Packer, and D. Koller. Self-paced learning for latent variable models. In NIPS, 2010. [3] F. Khan, X. Zhu, and B. Mutlu. How do humans teach: on curriculum learning and teaching dimension. In NIPS, 2011. [4] V. I. Spitkovsky, H. Alshawi, and D. Jurafsky. Baby steps: How “less is more” in unsupervised dependency parsing. In NIPS, 2009. [5] A. Lapedriza, H. Pirsiavash, Z. Bylinskii, and A. Torralba. Are all training examples equally valuable? In CoRR abs/1311.6510, 2013. [6] K. Tang, V. Ramanathan, F. Li, and D. Koller. Shifting weights: Adapting object detectors from image to video. In NIPS, 2012. [7] M. Kumar, H. Turki, D. Preston, and D. Koller. Learning specific-class segmentation from diverse data. In ICCV, 2011. [8] Y. Lee and K. Grauman. Learning the easy things first: Self-paced visual category discovery. In CVPR, 2011. [9] J. Supanˇciˇc III and D. Ramanan. Self-paced learning for long-term tracking. In CVPR, 2013. [10] F. Vaida. Parameter convergence for EM and MM algorithms. Statistica Sinica, 15(3):831, 2005. [11] T. Zhang. Analysis of multi-stage convex relaxation for sparse regularization. The Journal of Machine Learning Research, 11:1081–1107, 2010. [12] C. Zhang and T. Zhang. A general theory of concave regularization for high-dimensional sparse estimation problems. Statistical Science, 27(4):576–593, 2012. [13] P. Gong, C. Zhang, Z. Lu, J. Huang, and J. Ye. A general iterative shrinkage and thresholding algorithm for non-convex regularized optimization problems. In ICML, 2013. [14] C. Zhang. Nearly unbiased variable selection under minimax concave penalty. Annals of Statistics, pages 894–942, 2010. [15] E. Ni and C. Ling. Supervised learning with minimal effort. In Advances in Knowledge Discovery and Data Mining, pages 476–487. Springer, 2010. [16] S. Basu and J. Christensen. Teaching classification boundaries to humans. In AAAI, 2013. [17] L. Jiang, D. Meng, T. Mitamura, and A. Hauptmann. Easy samples first: self-paced reranking for zeroexample multimedia search. In ACM MM, 2014. [18] Q. Zhao, D. Y. Meng, L. Jiang, Q. Xie, Z. B. Xu, and A. Hauptman. Self-paced learning for matrix factorization. In AAAI, 2015. [19] L. Jiang, D. Y. Meng, S. Yu, Z. Z. Lan, , S. G. Shan, and A. Hauptman. Self-paced learning with diversity. In NIPS, 2014. [20] L. Jiang, D. Y. Meng, Q. Zhao, S. G. Shan, and A. Hauptman. Self-paced curriculum learning. In AAAI, 2015. [21] D. Zhang, D. Meng, and J. Han. Co-saliency detection via a self-paced multiple-instance learning framework. In ICCV, 2015. [22] S. Yu, L. Jiang, Z. Mao, X. J. Chang, X. Z. Du, C. Gan, Z. Z. Lan, Z. W. Xu, X. C. Li, Y. Cai, A. Kumar, Y. Miao, L. Martin, N. Wolfe, S. C. Xu, H. Li, M. Lin, Z. G. Ma, Y. Yang, D. Y. Meng, S. G. Shan, P. D. Sahin, S. Burger, F. Metze, R. Singh, B. Raj, T. Mitamura, R. Stern, and A. Hauptmann. CMU-Informedia@ TRECVID 2014 Multimedia Event Detection (MED). In TRECVID Video Retrieval Evaluation Workshop, 2014. [23] J. Fan and R. Li. Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of American Statistical Association, 96(456):1348–1360, 2001. [24] Y. Kang, Z. Zhang, and W. Li. On the global convergence of majorization minimization algorithms for nonconvex optimization problems. In arXiv: 1504.07791v2, 2015. [25] K. Lange, D. Hunter, and I. Yang. Optimization transfer using surrogate objective functions. Journal of Computational and Graphical Statistics, 9(1):1–20, 2000. [26] DR Cox. The regression analysis of binary sequences (with discussion). Journal of the Royal Statistical Society: Series B, 20:215–242, 1958. [27] V. Vapnik. Statistical Learning Theory. Wiley-Interscience, New York, 1998. [28] Stephen M. Stigler. Gauss and the invention of least squares. Annals of Statistics, 9(3):465–474, 1981.

11