Document not found! Please try again

Stochastic Optimization Techniques for Quantification Performance ...

3 downloads 0 Views 283KB Size Report
May 13, 2016 - ten dubbed quantification, is to use supervised learning to train a prevalence ..... set S of unlabelled items across a set C of available classes; we here deal ...... Data Mining (ICDM 2010), pages 737–742, Sydney, AU,. 2010.
Stochastic Optimization Techniques for Quantification Performance Measures Harikrishna Narasimhan

Shuai Li

IACS, Harvard University, USA

University of Insubria, Italy

Purushottam Kar

arXiv:1605.04135v1 [stat.ML] 13 May 2016

[email protected] [email protected] Sanjay Chawla QCRI-HBKU, Qatar

[email protected]

IIT Kanpur, India

[email protected]

Fabrizio Sebastiani QCRI-HBKU, Qatar

[email protected]

ABSTRACT

1. INTRODUCTION

The estimation of class prevalence, i.e., the fraction of the population that belongs to a certain class, is a very useful tool in data analytics and learning, and finds applications in many domains, such as sentiment analysis, epidemiology, etc. For example, in sentiment analysis, the objective is often not to estimate whether a specific text conveys positive or negative sentiment, but rather estimate the overall distribution of positive and negative sentiment during an event window. A popular way of performing the above task, often dubbed quantification, is to use supervised learning to train a prevalence estimator from labelled data. In this paper we propose the first online stochastic algorithms for directly optimizing (i) performance measures for quantification, and (ii) hybrid performance measures that seek to balance quantification and classification performance. We prove rigorous bounds for our algorithms which demonstrate that they exhibit optimal convergence. Our algorithms present a significant advancement in the theory of multivariate optimization. We also report extensive experiments on benchmark and real data sets which demonstrate that our methods significantly outperform existing optimization techniques used for the quantification problem.

Quantification [11] is defined as the task of estimating the prevalence (i.e., relative frequency) of the classes of interest in an unlabelled set, given a set of training items labelled according to the same classes. Quantification finds its natural application in contexts characterized by distribution drift, i.e., contexts where the unlabelled data may not exhibit the same class prevalences as the test data. This phenomenon may be due to different reasons, including the inherent non-stationary character of the context, or class bias that affected the selection of the training data. A naïve way to tackle quantification is via the “classify and count” (CC) approach, i.e., classify each unlabelled item independently and compute the fraction of the unlabelled items that have been attributed the class. However, a good classifier is not necessarily a good quantifier: assuming the binary case, even if (FP + FN) is comparatively small, bad quantification accuracy might result if FP and FN are significantly different (since perfect quantification coincides with the case FP = FN). This has led researchers to study quantification as a task on its own right, rather than as a byproduct of classification. The fact that quantification is not just classification in disguise can be seen by the fact that evaluation measures different from those for classification (e.g., F1 , AUC) need to be employed. Quantification actually amounts to computing how well an estimated class distribution pˆ fits an actual class distribution p; as such, the natural way to evaluate it is via a function from the class of f divergences [7], and a natural choice from this class (if only for the fact that it is the best known f -divergence) is Kullback-Leibler Divergence (KLD), defined as

CCS Concepts •Computing methodologies → Online learning settings; Supervised learning by classification; Learning linear models; Cost-sensitive learning;

Keywords Quantification, Classification, Class Imbalance, Stochastic Optimization, Multivariate Optimization

∗Fabrizio Sebastiani is on leave from Consiglio Nazionale delle Ricerche, Italy. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

KDD ’16 San Francisco, California, USA c 2016 ACM. ISBN 978-1-4503-2138-9. . . $15.00

DOI: 10.1145/1235

KLD(p, pˆ) =

X c∈C

p(c) log

p(c) pˆ(c)

(1)

Indeed, KLD is the most frequently used measure for evaluating quantification (see e.g., [3, 11, 10, 12]). Note that KLD is nondecomposable, i.e., the error we make by estimating p via pˆ cannot be broken down into item-level errors. This is not just a feature of KLD, but is an inherent feature of any measure for evaluating quantification. In fact, how the error made on a given unlabelled item impacts the overall quantification error depends on how the other items have been classified1 ; e.g., if FP > FN for the other un1 We here assume for simplicity that quantification is to be tackled in an aggregative way, i.e., the classification of individual items is a necessary intermediate step for the estimation of class prevalences. Note however that this is not necessarily so; examples of non-aggregative approach to quantification may be found in [14, 22].

labelled items, then generating an additional false negative is actually beneficial to the overall quantification accuracy, be it measured via KLD or via any other function. The fact that KLD is the measure of choice for quantification and that it is non-decomposable, has lead to the use of structured output learners, such as SVM-perf [16], that can indeed optimize non-decomposable functions; the approach of Esuli and Sebastiani [9, 10] is indeed based on optimizing KLD within SVM-perf. However, that minimizing |FP − FN| (or KLD, or any “pure” quantification measure) should be the only objective for quantification regardless of the value of (FP + FN), is fairly paradoxical. Some authors [3, 26] have observed that this might lead to the generation of unreliable quantifiers (i.e., systems with good quantification accuracy but bad or very bad classification accuracy), and have, as a result, championed the idea of optimizing some “multi-objective” measure that combines quantification accuracy with classification accuracy. For instance, within a decision-tree-like approach, Milli et al. [26] minimize |FP2 − FN2 |, since this is the product of |FN − FP| (a measure of quantification error) and (FN + FP) (a measure of classification error); Barranquero et al. [3] also optimize (within SVM-perf) a measure that combines quantification accuracy and classification accuracy. While SVMPerf provides a recipe for optimizing a general performance, there are serious limitations to this method. For instance, SVMPerf is not designed to directly handle applications where large streaming data sets is the norm. The other serious limitation is that SVMPerf does not scale well in a multi-class setting and the time required is exponential in the number of classes. In this paper, we develop stochastic optimization methods for optimizing popular quantification performance measures in streaming settings. While the recent advances have seen much progress in efficient methods of online learning and bandits for large-scale problems (e.g. [13, 23, 25]), most of these works assume that optimization objectives is decomposable and can be written as a summation or expectation of losses on individual data points. However, many of the measures used in the quantification literature have a multivariate and complex structure and cannot be written as a sum of losses on individual points. Recently, there has been some progress in developing stochastic optimization methods for such non-decomposable measures, but these approaches either require maintenance of large buffers (and as a result have poor convergence guarantees) or do not directly apply to many of the performance measures used for quantification. This problem of designing efficient stochastic methods for quantification performance measures is the focus of the present paper.

2.

RELATED WORK

Quantification methods. Different quantification methods have been proposed over the years, the two main classes being the aggregative and the non-aggregative methods. While the former require the classification of each individual item as an intermediate step, the latter do not, and estimate class prevalences holistically. Most methods (e.g., the ones described in [3, 4, 10, 11, 26]) fall in the former class, while the latter has few representatives (e.g., [14, 22]). Within the class of aggregative methods, a further distinction can be made between methods that use general-purpose learning algorithms (e.g., [4, 11]), sometimes tweaking them or post-processing their prevalence estimates to account for their estimated bias, and methods (which we have already discussed in Section 1) that instead use learning algorithms explicitly devised for quantification (e.g., [3, 10, 26]); the one we use in this paper belongs to this latter class.

Applications of quantification. From an application perspective, quantification is especially interesting in fields (such as social science, political science, market research, and epidemiology) which are inherently interested in aggregate data, and care little about individual cases. Aside from applications in these fields [15, 22], quantification has also been used in contexts as diverse as natural language processing [6], resource allocation [11], tweet sentiment analysis [12], and veterinary [14]. Quantification has independently been studied within statistics [15, 22], machine learning [2, 8, 30], and data mining [10, 11]. Unsurprisingly, given this varied literature, quantification also goes under different names, such as counting [24], class probability re-estimation [1], class prior estimation [6], and learning of class balance [8]. In some applications of quantification, the estimation of class prevalences is not an end to itself, but is functional to improving the accuracy of other tasks (e.g., classification). For instance, Balikas et al. [2] use quantification for model selection in supervised learning, i.e., they tune hyperparameters by choosing for them the values that yield the best quantification accuracy on the test data; this allows hyperparameter tuning to be performed without incurring the costs inherent in k-fold cross-validation. Saerens et al. [30] (followed in this by other authors [1, 32, 34]) apply quantification to customizing a trained classifier to the class prevalences of the test set, with the goal of improving classification accuracy on unlabelled data exhibiting a class distribution different from that of the training set. The work of Chan and Ng [6] may be seen as a direct application of this notion, since they use quantification in order to estimate word sense priors from a text data set to disambiguate, so as to tune a word sense disambiguator to the estimated sense priors. Their work can be seen as an instance of transfer learning (see e.g., [28]), since the goal is to adapt a word sense disambiguation algorithm to a domain different from the one the algorithm was trained upon.

3. PROBLEM SETTING For the sake of simplicity, in this paper we will restrict our analysis to binary classification problems and linear models. We will denote the space of feature vectors by X ⊂ Rd and the label set by Y = {−1, +1}. We shall assume that data points are generated according to some fixed but unknown distribution D over X × Y. We will denote the proportion of positives in the population by p := Pr [y = +1]. Our algorithms will receive a (x,y)∼D

sample of T training points sampled from D, which we will denote by T = {(x1 , y1 ), . . . , (xT , yT )}. As mentioned above, we will present our algorithms and analyses for learning a linear model over X . We will denote the model space by W ⊆ Rd and let RX and RW denote the radii of the domain X and model space W, respectively. However, we note that our algorithms and analyses can be extended to learning non-linear models by use of kernels, as well as be extended to multi-class versions of the quantification performance measures that we consider. However, we postpone a discussion of these extensions to an expanded version of this paper. The present work shall consider the optimization of quantification performance measures that can be represented as functions of the confusion matrix of the classifier. Since we are working in a binary setting, the confusion matrix can be completely described in terms of the true positive rate (TPR) and the true negative rate (TNR) of the classifier. However, our algorithms will use reward functions as surrogates of these. More formally, we will use a reward function r that assigns a reward r(ˆ y, y) to a prediction yˆ ∈ R when the true label is y ∈ Y. Given a reward function r, a model

w ∈ W, and a data point (x, y) ∈ X × Y, we will use 1 · r(w⊤ x, y) · 1(y = 1) p 1 r − (w; x, y) = · r(w⊤ x, y) · 1(y = −1) 1−p r + (w; x, y) =

to calculateq rewards on ypositive qand negative points. y Note that since E r + (w; x, y) = E r(w⊤ x, y)|y = 1 , setting r to (x,y) (x,y) q y r 0-1 (ˆ y, y) = 1 (y yˆ > 0) yields E r + (w; x, y) = TPR(w). (x,y) q y For the sake of convenience we will use P (w) = E r + (w; x, y) (x,y) q y and N (w) = E r − (w; x, y) to denote population averages (x,y)

of the reward functions. We shall assume that our reward function r is concave, Lr -Lipschitz, and takes values in a bounded range [−Br , Br ].

3.1 Performance Measures The task of quantification requires estimating the distribution of a set S of unlabelled items across a set C of available classes; we here deal with the case with |C| = 2. The main measure we adopt, and the one that has become somehow standard in the evaluation of binary or multiclass quantification, is KLD, as defined in Equation 1. KLD ranges between 0 (best) and +∞ (worst)2 . As noted in Section 1, some authors have championed the use of hybrid, “multi-objective” measures, that trade off quantification accuracy and classification accuracy. One such measure is the QMeasure, introduced in [3] and defined as Qβ = (1 + β 2 ) ·

Pclass · Pquant β 2 Pclass + Pquant

(3)

i.e., as a weighted combination of a measure of classification accuracy Pclass and a measure of quantification accuracy Pquant . As Pclass we adopt balanced accuracy, defined as BA = 12 (TPR + TNR), while as Pquant we take what [3] calls Normalized Squared FN−FP Score, defined as NSS = 1 − ( max{p,(1−p)}|S| )2 . Alternatives to BA this which we will focus on are CQReward = 2−NSS and BKReward = BA (see also Table 2). 1+KLD

4.

STOCHASTIC OPTIMIZATION ALGORITHMS FOR QUANTIFICATION

The previous discussion in sections 1 and 2 clarifies two aspects of efforts in the quantification literature – 1) specific performance 2 KLD is not a particularly well-behaved performance measure, being capable of taking unbounded values within the compact domain of the unit simplex. This poses a problem for optimization algorithms from the point of view of convergence, as well as numerical stability. To solve this problem, while computing KLD we can smooth both p(c) and pˆ(c) via additive smoothing, i.e., by computing ǫ + p(c) X (2) ps (c) = p(c) ǫ|C| + c∈C

where ps (c) denotes the smoothed version of p(c) and the denominator is just a normalizing factor (same for the pˆs (c)’s); the quan1 tity ǫ = 2|S| is often used as a smoothing factor, and is the one we adopt here. The smoothed versions of p(c) and pˆ(c) are then used in place of the non-smoothed versions in Equation 1. We can show that, as a result, KLD is always bounded by KLD(ps , pˆs ) ≤ log 1ǫ . However, note that the smoothed KLD still returns a value of 0 when p and pˆ coincide.

measures have been developed and adopted for evaluating quantification performance including KLD, NSS etc, and 2) algorithms that directly optimize these performance measures are desirable, as is evidenced by recent works [3, 9, 10, 26]. The works mentioned above make use of tools from the optimization literature to learn linear (e.g. [10]) and non-linear (e.g. [26]) models to perform quantification. The state of the art efforts in this direction have adopted the structural SVM approach for optimizing these performance measures with great success [3, 10]. However, this approach comes with severe drawbacks. The structural SVM [16], although a significant tool that allows optimization of arbitrary performance measures, suffers from two key drawbacks. Firstly, the structural SVM surrogate is not necessarily a tight surrogate for all performance measures, something that has been demonstrated in past literature [21, 27], which can lead to poor training. But more importantly, optimizing the structural SVM surrogate requires the use of expensive cutting plane methods which are known to scale poorly with the amount of training data, as well as are unable to handle streaming data. To alleviate these problems, we propose stochastic optimization algorithms that directly optimize a large family of quantification performance measures. Our methods come with sound theoretical convergence guarantees, are able to operate with streaming data sets and, as our experiments will demonstrate, offer much faster and accurate quantification performance on a variety of data sets. Our optimization techniques introduce crucial advancements in the field of stochastic optimization of multivariate performance measures and address two families of performance measures – 1) nested concave performance measures and 2) pseudo-concave performance measures. We describe these in turn below.

4.1 Nested Concave Performance Measures The first class of performance measures that we deal with are concave combinations of concave performance measures. More formally, given three concave functions Ψ, ζ1 , ζ2 : R2 → R, we can define a performance measure P(Ψ,ζ1 ,ζ2 ) (w) = Ψ(ζ1 (w), ζ2 (w)), where we have ζ1 (w) = ζ1 (P (w), N (w)) ζ2 (w) = ζ2 (P (w), N (w)) Examples of such performance measures include the negative KLD performance measure and the QMeasure which are described in Section 3.1. Table 1 describes these performance measures in canonical form i.e. in terms of the true positive and negative rates. Before describing our algorithm for nested concave measures, we recall the notion of concave Fenchel conjugate of concave functions. For any concave function f : R2 → R and any (u, v) ∈ R2 , the (concave) Fenchel conjugate of f is defined as f ∗ (u, v) =

inf

(x,y)∈R2

{ux + vy − f (x, y)} .

Clearly, f ∗ is concave. Moreover, it follows from the concavity of f that for any (x, y) ∈ R2 , f (x, y) =

inf

(u,v)∈R2

{xu + yv − f ∗ (u, v)} .

Below we state the properties of strong concavity and smoothness. These will be crucial in our convergence analysis. D EFINITION 1 (S TRONG C ONCAVITY AND S MOOTHNESS ). A function f : Rd → R is said to be α-strongly concave and γ-

Table 1: List of nested concave performance measures and their canonical expressions in terms of the confusion matrix Ψ(P, N ). The 4th, 6th 8th give the closed form updates used in steps 15-17 in Algorithm 1. Note that p and n denote the proportion of positives and negatives in the population. Expression

Ψ(x, y)

γ(q)

ζ1 (P, N )

−KLD(p,p) ˆ

p·x+n·y

log(pP +n(1−N))

QMeasureβ [3]

(1+β 2 )·BA·NSS β 2 ·BA+NSS

(1+β 2 )·x·y β 2 ·x+y

(p,n)   (1−z)2 2 (1+β )· z 2 , 2

BAKLD

C·BA+(1−C)·(−KLD)

C·x+(1−C)·y

Name NegKLD

β

q

z= 2 2 β q1 +q2 (C,1−C)

smooth if for all x, y ∈ Rd , we have γ α − kx − yk22 ≤ f (x)−f (y)−h∇f (y), x − yi ≤ − kx − yk22 . 2 2 We will assume that the functions Ψ, ζ1 , and ζ2 defining our performance measures are γ-smooth for some constant γ > 0. This is true of all functions, save the log function which is used in the definition of the KLD quantification measure. However, if we carry out the smoothing step pointed out in Section 3.1 with some ǫ > 0, then it can be shown that the KLD function does become ǫ12 smooth. An important property of smooth functions, that would be crucial in our analyses, is a close relationship between smooth and strongly convex functions T HEOREM 2 ([33]). A closed and convex function f is βsmooth iff its (concave) Fenchel conjugate f ∗ is β1 -strongly concave. We are now in a position to present our algorithm NEMSIS for stochastic optimization of nested concave functions. Algorithm 1 gives an outline of the technique. We note that a direct application of traditional stochastic optimization techniques [31] to such nested performance measures as those considered here is not possible as discussed before. NEMSIS, overcomes these challenges by exploiting the nested dual structure of the performance measure by carefully balancing updates at the inner and outer levels. At every time step, NEMSIS performs four very cheap updates. The first update is a primal ascent update to the model vector which takes a weighted stochastic gradient descent step. The weights of the descent step are decided by the dual parameters of the functions Ψ, ζ1 , and ζ2 . Then NEMSIS updates the dual variables in three simple steps. In fact line numbers 15-17 can be executed in closed form (see Table 1) for all the performance measures we see here which allows for very rapid updates. Below we state the convergence proof for NEMSIS. We note that despite the complicated nature of the performance measures being tackled, NEMSIS is still able to recover the optimal rate of convergence known for stochastic optimization routines. We refer the reader to Appendix A for a proof of this theorem. The proof requires a careful analysis of the primal and dual update steps at different levels and tying the updates together by taking into account the nesting structure of the performance measure. T HEOREM 3. Suppose we are given a stream of random samples (x1 , y1 ), . . . , (xT , yT ) drawn from a distribution D over X × Y. Let Algorithm 1 be executed for appropriate choices for step sizes ηt with a nested concave performance measure Ψ(ζ1 (·), ζ2 (·)). Then, for some constant C that depends on the Lipschitz properties P of the functions Ψ, ζ1 , ζ2 , the average model w = T1 Tt=1 wt output by the algorithm satisfies, with probability at least 1 − δ,   log 1δ ∗ . P(Ψ,ζ1 ,ζ2 ) (w) ≥ sup P(Ψ,ζ1 ,ζ2 ) (w ) − C √ w∗ ∈W T



α(r) 

ζ2 (P, N )



β(r) 1 , 1 r1 r2



1 , 1 r1 r2

log(nN+p(1−P ))

P +N 2

1,1) (2 2

1−(p(1−P )−n(1−N))2

2(z,−z) z=r2 −r1

P +N 2

1,1) (2 2

−KLD(P,N)

see above

Algorithm 1 NEMSIS: NEsted priMal-dual StochastIc updateS Require: Outer wrapper function Ψ, inner performance measures ζ1 , ζ2 , primal/dual step sizes ηt , ηt′ , feasible sets W, A Ensure: Classifier w ∈ W 1: w0 ← 0, t ← 1, {r0 , q0 , α0 , β0 , γ 0 } ← (0, 0) 2: while data stream has points do 3: Receive data point (xt , yt ) 4: // Perform primal ascent 5: if yt > 0 then  6: wt+1 ← ΠW wt + ηt (γt,1 αt,1 + γt,2 βt,1 )∇w r + (wt ; xt , yt )

7: 8: 9:

qt+1 ← t · qt + (αt,1 , βt,1 ) · r + (wt ; xt , yt ) else  wt+1 ← ΠW wt + ηt (γt,1 αt,2 + γt,2 βt,2 )∇w r − (wt ; xt , yt )

16:

qt+1 ← t · qt + (αt,2 , βt,2 ) · r − (wt ; xt , yt ) end if  rt+1 ← (t + 1)−1 t · rt + (r + (wt ; xt , yt ), r − (wt ; xt , yt ))  ∗ −1 ∗ qt+1 ← (t + 1) qt+1 − (ζ1 (αt ), ζ2 (β t )) // Perform dual  updates ∗ αt+1 = arg min α · rt+1 − ζ1 (α) α  βt+1 = arg min β · rt+1 − ζ2∗ (β)

17:

γ t+1 = arg min {γ · qt+1 − Ψ∗ (γ)}

10: 11: 12: 13: 14: 15:

β

γ

18: t←t+1 19: end while P 20: return w = 1t tτ =1 wτ

Related work of Narasimhan et al : Narasimhan et al [27] proposed an algorithm SPADE which offers stochastic optimization for concave performance measures. We note that although the quantification performance measures considered here are indeed concave, it is difficult to apply SPADE to them directly since SPADE requires computation of gradients of the Fenchel dual of the function P(Ψ,ζ1 ,ζ2 ) which are difficult to compute given the nested structure of this function. NEMSIS, on the other hand, only requires the duals of the individual functions Ψ, ζ1 , and ζ2 which are much more accessible. Moreover, NEMSIS uses a much simpler dual update which does not involve any parameters and, in fact, has a closed form solution in all our cases. SPADE, on the other hand, performs dual gradient descent which requires a fine tuning of yet another step length parameter. A third benefit of NEMSIS is that it achieves a logarithmic regret with respect to its dual updates (see the proof of Theorem 3) whereas SPADE incurs a polynomial regret due to its gradient descent-style dual update.

4.2 Pseudo-concave Performance Measures The next class of performance measures we consider consist of performance measures that can be expressed as a ratio of a quantification and a classification performance measure. More formally, given a convex quantification performance measure Pquant and a concave classification performance measure Pclass , we can define

Table 2: List of pseudo-concave performance measures and their canonical expressions in terms of the confusion matrix Ψ(P, N ). Note that p and n denote the proportion of positives and negatives in the population. Name CQReward BKReward

Expression

Pquant (P, N )

Pclass (P, N )

BA 2−NSS BA 1+KLD

1+(p(1−P )−n(1−N))2

P +N 2 P +N 2

KLD: see Table 1

Algorithm 2 CAN: Concave AlternatioN Require: Objective P(Pquant ,Pclass ) , model space W, tolerance ǫ Ensure: An ǫ-optimal classifier w ∈ W 1: Construct the valuation function V 2: w0 ← 0, t ← 1 3: while vt > vt−1 + ǫ do 4: wt+1 ← arg maxw∈W V (w, vt ) 5: vt+1 ← arg maxv>0 v such that V (wt+1 , v) ≥ v 6: t←t+1 7: end while 8: return wt

a performance measure P(Pquant ,Pclass ) (w) =

Pclass (w) , Pquant (w)

We assume that both the performance measures, Pquant and Pclass , are positive valued. Such performance measures can be very useful in allowing the system designer to balance classification and quantification performance. Moreover, the form of the measure allows an enormous amount of freedom in choosing the quantification and classification performance measures. Examples of such performance measures include the CQReward and the BKReward measures. These were introduced in Section 3.1 and are represented in their canonical forms in Table 2. Performance measures, constructed the way described above, with a ratio of a concave over a convex measures, are called pseudoconcave measures. This is because, although these measures are not concave, their level sets are still convex which makes it possible to optimize them efficiently. To see the intuition behind this, we need to introduce the notion of the valuation function corresponding to the performance measure. As a passing note, we remark that because of the non-concavity of these performance measures, NEMSIS cannot be applied here. D EFINITION 4 (VALUATION F UNCTION ). The valuation of a Pclass (w) pseudo-concave performance measure P(Pquant ,Pclass ) (w) = P quant (w) is defined, for any v > 0, as V (w, v) = Pclass (w) − v · Pquant (w) It can be seen that the valuation function defines the level sets of the performance measure. To see this, notice that due to the positivity of the functions Pquant and Pclass , we can have P(Pquant ,Pclass ) (w) ≥ v iff V (w, v) ≥ 0. However, since Pclass is concave, Pquant is convex, and v > 0, V (w, v) is a concave function of w. This close connection between the level sets and notions of valuation functions have been exploited before to give optimization algorithms for pseudo-linear performance measures such as the Fmeasure [27, 29]. These approaches treat the valuation function as some form of proxy or surrogate for the original performance measure and optimize it in hopes of making progress with respect to the original measure. Taking this approach with our performance measures yields a very natural algorithm for optimizing pseudo-concave measures which we outline in the CAN algorithm below as Algorithm 2. Notice that step 4 in the algorithm is a concave maximization problem over a convex set, something that can be done using a variety of methods. Also notice that step 5 can, by the definition of the valuation function, be carried out by simply setting vt+1 = P(Pquant ,Pclass ) (wt+1 ). It turns out that CAN has a linear rate of convergence for wellbehaved performance measures. The next result formalizes this statement. We note that this result is similar to the one arrived by [27] for pseudo-linear functions, but in a restrictive setting.

Algorithm 3 SCAN: Stochastic Concave AlternatioN Require: Objective P(Pquant ,Pclass ) , model space W, step sizes ηt , epoch lengths se , s′e Ensure: Classifier w ∈ W 1: v0 ← 0, t ← 0, e ← 0, w0 ← 0 2: repeat 3: // Learning phase e ← we 4: w 5: while t < se do 6: Receive sample (x, y) 7: // NEMSIS update with V (·, ve ) at time t 8: wt+1 ← NEMSIS (V (·, ve ), wt , (x, y), t) 9: t←t+1 10: end while e 11: t ← 0, e ← e + 1, we+1 ← w 12: // Level estimation phase 13: v+ ← 0, v− ← 0 14: while t < s′e do 15: Receive sample (x, y) 16: vy ← vy + r y (we ; x, y) // Collect rewards 17: t ←t+1 18: end while P (v ,v ) 19: t ← 0, ve ← P class (v+ ,v− ) quant

+

20: until stream is exhausted 21: return we



T HEOREM 5. Suppose we execute Algorithm 2 with a pseudoconcave performance measure P(Pquant ,Pclass ) such that the quantification performance measure always takes values in the range [m, M ], where m > 0. Let P ∗ := supw∈W P(Pquant ,Pclass ) (w) be the optimal performance level. Also let ∆t = P ∗ −P(Pquant ,Pclass ) (wt ) be the excess error for the model wt generated at time t. Then there exists a value η(m) < 1 such that ∆t ≤ ∆0 · η(m)t . The proof of this theorem can be found in Appendix B and generalizes the result of [27] to the more general case of pseudo-concave functions. Note that in all the pseudo-concave functions defined in Table 2, care is taken to ensure that the quantification performance measure satisfies m > 0. A drawback of CAN is that it does not operate in an online mode and requires a concave optimization oracle. However, we notice that for all the pseudo-concave performance measures that we have considered, the valuation function is always at least a nested concave function. This motivates us to use NEMSIS to solve the inner optimization problems and gives us the SCAN algorithm, outlined in Algorithm 3. It can be shown that SCAN enjoys a rate of convergence similar to that of NEMSIS as is stated by the following theorem. T HEOREM 6. Suppose we execute Algorithm 3 with a pseudoconcave performance measure P(Pquant ,Pclass ) such that the quantification performance measure always takes values in the range [m, M ], where m > 0, with epoch lengths increasing at a geomet˜ e ). ric rate decided by a suitable constant D > 1 i.e. se , s′e = Θ(D

Data Set KDDCup08 PPI CoverType ChemInfo Letter IJCNN-1 Adult Cod-RNA

Data Points 102,294 240,249 581,012 2142 20,000 141,691 48,842 488,565

Features 117 85 54 55 16 22 123 8

Positives 0.61% 1.19% 1.63% 2.33% 3.92% 9.57% 23.93% 33.3%

Table 3: Statistics of data sets used. Also let ∆e = P ∗ − P(Pquant ,Pclass ) (we ) be the excess error for the  e log 1 model we generated after e epochs. Then after e = O δǫ epochs, we can ensure with probability at least 1 − δ that ∆e ≤ ǫ. Moreover  the number of samples consumed till this point is at most e 12 . O ǫ The proof of this result can be obtained by showing that CAN is robust to approximate solutions to the inner optimization problem. However, due to lack of space, we postpone this result to an expanded version of the paper.

Related work of Narasimhan et al : Narasimhan et al [27] also proposed two algorithms AMP and STAMP which seek to optimize pseudo-linear performance measures. However, neither those algorithms nor their analyses transfer directly to the pseudo-concave setting. This is because, by exploiting the pseudo-linearity of the performance measure, AMP and STAMP are able to convert their problem to a sequence of cost-weighted optimization problems which are very simple to solve. This convenience is absent here and as mentioned above, even after creation of the valuation function, SCAN still has to solve a possibly nested concave minimization problem which it does by invoking the NEMSIS procedure on this inner problem. The proof technique used in [27] for analyzing AMP also makes heavy use of pseudo-linearity. The convergence proof of CAN, on the other hand, is more general and yet guarantees a linear convergence rate.

5.

EXPERIMENTAL RESULTS

We have carried out an extensive set of experiments on diverse data sets to compare our proposed methods with other state-of-the-art approaches. Data sets: We used the following benchmark data sets from the UCI archive : a) IJCNN, b) Covertype, c) Adult, d) Letters, and e) Cod-RNA. We also used the following three real-world data sets: Cheminformatics: This is a virtual screening data set from [18] with 2142 compounds, each represented as a 55 dimensional vector using the FP2 molecular fingerprint representation; there are 5 sets of 50 active compounds each (active against 5 different targets), and 1892 inactive compounds; for each target, the 50 active compounds are treated as positive, and all others as negative. Medical Diagnosis: This task was put forward in the 2008 KDDCup Challenge and is related to the (early) detection of breast cancer where the objective is to predict whether an image region of interest (ROI) from an X-ray is malignant (positive) or benign (negative). The data was extracted from images of 118 malignant patients and 1594 normal patients each represented by a 117 dimensional feature vector. Protein-Protein Interaction: The goal here is to predict whether pairs of proteins interact or not and consists of 2865 protein pairs

that are known to interact (positive) and a random set of 237,384 protein pairs which are assumed to be non-interacting (negatives). For each pair there are 162 features. Methods: We have compared our proposed NEMSIS and SCAN algorithms against the state-of-the-art one-pass mini-batch stochastic gradient method (1PMB) of [20] and the SVMPerf of [17]. We also included a variant of the NEMSIS algorithm where the dual updates were computed using original 0-1 TPR and TNR values, rather than surrogate reward functions (NEMSIS-NS), and a version of SCAN where the level estimation was performed using 0-1 TPR/TNR values. 3 . Parameters: In each case, we used 70% of the data for training and the remaining for testing. All parameters including step sizes, upper bounds on reward functions, regularization parameters, and projection radii were chosen from {10−4 , . . . , 104 } using a heldout portion of the training set treated as a validation set. In Figure 1 a comparison of NEMSIS-NS and NEMSIS against 1PMB and SVMPerf is carried out on several data sets on the NKLD measure. It is clear that the proposed algorithms have comparable performance with significantly faster rate of convergence. Since SVMPerf is offline it is important to clarify how it was compared against the online methods: essentially timers were embedded inside the SVMPerf code and periodically the current state of the model (the w vector) was extracted and used on the test set to evaluate the NKLD measure. It is clear that SVMPerf is significantly slower and its behavior is quite erratic. On three of the four data sets NEMSIS-NS achieves a faster rate of convergence vis-a-vis NEMSIS. In Figure 2 the performance of NEMSIS was evaluated on a weighted sum of NKLD and the balanced error rate (BER) where the weights are defined by CWeight. The weighted sum measures the joint quantification and classification performance. In all the three data sets we notice a sweet spot where the joint tasks simultaneously have good performance. In Figure 3 we evaluate the robustness of the algorithms as a function of different class proportions. The y-axis is the positive KLD and thus smaller numbers are better. Again, it is clear that the NEMSIS family of algorithms have significantly smaller KLD demonstrating their versatility across a range of class distribution. In Figure 4 we test the performance of the NEMSIS family on data sets with drifts. We checked how the models learnt on training sets performed on tests sets with varying class proportions. We have not included SVMPerf as it was taking an inordinately long time. Both on the Adult and Letter data set the NEMSIS family was fairly robust to class distribution change. When the class distributions were changed by a large amount (over 100 percent) then all algorithms, as expected, performed poorly. Finally, we test our methods in optimizing composite performance measures that strike a more nuanced trade-off between quantification and classification performance. Our results are presented in Figures 5 and 6. As seen, the proposed methods perform significantly better than the baseline 1PMB method.

6. CONCLUSION Quantification, the task of estimating class prevalence from labeled data in contexts subject to distribution drift, has emerged as an important problem in machine learning and data mining. For example, sentiment analysis is often a quantification problem, where we are 3

We will make code for our methods available publicly

−0.6 −0.8 −1 −4

10

−3

−2

−1

−1.5 −2

−0.4

NEMSIS NEMSIS−NS 1PMB SVMPerf

−0.6 −0.8 −1

−2.5

0

10 10 10 Training time (secs)

−1

0

−0.2

−4

10

10

(a) Adult

−3

−2

−1

0

−4

10 10 10 10 Training time (secs)

Negative KLD

NEMSIS NEMSIS−NS 1PMB SVMPerf

−0.4

0 NEMSIS NEMSIS−NS 1PMB SVMPerf

Negative KLD

0 −0.5

Negative KLD

Negative KLD

0 −0.2

10

(b) Cod-RNA

−3

−2

−1

10 10 10 Training time (secs)

−0.1

NEMSIS NEMSIS−NS 1PMB SVMPerf

−0.2 −0.3 −0.4

0

10

−4

10

−3

−2

−1

10 10 10 Training time (secs)

(c) KDD08

0

10

(d) PPI

Figure 1: Experiments with NEMSIS on KLD: Plot of KLD as a function of training time

−0.1 0

0.2

0.4

0.6

CWeight

0 1

0.8

−3

−0.01

0.8

−0.02

0.6

−0.03 0

0.2

0.4

0.6

(a) Adult

−0.2

−0.4 0

0.4 1

0.8

CWeight

x 10

1

0.5

0.2

0.4

0.6

CWeight

−3

x 10

(b) Cod-RNA

BER

0

BER

0.5

1 Negative KLD

−0.05

0

BER

1 Negative KLD

Negative KLD

0

0 1

0.8

−3

x 10

(c) Covtype

Figure 2: Experiments on NEMSIS with BAKLD: Plots of quantification and classification performance as a CWeight

0.1

0

Positive KLD

NEMSIS NEMSIS−NS 1PMB SVMPerf

0.04 0.02 0

kdd08

ppi

covtype

[1] R. Alaíz-Rodríguez, A. Guerrero-Curieses, and J. Cid-Sueiro. Class and subclass probability re-estimation to adapt a classifier in the presence of concept drift. Neurocomputing, 74(16):2614–2623, 2011. [2] G. Balikas, I. Partalas, E. Gaussier, R. Babbar, and M.-R. Amini. Efficient model selection for regularized classification by exploiting unlabeled data. In Proceedings of the 14th International Symposium on Intelligent Data Analysis (IDA 2015), pages 25–36, Saint Etienne, FR, 2015. [3] J. Barranquero, J. Díez, and J. J. del Coz. Quantification-oriented learning based on reliable classifiers. Pattern Recognition, 48(2):591–604, 2015. [4] A. Bella, C. Ferri, J. Hernández-Orallo, and M. J. Ramírez-Quintana. Quantification via probability estimators. In Proceedings of the 11th IEEE International Conference on Data Mining (ICDM 2010), pages 737–742, Sydney, AU, 2010. [5] N. Cesa-Bianchi, A. Conconi, and C. Gentile. On the

0.05

a9a

60

cod−rna

80 100 120 % change in class proportion

140

(a) Adult 0.1 0.05 0

ijcnn1

40

chemo

NEMSIS NEMSIS−NS 1PMB SVMPerf

letter

NEMSIS NEMSIS−NS 1PMB

0.1

0

Positive KLD

Positive KLD

0.2

7. REFERENCES

Positive KLD

interested in gauging the overall positive or negative sentiment of a population, and not the sentiment of each individual entity. A good classifier is not necessarily a good quantifier and vice-versa, and this has provided an impetus to design algorithms that exclusively solve the quantification task. In particular, the use of KullbackLeibler Divergence is now considered a de facto measure of quantification performance. In this paper we have proposed a family of stochastic algorithms (NEMSIS) to address the online quantification problem. By abstracting negative KL divergence as a family of concave or nested concave reward functions we have designed provably correct and efficient algorithms, and validated them on several data sets under varying conditions, including class imbalance and distribution drift. The proposed algorithms include the ability to jointly optimize both quantification and classification tasks. To the best of our knowledge this is the first work which directly addresses the online quantification problem; as such, it opens up application areas which till now were inaccessible.

NEMSIS NEMSIS−NS 1PMB

40

60

80 100 120 % change in class proportion

140

(b) Letter Figure 3: Comparison on data sets with different class proportions Figure 4: Comparison of methods under drifts in class proportions

0.8

1PMB NEMSIS NEMSIS−NS

0.75 0.7

−4

10

−3

−2

−1

10 10 10 Training time (secs)

1

1

1

0.9

0.9

0.9

0.8 1PMB NEMSIS NEMSIS−NS

0.7

0

−4

10

10

(a) Adult

−3

−2

−1

0.8 1PMB NEMSIS NEMSIS−NS

0.7

0

−4

10 10 10 10 Training time (secs)

10

(b) Cod-RNA

−3

−2

−1

10 10 10 Training time (secs)

QMeasure

0.85

QMeasure

QMeasure

0.9

QMeasure

0.95

1PMB NEMSIS NEMSIS−NS

0.8 0.7

0

−4

10

10

(c) IJCNN1

−3

−2

−1

0

10 10 10 Training time (secs)

10

(d) KDD08

Figure 5: Experiments with NEMSIS on Q-measure: Plot of Q-measure performance as a function of time

0.8 0.7

−4

10

−3

−2

−1

10 10 10 Training time (secs)

(a) Adult

0

10

1 1PMB SCAN−NS

0.8

0.9 0.8 1PMB SCAN−NS

0.7

−4

10

−3

−2

−1

0

10 10 10 10 Training time (secs)

CQReward

0.9

0.9

0.9 CQReward

1 1PMB SCAN−NS

CQReward

CQReward

1

0.7 0.6 0.5

0.8 0.7 1PMB SCAN−NS

0.6

0.4

−4

10

(b) Cod-RNA

−3

−2

−1

10 10 10 Training time (secs)

(c) CovType

0

10

0.5

−4

10

−3

−2

−1

10 10 10 Training time (secs)

0

10

(d) IJCNN1

Figure 6: Experiments with SCAN on CQreward: Plot of CQreward performance as a function of time

[6]

[7]

[8]

[9] [10]

[11] [12]

[13]

[14]

[15]

[16]

Generalization Ability of On-Line Learning Algorithms. In 15th Annual Conference on Neural Information Processing Systems (NIPS), pages 359–366, 2001. Y. S. Chan and H. T. Ng. Estimating class priors in domain adaptation for word sense disambiguation. In The 44th Annual Meeting of the Association for Computational Linguistics (ACL), pages 89–96, Sydney, AU, 2006. I. Csiszár and P. C. Shields. Information theory and statistics: A tutorial. Foundations and Trends in Communications and Information Theory, 1(4):417–528, 2004. M. C. du Plessis and M. Sugiyama. Semi-supervised learning of class balance under class-prior change by distribution matching. In The 29th International Conference on Machine Learning (ICML), 2012. A. Esuli and F. Sebastiani. Sentiment quantification. IEEE Intelligent Systems, 25(4):72–75, 2010. A. Esuli and F. Sebastiani. Optimizing text quantifiers for multivariate loss functions. ACM Transactions on Knowledge Discovery and Data, 9(4):Article 27, 2015. G. Forman. Quantifying counts and costs via classification. Data Mining and Knowledge Discovery, 2008. W. Gao and F. Sebastiani. Tweet sentiment: From classification to quantification. In Proceedings of the 7th International Conference on Advances in Social Network Analysis and Mining (ASONAM 2015), pages 97–104, Paris, FR, 2015. C. Gentile, S. Li, and G. Zappella. Online clustering of bandits. In The 31st International Conference on Machine Learning (ICML), 2014. V. González-Castro, R. Alaiz-Rodríguez, and E. Alegre. Class distribution estimation based on the Hellinger distance. Information Sciences, 218:146–164, 2013. D. J. Hopkins and G. King. A method of automated nonparametric content analysis for social science. American Journal of Political Science, 54(1):229–247, 2010. T. Joachims. A support vector method for multivariate

[17] [18]

[19]

[20]

[21]

[22] [23]

[24]

[25]

[26]

performance measures. In Proceedings of the 22nd International Conference on Machine Learning (ICML 2005), pages 377–384, Bonn, DE, 2005. T. Joachims, T. Finley, and C. J. Yu. Cutting-plane training of structural SVMs. Machine Learning, 77(1):27–59, 2009. R. N. Jorissen and M. K. Gilson. Virtual Screening of Molecular Databases Using a Support Vector Machine. Jounal of Chemical Information Modelling, 45(3):549–561, 2005. S. M. Kakade and A. Tewari. On the Generalization Ability of Online Strongly Convex Programming Algorithms. In NIPS, pages 801–808, 2008. P. Kar, H. Narasimhan, and P. Jain. Online and stochastic gradient methods for non-decomposable loss functions. In The 28th Annual Conference on Neural Information Processing Systems (NIPS), pages 694–702, Montreal, CA, 2014. P. Kar, H. Narasimhan, and P. Jain. Surrogate functions for maximizing precision at the top. In Proceedings of the 32nd International Conference on Machine Learning (ICML 2015), pages 189–198, Lille, FR, 2015. G. King and Y. Lu. Verbal autopsy methods with multiple causes of death. Statistical Science, 23(1):78–91, 2008. N. Korda, B. Szorenyi, and S. Li. Distributed clustering of linear bandits in peer to peer networks. In The 33rd International Conference on Machine Learning (ICML), 2016. D. D. Lewis. Evaluating and optimizing autonomous text classification systems. In The 18th ACM International Conference on Information Retrieval (SIGIR), Seattle, US, 1995. S. Li, A. Karatzoglou, and C. Gentile. Collaborative filtering bandits. In The 39th International ACM SIGIR Conference on Information Retrieval (SIGIR), 2016. L. Milli, A. Monreale, G. Rossetti, F. Giannotti, D. Pedreschi, and F. Sebastiani. Quantification trees. In Proceedings of the 13th IEEE International Conference on

[27]

[28]

[29]

[30]

[31]

[32]

[33] [34]

[35]

Data Mining (ICDM 2013), pages 528–536, Dallas, US, 2013. H. Narasimhan, P. Kar, and P. Jain. Optimizing non-decomposable performance measures: A tale of two classes. In Proceedings of the 32nd International Conference on Machine Learning (ICML 2015), pages 199–208, 2015. W. Pan, E. Zhong, and Q. Yang. Transfer learning for text mining. In C. C. Aggarwal and C. Zhai, editors, Mining Text Data, pages 223–258. Springer, Heidelberg, DE, 2012. S. P. Parambath, N. Usunier, and Y. Grandvalet. Optimizing F-Measures by cost-sensitive classification. In Proceedings of the 28th Annual Conference on Neural Information Processing Systems (NIPS 2014), pages 2123–2131, 2014. M. Saerens, P. Latinne, and C. Decaestecker. Adjusting the outputs of a classifier to new a priori probabilities: A simple procedure. Neural Computation, 14(1):21–41, 2002. S. Shalev-Shwartz, Y. Singer, N. Srebro, and A. Cotter. PEGASOS: Primal Estimated sub-GrAdient SOlver for SVM. Mathematical Programming, Series B, 127(1):3–30, 2011. J. C. Xue and G. M. Weiss. Quantification and semi-supervised classification methods for handling changes in class distribution. In Proceedings of the 15th ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD 2009), pages 897–906, Paris, FR, 2009. C. Zalinescu. Convex Analysis in General Vector Spaces. River Edge, NJ: World Scientific Publishing, 2002. Z. Zhang and J. Zhou. Transfer estimation of evolving class priors in data stream classification. Pattern Recognition, 43(9):3151–3161, 2010. M. Zinkevich. Online Convex Programming and Generalized Infinitesimal Gradient Ascent. In 20th International Conference on Machine Learning (ICML), pages 928–936, 2003.

APPENDIX A. PROOF OF THEOREM 3

L EMMA 7. Suppose we have an action space X and execute the follow the leader algorithm on a sequence of loss functions ℓt : X → R, each of which is α-strongly convex and L-Lipschitz, then we have

t=1

ℓt (xt ) − inf

x∈X

where xt+1 = arg min x∈X

T X t=1

Pt

ℓt (x) ≤

τ =1 ℓτ (x)

L2 log T , α

are the FTL plays.

t=1

ℓt (xt ) − inf

x∈X

T X t=1

ℓt (x) ≤

τ =1

ℓτ (xt+1 ) ≥

t X

τ =1

ℓτ (xt ) ≥

T X t=1

ℓt (xt ) −

t−1 X

ℓτ (xt ) +

τ =1 t X

α(t − 1) kxt − xt+1 k22 2

ℓτ (xt+1 ) +

τ =1

αt kxt − xt+1 k22 , 2

which gives us ℓt (xt ) − ℓt (xt+1 ) ≥

α (2t − 1) · kxt − xt+1 k22 . 2

However, we get ℓt (xt ) − ℓt (xt+1 ) ≤ L · kxt − xt+1 k2 by invoking Lipschitz-ness of the loss functions. This gives us kxt − xt+1 k2 ≤

2L . α(2t − 1)

This, upon applying Lipschitz-ness again, gives us ℓt (xt ) − ℓt (xt+1 ) ≤

2L2 . α(2t − 1)

Summing over all the time steps gives us the desired result. For the rest of the proof, we shall use the shorthand notation that we used in the outline of Algorithm 1, i.e. αt = (αt,1 , αt,2 ),

(the dual variables for ζ1 )

β t = (βt,1 , βt,2 ), γ t = (γt,1 , γt,2 ),

(the dual variables for ζ2 ) (the dual variables for Ψ)

We will also use additional notation st = (r + (wt ; xt , yt ), r − (wt ; xt , yt )),   ∗ ⊤ ∗ pt = α⊤ t st − ζ1 (αt ), β t st − ζ2 (β t ) ,

ℓt (w) = (r + (w; xt , yt ), r − (w; xt , yt )) R(w) = (P (w), N (w))

(A) =

T X

ℓt (xt+1 )

t=1

Now, by using the strong convexity of the loss functions, and the

T X ∗ (γ ⊤ t pt − Ψ (γ t )) t=1

Now, since Ψ is β-smooth and concave, by Theorem 2, we know that Ψ∗ is β1 -strongly concave. However that means that the loss function gt (γ) := γ ⊤ pt − Ψ∗ (γ) is β1 -strongly convex. Now the NEMSIS algorithm (step 17) implements γ t = arg min {γ · qt − Ψ∗ (γ)} , γ

Pt

where qt = 1t τ =1 pτ . Notice that this is identical to the FTL algorithm with the losses gt which are strongly convex. Thus, by an application of Lemma 7, we get ( T ) X ⊤ ∗ (A) ≤ inf (γ pt − Ψ (γ)) + βL2 log T γ

P ROOF. By the standard forward regret analysis, we get T X

t−1 X

Note that ℓt (wt ) = st . We now define a quantity that we shall be analyzing to obtain the convergence bound

We defer a detailed proof to an expanded version of this paper and just give the essential arguments in this proof sketch. We begin by observing the following general lemma regarding the follow the leader algorithm for strongly convex losses

T X

fact that the strong convexity property is additive, we get

t=1

The same technique, along with the observation that steps 15 and 16 of Algorithm 1 also implement the FTL algorithm, can be used to get the following results ( T ) T X X ⊤ ⊤ ∗ ∗ (αt st −ζ1 (αt )) ≤ inf (α st − ζ1 (α)) +βL2 log T , t=1

α

t=1

[35]) gives us the following bound on (A)

and T X t=1

∗ (β ⊤ t st − ζ2 (β t ))

≤ inf β

(

T X t=1



ζ2∗ (β))

(β st −

)

(A) =

2

+ βL log T .

(A) ≤ inf γ

= inf γ

(

=

T X t=1

γ1





(γ pt − Ψ (γ))

T X

)

2



2

+ βL log T

∗ (α⊤ t st − ζ1 (αt )) + γ2

t=1

γ,α,β

T X t=1

∗ ∗ (β ⊤ t st − ζ2 (β t )) − Ψ (γ)

t=1

t=1

+ βL2 log T

)

Thus, applying an online-to-batch conversion bound for strongly convex loss functions (for example [19]), gives us, with probability at least 1 − δ, γ1 inf α

+ βL2 T

(

(



α

γ2 inf β

(

T 1 X R(wt ) T t=1

β⊤

  1 log T + log δ

!



− ζ1 (α)



√ T

(B)

(C)

Analyzing the expression (B) gives us T 1 X ∗ ∗ γt,1 (α⊤ t R(w ) − ζ1 (αt )) T t=1   !⊤ PT T T X X γt,1 γt,1 αt ∗ ∗ t=1 γt,1  R(w ) − ζ1 (αt ) = PT PT T t=1 γt,1 t=1 γt,1 t=1 t=1  !⊤ ! PT T T X X γt,1 αt γt,1 αt t=1 γt,1   ≥ R(w∗ ) − ζ1∗ PT PT T t=1 γt,1 t=1 γt,1 t=1 t=1

(B) =

PT

n o min α⊤ R(w∗ ) − ζ1∗ (α) α n o =γ ¯1 min α⊤ R(w∗ ) − ζ1∗ (α) = γ ¯1 ζ1 (R(w∗ ))



t=1

γt,1

T

α

!

= Ψ(ζ1 (R(w)), ζ2 (R(w))) +

T 1 X R(wt ) T t=1

+ γ2 ζ2

!

− Ψ∗ (γ)

βL2 log

T δ

T

!

T log 1 1 X ∗ (A) ≥γ ¯1 ζ1 (R(w∗ )) + γ ¯2 ζ2 (R(w∗ )) − Ψ (γ t ) − √ δ T T t=1 T

log 1 ≥γ ¯1 ζ1 (R(w∗ )) + γ ¯2 ζ2 (R(w∗ )) − Ψ∗ (¯ γ) − √ δ T  log 1 ∗ ∗ ∗ ≥ min γ1 ζ1 (R(w )) + γ1 ζ2 (R(w )) − Ψ (γ) − √ δ γ T

,

(A) ≤ T · Ψ(ζ1 (R(w)), ζ2 (R(w))) + log

T δ



log 1 = Ψ(ζ1 (R(w∗ )), ζ2 (R(w∗ ))) − √ δ T

Thus, we have with probability at least 1 − δ, (A) ≥ T · Ψ(ζ1 (R(w∗ )), ζ2 (R(w∗ ))) − log

1√ T δ

Combining the upper and lower bounds on (A) finishes the proof.

Note that this is a much stronger bound than what Narasimhan et al [27] obtain for their gradient descent based dual updates. This, in some sense, establishes the superiority of the follow-the-leader type algorithms used by NEMSIS. A similar analysis can be applied to the primal updates. We can confirm that steps 6 and 9 of the NEMSIS algorithm (Algorithm 1) perform gradient ascent updates with respect to the loss functions ⊤

A similar analysis for (C) follows and we get

T δ

where the second last step follows from the Jensen’s inequality, the concavity of the functions P (w) and N (w), and the assumption that ζ1 and ζ2 are increasing functions of both their arguments. Thus, we have, with probability at least 1 − δ,



∗ ∗ ⊤ ∗ ∗ γt,1 (α⊤ t ℓt (w ) − ζ1 (αt )) + γt,2 (β t ℓt (w ) − ζ2 (β t ))

T T 1 X (A) 1 X ⊤ ∗ ∗ ⊤ ∗ ∗ ≥ γt,1 (αt R(w ) − ζ1 (αt )) + γt,2 (β t R(w ) − ζ2 (βt )) T T t=1 T t=1 {z } | {z } |

)

 βL2 log ≤ inf γ1 ζ1 (R(w)) + γ2 ζ2 (R(w)) − Ψ∗ (γ) + γ T



t=1

Applying a standard online-to-batch conversion bound (for example [5]), gives us, with probability at least 1 − δ,

! ) ) T 1 X R(wt ) − ζ2∗ (β) − Ψ∗ (γ) T t=1

T 1 X R(wt ) = inf γ1 ζ1 γ T t=1   βL2 1 + log T + log T δ

T X

T log 1 1 X ∗ − Ψ (γ t ) − √ δ T t=1 T

q y q t−1 y E st | {(xτ , yτ )}t−1 τ =1 = E ℓt (wt )| {(xτ , yτ )}τ =1 = R(wt )

+

∗ ⊤ ∗ ∗ γt,1 (α⊤ t ℓt (wt ) − ζ1 (αt )) + γt,2 (β t ℓt (wt ) − ζ2 (βt )) − Ψ (γ t )

− Ψ (γ t ) −

Now, because of the stochastic nature of the samples, we have

(

T X t=1

+ βL log T ( ) T T X X γ1 ≤ inf (α⊤ st − ζ1∗ (α)) + γ2 (β ⊤ st − ζ2∗ (β)) − Ψ∗ (γ)

(A) ≤ inf γ T

ht (wt )

t=1

This gives us (

T X



ht (w) = γt,1 (αt ℓt (w) − ζ1 (αt )) + γt,2 (β t ℓt (w) − ζ2 (β t )) − Ψ (γ t )

Since the functions ht (·) are concave and Lipschitz (due to assumptions on the smoothness and values of the reward functions), the standard regret analysis for online gradient ascent (for example

B.

PROOF OF THEOREM 5

We will prove the result by proving a sequence of claims. The first claim ensures that the distance to the optimum performance value is bounded by the performance value we obtain in terms of the valuation function at any step. For notational simplicity, we will use the shorthand P(w) := P(Pquant ,Pclass ) (w) . C LAIM 8. P ∗ := supw∈W P(Pquant ,Pclass ) (w) be the optimal performance level. Also, define et = V (wt+1 , vt ). Then, for any t, we have et P ∗ − P(wt ) ≤ m

P ROOF. We will prove the result by contradiction. Suppose ˜ ∈ W such that there exists some w et ˜ = P(w) + vt + e′ =: v ′ , m where e′ > 0. Then we have ˜ vt ) − et = Pclass (w) ˜ − vt · Pquant (w) ˜ − et . V (w,

˜ = v ′ , we have Pclass (w) ˜ − v ′ · Pquant (w) ˜ = 0 Now since P(w) which gives us e  e  t t ˜ vt )−et = ˜ V (w, + e′ Pquant (w)−e + e′ m−et > 0. t ≥ m m But this contradicts the fact that maxw∈W V (w, vt ) = et . This completes the proof.

The second claim then establishes that in case we do get a large performance value in terms of the valuation function at any time step, the next iterate will have a large leap in performance in terms of the original performance function P. C LAIM 9. For any time instant t we have et P(wt+1 ) ≥ P(wt ) + M P ROOF. By our definition, we have V (wt+1 , vt ) = et . This gives us  et  Pclass (wt+1 ) − vt + D  et  = vt · Pquant (wt+1 ) + et − vt + Pquant (wt+1 ) D  et ≥ vt · Pquant (wt+1 ) + et − vt + M ≥ 0, D which proves the result. We are now ready to establish the convergence proof. Let ∆t = P ∗ − P(Pquant ,Pclass ) (wt ). Then we have, by Claim 8 e t ≥ m · ∆t , and also P(wt+1 ) ≥ P(wt ) +

et , D

by Claim 9. Subtracting both sides of the above equation from P ∗ gives us et ∆t+1 = ∆t − D  m m · ∆t , · ∆t = 1 − ≤ ∆t − D D which concludes the convergence proof.

Suggest Documents