Local Rademacher Complexity-based Learning Guarantees for Multi-Task Learning
arXiv:1602.05916v2 [cs.LG] 9 Feb 2017
Niloofar Yousefi1 , Yunwen Lei2 , Marius Kloft3 Mansooreh Mollaghasemi4 and Georgios Anagnostopoulos5 1
Department of Electrical Engineering and Computer Science, University of Central Florida 2 Department of Mathematics, City University of Hong Kong 3 Department of Computer Science, Humboldt University of Berlin 4 Department of Industrial Engineering & Management Systems, University of Central Florida 5 Department of Electrical and Computer Engineering, Florida Institute of Technology
[email protected],
[email protected],
[email protected],
[email protected] and
[email protected]
Abstract We show a Talagrand-type concentration inequality for Multi-Task Learning (MTL), using which we establish sharp excess risk bounds for MTL in terms of distribution- and data-dependent versions of the Local Rademacher Complexity (LRC). We also give a new bound on the LRC for norm regularized as well as strongly convex hypothesis classes, which applies not only to MTL but also to the standard i.i.d. setting. Combining both results, one can now easily derive fast-rate bounds on the excess risk for many prominent MTL methods, including—as we demonstrate—Schatten-norm, group-norm, and graph-regularized MTL. The derived bounds reflect a relationship akeen to a conservation law of asymptotic convergence rates. This very relationship allows for trading off slower rates w.r.t. the number of tasks for faster rates with respect to the number of available samples per task, when compared to the rates obtained via a traditional, global Rademacher analysis.
Keywords: Multi-Task Learning, Generalization Bound, Local Rademacher Complexity.
1 Introduction A commonly occurring problem, when applying machine learning in the sciences, is the lack of a sufficient amount of training data to attain acceptable performance results; either obtaining such data may be very costly or they may be unavailable due to technological limitations. For example, in cancer genomics, tumor bioptic samples may be relatively scarce due to the limited number of cancer patients, when compared to samples of healthy individuals. In neuroscience, electroencephalogram experiments are carried out on human subjects to record training data and typically involve only a few dozen subjects. When considering any type of prediction task per individual subject in such settings (for example, whether the subject is indeed suffering a specific medical affliction or not), relying solely on the scarce data per individual most often leads to inadequate predictive performance. Such a direct approach completely ignores the advantages that might be gained, when considering intrinsic, strong similarities between subjects and, hence, tasks. Revisiting the area of genomics, different living organisms can be related to each other in terms of their evolutionary relationships – the information of how they are genetically related to each other can be obtained from the tree of life. Taking into account such relationships may be instrumental in detecting genes of recently developed organisms, for which only a limited number of training data is available. While our discussion so far focused on the realm of biomedicine, similar limitations and opportunities to overcome them exist in other fields as well.
1
Transfer learning [53] and, in particular, Multi-Task Learning (MTL) [19] leverages such underlying common links among a group of tasks, while respecting the tasks’ individual idiosyncrasies to the extent warranted. This is achieved by phrasing the learning process as a joint, mutually dependent learning problem. An early example of such a learning paradigm is the neural network-based approach introduced by [19], while more recent works consider convex MTL problems [2, 25, 4]. At the core of each MTL formulation lies a mechanism that encodes task relatedness into the learning problem [24]. Such relatedness mechanism can always be thought of as jointly constraining the tasks’ hypothesis spaces, so that their geometry is mutually coupled, e.g., via a block norm constraint [65]. Thus, from a regularization perspective, the tasks mutually regularize their learning based on their inter-task relatedness. This process of information exchange during co-learning is often referred to as information sharing. With respect to learning-theoretical results, the analysis of MTL goes back to the seminal work by [11], which was followed up by the works of [2, 45]. Nowadays, MTL frameworks are routinely employed in a variety of settings. Some recent applications include computational genetics [63], image segmentation [1], HIV therapy screening [14], collaborative filtering [18], age estimation from facial images [67], and sub-cellular location prediction [64], just to name a few prominent ones. MTL learning guarantees are centered around the notion of (global) Rademacher complexities; notions that were introduced to the machine learning community by [7, 9, 34, 31, 30], and employed in the context of MTL by [45, 46, 26, 48, 47, 49]. All these papers are briefly surveyed in Sect. 1.3. To formally recapitulate the essence of these works, let T denote the number of tasks being co-learned and n denote the number of available observations per task. Then, in terms of convergence rates in the number of samples and tasks, respectively, the fastest-converging √error or excess risk bounds derived in these works – whether distribution- or data-dependent – are of the order O(1/ nT ). More recently, the seminal works by [32] and [8] introduced a more nuanced variant of these complexities, termed Local Rademacher Complexity (LRC) (as opposed to the original Global Rademacher Complexity (GRC)). This new, modified function class complexity measure is attention-worthy, since, as shown by [8], a LRC-based analysis is capable of producing more rapidly-converging excess risk bounds (“fast rates”), when compared to the ones obtained via a GRC analysis. This can be attributed to the fact that, unlike LRCs, GRCs ignore the fact that learning algorithms typically choose well-performing hypotheses that belong only to a subset of the entire hypothesis space under consideration. The end result of this distinction empowers a local analysis to provide less conservative and, hence, sharper bounds than the standard global analysis. To date, there have been only a few additional works attempting to reap the benefits of such local analysis in various contexts: active learning for binary classification tasks [33], multiple kernel learning [28, 20], transductive learning [60], semi-supervised learning [52] and bounds on the LRCs via covering numbers [37].
1.1 Our Contributions Through a Talagrand-type concentration inequality adapted to the MTL case, this paper’s main contribution is the derivation of sharp bounds on the excess MTL risk in terms of the distribution- and data-dependent LRC. For a given number of tasks T , these bounds admit faster (asymptotic) convergence characteristics in the number of observations per task n, when compared to corresponding bounds hinging on the GRC. Hence, these faster rates allow us to increase the confidence that the MTL hypothesis selected by a learning algorithm approaches the best-in-class solution as n increases beyond a certain threshold. We also prove a new bound on the LRC, which generally holds for hypothesis classes with any norm function or strongly convex regularizers. This bound readily facilitates the bounding of the LRC for a range of such regularizers (not only for MTL, but also for the standard i.i.d. setting), as we demonstrate for classes induced by graph-based, Schatten- and group-norm regularizers. Moreover, we prove matching lower bounds showing that, aside from constants, the LRC-based bounds are tight for the considered applications. Our derived bounds reflect that one can trade off a slow convergence speed w.r.t. T for an improved convergence √ rate w.r.t. n. The latter one ranges, in the worst case, from the typical GRC-based bounds O(1/ n), all the way up to the fastest rate of order O(1/n) by allowing the bound to depend less on T . Nevertheless, the premium in question becomes less relevant to MTL, in which T is typically considered as fixed. Fixing all other parameters when the number of samples per task n approaches to infinity, our local bounds yield faster rates compared to their global counterparts. Also, it is witnessed that if the number of tasks T and the radius R of the ball-norms—in considered norm regularized hypotheses—can grow with n, there are cases in which local analysis always improves other global one. In comparison of our local bounds to some related works [48, 46] based on global analysis, one can observe that our bounds give faster convergence rates of the orders 1/T and 1/n in terms of number of tasks T and number of samples per task n, receptively.
2
1.2 Organization The paper is organized as follows: Sect. 2 lays the foundations for our analysis by considering a Talagrand-type concentration inequality suitable for deriving our bounds. Next, in Sect. 3, after suitably defining LRCs for MTL hypothesis spaces, we provide our LRC-based MTL excess risk bounds. Based on these bounds, we follow up this section with a local analysis of linear MTL frameworks, in which task-relatedness is presumed and enforced by imposing a norm constraint. In more detail, leveraging off the Fenchel-Young and H¨older inequalities, Sect. 4 presents generic upper bounds for the relevant LRC of any strongly convex as well as any norm regularized hypothesis classes. These results are subsequently specialized to the case of group norm, Schatten norm and graph-regularized linear MTL. Then, Sect. 5 supplies the corresponding excess risk bounds based on the LRC bounds of mentioned hypothesis classes. The paper concludes with Sect. 6, which compares side by side the GRC- and LRC-based excess risk bounds for the aforementioned hypothesis spaces, as well as two additional related works based on GRC analysis.
1.3 Previous Related Works An earlier work by [45], which considers linear MTL frameworks for binary classification, investigates the generalization guarantees based on Rademacher averages. In this framework, all tasks are pre-processed by a common bounded linear operator, and operator norm constraints are used to √ control the complexity of the associated hypothesis spaces. The GRC-based error bounds derived are of order O(1/ nT ) for both distribution-dependent and data-dependent cases. Another study [46], provides bounds for the empirical and expected Rademacher √ complexities of linear transformation classes. Based on H¨older’s inequality, GRC-based risk bounds of order O(1/ nT ) are established for MTL hypothesis spaces with graph-based and LSq -Schatten norm regularizers, where q ∈ {2} ∪ [4, ∞]. The subject of MTL generalization guarantees experienced renewed attention in recent years. [26] take advantage of the strongly-convex nature of certain matrix-norm regularizers to easily obtain generalization bounds for a variety of machine learning problems. Part of their work is devoted to the realm of online and off-line MTL. In the latter case, which pertains √ to the focus of our work, the paper provides a distribution-dependent GRC-based excess risk bound of order O(1/ nT ). Moreover, [48] present a global p Rademacher complexity analysis leading to both data and distribution-dependent excess risk bounds of order O( log(nT )/nT for a trace norm regularized MTL model. Also, [47] examines the bounding of (global) Gaussian complexities of function classes that result from considering composite maps, as it √ is typical in MTL among other settings. An application of the paper’s√results yields MTL risk bounds of order O(1/ nT ). More recently, [49] presents excess risk bounds of order O(1/ nT ) for both MTL and Learning-To-Learn (LTL) settings and reveals conditions, under which MTL is more beneficial over learning tasks independently. Finally, due to being domains related to MTL, but, at the same time, less connected to the focus of this paper, we only mention in passing a few works that pertain to generalization guarantees in the realm of life-long learning and domain adaptation. Generalization performance analysis in life-long learning has been investigated by [59, 13, 12, 55] and [54]. Also, in the context of domain adaptation, similar considerations are examined by [41, 43, 42, 21, 66, 44] and [22].
1.4 Basic Assumptions & Notation Consider T supervised learning tasks sampled from the same input-output space X × Y. Each task t is represented by an independent random variable (Xt , Yt ) governed by a probability distribution µt . Also, the i.i.d. samples related to each task t are described by the sequence (Xti , Yti )ni=1 , which is distributed according to µt . In what follows, we use the following notational conventions: vectors and matrices are depicted in boldface. The superscript T , when applied to a vector/matrix, denotes the transpose of that quantity. We define NT := {1, . . . , T }. For any random variables X, Y and function g we use Eg(X, Y ) and EX g(X, Y ) to denote the expectation with respect to (w.r.t.) all the involved random variables and the conditional expectation w.r.t. the random variable X. For any vector-valued function f = (f1 , . . . , fT ), we introduce the following two notations: P f :=
T T 1X 1X P ft = E(f (Xt )), T t=1 T t=1
Pn f :=
T T n 1X 1X1X f (Xti ). Pn ft = T t=1 T t=1 n i=1
We also denote f α = (f1α , . . . , fTα ), ∀α ∈ R. For any loss function ℓ and any f = (f1 , . . . , fT ) we define ℓf = (ℓf1 , . . . , ℓfT ) where ℓft is the function defined by ℓft ((Xt , Yt )) = ℓ(ft (Xt ), Yt ). 3
Finally, let us mention that, in the subsequent material, measurability of functions and suprema is assumed whenever necessary. Additionally, in later material, operators on separable Hilbert spaces are assumed to be of trace class.
2 Talagrand-Type Inequality for Multi-Task Learning The derivation of our LRC-based error bounds for MTL is founded on the following modified Talagrand’s concentration inequality adapted to the context of MTL, showing that the uniform deviation between the true and empirical means in a vector-valued function class F can be dominated by the associated multi-task Rademacher complexity plus a term involving the variance of functions in F . We defer the proof of this theorem in Appendix A. Theorem 1 (TALAGRAND -T YPE I NEQUALITY FOR MTL). Let F = {f := (f1 , . . . , fT )} be a class of vectorPT (T,Nt ) be a vector of t=1 Nt independent random valued functions satisfying supt,x |ft (x)| ≤ b. Let X := (Xti )(t,i)=(1,1) variables where Xt1 , . . . , Xtn , ∀t are identically distributed. Let {σti }t,i be a sequence of independent Rademacher 2 P variables. If T1 supf ∈F Tt=1 E ft (Xt1 ) ≤ r, then for every x > 0, with probability at least 1 − e−x , sup (P f − Pn f ) ≤ 4R(F ) +
f ∈F
r
8xr 12bx + , nT nT
(1)
where n := mint∈NT Nt , and the multi-task Rademacher complexity of function class F is defined as ( ) Nt T 1X 1 X i i sup R(F ) := EX,σ σt ft (Xt ) . f =(f1 ,...,fT )∈F T t=1 Nt i=1 Note that the same bound also holds for supf ∈F (Pn f − P f ). In Theorem 1, the data from different tasks are assumed to be mutually independent, which is typical in MTL setting [45]. To present the results in a clear way, we always assume in the following that the available data for each task is the same, namely n. Remark 2. At this point, we would like to present the result of the above theorem for the special case T = 1, which corresponds to the traditional single task learning framework. It is very easy to verify that for T = 1, the bound in (1) can be written as r 8xr 12bx + , (2) sup (P f − Pn f ) ≤ 4R(F ) + n n f ∈F where the function f is chosen from an scalar-valued function class F . This bound can be compared to the result of Theorem 2.1 of [8] (for α = 1), which reads as r 2xr 8bx + . (3) sup (P f − Pn f ) ≤ 4R(F ) + n n f ∈F Note that the difference between the constants in (2) and (3) is due to the fact that we failed in directly applying Bousquet’s version of Talagrand inequality—similar to what has been done by [8] for scalar-valued functions—to the class of vector-valued functions. To make it more clear, let Z be defined as (A.3) with the jackknife replication Zs,j ′′ ′′ for which a lower bound Zs,j can be found such that Zs,j ≤ Z − Zs,j . Then, in order to apply Theorem 2.5 of P Pn ′′ T 1 2 [17]’s work, one needs to show that the quantity nT s=1 j=1 Es,j [(Zs,j ) ] is bounded. This goal, ideally, can be P 2 T achieved by including a constraint similar to T1 supf ∈F t=1 E ft (Xt1 ) ≤ r in Theorem 1. However, we could not come up with any obvious andPmeaningful way—appropriate for MTL—of defining this constraint to satisfy the ′′ n 1 PT 2 boundedness condition nT s=1 j=1 Es,j [(Zs,j ) ] in terms of r. We would like emphasize that the key ingredient to the proof of Theorem 1 is the so-called Logarithmic Sobolev inequality—Theorem A.1—which can be considered as the exponential version of Efron-Stein inequality.
4
3 Excess MTL Risk Bounds based on Local Rademacher Complexities The cornerstone of Sect. 2’s results is the presence of an upper bound of an empirical process’s variance (the second term in the right-hand side of (1)). In this section, we consider the Rademacher averages associated with a smaller subset of the function class F and use them as a complexity term in the context of excess risk bounds. As pointed out by [8], these (local) averages are always smaller than the corresponding global Rademacher averages and allow for eventually deriving sharper generalization bounds. Herein, we exploit this very fact for MTL generalization guarantees. Theorem 1 motivates us to extend the definition of classical LRC R(F sclr , r) for a scalar-valued function class F sclr as n 1X σi f (Xi ) . sup R(F sclr , r) := EX,σ f ∈F sclr ,V (f )≤r n i=1 to the Multi-Task Local Rademacher Complexity (MT-LRC) using the following definition.
Definition 1 (M ULTI -TASK L OCAL R ADEMACHER C OMPLEXITY). For a vector-valued function class F , the Local ˆ , r) are defined as Rademacher Complexity R(F , r) and its empirical counterpart R(F R(F , r) := E ˆ , r) := Eσ R(F
sup f =(f1 ,...,fT )∈F V (f )≤r
T n 1 XX i i σ ft (Xt ) , nT t=1 i=1 t
sup f =(f1 ,...,fT )∈F Vn (f )≤r
T n 1 XX i σt ft (Xti ) , nT t=1 i=1
(4)
where V (f ) and Vn (f ) are upper bounds on the variances and empirical variances of the functions in F , respectively. This paper makes the choice V (f ) = P f 2 and Vn (f ) = Pn f 2 where P f 2 := Pn f 2 :=
T T 1X 1X P ft2 = E [ft (Xt )]2 , T t=1 T t=1
T T n 1X 1 X1X Pn ft2 = (ft (Xti ))2 . T t=1 T t=1 n i=1
Analogous to single task learning, the challenge in applying MT-LRCs in (4) to refine the existing learning rates is to find an optimal radius trading-off the size of the set {f ∈ F : V (f ) ≤ r} and its complexity, which, as we show later, reduces to the calculation of the fixed-point of a sub-root function. Definition 2 (S UB -ROOT F UNCTION). A function ψ : [0, ∞] → [0, ∞] is sub-root if 1. ψ is non-negative, 2. ψ is non-decreasing, √ 3. r 7→ ψ(r)/ r is non-increasing for r > 0. The following lemma is an immediate consequence of the above definition. Lemma 1 (Lemma 3.2 in [8]). If ψ is a sub-root function, then it is continuous on [0, ∞], and the equation ψ(r) = r has a unique (non-zero) solution which is known as the fixed point of ψ and it is denotes by r∗ . Moreover, for any r > 0, it holds that r > ψ(r) if and only if r∗ ≤ r. We will see later that this fixed point plays a key role in the local error bounds. The definition of local Rademacher complexity is based on the fact that by incorporating the variance constraint, a better error rate for the bounds can be obtained. In other words, the key point in deriving fast rate bounds is that around the best function f ∗ (the function that minimizes the true risk) in the class, the variance of the difference between the true and empirical errors of functions is upper bounded by a linear function of the expectation of this difference. We will call a class with this property a Bernstein class, and we provide a definition of a vector-valued Berstein class F as following. 5
Definition 3 (V ECTOR -VALUED B ERNSTEIN C LASS). A vector-valued function class F is said to be a (β, B)Bernstein class with respect to the probability measure P , if for every 0 < β ≤ 1, B ≥ 1 and any f ∈ F , there exists a function V : F → R+ such that P f 2 ≤ V (f ) ≤ B(P f )β .
(5)
It can be shown that the Bernstein condition (5) is not too restrictive and it holds, for example, for non-negative bounded functions with respect to any probability distribution [10]. Other examples include the class of excess risk functions LF := {ℓf − ℓf ∗ : f ∈ F }—with f ∗ ∈ F the minimizer of P ℓf — when the function class F is convex and the loss function ℓ is strictly convex. In this section, we show that under some mild assumptions on a vector-valued Bernstein class of functions, the LRC-based excess risk bounds can be established for MTL. We suppose that the loss function ℓ and the vector-valued hypothesis space F satisfy the following conditions: Assumptions 1. 1. There is a function f ∗ = (f1∗ , . . . , fT∗ ) ∈ F satisfying P ℓf ∗ = inf f ∈F P ℓf . 2. There is constant B ′ ≥ 1, such that for every f ∈ F we have P (f − f ∗ )2 ≤ B ′ P (ℓf − ℓf ∗ ). 3. There is a constant L, such that the loss function ℓ is L-Lipschitz in its first argument. As it has been pointed out by [8], there are many examples of regularized algorithms for which these conditions can be satisfied. More specifically, a uniform convexity condition on the loss function ℓ is usually sufficient to satisfy Assumptions 1.2. As an example for which this assumption holds, [8] refereed to the quadratic loss function ℓ(f (X), Y ) = (f (X) − Y )2 when the functions f ∈ F are uniformly bounded, More specifically, if for all f ∈ F , X ∈ X and Y ∈ Y, it holds that |f (X) − Y | ∈ [0, 1], then it can be shown that the conditions of Assumptions 1 are met with L = 1 and B = 1. We now present the main result of this section showing that the excess error of MTL can be bounded by the fixed-point of a sub-root function dominating the MT-LRC. The proof of the results is provided in Appendix B. Theorem 3 (Distribution-dependent excess risk bound for MTL). Let F := {f := (f1 , . . . , fT ) : ∀t, ft ∈ RX } be (T,n) a class of vector-valued functions f satisfying supt,x |ft (x)| ≤ b. Also, Let X := (Xti , Yti )(t,i)=(1,1) be a vector of nT independent random variables where for each task t, (Xt1 , Yt1 ) . . . , (Xtn , Ytn ) be identically distributed. Suppose that Assumptions 1 holds. Define F ∗ := {f − f ∗ }, where f ∗ is the function satisfying P ℓf ∗ = inf f ∈F P ℓf . Let B := B ′ L2 and ψ be a sub-root function with the fixed point r∗ such that BLR(F ∗ , r) ≤ ψ(r), ∀r ≥ r∗ , where R(F ∗ , r) is the LRC of the functions class F ∗ : R(F ∗ , r) := EX,σ
h
sup f ∈F , L2 P (f −f ∗ )2 ≤r
T n i 1 XX i σt ft (Xti ) . nT t=1 i=1
(6)
Then, we obtain the following bounds in terms of the fixed point r∗ of ψ(r): 1. For any function class F , K > 1 and x > 0, with probability at least 1 − e−x , ∀f ∈ F
P (ℓf − ℓf ∗ ) ≤
K 560K ∗ (48Lb + 28BK)x Pn (ℓf − ℓf ∗ ) + r + . K−1 B nT
(7)
2. For any convex function class F , K > 1 and x > 0, with probability at least 1 − e−x , ∀f ∈ F
P (ℓf − ℓf ∗ ) ≤
32K ∗ (48Lb + 16BK)x K Pn (ℓf − ℓf ∗ ) + r + . K −1 B nT
(8)
Corollary 4. Let fˆ be any element of convex class F satisfying Pn ℓfˆ = inf f ∈F Pn ℓf . Assume that the conditions of Theorem 3 hold. Then for any x > 0 and r > ψ(r), with probability at least 1 − e−x , P (ℓfˆ − ℓf ∗ ) ≤
32K ∗ (48Lb + 16BK)x r + . B nT 6
(9)
Proof. The results follows by noticing that Pn (ℓfˆ − ℓf ∗ ) ≤ 0. The next theorem, analogous to Corollary 5.4 of [8], presents a data-dependent version of (9) replacing the Rademacher complexity in Corollary 4 with its empirical counterpart. The proof of this Theorem, which repeats the same basic steps utilized by Theorem 5.4 of [8], can be found in Appendix B. Theorem 5 (Data-dependent excess risk bound for MTL). Let fˆ be any element of convex class F satisfying Pn ℓfˆ = inf f ∈F Pn ℓf . Assume that the conditions of Theorem 3 hold. Define ˆ ∗ , c3 r) + c2 x , ψˆn (r) = c1 R(F nT
ˆ ∗ , c3 r) := Eσ R(F
h
sup f ∈F , L2 Pn (f −fˆ)2 ≤c3 r
T n i 1 XX i σt ft (Xti ) , nT t=1 i=1
where c1 = 2L max (B, 16Lb), c2 = 128L2b2 + 2bc1 and c3 = 4 + 128K + 4B(48Lb + 16BK)/c2. Then for any K > 1 and x > 0, with probability at least 1 − 4e−x, we have 32K ∗ (48Lb + 16BK)x rˆ + , B nT where rˆ∗ is the fixed point of the sub-root function ψˆn (r). P (ℓfˆ − ℓf ∗ ) ≤
An immediate consequence of the results of this section is that one can derive excess risk bounds for given regularized MTL hypothesis spaces. In the next section, by further bounding the fixed point r∗ in Corollary 4 (and rˆ∗ in Theorem 5), we will derive distribution (and data)-dependent excess risk bounds for several commonly used norm-regularized MTL hypothesis spaces.
4 Local Rademacher Complexity Bounds for Norm Regularized MTL Models This section presents very general distribution-dependent MT-LRC bounds for hypothesis spaces defined by norm as well as strongly convex regularizers, which allows us to immediately derive, as specific application cases, LRC bounds for group-norm, Schatten-norm, and graph-regularized MTL models. It should be mentioned that similar data-dependent MT-LRC bounds can be easily obtained by a similar deduction process.
4.1 Preliminaries We consider linear MTL models where we associate to each task-wise function ft a weight w t ∈ H by ft (X) = ˜ = hφ(X), φ(X)i, ˜ ∀X, X ˜ ∈X hw t , φ(X)i. Here φ is a feature map associated to a Mercer kernel k satisfying k(X, X) and wt belongs to the reproducing kernel Hilbert space HK induced by k with inner product h·, ·i. We assume that the multi-task model W = (w1 , . . . , wT ) ∈ H × . . . × H is learned by a regularization scheme: T X n X
ℓ( wt , φ(Xti ) H , Yti ), min Ω (W ) + C W
(10)
t=1 i=1
where the regularizer Ω(·) is used to enforce information sharing among tasks. This regularization scheme amounts to performing Empirical Risk Minimization (ERM) in the hypothesis space o n (11) F := X 7→ [hw 1 , φ(X1 )i , . . . , hwT , φ(XT )i]T : Ω(D 1/2 W ) ≤ R2 ,
where D is a given positive operator defined in H. Note that the hypothesis spaces corresponding to group and Schatten norms can be recovered by taking D = I, and choosing their associated norms. More specifically, by choosing Ω(W ) = 12 kW k22,q , we can retrieve an L2,q -norm hypothesis space in (11). Similarly, the choice Ω(W ) = 12 kW k2Sq gives an LSq -Schatten norm hypothesis space in (11). Furthermore, the graph-regularized MTL [51, 24, 46] can be specialized by taking Ω(W ) = 21 kD1/2 W k2F , wherein k.kF is a Frobenius norm, and D = L + ηI with L being a graph-Laplacian, and η being a regularization constant. On balance, all these MTL models can be considered as norm-regularized models. Also, for specific values of q in group and Schatten norm cases, it can be shown that they are strongly convex. 7
4.2 General Bound on the LRC Now, we can provide the main results of this section, which give general LRC bounds for any MTL hypothesis space of the form (11) in which Ω(W ) is given as a strongly convex or a norm function of W . Theorem 6 (Distribution-dependent MT-LRC bounds by strong convexity). Let Ω(W ) in (10) be µ-strongly convex with Ω∗ (0) = 0 and kkk∞ ≤ K ≤ ∞. Let Xt1 , . . . , Xtn be an i.i.d. sample drawn from Pt . Also, assume that for each task t, the eigenvalue-eigenvector decomposition of the Hilbert-Schmidt covariance operator Jt is given by P∞ j ∞ Jt := E(φ(Xt ) ⊗ φ(Xt )) = j=1 λjt ujt ⊗ ujt , where (ujt )∞ j=1 forms an orthonormal basis of H, and (λt )j=1 are the corresponding eigenvalues, arranged in non-increasing order. Then for any given positive operator D on RT , any r > 0 and any non-negative integers h1 , . . . , hT : s s
2 r PT h R 2
t t=1 + EX,σ D −1/2 V , (12) R(F , r) ≤ min nT T µ ∗ {0≤ht ≤∞}T t=1 where V =
P
j>ht
E T j j i i σ φ(X ), u . t ut t i=1 t
D P n 1 n
t=1
Proof. Note that with the help of LRC definition, we have for any function class F , n E D X T 1 T R(F , r) = EX,σ (wt )t=1 , σti φ(Xti ) t=1 sup nT f =(f1 ,...,fT )∈F , i=1 P f 2 ≤r + T + * * n ∞ X 1X 1 T = EX,σ σti φ(Xti ), ujt ujt sup (wt )t=1 , T n i=1 , f ∈F j=1 t=1 P f 2 ≤r T * h q t D E X 1 ≤ EX,σ λjt w t , ujt ujt , sup T P f 2 ≤r j=1 t=1 * + T + q h n t −1 X 1X i σt φ(Xti ), ujt ujt λjt n i=1 j=1 t=1 + T + * * n X X 1 1 T + EX,σ sup (wt )t=1 , σti φ(Xti ), ujt ujt T n i=1 f ∈F j>ht t=1
= A1 + A2 .
where in the last equality, we defined the term in (13) as A1 , and the term in (14) as A2 . Step 1. Controlling A1 : Applying Cauchy-Schwartz (C.S.) inequality on A1 yields the following
2 12
ht q T X E D X
1 j j j
sup A1 ≤ EX,σ λt w t , ut ut
T 2
t=1 j=1 P f ≤r
2 21
* +
ht q −1 n T X
1X i X j j j i
σ φ(Xt ), ut ut λt
n i=1 t
t=1 j=1 1 ht T X D E2 2 X 1 λjt wt , ujt sup = EX,σ T 2 ≤r P f t=1 j=1 8
(13)
(14)
+2 21 * n ht T X X X −1 1 σti φ(Xti ), ujt . λjt n i=1 t=1 j=1
E2 D P λj = nt and P f 2 ≤ r With the help of Jensen’s inequality and regarding the fact that EX,σ n1 ni=1 σti φ(Xti ), ujt E2 D PT P∞ implies T1 t=1 j=1 λjt w t , ujt ≤ r (see Lemma 8 in the Appendix for the proof), we can further bound A1 as A1 ≤
s
r
PT
t=1
nT
ht
.
(15)
Step 2. Controlling An2 : We use on the regularizer in order to further bound the D strong convexity assumption Eo 1/2 −1/2 1 second term A2 = T EX,σ supf ∈F D W , D V . Let λ > 0. Applying (C.1) with w = D1/2 W and v = λD −1/2 V gives
D E λ2
−1/2 2 ≤ Ω(D 1/2 W ) + ▽Ω∗ (0) , λD −1/2 V + V .
D 2µ ∗ D E Note that, regrading the definition of V , we get Eσ ▽Ω∗ (0) , λD −1/2 V = 0, Therefore, taking supremum and expectation on both sides, dividing throughout by λ and T , and then optimizing over λ gives ) ( 2
2 D E R λ 1
1/2 −1/2 V ≤ min + EX,σ D −1/2 V A2 = EX,σ sup D W , D 0ht
E T j j i i . σ φ(X ), u t t ut i=1 t
D P n 1 n
t=1
Proof. The proof of this theorem repeats the same steps as the proof of Theorem 6, except for controlling term A2 in (14), in which the H¨older inequality can be efficiently used to further bound it as following + T + * * n X 1X 1 T σti φ(Xti ), ujt ujt A2 = EX,σ sup (w t )t=1 , T n i=1 f ∈F j>ht t=1
9
1 = EX,σ T
(
D
sup D f ∈F
1/2
W,D
−1/2
V
E
)
( )
1
1/2 −1/2 V EX,σ sup D W . D ≤ T ∗ f ∈F √
2R
EX,σ D −1/2 V . ≤ T ∗ H¨older
Remark 9. Notice that, obviously,
√
(18)
r
2
−1/2 2EX,σ D V ≤ µ2 EX,σ D−1/2 V for any µ ≤ 1. Interestingly, for ∗
∗
the cases considered in our study, it holds that µ ≤ 1. More specifically, from Theorem 3 and Theorem 12 in [26], it can be shown that R(W ) = 1/2 kW k22,q is q1∗ -strongly convex w.r.t. the group norm k.k2,q . Similarly, using Theorem 2
10 in [26], it can be shown that the regularization function R(W ) = 21 kW kSq with q ∈ [1, 2] is (q − 1)-strongly convex w.r.t. the LSq -Schatten norm k.kSq . Therefore, given the range of q in [1, 2], for which these two norms are strongly convex, it can be easily seen that µ ≤ 21 and µ ≤ 1 for the group and Schatten-norm hypotheses, respectively. Therefore, for this cases, H¨older inequality yields slightly tighter bounds for MT-LRC.
Remark 10. It is worth mentioning that, when applied to the norm-regularized MTL models, the result of Theorem 8 could be more general than that of Theorem 6. More specially, for L2,q -group and LSq -Schatten norm regularizers, Theorem 6 can only be applied to the special case of q ∈ [1, 2], for which these two norms are strongly convex. In contrast, Theorem 8 is applicable to any value of q for these two norms. For this reason and considering the fact that very similar results can be obtained from Theorem 6 and Theorem 8 (see Lemma 2 and Remark 11), we will use Theorem 8 in the sequel to find the LRC bounds of several norm regularized MTL models. In what follows, we demonstrate the power of Theorem 8 by applying it to derive the LRC bounds for some popular MTL models, including group norm, Schatten norm and graph regularized MTL models extensively studied in the literature of MTL [46, 23, 6, 4, 38, 3].
4.3 Group Norm Regularized MTL We first consider a group norm regularized MTL capturing the inter-task relationships by the group norm regularizer PT q 2/q 1 1 2 [23, 4, 39, 58], for which the associated hypothesis space takes the form t=1 kw t k2 2 kW k2,q := 2 Fq :=
1 2 . X 7→ [hw1 , φ(X1 )i , . . . , hw T , φ(XT )i]T : kW k22,q ≤ Rmax 2
(19)
Before presenting the result for the group-norm regularized MTL, we want to bring it into attention that A1 does not depend on the W -constraint in the hypothesis space, therefore the bound for A1 is the same for all cases we consider in this study, despite the choice of the reqularizer. However, A2 can be further bounded for different hypothesis spaces corresponding to different choice of regularization functions. In the following we start with a useful lemma which helps bounding A2 for the group-norm hypothesis space (19). The proof of this Lemma, which is based on the application of the Khintchine (C.2) and Rosenthal (C.3) inequalities, is presented in Appendix C. Lemma 2. Assume that the kernels in (10) are uniformly bounded, that is kkk∞ ≤ K ≤ ∞.
Then, for the group
−1/2 2 1 V ∗ for D = I can be norm regularizer 2 kW k2,q in (19) and for any 1 ≤ q ≤ 2, the expectation EX,σ D 2,q
upper-bounded as
* n + T
X 1X
EX,σ σti φ(Xti ), ujt ujt
n i=1
j>ht t=1
≤
2,q∗
10
√
∗
Keq T n
1 q∗
v u u ∗2 u eq +u t n
T
X
j λt
.
j>h
t t=1 q∗ 2
Remark 11. Similarly as in Lemma 2, one can easily prove that
* n + T 2
X 1X
EX,σ σti φ(Xti ), ujt ujt
n i=1
j>ht t=1
≤
∗2
Keq T n2
2,q∗
2 q∗
T
X eq
j + λt
.
n
j>ht t=1 q∗ ∗2
(20)
2
To see this, note that in the first step of the proof of Lemma 2 (see Appendix C), by replacing the outermost exponent 2 1 q∗ with q∗ , and following the same procedure, one can verify (20). Therefore, it can be concluded that very similar LRC bounds can be obtained via Theorem 6 and Theorem 8. Corollary 12. Using Theorem 8, for any 1 ≤ q ≤ 2, the LRC of function class Fq in (19) can be bounded as v
u
√ T u 1 ∞
2 2 ∗ X u 4 2KeRmax q ∗ T q∗
1− q2∗ 2eq Rmax j R(Fq , r) ≤ u λ . min rT + ,
t t nT
T nT
j=1 t=1 q∗
(21)
2
Proof Sketch: The proof of the corollary uses the result of Lemma 2 to upper bound A2 for the group-norm hypothesis space (19) as, v
T u
√ 1 u
2 X j u 2eq ∗ 2 Rmax 2KeRmax q ∗ T q∗
A2 (Fq ) ≤ u λ . (22) +
t t
nT 2 nT
j>ht
t=1 q∗ 2
Now, combining (15) and (22) provides the bound on R(Fq , r) as v
s T u
√ PT 1 u
∗ 2 R2 X j r t=1 ht u 2eq 2KeRmax q ∗ T q∗
max R(Fq , r) ≤ λ +u , +
t t
nT nT 2 nT
j>ht t=1 q∗
(23)
2
Then using inequalities shown below which hold for any α1 , α2 > 0, any non-negative vectors a1 , a2 ∈ RT , any 0 ≤ q ≤ p ≤ ∞ and any s ≥ 1, p √ √ (24) (⋆) α1 + α2 ≤ 2(α1 + α2 ) 1 older’s 1 H¨ 1 1 q (⋆⋆) lp − to − lq : ka1 kq = h1, a1 i q ≤ k1k(p/q)∗ kaq1 k(p/q) (25) = T q − p ka1 kp (⋆ ⋆ ⋆)
1
ka1 ks + ka2 ks ≤ 21− s ka1 + a2 ks ≤ 2 ka1 + a2 ks ,
(26)
we can obtain the desired result. See Appendix C for the detailed proof. Remark 13. Since the LRC bound above is not monotonic in q, it is more reasonable to state the above bound in terms of κ ≥ q; choosing κ = q is not always the optimal choice. Trivially, for the group norm regularizer with any κ ≥ q, it holds that kW k2,κ ≤ kW k2,q and therefore R(Fq , r) ≤ R(Fκ , r). Thus, we have the following bound on R(Fq , r) for any κ ∈ [q, 2], v
u
∞ √ T u 1
2 2 ∗ X u 4 2KeRmax κ∗ T κ∗
1− κ2∗ 2eκ Rmax j , R(Fq , r) ≤ u λ . (27) min rT +
t t nT
T nT j=1
t=1 κ∗ 2
Remark 14 (Sparsity-inducing group-norm). Assuming a sparse representations shared across multiple tasks is a well-known presumption in MTL [6, 4] which leads to the use of group norm regularizer 21 kW k22,1 . Notice that for P any κ ≥ 1, it holds that R(F1 , r) ≤ R(Fκ , r). Also, assuming an identical tail sum j≥h λj for all tasks, reduces the 11
∗
bound in (27) to the function κ∗ 7→ κ∗ T 1/κ in terms of κ. This function attains its minimum at κ∗ = log T . Thus, by choosing κ∗ = log T it is easy to show: v √ u 1 T ∞ 2 u 4 2eκ∗ 2 Rmax 2KeRmax κ∗ T κ∗ 2
X R(F1 , r) ≤t min rT 1− κ∗ , λjt
κ∗ +
nT T nT t=1 2 j=1 v √ (l κ∗ −to−l∞ )u 3 ∞ T 2 u 4 2 2e3 (log T )2 Rmax 2KRmax e 2 log T
X
j t min rT, λ . ≤ +
t nT T nT t=1 ∞ j=1
Remark 15 (L2,q Group-norm regularizer with q ≥ 2). For any q ≥ 2, Theorem 8 provides a LRC bound for the function class Fq in (19) as v
u
∞ T u
2 X u 4
1− q2∗ 2Rmax j R(Fq , r) ≤ u λ min rT (28) ,
, t t nT
T
j=1
∗ t=1 q 2
where q ∗ :=
q q−1 .
Proof. A2 (Fq )
H¨older’s
≤
≤
√
=
=
√
√
(
sup kW k2,q kV k2,q∗
f ∈Fq
)
1 ∗ ∗
* n +
q q
T X 1X
2Rmax X
σti φ(Xti ), ujt ujt EX,σ
T n i=1
t=1 j>ht
Jensen’s
≤
1 EX,σ T
√
2Rmax T
2Rmax T
2Rmax T
1
∗
2 q2 q∗
+ *
T n
X 1 X X
E σti φ(Xti ), ujt ujt X,σ
n
t=1 i=1 j>ht
1 ∗ +2 q2 q∗ * n T 1X i X X σt φ(Xti ), ujt EX,σ n t=1 i=1
j>ht
1 v q2∗ q∗ u u 2 T j X X u 2Rmax λt u = t nT 2 n t=1
j>ht
T
X
j λt
.
j>ht t=1 q∗ 2
By applying (⋆), (⋆⋆) and (⋆ ⋆ ⋆), this last result together with the bound in (15) for A1 , yields the result. To investigate the tightness of the bound in (21), we derive the lower bound which holds for the LRC of Fq with any q ≥ 1. The proof of the result can be found in Appendix C. Theorem 16 (Lower bound). The following lower bound holds for the local Rademacher complexity of Fq in (21) 2 with any q ≥ 1. There is an absolute constant c so that ∀t, if λ1t ≥ 1/(nRmax ) then for all r ≥ n1 and q ≥ 1, v u u R(Fq,Rmax ,T , r) ≥ t
c 2
nT 1− q∗
∞ X
2 R2 min rT 1− q∗ , max λj1 . T j=1
12
(29)
A comparison between theP lower bound in (29) and the upper bound in (21) can be clearly illustrated by assuming identical eigenvalue tail sums j≥∞ λjt for all tasks, for which the upper bound translates to v u u R(Fq,Rmax ,T , r) ≤ t
4
2
nT 1− q∗
√ 1 ∗2 2 2KeRmax q ∗ T q∗ 1− q2∗ 2eq Rmax j , min rT λt + . T nT j=1
∞ X
By comparing this to (29), we see that the lower bound matches the upper bound up to constants. The same analysis for MTL models with Schatten norm and graph regularizers yields similar results confirming that the LRC upper bounds that we have obtained are reasonably tight. Remark 17. It is worth pointing out that a matching lower bound on the local Rademacher complexity does not necessarily implies a tight bound on the expectation of an empirical minimizer. As it has been shown in Section 4 of [10], by direct analysis of the empirical minimizer, sharper bounds than the LRC-based bounds can be obtained. Consequently, based on Theorem 8 in [10], there might be cases in which the local Rademacher complexity bounds are constants, however P fˆ is of some order depending on the number of samples n—O(1/n))—which decreases with n growing. As it has pointed out in that paper, under some mild conditions on the loss function ℓ, similar argument also holds for the class of loss functions {ℓf − ℓf ∗ : f ∈ F }.
4.4 Schatten Norm Regularized MTL q 2 [6] developed a spectral regularization framework for MTL where the LSq -Schatten norm 12 kW k2Sq := 12 tr W T W 2 q is studied as a concrete example, corresponding to performing ERM in the following hypothesis space: 1 T ′2 2 (30) FSq := X 7→ [hw 1 , φ(X1 )i , . . . , hwT , φ(XT )i] : kW kSq ≤ Rmax . 2 Corollary 18. For any 1 ≤ q ≤ 2 in (30), the LRC of function class FSq is bounded as v u 2q ∗ R′2 ∞ T u 4
X
max j min r, λt R(FSq , r) ≤ t
. nT T t=1 1 j=1
The proof is provided in Appendix C.
Remark 19 (Sparsity-inducing Schatten-norm (trace norm)). Trace-norm regularized MTL, corresponding to Schatten norm regularization with q = 1 [48, 57], imposes a low-rank structure on the spectrum of W and can also be interpreted as low dimensional subspace learning [5, 35, 27]. Note that for any q ≥ 1, it holds that R(FS1 , r) ≤ R(FSq , r). Therefore, choosing the optimal q ∗ = 1, we get v u T 2R′2 ∞ u 4
X max j R(FS1 , r) ≤t min r, λt
.
nT T t=1 1 j=1
Remark 20 (LSq Schatten-norm regularizer with q ≥ 2). For any q ≥ 2, Theorem 8 provides a LRC bound for the function class FSq in (30) as
q q−1 ,
v u u u 4 R(FSq , r) ≤ u t nT
T ∞
X ′2 2Rmax j
λt min r,
.
T
j=1 t=1
(31)
1
we first bound the expectation EX,σ kV kSq∗ . Take U it as a matrix with T columns E D P where the only non-zero column t of U it is defined as j>ht n1 φ(Xti ), ujt ujt . Based on the definition of V = Proof. Taking q ∗ :=
13
P
j>ht
E T j j i i σ φ(X ), u , we can then provide a bound for this expectation as t t ut i=1 t
D P n 1 n
t=1
EX,σ kV kSq∗
+ T * n
X 1X
= EX,σ σti φ(Xti ), ujt ujt
n i=1
j>ht t=1 S ∗ q
T X n
X
σti U it = EX,σ
t=1 i=1
Sq ∗
1 q2∗ q∗ T n X X Jensen T ≤ tr EX,σ σti σsj U it U js
t,s=1 i,j=1
= tr
n T X X t=1 i=1
T
EX U it U it
! q∗ 2
q1∗
1 2 2 T X n X X 1 φ(Xti ), ujt = EX n t=1 i=1 j>ht v u 21 T
u T
u1 X j 1 X X j
u λt λt = =t
.
n t=1 n j>ht
j>ht t=1 1
Note that replacing this into (17), and with the help of (⋆), (⋆⋆) and (⋆ ⋆ ⋆), one can conclude the result.
4.5 Graph Regularized MTL The idea underlying graph regularized MTL is to force the models of related tasks to be close to each other, by penalizing the squared distance kwt − w s k2 with different weights ωts . We consider the following MTL graph regularizer [46] Ω(W ) =
T T T T X T X X 1 XX ωts kwt − ws k2 + η kwt k2 = (L + ηI)ts hw t , ws i , 2 t=1 s=1 t=1 t=1 s=1
where L is the graph-Laplacian associated to a matrixPof edge-weights ωts , I is the identity operator, and η > 0 is PT T a regularization parameter. According to the identity t=1 s=1 L + ηI ts wt , ws = k(L + ηI)1/2 W k2F , the corresponding hypothesis space is: o n D E D E 1 ′′2 . FG := X 7→ [ w 1 , φ(X1 ) , . . . , wT , φ(XT ) ]T : kD1/2 W k2F ≤ Rmax 2
(32)
where we define D := L + ηI.
Corollary 21. For any given positive definite matrix D in (32), the LRC of FG is bounded by v u 2D−1 R′′2 T ∞ u 4
X
tt max j min r, R(FG , r) ≤ t λt )
. nT T t=1 1 j=1
−1 where Dtt
T
t=1
are the diagonal elements of D−1 .
See Appendix C for the proof. 14
(33)
5 Excess Risk Bounds for Norm Regularized MTL Models In this section we will provide the distribution and data-dependent excess risk bounds for the hypothesis spaces considered earlier. Note that, due to space limitations, the proofs are provided only for the hypothesis space Fq with q ∈ [1, 2] in (19). However, in the cases involving the L2,q -group norm with q ≥ 2, as well as the LSq -Schatten and graph norms, the proofs can be obtained in a very similar way. More specifically, by using the LRC bounds of Remark 15, Corollary 18, Remark 20 and Corollary 21, one can follow the same steps of the proofs of this section to arrive at the results pertaining to these cases. Theorem 22. (Distribution-dependent excess risk bound for a L2,q group-norm regularized MTL) Assume that Fq in (19) is a convex class of functions with ranges in [−b, b], and let the loss function ℓ of Problem (10) be such that Assumptions 1 is satisfied. Let fˆ be any element of Fq with 1 ≤ q ≤ 2 which satisfies Pn ℓfˆ = inf f ∈Fq Pn ℓf . Assume moreover that k is a positive semi-definite kernel on X such that kkk∞ ≤ K ≤ ∞. Denote by r∗ the fixed point of 2BLR(Fq , 4Lr 2 ). Then, for any K > 1 and x > 0, with probability at least 1 − e−x , the excess loss of function class Fq is bounded as P (ℓfˆ − ℓf ∗ ) ≤
32K ∗ (48Lb + 16BK)x r + , B nT
(34)
where for the fixed point r∗ of the local Rademacher complexity 2BLR(Fq , 4Lr 2 ), it holds that v
T u
√ P 1 u T
2 X j u 2eq ∗ 2 Rmax B 2 t=1 ht 4 2KeRmax q ∗ T q∗
∗ + λ r ≤ min + 4BLu ,
t t
0≤ht ≤∞ Tn nT 2 nT
j>ht t=1 q∗
(35)
2
where h1 , . . . , hT are arbitrary non-negative integers.
Proof. First notice that Fq is convex, thus it is star-shaped around any of its points. Hence according to Lemma 6, R(Fq , r) is a sub-root function. Moreover, because of the symmetry of σti and because Fq is convex and symmetric, it can be shown that R(Fq∗ , r) ≤ 2R(Fq , 4Lr 2 ), where R(Fq∗ , r) is defined according to (6) for the class of functions r Fq . Therefore, it suffices to find the fixed point φ(r) = r. For this purpose, we will √ of 2BLR(Fq , 4L2 ) by solving use (23) as a bound for R(Fq , r), and solve αr + γ = r (or equivalently r2 − (α + 2γ)r + γ 2 = 0) for r, where we define v
T u
√ P u 1 T
2 2 2 ∗ X u B h 2 2KeRmax BLq ∗ T q∗ R 2eq
t
j max t=1 u α= λt , and γ = 2BLt . (36)
+
Tn nT 2 nT
j>ht t=1 q∗ 2
∗
It is not hard to verify that r ≤ α + 2γ. Substituting the definition of α and γ gives the result.
Remark 23. If the conditions of Theorem 8 hold, then it can be shown that the following results hold for the fixed point of the considered hypothesis spaces in (19), (30) and (32). • Group norm: For the fixed point r∗ of the local Rademacher complexity 2BLR(Fq , 4Lr 2 ) with any 1 ≤ q ≤ 2 in (19), it holds v
T u
√ P 1 u T
2 X j u 2eq ∗ 2 Rmax B 2 t=1 ht 4 2KeRmax q ∗ BLT q∗
∗ λ + 4BLu . (37) + r ≤ min
t t
0≤ht ≤∞ Tn nT 2 nT
j>ht
t=1 q∗ 2
Also, for any q ≥ 2 in (19), it holds r∗ ≤
min
0≤ht ≤∞
B
v u u 2 u 2Rmax t=1 ht + 4BLu t nT 2 Tn
PT 2
15
T
X
j λt
.
j>ht t=1 q∗ 2
(38)
• Schatten-norm: For the fixed point r∗ of the local Rademacher complexity 2BLR(FSq , 4Lr 2 ) with any 1 ≤ q ≤ 2 in (30), it holds v
u T
u P T
2 ′2 u 2q ∗ Rmax B
X j ∗ t=1 ht u λ + 4BLt r ≤ min (39)
. t 0≤ht ≤∞
Tn nT 2
j>ht
t=1 1
Also, for any q ≥ 2 in (30), it holds r∗ ≤
min
B
0≤ht ≤∞
v u u ′2 u 2Rmax h t=1 t + 4BLu t Tn nT 2
PT 2
T
X
j λt
.
j>ht t=1
(40)
1
• Graph regularizer: For the fixed point r∗ of the local Rademacher complexity 2BLR(FG , 4Lr 2 ) with any positive operator D in (32), it holds v
u T
u P T
2 ′′2 X u B h 2R
t j max −1 t=1 (41) λ D + 4BLu r∗ ≤ min
.
t tt t
0≤ht ≤∞ Tn nT 2 j>ht
t=1 1
Regarding the fact that λjt s are decreasing with respect to j, we can assume ∃dt : λjt ≤ dt j −αt for some αt > 1. As examples, this assumption holds for finite rank kernels as well as convolution kernels. Thus, it can be shown that ∞ Z ∞ X X j 1 dt −αt −αt 1−αt ≤ dt dx = dt j x λt ≤ dt x h1−αt . (42) =− 1 − αt 1 − αt t ht ht j>ht
j>ht
Note that via lp − to − lq conversion inequality in (25), for p = 1 and q = q∗ 2 , we have v
u s 2 )T u B 2 T 2− q2∗ PT PT
(h
q∗ t 2 2 t=1 t 2 B B T t=1 ht (⋆⋆) t=1 ht 2 ≤ B . ≤B Tn n2 T 2 n2 T 2
Now, applying, (24) and (26), and inserting (42) into (35), it holds for group norm regularized MTL with 1 ≤ q ≤ 2, v !T u √
1 u B 2 T 2− q2∗ h2 2 ∗ 2 R2
4 2KeRmax BLq ∗ T q∗ L 32d eq t u 1−αt t max ∗
r ≤ min 2B t − ht . (43)
+ 0≤ht ≤∞ n2 T 2 nT 2 (1 − αt ) nT
∗ t=1
q 2
Taking the partial derivative of the above bound with respect to ht and setting it to zero yields the optimal ht as 1 1+α 2 t 2 B −2 L2 T q∗ −2 n ht = 16dt eq ∗ 2 Rmax .
Note that substituting the above for α := mint∈NT αt and d = maxt∈NT dt into (43), we can upper-bound the fixed point of r∗ as r √ 1 1 1+α 14B 2 α + 1 ∗ 2 2 10 KRmax BLq ∗ T q∗ ∗ −2 2 q2∗ −2 r ≤ + dq Rmax B L T , n n α−1 nT which implies that
1
T 1− q∗ r∗ = O q∗
−2 ! 1+α
It can be seen that the convergence rate can be as slow as O −1
n
−α 1+α
1/q q∗ T√ T n
.
∗
(for small α, where at least one αt ≈ 1),
and as fast as O(n ) (when αt → ∞, for all t). The bound obtained for the fixed point together with Theorem 22 provides a bound for the excess risk, which leads to the following remark. 16
Remark 24 (Excess risk bounds for selected norm regularized MTL problems). Assume that Fq , FSq and FG are convex classes of functions with ranges in [−b, b], and let the loss function ℓ of Problem (10) be such that Assumptions 1 are satisfied. Assume moreover that k is a positive semidefinite kernel on X such that kkk∞ ≤ K ≤ ∞. Also, denote α := mint∈NT αt and d = maxt∈NT dt . Also, • Group norm: If fˆ satisfies Pn ℓfˆ = inf f ∈Fq Pn ℓf , and r∗ is the fixed point of the local Rademacher complexity 2BLR(Fq , 4Lr 2 ) with any 1 ≤ q ≤ 2 in (19) and any K > 1, it holds with probability at least 1 − e−x , r −1 1 2 1+α 1+α α−1 −α α + 1 ∗2 2 P (ℓfˆ − ℓf ∗ ) ≤ min 448K B α+1 T κ n 1+α dκ Rmax L2 α−1 κ∈[q,2] √ 1 320 KRmax KLκ∗ T κ∗ (48Lb + 16BK)x + + . (44) nT nT Also, for any q ≥ 2 in (19) and any K > 1, it holds with probability at least 1 − e−x , r −1 2 1+α 1 −α α−1 α+1 2 P (ℓfˆ − ℓf ∗ ) ≤ min 256K n 1+α dRmax L2 1+α B α+1 T q α−1 q∈[2,∞] (48Lb + 16BK)x . + nT
(45)
• Schatten-norm: If fˆ satisfies Pn ℓfˆ = inf f ∈FSq Pn ℓf , and r∗ is the fixed point of the local Rademacher complexity 2BLR(FSq , 4Lr 2 ) with any 1 ≤ q ≤ 2 in (30) and any K > 1, it holds with probability at least 1 − e−x , r 1 α−1 −1 −α α+1 ′2 dq ∗ Rmax L2 1+α B α+1 T 1+α n 1+α P (ℓfˆ − ℓf ∗ ) ≤ min 256K α−1 q∈[1,2] (48Lb + 16BK)x . (46) + nT Also, for any q ≥ 2 in (30) and any K > 1, it holds with probability at least 1 − e−x , r 1 α−1 −1 −α α+1 ′2 ∗ P (ℓfˆ − ℓf ) ≤ 256K dRmax L2 1+α B α+1 T 1+α n 1+α α−1 (48Lb + 16BK)x . + nT
(47)
• Graph regularizer: If fˆ satisfies Pn ℓfˆ = inf f ∈FG Pn ℓf , and r∗ is the fixed point of the local Rademacher complexity 2BLR(FG , 4Lr 2 ) with any positive operator D in (32) and any K > 1, it holds with probability at least 1 − e−x , r 1 1+α −1 −α α−1 α+1 ′′2 B α+1 T 1+α n 1+α dRmax L2 D−1 P (ℓfˆ − ℓf ∗ ) ≤ 256K max α−1 (48Lb + 16BK)x + . (48) nT −1 where D max := maxt∈NT D −1 tt .
Corollary 25. (Data-dependent excess risk bound for a MTL problem with a L2,q group-norm regularizer) Assume the convex class Fq in (19) has ranges in [−b, b], and let the loss function ℓ in Problem (10) be such that Assumptions 1 are satisfied. Let fˆ be any element of Fq with 1 ≤ q ≤ 2 which satisfies Pn ℓfˆ = inf f ∈Fq Pn ℓf . Assume moreover that k is a positive semidefinite kernel on X such that kkk∞ ≤ K ≤ ∞. DLet K t be the n E×n j j 1 1 i i ˆ ˆ normalized Gram matrix (or kernel matrix) of task t with entries (K t )ij := k(X , X ) = φ(X ), φ(X ) . Let n
ˆ1 , . . . λ ˆn be the ordered eigenvalues of matrix K t , and rˆ∗ be the fixed point of λ t t ˆ ∗ , c3 r) + c2 x , ψˆn (r) = c1 R(F q nT 17
t
t
n
t
t
where c1 = 2L max (B, 16Lb), c2 = 128L2b2 + 2bc1 and c3 = 4 + 128K + 4B(48Lb + 16BK)/c2, and T n 1 XX i i i ˆ ∗ , c3 r) := Eσ sup R(F σ f (X ) t q t t xt t∈NT ,i∈Nn . nT t=1 i=1 f ∈Fq , 2 L2 Pn (f −fˆ) ≤c3 r
(49)
Then, for any K > 1 and x > 0, with probability at least 1 − 4e−x the excess loss of function class Fq is bounded as P (ℓfˆ − ℓf ∗ ) ≤
32K ∗ (48Lb + 16BK)x rˆ + , B nT
(50)
where for the fixed point rˆ∗ of the empirical local Rademacher complexity ψˆn (r), it holds v
T u
P u n T
2 X j ˆt u 2c21 q ∗ 2 Rmax c21 c3 t=1 h 2c2 x
∗ ˆ u + 4 , λ rˆ ≤
+ t t
nT L2 nT 2 nT
j>hˆ t
t=1 q∗ 2
ˆ 1, . . . , h ˆ T are arbitrary non-negative integers, and (λˆj )n are eigenvalues of the normalized Gram matrix K where h t j=1 obtained from kernel function k. The proof of the result is provided in Appendix D.
6 Discussion This section is devoted to compare the excess risk bounds based on local Rademacher complexity to those of the global ones.
6.1 Global vs. Local Rademacher Complexity Bounds First, note that to obtain the GRC-based bounds, we apply Theorem 16 of [45], as we consider the same setting and assumptions for tasks’ distributions as considered in this work. This theorem presents a MTL bound based on the notion of GRC. Theorem 26 (MTL excess risk bound based on GRC; Theorem 16 of [45] ). Let the vector-valued function class F (n,T ) be defined as F := {f = (f1 , . . . , fT ) : X 7→ [−b, b]T }. Assume that X = (Xit )(i,t)=(1,1) is a vector of independent random variables where for all fixed t, X1t , . . . , Xnt are identically distributed according to Pt . Let the loss function ℓ be L-Lipschitz in its first argument. Then for every x > 0, with probability at least 1 − e−x , r 2Lbx . (51) P (ℓf − ℓf ∗ ) ≤ Pn (ℓf − ℓf ∗ ) + 2LR(F ) + nT Proof. As it has been shown in [45], the proof of this theorem is based on using McDiarmid’s inequality for Z defined in Theorem 1, and noticing that for the function class F with values in [−b, b], it holds that |Z − Zs,j | ≤ 2b/nT . It can be observed that, in order to obtain the excess risk bound in the above theorem, one has to bound the GRC term R(F ) in (51). Therefore, we first upper-bound the GRC of different hypothesis spaces considered in the previous sections. The proof of the results can be found in Appendix E. Theorem 27 (Distribution-dependent GRC bounds). Assume that the conditions of Theorem 6 hold. Then, the following results hold for the GRC of the hypothesis spaces in (19), (30) and (32), respectively.
18
• Group-norm regularizer: For any 1 ≤ q ≤ 2 in (19), the GRC of the function class Fq can be bounded as s √ 1
2 2eκ∗ 2 Rmax 2KeRmax κ∗ T κ∗
T ∀κ ∈ [q, 2] : R(Fq ) ≤ (tr (J )) . (52) +
t t=1 κ∗ nT 2 nT 2 Also, for any q ≥ 2 in (19), the GRC of the function class Fq can be bounded as s
2 2Rmax
T (J )) R(Fq ) ≤
(tr t t=1 q∗ . nT 2 2
(53)
Also, for any q ≥ 2 in (30), the GRC of the function class FSq can be bounded as r
′2 2Rmax
T (J )) R(FSq ) ≤
(tr t t=1 . nT 2 1
(55)
• Schatten-norm regularizer: For any 1 ≤ q ≤ 2 in (30), the GRC of the function class FSq can be bounded as r
′2 2q ∗ Rmax
T R(FSq ) ≤ (54)
(tr (Jt ))t=1 . 2 nT 1
• Graph regularizer: For any positive operator D in (32), the GRC of the function class FG can be bounded as r
′′2 T 2Rmax
−1 (56) D tr(J ) R(FG ) ≤
t t=1 . tt nT 2 1
where for the covariance operator Jt = E(φ(Xt ) ⊗ φ(Xt )) = tr(Jt ) :=
P∞
j=1
λjt ujt ⊗ ujt , the trace tr(Jt ) is defined as
∞ E X XD λjt . Jt ujt , ujt = j=1
j
that, assuming a unique bound for the traces of all tasks’ kernels, the bound in (52) is determined by Notice 1 ∗ ∗ q √T . We can also remark that, when the O qTT√n . Also, taking q ∗ = log T , we obtain a bound of order O log T n kernel traces are bounded, the bounds in (53), (54), (55) and (56) are of the order of O √1nT . Note that for the purpose of comparison, we concentrate only on the parameters R, n, T, q ∗ and α and assume all the other parameters are fixed and hidden in the big-O notation. Also, for the sake of simplicity, we assume that the eigenvalues of all tasks satisfy λjt ≤ dj −α (with α > 1). Note that from Theorem 26, it follows that a bound on the global Rademacher complexity provides also a bound on the excess risk. This together with Theorem 27, gives the GRC-based excess risk bounds of the following forms (note that q ≥ 1) 2 − 12 2 ∗ 2 12 − 21 κ ∗ Group norm: (a) ∀κ ∈ [q, 2], P (ℓfˆ − ℓf ) = O (Rmax κ ) T n . 2 − 12 1 1 2 )2 T q (b) ∀q ∈ [2, ∞], P (ℓfˆ − ℓf ∗ ) = O (Rmax n− 2 . 1 1 1 ′2 q ∗ ) 2 T − 2 n− 2 . Schatten-norm: (c) ∀q ∈ [1, 2], P (ℓfˆ − ℓf ∗ ) = O (Rmax 1 1 1 ′2 ) 2 T − 2 n− 2 . (d) ∀q ∈ [2, ∞], P (ℓfˆ − ℓf ∗ ) = O (Rmax 1 1 1 ′′2 (57) ) 2 T − 2 n− 2 . Graph regularizer: (e) P (ℓfˆ − ℓf ∗ ) = O (Rmax which can be compared to their LRC-based counterparts as following Group norm:
(a) ∀κ ∈ [q, 2],
1 2 − 1+α −α 1 2 κ∗ 2 ) 1+α T κ P (ℓfˆ − ℓf ∗ ) = O (Rmax n 1+α . 19
(b) ∀q ∈ [2, ∞], Schatten-norm: (c) ∀q ∈ [1, 2], (d) ∀q ∈ [2, ∞], Graph regularizer: (e)
1 2 − 1+α −α 1 2 ) 1+α T q P (ℓfˆ − ℓf ∗ ) = O (Rmax n 1+α . −1 −α 1 ′2 q ∗ ) 1+α T 1+α n 1+α . P (ℓfˆ − ℓf ∗ ) = O (Rmax −1 −α 1 ′2 ) 1+α T 1+α n 1+α . P (ℓfˆ − ℓf ∗ ) = O (Rmax −1 −α 1 ′′2 ) 1+α T 1+α n 1+α . P (ℓfˆ − ℓf ∗ ) = O (Rmax
(58)
It can be seen that holding all the parameters fixed when n approaches to infinity, the local bounds yield faster rates, since α > 1. However, when T grows to infinity, the convergence rate of the local bounds could be only as good as those obtained by the global analysis. A close appraisal of the results in (57) and (58) points to a conservation of asymptotic rates between n and T , when all other remaining quantities are held fixed. This phenomenon is more apparent for the Schatten norm and graphbased regularization cases. It can be seen that, for both the global and local analysis results, the rates (exponents) of n and T sum up to −1. In the local analysis case, the trade-off is determined by the value of α, which can facilitate faster n-rates and compromise with slower T -rates. A similar trade-off is witnessed in the case of group norm regularization, but this time between n and T 2/κ , instead of T , due to specific characteristic of the group norm. As mentioned earlier in Remark 13, the bounds for the class of group norm regularizer for 1 ≤ q ≤ 2 is not monotonic in q; they are minimized for q ∗ = log T . Therefore, we split our analysis for this case as follows: 1. First, we consider q ∗ ≥ log T , which leads to the optimal choice κ∗ = q ∗ , and taking the minimum of the global and local bounds gives n o −α 2 2 2 1 1 1 P (ℓfˆ − ℓf ∗ ) ≤ O min (Rmax q ∗ )(T q )− 2 n− 2 , (Rmax q ∗ ) 1+α (T q )− 1+α n 1+α . (59) It is worth mentioning that, for any value of α > 1, if the number of tasks T as well as the radius Rmax of the √ 1/q L2,q ball can grow with n, the local bound improves over the global one whenever RTmax = O( n). 2. Secondly, assume that q ∗ ≤ log T , in which case the best choice is κ∗ = log T . Then, the excess risk bound reads ( )! 2 −α Rmax log T 1+α 1+α Rmax log T −1/2 n , , (60) P (ℓfˆ − ℓf ∗ ) ≤ O min n T T and the local analysis improves over the global one, when
T Rmax log T
√ = O( n).
Also, a similar analysis for Schatten norm and graph regularized hypothesis spaces shows that the√local analysis is √ beneficial over the global one, whenever the number of tasks T and the radius R can grow, such that RT = O( n).
6.2 Comparisons to Related Works Also, it would be interesting to compare our (global and local) results for the trace norm regularized MTL with the GRC-baesd excess risk bound provided in [48] wherein they apply a trace norm regularizer to capture the tasks’ relatedness. It is worth mentioning that they consider a slightly different hypothesis space for W , which in our notation reads as 1 2 ′2 (61) FS1 := W : kW kS1 ≤ T Rmax . 2 It is based on the premise that, assuming a common vector w for all tasks, the P regularizer should not be a function of T number of tasks [48]. Given the task-averaged covariance operator C := 1/T t=1 Jt , the excess risk bound in [48] reads as ! r r r √ kCk∞ ln(nT ) + 1 bLx ′ + +5 . P (ℓfˆ − ℓf ∗ ) ≤ 2 2LRmax n nT nT 20
where loss function ℓ is L-Lipschitz and F has ranges in [−b, b]. One can easily verify that the trace norm is a Schatten norm with q = 1. Note that for any q ≥ 1 it holds that FS1 ⊆ FSq , which implies R(FS1 ) ≤ R(FSq ). This fact, in conjunction with Theorem 27 and Theorem 26 (applied to the class of excess loss functions) yields a GRC-based excess risk bound. Therefore, considering the trace norm hypothesis space (61) and the optimal value of q ∗ = 2, translates our global and local bounds to the following 1. GRC-based excess risk bound:
P (ℓfˆ − ℓf ∗ ) ≤
′ 4LRmax
v
u u (tr (J ))T t t=1 t 1 nT
+
r
bLx . nT
2. LRC-based excess risk bound (∀α > 1): r 1 α−1 −α α+1 (48Lb + 16BK)x ′2 P (ℓfˆ − ℓf ∗ ) ≤ 256K 2dRmax L2 1+α B α+1 n 1+α + . α−1 nT
(62)
}, Now, assume that each operator Jt is of rank M and denote its maximum eigenvalue by λmax . If λmax := maxt∈NT {λmax t t max then it is easy to verify that tr(Jt ) ≤ M λt and kCk∞ ≤ λmax , which leads to the following GRC-based bounds r r M λmax bLx ′ + , (63) Ours: P (ℓfˆ − ℓf ∗ ) ≤ 4LRmax n nT ! r r r √ λmax ln(nT ) + 1 bLx ′ + +5 . (64) [48]: P (ℓfˆ − ℓf ∗ ) ≤ 2 2LRmax n nT nT −α/1+α One can observe that as n → ∞, in all√cases the bound vanishes. However, it does so at a rate of np for our local bound in (62), at a slower rate of 1/ n for our global bound in (63), and at the slowest rate of ln n/n for the one in (64). We remark that, as T → ∞, all pbounds converge to a non-zero limit: our local bound p in (62) at a fast rate of 1/T , the one in (63) at a slower rate of 1/T , and the bound in (64) at a the slowest rate of ln T /T . Another interesting comparison can be performed between our bounds and the one introduced by [46] for a graph regularized MTL. For this purpose we consider the following hypothesis space similar to what has been considered by [46]
1
1/2 2 ′′2 FG = W : D W ≤ T Rmax . (65) 2 F
[46] provides a bound on the empirical GRC of the aforementioned hypothesis space. However, similar to the proof of Corollary 21, we can easily convert it to a distribution dependent GRC bound which matches our global bound in (56) (for the defined hypothesis space (65)) and in our notation reads as r
′′2 T 2Rmax
−1 R (FG ) ≤
Dtt tr(Jt ) t=1 . nT 1
Now, with D := L + ηI (where L is the graph-Laplacian, I is the identity operator, and η > 0 is a regularization parameter) and the assumption that the Jt s are of rank M , it can be shown that ! T T
X X T
−1 −1 −1 Dtt tr(Jt ) ≤ M λmax Dtt = M λmax tr D−1 =
Dtt tr(Jt ) t=1 = 1
t=1
−1
= M λmax tr (L + ηI)
= M λmax
t=1
T X t=1
1 1 + δt + η η
!
≤ M λmax
T
1 + δmin + η η
.
where λmax is defined as before. Also, we define (δt )Tt=1 as the eigenvalues of Laplacian matrix L with δmin := mint∈NT δt . Therefore, the matching GRC-based excess risk bounds can be obtained as s r ′′ 1 1 2LRmax bLx Ours & [46] : + . P (ℓfˆ − ℓf ∗ ) ≤ √ 2M λmax + δmin Tη nT n 21
(66) Also, from Remark 24, the LRC-based bound is given as r 1 1+α −α α−1 (48Lb + 16BK)x α+1 ′′2 B α+1 n 1+α + dRmax L2 D −1 . P (ℓfˆ − ℓf ∗ ) ≤ 256K max α−1 nT
(67)
The above p results show that when n → ∞, both GRC and LRC bounds approach zero, albeit, the global bound with a rate of 1/n, and the local one with a faster rate of n−α/α+1 , since p α > 1. Also, as T → ∞, both bounds approach non-zero limits. However, the global bound does so at a rate of 1/T and the local one at a faster rate of 1/T .
22
Appendices A
Proofs of the results in Sect. 2: “Talagrand-Type Inequality for MultiTask Learning”
This section presents the proof of Theorem 1. We first provide some useful foundations used in the derivation of our result in Theorem 1. Theorem A.1 (Theorem 2 in [15]). Let X1 , . . . , Xn be n independent random variables taking values in a measurable space X . Assume that g : X n → R is a measurable function and Z := g(X1 , . . . , Xn ). Let X1′ , . . . , Xn′ denote an independent copy of X1 , . . . , Xn , and Zi′ := g(X1 , . . . , Xi−1 , Xi′ , Xi+1 , . . . , Xn ) which is obtained by replacing the variable Xi with Xi′ . Define the random variable V + by V + :=
n X i=1
′
2 E′ Z − Zi′ + .
where (u)+ := max{u, 0}, and E [·] := E[·|X] denotes the expectation only with respect to the variables X1′ , . . . , Xn′ . Let θ > 0 and λ ∈ (0, 1/θ). Then, λθ λV + . log E exp 1 − λθ θ Definition 4 (Section 3.3 in [16]). A function g : X n → [0, ∞) is said to be b-self bounding (b > 0), if there exist functions gi : X n−1 → R, such that for all X1 , . . . , Xn ∈ X and all i ∈ Nn , log E eλ(Z−EZ) ≤
0 ≤ g(X1 , . . . , Xn ) − gi (X1 , . . . , Xi−1 , Xi+1 , . . . , Xn ) ≤ b, and, n X i=1
g(X1 , . . . , Xn ) − gi (X1 , . . . , Xi−1 , Xi+1 , . . . , Xn ) ≤ g(X1 , . . . , Xn ).
Theorem A.2 (Theorem 6.12 in [16]). Assume that Z = g(X1 , . . . , Xn ) is a 1-self bounding function. Then for every λ ∈ R, log Eeλ(Z−EZ) ≤ φ(λ)EZ, (A.1) where φ(λ) = eλ − λ − 1.
Corollary A.3. Assume that Z = g(X1 , . . . , Xn ) is a b-self bounding function (b > 0). Then, for any λ ∈ R we have eλb − 1 λZ EZ. log Ee ≤ b Proof. Note that Eq. (A.1) can be rewritten as log E[exp(λZ)] ≤ (eλ − 1)EZ. The stated inequality follows immediately by rescaling Z to Z/b in the above inequality. Lemma 3 (Lemma 2.11 in [17]). Let Z be a random variable, A, B > 0 be some constants. If for any λ ∈ (0, 1/B) it holds Aλ2 , log E eλ(Z−EZ) ≤ 2 1 − Bλ then for all x ≥ 0,
√ P Z ≥ EZ + 2Ax + Bx ≤ e−x .
Lemma 4 (Contraction property, [8]). Let φ be a Lipschitz function with constant L ≥ 0, that is, |φ(x) − φ(y)| ≤ L|x − y|, ∀x, y ∈ R. Then for every real-valued function class F , it holds Eσ R(φ ◦ F ) ≤ LEσ R(F ), where φ ◦ F := {φ ◦ f : f ∈ F } and ◦ is the composition operator.
(A.2)
Note that in Theorem 17 of [45], it has been shown that the result of this lemma also holds for the class of vectorvalued functions. 23
Proof of Theorem 1 Before laying out the details, we first provide a sketch of the proof. Defining Nt T h1 X i 1 X [Eft (Xti ) − ft (Xti )] , f ∈F T t=1 Nt i=1
Z := sup
(A.3)
we first apply Theorem A.1 to control the log-moment generating function log E eλ(Z−EZ) . From Theorem A.1, we P P s ′ know that the main component to control log E eλ(Z−EZ) is the variance-type quantity V + = Ts=1 N Z− j=1 E 2 + ′ Zs,j + . In the next step, we show that V can also be bounded in terms of two other quantities denoted by W and Υ. λ Applying Theorem A.1 for a specific value of θ, then gives a bound for log E eλ(Z−EZ) in terms of log E[e b′ (W +Υ) ]. We then turn to controlling W and Υ, respectively. Our idea to tackle W is to show that it is a self-bounding function, λW according to which we can apply Corollary A.3 to control log E[e b′ ]. The term Υ is closely related to the constraint imposed on the variance of functions in F , and can be easily upper bounded in termsof r. We finally apply Lemma 3 to transfer the upper bound on the log-moment generating function log E eλ(Z−EZ) to the tail probability on Z. To clarify the process we divide the proof into four main steps. Step 1. Controlling the log-moment generating function of Z with the random variable W and variance Υ. (T,Nt ) (T,Nt ) . Define the quantity be an independent copy of X := (Xti )(t,i)=(1,1) Let X ′ := (Xt′i )(t,i)=(1,1) h 1 1 E′ fs (Xs′j ) − fs (Xs′j ) − Efs (Xsj ) − fs (Xsj ) T Ns f ∈F T Ns
′ Zs,j := sup
+
Nt T i 1X 1 X [Eft (Xti ) − ft (Xti )] , T t=1 Nt i=1
(A.4)
′ where Zs,j is obtained from Z by replacing the variable Xsj with Xs′j . Let fˆ := (fˆ1 , . . . fˆT ) be such that Z = P T 1 1 PNt ˆ i ˆ i i=1 Eft (Xt ) − ft (Xt ) , and introduce t=1 Nt T Nt T i h 1 X 1 X [Eft (Xti ) − ft (Xti )]2 , 2 2 f ∈F T t=1 Nt i=1
W := sup
Nt T h 1 X i 1 X i i 2 . E[Ef (X ) − f (X )] t t t t 2 2 f ∈F T t=1 Nt i=1
Υ := sup
It can be shown that for any j ∈ Nn and any s ∈ NT : ′ Z − Zs,j ≤
and therefore
1 ˆ j 1 ′ ˆ ′j Efs (Xs ) − fˆs (Xsj ) − E fs (Xs ) − fˆs (Xs′j ) T Ns T Ns
′ (Z − Zs,j )2+ ≤
1 T 2 Ns2
2 [Efˆs (Xsj ) − fˆs (Xsj )] − [E′ fˆs (Xs′j ) − fˆs (Xs′j )] .
Then, it follows from the identity E′ [E′ fˆs (Xs′j ) − fˆs (Xs′j )] = 0 that Ns T X X s=1 j=1
=
Ns T X 2 X ′ ≤ E′ Z − Zs,j + s=1 j=1
Ns T X X s=1 j=1
≤ sup
2 i h 1 ′ ˆs (X j ) − fˆs (X j )] − [E′ fˆs (X ′j ) − fˆs (X ′j )] [E f E s s s s T 2 Ns2
Ns T X X 1 1 ˆs (X j ) − fˆs (X j )]2 + [E f E′ [E′ fˆs (Xs′j ) − fˆs (Xs′j )]2 s s 2 2 2 2 T Ns T N s s=1 j=1
Ns T X X
f ∈F s=1 j=1
Ns T X X 1 1 j j 2 [Ef (X ) − f (X )] + sup E[Efs (Xsj ) − fs (Xsj )]2 s s s s 2 2 T 2 Ns2 f ∈F s=1 j=1 T Ns
24
= W + Υ. 2b Introduce b′ := nT . Applying Theorem A.1 and the above bound on following bound on the log-moment generating function of Z:
log E eλ(Z−EZ) ≤
PT
s=1
2 λ λb′ log Ee b′ (W +σ ) , ′ 1 − λb
PNs
j=1
2 ′ E′ Z − Zs,j then gives the +
∀λ ∈ (0, 1/b′).
(A.5)
Step 2. Controlling the log-moment generating function of W . We now upper bound the log-moment generating function of W by showing that it is a self-bounding function. For any s ∈ NT , j ∈ NNs , introduce Nt T h 1 X i 1 1 X [Eft (Xti ) − ft (Xti )]2 − 2 2 [Efs (Xsj ) − fs (Xsj )]2 . 2 2 T Ns f ∈F T t=1 Nt i=1
Ws,j := sup
Note that Ws,j is a function of {Xti , t ∈ NT , i ∈ Nt }\{Xsj }. Letting f˜ := (f˜1 , . . . , f˜T ) be the function achieving the 2b ) supremum in the definition of W , it can be checked that (note that b′ = nT T 2 [W − Ws,j ] ≤
2 1 ˜s (X j ) − f˜s (X j )]2 ≤ 4b = T 2 b′2 . [E f s s Ns2 n2
(A.6)
s,j Similarly, if f˜ := (f˜1s,j . . . , f˜Ts,j ) is the function achieving the supremum in the definition of Ws,j , then one can derive the following inequality
T 2 [W − Ws,j ] ≥
1 [Ef˜ss,j (Xsj ) − f˜ss,j (Xsj )]2 ≥ 0. Ns2
Also, it can be shown that T X Ns X s=1 i=1
W − Ws,j ≤
T Ns 1 X 1 X [Ef˜s (Xsj ) − f˜s (Xsj )]2 T 2 s=1 Ns2 i=1
Nt T i h 1 X 1 X i i 2 . [Ef (X ) − f (X )] t t t t 2 2 f ∈F T t=1 Nt i=1
= sup
(A.7)
Therefore (according to Definition 4), W/b′ is a b′ -self bounding function. Applying Corollary A.3 then gives the following inequality for any λ ∈ (0, 1/b′ ): ′
′
′
log Eeλ(W/b ) ≤
(eλb − 1) 2 λΣ2 (eλb − 1) EW = Σ ≤ , b′2 b′2 b′ (1 − λb′ )
(A.8)
where we introduce Σ2 := EW and the last step uses the inequality (ex − 1)(1 − x) ≤ x, ∀x ∈ [0, 1]. Furthermore, the term Σ2 can be controlled as follows: (here (σti ) is a sequence of independent Rademacher variables, independent of Xti ): Σ2 ≤
Nt Nt T T hX X i 1 X 1 1 X i i 2 i i 2 +Υ E Ef (X ) − f (X ) E sup Ef (X ) − f (X ) − t t X t t t t t t 2 T2 Nt2 i=1 f ∈F t=1 Nt i=1 t=1
≤ 2EX,σ
h
Nt T i 1 X 1 X i i 2 i sup 2 +Υ Ef (X ) − f (X ) σ t t t t t 2 f ∈F T t=1 Nt i=1
Nt T h i 1 X 1 X σti Eft (Xti ) − ft (Xti ) + Υ ≤ 8bEX,σ sup 2 2 f ∈F T t=1 Nt i=1
≤
16bR(F ) + Υ, nT
25
where the first inequality follows from the definition of W and Υ, and the second inequality follows from the standard symmetrization technique used to related Rademacher complexity to uniform deviation of empirical averages from their expectation [8]. The third inequality comes from a direct application of Lemma 4 with φ(x) = x2 (with Lipschitz constant 4b on [−2b, 2b]), and the last inequality uses Jensen’s inequality together with the definition of R(F ) and the 1 . Plugging the previous inequality on Σ2 back into (A.8) gives fact that N12 ≤ nN t t
′
log Eeλ(W/b ) ≤
h 16bR(F ) i λ + Υ , b′ (1 − λb′ ) nT
∀λ ∈ (0, 1/b′ ).
(A.9)
Step 3. Controlling the term Υ. Note that Υ can be upper bounded as Ns T i h 1 X 1 X E[Efs (Xsj ) − fs (Xsj )]2 2 2 f ∈F T s=1 Ns j=1
Υ : = sup
T
≤
hX i 1 sup E[Efs (Xs1 ) − fs (Xs1 )]2 2 nT f ∈F s=1
1 sup nT 2 f ∈F r . ≤ nT
≤
T hX
E[fs (Xs1 )]2
s=1
(A.10)
i
hP i T 1 2 where the last inequality follows from the assumption T1 supf ∈F ≤ r of the theorem. s=1 E[fs (Xs )] Step 4. Transferring from the bound on log-moment generating function of Z to tail probabilities. Plugging ′ the bound on log EeλW/b given in (A.9) and the bound on Υ given in (A.10) back into (A.5) immediately yields the following inequality on the log-moment generating function of Z for any λ ∈ (0, 1/2b′): λb′ h λΥ i λ −1 16(nT ) bR(F ) + Υ] + 1 − λb′ b′ (1 − λb′ ) b′ h i ′ λ 16bR(F ) λb + 2Υ ≤ 1 − λb′ b′ (1 − λb′ ) nT h 16bR(F ) 2r i 2λ2 , ≤ + 2(1 − 2λb′ ) nT nT
log E[eλ(Z−EZ) ] ≤
(A.11)
where the second inequality uses (1 − λb′ )2 ≥ 1 − 2λb′ > 0 since λ ∈ (0, 1/2b′). That is, the conditions of Lemma ) 2r + nT and B = 2b′ ) to get the following inequality with probability 3 hold and we can apply it (with A = 2 16bR(F nT 2b at least 1 − e−x (note that b′ = nT ): r
h 16bR(F ) 2r i 4x + 2b′ x + nT nT r r bxR(F ) 8xr 4bx ≤ E[Z] + 8 + + nT nT nT r 8xr 4bx 8bx ≤ E[Z] + 2R(F ) + + + nT nT nT r 8xr 12bx + , ≤ 4R(F ) + nT nT √ where the third inequality follows from 2 uv ≤ u + v, and the last step uses the following inequality due to the symmetrization technique (here the ghost sample X ′ is an i.i.d. copy of the initial sample X) Z ≤ E[Z] +
Nt T h hX ii 1 1 X ft Xt′i − ft Xti EZ = EX sup EX ′ Nt i=1 f ∈F T t=1
26
Nt T h i 1 X 1 X ≤ EX,X ′ sup ft Xt′i − ft Xti f ∈F T t=1 Nt i=1
Nt T h i 1X 1 X = EX,X ′ ,σ sup σti ft Xt′i − ft Xti f ∈F T t=1 Nt i=1
≤ 2R(F ).
Note that the second identity holds since for any σti , the random variable ft (Xt′i ) − ft (Xti ) has the same distribution as σti (ft (Xt′i ) − ft (Xti )).
B Proofs of the results in Sect. 3: “Excess MTL Risk Bounds based on Local Rademacher Complexities” Theorem B.1 is at the core of proving Theorem 3 in Sect. 3. We first present the following lemma which is used in the proof of Theorem B.1. Lemma 5. Let K > 1, r > 0. Assume that F = {f := (f1 , . . . , fT ) : ∀t, ft ∈ RX } is a vector-valued (1, B)Bernstein class of functions. Also, let the rescaled version of F be defined as ′ rft ′ ′ ′ Fr := f = f1 , . . . , fT : ft := , f = (ft , . . . , fT ) ∈ F . max (r, V (f )) If Vr+ := supf ′ ∈Fr [P f ′ − Pn f ′ ] ≤
r BK ,
then
∀f ∈ F
Pf ≤
r K Pn f + . K −β BK
(B.1)
Proof. We prove (B.1) by considering two cases. Let f be any element in F . If V (f ) ≤ r, then f ′ = f . Therefore, r translates to considering the fact that for any f ′ ∈ Fr it holds that P f ′ ≤ Pn f ′ + Vr+ , the inequality Vr+ ≤ BK P f ≤ Pn f +
r K r ≤ Pn f + . BK K −1 BK
If V (f ) ≥ r, then f ′ = rf /V (f ). Therefore, P f ′ ≤ Pn f ′ + Vr+ together with Vr+ ≤
(B.2) r BK
gives
r r r Pf ≤ Pn f + , V (f ) V (f ) BK which, coupled with V (f ) ≤ BP f , yields P f ≤ Pn f +
1 P f. K
This last inequality then implies Pf ≤
K r K Pn f ≤ Pn + . K−1 K −1 BK
(B.3)
Eq. (B.1) follows by combining (B.2) and (B.3) together. The following provides another useful definition that will be needed in introducing the result of Theorem B.1. Definition 5 (Star-Hull). The star-hull of a function class F around the function f0 is defined as star(F , f0 ) := {f0 + α(f − f0 ) : f ∈ F , α ∈ [0, 1]}. 27
Now, we present a lemma from [8] which indicates that the local Rademacher complexity of the star-hull of any function class F is a sub-root function, and it has a unique fixed point. Lemma 6 (Lemma 3.4 in [8]). For any function class F , the local Rademacher complexity of its start-hull is a sub-root function. Theorem B.1 (Distribution-dependent bound for MTL). Let F = {f := (f1 , . . . , fT ) : ∀t, ft ∈ RX } be a class of (T,n) vector-valued functions satisfying supt,x |ft (x)| ≤ b. Let X := (Xti , Yti )(t,i)=(1,1) be a vector of nT independent random variables where (Xt1 , Yt1 ), . . . , (Xtn , Ytn ), ∀t ∈ NT are identically distributed. Assume that F is a (β, B)Bernstein class of vector-valued functions. Let ψ be a sub-root function with the fixed point r∗ . Suppose that ∀r ≥ r∗ , i PT Pn i i σ f (X ) is the LRC of the function class F . Then, t t t=1 i=1 t BR(F , r) ≤ ψ(r),
h where R(F , r) := E supf ∈F ,V (f )≤r
1 nT
1. For any K > 1, and x > 0, with probability at least 1 − e−x , every f ∈ F satisfies Pf ≤
560K ∗ (24b + 28BK)x K Pn f + r + . K −1 B nT
(B.4)
2. If F is convex, then for any K > 1, and x > 0, the following inequality holds with probability at least 1 − e−x for every f ∈ F Pf ≤
K 32K ∗ (24b + 16BK)x Pn f + r + . K−1 B nT
(B.5)
Proof. Similar to Lemma 5, define for the vector-valued function class F , rft , f = (ft , . . . , fT ) ∈ F . Fr := f ′ = f1′ , . . . , fT′ : ft′ := max (r, V (f ))
The proof can be broken down in two steps. The first step applies Theorem 1 and the seminal peeling technique [61, 62] to establish an inequality on the uniform deviation over the function class Fr . The second step then uses the Bernstein assumption V (f ) ≤ BP f to convert this inequality stated for Fr to a uniform deviation inequality for F . Step 1. Controlling uniform deviations for Fr . To apply Theorem 1 to Fr , we need to control the variances and uniform bounds for elements in Fr . We first show P f ′2 ≤ r, ∀f ′ ∈ Fr . Indeed, for any f ∈ F with V (f ) ≤ r, the definition of Fr implies ft′ = ft and, hence, P f ′2 = P f 2 ≤ V (f ) ≤ r. Otherwise, if V (f ) ≥ r, then ft′ = rft /V (f ) and we get T T 1X r2 r2 1 X P f ′2 = P ft′2 = P ft2 ≤ 2 2 V (f ) ≤ r. T t=1 T t=1 V (f ) V (f ) PT Therefore, T1 supf ′ ∈Fr t=1 E[ft′ (Xt )]2 ≤ r. Also, since functions in F admit a range of [−b, b] and since 0 ≤ r/ max(r, V (f )) ≤ 1, the inequality supt,x |ft′ (x)| ≤ b holds for any f ′ ∈ Fr . Applying Theorem 1 to the function class Fr then yields the following inequality with probability at least 1 − e−x , ∀x > 0 r 12bx 8xr ′ ′ + . (B.6) sup [P f − Pn f ] ≤ 4R(Fr ) + nT nT f ′ ∈Fr It remains to control the Rademacher complexity of Fr . Denote F (u, v) := f ∈ F : u ≤ V (f ) ≤ v , ∀0 ≤ u ≤ v, and introduce the notation Rn f ′ :=
T n 1 XX i ′ i σ f (X ), nT t=1 i=1 t t t
Rn (Fr ) := sup f ′ ∈Fr
i h Rn f ′ .
Note that R(Fr ) = ERn (Fr ). Our assumption implies V (f ) ≤ BP f ≤ Bb, ∀f ∈ F . Fix λ > 1 and define k to be the smallest integer such that rλk+1 ≥ Bb. Then, it follows from the union bound inequality R(G1 ∪ G2 ) ≤ R(G1 ) + R(G2 ) 28
(B.7)
that T n r 1 XX σti ft (Xti ) R(Fr ) = E sup Rn f ′ = E sup f ∈F nT t=1 i=1 max(r, V (f )) f ′ ∈Fr T n T n (B.7) 1 XX r 1 XX i σt ft (Xti ) + E sup ≤ E sup σti ft (Xti ) f ∈F (r,Bb) nT t=1 i=1 V (f ) f ∈F (0,r) nT t=1 i=1 k T n X (B.7) 1 XX i −j i λ E sup Rn f σt ft (Xt ) + ≤ E sup f ∈F (0,r) nT t=1 i=1 f ∈F (rλj ,rλj+1 ) j=0 ≤ R(F , r) + ≤
k X j=0
λ−j R F , rλj+1
k ψ(r) 1 X −j λ ψ(rλj+1 ). + B B j=0
1
The sub-root property of ψ implies that for any ξ ≥ 1, ψ(ξr) ≤ ξ 2 ψ(r), and hence k √ X λ ψ(r) ψ(r) −j/2 1+ λ 1+ √ ≤ λ . R(Fr ) ≤ B B λ−1 j=0 Taking the in the above inequality implies that R(Fr ) ≤ 5ψ(r)/B, which, together with the inequality p choice λ = 4 √ ψ(r) ≤ r/r∗ ψ(r∗ ) = rr∗ , ∀r ≥ r∗ , gives 5√ ∗ rr , ∀r ≥ r∗ . B Combining (B.6) and the above inequality together, for any r ≥ r∗ and x > 0, we derive the following inequality with probability at least 1 − e−x , r 20 √ ∗ 8xr 12bx ′ ′ rr + sup [P f − Pn f ] ≤ + . (B.8) ′ B nT nT f ∈Fr R(Fr ) ≤
√ for F . Setting A = 20 r∗ /B + p Step 2. Transferring uniform deviations for Fr to uniform deviations √ ′ 8x/nT and C = 12bx/nT , the upper bound (B.8) can be written as A r + C, that is, supf ′ ∈Fr [P f − Pn f ′ ] ≤ √ r A r + C. Now, according to Lemma 5, if supf ′ ∈Fr [P f ′ − Pn f ′ ] ≤ BK , then for any f ∈ F , r K Pn f + . (B.9) K −1 BK √ Therefore, in order to use √ the result of Lemma 5, we let A r + C = r/(BK). Assume r0 is the unique positive solution of the equation A r + C = r/(BK). It follows immediately that Pf ≤
r∗ ≤ (ABK)2 ≤ r0 ≤ (ABK)2 + 2BKC.
(B.8) then shows supf ′ ∈Fr [P f ′ − Pn f ′ ] ≤
r0 BK ,
and together with (B.9) implies
r0 K Pn f + K −1 BK r 24bx 8x 400 ∗ 40 8xr∗ K + r + Pn f + BK + . ≤ 2 K −1 B B nT nT nT p The stated inequality (B.4) follows immediately from 8xr∗ /nT ≤ Bx/(2nT ) + 4r∗ /B. The proof of the second part follows from the fact that Fr ⊆ {f ∈ star(F , 0) : V (f ) ≤ r}, where star(F , f0 ) is defined according to Definition 5. Also, since any convex class F is star-shaped around any of its points, we have Fr ⊆ {f ∈ F : V (f ) ≤ r}. Therefore, R(Fr ) in (B.6) can be bounded as R(Fr ) ≤ R(F , r) ≤ ψ(r)/B. The rest proof of (B.5) is analogous to that of the first part and is omitted for brevity. Pf ≤
29
Proof of Theorem 3 Note that the proof of this theorem relies on the results of Theorem B.1. Introduce the following class of excess loss functions ∗ HF := {hf = (hf1 , . . . , hfT ), hft : (Xt , Yt ) 7→ ℓ(ft (Xt ), Yt ) − ℓ(ft∗ (Xt ), Yt ), f ∈ F } .
(B.10)
It can be shown that supt,x |hft (x, y)| = supt,x |ℓ(ft (x), y) − ℓ(ft∗ (x), y)| ≤ L supt,x |ft (x) − ft∗ (x)| ≤ 2Lb. Also, Assumptions 1 implies P (ℓf − ℓf ∗ )2 ≤ L2 P (f − f ∗ )2 ≤ B ′ L2 P (ℓf − ℓf ∗ ),
∗ ∀hf ∈ HF .
∗ By taking B = B ′ L2 , we have for all hf ∈ HF ,
V (hf ) := P h2f ≤ L2 P (f − f ∗ )2 ≤ BP (ℓf − ℓf ∗ ) = BP hf , ∗ which implies that HF is a (1, B)-Bernstein class of vector-valued functions. Also, note that one can verify T X n X 1 ∗ , r) = BEX,σ sup BR(HF σti hft (Xti , Yti ) nT f ∈F , t=1 i=1 V (hf )≤r
= BEX,σ sup
f ∈F , V (hf )≤r
1 nT
≤ BLR(F ∗ , r) ≤ ψ(r),
T X n X t=1 i=1
σti ℓft (Xti , Yti )
where the second last inequality is due to Talagrand’s Lemma [36]. Applying Theorem B.1 (which is the extension of ∗ completes the proof. Theorem 3.3 of [8] to MTL function classes) to the function class HF The following lemma, as a consequence of Corollary 2.2 in [8], is essential in proving Theorem 5. Lemma 7. Assume that the functions in vector-valued function class F = {f = (f1 , . . . , fT )} satisfy supt,x |ft (x)| ≤ b with b > 0. For every x > 0, if r satisfies T X n 128L2b2 x X 1 i i 2 σt ft (Xt ) + , r ≥ 32L bEσ,X sup nT 2 f ∈F∗, 2 nT t=1 i=1 L P (f −f ) ≤r
then, with probability at least 1 − e−x , n o n o 2 2 f ∈ F : L2 P (f − f ∗ ) ≤ r ⊂ f ∈ F : L2 Pn (f − f ∗ ) ≤ 2r .
Proof. First, define Fr∗ := f ′ = (f1′ , . . . , fT′ ) : ∀t, ft′ = (ft − ft∗ )2 , f = (f1 , . . . , fT ) ∈ F , L2 P (f − f ∗ )2 ≤ r . Note that for all t ∈ NT , (ft − ft∗ )2 ∈ [0, 4b2 ]. Also, for any function in Fr∗ , it holds that P f ′2 =
T T T 1 X 1 X 4b2 X 4b2 r 2 4 2 P ft′2 = P (ft − ft∗ ) ≤ P (ft − ft∗ ) = 4b2 P (f − f ∗ ) ≤ 2 . T t=1 T t=1 T t=1 L
Therefore, by Theorem 1, with probability at least 1 − e−x , every f ′ ∈ Fr∗ satisfies r 32b2 xr 48b2 x ′ ′ ∗ + , Pn f ≤ P f + 4R(Fr ) + nT L2 nT 30
(B.11)
where R(Fr∗ ) = Eσ,X
sup
f ∈F , L P (f −f ∗ )2 ≤r
≤ 4bEσ,X
2
T X n X 1 i i ∗ i 2 σt (ft (Xt ) − ft (Xt )) nT t=1 i=1
sup
f ∈F , L2 P (f −f ∗ )2 ≤r
T X n X 1 σti ft (Xti ) . nT t=1 i=1
(B.12)
The last inequality follows from the facts that g(x) = x2 is 4b-Lipschitz on [−2b, 2b] and f ∗ is fixed. This together with (B.11), gives n T r 32b2 xr 1 XX i 48b2 x ′ ′ i σt ft (Xt ) + Pn f ≤ P f + 16bEσ,X + sup 2 nT L nT 2 f ∈F∗, 2 nT t=1 i=1 L P (f −f ) ≤r T n r 1 XX i 64b2 x r i σt ft (Xt ) + + . sup ≤ 2 + 16bEσ,X 2 L nT t=1 i=1 nT f ∈F , 2L L2 P (f −f ∗ )2 ≤r
Multiplying both sides by L2 completes the proof.
Proof of Theorem 5 With c1 = 2L max (B, 16Lb) and c2 = 128L2b2 + 2bc1 , define the function ψ(r) as T n 1 XX i c1 (c2 − 2bc1 )x σt ft (Xti ) + . sup ψ(r) = E 2 nT t=1 i=1 nT f ∈F , 2
(B.13)
∗ 2
L P (f −f ) ≤r
Since F is convex, it is star-shaped around any of its points, thus using Lemma 3.4 in [8] it can be shown that ψ(r) defined in (B.13) is a sub-root function. With the help of Corollary 4 and Assumptions 1, we have with probability at least 1 − e−x 2 (48Lb + 16BK)Bx L2 P fˆ − f ∗ ≤ BP ℓfˆ − ℓf ∗ ≤ 32Kr + . (B.14) nT
where B := B ′ L2 . Denote the right hand side of the last inequality by s. Since s ≥ r ≥ r∗ , then by the property of sub-root functions it holds that s ≥ ψ(s) which together with (B.13), gives T X n X 1 128L2b2 x σti ft (Xti ) + . s ≥ 32L2 bE sup nT t=1 i=1 nT f ∈F , L2 P (f −f ∗ )2 ≤s
Applying Lemma 7, we have with probability at least 1 − e−x , n o n o 2 2 f ∈ F , L2 P (f − f ∗ ) ≤ s ⊂ f ∈ F , L2 Pn (f − f ∗ ) ≤ 2s . Combining this with (B.14), gives with probability at least 1 − 2e−x , 2 (48Lb + 16BK)Bx ∗ 2 ˆ ≤ 2 32Kr + L Pn f − f nT 31
(48Lb + 16BK)B ≤ 2 32K + r = cr. c2
(B.15)
where c := 2(32K + (48Lb + 16BK)B/c2) and in the last inequality we used the fact that r ≥ ψ(r) ≥ c2 x/nT . Applying the triangle inequality, if (B.15) holds, then for any f ∈ F , we have ! r q 2 2 2 2 ∗ ∗ L2 Pn f − fˆ ≤ L2 Pn (f − f ) + L2 Pn f − fˆ q 2 √ ∗ 2 2 ≤ L Pn (f − f ) + cr .
(B.16)
Now, applying Lemma 7 for r ≥ ψ(r), implies that with probability at least 1 − 3e−x , n o n o 2 2 f ∈ F , L2 P (f − f ∗ ) ≤ r ⊂ f ∈ F , L2 Pn (f − f ∗ ) ≤ 2r ,
which coupled with (B.16), implies that with probability at least 1 − 3e−x , 2 √ n o √ 2 ∗ 2 2 2 ˆ 2+ c r . f ∈ F , L P (f − f ) ≤ r ⊂ f ∈ F , L Pn f − f ≤
Also, with the help of Lemma A.4 in [8], it can be shown that with probability at least 1 − e−x , T X n T X n X X 1 1 4bx sup σti ft (Xti ) ≤ 2Eσ σti ft (Xti ) + . E sup nT nT nT f ∈F , f ∈F , t=1 i=1 t=1 i=1 L2 P (f −f ∗ )2 ≤r
L2 P (f −f ∗ )2 ≤r
Thus, we will have with probability at least 1 − 4e−x , T X n X 1 c2 x σti ft (Xti ) xit t∈NT ,i∈Nn + sup ψ(r) ≤ c1 Eσ nT nT f ∈F , t=1 i=1 L2 P (f −f ∗ )2 ≤r
≤ c1 Eσ
sup
f ∈F , √ √ 2 L2 Pn (f −fˆ)2 ≤( 2+ c) r
≤ c1 Eσ ˆ ≤ ψ(r).
sup
f ∈F , L Pn (f −fˆ)2 ≤(4+2c)r 2
1 nT
c2 x σti ft (Xti ) xit t∈NT ,i∈Nn + nT i=1
T X n X t=1
T X n X 1 c2 x σ i ft (Xti ) xit t∈NT ,i∈Nn + nT t=1 i=1 t nT
Setting r = r∗ and applying Lemma 4.3 of [8], gives r∗ ≤ rˆ∗ which together with (B.14) yields the result.
C
Proofs of the results in Sect. 4: “Local Rademacher Complexity Bounds for MTL models with Strongly Convex Regularizers”
In the following, we would like to provide some basic notions of convex analysis which are helpful in understanding the results of Sect. 4. Definition 6 (S TRONG C ONVEXITY). A function R : X 7→ R is µ-strong convex w.r.t. a norm k.k if and only if ∀x, y ∈ X and ∀α ∈ (0, 1), we have R(αx + (1 − α)y) ≤ αR(x) + (1 − α)R(y) − 32
µ α(1 − α)kx − yk2 . 2
Definition 7 (S TRONG S MOOTHNESS). A function R∗ : X 7→ R is µ1 -strong smooth w.r.t. a norm k.k∗ if and only if R∗ is everywhere differentiable and ∀x, y ∈ X , we have R∗ (x + y) ≤ R∗ (x) + h▽R∗ (x), yi +
1 2 kyk∗ . 2µ
Property 1 (Theorem 3 in [26]: strong convexity/strong smoothness duality). A function R is µ-strongly convex w.r.t. the norm k.k if and only if its Fenchel conjugate R∗ is µ1 -strongly smooth w.r.t. the dual norm k.k∗ . The Fenchel conjugate R∗ is defined as R∗ (w) := sup {hw, vi − R(v)} . v
Property 2 (F ENCHEL -YOUNG function R,
INEQUALITY ).
The definition of Fenchel dual implies that for any strong convex
∀w, v ∈ S, hw, vi ≤ R(w) + R∗ (v). Combining this with the strong duality property of R∗ gives the following hw, vi − R(w) ≤ R∗ (v) ≤ R∗ (0) + h▽R∗ (0), vi +
1 2 kvk∗ . 2µ
(C.1)
Lemma 8. Assume that the conditions of Theorem 6 hold. Then, for ever f ∈ Fq , E2 D P P j j ≤ r. w , u (a) P f 2 ≤ r implies 1/T Tt=1 ∞ λ t t j=1 t D P E2 j n λ (b) EX,σ n1 i=1 σti φ(Xti ), ujt = nt .
Proof. Part (a)
T T 2 1 X
1 X
E wt , φ(Xti ) E wt ⊗ wt , φ(Xti ) ⊗ φ(Xti ) T t=1 T t=1
P f2 =
T T ∞ E 1 X
1 XX j D λt w t ⊗ w t , ujt ⊗ ujt wt ⊗ wt , EX φ(Xti ) ⊗ φ(Xti ) = T t=1 T t=1 j=1
=
T ∞ T ∞ ED E E2 1 XX j D 1 XX j D λt w t , ujt wt , ujt = λt w t , ujt ≤ r. T t=1 j=1 T t=1 j=1
= Part (b) EX,σ
*
σ t i.i.d.
=
=
n
1 EX n2 ∞
+2
n D ED E X 1 E σti σtk φ(Xti ), ujt φ(Xtk ), ujt X,σ 2 n i,k=1 ! * n + n E2 XD j 1 1X j j i i i = EX φ(Xt ) ⊗ φ(Xt ) , ut ⊗ ut φ(Xt ), ut n n i=1 i=1
1X i σ φ(Xti ), ujt n i=1 t
=
E λj 1X lD l λt ut ⊗ ult , ujt ⊗ ujt = t . n n l=1
The following lemmas are used in the proof of the LRC bound for the L2,q -group norm regularized MTL in Corollary 12.
33
Lemma 9 (Khintchine-Kahane Inequality [56]). Let H be an inner-product space with induced norm k·kH , v1 , . . . , vM ∈ H and σ1 , . . . , σn i.i.d. Rademacher random variables. Then, for any p ≥ 1, we have that
p
n
X
σi vi ≤ Eσ
i=1
c
n X i=1
H
2 kvi kH
! p2
.
(C.2)
where c := max {1, p − 1}. The inequality also holds for p in place of c. Lemma 10 (Rosenthal-Young Inequality; Lemma 3 of [29]). Let the independent non-negative random variables X1 , . . . , Xn satisfy Xi ≤ B < +∞ almost surely for all i = 1, . . . , n. If q ≥ 21 , cq := (2qe)q , then it holds n
1X E Xi n i=1
!q
≤ Cq
"
B n
q
n
+
1X EXi n i=1
!q #
.
(C.3)
Proof of Lemma 2 For the group norm regularizer kW k2,q , we can further bound the expectation term in (16) for D = I as follows
+ T * n
X 1X
σti φ(Xti ), ujt ujt E := EX,σ
n i=1
j>ht t=1
2,q∗
1 ∗ ∗
+ * n
q q
T X 1X
X
σti φ(Xti ), ujt ujt = EX,σ
n i=1
t=1 j>ht
1 ∗ ∗
+
q q
X * X n T X
1 σti φ(Xti ), ujt ujt Eσ ≤ EX
n
j>ht t=1 i=1
Jensen
1
∗
2 q2 q∗
n T (C.2) X ∗ X X 1
≤ EX φ(Xti ), ujt ujt q
n
t=1 i=1 j>ht
=
r
Jensen
≤
1 q2∗ q∗ T n D E2 q∗ X X 1 X φ(Xti ), ujt EX n n t=1 i=1
j>ht
r
1 q2∗ q∗ T n E2 X 1 XD q ∗ X φ(Xti ), ujt . EX n t=1 n i=1
(C.4)
j>ht
Note that for q ≤ 2, it holds that q ∗ /2 ≥ 1. Therefore, we cannot employ Jensen’s inequality to move the expectation operator inside the inner term, and instead we need to apply the Rosenthal-Young (R+Y) inequality (see Lemma 10 in the Appendix), which yields 1
R+Y
E ≤
r
T X
q∗ q∗ K (eq ∗ ) 2 n t=1 n
q2∗
+
X 1 n
j>ht
34
n X i=1
D
EX φ(Xti ), ujt
E2
q2∗ q∗
1
=
r
T
q∗ q ∗ X K (eq ∗ ) 2 n t=1 n
q2∗
q2∗ q∗ X j + λt .
(C.5)
j>ht
√ √ The last quantity can be further bounded using the sub-additivity of q∗ . and . respectively in (††) and (†) below, 1 q1∗ q2∗ q∗ q∗ r T (†) e K 2 X X j E ≤ q∗ + λt T n n t=1 j>ht
1
T
2
X
(††) e
j T q1∗ K + ≤ q∗ λ
t
n n
j>ht t=1 q∗ 2 v
u T
√ 1 u ∗2 X
u eq Keq ∗ T q∗
j +u = λ
. t t n
n
j>ht t=1 q∗
r
r
(C.6)
2
Proof of Corollary 12
Substituting the result of Lemma 2 into (18) gives, v
T u
√ 1 u
2 2 u 2eq ∗ Rmax 2KeRmax q ∗ T q∗
X j u A2 (Fq ) ≤ t λt .
+
nT 2 nT
j>ht ∗ t=1 q
(C.7)
2
Now, combining (15) and (C.7) provides the bound on R(Fq , r) as v
s T u
√ PT 1 u
2 2 ∗ X r t=1 ht u R 2eq 2KeRmax q ∗ T q∗
j max u R(Fq , r) ≤ λt +t
+
nT nT 2 nT
j>ht t=1 q∗ 2 v u
T u
1 T
X
√2KeR ∗ ∗ 2 X (⋆) u 2eq ∗ 2 Rmax u 2
max q T q j r h + + λ ≤ u
t t t nT
T nT t=1
j>ht t=1 q∗
(C.8)
2
v u u (⋆⋆) u u 2 ≤ u t nT
2 1− 2∗ 2eq ∗ 2 Rmax T rT q
(ht )t=1 q∗ + T 2
T
1
√2KeR
X ∗ ∗
max q T q j λ +
t
nT
j>ht t=1 q∗ 2
v
T u
√ 1 u
2 2 X j (⋆⋆⋆) u 4 2eq ∗ Rmax 2KeRmax q ∗ T q∗
1− q2∗ u ht + ≤ t λt .
+
rT
nT T nT j>ht
t=1 q∗ 2
where in (⋆), (⋆⋆) and (⋆ ⋆ ⋆) we applied following inequalities receptively, according which for all non-negative numbers α1 and α2 , and non-negative vectors a1 , a2 ∈ RT with 0 ≤ q ≤ p ≤ ∞ and s ≥ 1 it holds p √ √ (⋆) α1 + α2 ≤ 2(α1 + α2 ) 35
(⋆⋆) lp − to − lq : (⋆ ⋆ ⋆)
1
ka1 kq = h1, a1 i q
H¨older
≤
1
q1 1 1 = T q − p ka1 kp k1k(p/q)∗ kaq1 k(p/q)
ka1 ks + ka2 ks ≤ 21− s ka1 + a2 ks ≤ 2 ka1 + a2 ks .
Since inequality (⋆ ⋆ ⋆) holds for all non-negative ht , it follows v
T u
√ u 1
2 2 ∗ X u 4 2 R 2eq 2KeRmax q ∗ T q∗
j 1− q∗ max u R(Fq , r) ≤ t λt ht +
+
min rT
nT ht ≥0 T nT j>ht
t=1 q∗ 2 v
T u
√ u 1 ∞
∗2 2 u 4 2KeRmax q ∗ T q∗
X 1− q2∗ 2eq Rmax j u λt . ≤t , min rT
+
nT j=1 T nT
∗ t=1 q 2
Proof of Theorem 16
+ T n X 1X i i wt , σt φ(Xt ) n i=1 t=1
*
1 EX,σ sup T P f 2 ≤r, kW k22,q ≤2R2max + * T n X 1X i 1 i w , σ φ(X ) sup = EX,σ t t PT T n i=1 t 2 t ,φ(Xt )i ≤r, t=1 1/T t=1 Ehw 2 2
R(Fq,Rmax ,T , r) =
kW k2,q ≤2Rmax
* + T n X 1X i 1 i wt , σt φ(Xt ) ≥ EX,σ sup T n i=1 ∀t EX hw t ,φ(Xt )i2 ≤r, t=1 2 2 kW k2,q ≤2Rmax , kw 1 k2 =...=kwt k2 * + T n X 1X i 1 i wt , = EX,σ σt φ(Xt ) sup T n i=1 ∀t EX hwt ,φ(Xt )i2 ≤r, t=1 2
∀t kw t k22 ≤2R2max T
−
q
+ * n 1X i i σt φ(Xt ) wt , n i=1
T 1 X EX,σ sup T t=1 ∀t EX hw t ,φ(Xt )i2 ≤r, −2 ∀t kwt k22 ≤2R2max T q + * n X 1 i i σ1 φ(X1 ) = EX,σ sup w1 , n i=1 E hw ,φ(X )i2 ≤r, X 21 2 1 − q2
=
kw1 k2 ≤2Rmax T
= R(F
1,Rmax T
−1 q
,1
, r).
According to [50], it can be shown that there is a constant c such that if λ1t ≥ nR12 , then for all r ≥ n1 it holds max r P∞ − q2 j c 2 , r) ≥ n j=1 min r, Rmax T λ1 , which with some algebra manipulations gives the desired R(F −1 q 1,Rmax T
,1
result.
36
The following lemma is used in the proof of the LRC bounds for the LSq -Schatten norm regularized MTL in Corollary 18. Lemma 11 (Non-commutative Khintchine’s inequality [40]). Let Q1 , . . . , Qn be a set of arbitrary m × n matrices, and let σ1 , . . . , σn be a sequence of independent Bernoulli random variables. Then for all p ≥ 2,
p 1/p !1/2 !1/2
n n n
X
X
X T
T 1/2
Eσ Qi Qi Qi Qi σi Qi (C.9) ≤ p max
.
,
i=1 i=1 i=1 Sp Sp
Sp
Proof of Corollary 18
In order to find an LRC bound for a LSq -Schatten norm regularized hypothesis space (30), one just needs to bound the expectation term Ein (12). Define U it as a matrix with T columns, whose only non-zero tth column equals D P j j 1 i j>ht n φ(Xt ), ut ut . Also, note that for the Schatten norm regularized hypothesis space (30), it holds that D = I. Therefore, we will have
* n + T
X 1X
EX,σ D −1/2 V = EX,σ σti φ(Xti ), ujt ujt
n i=1 ∗
j>ht t=1 Sq ∗
q q1∗
T n
T n
X X
X X Jensen
σti U it σti U it ≤ EX Eσ = EX,σ
t=1 i=1 t=1 i=1 Sq ∗ Sq ∗
!1/2 ! 1/2
X T X n T X n (C.9) √
X T T i i i i ∗
Ut Ut , Ut Ut ≤ q EX max
t=1 i=1
t=1 i=1 ∗
Sq ∗
Sq ∗
!1/2
T n
XX (†††) √ T i i ∗
= Ut Ut q EX
t=1 i=1
Sq ∗
=
√ ∗ q EX tr
∗
2 q2
T n X X X 1
√ j j i
u φ(X ), u = q ∗ EX t t t
n
t=1 i=1 j>ht
T X n X t=1 i=1
1 q∗
U it
T
U it
! q2∗
q1∗
1 2 2 T X n X X
√ 1
φ(Xti ), ujt ujt = q ∗ EX
n
t=1 i=1 j>ht
21 √ ∗ n X D T X E X 2 q φ(Xti ), ujt EX = n t=1 i=1 j>ht v u 21 T
u √ ∗ X T X n X
u q∗ X j Jensen q
j u ≤ =t λt λt
.
n n t=1 i=1 j>ht
j>ht t=1 1
where in (†††), we assumed that the first term in the max argument is the largest one.
37
(C.10)
Proof of Corollary 21 Similar to the proof of Corollary 18, for the graph regularized hypothesis space (32), one can bound the expectation term in (12) as
h i 21
EX,σ D−1/2 V = EX,σ tr V T D −1 V ∗ 1 n,n X X T,T X E 2 D E
D X Jensen 1 i l ≤ EX 2 φ(Xti ), ujt φ(Xsl ), uks ujt , uks D −1 st Eσ σt σs n t,s=1 i,l=1 j>ht k>hs
1 T n D E2 2 X X 1X 1 = EX φ(Xti ), ujt D−1 n t=1 tt n i=1
j>ht
1 T n D E2 2 X X 1X 1 ≤ EX φ(Xti ), ujt D−1 n t=1 tt n i=1 j>ht v u T 21
u T
X X X u 1 1
j −1 −1 j =√ =u λt . Dtt λt
D tt t
n n t=1 j>ht j>ht
t=1 Jensen
(C.11)
1
D
Proof of the results in Sect. 5: “Excess Risk Bounds for MTL models with Strongly Convex Regularizers”
Proof of Corollary 25 ˆ ∗ , c3 r) ≤ 2R(F ˆ q , c3 r2 ). Assume that (ˆ ujt )j≥1 is an orthonormal basis of HK of matrix K t . First notice that R(F q 4L Then similar to the proof of Theorem 8 it can be shown that 21 * n +2 21 ˆt ˆt h h T T E D X X j −1 1 X 2 X X ˆj j i iˆ ˆ ˆ q , r) ≤ 1 Eσ ˆ ˆ jt λ φ(X ), u σ R(F λt w t , u sup t t t T n i=1 t Pn f 2 ≤r t=1 j=1 t=1 j=1 √
2R
+ Eσ D −1/2 Vˆ T 2,q∗
s
+ T * n √ PT ˆ n
X X r t=1 ht 2R 1
j ˆ i ), u ˆ jt ≤ , u σti φ(X + Eσ t ˆt
nT T n i=1
j>hˆ t t=1 ∗ 2,q
E T P D P n n 1 i ˆj ˆj iˆ ut and D = I, and φ(X ), u σ where the last inequality is obtained by replacing Vˆ = ˆ t t t i=1 j>ht n t=1 D D P E2 E2 j P P ˆ λt n T n j j j 2 1 ˆ i ), u ˆ ˆ ˆ and P f ≤ r implies = λ w , u regarding the fact that Eσ n1 i=1 σti φ(X ≤ r. n t t t t t=1 j=1 t n T Now, similar to the proof of Lemma 2, it can be shown that v
T u
+ T * n u ∗2
X 1X
X u q
j j j ˆ i ), u ˆ ˆt u ≤u σti φ(X Eσ λ
. t ˆt t t n
n i=1 ˆt
j>hˆ t
j> h t=1 t=1 q∗ ∗ 2,q
2
(T,n)
Note that, for the empirical LRC, the expectation is taken only with respect to the Radamacher variables (σti )(t,i=1) . 38
Therefore, we get ˆ q , c3 r ) ≤ R(F 4L2
s
which implies,
v u 2 ˆt u u 2q ∗ 2 Rmax c3 r t=1 h u + t nT 2 4nT L2 PT
v s u PT ˆ u 2 c3 r t=1 ht u 2q ∗ 2 Rmax ˆ u + ψn (r) ≤ 2c1 t nT 2 4nT L2 =
s
v u P T 2 ˆt u u 8c21 q ∗ 2 Rmax c21 c3 r t=1 h u + t nT L2 nT 2
T
n
X
j ˆ λt
,
j>hˆ t
t=1 q∗ 2
T
n
X
c x
2 ˆj λ +
t
nT
j>hˆ t
t=1 q∗ 2
T
n
X
c2 x
ˆj . λ
+ t
nT
j>hˆ t
∗ t=1 q 2
√ Denote the right hand side by ψˆnub (r). Solving the fixed point equation ψˆnub (r) = αr + γ = r for v
T u
P u 2 ∗2 2 n T
2 ˆ X u c1 c3 t=1 ht c2 x R 8c q
j max 1 ˆ u α= , γ = , λ
+ t t
nT L2 nT 2 nT
j>hˆ t
t=1 q∗
(D.1)
2
gives rˆ∗ ≤ α + 2γ. Substituting α and γ completes the proof.
E Proof of the results in Sect. 6: “Discussion” Proof of Theorem 27 Note that regarding theP definition of A2 in (14), the global rademacher complexity P∞ for each case can be obtained by replacing the tail-sum j>ht λjt in the bound of its corresponding A2 (F ) by j=1 λjt = tr(Jt ). Indeed, similar to the proof of Lemma 2, it can be shown that for the group norm with q ∈ [1, 2], ( ) T n 1 XX i i sup R(Fq ) = EX,σ σt ft (Xt ) f =(f1 ,...,fT )∈Fq nT t=1 i=1
!T √
n X
1 2R i i
≤ σ φ(X ) EX,σ . t t
T
n i=1 t=1 ∗ 2,q
Also, one can verify the following
!T
n
1X i i EX,σ σt φ(Xt )
n
i=1
t=1
2,q∗
∞ * n + T
X 1X
σti φ(Xti ), ujt ujt = EX,σ
n i=1
j=1 t=1 2,q∗ v
s T u
2 u ∞
2 2 ∗ u eq ∗ X j q ∗ KeT q
u ≤ + λ
t t n
n2
j=1 t=1 q∗ 2 s √ 1
Keq ∗ T q∗ eq ∗ 2 T = +
(tr (Jt ))t=1 q∗ . n n 2
(E.1)
where the inequality is obtained in a similar way as in Lemma 2. The GRC bounds for the other cases can be easily derived in a very similar manner. 39
References [1] Qi An, Chunping Wang, Ivo Shterev, Eric Wang, Lawrence Carin, and David B Dunson. Hierarchical kernel stick-breaking process for multi-task image analysis. In Proceedings of the 25th international conference on Machine learning, pages 17–24. ACM, 2008. [2] Rie Kubota Ando and Tong Zhang. A framework for learning predictive structures from multiple tasks and unlabeled data. The Journal of Machine Learning Research, 6:1817–1853, 2005. [3] Andreas Argyriou, St´ephan Cl´emenc¸on, and Ruocong Zhang. Learning the graph of relations among multiple tasks. ICML workshop on New Learning Frameworks and Models for Big Data, 2014. [4] Andreas Argyriou, Theodoros Evgeniou, and Massimiliano Pontil. Convex multi-task feature learning. Machine Learning, 73(3):243–272, 2008. [5] Andreas Argyriou, Andreas Maurer, and Massimiliano Pontil. An algorithm for transfer learning in a heterogeneous environment. In Machine Learning and Knowledge Discovery in Databases, pages 71–85. Springer, 2008. [6] Andreas Argyriou, Massimiliano Pontil, Yiming Ying, and Charles A Micchelli. A spectral regularization framework for multi-task structure learning. In Advances in neural information processing systems, pages 25–32, 2007. [7] Peter L Bartlett, St´ephane Boucheron, and G´abor Lugosi. Model selection and error estimation. Machine Learning, 48(1-3):85–113, 2002. [8] Peter L Bartlett, Olivier Bousquet, and Shahar Mendelson. Local rademacher complexities. Annals of Statistics, pages 1497–1537, 2005. [9] Peter L Bartlett and Shahar Mendelson. Rademacher and gaussian complexities: Risk bounds and structural results. Journal of Machine Learning Research, 3(Nov):463–482, 2002. [10] Peter L Bartlett, Shahar Mendelson, and Petra Philips. Local complexities for empirical risk minimization. In International Conference on Computational Learning Theory, pages 270–284. Springer, 2004. [11] Jonathan Baxter. A model of inductive bias learning. Journal of Artificial Intelligence Research, 12(149-198):3, 2000. [12] Shai Ben-David and Reba Schuller Borbely. A notion of task relatedness yielding provable multiple-task learning guarantees. Machine learning, 73(3):273–287, 2008. [13] Shai Ben-David and Reba Schuller. Exploiting task relatedness for multiple task learning. In Learning Theory and Kernel Machines, pages 567–580. Springer, 2003. [14] Steffen Bickel, Jasmina Bogojeska, Thomas Lengauer, and Tobias Scheffer. Multi-task learning for hiv therapy screening. In Proceedings of the 25th international conference on Machine learning, pages 56–63. ACM, 2008. [15] St´ephane Boucheron, G´abor Lugosi, and Pascal Massart. Concentration inequalities using the entropy method. Annals of Probability, pages 1583–1614, 2003. [16] St´ephane Boucheron, G´abor Lugosi, and Pascal Massart. Concentration inequalities: A nonasymptotic theory of independence. Oxford university press, 2013. [17] Olivier Bousquet. Concentration inequalities and empirical processes theory applied to the analysis of learning algorithms. PhD thesis, Ecole Polytechnique, Paris, 2002. [18] Bin Cao, Nathan N Liu, and Qiang Yang. Transfer learning for collective link prediction in multiple heterogenous domains. In Proceedings of the 27th International Conference on Machine Learning (ICML-10), pages 159–166, 2010. [19] Rich Caruana. Multitask learning. Machine learning, 28(1):41–75, 1997. 40
[20] Corinna Cortes, Marius Kloft, and Mehryar Mohri. Learning kernels using local rademacher complexity. In C.J.C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K.Q. Weinberger, editors, Advances in Neural Information Processing Systems 26, pages 2760–2768. Curran Associates, Inc., 2013. Available from: http://papers.nips.cc/paper/4896-learning-kernels-using-local-rademacher-complexity.pdf [21] Corinna Cortes and Mehryar Mohri. Domain adaptation in regression. In International Conference on Algorithmic Learning Theory, pages 308–323. Springer, 2011. [22] Corinna Cortes and Mehryar Mohri. Domain adaptation and sample bias correction theory and algorithm for regression. Theoretical Computer Science, 519:103 – 126, 2014. Algorithmic Learning Theory. Available from: http://www.sciencedirect.com/science/article/pii/S0304397513007184, doi:http://dx.doi.org/10.1016/j.tcs.2013.09.027. [23] A Evgeniou and Massimiliano Pontil. Multi-task feature learning. Advances in neural information processing systems, 19:41, 2007. [24] Theodoros Evgeniou, Charles A Micchelli, and Massimiliano Pontil. Learning multiple tasks with kernel methods. Journal of Machine Learning Research, 6(Apr):615–637, 2005. [25] Theodoros Evgeniou and Massimiliano Pontil. Regularized multi–task learning. In Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 109–117. ACM, 2004. [26] Sham M Kakade, Shai Shalev-Shwartz, and Ambuj Tewari. Regularization techniques for learning with matrices. The Journal of Machine Learning Research, 13(1):1865–1890, 2012. [27] Zhuoliang Kang, Kristen Grauman, and Fei Sha. Learning with whom to share in multi-task feature learning. In Proceedings of the 28th International Conference on Machine Learning (ICML-11), pages 521–528, 2011.
[28] Marius Kloft and Gilles Blanchard. The local rademacher complexity of lp-norm multiple kernel learning. In J. Shawe-Taylor, R.S. Zemel, P.L. Bartlett, F. Pereira, and K.Q. Weinberger, editors, Advances in Neural Information Processing Systems 24, pages 2438–2446. Curran Associates, Inc., 2011. Available from: http://papers.nips.cc/paper/4259-the-local-rademacher-complexity-of-lp-norm-multiple-k [29] Marius Kloft and Gilles Blanchard. On the convergence rate of lp-norm multiple kernel learning. The Journal of Machine Learning Research, 13(1):2465–2502, 2012. [30] V. Koltchinskii and D. Panchenko. Empirical margin distributions and bounding the generalization error of combined classifiers. Ann. Statist., 30(1):1–50, 02 2002. Available from: http://dx.doi.org/10.1214/aos/1015362183, doi:10.1214/aos/1015362183. [31] Vladimir Koltchinskii. Rademacher penalties and structural risk minimization. IEEE Transactions on Information Theory, 47(5):1902–1914, 2001. [32] Vladimir Koltchinskii. Local rademacher complexities and oracle inequalities in risk minimization. Ann. Statist., 34(6):2593–2656, 12 2006. Available from: http://dx.doi.org/10.1214/009053606000001019, doi:10.1214/009053606000001019. [33] Vladimir Koltchinskii. Rademacher complexities and bounding the excess risk in active learning. J. Mach. Learn. Res., 11:2457–2485, December 2010. Available from: http://dl.acm.org/citation.cfm?id=1756006.1953014. [34] Vladimir Koltchinskii and Dmitriy Panchenko. Rademacher processes and bounding the risk of function learning. In High dimensional probability II, pages 443–457. Springer, 2000. [35] Abhishek Kumar and Hal Daume III. Learning task grouping and overlap in multi-task learning. arXiv preprint arXiv:1206.6417, 2012. [36] Michel Ledoux and Michel Talagrand. Probability in Banach Spaces: isoperimetry and processes. Springer Science & Business Media, 2013.
41
[37] Yunwen Lei, Lixin Ding, and Yingzhou Bi. Local rademacher complexity bounds based on covering numbers. arXiv:1510.01463 [cs.AI], 2015. [38] Cong Li, Michael Georgiopoulos, and Georgios C Anagnostopoulos. Multitask classification hypothesis space with improved generalization bounds. IEEE transactions on neural networks and learning systems, 26(7):1468– 1479, 2015. [39] K Lounici, M Pontil, AB Tsybakov, and SA Van De Geer. Taking advantage of sparsity in multi-task learning. In COLT 2009-The 22nd Conference on Learning Theory, 2009. [40] F. Lust-Piquard. Khintchine inequalities in cp (1 < p < ∞). COMPTES RENDUS DE L ACADEMIE DES SCIENCES SERIE I-MATHEMATIQUE, 303(7):289–292, 1986. [41] Yishay Mansour, Mehryar Mohri, and Afshin Rostamizadeh. Domain adaptation: Learning bounds and algorithms. In Proceedings of The 22nd Annual Conference on Learning Theory (COLT 2009). Omnipress, June 2009. [42] Yishay Mansour, Mehryar Mohri, and Afshin Rostamizadeh. Domain adaptation with multiple sources. In D. Koller, D. Schuurmans, Y. Bengio, and L. Bottou, editors, Advances in Neural Information Processing Systems 21, pages 1041–1048. Curran Associates, Inc., 2009. Available from: http://papers.nips.cc/paper/3550-domain-adaptation-with-multiple-sources.pdf. ´ [43] Yishay Mansour, Mehryar Mohri, and Afshin Rostamizadeh. Multiple source adaptation and the rEnyi divergence. In Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence, UAI ’09, pages 367–374, Arlington, Virginia, United States, 2009. AUAI Press. Available from: http://dl.acm.org/citation.cfm?id=1795114.1795157. [44] Yishay Mansour and Mariano Schain. Robust domain adaptation. Annals of Mathematics and Artificial Intelligence, 71(4):365–380, 2013. Available from: http://dx.doi.org/10.1007/s10472-013-9391-5, doi:10.1007/s10472-013-9391-5. [45] Andreas Maurer. Bounds for linear multi-task learning. The Journal of Machine Learning Research, 7:117–139, 2006. [46] Andreas Maurer. The rademacher complexity of linear transformation classes. In Learning Theory, pages 65–78. Springer, 2006. [47] Andreas Maurer. Algorithmic Learning Theory: 25th International Conference, ALT 2014, Bled, Slovenia, October 8-10, 2014. Proceedings, chapter A Chain Rule for the Expected Suprema of Gaussian Processes, pages 245–259. Springer International Publishing, Cham, 2014. Available from: http://dx.doi.org/10.1007/978-3-319-11662-4_18, doi:10.1007/978-3-319-11662-4_18. [48] Andreas Maurer and Massimiliano Pontil. Excess risk bounds for multitask learning with trace norm regularization. In Conference on Learning Theory, volume 30, pages 55–76, 2013. [49] Andreas Maurer, Massimiliano Pontil, and Bernardino Romera-Paredes. The benefit of multitask representation learning. Journal of Machine Learning Research, 17(81):1–32, 2016. [50] Shahar Mendelson. On the performance of kernel classes. The Journal of Machine Learning Research, 4:759– 771, 2003. [51] Charles A Micchelli and Massimiliano Pontil. Kernels for multi–task learning. In Advances in neural information processing systems, pages 921–928, 2004. [52] Luca Oneto, Alessandro Ghio, Sandro Ridella, and Davide Anguita. Local rademacher complexity: Sharper risk bounds with and without unlabeled samples. Neural Networks, 65:115 – 125, 2015. Available from: http://www.sciencedirect.com/science/article/pii/S0893608015000404, doi:http://dx.doi.org/10.1016/j.neunet.2015.02.006.
42
[53] Sinno Jialin Pan and Qiang Yang. A survey on transfer learning. IEEE Transactions on knowledge and data engineering, 22(10):1345–1359, 2010. [54] Anastasia Pentina and Shai Ben-David. Multi-task and lifelong learning of kernels. In Algorithmic Learning Theory, pages 194–208. Springer, 2015. [55] Anastasia Pentina and Christoph H Lampert. Lifelong learning with non-iid tasks. In Advances in Neural Information Processing Systems, pages 1540–1548, 2015. [56] G Peshkir and Albert Nikolaevich Shiryaev. The khintchine inequalities and martingale expanding sphere of their action. Russian Mathematical Surveys, 50(5):849–904, 1995. [57] Ting Kei Pong, Paul Tseng, Shuiwang Ji, and Jieping Ye. Trace norm regularization: Reformulations, algorithms, and multi-task learning. SIAM Journal on Optimization, 20(6):3465–3489, 2010. [58] Bernardino Romera-Paredes, Andreas Argyriou, Nadia Berthouze, and Massimiliano Pontil. Exploiting unrelated tasks in multi-task learning. In International Conference on Artificial Intelligence and Statistics, pages 951–959, 2012. [59] S Thrun. Learning to learn: Introduction. In In Learning To Learn, 1996. [60] I. Tolstikhin, G. Blanchard, and M. Kloft. Localized complexities for transductive learning. In Proceedings of the 27th Conference on Learning Theory, volume 35, pages 857–884. JMLR, 2014. Available from: http://jmlr.org/proceedings/papers/v35/tolstikhin14.pdf. [61] Sara Van De Geer. A new approach to least-squares estimation, with applications. The Annals of Statistics, pages 587–602, 1987. [62] Aad W Van Der Vaart and Jon A Wellner. Weak convergence. In Weak Convergence and Empirical Processes, pages 16–28. Springer, 1996. [63] Christian Widmer, Marius Kloft, and Gunnar R¨atsch. Multi-task learning for computational biology: Overview and outlook. In Empirical Inference, pages 117–127. Springer, 2013. [64] Qian Xu, Sinno Jialin Pan, Hannah Hong Xue, and Qiang Yang. Multitask learning for protein subcellular location prediction. IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB), 8(3):748– 759, 2011. [65] Niloofar Yousefi, Michael Georgiopoulos, and Georgios C Anagnostopoulos. Multi-task learning with groupspecific feature space sharing. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 120–136. Springer, 2015. [66] Chao Zhang, Lei Zhang, and Jieping Ye. Generalization bounds for domain adaptation. In F. Pereira, C.J.C. Burges, L. Bottou, and K.Q. Weinberger, editors, Advances in Neural Information Processing Systems 25, pages 3320–3328. Curran Associates, Inc., 2012. Available from: http://papers.nips.cc/paper/4684-generalization-bounds-for-domain-adaptation.pdf. [67] Yu Zhang and Dit-Yan Yeung. Multi-task warped gaussian process for personalized age estimation. In Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on, pages 2622–2629. IEEE, 2010.
43