Consistency of Regularized Learning Schemes in Banach Spaces∗ Patrick L. Combettes1 , Saverio Salzo2 , and Silvia Villa3
arXiv:1410.6847v2 [math.ST] 19 Nov 2014
1
Sorbonne Universit´ es – UPMC Univ. Paris 06, UMR 7598, Laboratoire Jacques-Louis Lions F-75005 Paris, France
[email protected] 2
Universit` a degli Studi di Genova, Dipartimento di Matematica, 16146 Genova, Italy
[email protected] 3
Massachusetts Institute of Technology and Istituto Italiano di Tecnologia Laboratory for Computational and Statistical Learning, Cambridge, MA 02139, USA
[email protected]
Abstract This paper proposes a unified framework for the investigation of learning theory in Banach spaces of features via regularized empirical risk minimization. The main result establishes the consistency of such learning schemes under general conditions on the loss function, the geometry of the feature space, the regularization function, and the regularization parameters. The focus is placed on Tikhonov-like regularization with totally convex functions. This broad class of regularizers provides a flexible model for various priors on the features, including in particular hard constraints and powers of norms. In addition, the proposed analysis gives new insight into basic tools such as kernel methods, feature maps, and representer theorems in a Banach space setting. Even when specialized to Hilbert spaces, this framework yields new results that significantly extend the state of the art.
Keywords. consistency, Banach spaces, empirical risk, feature map, kernel method, regularization, representer theorem, statistical learning, totally convex function.
1 Introduction A common problem arising in the decision sciences is to infer a functional relation f between an input set X and an output set Y, from the observation of a finite number of realizations (xi , yi )16i6n in X ×Y of independent input/ouput random pairs with an unknown common distribution P ; see for instance [32, 48, 50]. We also have at our disposal prior information about f that confines it to some constraint set C. The quality of the estimation can be assessed by a risk R, i.e., the expectation, with respect to P , of a loss function. Since the space of functions from X to Y is not sufficiently structured to effectively address the problem both in terms of theoretical analysis and numerical solution, we proceed by embedding a Banach space F into the space of functions from X to Y through a linear ∗
Contact author: P. L. Combettes,
[email protected], phone: +33 1 4427 6319, fax: +33 1 4427 7200.
1
operator A : F → Y X . If the range of A is sufficiently large to approximate C, the problem is recast in the space F, which is called the feature space. In this context a consistent learning scheme is a map that assigns to each sample (xi , yi )16i6n an estimator un ∈ F such that R(Aun ) → inf R(C) as n becomes arbitrarily large. However, since the risk R is not available, as it depends on the unknown distribution P , a standard procedure to construct a learning scheme is to replace R by its empirical version Rn based on the available sample. Moreover, the straightforward minimization of the empirical risk typically leads to poor estimates and a regularization strategy must be introduced. In this paper we adopt Tikhonov-like regularization: a function G : F → [0, +∞] and a vanishing sequence (λn )n∈N in R++ are selected, and, for every n ∈ N, an estimator un is then computed by approximately solving the regularized problem minimize Rn (Au) + λn G(u).
(1.1)
u∈F
The ultimate objective of the present paper is to find appropriate conditions on the constituents of the problem ensuring that the above learning scheme is consistent, i.e., that as the sample size n becomes arbitrarily large, the risk values R(Aun ) converges, in some suitable probabilistic sense, to inf R(C). Our results go beyond the state of the art on two fronts simultaneously. First, the analysis is cast in general reflexive Banach spaces of features and functions, whereas the current literature focuses mainly on Hilbert spaces. Second, we bring into play a general class of convex functions as regularizers, namely totally convex functions. The main benefit of using this broad class of regularizers is that they can model various priors on the features, including hard constraints. At the same time, totally convex functions will be seen to provide the proper setting to achieve the consistency of the learning scheme based on (1.1). In addition, our analysis provides generalizations of tools such as kernel methods, feature maps, representer theorems, and concentration inequalities. We give an illustration of our results in the context of learning with dictionaries. Let us emphasize that our framework brings new results even in Hilbert spaces, as the current literature studies only specific instances of regularizers. Notation is provided in Section 2. Totally convex functions and Tikhonov-like regularization are investigated in Section 3. Section 4 is devoted to the study of Banach spaces of functions and their description by feature maps; representer and sensitivity theorems are established in this general setting. In Section 5, the regularized learning scheme is formalized and the main consistency theorems are presented.
2 Notation and basic definitions We set R+ = [0, +∞[ and R++ = ]0, +∞[. Let B 6= {0} be a real Banach space. The closed ball of B of radius ρ ∈ R++ centered at the origin is denoted by B(ρ). We say that B is of Rademacher type q ∈ [1, 2] [35, Definition 1.e.12] if there exists T ∈ [1, +∞[, so that for every n ∈ N r {0} and (ui )16i6n in B, Z
0
n 1 X
i=1
n
q X
q kui kq , ri (t)ui dt 6 T
(2.1)
i=1
2
where (ri )i∈N denote the Rademacher functions, that is, for every i ∈ N, ri : [0, 1] → {−1, 1} : t 7→ sign(sin(2i πt)). The smallest T for which (2.1) holds is denoted by Tq . Since every Banach space is of Rademacher type 1, this notion is of interest for q ∈ ]1, 2]. Moreover, a Banach space of Rademacher type q ∈ ]1, 2] is also of Rademacher type p ∈ ]1, q[. The modulus of convexity of B is δB : ]0, 2] → R+
u + v n o
ε 7→ inf 1 −
(u, v) ∈ B 2 , kuk = kvk = 1, ku − vk > ε , 2
(2.2)
and the modulus of smoothness of B is ρB : R+ → R+ τ
7→ sup
n1 2
o ku + vk + ku − vk − 1 (u, v) ∈ B 2 , kuk = 1, kvk 6 τ .
(2.3)
We say that B is uniformly convex if δB vanishes only at zero, and uniformly smooth if limτ →0 ρB (τ )/τ = 0 [8, 35]. Now let q ∈ [1, +∞[. Then B has modulus of convexity of power type q if there exists c ∈ R++ such that, for every ε ∈ ]0, 2], δB (ε) > cεq , and it has modulus of smoothness of power type q if there exists c ∈ R++ such that, for every τ ∈ R++ , ρB (τ ) 6 cτ q [8, 35]. A smooth Banach space with modulus of smoothness of power type q is of Rademacher type q [35, Theorem 1.e.16]. Therefore, the notion of Rademacher type is weaker than that of uniform smoothness of power type, in particular it does not imply reflexivity (see the discussion after [35, Theorem 1.e.16]). Let F : B → ]−∞, +∞]. The domain of F is dom F = u ∈ B F (u) < +∞ and F is proper if dom F 6= ∅. Suppose that F is proper and convex. The subdifferential of F is the set-valued operator ∗ ∂F : B → 2B : u ∈ B 7→ u∗ ∈ B ∗ (∀v ∈ B) F (u) + hv − u, u∗ i 6 F (v) , (2.4) and its domain is dom ∂F = u ∈ B ∂F (u) 6= ∅ . Moreover, for every (u, v) ∈ dom F × B, we set and bounded F ′ (u; v) = limt→0+ (F (u + tv) − F (u))/t. IfF is proper from below and C ⊂ B is such that C ∩ dom F 6= ∅, we put ArgminC F = u ∈ C F (u) = inf F (C) , and when it is a singleton we denote by argminC F its unique element. Moreover, we set (∀ǫ ∈ R++ ) ArgminǫC F = u ∈ C F (u) 6 inf F (C) + ǫ . (2.5)
We denote by Γ0 (B) the class of functions F : B → ]−∞, +∞], which are proper, convex, and lower semicontinuous, and we set Γ+ (B) = F ∈ Γ (B) F (B) ⊂ [0, +∞] . 0 0 Let p ∈ [1, +∞]. The conjugate of p is if p = 1 +∞ ∗ p = p/(p − 1) if 1 < p < +∞ 1 if p = +∞.
(2.6)
If p ∈ ]1, +∞[, the p-duality map of B is JB,p = ∂(k·kp /p) [20], and hence (∀u ∈ B) JB,p (u) = u∗ ∈ B ∗ hu, u∗ i = kukp and ku∗ k = kukp−1 .
(2.7)
For p = 2 we obtain the normalized duality map JB . Moreover, if B is reflexive, strictly convex, and smooth, then JB,p is single-valued and its unique selection, which we denote also by JB,p , is 3
−1 . When a Banach space is regarded as a measurable a bijection from B onto B ∗ and JB∗,p∗ = JB,p space it is with respect to its Borel σ-algebra. Let (Z, A, µ) be a σ-finite measure space and let Y be a separable real Banach space with norm |·|. We denote by M(Z, Y) the set of measurable functions from Z into Y. If p 6= +∞, Lp (Z, µ;RY) is the Banach space of all (equivalence classes of) measurable functions f ∈ M(Z, Y) such that Z |f |p dµ < +∞ and L∞ (Z, µ; Y) is the Banach space of all (equivalence classes of) measurable functions f ∈ M(Z, Y) which are µ-essentially bounded. R 1/p p if p 6= +∞, and kf k∞ = µ- ess-supz∈Z |f (z)| Let f ∈ Lp (Z, µ; Y). Then kf kp = Z |f | dµ p otherwise. If p ∈ ]1, +∞[, L (Z, µ; R) is uniformly convex and uniformly smooth, and it has modulus of convexity of power type max{2, p}, and modulus of smoothness of power type min{2, p} [35, p. 63] and hence it is of Rademacher type min{2, p}. If Z is countable, A = 2Z , and µ is the counting measure, we set lp (Z; Y) = Lp (Z, µ; Y) and lp (Z) = Lp (Z, µ; R). Let Y and Z be separable real Banach spaces. We denote by L (Y, Z) the Banach space of continuous linear operators from Y into Z endowed with the operator norm. A map Λ : Z → L (Y, Z) is strongly measurable if for every y ∈ Y, the function Z → Z : z 7→ Λ(z)y is measurable. In such a case the function Z → R : z 7→ kΛ(z)k is measurable [27]. If p 6= +∞, Lp [Z, µ; L (Y, Z)] is the Banach all (equivalence R space of p classes of) strongly measurable functions Λ : Z → L (Y, Z) such that Z kΛ(z)k µ(dz) < +∞ and L∞ [Z, µ; L (Y, Z)] is the Banach space of all (equivalence classes of) strongly measurable functions Λ : Z → L (Y, Z) such that µ- ess-supz∈Z kΛ(z)k < +∞ [7]. Let Λ ∈ Lp [Z, µ; L (Y, Z)]. Then R 1/p if p 6= +∞, and kΛk∞ = µ- ess-supz∈Z kΛ(z)k otherwise. kΛkp = Z kΛ(z)kp µ(dz)
Let (Ω, A, P) be a probability space, let P∗ be the associated outer probability, and let (Un )n∈N and U be functions from Ω to B. For every ξ : Ω → R and t ∈ R, we set [ξ > t] = ω ∈ Ω ξ(ω) > t ; (2.8)
the sets [ξ < t], [ξ > t], and [ξ 6 t] are defined analogously. The sequence (Un )n∈N converges in P∗
P-outer probability to U , in symbols Un → U , if [49] (∀ε ∈ R++ ) P∗ kUn − U k > ε → 0,
(2.9)
and it converges P∗-almost surely (a.s.) to U if (∃ Ω0 ⊂ Ω) P∗ Ω0 = 0
and (∀ω ∈ Ω r Ω0 )
Un (ω) → U (ω).
(2.10)
3 Total convexity and regularization Convex analysis and optimization play a central role in this paper. In this section, we provide the basic background and the additional results that will be required.
3.1 Totally convex functions Totally convex functions, were introduced in [13] and further studied in [14, 15, 57]. This notion lies between strict convexity and strong convexity. Let φ : R → [0, +∞] be such that φ(0) = 0 and dom φ ⊂ R+ . We set ( 0 if t = 0 φb : R → [0, +∞] : t 7→ φ(t)/|t| if t 6= 0. 4
(3.1)
The upper-quasi inverse of φ is [38, 57] ( φ(t) 6 s sup t ∈ R + ♮ φ : R → [0, +∞] : s 7→ +∞
if s > 0 if s < 0.
(3.2)
Note that, for every (t, s) ∈ R2+ , φ(t) 6 s ⇒ t 6 φ♮ (s). We set A0 = φ : R → [0, +∞] dom φ ⊂ R+ , φ is increasing on R+ , φ(0) = 0, (∀t ∈ R++ ) φ(t) > 0 (3.3)
and
o n b =0 . A1 = φ ∈ A0 φb is increasing on R+ , lim φ(t) + t→0
(3.4)
Proposition 3.1 Let φ ∈ A0 . Then the following hold: (i) dom φ is an interval containing 0.
(ii) dom φ♮ = [0, sup φ(R+ )[. (iii) Suppose that φb is increasing on R+ . Then dom φ♮ = R+ and φ is strictly increasing on dom φ. (iv) Suppose that (tn )n∈N ∈ RN + satisfies φ(tn ) → 0. Then tn → 0. (v) φ♮ is increasing on R+ and lims→0+ φ♮ (s) = 0 = φ♮ (0).
(vi) Let (s, t) ∈ R+ ×R++ . Then φ♮ (s) < t ⇔ s < φ(t− ). b ♮ ∈ A0 . (vii) Suppose that φ ∈ A1 . Then int(dom φ) 6= ∅, φb ∈ A0 , φb is right-continuous at 0, and (φ)
Proof. (i): This follows from (3.3).
(ii): For every s ∈ R+ , [φ 6 s] ⊂ dom φ. Therefore, if dom φ is bounded, φ♮ is real-valued. Now, suppose that dom φ = R+ . Let s ∈ R+ with s < sup φ(R+ ). Then there exists t1 ∈ R+ such that s < φ(t1 ). Moreover, since φ is increasing, t ∈ [φ 6 s] ⇒ φ(t) 6 s < φ(t1 ) ⇒ t 6 t1 . Hence, φ♮ (s) = sup[φ 6 s] 6 t1 < +∞. Therefore [0, sup φ(R+ )[ ⊂ dom φ♮ . On the other hand, if s ∈ [sup φ(R+ ), +∞[, then [φ 6 s] = dom φ and hence φ♮ (s) = +∞. (iii): For every t ∈ [1, +∞[, φ(t) > tφ(1) > 0. Hence sup φ(R+ ) = +∞ and therefore (ii) yields b 6 tφ(s) b dom φ♮ = R+ . Let t ∈ dom φ and s ∈ dom φ with t < s. If t > 0, then 0 < φ(t) = tφ(t) = (t/s)φ(s) < φ(s); otherwise, (3.3) yields φ(t) = φ(0) = 0 < φ(s).
(iv): Suppose that there exist ε ∈ R++ and a subsequence (tkn )n∈N such that (∀n ∈ N) tkn > ε. Then φ(tkn ) > φ(ε) > 0 and hence φ(tn ) 6→ 0. (v): See [57, Lemma 3.3.1(i)].
(vi): Suppose that t 6 φ♮ (s). Then for every δ ∈ ]0, t[ there exists t′ ∈ R+ such that φ(t′ ) 6 s and t − δ < t′ , hence φ(t − δ) 6 φ(t′ ) 6 s. Therefore 0 < supδ∈]0,t[ φ(t − δ) = φ(t− ) 6 s. Conversely, suppose that t > φ♮ (s). Let t′ ∈ φ♮ (s), t . Then it follows from (3.2), that φ(t′ ) > s, and hence φ(t− ) > s. (vii): It follows from (3.3) and (3.4) that int(dom φ) 6= ∅, φb ∈ A0 , and φb is continuous at 0. Let b♮ b ♮ (s) > 0. By continuity of s ∈ R++ that (φ) of (v), to prove that (φ) ∈ A0 , it remains to show . In view ♮ b 6 s is a neighborhood of 0 and hence (φ) b (s) = sup t ∈ R+ φ(t) b 6 s > 0. φb at 0, t ∈ R+ φ(t) 5
Definition 3.2 [16] Let F be a reflexive real Banach space and let G : F → ]−∞, +∞] be a proper convex function. (i) The modulus of total convexity of G is ψ : dom G × R → [0, +∞] (u, t) 7→ inf G(v) − G(u) − G′ (u; v − u) v ∈ dom G, kv − uk = t
(3.5)
and G is totally convex at u ∈ dom G if, for every t ∈ R++ , ψ(u, t) > 0.
(ii) Let ψ be the modulus of total convexity of G. For every ρ ∈ R++ such that B(ρ) ∩ dom G 6= ∅, the modulus of total convexity of G on B(ρ) is ψρ : R → [0, +∞] : t 7→
inf
u∈B(ρ)∩dom G
ψ(u, t),
(3.6)
and G is totally convex on B(ρ) if ψρ > 0 on R++ . Moreover, G is totally convex on bounded sets if, for every ρ ∈ R++ such that B(ρ) ∩ dom G 6= ∅, it is totally convex on B(ρ). Remark 3.3 Total convexity and standard variants of convexity are related as follows. (i) Suppose that G is totally convex at every point of dom G. Then G is strictly convex. (ii) Total convexity is closely related to uniform convexity [52, 56]. Indeed G is uniformly convex on F if and only if, for every t ∈ R++ , inf u∈dom G ψ(u, t) > 0 [57, Theorem 3.5.10]. Alternatively, G is uniformly convex on F if and only if (∀t ∈ R++ ) inf ρ∈R++ ψρ (t) > 0. (iii) In reflexive spaces, total convexity on bounded sets is equivalent to uniform convexity on bounded sets [15, Proposition 4.2]. Yet, some results will require pointwise total convexity, which makes it the pertinent notion in our investigation. Remark 3.4 Let u0 and u be in dom G. Then Definition 3.2 implies that G(u) − G(u0 ) > G′ (u0 ; u − u0 ) + ψ(u0 , ku − u0 k).
(3.7)
Moreover, if u∗ ∈ ∂G(u0 ), hu − u0 , u∗ i 6 G′ (u0 ; u − u0 ) and therefore G(u) − G(u0 ) > hu − u0 , u∗ i + ψ(u0 , ku − u0 k).
(3.8)
Thus, ∂G(u0 ) 6= ∅ ⇒ ψ(u0 , ku − u0 k) < +∞. The properties of the modulus of total convexity are summarized below. Proposition 3.5 Let F be a reflexive real Banach space, let G : F → ]−∞, +∞] be a proper convex function the domain of which is not a singleton, let ψ be the modulus of total convexity of G, and let u0 ∈ dom G. Then the following hold: (i) Let c ∈ ]1, +∞[ and let t ∈ R+ . Then ψ(u0 , ct) > cψ(u0 , t). (ii) ψ(u0 , ·) : R → [0, +∞] is increasing on R+ . (iii) Let t ∈ R+ . Then
ψ(u0 , t) = inf G(u) − G(u0 ) − G′ (u0 ; u − u0 ) u ∈ dom G, ku − u0 k > t . 6
(3.9)
(iv) Suppose that G is totally convex at u0 . Then ψ(u0 , ·) ∈ A0 and ψ(u0 , ·)b ∈ A0 . (v) dom ψ(u0 , ·) is an interval containing 0; moreover, if ∂G(u0 ) 6= ∅, then int dom ψ(u0 , ·) 6= ∅.
(vi) Suppose that ∂G(u0 ) 6= ∅. Then limt→0+ ψ(u0 , ·)b (t) = 0.
(vii) Suppose that ∂G(u0 ) 6= ∅ and that G is totally convex at u0 . Then ψ(u0 , ·) ∈ A1 . (viii) Let ρ ∈ R++ and suppose that G is totally convex on B(ρ). Then ψρ ∈ A0 and ψbρ ∈ A0 . Moreover, if B(ρ) ∩ dom ∂G 6= ∅, then ψρ ∈ A1 . (ix) Suppose that u0 ∈ ArgminF G and that G is totally convex at u0 . Then G is coercive.
Proof. (i): Suppose that u ∈ dom G satisfies ku − u0 k = ct and set v = (1 − c−1 )u0 + c−1 u = u0 + c−1 (u − u0 ). Then v ∈ dom G and kv − u0 k = t. Therefore, since G is convex and G′ (u0 ; ·) is positively homogeneous [6, Proposition 17.2], ψ(u0 , t) 6 G(v) − G(u0 ) − G′ (u0 ; v − u0 )
6 (1 − c−1 )G(u0 ) + c−1 G(u) − G(u0 ) − c−1 G′ (u0 ; u − u0 ) = c−1 G(u) − G(u0 ) − G′ (u0 ; u − u0 ) .
Hence cψ(u0 , t) 6 ψ(u0 , ct).
(ii): Let (s, t) ∈ R2++ be such that t < s, and set c = s/t. Then using (i), we have ψ(u0 , t) 6 0 , ct) 6 ψ(u0 , s).
c−1 ψ(u
(iii): Suppose that u ∈ dom G satisfies ku − u0 k > t and set s = ku − u0 k. Then it follows from (ii) that ψ(u0 , t) 6 ψ(u0 , s) 6 G(u) − G(u0 ) − G′ (u0 ; u − u0 ). (iv): Since ψ(u0 , 0) = 0, (ii) yields ψ(u0 , ·) ∈ A0 . Moreover, it follows from (i) that ψ(u0 , ·)b is increasing, hence ψ(u0 , ·)b ∈ A0 . (v): The first claim follows from the fact that ψ(u0 , ·) is increasing and ψ(u0 , 0) = 0. Next, since dom G is not a singleton, there exists u ∈ dom G, u 6= u0 . Finally, Remark 3.4 asserts that ∂G(u0 ) 6= ∅ ⇒ ψ(u0 , ku − u0 k) < +∞. (vi): Since (i) asserts that ψ(u0 , ·)b is increasing, limt→0+ ψ(u0 , ·)b (t) = inf t∈R++ ψ(u0 , ·)b (t). Suppose that inf t∈R++ ψ(u0 , ·)b (t) > 0. Then there exists ǫ ∈ R++ such that, for every t ∈ R++ , ψ(u0 , t) > ǫt. Let u ∈ dom G r {u0 }. For every t ∈ ]0, 1], define ut = u0 + tv, where v = u − u0 . Then ǫtkvk = ǫkut − u0 k 6 ψ(u0 , kut − u0 k) 6 G(u0 + tv) − G(u0 ) − G′ (u0 ; tv). Hence, since G′ (u0 ; ·) is positively homogeneous, ǫkvk + G′ (u0 ; v) 6 (G(u0 + tv) − G(u0 ))/t. Letting t → 0+ yields ǫkvk + G′ (u0 ; v) 6 G′ (u0 ; v), which contradicts the facts that G′ (u0 ; v) ∈ R and ǫkvk > 0. (vii)–(viii): The claims follow from (iv) and (vi). (ix): Since 0 ∈ ∂G(u0 ), (3.8) yields (∀u ∈ dom G) ψ(u0 , ku − u0 k) 6 G(u) − G(u0 ). On the other hand, since G is also totally convex at u0 , (iv)-(v) imply that there exists s ∈ R++ such that 0 < ψ(u0 , s) < +∞ and (∀t ∈ [s, +∞[) ψ(u0 , t) > tψ(u0 , s)/s. Therefore, for every u ∈ dom G such that ku − u0 k > s, we have G(u) > G(u0 ) + ku − u0 kψ(u0 , s)/s, which implies that G is coercive. Remark 3.6 Statements (i), (ii), (iii), and (v) are proved in [16, Proposition 2.1] with the additional assumption that int dom G 6= ∅, and in [14, Proposition 1.2.2] with the additional assumption that u0 is in the algebraic interior of dom G. 7
Example 3.7 Let F be a uniformly convex real Banach space and let φ ∈ A0 be real-valued, strictly R |t| increasing, continuous, and such that limt→+∞ φ(t) = +∞. Define (∀t ∈ R) ϕ(t) = 0 φ(s)ds. Then [56, Theorem 4.1(ii)] and [15, Proposition 4.2] imply that G = ϕ ◦ k·k is totally convex on bounded sets (see also [52, Theorem 6]). We now provide a significant example in which the modulus of total convexity on balls can be computed explicitly. Proposition 3.8 Let q ∈ [2, +∞[ and let F be a uniformly convex real Banach space with modulus of convexity of power type q. Let r ∈ ]1, +∞[ and for every ρ ∈ R+ , denote by ψρ the modulus of total convexity of k·kr on the ball B(ρ). Then there exists β ∈ R++ such that βtr if r > q (3.10) (∀ρ ∈ R+ )(∀t ∈ R+ ) ψρ (t) > βtq if r < q. q−r (ρ + t)
Hence k·kr is totally convex on bounded sets and, if r > q, it is uniformly convex. Moreover, for every ρ ∈ R+ and every s ∈ R+ , 1/(r−1) s β (ψbρ )♮ (s) 6 2q ρ max
if r > q s βρr−1
1/(q−1) ,
s βρr−1
1/(r−1)
(3.11) if r < q.
Proof. Let (u, v) ∈ F2 . We derive from [54, Theorem 1] that ku + vkr − kukr > rhv, u∗ i + ϑr (u, v),
(∀u∗ ∈ JF,r (u))
(3.12)
where ϑr (u, v) = rKr
Z
0
1
max{ku + tvk, kuk}r tkvk δF dt t 2 max{ku + tvk, kuk}
(3.13)
and Kr ∈ R++ is the constant defined according to [54, Lemma 3, Equation (2.13)]. Since δF (ε) > cεq for some c ∈ R++ , then Z 1 rKr c q ϑr (u, v) > kvk max{ku + tvk, kuk}r−q tq−1 dt. (3.14) 2q 0 Let us consider first the case r > q. Since, for every t ∈ [0, 1], max{ku + tvk, kuk} > tkvk/2, rKr c kvkq ϑr (u, v) > 2q
Z
1 0
tr−q rKr c kvkr−q tq−1 dt = kvkr r−q 2 2r
Z
0
1
tr−1 dt =
Kr c kvkr . 2r
(3.15)
Now, suppose that r < q. Then since, for every t ∈ [0, 1], max{ku + tvk, kuk} 6 kuk + kvk, ϑr (u, v) >
rKr c kvkq 2q
Z
1 0
kvkq 1 rKr c q−1 t dt > . max{ku + tvk, kvk}q−r q2q (kuk + kvk)q−r
8
(3.16)
Let ψ be the modulus of total convexity of k·kr . Then it follows from (3.15) and (3.16) that Kr c r if q 6 r 2r t (∀ u ∈ F)(∀ t ∈ R+ ) ψ(u, t) > (3.17) r Kr c tq if q > r. q 2q (kuk + t)q−r
Let ρ ∈ R++ and set β = (r/ max{q, r})Kr c/2max{q,r} . Then we obtain (3.10) by taking the infimum over u ∈ B(ρ) in (3.17). Thus, if r > q, the modulus of total convexity is independent from ρ, and hence k·kr is uniformly convex on F. On the other hand, if r < q, we deduce that k·kr is totally convex on bounded sets. Hence, r−1 if r > q βt q−1 (3.18) (∀ t ∈ R+ ) ψbρ (t) > βt if r < q. (ρ + t)q−r
A simple calculation shows that, if r < q, (∀ t ∈ R+ ) ψbρ (t) > νρ (t),
where
νρ (t) =
βρr−1 min (t/ρ)q−1 , (t/ρ)r−1 . q 2
(3.19)
The function νρ is strictly increasing and continuous on R+ , thus νρ♮ = νρ−1 . Since for arbitrary functions ψ1 : R+ → R+ and ψ2 : R+ → R+ we have ψ1 > ψ2 ⇒ ψ1♮ 6 ψ2♮ , we obtain (3.11). Remark 3.9 (i) An inspection of the proof of Proposition 3.8 reveals that the constant β is explicitly available in terms of r and of a constant depending on the space F. In particular, it follows from [54, Equation (2.13)] that, when r ∈ ]1, 2], √ Kr > 4(2 + 3) min{r(r − 1)/2, (r − 1) log(3/2), 1 − (2/3)r−1 } > 14(1 − (2/3)r−1 ), (3.20) and when r ∈ ]2, +∞[ √ √ r−1 r Kr > 4(2 + 3) min{1, (r − 1)(2 − 3), 1 − (2/3) 2 } > 14(1 − (2/3) 2 ) .
(3.21)
As an example, for the case F = lr (K) and k·krr , with r ∈ ]1, 2], since F has modulus of convexity of power type 2 with c = (r − 1)/8 [35], we have β > (7/32)r(r − 1)(1 − (2/3)r−1 ). (ii) In [53, Theorem 1] and [8, Lemma 2 p. 310] the case r = q is considered. It is proved that k·krF is uniformly convex and that its modulus of uniform convexity, say ν, satisfies ν(t) > βtr , for every t ∈ R++ .
3.2 Tikhonov-like regularization In this section we work with the following scenario. Assumption 3.10 F is a reflexive real Banach space, F : F → ]−∞, +∞] is bounded from below, G : F → [0, +∞], dom G is not a singleton, and dom F ∩ dom G 6= ∅. The function ε : R++ → [0, 1] ε(λ) satisfies limλ→0+ ε(λ) = 0 and, for every λ ∈ R++ , uλ ∈ ArgminF (F + λG). 9
We study the behavior of the regularized problem minimize F (u) + λG(u)
(3.22)
u∈F
as λ → 0+ in connection with the limiting problem minimize F (u).
(3.23)
u∈F
We shall present results similar to those of [3] under weaker assumptions and with approximate solutions of (3.22) as opposed to exact ones. In particular, Proposition 3.11 does not require the family (uλ )λ∈R++ to be bounded or F to have minimizers. Indeed, although these are common requirements in the inverse problems literature, where the convergence of the minimizers (uλ )λ∈R++ is relevant, from the statistical learning point of view this assumption is not always appropriate. In that context, as discussed in the introduction, primarily the convergence of the values (F (uλ ))λ∈R++ to inf F (F) is of interest. On the other hand, when (uλ )λ∈R++ is bounded and when additional convexity properties are imposed on G, we provide bounds and strong convergence results. Proposition 3.11 Suppose that Assumption 3.10 holds. Then the following hold: (i) limλ→0+ inf(F + λG)(F) = inf F (dom G). (ii) limλ→0+ F (uλ ) = inf F (dom G). (iii) limλ→0+ λG(uλ ) = 0. Proof. (i): Since dom F ∩ dom G 6= ∅, inf(F + λG)(F) < +∞. Let u ∈ dom G. Then, for every λ ∈ R++ , inf F (dom G) 6 F (uλ ) 6 F (uλ )+ λG(uλ ) 6 inf(F + λG)(F)+ ε(λ) 6 F (u)+ λG(u)+ ε(λ). (3.24) Hence, inf F (dom G) 6 lim λ→0+
inf(F + λG)(F) + ε(λ) 6 lim
λ→0+
inf(F + λG)(F) + ε(λ) 6 F (u). (3.25)
Therefore, limλ→0+ inf(F + λG)(F) + ε(λ) = inf F (dom G), and the statement follows. (ii): This follows from (i) and (3.24).
(iii): We derive from (i) and (3.24) that limλ→0+ F (uλ )+λG(uλ ) = inf F (dom G) which, together with (ii), yields the statement. Remark 3.12 Assume that inf F (F) = inf F (dom G). Then Proposition 3.11 yields limλ→0+ F (uλ ) = inf F (F) and limλ→0+ inf(F + λG)(F) = inf F (F). In particular the condition inf F (F) = inf F (dom G) is satisfied in each of the following cases: (i) The lower semicontinuous envelopes of F + ιdom G and F coincide [3, Theorem 2.6]. (ii) dom G ⊃ dom F and F is upper semicontinuous [6, Proposition 11.1(i)].
(iii) ArgminF F ∩ dom G 6= ∅.
Proposition 3.13 Suppose that Assumption 3.10 holds and set S = Argmindom G F . Suppose that F and G are weakly lower semicontinuous, that G is coercive, and that ε(λ)/λ → 0 as λ → 0+ . Then S 6= ∅
⇔
(∃ t ∈ R)(∀λ ∈ R++ ) G(uλ ) 6 t.
Now suppose that S 6= ∅. Then the following hold: 10
(3.26)
(i) (uλ )λ∈R++ is bounded and there exists a vanishing sequence (λn )n∈N in R++ such that (uλn )n∈N converges weakly. (ii) Suppose that u† ∈ F, that (λn )n∈N is a vanishing sequence in R++ , and that uλn ⇀ u† . Then u† ∈ ArgminS G. (iii) limλ→0+ G(uλ ) = inf G(S).
(iv) limλ→0+ F (uλ ) − inf F (dom G) /λ = 0.
(v) Suppose that G is strictly quasiconvex [6, Definition 10.25]. Then there exists u† ∈ F such that ArgminS G = {u† } and uλ ⇀ u† as λ → 0+ .
(vi) Suppose that G is totally convex on bounded sets. Then uλ → u† = argminS G as λ → 0+ .
Proof. Assume that S 6= ∅ and let u ∈ S. For every λ ∈ R++ , F (uλ ) + λG(uλ ) 6 F (u) + λG(u) + ε(λ), so that uλ ∈ dom G and G(uλ ) 6
ε(λ) F (u) − F (uλ ) ε(λ) + + G(u) 6 G(u) + . λ λ λ
(3.27)
Thus, since (ε(λ)/λ)λ∈R++ is bounded, so is (G(uλ ))λ∈R++ . Hence (uλ )λ∈R++ is in some sublevel set of G. Conversely, suppose that there exists t ∈ R++ such that supλ∈R++ G(uλ ) 6 t. It follows from the coercivity of G that (uλ )λ∈R++ is bounded. Therefore, since F is reflexive, there exist u† ∈ F and a sequence (λn )n∈N in R++ such that λn → 0 and uλn ⇀ u† . In turn, we derive from the weak lower semicontinuity of F and Proposition 3.11(ii) that F (u† ) 6 lim F (uλn ) = lim F (uλn ) = inf F (dom G).
(3.28)
Moreover, since G is weakly lower semicontinuous, G(u† ) 6 lim G(uλn ) 6 lim G(uλn ) 6 t.
(3.29)
Hence u† ∈ dom G and it follows from (3.28) that u† ∈ S. (i): This follows from the reflexivity of F and the boundedness of (uλ )λ∈R++ . (ii): Arguing as above, we obtain that (3.28) holds. Moreover, for every u ∈ S, it follows from (3.27) that, since G is weakly lower semicontinuous and ε(λn )/λn → 0, G(u† ) 6 lim G(uλn ) 6 lim G(uλn ) 6 G(u) < +∞.
(3.30)
Inequalities (3.28) and (3.30) imply that u† ∈ S and that (ii) holds. (iii): It follows from (3.30) and (ii) that G(uλn ) → inf G(S). Since (λn )n∈N is arbitrary, the claim follows. (iv): Let λ ∈ R++ . Since uλ is an ε(λ)-minimizer of F + λG, for every u ∈ dom G, we have F (u) − inf F (dom G) ε(λ) F (uλ ) − inf F (dom G) + G(uλ ) 6 + G(u) + . λ λ λ
(3.31)
In particular, taking u = u† in (3.31) yields F (uλ ) − inf F (dom G) ε(λ) + G(uλ ) 6 G(u† ) + . λ λ 11
(3.32)
Since ε(λ)/λ → 0, passing to the limit superior in (3.32) as λ → 0+ , and using (ii) and (iii), we get lim
λ→0+
F (uλ ) − inf F (dom G) + G(u† ) 6 G(u† ), λ
(3.33)
which implies (iv), since F (uλ ) − inf F (dom G) > 0. (v): It follows from (i) and (ii) that ArgminS G 6= ∅. Since S is convex and G is strictly quasiconvex, ArgminS G reduces to a singleton {u† } and (ii) yields uλ ⇀ u† as λ → 0+ . (vi): Since (uλ )λ∈R++ is bounded, it follows from [57, Proposition 3.6.5] (see also [15]) that there exists φ ∈ A0 such that u + u† ku − u† k G(u† ) + G(u ) λ λ λ 6 . −G (∀λ ∈ R++ ) φ 2 2 2
(3.34)
Hence, arguing as in [21, Proof of Proposition 3.1(vi)] and using (v) and the weak lower semicontinuity of G, we obtain uλ → u† as λ → 0+ . Remark 3.14 If ArgminF F ∩ dom G 6= ∅, then S = Argmindom G F = ArgminF F ∩ dom G and ArgminS G = ArgminArgminF F G (see [3, Theorem 2.6] for related results). The following proposition provides an estimate of the growth of the function λ 7→ kuλ k as λ → 0+ when the condition Argmindom G F 6= ∅ is possibly not satisfied. Proposition 3.15 Suppose that Assumption 3.10 holds, that G is convex with modulus of total convexity ψ, and that there exists u ∈ F such that ArgminF G ∩ dom F = {u}. Then ♮ F (u) − inf F (dom G) + ε(λ) . (3.35) (∀λ ∈ R++ ) kuλ − uk 6 ψ(u, ·) λ Proof. Let λ ∈ R++ . Since F (uλ ) + λG(uλ ) 6 F (u) + λG(u) + ε(λ), we have G(uλ ) − G(u) 6
F (u) − F (uλ ) + ε(λ) F (u) − inf F (dom G) + ε(λ) 6 . λ λ
(3.36)
Hence, recalling (3.8) and noting that u ∈ ArgminF G ⇔ 0 ∈ ∂G(u), we obtain ψ(u, kuλ − uk) 6 (F (u) − inf F (dom G) + ε(λ))/λ and the claim follows.
4 Learning in Banach spaces Basic tools such as feature maps, reproducing kernel Hilbert spaces, and representer theorems have played an instrumental role in the development of hilbertian learning theory [42, 46]. In recent years, there has been a marked interest in extending these tools to Banach spaces; see for instance [28, 58, 60] and references therein. The primary objective of this section is to further develop the theory on these topics.
12
4.1 Banach spaces of vector-valued functions and feature maps Sampling-based nonparametric estimation naturally calls for formulations involving spaces of functions for which the pointwise evaluation operator is continuous. In the Hilbert space setting, this framework hinges on the notions of a reproducing kernel Hilbert space and of a feature map, which have been extensively investigated, e.g., in [17, 46]. On the other hand, the study of reproducing kernel Banach spaces has been developed primarily in [58, 60]. However, in the Banach space setting, the continuity of the pointwise evaluation operators, the existence of a kernel, and the existence of a feature map may no longer be equivalent and further investigation is in order. Towards this goal, we start with the following proposition which extends [17, Proposition 2.2]. Proposition 4.1 Let X be a nonempty set, let Y and F be separable real Banach spaces, and let A : F → YX be a linear operator. Then ran A, endowed with the norm k·k : ran A → R+ : Au 7→ inf v∈ker A ku − vk, is a Banach space, and the quotient operator of A is a Banach space isometry from F/ ker A onto ran A. Moreover, the following are equivalent: (i) There exists a map Λ : X → L (Y∗ , F∗ ) such that (∀u ∈ F)(∀x ∈ X )
(Au)(x) = Λ(x)∗ u.
(4.1)
(ii) A : F → YX is continuous for the topology of pointwise convergence on YX .
(iii) The topology of ran A thus constructed is stronger than that of pointwise convergence on YX . Proof. Set W = ran A and N = ker A. Since N is a closed vector subspace of F, the quotient space F/N is a Banach space with the quotient norm πN u 7→ kπN ukF/N = inf v∈N ku − vk, where πN : F → F/N : u 7→ u + N is the canonical projection. Let A˜ : F/N → YX be the unique map such that A = A˜ ◦ πN . Then A˜ is linear, injective, and ran A˜ = ran A. Thus, we endow W with the Banach ˜ i.e., for every u ∈ F, kAuk = kAπ ˜ N uk = kπN uk space structure transported from F/N by A, F/N . Moreover, for every x ∈ X , we define the point-evaluation operator evx : W → Y : f 7→ f (x). (i)⇒(ii): Let x ∈ X . Then, by (4.1), evx ◦A = Λ(x)∗ is continuous. (ii)⇒(i): Set Λ : X → L (Y∗ , F∗ ) : x 7→ (evx ◦A)∗ . (ii)⇒(iii): Denote by |·| the norm of Y. Let x ∈ X and f ∈ W. Then there exists u ∈ F such that f = Au, and hence (∀v ∈ ker A) |f (x)| = |(evx ◦A)(u + v)| 6 kΛ(x)∗ k ku + vk. Taking the infimum over v ∈ ker A, and recalling the definition of the quotient norm, we get |f (x)| 6 kΛ(x)kkπN ukF/N = kΛ(x)∗ kkAuk = kΛ(x)kkf k. Hence, evx : W → Y is continuous. (iii)⇒(ii): Let x ∈ X . Since A˜ : F/N → W is an isometry and evx : W → Y is continuous, ˜ evx ◦A˜ : F/N → Y is continuous. Therefore, by definition of the quotient topology, evx ◦A = evx ◦A◦ πN : F → Y is continuous. Definition 4.2 In the setting of Proposition 4.1, if A : F → YX is continuous for the topology of pointwise convergence on YX , then the unique map Λ defined in (i) is the feature map associated with A and F is the feature space. Remark 4.3 Equation (4.1) is equivalent to (∀u ∈ F)(∀x ∈ X )(∀y∗ ∈ Y∗ )
hu, Λ(x)y∗ i = h(Au)(x), y∗ i, which shows that A is injective if and only if Λ(x)y∗ x ∈ X , y∗ ∈ Y∗ is dense in F∗ . 13
(4.2)
Proposition 4.4 Let (X , AX , µ) be a σ-finite measure space, let Y and F be separable real Banach spaces, let A : F → YX be linear and continuous for the topology of pointwise convergence on YX , and let Λ : X → L (Y∗ , F∗ ) be the associated feature map. Then the following hold: (i) Λ : X → L (Y∗ , F∗ ) is strongly measurable if and only if ran A ⊂ M(X , Y).
(ii) Let p ∈ [1, +∞] and suppose that Λ ∈ Lp [X , µ; L (Y∗ , F∗ )]. Then ran A ⊂ Lp (X , µ; Y) and, for every u ∈ F, kAukp 6 kΛkp kuk.
Proof. (i): It follows from Pettis’ theorem [25, Theorem II.2] and (4.2) that Λ : X → L (Y∗ , F∗ ) is strongly measurable if and only if, for every u ∈ F, Au is measurable.
(ii): Let u ∈ F and note that, by (i), Au is measurable. Moreover, by (4.1), (∀x ∈ X ) |(Au)(x)| = |Λ(x)∗ u| 6 kΛ(x)k kuk. Definition 4.5 Let (X , AX ) be a measurable space, let Y be a separable uniformly convex real Banach space, and let W be a vector space of bounded measurable functions from X to Y. Let C ⊂ M(X ; Y) be a convex set. (i) W is ∞-universal relative to C if, for every probability measure µ on (X , AX ) and for every f ∈ C ∩ L∞ (X , µ; Y), there exists (fn )n∈N ∈ (C ∩ W)N such that supn∈N kfn k∞ < +∞ and fn → f µ-a.e. (ii) Let p ∈ [1, +∞[. The space W is p-universal relative to C if, for every probability measure µ on (X , AX ), C ∩ W is dense in C ∩ Lp (X , µ; Y). When C = M(X ; Y) the reference to the set C is omitted. The following proposition shows that Definition 4.5 is an extension of the standard notion of universality in the context of reproducing kernel Hilbert spaces [46, Corollary 5.29], [18]. Proposition 4.6 Let (X , AX ) be a measurable space, let Y be a separable uniformly convex real Banach space, and let W be a vector space of bounded measurable X to Y. Let (C(x))x∈X be a functions from family of closed convex subsets of Y containing 0, let C = f ∈ M(X , Y) (∀x ∈ X ) f (x) ∈ C(x) , and let p ∈ [1, +∞[. Consider the following properties: (a) W is ∞-universal relative to C. (b) W is p-universal relative to C. Then the following hold: (i) Suppose that x 7→ C(x) is measurable [19]. Then (a)⇒(b). (ii) Suppose that X is a locally compact Hausdorff space and let C0 (X ; Y) be the space of continuous functions from X to Y vanishing at infinity [11]. Suppose that W ⊂ C0 (X ; Y) and that x 7→ C(x) is continuous with respect to the Attouch-Wets topology [5, 9]. Consider the following property: (c) C ∩ W is dense in C ∩ C0 (X ; Y) for the uniform topology. Then (a)⇔(b)⇔(c).
14
Proof. (i): Suppose that (a) holds and let µ be a probability measure on (X , AX ). We have W ⊂ L∞ (X , µ; Y). We derive from (a) and the dominated convergence theorem that C ∩ W is dense in C ∩ L∞ (X , µ; Y) for the topology of Lp (X , µ; Y). Next, let f ∈ C ∩ Lp (X , µ; Y) and let ǫ ∈ R++ . Since L∞ (X , µ; Y) is dense in Lp (X , µ; Y) for the topology of Lp (X , µ; Y), there exists g ∈ L∞ (X , µ; Y) such that kf − gkp 6 ǫ/2. The function PC (g) : X → Y : x 7→ PC(x) (g(x))
(4.3)
is well defined [31, Proposition 3.2] and its measurability follows from the application of [19, Lemma III.39] with ϕ : X × Y → R : (x, y) 7→ −|y − g(x)| and Σ = C : X → 2Y . Then PC (g) ∈ C and, for every x ∈ X , since {0, f (x)} ⊂ C(x), ( |PC(x) (g(x))| 6 |PC(x) (g(x)) − g(x)| + |g(x)| 6 2|g(x)| (4.4) |PC(x) (g(x)) − f (x)| 6 |PC(x) (g(x)) − g(x)| + |g(x) − f (x)| 6 2|g(x) − f (x)|. Therefore PC (g) ∈ L∞ (X , µ; Y) and kPC (g) − f kp 6 2kf − gkp 6 ǫ. (ii): (c)⇒(a): Let µ be a probability measure on (X , AX ) and let f ∈ C ∩ L∞ (X , µ; Y). We denote by K (X ; Y) the space of continuous functions from X to Y with compact support. Since X is completely regular, we derive from Lusin’s theorem [26, Corollary 1 in III.§15.8] and Urysohn’s lemma, that there exists a sequence (gn )n∈N in K (X ; Y) such that gn → f µ-a.e. and supn∈N kgn k∞ 6 kf k∞ . Let n ∈ N and define the function PC (gn ) : X → Y : x 7→ PC(x) (gn (x)). Let us prove that PC (gn ) is continuous. Let x0 ∈ X . Since limx→x0 C(x) = C(x0 ) in the Attouch-Wets topology, there exist a neighborhood U1 of x0 and t ∈ R++ such that, for every x ∈ U1 , inf |C(x)| < t. Moreover there exist a neighborhood U2 of x0 and q ∈ R++ such that, for every x ∈ U2 , gn (x) ∈ B(q). Now, fix r ∈ [3q + t, +∞[. Then, for every x ∈ U1 ∩ U2 , since r > 3q + inf |C(x)|, it follows from [38, Corollary 3.3 and Theorem 4.1] that |PC (gn )(x) − PC (gn )(x0 )|
6 |PC(x) (gn (x)) − PC(x) (gn (x0 ))| + |PC(x) (gn (x0 )) − PC(x0 ) (gn (x0 ))| 6 φ♮ 2r|gn (x) − gn (x0 )| + |gn (x) − gn (x0 )| + φ♮ (2r haus2q+t (C(x), C(x0 ))),
(4.5)
where φ ∈ A0 is the modulus of uniform monotonicity of the normalized duality map of Y on B(r), and, for every ρ ∈ R++ , hausρ is the ρ-Hausdorff distance [9]. Hence, since ♮ limx→x0 haus2q+t (C(x), C(x0 )) = 0, limx→x0 |gn (x) − gn (x0 )| = 0, and lims→0 T + φ (s) = 0 by Proposition 3.1(v), the continuity of PC (gn ) at x0 follows. In addition, since 0 ∈ x∈X C(x), the support of PC (gn ) is contained in that of gn . Therefore, for every n ∈ N, PC (gn ) ∈ C ∩ K (X ; Y), kPC (gn )k∞ 6 2kgn k∞ and, (∀x ∈ X ) |PC(x) (gn (x)) − f (x)| 6 2|gn (x) − f (x)|. Hence PC (gn ) → f µ-a.e. It follows from (c) that, for every n ∈ N, there exists fn ∈ C ∩ W such that kfn − PC (gn )k∞ 6 1/(n + 1). Therefore supn∈N kfn k∞ 6 supn∈N (1 + kPC (gn )k∞ ) 6 1 + 2kf k∞ and fn → f µ-a.e. (b)⇒(c): We follow the same reasoning as in the proof of [18, Theorem 4.1]. By contradiction, suppose that C ∩ W is not dense in C ∩ C0 (X ; Y). Since C ∩ W is nonempty and convex, by the Hahn-Banach theorem, there exists f0 ∈ C ∩ C0 (X ; Y) and ϕ ∈ C0 (X ; Y)∗ , and α ∈ R such that (∀f ∈ C ∩ W) ϕ(f ) < α < ϕ(f0 ).
(4.6)
Now, by [26, Corollary 2 and Theorem 5 in III.§19.3] there is a probability measure µ on X and a function h ∈ L∞ (X , µ; Y∗ ) such that Z hf (x), h(x)idµ(x) . (4.7) (∀ f ∈ C0 (X ; Y)) ϕ(f ) = X
15
∗
Since ϕ 6= 0, we have h 6= 0. Moreover h ∈ Lp (X , µ; Y∗ ). Therefore (∀ f ∈ C ∩ W) hf, hip,p∗ < α < hf0 , hip,p∗ .
(4.8)
p α α = {f ∈ Lp (X , µ; Y) | hf, hi Let H− p,p∗ 6 α}. Then H− is a closed half-space of L (X , µ; Y). Thereα and f ∈ α p fore, by (4.8), C ∩ W ⊂ H− 0 / H− . Hence, C ∩ W is not dense in C ∩ L (X , µ; Y).
Definition 4.7 Let X be a nonempty set and let Y be a separable real Banach space. Let W be a reflexive, strictly convex, and smooth Banach space of functions from X to Y. Then W is a reproducing kernel Banach space if, for every x ∈ X , the point-evaluation operator evx : W → Y : f 7→ f (x) is continuous. In the next proposition we show that in the Banach space setting, the duality map (see Section 2) is instrumental to properly define a kernel. Proposition 4.8 Under the assumptions of Proposition 4.1, let Λ : X → L (Y∗ , F∗ ) be defined by (4.1) and set W = ran A. Let B(Y∗ , Y) be the set of operators mapping bounded subsets of Y∗ into bounded subsets of Y. Suppose that F is reflexive, strictly convex, and smooth, and let p ∈ ]1, +∞[. Then W is a reproducing kernel Banach space and there exists a unique Kp : X × X → B(Y∗ , Y), called kernel, such that ( Kp (x, ·)y ∗ ∈ W ∗ ∗ (4.9) (∀ u ∈ F)(∀ x ∈ X )(∀ y ∈ Y ) hAu, JW,p (Kp (x, ·)y ∗ )i = h(Au)(x), y ∗ i. Moreover, we have (∀ x ∈ X )(∀ x′ ∈ X )
−1 Kp (x, x′ ) = Λ(x′ )∗ ◦ JF,p ◦ Λ(x).
(4.10)
Proof. Let N = ker A. Proposition 4.1 implies that W is isometrically isomorphic to F/N. Define −1 Kp : X × X → B(Y∗ , Y) : (x, x′ ) 7→ Λ(x′ )∗ ◦ JF,p ◦ Λ(x) .
(4.11)
Then (4.1) yields −1 (∀ x ∈ X )(∀ y ∗ ∈ Y∗ ) Kp (x, ·)y ∗ = AJF,p (Λ(x)y ∗ ).
(4.12)
Since F is reflexive, strictly convex, and smooth, F/N and W are likewise. Defining A˜ and πN as in ∗ ◦J the proof of Proposition 4.1, we have A˜∗ ◦ JW,p ◦ A˜ = JF/N,p and JF,p = πN F/N,p ◦ πN . Hence, ∗ A ◦ JW,p ◦ A = JF,p . Therefore, it follows from (4.12) and (4.1) that, for every (x, u) ∈ X × F, D E −1 (∀ y ∗ ∈ Y∗ ) hAu, JW,p (Kp (x, ·)y ∗ )i = Au, JW,p (AJF,p (Λ(x)y ∗ )) (4.13) = hu, Λ(x)y ∗ i
= h(Au)(x), y ∗ i.
(4.14)
Finally if a kernel satisfies (4.9), it satisfies (4.13) and hence (4.12), and thus coincides with Kp . Remark 4.9 (i) Equation (4.9) is a representation formula, meaning that the values of the functions in W can be computed in terms of the kernel Kp , which is said to be associated with the feature map Λ. 16
(ii) Definition 4.7 is more general than [60, Definition 2.2], since the latter requires that both F and Y be uniformly convex and uniformly smooth. Thus, Proposition 4.8 extends [60, Theorems 2.3 and 3.1]. Moreover, in Proposition 4.8, the kernel is built from a feature map and a general p-duality map, which results in a more general setting than that of [58, 60]. We emphasize that, when dealing with kernels in Banach spaces, there is no reason to restrict oneself to the normalized duality map. Rather, allowing general p-duality maps usually makes the computation of the kernel easier, as the following two examples show. Remark 4.10 In the setting of Proposition 4.8, consider the scalar case Y = R [58]. Then, for every x ∈ X , Λ(x)∗ ∈ F∗ and the kernel becomes E D −1 (4.15) Kp : X × X → R : (x, x′ ) 7→ JF,p (Λ(x)∗ ), Λ(x′ )∗ . −1 Moreover, for every x ∈ X , Kp (x, ·) = A[JF,p (Λ(x)∗ )], and the representation formula (4.9) turns into
(∀ u ∈ F)(∀ x ∈ X )
hAu, JW,p (K(x, ·))i = (Au)(x) .
(4.16)
It follows from the definitions of Kp and JF,p that (∀ (x, x′ ) ∈ X × X )
Kp (x, x) = kΛ(x)kp
∗
∗
and |Kp (x, x′ )| 6 Kp (x, x)1/p Kp (x′ , x′ )1/p (4.17)
Example 4.11 (generalized linear model) Let X be a nonempty set, let Y be a separable real Banach space with norm |·|, let K be a nonempty countable set, let r ∈ [1, +∞[. Let (φk )k∈K be a family of functions from X to Y, which, in this context, is usually called a dictionary [23, 45]. Assume that ∗ ∗ for every x ∈ X , (φk (x))k∈K ∈ lr (K; Y) and denote by k(φk (x))k∈K kr∗ its norm in lr (K; Y). Set X A : lr (K) → YX : u = (µk )k∈K 7→ µk φk (pointwise). (4.18) k∈K
Let x ∈ X . By H¨ older’s inequality, for every u ∈ F, |(Au)(x)| 6 kukr k(φk (x))k∈K kr∗ , which implies that evx ◦A is continuous. Therefore, Proposition 4.1 ensures that n o X ran A = f ∈ YX ∃ u ∈ lr (K) (∀ x ∈ X ) f (x) = µk φk (x) (4.19) k∈K
can be endowed with a Banach space structure for which the point-evaluation operators are continuous. Moreover n o X r ker A = u ∈ l (K) (∀ x ∈ X ) µk φk (x) = 0 (4.20) k∈K
and, for every u ∈ lr (K), kAuk = inf v∈ker A ku − vkr . Hence, for every f ∈ ran A, n o X r kf k = inf kukr u ∈ l (K) and (∀ x ∈ X ) f (x) = µk φk (x) .
(4.21)
k∈K
∗
Let us compute the feature map Λ : X → L (Y∗ , lr (K)). Let x ∈ X , let y ∗ ∈ Y∗ , and denote by ∗ h·, ·ir,r∗ the canonical pairing between lr (K) and lr (K). Then, for every u ∈ lr (K), X hu, Λ(x)y ∗ ir,r∗ = hΛ(x)∗ u, y ∗ i = h(Au)(x), y ∗ i = µk hφk (x), y ∗ i, (4.22) k∈K
17
∗
∗
which gives Λ(x)y ∗ = (hφk (x), y ∗ i)k∈K . Since L (Y∗ , lr (K)) and lr (K; Y) are isomorphic Banach spaces, the feature map can be identified with ∗
Λ : X → lr (K; Y) : x 7→ (φk (x))k∈K .
(4.23)
We remark that ran A is p-universal if, for every probability measure µ on (X , AX ), the span of (φk )k∈K is dense in Lp (X , µ; Y). Now suppose that r > 1. Since lr (K) is reflexive, strictly convex, and smooth, Proposition 4.8 asserts that ran A is a reproducing kernel Banach space and that the underlying kernel Kp : X × X → B(Y∗ , Y) can be computed explicitly. Indeed, [20, Proposition 4.9] implies that the r-duality map of lr (K) is Jr : lr (K) → lr (K) : u = (µk )k∈K 7→ (|µk |r−1 sign(µk ))k∈K ∗
(4.24)
r∗
r∗
Moreover, Jr−1 : l (K) → lr (K) is the r ∗ -duality map of l (K) (hence it has the same form as (4.24) with r replaced by r ∗ ). Thus, for every (x, x′ ) ∈ X × X and every y ∗ ∈ Y X ∗ Kr (x, x′ )y ∗ = Λ(x′ )∗ Jr−1 (Λ(x)y ∗ ) = |hφk (x), y ∗ i|r −1 sign(hφk (x), y ∗ i)φk (x′ ). (4.25) k∈K
In the scalar case Y = R, this becomes X
∗ Kr (x, x′ ) = Jr−1 (Λ(x)), Λ(x′ ) r,r∗ = |φk (x)|r −1 sign(φk (x))φk (x′ ).
(4.26)
k∈K
Example 4.12 (Sobolev Spaces) Let (d, k, m) ∈ (N r {0})3 and let p ∈ ]1, +∞[. Let X ⊂ Rd be a nonempty open bounded set with regular boundary and consider the Sobolev space W m,p (X ; Rk ), P p 1/p α normed with k · km,p : f 7→ . Recall that, if mp > d, then W m,p (X ; Rk ) is α∈Nd ,|α|6m kD f kp continuously embedded in C(X ; Rk ) [1]. Therefore (∃ β ∈ R++ )(∀x ∈ X )(∀f ∈ W m,p (X ; Rk )) |f (x)| 6 kf k∞ 6 βkf km,p .
(4.27)
Moreover W m,p (X ; Rk ) is isometrically isomorphic to a closed vector subspace of [Lp (X ; Rk )]n , for Pn p 1/p a suitable n ∈ N, normed with k·kp : (f1 , . . . , fn ) 7→ . Therefore, W m,p (X ; Rk ) is i=1 kfi kp uniformly convex and smooth (with the same moduli of convexity and smoothness as Lp ). This shows that W m,p (X ; Rk ) is a reproducing kernel Banach space and also that the associated feature map Λ is bounded. Likewise, W0m,p (X ; Rk ) is a reproducing kernel Banach space endowed with the norm k∇·kp , where this time ∇ : W0m,p (X ; Rk ) → Lp (X ; Rk×d ) is an isometry. For simplicity, we address the computation of the kernel for the space W01,p (X ; R). In this case, the p-duality map is ∗ 1 ∂k∇·kpp = −∆p : W01,p (X ; R) → W01,p (X ; R) , (4.28) p where ∆p is the p-Laplacian operator [4, Section 6.6]. Therefore, it follows from (4.15) that (∀ (x, x′ ) ∈ X 2 )
Kp (x, x′ ) = u(x′ ),
where
u 6= 0 and
− ∆p u = evx .
In the case when X = [0, 1], the kernel can be computed explicitly as follows (1 − x)x′ if x′ 6 x p−1 + (1 − x)p−1 ) 1/(p−1) (x (∀ (x, x′ ) ∈ X 2 ) Kp (x, x′ ) = (1 − x′ )x ′ 1/(p−1) if x > x (xp−1 + (1 − x)p−1 )
(4.29)
(4.30)
Finally, using a mollifier argument [1, Theorem 2.29], W0m,p (X ; R)+ is dense in C0 (X ; R)+ . Hence, by Proposition 4.6, W0m,p (X ; R) is universal relative to the cone of R+-valued functions. 18
Remark 4.13 Proposition 4.8 and the results pertaining to the computation of the kernel are of interest in their own right. Note, however, that they will not be directly exploited subsequently since in the main results of Section 5.1 knowledge of a kernel will turn out not to be indispensable.
4.2 Representer and sensitivity theorems in Banach spaces In the classical setting, a representer theorem states that a minimizer of a Tikhonov regularized empirical risk function defined over a reproducing kernel Hilbert space can be represented as a finite linear combination of the feature map values on the training points [42]. The investigation in Banach spaces was initiated in [37] and continued in [59]. In this section representer theorems are established in the general context of Banach spaces, totally convex regularizers, vector-valued functions, and approximate minimization. These contributions capture and extend existing results. Moreover, we study the sensitivity of such representations with respect to perturbations of the probability distribution on X × Y. Definition 4.14 Let X and Y be two nonempty sets, let (X × Y, A, P ) be a complete probability space, and let PX be the marginal probability measure of P on X . Let Y be a separable reflexive real Banach space with norm |·| and Borel σ-algebra BY . Υ(X × Y × Y) is the set of functions ℓ : X × Y × Y → R+ such that ℓ is measurable with respect to the tensor product σ-algebra A ⊗ BY and, for every (x, y) ∈ X × Y, ℓ(x, y, ·) : Y → R is continuous and convex. A function in Υ(X × Y × Y) is a loss. The risk associated with ℓ ∈ Υ(X × Y × Y) and P is Z ℓ x, y, f (x) P (d(x, y)). (4.31) R : M(X , Y) → [0, +∞] : f 7→ X ×Y
In addition,
(i) given p ∈ [1, +∞[, Υp (X × Y × Y, P ) is the set of functions ℓ ∈ Υ(X × Y × Y) such that (∃ b ∈ L1 (X × Y, P ; R))(∃ c ∈ R+ )(∀(x, y, y) ∈ X × Y × Y) ℓ(x, y, y) 6 b(x, y)+ c|y|p ; (4.32) (ii) Υ∞ (X × Y × Y, P ) is the set of functions ℓ ∈ Υ(X × Y × Y) such that (∀ρ ∈ R++ )(∃ gρ ∈ L1 (X × Y, P ; R))
(∀ (x, y) ∈ X × Y)(∀y ∈ B(ρ)) ℓ(x, y, y) 6 gρ (x, y); (4.33)
(iii) ΥY,loc (X × Y × Y) is the set of functions ℓ ∈ Υ(X × Y × Y) such that (∀ρ ∈ R++ )(∃ ζρ ∈ R++ )
(∀(x, y) ∈ X × Y)(∀(y, y′ ) ∈ B(ρ)2 )
|ℓ(x, y, y) − ℓ(x, y, y′ )| 6 ζρ |y − y′ |. (4.34)
Remark 4.15 (i) The properties defining the classes of losses introduced in Definition 4.14 arise in the calculus of variations [29]. Let p ∈ [1, +∞] and suppose that ℓ ∈ Υp (X ×Y ×Y, P ). Then the risk (4.31) is real-valued on Lp (X , PX ; Y). Moreover, since for every (x, y) ∈ X × Y, ℓ(x, y, ·) is convex and continuous, R : Lp (X , PX ; Y) → R+ is convex and continuous [29, Corollaries 6.51 and 6.53]. 19
(ii) If ℓ ∈ Υp (X × Y × Y, P ) then ℓ(x, y, ·) is bounded on bounded sets. Hence, by Proposition A.1(ii), ℓ(x, y, ·) is Lipschitz continuous relative to bounded sets. (iii) If q ∈ [p, +∞], then Υp (X × Y × Y, P ) ⊂ Υq (X × Y × Y, P ).
(iv) Suppose that ℓ ∈ ΥY,loc(X × Y × Y) and that there exists f ∈ L∞ (X , PX ; Y) such that R(f ) < +∞. Then ℓ ∈ Υ∞ (X × Y × Y, P ) and (i) implies that R : L∞ (X , PX ; Y) → R+ is convex and continuous. (v) The following are consequences of Propositions A.1(ii) and A.2(ii): (a) Suppose that ℓ ∈ Υ1 (X × Y × Y, P ) and let c ∈ R+ be as in Definition 4.14(i). Then ℓ ∈ ΥY,loc(X × Y × Y) and supρ∈R++ ζρ 6 c. Hence ℓ is Lipschitz continuous in the third variable, uniformly with respect to the first two. Moreover, in this case, the inequality in (4.32) is true with b = ℓ(·, ·, 0).
(b) Let p ∈ ]1, +∞[, let ℓ ∈ Υp (X × Y × Y, P ), and suppose that the inequality in (4.32) holds with b bounded and some c ∈ R+ . Then ℓ ∈ ΥY,loc (X × Y × Y) and ℓ(·, ·, 0) is bounded. Moreover, for every ρ ∈ R++ , ζρ 6 (p − 1)kbk∞ + 3cp max{1, ρ p−1 }. (c) Let ℓ ∈ Υ∞ (X × Y × Y, P ). Then the functions (gρ )ρ∈R++ in (4.33) belong to L∞ (P ) if and only if ℓ ∈ ΥY,loc (X × Y × Y) and ℓ(·, ·, 0) is bounded. In this case, for every ρ ∈ R++ , ζρ 6 2kgρ+1 k∞ .
p Example 4.16 Consider the setting of Definition 4.14 and let p ∈ [1, +∞[. Suppose that R (L -loss) Y ⊂ Y, that X ×Y |y|p P (d(x, y)) < +∞, and that
(∀ (x, y, y) ∈ X × Y × Y)
ℓ(x, y, y) = |y − y|p .
(4.35)
Then ℓ ∈ Υp (X × Y × Y, P ). Moreover, suppose that Y is bounded and set β = supy∈Y |y|. Then ℓ ∈ ΥY,loc (X × Y × Y) and (∀ρ ∈ R++ ) ζρ 6 p(ρ + β)p−1 . Indeed, the case p = 1 is straightforward. If p > 1, it follows from (A.8) that, for every y ∈ Y and every (y, y′ ) ∈ Y2 , |y − y|p − |y′ − y|p 6 p max{|y − y|p−1 , |y − y′ |p−1 }|y − y′ |. Therefore, for every (y, y′ ) ∈ B(ρ)2 and every y ∈ Y, |y − y|p − |y′ − y|p 6 p(ρ + β)p−1 |y − y′ |.
Now we propose a general representer theorem which involves the feature map from Definition 4.2.
Theorem 4.17 (representer) Let X and Y be two nonempty sets, let (X × Y, A, P ) be a complete probability space, and let PX be the marginal probability measure of P on X . Let Y be a separable reflexive real Banach space with norm |·|, let F be a separable reflexive real Banach space, let A : F → M(X , Y) be linear and continuous with respect to pointwise convergence on YX , and let Λ be the associated feature map. Let p ∈ [1, +∞], let ℓ ∈ Υp (X × Y × Y, P ), let R be the risk associated with ℓ and P , and suppose that Λ ∈ Lp [X , PX ; L (Y∗ , F∗ )]. Set F = R ◦ A, let G ∈ Γ+ 0 (F), let λ ∈ R++ , let ǫ ∈ R+ , and suppose that uλ ∈ F satisfies inf k∂(F + λG)(uλ )k 6 ǫ.
(4.36)
∗
Then there exists hλ ∈ Lp (X × Y, P ; Y∗ ) such that
(∀ (x, y) ∈ X × Y) hλ (x, y) ∈ ∂ Y ℓ x, y, (Auλ )(x) 20
(4.37)
and (∃ e∗ ∈ F∗ ) ke∗ k 6 ǫ
and e∗ − EP (Λhλ ) ∈ λ∂G(uλ ),
(4.38)
where Λhλ : X × Y → F∗ : (x, y) 7→ Λ(x)hλ (x, y) and, for every (x, y, y) ∈ X × Y × Y, ∂ Y ℓ(x, y, y) = ∂ℓ(x, y, ·)(y). Moreover, the following hold: (i) Suppose that p 6= +∞. Let (b, c) be as in Definition 4.14(i). If p = 1, then khλ k∞ 6 c; if p > 1, p−1 then khλ k1 6 (p − 1)kbk1 + 3pc(1 + kΛkp−1 ). p kuλ k
(ii) Suppose that p = +∞, that ℓ ∈ ΥY,loc (X × Y × Y) and let ρ ∈ ]kuλ k, +∞[. Then hλ ∈ L∞ (X × Y, P ; Y∗ ) and khλ k∞ 6 ζρkΛk∞ .
Proof. Set
Ψ : Lp (X × Y, P ; Y) → [0, +∞] : g 7→
Z
ℓ(z, g(z))P (dz).
(4.39)
X ×Y ∗
Since ℓ ∈ Υp (X ×Y ×Y, P ), Ψ is real-valued and convex. Place Lp (X ×Y, P ; Y) and Lp (X ×Y, P ; Y∗ ) in duality by means of the pairing Z hg(z), h(z)i P (dz). (4.40) h·, ·ip,p∗ : (g, h) 7→ X ×Y
∗
From now on, we denote by Lp and Lp the above cited Lebesgue spaces, endowed with the weak ∗ ∗ topologies σ(Lp , Lp ) and σ(Lp , Lp ), derived from the duality (4.40). Moreover, since ℓ > 0, it follows from [41, Theorem 21(c)-(d)] that Ψ : Lp → R is lower semicontinuous and ∗ (∀g ∈ Lp ) ∂Ψ(g) = h ∈ Lp h(z) ∈ ∂ Y ℓ(z, g(z)) for P -a.a. z ∈ X × Y . (4.41)
Next, since Λ ∈ Lp [X , PX ; L (Y∗ , F∗ )], it follows from Proposition 4.4(ii), that A : F → Lp (X , PX ; Y) b : F → Lp defined by is continuous. Therefore the map A (∀ u ∈ F)
b : X × Y → Y : (x, y) 7→ (Au)(x) Au
(4.42)
is linear and continuous. Moreover, p∗
(∀ u ∈ F)(∀ h ∈ L )
b hip,p∗ = hAu,
=
Z
ZX ×Y
X ×Y
h(Au)(x), h(x, y)i P (d(x, y)) hu, Λ(x)h(x, y)i P (d(x, y))
= hu, EP (Λh)i.
(4.43)
Note that, in (4.43), EP (Λh) is well defined, since Λh is measurable [27, Proposition 1.7], and, for every (x, y) ∈ X × Y, kΛ(x)h(x, y)k 6 kΛ(x)k|h(x, y)|. Hence, by H¨ older’s inequality R ∗ ∗ ∗ p b → F : h 7→ EP (Λh). Now, X ×Y kΛ(x)h(x, y)kP (d(x, y)) < +∞, and (4.43) implies that A : L p b b since F = Ψ ◦ A, applying [57, Theorem 2.8.3(vi)] to Ψ : L → R and A : F → Lp and, taking into account (4.41), we get b∗ (∂Ψ(Au b λ )) ∂F (uλ ) = A ∗ = EP (Λh) h ∈ Lp , h(x, y) ∈ ∂ Y ℓ(x, y, (Auλ )(x)) for P -a.a. (x, y) ∈ X × Y . (4.44) 21
Using (4.36) and [57, Theorem 2.8.3(vii)], there exists e∗ ∈ B(ε) such that e∗ ∈ ∂(F + ∗ λG)(uλ ) = ∂F (uλ ) + λ∂G(uλ ). Hence, in view of (4.44), there exists hλ ∈ Lp satisfying hλ (x, y) ∈ ∂ Y ℓ(x, y, (Auλ )(x)) for P -a.a. (x, y) ∈ X × Y and e∗ − EP [Λhλ ] ∈ λ∂G(uλ ). Since P is complete, and for every (x, y) ∈ X × Y, dom ∂ Y ℓ(x, y, ·) 6= ∅, we can modify hλ so that hλ (x, y) ∈ ∂ Y ℓ(x, y, (Auλ )(x)) holds for every (x, y) ∈ X × Y. (i): Let (x, y) ∈ X × Y. Since hλ (x, y) ∈ ∂ Y ℓ(x, y, (Auλ )(x)), |(Auλ )(x)| = |Λ(x)∗ uλ | 6 kΛ(x)kkuλ k .
(4.45)
By Definition 4.14(i), there exists b ∈ L1 (X × Y, P ; R)+ and c ∈ R++ such that (∀ y ∈ Y) ℓ(x, y, y) 6 b(x, y) + c|y|p .
(4.46)
Therefore, it follows from Proposition A.2 and (4.45) that, if p = 1, we have |hλ (x, y)| 6 c and, if p > 1, we have |hλ (x, y)| 6 (p − 1)b(x, y) + 3pc(kΛ(x)kp−1 kuλ kp−1 + 1).
(4.47)
p−1 Hence, using Jensen’s inequality, khλ k1 6 (p − 1)kbk1 + 3cp(1 + kΛkp−1 ). p kuλ k
(ii): Let (x, y) ∈ X × Y be such that kΛ(x)k 6 kΛk∞ , and set τ = ρkΛk∞ . We assume τ > 0. Then (4.45) yields |(Auλ )(x)| < τ . Thus, since B(τ ) is a neighborhood of (Auλ )(x) in Y, ℓ(x, y, ·) is Lipschitz continuous relative to B(τ ), with Lipschitz constant ζτ and hλ (x, y) ∈ ∂ Y ℓ(x, y, (Auλ )(x)), Proposition A.1(i) gives |hλ (x, y)| 6 ζτ . Remark 4.18 (i) Condition (4.36) is a relaxation of the characterization of uλ as an exact minimizer of F + λG via Fermat’s rule, namely 0 ∈ ∂(F + λG)(uλ ). (ii) In Theorem 4.17, let additionally F1 and F2 be separable reflexive real Banach spaces, let A1 : F1 → M(X , Y) and A2 : F1 → M(X , Y) be two linear operators which are continuous with respect to pointwise convergence on YX , let Λ1 : X → L (Y∗ , F1∗ ) and Λ2 : X → L (Y∗ , F2∗ ) be the feature maps associated with A1 and A2 respectively, and let G1 ∈ Γ+ 0 (F1 ). Suppose that F = F1 × F2 , that ǫ = 0, and that (∀u = (u1 , u2 ) ∈ F1 × F2 ) Au = A1 u1 + A2 u2
and G(u) = G1 (u1 ).
(4.48)
Then, setting uλ = (u1,λ , u2,λ ), (4.37) and (4.38) yield
and
(∀ (x, y) ∈ X × Y) hλ (x, y) ∈ ∂ Y ℓ x, y, (A1 u1,λ )(x) + (A2 u2,λ )(x) −EP (Λ1 hλ ) ∈ λ∂G1 (u1,λ ) and EP (Λ2 hλ ) = 0.
(4.49)
(4.50)
This gives a representer theorem with offset space F2 . If we assume further that F1 and F2 are reproducing kernel Hilbert spaces of scalar functions, that G1 = k · k2 , and that p < +∞, the resulting special case of (4.49) and (4.50) appears in [24, Theorem 2].
22
Corollary 4.19 In Theorem 4.17, make the additional assumption that F is strictly convex and smooth, that there exists a convex even function ϕ : R → R+ vanishing only at 0 such that G = ϕ ◦ k·k,
(4.51) p∗
and that uλ 6= 0. Then there exist e∗ ∈ F∗ , hλ ∈ L (X × Y, P ; Y∗ ), and ξ(uλ ) ∈ ∂ϕ(kuλ k) such that ke∗ k 6 ǫ, (4.37) holds, and
kuλ k ∗ (e − EP [Λhλ ]). (4.52) λξ(uλ ) Proof. Note ∂ϕ(R++ ) ⊂ R++ since ϕ is strictly increasing on R++ . It follows from Theorem 4.17 ∗ that there exist hλ ∈ Lp (X × Y, P ; Y∗ ) and e∗ ∈ F∗ such that (4.37) and (4.38) hold. Next, we prove that (∀ u ∈ F) ∂G(u) = u∗ ∈ F∗ hu, u∗ i = kuk ku∗ k and ku∗ k ∈ ∂ϕ(kuk) . (4.53) JF (uλ ) =
It follows from [6, Example 13.7] that, for every u∗ ∈ F∗ , G∗ (u∗ ) = ϕ∗ (ku∗ k). Moreover, the Fenchel-Young identity entails that, for every (u, u∗ ) ∈ F × F∗ , we have u∗ ∈ ∂G(u) ⇔ ϕ(kuk) + ϕ∗ (ku∗ k) = hu, u∗ i
⇔ hu, u∗ i = kukku∗ k and ku∗ k ∈ ∂ϕ(kuk) . (4.54) Set u∗λ = e∗ − EP (Λhλ ) /λ. Since uλ 6∈ {0} = ArgminF G = u ∈ F 0 ∈ ∂G(u) and u∗λ ∈ ∂G(uλ ), then u∗λ 6= 0. Now put vλ∗ = kuλ ku∗λ /ku∗λ k, then (4.53) yields huλ , vλ∗ i = kuλ k2 and ku∗λ k ∈ ∂ϕ(kuλ k). Moreover, kvλ∗ k = kuλ k. Hence, (2.7) yields vλ∗ = JF (uλ ) and (4.52) follows. Remark 4.20 Let r ∈ [1, +∞[ and let ϕ = |·|r in Corollary 4.19. Then (4.52) specializes to JF (uλ ) =
e∗ − EP (Λhλ ) . rkuλ kr−2 λ
(4.55)
If F is a Hilbert space and r = 2, we obtain the representation 1 ∗ e − EP (Λhλ ) , uλ = 2λ which was first obtained in [24, Theorem 2].
(4.56)
Remark 4.21 In Corollary 4.19, note that G is strictly quasiconvex and coercive since P F is strictly n −1 convex and ϕ is convex and strictly increasing on R+ . Now, let ǫ = 0 and let P = n i=1 δ(xi ,yi ) be the empirical probability measure associated with the sample (xi , yi )16i6n ∈ (X × Y)n . In this context, we obtain a representation for the solution uλ to the regularized empirical risk minimization problem n
1X ℓ(xi , yi , Au(xi )) + λϕ(kuk). minimize n u∈F
(4.57)
i=1
Indeed, (4.52) reduces to JF (uλ ) =
n X i=1
Λ(xi )yi∗ ,
where
(∀ i ∈ {1, . . . , n}) yi∗ = −
kuλ k hλ (xi , yi ) ∈ Y∗ . nλξ(uλ )
(4.58)
Thus, uλ can be expressed as a linear combination of the feature vectors (Λ(xi ))16i6n , for some vector coefficients (yi∗ )16i6n ∈ (Y∗ )n . This covers the classical setting of representer theorems in scalar-valued Banach spaces of functions [59, Theorem 3] and improve the vector-valued case [60, Theorem 5.7], since Y can be infinite-dimensional. 23
Example 4.22 We recover a case-study of [37]. Let φ : R+ → R+ be strictly increasing, continuous, R |t| and such that φ(0) = 0 and limt→+∞ φ(t) = +∞. Define ϕ : R → R+ : t 7→ 0 φ(s)ds, which is strictly convex, even, and vanishes only at 0. Assume that limt→0 ϕ(2t)/ϕ(t) < +∞, let (Ω, S, µ) be a measure space, and let F = Lϕ (Ω, µ; R) be the associated Orlicz space endowed with the Luxemburg norm induced by ϕ. We recall that F∗ = Lϕ∗ (Ω, µ; R), the Orlicz space endowed with the Orlicz norm associated to ϕ∗ [40]. Moreover, in this case the normalized duality map JF∗ = JF−1 : F∗ → F can be computed. Indeed, by [40, Theorem 7.2.5], we obtain that, for every g ∈ F∗ , there exists κg ∈ R++ such that JF∗ (g) = kgkφ−1 (κg |g|) sign(g). Given (gi )16i6n ∈ (F∗ )n , (yi )16i6n ∈ Rn , and λ ∈ R++ , the problem considered in [37] is to solve n
1X ℓ(yi , hu, gi i) + λϕ(kuk). minimize n u∈F
(4.59)
i=1
This corresponds to the framework considered in Corollary 4.19 and Remark 4.21, with X = F∗ , P n Y = Y = R, P = n−1 i=1 δ(gi ,yi ) , and (∀g ∈ X )(∀u ∈ F) (Au)(g) = hu, gi. Since, in this case, for every g ∈ X , Λ(g) = g, we derive from (4.58) that there exist κ ∈ R++ and (αi )16i6n ∈ Rn such that X n n X αi gi and − nλφ(kuλ k)αi ∈ kuλ k∂ℓ(yi , ·)(huλ , gi i). (4.60) uλ = kuλ kφ−1 κ αi gi sign i=1
i=1
We conclude this section with a sensitivity result in terms of a perturbation on the underlying probability measure. Theorem 4.23 (Sensitivity) In Theorem 4.17, make the additional assumption that G is totally con∗ vex at every point of dom G and let ψ be its modulus of total convexity. Take hλ ∈ Lp (X × Y, P ; Y∗ ) such that conditions (4.37)-(4.38) hold. Let Pe be a probability measure on (X × Y, A) such that ℓ ∈ Υ∞ (X × Y × Y, Pe ) and Λ is PeX -essentially bounded. Define Z e ◦ A. e : M(X , Y) → [0, +∞] : f 7→ ℓ(x, y, f (x))Pe(d(x, y)) and Fe = R (4.61) R X ×Y
Let ǫ˜ ∈ R++ and let u ˜λ ∈ F be such that inf k∂(Fe + λG)(˜ uλ ))k 6 ǫ˜. Then the following hold: (i) hλ ∈ L1 (X × Y, Pe ; Y∗ ). (ii) ψ(uλ , ·)b (k˜ uλ − uλ k) 6 kEPe (Λhλ ) − EP (Λhλ )k + ǫ + ǫ˜ /λ. Proof. (i): Let γ be the norm of Λ in L∞ [X , PeX ; L (Y, Z)] and let ρ ∈ ]γkuλ k, +∞[. Since ℓ ∈ Υ∞ (X × Y × Y, Pe ), there exists g ∈ L1 (X × Y, Pe; R) such that (∀ (x, y) ∈ X × Y)(∀ y ∈ B(ρ + 1))
ℓ(x, y, y) 6 g(x, y).
(4.62)
Let (x, y) ∈ X × Y be such that kΛ(x)k 6 γ. Then |(Auλ )(x)| 6 kΛ(x)kkuλ k 6 γkuλ k < ρ.
(4.63)
Therefore, since hλ (x, y) ∈ ∂ Y ℓ(x, y, (Auλ )(x)), it follows from Proposition A.1(i)-(ii) and (4.62) that |hλ (x, y)| 6 2 sup ℓ(x, y, B(ρ + 1)) 6 2g(x, y). Hence hλ ∈ L1 (X × Y, Pe; Y∗ ). (ii): Let (x, y) ∈ X × Y. Since hλ (x, y) ∈ ∂ Y ℓ(x, y, (Auλ )(x)), we have h˜ uλ − uλ , Λ(x)hλ (x, y)i = h(A˜ uλ )(x) − (Auλ )(x), hλ (x, y)i
6 ℓ(x, y, (A˜ uλ )(x)) − ℓ(x, y, (Auλ )(x)). 24
(4.64)
Since Λ is PeX -essentially bounded and hλ ∈ L1 (X × Y, Pe; Y∗ ), Λhλ is Pe-integrable. Integrating (4.64) with respect to Pe yields
e uλ ) − R(Au e (4.65) u ˜λ − uλ , EPe (Λhλ ) 6 R(A˜ λ ). Moreover, (4.38) and (3.8) yield
h˜ uλ − uλ , e∗ − EP (Λhλ )i + λψ(uλ , k˜ uλ − uλ k) 6 λG(˜ uλ ) − λG(uλ ). Summing the last two inequalities we obtain
u ˜λ − uλ , EPe (Λhλ ) − EP (Λhλ ) + e∗ + λψ(uλ , k˜ uλ − uλ k) 6 (Fe + λG)(˜ uλ ) − (Fe + λG)(uλ ).
(4.66)
(4.67)
Since there exists e˜∗ ∈ F∗ such that k˜ e∗ k 6 ǫ˜ and huλ − u ˜λ , e˜∗ i 6 (Fe + λG)(uλ ) − (Fe + λG)(˜ uλ ), we e e have (F + λG)(˜ uλ ) − (F + λG)(uλ ) 6 ǫ˜kuλ − u ˜λ k. This, together with (4.67), yields uλ − uλ k λψ(uλ , k˜ uλ − uλ k) 6 (ǫ + ǫ˜)k˜ uλ − uλ k + kEPe (Λhλ ) − EP (Λhλ )kk˜
(4.68)
and the statement follows.
5 Learning via regularization We study statistical learning in Banach spaces and present the main results of the paper.
5.1 Consistency theorems We first formulate our assumptions. They involve the feature map from Definition 4.2, as well as the loss and the risk introduced in Definition 4.14. Assumption 5.1 (i) (Ω, S, P) is a complete probability space, X and Y are two nonempty sets, A is a sigma algebra on X × Y containing the singletons, (X, Y ) : (Ω, S, P) → (X × Y, A) is a random variable with distribution P on X × Y, and P has marginal PX on X . (ii) Y is a separable reflexive real Banach space, ℓ ∈ ΥY,loc(X × Y × Y), R : M(X , Y) → [0, +∞] is the risk associated with ℓ and P , and there exists f ∈ L∞ (X , PX ; Y) such that R(f ) < +∞. For every ρ ∈ R++ , ζρ is defined as in (4.34). (iii) C is a nonempty convex subset of M(X , Y).
(iv) F is a separable reflexive real Banach space, q ∈ [2, +∞[, F∗ is of Rademacher type q ∗ with Rademacher type constant Tq∗ . (v) A : F → M(X , Y) is linear and continuous with respect to pointwise convergence on YX , Λ is the feature map associated with A, Λ ∈ L∞ [X , PX ; L (Y∗ , F∗ )].
(vi) G ∈ Γ+ 0 (F), G(0) = 0, the modulus of total convexity of G is ψ, ψ0 = ψ(0, ·), and G is totally convex on bounded sets. 25
(vii) (λn )n∈N is a sequence in R++ such that λn → 0. (viii) (Xi , Yi )i∈N is a sequence of independent copies of (X, Y ). For every n ∈ N r {0}, Zn = (Xi , Yi )16i6n and n
1X ℓ(xi , yi , f (xi )). (5.1) Rn : M(X , Y) × (X × Y) → R+ : (f, (x1 , y1 ), . . . , (xn , yn )) 7→ n n
i=1
The function ε : R++ → [0, 1] satisfies limλ→0+ ε(λ) = 0. For every n ∈ N r {0} and every λ ∈ R++ , the function un,λ : (X × Y)n → F satisfies ε(λ)
(∀z ∈ (X × Y)n ) un,λ (z) ∈ ArgminF (Rn (A·, z) + λG).
(5.2)
In the context of learning theory, X is the input space and Y is the output space, which can be considered to be embedded in the ambient space Y. The probability distribution P describes a functional relation from X into Y and R quantifies the expected loss of a function f : X → Y with respect to the underlying distribution P . The set C models a priori constraints. Since M(X , Y) is poorly structured, measurable functions are handled via the Banach feature space F and the feature map Λ. Under the provision that the range of A is universal relative to C (see Definition 4.5) every function f ∈ C can be approximately represented by a feature u ∈ F via f ≈ Au. Since the true risk R depends on P , which is unknown, the empirical risk Rn is constructed from the available data, namely a realization of Zn . In (5.2), un,λ is obtained by approximately minimizing a regularized empirical risk. Regularization is achieved by the addition of the convex function G, which will be asked to fulfill certain compatibility conditions with the constraint set C, e.g., dom G = A−1 (C). The objective of our analysis can be stated as follows. Problem 5.2 (consistency) Consider the setting of Assumption 5.1. The problem is to approach the infimum of the risk R on C by means of approximate solutions ε(λn )
un,λn (Zn ) ∈ ArgminF
(Rn (A·, Zn ) + λn G)
(5.3)
to the empirical regularized problems minimize Rn (Au, Zn ) + λn G(u),
(5.4)
u∈F
in the sense that R(Aun,λn (Zn )) → inf R(C) in probability (weak consistency) or almost surely (strong consistency), under suitable conditions on (λn )n∈N . Definition 5.3 Let p ∈ [1, +∞]. Then C in Assumption 5.1 is p-admissible if C ⊂ Lp (X , PX ; Y), or if p C∩L (C(x))x∈X of closed convex subsets of Y such that (X , PX ; Y) 6= ∅ and there exists a family C = f ∈ M(X , Y) (∀x ∈ X ) f (x) ∈ C(x) .
We are now ready to state the two main results of the paper, the proofs of which are deferred to Section 5.2.
Theorem 5.4 Suppose that Assumption 5.1 holds, set ς = kΛk∞ , and write ε = ε1 ε2 , where ε1 and ε2 are functions from R++ to [0, 1]. Let p ∈ [1, +∞] and suppose that ℓ ∈ Υp (X × Y × Y, P ), that C is p-admissible, that ran A is p-universal relative to C, and that A(dom G) ⊂ C ∩ ran A ⊂ A(dom G), where the closure is in Lp (X , PX ; Y). Then the following hold: 26
(i) Assume that ℓ(·, ·, 0) is bounded and let (∀n ∈ N) ρn ∈ ψ0♮ (kℓ(·, ·, 0)k∞ +1)/λn , +∞ . Suppose that ζςρn ζςρn ε1 (λn ) → 0 and ε2 (λn ) = O 1/q , (5.5) n and that (∀τ ∈ R++ ) ζςρn (ψbρn )♮
τ ζςρn λn n1/q
→ 0.
(5.6)
P∗
Then R(Aun,λn (Zn )) → inf R(C). Moreover, if τ ζ log n ςρ n ♮ → 0, (∀τ ∈ R++ ) ζςρn (ψbρn ) λn n1/q
(5.7)
then R(Aun,λn (Zn )) → inf R(C) P∗-a.s.
(ii) Assume that p ∈ ]1, +∞[ and that the function b associated with ℓ in Definition 4.14(i) is bounded, and let (∀n ∈ N) ρn ∈ ψ0♮ (kℓ(·, ·, 0)k∞ + 1)/λn , +∞ . Suppose that p−1 ρ p−1 n p−1 p−1 b ♮ τ ρn ρn ε1 (λn ) → 0, ε2 (λn ) = O 1/q , and (∀τ ∈ R++ ) ρn (ψρn ) → 0. (5.8) n λn n1/q P∗
Then R(Aun,λn (Zn )) → inf R(C). Moreover, if p−1 log n p−1 b ♮ τ ρn (∀τ ∈ R++ ) ρn (ψρn ) → 0, λn n1/q
(5.9)
then R(Aun,λn (Zn )) → inf R(C) P∗-a.s.
(iii) Assume that p = 1 and let (∀n ∈ N) ρn ∈ ψ0♮ ((R(0) + 1)/λn ), +∞ . Suppose that 1 τ ♮ b ε1 (λn ) → 0, ε2 (λn ) = O 1/q , and (∀τ ∈ R++ ) (ψρn ) → 0. n λn n1/q
(5.10)
P∗
Then R(Aun,λn (Zn )) → inf R(C). Moreover, if ♮ τ log n b → 0, (∀τ ∈ R++ ) (ψρn ) λn n1/q
(5.11)
then R(Aun,λn (Zn )) → inf R(C) P∗-a.s.
(iv) Suppose that S = Argmindom G (R ◦ A) 6= ∅. Then there exists a unique u† ∈ S which minimizes G on S; moreover, Au† ∈ C and R(Au† ) = inf R(C). Furthermore, suppose that the following conditions are satisfied: ε1 (λn ) → 0,
ε2 (λn ) → 0, λn
and
1 → 0. λn n1/q
(5.12)
P∗
P∗
Then kun,λn (Zn ) − u† k → 0 and R(Aun,λn (Zn )) → R(Au† ). Finally, suppose in addition that (log n)/(n1/q λn ) → 0 .
(5.13)
Then kun,λn (Zn ) − u† k → 0 P∗ -a.s. and R(Aun,λn (Zn )) → R(Au† ) P∗ -a.s. 27
Remark 5.5 (i) In the setting of Example 4.16, ℓ(·, ·, 0) is bounded if Y ⊂ B(ρ). (ii) A(dom G) ⊂ C ∩ran A ⊂ A(dom G) is a compatibility condition between G and C. It is satisfied in particular when dom G = A−1 (C), since A(A−1 (C)) = C ∩ ran A. On the other hand, ran A is trivially ∞-universal relative to C when C ⊂ ran A, or ran A ⊂ C and ran A is ∞-universal. (iii) It follows from Assumption 5.1(vi) that ArgminF G 6= ∅, hence 0 ∈ dom ∂G. Therefore, Proposition 3.5(viii) ensures that, for every ρ ∈ R+ , ψρ ∈ A1 . Thus, Proposition 3.1(vii) yields (ψbρ )♮ ∈ A0 and by Proposition 3.1(ii), dom (ψbρ )♮ is a non trivial interval containing 0. (iv) Let (sn )n∈N and (ρn )n∈N be sequences in R++ and suppose that ρ = inf n∈N ρn > 0. Then (ψbρn )♮ (sn ) → 0 ⇒ sn → 0. Indeed, for every n ∈ N, ρ 6 ρn ⇒ ψρn 6 ψρ ⇒ ψbρn 6 ψbρ ⇒ (ψbρ )♮ 6 (ψbρ )♮ . Therefore (ψbρ )♮ (sn ) → 0 ⇒ (ψbρ )♮ (sn ) → 0 ⇒ sn → 0 by Proposition 3.1(iv). n
n
Next we consider an important special case, in which the consistency conditions can be made explicit.
Corollary 5.6 Suppose that Assumption 5.1 holds, set ς = kΛk∞ , and write ε = ε1 ε2 , where ε1 and ε2 are functions from R++ to [0, 1]. Let p ∈ [1, +∞] and suppose that ℓ ∈ Υp (X × Y × Y, P ), that C is p-admissible, that ran A is p-universal relative to C, that A(dom G) ⊂ C ∩ ran A ⊂ A(dom G), where the closure is in Lp (X , PX ; Y). In addition, assume that ( F is uniformly convex with modulus of convexity of power type q (5.14) G = ηk·kr + H, where η ∈ R++ , r ∈ ]1, +∞[ , and H ∈ Γ+ 0 (F). Let β be the constant defined in Proposition 3.8, and set m = max{r, q}. Then the following holds: 1/r . Suppose (i) Assume that ℓ(·, ·, 0) is bounded and set (∀n ∈ N) ρn = (kℓ(·, ·, 0)k∞ + 1)/(ηβλn ) that m ζςρ ζςρn n → 0. (5.15) ζςρn ε1 (λn ) → 0, ε2 (λn ) = O 1/q , and m/r 1/q n λn n P∗
Then R(Aun,λn (Zn )) → inf R(C). Moreover, if m log n ζςρ n m/r 1/q n
λn
→ 0,
(5.16)
then R(Aun,λn (Zn )) → inf R(C) P∗-a.s. (ii) Assume that p ∈ ]1, +∞[, that the function b associated with ℓ in Definition 4.14(i) is bounded, and that 1 1 ε1 (λn ) → 0, ε (λ ) = O , and → 0. (5.17) 2 n (p−1)/r (p−1)/r pm/r λn n1/q λn λn n1/q P∗
pm/r 1/q n )
Then R(Aun,λn (Zn )) → inf R(C). Moreover, if log n/(λn inf R(C) P∗-a.s.
28
→ 0, then R(Aun,λn (Zn )) →
(iii) Assume that p = 1 and that ε1 (λn ) → 0, ε2 (λn ) = O
1 , n1/q
and
P∗
1 m/r λn n1/q
→ 0.
m/r 1/q n )
Then R(Aun,λn (Zn )) → inf R(C). Moreover, if log n/(λn inf R(C) P∗-a.s.
(5.18) → 0, then R(Aun,λn (Zn )) →
Remark 5.7 Corollary 5.6 shows that consistency is achieved when the sequence of regularization parameters (λn )n∈N converges to zero not too fast. The upper bound depends on the power type of the modulus of convexity of the feature space, the exponent of the norm in the regularizer, and the Lipschitz behavior of the loss. Note that a faster decay of (λn )n∈N is allowed when q = 2. Remark 5.8 In the setting of general regularizers and/or Banach feature spaces, the literature on consistency of regularized empirical risk minimizers is scarce. In [44], consistency is discussed in the context of classification in Hilbert spaces for regularizers of the type G = ϕ(k·k). In [45], consistency and learning rates are provided for classification problems and G = k · k, under appropriate growth assumptions on the average empirical entropy numbers. In [43], the consistency of an ℓ1 -regularized empirical risk minimization scheme is studied in a particular type of Banach spaces of functions, in which a linear representer theorem is shown to hold. Note that, in general reproducing kernel Banach spaces, the representation is not linear; see Corollary 4.19 and [59, 60]. We complete this section by providing an illustration of the above consistency theorems to learning with dictionaries in the context of Example 4.11. The setting will be a specialization of Assumption 5.1 to specific types of feature maps and regularizers. Our analysis extends in several directions that of [23]. Example 5.9 (Generalized linear model) Suppose that Assumption 5.1(i)-(iii) hold. Let K be a nonempty at most countable set, let r ∈ ]1, +∞[, and let F = lr (K).PLet ς ∈ R++ and let (φk )k∈K be ∗ ∗ a dictionary of functions in M(X , Y) such that for PX -a.a. x ∈ X , k∈K |φk (x)|r 6 ς r . Moreover, set X A : F → YX : u = (µk )k∈K 7→ µk φk (pointwise), (5.19) k∈K
∗
and let Λ : X → lr (K; Y) : x 7→ (φk (x))k∈K be the associated feature map. For every k ∈ K, let ηk ∈ R+ and let hk ∈ Γ+ 0 (R) be such that hk (0) = 0. Define X G : F → [0, +∞] : u = (µk )k∈K 7→ gk (µk ), where (∀k ∈ K) gk = hk + ηk | · |r . (5.20) k∈K
Let (λn )n∈N be a sequence in R++ such that λn → 0 and let (Xi , Yi )i∈N be a sequence of independent copies of (X, Y ). For every n ∈ Nr{0}, let Zn = (Xi , Yi )16i6n , and let un,λn (Zn ) be defined according to (5.2) as an approximate minimizer of the regularized empirical risk n 1X ℓ Xi , Yi , (Au)(Xi ) + λn G(u). n
(5.21)
i=1
The above model covers several classical regularization schemes, such as the Tikhonov (ridge regression) model [32], the ℓ1 or lasso model [47], the elastic net model [23, 61], the bridge regression model [30, 33], as well as generalized Gaussian models [2]. Furthermore the following hold: 29
(i) F is uniformly convex with modulus of convexity of power type max{2, r} [35, p. 63]. Moreover, ran A ⊂ M(X , Y), (∀x ∈ X )(∀u ∈ F)
|Λ(x)∗ u| = |(Au)(x)| 6 kukr k(φk (x))k∈K kr∗ 6 ςkukr ,
(5.22)
and therefore kΛk∞ 6 ς. Now suppose that inf k∈K ηk > 0. Then, in view of Proposition 3.8, G is totally convex on bounded sets. Altogether, Assumption 5.1 holds with q = max{2, r}. (ii) Let p ∈ [1, +∞] and suppose that one of the following holds: (a) C = A lr (K) ∩ ×k∈K dom hk .
(b) C = M(X , Y) and span{φk }k∈K is p-universal (Definition 4.5).
Then C is p-admissible (Definition 5.3), A(dom G) ⊂ C ∩ ran A ⊂ A(dom G) (where the closure is in Lp (X , PX ; Y)), and ran A is p-universal relative to C. Indeed, as for (ii)(a), C ⊂ ran A ⊂ Lp (X , PX ; Y), hence C is p-admissible and ran A is p-universal relative to C. Moreover, A(dom G) ⊂ C ⊂ A(dom G) since, for every u ∈ lr (K) ∩ ×k∈K dom hk and every ǫ ∈ R++ , there exists u ¯ ∈ RK with finite support, such that ku − u ¯kr 6 ǫ and kAu − A¯ ukp 6 ςku − u ¯kr 6 ςǫ. On the other hand, if C = M(X , Y), (ii)(b) is satisfied when X is a locally compact topological space and span{φk }k∈K is dense in C0 (X , Y) endowed with the uniform topology by Proposition 4.6(ii). (iii) Let C be as in item (ii)(a) or (ii)(b), let η ∈ R++ , and suppose that (∀k ∈ K) ηk > η. Then consistency can be obtained in the setting of Corollary 5.6, where q = max{2, r} = m, and, in view of Remark 3.9(i), β = (7/32)r(r − 1)(1 − (2/3)r−1 ). In particular, in items (ii) and pm/r pm/r 2p/r (iii) of Corollary 5.6, we have λn n1/q = λpn n1/r , if r > 2; and λn n1/q = λn n1/2 , if r 6 2. Moreover, by Theorem 5.4(iv), weak consistency holds if λn n1/ max{2,r} → +∞, and strong consistency holds if λn n1/ max{2,r}/ log n → +∞. (iv) Suppose that r ∈ ]1, 2] and that the loss function is differentiable with respect to the third variable. Then, by exploiting the separability of G, for a given sample size n, an estimate un,λn (zn ) can be constructed in l2 (K) using proximal splitting algorithms such as those described in [22, 51]. Remark 5.10 Let us compare the results of Example 5.9 to the existing literature on generalized linear models. (i) In the special case when K is finite, r > 1, and G = k · krr , [33] provides an excess risk bound which depends on the dimension of the dictionary (the cardinality of K) and the level of sparsity of the regularized risk minimizer; see [12] for a recent account of the role of sparsity in regression. (ii) In the special case when r = 2 and, for every k ∈ K, hk = wk |·| with wk ∈ R++ in (5.20), we recover the elastic net framework of [23]. This special case yields a strongly convex problem in a Hilbert space. In our general setting, the exponent r may take any value in ]1, +∞[. Note also that our framework implicitly allows for the enforcement of hard constraints on the coefficients since the functions (hk )k∈K are not required to be real-valued. We highlight that, when specialized to the elastic net regularizer, Theorem 5.4(iv) guarantees consistency under the same conditions as in [23, Theorem 2].
30
5.2 Proofs of the main results We start with a few properties of the functions underlying our construct. To this end, throughout this subsection, the following notation will used. Notation 5.11 In the setting of Assumption 5.1, F = R ◦ A and (∀n ∈ N r {0})
Fn : F × (X × Y)n → R+ : (u, z) 7→ Rn (Au, z).
(5.23)
In addition, ς = kΛk∞ , and, for every n ∈ N r {0} and λ ∈ R++ , αn,λ : R++ × R++ → R+
r 4Tq∗ 4τ 2τ + . (5.24) +2 n 3n n1/q √ Now let τ ∈ [1, +∞[ and n ∈ N r {0}. Then, since 2 2τ 6 1 + 2τ 6 3τ and n1/q 6 n1/2 6 n, we have τ ς(4Tq∗ + 5)ζςρ (5.25) (∀ ρ ∈ R++ ) αn,λ (τ, ρ) 6 λn1/q ςζςρ (τ, ρ) 7→ λ
Proposition 5.12 Suppose that Assumption 5.1 is satisfied. Then the following hold: (i) F : F → R+ is convex and continuous.
(ii) Let n ∈ N r {0} and z ∈ (X × Y)n . Then Fn (·, z) : F → R+ is convex and continuous.
(iii) G is coercive and strictly convex.
(iv) For every λ ∈ R++ , F + λG admits a unique minimizer. Proof. (i): Remark 4.15(iv) ensures that R : L∞ (X , PX ; Y) → R+ is convex and continuous. In turn, Proposition 4.4(ii) implies that A : F → L∞ (X , PX ; Y) is continuous. (ii):PThe argument is the same as above, except that P is replaced by the empirical measure (1/n) ni=1 δ(xi ,yi ) , where z = (xi , yi )16i6n .
(iii): It follows from Assumption 5.1(vi) and Proposition 3.5(ix) that G is coercive; its strict convexity follows from Definition 3.2(i).
(iv): By (i) and (iii), F + λG is a strictly convex coercive function in Γ+ 0 (F). It therefore admits a unique minimizer [57, Theorem 2.5.1(ii) and Proposition 2.5.6]. The strategy of the proof of Theorem 5.4 is to split the error in three parts, i.e., R(Aun,λ (Zn )) − inf R(C)
= (F (un,λ (Zn )) − F (uλ )) + (F (uλ ) − inf F (dom G)) + (inf F (dom G) − inf R(C)),
where uλ = argminF (F + λG). (5.26)
Note that Proposition 5.12(iv) ensures that uλ is uniquely defined. The first term on the right-hand side of (5.26) is known as the sample error and the second term as the approximation error. Proposition 3.11(ii) ensures that the approximation error goes to zero as λ → 0. Below, we start by showing that inf R(C) − inf F (dom G) = 0, if ran A is universal with respect to C and some compatibility conditions between G and C hold. Next, we study the sample error. Note that F (un,λ (Zn )) − F (uλ ) may not be measurable, hence the convergence results are provided with respect to the outer probability P∗ . 31
Proposition 5.13 Let X and Y be nonempty sets, let (X × Y, A, P ) be a probability space, let PX be the marginal of P on X , and let Y be a separable reflexive real Banach space. Let ℓ ∈ Υ(X × Y, Y), and let R : M(X , Y) → [0, +∞] be the risk associated with ℓ and P . Let C ⊂ M(X , Y) be nonempty and convex. Let p ∈ [1, +∞] and assume that C is p-admissible and that there exists g ∈ C ∩ Lp (X , PX ; Y) such that R(g) < +∞. Then inf R(C) = inf R(C ∩ Lp (X , PX ; Y)). (∀x ∈ X ) f (x) ∈ C(x) . Let f ∈ C be such that R(f ) < Proof. Suppose that C = f ∈ M(X , Y) +∞. For every n ∈ N, set An = x ∈ X |f (x)| 6 n , let Acn be its complement, and define fn : X → Y, fn = 1An f + 1Acn g. For every n ∈ N and x ∈ X , fn (x) ∈ C(x) and |fn (x)| 6 max{n, |g(x)|}, hence fn ∈ C ∩ Lp (X , PX ; Y). Moreover, Z |ℓ(x, y, g(x)) − ℓ(x, y, f (x))|P (d(x, y)). (5.27) (∀n ∈ N) |R(fn ) − R(f )| 6 Acn ×Y
Set h : (x, y) 7→ |ℓ(x, y, g(x)) − ℓ(x, y, f (x))|. Since R(f ) < +∞ and R(g) < +∞, we have h ∈ L1 (X × Y, P ). Since 1Acn ×Y h → 0 pointwise and 1Acn ×Y h 6 h, it follows from the dominated convergence theorem that the right-hand side of (5.27) tends to zero, and hence R(fn ) → R(f ). This implies that inf R(C ∩ Lp (X , PX ; Y)) 6 R(f ). Proposition 5.14 Let X and Y be nonempty sets, let (X × Y, A, P ) be a probability space, let PX be the marginal of P on X , and let Y be a separable reflexive real Banach space. Let C ⊂ M(X , Y) be nonempty and convex and let p ∈ [1, +∞]. Suppose that ℓ ∈ Υp (X × Y, Y, P ), that Λ ∈ Lp [X , PX ; L (Y∗ , F∗ )], and that A(dom G) ⊂ C ∩ran A ⊂ A(dom G), where the closure is in Lp (X , PX ; Y). Let R : M(X , Y) → [0, +∞] be the risk associated with ℓ and P . Then the following hold: (i) inf F (dom G) = inf R(C ∩ ran A). (ii) Suppose that C is p-admissible and ran A is p-universal relative to C. Then inf F (dom G) = inf R(C). Proof. (i): By Remark 4.15(i), R is continuous on Lp (X , PX ; Y) and hence inf R(A(dom G)) = inf R(A(dom G)). Therefore, since A(dom G) ⊂ C ∩ ran A ⊂ A(dom G), the assertion follows. (ii): Suppose first that p < +∞. Since R is continuous on Lp (X , PX ; Y) and C ∩ ran A is dense in C ∩ Lp (X , PX ; Y), inf R(C ∩ ran A) = inf R(C ∩ Lp (X , PX ; Y)). Thus, since C is p-admissible, Proposition 5.13 gives inf R(C ∩ Lp (X , PX ; Y)) = inf R(C) and hence inf R(C ∩ ran A) = inf R(C). The statement follows from (i). Now suppose that p = +∞. Let f ∈ C ∩ L∞ (X , PX ; Y). By Definition 4.5(i), there exists (fn )n∈N ∈ (C ∩ ran A)N and ρ ∈ R++ such that supn∈N kfn k∞ 6 ρ and fn → f PX -a.s. It follows from (4.33) that (∃ gρ ∈ L1 (X × Y, P ; R))(∀(x, y) ∈ X × Y) |ℓ(x, y, fn (x)) − ℓ(x, y, f (x))| 6 2gρ (x, y). By the dominated convergence theorem, R(fn ) → R(f ). Thus, inf R(C ∩ ran A) = inf R(C ∩ L∞ (X , PX ; Y)) and we conclude as above. Proposition 5.15 Suppose that Assumption 5.1 holds and that Notation 5.11 is in use. Write ε = ε1 ε2 , where ε1 and ε2 are functions from R++ to [0, 1], let λ ∈ R++ , and define uλ = argminF (F + λG). Let τ ∈ R++ , let n ∈ N r {0}, and let ρ ∈ [kuλ k, +∞[. Then the following hold: ε2 (λ) ∗ ♮ b (i) P kun,λ (Zn ) − uλ k > ε1 (λ) + (ψρ ) αn,λ (τ, ρ) + 6 e−τ . λ ε2 (λ) ♮ ∗ b 6 e−τ . (ii) P kun,λ (Zn )k 6 ρ ∩ F (un,λ (Zn ))−F (uλ ) > ςζςρ ε1 (λ)+(ψρ ) αn,λ (τ, ρ)+ λ 32
(iii) Suppose that ℓ ∈ Υ1 (X × Y × Y, P ) and let c ∈ R+ be as in Definition 4.14(i). Then ε (λ) 2 ∗ ♮ P F (un,λ (Zn )) − F (uλ ) > ςc ε1 (λ) + (ψbρ ) αn,λ (τ, ρ)+ 6 e−τ . λ Proof. (i): Let z = (xi , yi )16i6n ∈ (X × Y)n . Since ε (λ)ε2 (λ)
un,λ (z) ∈ ArgminF1
(Fn (·, z) + λG),
(5.28)
(5.29)
it follows from Proposition 5.12(ii) and Ekeland’s variational principle [36, Corollary 4.2.12] that there exists vn,λ ∈ F such that kun,λ (z) − vn,λ k 6 ε1 (λ) and inf k∂(Fn (·, z) + λG)(vn,λ )k 6 ε2 (λ). We P note that ℓ ∈ Υ∞ (X × Y × Y) by Remark 4.15(iv). Hence, setting Pe = (1/n) ni=1 δ(xi ,yi ) , we derive from Theorems 4.17(ii) and 4.23(ii) that there exists a measurable and P -a.s. bounded function hλ : X × Y → Y∗ such that khλ k∞ 6 ζςρ and n
ε2 (λ) 1X ♮ 1 b
EP [Λhλ ] − Λ(xi )hλ (xi , yi ) + kvn,λ − uλ k 6 (ψρ ) . (5.30) λ n λ i=1
Thus, for every z ∈ (X × Y)n
kun,λ (z) − uλ k 6 ε1 (λ) + (ψbρ )
♮
n X
ε2 (λ) 1 1
EP [Λhλ ] − Λ(xi )hλ (xi , yi ) + . λ n λ
(5.31)
i=1
Now consider the family of i.i.d. random vectors (Λ(Xi )hλ (Xi , Yi ))16i6n , from Ω to F∗ . Since max16i6n kΛ(Xi )hλ (Xi , Yi )k 6 ςζςρ P-a.s., Theorem A.5(i) gives n
i h 1X
Λ(Xi )hλ (Xi , Yi ) > λαn,λ (τ, ρ) 6 e−τ . P EP [Λ(X)hλ (X, Y )] − n
(5.32)
i=1
Hence, since (ψbρ )♮ is increasing by Proposition 3.1(vii), a fortiori we have
n
ε2 (λ) 1X ♮ 1 b
EP [Λhλ ] − Λ(Xi )hλ (Xi , Yi ) + P ε1 (λ) + (ψρ ) λ n λ i=1 ε2 (λ) 6 e−τ . > ε1 (λ) + (ψbρ )♮ αn,λ (τ, ρ) + λ
(5.33)
Thus (i) follows from (5.31) and (5.33). (ii): Let ω ∈ kun,λ (Zn )k 6 ρ . Since kuλ k 6 ρ and kun,λ (Zn (ω))k 6 ρ, we have kAuλ k∞ 6 ςρ and kAun,λ (Zn (ω))k∞ 6 ςρ. Hence, we derive from Assumption 5.1(ii) that F (un,λ (Zn (ω))) − F (uλ ) 6 ζςρ kAun,λ (Zn (ω)) − Auλ k∞ 6 ςζςρ kun,λ (Zn (ω)) − uλ k.
(5.34)
Thus, (ii) follows from (i). (iii): It follows from Remark 4.15(v)(a) that ℓ is globally Lipschitz continuous in the third variable uniformly with respect to the first two and that supρ′ ∈R++ ζρ′ 6 c. Hence, we derive from (4.31) that R is Lipschitz continuous on L1 (X , PX ; Y) with Lipschitz constant c. As a result, (∀ω ∈ Ω) F (un,λ (Zn (ω))) − F (uλ ) 6 ckAun,λ (Zn (ω)) − Auλ k∞ 6 ςckun,λ (Zn (ω)) − uλ k. (5.35) Thus, the statement follows from (i). The following technical result will be required subsequently. 33
Lemma 5.16 Let α : R+ → R+ and let γ ∈ R++ be such that, for every τ ∈ ]1, +∞[, α(τ ) 6 γτ . Let φ ∈ A0 , let (η, ǫ) ∈ R++ × R+ , and suppose that φ♮ (2γ) < η and φ♮ (2ǫ) < η. Set τ0 = φ(η − )/(2γ). Then φ♮ (α(τ0 ) + ǫ) < η. Proof. Recalling Proposition 3.1(vi), we derive from φ♮ (2γ) < η and φ♮ (2ǫ) < η that respectively τ0 > 1 and φ(η − ) > 2ǫ. Therefore, since γτ0 = φ(η − )/2, we have α(τ0 ) + ǫ 6 τ0 γ + ǫ = φ(η − )/2 + ǫ < φ(η − ). Again by Proposition 3.1(vi) we obtain that φ♮ (α(τ0 ) + ǫ) < η. Proposition 5.17 Suppose that Assumption 5.1 holds, that Notation 5.11 is in use, and that ℓ(·, ·, 0) is bounded. Write ε = ε1 ε2 , where ε1 and ε2 are functions from R++ to [0, 1]. Let (∀n ∈ N) ρn ∈ ♮ ψ0 (kℓ(·, ·, 0)k∞ + 1)/λn , +∞ . Then the following hold: (i) Let λ ∈ R++ , and set uλ = argminF (F + λG) and let ρ ∈ ψ0♮ (kℓ(·, ·, 0)k∞ + 1)/λ , +∞ . Let τ ∈ R++ and let n ∈ N r{0}. Then h P∗ F (un,λ (Zn )) − inf F (dom G) > ςζςρ ε1 (λ) + (ψbρ )♮ αn,λ (τ, ρ) + ε2 (λ)/λ i + F (uλ ) − inf F (dom G) 6 e−τ . (5.36) P∗
(ii) Suppose that (5.5) and (5.6) hold. Then F (un,λn (Zn )) → inf F (dom G).
(iii) Suppose that (5.5) and (5.7) hold. Then F (un,λn (Zn )) → inf F (dom G) P∗ -a.s.
Proof. (i): Since for every zn = (xi , yi )16i6n ∈ (X × Y)n , Fn (0, zn ) 6 kℓ(·, ·, 0)k∞ and F (0) 6 kℓ(·, ·, 0)k∞ , it follows from Proposition 3.15 that kun,λ (Zn )k 6 ρ and kuλ k 6 ρ. Thus, Proposi tion 5.15(ii) yields P∗ F (un,λ (Zn )) − F (uλ ) > ςζςρ ε1 (λ) + (ψbρ )♮ αn,λ (τ, ρ) + ε2 (λ)/λ 6 e−τ , and (5.36) follows. (ii): Because of (5.25), conditions (5.5)-(5.6) imply that (∀τ ∈ [1, +∞[)
ςζςρn ε1 (λn ) + (ψbρn )♮ αn,λn (τ, ρn ) + ε2 (λn )/λn
→ 0.
(5.37)
Therefore, it follows from (5.36) and Proposition 3.11(ii) that for every (η, τ ) ∈ R++× [1+ ∞[, there ∗ exists n ¯ ∈ N such that, for every integer n > n ¯ , P F (un,λn (Zn )) − inf F (dom G) > η 6 e−τ . Hence, (∀ (η, τ ) ∈ R++ ×[1, +∞[)
lim P∗ F (un,λn (Zn )) − inf F (dom G) > η 6 e−τ ,
n→+∞
(5.38)
and the convergence in outer probability follows.
(iii): Let η ∈ R++ and let ξ ∈ ]1, +∞[. It follows from (5.5) and (5.7) that there exists an integer n ¯ > 3, such that for every integer n > n ¯ , we have 2ςξ(4T ∗ + 5)ζ log n q ςρn n ¯ and set γ = ς(4Tq∗ + 5)ζςρn /(λn n1/q ). We derive from (5.25) that (∀ τ ∈ [1, +∞) αn,λn (τ, ρn ) 6 τ γ. Then, since 1 6 ξ log n, it follows from Lemma 5.16 that τ0 = ψbρn
η ζςρn
−
λn n1/q 2ς(4Tq∗ + 5)ζςρn
ε2 (λn ) ζςρn (ψbρn )♮ αn,λn (τ0 , ρn ) + < η. (5.40) λn
⇒
34
Now set Ωn,η = F (un,λn (Zn )) − inf F (dom G) > ςζςρn ε1 (λn ) + ςη + F (uλn ) − inf F (dom G) .
(5.41)
Item (i) yields ∗
P Ωn,η 6 exp
− ψbρn
η ζςρn
−
λn n1/q . 2ς(4Tq∗ + 5)ζςρn
(5.42)
We remark that, by Proposition 3.1(vi)-(vii), the first condition in (5.39) is equivalent to λn n1/q η − b > ξ log n . (5.43) ψρn ζςρn 2ς(4Tq∗ + 5)ζςρn P+∞ ∗ P+∞ ξ Thus it follows from (5.42) and (5.43) that n=¯ n P Ωn,η 6 n=¯ n 1/n < +∞. Hence the Borel-Cantelli lemma yields (note that the Borel-Cantelli lemma requires only the property of σsubadditivity, it therefore holds also for outer measures) \ [ ∗ P Ωn,η = 0. (5.44) k>¯ n n>k
We conclude that F (un,λn (Zn )) → inf F (dom G) P∗ -a.s. The next proposition considers the case of a globally Lipschitz continuous loss ℓ, and does not require the boundedness of ℓ(·, ·, 0). Proposition 5.18 Suppose that Assumption 5.1 holds, that Notation 5.11 is in use, and that ℓ ∈ Υ1 (X × Y × Y; P ). Let c ∈ R+ be as in Definition 4.14(i) and write ε = ε1 ε2 , where ε1 and ε2 are functions from R++ to [0, 1]. Let (∀n ∈ N) ρn ∈ ψ0♮ ((R(0) + 1)/λn ), +∞ . Then the following hold: (i) Let λ ∈ R++ , set uλ = argminF (F + λG) and let ρ ∈ ψ0♮ (F (0) + 1)/λ , +∞ . Let τ ∈ R++ and let n ∈ N r{0}. Then h P∗ F (un,λ (Zn )) − inf F (dom G) > ςc ε1 (λ) + (ψbρ )♮ αn,λ (τ, ρ) + ε2 (λ)/λ i + F (uλ ) − inf F (dom G) 6 e−τ . (5.45) P∗
(ii) Suppose that (5.10) holds. Then F (un,λn (Zn )) → inf F (dom G).
(iii) Suppose that (5.10) and (5.11) hold. Then F (un,λn (Zn )) → inf F (dom G) P∗ -a.s. Proof. (i): First note that, by Proposition 3.15, kuλ k 6 ρ. Thus, (5.45) follows from Proposition 5.15(iii). (ii)-(iii): Using (i), these can be established as in the proof of items (ii) and (iii) in Proposition 5.17. Proposition 5.19 Suppose that Assumption 5.1 holds, that Notation 5.11 is in use, and that S = Argmindom G F 6= ∅. Let u† = argminu∈S G(u) and write ε = ε1 ε2 , where ε1 and ε2 are functions from R++ to [0, 1]. For every λ ∈ R++ , set uλ = argminF (F + λG). Let ρ ∈ supλ∈R++ kuλ k, +∞ and let τ ∈ R++ . Then, for every sufficiently small λ ∈ R++ and every n ∈ N r{0}, ε2 (λ) ∗ † ♮ † b P kun,λ (Zn ) − u k > ε1 (λ) + (ψρ ) αn,λ (τ, ρ) + + kuλ − u k 6 e−τ . (5.46) λ Moreover, assume that (5.12) is satisfied. Then the following hold: 35
(i) For every sufficiently large n ∈ N, ε2 (λn ) ∗ † ♮ b +λn 6 2e−τ . (5.47) P F (un,λn (Zn ))−F (u ) > ςζςρ ε1 (λn )+(ψρ ) αn,λn (τ, ρ)+ λn P∗
P∗
(ii) un,λ (Zn ) → u† and F (un,λn (Zn )) → inf F (dom G).
(iii) Suppose that (5.13) holds. Then F (un,λn (Zn )) → inf F (dom G) P∗ -a.s. and un,λn (Zn ) → u† P∗ -a.s. Proof. First note that items (i) and (v) in Proposition 3.13 imply that u† is well defined and that supλ∈R++ kuλ k < +∞. Now, let λ ∈ R++ and let n ∈ N. Since kuλ k 6 ρ, it follows from Proposition 5.15(i) that h i (5.48) P∗ kun,λ (Zn ) − uλ k > ε1 (λ) + (ψbρ )♮ αn,λ (τ, ρ) + ε2 (λn )/λn 6 e−τ and, since kun,λ (Zn ) − u† k 6 kun,λ (Zn ) − uλ k + kuλ − u† k, (5.46) follows. Note also that Proposition 3.5(viii) implies that ψbρ ∈ A0 .
(i): Let η ∈ R++ be such that supλ∈R++ kuλ k + η 6 ρ. It follows from (5.12), (5.25), and Proposition 3.1(v), that ε1 (λn ) + (ψbρ )♮ αn,λn (τ, ρ) + ε2 (λn )/λn → 0. Hence, there exists n ¯ ∈ N such ♮ α b (τ, ρ) + ε (λ )/λ that for every integer n > n ¯ , ε (λ ) + ( ψ ) 6 η. Now, take an integer n>n ¯ n 1 n n,λn ρ 2 n and set Ωn = kun,λn (Zn ) − uλn k 6 η . Then Ωn ⊂ kun,λn (Zn )k 6 ρ and it follows from (5.48) that P∗ (Ω \ Ωn ) 6 e−τ . Hence, we deduce from Proposition 5.15(ii) that h i 6 2e−τ . (5.49) P∗ F (un,λn (Zn )) − F (uλn ) > ςζςρ ε1 (λn ) + (ψbρ )♮ αn,λ (τ, ρ) + ε2 (λn )/λn On the other hand, Proposition 3.13(iv) implies that, for n sufficiently large, F (uλn ) − F (u† ) 6 λn , which combined with (5.49) gives (5.47). (ii): Let η ∈ R++ . As above, αn,λn (τ, ρ) + ε2 (λn )/λn → 0 and Proposition 3.13(vi) asexists n ¯ ∈ N such that for every integer n > serts that kuλn − u† k → 0. Therefore, there n ¯ , ε1 (λn ) + (ψbρ )♮ αn,λn (τ, ρ) + ε2 (λn )/λn + kuλn − u† k 6 η. It follows from (5.46) that † k > η → 0. Likewise, using † k > η 6 e−τ . Therefore, P∗ ku (Z ) − u lim P∗ kun,λn (Zn ) − u n n,λ n (5.47), we obtain P∗ F (un,λn (Zn )) − F (u† ) > η → 0.
(iii): The proof follows the same line of that of Proposition 5.17(iii). Let η ∈ R++ and let ξ ∈ ]1, +∞[. It follows from (5.12), (5.13), and Proposition 3.1(v) that, for n ∈ N large enough, ♮ 2ε2 (λn ) ♮ 2ςξ(4Tq ∗ + 5)ζςρ log n b b < η and ζςρ (ψρ ) < η. (5.50) ζςρ (ψρ ) λn λn n1/q In turn, for such an n, we derive from Lemma 5.16 that − λn n1/q ε2 (λn ) η ♮ b b ⇒ ζςρ (ψρ ) αn,λn (τ0 , ρ) + < η. τ0 = ψρ ζςρ 2ς(4Tq∗ + 5)ζςρ λn
(5.51)
Now set
(∀n ∈ N r {0})
i h Ωn,η = F (un,λn (Zn )) − F (u† ) > ςζςρ ε1 (λn ) + ςη + λn . 36
(5.52)
Then (i) implies that, for n sufficiently large, − λn n1/q η ∗ b P Ωn,η 6 2 exp − ψρ . ζςρ 2ς(4Tq∗ + 5)ζςρ
(5.53)
Moreover, thanks to Proposition 3.1(vi), the first condition in (5.50) is equivalent to − λn n1/q η > ξ log n. (5.54) ψbρ ζςρ 2ς(4Tq∗ + 5)ζςρ P P Thus, for n ¯ ∈ N sufficiently large, (5.53) and (5.54) yield n>¯n P∗ Ωn,η 6 2 n>¯n 1/nξ < +∞, T S and hence it follows from the Borel-Cantelli lemma that P∗ k>¯ n n>k Ωn,η = 0. This shows that F (un,λn (Zn )) → F (u† ) P∗ -a.s. Next, let n ∈ N be sufficiently large so that ♮ 2ςξ(4Tq ∗ + 5)ζςρ log n ♮ 2ε2 (λn ) b b < η. (5.55) (ψρ ) < η and (ψρ ) λn λn n1/q 1/q b − b ♮ Using Lemma 5.16, upon setting τ0 = (ψρ (η )λn n /(2ς(4Tq∗ +5)ζςρ ), we obtain (ψρ ) αn,λn (τ0 , ρ)+ ε2 (λn )/λn < η. It then follows from (5.46) and (5.55) that for n sufficiently large, i h ψbρ (η −)λn n1/q 1 † † ∗ P kun,λn (Zn ) − u k > ε1 (λn ) + η + kuλn − u k 6 exp − (5.56) < ξ. 2ς(4Tq∗ + 5)ζςρ n
The conclusion follows by the Borel-Cantelli lemma.
Proof of Theorem 5.4. We first note that Proposition 5.14(ii) asserts that inf F (dom G) = inf R(C). (i): This follows from Proposition 5.17(ii)–(iii). (ii): Remark 4.15(v)(b) implies that, for every ρ ∈ R++ , ζρ 6 (p − 1)kbk∞ + 3cp max{1, ρ p−1 } and ℓ(·, ·, 0) is bounded. Hence conditions (5.8), and (5.9) imply (5.5)-(5.6) and (5.7) respectively. Therefore, the statement follows from (i). (iii): This follows from Proposition 5.18(ii)–(iii). (iv): This follows from Proposition 5.19(ii)–(iii). Proof of Corollary 5.6. Since F is uniformly convex of power type q, F∗ is uniformly smooth with modulus of smoothness of power type q ∗ [35, p. 63] and hence of Rademacher type q ∗ (see Section 2) in conformity with Assumption 5.1(iv). Moreover, by (5.14), the modulus of total convexity ψρ of G on B(ρ) is greater then that of ηk·kr . Hence, by Proposition 3.8, r if r > q ηβt (∀ρ ∈ R+ )(∀t ∈ R+ ) ψρ (t) > (5.57) ηβtq if r < q (ρ + t)q−r
and, for every ρ ∈ R+ and every s ∈ R+ , 1/(r−1) s if r > q ηβ ♮ b (ψρ ) (s) 6 1/(q−1) 1/(r−1) s s 2q ρ max , if r < q. ηβρr−1 ηβρr−1 37
(5.58)
(i): It follows from (5.57) that kℓ(·, ·, 0)k∞ + 1 kℓ(·, ·, 0)k∞ + 1 1/r ♮ (∀ n ∈ N) ψ0 = ρn . 6 λn ηβλn m/r 1/q n )
m /(λ Now fix τ ∈ R++ and assume that supn∈N ζςρn > 0. Since ζςρ n n m/r ζςρn /(λn n1/q )
n1/q )
(5.59) → 0 and m > 2, we have
→ 0. Moreover, since m/r > 1, we have ζςρn /(λn → 0 and, therefore, since ρn → +∞, there exists n ¯ ∈ N r {0} such that, for every integer n > n ¯ , τ ζςρn /(λn n1/q ) 6 ηβρnr−1 . Suppose that q > r and take an integer n > n ¯ . Evaluating the maximum in (5.58), we obtain ! 1/(q−1) τ ζςρn τ ρnq−r ζςρn ♮ q b . (5.60) (ψρn ) 62 ηβ λn n1/q λn n1/q
Therefore, substituting the expression of ρn yields 1 q q/r−1 (q−1) 1 τ ζςρn ζςρ n q q−1 (kℓ(·, ·, 0)k∞ + 1) ♮ b . 62 τ ζςρn (ψρn ) 1/q q/r q/r λn n (ηβ) λn n1/q On the other hand, if q 6 r, (5.58) yields 1/(r−1) r 1/(r−1) ζςρn τ τ ζ ςρ n ♮ . 6 ζςρn (ψbρn ) ηβ λn n1/q λn n1/q
(5.61)
(5.62)
Thus, altogether (5.61) and (5.62) imply that there exists γ ∈ R++ such that, for every integer n > n ¯ 1/(m−1) m ζςρ τ ζςρn n 1/(m−1) ζςρn (ψbρn )♮ 6 γτ . (5.63) m/r λn n1/q λn n1/q
It therefore follows from (5.15) that the right-hand side of (5.63) converges to zero and hence that (5.6) is fulfilled. Likewise, (5.16) implies (5.7). Altogether the statement follows from Theorem 5.4(i). (ii): It follows from Remark 4.15(v)(b) that ℓ(·, ·, 0) is bounded and that, for every ρ ∈ R++ , 1/r . Then there ζρ 6 (p − 1)kbk∞ + 3cp max{1, ρ p−1 }. Set (∀n ∈ N) ρn = (kℓ(·, ·, 0)k∞ + 1)/(ηβλn ) (p−1)/r
exists γ ∈ R++ such that (∀n ∈ N) ζρn 6 γ/λn
. Thus, the statement follows from (i).
1/r (iii): Fix τ ∈ R++ and set (∀n ∈ N) ρn = (R(0) + 1)/(ηβλn ) . Then (5.57) yields (∀ n ∈ N) m/r
ψ0♮ ((R(0) + 1)/λn ) 6 ρn . Since m/r > 1, 1/(λn n1/q ) → 0 implies 1/(λn n1/q ) → 0. Moreover, since ρn → +∞, there exists n ¯ ∈ N r {0} such that, for every integer n > n ¯ , τ /(λn n1/q ) 6 ηβρnr−1 . Suppose that q > r and take an integer n > n ¯ . Evaluating the maximum in (5.58), we obtain ! 1 1 q−1 q−r q/r−1 q−1 1 τ ρ 1 1 τ n q q q−1 (R(0) + 1) ♮ b 62 =2 τ . (5.64) (ψρn ) q 1/q 1/q q/r ηβ λn n (ηβ) /r λn n λn n1/q On the other hand, if q 6 r, (5.58) yields 1/(r−1) τ 1 τ ♮ b , 6 (ψρn ) ηβ λn n1/q λn n1/q
(5.65)
Thus (5.17), together with (5.64) and (5.65) imply that (5.10) is fulfilled. Likewise, the assumpm/r tion log n/(λn n1/q ) → 0 implies that (5.11) holds. Altogether, the statement follows by Theorem 5.4(iii). 38
A Appendix A.1 Lipschitz continuity of convex functions We state here some basic facts on Lipschitz properties of convex functions. Proposition A.1 Let B be a real Banach space and let F : B → [0, +∞] be proper and convex. Then the following hold: (i) [39, Proposition 1.11] Let u0 ∈ B, and suppose that there exist a neighborhood U of u0 and c ∈ R+ such that (∀ u ∈ U ) |F (u) − F (u0 )| 6 cku − u0 k .
(A.1)
Then ∂F (u0 ) 6= ∅ and sup k∂F (u0 )k 6 c .
(ii) [57, Corollary 2.2.12] Let u0 ∈ B, and suppose that, for some (ρ, δ) ∈ R2++ , F is bounded on u0 + B(ρ + δ). Then F is Lipschitz continuous relative to u0 + B(ρ) with constant 2ρ + δ 1 sup F (u0 + B(ρ + δ)). ρ+δ δ
(A.2)
Proposition A.2 Let B be a real normed vector space, let p ∈ [1, +∞[, let b ∈ R+ , let c ∈ R++ , and let F : B → R+ be a convex function such that F 6 ck·kp + b. Then the following hold: (i) Let u ∈ B. Then ∂F (u) 6= ∅ and ( c sup k∂F (u)k 6 3cp max{1, kukp−1 } + (p − 1)b
if p = 1 if p > 1.
(ii) Let ρ ∈ R++ . Then F is Lipschitz continuous relative to B(ρ) with constant ( c if p = 1 p−1 3cp max{1, ρ } + (p − 1)b if p > 1.
(A.3)
(A.4)
Proof. (i): Let (ǫ, δ) ∈ R2++ . Since F 6 ck·kp + b, then, F is bounded on u + B(ǫ + δ) and it follows from Proposition A.1(ii) that F is Lipschitz continuous relative to u + B(ǫ) with constant (2ǫ + δ)(ǫ + δ)−1 δ−1 c(kuk + ǫ + δ)p + b . Then Proposition A.1(i) entails that ∂F (u) 6= ∅ and sup k∂F (u)k 6
2ǫ + δ 1 c(kuk + ǫ + δ)p + b . ǫ+δ δ
(A.5)
Letting ǫ → 0+ in (A.5), we get
kuk b sup k∂F (u)k 6 c + 1 (kuk + δ)p−1 + . δ δ
(A.6)
s s δ p−1 b b + 6c + 1 sp−1 1 + + 1 sp−1 eδ(p−1)/s + , sup k∂F (u)k 6 c δ s δ δ δ
(A.7)
If p = 1, letting δ → +∞ in (A.6) yields sup k∂F (u)k 6 c. Now, suppose that p > 1 and set s = max{kuk, 1}. Then, since kuk 6 s, (A.6) implies that
39
where we took into account that (1 + δ/s)s/δ 6 e. By choosing δ = s/(p − 1), we get sup k∂F (u)k 6 3cpsp−1 + (p − 1)b/s and (A.4) follows since 1/s 6 1. (ii): Let (u, v) ∈ B(ρ)2 . It follows from (i) that ∂F (u) 6= ∅ and ∂F (v) 6= ∅. Let u∗ ∈ ∂F (u) and ∈ ∂F (v). Then F (v) − F (u) > hv − u, u∗ i and F (u) − F (v) > hu − v, v ∗ i. Hence |F (u) − F (v)| 6 max{ku∗ k, kv ∗ k}ku − vk and the statement follows by (i).
v∗
Proposition A.3 Let B be a real Banach space, let ρ ∈ R++ , let p ∈ ]1, +∞[, let b ∈ R+ , let c ∈ R++ , and set F = ck·kp + b. Then F is Lipschitz continuous relative to B(ρ) with constant cpρ p−1 . Proof. Let (u, v) ∈ B 2 and let u∗ ∈ JB,p (u). Then (2.7) yields kukp − kvkp 6 phu − v, u∗ i 6 pku∗ kkv − uk = pkukp−1 ku − vk. Swapping u and v yields kukp − kvkp 6 p max kukp−1 , kvkp−1 ku − vk, (A.8)
and the claim follows.
A.2 Concentration inequalities in Banach spaces This section provides fundamental concentration inequalities in probability theory in Banach spaces. In the Hilbert space setting, these results are well known [55]. The following result collects [55, Theorem 3.3.1], [46, Theorem 6.13], and a fundamental inequality [34, Proposition 9.11] which is valid in Rademacher-type Banach spaces (see Section 2). Lemma A.4 Let (Ω, A, P) be a probability space, let B be a separable real Banach space, let n ∈ Nr{0}, and let (Ui )16i6n be a family of independent integrable random variables from Ω to B. Then, for every ε ∈ R++ and every t ∈ R++ ,
n X n
n
X X tkUi k
Ui > nε 6 exp − tεn + tEP EP e − 1 − tkUi k . Ui + P (A.9) i=1
i=1
i=1
Moreover if B is of Rademacher type q ∈ ]1, 2] with Rademacher constant Tq and if, for every i ∈ {1, . . . , n}, EP Ui = 0, then
n
n X
X q q
EP Ui EP kUi kq . (A.10) 6 (2T ) q
i=1
i=1
We now provide the Banach space valued versions of the classical Hoeffding and Bernstein inequalities. The proof is similar to those of [46, Theorem 6.14 and Corollary 6.15], which deal with the Hilbert space case. A closely related result is [10, Corollary 2.2].
Theorem A.5 Let (Ω, A, P) be a probability space and let B be a separable real Banach space of Rademacher type q ∈ ]1, 2] with Rademacher constant Tq . Let (β, σ) ∈ R2++ , let n ∈ N r {0}, let (Ui )16i6n be a family of independent random variables from Ω to B satisfying max16i6n kUi k 6 β P-a.s., and let τ ∈ R++ . Then the following hold: (i) (Hoeffding’s inequality) r
X
1 n 4τ β 4βTq 2τ
(Ui − EP Ui ) > 1−1/q + 2β + 6 e−τ . P n n 3n n i=1 40
(A.11)
(ii) (Bernstein’s inequality) Suppose that, for every i ∈ {1, . . . , n}, EP Ui = 0 and EP kUi kq 6 σ q . Then r
X
1 n
2σT 2β 2−q σ q τ 2τ β q
P + U + > 6 e−τ . (A.12) i
n 1−1/q n 3n n i=1 Proof. (ii): It follows from Jensen’s inequality and Lemma A.4 that
q
n
X n X
n
X q q
EP Ui 6 EP Ui EP kUi kq 6 (2Tq )q nσ q . (A.13) 6 (2T ) q
i=1
i=1
i=1
P
Hence EP ni=1 Ui 6 2Tq σn1/q . Now let t ∈ R+ . Then n X i=1
n X +∞ m X t EP kUi km−q kUi kq EP etkUi k − 1 − tkUi k = m! m=2
6 6
i=1 +∞ n X X
tm m−q β EP kUi kq m! m=2
i=1 nσ q
βq
etβ − 1 − tβ
and, using Lemma A.4, we obtain that, for every ε ∈ R++ ,
X q
n
nσ 1/q tβ Ui P
> nε 6 exp − tεn + t2σTq n + β q e − 1 − tβ .
(A.14)
(A.15)
i=1
For every ε ∈ R++ such that εn − 2Tq σn1/q > 0, the right-hand side of (A.15) reaches its minimum at 1 (A.16) t¯ = log(1 + α), where α = εn − 2Tq σn1/q β q−1 /(nσ q ) . β
Moreover, as in [46, Theorem 6.14], one gets 3nσ q α2 nσ q ¯ . −t¯εn + t¯(bq n)1/q σ + q etβ − 1 − t¯β 6 − β 2β q α + 3 Now set 2Tq σ βq τ τβ p 6/γ + 1 + 1 + 1−1/q . γ= and ε = q 3nσ 3n n 1/q Then εn − 2Tq σn > 0 and (A.16) yield p 3γn 2Tq σ α= ε − 1−1/q = γ + γ 2 + 6γ, τβ n
so that α2 = 2γ(α + 3) = 2β q τ (α + 3)/(3nσ q ). Thus, (A.15) and (A.17) yield
X n
1
Ui > ε 6 e−τ . P n
(A.17)
(A.18)
(A.19)
(A.20)
i=1
From (A.18), substituting the expression of γ into that of ε, we obtain r r 2Tq σ 2Tq σ 2τ β β2τ 2 τ β 2σ q τ β 2−q 2β 2−q σ q τ 6 ε= + + + + 1−1/q , + 2 1−1/q n 9n 3n n 3n n n and (A.12) follows.
(A.21)
(i): For every i ∈ {1, . . . , n}, set Vi = Ui − EP Ui , so that EP Vi = 0, kVi k 6 2β P -a.s., and EP kVi kq 6 (2β)q . Therefore the statement follows from (ii) with σ = 2β. 41
References [1] R. A. Adams and J. J. F. Fournier, Sobolev Spaces, 2nd ed. Elsevier, Amsterdam 2003. [2] A. Antoniadis, D. Leporini, and J.-C. Pesquet, Wavelet thresholding for some classes of non-Gaussian noise, Statist. Neerlandica, vol. 56, pp. 434–453, 2002. [3] H. Attouch, Viscosity solutions of minimization problems, SIAM J. Optim., vol. 6, pp. 769–805, 1996. [4] H. Attouch, G. Buttazzo, and G. Michaille, Variational Analysis in Sobolev and BV Spaces. SIAM, Philadelphia, PA 2006. [5] H. Attouch and R. J.-B. Wets, Quantitative stability of variational systems: I. The epigraphical distance, Trans. Amer. Math. Soc., vol. 328, pp. 695–729, 1991. [6] H. H. Bauschke and P. L. Combettes, Convex Analysis and Monotone Operator Theory in Hilbert Spaces. Springer, New York 2011. [7] O. Blasco and J. van Neerven, Spaces of operator-valued functions measurable with respect to the strong operator topology, in: Vector Measures, Integration and Related Topics, pp. 65–78. Birkh¨auser, Basel 2010. [8] B. Beauzamy, Introduction to Banach Spaces and Their Geometry, 2nd ed. North-Holland, Amsterdam 1985. [9] G. Beer, Topologies on Closed and Closed Convex Sets. Kluwer, Dordrecht 1993. [10] D. Bosq, Linear Processes in Function Spaces. Springer, New York 2000. [11] N. Bourbaki, Int´egration, Chapitres 1 a ` 4, 2nd ed, Hermann, Paris, 1965. English translation: Integration I, Springer, New York 2004. [12] P. B´ uhlmann and S. van de Geer, Statistics for High-Dimensional Data. Springer, Berlin 2011. [13] D. Butnariu, Y. Censor, and S. Reich, Iterative averaging of entropic projections for solving stochastic convex feasibility problems, Comput. Optim. Appl., vol. 8, pp. 21–39, 1997. [14] D. Butnariu and A. N. Iusem, Totally Convex Functions for Fixed Points Computation and Infinite Dimensional Optimization. Kluwer, Dordrecht 2000. [15] D. Butnariu, A. N. Iusem, and C. Z˘alinescu, On uniform convexity, total convexity and convergence of the proximal point outer Bregman projection algorithm in Banach spaces, J. Convex Anal., vol. 10, pp. 35–61, 2003. [16] D. Butnariu and E. Resmerita, Bregman distances, totally convex functions, and a method for solving operator equations in Banach spaces, Abstr. Appl. Anal., art. 84919, 39 pp., 2006. [17] C. Carmeli, E. De Vito, and A. Toigo, Vector valued reproducing kernel Hilbert spaces of integrable functions and Mercer theorem, Anal. Appl. (Singap.), vol. 4, pp. 377–408, 2006 [18] C. Carmeli, E. De Vito, A. Toigo, and V. Umanit`a, Vector valued reproducing kernel Hilbert spaces and universality, Anal. Appl. (Singap.), vol. 8, pp. 19–61, 2010 [19] C. Castaing and M. Valadier, Convex Analysis and Measurable Multifunctions. Lecture Notes in Math. 580. Springer, New York 1977. [20] I. Cioranescu, Geometry of Banach Spaces, Duality Mappings and Nonlinear Problems. Kluwer, Dordrecht 1990. [21] P. L. Combettes, Strong convergence of block-iterative outer approximation methods for convex optimization, SIAM J. Control Optim., vol. 38, pp. 538–565, 2000. [22] P. L. Combettes and J.-C. Pesquet, Proximal thresholding algorithm for minimization over orthonormal bases, SIAM J. Optim., vol. 18, pp. 1351–1376, 2007. [23] C. De Mol, E. De Vito, and L. Rosasco, Elastic-net regularization in learning theory, J. Complexity, vol. 25, pp. 201–230, 2009. [24] E. De Vito, L. Rosasco, A. Caponnetto, M. Piana, and A. Verri, Some properties of regularized kernel methods, J. Mach. Learn. Res., vol. 5, pp. 1363–1390, 2004. [25] J. Diestel and J. J. Uhl Jr., Vector Measures. AMS, Providence, RI 1977.
42
[26] N. Dinculeanu, Vector Measures. Pergamon Press, Oxford 1967. [27] N. Dinculeanu, Vector Integration and Stochastic Integration in Banach Spaces. Wiley-Interscience, New York 2000. [28] G. E. Fasshauer, F. J. Hickernell, and Q. Ye, Solving support vector machines in reproducing kernel Banach spaces with positive definite functions, Appl. Comput. Harmon. Anal., vol. 38, pp. 115–139, 2015. [29] I. Fonseca and G. Leoni. Modern Methods in the Calculus of Variations: Lp Spaces. Springer, New York 2007. [30] W. Fu, Penalized regressions: the bridge versus the lasso, J. Comput. Graph. Stat., vol. 7, pp. 397–416, 1998. [31] K. Goebel and S. Reich, Uniform Convexity, Hyperbolic Geometry, and Nonexpansive Mappings. Marcel Dekker, New York 1984. [32] L. Gy¨ orfi, M. Kohler, A. Krzy˙zak, and H. Walk. A Distribution-Free Theory of Nonparametric Regression. Springer, New York 2002. [33] V. Koltchinskii, Sparsity in penalized empirical risk minimization, Ann. Inst. Henri Poincar´e Probab. Stat., vol. 45, pp. 7–57, 2009. [34] M. Ledoux and M. Talagrand, Probability in Banach Spaces: Isoperimetry and Processes. Springer, New York 1991. [35] L. Lindenstrauss and L. Tzafriri, Classical Banach Spaces II. Springer, Berlin 1979. [36] R. Lucchetti, Convexity and Well-Posed Problems. Springer, New York 2006. [37] C. A. Micchelli and M. Pontil, A function representation for learning in Banach spaces, in: Lecture Notes in Comput. Sci. 3120, pp. 255–269. Springer, New York 1994. [38] J.-P. Penot, Continuity properties of projection operators, J. Inequal. Appl., vol. 5, pp. 509–521, 2005. [39] R. R. Phelps, Convex Functions, Monotone Operators and Differentiability, 2nd ed. Lecture Notes in Math. 1364. Springer, New York 1993. [40] M. M. Rao and Z. D. Ren, Theory of Orlicz Spaces. Marcel Decker, Inc., 1991. [41] R. T. Rockafellar, Conjugate Duality and Optimization. SIAM, Philadelphia, PA, 1974. [42] B. Sch¨ olkopf, R. Herbrich, and A. Smola, A generalized representer theorem, in: Computational Learning Theory, Lecture Notes in Comput. Sci. 2111, pp. 416–426, 2001. [43] G. Song and H. Zhang, Reproducing kernel Banach spaces with the ℓ1 norm II: Error analysis for regularized least square regression, Neural Comput., vol. 23, pp. 2713–2729, 2011. [44] I. Steinwart, Consistency of support vector machines and other regularized kernel classifiers, IEEE Trans. Inform. Theory, vol. 51, pp. 128–142, 2005. [45] I. Steinwart, Two oracle inequalities for regularized boosting classifiers, Stat. Interface, vol. 2, pp. 271– 284, 2009. [46] I. Steinwart and A. Christmann, Support Vector Machines. Springer, New York 2008. [47] R. Tibshirani, Regression shrinkage and selection via the lasso, J. R. Stat. Soc. Ser. B Stat. Methodol., vol. 58, pp. 267–288, 1996. [48] A. B. Tsybakov, Introduction to Nonparametric Estimation. Springer, New York 2009. [49] A. W. Van der Vaart and J. A. Wellner, Weak Convergence and Empirical Processes. Springer, New York 1996. [50] V. N. Vapnik, Statistical Learning Theory. Wiley, New York 1998. [51] S. Villa, S. Salzo, L. Baldassarre, and A. Verri, Accelerated and inexact forward-backward algorithms, SIAM J. Optim., vol. 23, pp. 1607–1633, 2013. ˇ [52] A. A. Vladimirov, Ju. E. Nesterov, and Ju. N. Cekanov, Uniformly convex functionals, Vestnik Moskov. Univ. Ser. XV Vychisl. Mat. Kibernet., vol. 3, pp. 12–23, 1978.
43
[53] H. K. Xu, Inequalities in Banach spaces with applications, Nonlinear Anal., vol. 16, pp. 1127–1138, 1991. [54] Z. B. Xu and G. F. Roach, Characteristic inequalities of uniformly convex and uniformly smooth Banach spaces, J. Math. Anal. Appl., vol. 157, pp. 189–210, 1991. [55] V. Yurinsky, Sums and Gaussian Vectors. Lecture Notes in Math. 1617, Springer, New York 1995. [56] C. Z˘alinescu, On uniformly convex functions, J. Math. Anal. Appl., vol. 95, pp. 344–374, 1983. [57] C. Z˘alinescu, Convex Analysis in General Vector Spaces. World Scientific, River Edge, NJ 2002. [58] H. Zhang, Y. Xu, and J. Zhang, Reproducing kernel Banach spaces for machine learning, J. Mach. Learn. Res., vol. 10, pp. 2741–2775, 2009. [59] H. Zhang and J. Zhang, Regularized learning in Banach spaces as an optimization problem: representer theorems, J. Global Optim., vol. 54, pp. 235–250, 2012. [60] H. Zhang and J. Zhang, Vector-valued reproducing kernel Banach spaces with applications to multi-task learning, J. Complexity, vol. 20, pp. 195–215, 2013. [61] H. Zou and T. Hastie, Regularization and variable selection via the elastic net, J. R. Stat. Soc. Ser. B Stat. Methodol., vol. 67, pp. 301–320, 2005.
44