FAST INERTIAL DYNAMICS AND FISTA ALGORITHMS IN CONVEX OPTIMIZATION. PERTURBATION ASPECTS.
arXiv:1507.01367v1 [math.OC] 6 Jul 2015
HEDY ATTOUCH AND ZAKI CHBANI Abstract. In a Hilbert space setting H, we study the fast convergence properties as t → +∞ of the trajectories of the second-order differential equation α ˙ + ∇Φ(x(t)) = g(t), x ¨(t) + x(t) t where ∇Φ is the gradient of a convex continuously differentiable function Φ : H → R, α is a positive parameter, and g : [t0 , +∞[→ H is a ”small” perturbation term. In this damped inertial system, the viscous damping coefficient αt R vanishes asymptotically, but not too rapidly. For α ≥ 3, and t+∞ tkg(t)kdt < +∞, just assuming that argmin Φ 6= ∅, 0 we show that any trajectory of the above system satisfies the fast convergence property C Φ(x(t)) − min Φ ≤ 2 . H t For α > 3, we show that any trajectory converges weakly to a minimizer of Φ, and we show the strong convergence property in various practical situations. This complements the results obtained by Su-Boyd- Cand` es, and AttouchPeypouquet-Redont, in the unperturbed case g = 0. The parallel study of the time discretized version of this system provides new insight on the effect of errors, or perturbations on Nesterov’s type algorithms. We obtain fast convergence of the values, and convergence of the trajectories for a perturbed version of the variant of FISTA recently considered by Chambolle-Dossal, and Su-Boyd-Cand` es.
1. Introduction Throughout the paper, H is a real Hilbert space which is endowed with the scalar product h·, ·i, with kxk2 = hx, xi for any x ∈ H. Let Φ : H → R be a convex differentiable function, whose gradient ∇Φ is Lipschitz continuous on bounded sets. We suppose that S = argmin Φ is nonempty. Let us give α a positive parameter. We are going to study the asymptotic behaviour (as t → +∞) of the trajectories of the second-order differential equation α ˙ + ∇Φ(x(t)) = g(t) (1) (AVD)α,g x¨(t) + x(t) t and consider similar questions for the corresponding algorithms. Let us give some t0 > 0. The second-member g : [t0 , +∞[→ H is a perturbation term (integrable source term), such that g(t) is small for large t. Precisely, in our R +∞ main result, Theorem 2.1, assuming that α ≥ 3, and t0 tkg(t)kdt < +∞, we show that any trajectory of (1) satisfies the fast convergence property C Φ(x(t)) − min Φ ≤ 2 . (2) H t This extends the fast convergence of the values obtained by Su, Boyd and Cand`es in [41] in the unperturbed case g = 0. In Theorem 3.1, when α > 3, we show that any trajectory of (1) converges weakly to a minimizer of Φ, which extends the convergence result obtained by Attouch, Peypouquet, and Redont in [15] in the case g = 0. ˙ It is an isotropic linear This inertial system involves a viscous damping which is attached to the term αt x(t). damping with a viscous parameter αt which vanishes asymptotically, but not too rapidly. The asymptotic behaviour of the inertial gradient-like system (3)
(AVD)
x¨(t) + a(t)x(t) ˙ + ∇Φ(x(t)) = 0,
with Asymptotic Vanishing Damping ((AVD) for short), has been studied by Cabot, Engler and Gaddat in [24][25]. R ∞ As a main result, they proved that, under moderate decrease of a(·) to zero, i.e., a(t) → 0 as t → +∞ with a(t)dt = +∞, then for any trajectory x(·) of (3) 0 (4)
Φ(x(t)) → min Φ. H
As a striking property, for the specific choice a(t) = (5)
α t,
with α ≥ 3 , for example when considering
3 x ¨(t) + x(t) ˙ + ∇Φ(x(t)) = 0, t
Date: July 6, 2015. Key words and phrases. Convex optimization, fast gradient methods, inertial dynamics, vanishing viscosity, Nesterov method, FISTA algorithm, perturbations, errors. With the support of ECOS grant C13E03, Effort sponsored by the Air Force Office of Scientific Research, Air Force Material Command, USAF, under grant number FA9550-14-1-0056. 1
2
HEDY ATTOUCH AND ZAKI CHBANI
it has been proved by Su, Boyd, and Cand`es in [41] that the fast convergence property of the values (2) is satisfied by the trajectories of (5). In the same article [41], the authors show that (5) can be seen as a continuous version of the fast convergent method of Nesterov, see [31]-[32]-[33]-[34]. For the continuous dynamic, a related study concerning the case 1 ) convergence. a(t) = t1θ , 0 < θ < 1 has been developed by Jendoubi and May in [29], with roughly speaking O( t1+θ α The analysis developped in [29] does not contain the case a(t) = t , where the introduction of an additional scaling, due to the coefficient α, requires a specific analysis. That’s our main concern in this paper. Our results provide new insight on the effect of perturbations or errors in the associated algorithms. They provide a guideline for the study of the preservation, under small perturbations, of the fast convergence property of the corresponding Nesterov type algorithms. Specifically we consider a perturbed version of the variant of FISTA recentely considered by Chambolle and Dossal [26], and Su, Boyd and Cand`es [41]. We obtain fast convergence of the values in the case α ≥ 3, and convergence of the trajectories in the case α > 3. Convergence of the trajectories in the case α = 3, which corresponds to Nesterov algorithm, is still an open question. 2. Fast Convergence of the values Let Φ : H → R be a convex function, whose gradient ∇Φ is Lipschitz continuous on bounded sets. Let t0 > 0, Z +∞ kg(t)kdt < +∞. We consider the second-order differential equation α > 0, and g : [t0 , +∞[→ H such that t0
α x(t) ˙ + ∇Φ(x(t)) = g(t). t From Cauchy-Lipschitz theorem, for any Cauchy data x(t0 ) = x0 ∈ H, x(t ˙ 0 ) = x1 ∈ H we immediately infer the existence and uniqueness of a local solution to (6). The global existence follows from the energy estimate proved in Proposition 2.1, in the next paragraph. Throughout this paper we will use the following Gronwall-Bellman lemma, see [20, Lemme A.5] for a proof.
(6)
(AVD)α,g
x ¨(t) +
Lemma 2.1. Let m ∈ L1 (t0 , T ; R) such that m ≥ 0 a.e. on ]t0 , T [ and let c be a nonnegative constant. Suppose that w is a continuous function from [t0 , T ] into R that satisfies, for all t ∈ [t0 , T ] Z t 1 2 1 2 m(τ )w(τ )dτ. w (t) ≤ c + 2 2 t0 Then, for all t ∈ [t0 , T ]
|w(t| ≤ c +
Z
t
m(τ )dτ.
t0
2.1. Energy estimates. The following estimates are obtained by considering the global energy of the system, and showing that it is a strict Lyapunov function. Z +∞ kg(t)kdt < +∞ . Then, for any orbit x : [t0 , +∞[→ H of (AVD)α,g Proposition 2.1. Suppose α > 0, and t0
(7)
(8) Precisely, for any t ≥ t0 (9) (10)
sup kx(t)k ˙ < +∞, t Z +∞ 1 2 kx(t)k ˙ dt < +∞. t t0
Z √ kx(t)k ˙ ≤ kx(t ˙ 0 )k + 2 Φ(x0 ) − min Φ + H
Z
∞
t0
∞
t0
kg(τ )kdτ,
√ α 1 kx(τ ˙ )k2 dτ ≤ kx(t ˙ 0 )k2 + Φ(x0 ) − min Φ + kx(t ˙ 0 )k + 2 Φ(x0 ) − min Φ + kgkL1(t0 ,∞) kgkL1 (t0 ,∞) . H H τ 2
Proof. Let us give some T > t0 . For t0 ≤ t ≤ T , let us define the energy function Z T 1 2 (11) WT (t) := kx(t)k ˙ + Φ(x(t)) − min Φ + hx(τ ˙ ), g(τ )idτ. H 2 t Because of x˙ continuous, and g integrable, the energy function WT is well defined. After time derivation of WT , and by using (AVD)α,g , we obtain W˙T (t) := hx(t), ˙ x¨(t) + ∇Φ(x(t)) − g(t)i α ˙ = hx(t), ˙ − x(t)i, t that is (12)
α 2 ˙ ≤ 0. W˙T (t) + kx(t)k t
FAST INERTIAL DYNAMICS AND FISTA ALGORITHMS IN CONVEX OPTIMIZATION. PERTURBATION ASPECTS.
3
Hence WT (·) is a decreasing function. In particular, WT (t) ≤ WT (t0 ), i.e., Z T Z T 1 1 2 2 hx(τ ˙ ), g(τ )idτ. hx(τ ˙ ), g(τ )idτ ≤ kx(t kx(t)k ˙ + Φ(x(t)) − min Φ + ˙ 0 )k + Φ(x0 ) − min Φ + H H 2 2 t t0
As a consequence,
Z t 1 1 2 2 kx(τ ˙ )kkg(τ )kdτ. kx(t)k ˙ ≤ kx(t ˙ 0 )k + Φ(x0 ) − min Φ + H 2 2 t0 Applying Gronwall-Bellman lemma 2.1, we obtain 12 Z t 2 kg(τ )kdτ kx(t)k ˙ ≤ kx(t ˙ 0 )k + 2 Φ(x0 ) − min Φ + H
t0
Z t √ kg(τ )kdτ. ≤ kx(t ˙ 0 )k + 2 Φ(x0 ) − min Φ + H
t0
This being true for arbitrary T > t0 , and t0 ≤ t ≤ T , we deduce that Z √ (13) kx(t)k ˙ ≤ kx(t ˙ 0 )k + 2 Φ(x0 ) − min Φ + H
∞
t0
kg(τ )kdτ,
which gives (7) and (9). As a consequence, the function W (corresponding to T = +∞) Z ∞ 1 2 hx(τ ˙ ), g(τ )idτ, ˙ + Φ(x(t)) − min Φ + (14) W (t) := kx(t)k H 2 t is well defined, and is minorized by Z ∞ kg(τ )kdτ. (15) − kxk ˙ L∞ (t0 ,+∞) t0
By (12) we have
2 ˙ (t) + α kx(t)k ˙ ≤ 0. W t Integrating (16) from t0 to t, and using (13), (15), we obtain Z ∞ Z ∞ 1 α kg(τ )kdτ < +∞ kx(τ ˙ )k2 dτ ≤ kx(t ˙ 0 )k2 + Φ(x(t0 )) − min Φ + kxk ˙ L∞ (t0 ,+∞) H 2 t0 t@ τ √ 1 ≤ kx(t ˙ 0 )k2 + Φ(x(t0 )) − min Φ + kx(t ˙ 0 )k + 2 Φ(x0 ) − min Φ + kgkL1(t0 ,+∞) kgkL1 (t0 ,+∞) , H H 2 which gives (8) and (10).
(16)
2.2. The main result. Let us state our main result. Z +∞ τ kg(τ )kdτ < +∞. Then, for any orbit x : [t0 , +∞[→ H of (AVD)α,g , Theorem 2.1. Suppose that α ≥ 3, and t0
we have the following fast convergence of the values:
1 + α−1
Z
H
Precisely (17)
2 2 t (Φ(x(t)) − inf Φ) ≤ C + 2 H α−1
with C= Moreover (18)
C α−1
12
1 t2
Φ(x(t)) − min Φ = O
.
∞
τ kg(τ )kdτ
t0
!Z
∞
t0
τ kg(τ )kdτ,
t0 2 2 t (Φ(x0 ) − inf Φ) + (α − 1)kx0 − x∗ + x(t ˙ 0 )k2 . H α−1 α−1
t sup kx(t) − x + x(t)k ˙ ≤ α − 1 t≥t0 ∗
C α−1
21
1 + α−1
Z
∞
t0
τ kg(τ )kdτ < +∞.
Proof. The proof is an adaptation to our setting (with an integrable source term g) of the argument developed by Su-Boyd-Cand`es in [41]. Let us give some T > t0 , and x∗ ∈ S = argminΦ. For t0 ≤ t ≤ T , let us define the energy function Z T t τ 2 2 2 t (Φ(x(t)) − inf Φ) + (α − 1)kx(t) − x∗ + x(t)k ˙ +2 τ hx(τ ) − x∗ + x(τ ˙ ), g(τ )idτ. (19) Eα,g,T (t) := H α−1 α−1 α − 1 t Let us show that α−3 E˙α,g,T (t) + 2 t(Φ(x(t)) − min Φ) ≤ 0. H α−1
4
HEDY ATTOUCH AND ZAKI CHBANI
Derivation of Eα,g,T (·) gives
2 2 4 t(Φ(x(t)) − inf Φ) + t h∇Φ(x(t)), x(t)i ˙ H α−1 α−1 t 1 t t + 2(α − 1)hx(t) − x∗ + x(t), ˙ x(t) ˙ + x(t) ˙ + x ¨(t)i − 2thx(t) − x∗ + x(t), ˙ g(t)i α−1 α−1 α−1 α−1 2 2 4 t(Φ(x(t)) − inf Φ) + t h∇Φ(x(t)), x(t)i ˙ = H α−1 α−1 α t t + 2(α − 1)hx(t) − x∗ + x(t), ˙ x(t) ˙ +x ¨(t) − g(t) i. α−1 α−1 t
E˙α,g,T (t) :=
Then use (AVD)α,g in this last expression to obtain
4 2 2 t(Φ(x(t)) − inf Φ) + t h∇Φ(x(t)), x(t)i ˙ H α−1 α−1 t x(t), ˙ ∇Φ(x(t))i − 2thx(t) − x∗ + α−1 4 = t(Φ(x(t)) − inf Φ) − 2thx(t) − x∗ , ∇Φ(x(t))i. H α−1
E˙α,g,T (t) =
(20) (21) (22) By convexity of Φ
Φ(x∗ ) ≥ Φ(x(t)) + hx∗ − x(t), ∇Φ(x(t))i.
Replacing in (22) we obtain
˙ Eα,g,T (t) + 2 −
Equivalently
4 α−1
t(Φ(x(t)) − inf Φ) ≤ 0. H
α−3 E˙α,g,T (t) + 2 t(Φ(x(t)) − inf Φ) ≤ 0. H α−1
(23)
As a consequence, for α ≥ 3, the function Eα,g is nonincreasing. In particular, Eα,g (t) ≤ Eα,g (t0 ), which gives Z T τ t 2 2 ∗ 2 τ hx(τ ) − x∗ + t (Φ(x(t)) − inf Φ) + (α − 1)kx(t) − x + x(t)k ˙ +2 x(τ ˙ ), g(τ )idτ H α−1 α−1 α − 1 t Z T 2 τ t0 ≤ τ hx(τ ) − x∗ + t0 2 (Φ(x0 ) − inf Φ) + (α − 1)kx0 − x∗ + x(t ˙ 0 )k2 + 2 x(τ ˙ ), g(τ )idτ. H α−1 α−1 α−1 t0 Equivalently (24)
t 2 2 2 t (Φ(x(t)) − inf Φ) + (α − 1)kx(t) − x∗ + x(t)k ˙ ≤C +2 H α−1 α−1
with C= From (24) we infer
t
t0
τ hx(τ ) − x∗ +
τ x(τ ˙ ), g(τ )idτ, α−1
2 t0 t0 2 (Φ(x0 ) − inf Φ) + (α − 1)kx0 − x∗ + x(t ˙ 0 )k2 . H α−1 α−1
t C 1 1 2 kx(t) − x∗ + x(t)k ˙ ≤ + 2 α−1 2(α − 1) α − 1
(25)
Z
Z
t
t0
kx(τ ) − x∗ +
τ x(τ ˙ )kkτ g(τ )kdτ. α−1
Applying once more Gronwall-Bellman lemma 2.1, we obtain 12 Z t t 1 C ∗ kx(t) − x + + (26) τ kg(τ )kdτ. x(t)k ˙ ≤ α−1 α−1 α − 1 t0 Z +∞ tkg(t)kdt < +∞, it follows that Since t0
(27)
t x(t)k ˙ ≤ sup kx(t) − x + α − 1 t ∗
C α−1
21
1 + α−1
Z
∞
t0
τ kg(τ )kdτ < +∞.
Returning to (24), we conclude that (28)
2 2 t (Φ(x(t)) − inf Φ) ≤ C + 2 H α−1
C α−1
12
1 + α−1
Z
∞
t0
τ kg(τ )kdτ
!Z
∞
t0
τ kg(τ )kdτ.
FAST INERTIAL DYNAMICS AND FISTA ALGORITHMS IN CONVEX OPTIMIZATION. PERTURBATION ASPECTS.
5
Remark 2.1. As a consequence the energy function (29) Eα,g (t) :=
t 2 2 2 t (Φ(x(t)) − inf Φ) + (α − 1)kx(t) − x∗ + x(t)k ˙ +2 H α−1 α−1
Z
+∞
t
τ hx(τ ) − x∗ +
is well defined, and is a Lyapunov function for the dynamical system (AVD)α,g .
τ x(τ ˙ ), g(τ )idτ. α−1
3. Convergence of trajectories In the case α > 3, provided that the second member g(t) is sufficiently small for large t, we are going to show the convergence of the trajectories of the system (AVD)α,g
x ¨(t) +
α x(t) ˙ + ∇Φ(x(t)) = g(t). t
3.1. Main statement, and preliminary results. The following convergence result is an extension to the perturbed case (with a source term g) of the convergence result obtained by Attouch-Peypouquet-Redont in [15]. Theorem 3.1. Let Φ : ZH → R a convex continuously differentiable function such that S = argmin Φ is nonempty. +∞ tkg(t)kdt < +∞. Let t0 > 0, and x : [t0 , +∞[→ H be a classical solution of (AVD)α,g . Suppose that α > 3 and t0
Then, the following convergence properties hold:
a) (weak convergence) There exists some x∗ ∈ argmin Φ such that
x(t) ⇀ x∗ weakly as t → +∞.
(30)
b) (fast convergence) There exists a positive constant C such that Φ(x(t)) − min Φ ≤
(31)
H
Z
(32)
∞
t0
c) (energy estimate)
t Φ(x(t)) − inf Φ dt < +∞. H
Z
(33)
C t2
∞ 2 tkx(t)k ˙ dt < +∞
t0
kx(t)k ˙ ≤
(34)
C t
and hence lim kx(t)k ˙ = 0.
(35)
t→∞
In order to analyze the convergence properties of the trajectories of system (1), we will use the Opial’s lemma [35] that we recall in its continuous form; see also [22], who initiated the use of this argument to analyze the asymptotic convergence of nonlinear contraction semigroups in Hilbert spaces. Lemma 3.1. Let S be a non empty subset of H and x : [0, +∞[→ H a map. Assume that (i) (ii)
for every z ∈ S, lim kx(t) − zk exists; t→+∞
every weak sequential cluster point of the map x belongs to S.
Then w − lim x(t) = x∞ t→+∞
exists, for some element x∞ ∈ S.
We also need the following result concerning the integration of a first-order nonautonomous differential inequation, see [15]. Lemma 3.2. Suppose that δ > 0, and let w : [δ, +∞[→ R be a continuously differentiable function that satisfies the following differential inequality α (36) w(t) ˙ + w(t) ≤ k(t), t for some α > 1, and some nonnegative function k : [δ, +∞[→ R such that t 7→ tk(t) ∈ L1 (δ, +∞). Then (37)
w+ ∈ L1 (δ, +∞).
6
HEDY ATTOUCH AND ZAKI CHBANI
3.2. Proof of the convergence results. Proof. Step 1. Let us return to the decrease property (23) which is satisfied by the Lyapunov function Eα,g : α−3 t(Φ(x(t)) − inf Φ) ≤ 0. E˙α,g (t) + 2 H α−1 By integration of this inequality, we obtain Z α−3 t Eα,g (t) + 2 τ (Φ(x(τ )) − inf Φ)dτ ≤ Eα,g (t0 ). H α − 1 t0
By definition of Eα,g , and neglecting its nonnegative terms, we infer Z Z +∞ α−3 t τ ∗ x(τ ˙ ), g(τ )idτ + 2 τ (Φ(x(τ )) − inf Φ)dτ ≤ Eα,g (t0 ). 2 τ hx(τ ) − x + H α−1 α − 1 t0 t Hence
α−3 2 α−1 By (27), we have
Z
t
t0
τ (Φ(x(τ )) − inf Φ)dτ ≤ Eα,g (t0 ) + 2 H
sup kx(t) − x∗ + t
As a consequence α−3 2 α−1
Z
t
Z
+∞
t0
kx(τ ) − x∗ +
t x(t)k ˙ < +∞. α−1
t x(t)k ˙ τ (Φ(x(τ )) − inf Φ)dτ ≤ Eα,g (t0 ) + 2 sup kx(t) − x + H α−1 t ∗
t0
τ x(τ ˙ )kkτ g(τ )kidτ. α−1
Since α > 3, we deduce that
Z
(38)
Z
+∞
t0
kτ g(τ )kidτ.
+∞
t0
τ (Φ(x(τ )) − inf Φ)dτ < +∞. H
Step 2. Let us show that Z
∞ 2 tkx(t)k ˙ dt < +∞.
t0
To that end, we use the energy estimate which is obtained by taking the scalar product of (1) by t2 x(t): ˙ 2 t2 h¨ x(t), x(t)i ˙ + αtkx(t)k ˙ + t2 h∇Φ(x(t)), x(t)i ˙ = t2 hg(t), x(t)i. ˙
(39)
By the classical derivation chain rule, and Cauchy-Schwarz inequality, we obtain 1 2d d 2 2 (40) t kx(t)k ˙ + αtkx(t)k ˙ + t2 Φ(x(t) ≤ ktg(t)kktx(t)k. ˙ 2 dt dt After integration by parts Z t Z t t0 2 t2 2 2 2 skx(s)k ˙ ds skx(s)k ˙ ds + α kx(t)k ˙ − kx(t ˙ 0 )k2 − 2 2 t0 t0 Z t Z t ksg(s)kksx(s)kds. ˙ s(Φ(x(s)) − inf Φ)ds ≤ + t2 (Φ(x(t)) − inf Φ) − t0 2 (Φ(x(t0 )) − inf Φ) − 2 H
H
H
t0
t0
As a consequence, for some constant C ≥ 0, depending only on the Cauchy data, Z t Z t Z t t2 2 2 ksg(s)kksx(s)kds. ˙ (41) s(Φ(x(s)) − inf Φ)ds + skx(s)k ˙ ds ≤ C + 2 kx(t)k ˙ + (α − 1) H 2 t0 t0 t0 R∞ By (38) we have t0 s(Φ(x(s)) − inf H Φ)ds < +∞. Moreover α > 1. As a consequence, from (41) we deduce that, for some other constant C Z t 1 2 ksg(s)kksx(s)kds. ˙ ktx(t)k ˙ ≤C+ (42) 2 t0
Applying Gronwall-Bellman lemma 2.1, we obtain
ktx(t)k ˙ ≤ Since
Z
+∞
Z t √ ksg(s)kds. 2C + t0
tkg(t)kdt < +∞, we infer
t0
(43)
sup ktx(t)k ˙ < +∞. t
FAST INERTIAL DYNAMICS AND FISTA ALGORITHMS IN CONVEX OPTIMIZATION. PERTURBATION ASPECTS.
Returning to (41), we deduce that Z Z t 2 skx(s)k ˙ ds ≤ C + 2 (44) (α − 1)
∞
s(Φ(x(s)) − inf Φ)ds + sup ktx(t)k ˙ H
t0
t0
which gives
Z
t
Z
7
∞ t0
ksg(s)kds,
∞ 2 tkx(t)k ˙ dt < +∞.
t0
Moreover, combining (27),
sup kx(t) − x∗ + t
with (43), we deduce that
t x(t)k ˙ < +∞, α−1
sup kx(t)k < +∞,
(45)
t
i.e., all the orbits are bounded. Step 3. Our proof of the weak convergence property of the orbits of (AVD)α,g relies on Opial’s lemma. Given x ∈ argmin Φ, let us define h : [0, +∞[→ R+ by ∗
(46)
h(t) =
1 kx(t) − x∗ k2 . 2
By the classical derivation chain rule ˙ h(t) = hx(t) − x∗ , x(t)i, ˙ 2 ¨ h(t) = hx(t) − x∗ , x ¨(t)i + kx(t)k ˙ .
(47) (48)
Combining these two equations, and using (1) we obtain α 2 ¨ + α h(t) ˙ h(t) = kx(t)k ˙ + hx(t) − x∗ , x ¨(t) + x(t)i, (49) ˙ t t 2 (50) = kx(t)k ˙ + hx(t) − x∗ , −∇Φ(x(t)) + g(t)i. By monotonicity of ∇Φ and ∇Φ(x∗ ) = 0
(51)
hx(t) − x∗ , −∇Φ(x(t))i ≤ 0.
By (49) and (51) we infer (52)
2 ˙ ¨ + α h(t) ≤ kx(t)k ˙ + kx(t) − x∗ kkg(t)k. h(t) t
Equivalently (53)
¨ + α h(t) ˙ h(t) ≤ k(t), t
with 2 k(t) := kx(t)k ˙ + kx(t) − x∗ kkg(t)k. By (45) the orbit is bounded. Hence, for some constant C ≥ 0
2 k(t) ≤ kx(t)k ˙ + Ckg(t)k. R∞ R +∞ 2 ˙ dt < +∞. Hence t 7→ tk(t) ∈ L1 (t0 , +∞). Applying By assumption t0 tkg(t)kdt < +∞, and by (33) t0 tkx(t)k ˙ Lemma 3.2, with w(t) = h(t), we deduce that w+ ∈ L1 (t0 , +∞). Equivalently h˙ + (t) ∈ L1 (t0 , +∞), which implies that the limit of h(t) exists, as t → +∞. This proves item i) of the Opial’s lemma. We complete the proof by observing that item ii) is satisfied too. Indeed, since Φ(x(t)) converges to inf Φ, we have that every weak sequential cluster point of x(·) is a minimizer of Φ.
3.3. Strong convergence results. Since the work of J.B. Baillon, we know that without additional assumptions, the trajectories of the gradient systems may not converge strongly. Let’s examine some practical interest situations where strong convergence of the trajectories of (AVD)α,g is satisfied. Strong convergence under int(argmin Φ) 6= ∅. We will need the following result, see ([15], Lemma 5.4). Lemma 3.3. Suppose that δ > 0, and let f : [δ; +∞[→ H be a continuous function that satisfies f ∈ L1 (δ; +∞; H). Suppose that α > 1 and x : [δ; +∞[→ H is a classical solution of t¨ x(t) + αx(t) ˙ = f (t). Then, x(t) converges strongly in H as t → ∞.
8
HEDY ATTOUCH AND ZAKI CHBANI
Theorem 3.2. Suppose that α > 3,
Z
+∞
t0
tg(t)dt < +∞, and Φ satisfises int(argmin Φ) 6= ∅. Let x(·) be a classical
global solution of equation (1). Then, there exists some x∗ ∈ argmin Φ such that x(t) → x∗ strongly as t → +∞. Proof. We follow the same approach as that proposed in [15, Theorem 3.1]. We first observe that the assumption int(argminΦ) 6= ∅ implies the existence of some z¯ ∈ H and ρ > 0 such that, for all x ∈ H, h∇Φ(x), x − z¯i ≥ ρk∇Φ(x)k. In particular, for all t ≥ t0 h∇Φ(x(t)), x(t) − z¯i ≥ ρk∇Φ(x(t))k. Combining this inequality with (22) (that we recall below) E˙α,g (t) = we obtain
4 t(Φ(x(t)) − inf Φ) − 2thx(t) − z¯, ∇Φ(x(t))i H α−1
4 t(Φ(x(t)) − inf Φ). H α−1 Let us return to (23), which after integration, and using α > 3, gives Z ∞ t(Φ(x(t)) − inf Φ)dt < +∞. E˙α,g (t) + 2ρtk∇Φ(x(t))k ≤
(54)
H
t0
As a consequence, by integrating (54), we deduce that Z ∞ tk∇Φ(x(t))kdt < +∞. t0
By setting f (t) = tg(t) − t∇Φ(x(t)), we can rewrite equation (1) as t¨ x(t) + αx(t) ˙ = f (t). Since all assumptions of Lemma 3.3 are satisfied, we can affirm that x(t) converges strongly to some x∗ ∈ H. Recalling that Φ(x(t)) → inf H Φ and that Φ is continuous, we obtain x∗ ∈ argmin Φ. Strong convergence in the case of an even function. Recall that Φ : H → R is an even function if Φ(−x) = Φ(x) for all x ∈ H. In this case, 0 ∈ argmin H Φ. Z +∞ tg(t)dt < +∞, and Φ is an even function. Let x(·) be a classical global Theorem 3.3. Suppose that α > 3, t0
solution of equation (1). Then, there exists some x ¯ ∈ argmin H Φ such that x(t) converges strongly to x¯ as t → +∞. Proof. Set, for t0 ≤ τ ≤ r,
1 y(τ ) = kx(τ )k2 − kx(r)k2 − kx(τ ) − x(r)k2 . 2
By derivating twice, we obtain y(τ ˙ ) = hx(τ ˙ ), x(τ ) + x(r)i
and
y¨(τ ) = kx(τ ˙ )k2 + h¨ x(τ ), x(τ ) + x(r)i.
From these two equations and (1), we deduce that (55)
˙ ) y¨(τ ) + ατ y(τ
=
kx(τ ˙ )k2 + h¨ x(τ ) + ατ x(τ ˙ ), x(τ ) + x(r)i
kx(τ ˙ )k2 + hg(τ ) − ∇Φ(x(τ )), x(τ ) + x(r)i. R∞ d Let us now consider the energy function, W (τ ) = 12 kx(τ ˙ )k2 + Φ(x(τ )) + τ hx(t), ˙ g(t)dti. We have dτ W (τ ) = α 2 − τ kx(τ ˙ )k , and therefore W is a nonincreasing function. As a consequence, W (τ ) ≥ W (r), which equivalently gives Z r 1 1 2 2 kx(τ ˙ )k + Φ(x(τ )) ≥ kx(r)k ˙ + Φ(x(r)) − hx(t), ˙ g(t)idt. 2 2 τ =
Using the convex differential inequality Φ(−x(r)) ≥ Φ(x(τ )) − h∇Φ(x(τ )), x(τ ) + x(r)i, and the even property of Φ, Φ(x(r)) = Φ(−x(r)), we deduce that Z r 1 hx(t), ˙ g(t)idt. kx(τ ˙ )k2 ≥ −h∇Φ(x(τ )), x(τ ) + x(r)i − 2 τ Returning to (55), we finally obtain
3 α ˙ ) ≤ kx(τ ˙ )k2 + hg(τ ), x(τ ) + x(r)i + y¨(τ ) + y(τ τ 2
Z
τ
r
hx(t), ˙ g(t)idt.
FAST INERTIAL DYNAMICS AND FISTA ALGORITHMS IN CONVEX OPTIMIZATION. PERTURBATION ASPECTS.
9
Let us recall that, by Theorem 3.1, the trajectory x(·) is converging weakly, and hence bounded. Moreover, by (34), we have kx(t)k ˙ ≤ Ct . Hence, for some constant C Z +∞ α 3 1 (56) y¨(τ ) + y(τ ˙ ) ≤ k(τ ) := kx(τ ˙ )k2 + Ckg(τ )k + C kg(t)kdt. τ 2 t τ
Let us observe that the function k does not depend on r. Let us verify that τ 7→ τ k(τ ) ∈ L1 (t0 , +∞). By Theorem R +∞ R∞ 2 ˙ dt < +∞. By assumption, t0 tg(t)dt < +∞. Moreover, by Fubini theorem 3.1, we have t0 tkx(t)k Z Z ∞ Z ∞ 1 ∞ 1 kg(t)kdtdτ ≤ tkg(t)kdt < +∞. τ t 2 t0 τ t0 By integration of (56), by a similar argument as in Lemma 3.2, we obtain Z τ 1 C (57) y(τ ˙ )≤ α + α uα k(u)du, τ τ t0
˙ 0 )k kxk∞ . Set where C = t0 α kx(t
Z τ C 1 + uα k(u)du. τα τ α t0 By using Fubini theorem once more, and the fact that τ 7→ τ k(τ ) ∈ L1 (t0 , +∞), we deduce that K ∈ L1 (t0 , +∞). Integrating y(τ ˙ ) ≤ K(τ ) from t to r, we obtain Z r 1 kx(t) − x(r)k2 ≤ kx(t)k2 − kx(r)k2 + K(τ )dτ. 2 t K(t) :=
Since Φ is even, we have 0 ∈ argmin Φ. Hence limt→+∞ kx(t)k2 exists (see the proof of Theorem 3.1). As a consequence, x(t) has the Cauchy property as t → +∞, and hence converges.
Theorem 4.1. Suppose α > 0,
Z
4. The case argmin Φ = ∅.
+∞
kg(t)kdt < +∞, and inf Φ > −∞. Then, for any orbit x : [t0 , +∞[→ H of
t0
(AVD)α,g , the following minimizing property holds lim Φ(x(t)) = inf Φ.
t→+∞
H
We will use the following lemma, see [15]. Lemma 4.1. Take δ > 0, and let f ∈ L1 (δ, +∞) be nonnegative. Consider a nondecreasing continuous function ψ : (δ, +∞) → (0, +∞) such that lim ψ(t) = +∞. Then, t→+∞
1 t→+∞ ψ(t) lim
Z
t
ψ(s)f (s)ds = 0.
δ
Proof of Theorem 4.1. Let us first return to the proof of the energy estimates in Proposition 2.1. Replacing inf Φ by min Φ in the expression of the energy function, we obtain by the same argument sup kx(t)k ˙ < +∞, t Z +∞ 1 2 kx(t)k ˙ dt < +∞. t t0
(58) (59)
Consider the function h(t) = 12 kx(t) − zk2 , where this time, z is an arbitrary element of H. We can easily verify that 2 ˙ ¨ + α h(t) = kx(t)k ˙ − h∇Φ(x(t)), x(t) − zi + hg(t), x(t) − zi. h(t) t By convexity of Φ, we obtain 2 ˙ ¨ + α h(t) + Φ(x(t)) − Φ(z) ≤ kx(t)k ˙ + hg(t), x(t) − zi. (60) h(t) t Consider the energy function Z ∞ 1 2 hx(s), ˙ g(s)ids. W (t) = kx(t)k ˙ + Φ(x(t)) − inf Φ + 2 t By classical derivation rules, and (1)
d W (t) dt
= hx(t), ˙ x¨(t) + ∇Φ(x(t)) − g(t)i α 2 = − kx(t)k ˙ ≤ 0. t
10
HEDY ATTOUCH AND ZAKI CHBANI
R +∞ As a consequence, W is a nonincreasing function. Moreover, W is minorized by −kxk ˙ L∞ t0 kg(s)kds. Hence, there exists some W∞ ∈ R such that W (t) → W∞ as t → ∞. Let us take advantage of this property, and reformulate (60) with the help of W : Z ∞ 3 2 ˙ ¨ + α h(t) + W (t) + inf Φ − Φ(z) ≤ kx(t)k ˙ + hg(t), x(t) − zi + hx(s), ˙ g(s)ids. (61) h(t) t 2 t For every t > 0, W (t) ≥ W∞ . Setting B∞ = W∞ + inf Φ − Φ(z), we obtain Z ∞ 3 1 d 2 ˙ B∞ ≤ kx(t)k ˙ + kg(t)kkx(t) − zk + kxk ˙ L∞ kg(s)kds − α (tα h(t)). 2 t dt t
Multiplying this last equation by 1t , and integrating between two reals 0 < t0 < θ, we get (62) Z θ Z θ Z ∞ Z θ Z θ kg(t)kkx(t) − zk 3 θ1 1 d α˙ 1 2 B∞ ln( ) ≤ kx(t)k ˙ dt + dt + kxk ˙ L∞ (t h(t))dt. kg(s)kds dt − α+1 t0 2 t0 t t t t dt t0 t0 t0 t
Let us estimate the integrals in the second member of (62): R +∞ 2 ˙ dt < +∞. (1) By (58), t0 1t kx(t)k Rt (2) Exploiting the relation kx(t) − zk ≤ kx(t0 ) − zk + t0 kx(s)kds, ˙ we obtain Z +∞ Z θ kg(t)kkx(t) − zk kx0 − zk kg(t)kdt < +∞. + kxk ˙ L∞ dt ≤ t t0 t0 t0
(3) After integration by parts Z θ Z ∞ Z ∞ Z θ Z ∞ 1 kg(t)k ln t dt. kg(s)kds + kg(s)kds − ln t0 kg(s)kds dt = ln θ t t t0 θ t0 t0 Rθ 1 d α ˙ (4) Set I = t0 tα+1 dt (t h(t))dt. By integrating by parts twice iθ h Rθ 1˙ ˙ h(t) + (α + 1) t0 t12 h(t)dt I = t t0 Rθ 1 ˙ + (1+α) = C + 1θ h(θ) θ 2 h(θ) + 2(1 + α) t0 t3 h(t)dt.
˙ ˙ Since h ≥ 0, we have −I ≤ −C − 1θ h(θ). Then notice that |h(θ)| = |hx(θ), ˙ x(θ) − zi| ≤ kxk ˙ L∞ (kx(0) − zk + θkxk ˙ L∞ ). Collecting the above results, we deduce from (62) that Z ∞ Z θ θ kg(t)k ln t dt. kg(s)kds + kxk ˙ L∞ (63) B∞ ln( ) ≤ C + ln θ t0 θ t0
Dividing by ln( tθ0 ), and letting θ → +∞, thanks to Lemma 4.1 with ψ(t) = ln t, we conclude that B∞ ≤ 0. Equivalently, for every z ∈ H, W∞ ≤ Φ(z) − inf Φ, which leads to W∞ ≤ 0. R +∞ g(s)ds. Passing to the limit, as On the other hand, it is easy to see that W (t) ≥ Φ(x(t)) − inf Φ − kxk ˙ L∞ t t → +∞, we deduce that 0 ≥ W∞ ≥ lim sup Φ(x(t)) − inf Φ. Since we always have inf Φ ≤ lim inf Φ(x(t)), we conclude that limt→+∞ Φ(x(t)) = inf Φ. Remark 4.1. In [14], in the unperturbed case g = 0, it has been observed that, when argmin Φ = ∅, the fast convergence property of the values, as given in Theorem 2.1, may fail to be satisfied. A fortiori, without making additional assumption on the perturbation term, we also loose the fast convergence property in the perturbed case (take g = 0!). 5. From continuous to discrete dynamics and algorithms Time discretization of dissipative gradient-based dynamical systems leads naturally to algorithms, which, under appropriate assumptions, have similar convergence properties. This approach has been followed successfully in a variety of situations. For a general abstract discussion see [7], [8], and in the case or dynamics with inertial features see [3], [4], [6], [14], [15]. To cover practical situations involving constraints and/or nonsmooth data, we need to broaden our scope. This leads us to consider the non-smooth structured convex minimization problem (64)
min {Φ(x) + Ψ(x) : x ∈ H}
where • Φ : H → R ∪ {+∞} is a convex lower semicontinuous proper function (which possibly takes the value +∞); • Ψ : H → R is a convex continuously differentiable function, whose gradient is Lipschitz continuous.
The optimal solutions of (64) satisfy
∂Φ(x) + ∇Ψ(x) ∋ 0,
FAST INERTIAL DYNAMICS AND FISTA ALGORITHMS IN CONVEX OPTIMIZATION. PERTURBATION ASPECTS.
11
where ∂Φ is the subdifferential of Φ in the sense of convex analysis. In order to adapt our dynamic to this non-smooth situation, we will consider the corresponding differential inclusion α ˙ + ∂Φ(x(t)) + ∇Ψ(x(t)) ∋ g(t). (65) x¨(t) + x(t) t This dynamic is within the following framework (66)
x¨(t) + a(t)x(t) ˙ + ∂Θ(x(t)) ∋ g(t),
where Θ : H → R ∪ {+∞} is a convex lower semicontinuous proper function, and a(·) is a positive damping parameter. The detailed study of this differential inclusion goes far beyond the scope of the present article, see [10] for some results in the case of a fixed positive damping parameter, i.e., a(t) = γ > 0 fixed, and g = 0. A formal analysis of this sytem shows that the Lyapunov analysis, which has been developed in the previous sections, still holds, as long as one does not use the Lipschitz continuity property of the gradient (cocoercivity property). This is based on the fact that the convexity (subdifferential) inequalites are still valid, as well as the (generalized) derivation chain rule, R +∞ see [20]. Thus, setting Θ(x) = Φ(x) + Ψ(x), we can reasonably assume that, for α > 3, and t0 tkg(t)kdt < +∞, for each trajectory of (65), there is rapid convergence of the values, Θ(x(t)) − min Θ ≤
C , t2
and weak convergence of the trajectory to an optimal solution. Indeed, we are going to use these ideas as a guideline, and so introduce corresponding fast converging algorithms, making the link with Nesterov [31]-[34], Beck-Teboulle [19], and so extending the recent works of Chambolle-Dossal [26], Su-Boyd-Cand`es [41], Attouch-Peypouquet-Redont [14] to the perturbed case. As a basic ingredient of the discretization procedure, in order to preserve the fast convergence properties of the dynamical system (65), we are going to discretize it implicitely with respect to the nonsmooth function Φ, and explicitely with respect to the smooth function Ψ. Taking a fixed time step size h > 0, and setting tk = kh, xk = x(tk ) the implicit/explicit finite difference scheme for (65) gives α 1 (xk+1 − 2xk + xk−1 ) + 2 (xk − xk−1 ) + ∂Φ(xk+1 ) + ∇Ψ(yk ) ∋ gk , h2 kh where yk is a linear combination of xk and xk−1 , that will be made precise further. After developing (67), we obtain α (68) xk+1 + h2 ∂Φ(xk+1 ) ∋ xk + 1 − (xk − xk−1 ) − h2 ∇Ψ(yk ) + h2 gk . k A natural choice for yk leading to a simple formulation of the algorithm (other choices are possible, offering new directions of research for the future) is α (69) yk = xk + 1 − (xk − xk−1 ). k Using the classical proximal operator (equivalently, the resolvent operator of the maximal monotone operator ∂Φ) 1 −1 (70) proxγΦ (x) = argmin ξ∈H Φ(ξ) + kξ − xk2 = (I + γ∂Φ) (x) 2γ (67)
and setting s = h2 , the algorithm can be written as yk = xk + 1 − αk (xk − xk−1 ); (71) xk+1 = proxsΦ (yk − s(∇Ψ(yk ) − gk )) .
For practical purpose, and in order to fit with the existing litterature on the subject, it is convenient to work with the following equivalent formulation k−1 (xk − xk−1 ); yk = xk + k+α−1 (72) (AVD)α,g − algo xk+1 = proxsΦ (yk − s(∇Ψ(yk ) − gk )) .
k−1 α Indeed, we have k+α−1 = 1 − k+α−1 . When α is an integer, up to the reindexation k 7→ k + α − 1, we obtain the same sequences (xk ) and (yk ). For general α > 0, we can easily verify that the algorithm (AVD)α,g − algo is still associated with the dynamical system (65). This algorithm is within the scope of the proximal-based inertial algorithms [4], [30], [39], and forward-backward methods. In the unperturbed case, gk = 0, it has been recently considered by Chambolle-Dossal [26], Su-Boyd-Cand`es [41], and Attouch-Peypouquet-Redont [14]. It enjoys fast convergence properties which are very similar to that of the continuous dynamic.
12
HEDY ATTOUCH AND ZAKI CHBANI
For α = 3, gk = 0, we recover the classical algorithm based on Nesterov and G¨ uler ideas, and developed by Beck-Teboulle (FISTA) yk = xk + k−1 k+2 (xk − xk−1 ); (73) xk+1 = proxsΦ (yk − s∇Ψ(yk )) .
An important question regarding the (FISTA) method, as described in (73), is the convergence of sequences (xk ) and (yk ). Indeed, it is still an open question. A major interest to consider the broader context of (AVD)α,g − algo algorithms is that, for α > 3, these sequences converge, and they allow errors/perturbations, and using approximation methods. We will see that the proof of the convergence properties of (AVD)α,g − algo algorithms can be obtained in a parallel way with the convergence analysis in the continuous case in Theorem 3.1.
Theorem 5.1. Let Φ : H → R ∪ {+∞} be a convex lower semicontinuous proper function, and Ψ : H → R a convex continuously differentiable function, whose gradient isP L-Lipschitz continuous. Suppose that S = argmin (Φ + Ψ) is nonempty. Suppose that α ≥ 3, 0 < s < L1 , and k∈N kkgk k < +∞. Let (xk ) be a sequence generated by the algorithm (AVD)α,g − algo. Then, 1 (Φ + Ψ)(xk ) − min(Φ + Ψ) = O( 2 ). k Precisely, (Φ + Ψ)(xk ) − min(Φ + Ψ) ≤
(74) with C given by
C(α − 1)
2,
2s (k + α − 2)
r ∞ ∞ X X 2s E(0) (j + α − 1) kgj k , + (j + α − 1) kgj k C = G(0) + 2s α − 1 α − 1 j=0 j=0
where
G(0) =
2s 2 (α − 2) (Θ(x0 ) − Θ∗ ) + (α − 1)ky0 − x∗ k2 . α−1
Proof. To simplify notations, we set Θ = Φ + Ψ, and take x∗ ∈ argmin Θ, i.e., Θ(x∗ ) = inf Θ. In a parallel way to the continuous case, our proof is based on proving that (E(k)) is a non-increasing sequence, where E(k) is the discrete version of the Lyapunov function Eα,g (t) (we shall justify further that it is well defined), and which is given by ∞
(75)
E(k) :=
X 2s 2 (k + α − 2) (Θ(xk ) − Θ(x∗ ) + (α − 1)kzk − x∗ k2 + 2s (j + α − 1) hgj , zj+1 − x∗ i , α−1 j=k
with k+α−1 k yk − xk . α−1 α−1 In the passage from the continuous to discrete, we recall that we must use the reindexing k 7→ k + α − 1. Note that E(k) is equal to the Lyapunov function considered by Su-Boyd-Cand`es in [41, Theorem 4.3], plus a perturbation term. Let us introduce the function Ψk : H → R which is defined by
(76)
zk :=
∀y ∈ H, Ψk (y) := Ψ(y) − hgk , yi . We also set Θk = Φ + Ψk . We have ∇Ψk (y) = ∇Ψ(y) − gk , and hence ∇Ψk is still L-Lipschitz continuous. We can reformulate our algorithm with the help of Ψk as follows k−1 (xk − xk−1 ); yk = xk + k+α−1 (77) (AVD)α,g − algo xk+1 = proxsΦ (yk − s∇Ψk (yk )) .
In order to analyze the convergence properties of the above algorithm, it is convenient to introduce the operator Gs,k : H → H which is defined by, for all y ∈ H, Gs,k (y) =
1 (y − proxsΦ (y − s∇Ψk (y))) . s
Equivalently, proxsΦ (y − s∇Ψk (y)) = y − sGs,k (y),
FAST INERTIAL DYNAMICS AND FISTA ALGORITHMS IN CONVEX OPTIMIZATION. PERTURBATION ASPECTS.
13
and the algorithm (77) can be formulated as (AVD)α,g − algo
(78)
yk = xk +
k−1 k+α−1 (xk
− xk−1 );
xk+1 = yk − sGs,k (yk ).
k The variable zk , which is defined in (76) by zk = k+α−1 α−1 yk − α−1 xk , will play an important role. It comes naturally t x(t) ˙ + x(t) − x∗ which enters Eα,g (t). Indeed, into play as a discrete version of the term α−1
k+α−1 k+α−1 k (xk+1 − xk ) + xk = xk+1 − xk α−1 α−1 α−1 = zk+1
(79)
where the last equality comes from (80) below. Let us examine the recursive relation satisfied by zk . We have k+α k+1 yk+1 − xk+1 α−1 α−1 k k+1 k+α xk+1 + (xk+1 − xk ) − xk+1 = α−1 k+α α−1 k+α−1 k = xk+1 − xk α−1 α−1 k+α−1 k = (yk − sGs,k (yk )) − xk α−1 α−1 s (k + α − 1) Gs,k (yk ). = zk − α−1
zk+1 =
(80)
(81)
We now use the classical formula in the proximal gradient (also called forward-backward) analysis (see [19], [26], [37], [41]): for any x, y ∈ H s (82) Θk (y − sGs,k (y)) ≤ Θk (x) + hGs,k (y), y − xi − kGs,k (y)k2 . 2 Note that this formula is valid since s ≤ L1 , and ∇Ψk is L-lipschitz continuous. Let us write successively this formula at y = yk and x = xk , then at y = yk and x = x∗ . We obtain s Θk (yk − sGs,k (yk )) ≤ Θk (xk ) + hGs,k (yk ), yk − xk i − kGs,k (yk )k2 2 s Θk (yk − sGs,k (yk )) ≤ Θk (x∗ ) + hGs,k (yk ), yk − x∗ i − kGs,k (yk )k2 . 2 k , and the second by Multiplying the first equation by k+α−1 xk+1 = yk − sGs,k (yk ), we obtain
Θk (xk+1 ) ≤
(83) (84)
α−1 k+α−1 ,
then adding the two resulting equations, and using
k α−1 s Θk (xk ) + Θk (x∗ ) − kGs,k (yk )k2 k+α−1 k+α−1 2 α−1 k (yk − xk ) + (yk − x∗ ) . + Gs,k (yk ), k+α−1 k+α−1
Let us rewrite the scalar product in (84) as follows: k k α−1 α−1 Gs,k (yk ), Gs,k (yk ), (85) (yk − xk ) + (yk − x∗ ) = (yk − xk ) + yk − x∗ k+α−1 k+α−1 k+α−1 α−1 k+α−1 k α−1 Gs,k (yk ), yk − xk − x∗ = k+α−1 α−1 α−1 α−1 = hGs,k (yk ), zk − x∗ i . k+α−1
Combining (83)-(84) with (85), we obtain (86)
Θk (xk+1 ) ≤
α−1 α−1 s k Θk (xk ) + Θk (x∗ ) + hGs,k (yk ), zk − x∗ i − kGs,k (yk )k2 . k+α−1 k+α−1 k+α−1 2
In order to write (86) in a recursive form, we use the relation (81) satisfied by zk , which gives s (k + α − 1) Gs,k (yk ). zk+1 − x∗ = zk − x∗ − α−1 After developing kzk+1 − x∗ k2 = kzk − x∗ k2 − 2
s s2 (k + α − 1)2 kGs,k (yk )k2 , (k + α − 1) hzk − x∗ , Gs,k (yk )i + α−1 (α − 1)2
14
HEDY ATTOUCH AND ZAKI CHBANI
and multiplying the above expression by (α − 1)2
2
2s (k + α − 1)
(α−1)2 , 2s(k+α−1)2
we obtain
kzk − x∗ k2 − kzk+1 − x∗ k2 =
α−1 s hGs,k (yk ), zk − x∗ i − kGs,k (yk )k2 . k+α−1 2
Replacing this expression in (86), we obtain (87)
Θk (xk+1 ) ≤
Equivalently (88)
α−1 (α − 1)2 k kzk − x∗ k2 − kzk+1 − x∗ k2 . Θk (xk ) + Θk (x∗ ) + 2 k+α−1 k+α−1 2s (k + α − 1)
Θk (xk+1 ) − Θk (x∗ ) ≤
k (α − 1)2 ∗ 2 ∗ 2 (Θk (xk ) − Θk (x∗ )) + . 2 kzk − x k − kzk+1 − x k k+α−1 2s (k + α − 1)
Returning to Θ(y) = Θk (y) + hgk , yi, we obtain Θ(xk+1 ) − Θ(x∗ ) ≤
(89)
After reduction Θ(xk+1 ) − Θ(x∗ ) ≤
(90)
Multiplying by
2s α−1
(α − 1)2 k kzk − x∗ k2 − kzk+1 − x∗ k2 (Θ(xk ) − Θ(x∗ )) + 2 k+α−1 2s (k + α − 1) k hgk , xk − x∗ i . + hgk , xk+1 − x∗ i − k+α−1 k (α − 1)2 (Θ(xk ) − Θ(x∗ )) + kzk − x∗ k2 − kzk+1 − x∗ k2 2 k+α−1 2s (k + α − 1) α−1 ∗ (xk − x ) . + gk , xk+1 − xk + k+α−1
2
(k + α − 1) , we obtain
(91) 2s 2s 2 (k + α − 1) (Θ(xk+1 ) − Θ(x∗ )) ≤ k (k + α − 1) (Θ(xk ) − Θ(x∗ )) + (α − 1) kzk − x∗ k2 − kzk+1 − x∗ k2 α−1 α−1 2s α−1 2 + (k + α − 1) gk , xk+1 − xk + (xk − x∗ ) . α−1 k+α−1 For α ≥ 3 one can easily verify that
More precisely
k (k + α − 1) ≤ (k + α − 2)2 . 2
2
k (k + α − 1) = (k + α − 2) − k(α − 3) − (α − 2)2 ≤ (k + α − 2) − k(α − 3).
As a consequence, from (91) we deduce that (92)
Setting
2s α−3 2s (k + α − 1)2 (Θ(xk+1 ) − Θ(x∗ )) + 2s k (Θ(xk ) − Θ(x∗ )) ≤ (k + α − 2)2 (Θ(xk ) − Θ(x∗ )) α−1 α−1 α−1 α−1 2s 2 (k + α − 1) gk , xk+1 − xk + (xk − x∗ ) . + (α − 1) kzk − x∗ k2 − kzk+1 − x∗ k2 + α−1 k+α−1 G(k) =
(93) we can reformulate (92) as (94)
G(k + 1) + 2s
Equivalently (95)
2s (k + α − 2)2 (Θ(xk ) − Θ∗ ) + (α − 1)kzk − x∗ k2 , α−1
α−3 α−1 2s 2 k (Θ(xk ) − Θ(x∗ )) ≤ G(k) + (k + α − 1) gk , xk+1 − xk + (xk − x∗ ) . α−1 α−1 k+α−1
k+α−1 α−3 ∗ ∗ k (Θ(xk ) − Θ(x )) ≤ G(k) + 2s (k + α − 1) gk , (xk+1 − xk ) + xk − x . G(k + 1) + 2s α−1 α−1
Using (79)
zk+1 = we deduce that (96)
G(k + 1) + 2s
k+α−1 (xk+1 − xk ) + xk , α−1
α−3 k (Θ(xk ) − Θ(x∗ )) ≤ G(k) + 2s (k + α − 1) hgk , zk+1 − x∗ i . α−1
FAST INERTIAL DYNAMICS AND FISTA ALGORITHMS IN CONVEX OPTIMIZATION. PERTURBATION ASPECTS.
15
We now develop a similar analysis as in the continuous case. Given some integer K, set EK (k) = G(k) +
K X j=k
2s (j + α − 1) hgj , zj+1 − x∗ i .
Then (96) is equivalent to α−3 k (Θ(xk ) − Θ(x∗ )) ≤ EK (k). α−1 Hence, the sequence (EK (k)) is nonincreasing. In particular EK (k) ≤ EK (0), which gives EK (k + 1) + 2s
(97)
G(k) + As a consequence (98)
K X j=k
2s (j + α − 1) hgj , zj+1 − x∗ i ≤ G(0) +
G(k) ≤ G(0) +
k−1 X j=0
K X j=0
2s (j + α − 1) hgj , zj+1 − x∗ i .
2s (j + α − 1) hgj , zj+1 − x∗ i .
By definition of G(k), neglecting some positive terms, and by Cauchy-Schwarz inequality, we infer (α − 1)kzk − x∗ k2 ≤ G(0) + 2s Equivalently
k−1 X j=0
(j + α − 1) kgj kkzj+1 − x∗ k.
k
(99)
kzk − x∗ k2 ≤
1 2s X (j + α − 2) kgj−1 kkzj − x∗ k. G(0) + α−1 α − 1 j=1
We then use the following result, a discrete version of Gronwall’s lemma. Lemma 5.1. Let (ak ) be a sequence of positive real numbers such that a2k ≤ c + where (βj ) is a sequence of positive real numbers such that
k X
βj a j
j=1
P
j
βj < +∞, and c is a positive real number. Then
∞ X √ βj . c+
ak ≤
j=1
Proof. Set Ak := sup1≤j≤k aj . Then, for 1 ≤ l ≤ k a2l ≤ c +
l X j=1
βj aj ≤ c + Ak
∞ X
βj
j=1
Passing to the supremum with respect to l, with 1 ≤ l ≤ k, we obtain A2k ≤ c + Ak By elementary algebraic computation, it follows that Ak ≤
∞ X
βj .
j=1
∞ X √ βj . c+ j=1
Following the proof of Theorem 5.1. From (99), applying Lemma 5.1 with ak = kzk − x∗ k, we deduce that r ∞ G(0) 2s X ∗ (100) kzk − x k ≤ M := (j + α − 1) kgj k. + α − 1 α − 1 j=0 P Note that M is finite, because of the assumption k∈N kkgk k < +∞. Returning to (98) we obtain r ∞ ∞ X X G(0) 2s (101) G(k) ≤ C := G(0) + 2s (j + α − 1) kgj k (j + α − 1) kgj k . + α − 1 α − 1 j=0 j=0
16
HEDY ATTOUCH AND ZAKI CHBANI
By definition of G(k), and the positivity of its constitutive elements we finally obtain 2s (k + α − 2)2 (Θ(xk ) − Θ∗ ) ≤ C. α−1
which gives (74).
Remark 5.1. In the particular case α = 3, for a perturbed version of the classical FISTA algorithm, Schmidt, Le Roux, and Bach proved in [40] a result similar to Theorem 5.1 concerning the fast convergence of the values. Let us now study the convergence of the sequence (xk ). Theorem 5.2. Let Φ : H → R ∪ {+∞} be a convex lower semicontinuous proper function, and Ψ : H → R a convex continuously differentiable function, whose gradient isP L-Lipschitz continuous. Suppose that S = argmin (Φ + Ψ) is nonempty. Suppose that α > 3, 0 < s < L1 , and k∈N kkgk k < +∞. Let (xk ) be a sequence generated by the algorithm (AVD)α,g − algo. Then, P k (Φ + Ψ)(x ) − inf(Φ + Ψ) < +∞; i) k k P ii) kkxk+1 − xk k2 < +∞ ; iii) (xk ) converges weakly, as k → +∞, to some x∗ ∈ argmin Φ.
Proof. The demonstration is parallel to that of Theorem 3.1. Step 1. Let us return to (96), α−3 k (Θ(xk ) − Θ(x∗ )) ≤ G(k) + 2s (k + α − 1) hgk , zk+1 − x∗ i . α−1 By (100), we know that the sequence (zk ) is bounded. Summing the above inequalities, and using α > 3, we obtain X (102) k ((Φ + Ψ)(xk ) − inf(Φ + Ψ)) < +∞, G(k + 1) + 2s
k
thats’ item i).
Step 2. Now apply the fundamental inequality (82), which can be equivalently written as follows
Take y = yk , and x = xk . Since xk+1 (104)
Θk (xk+1 ) +
Equivalently, by definition of Θk , (105)
1 1 ky − sGs,k (y) − xk2 ≤ Θk (x) + kx − yk2 . 2s 2s k−1 = yk − sGs,k (yk ), and yk − xk = k+α−1 (xk − xk−1 ), we obtain
Θk (y − sGs,k (y)) +
(103)
Θ(xk+1 ) +
1 1 (k − 1)2 kxk − xk−1 k2 . kxk+1 − xk k2 ≤ Θk (xk ) + 2s 2s (k + α − 1)2
1 (k − 1)2 1 kxk − xk−1 k2 + hgk , xk+1 − xk i . kxk+1 − xk k2 ≤ Θ(xk ) + 2s 2s (k + α − 1)2
To shorten notations, set θk = Θ(xk ) − Θ(x∗ ), dk = 21 kxk − xk−1 k2 , a = α − 1. By Cauchy-Schwarz inequality, and with these notations, (105) gives 1 (k − 1)2 (106) dk+1 − dk ≤ (θk − θk+1 ) + kgk kkxk+1 − xk k. s (k + a)2 After multiplication by (k + a)2 , we obtain
1 (k + a)2 dk+1 − (k − 1)2 dk ≤ (k + a)2 (θk − θk+1 ) + (k + a)2 kgk kkxk+1 − xk k. s Summing from k = 1 to k = K gives
(107)
(108)
K X
k=1
K K X X (k + a)2 dk+1 − (k − 1)2 dk ≤ s (k + a)2 (θk − θk+1 ) + s (k + a)2 kgk kkxk+1 − xk k. k=1
k=1
By a similar computation as in Chambolle-Dossal [26, Corollary 2], we equivalently obtain (109)
(K + a)2 dK+1 +
K X
k=2
a (2k + a − 2) dk 2
2
≤ s (a + 1) θ1 − (K + a) θK+1 +
K X
k=2
(2k + 2a − 1) θk +
K X
k=1
2
!
(k + a) kgk kkxk+1 − xk k .
FAST INERTIAL DYNAMICS AND FISTA ALGORITHMS IN CONVEX OPTIMIZATION. PERTURBATION ASPECTS.
By (102) we have
P
k
17
(2k + 2a − 1) θk < +∞. Hence there exists some constant C such that, for all K ∈ N (K + a)2 kxK+1 − xK k2 ≤ C + 2s
(110)
K X
(k + a)2 kgk kkxk+1 − xk k.
k=1
We now proceed to a parallel argument to that used in the proof of Theorem 3.1. Let us write (110) as follows, with rk := (k + a)kxk+1 − xk k rk2
(111)
≤ C + 2s
k X j=1
(j + a)kgj krj .
We make appeal to the following discrete version of the Gronwall-Bellman lemma. Lemma 5.2. Let (rk ) be sequence of positive real numbers such that, for all k ≥ 1 rk2 ≤ C + where C is a positive constant, and
P
k
k X
ωj rj
j=1
ωj < +∞, with ωj ≥ 0. Then the sequence (rk ) is bounded with X √ rk ≤ C + ωj . j∈N
Proof. For simplicity, let us assume ωj > 0 (one can always reduce to this situation by adding some positive constant, P arbitrarily small, see Brezis [20] for the proof of this lemma in the continuous case). Set Ak := C + kj=1 ωj rj , A0 = C.
We have rk2 ≤ Ak , and Ak+1 − Ak = ωk+1 rk+1 . Equivalently rk+1 =
Ak+1 −Ak ωk+1 ,
which gives
p Ak+1 − Ak ≤ Ak+1 , ωk+1
and hence
Ak+1 Ak p −p ≤ ωk+1 . Ak+1 Ak+1
From this, and using that the sequence (Ak ) is increasing, we deduce that p p Ak+1 − Ak ≤ ωk+1 . √ Summing this inequality, and using rk ≤ Ak gives the claim.
Following the proof of Theorem 5.2. Let us apply lemma 5.2 to inequality (111) with rj = (j + a)kxj+1 − xj k, and P ωj = (j + a)kgj k. By using the assumption on the perturbation term k kkgk k < +∞, we deduce that sup kkxk+1 − xk k < +∞.
(112)
k
Injecting this information in (109), we obtain X X X (113) a (2k + a − 2) dk ≤ C + (2k + 2a − 1) θk + sup((k + a)kxk+1 − xk k) (k + a)kgk k. k
k
k
k
From a = α − 1 ≥ 2, (102), and the definition of dk , we deduce that X kkxk+1 − xk k2 < +∞, which is our claim ii).
Step 3. The last step consists in applying Opial’s lemma, whose discrete version is stated below. Lemma 5.3. Let S be a non empty subset of H, and (xk ) a sequence of elements of H. Assume that (i) (ii)
for every z ∈ S,
lim kxk − zk exists;
k→+∞
every weak sequential cluster point of the sequence (xk ) belongs to S.
Then w − lim xk = x∞ k→+∞
exists, for some element x∞ ∈ S.
18
HEDY ATTOUCH AND ZAKI CHBANI
We are going to apply Opial’s lemma with S = argmin (Φ+Ψ). By Theorem 5.1, we have (Φ+Ψ)(xk ) → min(Φ+Ψ) (indeed, we have proved fast convergence). By the lower semicontinuity property of Φ + Ψ for the weak convergence of H, we immediately obtain that item (ii) of Opial’s lemma is satisfied. Thus the only point to verify is that lim kxk −x∗ k exists for any x∗ ∈ argmin (Φ + Ψ). Equivalently, we are going to show that lim hk exists, with hk := 12 kxk − x∗ k2 . The beginning of the proof is similar to [4], [26]. It consists in establishing a discrete version of the second-order differential inequality (52) 2 ¨ + α h(t) ˙ h(t) ≤ kx(t)k ˙ + kx(t) − x∗ kkg(t)k. t We use the parallelogram identity, which in an equivalent form can be written as follows: for any a, b, c ∈ H 1 1 1 ka − bk2 + ka − ck2 = kb − ck2 + ha − b, a − ci . 2 2 2 Taking b = x∗ , a = xk+1 , c = xk , we obtain
(114)
1 1 1 kxk+1 − x∗ k2 + kxk+1 − xk k2 = kxk − x∗ k2 + hxk+1 − x∗ , xk+1 − xk i . 2 2 2 Equivalently, hk − hk+1 =
(115)
1 kxk+1 − xk k2 + hxk+1 − x∗ , xk − xk+1 i . 2
By definition of yk we have xk − xk+1 = yk − xk+1 − Replacing in (115), we obtain hk − hk+1 =
(116)
k−1 (xk − xk−1 ). k+α−1
k−1 1 kxk+1 − xk k2 + hxk+1 − x∗ , yk − xk+1 i − hxk+1 − x∗ , xk − xk−1 i . 2 k+α−1
Let us now use the monotonicity property of ∂Φ. Since −s∇Ψ(x∗ ) ∈ s∂Φ(x∗ ), and yk − xk+1 − s∇Ψ(yk ) + sgk ∈ s∂Φ(xk+1 ), we have hyk − xk+1 − s∇Ψ(yk ) + sgk + s∇Ψ(x∗ ), xk+1 − x∗ i ≥ 0. Equivalently
hyk − xk+1 , xk+1 − x∗ i + s h∇Ψ(x∗ ) − ∇Ψ(yk ) + gk , xk+1 − x∗ i ≥ 0.
Replacing in (116) we obtain
k−1 1 hxk+1 − x∗ , xk − xk−1 i ≤ 0. (117) hk+1 − hk + kxk+1 − xk k2 + s h∇Ψ(yk ) − ∇Ψ(x∗ ) − gk , xk+1 − x∗ i − 2 k+α−1 We now use the co-coercivity of ∇Ψ h∇Ψ(yk ) − ∇Ψ(x∗ ), xk+1 − x∗ i = h∇Ψ(yk ) − ∇Ψ(x∗ ), xk+1 − yk i + h∇Ψ(yk ) − ∇Ψ(x∗ ), yk − x∗ i 1 ≥ kΨ(yk ) − ∇Ψ(x∗ )k2 + h∇Ψ(yk ) − ∇Ψ(x∗ ), xk+1 − yk i L 1 ≥ kΨ(yk ) − ∇Ψ(x∗ )k2 − k∇Ψ(yk ) − ∇Ψ(x∗ )kkxk+1 − yk k (118) L L ≥ − kxk+1 − yk k2 . 2 Combining (117) and (118)
sL k−1 1 kxk+1 − yk k2 − skgk kkxk+1 − x∗ k − hxk+1 − x∗ , xk − xk−1 i ≤ 0. (119) hk+1 − hk + kxk+1 − xk k2 − 2 2 k+α−1 Let us use again (114) with b = x∗ , a = xk , c = xk−1 . We obtain 1 1 1 kxk − x∗ k2 + kxk − xk−1 k2 = kxk−1 − x∗ k2 + hxk − x∗ , xk − xk−1 i . 2 2 2 Equivalently hk−1 − hk =
(120)
1 kxk − xk−1 k2 − hxk − x∗ , xk − xk−1 i . 2
Combining (119) with (120) we obtain (121)
hk+1 − hk −
k−1 1 sL (hk − hk−1 ) ≤ − kxk+1 − xk k2 + kxk+1 − yk k2 + skgk kkxk+1 − x∗ k k+α−1 2 2 1 k−1 2 + kxk − xk−1 k + hxk − xk−1 , xk+1 − xk i . k+α−1 2
FAST INERTIAL DYNAMICS AND FISTA ALGORITHMS IN CONVEX OPTIMIZATION. PERTURBATION ASPECTS.
19
k−1 − xk−1 ), we have xk+1 − yk = xk+1 − xk − k+α−1 (xk − xk−1 ). Hence 2 k−1 k−1 kxk+1 − yk k2 = kxk+1 − xk k2 + kxk − xk−1 k2 − 2 hxk+1 − xk , xk − xk−1 i k+α−1 k+α−1
By definition of yk = xk +
k−1 k+α−1 (xk
Substituting in (121), we obtain
sL )kxk+1 − yk k2 + skgk kkxk+1 − x∗ k + γk + γk 2 kxk − xk−1 k2 , 2 k−1 1 2 where γk = k+α−1 . Since 0 < s < L , we have (1 − sL 2 ) > 0. On the other hand, since γk < 1, we have γk + γk < 2γk . Hence
(122)
hk+1 − hk − γk (hk − hk−1 ) ≤ −(1 −
hk+1 − hk − γk (hk − hk−1 ) ≤ skgk kkxk+1 − x∗ k + 2γk kxk − xk−1 k2 .
(123)
By (100), we know that the sequence (zk ) is bounded. By (112), we know that supk kkxk+1 − xk k < +∞ . Since xk = zk − k+α−1 α−1 (xk+1 − xk ), we deduce that the sequence (xk ) is bounded. Returning to (123), we have, for some constant C hk+1 − hk − γk (hk − hk−1 ) ≤ Ckgk k + 2γk kxk − xk−1 k2 . P 2 We now usePthe estimation that we obtained in step 2, namely k kkxk+1 − xk k < +∞. Combined with the assumption k kkgk k < +∞, we deduce that (124)
hk+1 − hk − γk (hk − hk−1 ) ≤ ωk , P for some nonnegative sequence (ωk ) such that k∈N kωk < +∞. Taking the positive part, we obtain
(125)
+
+
(hk+1 − hk ) − γk (hk − hk−1 ) ≤ ωk .
(126)
We are now using the following lemma, which is a discrete version of lemma 3.2. Lemma 5.4. Let (ak ) be sequence of nonnegative real numbers such that, for all k ≥ 1 k−1 ak+1 ≤ ak + ω k k+α−1 P where α ≥ 3, and k kωk < +∞, with ωk ≥ 0. Then the sequence (ak ) is summable, i.e., X ak < +∞. k∈N
Proof. Since α ≥ 3 we have α − 1 ≥ 2, and hence
ak+1 ≤
k−1 ak + ω k . k+2
Multiplying this expression by (k + 1)2 , we obtain (k + 1)2 ak+1 ≤ Then note that, for all integer k
(k − 1)(k + 1)2 ak + (k + 1)2 ωk . k+2
(k − 1)(k + 1)2 ≤ k2 . k+2
Hence (k + 1)2 ak+1 ≤ k 2 ak + (k + 1)2 ωk . Summing this inequality with respect to j = 1, 2, ..., k, we obtain k 2 ak ≤ a1 + 2
k−1 X
(j + 1)2 ωj .
j=1
Dividing by k , and summing with respect to k, we obtain X k
ak ≤ a1
X 1 k−1 X X 1 (j + 1)2 ωj . + k2 k 2 j=1 k
k
Applying Fubini theorem to this last sum, we obtain X k
We have
∞ X 1 X X 1 (j + 1)2 ωj . ak ≤ a1 + 2 k2 k j k
∞ X
k=j+1
k=j+1
1 ≤ k2
Z
∞ j
1 1 dt = . t2 j
20
HEDY ATTOUCH AND ZAKI CHBANI
Hence X k
which by
(j+1)2 j
ak ≤ a1
X (j + 1)2 X 1 + ωj < +∞, k2 j j
≤ 4j for j ≥ 1 gives the claim.
End of the proof of Theorem 5.2. Let us apply lemma 5.4 with ak = (hk − hk−1 )+ . We obtain X + (hk − hk−1 ) < +∞, k
which, combined with hk nonnegative, gives the convergence of the sequence (hk ), and ends the proof.
References [1] B. Abbas, H. Attouch, B. F. Svaiter, Newton-like dynamics and forward-backward methods for structured monotone inclusions in Hilbert spaces, JOTA, DOI 10.1007/s10957-013-0414-5, (2013). [2] S. Adly, H. Attouch, A. Cabot, Finite time stabilization of nonlinear oscillators subject to dry friction, Nonsmooth Mechanics and Analysis (edited by P. Alart, O. Maisonneuve and R.T. Rockafellar), Adv. in Math. and Mech., Kluwer (2006), pp. 289–304. [3] F. Alvarez, On the minimizing property of a second-order dissipative system in Hilbert spaces , SIAM J. Control Optim., 38, No 4, (2000), pp. 1102-1119. [4] F. Alvarez, H. Attouch, An inertial proximal method for maximal monotone operators via discretization of a nonlinear oscillator with damping, Set-Valued Analysis, 9 (2001), No. 1-2, pp. 3–11. [5] F. Alvarez, H. Attouch, Convergence and asymptotic stabilization for some damped hyperbolic equations with non-isolated equilibria, ESAIM Control Optim. Calc. of Var., 6 (2001), pp. 539–552. [6] F. Alvarez, H. Attouch, J. Bolte, P. Redont, A second-order gradient-like dissipative dynamical system with Hessian-driven damping. Application to optimization and mechanics , J. Math. Pures Appl., 81 (2002), No. 8, pp. 747–779. [7] F. Alvarez, J. Peypouquet, Asymptotic almost-equivalence of Lipschitz evolution systems in Banach spaces, Nonlinear Anal., 73 (2010), No. 9, pp. 3018–3033. [8] F. Alvarez, J. Peypouquet, A unified approach to the asymptotic almost-equivalence of evolution systems without Lipschitz conditions, Nonlinear Anal. 74 (2011), No. 11, pp. 3440–3444. [9] H. Attouch, G. Buttazzo, G. Michaille, Variational analysis in Sobolev and BV spaces. Applications to PDE’s and optimization , MPS/SIAM Series on Optimization, 6, Society for Industrial and Applied Mathematics (SIAM), Philadelphia, PA, Second edition, 2014, 793 pages. [10] H. Attouch, A. Cabot, P. Redont, The dynamics of elastic shocks via epigraphical regularization of a differential inclusion, Adv. Math. Sci. Appl., 12 (2002), No.1, pp. 273–306. [11] H. Attouch, M.-O. Czarnecki, Asymptotic control and stabilization of nonlinear oscillators with non-isolated equilibria, J. Differential Equations, 179 (2002), pp. 278–310. [12] H. Attouch, X. Goudou and P. Redont, The heavy ball with friction method. The continuous dynamical system, global exploration of the local minima of a real-valued function by asymptotical analysis of a dissipative dynamical system, Commun. Contemp. Math., 2 (2000), No 1, pp. 1–34. e, P. Redont, A second-order differential system with Hessian-driven damping; Application to non-elastic [13] H. Attouch, P.E. Maing´ shock laws, Differential Equations and Applications, 4 (2012), No. 1, pp. 27–65. [14] H. Attouch, J. Peypouquet, P. Redont, A dynamical approach to an inertial forward-backward algorithm for convex minimization, SIAM J. Optim., 24 (2014), No. 1, pp. 232–256. [15] H. Attouch, J. Peypouquet, P. Redont, Fast convergence of an inertial dynamical system with asymptotic vanishing damping, 2015, to appear. [16] H. Attouch, A. Soubeyran, Inertia and reactivity in decision making as cognitive variational inequalities, Journal of Convex Analysis, 13 (2006), pp. 207-224. [17] J.-B. Baillon, Un exemple concernant le comportement asymptotique de la solution du probl` eme tional Analysis, 28 (1978), pp. 369-376.
du dt
+ ∂φ(u) ∋ 0 , Journal of Func-
[18] H. Bauschke, P. Combettes, Convex Analysis and Monotone Operator Theory in Hilbert spaces , CMS Books in Mathematics, Springer, (2011). [19] A. Beck, M. Teboulle, A fast iterative shrinkage-thresholding algorithm for linear inverse problems, SIAM J. Imaging Sci., 2(1) 2009, pp. 183–202. [20] H. Br´ ezis, Op´ erateurs maximaux monotones dans les espaces de Hilbert et ´ equations d’´ evolution , Lecture Notes 5, North Holland, (1972). ezis, Asymptotic behavior of some evolution systems: Nonlinear evolution equations , Academic Press, New York, (1978), [21] H. Br´ pp. 141–154. [22] R.E. Bruck, Asymptotic convergence of nonlinear contraction semigroups in Hilbert spaces , J. Funct. Anal., 18 (1975), pp. 15–26. [23] A. Cabot, Inertial gradient-like dynamical system controlled by a stabilizing term, J. Optim. Theory Appl., 120 (2004), pp. 275–303. [24] A. Cabot, H. Engler, S. Gadat, On the long time behavior of second order differential equations with asymptotically small dissipation Transactions of the American Mathematical Society, 361 (2009), pp. 5983–6017.
FAST INERTIAL DYNAMICS AND FISTA ALGORITHMS IN CONVEX OPTIMIZATION. PERTURBATION ASPECTS.
21
[25] A. Cabot, H. Engler, S. Gadat, Second order differential equations with asymptotically small dissipation and piecewise flat potentials, Electronic Journal of Differential Equations, 17 (2009), pp. 33–38. [26] A. Chambolle, Ch. Dossal, On the convergence of the iterates of Fista, HAL Id: hal-01060130 https://hal.inria.fr/hal-01060130v3 Submitted on 20 Oct 2014. [27] R. Cominetti; J. Peypouquet, S. Sorin, Strong asymptotic convergence of evolution equations governed by maximal monotone operators with Tikhonov regularization, Journal of Differential Equations 245 (2008), No 12, pp. 3753–3763. [28] N. Flammarion, F. Bach, From averaging to acceleration, there is only a step-size, (2015) HAL-01136945. [29] M.A. Jendoubi, R. May, Asymptotics for a second-order differential equation with nonautonomous damping and an integrable source term, Applicable Analysis, 2014. [30] A. Moudafi, M. Oliny, Convergence of a splitting inertial proximal method for monotone operators , J. Comput. Appl. Math., 155 (2), (2003), pp. 447–454. [31] Y. Nesterov, A method of solving a convex programming problem with convergence rate O(1/k2) . In Soviet Mathematics Doklady, volume 27, 1983, pp. 372–376. [32] Y. Nesterov, Introductory lectures on convex optimization: A basic course, volume 87 of Applied Optimization. Kluwer Academic Publishers, Boston, MA, 2004. [33] Y. Nesterov, Smooth minimization of non-smooth functions, Mathematical programming, 103(1) 2005, pp. 127–152. [34] Y. Nesterov, Gradient methods for minimizing composite objective function , CORE Discussion Papers, 2007. [35] Z. Opial, Weak convergence of the sequence of successive approximations for nonexpansive mappings , Bull. Amer. Math. Soc., 73 (1967), pp. 591–597. [36] B. ODonoghue, E. J. Cand‘es, Adaptive restart for accelerated gradient schemes , Found. Comput. Math., 2013. [37] N. Parikh, S. Boyd, Proximal algorithms, Foundations and trends in optimization, volume 1, (2013), pp. 123-231. preprint. emes d’´ evolution et applications en optimisation , Th` ese Universit´ e Paris 6, (2008). [38] J. Peypouquet, Analyse asymptotique de syst` [39] D. A. Lorenz, Thomas Pock, An inertial forward-backward algorithm for monotone inclusions, J. Math. Imaging Vision, pages 1-15, 2014. (online). [40] M. Schmidt, N. Le Roux, F. Bach, Convergence rates of inexact proximal-gradient methods for convex optimization , NIPS’11 - 25 th Annual Conference on Neural Information Processing Systems, Dec 2011, Grenada, Spain. (2011) HAL inria-00618152v3. [41] W. Su, S. Boyd, E. J. Cand` es, A Differential Equation for Modeling Nesterov’s Accelerated Gradient Method: Theory and Insights . preprint. ´lisation de Montpellier, UMR 5149 CNRS, Universit´ Institut de Math´ ematiques et Mode e Montpellier 2, place Eug` ene Bataillon, 34095 Montpellier cedex 5, France E-mail address:
[email protected] ´matiques et applications (LIBMA), Cadi Ayyad university, Faculty of Sciences Laboratoire IBN AL-BANNA de Mathe Semlalia, Mathematics, 40000 Marrakech, Morroco E-mail address:
[email protected]