Document not found! Please try again

Minimizing Finite Sums with the Stochastic Average Gradient Algorithm

5 downloads 111 Views 700KB Size Report
Context: Machine Learning for “Big Data”. Large-scale machine ... Efficiently solving the problem when N or P are ve
Minimizing Finite Sums with the Stochastic Average Gradient Algorithm Mark Schmidt Joint work with Nicolas Le Roux and Francis Bach University of British Columbia

Mark Schmidt

Minimzing Finite Sums with the SAG Algorithm

Context: Machine Learning for “Big Data” Large-scale machine learning: large N, large P N: number of observations (inputs) P: dimension of each observation

Regularized empirical risk minimization: find x ∗ solution of min

x∈RP

N 1 X `(x T ai ) N

+

λr (x)

i=1

data fitting term +

Mark Schmidt

regularizer

Minimzing Finite Sums with the SAG Algorithm

Context: Machine Learning for “Big Data” Large-scale machine learning: large N, large P N: number of observations (inputs) P: dimension of each observation

Regularized empirical risk minimization: find x ∗ solution of min

x∈RP

N 1 X `(x T ai ) N

+

λr (x)

i=1

data fitting term +

regularizer

Applications to any data-oriented field: Vision, bioinformatics, speech, natural language, web.

Main practical challenges: Choosing regularizer r and data-fitting term `. Designing/learning good features ai . Efficiently solving the problem when N or P are very large. Mark Schmidt

Minimzing Finite Sums with the SAG Algorithm

This talk: Big-N Problems We want to minimize the sum of a finite set of smooth functions: min g(x) :=

x∈RP

Mark Schmidt

N 1 X fi (x). N i=1

Minimzing Finite Sums with the SAG Algorithm

This talk: Big-N Problems We want to minimize the sum of a finite set of smooth functions: min g(x) :=

x∈RP

N 1 X fi (x). N i=1

We are interested in cases where N is very large.

Mark Schmidt

Minimzing Finite Sums with the SAG Algorithm

This talk: Big-N Problems We want to minimize the sum of a finite set of smooth functions: min g(x) :=

x∈RP

N 1 X fi (x). N i=1

We are interested in cases where N is very large. We will focus on strongly-convex functions g. Simplest example is `2 -regularized least-squares, fi (x) := (aiT x − bi )2 +

λ kxk2 . 2

Other examples include any `2 -regularized convex loss: logistic regression, Huber regression, smooth SVMs, CRFs, etc.

Mark Schmidt

Minimzing Finite Sums with the SAG Algorithm

Stochastic vs. Deterministic Gradient Methods We consider minimizing g(x) =

Mark Schmidt

1 N

PN

i=1 fi (x).

Minimzing Finite Sums with the SAG Algorithm

Stochastic vs. Deterministic Gradient Methods We consider minimizing g(x) =

1 N

PN

i=1 fi (x).

Deterministic gradient method [Cauchy, 1847]: xt+1 = xt − αt g 0 (xt ) = xt − t

Linear convergence rate: O(ρ ).

N αt X 0 fi (xt ). N i=1

Iteration cost is linear in N. Fancier methods exist, but still in O(N)

Mark Schmidt

Minimzing Finite Sums with the SAG Algorithm

Stochastic vs. Deterministic Gradient Methods We consider minimizing g(x) =

1 N

PN

i=1 fi (x).

Deterministic gradient method [Cauchy, 1847]: xt+1 = xt − αt g 0 (xt ) = xt − t

Linear convergence rate: O(ρ ).

N αt X 0 fi (xt ). N i=1

Iteration cost is linear in N. Fancier methods exist, but still in O(N)

Stochastic gradient method [Robbins & Monro, 1951]: Random selection of i(t) from {1, 2, . . . , N}, 0 xt+1 = xt − αt fi(t) (xt ).

Iteration cost is independent of N. Sublinear convergence rate: O(1/t). Mark Schmidt

Minimzing Finite Sums with the SAG Algorithm

n fi (θ) with fi(θ) = " yi , θ Φ(xi ) + µ inimizing g(θ) = 1 ! " # n Stochastic inimizing g(θ)vs. = Deterministic fi(θ) = Methods " yi, θ!Φ(xi) + µ i=1 fi (θ) withGradient n i=1 n γt ! # # n fi atch gradient descent: θt = θt−1 −γ PtNg (θt−1) = θt−1 − γ ! t n We consider minimizing g(x) = N1 i=1# fi (x). atch gradient descent: θt = θt−1 −γtg (θt−1) = θt−1 − i=1 fi# n i=1 Deterministic gradient method [Cauchy, 1847]:

# ochastic gradient descent: θt = θt−1 − γtfi(t) (θt−1) Stochastic gradient method [Robbins & Monro, 1951]: # ochastic gradient descent: θt = θt−1 − γtfi(t)(θt−1)

Mark Schmidt

Minimzing Finite Sums with the SAG Algorithm

Motivation for New Methods Stochastic FG method has O(N) vs. costdeterministic with O(ρt ) rate.methods

log(excess cost)

SG method hasofO(1) with O(1/t) rate.O(1) iteration cost • Goal = best both cost worlds: linear rate with

stochastic deterministic time

Mark Schmidt

Minimzing Finite Sums with the SAG Algorithm

Motivation for New Methods Stochastic FG method has O(N) vs. costdeterministic with O(ρt ) rate.methods

log(excess cost)

SG method hasofO(1) with O(1/t) rate.O(1) iteration cost • Goal = best both cost worlds: linear rate with

stochastic deterministic hybrid time

Mark Schmidt

Minimzing Finite Sums with the SAG Algorithm

Motivation for New Methods Stochastic FG method has O(N) vs. costdeterministic with O(ρt ) rate.methods

log(excess cost)

SG method hasofO(1) with O(1/t) rate.O(1) iteration cost • Goal = best both cost worlds: linear rate with

stochastic deterministic hybrid time

Goal is O(1) cost with O(ρk ) rate. Mark Schmidt

Minimzing Finite Sums with the SAG Algorithm

Prior Work on Speeding up SG Methods

A variety of methods have been proposed to speed up SG methods: Step-size strategies, momentum, gradient/iterate averaging Polyak & Juditsky (1992), Tseng (1998), Kushner & Yin (2003) Nesterov (2009), Xiao (2010), Hazan & Kale (2011), Rakhlin et al. (2012)

Stochastic version of accelerated and Newton-like methods Bordes et al. (2009), Sunehag et al. (2009), Ghadimi and Lan (2010), Martens (2010), Xiao (2010), Duchi et al. (2011)

Mark Schmidt

Minimzing Finite Sums with the SAG Algorithm

Prior Work on Speeding up SG Methods

A variety of methods have been proposed to speed up SG methods: Step-size strategies, momentum, gradient/iterate averaging Polyak & Juditsky (1992), Tseng (1998), Kushner & Yin (2003) Nesterov (2009), Xiao (2010), Hazan & Kale (2011), Rakhlin et al. (2012)

Stochastic version of accelerated and Newton-like methods Bordes et al. (2009), Sunehag et al. (2009), Ghadimi and Lan (2010), Martens (2010), Xiao (2010), Duchi et al. (2011)

None of these methods improve on the O(1/t) rate

Mark Schmidt

Minimzing Finite Sums with the SAG Algorithm

Prior Work on Speeding up SG Methods

Existing linear convergence results: Constant step-size SG, accelerated SG Kesten (1958), Delyon and Juditsky (1993), Nedic and Bertsekas (2000)

Linear convergence up to a fixed tolerance: O(ρt ) + O(α).

Hybrid methods, incremental average gradient Bertsekas (1997), Blatt et al. (2007), Friedlander and Schmidt (2012)

Linear rate but iterations make full passes through the data

Mark Schmidt

Minimzing Finite Sums with the SAG Algorithm

Prior Work on Speeding up SG Methods

Existing linear convergence results: Constant step-size SG, accelerated SG Kesten (1958), Delyon and Juditsky (1993), Nedic and Bertsekas (2000)

Linear convergence up to a fixed tolerance: O(ρt ) + O(α).

Hybrid methods, incremental average gradient Bertsekas (1997), Blatt et al. (2007), Friedlander and Schmidt (2012)

Linear rate but iterations make full passes through the data

Special Problems Classes Collins et al. (2008), Strohmer & Vershynin (2009), Schmidt and Le Roux (2012), Shalev-Shwartz and Zhang (2012)

Linear rate but limited choice for the fi ’s

Mark Schmidt

Minimzing Finite Sums with the SAG Algorithm

Stochastic Average Gradient Assume only that: fi is convex, fi0 is L−continuous, g is µ-strongly convex.

Mark Schmidt

Minimzing Finite Sums with the SAG Algorithm

Stochastic Average Gradient Assume only that: fi is convex, fi0 is L−continuous, g is µ-strongly convex.

Is it possible to have an O(ρt ) rate with an O(1) cost?

Mark Schmidt

Minimzing Finite Sums with the SAG Algorithm

Stochastic Average Gradient Assume only that: fi is convex, fi0 is L−continuous, g is µ-strongly convex.

Is it possible to have an O(ρt ) rate with an O(1) cost? YES!

Mark Schmidt

Minimzing Finite Sums with the SAG Algorithm

Stochastic Average Gradient Assume only that: fi is convex, fi0 is L−continuous, g is µ-strongly convex.

Is it possible to have an O(ρt ) rate with an O(1) cost? YES! The stochastic average gradient (SAG) algorithm: 0 (x t ). Randomly select i(t) from {1, 2, . . . , N} and compute fi(t)

x t+1 = x t −

N αt X 0 t fi (x ) N i=1

Mark Schmidt

Minimzing Finite Sums with the SAG Algorithm

Stochastic Average Gradient Assume only that: fi is convex, fi0 is L−continuous, g is µ-strongly convex.

Is it possible to have an O(ρt ) rate with an O(1) cost? YES! The stochastic average gradient (SAG) algorithm: 0 (x t ). Randomly select i(t) from {1, 2, . . . , N} and compute fi(t)

x t+1 = x t −

N αt X t yi N i=1

Mark Schmidt

Minimzing Finite Sums with the SAG Algorithm

Stochastic Average Gradient Assume only that: fi is convex, fi0 is L−continuous, g is µ-strongly convex.

Is it possible to have an O(ρt ) rate with an O(1) cost? YES! The stochastic average gradient (SAG) algorithm: 0 (x t ). Randomly select i(t) from {1, 2, . . . , N} and compute fi(t)

x t+1 = x t −

N αt X t yi N i=1

Memory:

yit

=

fi0 (x t )

from the last iteration t where i was selected.

Mark Schmidt

Minimzing Finite Sums with the SAG Algorithm

Stochastic Average Gradient Assume only that: fi is convex, fi0 is L−continuous, g is µ-strongly convex.

Is it possible to have an O(ρt ) rate with an O(1) cost? YES! The stochastic average gradient (SAG) algorithm: 0 (x t ). Randomly select i(t) from {1, 2, . . . , N} and compute fi(t)

x t+1 = x t −

N αt X t yi N i=1

Memory:

yit

=

fi0 (x t )

from the last iteration t where i was selected.

Assumes that gradients of other examples don’t change. This assumption becomes accurate as ||x t+1 − x t || → 0. Stochastic variant of increment aggregated gradient (IAG). [Blatt et al. 2007]

Mark Schmidt

Minimzing Finite Sums with the SAG Algorithm

Convergence Rate of SAG: Attempt 1

Proposition 1. With αt =

1 2NL

the SAG iterations satisfy

 µ t E[g(x t ) − g(x ∗ )] 6 1 − C. 8LN

Mark Schmidt

Minimzing Finite Sums with the SAG Algorithm

Convergence Rate of SAG: Attempt 1

Proposition 1. With αt =

1 2NL

the SAG iterations satisfy

 µ t E[g(x t ) − g(x ∗ )] 6 1 − C. 8LN Convergence rate of O(ρt ) with cost of O(1). Is this useful?!?

Mark Schmidt

Minimzing Finite Sums with the SAG Algorithm

Convergence Rate of SAG: Attempt 1

Proposition 1. With αt =

1 2NL

the SAG iterations satisfy

 µ t E[g(x t ) − g(x ∗ )] 6 1 − C. 8LN Convergence rate of O(ρt ) with cost of O(1). Is this useful?!? This rate is very slow: performance similar to cyclic method.

Mark Schmidt

Minimzing Finite Sums with the SAG Algorithm

Convergence Rate of SAG: Attempt 2 1 1 Proposition 2. With αt ∈ [ 2Nµ , 16L ] and N ≥ 8 µL , the SAG iterations

satisfy

E[g(x t ) − g(x ∗ )] 6

Mark Schmidt



1−

1 8N

t

C.

Minimzing Finite Sums with the SAG Algorithm

Convergence Rate of SAG: Attempt 2 1 1 Proposition 2. With αt ∈ [ 2Nµ , 16L ] and N ≥ 8 µL , the SAG iterations

satisfy

E[g(x t ) − g(x ∗ )] 6



1−

1 8N

t

C.

Much bigger step-sizes: µ

C

#

(θt −θ∗ ),



 I A = a1 ee> + a2 I,  .  .  e= B = be,  . , C = cI. I

Proof involves finding {α, a1 , a2 , b, c, d, h, δ, γ} such that E(L(θt )|Ft−1 ) 6 (1 − δ)L(θt−1 ),

L(θt ) > γ[g(x t ) − g(x ∗ )].

Apply recursively and initial Lyapunov function gives constant. Mark Schmidt

Minimzing Finite Sums with the SAG Algorithm

Constants in Convergence Rate

What are the constants?

Mark Schmidt

Minimzing Finite Sums with the SAG Algorithm

Constants in Convergence Rate

What are the constants? If we initialize with yi0 = 0 we have C = [g(x 0 ) − g(x ∗ )] +

Mark Schmidt

4L 0 σ2 kx − x ∗ k2 + . N 16L

Minimzing Finite Sums with the SAG Algorithm

Constants in Convergence Rate

What are the constants? If we initialize with yi0 = 0 we have C = [g(x 0 ) − g(x ∗ )] +

4L 0 σ2 kx − x ∗ k2 + . N 16L

If we initialize with yi0 = fi0 (x 0 ) − g 0 (x 0 ) we have C=

4L 0 3 [g(x 0 ) − g(x ∗ )] + kx − x ∗ k2 . 2 N

Mark Schmidt

Minimzing Finite Sums with the SAG Algorithm

Constants in Convergence Rate

What are the constants? If we initialize with yi0 = 0 we have C = [g(x 0 ) − g(x ∗ )] +

4L 0 σ2 kx − x ∗ k2 + . N 16L

If we initialize with yi0 = fi0 (x 0 ) − g 0 (x 0 ) we have C=

4L 0 3 [g(x 0 ) − g(x ∗ )] + kx − x ∗ k2 . 2 N

If we initialize with N stochastic gradient iterations, [g(x 0 ) − g(x ∗ )] = O(1/N),

Mark Schmidt

kx 0 − x ∗ k2 = O(1/N).

Minimzing Finite Sums with the SAG Algorithm

Convergence Rate in Convex Case Assume only that: fi is convex, fi0 is L−continuous, some x ∗ exists.

Mark Schmidt

Minimzing Finite Sums with the SAG Algorithm

Convergence Rate in Convex Case Assume only that: fi is convex, fi0 is L−continuous, some x ∗ exists. Theorem. With αt 6

1 16L

the SAG iterations satisfy

E[g(x t ) − g(x ∗ )] = O(1/t) √ Faster than SG lower bound of O(1/ t).

Mark Schmidt

Minimzing Finite Sums with the SAG Algorithm

Convergence Rate in Convex Case Assume only that: fi is convex, fi0 is L−continuous, some x ∗ exists. Theorem. With αt 6

1 16L

the SAG iterations satisfy

E[g(x t ) − g(x ∗ )] = O(1/t) √ Faster than SG lower bound of O(1/ t). Same algorithm and step-size as strongly-convex case: Algorithm is adaptive to strong-convexity. Faster convergence rate if µ is locally bigger around x ∗ .

Mark Schmidt

Minimzing Finite Sums with the SAG Algorithm

Convergence Rate in Convex Case Assume only that: fi is convex, fi0 is L−continuous, some x ∗ exists. Theorem. With αt 6

1 16L

the SAG iterations satisfy

E[g(x t ) − g(x ∗ )] = O(1/t) √ Faster than SG lower bound of O(1/ t). Same algorithm and step-size as strongly-convex case: Algorithm is adaptive to strong-convexity. Faster convergence rate if µ is locally bigger around x ∗ .

Same algorithm could be used in non-convex case.

Mark Schmidt

Minimzing Finite Sums with the SAG Algorithm

Convergence Rate in Convex Case Assume only that: fi is convex, fi0 is L−continuous, some x ∗ exists. Theorem. With αt 6

1 16L

the SAG iterations satisfy

E[g(x t ) − g(x ∗ )] = O(1/t) √ Faster than SG lower bound of O(1/ t). Same algorithm and step-size as strongly-convex case: Algorithm is adaptive to strong-convexity. Faster convergence rate if µ is locally bigger around x ∗ .

Same algorithm could be used in non-convex case. Contrast with stochastic dual coordinate ascent: Requires explicit strongly-convex regularizer. Not adaptive to µ, does not allow µ = 0. Mark Schmidt

Minimzing Finite Sums with the SAG Algorithm

Comparing FG and SG Methods quantum (n = 50000, p = 78) and rcv1 (n = 697641, p = 47236) 0

0

10

10

IAG

−2

Objective minus Optimum

Objective minus Optimum

AFG IAG

10

AF

G

SG

A

SG

−4

10

L−B FG

−6

−1

10

L−

BF

GS

−2

10

SG

−3

10

AS

G

S

10

−4

10 0

10

20 30 Effective Passes

40

50

0

10

20 30 Effective Passes

40

Comparison of competitive deterministic and stochastic methods. Mark Schmidt

Minimzing Finite Sums with the SAG Algorithm

50

SAG Compared to FG and SG Methods quantum (n = 50000, p = 78) and rcv1 (n = 697641, p = 47236) 0

0

10

10

AFG

IAG Objective minus Optimum

10 Objective minus Optimum

IAG AFG

L−BFGS SG

−2

SG

−4

10

ASG

−6

10

L−BFG

S

−8

10

SAG

−LS

−10

ASG −5

10

SA

−10

10

G

−L

S

−15

10

10

−12

10

0

10

20 30 Effective Passes

40

50

0

10

20 30 Effective Passes

SAG starts fast and stays fast.

Mark Schmidt

Minimzing Finite Sums with the SAG Algorithm

40

50

SAG Compared to Coordinate-Based Methods quantum (n = 50000, p = 78) and rcv1 (n = 697641, p = 47236) 0

0

10

10

L−

Objective minus Optimum

DC A

−4

10

ASG PCD−L

−6

10

L−BFGS

−8

10

SAG

−LS

Objective minus Optimum

BF

−2

10

GS ASG

−5

10

PCD−

L

DC A

−10

10

SA

G−

LS

−10

10

−15

10 −12

10

0

10

20 30 Effective Passes

40

50

0

10

20 30 Effective Passes

40

PCD/DCA are similar on some problems, much worse on others.

Mark Schmidt

Minimzing Finite Sums with the SAG Algorithm

50

SAG Implementation Issues while(1) Sample i from {1, 2, . . . , N}. Compute fi0 (x). d = d − yi + fi0 (x). yi = fi0 (x). x =x−

α d. N

Mark Schmidt

Minimzing Finite Sums with the SAG Algorithm

SAG Implementation Issues while(1) Sample i from {1, 2, . . . , N}. Compute fi0 (x). d = d − yi + fi0 (x). yi = fi0 (x). x =x−

α d. N

Issues: Should we normalize by N?

Mark Schmidt

Minimzing Finite Sums with the SAG Algorithm

SAG Implementation Issues while(1) Sample i from {1, 2, . . . , N}. Compute fi0 (x). d = d − yi + fi0 (x). yi = fi0 (x). x =x−

α d. N

Issues: Should we normalize by N? Can we reduce the memory?

Mark Schmidt

Minimzing Finite Sums with the SAG Algorithm

SAG Implementation Issues while(1) Sample i from {1, 2, . . . , N}. Compute fi0 (x). d = d − yi + fi0 (x). yi = fi0 (x). x =x−

α d. N

Issues: Should we normalize by N? Can we reduce the memory? Can we handle sparse data?

Mark Schmidt

Minimzing Finite Sums with the SAG Algorithm

SAG Implementation Issues while(1) Sample i from {1, 2, . . . , N}. Compute fi0 (x). d = d − yi + fi0 (x). yi = fi0 (x). x =x−

α d. N

Issues: Should we normalize by N? Can we reduce the memory? Can we handle sparse data? How should we set the step size?

Mark Schmidt

Minimzing Finite Sums with the SAG Algorithm

SAG Implementation Issues while(1) Sample i from {1, 2, . . . , N}. Compute fi0 (x). d = d − yi + fi0 (x). yi = fi0 (x). x =x−

α d. N

Issues: Should we normalize by N? Can we reduce the memory? Can we handle sparse data? How should we set the step size? When should we stop?

Mark Schmidt

Minimzing Finite Sums with the SAG Algorithm

SAG Implementation Issues while(1) Sample i from {1, 2, . . . , N}. Compute fi0 (x). d = d − yi + fi0 (x). yi = fi0 (x). x =x−

α d. N

Issues: Should we normalize by N? Can we reduce the memory? Can we handle sparse data? How should we set the step size? When should we stop? Can we use mini-batches?

Mark Schmidt

Minimzing Finite Sums with the SAG Algorithm

SAG Implementation Issues while(1) Sample i from {1, 2, . . . , N}. Compute fi0 (x). d = d − yi + fi0 (x). yi = fi0 (x). x =x−

α d. N

Issues: Should we normalize by N? Can we reduce the memory? Can we handle sparse data? How should we set the step size? When should we stop? Can we use mini-batches? Can we handle constraints or non-smooth problems?

Mark Schmidt

Minimzing Finite Sums with the SAG Algorithm

SAG Implementation Issues while(1) Sample i from {1, 2, . . . , N}. Compute fi0 (x). d = d − yi + fi0 (x). yi = fi0 (x). x =x−

α d. N

Issues: Should we normalize by N? Can we reduce the memory? Can we handle sparse data? How should we set the step size? When should we stop? Can we use mini-batches? Can we handle constraints or non-smooth problems? Should we shuffle the data? Mark Schmidt

Minimzing Finite Sums with the SAG Algorithm

Implementation Issues: Normalization

Should we normalize by N in the early iterations? The parameter update: x =x−

α d. N

Mark Schmidt

Minimzing Finite Sums with the SAG Algorithm

Implementation Issues: Normalization

Should we normalize by N in the early iterations? The parameter update: x =x−

α d. M

We normalize by number of examples seen (M). Better performance on early iterations. Similar to doing one pass of SG.

Mark Schmidt

Minimzing Finite Sums with the SAG Algorithm

Implementation Issues: Memory Requirements

Can we reduce the memory? The memory update for fi (aiT x): Compute fi0 (aiT x). d = d + fi0 (aiT x) − yi . yi = fi0 (aiT x).

Mark Schmidt

Minimzing Finite Sums with the SAG Algorithm

Implementation Issues: Memory Requirements

Can we reduce the memory? The memory update for fi (aiT x): Compute fi0 (δ), with δ = aiT x. d = d + ai (f 0 (δ) − yi ). yi = fi0 (δ).

Use that fi0 (aiT x) = ai fi0 (δ).

Mark Schmidt

Minimzing Finite Sums with the SAG Algorithm

Implementation Issues: Memory Requirements

Can we reduce the memory? The memory update for fi (aiT x): Compute fi0 (δ), with δ = aiT x. d = d + ai (f 0 (δ) − yi ). yi = fi0 (δ).

Use that fi0 (aiT x) = ai fi0 (δ). Only store the scalars fi0 (δ). Reduces the memory from O(NP) to O(N).

Mark Schmidt

Minimzing Finite Sums with the SAG Algorithm

Implementation Issues: Memory Requirements Can we reduce the memory in general? We can re-write the SAG iteration as:   N X αt  0 t x t+1 = x t − fi (x ) − fi0 (x i ) + fj0 (x j ) . N j=1

Mark Schmidt

Minimzing Finite Sums with the SAG Algorithm

Implementation Issues: Memory Requirements Can we reduce the memory in general? We can re-write the SAG iteration as:   N X αt  0 t x t+1 = x t − fi (x ) − fi0 (x i ) + fj0 (x j ) . N j=1

The SVRG/mixedGrad method (‘memory-free methods’):   N X 1 x t+1 = x t − α fi0 (x t ) − fi0 (x˜ ) + fj0 (x˜ ) , N j=1

where we occasionally update x˜ .

[Mahdavi et al., 2013, Johnson and Zhang, 2013, Zhang et al., 2013, Konecny and Richtarik, 2013, Xiao and Zhang, 2014]

Mark Schmidt

Minimzing Finite Sums with the SAG Algorithm

Implementation Issues: Memory Requirements Can we reduce the memory in general? We can re-write the SAG iteration as:   N X αt  0 t x t+1 = x t − fi (x ) − fi0 (x i ) + fj0 (x j ) . N j=1

The SVRG/mixedGrad method (‘memory-free methods’):   N X 1 x t+1 = x t − α fi0 (x t ) − fi0 (x˜ ) + fj0 (x˜ ) , N j=1

where we occasionally update x˜ .

[Mahdavi et al., 2013, Johnson and Zhang, 2013, Zhang et al., 2013, Konecny and Richtarik, 2013, Xiao and Zhang, 2014]

Gets rid of memory for two fi0 evaluations per iteration. Mark Schmidt

Minimzing Finite Sums with the SAG Algorithm

Implementation Issues: Sparsity

Can we handle sparse data? The parameter update for each variable j: xj = xj −

α d. M j

Mark Schmidt

Minimzing Finite Sums with the SAG Algorithm

Implementation Issues: Sparsity

Can we handle sparse data? The parameter update for each variable j: xj = xj −

kα d. M j

For sparse data, dj is typically constant. Apply previous k updates when it changes.

Mark Schmidt

Minimzing Finite Sums with the SAG Algorithm

Implementation Issues: Sparsity

Can we handle sparse data? The parameter update for each variable j: xj = xj −

kα d. M j

For sparse data, dj is typically constant. Apply previous k updates when it changes. Reduces the iteration cost from O(P) to O(kfi0 (x)k0 ). Standard tricks allow `2 -regularization and `1 -regularization.

Mark Schmidt

Minimzing Finite Sums with the SAG Algorithm

Implementation Issues: Step-Size How should we set the step size?

Mark Schmidt

Minimzing Finite Sums with the SAG Algorithm

Implementation Issues: Step-Size How should we set the step size? Theory: α = 1/16L. Practice: α = 1/L.

Mark Schmidt

Minimzing Finite Sums with the SAG Algorithm

Implementation Issues: Step-Size How should we set the step size? Theory: α = 1/16L. Practice: α = 1/L.

What if L is unknown or smaller near x ∗ ?

Mark Schmidt

Minimzing Finite Sums with the SAG Algorithm

Implementation Issues: Step-Size How should we set the step size? Theory: α = 1/16L. Practice: α = 1/L.

What if L is unknown or smaller near x ∗ ? Start with a small L. Increase L until we satisfy: fi (x −

1 0 1 0 fi (x)) ≤ fi0 (x) − kfi (x)k2 . L 2L

(assuming kfi0 (x)k2 ≥ ) Decrease L between iterations.

(Lipschitz approximation procedure from FISTA)

Mark Schmidt

Minimzing Finite Sums with the SAG Algorithm

Implementation Issues: Step-Size How should we set the step size? Theory: α = 1/16L. Practice: α = 1/L.

What if L is unknown or smaller near x ∗ ? Start with a small L. Increase L until we satisfy: fi (x −

1 0 1 0 fi (x)) ≤ fi0 (x) − kfi (x)k2 . L 2L

(assuming kfi0 (x)k2 ≥ ) Decrease L between iterations.

(Lipschitz approximation procedure from FISTA) For fi (aiT x), this costs O(1) in N and P: fi (aiT x − Mark Schmidt

fi0 (δ) kai k2 ). L Minimzing Finite Sums with the SAG Algorithm

Implementation Issues: Termination Criterion

When should we stop?

Mark Schmidt

Minimzing Finite Sums with the SAG Algorithm

Implementation Issues: Termination Criterion

When should we stop? Normally we check the size of kf 0 (x)k.

Mark Schmidt

Minimzing Finite Sums with the SAG Algorithm

Implementation Issues: Termination Criterion

When should we stop? Normally we check the size of kf 0 (x)k. And SAG has yi → fi0 (x).

Mark Schmidt

Minimzing Finite Sums with the SAG Algorithm

Implementation Issues: Termination Criterion

When should we stop? Normally we check the size of kf 0 (x)k. And SAG has yi → fi0 (x).

We can check the size of k N1 dk = k N1

Mark Schmidt

PN

i=1

yi k → kf 0 (x)k

Minimzing Finite Sums with the SAG Algorithm

SAG Implementation Issues: Mini-Batches

Can we use mini-batches?

Mark Schmidt

Minimzing Finite Sums with the SAG Algorithm

SAG Implementation Issues: Mini-Batches

Can we use mini-batches? Yes, define each fi to include more than one example. Reduces memory requirements. Allows parallelization. But must decrease L for good performance LB ≤

1 X Li ≤ max{Li }. i∈B |B| i∈B

Mark Schmidt

Minimzing Finite Sums with the SAG Algorithm

SAG Implementation Issues: Mini-Batches

Can we use mini-batches? Yes, define each fi to include more than one example. Reduces memory requirements. Allows parallelization. But must decrease L for good performance LB ≤

1 X Li ≤ max{Li }. i∈B |B| i∈B

In practice, Lipschitz approximation procedure on to determine LB .

Mark Schmidt

Minimzing Finite Sums with the SAG Algorithm

Implementation Issues: Constraints/Non-Smooth For composite problems with non-smooth regularizers r , min x

N 1 X fi (x) + r (x). N

Mark Schmidt

i=1

Minimzing Finite Sums with the SAG Algorithm

Implementation Issues: Constraints/Non-Smooth For composite problems with non-smooth regularizers r , min x

N 1 X fi (x) + r (x). N i=1

We can do a proximal-gradient variant when r is simple: x = proxαr (·) [x −

α d]. N

[Mairal, 2013, 2014, Xiao and Zhang, 2014, Defazio et al., 2014]

E.g., with r (x) = λkxk1 : stochastic iterative soft-thresholding. Same converge rate as smooth case.

Mark Schmidt

Minimzing Finite Sums with the SAG Algorithm

Implementation Issues: Constraints/Non-Smooth For composite problems with non-smooth regularizers r , min x

N 1 X fi (x) + r (x). N i=1

We can do a proximal-gradient variant when r is simple: x = proxαr (·) [x −

α d]. N

[Mairal, 2013, 2014, Xiao and Zhang, 2014, Defazio et al., 2014]

E.g., with r (x) = λkxk1 : stochastic iterative soft-thresholding. Same converge rate as smooth case. Exist ADMM variants when r is simple/linear composition, r (Ax). [Wong et al., 2013]

Mark Schmidt

Minimzing Finite Sums with the SAG Algorithm

Implementation Issues: Constraints/Non-Smooth For composite problems with non-smooth regularizers r , min x

N 1 X fi (x) + r (x). N i=1

We can do a proximal-gradient variant when r is simple: x = proxαr (·) [x −

α d]. N

[Mairal, 2013, 2014, Xiao and Zhang, 2014, Defazio et al., 2014]

E.g., with r (x) = λkxk1 : stochastic iterative soft-thresholding. Same converge rate as smooth case. Exist ADMM variants when r is simple/linear composition, r (Ax). [Wong et al., 2013]

If fi are non-smooth, could smooth them or use dual methods. [Nesterov, 2005, Lacoste-Julien et al., 2013, Shalev-Schwartz and Zhang, 2013] Mark Schmidt

Minimzing Finite Sums with the SAG Algorithm

SAG Implementation Issues: Non-Uniform Sampling

Does re-shuffling and doing full passes work better?

Mark Schmidt

Minimzing Finite Sums with the SAG Algorithm

SAG Implementation Issues: Non-Uniform Sampling

Does re-shuffling and doing full passes work better? NO!

Mark Schmidt

Minimzing Finite Sums with the SAG Algorithm

SAG Implementation Issues: Non-Uniform Sampling

Does re-shuffling and doing full passes work better? NO! Performance is intermediate between IAG and SAG.

Mark Schmidt

Minimzing Finite Sums with the SAG Algorithm

SAG Implementation Issues: Non-Uniform Sampling

Does re-shuffling and doing full passes work better? NO! Performance is intermediate between IAG and SAG.

Can non-uniform sampling help?

Mark Schmidt

Minimzing Finite Sums with the SAG Algorithm

SAG Implementation Issues: Non-Uniform Sampling

Does re-shuffling and doing full passes work better? NO! Performance is intermediate between IAG and SAG.

Can non-uniform sampling help? Bias sampling towards Lipschitz constants Li .

Mark Schmidt

Minimzing Finite Sums with the SAG Algorithm

SAG Implementation Issues: Non-Uniform Sampling

Does re-shuffling and doing full passes work better? NO! Performance is intermediate between IAG and SAG.

Can non-uniform sampling help? Bias sampling towards Lipschitz constants Li . Justification: duplicate examples proportional to Li : Li N N fi (x) 1 XX 1 X fi (x) = P Lmean , Li N Li i=1

i=1 j=1

convergence rate depends on Lmean instead of Lmax .

Mark Schmidt

Minimzing Finite Sums with the SAG Algorithm

SAG Implementation Issues: Non-Uniform Sampling

Does re-shuffling and doing full passes work better? NO! Performance is intermediate between IAG and SAG.

Can non-uniform sampling help? Bias sampling towards Lipschitz constants Li . Justification: duplicate examples proportional to Li : Li N N fi (x) 1 XX 1 X fi (x) = P Lmean , Li N Li i=1

i=1 j=1

convergence rate depends on Lmean instead of Lmax . Combine with the line-search for adaptive sampling. (see paper/code for details)

Mark Schmidt

Minimzing Finite Sums with the SAG Algorithm

SAG with Non-Uniform Sampling protein (n = 145751, p = 74) and sido (n = 12678, p = 4932) 0

0

10

10

AG

D IAG

−1

10

10

L−B

−2

PCD−L

FGS

10

Objective minus Optimum

D AG

Objective minus Optimum

−1

IAG

−3

10

SG

−C

SA

G

ASG−C

−4

10

DCA

L− B

FG S

−2

10

−3

SG−

10

C

PCD−L ASG−C

SA G

−4

10

DCA −5

10

−6

−5

10

10

0

10

20 30 Effective Passes

40

50

0

10

20 30 Effective Passes

Datasets where SAG had the worst relative performance. Mark Schmidt

Minimzing Finite Sums with the SAG Algorithm

40

50

SAG with Non-Uniform Sampling protein (n = 145751, p = 74) and sido (n = 12678, p = 4932) 0

0

10

10

PCD−L SG−C SAG

IAG AGD

−5

10

SA

−10

10

G−

LS

(L

ip

sc h

itz

)

−15

10

L−B

FGS

ASG−C Objective minus Optimum

Objective minus Optimum

G IA DCA

AG L−D BF GS

−20

PCD−L ASG −C SAG

SG−C −5

10

DCA SA G

−L

S

−10

10

(L

ip sc

hi

tz )

−15

10

10 0

10

20 30 Effective Passes

40

50

0

10

20 30 Effective Passes

Lipschitz sampling helps a lot.

Mark Schmidt

Minimzing Finite Sums with the SAG Algorithm

40

50

Newton-like Methods

Can we use a scaling matrix H? x =x−

Mark Schmidt

α −1 H d. N

Minimzing Finite Sums with the SAG Algorithm

Newton-like Methods

Can we use a scaling matrix H? x =x−

α −1 H d. N

Didn’t help in my experiments with diagonal H. Non-diagonal will lose sparsity.

Mark Schmidt

Minimzing Finite Sums with the SAG Algorithm

Newton-like Methods

Can we use a scaling matrix H? x =x−

α −1 H d. N

Didn’t help in my experiments with diagonal H. Non-diagonal will lose sparsity. Quasi-Newton method proposed that has empirically-faster convergence, but much overhead. [Sohl-Dickstein et al., 2014]

Mark Schmidt

Minimzing Finite Sums with the SAG Algorithm

Conclusion and Discussion Faster theoretical convergence using only the ‘sum’ structure.

Mark Schmidt

Minimzing Finite Sums with the SAG Algorithm

Conclusion and Discussion Faster theoretical convergence using only the ‘sum’ structure. Simple algorithm, empirically better than theory predicts.

Mark Schmidt

Minimzing Finite Sums with the SAG Algorithm

Conclusion and Discussion Faster theoretical convergence using only the ‘sum’ structure. Simple algorithm, empirically better than theory predicts. Black-box stochastic gradient algorithm: Adaptivity to problem difficulty, line-search, termination criterion.

Mark Schmidt

Minimzing Finite Sums with the SAG Algorithm

Conclusion and Discussion Faster theoretical convergence using only the ‘sum’ structure. Simple algorithm, empirically better than theory predicts. Black-box stochastic gradient algorithm: Adaptivity to problem difficulty, line-search, termination criterion.

Recent/current developments: Memory-free variants. Non-smooth variants. Improved constants. Relaxed strong-convexity. Non-uniform sampling. Quasi-Newton variants.

Mark Schmidt

Minimzing Finite Sums with the SAG Algorithm

Conclusion and Discussion Faster theoretical convergence using only the ‘sum’ structure. Simple algorithm, empirically better than theory predicts. Black-box stochastic gradient algorithm: Adaptivity to problem difficulty, line-search, termination criterion.

Recent/current developments: Memory-free variants. Non-smooth variants. Improved constants. Relaxed strong-convexity. Non-uniform sampling. Quasi-Newton variants.

Future developments: Non-convex analysis. Parallel/distributed methods.

Mark Schmidt

Minimzing Finite Sums with the SAG Algorithm

Conclusion and Discussion Faster theoretical convergence using only the ‘sum’ structure. Simple algorithm, empirically better than theory predicts. Black-box stochastic gradient algorithm: Adaptivity to problem difficulty, line-search, termination criterion.

Recent/current developments: Memory-free variants. Non-smooth variants. Improved constants. Relaxed strong-convexity. Non-uniform sampling. Quasi-Newton variants.

Future developments: Non-convex analysis. Parallel/distributed methods.

Thank you for the invitation. Mark Schmidt

Minimzing Finite Sums with the SAG Algorithm

Suggest Documents