Conditional Gradient with Enhancement and ... - cmap - polytechnique

1 downloads 200 Views 5MB Size Report
1000. 1500. 2000. −3. −2. −1. 0. 1. 2. 3. 4 true. CG recovery. The greedy update steps might choose suboptimal ato
Conditional Gradient with Enhancement and Thresholding for Atomic Norm Constrained Minimization

Nikhil Rao with Parikshit Shah Stephen Wright

University of Wisconsin - Madison

(Large) data can be modeled as made up of a few “simple” components

wavelet coefficients

unit rank matrices

... and many more

sparse overlapping sets paths/cliques

Atoms and the atomic norm We assume the variables can be represented as a combination of a small number of atoms

x2R

a2A

atoms

atomic set

p

x=

Pk

i=1 ci ai

kxkA = inf

(

ci

0 8i atomic norm

X

ca : x =

a2A

X

ca a, ca

a2A

)

0

atomic norm is the ‘L1’ analog for general structurally constrained signals

ai = ±ei ) kxkA = kxk1

ai = uv T ) kxkA = kxk⇤

ai =

uG kuG k

) kxkA =

P

G2G

kxG k2

The atoms can be edges/cliques in a graph, individual Fourier/ wavelet components ...

Chandrasekaran, V. et al. "The convex geometry of linear inverse problems." Foundations of Computational Mathematics 12.6 (2012)

Atoms and the atomic norm We assume the variables can be represented as a combination of a small number of atoms

x2R

p

a2A

atoms

atomic set

x=

Pk

i=1 ci ai

kxkA = inf

(

ci

0 8i atomic norm

X

ca : x =

a2A

X

ca a, ca

a2A

)

0

atomic norm is the ‘L1’ analog for general structurally constrained signals

ai = ±ei ) kxkA = kxk1

ai = uv T ) kxkA = kxk⇤

We would like to have a method that lets us solve

ai =

uG kuG k

) kxkA =

P

G2G

kxG k2

The atoms can be edges/cliques in a graph, individual Fourier/ wavelet components ...

min f (x) = 12 ky

xk2

subject to

kxkA  ⌧

Chandrasekaran, V. et al. "The convex geometry of linear inverse problems." Foundations of Computational Mathematics 12.6 (2012)

Motivation Frank-Wolfe / Conditional Gradient

(Frank and Wolfe. ’56, Clarkson ’08, Jaggi ’13, Jaggi and Sulovski ’10, Bach et. al., ’12, Harchaoui et al, .’12, Dudik et. al, ’12, Tewari et. al. ’11 ...)

Solve a (relatively) simple optimization at each iteration at = arg minhrft , ai a2A

line search ˆ = arg min f ([1 2[0,1]

update x = (1

ˆ )x + ˆ ⌧ at

]x + at )

Motivation Frank-Wolfe / Conditional Gradient

(Frank and Wolfe. ’56, Clarkson ’08, Jaggi ’13, Jaggi and Sulovski ’10, Bach et. al., ’12, Harchaoui et al, .’12, Dudik et. al, ’12, Tewari et. al. ’11 ...)

Solve a (relatively) simple optimization at each iteration 4

at = arg minhrft , ai a2A

3

line search ˆ = arg min f ([1 2[0,1]

update x = (1

true CG recovery

2

]x + at )

1 0 −1

ˆ )x + ˆ ⌧ at

−2 −3

500

1000

1500

2000

p = 2048, m = 512 Gaussian measurements, AWGN 0.01 The greedy update steps might choose suboptimal atoms to represent the solution, and/or lead to less parsimonious solutions and/or miss some components

Motivation Frank-Wolfe / Conditional Gradient

(Frank and Wolfe. ’56, Clarkson ’08, Jaggi ’13, Jaggi and Sulovski ’10, Bach et. al., ’12, Harchaoui et al, .’12, Dudik et. al, ’12, Tewari et. al. ’11 ...)

Solve a (relatively) simple optimization at each iteration 4

at = arg minhrft , ai a2A

3

line search ˆ = arg min f ([1 2[0,1]

update x = (1

true CG recovery

2

]x + at )

1 0 −1

ˆ )x + ˆ ⌧ at

−2 −3

500

1000

1500

2000

p = 2048, m = 512 Gaussian measurements, AWGN 0.01 The greedy update steps might choose suboptimal atoms to represent the solution, and/or lead to less parsimonious solutions and/or miss some components

Our goal is to develop a greedy scheme that retains the computational advantages of FW, and also incorporates a “self correcting” mechanism to purge suboptimal atoms

CoGEnT Solve

Conditional Gradient with Enhancement and Truncation

min f (x) = 12 ky

rt = y

xt

1

xk2 Subject to

wt = y

kxkA  ⌧

⌧ at

FORWARD STEP

at = arg mina2A hrf (xt A˜t = [At 1 at ]

1 ), ai

Conditional Gradient

very efficient in most cases t

=

hrt ,rt wt i krt wt k2

c˜t = [(1

t )ct 1

Line search parameter for the L2 loss t]

OPTIONALLY

O(k log(k)) projection onto simplex c˜t = min 12 ky A˜t ck2 c 0 kck1  ⌧ with c = [(1 t )ct 1 t ] as a warm start

Solve

x˜t = A˜t c˜t

The warm start makes enhancement efficient

Enhancement

CoGEnT

Conditional Gradient with Enhancement and Truncation BACKWARD STEP

1 2 abad = arg min{f (x˜t ) ca hrf (x˜t ), ai + ca k ak2 } ˜ 2 a2A Truncation ¯ ˜ set A = A\abad and find corresponding c¯t x ¯ = A¯c¯t compute and store this scalar when atom is selected

if

f (¯ x)  ⌘f (xt

xt = x ¯

1)

+ (1

⌘)f (x˜t )

At = A¯ ct = c¯t

Zhang ’08, Jain et. al. ’11 for OMP

Update

otherwise

xt = x˜t At = A˜t ct = c˜t

can remove multiple atoms at a single iteration Zhang, T. "Adaptive forward-backward greedy algorithm for learning sparse representations." IEEE Trans. Info Theory 57.7 (2011) Jain, P., Tewari, A. and Dhillon, I. “Orthogonal matching pursuit with replacement”. NIPS 2011

CoGEnT

Conditional Gradient with Enhancement and Truncation BACKWARD STEP

1 2 abad = arg min{f (x˜t ) ca hrf (x˜t ), ai + ca k ak2 } ˜ 2 a2A Truncation ¯ ˜ set A = A\abad and find corresponding c¯t x ¯ = A¯c¯t compute and store this scalar when atom is selected

if

f (¯ x)  ⌘f (xt

xt = x ¯

1)

+ (1

⌘)f (x˜t )

At = A¯ ct = c¯t

otherwise

xt = x˜t At = A˜t ct = c˜t

Zhang ’08, Jain et. al. ’11 for OMP

Update f (xt

1)

f (x)

can remove multiple atoms at a single iteration Zhang, T. "Adaptive forward-backward greedy algorithm for learning sparse representations." IEEE Trans. Info Theory 57.7 (2011) Jain, P., Tewari, A. and Dhillon, I. “Orthogonal matching pursuit with replacement”. NIPS 2011

CoGEnT

Conditional Gradient with Enhancement and Truncation BACKWARD STEP

1 2 abad = arg min{f (x˜t ) ca hrf (x˜t ), ai + ca k ak2 } ˜ 2 a2A Truncation ¯ ˜ set A = A\abad and find corresponding c¯t x ¯ = A¯c¯t compute and store this scalar when atom is selected

if

f (¯ x)  ⌘f (xt

xt = x ¯

1)

+ (1

⌘)f (x˜t )

Zhang ’08, Jain et. al. ’11 for OMP

Update

At = A¯ ct = c¯t

f (xt

otherwise

1)

f (x)

xt = x˜t At = A˜t ct = c˜t ⌘f (xt

1)

can remove multiple atoms at a single iteration

+ (1

⌘)f (˜ xt ) f (˜ xt )

Zhang, T. "Adaptive forward-backward greedy algorithm for learning sparse representations." IEEE Trans. Info Theory 57.7 (2011) Jain, P., Tewari, A. and Dhillon, I. “Orthogonal matching pursuit with replacement”. NIPS 2011

CoGEnT

Conditional Gradient with Enhancement and Truncation BACKWARD STEP

1 2 abad = arg min{f (x˜t ) ca hrf (x˜t ), ai + ca k ak2 } ˜ 2 a2A Truncation ¯ ˜ set A = A\abad and find corresponding c¯t x ¯ = A¯c¯t compute and store this scalar when atom is selected

if

f (¯ x)  ⌘f (xt

xt = x ¯

1)

+ (1

⌘)f (x˜t )

Zhang ’08, Jain et. al. ’11 for OMP

Update

At = A¯ ct = c¯t

f (xt

otherwise

1)

f (x)

xt = x˜t At = A˜t ct = c˜t ⌘f (xt

1)

can remove multiple atoms at a single iteration

+ (1

⌘)f (˜ xt )

f (¯ x) f (˜ xt )

Zhang, T. "Adaptive forward-backward greedy algorithm for learning sparse representations." IEEE Trans. Info Theory 57.7 (2011) Jain, P., Tewari, A. and Dhillon, I. “Orthogonal matching pursuit with replacement”. NIPS 2011

Revisiting the l1 example 4 3

true CG recovery

p = 2048, m = 512 Gaussian measurements, AWGN 0.01

2 1 0 −1 −2 −3

500

1000

1500

2000

4 3

true CoGEnT recovery

2 1 0 −1 −2 −3

500

1000

1500

2000

A Comparison with Frank-Wolfe Away steps (Guelat and Marcotte ’86) : At each iteration, choose an atom as follows:

af wd = arg minhrft , ai a2A

dt = af wd

xt

aaway = arg maxhrft , ai a2A

da = xt

aaway

if hdt , rft i > hda , rft i , at = aaway else at = af wd gives better solutions than vanilla FW, but does not always make solutions more sparse.

Truncation steps : At each iteration, choose an atom:

af wd = arg minhrft , ai a2A

choose an atom to remove based on the quadratic form

1 abad = arg min{ ca hrf (x˜t ), ai + c2a k ak2 } ˜ 2 a2A explicitly deletes atoms, and tries to represent the solution using remaining atoms. Guelat J. and Marcotte P. "Some comments on Wolfe's ‘away step’." Mathematical Programming 35.1 (1986): 110-119.

A Comparison with Frank-Wolfe Away steps (Guelat and Marcotte ’86) :

p = 3000 m=700 noisy CS measurements

At each iteration, choose an atom as follows:

af wd = arg minhrft , ai a2A

dt = af wd

xt

aaway = arg maxhrft , ai a2A

da = xt

aaway

true sparsity = 100 estimated sparsity = 344 L2 error = 0.0011 L1 error = 0.0020

if hdt , rft i > hda , rft i , at = aaway else at = af wd gives better solutions than vanilla FW, but does not always make solutions more sparse.

Truncation steps : At each iteration, choose an atom:

af wd = arg minhrft , ai a2A

choose an atom to remove based on the quadratic form

true sparsity = 100 estimated sparsity = 154 L2 error = 0.0006 L1 error = 0.0009

1 abad = arg min{ ca hrf (x˜t ), ai + c2a k ak2 } ˜ 2 a2A explicitly deletes atoms, and tries to represent the solution using remaining atoms. Guelat J. and Marcotte P. "Some comments on Wolfe's ‘away step’." Mathematical Programming 35.1 (1986): 110-119.

Convergence Noise-Free measurements: y = Suppose

x?

9x# : kx# kA < ⌧, y =

Then CoGEnT converges at a linear rate:

Noisy measurements: y = Suppose

x# f (xT )  f (x0 ) exp( CT )

x? + n

x# is the optimal solution

Then CoGEnT converges at a sub- linear rate:

f (xT )

f (x# ) 

C T

The proofs closely follow classical proofs for convergence of CG methods (Beck and Teboulle, ’04, Tewari et. al. ’11)

Tewari, A. et. al. "Greedy algorithms for structurally constrained high dimensional problems." NIPS 2011 Beck, A. and Teboulle, M.. "A conditional gradient method with linear rate of convergence for solving convex linear systems." Mathematical Methods of Operations Research 59.2 (2004): 235-247.

Results

length 2048 signal, 512 Gaussian measurements, 95% sparse, AWGN 0.01

L1 recovery 40

CoGEnT CG

CoGEnT CoGEnT !E CG

Objective

30

Objective

15

20

10

5

10

0

50

100 # Iterations

150

0 0

2

4

time

6

8

10

Results

length 2048 signal, 512 Gaussian measurements, 95% sparse, AWGN 0.01

L1 recovery

15

40

Objective

30

Objective

CoGEnT CG

CoGEnT CoGEnT !E CG

20

10

5

10

0

50

100 # Iterations

150

0 0

Latent group lasso: Proximal point methods require the replication of variables overlapping groups of size 50. # groups

true replicated dimension dimension

CoGEnT

SpaRSA

100

2030

5000

14.9

22.2

1000

20030

50000

210.9

461.6

1200

24030

60000

358.64

778.2

1500

30030

75000

574.9

1376.6

2000

40030

100000

852.02

2977

2

4

time

6

8

10

Results

length 2048 signal, 512 Gaussian measurements, 95% sparse, AWGN 0.01

L1 recovery

15

40

Objective

30

Objective

CoGEnT CG

CoGEnT CoGEnT !E CG

20

10

5

10

0

50

100 # Iterations

150

0 0

2

4

Latent group lasso: Proximal point methods require the replication of variables

# groups

6

8

10

Matrix completion 120

overlapping groups of size 50. true replicated dimension dimension

time

CoGEnT

SpaRSA

True singular values Conditional Gradient CoGEnT

100 80

100

2030

5000

14.9

22.2

1000

20030

50000

210.9

461.6

60

1200

24030

60000

358.64

778.2

40

1500

30030

75000

574.9

1376.6

2000

40030

100000

852.02

2977

20 0

5

10

15

20

More results CS off the grid (Tang, et al. ’12) CoGEnT 1

CG 1

True Recovered

True Recovered 0.8

0.6

0.6

Atoms: complex sinusoids employ an adaptive gridding procedure to refine our iterates, and backward step allows for the removal of suboptimal atoms

a0

a0

0.8

0.4

0.4

0.2

0.2

0 0

0.2

0.4 0.6 Frequency

0.8

1

0 0

0.2

0.4 0.6 Frequency

0.8

1

Sparse Overlapping Sets lasso (NR. et al, NIPS 13) Check if a group is “active” and select an index from the “most correlated” group 1 True Signal CoGEnT Recovery 0.5

convergence results hold even for approximate atom selections

0

0.5

1

200

400

600

800

1000

1200

Conclusions CoGEnT allows us to solve very general high dimensional inference problems Backward step yields sparse(r) solutions when compared to standard CG Identical convergence rates to CG

Extensions online settings demixing applications (preliminary results seem promising)

Thank You

Suggest Documents