1000. 1500. 2000. â3. â2. â1. 0. 1. 2. 3. 4 true. CG recovery. The greedy update steps might choose suboptimal ato
Conditional Gradient with Enhancement and Thresholding for Atomic Norm Constrained Minimization
Nikhil Rao with Parikshit Shah Stephen Wright
University of Wisconsin - Madison
(Large) data can be modeled as made up of a few “simple” components
wavelet coefficients
unit rank matrices
... and many more
sparse overlapping sets paths/cliques
Atoms and the atomic norm We assume the variables can be represented as a combination of a small number of atoms
x2R
a2A
atoms
atomic set
p
x=
Pk
i=1 ci ai
kxkA = inf
(
ci
0 8i atomic norm
X
ca : x =
a2A
X
ca a, ca
a2A
)
0
atomic norm is the ‘L1’ analog for general structurally constrained signals
ai = ±ei ) kxkA = kxk1
ai = uv T ) kxkA = kxk⇤
ai =
uG kuG k
) kxkA =
P
G2G
kxG k2
The atoms can be edges/cliques in a graph, individual Fourier/ wavelet components ...
Chandrasekaran, V. et al. "The convex geometry of linear inverse problems." Foundations of Computational Mathematics 12.6 (2012)
Atoms and the atomic norm We assume the variables can be represented as a combination of a small number of atoms
x2R
p
a2A
atoms
atomic set
x=
Pk
i=1 ci ai
kxkA = inf
(
ci
0 8i atomic norm
X
ca : x =
a2A
X
ca a, ca
a2A
)
0
atomic norm is the ‘L1’ analog for general structurally constrained signals
ai = ±ei ) kxkA = kxk1
ai = uv T ) kxkA = kxk⇤
We would like to have a method that lets us solve
ai =
uG kuG k
) kxkA =
P
G2G
kxG k2
The atoms can be edges/cliques in a graph, individual Fourier/ wavelet components ...
min f (x) = 12 ky
xk2
subject to
kxkA ⌧
Chandrasekaran, V. et al. "The convex geometry of linear inverse problems." Foundations of Computational Mathematics 12.6 (2012)
Motivation Frank-Wolfe / Conditional Gradient
(Frank and Wolfe. ’56, Clarkson ’08, Jaggi ’13, Jaggi and Sulovski ’10, Bach et. al., ’12, Harchaoui et al, .’12, Dudik et. al, ’12, Tewari et. al. ’11 ...)
Solve a (relatively) simple optimization at each iteration at = arg minhrft , ai a2A
line search ˆ = arg min f ([1 2[0,1]
update x = (1
ˆ )x + ˆ ⌧ at
]x + at )
Motivation Frank-Wolfe / Conditional Gradient
(Frank and Wolfe. ’56, Clarkson ’08, Jaggi ’13, Jaggi and Sulovski ’10, Bach et. al., ’12, Harchaoui et al, .’12, Dudik et. al, ’12, Tewari et. al. ’11 ...)
Solve a (relatively) simple optimization at each iteration 4
at = arg minhrft , ai a2A
3
line search ˆ = arg min f ([1 2[0,1]
update x = (1
true CG recovery
2
]x + at )
1 0 −1
ˆ )x + ˆ ⌧ at
−2 −3
500
1000
1500
2000
p = 2048, m = 512 Gaussian measurements, AWGN 0.01 The greedy update steps might choose suboptimal atoms to represent the solution, and/or lead to less parsimonious solutions and/or miss some components
Motivation Frank-Wolfe / Conditional Gradient
(Frank and Wolfe. ’56, Clarkson ’08, Jaggi ’13, Jaggi and Sulovski ’10, Bach et. al., ’12, Harchaoui et al, .’12, Dudik et. al, ’12, Tewari et. al. ’11 ...)
Solve a (relatively) simple optimization at each iteration 4
at = arg minhrft , ai a2A
3
line search ˆ = arg min f ([1 2[0,1]
update x = (1
true CG recovery
2
]x + at )
1 0 −1
ˆ )x + ˆ ⌧ at
−2 −3
500
1000
1500
2000
p = 2048, m = 512 Gaussian measurements, AWGN 0.01 The greedy update steps might choose suboptimal atoms to represent the solution, and/or lead to less parsimonious solutions and/or miss some components
Our goal is to develop a greedy scheme that retains the computational advantages of FW, and also incorporates a “self correcting” mechanism to purge suboptimal atoms
CoGEnT Solve
Conditional Gradient with Enhancement and Truncation
min f (x) = 12 ky
rt = y
xt
1
xk2 Subject to
wt = y
kxkA ⌧
⌧ at
FORWARD STEP
at = arg mina2A hrf (xt A˜t = [At 1 at ]
1 ), ai
Conditional Gradient
very efficient in most cases t
=
hrt ,rt wt i krt wt k2
c˜t = [(1
t )ct 1
Line search parameter for the L2 loss t]
OPTIONALLY
O(k log(k)) projection onto simplex c˜t = min 12 ky A˜t ck2 c 0 kck1 ⌧ with c = [(1 t )ct 1 t ] as a warm start
Solve
x˜t = A˜t c˜t
The warm start makes enhancement efficient
Enhancement
CoGEnT
Conditional Gradient with Enhancement and Truncation BACKWARD STEP
1 2 abad = arg min{f (x˜t ) ca hrf (x˜t ), ai + ca k ak2 } ˜ 2 a2A Truncation ¯ ˜ set A = A\abad and find corresponding c¯t x ¯ = A¯c¯t compute and store this scalar when atom is selected
if
f (¯ x) ⌘f (xt
xt = x ¯
1)
+ (1
⌘)f (x˜t )
At = A¯ ct = c¯t
Zhang ’08, Jain et. al. ’11 for OMP
Update
otherwise
xt = x˜t At = A˜t ct = c˜t
can remove multiple atoms at a single iteration Zhang, T. "Adaptive forward-backward greedy algorithm for learning sparse representations." IEEE Trans. Info Theory 57.7 (2011) Jain, P., Tewari, A. and Dhillon, I. “Orthogonal matching pursuit with replacement”. NIPS 2011
CoGEnT
Conditional Gradient with Enhancement and Truncation BACKWARD STEP
1 2 abad = arg min{f (x˜t ) ca hrf (x˜t ), ai + ca k ak2 } ˜ 2 a2A Truncation ¯ ˜ set A = A\abad and find corresponding c¯t x ¯ = A¯c¯t compute and store this scalar when atom is selected
if
f (¯ x) ⌘f (xt
xt = x ¯
1)
+ (1
⌘)f (x˜t )
At = A¯ ct = c¯t
otherwise
xt = x˜t At = A˜t ct = c˜t
Zhang ’08, Jain et. al. ’11 for OMP
Update f (xt
1)
f (x)
can remove multiple atoms at a single iteration Zhang, T. "Adaptive forward-backward greedy algorithm for learning sparse representations." IEEE Trans. Info Theory 57.7 (2011) Jain, P., Tewari, A. and Dhillon, I. “Orthogonal matching pursuit with replacement”. NIPS 2011
CoGEnT
Conditional Gradient with Enhancement and Truncation BACKWARD STEP
1 2 abad = arg min{f (x˜t ) ca hrf (x˜t ), ai + ca k ak2 } ˜ 2 a2A Truncation ¯ ˜ set A = A\abad and find corresponding c¯t x ¯ = A¯c¯t compute and store this scalar when atom is selected
if
f (¯ x) ⌘f (xt
xt = x ¯
1)
+ (1
⌘)f (x˜t )
Zhang ’08, Jain et. al. ’11 for OMP
Update
At = A¯ ct = c¯t
f (xt
otherwise
1)
f (x)
xt = x˜t At = A˜t ct = c˜t ⌘f (xt
1)
can remove multiple atoms at a single iteration
+ (1
⌘)f (˜ xt ) f (˜ xt )
Zhang, T. "Adaptive forward-backward greedy algorithm for learning sparse representations." IEEE Trans. Info Theory 57.7 (2011) Jain, P., Tewari, A. and Dhillon, I. “Orthogonal matching pursuit with replacement”. NIPS 2011
CoGEnT
Conditional Gradient with Enhancement and Truncation BACKWARD STEP
1 2 abad = arg min{f (x˜t ) ca hrf (x˜t ), ai + ca k ak2 } ˜ 2 a2A Truncation ¯ ˜ set A = A\abad and find corresponding c¯t x ¯ = A¯c¯t compute and store this scalar when atom is selected
if
f (¯ x) ⌘f (xt
xt = x ¯
1)
+ (1
⌘)f (x˜t )
Zhang ’08, Jain et. al. ’11 for OMP
Update
At = A¯ ct = c¯t
f (xt
otherwise
1)
f (x)
xt = x˜t At = A˜t ct = c˜t ⌘f (xt
1)
can remove multiple atoms at a single iteration
+ (1
⌘)f (˜ xt )
f (¯ x) f (˜ xt )
Zhang, T. "Adaptive forward-backward greedy algorithm for learning sparse representations." IEEE Trans. Info Theory 57.7 (2011) Jain, P., Tewari, A. and Dhillon, I. “Orthogonal matching pursuit with replacement”. NIPS 2011
Revisiting the l1 example 4 3
true CG recovery
p = 2048, m = 512 Gaussian measurements, AWGN 0.01
2 1 0 −1 −2 −3
500
1000
1500
2000
4 3
true CoGEnT recovery
2 1 0 −1 −2 −3
500
1000
1500
2000
A Comparison with Frank-Wolfe Away steps (Guelat and Marcotte ’86) : At each iteration, choose an atom as follows:
af wd = arg minhrft , ai a2A
dt = af wd
xt
aaway = arg maxhrft , ai a2A
da = xt
aaway
if hdt , rft i > hda , rft i , at = aaway else at = af wd gives better solutions than vanilla FW, but does not always make solutions more sparse.
Truncation steps : At each iteration, choose an atom:
af wd = arg minhrft , ai a2A
choose an atom to remove based on the quadratic form
1 abad = arg min{ ca hrf (x˜t ), ai + c2a k ak2 } ˜ 2 a2A explicitly deletes atoms, and tries to represent the solution using remaining atoms. Guelat J. and Marcotte P. "Some comments on Wolfe's ‘away step’." Mathematical Programming 35.1 (1986): 110-119.
A Comparison with Frank-Wolfe Away steps (Guelat and Marcotte ’86) :
p = 3000 m=700 noisy CS measurements
At each iteration, choose an atom as follows:
af wd = arg minhrft , ai a2A
dt = af wd
xt
aaway = arg maxhrft , ai a2A
da = xt
aaway
true sparsity = 100 estimated sparsity = 344 L2 error = 0.0011 L1 error = 0.0020
if hdt , rft i > hda , rft i , at = aaway else at = af wd gives better solutions than vanilla FW, but does not always make solutions more sparse.
Truncation steps : At each iteration, choose an atom:
af wd = arg minhrft , ai a2A
choose an atom to remove based on the quadratic form
true sparsity = 100 estimated sparsity = 154 L2 error = 0.0006 L1 error = 0.0009
1 abad = arg min{ ca hrf (x˜t ), ai + c2a k ak2 } ˜ 2 a2A explicitly deletes atoms, and tries to represent the solution using remaining atoms. Guelat J. and Marcotte P. "Some comments on Wolfe's ‘away step’." Mathematical Programming 35.1 (1986): 110-119.
Convergence Noise-Free measurements: y = Suppose
x?
9x# : kx# kA < ⌧, y =
Then CoGEnT converges at a linear rate:
Noisy measurements: y = Suppose
x# f (xT ) f (x0 ) exp( CT )
x? + n
x# is the optimal solution
Then CoGEnT converges at a sub- linear rate:
f (xT )
f (x# )
C T
The proofs closely follow classical proofs for convergence of CG methods (Beck and Teboulle, ’04, Tewari et. al. ’11)
Tewari, A. et. al. "Greedy algorithms for structurally constrained high dimensional problems." NIPS 2011 Beck, A. and Teboulle, M.. "A conditional gradient method with linear rate of convergence for solving convex linear systems." Mathematical Methods of Operations Research 59.2 (2004): 235-247.
Results
length 2048 signal, 512 Gaussian measurements, 95% sparse, AWGN 0.01
L1 recovery 40
CoGEnT CG
CoGEnT CoGEnT !E CG
Objective
30
Objective
15
20
10
5
10
0
50
100 # Iterations
150
0 0
2
4
time
6
8
10
Results
length 2048 signal, 512 Gaussian measurements, 95% sparse, AWGN 0.01
L1 recovery
15
40
Objective
30
Objective
CoGEnT CG
CoGEnT CoGEnT !E CG
20
10
5
10
0
50
100 # Iterations
150
0 0
Latent group lasso: Proximal point methods require the replication of variables overlapping groups of size 50. # groups
true replicated dimension dimension
CoGEnT
SpaRSA
100
2030
5000
14.9
22.2
1000
20030
50000
210.9
461.6
1200
24030
60000
358.64
778.2
1500
30030
75000
574.9
1376.6
2000
40030
100000
852.02
2977
2
4
time
6
8
10
Results
length 2048 signal, 512 Gaussian measurements, 95% sparse, AWGN 0.01
L1 recovery
15
40
Objective
30
Objective
CoGEnT CG
CoGEnT CoGEnT !E CG
20
10
5
10
0
50
100 # Iterations
150
0 0
2
4
Latent group lasso: Proximal point methods require the replication of variables
# groups
6
8
10
Matrix completion 120
overlapping groups of size 50. true replicated dimension dimension
time
CoGEnT
SpaRSA
True singular values Conditional Gradient CoGEnT
100 80
100
2030
5000
14.9
22.2
1000
20030
50000
210.9
461.6
60
1200
24030
60000
358.64
778.2
40
1500
30030
75000
574.9
1376.6
2000
40030
100000
852.02
2977
20 0
5
10
15
20
More results CS off the grid (Tang, et al. ’12) CoGEnT 1
CG 1
True Recovered
True Recovered 0.8
0.6
0.6
Atoms: complex sinusoids employ an adaptive gridding procedure to refine our iterates, and backward step allows for the removal of suboptimal atoms
a0
a0
0.8
0.4
0.4
0.2
0.2
0 0
0.2
0.4 0.6 Frequency
0.8
1
0 0
0.2
0.4 0.6 Frequency
0.8
1
Sparse Overlapping Sets lasso (NR. et al, NIPS 13) Check if a group is “active” and select an index from the “most correlated” group 1 True Signal CoGEnT Recovery 0.5
convergence results hold even for approximate atom selections
0
0.5
1
200
400
600
800
1000
1200
Conclusions CoGEnT allows us to solve very general high dimensional inference problems Backward step yields sparse(r) solutions when compared to standard CG Identical convergence rates to CG
Extensions online settings demixing applications (preliminary results seem promising)
Thank You