Adaptive Consensus ADMM for Distributed Optimization. Zheng Xu with Gavin Taylor, Hao Li, Mario Figueiredo, Xiaoming Yua
Adaptive Consensus ADMM for Distributed Optimization Zheng Xu with Gavin Taylor, Hao Li, Mario Figueiredo, Xiaoming Yuan, and Tom Goldstein
Outline • Consensus problem in distributed computing • Alternating direction method of multipliers (ADMM) and penalty parameter • Adaptive consensus ADMM (ACADMM) with spectral stepsize: fully automated optimizer • The O(1/k) convergence rate of ADMM with adaptive penalty • Numerical results on various applications and datasets
Statistical learning problem min f (v) + g(v) v
• Example:
1 min kDx x 2
ck22 + ⇢1 kxk1 +
⇢2 kxk22 2
2
2
2
4
4
4
6
6
6
8
8
8
=
10 12
⇤
10 12
10 12
14
14
16
16
16
18
18
18
20
20
c
0.5 0.6 0.7 0.8 0.9 1 1. 1.2 1.3 1.4 1.5
14
20 2
4
6
8
10
12
D
14
16
18
20
0.5 0.6 0.7 0.8 0.9 1 1. 1.2 1.3 1.4 1.5
x
Problem decomposition and data parallelism min v
• Example: min
N X 1 i=1
2
N X
fi (v) + g(v)
i=1
kDi x
2
ci k2 + ⇢1 |x| +
⇢2 kxk2 2
2
2 4
4 6
6 8
8 10
10 12
12 14
14 16
16 18
18
2
2 4
2 4 6 8 10
=
2 4 6
4
4 6 8
6
6 8 10
8
10 12 14
12
12 14 16
14
14 16 18
16
16 18 20
20
18 02.05 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5
18 20
0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5 20
20
c = [c1 ; . . . ; ci ; . . . ; cN ] 0.5 0.6 0.7 0.8 0.9 1 1. 1.2 1.3 1.4 1.5
⇤
8 10 12
10 12 14 16 18
2
4
6
8
10
12
14
16
18
20
2
4
6
8
10
12
14
16
18
20
2
4
6
8
10
12
14
16
18
20
D = [D1 ; . . . ; Di ; . . . ; DN ]
20 0.5 0.6 0.7 0.8 0.9 1 1. 1.2 1.3 1.4 1.5
x
Consensus problem min ui ,v
• Example:
N X
fi (ui ) + g(v), subject to ui = v.
i=1
f1 (u1 )
2 4
fi (ui )
fN (uN ) 2 4 6 8
1 kDi ui 2
10 12
6
14 8
2
16 10
4
18 12
6
20 2
14
8
16
10
4
6
8
10
12
14
16
18
20
fi (ui ) =
18
12
20
14
2
4
6
8
10
12
14
16
18
20
16
ci k2 2
18 20 2
4
6
8
10
12
14
16
18
4
20
6
local nodes
8
2
=
4
2
⇤
4
6
6
8
8
10
10
12
12
14
14
16
16
18
18
20
20
10 12 14 16 18
0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5
central server v and g(v)
g(v)
20 0.5 0.6 0.7 0.8 0.9 1 1. 1.2 1.3 1.4 1.5
2
4
6
8
10
12
14
g(v) = ⇢1 |v| +
16
18
20
⇢2 kvk2 2
Consensus ADMM N X
min ui ,v
fi (ui ) + g(v), subject to ui = v.
i=1
uk+1 = arg min fi (ui ) + h i ui
4 3 2
v k+1 = arg min g(v) +
1
v
0 -1 -2
k+1 i
-3 -4 2 1
2 1
0 0
-1
-1 -2
-2
=
k i
+ ⌧ik (v k+1
N X
k i,
(h
vk k i,
i=1 uk+1 ) i
ui i + ⌧i /2kv k k
v
u i k2
uk+1 i + ⌧i /2kv i k
2 uk+1 k ) i
Consensus ADMM and penalty parameter N X
min ui ,v
fi (ui ) + g(v), subject to ui = v.
i=1
uk+1 = arg min fi (ui ) + h i
k i,
ui
4 3 2
v k+1 = arg min g(v) +
1 0
v
-1 -2
k+1 i
-3 -4 2 1
=
k i
+ ⌧ik (v k+1
N X
(h
k ui i + ⌧i /2 kv k
vk k i,
i=1
v
k ⌧ uk+1 i + /2 kv i i
uk+1 ) i
2 1
0 0
-1
-1 -2
-2
ui k2
The only free parameter!
uk+1 k2 ) i
Background: gradient descent • Objective min F (x) x • Gradient descent xk+1 = xk
⌧ k rF (xk )
⌧k xk
Background: quadratic case • Objective min F (x) x • Gradient descent xk+1 = xk
⌧ k rF (xk )
↵ • If quadratic F (x) = kx x⇤ k2 2 • Then optimal stepsize ⌧ k = 1/↵
⌧ k = 1/↵
xk xk+1
Background: spectral stepsize • Objective min F (x) x • Gradient descent xk+1 = xk
⌧ k rF (xk )
• Spectral (Barzilai-Borwein) stepsize • Assume the function is locally quadratic with ⌧ k = 1/↵ curvature ↵ • Estimate the curvature by solving xk 1-d least squares rF (x) = ↵x + a • Gradient descent with ⌧ k = 1/↵
xk+1
J. Barzilai and J. Borwein. Two-point step size gradient methods. 1988.
Advantages of spectral stepsize • Automates the stepsize selection • Achieves fast (superlinear) convergence steps
• What about ADMM?
⌧ k = 1/↵
xk xk+1 J. Barzilai and J. Borwein. Two-point step size gradient methods. 1988. Y. Dai. A New Analysis on the Barzilai-Borwein Gradient Method. 2013.
Spectral penalty of ADMM Consensus problem
min ui ,v
N X
fi (ui ) + g(v), subject to ui = v.
i=1
• Assume the function is locally quadratic • Estimate the curvature(s) • Decide penalty parameter
Dual interpretation • Consensus problem •
min ui ,v
N X
fi (ui ) + g(v), subject to ui = v.
i=1
min f (u) + g(v), subject to u + Bv = 0 u,v
where B =
•
(Id ; . . . ; Id ) u = (u1 ; . . . ; uN )
Dual problem by Fenchel conjugate min f ⇤ ( ) h , bi + g ⇤ (B T ) | {z } | {z } fˆ( )
g ˆ( )
No constraints!
Dual problem and DRS min f ⇤ ( ) h , bi + g ⇤ (B T ) | {z } | {z } fˆ( )
g ˆ( )
(u, v, ) ADMM Douglas-Rachford Splitting ( ˆ , ) where ˆ k+1 = k + ⌧ k (v k i
fˆ( ˆ )
gˆ( )
i
i
uk+1 ) i
Linear approximation • The gradients are linear 2
2
4
4
6
6
8
8
10 12 14 16 18 20 0.5 0.6 0.7 0.8 0.9 1 1. 1.2 1.3 1.4 1.5
@ fˆ( ˆ )
= ↵·
10
2
2
4
4
6
6
8
8
10
12
12
14
14
16 18 20 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5
ˆ
=
·
10 12 14
16
16
18
18
20
20
0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5
0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5
@ˆ g( )
Z. Xu, M. Figueiredo, and T. Goldstein. Adaptive ADMM with spectral penalty parameter selection. 2017
Linear approximation • The gradients are linear 2 2 4 4
2
2
6
4
4
8
6
10
8
8
12
10
10
14
12
12
16
14
14
18
16
16
20
18
18
6 8
6
10 12 14 16 18 20 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5
0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5 20
20
0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5
0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5
2
2
2
2 4
4 4
4 6 8 10 12 14 16 18 20 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5
= ↵·
6 6
8 8
10 10
12 12
14 14
16 16
18 18
20 20
0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5
0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5
=
·
6 8 10 12 14 16 18 20 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5
2 2 4 4
2
2
6
4
4
8
6
10
8
8
10
10
12
12
14
14
16
16
18
18
20
20
0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5
0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5
6 8 10 12 12 14 14 16 16 18 18 20 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5
20 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5
@ fˆ( ˆ )
ˆ = [ ˆ i ]iN
@ˆ g( )
Z. Xu, M. Figueiredo, and T. Goldstein. Adaptive ADMM with spectral penalty parameter selection. 2017
6
= [ i ]iN
Linear approximation • The gradients are linear • Node specific penalty parameter 2 2 4 4
2
2
6
4
4
8
6
10
8
8
12
10
10
14
12
16
14
18
16
20
18
18
20
20
6 8 10 12 14 16
↵1
18 20 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5
6
1
12 14 16
0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5
2
2
4
4
0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5
2
2
= ↵i ·
6 8 10 12 14 16 18 20 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5
4
4
6 6
8 8
10 10
12 12
14 14
16 16
18 18
20 20
0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5
=
i·
0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5
6 8 10 12 14 16 18 20 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5
2 2 4 6 8 10 12 14
↵N
4
2
2
6
4
4
8
6
10
8
12 14
16 16 18 18 20 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5
20 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5
@ fˆ( ˆ )
ˆ = [ ˆ i ]iN
10 12 14
6 8
N
10 12 14
16
16
18
18
20
20
0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5
0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5
@ˆ g( )
= [ i ]iN
Z. Xu, M. Figueiredo, and T. Goldstein. Adaptive ADMM with spectral penalty parameter selection. 2017 C. Song, S. Yoon, and V. Pavlovic. Fast ADMM algorithm for distributed optimization with adaptive penalty. 2016.
Linear approximation • The gradients are linear
@ fˆ( ˆ ) = M↵ ˆ +
• Node specific penalty parameter
where M↵ , M are diagonal matrices.
T ↵ = M↵ 1
ˆk
T =M
fˆ( ˆ ) k
and @ˆ g( ) = M
+ ,
1
gˆ( )
Node-specific spectral penalty • Schema
⌧ik
p = 1/ ↵i
i,
8i = 1, . . . , N
• Estimation and safeguarding: from ADMM variables (u, v, ) 2
2
4
4
6 8 10 12 14
=
16 18
uki
↵ik ·
2
4
4
6
6
6
8
8
10
10
ˆk
12
i
14
ˆ k0
12
i
14
16
0.5 0.6 0.7 0.8 0.9 1 1. 1.2 1.3 1.4 1.5
0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5
ˆk
=
16
k ↵cor,i
18 20
i
uki 0
2
20
uki 0
uki
ˆ k0 i
14
k cor,i
20
0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5
v
12
18
20
0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5
v
k0
v k0
k i
k0 i
8
10
16
18
k
k i·
vk
k i
k0 i
O (1/k) convergence with adaptivity • Bounded adaptivity
1 X
k=1
(⌘ k )2 < 1, where (⌘ k )2 =
(⌘ik )2 = max{⌧ik /⌧ik
1
max {(⌘ik )2 },
i2{1,...,p}
1, ⌧ik
1
/⌧ik
1}.
• The norm of the residuals converges to zero • The worst-case ergodic O(1/k) convergence rate in the variational inequality sense
Experiments • ADMM methods • • • •
Consensus ADMM [Boyd et al. 2011] Residual balancing [He et al. 2000] Consensus residual balancing [Song et al. 2016] Adaptive ADMM [Xu et al. 2017]
• Applications • • • •
Linear regression with elastic net regularizer Sparse logistic regression Support vector machine Semidefinite programming
Residual plot • Application: Sparse logistic regression • Dataset: News 20, size 19,996 × 1,355,191 • Distributed on 128 cores 101
5 CADMM AADMM ACADMM
10-1
ACADMM
10-2 10-3 10
-4
10
-5
4 3
Penalty tau
Relative residual
100
2 1 0
0
50
100
150
200
Iterations
250
300
350
-1 -50
0
50
100
150
Iterations
200
250
300
350
More numerical results • More results in the paper! Iterations (and runtime in seconds);128 cores are used; absence of convergence after n iterations is indicated as n+. Application
Dataset
CADMM
RB-ADMM
AADMM
CRB-ADMM ACADMM
MNIST 100+(1.49e4) 88(1.29e3) 40(5.99e3) 87(1.27e4) 14(2.18e3) News20 100+(4.61e3) 100+(4.60e3) 100+(5.17e3) 100+(4.60e3) 78(3.54e3) MNIST 325(444) 212(387) 325(516) 203(286) 149(218) Sparse logreg News20 316(4.96e3) 211(3.84e3) 316(6.36e3) 207(3.73e3) 137(2.71e3) MNIST 1000+(930) 172(287) 73(127) 285(340) 41(88.0) SVM News20 259(2.63e3) 262(2.74e3) 259(3.83e3) 267(2.78e3) 217(2.37e3) SDP Ham-9-5-6 100+(2.01e3) 100+(2.14e3) 35(860) 100+(2.14e3) 30(703)
EN regression
Robust to initial penalty selection • More sensitivity analysis in the paper! 10
3
ENRegression-Synthetic2
Iterations
ACADMM 102
101 -2 10
CADMM RB-ADMM AADMM CRB-ADMM ACADMM
10
0
2
10
Initial penalty parameter
10
4
Acceleration by distribution 10
SVM-Synthetic2
3
10
SVM-Synthetic2
4
10
10
Seconds
Iterations
103 2
1
ACADMM
CADMM RB-ADMM AADMM CRB-ADMM ACADMM
10
1
2
10
Number of cores
10
10
2
1
CADMM RB-ADMM AADMM CRB-ADMM ACADMM
10
1
2
10
Number of cores
Summary • Fully automated optimizer for consensus problem in distributed computing • Node-specific spectral penalty for ADMM • O(1/k) convergence rate of ADMM with adaptive penalty • Numerical results on various applications and datasets
Thank you! Poster #28 tonight Gavin Taylor Hao Li Mario Figueiredo Xiaoming Yuan Tom Goldstein
1
5
10
CADMM RB-ADMM AADMM CRB-ADMM ACADMM
0
4
10
-1
10
-2
10
-3
1
10
-4
0
10
-5
3
Penalty tau
Relative residual
10
0
50
100
150
200
Iterations
250
300
350
2
-1 -50
0
50
100
150
Iterations
200
250
300
350