Jun 18, 2009 - Each nodes start with a piece of information. 2. Each node gossips to their neighbors about their current information. 3. Each node updates their ...
GADGET SVM: a Gossip-bAseD sub-GradiEnT SVM solver Chase Hensel Columbia University
and Haimonti Dutta Center for Computational Learning Systems at Columbia University
June 18, 2009
Road map
1. Preliminaries ◮ ◮ ◮ ◮
SVM The Distributed Setting Current Solutions to Distributed SVM Using Gossip
2. The Gadget Algorithm 3. Experimental Results 4. Future Work and Challenges
What is SVM? The SVM algorithm finds the maximum-margin hyper-plane which separates two classes of data. E.g. two classes: Overweight, and Not-overweight. 6 r
Height
r r
r r
r r
r rr
r r rr
r r
-
Weight
What is SVM? The SVM algorithm finds the maximum-margin hyper-plane which separates two classes of data. E.g. two classes: Overweight, and Not-overweight. 6 r
Height
r r
r r r r
r rr
r r rr
r r
-
Weight
What is SVM? The SVM algorithm finds the maximum-margin hyper-plane which separates two classes of data. E.g. two classes: Overweight, and Not-overweight. 6 r
Height
r r
r r r r
r rr
r r rr
r r
-
Weight
The variant we use achieves this by solving: n
min w
λ 1X max{0, 1 − yj hxj , wi} kwk2 + 2 n j=1
Notice, there is no bias term, and we don’t use Mercer kernels.
The Distributed Setting In the real world, data can be spread across a network! e d a b c
The Distributed Setting In the real world, data can be spread across a network! e d a b c
Now we try to solve the Distributed SVM (DSVM) problem: m
n
i λ 1 XX max {0, 1 − yj hw, xj i} min kwk2 + w 2 n
i =1 j=1
Our data is spread across m nodes, each with a different set of examples.
Why is the Distributed Setting Interesting? The setting is common: ◮
Federated Databases
◮
Peer-2-Peer Networks: Gnuttella, Bit Torrent
◮
Wireless Networks: Cell Phones, 80211
◮
Sensor Networks: Swarm Robotics, Power grid
◮
Other: Pop3 email
Why is the Distributed Setting Interesting? The setting is common: ◮
Federated Databases
◮
Peer-2-Peer Networks: Gnuttella, Bit Torrent
◮
Wireless Networks: Cell Phones, 80211
◮
Sensor Networks: Swarm Robotics, Power grid
◮
Other: Pop3 email
There are hard challenges: ◮
How can we minimize communication?
◮
How can we perform well on many different topologies?
◮
How can we deal with dynamic networks?
Current solutions to DSVM
◮
Compute local solutions, send support vectors to master node, repeat [SLS99, CCH05]. ◮
◮
In the worst case this centralizes the data.
Design an algorithm for a specific topology [GCB+ 05, LRV08]. ◮
What about other networks?
There are many open area’s for improvement!
Gossip We can model information flow like the spread of gossip in a social network. e d a b c
a talks to b,c, and e; b talks to a,c, and e; c talks to a, b and d . . .
1. Each nodes start with a piece of information. 2. Each node gossips to their neighbors about their current information. 3. Each node updates their information based on the data obtained from their neighbors. 4. The nodes repeat this process until convergence.
Road map
1. Preliminaries 2. The Gadget Algorithm ◮ ◮ ◮ ◮ ◮ ◮ ◮
Another look at the DSVM equation Sub-Gradient methods The Gadget Algorithm Gossipy Sub-Gradients? Push Sum A bit on convergence Running Time
3. Experimental Results 4. Future Work and Challenges
Another look at the DSVM equation
Recall the DSVM equation, m
n
i 1 XX λ 2 max {0, 1 − y hw, xi} min kwk + w 2 n
i =1 j=1
It is strongly-convex. Loosely speaking, this implies the Hessian is always positive semi-definite, and that DSVM is nicely behaved.
Definition A function f : Rd → R is λ-strongly-convex if ∀x, y ∈ Rd and λ ∀g ∈ ∂f (x), f (y ) ≥ f (x) + hg , y − xi + kx − y k22 . 2
Sub-Gradient methods
Solving DSVM as a black box optimization problem is a bad idea. 1. Ignores strong-convexity. 2. Ignores the DSVM equation’s simplicity. Instead! we will use a gradient descent variant designed for simple strong convex optimization problems.
Gadget(λ, Si , T , B)
Gadget(λ, Si , T , B) (1)
w ˆi
=w ¯i = 0
Gadget(λ, Si , T , B) (1)
w ˆi = w ¯i = 0 for t = 1, . . . ,T do
Gadget(λ, Si , T , B) (1)
w ˆi = w ¯i = 0 for t = 1, . .n . ,T do o (t) + ˆ i , xi < 1 Set Si = (x, y ) ∈ Si : y hw
Gadget(λ, Si , T , B) (1)
w ˆi = w ¯i = 0 for t = 1, . .n . ,T do o (t) + ˆ i , xi < 1 Set Si = (x, y ) ∈ Si : y hw X (t) Set Li = yx (x,y )∈Si+
Gadget(λ, Si , T , B) (1)
w ˆi = w ¯i = 0 for t = 1, . .n . ,T do o (t) + ˆ i , xi < 1 Set Si = (x, y ) ∈ Si : y hw X (t) Set Li = yx (x,y )∈Si+
ˆ(t) ← Push Sum(B, L(t) , ni ) Set L i g ,i
Gadget(λ, Si , T , B) (1)
w ˆi = w ¯i = 0 for t = 1, . .n . ,T do o (t) + ˆ i , xi < 1 Set Si = (x, y ) ∈ Si : y hw X (t) Set Li = yx (x,y )∈Si+
ˆ(t) ← Push Sum(B, L(t) , ni ) Set L i g ,i 1 Set α(t) = λt
Gadget(λ, Si , T , B) (1)
w ˆi = w ¯i = 0 for t = 1, . .n . ,T do o (t) + ˆ i , xi < 1 Set Si = (x, y ) ∈ Si : y hw X (t) Set Li = yx (x,y )∈Si+
ˆ(t) ← Push Sum(B, L(t) , ni ) Set L i g ,i 1 Set α(t) = λt 1
(t)
ˆ ˆi (t) + α(t) L Set wi (t+ 2 ) = (1 − λα(t) )w g ,i
Gadget(λ, Si , T , B) (1)
w ˆi = w ¯i = 0 for t = 1, . .n . ,T do o (t) + ˆ i , xi < 1 Set Si = (x, y ) ∈ Si : y hw X (t) Set Li = yx (x,y )∈Si+
ˆ(t) ← Push Sum(B, L(t) , ni ) Set L i g ,i 1 Set α(t) = λt 1
(t)
ˆ ˆi (t) + α(t) L Set wi (t+ 2 ) = (1 − λα(t) )w g ,i (t+ 12 )
Set w ˆi
1
← Push Sum(B, wi (t+ 2 ) , 1)
Gadget(λ, Si , T , B) (1)
w ˆi = w ¯i = 0 for t = 1, . .n . ,T do o (t) + ˆ i , xi < 1 Set Si = (x, y ) ∈ Si : y hw X (t) Set Li = yx (x,y )∈Si+
ˆ(t) ← Push Sum(B, L(t) , ni ) Set L i g ,i 1 Set α(t) = λt 1
(t)
ˆ ˆi (t) + α(t) L Set wi (t+ 2 ) = (1 − λα(t) )w g ,i (t+ 12 )
Set w ˆi
(t+1)
Set w ˆi
1
(t+ 2 ) ← Push Sum(B, wi , 1) √ 1/ λ (t+ 12 ) w ˆ = min 1, (t+ 1 ) i kw ˆi 2 k
Gadget(λ, Si , T , B) (1)
w ˆi = w ¯i = 0 for t = 1, . .n . ,T do o (t) + ˆ i , xi < 1 Set Si = (x, y ) ∈ Si : y hw X (t) Set Li = yx (x,y )∈Si+
ˆ(t) ← Push Sum(B, L(t) , ni ) Set L i g ,i 1 Set α(t) = λt 1
(t)
ˆ ˆi (t) + α(t) L Set wi (t+ 2 ) = (1 − λα(t) )w g ,i (t+ 12 )
Set w ˆi
(t+1)
Set w ˆi
1
(t+ 2 ) ← Push Sum(B, wi , 1) √ 1/ λ (t+ 12 ) w ˆ = min 1, (t+ 1 ) i kw ˆi 2 k (t+1)
w ¯i = w ¯i + w ˆi
Gadget(λ, Si , T , B) (1)
w ˆi = w ¯i = 0 for t = 1, . .n . ,T do o (t) + ˆ i , xi < 1 Set Si = (x, y ) ∈ Si : y hw X (t) Set Li = yx (x,y )∈Si+
ˆ(t) ← Push Sum(B, L(t) , ni ) Set L i g ,i 1 Set α(t) = λt 1
(t)
ˆ ˆi (t) + α(t) L Set wi (t+ 2 ) = (1 − λα(t) )w g ,i (t+ 12 )
Set w ˆi
(t+1)
Set w ˆi
1
(t+ 2 ) ← Push Sum(B, wi , 1) √ 1/ λ (t+ 12 ) w ˆ = min 1, (t+ 1 ) i kw ˆi 2 k (t+1)
w ¯i = w ¯i + w ˆi w ¯ i /t is current estimate of the maximum margin hyper-plane end for
Gossipy Sub-Gradients? What about the network??? The Push Sum protocol uses gossip to deterministically simulate a random walk on the network. a b c d e e a 0 1 1 0 1 3 3 3 d b 1 0 1 0 1 3 3 3 B = c 1 1 0 1 0 a 3 3 3 d 0 0 1 0 1 2 2 b e 13 31 0 13 0 c B is a stochastic matrix representing the nodes communication pattern, and the probability transition matrix of the random walk. B’s mixing time is proportional to the number of rounds of communication.
Push Sum Protocol
Push-Sum(B, (V ), w ) Each node starts with wi equal to input weight. All nodes set Si equal to local input vector V loop P Set Si = Pj∈N(i ) bj,i Sj Set wi = j∈N(i ) bj,i wj end loop Si . Return wi
Two calls to Push Sum per round? Gadget(λ, Si , T , B) (1) w ˆi = w ¯i = 0 for t = 1, . . . ,T do Set n o (t) wi , xi < 1 Si+ = (x, y ) ∈ Si : y hˆ X (t) Set Li = yx (x,y)∈Si+
Set
ˆ(t) L g,i α(t)
(t)
◮
The first computes a global sub-gradient.
◮
The second ensures that the nodes local weight vectors are close (within a fixed relative error).
← Push Sum(B, Li , ni )
1 Set = λt Set 1 ˆ(t) ˆi (t) + α(t) L wi (t+ 2 ) = (1 − λα(t) )w g,i Set 1 (t+ 21 ) ← Push Sum(B, wi (t+ 2 ) , 1) w ˆi Set 9 8 √ < 1/ λ = (t+ 12 ) (t+1) w ˆ w ˆi = min 1, (t+ 1 ) ; i : kw ˆi 2 k (t+1)
w ¯i = w ¯i + w ˆi w ¯ i /t is current estimate of the maximum margin hyper-plane end for
A bit on convergence Theorem If f (w ) is the function from the DSVM equation and w ⋆ is its optimal solution then for all T ≥ 1 T X t=1
(t)
f (w ˆ i ) − f (w⋆ ) ≤
T
X c2
(t)
ˆi − w ˜ (t) (log(T ) + 1) + c
w 2λ 2 t=1
√ +2 λ
T X t=1
˜(t) ˜ (t) ) − L
Lg (w g . 2
c is the maximum norm of any sub-gradient, w ˜ (t) is the network average (t) ˜g is the network average sub-gradient of weight vector at time t, and L the loss at time t.
A bit on convergence continued. . .
T X t=1
f
(t) (w ˆi )
T
X c2
(t) ˜ (t) ˆi − w − f (w ) ≤ (log(T ) + 1) + c
w 2λ 2 ⋆
t=1
√ +2 λ
T X t=1
˜(t) ˜ (t) ) − L
Lg (w g . 2
A bit on convergence continued. . .
T X t=1
f
(t) (w ˆi )
T
X c2
(t) ˜ (t) ˆi − w − f (w ) ≤ (log(T ) + 1) + c
w 2λ 2 ⋆
t=1
√ +2 λ
T X t=1
K ˜(t) ˜ (t) ) − L
Lg (w g .A
* Error due to first call to Push Sum.
2
A
A A
A A
Error due to second call to Push Sum.
Running Time ′ ′ ˜ d(ni + deg τmix ) Gadget obtains an ǫ-accurate solution in O λǫ ′ d deg τ mix ˜ time, requiring each node to send O messages, λǫ where ni′ is the maximum number of examples at any node, d is the total number of non-zero features, τmix denotes the mixing time of a random walk on network G , and deg′ is the maximum degree of any node. ◮
Running time is linearly proportional to the maximum number of examples at any node.
◮
Message complexity does not depend on the number of examples.
Road map
1. Preliminaries 2. The Gadget Algorithm 3. Experimental Results 4. Future Work and Challenges
Experimental Results Using the Enron email spam Data-Set described in [MAP06], we compared Gadget to both a centralized projected sub-gradient algorithm, Pegasos [SSSS07], and another DSVM variant due to Caragea et. al. [CCH05]. Both Distributed algorithms were run on a simulated network using the Wheel topology. d c e a b The wheel’s mixing time is essentially a constant.
Experimental Results 90 Caragea Gadget Pegasos
80
CPU Time (Seconds).
70 60 50 40 30 20 10 0 20
40
60 80 100 120 140 160 180 Number of Nodes. d=12, epsilon=0.1,lambda=0.1,ni=1000.
Cpu-time for various network sizes.
200
Experimental Results 5
3.5
x 10
Caragea Gadget 3
# of Messages.
2.5
2
1.5
1
0.5
0 20
40
60 80 100 120 140 160 180 Number of Nodes. d=12, epsilon=0.1,lambda=0.1,ni=1000.
200
Messages complexity for various network sizes.
Experimental Results 70 Caragea Gadget Pegasos
60
CPU Time (Seconds).
50
40
30
20
10
0 0.02
0.025
0.03
0.035 0.04 0.045 0.05 0.055 0.06 Epsilon. d=12,m=25,lambda=0.1,ni=1000.
0.065
0.07
Cpu-time as a function of epsilon for 25-node wheel with fixed Push Sum Accuracy.
Experimental Results 5
6
x 10
Caragea Gadget 5
# of Messages.
4
3
2
1
0 0.02
0.025
0.03
0.035 0.04 0.045 0.05 0.055 0.06 Epsilon. d=12,m=25,lambda=0.1,ni=1000.
0.065
0.07
Number of messages as a function of epsilon for 25-node wheel with fixed Push Sum Accuracy.
Experimental Results 100 Caragea Gadget
90 80
CPU Time (Seconds).
70 60 50 40 30 20 10 0
0
5000 10000 Ni. d=12,m=20,lambda=0.1,eps=.04.
15000
Cpu-time as a function of ni for 20-node wheel with fixed epsilon.
Experimental Results 4
14
x 10
Caragea Gadget 12
# of Messages.
10
8
6
4
2
0
0
5000 10000 Ni. d=12,m=20,lambda=0.1,eps=.04.
15000
Number of messages as a function of ni for 20-node wheel with fixed epsilon.
Road map
1. Preliminaries 2. The Gadget Algorithm 3. Experimental Results 4. Future Work and Challenges ◮ ◮
What’s Next? A Distributed Headache?
Whats Next?
◮
Dynamic networks
◮
Stateless SVM?
◮
Incorporating Bias
◮
Better Gossip Protocols
◮
Different sub-gradient steps
A Distributed Headache?
Mercer kernels require all data in the SVM equation to be transposed with itself. In the distributed setting this centralizes the data. For a truly practical DSVM variant to exist, this problem needs to be addressed. X K (xi , xj )ci cj i ,j
Another practical issue is synchronization. Ensuring that all nodes are in the same part of the same round of communication is hard. Real networks aren’t truly gossipy. e.g. Gnuttella has a node, super-node structure.
References Cornelia Caragea, Doina Caragea, and Vasant Honavar. Learning support vector machines from distributed data sources. In AAAI, pages 1602–1603, 2005. Hans Peter Graf, Eric Cosatto, L´ eon Bottou, Igor Dourdanovic, and Vladimir Vapnik. Parallel support vector machines: The cascade svm. In Lawrence K. Saul, Yair Weiss, and L´ eon Bottou, editors, Advances in Neural Information Processing Systems 17, pages 521–528. MIT Press, Cambridge, MA, 2005. Yumao Lu, Vwani Roychowdhury, and Lieven Vandenberghe. Distributed parallel support vector machines in strongly connected networks. Neural Networks, IEEE Transactions on, 19(7):1167–1178, 2008. V. Metsis, I. Androutsopoulos, and G. Paliouras. Spam filtering with naive bayes - which naive bayes? In Third Conference on Email and Anti-Spam (CEAS, 2006. Nadeem Ahmed Syed, Huan Liu, and Kah Kay Sung. Handling concept drifts in incremental learning with support vector machines. In KDD ’99: Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 317–321, New York, NY, USA, 1999. ACM. Shai Shalev-Shwartz, Yoram Singer, and Nathan Srebro. Pegasos: Primal estimated sub-gradient solver for svm. In ICML ’07: Proceedings of the 24th international conference on Machine learning, pages 807–814, New York, NY, USA, 2007. ACM.