GADGET SVM: a Gossip-bAseD sub-GradiEnT SVM

0 downloads 0 Views 165KB Size Report
Jun 18, 2009 - Each nodes start with a piece of information. 2. Each node gossips to their neighbors about their current information. 3. Each node updates their ...
GADGET SVM: a Gossip-bAseD sub-GradiEnT SVM solver Chase Hensel Columbia University

and Haimonti Dutta Center for Computational Learning Systems at Columbia University

June 18, 2009

Road map

1. Preliminaries ◮ ◮ ◮ ◮

SVM The Distributed Setting Current Solutions to Distributed SVM Using Gossip

2. The Gadget Algorithm 3. Experimental Results 4. Future Work and Challenges

What is SVM? The SVM algorithm finds the maximum-margin hyper-plane which separates two classes of data. E.g. two classes: Overweight, and Not-overweight. 6 r

Height

r r

r r

r r

r rr

r r rr

r r

-

Weight

What is SVM? The SVM algorithm finds the maximum-margin hyper-plane which separates two classes of data. E.g. two classes: Overweight, and Not-overweight. 6 r

Height

r r

r r r r

r rr

r r rr

r r

-

Weight

What is SVM? The SVM algorithm finds the maximum-margin hyper-plane which separates two classes of data. E.g. two classes: Overweight, and Not-overweight. 6 r

Height

r r

r r r r

r rr

r r rr

r r

-

Weight

The variant we use achieves this by solving: n

min w

λ 1X max{0, 1 − yj hxj , wi} kwk2 + 2 n j=1

Notice, there is no bias term, and we don’t use Mercer kernels.

The Distributed Setting In the real world, data can be spread across a network! e d a b c

The Distributed Setting In the real world, data can be spread across a network! e d a b c

Now we try to solve the Distributed SVM (DSVM) problem: m

n

i λ 1 XX max {0, 1 − yj hw, xj i} min kwk2 + w 2 n

i =1 j=1

Our data is spread across m nodes, each with a different set of examples.

Why is the Distributed Setting Interesting? The setting is common: ◮

Federated Databases



Peer-2-Peer Networks: Gnuttella, Bit Torrent



Wireless Networks: Cell Phones, 80211



Sensor Networks: Swarm Robotics, Power grid



Other: Pop3 email

Why is the Distributed Setting Interesting? The setting is common: ◮

Federated Databases



Peer-2-Peer Networks: Gnuttella, Bit Torrent



Wireless Networks: Cell Phones, 80211



Sensor Networks: Swarm Robotics, Power grid



Other: Pop3 email

There are hard challenges: ◮

How can we minimize communication?



How can we perform well on many different topologies?



How can we deal with dynamic networks?

Current solutions to DSVM



Compute local solutions, send support vectors to master node, repeat [SLS99, CCH05]. ◮



In the worst case this centralizes the data.

Design an algorithm for a specific topology [GCB+ 05, LRV08]. ◮

What about other networks?

There are many open area’s for improvement!

Gossip We can model information flow like the spread of gossip in a social network. e d a b c

a talks to b,c, and e; b talks to a,c, and e; c talks to a, b and d . . .

1. Each nodes start with a piece of information. 2. Each node gossips to their neighbors about their current information. 3. Each node updates their information based on the data obtained from their neighbors. 4. The nodes repeat this process until convergence.

Road map

1. Preliminaries 2. The Gadget Algorithm ◮ ◮ ◮ ◮ ◮ ◮ ◮

Another look at the DSVM equation Sub-Gradient methods The Gadget Algorithm Gossipy Sub-Gradients? Push Sum A bit on convergence Running Time

3. Experimental Results 4. Future Work and Challenges

Another look at the DSVM equation

Recall the DSVM equation, m

n

i 1 XX λ 2 max {0, 1 − y hw, xi} min kwk + w 2 n

i =1 j=1

It is strongly-convex. Loosely speaking, this implies the Hessian is always positive semi-definite, and that DSVM is nicely behaved.

Definition A function f : Rd → R is λ-strongly-convex if ∀x, y ∈ Rd and λ ∀g ∈ ∂f (x), f (y ) ≥ f (x) + hg , y − xi + kx − y k22 . 2

Sub-Gradient methods

Solving DSVM as a black box optimization problem is a bad idea. 1. Ignores strong-convexity. 2. Ignores the DSVM equation’s simplicity. Instead! we will use a gradient descent variant designed for simple strong convex optimization problems.

Gadget(λ, Si , T , B)

Gadget(λ, Si , T , B) (1)

w ˆi

=w ¯i = 0

Gadget(λ, Si , T , B) (1)

w ˆi = w ¯i = 0 for t = 1, . . . ,T do

Gadget(λ, Si , T , B) (1)

w ˆi = w ¯i = 0 for t = 1, . .n . ,T do o (t) + ˆ i , xi < 1 Set Si = (x, y ) ∈ Si : y hw

Gadget(λ, Si , T , B) (1)

w ˆi = w ¯i = 0 for t = 1, . .n . ,T do o (t) + ˆ i , xi < 1 Set Si = (x, y ) ∈ Si : y hw X (t) Set Li = yx (x,y )∈Si+

Gadget(λ, Si , T , B) (1)

w ˆi = w ¯i = 0 for t = 1, . .n . ,T do o (t) + ˆ i , xi < 1 Set Si = (x, y ) ∈ Si : y hw X (t) Set Li = yx (x,y )∈Si+

ˆ(t) ← Push Sum(B, L(t) , ni ) Set L i g ,i

Gadget(λ, Si , T , B) (1)

w ˆi = w ¯i = 0 for t = 1, . .n . ,T do o (t) + ˆ i , xi < 1 Set Si = (x, y ) ∈ Si : y hw X (t) Set Li = yx (x,y )∈Si+

ˆ(t) ← Push Sum(B, L(t) , ni ) Set L i g ,i 1 Set α(t) = λt

Gadget(λ, Si , T , B) (1)

w ˆi = w ¯i = 0 for t = 1, . .n . ,T do o (t) + ˆ i , xi < 1 Set Si = (x, y ) ∈ Si : y hw X (t) Set Li = yx (x,y )∈Si+

ˆ(t) ← Push Sum(B, L(t) , ni ) Set L i g ,i 1 Set α(t) = λt 1

(t)

ˆ ˆi (t) + α(t) L Set wi (t+ 2 ) = (1 − λα(t) )w g ,i

Gadget(λ, Si , T , B) (1)

w ˆi = w ¯i = 0 for t = 1, . .n . ,T do o (t) + ˆ i , xi < 1 Set Si = (x, y ) ∈ Si : y hw X (t) Set Li = yx (x,y )∈Si+

ˆ(t) ← Push Sum(B, L(t) , ni ) Set L i g ,i 1 Set α(t) = λt 1

(t)

ˆ ˆi (t) + α(t) L Set wi (t+ 2 ) = (1 − λα(t) )w g ,i (t+ 12 )

Set w ˆi

1

← Push Sum(B, wi (t+ 2 ) , 1)

Gadget(λ, Si , T , B) (1)

w ˆi = w ¯i = 0 for t = 1, . .n . ,T do o (t) + ˆ i , xi < 1 Set Si = (x, y ) ∈ Si : y hw X (t) Set Li = yx (x,y )∈Si+

ˆ(t) ← Push Sum(B, L(t) , ni ) Set L i g ,i 1 Set α(t) = λt 1

(t)

ˆ ˆi (t) + α(t) L Set wi (t+ 2 ) = (1 − λα(t) )w g ,i (t+ 12 )

Set w ˆi

(t+1)

Set w ˆi

1

(t+ 2 ) ← Push  Sum(B, wi  , 1) √  1/ λ  (t+ 12 ) w ˆ = min 1, (t+ 1 )  i  kw ˆi 2 k

Gadget(λ, Si , T , B) (1)

w ˆi = w ¯i = 0 for t = 1, . .n . ,T do o (t) + ˆ i , xi < 1 Set Si = (x, y ) ∈ Si : y hw X (t) Set Li = yx (x,y )∈Si+

ˆ(t) ← Push Sum(B, L(t) , ni ) Set L i g ,i 1 Set α(t) = λt 1

(t)

ˆ ˆi (t) + α(t) L Set wi (t+ 2 ) = (1 − λα(t) )w g ,i (t+ 12 )

Set w ˆi

(t+1)

Set w ˆi

1

(t+ 2 ) ← Push  Sum(B, wi  , 1) √  1/ λ  (t+ 12 ) w ˆ = min 1, (t+ 1 )  i  kw ˆi 2 k (t+1)

w ¯i = w ¯i + w ˆi

Gadget(λ, Si , T , B) (1)

w ˆi = w ¯i = 0 for t = 1, . .n . ,T do o (t) + ˆ i , xi < 1 Set Si = (x, y ) ∈ Si : y hw X (t) Set Li = yx (x,y )∈Si+

ˆ(t) ← Push Sum(B, L(t) , ni ) Set L i g ,i 1 Set α(t) = λt 1

(t)

ˆ ˆi (t) + α(t) L Set wi (t+ 2 ) = (1 − λα(t) )w g ,i (t+ 12 )

Set w ˆi

(t+1)

Set w ˆi

1

(t+ 2 ) ← Push  Sum(B, wi  , 1) √  1/ λ  (t+ 12 ) w ˆ = min 1, (t+ 1 )  i  kw ˆi 2 k (t+1)

w ¯i = w ¯i + w ˆi w ¯ i /t is current estimate of the maximum margin hyper-plane end for

Gossipy Sub-Gradients? What about the network??? The Push Sum protocol uses gossip to deterministically simulate a random walk on the network.  a b c d e e  a 0 1 1 0 1 3 3 3  d  b 1 0 1 0 1 3 3 3 B =  c 1 1 0 1 0 a  3 3 3  d 0 0 1 0 1 2 2 b e 13 31 0 13 0 c B is a stochastic matrix representing the nodes communication pattern, and the probability transition matrix of the random walk. B’s mixing time is proportional to the number of rounds of communication.

Push Sum Protocol

Push-Sum(B, (V ), w ) Each node starts with wi equal to input weight. All nodes set Si equal to local input vector V loop P Set Si = Pj∈N(i ) bj,i Sj Set wi = j∈N(i ) bj,i wj end loop Si . Return wi

Two calls to Push Sum per round? Gadget(λ, Si , T , B) (1) w ˆi = w ¯i = 0 for t = 1, . . . ,T do Set n o (t) wi , xi < 1 Si+ = (x, y ) ∈ Si : y hˆ X (t) Set Li = yx (x,y)∈Si+

Set

ˆ(t) L g,i α(t)

(t)



The first computes a global sub-gradient.



The second ensures that the nodes local weight vectors are close (within a fixed relative error).

← Push Sum(B, Li , ni )

1 Set = λt Set 1 ˆ(t) ˆi (t) + α(t) L wi (t+ 2 ) = (1 − λα(t) )w g,i Set 1 (t+ 21 ) ← Push Sum(B, wi (t+ 2 ) , 1) w ˆi Set 9 8 √ < 1/ λ = (t+ 12 ) (t+1) w ˆ w ˆi = min 1, (t+ 1 ) ; i : kw ˆi 2 k (t+1)

w ¯i = w ¯i + w ˆi w ¯ i /t is current estimate of the maximum margin hyper-plane end for

A bit on convergence Theorem If f (w ) is the function from the DSVM equation and w ⋆ is its optimal solution then for all T ≥ 1 T X t=1

(t)

f (w ˆ i ) − f (w⋆ ) ≤

T

X c2

(t)

ˆi − w ˜ (t) (log(T ) + 1) + c

w 2λ 2 t=1

√ +2 λ

T X t=1



˜(t) ˜ (t) ) − L

Lg (w g . 2

c is the maximum norm of any sub-gradient, w ˜ (t) is the network average (t) ˜g is the network average sub-gradient of weight vector at time t, and L the loss at time t.

A bit on convergence continued. . .

T X t=1

f

(t) (w ˆi )

T

X c2

(t) ˜ (t) ˆi − w − f (w ) ≤ (log(T ) + 1) + c

w 2λ 2 ⋆

t=1

√ +2 λ

T X t=1



˜(t) ˜ (t) ) − L

Lg (w g . 2

A bit on convergence continued. . .

T X t=1

f

(t) (w ˆi )

T

X c2

(t) ˜ (t) ˆi − w − f (w ) ≤ (log(T ) + 1) + c

w 2λ 2 ⋆

t=1

√ +2 λ

T X t=1



K ˜(t) ˜ (t) ) − L

Lg (w g .A

 *     Error due to first call to Push Sum.

2

A

A A

A A

Error due to second call to Push Sum.

Running Time   ′ ′ ˜ d(ni + deg τmix ) Gadget obtains an ǫ-accurate solution in O  λǫ  ′ d deg τ mix ˜ time, requiring each node to send O messages, λǫ where ni′ is the maximum number of examples at any node, d is the total number of non-zero features, τmix denotes the mixing time of a random walk on network G , and deg′ is the maximum degree of any node. ◮

Running time is linearly proportional to the maximum number of examples at any node.



Message complexity does not depend on the number of examples.

Road map

1. Preliminaries 2. The Gadget Algorithm 3. Experimental Results 4. Future Work and Challenges

Experimental Results Using the Enron email spam Data-Set described in [MAP06], we compared Gadget to both a centralized projected sub-gradient algorithm, Pegasos [SSSS07], and another DSVM variant due to Caragea et. al. [CCH05]. Both Distributed algorithms were run on a simulated network using the Wheel topology. d c e a b The wheel’s mixing time is essentially a constant.

Experimental Results 90 Caragea Gadget Pegasos

80

CPU Time (Seconds).

70 60 50 40 30 20 10 0 20

40

60 80 100 120 140 160 180 Number of Nodes. d=12, epsilon=0.1,lambda=0.1,ni=1000.

Cpu-time for various network sizes.

200

Experimental Results 5

3.5

x 10

Caragea Gadget 3

# of Messages.

2.5

2

1.5

1

0.5

0 20

40

60 80 100 120 140 160 180 Number of Nodes. d=12, epsilon=0.1,lambda=0.1,ni=1000.

200

Messages complexity for various network sizes.

Experimental Results 70 Caragea Gadget Pegasos

60

CPU Time (Seconds).

50

40

30

20

10

0 0.02

0.025

0.03

0.035 0.04 0.045 0.05 0.055 0.06 Epsilon. d=12,m=25,lambda=0.1,ni=1000.

0.065

0.07

Cpu-time as a function of epsilon for 25-node wheel with fixed Push Sum Accuracy.

Experimental Results 5

6

x 10

Caragea Gadget 5

# of Messages.

4

3

2

1

0 0.02

0.025

0.03

0.035 0.04 0.045 0.05 0.055 0.06 Epsilon. d=12,m=25,lambda=0.1,ni=1000.

0.065

0.07

Number of messages as a function of epsilon for 25-node wheel with fixed Push Sum Accuracy.

Experimental Results 100 Caragea Gadget

90 80

CPU Time (Seconds).

70 60 50 40 30 20 10 0

0

5000 10000 Ni. d=12,m=20,lambda=0.1,eps=.04.

15000

Cpu-time as a function of ni for 20-node wheel with fixed epsilon.

Experimental Results 4

14

x 10

Caragea Gadget 12

# of Messages.

10

8

6

4

2

0

0

5000 10000 Ni. d=12,m=20,lambda=0.1,eps=.04.

15000

Number of messages as a function of ni for 20-node wheel with fixed epsilon.

Road map

1. Preliminaries 2. The Gadget Algorithm 3. Experimental Results 4. Future Work and Challenges ◮ ◮

What’s Next? A Distributed Headache?

Whats Next?



Dynamic networks



Stateless SVM?



Incorporating Bias



Better Gossip Protocols



Different sub-gradient steps

A Distributed Headache?

Mercer kernels require all data in the SVM equation to be transposed with itself. In the distributed setting this centralizes the data. For a truly practical DSVM variant to exist, this problem needs to be addressed. X K (xi , xj )ci cj i ,j

Another practical issue is synchronization. Ensuring that all nodes are in the same part of the same round of communication is hard. Real networks aren’t truly gossipy. e.g. Gnuttella has a node, super-node structure.

References Cornelia Caragea, Doina Caragea, and Vasant Honavar. Learning support vector machines from distributed data sources. In AAAI, pages 1602–1603, 2005. Hans Peter Graf, Eric Cosatto, L´ eon Bottou, Igor Dourdanovic, and Vladimir Vapnik. Parallel support vector machines: The cascade svm. In Lawrence K. Saul, Yair Weiss, and L´ eon Bottou, editors, Advances in Neural Information Processing Systems 17, pages 521–528. MIT Press, Cambridge, MA, 2005. Yumao Lu, Vwani Roychowdhury, and Lieven Vandenberghe. Distributed parallel support vector machines in strongly connected networks. Neural Networks, IEEE Transactions on, 19(7):1167–1178, 2008. V. Metsis, I. Androutsopoulos, and G. Paliouras. Spam filtering with naive bayes - which naive bayes? In Third Conference on Email and Anti-Spam (CEAS, 2006. Nadeem Ahmed Syed, Huan Liu, and Kah Kay Sung. Handling concept drifts in incremental learning with support vector machines. In KDD ’99: Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 317–321, New York, NY, USA, 1999. ACM. Shai Shalev-Shwartz, Yoram Singer, and Nathan Srebro. Pegasos: Primal estimated sub-gradient solver for svm. In ICML ’07: Proceedings of the 24th international conference on Machine learning, pages 807–814, New York, NY, USA, 2007. ACM.