MapReduce/Bigtable for Distributed OpZmizaZon

1 downloads 211 Views 2MB Size Report
∑q. Can be done in parallel. Cheap to compute ... subset is size 1, then the algorithm is an online: ... For certain c
MapReduce/Bigtable
for
 Distributed
Op7miza7on
 Keith
Hall,
Sco@
Gilpin,
Gideon
Mann
 presented
by:
Slav
Petrov
 Zurich,
Thornton,
New
York
 Google
Inc.


Outline
 •  Large‐scale
Gradient
Op7miza7on
 –  Distributed
Gradient
 –  Asynchronous
Updates
 –  Itera7ve
Parameter
Mixtures


•  MapReduce
and
Bigtable
 •  Experimental
Results


Gradient
Op7miza7on
SeRng
 Goal:
find
 θ∗ = argmin f (θ) θ

f If



is
differen7able,
then
solve
via
gradient
 updates:
 i+1 θ = θi + α∇f (θi ) f Consider
case
where




is
composed
of
a
sum
of
 fq differen7able
func7ons




,
then
the
gradient
 update
can
be
wri@en
as:
 This
case
is
the
 ! i+1 i i focus
of
the
talk
 θ =θ +α ∇fq (θ ) q

Maximum
Entropy
Models
 X :
input
space
 Y :
output
space
 Φ : X × Y → Rn :
feature
mapping


S = ((x1 , y1 ), ..., (xm , ym )) :
training
data
 1 pθ [y|x] = exp(θ · Φ(x, y)) :
probability
model
 Z ! 1 log pθ (yi |xi ) θ∗ = argmin m i θ The
objec7ve
func7on
is
a
summa7on
of
func7ons,
 each
of
which
is
computed
on
one
data
instance.


Distributed
Gradient
 Observe
that
the
gradient
 update
can
be
broken
 down
into
three
phases:


θ

i+1

∇fq (θ ) Can
be
done
in
parallel
 ! Cheap
to
compute


=θ +α i

! q

∇fq (θi )

i

q

Update
 Depends
on
number
of
 Step
 features,
not
data
size.
 [Chu
et
al.
’07]


At
each
itera7on,
must
 send
complete



to
each
 θ parallel
worker.


Stochas7c
Gradient
 ! Alterna7vely
approximate
 i+1 i θ =θ +α ∇fq (θi ) the
sum
by
a
subset
of
 q func7ons
















.
If
the
 ∇fq (θi ) subset
is
size
1,
then
the
 algorithm
is
an
online:
 θθ+1 = θi + α∇fq (θi )

Stochas7c
gradient
approaches
provably
converge,
 and
in
prac7ce
oZen
much
quicker
than
exact
 gradient
calcula7on
methods.




Asynchronous
Updates
 Asynchronous
updates
are
a
distributed
extension
 of
stochas7c
gradient.

 Each
worker:
 

Get
current
 θi 

Compute



 ∇fq (θi ) 

Update
global
parameter
vector
 Since
each
worker
will
not
be
compu7ng
in
lock‐ step,
some
gradients
will
be
based
on
old
 parameters.
Nonetheless,
this
also
converges.
 [Langford
et
al.
’09]


Itera7ve
Parameter
Mixtures
 Separately
es7mate



on
different
samples,
and
 θ then
combine.
Itera7ve:
take
resul7ng




as
star7ng
 θ point
for
next
round
and
rerun.

 For
certain
classes
of
objec7ve
 func7ons
(e.g.
maxent
and
 perceptron),
convergence
can
 also
be
shown.


[Mann
et
al.
’09,
McDonald
et
al.
‘10]


Li@le
communica7on
between
 workers.


Typical
Datacenter
(at
Google)


Commodity
 machines,
with
 rela7vely
few
 Racks
of
machines
connected
by
 cores
(e.g.