âq. Can be done in parallel. Cheap to compute ... subset is size 1, then the algorithm is an online: ... For certain c
MapReduce/Bigtable
for
Distributed
Op7miza7on
Keith
Hall,
Sco@
Gilpin,
Gideon
Mann
presented
by:
Slav
Petrov
Zurich,
Thornton,
New
York
Google
Inc.
Outline
• Large‐scale
Gradient
Op7miza7on
– Distributed
Gradient
– Asynchronous
Updates
– Itera7ve
Parameter
Mixtures
• MapReduce
and
Bigtable
• Experimental
Results
Gradient
Op7miza7on
SeRng
Goal:
find
θ∗ = argmin f (θ) θ
f If
is
differen7able,
then
solve
via
gradient
updates:
i+1 θ = θi + α∇f (θi ) f Consider
case
where
is
composed
of
a
sum
of
fq differen7able
func7ons
,
then
the
gradient
update
can
be
wri@en
as:
This
case
is
the
! i+1 i i focus
of
the
talk
θ =θ +α ∇fq (θ ) q
Maximum
Entropy
Models
X :
input
space
Y :
output
space
Φ : X × Y → Rn :
feature
mapping
S = ((x1 , y1 ), ..., (xm , ym )) :
training
data
1 pθ [y|x] = exp(θ · Φ(x, y)) :
probability
model
Z ! 1 log pθ (yi |xi ) θ∗ = argmin m i θ The
objec7ve
func7on
is
a
summa7on
of
func7ons,
each
of
which
is
computed
on
one
data
instance.
Distributed
Gradient
Observe
that
the
gradient
update
can
be
broken
down
into
three
phases:
θ
i+1
∇fq (θ ) Can
be
done
in
parallel
! Cheap
to
compute
=θ +α i
! q
∇fq (θi )
i
q
Update
Depends
on
number
of
Step
features,
not
data
size.
[Chu
et
al.
’07]
At
each
itera7on,
must
send
complete
to
each
θ parallel
worker.
Stochas7c
Gradient
! Alterna7vely
approximate
i+1 i θ =θ +α ∇fq (θi ) the
sum
by
a
subset
of
q func7ons
.
If
the
∇fq (θi ) subset
is
size
1,
then
the
algorithm
is
an
online:
θθ+1 = θi + α∇fq (θi )
Stochas7c
gradient
approaches
provably
converge,
and
in
prac7ce
oZen
much
quicker
than
exact
gradient
calcula7on
methods.
Asynchronous
Updates
Asynchronous
updates
are
a
distributed
extension
of
stochas7c
gradient.
Each
worker:
Get
current
θi
Compute
∇fq (θi )
Update
global
parameter
vector
Since
each
worker
will
not
be
compu7ng
in
lock‐ step,
some
gradients
will
be
based
on
old
parameters.
Nonetheless,
this
also
converges.
[Langford
et
al.
’09]
Itera7ve
Parameter
Mixtures
Separately
es7mate
on
different
samples,
and
θ then
combine.
Itera7ve:
take
resul7ng
as
star7ng
θ point
for
next
round
and
rerun.
For
certain
classes
of
objec7ve
func7ons
(e.g.
maxent
and
perceptron),
convergence
can
also
be
shown.
[Mann
et
al.
’09,
McDonald
et
al.
‘10]
Li@le
communica7on
between
workers.
Typical
Datacenter
(at
Google)
Commodity
machines,
with
rela7vely
few
Racks
of
machines
connected
by
cores
(e.g.