Fast Averaging - Google Sites

3 downloads 153 Views 352KB Size Report
MapReduce. Conclusions. Motivation. Large data center (a cluster of computers). Used by Microsoft, Google, Amazon, Faceb
1/17 Introduction Algorithm MapReduce

Fast Averaging

Conclusions

Shreeshankar Bodas Massachusetts Institute of Technology

Joint work with Devavrat Shah

August 4, 2011

2/17 Introduction Algorithm MapReduce Conclusions

Introduction Task: Auto-complete suggestions

3/17

Introduction

Introduction Algorithm MapReduce Conclusions

f. . .

4/17

Introduction

Introduction

n

Algorithm MapReduce Conclusions

Given: x1 , x2 , . . . , xn ∈ ℜ+ ,

Want: µ =

1X xi . n i=1

x1 x2 x3

P

xn

P

x1 + x2 + · · · + xn

P

1/n

µ

5/17

Deterministic Algorithm

Introduction Algorithm MapReduce Conclusions

Advantages:

, ,

Exact answer Distributed

Issues:

/ /

Complexity: Θ(n) Robustness

(i.e., Latency)

6/17

Introduction

Introduction Algorithm MapReduce Conclusions

5

5

5

5 Sample any one!

Want: How many values to sample, for “good” performance?

5

7/17

Our Contribution

Introduction Algorithm MapReduce Conclusions

Propose a randomized algorithm for averaging Analyze trade-off between accuracy and latency Improve job completion time of MapReduce Intuition: Numbers “regular” ⇒ mean computation “easy”

8/17

(Centralized) Algorithm

Introduction

Pick r out of n

Algorithm MapReduce Conclusions

x1 x2 x3 Proposed Algorithm Randomly select r out of n numbers - Sample every number B(r , 1/n) times

Report their average

xn

9/17

(Centralized) Algorithm

Introduction Algorithm MapReduce Conclusions

Features of the proposed algorithm: Distributed implementation possible Online Trade-off accuracy for speed Robust to node failures

10/17 Introduction Algorithm MapReduce Conclusions

Main Result Theorem Under our algorithm, if   1 2 r≥ log × some constant, ǫ2 δ then   µ ˆ − µ ≥ ǫ ≤ δ. P µ

Constant depends on k, the number of finite moments of {xi }∞ i=1 . Compare with Chernoff bound for i.i.d. xi s:   1 1 log r& ǫ2 δ

11/17 Introduction Algorithm

Motivation Large data center (a cluster of computers)

MapReduce Conclusions

Used by Microsoft, Google, Amazon, Facebook, . . . What functions can I compute?

12/17

What is MapReduce?

Introduction Algorithm MapReduce Conclusions

Map Divide Input Number-crunching!

One master-server, m slave-servers Slave servers ≡ Mappers Master server ≡ Reducer

Reduce Combine Outputs ∼ Summation

13/17

MapReduce

Introduction Algorithm MapReduce

MapReduce can be used for:

Conclusions

Word-count, URL access count, . . .

MapReduce

Searching for text Reverse web-link graph Max(·) of an array Histogram .. .

Reduce ∼ Summation

14/17 Introduction Algorithm MapReduce Conclusions

Example Task: Auto-complete suggestions

15/17

Why it works

Introduction Algorithm MapReduce

True Frequency

Conclusions

Ex Fe d

fo x re Fi

ew N

Fo x

Fa ce

bo

ok

s

Frequency

Estimated Frequency

Keyword

16/17

Why it works

Introduction Algorithm MapReduce Conclusions

Mathematically, If the sequence {xi }ni=1 is “regular,” earlier result applies Intuition: Heavy-hitters well-represented in under-sampling ∴ Top 5 suggestions etc. can be computed “quickly”

17/17

Conclusions

Introduction Algorithm MapReduce

Sequence is “regular” ⇒ Mean computation “easy”

Conclusions

MapReduce: Used in data centers for processing huge logs of data Performs “simple” mathematical operations Reduce = Summation

Randomized algorithm for fast averaging: trade-off between accuracy, completion time, and confidence

17/17

Conclusions

Introduction Algorithm MapReduce

Sequence is “regular” ⇒ Mean computation “easy”

Conclusions

MapReduce: Used in data centers for processing huge logs of data Performs “simple” mathematical operations Reduce = Summation

Randomized algorithm for fast averaging: trade-off between accuracy, completion time, and confidence

Thanks! Questions?

18/17

External Arrivals, Departures

Introduction Algorithm

F4

F1

S1

File 1

S2

File 2

x2

Sm

File n

xn

x1

MapReduce Conclusions

Queries

Fk

λ p.u. time

Fk

Fℓ

F2

x1 , x2 , . . . , xn : query-specific numbers Files ≡ Search-query logs Many more files than servers, m ∼



n

19/17

Bounding Response Time

Introduction Algorithm MapReduce Conclusions

1/λ

T Incoming Queries

Time Processing begins here

Server Occupancy Time Early termination

Response time ≤ 2T

20/17

Analysis

Introduction Algorithm MapReduce Conclusions

Trade-off between accuracy, response-time, and confidence: Probability of error,   µ ˆ − µ P ≥ ǫ ≤ f (ǫ, T , p) µ

where p = P(Server Sj samples file Fi ).

f (·) can be computed using Markov chain analysis