MapReduce. Conclusions. Motivation. Large data center (a cluster of computers). Used by Microsoft, Google, Amazon, Faceb
1/17 Introduction Algorithm MapReduce
Fast Averaging
Conclusions
Shreeshankar Bodas Massachusetts Institute of Technology
Joint work with Devavrat Shah
August 4, 2011
2/17 Introduction Algorithm MapReduce Conclusions
Introduction Task: Auto-complete suggestions
3/17
Introduction
Introduction Algorithm MapReduce Conclusions
f. . .
4/17
Introduction
Introduction
n
Algorithm MapReduce Conclusions
Given: x1 , x2 , . . . , xn ∈ ℜ+ ,
Want: µ =
1X xi . n i=1
x1 x2 x3
P
xn
P
x1 + x2 + · · · + xn
P
1/n
µ
5/17
Deterministic Algorithm
Introduction Algorithm MapReduce Conclusions
Advantages:
, ,
Exact answer Distributed
Issues:
/ /
Complexity: Θ(n) Robustness
(i.e., Latency)
6/17
Introduction
Introduction Algorithm MapReduce Conclusions
5
5
5
5 Sample any one!
Want: How many values to sample, for “good” performance?
5
7/17
Our Contribution
Introduction Algorithm MapReduce Conclusions
Propose a randomized algorithm for averaging Analyze trade-off between accuracy and latency Improve job completion time of MapReduce Intuition: Numbers “regular” ⇒ mean computation “easy”
8/17
(Centralized) Algorithm
Introduction
Pick r out of n
Algorithm MapReduce Conclusions
x1 x2 x3 Proposed Algorithm Randomly select r out of n numbers - Sample every number B(r , 1/n) times
Report their average
xn
9/17
(Centralized) Algorithm
Introduction Algorithm MapReduce Conclusions
Features of the proposed algorithm: Distributed implementation possible Online Trade-off accuracy for speed Robust to node failures
10/17 Introduction Algorithm MapReduce Conclusions
Main Result Theorem Under our algorithm, if 1 2 r≥ log × some constant, ǫ2 δ then µ ˆ − µ ≥ ǫ ≤ δ. P µ
Constant depends on k, the number of finite moments of {xi }∞ i=1 . Compare with Chernoff bound for i.i.d. xi s: 1 1 log r& ǫ2 δ
11/17 Introduction Algorithm
Motivation Large data center (a cluster of computers)
MapReduce Conclusions
Used by Microsoft, Google, Amazon, Facebook, . . . What functions can I compute?
12/17
What is MapReduce?
Introduction Algorithm MapReduce Conclusions
Map Divide Input Number-crunching!
One master-server, m slave-servers Slave servers ≡ Mappers Master server ≡ Reducer
Reduce Combine Outputs ∼ Summation
13/17
MapReduce
Introduction Algorithm MapReduce
MapReduce can be used for:
Conclusions
Word-count, URL access count, . . .
MapReduce
Searching for text Reverse web-link graph Max(·) of an array Histogram .. .
Reduce ∼ Summation
14/17 Introduction Algorithm MapReduce Conclusions
Example Task: Auto-complete suggestions
15/17
Why it works
Introduction Algorithm MapReduce
True Frequency
Conclusions
Ex Fe d
fo x re Fi
ew N
Fo x
Fa ce
bo
ok
s
Frequency
Estimated Frequency
Keyword
16/17
Why it works
Introduction Algorithm MapReduce Conclusions
Mathematically, If the sequence {xi }ni=1 is “regular,” earlier result applies Intuition: Heavy-hitters well-represented in under-sampling ∴ Top 5 suggestions etc. can be computed “quickly”
17/17
Conclusions
Introduction Algorithm MapReduce
Sequence is “regular” ⇒ Mean computation “easy”
Conclusions
MapReduce: Used in data centers for processing huge logs of data Performs “simple” mathematical operations Reduce = Summation
Randomized algorithm for fast averaging: trade-off between accuracy, completion time, and confidence
17/17
Conclusions
Introduction Algorithm MapReduce
Sequence is “regular” ⇒ Mean computation “easy”
Conclusions
MapReduce: Used in data centers for processing huge logs of data Performs “simple” mathematical operations Reduce = Summation
Randomized algorithm for fast averaging: trade-off between accuracy, completion time, and confidence
Thanks! Questions?
18/17
External Arrivals, Departures
Introduction Algorithm
F4
F1
S1
File 1
S2
File 2
x2
Sm
File n
xn
x1
MapReduce Conclusions
Queries
Fk
λ p.u. time
Fk
Fℓ
F2
x1 , x2 , . . . , xn : query-specific numbers Files ≡ Search-query logs Many more files than servers, m ∼
√
n
19/17
Bounding Response Time
Introduction Algorithm MapReduce Conclusions
1/λ
T Incoming Queries
Time Processing begins here
Server Occupancy Time Early termination
Response time ≤ 2T
20/17
Analysis
Introduction Algorithm MapReduce Conclusions
Trade-off between accuracy, response-time, and confidence: Probability of error, µ ˆ − µ P ≥ ǫ ≤ f (ǫ, T , p) µ
where p = P(Server Sj samples file Fi ).
f (·) can be computed using Markov chain analysis