Photon: fault-tolerant and scalable joining of continuous data streams

Photon: fault-tolerant and scalable joining of continuous data streams Manpreet ([email protected])

Google Confidential and Proprietary

Agenda ● ● ● ● ●

Problem and motivation Systems challenges Design Production deployment Future work


The Stats Loop Advertisers + Publishers

Users on Google.com queries, clicks

accounts, campaigns, budgets

ads

Ads Serving events

Logs Continuously aggregate logs

reports, billing, …

Ads inventory, budgets, … Hourly/Daily/Monthly sums of Clicks, Impressions, Cost … per … Advertiser, Publisher, Keyword etc

Stats Engine

Persistent Storage Continuously update stats Google Confidential and Proprietary

Problem ● Join two continuous streams of events ○ ○ ○

Based on a shared id Stored in the Google File System Replicated in multiple datacenters

● Motivation: ○ ○

Join click with query Logged by multiple servers at different times ■

Cannot put query inside the click


Systems challenges ● Exactly-once semantics ○

Output used for billing / internal monitoring

● Reliability ○ ○

Fault-tolerance in the cloud Automatically handle single data-center disaster

● Scalability ○

Millions of joins per minute

● Latency ○

O(seconds)


Singly-homed system ● With patched-on failover tool ○

Ran in production for several years

● Very high maintenance cost ● Datacenters have downtimes: ○ ○ ○

planned unplanned random hiccups

● Cannot provide very high up-time SLA


How about running a MapReduce? ● ● ● ●

Read both query logs and click logs Mapper output key is query_id Reducer outputs the joined event Issues: ○ ○

batch job high latency (setup cost, stragglers, etc)


Challenges of running in multiple datacenters ● Stateful system ○

Need to ensure exactly-once semantics

● State needs to be replicated synchronously to multiple datacenters ○

==> Paxos (guarantees majority group members have all the updates)


High-level idea DCB

DCE

DCA Paxos

Pipeline in DataCenter 1

●

Pipeline in DataCenter 2

Run the pipeline in multiple data-centers in parallel: ○

●

Paxos-backed storage

Each pipeline processes every event

Paxos-backed storage is shared across pipelines ○ ○

Used to dedup events (at-most-once semantics) Maintain persistent state at event-level


PaxosDB: key building block ● Key-value store built on top of raw Paxos ● Runs paxos algorithm to guarantee consistent replication ○ ○

Multi-row transactions with conditional updates Auto-elects new master

● Key challenge: scalability ○ ○

Need to join Millions of events/minute With cross-country replicas, less than 10 paxos transactions/sec

DCB

DCE

DCA Paxos PaxosDB library

RPC client

RPC request RPC server


Architecture of a single IdRegistry server


IdRegistry made scalable: sharding ● Run paxos transactions in parallel for independent keys DCE

DCB

DCE

DCB

DCA

DCA Paxos

Paxos

PaxosDB library

PaxosDB library

RPC server

RPC server

RPC request1

RPC request2 RPC client


IdRegistry made scalable: server-side batching ● Batch multiple RPC requests into a single Paxos transaction DCB

DCE

DCE

DCB

DCA


Paxos PaxosDB library

Batch multiple RPC requests at server RPC server

RPC server

RPC request 1, 2, 3

RPC request 4 RPC client


IdRegistry made scalable: client-side batching ● Batch multiple client requests into a single RPC DCB

DCE

DCE

DCB

DCA


Paxos PaxosDB library

Batch multiple RPC requests at server RPC server

RPC server

Batch multiple client requests into a single RPC RPC request 4

RPC request 1, 2, 3

RPC client


IdRegistry made scalable ● Sharding ○ ○

Paxos transactions happen in parallel for each shard Load-balance amongst shards: ■

○

Shard by hash(event_id) mod num_shards

Built automated support for resharding ■ ■ ■

Attach timestamp to each key Use timestamp-based sharding GC old keys

● Server-side batching ○ ○

RPC thread adds input request to an in-memory queue Single paxos thread extracts multiple requests, performs a single multi-row paxos transaction, sends rpc response

● Client-side batching ○

Client batches multiple RPC requests into a single RPC request.


Photon architecture in a single data-center

Click Logs

Log Dispatcher

Retry

Query Logs Step1 Step2 EventStore

query_id query

click Step3 Joiner Step4

Paxos-based IdRegistry

Joined Click Logs


Photon architecture in multiple datacenters DataCenter 1

Query Logs

Click Logs

EventStore

Log Dispatcher

Joiner

Joined Click Logs

DataCenter 2

Click Logs

Retry

Retry

Paxos-based IdRegistry

Query Logs

Log Dispatcher

EventStore

Joiner

Joined Click Logs


PaxosDB rpc semantics ● InsertKey: ○ ○

Returns false if the key already exists Else inserts the key

● What if the PaxosDB inserts the key but Joiner does not get the response in time? ○

Observed 0.01% loss in production

● Write unique id for joiner along with the event_id: ○ ○ ○

event_id is key Joiner id is value Handles Joiner rpc retries gracefully

● If Joiner crashes after writing to Paxos, run offline recovery


Reading input events continuously ● Events stored in Google File System ● Periodically stat the directory: ○ ○

identify new files check the growth update on existing files

● Keep track of


EventStore ● Sequentially read the query logs and store a mapping from query_id to query ○

In a distributed hash table (e.g. bigtable)

● CacheEventStore: ○ ○ ○

Majority of clicks happen within few minutes of query Cache mapping from query_id ---> query for recent queries in RAM Use distributed Cache Servers (similar to MemCache)

● LogsEventStore ○ ○ ○

Performance optimization since our query logs are sorted Fallback in case of cache miss Use binary search in sorted event stream in disks


Purgatory (in Log Dispatcher) ● What if a subset of query logs are delayed? ● Log Dispatcher reads every event from the log, keeps retrying until successful join ○ ○

Maintains state in peristent storage (Google File System) Ensures at-least-once semantics

● Exponential backoff in case of failure


Operational challenges ● Need 24x7 running system ○

● ● ● ● ●

Low maintenance

Volume of input traffic fluctuates by 2x within a day Bursts of traffic in case of network issues Predicting resource requirements Auto-resize the jobs to handle traffic growth and spike Reliable monitoring


Production deployment ● System running in production for several months ○ ○ ○ ○ ○ ○

O(700)-way sharded PaxosDB 5 PaxosDB replicas spread across East Coast, West Coast, Mid-West 2 Photon pipelines running in East Coast, West Coast Order of magnitude higher uptime SLA than singly-homed system Survived multiple planned / unplanned datacenter downtimes Much less noisy alerts


Photon withstanding a real datacenter disaster


End-to-end latency of Photon


Effectiveness of server-side batching in IdRegistry


IdRegistry dynamic time-based upsharding


Dispatcher in one datacenter catching up after downtime


Photon wasted joins minimized


Photon EventStore lookups in a single datacenter


How we addressed the systems challenges ● Exactly-once semantics ○ ○

PaxosDB ensures at-most-once Dispatcher retry ensures at-least-once

● Reliability ○ ○

PaxosDB needs only majority members to be up. Storing global state in PaxosDB allows run in multiple datacenters

● Scalability ○ ○

Reshard the number of PaxosDB servers All the other workers are stateless

● Latency ○ ○

Mostly RPC-based communication amongst jobs Most data transfers in RAM


Next BIG challenges ● Better resource utilization ● Join multiple log sources


More cool problems we are solving in AdWords Backend ● Real-time logs processing ○ ○

Read 100T logs per day Compute stats over 200+ dimensions with low latency

● Petabyte scale database storage engine ○ ○

32M rows updated per sec, 140T total rows 200K read queries/sec with latency of 90ms

● Efficiently backfill stats for the last N years ○ ○

O(50B) rows per day Too big to store in a database

● And many more cool projects... ○ ○

Join us and find out more! Manpreet Singh ([email protected])


Google Application Process

1

Apply Now!

2

Resume Review and Qualification

3

First Round Interviews - 2 Technical Phone Screens

4

Full-time Positions: 3-5 Onsite Interviews Internships: 1 Host Matching Interview

5

Full-time: google.com/students/eng Internships: google.com/students/intern

Hiring Committee - Offer

Questions


Photon: fault-tolerant and scalable joining of continuous data streams

Photon: fault-tolerant and scalable joining of continuous data streams

Suggest Documents

Photon: Fault-tolerant and Scalable Joining of Continuous ... - People

Photon: Fault-tolerant and Scalable Joining of Continuous ... - People

Photon: fault-tolerant and scalable joining of continuous ... - CiteSeerX

CONTINUOUS MONITORING OF DISTRIBUTED DATA STREAMS ...

Scalable Preference Learning from Data Streams - WWW2015

Exploiting Punctuation Semantics in Continuous Data Streams

Exploiting Punctuation Semantics in Continuous Data Streams

Supporting Variable Bit-Rate Streams in a Scalable Continuous Media

little streams finding their bed, joining together

Parallel processing of continuous queries over data streams ...

Parallel processing of continuous queries over data streams ...

Scalable Regression Tree Learning in Data Streams - Google Sites

Scalable Keyword Search on Large Data Streams - CiteSeerX

Scalable Distributed Change Detection from Astronomy Data Streams ...

Scalable Preference Learning from Data Streams - International World ...

Continuous Trend-Based Clustering in Data Streams - delab-auth

CADS: Continuous Authentication on Data Streams - Semantic Scholar

Continuous-mode effects and photon-photon phase gate performance

Continuous Axion Photon Duality and its Consequences

Online Mining of Data Streams:

Online Clustering of Data Streams

outlier detectionbased faulttolerant data aggregation for ... - Yang Xiao

Scalable modeling of cell populations in continuous

Data Streams