Logged by multiple servers at different times. â¡ Cannot put ... Architecture of a single IdRegistry server ... What if
Photon: fault-tolerant and scalable joining of continuous data streams Manpreet (
[email protected])
Google Confidential and Proprietary
Agenda ● ● ● ● ●
Problem and motivation Systems challenges Design Production deployment Future work
Google Confidential and Proprietary
The Stats Loop Advertisers + Publishers
Users on Google.com queries, clicks
accounts, campaigns, budgets
ads
Ads Serving events
Logs Continuously aggregate logs
reports, billing, …
Ads inventory, budgets, … Hourly/Daily/Monthly sums of Clicks, Impressions, Cost … per … Advertiser, Publisher, Keyword etc
Stats Engine
Persistent Storage Continuously update stats Google Confidential and Proprietary
Problem ● Join two continuous streams of events ○ ○ ○
Based on a shared id Stored in the Google File System Replicated in multiple datacenters
● Motivation: ○ ○
Join click with query Logged by multiple servers at different times ■
Cannot put query inside the click
Google Confidential and Proprietary
Systems challenges ● Exactly-once semantics ○
Output used for billing / internal monitoring
● Reliability ○ ○
Fault-tolerance in the cloud Automatically handle single data-center disaster
● Scalability ○
Millions of joins per minute
● Latency ○
O(seconds)
Google Confidential and Proprietary
Singly-homed system ● With patched-on failover tool ○
Ran in production for several years
● Very high maintenance cost ● Datacenters have downtimes: ○ ○ ○
planned unplanned random hiccups
● Cannot provide very high up-time SLA
Google Confidential and Proprietary
How about running a MapReduce? ● ● ● ●
Read both query logs and click logs Mapper output key is query_id Reducer outputs the joined event Issues: ○ ○
batch job high latency (setup cost, stragglers, etc)
Google Confidential and Proprietary
Challenges of running in multiple datacenters ● Stateful system ○
Need to ensure exactly-once semantics
● State needs to be replicated synchronously to multiple datacenters ○
==> Paxos (guarantees majority group members have all the updates)
Google Confidential and Proprietary
High-level idea DCB
DCE
DCA Paxos
Pipeline in DataCenter 1
●
Pipeline in DataCenter 2
Run the pipeline in multiple data-centers in parallel: ○
●
Paxos-backed storage
Each pipeline processes every event
Paxos-backed storage is shared across pipelines ○ ○
Used to dedup events (at-most-once semantics) Maintain persistent state at event-level
Google Confidential and Proprietary
PaxosDB: key building block ● Key-value store built on top of raw Paxos ● Runs paxos algorithm to guarantee consistent replication ○ ○
Multi-row transactions with conditional updates Auto-elects new master
● Key challenge: scalability ○ ○
Need to join Millions of events/minute With cross-country replicas, less than 10 paxos transactions/sec
DCB
DCE
DCA Paxos PaxosDB library
RPC client
RPC request RPC server
Google Confidential and Proprietary
Architecture of a single IdRegistry server
Google Confidential and Proprietary
IdRegistry made scalable: sharding ● Run paxos transactions in parallel for independent keys DCE
DCB
DCE
DCB
DCA
DCA Paxos
Paxos
PaxosDB library
PaxosDB library
RPC server
RPC server
RPC request1
RPC request2 RPC client
Google Confidential and Proprietary
IdRegistry made scalable: server-side batching ● Batch multiple RPC requests into a single Paxos transaction DCB
DCE
DCE
DCB
DCA
DCA Paxos PaxosDB library
Paxos PaxosDB library
Batch multiple RPC requests at server RPC server
RPC server
RPC request 1, 2, 3
RPC request 4 RPC client
Google Confidential and Proprietary
IdRegistry made scalable: client-side batching ● Batch multiple client requests into a single RPC DCB
DCE
DCE
DCB
DCA
DCA Paxos PaxosDB library
Paxos PaxosDB library
Batch multiple RPC requests at server RPC server
RPC server
Batch multiple client requests into a single RPC RPC request 4
RPC request 1, 2, 3
RPC client
Google Confidential and Proprietary
IdRegistry made scalable ● Sharding ○ ○
Paxos transactions happen in parallel for each shard Load-balance amongst shards: ■
○
Shard by hash(event_id) mod num_shards
Built automated support for resharding ■ ■ ■
Attach timestamp to each key Use timestamp-based sharding GC old keys
● Server-side batching ○ ○
RPC thread adds input request to an in-memory queue Single paxos thread extracts multiple requests, performs a single multi-row paxos transaction, sends rpc response
● Client-side batching ○
Client batches multiple RPC requests into a single RPC request.
Google Confidential and Proprietary
Photon architecture in a single data-center
Click Logs
Log Dispatcher
Retry
Query Logs Step1 Step2 EventStore
query_id query
click Step3 Joiner Step4
Paxos-based IdRegistry
Joined Click Logs
Google Confidential and Proprietary
Photon architecture in multiple datacenters DataCenter 1
Query Logs
Click Logs
EventStore
Log Dispatcher
Joiner
Joined Click Logs
DataCenter 2
Click Logs
Retry
Retry
Paxos-based IdRegistry
Query Logs
Log Dispatcher
EventStore
Joiner
Joined Click Logs
Google Confidential and Proprietary
PaxosDB rpc semantics ● InsertKey: ○ ○
Returns false if the key already exists Else inserts the key
● What if the PaxosDB inserts the key but Joiner does not get the response in time? ○
Observed 0.01% loss in production
● Write unique id for joiner along with the event_id: ○ ○ ○
event_id is key Joiner id is value Handles Joiner rpc retries gracefully
● If Joiner crashes after writing to Paxos, run offline recovery
Google Confidential and Proprietary
Reading input events continuously ● Events stored in Google File System ● Periodically stat the directory: ○ ○
identify new files check the growth update on existing files
● Keep track of
Google Confidential and Proprietary
EventStore ● Sequentially read the query logs and store a mapping from query_id to query ○
In a distributed hash table (e.g. bigtable)
● CacheEventStore: ○ ○ ○
Majority of clicks happen within few minutes of query Cache mapping from query_id ---> query for recent queries in RAM Use distributed Cache Servers (similar to MemCache)
● LogsEventStore ○ ○ ○
Performance optimization since our query logs are sorted Fallback in case of cache miss Use binary search in sorted event stream in disks
Google Confidential and Proprietary
Purgatory (in Log Dispatcher) ● What if a subset of query logs are delayed? ● Log Dispatcher reads every event from the log, keeps retrying until successful join ○ ○
Maintains state in peristent storage (Google File System) Ensures at-least-once semantics
● Exponential backoff in case of failure
Google Confidential and Proprietary
Operational challenges ● Need 24x7 running system ○
● ● ● ● ●
Low maintenance
Volume of input traffic fluctuates by 2x within a day Bursts of traffic in case of network issues Predicting resource requirements Auto-resize the jobs to handle traffic growth and spike Reliable monitoring
Google Confidential and Proprietary
Production deployment ● System running in production for several months ○ ○ ○ ○ ○ ○
O(700)-way sharded PaxosDB 5 PaxosDB replicas spread across East Coast, West Coast, Mid-West 2 Photon pipelines running in East Coast, West Coast Order of magnitude higher uptime SLA than singly-homed system Survived multiple planned / unplanned datacenter downtimes Much less noisy alerts
Google Confidential and Proprietary
Photon withstanding a real datacenter disaster
Google Confidential and Proprietary
End-to-end latency of Photon
Google Confidential and Proprietary
Effectiveness of server-side batching in IdRegistry
Google Confidential and Proprietary
IdRegistry dynamic time-based upsharding
Google Confidential and Proprietary
Dispatcher in one datacenter catching up after downtime
Google Confidential and Proprietary
Photon wasted joins minimized
Google Confidential and Proprietary
Photon EventStore lookups in a single datacenter
Google Confidential and Proprietary
How we addressed the systems challenges ● Exactly-once semantics ○ ○
PaxosDB ensures at-most-once Dispatcher retry ensures at-least-once
● Reliability ○ ○
PaxosDB needs only majority members to be up. Storing global state in PaxosDB allows run in multiple datacenters
● Scalability ○ ○
Reshard the number of PaxosDB servers All the other workers are stateless
● Latency ○ ○
Mostly RPC-based communication amongst jobs Most data transfers in RAM
Google Confidential and Proprietary
Next BIG challenges ● Better resource utilization ● Join multiple log sources
Google Confidential and Proprietary
More cool problems we are solving in AdWords Backend ● Real-time logs processing ○ ○
Read 100T logs per day Compute stats over 200+ dimensions with low latency
● Petabyte scale database storage engine ○ ○
32M rows updated per sec, 140T total rows 200K read queries/sec with latency of 90ms
● Efficiently backfill stats for the last N years ○ ○
O(50B) rows per day Too big to store in a database
● And many more cool projects... ○ ○
Join us and find out more! Manpreet Singh (
[email protected])
Google Confidential and Proprietary
Google Application Process
1
Apply Now!
2
Resume Review and Qualification
3
First Round Interviews - 2 Technical Phone Screens
4
Full-time Positions: 3-5 Onsite Interviews Internships: 1 Host Matching Interview
5
Full-time: google.com/students/eng Internships: google.com/students/intern
Hiring Committee - Offer
Questions
Google Confidential and Proprietary