May 15, 2014 - Import data into Amazon Redshift. ⢠Use SQL queries, use BI tools etc. ⢠Powerful analytics and aggre
Amazon Redshift & Amazon DynamoDB Michael Hanisch, Amazon Web Services Erez Hadas-Sonnenschein, clipkit GmbH Witali Stohler, clipkit GmbH 2014-05-15 © 2014 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc.
Amazon Redshift & Amazon DynamoDB
Amazon Redshift
Amazon Redshift
Fast, simple, petabyte-scale data warehousing for less than $1,000/TB/Year
A fully managed data warehouse service • Massively parallel relational data warehouse • Takes care of cluster management and distribution of your data • Columnar data store with variable compression • Optimized for complex queries across many large tables • Use standard SQL & standard BI tools Amazon Redshift
Amazon DynamoDB
A fully managed fast key-value store • • • • •
Fast, predictable performance Simple and fast to deploy Easy to scale as you go, up to millions of IOPS Pay only for what you use: Read / write IOPS + storage Data is automatically replicated across data centers
Amazon DynamoDB
Amazon DynamoDB • Fast insert & update • Limited query capability (single table only) • NoSQL database
Amazon Redshift • Fast queries • Flexible queries (JOINs, aggregation functions, …) • SQL
Queries in Amazon DynamoDB
Queries in Amazon DynamoDB • Query or BatchQuery APIs retrieve items • Scan & filter to comb through a whole table • You have to join tables in your own code!
Amazon DynamoDB
Queries in Amazon DynamoDB (2) • Apache Hive on Amazon EMR can access data in DynamoDB • Run HiveQL queries for bulk processing • Can integrate data in HDFS, Amazon S3, …
HiveQL queries on Amazon EMR
Amazon DynamoDB
Queries in Amazon DynamoDB (3) • Import data into Amazon Redshift • Use SQL queries, use BI tools etc. • Powerful analytics and aggregation functions
Amazon Redshift
Amazon DynamoDB
Importing Data into Amazon Redshift
TMTOWTDI
Query & Insert #1 Query / BatchQuery
Amazon DynamoDB
#2 Retrieve Items
#3 INSERT … INTO (…)
Amazon Redshift
Query & Insert The Good • Full control over queries • Decide which items you want to move to Redshift • Process data on the way
The Bad • Slow • Inefficient on the Redshift side of things • Does not scale well
The COPY Command #1 COPY FROM …
#2 Politely ask for a table
Amazon DynamoDB
#3 Return whole table
Amazon Redshift
The COPY Command #1 COPY FROM …
Amazon DynamoDB
#2 Parallel Scans
Amazon Redshift
The COPY Command #1 COPY FROM …
Amazon DynamoDB
#3 Return Items
Amazon Redshift
The COPY Command • COPY a single table at a time • From one Amazon DynamoDB table into one Amazon Redshift table • Fast – executed in parallel on all data nodes in the Amazon Redshift cluster • Can be limited to use a certain percentage of provisioned throughput on the DynamoDB table
The COPY Command COPY (col1, col2, …) FROM 'dynamodb://' CREDENTIALS 'aws_access_key_id=…;aws_secret_access_key=…' READRATIO 10 -- use 10% of available read capacity COMPROWS 0 -- how many rows to read to determine -- compression […other options…]
The COPY Command • • • •
Attributes are mapped to columns by name Case of column names is ignored Attributes that do not map are ignored Missing attributes are stored as NULL or empty values • Only works for STRING and NUMBER attributes
The COPY Command The Good • Easy to use • Fast • Efficient use of resources • Scales linearly with cluster size • Only uses certain percentage of read throughput
The Bad • Whole tables only • No processing in between • Can only copy from DynamoDB in same region • Only works with STRING and NUMBER types
Query & Insert at Scale #1 Query / BatchQuery in parallel
Amazon DynamoDB
#2 Retrieve Items
#3 INSERT … INTO (…) in parallel
Amazon Redshift
Query & Insert at Scale #1 Query / BatchQuery in parallel
#3 INSERT … INTO (…) in parallel Amazon EMR
Amazon DynamoDB
#2 Retrieve Items
Amazon Redshift
Query & Insert at Scale #1 Query / BatchQuery in parallel
#3 INSERT … INTO (…) in parallel Amazon EMR
Amazon DynamoDB
#2 Retrieve Items
Amazon Redshift
Query & Import using Amazon EMR #1 Query / BatchQuery in parallel
Amazon DynamoDB
#3 Export to Amazon EMR file(s) on S3
e #5 R
#2 Retrieve Items
Amazon S3
#4 COPY… FROM s3://
tr
file e v ie
s
Amazon Redshift
Query & Import using Amazon EMR #3 COPY … FROM emr://
#1 Query / BatchQuery in parallel Amazon EMR
#4 Retrieve files from HDFS
Amazon DynamoDB
#2 Retrieve Items
Amazon Redshift
Query & Import using Amazon EMR The Good • Decide which items you want to move to Redshift • Full control over queries • Process data on the way • Scales well • Integrates with other data sources easily
The Bad • Additional complexity • Additional cost (for EMR) • Slower than direct COPY from Amazon DynamoDB
Please welcome Erez Hadas-Sonnenschein, Sr. Product Manager Witali Stohler, Datawarehouse & BI Specialist
clipkit GmbH
Video Syndication – The Possibilities
Content – Partner Overview News Sports Cars/motor Business/finances Music Gaming Cinema Cooking/food Lifestyle/fashion Traveling Computer/mobile Fitness/wellness Knowledge/hobby entertaintment
clipkit Player – Analytics (Metrics) Full Screen Category Playlist Pos. Play / Pause Progress Pos.
Mute / Unmute
Volume
clipkit Player – Analytics (Metrics) Location (Country, City) Language Browser Operating System Video Id Publisher URL Etc…
First Implementation (Expensive and Slow) • designed in starting days • not calculated to such amount of data • slow copy process from S3 to DB (PHP application old architecture) • fix EC2 price (expensive to support peak hours) • PostgreSQL scalability limitations • sometimes the copy process was so slow that the delay was ~3 days.
Analytics / Metrics (Requests Graph)
Analytics / Metrics (Numbers) • ~ 6,000,000 New Entries per day • ~ 1,000 Requests per second (Peak Hours) • ~ 25 Requests per second (Off-peak Hours) 4000% Requests Growth during the day.
Second Implementation (Expensive and Slow)
• Inserting only for one (big) Table • The copy command only works for whole tables • The minimum delay was one day • Our solution have increase the provisioned throughput and that was expensive
NO REAL-TIME DATA
Third Implementation (Cheap and Fast)
Third Implementation – Dynamo DB • Java SDK AmazonDynamoDBAsyncClient (Fire and Go) • Easy to Create and Delete Tables • Write Latency ~5ms • Throughput auto scale with Dynamic DynamoDB • One Table per day • Continuous Iteration and copy to Redshift • We just pay for what we use
Third Implementation – Redshift • Standard PostgreSQL JDBC • Fully managed by Amazon • Automated Backups and Fast Restores
• ~7000 Insert Items per Second • Less than 2 seconds Queries to > 1 billion entries • Real-time available data (maximum 1 minute delay)
Third Implementation – Conclusions • Java Web Application – Auto Scale (Off-Peak - 1 Small Instance)
• Dynamo DB – One Table per day (After copied it will be deleted) – Auto Scale – ~5 ms Put Item Latency
• Redshift – Insert ~7000 Items per second – Fully managed
Thank You!