Document not found! Please try again

Big Data is Dead, Long Live Business Intelligence? - Amazon Web ...

1 downloads 162 Views 5MB Size Report
Markus Schmidberger, Data Platform Architect. Glomex GmbH – A ProSiebenSat.1 Media SE company. Berlin, April 12th 2016
berlin

Big Data is Dead, Long Live Business Intelligence? Michael Muckel, Head of Data Platform Markus Schmidberger, Data Platform Architect Glomex GmbH – A ProSiebenSat.1 Media SE company Berlin, April 12th 2016

© 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Glomex: A ProSiebenSat.1 company

Glomex GmbH – A ProSiebenSat.1 Media SE company

Page 2

Glomex – The Global Media Exchange Glomex Video Value Platform

Publishers

Content providers Media Delivery Platform Non-P7S1 publishers

External broadcasters Web-only content owners

Glomex GmbH – A ProSiebenSat.1 Media SE company

Media Exchange Platform

Page 3

Glomex – Data Platform Video Value Platform

Media Delivery Platform

Media Exchange Platform

Data Platform Real-time-Monitoring

Glomex GmbH – A ProSiebenSat.1 Media SE company

Batch Analytics

Machine Learning

Page 4

Key Components of our New Data Platform

Real-Time Monitoring Enable our development teams to serve our content to our users in the best quality possible.

Analytics Provide our teams access to the data to enable data-driven development of new features and products.

Content Discovery Find the most relevant content for our customers and their users. Glomex GmbH – A ProSiebenSat.1 Media SE company

Page 5

Lambda Architecture

≠ AWS Lambda

Graphic provided by http://lambda-architecture.net

Glomex GmbH – A ProSiebenSat.1 Media SE company

Page 6

Simplify Data Processing

data

ingest / collect

store

process / analyze

visualize / serve

answers

Time to Answer (Latency) Throughput Cost more concrete numbers at the end Glomex GmbH – A ProSiebenSat.1 Media SE company

Page 7

Data Processing in Big Data World

Transactional Data

Logstash Search Data

Amazon DynamoDB

Amazon RDS Amazon ES

ML

Warm

Amazon ML Amazon Redshift

Impala

Glomex GmbH – A ProSiebenSat.1 Media SE company

Batch

Amazon S3 Amazon Glacier

Streaming

Apache Kafka Amazon Kinesis Amazon DynamoDB

Pig

Cold

Hot

Stream Processing

Stream Data

File Storage

File Data

Stream Storage

IoT

Logging

Hot

A

Predictions

Amazon Kinesis AWS Lambda

Fast

Slow

Fast

Analysis & Visualization

A ndroid

Hot

Consume Amazon QuickSight

IDE Notebooks

iOS

Mobile Apps

Amazon ElastiCache

Interactive

Web Apps

Analyze

ETL

Amazon Elastic MapReduce

Store Search SQL NoSQL Cache

Applications

Collect

Apps & APIs

Page 8

Our Data Platform Architecture Data Platform - MicroService Layout

Content API

Content Import Service

CDN files

CDN Log Import Service

data stream

other modules

Content Discovery Service Data Management Service

Data API

Metadata Service KPI & Analytics Service

Portal

AdProxy Log Import Service

Data Lake data stream

VAS Log Import Service

data stream

Player Feedback Import Service

Data Layer

Technical Monitoring Service

Real-Time Dashboards

Dev / Ops Analytics Service

Data Platform Access Team

Data Quality Service External Data Import Service

Data Science Analytics Service

Data Science UI

Data Platform Monitoring Service

INGEST

Glomex GmbH – A ProSiebenSat.1 Media SE company

STORE

PROCESS & ANALYSE

VISUALIZE & SERVE

Page 9

Real-Time Player Monitoring Data Platform - MicroService Layout

Content API

Content Import Service

CDN files

CDN Log Import Service

data stream

other modules

Content Discovery Service Data Management Service

Data API

Metadata Service KPI & Analytics Service

Portal

AdProxy Log Import Service

Data Lake data stream

VAS Log Import Service

data stream

Player Feedback Import Service

Data Layer

Technical Monitoring Service

Real-Time Dashboards

Dev / Ops Analytics Service

Data Platform Access Team

Data Quality Service External Data Import Service

Data Science Analytics Service

Data Science UI

Data Platform Monitoring Service

INGEST

Glomex GmbH – A ProSiebenSat.1 Media SE company

STORE

PROCESS & ANALYSE

VISUALIZE & SERVE

Page 10

Monitoring Video-Streaming Experience

Focus on Metrics from the User‘s Perspective

From Server-Uptime

Glomex GmbH – A ProSiebenSat.1 Media SE company

To (anonymized) Real-User Monitoring

Page 11

1

3

Automate

Glomex GmbH – A ProSiebenSat.1 Media SE company

Analyze

2

Take Actions

Page 12

Our Ingest Process

Glomex GmbH – A ProSiebenSat.1 Media SE company

Page 13

Kinesis Firehose is doing his job

Next session: “Streaming Data: The Opportunity and How to Work With It”

Glomex GmbH – A ProSiebenSat.1 Media SE company

Page 14

Data Facts

20 GB

Per day click-stream data in Kinesis Firehose

5 Billion

Record processed per day

~100 ms

Data freshness to S3

Glomex GmbH – A ProSiebenSat.1 Media SE company

Page 15

ElasticSearch + Grafana for real-time analyses

Not AWS managed!

Glomex GmbH – A ProSiebenSat.1 Media SE company

Page 16

ElasticSearch on Spot Instances

Glomex GmbH – A ProSiebenSat.1 Media SE company

Page 17

CDN ’Batch Processing’ Data Platform - MicroService Layout

Content API

Content Import Service

CDN files

CDN Log Import Service

data stream

other modules

Content Discovery Service Data Management Service

Data API

Metadata Service KPI & Analytics Service

Portal

AdProxy Log Import Service

Data Lake data stream

VAS Log Import Service

data stream

Player Feedback Import Service

Data Layer

Technical Monitoring Service

Real-Time Dashboards

Dev / Ops Analytics Service

Data Platform Access Team

Data Quality Service External Data Import Service

Data Science Analytics Service

Data Science UI

Data Platform Monitoring Service

INGEST

Glomex GmbH – A ProSiebenSat.1 Media SE company

STORE

PROCESS & ANALYSE

VISUALIZE & SERVE

Page 18

Processing CDN-Logs

Per day as zipped log-files

25 GB

Record processed per day

300 Million +

Normal challenges with external data sources Out-of-order deliver / Data quality issues / Varying file sizes / etc.

Glomex GmbH – A ProSiebenSat.1 Media SE company

Page 19

Requirements for our Data Processing Pipeline

Monitor Complete Pipeline Enable Reprocessing of Historical Datasets Be Ready to Scale

Glomex GmbH – A ProSiebenSat.1 Media SE company

Page 20

Our CDN Pipeline

Glomex GmbH – A ProSiebenSat.1 Media SE company

Page 21

AWS Lambda Limits

5 min 512 MB •

How to process 800MB gziped logfile?



How to split compressed gzip files?



Splitter using Amazon SQS and Amazon EC2 Spot Instances

Glomex GmbH – A ProSiebenSat.1 Media SE company

AWS Lambda Timeout AWS Lambda temp disk

Page 22

Our Meta Data Store

AWS Big Data Blog:

https://blogs.aws.amazon.com/bigdata/post/Tx2YRX3Y16CVQFZ/Building-andMaintaining-an-Amazon-S3-Metadata-Index-without-Servers

Our Meta Data Store

Glomex GmbH – A ProSiebenSat.1 Media SE company

Page 24

Be serverless and serve data

Amazon Kinesis

AWS Lambda

Glomex GmbH – A ProSiebenSat.1 Media SE company

AWS Lambda

Amazon API Gateway

Page 25

CDN Batch Facts

2.3 min 600 rec/sec 6 1 $ / hour

Average run-time of AWS Lambda

AWS Lambda duration

Processing time Parallel AWS Lambda functions

Redshift CPU

Cost for 25 GB/day CDN processing

Glomex GmbH – A ProSiebenSat.1 Media SE company

Page 26

Data Science Environment Data Platform - MicroService Layout

Content API

Content Import Service

CDN files

CDN Log Import Service

data stream

other modules

Content Discovery Service Data Management Service

Data API

Metadata Service KPI & Analytics Service

Portal

AdProxy Log Import Service

Data Lake data stream

VAS Log Import Service

data stream

Player Feedback Import Service

Data Layer

Technical Monitoring Service

Real-Time Dashboards

Dev / Ops Analytics Service

Data Platform Access Team

Data Quality Service External Data Import Service

Data Science Analytics Service

Data Science UI

Data Platform Monitoring Service

INGEST

Glomex GmbH – A ProSiebenSat.1 Media SE company

STORE

PROCESS & ANALYSE

VISUALIZE & SERVE

Page 27

Data Science Environment

Project Jupyter: http://jupyter.org/ Glomex GmbH – A ProSiebenSat.1 Media SE company

Page 28

Amazon Kinesis

Development

Cluster Technology

Data Sources

Data Science Environment - Architecture

Glomex GmbH – A ProSiebenSat.1 Media SE company

Amazon Redshift

Amazon S3

Elasticsearch

Amazon EMR In development

Github In development

Page 29

Our Lambda Architecture on AWS Data Platform - Lambda Architecture Batch Layer

other player modules

AWS Lambda

Amazon Redshift

CDN files

Amazon API Gateway Portal

AWS Lambda

Amazon Elastic MapReduce + Spark

Serving Layer

S3

EC2 with Caravel

EC2 with Jupyther

Team

data stream

Instance with Kinesis Agent

Amazon KinesisFirehose

AWS Lambda

Speed Layer

Glomex GmbH – A ProSiebenSat.1 Media SE company

EC2 with ElasticSearch

EC2 with Grafana

Applications

Page 30

Key Takeaways

Lambda Architecture Enrich your traditional, batch-driven BI-workflow with real-time analytics Use Lambda-Architecture as a guiding principle and adapt it to your needs

Glomex GmbH – A ProSiebenSat.1 Media SE company

Page 31

Key Takeaways

Focus on features development and robust pipelines not on infrastructure management AWS managed services provide an robust way to run complex big data infrastructures Follow best-practices provided by AWS and the community Glomex GmbH – A ProSiebenSat.1 Media SE company

Page 32

Key Takeaways

Provide an open data environments Trust the creativity of your engineering teams to find insights in your datasets Structure your data that it can be access in processed and raw form Notebooks provide easy access to even large distributed datasets

Glomex GmbH – A ProSiebenSat.1 Media SE company

Page 33

We are hiring …

Michael Muckel, Head of Data Platform Markus Schmidberger, Data Platform Architect Glomex GmbH – A ProSiebenSat.1 Media SE company



Data Scientists



Data Engineers



Project Managers