Abyres-Hortonworks Data Platform.pdf - Google Drive

3 downloads 11 Views 2MB Size Report
There was a problem previewing this document. Retrying... Download. Connect more apps... Try one of the apps below to op
Hortonworks Data Platform Mohd Izhar A byres Sdn Bhd

© Hortonworks Inc. 2011 – 2015. All Rights Reserved

About Hortonworks Customer Momentum • ~700 customers (as of November 4, 2015) • 152 customers added in Q3 2015 • Publicly traded on NASDAQ: HDP

Hortonworks Data Platform

Founded in 2011

Original 24 architects, developers, operators of Hadoop from Yahoo!

• Completely open multi-tenant platform for any app and any data • Consistent enterprise services for security, operations, and governance

800+

1500+

EMPLOYEES

ECOSYSTEM PA R T N E R S

Partner for Customer Success • Leader in open-source community, focused on innovation to meet enterprise needs

• Unrivaled Hadoop support subscriptions Page 2

© Hortonworks Inc. 2011 – 2015. All Rights Reserved

Big Data: One of Top Priorities for CIOs

, September 2014 survey of 100 CIOs from the US and Europe

Page 3

© Hortonworks Inc. 2011 – 2014. All Rights Reserved

The CHASM

We are here…

Innovators & enthusiasts

Page 4

Early adopters & visionaries

© Hortonworks Inc. 2011 – 2014. All Rights Reserved

*

Early majority & pragmatists

Late majority & conservatives

Laggards & Skeptics

Hadoop Journey

Page 5

© Hortonworks Inc. 2011 – 2014. All Rights Reserved

Big Data Functional Architecture Key Tenants of Lambda Architecture Store

Pre-Compute Views & Deep Learning

BATCH LAYER

SERVING LAYER New Data Stream

Business View

Process Streams

Business View

Incremental Views

Lambda Architecture Page 6

© Hortonworks Inc. 2011 – 2014. All Rights Reserved

Query

 Batch Layer    

 Speed layer  

SPEED LAYER

Manages master data Immutable, append-only set of raw data Cleanse, Normalize & Pre-Compute Batch Views Advanced Statistical Calculations

Real Time Event Stream Processing Computes Real-Time Views

 Serving Layer  

Low-latency, ad-hoc query Reporting, BI & Dashboard

Big Data Functional Architecture Key Tenants of Lambda Architecture HADOOP 1 Store

Pre-Compute Views & Deep Learning

BATCH LAYER

SERVING LAYER New Data Stream

Business View

Process Streams

Page 7

Business View

Incremental Views

© Hortonworks Inc. 2011 – 2014. All Rights Reserved

Query

 Batch Layer    

 Speed layer  

SPEED LAYER

Manages master data Immutable, append-only set of raw data Cleanse, Normalize & Pre-Compute Batch Views Advanced Statistical Calculations

Real Time Event Stream Processing Computes Real-Time Views

 Serving Layer  

Low-latency, ad-hoc query Reporting, BI & Dashboard

Big Data Functional Architecture Key Tenants of Lambda Architecture HADOOP2 Store

Pre-Compute Views & Deep Learning

BATCH LAYER

SERVING LAYER New Data Stream

Business View

Process Streams

Page 8

Business View

Incremental Views

© Hortonworks Inc. 2011 – 2014. All Rights Reserved

Query

 Batch Layer    

 Speed layer  

SPEED LAYER

Manages master data Immutable, append-only set of raw data Cleanse, Normalize & Pre-Compute Batch Views Advanced Statistical Calculations

Real Time Event Stream Processing Computes Real-Time Views

 Serving Layer  

Low-latency, ad-hoc query Reporting, BI & Dashboard

MR-279: YARN

Hadoop 2 & YARN 2006

2009

October 23, 2013

Hadoop2 & YARN based Architecture

Hadoop w/ MapReduce MapReduce Largely Batch Processing 1

°

° ° HDFS

°

Batch °

(Hadoop ° ° Distributed ° ° File° System) N

Silo’d clusters Largely batch system Difficult to integrate

Page 9

Real-Time

Interactive

© Hortonworks Inc. 2011 – 2014. All Rights Reserved

YARN: Data Operating System 1

°

°

°

°

°

°

° HDFS

°

°

°

(Hadoop Distributed File System) ° ° ° ° ° ° °

°

°

N

°

Architected & led development of YARN to enable the Modern Data Architecture

Apache Hadoop – Data Operating System Shared Compute & Workload Management

BATCH, INTERACTIVE & REAL-TIME DATA ACCESS

• Common data platform, many applications

Script

SQL

Java Scala

NoSQL

Stream

• Support multi-tenant access & processing

Pig

Hive

Cascading

HBase Accumulo

Storm

• Batch, interactive & real-time use cases

Tez

Tez

Tez

Slider

Slider

Others

In-Memory Search

Spark

Solr

ISV Engines

YARN: Data Operating System (Cluster Resource Management)

Common & Shared Scale Out Storage • Shared data assets • Flexible schema • Cross workload access

Page 10

© Hortonworks Inc. 2011 – 2014. All Rights Reserved

1

°

°

°

°

°

°

°

°

°

°

°

°

°

°

°

°

(Hadoop File ° ° Distributed ° ° ° System) ° °

°

°

°

°

HDFS

Enterprise Hadoop

°

°

Hortonworks Data Platform

Page 11

© Hortonworks Inc. 2011 – 2014. All Rights Reserved

Core Capabilities of Enterprise Hadoop PRESENTATION & APPLICATION

ENTERPRISE MGMT & SECURITY

Enable both existing and new application to provide value to the organization

Empower existing operations and security tools to manage Hadoop

GOVERNANCE & INTEGRATION

DATA ACCESS

Access your data simultaneously in multiple ways (batch, interactive, real-time) Load data and manage according to policy Store and process all of your Corporate Data Assets

DATA MANAGEMENT

Provide deployment choice across physical, virtual, cloud DEPLOYMENT OPTIONS Page 12

© Hortonworks Inc. 2011 – 2014. All Rights Reserved

SECURITY

Provide layered approach to security through Authentication, Authorization, Accounting, and Data Protection

OPERATIONS

Deploy and effectively manage the platform

Hortonworks Data Platform 2.3 Hortonworks Data Platform 2.3 GOVERNANCE & INTEGRATION Data Lifecycle & Governance

Batch

Script

SQL

NoSQL

Stream

Search

In-memory

Others

MapReduce

Pig

Hive

HBase Accumulo Phoenix

Storm

Solr

Spark

ISV Engines

Tez

Tez

Tez

Slider

Slider

Falcon Atlas

1

°

°

°

°

°

°

°

°

°

°

°

°

°

N

HDFS Hadoop Distributed File System °

°

°

°

°

°

°

DATA MANAGEMENT

Linux

Page 13

Windows

© Hortonworks Inc. 2011 – 2014. All Rights Reserved

Deployment Choice

Administration Authentication Authorization Auditing Data Protection Ranger Knox Atlas HDFS Encryption

YARN : Data Operating System

Data Workflow Sqoop Flume Kafka NFS WebHDFS

SECURITY

DATA ACCESS

On-Premise

Cloud

OPERATIONS

Provisioning, Managing, & Monitoring Ambari Cloudbreak Zookeeper

Scheduling Oozie

HDP At A Glance Data Storage and Resource Management HDFS (storage) works closely with MapReduce (data processing) to provide scalable, fault-tolerant, cost-efficient storage for big data.

Pig (scripting) is a high-level language (Pig Latin) for data analysis programs, and infrastructure for evaluating these programs via map-reduce.

YARN is the process resource manager and is designed to be co-deployed with HDFS such that there is a single cluster, providing the ability to move the computation resource to the data Data Engines

Hive (SQL) Provides data warehouse infrastructure, enabling data summarization, ad-hoc query and analysis of large data sets. The query language, HiveQL (HQL), is similar to SQL.

Map Reduce a programming model and an associated implementation for processing and generating large data sets with a parallel, distributed algorithm on a Hadoop cluster.

HCatalog (SQL) Table &storage management layer that provides users of Pig, MapReduce and Hive with a relational view of data in HDFS.

Tez (SQL) leverages the MapReduce paradigm to enable the execution complex Directed Acyclic Graphs (DAG) of tasks. Tez eliminates unnecessary tasks, synchronization barriers and I/O to HDFS, speeding up data processing. Spark is an open-source data analytics cluster computing framework Spark provides primitives for in-memory cluster computing that allows user programs to load data into a cluster's memory and query it repeatedly, making it well suited to machine learning algorithms Storm (data streaming). Distributed real-time computation system for processing fast, large streams of data. Storm topologies can be written in any programming language. Page 14

Data Access

© Hortonworks Inc. 2011 – 2014. All Rights Reserved

Hbase (NoSQL) Non-relational database that provides random real-time access to data in very large tables. HBase is transactional and allows updates, inserts and deletes. HBase includes support for SQL through Phoenix. Accumulo (NoSQL) Based on Google's BigTable design, features a few novel improvements on the BigTable design in the form of cell-based access control Phoenix a SQL skin over HBase delivered as a clientembedded JDBC driver targeting low latency queries over HBase data. Solr is an open source enterprise search platform from the Apache Lucene project. It includes full-text search, hit highlighting, faceted search, dynamic clustering, database integration, and rich document handling. And provides distributed search and index replication.

HDP At A Glance Governance & Integration ATLAS

Security Knox (perimeter security) is a REST gateway for Hadoop providing network isolation, SSO, authentication, authorization and auditing functions.

Atlas is a scalable and extensible set of core foundational governance services that enables enterprises to effectively and efficiently meet their compliance requirements within Hadoop

Ranger (security management) is a framework to monitor and manage security assets across a cluster. It provides central administration, monitoring and security hooks for Hadoop applications.

Falcon is a data governance engine that defines Oozie data pipelines, monitors those pipelines (in coordination with Ambari), and tracks those pipelines for dependencies and audits.

Operations & Management Kafka (message broker) - Apache Kafka is publish-subscribe messaging implemented as a distributed commit log

Ambari (cluster management) is an operational framework to provision, manage and monitor Hadoop clusters. It includes a web interface for administrators to start/stop/test services and change configurations.

Scoop is a tool that efficiently transfers bulk data between Hadoop and structured datastores such as relational databases. Flume is a distributed service to collect, aggregate and move large amounts of streaming data into HDFS. Libraries Mahout is a suite of machine learning libraries designed to be scalable and robust Cascading is an application development platform for building Data applications on Apache Hadoop.

Page 15

© Hortonworks Inc. 2011 – 2014. All Rights Reserved

CLOUDBREAK

Cloudbreak - Cloud agnostic Hadoop as a Service API. Abstracts the provisioning and ease management and monitoring of on-demand clusters. Zookeeper is a distributed configuration and synchronization service and a naming registry for distributed. Distributed applications use Zookeeper to store and mediate updates to important configuration information. Oozie (job scheduling) Oozie enables administrators to build complex data transformations, enables greater control over long-running jobs and can schedule repetitions of those jobs. .

Data Access

Page 16

© Hortonworks Inc. 2011 – 2014. All Rights Reserved

Stinger Initiative – DELIVERED

Custom Apps

Business Analytics

Next generation SQL based interactive query in Hadoop

SQL

Window Functions

Speed Improve Hive query performance has increased by 100X to allow for interactive query times (seconds)

Stinger Phase 1: • • • •

Apache Tez

• • • •

° ORC File°

1

°

°

°

°

HDFS °

°

°

°

°

°

°

N

• • • •

Support broadest range of SQL semantics for analytic applications running against Hadoop

Apache Hive Contribution… an Open Community at its finest

1,672

145

44

~390,000

13

Jira Tickets Closed

Developers

Companies

Lines Of Code Added… (2x)

Months

Page 17

© Hortonworks Inc. 2011 – 2014. All Rights Reserved

SQL Types SQL Analytic Functions Advanced Optimizations Performance Boosts via YARN

Stinger Phase 3

(Hadoop Distributed File System)

SQL

Base Optimizations SQL Types SQL Analytic Functions ORCFile Modern File Format

Stinger Phase 2:

Apache YARN

Scale The only SQL interface to Hadoop designed for queries that scale from TB to PB

Data Management

Stinger Project

Apache Hive Apache MapReduce

Data Access

Hive on Apache Tez Query Service (always on) Buffer Cache Cost Based Optimizer (Optiq)

Operations

Governance & Integration

Interactive SQL-IN-Hadoop Delivered

Security

HDP

Page 18

© Hortonworks Inc. 2011 – 2014. All Rights Reserved

Apache Storm Real-time event processing for sensor and business activity monitoring • Unlocks new business cases for Hadoop • Key component of a data lake architecture • Scale: Ingest millions of events per second. Fast query on petabytes of data

Investment Phases Phase-1  Install, Start, & Stop via Ambari  Kafka, HBase, & HDFS Connectors  Ganglia & Nagios based monitoring

Phase-2 • Storm-on-YARN • Ingest & Notification for JMS • Data persistence: EDWs, RDBMS, Cassandra

• Integrated with Ambari to manage Phase-3 • • • • •

Page 19

© Hortonworks Inc. 2011 – 2014. All Rights Reserved

High Availability mgmnt w/Ambari AD/LDAP plugin for authentication Declarative “wiring” Hive update support Advanced scheduler

Data Management

Operations

Governance & Integration

Stream Processing With Storm

Data Access

Security

HDP

Spark SQL

Spark Streaming

MLlib

GraphX

Apache Spark Spark allows you to do fast iterative processing on in-memory datasets

Page 20

© Hortonworks Inc. 2011 – 2014. All Rights Reserved

Data Management

Operations

Governance & Integration

In-Memory With Spark

Data Access

Security

HDP

Dynamic Schema Scales Horizontally to PB of Data Directly Integrated with Hadoop

HBase

HBase

HBase

RegionServer

RegionServer

RegionServer

YARN : Data Operating System 1

°

°

°

°

°

Page 21

°

°

° ° ° HDFS (Permanent Data°Storage) ° ° ° °

© Hortonworks Inc. 2011 – 2014. All Rights Reserved

°

°

°

°

°

N

Data Management

100% Open Source Store and Process Petabytes of Data Flexible Schema Scale out on Commodity Servers High Performance, High Availability Integrated with YARN SQL and NoSQL Interfaces

HORTONWORKS CONFIDENTIAL & PROPRIETARY INFORMATION

Operations

Data Access

Security

Distributed Database With Apache HBase

Governance & Integration

HDP

Data Management

Apache Solr Open source enterprise search for Hadoop and HDP • Open architecture: In the community, for the community • Simple, powerful UI for advanced search applications

• High performance indexing & sub-second search times over billions of documents • Deep Integration Roadmap with HDP

Page 22

© Hortonworks Inc. 2011 – 2014. All Rights Reserved

Operations

Governance & Integration

HDP Search

Data Access

Security

HDP

Data Management

Project

Strengths

Use Cases

Unique Capabilities

Apache Hive

Most comprehensive SQL Scale Maturity

ETL Offload Reporting Large-scale aggregations

Robust cost-based optimizer Mature ecosystem (BI, backup, security and replication)

SparkSQL

In-memory Low latency

Exploratory analytics Dashboards

Language-integrated Query

Apache Phoenix

Real-time read / write Transactions High concurrency

Dashboards System-of-engagement Drill-down / Drill-up

Real-time read / write

Page 23

© Hortonworks Inc. 2011 – 2014. All Rights Reserved

Operations

Data Access

Security

SQL on Hadoop Summary

Governance & Integration

HDP

Data Governance

Page 24

© Hortonworks Inc. 2011 – 2014. All Rights Reserved

Healthcare

Financial

Energy

Retail

Other

HIPAA HL7

SOX Dodd-Frank

PPDM

PCI PII

CWM

Search

REST-based API

REST API

Modern, flexible access to Atlas services, HDP components and external tools

Services

Search—SQL, like DSL (Domain Specific Language)

Lineage

Knowledge Store Taxonomies

Policy Rules

Type-System

Models

Audit Store

Exchange

Data Lifecycle Management Tag Based Policies

Real Time Tag Based Access Control

Support for key word, faceted and full text searches

Lineage Capture all SQL runtime activity on HiveServer2 providing lineage for both data and schema

Exchange Leverage existing metadata by importing it from ETL tools, ERP systems and data warehouses Export metadata to downstream systems

Apache Atlas

Page 25

© Hortonworks Inc. 2011 – 2014. All Rights Reserved

Data Management

Operations

Data Access

Security

Data Governance With Apache Atlas

Governance & Integration

HDP

Data Life Cycle Management With Apache Falcon

Data Management

Automate the movement and processing of data sets •

No hard-coding complex data set and pipeline processing

Dataset Lifecycle Management •

Backup & Archive Cluster

Replicate datasets (HDFS, Hive) as part of DR/Backup/ Archival plans



Falcon triggers processes for retries and handles late data arrival.



Establish the retention policies for datasets.



Falcon schedules and handles eviction

Page 26

© Hortonworks Inc. 2011 – 2014. All Rights Reserved

Production Cluster

Operations

Data Access

Security

Governance & Integration

HDP

Security

Page 27

© Hortonworks Inc. 2011 – 2014. All Rights Reserved

Data Management

Security in HDP is the most comprehensive and complete for Hadoop Administration Central management & consistent security

Authentication Authenticate users and systems

Authorization Provision access to data

Audit Maintain a record of data access

Data Protection Protect data at rest and in motion

Page 28

© Hortonworks Inc. 2011 – 2014. All Rights Reserved

• HDP ensures comprehensive enforcement of security policy across the entire Hadoop stack • HDP provides functionality across the complete set of security requirements • HDP is the only solution to provide a single simple interface for security policy definition and maintenance

Operations

Data Access

Security

Governance & Integration

HDP Security: comprehensive, complete and simple

HDP

Data Management

HDP 2.3

Centralized Security Administration w/ Ranger

Page 29

Authentication Who am I/prove it?

Authorization What can I do?

Audit What did I do?

Data Protection

• Kerberos • API security with Apache Knox

• Fine grain access control with Apache Ranger

• Centralized audit reporting w/ Apache Ranger

• Wire encryption in Hadoop • Native HDFS and partner encryption

© Hortonworks Inc. 2011 – 2014. All Rights Reserved

Can data be encrypted at rest and over the wire?

Operations

Governance & Integration

Security with HDP

Data Access

Security

HDP

Operations

Page 30

© Hortonworks Inc. 2011 – 2014. All Rights Reserved

 Provision, manage and monitor Hadoop clusters  Provision Developer Views for data analysis and application development

Page 31

© Hortonworks Inc. 2011 – 2014. All Rights Reserved

Data Management

Operations

Data Access

Security

Hadoop Operations With Apache Ambari

Governance & Integration

HDP

Page 32

© Hortonworks Inc. 2011 – 2014. All Rights Reserved

Operations

Visual SQL editor empowers developersDatatoAccess efficiently write, view, and edit SQL queries Data Management

Security

Hadoop Development with Apache Ambari

Governance & Integration

HDP

Data Management

BI / Analytics

IoT Apps

(Hive)

(Storm, HBase, Hive)

Cloudbreak 1. Pick a Blueprint 2. Choose a Cloud 3. Launch HDP!

Example Ambari Blueprints: Data Science (Spark)

Page 33

IoT Apps, BI / Analytics, Data Science, Dev / Test

© Hortonworks Inc. 2011 – 2014. All Rights Reserved

Dev / Test (all HDP services)

Operations

Data Access

Security

Governance & Integration

Cloudbreak - Launch HDP on Any Cloud for Any Application

HDP

Page 34

© Hortonworks Inc. 2011 – 2014. All Rights Reserved

Data Management

Operations

Data Access

Security

Proactive Support With Smart Sense

Governance & Integration

HDP

Hortonworks Data Platform GOVERNANCE & INTEGRATION Data Lifecycle & Governance

Batch

Script

SQL

NoSQL

Stream

Search

In-memory

Others

MapReduce

Pig

Hive

HBase Accumulo Phoenix

Storm

Solr

Spark

ISV Engines

Tez

Tez

Tez

Slider

Slider

Falcon Atlas

1

°

°

°

°

°

°

°

°

°

°

°

°

°

N

HDFS Hadoop Distributed File System °

°

°

°

°

°

°

Administration Authentication Authorization Auditing Data Protection Ranger Knox Atlas HDFS Encryption

YARN : Data Operating System

Data Workflow Sqoop Flume Kafka NFS WebHDFS

SECURITY

DATA ACCESS

OPERATIONS

Provisioning, Managing, & Monitoring Ambari Cloudbreak Zookeeper

Scheduling Oozie

DATA MANAGEMENT

Linux

Windows

Deployment Choice

On-Premise

Cloud

Enterprise Hadoop – Hortonworks Data Platform Together We Modernize Data Architectures Page 35

© Hortonworks Inc. 2011 – 2014. All Rights Reserved