Abyres-Hortonworks Data Platform.pdf. Abyres-Hortonworks Data Platform.pdf. Open. Extract. Open with. Sign In. Main menu
Hortonworks Data Platform Mohd Izhar A byres Sdn Bhd
© Hortonworks Inc. 2011 – 2015. All Rights Reserved
About Hortonworks Customer Momentum • ~700 customers (as of November 4, 2015) • 152 customers added in Q3 2015 • Publicly traded on NASDAQ: HDP
Hortonworks Data Platform
Founded in 2011
Original 24 architects, developers, operators of Hadoop from Yahoo!
• Completely open multi-tenant platform for any app and any data • Consistent enterprise services for security, operations, and governance
800+
1500+
EMPLOYEES
ECOSYSTEM PA R T N E R S
Partner for Customer Success • Leader in open-source community, focused on innovation to meet enterprise needs
• Unrivaled Hadoop support subscriptions Page 2
© Hortonworks Inc. 2011 – 2015. All Rights Reserved
Big Data: One of Top Priorities for CIOs
, September 2014 survey of 100 CIOs from the US and Europe
Page 3
© Hortonworks Inc. 2011 – 2014. All Rights Reserved
The CHASM
We are here…
Innovators & enthusiasts
Page 4
Early adopters & visionaries
© Hortonworks Inc. 2011 – 2014. All Rights Reserved
*
Early majority & pragmatists
Late majority & conservatives
Laggards & Skeptics
Hadoop Journey
Page 5
© Hortonworks Inc. 2011 – 2014. All Rights Reserved
Big Data Functional Architecture Key Tenants of Lambda Architecture Store
Pre-Compute Views & Deep Learning
BATCH LAYER
SERVING LAYER New Data Stream
Business View
Process Streams
Business View
Incremental Views
Lambda Architecture Page 6
© Hortonworks Inc. 2011 – 2014. All Rights Reserved
Query
Batch Layer
Speed layer
SPEED LAYER
Manages master data Immutable, append-only set of raw data Cleanse, Normalize & Pre-Compute Batch Views Advanced Statistical Calculations
Real Time Event Stream Processing Computes Real-Time Views
Serving Layer
Low-latency, ad-hoc query Reporting, BI & Dashboard
Big Data Functional Architecture Key Tenants of Lambda Architecture HADOOP 1 Store
Pre-Compute Views & Deep Learning
BATCH LAYER
SERVING LAYER New Data Stream
Business View
Process Streams
Page 7
Business View
Incremental Views
© Hortonworks Inc. 2011 – 2014. All Rights Reserved
Query
Batch Layer
Speed layer
SPEED LAYER
Manages master data Immutable, append-only set of raw data Cleanse, Normalize & Pre-Compute Batch Views Advanced Statistical Calculations
Real Time Event Stream Processing Computes Real-Time Views
Serving Layer
Low-latency, ad-hoc query Reporting, BI & Dashboard
Big Data Functional Architecture Key Tenants of Lambda Architecture HADOOP2 Store
Pre-Compute Views & Deep Learning
BATCH LAYER
SERVING LAYER New Data Stream
Business View
Process Streams
Page 8
Business View
Incremental Views
© Hortonworks Inc. 2011 – 2014. All Rights Reserved
Query
Batch Layer
Speed layer
SPEED LAYER
Manages master data Immutable, append-only set of raw data Cleanse, Normalize & Pre-Compute Batch Views Advanced Statistical Calculations
Real Time Event Stream Processing Computes Real-Time Views
Serving Layer
Low-latency, ad-hoc query Reporting, BI & Dashboard
MR-279: YARN
Hadoop 2 & YARN 2006
2009
October 23, 2013
Hadoop2 & YARN based Architecture
Hadoop w/ MapReduce MapReduce Largely Batch Processing 1
°
° ° HDFS
°
Batch °
(Hadoop ° ° Distributed ° ° File° System) N
Silo’d clusters Largely batch system Difficult to integrate
Page 9
Real-Time
Interactive
© Hortonworks Inc. 2011 – 2014. All Rights Reserved
YARN: Data Operating System 1
°
°
°
°
°
°
° HDFS
°
°
°
(Hadoop Distributed File System) ° ° ° ° ° ° °
°
°
N
°
Architected & led development of YARN to enable the Modern Data Architecture
Apache Hadoop – Data Operating System Shared Compute & Workload Management
BATCH, INTERACTIVE & REAL-TIME DATA ACCESS
• Common data platform, many applications
Script
SQL
Java Scala
NoSQL
Stream
• Support multi-tenant access & processing
Pig
Hive
Cascading
HBase Accumulo
Storm
• Batch, interactive & real-time use cases
Tez
Tez
Tez
Slider
Slider
Others
In-Memory Search
Spark
Solr
ISV Engines
YARN: Data Operating System (Cluster Resource Management)
Common & Shared Scale Out Storage • Shared data assets • Flexible schema • Cross workload access
Page 10
© Hortonworks Inc. 2011 – 2014. All Rights Reserved
1
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
(Hadoop File ° ° Distributed ° ° ° System) ° °
°
°
°
°
HDFS
Enterprise Hadoop
°
°
Hortonworks Data Platform
Page 11
© Hortonworks Inc. 2011 – 2014. All Rights Reserved
Core Capabilities of Enterprise Hadoop PRESENTATION & APPLICATION
ENTERPRISE MGMT & SECURITY
Enable both existing and new application to provide value to the organization
Empower existing operations and security tools to manage Hadoop
GOVERNANCE & INTEGRATION
DATA ACCESS
Access your data simultaneously in multiple ways (batch, interactive, real-time) Load data and manage according to policy Store and process all of your Corporate Data Assets
DATA MANAGEMENT
Provide deployment choice across physical, virtual, cloud DEPLOYMENT OPTIONS Page 12
© Hortonworks Inc. 2011 – 2014. All Rights Reserved
SECURITY
Provide layered approach to security through Authentication, Authorization, Accounting, and Data Protection
OPERATIONS
Deploy and effectively manage the platform
Hortonworks Data Platform 2.3 Hortonworks Data Platform 2.3 GOVERNANCE & INTEGRATION Data Lifecycle & Governance
Batch
Script
SQL
NoSQL
Stream
Search
In-memory
Others
MapReduce
Pig
Hive
HBase Accumulo Phoenix
Storm
Solr
Spark
ISV Engines
Tez
Tez
Tez
Slider
Slider
Falcon Atlas
1
°
°
°
°
°
°
°
°
°
°
°
°
°
N
HDFS Hadoop Distributed File System °
°
°
°
°
°
°
DATA MANAGEMENT
Linux
Page 13
Windows
© Hortonworks Inc. 2011 – 2014. All Rights Reserved
Deployment Choice
Administration Authentication Authorization Auditing Data Protection Ranger Knox Atlas HDFS Encryption
YARN : Data Operating System
Data Workflow Sqoop Flume Kafka NFS WebHDFS
SECURITY
DATA ACCESS
On-Premise
Cloud
OPERATIONS
Provisioning, Managing, & Monitoring Ambari Cloudbreak Zookeeper
Scheduling Oozie
HDP At A Glance Data Storage and Resource Management HDFS (storage) works closely with MapReduce (data processing) to provide scalable, fault-tolerant, cost-efficient storage for big data.
Pig (scripting) is a high-level language (Pig Latin) for data analysis programs, and infrastructure for evaluating these programs via map-reduce.
YARN is the process resource manager and is designed to be co-deployed with HDFS such that there is a single cluster, providing the ability to move the computation resource to the data Data Engines
Hive (SQL) Provides data warehouse infrastructure, enabling data summarization, ad-hoc query and analysis of large data sets. The query language, HiveQL (HQL), is similar to SQL.
Map Reduce a programming model and an associated implementation for processing and generating large data sets with a parallel, distributed algorithm on a Hadoop cluster.
HCatalog (SQL) Table &storage management layer that provides users of Pig, MapReduce and Hive with a relational view of data in HDFS.
Tez (SQL) leverages the MapReduce paradigm to enable the execution complex Directed Acyclic Graphs (DAG) of tasks. Tez eliminates unnecessary tasks, synchronization barriers and I/O to HDFS, speeding up data processing. Spark is an open-source data analytics cluster computing framework Spark provides primitives for in-memory cluster computing that allows user programs to load data into a cluster's memory and query it repeatedly, making it well suited to machine learning algorithms Storm (data streaming). Distributed real-time computation system for processing fast, large streams of data. Storm topologies can be written in any programming language. Page 14
Data Access
© Hortonworks Inc. 2011 – 2014. All Rights Reserved
Hbase (NoSQL) Non-relational database that provides random real-time access to data in very large tables. HBase is transactional and allows updates, inserts and deletes. HBase includes support for SQL through Phoenix. Accumulo (NoSQL) Based on Google's BigTable design, features a few novel improvements on the BigTable design in the form of cell-based access control Phoenix a SQL skin over HBase delivered as a clientembedded JDBC driver targeting low latency queries over HBase data. Solr is an open source enterprise search platform from the Apache Lucene project. It includes full-text search, hit highlighting, faceted search, dynamic clustering, database integration, and rich document handling. And provides distributed search and index replication.
HDP At A Glance Governance & Integration ATLAS
Security Knox (perimeter security) is a REST gateway for Hadoop providing network isolation, SSO, authentication, authorization and auditing functions.
Atlas is a scalable and extensible set of core foundational governance services that enables enterprises to effectively and efficiently meet their compliance requirements within Hadoop
Ranger (security management) is a framework to monitor and manage security assets across a cluster. It provides central administration, monitoring and security hooks for Hadoop applications.
Falcon is a data governance engine that defines Oozie data pipelines, monitors those pipelines (in coordination with Ambari), and tracks those pipelines for dependencies and audits.
Operations & Management Kafka (message broker) - Apache Kafka is publish-subscribe messaging implemented as a distributed commit log
Ambari (cluster management) is an operational framework to provision, manage and monitor Hadoop clusters. It includes a web interface for administrators to start/stop/test services and change configurations.
Scoop is a tool that efficiently transfers bulk data between Hadoop and structured datastores such as relational databases. Flume is a distributed service to collect, aggregate and move large amounts of streaming data into HDFS. Libraries Mahout is a suite of machine learning libraries designed to be scalable and robust Cascading is an application development platform for building Data applications on Apache Hadoop.
Page 15
© Hortonworks Inc. 2011 – 2014. All Rights Reserved
CLOUDBREAK
Cloudbreak - Cloud agnostic Hadoop as a Service API. Abstracts the provisioning and ease management and monitoring of on-demand clusters. Zookeeper is a distributed configuration and synchronization service and a naming registry for distributed. Distributed applications use Zookeeper to store and mediate updates to important configuration information. Oozie (job scheduling) Oozie enables administrators to build complex data transformations, enables greater control over long-running jobs and can schedule repetitions of those jobs. .
Data Access
Page 16
© Hortonworks Inc. 2011 – 2014. All Rights Reserved
Stinger Initiative – DELIVERED
Custom Apps
Business Analytics
Next generation SQL based interactive query in Hadoop
SQL
Window Functions
Speed Improve Hive query performance has increased by 100X to allow for interactive query times (seconds)
Stinger Phase 1: • • • •
Apache Tez
• • • •
° ORC File°
1
°
°
°
°
HDFS °
°
°
°
°
°
°
N
• • • •
Support broadest range of SQL semantics for analytic applications running against Hadoop
Apache Hive Contribution… an Open Community at its finest
1,672
145
44
~390,000
13
Jira Tickets Closed
Developers
Companies
Lines Of Code Added… (2x)
Months
Page 17
© Hortonworks Inc. 2011 – 2014. All Rights Reserved
SQL Types SQL Analytic Functions Advanced Optimizations Performance Boosts via YARN
Stinger Phase 3
(Hadoop Distributed File System)
SQL
Base Optimizations SQL Types SQL Analytic Functions ORCFile Modern File Format
Stinger Phase 2:
Apache YARN
Scale The only SQL interface to Hadoop designed for queries that scale from TB to PB
Data Management
Stinger Project
Apache Hive Apache MapReduce
Data Access
Hive on Apache Tez Query Service (always on) Buffer Cache Cost Based Optimizer (Optiq)
Operations
Governance & Integration
Interactive SQL-IN-Hadoop Delivered
Security
HDP
Page 18
© Hortonworks Inc. 2011 – 2014. All Rights Reserved
Apache Storm Real-time event processing for sensor and business activity monitoring • Unlocks new business cases for Hadoop • Key component of a data lake architecture • Scale: Ingest millions of events per second. Fast query on petabytes of data
Investment Phases Phase-1 Install, Start, & Stop via Ambari Kafka, HBase, & HDFS Connectors Ganglia & Nagios based monitoring
Phase-2 • Storm-on-YARN • Ingest & Notification for JMS • Data persistence: EDWs, RDBMS, Cassandra
• Integrated with Ambari to manage Phase-3 • • • • •
Page 19
© Hortonworks Inc. 2011 – 2014. All Rights Reserved
High Availability mgmnt w/Ambari AD/LDAP plugin for authentication Declarative “wiring” Hive update support Advanced scheduler
Data Management
Operations
Governance & Integration
Stream Processing With Storm
Data Access
Security
HDP
Spark SQL
Spark Streaming
MLlib
GraphX
Apache Spark Spark allows you to do fast iterative processing on in-memory datasets
Page 20
© Hortonworks Inc. 2011 – 2014. All Rights Reserved
Data Management
Operations
Governance & Integration
In-Memory With Spark
Data Access
Security
HDP
Dynamic Schema Scales Horizontally to PB of Data Directly Integrated with Hadoop
HBase
HBase
HBase
RegionServer
RegionServer
RegionServer
YARN : Data Operating System 1
°
°
°
°
°
Page 21
°
°
° ° ° HDFS (Permanent Data°Storage) ° ° ° °
© Hortonworks Inc. 2011 – 2014. All Rights Reserved
°
°
°
°
°
N
Data Management
100% Open Source Store and Process Petabytes of Data Flexible Schema Scale out on Commodity Servers High Performance, High Availability Integrated with YARN SQL and NoSQL Interfaces
HORTONWORKS CONFIDENTIAL & PROPRIETARY INFORMATION
Operations
Data Access
Security
Distributed Database With Apache HBase
Governance & Integration
HDP
Data Management
Apache Solr Open source enterprise search for Hadoop and HDP • Open architecture: In the community, for the community • Simple, powerful UI for advanced search applications
• High performance indexing & sub-second search times over billions of documents • Deep Integration Roadmap with HDP
Page 22
© Hortonworks Inc. 2011 – 2014. All Rights Reserved
Operations
Governance & Integration
HDP Search
Data Access
Security
HDP
Data Management
Project
Strengths
Use Cases
Unique Capabilities
Apache Hive
Most comprehensive SQL Scale Maturity
ETL Offload Reporting Large-scale aggregations
Robust cost-based optimizer Mature ecosystem (BI, backup, security and replication)
SparkSQL
In-memory Low latency
Exploratory analytics Dashboards
Language-integrated Query
Apache Phoenix
Real-time read / write Transactions High concurrency
Dashboards System-of-engagement Drill-down / Drill-up
Real-time read / write
Page 23
© Hortonworks Inc. 2011 – 2014. All Rights Reserved
Operations
Data Access
Security
SQL on Hadoop Summary
Governance & Integration
HDP
Data Governance
Page 24
© Hortonworks Inc. 2011 – 2014. All Rights Reserved
Healthcare
Financial
Energy
Retail
Other
HIPAA HL7
SOX Dodd-Frank
PPDM
PCI PII
CWM
Search
REST-based API
REST API
Modern, flexible access to Atlas services, HDP components and external tools
Services
Search—SQL, like DSL (Domain Specific Language)
Lineage
Knowledge Store Taxonomies
Policy Rules
Type-System
Models
Audit Store
Exchange
Data Lifecycle Management Tag Based Policies
Real Time Tag Based Access Control
Support for key word, faceted and full text searches
Lineage Capture all SQL runtime activity on HiveServer2 providing lineage for both data and schema
Exchange Leverage existing metadata by importing it from ETL tools, ERP systems and data warehouses Export metadata to downstream systems
Apache Atlas
Page 25
© Hortonworks Inc. 2011 – 2014. All Rights Reserved
Data Management
Operations
Data Access
Security
Data Governance With Apache Atlas
Governance & Integration
HDP
Data Life Cycle Management With Apache Falcon
Data Management
Automate the movement and processing of data sets •
No hard-coding complex data set and pipeline processing
Dataset Lifecycle Management •
Backup & Archive Cluster
Replicate datasets (HDFS, Hive) as part of DR/Backup/ Archival plans
•
Falcon triggers processes for retries and handles late data arrival.
•
Establish the retention policies for datasets.
•
Falcon schedules and handles eviction
Page 26
© Hortonworks Inc. 2011 – 2014. All Rights Reserved
Production Cluster
Operations
Data Access
Security
Governance & Integration
HDP
Security
Page 27
© Hortonworks Inc. 2011 – 2014. All Rights Reserved
Data Management
Security in HDP is the most comprehensive and complete for Hadoop Administration Central management & consistent security
Authentication Authenticate users and systems
Authorization Provision access to data
Audit Maintain a record of data access
Data Protection Protect data at rest and in motion
Page 28
© Hortonworks Inc. 2011 – 2014. All Rights Reserved
• HDP ensures comprehensive enforcement of security policy across the entire Hadoop stack • HDP provides functionality across the complete set of security requirements • HDP is the only solution to provide a single simple interface for security policy definition and maintenance
Operations
Data Access
Security
Governance & Integration
HDP Security: comprehensive, complete and simple
HDP
Data Management
HDP 2.3
Centralized Security Administration w/ Ranger
Page 29
Authentication Who am I/prove it?
Authorization What can I do?
Audit What did I do?
Data Protection
• Kerberos • API security with Apache Knox
• Fine grain access control with Apache Ranger
• Centralized audit reporting w/ Apache Ranger
• Wire encryption in Hadoop • Native HDFS and partner encryption
© Hortonworks Inc. 2011 – 2014. All Rights Reserved
Can data be encrypted at rest and over the wire?
Operations
Governance & Integration
Security with HDP
Data Access
Security
HDP
Operations
Page 30
© Hortonworks Inc. 2011 – 2014. All Rights Reserved
Provision, manage and monitor Hadoop clusters Provision Developer Views for data analysis and application development
Page 31
© Hortonworks Inc. 2011 – 2014. All Rights Reserved
Data Management
Operations
Data Access
Security
Hadoop Operations With Apache Ambari
Governance & Integration
HDP
Page 32
© Hortonworks Inc. 2011 – 2014. All Rights Reserved
Operations
Visual SQL editor empowers developersDatatoAccess efficiently write, view, and edit SQL queries Data Management
Security
Hadoop Development with Apache Ambari
Governance & Integration
HDP
Data Management
BI / Analytics
IoT Apps
(Hive)
(Storm, HBase, Hive)
Cloudbreak 1. Pick a Blueprint 2. Choose a Cloud 3. Launch HDP!
Example Ambari Blueprints: Data Science (Spark)
Page 33
IoT Apps, BI / Analytics, Data Science, Dev / Test
© Hortonworks Inc. 2011 – 2014. All Rights Reserved
Dev / Test (all HDP services)
Operations
Data Access
Security
Governance & Integration
Cloudbreak - Launch HDP on Any Cloud for Any Application
HDP
Page 34
© Hortonworks Inc. 2011 – 2014. All Rights Reserved
Data Management
Operations
Data Access
Security
Proactive Support With Smart Sense
Governance & Integration
HDP
Hortonworks Data Platform GOVERNANCE & INTEGRATION Data Lifecycle & Governance
Batch
Script
SQL
NoSQL
Stream
Search
In-memory
Others
MapReduce
Pig
Hive
HBase Accumulo Phoenix
Storm
Solr
Spark
ISV Engines
Tez
Tez
Tez
Slider
Slider
Falcon Atlas
1
°
°
°
°
°
°
°
°
°
°
°
°
°
N
HDFS Hadoop Distributed File System °
°
°
°
°
°
°
Administration Authentication Authorization Auditing Data Protection Ranger Knox Atlas HDFS Encryption
YARN : Data Operating System
Data Workflow Sqoop Flume Kafka NFS WebHDFS
SECURITY
DATA ACCESS
OPERATIONS
Provisioning, Managing, & Monitoring Ambari Cloudbreak Zookeeper
Scheduling Oozie
DATA MANAGEMENT
Linux
Windows
Deployment Choice
On-Premise
Cloud
Enterprise Hadoop – Hortonworks Data Platform Together We Modernize Data Architectures Page 35
© Hortonworks Inc. 2011 – 2014. All Rights Reserved