Simulation, Analytique, Big Data. Sylvie Boin .... B B. A. B. C. D. Risk analytics,.
Sensitivity analysis,. Monte Carlo simulation. Support for Diverse Workloads & ...
Technical Computing : la nouvelle ère Simulation, Analytique, Big Data Sylvie Boin Technical Computing sales Manager Emmanuel Lecerf Platform Computing Sales
Technical Computing : les annonces IBM en un clin d’oeil
IBM Technical Computing portfolio : refresh of major products and solutions for mainstream technical computing Application Ready Solutions for
Power Systems
Auto/Aero, Life Sciences, Petroleum, Big Data TM
Engine for faster insights
Flex Systems
TM
Integrated hybrid system
System x® Blue
Gene®
Extremely fast, energy efficient supercomputer
Redefining x86 New
Storage System® High performance storage
Parallel Environment
3
NeXtScale System™ Hyperscale, Density, Flexibility
HPC Cloud
GPFS™ Storage Server
IBM Platform LSF® Family
IBM Platform™ Symphony Family
IBM Platform HPC
IBM Platform Cluster Manager
Big data storage
xCAT Intelligent Cluster™ GPFS™
Factory-integrated, interoperability-tested system with compute, storage, networking and cluster management
Application Ready Solutions : new enhancements Industry
Life Sciences
Auto/Aero Engineering
Petroleum
Life Sciences
Life Sciences Comp Chem.
Auto/Aero Engineering
Auto/Aero Engineering
Life Sciences
Big Data
ISV
Accelyrs
ANSYS
Schlumberger
CLC-Bio
Gaussian
MSC Software
Dassault Systemes
mpiBLAST
IBM SWG
Applis.
Accelrys Pipeline Pilot NGS collect.
ANSYS, FLUENT, Remote 3D
ECLIPSE, INTERSECT
CLC Genomics Server
Gaussian
MSC Nastran, Patran, SimManager
ABAQUS
MpiBLAST
InfoSp here BigInsi ghts
IBM platform
Flex System x240 V7000 Unified, Platform HPC, GPFS
Flex System x240 DS3500, Platform HPC, GPFS
Flex System x240 DS3500, Platform HPC, GPFS
Flex System x240 V7000 Unified, Platform HPC, GPFS
Flex System p260, p460, DS3500, Platform LSF, GPFS
Flex System x240, NeXtScale, GPFS, V7000 Unified, Platform HPC
System x3650 M4, Flex System x240, NeXtScale, GPFS, Platform HPC, MIO, integrated storage
NeXtScale, Platform HPC, GPFS
Power Linux R72, Platfor m PCM, Symph ony, GPFS, Int. Storag e
Live
Live
Live
Live
Live
90% go
Live
Adding Flex IVB,NeXtScal e Platform HPC
Adding Flex System p460, with LSF, PCM
Adding IVB, NeXtScale, Platform HPC
NeXtScal e, Platform HPC
Addin g NeXtS cale, Platfor m HPC
Status New October Content
Technical Computing : la nouvelle ère Simulation, Analytique, Big Data
83x 6,000,000 users on Twitter pushing out 300,000
500,000,000 users on Twitter pushing out 400,000,000
tweets per day
tweets per day
1333x
The characteristics of big data Cost efficiently processing the growing Volume 50x
2010
35 ZB
Responding to the increasing Velocity
30 Billion RFID sensors and counting
Collectively Analyzing the broadening Variety
80% of the worlds data is unstructured
2020
Establishing the Veracity of big data sources
1 in 3 business leaders don’t trust the information they use to make decisions
5 Big Data Patterns
Big Data Exploration Find, visualize, understand all big data to improve business knowledge
Enhanced 360o View of the Customer
Security/Intelligence Extension
Achieve a true unified view, incorporating internal and external sources
Lower risk, detect fraud and monitor cyber security in real-time
Operations Analysis
Data Warehouse Augmentation
Analyze a variety of machine data for improved business results
Integrate big data and data warehouse capabilities to increase operational efficiency
Hadoop MapReduce
De-facto “Big Data” standard • Pioneered at Google / Yahoo! • Framework for writing applications to rapidly process vast datasets • More cost effective than traditional data warehouse / BI infrastructure • Dramatic performance gains • Java based
• From our perspective: Just another distributed computing problem
Common Pain Points in Big Data Hadoop environment •
Limited HA features in the workload engine
•
Large performance overhead during job initiation
•
Resource silos associated with MapReduce applications •
Single purpose clusters - under utilized resources
•
Not adaptive
•
Scheduling engine lacks sophistication
•
No way to manage a shared services model tied to an SLA
•
Difficult to troubleshoot
•
Difficult to manage as the cluster scales
•
Lack of application life cycle / rolling upgrades
•
Scalability concerns
•
Lack of reporting tools
IBM’ Big Data Architecture Streams
Data in Motion
Video/Audio Network/Sensor Entity Analytics Predictive
Information Ingestion and Operational Information
Data at Rest
Stream Processing Data Integration Master Data
Hadoop
Data in Many Forms
Intelligence Analysis
Real-time Analytics
Landing Area, Analytics Zone and Archive
Exploration, Integrated Warehouse, and Mart Zones
Discovery Deep Reflection Operational Predictive
Raw Data Structured Data Text Analytics Data Mining Entity Analytics Machine Learning
Information Governance, Security and Business Continuity
IBM Platform Computing – shared infrastructure
Decision Management
BI and Predictive Analytics
Navigation and Discovery
The MapReduce Architecture 3 logical layers … 3 options in IBM
12
Applications or End User Access
IBM Software (BigInisght, SPSS, analytics…)
MapReduce Workload Management
Platform Symphony
Distributed Parallel File Systems / Data Storage
GPFS-FPO
IBM Platform Symphony “Analytics meets Infrastructure” Two distinct value propositions 1 Use a fast, distributed software infrastructure to accelerate and provide greater capacity for business critical analytic workloads
Analytics
Infrastructure (HW & SW)
2
Use sophisticated policies to optimize the use of infrastructure resources and ensure alignment to the goals of the business
As the scale of problems grow, an agile, distributed infrastructure becomes ever more critical to project success.
IBM Platform Symphony Architecture
COMPUTE INTENSIVE
IBM Platform Symphony Management Console
Low-latency Serviceoriented Application Middleware
DATA INTENSIVE
Enhanced Hadoop MapReduce Service Processing Framework
Instance Manager (SIM) IBM Platform Symphony Core
IBM Resource Orchestrator
IBM Platform Symphony Enterprise Reporting Framework
Different workloads demand different SLAs
“I need an updated counterparty credit risk analysis for the final earnings report by 2:00 pm”
“I wonder if teenagers in California still think red shoes are cool?”
Cluster Sprawl - Silos of underutilized, incompatible clusters A
Risk analytics, Sensitivity analysis, Monte Carlo simulation
B
Metadata generation, File classification, Batch analysis
D
C Search, Analysis, Concept Recognition
Data Intensive Apps
A
A
A
A
B
B
B
B
C
C
C
C
D
D
D
D
A
A
A
A
B
B
B
B
C
C
C
C
D
D
D
D
A
A
A
A
B
B
B
B
C
C
C
C
D
D
D
D
A
A
A
A
B
B
B
B
C
C
C
C
D
D
D
D
A
A
A
A
B
B
B
B
C
C
C
C
D
D
D
D
A
A
A
A
B
B
B
B
C
C
C
C
D
D
D
D
Cluster 1
Cluster 2
Cluster 3
Support for Diverse Workloads & Platforms Heterogeneous Application Support A
B
Risk analytics, Sensitivity analysis, Monte Carlo simulation
D
C
Metadata generation, File classification, Batch analysis
Search, Analysis, Concept Recognition
Data Intensive Apps
Workload Manager C
C
C
C
C
C
B
B
A
A
A
A
A
A
A
A
C
C
C
C
C
C
B
B
A
A
A
A
A
A
A
A
C
C
C
C
C
C
B
D
D
D
D
D
D
B
B
B
B
B
B
B
B
B
B
D
D
D
D
D
D
B
B
B
Resource Orchestration
Multiple Instances of MapReduce in a Single Cluster
Sophisticated, Policy Based, Resource Sharing
Sharing while preserving ownership Near 100% sustained resource utilization – Dynamic Allocation Allocations flex during runtime to reflect business priorities Enables application level SLA management
Easy to Manage Sophisticated Workload ManagementInteract with Running Jobs
GPFS-FPO: Enterprise Class Replacement for HDFS GPFS 3.5
HDFS
Terasort: large reads
X
X
Hbase: small write
X
X
Metadata intensive
X
X
Posix compliance
X
Meta-data replication
X
Distributed name node
X
Snapshot
X
Asynchronous replication
X
Backup
X
Security & integrity
Access control lists
X
Ease of use
Policy based Ingest
X
Performance
Enterprise readiness
Protection & recovery
Open Source
IBM BigInsights
IBM
IBM InfoSphere BigInsights v2.1 Enterprise Edition Administration
Applications & Development
Visualization & Discovery
Big SQL BigSheets Dashboard & Visualization
JDBC
Apps
Text Analytics
Workflow
Pig & Jaql
MapReduce Hive
Admin Console Netezza
Monitoring DB2
Streams
Advanced Analytic Engines R
Text Processing Engine & Extractor Library)
Adaptive Algorithms
DataStage
Workload Optimization
Runtime
Guardium
Integrated Installer
Enhanced Security
Splittable Text Compression
Adaptive MapReduce
ZooKeeper
Oozie
Jaql
Flexible Scheduler
Lucene
Pig
H Catalog
Index
MapReduce IBM Platform Symphony Advanced Edition
Data Store
HBase
Hive
High Availability
Platform Computing Cognos
Management
HDFS
IBM GPFS-FPO
Flume
Security
Audit & History Lineage
File System
Integration
Sqoop
Benchmark: Short Running Tasks – Results Scheduler Performance 1600 1400 1200
Tasks/Sec
1000 800 600 400 200 0
Tasks per second
Hadoop 0.20.2
3,3
Hadoop
30,3
Symphony 6.1
1516
Symphony 6.1 can schedule ~50x more tasks per second than current Hadoop release Production clusters running workloads with large amounts of tasks greatly benefit in a fast scheduler Hadoop results taken from Hadoop World 2011 performance presentation, Lipcon & Chen
Benchmark: SWIM: Facebook 2010 Workload – Results SWIM: Facebook 2010 Workload Hadoop 1.0.1
Hadoop 1.0.1
Hadoop 1.0.1
Symphony 6.1
7.5x Faster
Symphony 6.1
Symphony 6.1
0
1000
2000
3000
4000
Seconds
5000
6000
7000
8000
Thank you!