Jan 28, 2016 - Oracle based workflow and architecture for streaming analytics. ⢠Streaming data .... Organize. Action. Hadoop. Java Device. Oracle Event. Processing. Embedded ... for sharing between different tools and environments.
Machine Learning on Streaming Data via Integration of Oracle Advanced Analytics and Oracle Stream Explorer Mauricio Arango, Alex Ardel January 28, 2016
2
Agenda • • • • • • • •
Machine learning on streaming data Oracle based workflow and architecture for streaming analytics Streaming data platform (Oracle Stream Explorer) Model generation (Oracle Advanced Analytics) Model transfer Real-time scoring Demo Summary
3
Machine Learning on Streaming Data Why is it Important? • Massive expansion of sensors across businesses & industries, governments, and scientific domains – eg. IoT • Increasingly data streams from sensors require near real-time predictive analytics – data value decreases with time – Fault detection and forecasting in complex infrastructure systems – eg. wind turbine farms – Complex scientific equipment monitoring – eg. cooling systems in particle accelerators – Security threat detection – eg. network intrusion detection – Car & truck condition monitoring – eg. detecting patterns that signal upcoming mechanical problems – Environmental monitoring – early detection of air & water quality issues
4
Machine Learning Analytics on Streaming Data
• Use of machine learning algorithms to detect patterns & perform predictions based on models created from historical data • Predictions performed on input data streams – capable of handling very high-throughput streams
Historical data - used to build model
Machine Learning
Streaming data - predictions performed on it
Prediction System
Prediction stream
5
Machine Learning Analytics on Streaming Data
• Streaming machine learning workflows involve two main stages: – Model building (training) • Uses historical data as input – may involve handling very large datasets • Requires toolset with support for expanding set of key machine learning algorithms • Performed with low frequency
– Model scoring • Performed with high frequency • No historical data required - model representation contains all required information
6
Streaming Machine Learning Architecture • Use separate optimized platforms for each workflow stage: – Oracle R Enterprise (ORE) component of Oracle Advanced Analytrics (OAA) for model development and building – Oracle Stream Explorer (OSX) for streaming data scoring, stream pre/postprocessing
Oracle Stream Explorer Input event stream
Preprocessing, model import & scoring function creation
Model Scoring
Prediction stream
Model transfer from ORE to OSX using PMML
Training data
Model Builder
Oracle R Enterprise
7
Streaming Machine Learning End-to-End Workflow • End goals: – Fully automated OSX scoring application generation triggered by PMML model input – Automated model refresh – Simple workflow control via OSX user interface application Model repository
Generation
Conversion to standard format
(OAA)
(R/PMML)
(OAA)
Model
Model import (OSX)
creation / export
Scoring function creation
Scoring on input streams
(OSX)
(OSX)
8
Streaming Platform
9
Oracle Stream Explorer Overview Real-time Analytics Platform Enterprise
Internet of Things Embedded Event Processing
Enterprise Event Processing • High Volume
• • • •
FOG Computing Raspberry Pi
“Sea of data”
Filtering Correlation Aggregation Pattern matching
• Continuous Streaming • Sub-Millisecond Latency • Disparate Sources • Time-Window Processing • Pattern Matching • High Availability/ Scalability • Coherence Integration • Geospatial, Geofencing • Big Data Integration • K-means clustering
Stream Processing Toolset
• Event Pattern Examples •congestion detection, •silent failure detection •geospatial movement •location based •anomaly detection via clustering
• Integration with business event visualization (BAM) • Very high performance: • Sparc T5 – 4 million events/sec • Exalogic – 30 million events/sec
Cloud Services*
10
Streaming Applications Modeled as Data Flow Graphs Event Processing Network (EPN)
Input event streams
Output event streams
EPN (Event Processing Network) Elements
Adapter
Channel
Cache
• Application logic contained in processor nodes • Programmed in Java and Continuous Query Language (CQL)
POJO Processor
11
Stream Explorer User Interface • New Fast Data, Real-time Streaming Analytics Platform for the Business Audience – Hides challenges and complexities of using streaming data analytics platforms – Provides accelerated delivery time to market of real-time event driven solutions
• Simple Cloud-Aware Canvas Façade – Analyze simulated or live data feeds to detect event patterns, perform event correlation, aggregation, filtering & clustering – Provides out-of-the-box patterns for industry specific solutions – Integrates transparently with the runtime development platform to include predictive algorithms
12 12
Oracle Internet of Things Cloud Service Infrastructure Oracle Cloud
Devices / Internet “Things”
Enterprise Cloud or On Premise
3rd Party Device Cloud
Other Devices
Endpoint Management
Oracle Database
Communications Service Provider Applications
Oracle Business Intelligence Cloud Service
WWAN
Other Devices
Cloud Service Gateway
Firewall Messaging Proxy
2G/3G/LTE Network
Oracle Stream Explorer
Custom Application
Field Service
CRM / OM / SFA
Industry Vertical Applicati ons
Charging and Billing
ERP
Hadoop
Event Processing
Java Device Oracle Event Processing Embedded Other Devices
Gather
Enrich
Stream
Network Management and Policy Device Management
Manage
Dispatcher REST/JMS
Analyze & Acquire
Oracle Integration Cloud Service
Organize
• Finan cials • SCM • HCM
Action
13
Model Generation
14
Model generation
Large number of models built with Oracle R Enterprise
• Oracle R Enterprise (ORE) is a component of the Oracle Advanced Anaytics (OAA) option of Oracle Database Enterprise Edition – Makes the R statistical programming language and environment ready for the enterprise and big data. – R users can develop, refine, and deploy R scripts that leverage the parallelism and scalability of Oracle Database to automate data analysis.
• ORE Transparency Layer In-database data exploration, preparation and analysis through implicit translation of R into SQL. Operations on database data as though they were R objects using R syntax.
• ORE Embedded R Execution Execution of R scripts by one or more R engines running on the database server • • • • •
Multiple models generated concurrently with ore.*Apply() functions
Allows the use of open source CRAN packages in R scripts running on the Database server. Eliminates moving data from the Oracle Database server to a client R session Uses the database server to start, manage, control the execution of R scripts in R engines running on the server. Leverages the memory and processing power of the database server machine for R engine execution. Enables data-parallel and task-parallel execution of user-defined R functions
15
Model Transfer
16
Model transfer between generator and scoring engine Predictive Model Markup Language (PMML)
Model Building
Model Consumption
R, Python, SAS, SPSS,SAP, Knime, C
www.dmg.org : Data Mining Group • Standard for representing data mining models for sharing between different tools and environments • Brainchild of DMG. Mature standard supported by over 20 vendors and open source analytic organisations
C Contribution to at least one version (PMML or PFA)
C • XML schema for describing the model structure
17
PMML components Header Data Dictionnary Data Transformations
Mining Schema Targets Model
Outputs
Classes of models supported by PMML
Version/timestamp/model development environment C
• • • •
Definitions for all fields used by the DM model (data types, ranges C) Mapping of user data in suitable form for the DM model
•
Fields usage (active/target), policies for handling missing/invalid values, outliers C Syntax for handling target categories.
• • • • • • •
Association Rules Baseline Models Decision Trees Center- & Distribution-based Clustering Regression & General Regression k-Nearest Neighbors Neural Networks Naïve Bayes Sequences Text Times Series Support Vector Machines
Definitions for data mining model
Multiple Models
Scope of Fields
Taxonomies
Model Verification
J.
18
PMML ONTIME_S dataset : Flight arrival delay prediction
linear regression model example
ARRDELAY ~ DEPDELAY + DISTANCE + DAYOFWEEK
R ‘pmml’ package (cran.r-project.org/package=pmml ) : supports conversion to PMML for : • • • •
ada (ada) arules coxph (survival) glm (stats)
• • • •
glmnet (glmnet) hclust (stats) kmeans (stats) ksvm (kernlab)
• • • •
lm (stats) multinom (nnet) naiveBayes (e1071) nnet (nnet)
• • • •
randomForest (randomFoerst) rfsrc (randomForestSRC) rpart (rpart) svm (e1071)
19
Model generation/conversion/export flow Model
Conversion
Model
Generation
to PMML
repository/export
• ORE Embedded R execution : ore.*Apply() for data- and task-parallel execution • Concurrent generation of large number of models frml