Machine Learning on Streaming Data via Integration of Oracle ... - IOUG

12 downloads 1422 Views 1MB Size Report
Jan 28, 2016 - Oracle based workflow and architecture for streaming analytics. • Streaming data .... Organize. Action. Hadoop. Java Device. Oracle Event. Processing. Embedded ... for sharing between different tools and environments.


Machine Learning on Streaming Data via Integration of Oracle Advanced Analytics and Oracle Stream Explorer Mauricio Arango, Alex Ardel January 28, 2016

2

Agenda • • • • • • • •

Machine learning on streaming data Oracle based workflow and architecture for streaming analytics Streaming data platform (Oracle Stream Explorer) Model generation (Oracle Advanced Analytics) Model transfer Real-time scoring Demo Summary

3

Machine Learning on Streaming Data Why is it Important? • Massive expansion of sensors across businesses & industries, governments, and scientific domains – eg. IoT • Increasingly data streams from sensors require near real-time predictive analytics – data value decreases with time – Fault detection and forecasting in complex infrastructure systems – eg. wind turbine farms – Complex scientific equipment monitoring – eg. cooling systems in particle accelerators – Security threat detection – eg. network intrusion detection – Car & truck condition monitoring – eg. detecting patterns that signal upcoming mechanical problems – Environmental monitoring – early detection of air & water quality issues

4

Machine Learning Analytics on Streaming Data

• Use of machine learning algorithms to detect patterns & perform predictions based on models created from historical data • Predictions performed on input data streams – capable of handling very high-throughput streams

Historical data - used to build model

Machine Learning

Streaming data - predictions performed on it

Prediction System

Prediction stream

5

Machine Learning Analytics on Streaming Data

• Streaming machine learning workflows involve two main stages: – Model building (training) • Uses historical data as input – may involve handling very large datasets • Requires toolset with support for expanding set of key machine learning algorithms • Performed with low frequency

– Model scoring • Performed with high frequency • No historical data required - model representation contains all required information

6

Streaming Machine Learning Architecture • Use separate optimized platforms for each workflow stage: – Oracle R Enterprise (ORE) component of Oracle Advanced Analytrics (OAA) for model development and building – Oracle Stream Explorer (OSX) for streaming data scoring, stream pre/postprocessing

Oracle Stream Explorer Input event stream

Preprocessing, model import & scoring function creation

Model Scoring

Prediction stream

Model transfer from ORE to OSX using PMML

Training data

Model Builder

Oracle R Enterprise

7

Streaming Machine Learning End-to-End Workflow • End goals: – Fully automated OSX scoring application generation triggered by PMML model input – Automated model refresh – Simple workflow control via OSX user interface application Model repository

Generation

Conversion to standard format

(OAA)

(R/PMML)

(OAA)

Model

Model import (OSX)

creation / export

Scoring function creation

Scoring on input streams

(OSX)

(OSX)

8

Streaming Platform

9

Oracle Stream Explorer Overview Real-time Analytics Platform Enterprise

Internet of Things Embedded Event Processing

Enterprise Event Processing • High Volume

• • • •

FOG Computing Raspberry Pi

“Sea of data”

Filtering Correlation Aggregation Pattern matching

• Continuous Streaming • Sub-Millisecond Latency • Disparate Sources • Time-Window Processing • Pattern Matching • High Availability/ Scalability • Coherence Integration • Geospatial, Geofencing • Big Data Integration • K-means clustering

Stream Processing Toolset

• Event Pattern Examples •congestion detection, •silent failure detection •geospatial movement •location based •anomaly detection via clustering

• Integration with business event visualization (BAM) • Very high performance: • Sparc T5 – 4 million events/sec • Exalogic – 30 million events/sec

Cloud Services*

10

Streaming Applications Modeled as Data Flow Graphs Event Processing Network (EPN)

Input event streams

Output event streams

EPN (Event Processing Network) Elements

Adapter

Channel

Cache

• Application logic contained in processor nodes • Programmed in Java and Continuous Query Language (CQL)

POJO Processor

11

Stream Explorer User Interface • New Fast Data, Real-time Streaming Analytics Platform for the Business Audience – Hides challenges and complexities of using streaming data analytics platforms – Provides accelerated delivery time to market of real-time event driven solutions

• Simple Cloud-Aware Canvas Façade – Analyze simulated or live data feeds to detect event patterns, perform event correlation, aggregation, filtering & clustering – Provides out-of-the-box patterns for industry specific solutions – Integrates transparently with the runtime development platform to include predictive algorithms

12 12

Oracle Internet of Things Cloud Service Infrastructure Oracle Cloud

Devices / Internet “Things”

Enterprise Cloud or On Premise

3rd Party Device Cloud

Other Devices

Endpoint Management

Oracle Database

Communications Service Provider Applications

Oracle Business Intelligence Cloud Service

WWAN

Other Devices

Cloud Service Gateway

Firewall Messaging Proxy

2G/3G/LTE Network

Oracle Stream Explorer

Custom Application

Field Service

CRM / OM / SFA

Industry Vertical Applicati ons

Charging and Billing

ERP

Hadoop

Event Processing

Java Device Oracle Event Processing Embedded Other Devices

Gather

Enrich

Stream

Network Management and Policy Device Management

Manage

Dispatcher REST/JMS

Analyze & Acquire

Oracle Integration Cloud Service

Organize

• Finan cials • SCM • HCM

Action

13

Model Generation

14

Model generation

Large number of models built with Oracle R Enterprise

• Oracle R Enterprise (ORE) is a component of the Oracle Advanced Anaytics (OAA) option of Oracle Database Enterprise Edition – Makes the R statistical programming language and environment ready for the enterprise and big data. – R users can develop, refine, and deploy R scripts that leverage the parallelism and scalability of Oracle Database to automate data analysis.

• ORE Transparency Layer In-database data exploration, preparation and analysis through implicit translation of R into SQL. Operations on database data as though they were R objects using R syntax.

• ORE Embedded R Execution Execution of R scripts by one or more R engines running on the database server • • • • •

Multiple models generated concurrently with ore.*Apply() functions

Allows the use of open source CRAN packages in R scripts running on the Database server. Eliminates moving data from the Oracle Database server to a client R session Uses the database server to start, manage, control the execution of R scripts in R engines running on the server. Leverages the memory and processing power of the database server machine for R engine execution. Enables data-parallel and task-parallel execution of user-defined R functions

15

Model Transfer

16

Model transfer between generator and scoring engine Predictive Model Markup Language (PMML)

Model Building

Model Consumption

R, Python, SAS, SPSS,SAP, Knime, C

www.dmg.org : Data Mining Group • Standard for representing data mining models for sharing between different tools and environments • Brainchild of DMG. Mature standard supported by over 20 vendors and open source analytic organisations

C Contribution to at least one version (PMML or PFA)

C • XML schema for describing the model structure

17

PMML components Header Data Dictionnary Data Transformations

Mining Schema Targets Model

Outputs

Classes of models supported by PMML

Version/timestamp/model development environment C

• • • •

Definitions for all fields used by the DM model (data types, ranges C) Mapping of user data in suitable form for the DM model



Fields usage (active/target), policies for handling missing/invalid values, outliers C Syntax for handling target categories.

• • • • • • •

Association Rules Baseline Models Decision Trees Center- & Distribution-based Clustering Regression & General Regression k-Nearest Neighbors Neural Networks Naïve Bayes Sequences Text Times Series Support Vector Machines

Definitions for data mining model

Multiple Models

Scope of Fields

Taxonomies

Model Verification

J.

18

PMML ONTIME_S dataset : Flight arrival delay prediction

linear regression model example

ARRDELAY ~ DEPDELAY + DISTANCE + DAYOFWEEK

R ‘pmml’ package (cran.r-project.org/package=pmml ) : supports conversion to PMML for : • • • •

ada (ada) arules coxph (survival) glm (stats)

• • • •

glmnet (glmnet) hclust (stats) kmeans (stats) ksvm (kernlab)

• • • •

lm (stats) multinom (nnet) naiveBayes (e1071) nnet (nnet)

• • • •

randomForest (randomFoerst) rfsrc (randomForestSRC) rpart (rpart) svm (e1071)

19

Model generation/conversion/export flow Model

Conversion

Model

Generation

to PMML

repository/export

• ORE Embedded R execution : ore.*Apply() for data- and task-parallel execution • Concurrent generation of large number of models frml