Building Better Predictive Models with Cognitive Assistance in a Data

0 downloads 0 Views 2MB Size Report
Sep 12, 2018 - Chief Data Scientist – Analytics Services at. IBM. A Data Scientist Thought ... IBM. Compose. IBM Cloud. Data from the IBM Cloud. & third party clouds ... Data Engineer. Cognos. Watson Analytics. Dashboards. Developer.
BUILDING BETTER PREDICTIVE MODELS WITH COGNITIVE ASSISTANCE IN A DATA SCIENCE ECOSYSTEM

Dr. Alex Liu

Chief Data Scientist Analytics Services @ IBM [email protected] Sep 12, 2018 NASA JPL SVCP

ALEX LIU INTRODUCTION Chief Data Scientist – Analytics Services at IBM

A Data Scientist Thought Leader Chief Data Scientist for a few corporations before joined IBM Taught advanced data analytics for the University of South California and the University of California at Irvine Consulted for the United Nations, Ingram Micro … M.S. and Ph.D. from Stanford University

DATA SCIENCE: TURNING DATA INTO VALUE WITH MODELS Data Science produces insights/values via a complicated proccese a big set of tools

BigInsights (HDFS)

Cloudant dashDB (DBaaS) (Analytics)

SQDB (Managed DB2)

Swift (Object Storage)

3

DATA SCIENCE PROJECTS RETURN VERY VALUABLE RESULTS BUT A LOT FAILED Netflix, for example, integrates data science into each part of their business; they estimate a billion dollars in incremental value from their personalization and recommendation alone. Knight Capital Group, for instance, lost $440 million in 45 minutes after a mistake in updating a model. Gartner estimated that 60% of big data projects fail in 2016, and in 2017. Reproducibility crisis & fast insight demands

DATA SCIENCE – COMPLICATED VERY COMPLICATED FLOWS JUST FOR MODEL BUILDING STAGE • More than 50 different algorithms: SVM, Neural Net, Decision Trees/Forests, Naïve Bayes, Regression, SMO, k-nearest Neighbor, Clustering, Rules, … • Combinatorially explosive number of parameter choices per algorithm: kernel type, pruning strategy, number of trees in a forest, learning rate, …

• Wide variation in performance across different algorithm implementations (e.g., SPSS vs Python vs WEKA vs SPARK …) • User-Defined algorithms • Substantial cost in user and compute time •

User spends time on trying new combinations and parameters



Computational cost for training a single SVM can exceed 24h



Selection commonly based on data scientist bias

• Each additional pipeline stage increases complexity dramatically!

5

IMPORTANCE OF AUTOMATIONS & COMMUNITIES AUTOMATION ~ Compare Data Scientist with and without computer-based augmentation Show that computer-augmented data science can reduce time-to-result by an order of magnitude and improve quality of results

COMMUNITY ~ Self-learn and validate using open competitions or evaluations (e.g., Kaggle, OpenML), IBM customer engagements

6

DS ASSISTED BY AI WITHIN A DS COMMUNITY 1) Bring automation into key areas of large-scale data analysis tasks Overcome “analytic decision overload” for Data Scientists Enable Data Scientist to: view and interact with decision making process in an online fashion

obtain rapid insights from data to answer key questions

Augumention Vs. Automation

2) Integrated System of tools, working with DS communities An integrated system for scientists to easily handle data and analytical and application needs Upload and prepare data from various sources Cross-platform modeling and machine learning implementation Cross-platform analytic deployments on Big Data platforms IBM Research

7

Developer

WATSON STUDIO LOCAL

Watson Analytics

Dashboards Data Steward

Cognos

32 Different Connections Plugin

IBM Analytics Engine

Db2 Warehouse on Cloud

Watson Data Platform Persistence Cloud Services

Data scientist

IBM Cloud Object Storage

Jupyter Notebooks

IBM Compose

Watson Studio

IBM Cloudant

Data Science Tools

RStudio

One Platform for IBM Analytics Team IBM Cloud

Hadoop

Data Refinery

On-premises data

Spark ML

Data from the IBM Cloud & third party clouds

Data Engineer

9

IBM Confidential

IBM Data Science Experience summary

IBM Data Science Experience summary

TAKING A DATA SCIENCE ECOSYSTEM APPROACH A DATA SCIENCE ECOSYSTEM HAS THREE BASIC ELEMENTS 1) DATA PORTAL 2) DATA SCIENCE COMMUNITY 3) DATA SCIENCE PLATFORM

RMDS COMMUNITIES AT IBM GLENDALE Pasadena/Glendale Meetup Community Local face to face community – more than 1100 members https://www.meetup.com/RMDS_LA/ https://www.linkedin.com/groups/1895501 has 29K participants

Aim to create an environment for utilizing big data analytics to create smart cities and smart commerce

EX1: citizen data science ecosystem with open data

105,000+ collections 349 citizen apps 500,000 data resources 175 agencies 450 APIs

Source: City of LA Mayor’s Tech Advisor Presentation at RMDS Meetup.

14

http://www.ibm.com/weather

EXAMPLE – 1KM VISIBLE (GOES-R WILL BE EVEN BETTER)

EX2: A data science ecosystem with weather data

Weather Data

Transaction

101 010 101

Applications

Solutions

Optimizing Operations

Analytical Insights for Transformation

Connecting all the data scientists from a DS community

IoT Data

Platform ~ IBM DSX

A MAJORITY OF RETAIL AND CP EXECUTIVES INDICATE WEATHER HAS A SIGNIFICANT IMPACT ON BUSINESS DECISION-MAKING 

Weather either influences all human decisions or triggers automated actions in the following areas

Worker allocation and staff scheduling

51%

Work safely

50%

Inventory pricing

50%

Customer interactions

45%

Marketing / messaging

41%

Inventory placements

40%

Routes and transportation

39%

Supply chain / sourcing

35%

Product development

33%