BUILDING BETTER PREDICTIVE MODELS WITH COGNITIVE ASSISTANCE IN A DATA SCIENCE ECOSYSTEM
Dr. Alex Liu
Chief Data Scientist Analytics Services @ IBM
[email protected] Sep 12, 2018 NASA JPL SVCP
ALEX LIU INTRODUCTION Chief Data Scientist – Analytics Services at IBM
A Data Scientist Thought Leader Chief Data Scientist for a few corporations before joined IBM Taught advanced data analytics for the University of South California and the University of California at Irvine Consulted for the United Nations, Ingram Micro … M.S. and Ph.D. from Stanford University
DATA SCIENCE: TURNING DATA INTO VALUE WITH MODELS Data Science produces insights/values via a complicated proccese a big set of tools
BigInsights (HDFS)
Cloudant dashDB (DBaaS) (Analytics)
SQDB (Managed DB2)
Swift (Object Storage)
3
DATA SCIENCE PROJECTS RETURN VERY VALUABLE RESULTS BUT A LOT FAILED Netflix, for example, integrates data science into each part of their business; they estimate a billion dollars in incremental value from their personalization and recommendation alone. Knight Capital Group, for instance, lost $440 million in 45 minutes after a mistake in updating a model. Gartner estimated that 60% of big data projects fail in 2016, and in 2017. Reproducibility crisis & fast insight demands
DATA SCIENCE – COMPLICATED VERY COMPLICATED FLOWS JUST FOR MODEL BUILDING STAGE • More than 50 different algorithms: SVM, Neural Net, Decision Trees/Forests, Naïve Bayes, Regression, SMO, k-nearest Neighbor, Clustering, Rules, … • Combinatorially explosive number of parameter choices per algorithm: kernel type, pruning strategy, number of trees in a forest, learning rate, …
• Wide variation in performance across different algorithm implementations (e.g., SPSS vs Python vs WEKA vs SPARK …) • User-Defined algorithms • Substantial cost in user and compute time •
User spends time on trying new combinations and parameters
•
Computational cost for training a single SVM can exceed 24h
•
Selection commonly based on data scientist bias
• Each additional pipeline stage increases complexity dramatically!
5
IMPORTANCE OF AUTOMATIONS & COMMUNITIES AUTOMATION ~ Compare Data Scientist with and without computer-based augmentation Show that computer-augmented data science can reduce time-to-result by an order of magnitude and improve quality of results
COMMUNITY ~ Self-learn and validate using open competitions or evaluations (e.g., Kaggle, OpenML), IBM customer engagements
6
DS ASSISTED BY AI WITHIN A DS COMMUNITY 1) Bring automation into key areas of large-scale data analysis tasks Overcome “analytic decision overload” for Data Scientists Enable Data Scientist to: view and interact with decision making process in an online fashion
obtain rapid insights from data to answer key questions
Augumention Vs. Automation
2) Integrated System of tools, working with DS communities An integrated system for scientists to easily handle data and analytical and application needs Upload and prepare data from various sources Cross-platform modeling and machine learning implementation Cross-platform analytic deployments on Big Data platforms IBM Research
7
Developer
WATSON STUDIO LOCAL
Watson Analytics
Dashboards Data Steward
Cognos
32 Different Connections Plugin
IBM Analytics Engine
Db2 Warehouse on Cloud
Watson Data Platform Persistence Cloud Services
Data scientist
IBM Cloud Object Storage
Jupyter Notebooks
IBM Compose
Watson Studio
IBM Cloudant
Data Science Tools
RStudio
One Platform for IBM Analytics Team IBM Cloud
Hadoop
Data Refinery
On-premises data
Spark ML
Data from the IBM Cloud & third party clouds
Data Engineer
9
IBM Confidential
IBM Data Science Experience summary
IBM Data Science Experience summary
TAKING A DATA SCIENCE ECOSYSTEM APPROACH A DATA SCIENCE ECOSYSTEM HAS THREE BASIC ELEMENTS 1) DATA PORTAL 2) DATA SCIENCE COMMUNITY 3) DATA SCIENCE PLATFORM
RMDS COMMUNITIES AT IBM GLENDALE Pasadena/Glendale Meetup Community Local face to face community – more than 1100 members https://www.meetup.com/RMDS_LA/ https://www.linkedin.com/groups/1895501 has 29K participants
Aim to create an environment for utilizing big data analytics to create smart cities and smart commerce
EX1: citizen data science ecosystem with open data
105,000+ collections 349 citizen apps 500,000 data resources 175 agencies 450 APIs
Source: City of LA Mayor’s Tech Advisor Presentation at RMDS Meetup.
14
http://www.ibm.com/weather
EXAMPLE – 1KM VISIBLE (GOES-R WILL BE EVEN BETTER)
EX2: A data science ecosystem with weather data
Weather Data
Transaction
101 010 101
Applications
Solutions
Optimizing Operations
Analytical Insights for Transformation
Connecting all the data scientists from a DS community
IoT Data
Platform ~ IBM DSX
A MAJORITY OF RETAIL AND CP EXECUTIVES INDICATE WEATHER HAS A SIGNIFICANT IMPACT ON BUSINESS DECISION-MAKING
Weather either influences all human decisions or triggers automated actions in the following areas
Worker allocation and staff scheduling
51%
Work safely
50%
Inventory pricing
50%
Customer interactions
45%
Marketing / messaging
41%
Inventory placements
40%
Routes and transportation
39%
Supply chain / sourcing
35%
Product development
33%