Machine Learning Startups - Machine Learning Meetups

Machine Learning Startups

My Background

Lessons Learned Turn hard problems into easy ones ML in practice requires carefully formulating research problems ...and being creative about bootstrapping training data

Lessons Learned Many ways to capture dependencies Training data and features > models

Lessons Learned A model is not a product Nobody cares about your ideas

Flightcaster

Predicting the real-time state of the global air traffic network

The Prediction Problem

Flight F departing at time T Likelihood that F departs at T, T+n1, T+n2

Featurizing

Carrier, FAA, weather data Nightly reset natural cadence for feature vecs

Every aircraft has a unique tail #

Fuzzy k-way join on tail #, time, location Isolate incorrect joins by keeping feature vecs independent

positions in past - already delayed at prediction time?

weather and status - FAA groundings at airports on path?

featurizing time - how delayed and how many mins from departure?

Models

trees could pick up dependencies that linear model couldn’t but perf became trivially incremental once adding more sophisticated ways of featurizing dependencies

Tools and Deployment

Clojure on hadoop for featurizing and model training Wrap complexity in simple API FP awesome for data pipelines

Write models to json Product team used Rails Read json and make predictions Predictions stored in production DB for eval

Pain Points

Log-based-debugging paradigm sucks Don’t want to catch ETL and feature eng issues in hadoop setting At same time can not catch at tiny scale because needs real data at material scale

dirty data -- manual entry early days of clojure / hadoop deploying json models rather than services

Lessons learned

Model selection mattered less than featurizing Many ways to capture dependencies

Intuitions of domain expert useful but also often misleading Use domain experts to identify data sources Then build good tools and take scientific approach to exploring the feature space

Computational graph with HOF in order to log structured data Inspired fast debugging with plumbing.graph at prismatic Isolate issues: single thread, multi thread, multi process and multi machine

Production was OK not great Better to to put ML behind services Full stack product team calls backend team APIs

A model is not a product Humans don’t understand probability distributions Even if discretized or turned into classification Solve a human need directly -- turn into recommendations, etc

Prismatic

Personalized Ranking of People, Topics, and Content

The Personalized Ranking Problem

Given a index of content, display the content that maximizes the likelihood of user engagement

Intention: max LT engagement Proxy: max session interactions

Content

Focused crawling of Twitter, FB and web Maximum coverage algorithms Spam content and de-duping

Featurizing

Content and interaction features Feature crosses and hacks for dependencies Bootstrapping weight hacks -- can’t train on overly sparse interactions Scores for interests (topics, people, publishers, …)

Models: Personalized Ranking

Logistic -- newsfeed ranking has to be ultra fast in prod 100ms Learning to rank -- inversions Universal features, user specific weight-vectors Snapshot every session

Models: Classification

How do you train a large set of topic classifiers? Latent topic models don’t work But how would we get labeled data to train a classifier for each topic?

Enter distant supervision Create mechanism to bootstrap training data with noisy labels Requires lots of heuristics and clever hacks

Snarf docs with twitter queries, etc Create pos and negs using filters and distance measures Lots of techniques to featurize text for filters and training

Tools and Deployment

Clojure! Plumbing on github Clj backend and cljs frontend Graph, schema, ml libs

Pain Points

Presentation biases People click what they're shown Biases clicks on top stories Self-reinforcing viral propagation engine

Data Issues Dupes easy, but spambots and nets keep getting more sophisticated esp on twitter Bootstrapping distance supervision is hard but OK Bootstrapping ranking with sparse interactions is super hard

Social vs interest based personalization What’s interesting vs what’s viral? How do you define what’s interesting? How much is a share worth compared with dwell time? Researchers bias on their own prefs

Lessons learned

Overboard with clojure NIH Environment changes fast -- missed spark etc automated classifier training data and retrain with zero intervention Can optimize interactions a lot > 50%

When data is too sparse, optimize product before optimizing models Heuristic IR may be good enough for a while Investment in learning to rank is massive

Goal

So far

10 Companies 3 Years $65M

2 Companies 6 Months $1MM

Unsexy low beta

Prop models Prop data

Cyber MGA Indirect losses ● stock price ● credit rating ● sales

Market ➔ 2.5B today ➔ 35% growth ➔ 50B in 10 years

Catalysts ➔ SEC and EU regs ➔ High profile breaches ➔ Large indirect losses

First iteration - Supervised Learning ➔ Positives : recorded breaches

➔ Negatives : random sample of companies (not attacked)

➔ Features : Security features ● DNS records, certs, service vulnerabilities, …

First iteration - Supervised Learning #FAIL ➔ Incorrect assumption: breached companies having worse security and negative samples not being attacked

Likelihood of breach

Absence of historical data and nonstationarity create a challenging environment ➔ Rich current data isn’t available historically and decays in predictive power over time ➔ Could static data be a more robust and stable predictor of risk?

Relationship with catastrophes

Insurance models for earthquakes, floods, hurricanes Sparse events (cannot estimate probs from freqs) Events are correlated (how true is this for cyber?) Can we draw from ideas in cat risk to model cyber risk?

Relationship with catastrophes

Cat: ➔ Stochastic simulation using physical models ➔ Impacts change in magnitude but not type Cyber: ➔ Behavior of incentivised cyber-attackers hard to model ➔ Impacts change over time

Broader Approach

Industry baseline

Freq. of breaches

Infrastructure security Social Engineering Dynamic behavior Assets (+Lifecycle)

TS

Size of loss Uncertainty Load

Premium Decomposition

Premium Decomposition

Simplifying assumption: we can start incrementally and loss magnitude will always hit limit Likelihood and uncertainty depend on breach sample ➔ Estimate uncertainty from on confidence ➔ Estimate likelihood from risk features

Indirect Losses

Quantifying Indirect losses is complicated ➔ Normalizing market and industry effects ➔ Effect of news and corporate events? ➔ Over what time period? ➔ How do we define a statistically significant loss?

Investigation Tools

Roadmap Freq. estimation

Loss model

Pricing support

V1

Industry based freq.

Stock loss

Uncertainty from variance

V2

Net. security model

--

--

V3

Behavior of company

--

Better uncertainty quant.

V4

--

Sales losses

--

V5

--

Credit Rating

--

V6

Social engineering

--

--

V7

--

--

Pricing model

Future Challenges

Accumulation risk ➔ ➔ ➔ ➔

Correlated breaches Autonomous vehicles Supply chain Physical damage

Bloomberg for Back Office

The world’s first AI enabled compliance solution

Market ➔

Banks spend ~100B on compliance

➔

~20B on analytics alone

➔

growing at 20% annually

Catalysts ➔

9/11 and 2008 crisis

➔

20X explosion in fines

➔

Exec departures

Computer Vision

Image due diligence

Image distance for ID check ➔

Detects faces in the image using pre-trained models

➔

Transform the image using the real-time pose estimation

➔

Represent the face on the hypersphere using the neural network

➔

Apply any classification technique to the found features

Image due diligence ➔

Check whether the photos on several IDs belongs to the same person

➔

Perform image due diligence in the databases of criminals and other databases

NLP: Detecting Adverse News

Raptor NLP: detecting adverse news

IR approach: name + keyword in same sentence ● ●

Low false negs High false pos

John Smith ● ●

Judge John Smith sentenced James Doe for money laundering. Amy Smith is accused of murder of her brother John Smith.

Problem formulation

Classification approaches: ●

●

General entity centric “sentiment” classifier ○ High coverage ○ Not easy to interpret and understand what is going on Multiple specific relationship extractors (X sentenced for Y, X accused of Y, …) ○ Lower coverage ○ Easy to debug and understand

Distant Supervision: Training data

Training data: generate noisy training data using heuristics Positive examples: Look for mentions of people with bad news Negative examples: Tricky and hard part. Many heuristics: ● ●

Use list of judges and attorneys, search for their mentions Simple syntactic rules: “X said”, ...

Distant Supervision: Training data

heuristics fall into three categories: a) Poor: doesn’t work b) Low coverage: only catches few samples c) Good: big impact on performance

Distant Supervision: One nice heuristic

Different sources have different rates of true vs false positives (think bbc.com vs court proceedings reports). Use this info with some other heuristic to gain a lots of negative samples. One heuristic might be even previous version of classifier.

Distant Supervision: Languages

We have to work in multiple languages which limits use of features coming from tools like dependency parsers. Currently exploring heuristics based on parse trees and machine translation.

Distant Supervision: Modeling Approach

Need a model that captures entity centric features and word order Logistic with classic text features (raw bigrams, entity centric features, dependency parse features, …) ○ ○

Lot of time spent in building features Easier to understand, interpret and debug than neural nets

Deep learning: RNN/CNNs ○ ○ ○

Saves time on feature engineering Hard to debug, understand and interpret Currently, slightly better performance than features + logistic

Modeling Approach: Recurrent Networks


Pre-trained word embeddings a) Help to achieve better performance b) Can be easily obtained for any language c) Can be shared across multiple tasks

Modeling Approach: Convolutional Networks


CNN vs RNN setups for NLP a) CNNs are coming into NLP from CV b) CNNs faster than RNNs and can have similar performance c) In our case currently a tie

Key Takeaways and open problems

Open problems with false negatives: ●

●

Information spanning multiple sentences: ○ Coreference resolution (John is mayor of Boston. He was sentenced for …) ○ Discourse analysis (relations between sentences) Analysis of formatted text (tables, bullet points, …)

Key Takeaways and open problems

Key Takeaways: ● ●

Improving training data helps a lot more than tweaking model Avoid the academic trap of testing many neural net architectures

Risk Ranking and Networks

Google for Risk

➔

Google wins in ranking because it has the most user click data

➔

We win in risk because we have analyst annotation data

Google for risk Query

Task: Search for CDD/EDD sources and rank the results based on the risk they represent

fraud charged forgery

Goal: Do not miss anything important AND filter as many false positives as possible Raptor

Ranked results

Google for risk - approaches

Problems: ● accurate identification of the person (name collisions) ● identify the right context the person is mentioned in Additional requirements: ● interpretable results on all levels: rank, risk, NLP ● utilize user feedback: implicit vs explicit

Risk Model Validation and Interpretability

Google for risk - approaches

Prediction vs. Ranking: Prediction ● Want scores and filtering ● Still interesting to order results ● Loss is error Learning To Rank ● Want optimal ordering of results ● Scores not interesting ● Loss is number of inversions

Facebook for risk

Risk Networks Task: Identify risks from the person’s social network Evaluate risk in network on different levels: ● node ● edge ● path ● subgraph

Facebook for risk: Streamline investigation with the risk network ➔

Links between all people and business entities

➔

Pagerank for risk

➔

See the riskiest paths through the network

➔

Drill down into high risk customer-customer and customer-entity relationships

Facebook for risk

Building the multigraph (1) Nodes = entities (people and orgs) Edges = relationships 1.

2.

Start with extracting relationships from structured databases: ○ Wikipedia ○ Company Registers ○ Panama Papers, etc. De-duplicate nodes across different datasets ○ another ML problem

Facebook for risk Open Corporates

Query: David Cameron Risk: Blairmore Holdings Inc

Accelerated Mailing & Marketing Limited

Wikipedia

Mannings Heath Hotel Limited

Univel Limited

Arthur Elwen Ian Donald Cameron David Cameron

Cameron Optometry Limited Nancy Gwen

Mary Fleur Mount

Panama Papers

Blairmore Holdings Inc

Facebook for risk

Building the multigraph (2) Extract additional relationships from text => more NLP ○ named entity extraction for nodes ○ relation extraction for edges

subject

relation

object

Ian Cameron was a director of Blairmore Holdings Inc, an investment fund run from the Bahamas but named after the family’s ancestral home in Aberdeenshire.

Facebook for risk Roadmap 1. 2. 3.

4.

Start with risk CDD/EDD risk scores at node level Propagate risk across edges to derive edge weights Social Network Analysis: ○ random walk with restarts: PageRank, HITS, Personalized PageRank Subgraph risk ranking: use SNA approaches to featurize graph for ranking

Facebook for risk Link prediction Task: Inferring new relationships from network and behavior Approach 1. add behavior data to network 2. extract features from network ○ node based: Jaccard’s coef., Preferential attachment … ○ path based: Katz, PageRank, Personalized PageRank … ○ feature matrices over pair of nodes: Path Ranking Algorithm 3. combine with semantic features for each node 4. treat as a binary class classification problem

CEAI Topics Computational Finance Computer Vision Computational Bio & Medicine

Proactive Full-stack MGAs