Full stack product team calls backend team APIs ... Spam content and de-duping ..... Facebook for risk: Streamline inves
Machine Learning Startups
My Background
Lessons Learned Turn hard problems into easy ones ML in practice requires carefully formulating research problems ...and being creative about bootstrapping training data
Lessons Learned Many ways to capture dependencies Training data and features > models
Lessons Learned A model is not a product Nobody cares about your ideas
Flightcaster
Predicting the real-time state of the global air traffic network
The Prediction Problem
Flight F departing at time T Likelihood that F departs at T, T+n1, T+n2
Featurizing
Carrier, FAA, weather data Nightly reset natural cadence for feature vecs
Every aircraft has a unique tail #
Fuzzy k-way join on tail #, time, location Isolate incorrect joins by keeping feature vecs independent
positions in past - already delayed at prediction time?
weather and status - FAA groundings at airports on path?
featurizing time - how delayed and how many mins from departure?
Models
trees could pick up dependencies that linear model couldn’t but perf became trivially incremental once adding more sophisticated ways of featurizing dependencies
Tools and Deployment
Clojure on hadoop for featurizing and model training Wrap complexity in simple API FP awesome for data pipelines
Write models to json Product team used Rails Read json and make predictions Predictions stored in production DB for eval
Pain Points
Log-based-debugging paradigm sucks Don’t want to catch ETL and feature eng issues in hadoop setting At same time can not catch at tiny scale because needs real data at material scale
dirty data -- manual entry early days of clojure / hadoop deploying json models rather than services
Lessons learned
Model selection mattered less than featurizing Many ways to capture dependencies
Intuitions of domain expert useful but also often misleading Use domain experts to identify data sources Then build good tools and take scientific approach to exploring the feature space
Computational graph with HOF in order to log structured data Inspired fast debugging with plumbing.graph at prismatic Isolate issues: single thread, multi thread, multi process and multi machine
Production was OK not great Better to to put ML behind services Full stack product team calls backend team APIs
A model is not a product Humans don’t understand probability distributions Even if discretized or turned into classification Solve a human need directly -- turn into recommendations, etc
Prismatic
Personalized Ranking of People, Topics, and Content
The Personalized Ranking Problem
Given a index of content, display the content that maximizes the likelihood of user engagement
Intention: max LT engagement Proxy: max session interactions
Content
Focused crawling of Twitter, FB and web Maximum coverage algorithms Spam content and de-duping
Featurizing
Content and interaction features Feature crosses and hacks for dependencies Bootstrapping weight hacks -- can’t train on overly sparse interactions Scores for interests (topics, people, publishers, …)
Models: Personalized Ranking
Logistic -- newsfeed ranking has to be ultra fast in prod 100ms Learning to rank -- inversions Universal features, user specific weight-vectors Snapshot every session
Models: Classification
How do you train a large set of topic classifiers? Latent topic models don’t work But how would we get labeled data to train a classifier for each topic?
Enter distant supervision Create mechanism to bootstrap training data with noisy labels Requires lots of heuristics and clever hacks
Snarf docs with twitter queries, etc Create pos and negs using filters and distance measures Lots of techniques to featurize text for filters and training
Tools and Deployment
Clojure! Plumbing on github Clj backend and cljs frontend Graph, schema, ml libs
Pain Points
Presentation biases People click what they're shown Biases clicks on top stories Self-reinforcing viral propagation engine
Data Issues Dupes easy, but spambots and nets keep getting more sophisticated esp on twitter Bootstrapping distance supervision is hard but OK Bootstrapping ranking with sparse interactions is super hard
Social vs interest based personalization What’s interesting vs what’s viral? How do you define what’s interesting? How much is a share worth compared with dwell time? Researchers bias on their own prefs
Lessons learned
Overboard with clojure NIH Environment changes fast -- missed spark etc automated classifier training data and retrain with zero intervention Can optimize interactions a lot > 50%
When data is too sparse, optimize product before optimizing models Heuristic IR may be good enough for a while Investment in learning to rank is massive
Goal
So far
10 Companies 3 Years $65M
2 Companies 6 Months $1MM
Unsexy low beta
Prop models Prop data
Cyber MGA Indirect losses ● stock price ● credit rating ● sales
Market ➔ 2.5B today ➔ 35% growth ➔ 50B in 10 years
Catalysts ➔ SEC and EU regs ➔ High profile breaches ➔ Large indirect losses
First iteration - Supervised Learning ➔ Positives : recorded breaches
➔ Negatives : random sample of companies (not attacked)
➔ Features : Security features ● DNS records, certs, service vulnerabilities, …
First iteration - Supervised Learning #FAIL ➔ Incorrect assumption: breached companies having worse security and negative samples not being attacked
Likelihood of breach
Absence of historical data and nonstationarity create a challenging environment ➔ Rich current data isn’t available historically and decays in predictive power over time ➔ Could static data be a more robust and stable predictor of risk?
Relationship with catastrophes
Insurance models for earthquakes, floods, hurricanes Sparse events (cannot estimate probs from freqs) Events are correlated (how true is this for cyber?) Can we draw from ideas in cat risk to model cyber risk?
Relationship with catastrophes
Cat: ➔ Stochastic simulation using physical models ➔ Impacts change in magnitude but not type Cyber: ➔ Behavior of incentivised cyber-attackers hard to model ➔ Impacts change over time
Broader Approach
Industry baseline
Freq. of breaches
Infrastructure security Social Engineering Dynamic behavior Assets (+Lifecycle)
TS
Size of loss Uncertainty Load
Premium Decomposition
Premium Decomposition
Simplifying assumption: we can start incrementally and loss magnitude will always hit limit Likelihood and uncertainty depend on breach sample ➔ Estimate uncertainty from on confidence ➔ Estimate likelihood from risk features
Indirect Losses
Quantifying Indirect losses is complicated ➔ Normalizing market and industry effects ➔ Effect of news and corporate events? ➔ Over what time period? ➔ How do we define a statistically significant loss?
Investigation Tools
Roadmap Freq. estimation
Loss model
Pricing support
V1
Industry based freq.
Stock loss
Uncertainty from variance
V2
Net. security model
--
--
V3
Behavior of company
--
Better uncertainty quant.
V4
--
Sales losses
--
V5
--
Credit Rating
--
V6
Social engineering
--
--
V7
--
--
Pricing model
Future Challenges
Accumulation risk ➔ ➔ ➔ ➔
Correlated breaches Autonomous vehicles Supply chain Physical damage
Bloomberg for Back Office
The world’s first AI enabled compliance solution
Market ➔
Banks spend ~100B on compliance
➔
~20B on analytics alone
➔
growing at 20% annually
Catalysts ➔
9/11 and 2008 crisis
➔
20X explosion in fines
➔
Exec departures
Computer Vision
Image due diligence
Image distance for ID check ➔
Detects faces in the image using pre-trained models
➔
Transform the image using the real-time pose estimation
➔
Represent the face on the hypersphere using the neural network
➔
Apply any classification technique to the found features
Image due diligence ➔
Check whether the photos on several IDs belongs to the same person
➔
Perform image due diligence in the databases of criminals and other databases
NLP: Detecting Adverse News
Raptor NLP: detecting adverse news
IR approach: name + keyword in same sentence ● ●
Low false negs High false pos
John Smith ● ●
Judge John Smith sentenced James Doe for money laundering. Amy Smith is accused of murder of her brother John Smith.
Problem formulation
Classification approaches: ●
●
General entity centric “sentiment” classifier ○ High coverage ○ Not easy to interpret and understand what is going on Multiple specific relationship extractors (X sentenced for Y, X accused of Y, …) ○ Lower coverage ○ Easy to debug and understand
Distant Supervision: Training data
Training data: generate noisy training data using heuristics Positive examples: Look for mentions of people with bad news Negative examples: Tricky and hard part. Many heuristics: ● ●
Use list of judges and attorneys, search for their mentions Simple syntactic rules: “X said”, ...
Distant Supervision: Training data
heuristics fall into three categories: a) Poor: doesn’t work b) Low coverage: only catches few samples c) Good: big impact on performance
Distant Supervision: One nice heuristic
Different sources have different rates of true vs false positives (think bbc.com vs court proceedings reports). Use this info with some other heuristic to gain a lots of negative samples. One heuristic might be even previous version of classifier.
Distant Supervision: Languages
We have to work in multiple languages which limits use of features coming from tools like dependency parsers. Currently exploring heuristics based on parse trees and machine translation.
Distant Supervision: Modeling Approach
Need a model that captures entity centric features and word order Logistic with classic text features (raw bigrams, entity centric features, dependency parse features, …) ○ ○
Lot of time spent in building features Easier to understand, interpret and debug than neural nets
Deep learning: RNN/CNNs ○ ○ ○
Saves time on feature engineering Hard to debug, understand and interpret Currently, slightly better performance than features + logistic
Modeling Approach: Recurrent Networks
Distant Supervision: Modeling Approach
Pre-trained word embeddings a) Help to achieve better performance b) Can be easily obtained for any language c) Can be shared across multiple tasks
Modeling Approach: Convolutional Networks
Distant Supervision: Modeling Approach
CNN vs RNN setups for NLP a) CNNs are coming into NLP from CV b) CNNs faster than RNNs and can have similar performance c) In our case currently a tie
Key Takeaways and open problems
Open problems with false negatives: ●
●
Information spanning multiple sentences: ○ Coreference resolution (John is mayor of Boston. He was sentenced for …) ○ Discourse analysis (relations between sentences) Analysis of formatted text (tables, bullet points, …)
Key Takeaways and open problems
Key Takeaways: ● ●
Improving training data helps a lot more than tweaking model Avoid the academic trap of testing many neural net architectures
Risk Ranking and Networks
Google for Risk
➔
Google wins in ranking because it has the most user click data
➔
We win in risk because we have analyst annotation data
Google for risk Query
Task: Search for CDD/EDD sources and rank the results based on the risk they represent
fraud charged forgery
Goal: Do not miss anything important AND filter as many false positives as possible Raptor
Ranked results
Google for risk - approaches
Problems: ● accurate identification of the person (name collisions) ● identify the right context the person is mentioned in Additional requirements: ● interpretable results on all levels: rank, risk, NLP ● utilize user feedback: implicit vs explicit
Risk Model Validation and Interpretability
Google for risk - approaches
Prediction vs. Ranking: Prediction ● Want scores and filtering ● Still interesting to order results ● Loss is error Learning To Rank ● Want optimal ordering of results ● Scores not interesting ● Loss is number of inversions
Facebook for risk
Risk Networks Task: Identify risks from the person’s social network Evaluate risk in network on different levels: ● node ● edge ● path ● subgraph
Facebook for risk: Streamline investigation with the risk network ➔
Links between all people and business entities
➔
Pagerank for risk
➔
See the riskiest paths through the network
➔
Drill down into high risk customer-customer and customer-entity relationships
Facebook for risk
Building the multigraph (1) Nodes = entities (people and orgs) Edges = relationships 1.
2.
Start with extracting relationships from structured databases: ○ Wikipedia ○ Company Registers ○ Panama Papers, etc. De-duplicate nodes across different datasets ○ another ML problem
Facebook for risk Open Corporates
Query: David Cameron Risk: Blairmore Holdings Inc
Accelerated Mailing & Marketing Limited
Wikipedia
Mannings Heath Hotel Limited
Univel Limited
Arthur Elwen Ian Donald Cameron David Cameron
Cameron Optometry Limited Nancy Gwen
Mary Fleur Mount
Panama Papers
Blairmore Holdings Inc
Facebook for risk
Building the multigraph (2) Extract additional relationships from text => more NLP ○ named entity extraction for nodes ○ relation extraction for edges
subject
relation
object
Ian Cameron was a director of Blairmore Holdings Inc, an investment fund run from the Bahamas but named after the family’s ancestral home in Aberdeenshire.
Facebook for risk Roadmap 1. 2. 3.
4.
Start with risk CDD/EDD risk scores at node level Propagate risk across edges to derive edge weights Social Network Analysis: ○ random walk with restarts: PageRank, HITS, Personalized PageRank Subgraph risk ranking: use SNA approaches to featurize graph for ranking
Facebook for risk Link prediction Task: Inferring new relationships from network and behavior Approach 1. add behavior data to network 2. extract features from network ○ node based: Jaccard’s coef., Preferential attachment … ○ path based: Katz, PageRank, Personalized PageRank … ○ feature matrices over pair of nodes: Path Ranking Algorithm 3. combine with semantic features for each node 4. treat as a binary class classification problem
CEAI Topics Computational Finance Computer Vision Computational Bio & Medicine
Proactive Full-stack MGAs