Apr 19, 2017 - No noise and no uncertainty. â Real domains: â Multi-relational and heterogeneous. â Noisy and ...
Turning Data into Knowledge using Statistics and Semantics Prof. Lise Getoor UC Santa Cruz @lgetoor NIST Ontology Summit April 19, 2017
Data: Challenge Most data does not look like this
Or even like this
It looks more like this
Knowledge: Challenge Most of the knowledge does not look like this
Or even like this
It looks more like this
The Problem ¢
Traditional statistical machine learning approaches assume: l A random sample of homogeneous objects from single relation l Independent, identically distributed (IID)
¢
Traditional knowledge-based approaches assume: l Logical language for describing structure l No noise and no uncertainty
¢
Real domains: l Multi-relational and heterogeneous l Noisy and uncertain
Need methods which can: 1. Make use of logical structure 2. Handle uncertainty 3. Perform collective inference [Getoor & Taskar ’07]
Statistical Relational Learning (SRL) http:/linqs.cs.umd.edu/projects/Tutorials/nips2012.pdf
http://psl.linqs.org
Stephen Bach
Jay Pujara
Matthias Broecheler
Ben London
Bert Huang
Arti Ramesh
Alex Memory
Lily Mihalkova
Jimmy Foulds
Angelika Kimmig
Shobeir Fakhraei
Hui Miao
Dhanya Sridhar
Shachi Kumar
http://psl.linqs.org
Probabilistic Soft Logic (PSL) Declarative language based on logics to express collective probabilistic inference problems - Predicate = relationship or property - Atom = (continuous) random variable - Rule = capture dependency or constraint - Set = define aggregates PSL Program = Rules + Input DB Reference: Hinge-Loss Markov Random Fields and Probabilistic Soft Logic, Stephen H. Bach, Matthias Broecheler, Bert Huang, Lise Getoor, arXiv 2015
http://psl.linqs.org
Ontology Alignment provides
Service & Products Software
Hardware
buys
develops helps
Customer helps
9
Sales Person
Consulting
Staff
works for
Company buys
Employees
sells to
Developer
develop
Software Dev Hardware
interacts
Customers
IT Services
Products & Services
work for
Organization
interacts with
Employee
sells
Technician
Sales
Accountant
http://psl.linqs.org
Ontology Alignment provides
Service & Products Software
Hardware
buys
develops helps
Customer helps
10
Sales Person
Consulting
Staff
works for
Company buys
Employees
sells to
Developer
develop
Software Dev Hardware
interacts
Customers
IT Services
Products & Services
work for
Organization
interacts with
Employee
sells
Technician
Sales
Accountant
http://psl.linqs.org
Ontology Alignment provides
Service & Products Software
Hardware
work for
Organization buys
interacts
Customers
develops helps
IT Services
sells to
Developer
Sales Person
Match, Don’t Match? develop
Products & Services
Customer helps
Software Dev Hardware
11
Consulting
Staff
works for
Company
buys
Employees
interacts with
Employee
sells
Technician
Sales
Accountant
http://psl.linqs.org
Ontology Alignment provides
Service & Products Software
Hardware
work for
Organization buys
interacts
Customers
develops helps
IT Services
Employees
sells to
Developer
Sales Person
Staff
Similar to what extent? develop
Products & Services
buys
Customer helps
Software Dev Hardware
12
works for
Company
Consulting
interacts with
Employee
sells
Technician
Sales
Accountant
http://psl.linqs.org
Entity Resolution § Entities - People References
John Smith name
§ Attributes
- Friendship
§ Goal: Identify references that denote the same person
name
A
- Name
§ Relationships
J. Smith
B friend
C
friend
D E
F = =
G H
http://psl.linqs.org
Entity Resolution § References, names, friendships § Use model to express dependencies - ‘’If two people have similar names, they are probably the same’’ - ‘’If two people have similar friends, they are probably the same’’ - ‘’If A=B and B=C, then A and C must also denote the same person’’
John Smith
J. Smith
name
name
A
B friend
C
friend
D E
F = =
G H
http://psl.linqs.org
Entity Resolution A.name ≈
{str_sim}
§ References, names, friendships § Use model to express dependencies - ‘’If two people have similar names, they are probably the same’’ - ‘’If two people have similar friends, they are probably the same’’ - ‘’If A=B and B=C, then A and C must also denote the same person’’
B.name => A≈B : 0.8
John Smith
J. Smith
name
name
A
B friend
C
friend
D E
F = =
G H
http://psl.linqs.org
Entity Resolution § References, names, friendships § Use model to express dependencies
John Smith
J. Smith
name
name
A
- ‘’If two people have similar names, they are probably the same’’ - ‘’If two people have similar friends, they are probably the same’’ - ‘’If A=B and B=C, then A and C must also denote the same person’’
B friend
C
friend
D E
F =
G H
=
{A.friends} ≈{} {B.friends} => A≈B : 0.6
http://psl.linqs.org
Entity Resolution § References, names, friendships § Use model to express dependencies - ‘’If two people have similar names, they are probably the same’’ - ‘’If two people have similar friends, they are probably the same’’ - ‘’If A=B and B=C, then A and C must also denote the same person’’
John Smith
J. Smith
name
name
A
B friend
C
friend
D E
F =
G H
=
A≈B ^ B≈C => A≈C : ∞
http://psl.linqs.org
Hinge-loss Markov Random Fields § PSL makes large-scale reasoning scalable by mapping logical rules to convex functions § Continuous variables2 in [0,1] 1 P (Y | X) = exp 4 Z
m X j=1
3
wj max{`j (Y, X), 0}pj 5
§ Potentials are hinge-loss functions § Subject to arbitrary linear constraints § Log-concave!
http://psl.linqs.org
PSL Foundations § PSL makes large-scale reasoning scalable by mapping logical rules to convex functions § Three principles justify this mapping: - LP programs for MAX SAT with approximation guarantees [Goemans and Williamson, ’94] - Pseudomarginal LP relaxations of Boolean Markov random fields [Wainwright, et al., ’02] - Łukasiewicz logic, a logic for reasoning about continuous values [Klir and Yuan, ‘95]
http://psl.linqs.org
PSL in a Slide § MAP Inference in PSL translates into convex optimization problem -> inference is really fast! § Inference further enhanced with state-of-the-art optimization and distributed processing paradigms such as ADMM & GraphLab -> inference even faster! § Outperforms discrete MRFs in terms of speed, and (very) often accuracy § Support for latent variables and weight learning § PSL is flexible: Applied to image segmentation, activity recognition, stance-detection, sentiment analysis, document classification, drug target prediction, latent social groups and trust, engagement modeling, ontology alignment, knowledge graph identification, and looking for more!
NEED: KR for ML What about ML for KR?
Pujara, Miao, Getoor, Cohen, ISWC 2013
Knowledge Graph Identification ¢ Problem:
Collectively reason about noisy, inter-related fact extractions ¢ Task: NELL fact-promotion (web-scale IE) l Millions
of extractions, with entity ambiguity and confidence scores l Rich ontology: Domain, Range, Inverse, Mutex, Subsumption ¢ Goal:
Determine which facts to include in NELL’s knowledge base
Inferring a Knowledge Graph Internet Collec4vely Infer
Knowledge Graph
Sta)s)cs
Seman)cs
Website reliability Extractor confidence Parse features n-‐gram frequency
Subsump4on Mutual Exclusion Domain/Range Same-‐En4ty
Pujara, Miao, Getoor, Cohen, ISWC 2013
Knowledge Graph Iden4fica4on Problem:
Knowledge Graph Knowledge Graph Iden4fica4on (KGI)
Extrac4on Graph • Perform graph iden+fica+on: – en4ty resolu4on – link predic4on – collec4ve classifica4on
• Enforce ontological constraints • Incorporate source uncertainty
=
Pujara, Miao, Getoor, Cohen, ISWC 2013
Graph Iden4fica4on in KGI Noisy Extractions:
Entity Resolution:
. 𝐿𝐵𝐿(𝐸1 , 𝐿) ⟹ 𝐿𝐵𝐿(𝐸2 , 𝐿) 𝑆𝐴𝑀𝐸𝐸𝑁𝑇(𝐸1 , 𝐸2 ) ⋀ . 𝑅𝐸𝐿(𝐸1 , 𝐸, 𝑅) ⟹ 𝑅𝐸𝐿(𝐸2 , 𝐸, 𝑅) 𝑆𝐴𝑀𝐸𝐸𝑁𝑇(𝐸1 , 𝐸2 ) ⋀ . 𝑅𝐸𝐿(𝐸, 𝐸1 , 𝑅) ⟹ 𝑅𝐸𝐿(𝐸, 𝐸2 , 𝑅) 𝑆𝐴𝑀𝐸𝐸𝑁𝑇(𝐸1 , 𝐸2 ) ⋀
KGI Representation of Ontological Rules
Adapted from Jiang et al., ICDM 2012
Illustra4on of KGI (Annotated) Extrac)on Graph Uncertain Extractions:
Dom(hasCapital, country) Mut(country, bird)
Lbl
Ontology:
country
bird
Entity Resolution:
Dom
Bishkek
SameEnt(Kyrgyz Republic, Kyrgyzstan)
A3er Knowledge Graph Iden)fica)on Kyrgyzstan Lbl country Kyrgyz Republic
Kyrgyz Republic
a l) sCapit
.8: Rel(Kyrgyz Republic, Bishkek, hasCapital)
Kyrgyzstan
SameEnt
Rel(ha
.5: Lbl(Kyrgyzstan, bird) .7: Lbl(Kyrgyzstan, country) .9: Lbl(Kyrgyz Republic, country)
Rel(hasCapital)
Bishkek (Pujara et al., ISWC13)
Jay Pujara UMD Thesis 2016
Three knowledge graphs NELL: Broad-‐domain KG from IE, 1.7M candidate facts, 70K ontological rela4ons MusicBrainz: User contributed KG of 800K ar4sts and bwoth orks Key result: Freebase: C ollec4ve e n4ty r esolu4on t o sta4s4cal features and integrate 13M MusicBrainz ar4sts & albums
seman4c constraints NELL MusicBrainz Freebase help, but combining AUC F1 AUC F1 AUC F! MLN Ontology .899 w.836 -‐-‐ -‐-‐ -‐-‐ them always ins -‐-‐ We can do (Jiang, ICDM12)
Source sta4s4cs
.888 .843
.672
.788
.416
.734
.797
this f ast! .831
En4ty resolu4on seman4cs
.809 .804
Ontological seman4cs
.899 .832
.753
.832
.569
.805
All of the above
.904 .854
.901
.919
.724
.840
Closing Comments
Graph Identification++
Opportunity!
Thank You! More information: http://www.soe.ucsc.edu/~getoor http://psl.linqs.org