Turning Data into Knowledge using Statistics and ...

3 downloads 100 Views 12MB Size Report
Apr 19, 2017 - No noise and no uncertainty. ○ Real domains: ○ Multi-relational and heterogeneous. ○ Noisy and ...
Turning Data into Knowledge using Statistics and Semantics Prof. Lise Getoor UC Santa Cruz @lgetoor NIST Ontology Summit April 19, 2017

Data: Challenge Most data does not look like this

Or even like this

It looks more like this

Knowledge: Challenge Most of the knowledge does not look like this

Or even like this

It looks more like this

The Problem ¢ 

Traditional statistical machine learning approaches assume: l  A random sample of homogeneous objects from single relation l  Independent, identically distributed (IID)

¢ 

Traditional knowledge-based approaches assume: l  Logical language for describing structure l  No noise and no uncertainty

¢ 

Real domains: l  Multi-relational and heterogeneous l  Noisy and uncertain

Need methods which can: 1.  Make use of logical structure 2.  Handle uncertainty 3.  Perform collective inference [Getoor & Taskar ’07]

Statistical Relational Learning (SRL) http:/linqs.cs.umd.edu/projects/Tutorials/nips2012.pdf

http://psl.linqs.org

Stephen Bach

Jay Pujara

Matthias Broecheler

Ben London

Bert Huang

Arti Ramesh

Alex Memory

Lily Mihalkova

Jimmy Foulds

Angelika Kimmig

Shobeir Fakhraei

Hui Miao

Dhanya Sridhar

Shachi Kumar

http://psl.linqs.org

Probabilistic Soft Logic (PSL) Declarative language based on logics to express collective probabilistic inference problems -  Predicate = relationship or property -  Atom = (continuous) random variable -  Rule = capture dependency or constraint -  Set = define aggregates PSL Program = Rules + Input DB Reference: Hinge-Loss Markov Random Fields and Probabilistic Soft Logic, Stephen H. Bach, Matthias Broecheler, Bert Huang, Lise Getoor, arXiv 2015

http://psl.linqs.org

Ontology Alignment provides

Service & Products Software

Hardware

buys

develops helps

Customer helps

9

Sales Person

Consulting

Staff

works for

Company buys

Employees

sells to

Developer

develop

Software Dev Hardware

interacts

Customers

IT Services

Products & Services

work for

Organization

interacts with

Employee

sells

Technician

Sales

Accountant

http://psl.linqs.org

Ontology Alignment provides

Service & Products Software

Hardware

buys

develops helps

Customer helps

10

Sales Person

Consulting

Staff

works for

Company buys

Employees

sells to

Developer

develop

Software Dev Hardware

interacts

Customers

IT Services

Products & Services

work for

Organization

interacts with

Employee

sells

Technician

Sales

Accountant

http://psl.linqs.org

Ontology Alignment provides

Service & Products Software

Hardware

work for

Organization buys

interacts

Customers

develops helps

IT Services

sells to

Developer

Sales Person

Match, Don’t Match? develop

Products & Services

Customer helps

Software Dev Hardware

11

Consulting

Staff

works for

Company

buys

Employees

interacts with

Employee

sells

Technician

Sales

Accountant

http://psl.linqs.org

Ontology Alignment provides

Service & Products Software

Hardware

work for

Organization buys

interacts

Customers

develops helps

IT Services

Employees

sells to

Developer

Sales Person

Staff

Similar to what extent? develop

Products & Services

buys

Customer helps

Software Dev Hardware

12

works for

Company

Consulting

interacts with

Employee

sells

Technician

Sales

Accountant

http://psl.linqs.org

Entity Resolution §  Entities -  People References

John Smith name

§  Attributes

-  Friendship

§  Goal: Identify references that denote the same person

name

A

-  Name

§  Relationships

J. Smith

B friend

C

friend

D E

F = =

G H

http://psl.linqs.org

Entity Resolution §  References, names, friendships §  Use model to express dependencies -  ‘’If two people have similar names, they are probably the same’’ -  ‘’If two people have similar friends, they are probably the same’’ -  ‘’If A=B and B=C, then A and C must also denote the same person’’

John Smith

J. Smith

name

name

A

B friend

C

friend

D E

F = =

G H

http://psl.linqs.org

Entity Resolution A.name ≈

{str_sim}

§  References, names, friendships §  Use model to express dependencies -  ‘’If two people have similar names, they are probably the same’’ -  ‘’If two people have similar friends, they are probably the same’’ -  ‘’If A=B and B=C, then A and C must also denote the same person’’

B.name => A≈B : 0.8

John Smith

J. Smith

name

name

A

B friend

C

friend

D E

F = =

G H

http://psl.linqs.org

Entity Resolution §  References, names, friendships §  Use model to express dependencies

John Smith

J. Smith

name

name

A

-  ‘’If two people have similar names, they are probably the same’’ -  ‘’If two people have similar friends, they are probably the same’’ -  ‘’If A=B and B=C, then A and C must also denote the same person’’

B friend

C

friend

D E

F =

G H

=

{A.friends} ≈{} {B.friends} => A≈B : 0.6

http://psl.linqs.org

Entity Resolution §  References, names, friendships §  Use model to express dependencies -  ‘’If two people have similar names, they are probably the same’’ -  ‘’If two people have similar friends, they are probably the same’’ -  ‘’If A=B and B=C, then A and C must also denote the same person’’

John Smith

J. Smith

name

name

A

B friend

C

friend

D E

F =

G H

=

A≈B ^ B≈C => A≈C : ∞

http://psl.linqs.org

Hinge-loss Markov Random Fields §  PSL makes large-scale reasoning scalable by mapping logical rules to convex functions §  Continuous variables2 in [0,1] 1 P (Y | X) = exp 4 Z

m X j=1

3

wj max{`j (Y, X), 0}pj 5

§  Potentials are hinge-loss functions §  Subject to arbitrary linear constraints §  Log-concave!

http://psl.linqs.org

PSL Foundations § PSL makes large-scale reasoning scalable by mapping logical rules to convex functions § Three principles justify this mapping: -  LP programs for MAX SAT with approximation guarantees [Goemans and Williamson, ’94] -  Pseudomarginal LP relaxations of Boolean Markov random fields [Wainwright, et al., ’02] -  Łukasiewicz logic, a logic for reasoning about continuous values [Klir and Yuan, ‘95]

http://psl.linqs.org

PSL in a Slide §  MAP Inference in PSL translates into convex optimization problem -> inference is really fast! §  Inference further enhanced with state-of-the-art optimization and distributed processing paradigms such as ADMM & GraphLab -> inference even faster! §  Outperforms discrete MRFs in terms of speed, and (very) often accuracy §  Support for latent variables and weight learning §  PSL is flexible: Applied to image segmentation, activity recognition, stance-detection, sentiment analysis, document classification, drug target prediction, latent social groups and trust, engagement modeling, ontology alignment, knowledge graph identification, and looking for more!

NEED: KR for ML What about ML for KR?

Pujara, Miao, Getoor, Cohen, ISWC 2013

Knowledge Graph Identification ¢  Problem:

Collectively reason about noisy, inter-related fact extractions ¢  Task: NELL fact-promotion (web-scale IE) l  Millions

of extractions, with entity ambiguity and confidence scores l  Rich ontology: Domain, Range, Inverse, Mutex, Subsumption ¢  Goal:

Determine which facts to include in NELL’s knowledge base

Inferring  a  Knowledge  Graph   Internet   Collec4vely  Infer  

Knowledge  Graph  

Sta)s)cs  

Seman)cs  

Website  reliability   Extractor  confidence   Parse  features   n-­‐gram  frequency  

Subsump4on   Mutual  Exclusion   Domain/Range   Same-­‐En4ty  

Pujara, Miao, Getoor, Cohen, ISWC 2013

Knowledge  Graph  Iden4fica4on   Problem:  

Knowledge  Graph   Knowledge  Graph     Iden4fica4on  (KGI)  

Extrac4on  Graph   •  Perform  graph  iden+fica+on:   –  en4ty  resolu4on   –  link  predic4on   –  collec4ve  classifica4on  

•  Enforce  ontological  constraints   •  Incorporate  source  uncertainty  

=  

Pujara, Miao, Getoor, Cohen, ISWC 2013

Graph  Iden4fica4on  in  KGI   Noisy Extractions:

Entity Resolution:

.  𝐿𝐵𝐿(𝐸1 , 𝐿)               ⟹    𝐿𝐵𝐿(𝐸2 , 𝐿) 𝑆𝐴𝑀𝐸𝐸𝑁𝑇(𝐸1 , 𝐸2 )  ⋀ .  𝑅𝐸𝐿(𝐸1 , 𝐸, 𝑅)    ⟹    𝑅𝐸𝐿(𝐸2 , 𝐸, 𝑅)     𝑆𝐴𝑀𝐸𝐸𝑁𝑇(𝐸1 , 𝐸2 )  ⋀ .  𝑅𝐸𝐿(𝐸, 𝐸1 , 𝑅)     ⟹    𝑅𝐸𝐿(𝐸, 𝐸2 , 𝑅)     𝑆𝐴𝑀𝐸𝐸𝑁𝑇(𝐸1 , 𝐸2 )  ⋀    

KGI Representation of Ontological Rules

Adapted from Jiang et al., ICDM 2012

Illustra4on  of  KGI   (Annotated)  Extrac)on  Graph   Uncertain Extractions:

Dom(hasCapital, country) Mut(country, bird)

Lbl

Ontology:

country  

bird  

Entity Resolution:

Dom

Bishkek  

SameEnt(Kyrgyz Republic, Kyrgyzstan)

A3er  Knowledge  Graph  Iden)fica)on   Kyrgyzstan   Lbl   country   Kyrgyz  Republic  

Kyrgyz  Republic  

a l) sCapit

.8: Rel(Kyrgyz Republic, Bishkek, hasCapital)  

Kyrgyzstan  

SameEnt

Rel(ha

.5: Lbl(Kyrgyzstan, bird) .7: Lbl(Kyrgyzstan, country) .9: Lbl(Kyrgyz Republic, country)

Rel(hasCapital)

Bishkek   (Pujara  et  al.,  ISWC13)  

Jay Pujara UMD Thesis 2016

Three  knowledge  graphs   NELL:  Broad-­‐domain  KG  from  IE,  1.7M   candidate  facts,  70K  ontological  rela4ons   MusicBrainz:  User  contributed  KG  of  800K   ar4sts   and  bwoth   orks   Key   result:   Freebase:   C ollec4ve   e n4ty   r esolu4on   t o   sta4s4cal  features  and   integrate  13M  MusicBrainz  ar4sts  &  albums  

seman4c  constraints   NELL   MusicBrainz   Freebase   help,  but  combining   AUC   F1   AUC   F1   AUC   F!   MLN  Ontology   .899  w.836   -­‐-­‐   -­‐-­‐   -­‐-­‐   them  always   ins   -­‐-­‐   We  can  do   (Jiang,  ICDM12)  

Source  sta4s4cs  

.888   .843  

.672  

.788  

.416  

.734  

.797  

this   f ast!   .831  

En4ty  resolu4on  seman4cs  

.809   .804  

Ontological  seman4cs  

.899   .832  

.753  

.832  

.569  

.805  

All  of  the  above  

.904   .854  

.901  

.919  

.724  

.840  

Closing Comments

Graph Identification++

Opportunity!

Thank You! More information: http://www.soe.ucsc.edu/~getoor http://psl.linqs.org

Suggest Documents