Materializing and Querying Learned Knowledge - Google Sites

0 downloads 83 Views 285KB Size Report
There are regularities in the data ... regularities in the SW data with machine learning. ▫ Store the ... The (sparse,
Materializing and Querying Learned Knowledge

Volker Tresp, Yi Huang Siemens, Corporate Technology

Markus Bundschus University of Munich

Achim Rettinger Technical University of Munich

Deductive and Inductive Reasoning 

In many Semantic Web (SW) domains a tremendous amount of statements (expressed as triples) might be true but only a small number of statements is known to be true or can be inferred to be true



There are regularities in the data  But: cannot be captured by axioms  Large people tend to have higher weight



The goal of this work  Estimate the truth values of statements by exploring regularities in the SW data with machine learning  Store the probabilistic triples in the SW-KB  Make those triples available for querying IRMLES 2009

A Regular SPARQL Query

Query (including deductive inference):  Find all actors that act in movies that are filmed in an Italian city

IRMLES 2009

A SPARQL Query with Learned Probabilities

Query (including deduction and induction)  Find all actors that are likely to act in movies that are filmed in an Italian city

integrate in query learn!

sort by probability

Ronald Reagan George W. Bush … Damn … should have excluded US-presidents IRMLES 2009

Requirements

  



Machine learning should be “push-button” requiring a minimum of user intervention Learning time should scale well with the size of the SW The statements and their probabilities, which are predicted from machine learning, should easily be integrated into SPARQL-type querying Machine learning should be suitable for the data situation on the SW with sparse data (e.g., only a small number persons are friends) and missing information (e.g., some people don't reveal private information)

IRMLES 2009

The Key Steps  User defines key entity (person)  User defines population (person, that is an actor)  LarKC: defines sample (subset of population)  LarKC: finds all triples in which key entity is either subject or property value  Calculate aggregates features  The (sparse, incomplete) data matrix is generated (including deduced triples)  Pruning: Columns with few ones are removed  Learning by matrix completion methods  Learned models makes prediction in the sample  Learned model is applied to population  A subset of estimated (probabilistic) triples is written into triple store  Queries can be formulated IRMLES 2009

FOAF Experiment

Of

Ivey League

kn ow s

kn ow s

te nd s

Harvard at

Joe

residence te da

po st

#ofBlogs

NE-US n io g Re in

Boston

ha s

irth OfB

OnlineChat Account

ed

knows s ld o h

From subclass relations

dIn ate lo c

ws o kn

Jack

type

Mary

image

1980 ageGroup

ThirtySomething

RULE: If born between 1979 and 1989 then in ageGroup ThirtySomething IRMLES 2009

Kn thir ows a tyS ge om Gro eth up ing

hir tyS om Re sid eth en ing ce inR kn eg ow ion sJ NE oe -U S kn ow sJ ac k kn ow sM ary

ag eG rou pt

Re sid en ce Bo Re sto sid n en ce NY ho lds C on lin eC ha ha sIm tA co ag un e t

Data Matrix (FOAF)

Joe Jack Mary



IRMLES 2009

FOAF Experiment Statistics  We selected 636 persons with a "dense" friendship information  On average, a given person has 18 friends  Numerical values such as date of birth or the number of blog posts were discretized  The resulting data matrix, after pruning columns with few ones, has 636 persons (rows) and 491 columns  462 of the 491 columns (friendship attributes) refer to the property knows  The remaining columns (general attributes) refer to general information about age, location, number of blog posts, attended school, etc.  We can then answer queries such as  Who would likely want to be Jack's friend;  Which female persons in the north-east US, would likely want to be Jack's friends IRMLES 2009

Learning Approaches  SVD based

X =UDV

T

( rr ) T ˆ X =UD V  NNMF

 LDA

X = AB X =AB

T

T

ai , j ≥ 0

bi , j ≥ 0

ai ,k = P(attr = i | z = k )

bi , j = P( KE = i | z = k ) IRMLES 2009

Experimental Results

NDCG-Score for different learning approaches IRMLES 2009

Persisting Probabilistic Triples

• quadruple

PersonA

foaf:knows

PersonB

0.758

_:node

• reification

rdf:subject

rdf:predicate

rdf:object

prob

rdf:type

(simplest but high memory cost)

PersonA

foaf:knows

PersonB

0.758

Statement

• blank node

PersonA

kp

_:node

foaf:knows

PersonB

prob

0.758

IRMLES 2009

Results: Who wants to be Trelena’s Friend

IRMLES 2009

Conclusion and Outlook  We presented a novel generic learning approach for deriving probabilistic SW statements and demonstrated how these can be integrated into an extended SPARQL query  The approach is suitable for a typical situations with sparse/missing data  The learning process is to a large degree autonomous (goal!)  Generalization from the sample to the population is linear in the size of the population (matrix!)  Learned statements are materialized for fast querying  LDA showed best performance (Bayesian averaging)  Part of EU-FP7: LarKC

IRMLES 2009

Suggest Documents