There are regularities in the data ... regularities in the SW data with machine learning. â« Store the ... The (sparse,
Materializing and Querying Learned Knowledge
Volker Tresp, Yi Huang Siemens, Corporate Technology
Markus Bundschus University of Munich
Achim Rettinger Technical University of Munich
Deductive and Inductive Reasoning
In many Semantic Web (SW) domains a tremendous amount of statements (expressed as triples) might be true but only a small number of statements is known to be true or can be inferred to be true
There are regularities in the data But: cannot be captured by axioms Large people tend to have higher weight
The goal of this work Estimate the truth values of statements by exploring regularities in the SW data with machine learning Store the probabilistic triples in the SW-KB Make those triples available for querying IRMLES 2009
A Regular SPARQL Query
Query (including deductive inference): Find all actors that act in movies that are filmed in an Italian city
IRMLES 2009
A SPARQL Query with Learned Probabilities
Query (including deduction and induction) Find all actors that are likely to act in movies that are filmed in an Italian city
integrate in query learn!
sort by probability
Ronald Reagan George W. Bush … Damn … should have excluded US-presidents IRMLES 2009
Requirements
Machine learning should be “push-button” requiring a minimum of user intervention Learning time should scale well with the size of the SW The statements and their probabilities, which are predicted from machine learning, should easily be integrated into SPARQL-type querying Machine learning should be suitable for the data situation on the SW with sparse data (e.g., only a small number persons are friends) and missing information (e.g., some people don't reveal private information)
IRMLES 2009
The Key Steps User defines key entity (person) User defines population (person, that is an actor) LarKC: defines sample (subset of population) LarKC: finds all triples in which key entity is either subject or property value Calculate aggregates features The (sparse, incomplete) data matrix is generated (including deduced triples) Pruning: Columns with few ones are removed Learning by matrix completion methods Learned models makes prediction in the sample Learned model is applied to population A subset of estimated (probabilistic) triples is written into triple store Queries can be formulated IRMLES 2009
FOAF Experiment
Of
Ivey League
kn ow s
kn ow s
te nd s
Harvard at
Joe
residence te da
po st
#ofBlogs
NE-US n io g Re in
Boston
ha s
irth OfB
OnlineChat Account
ed
knows s ld o h
From subclass relations
dIn ate lo c
ws o kn
Jack
type
Mary
image
1980 ageGroup
ThirtySomething
RULE: If born between 1979 and 1989 then in ageGroup ThirtySomething IRMLES 2009
Kn thir ows a tyS ge om Gro eth up ing
hir tyS om Re sid eth en ing ce inR kn eg ow ion sJ NE oe -U S kn ow sJ ac k kn ow sM ary
ag eG rou pt
Re sid en ce Bo Re sto sid n en ce NY ho lds C on lin eC ha ha sIm tA co ag un e t
Data Matrix (FOAF)
Joe Jack Mary
…
IRMLES 2009
FOAF Experiment Statistics We selected 636 persons with a "dense" friendship information On average, a given person has 18 friends Numerical values such as date of birth or the number of blog posts were discretized The resulting data matrix, after pruning columns with few ones, has 636 persons (rows) and 491 columns 462 of the 491 columns (friendship attributes) refer to the property knows The remaining columns (general attributes) refer to general information about age, location, number of blog posts, attended school, etc. We can then answer queries such as Who would likely want to be Jack's friend; Which female persons in the north-east US, would likely want to be Jack's friends IRMLES 2009
Learning Approaches SVD based
X =UDV
T
( rr ) T ˆ X =UD V NNMF
LDA
X = AB X =AB
T
T
ai , j ≥ 0
bi , j ≥ 0
ai ,k = P(attr = i | z = k )
bi , j = P( KE = i | z = k ) IRMLES 2009
Experimental Results
NDCG-Score for different learning approaches IRMLES 2009
Persisting Probabilistic Triples
• quadruple
PersonA
foaf:knows
PersonB
0.758
_:node
• reification
rdf:subject
rdf:predicate
rdf:object
prob
rdf:type
(simplest but high memory cost)
PersonA
foaf:knows
PersonB
0.758
Statement
• blank node
PersonA
kp
_:node
foaf:knows
PersonB
prob
0.758
IRMLES 2009
Results: Who wants to be Trelena’s Friend
IRMLES 2009
Conclusion and Outlook We presented a novel generic learning approach for deriving probabilistic SW statements and demonstrated how these can be integrated into an extended SPARQL query The approach is suitable for a typical situations with sparse/missing data The learning process is to a large degree autonomous (goal!) Generalization from the sample to the population is linear in the size of the population (matrix!) Learned statements are materialized for fast querying LDA showed best performance (Bayesian averaging) Part of EU-FP7: LarKC
IRMLES 2009