Unifying Personalized PageRank and Prolog William W ... - Google Sites

0 downloads 85 Views 6MB Size Report
History. • Grad school Rutgers, job at AT&T. • Worked in group doing KR, DB, learning, information retrieval, â€
Unifying Personalized PageRank and Prolog

William W. Cohen with: William Yang Wang, Katie Mazaitis, Einat Minkov, Ni Lao, Tom Mitchell & others

Machine Learning Dept. and Language Technologies Inst. School of Computer Science Carnegie Mellon University

My History Machine Learning

Representation languages: DBs, KR

Text cat, IR, IE

History 82


1982/1984: Ehud Shapiro’s thesis: –  MIS: Learning logic programs as debugging an empty Prolog program –  Thesis contained 17 figures and a 25-page appendix that were a full implementation of MIS in Prolog –  Incredibly elegant work


•  “Computer science has a great advantage over

84 86 88 90 92

98 00 04 08 12


other experimental sciences: the world we investigate is, to a large extent, our own creation, and we are the ones to determine if it is simple or messy.”

History 82 84 86 88 90 92 94 96 98 00 04 08 12

•  Grad school Rutgers, job at AT&T •  Worked in group doing KR, DB, learning, information retrieval, … •  My work: learning logical (description-logic-like, Prolog-like, rule-based) representations that model large noisy real-world datasets.

History 82 84 86 88 90 92 94 96 98 00 04 08 12

•  The web takes off –  as predicted by William Gibson

•  IR folks start looking at retrieval and questionanswering with the Web •  Alon Halevy (DB guy) starts the Information Manifold project to integrate data on the web –  VLDB 2006 10-year Best Paper Award for 1996 paper on IM •  I started got very interested in information integration….

History 82 84 86 88 90 92 94 96 98 00

•  As the world of computer science gets richer and more complex, computer science can no longer limit itself to studying “our own creation”. •  Tension exists between –  Elegant theories of representation –  The not-so-elegant real world that is being represented

04 08 12

•  Concise logical representations often “don’t fit” complex realworld data

History 82 84 86 88 90 92 94 96 98 00 04 08 12

•  The beauty of the real world is its complexity….

History 82 84 86 88 90 92 94 96 98 00 04 08 12

•  The web takes off –  as predicted by William Gibson

•  IR folks start looking at retrieval and questionanswering with the Web •  Alon Halevy (DB guy) starts the Information Manifold project to integrate data on the web –  VLDB 2006 10-year Best Paper Award for 1996 paper on IM •  I started got very interested in information integration….

WHIRL language:

SELECT R.a,S.a,S.b,T.b FROM R,S,T WHERE R.a~S.a and S.b~T.b Link items as needed by Q

Incrementally produce a ranked list of possible links, with “best matches” first. User (or downstream process) decides how much of the list to generate and examine.

(~ TFIDF-similar)

Query Q

























History 82 84 86 88 90 92 94 96 98 00 04 08 12

•  Alon Halevy (DB guy) starts the Information Manifold project to integrate data on the web –  VLDB 2006 10-year Best Paper Award for 1996 paper on IM •  William Cohen (ML guy) wrote WHIRL system, bridging KR/DB ideas with a key IR idea: integration by reasoning about the similarity of strings •  Combining complex models of similarity and logic –  SIGMOD 2008 10-Year Best Paper Award for 1998 Paper on WHIRL

Beyond TFIDF: graph similarity 82 84

“William W. Cohen, CMU”

86 88 90 92 94 96 98 00 04 08 12

cohen dr



“Dr. W. W. Cohen”

“Christos Faloutsos, CMU”


“George H. W. Bush” “George W. Bush”

Personal Info Management as Similarity Queries on a Graph Einat Minkov, Univ Haifa [SIGIR 2006, EMNLP 2008, TOIS 2010]


Sent To

Term In Subject

William graph proposal CMU 6/17/07 6/18/07


Beyond TFIDF: graph similarity 82 84 86 88 90 92 94 96 98 00 04 08 12

•  Personalized PageRank aka Random Walk with Restart: –  Similarity measure for nodes in a graph, analogous to TFIDF for text in a WHIRL database

–  natural extension to PageRank –  amenable to learning parameters of the walk (gradient search, w/ various optimization metrics): •  Toutanova, Manning & NG, ICML2004; Nie et al, WWW2005; Xi et al, SIGIR 2005 –  very fast to compute –  queries: Given type t* and node x, find y:T(y)=t* and y~x Given type t* and nodes X, find y:T(y)=t* and y~X

Tasks can be reduced to similarity queries Person name disambiguation

[ term “andy” file msgId ] “person”


q  What are the adjacent messages in this thread? q  A proxy for finding “more messages like this one”

Alias finding

What are the email-addresses of Jason ?...

[ file msgId ] “file” [ term Jason ] “email-address”

Meeting attendees finder

Which email-addresses (persons) should I notify about this meeting?

[ meeting mtgId ] “email-address”

Results on one task + Learning





Mgmt. game




0% 1











Beyond TFIDF: graph similarity 82 84 86

•  Personalized PageRank aka Random Walk with Restart: –  Given type t* and nodes X, find y:T(y)=t* and y~X

88 90 92 94 96 98 00 04 08 12

•  New and better learning methods –  richer parameterization –  faster PPR inference –  structure learning

•  Other tasks: –  relation-finding in parsed text –  information management for biologists –  inference in large noisy knowledge bases –  work with Ni Lao (formerly CMU, now Google)

History Machine Learning

Representation languages: DBs, KR

Linguistic similarity: NLP, IE, IR

Machine Learning

Representation languages: DBs, KR

Linguisticègraph similarity: NLP, IE, IR

Machine Learning

Representation languages: DBs, KR


Linguisticègraph similarity: NLP, IE, IR

Unifying Personalized PageRank and Prolog: ProPPR

William Yang Wang, Katie Mazaitis

Sample ProPPR program….

Horn rules

features of rules

D’oh! This is a graph!

.. and search space…

•  Score  for  a  query  soln  (e.g.,  “Z=sport”  for  “about(a,Z)”)   depends  on  probability  of  reaching  a  ☐  node*   •  learn  transi=on  probabili=es  based  on  features  of  the  rules   •  implicit  “reset”  transi=ons  with  (p≥α)  back  to  query  node   •  Looking  for  answers  supported  by  many  short  proofs   “Grounding”  size  is  O(1/αε)  

…  ie  independent  of  DB  size   è  fast  approx  incremental   inference  (Reid,Lang,Chung,  08)   Learning:  supervised   variant  of  personalized   PageRank  (Backstrom  &   Leskovic,  2011)  

*Exactly as in Stochastic Logic Programs [Cussens, 2001]

Sample  Task:  Cita=on  Matching   •  Task:     •  cita=on  matching  (Alchemy:  Poon  &  Domingos).   •  Dataset:     •  CORA  dataset,  1295  cita=ons  of  132  dis=nct  papers.   •  Training  set:  sec=on  1-­‐4.   •  Test  set:  sec=on  5.   •  ProPPR  program:     •  translated  from  corresponding  Markov  logic  network   (dropping  non-­‐Horn  clauses)   •  #  of  rules:  21.  

Task:  Cita=on  Matching  

Time:  Cita=on  Matching   vs  Alchemy  

“Grounding”  is  independent  of  DB  size  

Accuracy:  Cita=on  Matching  

Our  rules   UW  rules  

AUC  scores:  0.0=low,  1.0=hi   w=1  is  before  learning  

It  gets  becer…..   •  Learning uses many example queries •  e.g: sameCitation(c120,X) with X=c123+, X=c124-, … •  Each query is grounded to a separate small graph (for its proof) •  Goal is to tune weights on these edge features to optimize RWR on the query-graphs. •  Can do SGD and run RWR separately on each query-graph •  Graphs do share edge features, so there’s some synchronization needed

Learning  can  be  parallelized  by  splidng  on  the  separate  “groundings”  of  each  query  

Another  Sample  Task  

Lao: A learned random walk strategy is a weighted set of random-walk “experts”, each of which is a walk constrained by a path (i.e., sequence of relations) Recommending papers to cite in a paper being prepared 1) papers co-cited with on-topic papers

6) approx. standard IR retrieval 7,8) papers cited during the past two years

12-13) papers published during the past two years

Another study: learning inference rules for a noisy KB (Lao, Cohen, Mitchell 2011)

AthletePlays ForTeam HinesWard


TeamPlays InLeague

AthletePlaysInLeague ?


IsA PlaysIn

American isa-1

Synonyms of the query team

•  Paths  learned  are  like  ProPPR  rules   •  …but  they  are  learned  separately  for  each   rela=on  type,  and  one  learned  rule  can’t  call   another   athletePlaySport(Athlete,Sport)  ç      onTeam(Athlete,Team),  teamPlaysSport(Team,Sport)     teamPlaysSport(Team,Sport)  ç        memberOf(Team,Conference),      hasMember(Conference,Team2),    plays(Team2,Sport).   teamPlaysSport(Team,Sport)  ç    onTeam(Athlete,Team),  athletePlaysSport(Athlete,Sport)    

•  Paths  learned  are  like  ProPPR  rules   •  …but  they  are  learned  separately  for  each   rela=on  type,  and  one  learned  rule  can’t  call   another   athletePlaySportViaRule(Athlete,Sport)  ç      onTeamViaKB(Athlete,Team),  teamPlaysSportViaKB(Team,Sport)     teamPlaysSportViaRule(Team,Sport)  ç        memberOfViaKB(Team,Conference),      hasMemberViaKB(Conference,Team2),    playsViaKB(Team2,Sport).   teamPlaysSportViaRule(Team,Sport)  ç    onTeamViaKB(Athlete,Team),  athletePlaysSportViaKB(Athlete,Sport)    

Experiment: •  Take top K paths for each predicate learned by Lao’s PRA •  (I don’t know how to do structure learning for ProPPR yet) •  Convert to a mutually recursive ProPPR program •  Train weights on entire program (~=800 rules, 12k queries)

athletePlaySport(Athlete,Sport)  ç      onTeam(Athlete,Team),  teamPlaysSport(Team,Sport)   athletePlaySport(Athlete,Sport)  ç  athletePlaySportViaKB(Athlete,Sport)       teamPlaysSport(Team,Sport)  ç        memberOf(Team,Conference),      hasMember(Conference,Team2),    plays(Team2,Sport).   teamPlaysSport(Team,Sport)  ç    onTeam(Athlete,Team),  athletePlaysSport(Athlete,Sport)   teamPlaysSport(Team,Sport)  ç  teamPlaysSportViaKB(Team,Sport)    

Joint  Inference  for  Rela=on  Predic=on   •  •  •  •  • 

Task:  link  predic=on.   Dataset:  a  subset  of  19,527  beliefs  from  NELL.   Training  set:  12,331  queries.   Test  set:  1,185  queries.   #  Rules:  797.  

You  can  do  more  with  ProPPR…  

Machine Learning

Representation languages: DBs, KR


Linguisticègraph similarity: NLP, IE, IR

•  Semantically simple •  Extends PPR and Prolog •  Scalable and flexible: •  Applicable to very large databases even with arbitrary recursion in a logic program •  Easily parallelizable learning-to-perform-PPR method o  Not (yet) fast o  Learned probabilities are about a proof process on a logic program, not about state of the world