Document not found! Please try again

Use of Semantics in Real Life Applications - York University

72 downloads 10857 Views 11MB Size Report
Oct 29, 2010 - Source National Visualization and Analytics .... Text fields. « … the certification test is documented in the report … » Numerical .... Business.
Gregory Grefenstette Chief Science Officer Exalead

Use of Semantics in Real Life Applications 29 October 2010

Confidential Information

What is semantics Semantics is many things to many people My definition of semantics (general public) Semantics in Search Engines Successful semantics Entity extraction Topic segmentation Affect analysis, domain analysis

Semantics in Products Voxalead News Urbanizer

Future LOD

10/29/2010

2

Confidential Information

Semantic Entities

D O I :

Unified Access to Heterogeneous Audiovisual Archives Y. Avrithis, G. Stamou, M. Wallace et al. 10.3217/jucs-009-06-0510

10/29/2010

3

Confidential Information

Semantic store The Multidimensional Learning Model: A Novel Cognitive Psychology-Based Model for Computer Assisted Instruction in Order to Improve Learning in Medical Students Tarek M. Abdelhamid, M.D., Formerly from the Medical Education Development Office, The University of Auckland School of Medicine and Health Science

Figure 3 According to the level of processing approach the semantic (meaningful) coding of information provides the deepest form of processing.

10/29/2010

4

Confidential Information

Major levels of linguistic structure

Source National Visualization and Analytics Center: Illuminating the Path: The Research and Development Agenda for Visual Analytics p.110., 2005 Author James J. Thomas and Kristin A. Cook (Ed.)

10/29/2010

5

Confidential Information

Syntactic and semantic level description of the surface text

10/29/2010

6

Confidential Information

Semantic Level

Announcing the Release of the August 2009 CTP for the Open XML SDK BrianJones 27 Aug 2009 9:21 AM

10/29/2010

7

Confidential Information

Semantic Map

LearningTip #50: Strategies Help Reluctant Silent Readers Read to Learn By Joyce Melton Pagés, Ed.D. Educator, President of KidBibs

10/29/2010

8

Confidential Information

Semantic Structure Foundations of language: brain, meaning, grammar, evolution By Ray Jackendoff

http://www.cse.iitk.ac.in/users/amit/books/jackendoff-2002-foundations-of-language.html/

10/29/2010

9

Confidential Information

Semantic Levels and Computer Evidence

http://www.anguas.com

Mar 18 2010 10/29/2010

10

Confidential Information

Semantic Gap

http://mklab.iti.gr/svmcf

10/29/2010

11

Confidential Information

Semantic Filtering

M. Naphade, I. Kozintsev and T. Huang, “A Factor Graph Framework for Semantic Video Indexing”, accepted for publication in IEEE Transactions on Circuits and Systems for Video Technology http://domino.research.ibm.com

10/29/2010

12

Confidential Information

Semantic Vocabulary

Jingen Liu, Yang Yang and Mubarak Shah, , Learning Semantic Visual Vocabularies Using Diffusion Distance IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), 2009.

10/29/2010

13

Confidential Information

Semantic Needs

http://blog.webgenomeproject.org/internethierarchy-of-needs-elaborated-part-4/

10/29/2010

14

Confidential Information

Unified Semantic Ecosystem

http://dita.xml.org/wiki/level-six-universalsemantic-ecosystem

10/29/2010

15

Confidential Information

Swoogle Semantic Web indexer

10/29/2010

16

Confidential Information

Semantic Web The element of the Semantic Web Can be encoded in XML Simplicity and mathematical consistency This is called Resource Description Framework (RDF)

http://www.w3.org/2007/Talks/1211-whit-tbl/ Tim Berners-Lee MIT Computer Science & Artificial Intelligence Laboratory (CSAIL) Southampton University School of Electronics and Computer Science World Wide Web Consortium

10/29/2010

17

Confidential Information

Semantic Web RDF data...

Subject and object node using same URIs http://www.w3.org/2007/Talks/1211-whit-tbl/ Tim Berners-Lee MIT Computer Science & Artificial Intelligence Laboratory (CSAIL) Southampton University School of Electronics and Computer Science World Wide Web Consortium 10/29/2010

18

Confidential Information

Semantics Upper level classes Relations between concepts Predicates extracted from text Restrictions on content Labels on visual data « meaning » Shared master data space RDF triples

10/29/2010

19

Confidential Information

MY EXPERIENCE WITH SEMANTICS Confidential Information

My thinking in 2007 Language identification tokenization Word segmentation Morphological analysis Part of speech tagging

Lemmatisation Spelling correction, synonym expansion

Phrase chunking Entity recognition Summarization

Dependency parsing

Sentiment analysis

Event analysis

Document analysis Topic identification

clustering

Classification

10/29/2010

21

Confidential Information

In 2008, I joined Exalead

16 milliards de pages web 2 milliards d’images 50 millions de videos

www.exalead.com/search

22

Confidential Information

Searching for known items with facets

ASIDE ABOUT ENTERPRISE SEARCH Confidential Information

10/29/2010

24

Confidential Information

Semantic Dimensions Facets

Hunting Gathering Shopping

10/29/2010

25

Confidential Information

Confidential Information

END OF ASIDE BACK TO CLIENTS DESIRE FOR SEMANTICS Confidential Information

I learned that this is what Clients think

Anything added to but not in the original text is semantics Anything fancier than indexing original strings

Anything that has "language"

10/29/2010

28

Confidential Information

Semantics is three things to the ordinary person

Is this an example of that?

Are these two things the same? What is the relation between these two things?

10/29/2010

29

Confidential Information

Semantics is three things to the ordinary person

Is this an example of that? Language identification, entity extraction, word sense disambiguation, affect analysis

Are these two things the same? Tokenization, spelling correction, morphological analysis, summarization, synonym expansion, paraphrase, anaphor

What is the relation between these two things? Parsing, event extraction, relation extraction, classification, clustering

10/29/2010

30

Confidential Information

Semantics in a Search Engine Text fields « … the certification test is documented in the report … »

Numerical fields 3.14159265

Date 16/11/1957

GPS coordinates, real world coordinates 48.451065619, 1.4392089

Categories Top/Animals/Pets/Dog

Value separated fields Color: outside red ; interior: white ; trimming: silver

Metadata (attribute:value) Source : dailymotion

10/29/2010

31

Confidential Information

Exalead Fields capture most types, documents capture rows Exalead Field types Text Dates Numbers Categories

Exalead Documents Each doc an entity, database rows, or Each doc a text, with entities added in fields

32 10/29/2010

32

Confidential Information

Semantics Processes in a Search Engine

autosuggestion language identification related terms related queries spell checker faceted search

lemmatisation synonyms phonetic cross language Ontology Matcher

local search phonetic lemmatisation stopwords synonyms sentiment analysis named entity

INDEX

QueryMatcher FastRules HTMLRelevantContextExtractor Categorizer Clusterer 10/29/2010

33

Confidential Information

33

AutoSuggestion Example

Auto completion of user query as they type

10/29/2010

34

Confidential Information

Auto Suggestion Resource ....

Jones, G. et al. Non-hierarchic document clustering Willett Introduction Cluster analysis, or automatic classification, is a multivariate statistical technique ... henceforth a GA, for document clustering. 04 nov. 2003 - 20k MyCategory: MachineLearning Source : Vie société : Documentation : Press : Exalead EN-FR Press excerpts : Old - News Articles : Articles de recherche : NeuralNetworks :clustering Personnalité : Gareth Jones, Peter Willett, Edward Arnold, Van Nostrand Reinhold Lieu : Berlin, Duran, Sheffield, New York, London Organisation : British Library, Wellcome, Everitt, American Society, ACM, Springer

add a category to a document matching a compiled regular expression

100 000 queries with boolean large multi word expression matching 1 Mb / sec / server FastRulesAdvantages Very fast rules categorization Support huge set of rules (> 100k rules) Regular expression on chunk (for example a regular expression on an url, title or text) Available as a standalone component (Exa and Java) Drawbacks Limited syntax (only AND/OR/NOT operators are supported) Rules must be compiled before usage 10/29/2010

45

Confidential Information

RulesMatcher, Named Entities

image of use?

10/29/2010

47

Confidential Information

SentimentAnalyzer

10/29/2010

48

Confidential Information

Categorizer Business

Consumer

Shopping

Services

Customer Service

Inqueries

Documents D’apprentissage

Pets

Signature de la classe

Business

Consumer

Shopping

Services

Nouveau document

Inqueries

Customer Service

10/29/2010

Pets

49

Confidential Information

Semantics Processes in a Search Engine

autosuggestion language identification related terms related queries spell checker faceted search

lemmatisation synonyms phonetic cross language Ontology Matcher

local search phonetic lemmatisation stopwords synonyms sentiment analysis named entity

INDEX

QueryMatcher FastRules HTMLRelevantContextExtractor Categorizer Clusterer 10/29/2010

50

Confidential Information

50

With documents as containers With rich semantic markup With database offloading database queries generating documents

Search engines can become part of a process

Search Based Applications

10/29/2010

51

Confidential Information

Search-Based Applications Search Engines can provide a powerful search-based infrastructure...

… for a new generation of applications

Without Offloading High complexity/costs, Low performance/reusability

With Offloading Low complexity/costs, High performance/reusability

10/29/2010

52

Confidential Information

Affordance of Search Engines

Web Search Technology has evolved over the past decade to perfect distribution of information Handle 100s of millions of database records

Offer 99,999% availability Follow a 2-month innovation roadmap

Match low revenue model ($0,0001/ session)

Support impossible to forecast traffic

Be usable without training

Provide sub-second response times

Present up-to-date 10/29/2010 information

53

Confidential Information

53

Databases

Search Based

Search engines

Applications

10/29/2010

54

Confidential Information

Transfer from research to industry

SEMANTICS SUCCESS STORIES Confidential Information

Named Entities "Named entities are phrases that contain the names of persons, organizations, locations, times, and quantities." (CoNLL 2002) MUC 1 (1987) thru MUC 6 (1995) Jim bought 300 shares of Acme Corp. in 2006.

http://cs.nyu.edu/cs/faculty/grishman/ 10/29/2010

56

Confidential Information

Voxalead

10/29/2010

57

Confidential Information

10/29/2010

58

Confidential Information

VoxaleadNews

Methodology RSS feed

Video file

Meta data

Stream splitter

Audio track .wav

Extract vignettes .jpg

Title Date Channel …

Video re encoding .mp4

Speech-to-Text

Text with timetracking .xml

Push API Index

Metadata Named Entities Cross Lingual

59 10/29/2010

59

Confidential Information

10/29/2010

60

Confidential Information

10/29/2010

61

Confidential Information

10/29/2010

62

Confidential Information

10/29/2010

63

Confidential Information

10/29/2010

64

Confidential Information

Confidential Information

Transfer from research to industry

SEMANTICS SUCCESS STORY 2

Confidential Information

Topic Segmentation

http://people.ischool.berkeley.edu/~hearst/papers/acl94.pdf 10/29/2010

68

Confidential Information

Professor Marti Hearst

Topic Segmentation

http://people.ischool.berkeley.edu/~hearst/papers/acl94.pdf Professor Marti Hearst 10/29/2010

69

Confidential Information

Text segmentation in practice

10/29/2010

70

Confidential Information

10/29/2010

71

Confidential Information

10/29/2010

72

Confidential Information

10/29/2010

73

Confidential Information

10/29/2010

74

Confidential Information

Transfer from research to industry

SEMANTICS SUCCESS STORY 3

Confidential Information

Brief History of Semantic Fields and Affect Analysis active

. spry

young

slow

Deese (1965) semantic axes fast

free association experiments “big-small”, “hot-cold”, etc.

old passive

Berlin & Kay (1969) Lehrer (1974) semantic fields Levinson (1983) linguistic scale set of alternate or contrastive expressions arranged on an axis by degree of semantic strength red

red

orange

amber

yellow

tan

green

blue

yellow

blue

10/29/2010

76

Confidential Information

Affect Lexicons Lasswell Value Dictionary (1969) Eight dimensions: WEALTH, POWER, RECTITUDE, RESPECT, ENLIGHTENMENT, SKILL, AFFECTION, AND WELLBEING with positive or negative orientation e.g., admire: RESPECT (positive)

General Inquirer dictionary (Stone, et al. 1965) 9051 headwords 1,915 positive and 2,291 negative words (Pos/Neg)

also labels: Active, Passive, ... , Pleasure, Pain, …Human, Animate, …, Region, Route,…, Fetch, Stay, ..

http://www.wjh.harvard.edu/~inquirer/inqdict.txt

10/29/2010

77

Confidential Information

Clairvoyance Affect Lexicon

"arrogance" sn "superiority" 0.7 0.9 .. "gleeful" adj “happiness” 0.7 0.6 "gleeful" adj “excitement” 0.3 0.6 …

84 pair affect classes (positive/negative)

Confidential Information

Finding Emotive Words using Patterns appear feel is look seem

almost extremely so very too

X

“appearing so …” “feeling extremely…” “was too…” “looks almost….”

105 patterns with conjugated word forms

10/29/2010

79

Confidential Information

Most Common Words from All Patterns Using snippets from all 105 patterns: 8957 good 2906 important 2506 happy 2455 small 2024 bad 1976 easy 1951 far 1745 difficult 1697 hard 1563 pleased

… in all 15111 different, inflected word

10/29/2010

80

Confidential Information

SO-PMI of emotive pattern words

37.5 33.6 32.9 29.0 26.2 24.6 22.9 21.9 -17.1 -17.5 -18.1 -18.5 -18.6 -22.7

knowlegeable tailormade eyecatching huggable surefooted timesaving personable welldone …. underdressed uncreative disapproving meanspirited unwatchable discombobulated Confidential Information

Mood of a city

URBANIZER.COM

Confidential Information

http://labs.exalead.com/

10/29/2010

83

Confidential Information

10/29/2010

84

Confidential Information

10/29/2010

85

Confidential Information

10/29/2010

86

Confidential Information

10/29/2010

87

Confidential Information

10/29/2010

88

Confidential Information

10/29/2010

89

Confidential Information

10/29/2010

90

Confidential Information

Extraction + Abstraction + Abstraction

"…decor is not particularly elegant…" ambiance

↓ 70

Business Lunch ≈ service ε [0,40] V cuisine ε [45,100] V ambiance ε [45,100]

10/29/2010

91

Confidential Information

URBANIZER Domain modeling

Service Domain vocabulary

Cuisine

Sentiment for domain Syntactic patterns

Ambiance

Confidential Information

Domain analysis, same as Affect Analysis

Deese (1965) semantic axes free association experiments “big-small”, “hot-cold”, etc.

quick homey upscale

active young

slow

casual fast

refined

old

leisurely

passive

…spent {two, three} hours…

10/29/2010

93

Confidential Information

10/29/2010

94

Confidential Information

10/29/2010

95

Confidential Information

10/29/2010

96

Confidential Information

Another example of domain modeling

10/29/2010

97

Confidential Information

10/29/2010

98

Confidential Information

10/29/2010

99

Confidential Information

Foreseeable, near term

FUTURE SUCCESSES

Confidential Information

Dbpedia, RDF triples

10/29/2010

101

Confidential Information

Wikipedia InfoBoxes

10/29/2010

102

Confidential Information

Implicit tables

103 10/29/2010

103

Confidential Information

Linked Open Data Cloud

2008

2008 2008

2007

2008

10/29/2010

104

2009

Confidential Information

2009

LOD2 turns linked data into knowledge interlink SILK

create

DXX Engine

merge

poolparty SemMF

OntoWiki

2007

Sigma

20082008 WiQA2008

ORE

Virtouso

verify DL-Learner

2008

2009

classify

MonetDB Sindice

enrich 105 Confidential Information

Challenge : searching ad hoc, idiosyncratic spaces

http://www.imrc.kist.re.kr/wiki/LifeLog

10/29/2010

106

Confidential Information

Microsoft SenseCam

http://research.microsoft.com/enus/um/cambridge/projects/sensecam/

http://justgetthere.us/blog/plugin/tag/cameras

10/29/2010

107

Confidential Information

http://gadgets.infoniac.com/head-worncamera-for-special-events.html 10/29/2010

108

Confidential Information

Bike My Way

10/29/2010

109

Confidential Information

Insta Mapper

10/29/2010

110

Confidential Information

Challenge : searching ad hoc, idiosyncratic spaces

Hui Yang and Jamie Callan. A Metric-based Framework for Automatic Taxonomy Induction. In Proceedings of the 47th Annual Meeting of the Association for Computational Linguistics (ACL2009), Singapore. Aug 2-7, 2009

http://www.imrc.kist.re.kr/wiki/LifeLog

10/29/2010

111

Confidential Information

Conclusions Semantics What is this? Are these the same? Who does what to whom?

Search Engines now handle rich semantics Search based applications, between DB and IR

Semantic Research Successful transfer to industry

Near Future Integration crowd source knowledge Wikipedia Linked open data

10/29/2010

112

Confidential Information

Thank you

10/29/2010

113

Confidential Information

Suggest Documents