Oct 29, 2010 - Source National Visualization and Analytics .... Text fields. « ⦠the certification test is documented in the report ⦠» Numerical .... Business.
Gregory Grefenstette Chief Science Officer Exalead
Use of Semantics in Real Life Applications 29 October 2010
Confidential Information
What is semantics Semantics is many things to many people My definition of semantics (general public) Semantics in Search Engines Successful semantics Entity extraction Topic segmentation Affect analysis, domain analysis
Semantics in Products Voxalead News Urbanizer
Future LOD
10/29/2010
2
Confidential Information
Semantic Entities
D O I :
Unified Access to Heterogeneous Audiovisual Archives Y. Avrithis, G. Stamou, M. Wallace et al. 10.3217/jucs-009-06-0510
10/29/2010
3
Confidential Information
Semantic store The Multidimensional Learning Model: A Novel Cognitive Psychology-Based Model for Computer Assisted Instruction in Order to Improve Learning in Medical Students Tarek M. Abdelhamid, M.D., Formerly from the Medical Education Development Office, The University of Auckland School of Medicine and Health Science
Figure 3 According to the level of processing approach the semantic (meaningful) coding of information provides the deepest form of processing.
10/29/2010
4
Confidential Information
Major levels of linguistic structure
Source National Visualization and Analytics Center: Illuminating the Path: The Research and Development Agenda for Visual Analytics p.110., 2005 Author James J. Thomas and Kristin A. Cook (Ed.)
10/29/2010
5
Confidential Information
Syntactic and semantic level description of the surface text
10/29/2010
6
Confidential Information
Semantic Level
Announcing the Release of the August 2009 CTP for the Open XML SDK BrianJones 27 Aug 2009 9:21 AM
10/29/2010
7
Confidential Information
Semantic Map
LearningTip #50: Strategies Help Reluctant Silent Readers Read to Learn By Joyce Melton Pagés, Ed.D. Educator, President of KidBibs
10/29/2010
8
Confidential Information
Semantic Structure Foundations of language: brain, meaning, grammar, evolution By Ray Jackendoff
http://www.cse.iitk.ac.in/users/amit/books/jackendoff-2002-foundations-of-language.html/
10/29/2010
9
Confidential Information
Semantic Levels and Computer Evidence
http://www.anguas.com
Mar 18 2010 10/29/2010
10
Confidential Information
Semantic Gap
http://mklab.iti.gr/svmcf
10/29/2010
11
Confidential Information
Semantic Filtering
M. Naphade, I. Kozintsev and T. Huang, “A Factor Graph Framework for Semantic Video Indexing”, accepted for publication in IEEE Transactions on Circuits and Systems for Video Technology http://domino.research.ibm.com
10/29/2010
12
Confidential Information
Semantic Vocabulary
Jingen Liu, Yang Yang and Mubarak Shah, , Learning Semantic Visual Vocabularies Using Diffusion Distance IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), 2009.
10/29/2010
13
Confidential Information
Semantic Needs
http://blog.webgenomeproject.org/internethierarchy-of-needs-elaborated-part-4/
10/29/2010
14
Confidential Information
Unified Semantic Ecosystem
http://dita.xml.org/wiki/level-six-universalsemantic-ecosystem
10/29/2010
15
Confidential Information
Swoogle Semantic Web indexer
10/29/2010
16
Confidential Information
Semantic Web The element of the Semantic Web Can be encoded in XML Simplicity and mathematical consistency This is called Resource Description Framework (RDF)
http://www.w3.org/2007/Talks/1211-whit-tbl/ Tim Berners-Lee MIT Computer Science & Artificial Intelligence Laboratory (CSAIL) Southampton University School of Electronics and Computer Science World Wide Web Consortium
10/29/2010
17
Confidential Information
Semantic Web RDF data...
Subject and object node using same URIs http://www.w3.org/2007/Talks/1211-whit-tbl/ Tim Berners-Lee MIT Computer Science & Artificial Intelligence Laboratory (CSAIL) Southampton University School of Electronics and Computer Science World Wide Web Consortium 10/29/2010
18
Confidential Information
Semantics Upper level classes Relations between concepts Predicates extracted from text Restrictions on content Labels on visual data « meaning » Shared master data space RDF triples
10/29/2010
19
Confidential Information
MY EXPERIENCE WITH SEMANTICS Confidential Information
My thinking in 2007 Language identification tokenization Word segmentation Morphological analysis Part of speech tagging
Lemmatisation Spelling correction, synonym expansion
Phrase chunking Entity recognition Summarization
Dependency parsing
Sentiment analysis
Event analysis
Document analysis Topic identification
clustering
Classification
10/29/2010
21
Confidential Information
In 2008, I joined Exalead
16 milliards de pages web 2 milliards d’images 50 millions de videos
www.exalead.com/search
22
Confidential Information
Searching for known items with facets
ASIDE ABOUT ENTERPRISE SEARCH Confidential Information
10/29/2010
24
Confidential Information
Semantic Dimensions Facets
Hunting Gathering Shopping
10/29/2010
25
Confidential Information
Confidential Information
END OF ASIDE BACK TO CLIENTS DESIRE FOR SEMANTICS Confidential Information
I learned that this is what Clients think
Anything added to but not in the original text is semantics Anything fancier than indexing original strings
Anything that has "language"
10/29/2010
28
Confidential Information
Semantics is three things to the ordinary person
Is this an example of that?
Are these two things the same? What is the relation between these two things?
10/29/2010
29
Confidential Information
Semantics is three things to the ordinary person
Is this an example of that? Language identification, entity extraction, word sense disambiguation, affect analysis
Are these two things the same? Tokenization, spelling correction, morphological analysis, summarization, synonym expansion, paraphrase, anaphor
What is the relation between these two things? Parsing, event extraction, relation extraction, classification, clustering
10/29/2010
30
Confidential Information
Semantics in a Search Engine Text fields « … the certification test is documented in the report … »
Numerical fields 3.14159265
Date 16/11/1957
GPS coordinates, real world coordinates 48.451065619, 1.4392089
Categories Top/Animals/Pets/Dog
Value separated fields Color: outside red ; interior: white ; trimming: silver
Metadata (attribute:value) Source : dailymotion
10/29/2010
31
Confidential Information
Exalead Fields capture most types, documents capture rows Exalead Field types Text Dates Numbers Categories
Exalead Documents Each doc an entity, database rows, or Each doc a text, with entities added in fields
32 10/29/2010
32
Confidential Information
Semantics Processes in a Search Engine
autosuggestion language identification related terms related queries spell checker faceted search
lemmatisation synonyms phonetic cross language Ontology Matcher
local search phonetic lemmatisation stopwords synonyms sentiment analysis named entity
INDEX
QueryMatcher FastRules HTMLRelevantContextExtractor Categorizer Clusterer 10/29/2010
33
Confidential Information
33
AutoSuggestion Example
Auto completion of user query as they type
10/29/2010
34
Confidential Information
Auto Suggestion Resource ....
Jones, G. et al. Non-hierarchic document clustering Willett Introduction Cluster analysis, or automatic classification, is a multivariate statistical technique ... henceforth a GA, for document clustering. 04 nov. 2003 - 20k MyCategory: MachineLearning Source : Vie société : Documentation : Press : Exalead EN-FR Press excerpts : Old - News Articles : Articles de recherche : NeuralNetworks :clustering Personnalité : Gareth Jones, Peter Willett, Edward Arnold, Van Nostrand Reinhold Lieu : Berlin, Duran, Sheffield, New York, London Organisation : British Library, Wellcome, Everitt, American Society, ACM, Springer
add a category to a document matching a compiled regular expression
100 000 queries with boolean large multi word expression matching 1 Mb / sec / server FastRulesAdvantages Very fast rules categorization Support huge set of rules (> 100k rules) Regular expression on chunk (for example a regular expression on an url, title or text) Available as a standalone component (Exa and Java) Drawbacks Limited syntax (only AND/OR/NOT operators are supported) Rules must be compiled before usage 10/29/2010
45
Confidential Information
RulesMatcher, Named Entities
image of use?
10/29/2010
47
Confidential Information
SentimentAnalyzer
10/29/2010
48
Confidential Information
Categorizer Business
Consumer
Shopping
Services
Customer Service
Inqueries
Documents D’apprentissage
Pets
Signature de la classe
Business
Consumer
Shopping
Services
Nouveau document
Inqueries
Customer Service
10/29/2010
Pets
49
Confidential Information
Semantics Processes in a Search Engine
autosuggestion language identification related terms related queries spell checker faceted search
lemmatisation synonyms phonetic cross language Ontology Matcher
local search phonetic lemmatisation stopwords synonyms sentiment analysis named entity
INDEX
QueryMatcher FastRules HTMLRelevantContextExtractor Categorizer Clusterer 10/29/2010
50
Confidential Information
50
With documents as containers With rich semantic markup With database offloading database queries generating documents
Search engines can become part of a process
Search Based Applications
10/29/2010
51
Confidential Information
Search-Based Applications Search Engines can provide a powerful search-based infrastructure...
… for a new generation of applications
Without Offloading High complexity/costs, Low performance/reusability
With Offloading Low complexity/costs, High performance/reusability
10/29/2010
52
Confidential Information
Affordance of Search Engines
Web Search Technology has evolved over the past decade to perfect distribution of information Handle 100s of millions of database records
Offer 99,999% availability Follow a 2-month innovation roadmap
Match low revenue model ($0,0001/ session)
Support impossible to forecast traffic
Be usable without training
Provide sub-second response times
Present up-to-date 10/29/2010 information
53
Confidential Information
53
Databases
Search Based
Search engines
Applications
10/29/2010
54
Confidential Information
Transfer from research to industry
SEMANTICS SUCCESS STORIES Confidential Information
Named Entities "Named entities are phrases that contain the names of persons, organizations, locations, times, and quantities." (CoNLL 2002) MUC 1 (1987) thru MUC 6 (1995) Jim bought 300 shares of Acme Corp. in 2006.
http://cs.nyu.edu/cs/faculty/grishman/ 10/29/2010
56
Confidential Information
Voxalead
10/29/2010
57
Confidential Information
10/29/2010
58
Confidential Information
VoxaleadNews
Methodology RSS feed
Video file
Meta data
Stream splitter
Audio track .wav
Extract vignettes .jpg
Title Date Channel …
Video re encoding .mp4
Speech-to-Text
Text with timetracking .xml
Push API Index
Metadata Named Entities Cross Lingual
59 10/29/2010
59
Confidential Information
10/29/2010
60
Confidential Information
10/29/2010
61
Confidential Information
10/29/2010
62
Confidential Information
10/29/2010
63
Confidential Information
10/29/2010
64
Confidential Information
Confidential Information
Transfer from research to industry
SEMANTICS SUCCESS STORY 2
Confidential Information
Topic Segmentation
http://people.ischool.berkeley.edu/~hearst/papers/acl94.pdf 10/29/2010
68
Confidential Information
Professor Marti Hearst
Topic Segmentation
http://people.ischool.berkeley.edu/~hearst/papers/acl94.pdf Professor Marti Hearst 10/29/2010
69
Confidential Information
Text segmentation in practice
10/29/2010
70
Confidential Information
10/29/2010
71
Confidential Information
10/29/2010
72
Confidential Information
10/29/2010
73
Confidential Information
10/29/2010
74
Confidential Information
Transfer from research to industry
SEMANTICS SUCCESS STORY 3
Confidential Information
Brief History of Semantic Fields and Affect Analysis active
. spry
young
slow
Deese (1965) semantic axes fast
free association experiments “big-small”, “hot-cold”, etc.
old passive
Berlin & Kay (1969) Lehrer (1974) semantic fields Levinson (1983) linguistic scale set of alternate or contrastive expressions arranged on an axis by degree of semantic strength red
red
orange
amber
yellow
tan
green
blue
yellow
blue
10/29/2010
76
Confidential Information
Affect Lexicons Lasswell Value Dictionary (1969) Eight dimensions: WEALTH, POWER, RECTITUDE, RESPECT, ENLIGHTENMENT, SKILL, AFFECTION, AND WELLBEING with positive or negative orientation e.g., admire: RESPECT (positive)
General Inquirer dictionary (Stone, et al. 1965) 9051 headwords 1,915 positive and 2,291 negative words (Pos/Neg)
also labels: Active, Passive, ... , Pleasure, Pain, …Human, Animate, …, Region, Route,…, Fetch, Stay, ..
http://www.wjh.harvard.edu/~inquirer/inqdict.txt
10/29/2010
77
Confidential Information
Clairvoyance Affect Lexicon
"arrogance" sn "superiority" 0.7 0.9 .. "gleeful" adj “happiness” 0.7 0.6 "gleeful" adj “excitement” 0.3 0.6 …
84 pair affect classes (positive/negative)
Confidential Information
Finding Emotive Words using Patterns appear feel is look seem
almost extremely so very too
X
“appearing so …” “feeling extremely…” “was too…” “looks almost….”
105 patterns with conjugated word forms
10/29/2010
79
Confidential Information
Most Common Words from All Patterns Using snippets from all 105 patterns: 8957 good 2906 important 2506 happy 2455 small 2024 bad 1976 easy 1951 far 1745 difficult 1697 hard 1563 pleased
… in all 15111 different, inflected word
10/29/2010
80
Confidential Information
SO-PMI of emotive pattern words
37.5 33.6 32.9 29.0 26.2 24.6 22.9 21.9 -17.1 -17.5 -18.1 -18.5 -18.6 -22.7
knowlegeable tailormade eyecatching huggable surefooted timesaving personable welldone …. underdressed uncreative disapproving meanspirited unwatchable discombobulated Confidential Information
Mood of a city
URBANIZER.COM
Confidential Information
http://labs.exalead.com/
10/29/2010
83
Confidential Information
10/29/2010
84
Confidential Information
10/29/2010
85
Confidential Information
10/29/2010
86
Confidential Information
10/29/2010
87
Confidential Information
10/29/2010
88
Confidential Information
10/29/2010
89
Confidential Information
10/29/2010
90
Confidential Information
Extraction + Abstraction + Abstraction
"…decor is not particularly elegant…" ambiance
↓ 70
Business Lunch ≈ service ε [0,40] V cuisine ε [45,100] V ambiance ε [45,100]
10/29/2010
91
Confidential Information
URBANIZER Domain modeling
Service Domain vocabulary
Cuisine
Sentiment for domain Syntactic patterns
Ambiance
Confidential Information
Domain analysis, same as Affect Analysis
Deese (1965) semantic axes free association experiments “big-small”, “hot-cold”, etc.
quick homey upscale
active young
slow
casual fast
refined
old
leisurely
passive
…spent {two, three} hours…
10/29/2010
93
Confidential Information
10/29/2010
94
Confidential Information
10/29/2010
95
Confidential Information
10/29/2010
96
Confidential Information
Another example of domain modeling
10/29/2010
97
Confidential Information
10/29/2010
98
Confidential Information
10/29/2010
99
Confidential Information
Foreseeable, near term
FUTURE SUCCESSES
Confidential Information
Dbpedia, RDF triples
10/29/2010
101
Confidential Information
Wikipedia InfoBoxes
10/29/2010
102
Confidential Information
Implicit tables
103 10/29/2010
103
Confidential Information
Linked Open Data Cloud
2008
2008 2008
2007
2008
10/29/2010
104
2009
Confidential Information
2009
LOD2 turns linked data into knowledge interlink SILK
create
DXX Engine
merge
poolparty SemMF
OntoWiki
2007
Sigma
20082008 WiQA2008
ORE
Virtouso
verify DL-Learner
2008
2009
classify
MonetDB Sindice
enrich 105 Confidential Information
Challenge : searching ad hoc, idiosyncratic spaces
http://www.imrc.kist.re.kr/wiki/LifeLog
10/29/2010
106
Confidential Information
Microsoft SenseCam
http://research.microsoft.com/enus/um/cambridge/projects/sensecam/
http://justgetthere.us/blog/plugin/tag/cameras
10/29/2010
107
Confidential Information
http://gadgets.infoniac.com/head-worncamera-for-special-events.html 10/29/2010
108
Confidential Information
Bike My Way
10/29/2010
109
Confidential Information
Insta Mapper
10/29/2010
110
Confidential Information
Challenge : searching ad hoc, idiosyncratic spaces
Hui Yang and Jamie Callan. A Metric-based Framework for Automatic Taxonomy Induction. In Proceedings of the 47th Annual Meeting of the Association for Computational Linguistics (ACL2009), Singapore. Aug 2-7, 2009
http://www.imrc.kist.re.kr/wiki/LifeLog
10/29/2010
111
Confidential Information
Conclusions Semantics What is this? Are these the same? Who does what to whom?
Search Engines now handle rich semantics Search based applications, between DB and IR
Semantic Research Successful transfer to industry
Near Future Integration crowd source knowledge Wikipedia Linked open data
10/29/2010
112
Confidential Information
Thank you
10/29/2010
113
Confidential Information