Jun 19, 2012 - Artificial Intelligence Laboratory. Jožef Stefan Institute ... inference, statistics etc. http://en.wikipedia.org/wiki/Complex_Event_Processing.
19.6.2012
Text Stream Processing Artificial Intelligence Laboratory Jožef Stefan Institute Ljubljana, Slovenia
Dunja Mladenić Marko Grobelnik Blaž Fortuna Delia Rusu
ailab.ijs.si
• • • •
What are text streams Properties of text streams Motivation Pre-processing of text streams • Text quality
Introduction to text streams
Text stream processing • Topic detection • Entity, event and fact extraction and resolution • Word sense disambiguation • Summarization • Sentiment analysis • Social network analysis
• Key literature overview • Further publicly available tools • Conclusions • Questions and discussion
Concluding remarks
ailab.ijs.si
1
19.6.2012
Introduction to Text Streams What are text streams Properties of text streams Motivation Pre-processing of text streams Text quality
ailab.ijs.si
What are data streams Continuously arriving data, usually in real-time Dealing with streams can be often easy, but… …gets hard when we have an intensive data stream and complex operations on data are required!
In such situations usually… …the volume of data is too big to be stored …the data can be scanned thoroughly only once …the data is highly non-stationary (changes properties through time), therefore approximation and adaptation are key to success
Therefore, a typical solution is… …not to store observed data explicitly, but rather in the aggregate form which allows execution of required operations ailab.ijs.si
2
19.6.2012
Stream processing Who works with real time data processing? “Stream Mining” (subfield of “Data Mining”) dealing with mining data streams in different scenarios in relation with machine learning and data bases http://en.wikipedia.org/wiki/Data_stream_mining
“Complex Event Processing” is a research area discovering complex events from simple ones by inference, statistics etc. http://en.wikipedia.org/wiki/Complex_Event_Processing
ailab.ijs.si
Motivation for stream processing Why one would need (near) real-time information processing? …because Time and Reaction Speed correlate with many target quantities – e.g.: …on stock exchange with Earnings …in controlling with Quality of Service …in fraud detection with Safety, etc.
Generally, we can say: Reaction Speed == Value …if our systems react fast, we create new value!
ailab.ijs.si
3
19.6.2012
What are text streams Continuous, often rapid, ordered sequence of texts Text information arriving continuously over time in the form of a data stream News and similar regular report News articles, online comments on news, online traffic reports, internal company reports, web searches, scientific papers, patents
Social media discussion forums (eg., Twitter, Facebook), short messages on phones or computer, chat, transcripts of phone conversations, blogs, e-mails
Demo http://newsfeed.ijs.si
ailab.ijs.si
NewsFeed
ailab.ijs.si
4
19.6.2012
Properties of text streams Produced with a high rate over time Can be read only once or a small number of times (due to the rate and/or overall volume) Challenging for computing and storage capabilities – efficiency and scalability of the approaches Strong temporal dimension Modularity over time and sources (topic, sentiment,…) ailab.ijs.si
Example task: evolution of research topics and communities over time Based on time stamped research publication titles and authors Observe which topics/communities shrunk, which emerged, which split, over time, when in time were the turning points,… TimeFall – monitoring dynamic, temporally evolving graphs and streams based on Minimum Description Length find good cut-points in time, and stitch together the communities: good cut-point leads to shorter description length. fast and efficient incremental algorithm, scales to large datasets, easily parallelizable ailab.ijs.si
5
19.6.2012
Example task: evolution of research topics and communities over time Given: n time-stamped events (eg., papers), each related to several of m items (eg., title-words, and/or author-names) Find cluster patterns and summarize their evolution in time Papers
Words
1
1990 1992 1991 1990 1992 1991 1990 1991
Time
Words
Words
1990
2
1991
1992
3
V
Time
Papers
Time 1990 1992 1991 1990 1992 1991 1990 1991
Time
Time
Word Clusters
1990
5
1992
Time
Word Clusters
1990
1991
1992
Word Clusters
1990
4
1991
1992
ailab.ijs.si
TimeFall on 12 million medical publications from PubMed MEDLINE over 40 years scales linearly with the product of the initial time point blocks and the number of nonzeros in the matrix
J. Ferlez, C. Faloutsos, J. Leskovec, D. Mladenic, M. Grobelnik. Monitoring Network Evolution ailab.ijs.si using MDL. International Conference on Data Engineering (ICDE 2008).
6
19.6.2012
Pre-processing text stream Basic text pre-processing including removing stop-words, applying stemming
Representing text for internal processing Splitting into units (eg., sentences or words) Mapping to internal representation (eg., feature vectors of words, vectors of ontology concepts)
Pre-processing for aligning/merging text streams Time wise alignment of multiple text streams coordinated text streams (appearing over the same time window, eg. news) Content alignment possibly over different languages ailab.ijs.si
Example The city hosts a great number of religious buildings, many of them dating back to medieval times. Stop Words
ailab.ijs.si
7
19.6.2012
Example city hosts great number religious buildings, host
religi
build
many them dating back medieval times. date
mediev
time
Stemming
ailab.ijs.si
Example city host great number religi build, many them date back mediev time. Splitting into units of words
(city, host, great, number, religi, build, many, them, date, back, mediev, time)
Feature vector of words
ailab.ijs.si
8
19.6.2012
Text Quality Factors: Vocabulary use Grammatical and fluent sentences Structure and coherence Non-redundant information Referential clarity – e.g. proper usage of pronouns
Models of text quality Global coherence - overall document organization Local coherence - Adjacent sentences
Language model based approaches ailab.ijs.si
• • • •
What are text streams Properties of text streams Motivation Pre-processing of text streams • Text quality
Introduction to text streams
Text stream processing • Topic detection • Entity, event and fact extraction and resolution • Word sense disambiguation • Summarization • Sentiment analysis • Social network analysis
• Key literature overview • Further publicly available tools • Conclusions • Questions and discussion
Concluding remarks
ailab.ijs.si
9
19.6.2012
Text Stream Processing WEB
Topic Detection Web Crawler
Text PreProcessing
Information Extraction Word Sense Disambiguation
Summarization Sentiment Analysis Social Network Analysis
Text Stream Processing Results
ailab.ijs.si
Topic Detection
Religion Art
ailab.ijs.si
10
19.6.2012
Topic Detection Supervised techniques The data is labeled with predefined topics Machine learning algorithms are used to predict unseen data labels
Unsupervised techniques Identify patterns and structure within the dataset Clustering: grouping data sharing similar topics Statistical methods: probabilistic topic modeling
ailab.ijs.si
Probabilistic Topic Modeling Topic: a probability distribution over words in a fixed vocabulary Given an input corpus containing a number of documents, each having a sequence of words, the goal is to find useful sets of topics
ailab.ijs.si
11
19.6.2012
Latent Dirichlet Allocation Documents can have multiple topics Religion
Art
D. Blei, A. Ng, and M. Jordan. Latent Dirichlet allocation. Journal of Machine Learning Research, 3:993–1022, January 2003 ailab.ijs.si
LDA Generative Process A topic is a distribution over words A document is a mixture of topics (at the level of the corpus) Each word is drawn from one of the corpus-level topics For each document generate the words: 1. Randomly choose a distribution over the topics 2. For each word in the document a) b)
Randomly choose a topic from the distribution over topics in (step 1) Randomly choose a word from the corresponding distribution over the vocabulary
ailab.ijs.si
12
19.6.2012
LDA Generative Process Assume a number of topics for the document collection (Craiova guide)
Choose a distribution over the topics
For each word: • Choose a topic assignment • Choose the word from the topic
religious 0.03 monastery 0.01 church 0.01 art 0.02 painter 0.02 sculpture 0.01 park 0.01 garden 0.01
ailab.ijs.si
Topic Models - Extensions Hierarchical Topic Models D. Blei, T. Griffiths, and M. Jordan. The nested Chinese restaurant process and Bayesian nonparametric inference of topic hierarchies. Journal of the ACM, 57:2 1–30, 2010. Q. Ho, J. Eisenstein, E. P. Xing. Document Hierarchies from Text and Links. WWW 2012
Dynamic Topic Models D. Blei and J. Lafferty. Dynamic topic models. In Proceedings of the 23rd International Conference on Machine Learning, 2006. ailab.ijs.si
13
19.6.2012
Topic Detection in Streams Unsupervised methods Simpler approaches – e.g. Clustering Probabilistic topic models Challenging because of the amount and dynamics of the data E.g. Online inference for LDA – fits a topic model to random Wikipedia articles
ailab.ijs.si
Topic Detection Tools Available implementations LDA, HLDA, … http://www.cs.princeton.edu/~blei/topicmodeling.html
Mallet Toolkit for statistical NLP http://mallet.cs.umass.edu/
ailab.ijs.si
14
19.6.2012
Clustering on text streams Grouping similar documents – adjusting to changes in the topics over time Clusters generated as the data arrives and stored in a tree Adding examples by adjusting the whole path from the root to the leaf node with the new example – adding, removing, splitting and merging clusters
ailab.ijs.si
Clustering on Reuters V1 news (colors showing predefined topics)
B. Novak, Algorithm for identifying topics in text streams, 2008
ailab.ijs.si
15
19.6.2012
Topic Detection - DEMOS A 100-topic browser of the dynamic topic model fit to Science (1882-2001) http://topics.cs.princeton.edu/Science/
Browsing search results http://searchpoint.ijs.si/
ailab.ijs.si
100-topic browser Science (1882-2001)
1890
1940
2000
ailab.ijs.si
16
19.6.2012
Search Point
ailab.ijs.si
Entity Extraction Subtask of information extraction Identifying elements in text which belong to a predefined group of things: Names of people, locations, organizations (most common) Time expressions, quantities, money amounts, percentages Gene and protein names Etc.
ailab.ijs.si
17
19.6.2012
Entity Extraction
ailab.ijs.si
Entity Extraction Approaches Lists of entities (gazetteers) and grammar rules e.g. GATE – General Architecture for Text Engineering H. Cunningham, et al. Text Processing with GATE (Version 6). University of Sheffield Department of Computer Science. 15 April 2011
Statistical models e.g. Stanford NER - linear chain Conditional Random Field (CRF) sequence models J. R. Finkel, T. Grenager, and C. Manning. Incorporating Nonlocal Information into Information Extraction Systems by Gibbs Sampling. In ACL 2005, pp. 363-370. ailab.ijs.si
18
19.6.2012
Collective Entity Resolution Entity resolution: discover and map entities to corresponding references (e.g from a database, knowledge base, etc.). Approaches: Pairwise similarity with attributes of references Relational clustering using both attribute and relational information I. Bhattacharya, L. Getoor. Collective entity resolution in relational data. ACM Transactions on Knowledge Discovery from Data (TKDD), 2007.
Topic models for the context of every word in a knowledge base P. Sen. Collective Context-Aware Topic Models for Entity Disambiguation. WWW 2012
ailab.ijs.si
Entity Resolution to Linked Data Enhance named entity classification using Linked Data features Y. Ni, L. Zhang, Z. Qiu, C. Wang. Enhancing the Open-Domain Classification of Named Entity using Linked Open Data. ISWC, 2010.
Type knowledge base from LOD (name string, type) E.g. from the triplet (dbpedia:Craiova, rdf:type, Place) -> (Craiova, Place)
Uses WordNet as an intermediate taxonomy to compute the similarity between the LOD type and the target type
ailab.ijs.si
19
19.6.2012
Entity Resolution to Linked Data Finding all possible forms under which an entity can occur in text Resource descriptions - most useful rdfs:label and foaf:name Redirect relationship (entity1, type1) (entity2, ?) entity1 has URI1 entity2 has URI2 URI1 owl:sameAs URI2 Conclude: (entity2, type1)
ailab.ijs.si
Relation Extraction Identifying relationships between entities (and more generally phrases) Traditional relation extraction The target relation is given, together with corresponding extraction patterns for the relation A specific corpus
Open relation extraction (and more general Open information extraction) Diverse relations, not previously fixed Corpus: the Web
M. Banko, O. Etzioni. The Tradeoffs Between Open and Traditional Relation Extraction. ACL, 2008. ailab.ijs.si
20
19.6.2012
Identifying Relations for Open IE 3-step method: Label – automatically label sentences with extractions (arg1, relation phrase, arg2) Learn – learn a relation phrase extractor (e.g using CRF) Extract – given a sentence, identify (arg1, arg2) and the relation phrase (based on the learned relation extractor)
Examples TextRunner – M. Banko, M. Cafarella, S. Soderland, M. Broadhead, O. Etzioni. Open Information Extraction from the Web. IJCAI 2007. WOE – F. Wu, D.S. Weld. Open Information Extraction using Wikipedia. ACL 2010. ailab.ijs.si
Identifying Relations for Open IE REVERB Input: POS-tagged and NP-chunked sentence Identify relation phrases syntactic and lexical constraints
Find a pair of NP arguments for each relation phrase – assign confidence score (logistic regression classifier) Output: (x,y,z) extraction triplets A. Fader, S. Soderland O. Etzioni. Identifying Relations for Open Information Extraction. EMNLP 2011.
ailab.ijs.si
21
19.6.2012
Identifying Relations for Open IE Key points: Relation phrases are identified holistically as opposed to word-by-word Potential phrases are filtered based on statistics (lexical constraints) relation first opposed to arguments first relation phrase not confused for arguments (e.g. “made a deal with”)
DEMO http://openie.cs.washington.edu/# ailab.ijs.si
REVERB
ailab.ijs.si
22
19.6.2012
Never Ending Language Learning NELL – Never Ending Language Learning Addressed tasks Reading task: read the Web and extract a knowledge base of structured facts and knowledge. Learning task: improved (and updated) reading – extract past information more accurately
A. Carlson, J. Betteridge, B. Kisiel, B. Settles, E.R. Hruschka Jr. and T.M. Mitchell. Toward an Architecture for Never-Ending Language Learning. AAAI, 2010.
ailab.ijs.si
Never Ending Language Learning
A. Carlson et al. 2010
ailab.ijs.si
23
19.6.2012
Never Ending Language Learning Coupled Pattern Learner (CPL) Extracts instances of categories and relations (using contextual patterns)
Coupled SEAL (CSEAL) Queries the Web with beliefs from each category or relation, mines lists and tables to extract new instances
Coupled Morphological Classifier (CMC) One regression model per category – classifies noun phrases
Rule Learner (RL) Infer new relation instances
DEMO http://rtw.ml.cmu.edu/rtw/ ailab.ijs.si
NELL
ailab.ijs.si
24
19.6.2012
Domain and Summary Templates Domain templates Event-centric: the focus is on events described with verbs E. Filatova, V. Hatzivassiloglou, K. McKeown. Automatic creation of domain templates. In Proceedings of COLING/ACL 2006
Summary templates Entity-centric: the focus is on summarizing entity categories P. Li, J. Jiang, Y. Wang. Generating templates of entity summaries with an entity-aspect model and pattern mining. ACL 2010. ailab.ijs.si
Domain Templates Domain is a set of events of a particular type E.g. presidential elections, football championships
Domains can be instantiated – instances of events of a particular type E.g. Euro Championship 2012
Different levels of granularity Hierarchical structure for domains Template – a set of attribute-value pairs The attributes specify functional roles characteristic for the domain events ailab.ijs.si
25
19.6.2012
Domain Templates Use a corpus describing instances of events within a domain and learn the domain templates (general characteristics of the domain) The verbs are used as a starting point – estimate of the verb importance given the domain The sentences containing the top X verbs are parsed The most frequent subtrees (FREQuent Tree miner) are kept The named entities are substituted with more generic constructs – e.g. POS tags The frequent sub-trees are merged together ailab.ijs.si
Domain Templates E.g. terrorist attack domain • Important verbs Killed, told, found, injured, reported, Happened, blamed, arrested, died, linked • Frequent subtrees (VP(ADVP(NP))(VBD killed)(NP(CD 34)(NNS people))) (VP(ADVP)(VBD killed)(NP(CD 34)(NNS people))) • Merging subtrees (VBD killed)(NP(NUMBER)(NNS people))
ailab.ijs.si
26
19.6.2012
Summary Templates Starting point: a collection of entity summaries for a given entity category Goal: to obtain a summary template for the entity category E.g. The physicist category ENTITY received his phd from ? university ENTITY studied ? under ? ENTITY earned his ? in physics from university of ? ENTITY was awarded the medal in ? ENTITY won the ? award ENTITY received the Nobel prize in physics in ?
ailab.ijs.si
Summary Templates Identify subtopics (aspects) of the summary collection Using LDA (see Topic Detection) Each word: a stop word, a background word, a document word, an aspect word
ailab.ijs.si
27
19.6.2012
Summary Templates Sentence patterns are generated for each aspect frequent subtree pattern mining
Fixed structure of a sentence pattern Aspect words, background words, stop words
Template slots – vary between documents Document words
ailab.ijs.si
Summary Templates Sentence pattern generation Locate subject entities (using heuristics) – e.g. pronouns in a biography Generate parse trees (using Stanford Parser) – label stop, background, aspect, document, entity words given by the topic model Mine frequent subtree patterns (using FREQT) Prune patterns without entities or aspect words Convert subtree patterns to sentence patterns (find the sentences that generated the pattern)
ailab.ijs.si
28
19.6.2012
Word sense disambiguation Identifying the meaning of words in context Supervised WSD Words labeled with their senses are required Classification task
Unsupervised WSD Known as word sense induction Clustering task
Knowledge-based WSD Relies on knowledge resources: WordNet, Wikipedia, OpenCyc, etc.
R. Navigli. Word sense disambiguation: A survey. ACM Computational Surveys, 41(2), 2009.
ailab.ijs.si
Word Sense Disambiguation Ponzetto, S.P. and Navigli, R. Knowledge-rich Word Sense Disambiguation Rivaling Supervised Systems. ACL 2010. Extend WordNet with Wikipedia relations Apply simple knowledge-based approaches Performance was similar with state-of-the-art supervised approaches
ailab.ijs.si
29
19.6.2012
WSD Evaluation Evaluation workshops SenseEval, SemEval, … WSD evaluation topics (SemEval 2010) Cross-lingual WSD WSD on a specific domain Word sense induction Disambiguating Sentiment Ambiguous Adjectives
Evaluation topics related to WSD (SemEval 2012) Semantic textual similarity – similarity between two sentences Relational similarity – between pairs of words ailab.ijs.si
Summarization Extractive Identifying relevant sentences that belong to the summary
Abstractive Identifying/paraphrasing sections of the document to be summarized E.g. Summarization as phrase extraction - K. Woodsend, M. Lapata. Automatic Generation of Story Highlights. ACL, 2010. joint content selection and compression model ILP model to determine phrases that form the highlights ailab.ijs.si
30
19.6.2012
Summarization Evaluation Several evaluation workshops Document Understanding Conferences (DUC), Text Analysis Conferences (TAC) Metrics: ROUGE (n-gram based)
Linguistic quality Grammaticality, non-redundancy, referential clarity, focus, structure and coherence E. Pitler, A. Louis, A. Nenkova. Automatic Evaluation of Linguistic Quality in Multi-Document Summarization. ACL 2010.
ailab.ijs.si
Sentiment analysis Broad sense: sentiment analysis ~ opinion mining “computational treatment of opinion, sentiment, and subjectivity in text” (B. Pang, L. Lee, 2008) Surveys, book chapters: B. Pang, L. Lee. Opinion mining and sentiment analysis. Foundations and Trends in Information Retrieval 2(1-2), pp. 1–135, 2008 B. Liu. Sentiment Analysis and Subjectivity. Handbook of Natural Language Processing, Second Edition, (editors: N. Indurkhya and F. J. Damerau), 2010. B. Liu. Web Data Mining - Exploring Hyperlinks, Contents and Usage Data, Ch. 11: Opinion Mining, Second Edition, Springer, 2011. ailab.ijs.si
31
19.6.2012
Interactive Approach to Sentiment Analysis http://aidemo.ijs.si/render/index.html
Model and task selection
Query
Model-based filters
Examples retrieved by uncertainty or class-margin sampling
Query result, grouped by predicted label
T. Stajner and I. Novalija. Managing Diversity through Social Media, ESWC ailab.ijs.si 2012 Workshop on Common value management.
Architecture Indexed documents Active learning for modeling topic, sentiment (diversity analysis) Interactive user interface Example: stream of social media posts relevant to brand management ailab.ijs.si
32
19.6.2012
Social Network Analysis Modeling social relationships Network theory concepts Nodes – individuals within the network Edges – relationships between individuals
Mario Karlovčec, Dunja Mladenić, Marko Grobelnik, Mitja Jermol. Visualizations of Business and Research Collaboration in Slovenia, Proc. Of the Information Technology Interfaces 2012. ailab.ijs.si
Influence and Passivity in Social Media Majority of Twitter users are passive information consumers – do not forward the content to the network Influence and passivity based on information forwarding activity Passivity User retweeting rate and audience retweeting rate how difficult it is for other users to influence him
Algorithm ~ HITS Passivity score ~ authority score Most passive: robot users – follow many users, but retweet a small percentage
Influence score ~ hub score Most influential: news services – post many links forwarded by other users
D.M. Romero, W. Galuba, S. Asur, and B.A. Huberman. Influence and Passivity in Social Media. ECML PKDD, 2011.
ailab.ijs.si
33
19.6.2012
• • • •
What are text streams Properties of text streams Motivation Pre-processing of text streams • Text quality
Introduction to text streams
Text stream processing • Topic detection • Entity, event and fact extraction and resolution • Word sense disambiguation • Summarization • Sentiment analysis • Social network analysis
• Key literature overview • Further publicly available tools • Conclusions • Questions and discussion
Concluding remarks
ailab.ijs.si
Key literature
ailab.ijs.si
34
19.6.2012
Further publicly available tools Topic detection David Blei’s homepage: http://www.cs.princeton.edu/~blei/topicmodeling.html Mallet: http://mallet.cs.umass.edu/
Natural language toolkits GATE: http://gate.ac.uk/ OpenNLP: http://opennlp.apache.org/ Nltk: http://nltk.org/
Entity Extraction Stanford NER: http://nlp.stanford.edu/ner/index.shtml
Relation Extraction NELL: http://rtw.ml.cmu.edu/rtw/ REVERB: http://openie.cs.washington.edu/
WSD WordNet::SenseRelate: http://senserelate.sourceforge.net/
ailab.ijs.si
Conclusions Dealing with streams can be often easy, but… …gets hard when we have an intensive data stream and complex operations on data are required!
Topic detection Currently online inference (e.g. for LDA) is a new direction
Entity, relationship and template extraction, sentiment analysis and social network analysis Are already applied on streams
Word Sense Disambiguation Complex knowledge bases (e.g. WordNet + Wikipedia) coupled with simple disambiguation algorithms work well
Summarization Abstraction summaries are more suited for text streams ailab.ijs.si
35
19.6.2012
Questions and discussion
ailab.ijs.si
36