Noname manuscript No.
(will be inserted by the editor)
Mining Semantic Associations from Unstructured Text Aditya Ramana Rachakonda · Srinath Srinivasa · Sumant Kulkarni · M.S. Srinivasan
the date of receipt and acceptance should be inserted later
Abstract Mining latent semantics from text corpora is an important problem with
several applications. Models to explain how semantic associations get embedded into natural language, are of immense interest. In this work, we take a cognitive modeling approach to this problem. The text corpus is viewed as a “cognitive space,” and each document is treated as a cognitive “episode” whose objective is to convey some meaning within some context. Communication of semantics is proposed to take place in a 3-layer model. The top-most layer is called the analytic layer that maintains the semantic worldview of the communicator. Communication happens in the context of an episode, where semantic associations from the analytic layer are combined with the episodic objectives to create conceptual sentences. This is performed at the second layer called the episodic layer. The last layer, called the linguistic layer provides language specific vocabularies and inflections that convert conceptual sequences from the episodic layer into communication units like speech utterances, gestures or written text. Mining latent semantics becomes the problem of inferring semantic layer associations by observing linguistic layer recordings like text. This is addressed by proposing one or more episodic hypotheses on patterns of co-occurrences of terms in the linguistic layer. An episodic hypothesis builds on observational patterns and generalizes observations to semantic associations. A suite of semantic associations are described in the paper and are reliably mined from text using their corresponding episodic hypotheses.
Keywords latent semantics, cognitive models, co-occurrence
M.S. Srinivasan is affiliated with IBM India Software Lab and with IIIT Bangalore as a parttime research scholar. E-mail: {aditya.ramana, sri, sumant}@iiitb.ac.in,
[email protected] Open Systems Lab International Institute of Information Technology - Bangalore 26/C, Electronics City, Hosur Road, Bangalore, India 560100.
2
Rachakonda et al.
1 Introduction
Extracting inherent meaning or latent semantics, from human generated text corpora, like a collection of news articles, email and blog post collections, etc. is an important problem with several application areas. A practical approach to acquire such latent semantics is to observe patterns of term distribution. Term distributions in human generated text, are not independent of each other and sets of semantically related terms tend to occur together Sahlgren (2006). In linguistics, this correlation is termed as the distributional hypothesis Rubenstein and Goodenough (1965), and is stated as, “words which are similar in meaning occur in similar contexts.” Due to this correlation, observing occurrence and co-occurrence patterns of terms has become the primary mode of gathering latent lexical semantics from unstructured text. Mining latent semantics based on co-occurrences has been receiving research attention since several years now. Research towards this end, can be broadly viewed along the following schools of thought, a) co-occurrence graph mining; b) dimensionality reduction; and c) generative models. These algorithms are sometimes enhanced with techniques derived from machine learning like supervised or unsupervised learning. Section 2 provides a brief overview of some algorithms using these approaches. Despite such diversity, our philosophical understanding of how semantics is embedded in text, remains rather rudimentary. While there seems to be a general consensus that semantics can be discerned from co-occurrence patterns, to the best of our knowledge, there are no conceptual models describing how specific kinds of semantic associations lead to specific forms of co-occurrences. In this work, we try to answer this question. We approach the problem by viewing the document corpus as a cognitive space, rather than a collection of text. A cognitive space is a space where “meaning,” rather than “information ” is exchanged. The building blocks of a cognitive space are concepts rather than terms. Terms are basically language-specific representations of concepts. While terms are visible to us from the corpus, what they actually represent are concepts, that are “latent” in the communicators’ minds. Each document in the corpus is seen as a cognitive episode where a person or a set of persons is trying to communicate some meaning. Meaning is communicated by referring to associations contained in our semantic memory. In cognitive science, semantic memory is defined as that part of our brain that stores general knowledge associations across concepts collectively forming our world-view. Thus, mining semantics from a document corpus gives us insights into the shared world-view of all the participants in the corpus. Such semantic associations are also called analytic semantics. Our problem is now one of discerning analytic semantics between concepts by observing co-occurrences across terms, which are basically linguistic handles for concepts. To make this connection from the term space to the conceptual space, we introduce a notion called episodic hypotheses. An episodic hypothesis makes an assertion about the kinds of co-occurrence patterns that can be observed across episodes, when a particular semantic association is being communicated. For this reason, episodic hypotheses are also called observational hypotheses, and we shall be using these terms interchangeably in this paper.
Mining Semantic Associations from Unstructured Text
3
An example of an analytic semantic and its episodic hypothesis is as follows. Consider a semantic association called “topic ” relating a concept with a set of concepts. Given a set of concepts represented by terms {F ederer, N adal, W imbledon}, the concept T ennis is usually seen as their “topic.” The “topic” relationship at the analytic level is defined with a notion of “aboutness.” The set of concepts {F ederer, N adal, W imbledon} are collectively about the concept represented by the term T ennis (and possibly about other concepts too). The topic for a given set of concepts is hence the concept whose hypothetical “aboutness” is maximized in the analytic layer, when the given set of concepts are considered as one unit. How do we know how is the “aboutness” relationship distributed in the analytic layer? To answer this, we propose an episodic hypothesis about the way in which terms co-occur, that reflect the nature of the aboutness relationship. For this, we propose the following episodic hypothesis in this paper: If a concept represented by term t is a “topic” for a set of concepts represented by terms in the set U , then as the number of terms in U mentioned in an episode increases, it is very likely that the episode will mention the term t as the length of the episode increases. In other words, the hypothesis asserts that it is very unlikely to have long semantically meaningful episodes (documents, conversations, etc) mentioning the terms F ederer, N adal, and W imbledon without also mentioning T ennis. Essentially: To make a meaningful document or episode about a topic, we have to mention the topic itself sometime or the other. Such episodic hypotheses form the core of our model. The hypothesis is then tested on a co-occurrence graph by reducing it to some variant of a graph mining algorithm, which is detailed later in this paper. The contributions of this paper can be summarized as follows. We propose a 3-layer cognitive model to help us reduce semantic associations to observable patterns of co-occurrences. We also propose four different semantic associations namely, topical anchors, semantic siblings, topical markers, and topic expansion ; and provide corresponding hypotheses to mine each of the associations from a textual corpus. The nature of latent semantics described, are associations across concepts representing collective world-view. This is characteristically different from several related works in topic modeling, that associate latent semantics with documents rather than terms. Our work does not provide richer semantics about a given document, but provides elements of the shared world-view, given a corpus of documents. The semantics extracted using this model can be used as features of a machine learning model to solve specific problems. The underpinnings of the proposed model draws upon relevant concepts from Analytic Philosophy and Cognitive Psychology to understand the notion of “meaning” and the way our brain is thought to understand meaning. Both of these are detailed later in the paper. The rest of the paper is organized as follows. Section 2 briefly describes the related literature in mining latent semantics. Section 3 describes the philosophical underpinnings upon which the 3-layer model (Section 4) is defined. Four different latent semantic associations based on this model are then demonstrated in section 5 and the concluding remarks are noted in Section 6.
4
Rachakonda et al.
2 Related Literature
In this section, we survey literature in mining latent semantics. The objective here is to provide a realistic backdrop for this work, rather than a comprehensive overview of the literature. For the latter, the interested reader may like to refer to Berry (2003); Hotho et al (2005); Pang and Lee (2008); Sebastiani (2002). Co-occurrence Graph Mining Co-occurrence graph mining is one of the funda-
mental ways by which the distributional hypothesis can be applied to semantics mining. Semantic associations are generally captured as properties of a specific graph. The co-occurrence graph is generally algorithm specific as different cooccurrence graphs tend to serve different purposes. Widdows and Dorow (2002) and Dorow et al (2005) use lists of nouns from a part-of-speech (POS) tagged corpus and construct a co-occurrence graph. They then use a random-walk based graph clustering algorithm like MCL (see van Dongen (2000)) to identify significant clusters of words and thereby distinguishing between word senses. Instead of a corpus wide graph, Mihalcea and Tarau (2004) propose a co-occurrence graph of terms inside a document and compute a randomwalk centrality of the nodes to identify the terms representing the topic of the document. They also show that the same technique can be used on sentences in a document to identify the sentence summarizing the document best. Intuitively, the nodes which are central in the graph tend to represent the topic of the context, in this case, the document. There are several algorithms which use a co-occurrence graph consisting of heterogeneous nodes. Ghose et al (2007) use a noun-adjective bipartite co-occurrence graph to determine the sentiments associated with the nouns. A bipartite randomwalk is used to identify important opinions (adjectives) and important product features (nouns) in a given product corpus. Similarly, Dagan et al (1999) use a noun-verb bipartite co-occurrence graph for speech disambiguation. In this work, based on a well-defined cognitive model and a set of co-occurrence primitives, we propose a slew of semantic association mining algorithms on a homogeneous co-occurrence graph. Dimensionality reduction A significant part of the recent research interest in
latent semantics, can be attributed to Latent Semantic Analysis (LSA) by Deerwester et al (1990). LSA represents a text corpus as a rectangular term-document matrix, where each document is a vector in a vector space of terms. It then uses singular value decomposition to identify the basis dimensions of the vector space along with their importance. LSA then collapses the vector space by eliminating all but the top k dimensions and hence document vectors which were far off in the original space come closer in the new space. Such a recomputed space establishes extraneous associations between documents and terms beyond what was captured. LSA was novel and was able to detect significant semantic associations in standardized English vocabulary tests; see Landauer and Dumais (1997); Turney (2001) and also found use in automated essay assessments; see Miller (2003). Landauer and Dumais (1997) used LSA to address philosophical questions of knowledge acquisitions — specifically, Plato’s problem — the question of how humans acquire much more knowledge than the information that they have been exposed to. The authors claim that human semantic memory extracts meanings of terms
Mining Semantic Associations from Unstructured Text
5
from their context of use by capitalizing on the weak interrelations between terms that it has been exposed to — somewhat similar to the SVD reductions in LSA. Dimension reduction techniques have also been applied to co-occurrence graph mining by Lund et al (1995) in Hyperspace Analogue to Language (HAL) and by Rohde et al (2004) Correlated Occurrence Analogue to Lexical Semantics (COALS). Both the algorithms work on a term-term matrix composed of cooccurrence vectors instead of document vectors for mining semantic associations. Despite its impressive results, LSA and its variants do not have sound mathematical underpinnings for the extracted semantics. This meant that, while terms could be semantically related by collapsing dimensions, LSA is not able to assign a label to such associations. In addition, LSA computations are global in nature involving the entire corpus, making it difficult for incremental changes in computing semantic relatedness. Several research efforts have tried to extend LSA in different directions as well as explore newer models for capturing latent semantics. Generative models Generative models in semantics started as a mathematically sound extension to LSA. Here, documents in the corpus are considered to be generated by a mixture of one or more random processes. The challenge is then to estimate the parameters of these latent processes such that their generative term distributions minimize the error with the observed term distributions from the corpus. Hofmann (1999) proposed pLSI, a probabilistic approach for topical mixture model. Here, a document is modeled as comprising of a set of topics, where each topic generates terms with a given probability distribution. The topics are the latent (unobserved) variables, whose probability of expression have to be determined from the corpus. This is estimated using a maximum likelihood estimation technique, running the Expectation Maximization (EM) algorithm. LSA in itself has been extensively used and extended for different kinds of topic estimation problems. Some representative examples include: Kim et al (2008); Wei-jiang et al (2009). An increasingly popular topic model in recent times is the Latent Dirichlet Allocation (LDA), first proposed by Blei et al (2003). LDA extends on PLSI by proposing a 3-layer hierarchical model, where a document is modeled as generated using a mixture of k (finite) hypothetical topics where a topic is a probability distribution over all the terms observed in the corpus, based on a Dirichlet prior. Statistical techniques like expectation maximization or Gibbs sampling are applied to invert this process and identify the set of topics which generated the corpus; see Steyvers and Griffiths (2007). LDA has grown in popularity in recent times with several implementations of LDA being publicly available, and several extensions over the base LDA model. Some examples include: Anthes (2010); Deng et al (2011); Griffiths and Steyvers (2004); Shivashankar et al (2011); Vuli´c et al (2011); Arora and Ravindran (2008); Krestel et al (2009); Pilz and Paaß (2011); Wahabzada et al (2011); Wei and Croft (2006). Since parameter estimation in generative models are based on iterative optimization approaches, they are only guaranteed to converge to a local optima. Thus, the convergence may vary for the same dataset, over different runs. In contrast to the current literature on latent semantics, our work is primarily based on cognitive modeling of semantics. Rather than model a document as a
6
Rachakonda et al.
mixture of topics, we look at documents as a semantic communication, where the author of the document is trying to communicate some meaningful sentences to the reader. We then look into Cognitive Science and the way human memory is organized, to build a model of how terms in a document gets generated. Other relevant literature There are machine learning approaches to semantics
which generalize on training data to discover semantic associations in the supervised case, or look for recurring patterns of associations, in the unsupervised case. These are mostly used in conjunction with some of the above techniques. For instance, Widdows, graph clustering is used to identify different word senses. Turney and Littman (2003) combine LSA with classification techniques to determine the “semantic orientation” of a body of text, where orientations are broadly classified as positive (praise) or negative (criticism). Similarly, Zu et al (2003) apply feature transformation techniques based on dimensionality reduction in order to improve accuracy of automatic text classification. Another relevant area of work includes literature on ontology discovery from text (see: Buitelaar et al (2005)) in the form of semantic relationships across terms. Association rule mining and shallow parsing are some of the main approaches used for mining semantic relationships; see Pradhan et al (2003); Carreras and M` arquez (2005); M` arquez et al (2008); Coppola et al (2008). Finally, a rich body of literature exists in computational models of cognition that are syncretically relevant to our work. Some suggested reading include Busemeyer and Diederich (2010); Lamberts and Goldstone (2005); Polk and Seifert (2002). Broadly, computational models of cognition can be divided into connectionist models and rational models; see Chapter 3 in Polk and Seifert (2002). Connectionist models describe cognitive activity in terms of neural connectivity and feedback, while rational models attempt to describe higher levels of cognitive activity with semantic abstractions. It is quite possible that several issues addressed here are also addressed in computational models of cognition, some of which like Soar for example, are widely used. The work proposed in this paper is of a much narrower scope than that of general cognitive modeling, that includes aspects of cognition like goal-setting and strategizing. This work is best seen from the vantage point of bringing cognitive concepts into mining latent semantics from text corpora.
3 Semantics and Cognition
Before presenting the 3-layer model, we look into relevant concepts from Analytic Philosophy and Cognitive Psychology, that forms the underpinnings for the proposed model.
3.1 Semantics in Analytic Philosophy The question of what is meaning and whether concepts exist on their own, has been a question that has evoked philosophical investigations over several centuries. In the early 20th century, philosophical questions into the nature of what is real were asked in the scientific and mathematical communities. In response to this, Moore
Mining Semantic Associations from Unstructured Text
7
(1905), Russell (1919) and Wittgenstein (1922) developed a school of thought that is now called Analytic Philosophy (AP); see Preston (2006). Analytic Philosophy asserts that several, perhaps infinitely many concepts do exist on their own (are simpliciter ). And not only do concepts exist on their own, their characteristic relationships are also simpliciter. This turned philosophical investigations towards exploring implications of assertions, rather than interpretations into what they could possibly denote. For instance, given a statement like, “There is a cat on the table” Analytic Philosophy would not probe into the meaning of terms like “cat” and “table” apart from their commonsense interpretation. Instead, it would perhaps lead to derived implications like, “If there is milk on the table, the cat is likely to drink the milk,” and so on. Analytic philosophy found favor among scientific and mathematical communities since its arguments advocated existence of an objective external reality independent of the observer. Several schools of thought emerged from Analytic Philosophy. Among them, of interest to us is Ordinary Language Philosophy (OLP), that was first developed by Wittgenstein (1953). One of the main arguments of OLP may be termed as “meaning is use.” The founding principles of AP was based on the notion that concepts do exist simpliciter. But it also assumed that linguistic terms have unambiguous, well-known interpretations of the concepts they represent. Hence, the term “cat” represented the commonsense definition of cat, regardless of where it appeared. OLP challenged such a notion and asserted that the meaning acquired by a term is based on the context in which it is used. In fact, “meaningless” terms can acquire meaning depending on the context of use, as in the meaning acquired by the term “jabberwock” in the following paragraph: “Everyday I go to work in my jabberwock. My jabberwock is made for urban commuting. It can seat five people and runs on petrol. It gives about 20 kilometers to a liter, making it the most fuel-efficient jabberwock in its class.” We can see that the term “jabberwock” acquires a meaning “car” in the paragraph; and this is based on other terms appearing in its context, like “commuting,” “petrol,” “seat,” “fuel,” etc. This meaning is use argument forms the basis for the distributional hypothesis of linguistics discussed in Section 1. OLP is consistent with Analytic Philosophy in that, concepts are still considered simpliciter. The connectedness of OLP is in the linguistic (term) space – it is the association of terms with concepts that is determined by use, and not the concepts themselves. Hence, in the above example, the meaning of the term “car” is still its well understood commonsense meaning; only its association with the term “jabberwock” is established by the way “jabberwock” is used in the paragraph. Given these, the following are essential takeaways from Analytic Philosophy pertinent to latent semantics in text: 1. There exist several concepts that are simpliciter (that exist on their own) and can be considered basic building blocks of any semantic statement 2. A concept is represented by one or more linguistic terms when used in communication
8
Rachakonda et al.
Fig. 1 Taxonomy of the brain’s long-term memory. Diagram adapted from the web-page: The Brain from Top to Bottom: An interactive website on human brain and behavior http://thebrain.mcgill.ca/flash/index a.html Last accessed on Oct 22 2011.
3. Association of terms to concepts is determined by use. The encyclopedic meaning of a term is simply the dominant interpretation acquired by a term by virtue of its use by a population.
3.2 Semantics in Cognitive Psychology Cognitive psychology is concerned with the way the human brain interprets the world and derives meanings from experiences. Human memory is understood to be broadly classified into two forms: long-term and short-term memory. Short-term memory stores small amounts of information for ready access and is responsible for cognitive processing on the fly. The long-term memory describes latent knowledge acquired by the brain over time. Long-term memory plays a central role in model-building, reasoning and problem solving. Long-term memory itself is of two broad kinds: imperative and declarative memory. Imperative memory is responsible for procedural knowledge that defines skills and reflexes. In contrast, declarative memory is responsible for storing knowledge describing the world. Our current understanding of declarative memory are founded in the works of Tulving (1972), where two contrasting forms of long-term memory are proposed: semantic and episodic memory. This is schematically shown in Figure 1. Both episodic and semantic memories are declarative in nature, meaning that they store “what-is” knowledge, rather than procedural “how-to” knowledge. Their point of difference is that episodic memory stores knowledge of an autobiographical nature, which involves vivid knowledge of the “episode” that the subject was involved in. Semantic memory on the other hand, stores “general knowledge” constructs that may not be attributable to any particular autobiographical episode. For example, a piece of knowledge like: “I learned that my neighbor is a vegetarian, when we had dinner at their place last night” is episodic knowledge, while a statement like “June is the month preceding July” represents general knowledge even when there is no recollection of any specific episode where this knowledge was learned. Semantic memory can be seen as the answer to Plato’s problem in the human brain, since the general knowledge obtained by the brain is much more than the episodic knowledge that it has been exposed to. There is some kind of cognitive
Mining Semantic Associations from Unstructured Text
9
activity that happens in the brain, that continually converts episodic knowledge into general knowledge elements in the semantic memory. There is also evidence of interplay between episodic and semantic memory; see Greenberg and Verfaellie (2010). Semantic memory creates general knowledge by taking inputs from episodic memory and combining them with other, already known general knowledge elements. Analogously, the way a person reacts to stimuli in a given episode and the kind of inferences that are made by the episodic memory are influenced by the general knowledge “insights” already present in the person’s semantic memory. Given the above, a document corpus can be viewed as a collection of cognitive communication episodes. Each document represents an instance where the author wishes to communicate some semantics with the reader. The author collects his/her thoughts based on what the episode is about and based on what the author has in his/her semantic memory. These thoughts are then labeled by appropriate linguistic terms based on the language of communication and the author’s linguistic style. The recipient in turn, parses the linguistic sentences, disambiguates terms for their concept associations, and finally makes changes to his/her semantic memory. When we look for latent semantics using algorithmic techniques, in effect the algorithm acts as a hypothetical recipient of the intended communication. We then need to model the way semantics are generated by the author and broken down by the recipient. This is addressed in the next section.
4 3-layer Semantics Model
Consolidating the different perspectives on semantics presented thus far, we propose the following conceptual model for explaining human semantics and then architect an equivalent model for machines to mine latent semantics in text corpora.
4.1 Human Semantics Model Semantic communication in human beings involve three layers of operation. The top-most layer is called the analytic layer hosts the semantic memory, containing general knowledge associations. Semantic communication originates in the second layer called the episodic layer. The episodic context determines what background knowledge can be assumed, and what needs to be communicated. The analytic layer is queried one or more times for general knowledge elements required to build the communication. The third linguistic layer provides language-specific terms converting a conceptual sentence to spoken or written sentences in a given language. The linguistic layer is not just responsible for vocabulary, but is also responsible for linguistic styles like figures of speech, sarcasm, poetic language, etc. At the recipient’s end, the same three layers are involved, but in the reverse order. When encountered with written or spoken text, the linguistic layer parses the input, disambiguates terms and extracts semantic associations between terms. The episodic layer correlates semantic associations from a given sentence with other associations in the same episode and with the episodic context. This interpretation, along with associations with sensory inputs, is recorded in the episodic memory
10
Rachakonda et al.
Fig. 2 3-layer model for semantic communication
and perhaps acted upon by other parts of the brain to provide adequate responses. The analytic layer works in the background and distills general knowledge semantics from what was learned over several episodes. Figure 2 schematically depicts the above model. The constituent elements of each of these layers are detailed as follows. Analytic Layer The analytic layer can be modeled as a collection of a set of concepts and a set of associations between the concepts. Concepts are mental abstractions of real world entities like Federer, Space Shuttle or of abstract notions like π , Machine Learning. Each concept may be related
to other concepts by a variety of semantic associations. Concepts and associations collectively form the person’s world-view. World-views can vary from person to person, but also overlap significantly, which is a prerequisite for meaningful communication to happen. Associations are mental abstractions of relationships between concepts or sets of concepts considered as a unit. For example, plays(Federer, Wimbledon) is an association describing an ordered relationship from the concept Federer to the concept Wimbledon. Apart from specific, precisely-defined associations, concepts also indirectly influence other concepts in the analytic layer. We focus on a specific type of hypothetical indirect association called “aboutness ” for a number of semantics that we extract below. Every concept in the analytic layer, is said to have an “aboutness” relationship with every other known concept. Aboutness is modeled as a score in the real interval [0, 1]. The aboutness scores of any concept c over all the concepts in the analytic layer is represented as an aboutness distribution: A(c), and the aboutness score of a target concept t from a source concept c is represented as Ac (t). Aboutness scores are maximum for a concept unto itself: Ac (c) = 1, in other words, every concept is maximally “about” itself. And for any two semantically unrelated concepts c and t, Ac (t) = At (c) = 0. Sets of concepts in the analytic layer sometimes function as one unit. Associations and aboutness relationships are sometimes managed over sets of concepts rather than individual concepts. Episodic Layer The episodic layer deals with cogent snippets of information called episodes which help in building the complex analytic layer above. An episode is an
Mining Semantic Associations from Unstructured Text
11
autobiographical situation involving the subject (the speaker or author, and the reader or listener) and has certain episodic context and objectives. An episode is abstracted as a small subset of concepts which are associated together, and express an idea or episodic objective. The episodic objective pertains to what the episode is “about,” and is the glue which binds the concepts and associations expressed in an episode. Hence an episode like, “Federer won a Grand Slam” can be said to contain the concepts Federer and Grand Slam and the association win(Federer, Grand Slam), with an episodic objective being reporting about a Tennis event. Although semantics referred to in episodes are sourced from the analytic layer, what is stated in an episode about a concept may be quite different from the semantic signature of the concept in the analytic layer. The episodic objective may not be stated explicitly in the episode at all. Specifically, an episodic context may make several assumptions about the background knowledge that needs to be already possessed by the recipient. In the above case, the sentence “Federer won a Grand Slam” makes the intended semantic sense in the recipient, if the recipient already knows the concepts “Federer” and “Grand Slam” in their analytic layer and the context Tennis in which the expressed semantic association is to be placed. Linguistic Layer The linguistic layer is the layer which populates an idea in
an episode with terms from our vocabulary. The primary building block of the linguistic layer is the vocabulary set. Depending on the language in which the concepts are expressed, the vocabulary may change. The human brain sometimes uses terms from a vocabulary to perform internal cogitation as well. However, if the intended sentence is to be communicated in a different language from which it is thought in, the linguistic layer steps in to replace concept names with terms from the required vocabulary. Hence, we may “think” in English (in the episodic layer), but “speak” in German (from the linguistic layer), and so on. Terms represented in a human language are external manifestations of concepts. Unlike concepts, terms exhibit synonymy and polysemy. The association between terms and concepts are not strict, and the variations in “term sense” contributes to linguistic richness and flexibility. Terms are also not directly associated with one another; but gain their associations through their usage in episodes and the corresponding concepts they get associated to. This association is a model of meaning in our language. As we keep using the terms differently, they start getting associated with different concepts and hence their associations change in accordance to usage. The 3-layer model acts as a simplified representation of a complex reality of cognition. Significant notions like truth, question answering, generalization, model building, etc. are ignored from the semantic standpoint of the model. The proposed model is detailed enough just to extract specific semantic associations across terms from document corpora.
4.2 Machine Semantics Architecture Through a model of the process by which humans embed semantics in text or speech, some of the semantics should be recoverable by using a similar process on
12
Rachakonda et al.
Fig. 3 Machine Semantics Architecture
the machine. Hence the 3-layer model is extended onto the machine to extract semantic associations. To build a reasonably complex world-view, an average human would have processed a large number of episodes. On the side of the machine, the document corpus serves a similar purpose. The machine semantics architecture is depicted in Figure 3 and is described bottom-up as follows: Linguistic Layer The linguistic layer processes input documents from a corpus based on their language and identifies terms in the text. In our implementation, we assume the language to be English and use a combination of existing algorithms and heuristics to extract terms. The rest of the semantic processing, works only on terms. The query and response would both be sets of terms and not grammatical sentences. Episodic Layer Earlier, an episode was modeled as a set of concepts and as-
sociations which represent a cogent snippet of information. As the concepts are unobservable in text, we use the terms in the document corpus as stand-ins for the mental concepts. Similarly, since we only perform shallow parsing of the documents, we do not know exactly what are the associations between concepts that are mentioned in the document. Instead, we simply represent the co-occurrences between terms in a document as a stand-in for associations between terms. To measure co-occurrence, every document is modeled as a set of occurrence contexts. Occurrence contexts are sets of observed terms based on some logical grouping – like paragraph markers or section markers in a document. If such markers are not available, the occurrence context is deemed to be the entire document. Hence sets of co-occurring terms in a document form an episode for the machine. Co-occurrence Layer As inferred from the episodic layer, the terms and their co-
occurrences are the only facts observed from the document corpus. The analytic layer is then a single co-occurrence graph that combines all the co-occurrences across terms found across all episodes in the episodic layer. The co-occurrence graph is a raw, “uncooked” form of the semantic associations of the semantic memory. For
Mining Semantic Associations from Unstructured Text
13
the sake of clarity, to distinguish it from the actual analytic layer, it is termed the co-occurrence layer. In a co-occurrence graph, since terms are used as stand-ins for concepts, the nodes would be plagued by natural language problems like synonymy and polysemy and the association edges will be almost meaningless as they are unlabeled and undirected. But, even though the co-occurrence graph is a crude representation of semantic associations, it is still a rich source of semantics. Formally, the undirected co-occurrence graph G is defined as, G = (T, C, w)
(1)
where T is the set of all terms in the corpus and C is the set of all pair-wise co-occurrences across terms inferred from the episodes in the corpus. The function w indicates the corresponding co-occurrence count in the corpus. This is the number of occurrence contexts across all documents in the corpus, that contain both elements of a co-occurring pair. Figure 4(i) depicts an example co-occurrence graph. To enable extraction of semantics from the co-occurrence graph, we need to define several primitives by which we can operate on the graph. Given a set of terms X , their closure X ∗ , is the set of all the terms which co-occur with at least one of the terms in X . Their focus X⊥ , is the set of all terms which co-occur with all the terms in X . In the example graph shown in Figure 4(i), the closure of nodes {c, e} is the set of nodes {a, b, c, d, e, f, g, h} as shown in Figure 4(ii) and their focus is {c, e, f, g} as shown in Figure 4(iii). A set of terms X is said to be coherent if X⊥ 6= φ. Incoherent terms—terms which do not share co-occurring terms—are of little use in co-occurrence based semantics. In the example, terms {a, d} do not share neighbors and are thus incoherent (Fig. 4(iv)). Given a term t, its co-occurrence neighborhood N (t) is the set of all terms cooccurring with t along with their co-occurrence counts. In other words, it is the “star”-like sub-graph in G originating from t. Formally: N (t) = (TN (t) , CN (t) , w)
(2)
where TN (t) = {t} ∪ {u|u ∈ T, {t, u} ∈ C}, CN (t) = {{t, u}|u ∈ T, {t, u} ∈ C} and w is the corresponding edge weight in G . In the example, the neighborhood of e, N (e), is the highlighted sub-graph shown in Figure 4(v). The neighborhood of a set of terms X can be defined in its two canonical forms: N (X ∗ ), the neighborhood closure and N (X⊥ ) the neighborhood focus. Formally: [ N (X ∗ ) = N (x) (3) x∈X
N (X⊥ ) =
\
N (x)
(4)
x∈X
In the example, Figures 4(vi) and 4(vii), depict the neighborhood of the closure and focus of {c, e}. The neighborhood represents the relationship of a set of terms with all the terms which co-occur with them. When computing the neighborhood of a set of concepts X , the primitives X ∗ and X⊥ are treated as separate compound concepts, so that the neighborhoods
14
Rachakonda et al.
Fig. 4 Co-occurrence Primitives
N (X ∗ ) and N (X⊥ ) still appear like star graphs. The co-occurrence weights of terms in the neighborhood of X⊥ and X ∗ are updated as follows: w(X⊥ , u) = min w(x, u)
(5)
w (X ∗ , u ) =
(6)
x∈X
X
w(x, u)
x∈X
The above are essentially multi-set (bag) intersection and union operations. The co-occurrence count between a pair of terms x, u is the number of occurrence contexts containing both x and u. Given this, the co-occurrence counts for focus and closure are naturally modeled as multi-set intersection and union operations respectively.
Mining Semantic Associations from Unstructured Text
15
Given a co-occurrence graph G , the semantic context ψ (t), of a term t, is the induced sub-graph of the vertexes of the neighborhood TN (t) . An induced sub-graph H of a graph G contains a subset of vertexes of G and all edges of the form {v1 , v2 } from G such that v1 , v2 ∈ V (H ). Formally: ψ (t) = (TN (t) , Cψ(t) , wt )
(7)
where Cψ(t) = {{v1 , v2 }|v1 , v2 ∈ TN (t) , {v1 , v2 } ∈ C}. As before, for a set of terms X we define the semantic contexts of their closure and focus. ψ (X ∗ ) =
[
ψ (x )
(8)
ψ (x )
(9)
x∈X
ψ (X⊥ ) =
\ x∈X
The semantic context of a set of terms captures the inter-relationships between all the terms with which they co-occur. In the example, Figure 4(viii), represents the semantic context of e. Edges {{c, f }, {c, g}, {f, g}, {f, h}} are present in ψ (e) but not in N (e). Figures 4(ix) and 4(x), depict the semantic context of the closure and focus of {c, e} and can be compared with the closure and focus of the neighborhoods for distinction. The semantic context is an important data structure for co-occurrence based latent semantics. We claim that a large number of latent semantics pertinent to terms in X will be found within the semantic context of either X ∗ or X⊥ . It is not necessary to process the entire graph for extracting latent semantics related to a small set of terms. Of course, it is important that any such set of terms X have a coherent co-occurrence context in the first place. As an association, co-occurrence between two terms along with a weight is a quantified undirected relationship between the terms. However, when viewed from the vantage point of either of the terms, the relative probability of co-occurrence of the other term, need not be identical. This asymmetry is captured with the notion of generatability. If a term u co-occurs with t, then the probability that any arbitrarily chosen co-occurring term with t happens to be u is called the generatability of u in the context of t. Formally:
Γt→u =
8 > :0
(10)
otherwise
Every edge ({u, v} ∈ C ) can have two generatability probabilities associated with it (Γu→v , Γv→u ). Figure 4(xi), is a directed graph where every edge in the example is decomposed into its corresponding generatability edges. All the generatability probabilities originating from a term u form a probability distribution which is called the generatability distribution Γu of u over its neighborhood N (u). In the example, the generatability distribution of e, Γe is shown in Figure 4(xii). Generatability of a target term can be extended to sets of source terms from a single term using the focus and closure operators. Hence, ΓX ∗ →u and ΓX⊥ →u are calculated as above, with the co-occurrence weights adjusted according to the multi-set operations specified in Equations 5 and 6.
16
Rachakonda et al.
Using the primitives described above, episodic hypotheses can be expressed on the co-occurrence layer which yield semantic associations in the context of a query.
5 Episodic Hypotheses for Mining Latent Semantics
To distill semantic associations from the raw co-occurrence graph, we introduce the notion of Episodic Hypotheses. An episodic hypothesis is a theory about how a given semantic association in the analytic layer gets communicated across different episodes. They assert co-occurrence patterns across terms or sets of terms and associate them with specific semantic associations across concepts in the analytic layer. Four example episodic hypotheses are presented in this paper. The notion of an episodic hypothesis is generic enough that more such algorithms can be designed over the proposed framework. To test the hypotheses presented here, an encyclopedic document corpus based on Wikipedia articles is used as a reference. That is, each episode is assumed to be about educating the reader about a particular topic. This makes it easier to test an algorithm by using human subjects to validate the results. However, the algorithms are generic enough to work on any kind of dataset. For instance, a separate test was also conducted on a dataset from an internal corporate blog environment. An example contrasting result between this and the Wikipedia dataset had to do with the terms that were considered related to the topic: Europe. In the Wikipedia dataset, the topic Europe was associated with country names, places, historical events, etc. while in the corporate blog dataset, the topic Europe was associated with terms related to travel and photography. This is still a meaningful result as the population that created the blog dataset primarily associated Europe with travel and photography. But using such an arbitrary dataset makes it difficult to validate the results of a proposed episodic hypothesis, as the evaluators may not be privy to the characteristics of the population that created the dataset. For this reason, Wikipedia is used as the benchmark. The algorithms presented here are all based on heuristics that approximate patterns at one level to interpretations at a higher level. There is no human intervention, training or learning involved. However, in any given application, the proposed framework can be augmented with other conventional forms of mining and machine learning. The objective here is to show evidence for our assertion that analytic semantics is latent in the way terms co-occur across episodes. Document Corpus Before delving into the episodic hypotheses, we present characteristics of the document corpus on which the co-occurrence graph is built. The entire text of the English language Wikipedia is used for measuring co-occurrences. The dataset was cleaned by removing all the non-article pages — like category pages, talk pages, user pages and so on — and stub pages and in each of the pages the tables, info-boxes and general references were removed from the text. In the experiments, co-occurrence was measured between entities, where an entity is taken to be a noun or a noun phrase. A Wikipedia entry for an article on a topic is divided into sections and there are subtle variations in context across sections in the same document. We treated
Mining Semantic Associations from Unstructured Text
17
each section as an occurrence context. All the entities which co-occurred in an occurrence context were added to the co-occurrence graph by incrementing the edge between them. Each occurrence context is treated as a set of entities and is added to the co-occurrence graph separately. If two words co-occur in three different sections in a document, then when the document is processed, their cooccurrence is incremented by three. The number of times a word occurs in a section is ignored because the co-occurrence inside a section is taken to be a binary variable, i.e., it is either present or absent. The Wikipedia data was obtained in May 2011 and the co-occurrence graph built using it contains more than 7 million nodes and 155 million edges. As the algorithms described below were devised at various points in time, an older version of the Wikipedia data was used sometimes. Such instances are explicitly mentioned in the relevant sections.
5.1 Topical Anchors The first kind of semantic associations that we consider are called Topical Anchors. Topical anchors, are concepts representing topic of an episode, based on the terms that have occurred in the episode. For example, if a document has words like Federer, Nadal and Wimbledon, it would be very useful if their association with Tennis can be established, even if the word Tennis does not appear in the document. Tennis, here acts as the topical anchor for these set of words. Topical anchors find applications like automatic labeling of conversations, email messages, etc. and are useful in settings like handling customer complaints. Topical anchors was published independent of the 3-layer model earlier by Rachakonda and Srinivasa (2009a,b), and it is included briefly in this work for the sake of completeness. The contribution in this work is to revisit the algorithms presented earlier, through the lens of the 3-layer model and present the algorithms as an episodic hypothesis; thus providing an explanation as to why the algorithms result in topical anchors. To extract topical anchors, we need to first define what topical anchors mean at the analytic layer and hypothesize about what kinds of patterns in the lower layers approximate topical anchors. Analytic Definition The topical anchor t of a set of concepts Q, is the concept whose aboutness distribution A(t) resembles the aboutness distribution of Q: A(Q). This implies that the topical anchor t is a concept which is semantically about the same set of concepts as the set Q is collectively about. The next step is to reduce this definition into an episodic or observational hypothesis. If t is a topic for the set of concepts in Q, how will it be evidenced across different episodes? Observational Hypothesis If a set of terms Q are observed in an episode, the topical anchor of the terms is the term t whose probability of generation increases
with the length of the episode or with the number of such episodes. To explain the episodic hypothesis in intuitive terms, consider the following example as a hypothetical claim: Suppose we are witness to a conversation, a lecture, or an article containing the terms {Roger Federer, Wimbledon, Davis Cup}
18
Rachakonda et al.
then, we are bound to encounter the term Tennis, the longer the conversation, lecture or article gets, or the more the number of such episodes containing the above terms, that we observe. Note the difference between the analytic definition of topical anchors and the observational hypothesis. This hypothesis addresses observable patterns in human language usage that represent latent analytic meaning. Co-occurrence Algorithm Given a coherent set of terms Q, in a corpus represented as a co-occurrence graph G, their topical anchor is the term with the high-
est cumulative generatability score in an infinitely long random walk executed on ψ (Q∗ ). To find the topical anchor term as described in the observational hypothesis, we measure the term which is most generatable in the semantic context of the closure of Q. The most generatable term in the semantic context ψ (Q∗ ) would be the most central term when one performs a random walk. In our implementation, to perform a random walk an OPIC-like algorithm is adopted; see Abiteboul et al (2003) for OPIC. Every node representing a term in Q is initialized with a seed cash and this cash is distributed to its neighbors in accordance to their generatability values. For example, if a node u has cash of x and if Γu→r is 0.05 then r gets a cash of 0.05x from u. This process is iterated by picking any node uniformly at random and distributing its cash to its neighbors, again in accordance with its generatability to other nodes. As this process is repeated, the cash-flow history at every node is recorded. The cash-flow history of a node is the total cash distributed by the node over all iterations up to this point. The relative ordering of cash-flow histories of nodes converges to a fixed point indicating the centrality of nodes in the context. A set of example results of this algorithm are shown in Table 1. There is a subtle but important distinction in the way the cash is distributed to a node’s neighbors. The cash at u is not distributed in the ratio of its cooccurrences edge weights to its neighbors but according to the generatability of that neighbor. These two values would be exactly same for cash distributed by a node u if, N (u) ⊆ ψ (Q∗ ). But when N (u) contains terms which are not in ψ (Q∗ ) then the two metrics differ. a
18
u
2
1
3
r
b
19
s
2
5 t
Fig. 5 An example sub-graph
For example, in the graph shown in Figure 5, let the sub-graph in the dotted circle indicate ψ (Q∗ ) and the sub-graph with the blue edges indicate N (u). As
Mining Semantic Associations from Unstructured Text Input Terms (Q) MIT, Stanford, Harvard Manchester United, Chelsea, Arsenal Injection, Surjection, Bijection Rice, Wheat, Barley Volt, Watt, Ohm, Tesla
19
Topical Anchors University, College, United States London, Football, Football Club Mathematics, Set, Function Food, Agriculture, Maize Unit, Electricity, Current
Table 1 Example results from the topical anchor experiments
earlier, let u have a cash of x to distribute. If we just take ψ (Q∗ ), i.e., ignoring a and b and distribute cash at u in the ratio of its co-occurrence edge weights in ψ (Q∗ ), then r and s would receive cash of 0.66x and 0.33x respectively. But when distributing cash using generatability where Γu→r = 0.05 and Γu→s = 0.025, r and s would receive a cash of 0.05x and 0.025x respectively. This effectively reduces u’s say in determining the topical anchor of ψ (Q∗ ) by taking into account the structure of the graph outside of ψ (Q∗ ). When we distribute cash in terms of generatability edges, all the cash at a node is not necessarily distributed. The undistributed cash is leaked out of the system and is not distributed any further. This is important because for a query like {Roger Federer, Wimbledon, Davis Cup}, a term like Football due to its higher global popularity ends up getting considerable amount of cash. When the cash from Football is redistributed in proportion of its co-occurrences, it gets an unfair say in the outcome of the random walk. This is because it has large number of cooccurrence edges into the sub-graph like the node u in the previous example. But its percentage of co-occurrence edges into the sub-graph is still low. To account for this, cash is distributed in accordance to the generatability of the neighbors present and the undistributed cash is leaked out of the system. 5.1.1 Experimental Results
A set of 100 human generated queries were chosen and volunteers were asked to write down at most three topical anchors for each of the queries, independent of the algorithm. The volunteers were not aware of the existence of such an algorithm. There were a total of 86 volunteers and each volunteer was given 30 random queries from the hundred questions and the topical anchors given by the volunteers were recorded. These volunteer generated topical anchors were compared with the topical anchors generated by the algorithm. For the experiment, we partitioned the topical anchors given by the users into confidence intervals based on the percentage of users agreeing upon a topical anchor. A confidence interval from x to x + c is a bucket where x% to (x + c)% of users who answered that query agree that the terms in the bucket are topical anchors for a query. For example, if 95% of the users answer computer for the query ,“CPU, hard disk, monitor, mouse ” then we put computer into the 90-100 confidence interval. The confidence intervals are in steps of 10 starting from 40-50 and going up to 90-100. The user generated topical anchors which are not in any of the confidence intervals above 40 are ignored because of the lack of adequate support. With a confidence cut-off at 40, there are a total of 156 topical anchors, which are in confidence intervals of 40 and above, across all the 100 chosen queries. Some queries like, “summer, winter, spring, autumn ” have just one anchor season and
20
Rachakonda et al.
some others like “Volt, Watt, Ohm, Tesla ” have several user given anchors like unit, electricity, physics. We evaluated three different algorithms for computing the most central nodes. First we used a variant of TF-IDF to identify the most important nodes in a subgraph. As there is no notion of a document in the co-occurrence graph we had to use an interpretation of TF-IDF which is similar in spirit. Term Frequency (TF) of a node for a context ψ (Q∗ ), was defined as the sum of all the edge weights of edges to nodes in the context. Inverse of Document Frequency (IDF) of a node was defined as the log of the ratio between the sum of all the edge weights of edges from all the nodes in the context to the sum of all of edge weights of edges from the given node. The product of TF and IDF was used as the score of a node as shown in equation 11. T F (i, ψ (Q∗ )) =
X
w(i, x)
x∈Tψ(Q∗ )
P ∗
IDF (i, ψ (Q )) = log
x∈Tψ(Q∗ )
P
P
z∈T
y∈T
w(x, y )
w(i, z )
(11)
Fig. 6 Comparison between TF-IDF and random walks with and without cash leakage
We also compared the two modes of distributing cash: (i) the cash of a node is distributed in accordance with the generatability and the undistributed cash is leaked out of the system, and (ii) the cash of a node is distributed in the ratio of the co-occurrence weights in the sub-graph. The former is called cl where cl stands for cash leakage and the latter opic. Each algorithm was executed in three different variants based on the number of topical anchors they choose. The number of correctly identified topical anchors of tfidf 1 was computed where the 1 stands for picking only one topical anchor per query. Then the number of correctly identified topical anchors of tfidf 3 and tfidf 10 were also computed where the number of topical anchors generated by the algorithms are 3 and 10 respectively. The same procedure was repeated with opic 1, opic 3 and opic 10 and cl 1, cl 3 and cl 10.
Mining Semantic Associations from Unstructured Text
21
On the whole, there were 3 different algorithms computed 3 times with varying number of topical anchors on each of the 100 test queries. The results were plotted with the confidence intervals for the topical anchors on the horizontal axis and the number of correctly picked topical anchors termed the hit-rate in the vertical axis and are presented in Figure 6. The gray vertical bars indicate the number of topical anchors chosen by the volunteers. The plot in Figure 6 compares the number of volunteer picked topical anchors with those of all the three algorithms. The results show that tfidf 10 and opic 10 could correctly pick only 40 and 56 topical anchors respectively of the 156, i.e., tfidf 10 could only pick 40 correct topical anchors. In contrast the random walk cl 1 performed much better with a hit-rate of 67. To emphasize, the cash leaking random walks, cl 3 and cl 10 correctly identified 110 and 149 of the topical anchors respectively. This implies that cl 3 could correctly generate 70% of the user generated topical anchors. Please refer to the earlier work byRachakonda and Srinivasa (2009b) for further experiments involving topical anchors. This experiment validates the algorithm that the topical anchors of a set of terms are terms which have the highest cumulative generatability and in turn validates the observational hypothesis, that the topical anchors are terms which are the most generatable in a text which contains the query terms.
5.2 Semantic Siblings The second algorithm that we present concerns the notion of semantic siblings. Semantic siblings are sets of terms representing concepts that play similar roles in one or more settings, but are not synonyms. For example, the terms Sapphire, Emerald, Topaz form semantic siblings, as they are all types of gems. In an analytic sense, semantic siblings are concepts that share the same conceptual parent. But without a formal ontology, identifying semantic siblings is not straightforward. Identifying semantic siblings is an important problem in several applications like semantic query expansion, recommender systems, semantic matching of documents, etc. Related Work There are several existing techniques which exploit structural cues
for mining semantic siblings. For example, to identify semantic siblings, there are algorithms which use HTML and XHTML tags, like Brunzel and Spiliopoulou (2006); and comma-separated terms in a sentence, like Sarmento et al (2007). Also words occurring along a column or a row of a table in a web page are likely to be siblings; see He and Xin (2011). Sometimes ‘x is a y’ patterns in text can be used to determine the parent-child tree which can in turn help determine semantic siblings Phillips and Riloff (2002). The semantic siblings relationship between terms can also act as a foundation to a different algorithm. Dorow et al (2005) use co-occurrences constructed out of lists of nouns (semantic siblings) to disambiguate between different senses of a term. In this work, we hypothesize that, like other several semantic associations, semantic siblings are also latent in co-occurrence patterns and can be mined without using structural cues like lists or tables. This makes sure that the algorithm can
22
Rachakonda et al.
be language agnostic and can perform well even on unstructured corpora like free text or transcripts. To extract semantic siblings and to differentiate them from synonyms, the problem is posed as a set expansion problem. A small set (cardinality: 3) of semantic siblings is taken as input and a larger set (20) of siblings is returned as the result. As before, we follow a methodology of defining semantic siblings in the analytic layer and hypothesize what approximates this definition in terms of episodic observations and finally determine the co-occurrence algorithm. In topical anchors, a given set of terms had few terms (mostly one) representing the topic and hence the algorithm was convergent, whereas in semantic siblings there are several possible siblings and hence this problem by nature is divergent. Analytic Definition A semantic sibling s of a set of concepts Q = {q1 , q2 , . . . , qn },
is the concept whose aboutness distribution resembles the aboutness distribution of each of the concepts in Q. i.e., (A(q1 ) ≈ A(s) ∧A(q2 ) ≈ A(s) ∧. . .∧A(qn ) ≈ A(s)). This implies that the semantic sibling should be a concept which is semantically similar to each of the given set of concepts. Observational Hypothesis Elements from a set of concepts Q are said to be semantic siblings of one another, if given an episode e that features one of the concepts q ∈ Q, it is possible to find another episode e0 featuring some other concept q 0 ∈ Q, with the rest of the concepts and associations in e0 nearly identical to that of e. The observational hypothesis is based on a notion of replaceability of semantically similar concepts. In the association plays(Federer, Wimbledon), Federer can be replaced by Nadal, but not by Germany. Two concepts which are semantically similar will be replaceable in most of the associations. Two different algorithms on the co-occurrence graph were tested for computing replaceability. Co-occurrence Algorithm 1 (direct) Given a coherent set of terms Q which are semantic siblings of one another, in a corpus represented as a co-occurrence graph G, a term s is a semantic sibling of the terms in Q if the properties of the neighborhood N (s), is similar to the properties of the neighborhoods of each term in Q, N (qi ). The generatability distribution of a node is the best way to capture the properties of the neighborhood. In this algorithm, henceforth referred to as direct, the generatability distribution of a candidate sibling Γs is compared with the generatability distributions of each of the terms (Γqi ) in Q. This results in a vector of scores. The magnitude of the vector is a measure of the replaceability of s and is used to determine the semantic siblings. Co-occurrence Algorithm 2 (interleaved) Given a coherent set of terms Q which are semantic siblings of one another, in a corpus represented as a co-occurrence graph G, a term s is a semantic sibling of the terms in Q if the properties of the neighborhood N (((Q \ {qi }) ∪ {s})∗ ), where one of the query terms is replaced by s, is similar to the properties of the neighborhood N (Q∗ ).
Mining Semantic Associations from Unstructured Text
23
In this algorithm, henceforth referred to as interleaved, the joint generatability distribution ΓQ of the terms in Q over N (Q∗ ) is estimated by assuming them to be independent. Similarly, a new set Q0i = (Q \ {qi }) ∪ {s} is constructed, where the ith input sibling is replaced by the candidate s. The candidate s can replace qi , if ΓQ is similar to ΓQ0i . Again there are a vector of scores and a similar process is followed as in direct. Both the algorithms compare one probability distribution with another and for this purpose we used Kullback-Leibler divergence (KL divergence). The KL divergence of a distribution B with respect the distribution A is given as, DKL (A||B ) =
X
A(i) ln
i
A(i) B (i )
(12)
KL divergence between two distributions is a positive number in the range [0, ∞] where a lower value indicates higher similarity. Given a query Q = {q1 , q2 , q3 }, and a candidate sibling s, the direct algorithm computes the vector hDKL (Γq1 ||Γs ), DKL (Γq2 ||Γs ), DKL (Γq3 ||Γs )i. For the same example, the resultant vector in the interleaved algorithm would be hDKL (ΓQ ||ΓQ01 ), DKL (ΓQ ||ΓQ02 ), DKL (ΓQ ||ΓQ03 )i. Every term in ψ (Q∗ ) is chosen as a candidate and the above vectors are computed using direct and interleaved methods. In each of the algorithms, the candidates are ordered based on the magnitude of the vector. Semantic siblings are those terms whose vectors have the lowest magnitudes. The dataset and evaluation methodology was similar to topical anchors but as the algorithm has several correct answers, volunteers were shown the results and were asked to choose the correct semantic siblings from the ones generated. Some sample queries along with the top 10 siblings given by the interleaved algorithm are shown below: Sapphire Emerald Topaz Gemstone Opal Ametheyst Garnet Peridot Lapis Lazuli Turquoise Beryl Onyx Pearl
Roger Federer Rafael Nadal Andy Roddick Janko Tipseravic Marat Safin Arnaud Clement Mario Ancic Mardy Fish Marcos Baghdatis Jurgen Melzer Paul-Henri Mathieu Jose Acasuso Michael Berrer
Table 2 Semantic siblings results
5.2.1 Experimental Results
As in the case of topical anchors, a set of 100 human generated semantic sibling sets of size 3 were chosen. For these 100 sets 20 semantic siblings were computed using the direct and the interleaved methods. To eliminate bias towards any algorithm,
24
Rachakonda et al.
the results of these computations were merged and sorted in the alphabetical order before presenting them to our human evaluators. Evaluators were presented with each set of query semantic sibling terms accompanied by a larger set of expanded semantic siblings. So for a given semantic sibling query the evaluators were shown anywhere between 20 to 40 semantic siblings based on the overlap of results between the algorithms. The evaluation proved a difficult task, as there were a total of 3506 decisions to be made across the hundred queries and hence is more time intensive than the topical anchors evaluation. To ensure that each query has enough number of people answering it, the order of the queries were randomized between evaluators. The evaluators were asked to evaluate as many queries as they felt comfortable. A total of 18 evaluators volunteered for the purpose and on an average every query received answers from 6.9 evaluators and the minimum was 3 evaluators for 3 of the 100 queries and the rest of the 97 queries had four or more evaluators. To eliminate erroneous accidental clicks and other trivial biases, every term in the results which was chosen as a semantic sibling by at least two evaluators has been considered a semantic sibling. A set of semantic siblings tend to exhibit a hypothetical parent term, i.e., they are either co-hyponyms of a shared hypernym, like {Federer, Nadal, Roddick } → Tennis player or sportsperson ; or co-meronyms of a shared holonym like, {Germany, France, Spain} → Europe. Technically, Entity is a hypernym, and Universe is a holonym for any given set of terms, but they are so generic to be of any use. Hence, the evaluators were allowed to find the hypothetical parent at the right generalization which they found suitable in everyday usage, given that the query terms are semantic siblings themselves. Another point of note is that, the evaluation was for choosing semantic siblings and not for replaceable terms. For the query, {Sapphire, Emerald, Topaz }, Gemstone is not a semantic sibling, though any of them can be readily replaced by it as it is more accurately the conceptual parent. Similarly, for the query {Ford, Toyota, Honda}, the term Toyota Corolla is not a sibling, though it can replace Toyota in some contexts. Based on such an evaluation, we plotted the accuracy of both the direct and the interleaved algorithms in figure 7. For generating 20 semantic siblings, on an average the interleaved yielded 63.4% accuracy and the direct yielded 59.2% accuracy. Although direct and interleaved have similar accuracies their overlap is very small. Of the 3506 terms given to the evaluators 2133 were chosen as semantic siblings by at least two evaluators. Amongst those 2133, 1267 were generated by interleaved and 1184 by direct. This means that only 318 siblings were generated by both the algorithms, and that indicates that the result space of each algorithm has a minimal overlap on the other. In comprehending these results, there are several factors which must be taken into account. For example, in the semantic context ψ (Q∗ ) where Q = {Federer, Nadal, Roddick }, the number of nodes were 2334. Among those only less than 100 represented tennis players, that is around 4% of the nodes. Sometimes the number of semantic siblings available in the context are so low that they are less than 20. For example, one of the queries in the evaluation had three terms {Aries, Taurus, Gemini} from the zodiac signs. Though their semantic context was large, the number of possible semantic siblings can only be 9 more. Of those 9, all except
Mining Semantic Associations from Unstructured Text
25
16 Interleaved Direct
14
Number of Queries
12 10 8 6 4 2 0
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
Number of Correct Siblings per Query
Fig. 7 Accuracy of the direct and the interleaved algorithms
Cancer were picked up by the algorithm. Hence the chosen threshold of 20 might
not be optimal, but it was chosen with the trade-off between maximizing the number of semantic siblings obtained and reducing the load on evaluation. Different kinds of terms exhibit different co-occurrence patterns and terms which represent people have a specific signature of their own. In our dataset, we found the algorithms performed significantly better in those query terms which involved people. The interleaved algorithm had an accuracy of 77.5% and direct had an accuracy of 65.6% for queries in which the terms in Q represented people. Both the algorithms are purely heuristics derived from the cognitive model and hence are not optimized through any supervised learning algorithm. We found that if we had an oracle who could choose the better algorithm on looking at the query, then the overall accuracy can be improved to 71%. We also compared these algorithms with other queries on structured datasets like, WordNet1 , YAGO22 , DBPedia3 . Using WordNet, YAGO2 and DBPedia we computed the semantic siblings for all the hundred queries and the results are presented in Table 3.
WordNet YAGO2 DBPedia direct ≥ 5 direct ≥ 10 interleaved ≥ 5 interleaved ≥ 10
Queries answered 28 36 52 84 73 94 78
Mean no. of results 22 0.51M 1.75M NA NA NA NA
Table 3 Comparison with structured datasets
WordNet is a hand-tagged semantics corpus which defines several semantic associations between terms. WordNet provides a relationship between terms called 1 2 3
http://wordnet.princeton.edu/ http://www.mpi-inf.mpg.de/yago-naga/yago/ http://dbpedia.org/About
26
Rachakonda et al.
sister terms which is similar to the semantic sibling relationship. Each term has
a set of senses and the sister term could be a semantic sibling in any of the senses. To get the semantic siblings of a query we took an intersection of the sister terms of each of the query terms. Only 28 of the 100 queries had a non-null intersection and for those 28 cases WordNet returned an average of 22 results per query. YAGO2 is a semi-automatically built set of semantic associations. Like WordNet, YAGO2 describes several relationships between terms and one among them is is-a. The is-a captures hyponym-hypernym relationships which are inherent in semantic siblings. Hence to find semantic siblings of a query, we computed an intersection of the is-a parents of the query terms and then counted the number of other terms which share one of their parents with all the query terms. We found that 36 of the 100 queries had all query terms represented but there was no simple way by which we could eliminate the highly generic is-a parents. Hence each of the queries had hundreds of thousands of semantic siblings. Even assuming that a clever technique can be used to prune the result set to an acceptable size, two-thirds of the queries cannot be answered still. DBPedia use the structural information in Wikipedia to identify semantic attributes. A similar technique like YAGO2 was used in DBPedia. Because DBPedia is based on Wikipedia its size is much greater than YAGO2 and hence more number of queries could be answered. But like YAGO2 each of the queries had millions of semantic siblings. In comparison, the results of direct had 5 or more siblings for 84 queries and 10 or more siblings for 73 queries. Similarly, the results of interleaved had 5 or more siblings for 94 queries and 10 or more siblings for 78 queries. This is due to the fact that the algorithms are based on co-occurrence in unstructured text and do not rely on structural cues for semantics extraction.
5.3 Topical Markers The third kind of semantic association that we present is called Topical Markers. Topical markers are concepts which are unique to a topic, and are improbable to be significantly associated with other topics. For example, the usage of the concept double fault in an episode can be used to determine the topic of the episode as Tennis. Hence double fault is a topical marker for Tennis. Similarly Federer is another topical marker for Tennis. The goal here is to identify a set of topical markers given a topic. Topical Markers can be used to find snippets related to a topic in a large stream of text. For instance, the topical markers for Machine Learning can be computed and used to search a text stream like Twitter to identify tweets related to machine learning with high confidence, even though the term machine learning itself need not appear in the tweet. In the space of terms and documents, the relative importance of a term in a document with respect to the entire corpus has been addressed initially using metrics like TF-IDF Jones (1972). Though TF-IDF assigns a higher weight to terms important in a document, it fails to identify them uniquely. For example, if we consider the Internet as a corpus and a page on Harry Potter as our document, the term magic might have a high weight for that page. Still it is not unique to Harry Potter unlike the terms Hogwarts or Hermione Granger. Statistically improbable
Mining Semantic Associations from Unstructured Text
27
phrases4 address this issue by looking for terms which are not only important but also unique to a document. They have been successfully used to determine and delete duplicates in a document corpus Errami et al (2010). Although, statistically improbable phrases can be found in the term-document space, there is no such equivalent in the conceptual space. For example, unlike Harry Potter, concepts like machine learning or Tennis need not be confined to the boundaries of a small set of documents. Analytic Definition A topical marker m for a given concept t, is a concept such that t has high aboutness values for those concepts for which m has a high about-
ness value and not necessarily vice versa. Let us consider Tennis and one of its topical markers double fault. Tennis will have a high aboutness to those concepts which are related to double fault but not necessarily the other way around. Observational Hypothesis The term m is the topical marker of a term t, if on observing m in an episode, the probability of generation of t increases with the
length of the episode or with the number of such episodes. The observational hypothesis is similar to topical anchors, but instead of observing a set of terms to determine the topic, the topical marker term can independently determine the topic. Co-occurrence Algorithm Given a topic represented as term t, in a corpus represented as a co-occurrence graph G, a topical marker m is a term which with a very high probability generates terms only in ψ (t). A topical marker m is a term whose generatability distribution lies almost completely within the vertexes of ψ (t), as it should be unique to ψ (t). It should, with a very high probability generate terms which are central in ψ (t). To compute these terms which are central to ψ (t), we adopt a variant of a HITS-like Kleinberg
(1999) algorithm. This algorithm uses mutual recursion to compute the most central (authorities) and the most unique (hubs) vertexes of a context simultaneously. First a bipartite graph is created by dividing each vertex in ψ (t) into an authority and a hub as shown in Figure 8. Initially, the scores for all the hubs are initialized to be 1/n where n is the number of vertexes in ψ (t). The authority and hub scores of the terms are then computed recursively as in equation 13. auth(a) =
X
Γx→a × hub(x)
x∈Tψ(t) ∧x∈N (a)
hub(a) =
X
Γa→x × auth(x)
(13)
x∈Tψ(t) ∧x∈N (a)
The hub and authority scores thus computed are normalized after every iteration. After convergence, the product of the authority and hub scores of each term is treated as the marker score for the term. The terms are ranked based on their marker scores and the top k terms are chosen as the topical markers of t. Two different methods were used for evaluating topical markers. First method was through human evaluation of the results generated by the algorithm. This 4
Originally introduced by Amazon.com http://www.amazon.com/gp/search-inside/sipshelp.html
28
Rachakonda et al.
Fig. 8 Example: Semantic context to bipartite graph
evaluation methodology was similar to that of semantic siblings. Since each query results in many correct topical markers, the volunteers were shown the top 30 results and asked to select the correct topical markers from them, for the given topic. Some example results are given in Table 4. machine learning weighted majority algorithm boosting semi-supervised learning knowledge discovery institute of photogrammetry and geoinformation supervised learning 3d geometry international conference on machine learning concept drift data mining data stream mining unsupervised learning computational intelligence image interpretation rapidminer
quantum mechanics quantum tic tac toe schr¨ odinger equation hamiltonian copenhagen interpretation epr paradox quantum field theory density matrix quantum entanglement wave function wave function collapse physics bell’s theorem electron classical mechanics hilbert space
Table 4 Topical marker results
5.3.1 Experimental Results
A set of 50 topics were randomly chosen as queries from the set of human-generated queries for topical anchors and semantic siblings. Totally 19 people volunteered for this evaluation. 30 topical markers were generated for each of the 50 query topics and every evaluator had to make a maximum of 1500 decisions. Hence they were asked to evaluate as many queries as they felt comfortable with. As in the case of semantic siblings, to eliminate erroneous accidental clicks and other trivial biases, every term in the results which has been chosen by at least two evaluators is considered a topical marker.
Mining Semantic Associations from Unstructured Text
29
Based on this evaluation method, our algorithm yielded an accuracy of 93.8% for generating 30 topical markers. The accuracy is plotted in Figure 9.
16
Number of Queries
14 12 10 8 6 4 2 0
0
3
6
9
12
15
18
21
24
27
30
Number of Correct Topical Markers per Query
Fig. 9 Accuracy of the topical markers algorithm
In the second method of evaluation, 10 of the 50 query topics were randomly chosen. For each of these topics, we selected the top 10 topical markers generated by our algorithm and for each of them, we performed a search on the web using Google Custom Search API5 . In general, this API only searches on certain websites but it was adequately modified to perform searches over the entire web. For each of the 10 topical markers used as search queries, we selected the top 10 results returned by Google Custom Search API. Hence, there were 100 web pages for each topic (10 results for each of the 10 topical markers of the topic). We manually classified these web pages as being relevant to the topic or not. Totally 1000 such decisions were made (for the 10 topics) and the evaluation resulted in 902 of the 1000 pages being correct. The search queries were the topical markers generated by our algorithm for the given topic and not the topics themselves. Hence, the high number of web pages being classified as correctly belonging to the topic gives us an indication of the effectiveness of the topical markers algorithm in uniquely determining the topic. Figure 10 shows the number of search results for all 10 topical markers of each topic which were classified as correctly belonging to the topic. The 10 topics are shown on the x-axis. In this experiment, we found that the reason why Jazz had only 78 pertinent pages of 100 pages returned, is because many of the Jazz musicians who were topical markers shared their names with other real world people not popular enough to feature on Wikipedia, but popular enough to have a presence on the Internet. Sometimes we also found metaphorical usage of the marker making it unrelated to the topic. For example, for the query Bible, Jesus was a topical marker, but one of the results in the web search for Jesus, is the Twitter page6 of a profile called Jesus unrelated to the concept Bible. This can be attributed to the fact that search engines try to diversify the set of results returned Agrawal et al (2009). 5 6
Google Custom Search API: http://developers.google.com/custom-search/ http://twitter.com/jesus/
30
Rachakonda et al. 100
60
40
20
ota toy
ho
lor the
ck
do ft he
lm
rin gs
es
da an
rlo she
s an ic me
tum qu
an
nty mo
ch
py
rni lea ine
ma ch
rw
tho n
ng
z jaz
ck cri
bib
et
0 le
Number of Correct Results
80
Fig. 10 Effectiveness of topical markers in determining the topic Google AdWords postulates of quantum mechanics basics of quantum mechanics quantum mechanics pdf quantum mechanics basics perturbation theory quantum mechanics what is quantum mechanics quantum mechanical model of atom quantum mechanics wiki lectures on quantum mechanics books on quantum mechanics quantum mechanics books application of quantum mechanics quantum mechanics video lectures relativistic quantum mechanics quantum statistical mechanics
Topical markers quantum tic tac toe schr¨ odinger equation hamiltonian copenhagen interpretation epr paradox quantum field theory density matrix quantum entanglement wave function wave function collapse physics bell’s theorem electron classical mechanics hilbert space
Table 5 Results for quantum mechanics
We also compared the topical markers with the top 30 terms having highest TF-IDF for each of the 50 query topics. The TF-IDF metric used for this purpose is same as that in equation 11, used in topical anchors evaluation. The TF-IDF algorithm fares very poorly compared to the results of our topical markers algorithm as it fails to generate any terms that are not just important but also unique to the topic. For example, for the topic capitalism, the terms with highest TF-IDF scores are United States, Germany, France, United kingdom etc. Even though these terms are important to capitalism, they are not topical markers. In addition, we also performed a qualitative comparison of our results with Google AdWords7 for the query topics. Google AdWords service allows the user to find important keywords for any topic (which they may wish to advertise). This mostly helps in query expansion, but it is not suitable to find Statistically Improbable Phrases (SIPs) for the topic. For example, for the topic quantum mechanics, results generated by Google AdWords and the topical markers generated by our 7
Google AdWords: http://adwords.google.com
Mining Semantic Associations from Unstructured Text
31
algorithm are shown in Table 5. It is seen that all the suggested keywords from Google AdWords contain the term quantum mechanics and hence they miss all such important topical markers which do not contain the topic name but uniquely identify it. This suggests that augmenting Google AdWords with topical markers may be attractive.
5.4 Topic Expansion The last semantic association we consider in this paper, is called Topic Expansion. Topic expansion of a topic t is the process of unfolding t into a set of concepts that are collectively “about” t. In a document corpus, where terms are used as a stand-in for concepts, topic expansion has to also contend with the problem of multiple senses of terms. Each sense of a term represents a different concept in the analytic layer – but at the linguistic layer, the same term is used to refer to the different concepts. For example, the term Java is the handle for the concepts Java island, Java programming language, and Java coffee. When we want to describe the term Java using other concepts, we do not have any way to identify which of the senses of Java we are referring to. Hence, we have to identify and describe all the senses of the term Java. Sense partitioning of topical terms is an integral aspect of topic expansion. Essentially topic expansion has to perform two different tasks: identify different senses (concepts) that a term represents, and expand each sense into a set of terms which are collectively about the sense in which the term is used. Related Work Currently, we are not aware of any available methods for topic expansion in the specific form introduced above. However, Word Sense Disambiguation (WSD) algorithms by Widdows and Dorow (2002) and Dorow et al (2005) are closely related. As discussed earlier, they use a co-occurrence graph constructed out of list of terms which share a semantic sibling relationship. To disambiguate the senses of a term, its neighborhood graph is clustered into different components belonging to distinct senses. In a similar vein, SenseClusters by Purandare and Pedersen (2004) is a tool which uses co-occurrences as feature vectors so as to use existing clustering techniques. Along with co-occurrences they also use other features — derived using heuristics or from algorithms like LSA — to find clusters representing different senses of a given word. Though word sense disambiguation is a integral part of topic expansion, it does not address the problem completely. In topic expansion, we need to represent the ordering of the terms based on their importance to the topic term. This notion of importance is generally absent in word sense disambiguation algorithms. Another relevant area of work is topic modeling. A topic modeling algorithm like Latent Dirichlet Allocation (LDA), by Blei et al (2003), can be seen to be closely related to topic expansion. Given a corpus, LDA assumes that each document is a made up of multiple topics mixed in different probabilities. Using the assumption, it tries to cluster terms in the corpus along the different topics that represent the corpus. While, this is different from expanding a topic given the topical anchor term, we will be suitably modifying LDA to use it as a benchmark for comparison.
32
Rachakonda et al.
Topic expansion finds various applications where we need more detailed description of the topics, as well as in sense disambiguation. As with the rest of the episodic hypotheses described earlier, we follow the 3-step approach to defining topic expansion. First we define topic expansion in the analytic layer. Then, we hypothesize what approximates the analytical definition in terms of episodic observations (observational hypothesis). Finally, we explain the co-occurrence algorithm which is the approximation of analytical definition and observational hypothesis. Analytic Definition For a topic s, a topic expansion T E (s) = {c1 , c2 , c3 , ..., cn } is a set of concepts, which collectively display a high aboutness for s.
When describing the results of a topic expansion, it also makes sense to order them in terms of their individual aboutness scores for s, making T E (s) as a tuple of terms: hc1 , c2 , c3 , ..., cn i, where: Ac1 (s) ≥ Ac2 (s) ≥ Ac3 (s) ≥ ... ≥ Acn (s)
We know that a concept representing a topic has an aboutness score of 1 for itself. Hence it will always be the first topic in T E . That is, c1 = s. Observational Hypothesis In a long enough conversation about a topic t, the probability of the joint occurrences of concepts about t including t itself, is much higher than the joint probability of concepts unrelated to t. In other words, as the length and the number of episodes about topic t increase, topically relevant terms tend to cluster together within and across episodes and these clusters tend to include the topical term t. The last condition in the hypothesis makes it consistent with the episodic hypothesis for topical anchors. Co-occurrence Algorithm For a given term t, topic expansion is done in four steps. In the first step, called the cluster generation step, we create clusters of terms in N (t) based on their generatability with t. In the second step called the cluster merge step, we merge clusters of terms that were generated, based on similarity. The third step, called the filtration step removes redundant clusters; and the final step called the ranking step orders terms in each cluster based on their relevance to the topical term t. Each of these steps are described in more detail below. Cluster generation: Given the topical term t, we start expanding t to include other co-occurring terms based on their generatability. If |N (t)| = k, a total of k clusters are initially generated. This is done by running a cluster generate algorithm k times, with each co-occurring term. The ith run of the cluster generate algorithm generates a cluster using t and the ith most generatable term in N (t). In any given run of the cluster generate algorithm, as the expanded set grows, we consider it as one unit, and take the generatability scores of the focus of the expanded set. The cluster generate algorithm for the ith run is described below: The term ui , which is the ith most generatable term from t is used as the “key” sense term for expanding the ith most-important cluster. The index i associated with a cluster represents the importance rank of the generated cluster. Cluster merging: Cluster generation generates all possible sets of closely occurring terms from the given topic t. After this step, two or more clusters may represent the same sense and are redundant as separate clusters. In the second
Mining Semantic Associations from Unstructured Text
33
ALGORITHM 1: Cluster generation with the ith most generatable term Data: Corpus co-occurrence graph G, topic term t, co-occurring term u where u is the ith most generatable term in the neighborhood of t when this algorithm is called for the ith time Result: Cluster Ci of terms containing t, u and a subset of terms from N (t) X ←− {t, u} ; while N (X⊥ ) exists do Let v ∈ N (X⊥ ) be the term in N (X⊥ ) with the highest generatability ΓX⊥ →v ; X ←− X ∪ v ; end return X;
step – namely cluster merge step, we merge clusters iteratively, based on their similarity. Cluster similarity between clusters Ca and Cb is given by: O (C a , C b ) =
|Ca ∩ Cb |
min(|Ca |, |Cb |)
(14)
Algorithm 2 below describes cluster merging.
ALGORITHM 2: Cluster merging Data: Set of clusters C resulting from cluster generate, similarity threshold α Result: Set of merged clusters S, where |S| ≤ |C| representing different predominant senses of query topic t S ←− C; while there exist clusters Ci , Cj ∈ S such that O(Ci , Cj ) ≥ α do S ←− S\Ci ; S ←− S\Cj ; S ←− S ∪ {{Ci ∪ Cj }}; Assign min(i, j) as the cluster index of the newly created cluster Ci ∪ Cj ; end return S;
The outcome of cluster merge should be dominant clusters depicting the different senses of the topic t. However, based on the vagaries of the threshold parameter, there could be extraneous clusters left out that are not highly generatable from t and do not depict any major sense of the topic t. Filtration: In the third step called the filtration step, such extraneous clusters are filtered out. Recall that the cluster index represents the “importance” of a cluster or the dominance of the sense that the cluster represents – the lower the index, the greater the importance. When two clusters are merged in step two, they retain the lower index. This means that extraneous clusters that are left out tend to have a higher index, and hence lower importance. In this filtration step, we choose to drop such low importance clusters using a similar logic as cluster merge. Every time, a pair of clusters with highest overlap are chosen. Amongst them, the less important cluster is dropped. This process is repeated till the maximum overlap between any two clusters in the overlap matrix drops below another moderate threshold β . After doing this, we arrive at n ≤ |S| clusters, each representing a different dominant sense for the topic.
34
Rachakonda et al.
Input Terms (t) Amazon
Sense 1 Amazon, Amazon River, Amazon Rainforest, Rainforest, Brazil, Peru, Andes
Corpus
Corpus, Habeas corpus, Eighth amendment, Mandamus, Capital punishment in the United States, Writ, Appeal, Sentence, Amendment, Governor
Filter
Filter, Boolean prime ideal theorem, Order theory, Ideal, Partially ordered set, Boolean algebra, Glossary of order theory, Infimum, Axiom of choice, Lattice
Sense 2 Amazon, Amazon.com, Brazil, Consumer electronics, Services, MP3, Internet, Company, October 23, Software Corpus, Native speaker, Word sense disambiguation, Natural language, Machine translation, Computational linguistics, Natural language processing, Substitution, Linguistic typology, Recognition Filter, Glass, Water, Light, Metal, Oxygen, Color, Heat, Chemistry, Fish
Sense 3 Amazon, Amazons, River, Artemis, Nile, Greek, Herodotus, Mythology, Civilization, Black sea Corpus, Hippocratic corpus, Kos, Hippocrates, History of medicine, Oath, Acute, New York university, Medical, Byzantine
Filter, Camera, Exposure, Photography, Glass, Lens, Photographic film, Visible light, Light, Optics
Table 6 Example results from the topic expansion experiments
Ranking: In the final step, called the ranking step, we rank the terms in each of the topic expansion cluster in decreasing order of their importance for the sense of the cluster. The importance of a term for the sense is computed with a measure called the exclusivity score. The exclusivity of two terms tm and tn is defined as: E (tm , tn ) = Γtm →tn · Γtn →tm
(15)
Exclusivity is the product of the two way generatabilities and is a undirected score in co-occurrence graph. The exclusivity score of a term with respect to the topic represents not only the importance of the term to the topic, but the importance of the topic itself, to the given term. Table 6 shows some examples of topic terms and the dominant senses expanded using the above algorithm. 5.4.1 Experimental Results
To evaluate topic expansion, 25 ambiguous polysemous terms were chosen as topics so as to demonstrate the sense disambiguation aspect of the algorithm. For each of these 25 terms, topic expansion algorithm was run with α = 0.9 and β = 0.5. The results of the algorithm were compared with topic modeling and word sense disambiguation algorithms. For each of the 25 terms, we compared the topic expansions (clusters) generated by our algorithm with the topics generated by LDA. Instead of giving a topic term as input, we gave all the documents containing the term in our corpus as input to LDA. LDA generates a finite number of topics as output and in our case, we generated 10 topics using LDA. Some of the input documents might mention the topic in passing could cause LDA to generate some unrelated topics. To account for this discrepancy, we chose the three most related topics from the 10 output topics
Mining Semantic Associations from Unstructured Text
35
such that they possessed distinct senses. For the same topic term, we picked the three best clusters from the results of topic expansion using the inherent ranking of clusters generated by this algorithm. To evaluate the algorithms we represented each cluster using its top 10 terms. To measure the goodness to the clusters, we determined two metrics, cohesiveness and relatedness. Cohesiveness of a cluster is a measure assigned by the evaluator, specifying how closely related are the terms within the cluster to form a given sense. Relatedness of a cluster is measure assigned by the evaluator, specifying the relevance of a cluster to one of the senses of the topic term t. There are six such clusters for each input (three from LDA and three from topic expansion), and the clusters were jumbled such that an evaluator looking the clusters would not be able to identify the algorithm generating a cluster. A total of 22 volunteers evaluated topic expansion of each of the input topic terms. We asked them to rate each cluster based on its cohesiveness and its relatedness on a scale of 0 to 3, 0 corresponding to nonsense or completely unrelated and 3 being excellent. Such evaluations require a lot of manual effort, as there were 25 topics each having six clusters, where each cluster had 10 terms. Hence for the complete evaluation, the volunteers had to essentially go through 1500 terms and comprehend them before passing a judgment. They were given the choice of evaluating only those sets which they were comfortable with. On an average, we found that each volunteer evaluated 15 input terms. The cohesiveness of each cluster was calculated as the mean user rating for cohesiveness of that cluster. The overall cohesiveness score of a topic term was the mean of its three cluster-wise cohesiveness scores. The overall cohesiveness score was computed for both the algorithms for each of the input topic terms as shown in figure 11. We observed that cohesiveness of the topic expansion clusters was better than the cohesiveness of LDA clusters. Topic Expansion
LDA
3 Cohesiveness
2.5 2 1.5 1 0.5 switch
window
route
sandal
kiwi
pulsar
java
jaguar
file
filter
disk
cream
delphi
coat
corpus
cloud
buffalo
boot
bowling
auto
blackberry
apple
anand
address
amazon
0
Fig. 11 Comparison of cohesiveness scores between topic expansion and LDA
Similar to the cohesiveness, we also computed the overall relatedness score for each topic term, for topic expansion and LDA clusters separately. We observed that the relatedness of topic expansion clusters was always higher than that of the LDA clusters for any given topic. The relatedness scores for the two algorithms are shown in figure 12.
36
Rachakonda et al.
The objective of this comparison is not to show that topic expansion outperforms LDA. We understand that LDA and topic expansion are not directly comparable. However, LDA serves as a good backdrop against which we can calibrate the proposed episodic hypothesis on topic expansion, which in turn demonstrates the effectiveness of the proposed cognitive model for latent semantics mining.
Topic Expansion
LDA
2.5 Relatedness
2 1.5 1 0.5
switch
window
route
sandal
kiwi
pulsar
java
jaguar
file
filter
disk
cream
delphi
coat
corpus
cloud
buffalo
boot
bowling
auto
blackberry
apple
anand
address
amazon
0
Fig. 12 Comparison of relatedness scores between topic expansion and LDA.
In another experiment we compared the relatedness scores of clusters generated using topic expansion against the relatedness scores of clusters generated by a word sense disambiguation algorithm proposed by Dorow et al (2005). The algorithm uses Markov Clustering (MCL) on the neighborhood graph of the topic term after removing the topic term to detect different senses. We used the same algorithm with identical parameters mentioned in the original paper (inflation parameter = 2, expansion parameter = 2). MCL performs a random walk starting from different nodes in the graph. The idea here is that, random walks in a clustered graph tend to stay within the cluster from where they originated. Using this property we computed the clusters and the 10 most important nodes in the cluster for our evaluation. The evaluation was similar to what was done earlier and the overall relatedness and cohesiveness scores were computed for each of the 25 input terms based on evaluator inputs. We found out that the results of topic expansion were considerably better than the results of the MCL based clustering method as shown in figure 13. This algorithm for word sense disambiguation was proposed on a specific cooccurrence graph which was built by connecting the nouns occurring in a list. Such nouns tend to be more of semantic siblings and not just co-occurring entities. For example, Federer and Tennis cannot co-occur in such a graph. In this experiment though, we used our co-occurrence graph as the original graph is not readily available. This might be the reason why the algorithm consistently underperformed in comparison to topic expansion. We also observed that in the results of latter method, the focus is not on the ordering of the terms within the clusters according their importance with respect to the topic. Hence the cohesiveness scores of the clusters based on the top 10 terms were quite low and hence did not mandate a comparison.
Mining Semantic Associations from Unstructured Text Topic Expansion
37
Word Sense Disambiguation
Relatedness
2.5 2 1.5 1 0.5
switch
window
route
sandal
pulsar
kiwi
java
jaguar
file
filter
disk
cream
delphi
corpus
coat
cloud
buffalo
boot
bowling
auto
blackberry
apple
anand
address
amazon
0
Fig. 13 Comparison of relatedness scores between topic expansion and word sense disambiguation.
These experiments validate that, the topic expansion algorithm gives an ordered set of highly exclusive terms with respect to the topic and it compares favorably against using existing topic modeling and word sense disambiguation algorithms to solve this problem.
6 Conclusion
The research presented in this paper presents a significant departure in the problem of mining latent semantics. In contrast to vector space or generative models, we look into cognitive science and analytic philosophy to build background models of latent semantics. Cognitive modeling has a rich body of literature, which can deeply impact research in text mining and analytics. In fact, our contention is that the notion of analytics will eventually give way to a notion of cognitics. While analytics is primarily about extracting knowledge from data, cognitics is about model building as well as feedback of semantics into the operations of the system from which the data is collected. Just as the current trend for major application programs is to be shipped with an inbuilt analytics module, we envisage that future applications would be shipped with a cognitics module that not only extracts knowledge from the application’s dynamics, but also intelligently contribute to the application’s dynamics. Rudimentary forms of cognitics already exist in the form of recommender systems. We believe that the proposed model that views semantics from three different layers: linguistic, episodic and analytic would form an important element of the design of any cognitics module. Acknowledgements The authors would like to thank Mandar R. Mutalikdesai and all the Masters’ students and other volunteers who contributed in the implementation as well as the evaluation of the different episodic hypotheses.
38
Rachakonda et al.
References Abiteboul S, Preda M, Cobena G (2003) Adaptive on-line page importance computation. In: Proceedings of the 12th international conference on World Wide Web, ACM, New York, NY, USA, WWW ’03, pp 280–290 Agrawal R, Gollapudi S, Halverson A, Ieong S (2009) Diversifying search results. In: Proceedings of the Second ACM International Conference on Web Search and Data Mining, ACM, New York, NY, USA, WSDM ’09, pp 5–14 Anthes G (2010) Topic models vs. unstructured data. Communications of the ACM 53:16–18 Arora R, Ravindran B (2008) Latent dirichlet allocation based multi-document summarization. In: Proceedings of the second workshop on Analytics for noisy unstructured text data, ACM, New York, NY, USA, AND ’08, pp 91–97 Berry MW (2003) Survey of Text Mining. Springer-Verlag New York, Inc., Secaucus, NJ, USA Blei D, Ng A, Jordan M (2003) Latent dirichlet allocation. Journal of Machine Learning Research 3:993–1022 Brunzel M, Spiliopoulou M (2006) Discovering semantic sibling groups from web documents with xtreem-sg. In: Proceedings of the 15th international conference on Managing Knowledge in a World of Networks, Springer-Verlag, Berlin, Heidelberg, EKAW ’06, pp 141–157 Buitelaar P, Cimiano P, Magnini B (2005) Ontology learning from text: An overview. Ontology learning from text: Methods, evaluation and applications 123:3–12 Busemeyer J, Diederich A (2010) Cognitive modeling. Sage Carreras X, M` arquez L (2005) Introduction to the conll-2005 shared task: Semantic role labeling. In: Proceedings of the Ninth Conference on Computational Natural Language Learning, Association for Computational Linguistics, pp 152–164 Coppola B, Moschitti A, Pighin D (2008) Generalized framework for syntax-based relation mining. In: 2008 Eighth IEEE International Conference on Data Mining, IEEE, pp 153– 162 Dagan I, Lee L, Pereira FC (1999) Similarity-based models of word cooccurrence probabilities. Machine Learning 34(1):43–69 Deerwester S, Dumais S, Furnas G, Landauer T (1990) Indexing by latent semantic analysis. Journal of the American Society for Information Sciences 41:391–407 Deng H, Zhao B, Han J (2011) Collective topic modeling for heterogeneous networks. In: Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval, ACM, New York, NY, USA, SIGIR ’11, pp 1109–1110 van Dongen S (2000) Graph clustering by flow simulation. PhD thesis, University of Utrecht, Utrecht Dorow B, Widdows D, Ling K, Eckmann JP, Sergi D, Moses E (2005) Using curvature and Markov clustering in graphs for lexical acquisition and word sense discrimination. In: 2nd Workshop organized by the MEANING Project (MEANING-2005), Trento, Italy Errami M, Sun Z, George AC, Long TC, Skinner MA, Wren JD, Garner HR (2010) Identifying duplicate content using statistically improbable phrases. Bioinformatics 26(11):1453–1457 Ghose A, Ipeirotis PG, Sundararajan A (2007) Opinion mining using econometrics: A case study on reputation systems. In: In Proceedings of the 44th Annual Meeting of the Association for Computational Linguistics (ACL Greenberg DL, Verfaellie M (2010) Interdependence of episodic and semantic memory: Evidence from neuropsychology. Journal of the International Neuropsychological Society 16:74–753 Griffiths TL, Steyvers M (2004) Finding scientific topics. Proceedings of the National Academy of Sciences of the United States of America 101(Suppl 1):5228–5235 He Y, Xin D (2011) Seisa: Set expansion by iterative similarity aggregation. In: Proceedings of the 20th international conference on World wide web, ACM, New York, NY, USA, WWW ’11, pp 427–436 Hofmann T (1999) Probabilistic latent semantic indexing. In: Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval (SIGIR 1999), pp 50–57 Hotho A, Nuernberger A, Paass G (2005) A brief survey of text mining. LDV Forum - GLDV Journal for Computational Linguistics and Language Technology 20(1):19–62 Jones KS (1972) A statistical interpretation of term specificity and its application in retrieval. Journal of Documentation 28:11–21
Mining Semantic Associations from Unstructured Text
39
Kim YM, Pessiot JF, Amini MR, Gallinari P (2008) An extension of plsa for document clustering. In: Proceeding of the 17th ACM conference on Information and knowledge management, ACM, New York, NY, USA, CIKM ’08, pp 1345–1346 Kleinberg JM (1999) Authoritative sources in a hyperlinked environment. J ACM 46(5) Krestel R, Fankhauser P, Nejdl W (2009) Latent dirichlet allocation for tag recommendation. In: Proceedings of the third ACM conference on Recommender systems, ACM, New York, NY, USA, RecSys ’09, pp 61–68 Lamberts K, Goldstone R (2005) Handbook of cognition. SAGE Landauer TK, Dumais ST (1997) A solution to plato’s problem: The latent semantic analysis theory of acquisition, induction, and representation of knowledge. Psychological Review 104(2):211–240 Lund K, Burgess C, Atchley RA (1995) Semantic and associative priming in high-dimensional semantic space. Conference of the Cognitive Science Society 17 M` arquez L, Carreras X, Litkowski K, Stevenson S (2008) Semantic role labeling: an introduction to the special issue. Computational linguistics 34(2):145–159 Mihalcea R, Tarau P (2004) TextRank: Bringing order into texts. In: Proceedings of EMNLP04and the 2004 Conference on Empirical Methods in Natural Language Processing Miller T (2003) Essay assessment with latent semantic analysis. Journal of Educational Computing Research 29(4) Moore GE (1905) The nature and reality of the objects of perception. Proceedings of the Aristotelian Society 6:68–127 Pang B, Lee L (2008) Opinion mining and sentiment analysis. Found Trends Inf Retr 2:1–135 Phillips W, Riloff E (2002) Exploiting strong syntactic heuristics and co-training to learn semantic lexicons. In: Proceedings of the ACL-02 conference on Empirical methods in natural language processing - Volume 10, Association for Computational Linguistics, Stroudsburg, PA, USA, EMNLP ’02, pp 125–132 Pilz A, PaaßG (2011) From names to entities using thematic context distance. In: Proceedings of the 20th ACM international conference on Information and knowledge management, ACM, New York, NY, USA, CIKM ’11, pp 857–866 Polk T, Seifert C (2002) Cognitive modeling. Bradford Books, MIT Press Pradhan S, Hacioglu K, Ward W, Martin JH, Jurafsky D (2003) Semantic role parsing: Adding semantic structure to unstructured text. In: Proceedings of the Third IEEE International Conference on Data Mining, IEEE Computer Society, Washington, DC, USA, ICDM ’03, pp 629– Preston A (2006) Analytic philosophy Purandare A, Pedersen T (2004) Senseclusters: finding clusters that represent word senses. In: Demonstration Papers at HLT-NAACL 2004, Association for Computational Linguistics, Stroudsburg, PA, USA, HLT-NAACL–Demonstrations ’04, pp 26–29 Rachakonda AR, Srinivasa S (2009a) Finding the topical anchors of a context using lexical cooccurrence data. Proceeding of the 18th ACM conference on Information and knowledge management CIKM 09 p 1741 Rachakonda AR, Srinivasa S (2009b) Vector-based ranking techniques for identifying the topical anchors of a context. In: Proceedings of COMAD 2009 Rohde DLT, Gonnerman LM, Plaut DC (2004) An improved method for deriving word meaning from lexical co-occurrence. Cognitive Science Rubenstein H, Goodenough JB (1965) Contextual correlates of synonymy. Commun ACM 8(10):627–633 Russell B (1919) The Philosophy of Logical Atomism. The Monist, Open Court Sahlgren M (2006) The word-space model. PhD thesis, Stockholm University Sarmento L, Jijkuon V, de Rijke M, Oliveira E (2007) “more like these”: Growing entity classes from seeds. In: Proceedings of the sixteenth ACM conference on Conference on information and knowledge management, ACM, New York, NY, USA, CIKM ’07, pp 959–962 Sebastiani F (2002) Machine learning in automated text categorization. ACM Comput Surv 34:1–47 Shivashankar S, Srivathsan S, Ravindran B, Tendulkar AV (2011) Multi-view methods for protein structure comparison using latent dirichlet allocation. Bioinformatics 27(13):i61– i68 Steyvers M, Griffiths T (2007) Latent Semantic Analysis: A Road to Meaning, chap Probabilistic topic models
40
Rachakonda et al.
Tulving E (1972) Episodic and semantic memory. In Tulving, E, & Donaldson, W (Eds) Organization of Memory Turney PD (2001) Mining the web for synonyms: Pmi-ir versus lsa on toefl. In: Proceedings of the Twelth European Conference on Machine Learning (ECML-2001) Turney PD, Littman ML (2003) Measuring praise and criticism: Inference of semantic orientation from association. ACM Trans Inf Syst 21:315–346 Vuli´ c I, De Smet W, Moens MF (2011) Identifying word translations from comparable corpora using latent topic models. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: short papers - Volume 2, Association for Computational Linguistics, Stroudsburg, PA, USA, HLT ’11, pp 479–484 Wahabzada M, Kersting K, Pilz A, Bauckhage C (2011) More influence means less work: fast latent dirichlet allocation by influence scheduling. In: Proceedings of the 20th ACM international conference on Information and knowledge management, ACM, New York, NY, USA, CIKM ’11, pp 2273–2276 Wei X, Croft WB (2006) Lda-based document models for ad-hoc retrieval. In: Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval, ACM, New York, NY, USA, SIGIR ’06, pp 178–185 Wei-jiang L, Tie-jun Z, Wen-mao Z (2009) Plsa-based query expansion. In: Proceedings of the 2009 Eigth IEEE/ACIS International Conference on Computer and Information Science, IEEE Computer Society, Washington, DC, USA, pp 400–405 Widdows D, Dorow B (2002) A graph model for unsupervised lexical acquisition. In: In 19th International Conference on Computational Linguistics, pp 1093–1099 Wittgenstein L (1922) Tractatus logico-philosophicus. London: Routledge, 1981 Wittgenstein L (1953) Philosophical Investigations. Blackwell, Oxford, translated by G.E.M. Anscombe Zu G, Ohyama W, Wakabayashi T, Kimura F (2003) Accuracy improvement of automatic text classification based on feature transformation. In: Proceedings of the 2003 ACM symposium on Document engineering, ACM, New York, NY, USA, DocEng ’03, pp 118–120