semantic patterns between documents and the ...

3 downloads 1179 Views 7MB Size Report
1.5 Corpus, Collections, Documents, Texts, and Words. 23. 1.6 Structure ... tion out of a fixed collection; a traditional well-controlled collectionas the Google founders Sergey ..... By closing the device from its natural surroundings the developer.
Semantic Patterns between Documents and the Vocabulary Eric Van Horenbeeck

Faculteit Letteren en Wijsbegeerte Departement Taalkunde

Topical Facets:

Semantic Patterns between Documents and the Vocabulary

Topical facets:

semantische patronen tussen documenten en het lexicon

Proefschrift voorgelegd tot het behalen van de graad van doctor in de Taal- en Letterkunde aan de Universteit Antwerpen te verdedigen door Eric Van Horenbeeck Promotor: Prof. dr. Walter Daelemans

Antwerpen, 2008

1

 Copyright 2008 - Eric Van Horenbeeck No part of this book may be reproduced or transmitted in any form or by any means, electronic of mechanical, including print, photocopying, recording, or by any information storage retrieval system, without written permission from the author. [email protected] Alle rechten voorbehouden. Uit deze uitgave mag niets worden gereproduceerd, opgeslagen in een geautomatiseerd gegevensbestand, of openbaar gemaakt in enige vorm of op enige wijze, hetzij elektronisch, mechanisch, door middel van boekdruk, fotokopie, of welk ander medium ook zonder schriftelijke toestemming van de auteur. [email protected] Vormgeving: Niko Van Horenbeeck ISBN: 978-90-5728-105-1

2

Interpretation would be impossible if the life-expressions were totally alien. It would be unnecessary if there were nothing alien about them. It must therefore lie between the two extremes. Wilhelm Dilthey, 1833-1911 in Jürgen Habermas, 1972: 164

3

Abstract An author uses words from the lexicon, organized according to a set of grammatical rules, a process permitting the production of an unlimited stream of texts. Once finished the document petrifies. Changing a word or adding a comma brings forth a new text. There is, however, more structure in language productions than loose words and finished documents. The amount of reused text chunks is striking, chunks that are more complex than words but less so than the text itself or its main components. These chunks make up a third layer between petrified documents and the constituent tokens; a layer with recycled phrases. Texts that share many of these fragments have also a meaning in common, hence, the denomination topical facets. The word topic relates to the pivotal theme in a language production; it concerns what a text is about. The term facet specifies that only a fraction of the whole topic is available. Many topical facets are needed to compose a full topic. When we collect texts based on the topical facets they have in common, we posit at the same time that the documents are gathered on a shared meaning too. A relatively recent theory underpins this conjecture, a theory describing natural language with a special network class: the small-world network. The theory acknowledges that words show different frequencies of use, but argues in particular that words maintain more or less intense relations with other words. The network captures these relations as word types with their combination rules. A small-world network is dynamic. It is possible to add and remove texts without the need to recalculate the whole structure. A language user follows a path over the network that ultimately results in the text he or she will pronounce or write down. Efficient communication forces the language user to choose words, phrases and rules that are obtainable in the network and that are also used by others when bringing up the same subject. The small-world theory is the central concept guiding the development of an instrument to describe (digital) documents. The instrument is a network construction fed with a constant flow of texts. After each step word relations are analyzed without human intervention. It allows the application to learn how documents are connected. In a digital environment the application gets a head start on the person who might not know the content. To check the assumptions, we use a 30,000 document corpus from eight different newsproviders. Every article has been manually annotated by researchers in the US and assigned, if appropriate, to one of sixty different news topics. In this controlled environment we test the application on its ability to allocate each document to the right topic without external help. Unsupervised information detection or document classification based on certain properties, are possible uses of topical facets. As part of this dissertation we looked into the semantic component, further research on other language features present in the network is conceivable.

4

Samenvatting Een auteur gebruikt woorden uit een lexicon, gegroepeerd volgens de regels van de grammatica. Dit proces laat toe een oneindig aantal teksten te produceren. Eens voltooid, versteent het document. Het verwisselen van een woord, het toevoegen van een komma, volstaan om er een andere tekst van te maken. Toch is er meer structuur in taalrealisaties dan enerzijds losse woorden en anderzijds afgewerkte documenten. Het valt op hoe omvangrijk het aandeel is van tekstblokjes die hergebruikt worden en die complexer zijn dan woorden maar simpeler dan de hele tekst of zijn tekstdelen. Zij vormen een derde laag tussen het versteende document en de woorden die het hebben opgebouwd, een laag van gerecycleerde woordreeksen. Teksten die veel van dergelijke fragmenten met elkaar delen, hebben een gedeelde betekenis. Vandaar de benaming topical facets. Met topic wordt het centrale thema in een taalrealisatie bedoeld; datgene waarover een tekst gaat. De term facet geeft aan dat het slechts een fragment van de hele topic betreft. Er zijn verschillende topical facets nodig om een complete topic samen te stellen. Vermits we teksten kunnen groeperen op basis van de topical facets die ze gemeenschappelijk hebben, veronderstellen we dat hierdoor ook documenten gegroepeerd worden op basis van betekenissen die ze met elkaar delen. Achter deze hypothese zit een relatief recente theorie die de natuurlijke taal beschrijft met een speciaal soort netwerk: het small-world netwerk. De theorie erkent dat woorden verschillende gebruiksfrequenties hebben, maar stelt dat bovenal woorden meer of minder intense relaties onderhouden met andere woorden. Het netwerk capteert deze relaties als woordsoorten met hun combinatieregels. Een small-world netwerk is dynamisch. Teksten kunnen worden toegevoegd en verwijderd zonder het volledige netwerk te moeten heropbouwen. Een taalgebruiker volgt een pad over het netwerk dat uiteindelijk de tekst oplevert die hij of zij zal uitspreken of neerschrijven. Efficiënte communicatie verplicht de taalgebruiker woorden, woordreeksen en combinatieregels te kiezen die beschikbaar zijn in het netwerk en die voor een deel door anderen worden gebruikt wanneer hetzelfde onderwerp ter sprake komt. In het proefschrift vormt de small-world theorie het centrale concept achter de ontwikkeling van een instrument om (digitale) documenten te beschrijven. Het instrument is een netwerkconstructie waaraan continu teksten worden toegevoegd. Na elke stap worden de relaties tussen woorden geanalyseerd zonder menselijke tussenkomst. Zo leert de applicatie wat documenten aan elkaar bindt. Hierdoor krijgt de toepassing een voorsprong op de menselijke gebruiker die in een digitale omgeving de inhoud van de bestanden niet kent. We gebruiken een corpus van meer dan 30.000 nieuwsberichten uit acht verschillende bronnen om de uitgangspunten te toetsen. Onderzoekers in de VS hebben alle berichten manueel geannoteerd en eventueel toegewezen aan een van zestig verschillende topics. Dit creëert een gecontroleerde omgeving waarin wordt gekeken of de applicatie zonder hulp van buitenaf de nieuwsberichten kan groeperen per onderwerp. Automatische informatie detectie of document classificatie op basis van bepaalde eigenschappen zijn enkele toepassingen van topical facets. In het kader van dit proefschrift is naar de semantische component gekeken, verder onderzoek zou andere taalstructuren in het netwerk kunnen bestuderen.

5

Preface About ten years ago I had the privilege to work on the development of a comprehensive electronic clipping service funded by the joint Belgian newspaper publishers. A key feature was a computer-controlled selection of news clippings based on an individual query profile. The Belga News Agency now hosts Mediargus as the package is called, and services over 11,000 professional customers on a daily basis (www.mediargus.be). At the time commercial information retrieval systems offered keyword search with an inverted document list augmented with Boolean operators. A while later I became involved in a project to collect continuously all cultural activities nationwide into a proprietary database structure, and to redistribute the data to affiliated subscribers according to dedicated publishing schemes. The Flemish regional government has taken over the cultural database and it operates as a public service since (www.cultuurweb.be). Here also there was a need for a flexible gathering method. Just as in the previous case, the available search and retrieval possibilities fell somewhat short of what I expected. Both experiences made me curious about the limits and possibilities of information storage and retrieval, and about the underlying principles that direct the research in this field. Alcatel granted me a three-year scholarship on a thesis proposal to investigate a number of computational ideas such as the notion of a semantic network, the emergence of meaning from a stream of symbols, and to build a practical device to handle input and to exhibit language patterns. Soon afterwards Alcatel withdrew from the project as the result of a worldwide reorganization. I decided not to break off the work and provided the remainder of the funding myself by combining the Ph.D.-work with a job outside academia, resulting in a production process lasting longer than anticipated. The present dissertation is set in the domain of computational linguistics. The non-specialist reader with an interest in the field may consult The Oxford Handbook of Computational Linguistics (Mitkov, 2005) for a general introduction, or for an overview of its main components. The work presented here implies the existence of a sizeable piece of software. Because the program as such is not the focus of the following pages, user documentation is not enclosed. However, the thesis includes pseudo-code of key software components. The full source code written in Java with its companion documentation is available on request. The program is working and demonstrable. The bibliographic reference style in this work is APA 5 based on the Publication Manual of the American Psychological Association, Fifth Edition (2001).

6

Acknowledgements First, I wish to thank my supervisor Prof. Dr. Walter Daelemans for his guidance throughout my work at the University of Antwerp and for providing support in moments of doubt. His constructive comments helped me to clarify my thoughts, to kill the occasional darlings, and to organize the thesis better. I especially admire his patience with a work that took time to mature and that inevitably drifted away from its initial setup. A word of thanks also to the co-director of CNTS, Prof. Dr. Steven Gillis, for shaping an environment appropriate for intellectual work where novel ideas can sprout and grow. Doing research is a competitive affair and Walter – together with Steven – take great care to maintain the academic and professional status of the CNTS research groups. Multi-disciplinary research is important for CNTS and the presence of multiple disciplines is encouraged. I am much indebted to Hans Cobben, who worked for the Alcatel Internet Division at that time and who is now with SunGard. Hans provided the initial grant that allowed me to start this undertaking. I thank the numerous other people with whom I had the chance to discuss the various problems I try to solve in the thesis. I am grateful to Dr Toon Calders in particular for reading and commenting on some important technical aspects and for agreeing to be in my thesis committee. I would like to thank Dr Guy De Pauw and Dr Iris Hendrickx for the time and effort spending in reviewing this dissertation. I also want to acknowledge the current and past members of the faculty and the CNTS research groups, some of whom I met for a short period only, with others I traveled for all those years. There is my former roommate Guy and my present roommate Iris, and in alphabetic order – hoping to overlook no one – Agnita, An, Anja, Anne, Annemie, Antal, Bart, Emmanuel, Erik, Evie, Fien, Frederik, Georges, Gert D., Griet, Hanne, Helena, Inge, Jan, Jo, Karen, Kevin, Kim, Lieve, Maarten, Marc, Marie-Laure, Martin, Martine, Øydis, Renate, Véronique, and Vincent. Thank you all for the support, the lunch breaks, the dinners now and then, and for the pleasant working atmosphere. The picture has another side too. Making a dissertation is not conducive to one’s social life; anyone who went through it will attest that. I bear the responsibility for the neglect of personal relations with many much-loved people. Even more so, I am deeply grateful for the unconditional friendship and support I received from my friends and from all the members of my family.

7

Table of Contents 1. 1.1 1.2 1.3 1.4 1.5 1.6

Introduction An Overview Main Proposition and Contributions Short Walkthrough Prosthesis as a Meta-Model Corpus, Collections, Documents, Texts, and Words Structure of the Thesis

11 11 13 16 22 23 26

2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9 2.10 2.11

Topical Facets Introduction Inferring Content Topical Facets at Work Analysis of Non-Combined Documents Open Sets and the Closure Issue Generalizing Topical Information Topics and Queries Queries by Proxy Semantic Preference Clustering Related Research Summary

28 28 30 36 41 44 48 54 58 61 64 67

Experiments and Evaluation Introduction The Corpus Experimental Setup Stratified Sample Experiment Cambodian Elections Pinochet Trial Osama bin Laden Indictment US Congressional Elections Chechnya Rebel Kidnapping US Secretary Richardson’s Visit to Taiwan Topic Detection Experiment The TDT 2000 Challenge Tracking and Detection Defined Overview of the Submitted Systems UMass and the Topical Facet Application Discussion Related Research Summary

69 69 71 72 75 76 77 79 81 83 84 85 86 87 87 88 90 93 95

2.

3. 3.1 3.2 3.3 3.4 3.4.1 3.4.2 3.4.3 3.4.4 3.4.5 3.4.6 3.5 3.5.1 3.5.2 3.5.3 3.5.4 3.6 3.7 3.8 4. 4.1 4.2 4.3 4.4 4.5 4.6.1 4.6.2 4.6.3 4.7

8

Text Network Introduction Units and Time in a Text Network Network Relations Similarity of Documents Topological Awareness Context and Meaning The tf*idf Formula Revisited The Information Spectrum Informative Arcs and Allocations

96 96 98 102 105 106 108 110 114 117

4.8 Text Network as Data Compression 4.9 Computational Complexity 4.10 Related Research 4.10.1 Many Guises of a Network 4.10.2 Related Work on Co-Occurrences and Stemming 4.10.3 Summarization 4.11 Summary 5. 5.1 5.2 5.3 5.4 5.5 5.6 5.7 5.8 5.9 5.9.1 5.9.2 5.10 5.11 6. 6.1 6.2 6.2.1 6.2.2 6.2.3 6.2.4 6.2.5

121 124 126 126 128 130 132

From Random Relations to Self-Organizing Structures Introduction Are Networks Necessary? From Erdös to Milgram Long Paths, Short Paths, and a Random Walk Topological Awareness Growth and Decay Language as a Hierarchical and Modular Network Community Driven Content The Importance of Time Short-term Horizon Long-term Horizon Related Research Summary

134 134 136 143 146 148 150 155 158 160 160 161 162 166

General Conclusion and Further Research Conclusion Further Research Hubs and Authorities Abstract Concepts Summarization Directed Acyclic Word Graph (DAWG) Semantic Web

167 167 168 169 170 170 171 171

Appendix Topic Detection and Tracking Tasks Israel-Hamas Documents Gray Whales American Football Pinochet US Secretary Richardson’s Visit to Taiwan US Congressional Elections Chechnya Rebel Kidnapping Topic Detection and Tracking Results Timing four network building tasks

173 173 174 185 186 188 190 192 194 196 197

Glossary

198

Tabels and figures

204

References

207

Index

218

9

topical facets

10

1. Introduction 1.1

An Overview

In bygone days accessing relevant information was usually described as the retrieval of information out of a fixed collection; a traditional well-controlled collection as the Google founders Sergey Brin and Lawrence Page call it (Brin & Page, 1998). The notion of a fixed collection supposes, apart from a time frame to define the scope of fixed, a gatekeeper function to regulate and control the admission of a document to the collection. The gatekeeper is not necessarily a person. Dewey’s Decimal Classification is an example of a gatekeeper function. First published in 1876, it is in use continually to the present day in more than 200,000 libraries worldwide. Every book gets one entry. For instance, specific breeds and kinds of horses are stored under the index 636.11-636.17. A library patron must have prior knowledge of horses and know that an Arabian is an Oriental horse and so he or she can look up the link for an Oriental horse to find the entry for the Arabian horse. Another widely accepted system of library cataloging is the Universal Decimal Classification (UDC), developed by the Belgian bibliographers Paul Otlet and Henri la Fontaine at the end of the 19th century. It is derived from Dewey’s decimal classification. The method includes faceted elements through the application of additional symbols to indicate relationships between subjects. UDC has been extended over many years to cope with the increasing number of disciplines of human knowledge and contains now over 220,000 subdivisions (Rayward, 1997; Levie, 2006). The Decimal Classification is designed for the arrangement of books on library shel­ves. In our shelfless networked environment there is a constant and growing flow of electronic content challenging the gatekeeper’s set of criteria. The gatekeeper’s plight is the risk to turn away, to misplace, or to overlook previously unseen documents. (Eppler & Mengis, 2004; Williams & Delli Carpini, 2004). According to Bella Hass Weinberg, professor of Library and Information Science at St. John’s University in New York already a decade ago, most of the sophisticated cataloging schemes developed to get a grip on the matter have been abandoned or ignored (Weinberg, 1996). People consult the Internet for the most recent information and they will not wait for an international committee to decide in which division to place a new topic. Today, librarians involved in categorization encounter accusations of bias, lack of specificity, and the widespread belief that no human content analysis is necessary since every word is searchable. There are at least three problems connected to this conviction: it is not evident to understand what a body of digital data is about; a user might not even know how to formulate a query to unlock the information available to him; knowing a search word might result in the return of an overload of out-of-context documents, and having a word without knowing its synonyms and derivations often results in overlooking potentially relevant data. A few years back The Scientist devoted a cover story on the issue (Fogarty & Bahls, 2002). The journal quotes a professor of biochemistry at Stanford’s School of Medicine who challenges students in his computational biology classes to search the university’s database for yeast membrane proteins. Without knowing all the synonyms for key membrane proteins such as transmembrane, innermembrane, and outermembrane, the students generally find 20 to 200 instances while the database contains almost 2,000 proteins. The same article recounts how researchers at the University of Minnesota discovered after three years of research that results they were writing up had already been published. A gene recognized by

1. Introduction

11

one name in the United States has another name in Spain or France. Recently Communications of the ACM raises the same issue. In a headline article Ben Shneiderman remarks that finding relevant items with traditional keyword search may prove challenging. Often searchers need prolonged assistance in discovering the range of possibilities, learning concepts, and terminology of the classification schemes (Shneiderman, 2007). The impact of similar threats and opportunities of content management on the civilian society and the business environment is the subject of a detailed report by the European Commission (Stanbridge et al., 2004). The following chapters discuss a computational approach that analyzes the content of documents, unsupervised and without the assistance of a priori knowledge. The focus will be on the extraction of semantic information. At the risk of being trapped in an endless recursion of interpreting words with other words, the sense of the phrase semantic information needs a short clarification. In an overview Luciano Floridi observes that different authors use different explanations for the term semantic and the term information depending on the cluster of requirements and desiderata orientating their research (Floridi, 2005). In the context of this work semantic information stands for symbols having the possibility to elicit meaning for the human user of a computer system. The meaning of a word is a function of the relationships it contracts with other words in a particular lexical field or subsystem, organizing related words and expressions, and showing their relationship to one another. It cannot be adequately described except in terms of these relationships. The concept returns in chapter 4 where it underpins the calculation of the informative value of words. Paragraph 1.2 will take a closer look at the difficulty with conceptual definitions and proposes to consider the program under study as a prosthesis. Analysis of content leads to a computer generated document description labeled topical facet. Topical because it links two or more documents based on semantic agreement, facet because the accord is highly fractional. The semantic agreement in question is a collection of shared phrases or multi-word units composed of significant words. The fact of being fractional relaxes the requirement of some knowledge-representation systems to allocate a single topic or a single concept to a document or a document subdivision. One might picture topical facets as occupying the middle between collections of documents and an inventory of words obtained from those documents. The words drawn from a general vocabulary at the command of the author constitute an assembly of unordered building blocks, while a document is a fixed language production. Topical facets are situated in between; they have more structure than loose words, but unlike a document, they are not confined to a specific language realization. On the one hand a topical facet is composed of words linking it to the vocabulary and on the other hand it is related to two or more documents. In spite of its fractionary nature this representation succeeds in bringing about a family resemblance between groups of documents. The resemblance can be used to assist in a number of document-related tasks. The retrieval of texts based on a shared topic will demonstrate the usefulness of the underlying technology and will test its premises. Topical facets allow a polythetic classification as discussed in van Rijsbergen: a polythetic class possesses a minimal number of defining characteristics, though none of the features has to be found in each member of the category. This in contrast to a monothetic class, defined in terms of criteria that are both necessary and sufficient to identify members of that class (van Rijsbergen, 1979: 28 – 30).   

12

1. Introduction

To evaluate the system, an annotated test corpus is used. It was developed for a specific topic detection and tracking task, and contains articles from eight US news agencies. The corpus is properly introduced in chapter 3 on page 71. The task will be to extract documents related to a specific topic. In this work a topic is defined as a component with topical facets as its constituent parts. To make this process work, all documents are cast into one large unrestricted text network. An unrestricted text does not contain predetermined constructs. The number and type of the features are based on the actual word usage in the text. They cannot be anticipated in advance and are not confined to a particular domain, but are extracted while the data are coming in. No clearing of stop words or correction of typographical errors is applied. The network serving as a model is based on the description of the human language in terms of a graph of word interactions. Text networks will be tackled by means of a graph theoretical toolkit. Graphs are very convenient for the job due to their non-linear, non-hierarchical structure and because relations are central to network analysis in contrast to the study of features or characteristics of tokens, documents, and communities. Network analysis studies in essence the connections between tokens, documents, and communities (Monge & Contractor, 2001: 441). A minor inconvenience is the confusion about the name of such a network. Semantic network and neural network in particular are designations often seen, but not directly applicable to the work presented here. Different readings on the use of the network concept will receive due attention further below in the Related Research paragraphs. An important aspect to explain the successful working of the human language as a network is the constant recycle of a relatively limited set of components in ever-different arrangements. According to Alexander Mehler in a recent review, text network analysis is only starting in corpus linguistics as well as in computational and quantitative linguistics. In particular, the exploration of intertextual relations beyond hyperlinks as a source of networking within text corpora is at the frontier of the field (Mehler, 2007). In current corpus linguistic studies textual networking turns up increasingly for the reason that web-based interfaces are popular, and owing to the opening of databases of scientific publications. There is, for instance, Medline, a biomedical bibliographic information repository with nearly eleven million records from over 7,300 different publications since 1965, or press archives in general, or repositories in areas of technical communication, such as LexisNexis with five billion searchable documents from more than 32,000 legal, news, and business sources. The practice of committing huge data repositories to trees or tables according to standard taxonomies is either unaffordable due to the cost in money or time, or has the undesirable side effect of curtailing the creativeness of its users.

1.2

Main Proposition and Contributions

The research described in this dissertation is about unsupervised information discovery. For some time now a particular form of complex networks, the small-world variant, is investigated in relation to natural language productions (e.g. the structure of the World Wide Web) and language aspects (e.g. content derivation). Watts and Strogatz, Kleinberg, Albert and Barabási, and Ferrer i Cancho, among others developed a small-world theory of networks that will serve as the analytical model for the work at hand (Ferrer i Cancho, 2005; Albert & Barabási, 2002; Kleinberg, 2000; Watts & Strogatz, 1998). The present work offers a way to cast any number of documents,

1. Introduction

13

without human intervention, into a text network spanned by types of tokens as nodes and the relations between them as links. Typical of this kind of network is that every type becomes a node linked to other nodes on the condition that this link exists in at least one document. The network grows by adding and analyzing the texts in succession while remembering the publication date and source of the added data. The topical facet model is language independent, a claim based on a conjecture by Dorogovtsev and Mendes. These authors postulate that the structure of word interactions in any human language belongs to the small-world class of networks (Dorogovtsev & Mendes, 2001). As main proposition it is stated that the incremental growth process of the network reveals structural similarities between documents. The observed similarities take the form of a layer between the unbounded diversity of the vocabulary and the uniqueness of a document. The similarities carry semantic information and are called topical facets. In order to substantiate the claim that topical facets can be useful in processes relying on patterns of semantic similarity an Information Retrieval (IR) task will be evaluated. Throughout the study the TDT3-corpus provides a testing environment. TDT stands for Topic Detection and Tracking, an Information Retrieval task on a purpose-built corpus. The corpus is a collection of manually categorized ground truth documents allowing the validation of detection and tracking measurements or techniques. The collection spans the period Oct. – Dec. 1998 with data from English and Mandarin news sources and is introduced in detail on page 71 in chapter 3. The contributions of the present work are situated on two levels. First there is the algorithmic domain. The development of a special kind of network into a theoretically well founded language description conjecture justifies the use of an unrestricted text network for IR and Natural Language Pro­cessing (NLP) tasks. Encoding information about nodes and relations between nodes makes novel approaches with graph-based algorithms possible. The topical facet layer may be a starting point for an alternative view on how to formulate a query, and on the relation between an inquiry and a topic of interest that is, or is not, present in the data. Query by proxy is proposed as a solution when one cannot take for granted the existence of a common ground between the query and the data. The detection of relevant information proceeds in a two step fashion. The use of topical facets allows for an initial recall step aiming at including as many relevant items in the finding set as possible. In the next phase nonrelevant or weakly related items are eliminated to increase the purity of the retrieval performance. This is achieved with the concerted action of three tools: core extraction to identify the topical facets contributing most to the retrieval output, document-bydocument similarity to ensure a relation between two documents based on a mutually shared semantic construction – not only on loose words – and semantic preference clustering to group documents round the query proxy. The second contribution level concerns the implementation of the research results into an operational application. A network is a computationally efficient document repository. The Topical Facets Application encounters 12,388,120 tokens, but stores only approx. one percent of them (136,431) as different types.

14

1. Introduction

Adding data is normally computationally demanding, though not in this case where adding a document requires only local adjustments (adding new words and new arcs) and limited recalculation, making information discovery in a constant data flow feasible. Deriving semantic information in a network setting, contributes to the intelligibility of the algorithmic procedures. Highly informative phrases are available to give a shorthand description of a document or a topical facet. It stimulates information discovery by circumstantially browsing a body of unknown data. The next paragraph presents a short walkthrough to make a few of the main aspects under discussion more tangible. Paragraph 1.4 explains why the application is viewed as a prosthesis. Paragraph 1.5 looks into common notions such as corpus, word, collection, community of reader, and the difference between text and document. Certain concepts obtain a specific meaning in the range of this work. The final paragraph 1.6 has an overview of the remaining chapters.

1. Introduction

15

16

1.3

Short Walkthrough

Fig. 1-1.

Screenshot with some of the topical facets found in 830 documents.

1. Introduction

In order to illustrate the preceding concepts at once, we present some screens from the Topical Facet Application. The reader is invited to imagine a user with no clue about the content. Therefore, keyword search is not an option. The user will have to explore the data first. Figure 1-1 shows some of the 438 topical facets found in 830 documents for October 15 – October 16, 1998, an arbitrary chosen period. In the middle panel one sees the details of the topical facets #1, #2, and #3. Listed under topical facet #1 are four documents labeled as Doc 357, Doc 4550, Doc 4806, and Doc 4867. Each document has a few phrases indicative of its content. All the phrases are computer generated without manual intervention or clean up. Topical facet #1 links the four documents mentioned and contains words as outside manila, sustained winds.

1. Introduction

17

Fig. 1-2.

18

Screenshot with document 357 and topical facet #1 and #24.

1. Introduction

As is shown in figure 1-2, we see now another facet #24 connecting three articles. Because a topical facet sits between the vocabulary and the document it can be reached by any word it contains, but also by any document linked by it. The user may select any document and inspect all the connecting topical facets. Alternatively, he or she can choose to start from any topical facet and look up the different documents shar­ing this facet. If the user clicks on document 357 – the first from the list of four seen in the previous screenshot – a window opens with the text of this document and with any topical facet related to it. The first article appears also in the set of topical facet #1, the two other are new. Topical facet #24 is typified by the words coast guard. In document 5228, the last of this group, the term ethiopian asylum comes up.

1. Introduction

19

Fig. 1-3.

20

Screenshot of a document retrieval task on asylum seekers. The query is at the bottom pane in the middle.

1. Introduction

Suppose our user stops brows­ing and wants more information on asylum seek­ers. With that intention in mind, he or she sets a somewhat broader time interval (from 10/10 to 20/10) and formulates a query about concern over the rising number of Somali and Ethiopian asylum seekers who died at sea. This step illustrates the topic detection feature of the application. Figure 1-3 shows the result. The system returns six documents for the period October 10 – October 20, 1998. Except for the initial document nothing more is found on Ethiopian or Somali refugees in this time interval. However, the system discovers documents on Palestinian and Kosovo Albanian asylum seekers. This concludes our short overview of how a user becomes acquainted with a body of unknown data and how he or she uses this outcome to launch a query. An extensive evaluation is post­ poned until chapter 3.

1. Introduction

21

1.4

Prosthesis as a Meta-Model

When working with human language one cannot avoid treading on fields studied by other disciplines such as neurobiology, psycholinguistics, or philosophy. A confrontation between the different points of view on the faculty of language is scientifically important, but it is often difficult to maintain a distance from discussions on mind and meaning theories and to justify the preference of one theory over another (Edelman, 2007). The dilemma is not limited to the choice of a theory. Any hypothesis will develop the instruments necessary to validate its explanation of reality. Applying a methodology can be explained as taking position in favor of the underlying principles; not using it could be seen as rejecting the theory. In computational linguistics, many standard models ignore the ordering of words – hence the informal bag-of-words tag used to designate them (Blei, Ng & Jordan, 2003). This thesis follows a converse path where word order is central. According to Avi Arampatzis, most attempts to break out of the bag-of-words paradigm by employing NLP and other linguistic resources are inconsistent or produce at least dubious results (Arampatzis, 2001). This dissertation takes up the challenge without formally proving whether a network model of human language has more representational power in general over standard techniques. A pragmatic solution to avoid premature commitment to one theory over another or to a given methodology, is to define the computational model under study as a prosthesis. The prosthesis metaphor is more than a figure of speech. It conceives the device as something that has to interface with a human user. The responsibilities of both parties in the man-machine interaction will have to be explained, but the machine component has no usage for world knowledge, that matter being relegated to its human owner. As will become apparent, the Topical Facets Application stops short at the point where semantic information shared by different documents is extracted. The information is highly fractional and inconclusive, hence the necessary intervention of the user to restrict the boundless possible interpretations in function of a specific demand. It is quite easy to furnish the device with a form of self-reference by letting it autogenerate queries to itself and analyze the outcome. But this is a very different playground, opening other lines of research not pursued any further here. Next, and perhaps more consequentially, the prosthesis approach puts the development in a performance perspective. The designer has to balance a level of abstraction with a level of practicality. Practical and theoretical considerations necessary to get the assembly working will justify the choice of models, elements, and methods. The prosthesis in other words, is closed to the broader world of its user. A wooden leg is an example of a prosthesis serving a straightforward function. Every carpenter can make one without knowledge of limb embryology or muscle anatomy. A wooden leg is a substitutive prosthesis, replacing a body part that no longer performs normally. The work presented here is a prosthesis, extending the natural limitations of the human body in the way a gun or a bicycle does. By closing the device from its natural surroundings the developer can pick whatever component he considers suitable for the task without the need to revert to the way a natural language emerges, the human brain functions, or to answer questions whether there is any (logical) truth in the way the machine copes with information. A prosthesis does not have to engage itself in proving the truth-value of its utterances (Fernández-Armesto, 1997: 51). Jon Barwise is the odd author who inquires into this matter. He provides elements for a logic of intelligent interaction between a computational device and a human being; he discusses the mechanism by which a representation and the embedding circumstances may combine into a

22

1. Introduction

relational interface (Barwise, 1989). However, pending the formulation of a complete prosthesis theory, the responsibility of the developer is limited to answering assessment questions on how and why his contraption works. Roland Hausser observes that a sparrow remains in the air owing to the same aerodynamic laws as an airplane, but that it would be ludicrous to require therefore, mating and breeding behavior from a jumbo-jet (Hausser, 1999: 4). Closing the airplane prosthesis from the broader natural environment where humans and sparrows each evolve their skills, does not prevent a research worker to look for clues, ideas, or laws of nature that might be put to good use when designing a machine. The neural network concept is another example. The principle of synaptic modulation first articulated by Donald O. Hebb in 1949, is still an inspiration for computational modelers, even if in neuroscience his biological speculations have been largely superseded (Hebb, Brown & Milner, 2002: F12-13). No contemporary researcher in the machine-learning field would claim a resemblance to the natural brain. The prosthesis view in this dissertation takes a position in-between the generally acknowledged AI-standpoint that an application should have, at least in principle, a symbolic representation of the world-context it operates in, and the belief that a large amount of intelligent activity can be controlled without any human instilled reasoning (Kirsh, 1991). Whatever the case, the Topical Facets Application claims no human-like abilities. Attention will go first to a performing software application and to an appropriate evaluation method. The theoretical foundations for the underlying model are dealt with in the chapters thereafter.

1.5

Corpus, Collections, Documents, Texts, and Words

While a separate glossary at the end of the book outlines succinctly the symbols and technical terms appearing throughout this work, it is appropriate to look for a more extensive definition of a few central concepts. The components reviewed here are supposed to behave according to the given description without additional provisions. The diagram in figure 1-4 sketches how they are associated. Because the relation between the components is circular, it makes no difference where to start. At the right hand side the diagram introduces a community of readers and a domain of interest. It is almost a truism to state that a community of readers will select particular documents in preference to those chosen by other communities of readers, since they relate to favored domains of interest. However, a corollary of this proposition is that the existence of a domain of interest can be inferred as explicitly from the existence of a community of users as from a content analysis of the documents. The converse is also true: the existence of a body of content presumes a community of users. The language user brings about the fitting to reality, not the message. This is the underlying rationale for community content derivation, according to Peter Gärdenfors and Ray Jackendoff among many others (Gärdenfors, 1990). A crucial factor is the author using concepts shared with the intended audience (Jackendoff, 1998). It is understood that natural language users are free to participate in as many different communities as they wish. Chapter 5 will present results on community-driven content research. The relationship between a reader and his document is not everlasting, because the informative value of a document changes. Issues have a diachronic aspect: they emerge, develop, and

1. Introduction

23

disappear. Measuring the frequency or intensity reveals the value that the news providers of a community attach to certain themes in the course of time.

Tokens Vocabulary

wa, wb, … wx Texts

Words t

t

tj

Collections Documents

tj

ti

timestamp timestamp

Timestamp tsy dm

da

db

dc

ca

ti

Source sa

Fig. 1-4.

Domain of interest

Community of readers dm

dn

do

cy

dm

Relation between words, texts, documents, and collections

A stream of documents nurtures the information need of a community. Documents are often delivered as a combined set, in this work limited to one type of aggregation, a collection. A newspaper is an example of a collection, although it can be any assortment of documents with no other binding than the shared interest of its users. Henceforth, collection will signify the set of documents issued by the same source and in the same period. The source in question is not the actual author, but the publisher of the collection. This restricts the extent of the notion collection by comparison with the idiomatic usage of the term, both with regard to the source of the document and to the period covering a collection. The document is the basic ingredient available to the Topical Facets Application. In this work, document stands for a text fitted with a source and a timestamp. One finds a similar distinction between document and text in the XML document markup language (Bray, Paoli, Sperberg-McQueen, Maler & Yergeau, 2006). To see the conceptual difference between a text and a document, the reader may imagine a press release sent out by some international corporation. The release will land on the desk of many newspaper editors. Each editor has to consider to throw the text away, to let it lie about for a while, to integrate the content into a new article, to render it as is, to shorten it, and so on. When the text eventually appears in the paper it becomes a document. For a reader the source of the document is the time stamped newspaper not the international corporation. The responsibility of a document publisher (the source) is to maintain coherence with regard to the

24

1. Introduction

domain(s) of interest covered. Even more so if the set of documents is a publication appearing continuously, because of the circular relation between the community and its news providers. A community expects to be informed on domains relevant to them, while the news provider expects to get the public endorsement justifying his work and hence his existence as a publisher. These are the considerations guiding an editor when he ponders over the press release. The issue will not be pursued further, while this work assumes the existence of a stable community of readers and the ready availability of a community related corpus as a prerequisite. For additional reading, we refer to Jeremy Tunstall, a sociologist who pioneered research in this field (Tunstall, 1972). The subject has not vanished from sight since (Gans, 2004; Tanner, 2004). Text is at the heart of a document. Any author who wants to inform a potential reader about an issue will choose words appropriate for his purpose. That is the reason why a text exhibits lexical coherence (Morris & Hirst, 1991). To continue on the line about the relation between community and content, George A. Miller and Philip N. Johnson-Laird call the common frame of reference shared by a speech community a set of core concepts, implying the participation of the members of a speech community (Miller & Johnson-Laird, 1987; Cooper, 2000: 107). The implicit coherence is about discourse in its day-to-day manifestation, not the meaning from a dictionary or the coherence expected from some formalized language (Lakoff & Johnson, 1981: 59 – 69). In view of the network approach to be exposed below and to link the internal structure with the surrounding society, it is relevant to mention Michael Halliday. According to this author, all texts attempt to present a coherent world of the text with something he calls the textual function – a world in which all the elements of the text cohere internally and which itself coheres to its relevant environment (Halliday, 1978: 187– 89). In the same way as the concepts of community, collection, and domainof-interest are given, it is supposed without further inspection that the texts on hand possess sufficient internal coherence. Words are the basic language elements in this setting. There exist several ways of defining a word, and these are not equivalent (McKean, 2001). The phonological word is a piece of speech, behaving as a unit of pronunciation in accordance with criteria varying from language to language. A lexical item (or lexeme) is an abstract unit of the lexicon of a language with a more or less readily identifiable meaning or function. The lexical item can appear in several grammatical forms. Some items appear to be words by some criteria, but exist in two or three pieces, sometimes separated by other material such as hyphens and points. There are abbreviations and logograms (a written character that is not a letter of the alphabet, representing a word by convention: §, for example). A vocabulary contains all the words at the disposition of a language user, whether or not he or she will ever use them. A word then is defined as a member of a vocabulary, rendered as an orthographic chain of characters with a white space at each end, but with no white space in the middle (Juilland & Roceric, 1972). It is a matter of further research whether this operational definition can be replaced by a language independent word parser1. An actually used word is called token. A type is the class with those actually used tokens as its instances. In the sentence: The Virgo constellation will enter in a conflict with the Leo constellation, there are two tokens the and two tokens constellation, yet there is only one type for each of them. Both are words from the English vocabulary. A word can be called term also. A term is an identification assigned to a 1

See page 171 for a possible approach.

1. Introduction

25

concept, though for all practical reasons, the basic element is henceforth called word, term, or token, used interchangeably when there is no danger of ambiguity. When a word, term, or token is used in a strictly computational context devoid of any linguistic property, it is called a label. A corpus is the top category comprising all documents available augmented with additional information. This level is outside the scope of the application, and is not represented in figure 1-4. By design the computer application has no access to the meta-information, that being the privilege of the human owner. In accordance with the prosthesis model, a system interfaces with its owner and its users within the framework of an assignment. Meta-information of a corpus may include, among other things, a technical description on the data format, information about the way the documents were collected, rules on how to present the data to a system, an external list of keywords, or manually added tags and features. The accompanying Topical Facet computer application handles Latin character sets with a top-to-bottom and left-to-right reading direction. Because of this constraint Mandarin texts in the corpus are not processed, although the original TDT task description called for a bilingual treatment. Nevertheless, the underlying model is essentially language independent. Information bearing items manifest themselves on any material substrate (paper, electronic, spoken, or written on the shoulder blade of a camel). On pragmatic grounds the present system is limited to computer-readable messages presented as a sequence of words selected from a natural language vocabulary of unknown length. From the preceding assumptions one can anticipate to find relevant word relations in the context of a document, to discover semantic relations in a collection of documents, and to learn about domain relations in the context of a community. Even without taking a position with regard to the meaningfulness of a text, it is expected that those relations deliver a representation of the information content of a document (Dretske, 1999). The concepts of community and collection are a central organizational feature. The data have a timestamp, linking collection and community to the notion of temporariness. The subsequent chapters will look at these concepts in detail.

1.6

Structure of the Thesis

Chapter 2 on Topical Facets is next and explains the derivation of topical facets from a network before presenting background information on how and why. This arrangement of the content has the advantage of presenting straight to the point the purpose of topical facets as a partial document description that can be used in a number of situations, for instance in document classification and in information retrieval. The disadvantage being that sometimes the reader has to jump back and forth in the text in order to find an explanation on specific concepts associated with graphs or networks. In chapter 3 Experiments and Evaluation a number of detailed experiments illustrate the methodology. The chapter concludes with an evaluation. An appendix is available with details on the topic detection task used to evaluate the system, and with text samples and related topical facets to illustrate the key concepts. The construction of the text network in general and its theoretical foundation, are subject of the two consecutive chapters. Chapter 4 Text Network covers background material assumed in the previous chapters and pertaining to the construction of an unrestricted text network. The assessment of its components concerns the

26

1. Introduction

data reduction gained and the computational complexity of the system. Chapter 5 From Random Relations to Self-Organizing Structures is an account of the development of the small-world type of networks from a general concept to a model of human language. It constitutes the theoretical underpinning of the use of an unrestricted text network. Chapter 6 General Conclusion and Further Research summarizes the important properties and limitations of topical facets, and points out some opportunities for future research. In addition to the Appendix, a Glossary is included with brief definitions of technical terms and abbreviations. The list of Tables and Figures also contains the location in the text of the key algorithms of the application. The work concludes with the bibliographic references of the thesis and an author and main concepts index.

1. Introduction

27

2.

Topical Facets

2.1

Introduction

This chapter discusses the construction of small components form a network environment, called topical facets. A detailed explanation of the machinery of networks is postponed until chapter four. There is, however, the notion of informative arc in need of a short clarification at this stage. An arc stands for a sequence between two words, say a and b, actually realized in a text. Therefore, it has direction; ab is different from ba. The term informative stands for the semantic quality of an arc. The informative value of an arc is the sum of the informative value of its constitutive terms2. Being informative is about being more of less relevant to the meaning of a text. With the topical facet approach the focal point will shift from relevance inside a text carried by the informative value of words, to semantic relations extending over the boundaries of documents and collections. The following elements participate in the construction: • A network accommodating all unrestricted texts presented to the system. • Arcs (informative and other). • Texts consisting of ordered successions of arcs. • Documents each including one text, a timestamp, and a source. • Collections containing sets of documents with the same timestamp and source. • The community, a group of people using the same collections. A topical facet partially describes the information shared by several documents by collecting informative arcs found in those documents. The result is an open set of topical facets at the disposition of the user. The set is open since any new document added to the system may contain elements relating to existing documents, thus forcing the system to reconsider the semantic structures composed so far. Since more than one document can share the same topical facet and because several topical facets typically link one document with one or more other documents, the user can visit the entire network hopping from one facet to the next, a phenomenon known as unlimited semiosis. While unlimited semiosis is generally considered problematic, it is one of the strengths of the topical facet concept as it allows the user to access repeatedly the information contained in a body of documents in different ways without the need to re-analyze the data. The term tacit knowledge was coined by Michael Polanyi. He asserts that as a result from how the brain processes sense data a human being knows more than he can tell (Polanyi, 1983: 4). The ability to link documents in numerous ways provides the Topical Facets Application with something similar that might be called tacit information; the network has more information on hand than one can ask. Topical facets are elements forming a semantic relation between two or more documents. One possible use of the topical facet layer, investigated here, is the topic based retrieval and categorization of documents. A topic is defined as a closed construct made up with several topical facets generating a specific description of the data, thereby eliminating other possible combinations and interpretations. We argue that a well-formed query is a topic trigger. A user who is not familiar with the information in front of him, has the option to browse the data starting from one particular document or one particular topical facet, and to explore the semantic relations presented by the 2

28

See page 117

2. Topical Facets

system prior to articulating his query. The documents retrieved as a result constitute a category based on shared salient properties. The triggering query could be seen as a member from that category. The topical facets model does not take transitivity between the query and the corpus for granted. Therefore, in order not to confront the data directly, a proxy will substitute the query. The proxy is a document from the corpus, representing the query at best when ranking documents on relevance. An evaluation procedure of the retrieval output will enhance its precision. The procedure calculates the main core of the topical facets intervening in the document retrieval. Documents not linked by the facets from the main core are eliminated. A document-by-document similarity metric quantifies the semantic relation of the remaining documents with the proxy. Finally a semantic preference clustering method uses the similarity value to assemble documents based on their maximal mutual preference. The remainder of this chapter is organized as follows: paragraph 2.2 Inferring content deals with methods to fit the application with a content inferring capacity. In paragraph 2.3 Topical Facets at Work some production examples are shown; paragraph 2.4 Analysis of Non-Combined Documents deals with how long range relations are made. Paragraph 2.5 Open Sets and the Closure Issue discusses the specific behavior of topical facets. Paragraph 2.6 Generalizing Topical Information is about ways to unlock the available information and 2.7 Topics and Queries examines the resemblance between the notions topic and query. Paragraph 2.8 Queries by Proxy deals with a function detecting topics in a stream of incoming documents. Paragraph 2.9 Semantic Preference Clustering explains how the retrieved documents are ranked. In paragraph 2.10 Related Research we look at other Information Retrieval approaches using faceted information components. The chapter finishes with a summary in paragraph 2.11.

2. Topical Facets

29

Inferring Content

2.2

It will become apparent that a token has different relations with other tokens from one document to the next and that one infers dissimilarity in meaning from that behavior. A computer program can calculate these differences without claiming to know anything about the true meaning of the tokens involved. The operation yields informative words as a corollary. The diagram in figure 2-1 sketches the situation: collection c1 has three documents A, B and C and collection c2 has three different documents D, E, and F. The intersection of the words in A, B and C from collection c1 produces a set containing all the elements common to A, B, and C. The same operation with the documents from collection c2 gives the words shared by D, E, and F. Removing these common elements leaves each document inside a collection with the words making it distinctive from other documents. How to determine the informative value of a token or a pair of tokens in a network is the subject of chapter 4.

Collection c1

A

Collection c2

Two sets of common words identified in six documents from two collections

D

E

B F

C

Fig. 2-1.

Inferring informative words at document level by removing common words.

Whereas the informative value of words is largely about dissimilarity among documents the following series of operations will concentrate on the similarity of token relations between texts and parts of texts helping the Topical Facets Application a step closer to the construction of meaningful semantic associations. The first matter of importance is breaking out of the collection boundary without losing the benefit of informative components gathered at document level. As always, the application has a role as prosthesis to its owner without any implied attribution of human-like motivation, characteristics, or performance. The adjective meaningful, for instance, in the sentence part a meaningful semantic association, affects ultimately only the user of the program. However, it is derived from a conjecture about actual language realizations. Figure 2-1 sketches how to obtain distinctive words inside a collection. The diagram in figure 2-2 shows what happens when performing the same intersection operation on the distinctive word set from the collections c1 and c2. One discovers a similarity between two documents based on the existence of a set of informative words Q used jointly by document B and document F, or Q = B ∩ F , infer-

30

2. Topical Facets

ring a potentially meaningful semantic association. The words located both in B and F would form ordered pairs with x ∈ Q , y ∈ B and y ' ∈ F . If we accept that R stands for a semantic relation then word order matters and the similarity between two words is asymmetric:

∃x ∃y (R (x, y ) ∧ ¬R (y , x ))



(2.1)



Given that some – or even all – words in these two documents are composed with the same characters does not make them equivalent. For instance, the words y and y’ may seem comparable, yet they can be irreflexive at the semantic level: ∃y ¬R (y , y ' ) . To reveal the meaning of a text one has to look for characteristics more complex than words sharing the same external aspect.

The similarity of dissimilar words D A

c2

c1

B

C

Fig. 2-2.

F

E

Q

Inferring similar content from the presence of dissimilar words in documents B and F belonging to the collections c1 and c2 respectively, introduced in figure 2-1.

All the necessary relations are in the text. Embedding the text with other texts in a setting such as a network is a way to preserve these relations. All the words become nodes in a network. More precisely: every word will label a node. In a text each word is stringed to another word in a linear fashion. By mapping the words of a text onto a network the nodes representing the words inherit the same relations as the text. One can reconstruct the original text hopping from one node to the next by following the path in the right direction. A walk in the network is a path with a direction and an arc is the directed link between two nodes. A node is also called vertex. In network terminology the walk representing a text is a maximal connected component. Maximal means that no other vertex can be added without altering the defining characteristic of the component, in this case the ordering of the connected vertices. Adding or removing a word in a text would change it into another text. Smaller structures inside a text can have the same property: a sentence, or a phrase, for instance. An unrestricted text network is the graph collecting any text under the condition that each different word can label one node only. With this construction in place building a method to retrieve content-bearing parts requires three steps:

2. Topical Facets

31

• Finding documents with similar components; • Collecting those components; • Grouping the documents sharing the same set of components. Because the framework in which this happens is an unrestricted text network where every text is a subnetwork, one might equally well use the linguistic term phrase, instead of arc, or component. In paragraph 4.7 on page 117 the reader finds a more detailed definition on the informative arcs under consideration. To illustrate the procedure, consider the following example with a few individuals – whom we meet again further down – found in some of the documents and collections, making up a text network. We define a document d = (text , source, time ) as a tuple with a text, a source, and a timestamp. Each documentd ∈ D where D is the pool of documents that builds the network G. The notation for a network is G (V , A, g ) where V is a set of vertices {v1, v2, …, vn}, A is the set 2 of directed lines or arcs, andA ⊆ V . g is a function associating with each arc an ordered pair of labeled vertices v i , v j . The labels in question are the words from the document and the ordering of the vertices is determined by the order of the words in the text. A collection c is a set of documents c = {d1, d 2 ,...d n } grouped by an equivalence relation R on D. If the relation R is the fact of having the same source and time then for each d ∈ Ds ds ∈ D d , d s ∈ R is a collection.

{

}

Assume six documents extracted from several document collections. Although a document should have a timestamp and a source by our definition, these elements are not considered at this point. The documents are labeled d1, d2, d3, d4, d5, and d6 and carry the following content: d1 is about Bill Clinton, Hamas, a bomb factory, Israel, and Madeleine Albright. The words are abbreviated to BC, HM, BF, IS, and MA. d2 is about Bill Clinton, Jonathan Pollard, and Madeleine Albright, abbreviated to BC, JP, and MA. d3 is about Israel, Jonathan Pollard, and the US Navy, abbreviated to IS, JP, and UN. d4 is about Israel, Hamas and a bomb factory, abbreviated to IS, HM and BF. d5 is about the US Navy, abbreviated to UN. d6 is about a bomb factory, abbreviated to BF. Assume also that the individuals mentioned here are represented by arcs having been identified as informative (including Hamas and Israel). As a matter of convenience all other words that might be present in the documents are ignored. The Venn-diagram in figure 2-3 offers a view on the content relations between the six documents.

32

2. Topical Facets

BC

MA

d2 d1

JP d6 d3 d5

d4

BF

Fig. 2-3.

HM

IS

UN

Venn-diagram showing how informative data are shared between the six documents.

Step 1 Collecting documents The first step brings about document relations based on shared phrases. In our toy example we take the Bill Clinton bigram for granted, but in an unrestricted text network Bill and Clinton are just labels attached to a node. A node or vertex will have a one-to-many relation with the documents making up the network. To describe the process as an operation on a network we represent the Bill Clinton bigram as an arc. According to the definition an arc abi is composed with two vertices from the pool V of vertices available in the network G. In this case we would index it with b (for Bill) and c (for Clinton), or (ab , bc ) ∈ V in G = (V, A). The Bill Clinton phrase is an ordered pair, making it a member of the set of arcs A in the network. The superscript i identifies the arc as informative and distinguishes it from arcs not having this characteristic. By virtue of the one-to-many relation of its constituting vertices the arc abi can be present in one or more of the documents d from the set of documents D. Then, for every arc abi there is a set s indexed with that arc that collects all the documents containing abi:

{

}

sab = d ab i ∈ d ∧ d ∈ D



(2.2)

In words: set sab gathers those documents having arc abi as a member. After visiting all arcs in A, the set S = {sab , sbc ,...smn } is obtained where for every informative arc there is a set with all the documents containing that arc. Table 2-1 shows the set S on the last row with the sets of documents containing the phrases from the example. As an illustration see how the documents d1 and d2 share the phrase BC and the phrase MA. BC

IS

HM

d1







d2



JP



d4



UN

√ √

d3

MA













d5



d6 S

BF

√ {d1,d2}

{d1,d3,d4}

{d1,d4}

{d2,d3}

{d1,d2}

{d3,d5}

{d1,d4,d6}

Table 2-1. Relations between the phrases and the documents.

2. Topical Facets

33

Step 2 Collecting the Topical Facets The second step collects different arcs appearing in similar document sets as illustrated in figure 2-4. Each set is a topical facet. {d1,d2} {d1,d3,d4}

{IS, HM} {BC, MA}

{d1,d4} {d2,d3}

{JP}

{d1,d2} {UN}

{d3,d5}

{HM, BF}

{d1,d4,d6}

Fig. 2-4.

Arcs shared by document sets are grouped.

For instance, set {d1, d3, d4} contains {d1, d4} as a subset, bringing together IS and HM. Because {d1, d4} is also a subset of {d1, d4, d6} the content of this subset is joined to the content of the main set, yielding {HM, BF} and illustrating how the same phrase can be a part of more than one phrase collection. Set bsab is constructed where any s ∈ S

{

bsab = cd i ∈ A scd ⊆ sab



}

(2.3)

In words: set bsab gathers every arc cdi that indexes a set found in S and containing the same documents. The set of sets B = bsab , bsbc ,...bsmn is obtained where every set has the phrases appearing in the same documents. The collecting-document set indexes the set of phrases. If a set sab holds the same documents that were also collected by set sdf, then sab = sdf . The arcs indexing these document sets are brought together in bsab = bsdf . Each maximal element in B is a topical facet.

{

}

Step 3 Collecting document sets It is straightforward to assemble set ed containing the set s of which a document d is a member. A document can be member of different sets in S because topical facets are by definition not restricted to one document and that one document can accommodate several topical facets. The result of this operation on the test documents is illustrated in table 2-2. The table is a list representation of the Venn-diagram in figure 2-3. Document sets d1

{d1,d2}, {d1,d3,d4}, {d1,d4,d6}

d2

{d1,d2}, {d2,d3}

d3

{d1,d3,d4}, {d2,d3}, {d3,d5}

d4

{d1,d3,d4}, {d1,d4,d6}

d5

{d3,d5}

d6

{d1,d4,d6}

Table 2-2. Document membership.

34

2. Topical Facets

ed = {sab d ∈ sab ∧ ab ∈ A}



(2.4)



{

}

In E = ed1 , ed2 ,...edn there is a set for every document d containing every document set s ∈ S on the condition that d ∈ s . Every set is indexed with its document identification tag. Many small subsets now gather the documents based on the phrases they share which is the list representation of the document-by-document matrix. This has two consequences: first, any document from such a pool is similar – up to a degree – to all the other documents in the set. Up to a degree implying that the more phrases two documents share, the more their content is similar. Second, any phrase participating in bringing together these documents knows that it is related to all the other collaborating phrases from the pool. A question about Hamas [HM] leads directly to a set containing four documents d1, d3, d4, and d6 since HM is an element of two topical facets: one indexed with {IS, HM} and a second with {HM, BF}. Other phrases from these four documents are Bill Clinton [BC] and Madeleine Albright [MA] in d1, Israel [IS] in d1, d3, and d4 and bomb factory [BF] in d1, d4, and d6. Jonathan Pollard [JP] and the US Navy [UN] enter the picture via d3. The d3 document does not mention Hamas, illustrating how a document gets retrieved in spite of lacking the query word (Fig. 2-5). Whether this is relevant or not is discussed below in paragraph 2-6 on open sets and the closure issue. Paragraph 2.4 will analyze real examples with material from the corpus. Query

Topical Facets

Documents

“Hamas”

{IS, HM} {HM, BF}

{d1, d3, d4} {d1, d4, d6}

IS

Information d1

{BC, IS, HM, MA}

d3

{IS, JP, UN}

d4

{IS, HM}

d6

{HM, BF}

BC, MA

HM

BF

Core information from the Topical Facets

Fig. 2-5.

JP, UN

Additional information from related documents

Retrieving documents and information referring to a query about “Hamas”.

Elicited by the query the topical facets return core information on Hamas based on data in the repository. We learn that Hamas is related to a bomb factory and to Israel, but not directly to Clinton and Albright, or to Pollard and the US Navy. There is no additional information on the bomb factory in these data. From the informative phrases found in the related documents, it turns out that Israel is involving Bill Clinton and Madeleine Albright, and also Jonathan Pollard and the US Navy in a separate case.

2. Topical Facets

35

Topical Facets at Work

2.3

The following paragraph discusses in depth the construction of a series of topical facets – abbreviated to TF – in a two-day sample from the corpus. The period is 01/10/1998 – 02/10/1998. The system scanned 685 documents belonging to 8 sources (ABC – APW – CNN – MNB – NBC – NYT – PRI – VOA). 370 topical facets were found, combining 615 documents. 70 documents remained unrelated (10%). These unrelated documents will receive attention in the next paragraph. Topical facets from this sample may show up in many places, one topical facet, for example, manages to link 45 documents. The general content of this hefty TF is about a triple digit decline of the stock market on Wall Street, a slowdown in financial markets worldwide, and remarks on this situation by Ernst Welteke who will succeed Hans Tietmeyer as German Bundesbank president in 1999. This subject gets apparently a lot of attention in the media at that moment. The smallest TF-induced set combines only 3 documents and is about a Serb-Albanian conflict in Prizren where most of the Albanian population was either forced or intimidated into leaving town.

The algorithm implements the three main steps in the construction of topical facets from collections and documents found in the network. Input:

D = {d d ∈ G} set with the documents known to the network G

A = {ab ab ∈ d ∧ d ∈ D} set of arcs found in the documents of the network

Output: S a set of document sets indexed with members of A

B a set of arcs sets indexed with members of S E a set of document sets from S indexed with members of D

for all

ab ∈ A



sab := {ab ab ∈ d ∧ ab is informative} collecting documents containing arc ab

for every informative arc in the network

S := {sab ab ∈ A}

set of document sets indexed with an arc

end for for all

s ∈S



bs := {sab = s}

{



collecting the arcs having the same document set

}

B := bsab sab ∈ S

end for

for all d ∈ D

for every arc-indexed document set

set of arc sets indexed with a document set



ed := {d ∈ s}

E := {ed d ∈ D}

for every document collecting the document sets containing d



set of document sets indexed with a document

end for

Algorithm 2-1.

36

Pseudo-code for the Topical Facets main program component.

2. Topical Facets

16 14 12 10 8 6 4 2 0 5 ,4

Fig. 2-6.

6 ,8

8 ,2

9,6

11

M o re

Distribution of the number of topical facets over a document. The x-axis shows six frequency classes with the average number of topical facets per class. The y-axis shows the number of documents per frequency class.

On average one TF joins 7.9 documents (Fig. 2-6). A topical facet is not one component, but a set of shared arcs. One expects a tendency towards more TFs with increasing document length. However, while this may be true for smaller documents with less than 250 words, it cannot be generalized over the document collection (Fig. 2-7). 14

12

10

8

6

4

2

0 0

Fig. 2-7.

100

200

300

400

500

600

Relation between the number of topical facets (y-axis) and the length of a document (x-axis).

In this corpus any single document would have an average of 13 topical facets. It implies that any document can be combined with many other documents, depending on the selection of facets used to retrieve them. An obvious form of selection, but not the only one, is to formulate a query. If a document is on average associated with 13.2 TFs and each TF is linking 7.9 documents, any random document out of a group of 615 can show up in any of the: 13.2 * 7.9 * 615 = 64,132 meaningful document combinations or approx. |D|e2 where |D| is the number of documents in the set. These complex document relations have the advantage of allowing many different view points on the same data set. The same document may appear in different retrieval outputs depending on the perspective expressed by the query.

2. Topical Facets

37

Let us analyze TF1241 and TF1321 as an illustration. These are two of the 370 different topical facets found in the first days of October 1998. Since only a small number of arcs are involved and because they link only a few texts of modest length, it is feasible to expose in detail a major part of the process. All the computer-generated phrases making up these topical facets are shown. A sequential identification number without further importance identifies the documents. TF1241 consists of 8 phrases: [group hamas, hamas activists, horrific attack, palestinian territories, palestinian militants, yitzhak mordechai, militant palestinian, palestinian group] TF1241 brings together 6 documents: 290, 393, 408, 609, 632, and 672. TF1321 has 10 phrases: [palestinian authority, authority israel, palestinian security, militant group, horrific attack, government spokesman, palestinian territories, palestinian militants, bomb factory, yitzhak mordechai]. This topical facet connects 10 documents: 45, 62, 268, 269, 311, 367, 393, 408, 609, and 672. Because four documents of the sixteen are linked by TF1241 and also by TF1321, we have twelve different documents involved in this analysis. Table 2-3 lists the documents linked by TF1241 and TF1321 as well as supplementary topical facets relating to these texts, but disregarded at present. The smallest document is d311, it has only topical facet 1321 to link it to the others. Extracts of the twelve documents are reproduced in the Appendix Israel-Hamas on page 174 ff, together with the topical facts concerned. It allows the reader to examine the sensibleness of the established relationships.

Doc id

Document Label

TF id

Other topical facets related to the document, but not analyzed

Document Length (tokens)

45

19981001_APW0580

1321

903, 1399, 2727, 3561, 3566

252

62

19981001_APW0855

1321

1429, 2727, 3640

363

268

19981001_VOA1700.0226

1321

1365, 1399, 2727, 3324

181

269

19981001_VOA1700.0293

1321

1397

73

290

19981001_VOA1700.1985

1241

1397, 2727, 2856

155

311

19981001_VOA1800.0303

1321

-

47

367

19981002_APW0564

1321

1110, 1421, 2558, 2727

318

393

19981002_APW1025

1241, 1321

903, 1397, 1399, 1429, 2727, 3566

315

408

19981002_APW1076

1241, 1321

903, 1110, 1365, 1397, 1399, 1421, 1429, 573 2558, 2727, 3566

609

19981002_VOA1700.0249

1241, 1321

1365, 3247, 3400

172

632

19981002_VOA1700.2128

1241

2727

46

1241, 1321

1365, 1397, 1399, 2727, 3054, 3247, 3400

371

672

19981002_VOA1800.2520

Table 2-3. Documents linked by topical facets 1241 and 1321.

38

2. Topical Facets

The following is an Automatic Speech Recognition (ASR) rendering of d632, including transcription errors and without punctuation, as is. The short Voice of America (VOA) item from the five o’clock news on October 2, 1998 is also represented as a text graph in figure 2-8: israel has sealed its borders with the west bank and gaza strip amid warnings the militant palestinian group hamas is planning a major attack on israel a closure remains in effect until at least tuesday and bands palestinian workers from going to their jobs in israel gaza strip

bands

amid

palestinian

warnings workers israel

closure

group

sealed militant

remains

hamas d632 planning

tuesday bank going

least

west borders their

major

jobs

Fig. 2-8.

attack until

effect

Document 632 as a contracted graph.

The label d632 in the middle of the graph in figure 2-8 stands for the placeholder vertex, representing the non-informative tokens found in document 632. Even with a few function words lacking the reader might grasp the meaning of the original text by following the directed arcs. The Introduction to this chapter touched briefly on the existence of non-informative tokens. Paragraph 4.6.3 on page 114 elaborates on the subject. In figure 2-9, the same document 632 is connected to documents 311 and 609. The graphical representation is restricted to these three small articles with the connecting topical facets 1241 and 1321 for readability reasons. The application makes a distinction between militant and militants, as it does between group and groups. Lemmatizing would diminish the complexity of the TFs in this case. The presence of the words bomb factory is no accident. The phrase is part of TF1321 joining the three documents in view, but it is as such not in the text of any of these. However, one will find it in documents 45, 393, 408, and 672 which are linked by the same topical facet 1321 too. The benefit of having a phrase in a topical facet, missing in the text linked by this facet, becomes clear when a question is asked about a bomb factory. The query generates TF1321 and as a result it will point to the documents containing the actual wording, but also to the three documents lacking these terms. The subject of questioning the system receives attention later in this chapter on page 54.

2. Topical Facets

39

Fig. 2-9.

40

Contracted graphs of three related documents (d311, d632, d609) stitched together by TF1241 (dark arrows) and TF1321 (grey arrows). More topical facets are involved in describing the relation between these documents, but are not shown in the figure for the sake of clarity.

2. Topical Facets

2.4

Analysis of Non-Combined Documents

Having single words in common is not enough to connect two articles. There is only a relation when documents share one or more topical facets. Sharing a topical facet assumes sharing a set of arcs (word pairs). The system could not link the following 70 documents from the period 01/10/1998 – 02/10/1998 because it missed the proper topical facets and despite the presence of similar words, 6, 7, 8, 10, 15, 34, 47, 58, 68, 90, 100, 101, 104, 110, 115, 131, 138, 139, 154, 165, 166, 197, 198, 200, 202, 204, 206, 242, 256, 272, 280, 283, 323, 326, 330, 338, 366, 377, 383, 396, 399, 419, 442, 456, 460, 461, 482, 483, 502, 506, 511, 514, 517, 519, 521, 547, 558, 561, 572, 581, 588, 589, 600, 611, 622, 675, 676, 677, 679, 681 There are several reasons why this happens: a new issue is emerging and has not yet caught on in the community: a local editor pays attention to a subject leaving indifferent anybody else in the news community; the ASR-system jumbles an item to the point of total unintelligibility. However, as unused texts are rescanned at every new session of the application most of them are eventually recovered and properly linked to related articles. This ability is of importance since it permits to relate to kindred subjects that are separated over arbitrarily long periods. Table 2-4 summarizes the rescan history of the 70 documents in a two-week period. The first column gives the identification number of the rescanned document, the second column the date(s) of a successful rescan, and the third column shows the topical facet(s) relating this document to others. More TFs imply more links to more different documents. The majority of the non-combined documents (57%) are recuperated within three days: 14 documents at day + 1, 10 at day + 2, and another 16 at day + 3. A fortnight later just 10 articles (14%) remain unlinked. The multiple updates in the table illustrate that texts can acquire additional hooks to other documents at a later stage when new data are added to the system. To understand how rescanning works, consider the following short examples with articles from the table: Document 502 from Oct. 2nd is rescued from oblivion on Oct. 15th. It is a story about an Asian long horned beetle that was found aboard a Chinese freighter. The beetle with no natural enemies and unaffected by pesticides is chewing to death maples, helms, and chestnut in the US backyards and forests (19981002_NBC1830.1273). TF1552 anchors it with an appropriate agriculture department phrase to related documents. The network stores highly informative components on beetles and sundry trees too. At this moment the system cannot use them to construct a topical facet, for the reason that, at least up to Oct. 15th, no other text has comparable beetle phrases. The text may show up in a query related to agriculture, but not in a query about beetles. On Oct. 16th the Voice of America relates to the very same item in document 19981016_VOA1700.2893, effectively creating a new topical facet with beetles and trees this time – unfortunately outside the arbitrarily time frame set for this exercise.

2. Topical Facets

41

Doc

Rescan Date

new Topical Facets generated

Doc

Rescan Date

new Topical Facets generated

6

7/10

[2037]

338

8/10, 15/10

[2093], [3889, 3695]

7

not recuperated in period 2/10 – 15/10

366

8/10

[1952]

8

not recuperated in period 2/10 – 15/10

377

4/10

[1402]

10

2/10

[3336]

383

8/10

[1257]

15

2/10

[592]

396

3/10

[1745]

34

3/10, 6/10

[390, 1688], [3393]

399

2/10

[2384, 2281, 878]

47

2/10

[2817]

419

4/10

[1749]

58

3/10

[2284]

442

5/10

[4855, 2614] [3233]

68

3/10

[436]

456

4/10

90

4/10

[778]

460

not recuperated in period 2/10 – 15/10

100

not recuperated in period 2/10 – 15/10

461

not rescanned in period 2/10 – 15/10

101

4/10, 8/10

482

4/10

[762], [1656]

[3271, 4121]

104

4/10, 15/10

[1007], [1415]

483

4/10, 9/10

[3129], [2477]

110

4/10

[2404]

502

15/10

[1552]

115

3/10, 5/10

[475], [1706]

506

not recuperated in period 2/10 – 15/10

131

2/10

[1899]

511

6/10

[1503]

138

4/10

[2404]

514

8/10

[2078, 2471]

139

9/10

[2810]

517

6/10

[3360]

154

not recuperated in period 2/10 – 15/10

519

4/10

[843] [3077], [2435], [3540, 3363, 1290] [1520]

165

13/10

[4012]

521

2/10, 3/10, 8/10

166

2/10, 4/10

[2960], [2970]

547

4/10

197

8/10

[2276]

558

2/10, 3/10

[1987], [1790, 1589]

198

4/10

[1037]

561

6/10

[3115, 2420, 1914]

200

7/10

[1188]

572

not recuperated in period 2/10 – 15/10

202

2/10

[2023]

581

7/10

[1450]

204

8/10

[3594]

588

2/10, 14/10

[1465], [5380, 2346]

206

14/10

[1485]

589

10/10, 14/10

[3346], [1081]

242

5/10

[938]

600

not recuperated in period 2/10 – 15/10

256

3/10

[1346, 423]

611

6/10, 14/10

[2005], [4235, 3569]

272

4/10

[1010]

622

4/10

[459]

280

9/10

[3101]

675

8/10, 15/10

[2987], [985, 4798]

283

5/10

[737]

676

2/10

[1465]

323

2/10, 14/10

[3454], [1598]

677

not recuperated in period 2/10 – 15/10

326

3/10, 7/10

[548], [4081, 4318, 4165]

679

5/10, 12/10

[3294], [5047]

330

3/10

[2206]

681

2/10

[3336, 3446]

10 documents not picked up

17 multiple updates

Table 2-4. Overview of the rescan history of 70 unused documents in a two-week period.

42

2. Topical Facets

Document 521 is a lengthy account in the New York Times from Oct. 2nd about the Russian post. More than 1,000 mail cars have been sidetracked with up to 18 tons of letters, newspapers, and parcels. The state Railway Ministry refuses to carry more mail until the Russian Post Office makes good on 210 million rubles in old bills (19981002_ NYT0250). The story of the clogged Russian postal service gets a link on 2/10, 3/10 and 8/10 to global economic crises with TF3077, TF2435, and TF3540. TF1290 finally carries the component postal service, a phrase encountered in a document not about this particular incident, but nonetheless semantically related. Other successfully rescanned documents include a Van Gogh exhibition in the National Gallery in Washington (d326: 19981002_ABC1830.1445) and the start of a trial implicating US fighter pilots accused from killing accidentally twenty skiers in the Italian Alps (d588: 19981002_PRI2000.0264). These are examples from high profile subjects picked up the following day and beyond by other sources. TF1987 links four documents with sporting results by the name of two football teams playing the previous day (table 2-5 and table 2-6). At first, no informative relation was found for document 558 with any of the other 684 documents. Arc

Vertex Labels

TF id

5412*525

ohio state

1987

14249*525

penn state

1987

Table 2-5. Shared dyads in TF1987

In the next period 02/10/1998 – 03/10/1998 topical facet TF1987 links it to documents 621, 662, and 822. All four are about US college football3. Document Length (tokens)

Doc id

Document Label

TF id

Other topical facets linked to the document

558

19981002_NYT0404

1987

-

501

621

19981002_VOA1700.1271

1987

1077, 683

139

662

19981002_VOA1800.1255

1987

683

184

822

19981003_CNN1600.1236

1987

-

82

Table 2-6. Topical facet 1987 links document 558 to three other documents in a new imbricating period.

To conclude this paragraph on rescanning unused documents, we present two examples of ASR-rendered articles that did not find a link to other documents in the period 2/10 – 15/10. The first article states that religious services are good for health, and the second, crippled by a poor ASR-rendering, seems to be about new video releases coming on the market.

3

The text on American Football is in the Appendix on page 186 ff.

2. Topical Facets

43

d8: 19981001_ABC1830.0920 – full text, as is and there was some confirmation today of something we’ve heard before in fact reported here before faith may help to save your life quite live to the study of mostly affluent white adults published in the american journal of public health found that people who attend religious services have substantially lower mortality rates than people who do not d462: 19981002_CNN1600.1458 – full text, as is what and you have this week lots of people read the city of angels the film was over one of the video software dealers association rebels are said in a row mercury rising makes its debut in second place by while the budding singer and primary bill In this paragraph we saw how topical facets link older documents to newer ones, and are as such not restricted to a specific moment in time. The exercise also demonstrates how long distance relations between documents are formed. The content of a TF is not fixed; an update is always possible by including or removing documents. These observations lead to the next subject.

2.5

Open Sets and the Closure Issue

The text network is a container holding by definition all the data from the documents presented to it. Even though between these data n-ary relations are defined, it does not make the network a database in the conventional sense. Every vertex can be an entity or an attribute. In document d632, the palestinian group object has a hamas property, but a hamas entity with a palestinian group attribute is equally valid. On the other hand the network consists of nodes representing the type of a token and each document d from the set of documents D is a walk over the network. It is easy to return all the documents that visit vertex vq, where q is the lookup-query. The closure of q is the set containing all ordered pairs with the desired property v q , d . This is a very shallow use of the possibilities of a text network. In fact no network is needed here, any inverted file index will do. Storing a mapping from words to their locations in a document is a much-applied structure in commercial document retrieval systems. The interesting part of the system under study is the existence of semantic relations between tokens both within and across documents of different collections with different time frames. However, it comes at a price. When all the prepared relations are taken into account no closure in the foregoing sense is possible. When it is possible to infiltrate the entire network from the center to the farthest corners, how to avoid generating countless unproductive hypotheses in search for information? This problem is known as unlimited semiosis, an interesting problem in its own right (Eco, 1994: 144). Take document 408 from table 2-3 as an illustration4. With 573 tokens it is the lengthiest article carrying the most facets. Wide-ranging articles on very specific subjects tend to produce large topical facets. Apart from topical facet 1321, eleven other TFs are linked to it: 903, 1110, 1241, 1365, 1397, 1399, 1421, 1429, 2558, 2727, and 3566. Especially facets 1110, 1429, and 2558 are relevant here since they include many components pointing to other texts, several of them not found in document 408. As always, all the computergenerated phrases are shown and no manual clean up was applied. 4

44

The text on Israel-Hamas is in the Appendix on page 174.

2. Topical Facets

TF1110 contains: [madeleine albright, sandy berger, adviser sandy, albright defense, political authorization, blatantly untrue, mideast summit, publicized atrocities, 300,000 refugees, joint chiefs, joseph ralston, gen joseph, gordon smith, widely publicized, begin serious, serious negotiations, negotiations toward, republican senators, state madeleine]. TF2558 contains: [bill clinton, pollard’s fate, free pollard, governments disavowed, israeli governments, israeli spy, passing secret, naval intelligence, daily yediot, possible release, recently israeli, intelligence analyst, ahronot reported, military documents, operation however, mideast summit, see how, critical issue, war machine, tactical position, rugova told, vocal supporters, bosnia starting, 4 tv, cannot allow, spread again, most vocal, britain among, tv news, disavowed pollard, pollard israeli, granted pollard, jonathan pollard, yediot ahronot, shadowy israeli, israeli intelligence-gathering, israeli citizenship, israeli agent, rogue operation, clinton agreed, intelligence-gathering body, early release, arrested outside, including cabinet, visited him, acknowledged him, many israeli, agent many, emerged later, officials including, states until] TF1429 is almost similar to TF2558, except for benjamin netanyahu replacing bill Clinton, and because it lacks a reference to the conflict in former Yugoslavia. Figure 2-10 has a schematic view on the subjects touched by TF1110, TF1429, and TF2558 in the first days of October 1998. It illustrates the relation of certain themes missing in d408, but taken care of in other texts. The main actors collected by the three topical facets are now briefly presented to clarify the shortcut to those different stories and to justify the diagram: • • • •

• • •



5

William “Bill” Clinton serves as the 42nd President of the United States from 1993 to 2001. Madeleine Albright holds the office of United States Secretary of State from January 23, 1997 until January 20, 2001. Benjamin Netanyahu is the 9th Prime Minister of Israel from June 1996 to July 1999. Samuel “Sandy” Berger serves as the United States National Security Advisor under President Clinton. He helps to formulate actions against Iraq and to plan the NATO bombing campaign against former Yugoslavia. Joseph Ralston is at that time Vice Chairman of the Joint Chiefs of Staff in the US Army. Gordon Smith is a Republican senator on the Foreign Relations Committee working on evidence linking Osama bin Laden to various terrorist activities. Jonathan Pollard is an ex-United States Naval civilian intelligence analyst condemned to a life sentence for spying in 1986. Israel publicly denied that Pollard was an Israeli spy until 1998 when he was granted the Israeli citizenship. Ibrahim Rugova, the first President of Kosovo, confronted with the massive Serb oppression of the Albanian population5, is looking for support at that time from the US, Great Britain, NATO, and the European Union.

The 300,000 refugees component in TF1110 is about this conflict.

2. Topical Facets

45

Just a few of the phrases from the three topical facets are actually seen in document 408. Two of the topical facets refer largely to the Pollard spy-case, although there is not a single word of it in any of the selected articles. Should this be considered as concept drift? According to Alexey Tsymbal, it is difficult to distinguish between a true concept drift and noise. Some algorithms may overreact to noise erroneously interpreting it as concept drift while others may be highly robust to noise, adjusting to the changes too slowly (Tsymbal, 2004). Noisy data are a problem in information retrieval and in classification tasks, both in a supervised and in an unsupervised context. A related issue is the consistency over time. A concept may cease to be valid though it retains partial overlap with the new situation. The issue will turn up again in chapter 3 where we evaluate the decision of the application to classify or not certain documents as belonging to a given topic.

bill clinton

….

iraq israel

israeli spy in the us navy

usa great britain

palestine madeleine albright

serbia

Fig. 2-10. A view on other issues besides Israel and Palestine involving Bill Clinton and Madeleine Albright.

Is the Pollard case an example of concept drift? It depends. There is a strong semantic overlap involving people and places seen in other contexts. Readers versed in Israeli-US foreign policy relations may detect more predictive features than someone who is not. Since the documents stem from the same community in the same period, one may presume that this spy case creates an embarrassing situation for the protagonists of the peace negotiations. In that sense the feedback given by topical facets 1429 and 2558 should not be considered as random noise, but as informative background provided by design. Information of this sort can trigger a thinking-outside-the-box insight precisely because it deviates from the obvious by pointing to the edge of a topical centre, where a meaningful link still exists. This is especially relevant in the case of non-routine problems; those a problem solver has not previously solved and for which no preexisting answer is at hand. It helps to reconstruct the mental model a reader has about an issue (Mayer, 1996). Bill Clinton is the president of the United States in 1998 and it will come as no surprise that the ‘bill clinton’ phrase used on its own, can activate more issues than merely the foreign policy pointers found in the facets seen so far. To name one: that very day the Clinton Administration was implicated in a confrontation between animal rights activists and an Indian tribe who got the authorization

46

2. Topical Facets

to kill its first gray whale in seventy years off the coast of Washington state’s Olympic peninsula6. As nothing in the current topical facet list points directly to that subject there is no danger for a relevance drift into the Pacific Ocean at this stage. Although no closure can be obtained as a matter of course, it does not mean that these sequences are fruitless or unproductive. By virtue of the semantic relation between the components, topical facets constitute a middle layer between the document and the lexis. Defining a concept from the vocabulary up is like a query containing just bill clinton. It will produce a cluttered feedback equivalent to the bewildering and equally pointless feat of describing a whale in general. Where should it start, what information should it not include? Although ultimately all words from a text will be reused, they are loose sand to the system. On the other side lies the document layer where each item embodies a genuine, but petrified language realization. The aphorism at the beginning of this dissertation is about this dilemma. When starting from the vocabulary every grammatical combination of words can be expressed. When relying solely on existing documents, one supposes that everything has already been said and merely lies waiting to be disclosed. The analytical rendering of this old and appealing belief is that acquiring knowledge is a dynamical system with an attractive limit-cycle to which language productions inevitably evolve, given enough time. Nassim Taleb discusses this fallacy and many related ones at length in a recent essay (Taleb, 2007). The subject will turn up in chapter 5 on page 140 ff in a discussion about bag-of-words where word order is ignored. A metaphor for the type of retrieval from a closed phase space is the jigsaw puzzle. Taken together the puzzle parts reshape the image of the assignment. The task is by definition accessible and complete. A 1000-piece puzzle may be awkward to put together, but the final result and all intermediary states are known. A personnel administration provides a comparable situation. Only facts fitting into the strict scheme of an employee file are classified. No great surprises are to be expected here when querying the database tables. Looking up well-organized data is often the most reasonable search method. However, it is not always evident to understand what a body of unstructured digital data is about. A user might not even know how to formulate a query to unlock the information. For a visitor with no clue about the nature of the data, topical facets provide proto-inductive information like semantic step stones through the data. Proto-inductive because collecting the components happens with no specific learning goal. The proto-inductive information is a partial description of a document space, more complex than a vocabulary and less specific than the documents. The description occurs before conceptualization and the inductive process of establishing connections between the various inputs is the responsibility of the owner in this arrangement. As prosthesis the software assists the user to pick one topical facet or one document and to display an open content space, roughly the way it was done in the introductory Walkthrough section on page 16. If these data were to be used in a document classification or retrieval task, the system is to induce a view describing the instance class containing the attributes requested by the user. To achieve this, an additional step is required.

6

Text on Gray Whales in the Appendix on page 185.

2. Topical Facets

47

2.6

Generalizing Topical Information

Finding relevant information is a process. Academic libraries issue extensive guidelines on how to look for information (Booth, Williams & Colomb, 2003). In real life the user can often not resolve on what he is looking for, yet all the same, he will know it when he sees it. The emphasis then is not on finding something specific but on becoming familiar with the area of interest, on expecting to encounter familiar facts. Or on hoping to experience a serendipity effect, coined in the 18th century by Horace Walpole in a letter to Horace Mann about people “(…) always making discoveries, by accidents and sagacity, of things they were not in quest of” (Johnson, 2002: 58). Let us call this circumstantial roaming of the search space the serendipical search. a

q

a’ q’

Fig. 2-11. Serendipical insight is obtained as a consequence of finding a link outside a system, able to bridge two previously unrelated sub graphs pertaining to a problem P inside that system.

Although this chance encounter may look very natural and phrases with serendipity may be part of the Internet lingo, the serendipity concept is regarded with suspicion due to its fictional origin and to its seemingly arbitrary nature. In this work we give it a provisional definition. Assume that inside a system two partial information structures exist, one with a terminal vertex a’ and the second with an initial vertex q’. Each structure is partially related to some problem P, but they are not directly linked to each other. A serendipical find would be the encounter with a component aq from outside the system, allowing a direct connection between those two structures. The ensuing graph could be, for instance, a decision tree solving P. The external component is useless without the existence within the system of the appropriate docking sides. Uncovering the missing link at the outside is not possible when one does not know what to connect to at the inside. Figure 2-11 illustrates the idea and shows how chance favors the prepared mind7. Not every unusual finding is of interest by itself, and many results appearing serendipitous should be more properly characterized as surprising (Dunbar, 1996: 388 ff ). The approximate search is a second search strategy with more structure. Here a user knows roughly where to look, but he cannot jump to it without being acquainted with the exact terminology. Navigating through the content neighborhood, though, offers him a broad view on what plausible answers are available. Finally, there is the direct search. This is the standard method, presuming a level of expert knowledge competent enough to construct a query with the appropriate 7

48

Quote attributed to Louis Pasteur (1822-1895).

2. Topical Facets

vocabulary, and addressed to the relevant sources. Often these three different search approaches intermingle, a matter labeled information foraging by to Peter Pirolli (Pirolli, 2007). Direct search is generally the preferred method to evaluate retrieval systems. The official TDT task guarantees the implied search competence by selecting first a few relevant documents from the corpus, and presenting them next as a query to the system. The topical facet prosthesis needs to support the three searches. Ultimately, the system should be able to deliver an answer to a direct search question, or to the roamer, or the approximate searcher when they made up their mind. The job to undertake is to dissociate topical facets from the originating articles to lose their factual, documentary nature and turn out a freestanding informative object called topic, necessary to deal with precise questions. To achieve this goal, a topical facet – document matrix is translated into a unipartite graph and then the main core of the graph is calculated. Abstracting away from data typified by source and date yields generalized topical information. The abstraction is not unconditional; it may be called more properly a generalization within the limits of a given set of documents, i.e., bounded by source and time. To illustrate the line of attack, table 2-7 collects the same 12 documents and 19 topical facets on the Israeli-Palestinian conflict and the Hamas involvement, as previously listed in table 2-3. The right column Total docs linked’ in the table shows how many documents a particular topical facet brings together. As a reading example: TF1397 [palestinian police, hamas activists] on the sixth row, links 4 documents 269, 393, 408, and 672. The bottom row of the table shows the total number of facets pointing to a particular document. Topical facets 903 1110 1241 1321 1365 1397 1399 1421 1429 2558 2727 2856 3054 3247 3324 3400 3561 3566 3640 Total facets involved

d45

d62

d268 d269 d290 d311 d367 d393 d408 d609 d632 d672



√ √ √





√ √







√ √



√ √ √ √

√ √ √





√ √

√ √

√ √



√ √ √ √ √ √ √ √ √ √ √

√ √ √



√ √ √ √ √







√ √





√ √ √





8

12

√ 6

4

5

2

4

1

5

5

2

Total docs linked 3 2 6 10 5 4 5 2 3 2 9 1 1 2 1 2 1 3 1

9

Table 2-7. Topical facet - document relation.

2. Topical Facets

49

As a reading example: document 311 in the middle of the table has only topical facet 1321 pointing to it. It does not mean that d311 is weakly related to the subject. Document 311 is a small item, one of the three articles in the graph on page 40 (Fig. 2-9), and topical facet 1321 is much focused, connecting ten of the twelve documents. To find a document really weakly related to a subject, one needs both a topical facet linking no more than two documents, and a document with that and only that topical facet pointing to it. The relations in the topical facet – document matrix can be rendered as a bipartite graph. A bipartite graph

Gb (Vb , Ab )

(2.5)



is different from the general graph description, in the sense that the set of vertices Vb consists of two nonempty disjoint sets: v1 for the documents and v2 for the topical facets, and that every edge in the edge set Eb joins a vertex in v1 to a vertex in v2 and to nothing else. Every topical facet is linked to the documents in which it participates. 19 Facets 903 - 1110 - 1241 - 1321 - 1365 - 1397 - 1399 - 1421 - 1429 - 2558 -

2727 - 2856

-3054 - 3247 - 3324 - 3400 - 3561 - 3566 -

3640

45 - 62 - 268 - 269 - 290 - 311 - 367 - 393 - 408 -609 -632 - 672

12 Documents

Fig. 2-12. Illustration of the transformation of the Topical Facets – Document matrix (table 2-7) into a bipartite graph. The solid black lines refer to the reading example in the text.

The topical facet – document matrix and its accompanying bipartite graph is effectively an autoassociative memory component. Autoassociative memories allow the retrieval of a stored pattern associated with an input pattern. The associative memory is a connection weight matrix wk where the weight values k for a particular associated pattern pair (xk, yk) are computed as:

(w ) = (x ) (y ) ij

k

i

k

j

k



(2.6)

Reference clues are associated with actual memory contents until a desirable match is found. Using them is often not so straightforward and requires a good understanding of the problem space. In our case, topical facets make up the input and the related documents the output. Because these relations are inherently present in the network, no supplementary tuning or external intervention is necessary. Figure 2-12 illustrates the bipartite concept. To keep the graph readable only a handful of the edges are highlighted.

50

2. Topical Facets

As a reading example: on the top row left topical facet 903 links three documents on the bottom row: 45, 393, and 408. Additional topical facets link to document 45 as well: 1321, 1399, 2727, 3561, and 3566. Other relations are grayed out to maintain readability. Next, the bipartite model transforms into a unipartite graph by connecting topical facets, provided they appear together at least once in a document (Fig. 2-13). 3247

3400

3054

3640

2856

1241 1397

2727

1365

3324

1321 1429 1399

1421

903 2558

3561

3566

1110

Fig. 2-13. A unipartite graph derived from figure 2-12 depicts the interrelations between topical facets. The circled facets belong to the main core.

Topical facets in a core with low k-values have fewer links with other topical facets in that core, suggesting a limited contribution to the semantic content of a document. The core number of vertex v is the highest order of a core containing this vertex. The application will extract the topical facets belonging to the main core. Consider the unipartite graph given in figure 2-13, summarized in table 2-8. The core with k = 9 has 12 members. These are linked to at least 9 other members and possibly to other nodes with a lower k. No node outside the 9-core is connected to 9 or more members of the core. Consequently, 9 is the maximum core number in this graph. For instance, the topical facet node 1365 has 12 links: 9 to the other 9-core elements, one link to node 3054 (8-core), one to node 3324 (4-core), and one to node 2856 (3-core). Since nine is the highest order that contains node 1365 it is a member of the 9-core. The main core is obtained by recursively deleting all vertices and incident lines of degrees less than k. The remaining graph is the k-core. In this case, the main core has k = 9 and forms a cluster containing 12 of the 19 topical facets involved. k-cores can be described as cohesive regions that may contain subsets that are more cohesive.

2. Topical Facets

51

The subgraph Hk = (W, A |W), where A is the set of arcs induced by the set W, is a k-core or a core of order k if and only if

∀v ∈ W : degH (v ) ≥ k



(2.7)



and Hk is the maximum subgraph with this property. The core of maximum order is the main core (Batagelj & Zaveršnik, 2002; Guillaume & Latapy, 2004). k 3 4 5 8 9

Frequency 2 1 1 3 12 19

Frequency % 10.5 5.3 5.3 15.8 63.2 100.00

Topical Facets 2856, 3640 3324 3561 3054, 3247, 3400 twelve other facets

Table 2-8. Clusters of topical facets from weakly to strongly related.

The following weakly related components with k < 9 are discarded without losing essential information: TF2856 [told reporters, critical issue, war machine, tactical position, rugova told, leader tom] TF3640 [six months, enough votes] TF3324 [we don’t, today i] TF3561 [even if, did you] TF3054 [last month, crashed off ] TF3247 [defense minister, turkey’s defense] TF3400 [most serious, no details] These weak facets contain an idiosyncratic signature proper to the writer of the text (e.g. crashed off), and/or additional topical details not adopted by other authors (turkey’s defense minister). The twelve encircled topical facets in figure 2-14 belong to the main core. They are closely knit around a common semantic entity. Knowing the source of the main core – a cluster of 12 interrelated semantic components – and by virtue of its cohesive characteristic, the core represents condensed information about a specific subject. The generalization is now dissociated from the originating documents. It is not really an abstraction, since it is relative to the set of documents from a given community and only significant inside the specified time interval. In the Israeli-Palestinian case under discussion, the intersection of the topical facets in the main core produces the following – alphabetically sorted – phrases: a cknowledged him, agent many, ahronot reported, arrested outside, bomb factory, clinton agreed, daily yediot, disavowed pollard, early release, emerged later, free pollard, government spokesman, governments disavowed, granted pollard, hamas activist, horrific attack, including cabinet, intelligence analyst, intelligence-gathering

52

2. Topical Facets

body, israeli agent, israeli citizenship, israeli governments, israeli intelligence-gathering, israeli security, israeli spy, israeli troops, jonathan pollard, joseph ralston, leader yasser, madeleine albright, many israeli, middle east, mideast summit, militant group, militant groups, militant palestinian, military documents, moshe fogel, most vocal, naval intelligence, officials including, operation however, palestinian group, palestinian militants, palestinian security, palestinian territories, passing secret, pollard israeli, pollard’s fate, possible release, recently israeli, rogue operation, shadowy israeli, states until, visited him, yasser arafat, yediot ahronot, yitzhak mordechai The sorting is presented here to accommodate the reader; the software will encounter these phrases in their natural locations, i.e. in the respective text graphs. Because the phrases are arcs, one can assemble longer strings such as: leader yasser – yasser arafat and daily yediot – yediot ahronot – ahronot reported, yielding: leader yasser arafat and daily yediot ahronot reported. The phrase collection is a freestanding informative object providing the reader with a rough summarization/impression of the events. This algorithm extracts a subgraph, based on the core number k of the vertices. The algorithm recursively deletes all vertices, and lines incident with them, of degree less than k. The remaining graph is the k-core.

TF := (tfi , j )m × n

Input:



the topical facets matrix, m = number of arcs, n = number of vertices

G = (V, A) network G with its set of vertices V and set of directed lines or arcs A. Set V ordered in increasing order of the number of lines coming in or leaving a vertex





(the alldegree of v∈V)

Output: MC set with the main core topical facets for all

v ∈ V corev := degv

for every vertex in the network in the order the core number for v is its alldegree

neighbors of v R v := {v ' ∈ V ab ∈ A,v ' ∈ ab ∧ v ∈ ab} v for all v ' ∈ R for every neighbor of v v' v if deg > deg degv ' := degv '− 1 decrease the alldegree of v’ with one reorder V





end if

end for

end for for all

tf ∈ TF

{

MC := tf max corev



}

collecting the topical facets having vertices with the highest core

end for

Algorithm 2-2.

Core extraction on the unipartite topical facet graph.

2. Topical Facets

53

When the system interfaces with a precise question, i.e. when a user is doing more than simply cruising the topical facet space, the question has to be translated into a set of elements comprehensible to the system. The application will try to match the query with existing topical facets and will respond with pointers to the relevant documents or with any other action programmed by the owner. In a felicitous situation each term can be linked to a topical facet. Each topical facet will activate other topical facets and topical facets have pointers to relevant documents as sketched in figure 2-14.

Query

Term – Facet Dictionary

Facet – Document Dictionary

w1 {TFa, TFb , TFc, ..} w2 {TFb, TFd, TFe, …} wn {TFa, TFc, TFe, …}

TFb {da, db, dc, …} TFe {dd, df, dg, …} TFn {db, dg, dq,…}

Fig. 2-14. The query elicits a Term - Document relation via the Topical Facets layer.

Words {w1,w2,w3} from the query activate certain facets {TFb, TFe, TFn} and every facet in turn activates a set of documents.

2.7

Topics and Queries

Is the intersection of the topical facets from the main core a topic and what is the relation between a query and a topic? It is understood that the query intended here is anything of greater substance than a single token thrown at a full text indexer. With regard to the meaning of the term topic, there is no unique definition. The DARPA Topic Detection and Tracking Project says: A topic is defined to be a seminal event or activity, along with all directly related events and activities. Early in the DARPA study the meaning of topic was restricted to be an event, something happening at some specific time and place. The recent definition is broader and includes events without necessarily pointing to specific places and/or moments. The events may be unexpected such as an accident or anticipated such as a political election (Wayne, 2000). The description is similar to definitions found in Webster’s Revised Unabridged Dictionary or in the Cambridge International Dictionary of English. A joint property in these definitions is the designation of topic as the pivot of an event, action, conversation, thought, etc. A characteristic similar to the outcome of the set intersection operations performed in the previous paragraph, where a relation was sought between salient elements from different text parts in a collection characterized by a community. Information retrieval theory will turn to relevance theory and specifically to logical relevance for a formal description of the pivotal attribute. According to John Hutchins the premise is the following: if a given document d is about a request Q, then there is a high likelihood that d will be relevant with respect to the associated information need. Thus, the information retrieval problem is reduced to deciding the aboutness relation between documents and queries (Hutchins, 1977).

54

2. Topical Facets

A view on IR that since evolved into a theoretical research field known as Situation Theory or Situation Semantics (Barwise, 1981; Huibers, Lalmas & van Rijsbergen, 1996)8. A topic is aboutness enclosed in a spatio-temporal frame: it is about something happening at a particular moment in time, at a particular place. A query is the representation of an information need. Based on the definition of topic as the pivotal element in a language realization a well-formed query is a topic with a question mark. Any sensible answer should contain explicitly or implicitly the topic kernel formulated by the query, augmented with relevant additional and related items of information at the disposal of the answering system, if any. Formally: a query Q is a nonempty set Q = {A,B,C} whose elements are themselves sets, namely the topical facets A, B, C implied by the query. If these TFs have no elements in common known to the application, then A ∩ B ∩ C = ∅ , meaning that as far as the system is concerned, the query is unintelligible. Answering a question requires understanding the question to begin with. If however, the query contains material the application understands, i.e. if the tokens used in the query are embedded in topical facets, then the outcome TP of the intersection of these topical facets is the set of activated facet elements (implying in the same move the existence of possibly deactivated elements). Q    TP =  x ∈ I Ai  : {Ai : i ∈ Q} i =1   

(2.8)

In words: a set TP is obtained with every element x from the intersection of a topical facets set Ai with every other topical facets set i in Q. The set TP is called topic of interest or topic for short. Example: suppose the user is interested in information on a bailout package passed by the Japanese parliament to save the banking system, at a specific moment in time. Assume that this query points to three topical facets containing components known by the system: • The banking system • The Japanese parliament • A bailout package Each separate topical facet collects a number of elements with links to a great many of distinct documents. It so happened that many economies in Latin America, Russia, and Asia faced monetary difficulties at that moment in 1998, and more than one bailout was set up by national and international financial institutions. Therefore, more than one topical facet contains the phrase banking system or bailout package, each pointing to documents only partially relevant to the query. For instance TF179 points to 19 documents and TF184 to 36 documents in the period Oct. 15th – Oct. 20th 1998. Topical facet # 179 containing [ monetary fund, bailout package, latin america’s, northern portugal, 30 billion, world bank, american nations ]. Related documents: 19. Topical facet # 184 containing [ interest rates, bailout package, highest level, discount rate, sustained economic, financial institutions, rates late ]. Related documents: 36. 8

Related Research on page 65 has references on this theme.

2. Topical Facets

55

Topical facet # 229 containing [ banking system, market rally ]. Related documents: 7. Topical facet # 401 containing [ banking sector, japan’s upper, upper house ] Related documents: 3. Retrieving all the documents somehow connected by the bailout facets would overshoot the question, as testified by the following examples implicating Thailand, Russia, and Brazil, but not Japan: d4897: 19981016_APW0453 – extract (…) Thailand’s interest rates, although now coming down slightly, remain among the highest in the region, the legacy of a tight monetary policy dictated by the terms of a dlrs 17.2 billion International Monetary Fund economic bailout package. d5140: 19981016_NYT0286 – extract (…) Since the economic collapse of Russia in August, investors and economists have sharpened their focus on Brazil. While President Clinton lobbied Congress for $18 billion to restock the International Monetary Fund this week, with an eye toward assisting Brazil, the team that returned here from the IMF’s annual meeting in Washington earlier this month scoured government accounts for politically feasible sources of savings and reform. (…) Michel Temer, president of the lower house of Parliament, said in an interview here that Congress would begin debate on social security reform Nov. 4, and would begin what promises to be a stormy debate on political reform soon after (…)

Activated components characterizing a topic Deactivated components

Japanese parliament

banking system TP

Deactivated components bailout Partial agreement

Fig. 2-15. A query generates the intersection of the implicated topical facets, activating and deactivating certain components and constituting a topic TP.

56

2. Topical Facets

The intersection of all the topical facets restrains the choice and gives the collection of returned documents a higher relevance with respect to the request (Fig. 2-15). The intersection of topical facets rules out Russia, Brazil, or Thailand as parties concerned in the relief measures; it deactivates any banking issue not related to the bailout and activates only the financial support theme, among the many points on the agenda of Japan’s House of Representatives. The reader is reminded that the current intersection is an intersection of topical facets and not of tokens: neither the words japanese parliament, nor bailout or banking system need to be literally rendered for a text to be retrieved. For instance, the sample text below uses japan’s upper house of parliament and japanese government. The operation locates 12 articles from six sources in the chosen period: ABC19981016.1830.0936, APW19981016.0251, APW19981019.0532, APW19981019.0507, CNN19981017.1130.0307, NYT19981016.0148, NYT19981016.0249, NYT19981019.0076, PRI19981016.2000.1846, VOA19981016.0600.2936, VOA19981016.0600.2183, and VOA19981016.0600.0197 Two are reproduced below: Example d5192: 19981016_VOA0600.0197 – full text ASR, as is. japan’s upper house of parliament has given final approval to a package of laws the clinton and japan’s ailing banking sector the new laws also that they five hundred twenty billion dollar fund of taxpayer dollars to bail out a week but solvent banks earlier parliament passed a supplemental budget to fund the reforms japan is that under intense international pressure to find a banking center there’s been crippled by a series as speculative and loans made during the nineteen eighty Example d4859: 19981016_ABC1830.0936 – full text ASR, as is. still on the money tonight two steps today to address the most obvious weak spots in the global economy that japanese government has given final approval for public funds to failing banks international monetary fund knows it will be getting that eighteen billion extra dollar from the u. s. when congress and the president sign the budget agreement next week Regions with a partial overlap are generated too. They are strictly speaking off the mark although still thematically related to the central topic. The partial overlap is found in documents with a similarity value at the limit of the on-topic range. Values closer to the topic have a stronger relation. In this case it could be about bailing out other banking systems in other countries as seen in document 4897 and 5140 above in the text. When the query mobilizes many topical facets the peripheral zone around the topic can be very rich and interesting as illustrated by the examples on page 76 ff.

2. Topical Facets

57

2.8

Queries by Proxy

When a system responds to a query by delivering a set of documents, the answer is customary called a retrieval output obtained by ranking documents in descending order with respect to their relation to the query. The selection of documents will only approximate to the query. A widely used method is the vector space model introduced by Gerard Salton and his SMART-group more than 30 years ago (Salton, 1989). Queries and documents are re­presented both in the same vector notation in a shared n-dimensional space, suggesting a geometrical reading of the retrieval process. Assuming the query to be in the same space as the relevant documents is assuming that a collection exists of yet unknown documents that can be revealed by comparing a representation of a candidate document to a representation of the query. In other words, the retrieval output is defined as a category with the query acting as a prototype helping to identify category members based on shared salient prop­erties. According to a broadly accepted view, a particular member and a set of membership rules, such as similarity, define a category (Lakoff, 1990: 6; Rosch, Mervis, Gray, Johnson & Boyes-Braem, 1976). Peter Pirolli presents a recent overview on prototype and exemplar theories in information extraction (Pirolli, 2007: 20 – 3). Queries are not necessarily good prototypes. Except perhaps for the direct searcher, no interrogator has by definition a sufficient knowledge of the body of data he is questioning, and there is a priori no reason why the query would contain features with a good, a poor, or any family resemblance to those of the data. This is especially true in the event of a serendipical or an approximate search. A view supported by Peter Bollmann-Sdorra and Vijay Raghavan by showing how documents and queries are different objects that should not be treated carelessly in a uniform manner (Bollmann-Sdorra & Raghavan, 1998). To understand why this is the case, one should remember that the similarity of documents has been derived from the existence of a society producing and sharing communal content. Any question addressed to the collection is possibly originating from an outsider, using a different language, expressing different interests (Wei Jin & Srihari, 2007). In the official TDT evaluation setup to be reviewed in the next chapter training documents retrieved in advance from the ontopic subset of the corpus, ensure the relevance of the query. The Topical Facets Application, on the other hand, does not start from the principle that a relation between the query and the data exists. If one cannot take for granted transitivity between the query and the corpus, one should look for a document similar to the query, instead of asking if the query is similar to a document. For each document it is always possible to construct a query that will retrieve the document, while the reverse may cause difficulties. Consequently, a true prototype for a retrieval output is the single document from a body of data representing best the query in the sense that this document would retrieve the query. The Topical Facets Application will search for the shortest document approximating an exact replica of the query. Having such a prototype in our model settles the problem of moving about different search spaces. For the user of the system there is the additional benefit that evaluating the retrieval output can start with inspecting this prototype. If the prototype is off the mark the rest of the retrieved documents will be even more inadequate. In the special case where only one document in the data answers the query it would also be recovered. All documents linked to the activated topical facets can now be compared to the proxy, with a document-by-document similarity method. The next paragraph describes the semantic preference clustering procedure using the similarity scores obtained. A similarity score ranging from 0 to 1 conveys the degree of resemblance.

58

2. Topical Facets

The algorithm computes a document-by-document similarity value by comparing each document of a set with all the other documents in that set. The similarity value is the combination of three metrics: vertex similarity, walk similarity, and vertex/walk ratio. Input:

DR

Output:

SDR := Sdi

a collection of documents related to a query

{ }

for all d i ∈ D R let t i ∈ G

similarity values for every d ∈ D R

for every document di from a collection of related documents

ti the text graph representing document di in network G

Vi := {v ∈ t i }

set of vertices from text graph ti

Ai := {a ∈ t i }

set of arcs from text graph ti

Wi := {ai , ai +1, ai + 2 , ai + 3 ,...}∈ Ai

set of walks from text graph ti

for all d j ∈ D R ∧ (i < j ) for every document dj from the same collection and i < j



let t j ∈ G



V j := {v ∈ t j }

tj the text graph representing document dj in network G set of vertices from text graph tj

Aj := {a ∈ t j }



set of arcs from text graph tj

W j := {a j , a j +1, a j + 2 , a j + 3 ,...}∈ A j



set of walks from text graph tj

if Vi ∩ V j > 0

simV =

simW =

α=

2 Vi ∩ V j Vi + V j

if there is at least one vertex in common

2 Wi ∩ W j Wi + W j

vertex similarity



2 Vi ∩ V j 2 Vi ∩ V j + Wi + W j

walk similarity

, β = 1− α

{

vertex/walk ratio

}

Sdi := simV × (α + β × simW ) simW > 0

similarity value for document di

end if

end for

end for Algorithm 2-3.

Document-by-document similarity.

2. Topical Facets

59

The similarity measure between two texts repre­sented by their graphs Ti and Tj, starts with the intersection of the graphs Ti ∩ T j = Tk where Tk yields the vertices and arcs appearing in both Ti and Tj. As stated earlier, looking for a carbon copy of a document is not interesting. Asymmetry keeps a conversation going. In a situation with perfect symmetry both parties hold the same information, making any exchange pointless. Moreover, perfect symmetry in a dialogue implies that both parties are aware of this fact. Ditto in a search: a symmetric query is identical to the information sought. It will return at best a confirmation of what is already known. To answer these concerns, the following metric defines an estimate of the match between two documents:

similarity = simV ⋅ (α + β × simW )

(2.9)



The method is based on a document similarity measure, developed for conceptual graphs by Manuel Montes-y-Gómez et al. (Montes-y-Gómez, López-López & Gelbukh, 2000). It is adapted here for use in text networks. simV, the first part of the equation 2.9, simply measures how many terms the two text graphs Ti and Tj have in common. Only informative vertices as defined in the introduction (and later in chapter 4) take part. The vertex similarity value simV (2.10) is a Dice coefficient (Dice, 1945). It normalizes for document length by dividing by the sum of the number of vertices in both documents. A coefficient of 1.0 indicates identical sets of vertices while 0.0 signifies a complete lack of similarity.

simV =

2 VTi ∩ VT j VTi + VT j



(2.10)

where VTi ∩ VT j is the number of vertices shared between the two documents. The second part of equation 2.9 calculates the relational similarity among terms. Walk similarity simW (2.11) tells how many of the same vertices are connected in both texts and how strong that connection is. The idea being that the same term appearing in two documents is more semantically similar if it is used in a similar context. The context under discussion is formalized as a walk in the text graph. A walk is a sequence of incident arcs and vertices. A connection is evaluated by looking for walks that are similar to a degree. WTi ∩ WT j is the degree of connection of the shared vertices, i.e. of vertices seen in a walk obtained by intersecting the two text graphs Ti and Tj. The similarity metric measures the proportion between the degree of connection of the vertices in this document and the degree of connection of the same vertices in both documents using a modified formula for the Dice coefficient:

simW =

2 WTi ∩ WT j

(2.11)

WTi + WT j According to Montes-y-Gómez et al., the combination of simV and simW is roughly multiplicative. The relational similarity has a secondary importance: there cannot be a relation without the existence of some shared vertices. Even without common relations a level of resemblance remains

60

2. Topical Facets

between two texts by virtue of the presence of similar terms. So, while the cumulative similarity measure is proportional to simV, it will not be zero when simW = 0. Smoothing out the effect of simW is the third part in the formula. In case simW = 0 the overall similarity would depend solely on the value of the vertex similarity simV. When simW > 0 it attributes a fraction of the general similarity to the relational likeness. Coefficient α expresses the importance of this fraction. The values of α and β depend on the structure of the text graphs Ti and Tj. Their values reflect the degree of connection of the elements of Ti in the graphs Ti and Tj. The coefficient α expresses the ratio of information delivered by the vertices proper as compared to the information coming from vertices connected by a walk:

α=

2 VTi ∩ VT j 2 VTi ∩ VT j + WTi + WT j

(2.12)



and its complement β = 1 − α where Vi ∩ V j is the number of shared vertices and WTi + WT j is the degree of connection of the shared vertices. The algorithm calculates for each document-by-document relation the different similarity components by intersecting a set of documents with itself, except when the first and second document keys point to each other, or when this intersection has been done already. Therefore, n (n − 1) / 2 computations are possible where n is the number of documents. The result is a matrix where every cell holds one document-by-document similarity value. Only cells with a walk similarity simW > 0 are entered. This guarantees a relation between two documents based on a mutually shared semantic construction and not only on loose words. Chapter 3 examines the contribution of the document-by-document metric to the topic detection and tracking experiments.

2.9

Semantic Preference Clustering

Being related to the prototype is a necessary, but not a sufficient reason to be relevant to the query. The semantic relation can be very superficially based on one topical facet only. Semantic Preference Clustering or collecting documents in clusters built on their mutual semantic preference, addresses this weakness. Table 2-9 illustrates the concept. This document-by-document matrix relates to the Cambodian Election Task. The Cambodian Elections Task is subject of a detailed analysis in the next chapter on page 76. The first cluster is made with the documents most related to the prototype (being the document most similar to the query). The prototype is the focus of the first cluster. If a document has its highest similarity score with the focus document, i.e. when no similarity score between the document and any other document is higher, then that document enters the cluster. Documents appear in one cluster only (hard clustering). At the end of the procedure documents not consigned to any cluster are removed. One may assume that they have no business with the query, since no favorite relation with any of the retrieved documents was found. In this specific case the application produces five clusters containing 13 different documents in response to the Cambodia query. The first cluster contains the original prototype

2. Topical Facets

61

document together with two others [651, 4880, 7031]. Cluster 2 contains the documents [7379, 7399, 7759]. Cluster 3 has two documents [4442, 4463]. Cluster 4 has three documents [28, 654, 5113] and cluster 5 contains two documents [4532, 5295]. Together the clustered documents make up the final retrieval output. The clustering works as follows: there is a set D of documents, analyzed on their mutual similarity. Results are stored in a sym­metric document matrix D × D with d nodes. Each node in the matrix contains a similarity value v. D [i , j ] stands for the similarity score between vi and vj. D [i , i ] = 0 , is the similarity value of a document with itself. vj is a nearest neighbor (NN) to vi if and only if D [i , j ] = max D [i , h ] where h = 1…d and h ≠ i. If vj is a NN of vi then vi and vj are in the same cluster. In this setting a cluster is a connected component in the symmetric transitive closure of the nearest neighbor relation. In other words, we say that vi and vj are strongly connected if and only if vi = vj or: • vi is strongly connected with vi • vi is a NN of vj or • vj is a NN of vi or ∃w ∈ V and vi is strongly connected to w and w is a strongly connected component • to vj. Semantic Preference Clustering is a nearest neighbor algorithm, a special case of the k-NN classifier where the class is predicted to be the class with the highest similarity score - i.e. when k = 1.

(

)

The following is a reading example of table 2-9 about clustering the results of the document-bydocument similarity matrix based on their semantic preference. Document 7031 in the circled cell range has a relation > 0 with all other documents in the matrix. The documents with the highest similarity values are: 651 with a value of 0.165, 4480 with a value of 0.141, 7379 with a value of 0.196, 7399 with a value of 0.222, and 7759 with a value of 0.131. Document 651 has its highest similarity value with d7031; equally so for d4880. Between d7379 and d7399 a strong mutual relation exists with a similarity value of 0.671 instead of 0.196 for d7379 with d7031 and 0.222 for d7399 with d7031. Finally, the similarity value between d7759 and d7399 is 0.160 instead of 0.131 between d7759 and d7031. This produces a first cluster with d7031 as focus document and with d651 and d4880 as members, having the highest preference for document 7031. A second cluster is started with the document having the next highest similarity score with the prototype, in this case with document 7399. The remaining documents that did not join the first cluster are scanned again. As shown previously document 7399 will collect d7379 and d7759. The process is repeated until all documents are processed. Documents not belonging to a cluster are removed from the retrieval output.

62

2. Topical Facets

Table 2-9. Symmetric document-by-document similarity matrix with documents related to a specific query. The row and column headers hold a document identification number, the cells in the matrix give the mutual similarity value. The circled cell range refers to a reading example in the text.

2. Topical Facets

63

-

0.000

0.027

0.005

0.002

0.003

0.019

0.005

0.014

0.008

0.009

0.007

0.005

0.000

0.000

0.008

651

654

4442

4463

4532

4559

4760

4880

5113

5295

7031

7062

7379

7399

7759

28

28

Document

0.084

0.120

0.107

0.006

0.165

0.017

0.005

0.056

0.000

0.018

0.002

0.019

0.017

0.036

-

0.000

651

0.052

0.025

0.020

0.018

0.051

0.033

0.021

0.065

0.008

0.000

0.015

0.032

0.031

-

0.036

0.027

654

0.048

0.072

0.067

0.008

0.061

0.002

0.010

0.023

0.003

0.012

0.013

0.651

-

0.031

0.017

0.005

4442

0.046

0.031

0.019

0.008

0.033

0.002

0.010

0.028

0.002

0.009

0.013

-

0.651

0.032

0.019

0.002

4463

0.018

0.009

0.009

0.009

0.005

0.040

0.011

0.012

0.005

0.019

-

0.013

0.013

0.015

0.002

0.003

4532

0.012

0.008

0.008

0.011

0.013

0.017

0.006

0.012

0.007

-

0.019

0.009

0.012

0.000

0.018

0.019

4559

0.023

0.008

0.008

0.011

0.017

0.006

0.009

0.016

-

0.007

0.005

0.002

0.003

0.008

0.000

0.005

4760

0.085

0.055

0.044

0.017

0.141

0.038

0.004

-

0.016

0.012

0.012

0.028

0.023

0.065

0.056

0.014

4880

0.014

0.013

0.009

0.013

0.013

0.004

-

0.004

0.009

0.006

0.011

0.010

0.010

0.021

0.005

0.008

5113

0.015

0.005

0.009

0.005

0.024

-

0.004

0.038

0.006

0.017

0.040

0.002

0.002

0.033

0.017

0.009

5295

0.131

0.222

0.196

0.007

-

0.024

0.013

0.141

0.017

0.013

0.005

0.033

0.061

0.051

0.165

0.007

7031

0.028

0.011

0.010

-

0.007

0.005

0.013

0.017

0.011

0.011

0.009

0.008

0.008

0.018

0.006

0.005

7062

0.108

0.671

-

0.010

0.196

0.009

0.009

0.044

0.008

0.008

0.009

0.019

0.067

0.020

0.107

0.000

7379

0.160

-

0.671

0.011

0.222

0.005

0.013

0.055

0.008

0.008

0.009

0.031

0.072

0.025

0.120

0.000

7399

-

0.160

0.108

0.028

0.131

0.015

0.014

0.085

0.023

0.012

0.018

0.046

0.048

0.052

0.084

0.008

7759

2.10

Related Research

In this chapter we looked at network features supporting the discovery of semantic relations between documents and how topical facets allow classifying documents in a number of ways. There is a vast body of research on how to approach best information classification tasks, ranging from pure empirical constructions engineered to solve a specific problem to elaborate general language models. Philip Resnik is known for this contribution on semantic relatedness using network representations. His network is not an unrestricted text network. He evaluates semantic similarity with a taxonomy by measuring the distance between the nodes corresponding to the items being compared. The shorter the path from one node to another the more similar they are. An acknowledged problem is that links in a taxonomy represent uniform distances (Resnik, 1999). In 2003 David Blei, Andrew Ng, and Michael Jordan developed the Latent Dirichlet Allocation (LDA) model (Blei, Ng & Jordan 2003). The model assumes that words in a document arise from a mixture of topics. The topics as defined by the authors are shared by all documents in a collection. The topic proportions are document-specific and randomly drawn from a Dirichlet distribution. Two years later David Blei and John Lafferty proposed the Correlated Topic Model. This model uses a more flexible distribution for the topic proportions, allowing for covariance among the components. The result is a more realistic model of the latent topic structure where the presence of one latent topic may be correlated with the presence of another. Under a pure Dirichlet dis­tribution the components of the proportions vector are nearly independent, leading to the unrealistic assumption that the presence of one topic is not correlated with the presence of another (Blei & Lafferty, 2005). Long before the advent of the small-world theories it was accepted that objects could be ordered in multiple ways rather than in a single pre-determined, or taxonomic order. Faceted classification, for instance, does not assign fixed slots to subjects, but uses the mutually exclusive and collectively exhaustive properties of a class. According to Talmy Givón, learning thrives on the gradual re-interpretation of categorical boundaries, something impossible to accommodate within a discrete categorical system (Givón, 1999: 91). The term facets was introduced by the Indian librarian and classificationist S.R. Ranganathan for his Colon Classification in the early 1930s. It is a flexible classification method because it makes few assumptions about the scope and organization of the domain (Taylor, Miller & Wynar, 2006). A well-known application of faceted classification is in faceted navigation systems enabling a user to explore an information tree going from a category to its sub-categories, while choosing the order in which the categories are presented. This differs from a traditional taxonomy where the hierarchy of categories is fixed. For example, a used car guide might group cars first by brand, then by type, price, motorization, performance, and options. In a faceted system a user might decide first to divide the cars by price and then by brand and then by options, while another user could first sort the cars by type and then by performance. Faceted navigation, like taxonomic navigation, guides users by showing them the available categories, but does not require them to browse through a fixed hierarchy that may not suit their way of thinking. One application using the concept is the classification tool developed by the OCLC, the Online Computer Library Centre in Dublin, Ohio9 with support from the U.S. Library of Congress (Dean, 2003). It is a metadata scheme called FAST (Faceted Application of Subject Terminology) 9 OCLC owns the rights on the Dewey Decimal Classification system – mentioned in the Introduction – and is responsible for its maintenance and development. 60,000 libraries in 112 countries around the world utilize OCLC services.

64

2. Topical Facets

designed for web subject data classification. Another system uses the Semantic GrowBag to demonstrate faceted classification. It automatically organizes topic facets for community-specific document collections. The facets are a way to categorize content. The approach uses metadata tags from documents to extract information about prevalent topics and then applies PageRank to determine relations between these topics (Diederich, Thaden & Balke, 2006). The approach is based on insights from Marti Hearst (Hearst, 2006). The discussion on faceted versus taxonomic classification deals with practical difficulties, however, underlying is a conception with important implications. If the purpose of a computer system is to model a representation in the domain of information classification, then, according to Peter Gärdenfors there are two dominating methods. The symbolic approach assumes that a cognitive system can be described as a Turing machine with a finite set of symbols, a finite set of states, and a procedure. The second approach is associationism, where associations among different kind of information elements carry the burden of representation (Gärdenfors, 2000: 1 – 4). This dichotomy goes back to the discussion between Bertrand Russell for whom a sentence expresses a proposition which is true or not, and John L. Austin’s view where there is always a contextual parameter coming between sentence and proposition: the situation the sentence is about. This philosophical debate has impact on language theory notably through the contributions of Rodney Needham (Needham,1975) and Richard Rorty (Rorty, 1980) who states that every representation is a mediation and that we must drop the notion of correspondence and see propositions connected with other propositions rather than with the world. For a position in favor of the truth conditions of sentences, see Barbara Abbott (Abbott, 1997). Jon Barwise and John Etchemendy prefer the Austinian solution that is fundamental to situation semantics (Barwise & Etchemendy, 1987: 122 – 23). In Gärdenfors’ cognitive model information is represented geometrically. Points or regions in a space of domains represent properties and concepts. Context is modeled as a function to distribute weights on the domains. For Lawrence Barsalou a concept is a dynamical system. A concept is a skill for tailoring representations to the constraints of situated action. Because the same category can be encountered in a variety of settings and serve many goals, no single representation could possibly serve in all the different situations (Barsalou, 2003). Situation Theory was developed out of efforts by Jon Barwise to provide the semantics for a situation. Situations are entities with characteristics having relations with one another. When the relevance of a document depends on an inference, the document is less relevant when that inference is uncertain. Since in standard logic relations are truth-based rather than information-based the correspondence is erroneously rendered. Situation Theory formulates a logic describing this relation as a flow of information to capture its relevance (Devlin, 1995; Dretske, 1999). Mounia Lalmas who examines several logical models (Lalmas, 1998) gives an interesting overview of Information Retrieval founded on these ideas. Nowadays the issue of context, situation, or aboutness remains often hidden in the operational definitions of various mainstream retrieval models. In the vector space model, for example, if the angle between document and query vectors is above a certain threshold, the document is supposed to be about the query. The same is true in probabilistic retrieval models. A function

2. Topical Facets

65

evaluating the degree of correspondence between a document and a query has the following form: a set of relevant documents is a priori associated with a test query. In the actual experiment a matching function ranks a list of documents, depending on the test score between the test query and a particular document representation. Statistical tests of significance are applied to compare performances of different ranking functions across a set of test queries. The empirical research model is a cornerstone of several document classification tasks, though it is questioned as well. Several of the matching functions rely on constants and the value attributed to the parameter can seriously affect the outcome of the function. The particular values for these constants are not obtained from theory, but are tuned according to a particular document collection and test query set (Wong, Song, Bruza & Cheng, 2001). This influences the results. Véronique Hoste shows that apart from the bias created by model preference, important performance fluctuations are obtained due to parameter selection. Sometimes she observes improvements sometimes large deviations (Hoste, 2005: 179). Eamonn Keogh argues that a system should be able to set its parameters a priori from analysis of the information environment and not by estimating parameters from the data themselves, in order to get parameter-free cognitive models (Keogh, Lonardi & Ratanamahatana, 2004). This amounts to reckon with the situation as defined by J. Barwise. Finally a word about query expansion. Though topical facets intervene in solving a query, the use of it should not be understood as a form of query expansion. Under the bag-of-words model, if an otherwise relevant document does not contain one or more terms from the query, then the document will not be retrieved. The aim of query expansion is to reduce this mismatch by expanding the query with words or phrases with a related meaning. Query expansion collects or generates additional search terms such as synonyms, related concepts, stemmed of lemmatized fractions of a query term. The effect of a standard query expansion on retrieval performance is regularly tested and the marks on the report are often low (Kekäläinen & Järvelin, 1998). Terms in a topical facet, on the other hand, are selected a priori, i.e. before any question is formulated and originate from a semantic relation found in the documents available to the system. A strictly local document environment defines the linkage between the terms, not an association based on a sense shared by a language community.

66

2. Topical Facets

2.11

Summary

Topical facets offer a partial description of the semantic relation between documents. The relation builds on similar words and similar phrases between documents. The document similarity in question is not just token resemblance. Starting from informative words at document level, informative components spanning several documents are formed. Documents are assumed similar if more of these components are shared, without expecting to find a full one-to-one match. Topical facets are distinct from other information descriptors in one or more of the following aspects: • Topical facets are gathered in an unsupervised way from unrestricted text data; • A topical facet is dynamic in the sense that new documents may enrich existing facets, without the need to rerun the analysis of the complete data set. Documents can be removed just as well; • The dynamical construction of a topical facet allows capturing diachronic aspects present in the related documents. • The topical facet generalizes the semantic information found in documents, in the sense that it abstracts away from the context of the collections and thus from the source and time defining a collection. • A topical facet is not a descriptor of any particular document; it is a set of similarities between two or more documents. The same component from a specific topical facet may turn up in other facets. The same topical facet can be found in more than one document • The topical facets layer restricts the data set in an information retrieval task. The intersection of the topical facets related to a task, activates certain elements and deactivates others. The set of topical facets is open. Exploring the content is possible in several ways: serendipically, approximate, or direct. It is the open nature of the set of topical facets that makes possible the serendipical and approximate exploration of the content. The serendipical search or the circumstantial browsing of the search space allows exploration of unknown data. The approximate search assumes a user who knows roughly where to look. Navigating through the content neighborhood offers him a broad view on what plausible answers are available. The direct search, presuming a level of expert knowledge competent enough to construct a query with the appropriate vocabulary, is also supported. Topical facets are proto-inductive because the components have no meaning as such. The protoinductive information is a partial description of a document space. The description occurs before conceptualization and the inductive process of establishing connections happens when a query is formulated. Formulating a query results in the extraction of a closed subset of topical facets, followed by a generalizing operation whereby a bipartite graph is transformed into a unipartite graph and the maximum core of this graph is extracted. This dissociates topical facets from the originating articles, losing their factual nature in the process. Based on the definition of topic as the pivotal element in a language realization, we define a well-formed query as a topic with a question mark. Any sensible answer should contain the topic kernel formulated by the query, augmented with additional and related items of information at

2. Topical Facets

67

the disposal of the answering system. A topic as a meaningful concept is represented by a pattern of connectivity between several nodes associated with related concepts. The retrieved documents constitute a category, where category membership is based on shared salient properties. A query can be seen as a particular member from the category, defined by a set of membership rules such as similarity. However, the topical facets model does not take transitivity between the query and the corpus for granted, or that a relation between the query and the data exists. Therefore, the query will not confront the data directly, but is represented by a proxy, a document from the corpus that best represents the query. For each document it is always possible to construct a query that will retrieve the document, while the reverse may cause difficulties. All documents linked to the activated topical facets can be compared to the query prototype with a document-by-document similarity method and ranked accordingly. It guarantees that the mutually shared semantic construction, characterizing the relation between two documents, is not derived from a few loose words. Retrieved documents are related to a query. The relationship, however, may be superficial. The document similarity value assists in grouping the documents in clusters. Semantic preference clustering organizes the retrieval output according to the semantic distance that separates the documents from the pivotal query proxy.

68

2. Topical Facets

3. Experiments and Evaluation 3.1

Introduction

The topical facets model uses a multi-stage data extraction and processing procedure. First, all documents are translated into network components and then the system looks for patterns inside and across the constituting data. The likeness is not limited to the presence of specific words in a document; the terms also need to have a semantic relation. Next, the system will rally all documents related to a query (the recall phase) and rank them on similarity with a proxy of the query (the precision phase). The previous chapter describes three different ways to explore a body of unstructured information. Serendipical search, or circumstantial browsing, allows the unconstrained exploration of unknown data. Approximate search offers a broad view on what plausible answers are available. The direct search, or standard search, presumes the competence to construct a query with the proper vocabulary and addressed to the relevant sources. In this section we report on two experiments to evaluate the third mode, standard search, with material provided by the TDT-consortium. Topic Detection and Tracking is a well-defined NLP-task, using an annotated corpus of predefined topics. For the first experiment, we take a stratified sample of six detection tasks. It permits a detailed analysis of the procedure and the results. The second experiment runs a batch of all 60 tasks as defined by the TDT consortium over the whole corpus. The statistics are compared to the published results from other participants. A relevance metric will allow us to judge how good the Topical Facets Application performs. In the world of Information Retrieval relevance is generally expressed as the outcome of a precision and recall estimation. In the special corpus, assembled and annotated for the purpose of a controlled topic detection and tracking task, precision and recall are recast as false alarm and miss probabilities. These concepts catch the same idea. Although widely used, the notion of relevance is not without controversy because to a certain extent relevance is in the eye of the beholder. Further, any information base big enough to make search engines interesting, is too large to find out how many matches there are. In such a case to achieve a 100% recall with a 100% certainty, is to retrieve every document in the database. To get near a 100% precision is to retrieve the one and only document with the highest confidence. Yet, even if they are not easy to use quantitatively, precision and recall are indispensable to evaluate search systems. Empirical studies of retrieval performance show a tendency for the precision to decline as the document recall increases. Michael Buckland and Fredric Gey examine the nature of the relationships between recall and the number of documents retrieved, and between precision and the number of documents retrieved in the context of different assumptions about the retrieval performance. It is demonstrated that a tradeoff between recall and precision is inevitable whenever the retrieval performance is better than retrieval at random (Buckland & Gey, 1994). Examination of the mathematical relationship between precision and recall shows that a quadratic recall function describes the empirical recall-precision behavior if transformed into a tangent parabola. We observe a similar quadratic curve in the results of one TDT contender, but not in the results of the Topical Facets Application. Buckland and Gey demonstrate how an initial retrieval, emphasizing high recall, followed by detailed searching of the initially retrieved set, can be used to improve

3. Experiments and Evaluation

69

both recall and precision simultaneously. While retrieval in two stages can be beneficial for very large databases or systems with limited retrieval capabilities, the tradeoff between precision and recall remains. The Topical Facets Application acts on the advice and performs a retrieval step prior to the analysis of the document assembly. It explains the reasonable precision results obtained. In the event of a large retrieval yield the program truncates the output, which leads to a somewhat erratic behavior in the search space. This, however, is an indication that optimizing the recall procedure might result in a higher overall relevance score. An important difference with the official TDT task is how the Topical Facets Application handles the detection task. As a rule a TDT system should not look ahead nor defer a decision about a story being on topic. In contrast with this guideline participants get between two and four training documents for every topic. These training documents are extracted randomly from the set of manually selected on-topic documents and offer in effect a outlook in the future and a perfect match. The Topical Facets Application cannot take advantage of the training documents and resorts to the topic description given in the formal guidelines to formulate a query (see Table 3-8 for an example of these instructions). The next paragraph 3.2 Corpus draws up the special testing and training data inventory used to build the network and the evaluation method. Paragraph 3.3 discusses the experimental setup of the topic detection and tracking task. Paragraph 3.4 Stratified Sample Experiment reports on six detailed cases chosen from the available test bank. Paragraph 3.5 Topic Detection Experiment presents the results of a competitive topic detection and tracking experiment, followed by paragraph 3.6 Discussion where the particular behavior of the Topical Facets Application is examined. Selected aspects in retrieval methods get attention in paragraph 3.7 Related Research. Paragraph 3.8 Summary wraps up the chapter.

70

3. Experiments and Evaluation

3.2

The Corpus

All the data in this thesis are draw from a corpus produced by the Defense Advanced Research Projects Agency (DARPA). The idea for the project originated in 1996, when DARPA realized that it needed technology to determine the topical structure of news streams without human intervention. In 1997, a pilot study laid the essential groundwork. During 1998 ASR data were added and Chinese data in 1999 (Wayne, 2000). The following US radio and television networks provided the English material: Associated Press Worldstream Service (APW), The New York Times News Service (NYT), Public Radio International (PRI), The Voice of America (VOA), American Broadcasting Company (ABC), CNN Cable News Network (CNN), MS-NBC (MSN), and National Broadcasting Company (NBC). Mandarin sources include Zaoboa, Xinhua, and the Mandarin newsfeed of the Voice of America. The TDT3 corpus is the third version, released to the broader research community in 2001. Source

English

TDT3 Stories

Mandarin

TDT3 Stories

NYT New York Times News Service

6,871

Zaobao

3,817

APW Associated Press Worldstream

7,338

Xinhua

5,153

VOA English News Service

3,948

VOA Mandarin

3,371

PRI The World

1,575

CNN Headline News

9,003

ABC World News Tonight

1,012

Newswire

Radio

Television MSNBC News With Brian Williams

683

NBC Nightly News

846

Total English stories

31,276

Total Mandarin stories

12,341

Table 3-1. TDT3 Corpus stories count per news source.

Table 3-1 gives an overview of the number of stories (texts) provided by the different sources. The English corpus counts approximately 15 million tokens. DARPA still funds the corpus production and maintenance. It is available via the Linguistic Data Consortium (LDC). Further in the text the abbreviation TDT refers to this corpus. TDT research is continuing under a new DARPA program known with the acronym TIDES (Translingual Information Detection, Extraction, and Summarization). Part of the raw material was made available as edited electronic documents, part as audio files. Word errors due to the automatic speech transcription vary widely: the average error rates range from approximately 25 to 30%. In addition to that the speech transcription programs do not supply punctuation marks, nor do they differentiate between capital and lower-case letters. Therefore, the ensuing network omits punctuation everywhere and converts all words, even the obvious proper names, to lower-case. The system reads in the data as plain text with XML markup tags. The Topical Facets Application scans the continuous strings of text to find embedded format

3. Experiments and Evaluation

71

codes, and converts the content to either the source of the text, the publication date, or the story itself. The news stream is segmented so that each story discusses a single topic. Dragon Systems transcribed the audio files using their software in collaboration with the National Institute of Standards and Technology (NIST) using BBN software10. Since it was not feasible to name and locate every event in the corpus, the LDC selected 60 topics at random. Human lead annotators chose these stories from the various sources and wrote topic descriptions for the suitable ones. The question as where to draw the line on including (or excluding) related events, was settled by creating rules for each of the topic types. These rules guided the annotation11. The annotating staff read every text in the corpus and labeled it as YES, NO, or BRIEF with respect to each topic (with BRIEF signifying that the topic was mentioned, but occupied less than 10% of the story). Each story in the complete TDT corpus has a tag depending whether it discusses or not one of the 60 predefined topics (Wayne, 2000). Special instructions apply to the use of this corpus: • Algorithms can only employ the content of the data, plus information about source, date, and time. • The topic detection task requires systems to group incoming stories into unsupervised topic clusters, creating new topic clusters as needed, without a possibility to look ahead. However, systems are allowed to defer a decision for a maximum of 10 stories. • For training purposes the application gets between one and four stories per topic. The total amount of training stories is less than 240 on a corpus of 31,000 (< 0.8%). The number of on-topic stories in the total collection varies greatly, from over 700 items to less than 5, spanning an a priori probability range from 2.4% down to 0.01%. The on-topic fraction or the ratio of relevant items is sometimes called generality in the literature (Salton & McGill 1983; Menczer, 2004). However, the term gets other interpretations too, for instance to indicate the specificity of a document with regard to a query (Hyun Woong Shin, Hovy, McLeod & Pryor, 2006), making it rather ambiguous and therefore, it will not be used here. The list of the 60 tasks, with the number of on-topic stories in the corpus tagged by human annotators, is given in the Appendix on page 173.

3.3

Experimental Setup

This paragraph explains the settings of the official TDT task and the way the Topical Facets Application relates to it in order to evaluate its performance. The main distinction between the two methods lies in defining how a system knows what the object of interest is. The official TDT setting expects a system to learn a task by analyzing a set of typical documents, handpicked from the ground truth set. The Topical Facets Application, by contrast, has implicit knowledge about all semantically relevant document relations, but is oblivious of the corpus builders’ intentions. The topical facets layer describes the information shared by several documents by collecting informative patterns found in and across those documents. The result is an open set of topical facets, allowing for multiple views on the documents making up the data repository. Since more 10 DARPA is the US Defense Advanced Research Projects Agency (http://www.darpa.mil/). The Linguistic Data Consortium (LDC) is an association of universities, private companies, and US government research laboratories. The University of Pennsylvania is LDC’s host institution (http://www.ldc.upenn.edu/). Dragon Systems Inc. is a private company supplier of speech and language technology (http://www.dragonsys.com ). NIST is a US government technology agency that works with the industry to develop and apply technology, measurements, and standards (http://www.nist.gov/). BBN is a private company founded in 1948 by Richard Bolt, Leo Beranek and Robert Newman from the Massachusetts Institute of Technology (MIT) (http://www.bbn.com/). 11 See the Pinochet Trial briefing on page 78 for an example.

72

3. Experiments and Evaluation

than one document can share the same topical facet and because several topical facets typically link one document with one or more other documents, many categorizations are possible. Filling bins with interesting data is not practical with the topical facets way of looking at a corpus of documents. If pressed to do so, the Topical Facets Application could construct approx. 3,100,000 semantically related sets (|D|e2 combinations are possible, where |D| is the number of documents) with the 31,000 documents from the corpus. A pure detection task directly on the data as imposed by the TDT rules, is not workable. Building the topical facets layer is separate from the extraction of information. The user is not supposed to obtain representative training material from the data, but has to formulate a query. The application then selects a prototype, resembling best the topic of interest of the user, from a collection of related documents. This approach does not make the Topical Facet Application to a variant of the traditional information retrieval systems, where index terms are used to catalog and retrieve documents. Index terms are keywords or sets with terms appearing in the documents in some collection. An expert, with the competence to construct a query with the appropriate vocabulary and addressed to the relevant sources, may expect to retrieve the material relevant to his inquiry. The Topical Facets Application does not assume that such a relation exists between the query and the data. In the Topical Facets setting an information request cannot address the data directly, but will look for the shortest document that best represents the query in the sense that this document would retrieve the query. As a first step the system links the terms from the query to all topical facets containing these terms and collects all documents related to the retrieved facets. This is the initial recall sweep, suggested by M. Buckland and F. Gey (Buckland & Gey, 1994). For the Cambodian Elections task, which is the first subject in the next paragraph, the human annotators found and tagged 10 articles deemed to be on-topic in the interval of the experiment, but unknown to the software application. This is about 0.12% of all available documents and called the on-topic fraction of the corpus. At the outset a query about the Cambodian elections mobilizes 274 topical facets. A unipartite transformation described in the previous chapter, reduces the set of facets to 32. The number of candidate documents linked by the facets decreases accordingly to 68. Next, the system looks up a prototype that approximates the query statement among the candidate documents. Relevant documents should have some semantic relation with this prototype. The Semantic Preference Clustering groups documents based on the semantic preference with the prototype. Documents showing no relations are removed from the initial selection. It is under­stood that the system complies with the TDT-rules by scanning over the documents one-by-one, while respecting the sequence of the incoming documents and without permission to delay a decision. To evaluate the experiments we use a contingency table and the standard evaluation metrics described in table 3-2 (Manning & Schütze, 1999: 268 – 9). The procedure to measure recall/ precision and miss/false alarm in the TDT setting is explained next.

Retrieved

Not Retrieved

On-topic

A

B

Off-topic

C

D

3. Experiments and Evaluation

73

Label

Formula

Precision

A/A+C

Proportion of retrieved material being on-topic

Description

Recall

A/A+B

Proportion of on-topic material that is retrieved

Miss

B/A+B

Proportion of on-topic material not retrieved; this value is the same as (1- recall)

False alarm

C/C+D

Proportion of off-topic material that is retrieved, also called fallout

On-topic fraction

A+B/A+B+C+D

Proportion of the collection being on-topic.

Table 3-2. Contingency table and evaluation metrics.

The outcome is obtained by macro averaging, a general average over the classes of the precision, recall, and F-score measures. The F-score with equal weight for precision and recall is defined as:

2 ⋅ Precision ⋅ Recall (Precision + Recall )

(3.1)

As will turn out below, the system manages to discover on-topic documents overlooked by the human annotators. We will discuss this matter in detail when it happens. In order not to bias the score when comparing with other contributions, overlooked documents are not taken into account. Apart from using precision and recall, the general outcome of the systems competing in the TDT task is also measured in terms of a detection cost. Detection cost is the weighted sum of the probability of missing a relevant document, and the probability of returning an off-topic document, labeled false alarm in this setting. The measure combines the likelihood of errors (miss and false alarm) with prior probabilities and costs of errors:

Cost = Cmiss ⋅ P (miss ) ⋅ P (target ) + Cfalse ⋅ P (false ) ⋅ P (non-target )



(3.2)

where P(miss) and P(false) are the conditional probabilities of miss and false alarm, P(target) and P(non-target) are a priori probabilities of a story being on- or off-topic, and Cmiss and Cfalse are the costs of a miss and false alarm, respectively. The TDT Consortium fixed those numbers to a constant value, based on empirical evidence from the training data (Table 3-3). Given the TDT cost measures, a system could always choose no, yielding a 100% miss rate with no false alarms, at a cost of 0.02. Alternatively, it could always choose yes, at a cost of 0.098. In order to compare with these no effort approaches, the minimum of those two values normalizes the TDT cost measures. The normalized cost of always saying no is 1.0 and of always saying yes is 4.9. Therefore, a useful approach must have a normalized cost below one. Measure

Detection

Tracking

Cmiss

1.0

1.0

Cfalse

0.1

0.1

P(target)

0.02

0.02

P(non-target)

0.98

0.98

Table 3-3. Cost probability constants.

74

3. Experiments and Evaluation

3.4

Stratified Sample Experiment

The first of the two experimental setups allows a detailed evaluation of the two-step retrieval approach. A 10% stratified sample from the 60 predefined TDT-tasks includes the task with the largest number of on-topic stories (Task 30050 – US Elections with 741 documents) and the task with the fewest (Task 30058 – Secretary Richardson in Taiwan with 4 documents). The corpus material covers the months of October, November, and December 1998. Some events occur earlier, such as the Cambodian Elections, or refer to earlier events (Pinochet, Osama bin Laden). The number of articles and the number of on-topic texts involved are given below in the next section with each of the tests. The different periods contain the same number of days and include data from before, during, or after the main event, as outlined in table 3-4, where E stands for the moment in time when the particular incident took place. The cells in the table marked with Test designate the story-sampling period. An event may be announced in advance and may receive attention afterwards. The setting of the six events is as follows: • The Cambodian elections were held in July 1998. The coalition is formed in November. • Pinochet’s arrest in London occurs on October 16, 1998. Court negotiations last the rest of the year. • Osama bin Laden is assumed responsible for plotting and executing attacks on the American embassies in Africa in August of 1997. His indictment is issued on November 4, 1998. • The U.S. mid-term elections were held on November 3, 1998. • The Chechnyan rebel kidnapping took place in early October 1998. The hostages were found beheaded on December 8 of the same year. • Secretary Richardson visited Taiwan from November 9 to November 11, 1998. Corpus Material 1998 30001 Cambodia

Event

30003 Pinochet

Event

30005 Osama

Event

30050 Elections 

Event

30056 Chechnya 30058 Richardson

(…)

Jul

Aug

Oct

Sep

Nov

E

E

Data

Test E

Data

Test E

E

Data

Test E

Data Event

Dec

Test E

E

Data

Test

Event

E

Data

Test

Table 3-4. Timetable with six events (E) and test data (Test) sampled from the corpus. The corpus covers the whole of October, November, and December 1998.

3. Experiments and Evaluation

75

3.4.1 Cambodian Elections Topic task 30001 is the first of the manually annotated topics and deals with general elections in Cambodia, taking place in July 1998. Two main characters in the event are Hun Sen, Leader of the People’s Party and Prince Norodom Ranariddh, leader of FUNCINPEC. The People’s Party beats FUNCINPEC and in November 1998 the two parties agree to a coalition government. According to the annotation rules all the stories about the election itself (campaigns, results of the election) are considered on topic, including the citizens’ responses to the election, the government efforts to stop the protests, negotiations between the two parties, details of the agreement reached between the parties, and reactions of Cambodian citizens and world leaders to the agreement. Retrieved documents sorted on descending similarity with the prototype

Similarity Score

(7031) 19981022_APW0269 (7399) 19981022_VOA1700.2109 (7379) 19981022_VOA1700.0277 (651) 19981002_VOA1800.0338 (4880) 19981016_APW0240 (7759) 19981023_VOA0600.1907 (4442) 19981014_VOA0600.0193 (654) 19981002_VOA1800.0507 (4463) 19981014_VOA0600.2457 (5295) 19981017_APW0346 (5113) 19981016_NYT0219 (28) 19981001_APW0315 (4532) 19981015_APW0140

0.9531 0.2221 0.1955 0.1498 0.1471 0.1307 0.0741 0.0563 0.0417 0.0192 0.0144 0.0078 0.0067

Table 3-5. The final set of documents retrieved for the Cambodia query. 0,3 On-Topic

Unrelated

0,25

0,2

0,15

0,1 r2 = 0.94 0,05

0 7399

Fig. 3-1.

76

7379

651

4880

7759

4442

654

4463

5295

5113

28

4532

Retrieved documents (x-axis) sorted on their similarity with the prototype document (y-axis). Cambodia prototype document 7031 omitted. The black squares mark the unrelated items. Logarithmic trend line and coefficient of determination added.

3. Experiments and Evaluation

The query formulated to the system is to return all documents relevant to the following statement: Cambodia’s People’s Party beats the FUNCINPEC party in national elections. Later the two parties with Hun Sen, leader of the People’s Party and Prince Norodom Ranariddh agree to a coalition government in Phnom Penh, Cambodia. The total number of news stories in the sampled period October 1 – October 25, 1998 is 8,326. Document 7031 is the prototype chosen by the application to represent the query. A total of 68 documents are considered initial candidates for the retrieval output and are compared to the prototype, using the similarity metric as explained in the previous paragraph. Table 3-5 shows the 13 remaining documents with a meaningful relation with the prototype. The system identified correctly nine files, four false alarms were given, and one document is missing. Table 3-6 has the score: Precision

0.69

Recall

0.90

F-score

0.78

Table 3-6. Results for the Cambodia query.

The four unrelated stories are listed below, ordered by diminishing relevance to the prototype. The numbers behind some of the key terms mark how often that word appears in the text. • d5295 (19981017_APW0346) discusses a confidence vote in the Italian Parliament. Common elements are government (5), coalition (2), party, and parties (4). • d5113 (19981016_NYT0219) handles elections in Swaziland. Related elements are elections (3), government, and parties. • d28 (19981001_APW0315) is about an incident with Anwar Ibrahim, deputy prime minister and finance minister in Malaysia. The link with the prototype is triggered off by elements such as People’s Party, leader, and government (3). • d4532 (19981015_APW0140) is about elections in Lesotho. Related elements are government, national, elections (4), party, and parties (5). From the system’s point of view this is sufficient to have a meaningful relation, though it is a weak one at the end of a list with related documents. Figure 3-1 shows the divide between related and unrelated stories. A logarithmic line provides the best fit in all samples for the ranking of documents, according to their similarity with the prototype.

3.4.2 Pinochet Trial A second test uses topic task 30003 about the arrest and trial of former Chilean president Augusto Pinochet. Table 3-8 gives an example of the full description of a detection and tracking assignment as determined by the TDT consortium. Similar instructions exist for every topic task. The system is requested to return every document discovered between November 1 and November 25 1998, pertaining to the following declaration:

3. Experiments and Evaluation

77

Former Chilean dictator General Augusto Pinochet, who ruled Chile from 1973-1990, is arrested in a London hospital on a warrant issued by Spanish Judge Baltasar Garzon on charges of genocide and torture during his reign. The global document count for this period is 8,888. Human annotators identified 88 on-topic articles, a 1% on-topic fraction. As explained earlier a prototype document is selected from the set of provisional articles to represent the query. The text of this prototype is in the Appendix on page 188. The initial set is an intermediate stage. After going through the document-by-document routine and the Semantic Preference Clustering, the software identified correctly 50 articles, gave 16 false alarms, and missed 38 on-topic documents. The same evaluation metric as used in the previous example, yields the following results (Table 3-7):

Precision

0.76

Recall

0.57

F-score

0.65

Table 3-7. Results for the Pinochet query.

Pinochet Trial WHAT:  Pinochet, who ruled Chile from 1973-1990, is arrested on charges of genocide and torture during his reign. WHO: Former Chilean dictator General Augusto Pinochet; Judge Baltasar Garzon (“Superjudge”) WHERE:  Pinochet is arrested and held in London, then later extradited to Spain. WHEN:  The arrest occurs on October 16th 1998; court negotiations last the rest of the year. Topic Explication: Pinochet was arrested in a London hospital on a warrant issued by Spanish Judge Baltasar Garzon. Pinochet appealed his arrest and a London court agreed, but the decision was overturned by Britain’s highest court. After much legal wrangling over the site of the trial, the British Courts ruled that Spain should proceed with the extradition request; Pinochet continues to fight it. ON TOPIC:  stories covering any angle of the legal process surrounding this trial (including Pinochet’s initial arrest in October, his appeals, British Court rulings, reactions of world leaders and Chilean citizens to the trial, etc.). Stories about Pinochet’s reign or legacy are not on topic unless they explicitly discuss this trial. Rule of Interpretation: 3 Legal /Criminal Cases: Examples - crimes, arrests, cases. The event might be the crime, the arrest, the sentencing, the arraignment, the search for a suspect. The topic is the whole package: crime, investigation, searches, victims, witnesses, trial, counsel, sentencing, punishment, and other similarly related things. Table 3-8. The TDT guidelines for the Pinochet Trial topic.

78

3. Experiments and Evaluation

Figure 3-2 illustrates the relation of the retrieved documents with the prototype. Unrelated documents cluster together at the end of the similarity value curve. The Pinochet Trial appendix on page 189 gives the full text of two outlier documents 9103 and 8964, assumed to be off-topic, since not tagged. The first is about the Swiss government investigating possible bank accounts belonging to Pinochet and the second is an official British warning for its citizens traveling to Chile. The software evaluated them as similar enough to the prototype to be included in the set of relevant documents. They were not tagged as on-topic by the human annotators although they meet the conditions explicated by the TDT guidelines. The reader may agree that these texts have probably been overlooked. The appendix also reproduces d10587, the first real false alarm. The text on the edge of the on-topic zone is about the alleged torturing of Coptic Christians by the Egyptian police and reactions of local and international human rights groups. The document is off-topic, however, the thematic resemblance at the border of the on-topic region is apparent. It is caused by the partial agreement between the intervening topical facets, in particular the alleged involvement of the authorities in arresting and torturing citizens, and the international attention of human rights groups, semantic components found in Pinochet articles too. 0,3

On-Topic

Unrelated

0,25

0,2

0,15 d9103

0,1

r2 = 0.91 d8964

0,05

d10587

0

Fig. 3-2.

Pinochet Trial documents sorted on their similarity (y-axis) with the prototype (omitted). The large squares mark so-called false alarms. Logarithmic trend line and coefficient of determination added.

3.4.3 Osama bin Laden Indictment Topic task 3005 focuses on the Saudi born millionaire and terrorist Osama bin Laden. He was indicted on 238 counts for plotting and executing bomb attacks on American embassies in Africa in August of 1997. The query statement reads:

3. Experiments and Evaluation

79

Osama bin Laden was indicted by the US District Court in New York for the attacks on American embassies in Africa in August of 1997 through his Afghanistan based terrorist group, al Qaeda. Efforts by the CIA and reactions from the Muslim world. In the period October 15 – November 14, 1998, the corpus contained 10,944 stories of which 38 were on-topic for this task, an on-topic fraction of 0.347%. The system identified correctly 12 files, gave four false alarms, and overlooked 26 documents. The metrics of performance are shown in table 3-9. Precision

0.75

Recall

0.31

F-score

0.44

Table 3-9. Results for the Osama bin Laden query.

0,090 On-Topic

Unrelated

0,080

0,070 r2 = 0.98 0,060 d11930 0,050

0,040

0,030 d12652

d12036

d12051

0,020 d15316 0,010

Fig. 3-3.

Osma bin Laden Indictment. Documents sorted on their similarity (y-axis) with the prototype (omitted). The black squares mark so-called false alarms. Logarithmic trend line and coefficient of determination added.

The alleged false story d11930 (19981104_VOA1600.0003) is in fact on target although the name of Osma bin Laden does not appear in the article. The first black square in figure 3-3 stands for the document in question. The following is the ASR rendering, as is: authorities have indicted so the terrorist though some of them love men one of his top aides in connection with the deadly u. s. embassy bombings in kenya and tanzania u. s. attorney mary jo white announced the charges at a news conference in new york wednesday it’s

80

3. Experiments and Evaluation

out of the lot and as military commander mohammad a. t. charged with plotting and carrying out the most heinous acts of international terrorism and murder their alleged victims include the hundreds of african an american citizens who tragically lost their lives in the embassy bombings in east africa on august seventh nineteen eighty eight and the thousands more were seriously injured u. s. state department said it is offering a reward of up to five million dollars for information leading to their arrest or conviction The next three stories are off-topic borderline cases, but thematically related as witnessed by the short extracts from each of them: d12652 (19981106_NYT0494): President Clinton expressed sympathy to Israel Friday after a car-bomb explosion in a Jerusalem market wounded 24 people. d12036 (19981105_APW0888): Two leaders of Egypt’s main Muslim militant group on Thursday urged their followers to abandon violence in favor of spreading their ideology in public campaigns. d12051 (19981105_APW1209): The United Nations on Thursday detailed what it called horrific massacres of civilians in Afghanistan by the Taliban militia during its capture of a northern city in August, including mass suffocation, execution and torture. The last story is completely off the mark: d15316 (19981114_CNN1300.0252): The house judiciary committee is set to begin its impeachment inquiry against president Clinton Thursday. Most of the remaining off-topic information deals with the Clinton-Lewinsky scandal, prominent at that time. The system takes them into consideration because of the presence of topical facets with active US judicial elements in both the bin Laden and the Lewinsky case.

3.4.4 US Congressional Elections Topic task 30050 deals with the mid-term elections of the 106th United States Congress. It is a tricky task as the topic overlaps with Topic 30024 concerning the resignation of Newt Gingrich as Speaker of the House of Representatives a month after the elections. A motivating factor in Gingrich’s resignation was the loss of seats attributed to voter dissatisfaction with the Republican efforts to impeach Clinton on the Monica Lewinsky scandal. The current topic is also related to task 30046 about Bob Livingston, selected to replace Newt Gingrich. Livingston announced he would not accept the role of Speaker and would leave Congress. Livingston acknowledged the

3. Experiments and Evaluation

81

charges of marital infidelity published in Hustler Magazine. He called in vain on President Clinton to follow his lead and resign too. The topic-eliciting statement runs as follows: Campaign coverage of the mid-Term Congressional Elections of the 106th Congress of the United States, involving all seats in the House of Representatives and 34 of the 100 Senate seats This rather meager query demonstrates the advantage of having a prototype to collect topic related documents. The sentence is short with only a few informative words while the prototype is extensive and to the point12. 0,12 On-Topic 0,1

0,08

0,06

d9823 0,04 d10332 d8010

d14618

d14643 d14469

0,02 r2

= 0.89

d10508

d11111 d14341 d15012

0

Fig. 3-4.

US Congress Elections. Documents sorted on similarity (y-axis) with the prototype (omitted). The black squares mark a false alarm. Logarithmic trend line and coefficient of determination added.

For this task 354 on-topic files were annotated in the period November 1 – November 25 on a total file count of 8,888, representing an on-topic fraction of 4%. The system identified correctly 41 files, gave 10 false alarms and missed 313 documents (Table 3-10). The first three so-called false alarms shown in figure 3-4, are in fact documents bearing upon the subject as testified by the excerpts shown in the Appendix. Document 10508 further down the line is also on-topic. If these four documents had been put into the right class by the annotators, precision would be 88%. The first real false alarm is d14618 about a debate in the US Congress on the ratification of the international Kyoto agreement on global warming. The same subject appears in d14643 and d11111. Documents 14469 and 14341 are about a congressional hearing on the position of the 12 Prototype text in the Appendix on page 192.

82

3. Experiments and Evaluation

Clinton administration regarding Saddam Hussein, and finally d15012 discusses Alan Greenspan meeting with the Federal Reserve Board. Precision

0.80

Recall

0.12

F-score

0.20

Table 3-10. Results for US Congress query.

The relatively low similarity marks are striking, but typical. The US Congress remains a forum of debate even during elections. Many political subjects turning up are not covered by the short and rather general query, thus bringing down the overall score.

3.4.5 Chechnya Rebel Kidnapping Topic task 30056 is about Chechen gunmen seizing three British citizens, Darren Hickey, Rudolf Petschi, and Peter Kennedy, plus a New Zealander, Stanley Shaw in early October. They were later beheaded. Criminals in Chechnya have abducted dozens of foreigners and Russians, usually seeking large ransoms since the end of a bitter independence war with Moscow in 1996. In late December Chechen leaders begin to address the issue in Parliament. On topic are stories covering the kidnapping or murder of these four foreigners, investigations and arrests in the case, direct responses to this situation in the Chechen government, and protests against the escalating violence by the Chechen people. As in the previous cases the system has to return documents related to a specific statement and restricted to a given interval, here December 1 – December 25, 1998: In early October, three Britons, Darren Hickey, Rudolf Petschi and Peter Kennedy and a New Zealander, Stanley Shaw, were seized by Chechen gunmen, and later beheaded. The heads of four Western kidnapping victims are found near Grozny, Chechnya Among the available 8,380 articles the system detected 65 initial candidate documents, of these 25 related to the prototype. The text of the programmatically chosen prototype is on page 194 in the Chechnya Rebel Kidnapping appendix. The human annotators marked 28 on-topic files for this task, an on-topic fraction of 0.33%. The system identified correctly 22 files, gave one false alarm, and missed six. The false alarm is about a kidnapped French United Nations official who was later freed on the border between the republics of Chechnya and Ingushetia in the same period (Fig. 3-5). Off-topic perhaps, but undoubtedly from a kindred class of events13. The score for this task is in table 3-11: Precision

0.96

Recall

0.79

F-score

0.86

Table 3-11. Results for the Chechnya Kidnapping query. 13 Text in the Appendix on page 190.

3. Experiments and Evaluation

83

0,3 On-Topic 0,25

0,2

0,15

0,1 r2 = 0.86

0,05

0

Fig. 3-5.

Chechnya Rebel Kidnapping documents sorted on similarity (y-axis) with the prototype (omitted). The large square marks a false alarm. Logarithmic trend line and coefficient of determination added.

3.4.6 US Secretary Richardson’s Visit to Taiwan Topic task 30058 is about a U.S. – Taiwan business conference on economic and commercial issues, and Secretary Bill Richardson attending it. Richardson encountered Taiwanese president Lee Teng-hui and local business leaders. The visit irritated the political leaders in mainland China. The following statement was submitted to the system: US Energy Secretary Bill Richardson visited Taiwan for a joint U.S.-Taiwan business conference and met with Taiwanese president Lee Teng-hui and prominent Taiwanese business leaders. The controversial visit angered mainland China, which registered a formal protest in Washington For this task four annotated on-topic files were to be found among 8,888 stories, representing an on-topic fraction of 0.045% in the sample subset covering November 1 to November 25, 1998. The system missed no documents and ranked the four relevant files correctly on top of the retrieval output, followed by nineteen false alarms. Table 3-12 shows the results for this experiment. Precision

0.17

Recall

1.00

F-score

0.30

Table 3-12. Results for the Bill Richardson query.

84

3. Experiments and Evaluation

0,14 On-Topic

Unrelated

0,12

0,1

0,08

0,06

d11110

0,04 r2 = 0.96 0,02

d18944 d10856

0

Fig. 3-6.

Richardson in Taiwan. Documents sorted on their similarity (y-axis) with the prototype (omitted). The large squares mark so-called false alarms. Logarithmic trend line and coefficient of determination added.

Again, it is worthy to see what the false alarms are about. The first off-topic document 11110 discusses the Chinese attitude towards recent Taiwanese diplomatic initiatives (Fig. 3-6). The second article 18944 is about a trip made in the same period by another American envoy to Russia, and then to Japan where the relations between Japan and Taiwan were on the agenda, which inevitably implicates Beijing. Document 10856 and the third article further down the relevance road are about Iraq’s decision to halt the work of the UN arms inspectors. Extracts from these texts are in the appendix on page 191. Here again articles following immediately after the ontopic texts still have a thematic relationship with the subject in view. Below a similarity value of 0.04 the affinity drifts away rapidly. The value of 0.04 is an empirical finding without an apparent analytical justification.

3.5

Topic Detection Experiment

A conclusion from the first experiment is that ranking with a proxy document works fairly well. The sorting allows the discovery of several on-topic stories overlooked by human annotators. An advantage of the Topical Facet Application is that these observations and the results in general can be motivated semantically. A limitation of the actual system is the arbitrary cutoff, when the set of retrieved documents is large. Documents are grouped in four baskets. The first bin contains documents with the highest facet values; the second has the next highest value, and so on. When more than a certain number of candidate documents needs to be processed, say one thousand, the system only cares about the first and the second basket, apparently neglecting valuable ontopic documents in he course of action. After a detailed look at the topic detection task by means of a stratified sample, the overall assessment of the topical facets model can now be undertaken. Since the start of the TDT-project in

3. Experiments and Evaluation

85

1998 several research groups built programs for tackling the topic tracking and detection tasks. For instance, Allan and his co-workers report the results on an earlier version of the TDT corpus containing 15,863 stories (Allan, Papka & Lavrenko, 1998). Table 3-13 has the score obtained on a detection task for 25 events: Precision

0.45

Recall

0.54

F-score

0.49

Table 3-13. Results for UMass with an earlier version of the corpus.

In other experiments the effect of mixed language corpora was reviewed. UMass obtained an English-only cost of 0.23 and a Chinese-only cost of 0.25. This is similar to the tracking results in a monolingual setting, showing that language makes little difference. Evaluation results in general show low false alarm rates, though with relatively high miss rates (Wayne, 2000). The report finds considerable variability in per topic performance across sites and across versions, and little correlation between systems.

3.5.1 The TDT 2000 Challenge The TDT3 corpus provides the material to the unrestricted text network and to a workbench with an evaluation plan for testing the assembly of topical facets. This paragraph will now point out how the system under study measures up to the results published at the official 2000 TDT Evaluation Workshop. DARPA, the University of Massachusetts (UMass), Carnegie Mellon University (CMU), and Dragon Systems, further called the TDT Consortium, developed the TDT tasks and the complementary evaluation experiments. According to George Doddington, TDT relates with Information Retrieval, but goes beyond the conventional concerns, particularly concerning the weight put on discovering new information and the focus on specific events, rather than on subject categories (Doddington & Fiscus, 2000). It poses five challenges: • Story Segmentation: to provide a transcription of an audio source stream into its constituent stories. • Topic Tracking: to detect stories in multiple source streams discussing a target topic, with nt of them known. • Topic Detection: to classify all stories unsupervised in clusters based on topic similarity. New topics must be detected as the incoming stories are pro­ces­sed. • First-Story Detection: to detect the first story that discusses a topic, for all topics. It is essentially the same as Topic Detection, differing only in the system output. • Link Detection: to detect whether a pair of stories discuss the same topic. The link detection task represents a basic functionality, needed to support all applications (including the TDT applications of topic detection and tracking). The link detection task is related to the topic tracking task with nt = 1. Story segmentation, first-story detection, or link detection are not the point of interest here, so, only topic tracking and topic detection will have our attention further in the text.

86

3. Experiments and Evaluation

3.5.2 Tracking and Detection Defined For the official TDT topic tracking task one to four stories identified as on-topic are given as training material to the contenders. Their system must track that topic and collect all related stories. The training documents are selected randomly from the annotated truth set. To know that some stories are on-topic permits discarding most of the news items as irrelevant. Since topics are independently processed, the possibility exists that one story is tracked by multiple topics, i.e. that the story could discuss more than one topic. The detection task on the other hand, forces a strict partitioning of the stream of news stories, requiring that each story be assigned to a single topic cluster. Unlike tracking the detection task has no human input. The only helpful information about stories and topics comes from pre-annotated training data that are likely to share only a few topics with the news being dealt with. Humans are not permitted to correct or confirm a system’s output during or after a run. A major distinction between tracking and detection, as the TDT Consortium defines it, is the underlying assumption about whether or not stories can be on multiple topics. When new topics appear in the stream of news, the system must decide to create a new cluster unsupervised. Most approaches to detection rely on story clustering, partitioning all arriving news into bins, depending on the topic being discussed. The first time a story does not fit the cluster being tracked the system creates a second cluster and starts trailing both clusters simultaneously.

3.5.3 Overview of the Submitted Systems Before discussing the outcome of the modified UMass and Topical Facets systems, a closer look at the TDT-report based on the final workshop presentation puts the results in perspective. The entry for the Third Topic Detection Task includes six participants: BBN, Dragon Systems Inc., The National Taiwan University, The Carnegie Mellon University, IBM, and The University of Massachusetts. The Topic Tracking Test has seven contestants: BBN, Dragon Systems Inc., The University of Iowa, The Carnegie Mellon University, GE, The University of Massachusetts, and the University of Pennsylvania (Doddington & Fiscus, 2000). IBM presented the best detection system, according to Charles Wayne, with a normalized detection cost of 0.26. The second best system (UMass) had a cost of 0.32. The most successful system used logistic regression to combine probabilities from a topic spotting technique (with a language model built from concatenated training stories) and an information retrieval technique (in which the unknown story produces the model), followed by normalization (with thousands of known off-topic stories) and adaptation (with high scoring test documents added to the training set and parameters re-estimated). The best tracking system was delivered by BBN at a normalized cost of 0.092; the second best system comes from CMU at a cost of 0.14. The BBN method computes how much more probable a document is under the hypothesis that it is on-topic, compared with how probable it is under the null hypothesis, i.e. that it is not relevant. It is formed by pooling the training documents for a topic to form a large query and compute the posterior probability that a test document is relevant given the query (Leek, Jin, Sista & Schwartz, 2000). The Carnegie Mellon University participants use a text classification model, represented by

3. Experiments and Evaluation

87

weighted word vectors and scored by calculating the cosine distance. The k-nearest training vectors are calculated which could be positive or negative instances for a given event. Their votes are combined to get the score for current story (Carbonell et al., 1999). The Dragon Systems approach supplements the background topic models with a language model for a specific event. For each story the system loops through all the stories. A story is declared an instance of an event, if its score against the event model deviates enough from scores against background topic models. A story, as defined in the corpus, is a text with a tag, indicating whether it belongs to a topic or not. The system has no access to these labels. The system considers switching each story from its present topic to each of the others, based on the same distance measure as before. Therefore, some clusters vanish and additional clusters are created. One difficulty is the computational complexity of the clustering task. As the clusters are constantly being updated, with each update a topic model build phase is triggered. The topic builds are kept simple to keep the detection task tractable (Yamron, Knecht & van Mulbregt, 2001).

3.5.4 UMass and the Topical Facet Application Earlier we observed that a pure detection task directly on the data would not be feasible for the Topical Facets Application. Fortunately, the approach implemented by this system is comparable to an adapted form of the tracking/detection task, performed by James Alan of the University of Massachusetts on the same data set. Allan sees a relationship between tracking and detection when only one story is available. In that case a tracking system is expected to use that single story to find all remaining stories on the same topic. To detect whether a new topic is coming up, the system handles each incoming story as potentially new, assigns it to a cluster if one exists, creates a new one otherwise, and starts tracking that single story (Allan, 2002). UMass uses the training data to create a short query to represent the event being tracked, and the query is applied to all subsequent stories. A story is declared to belong to the given event if it matches the query with cosine-vector similarity measures. This approach is similar to how the Topical Facets Application works, with the difference that the query of the latter is derived directly from the topic annotation instruction, and of course, with the difference of the similarity methods used. A second reason to compare with UMass is the availability of detailed data. While other entrants publish a global normalized cost only, UMass gives both precision and recall, and miss and false alarms probabilities. Figure 3-7 shows the results produced by UMass with the TDT3 data, presented as miss and false alarm probabilities. The UMass mean detection score (marked with a large X in the figure) is at P(miss) = 24% and P(false) = 1%. The results from the Topical Facet Application shown as miss and false alarms probabilities are presented in figure 3-8. The mean detection mark for the Topical Facet Application is at P(miss) = 48% and P(false) = 0.19%. The scores are in table 3-14. The Appendix on page 196 contains the topic list with miss-false and recall-precision values for every task. Precision

0.56

Recall

0.52

F-score

0.48

Table 3-14. Results of the Topical Facets Application.

88

3. Experiments and Evaluation

Fig. 3-7.

Performance of the UMass system. Each topic in the system’s output is represented as a single point.

The outcome is drawn in a probability grid developed by the TDT consortium. The best position with the fewest misses and false alarms is in the lower left hand corner. The maximal precision/ recall return is situated in the upper right corner. The random line in the figure assumes that 100 items are relevant in a retrievable set of 1,000. It is at the limit of a perverse retrieval, in which all of the nonrelevant items are retrieved before the first relevant one. Until 90% of the documents have been extracted, the next document will always be nonrelevant or missed. Recall remains zero until no more irrelevant documents remain, at which time the system has no choice, but to retrieve relevant items. Detailed results from both systems expose that the UMass score has a 50% lower miss percentage, at the price of a 5-fold higher false alarm rate. 90 Random 80 Detection cluster Mean Detection score

Miss probability (in %)

60

40

20

10 5

2 1 .01 .02

Fig. 3-8.

.05 .1

.2

.5

1

2

5 10 20 False Alarms probability (in %)

40

60

80

90

Performance of the Topical Facets Application. Each topic in the output is represented as a single point. No account is taken of on topic documents overlooked by the annotators are.

3. Experiments and Evaluation

89

The UMass system returns 5 times more unrelated articles in order to double the number of relevant texts, compared to the Topical Facets Application (Fig. 3-9).

Topical Facets

False 2

UMass

False

Topical Facets

Miss 1

Miss

UMass

Fig. 3-9.

UMass produces relatively more false stories to obtain a lower miss probability.

3.6

Discussion

In the six tests in paragraph 3.4 there is a reliable logarithmic relation between the similarity value and the fact of being on-topic, with a coefficient of determination (r2) between 0.86 and 0.98. The explanatory variable (the content of a document as expressed by the topical facets it contains) explains about all of the variation in the response variable (the degree of similarity with the prototype). Yet, the transition from being on-topic to being unrelated is gradual and it cannot be mathematically derived. The position in the document ranking where a user would draw the line between relevant and non-relevant, is well beyond the point of inflection of the curve. This raises the question of balancing precision and recall. Topical facets are apparently good in extracting a precise answer after a recall phase yielding a high proportion of on-topic material. It is accepted that a tradeoff exists between high precision and high recall (Buckland & Gey, 1994). Having a higher miss score then is the converse of a lower false alarm mark. We saw the results of 60 TDT-tasks with the cosine-vector method of UMass in figure 3-7. The probability of selecting an unrelated article is the control variable on the x-axis, called false alarm in this context. For each value of the control variable there is a value of the dependent variable on the y-axis, i.e. the probability of overlooking a relevant text and called the miss probability. There is certainty at the extremes: nothing is overlooked when the number of false alarms is maximal, and no unrelated article will be retrieved when the number of missing texts is maximal. The regression function m() estimates the other possible relations between the extremes, and is presented here as a semi-parametric model, with few assumptions about its shape at this point:

90

y = m (x ) + e

i i i where the ei ’s are the residuals or errors.

3. Experiments and Evaluation

(3.3)

The 60 assignments of the TDT task are a subset of all the possible retrieval experiments with the data from the corpus. The residual for each test is the vertical projection of the observed value of a dependent variable - i.e. the probability to miss a relevant document - on the predicted value of a dependent variable. If for every value of x, a perfect value for y is predicted, there would be no residual. Every point would lie on the regression line. In reality one expects some variability across the experiments. Regression assumes that the residuals have a constant variance and are normally distributed. The variance of each of the sampling distributions should be the same, so a covariance matrix of residuals from repeated samples with a linear relation should have a constant value down the diagonal and zero’s off the diagonal. One way to appreciate the variability is drawing a scatter plot of the standardized residuals against an explanatory variable. The miss-residuals and false alarm probabilities from the UMass data demonstrate a strong non-linear relation (Fig. 3-10). Consequently, any experiment with the cosinevector method from UMass will generate miss-false results distributed over the probability space according to the quadratic function:

y = −ax 2 + bx − c



(3.4)



Parameter settings and particular attributes of the data define the specific location on the curve.

2

1

Miss Residuals

r2 = 0.81

0

0,00

0,50

1,00

1,50

2,00

2,50

3,00

3,50

4,00

4,50

-1

-2

-3

False - UMass

Fig. 3-10. Strong parabolic relation between the Miss residuals (y-axis) and the False values (x-axis) in the UMass data.

This contrasts sharply with the behavior from the Topical Facets Application residuals (Fig. 3-11). The error terms fan out in a funnel shape, a typical sign of heteroscedasticity, the term used to indicate a condition in which the variance on a dependent variable does not have a constant value across all

3. Experiments and Evaluation

91

levels of the independent variable (Studenmund & Cassidy, 2001: 371 ff ). As there is no significant relation, no prediction can be made in this case with regard to the quality of the outcome of an experiment. At issue is the configuration of the model, not its tuning. It will make sense to improve the method used to retrieve documents in the recall phase. The cluster of good test results in the lower left hand corner of figure 3-8 is not a random incident, nor is any of the other results, for that matter. The variance of the residuals hints to a bias caused by the aggregation of data with different distributions. This requires a word of explanation. The system does not directly confront documents with a query, but tries to collect the most salient topical facets first. The salient documents related to these facets are ranked first. Because the number of topical facets used to calculate the probabilities varies, the variation cannot be constant. It should be possible to optimize the quality of the involved facets by eliminating phrases not specific enough in relation to the query, improving in consequence the quality of the recall outcome.

3

2

Miss Residuals

1

r2 = 0.00

0

0,000

0,002

0,004

0,006

0,008

0,010

0,012

0,014

0,016

-1

-2

-3

False - TopicalFacets

Fig. 3-11. Funnel-shaped heteroscedasticity observed in the distribution of the Topical Facets residuals.

In this version of the program the lower bound of an informative token was set at four times the standard deviation from the mean informative value in a collection14. Restricting the entry to two times the standard deviation would limit the number of common words. Named entities and other associations will make up a larger proportion of the informative arcs, suiting the specific data set better. However, this arrangement is crude since it violates the principle of parameter-free models, and more significantly, it contains the risk of eliminating otherwise valuable words. Chapter 6 on page 169 has a suggestion for further research to address this difficulty.

14 See the equation 4.19 on page 115.

92

3. Experiments and Evaluation

Other factors could also be involved. For instance, the number of stories about a topic when using the Topic Facets Application influences partially the probability of missing relevant documents. The system overlooks a relatively large number of articles. The missed documents result is a program design issue. The current version of the device sorts the candidate documents on their informative value and truncates the number of files to process, in order to manage the processing load. Especially with subjects attracting a lot of media attention, this leads to a loss of valuable data. However, the coefficient of determination suggests that the number of relevant documents explains only 30% of the total variation of the miss probability (Fig. 3-12). Implying that to include all recalled stories would improve the relevance rate by approx. 1/3. A more important contribution will come from a better document description, as suggested above.

Miss probability

1,00

r2 = 0.30

0,10 1

10

100

1000

Stories

Fig. 3-12. The miss probability for a topical facets experiment against the number of stories per topic on a double logarithmic scale.

3.7

Related Research

TopCat (Topic Categories) is a technique for identifying topics recurring in articles in a text corpus (Clifton, Cooley & Rennie, 1999). The TopCat test corpus consists of an earlier TDT2 data set with 60,000 stories from two newswires, two televised sources, and two radio sources, covering January to June 1998. Because TopCat reports on experiments with the previous corpus and does not fully comply with the TDT-rules, it is not possible to put both systems side by side. Clifton et al. report that TopCat gives a false alarm probability of 0.0026 and a miss ratio of 0.42 on the entire sixmonth corpus. According to the authors the difference in specificity between the human-defined topics and the TopCat-discovered topics is the main reason for the relatively high miss probability. Chris Clifton, Robert Cooley, and Jason Rennie work preferentially with associations. They extract named entities using linguistic cues to identify people, places, and organizations, as opposed to full text. Groups of named entities occurring together form frequent itemsets, capturing closely

3. Experiments and Evaluation

93

related entities corresponding to an ongoing topic in the corpus. Given that they use (parts of ) the corpus as a static collection, they have a problem as how to alert the user when something has changed, either when new topics emerge, or when new information added is to an existing topic. One approach they suggest is to identify a changed frequent itemset as the result of new documents contributing to an old frequent itemset, or to identify a new frequent itemset as the result of documents that did not previously support an itemset. Carrying this through the hypergraph partitioning used to cluster the documents of the corpus, proved a great challenge unsolved at the moment of publication. TopCat was not submitted as an entry in the TDT challenge. The work was performed while the authors were at MITRE15. How do topical facets relate to other document descriptions used in information retrieval? Information retrieval (IR) systems in general do not include the entire contents of a document. Any IR method presupposes a manageable representation of the information. The internal representation uses some sort of descriptor, such an infon, a conceptual graph, an index-expression, a formula, and so on (Devlin, 1995). When a user submits his information need, the query is transformed into an internal representation. The transformation of the query is similar to the process used to characterize a document’s content. The system compares the query description with all the document representations and decides by a matching technique what representations are likely to be relevant. These then become the retrieved documents. Depending on how adequate the system is, the retrieved documents correspond to the available relevant documents in some degree. Additional semantic knowledge supplied by a thesaurus can support the process. Synonymous relationships are an example: a document indexed by an item is indexed by its synonyms too (Lalmas, 1998). It is known that a global vector similarity measurement may produce misleading results when ambiguities are not properly addressed. For this reason, the SMART Information Retrieval Environment at the Cornell University uses a dual search strategy in which local vector similarity supplements the global vector matching. The system attempts to detect locally matching text fragments such as sentences and paragraphs. This local text match is then used as a filter affecting the results of the global text comparison operation. A similarity between a query and a document is accepted as correct only if the respective texts contain sufficient similar local text structures. According to Gerard Salton and his co-workers, this means the presence of at least one matching sentence pair with high similarity scores (Salton, Allan & Singhal, 1996). In the approach presented here the text structure is not a sentence, but a topical facets relation. However, the basic assumptions are kindred: extracting information gains from applying a two-step method. The dual layer approach is a way to reduce model complexity. The model complexity problem ad­dresses the question of choosing the number of free parameters needed to fit a given data set. Given enough parameters any model will match any given data set perfectly. Such a good fit is misleading while it fits the noise in the data too and has no more predictive power than the raw facts.

15 A not-for-profit MIT spin-off, providing engineering and technical services to the US government and based in Bedford, Massachusetts.

94

3. Experiments and Evaluation

3.8

Summary

The first part of the chapter introduces a 15 million-word corpus that supplies the data for the text network. The corpus also constitutes an environment for controlled testing. With sixty predefined topics scattered over the corpus, it is possible to assess information retrieval systems with a topic detection and tracking task. The stratified sample in the first experiment describes in depth aspects of the Topical Facets Application in different task settings. Special attention goes to the discussion of the relevance of the retrieved material. The ranking of the retrieved articles is semantically justified, allowing the user insight in why the Topical Facets Application ranks some documents before the others. The issue of defining the border between stories on-topic and stories unrelated to the task is raised. Task

Precision

Recall

F-score

Miss

False

Cambodia

0 .69

0 .90

0 .78

0.33

0.0058

Pinochet

0 .76

0 .57

0 .65

0.34

0.0021

Osama bin Laden

0 .75

0 .31

0 .44

0.65

0.0035

Richardson

0 .17

1 .00

0 .30

0.25

0.0002

US elections

0 .80

0 .12

0 .20

0.91

0.0001

Chechnya

0 .96

0 .79

0 .86

0.48

0.0001

Average

59

38,57

46,14

0.49

0.0019

Table 3-15. Results for the Topic Facets experiments.

The Topic Detection and Tracking challenge from DARPA partially evaluates the application. The test was designed for purpose built systems, and not all the required assignments in the original examination are relevant to the case developed in this thesis. Nevertheless, the published results contain enough reference material to allow an assessment. The Topical Facets Application has a relatively low false alarm score and a relatively high miss rate. The UMass score has a 50% lower miss percentage for a 5-fold higher false alarm probability. Analysis of the residuals of the miss-false probability relation shows that the Topical Facet tracking method can be structurally improved, which is the subject of further research. Task

Miss

False

Partial Topical Facet

0.49

0.0019

Full Topical Facet

0.48

0.0019

UMass

0.24

0.01

Table 3-16. Results for the TDT task.

3. Experiments and Evaluation

95

4.

Text Network

4.1

Introduction

After presenting the topical facet with an evaluation in an information retrieval task, the next two chapters look into the network concept underpinning the model. Chapter 5 gives an overview on how a general network model was adapted to fit in the domain of natural language. The chapter following now introduces the way an unrestricted text network is built, what kind of domains it is concerned with and the objects, properties, and relationships it deals with. The unrestricted characteristic relates to the fact that the network does not contain predetermined concepts and is not confined to a particular domain. Features are derived from the actual data. No removing of stop words or typographical errors is applied. At the beginning the system is empty, knowing nothing of the external world. It will search the sequentially incoming texts from the available corpus material, for the smallest possible structures sufficient to build a generalizing representation of the data. This would be an online learning task in the terminology of Pat Langey, as the instances are presented one at a time, in contrast to offline learning where all the data are available simultaneously. The definition of learning supposes an improvement in performance through the acquisition of knowledge in an environment (Langley, 1996: 5). To do that the software translates the stream of potentially noisy messages into an assembly of lines and nodes called unrestricted text network. The nodes or vertices stand for the words, more precisely for its token type, and the lines linking these points are the arcs. The main advantage of using a text network is that it can be analyzed easily for interesting characteristics, using graph theory and well established network analysis methods. Networks can be represented and manipulated proficiently in a computer. The type of a word is a string unique inside the entire network. Vertices remember their position in the text and recognize left and right neighbors. The vertex knows what documents it is used in and gathers the necessary information to calculate its informative value inside a collection of these documents. Since vertices have informative weight any arc joining two of them has an informative value as well. Two tokens can be linked more than once and the frequency of these links contributes to the arc weight. Since the words of a text are ordered, arcs have a direction: incoming from the vertex to the left or outgoing to the vertex at the right-hand side. Arcs allow traveling from one vertex to another and from one document to another, with the vertices acting as junctions. It is understood that the idea of using some form of network construction is not new. The World Wide Web would never have sprung into existence without the development of the hypertext idea. In essence, hypertext consists of chunks of text that are not organized in the conventional linear sequence, characterizing ordinary information, but in a network format. Each part can have pointers to an unlimited number of other chunks, devising associations between the chunks. Every web-author is encouraged and supported in creating, editing and deleting text fragments, and pointers, building user communities along the line (Conklin, 1987). The uses and functions of a network are illustrated with a toy network to clarify essential components and relations and with short samples from the real text corpus when it comes to operational matters such as performance and implementation. The use of unrestricted text networks in theoretical language research is justified by the collective endeavor of many researchers and is the subject of chapter 5. Most of the methods needed to apply these insights in an empirical

96

4. Text Network

setting are standard graph and network tools, augmented with a few recent instruments such as core extraction and topological awareness. Awareness allowing for the location in the network (where), and for a time factor (when). In the following paragraphs of this chapter, the components of the text network are described with their appropriate graph notation. The construction of a text network starts with paragraph 4.2 Units and Time in a Text Network and with paragraph 4.3 about Network Relations. A view on text comparison is given in paragraph 4.4 Similarity of Documents. Paragraph 4.5 Topological Awareness is about navigating over the network. How network components are semantically distinguished is the subject of paragraph 4.6 Informative Value of Network Components and paragraph 4.7 Informative Arcs and Allocations. Paragraph 4.8 Text Network as Data Compression demonstrates how the network reduces the volume of the original data. Paragraph 4.9 Computational Complexity looks at the use of computer resources by the system. A view on other uses of the network concept and related work on co-occurrences and summarization is given in paragraph 4.10 Related Research. The chapter concludes with an overview in paragraph 4.11 Summary.

4. Text Network

97

Units and Time in a Text Network

4.2

To assemble an unrestricted text network we assume the following components to be readily available: • Words, the smallest component in the application and represented by their types. • Text, a construction with words linked in a specific order; • Document, implying a text with a source and a timestamp; • Collection, defined as a set of documents having the same source and the same timestamp. These elements are yet unconnected. Setting up a network starts with reading the first document in an otherwise empty environment. Source and timestamp are kept as features tacked to the text content. A single text is already a simple network, i.e. a linear graph with a beginning and an end. Together the graphs create a network, with the restriction that any token is assigned only once to a point in the network. This point is called vertex and it represents the class from which a token is an instance. A directed line called arc symbolizes the relation between two tokens observed in the text.

with by

the

Hercules strangled from

was escaped

Nemean

of leo

that

trace

virgo is

circus

no

zoo

constellation

lion

savannah

yet

or performs

will

in much

animal

less

conflict enter

it

summersault freely yesterday

walks

every

as a

evening living

Fig. 4-1.

A simple five-sentence unrestricted text network

To illustrate some of these aspects take the following one-sentence texts as an example: T1: The lion from the circus performs a summersault every evening. T2: Yesterday evening a lion escaped from the zoo; no trace of the animal yet. T3: The Leo constellation is the Nemean Lion that was strangled by Hercules.

98

4. Text Network

T4: The Virgo constellation will enter in a conflict with the Leo constellation. T5: As a living animal it walks freely in the savannah, much less freely in the zoo or in a circus. The particular geometric arrangement of the vertices in figure 4-1 carries no significance. When visualizing a network or parts of it, only the relationship is of importance. A commonly used notation for a network G is the tuple:

G (V , A, g, r )



(4.1)

where V is a set of vertices {v1, v2, …, vn}, A is the set of directed lines or arcs, and g is a function associating with each arc a an ordered pair of vertices v i , v j named the endpoints of a, where vi is the initial point and vj the terminal point of a. r is any other function on V or A, adding certain features such as a label, a weight, and so on.

An alternative way to express the relation between two elements, is calling vi adjacent to vj and vj adjacent from vi. The line v i , v j is said to be incident on vi and vj. When the focus is on the two endpoints more than on the link between them, an arc is also called dyad in the context of network analysis. In a general linguistic setting, the sequence of any two words is named bigram. Every vertex represents a type tk, element from a vocabulary L and identified by a label. 330000 r2 = 0.98

280000

230000

180000

130000

80000 1

Fig. 4-2.

3

5

7

9

11

13

15

17

19

21

23

25

27

29

The stepwise addition of new documents to the network (x-axis) results in a diminishing increase of new types (y-axis). The dotted line is a logarithmic trend line with a nearly perfect match of r2 = 0.98.

When the network receives more texts, chances are that many of the types are already present. A new-labeled vertex vtk is created on the condition v tk ∉ V . Less and less new types are created without ever stopping. Figure 4-2 illustrates thirty document-reading moments on the x-axis with material from the corpus used throughout this work. The y-axis shows the cumulative number of novel vertices obtained after each step. The one hundred fifty thousand type mark is reached after a couple of steps, yet it takes an additional twenty steps to double this number.

4. Text Network

99

A natural language manifestation has on its surface a linear sequence of tokens (Katzner, 2002). In a language with a left-to-right reading direction the left arc of a vertex is called the incoming link and the arc to the right the outgoing link. A text is a well ordered (linear) sequence of terms having as first term an initial vertex and as last one the terminal vertex. The initial vertex of a text has no left neighbor and consequently no incoming link. Its terminal vertex has no right neighbor and no outgoing link. Punctuation marks are ignored in this particular situation for the reason that parts of the corpus do not contain any. In a text network made with properly marked texts each punctuation symbol would become a vertex too. By convention n indicates the number of nodes in a graph, or n = V and m the number of lines, or m = A . Except for the initial and terminal vertices all vertices of a text are linked by at least two arcs. The number of lines coming into a vertex degi is the input degree while the output degree dego is the number of lines going out of a vertex, sometimes abbreviated to indegree and outdegree. The total number of lines coming in or going out of a vertex degs is the alldegree. The degree of a network is the average of the alldegree of its vertices. If degv is the degree of a vertex v in a graph G with n vertices and m links, the average network degree is

deg a =

1 ∑ deg v n v ∈

(4.2)

The distribution of links between words in a natural language bears no relation to a Gaussian distribution. It resembles the Mandelbrot-Zipf term-frequency ranking. Work by Dorogovtsev and Mendes and Ferrer i Cancho and Solé shows that the arrangement described by George Zipf is the univariate rendering of a complex networked structure. Zipf searched for a principle of least effort to explain the apparent equilibrium between uniformity and diversity in the use of words. Recent research proves it to be the optimal solution for maximizing the referential power of human language (Ferrer i Cancho 2003)16. Figure 4-3 is a different view on the rank-frequency plot. Here links between words are counted and ranked not the words themselves. The x-axis represents the cumulative count of all types in the corpus ordered according to the number of observed links that each type carries on the y-axis. A link exists when a given word has a neighbor word, either at its left or at its right. As a reading example: the figure shows how only 2.6% of the types in the network generate 50% of all the links.

16 For a theoretical exposition on this subject see page 152 in chapter 5.

100

4. Text Network

100% 90% 80% 70% 60% 50% 123 (2.6%) different token-types produce 50% of the links

40% 30% 20% 10%

97,80

93,73

89,65

85,58

81,50

77,43

73,36

69,28

65,21

61,13

57,06

52,99

48,91

44,84

40,76

36,69

32,62

28,54

24,47

20,39

16,32

12,25

8,17

4,10

0,02

0%

Token-types

Fig. 4-3.

Degree distribution of types: a small percentage (2.6%) of the available types (cumulative on the x-axis) collects the larger part (50%) of all links (y-axis).

The network grows by adding text to it, making the resulting graph dynamic. In the graph family  a dynamic graph is a graph G ∈  at a given time step t, symbolized by G(t). This expands the graph description to

G (V , A, g, r , t )

(4.3)



Time step t is an interval of time, such that in any interval one action is performed. The stepwise addition of one new document is an important rule enforced on the prosthesis. In an online setting it is allowed neither to access the complete corpus at once, nor to preview any upcoming data. The time step under discussion should not be confounded with the timestamp attached as a feature to a document, as mentioned on page 24. In addition, the system is not concerned about the moment in time the event described in the text, actually took place. It is possible to add a document d produced at any moment t – x to a network g at any moment t. Henceforth, the shorthand graph notation G (V , A ) will replace the extensive formula in cases where the intended meaning is clear.

4. Text Network

101

NetworkBuilder processes one token tk at a time, respecting the sequence of text T. Token type tk has a label (the word), and a key to its document. The type indexes a vertex. A new vertex and a new arc are created if appropriate.

{

}

Input: C := c1, c2 ,..., cn

texts with the same source and timestamp define a collection c

L := {tk1, tk 2 ,..., tkn }

V := {v l v l ∈ L}

label map, set of all types known to the system

A := {ab a ∈V , b ∈V }

L = ∅,V = ∅; A = ∅

Output: G(V,A)



for each if tk ∈ L get vtk else L := L ∪ tk v tk := v i

set of arcs L, V, and A can be empty at the onset a text network

token from a text tk ∈ T , out of a collection T ∈ ci if the token tk is known, then a labeled vertex exists get the vertex labeled with tk

V := V ∪ v tk

end if get v-1

set of vertices labeled with l an element of L

token tk is added to map L define new vertex indexed with type tk new vertex added to the vertex set get the previous vertrex

if (v ' ∈ V ) ∧ (ab ∉ A )

let ab = v ',v if a previous vertex v’ exists, and arc ab is not obtainable from the set of arcs A, then an arc ab is created between the current and the previous vertex end if end for Algorithm 4-1.

4.3

Network Builder

Network Relations

Text T1, introduced in the preceding paragraph, has the relation: the → lion → from → the → circus (…) If l is the number of tokens in a text, any text may be symbolically represented as: v1Rv2Rv3 … Rvl or with a shortcut: v1Rvl , meaning that there exists a directed line from term v1 to term vl, possibly with cycles because the same term can be reused in the same text. In the trivial case of a network with only one text, it is easy to reconstruct the original by starting from any point on the line: one has to go upstream to the initial vertex and/or downstream to the terminal vertex, while observing the ordered sequence as given by the term indices. Under this definition, any text T is a strongly connected component from the network, if for any pair of vertices vi and vj in the

102

4. Text Network

vertex set of T, there is a directed path from i to j. A network component is a maximal connected subnetwork when nothing can be added without destroying the property of connectedness. A text is a maximal strongly connected subnetwork; adding or removing elements would result in a different text. The strong connection property is put to use when analyzing text collections. Since all vertices in a text are by definition connected, the sequence of the vertices (v1, v2. . . vk) is called a walk, or:

(v i ,v i +1 ) ∈ A, i = 1,..., n − 1

(4.4)

The length of a walk is the number of lines traveled (n − 1). Vertex vu from text T1 is reachable by vertex vv from text T2 if and only if there exists a walk containing initial vertex vv and terminal vertex vu. Although in principle not all vertices should be reachable from one text to another, the accessibility condition will be met soon enough in a typical text network of any dimension. This is due to the hub-function performed by the highly linked terms. A hub is a place of convergence where paths arrive from one or more directions and are forwarded out in one or more other directions. The vertex labeled with the in figure 4-1 is an example. A hub allows to bridge small clusters of otherwise relatively seldom used terms, making an unrestricted text network into an efficient modular meaning-making machine17. The definite article the functions as a junction in the five-sentence network introduced earlier, allowing for instance to link circus with constellation, terms remaining otherwise unrelated in this context. The application presented here presupposes full reachability of all its vertices, under the assumption that any text is a connected graph and that no text T exists in the network where all of its tokens are unknown to the language L or T ∩ L = ∅ . Because the relations between vertices are at the heart of many propositions developed here, a more comprehensive formal description of network relations is appropriate. Let U = {u1, u2,…, un} be a finite set of units. Among the units in U some connections are possible. These connections are described as binary relations R1 ⊆ U × U . uiR1uj is read as: unit ui has a relation with unit uj. For example, if set U is composed of types, the relation R1 is the arc between two types expressing the sequential order in a particular language production. Collecting subsets of vertices to populate set U is possible when the vertices carry additional features identified by an index. Another example of the use of relations is the assumption that a document needs readers. This permits to derive a set of documents from a reader community or to derive a set of readers from a set of related documents, without having knowledge of who the readers are, nor of the content of the documents. Let set X = {r1, r2, …rn} containing readers and another set Y = {d1, d2, …dn} containing documents. Let R be a binary relation (X, Y, G) on elements from X associated with elements of Y where G is a subset of the Cartesian product X × Y defined by the relation R. If the R-relation stands for uses then xRy means x uses y, or x → y. A transitive closure is obtained by the intersection of all transitive relations containing R. In other words, the transitive closure returns sets with all the readers using a certain document y, for every y Î Y. If an R’-relation on the same sets X and Y is defined as is used by then yR’x means y is used by x, or y ← x. The transitive closure now returns sets with all the documents used by a reader x, for every x Î X. Finally a set X’ is populated with 17 Exposition of this attribute in chapter 5 on page 153.

4. Text Network

103

sets of readers using the same document and a set Y’ with sets of documents used by the same readers, both the result of the previous transitive closures. The R’ relation on X’ and Y’ now yields sets of documents used by the same set of readers. However, no full transitivity between readers and documents is implied:

(∀x )doc (x ) → (∃y )(reader (y ) ∧ R (x, y ))



but (∀y ) reader (y ) → (∃x ) doc (x ) ∧ ¬R (x, y ) 

(4.5)

(

)



In words: all documents have at least one reader, but not everybody reads all documents. Google is a striking illustration of how relevant content can be successfully derived from looking at the relations of a community of readers, without the need to look into the substance of the constitutive documents (Brin & Page, 1998). The PageRank algorithm driving Google, considers the web as a graph G = (V, A) where the vertices V are web pages and the arcs A are outward pointing links from one document to another. The outward link starting at one document is the inward link of its target, giving In (Vi) for the predecessor and Out (Vi) for the successor links. The PageRank (PR) value of a page is:

PR(Vi )=(1- p)+p

1 PR(V j ) j ∈In(Vi ) |Out(V j )|



(4.6)

where p ∈ [0, 1] is the probability to follow an outlink, chosen uniformly at random, of page V. PageRank is the stationary distribution of the random walk, where a link from page Vi to page Vj is considered as a vote by page Vi for page Vj . T1

the

lion

from

the

circus

performs

the

0

1

0

0

1

0

lion

0

0

1

0

0

0

from

0

0

0

1

0

0

circus

0

0

0

0

0

1

performs

0

0

0

0

0

0

Table 4-1. An adjacency matrix representation of the binary relations in a one-sentence text network.

In summary, various relations populate the set R = {R1, R2, R3, … Rn}. In the resulting structure G = (U, R) is G a graph with the set of relations R describing all possible connections among all the elements of the set U. A network using relation R can be represented by a binary matrix:

Rt = rij  nxn 1 t i R t j where rij =   0 otherwise

(4.7)

Here rij is a real number, expressing the strength of relation R between unit ti and tj. Table 4-1 illustrates the idea: there exists a relation between the and lion and circus.

104

4. Text Network

4.4

Similarity of Documents

Several tokens of the sample texts introduced at the beginning of this chapter are identical: the, lion, evening, and a. There is no need to express identical tokens more than once. The preceding and following terms reveal the identity of a token. Human readers use the structure in a text to see a similarity between sentences without them being the same. In order to introduce a comparable capacity into the system, we have to translate this skill as an operation on network relations. Consider the following one-sentence texts: Ta: Yesterday morning a lion escaped from the zoo Tb: Yesterday morning a lion escaped from the zoo Tc: Yesterday evening a lion escaped from the zoo If two sets of identical types have exactly similar relations they would yield exactly the same sentence. In the context of a text network, the sets of vertices are partially ordered and bounded. v2 morning v3 a

v4 lion

v5 escaped

v8 zoo

v7 the

v6 from

v1 yesterday

evening v9

Fig. 4-4.

Similar and identical texts represented as a text graph.

Let V1 = {v1, v2, v3,...,v8} be the vertex set of text subgraph GTa and V2 = {v1, v2, v3, ...,v8} is the vertex set of text subgraph GTb, and V3 = {v1, v9, v3, ...,v8} the vertex set of GTc (Fig. 4-4). A one-to-one mapping f(vi) = i exists between two graphs G1 to G2 if there is an arc between vi and vj, where v i ∈ VG1 ,v j ∈ VG1, only if there is an arc between v’i and v’j, where v 'i ∈ VG2 ,v ' j ∈ VG2. This is the case with all v in GTa and GTb and partially so with GTa or GTb and GTc. The part of GTc that maps to GTa or GTb is the common subgraph. Consider graph G’(V’,A’,g’) and graph G(V,A,g), where V and V’ are sets of vertices, A and A’ are sets of arcs and g and g’ are functions associating with each arc an ordered pair of vertices. G’ is called a subgraph of G, G ' ⊆ G if V ' ⊆ V , A' ⊆ A, g v i ,v j = g ' v i ,v j ' for all (v i ,v j ) ∈ V .

(

)

(

)

Detecting similarity between documents is a problem of graph matching, referring to the task of finding a mapping f from the vertices of a given graph G1 to the vertices of another graph G2. If the mapping preserves all arcs and all vertices we have two isomorphic graphs. Graph matching is known to be computationally complex because it is costly to find a mapping for a pair of given graphs in a network through the exhaustive enumeration of all possible mappings. Usually the resemblance will be incomplete or nonexistent. However, a text graph is characterized by the existence of unique vertex labels; each vertex in the graph possesses a label different from all other labels. This condition implies that, whenever two graphs are being matched with each other, each vertex has at most one candidate for possible assignment under a mapping function f in the other graph. Any candidate is uniquely defined through its label and significant portions

4. Text Network

105

of the search space can be eliminated. (Dickinson, Bunke, Dadej & Kraetzl, 2005). The documentby-document similarity method in chapter two on page 60 applies this property.

4.5

Topological Awareness

When a second, a third, and more texts are added to the network the individual components are no longer separable. Individual texts share more and more terms and without adding supplementary features it is not clear what direction to choose at a junction on a network path. If v i − v j is allowed by text Tn and v j − v i is allowed by text Tm, the sense of direction is lost. This is why unrestricted text networks are by definition dense, in the sense that the number of arcs is greater than the number of vertices, while a relational matrix representing only one text is mostly sparse (the number of arcs is of the same order as the number of vertices). Many network representations are global in nature and require the client in such case to access the entire network in order to derive useful information with a breadth-first or depth-first search even if the sought piece of information is local and pertains to only few vertices. Here also the notion of graph labeling is a valuable aid. The idea is to identify vertices in a way that allows inferring a neighborhood relation directly from the labels without using any additional information sources. Figure 4-5 pictures how vertices with an even and with an odd index are accessible fast without the need to explore the entire lattice. This is consistent with the Kleinberg network model amended by Fraignaud, Gavoille and Paul in the sense that a vertex has local knowledge (of its adjacent vertices) and is aware of long-range links as well, without claiming to be informed about the entire network, as exposed on page 148.

v1

v4

v3

v2

v5

Fig. 4-5.

106

Topological awareness. Appropriate graph labeling (here with an even and odd index) allows traveling fast over a complex network.

4. Text Network

An intuitive analogy is traveling at night to an unknown city in an area with unnamed streets without a roadmap or instead, a journey with a GPS-system over a well indicated highway grid. Features to achieve these long-range relations in the text network are at hand: documents have a source and a temporal dimension. Source and time are meta-features with regard to the text content available to the system, and by the definition used in this work, a document is a text with a source and a timestamp, so the link is not arbitrary. Labeling a graph with sources and time stamps provides ultimately every vertex with a unique identifier. A document is identified by a member from a set of document sources S = {s1, s2, … sn} where s is a set of sources responsible for the production of texts. A document is also recognizable by the moment in time it was produced. The system is aware of the moment of publication of a document. The network will ignore information lying outside the scope fixed for an interactive session. Given that source and time co-occur in the document we define a relation

Rd ⊆ S × K

(4.8)



to index the vertices belonging to the text identified by this document relation, where K is a calendar and S a set of sources. Texts emanating from the same source on the same date describe a collection. A collection c is a family of texts indexed by the same document relation index. A newspaper is an adequate example. Newspaper cn where n ∈ Rd , gathers all texts issued by the same publisher si ∈ S on the same day k i ∈ K . The collection set C = {c1, c2 ,..., cn } brings together all the collections found in the network. We can identify a text by replacing the individual meta-information source and timestamp with the collection it is a member of. A collection is a container of texts and a text is a container of vertices. A document has a one-to-one relation with a collection, but one vertex can be a member of many documents (one-to-many relation). These properties allow the extraction of fully functional parts of the global network. A formal description of the operation starts with a set V’ of vertices selected on a given collection index c. Set V ' ⊆ V is a subset of the vertices in the network G = (V, A, g), where g is the arc building function. A function g’ associates adjacent vertices in V’ and yields an arc set A ' ⊆ A . The final result is an induced subgraph

G ' = (V ', A ', g ' )

(4.9)



The subgraph G’ contains all components from the texts in c. The subgraph in view is a core, since the selection criterion for the vertices ensures for every vertex v in the collection V’:

∀v ∈ V ' : deg s (v ) ≥ k

(4.10)



where k is the core number of vertex v and degs its alldegree (indegree + outdegree). G’ is the maximum subgraph with this property (Batagelj & Zaveršnik, 2002). In words: a core is a decomposition of a graph based on the connectivity of a subset of its vertices. Because the extracted components are texts and strongly connected, each core is a connected subgraph as well. For instance, a newspaper with all the articles it contains is a collection and hence a subgraph in the network.

4. Text Network

107

4.6 Informative Value of Network Components With the basic structures in place, the next step is to provide the network with the possibility to extract semantic information. The meaning of a text is a function of the relationships it engages with words in a particular lexical field, or subsystem, and it cannot be adequately described except in terms of these relationships. In contrast with the bag-of-words methods there is no need to discard the stop-words or hapax (dis, tris) legomena. From the standpoint of an unrestricted text network, there exist neither words with low information content nor words with too high a value. To qualify a token as having low information content is similar to saying that the token has no relations specific enough to make it stand out against the background of all other relations in a collection of documents. More precisely: a token gets a low information value when it is linked to many other tokens traversing many documents, while a highly informative token would have fewer, but more privileged relations. Therefore, the same type can be more or less informative depending on the quality of its relations. How to quantify this quality is the subject of the subsequent sections.

4.6.1 Context and Meaning How ca n we qualify each token type with an informative value? As indicated previously, informative value is relative with regard to the context of use. A collection of documents defines and restricts that context. Therefore, context plays a significant role. According to Graeme Hirst, among others, the notion of context is a problematic one (Hirst, 2000). What is a context in one case might not be a context in another. There is apparently no procedure to determine whether a particular entity is a context just from looking at properties or attributes. When it comes to the effects of context in natural language there is consensus on one point: it constrains interpretation because as a source of information it reduces or eliminates ambiguity, vagueness, or underspecification. Some authors, Douglas Lenat for instance, venture to analyze the context-space and describe up to a dozen dimensions along which contexts may vary (Lenat, 1998). The prosthesis policy of this work recasts the responsibility of all things context that transcend the pragmatic context definition used here to the human client of the software application. It is up to him to evaluate if assertions are accurate in one context or fake in another. The application expects the user to accommodate apparently contradictory information by partitioning it out to the different metacontexts appropriate to him. The application assumes its duties as prosthesis by organizing the data according to the different sources and according to the community that takes advantage of these sources. The corpus material in this case is acquired from United States news providers and the relevant community is the user group of these American agencies. It is assumed that the documents take on meaning in the micro-context of a collection as defined above. Lawrence Barsalou explains how this happens. Rather than simulating a general nose concept in isolation the nose symbol appears in the context of a background object, for example, a face. Such a simulation does not produce a global representation of a property that covers the thing across all relevant categories. The nose representation rather simulates specific noses, such as those for humans, dogs, fishes, and airplanes, and one highly schematic one (Barsalou, 2005: 393 – 40). Barsalou concludes that studying the skill to construct temporary abstractions dynamically is more informative than attempting to discover one particular abstraction representing a

108

4. Text Network

category. Following Barsalou and transposing his proposition to the prosthesis environment, it is asserted that tokens acquire meaning in a context of other token symbols. A context activates certain features and obstructs others. This scheme was put to work in chapter two to define a topic from the overlap of topical facets18. Linda Smith and Diana Heise report how experience with correlations causes increased attention to the combinations of features entering into those correlations (Smith & Heise, 1992: 249). In a network environment the features in question can only be relations between these symbols since token symbols are the only material available. The coherent string of tokens we name message does not transfer anything. It is the receiver who extracts meaning out of these message-signals by observing the relations between the symbols (Deacon, 1998: 112). The receiver not only observes a relation, but a relation augmented with feature weights. Without weight the relation would merely be a link. Feature weights modulate the prominence of a link in comparison with other links. In a symbol context the weight must be a weight derived from relations, i.e. based on the relative frequency of these relations and its components. Informative value is the term used to delineate the feature weight of a link. 150 140 130 120 110 100 90 80 70 60 50 40 30 20 10 0

Fig. 4-6.

In US news documents the word ‘president’ is fairly common (highly linked). The Y-axis shows the number of links for each type. The frequency for ‘the’ with a count of 287 is omitted.

The high redundancy of information in a text is the starting point to get to the informative value. One expects to find a class of words that reappears continuously, a second class of words seen occasionally, and surely not a text collection filled with carbon copies. Although inside a collection different subjects tend to be spread over different texts, a lot of words are recycled from one use to the next. A phenomenon manifesting itself with certain words that carry a higher number of 18 See chapter 2 on page 55.

4. Text Network

109

links than one would expect from a mere frequency count of the same word in a general lexicon. For example in a random subset extracted from the TDT-corpus, the word president occupies the 20th position between words as (Fig. 4-6): the, and, to, of, a, in, is, that, for, i, this, on, but, it, he, his, was, from, now, president, what, have, out, with, you, as, her, will, about, are, … Contrary to the Mandelbrot-Zipf school that describes the word frequency as a continuous function, Dorogovtsev and Mendes propose an explanation founded on a networked view, fitting the data with two power-law parts with different exponents19. Highly frequent words (e.g. function words, but not only these) have a different role as compared to words in the second part of the model. They serve as hubs. It explains the appearance of president as a pivotal point in many texts of the corpus, something not surprising for collections made up with documents provided by news agencies from the United States. In the method to be exposed in the next paragraph it is not the word frequency that matters, but the frequency of links between the words. As frequent links presume frequent words it makes sense to use the experience acquired in word frequency research and to put on adjustments when needed. The main difference between the two types of frequency is that link frequency concerns both endpoints of the arc: the initial vertex and the terminal vertex. Link frequency captures how prominent a word is within a given document. The document frequency of a term is a marker of informativeness in the context of a document collection. A word carrying a substantial semantic load either will show up several times in a document, or will not occur at all. An illustration is given in figure 4-2 on page 141. Most of the words used in newspaper texts are semantically loaded: they appear intensively for a short period and vanish when the subject is out of interest. Semantic lightweights tend to spread out over numerous documents continuously. To quantify this behavior the combined term frequency and document frequency metric with its tf*idf symbol is used in various information retrieval and classification tasks. It is put to use here in a somewhat modified form.

4.6.2

The tf*idf Formula Revisited

Karen Spärck Jones and Stephen Robertson are often named as standing at the origin of term frequency and its companion the inverse document frequency in information retrieval tasks (Spärck Jones, 1972; Robertson, 1977). Looking back in 2004 at the assorted flavors of the now widely used tf*idf metric, Robertson remarks that although neither the original Spärck Jones idf-formula, nor the relevance weighting model of Robertson and Spärck Jones, make use of within-document term frequency (tf), the combined measure proved robust and hard to beat (Robertson, 2004). The tf*idf model proper was proposed by Gerard Salton and Yiming Yang in 1973 at the outcome of empirical studies of combinations of weighting factors (See Salton & Buckley, 1988 for an early evaluation). In his review Robertson looks at the use of the tf-factor and why it is multiplied with the idf-weight. Parts of the argumentation are appropriate for the networked approach presented here. Often the use of the term frequency is justified by referring to Zipf’s law. But the number of occurrences of a word in a body of continuous text is not the same as the term frequency used in the standard idf-formulation, where the number of documents is counted in which the term 19 See page 156 in chapter 5 for the theoretical details.

110

4. Text Network

occurs, irrespective of the frequency of the term in each document. The standard definition of the inverse document frequency being:

 D  idf (t ) = log    df (t ) 



where df (t ) =

(4.11)

1 d contains t

∑ 0 otherwise

d ∈D





The consequences of this problem are that document length needs to be taken into account, and that the basis on which

D

was seen as a probability has changed. Here |D| is the number of

df (t )

documents in a collection. To rework the formula let tfi be the frequency of term ti in a document or more generally tfi,j for ti in a document dj . Let dl j refer to the number of term-positions (length of the document). Summing over all the terms in the document yields:

dl = ∑ tf

(4.12)

i i The independence assumption says that the number of times a term occurs does not affect the probability that it can reappear in the same text. This allows to add a weight wi for each time term ti turns up. When weights are cumulated over the complete document one gets tfi occurrences of each term ti. To multiply the idf-weight wi by the number of occurrences tf i the event of seeing a tf and |D| particular term now relates to a particular term position. By replacing the df(t) by j i,j by dl the probability is transferred from the event space of documents to the event space of j j positions of terms in the combined texts of the collection. This approach suggested by Robertson, is similar to the way Gerard Salton et al. resolved the problem with the different event spaces:





w ik =

tfik ⋅ log ( D / d k )

∑ (tf ) ⋅ (log (D / d )) j =l

2

2

1

ij

(4.13)



j

where wik represents the weight of a term in a document. tfik is the frequency of the occurrence of the term in view and dk is the number of texts in the collection with that term. Not the number of documents with an occurrence of word t is now central, but the sum of the relative frequencies of t in each document. The frequency information inside a document is considered instead of just the binary occurrence. The particular form of the denominator is used for length normalization (Salton, 1989). When working with short documents in collections with little variation in length the tf-measure is not so important. If the document span varies greatly inside and over the different collections, the combination of the term frequency with the idf– measure is justified. However, multiplying idf with the tf-component propagates the highly non-linear behavior of the term-frequency to the overall document score. Apart from the use of link frequency instead of the term frequency, the term weighting system of this application is similar to the models found in the literature, including the use of document length normalization (Manning & Schütze, 1999: 544).

4. Text Network

111

tw = 0.5 +

0.5 ⋅ tl t ,d

(4.14)

max t (tl t ,d )

where tw stands for the token link weight, tlt,d is the alldegree of the vertex representing this type and maxt is the maximum alldegree found in this document, normalizing the value for document length. Finally the term weight is multiplied with the inversed document frequency to obtain the informative value i of the vertex v

iv = tw ⋅ idf (t )



(4.15)



A network environment justifies using link frequency over term frequency for the reason that it captures the interaction between words. Another difference with the standard formula is the restricted definition of collection. The same word used in different collections yields a different term weight or informative value. It has been detected before that the probability of occurrence of a term fluctuates with the context in which it appears. Figure 4-7 is an illustration of this collection effect. The noun question gets a different value in each of the 13 collections sampled from the test corpus. In five cases it scores below the average value for all 5,230 types involved. Figure 4-8 shows a similar view on the proper name albanian. The word was observed in eight collections of the sample and it grades three times under the general average. The implication being, that a word is considered informative in one context/collection and less so in another, something to be expected from any meaningful natural language production. It demonstrates why estimating the informativeness of a word on a local level instead of on corpus level is useful.

question 0,020 0,018 0,016 0,014 0,012 0,010 0,008 0,006 0,004 0,002 1

Fig. 4-7.

112

2

3

4

5

6

7

8

9

10

11

12

13

Informative value of token question in 13 different collections (x-axis). The dotted line is the average value (y-axis) of the 5230 different types found in these collections.

4. Text Network

albanian 0,014

0,012

0,010

0,008

0,006

0,004

0,002 1

2

Fig. 4-8.

3

4

5

6

7

8

The information value (y-axis) of the word albanian in eight different collections (x-axis). The dotted line is the average value of all types in this sample.

Table 4-2 shows the frequencies of terms used in the context of a conversation about clothes, contrasted with their relative frequencies in an entire corpus (Wu, 1998: 3). The informative weight differences of the same term reflect the dynamical nature of the semantic characteristics in message-production (Dillard & Solomon, 2000). A feature put to use in content analysis. For instance, gas and guys are acoustically similar. When combined with lot of a speech recognizer experiences difficulties. Both trigrams have the same relative frequency in the Switchboard corpus: f (gas|lot of) = 0.0006 and f (guys|lot of) = 0.0006. Switchboard is a collection of 2,430 spontaneous telephone conversations, averaging six minutes in length. When restricted to the conversation topic buying a car, the relative frequency of the combination gas – lot of rises 15 fold to 0:009, while the guys – lot of combination drops nearly to zero (Wu, 2002).

Term

Frequency in clothes conversation

Frequency in the entire Switchboard corpus

log difference

appearance attire attorneys avon backless baggy bakery blouse blouses blue boots

0.000452 0.000452 0.000602 0.000301 0.000301 0.000301 0.000602 0.000602 0.000904 0.002410 0.000904

0.000016 0.000004 0.000028 0.000008 0.000003 0.000004 0.000011 0.000011 0.000011 0.000120 0.000020

3.35397 4.74027 3.08204 3.64167 4.74028 4.33483 4.04712 4.04709 4.45255 3.00199 3.82395

Table 4-2. Sensitive words in a conversation about clothes.

4. Text Network

113

In general, the use of words in sentences depends on the topic of discussion. When people are talking about buying a car and hear the ambiguous expression lot of gas/guys they tend to select gas instead of guys, since the former is more relevant to the topic of discussion.

4.6.3 The Information Spectrum The informative weights of the types in a collection shape an information spectrum (Fig. 4-9). In order to put this spectrum to a practical use, a filter is applied to separate the valuable from the locally less helpful data. The filter consists of two components: the noise level of a collection and the lower bound of the informative values found in a collection. First the noise level. Real data are invariably noisy. They contain errors caused by external processes and events: in transcription by omitting punctuation, white spaces, etc. To simplify the treatment of errors the noise inducing process is viewed as additive because it is continuous and independent of other processes. The background noise level of collection c is empirically defined as the first decile over the minimum informative value observed in that collection.

noisec = 1.1 ⋅ min (iv ∈ c )



(4.16)



351 301 251 201 151 101 51

Fig. 4-9.

0.5564

0.5585

0.5607

0.5644

0.5696

0.5773

0.5995

0.6371

0.6397

0.6421

0.6447

0.6476

0.6504

0.6548

0.6592

0.6665

0.6782

0.7860

0.8003

0.8126

0.8288

0.8482

0.8833

0.9451

1.3368

1

Distribution of informative tokens from one day in the NYT sorted from high to low informative value. The x-axis shows a part of the full informative value range, the y-axis has the number of types for each value.

Subtracting the noise level from the informative value determines the minimum acceptability of a vertex as an informative token. The second component of the filter is the lower bound of the informative values of a collection. The mean

114

4. Text Network

iv =



1 ⋅ ∑ c iv n

(4.17)

or arithmetic average of the informative values is used to describe the highly skewed distribution at hand. The standard deviation

σ iv =



1 ∑ (iv − iv )2 n c i

(4.18)

is the measure of the spread of the values iv inside a collection c, where n is the number of values. The values are widely spread and the distribution is unknown. For an unknown distribution with 2 a known expected value u and a known variance, σ X the Chebyshev’s inequality ensures that at least 94% of the values are within four standard deviations from the mean (Harris & Stocker, 1998). Therefore, in order to catch as many informative words as possible the lower bound clb for a collection is set to four times the standard deviation from the mean:

clb = iv − 4 ⋅ σ iv



(4.19)



There is no upper limit since no word has an informative value too large to consider. A type has an acceptable information value when

iv − noisec > clb



(4.20)



In words: to be acceptable the informative value i of vertex v over the noise level in a collection should be greater than the lower information bound clb of that collection. Types failing to meet this requirement have their informative weight replaced by a zero placeholder iv = 0 but are not eliminated from the network.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Collection

Informative types

Total types

Total tokens

Mean value

Lower bound

Files

19981001NYT 19981002NYT 19981001CNN 19981002CNN 19981001APW 19981002APW 19981001ABC 19981002ABC 19981001VOA 19981002VOA 19981001NBC 19981002NBC 19981001MNB 19981002MNB 19981002PRI

10,787 9,806 2,348 2,420 4,657 4,910 1,024 1,061 2,328 3,415 976 936 1,566 1,511 2,124

11,014 10,016 2,368 2,450 4,729 4,990 1,093 1,118 2,377 3,479 1,119 1,055 1,703 1,649 2,183

73,998 69,137 9,649 9,577 23,128 26,618 3,097 3,062 9,419 16,624 2,866 2,703 7,046 6,937 8,150

0.8343 0.8523 0.9491 0.9359 0.8873 0.8885 0.5248 0.5210 0.8288 0.9005 0.4291 0.3721 0.4967 0.5043 0.6532

0.1114 0.1219 0.0668 0.0869 0.1108 0.1338 0.0590 0.0357 0.1045 0.1158 0.0276 0.0714 0.0418 0 0.0586

75 78 80 74 76 84 12 12 52 79 8 6 12 13 24

Table 4-3. Informative tokens in a sample of fifteen collections.

4. Text Network

115

Table 4-3 shows fifteen collections extracted from the first two days in October 1998 to illustrate the outcome of this intervention. The column Informative types lists the number of terms matching the acceptability condition. Most types are informatively acceptable. Other columns display additional information on the collections such as the total number of tokens, the mean informative value of the collection, the lower informative bound, and the number of files populating the collection. Take the New York Times collection from October 2 as an illustration. The highest informative value is 1.34 for the token GE (General Electric), followed by Salinas, Blumenthal and Thicke. These are probably the hapaxes in this day’s New York Times. The lowest acceptable value is 0.1382 for the word months. As can be seen on the graph in figure 4-9, there is a high concentration of valuable words in the area 0.88 – 0.78. The highest concentration is found at value 0.796 with 372 words.

Pseudo-code calculating the informative value iv of a vertex over all the collections where this vertex is seen. Input: V

= {v v ∈ G}

set with all vertices in the network

C = {c c ∈ G}



D = {d d ∈ G}



Output:

set with all collections in the network

set all documents in the network

V ' = {v ' v ' ∈ G}

set of vertices updated with informative value

for all v ∈ V

for every vertex

for all c ∈ C

in every collection



Dc := {d d ∈ c}



collect every document d from this collection

Dv := {d d ∈ D ∧ v ∈ D}



Dcv := Dc ∩ Dv



meanMax =

set of documents using this vertex in this collection

∑ max deg

meanDeg av =

∑ deg

d ∈Dcv

Dcv

a

d ∈Dcv



collect every document d using this vertex

Dcv

av

the mean alldegree of this vertex



tf = 0.5 + (0.5 ∗ meanDeg av ) / meanMax



let v = v tf *idfc vertex v is labeled with the tf*idf value, indexed with collection c



tf *idf = tf ∗ log ( Dc / Dcv

)

end for

end for Algorithm 4-2.

116

the mean highest link frequency

The informative value of a vertex.

4. Text Network



the token link frequency

The following is an example of a 99-word cluster found at value 0.7972: adam, adultery, alienated, allusions, amiss, approaching, bathsheba’s, blessed, blindly, blow, caring, centralized, chapters, chronicles, collapses, comprises, confessing, connived, consciousness, constellation, curse, decatur, denounce, disciplined, drawl, duplicitous, dust, embedded, eve, exclaims, exegetes, facilitating, feminist, fidelity, foolishness, forgiven, four-minute, furloughs, ga, glorious, guilt, hata’ti, hatchet, household, husband’s, imagines, knox, literary, lust, lyhwh, messianic, monarchy, narrative, ominous, one-on-one, orders, ought, ourselves, passion, passive, paternity, pregnant, professor’s, profoundly, prophet, pulpits, punish, rabbinical, recounts, relies, retelling, royal, ruthless, sages, saul, scriptures, seminary, sincere, sinned, skepticism, soldier, spasm, spots, strange, swift, testament, theological, thread, traditionally, trail, traps, tribes, unloosed, unmentioned, verbs, verse, warns, wheedle, wheeler-dealer Types with a zero for informative value include: a, about, after, all, also, an, and, are, as, at, be, because, been, before, but, by, can, did, do, even, first, for, from, had, has, have, he, his, if, in, into, is, it, its, last, like, many, more, most, new, no, not, now, of, on, one, only, or, other, out, over, said, so, some, than, that, the, their, them, there, they, this, to, two, up, very, was, way, we, well, were, what, when, which, who, will, with, would, year, years The attention of the reader is drawn on the word new considered of low informative value. As a result the combination New York will not yield at least not in this collection. Neither did the terms american and president make it, nor clinton with the lowest value score of all. These terms will not participate when the system is looking for informative tokens inside this collection. However, clinton, new york et al. remain at the disposal of the network; they may be valuable in other collections and consequently they can be learned by the system.

4.7

Informative Arcs and Allocations

The previous paragraph argues that the interaction with other tokens permits the calculation of the informative value of a token. Dorogovtsev and Mendes state that language is a network of volatile tokens, collaborating with a stable and highly connected kernel of words, as explained on page 156 ff. Building on work by other researchers, they show how the structure of a language word web is the result of new words connecting preferentially to an existing word, while at the same time new edges preferentially emerge between pairs of existing words. These two processes are the key factors of edge emergence. Their word web is an analytical exercise covering the English language in its entirety and is therefore necessarily probability driven. In the unrestricted text network on hand the same principles apply with a few adaptations: • Observed and expected values can substitute probabilities; • Direction is not lost, allowing the use of arcs rather than edges; • The modified tf*idf metric captures adequately the concept of preferential emergence and the concept of new and existing words.

4. Text Network

117

Returning to the last point: the link frequency concept used in the tf-part of the tf*idf metric has the same meaning as the number of connections used by Dorogovtsev and Mendes. The idf-part deals with the new – existing word feature. When a new text is added to the network, each word is evaluated against the terms from the collection already known by the system. A new word would be a term not seen in previous texts, while an old or existing word would be present in many documents. In the Dorogovtsev and Mendes model preferential attachment happens when a node is chosen with the probability proportional to the number of its connections. The word web model supplies arguments in favor of using arcs when describing language realizations by a computational device. Interestingly, there also exists research referring to the favorable influence on the human memory of directional word pairs. Michael Kahana reports on the human ability to recover an entire pattern with the help of associations (Kahana, 2002). Teuvo Kohonen describes an implementation of binary relations consisting of ordered sets of items for the representation of linguistic expressions. He shows that it is possible to create complex structures from elementary relations and he conjectures that knowledge in general might be represented by such relational graphs (Kohonen, 1989: 27– 9)20. In network analysis these complex structures may take the form of 3-, 4- and 5-rings, or k-rings in general. A k-ring is a closed chain of length k. Well-studied is the 3-ring or triad, a triangular connective in directed graphs. It is a matter of further research whether these structures reveal interesting linguistic features. In any case a substantial portion of the triadic structure is explained by nodal and dyadic features. Katherine Faust reports how higher order properties of networks are generated from lower order dynamics. She underscores the importance of the nodal and dyadic parameters such as network density, nodal indegree, and outdegree distributions and the dyad census (Faust, 2007). Collection c Document d

Attributes Vertex v

Informative

Non-informative

Informative

c+ d+

c- d+

Non-informative

c+ d-

c- d-

Table 4-4. Possible combinations of the informative value of a vertex, where ‘-' stands for not having, and ‘+’ having the attribute .

The network construction procedure creates arcs or dyads by connecting the vertices of a document. Given that vertices can have an informative value in one collection, but not in another, a vertex obtains one out of four possible outcomes for every collection – document combination (table 4-4). Only vertices with the c+ d+ combination are considered informative when it comes to constructing informative arcs. The procedure to qualify and connect the vertices in the network leads to the following three relations: • Two linked adjacent vertices v i and v j with an acceptable informative value defined as iv i − noisec > clb at the initial point and iv j − noisec > clb at the terminal point, identify an informative arc, represented as arc abi.

20 See page 164 for more details.

118

4. Text Network



Two linked adjacent vertices with one acceptable informative value at the terminal point and a placeholder at the initial point are called dummy arc, • Two linked adjacent vertices with one acceptable informative value at the initial point and a placeholder at the terminal point are called dummy arc. The values of an informative arc need not to be the same: iv i ≠ iv j . When two tokens have exactly the same informative value, one may assume that both have a similar relation to the text in which they live. If in addition they are adjacent, they also share a relation with each other. Let vi and vj be two vertices sharing the same value iv i = iv j and having an adjacency relation R, then

(

)

∀v i ∀v j : R (v i ,v j ) → ¬R (v j ,v i )

(4.21)

This relation is only symmetric if vi = vj in other words when the relation has twice the same token, which is highly improbable. The general phenomenon of related words likely to be used in the same context is named an association. This special bond between two vertices gives rise to a fourth network relation: • An arc with exactly the same informative value at the initial point and the terminal point is defined to be an association, represented as arc aba. Several of these associations can be chained into one associative string, for as long as the interlocking pairs of vertices (vi , vj ), (vj, vk) , (vk, vl), … meet the conditions. In the ensuing network operations the first three qualified arcs will play the major part. However, associations provide strong hints about the content of a text and as such they will turn up when the system needs to interface with the human user. It is possible though, that an otherwise normal text does not contain many associations. For example, the following document (19981002_PRI.2000.1982) yields one association: devastating bomb for the reason that only two of the adjacent vertices share exactly the same informative value in the collection of which this text is a member: two months after the devastating bomb attack on the u. s. embassy in kenya the fund set up to help relatives of those who died has paid out its first benefits about thirty close relatives received one thousand seven hundred dollars at a ceremony in the main park in the capital nairobi a fund now totals nearly four million dollars that was established with donations from the public Of course, enough other vertices in the text have a significant value to construct the necessary informative arcs to unlock the content in the event of an information retrieval task. Table 4-5 shows the most informative associations in four texts from the same two-day sample described in table 4-3. The phrases are computer generated without manual interference. The notion of association of this method is indirectly linguistically motivated. It manages to capture some interesting co-occurrences and phrases: • Named entities for instance, such as: nat turner, gene autry, los angeles. These act like fixed phrases appearing in the same form throughout the text. The first name and the family name may show up independently. • Idioms or frozen expressions: pop charts, art collector, best sellers.

4. Text Network

119

• •

• •

Terminological expressions referring to objects in technical domains: flame retardant, dimethyl methylphosphonate. Noun phrases (NP) without the optional determiners or prepositional phrases (because the system replaces these function words with a dummy symbol if they stay under the informative lower bound of a collection): multifaceted characters, sensible basics, recording artist. Verb phrase (VP): young fans rarely questioned, bounced along. Adjective phrase (AP): always considered myself. 19981001 NYT 0501

19981001 APW 0538

tobacco-plantation clan nat turner turn-of-the-century victorian good-hearted squalor darrell larson birdlike redhead andie macdowell bridget terry neurotic perfectionist

tel aviv suburb flame retardant dangerous chemicals aboard 190 liters dimethyl methylphosphonate nerve gas nrc handelsblad intended recipient shaul yahalom nes ziona

19981002 PRI 2000.3309

19981002 NBC 1830.1603

holding addressed simpson producer minus overkill sit sydney’s dance hinder pastels candidacy multifaceted characters des bouquets pop charts sensible basics musical finale art collector opposition third always collected gay drag simon hunt spoiler mendes scheck raised completely intimately

young fans rarely questioned league baseball bounced along great horse champion roy popular phone store nod yes best sellers gene autry always considered myself star audrey producer criticized million records businessman owning los angeles tv stations sings mama recording artist order gets

Table 4-5. Associations collected from four texts

Certain subclasses of collocations are lacking because tokens with a zero placeholder are ruled out, such as collocations with so-called light verbs like make, take, and do and verb particle constructions like to tell off and to go down. Not due to their intrinsic nature, but as the result of a low informative value. The co-occurrences presented here assume adjacency of words. A phrase or a named entity can be an association even if it has non-consecutive components. Figure 4-10 illustrates how a collocational trigram could be derived from two arcs under the following conditions:

120

4. Text Network

v is − v 0j v 0j − v ks '

where the superscripts s and s’ hold the same informative value iv i = iv j and superscript 0 indicates a zero placeholder. At distance k – i = 2 the informative vertices are separated by one dummy (zero informative) vertex. This vertex could then be replaced by a term f out of a limited set F of function words with the lowest semantic content yielding noun-noun constructions through a prepositional phrase such as Embassy in Kenya, Bank of England, and Rudolph the Red Nosed Reindeer. However, a small exercise shows that more is needed to make this work. When the system constructs trigrams as described, it returns not only named entities or other valuable associations, but also many more common sentence chunks like trying to land, hundred and forty, arrested in london…

v is

v0

A

v ks '

v0 F=

the, and, to, of, a, in

v is

v ks '

B

vf

Fig. 4-10. Creating a non-adjacent trigram collocation by replacing the v0 placeholder in two bigrams (A) with a function word f ∈ F (B).

4.8

Text Network as Data Compression

The way the Topical Facets Application organizes its data is different from how it is done in the TDT-corpus. In the corpus one data file consists of all the texts from one source for one day. The application presented here has one file for every article. There are 31,501 useful files, i.e. files with a text length > 2 tokens. An average file is 3.3 KB in size. In total, the system has 100 MB of raw data at its disposition. The processed text volume of this corpus increases linearly according to the function y = 420.96 x – 12,788 with a coefficient of determination r2 = 0.98. The general form is the familiar linear regression:

y = mx + b

(4.22)

where m is the slope of the curve and b the intercept with the y-axis. When the raw text from the corpus is converted into a labeled graph the application no longer needs the original file, which it can be fully reconstructed on demand due to the maximal connectedness property of a text. Therefore, the network dimension grows essentially by number of new types and not by the number of files (Fig. 4-11).

4. Text Network

121

10

500 cumulative impact of 500 raw text files

9

450 400

8 350 7

300

6

250 cumulative impact of 500 files as a text network

5

200 150

4 100 3

50

2

0

Fig. 4-11. Synthetic view on a network growing with the logarithm of the number of files (x-axis), compared to a linear increase with the full text (number of tokens divided by ‘000 on the right y-axis).

The best fit to the data of this corpus is given by the equation y = 60,955 ln(x) + 104,841 with r2 = 0.98, an instance of the logarithmic regression function:

y = c ln x + b

(4.23)

where c is a constant and b the intercept with the y-axis. In both functions x is the number of text files, log transformed in the second case. After adding 500 files, the network volume is less than 25 percent of the full text set and the gap grows with every file, that is without counting the overhead containing document information. The corpus counts 12,388,120 tokens, but has only 136,431 types (1.1%). This is in agreement with Zipf’s observation. The tempered expansion rate influences positively the computational complexity of the software. The network as it is conceived reduces the computational complexity by traveling over labeled type nodes instead of over unlabeled tokens. Representing the low informative types with a placeholder symbol speeds up processing even more, not by reducing the network size as such, but by permitting a temporary subnetwork extraction with a reduced dimension. In the temporary subnetwork a vertex contraction process joins all placeholders by removing the edges between any two vertices being contracted. No information is lost since the labels of the vertices used to construct the graph also label the substitute dummy vertex and the text keeps its maximal connected subgraph characteristic.

122

4. Text Network

with by

the

Hercules strangled from

was escaped

Nemean

of leo

that

trace

virgo is

circus

no

zoo

constellation

lion

savannah

yet

or performs

will

in much

animal

less

conflict enter

it

summersault freely yesterday

walks

every

as

evening

a

living

Fig. 4-12.

In this five-sentence unrestricted text network the informative words are highlighted.

Assume that the following tokens from the toy text network presented earlier are considered being of informative value (Fig. 4-12): T1: lion, circus, performs, summersault, evening T2: yesterday, evening, lion, escaped, zoo, trace, animal. T3: leo, constellation, Nemean, lion, strangled, Hercules T4: virgo, constellation, enter, conflict, leo, constellation T5: living, animal, freely, savannah, walks, less, freely, zoo, circus A dummy vertex replaces any two adjacent non-informative words in this temporary subnetwork, while the arcs linking them are removed. The use of a dummy is an alternative to the removal of stop words. Figure 4-13 illustrates the resulting contracted graph. Let G = (V, A) be a graph containing vertices u and v, and let G’ be the resulting graph of the contraction by replacing u and v with w and the removal of arc (u, v) and/or (v, u). Then

G ' = (V ', A ' )

(

(4.24)



)

where V ' = V \ {u,v } U {w }

A ' = A \ {(u,v ), (v , u )}

4. Text Network

123

In words: There is set of vertices, say {u, v}, that is replaceable by a vertex w such that w is adjacent to the union of the nodes to which u and v were originally adjacent. Arcs linking directly two vertices of that set are removed upon contraction. leo

virgo escaped

constellation trace

zoo Nemean savannah lion

living

dummy strangled

Hercules

animal

circus

walks enter

performs

summersault less

evening freely

conflict yesterday

Fig. 4-13. Representation of a reduced temporary subnetwork after replacing the set of non-informative types with one placeholder vertex (the dummy node in the graph).

The contraction presented here has deeper implications than being merely a local complexityreducing device. On page 156 Ramon Ferrer i Cancho and Ricard Solé show analytically that any word in a lexicon can be reached with less than three intermediate steps. An example on page 39 illustrates the point: the highly available function words act as a hub in the text allowing jumping to any word in very few steps indeed.

4.9

Computational Complexity

The computational complexity of an application in big O notation is about the upper bound of an algorithm’s usage of computational resources, given the size of the input data. Building the unrestricted text network includes the allocation of the words to a type, the construction of arcs, the calculation of an informative value for the vertices, and looking up any associations. This part loops over the tokens presented to the system. Because only the types require processing, the application uses logarithmic time O (log n ) where n is the number of vertices (Fig. 4-14). The algorithm to calculate the topical facets visits all informative arcs and requires a constant time per arc. It contributes O (m ) at most, where m is the number of arcs. The total time complexity of the application with regard to the network and the topical facet layer building is O (m + log n ) .

124

4. Text Network

330000

0,011000

0,010000

280000

0,009000

y = 0.0023Ln(x) + 0.0017 r2 = 0.92

0,008000

0,007000

230000

0,006000

0,005000

180000

0,004000

0,003000

130000

0,002000

0,001000

80000 1

2

3

4

5

6

7

8

9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29

Tokens

Time

Fig. 4-14. A 300,000 token network (left y-axis) and the cumulative processing time needed to import the text files, calculate the informative value and collecting the topical facets (right y-axis with decimal time notation).

The use of the system in a topic search is a distinct process on an existing network, therefore, it is sensible to take the complexity of this operation separately into account. The document similarity algorithm considers a subset of documents Dr deemed relevant. The subset has a size Dr < D and is scanned twice: once to look up a prototype and once to range the documents according to the similarity score with the prototype. To compute the similarity, arcs and vertices are visited inside each document d ∈ Dr with time O max (md , nd ) . Since it is executed for each arc of the network G at most, the total time complexity of a topical search is O(max(m, n)). For the reason that in a connected network m ≥ n – 1, O(max(m, n)) = O(m). In real situations the input size is expected to be a much smaller subset, hence, md

Suggest Documents