The Design of a Topic Tracking System Abstract - CiteSeerX

4 downloads 19508 Views 136KB Size Report
tracking involves tracking a given news event in a stream of news stories i.e. finding all ... These sources include news broadcast programs such as CNN news and .... Stories were preprocessed to remove stopwords and in addition keywords were .... 2 for precision, we see that an Nt value of 2 seems to give the best results.
The Design of a Topic Tracking System Joe Carthy Department of Computer Science, University College Dublin Ireland

Alan F. Smeaton School of Computer Applications, Dublin City University Ireland [email protected], [email protected]

Abstract This paper describes research into the development of techniques to build effective Topic Tracking systems. Topic tracking involves tracking a given news event in a stream of news stories i.e. finding all subsequent stories in the news stream that discuss the given event. This research has grown out of the Topic Detection and Tracking (TDT) initiative sponsored by DARPA. The paper describes the results of a topic tracking system designed using traditional IR techniques and outlines a new approach to TDT using lexical chaining which should improve effectiveness.

1

Introduction

This paper is concerned with the design, implementation and evaluation of two topic tracking systems: the first system (TTS1:Topic Tracking System 1) is based on the use of traditional IR techniques and the second system (TTS2) incorporates the use of lexical chaining. Similar work has been carried out by Allan et al. [1]. TTS1 utilizes a term weighting scheme which is a modified version of that described by Robertson and Sparck Jones [2]. This work is the initial phase of a long term project to investigate the effectiveness of lexical chaining as a basis for topic tracking and detection. The reason for developing TTS1 is that it will act as performance benchmark which can be used to evaluate the effectiveness of a tracking systems such as TTS2, based on lexical chaining.

1.1

Topic Detection and Tracking

Topic detection and tracking research has grown out of a DARPA-sponsored initiative to investigate the computational task of finding new events and tracking existing events in a stream of textual news stories from multiple sources. These sources include news broadcast programs such as CNN news and newswire sources such as Reuters. The information in these sources is assumed to be divided into a sequence of stories which provide information on one or more events. The TDT problem may be divided into three major tasks: •



Detection: Identify those stories in a news stream that are the first to describe a new event occurring in the news stream, where the event has not been predefined or predicted. Tracking: given a small number of sample stories about an event, find all following stories in a news stream about the same event, from some point in time onwards. Segmentation: segment a stream of news data, especially recognized speech, into distinct stories.

1.

Topic Tracking



The tracking task is defined as that of associating incoming stories with events known to the system. An event is defined as “known” by its association with stories that discuss the event. So, each target event is defined by a list of stories that define it. If we take an event such as “the Kobe earthquake”, then the first story (or first N stories) in the corpus describing the Kobe earthquake could be used as the definition of that event. In event tracking, a target event is given by providing a set of stories that describe this event and each successive story in the corpus is to be classified as to whether or not it describes the target event. To support this task the corpus is divided into two parts: a training set which comprises stories known to be either about the event or not to be about the event, and a test set of stories which have to be classified. The tracking task is to correctly classify all of the stories in the test set as to whether or not they discuss the target event. The training and test sets will differ for each target event. An interesting question for the tracking task involves quantifying the number of stories that are used to define

the target event. This number is referred to as Nt. It is desirable to minimize the Nt value as we wish to begin tracking as soon as possible after the training stories have been presented. Consider an event concerning an explosion that has just occurred. It is essential to begin tracking immediately as opposed to using stories that occur weeks after the event. In the TDT pilot study Nt values of 1, 2, 4, 8 and 16 were used. 1.1.2 TDT Evaluation Corpus A TDT test corpus was constructed to facilitate the pilot study mentioned above. This corpus includes 15,863 news stories from July 1, 1994 to June 30, 1995. Half of the stories are Reuters news articles and the other half are CNN broadcast news stories which have been transcribed. The corpus included relevance judgements for a set of 25 events covering a broad spectrum of interests such as disaster stories (e.g. Oklahoma City bombing; Kobe earthquake in Japan) and crime stories (e.g. OJ Simpson trial). Every story in the corpus was judged with respect to every event by two sets of assessors and any conflicts were reconciled by a third assessor. This exhaustive assessment meant that the corpus size is quite small in comparison to TREC corpus sizes but a second and larger TDT corpus has recently been created by the Linguistic Data Corporation. [3]

1.2

Lexical Chaining

The notion of lexical chaining derives from work in the area of textual cohesion by Halliday and Hasan [4]. The linguistics term text is used to refer to any passage spoken or written that forms a unified whole. This unity or cohesion may be due for example to an anaphoric reference which provides cohesion between sentences. In its simplest form it is the presupposition of something that has gone before, whether in the preceding sentence or not. Where the cohesive elements occur over a number of sentences a cohesive chain is formed. For example, the following lexical chain, {mud pie, dessert, mud pie, chocolate, it} could be constructed given the sentences: John had mud pie for dessert. Mud pie is made of chocolate. John really enjoyed it. The word it in the third sentence refers back to dessert in the first sentence. In this example it can also be seen that repetition (mud pie in the first and second sentence) also contributes to the cohesion of the text. Lexical cohesion is as the name suggests lexical - it involves the selection of a lexical item that is in some way related to one occurring previously. It is established through the structure of the lexis or vocabulary. Reiteration is a form of lexical cohesion which involves the repetition of a lexical item. This may involve simple repetition of the word but also includes the use of a synonym, near-synonym or superordinate. For example in the sentences John bought a Jag. He loves the car. a superordinate, car, refers back to a subordinate Jag. The part-whole relationship is also an example of lexical cohesion e.g. airplane and wing. A lexical chain is a sequence of related words in the text, spanning short (adjacent words or sentences) or long distances (entire text). A chain is independent of the grammatical structure of the text and in effect it is a list of words that captures a portion of the cohesive structure of the text. A lexical chain can provide a context for the resolution of an ambiguous term and enable identification of the concept the term represents. Morris and Hirst [5] were the first researchers to suggest the use of lexical chains to determine the structure of texts. A number of researchers such as Stairmand [6], Green [7] and Kazman[8] have developed IR applications using lexical chains as an aid to representing document content. Stairmand and Black [9] point out that the “most relevant documents are not necessarily those with the highest incidence of the search terms, but rather those in which the concepts represented by the search terms are the focus of the document. It follows that weighting key concepts according to the relative strength of the context in which they occur provides a more accurate reflection of the significance of a concept in a document, since the context will tend to embody a topic rather than a term in isolation.” This quote summarizes why we believe that the use of lexical chaining may be useful in topic tracking. By identifying the lexical chains in a news story we hope to identify the focus of a news story which can then be used in tracking that story. It is important to realise that determining lexical chains is not a sophisticated natural language analysis process. In applications such as machine translation, structural linkages must be identified and resolved to 100% accuracy in order for the translation process itself to be 100% correct. In determining lexical chains the level of linguistic processing is far less great. As the name suggests, processing is done at the lexical rather than the structural level making it a kind of "cheap and dirty" or more correctly "slightly less effective but faster, more robust and more tolerant of errors" process. This makes it achievable and scaleable to large collections of documents. A key factor in the design of any IR system is the notion of aboutness and how we represent what a document

is about. The vast majority of IR systems represent document content via lists of keywords. Any given document and in particular a news story, will have typically have a central theme or focus. A lexical chain is one technique that can be used to identify the central theme of a document. By developing the theory of lexical chaining we postulate that it will be possible to build more sophisticated indexing techniques than the simple keyword-based ones that dominate in current IR systems.

2.

Topic Tracking System Design

A topic tracking system (TTS1) was designed using traditional IR techniques i.e. stories were represented by keyword lists stored in inverted files. Stories were preprocessed to remove stopwords and in addition keywords were stemmed (This process was also applied to all queries). As mentioned earlier, the topic to be tracked is associated with one or more stories, the number of stories used is called the Nt. value. The overall operation of the tracking system may be summarized in the following set of steps: 1.

Construct a tracking query for the current event from the Nt. stories for this event.

2. 3.

Compare each new document in the document stream to the tracking query. If the new document is sufficiently similar to the tracking query then flag it as tracking that event.

4.

Repeat steps 1 to 3 for all Nt values.

5.

Repeat steps 1 to 4 for all events to be tracked.

The tracking query is simply the set terms from the Nt stories that define an event. In the context of event tracking, terms from the stories (events) to be tracked may be regarded as query terms. Then, the problem becomes one of computing the similarity of a given story to an event descriptor for the event to be tracked. Thus from a conventional IR perspective we wish to compute S = Sim(EventDesci, Storyj) and

if S > Threshold then Storyj tracks Eventi In computing the threshold for a given eventi and Nt value, we construct an event descriptor (EventDesci) by taking all terms from the tracking set i.e. the set of Nt stories. We then compute the threshold for this Nt value as the similarity of the EventDesci with itself Ti = Sim(EventDesci, EventDesci)

Thus we have a separate threshold for each Nt value. We obviously expect this threshold to have a very high value, given that we are comparing an entity with itself. As a consequence, we would not expect the similarity of other stories with the event being tracked, to have such a high score. Hence, we use what we call a Thresholding constant TC when deciding if a given story tracks an event : S = Sim(EventDesci, Storyj) and if S > TC * Thresholdi then Storyj tracks Eventi

The value of TC determines the recall (proportion of relevant documents retrieved) and precision (proportion of retrieved documents that are relevant) of the system i.e. low values of TC (e.g. .28, .30) yield high levels of recall but low levels of precision whereas high values of TC (e.g. .60) yield higher precision values, but lower recall values. The effects of varying the value of TC can be seen by looking at the performance values in Figures 1 to 4 for TC values between 0.25 and 0.60 at 0.05 intervals. These performance metrics were obtained using the evaluation software (tdteval) that comes with the TDT corpus. The similarity function is based on the traditional technique of matching query terms with index terms and computing a document score based on:

• • •

the number of matches between a query and a document the frequency of occurrence of terms in documents and in the collection the length of the documents.

Many matching functions have been proposed [10] and the system described here uses a modified version of the Combined Weight [2] for computing document scores. We have used such a system in previous research with successful results [11]. The Combined Weight is computed by combining collection frequency for a term, term frequency and document length to give an overall term weight. It may be defined as follows for term ti in document dj Given n = number of documents that term ti occurs in N = number of documents in the collection

The Collection Frequency Weight (also known as the Inverse Document Frequency) for term ti is CFWi

= log N - log n

The Term Frequency for term t i is defined as TFij = Number of occurrences of ti in dj

The Document Length is defined as DLj = total of term occurrences in dj

This may be normalized by dividing by the average document length, giving NDLj = Dlj / Average DL for all docs

The Combined Weight for one term ti and one document dj is defined as CWij =

[ CFWi * TFij * (K1 +1) ] / K1 * ( (1-b) + ( b * NDLj))) + TFij]

(K1 and b are tuning constants. It has been found empirically that by setting K1 = 2 and b = 0.75, useful results have been obtained. [12] ) The overall score for a document dj is simply the sum of the weights of the query terms present in the document. Modified Combined Weight Because of the nature of event tracking we do not have available collection frequency data (CFW) i.e. we do not know in advance what stories are going to be presented to the system. This also means that we cannot compute the average document length (NDL) for all documents. As a consequence, we omitted the CFW element in the above formula and used the length of the stories in the tracking set as the average document length giving: Mod-CWij =

3.

[TFij * (K1 +1) ] / [ K1 * ( (1-b) + ( b * NDLj))) + TFij]

Results

It is not clear as to what are the appropriate evaluation measures for a TDT system. A number of possibilities arise such as the standard IR metrics of ranked precision and recall. In the TDT pilot study, system effectiveness was measured by the miss rate (false negatives) and false alarm (false positives or fallout) rate. Miss rate was defined as the total number of misses divided by the number of stories that were judged as tracking an event. False alarm rate was defined as the total number of false alarms divided by the total number of stories which were judged not to track the event. These were chosen because the problem was perceived as being a detection task as opposed to a ranked

retrieval. The pilot study was concerned with both the new event detection problem and the tracking problem. In this paper where we are concerned with only the tracking task we use both forms of evaluation (precision/recall and miss rate/false alarm rate). This is on the basis that story tracking is very similar to document retrieval, as we can view the story (set of stories) being tracked as analogous to a query. As might be expected from a system based on traditional IR methods, there is the usual trade-off between precision and recall. In the case of recall, we can see from Figure 1 that an Nt value of 1 is sufficient to give good performance, whereas from Figure 2 for precision, we see that an Nt value of 2 seems to give the best results. The results show that a small number of stories can be used to describe an event for tracking purposes i.e. high Nt values are not required. In fact performance degrades when Nt values are used. This may be caused by the dilution of terms pertinent to the real story because of the incorporation of too many facets by too many stories. We see from Figures 3 and 4 that the False Alarm rate and Miss rate also demonstrate a trade-off relationship, as might be expected. Thus, as the False Alarm rate drops from a high of 14% (Nt = 1) at a TC value of 0.25 to almost zero at a TC value of .55, the Miss Rate rises from around 10% to almost 100% over the same TC range. Overall, the performance of TTS1 is broadly in line with the results reported by Allan et al. [1]. The poor precision performance highlights the need for further research to improve effectiveness.

Average Recall vs Threshold

80 70 60 50 40 30

Recall Nt = 1 Recall Nt = 2 Recall Nt = 4 Recall Nt = 8 Recall Nt=16

0.6

0.55

0.5

0.45

0.4

0.35

0.3

20 10 0

0.25

Recall

100 90

Threshold

Figure 2: Average Recall versus Threshold

Average Precision vs Threshold 80 70

Precision

60 50 40 Prec Nt = 1 Prec Nt = 2 Prec Nt = 4

30 20

Prec Nt = 8 Prec Nt =16

10 0.6

0.55

0.5

0.45

0.4

0.35

0.3

0.25

0

Threshold

Figure 2: Average Precision versus Threshold

%Miss Rate vs Threshold 100 80 70 60 50 40 30 20

%Miss Nt=1 %Miss Nt=2 %Miss Nt=4

10 0.6

0.55

0.5

0.45

0.4

0.35

0.3

0 0.25

%Miss

Rate

90

Threshold

Figure 3:Miss rate versus Threshold

%Miss Nt=8 %Miss Nt=16

%False Alarm Rate vs Threshold 14

%False

Alarm

12 10 8 6 %FA Nt=1 %FA Nt=2

4 2

0.6

0.55

0.5

0.45

0.4

0.35

0.25

0.3

0

%FA Nt=4 %FA Nt=8 %FA Nt=16

Threshold

Figure 4: False Alarm Rate versus Threshold

4.

Topic Tracking using Lexical Chaining

In this section we outline how lexical chaining is used in our second topic tracking system, TTS2. As we have seen earlier a lexical chain is a sequence of related words in the text. In order to construct lexical chains we must be able to identify that there are relationships between words. This is made possible by the use of WordNet [13,14]. WordNet is a computational lexicon which was developed at Princeton University. In WordNet, synonym sets (synsets) are used to represent concepts and a synset consists of all those terms that may be used to refer to that concept. For example, take the concept airplane it could be represented by the synset {airplane, aeroplane, plane}. A synset signifies the existence of a concept, and makes no attempt to explain what the concept is. A WordNet synset has a numerical identifier such as 02054514. Links between synsets in WordNet represent conceptual relations such as synonymy, hyponymy (is-a), meronymy (part-of) etc. The synset identifier can be used to represent the concept referred to in the synset, for indexing and lexical chaining purposes. In our current research we use synset identifiers as the elements of lexical chains. For example, if any of the terms airplane, aeroplane, plane occur in a story which is been parsed into lexical chains, then the synset identifier 02054514 will be added to the chain for each of the terms. This has the added advantage that it automatically addresses the problem of synonymy that faces keyword-based IR systems [15]. A lexical chain may be characterised by the proportion of the document spanned by the chain. This is call the span of the chain. A chain may be also characterised its density i.e. the number of elements in the chain divided by the span of the chain. These chain scores have already been used [6,9] in I.R. systems.

4.1 Lexical Chaining Procedure The lexical chaining module of TTS2 operates as follows. Consider a given story as a stream of terms which has been preprocessed to remove stopwords. th

1.

Take the i term in the story and using WordNet generate the set called Neighbour i of its related synsets using the hyponym/hypernym (is-a) and meronym/holonym (part-of) relationships.

2.

For each other term in the document, if it is a member of the set Neighbour i then add its synset identifier to the lexical chain for termi. The location of the chain element in the document stream is also stored. If the lexical chain contains 3 or more elements then store the chain in a chain index file. Repeat steps 1,2 and 3 for all terms in the story.

3. 4.

Choosing which synsets to add to the Neighbour set is obviously a critical step. If we add too many synsets, by following too many links in the WordNet hierarchy, for example, we will generate false chains (Stairmand [6]). For this reason, our initial experiments use only synsets that are directly related to the term whose Neighbour set is being generated. From step 3 we can see that a lexical chain is represented by adding its elements (i.e. synset identifiers) to a document-chain index which stores the lexical chains for every document in the collection. An entry in this index indicates the document identifier, the set of (one or more) lexical chains for this document and for each chain element, its location in that document. This allows us compute the span and density of lexical chains as needed. One of the problems with using WordNet in I.R, as pointed out by Stairmand [6], is that many terms are not represented in WordNet, and in particular proper nouns. Such terms are often very valuable for retrieval purposes and especially in news stories for event tracking and detection purposes. For this reason we also use the conventional term index that was used in the tracking system described in Section 3. By combining the use of lexical chains with a term index, full use can be made of all of the information available in a news story to facilitate topic tracking. As Kominek and Kazman [16] point out “combining multiple indexing techniques is more powerful than relying on just one”. To date, other researchers [6,7,8], utilising lexical chaining in IR have not done this. As for TTS1, the topic to be tracked is associated with one or more (Nt) stories. The overall operation of TTS2 may be summarized in the following set of steps: 1.

Compute the tracking set of lexical chains for the current event from the Nt stories for this event.

2.

Construct a tracking query for the current event from the Nt stories for this event.

3. 4.

Compare each new document in the document stream to the tracking descriptor (comprising the tracking set and tracking query). If the new document is sufficiently similar to the tracking descriptor then flag it as tracking that event.

5.

Repeat steps 1 to 3 for all Nt values.

6.

Repeat steps 1 to 4 for all events to be tracked.

The tracking set is the set all lexical chains extracted from the Nt stories that define an event. The tracking query consists of all terms the Nt stories. Comparing a document to the tracking descriptor involves computing two similarities, one based on term similarity as defined in Section 2 and one based on lexical chain similarity. Term Similarity = Sim(EventDesci, Storyj) as defined earlier and Chain Similarity = Ch-Sim(Tracking-Seti, Storyj) S = Term Similarity + Chain Similarity if S > Thresholdi then Storyj tracks Eventi

As in Section 2, when computing the threshold for a given eventi and Nt value, we compute the threshold for this Nt value as the similarity of the event with itself Ti = Sim(EventDesci, EventDesci) + Ch-Sim(Tracking-Seti, Tracking-Seti)

As before we have a separate threshold for each Nt value and we use a Thresholding constant TC when deciding if a given story tracks an event : if S > TC * Thresholdi then Storyj tracks Eventi

Computing Chain Similarity A number of similarity measures may be used for computing Ch-Sim(Tracking-Seti , Storyj ) i.e. lexical chain similarity. One such measure is the Overlap Coefficient [10] which may be defined as follows, for two lexical chains c1 and c2:

Overlap Coefficient =

| c1 ∩ c2 | min(| c1 |,| c2 |)

This measure has the advantage that if the elements of a short chain are subsumed in a longer chain we get a high matching value. The Ch-Sim function is computed by 1.

Compute the overlap coefficient of lexical chain i in the tracking set with all chains in Storyj and return the highest matching value as max_matchi

2.

Repeat step 1 for all n lexical chains in the tracking set.

3.

Ch-Sim(Tracking-Seti, Storyj) is then

n

∑ max_ match

i

i =1

Experiments are currently under way at the time of submitting this paper to evaluate the effectiveness of this approach.

5.

Conclusions and Future Work

In this paper we have described two topic tracking systems, TTS1 based on traditional IR techniques and TTS2 based on the use of lexical chaining techniques. The performance of TTS1 is largely what might be expected when using this methodology. We hypothesize that significant improvements in performance will be gained by moving away from a keyword-based model to one based on lexical chaining i.e. TTS2. Experiments are currently underway to test this hypothesis.

6. References [1]

James Allan et al., Topic Detection and Tracking Pilot Study Final Report, In the proceedings of the DARPA Broadcasting News Transcript and Understanding Workshop, February 1998.

[2]

Robertson, S.E. and Sparck Jones, K, Simple, proven approaches to text retrieval. University of Cambridge Computer Laboratory Technical Report no. 356, 1994 (updated 1996,1997)..

[3]

Linguistic Data Consortium: http://www.ldc.upenn.edu/

[4]

M Halliday, R Hasan, Cohesion in English, Longman: 1976.

[5]

Jane Morris, Graeme Hirst, Lexical Cohesion by Thesaural Relations as an Indicator of the Structure of Text, Computational Linguistics 17(1), March 1991.

[6]

Mark Stairmand, A Computational Analysis of Lexical Cohesion with applications in information Retrieval, Ph.D. Thesis, UMIST, 1996.

[7]

Stephen J Green, Automatically Generating Hypertext By Comparing Semantic Similarity, University of Toronto, Technical Report number 366, October 1997.

[8]

Rick Kazman, Reem Al-Halimi, William Hunt, Marilyn Manti, Four Paradigms for Indexing Video Conferences, IEEE Multimedia Vol. 3, No. 1.

[9]

Mark A. Stairmand, William J. Black, Conceptual and Contextual Indexing using WordNet-derived Lexical Chains, Proceedings of BCS IRSG Colloquium 1997, pp. 47-65.

[10]

Keith van Rijsbergen, Information Retrieval, Butterworths, 1979.

[11]

Joe Carthy and Alan F. Smeaton, IR Experiments with and without Relevance Feedback, Technical Report, Computer Science Dept., UCD, January 1999.

[12]

Robertson, S.E., and Walker, S., "On relevance weights with little relevance information", Proceedings of the ACM SIGIR Conference on Research and Development in Information Retrieval, 16-24, ACM Press, 1997.

[13]

George Miller, Special Issue, WordNet: An On-line Lexical Database, International Journal of Lexicography, 3(4), 1990.

[14]

Christine Fellbaum, (Ed.), WordNet: An Electronic Lexical Database and Some of its Applications, MIT Press, 1996.

[15

G. Furnas, W., T.K. Landauer, L.M. Gomez, and S.T. Dumais, The vocabulary problem in human-system communication, CACM, 30, 11, 964-971, Nov. 1987.

[16]

Rick Kazman, John Kominek, Accessing Multimedia through Concept Clustering, Proceedings of CHI '97, March 1997, pp. 19-26.

Suggest Documents