Message Extraction Through Estimation of Relevance. - CiteSeerX

8 downloads 5595 Views 1MB Size Report
Experiments with the METER system have shown that associative retrieval offers a ... Associative methods are als0 useful when the volume or volatility of the data precludes any ...... be submitted through an information retrieval specialist.
1980 TOC

Search

Message extraction through estimation of relevance Christopher Landauer and Clinton Mah

8.1

Introduction

In both the fast-access and the large-volume ends of the information processing spectrum the end user may be called an information analyst. His task as part of an information processing system is to provide the insights and to ask the right questiofis of the system. The computer should perform all of the routine analysis and comparison of documents. The interaction between the analyst and the computer must make it easy for the analyst to ask the necessary questions and to interpret the computer's response as an answer. Experiments with the M E T E R system have shown that associative retrieval offers a unique capability for obtaining information in response to queries produced by the analyst. In fact, associative methods offer the most assistance precisely in the cases that are difficult to handle in any other way - namely, when there is a large amount of unformatted English text. Associative methods are als0 useful when the volume or volatility of the data precludes any detailed knowledge of its contents. In this case the analyst will have to rely on the responses to questions in order to gain any knowledge of specific events. Associative retrieval methods allow an analyst to obtain that kind of information even without knowledge of specifics. With any Boolean or keyword system, an analyst must have a more detailed knowledge of vocabulary in order to obtain a comparable response. The M E T E R system was designed with several goals in mind. Specifically, the system was to exploit the methods of associative retrieval in an effective and inexpensive fashion, and to allow a naive user (that is, someone unfamiliar with the exact content of the database) to access useful information with minimal effort. Our success with these particular goals far exceeded our expectations in the light of the huge amount of research work already completed in the area. The particular implementation we built was expected to keep pace with a database of up to 20 000 messages that arrive continuously at a m a x i m u m rate of 4000-5000 per day. The system must have decent response times (one or two minutes) with five simultaneous users, and almost 24 hour access. The system was required to run on a DEC P D P - 11/45 or 11/70 without special hardware. As a tool for information analysis, the METER system was designed in a 117

118

Message extraction through estimation of relevance

sufficiently modular fashion to permit of application to several different processing environments. Moreover, the system has been designed to be selftuning. For any specific environment, the system produces peripheral information that can be used to tune the system to that environment while it is running. In fact, M E T E R is an implementation design strategy as well as a particular set of computer programs. The Scaled METER System is an experimental implementation of M E T E R designed as a test bed to support evaluation of and comparisons between associative retrieval algorithms. All experiments described in this chapter were performed with the Scaled M E T E R System. The basic structure of all M E T E R systems is derived from consideration of earlier systems for retrospective document searches (see Salton, 1968; Lancaster and Fayen, 1973; Hersch et al., 1977), although the original application area, current events analysis (for example, investigative reporting), has required substantial changes in organisation (see Landauer et al., 1977; Hersh and Landauer, 1978). The structural differences have been shown to apply more generally to other text retrieval problems. For current events analysis, the raw information arrives in the form of short segments of English text (less than 1000 words), called messages, at a rate of up to 5000 per day. At any given time the current database consists of about 20 000 messages (up to one week of messages). The operation of a M E T E R system can be described in database access terms as follows: the system builds an access mechanism for the stream of messages without disturbing the original text. The preparation of the text stream for retrievals has two main processing steps. The message acceptor transforms the text into a vector form that is amenable to statistical analysis and the message analyser builds the access pathways for the current database window. The message acceptor performs all local text analyses - - that is, those that relate to one message independently of the others. These analyses include the lexical normalisation, suffix removal and text reduction described in Sections 8.2, 8.3, and 8.4. The output is a sequence of vectors (or lists), with one list for each message, that describe the occurrences of terms within the message. The message analyser performs all global analyses - - that is, those that relate to more than one message. These analyses include the collection of term statistics, the selection of content stems (index terms), the computation of stem-to-stem associations and the final reduction of each message vector to its retrieval form. The more difficult of these processes are discussed in Sections 8.5 and 8.7. The files produced during this phase are the basic access pathways. The third main part of a M E T E R system is the message extractor. This phase is the on-line retrieval mechanism that makes use of the access pathways built by the previous message analysis. Some of the query processing details are described in Sections 8.6 and 8.8-8.10. The user interface, described in Section 8.11, is the only part of the M E T E R system that an information analyst will see. Section 8.12 describes our finding that, even without stem associations, a statistically oriented keyword retrieval system will generally function better in a non-expert environment than a Boolean system. Section 8.13 describes the evaluation procedure for the various configurations of the Scaled M E T E R System, and Section 8.14 gives an overall

Lexical normalisation

119

summary of the advantages of this system over other systems of its type. Throughout this chapter we use the words ~message', 'text' and 'document' interchangeably, and also 'stem' and 'term'. Both 'content stem' and 'index term' refer to those stems which are chosen to represent the messages in a database.

Any system that performs any kind of word processing needs a specific convention about what is actually a word. There are many places in which decisions must be made, and each such decision has an effect on all subsequent processing, tn this section we describe the particular choices used in the M E T E R systems, with some comments on the general problem. Because M E T E R text analysis is based on counting occurrences of words in messages, we first had to decide what an occurrence of a word should be. This is a critical problem in a statistical system, because it ultimately affects everything else in the system. Consequently, we spent a great deal of time developing procedures for handling hyphenation and abbreviation, for filtering out numerical strings and other non-words, and for stemming - - t h e reduction of words to a root form by removal of suffixes. First, we currently assume that a word must contain a letter and may contain imbedded digits. An apostrophe will always end the sequence. All letters are translated to lower case, as we do not account for the differences between capital letters and others. A sequence of digits (without a letter) will not be considered. In a document collection of technical abstracts the rules for accepting numbers as words should probably allow numerical strings and be sufficiently flexible to allow of decimal points and scientific notation. Hyphenation of a word across a line boundary does not break the word. The hyphen and intervening space characters are ignored. On the other hand, hyphenation of a sequence of letters and digits within a line is treated differently, if the sequence after the hyphen contains no letters (for example, 'adam-12'), then the hyphen is ignored. If the sequence after the hyphen contains a letter (for example, 'a79-c'), or if the sequence before the hyphen does not (for example, '12-3 la'), then the hyphen breaks the sequence into two parts. We have chosen to split hyphenated pairs of words, because the word pair will not otherwise be related to other, individual occurrences of the two words. Since the two words occur together in a message, the association relation between them will account for their connection. Some abbreviations are also accepted as words. ~f a period (dot '.') occurs between two letters or digits, without intervening spaces, then the period is ignored. Thus, the sequence 'U.S.A.' produces the word 'usa', but the sequence 'U. S. A.' is accepted in three parts, all of which are subsequently discarded. All words with fewer than three letters and digits are ignored. These 'short words' usually are function words (for example, 'to', 'of', 'be'), and seldom contain useful information. Finally, there is a list of c o m m o n function words, called stopwords, which we assume a priori to have no usefulness for retrievals. This list of stopwords includes the most common words, such as 'the', 'and', 'then', as well as some

120

Message extraction through estimation of relevance

less frequent words (for example, 'nevertheless', 'because', 'however'). The current METER System has a standard list of about 200 stopwords, but, in general, the list should be tuned somewhat for each particular database. For application to large document collections, a METER system will have several extensions to these word conventions. These extensions all involve recognising specific sequences for the purpose of bypassing the normal processing. A major source of error in the current system is the occurrence of proper names. Names of people, organisations and locations, names that contain more than one word, and al~breviations of names all have different spelling and connection conventions from those of other words. Foreign words are another obvious source of error. The only known marginally effective methods for dealing with these problems are to use a specific dictionary of acceptable terms or to use some recognition heuristics. Both methods suffer from serious deficiencies. An illustration of the problem in this area is given by the word 'begin'. In many cases the word should be considered as a stopword, since it is very general. If so, however, a current events database would not be able to find the current Prime Minister of Israel.

8.3 Stemming The purpose of a stemming procedure is to reduce words to their root forms by removing affixes and possibly by transforming the results to get rid of irregularities. This reduction is important in statistical text analysis, because grammatical variations of a word can then be mapped into instances of the same term for counting. The total number of distinct terms will also be reduced, which makes computation of such statistics as pairwise term associations more tractable. A stemmer also helps to simplify query entry by eliminating the need for using 'wild card' specifiers to indicate word roots (for example, using 'comput*" to indicate all words beginning with 'comput'). The quality of stemming ultimately puts an upper bound on the effectiveness of a statistical retrieval system. Understemming fails to map similar words together and may lead to omission of significant relationships. Overstemming incorrectly lumps words together, suggesting spurious relationships. In either case the underlying noise level in a statistical system increases and the accuracy of retrievals falls off. The development of a good stemming procedure is therefore a prime prerequisite of any statistical retrieval system. The M E T E R stemmer is a sophisticated procedure based on English spelling rules and many special cases. It is fairly comprehensive, being the result of over three years of evolution through iterative experimentation (trial and error). Careful stemming, in fact, appears to be one major factor in the improved performance of METER over earlier statistical retrieval systems such as R A D C O L (see Morris, Carroll and Jayne, 1975; Hersh et al., 1977). The procedure continues to be refined. At present the M E T E R stemmer consists of two distinct components: an English inflectional stemmer and a general suffix remover. The inflectional stemmer concentrates on '-s', '-ed' and '-ing' endings, which by themselves can account for up to 40 per cent of the cases where stemming was necessary in texts processed so far with METER. The frequency of these endings justifies

.L |

Stemming

121

making a special effort to do these correctly. About 300 different string patterns are recognised by the inflectional stemmer, including such cases as rushes ~ rush plunged ~ p l u n g e writing~write flies--,fly The inllectional stemmer can restore a final ~-e' that is dropped when a suffix is appended, which eliminates a major weakness of stemmers on systems such as RADCOL. The M E T E R general suffix remover handles all suffixes other than '-s', '-ed' and '-ing'. Because of space constraints, it cannot treat as many special cases per suffix as the inflectional stemmer; but it is still rather extensive, recognising about 800 different string patterns. For example. comprehensive --*comprehend differentiation--*differ judgement ~ j u d g e angular ~ angle transmission ~ transmit energy--* energe American ~ a m e r Unlike the inflectional stemmer, the general suffix remover will not always produce full words. All that the suffix remover attempts to do is to derive distinct stems for counting. It includes a recogniser for certain common English irregular forms and uses the final '-e' restoration procedure of the inflectional stemmer. Both the inflectional stemmer and the general suffix remover have been repeatedly tested against three text collections of 1200, 1700 and 2500 documents, respectively; and both have been much revised. Their effectiveness is best seen when compared with the stemmer of the R A D C O L system, from which many ideas of M E T E R were derived. The R A D C O L stemmer recognised about 500 different string patterns, but tended to overstem a great deal without really avoiding understemming as intended. In one experiment with the R A D C O L stemmer, involving about 400 words, there are 14 instances of significant understemming or inconsistent stemming. This is more than 3 per cent of total stems, not counting misspellings. The M E T E R stemmer has an error rate of less than 1 per cent on understemming and overstems a great deal less than the R A D C O L stemmer. The main weakness of the M E T E R stemmer is with proper names and adjectives, and with foreign words, since these tend to be irregular with respect to English spelling rules. As the stemmer has been refined, these words have accounted for most errors. Further development of the stemmer is expected as additional collections of text become available for analysis. The only practical limit here is the amount of memory that can be devoted to stemming when a M E T E R system is run on a minicomputer such as the D E C PDP-11/45. At present it should be possible to incorporate about 200 more string patterns, which would leave plenty of r o o m for future development. We note that the stemming problem for English, although it is complicated

122

Message extraction through estimation of relevance

by the large number of words of foreign descent, is not nearly as hard as the problem faced by the Responsa project (see Attar and Fraenkel, 1977; Attar et al., 1978), since Hebrew is much more inflected, and written Hebrew is much more lexically ambiguous (owing to missing vowel signs). Of all procedures, the stemmer was by far the most complex and required the most effort. A stemmer, however, is essential in a statistical retrieval system; keeping separate counts for grammatical variations of the same word takes up too much space and may yield counts that are too low for deriving meaningful statistics. A stemmer also helps the user by eliminating the need for 'wild card' conventions for indicating simultaneous searches on all variants of a word.

8.4

Text reduction

Each word that occurs in message text is either ignored (if it is too short or if it is a stopword) or stemmed. The result of these processes is a sequence of stems, called raw stems, in the same order as that in which the corresponding words appeared in the original text. It should be noted that stopwords and short words do not become raw stems; they are entirely unavailable for processing. The final transformation of a message for processing begins with the ordered sequence of raw stems produced by the stemmer. The current M E T E R system collects the stems, orders them alphabetically and represents the message by the resulting micro-concordance. This list consists of the raw stems together with an importance rating ('frequency weight') for the stem in the message. Our current rating uses log2(1 + raw count) as opposed to the raw count directly. This scheme avoids over-emphasising stems that occur many times in one message. Moreover, it is consistent with common statistical practice to avoid using raw count data directly (see Tukey, 1977). Other term weighting schemes are described in Chapter 5 of this volume, together with some simulated experiments to compare them. Another reduction scheme is to use the ordered stem sequence directly. The result of some experiments with a small database is described in Attar and Fraenkel, 1977, but the conclusions drawn have not been tested on larger databases. Algorithms have been designed for incorporating this option into the Scaled METER System, but they have not been tested. Our expectation is that, as the individual messages get larger, this scheme will perform better in comparison with the micro-concordance scheme currently used in METER.

8.5

Content stem selection

In a statistical retrieval system based on associations, we usually will not want to compute statistics for all stems contained in a text collection. From a practical standpoint, this would take too much time as well as too much space in most applications. What we can do is to choose a fairly large sub-set of stems that are in some sense the most interesting of all the stems. Here it is natural to define 'interesting' in a statistical way, using what we call 'content measures'.

Content stem selection

123

Intuitively, a stem is interesting in as far as it helps to discriminate between messages in a given collection. If it occurs with the same frequency in all messages, then it will be of little value for retrieval of messages. A good stem for •retrieval should have a highly non-uniform distribution of occurrence within a collection, characterised by a high value of statistical variance for frequency of occurrence in individual messages. The variance or something similar is typically the basis for defining a content measure. We cannot use variance by itself, because it depends too much on the number of messages that contain a stem. In fact, for the kinds of distributions typical for actual stems, ranking by variance is about the same as ranking by the number of messages containing the stem. The variance divided by the square of the mean leads to the opposite problem, giving a ranking about the same as the reverse ranking for number o f messages containing a stem. Division by the mean seems to be the best compromise for a variance-based content measure. Variance divided by mean is essentially the content measure employed by METER. This is a version of the measure originally proposed by Dennis (see Dennis, 1967). The major departure has been to use a log count instead of the raw count as the frequency weight of a stem in a message (see Section 8.4) and to multiply the variance by the total number of texts in a collection, to avoid having to store fractions. We have investigated about 20 other content measures for METER, but the useful ones tend to correlate highly (at least 0.60) with variance over mean, while requiring more computation. Other, more extensive, comparisons have been designed, and are at present being performed. A problem with the use of variance is the difficulty of comparing the statistics for a frequently and an infrequently occurring stem. Unfortunately, given Zipf's Law (see Miller and Newman, 1958), it is typical in naturallanguage text that only a minority of stems are not in these two categories. Computing statistics by use of logarithmic frequencies helps somewhat but does not completely resolve the problem. The approach of the METER system here has been to consider a stem only if the number of messages containing it falls between arbitrarily chosen lower and upper bounds. In the case of infrequent stems, we probably lack a large enough sample to compute usable statistics; and in the case of the most frequent stems, we do not find them especially helpful for retrievals. On the whole, a single statistic is probably inadequate to measure all the senses of 'content' useful in an associative retrieval system. We need in M E T E R at least four different kinds of content stems: (1) stems corresponding to referential words, as opposed to purely syntactic function words; (2) stems that maximally separate messages in some hyperspatial sense; (3) stems that characterise the information content of messages; and (4) stems for which we can compute statistical associations. A variance over mean measure works well in selecting stems for message separation but has dubious value for the other sense of 'content'. Its problems with choosing stems for association will be described later. In the case of filtering out function words, statistical approaches, in general, perform poorly and in heterogeneous collections can be misleading. For example, in a collection of messages consisting of journalistic text mixed with telegraphic text the word 'the' was found to be useful in separating the two

124

Message extraction through estimation of relevance

classes of messages by style, though not by content. The statistical measure correctly located words which discriminated between two classes of documents but the two classes were not relevant to the needs of the user. It is easier and more effective simply to have lists of those words we want to filter out. In the case of choosing stems for information content, we should note that this sort of criterion applies to individual messages rather than to an entire collection. For example, the stems 'soviet' and 'agriculture' may well be the best stems to index the content of a gi,~en message, even though those stems may be contained in all messages of a given collection and thus give poor discrimination. Because discrimination seems to be the better thing to have in a retrieval system, our strategy in M E T E R has been to focus on that criterion in selecting a content measure and to consider the matter of information content only in the later context of designing retrieval algorithms.

8.6

Query formation and stem weighting

M E T E R accepts a query initially expressed as text, either entered from a user terminal or derived from a previously retrieved message. This text is converted into a query vector through the same procedures as those used for messages, except that content stems are also selected. With queries in vector form, it is possible to think of them as points in the same vector space; and we can define a notion of nearness in that space as a way of retrieving messages relative to a query. A major problem with this use of a vector space is that the significance of a stem is not directly proportional to its frequency of occurrence. In fact, from an information-theoretic view, it is exactly the opposite; an infrequent stem when it occurs in a message carries more information than a frequent one. An argument in favour of the vector space a p p r o a c h is given in Sparck-Jones (1973), where two retrieval systems (without associations) are compared, one of which uses the Boolean frequency weighting (that is, the weight for a stem is 1 if it occurs and 0 if it does not), and the other of which uses raw counts. Other results and discussion m a y be found in Salton (1968, 1975). The vector space approach, however, tends to be dominated by high-frequency stems, largely because the high-frequency stems in a message vector account for the major portion of its Euclidean length. For example, consider a retrieval for a query with two stems 'soviet' and 'dissident' having equal weight, but with 'soviet' occurring more than 100 times more frequently than 'dissident' in the target document collection. A vector space scheme tends to retrieve messages containing 'soviet' first, because there will p r o b a b l y be more messages containing the more frequent stem and because that stem will probably a p p e a r more often in a given message. As a result, the more specific portions of a query will be overpowered by the more general portions. To avoid the problem, we need to adjust the weights of stems in a query vector so that specific stems are favoured over general stems. An easy way of doing this is to multiply the initial weight of query stems by the content measure for the stem, which described the usefulness of the stem in discriminating between messages. The variance-over-mean measure seems to be effective, as well as the inverse document frequency (a measure based on the

i

Term associations

125

logarithm of the number of messages containing a stem), This use of the inverse document frequency is described in Sparck Jones (1973). Two new difficulties are introduced by such weighting. First, it becomes harder to interpret what estimated relevance values really mean, especially if they are computed for different queries. Second, it is unclear what to do when a retrieved message is used as a query. The first difficulty is not especially serious, since estimated relevance is hard to interpret in any circumstance. We have effectively eliminated it as a problem by making the numbers invisible to a user through the employment of unscaled histograms that encourage the interpretation of estimated relevance only in a relative sense. ~n the case of a retrieved message as a query, we decided to include special weighting, on the grounds that messages looking like the retrieved message have probably been retrieved already. The assumption here is that the user just wants the particular stems of the retrieved message for the next query. This strategy ultimately allows the user to scan a larger number of messages in a collection; note that consequently a message as query will not retrieve itself with estimated relevance of 1.0. The function of retrieving messages that look like a given message has been assigned more appropriately to the relevance feedback component of METER.

~o7

~ e ~ ' m a~s~eiatfie~s

The original motivation for computing statistical associations between pairs of stems was to provide for an automatically generated thesaurus containing synonyms for query expansions. According to structural linguistic theory, the meaning of a word is established by its range of possible contexts; and nearly synonymous words should occur in similar contexts. !f first-order stem associations are defined according to co-occurrence in messages, then it is reasonable to interpret them as a kind of context and second-order associations as a kind of synonymy relation. Expansion of a query by secondorder associations should therefore reduce the chance of not retrieving a relevant message because the wrong words appear in the query. As it turns out, however, results from experiments with various document collections invalidate this assumption. Rubenstein and Goodenough (1965), for example, compared second-order associations with human judgements of degree of synonymy, and only at the extremes was there a monotonic relationship between the two quantities. For the middle ranges of judged degree of synonymy, there was no correlation at all with the second-order associations. In our experiments, moreover, query expansion with secondorder associations has never improved recall in any controlled test run so far, which suggests that it is extremely hard to come up with an actual example of this occurrence. Furthermore, there is some indication that the actual associations now employed in M E T E R may contain too much noise to support this expectation. Associations in the current M E T E R system were computed as a simple inner product of characteristic vectors for different stems, where a vector describes the frequency weight of a given stem in each document of a collection. This yields a K x K sparse symmetric matrix of associations A with zero along the main diagonal, where K is the total number of associated

.....

III

I

126

Message extraction through estimation of relevance

content stems. The matrix of first-order associations is (A + aI), where a is an empirically selected value for self-association; the matrix of second-order associations is ( A + al) 2 = A z + 2 a A + a 2 I

Expansion of a query corresponds to multiplication of the query vector by one of these association matrices. Ordinarily, we would have a ~

Suggest Documents