SQLET: Short Query Linguistic Expansion Techniques

Presented at RIAO’97, Montreal, June 23-25, 1997

SQLET: Short Query Linguistic Expansion Techniques, Palliating One-Word Queries by Providing Intermediate Structure to Text Gregory Grefenstette

Rank Xerox Research Centre, Grenoble Laboratory 6, chemin de Maupertuis. F-38240 Meylan, France E-mail: [email protected] Tel.+33 4 76615050

Abstract Most people using the WWW try to find information using one or two word queries. The information retrieval systems derived from research models were designed for longer queries and do not provide an adequate response to the user’s needs. On the other hand, recent advances in natural language processing permit the extraction of typed information that is axed on one or two words. We review a selection of this typed information and describe how it could be used to present an intermediate structure for the user, a structure fitting between their short queries and the documents found on the web. The user would first access this structure, and having found a use corresponding to their information need in this structure, more quickly and precisely access the corresponding web pages without random searching. An example is presented for a sample short query.

KeyWords text mining; light parsing; information extraction; collocations; short query.

1 Introduction There is a well-known gap between the characteristics of the typical user studied by Information Retrieval (IR) researchers and the typical information seeker on the Web. Most real searchers use only a few words[LK95], but the bulk of Information Retrieval research has been based on longer queries. There are many reasons why the theoretical model has been maintained despite its known divergence from actual use: •

Some information seekers (take the example of government intelligence gatherers) do express their information requests as a long description of the subjects that interest them,

•

The basic accounting of system improvement in information retrieval ever since Cleverdon[Cle91] in the 1960’s has been based on matching lists of system-retrieved documents to previously established lists of judged-to-be-relevant documents. In order that the relevance judgments represent some objective standard against which to measure, the queries must be relatively explicit about what is being searched for. This means that the queries must be verbose. A short one or two word query could not be used in such experiments without the relevance judgments either becoming subjective (the judger reinterpreting

what the searcher was looking for), or being reduced to a simple judgment of when a pattern was found in a text (something any file system search tool provides). 1

As a result of the above, the vector space model of information retrieval has been favored as an underlying IR implementation in research, but this model works best when many axes (many query words) are being used to circumscribe the space of relevant documents. When one or two words are given as a query on a Web browser, the user is essentially casting out a fishing line into an enormous ocean of pages to see what bites. Pages are reeled in and the user must examine each page to determine if the use of the word in that page is the meaning that they intended. Pages are returned based on the word’s inclusion in the titile field, or 2 on the relative frequency of the given word in the pages indexed. Some search engines allow the user to specify a positional arrangement so that the query words appear "near" each other, or in a certain order. The burden of predicting how the words are used is left nonetheless on the user, the browser considering the whole space of the Web as a flat index structure. As a response to this flat structure, a few search engines, such as Yahoo!3, provide a hierarchical structure into which pages are inserted, but this insertion is done manually which limits the number of pages that can be indexed. In this article we describe a necessary, new addition to Web searching, adapted to the problem of short queries, particularly one-word queries. We argue that not only should the page index be made available to the searcher, but that many tools exist for creating a linguistically derived structure that the WWW user should be able to browse starting from a one-word query. This structure can be seen as an intermediate structure between the user and WWW documents. Rather than delving straight into the documents as is the current case, the user should be able to see what multiword linguistic structures a word appears in. These more precise structures can then be clicked to reach the Web-based documents. The paper is organized as follows. After presenting existing approaches to the problem in Section 2, we present SQLET (Short Query Linguistic Expansion Techniques) for realizing this structure from raw text. Focussing on a given word from a short query, we present methods for presenting different types of common noun phrases in Section 3.1, and of noun phrases involving proper names in Section 3.2. Section 3.3 discusses syntactic structures involving verb form of the given word. Section 3.4 gives a sample of higher order affinities that can be extracted from text concerning a given word. Section 4 concludes the paper.

1

This model[SM83] considers each word in the language as an axis in a highly dimensional space. Each document and each query are plotted in this space. To find the closest documents to a query, one calculates the cosine of the angle between the query point and all the document points, taking the points found closest as the relevant documents. 2

This method of ranking pages has led to the "word spamming problem" in which a malicious HTML programmer fills in a WWW page with a large number of occurrences of a hidden key word (such as a competitor’s company name) so that the programmer’s page is shown first to the unsuspecting user. 3

http://www.yahoo.com

2 Related Commercial Query Tools and Related Research A possible method for improving precision of a query by including frequently occurring terms from an initial retrieved document set can be implemented in the DIALOG information 4 retrieval system using the RANK command , first introduced in 1994. Once a user performs a search, RANK will sort, by frequency, all the indexed words appearing in the result set and display them to the user, who can then use them to make their query more precise. This feedback step can happen, after performing a search, but before the user examines the first set of documents returned, in which case we can consider this first approximation to the method proposed here. Recently, Francois Bourdoncle of the Ecole des Mines de Paris has developed a somewhat similar version of this intermediate approach between query and documents, called LiveTopics, for the Altavista5 browser. LiveTopics shows, in either a graph or a list, the words which appear in the neighborhood of the original query words in the retrieved documents6. One can click on these words which are then added to the query to improve the precision of the original query. The techniques presented in this current paper could be presented in such a fashion, but the relations between words will be based here on a pre-stored linguistic analysis over the whole corpus and not on co-occurrence within a document set retrieved in response to a query. In research systems, two other approaches responding to the inadequacies of current IR systems with regard to the short query problem can be found in Kwok [Kwo96] and Hearst et al. [HKP95]. Kwok remarks that short queries lack two elements that make vector space retrieval over long queries successful: the variety of terms used to match the query to documents, and the indication of the importance of the words in the query. This latter point, which most affects precision according to Kwok, is automatically estimated in traditional information retrieval systems from the frequency with which the word appears in the query. Of course, in short queries, each word will usually only appear once and frequency can not give us any clues of the term’s importance. In order to show the interest of giving more weight to the important words in a short query, Kwok first performed a series of experiments in which a human user manually picked the important words in a query, and these words were repeated twice in the query submitted to the IR system. This simple technique improved the average precision by 10%, a large gain in information retrieval circles. Kwok then devised an automatic method of weighting query words by replacing their initial weight of 1 by a weight representing the average number of times a word appears in a document, when it appears, divided by a function of the absolute frequency of the word in the whole document database. This weighting gives rare words, which appear fairly often in the documents they appear in, more weight than common words or rare words that appear sporadically. He showed [Kwo96] that this automatic weighting performed as well as having a user choose the most important words in a query. Kwok's methods try to improve the precision of the initial set of documents retrieved by a short query. But, a short query can often pick up many different aspects of word meaning in 4

http://www.dialog.com/dialog/publications/rank-qrc.html

5

http://altavista.digital.com

6

This is hardly new. The same idea was described by Doyle [Doy61] in the early 1960’s as a realization of her Semantic Road Maps for interactive document collection browsing.

the document subcollection defined by all the documents containing the words in the query. Hearst et al. describe a system in which the documents returned from an initial query are clustered along these strands of meaning. This clustering is automatically performed on the documents that contain the short query terms by a technique developed in a collection browsing system called Scatter-Gather [CPKT92]. This linear-time technique works by splitting the returned documents into a fixed set of clusters, each cluster is then represented by its most representative terms, which are displayed to the user along with the first few titles in each cluster. The user can then easily see which of the groupings correspond most closely to the information that they were looking for and use the whole cluster as a new request via a traditional relevance feedback technique [Har92]. For example, Hearst et al. [HKP95] show that, given the one-word query star over the text of a general encyclopedia, the 400 documents returned are automatically clustered by the Scatter-Gather technique into groups characterized by words such as Cluster 2: flag, rug, weave, carpet, pattern, stripe, force; Cluster 3: game, player, team, league, ball, football, professional; Cluster 4: energy, hydrogen, radiation, planet, temperature, gas, etc. These clusters show different aspects of the meaning of the polysemous word star: as a pattern, as a person, as a physical phenomenon, and so on. This technique, which retrieves and then clusters documents, gives a general view of the words used in the same documents as short query words, and provides an alternative dynamic approach to using the more fine-grained pre-stored structures proposed below.

3 SQLET We believe that when a Web user gives one word as their information need, what they would really like to know is how that word is being used on the Web, and once that use is determined, then they would like to see the pages that use the word in that way. Providing a semantically tagged version of the Web, in which all words would map into semantic classes, which would themselves be linked to Web pages, still seems utopian given the little progress 7 [LGP+ 91] made in the field of real-word semantics . But today simpler language processing technology is mature for providing a lower-level, linguistic structuring of the information found on Web Pages. Given the structures in which a word is found, its meaning can often be deduced. For example, dog can mean many things. But one easily sees its nuances of meaning when given such phrases as dog biscuit or rabid dog or walk a dog (not to mention the inevitable example: hot dog.) These and other types of linguistically-motivated structures can be automatically extracted from text using present-day natural language technology. SQLET proposes providing this structure in response to a short WWW query, giving an intermediate view of Web material much as a back-of-the-book index gives an intermediate view of a book’s content. As an example of the type of information that can be recognized, we extracted from the 8 British National Corpus all the sentences that contain the string research. There are 16632 such sentences. This 3 M-byte corpus corresponds in a sense to gathering a large number of

7

8

See http://www.cyc.com/tech.html#kb for a contradictory claim.

The British National Corpus(BNC) is a 100 million word corpus of British written and spoken English. For more information, see http://info.ox.ac.uk/bnc/index.html. We parsed BNC files a to g using our light, finitestate parser[Gre96], in order to extract all the structures illustrated in this paper.

WWW pages containing any form of that string. We will use this same corpus as an example for all the following sections.

3.1 Noun Phrases Not only for English, but for many other languages, there exists robust natural language technology [GS97] for performing basic processing of raw text (1) Tokenization: dividing the input text into units; (2) Morphological analysis: associating grammatical information (e.g., being a verb form, or a adjectival form) to those units; (3) Part-of-speech disambiguation: deciding what grammatical role a word is playing in context (e.g. is pin a verb or noun in a given sentence); (4) Lemmatization: producing the normalized form of the word (e.g. thought as verb goes to think). Once part-of-speech tags have been assigned to words, one can extract simple noun phrases by defining the contours of a noun phrase in terms of its part-of-speech patterns [Sch96]. Much of the terminology found in any corpus is composed of such simple noun phrases. When a word is embedded in these phrases, one has a good indication of the word’s meaning. Running the BNC research corpus through a noun phrase extractor produces a list of noun phrases, such as market research, research project, research paper, empirical research, research money, research minister, research embryo, etc. 3494 different simple noun phrases can be found, 406 of which appear more than twice. Each of these phrases can be linked back to the pages from which they were extracted, in the WWW setting. But how should this linguistically structured information be presented to the user? In some cases, research is the head of the noun phrase, and in other cases it modifies some other noun. We cannot expect the naive user of the WWW to know or care about these linguistic distinctions. But we can rephrase these relations in a way that may be more understandable to the lay user. We can label those phrases in which research appears at the end of the phrase as types of research. And those in which it appears in the beginning as things involved in research. Though the actual relations that a word can play in a two word phrase are multifarious [Sal85, War78], these labels are vague enough to cover many of them. Given these labels, we can present the noun phrases structures in order of decreasing frequency to the user in a way such as in Figure 1. Due to tagging errors, some of words research are actually being used a verb in the things listed as research things. Still, the association of research with another word often makes its exact meaning clearer. types of research:

market research, recent research, own research, scientific research, social research, medical research, new research, basic research, empirical research, clinical research, extensive research, little research, area research, international research, historical research, late research, current research, health research, industrial research, major research, nuclear research, academic research, field research, good research, fusion research, cancer research, ...

research things

research project, research programme, research centre, research finding, research team, research study, research area, research group, research institute, research result, research worker, research method, research laboratory, research work, research year, research student, research effort, research interest, research evidence, research report, research department, research field, research council, research station, research assistant,....

Figure 1. Explaining the uses of research in noun phrases without using linguistic terminology.

3.2 Proper Names For languages, such as English, which use capitalization as a proper name indicator, it easy to use the patterns of upper and lower case letters characteristic of names to recognize a large number of proper names. An awk program for collecting names from tokenized texts is easy to write [Gre94a, p.150]. Alternative name recogntion methods use lists of proper name markers [Bor67]. Recognized names may be semantically typed [Don93] by decoding their subcomponents. As a simple approximation, once noun phrases have been recognized, one can isolate those noun phrases containing non-sentence-initial upper case from lower-case noun phrases, shown in Figure 2. This supplementary division of noun phrases provides a further automatic structuring of a single word’s uses to the WWW user. Using this capitalization heuristic, we can describe noun phrases including our example word research as samples of Named research and Names involving Research that exist in the text.

Named research:

American research, Belfast research, AIDS research, British research, German research, Antarctic research, UK research, General practice research, French research, European research, Early research, Copernican research programme, al-Anbar space research base, Xerox research, Watson research laboratory, UK market search firm, Tenovus research team, Star Wars research, Soviet military research power, Riso research centre, Recent US research, Newsons research, NHS development research, Massachusetts research outfit, London-based market research firm, Lakatosian research programme, French research institute, Forty Years research, EC research, Dutch research, Cambridge research lab, Blue Skies research, Australian research

Names involving Medical Research Council, Cray Research Inc, Engineering Research Research Council, AST Research Inc, Social Research Council, Cancer Research Campaign, Imperial Cancer Research Fund, Community Planning Research, Research Councils, Social Research, Natural Environment Research Council, Science Policy Research Unit, Cancer Research, Research Council, Medical Research, Food Research Council, Building Research Establishment, Research Centre, Digital Research, Defence Research Agency, Water Research Centre, Research Department, Cray Research, Conservative Research Department, Labour Research, British Library Research, ..., Xerox Palo Alto Research Center Figure 2. Noun phrases in which research appears with a “proper name.”.

6\QWDFWLF5HODWLRQV A step beyond recognition of noun phrases is recognizing non-contiguous, yet syntactically tied words, a process that can be called shallow or light parsing. Light parsing can be performed in a variety of ways: from simple recognition of fixed patterns [Cho88,Sma93] to extracting parse subtrees [Deb82,Abn91,Hin93]. We perform light parsing [Gre94a,Gre96] by

(1) recognizing any sequence of nominal units and marking them as noun groups, (2) recognizing any sequence of verbal units and marking them as verbal groups, (3) identifying and labeling group heads, and (4) extracting typed relations between group heads using finitestate filters. These steps allow us to extract labeled relations between nouns and verbs. Verbal phrases involving research used either as a verb or as a noun can be extracted in this way from our research corpus. We then find the typical subjects and direct objects of the verb research and the typical verbs associated with the noun research. For example, the most frequently found direct object of research was book, 14 times out of 483 uses of research as a verb, as in the BNC sentence: At times , as I have researched this book , it has occurred to me that the second-hand book world is the only place left in England where knowledge of anything but the latest semi-literate fads still exists. The most common subject of research, other than a personal pronoun, is student, as shown in the BNC extract: While abroad , students research a project in each of their languages. As with simple noun phrases, we are once again confronted with the problem of how to present this linguistically typed information in a way that is comprehensible. The naive user of the WWW will not want to know about direct objects and subjects. In fact, most naive users of the WWW will hardly remember what a noun or a verb is. The information may be conveyed in a more simple way by using circumlocutions such as Things one can research instead of the more academic direct objects of research. We can talk around subjects as being examples of Who can research. We can explain adverbial modification as How can one research. These presentations of research as a verb can be displayed as in Figure 3. Things one research:

can research book, research somebody, research development, research project, research finding, research it, research history, research method, research laboratory, research need, research material, research report, research there, research matter, research area, research one, research study, research issue, research subject, research problem, research worker, research community, research programme, research way, research design, research market, research topic, research question, ...

Who can research:

somebody research, student research, part research, time research, team research, body research, university research, project research, sociologist research, company research, issue research, historian research, computer research, worth research, scientist research, group research, career research, implication research, man research, ...

How can one re- research carefully, research thoroughly, research extensively, research search: deeply, research properly, research meticulously, research currently, research widely, research surprisingly, research recently, research poorly, research exhaustively, research usually, research sufficiently, research seriously, research relatively, research really, research probably, research previously, research particularly, research highly, research fully, research eventually, ... Figure 3. Verb phrases involving the verb research.

Similarly, one can find different ways of presenting verbs used with the noun research. One can list things that research can do, or things that are done to research, as in Figure 4, which cover the uses of the word as a direct object and as a subject. Things that one do research, conduct research, research, carry research, undertake research, does to research: need research, fund research, market research, support research, base research, publish research, commission research, require research, develop research, use research, direct research, continue research, concern research, sponsor research, encourage research, complete research, begin research, detail research, make research, focus research, involve research, promote research, ... What can research research show, research suggest, research indicate, research reveal, do: research have, research go, research provide, research fund, research lead, research take, research find, research focus, research confirm, research demonstrate, research support, research come, research concentrate, research continue, research do, research involve, research become, research make, research identify, research use, research begin, research tell, research carry, research address, research, research produce, ... Figure 4. Verb phrases in which research appears as a noun.

)LUVWDQG6HFRQG2UGHU$IILQLWLHV The linguistically derived structures mentioned in the previous sections show the words that are syntactically tied to research. If the WWW users pose a one-word query research then presenting them with these lists might provide a useful overview for distinguishing the meaning the searcher was initially interested in. As an anecdote, in early 1997, the Web browser AltaVista considered research as a stop word, since it had found 9 million occurrences of the word on the Web. After a certain number of occurrences, on this browser, a word no longer serves as an independent index term. The way that a word is used in text, as presented in the previous sections, and the other words which co-occur with a given word, as is brought out by the RANK function of Dialog, LiveTopics of Altavista, and Scatter-Gather are all instances of first order affinity [Gre94b], that is, words often found in the same vicinity as the given word. Adding words from this vicinity can improve the precision of the documents retrieved. Second order affinities are words which, with respect to some reference corpus, need not occur in the same context as a given word but whose syntactic contexts are similar the given word. In our example, this means finding words to which the same things that are done as are done to research, or are done by research. In order to find these words, one can compare all the bits of information that one can glean from a corpus for a given word like research. These comparisons can lead to the discovery of terms in the same semantic area of the given word, providing near synonyms or antonyms. Using a smaller corpus than the BNC, we parsed and extracted all the uses of the word research in 33 MBytes of text from the online

9

archives of the LINGUIST list . Using similarity comparison techniques [Gre94a, Chap3], we found that the words most used like research were development, application, theory, course, discussion, data, aspect, knowledge, use, and approach. These are additional words that may be used to refine the initial query.

4 Conclusion We have presented a number of natural language processing tools that could be used to provide an intermediate structure between short queries and the WWW. The tools provide, for each word, a list of ways in which that word is used with other words over the indexed documents. The presence of the other words allows the user to quickly search for those meanings that they had in mind when the query was produced, or to add in additional words to improve precision of the initial query. The burden of the work is placed upon the computer indexing, but existing natural language processing tools are already capable of providing the text analysis necessary. And, the question of additional storage costs, generated by collecting these structured uses, is becoming less meaningful daily.

References [Abn91] Steven Abney. Parsing by chunks. In Steven Abney, Robert Berwick and Carol Tenny, editors, Principle-Based Parsing. Kluwer Academic Publishers, Dordrecht, 1991. [Bor67] C. Borkowski. An experimental system for the automatic identification of personal names and personal titles in newspaper texts. American Documentation, 18:131, July 1967. [Cho88] Yaacov Choueka. Looking for a needle in a haystack, or locating interesting collocational expressions in large textual databases. In RIAO’88 Conference Proceedings, pages 609-623, MIT,Cambridge,Mass, Mar 1988. [Cle91] Cyril W. Cleverdon. The significance of the Cranfield tests on index languages. In A. Bookstein, Y. Chiaramella, G. Salton, and V. V. Raghavan, editors, Proceedings of the 14th Annual International ACM/SIGIR Conference on Research and Development in Information Retrieval, pages 3-131, New York, Oct 13-16 1991 [CPKT92] Douglas Cutting, Jan O. Pedersen, David Karger, and John W. Tukey. Scatter/Gather: A cluster-based approach to browsing large document collections. In Proceedings of SIGIR’92, pages 318-329, Copenhagen, Denmark, June 21-24 1992. ACM. [Deb82] Fathi Debili. Analyse Syntaxico-Semantique Fondee sur une Acquisition Automatique de Relations Lexicales-Semantiques. PhD thesis, University of Paris XI, France, 1982. [Don93] David D. Donaldson. Internal and external evidence in the identification and semantic categorization of proper names. In B. Boguraev and J. Pustejovsky, editors, Proceedings of the SIGLEX Workshop on Acquisition of Lexical Knowledge from Text, pages 32-43, Columbus, OH, 1993.

9

http://www.emich.edu/~linguist

[Doy61] Lauren B. Doyle. Semantic road maps for literature searchers. Journal of the ACM, 8(4):553-578, October 1961. [Gre94a] Gregory Grefenstette. Explorations in Automatic Thesaurus Discovery. Kluwer Academic Press, Boston, 1994. [Gre94b] Gregory Grefenstette. Corpus-derived first, second and third-order word affinities. In Sixth Euralex International Congress, Amsterdam, Aug 3-Sept 3, 1994. [Gre96] G. Grefenstette. Light parsing as finite state filtering. In Workshop on Extended finite state models of language, Budapest, Hungary, Aug 11-12 1996. ECAI’96. [GS97] G. Grefenstette and F. Segond. Multilingual natural language processing. International Corpus of Corpus Linguistics, 2(1), 1997 [Har92] Donna Harman. Relevance feedback revisited. In Proceedings of SIGIR’92, Copenhagen, Denmark, June 21-24 1992. ACM. [Hin93] Donald Hindle. A parser for text corpora. In B.T.S. Atkins and A. Zampolli, editors, Computational Approaches to the Lexicon. Clarendon Press, 1993. [HKP95] Marti A. Hearst, David Karger, and Jan O. Pedersen. Scatter/gather as a tool for the navigation of retrieval results. In Robin Burke, editor, Working Notes of the AAAI Fall Symposium on AI Applications in Knowledge Navigation and Retrieval, Cambridge, MA, November 1995. AAAI. [Kwo96] K. L. Kwok. A new method for weighting query terms for ad-hoc retrieval. In Proc. of the 19th ACM/SIGIR Conference, pages 187-196, 1996. [LGP+ 91] D. B. Lenat, R. V. Guha, D. Pratt, K. Pittman, W. Pratt, and K. Goolsbey. The world according to CYC, part 4. Technical Report ACT-CYC-002-91, Microelectronics and Computer Technology Corporation, Austin, TX, January 1991. [LK95] X. A. Lu and R. B. Keefer. Query expansion/reduction and its impact on information retrieval effectiveness. In Donna Harman, editor, The Third Text REtrieval Conference (TREC-3), pages 231-239, Washington, 1995. U.S. Government Printing Office. NIST Special Publication 500-225. [Sal85] Gerard Salton. A note on information retrieval models. In RIAO’85, pages 2-27, Grenoble, France, March 18-20 1985. CID, Paris, and IMAG. [Sch96] Anne Schiller. Multilingual part-of-speech tagging and noun phrase mark-up. In 15th European Conference on Grammar and Lexicon of Romance Languages, University of Munich, Sept 19-21, 1996. [SM83] Gerard Salton and M. McGill. An Introduction to Modern Information Retrieval. McGraw-Hill, New York, 1983. [Sma93] Frank Smadja. Retrieving collocations from text: Xtract. Computational Linguistics, 19(1):143-178, March 1993. [War78] Beatrice Warren, editor. Semantic Patterns of Noun-Noun Compounds. Acta Universitatis Gothoburgensis, Goteborg, Sweden, 1978. Gothenburg Studies in English,41.