Scalable Query Assistance for Search Engines John Darlington, Yike Guo and Stefan R uger
[email protected]
Department of Computing, Imperial College London SW7 2BZ, England
Abstract We present a novel algorithm that computes related words to a query by analysing the subset of hit documents w.r.t. the whole document collection.
1 Introduction Search engines such as Fujitsu's AP-Tera, DEC's AltaVista, Hotbot, Yahoo, Excite, Webcrawler and Lycos have (mostly Boolean) queries as input and display a list of hit documents together with their ranking. A common problem with this approach is that weeding out all the irrelevant hits can try one's patience. This is nothing that the search engine itself can solve. For example, imagine the original query \computer". The search engine cannot possibly know whether the user prefers documents about software, hardware, video games, multi-media, internet, etc. Clearly, in this case, the user frontend should be able to suggest appropriate terms depending on the hit documents. This is what \query assistance" is all about. It should assist in expressing the information need, fully exploit the search engine's technology, assess the relevance of hit documents and satisfy the human user's time constraints. The query assistance algorithm should be able to analyse a set of hit documents (as a subset of a document collection) and identify terms that are potentially interesting for the user, related to/typical for the hit documents and relevant to discriminate the hit documents.
Ful lling the second objective seems to be crucial, because the words that are typical for the hit documents elicit the structure of the hit documents more than words that discriminate subsets. For example, imagine again the query \computer". It might be that the collection of hit documents can be divided into three exclusive groups that are discriminated by containing the words \year", \house" and \summer", respectively. As these words are not speci cally used in the hit document collection, this would not help a user with domain knowledge about computers. And the user without domain knowledge would be let to think that computers have to do with housing and seasons. Clearly, we should want to be given words such as \hardware" and \software" that are typical for the hit documents. Figure 1 shows a typical query scenario. A document collection contains millions to billions of documents. It is not uncommon that a query hits thousands of documents, where further query re nements seem to be necessary in order to obtain a manageable result in the tens of documents. As the set of hit documents is more or less an arbitrary subset of the whole document collection, its composition can vary in unforeseeable ways. To obtain a high-quality query assistance, it seems to be essential to compute discriminative and related terms (both with respect to the set of hit documents) at the time of the query in real time. Thus, the scope for preprocessing the whole document collection is limited. Nevertheless, preprocessing is considered to play a vital role, because it seems to be prohibitive to access the hit documents at query time. For example, with today's limited bandwidth of the internet, it
2 Preprocessing of the Whole Document Set
1 million - 1 billion documents 1 million different words
Boolean Query
1,000 - 10,000 documents 50,000 different words
Query Assistance
10-100 relevant documents
Figure 1: A typical query scenario with need for query assistance
would be virtually impossible for a WWW search engine to access all hit documents at query time. We therefore suggest preparing a document summary at index time (see Section 2) that contains a selection of potentially interesting words. Based on this, we discuss a novel, scalable and easily parallelisable algorithm for dealing with the query assistance challenge in Section 3. The algorithm is being implemented into the server-side of Fujitsu's AP-Tera on a AP 3000 computer. A prototype version on a workstation with tens of thousands newspaper articles is readily available and can be demonstrated during the conference.
During the process of indexing documents, usually a structure for every document is created that describes it. This structure would contain the name of the document, the owner id, modi cation dat, and also a eld for a short document abstract. It is suggested to use such a eld in order to store a preprocessed summary (or surrogate) that is the base of ecient query time computations. This eld plays a vital role in the desire to avoid accessing of the document itself at query time. We designed, executed and evaluated several experiments in order to assess feasibility and relevance of the following possible preprocessing steps 1. removal of SGML/XML commands 2. folding upper case to lower case 3. identify abbreviations, detect sentences 4. part of speech analysis 5. ltering the nouns 6. removal of stop words 7. stemming (Porter's stemming algorithm) 8. term and document frequency computation 9. computation of potential interesting words to be included in the document surrogate We implemented prototype versions of these steps in perl, except for the part of speech tagging, where we ported a public domain C programme (Brill 1995; Brill 1993). It was concluded that
De nition: Term frequency is de ned as the number of occurrences of a term in document . Document frequency is the number of documents that actually contain term . Both are considered to be important quantities, as they are easily computed and maintained, when adding or deleting documents from a document collection. tij
j
i
dj
j
P1-W-2
Part of speech tagging (noun identi cation) proves to be useful. In our
experiments, noun-only suggestion of additional query words proved to be more sensible than the suggestion of a much bigger list of all possible words. Usually "aboutness" of documents is rather conveyed in nouns, and less in verbs, adjectives and adverbs, and certainly not in prepositions, pronouns and the like. In English, many words can be used in dierent parts of speech, eg, "father" as a noun and as a verb have quite a dierent meaning, and "swimming" can be used as an adjective, noun or participle. A careful grammatical analysis is hence necessary to lter all the nouns. Of course,
the nouns include all the proper names and product names. In our experiments, nounonly suggestion of additional query words proved to be much more sensible than the suggestion of a far bigger list of words that include verbs and adjectives. The smaller nouns-only list helps the user to categorise the documents quicker; this approach also saves space.
Stemming seems to be questionable.
Unless the Boolean queries and the index terms of the inverted documents are also stemmed in the same way, the disadvantages of stemming for query assistance seem to outweigh the bene ts. In our experiments, it was found that stemming increases recall, but tends to decrease precision. Furthermore the suggestion of stemmed words should be avoided. For example, if the stemmed form \econom" is suggested, it would not be clear whether to include \economy" or \economist" into the query.
In order to save space for the document surrogates, it is suggested to assign a word code to each potentially interesting word (say three bytes) and store only the word codes in the document abstract eld. This additional database table for the mapping between words and word codes should also contain the overall document frequency of this word, as this information is needed in the query assistance.
2.1 Data Flow
The preprocessing is shown in Figure 2. It is clearly visible from this picture that the main part of preprocessing is trivially parallelisable. Table 1 documents that the not immediately parallelisable summarising and coordinating part (e) of the preprocessing takes only a small fraction of the overall time. documents file 1
documents file 2
documents file n
\Interesting" words are suciently well identi ed by a heuristic. The
third of the noun vocabulary with the highest document frequency was kept (but nouns with a document frequency exceeding 66% were dropped). Although this already reduces the summary version, only the 128 most important (according to + 5 tanh(100 ), where means the number of documents) words per document were picked to represent a document. This heuristic picks nouns with a medium document frequency (they are discriminative) that are often used in the document. This approach also limits the storage requirements. For the time being, we leave it to further empirical evidence to substantiate or modify this heuristic. The algorithm of Brill for part of speech analysis can be readily applied for English texts. The algorithm is rule-based and the rules had been learnt from sample texts. As this programme package also contains the learning algorithms, the same method can, in principle, be applied to other IndoEuropean languages with some initial eort. tij
dj =d
pretag
(a)
pretag
pretag
tagger
(b)
tagger
tagger
pos-tagger
pos-tagger
pos-tagger
remtag (c)
remtag
remtag
tfdf
tfdf
tfdf
(d)
d
summary (e)
table with word id and document frequency
document surrogates
Figure 2: Preprocessing: data ow The basic assumption is that a document collection consists of one or more les. Each le contains one or more SGML/XML-tagged document bodies that are each preceded by a special line such as document-name
P1-W-3
Table 1: Time per megabyte for preprocessing
3 Query Assistance Prototype
(a) preparing for (b) 41 s (b) part of speech tagging 173 s (c) removal of non-nouns 11 s (d) frequency computation 8s (e) interesting word computation 11 s
2.2 Multi-Language Support It is not necessary that the document bodies are written in the same language. However, if more document collections in dierent languages are given, it makes sense to process them separately and merge them afterwards with the programme mergesummaries as shown in Figure 3. language 1 (eg Japanese)
example set of surrogate data.
language k (eg English)
table with word id and document frequency
document surrogates
mergesummaries
We implemented a prototype for the query assistance. The input for this prototype consists of the surrogate data. In the real application, the surrogate data would be coming from the preprocessed document abstract eld of the query result structure. In our prototype, the query has to be emulated in order to address a sensible hit document subset of the whole documents. The emulation of the search engine also requires the prototype build up an inverted document list. Simple oneword queries such as computer may be used to produce a typical hit document set. The main point of the prototype is that a list of, say, 100 words will be computed that are related to the hit documents and that may be used to further narrow down the query. The computation time is proportional to the number of hit documents. For many hit documents, it seems to be sucient to process a random sample of, say, 1000 documents or to process the 1000 bestranked documents (ranked by the search engine).
3.1 Relevance Assignment
table with word id and document frequency
Let be the relevant subset of hit documents. The document surrogates of use a certain vocabulary , which is a subset of . Let be the document frequency of a word 2 in the set , ie, tells how many hit documents contain the word . The set and the numbers for each word 2 have to be computed at query time. Fortunately, the time needed for this computation is limited, as only max 128,000 word ids from the max 1000 surrogates have to be processed. The 100 words with the highest weight ! j j := log H
document surrogates
H
multi-language surrogate
VH
V
j
Figure 3: Multi-language support
H
VH
hj
j
The output of the whole preprocessing is one summary le that contains a table of the used words together with an id number and the document frequency, respectively. This le also contains a document surrogate of each document, ie, a collection of up to 128 words together with the term frequency of this word in the collection. Currently, the term frequency is not used. Our query assistance algorithm is based on this le. We applied the preprocessing modules to 45 Megabytes of Financial Times articles, 1992 (15,569 documents with 21,359 dierent potentially interesting nouns, source TIPSTER/TREC cd roms) in order to obtain an
hj
hj
VH
j
wj
hj dj
VH
H
hj
hj
are chosen. The right factor encourages medium hit document frequency ; we do not want to suggest words that appear in all hit documents and we do not want to emphasise words that appear in only a few hit documents. The left factor is responsible for picking terms that are hj
hj =dj
P1-W-4
typical for the hit document collection . The maximal value of is attained for = , meaning that all occurrences of word in the whole document collection are actually within a subset of . The minimal possible value of is 1 ( ? j j + 1) and reached by words that appear in only one hit document but otherwise appear in all other documents. In order to strengthen the \intelligent" behaviour of the left factor, one could emphasise it more by squaring as in !2 ! j j := log H
hj =dj
hj
dj
j
H
hj =dj
= d
H
j
hj
wj
dj
H
hj
hj
:
We also conducted experiments, where the term frequency of the surrogate word in the hit document was involved in the weight de nition. No visible change in the set of proposed words was achieved. This brought us to the conclusion that is might be worthwhile to save the storage space for the term frequencies in the surrogate documents. We assign a relative relevance number j
tij
i
rj
:= 100 max
wj l
wl
between 0 and 100 to each word . We believe that this number expresses the combined virtues of the relevance of a word to the hit document collection and the discriminative power of this word for . j
j
H
H
3.2 Similarity Computation & Term Clustering
The ranked list of the top related words as computed in the previous subsection seems to be a useful query assistance ingredient. However, the pure list (sorted according to above relevance measure) does not seem to be the optimal way of presentation. Some of the related words are clearly more akin to each other than others. The semantically similar words should be grouped together before presenting them to the user. This intrinsic similarity of the words is given by their joint usage in documents. We only consider the set of hit documents in order to compute the similarity of the words and in the following way sim( ) := 2 + j
j; l
hj;l
hj
hl
;
l
where is the number of hit documents that contain both and . There seem to be two main ways to compute the similarity of two words. One way is using the search engine itself, which usually returns the number of hits for a query . could be computed by querying for \ and and ", where stands for the query that produces the hit document set . Similarly, and can be computed quite easily. Another way of computing the similarity would be to compute bit vectors for each word , where each bit stands for a hit document. The corresponding bit is set, if occurs in document . Then 2#( & ) sim( ) = # +# hj;l
j
l
q
q
hj;l
j
l
q
H
hj
hl
vj
j
vji
j
i
j; l
vj
vj
vl
vl
;
where # stands for the number of bits set in bit-vector . The triangle similarity matrix (sim( ) for ) is the input for a suitable cluster algorithm. We have chosen the hierarchical cluster algorithm to produce a clustering. A typical result is shown in Figure 4 as a tree. Cutting the tree at a certain height, eg 40, identi es seven subclusters in this case. The word within each cluster with the biggest relevance number may serve as a representative word for this cluster. x
x
j; l
j
l
3.3 Scalability The preprocessing time is more or less linear in the document collection's size. As discussed before and as indicated by Figure 2 and Table 1, the preprocessing can be cut down according to the number of available processors by dataparallelism. At query time, the computation of a list of suggested words is linear in the number of hit documents that are processed and roughly quadratic in the number of terms that are to be suggested. As both quantities can be limited by, say, the top 1000 hit documents, and a list of 100 words, this time is eectively limited by a constant. The amount of hit documents and the number of words to be suggested may be tuned to the performance of the implementation or to the hardware or even adapted to the load of the server at runtime. If it seems undesirable to limit
P1-W-5
the analysis to a subset of the hit documents, the following parallelisation scheme might be taken into consideration: The most time consuming part of the relevance computation is the joining of the surrogate vocabulary of the hit document set and the hit document frequency computation. Imagine that we already have two sets of words and that originate form two disjoint subsets and of and that we already have the corresponding -document frequencies. The algorithm that joins the two sets and and computes the document frequency of the occuring words w.r.t. the document set union [ is basically the same than the one used in Figure 3. This algorithm can be used independently as parallel elements of a binary tree whose leaves are the hit documents. This parallelisation scheme resembles that of the reduce operation of the MPI library. H
Va
Ha
Vb
Hb
H
Ha=b
Va
Vb
Ha
Hb
4 Conclusion The prototype implementation showed the feasibility of the suggested approach very well. It is to be expected that this algorithm can be transformed into a practical graphical user interface with a useful functionality that stands out from common search engine's interfaces.
References Brill, E. (1993). ftp://ftp.cs.jhu.edu/pub/ brill/Programs/ RULE BASED TAGGER V.1.14.tar.Z. Brill, E. (1995). Transformaiton-based errordriven learning and natural language processing: A case study in part of speech tagging. Computational Linguistics 21 (4), 543{565.
Acknowledgement. This work was supported
by the Fujitsu European Centre for Information Technology.
P1-W-6
0
20
40
60
80
use board need networks communications systems products businesses world companies color things force solutions today power equipment programs program software technology pc microsoft users internet web product computers system video entertainment information news service maker card wide mail electronics access example phone machine pages programming fact data ridder site com http file device typing browser printer keyboard mac apple processing files digital show number west radio questions host crossman wjno am graphics cd rom text screen display storage screens memory click version windows macintosh modem multimedia craig machines mouse desktop
Figure 4: Hierarchical clustering of the suggested words P1-W-7
100