Query Translation for Distributed Information ... - Semantic Scholar

1 downloads 453 Views 217KB Size Report
An example of a Web service based on the ranking query model is the Yahoo search engine. (http://www.yahoo.com). The goal of the Ya- hoo service is a wider ...
Query Translation for Distributed Information Gathering on the Web Boris Chidlovskii and Uwe M. Borgho

Xerox Research Centre Europe, Grenoble Laboratory 6, Chemin de Maupertuis, F{38240 Meylan, France E-mail: @xrce.xerox.com

Abstract

inal query with a query expressible in the native language. To guarantee a correct response to the user query, the approximated query subsumes the original one. As a result, the Web service response is the superset of data asked for the original query. To eliminate the irrelevant data, the front-end performs post- ltering as the second step of the query processing. To avoid primitive query subsumption and spending in vain network communication resources, the query translation should be optimal.

The heterogeneity of Web information services poses new problems in the processing of user queries over distributed federated data. It was recerntly proven that the query translation between two information services supporting all Boolean operators can be done in an optimal way [4, 3]. On the Web, however, the situation when at least one Boolean operator is not supported is frequent. We study the case where a Boolean user query cannot be directly translated and should be split into subqueries. We propose two strategies for query subsumption and discuss in detail the strategy which minimizes the number of submitted sub-queries. We derive an appropriate query form for the minimal strategy and demonstrate how both translation strategies are implemented in the Knowledge Brokers system [2, 6].

1.1 Query translation problem

Every Web information service contains a search engine as a tool to provide fast retrieval of data. The di erent commercial and public/shareware search engines (Verity, Digital, Fulcrum, Excite, OpenTex, Glimpse, Swish, etc.) all have simple one-keyword query. The di erences between the engines become apparent when preparing a complex query. Most engines support Boolean operators AND, OR, NOT (=AND NOT) in full or at least reduced form. Binary proximity word operators like W(n) and Near(n) require two words to appear in a document in a given order and within given distance n (usually, in words). Unary word operators usually include phrases and stemming. Finally, the attribute search predicates CONTAINS and EQUALS are used to restrict the search to some document attributes, such as Title, Author, Abstract etc. A important case when query translation is required between two search engines was studied in [4, 5]. It gives a theoretical foundation and provides an ecient algorithm for optimal translation under the basic assumption that both the front-end and the native languages support the Boolean operations AND, OR and NOT for the query formulation. The languages may, however, differ in supporting the word proximity operators W(n), Near(n) and attribute search predicates CONTAINS, EQUALS etc.

1 Introduction Distributed information retrieval and gathering on the Web relies upon request brokering and cooperation with multiple search services. Due to the heterogeneity of the Web, the services usually support query languages with various facilities for formulating a query. As a result, the user in a broker system is assumed to be provided with a front-end query language which usually uni es the various search facilities and hides the heterogeneity of the Web search service [11, 4]. The front-end does not manage query and data internally; instead it relies upon external services to retrieve the information the user is asking for. To do this, the front-end must be able to translate user queries (written in the front-end uni ed language) to native languages supported by the Web information services. If a user query contains operations not supported by a native language, the query translation on the rst step leads to approximationof the orig1

For each operator or predicate that is supported di erently in the two languages, a rewriting step is required to subsume the operator by operators available in the native language. As a result, the procedure of the query translation is as follows. First, a user query is transformed into the disjunctive normal form (DNF) which was proven to preserve the minimality property [4]. The predicate rewriting step then replaces the unsupported predicates with their subsumption. Finally, the subsuming query is sent to the Web service. Once the response is received from the service, post ltering tecniques are used to discard the irrelevant data. Example 1. Consider the query 1 = Title

Q

buttons, boxes and pop-up lists. The radio buttons request the user to select only one of them while boxes allow user to select one or more options. A pop-up list works similar to radio buttons allowing the user to select one item from a list. Finally, a form often contains a submit button which causes the submission of the input data to the cgi-script.

Web search page user browser Search:

broker system user browser

,

CONTAINS fault W(2) tolerance

CGI-script

which should retrieve all documents with two words fault and tolerance in the title and with at most two words in between. If a native query language supports the Boolean operator AND but does not support the operator W(n), we should subsume W(n) with AND operations. Therefore, the subsuming query will be s = Title CONTAINS fault AND tolerance. After the data have been retrieved from the information service, the post ltering discards those documents which do not meet the W(2) condition.

user browser

Clients

search engine

Server

Figure 1. Typical architecture of a Web search service.

Q

1.2

user browser

To distinguish the query language of the underlying search engine from the set of queries the user can formulate using the input windows and options of the Web page, we call the later the page query language. The engine language and the page language often coincide. Numerous Web information services including well-known search services like AltaVista, WebCrawler, Yahoo etc., have the query forms that contain only one input window. In such a case, the input strings are simply forwarded to the search engine for processing and therefore the page language is equivalent to that of the search engine. However, the design of many specialized Web search pages is driven by speci c application needs and requirements which results in the page query language being quite di erent from the engine language. Obviously, the expressive power of the page language can be as high as that of the engine language. However, so-called \advanced search pages" which propose a nice query formulation interface often reduce the overall power of the page language with respect to the engine language. We say that a Web page query language has an operational limitation if it either fails to support all Boolean operations AND, OR, NOT or supports them in a non-standard way. Thus, cases when the page language capacity is limited to one word, also belong to this group. For example, services for searching FTP sites like Archie [7] allow one

Query translation on the Web

The algorithms for basic query translation assumes that the Boolean operators AND, OR, NOT are supported by the native language. This assumption results in guaranteeing the one-query subsumption, when any front-end query may be subsumed with one native query. Unfortunately, on the Web this assumption is not true for many information services. Moreover, even if the Boolean operators are supported, there are many obstacles which make the basic query translation dicult or even impossible. To study the problem we need to understand how a typical Web information service works. Any Web information service labeled by its URL (Uniform Resource Locator) is connected to an underlying search engine through, for example, a cgi-script (see Figure 1). The Web page provides a form for entering queries which are sent to the cgi-script. The form is introduced with the html tag and terminated with the tag . It de nes the method get or post which is used to send queries to the cgi-script and contains one or more input areas de ned by tags [13]. Simple text windows accept textual information while additional options are introduced by radio 2

word only (or more precisely, a regular expression) in a query. A number of special Web services have developed their own search engines, most of them also accept only one word. Some examples are language-to-language translations, dictionary look-up services (i.e., WWWWebster Dictionary at http://www.m-w.tom/netdict.htm) and genome databases (i.e., FASTA-SWAP Pattern database at http://dot.imgen.bot.tmc.edu:9331). We pay particular attention to the translation of Boolean queries into a Web page language which supports the ranking model for query formulation and for the feedback relevance [1]. The ranking model is used by many Web services and fundamentally di erent from the Boolean query model. An example of a Web service based on the ranking query model is the Yahoo search engine (http://www.yahoo.com). The goal of the Yahoo service is a wider search and better ranking of the information retrieved rather than reducing the amount of data. The order of keywords in the query is used to rank the data. A keyword can appear in the query with a leading sign ``+'' or ``-''. The sign ``+'' appears in front of the keyword if it must appear in the response. The sign ``-'' is placed in front of the keyword to exclude documents containing the word. The standard Boolean query ``A AND B'' therefore looks in Yahoo as ``+A +B'', and the query ``A OR B'' becomes ``A B''. However, there is no direct translation for the query ``A AND B OR C''1. The limitation posed by the ranking model on the Boolean query translation is operational. In fact, any Boolean query containing either operators OR or AND can be easily translated into a ranking query. Moreover, if an AND-query is extended with NOT terms, the translation is still straightforward. For example, the Boolean query ``A AND B NOT C'' is translated as ``+A +B -C''. In other words, the Yahoo search page supports either fAND, NOTg or fORg Boolean operator sets. To conclude our examples of operational limitations, some Web services support only AND (or OR) operators despite having a powerful search engine in use, to force the user to be more (less) selective during the search (see, for example, the ACM Digital Library at http://www.acm.com). Our contribution. We consider query translation for Web information services with di erent operational limitations and propose two possible strategies for translation, namely direct and min-

imal strategies. We study advantages and disadvantages of both strategies. For the minimal model which is aimed at the minimization of query submissions, we prove some important results and derive appropriate procedures. We also discuss how the query translation problem is implemented in the Knowledge Brokers system [2, 6]. The remainder of the paper is organized as follows. In Section 2 we will consider the query translation architecture, the front-end query model and the predicate rewriting rules adopted in the Knowledge Brokers. Then, in Section 3 we will discuss the subsumption strategies for the case of operational limitations and prove some important results In Section 4 we will show the subsumption procedure for the minimal model. Finally, Section 5 contains some concluding remarks.

2 Query translation architecture The architecture of the query translation in Knowledge Brokers is close to that adopted in [4]. Figure 2 shows the steps of query processing from submission to obtaining the results. First, the user query is parsed and an internal tree-like presentation is created. The query translator processes the query tree on the basis of operators supported in the native language to generate a subsuming query tree. Concurrently, the ltering function for the post- ltering is generated. If the one-query subsumption is impossible, several sub-queries are created. The subsuming query tree(s) are then put into the syntax of the native language and sent o over the net. After the raw html le(s) have been received and document items are extracted from them, the post- ltering uses the ltering function to discard the irrelevant items. 2.1 User query language

The front-end query model in the Knowledge Brokers includes a number of basic predicates which may be used to formulate a query. Below we group the basic predicates according to their semantics: 1. Boolean operators AND, OR, NOT (= AND NOT) are used to compose a complex query. The predicate AND indicates that both terms should be veri ed in a retrieved document and predicate OR indicates that at least one of the them should be veri ed. The operator NOT is binary in the majority of information retrieval systems; it means that any retrieved document should satisfy the rst term and not the second.

1 Note the direct translation of AND as ``+'', NOT as ``-'' and OR as no sign is not correct. For the query ``A AND B OR C'', the translation ``+A +B C'' would not return a document containing only C.

3

reply

Description of result page

filter

operations

user query Parser

result page(s)

Merger and extractor

Post-filtering

Description of native language

syntax

Syntax generator

Query transformator

request to reformulate the query

Web search page

broker server

native query(s)

remote server

Figure 2. Query translation architecture.

Query

::=

Predicate

::=

E-Expression

::=

C-Expression

::=

ProximityExp

::=

Phrase WordTerm

::= ::=

j j j j j j j j j

j

j

Predicate ( AND OR NOT ) Query Predicate Attribute CONTAINS C-Expression Attribute EQUALS E-Expression Date EQUALS DateTerm E-Expression OR Phrase Phrase C-Expression ( AND OR NOT ) C-Expression ProximityExp WordTerm Phrase WordTerm W(n) WordTerm WordTerm Near(n) WordTerm `` (Word)+ '' Word Word*

j

j

//stemming

Figure 3. Abstract syntax of the front-end query language.

4

Information repository

2. Binary word predicates between two word terms A and B:  query A W(n) B requires the term A to appear in the text within n words before B.  query A Near(n) B requires the terms A and B to appear in the text within n words, in any order. 3. Unary word predicates :  phrase is a quoted string (like ``information retrieval'') which is supposed to appear in the document.  A*(stemming) matches any words with the stem A. For example, system* matches system, systems, systematic etc. 4. A set of document attributes can be searched by using the attribute predicates CONTAINS and EQUALS. When using EQUALS to search an attribute, one or several phrases connected by only the Boolean operator OR are allowed. Instead, in the CONTAINS predicate any binary and unary predicates listed in the previous items may be used. In addition, a special attribute Date can be searched with the predicates BEFORE and AFTER to retrieve documents dated before or after a given date.

AND

CONTAINS

Title

W(2)

Author

distributed

Lamport

system*

Figure 4. Query tree example.

say that such a native query Qs minimally subsumes the user query. We use the disjunctive normal form (DNF) of

a query which preserves the minimality of translation when no operational limitation occurs [4]. The DNF of the user query is a disjunction of the form dnf = 1 _ 2 _    _ k , where each term i is a conjunction of predicates of the form i = ~1 ~2    ~n. Each predicate ~j is either a basic predicate j or its negation : j . Example 3. The DNF of the query Q

C

C

C

C

C

P P

P

P

P

Q

3 = Title

P

CONTAINS Web NOT Usenet OR Internet

has two conjunctions dnf 3 = ~11 ~12 _ ~21 with two basic predicates 11 = Title CONTAINS Web and 12 = Title CONTAINS Usenet in the rst conjunction and one basic predicate 21 = Title CONTAINS Internet in the second one. Note the predicate 12 is negated. To generate the minimal subsumption s for a query , all unsupported basic predicates in dnf should be rewritten by using the supported predicates. Each unsupported predicate should have positive and negative subsumption s+ and ? ? + s such that h s i  h i  h s i, where h i is the set of documents returned by a query . Then, each entry of in the user query is replaced with the positive subsumption s+ , since h i  h s+ i, while each entry of its negation : should be replaced with s? , since h: i  h s?i. Besides the positive and negative subsumptions, the predicate may have a neutral subsumption s= supported by the native language, such that h i = h s= i. The basic translation procedure described above produces the optimal subsumption when the page language has no operational limitation. If such a limitation takes place, the procedure cannot translate the query properly and we combine it with other tools. Filtering function. The ltering function required for post- ltering is generated during the basic query translation. The function looks like a query and is initially equivalent to the user query Q

P

P

P

P

P

P

P

Q

The abstract syntax of the front-end query language in the Knowledge Brokers is given in Figure 3. The internal presentation of a user query is a binary tree where an internal node contains a basic predicate and a tree leaf contains a word term, phrase or attribute name. Example 2. The query 2

Q

Q

P

P

P

P

P

P

P

P

P

Q

P

P

P

P

(Title CONTAINS distributed W(2) system*) AND (Author EQUALS Lamport)

P

P

is binary tree shown in Figure 4. Attribute predicates CONTAINS and EQUALS are considered as binary nonsymmetrical operators with the left child containing the attribute name and the right child containing the query term. 2.2

EQUALS

P

P

P

P

Basic query subsumption

To translate a user query eciently, we should construct a native query which returns a minimal number of irrelevant \extra" data. We Q

5

P

tree. The nodes for supported predicates or predicates having a neutral subsumption are removed and the tree is obtained by contracting the tree after the removal of the nodes. The nodes left in the ltering tree contain all the basic predicates subsumed during the translation.

such a predicate can appear in the subsuming query. Table 5 represents only neutral and positive subsumptions as negative subsumptions are easily derived from positive ones. If there exists a positive subsumption for an operator 1 with an operator 2, then the same rule can be symmetrically used as a negative subsumption of operator 2 with operator 1 . For example, the rule A Near(n) B is a positive subsumption for A W(n) B. Therefore, the operator W(n) is a negative subsumption for Near(n). Op

Op

2.3

Word predicates

Op

All rewriting rules for the basic predicates are collected in Table 5. As one predicate may have several rewriting rules (for di erent native languages), the rules are grouped by predicates given in the rst column of the table. The second column shows di erent combinations of supported predicates. The third and fourth columns show the subsumption type and the subsumption itself. Word predicates compose the largest group of rewriting rules. Consider, for example, rules for the operator W(n) when it is not supported by a native language. The operator can obviously be subsumed with Near(n) or, if the later is not supported, with the AND operator. Table 5 also contains a special case of subsumption for W(n), when a native language supports the operator W only for some values of (like in AltaVista where the proximity operator Near is supported with the default value = 10). Table 5 proposes similar rewriting rules for the proximity operator Near(n) and any word phrase which is equivalent to a sequence of words connected with the operator W(0).

2.5

Query rejection

For any information service, there often exists a number of user queries which cannot be subsumed with native queries. Such a query is rejected and the user is asked to reformulate the query. The query rejection happens when the query translation degenerates, that is the rewriting process described above produces just the logical constant true. This happens if none of the query predicates is supported and none of the rules in Table 5 can be applied. For example, the query ``Title CONTAINS mediation'' will be rejected if neither Title nor full-text attributes are supported. The basic subsumption procedure assumes that all Boolean operators are supported. Therefore, the operators Near(n), W(n) and phrases have always positive subsumptions with the operator AND. Instead, an attribute predicate CONTAINS or EQUALS should be subsumed with the constant true if the attribute is not queryable and the fulltext search is not supported. If all attribute predicates of the query are subsumed with true, the translation degenerates and the query is rejected. The last point is the processing of the stemming operators. If stemming is not supported, its neutral subsumption (see Table 5) requires all interpretations for a given stem. As an example, the Knowledge Brokers contain a stemmer which provides an interpretation list for each English stem [14]. If such a tool is not available, each stemming operator is subsumed with true. In the reminder of the paper we consider only acceptable queries, that is, we assume that at least one attribute in a query is queryable or that the full-text search is supported.

n

n

2.4

Op

Attribute predicates

Before any Web service is included in the Knowledge Brokers, we extract a list of queryable attributes of the service including the full-text search where available. When subsuming a predicate EQUALS or CONTAINS, we rst check whether the attribute used in the predicate is queryable. If so, the positive subsumption for the predicate EQUALS uses the predicate CONTAINS instead (see Table 5). In contrast, CONTAINS has no positive subsumption with EQUALS2. If no attribute search is possible but the full-text search is available, both predicates CONTAINS and EQUAL are rewritten as fulltext search queries. Note that the table provides no positive subsumption for any predicate (BEFORE and AFTER) against the attribute Date. Indeed, only if the native query language supports a similar attribute,

3 Query subsumption and operational limitations

2 The need of subsuming CONTAINS with EQUALS appears only if a Web page language supports EQUALS and not CONTAINS. It should be noted that this case is extremely rare on the Web.

Although all operational limitations are usually coupled to plain full-text search, formally they 6

Basic predicate A Near(n) B

A W(n) B

``A1 A2



Ak ''

(phrase)

(stemming)

A* attr EQUALS A attr CONTAINS A

Native language OR, W(n) 8n

Near(m), m > n AND W(m), m > n Near(m), m n AND W(0) W(m), m > 0 Near(0) Near(m), m >0 AND OR CONTAINS full text search



,

Type neutral positive positive positive positive positive neutral positive positive positive positive neutral positive positive

Subsumption

A1 A1

(A W(n) B) OR (B W(n) A) A Near(m) B A AND B A W(m) B A Near(m) B A AND B A1 W(0) A2 W(0) W(0) Ak A1 W(m) A2 W(m) W(m) Ak Near(0) A2 Near(0) Near(0) Ak Near(m) A2 Near(m) Near(m) Ak A1 AND A2 AND AND Ak A1 OR OR Ap attr CONTAINS A A



    

Figure 5. Table of basic predicate subsumptions.

cannot disallow the possibilty of other query feature. For example, a page language which does not support the operators OR and NOT can support attribute and proximity predicates, etc. For the sake of clarity, we separate the problems related to the operational limitations from those not related, and introduce an intermediate page language. The intermediate page language ( -language) is obtained from the original page language by completing the Boolean operator set whenever an operational limitation takes place. A query in the intermediate language is called query. An -query is a Boolean expression over basic predicates and possibly contains all three Boolean operators. Any basic predicate may have an easy subsumption in the native language while all the expression may not. The introduction of the -language imposes a two-step procedure upon user query subsumption: a user query is rst subsumed with an -query, then this -query is translated into the page language. As we will see in Section 3.3, the rst step is easy and any user query has an optimal subsumption in -language. Such a subsumption is directly generated by the basic translation procedure described in Section 2.2. Now we concentrate on the translation of an query into the page language. It consists in the subsumption of a Boolean expression containing all Boolean operators by an expression where one or more Boolean operators are not supported.

erator NOT in information retrieval is binary and equivalent to AND NOT. First, it makes impossible to rewrite the operator OR using the DeMorgan rule A OR B = :(:A AND :B). In other words, if the operator OR is not supported, there is no way to rewrite each operator OR as a combination of the operators AND and NOT. Second, each disjunction of the query DNF must contain at least one positive predicate. The Boolean operators AND and NOT need only positive subsumptions since negative subsumptions are never required when the query is given in the DNF. Positive subsumptions for the operators AND and NOT are rather simple and may be subsumed by one of the two surrounding terms. For ``A AND B'', it may be either A or B since hA AND Bi  hAi and hA AND Bi  hBi. For ``A NOT B'', the rst term A subsumes the operator since hA NOT Bi  hAi. However there is no way to subsume in one shot the query ``A OR B'' if the operator OR is not supported. This leads us to the following result :

I

I

I

I

I

I

Lemma 1 If the operator

OR is supported, any front-end query may be subsumed with one native query.

I

A native language which does not support OR poses more problems than one that does not support AND or NOT. In this case, the subsumption with one native query cannot be guaranteed for most queries. The user query must be split into sub-queries which sent to the information service independently. Lemma 1 gives the key to the query translation problem for the ranking model described in Section 1.2. If a user query contains only the oper-

I

Subsumption for operators AND and NOT. The main di erence between the classical Boolean algebra and the use of Boolean operators in information retrieval (including the Web) is the NOT operator. Unlike the Boolean algebra, the op-

7

basic predicate A, then hA AND :B AND :Ci is usually bigger than hA AND B AND Ci.

ators AND and NOT, it is translated into the native language by putting the sign ``+'' in front of every positive predicate and sign ``-'' in front of each negative predicate. If the query contains only the operator OR, all query keywords are simply reproduced in the native ranking query. If all three Boolean operators are present in the user query, the rewriting rules for the operators AND and NOT are applied until no such operator is left. The resulting query contains only the operator OR and is directly transformed into a ranking query. Example 4. The Boolean query 4=``A AND B OR C NOT D'' can be subsumed with ranking queries ``A B'' or ``B C''.

A

< A and B and C > C < A and not B and not C>

Q

3.1

Figure 6. Venn diagram for the example query.

2. Query processing by the information service: The query processing algorithms vary from one service to another. Two main trends exist:  the less keywords the query has, the faster a search engine processes the query;  due to the request serialibility on the service server, several queries take longer to be processed that a single query. 3. Postprocessing of returned data: In the minimal strategy, the post- ltering is linear. In the direct strategy, the merge of the two returned lists of documents rarely remains linear as documents in the results often appear in di erent order (due to internal ranking mechanisms). 4. A limited number of documents returned or documents returned are split in a number of answer pages (as in AltaVista): The direct strategy handles this situation much better. In fact, both sub-queries A AND B and A AND C for the query 5 return documents requested by the query. Instead, using the minimal strategy, the result list (or the rst page) for the subsuming query A may contain no documents related to the original query, that is, containing either B or C. To sum up, the direct strategy works better for cases 1 and 4 above, while items 2 and 3 are handled more smoothly when the minimal strategy is used. Therefore, the query subsumption process should be able to take advantage of both strategies. In the Knowledge Brokers system, one of the two strategies is statically attached to each information service and any user query sent to the

Subsumption strategy

The subsumption minimality proven for the basic query translation (see Section 2.2) cannot be extended to the case of an unsupported OR since a one-query subsumption generally fails. In such a case, two opposite subsumption strategies are possible. The direct strategy will simply split the original query DNF in a number of sub-queries equal to the number of disjunctions. Instead, the minimal strategy will try to modify the original query and obtain one subsuming query or, if this fails, minimize the number of subsuming sub-queries. The choice of the optimal strategy depends on a number of factors which we now discuss in detail. Example 5. Consider the -query 5 = A AND B OR A AND C when the native language does not support the OR operator. The direct strategy splits the query into two sub-queries A AND B and A AND C, sends them to the information service and merges the returned results afterwards. Instead, the minimal solution will submit one subsuming query A. When the responses are returned, all documents containing neither B nor C are ltered out. Each of the two strategies has its own advantages and disadvantages. The advantage of the direct approach is that a simple merge is sucient to produce the result list. In contrast, in the minimal solution the superset of data is returned but the information service is contacted only once. Below we list the main issues which di erentiate the two strategies : I

B

Q

Q

1. The amount of data transfered over the net: For the query 5, it is reduced to detecting which of the two following sets is bigger : hA AND :B AND :Ci or hA AND B AND Ci (see Figure 6) since the part hA AND B AND Ci is taken twice in the direct solution. As h:Ai is usually much bigger than hAi for a Q

8

A conjunction C = c1    cm is called P conjunction if at least one of the elements ci is

service is processed following this strategy. The choice of the stategy is performed on the basis of a statistical evaluation during a training period. In the remainder of this section we evaluate both strategies in a situation when the operator OR is not supported. Direct subsumption strategy. In the case of an unsupported OR operator, the direct strategy simply splits the DNF query into the number of sub-queries equal to the number of disjunction of the query DNF. On the contrary, the minimal strategy requires some query analysis and particular query transformation techniques.

a basic positive predicate. For example, A (B _ is -conjunction while (A _ B)(B _ C) is not. A query has top-level OR (shortly, -TON) form (  1) if it can be presented as a disjunction of -conjunctions. A -TON form is minimal TON form (MTON) if there does not exist a ( ? 1)-TON form of . For example, the query ``A B _ C :A _ B D'' has a 2-TON form ``B (A _ D) _ C :A'' which is minimal. Note that the query can be represented as one conjunction like ``(B _ C : A) (A _ D _ C : A)'', but it is not a -conjunction. Futhermore, a query can have more than one MTON forms. For example, the -query ``A B _ A C _ B C'' has three MTON forms with = 2: ``A (B _ C) _ B C'', ``A B _ C (B _ C)'' and ``A C _ B (A _ C)''. C)

P

Q

k

k

k

k P

k

k

Q

P

3.2

Minimal subsumption strategy

I

The rst goal of the minimal strategy is to detect an eventual one-query subsumption for an query. An simple and immediate solution requires the conjunction normal form (CNF) of an -query instead of the DNF. The CNF of the query is a conjunction of the form cnf = 1 2    k , where each term i is a disjunction of predicates of the form i = ~1 _ ~2 _  _ ~n. Like the DNF, each predicate ~j is either j ( a positive basic predicate) or : j (a negative basic predicate). A disjunction i is called simple if it contains one predicate only, that is, i = ~1. The following lemma shows how the query CNF can be used for verifying a one-query subsumption. I

k

I

Q

D D

Theorem 1 Let operator

OR be not supported. Then, an I -query can be subsumed with k subsuming sub-queries i its MTON form has k conjunctions.

D

D

D

P

P

P

P

P

Proof. Let an -query is given and its MTON form contain conjunctions. Each conjunction is -conjunction and contains at least one positive predicate i. Therefore, sub-queries given by predicates i, = 1 subsumes the query. Now we assume that the query MTON contains conjunctions but the query can be subsumed with ? 1 sub-queries i, = 1 ? 1. None of i contains OR and the disjunction 1 _ 2 _ _ k?1 subsumes the original query. Therefore, there exists a ? 1-MTON query form which contadicts the assumption.2 The MTON query form may coincide with the DNF as for the query ``A B _ C D'' (no simpli cation can be done) or coincide with the CNF as for query ``A (B _ C)''. Finally, the result on a one-query subsumption ( Lemma 1 ) can be reformulated as follows :

P

I

D

D

k

P

P

P

P

Lemma 2 Let operator OR be not supported by a

k

Q

P

Q

:::

P

I

Q

Q

i

; : : :; k

P

P

P

k

Q

Q

P

P

contains a disjunction with one positive predicate , that is, cnf =  0, then obviously subsumes . On the other side, if an -query can be subsumed with a query 0 containing no operator OR, then the original query can be presented as a conjunction = 0  00. As 0 contains no OR, it is a conjunction of (one or more) basic predicates and at least one of them is positive. Therefore, the CNF of contains at least one simple disjunction with a positive predicate.2 The query CNF can help us to detect a onequery subsumption. However, if the one-query subsumption fails and the minimal number of subqueries for submission is required, the query CNF does not help. To solve the problem we build a special query form which is di erent from both DNF and CNF but allows an immediate generation of the minimal number of subsuming subqueries. Q

; : : :; k

k

given information service. Then, an I -query can be subsumed with one native query i the query CNF contains at least one simple disjunction with a positive predicate. Proof. Let Q be an I -query. If the CNF of Q

P

k

i

Q

Corollary 1 Let operator OR be not supported. Then, a one-query subsumption is possible i the original query has a 1-TON form.

Q

Unfortunately, the MTON form is an NPcomplete problem. In proof, we consider the query DNF and denote the set of its conjunctions as . Then we mark all conjunctions of containing the same positive basic predicate i as subset i . Thus, we reduce the MTON form problem to the minimal cover of by subsets i which is shown to be NP-complete [9]. D

D

P

D

9

S

S

Although user queries are typically short in real applications and the MTON form is easily obtained, we describe a greedy algorithm for the general case. The algorithm obtains a -TON query form with close to the optimal value. This algorithm starts with the query DNF of conjunctions as the -TON form. Then it applies to conjunctions of the current TON form the distributive rule AT1 _ AT2 = A(T1 _ T2 ) where A is a positive basic predicate. Each time the rule is applied, the cardinality of the query TON form is reduced by 1. The procedure stops when the distributive rule cannot be applied any longer. Therefore, it takes at most ? 1 steps to obtain the resulting TON form. Consider the query ``A B _ A C _ A D _ B E _ C F _ D G''. The greedy algorithm generates the following 4-TON form : ``A (B _ C _ D) _ B E _ C F _ D G'' while the MTON form has 3 disjunctions: ``B (A _ E) _ C (A _ F) _ D (A _ G)''.

The existance of a minimal -subsumption for a user query allows us to combine it with Theorem 1 and obtain the main result for the user query subsumption. I

k

k

Theorem 2 Let operator OR be not supported by a given information service. Then, a front-end query Q can be subsumed with k subsuming subqueries i its I -subsuming query has an MTON form with k conjunctions.

n

n

Example 6. Consider the query

6 =A W(2) and assume that the operator is not supported. The subsuming -query for is . Since the last can be subsumed with one predicate A, the original query can be subsumed with A, too. Filtering function. Subsumptions for operators AND and NOT are positive subsumptions and neither of them changes the ltering function. The splitting of a user query into sub-queries when the operator OR is not supported does not change the ltering function either. There is however an exception. If no basic predicate in a user query was subsumed by rewriting rules before the splitting, there is no need in the post- ltering. For example, for the user query ``A OR B'' which is split into sub-queries A and B, the merge of returned documents does not need any ltering function. B OR C Near(0) A OR Q6 ``A AND B OR C AND A''

n

3.3

Front-end query subsumption

The subsumption of -queries discussed before is the second and more complex step of the subsumption procedure discussed in Section 2.2. To complete the procedure, now we discuss how any front-end query can be subsumed with an -query. Although various -queries can subsume a given user query, the following lemma states that there exists an optimal -subsumption. The corresponding proof demonstrates how to obtain such a subsumption. I

I

Q

I

I

3.4 Multi-step subsumption

I

The nal issue is the case when multiple predicates rewriting is necessary. For example, it occurs during the translation of a query with predicate Near(n) to a native language supporting only one-word queries. In such a case, Near(n) is rst rewritten with AND and the resulting query is then rewritten in a one word query. In Figure 7 we represent the predicate rewriting rules as a graph where each node corresponds to one predicate and each edge corresponds to one rewriting rule. One transition in the graph is one subsumption. Multiple rewriting corresponds to a chain of transitions in the graph. The predicate rewriting stops when all predicates in the query t into the native language. Note that multiple rewriting does not concern queries with the operator OR and the graph in Figure 7 contains no node for the operator. Another basic predicate not included in the graph is the stemming operator, as it is subsumed with the use of the operator OR. In the Knowledge Brokers system, when the stemming operator is not supported in a native language, the list of interpretations is generated by a local stemmer [14].

Lemma 3 For any user query there exists a minimal subsuming I -query.

Proof. Both the front-end language and language support all Bollean operators and therefore there is an minimal subsumption of a user query with one -query. The minimal subsumption is conducted with the basic query subsumption procedure given in Section 2.2. The procedure takes the query DNF and applies the predicate rewriting rules given in Table 1. Every attribute predicate EQUALS is substituted with the predicate CONTAINS and all document attributes are substituted with full-text. The basic predicates W(n), Near(n) and phrases are positively subsumed with one or more ANDpredicates. In the negative case, they are subsumed with false. Stemming is subsumed with a number of OR-operators in the positive case, and with false, otherwise. As each disjunction of the DNF contains at least one positive predicate, the query never degenerates. 2 I

I

10

in a Web page language lead to splitting a user query into sub-queries before sending them over the net. We have presented two approaches to the problem and analyzed their advantages and disadvantages. For the minimal model which minimizes the number of sub-queries we have derived a number of important results and constructed the most appropriate query representation to achieve a low computational complexity. In the Knowledge Brokers, each information service is statically attached to the direct or minimal translation strategy, i.e., the strategy is chosen during a training period and repeatedly used over the user session. In some cases, however, the best strategy should be chosen dynamically. An appropriate and accurate cost model is required to choose the strategy dynamically on the basis of the query, the information service and the net characteristics.

one word

CONTAINS

AND

NOT

EQUALS

NEAR(n)

W(n)

Phrase

Figure 7. Predicate rewriting graph.

4 Query translation algorithm

References

The query subsumpion algorithm for the minimal model3 given in Figure 4 contains two steps. The rst step subsumes the user query with the optimal -query using the query DNF and predicate rewritings. If no operational limitation occurs, the algorithm stops. Otherwise, the operators NOT and AND are rst to be checked and subsumed if necessary. Finally, if the operator OR is not supported, the minimal -TON form is constructed. Each -disjuction of the -TON form produces a sub-query to send to the information service afterwards. One choice in the algorithm is kept hidden, namely, how to choose one of two terms when an operator AND is subsumed. Both terms are acceptable, but for practical reason we are interested in a more selective term. The term selectivity is unknown a priori. However, there are a number of empirical observations which can help us in the choice. Each of them can fail in the case of a particular query but appears to be helpful over a large set of queries. Here we list some well-known observations used in the Knowledge Brokers, in the order of descreasing importance: (i) di erent attributes have di erent selectivities; (ii) the stemming is less selective than any plain keyword; (iii) a longer keyword is often more selective than a shorter one.

[1] E. Bertino, B. C. Ooi, R. Sacks-Davis et al. Indexing Techniques for Advanced Database Systems, Kluwer Academic Publishing, 1997.

I

[2] J.-M. Andreoli, U. Borgho , R. Pareschi. Constraint-Based Knowledge Brokers. In Proc. 1st Intl. Symp. on Parallel Symbolic Computation (PASCO'94), Lecture Notes

k

P

k

Series in Computing 5, pp 1-11.

[3] C.-C. K. Chang, H. GarciaMolina, A. Paepcke. Predicate Rewriting for Translation Boolean Queries in a Heterogeneous Information System, In Technical Report SIDL-WP-1996-0028, Stanford University, 1996.

[4] C.-C. K. Chang, H. Garcia-Molina, A. Paepcke. Boolean Query Mapping Across Heterogeneous Information Sources. In IEEE Transaction on Knowledge and Data Engineering, vol.8, N. 4, 1996. [5] C.-C. K. Chang, H. Garcia-Molina. Evaluating the Cost of Boolean Query Mapping. In Proc. 2nd ACM Intern. Conf. Digital Library, 1997.

5 Conclusion

[6] B. Chidlovskii, U. U. Borgho , P.-Y. Chevalier. Toward Sophisticated Wrapping of Web-based Information Repositories, In Proc. Intern. RIAO'97 Conf., 1997, pp. 123-135.

We have discussed the translation of Boolean queries on the Web when operational limitations 3 The query translation algorithm for the direct strategy is simplier and can be easily adopted from there.

11

Procedure SUBSUME (query ) Input : query in the DNF : = _ki=1 i = _ki=1 (^ijp=1 ~ij ) Q

Q

Output

Q

P

C

: the subsuming query(s) and the ltering function for the minimal model.

begin

// Step 1 : translate the query into an -query Create a copy of the query as the ltering tree for each basic predicate ~ij in the query do ifpredicate ~ij is supported then remove it from the ltering tree else // Subsume the predicate ~ij if exists a neutral subsumption for ~ij then rewrite the predicate and remove it from the ltering tree else if ~ij = ij then replace ~ij with its positive subsumption else if ~ij = : ij then replace ~ij with its negative subsumption return the ltering function Q

I

P

P

P

P

P

P

P

P

P

P

// Step 2 : translate the -query into the native language operator NOT is not supported then for each entry of NOT do subsume A NOT B with A if operator AND is not supported then for each entry of AND do subsume A AND B with A or B if OR is supported then return the subsuming query I

if

else

obtain the -TON query form by applying the rule A T1 _ A T2 for each -conjunction of the MTON form do return the subsuming positive predicate as a sub-query k

P

end

12

= A (T1

_

T2 )

[7] A. Emtage and P. Deutch. Archie - an Electronic Directory Service for the Internet, In Proc. USENIX Winter Conference, 1992, pp.93-110. [8] D. Florescu, L. Raschid, P. Valduriez. Using Heterogeneous equivalences for Query Rewriting in Multidatabase Systems. In Proc. Cooperative Inform. Systems Conf., 1995. [9] M. Garey and D. Johnson. A Guide to the Theory of NP-Completeness, Freeman and Company, 1979. [10] L. Raschid, Y. Chang, B. J. Dorr. Query Transformation Techniques for Interoperable Query Processing in Cooperative Information Systems. In Proc. Cooperative Inform. Systems Conf., 1994. [11] Ch. Reck and B. Konig-Ries. An Architecture for Transparent Access to Semantically Heterogeneous Information Sources. In Proc. Cooperative Information Agents, Lect. Note Comp. Science, vol. 1202, 1997. [12] V. Vassalos, Y. Papakonstantinou. Describing and Using Query Capabilities of Heterogeneous Sources. In Proc. 23rd VLDB Conference, Athens, Greece, 1997. [13] W. Weinman. The CGI book. New Riders Publishing, 1996. [14] Xerox Research Centre Europe: Linguistic tools. At http://www.xrce.xerox.com/research/ mltt/home.html.

13