On the integration of IR and databases - Semantic Scholar

1 downloads 105 Views 216KB Size Report
Jul 2, 1998 - We conclude with a discussion of the topic of query optimization in the context of extending ... trieval requires even more than text retrieval an iterative search process involv- ing many different .... mapped files. It uses the lower level OS primitives to advice on the buffer ...... content-specific search engines.
On the integration of IR and databases Arjen P. de Vries and Annita N. Wilschut Centre for Telematics and Information Technology University of Twente July 2, 1998 Integration of information retrieval (IR) in database management systems (DBMSs) has proven difficult. Previous attempts to integration suffered from inherent performance problems, or lacked desirable separation between logical and physical data models. To overcome these problems, we discuss a database approach based on structural object-orientation. We implement of IR techniques using extensions in an object algebra called MOA. MOA has been implemented on top of the database backend Monet, a state-ofthe-art high-performance database kernel with a binary relational interface. Our prototype implementation of the inference network retrieval model using MOA and Monet demonstrates the feasibility of this approach. We evaluate the benefits of the resulting system. We conclude with a discussion of the topic of query optimization in the context of extending an object algebra.

1

Introduction

Information retrieval (IR) is concerned with the retrieval of (usually text) documents that are likely to be relevant to the user’s information need as expressed by his request [vR79]. IR retrieves documents based on their content. Because the request (or query) is never a perfect expression of the information need, IR is an iterative process. Database management systems (DBMSs) traditionally support a different kind of retrieval. Van Rijsbergen summarizes the differences between data retrieval (as in databases) and IR as follows ([vR79]). Retrieval in databases is deductive and exact. Retrieval in IR uses inductive inference and is approximate. Objects in a database possess attributes both necessary and sufficient to belong to a class. Conversely, no attribute of a document’s content is necessary nor sufficient to judge relevance to an information need.

1

2

On the integration of IR and databases - submitted to DS8

The differences between the problems studied in both fields resulted in little overlap between the functionality provided in databases and IR systems. Modern extensible database systems now start to support non-traditional data types, such as text or multimedia documents. Simply adding new data types to a database system is however insufficient. The problems with multimedia retrieval are very similar to those studied in IR [dVB98]. The nature of multimedia retrieval requires even more than text retrieval an iterative search process involving many different representations of the objects in the database [dVvdVB97]. Summarizing the previous paragraphs, DBMSs do not sufficiently support searching on content and IR systems do not handle structured data. However, many applications of information systems have requirements that can only be matched with a combination of data retrieval and information retrieval: eg. patient’s data in hospital systems, multimedia data in digital libraries, and business reports in office information systems. The characteristic feature of such applications is the combination of content management with the usual manipulation of formatted data; these two aspects are often referred to as the logical structure and the content structure of a document [MRT91]. Consider an information request in a digital library for ‘recent news bulletins about earthquakes in California’. In this example, ‘recent’ and ‘news’ refer to attributes of the objects in the library (which are in its logical structure), while ‘earthquake’ and ‘California’ refer to the content of those objects (which is its content structure). Note that the combination between constraints on content and other attributes of the documents is also important for the implementation of Mizzaro’s different notions of relevance in the IR process [Miz98]. Another argument in favour of integrated systems is from an architectural viewpoint. Databases and IR systems share requirements like concurrency control, recovery, and internal indexing techniques. Database management systems (DBMSs) already provide such support. The integration of IR in databases can also help the IR researcher to concentrate on retrieval models and reduce the effort of implementation involved with experimental IR. This paper presents an implemented prototype of an integrated IR and database system. In the next section we review previous approaches to such integration, discuss some problems with these systems and present an outline of our design.

2

Background and problem statement

We first introduce a model of IR. Next, we review previous attempts at the integration of IR in (relational) databases. Finally, we propose a new approach to accomplish a fully integrated system.

Arjen P. de Vries and Annita N. Wilschut

2.1

3

Information retrieval systems

The development of an IR system consists of three steps (see eg. [WY95]). First, we choose an appropriate scheme to represent the documents. Second, we model query formulation. Third, we select a ranking function which determines the extent to which a document is relevant to a query. The combination of these three steps is known as the retrieval model. Usually, documents and queries are simply represented as a collection of words. Sometimes, the words are reduced to their stems (stemming) and often frequent words with little semantics are left out (stopping). Techniques from natural language processing may be applied to identify phrases. The resulting features of these procedures are called the indexing terms. A wide variety of ranking formulas exists; see [ZM98] for an overview of the most common ones. Typically, the ranking is determined using statistics about the distribution of the indexing terms over the document collection. Most statistics are based on the term frequency (tf ), the number of times that a term occurs in a document. Usually, the raw values are normalized to account for document length before they are used for ranking, eg. by dividing with max tf . Sometimes, log-normalization is used because the difference between one or two occurences of a query term in a document is much more significant than the difference between fifteen or sixteen times. For example, the term frequency component (tf ) in the ranking formula of the popular retrieval system InQuery is [CCB95]: log(tf + 0.5) 0.4 + 0.6 · (1) log(max tf + 1.0) The document frequency (df ) is the number of documents in which a term occurs. Because a high term frequency of ‘car’ in a document from a collection of documents about cars does not add much information about the relevance of that document for the query ‘red sports cars’, tf is usually multiplied with the inverse document frequency. The inverse document frequency (idf ) is usually d defined as log N df , and expresses the amount of surprise when we observe a term in a document. The product tf · idf is the most popular weighting in IR and is used in most ranking formulas.

2.2

Integration of IR in relational databases

Integration of IR and databases has been a research topic for several years. The main interest in databases from IR researchers has been the more traditional database support: concurrent addition, update, and retrieval of documents, as well as scalability of the system. Early research by MacLeod compared the model of text retrieval systems with the data models in relational databases [Mac91]. He concluded that many

4

On the integration of IR and databases - submitted to DS8

difficulties arise when mapping a document into a record oriented structure. Schek and Pistor suggested the use of NF2 algebra for the integration of IR and databases [SP82]. Their work is however only applicable to Boolean retrieval models, that cannot handle the intrinsic uncertainty of the IR process. Similar to the development of geographical information systems, researchers therefore started to build ‘coupled’ systems: an information retrieval component handles the queries on content and the database component handles the queries on the logical structure. The most primitive approach is to use two completely different systems with a common interface eg. [GTZ93]. Modern extensible database systems allow the integration of user-defined data types, functions, and search accelerators (eg. Illustra [SM96], Starburst [HCL+ 90], or Monet [BK95]). Therefore, a tighter integration has become feasible in which the extension model of the database is used to encapsulate the otherwise completely seperated IR component (see eg. [DDSS95], [VCC96] and [DM97]). The desire to model uncertainty in database systems inspired the recent development of probabilistic relational algebras, eg. [FR97] and [LLRS97]. Although this development will eventually lead to a very interesting framework to implement probabilistic retrieval models in a database system, these algebras require some fundamental changes in the core algorithms of (extended) relational database systems. Furthermore, query optimization and physical database design under these models require still a lot of research.

2.3

Integration of IR in object databases

Previous approaches to IR and database integration have also suggested the application of object-oriented databases, see [Mil97] for an overview. Object data models allow nesting, and in addition to sets, work with multiple collection types. Therefore, an object-oriented model is very suited for the implementation of IR. A good example of the design of IR in an OODBMS is the Cobra system described in [MMR97] and [Mil97]. The current implementation of OODBMSs often provides little support for data independence. Although object-orientation is very useful for data modeling of complex objects and their behaviour, this does not necessarily imply that the implementation of the objects at the database level should be the same as the view on the objects at the conceptual level. Often, the implementation of object databases uses conventional programming languages to extend the OODBMS. The structure of the complex objects is hard-coded in the implementation of the object, and only the interface of the complex object is known to the database. The system cannot ‘look inside’ the objects to plan the execution of a query that involves several method-calls. For many operations on the complex objects, this will lead to object-at-a-time processing instead of the often preferrable collection-at-a-time processing. Algebraic optimization of query expressions, as proven very succesful for relational databases, is hard to achieve because the

Arjen P. de Vries and Annita N. Wilschut

5

structure of the objects is hidden inside of the object. Concurrency control and parallellization can only be defined at object granularity. These problems can be avoided using structural object-orientation. In a database system with support for structural object-orientation, we distinguish between atomic data types and structures over these types. Because the database also manages the structure, it can ‘look inside’ the objects and make better decisions with respect to query processing and concurrency control. In this paper, we study an approach based on structural object-orientation for the design of an integrated IR and database system. This design consists of three levels. At the top level of our design, we model the information retrieval functionality. We define extensions that model the three components of a retrieval model: document representation, query formulation, and the ranking process. This level implements the management of content and adds it to the object algebra. The second level is the MOA object algebra. MOA provides a nested object data model using bags, objects, and tuples. The formal definition of structure generators allows the addition of the specific structures for information retrieval. MOA provides the functionality required to manage the logical structure of the documents. At the bottom level, MOA object algebra is mapped to a flat binary relational model. This datamodel is employed by Monet, an efficient database kernel that is used as a backend for query execution. The use of a different physical data model brings us the benefit of data independence and avoids the pitfalls of object-oriented databases mentioned before. The succes of the combination of MOA and Monet has been demonstrated for an OO version of the TPC-D decision support benchmark in [BWK98]. In the remainder of this paper, we discuss the implementation of an integrated IR and database system using structural object-orientation. We do not further discuss our retrieval model from an IR viewpoint (refer to [dVB98] for our ideas with this respect), but concentrate on the database aspects of our design. We start in section 3 with a quick overview of the Monet database and MOA object algebra. In section 4, we discuss the integration of IR in MOA and Monet. We design new MOA structures that support IR functionality in 4.1. Next, we introduce the physical data model and define algebraic operations to support IR (4.2). After discussing the resulting design in section 4.3, we investigate the opportunities for query optimization in section 5 and then finish with conclusions and further research.

6

3

On the integration of IR and databases - submitted to DS8

Monet and the MOA object algebra

In this section, we give overviews of the database Monet, the MOA object algebra, and its implementation on Monet.

3.1

Monet

Monet is an extensible parallel database kernel that has been developed at the UvA and the CWI since 1994 [BK95]. Monet is intended to serve as a backend in various application domains. In this research, we used Monet to implement an IR retrieval model with algebraic database operations. The mapping of standard MOA to Monet is described in [BWK98]. Monet has also been used succesfully as backend for geographic information systems [BQK96] as well as commercial data mining applications. Monet’s design is based on two trends. First, the average main memory in workstations gets larger and larger. Therefore, processing should be focused on operations that are performed in main memory instead of on disk; all primitive operations in Monet assume that their data fit in main-memory. Second, operating systems evolve towards micro-kernels, i.e. they make part of the OS functionality accessible to applications. A DBMS should therefore not attempt to ‘improve’ or ‘replace’ the OS-functionality for memory management. If the tables in Monet get too large for main memory, the database uses memory mapped files. It uses the lower level OS primitives to advice on the buffer management strategy. This way, the MMU can do the job in hardware. Monet implements a binary relational model. The data is stored in Binary Association Tables (BATs). BATs are tables with two columns, the head and the tail, each storing an atomic value. Structured data is decomposed over many narrow tables. This vertical fragmentation is especially useful when the objects consist of many attributes, but most queries on these objects refer to only some of them. Monet accesses exclusively the attributes that are used in the query (which is thus more likely to be performed in main memory). The extra cost for re-assembling multi-attribute data can be reduced by efficient vectorized (semi)join algorithms as discussed in [BWK98]. The Monet Interface Language (MIL) consists of the BAT-algebra, which contains basic operations on bags, and a collection of control structures. Table 1 gives a semi-formal impression of the BAT-algebra. Two operations deserve a little more attention. The algebra operations materialize their results and never change their operands. The multiplex constructor [f] allows bulk application of any algebraic operation to the tail of a BAT. It is also defined on multiple BATs as the operation performed on all combinations of tail values over the natural join on head values. The set-aggregate constructor {g} is used for bulk aggregation. It is defined for each aggregate function that maps a set of tailtype on a basetype. It groups over the head and calculates the aggregate of

Arjen P. de Vries and Annita N. Wilschut BAT command AB.mirror AB.semijoin( CD ) AB.join( CD ) AB.select( t ) AB.select( low, high ) [f](AB) [f](AB,...,XY) {g}(AB)

7

result {ba | ab ∈ AB} {ab | ab ∈ AB ∧ (∃cd ∈ CD ∧ a = c)} {ad | ab ∈ AB ∧ cd ∈ CD ∧ b = c} {ab | ab ∈ AB ∧ b = t} {ab | ab ∈ AB ∧ low ≤ b ≤ high} {af (b) | ab ∈ AB} {af (b, . . . , y) | ab ∈ AB ∧ . . . ∧ xy ∈ XY ∧ a = . . . = x} {ag(Sa ) | a ∈ A ∧ Sa = {b | ab ∈ AB}}

Table 1: A sample of the MIL commands

the corresponding sets of tail values. The algebraic commands do an additional dynamic optimization step before execution. Based on the properties of the operands, the most efficient algorithm is chosen at run-time. For instance, the join algorithm may be a hashjoin, but also a mergejoin that assumes the join columns to be ordered, or a syncjoin that assumes the join columns to be identical. When two BATs have an identical head column, they are said to be synced. The semijoin and the multiplex operations can be performed very efficiently on synced BATs, because we do not have to compare the values - we already know beforehand that the head values in the operands are equal. Note that the run-time evaluation of mirror does not physically swap head and tail, but instead creates a view on the operand. Similarly, when the kernel knows that a BAT is ordered on tail, a range select on this BAT does not create a new BAT but a view.

3.2

MOA, an algebra on an extensible object datamodel

The object algebra MOA consists of an extensible object datamodel and an algebra on this data model. This section summarizes the datamodel and the algebra. A more elaborate description of MOA and an evaluation of the performance of its implementation on Monet are found in [BWK98]. 3.2.1

Datamodel

The MOA data model is based on the concepts of base types and structuring primitives. MOA assumes a finite set of ADT-style base types. The internal structure of base types cannot be accessed. It is only accessible via operations. Base types allow efficient implementation in a general-purpose programming language. They are

8

On the integration of IR and databases - submitted to DS8

implemented at the level of the physical storage system. Base values (which are instances of base types) are atomic. The basetypes for our MOA implementation are provided by the underlying physical system, Monet. MOA inherits the base type extensibility provided by Monet. The MOA datamodel consists of structures that can recursively be used to generate structured types. A structure combines known types to create a structured type. The set, bag, and tuple are well-known structuring primitives that are provided in all OO data models. The collection of structures constitutes the datamodel at the logical level. Structures are mapped by the MOA implementation on the physical system. This mapping provides data independence between the MOA data model and the physical storage system. The logical-to-physical mapping also allows the system to optimize query execution plans. The MOA kernel system supports a generic notion of orthogonal structures. The actual implementations of structures are added to the system in separate modules. The result is an orthogonal type system that consists of atomic types, a bag structure and a tuple structure, which are provided as structure modules to the system, and possibly other user-defined structures. The type system can now be summarized as follows: basetypes: τ is a type if τ is an atomic type. structures: If τ1 , · · · , τn is a, possibly empty, list of types and T is a structure defined over τ1 , · · · , τn , then T (τ1 , · · · , τn ) is a type. MOA also supports the notion of classes and objects. As these concepts are not directly relevant in the context of this paper, they are not discussed here. In this paper, we further assume that MOA is supplied with the bag and the tuple structure, and we will define a number of structures to model documents, and statistics over document collections. 3.2.2

Algebra

The algebra on the MOA data model consists of the operations defined on the available atomic and structured types. Each atomic type has its own accessor functions. The accessor functions on the atomic types are executed directly in the physical system. Each structure definition comes with operations on the structure. The logical MOA operations on structures are translated into efficient physical execution plans. In MOA, the bag structure implements many operations that are usual in query algebras, like the selection, map, join, semijoin, nest, unnest, etc. The tuple structure implements an operation to extract an attribute from the tuple. Atomic types simply add their accessor functions to the algebra.

Arjen P. de Vries and Annita N. Wilschut

9

Example Due to shortage of space, we cannot give the full definition of the MOA data model and query algebra. To just give an idea of the language, we illustrate MOA with an example that assumes an instance of MOA that supports the bag and the tuple structures. Assume that X is a value of a nested structured type: X: BAG

Now the MOA expression select[attr(THIS,0) = 1](X)

selects those tuples from the operand bag into a result bag for which the zeroth (integer) attribute has value 1. The result is a value of the same type as the operand. After the selection operation, we may want to remove the integer attribute from the resulting tuples. map[TUPLE](select[attr(THIS,0) = 1](X))

Let’s now assume that Y is a value of type BAG. Now the expression join[count(attr(THIS,2)), THIS, TUPLE](X,Y)

joins the elements of bag X with bag Y on the equality of the count of the bag attribute in the tuples in X and value of the integers in Y. The join uses the TUPLE structure to generate the result. Consequently, the type of the result is BAG

3.2.3

Implementation

MOA is implemented on Monet, which only supports binary relational tables. In the case of MOA, we use full vertical decomposition [CK85] to store structured data in binary tables. The combination of BATs storing values and a structure function on those BATs forms the representation of a structured value. The full details of the mapping of MOA on Monet are described in [BWK98]. Here we give an example illustrating the idea. Figure 1 shows the mapping of bag value X in the example above on Monet. The rectangles correspond to binary tables and the ovals together constitute the structure function that allows reconstructing the structured value X from its decomposition. The structure function used in this example is:

10

On the integration of IR and databases - submitted to DS8

Figure 1: The mapping of the MOA data model on Monet’s binary tables BAG< indexX, TUPLE< Atomic, Atomic, BAG< indexA2, Atomic, idsA2> > >;

There is a one-to-one correspondence between a structure function and the structured type that is mapped via the structure function. MOA

structured data

structure function

logical algebra

structure function

structured data

structure function

MIL physical algebra

Binary Tables

Binary Tables

Figure 2: MOA query execution by translation to MIL The idea behind the algebra implementation is to translate a query on the representation of the structured operands into a representation of the structured query result. Figure 2 illustrates this process: the query is a MOA expression

Arjen P. de Vries and Annita N. Wilschut

11

on a structure expression on BATs, and its translation is a MIL program on the operand BATs that generates result BATs, which in turn are operands of another structure expression that represents the result.

4

Integration of IR in MOA/Monet

In the following subsections, we introduce IR structures in MOA, and then describe the implementation of IR as algebraic database operations. We conclude with a discussion of the resulting design.

4.1

IR in MOA

Knowing the basics of MOA and its standard structures, we now introduce new structures for IR. We begin with the definition of DOCREP, a structure that represents the content of a document. The implementation of DOCREP encapsulates the document representation at the physical level that we introduce in section 4.2.2. The combination of the structures BAG and DOCREP allows us to model document collections with MOA. The approach most similar to ‘normal’ IR systems is to specify a document collection as a BAG of DOCREPs. We can also use the standard structure TUPLE to model a document as combination of content and logical structure; eg. as a tuple of category and content. This allows us to select a subset of the collection based on a selection predicate on the document’s category using standard MOA expressions. Example 1 shows the selection of ‘News’ documents from an extent docs defined with the structure definition for document collection. Example 1

12

On the integration of IR and databases - submitted to DS8

document collection structure definition: BAG< TUPLE< Category : str, Content : DOCREP > >; its associated structure function: docs ::= BAG>; expression for subset selection: select[ =( THIS.Category, "News" ) ] ( docs );

When querying a collection on content, we compute a score to express the belief that a document is relevant for the query. The computation of this belief usually requires global statistics of the document collection. Because we may have several document collections in one database, we model the collection statistics explicitely in the DCSTAT structure. Also, when we reduce a collection using conditions on the logical structure of documents, we may in some cases want to use the statistics of the subcollection for content querying (eg. when selecting documents that are news items), but in other cases use the statistics of the original collection (eg. when selecting the documents written at University of Twente). An explicit structure for the collection statistics makes both strategies possible. In the current implementation, a DCSTAT is constructed using a collection name, the dictionary (or indexing vocabulary), the number of documents in the collection, and the document frequencies (df ) of the indexing terms. Another approach to construct a DCSTAT would be to define an aggregate method getStats in structure DOCREP, that creates a new DCSTAT given a collection of documents. DCSTAT’s method nidf encapsulates the calculation of normalized inverse document frequency scores for the query terms in its operand. If we do not specify query terms, nidf returns tuples of term identifier with normalized idf values for all terms in the indexing vocabulary. We can therefore also use a semijoin between its result and the query terms, obtaining the same answer as in the previous example. Example 2 shows the queries for both approaches, if stats and query are extents of the structure definitions of nidf’s input arguments. Example 2

Arjen P. de Vries and Annita N. Wilschut

13

nidf method definition: in: DCSTAT, BAG< oid > out: TUPLE< termid: oid, nidf: dbl > equivalent expressions for idf of query terms: nidf( stats, query ); semijoin[ THIS.nidf, THIS, TUPLE ]( nidf( stats ), query );

Because an IR retrieval model calculates a belief score for a document given a query, one would expect the full belief calculation to be specified in a method of DOCREP. Anticipating a retrieval model that we proposed for multimedia retrieval in [dVB98], we introduce an extra structure DOCNET to model belief combination. The complete process of belief computation is now divided in two steps. First, DOCREP method getBL constructs a DOCNET for each document in the collection. This DOCNET models the query terms and their beliefs. Next, the operations on DOCNET combine term beliefs into a final judgement per document. This separation of belief computation in two steps allows later extensions of our algebra with types for multimedia content, eg. IMGREP, that can also produce DOCNETs given a query. Example 3 shows the corresponding MOA expression. Example 3 map[sum(THIS)]( map[getBL(THIS, query, stats)]( docs ) );

We conclude this subsection with a combination of conditions on content with constraints on logical structure, exposing the full benefits of the integration of IR and databases. The expression in example 4 retrieves documents that match the query and are in the ‘news’ category. Note that we do not introduce MOA as a language for end-users. We aim at support for a full-fledged declarative object query language like OQL [CBB+ 97]. MOA is only an implementation platform for such a language. At present, we have not yet implemented translation of OQL to MOA. Therefore, our current system should be considered only as a basis for IR application development. Example 4 map[sum(THIS)]( map[getBL( THIS.Content, query, stats ) ]( select[ =( THIS.Category, "News" ) ]( docs ) ));

14

4.2

On the integration of IR and databases - submitted to DS8

Algebraic processing in IR implementation

In this subsection, we describe the bottom layer of our IR implementation. The MOA structures and expressions of the previous section are translated to operations in the Monet database kernel. We first introduce Monet’s extension mechanisms [BK95]. Next, we discuss the representation of document content in Monet’s binary association tables (BATs). Finally, we describe an extension module that adds probabilistic reasoning to the database kernel, based on the well-known inference network retrieval model [Tur91], [TC92]. 4.2.1

User-defined operators in Monet

Before we can perform the calculations necessary for belief computation in the database, we have to define new algebraic operations. Monet allows us to define new MIL operations in two ways: define new procedures in MIL, or add extensions written in C or C++. To implement the ranking functions of some specific retrieval model, we prefer to write the functions that estimate probabilities directly in MIL. For a MIL procedure that computes tf following equation 1, refer to ntf in example 5. After collecting the necessary tf and max tf values in synced BATs tfs and maxtfs, MIL expression [ntf](tfs,maxtfs) computes the normalized term frequencies. Example 5 # Normalized term frequency PROC ntf( tf, maxtf ) := { RETURN 0.4 + 0.6*(log10(tf + 0.5)/log10(maxtf + 1.0)); }

Because there exists not yet a ‘best’ model for information retrieval, many different ranking formulas are used in experiments with a single retrieval model (see eg. the variations between tf expressions used in different experiments with InQuery [Tur91], [CCB95], and [ACC+ 97]). Implementing the ranking formulas as MIL procedures is very convenient for experimentation; we do not have to recompile our system every time we want to try a new ranking formula. Another way to add new functionality to Monet uses its extension system. The extension system supports user-defined data types, user-defined search accelerators, and user-defined functions [BQK96]. In section 4.2.3, we apply Monet’s extensibility to implement special probabilistic operators. A different application of extension modules is interfacing to code written by someone else. In a trice, we added Frakes and Cox’s implementation of Porter’s stemming algorithm [FBY92] to our system; see operation porter in code example 6.

Arjen P. de Vries and Annita N. Wilschut

15

Example 6 >module(stemmer); >stemmed_ws := [porter](ws); >print( ws, stemmed_ws ); #-----------------------------------------# # BAT: ws | stemmed_ws # # (oid) (str) | (str) # #-----------------------------------------# [ 0@0, "test", "test" ] [ 1@0, "testing", "test" ] [ 2@0, "policy", "polici" ] [ 3@0, "police", "polic" ] [ 4@0, "officer", "offic" ]

4.2.2

Document representations in Monet

The information about the documents that is used in the IR process must be stored in BATs. We term this the flattened document representation. At a first sight, the mapping may seem clumsy and only an introduction of extra complexity. This representation is however the foundation of an algebraic implementation of IR in Monet, hence the key to data independence, query optimization and parallellization. Here, we take a closer look at the flattened representation of documents and give an impression of the operations issued at the execution level. To perform IR query processing as algebraic manipulation of BATs, we first define a flattened representation of a document collection. Table 2 shows an example of a flattened collection containing three documents.1 We define three synced BATs, dj, ti, and tfij. These BATs store the frequency tf ij of term ti in document dj for each term ti occuring in document dj . When computing document scores for query q we proceed as follows. First, we join the query terms with the document terms. Next, using additional joins we look up the document ids and term frequencies. Note that these joins are executed very efficiently, because we use synced BATs. Assuming the belief in a term is given by equation 1, we need the document-specific values of max tf . Given our flattened document representation, these are computed using the setaggregate version of max. The normalized term frequencies are computed from the tf and max tf tables with MIL procedure ntf. Similarly, we can compute normalized inverse document frequencies from the collection statistics, and use these in combination with normalized term frequencies and document lengths 1 Notice that we do not model proximity information in our current implementation. Using Monet’s extensibility of data types, we expect no problems to provide an efficient implementation of proximity operators, see also [BCC94]. The implementation of an ADT for word location lists bears similarity with the polygon ADT in the GIS extensions [BQK96].

16

On the integration of IR and databases - submitted to DS8

d1 : d2 : d3 : dj 1 1 2 2 2 3 3 3

accac aebbe bcdbbc ti a c a b e b c d

tfij 2 3 1 2 2 3 2 1

query a b intermediate results qdj 1 2 2 3

qti a a b b

qtfij 2 1 2 3

qntfij 0.796578 0.621442 0.900426 0.942206

document collection Table 2: Representation of documents in BATs

to produce a wide variety of ranking formulas. 4.2.3

Probabilistic reasoning in Monet

Turtle and Croft introduced the inference network retrieval model in IR, based on a restricted class of Bayesian inference networks [Tur91], [TC92]. An efficient implementation of this model, the InQuery system, has been shown very effective in the TREC evaluations [CCB95], [ACC+ 97]. Wong et al. showed that it is theoretically possible to model inference in generic Bayesian networks as ordinary queries in a relational database with an additional probabilistic join operator [WXN94]. Vasanthakumar et al. implemented a somewhat similar procedure to compute probabilities as SQL queries in the inference retrieval network model [VCC96]. User-defined functions add the necessary probabilistic operators to compute beliefs to a DEC Rdb V6.0 relational DBMS. After the term-based probability estimates have been calculated, a combined score is computed to express the belief in a document given the query. Our approach to integrating probabilistic reasoning in Monet is similar to the implementation in [VCC96]. The probabilistic operator to combine belief lists, PICEval, is based on the evaluation of PIC-matrices as described in [Gre96], [GCT97], and [GCT98]. A problem with the implementation of probabilistic reasoning in [VCC96] is the

Arjen P. de Vries and Annita N. Wilschut doc term belief d1 a 0.56 d1 b default d2 a 0.67 d2 b 0.82 d3 a default d3 b 0.73 With outer join

doc d1 d2 d2 d3

term a a b b

17

belief 0.56 0.67 0.82 0.73

Without outer join

Table 3: Intermediate tables during probabilistic inference.

computation of a full outer join between the terms that occur in the query and the terms that occur in the documents. If the query consists of nq query terms and the document collection consists of Nd documents, then the intermediate result requires space for nq · Nd records. Query terms that do not occur in a document also take a place in the intermediate results. Because usually most documents only contain some of the query terms, this intermediate result is likely to be a long but sparse table.2 The full outer join is only used to assign default beliefs to those terms that do not occur in the document. Instead of doing the full outer join, we handle the default beliefs from within the probabilistic operators. Hereto, we developed queryPICEval, a special version of PICEval with the query as an extra operand. For the terms occuring in the query that are not listed in the intermediate result, the default belief is inserted while combining the evidence. To illustrate the difference in processing between the two approaches, we present a short example in table 3. In the following MIL example, the term beliefs are combined using InQuery’s probabilistic #sum operator: pic := newPIC( "sum", query.count ); docscores := {queryPICEval}(qdj, qti, qbelief, query, pic);

4.3

Discussion

In the previous subsections, we gave an overview of our implementation of IR query processing in the MOA/Monet environment, using a mixture of kernel operators and user-defined functions. Here, we summarize the benefits of the data independence offered in our design based on object-orientation. The kernel chooses at runtime from a variety of algorithms to execute the algebraic operations as efficient as possible. Similar to the use of indices in relational databases, we may increase performance by the definition of access structures on 2 Even when the user enters a short query, the processing of relevance feedback [Rob90], [HC93] and techniques like local context analysis [XC96] add many terms to an initial query.

18

On the integration of IR and databases - submitted to DS8

the BATs. When using a parallel machine, the IR code automatically benefits from the implementation of parallel kernel algorithms. The separation between physical execution and specification of a retrieval model in algebraic operations is very useful for experimental IR research. The researcher can play with new retrieval models without worrying too much about low-level implementation issues like hash tables, index trees and parallellization. The separation between the MIL programs in the database and the structure specified in MOA also supports the design of IR experiments. The MOA expressions that model the experiment do not change when we play with the retrieval model. The combination of MOA and its extensions for IR makes it possible to combine constraints on the content of documents with constraints on their logical structure. And, we can easily manage several versions of the same underlying collection, eg. using different stemming algorithms or extracting concepts with varying NLP techniques. These benefits can in principle also be achieved at the Monet level. However, writing MIL programs on the flattened document representation requires a lot of knowledge of the mapping from documents to BATs. The expression in example 4 is translated into a MIL program consisting of 36 operations, and is already quite complex to comprehend, let alone construct. Clearly, MOA expressions have a great advantage over the low-level BAT manipulation. Remember though that we do not intend MOA as a language for the end-user; only a language at calculus level (eg. OQL) is suitable for the average (expert) database user.

5

Algebraic optimization in MOA

The main goal of following the algebraic approach in our design is algebraic query optimization. In this section, we discuss the benefits of our design with this respect. First, we give a short example of query optimization at the level of MOA expressions. Next, we distinguish three types of optimization that have to be taken into account when developing MOA extensions. First, using MOA structures allows a better, task-specific mapping to the relational database kernel. Second, several semantically identical query translations may be possible for the same MOA expression. A query optimizer is likely to have trouble generating these equivalent expressions, and we argue that the extension model should provide alternative expressions. The best alternative can then be chosen by the query optimizer, using information about the runtime state of the database. Finally, we may use search accelerators like materialized views and indices on path expressions within the implementation of structures.

Arjen P. de Vries and Annita N. Wilschut

5.1

19

Query optimization of MOA expressions

Recall the MOA expression in example 4. This expression retrieves the belief scores of those documents that are in the ‘news’ category. A semantically equivalent expression is given in example 4. Here, we specify that after computing belief scores of all documents, we are only interested in those documents that are in the ‘news’ category. Example 7 map[THIS.DocBeliefs]( select[=(THIS.Category,"News")]( map[TUPLE< Category: THIS.Category, DocBeliefs: sum(THIS.TermBeliefs) > ]( map[TUPLE< Category: THIS.Category, TermBeliefs: getBL(THIS.Content, query, stats) > ]( docs ) )));

The process of query optimization should rewrite the expression of 7 into the expression of 4 before executing the automatically generated MIL program of that expression. The underlying principle in this example is known as ‘push select down’. In the current implementation, we do not have a good query optimizer. However, it is important to realize that this optimization is hard if not impossible to achieve without following the algebraic approach. Conversely, our implementation using an object algebra can clearly be adapted to perform this style of optimization.

5.2

Defining structures to increase efficiency

Schek and Pistor have previously proposed NF2 algebra for the integration of IR and databases [SP82]. Thus, the document content can also be modelled in standard MOA, using a nested set of normal tuples. Queries using our DOCREP structure are however not only simpler to understand (better data independence), but also require less algebraic operations to be processed. The mapping to MIL of the NF2 equivalent of example 1 requires 10 joins and a select; the DOCREP version only uses 6 joins and the same select.3 Similarly, the query in example 2 with the query terms as operand of the nidf implementation for DCSTAT can be translated with far less operations than the standard MOA query using the semijoin. The extra knowledge encoded in new structures can contribute to the generation of more efficient MIL programs. Note that although 3 Of course, it is important to realize that in general, the number of operations is not directly proportional to the amount of processing that is required.

20

On the integration of IR and databases - submitted to DS8

we introduce this type of performance improvement in a section on query optimization, it is important to realize that it takes place beforehand, at design time, and does not further improve query processing at runtime.

5.3

Defining alternative mappings

In theory, query optimization comes down to a search of the space of semantically equivalent algebraic expressions for the one expression with the highest estimated performance. Because this search problem is NP-complete, real-life query optimizers only consider a small subspace of all possible candidate expressions. Often, the best expressions are not generated automatically. The designers of MOA structures have much knowledge about the high-level process supported by the structure. We therefore propose that the designers should provide a set of semantically equivalent algebraic expressions, that may then be used as candidate expressions by a query optimizer. For instance, consider belief computation in the DOCREP structure. When we compute a ranking function like equation 1, two competing strategies can calculate the required max tf values. First, we can compute max tf for all documents, and then select the values that are needed for the query under consideration: djmaxtfij := {max}( join( dj.mirror, tfij ) ) join( qdj, qdjmaxtfij);

An alternative approach is to preselect the document identifiers of documents in which the query terms occur. This way, the max tf are only computed for those documents that are relevant for the query. djreq := semijoin( dj.mirror, qdj.mirror ); djmaxtfij := {max}( join( djreq, tfij ) ); join( qdj, qdjmaxtfij);

The choice between both evaluation strategies should be made by the query optimizer. The second approach may be more efficient than the previous strategy if the query terms do not occur in many documents. Although the alternatives in previous example may well be generated automatically (it follows the wellknown ‘push selects down the tree’ strategy), we found several alternatives for complex high-level operations that are not so likely to be found by a query optimizer. The rewriting process of MOA structures should therefore be adapted to dynamically rewrite structures, selecting the best alternative based on the run-time data distribution.

Arjen P. de Vries and Annita N. Wilschut

5.4

21

Search accelerators in MOA extensions

Finally, we identified a third type of optimization that should be taken into account when designing structures. Consider again the previous example of computation of max tf values. Instead of computing these results on-the-fly whenever they are needed, we can also store this intermediate result as materialized view on djmaxtfij. Similarly, we may want to define access structures on the results of complex computations [SM96]. In the context of multimedia data, we may even want to create that access structure without maintaining the results themselves in the database. The next time we need the answer of this computation, we only use the access structure as a filter to reduce the set of objects for which the expression must be recomputed. These types of optimization can of course be hard-coded within the implementation of structures. We believe however that the application of such search accelerators should be orthogonal to the specification of the mapping of the complex structure to the flattened database model. An open question though is how these accelerators can be defined. In relational databases, the user defines indexes using the SQL CREATE INDEX command. However, these indexes are defined on tables that exist in the schema. In this case, the tables and functions for which we want to define search accelerators are hidden within the structure’s mapping and may not be visible in the schema at all. The identification of useful search accelerators may only be solved as an additional task for the MOA rewriter, either interactively with the user or automatically.

6

Conclusions and future research

We reported the integration of IR and databases using a structural objectoriented approach to database design. At the highest level, we integrated a well known information retrieval model in MOA object algebra. We defined a mapping of the belief computations into algebraic operations in a database kernel based on binary associations, with a strong emphasis on set-at-time operations. This approach provides data independence between the logical and the physical database. As a result, parallallization and the selection of access structures have become orthogonal to the development of an information retrieval model. More importantly, an algebraic approach with an object-oriented interface allows algebraic query optimization. We demonstrated the benefits of an integrated approach to IR and databases, and gave several examples illustrating the kind of queries that can be specified in our system. We discussed the role of query optimization for efficient processing of queries that combine logical structure and content. We also argued that welldesigned structures may increase efficiency and should support the process of query optimization.

22

On the integration of IR and databases - submitted to DS8

Research goals include the implementation of a multimedia digital library, following the structural object-oriented approach used in this paper. We also aim to implement a variety of text retrieval models, and use Monet’s extensibility to add ADTs for proximity information. In the near future, we plan to implement a LIST structure, and study the optimization that can take place when we are only interested in the N-best matches. With respect to optimization, we are especially interested in the role of new structures during query optimization. Finally, we want to scale-up to very large document collections via the exploitation of parallellism. For this purpose, we plan to extend the MOA implementation to generate parallel MIL programs.

Acknowledgements Jan Flokstra provided us with the infrastructure to make this work possible. We are also grateful to the Monet team in general for many explanations and support, and want to thank Peter Boncz in particular, without whom prototyping our ideas on Monet would have been impossible.

References [ACC+ 97]

J. Allan, J.P. Callan, W.B. Croft, L. Ballesteros, J. Broglio, J. Xu, and H. Shu. INQUERY at TREC-5. In Proceedings of the Fifth Text REtrieval Conference (TREC-5), pages 119–132, Gaithersburg, MD, 1997. National Institute of Standards and Technology. special publication 500-238.

[BCC94]

E.W. Brown, J.P. Callan, and W.B. Croft. Fast incremental indexing for full-text information retrieval. In Proceedings of the 20th International Conference on Very Large Databases (VLDB), Santiago, Chile, 1994.

[BK95]

P.A. Boncz and M.L. Kersten. Monet: An impressionist sketch of an advanced database system. In BIWIT’95: Basque international workshop on information technology, July 1995.

[BQK96]

P.A. Boncz, C.W. Quak, and M.L. Kersten. Monet and its geographic extensions. In Proceedings of the 1996 EDBT conference, 1996.

[BWK98]

P. Boncz, A.N. Wilschut, and M.L. Kersten. Flattening an object algebra to provide performance. In Fourteenth International Conference on Data Engineering, pages 568–577, Orlando, Florida, February 1998.

Arjen P. de Vries and Annita N. Wilschut

23

[CBB+ 97]

R.G.G. Catell, D.K. Barry, D. Bartels, M. Berler, J. Eastman, S. Gamerman, D. Jordan, A. Springer, H. Strickland, and D. Wade. The Object Database Standard: ODMG 2.0. Morgan Kaufmann Publishers Inc., 1997.

[CCB95]

J.P. Callan, W.B. Croft, and J. Broglio. TREC and TIPSTER experiments with INQUERY. Information Processing and Management, 31(3):327–343, 1995.

[CK85]

G.P. Copeland and S.N. Koshafian. A decomposition storage model. In Proceedings of the SIGMOD Conference, pages 268– 279, 1985.

[DDSS95]

S. DeFazio, A. Daoud, L.A. Smith, and J. Srinivasan. Integrating IR and RDBMS using cooperative indexing. In Proceedings of the 18th annual international ACM SIGIR conference on Research and development in information retrieval (SIGIR’95), pages 84– 92, 1995.

[DM97]

S. Dessloch and N. Mattos. Integrating SQL databases with content-specific search engines. In Proceedings of the 23rd VLDB conference, Athens, Greece, 1997.

[dVB98]

A.P. de Vries and H.M. Blanken. The relationship between IR and multimedia databases. In IRSG’98, Autrans, France, March 1998.

[dVvdVB97] A.P. de Vries, G.C. van der Veer, and H.M. Blanken. Let’s talk about it: Dialogues with multimedia databases. To appear in Displays; also available as tech report CTIT 97-13 from the Centre for Telematics and Information Technology, The Netherlands, 1997. [FBY92]

William B. Frakes and Ricardo Baeza-Yates, editors. Information Retrieval: Data Structures & Algorithms. Prentice Hall, 1992.

[FR97]

N. Fuhr and Th. R¨ olleke. A probabilistic relational algebra for the integration of information retrieval and database systems. ACM Transactions on Information Systems, 15(1):32–66, January 1997.

[GCT97]

W.R. Greiff, W.B. Croft, and H. Turtle. Computationally tractable probabilistic modeling of boolean operators. In Proceedings of the 20th annual international ACM SIGIR conference on research and development in information retrieval, pages 119–128, July 1997.

[GCT98]

W. Greiff, W.B. Croft, and H. Turtle. PIC matrices: A computationally tractable class of probabilistic query operators. Technical Report IR-132, The Center for Intelligent Information Retrieval, 1998. submitted to ACM TOIS.

24

On the integration of IR and databases - submitted to DS8

[Gre96]

W.R. Greiff. Computationally tractable, conceptually plausible classes of link matrices for the INQUERY inference network. Technical Report TR-96-66, University of Massachusetts, Amherst, Massachusetts, September 1996.

[GTZ93]

J. Gu, U. Thiel, and J. Zhao. Efficient retrieval of complex objects: Query processing in a hybrid DB and IR system. In Proceedings of the 1st German National Conference on Information Retrieval, 1993.

[HC93]

D. Haines and W.B. Croft. Relevance feedback and inference networks. In Proceedings of the sixteenth annual international ACM SIGIR conference on research and development in information retrieval (SIGIR’93), pages 2–11, 1993.

[HCL+ 90]

L.M. Haas, W. Chang, G.M. Lohman, J. McPherson, P.F. Wilms, G. Lapis, B. Lindsay, H. Pirahesh, M. Carey, and E. Shekita. Starburst mid-flight: As the dust clears. IEEE Trans. on Knowledge and Data Engineering, 2(1):143–160, March 1990.

[LLRS97]

L.V.S. Lakshmanan, N. Leone, R. Ross, and V.S. Subrahmanian. ProbView: A flexible probabilistic database system. ACM Transactions on Database Systems, 22(3):419–469, September 1997.

[Mac91]

I.A. Macleod. Text retrieval and the relational model. Journal of the American society for information science, 42(3):155–165, 1991.

[Mil97]

T.J. Mills. Content modelling in multimedia information retrieval systems. The Cobra retrieval system. PhD thesis, University of Cambridge, July 1997.

[Miz98]

S. Mizzaro. How many relevances in information retrieval? Interacting With Computers, 10(3):305–322, 1998. In press.

[MMR97]

T.J. Mills, K. Moody, and K. Rodden. Cobra: a new approach to IR system design. In Proceedings of RIAO’97, pages 425–449, July 1997.

[MRT91]

C. Meghini, F. Rabitti, and C. Thanos. Conceptual modeling of multimedia documents. IEEE Computer, 24(10):23–30, October 1991.

[Rob90]

S.E. Robertson. On term selection for query expansion. Journal of documentation, 46(4):359–364, 1990.

[SM96]

M. Stonebraker and Dorothy Moore. Object-relational DBMSs: The next great wave. Morgan Kaufmann Publishers, Inc., 1996.

Arjen P. de Vries and Annita N. Wilschut

25

[SP82]

H.-J. Schek and P. Pistor. Data structures for an integrated data base management and information retrieval system. In Proceedings of the Eighth International Conference on Very Large Data Bases, pages 197–207, Mexico City, 1982.

[TC92]

H.R. Turtle and W.B. Croft. A comparison of text retrieval models. The computer journal, 35(3):279–290, 1992.

[Tur91]

H.R. Turtle. Inference networks for document retrieval. PhD thesis, Univeristy of Massachusetts, 1991.

[VCC96]

S.R. Vasanthakumar, J.P. Callan, and W.B. Croft. Integrating INQUERY with an RDBMS to support text retrieval. Bulletin of the technical committee on data engineering, 19(1):24–34, March 1996.

[vR79]

C.J. van Rijsbergen. Information retrieval. Butterworths, London, 2nd edition, 1979. Out of print, available online from http://www.dcs.glasgow.ac.uk/Keith/Preface.html.

[WXN94]

S.K.M. Wong, Y. Xiang, and X. Nie. Representation of Bayesian networks as relational databases. In Advances in intelligent computing - IPMU ’94, pages 117–130, Paris, France, 1994. Springer.

[WY95]

S.K.M. Wong and Y.Y. Yao. On modeling information retrieval with probabilistic inference. ACM Transactions on Information Systems, 13(1):38–68, January 1995.

[XC96]

J. Xu and W. B. Croft. Query expansion using local and global document analysis. In Proceedings of the 19th International Conference on Research and Development in Information Retrieval (SIGIR ’96), pages 4–11, Z¨ urich, Switzerland, 1996.

[ZM98]

J. Zobel and A. Moffat. Exploring the similarity space. SIGIR Forum, 32(1), Spring 1998.