a multidimensional information retrieval engine for structured data and ...

3 downloads 12380 Views 275KB Size Report
This paper presents an original information- retrieval engine, called MIRE, for integrating structured data and text. Among other things,. MIRE is designed to work ...
MIRE: A Multidimensional Information Retrieval Engine for Structured Data and Text Jinho Lee, David Grossman, Ratko Orlandic Information Retrieval Laboratory Department of Computer Science Illinois Institute of Technology {leejin,dagr}@ir.iit.edu, [email protected]

Abstract This paper presents an original informationretrieval engine, called MIRE, for integrating structured data and text. Among other things, MIRE is designed to work in a natural and efficient way with the inherent hierarchies of structured data. While multi-dimensional access methods have originally been developed for spatial applications, they can be successfully used to index hierarchical structured data and add to an existing information-retrieval engine the capability of navigating hierarchical dimensions. To support this capability, MIRE enhances the processing algorithms of an existing multidimensional access method to avoid overflow and support for hierarchical dimensions. Compared to a search engine with multiple indexes for a different type of search, the multidimensional approach shows a significant reduction in the number of page accesses over a large document collection.

1. Introduction As the volume of information grows, the need to integrate large quantities of structured and unstructured data has become much more pronounced. User requests such as "Find all documents that has terms data mining, which are published in Chicago during year 2001" require a search of both structured and unstructured data. While address, phone number and date are typical examples of structured data, there are many pieces of information represented as unstructured or semi-structured data, such as email or resume. There are essentially four different approaches to integrating structured data and text. The first uses a single user interface that ties a database and an information retrieval system

together [13]. The problem of this approach is the consistency between two disparate systems and the difficulty of building middleware that adequately shields the users from the underlying systems. The second approach requires fundamental changes to the SQL standard to support textbased operations [11]. However, because the inherent portability of the standard SQL is lost, this approach locks users into a single database system. The third approach uses standard SQL to implement typical IR functionality. A relation is used to model an inverted index and standard SQL is used to implement information retrieval functionality and relevance ranking [4]. This approach has been shown to be portable as it can be implemented on any database management system (DBMS). To address the low query response time and the integrity of the indexed data, researchers are investigating the problems of clustering data as well as parallel data access and processing [20, 21]. While the relational database approach can be used for information retrieval, it lacks the ability to easily handle hierarchies within the structured data. The fourth approach is to build an Information Retrieval Engine on top of an OLAP (On-Line Analytic Processing) facility. OLAP applications are designed to support powerful analytical tasks over shared multidimensional data. This enables analysts, managers, and executives to gain insight into key measures of business performance [19]. New design approaches, such as “star schema” and its variants, are being tailored to OLAP applications [10, 22]. The star schema consists of two kinds of relational tables, dimensional tables and fact tables. While dimensional tables contain characteristics of reference data (e.g.; time, location, product, store, salesperson, and etc.), the fact tables contain measures such as amount-of-purchase or word-appears-in-document for our purpose. The

Proceedings of the International Conference on Information Technology: Coding and Computing (ITCC’02) 0-7695-1506-1/02 $17.00 © 2002 IEEE

fact table also contains a set of keys that are used to reference dimensional tables [22]. By modeling the text and hierarchical structured data in a star schema, users can gain advantage of typical OLAP such as slicing, dicing, drilling up and down. It adds significant understanding of the documents to the users. As a result, users can interactively explore the documents returned from their queries and potentially gain relevant insight more quickly than by reading through a ranked list of documents [12]. We present the design of a new informationretrieval engine, called MIRE, which integrates an inverted index for text and a multidimensional access method for structured hierarchical data. Multidimensional access methods have been originally developed for spatial applications and used to support multidimensional similarity searching in the sense of spatial proximity [8]. However, we find out that multidimensional access methods can also be used for handling of hierarchical structured data and it can provide OLAP functionalities such as drilling up and down on specific dimension. This, in turn, enables a degree of integration of structured data and text that has not been achieved before. We performed extensive test on our multidimensional approach (MIRE) and single dimension structure approach. Our multidimensional approach (MIRE) showed a significant improvement in the number of page accesses over single dimensional approaches. The rest of this paper is organized into five sections. Section 2 describes the background work and section 3 and 4 describe the design of the Multidimensional Information Retrieval Engine (MIRE) and its functionalities. Section 5 presents the experimental results and future work is discussed in Section 6.

2. Background Prior work can be partitioned into two sections. Section 2.1 describes the inverted indexes for information retrieval and section 2.2 discusses the multidimensional access methods. Both methods are used as part of our Multidimensional Information Retrieval Engine (MIRE) with modification.

2.1 Inverted Indexes for IR Typically, users are able to issue a query over document collection and obtain documents ranked by some measure of relevance.

Numerous strategies exist to improve the computation of relevance; commonly used one is a vector space model (VSM) [17]. In this model, documents and queries are represented as points in a vector space. Both documents and queries are represented as vectors of the form Di = and Q = , where dik and qk represent the relevance of a term k with respect to a document or query. The weighting factor, dik , is the weight of a term in a document and it can be calculated as dik = tfik * idf, where tfik denotes the term frequency of a term k in a document, Di , and idf is calculated as log(N/nk) where nk denotes the number of documents which is containing the term k. The weighting factor for a term in a query qk can be calculated in the same way as dik. Modern search engines use different similarity measures to rank documents in terms of how closely they satisfy the users query. To implement relevance ranking, we use the tf-idf measure. The similarity coefficient between documents and query is defined by the vector product: SC(Di, Q) = ∑ dik * qk where 1 ≤ k ≤ t. Searching for the most relevant document with respect to a given query is equivalent to finding a document with the highest value of SC(Di, Q). The vector space model is typically implemented by using an inverted index. An inverted index contains a term dictionary or lexicon with an entry for each term. Associated with each term is a posting list. The posting list contains an entry for each distinct occurrence of a term in a document. Also, the weight (tf-idf) of the term in the document is stored [5].

2.2 Multidimensional Access Methods Multidimensional access methods have been originally developed for spatial application and numerous multidimensional access methods have been proposed. An excellent survey can be found in [3]. Early multidimensional data structures used main memory structures not taking into consideration secondary storage. Even though the price of physical memory is rapidly decreasing, the amount of data to be handled continues to grow. Using main memory for data that resides on disk does not provide better performance because there is no control over how the operating system performs disk access [3]. Many access methods such as LSD tree[7], buddy-tree[18] have been designed with secondary storage management in mind. Treebased access methods such as k-d-b tree or the

Proceedings of the International Conference on Information Technology: Coding and Computing (ITCC’02) 0-7695-1506-1/02 $17.00 © 2002 IEEE

hB-tree, a generalization of B-tree, were also proposed. Recently, many multidimensional access methods such as SS-tree [24], X-tree [1], SR-tree [9] were proposed for data with many dimensions.

3. Treat IR as an OLAP Application A typical OLAP application has a fact table that contains atomic information about a particular fact and links to various dimensions. The classic example is a SALES fact table with dimensions such as LOCATION and TIME that indicate where and when the sale occurred. Often, these dimensions are hierarchical. A fact table for a document collection may contain basic information about a document with links to TIME and LOCATION dimensions that identify where and when a document is published. These dimensions have hierarchies. The TIME dimension has a hierarchy of year, month, and day. Similarly, The LOCATION dimension has a hierarchy of Region, State, county and City. Certainly, other hierarchies may well be reasonable for other dimensions. The ORGANIZATION dimension which tells who wrote a document has a hierarchy of University or Laboratory, Department or Division, and individual person. The CATEGORY dimension that describes the contents of a document can have a hierarchy shown in Figure 1.

.

CATEGORY

Medica

DBMS





Computer



O.S



Legal

IR

(Multidimensional Information Retrieval Engine). With MIRE, we were able to verify that OLAP when applied to information retrieval provides several functionalities not found in typical search engines [12]. However, to address scalability issues, we have built a custom version of MIRE that does not rely upon any commercial products. We have taken into consideration three approaches: 1.

Build an Inverted Index and use two separate B-trees to access the dimensional data in LOCATION and TIME. (BIRE)

2.

Build an Inverted Index and use a single multidimensional access structure (modified k-d-b tree) to access structured data in LOCATION and TIME as well. (MIRE)

3.

Build a single multidimensional access structure (modified k-d-b tree) to access TEXT, LOCATION, and TIME together. (SMIRE)

SMIRE uses a single multidimensional access structure for the text, TIME and LOCATION. A document has a one to one relationship with LOCATION and TIME, whereas it has one to many relationships with terms. It turns out that SMIRE is not feasible model because one single multi-dimensional access structure cannot handle two different types of relationship efficiently. To index LOCATION, TIME, and terms in one single multidimensional access structure, LOCATION and TIME attributes that have one to one relationship with document should be replicated to make an input vector (Document Id, term, term weight, LOCATION, TIME) with respect to terms. The storage overhead will grow in proportion to the number of documents. If we consider large document collection, SMIRE is not a viable model.

Figure1: The hierarchy of category dimension. This category can be obtained from a thesaurus or from the concept hierarchy such as Wordnet [23, 25] or from a manually derived set of related words. Using OLAP for document collections provides new functionalities such as browsing through a document collection and quickly identifying patterns surrounding where and when documents are written. We have previously used Microsoft OLAP (On-Line Analytic Processing) Services to build a prototype of MIRE

4. The Design of Multi Dimensional Information Retrieval Engine (MIRE) Since single multidimensional access structure (SMIRE) is not a viable alternative, we implemented BIRE and MIRE and compared their performance. A structured query such as “Find all documents with Information Retrieval, which are published in Illinois during 2000” is processed in

Proceedings of the International Conference on Information Technology: Coding and Computing (ITCC’02) 0-7695-1506-1/02 $17.00 © 2002 IEEE

two steps. LOCATION and TIME dimensions are retrieved from the multidimensional access structure. The set of documents retrieved from this structure is stored in an intermediate document collection. From the inverted index lists, documents that contain the term “Information Retrieval” can be obtained and stored in another intermediate set. Final set of ranked documents can be obtained by merging these two intermediate document sets. While a lot of multidimensional access structures have been proposed in the past, due to the conceptual simplicity and relatively good performance in low-dimension space, we have used a modified k-d-b tree structure [15] in our system. The modified k-d-b tree uses the firstdivision splitting algorithm of the LSD tree [7] and the Buddy tree [18]. This policy is based on a first-division plane that does not force a partition of rectangles corresponding to the lower-level nodes. It avoids downward propagation of splitting associated with forced splitting and improves the performance of the original k-d-b tree [2, 3]. Therefore, we used first-division splitting algorithm to implement our multidimensional access structure. One of the important tasks for implementers of a multidimensional access structure is to avoid collisions caused by overflow when splitting a point page. If too many identical values are found in one dimension and the split is performed on this dimension, it can cause overflow of a single node. To handle this problem, we adopted another modification of the original k-d-b trees. An extra dimension is initially added to the point vectors. The value of this extra dimension is an auto-increment number. If collision cannot be managed with existing dimensions, this new dimension is used as split dimension to avoid collision due to overflow. The next interesting issue is how multidimensional access methods handle hierarchical information such as LOCATION (region, state, city) and TIME (year, month, day). LOCATION dimension is hierarchical dimension and it can be drilled down from region to state, state to city. It can also be drilled up to state from city, region from state. To support these hierarchical characteristics, we use the following technique. For our model, Location can be represented as 4 bytes value that is represented as (state

Suggest Documents