An Evaluation of the Incorporation of a Semantic ... - Semantic Scholar

An Evaluation of the Incorporation of a Semantic Network into a Multidimensional Retrieval Engine Jinho Lee, David Grossman, Ratko Orlandic Information Retrieval Laboratory Department of Computer Science Illinois Institute of Technology Chicago, IL 60616

{leejin2, grossman, orlandic}@iit.edu

ABSTRACT

2. BACKGROUND 2.1 Multidimensional Access Methods

This paper describes a new method for incorporating a hierarchical category dimension into an Information Retrieval framework. The approach is to use the synonym sets and the hyponym (“is-a”) relations defined within Wordnet in order to derive a conceptual hierarchical category dimension. The hierarchical nature of a category dimension not only provides an overview of a set of documents but also facilitates the effectiveness and the efficiency of searching documents. An evaluation is performed on two different types of models and the multidimensional approach shows a significant reduction in the number of page accesses over a large document collection.

Multidimensional access methods were originally developed for spatial applications and used to support similarity searching. An excellent survey of these methods is found in [5]. Early multidimensional data structures used main memory and did not consider secondary storage. However, many access methods, such as the LSD tree [7] and the Buddy tree [14], can effectively deal with data on secondary storage. The k-d-b tree [11], which is the generalization of a B-tree, is one of the most popular of these. Recently, many multidimensional access methods such as the SStree [16], the X-tree [1], and the SR-tree [9] were designed for data in high dimensional spaces.

Categories & Subject Descriptors: Add here

2.2 Deriving Conceptual Hierarchy

General Terms: Add here

Numerous approaches automatically derive hierarchical categories from text. Hearst found that certain key phrases could be used to indicate hyponym (“is-a”) and hypernym (“is-a-typeof”) relations. Sentences that contain these special key phrases were parsed to identify hierarchical relations [6]. Woods used phrase analysis and a large knowledge base to arrange terms into a concept hierarchy [17]. Sanderson built concept hierarchies through the use of simple term association techniques called subsumption. For two terms, x and y, x is said to subsume y if the documents that contain y are a subset of the documents that contain x. This concept hierarchy has been able to significantly improve the users’ understanding of documents [12]. More recently, Dawn, Croft, and Rosenberg used probabilistic language models to find topic words for hierarchical summarization [4]. They recast the language model to a graph where vertices represent terms and edges represent the affinity between two vertices. This enabled them to restate the problem as the Dominating Set Problem (DSP). Through this approach, they were able to find topic terms that have maximal predictive power and generally capture the topics of documents. To create subsequent levels, they used different conditional probabilities that consider the co-occurrence of the parent terms as an important factor.

1. INTRODUCTION We have previously demonstrated that the Multidimensional Information Retrieval Engine (MIRE) has the ability to integrate structured data and text with hierarchical support for LOCATION and TIME dimensions [10]. The multidimensional nature of the system has provided efficient handling of hierarchical structured data and typical OLAP functionalities such as “roll up” and “drill down” on specific dimensions. The goal of this research is to extend that work by incorporating a semantic network such as Wordnet [2, 3, 13, 15] into our system as a category dimension. The rest of this paper is organized into five sections. Section 2 describes the background work. Section 3 shows the design of the Multidimensional Information Retrieval Engine (MIRE). Section 4 proposes a way to automatically derive a hierarchical category dimension and Section 5 presents the experimental results. Conclusions and future work is discussed in Section 6. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. CIKM’03, November 3–8, 2003, New Orleans, Louisiana, USA. Copyright 2003 ACM 1-58113-723-0/03/0011…$5.00.

572

3. DESIGN OF MIRE

4. DERIVING THE CATEGORY DIMENSION

In our prior work, three different approaches were considered for our system [10]. •

BIRE: Build an Inverted Index for text and use three separate B-trees to access dimensional data in LOCATION, TIME and CATEGORY dimensions.

•

MIRE: Build an Inverted Index for text and use a single multidimensional access structure to manipulate hierarchical structured data in LOCATION, TIME and CATEGORY dimensions.

•

SMIRE: Build a single multidimensional access structure to handle TEXT, LOCATION, TIME, and CATEGORY together.

We use Wordnet [2, 3, 13, 15] to automatically derive a hierarchical category. To build a hierarchical category dimension, each document was parsed and a synonym set was identified for each non stop-word. We then followed the “Is-A” relationship up to the root of the hierarchy. If a term has multiple meanings, all the senses and their ancestors were also visited. We assigned a weight of tf-idf for each term that was tracked for each synonym set visited. A map is used to maintain synonym sets and the corresponding weights. Simply tracking a visit count does not reflect any importance of a term appearing in a document. Thus, we combine the importance of a term and the visit count so that it provides a better descriptive category for a document. This synonym set hierarchy is constructed for each document.

We have shown that the single multidimensional access structure (SMIRE) is not a viable model due to storage overhead [10]. Therefore, we implemented only MIRE and BIRE for this evaluation. We used a modified k-d-b tree due to its conceptual simplicity and good performance in low-dimensional space. The k-d-b tree was enhanced with the modification of a split algorithm to avoid a collision due to an overflow. This overflow happens when a split is performed on skewed data. The split algorithm was improved so that it uses a special axis as a split dimension when a collision occurs. This special dimension has a clear uniqueness property, so it never overflows when a split is performed. In our implementation, the document identifier was used as a special dimension because every document identifier is unique.

The recursive algorithm how the weight is tracked is shown in Figure 2.

RecursiveTrace (SynsetPtr sPtr, new_weight) { while (sPtr) { if sPtr exists in the map { // add up the tf-idf weight weight = weight + new_weight; } else { insert a new synonym set into the map with the initial weight of a term }

A star schema used to model our document collection is shown in Figure 1.

Year Month Day

Time

Term

// follow up the hierarchy recursively up to the root recursiveTrace (sPtr->ptrlist, new_weight); // if a term has multiple senses // go through all senses sPtr = sPtr->nextss;

Weight

} Region State City

Location

Category

}

st

1 Cat 2nd Cat 3rd Cat

Figure 2. Recursive Algorithm for Weight Tracking To acquire a hierarchical category for each document, root synonym sets are first searched from the synonym set hierarchy obtained from the previous step. The root synonym set that has the largest weight becomes the first category. A subsequent category is chosen from its children synonym sets. Among the children synonym sets, a synonym set having the largest weight makes the second sub category. In a similar manner, all levels of subsequent categories are obtained for each document. For the LOCATION dimension, state and city are mapped to one integer value using the following bit operations.

Figure 1. Star Schema The star schema in Figure 1 shows four basic dimensions and a typical fact table. The arrows along the dimensions indicate that those dimensions have hierarchies. The TIME dimension has a hierarchy of (year, month, day) and the LOCATION dimension has a hierarchy of (Region, State, City). The TERM dimension includes information on each word or a term that is indexed. The CATEGORY dimension describes various subject categories used to represent a document.

Location = ((state index)