Modeling and Querying Textual Data using ER Models and SQL

1 downloads 0 Views 173KB Size Report
Apr 14, 1997 - expanded the types of data available to an information management ... *Ph.D. candidate, Rutgers University ..... and profession = 'President'.
Modeling and Querying Textual Data using E-R Models and SQL

Vipul Kashyap*and Marek Rusinkiewicz Microelectronics and Computer Technology Corporation (MCC) Austin, Texas 78759-5398, USA [email protected] April 14, 1997 Abstract

Recent emerging technologies such as internetworking and the World Wide Web have signi cantly expanded the types of data available to an information management system. Textual data is the most prevalent of these data types. In this paper we discuss an approach based on domain ontologies (expressed as E-R Models) and SQL for modeling and querying textual data as implemented in the InfoSleuth project at MCC. We identify the basic shortcomings of current approaches for textual data on the web: (a) lack of precision; and (b) lack of interoperation and discuss how our approach helps alleviate the above shortcomings. Techniques for mapping concepts in a domain ontology to the underlying textual data and for translation of queries expressed in SQL to the underlying information retrieval operations are presented. Limitations of current indexing technologies in supporting expressions translated from SQL are identi ed and heuristic approaches are proposed to overcome the same.

1 Introduction In recent years, there has been an explosion in the volume and variety of data available through the World Wide Web and related internetworking technologies. Most of this data is textual and typically has minimal structure associated with it. The existing technologies for accessing information on the Web use keywordbased search engines [KM91, Alt, Inf] that neither assume nor impose any structure on the underlying data. These approaches su er from the following shortcomings:

Lack of Precision When users on the Web specify a keyword based query, they get a large number of irrelevant answers. The reasons for this may be identi ed as:  

Lack of an expressive query language. Most of the times a user nds it dicult to express his

information requirements as a collection of keywords. Limitations of current indexing technologies. Structured databases support concept based querying where a user can express his query based on concepts and constraints. For example, a database schema expressed in an E-R model can be viewed as a collection of concepts that are related to each other. A query expressed in SQL can refer to these concepts in the FROM clause and specify constraints in the WHERE clause.

To improve the precision of textual queries, we need to complement keyword based querying with concept and constraint based approaches. Lack of Interoperation Users may often want to get results from di erent indexing engines (meta-searching) that use di erent query languages and operators or combine the textual information with information stored in structured databases. Correlation across these types of resources could also be used to improve the precision of the results. * Ph.D.

candidate, Rutgers University

These shortcomings lead to a number of questions that constitute motivation for our work.   



Is it possible to use query languages such as SQL to enable a user to express his query better? Is it possible to use a common domain ontology (expressed as an E-R model) and query language (SQL) to abstract out di erences in the query languages used by di erent search engines? If we do adopt a common model and query language such as the one used in structured databases, to what extent will it facilitate the interoperation/integration of information across textual and structured databases? To what extent can such approaches be supported by current textual indexing technologies?

Related Work

In this position paper we focus on issues related to the use of a common model and query language typically used for structured databases, viz., E-R models and SQL as implemented in the InfoSleuth system [BBB+ 97]. Data models and query languages have been proposed for semi-structured documents [Abi97]. The extraction of structure involves parsing the underlying documents in a tolerant manner. In our approach, we map elements in a domain ontology into a set of potential structures that may appear in the text body and increment a relevance measure based on their existence. This is similar to viewing the domain ontology as a data guide (using the terminology of [Abi97]) that provides a loose description of the structure of data [BDFS97]. Lightweight models have been used to model semi-structured data [AQM+ , BDFS97]. We however use a heavyweight domain ontology in an indicative sense. For instance, attributes of an entity are used as a list of indicative features or evidences that can be searched for in the text body of a document. The mapping from the model elements to the structures is accomplished in a domain speci c manner and provide a set of concepts based on which a textual database can be queried. This is in contrast to concept based searches supported by indexing engines such as Excite [Exc] where the concepts are constructed by statistical methods and are generic in nature. SQL queries based on concepts and constraints are mapped to constructs on the potential structures associated with the model elements. Approaches for combined querying of structured and textual data have been presented in [YA94, CM94] where information retrieval operations are incorporated in an SQL-like query language. Optimization techniques that combine database processing and full text operations have been proposed in [BDGM95, CDY95]. In our approach an SQL query is mapped by the system to appropriate information retrieval operations that are transparent to the user. The structure of this paper is as follows. In Section 2 we discuss our broad approach for describing different types of multimedia data with help of domain ontologies. In Section 3, we discuss the use of the E-R model to describe textual data. Section 4 discusses the mapping of SQL to information retrieval operations. Section 5 presents the conclusions and ongoing work.

2 Domain Ontology-based view of the Information Space

The InfoSleuth system [BBB+ 97] is comprised of a network of cooperating agents representing users and information resources that address the information needs of the users. The contents of the information resources are described using concepts from domain ontologies. Ontologies give a concise, uniform, and declarative description of semantic information independent of the underlying media of representation. InfoSleuth thus views an information source at the level of its relevant semantic concepts and information requests are speci ed in terms of the ontological concepts in a media independent manner. Thus users can specify a query containing concepts from a domain ontology in SQL and get back tuples from a structured database, text documents or images from the various information resources (Figure 1). The above approach facilitates the integration and querying of multimedia data in a seamless manner at the level of semantic concepts obtained from a domain ontology. A critical challenge in this approach is to map ontological concepts to the underlying textual and image data. An approach for describing image data

Domain Ontologies

SQL Query Tuples, Text, Images ...

InfoSleuth Agent Infrastructure

Text Database

Image Database

Structured Databases

Figure 1: Ontology-based access of multimedia information ER-Object

person

document

party

active_in (member_of)

has_document

(has_member)

Figure 2: A portion of the Politics Ontology using ontological concepts is described in [Per]. In this paper we shall describe an approach to describe and query textual documents in a domain speci c manner. Consider a portion of the politics domain ontology represented as an E-R model and illustrated in Figure 2. The entities person and party model information about political personalities and parties respectively. The relationship active in models the participation of a person in the various political parties. We also introduce a special entity called document which contains information about the various elements of the E-R model. The relationship has document models the association between the various entities and the documents in which they appear.

3 Mapping E-R models to Textual Data The Entity-Relationship (E-R) model has been widely used as a conceptual model to capture user requirements and characteristics of an application domain. In the case of structured data, entities, attributes and relationships are mapped to the underlying tables (relational model) or objects (object-oriented model). We rst discuss a relevant subset of information retrieval operations used to construct topic expressions supported by the Verity Indexing engine [Inc94]. With the help of an example used in the InfoSleuth system, we demonstrate how an E-R model can be mapped to these topic expressions. We also compare and contrast with the approach used to map an E-R model to structured data.

3.1 Information Retrieval Operations to create Topic Expressions

We now enumerate some of the information retrieval operators which can be used to construct the topic expressions [Inc94]. Each of these operations by themselves de ne a topic and can be combined in di erent ways to de ne richer topics. These topic expressions are used to query document collections based on content and can be considered as views on the underlying textual data.

name profession

person.name

Table person

party.name

Table active_in

Parameterized Topic active_in ideology

name

name

profession person

party

active_in (member_of)







(has_member)

Topic person



(person) (politician) (minister) (diplomat) (chancellor) (congressman) (senator) (speaker) (reformer)

Parameterized Topic profession





[person.name] (appointed) ()



[person.name] (elected) ()



[person.name] (become) ()

[person.name] (leads) [party.name] [person.name] (belongs) [party.name] [person.name] (represents) [party.name]

Figure 3: Mapping an E-R model to structured and textual data

(W1) Checks whether W1 is a word in the text body. (W1 , W2 , ..., Wk ) Checks whether W1 , W2, ..., Wk form a phrase (in the same order) in the text body.

(W1 , W2 , ..., Wk ) Checks whether W1, W2 , ..., Wk appear in the same sentence (in

any order) in the text body. (W1) Checks whether a thesaurus expansion of W1 appears in the text body. (W1 ) Checks whether the root/stem of a word appears in the text body. (T1 ) This may be a pre-de ned topic de ned using the above and following operators. (T1 , T2 , ..., Tk ) Checks whether the topics T1, T2 , ..., Tk appear in the text body. Each of the topics can have a weight associated with it and depending on the presence of the topics, the weights can be \accrued".

3.2 Mapping E-R model elements to Topic Expressions

Consider a more detailed version of the politics ontology (Figure 2) illustrated in Figure 3. The entity person has attributes name and profession; and the entity party has attributes name and ideology describing them. With the help of the example in Figure 3, we now illustrate the mapping from an E-R model to views or topics in the underlying textual database. The following cases arise:

Entity Mapping In the case of structured data, the entity person is mapped to a table, whereas for textual data, it is mapped to a topic. The underlying representations are di erent in these cases. In the case of structured data, the instances of an entity are a set of tuples, whereas for textual data, the instances are a set of words/patterns appearing in documents. A critical consequence is that concept extensions in textual data lack of the notion of object identity or type.

Attribute Mapping In the case of structured data, the attribute profession is mapped to a table column.

However, in the case of textual data it is mapped to a parameterized topic. In the above example, whenever we want to evaluate a condition like profession = 'President', we search for patterns like [person.name] appointed President, [person.name] elected President, etc. This brings forth a critical distinction between structured and textual data. In the case of structured data, the fact that someone is a president is known with certainty, as there is a notion of a key which uniquely identi es a person. As discussed in [Abi97], one could possibly parse semi-structured documents to infer the profession of a person. But in the case of unstructured data, we make the indexing engine search for the abovementioned patterns and this gives rise to uncertainty in the answers. The two main reasons for this are: the lack of structure in the underlying data leading to the absence of the notion of a key; and our approach of searching for patterns at run-time as opposed to parsing/extraction of documents resorted to for semi-structured documents. Relationship Mapping The relationship active in models the association between the entities person and party. In the case of structured data the relationship is mapped to a table which contain the object ids of the entities as foreign keys. In the case of textual data, since there is no notion of object identity, we search for patterns such as [person.name] leads [party.name], [person.name] belongs [party.name] etc. This again gives rise to uncertainty for reasons similar to those in the previous cases. Parameterization of Topics: A critical feature not supported by the underlying indexing technology is that of parameterization of the topics. This is necessary to support querying of document collections in the SQL sense. The parameter of the type [person.name] results in the substitution of the associated topic being inserted at that position, where as a parameter of the type results in the substitution of a value from the SQL query being translated.

4 Translating SQL to Information Retrieval Operations We now discuss with the help of examples, how SQL queries, speci ed on concepts in an E-R model and containing constraints on those concepts, can be translated to topic expressions which are evaluated against the underlying text indexing engine. The various possibilities corresponding to the select, project and join relational algebra operations are considered. We identify the cases which are supported by current textual indexing technologies and consider heuristic approaches to implement those that are not supported.

4.1 Operations supported by Current Indexing Technologies

We now discuss those class of queries that we were able to process by appropriate instantiation of the parameters in the topics associated with the model elements. The evaluation of the topic expressions thus generated is supported by current indexing technology. It may be noted that as opposed to structured data (where a tuple is either in the answer or not), in the case of document data each tuple is in the answer with a degree of relevance.

Simple Concept Retrieval select has document from person The associated topic (person) (Figure 3) is determined and sent to the text indexing engine. It was observed that there were some documents which contained a lot of information about the concept

person but did not contain the keyword \person" in the document. On the other hand, a document

containing the keyword \person" was ranked lower as it contained less information about the concept person.

Selection Query select has document from person where name = 'Aleksandr Shokhin' and profession = 'President'

The translation of this query depends on the topic illustrated in Figure 3 and is illustrated in Figure 4. It was observed that the documents that were returned (with some weight < 1) contained information about deputy prime minister Aleksandr Shokhin. It may be noted that such a query would have returned an empty (or false) answer in the case of structured data. But since, we are not able to capture the association between the name and profession of a person in a precise manner, we do get these types of

Entity person => (person) Attribute person.name => () Attribute person.profession => (([person.name], (appointed), ()) .......)

select has_document from person where name = ’Aleksandr Shokhin’ and profession = ’President’ (person) ((Aleksandr), (Shokhin)) ((((Aleksandr), (Shokhin)), (appointed), (President)) .... ) ((person), ((Aleksandr), (Shokhin)), ((((Aleksandr), (Shokhin)), (appointed), (President)) ......))

Figure 4: Translating an SQL query to a topic documents. It is a moot point however that on the Web in a knowledge discovery environment, this may actually be a useful facility. Filtering as a Selection Query The above selection query consists of constraints that depend on the text of the document. However, it is useful to extract certain attributes of a document such as title, author, date and use them to lter out irrelevant documents. Filtering conditions are supported by some indexing technologies and can be expressed as selection queries in our approach. Consider the following query: select has document from person where name = 'Mikhail Gorbachev' and has document.author = 'Julia Wishnevsky'

This query can be interpreted in the following ways: 

Priority to ltering conditions. First select those documents that satisfy the ltering conditions

and then order them according to the measure of relevance based on the presence of the appropriate concepts in the text body. The topic expression corresponding to this assumption is:

(((person), ((Mikhail), (Gorbachev))), (author 'Julia Wishnevsky'))



Priority to textual relevance. First order all the documents according to the measure of relevance

based on the presence of the appropriate concepts in the text body. A higher relevance measure is given to documents satisfying the ltering conditions. The topic expression corresponding to this assumption is:

((person), ((Mikhail), (Gorbachev)), (author 'Julia Wishnevsky')) Implicit Join select has document from active in where has member.name = 'Labor Party' and member of.name = 'Adrian Paunescu'

This is an example of an implicit join, which in the case of structured data gets translated to the following query involving the appropriate tables: select has document from active in, person, party where active in.member of = person.id and active in.has member = party.id and person.name = 'Adrian Paunescu' and party.name = 'Labor Party

The same query can be translated into the following query based on the associated parameterized topic (Figure 3).

((((Adrian),(Paunescu) ), (leader), ((Labor), (Party))) .... )

This is an example of a class of join queries whose translations can be evaluated by the underlying indexing engine. This is because we have directly mapped the relationship active in to a topic which avoids the need for a join expression across two parameterized topics.

4.2 Limitations of current Indexing Technologies

We observed that the following two types of queries could not be processed due to limitations in current indexing technologies. Approaches to overcome these limitations, based on post-processing of query results are outlined.

Projection Query select profession from person where name = 'Yeltsin'

As discussed earlier the association between the name and profession of a person is not known with certainty. However, the verity indexing engine does return the set of patterns that match the topic corresponding to the attribute person.profession. So one approach would be to post-process the patterns returned by evaluating the following topic:

(((Yeltsin),(appointed), WILDCARD) ((Yeltsin),(becomes), WILDCARD) ((Yeltsin),(appointed), WILDCARD) .... )

Depending on the patterns returned, the words appearing in the same position as WILDCARD may be deemed as the profession of Yeltsin. However it is possible that the WILDCARD may not be replaced by a unique value in each case. In that case a set of answers each associated with a relevance measure will be returned.

Join Query select has document from active in where member of.name = 'Yeltsin' and has member.ideology = 'communism'

The translation of this results in the join between these two topics:

T1 = (((Yeltsin), (leader), [party.name]) ... ) T2 = (([party.name], (subscribes), (communism) ... )

The topic which the indexing engine needs to evaluate is:

(T1, T2 ), s.t. T1 .party.name = T2 .party.name

These kinds of conditions are currently not supported by the underlying indexing technologies. However, one could still adopt the strategy of post-processing the patterns returned by the Verity indexing engine. Where in each of T1 and T2 , [party.name] is replaced by a WILDCARD. At the post-processing stage we could check for the equality of the wild cards returned for each of the expressions and assign a relevance measure based on the number of matches.

5 Conclusions and Ongoing Work We have presented in this paper an approach based on a common E-R model and language (SQL) for querying textual databases. A proposal for meta-searching [GCGMP97] has identi ed the basic issues in interoperation across various indexing technologies. These issues are also relevant in the broader context of interoperation across textual and structured data. In the InfoSleuth system, the relevant source is chosen based on the E-R model describing the data source whether it contains textual, image or structured data. At run-time the concepts and constraints in the query are matched to those in the E-R model to determine the relevant data source [BBB+ 97]. The concepts and constraints in the query are mapped to information retrieval expressions supported by the indexing engine. This is the responsibility of the individual source and results in encapsulation of query language heterogeneity discussed in [GCGMP97]. The use of a common E-R model and SQL facilitate the merging of results across textual databases and also results across structural and textual databases. More research is required however to de ne the semantics of combining results from structured and textual databases. Some applications require the functionality of processing textual data to insert information in a structured data and this is also facilitated with this approach. We thus believe that an approach based on E-R models and SQL for textual data has its advantages and is worth exploring. We expect this approach to enable improvement in the precision of results and achievement of greater interoperation.

We are currently working on providing integrated access to multi-lingual (e.g., English and Polish) document collections; and rudimentary combination of results from textual and structured data. We are experimenting with approaches based on post-processing of results to support join and projection operations, which we believe shall enhance the precision of the answers. The use of natural language processing, fact extraction techniques and taxonomic subsumption [W+ ] are also being explored and we hope to incorporate some of these techniques in future releases of the InfoSleuth system.

References [Abi97] [Alt] [AQM+ ]

S. Abiteboul. Querying Semi-Structured Data. In Proceedings of ICDT, 1997. Altavista. http://www.altavista.digital.com. S. Abiteboul, D. Quass, J. McGugh, J. Widom, and J. Weiner. The LOREL query language for semi-structured data. ftp://db.stanford.edu/pub/papers/lorel96.ps. [BBB+ 97] R. Bayardo, W. Bohrer, R. Brice, A. Cichocki, G. Fowler, A. Helal, V. Kashyap, T. Ksiezyk, G. Martin, M. Nodine, M. Rashid, M. Rusinkiewicz, R. Shea, C. Unnikrishnan, A. Unruh, and D. Woelk. Infosleuth: Semantic Integration of Information in Open and Dynamic Environments. In Proceedings of the 1997 ACM International Conference on the Management of Data (SIGMOD), Tucson, Arizona, May 1997. [BDFS97] P. Buneman, S. Davidson, M. Fernandez, and D. Suciu. Adding structure to unstructured data. In Proceedings of ICDT, 1997. [BDGM95] S. Brin, J. Davis, and H. Garcia-Molina. Copy Detection Mechanisms for Digital Documents. In Proceedings of the 1995 ACM SIGMOD, May 1995. [CDY95] S. Chaudhuri, U. Dayal, and T. Yan. Join Queries with External Text Sources: Execution and Optimization Techniques. In Proceedings of the 1995 ACM SIGMOD, May 1995. [CM94] M. Consens and T. Milo. Optimizing Queries on Files. In Proceedings of the 1994 ACM SIGMOD, May 1994. [Exc] Excite. http://www.excite.com. [GCGMP97] L. Gravano, C. Chang, H. Garcia-Molina, and A. Paepcke. STARTS: Stanford Proposal for Internet Meta-Searching. In Proceedings of the 1997 ACM International Conference on the Management of Data (SIGMOD), Tucson, Arizona, May 1997. [Inc94] Verity Inc. Verity Developer Kit (VDK) API Reference Guide V1.0.3, 1994. [Inf] Infoseek. http://www.infoseek.com. [KM91] B. Kahle and A. Medlar. An Information System for Corporate Users : Wide Area Information Servers. Connexions - The Interoperability Report, 5(11), November 1991. [Per] B. Perry. Notes on incorporating Content-based Image Agents in InfoSleuth. Working Notes: Hughes Research Laboratories. [W+ ] W. A. Woods et al. Conceptual Indexing for Precision Content Retrieval. http://www.sunlabs.com/research/knowledge/index.html. [YA94] T. Yan and J. Annevelink. Integrating a Structured-Text Retrieval System with an ObjectOriented Database System. In Proceedings of the 20th VLDB Conference, September 1994.

Suggest Documents