Structured Document Databases - Semantic Scholar

3 downloads 9666 Views 175KB Size Report
Sep 4, 1996 - The project involves building a database environment for storing, querying .... (a) Index builder: Given the data in raw SGML format, the index ...
Structured Document Databases Arijit Sengupta September 4, 1996 Abstract

This is a summary of research activities so far performed in the Structured Document Database project. The project involves building a database environment for storing, querying and updating structured documents - in particular, documents encoded in SGML (Standard Generalized Markup Language) [ISO86]. The project has three major components - (i) designing a query language for querying structured documents, (ii) designing and implementing a query engine capable of performing queries written in this query language, and (iii) designing and implementing an interactive Graphical User Interface capable of expressing queries in this language. This report describes the work done in each of the above areas, and projected plans for the immediate future.

1 Introduction Documents have been a very important eld of research for a very long time. The extensive use of documents in every aspect of human life, in almost every eld of education, and in almost every task in everyday life makes them a very signi cant source of information. Starting from cave painting to the modern-day electronic texts, documents have been used almost in every stage of human development history. In this era of automation, we have very sophisticated systems for creating and editing documents, for grammar correction and spell-checking, for searching and automatic search-replacement of text, formatting and typesetting. Documents have been enhanced by the incorporation of tags which store information related to their structure. However, once documents are created and printed, not many tools can use these documents in an ecient way to retrieve information. This is why we need document databases.

2 The domain of Structured Document Databases Documents can be divided roughly into two categories - documents without any structure and documents with some embedded structure. Simple ASCII text documents fall in the rst category. Most word processor documents and documents in special formats (nro , LATEX, HTML, SGML, etc.) are examples of structured documents. Structured documents can be roughly described as documents with some form of internal structure. This structure could be in terms of physical structure (such as font, size, layout - as seen in wordprocessor documents, nro etc.) or in terms of logical structure (purely logical information of the structure without any indication of layout such as SGML, and to some extent HTML) or mixed structure containing both physical and logical information (such as LATEX).

2.1 Why are documents so special?

Structured documents form a class of their own in terms of databases because of two reasons: (i) their complex structure and (ii) the importance on the relative ordering of document components. Relational databases are at in nature. Any complexity in the logical structure of the data in relational databases is resolved by \normalizing" the structure into a number at tabular components. Moreover, in relational databases, the relative order of the tuples is not important. On the other hand, structured documents 1

have complex hierarchical structure - a particular component of a document can be composed of multiple components (e.g. a chapter can consist of a heading and a number of sections). Moreover, the relative order among these components in documents is important (e.g. the rst chapter in a book should always come before the second chapter).

3 Current State of the Art The previous work in structured document database research can be divided into two major approaches - (i) Top-down design approach and (ii) Bottom-up design approach. In the former approach, the design starts at the conceptual level where a model of the database is rst formed, query languages based on this model are then designed, and nally implemented using currently available or newly designed hardware and/or software facilities. In the latter approach, an implementation strategy (possibly based on a special data structure or an existing system) is rst decided on, and the query languages and capabilities of the system are then decided based on this implementation strategy. A small classi cation of current work based on these two broad categories can be found in my thesis proposal [Sen95].

4 Overview of Current Work The current work can be broadly classi ed under the top-down approach category. In this work, we decide rst on a data format and the types of queries that we would like to do on this data, and a query language that would be able to perform these queries. We then design a backend engine and a frontend interface that would be able to perform the same class of queries on this data. The basic requirements of such a system are the following: 1. Data Format: The database will contain documents in SGML format. It should possibly have support for HyTime [ISO92] and DSSSL [ISO94] standards. 2. Query Language: The users of the system will be able to pose queries on the database using a query language that will have at least the following properties: (a) Tractable: The query language should be tractable - queries written in the language should be computable in a reasonable amount of time using a reasonable amount of disk space and other computer resources. (b) Complete: The query language should be complete - the set of queries possible in the query language should be a self-complete set consisting of a formally complete class of queries. (c) Simple: The query language should be simple - it should be easy enough to formulate queries using the query language. Queries written in the language should be close to the English description. 3. Query Engine: There should be a backend engine that can be used to compute the results of the queries written in the above query language. To achieve this eciently, the engine should be able to keep special indices based on the data, and should be able to optimize the queries by making good use of the internal structure and available indices. Basic components of the query engine should be the following: (a) Index builder: Given the data in raw SGML format, the index builder should be able to build speci c data structures that help speed up access to the data. (b) Query optimizer: The query optimizer should be able to rewrite the query speci ed by the user to an equivalent query (or a sequence of operations) that will yield the same result but be more ecient. (c) Core query engine: The core query engine should be able to take an optimized query from the optimizer, break it up into steps and build the result making good use of the indices available. 2

4. Query Interface: There should be an interface to the query engine which will allow the users to either type in the query in the query language format, or formulate the query using a graphical interface. The graphical interface will have the advantage that it will not require the users to know the internal structure of the databases or the syntactic constructs of a query language. However, because of the complex nature of the database, this visual interface for the query language may be somewhat restricted compared to the actual query language, but it should be able to pose a signi cant subset of the class of queries that the query language can pose.

5 The Query Language The proposed query language has its roots in SQL (Structured Query Language) [SQL86]. SQL is a natural language implementation of the more formal \Relational Calculus", and can pose a class of queries which are all solvable in polynomial time. The language we propose is an extension to SQL which introduces two main additional features to SQL over the standard: 1. Complex selection: In standard SQL, the structure of the basic components (tables) are at, so selection conditions involve only one level of indirection - a column in a table. However, for documents, the internal structure can be represented as a tree, and selecting particular nodes in the tree will involve some type of traversing of the tree. A particular traversal of the tree can lay out a path from the root of the tree to the target node (or a set of target nodes). So selections for document databases involve \path expressions" that specify a set of paths from one node to a set of target nodes. In the proposed language, path expressions are allowed in the selection condition as a start node, a target node, and a series of intermediate nodes. This is a simpli ed version of general path regular expressions. 2. Structured result forming: In standard SQL, the result of a query is a set of columns speci ed in the select statement. This works well in relational databases since the output from a query also appears as a table, thus maintaining the closure property. However, for structured documents, to maintain the closure property, the result of queries should yield document components, which are likely to be complex themselves. To support this, the query language allows complex grouping operations in the select statement by means of an \output DTD" which speci es the relationship between the elds in the query and the output structure of those elds. 3. Other extensions: Some other small extensions of SQL are necessary to properly access SGML attribute values, to properly use HyTime links, and to express comparison between relative positions. More details on these two extensions and example queries can be obtained from [Sen96a].

5.1 Properties of the query language

The proposed query language was designed to perform the basic rst-order-logic type of queries with complex selections and path expressions. It still needs to be formally proved whether or not this language is in PTIME. Here we take the three main types of queries that can be posed with the query language, and show that they are indeed in PTIME. To discuss the problem, it is rst necessary to de ne the idea of a complex tuple. In relational databases, one row in a table is a tuple. In hierarchical databases, however, it is not so easy to pinpoint tuples. One might argue that in case of SGML documents, the SGML document itself is one large tuple - and in case of documents databases containing multiple documents, each document is a tuple. However, as is common with SGML, it is often desirable to increase the granularity of tuples - which will imply one single document may contain multiple tuples of di erent types. Thus tuples are often associated with a type for that tuple. For example, in the Figure 1(i), the subtrees rooted at the four nodes marked B can be considered four tuples of type B. The root of the tree itself can be considered to be a tuple of type A. If the tuple is instantiated with particular values of the typed nodes, it will be represented as the tree in Figure 1(ii). In short, the granularity of the tuples can be as low as the whole document, or as high as every node in the hierarchy. 3

A

B

B

C

a1

B

B

C

C

b1

C

b2

c1

b4

b3

c2

c3

c4

Figure 1: (i) A complex tuple, (ii) The same tuple with types instantiated

5.2 Algebraic representation of complex tuples

We are going to represent complex tuples using an pre x notation in which the root of the tree appears rst, and all the children of the root appears in parenthesis in a recursive manner. So, the tree in Figure 1(ii) will be represented by the string a1 (b1 b2(c1 c2 )b3 b4 (c3 c4 )). A tuple rooted at a1 and having two subtrees T1 and T2 as two branches will be represented as a1 (T1 T2 ). A tuple rooted at a1 and having a descendant T1 somewhere in the tuple will be represented as a1 (   T1   ).

5.3 Selection and Projection

The Select and Project operations for trees is similar to those in relational databases, with the di erence that they both use path expressions. Evaluation of path expressions depends only on the size of the schema and not on the size of the documents, since the schema is usually constant. So the complexity of either operations is equivalent to the complexity of the same operations in relational databases, and are hence in PTIME.

5.4 The \join" problem

Normally, for a complex object database, joins are not so important, since joins are built into the structure of the database. Joins are very necessary in relational databases as the conceptual structure in relational databases is broken (or normalized) into multiple at structured tables, and any query involving two di erent components will involve a \join" of the two components. This is not necessary in complex object databases, since the structures are never broken. However, for some queries involving di erent parts of the structure might require a comparison of a node in the hierarchy with another node in a separate hierarchy. This leads to a special type of join in these databases. For example, one might want to nd out all the pairs of authors who have written books with at least one common title. This requires comparing the title node of two di erent complex tuples. The join operation (in particular, the equi-join operation) compares the values of two nodes in the same of di erent subtrees, and constructs a new tree combining the a pair of involving subtrees that have the same value of the joining node. The subtrees are combined by creating a new root with the roots of the component subtrees as children. Formally, if TA is the collection of tuples of the form A(TA1    TAna ) and TB is the collection of tuples of the form B (TB1    TBnb ), then the join of these sets of tuples, denoted as TA ./A=B TB is the collections of tuples of the form rn (ai (TA1    TAna )bj (TB1    TBnb )), where ai = bj for some i and j . The new value rn is a temporarily introduced root - to keep the tree structure consistent. Note that it is possible to extend this formalism to forests, in which case the introduction of a temporary root will not be necessary. Normally, join by itself will not have any speci c meaning, but will be followed by subsequent projections from the resulting tree that will remove the introduced pseudo-root. For example, the query described at the beginning of this section will be solved as: SELECT B1..Author,B2..Author FROM Books B1, Books B2

4

WHERE B1..Title = B2..Title AND B1..Author B2..Author

Note that the join on Title was immediately followed by projection of the Author nodes, so the result will not contain the temporary root that is introduced.

5.5 An alternative query language

An alternative, but equivalent query language, in which SGML itself is used to formulate the query using an SQL DTD, is described in [Sen96b].

6 The Query Engine The work on the query engine is still at its infancy, primarily because of the lack of good low level database support in the chosen platforms. Two di erent versions of the engine have been built - however, both are restricted in functionality. The rst version is built using Shore [CDNM94] - an object-oriented data repository, and the second version is built using the Open Text Pat database system. The current version of the user interface is interfaced with the Pat engine, because of its support for most of the required queries. However, the nal version is planned to be independent of any commercial database engine.

6.1 Using Shore

Shore [CDNM94] is an object-oriented persistent data repository. It provides internal data-integrity and locking functionality and lets the programmer specify his/her data format. The version of the query engine built using shore initially works on the data and builds a persistent structure. The structure is very similar to the parse tree of the SGML documents. The parse tree is built from the ESIS (Encapsulated Structure Information System) output produced by a validating SGML parser, and is stored in Shore's persistent object repository. Although not implemented at this time, it is possible to generate special data structure on top of the parse tree that enables more ecient access to speci c nodes in the tree. At query time, the parse tree is read back from the shore repository. The query is converted to a traversal strategy, and the result is formed using a set of nodes of the parse tree.

6.2 Using Pat

Pat, or Open Text 5.0 [Ope94] is a document database system that uses a proprietary \Patricia tree" index for fast access to structured data. Although very fast and ecient, the Pat query language is not adequate for rst order logic queries. Moreover, Pat's query language is dicult to understand and formulating complex queries in Pat requires a thorough knowledge of the structure of the document. However, it provides adequate support for tree traversal functions, and we used this feature to integrate it with the current user interface. In this interface, the queries sent from the interface is rst sent to an optimizer which determines an optimal traversal path. This path is then sequentially used to traverse the tree and gather the accumulated result. The following is a brief description of the algorithm implemented in this optimization.

6.2.1 Single Accumulator Evaluation

This optimization works similar to a simgle accumulator microprocessor. The accumulator keeps the current result corresponding to a particular node in the tree. Suppose we denote the accumulator by A, then A = (N; E ) where N is the node corresponding to the accumulator, and E is the current result. Depending on the new operation X = (P; E1 ), three cases might arise:

5

R

P

N

P

R

N

T N

P A

B

Figure 2: Single Accumulator Evaluation { Case I, Case II, Case III, and exceptions for non-root target node Case I: P  subtree(N ) - or P is a descendent of N. In this case, A is evaluated to (N; E  (E1 ?! N )) where  is the operator (only AND and OR supported currently) between A and X and ?! is the ancestor operation. Case II: N  subtree(P ) - or P is an ancestor of N. In this case, A is evaluated to (P; E1  (E ?! P )). Case III: N and P are in di erent subtrees. In this case, A is evaluated to (R; (E1 ?! R)  (E ?! R)), where R is the closest common ancestor of N and P . The above evaluation technique work as long as there is no restriction in the node for the result, or as long as the result node is the root of the tree. However, if the result node is not the root of the tree, then the result needs to be modi ed so that the accumulator is targeted towards the result node. For example, in the above gure (Figure 2(iv)), if the accumulator is in the region A, and the target node is T , if the next operation is in the Region B , even if the common ancestor is not T , the accumulator needs to be re-evaluated so that it corresponds to region T , and it needs to stay locked in that node for the rest of the operations. This possibly leads to some amount of ineciency.

6.2.2 Multi-accumulator evaluation

To avoid the ineciency resulting from a non-root target node, the evaluation strategy can be made to use multiple accumulators. In particular, two accumulators, one for the subtree rooted at the target node (region A in Figure 2(iv)), and one for the rest of the tree (region B in Figure 2(iv)). However, such strategies can get complex when the regions are combined using di erent logical operators. In that case, the accumulators have to be merged to one.

7 The Query Interface The proposed visual query interface is analogous to the \Query By Example" method for relational databases [Zlo77]. This method uses a visual template for the instances in the database for querying. Users then use a point-and-click method to point to speci c regions of the template to specify query strings corresponding to the particular regions. We call this type of interface \Query By Template" or QBT. A prototype interface that implements the projection and selection queries has already been built and tested for usability. A preliminary description of the interface and the motivations for it can be obtained from [SD96]. The interface has been built using the JavaTM [Jav95], an object-oriented distributed programming language. In spite of its state of infancy and sub-optimal eciency, we decided to choose Java over other other graphical interface builders, because of its availability, and the capability of running Java-based programs from a WWW browser. The current version of the interface is being used for sending queries to an Open Text [Ope94] database containing the Chadwyck-Healey English Poetry Full-Text database [Cha94]. The interface is designed so that the backend can be easily modi ed to use an alternative engine That can perform tree-traversal type of operations. 6

The current version of the interface has the following features: 1. A three-pane visualization of the query language. The rst and the easiest version is the query template, containing a graphical template of the data instances in the database. The second visualization is the internal schema using a tree-like interface that displays a graphical view of the structure of the document. In the third level, the interface is simply an editor where the user can type his/her query in. In the rst level, the user associates the structure of the document with the visual representation. Although not implemented in the current prototype, it is possible to zoom in into a particular region and re ne queries for that region by specifying more query strings in its components. 2. Combination of queries using logical operators. All the three screens have their own way for combining query components using logical operators. In the template screen, the user can visually link various components of his/her query using logical operators. No link between two query components imply an \AND" operator, and explicit links can be made for either \OR" or \AND" operations. Because of the nature of the implicit links, explicit ORs are given more precedence at the time of evaluation, although it is not the usual convention. The SQL queries written in the SQL screen can use logical operators as in the SQL syntax. 3. Direct validation of queries. The SQL queries are validated at the interface level before sending to the server. 4. Easy interfacing with any database. Although the interface was written using Pat and the Chadwyck-Healey database, it is designed to be easily usable with other databases and other database systems. To use a di erent database, the administrator only needs to set up a few con guration les, and run an auto-con guration utility that con gures most of the interface which can be ne-tuned later. To use a separate database system, the administrator needs to write a small amount of code to perform the interface using the existing class library. Any database system that is capable of running tree-traversal type of queries can work with this interface. This includes Pat, any DSSSL engine, and most other existing document database systems.

8 Future Work The main areas where more work needs to be done is the foundations and the core query engine (without an external database support). The work on the interface is close to the end, and an extensive usability test has been performed. The rst usability analysis showed some problems with the interface that have been xed since. Another smaller usability test will probably be performed to check for other possible problems. The current version of the query engine does not support joins - but that will be implemented in the next phase. The aim of this project ultimately is to produce a prototype system that will have the querying capabilities like the standard relational databases and not be restricted to some ad-hoc structure-dependent queries. The system should also not be dependent on any other external database system and should be able to use SGML in its native format, and will be able to produce SGML documents as output from the query engine.

References [CDNM94] M. Carey, D. DeWitt, J. Naughton, and M.Solomon. Shoring up persistent applications. ACM Sigmod Conference Proceedings, May 1994. [Cha94] Chadwyck-Healey. The English Poetry Full-Text Database, 1994. The works of more than 1,250 poets from 600 to 1900. [ISO86] International Standards Organization. ISO 8879: Information Processing { Text and Oce Systems { Standard Generalized Markup Language (SGML), 1986. 7

[ISO92] [ISO94] [Jav95] [Ope94] [SD96] [Sen95] [Sen96a] [Sen96b] [SQL86] [Zlo77]

International Standards Organization. ISO/IEC 10744: Hypermedia/Time-based Structuring Language: HyTime, 1992. International Standards Organization. ISO/IEC DIS 10179: Document Style Semantics and Speci cation Language: DSSSL, 1994. Sun Microsystems. The JavaTM Language Speci cation: Version 1.0 Beta, 1995. Open Text Corporation. Open Text 5.0, 1994. Arijit Sengupta and Andrew Dillon. Extending sgml to accommodate database functions: A methodological overview. JASIS, 1996. To appear in the JASIS special issue on Structured Information/Standards for Document Architectures. Arijit Sengupta. Design and implementation of a database environment for the manipulation of structured documents. Proposal for Ph.D. Thesis, April 1995. Arijit Sengupta. Demand more from your SGML database! bringing SQL under the SGML limelight. , 9(4):1{7, April 1996. Arijit Sengupta. Standardizing the querying process with SGML: The SQL DTD. In Tommie Usdin and Debbie Lapeyre, editors, Proceedings of the SGML'96 Conference. Graphic Communications Association, November 1996. To appear in the Conference Proceedings. ANSI X3.135-1986, Database Language SQL, 1986. M. M. Zloof. Query by example: A database language. IBM Systems Journal, 16(4), 1977.

8