of the document retrieval language into the object-relational query language is analysed .... Another matter caused by ambiguity is that query conditions may be ...
Efficient Retrieval of Structured Documents from Object-Relational Databases Rafael Berlanga, María José Aramburu and Salvador García Universitat Jaume I Departamento de Informática Campus Riu Sec E-12071 Castellón SPAIN
Tel. +34-64-72 8304 Fax. +34-64-72 8435
{berlanga, aramburu, garcia}@inf.uji.es
Efficient Retrieval of Structured Documents from Object-Relational Databases Rafael Berlanga, María José Aramburu and Salvador García Universitat Jaume I -Castellón- SPAIN {berlanga, aramburu, garcia}@inf.uji.es
Abstract: This paper proposes a new and efficient method to represent, store and retrieve structured documents from object-relational databases. Its main contribution consists of a codification scheme for document structures that assigns codes to documents as an additional attribute. Thus, retrieval conditions on the structure of documents can be evaluated by applying these codes, avoiding traversing object references. The paper also gives some clues to the construction of a repository of structured documents over objectrelational tables, so that query conditions regarding the contents, structure and metadata of documents are executed by the underlying database system.
1. Introduction The risen of Internet based digital libraries has led to a general request for efficient tools to store and retrieve documents. Nowadays, document repositories can be queried by means of information retrieval mechanisms extended with some predicates to specify conditions on document structures [1-2]. Additionally, the storage and retrieval of SGML documents [3] has been approached from the perspective of object databases extended with information retrieval mechanisms [4-5]. Although information retrieval techniques have shown very useful when applied to centralised document repositories, some of the most important requirements of digital libraries cannot be supported by them. Among these requirements here we highlight the distributed architecture of these systems, security matters, and the management of dynamic document metadata. Given that object-relational database systems can satisfy many of these requirements, a valid approach to the development of future digital libraries consists of integrating object-relational databases technology with information retrieval techniques. In fact, the latest versions of the most important commercial database systems already provide
1
an information retrieval module together with many object-oriented and distributed features (e.g. Oracle [6]). In this paper we propose an efficient method to represent, store and retrieve structured documents from object-relational databases. Its main contribution is a new codification scheme for document structures that assigns codes to documents as an additional attribute. Retrieval conditions on the structure of documents can be evaluated by applying these codes, thus avoiding traversing object references. As a global result, this paper gives some clues to the construction of a repository of structured documents over a group of relational tables whose columns store multimedia data. Assuming that an information retrieval module is provided, query conditions regarding the contents, structure and metadata of documents, can be executed by the underlying object-relational database system without any further extensions. The rest of the paper is organised as follows. In Section 2 we describe the adopted data model for structured documents and its implementation in an object-relational database. Section 3 is dedicated to analyse the document retrieval language, paying special attention to query conditions on the structure of documents. Section 4 describes a new method to represent and codify structured documents into object-relational databases. The translation of the document retrieval language into the object-relational query language is analysed in Section 5. Section 6 presents a discussion about other approaches to the modelling and storage of structured documents. Finally, Section 7 gives some conclusions. 2. TOODOR Document Model TOODOR (Temporal Object-Oriented Document Organisation and Retrieval) is a document model intended to represent and query structured documents with temporal behaviour. In previous works [7-8], we have shown its utility for the construction of digital libraries. This section summarises some of its main features. The document model of TOODOR relies on a type system denoted DTS whose constructors are similar to those defined by SGML [3]. This type system starts from a set of attribute names Att, a set of class names Class, and a set of multimedia data types RawData, to construct document types as follows [9-10]: DTS := RawData | Class | (T1 |...| Tm) | T+ | T* | T? | [a1: T1,..., am: Tm] 2
By using TOODOR, the schema of a digital library consists of a group of document classes whose definitions can evolve along time. Each class is expressed as a 3-tuple in the following way: C = (C-name, C-span, C-history) Here, C-name is a class name from Class, C-span is the lifespan of the class, and C-history contains a sequence of type definitions for the class. More specifically, the history of a class is represented as a series of 3-tuples (T, Pop, I), where T is a valid type from DTS, Pop groups the set of class instances created with that type, and I is a time interval that indicates when instances are inserted according to that class definition. The integrity constraints of this data model were already described in [8,10]. An important property of this data model is that the schema of a TOODOR digital library cannot contain recursive definitions and therefore, in its hierarchies of composition cycles never appear. Thus, the schema can be always represented as a directed acyclic graph (DAG), whose nodes set comprises the names of its attributes and classes, and the edges set represents the composition relationships between classes and attributes. Given that classes can be redefined, edges are labelled with time periods. Additionally, the label “SET” is assigned to denote multi-valued composition relationships (i.e. attributes with + and * type constructors). Figure 1 shows an example with two historical definitions and its corresponding representation as a DAG. C1= (C-name = Section, C-span = [1,now], C-history = (T1=[date:date, contents:Article+],
Section
Pop1 = [sec@1, ..., sec@12], I1 = [1, now]))
date
C2= (C-name = Article, C-span = [1, now], C-history = (
( T2 = [title:string, authors:string+, keywords:string+, body:text], Pop2 = [art@50, art@302, ...,art@501], I2 = [51, now])))
SET [1, now]
[1, now]
( T1 = [ title:string, authors:string+, abstract:text?, body:text ], Pop1 = [art@1, art@2, ..., art@20], I1 = [1, 50]),
contents
date
title
authors
Article
keywords
SET [1, now] [1, now]
string
SET [51, now]
abstract [1, 50]
body [1, now]
text
Fig. 1: Example of Schema and Graph
3
2.1. Some Implementation Issues When representing the schema of a digital library in an object-relational database, the following two relational tables can be used to store the schema graph: node_table(Node_name, Type, Code) edge_table(Node_name1, Node_name2, Set_valued, Span)
The first table describes the nodes in the schema graph, where Type indicates if the node denotes a class, an attribute or a raw data type. The column Code represents a unique identifier for each node that will serve us to codify the schema. The second table describes the composition relationships, where Node_name1 and Node_name2 are two nodes connected by an edge during the period Span, and Set_valued indicates whether the relationship is multi- or single valued. These two tables take part of the digital library data dictionary, so that can be used to consult the schema, to validate the consistency of document structures or to parse path expressions. Concerning the instance documents of a possible Internet-based digital library, they could be expressed in either HTML or XML format. However, the management of HTML documents in this application is not optimal because the logical structure of the documents has to be inferred from its formatting tags [11], which cannot always be made automatically. In the other case, supposing that the XML tag names correspond to the names of the attributes and classes in the schema, documents can be directly inserted into the database. In Section 4 we present a format of storage for XML documents that allows for their efficient retrieval in an object-relational database. 3. Document Retrieval Language This section is dedicated to describe the main features of the retrieval language of TOODOR, named TDRL (Temporal Document Retrieval Language). In this language, retrieval conditions can be of three types: structural, contents and temporal conditions. This section only deals with structural conditions, whereas Section 5.3. discusses how to combine structural and contents conditions during query processing. The range of temporal conditions of TDRL was already presented in [12]. Syntactically, TDRL is based in the OQL standard [13], and allows the retrieval of documents by specifying sentences with the following format: 4
SELECT Otarget FROM Type1 as O1, ... , Typek as Ok WHERE Cd1 and ... and Cdn AT TimeSpan
The variable Otarget represents the set of documents included in the portion of the digital library specified by the object variables O1, ..., Ok and the conditions Cd1, ..., Cdn. The
AT
clause can be used to make a temporal projection of the database, so that only the documents inserted during the specified time interval are retrieved. 3.1 Path Expressions Like in other object query languages, in TDRL path expressions are used to generate the set of objects that can be accessed through the corresponding schema trajectory. The syntax of path expressions is as follows:
::= Class
::= | ε
::= .Att | .Att(Num) | .#
Notice that when a path expression includes multi-valued references, it is possible to specify a number between brackets that indicates the exact position of the intended component. Furthermore, TDRL allows for generalised path expressions [14] by using the symbol ‘#’, which matches to any valid path of any length. The following sentences are examples of queries with path expressions: SELECT O FROM The_Times.contents.sections(3).articles(2).title as O; SELECT O FROM The_Times.contens.#.title as O; SELECT O FROM News.data.place as O;
Path expressions must always terminate in a single class, in the other case they are said ambiguous. To determine such a class, it is necessary to navigate across the digital library schema. There happen two possible cases of path expression ambiguity: when the type of a component of the path is a union of types, and when a component of the path has several historical definitions. In both cases the reached objects can belong to different types. To avoid this ambiguity, before execution queries with ambiguous path expressions must be rewritten into several sub-queries in the following way: 1. The query “SELECT O FROM C.#.att as O;” where the type of att is (C1 | C2), contains an ambiguous path expression (C.#.att). Therefore, it must be rewritten into the two following sub-queries: 5
SELECT O FROM C.#.att{C1} as O; SELECT O FROM C.#.att{C2} as O;
In this case, when generating the domain of the variable O, only the objects of the specified class will be considered. 2. The query “SELECT O FROM C.#.att as O;” with att of type C1 during I1 and of type C2 during the period I2, must be translated into the following sub-queries: SELECT O FROM C.#.att as O AT I1; SELECT O FROM C.#.att as O AT I2;
Another matter caused by ambiguity is that query conditions may be inconsistent for some of the objects in the query. At this respect, in TDRL when a condition cannot be evaluated over certain object the result is always false. 3.2 Structural Conditions Apart from generating the values of object variables by means of path expressions, TDRL also allows the evaluation of structural conditions on documents. With this purpose the following predicates are defined: •
in(o1, o2) is true if there exists a path expression of any length that goes from o1 to o2. For instance, the following query retrieves all the articles within the fourth section of a newspaper: SELECT A FROM Article as A, The_Times.#.sections(4) as S WHERE in(S, A);
• child(o1, o2, i) is true if o1 is the i-th child of o2. The parameter i is optional. • same-parent(o1, o2, d) is true if o1 and o2 are siblings of the same parent and there are d elements that separate o1 from o2. As an example, the following query looks for news talking about ‘Brasil’ which have a contiguous news talking about ‘IMF’. SELECT N1 FROM News as N1, N2 WHERE same-parent(N1, N2, 1) and contains(N1,’Brasil’) and contains(N2,’IMF’);
• common-ancestor(o1, o2, l) is true if o1 and o2 have a common ancestor which it is located in the composition tree at least l levels above the less deep of both objects.
6
Previous works have defined some algebra operators with similar semantics (see [2, 15] for reviews). In the next section we propose a new scheme of codification for document structures that allows the evaluation of structural conditions with a new approach that reduces their computational cost. 4. Representing Documents with the Object-Relational Data Model In TOODOR, documents are represented as objects with the following format: D = (Oid, It, Vt, Contents) Oid denotes the unique identifier of the document, It is its time of publication, Vt is the time interval during which its contents are valid, and the multimedia contents of the document are represented in the last component. Furthermore, given the object-oriented data model of TOODOR, the Contents component also stores the references that represent the relationships of composition between documents. Notice that this is the most widely accepted approach for storing structured document into object-oriented databases [5, 14]. However, object references mean some important drawbacks. By one hand, although references are typed, the union of types is not supported by the data models of current solutions [5, 14]. This is an issue that increases the complexity of query processing tasks. On the other hand, evaluating structural conditions requires navigating through object references, which in the case of large volumes of documents is too expensive. In this section we propose a scheme of codification for storing structured documents with object references but with the advantage of allowing the evaluation of structural conditions in an efficient way. With this scheme the problem of the union of types is also avoided. As a result, documents can be stored into object-relational tables and queried by following the same model. 4.1 Document Representation Starting from a TOODOR digital library schema, in this section we explain how to match documents into object-relational tables. Firstly, each document class in the initial schema is associated to a table whose rows contains its instances. The format of each table depends on the metadata defined for the
7
corresponding document class [7] (i.e.: attributes such as author, data, etc.). They follow the format: class-name(Oid, Scode, It, Vt, Metadata, Contents)
Here Oid is a unique identifier, SCode is an especial code relating to the document structure, Metadata is a record of attributes describing the metadata, and the document contents are stored in the last column as multimedia data values with references. The idea of a code for representing the structure of documents was taken from the work presented by [16] for evaluating recursive relational queries. The objective is to define a codification schema for expressing the position of each instance document inside the global schema of classes. In order to build codes that can be efficiently processed, a codification scheme must satisfy the following two properties: 1. Codes must induce a clustering of each document and its components. In this way, each tree of composition corresponds to a single cluster. 2. Codes must facilitate the evaluation of structural conditions and path expressions, avoiding as much as possible navigating through object references. The next section describes how to obtain the code for each inserted document. 4.2 Codifying Document Structures The code of a document object represents its position inside the logical organisation of the whole repository. This position is determined by the path followed through the digital library schema before inserting it in its corresponding class. Given that the schema of a digital library in TOODOR coincides with a directed acyclic graph and the structure of a document is a tree, this path is unique for each inserted object. Consequently, by codifying these paths it is possible to assign a different code to each object, which also indicates its exact position. Given that each code, here denoted SCode, must be unique for each inserted document object, it is also necessary to ensure a different code for each object at the end of a multi-valued reference. Thus, for all the siblings of an ordered sequence, it is their relative position into the sequence what distinguish them. Each document in HTML or XML format can be considered as a tree whose internal nodes correspond to instances of the classes in the schema and whose leaves are instances of multimedia raw data types that store the contents of the document. In this way, we denote 8
with parent(N) the parent of the node N, being Nroot the root of the composition tree. For repetitive components, pos(N) expresses the position of the node N with respect to its siblings. Finally, each SCode is build by concatenating the codes associated to the elements E of the schema (see Section 2), these are denoted code(E). Therefore, to obtain the SCode of a node, the following recursive function can be applied: SCode(Nroot) = code(Nroot) • Root_Id = SCode(parent(N)) • code(N) • pos(N)
SCode(N)
The operator • is applied to concatenate codes that depending on the size of the application, can be represented as strings of either characters or bits. Each separated pair of codes (code(N) • pos(N)) obtained for each node is named a segment. Thus, the number of segments in a SCode coincides with the level of the node in its composition tree. Furthermore, Root_Id denotes a unique identifier associated to the root node, which is inherited as a prefix by all the SCodes of the tree nodes. Thus, it constitutes the identifier of the whole composition tree. Sect#5 date
1/2/1999
contents
Artl#1
Artl#2
Artl#3
E Code(E) Section S date d contents c Article A String s Date D ... ...
Object
Path
SCode
Sect#5
S
S5
Date#1
SdD
S5d0D0
Artl#1
ScA
S5c1A0
Artl#2
ScA
S5c2A0
Artl#3
ScA
S5c3A0
Schema Encoding
Object SCodes
Fig. 2: Schema Encoding and its SCodes
It can be proved that this codification scheme defines a partial order between document elements by means of the prefix relationship. By this reason the objects in a composition tree representing certain document can be clustered and indexed by a conventional B+-tree. Therefore, the first property of an efficient codification scheme is satisfied. In the next section, the advantages of this codification for the evaluation of path expressions and structural conditions are analysed.
9
5. Query Processing in TOODOR Under this representation, query processing in TOODOR can be summarised in the following steps. Firstly, the TDRL sentence is analysed in order to find ambiguous path expressions and inconsistent conditions. As a result, some query elements can be simplified and the initial sentence may be divided into several queries without ambiguity. After this step, the TDRL sentence is translated to the query language provided by the object-relational database adopted for implementing TOODOR. Thus, query processing is carried out by the underlying database system. Finally, the answer to the query is arranged for a proper presentation to the application. 5.1 Evaluation of Path Expressions When processing a path expression, the system first checks that it is well formed with respect to the digital library schema and that it is not ambiguous. Then, the query processor locates the final class that stores the objects in the domain of the variable associated to the path expression. For a path expression P, we will denoted this class as goal(P). Afterwards the path expression is codified with the same scheme used for assigning the SCode to each inserted document (see Section 4.2). SCode(Path) is the resulting SCode and its evaluation is made by applying the interpreted grammar below:
::=
Class {CodePath = Code(Class) • '%'• CodePath’}
::=
{CodePath’ = CodeElement • CodePath’2} |
ε ::=
{CodePath’ = ''}
(.Att { CodeElement = Code(Att) • '_' • Code(type(Att)) • '_'} | .Att(Num) {CodeElement = Code(Att) • [Num] • Code(type(Att)) • '_'} | .# {CodeElement = '%'} )
In this grammar, we assume that Scodes are represented as strings of chars, and each code segment comprises two chars (except for the code of the root). Here, we use two wildcards: ‘_’ to denote one anonymous char, and ‘%’ to denote one anonymous sub-string of any length. Additionally, we use the operator [N] to denote the conversion of the number N to a string. Once the path expression is codified and its final class identified, the objects of its domain are generated by a select operation as follows: 10
table(Path) = SEL like(SCode, SCode(Path)) (goal(Path))
Here the operator like/2 corresponds to the standard SQL operator for evaluating string matching conditions. The need for the like operator is due to the inclusion of wildcards into the codes of path expressions. One limitation of this approach is that the wildcard ‘%’ does not indicate the length of the intermediate sub-strings that can appear in path expression codes. In our case, the size of segments is fixed and therefore the length of these sub-strings is a multiple of that size. As a consequence, the operator like can produce some false drops in the answer, which need to be removed. It is worth mentioning that the inclusion of regular expressions in the like operator would solve this drawback. 5.2 Evaluation of Structural Conditions After this process, each variable in a TDRL sentence has its initial domain restricted to the objects in a table. The table associated to the variable O is here denoted table(O). Among the conditions of a query can only appear unary and binary conditions, these are conditions on a single variable and conditions relating two variables of the query. Thus, the evaluation of unary conditions can be made with a select operation, whereas binary conditions are evaluated with join operations over the corresponding tables. In the special case of the structural predicates, the required join operation can be evaluated in terms of the SCodes associated to the implicated objects. Supposing that X and Y are two tables with the domains of two object variables involved in an structural condition, this can be evaluated as follows: structural_condition(X, Y) ⇒ JOINjoin_condition (X, Y)
The evaluation of each structural predicate of Section 3.2 needs a different join-condition. As these are conditions over the strings that represent the SCodes of the objects, their definition requires the following string operators: •
prefix(code1, code2) indicates whether the first code string is a prefix of the second.
•
length(code) says the number of segments that constitute the string code.
11
•
code|p returns the left part of the code string truncated after its p-th segment.
•
code|p returns the right part of the code string truncated after its p-th segment.
•
code[p] returns the p-th segment of code.
•
code[p].i returns the first or second (i) component of the p-th segment of code.
Table 1 specifies the process of evaluating the structural predicates of Section 3.2 by means of the previous string operators. In this table, consider code1 as the SCode of the object o1 and code2 as the SCode of o2. Predicate
Join Condition
in(o1, o2)
prefix(code1, code2)
child(o1, o2, i)
code1|length(code1)-2 = code2 ∧ code1 [length(code1)-1].2 = i
same_parent(o1, o2, d)
code1|length(code1)-2 = code2|length(code2)-2 ∧ code1 [length(code1)-1].1 = code2 [length(code2)-1].1 ∧ abs(code1 [length(code1)-1].2 - code2 [length(code2)-1].2) = d
common_ancestor(o1, o2, l)
∃ p / code1|p = code2|p ∧ length(code1|p) ≥ l ∧ length(code2|p) ≥ l
Table 1: Join Conditions to Evaluate Structural Conditions on Documents
5.3 Combining Structural and Contents Conditions Given that the SCode of a document distinguishes it from the rest of instances in the repository, these codes could be used as document identifiers in an information retrieval system. As a result, structural conditions could be evaluated with the same mechanisms as contents conditions. These mechanisms would first extract from an inverted index the SCode codes of the documents that satisfy the contents conditions. Afterwards, the retrieved SCodes would serve to evaluate the structural conditions before retrieving the documents from the repository. Additionally, SCodes can be used to improve the precision and recall of the information retrieval system. The idea is to assign to each SCode a level of relevance depending on its relative position in the schema. In other words, those topics that appear in a relevant position of the document such as the title would have a higher level of relevance 12
than those that appear in another section. Thus, starting from the following function to assign a relevance to each SCode: structure_relevance : SCode → [0, 1] and after processing the conditions on contents and structure, the resulting documents could be ordered by their relevance in terms of both the frequency of topics (IDF [17]) and the position of topics. In general, the relevance of a document could be evaluated as follows: relevance(D) = f (IDFD, structure_relevance(SCode(D))) With the purpose of improving the precision and recall of the system, at the moment we are analysing some possible linear functions to rank answers. 6. Discussion Traditionally, the problem of storing and retrieving structured documents has been approached from two separate perspectives: databases and information retrieval systems. In both areas, an important topic of research is how to represent the structure of documents so that it can be applied during query processing. In the area of database systems the most frequently adopted approach is the objectoriented data model. Under this model, documents are represented as trees of composition with references between their component objects [14]. Thus, structure query conditions are evaluated by traversing object references. In order to process conditions on document contents, these systems use to support some kind of coupling with an information retrieval system [4, 5]. Among some general drawbacks, here we highlight their limited ability to manage flexible document types, and the high computational cost involved in processing structural conditions by traversing references. Recently, after the global interest for Internet-based applications, some new database models for semi-structured data have been proposed [5, 18]. Their main objective is to store and retrieve documents into a database without any specific schema that indicates how to represent them. However, the absence of the schema implies high computational costs during query processing because query optimisation is difficult. Furthermore, the possible adoption of XML for future Internet based applications induces us to consider conventional databases with types as the most proper solution. 13
About the information retrieval area, several extensions that consider the structure of documents during retrieval operations have been proposed. In [2] appears a good review of these extensions. In general terms, these formalisms define an algebra whose operators combine conditions on the structure and contents of documents. The evaluation of these operators requires two indexes: one for the text terms and other for the composition trees of documents. In spite of increasing the complexity of the information retrieval system, with this method queries can be evaluated quite efficiently. However, these systems do not support metadata in their queries and their operators over the structure of documents make difficult the specification of path expressions and other query conditions. Finally, conditions on contents and structure are evaluated separately, which in many cases is not the right alternative because they are not independent. The work presented in this paper is an attempt to join both areas with the purpose of integrating conditions on the structure and contents of documents. By one hand, queries are specified by means of a language that also supports conditions on metadata. This language adopts the object-oriented formalism when specifying path expressions, so that it is possible to indicate the origin of each document. On the other hand, the proposed language consists of some operators similar to those provided by information retrieval systems to specify conditions on the contents and structure of documents. In Section 5.3 we showed how to combine both types of conditions when assigning a level of relevance to each document. Furthermore, all these features can be efficiently executed because they are supported by means of some small extensions to an object-relational database system with information retrieval operations. 7. Conclusions In this paper we have presented a new approach to the representation and retrieval of structured documents that can be directly supported by current object-relational databases. Its main novelty is a codification scheme for the composition hierarchies of documents that allows for their storage into relational tables, and their retrieval by means of a SQL based language with some extensions. TOODOR has been implemented on the top of a commercial object-relational database system with good information retrieval capabilities [20]. In the current version, structure 14
codes have been represented as strings of characters so that they are queried by using the string matching operators of the database system. Text queries are performed by an information retrieval server, which is fully maintained by the database management system. This first version of TOODOR is being tested on a national newspaper document database. The schema of this database has about 30 elements (types and attributes) and the number of stored documents is about 10,000 documents. In this context, the size of the structure codes is at most 20 characters. Preliminary results show a good performance for this approach, and typical queries on the structure and contents of newspapers are evaluated in the order of seconds [8]. Future work is mainly focused on applying TOODOR to distributed digital libraries. Additionally, further work is being carried out in defining document relevance functions that relate contents and structure. Acknowledgements This work has been partially funded by the CICYT project TEL97-1119 and the Fundación Caixa Castelló. References [1] A. Salminen and F. Tompa. “PAT Expressions: an Algebra for Text Search”. In COMPLEX’92, pages 309-332, 1992. [2] R. Baeza-Yates and G. Navarro. “Integrating Contents and Structure in Text Retrieval”. In SIGMOD Record, Vol. 25, No.1, pp. 67-79, 1996. [3] ISO 8879. Information Processing- Text and Office Systems, Standard Generalized Markup Language, 1996. [4] M. Volz, K. Aberer and K. Böhm. “Applying a Flexible OODBMS-IRS-Coupling for Structured Document Handling”. In Proc. International Conference on Data Engineering, pp. 10-19, 1996. [5] S. Cluet. “Modelling and Querying Semi-Structured Data”. In Lecture Notes in Computer Science, Vol. 1299, Springer-Verlag, pp. 192-213, 1997. [6] C. Doherty. “Database Systems Management in Oracle8”. In Proc. SIGMOD ACM Conference, pp. 510-511, 1998. [7] M. Aramburu and R. Berlanga. “An Approach to a Digital Library of Newspapers”. In Information Processing & Management, Vol. 33(5), pp. 645-661, 1997.
15
[8] M. Aramburu. “TOODOR: A Temporal Database Model for Historical Documents”. PhD Thesis, The University of Birmingham, UK, 1998. [9] M. Aramburu and R. Berlanga. “Temporal Object-Oriented Document Organisation and Retrieval”. In Proc.of the Third Biennial World Conference on Integrated Design and Process Technology: Issues and Applications of Database Technology, Society for Design and Process Science, pp. 368-375, 1998. [10] M. Aramburu and R. Berlanga. “Metadata for a Digital Library of Historical Documents”. In Proc. 8th International Conference on Database and Expert System Applications, pp. 409-418, 1997. [11] I. Sanz, R. Berlanga and M. Aramburu. “Gathering Metadata from Web-based Repositories of Historical Publications”. In Proc. 9th Workshop on Database and Expert Systems Applications, pp. 473-478, 1998. [12] M. Aramburu and R. Berlanga. “A Retrieval Language for Historical Documents”. In Proc. of 9th International Conference on Database and Expert Systems Applications, pp. 216-225, 1998. [13] R. Cattell (ed.). “The Object Database Standard: ODMG-93 Release 1.2”. Morgan Kaufmann, 1996. [14] V. Christophides, et al. “From Structured Documents to Novel Query Facilities”. In Proc. of the ACM SIGMOD International Conference on Management of Data, pp. 313-324, 1994. [15] G. Navarro and R.A. Baeza-Yates. “Proximal Nodes: A Model to Query Document Databases by Content and Structure”. In ACM Transactions on Information Systems, Vol. 15, No. 4, pp. 400-435, 1997. [16] J. Teuhola. “Path Signatures: A Way to Speed Up Recursion in Relational Databases”. In IEEE Transactions on Knowledge and Data Engineering, Vol. 8(3), 1996. [17] G. Salton and M. McGill. “Automatic Text Processing”. Addison-Wesley, 1989. [18] S. Abiteboul et al. “The Lorel Query Language for Semistructured Data”. In Journal of Digital Libraries, Vol.1, No.1, pp. 68-88, 1997. [19] Y. Chen and K. Aberer. “Layered Index Structures in Document Database Systems”. In Proc. of the International Conference on Information and Knowledge Management, pp. 406-413, 1998. [20] Oracle 8.0.4, User and Administrator Guides, 1998.
16