... se dedica al análisis y diseño de una biblioteca digital de publicaciones periódicas ... las relaciones de composición entre documentos y sus propiedades.
Informe Técnico DI 01-01/97
Metadata in a Digital Library of Periodicals
María José Aramburu Cabo Rafael Berlanga Llavori
Departamento de Informática Enero de 1997
email: {aramburu,berlanga} @inf.uji.es Universitat Jaume I U. P. Informática Campus de Penyeta Roja 12071 CASTELLON 1
Resumen Este trabajo se dedica al análisis y diseño de una biblioteca digital de publicaciones periódicas y revistas científicas. Este tipo de bibliotecas se adscriben a un conjunto de aplicaciones sobre documentos históricos estructurados, el cual fue identificado en otros trabajos previos de los autores [Ara96][Ara97]. Estos problemas han sido abordados desde el punto de vista de las bases de datos orientadas a objeto, de tal modo que el contenido multimedia de las revistas académicas es almacenado en una base de documentos que preserva la misma organización lógica con las que fueron concebidas por los editores. En este contexto, el lenguaje de consulta de la base de documentos se encarga de extraer el contenido global o parcial de las revistas almacenadas mediante la descripción de algunos atributos tales como los autores, palabras clave, fechas de publicación, etc. En el modelo propuesto, el concepto de “metadata” juega un papel muy importante. Primero, el concepto de metadata provee un marco adecuado para modelar y clasificar las relaciones de composición entre documentos y sus propiedades temporales. Por otro lado, el uso de “metadata” mejora el poder expresivo del lenguaje de definición y de consulta.
2
Abstract This paper is devoted to the analysis and design of a digital library of periodicals and academic journals. This kind of libraries pertains to a set of applications about historical structured documents, which has been identified in our previous work [Ara96][Ara97]. We have addressed these applications from the point of view of temporal object-oriented databases, so that the whole multimedia contents of the academic journals are stored into a document database preserving the same logical organisation as given by publishers. In this context, the database query language is in charge of extracting the global or partial contents of stored journals by describing the wished features such as authors, keywords, dates as well as, their temporal relationships. In our model, metadata plays an outstanding role. Firstly, the concept of metadata provides a well-suited framework to model and classify the composition relationships between documents and their temporal properties. By the other hand, the use of metadata improves the expressive power of both document definition and query languages. Keywords: Digital Library, Periodicals, Structured Documents, Temporal Object Oriented Data Model, Metaclasses, Metadata.
3
1. Introduction In this paper, an approach to a digital library to store and retrieve academic journals is presented. As introduced in [Fox95], this application has been recently approached by some publishers and librarians because it brings many new expectations for their users. We have analysed the application from a particular point of view which gives new extra benefits to users. In our case, the whole multimedia contents of the publications are kept into a database, at the same time that their logical organisation is preserved as defined by publishers. Afterwards, users can consult the database to retrieve the total or partial contents of journals by describing their attributes, like authors and keywords, but also their logical organisation and relationships. Users can also provide into their queries the specific views of the journals that they wish to be presented. In addition, we have given to our application a strong temporal dimension that means its main novelty. We intend expressing all the temporal attributes of journals as well as their temporal behaviour. Afterwards, we allow all this information to be specified by users in their queries as further retrieval conditions. In this way, queries become more complex and descriptive and answers more adapted to the initial user needs. In order to concrete the requirements (section 1.2) of the application in hand, the properties of academic journals and periodicals, as we understand them from the point of view of our application, are first analysed along the subsection 1.1.
1.1 Some Properties of Periodicals We regard periodicals as logical organisations of other document components written by many authors. In last term, the very units of information that conform documents are multimedia elements such as texts, pictures, graphics and so on. Moreover, the composition of a journal uses to follow some well-established profile which conforms what we term its document type. However, frequently, document types require to be modified in order to receive the publication of some special issue, or to temporally include some extra sections that are not usually edited in regular editions. On the other hand, periodicals exhibit a neat temporal behaviour. Time attributes such as the date and the frequency of publication are always associated to these documents. Additionally, it is important to define the time span during which the information contained in these publications is considered up to date [Tan93].
1.2 The Requirements of the Application From both, the nature of documents in hand and the environment of the digital library, we identify a set of global requirements, namely [Ara96]: - support for multimedia data types and user attributes, - classification and organisation of many kinds of documents, - provision for document identity and sharing, - a language to define the document types, and - a language to query the intended documents from the database. Additionally, more complex requirements arise from the representation of those document properties that do not refer to their contents nor user attributes. Specifically, the following list of features should be supported by the intended system:
4
1. The generic structure of a document type should be defined in flexible terms and without ambiguity. Further, any instance of a document type should have a structure compatible with the generic structure of that type. 2. Temporal evolution for the document base schema needs to be supported. 3. Temporal information about documents should be explicitly defined within the data model. 4. Consistency notions about a temporal document base should be defined. From this list, two kind of requirements can be identified: those referring to the nature of document types and those referring to the temporal behaviour of documents. Both of them need to be expressed in terms of data associated to document types, instead of document instances. It is by this reason that in our approach we refer to all this information as metadata of a document base. This paper is mainly devoted to define a formal framework for a document database to cope with the requirements above. First, Section 2 describes the formal data model following the usual definitions for an object-oriented data model. Section 3 briefly shows how this datamodel can represent the metadata and data of a database of periodicals and publications. Finally, Section 4 analyses the roles of metadata in a digital library of periodicals.
2. A Data Model for Historical Documents The database literature has so far treated separately the requirements described in the previous section. First, nobody discusses the fact that the most appropriate approach to modelling structured documents is the object-oriented data model [Chr94][Özs95]. Second, the most natural way of incorporating metadata into this data model is to define metaclasses [Dia94]. Finally, time features have been included in most of the data models for databases [Tan93][Bert95]. In this section, we present a formal data model for historical documents that integrates all these aspects. Within an object-oriented data model, we make use of metaclasses to include time attributes and behaviour for describing the history of document types and objects. Therefore, this information is viewed and treated as metadata. The remainder of the section is organised as follows. Subsection 2.1 describes the types systems on which the data model relies. Then, subsections 2.2, 2.3 and 2.4 define metaclasses, classes and objects. These definitions are adaptations of those described in other formalisms [Bert95][Abi95] in order to cope with the application requirements. Then, subsection 2.5 examines the class hierarchy concept. Finally, subsection 2.6 defines the schema and objects database.
2.1 Type Systems The proposed data model relies on two different type systems 076 and '76. The former is intended to define metadata types, whereas the latter is mainly intended to describe document types. The syntax of these types are described in the following subsections. In the remainder of the paper we denote the set of all the object identifiers as 2, and the set of all the class identifiers as &,.
5
2.1.1
076
Type System
Every type of the type system 076 follows the following syntax: τ := $720,& | time| interval | [A1:τ1,.., An: τn] | {τ} where the type group $720,& contains all the atomic types such as integer, real, char, string..., time is the type that designates the domain of time instants and interval is the type that designates the domain of time intervals, [A1:τ1,.., An: τn] is the tuple constructor and {τ} is the set constructor. Both constructors form new types from the existing ones. The value domains of the atomic and structured types follow the usual semantics given in the complex value data model [Abi95]. They are denoted with the function dom(type). Further, the domain for the type time is defined as in [Bert95] so that it is isomorphic to the set of natural numbers. Hence, time intervals, denoted with [t1, t2], are interpreted as sets of consecutive time instants, on which the set operators ∩, ∪ and ⊆ can be used to define the basic temporal relationships and operators over intervals. Formally, these domains are stated as follows: 7,0(
,17
2.1.2
'76
= {0, 1, 2, 3, 4, 5, 6, .., now, ...}
= {[x, y] | x, y∈7,0(, x ≤ y}
Type System
The type system '76 brings up great similarities with the Document Type Definition language of SGML [ISO86]. Indeed, they both are mainly aimed to represent the generic structure of documents in flexible terms [And89]. Moreover, both of them own a grammar based syntax that defines a generic structure of a document as a hierarchy of nested elements [Gol90]. The syntax of the '76 types is as follows: τ := 5$:'$7$ | &/$66 | (τ1 |...|τm) | τ+ | τ* | τ? | [ÂA1:τ1,.., Am:τmÂ] Here, each type of 5$:'$7$ is a multimedia type such as text, picture, graphic, etc. Each type of &/$66 is a class identifier. Both sets of types comprise the set of basic unstructured data types in terms of which structured types are defined as follows: 1. The constructor ‘|’ expresses the union of types and gives the range of data types, τ1,..., τm , that some component can take. 2. Flexible structured components are formed with either basic or union components plus a suffix that indicates the option degree: ‘+’ expresses that the component is expected to appear at least once, ‘*’ indicates that the component can appear zero or more times and ‘?’ expresses that the component can appear once or zero times. 3. At length, the tree-like structure is formed with nested ordered tuples1 using the attribute names Ai as the node concepts. The components τi are either basic, union, flexible structured components or nested ordered tuples.
1
The use of nested ordered tuples is here necessary in order to represent the ordering between the document’s components, which is a strong feature of structured documents.
6
Example: The following expressions are '76 types: [Âbody:Paper+, photo:PhotographÂ] [Âtitle:Text, author:Text, body:(Figure | Photograph)*Â]
where Paper is the name of a class and both Text and Photograph are raw data types. The values for any multimedia type τ from 5$:'$7$ are denoted with dom(τ). At the other hand, the function π*(c, t) [Bert95] denotes the set of object identifiers that belongs to a class c at time t. Therefore, this function give us the value domain of the type c∈&/$66 at a certain time instant. The temporal nature of this function requires a time-dependent definition of the domain values for the type system '76. This leads us to define the valid set of values that can be associated to each '76 type τ at a given time instant, which is called legal values of τ. Definition. (Legal Values) Let τ be a '76 type, then the legal values of τ at time t , denoted with τt , are defined as follows: • τt = dom(τ) for all t, if τ∈5$:'$7$ • τt = π*(τ, t), ∀τ∈&/$66 • τ1 |...| τm t = ∪i=1..m τit • τ+t = 2 τ t - ∅ • τ*t = 2 τ t • τ?t = ∅ ∪ {{v}| v ∈ τ’t} • if τ=([ÂA1: τ1,.., An: τnÂ]) then τt ={[Âv1, .., vnÂ]| vi ∈τit for all i∈[1, n]}
2.2 Metaclasses The basis of our data model relies on the concept of metaclass. From a conceptual point of view, a metaclass is a class of classes. In other words, a metaclass abstracts the common structural and operational properties of a set of classes. A metaclass signature can be formulated as a 5-tuple with the form: MC = 〈id, meta_type, c_meth, min_type, o_meth〉 where id is the metaclass’s identifier, meta_type is a 076 type, c_meth is a set of method signatures defined over the type system 076, min_type is a '76 type and o_meth is a set of method signatures defined over '76 types. A metaclass constitutes itself a class whose object instances are classes. The state and behaviour of its classes are defined with the type meta_type and the set of methods c_meth respectively. At the same time, a metaclass induces a class hierarchy with only a class root that is named like the metaclass. This class has associated the type min_type and the set of methods o_meth. Thus, all the instances of a metaclass are specializations of its root class. The function Π(m) returns the set of all the identifiers of those classes that belong to the metaclass denoted by m. We assume that metaclasses cannot form hierarchies, that is, the populations of metaclasses are totally disjoint. This assures that the object populations do not participate in different class 7
hierarchies. With this restriction, the model satisfies the invariant given in [Bert94] for hierarchies with multiple roots.
2.3
Classes
The main feature of the classes of our model is that they evolve along time. Particularly, document classes are supposed to change their type and state as new formats and metadata arise from the needs of up-to-date publications. This feature implies a time-dependent definition of types, class hierarchies and the data base schema. All this concepts are explained in turn. We define a class signature as the following 5- tuple: C = 〈id, lifespan, type, history, mc〉 where: 1. id is the class identifier, 2. lifespan is a value of ,17 that indicates the time interval on which is valid the class, 3. type is a value from {historic, static} that indicates whether the class evolves, 4. history is a tuple as follows: (h_type=(τ1@i1,.., τn@in),
c_state =(v1@i1,..,vk@ik),
i_ext = (p1@i1,.., pn@in),
m_ext = (p*1@i1,.., p*n@in))
where τ1 .. τn denote '76 types, v1.. vn denote 076 values, p1.. pn, p*1.. p*n are sets of object identifiers from 2,, and i1.. in are time intervals from ,17. Here, the attribute h_type expresses the history of the C’s type. The attribute c_state represents the history of the class meta-attributes. Finally, i_ext contains the population of instance objects and m_ext contains the population of member objects.
The intervals of these series must meet each other in such a way that they are disjoint and their union is included in the C’s lifespan. Notice that the intervals of h_type, ext and p-ext form the same temporal sequence. 5. mc is the metaclass to which the class C belongs. Invariant 1: For each class C with C.id = c there exists a metaclass M with M.id = m so that if c∈Π(m) then C.mc = m. As earlier mentioned, the function π*(c, t) returns the member population of the class C identified by c at time t, that is the set of object identifiers p so that ∃i, p@i∈C.history.m_ext and t∈i. The function π(c, t) is analogously defined to denote the instance population of a class identified by c.
8
The following functions serve us to simplify the notation of the forthcoming definitions. The function type(c, t) is defined to access the historical type of a class C with identifier c at time t. The life-span of a class is accessed through its identifier by the function lifespan(c). Definition (Class Consistency) Let C be a class and M its metaclass (i.e. C.mc=M.id). A class C is said consistent if the following conditions hold: 1. for each τ@i ∈C.history.h_type, τ ≤t M.min_type for all t∈i and 2. for each v@i ∈C.history.c_state, v∈dom(M.meta_type).
Example: The following class definition can be considered to represent the evolution of the class of short papers of a document database: Doc = 〈 id= short_paper, lifespan=[1, 100], type= historic, history =( h_type= ( [|| title:string, body:section+, ref:string* ||]@[1, 30], [|| title:string, abstract:text, body:section+, ref:string* ||]@[31, 50], [|| title:string, keywords:string+,body:section+, ref:string* ||]@[51,100] ), c_state= ( (published=100) @ [1, 100]), p_ext =({i#1, i#4, i#5, i#6}@[1, 30], {i#3,i#7,i#8,i#9}@[31, 50], {i#10,i#11}@[51,100]) ), mc = documents〉
The types body and section are suppposed to be classes previously defined, whereas the types string and text are supposed to be rawdata types.
2.4
Objects
An object signature is associated to the following tuple: O = 〈oid, e_time, vt, v, c_id〉 where oid∈2, is the object identifier, e_time is a time instant, vt is a time interval and c_id is the identifier of the most specific class to which the object O belongs at time e_time. The time instant e_time (edition time) represents the time instant at which the value of the object is associated to a historical type of the class c_id. Invariant 2 accounts for the existence of such a historical type for each the e_time of each object. The time interval vt (up-to-date time) represents the time interval on which the object is regarded as up-to-date. This attribute is tightly related to our applications and its value is usually given by an interpretation of the object contents. For instance, the up-to-date time of a newspaper article is given by the interpretation extracted after reading and understanding its text.
9
Example: Given the above definition, the following example shows a short_paper object: Doc = 〈 oid = i#21, e_time = 40, vt = [10,41], v = (title=“Object-Orientation..”, abstract=i#24, body={i#134, i#211, i#56}, ref={}), c_id = short_paper〉
Invariant 2: For each object O there exists a class C (with identifier c) and a time t so that if O.oid ∈ π(c, t) then: 1. O.c_id=c and 2. there exists a historical type τ@i∈ C.history.h_type so that O.e_time ∈ i. Definition (Object Consistency) An object O is consistent if O.v ∈type(O.c_id, t)t with t=O.lifespan. In order to define a consistent set of objects, we introduce the function ref(o) [Bert94] which returns the set of object identifiers to which the object identified by o refers. Moreover, I(O) denotes the set of identifiers of the objects in O. Definition (Consistent Set of Objects) [Bert94] A set OBJ is a consistent set of objects if the following conditions hold: 1. for all objects o∈ OBJ, o is a consistent object, 2. for each pairs of objects o, o’∈ OBJ, if o.oid = o’.oid then o.e_time = o’.e_time, o.v = o’.v and o.vt = o’.vt, 3. for all objects o∈ OBJ, each identifier in ref(o) must be contained in I(OBJ)
2.5
Inheritance
A class hierarchy is defined as the tuple 〈&O, ≤ - 〉 where &O is a set of class identifiers from &, and ≤ - is a ternary relationship formed with pairs of classes from &O and time intervals i. Semantically, ≤ - defines partial orders over &O for each disjoint time interval. In this context, the expression c ≤ - i c’ states that the class c is a subclass of c’ during the time interval i. IS A
IS A
IS A
IS A
It is worth noticing that the class hierarchy always forms a forest [Abi95] since each metaclass induces a static root class from which the rest of classes specialize. Invariant 3. Let 〈&O, ≤ - 〉 be a class hierarchy. For each 〈c, c’, i〉∈≤ must hold: IS A
IS-A
the following conditions
1. lifespan(c) ⊆ lifespan(c’) and 2. lifespan(c’) ⊆ i Due to the evolution of classes, the definition of the subtype relationships is time-dependent.
10
Definition (Subtypes) Let τ and τ’ be '76 types, then τ is a subtype of τ’ at time t, denoted with τ ≤t τ’, if and only if: •
τ =τ’
•
τ, τ’∈&/$66 and there exists an interval i so that τ ≤ IS-A τ’ and t∈i,
•
τ=τ1+, τ’=τ2+ and τ1 ≤ t τ2
•
τ=τ1*, τ’=τ2* and τ1 ≤ t τ2
•
τ=τ1?, τ’=τ2? and τ1 ≤ t τ2
•
τ=τ1+, τ’=τ2* and τ1 ≤ t τ2
•
τ=τ1*, τ’=τ2? and τ1 ≤ t τ2
•
τ =(τ1 |...| τm ), τ’=(τ'1 |..| τ'k ), k ≤ m and for each i∈[1, k], τi ≤ t τ'i
i
• τ = [ÂA1:τ 1 ,.., Am:τ m Â], τ’= [ÂA’1:τ’ 1 ,.., A’k:τ’k Â], k ≤ m and for each i∈ [1, k], τi ≤ t τ'i and Ai = A’i Definition (int-well-formed CH) A class hierarchy 〈&O, ≤ - 〉 is intentionally well-formed if for each 〈c, c’, i〉∈≤ - holds type(c, t) ≤ t type(c’, t) for all t ∈ i. IS A
IS A
Definition (ext-well-formed CH) A class hierarchy 〈&O, ≤ - 〉 is extensionally well-formed if for each 〈c, c’, i〉∈≤ - holds π*(c, t) ⊆ π*(c’, t) for all t ∈ i. IS A
IS A
2.6
Schema and Object Database
It is important to notice that the intended applications are rather concerned with recording the history of the document types and instances than the rules for their evolution. In this way, the formalisation here presented has not focus on how document types have to evolve, but on defining when a document schema is historically consistent. Further, document instances are always attached to one only document type at a given instant so that they cannot change their state never again upon insertion. Regarding to the above considerations, the schema of a historical document database can be defined as follows: Schema = 〈0&O, &O'HI ,≤ - 〉 IS A
where 0&O is a set of metaclass definitions, &O'HI is a set of consistent class definitions which forms a int-well-formed class hierarchy under the relation ≤ - . Invariant 1 must be true to keep consistent the instance relationship between classes and metaclasses and Invariant 3 must be satisfied to maintain consistent the class hierarchy. IS A
An instance of the above schema is simply a consistent set of objects OBJ. The invariant 2 must be satisfied for OBJ in order to keep consistent the instance relationship between the objects in OBJ and the classes from the schema. Further, the class hierarchy formed by the schema must be extensionally-well-formed with respect to OBJ.
11
3
Defining a Schema for a Digital Library of Periodicals
To cope with the requirements for document applications, we propose the definition of four metaclasses, namely: document, publication, periodical and multimedia data. The basic metaclasses are the document and the multimedia ones, because they support the basic document types (e.g. article, paper, section, etc.) and indexed multimedia data (e.g. text, picture, graphic, etc.) respectively. The metaclass publication groups all the document compendiums that have been published at a given time, whereas the metaclass periodical groups all the document compendiums that are periodically published. Table 1 summarises the relevant metadata that define each of these predefined metaclasses [Ara96]. Their semantics are briefly explained in turn. Metaclass/ Metadata
document
publication
multimedia
Name:string Period:time ISSN:string Editor:{string}
meta-attributes (076)
domain attributes ('76)
periodical
Rev-ref:{CLASS}
Date-P:date ISBN:string Site: string* Editor:string+
Number:integer Volume:integer
Rev-ref:{CLASS} Index:ATOMIC Contents:RAWDATA
Table 1: Sample of Metadata for the four predefined metaclasses.
In the table we make use of the weak types CLASS, ATOMIC and RAWDATA to represent a multiple or-type involving all the types in &/$66, $720,& and 5$:'$7$ respectively. This allows objects to take values from the domains of all the types from the corresponding group. Apart from the usual metadata associated to publications and periodicals (e.g. Volume, ISBN, Period, etc.), a set of useful metadata has been included. Thus, documents incorporate an attribute called Rev-ref to maintain the inverse references to their container objects. Multimedia data also includes an attribute for inverse references. Moreover, multimedia data is indexed by a key called Index which is usually obtained by applying a feature-extraction function to their contents. For more detailed information about the metadata associated to historic documents the reader can consult our previous works [Ara96][Ara97].
4
The Roles of Metadata
In this section we discuss and resume which has been the role of metadata in developing our digital library of journals. Prior to proceed with it, the motivations behind the use of metadata are reviewed. Among the application requirements identified in the introduction, two groups cannot be afforded with conventional object-oriented data models. The first group concerns with the flexibility of document structures, whereas the second one concerns with temporal properties of documents. While the first group requires minor extensions to the object-oriented data model [Chr94] [Özs95], the second group demands for a more powerful framework to state the integrity constraints that ensure data consistency along time. Metadata have served us to tackle both problems at time. By using the concept of metaclass, we have developed a formal data model consisting of two levels of information:
12
• The first level deals with meta-information referred to temporal behaviour and the generic structure of documents. • The second level deals with the document classes, their attributes and the specific document structures and relationships. The information that instantiates the first layer constraints the set of values that the instances in the second layer can take. Similarly, the first level constitutes a layer over which the logical schema for a concrete application can be developed easily and with more meaning. Moreover, this separation allows both the metadata and the data into the database to be separately considered. Therefore, users and applications can consult them as convenient in each kind of operation to be executed. To understand better the benefits that this approach brings to our application, in following subsections we will explain how metadata can be applied at different phases of execution.
4.1 Document Insertion When inserting new documents into the database, metadata is applied in two senses: 1- The definition of the generic structure of a document serves to validate the specific structure of its instances. Thus, the logical references contained within the state of a document object must preserve the aggregation relationships established at class level. 2- From the value of certain temporal meta-attributes, the value of other attributes of the document instances will be automatically deduced. This is the case of the document valid time vt, which could be deduced by default from the date-P (res. period) of the publication (res. periodical) that contains it or from the set of valid times of the objects that the document refers to. This might constitute a way to propagate upwards or downwards the temporal attributes of documents through the aggregation hierarchy. In general, supporting all these metadata into the data model helps to enforce the integrity of the database. At the same time, the definition of metadata at class level, assists the instantation process.
4.2 Query Formulation When asking to the system with a Document Retrieval Language (DRL) as in [Ara97] the user must be acquainted with the properties of the intended documents. This is specially true when defining trajectories through the document’s components. The document retrieval language could overcome these drawbacks if metadata were queried together with the intended data. From the formal model of data designed in section 2, it follows that every class in the base is an instance of some metaclass. Consequently, we can consider to apply our DRL to retrieve the metadata encapsulated in class objects. In this way, it could be possible to ask for the metaattribute DBJournal.period to know the period of this sort of periodical. Similarly, asking for DBJournal.type would return both the generic structure and the user attributes of the class DBJournal. However, due to the historic nature of the database schema, the retrieval of this kind of information should specify the time at which it is queried. Therefore, the clause at defined in DRL have to be also supported at the metaclass level.
13
Per example, the following query can be used to know the name of the subclasses of DBJournal that are published yearly and that contain the class Review among their components. select c from Periodical c where c.period= 1year and c contains_type Review and c subclass DBJournal at [1/10/1995,1/11/1996]
Notice that the result of evaluating the query type Periodical is the set of classes that instantiates that metaclass. When processing this query, the evaluator must notice that Periodical is not a class, but a metaclass. In this way, the variable c will take as domain the set of classes that are instances of the corresponding metaclass. The predicate contains_type/2 accounts for the class aggregation hierarchy, whereas the predicate subclass/2 implies the relation ≤ - defined in section 2.4. All the query predicates must be evaluated in the context of the time specified by the clause at. IS A
By another hand, given that ODMG-93 allows nesting of queries, it is correct to embed queries about metadata into the queries for documents. With this capacity, the retrieval language becomes much more expressive since users can formulate queries without an exact understanding of the database logical schema. For instance, suppose a user who wishes to know which are the periodicals of Springer that have been published in May. This can be done in DRL as follows: select p from Periodical as p where p.publisher= ‘Springer’ at [1/5/96, 31/5/96]
Another more complex query would ask for all the documents published by John Smith in the DBJournal during 1995. In this way, a nested query should appear in the clause from as follows: select d from (select c from Document c where DBJournal contains_type c) as d where d.author= ‘John Smith’and DBJournal contains d at [1/5/95,31/5/95]
To evaluate this query, the nested query must be first solved to obtain a set of classes that satisfy it. The union of the instances in any of these classes will constitute the domain over which the variable d will range during further query processing. For this case, the clause at affects both queries. Concluding, the Document Retrieval Language can be extended to accept queries for metadata. This facility is very important in order to be aware of the document base schema prior to formulate any demands for documents. In general, it is often necessary to have the definition of a class available to establish which kind of publication is and the possibilities to find there the required information. Further, knowing the generic structure of a document class allows the formulation of trajectories, projections and conditions over documents to be correctly specified. Similarly, temporal properties about documents can facilitate to users query formulation. In this sense, notice that it is possible to draw up queries that combine conditions over the data and the metadata of some class. This facility increases the expressiveness of the query language, at the same time that allows users to specify more complex retrieval conditions, taking profit of the metadata stored in the base. Finally, due to the evolution of document classes, users may need to know the valid definition of the involved classes at the time in which the database is going to be projected. At this respect, metadata is used to perceive the evolution of publications along time.
14
5 Conclusions In this paper, a new approach to the design of a digital library of scientific journals has been proposed. After analysing the application, we have developed a complete data model that covers complex properties of these documents and advanced retrieval conditions. The basis of our approach is the integration of an object oriented data model extended with metaclasses, as well as, a large amount of metadata about the documents in the application. The difference with other proposals for this application, is that our model integrates and, at the same time, differentiates between metadata and the rest of document attributes and contents. The use of metadata in our approach has permitted us to enhance previous document applications in several senses. Firstly, we store and project the global contents of the journals with their original organisation. Secondly, a temporal dimension has been added to the model that facilitates the specification of temporal relationships between the documents to be retrieved. Furthermore, users and applications are always able to consult metadata. In concrete, we have provided a query language that combines both conditions on data and metadata. In this way, queries become more expressive and precise.
7 References [Abi95] Abiteboul, S., Hull, R. and Vianu, V. “Foundations of Databases”. Addison Wesley Publishing Company, 1995. [And89] André, J., Furuta, R. and Quint, V. editors, “Structured Documents”. The Cambridge Series on Electronic Publishing, Cambridge University Press, 1989. [Ara96] Aramburu, M.J. and Berlanga, R. “Object Oriented Modelling of Periodicals”, 7th Workshop on Database and Expert System Applications, IEEE, Zurich, 1996. [Ara97] Aramburu, M. and Berlanga, R. “An Approach to a Digital Library of Newspapers”. To appear in Information Processing & Management, Special Issue on Electronic News, 1997. [Bert94] Guerrini, G., Bertino, E. and Bal,R. “A Formal Definition of the Chimera Object-Oriented Data Model” Technical Report IDEA, ESPRIT Project 6333, May 1994. [Bert95] Bertino,E. Ferrari,F. and Guerrini, G. “A Formal Temporal Object-Oriented Data Model” Technical Report 141-95, Università di Milano, 1995. [Cat96] Cattell, R.G.G., Ed. “The Object Database Standard: ODMG-93 Release 1.2.” San Francisco: Morgan Kaufmann Publishers 1996. [Chr94] Christophides, V., Abitebul, S., Cluet, S. and Scholl, M. “From Structured Documents to Novel Query Facilities”. Proceedings of the ACM SIGMOD International Conference on Management of Data, Minnesota, USA. 1994. [Dia94] Díaz, O. and Paton, N. “Extending ODBMSs Using Metaclasses”. IEEE Software, 1994. [Fox95] Fox, E.A., Akscyn, R.M., Furuta, R. & Legget, J. “Digital Libraries”. Communications of ACM, 38(4)., 22-103, 1995. [Gol90] Goldfarb, C. “The SGML Handbook”. Oxford: Claredon Press, 1990. [ISO86] ISO 8879. Information Processing- Text and Office Systems, Standard Generalized Markup Language, 1986. [Tan93] Tansel, A. et al. “Temporal Databases: Theory, Design and Implementation”. The Benjamin/Cummings Publishing Company, Inc. 1993.
15